Table of Contents

Networking Todo List

These are the tasks that need to be completed. Move tasks from here to New Stuff when completed.

IPSEC

KAME handles this by just dropping the first packet. TCP retransmits over and over until the IPSEC route is resolved. This behavior isn't very nice either. The currently designed solution is to implement something like ARP. How ARP works is that it queues packets until neighbour discovery is complete, then it transmits these packets.

Patrick McHardy and Herbert Xu came up with some draft ideas wrt. implementation.

Whilst policy and security updates properly show up, future route and socket route lookups, and the implementation of the necessary flushing, is suboptimal. Also, the policy→bundle lookup can be improved by using something other than a linked list.

As with above, in particular, PF_KEY reliability hacks in the kernel (similar to NetBSD) will make Linux a reliable and production-ready VPN concentrator today. While porting to and improving Netlink is the optimal future, quality PF_KEY-based IKE implementations exist today (e.g., racoon), but these implementations are unreliable on Linux without such kernel hacks to make PF_KEY a reliable interface.

struct sk_buff

Several spots check things like skb→users and skb_shared() on transmit, which can never be true these days. Known offenders are tunnel devices, ipmr.c and loopback device. Alexey says that ipmr.c case is so bad it should be rewritten instead of trying to repair existing xmit code.

where appropriate to handle non-linear skbs.

There are two sets of applications that want more flexible sk_buff handling for device drivers. The one is SKB recycling as experimented with by Robert Olsson. The other is for network devices which use pools of large and small buffers (typically the large buffers are page sized and the small ones are 256 bytes).

The way the smart devices work is they watch TCP flows and accumulate data contiguously into pages. The header portions go into the small buffers. With these devices it is pretty easy to implement receive zero-copy.

With these clever devices the big question is what exactly is the header portion. Implementations I (DaveM) am aware of allow one to teach it the basics of various protocols. For example, you can tell it what a SunRPC header looks like after the TCP part. This is all important so that the data part accumulated into page sized chunks can be flipped directly into the file system cache. Otherwise, if the data is not really page aligned, we can't zero-copy it.

Start moving towards optional IP routing cache

INET sock

This is very non-intuitive and trips up people all the time.

IPV4

IPV6

Its gc replicates the most first variant of ipv4 brother, which was proven to be sick ages ago.

UDP

TCP

Multiple hw queues can be used to spread receive processing across CPUs; this will eliminate main cpu% as a bottleneck for 10GbE performance.

Using a NIC that supports multiple hw queues and MSI-X, a network driver can do a decent job on distributing kernel part of receive traffic processing across CPUs - as long as it is not important which session lands on which cpu. This part doesn't require any changes outside of the driver.

This scheme can be further improved upon, if the host tells the driver what CPU it wished to run a particular session on. With this information, the driver can steer a session to the same CPU that the scheduler runs the socket reads on, and achieve the best cache locality for both kernel and user level rx processing. Much of the newer hardware supports this (see RSS Hash).

Another idea for doing this seems to be the one that Andi came up with - adding a new callback in the netdevice structure that is invoked every time a scheduler migrates socket reads to a different cpu. This would allow the driver to migrate the kernel part of rx processing to the same cpu that the read is running on. In addition to the cpu number, it will be beneficial to get priority for the socket as well. This is because NIC capacity for explicit “session to cpu” steering may not be unlimited.

RSS hash is simpler than adding scheduler tweaks, the problem is getting it to work with MSI-X.

This can be arguably left for now to the driver-only implementation, since the support needed from the stack - ability to accept fragmented skb that is bigger that MTU - is already there. The only other thing to consider may be forcing an ACK per LRO frame; not sure if this is worthwhile… LRO implementations will need to be implemented carefully if we ever add support for the ECN nonce bit.

In addition to distributing rx processing across multiple CPUs (#1 above), hw queues can be used for other things, like QoS for incoming traffic. In this case, separate queues for higher priority traffic will guarantee things like lower latency, better bandwidth, better DoS protection and more fine-tuned (per queue, not per NIC) interrupt moderation.

This part needs more discussion. Possibly NAPI can make some changes to utilize the feature, and some common user-level configuration options (via do_ioctl) may be useful too.


DCCP

Status Harald Welte attached an (untested) patch for basic iptables support. Please review (esp. the option matching part) and consider applying it to your tree (or tell me to submit it to davem). Current iptables from svn.netfilter.org has the required userspace support (and even a manpage snippet

To the best of my knowledge, we're the only stateful packet filter that does SCTP so far… would be great to have DCCP support, too. Since you know the state transitions and other aspects of the DCCP protocol well, it would be great to see ip_conntrack_proto_dccp.c (or even better: nf_conntrack_proto_dccp.c) at some point. Requested by Harald Welte.

Issue is that CCID3 receiver on linux sends ACKs with time stamp option and does not send ACKs with elapsed time option. (As far as I tested..) It seems to me that CCID3 receiver must use elapsed time option or can use time stamp echo option in case that CCID3 sender uses time stamp option. (According to section 8.2.in draft-ietf-dccp-ccid3-11.txt). Raised by Nishida-san

Look at changing away from struct timeval/do_gettimeofday as these waste 4 bytes per instance on 64 bit machines. Raised by ArnaldoMelo

One thing that I found out is that we're not accounting the packet history allocations to the socket, which is very wrong and I'll work on fixing in the coming days. Raised by ArnaldoMelo

The service code in the REQUEST and RESPONSE packets is in network byte order and TcpDump is not using ntohl on it, below is the dump for a session where I used service=1=ntohl(16777216), one other idea is to look if the 4 bytes that compose the service are in the ASCII printable range and present it as “names” like suggested in the draft. Raised by ArnaldoMelo

The idea is mainly to have a TLV like concept to allow managing all the simple id=value settings.

Done - part of 2.6.16 Jamal has some doc in progress

Other emerging users of this are: MPLS and process accounting.

Packet Classifiers

Depends on the work going on to remove the route cache so this is on hold.

Its a pretty useless algorithm given KMP and BM already outclass it. We want to use it to validates Thomas' callbacks etc (Read: How fast can you do it the LinuxWay?) as well as giving us a laugh test check (we have it too - a really bad excuse, but i hope to have fun).

Why not make qdisc's real kobjects? Then they could be linked into sysfs and it would be easier to see the interrelationships of classifiers, qdisc, and devices.

xfrm

In IPSEC trees route based NAT is broken, the code needs to be converted to use the xfrm engine. Actually, it's been entirely deleted from the tree now. Thomas Graf supposedly has some code coming which will reintroduce this feature.

Misc Optimizations

MPLS support is really needed for us to be taken seriously as a full VPN solution in some environments. DaveM wrote an skeletal implementation long ago and passed it on to Jamal who enhanced the netlink layer significantly in order to support configuration of things like MPLS much better. Unfortunately, we all ended up in a spat with the maintainer of another MPLS Linux implementation, nobody yielded and everything ended up stuck in the mud. Update: Steve Whitehouse is working with James Leu - so we expect to see some good stuff Real Soon Now

It is argued that a saner way needs to exist in order to implement correct module unload for non-trivial modules such as IPV6.

Alexey has proposed a multi-stage unload sequence. In the first stage, the module removes all of it's public interfaces. In the second stage, we wait for references to existing objects to go away. Rusty is in general agreement, although he wants us to exercise caution before we go down any avenue at all. He also wants us to be aware of the good points about the current counter based system in 2.5.x

Alexey and myself (davem) fear that when using the counter system in a complex module, the whole thing would be polluted with module_{get,put}() calls everywhere. We also argue that, because a module has to make it's own object management and reference counting, the module refcounting facility is superfluous.