Networking Todo List

Networking Todo List

These are the tasks that need to be completed. Move tasks from here to New Stuff when completed.

IPSEC

ARP-like resolution of IPSEC rules. Currently, if a policy needs to be resolved by a key manager during connect() we behave as follows:
- O_NONBLOCK: Continually return -EAGAIN until resolution is complete.
- not-O_NONBLOCK: We sleep until the key manager resolves the policy or we time out.

KAME handles this by just dropping the first packet. TCP retransmits over and over until the IPSEC route is resolved. This behavior isn't very nice either. The currently designed solution is to implement something like ARP. How ARP works is that it queues packets until neighbour discovery is complete, then it transmits these packets.

Patrick McHardy and Herbert Xu came up with some draft ideas wrt. implementation.

Policy and Security Associations

Whilst policy and security updates properly show up, future route and socket route lookups, and the implementation of the necessary flushing, is suboptimal. Also, the policy→bundle lookup can be improved by using something other than a linked list.

PF_KEY reliability.

As with above, in particular, PF_KEY reliability hacks in the kernel (similar to NetBSD) will make Linux a reliable and production-ready VPN concentrator today. While porting to and improving Netlink is the optimal future, quality PF_KEY-based IKE implementations exist today (e.g., racoon), but these implementations are unreliable on Linux without such kernel hacks to make PF_KEY a reliable interface.

struct sk_buff

SKBs are too big, ongoing work… See pages such as this one for some ideas and analysis:
- skb→h is really useless and can be eliminated immediately. The only place where it is really used is checksumming offload on output. skb→h is used there to mark the beginning of area to checksum, the idea was to support offload for protocols other than TCP and UDP. Given that this generality is not used, it can be replaced with direct parsing of IP header.
- skb→mac.raw, it can be removed easily provided skb→mac_len is left intact. They cannot both be removed: skb→mac.raw is used in packet socket to return MAC header back. This information could be passed as an argument to ptype handler. Unfortunately, MAC header is removed inside device driver, so it would require lots of changes. Another use is for logging and filtering by MAC address in netfilter in some other places (for example even net/ipv4/route.c uses it). It is not clear how to remove this without reducing functionality.
- net.raw has similar issues: IP header is used in recvmsg() to fetch, for example, IP addresses. Essentially, to remove it we have to hold skb→data at IP header and then reparse the packet in recvmsg(). Honestly, it may not be worth the effort.
- skb→input_dev can be made optional under CONFIG_NET_CLS_ACT. No reference counting is done for input_dev and thus references to them outside of the softirq handler are illegal. One idea is to use ifindex of input device.
- skb→dev is also used for interlevel argument passing. It could be killed in theory, but in practice it would be a lot of fuss.

Fix skb→users and skb_shared() bogosity on transmit

Several spots check things like skb→users and skb_shared() on transmit, which can never be true these days. Known offenders are tunnel devices, ipmr.c and loopback device. Alexey says that ipmr.c case is so bad it should be rewritten instead of trying to repair existing xmit code.

Use the new zero-copy sequential skb data read interface

where appropriate to handle non-linear skbs.

Need an API so device can manage it's receive buffer memory

There are two sets of applications that want more flexible sk_buff handling for device drivers. The one is SKB recycling as experimented with by Robert Olsson. The other is for network devices which use pools of large and small buffers (typically the large buffers are page sized and the small ones are 256 bytes).

The way the smart devices work is they watch TCP flows and accumulate data contiguously into pages. The header portions go into the small buffers. With these devices it is pretty easy to implement receive zero-copy.

With these clever devices the big question is what exactly is the header portion. Implementations I (DaveM) am aware of allow one to teach it the basics of various protocols. For example, you can tell it what a SunRPC header looks like after the TCP part. This is all important so that the data part accumulated into page sized chunks can be flipped directly into the file system cache. Otherwise, if the data is not really page aligned, we can't zero-copy it.

Start moving towards optional IP routing cache

Get rid of use of source addresses and information depending upon source address in dst entry: rt_src etc. It could be optionally “cached” there, but retrieved by another callback to routing, when “cached” result is not available. This would allow radical reduction of routing cache pressure at least when routing does not depend on source address.
Get rid of use of destination addresses in dst entries. This would allow to aggregate dst entries and to use direct references to underlying fib_info instead.

INET sock

Make inet_sock→cork a pointer, this will make inet_sock (and all sub-class structures) nearly 100 bytes smaller.

The 'dst' argument to request_sock_ops→rtx_syn_ack is always given as NULL and thus unused. We should delete it.

inet_lookup and __inet_lookup have strange handling of the local port in order to accommodate the fact that inet_sock→num is in cpu-endianness and that is what is used for hashing and port comparisons on packet input.

This is very non-intuitive and trips up people all the time.

IPV4

IPV6

IPV6 gc engine is broken and needs revision

Its gc replicates the most first variant of ipv4 brother, which was proven to be sick ages ago.

UDP

Add locking on connect(2) path. Right now two threads can call connect(2) simultaneously with undefined results. The same thing probably applies to all datagram protocols.

Make send(2) path lockless again if corking is not used (suggested by Alexey).

TCP

Investigate TCP traffic steering. i.e. TCP flow association
Add real Async I/O networking support (ie. Linux aio over sockets)
Investigate various receive side offloads
Distributing RX processing across multiple CPUs

Multiple hw queues can be used to spread receive processing across CPUs; this will eliminate main cpu% as a bottleneck for 10GbE performance.

Using a NIC that supports multiple hw queues and MSI-X, a network driver can do a decent job on distributing kernel part of receive traffic processing across CPUs - as long as it is not important which session lands on which cpu. This part doesn't require any changes outside of the driver.

This scheme can be further improved upon, if the host tells the driver what CPU it wished to run a particular session on. With this information, the driver can steer a session to the same CPU that the scheduler runs the socket reads on, and achieve the best cache locality for both kernel and user level rx processing. Much of the newer hardware supports this (see RSS Hash).

Another idea for doing this seems to be the one that Andi came up with - adding a new callback in the netdevice structure that is invoked every time a scheduler migrates socket reads to a different cpu. This would allow the driver to migrate the kernel part of rx processing to the same cpu that the read is running on. In addition to the cpu number, it will be beneficial to get priority for the socket as well. This is because NIC capacity for explicit “session to cpu” steering may not be unlimited.

RSS hash is simpler than adding scheduler tweaks, the problem is getting it to work with MSI-X.

This can be arguably left for now to the driver-only implementation, since the support needed from the stack - ability to accept fragmented skb that is bigger that MTU - is already there. The only other thing to consider may be forcing an ACK per LRO frame; not sure if this is worthwhile… LRO implementations will need to be implemented carefully if we ever add support for the ECN nonce bit.

Additional Support for Multiple HW Queues

In addition to distributing rx processing across multiple CPUs (#1 above), hw queues can be used for other things, like QoS for incoming traffic. In this case, separate queues for higher priority traffic will guarantee things like lower latency, better bandwidth, better DoS protection and more fine-tuned (per queue, not per NIC) interrupt moderation.

This part needs more discussion. Possibly NAPI can make some changes to utilize the feature, and some common user-level configuration options (via do_ioctl) may be useful too.

Alexey suggests that we should walk through the non-FACK retransmit handling paths and make them follow RFC3517 more accurately.

There still are lots of possibilities for reducing the number of lock roundtrips necessary to send and receive a packet. Also, reader/writer locks are over used in the networking code; on modern hardware a reader/writer lock is significantly more expensive than a spinlock.

DCCP

CCID3
- Packet history allocations have to be accounted to the socket.
- Feedback packets are not being sent once per RTT as per spec
- Using CCID3 for heavy bidirectional data traffic (example: RFC 862 Echo service) currently does not work well, it is recommended to use two unidirectional connections instead.
- RFC 3448 says “When calculating the average loss interval we need to decide whether to include the interval since the most recent packet loss event. We only do this if it is sufficiently large to increase the average loss interval.”. The effect of us doing this is that if we get one loss it can bring our rate down and it never recovers if we don't get another loss later on.
Memory usage. Can run a machine out of memory on rx briefly - partially related to above?? - but need to put a limit on buffers.
CCID3 VoIP variant
PMTU
timestamps - Stop using do_gettimeofday with offsets, its not monotonic, admins can change the system's time and the offsets gets invalidated.
Fixme/bug comments in code
Share congestion code with TCP
Implement the remaining options processing
Implement iptables header matching for DCCP

Status Harald Welte attached an (untested) patch for basic iptables support. Please review (esp. the option matching part) and consider applying it to your tree (or tell me to submit it to davem). Current iptables from svn.netfilter.org has the required userspace support (and even a manpage snippet

Implement connection tracking and NAT for DCCP in netfilter/iptables

To the best of my knowledge, we're the only stateful packet filter that does SCTP so far… would be great to have DCCP support, too. Since you know the state transitions and other aspects of the DCCP protocol well, it would be great to see ip_conntrack_proto_dccp.c (or even better: nf_conntrack_proto_dccp.c) at some point. Requested by Harald Welte.

Timestamp options

Issue is that CCID3 receiver on linux sends ACKs with time stamp option and does not send ACKs with elapsed time option. (As far as I tested..) It seems to me that CCID3 receiver must use elapsed time option or can use time stamp echo option in case that CCID3 sender uses time stamp option. (According to section 8.2.in draft-ietf-dccp-ccid3-11.txt). Raised by Nishida-san

Misc

Look at changing away from struct timeval/do_gettimeofday as these waste 4 bytes per instance on 64 bit machines. Raised by ArnaldoMelo

One thing that I found out is that we're not accounting the packet history allocations to the socket, which is very wrong and I'll work on fixing in the coming days. Raised by ArnaldoMelo

The service code in the REQUEST and RESPONSE packets is in network byte order and TcpDump is not using ntohl on it, below is the dump for a session where I used service=1=ntohl(16777216), one other idea is to look if the 4 bytes that compose the service are in the ASCII printable range and present it as “names” like suggested in the draft. Raised by ArnaldoMelo

Add handling of IPV6_PKTOPTIONS in net/dccp/ipv6.c, similar to the handling in net/ipv6/tcp_ipv6.c. Raised by Gerrit Renker

identify empty loss intervals in a way different from using ~0U or audit to make its use safe. Raised by Eddie. Ian has audited it as OK but says should be stored as an array.

Netlink

Fix 64bit netlink alignment issues (gen_stats, …)
Generic netlink attribute macros (NLA_*)
devconfig via rtnetlink

The idea is mainly to have a TLV like concept to allow managing all the simple id=value settings.

Generic netlink family to be used first by TIPC

Done - part of 2.6.16 Jamal has some doc in progress

Other emerging users of this are: MPLS and process accounting.

Packet Classifiers

Cleanup locking in net/sched/
Add routing attributes to meta ematch

Depends on the work going on to remove the route cache so this is on hold.

Rabin fingerprints using the ematch stuff

Its a pretty useless algorithm given KMP and BM already outclass it. We want to use it to validates Thomas' callbacks etc (Read: How fast can you do it the LinuxWay?) as well as giving us a laugh test check (we have it too - a really bad excuse, but i hope to have fun).

Meta action
Qdisc in sysfs

Why not make qdisc's real kobjects? Then they could be linked into sysfs and it would be easier to see the interrelationships of classifiers, qdisc, and devices.

xfrm

CONFIG_IP_ROUTE_NAT needs to be converted over to xfrm engine

In IPSEC trees route based NAT is broken, the code needs to be converted to use the xfrm engine. Actually, it's been entirely deleted from the tree now. Thomas Graf supposedly has some code coming which will reintroduce this feature.

xfrm2

Misc Optimizations

Scan networking code for __read_mostly candidates
MPLS stack really desirable for real VPN support

MPLS support is really needed for us to be taken seriously as a full VPN solution in some environments. DaveM wrote an skeletal implementation long ago and passed it on to Jamal who enhanced the netlink layer significantly in order to support configuration of things like MPLS much better. Unfortunately, we all ended up in a spat with the maintainer of another MPLS Linux implementation, nobody yielded and everything ended up stuck in the mud. Update: Steve Whitehouse is working with James Leu - so we expect to see some good stuff Real Soon Now

Need ability to handle non-trivial modules sanely

It is argued that a saner way needs to exist in order to implement correct module unload for non-trivial modules such as IPV6.

Alexey has proposed a multi-stage unload sequence. In the first stage, the module removes all of it's public interfaces. In the second stage, we wait for references to existing objects to go away. Rusty is in general agreement, although he wants us to exercise caution before we go down any avenue at all. He also wants us to be aware of the good points about the current counter based system in 2.5.x

Alexey and myself (davem) fear that when using the counter system in a complex module, the whole thing would be polluted with module_{get,put}() calls everywhere. We also argue that, because a module has to make it's own object management and reference counting, the module refcounting facility is superfluous.

Fix remaining abuses of IFF_RUNNING (syncppp, s390/net)

sys_getsockname and sys_getpeername are lame functions. They both do exactly the same thing in two different structural ways. The only difference between them is the setting of the “peer” fourth argument to sock→ops→getname() which is zero for sys_getsockname and non-zero for sys_getpeername. The common logic screams to be separated out into a separate function.

sock_close has this ugly if (!inode) test and a totally inaccurate comment above it. Maybe a very long time ago sock_close could be called on half-built sockets, but not any longer. All the error paths of socket creation use sock_release. At the very least, this thing should be reduced to a WARN_ON so we would at least have a backtrace if this ever triggered.

Table of Contents