by Arnout Vandecappelle, Mind
This article describes the control flow (and the associated data buffering) of the Linux networking kernel. The picture on the left gives an overview of the flow.Open it in a separate window and use it as a reference for the explanation below. This article is based on the 2.6.20 kernel. Please feel free to update for newer kernels.
Another article gives a similar description based on a 2.4.20 kernel. Unfortunately, that one is not on a Wiki so it can't be updated…
Refer to Net:Network Overview for an overview of all aspects of the networking kernel: routing, neighbour discovery, NAPI, filtering, …
The network data (including headers etc.) is managed through the sk_buff data structure. This minimizes copying overhead when going through the networking layers. A basic understanding of sk_buff is required to understand the networking kernel.
The kernel as a whole makes heavy use of virtual methods. These are recorded as function pointers in data structures. In the figure these are indicated with diamonds. This article never shows all possible implementations for these virtual methods, just the main ones.
This article only discusses TCP over IPv4 over Ethernet connections. Of course, many combinations of the different networking layers are possible, as well as tunnelling, bridging, etc.
There are three system calls that can send data over the network:
All of these eventually end up in sock_sendmsg(), which does security_sock_sendmsg() to check permissions and then forwards the message to the next layer using the socket's sendmsg virtual method. ===== Layer 4: Transport layer (TCP)===== tcp_sendmsg: for each segment in the message - find an sk_buff with space available (use the one at the end if space left, otherwise allocate and append a new one) - copy data from user space to sk_buff data space (kernel space, probably DMA-able space) using skb_add_data(). * The buffer space is pre-allocated for each socket. If the buffer runs out of space, communication stalls: the data remains in user space until buffer space becomes available again (or the call returns with an error immediately if it was non-blocking). * The size of allocated sk_buff space is equal to the MSS (Maximum Segment Size) + headroom (MSS may change during connection, and is modified by user options). * Segmentation (or coalescing of individual writes) happens at this level. Whatever ends up in the same sk_buff will become a single TCP segment. Still, the segments can be fragmented further at IP level. - The TCP queue is activated; packets are sent with tcp_transmit_skb() (called multiple times if there are more active buffers). - tcp_transmit_skb() builds the TCP header (the allocation of the sk_buff has left space for it). It clones the skb in order to pass control to the network layer. The network layer is called through the queue_xmit virtual function of the socket's address family (inet_connection_sock→icsk_af_ops). ===== Layer 3: Network layer (IPv4)===== - ip_queue_xmit() does routing (if necessary), creates the IPv4 header - nf_hook() is called in several places to perform network filtering (firewall, NAT, …). This hook may modify the datagram or discard it. - The routing decision results in a destination (dst_entry) object. This destination models the receiving IP address of the datagram. The dst_entry's output virtual method is called to perform actual output. - The sk_buff is passed on to ip_output() (or another output mechansim, e.g. in case of tunneling). - ip_output() does post-routing filtering, re-outputs it on a new destination if necessary due to netfiltering, fragments the datagram into packets if necessary, and finally sends it to the output device. * Fragmentation tries to reuse existing fragment buffers, if possible. This happens when forwarding an already fragmented incoming IP packet. The fragment buffers are special sk_buff objects, pointing in the same data space (no copy required). * If no fragment buffers are available, new sk_buff objects with new data space are allocated, and the data is copied. * Note that TCP already makes sure the packets are smaller than MTU, so normally fragmentation is not required. - Device-specific output is again through a virtual method call, to output of the dst_entry's neighbour data structure. This usually is dev_queue_xmit. There is some optimisation for packets with a known destination (hh_cache). ===== Layer 2: Link layer (e.g. Ethernet)===== The main function of the kernel at the link layer is scheduling the packets to be sent out. For this purpose, Linux uses the queueing discipline (struct Qdisc) abstraction. For detailed information, see Chapter 9 (Queueing Disciplines for Bandwidth Management) of the Linux Advanced Routing & Traffic Control HOWTO and Documentationnetworking/multiqueue.txt. dev_queue_xmit puts the sk_buff on the device queue using the qdisc→enqueue virtual method. * If necessary (when the device doesn't support scattered data) the data is linearised into the sk_buff. This requires copying. * Devices which don't have a Qdisc (e.g. loopback) go directly to dev_hard_start_xmit(). * Several Qdisc scheduling policies exist. The basic and most used one is pfifo_fast, which has three priorities. The device output queue is immediately triggered with qdisc_run(). It calls qdisc_restart(), which takes an skb from the queue using the qdisc→dequeue virtual method. Specific queueing disciplines may delay sending by not returning any skb, and setting up a qdisc_watchdog_timer() instead. When the timer expires, netif_schedule() is called to start transmission. Eventually, the sk_buff is sent with dev_hard_start_xmit() and removed from the Qdisc. If sending fails, the skb is re-queued.netif_schedule() is called to schedule a retry. netif_schedule() raises a software interrupt, which causes net_tx_action() to be called when the NET_TX_SOFTIRQ is ran by ksoftirqd. net_tx_action() calls qdisc_run() for each device with an active queue. dev_hard_start_xmit() calls the hard_start_xmit virtual method for the net_device. But first, it calls dev_queue_xmit_nit(), which checks if a packet handler has been registered for the ETH_P_ALL protocol. This is used for tcpdump. The device driver's hard_start_xmit function will generate one or more commands to the network device for scheduling transfer of the buffer. After a while, the network device replies that it's done. This triggers freeing of the sk_buff. If the sk_buff is freed from interrupt context, dev_kfree_skb_irq() is used. This delays the actual freeing until the next NET_TX_SOFTIRQ run, by putting the skb on the softnet_data completion_queue. This avoids doing frees from interrupt context. ====== Receive flow ====== ===== Layer 2: Link layer (e.g. Ethernet)===== The network device pre-allocates a number of sk_buffs for reception. How many, is configured per device. Usually, the addresses to the data space in these sk_buffs are configured directly as DMA area for the device. The device interrupt handler takes the sk_buff and performs reception handling on it. Before NAPI, this was done using netif_rx(). In NAPI, it is done in two phases. - From the interrupt handler, the device driver just calls netif_rx_schedule() and returns from interrupt. netif_rx_schedule() adds the device to sofnet_data's poll_list and raises the NET_RX_SOFTIRQ software interrupt. - ksoftirqd runs net_rx_action(), which calls the device's poll virtual method. The poll method does device-specific buffer management, calls netif_receive_skb() for each sk_buff, allocates new sk_buffs as required, and terminates with netif_rx_complete(). netif_receive_skb() finds out how to pass the sk_buff to upper layers. - netpoll_rx() is called, to support the Netpoll API - Call packet handlers for ETH_P_ALL protocol (for tcpdump) - Call handle_ing() for ingress queueing - Call handle_bridge() for bridging - Call handle_macvlan() for virtual LAN - Call the packet handler registered for the L3 protocol specified by the packet. The packet handlers are called with the deliver_skb() function, which calls the protocol's func virtual method to handle the packet. ===== Layer 3: Network layer (IPv4, ARP)===== ==== ARP==== ARP packets are handled with arp_rcv(). It processes the ARP information, stores it in the neighbour cache, and sends a reply if required. In the latter case, a new sk_buff (with new data space) is allocated for the reply. ==== IPv4==== IPv4 packets are handled with ip_rcv(). It parses headers, checks for validity, sends an ICMP reply or error message if required. It also looks up the destination address using ip_route_input(). The destination's input virtual method is called with the sk_buff. * ip_mr_input() is called for multicast addresses. The packet may be forwarded using ip_mr_forward(), and it may be delivered locally using ip_local_delivery(). * ip_forward() is called for packets with a different destination for which we have a route. It directly calls the neighbour's output virtual method. * ip_local_deliver() is called if this machine is the destination of the packet. Datagram fragments are collected here. ip_local_deliver() delivers to any raw sockets for this connection first, using raw_local_deliver(). Then, it calls the L4 protocol handler for the protocol specified in the datagram. The L4 protocol is called even if a raw socket exists. Throughout, xfrm4_policy_check calls are included to support IPSec. ===== Layer 4: Transport layer (TCP)===== The net_protocol handler for TCP is tcp_v4_rcv(). Most of the code here deals with the protocol processing in TCP, for setting up connections, performing flow control, etc. A received TCP packet may include an acknowledgement of a previously sent packet, which may trigger further sending of packets (tcp_data_snd_check()) or of acknowledgements (tcp_ack_snd_check()). Passing the incoming packet to an upper layer is done in tcp_rcv_established() and tcp_data_queue(). These functions maintain the tcp connection's out_of_order_queue, and the socket's sk_receive_queue and sk_async_wait_queue. If a user process is already waiting for data to arrive, the data is immediately copied to user space using skb_copy_datagram_iovec(). Otherwise, the sk_buff is appended to one of the socket's queues and will be copied later. Finally, the receive functions call the socket's sk_data_ready virtual method to signal that data is available. This wakes up waiting processes. ===== Layer 5: Session layer (sockets and files)===== There are three system calls that can receive data from the network: * read (memory data from a file descriptor) * recvfrom (memory data from a socket) * recvmsg (a composite message from a socket) All of these eventually end up in __sock_recvmsg(), which does security_sock_recvmsg() to check permissions and then requests the message to the next layer using the socket's recvmsg virtual method. This is often sock_common_recvmsg(), which calls the recvmsg virtual method of the socket's protocol. tcp_recvmsg() either copies data from the socket's queue using skb_copy_datagram_iovec(), or waits for data to arrive using sk_wait_data(). The latter blocks and is woken up by the layer 4 processing.