This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
networking:kernel_flow [2016/10/05 12:38] dcarrel Another escaping %%//%% which was bloating the formatting. |
networking:kernel_flow [2022/10/15 00:37] (current) q2ven [Layer 2: Link layer (e.g. Ethernet)] |
||
---|---|---|---|
Line 9: | Line 9: | ||
=====Contents===== | =====Contents===== | ||
- | * [[https://www.linuxfoundation.org/#Preliminaries|1 Preliminaries]] | + | * [[#preliminaries|1 Preliminaries]] |
- | * [[https://www.linuxfoundation.org/#Transmission_path|2 Transmission path]] | + | * [[#transmission-path|2 Transmission path]] |
- | * [[https://www.linuxfoundation.org/#Layer_5:_Session_layer_.28sockets_and_files.29|2.1 Layer 5: Session layer (sockets and files)]] | + | * [[#layer-5session-layer-sockets-and-files|2.1 Layer 5: Session layer (sockets and files)]] |
- | * [[https://www.linuxfoundation.org/#Layer_4:_Transport_layer_.28TCP.29|2.2 Layer 4: Transport layer (TCP)]] | + | * [[#layer-4transport-layer-tcp|2.2 Layer 4: Transport layer (TCP)]] |
- | * [[https://www.linuxfoundation.org/#Layer_3:_Network_layer_.28IPv4.29|2.3 Layer 3: Network layer (IPv4)]] | + | * [[#layer-3network-layer-ipv4|2.3 Layer 3: Network layer (IPv4)]] |
- | * [[https://www.linuxfoundation.org/#Layer_2:_Link_layer_.28e.g._Ethernet.29|2.4 Layer 2: Link layer (e.g. Ethernet)]] | + | * [[#layer-2link-layer-eg-ethernet|2.4 Layer 2: Link layer (eg Ethernet)]] |
- | * [[https://www.linuxfoundation.org/#Receive_flow|3 Receive flow]] | + | * [[#receive-flow|3 Receive flow]] |
- | * [[https://www.linuxfoundation.org/#Layer_2:_Link_layer_.28e.g._Ethernet.29_2|3.1 Layer 2: Link layer (e.g. Ethernet)]] | + | * [[#layer-2link-layer-eg-ethernet1|3.1 Layer 2: link layer (eg ethernet)]] |
- | * [[https://www.linuxfoundation.org/#Layer_3:_Network_layer_.28IPv4.2C_ARP.29|3.2 Layer 3: Network layer (IPv4, ARP)]] | + | * [[#layer-3network-layer-ipv4-arp|3.2 Layer 3: Network layer (IPv4, ARP)]] |
- | * [[https://www.linuxfoundation.org/#ARP|3.2.1 ARP]] | + | * [[#arp|3.2.1 ARP]] |
- | * [[https://www.linuxfoundation.org/#IPv4|3.2.2 IPv4]] | + | * [[#ipv4|3.2.2 IPv4]] |
- | * [[https://www.linuxfoundation.org/#Layer_4:_Transport_layer_.28TCP.29_2|3.3 Layer 4: Transport layer (TCP)]] | + | * [[#layer-4transport-layer-tcp1|3.3 Layer 4: Transport layer (TCP)]] |
- | * [[https://www.linuxfoundation.org/#Layer_5:_Session_layer_.28sockets_and_files.29_2|3.4 Layer 5: Session layer (sockets and files)]] | + | * [[#layer-5session-layer-sockets-and-files1|3.4 Layer 5: Session layer (sockets and files)]] |
====== Preliminaries====== | ====== Preliminaries====== | ||
Line 45: | Line 46: | ||
===== Layer 4: Transport layer (TCP)===== | ===== Layer 4: Transport layer (TCP)===== | ||
- | [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/tcp.c#L661|tcp_sendmsg]]: for each segment in the message | + | [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/tcp.c#L661|tcp_sendmsg]]: for each segment in the message |
- find an sk_buff with space available (use the one at the end if space left, otherwise allocate and append a new one) | - find an sk_buff with space available (use the one at the end if space left, otherwise allocate and append a new one) | ||
Line 52: | Line 53: | ||
* The size of allocated sk_buff space is equal to the MSS (Maximum Segment Size) + headroom (MSS may change during connection, and is modified by user options). | * The size of allocated sk_buff space is equal to the MSS (Maximum Segment Size) + headroom (MSS may change during connection, and is modified by user options). | ||
* Segmentation (or coalescing of individual writes) happens at this level. Whatever ends up in the same sk_buff will become a single TCP segment. Still, the segments can be fragmented further at IP level. | * Segmentation (or coalescing of individual writes) happens at this level. Whatever ends up in the same sk_buff will become a single TCP segment. Still, the segments can be fragmented further at IP level. | ||
- | - The TCP queue is activated; packets are sent with [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/tcp_output.c#L389|tcp_transmit_skb()]] (called multiple times if there are more active buffers). | + | - The TCP queue is activated; packets are sent with [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/tcp_output.c#L389|tcp_transmit_skb()]] (called multiple times if there are more active buffers). |
- | - [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/tcp_output.c#L389|tcp_transmit_skb()]] builds the TCP header (the allocation of the sk_buff has left space for it). It clones the skb in order to pass control to the network layer. The network layer is called through the queue_xmit virtual function of the socket's address family (inet_connection_sock->icsk_af_ops). | + | - [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/tcp_output.c#L389|tcp_transmit_skb()]] builds the TCP header (the allocation of the sk_buff has left space for it). It clones the skb in order to pass control to the network layer. The network layer is called through the queue_xmit virtual function of the socket's address family (inet_connection_sock->icsk_af_ops). |
===== Layer 3: Network layer (IPv4)===== | ===== Layer 3: Network layer (IPv4)===== | ||
- | - [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_output.c#L284|ip_queue_xmit()]] does routing (if necessary), creates the IPv4 header | + | - [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_output.c#L284|ip_queue_xmit()]] does routing (if necessary), creates the IPv4 header |
- nf_hook() is called in several places to perform network filtering (firewall, NAT, ...). This hook may modify the datagram or discard it. | - nf_hook() is called in several places to perform network filtering (firewall, NAT, ...). This hook may modify the datagram or discard it. | ||
- The routing decision results in a destination (dst_entry) object. This destination models the receiving IP address of the datagram. The dst_entry's output virtual method is called to perform actual output. | - The routing decision results in a destination (dst_entry) object. This destination models the receiving IP address of the datagram. The dst_entry's output virtual method is called to perform actual output. | ||
- | - The sk_buff is passed on to [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_output.c#L270|ip_output()]] (or another output mechansim, e.g. in case of tunneling). | + | - The sk_buff is passed on to [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_output.c#L270|ip_output()]] (or another output mechansim, e.g. in case of tunneling). |
- | - [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_output.c#L270|ip_output()]] does post-routing filtering, [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_output.c#L202|re-outputs it on a new destination if necessary due to netfiltering]], [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_output.c#L415|fragments the datagram into packets if necessary]], and finally [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_output.c#L164|sends it to the output device]]. | + | - [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_output.c#L270|ip_output()]] does post-routing filtering, [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_output.c#L202|re-outputs it on a new destination if necessary due to netfiltering]], [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_output.c#L415|fragments the datagram into packets if necessary]], and finally [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_output.c#L164|sends it to the output device]]. |
* Fragmentation tries to reuse existing fragment buffers, if possible. This happens when forwarding an already fragmented incoming IP packet. The fragment buffers are special sk_buff objects, pointing in the same data space (no copy required). | * Fragmentation tries to reuse existing fragment buffers, if possible. This happens when forwarding an already fragmented incoming IP packet. The fragment buffers are special sk_buff objects, pointing in the same data space (no copy required). | ||
* If no fragment buffers are available, new sk_buff objects with new data space are allocated, and the data is copied. | * If no fragment buffers are available, new sk_buff objects with new data space are allocated, and the data is copied. | ||
* Note that TCP already makes sure the packets are smaller than MTU, so normally fragmentation is not required. | * Note that TCP already makes sure the packets are smaller than MTU, so normally fragmentation is not required. | ||
- | - Device-specific output is again through a virtual method call, to output of the dst_entry's neighbour data structure. This usually is [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1421|dev_queue_xmit]]. There is some optimisation for packets with a known destination (hh_cache). | + | - Device-specific output is again through a virtual method call, to output of the dst_entry's neighbour data structure. This usually is [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1421|dev_queue_xmit]]. There is some optimisation for packets with a known destination (hh_cache). |
==== Layer 2: Link layer (e.g. Ethernet)===== | ==== Layer 2: Link layer (e.g. Ethernet)===== | ||
Line 71: | Line 72: | ||
The main function of the kernel at the link layer is scheduling the packets to be sent out. For this purpose, Linux uses the queueing discipline (struct Qdisc) abstraction. For detailed information, see [[http://lartc.org/howto/lartc.qdisc.html|Chapter 9 (Queueing Disciplines for Bandwidth Management)]] of [[http://lartc.org/howto/index.html|the Linux Advanced Routing & Traffic Control HOWTO]] and Documentation%%//%%networking/multiqueue.txt. | The main function of the kernel at the link layer is scheduling the packets to be sent out. For this purpose, Linux uses the queueing discipline (struct Qdisc) abstraction. For detailed information, see [[http://lartc.org/howto/lartc.qdisc.html|Chapter 9 (Queueing Disciplines for Bandwidth Management)]] of [[http://lartc.org/howto/index.html|the Linux Advanced Routing & Traffic Control HOWTO]] and Documentation%%//%%networking/multiqueue.txt. | ||
- | [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1421|dev_queue_xmit]] puts the sk_buff on the device queue using the qdisc->enqueue virtual method. | + | [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1421|dev_queue_xmit]] puts the sk_buff on the device queue using the qdisc->enqueue virtual method. |
* If necessary (when the device doesn't support scattered data) the data is linearised into the sk_buff. This requires copying. | * If necessary (when the device doesn't support scattered data) the data is linearised into the sk_buff. This requires copying. | ||
- | * Devices which don't have a Qdisc (e.g. loopback) go directly to [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1343|dev_hard_start_xmit()]]. | + | * Devices which don't have a Qdisc (e.g. loopback) go directly to [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1343|dev_hard_start_xmit()]]. |
* Several Qdisc scheduling policies exist. The basic and most used one is pfifo_fast, which has three priorities. | * Several Qdisc scheduling policies exist. The basic and most used one is pfifo_fast, which has three priorities. | ||
- | The device output queue is immediately triggered with [[http://lxr.linux.no/linux+v2.6.20/include/net/pkt_sched.h#L223|qdisc_run()]]. It calls [[http://lxr.linux.no/linux+v2.6.20/net/sched/sch_generic.c#L91|qdisc_restart()]], which takes an skb from the queue using the qdisc->dequeue virtual method. Specific queueing disciplines may delay sending by not returning any skb, and setting up a qdisc_watchdog_timer() instead. When the timer expires, [[http://lxr.linux.no/linux+v2.6.20/include/linux/netdevice.h#L635|netif_schedule()]] is called to start transmission. | + | The device output queue is immediately triggered with [[https://elixir.bootlin.com/linux/v2.6.20/source/include/net/pkt_sched.h#L223|qdisc_run()]]. It calls [[https://elixir.bootlin.com/linux/v2.6.20/source/net/sched/sch_generic.c#L91|qdisc_restart()]], which takes an skb from the queue using the qdisc->dequeue virtual method. Specific queueing disciplines may delay sending by not returning any skb, and setting up a qdisc_watchdog_timer() instead. When the timer expires, [[https://elixir.bootlin.com/linux/v2.6.20/source/include/linux/netdevice.h#L635|netif_schedule()]] is called to start transmission. |
- | Eventually, the sk_buff is sent with [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1343|dev_hard_start_xmit()]] and removed from the Qdisc. If sending fails, the skb is re-queued.[[http://lxr.linux.no/linux+v2.6.20/include/linux/netdevice.h#L635|netif_schedule()]] is called to schedule a retry. | + | Eventually, the sk_buff is sent with [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1343|dev_hard_start_xmit()]] and removed from the Qdisc. If sending fails, the skb is re-queued.[[https://elixir.bootlin.com/linux/v2.6.20/source/include/linux/netdevice.h#L635|netif_schedule()]] is called to schedule a retry. |
- | [[http://lxr.linux.no/linux+v2.6.20/include/linux/netdevice.h#L635|netif_schedule()]] raises a software interrupt, which causes [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1644|net_tx_action()]] to be called when the NET_TX_SOFTIRQ is ran by ksoftirqd. net_tx_action() calls qdisc_run() for each device with an active queue. | + | [[https://elixir.bootlin.com/linux/v2.6.20/source/include/linux/netdevice.h#L635|netif_schedule()]] raises a software interrupt, which causes [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1644|net_tx_action()]] to be called when the NET_TX_SOFTIRQ is ran by ksoftirqd. net_tx_action() calls qdisc_run() for each device with an active queue. |
- | [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1343|dev_hard_start_xmit()]] calls the hard_start_xmit virtual method for the net_device. But first, it calls dev_queue_xmit_nit(), which checks if a packet handler has been registered for the ETH_P_ALL protocol. This is used for tcpdump. | + | [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1343|dev_hard_start_xmit()]] calls the hard_start_xmit virtual method for the net_device. But first, it calls dev_queue_xmit_nit(), which checks if a packet handler has been registered for the ETH_P_ALL protocol. This is used for tcpdump. |
- | The device driver's hard_start_xmit function will generate one or more commands to the network device for scheduling transfer of the buffer. After a while, the network device replies that it's done. This triggers freeing of the sk_buff. If the sk_buff is freed from interrupt context, [[http://lxr.linux.no/linux+v2.6.20/include/linux/netdevice.h#L679|dev_kfree_skb_irq()]] is used. This delays the actual freeing until the next NET_TX_SOFTIRQ run, by putting the skb on the softnet_data completion_queue. This avoids doing frees from interrupt context. | + | The device driver's hard_start_xmit function will generate one or more commands to the network device for scheduling transfer of the buffer. After a while, the network device replies that it's done. This triggers freeing of the sk_buff. If the sk_buff is freed from interrupt context, [[https://elixir.bootlin.com/linux/v2.6.20/source/include/linux/netdevice.h#L679|dev_kfree_skb_irq()]] is used. This delays the actual freeing until the next NET_TX_SOFTIRQ run, by putting the skb on the softnet_data completion_queue. This avoids doing frees from interrupt context. |
====== Receive flow ====== | ====== Receive flow ====== | ||
Line 93: | Line 94: | ||
The network device pre-allocates a number of sk_buffs for reception. How many, is configured per device. Usually, the addresses to the data space in these sk_buffs are configured directly as DMA area for the device. The device interrupt handler takes the sk_buff and performs reception handling on it. Before NAPI, this was done using netif_rx(). In NAPI, it is done in two phases. | The network device pre-allocates a number of sk_buffs for reception. How many, is configured per device. Usually, the addresses to the data space in these sk_buffs are configured directly as DMA area for the device. The device interrupt handler takes the sk_buff and performs reception handling on it. Before NAPI, this was done using netif_rx(). In NAPI, it is done in two phases. | ||
- | - From the interrupt handler, the device driver just calls [[http://lxr.linux.no/linux+v2.6.20/include/linux/netdevice.h#L858|netif_rx_schedule()]] and returns from interrupt. netif_rx_schedule() adds the device to sofnet_data's poll_list and raises the NET_RX_SOFTIRQ software interrupt. | + | - From the interrupt handler, the device driver just calls [[https://elixir.bootlin.com/linux/v2.6.20/source/include/linux/netdevice.h#L858|netif_rx_schedule()]] and returns from interrupt. netif_rx_schedule() adds the device to softnet_data's poll_list and raises the NET_RX_SOFTIRQ software interrupt. |
- | - ksoftirqd runs [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1904|net_rx_action()]], which calls the device's poll virtual method. The poll method does device-specific buffer management, calls netif_receive_skb() for each sk_buff, allocates new sk_buffs as required, and terminates with netif_rx_complete(). | + | - ksoftirqd runs [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1904|net_rx_action()]], which calls the device's poll virtual method. The poll method does device-specific buffer management, calls netif_receive_skb() for each sk_buff, allocates new sk_buffs as required, and terminates with netif_rx_complete(). |
- | [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1764|netif_receive_skb()]] finds out how to pass the sk_buff to upper layers. | + | [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1764|netif_receive_skb()]] finds out how to pass the sk_buff to upper layers. |
- | - [[http://lxr.linux.no/linux+v2.6.20/include/linux/netpoll.h#L49|netpoll_rx()]] is called, to support the [[http://people.redhat.com/~jmoyer/netpoll-linux_kongress-2005.pdf|Netpoll API]] | + | - [[https://elixir.bootlin.com/linux/v2.6.20/source/include/linux/netpoll.h#L49|netpoll_rx()]] is called, to support the [[http://people.redhat.com/~jmoyer/netpoll-linux_kongress-2005.pdf|Netpoll API]] |
- Call packet handlers for ETH_P_ALL protocol (for tcpdump) | - Call packet handlers for ETH_P_ALL protocol (for tcpdump) | ||
- Call handle_ing() for ingress queueing | - Call handle_ing() for ingress queueing | ||
Line 105: | Line 106: | ||
- Call the packet handler registered for the L3 protocol specified by the packet. | - Call the packet handler registered for the L3 protocol specified by the packet. | ||
- | The packet handlers are called with the [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1690|deliver_skb()]] function, which calls the protocol's func virtual method to handle the packet. | + | The packet handlers are called with the [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1690|deliver_skb()]] function, which calls the protocol's func virtual method to handle the packet. |
===== Layer 3: Network layer (IPv4, ARP)===== | ===== Layer 3: Network layer (IPv4, ARP)===== | ||
==== ARP==== | ==== ARP==== | ||
- | ARP packets are handled with [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/arp.c#L930|arp_rcv()]]. It processes the ARP information, stores it in the neighbour cache, and sends a reply if required. In the latter case, a new sk_buff (with new data space) is allocated for the reply. | + | ARP packets are handled with [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/arp.c#L930|arp_rcv()]]. It processes the ARP information, stores it in the neighbour cache, and sends a reply if required. In the latter case, a new sk_buff (with new data space) is allocated for the reply. |
==== IPv4==== | ==== IPv4==== | ||
- | IPv4 packets are handled with [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_input.c#L373|ip_rcv()]]. It parses headers, checks for validity, sends an ICMP reply or error message if required. It also looks up the destination address using [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/route.c#L2090|ip_route_input()]]. The destination's input virtual method is called with the sk_buff. | + | IPv4 packets are handled with [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_input.c#L373|ip_rcv()]]. It parses headers, checks for validity, sends an ICMP reply or error message if required. It also looks up the destination address using [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/route.c#L2090|ip_route_input()]]. The destination's input virtual method is called with the sk_buff. |
- | * [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ipmr.c#L1338|ip_mr_input()]] is called for multicast addresses. The packet may be forwarded using ip_mr_forward(), and it may be delivered locally using ip_local_delivery(). | + | * [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ipmr.c#L1338|ip_mr_input()]] is called for multicast addresses. The packet may be forwarded using ip_mr_forward(), and it may be delivered locally using ip_local_delivery(). |
- | * [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_forward.c#L56|ip_forward()]] is called for packets with a different destination for which we have a route. It directly calls the neighbour's output virtual method. | + | * [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_forward.c#L56|ip_forward()]] is called for packets with a different destination for which we have a route. It directly calls the neighbour's output virtual method. |
- | * [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_input.c#L263|ip_local_deliver()]] is called if this machine is the destination of the packet. Datagram fragments are collected here. | + | * [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_input.c#L263|ip_local_deliver()]] is called if this machine is the destination of the packet. Datagram fragments are collected here. |
ip_local_deliver() delivers to any raw sockets for this connection first, using raw_local_deliver(). Then, it calls the L4 protocol handler for the protocol specified in the datagram. The L4 protocol is called even if a raw socket exists. | ip_local_deliver() delivers to any raw sockets for this connection first, using raw_local_deliver(). Then, it calls the L4 protocol handler for the protocol specified in the datagram. The L4 protocol is called even if a raw socket exists. | ||
Line 123: | Line 124: | ||
===== Layer 4: Transport layer (TCP)===== | ===== Layer 4: Transport layer (TCP)===== | ||
- | The net_protocol handler for TCP is [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/tcp_ipv4.c#L1611|tcp_v4_rcv()]]. Most of the code here deals with the protocol processing in TCP, for setting up connections, performing flow control, etc. | + | The net_protocol handler for TCP is [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/tcp_ipv4.c#L1611|tcp_v4_rcv()]]. Most of the code here deals with the protocol processing in TCP, for setting up connections, performing flow control, etc. |
A received TCP packet may include an acknowledgement of a previously sent packet, which may trigger further sending of packets (tcp_data_snd_check()) or of acknowledgements (tcp_ack_snd_check()). | A received TCP packet may include an acknowledgement of a previously sent packet, which may trigger further sending of packets (tcp_data_snd_check()) or of acknowledgements (tcp_ack_snd_check()). |