This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
networking:kernel_flow [2020/09/13 16:07] foxhlchen replace outdated linux.no with bootlin |
networking:kernel_flow [2022/10/15 00:37] (current) q2ven [Layer 2: Link layer (e.g. Ethernet)] |
||
---|---|---|---|
Line 42: | Line 42: | ||
* sendmsg (a composite message to a socket) | * sendmsg (a composite message to a socket) | ||
- | All of these eventually end up in sock_sendmsg(), which does security_sock_sendmsg() to check permissions and then forwards the message to the next layer using the socket's sendmsg virtual method. | + | All of these eventually end up in %%__%%sock_sendmsg(), which does security_sock_sendmsg() to check permissions and then forwards the message to the next layer using the socket's sendmsg virtual method. |
===== Layer 4: Transport layer (TCP)===== | ===== Layer 4: Transport layer (TCP)===== | ||
Line 58: | Line 58: | ||
===== Layer 3: Network layer (IPv4)===== | ===== Layer 3: Network layer (IPv4)===== | ||
- | - [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_output.c#L284|ip_queue_xmit()]] does routing (if necessary), creates the IPv4 header | + | - [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_output.c#L284|ip_queue_xmit()]] does routing (if necessary), creates the IPv4 header |
- nf_hook() is called in several places to perform network filtering (firewall, NAT, ...). This hook may modify the datagram or discard it. | - nf_hook() is called in several places to perform network filtering (firewall, NAT, ...). This hook may modify the datagram or discard it. | ||
- The routing decision results in a destination (dst_entry) object. This destination models the receiving IP address of the datagram. The dst_entry's output virtual method is called to perform actual output. | - The routing decision results in a destination (dst_entry) object. This destination models the receiving IP address of the datagram. The dst_entry's output virtual method is called to perform actual output. | ||
- | - The sk_buff is passed on to [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_output.c#L270|ip_output()]] (or another output mechansim, e.g. in case of tunneling). | + | - The sk_buff is passed on to [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_output.c#L270|ip_output()]] (or another output mechansim, e.g. in case of tunneling). |
- | - [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_output.c#L270|ip_output()]] does post-routing filtering, [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_output.c#L202|re-outputs it on a new destination if necessary due to netfiltering]], [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_output.c#L415|fragments the datagram into packets if necessary]], and finally [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_output.c#L164|sends it to the output device]]. | + | - [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_output.c#L270|ip_output()]] does post-routing filtering, [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_output.c#L202|re-outputs it on a new destination if necessary due to netfiltering]], [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_output.c#L415|fragments the datagram into packets if necessary]], and finally [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_output.c#L164|sends it to the output device]]. |
* Fragmentation tries to reuse existing fragment buffers, if possible. This happens when forwarding an already fragmented incoming IP packet. The fragment buffers are special sk_buff objects, pointing in the same data space (no copy required). | * Fragmentation tries to reuse existing fragment buffers, if possible. This happens when forwarding an already fragmented incoming IP packet. The fragment buffers are special sk_buff objects, pointing in the same data space (no copy required). | ||
* If no fragment buffers are available, new sk_buff objects with new data space are allocated, and the data is copied. | * If no fragment buffers are available, new sk_buff objects with new data space are allocated, and the data is copied. | ||
* Note that TCP already makes sure the packets are smaller than MTU, so normally fragmentation is not required. | * Note that TCP already makes sure the packets are smaller than MTU, so normally fragmentation is not required. | ||
- | - Device-specific output is again through a virtual method call, to output of the dst_entry's neighbour data structure. This usually is [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1421|dev_queue_xmit]]. There is some optimisation for packets with a known destination (hh_cache). | + | - Device-specific output is again through a virtual method call, to output of the dst_entry's neighbour data structure. This usually is [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1421|dev_queue_xmit]]. There is some optimisation for packets with a known destination (hh_cache). |
==== Layer 2: Link layer (e.g. Ethernet)===== | ==== Layer 2: Link layer (e.g. Ethernet)===== | ||
Line 72: | Line 72: | ||
The main function of the kernel at the link layer is scheduling the packets to be sent out. For this purpose, Linux uses the queueing discipline (struct Qdisc) abstraction. For detailed information, see [[http://lartc.org/howto/lartc.qdisc.html|Chapter 9 (Queueing Disciplines for Bandwidth Management)]] of [[http://lartc.org/howto/index.html|the Linux Advanced Routing & Traffic Control HOWTO]] and Documentation%%//%%networking/multiqueue.txt. | The main function of the kernel at the link layer is scheduling the packets to be sent out. For this purpose, Linux uses the queueing discipline (struct Qdisc) abstraction. For detailed information, see [[http://lartc.org/howto/lartc.qdisc.html|Chapter 9 (Queueing Disciplines for Bandwidth Management)]] of [[http://lartc.org/howto/index.html|the Linux Advanced Routing & Traffic Control HOWTO]] and Documentation%%//%%networking/multiqueue.txt. | ||
- | [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1421|dev_queue_xmit]] puts the sk_buff on the device queue using the qdisc->enqueue virtual method. | + | [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1421|dev_queue_xmit]] puts the sk_buff on the device queue using the qdisc->enqueue virtual method. |
* If necessary (when the device doesn't support scattered data) the data is linearised into the sk_buff. This requires copying. | * If necessary (when the device doesn't support scattered data) the data is linearised into the sk_buff. This requires copying. | ||
- | * Devices which don't have a Qdisc (e.g. loopback) go directly to [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1343|dev_hard_start_xmit()]]. | + | * Devices which don't have a Qdisc (e.g. loopback) go directly to [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1343|dev_hard_start_xmit()]]. |
* Several Qdisc scheduling policies exist. The basic and most used one is pfifo_fast, which has three priorities. | * Several Qdisc scheduling policies exist. The basic and most used one is pfifo_fast, which has three priorities. | ||
- | The device output queue is immediately triggered with [[http://lxr.linux.no/linux+v2.6.20/include/net/pkt_sched.h#L223|qdisc_run()]]. It calls [[http://lxr.linux.no/linux+v2.6.20/net/sched/sch_generic.c#L91|qdisc_restart()]], which takes an skb from the queue using the qdisc->dequeue virtual method. Specific queueing disciplines may delay sending by not returning any skb, and setting up a qdisc_watchdog_timer() instead. When the timer expires, [[http://lxr.linux.no/linux+v2.6.20/include/linux/netdevice.h#L635|netif_schedule()]] is called to start transmission. | + | The device output queue is immediately triggered with [[https://elixir.bootlin.com/linux/v2.6.20/source/include/net/pkt_sched.h#L223|qdisc_run()]]. It calls [[https://elixir.bootlin.com/linux/v2.6.20/source/net/sched/sch_generic.c#L91|qdisc_restart()]], which takes an skb from the queue using the qdisc->dequeue virtual method. Specific queueing disciplines may delay sending by not returning any skb, and setting up a qdisc_watchdog_timer() instead. When the timer expires, [[https://elixir.bootlin.com/linux/v2.6.20/source/include/linux/netdevice.h#L635|netif_schedule()]] is called to start transmission. |
- | Eventually, the sk_buff is sent with [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1343|dev_hard_start_xmit()]] and removed from the Qdisc. If sending fails, the skb is re-queued.[[http://lxr.linux.no/linux+v2.6.20/include/linux/netdevice.h#L635|netif_schedule()]] is called to schedule a retry. | + | Eventually, the sk_buff is sent with [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1343|dev_hard_start_xmit()]] and removed from the Qdisc. If sending fails, the skb is re-queued.[[https://elixir.bootlin.com/linux/v2.6.20/source/include/linux/netdevice.h#L635|netif_schedule()]] is called to schedule a retry. |
- | [[http://lxr.linux.no/linux+v2.6.20/include/linux/netdevice.h#L635|netif_schedule()]] raises a software interrupt, which causes [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1644|net_tx_action()]] to be called when the NET_TX_SOFTIRQ is ran by ksoftirqd. net_tx_action() calls qdisc_run() for each device with an active queue. | + | [[https://elixir.bootlin.com/linux/v2.6.20/source/include/linux/netdevice.h#L635|netif_schedule()]] raises a software interrupt, which causes [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1644|net_tx_action()]] to be called when the NET_TX_SOFTIRQ is ran by ksoftirqd. net_tx_action() calls qdisc_run() for each device with an active queue. |
- | [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1343|dev_hard_start_xmit()]] calls the hard_start_xmit virtual method for the net_device. But first, it calls dev_queue_xmit_nit(), which checks if a packet handler has been registered for the ETH_P_ALL protocol. This is used for tcpdump. | + | [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1343|dev_hard_start_xmit()]] calls the hard_start_xmit virtual method for the net_device. But first, it calls dev_queue_xmit_nit(), which checks if a packet handler has been registered for the ETH_P_ALL protocol. This is used for tcpdump. |
- | The device driver's hard_start_xmit function will generate one or more commands to the network device for scheduling transfer of the buffer. After a while, the network device replies that it's done. This triggers freeing of the sk_buff. If the sk_buff is freed from interrupt context, [[http://lxr.linux.no/linux+v2.6.20/include/linux/netdevice.h#L679|dev_kfree_skb_irq()]] is used. This delays the actual freeing until the next NET_TX_SOFTIRQ run, by putting the skb on the softnet_data completion_queue. This avoids doing frees from interrupt context. | + | The device driver's hard_start_xmit function will generate one or more commands to the network device for scheduling transfer of the buffer. After a while, the network device replies that it's done. This triggers freeing of the sk_buff. If the sk_buff is freed from interrupt context, [[https://elixir.bootlin.com/linux/v2.6.20/source/include/linux/netdevice.h#L679|dev_kfree_skb_irq()]] is used. This delays the actual freeing until the next NET_TX_SOFTIRQ run, by putting the skb on the softnet_data completion_queue. This avoids doing frees from interrupt context. |
====== Receive flow ====== | ====== Receive flow ====== | ||
Line 94: | Line 94: | ||
The network device pre-allocates a number of sk_buffs for reception. How many, is configured per device. Usually, the addresses to the data space in these sk_buffs are configured directly as DMA area for the device. The device interrupt handler takes the sk_buff and performs reception handling on it. Before NAPI, this was done using netif_rx(). In NAPI, it is done in two phases. | The network device pre-allocates a number of sk_buffs for reception. How many, is configured per device. Usually, the addresses to the data space in these sk_buffs are configured directly as DMA area for the device. The device interrupt handler takes the sk_buff and performs reception handling on it. Before NAPI, this was done using netif_rx(). In NAPI, it is done in two phases. | ||
- | - From the interrupt handler, the device driver just calls [[http://lxr.linux.no/linux+v2.6.20/include/linux/netdevice.h#L858|netif_rx_schedule()]] and returns from interrupt. netif_rx_schedule() adds the device to sofnet_data's poll_list and raises the NET_RX_SOFTIRQ software interrupt. | + | - From the interrupt handler, the device driver just calls [[https://elixir.bootlin.com/linux/v2.6.20/source/include/linux/netdevice.h#L858|netif_rx_schedule()]] and returns from interrupt. netif_rx_schedule() adds the device to softnet_data's poll_list and raises the NET_RX_SOFTIRQ software interrupt. |
- | - ksoftirqd runs [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1904|net_rx_action()]], which calls the device's poll virtual method. The poll method does device-specific buffer management, calls netif_receive_skb() for each sk_buff, allocates new sk_buffs as required, and terminates with netif_rx_complete(). | + | - ksoftirqd runs [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1904|net_rx_action()]], which calls the device's poll virtual method. The poll method does device-specific buffer management, calls netif_receive_skb() for each sk_buff, allocates new sk_buffs as required, and terminates with netif_rx_complete(). |
- | [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1764|netif_receive_skb()]] finds out how to pass the sk_buff to upper layers. | + | [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1764|netif_receive_skb()]] finds out how to pass the sk_buff to upper layers. |
- | - [[http://lxr.linux.no/linux+v2.6.20/include/linux/netpoll.h#L49|netpoll_rx()]] is called, to support the [[http://people.redhat.com/~jmoyer/netpoll-linux_kongress-2005.pdf|Netpoll API]] | + | - [[https://elixir.bootlin.com/linux/v2.6.20/source/include/linux/netpoll.h#L49|netpoll_rx()]] is called, to support the [[http://people.redhat.com/~jmoyer/netpoll-linux_kongress-2005.pdf|Netpoll API]] |
- Call packet handlers for ETH_P_ALL protocol (for tcpdump) | - Call packet handlers for ETH_P_ALL protocol (for tcpdump) | ||
- Call handle_ing() for ingress queueing | - Call handle_ing() for ingress queueing | ||
Line 106: | Line 106: | ||
- Call the packet handler registered for the L3 protocol specified by the packet. | - Call the packet handler registered for the L3 protocol specified by the packet. | ||
- | The packet handlers are called with the [[http://lxr.linux.no/linux+v2.6.20/net/core/dev.c#L1690|deliver_skb()]] function, which calls the protocol's func virtual method to handle the packet. | + | The packet handlers are called with the [[https://elixir.bootlin.com/linux/v2.6.20/source/net/core/dev.c#L1690|deliver_skb()]] function, which calls the protocol's func virtual method to handle the packet. |
===== Layer 3: Network layer (IPv4, ARP)===== | ===== Layer 3: Network layer (IPv4, ARP)===== | ||
==== ARP==== | ==== ARP==== | ||
- | ARP packets are handled with [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/arp.c#L930|arp_rcv()]]. It processes the ARP information, stores it in the neighbour cache, and sends a reply if required. In the latter case, a new sk_buff (with new data space) is allocated for the reply. | + | ARP packets are handled with [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/arp.c#L930|arp_rcv()]]. It processes the ARP information, stores it in the neighbour cache, and sends a reply if required. In the latter case, a new sk_buff (with new data space) is allocated for the reply. |
==== IPv4==== | ==== IPv4==== | ||
- | IPv4 packets are handled with [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_input.c#L373|ip_rcv()]]. It parses headers, checks for validity, sends an ICMP reply or error message if required. It also looks up the destination address using [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/route.c#L2090|ip_route_input()]]. The destination's input virtual method is called with the sk_buff. | + | IPv4 packets are handled with [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_input.c#L373|ip_rcv()]]. It parses headers, checks for validity, sends an ICMP reply or error message if required. It also looks up the destination address using [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/route.c#L2090|ip_route_input()]]. The destination's input virtual method is called with the sk_buff. |
- | * [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ipmr.c#L1338|ip_mr_input()]] is called for multicast addresses. The packet may be forwarded using ip_mr_forward(), and it may be delivered locally using ip_local_delivery(). | + | * [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ipmr.c#L1338|ip_mr_input()]] is called for multicast addresses. The packet may be forwarded using ip_mr_forward(), and it may be delivered locally using ip_local_delivery(). |
- | * [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_forward.c#L56|ip_forward()]] is called for packets with a different destination for which we have a route. It directly calls the neighbour's output virtual method. | + | * [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_forward.c#L56|ip_forward()]] is called for packets with a different destination for which we have a route. It directly calls the neighbour's output virtual method. |
- | * [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/ip_input.c#L263|ip_local_deliver()]] is called if this machine is the destination of the packet. Datagram fragments are collected here. | + | * [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/ip_input.c#L263|ip_local_deliver()]] is called if this machine is the destination of the packet. Datagram fragments are collected here. |
ip_local_deliver() delivers to any raw sockets for this connection first, using raw_local_deliver(). Then, it calls the L4 protocol handler for the protocol specified in the datagram. The L4 protocol is called even if a raw socket exists. | ip_local_deliver() delivers to any raw sockets for this connection first, using raw_local_deliver(). Then, it calls the L4 protocol handler for the protocol specified in the datagram. The L4 protocol is called even if a raw socket exists. | ||
Line 124: | Line 124: | ||
===== Layer 4: Transport layer (TCP)===== | ===== Layer 4: Transport layer (TCP)===== | ||
- | The net_protocol handler for TCP is [[http://lxr.linux.no/linux+v2.6.20/net/ipv4/tcp_ipv4.c#L1611|tcp_v4_rcv()]]. Most of the code here deals with the protocol processing in TCP, for setting up connections, performing flow control, etc. | + | The net_protocol handler for TCP is [[https://elixir.bootlin.com/linux/v2.6.20/source/net/ipv4/tcp_ipv4.c#L1611|tcp_v4_rcv()]]. Most of the code here deals with the protocol processing in TCP, for setting up connections, performing flow control, etc. |
A received TCP packet may include an acknowledgement of a previously sent packet, which may trigger further sending of packets (tcp_data_snd_check()) or of acknowledgements (tcp_ack_snd_check()). | A received TCP packet may include an acknowledgement of a previously sent packet, which may trigger further sending of packets (tcp_data_snd_check()) or of acknowledgements (tcp_ack_snd_check()). |