Table of Contents

napi

NAPI (“New API”) is an extension to the device driver packet processing framework, which is designed to improve the performance of high-speed networking. NAPI works through:

New drivers should use NAPI if the hardware can support it. However, NAPI additions to the kernel do not break backward compatibility and drivers may still process completions directly in interrupt context if necessary.

Contents

NAPI Driver design

The following is a whirlwind tour of what must be done to create a NAPI-compliant network driver.

For each interrupt vector, the driver must allocate an instance of struct napi_struct. This does not require calling any special function, and the structure is typically embedded in the driver's private structure. Each napi_struct must be initialised and registered before the net device itself, using netif_napi_add(), and unregistered after the net device, using netif_napi_del().

The next step is to make some changes to your driver's interrupt handler. If your driver has been interrupted because a new packet is available, that packet should not be processed at that time. Instead, your driver should disable any further “packet available” interrupts and tell the networking subsystem to poll your driver shortly to pick up all available packets. Disabling interrupts, of course, is a hardware-specific matter between the driver and the adaptor. Arranging for polling is done with a call to:

   void napi_schedule(struct napi_struct *napi);

An alternative form you'll see in some drivers is:

   if (napi_schedule_prep(napi))
       __napi_schedule(napi);

The end result is the same either way. (If napi_schedule_prep() returns zero, it means that there was already a poll scheduled, and you should not have received another interrupt).

The next step is to create a poll() method for your driver; it's job is to obtain packets from the network interface and feed them into the kernel. The poll() prototype is:

   int (*poll)(struct napi_struct *napi, int budget);

The poll() function should process all available incoming packets, much as your interrupt handler might have done in the pre-NAPI days. There are some exceptions, however:

   int netif_receive_skb(struct sk_buff *skb);

 

   void napi_complete(struct napi_struct *napi);

The networking subsystem promises that poll() will not be invoked simultaneously (for the same napi_struct) on multiple processors.

The final step is to tell the networking subsystem about your poll() method. This is done in your initialization code when registering the napi_struct:

   netif_napi_add(dev, &napi, my_poll, 16);

The last parameter, weight, is a measure of the importance of this interface; the number stored here will turn out to be the same number your driver finds in the budget argument to poll(). Gigabit and faster adaptor drivers tend to set weight to 64; smaller values can be used for slower media.

Hardware Architecture

NAPI, however, requires the following features to be available:

NAPI processes packet events in what is known as napi→poll() method. Typically, only packet receive events are processed in napi→poll(). The rest of the events MAY be processed by the regular interrupt handler to reduce processing latency (justified also because there are not that many of them).

Note, however, NAPI does not enforce that napi→poll() only processes receive events. Tests with the tulip driver indicated slightly increased latency if all of the interrupt handler is moved to napi→poll(). Also MII/PHY handling gets a little trickier.

The example used in this document is to move the receive processing only to napi→poll(); this is shown with the patch for the tulip driver. For an example of code that moves all the interrupt driver to napi→poll() look at other drivers (tg3, e1000, sky2). There are caveats that might force you to go with moving everything to napi→poll(). Different NICs work differently depending on their status/event acknowledgement setup.

There are two types of event register ACK mechanisms.

Can't seem to find any supported by Linux which do this.

This is a very important topic and appendix 2 is dedicated for more discussion.

Locking rules and environmental guarantees

For the rest of this text, we'll assume that napi→poll() only processes receive events.

NAPI API

Advantages

 

Performance under high packet load

NAPI provides an “inherent mitigation” which is bound by system capacity as can be seen from the following data collected by Robert Olsson's tests on Gigabit ethernet (e1000):

PsizeIppsTputRxintTxintDoneNdone
60890000409362172762276823
128758150464364219301107738
25644563277464642155072112906
512232666994445241292191472411921062
10241190611000003872519192588725110
1440851931000003946576195059465690

Legend:

Observe that when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated. The system can't handle the processing at 1 interrupt/packet at that load level. At lower rates on the other hand, rx interrupts go up and therefore the interrupt/packet ratio goes up (as observable from that table). So there is possibility that under low enough input, you get one poll call for each input packet caused by a single interrupt each time. And if the system can't handle interrupt per packet ratio of 1, then it will just have to chug along.

Use of softirq for other optimizations

NAPI usage does not have to be limited only to receiving packets. With many devices the poll() routine can also be used to manage transmit completion or PHY interface state changes. By moving this processing out of the hardware interrrupt service routine, there may be less latency and better performance.

Hardware Flow control

Most chips with flow control only send a pause packet when they run out of Rx buffers. Since packets are pulled off the DMA ring by a softirq in NAPI, if the system is slow in grabbing them and we have a high input rate (faster than the system's capacity to remove packets), then theoretically there will only be one rx interrupt for all packets during a given packetstorm. Under low load, we might have a single interrupt per packet. Flow control should be programmed to apply in the case when the system can't pull out packets fast enough, i.e send a pause only when you run out of rx buffers.

There are some tradeoffs with hardware flow control. If the driver makes receive buffers available to the hardware one by one, then under load up to 50% of the packets can end up being flow control packets. Flow control works better if the hardware is notified about buffers in larger bursts.

Disadvantages

 

Latency

In some cases, NAPI may introduce additional software IRQ latency.

IRQ masking

On some devices, changing the IRQ mask may be a slow operation, or require additional locking. This overhead may negate any performance benefits observed with NAPI

 

Issues

IRQ race a.k.a rotting packet

The are two common race issues that a driver may have to deal with. These are cases where it is possible to cause the receiver to stop because of hardware and logic interaction.

IRQ mask and level-triggered

If a status bit for receive or rxnobuff is set and the corresponding interrupt-enable bit is not on, then no interrupts will be generated. However, as soon as the “interrupt-enable” bit is unmasked, an immediate interrupt is generated (assuming the status bit was not turned off). Generally the concept of level triggered IRQs in association with a status and interrupt-enable CSR register set is used to avoid the race.

If we take the example of the tulip: “pending work” is indicated by the status bit (CSR5 in tulip). The corresponding interrupt bit (CSR7 in tulip) might be turned off (but the CSR5 will continue to be turned on with new packet arrivals even if we clear it the first time). Very important is the fact that if we turn on the interrupt bit when status is set, then an immediate irq is triggered.

If we cleared the rx ring and proclaimed there was “no more work to be done” and then went on to do a few other things; then when we enable interrupts, there is a possibility that a new packet might sneak in during this phase. It helps to look at the pseudo code for the tulip poll routine:

         do {
                 ACK;
                 while (ring_is_not_empty()) {
                         work-work-work
                         if quota is exceeded: exit, no touching irq status/mask
                 }
                 /* No packets, but new can arrive while we are doing this*/
                 CSR5 := read
                 if (CSR5 is not set) {
                         /* If something arrives in this narrow window here,
                          *  where the comments are ;-> irq will be generated */
                         unmask irqs;
                        exit poll;
                }
        } while (rx_status_is_set);

CSR5 bit of interest is only the rx status.

If you look at the last if statement: you just finished grabbing all the packets from the rx ring .. you check if status bit says there are more packets just in … it says none; you then enable rx interrupts again; if a new packet just came in during this check, we are counting that CSR5 will be set in that small window of opportunity and that by re-enabling interrupts, we would actually trigger an interrupt to register the new packet for processing.

non-level sensitive IRQs

Some systems have hardware that does not do level triggered IRQs properly. Normally, IRQs may be lost while being masked and the only way to leave poll is to do a double check for new input after netif_rx_complete() is invoked and re-enable polling (after seeing this new input).

 	.
 	. 
 restart_poll:
 	while (ring_is_not_empty()) {
 		work-work-work
 		if budget is exceeded: exit, not touching irq status/mask
 	}
 	.
 	.
 	.
 	enable_rx_interrupts()
 	napi_complete(napi);
 	if (ring_has_new_packet() && napi_reschedule(napi)) {
 		disable_rx_and_rxnobufs()
 		goto restart_poll
 	} while (rx_status_is_set);


Basically napi_complete() removes us from the poll list, but because a new packet which will never be caught due to the possibility of a race might come in, we attempt to re-add ourselves to the poll list.

Scheduling issues

As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the general solution to schedule softirq's to run before next interrupt and by putting them under scheduler control. Also this prevents consecutive softirq's from monopolizing the CPU. This also has the effect that the priority of ksoftirq needs to be considered when running very CPU-intensive applications and networking to get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 (eventually more) is reported to cure problems with low network performance at high CPU load.

Most used processes in a GIGE router:

 USER  PID  %CPU %MEM  SIZE   RSS TTY STAT START     TIME COMMAND
 root    3  0.2  0.0     0     0  ?   RWN  Aug 15  602:00 (ksoftirqd_CPU0)
 root  232  0.0  7.9 41400 40884  ?   S    Aug 15   74:12 gated