I/OAT (I/O Acceleration Technology) is the name for a collection of techniques by Intel to improve network throughput. The most significant of these is the DMA engine. The DMA engine is meant to offload from the CPU the copying of SKB data to the user buffers. This is not a zero-copy receive, but does allow the CPU to do other work while the copy operations are performed by the DMA engine.
The I/OAT patch series consists of three general areas. First, it adds a DMA subsystem to the kernel, which abstracts the DMA engine hardware from users of it. Second, it adds the I/OAT hardware driver, which plugs into the DMA subsystem and handles controlling the actual hardware. Finally, it implements a series of modifications to the network stack to make use of asynchronous copy offload.
The net stack modifications, given that they touch very important code, have recieved the most scrutiny. Significant changes:
struct sock_commonmost notably)
sk_eat_skb()has an added parameter
tcp_recvmsg(): Code added to pin user buffer memory on entry. Code added to wait for async copies to complete, and unpin memory, before exiting.
tcp_rcv_established(): Code added to initiate async copies if possible.
dma_try_early_copy()added to tcp.c.
Updated to ioat-1.7 and netdev latest git (20060508)
Intel presented technical information at OLS 2005 (but no code.) Posted all code but HW driver for review November 2005. Posted updated patch with HW driver March 3 2006, and again incorporating dev community feedback March 29 2006.
I/OAT has been queued for 2.6.18.
This is the initial data we posted to netdev March 16 2006.
This is more Chariot data, but also includes results with its data verification on, thus touching the data. The CPU gap is narrower (esp on 8 port) but still noteworthy.
This data shows that I/OAT really benefits from larger application buffer sizes. There is a CPU spike at 2K, although also increased throughput. This could be eliminated by increasing the tcp_dma_copybreak sysctl (“cat 4096 > /proc/sys/net/ipv4/tcp_dma_copybreak”), which disables I/OAT at or below that application buffer size.
This shows netperf performance. Notice we are using fewer clients than the Chariot tests. A slight CPU savings at higher application buffer sizes, but less noteworthy than Chariot.
This data shows 6 individual runs of Tbench, showing 7-10% drop in CPU utilization.
Results from SPECWeb. Since this is a TX test, I/OAT should not impact performance, and these indicate it doesn't.
This data shows results with different numbers of ports. It includes both standard netperf data, as well as results using a new option only present in netperf's SVN repo that touches the data after it is received.