CPUs can be partitioned to separate the resources of tasks and interrupts with different focus. In a real time system, CPU partitioning can be used to separate CPUs dedicated to real time tasks and their corresponding interrupts.
The base technology for CPU partitioning is CPU affinity. On top of this mechanism further Linux kernel facilities for CPU partitioning are implemented. User space tooling is available as well.
This article gives an short overview about the facilities and tools. Follow the links for detailed information.
The processing of tasks or interrupts can be restricted to a specified set of CPUs by setting the affinity. The task CPU affinity affects the scheduler and makes sure that the specific task is executed only on the CPUs which are in the tasks affinity set. The IRQ affinity specifies to which CPU an interrupt is allowed to be routed.
In a SMP system the property that binds processes or tasks to one or more processors by the OS scheduler is known as CPU affinity, the capability to override how the processes or tasks are assigned to a particular set of processors by the scheduler is a feature available in several OSes. The idea is to say “always run this process/task on processor one” or “run these processes/tasks on all processors but processor zero”. The scheduler places the processes/tasks on the CPUs which are contained in the affinity set.
Task affinity management can be utilized via the following mechanisms:
The CPU affinity of per-CPU threads like ksoftirqd/n and kworker/n (where n is the core number) is not settable. Other threads like kswapd/n are per-NUMA node and can be only pinned within the cores of their node.
Kworker threads and the workqueue tasks which they perform are a special case. While it is possible rely on taskset and sched_setaffinity() to manage kworkers, doing so is of little utility since the threads are often short-lived and, at any rate, often perform a wide variety of work. The paradigm with workqueues is instead to associate an affinity setting with the task itself. “Unbound” is the name for workqueues which are not per-CPU. These workqueues consume a lot of CPU time on many systems and tend to present the greatest management challenge for latency control. Those unbound workqueues which appear in /sys/devices/virtual/workqueue are configurable from userspace. The parameters affinity_scope, affinity_strict and cpu_mask together determine on which cores the kworker which executes the work function will run. Many unbound workqueues are not configurable via sysfs. Making their properties visible there requires an additional WQ_SYSFS flag in the kernel source.
Since kernel 6.5, the tools/workqueue/wq_monitor.py Python script is available in-tree, and since 6.6, wq_dump.py has joined it. These Python scripts require the drgn debugger, which is packaged by major Linux distributions. Another recent addition of potential particular interest for the realtime project is wqlat.py, which is part of the bcc/tools suite (see ). Both sets of tools may require special kernel configuration settings.
Hardware interrupts can interrupt kernel and user space computations at any given time, except when the kernel disables interrupt processing to protect resources. When a hardware interrupt is handled the CPU switches into a separate context and executes the handler code and switches back to the interrupted context and resumes the execution.
Depending on the interrupt hardware, interrupts can be routed to any CPU or delivery can be rotated between CPUs. Most interrupt controllers allow to restrict the set of CPUs to which a particular interrupt can be delivered by setting the IRQ affinity.
When the CPU receives an interrupt, a context switch to interrupt context is executed and the current task has to wait until the IRQ is handled. The possibility to allow only a set of CPUs to handle dedicated IRQ is called IRQ affinity. Thereby the hardware routing of the interrupt to the CPUs is affected.
IRQ affinity management can be utilized via the following mechanisms:
A common paradigm with realtime systems is to pin latency-insensitive kernel and userspace tasks tasks on a designated “housekeeping” core. For example, taskset can pin kernel threads like kswapd and kauditd. Applications whose network traffic latency is not critical may wish to pin network IRQs there as well. Userspace threads which are sometimes CPU-intensive like systemd and rsyslog may also be pinned on the housekeeping core. Pinning userspace threads will not have the desired effect if much of their work is performed by unbound workqueues, which may migrate to any core.
Softirqs are kernel threads which are often challenging to manage on realtime systems. Softirqs may run in atomic context immediately following a hard IRQ which “raises” them, or they may be executed in process context by per-CPU kernel threads called ksoftirqd/n, where n is the core number. There are 10 kinds of softirqs which perform diverse tasks for the networking, block, scheduling, timer and RCU subsystems as well as executing callbacks for a large number of device drivers via the tasklet mechanism. Only one softirq of any kind may be active at any given time on a core. Thus if ksoftirqd is preempted by a hard IRQ, the associated soft interrupt is disabled from following it immediately, and must wait for ksoftirqd. This unfortunate situation has been called “the new Big Kernel Lock” by realtime Linux maintainers.
Kernel configuration allows system managers to move the NET_RX and RCU callbacks out of softirqs and into their own kthreads. Since kernel 5.12, moving the NET_RX into its own kthread is possible by echo-ing '1' into the threaded sysfs attribute associated with a network device. The process table will afterwards include a new kthread called napi/xxx, where xxx is the interface name. [Read more about the NAPI mechanism in the networking wiki.] Userspace may employ taskset to pin this kthread on any core. Moving the softirq into its own kthread incurs a context-switch penalty, but even so may be worthwhile on systems where bursts of network traffic unacceptably delay applications. RCU Callback Offloading produces a new set of kthreads, and can be accomplished via a combination of compile-time configuration with boot-time command-line parameters.