User Tools

Site Tools


realtime:documentation:howto:applications:cpuidle

CPU idle power saving methods for real-time workloads

Most configurations created for real-time applications disable power management completely to avoid any impact on latency. It is, however, possible to enable power management to a degree to which the impact on latency is tolerable based on application requirements. This document addresses how CPU idle states can be enabled and tuned to allow power savings while running real-time applications.

CPU idle states and their impact on latencies

A CPU idle state is a hardware feature to save power while the CPU is doing nothing. Different architectures support different types of CPU idle states. They vary in the degree of power savings, target residency and exit latency. Target residency is the amount of time the CPU needs to be in that idle state to justify the power consumed to enter and exit that state. Exit latency is the time the hardware takes to exit from that idle state.

CPU idle states in Intel CPUs are referred to as C states. Each C state has a name, starting from C0 until the maximum number of C states supported. C states are generally per core; however, a package can also enter a C state when all cores in the package enter a certain C state. The CPU is in C0 when it is fully active and is put into any of the other C states when the kernel becomes idle.

C states with higher numbers are referred to as “deeper C states.” These states save more power but also have higher exit latencies. Typically the deeper the idle state, the more components are either turned off or voltage reduced. Turning these components back on when the CPU wakes up from the deeper C states takes time. These delays can also vary depending on differences in platform components, kernel configurations, devices running, kernel operations around wake, state of caches and TLBs. Also the kernel must lock interrupts to synchronize the turning on of components, clocks and updating the state of the scheduler. The delays can vary a lot. The source: intel_idle.c

The following sections discuss how we can tune the system so that we can limit the power saving capabilities to the point where these variable latencies (jitter) are contained within the tolerance of the real-time application design.

Configurations to guard critical cores from interference

It would help to understand some basic configurations used in a real-time application environment to help reduce interference into the cores that run the real-time applications. These configurations are done in kernel boot parameters. Real-time applications can be run in “mixed mode” where some cores run real-time applications referred to as “critical cores” while other cores run regular tasks. If not running in mixed mode then all the cores would be running real-time applications and some of the configurations discussed below may not be necessary. See CPU partitioning and cpu lists in The kernel's command-line parameters for details.

isolcpus=list of critical cores – isolate the critical cores so that the kernel scheduler will not migrate tasks from other cores into them.

irqaffinity=list of non-critical cores – protect the critical cores from IRQs.

rcu_nocbs=list of critical cores – stop RCU callbacks from getting called into the critical cores.

nohz=off – The kernel's “dynamic ticks” mode of managing scheduling-clock ticks is known to impact latencies while exiting CPU idle states. This option turns that mode off. Refer to NO_HZ: Reducing Scheduling-Clock Ticks for more information about this setting.

nohz_full=list of critical cores – this will activate dynamic ticks mode of managing scheduling-clock ticks. The cores in the list will not get scheduling-clock ticks if there is only a single task running or if the core is idle. The kernel should be built with CONFIG_NO_HZ_FULL options enabled.

Power Management Quality of Service (PM QoS)

PM QoS is an infrastructure in the kernel that can be used to fine tune the CPU idle system governance to select idle states that are below a latency tolerance threshold. It has both a user level and kernel level interface. It can be used to limit C states in all CPUs system wide or per core. The following sections explain the user level interface. Details: PM Quality Of Service Interface.

Specifying system wide latency tolerance

You can specify system wide latency tolerance by writing a latency tolerance value in micro seconds into /dev/cpu_dma_latency. A value of 0 means disable C states completely. An application can write a limitation during critical operations and then restore to default value by closing the file handle to that entry.

Example setting system wide latency tolerance:

s32_t latency = 0;
fd = open("/dev/cpu_dma_latency", O_RDWR);
 
/* disable C states */
write(fd, &latency, sizeof(latency));
 
/* do critical operations */
 
/* Closing fd will restore default value */
close(fd);

Specifying per-core latency tolerance

You can specify the latency tolerance of each core by writing the latency tolerance value into /sys/devices/system/cpu/cpu<cpu number>/power/pm_qos_resume_latency_us. The cpuidle governor compares this value with the exit latency of each C state and selects the ones that meet the latency requirement. A value of “0” means “no restriction” and a value of “n/a” means disable all C states for that core.

Example setting per-core latency tolerance from command line:

To disable all CPU idle states in CPU 3:

$echo “n/a” > /sys/devices/system/cpu/cpu3/power/pm_qos_resume_latency_us

To limit latency to 20 us:

$echo 20 > /sys/devices/system/cpu/cpu3/power/pm_qos_resume_latency_us

To remove all restrictions or revert to default:

$echo 0 > /sys/devices/system/cpu/cpu3/power/pm_qos_resume_latency_us

Example setting per-core latency tolerance from application:

char latency_str[10];
fd = open("/sys/devices/system/cpu/cpu3/power/pm_qos_resume_latency_us", O_RDWR);   
 
/* disable C states */
strcpy(latency_str, “n/a”); 
write(fd, &latency, sizeof(latency)); 
 
/* do critical operations */
 
/* revert to “no restriction” */
strcpy(latency_str,0); 
write(fd, &latency, sizeof(latency));   
 
/* set latency tolerance to 20us */
sprintf(latency_str,%d”, 20); 
write(fd, &latency, sizeof(latency));   
 
/* do operations tolerant of 20us latency */

Note: The per-core user interface was changed in version 4.16. Current RT Linux is 4.14. Pull in commits 704d2ce, 0759e80 and c523c68 from 4.16.

Tools used to measure latencies

Cyclictest is used to measure the latencies while turbostat is used to identify the C states that are selected and their residencies. See cyclictest manpage.

Some parameters:

-a – Set affinity to CPU running real-time workload

-h or –H – generate histogram. Takes a parameter to limit maximum latency to be captured

-t – number of threads to use

-p – priority of thread

-i – interval in microseconds. This is the time the application is idle between operations.

-m – locks memory locations preventing paging out

-D – duration to run the test.

–laptop – cyclictest by default disables all C states using PM QoS. This option will stop it from doing that.

Cyclictest will be used in the tuning methods described below. The first tuning method uses PM QoS to specify a latency tolerance. The second tuning method uses the “i” option of cyclictest to modify the interval to control the CPU idle state selection.

Tuning the latency using PM QoS

As explained above, you can use PM QoS to control the type of CPU idle states that the kernel selects when it goes to idle. The cpuidle governor compares the latency tolerance value registered through PM QoS with the hardware exit latencies of each CPU idle state and picks the one that meets the latency requirement.

Run cyclictest with the histogram option, and check the histogram for tolerable variations in the latency. Also check the other outputs such as maximum and average latencies to verify that they meet the requirements of the application.

If the histogram and the latency values do not meet the application requirements, reduce the PM QoS per-core latency tolerance value for the critical cores. Repeat this process until the results from cyclictest meet the application requirements.

Tuning the latency by adjusting application idle time

This section explains the tuning of CPU idle state selection by adjusting the interval for which the application goes to idle.

Each CPU idle state has a target residency value associated with it. Entering and exiting a CPU idle state consumes some power. The target residency of the CPU idle state is the amount of time in micro seconds that the CPU must be in idle to save enough power to justify the power consumed by entering and exiting that state. The cpuidle governor in the kernel compares the time the CPU is predicted to stay idle with the target residency of the different CPU idle states. It then picks the one that has a target residency less than the predicted idle time. The predicted idle time is the time when no task is scheduled to run in that CPU.

You can design your application to never be idle for more than the interval that would allow deeper CPU idle states with latency variations that are more than the latency tolerance of the application. This method first determines the maximum interval before the latency variations exceed the latency tolerance threshold.

This can be done by running cyclictest with different values for “i” (interval) parameter and checking the histogram and latency results. For example, run cyclictest with an interval value of 1000 us and then check cyclictest results. If they are not acceptable then decrease the value until an acceptable histogram and latency results are reached.

Tuning example

This example uses an Intel® NUC kit with Intel® Celeron® Processor J3455.

«Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Performance varies depending on system configuration.»

CPU 3 is the critical core running real-time workloads. It is isolated and protected as described above.

At each point we can use turbostat to check the C states used in a CPU as follows:

$turbostat --debug
Core     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz     IRQ     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7
   -       -      74   13.22     561    1498   20123       0    3.67    0.00   83.11    0.00
   0       0      78   13.76     567    1498    5051       0    3.79    0.00   82.46    0.00
   1       1      78   13.56     572    1498    5039       0    3.63    0.00   82.80    0.00
   2       2      75   13.42     559    1498    5030       0    3.70    0.00   82.89    0.00
   3       3      66   12.13     543    1498    5003       0    3.58    0.00   84.29    0.00

Calibrate worst case latency

Set PM QoS resume latency constraint to 0 (“no restrictions”). Run cyclictest with a high interval and capture histogram data in a file.

$cyclictest -a3 -n -q -H1000 -t4 -p80 -i200 -m -D5m --laptop

Generate a graph from the histogram data using any graphing tool, for example, gnuplot.

The following example graph shows very high jitter:

«Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Performance varies depending on system configuration.»

Note the maximum latency in this run for use in strategies to save more power discussed below. This would be the worst-case scenario that the application would take into consideration when it has idle times longer than this time and can decide to reduce restrictions saving more power.

Calibrate PM QoS resume latency constraint

Try some latency constraint values until a desired jitter level is reached. For the purpose of this demonstration, we will specify 49 us as a latency constraint to PM QoS.

$echo 49 > /sys/devices/system/cpu/cpu3/power/pm_qos_resume_latency_us   
$cyclictest -a3 -n -q -H1000 -t4 -p80 -i200 -m -D5m --laptop 
$echo 0 > /sys/devices/system/cpu/cpu3/power/pm_qos_resume_latency_us (Revert back to "no restriction" when done)

Following is the graph generated from the histogram:

«Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Performance varies depending on system configuration.»

Cyclictest latency results and the histogram show that the latencies are not varying. Assuming this is the acceptable jitter level of the application, it should specify the corresponding PM QoS latency constraint value during critical operations and at other times remove the restriction to save more power.

Calibrate idle interval

For the purpose of demonstration, this example sets the maximum sleep time of the workload to 100 us. Only CPU idle states with target residency less than that will be allowed.

Note that the PM QoS latency tolerance value was reverted to “no restriction” for this run.

$cyclictest -a3 -n -q -H1000 -t4 -p80 -i100 -m -D5m --laptop

Following is the graph generated from the histogram:

«Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Performance varies depending on system configuration.»

Latency results and histogram show that the latency impact is reduced compared to the initial run. If this jitter level is acceptable, the application should use the corresponding interval as the maximum time to idle during critical phases.

Conclusion

Using the PM QoS method, the application specifies a PM QoS resume latency constraint that ensures jitter stays within the maximum tolerance level of the application. To save more power, it reverts back to “no restriction” when there are no critical operations to be done.

In the idle interval method, the application uses the calibrated safe idle interval as the maximum period that it will idle at a time during critical phases. If it needs to idle for longer periods, it will make sure that the idle period before the deadline will not exceed safe idle interval size. It would take into consideration the worst-case maximum latency that was found in the “no restriction” run.

As an example, let us assume the worst case latency is 400 us and the safe idle interval is 100 us. If the application is waiting for 1000 us, it will wake up early enough before reaching the deadline to make sure there is room for the worst-case latency. In this example, it would need to keep a buffer of 400 us before reaching the 1000 us deadline. First it would wait for 600 us (1000 - 400). Once woken, it would check how much time is left to reach the deadline. It will sleep in chunks of 100 us or less for the remaining time to block C states with target residencies higher than the calibrated safe idle interval.

In the same example, if PM QoS method is used, then the application can increase the restriction in PM QoS during the critical phase (400 us before the deadline) and wait for the remaining time.

An application can optimize power saving during long idle times by reducing the restriction and allowing more power to be saved when it can. It can also save power while performing non-critical operations. A combination of the PM QoS and idle interval methods will facilitate different strategies to save power without compromising the application's real-time constraints.

Strategies for effective power savings considering CPU topology and caching behavior

CPU topology plays an important role on how the processor utilizes the power saving capabilities of the different C states. Processors have multiple cores and the operating system groups logical CPUs within each core. Each of these groupings has shared resources that can be turned off only when all the processing units in that group reach a certain C state. If one logical CPU in a core can enter a deep C state but other logical CPUs are still running or at a lesser power saving C state, the CPU that can enter the deep state will be held at a less power saving state. This is because if the shared resources are turned off, then the other CPUs that are still running, will not be able to run. The same applies to package C states. A package can enter a deep C state only when all the cores in that package enter a certain deep C state, when the package level components can be turned off.

When designing a multi-core real-time application, assign tasks to a cluster of cores that can go idle at the same time. This may require some static configuration and knowledge of processor topology. Tools like turbostat can be used to get an idea of the groupings.

Another area to consider is cache optimization. Deeper C states would cause caches and TLBs to be flushed. Upon resume, the caches need to be reloaded for optimal performance. This reloading can cause latencies at places where it was not expected based on earlier calibrations. This can be avoided by adding logic in the methods described above to also force the cache to get repopulated by critical memory regions. As the application wakes up from deeper C states earlier than the approaching critical phase, it can access the memory regions it would need to reference in the critical phase, forcing them to get reloaded in the cache. This cache repopulating technique can be incorporated into any general cache optimization scheme the real-time application may be using. The technique applies not only to C states but also to any situation where the cache must be repopulated.

Reference

realtime/documentation/howto/applications/cpuidle.txt · Last modified: 2023/09/27 15:48 by costa.shul