The following page proposes a general structure for the latency debugging process. It proposes a few steps and suggestions that help debug latencies methodically and efficiently.
The steps are not meant to be a procedure that must be followed to the letter. The text below contains general guiding principles, but when it comes to latency debugging the variations and exceptions are innumerable. Because of this, latency debugging requires both creative and critical thinking. It is important to think about whether or not a particular suggestion is applicable for the latency that is being debugged.
Debugging latencies is also an iterative process, so the following steps can be combined and repeated depending on the latency that is being debugged.
An important step in debugging a latency is understanding how to reproduce it. More specifically, it is valuable to determine the situations in which the latency happens or the conditions in the system that produce the latency.
When just starting the debugging process, the problematic situation can be described at a very high level. For example, the description could be “the latency happens when this button is pressed” or “the latency happens often when all the CPUs are maxed”.
While understanding how to reproduce the issue, try to figure out the most simple situation in which the latency is triggered. Reproducing the latency in a simpler situation will make it easier to later isolate its source because there will be fewer possibilities to eliminate. For example, does the latency only trigger when several different tasks are running or does it trigger when only one particular task is running? If an issue happens very infrequently or is only triggered in complex situations, it may not be a good use of time to start by trying to simplify the situation used to trigger the latency. With these kinds of latencies, reproducing the issue in a simple way can require a deeper understanding of the latency's cause. So, when debugging these latencies, it can be best to move on to measuring the latency and isolating the source before attempting to simplify the situation.
To check if the latency is being reproduced, use a tool to measure the length of the latencies occurring on the system. Knowing the approximate length of the latency that is being debugged is important because this information makes it easier to distinguish this latency from the others. Being able to distinguish the latency makes it possible to isolate. The measuring step is usually done while understanding how to reproduce the latency.
It is important to remember that latencies caused by the same problem will not necessarily be measured as exactly the same length. Their measured values will frequently vary within the range of a few tens of microseconds. This is not only because the source of the latency might not always cause latencies of identical length, but also because the measured length can vary depending on the measuring tool and when the measurement is taken. For example, a latency of 180 us and a latency of 210 us could be caused by the same problem, but a latency of 130 us and a latency of 1200 us are probably not caused by the same problem.
If it is possible, also pay attention to when the latency happens. If the measured latencies are of a similar length and they happen in similar situations, then they are most likely caused by the same issue.
A tool that is frequently used for measuring latencies is Cyclictest. Using Cyclictest correctly can be challenging at first, but if the tool is configured correctly and is run for a sufficient amount of time, then it can provide reasonably accurate measurements for most latencies. The tool does have some limitations which are described in various places in its documentation such as on the Cyclictest test design page.
After having measured the latency length and exploring what triggers the latency, the next step is to narrow down the possible sources of the latency. The goal of this step is to identify the specific area in which the latency originates, such as a specific function.
A good way to isolate the source of a latency is to use tracing. Tracing collects information about the execution of a program in traces. These traces can be more or less detailed. For example, the trace can be very detailed and include all of the function calls that were made by the operating system, or it can be less detailed and only indicate the times that interrupts were disabled. Usually, tracing can only see what the operating system is doing, but if the userspace tasks running on a system have been instrumented appropriately then they can also be traced.
When using tracing to isolate the source of a latency, it is generally best to start with less detailed traces and to gradually increase the level of detail over several iterations. The source of the latency can be isolated more precisely after each iteration. There are a few reasons for gradually increasing the detail in the traces. First of all, starting with simpler traces makes it easier to understand what is happening at a more abstract level and to eliminate the areas that are not the source of the latency. Additionally, as the level of detail in the trace is increased, the tracing overhead will also increase. A higher overhead means that the tracing has more of a chance of changing the system's observed behavior. For example, when using function tracing to analyze certain latencies, the overhead may have a large enough effect that the latency may not be observed at all.
While isolating the latency, keep in mind that problems in the application or in the operating system are more common than problems in the firmware of the hardware. It is best to explore the most likely possibilities first. Start by eliminating obvious possibilities, such as disabling interrupts or preemption for too long. Afterwards, switch to enabling a few tracepoints to start building a high-level understanding of what is happening in the system. For example, trace all the schedule switches that happen on a system and try to determine which task is running when the latency happens. After a few iterations of tracing, it may even be possible to further simplify the situation that reproduces the latency. This will facilitate the analysis in the later tracing iterations.
Because the tracing overhead can affect the behavior that is observed, it is important to make sure that the same latency is being triggered and analyzed each time tracing is performed. Make sure that the tracing iterations are allowing the latency to be isolated and not simply changing what latencies are happening.
Having irrelevant information in the trace makes it more difficult to read, so it is good practice to stop the tracing immediately after the latency occurs. This prevents the trace from becoming longer than it needs to be.
One way to do this is by using Cyclictest. Cyclictest has an option that causes tracing to stop after a certain specified latency limit is exceeded. There are instructions in the documentation about how to adjust the latency detection limit as a function of the tracing overhead so that tracing is stopped at the correct time. Additionally, several different types of Ftrace tracers can be used through Cyclictest.
Ftrace can also be used by writing to the files in tracefs that are used to control the tracing (e.g. current_tracer, tracing_on). The writes to the files can be added directly to the programs that are running on the system.
Now that a more specific section of code or a function has been identified as the source of the latency, it is time to go look at the code and figure out why the latency is caused. The difficulty of this step in the debugging process can vary greatly. Sometimes it is very easy to identify the cause of a latency, such as when the code explicitly disables interrupts for a period that is clearly too long, and sometimes it can be virtually impossible to identify the cause without the help of the system vendor.
Once again, it is best to eliminate the most obvious possible causes before moving on to the more complex possible causes. Bugs in the application or in the operating system are much more common than bugs in the firmware and in the hardware. So, start by confirming that interrupts and preemption are not disabled for too long and then explore other possibilities if necessary. Latencies can be caused by so many things. The task could be waiting for a resource, waiting for a lock, waiting for a device, etc.
If after looking at the code there does not seem to be anything that explains why the latency happens, then the latency could be caused by the firmware or the hardware. The documentation about identifying SMI latencies with function tracing can help confirm if this is the case. If the latency is indeed caused by the firmware or the hardware, then determining exactly why the latency is happening can become extremely difficult. This is because there is often very little documentation available about the behavior of these parts of a system so it is sometimes challenging to understand exactly what causes the latency.
The appropriate solution for fixing a latency really depends on the situation and the system. The general principle that applies when minimizing latencies is that the appropriate task must be able to run when it needs to run. This means that interrupts or preemption should be disabled only when necessary and for as little time as possible. It also means that issues like priority inversion should be avoided so that low priority tasks do not block a higher priority task from executing. It is also important to assign task priorities correctly. In general, tasks that have shorter periods or more strict deadlines should be given higher priority.
Fixing hardware and firmware issues can be very challenging. Sometimes it is possible to resolve them by changing something in the kernel configuration or in the firmware configuration. However, when there is very little documentation and it is not quite clear exactly why the latency is occurring, this can be difficult. The Internet can sometimes help with identifying the problem, but eventually it could be appropriate to ask the system vendor.
After applying the supposed fix, test the system in the original conditions that caused the latency. If the latency does not occur, then this confirms that the latency has been fixed, or at least that it does not occur under those conditions as frequently. However, if the latency is still observed, then it could mean that the problem was correctly identified but that the solution is wrong. It could also mean that a different latency was resolved because the tracing overhead changed the behavior of the system.
More information