Differences

This shows you the differences between two versions of the page.

--- realtime:documentation:howto:debugging:cyclictest-smi-ftrace [2018/07/13 08:02]
ebugden
+++ — (current)
@@ Line 1: / Line 1: @@
-====== Cyclictest - SMI detection with function tracing ======
-One way to identify where SMIs happen is to look for inexplicably long gaps in a kernel function trace, as these usually indicate that an abnormally long amount of time was spent handling an SMI. [[realtime:documentation:howto:tools:cyclictest|Cyclictest]] can be used to produce a trace which can then be analyzed to find latencies that could be caused by SMIs. This trace analyzing technique is interesting because it can be used to identify firmware or hardware related latencies on any architecture that supports tracing, not just SMIs on x86 processors.
-This page explains how to correctly produce a function trace using Cyclictest and how to identify SMI related latencies in the trace. Additional details that clarify some of the topics mentioned more briefly on this page can be found in the general documentation about debugging using Cyclictest [[realtime:documentation:howto:tools:cyclictest:debugging|here]].
-===== Importance =====
-Analyzing a function trace that contains the problematic latency, as well as the kernel code that is executing when the latency happens, is an important step in debugging it. Latencies caused by the kernel are much more common than latencies caused by firmware or hardware, so the possibility of a kernel related latency should be eliminated before any further investigation. On the other hand, if a latency is indeed SMI related then knowing the function calls that are made around the SMI can help determine why it is triggered.
-===== Setup =====
-Before tracing the kernel to identify latencies, it is important to make sure that the CPUs will never go idle. When a CPU enters an idle state there can be long periods of time where the processor effectively does nothing. These idle periods show up in function traces as additional latencies. Since the source of these latencies  is already known, having them in the trace does not add information that would help debug other latencies. Instead of having to filter these irrelevant latencies during the trace analysis, it is easier to just avoid having them in the trace in the first place by preventing the CPUs from going idle. Some simple ways of keeping the CPUs busy are described [[realtime:documentation:howto:debugging:no-cpu-idle|here]].
-===== Tracing =====
-Once the CPUs are sure to never idle, a trace can be obtained using Cyclictest with the break trace and the function tracing options. The break trace option (--breaktrace) will stop tracing and end Cyclictest if it detects a latency that is longer than a specified limit. This option is intended to be used with one of the Cyclictest options that enables a specific Ftrace tracer. For this SMI detection technique, the function tracing option (--ftrace) is used in order to include all the kernel function calls in the resulting trace.
-<WRAP center round important 60%>
-Ftrace might not behave the same way after using Cyclictest tracing options. When these options are used, Cyclictest changes some Ftrace options to get the desired trace format. The Ftrace options that were set before are not restored after Cyclictest runs.
-</WRAP>
-Below is an example of a Cyclictest command that includes the break trace and function tracing options. The '#' before the command is to indicate that the command must be run as root, with sudo, or as a member of the realtime group.
-<code bash>
-# cyclictest --mlockall --smp --priority=80 --interval=200 --distance=0 --breaktrace=150 --ftrace
-</code>
-The combination of options in this example is not appropriate for every test situation. More details about how to choose the right options for measuring a specific latency on a given system can be found in the Cyclictest documentation [[realtime:documentation:howto:tools:cyclictest:debugging|here]].
-<WRAP center round important 60%>
-Remember to adjust for the tracing overhead when specifying the break trace limit.
-</WRAP>
-Function tracing slows down the execution of the program, so the target latency will most likely be longer after enabling tracing. It is important to adjust for this additional overhead because otherwise the tracing may not break on the correct latency and the trace may not contain the latency that was supposed to be observed. More details about how to adjust the break trace limit can be found in the documentation about debugging using Cyclictest [[realtime:documentation:howto:tools:cyclictest:debugging|here]].
-===== Analysis =====
-Once a trace that includes the correct latency is obtained, it needs to be analyzed critically in order to determine the latencies that could be caused by SMIs. First, the latencies in the trace must be identified. Because the trace contains every function call made by the kernel, the timestamps for consecutive tracepoints should only be a few microseconds apart. In theory, any gap between function calls that is longer than a few microseconds could potentially indicate a problem, but it is generally best to start by finding and analyzing the longest latencies in the trace.
-==== Identifying long latencies ====
-The longer latencies in the trace should have certain special characters next to their timestamps. For example, a latency of more than 50 us is indicated with a '+' and a latency of more than 100 us is indicated with a '!'. In the following trace excerpt there is an exclamation mark next to the timestamp of the second function call because the second and the third native_io_delay calls are more than 100 us apart.
-<code>
-bash-960     0d...111 38067191us : native_io_delay <-pcspkr_event
-bash-960     0d...111 38067200us!: native_io_delay <-pcspkr_event
-bash-960     0d...111 38067346us : native_io_delay <-pcspkr_event
-</code>
-The second call was made at 38067200 us and the third call was at 38067346 us, so there was an unusually long 146 us delay between the two calls. The Ftrace documentation can be consulted to see the characters that indicate even longer latencies.
-==== Identifying SMI related latencies ====
-The next step in the trace analysis is to distinguish the unsuspicious long latencies from the long latencies that could be caused by SMIs. In other words, eliminate all the long latencies that can be explained by something. To do this, analyze the situation in which the latency occurs and have a look at the function code that was executing when the latency happened. Can anything explain the latency? Some possible latency causes could be that the CPU is idle, that uninstrumented userspace code is executing, that the executing kernel function disables interrupts and then does something that takes a long time, etc. If the latency cannot be explained by the situation or by the code and the OS appears to lose the CPU for no reason, then the latency is probably caused by an SMI.

Wiki

User Tools

Site Tools

Differences

Page Tools