User Tools

Site Tools


realtime:documentation:howto:debugging:cyclictest-smi-ftrace

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
realtime:documentation:howto:debugging:cyclictest-smi-ftrace [2018/07/13 08:02]
ebugden
realtime:documentation:howto:debugging:cyclictest-smi-ftrace [2018/07/16 10:54]
ebugden Clarify warning message about trace overhead
Line 15: Line 15:
 ===== Tracing ===== ===== Tracing =====
  
-Once the CPUs are sure to never idle, a trace can be obtained using Cyclictest with the break trace and the function tracing options. The break trace option (--breaktrace) will stop tracing and end Cyclictest if it detects a latency that is longer than a specified limit. This option is intended to be used with one of the Cyclictest options that enables a specific Ftrace tracer. For this SMI detection technique, the function tracing option (--ftrace) is used in order to include ​all the kernel function calls in the resulting trace.+Once the CPUs are sure to never idle, a trace can be obtained using Cyclictest with the break trace and the function tracing options. The break trace option (--breaktrace) will stop tracing and end Cyclictest if it detects a latency that is longer than a specified limit. This option is intended to be used with one of the Cyclictest options that enables a specific Ftrace tracer. For this SMI detection technique, the function tracing option (--ftrace) is used so that the trace includes ​all the kernel function calls.
  
 <WRAP center round important 60%> <WRAP center round important 60%>
Line 30: Line 30:
  
 <WRAP center round important 60%> <WRAP center round important 60%>
-Remember to adjust for the tracing overhead when specifying the break trace limit.+Remember to adjust for the tracing overhead when specifying the Cyclictest ​break trace limit.
 </​WRAP>​ </​WRAP>​
  
Line 39: Line 39:
 Once a trace that includes the correct latency is obtained, it needs to be analyzed critically in order to determine the latencies that could be caused by SMIs. First, the latencies in the trace must be identified. Because the trace contains every function call made by the kernel, the timestamps for consecutive tracepoints should only be a few microseconds apart. In theory, any gap between function calls that is longer than a few microseconds could potentially indicate a problem, but it is generally best to start by finding and analyzing the longest latencies in the trace. Once a trace that includes the correct latency is obtained, it needs to be analyzed critically in order to determine the latencies that could be caused by SMIs. First, the latencies in the trace must be identified. Because the trace contains every function call made by the kernel, the timestamps for consecutive tracepoints should only be a few microseconds apart. In theory, any gap between function calls that is longer than a few microseconds could potentially indicate a problem, but it is generally best to start by finding and analyzing the longest latencies in the trace.
  
-==== Identifying long latencies ====+==== Identifying ​all long latencies ====
  
 The longer latencies in the trace should have certain special characters next to their timestamps. For example, a latency of more than 50 us is indicated with a '​+'​ and a latency of more than 100 us is indicated with a '​!'​. In the following trace excerpt there is an exclamation mark next to the timestamp of the second function call because the second and the third native_io_delay calls are more than 100 us apart. The longer latencies in the trace should have certain special characters next to their timestamps. For example, a latency of more than 50 us is indicated with a '​+'​ and a latency of more than 100 us is indicated with a '​!'​. In the following trace excerpt there is an exclamation mark next to the timestamp of the second function call because the second and the third native_io_delay calls are more than 100 us apart.
Line 51: Line 51:
 The second call was made at 38067200 us and the third call was at 38067346 us, so there was an unusually long 146 us delay between the two calls. The Ftrace documentation can be consulted to see the characters that indicate even longer latencies. The second call was made at 38067200 us and the third call was at 38067346 us, so there was an unusually long 146 us delay between the two calls. The Ftrace documentation can be consulted to see the characters that indicate even longer latencies.
  
-==== Identifying ​SMI related latencies ====+==== Isolating ​SMI related latencies ====
  
 The next step in the trace analysis is to distinguish the unsuspicious long latencies from the long latencies that could be caused by SMIs. In other words, eliminate all the long latencies that can be explained by something. To do this, analyze the situation in which the latency occurs and have a look at the function code that was executing when the latency happened. Can anything explain the latency? Some possible latency causes could be that the CPU is idle, that uninstrumented userspace code is executing, that the executing kernel function disables interrupts and then does something that takes a long time, etc. If the latency cannot be explained by the situation or by the code and the OS appears to lose the CPU for no reason, then the latency is probably caused by an SMI.    The next step in the trace analysis is to distinguish the unsuspicious long latencies from the long latencies that could be caused by SMIs. In other words, eliminate all the long latencies that can be explained by something. To do this, analyze the situation in which the latency occurs and have a look at the function code that was executing when the latency happened. Can anything explain the latency? Some possible latency causes could be that the CPU is idle, that uninstrumented userspace code is executing, that the executing kernel function disables interrupts and then does something that takes a long time, etc. If the latency cannot be explained by the situation or by the code and the OS appears to lose the CPU for no reason, then the latency is probably caused by an SMI.