User Tools

Site Tools


realtime:documentation:howto:debugging:debug-steps

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
realtime:documentation:howto:debugging:debug-steps [2018/08/21 08:50]
ebugden Change page name from debugging/steps
realtime:documentation:howto:debugging:debug-steps [2023/10/03 05:39] (current)
costa.shul Latency detection tools
Line 1: Line 1:
 ====== Steps for Debugging Latencies ====== ====== Steps for Debugging Latencies ======
  
-The following page proposes a general structure for the latency debugging process. It proposes a couple ​steps and suggestions that help debug latencies methodically and efficiently.+The following page proposes a general structure for the latency debugging process. It proposes a few steps and suggestions that help debug latencies methodically and efficiently.
  
 The steps are not meant to be a procedure that must be followed to the letter. The text below contains general guiding principles, but when it comes to latency debugging the variations and exceptions are innumerable. Because of this, latency debugging requires both creative and critical thinking. It is important to think about whether or not a particular suggestion is applicable for the latency that is being debugged. The steps are not meant to be a procedure that must be followed to the letter. The text below contains general guiding principles, but when it comes to latency debugging the variations and exceptions are innumerable. Because of this, latency debugging requires both creative and critical thinking. It is important to think about whether or not a particular suggestion is applicable for the latency that is being debugged.
Line 23: Line 23:
 If it is possible, also pay attention to when the latency happens. If the measured latencies are of a similar length and they happen in similar situations, then they are most likely caused by the same issue. If it is possible, also pay attention to when the latency happens. If the measured latencies are of a similar length and they happen in similar situations, then they are most likely caused by the same issue.
  
-A tool that is frequently used for measuring latencies is [[realtime:​documentation:​howto:​debugging:​cyclictest:​start|Cyclictest]]. Using Cyclictest correctly can be challenging at first, but if the tool is configured correctly and is run for a sufficient amount of time, then it can provide reasonably accurate measurements for most latencies. The tool does have some limitations which are described in various places in its documentation such as on the Cyclictest [[realtime:​documentation:​howto:​debugging:​cyclictest:​test-design:start|test design]] page.+A tool that is frequently used for measuring latencies is [[realtime:​documentation:​howto:​tools:​cyclictest:​start|Cyclictest]]. Using Cyclictest correctly can be challenging at first, but if the tool is configured correctly and is run for a sufficient amount of time, then it can provide reasonably accurate measurements for most latencies. The tool does have some limitations which are described in various places in its documentation such as on the Cyclictest [[realtime:​documentation:​howto:​tools:​cyclictest:​test-design|test design]] page.
  
 ===== Isolate the source ===== ===== Isolate the source =====
Line 43: Line 43:
 Having irrelevant information in the trace makes it more difficult to read, so it is good practice to stop the tracing immediately after the latency occurs. This prevents the trace from becoming longer than it needs to be.  Having irrelevant information in the trace makes it more difficult to read, so it is good practice to stop the tracing immediately after the latency occurs. This prevents the trace from becoming longer than it needs to be. 
  
-One way to do this is by using [[realtime:​documentation:​howto:​debugging:​cyclictest:​start|Cyclictest]]. Cyclictest has an option that causes tracing to stop after a certain specified latency limit is exceeded. There are instructions in the documentation about how to adjust the latency detection limit as a function of the tracing overhead so that tracing is stopped at the correct time. Additionally,​ several different types of Ftrace tracers can be used through Cyclictest.+One way to do this is by using [[realtime:​documentation:​howto:​tools:​cyclictest:​tracing|Cyclictest]]. Cyclictest has an option that causes tracing to stop after a certain specified latency limit is exceeded. There are instructions in the documentation about how to adjust the latency detection limit as a function of the tracing overhead so that tracing is stopped at the correct time. Additionally,​ several different types of Ftrace tracers can be used through Cyclictest.
  
 Ftrace can also be used by writing to the files in tracefs that are used to control the tracing (e.g. current_tracer,​ tracing_on). The writes to the files can be added directly to the programs that are running on the system. Ftrace can also be used by writing to the files in tracefs that are used to control the tracing (e.g. current_tracer,​ tracing_on). The writes to the files can be added directly to the programs that are running on the system.
Line 53: Line 53:
 Once again, it is best to eliminate the most obvious possible causes before moving on to the more complex possible causes. Bugs in the application or in the operating system are much more common than bugs in the firmware and in the hardware. So, start by confirming that interrupts and preemption are not disabled for too long and then explore other possibilities if necessary. Latencies can be caused by so many things. The task could be waiting for a resource, waiting for a lock, waiting for a device, etc. Once again, it is best to eliminate the most obvious possible causes before moving on to the more complex possible causes. Bugs in the application or in the operating system are much more common than bugs in the firmware and in the hardware. So, start by confirming that interrupts and preemption are not disabled for too long and then explore other possibilities if necessary. Latencies can be caused by so many things. The task could be waiting for a resource, waiting for a lock, waiting for a device, etc.
  
-If after looking at the code there does not seem to be anything that explains why the latency happens, then the latency could be caused by the firmware or the hardware. The documentation about identifying [[realtime:​documentation:​howto:​debugging:​cyclictest-smi-ftrace|SMI latencies]] with function tracing can help confirm if this is the case. If the latency is indeed caused by the firmware or the hardware, then determining exactly why the latency is happening can become extremely difficult. This is because there is often very little documentation available about the behavior of these parts of a system so it is sometimes challenging to understand exactly what causes the latency.+If after looking at the code there does not seem to be anything that explains why the latency happens, then the latency could be caused by the firmware or the hardware. The documentation about identifying [[realtime:​documentation:​howto:​debugging:​smi-latency:​cyclictest-tracing|SMI latencies]] with function tracing can help confirm if this is the case. If the latency is indeed caused by the firmware or the hardware, then determining exactly why the latency is happening can become extremely difficult. This is because there is often very little documentation available about the behavior of these parts of a system so it is sometimes challenging to understand exactly what causes the latency.
  
 ===== Fix the problem ===== ===== Fix the problem =====
Line 64: Line 64:
  
 After applying the supposed fix, test the system in the original conditions that caused the latency. If the latency does not occur, then this confirms that the latency has been fixed, or at least that it does not occur under those conditions as frequently. However, if the latency is still observed, then it could mean that the problem was correctly identified but that the solution is wrong. It could also mean that a different latency was resolved because the tracing overhead changed the behavior of the system. After applying the supposed fix, test the system in the original conditions that caused the latency. If the latency does not occur, then this confirms that the latency has been fixed, or at least that it does not occur under those conditions as frequently. However, if the latency is still observed, then it could mean that the problem was correctly identified but that the solution is wrong. It could also mean that a different latency was resolved because the tracing overhead changed the behavior of the system.
 +
 +
 +More information
 +  * [[realtime:​documentation:​howto:​tools:​start#​latency_detection|Latency detection tools]] ​
realtime/documentation/howto/debugging/debug-steps.1534841410.txt.gz ยท Last modified: 2018/08/21 08:50 by ebugden