Using rtla with perf processor trace

Using rtla with perf processor trace

The rtla tool allows extensive tracing using the capabilities of tracefs subsystem, specifically, tracepoints. Those can be used to track various latency-related events, including IRQs, thread interference, scheduling delays, IPIs, and so on.

Tracepoints trace specific points inside the Linux kernel, and as such, they fail to reveal the latency source in two cases:

latencies outside the kernel (firmware-side NMIs, SMIs)
latencies coming from unpredictable, or untraceable parts of the Linux kernel (e.g. contention in early IRQ code)

On some CPUs, namely some models of Intel x86 and ARM architectures, a feature called processor trace is available. This feature allows one to trace, without extra overhead, all instructions that are executed in a time period, and collect them into a ringbuffer.

On Linux, this feature is supported by perf's Intel Processor Trace and CoreSight support. Using actions, rtla is able to call perf and capture all instructions that were executed during a latency measurement cycle when tracing is stopped.

Prerequisites

To use rtla with processor trace, you need:

rtla version 6.17 or higher
Linux kernel version 6.17 or higher, with BPF enabled
perf tool with Intel Processor Trace support

Currently, this guide only covers processor trace on Intel x86 platforms. This method is also currently only supported with the timerlat tool.

Measurement

To start a rtla measurement with processor trace, rtla is run as a child process under perf record. For rtla, sending SIGUSR2 to its parent process (that is, perf) on threshold overflow is enabled; for perf, Intel PT is enabled, together with perf aux buffer snapshot on signal.

This makes perf snapshot the ringbuffer at the moment of a latency spike, with a short delay: the kernel has to wake up rtla, then rtla signals perf, and perf does the snapshot.

The general template is as follows:

$ perf record --no-switch-events \
              --kcore -m 1G,1M \
              -e intel_pt/cyc,noretcomp/k \
              -e osnoise:timerlat_sample --filter 'timer_latency >= <threshold * 1000>' \
              -S --cpu <cpulist> -- \
              rtla timerlat <tool> -c <cpulist> --warm-up 5 \
                                   -a <threshold> --on-threshold signal,num=12,pid=parent

–no-switch-events is used to disable collecting switch events, which can make the perf data very large if the measurement lasts for several hours. -e osnoise:timerlat_sample is passed to perf so that the latency from rtla data can be matched with perf data. –warm-up 5 is used to make rtla wait 5 seconds before processing samples to make sure all measurement is seen by perf.

–filter <threshold * 1000> is used to filter only for timerlat samples that are over the threshold, that is, the spike and anything that is captured after the spike. Without this, there is a risk of the Intel PT data overwriting the timerlat sample entries.

Based on the number of cores you are running the measurement on, you might need to tweak the -m option argument: the first number is the size of the perf buffer, the second one is the size of the perf auxilliary buffer. If the auxilliary buffer is too big compared to the perf buffer, it will overwrite the osnoise:timerlat_sample entries, and the results will be more difficult to interpret.

See Intel Processor Trace manual at Perf Wiki for more information about Intel PT options.

Analysis

Once rtla detects a spike, it will stop the measurement and print out results from auto-analysis. At the same time, perf will capture auxiliary buffer data that contains entries from processor trace. The results include the CPU or CPUs on which the spike was measured:

  Saving trace to timerlat_trace.txt
...
rtla timerlat hit stop tracing
## CPU 10 hit stop tracing, analyzing it ##
  IRQ handler delay:                  (exit from idle)      43.27 us (37.89 %)
  IRQ latency:                                              53.43 us
  Timerlat IRQ duration:                                    36.33 us (31.82 %)
  Blocking thread:                                           9.85 us (8.62 %)
                      swapper/10:0                           9.85 us
    Blocking thread stack trace
                -> timerlat_irq
                -> __hrtimer_run_queues
                -> hrtimer_interrupt
                -> __sysvec_apic_timer_interrupt
                -> sysvec_apic_timer_interrupt
                -> asm_sysvec_apic_timer_interrupt
                -> cpuidle_enter_state
                -> cpuidle_enter
                -> cpuidle_idle_call
                -> do_idle
                -> cpu_startup_entry
                -> start_secondary
                -> common_startup_64
  Softirq interference                                       3.07 us (2.69 %)
                     SCHED:7                                 3.07 us
------------------------------------------------------------------------
  Thread latency:                                          114.19 us (100%)
 
Max timerlat IRQ latency from idle: 53.43 us in cpu 10
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 3.259 MB perf.data ]

By examining timerlat_trace.txt, the sequence numbers of the spike in thread context (context=1; grep for context=0 for IRQ context and context=2 for user context) and the measurement before it can be extracted:

$ grep -F '[010]' timerlat_trace.txt | grep 'timerlat_sample:.*context=1' | tail -n1
    timerlatu/10-16522   [010] ....1 17457.915257: timerlat_sample: timer_latency=114189 seqnum=4995 context=1

From this, matching timestamp from perf data may be determined:

$ perf script --cpu 10 | grep 'seqnum=4995 context=1'
          :16522   16522 [010] 17457.331712:  osnoise:timerlat_sample: timer_latency=114189 seqnum=4995 context=1

Since the latency was 114 us, and the period 1000 us, the expected start of the timer is 17457.329484. perf script can now be used to examine everything that was happening on CPU 10 during the time between the start of the timer and its expiration:

$ perf script --cpu 10 --time 17457.329484,17457.331713
         swapper       0 [010] 17457.329672:                      cbr:  cbr:  8 freq:  798 MHz ( 38%)     ffffffff89864216 intel_idle_ibrs+0xb6 ([kernel.kallsyms])
         swapper       0 [010] 17457.329672:          1               branches:k:  ffffffff8986342d cpuidle_enter_state+0x26d ([kernel.kallsyms]) => ffffffff89862b70 ct_idle_exit+0x0 ([kernel.kallsyms])
         swapper       0 [010] 17457.329672:          1               branches:k:  ffffffff89862b7d ct_idle_exit+0xd ([kernel.kallsyms]) => ffffffff898629c0 ct_kernel_enter.isra.0+0x0 ([kernel.kallsyms])
         swapper       0 [010] 17457.329672:          1               branches:k:  ffffffff898629f8 ct_kernel_enter.isra.0+0x38 ([kernel.kallsyms]) => ffffffff89862930 ct_kernel_exit_state+0x0 ([kernel.kallsyms])
...
         swapper       0 [010] 17457.329678:          1               branches:k:  ffffffff88c3dcca __hrtimer_run_queues+0x12a ([kernel.kallsyms]) => ffffffff8986ed80 _raw_spin_unlock_irqrestore+0x0 ([kernel.kallsyms])
         swapper       0 [010] 17457.329678:          1               branches:k:  ffffffff88c3dcd7 __hrtimer_run_queues+0x137 ([kernel.kallsyms]) => ffffffff88ce5710 timerlat_irq+0x0 ([kernel.kallsyms])
         swapper       0 [010] 17457.329678:          1               branches:k:  ffffffff88ce5759 timerlat_irq+0x49 ([kernel.kallsyms]) => ffffffff88c413d0 ktime_get+0x0 ([kernel.kallsyms])
         swapper       0 [010] 17457.329678:          1               branches:k:  ffffffff88c41408 ktime_get+0x38 ([kernel.kallsyms]) => ffffffff88aabc90 read_tsc+0x0 ([kernel.kallsyms])
         swapper       0 [010] 17457.329678:          1               branches:k:  ffffffff88ce5765 timerlat_irq+0x55 ([kernel.kallsyms]) => ffffffff88cb6400 trace_clock_local+0x0 ([kernel.kallsyms])
         swapper       0 [010] 17457.329678:          1               branches:k:  ffffffff88cb640b trace_clock_local+0xb ([kernel.kallsyms]) => ffffffff88aacb20 sched_clock+0x0 ([kernel.kallsyms])
         swapper       0 [010] 17457.329678:          1               branches:k:  ffffffff88aacb2b sched_clock+0xb ([kernel.kallsyms]) => ffffffff8985fa30 sched_clock_noinstr+0x0 ([kernel.kallsyms])
         swapper       0 [010] 17457.329678:          1               branches:k:  ffffffff88ce57c6 timerlat_irq+0xb6 ([kernel.kallsyms]) => ffffffff88ce4600 record_timerlat_sample+0x0 ([kernel.kallsyms])
         swapper       0 [010] 17457.329678:          1               branches:k:  ffffffff88ce46f9 record_timerlat_sample+0xf9 ([kernel.kallsyms]) => ffffffff88ce1b00 __traceiter_timerlat_sample+0x0 ([kernel.kallsyms])
...
          :16522   16522 [010] 17457.331711:          1               branches:k:  ffffffff88ce484b timerlat_fd_read+0x11b ([kernel.kallsyms]) => ffffffff88ce4600 record_timerlat_sample+0x0 ([kernel.kallsyms])
          :16522   16522 [010] 17457.331711:          1               branches:k:  ffffffff88ce46f9 record_timerlat_sample+0xf9 ([kernel.kallsyms]) => ffffffff88ce1b00 __traceiter_timerlat_sample+0x0 ([kernel.kallsyms])
          :16522   16522 [010] 17457.331711:          1               branches:k:  ffffffff88ce1b23 __traceiter_timerlat_sample+0x23 ([kernel.kallsyms]) => ffffffff88ce1f40 perf_trace_timerlat_sample+0x0 ([kernel.kallsyms])
          :16522   16522 [010] 17457.331711:          1               branches:k:  ffffffff88ce1f99 perf_trace_timerlat_sample+0x59 ([kernel.kallsyms]) => ffffffff88cf4a30 perf_trace_buf_alloc+0x0 ([kernel.kallsyms])
          :16522   16522 [010] 17457.331712:  osnoise:timerlat_sample: timer_latency=114189 seqnum=4995 context=1
...

Synchronization points to look for in the output are timerlat_irq, timerlat_fd_read (in case of user mode RTLA), and record_timerlat_sample. In this particular example, it can also be seen that the CPU was only running with 38% of frequency. This is a consequence of idle states, and can be mitigated with –dma-latency 0 (see System tuning section of Real-time Linux Analysis tool).

Table of Contents

Using rtla with perf processor trace

Prerequisites

Measurement

Analysis