Linux Foundation Wiki

project collaboration site

User Tools

Site Tools


diamon:meeting_notes

Meeting Notes

March 25, 2015

Were present:

  • Srikar Dronamraju
  • Deirdré Straughan
  • Brendan Gregg
  • Jérémie Galarneau
  • Mathieu Desnoyers
  • Antoine Busque
  • Jonathan Rajotte
  • Masami Hiramatsu
  • Bernd Huffmann
  • Matthew Kouzam
  • Petru Lauric
  • Razvan Ionescu
  • Andi Kleen
  • Rudolph Zimmel
  • Dominique Toupin

News

Tracing Summit

Mathieu Desnoyers has contacted the Linux Foundation to settle the date and location of the 2015 edition of the Tracing Summit.

The 2015 edition of the Tracing Summit will be held on August 20th in Seattle, a day after LinuxCon.

The registration fee is $60. The CFP will be sent very soon.

Babeltrace is now a DiaMon Workgroup project

Babeltrace has been brought into the DiaMon Workgroup. The Linux Foundation provides bugzilla and git hosting.

Official announcement

Brendan Gregg's presentation on industry use-cases

Background about tracing at Netflix

Brendan Gregg has posted a list of features he'd like to see in tracing tools to the DiaMon Discuss mailing list.

https://lists.linuxfoundation.org/pipermail/diamon-discuss/2015-March/000047.html

Although tracers might provide low-level information, it's important to understand that performance gains of 1-2% have a significant impact at Netflix's scale.

Given that things change so fast, the low-hanging fruit for Netflix has typically been spotting application mistakes.

Netflix is a micro-service infrastructure and, currently, developers identify performance metrics they think are relevant which are then tracked by Netflix's internal performance tools.

New kernel versions introduce new tunable parameters or new defaults which the performance team has to track down and debug constantly. They typically use Ftrace and perf events to diagnose these problems. Brendan has also been toying around with eBPF.

One thing missing right now is that he'd like to see syscall rates (not just read/writes).

Sysdig seems to offer this but the overhead involved prevented him from deploying it in production. Wants facilities to find event time spans, frequency distributions, counts, etc.

Ideally, on-demand instrumentation would be tied into the analysis.

Another desired feature is latency heatmaps in which you could drill down from the syscall to the driver or block-layer level.

Debug information management

Given Netflix's requirements, it is not possible to deploy all debug information on targets as provisioning times must remain as low as possible. This info can account for hundreds of megabytes

Andi Kleen asked if it would be possible to share the debug infos on a central server. Brendan answered that in his experience, this never works. These solutions habe proven non-trivial to setup and maintain. Too much setup work for use cases which happen “once in a blue moon”.

Andi proposesed exploring ways to make the debug info smaller. Bredan mentioned that a couple of megabytes would be fine. He had success gzip'ing debug infos. We need a stripped-down “tracing debug info” which only provides the essential info to determine function name, arguments, variable names, etc.

Masami Hiramatsu has done some work on this and got down to around 321kb for kernel debug info.

Ease of use of tools

At Netflix, only people on the performance team will use traces. The rest of the company relies on the analysis data provided by tools developed by the perf team. Needs to be easy enough to use for the performance team to figure it out, but it's not too bad if it remains complex.

SystemTap USDT probes were tried with Node.js. Stability issues were encountered.

IO.js now has LTTng instrumentation and the Netflix Node.js team will be looking into trying it out.

Right now, visual correlation of latencies works well, but the future really is to programatically break down interesting latencies in terms of lock usage, scheduling wakeups, etc.

Brendan will try to come up with a list of categories of problems faced at Netflix.

Mathieu thinks that creating specialized modules targeting problem classes is the way to go. DTrace had a great generic infrastructure to write custom scripts. In the end, much fewer than 100 people ended-up writing such scripts… Mostly people working on Solaris kernel code.

Most people are reusing Brendan's scripts, in part because they are documented, have examples, etc. Brendan identifies the lack of documentation as a current problem. For instance, he can't make sense of most SystemTap scripts.

It is essential that scripts be documented and that information on how to interpret their output be made available. Might seem obvious to the author, but in reality it is often very hard to make sense of what's provided.

Future Work

New version of the Common Trace Format specification: Mathieu Desnoyers has posted the proposed changes on the DiaMon mailing list to drive the discussion forward. https://lists.linuxfoundation.org/pipermail/diamon-discuss/2015-March/000049.html

There is ongoing work to produce a JSON-based exchange format format between LTTng analysis scripts and viewers.

November 12, 2014

Were present:

  • Dominique Toupin
  • Agustin Vega-Frias
  • Mathieu Desnoyers
  • Christian Babeux
  • Masami Hiramatsu
  • Christoph Lameter
  • Thomas Gleixner
  • Petru Lauric
  • Ed Martinez
  • (and others we for which we missed the name)

Diamon Introduction

  • Dominique introduce the reason why the workgroup was created.
  • Current diamon draft is largely based on Linux in telecom companies (Ericsson), other requirements welcomed.
  • Overview of website, share information with other companies and user.

Overview of mini-coredumps

  • Linutronix
  • Trigger memory dump when application crash.
  • Dump only a small portion of memory instead of the whole memory.
  • Configurable, setup watches, per application fine tuning.
  • Libminicoredump, apps can register data structures to be dumped.
  • Capture state of application
  • Snapshot in time, asynchronous dump to debug
  • Plan is to host the mini core dump at diamon.org

How to integrate mini-coredump with tracing? Snapshot? → Application registration mechanism with mini-coredump

How to integrate those tools? Distribution to users and packaging not ready.

Overview of TraceCompass (UI for Linux tracing)

  • Import text logs, libpcap traces, LTTng traces
  • Show information from mini-coredump
  • Standalone application similar to Wireshark

How to integrate this standalone tool in another toolkit? → Java based toolkit should be easy

  • Used by Intel for trace tool, Sourcery Analyser from Mentor, etc.

Overview of the reason for the creation of the workgroup

  • In tracing, everyone is scratching their own itch, workgroup a reason to share and collaborate together
  • LTTng project is focusing on trace analysis, usable by non-expert (e.g. non-kernel developers)
  • CTF, a trace format to be able to correlate multiple tracing data sources

Christoph Lameter: Finding latencies in their own applications. Multiple tools, difficult to use. Get user feedback on what they would like to see in those tools.

Interest from perf/ftrace to CTF?

Linutronix + Red Hat, perf to CTF module, work is on-going. Post-processing perf output file to convert to CTF. Mostly functional, few rough-edges, still need to modify Eclipse viewer to adapt to the event semantic exposed by Perf.

People asking for the presentation from Jiri Olsa at this year tracing summit. http://www.tracingsummit.org/w/images/9/98/TracingSummit2014-Perf-CTF.pdf

Steven Rostedt from ftrace interested in looking into CTF.

Masami: Where can the CTF specification be found ? CTF specification available at http://www.efficios.com/ctf Eventually, the plan is to move the specification to the Diamon workgroup. Babeltrace and libbabeltrace can be used to convert/read CTF.

Work on Common Trace Clocks

Thomas Gleixner:

  • Provide a NMI-safe accessor to the clock monotonic. Provides a clock that can be correlated with user-space monotonic clock.
  • Correlated timestamp across the network possible.
  • Merged in 3.17

How to use this clock ?

  • extern u64 ktime_get_mono_fast_ns(void);
  • Can change the clock in ftrace, via debugfs.
  • Perf still has to figure out how to expose this.
  • use clock monotonic: give correlation across network, which is harder to do when using directly TSC.
  • Correlation between clock monotonic and clock realtime can be done with a single correlation point, since both clocks are adjusted in the same way by NTP.

HW assisted tracing

Freescale:

  • Trace from the cpu, hw port, ETM/PTM like trace.
  • HW trace form SoC, DDR controllers
  • At tracing summit, presented tool for soc trace, for network developers
  • Issue about having a common clock, traces from multiple hw, different hw clocks
  • How to correlate with hardware and software traces
  • Hw trace tend to be proprietary, need for common generic format.
  • Consuming CTF file, analysis in TraceCompass
  • DMTF format to CTF?
    • Could probably use babeltrace to convert. Semantic about the clocks could be missing from DMTF though.

The diamon workgroup is a lean workgroup, straightforward governance. Do not need to be a member of LF to join, If you want your company logo on diamon.org (this is optional), send the request to Mike Dolan <mdolan at linuxfoundation.org>

How to correspond with diamon? “Diamon discuss” mailing list. https://lists.linuxfoundation.org/mailman/listinfo/diamon-discuss

Next meeting: Contribution from other users to see what they would like to be improved in tooling.

Christoph Lameter: will bring someone from his team to present Masami Hiramatsu: Interested in cloud logging/tracing. Integration of app logging with tracing.

diamon/meeting_notes.txt · Last modified: 2015/03/25 20:40 by jgalar