User Tools

Site Tools


diamon:meeting_notes

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
diamon:meeting_notes [2015/03/19 21:27]
abusque Create Meeting Notes page, with minutes from nov 12
diamon:meeting_notes [2015/03/25 20:40] (current)
jgalar
Line 1: Line 1:
 ====== Meeting Notes ====== ====== Meeting Notes ======
 +
 +===== March 25, 2015 =====
 +Were present:
 +  * Srikar Dronamraju
 +  * Deirdré Straughan
 +  * Brendan Gregg
 +  * Jérémie Galarneau
 +  * Mathieu Desnoyers
 +  * Antoine Busque
 +  * Jonathan Rajotte
 +  * Masami Hiramatsu
 +  * Bernd Huffmann
 +  * Matthew Kouzam
 +  * Petru Lauric
 +  * Razvan Ionescu
 +  * Andi Kleen
 +  * Rudolph Zimmel
 +  * Dominique Toupin
 +
 +==== News ====
 +
 +=== Tracing Summit ===
 +Mathieu Desnoyers has contacted the Linux Foundation to settle the date and location of the 2015 edition of the [[http://​tracingsummit.org/​|Tracing Summit]].
 +
 +The 2015 edition of the [[http://​tracingsummit.org/​|Tracing Summit]] will be held on August 20th in Seattle, a day after [[http://​events.linuxfoundation.org/​events/​linuxcon-north-america|LinuxCon]].
 +
 +The registration fee is $60. The CFP will be sent very soon.
 +
 +
 +=== Babeltrace is now a DiaMon Workgroup project ===
 +
 +Babeltrace has been brought into the DiaMon Workgroup. The Linux Foundation provides bugzilla and git hosting.
 +
 +[[http://​diamon.org/​2015/​03/​babeltrace-now-hosted-by-the-linux-foundation/​|Official announcement]]
 +
 +
 +==== Brendan Gregg'​s presentation on industry use-cases ====
 +
 +=== Background about tracing at Netflix ===
 +
 +Brendan Gregg has posted a list of features he'd like to see in tracing tools to the DiaMon Discuss mailing list.
 +
 +[[https://​lists.linuxfoundation.org/​pipermail/​diamon-discuss/​2015-March/​000047.html]]
 +
 +
 +Although tracers might provide low-level information,​ it's important to understand that performance gains of 1-2% have a significant impact at Netflix'​s scale.
 +
 +Given that things change so fast, the low-hanging fruit for Netflix has typically been spotting application mistakes.
 +
 +Netflix is a micro-service infrastructure and, currently, developers identify performance metrics they think are relevant which are then tracked by Netflix'​s internal performance tools.
 +
 +New kernel versions introduce new tunable parameters or new defaults which the performance team has to track down and debug constantly. They typically use Ftrace and perf events to diagnose these problems. Brendan has also been toying around with eBPF.
 +
 +One thing missing right now is that he'd like to see syscall rates (not just read/​writes).
 +
 +Sysdig seems to offer this but the overhead involved prevented him from deploying it in production.
 +Wants facilities to find event time spans, frequency distributions,​ counts, etc.
 +
 +Ideally, on-demand instrumentation would be tied into the analysis.
 +
 +Another desired feature is latency heatmaps in which you could drill down from the syscall to the driver or block-layer level.
 +
 +=== Debug information management ===
 +
 +Given Netflix'​s requirements,​ it is not possible to deploy all debug information on targets as provisioning times must remain as low as possible. This info can account for hundreds of megabytes
 +
 +Andi Kleen asked if it would be possible to share the debug infos on a central server. Brendan answered that in his experience, this never works. These solutions habe proven non-trivial to setup and maintain. Too much setup work for use cases which happen "once in a blue moon".
 +
 +Andi proposesed exploring ways to make the debug info smaller.
 +Bredan mentioned that a couple of megabytes would be fine. He had success gzip'​ing debug infos. We need a stripped-down "​tracing debug info" which only provides the essential info to determine function name, arguments, variable names, etc.
 +
 +Masami Hiramatsu has done some work on this and got down to around 321kb for kernel debug info.
 +
 +=== Ease of use of tools ===
 +
 +At Netflix, only people on the performance team will use traces. The rest of the company relies on the analysis data provided by tools developed by the perf team.
 +Needs to be easy enough to use for the performance team to figure it out, but it's not too bad if it remains complex.
 +
 +SystemTap USDT probes were tried with Node.js. Stability issues were encountered.
 +
 +IO.js now has LTTng instrumentation and the Netflix Node.js team will be looking into trying it out.
 +
 +Right now, visual correlation of latencies works well, but the future really is to programatically break down interesting latencies in terms of lock usage, scheduling wakeups, etc.
 +
 +Brendan will try to come up with a list of categories of problems faced at Netflix.
 +
 +Mathieu thinks that creating specialized modules targeting problem classes is the way to go. DTrace had a great generic infrastructure to write custom scripts. In the end, much fewer than 100 people ended-up writing such scripts... Mostly people working on Solaris kernel code.
 +
 +Most people are reusing Brendan'​s scripts, in part because they are documented, have examples, etc.
 +Brendan identifies the lack of documentation as a current problem. For instance, he can't make sense of most SystemTap scripts.
 +
 +It is essential that scripts be documented and that information on how to interpret their output be made available. Might seem obvious to the author, but in reality it is often very hard to make sense of what's provided.
 +
 +==== Future Work ====
 +
 +New version of the Common Trace Format specification:​ Mathieu Desnoyers has posted the proposed changes on the DiaMon mailing list to drive the discussion forward.
 +[[https://​lists.linuxfoundation.org/​pipermail/​diamon-discuss/​2015-March/​000049.html]]
 +
 +There is ongoing work to produce a JSON-based exchange format format between LTTng analysis scripts and viewers.
 +
  
 ===== November 12, 2014 ===== ===== November 12, 2014 =====
diamon/meeting_notes.1426800479.txt.gz · Last modified: 2015/03/19 21:27 by abusque