Mathieu Desnoyers has contacted the Linux Foundation to settle the date and location of the 2015 edition of the Tracing Summit.
The registration fee is $60. The CFP will be sent very soon.
Babeltrace has been brought into the DiaMon Workgroup. The Linux Foundation provides bugzilla and git hosting.
Brendan Gregg has posted a list of features he'd like to see in tracing tools to the DiaMon Discuss mailing list.
Although tracers might provide low-level information, it's important to understand that performance gains of 1-2% have a significant impact at Netflix's scale.
Given that things change so fast, the low-hanging fruit for Netflix has typically been spotting application mistakes.
Netflix is a micro-service infrastructure and, currently, developers identify performance metrics they think are relevant which are then tracked by Netflix's internal performance tools.
New kernel versions introduce new tunable parameters or new defaults which the performance team has to track down and debug constantly. They typically use Ftrace and perf events to diagnose these problems. Brendan has also been toying around with eBPF.
One thing missing right now is that he'd like to see syscall rates (not just read/writes).
Sysdig seems to offer this but the overhead involved prevented him from deploying it in production. Wants facilities to find event time spans, frequency distributions, counts, etc.
Ideally, on-demand instrumentation would be tied into the analysis.
Another desired feature is latency heatmaps in which you could drill down from the syscall to the driver or block-layer level.
Given Netflix's requirements, it is not possible to deploy all debug information on targets as provisioning times must remain as low as possible. This info can account for hundreds of megabytes
Andi Kleen asked if it would be possible to share the debug infos on a central server. Brendan answered that in his experience, this never works. These solutions habe proven non-trivial to setup and maintain. Too much setup work for use cases which happen “once in a blue moon”.
Andi proposesed exploring ways to make the debug info smaller. Bredan mentioned that a couple of megabytes would be fine. He had success gzip'ing debug infos. We need a stripped-down “tracing debug info” which only provides the essential info to determine function name, arguments, variable names, etc.
Masami Hiramatsu has done some work on this and got down to around 321kb for kernel debug info.
At Netflix, only people on the performance team will use traces. The rest of the company relies on the analysis data provided by tools developed by the perf team. Needs to be easy enough to use for the performance team to figure it out, but it's not too bad if it remains complex.
SystemTap USDT probes were tried with Node.js. Stability issues were encountered.
IO.js now has LTTng instrumentation and the Netflix Node.js team will be looking into trying it out.
Right now, visual correlation of latencies works well, but the future really is to programatically break down interesting latencies in terms of lock usage, scheduling wakeups, etc.
Brendan will try to come up with a list of categories of problems faced at Netflix.
Mathieu thinks that creating specialized modules targeting problem classes is the way to go. DTrace had a great generic infrastructure to write custom scripts. In the end, much fewer than 100 people ended-up writing such scripts… Mostly people working on Solaris kernel code.
Most people are reusing Brendan's scripts, in part because they are documented, have examples, etc. Brendan identifies the lack of documentation as a current problem. For instance, he can't make sense of most SystemTap scripts.
It is essential that scripts be documented and that information on how to interpret their output be made available. Might seem obvious to the author, but in reality it is often very hard to make sense of what's provided.
New version of the Common Trace Format specification: Mathieu Desnoyers has posted the proposed changes on the DiaMon mailing list to drive the discussion forward. https://lists.linuxfoundation.org/pipermail/diamon-discuss/2015-March/000049.html
There is ongoing work to produce a JSON-based exchange format format between LTTng analysis scripts and viewers.
How to integrate mini-coredump with tracing? Snapshot? → Application registration mechanism with mini-coredump
How to integrate those tools? Distribution to users and packaging not ready.
How to integrate this standalone tool in another toolkit? → Java based toolkit should be easy
Christoph Lameter: Finding latencies in their own applications. Multiple tools, difficult to use. Get user feedback on what they would like to see in those tools.
Linutronix + Red Hat, perf to CTF module, work is on-going. Post-processing perf output file to convert to CTF. Mostly functional, few rough-edges, still need to modify Eclipse viewer to adapt to the event semantic exposed by Perf.
People asking for the presentation from Jiri Olsa at this year tracing summit. http://www.tracingsummit.org/w/images/9/98/TracingSummit2014-Perf-CTF.pdf
Steven Rostedt from ftrace interested in looking into CTF.
Masami: Where can the CTF specification be found ? CTF specification available at http://www.efficios.com/ctf Eventually, the plan is to move the specification to the Diamon workgroup. Babeltrace and libbabeltrace can be used to convert/read CTF.
How to use this clock ?
The diamon workgroup is a lean workgroup, straightforward governance. Do not need to be a member of LF to join, If you want your company logo on diamon.org (this is optional), send the request to Mike Dolan <mdolan at linuxfoundation.org>
How to correspond with diamon? “Diamon discuss” mailing list. https://lists.linuxfoundation.org/mailman/listinfo/diamon-discuss
Next meeting: Contribution from other users to see what they would like to be improved in tooling.
Christoph Lameter: will bring someone from his team to present Masami Hiramatsu: Interested in cloud logging/tracing. Integration of app logging with tracing.