Trace Collection

Trace quality is one of the essential requirements for analysis. Low quality traces may lead to complex, wrong conclusions for trace analysis. There are two main issues in trace quality. One is timing drift, which is when the actual event arrival time is earlier than the collected arrival time. The other is a missing event, such as when the tool cannot capture all the required events. Thus, proper tools shall be applied to guarantee the correctness of the traces. Both software tools and hardware devices are introduced in this chapter.

Collection Techniques

Many techniques have been proposed to monitor and capture system or component traces. There are four techniques generally:

Hardware-based monitoring entails the modification of the testbed hardware so that as a program is executed, a record of all instructions and/or data addresses is created.
Software-based tracing can achieve similar goals as hardware to a certain degree, but instead of altering the system hardware, software is modified or inserted.
Emulation-based tracing constructs a layer between the host machine and the OS under evaluation, like QEMU¹ and SimOS.² The layer only emulates enough components to allow the OS to run correctly. While this system provides a flexible interface to collect operating system-dependent traces, the accuracy of the captured trace is dubious sometimes. Since emulation is performed, execution will be perturbed.
Microcode-based tracing utilizes microcode modification to capture trace information, introducing minimal slowdown, like PALcode (Privileged Architecture Library code).³

However, the third and fourth techniques are not popular due to high complexity and dubious accuracy. Therefore, only hardware and software based techniques are discussed in this chapter.

Hardware Trace Collection

We refer the hardware method to the trace collection approach that uses a particular hardware device/system capturing the IOs other than the targeted storage devices, although some software may be still required to manage the traces [38, 45]. There are many types of hardware to collect the block-level trace. One of the most common devices is a bus analyzer, although it is not limited to block-level IOs for disk drives, such as network traffic, DDR/CPU caching/stall/latency/throughput/etc. Some products can capture rather accurate traces, such as the Xgig 6G SAS/SATA analyzer from Viavi solution, the BusXpert Micro II Series SAS/SATA analyzer from SerialTek, the Trace and analyzer from TI, the protocol analyzer from LeCroy, the Eumulator XL-100 from Arium, and the SuperTrace Probe from Green Hills Software.

The bus analyzers often provide multiple communication interfaces for users. Take the devices in Figure 3-1 as an example. They provide USB, Ethernet, SCSI, etc. These devices usually achieve reliable and accurate linkups via multiple mechanisms, with higher resolution (e.g., time precision and capture frequency) and more information captured than software-based tools.

../images/468166_1_En_3_Chapter/468166_1_En_3_Fig1_HTML.jpg — Figure 3-1
Bus analyzers from LeCropy, XGIG, and SerialTek

Figure 3-2 shows an example of SAS IO access in BusXpert, which provides almost all basic information related to SAS protocols. The users can easily trace the command status from the detailed logs, such as the response time, connections, etc.

../images/468166_1_En_3_Chapter/468166_1_En_3_Fig2_HTML.jpg — Figure 3-2
Plentiful protocol information from BusXpert

Figure 3-3 provides another example of SATA command analysis. You can see that the host issued the command COMWAKE after around 5 seconds. The drive almost immediately acknowledged COMWAKE. At time 5.59 seconds, SMART READ DATA was transferred to the host.

../images/468166_1_En_3_Chapter/468166_1_En_3_Fig3_HTML.jpg — Figure 3-3
SATA command analysis

Although there is no difficulty in capturing almost all the essential protocol information, no advanced metrics of IO properties are included in the software used to analyze the trace.

Software Trace Collection

In term of accuracy, a software trace collector may be not as good as hardware devices. In particular, for these applications with time precision in nanoseconds or less, software may not work well. For example, a disk feature debug related to the SAS/SATA protocol may be applicable to the bus analyzer since it may involve the disk drive’s SoC clock issues. However, for disk drive IO performance, it is generally operated at the millisecond level (precision), which is generally within the capability of the modern processors and operating systems inside a common server or workstation.

There are many IO tools available [45, 35]:

Linux/Unix: Dtrace[46], LTTng, BCC,⁴ iostat, dstat, tracefs,⁵ iotop, hdparm, ionice, Ctrace,⁶ iogrind, POSIX Test Suite, ioprofile, SystemTap, IOR, PCP, and swtrace
Windows: Xperf,⁷ TraceWPP/TraceView/Tracelog/Logman,⁸ Vtrace, Oracle trace collector, Bus analyzer module,⁹ and PatchWrx¹⁰

However, not all of these tools can provide event details. In fact, the general purpose monitoring tools, like iostat and iotop, cannot provide detailed information on a per-IO basis.

These tools can be divided into two classes: static and dynamic. Static tools view the binary image of a program as a black box that is never modified. Dynamic tools instead rely on binary-level alterations to facilitate the gathering of statistical data from an application. For example, all the Windows tools and iotop/iostat/dstat/hdparm/ionice/iogrind/ioprofile are static tools, while SystemTap, Dtrace, and LTTng are dynamic tools. In particular, Dtrace and LTTng use a mechanism called probing that is able to selectively activate instrumentation routines that are embedded within software at all levels of abstraction, so that performance-related statistics can be obtained from not only an application but also the various libraries and kernel routines associated with its execution.

Blktrace

Blktrace is a static tool that has been embedded into the Linux kernel since version 2.617-rc1. This tool is lightweight and easy to use. It only considers device access after OS/FS cache. When IO enters to block an IO layer (request queue), the relay channel per CPU gets events emitted, and blktrace then captures the events from the channels. More details can be found in Appendix B.

Dtrace, SystemTap, and LTTng

As mentioned, dynamic tracing tools embed tracing code into working user programs or kernels, without the need of recompilation or reboot. Since any processor instruction may be patched, it can virtually access any information you need at any place. I will discuss several dynamic tracing tools next.

DTrace [47] originated from Solaris.¹¹ Its development was begun in 1999, and it became part of the Solaris 10 release. Nowadays, DTrace is open-sourced as a part of OpenSolaris, although it has not merged into the Linux kernel due to license incompatibility. There exist several ports without proper support. A toolkit based on Dtrace for simplification of use has been developed by B. Gregg.¹² But the essential limitation has been solved. A few attempts led to the development of another clone of DTrace called DProbes, but it seems to be unsuccessful.

Therefore, three major Linux players, Red Hat, Hitachi and IBM, presented another dynamic tracing system for Linux called SystemTap.¹³ SystemTap is one of the most powerful tracers so far. However, it has to generate a native module for each script it runs, which is a huge performance penalty. Ktap¹⁴ was further developed to reduce the overhead using Lua and LuaJIT internally. Another similar implementation is sysdig,¹⁵ which is scriptless.

LTTng¹⁶ is also a widely used open source tracing framework for Linux. It used static tracing and required kernel recompilation until version 2.0; it currently utilizes ftrace and kprobe subsystems in the Linux kernel. It makes the users understand the interactions among multiple system components, like the Linux kernel, using either existing or user-defined instrumentation points, C/C++ applications, Java applications, Python applications, or any other user space application with the LTTng logger. It may outperform other tracers because it has optimized event collection. It also supports numerous event types, including USDT (user-level statically defined tracing).

When identifying the overall system performance instead of only storage IO, these tools will play a significant role. In Chapter 9, you will use Ceph as an example to find the performance bottleneck from an overall system view.

Trace Warehouse

Mainly for research purposes , there are some real/synthesis traces available online for download. The following are few examples:

SNIA at http://iotta.snia.org . It provides block IO trace (e.g., the block traces on a virtual desktop infrastructure and Microsoft Production Servers), NFS trace, system call trace, etc.
Sandia National Laboratories at www.cs.sandia.gov/Scalable_IO/SNL_Trace_Data/ . S3d I/O kernel trace data was collected during runs on 6400 clients of Redstorm.
Los Alamos National Laboratory at http://institute.lanl.gov/data/ . Few traces, like MPI/HPC, are categorized.
Google at https://github.com/google/cluster-data . It provides cluster workload trace on Google compute cells.
Facebook at https://github.com/SWIMProjectUCB/SWIM/wiki/Workloads-repository . A number of 1-hour segments from Facebooks Hadoop traces were published as part of UC Berkeley AMP Labs SWIM project.
Dartmouth University at www.cs.dartmouth.edu/dfk/nils/workload.html . It provides some traces from parallel file systems (e.g., Intel’s CFS, Thinking Machines SFS).¹⁷
Harvard University at www.eecs.harvard.edu/sos/traces.html . It provides some NFS traces .
UMassAmherst at http://traces.cs.umass.edu/index.php/Main/Traces . OLTP and search engine traces are archived.
Hebrew University at www.cs.huji.ac.il/labs/parallel/workload/index.html . Multiple parallel workloads are collected.
OpenCloud at http://ftp.pdl.cmu.edu/pub/datasets/hla . These traces were taken from a Hadoop cluster managed by CMU’s Parallel Data Lab. They provide very detailed insights into the workload of a cluster used for scientific workloads during a 20-month period, including timestamps, slot counts, and more.

Together with the source code for the analysis tool, I also provide trace sample data in GitHub.

This chapter discussed both hardware and software tools for trace collection. Note that the former generally offer higher precision and more information than the latter, although they are more expensive. However, in many scenarios, the precision is only required at the millisecond level. Therefore, software-only tools are widely applied in both industries and academics. Note that there exist various tools for different purposes. In order to identify the overall system performance, you shall employ multiple tools or some integrated tool sets.

Footnotes

www.qemu.org/

http://simos.stanford.edu/

http://download.majix.org/dec/palcode_dsgn_gde.pdf

https://github.com/iovisor/bcc

www.usenix.org/conference/fast-04/tracefs-file-system-trace-them-all . It is a thin stackable file system used to capture file system traces in a portable way. Tracefs can capture uniform traces for any file systems without modifying the file systems being traced. It can also capture traces at various degrees of granularity: by users, groups, processes, file operations, files and file names, etc. In addition, it can transform trace data into aggregate counters, compressed, checksummed, encrypted, or anonymized streams; and it can buffer and direct the resulting data to various destinations (e.g., sockets, disks, etc.).

http://ctrace.sourceforge.net/ . CTrace is a fast and lightweight trace/debug library designed specifically for multi-threaded applications. It is coded in C and employs POSIX threads.

http://xperf123.codeplex.com/ . Xperf is built on top of the ETW (Event Tracing for Windows) infrastructure, which provides the capability to capture event traces for user and kernel mode drivers.

http://msdn.microsoft.com/en-us/library/windows/hardware/ff552961(v=vs.85).aspx . These Windows tools enable WPP tracing in a trace producer and controlling trace sessions (trace controllers).

www.scsitoolbox.com/products/BusAnalyzerModule.asp . BAM is a software bus analyzer that can capture, display, and analyze trace data from any peripheral bus, including SCSI, Fiber Channel, IDE, ATA, SATA, and SAS. BAM offers complete versatility as far as choice of phases that are captured and displayed, capture modes to minimize IO impact, buffer size and capture size, and device(s) to capture trace data from.

http://studies.ac.upc.edu/doctorat/InstProf/PatchWrx.pdf . PatchWrx is a static binary-rewriting instrumentation tool to capture full instruction and data address traces on the DEC Alpha platform running Microsoft NT. The toolset modifies the binary image before execution.

www.solarisinternals.com/wiki/index.php/DTraceTopics

https://github.com/opendtrace/toolkit

http://sourceware.org/systemtap/langref/

https://github.com/ktap/ktap

www.sysdig.org/

http://lttng.org/

Most of these traces have been designed under the assumption that scientific applications running on parallel computers would exhibit behavior similar to that of the same applications running on uniprocessors and vector supercomputers.

3. Trace Collection