© Jun Xu 2018
Jun XuBlock Trace Analysis and Storage System Optimizationhttps://doi.org/10.1007/978-1-4842-3928-5_3

3. Trace Collection

Jun Xu1 
(1)
Singapore, Singapore
 

Trace quality is one of the essential requirements for analysis. Low quality traces may lead to complex, wrong conclusions for trace analysis. There are two main issues in trace quality. One is timing drift, which is when the actual event arrival time is earlier than the collected arrival time. The other is a missing event, such as when the tool cannot capture all the required events. Thus, proper tools shall be applied to guarantee the correctness of the traces. Both software tools and hardware devices are introduced in this chapter.

Collection Techniques

Many techniques have been proposed to monitor and capture system or component traces. There are four techniques generally:
  • Hardware-based monitoring entails the modification of the testbed hardware so that as a program is executed, a record of all instructions and/or data addresses is created.

  • Software-based tracing can achieve similar goals as hardware to a certain degree, but instead of altering the system hardware, software is modified or inserted.

  • Emulation-based tracing constructs a layer between the host machine and the OS under evaluation, like QEMU1 and SimOS.2 The layer only emulates enough components to allow the OS to run correctly. While this system provides a flexible interface to collect operating system-dependent traces, the accuracy of the captured trace is dubious sometimes. Since emulation is performed, execution will be perturbed.

  • Microcode-based tracing utilizes microcode modification to capture trace information, introducing minimal slowdown, like PALcode (Privileged Architecture Library code).3

However, the third and fourth techniques are not popular due to high complexity and dubious accuracy. Therefore, only hardware and software based techniques are discussed in this chapter.

Hardware Trace Collection

We refer the hardware method to the trace collection approach that uses a particular hardware device/system capturing the IOs other than the targeted storage devices, although some software may be still required to manage the traces [38, 45]. There are many types of hardware to collect the block-level trace. One of the most common devices is a bus analyzer, although it is not limited to block-level IOs for disk drives, such as network traffic, DDR/CPU caching/stall/latency/throughput/etc. Some products can capture rather accurate traces, such as the Xgig 6G SAS/SATA analyzer from Viavi solution, the BusXpert Micro II Series SAS/SATA analyzer from SerialTek, the Trace and analyzer from TI, the protocol analyzer from LeCroy, the Eumulator XL-100 from Arium, and the SuperTrace Probe from Green Hills Software.

The bus analyzers often provide multiple communication interfaces for users. Take the devices in Figure 3-1 as an example. They provide USB, Ethernet, SCSI, etc. These devices usually achieve reliable and accurate linkups via multiple mechanisms, with higher resolution (e.g., time precision and capture frequency) and more information captured than software-based tools.
../images/468166_1_En_3_Chapter/468166_1_En_3_Fig1_HTML.jpg
Figure 3-1

Bus analyzers from LeCropy, XGIG, and SerialTek

Figure 3-2 shows an example of SAS IO access in BusXpert, which provides almost all basic information related to SAS protocols. The users can easily trace the command status from the detailed logs, such as the response time, connections, etc.
../images/468166_1_En_3_Chapter/468166_1_En_3_Fig2_HTML.jpg
Figure 3-2

Plentiful protocol information from BusXpert

Figure 3-3 provides another example of SATA command analysis. You can see that the host issued the command COMWAKE after around 5 seconds. The drive almost immediately acknowledged COMWAKE. At time 5.59 seconds, SMART READ DATA was transferred to the host.
../images/468166_1_En_3_Chapter/468166_1_En_3_Fig3_HTML.jpg
Figure 3-3

SATA command analysis

Although there is no difficulty in capturing almost all the essential protocol information, no advanced metrics of IO properties are included in the software used to analyze the trace.

Software Trace Collection

In term of accuracy, a software trace collector may be not as good as hardware devices. In particular, for these applications with time precision in nanoseconds or less, software may not work well. For example, a disk feature debug related to the SAS/SATA protocol may be applicable to the bus analyzer since it may involve the disk drive’s SoC clock issues. However, for disk drive IO performance, it is generally operated at the millisecond level (precision), which is generally within the capability of the modern processors and operating systems inside a common server or workstation.

There are many IO tools available [45, 35]:
  • Linux/Unix: Dtrace[46], LTTng, BCC,4 iostat, dstat, tracefs,5 iotop, hdparm, ionice, Ctrace,6 iogrind, POSIX Test Suite, ioprofile, SystemTap, IOR, PCP, and swtrace

  • Windows: Xperf,7 TraceWPP/TraceView/Tracelog/Logman,8 Vtrace, Oracle trace collector, Bus analyzer module,9 and PatchWrx10

However, not all of these tools can provide event details. In fact, the general purpose monitoring tools, like iostat and iotop, cannot provide detailed information on a per-IO basis.

These tools can be divided into two classes: static and dynamic. Static tools view the binary image of a program as a black box that is never modified. Dynamic tools instead rely on binary-level alterations to facilitate the gathering of statistical data from an application. For example, all the Windows tools and iotop/iostat/dstat/hdparm/ionice/iogrind/ioprofile are static tools, while SystemTap, Dtrace, and LTTng are dynamic tools. In particular, Dtrace and LTTng use a mechanism called probing that is able to selectively activate instrumentation routines that are embedded within software at all levels of abstraction, so that performance-related statistics can be obtained from not only an application but also the various libraries and kernel routines associated with its execution.

Blktrace

Blktrace is a static tool that has been embedded into the Linux kernel since version 2.617-rc1. This tool is lightweight and easy to use. It only considers device access after OS/FS cache. When IO enters to block an IO layer (request queue), the relay channel per CPU gets events emitted, and blktrace then captures the events from the channels. More details can be found in Appendix B.

Dtrace, SystemTap, and LTTng

As mentioned, dynamic tracing tools embed tracing code into working user programs or kernels, without the need of recompilation or reboot. Since any processor instruction may be patched, it can virtually access any information you need at any place. I will discuss several dynamic tracing tools next.

DTrace [47] originated from Solaris.11 Its development was begun in 1999, and it became part of the Solaris 10 release. Nowadays, DTrace is open-sourced as a part of OpenSolaris, although it has not merged into the Linux kernel due to license incompatibility. There exist several ports without proper support. A toolkit based on Dtrace for simplification of use has been developed by B. Gregg.12 But the essential limitation has been solved. A few attempts led to the development of another clone of DTrace called DProbes, but it seems to be unsuccessful.

Therefore, three major Linux players, Red Hat, Hitachi and IBM, presented another dynamic tracing system for Linux called SystemTap.13 SystemTap is one of the most powerful tracers so far. However, it has to generate a native module for each script it runs, which is a huge performance penalty. Ktap14 was further developed to reduce the overhead using Lua and LuaJIT internally. Another similar implementation is sysdig,15 which is scriptless.

LTTng16 is also a widely used open source tracing framework for Linux. It used static tracing and required kernel recompilation until version 2.0; it currently utilizes ftrace and kprobe subsystems in the Linux kernel. It makes the users understand the interactions among multiple system components, like the Linux kernel, using either existing or user-defined instrumentation points, C/C++ applications, Java applications, Python applications, or any other user space application with the LTTng logger. It may outperform other tracers because it has optimized event collection. It also supports numerous event types, including USDT (user-level statically defined tracing).

When identifying the overall system performance instead of only storage IO, these tools will play a significant role. In Chapter 9, you will use Ceph as an example to find the performance bottleneck from an overall system view.

Trace Warehouse

Mainly for research purposes , there are some real/synthesis traces available online for download. The following are few examples:

Together with the source code for the analysis tool, I also provide trace sample data in GitHub.

This chapter discussed both hardware and software tools for trace collection. Note that the former generally offer higher precision and more information than the latter, although they are more expensive. However, in many scenarios, the precision is only required at the millisecond level. Therefore, software-only tools are widely applied in both industries and academics. Note that there exist various tools for different purposes. In order to identify the overall system performance, you shall employ multiple tools or some integrated tool sets.