Introducing Ftrace

The kernel function tracer, Ftrace, evolved from work done by Steven Rostedt, and many others, as they were tracking down the causes of high latency. Ftrace appeared in Linux 2.6.27 and has been actively developed since then. There are a number of documents describing kernel tracing in the kernel source in Documentation/trace.

Ftrace consists of a number of tracers that can log various types of activity in the kernel. Here, I am going to talk about the function and function_graph tracers, and about the event tracepoints. In Chapter 14, Real-time Programming, I will revisit Ftrace and use it to show real-time latencies.

The function tracer instruments each kernel function so that calls can be recorded and timestamped. As a matter of interest, it compiles the kernel with the -pg switch to inject the instrumentation, but there the resemblance to gprof ends. The function_graph tracer goes further and records both the entry and exit of functions so that it can create a call graph. The event tracepoints feature also records parameters associated with the call.

Ftrace has a very embedded-friendly user interface that is entirely implemented through virtual files in the debugfs filesystem, meaning that you do not have to install any tools on the target to make it work. Nevertheless, there are other user interfaces if you prefer: trace-cmd is a command-line tool which records and views traces and is available in Buildroot (BR2_PACKAGE_TRACE_CMD) and the Yocto Project (trace-cmd). There is a graphical trace viewer named KernelShark which is available as a package for the Yocto Project.

Ftrace and its various options are configured in the kernel configuration menu. You will need the following as a minimum:

For reasons that will become clear later, you would be well advised to turn on these options as well:

Since the whole thing is hosted in the kernel, there is no user space configuration to be done.

Using Ftrace
Before you can use Ftrace, you have to mount the debugfs filesystem which, by convention, goes in the /sys/kernel/debug directory:

All the controls for Ftrace are in the /sys/kernel/debug/tracing directory; there is even a mini HOWTO in the README file.

This is the list of tracers available in the kernel:

The active tracer is shown by current_tracer, which, initially, will be the null tracer, nop.

To capture a trace, select the tracer by writing the name of one of the available_tracers to current_tracer, then enable tracing for a short while, as shown here:

In that one second, the trace buffer will have been filled with the details of every function called by the kernel. The format of the trace buffer is plain text, as described in Documentation/trace/ftrace.txt. You can read the trace buffer from the trace file:

You can capture a large number of data points in just one second.

As with profilers, it is difficult to make sense of a flat function list like this. If you select the function_graph tracer, Ftrace captures call graphs like this:

Now you can see the nesting of the function calls, delimited by parentheses, { and }. At the terminating brace, there is a measurement of the time taken in the function, annotated with a plus sign, +, if it takes more than 10 µs, and an exclamation mark, !, if it takes more than 100 µs.

You are often only interested in the kernel activity caused by a single process or thread, in which case you can restrict the trace to the one thread by writing the thread ID to set_ftrace_pid.

Enabling CONFIG_DYNAMIC_FTRACE allows Ftrace to modify the function trace sites at runtime, which has a couple of benefits. Firstly, it triggers additional build-time processing of the trace function probes which allows the Ftrace subsystem to locate them at boot time and overwrite them with NOP instructions, thus reducing the overhead of the function trace code to almost nothing. You can then enable Ftrace in production or near production kernels with no impact on performance.

The second advantage is that you can selectively enable function trace sites rather than trace everything. The list of functions is put into available_filter_functions; there are several tens of thousands of them. You can selectively enable function traces as you need them by copying the name from available_filter_functions to set_ftrace_filter, and then stop tracing that function by writing the name to set_ftrace_notrace. You can also use wildcards and append names to the list. For example, suppose you are interested in tcp handling:

Run some tests and then look at the trace:

set_ ftrace_filter can also contain commands, for example, to start and stop tracing when certain functions are executed. There isn't space to go into these details here but, if you want to find out more, please read the Filter commands section in Documentation/trace/ftrace.txt.

The function and function_graph tracers described in the preceding section record only the time at which the function was executed. The trace events feature also records parameters associated with the call, making the trace more readable and informative. For example, instead of just recording that the function kmalloc had been called, a trace event will record the number of bytes requested and the returned pointer. Trace events are used in perf and LTTng as well as Ftrace, but the development of the trace events subsystem was prompted by the LTTng project.

It takes effort from kernel developers to create trace events since each one is different. They are defined in the source code using the TRACE_EVENT macro: there are over a thousand of them now. You can see the list of events available at runtime in /sys/kernel/debug/tracing/available_events. They are named subsystem:function, for example, kmem:kmalloc. Each event is also represented by a subdirectory in tracing/events/[subsystem]/[function], as demonstrated here:

The files are as follows:

Event tracing does not depend on the function tracers, so begin by selecting the nop tracer:

Next, select the events to trace by enabling each one individually:

You can also write the event names to set_event, as shown here:

Now, when you read the trace, you can see the functions and their parameters:

Exactly the same trace events are visible in perf as tracepoint events.