Introducing Ftrace

The kernel function tracer, Ftrace, evolved from work done by Steven Rostedt, and many others, as they were tracking down the causes of high latency. Ftrace appeared in Linux 2.6.27 and has been actively developed since then. There are a number of documents describing kernel tracing in the kernel source in Documentation/trace.

Ftrace consists of a number of tracers that can log various types of activity in the kernel. Here, I am going to talk about the function and function_graph tracers, and about the event tracepoints. In Chapter 14, Real-time Programming, I will revisit Ftrace and use it to show real-time latencies.

The function tracer instruments each kernel function so that calls can be recorded and timestamped. As a matter of interest, it compiles the kernel with the -pg switch to inject the instrumentation, but there the resemblance to gprof ends. The function_graph tracer goes further and records both the entry and exit of functions so that it can create a call graph. The event tracepoints feature also records parameters associated with the call.

Ftrace has a very embedded-friendly user interface that is entirely implemented through virtual files in the debugfs filesystem, meaning that you do not have to install any tools on the target to make it work. Nevertheless, there are other user interfaces if you prefer: trace-cmd is a command-line tool which records and views traces and is available in Buildroot (BR2_PACKAGE_TRACE_CMD) and the Yocto Project (trace-cmd). There is a graphical trace viewer named KernelShark which is available as a package for the Yocto Project.

Preparing to use Ftrace

Ftrace and its various options are configured in the kernel configuration menu. You will need the following as a minimum:

CONFIG_FUNCTION_TRACER in the menu Kernel hacking | Tracers | Kernel Function Tracer

For reasons that will become clear later, you would be well advised to turn on these options as well:

CONFIG_FUNCTION_GRAPH_TRACER in the menu Kernel hacking | Tracers | Kernel Function Graph Tracer
CONFIG_DYNAMIC_FTRACE in the menu Kernel hacking | Tracers | enable/disable function tracing dynamically

Since the whole thing is hosted in the kernel, there is no user space configuration to be done.

Using Ftrace Before you can use Ftrace, you have to mount the debugfs filesystem which, by convention, goes in the /sys/kernel/debug directory:

# mount –t debugfs none /sys/kernel/debug

All the controls for Ftrace are in the /sys/kernel/debug/tracing directory; there is even a mini HOWTO in the README file.

This is the list of tracers available in the kernel:

# cat /sys/kernel/debug/tracing/available_tracers
blk function_graph function nop

The active tracer is shown by current_tracer, which, initially, will be the null tracer, nop.

To capture a trace, select the tracer by writing the name of one of the available_tracers to current_tracer, then enable tracing for a short while, as shown here:

# echo function > /sys/kernel/debug/tracing/current_tracer
# echo 1 > /sys/kernel/debug/tracing/tracing_on
# sleep 1
# echo 0 > /sys/kernel/debug/tracing/tracing_on

In that one second, the trace buffer will have been filled with the details of every function called by the kernel. The format of the trace buffer is plain text, as described in Documentation/trace/ftrace.txt. You can read the trace buffer from the trace file:

# cat /sys/kernel/debug/tracing/trace
# tracer: function
#
# entries-in-buffer/entries-written: 40051/40051   #P:1
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
              sh-361   [000] ...1   992.990646: mutex_unlock <-rb_simple_write
              sh-361   [000] ...1   992.990658: __fsnotify_parent <-vfs_write
              sh-361   [000] ...1   992.990661: fsnotify <-vfs_write
              sh-361   [000] ...1   992.990663: __srcu_read_lock <-fsnotify
              sh-361   [000] ...1   992.990666: preempt_count_add <-__srcu_read_lock
              sh-361   [000] ...2   992.990668: preempt_count_sub <-__srcu_read_lock
              sh-361   [000] ...1   992.990670: __srcu_read_unlock <-fsnotify
              sh-361   [000] ...1   992.990672: __sb_end_write <-vfs_write
              sh-361   [000] ...1   992.990674: preempt_count_add <-__sb_end_write
[...]

You can capture a large number of data points in just one second.

As with profilers, it is difficult to make sense of a flat function list like this. If you select the function_graph tracer, Ftrace captures call graphs like this:

# tracer: function_graph
#
# CPU  DURATION            FUNCTION CALLS
#|     |   |               |   |   |   |
 0) + 63.167 us   |              } /* cpdma_ctlr_int_ctrl */
 0) + 73.417 us   |            } /* cpsw_intr_disable */
 0)               |            disable_irq_nosync() {
 0)               |              __disable_irq_nosync() {
 0)               |                __irq_get_desc_lock() {
 0)   0.541 us    |                  irq_to_desc();
 0)   0.500 us    |                  preempt_count_add();
 0) + 16.000 us   |                }
 0)               |                __disable_irq() {
 0)   0.500 us    |                  irq_disable();
 0)   8.208 us    |                }
 0)               |                __irq_put_desc_unlock() {
 0)   0.459 us    |                  preempt_count_sub();
 0)   8.000 us    |                }
 0) + 55.625 us   |              }
 0) + 63.375 us   |            }

Now you can see the nesting of the function calls, delimited by parentheses, { and }. At the terminating brace, there is a measurement of the time taken in the function, annotated with a plus sign, +, if it takes more than 10 µs, and an exclamation mark, !, if it takes more than 100 µs.

You are often only interested in the kernel activity caused by a single process or thread, in which case you can restrict the trace to the one thread by writing the thread ID to set_ftrace_pid.

Dynamic Ftrace and trace filters

Enabling CONFIG_DYNAMIC_FTRACE allows Ftrace to modify the function trace sites at runtime, which has a couple of benefits. Firstly, it triggers additional build-time processing of the trace function probes which allows the Ftrace subsystem to locate them at boot time and overwrite them with NOP instructions, thus reducing the overhead of the function trace code to almost nothing. You can then enable Ftrace in production or near production kernels with no impact on performance.

The second advantage is that you can selectively enable function trace sites rather than trace everything. The list of functions is put into available_filter_functions; there are several tens of thousands of them. You can selectively enable function traces as you need them by copying the name from available_filter_functions to set_ftrace_filter, and then stop tracing that function by writing the name to set_ftrace_notrace. You can also use wildcards and append names to the list. For example, suppose you are interested in tcp handling:

# cd /sys/kernel/debug/tracing
# echo "tcp*" > set_ftrace_filter
# echo function > current_tracer
# echo 1 > tracing_on

Run some tests and then look at the trace:

# cat trace
# tracer: function
#
# entries-in-buffer/entries-written: 590/590   #P:1
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
        dropbear-375   [000] ...1 48545.022235: tcp_poll <-sock_poll
        dropbear-375   [000] ...1 48545.022372: tcp_poll <-sock_poll
        dropbear-375   [000] ...1 48545.022393: tcp_sendmsg <-inet_sendmsg
        dropbear-375   [000] ...1 48545.022398: tcp_send_mss <-tcp_sendmsg
        dropbear-375   [000] ...1 48545.022400: tcp_current_mss <-tcp_send_mss
[...]

set_ ftrace_filter can also contain commands, for example, to start and stop tracing when certain functions are executed. There isn't space to go into these details here but, if you want to find out more, please read the Filter commands section in Documentation/trace/ftrace.txt.

Trace events

The function and function_graph tracers described in the preceding section record only the time at which the function was executed. The trace events feature also records parameters associated with the call, making the trace more readable and informative. For example, instead of just recording that the function kmalloc had been called, a trace event will record the number of bytes requested and the returned pointer. Trace events are used in perf and LTTng as well as Ftrace, but the development of the trace events subsystem was prompted by the LTTng project.

It takes effort from kernel developers to create trace events since each one is different. They are defined in the source code using the TRACE_EVENT macro: there are over a thousand of them now. You can see the list of events available at runtime in /sys/kernel/debug/tracing/available_events. They are named subsystem:function, for example, kmem:kmalloc. Each event is also represented by a subdirectory in tracing/events/[subsystem]/[function], as demonstrated here:

# ls events/kmem/kmalloc
enable   filter   format   id   trigger

The files are as follows:

enable: You write a 1 to this file to enable the event.
filter: This is an expression which must evaluate to true for the event to be traced.
format: This is the format of the event and parameters.
id: This is a numeric identifier.
trigger: This is a command that is executed when the event occurs using the syntax defined in the Filter commands section of Documentation/trace/ftrace.txt. I will show you a simple example involving kmalloc and kfree.

Event tracing does not depend on the function tracers, so begin by selecting the nop tracer:

# echo nop > current_tracer

Next, select the events to trace by enabling each one individually:

# echo 1 > events/kmem/kmalloc/enable
# echo 1 > events/kmem/kfree/enable

You can also write the event names to set_event, as shown here:

# echo "kmem:kmalloc kmem:kfree" > set_event

Now, when you read the trace, you can see the functions and their parameters:

# tracer: nop
#
# entries-in-buffer/entries-written: 359/359   #P:1
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
             cat-382   [000] ...1  2935.586706: kmalloc: call_site=c0554644 ptr=de515a00 bytes_req=384 bytes_alloc=512 gfp_flags=GFP_ATOMIC|GFP_NOWARN|GFP_NOMEMALLOC
             cat-382   [000] ...1  2935.586718: kfree: call_site=c059c2d8 ptr=  (null)

Exactly the same trace events are visible in perf as tracepoint events.