The kernel function tracer, Ftrace
, evolved from work done by Steven Rostedt, and many others, as they were tracking down the causes of high latency. Ftrace
appeared in Linux 2.6.27 and has been actively developed since then. There are a number of documents describing kernel tracing in the kernel source in Documentation/trace
.
Ftrace
consists of a number of tracers that can log various types of activity in the kernel. Here, I am going to talk about the function
and function_graph
tracers, and about the event tracepoints. In Chapter 14, Real-time Programming, I will revisit Ftrace and use it to show real-time latencies.
The function
tracer instruments each kernel function so that calls can be recorded and timestamped. As a matter of interest, it compiles the kernel with the -pg
switch to inject the instrumentation, but there the resemblance to gprof ends. The function_graph
tracer goes further and records both the entry and exit of functions so that it can create a call graph. The event tracepoints feature also records parameters associated with the call.
Ftrace
has a very embedded-friendly user interface that is entirely implemented through virtual files in the debugfs
filesystem, meaning that you do not have to install any tools on the target to make it work. Nevertheless, there are other user interfaces if you prefer: trace-cmd
is a command-line tool which records and views traces and is available in Buildroot (BR2_PACKAGE_TRACE_CMD
) and the Yocto Project (trace-cmd
). There is a graphical trace viewer named KernelShark which is available as a package for the Yocto Project.
Ftrace
and its various options are configured in the kernel configuration menu. You will need the following as a minimum:
For reasons that will become clear later, you would be well advised to turn on these options as well:
Since the whole thing is hosted in the kernel, there is no user space configuration to be done.
Using Ftrace
Before you can use Ftrace,
you have to mount the debugfs
filesystem which, by convention, goes in the /sys/kernel/debug
directory:
# mount –t debugfs none /sys/kernel/debug
All the controls for Ftrace
are in the /sys/kernel/debug/tracing
directory; there is even a mini HOWTO
in the README
file.
This is the list of tracers available in the kernel:
# cat /sys/kernel/debug/tracing/available_tracers blk function_graph function nop
The active tracer is shown by current_tracer
, which, initially, will be the null tracer, nop
.
To capture a trace, select the tracer by writing the name of one of the available_tracers
to current_tracer
, then enable tracing for a short while, as shown here:
# echo function > /sys/kernel/debug/tracing/current_tracer # echo 1 > /sys/kernel/debug/tracing/tracing_on # sleep 1 # echo 0 > /sys/kernel/debug/tracing/tracing_on
In that one second, the trace buffer will have been filled with the details of every function called by the kernel. The format of the trace buffer is plain text, as described in Documentation/trace/ftrace.txt
. You can read the trace buffer from the trace
file:
# cat /sys/kernel/debug/tracing/trace # tracer: function # # entries-in-buffer/entries-written: 40051/40051 #P:1 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | sh-361 [000] ...1 992.990646: mutex_unlock <-rb_simple_write sh-361 [000] ...1 992.990658: __fsnotify_parent <-vfs_write sh-361 [000] ...1 992.990661: fsnotify <-vfs_write sh-361 [000] ...1 992.990663: __srcu_read_lock <-fsnotify sh-361 [000] ...1 992.990666: preempt_count_add <-__srcu_read_lock sh-361 [000] ...2 992.990668: preempt_count_sub <-__srcu_read_lock sh-361 [000] ...1 992.990670: __srcu_read_unlock <-fsnotify sh-361 [000] ...1 992.990672: __sb_end_write <-vfs_write sh-361 [000] ...1 992.990674: preempt_count_add <-__sb_end_write [...]
You can capture a large number of data points in just one second.
As with profilers, it is difficult to make sense of a flat function list like this. If you select the function_graph
tracer, Ftrace captures call graphs like this:
# tracer: function_graph # # CPU DURATION FUNCTION CALLS #| | | | | | | 0) + 63.167 us | } /* cpdma_ctlr_int_ctrl */ 0) + 73.417 us | } /* cpsw_intr_disable */ 0) | disable_irq_nosync() { 0) | __disable_irq_nosync() { 0) | __irq_get_desc_lock() { 0) 0.541 us | irq_to_desc(); 0) 0.500 us | preempt_count_add(); 0) + 16.000 us | } 0) | __disable_irq() { 0) 0.500 us | irq_disable(); 0) 8.208 us | } 0) | __irq_put_desc_unlock() { 0) 0.459 us | preempt_count_sub(); 0) 8.000 us | } 0) + 55.625 us | } 0) + 63.375 us | }
Now you can see the nesting of the function calls, delimited by parentheses, {
and }
. At the terminating brace, there is a measurement of the time taken in the function, annotated with a plus sign, +
, if it takes more than 10 µs
, and an exclamation mark, !
, if it takes more than 100 µs
.
You are often only interested in the kernel activity caused by a single process or thread, in which case you can restrict the trace to the one thread by writing the thread ID to set_ftrace_pid
.
Enabling CONFIG_DYNAMIC_FTRACE
allows Ftrace to modify the function trace
sites at runtime, which has a couple of benefits. Firstly, it triggers additional build-time processing of the trace function probes which allows the Ftrace subsystem to locate them at boot time and overwrite them with NOP instructions, thus reducing the overhead of the function trace code to almost nothing. You can then enable Ftrace in production or near production kernels with no impact on performance.
The second advantage is that you can selectively enable function trace sites
rather than trace everything. The list of functions is put into available_filter_functions
; there are several tens of thousands of them. You can selectively enable function traces as you need them by copying the name from available_filter_functions
to set_ftrace_filter
, and then stop tracing that function by writing the name to set_ftrace_notrace
. You can also use wildcards and append names to the list. For example, suppose you are interested in tcp
handling:
# cd /sys/kernel/debug/tracing # echo "tcp*" > set_ftrace_filter # echo function > current_tracer # echo 1 > tracing_on
Run some tests and then look at the trace:
# cat trace # tracer: function # # entries-in-buffer/entries-written: 590/590 #P:1 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | dropbear-375 [000] ...1 48545.022235: tcp_poll <-sock_poll dropbear-375 [000] ...1 48545.022372: tcp_poll <-sock_poll dropbear-375 [000] ...1 48545.022393: tcp_sendmsg <-inet_sendmsg dropbear-375 [000] ...1 48545.022398: tcp_send_mss <-tcp_sendmsg dropbear-375 [000] ...1 48545.022400: tcp_current_mss <-tcp_send_mss [...]
set_ ftrace_filter
can also contain commands, for example, to start and stop tracing when certain functions are executed. There isn't space to go into these details here but, if you want to find out more, please read the Filter commands section in Documentation/trace/ftrace.txt
.
The function and function_graph
tracers described in the preceding section record only the time at which the function was executed. The trace events feature also records parameters associated with the call, making the trace more readable and informative. For example, instead of just recording that the function kmalloc
had been called, a trace event will record the number of bytes requested and the returned pointer. Trace events are used in perf and LTTng as well as Ftrace, but the development of the trace events subsystem was prompted by the LTTng project.
It takes effort from kernel developers to create trace events since each one is different. They are defined in the source code using the TRACE_EVENT
macro: there are over a thousand of them now. You can see the list of events available at runtime in /sys/kernel/debug/tracing/available_events
. They are named subsystem:function
, for example, kmem:kmalloc
. Each event is also represented by a subdirectory in tracing/events/[subsystem]/[function]
, as demonstrated here:
# ls events/kmem/kmalloc enable filter format id trigger
The files are as follows:
enable
: You write a 1
to this file to enable the event.filter
: This is an expression which must evaluate to true for the event to be traced.format
: This is the format of the event and parameters.id
: This is a numeric identifier.trigger
: This is a command that is executed when the event occurs using the syntax defined in the Filter commands section of Documentation/trace/ftrace.txt
. I will show you a simple example involving kmalloc
and kfree
.Event tracing does not depend on the function tracers, so begin by selecting the nop
tracer:
# echo nop > current_tracer
Next, select the events to trace by enabling each one individually:
# echo 1 > events/kmem/kmalloc/enable # echo 1 > events/kmem/kfree/enable
You can also write the event names to set_event
, as shown here:
# echo "kmem:kmalloc kmem:kfree" > set_event
Now, when you read the trace, you can see the functions and their parameters:
# tracer: nop # # entries-in-buffer/entries-written: 359/359 #P:1 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | cat-382 [000] ...1 2935.586706: kmalloc: call_site=c0554644 ptr=de515a00 bytes_req=384 bytes_alloc=512 gfp_flags=GFP_ATOMIC|GFP_NOWARN|GFP_NOMEMALLOC cat-382 [000] ...1 2935.586718: kfree: call_site=c059c2d8 ptr= (null)
Exactly the same trace events are visible in perf as tracepoint events.