Callgrind

Callgrind is a call-graph generating profiler that also collects information about processor cache hit rate and branch prediction. Callgrind is only useful if your bottleneck is CPU-bound. It's not useful if heavy I/O or multiple processes are involved.

Valgrind does not require kernel configuration but it does need debug symbols. It is available as a target package in both the Yocto Project and Buildroot (BR2_PACKAGE_VALGRIND).

You run Callgrind in Valgrind on the target, like so:

# valgrind --tool=callgrind <program>

This produces a file called callgrind.out.<PID> which you can copy to the host and analyze with callgrind_annotate.

The default is to capture data for all the threads together in a single file. If you add option --separate-threads=yes when capturing, there will be profiles for each of the threads in files named callgrind.out.<PID>-<thread id>, for example, callgrind.out.122-01, callgrind.out.122-02, and so on.

Callgrind can simulate the processor L1/L2 cache and report on cache misses. Capture the trace with the --simulate-cache=yes option. L2 misses are much more expensive than L1 misses, so pay attention to code with high D2mr or D2mw counts.