Callgrind is a call-graph generating profiler that also collects information about processor cache hit rate and branch prediction. Callgrind is only useful if your bottleneck is CPU-bound. It's not useful if heavy I/O or multiple processes are involved.
Valgrind does not require kernel configuration but it does need debug symbols. It is available as a target package in both the Yocto Project and Buildroot (BR2_PACKAGE_VALGRIND
).
You run Callgrind in Valgrind on the target, like so:
# valgrind --tool=callgrind <program>
This produces a file called callgrind.out.<PID>
which you can copy to the host and analyze with callgrind_annotate
.
The default is to capture data for all the threads together in a single file. If you add option --separate-threads=yes
when capturing, there will be profiles for each of the threads in files named callgrind.out.<PID>-<thread id>
, for example, callgrind.out.122-01
, callgrind.out.122-02
, and so on.
Callgrind can simulate the processor L1/L2 cache and report on cache misses. Capture the trace with the --simulate-cache=yes
option. L2 misses are much more expensive than L1 misses, so pay attention to code with high D2mr or D2mw counts.