This is the story of the birth of the p4 profiler. The story is written in second person form. If you are a programmer and have ever had to optimize some code running on several devices in parallel, then hopefully will recognize yourself in it.

You are a software engineer working with image processing, including both computer vision and graphics. One day your client asks for the inevitable feature: can we make it faster? Currently the code base is in C++ and pretty well optimized, but there are some trivially parallelizable tasks in the code so off-loading to co-processors comes to mind. In particular, the GPU is a good candidate (because you work with platforms that has it). You decide to profile it first (because that is what you should do, right?), and use perf, a reliable statistical profiler. The profiler confirms your hunch: some image processing loops are indeed the bottlenecks, and they can easily be off-loaded to the GPU.

You start to rewrite parts of the C++ code as GLSL compute shaders and pick a cross platform compute and graphics driver (i.e. Vulkan) to do the job. A new bottleneck quickly emerges: the transfer of images forth and back between the CPU and GPU. Nevertheless, you implement a hack for it with a remark that it’s premature optimization to optimize while optimizing, you haven’t even got the workload off device yet! So, you sprinkle a few fences in some critical places, and perform some pixel readbacks using plain memory mapping to device buffers visible from the host. After some weeks of coding, the optimization is completed and the code runs around 10x faster, everyone is happy. What was used to be a batch job turned into a function usable on a mobile device.

Time goes by, and one year later, the client asks again: can we make it even faster? You know this day was inevitable, after all, you implemented a hack to postpone rewriting some of the trickiest algorithms in GLSL, and you still haven’t implemented an asynchronous callback framework to properly handle high performance inter device communication. Now the client ask not just for running the code on a mobile device in feasible time (a few seconds is usually the limit before normal peoples patience run out), but to run it in reflex time, that is, close to the 250 milliseconds it takes for a human to react to anything.

So, you are going to finish of what your started and fully optimize the code. Again, you decide to profile it first to confirm your suspicion about the program choking on fences and memory transfers. Using perf again reveals that the program spends more than 50% of the time in an already super optimized tight loop on the CPU, the rest being spent in various math function calls and other floating point operations which is not possible to further optimize without changing the algorithm behavior. Yet, running your program through time shows that more than 50% is spent on something else: real time spent is about twice that of user time, and almost no time in the kernel. Annoyed by the fact that your most trusted profiler fails to find the real bottleneck (at least to your gut feeling), you try out some tracing profilers instead. You generally don’t like tracing your programs, because it almost never gives accurate profiles (though it does give accurate function call counts!). Trying out a static tracer like gperf, and some dynamic tracers like uftrace, callgrind (part of valgrind) and google-pprof, the reports gives similar conclusions as given by perf: the program is spends most of it’s time in some well designed loops that simply can not be the real bottleneck! Now you start to drag your hair a bit and resort to reading various forums on how to profile heterogeneous compute code. You bump into this thread on Stack Overflow (SO), which basically says to just use the debugger. The idea stems from the poor man’s profiler (PMP), and the answer on SO neatly explains why it works using Bayesian statistics.

It seems that there is an undocumented, unholy trick, to profile programs. It’s so hacky, that the best implementation of it became nothing more than a meme site. Yet, people secretly applies the hack to profile real world programs… which can be summarized as follows

  1. Compile your program with debug symbols
  2. Run your program under a debugger
  3. Repeat: interrupt, backtrace, continue

This will collect stack traces of the program, in a way that no other profiler seem to do. And it’s accurate, way more accurate than the profiles given by others. However, due to the low sampling frequency when doing this manually, the statistical sample has a high variance. The PMP is an automation of this using some bash and awk. The implementation described on the PMP website is however, horribly inefficient. Not only does the continual interruption of the process bias the statistical profile, it slows down the program to an unacceptable extent, making sampling rates over 0.2 Hz unrealistic. How to improve this? Your start with reading the GDB manual, which gets exciting around chapter 23: Extending GDB. Here you learn about GDB’s own scripting language (GDB-script?), and you quickly scribble together something like this

define profile
    while (1)
        shell sleep 0.01
        interrupt
        backtrace
        continue &
    end
end

Usage:

  1. gdb -ex 'source SCRIPT' PROGRAM
  2. run &
  3. profile

However, the interrupt cause GDB to return control to you, so this doesn’t work as an automation. You can still collect stack traces by rapidly pressing enter, which will reschedule the next interrupt 10ms into the future. Now you remember there is good tool for automating manual processes: Expect. You rewrite the GDB-script as an automated interaction instead

while {1} {
    sleep 0.01
    send "interrupt\r"
    expect "stopped."
    send "backtrace\r"
    expect "(gdb)"
    send "continue&\r"
    expect "Continuing."
}

When everything goes alright it increases possible sampling rate compared with the original PMP by at least two orders of magnitude! It is just limited by how fast Expect can communicate with GDB and how fast GDB can communicate with the underlying process. You try it out on your optimization problem and this time it does stop the program almost all the time at a fence or a memory transfer operation! Finally, you have managed to get a truthful statistical profile to gain confidence in your yet unstarted optimization journey.

There are several special cases one needs to consider, including free parameters (such as the sleep time), handling of errors, collecting stack traces with C++ symbols, reporting in a readable format, etc. The resulting scripts become long enough that you think it’s worthwhile to document it and package it as a small tool for others to use. Inspired by the Bayesian description in the SO thread, you want the word probabilistic to be part of the name. You also think the tool will be useful for humans regardless of gender. Hence, you conjure the name poor persons probabilistic profiler, abbreviated as pppp, compressed as p4. You contribute your findings to the PMP community by sharing the code on Github: https://github.com/cdeln/p4.