The pmap trick

While putting together a workload intensive cloud task I realized that there are two fundamental properties of concurrent systems that you can optimize for:

Throughput
Latency

and these are addressed with processes and threads respectively. That’s it, if you want to handle as much data as possible over time (i.e. high throughput), you should use processes. On the other hand, if you need the result to be returned as quick as possible from time of dispatch, you should use threads. Now, many high performance libraries out there (OpenCV, PyTorch, …) are multi-threaded by default. Turning that off and manually splitting data into shards that can be processed in parallel by different processes have given me a huge increase in throughput in some cloud tasks (this depends on how many cores you run on, the improvement gets better the more cores you have!). I find this very interesting, because this means that many of the optimizations done by these libraries are actually counter productive, and I do not think many developers know about it!

OpenCV can use different parallelization backends, and multi-threading behaviour can be configured through cv.setNumThreads for some of them (but not all!). In particular, parallelization done by OpenMP is not affected by this call… ¹ To be sure your code runs sequentially, you can set various environmental variables such as OMP_NUM_THREADS=1. Note that other libraries such as PyTorch, Tensorflow, etc. have their own API and set of envvars to configure parallelization.

To illustrate the effect, let us write a Python script that perform the fairly compute intensive task of extracting SIFT ² feature descriptors from a set of images. The OpenCV implementation of SIFT is multi-threaded, while multi-processing is done with the builtin Python library. The script is written so that we can tune the number of processes and threads individually.

import os
import argparse
import multiprocessing as mp

import cv2 as cv


def split(xs, n):
    step = len(xs) / n
    inds = [round(i * step) for i in range(n + 1)]
    return [xs[i:j] for i,j in zip(inds[:-1], inds[1:])]


def concat(xs):
    return [y for x in xs for y in x]


def pmap(func, args, num_processes):
    pool = mp.Pool(num_processes)
    shards = split(args, num_processes)
    return concat(pool.map(func, shards))


def compute(paths):
    sift = cv.SIFT_create()
    result = []
    for path in paths:
        image = cv.imread(path)
        _, descriptors = sift.detectAndCompute(image, None)
        result.append(descriptors)
    return result


def run(args):
    cv.setNumThreads(args.threads)
    os.environ['OMP_NUM_THREADS'] = str(args.threads)
    paths = [os.path.join(args.folder, filename) for filename in os.listdir(args.folder)]
    pmap(compute, paths, args.processes)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('folder')
    parser.add_argument('--threads', type=int, default=1)
    parser.add_argument('--processes', type=int, default=1)
    args = parser.parse_args()
    run(args)


if __name__ == '__main__':
    main()

I collected a short 9 second 1080 x 1920 resolution video @ 30 FPS on my phone and extracted into a folder with FFMPEG ³. Maxing out the number of threads in one process on my 12-core machine by running

/usr/bin/time -f 'Time: %es\nCPU : %P' python3 SCRIPT FOLDER --threads $(nproc)

gives me

Time: 68.19s
CPU : 251%

which is only a mere 20% CPU utilization (1200% means 100% utilization on a 12 core machine)! In contrast, maxing out the number of processes, each with a single thread by running

/usr/bin/time -f 'Time: %es\nCPU : %P' python3 SCRIPT FOLDER --processes $(nproc)

gives me

Time: 20.50s
CPU : 1103%

which is about 90% CPU utilization, or around x3 speed up compared with multi-threaded version!

To increase confidence in my hypothesis I rewrote the original Python snippet in C++, just to rule out potential scripting inefficiencies

#include <cstdlib>
#include <filesystem>
#include <iostream>

#include <opencv2/imgcodecs.hpp>
#include <opencv2/features2d.hpp>

using namespace std;
using namespace cv;

namespace fs = filesystem;

int main(int argc, char *argv[])
{
    if (argc != 2)
    {
        cerr << "Usage: " << argv[0] << " folder" << endl;
        return EXIT_FAILURE;
    }

    auto folder = argv[1];
    auto sift = SIFT::create();
    vector<KeyPoint> keypoints;
    vector<Mat> result;

    for (auto entry : fs::directory_iterator(folder))
    {
        if (entry.is_regular_file())
        {
            Mat image = imread(entry.path());
            Mat descriptors;
            sift->detectAndCompute(image, noArray(), keypoints, descriptors);
            result.push_back(descriptors);
        }
    }
    return EXIT_SUCCESS;
}

Yes, I am using namespaces; it is completely OK in a source file (but please never do that in your headers!). I compile the code with the following command

bear -- g++ main.cxx -std=c++17 -I /usr/local/include/opencv4/ -L /usr/local/lib -lopencv_{core,imgcodecs,features2d}

The bear ⁴ command wraps a compile command and produce a compile_commands.json for your language server as a side effect. Very nice tool.

Statistics from running the compiled executable are

Time: 45.47s
CPU : 316%

which is about 50% faster compared with the original Python script, but CPU utilization is still quite poor. I wanted to optimize further by using the brand new overload of cv::imread which takes an output parameter (I am surprised that this was introduced as late as in 4.10), but skipped to make it similar to the Python version + I don’t have the energy to upgrade my OpenCV installation tonight…

The pmap-trick is of course only easy to apply in trivially parallelizable tasks like the feature extraction example in this post. For tasks requiring fine grained synchronization and data exchange threads have more advantages (it would be interesting to explore the boundary between threads and processes more!).

Memory consumption when using multi-processing tends to be much bigger than for the multi-threaded version. In the example case above, the multi-threaded version consumes about 3GB and the multi-processing version about 6GB. This is non-negligible, and for very memory heavy operations such as running deep networks you can not max out on processes but have to resort to a compromise between threads and processes.

It is also worth noting that using threads seems to be more energy efficient, that is, the total compute time divided by CPU utilization is less. However, this observation might just be because the returns from using more cores is not linear and decays with CPU utilization. It would be interesting to see if the multi-threaded example could max out all CPUs and give even faster running times than the multi-processing version!

As a final note, the “pmap trick” is nothing but “map reduce” ⁵ on a single host, with reduction set to concatenation (did I mention that map actually is just reduce?).

Conclusion is that designing your code such that you can select a balance between the two types of parallelization techniques can greatly improve the performance of your tasks.

cv::setNumThreads ↩
Scale Invariant Feature Transform: The most successful and influential computer vision feature extraction algorithm, a must have in any computer visionists toolbox ↩
FFMPEG: Swiss army knife for video processing ↩
Build EAR ↩
MapReduce: Distributed programming model ↩