- Introduction
- Installing dispenso
- Building dispenso
- Testing dispenso
- Benchmarking dispenso
- Known Issues
- Licensing
Latin: To dispense, distribute, manage
Dispenso is a library for working with sets of tasks in parallel. It provides mechanisms for thread pools, task sets, parallel for loops, futures, pipelines, and more. Dispenso is a well-tested C++14 library designed to have minimal dependencies (some dependencies are required for the tests and benchmarks), and designed to be clean with compiler sanitizers (ASAN, TSAN). Dispenso is currently being used in hundreds of projects and hundreds of C++ files at Meta (formerly Facebook). Dispenso also aims to avoid major disruption at every release. Releases will be made such that major versions are created when a backward incompatibility is introduced, and minor versions are created when substantial features have been added or bugs have been fixed, and the aim would be to only very rarely bump major versions. That should make the project suitable for use from main
branch, or if you need a harder requirement, you can base code on a specific version.
Dispenso has the following features
AsyncRequest
: Asynchronous request/response facilities for lightweight constrained message passingCompletionEvent
: A notifiable event type with wait and timed waitConcurrentObjectArena
: An object arena for fast allocation of objects of the same typeConcurrentVector
: A vector-like type with a superset of the TBB concurrent_vector APIfor_each
: Parallel version ofstd::for_each
andstd::for_each_n
Future
: A futures implementation that strives for interface similarity with std::experimental::future, but with dispenso types as backing thread poolsGraph
: Utilities for fast and lightweight execution of task graphs. Task graphs may be reused to avoid setup costs, and subgraph execution is automated based on updated and downstream nodes.OnceFunction
: A lightweight function-like interface forvoid()
functions that can only be called onceparallel_for
: Parallel for loops over indices that can be blocking or non-blockingpipeline
: Parallel pipelining of workloadsPoolAllocator
: A pool allocator with facilities to supply a backing allocation/deallocation, making this suitable for use with e.g. CUDA allocationResourcePool
: A type that acts similar to a semaphore around guarded objectsRWLock
: A minimal reader-writer spin lock that outperforms std::shared_mutex under low write contentionSmallBufferAllocator
: An allocator that enables fast concurrent allocation for temporary objectsTaskSet
: Sets of tasks that can be waited on togetherThreadPool
: The backing thread pool type used by many other dispenso features
TBB has significant overlap with dispenso, though TBB has more functionality, and is likely to continue having more utilities for some time. We chose to build and use dispenso for a few primary reasons like
- TBB is built on older C++ standards, and doesn't deal well with compiler sanitizers
- TBB lacks an interface for futures
- We wanted to ensure we could control performance and availability on non-Intel hardware
Dispenso is faster than TBB in some scenarios and slower in other scenarios. For example, with parallel for loops, dispenso tends to be faster for small and medium loops, and on-par with TBB for large loops. When many loops can run independently of one another, dispenso shines and can perform significantly better than TBB. Anecdotally speaking, we have seen one workload with independent parallel for loops at Meta where porting to dispenso lead to a 50% speedup.
OpenMP has very simple semantics for parallelizing simple for loops, but gets quite complex for more complicated loops and constructs. OpenMP wasn't as portable in the past, though the number of compiler supporting it is increasing. If not used carefully, nesting of OpenMP constructs inside of other threads (e.g. nested parallel for) can lead to large number of threads, which can exhaust machines.
Performance-wise, dispenso tends to outperform simple OpenMP for loops for medium and large workloads, but OpenMP has a significant advantage for small loops. This is because it has direct compiler support and can understand the cost of the code it is running. This allows it to forgo running in parallel if the tradeoffs aren't worthwhile.
Folly is a library from Meta that has several concurrency utilities including thread pools and futures. The library has very good support for new C++ coroutines functionality, and makes writing asynchronous code (e.g. I/O) easy and performant. Folly as a library can be tricky to work with. For example, the forward/backward compatibility of code isn't a specific goal of the project.
Folly does not have a parallel loop concept, nor task sets and parallel pipelines. When comparing Folly's futures against dispenso's, dispenso tries to maintain an API that is closely matched to a combination of std::experimental::future and std::experimental::shared_future (dispenso's futures are all shared). Additionally, for compute-bound applications, dispenso's futures tend to be much faster and lighter-weight than Folly's.
TaskFlow is a library that seems to have been initially designed for parallel execution of task graphs. It has some similarities with Dispenso: It has a backing thread pool, it has a parallel_for-like functionality, task graphs, and pipelines. There are also utilities in each library that don't overlap (yet). TaskFlow task graphs are pretty high performance, and are sometimes faster than dispenso for full graph execution (depending on platform). Dispenso has higher-performance task graph building times and has the ability to run partially-modified task graphs much faster. Dispenso's parallel_for seems to be much lower overhead than TaskFlow's for_each_index (Dispenso is 10x to 100x faster in overhead benchmarks). Dispenso's pipelines are also much faster and much simpler to construct than TaskFlow's in our benchmarks.
We haven't done a strong comparison vs these other mechanisms. GCD is an Apple technology used by many people for Mac and iOS platforms, and there are ports to other platforms (though the mechanism for submitting closures is different). Much of the C++ parallel algorithms work is still TBD, but we would be very interested to enable dispenso to be a basis for parallelization of those algorithms. Additionally, we have interest in enabling dispenso to back the new coroutines interface. We'd be interested in any contributions people would like to make around benchmarking/summarizing other task parallelism libraries, and also integration with C++ parallel algorithms and coroutines.
Dispenso isn't really designed for high-latency task offload, it works best for compute-bound tasks. Using the thread pool for networking, disk, or in cases with frequent TLB misses (really any scenario with kernel context switches) may result in less than ideal performance.
In these kernel context switch scenarios, dispenso::Future
can be used with dispeno::NewThreadInvoker
, which should be roughly equivalent with std::future performance.
If you need async I/O, Folly is likely a good choice (though it still doesn't fix e.g. TLB misses).
Documentation can be found here
Here are some simple examples of what you can do in dispenso. See tests and benchmarks for more examples.
for(size_t j = 0; j < kLoops; ++j) {
vec[j] = someFunction(j);
}
Becomes
dispenso::parallel_for(0, kLoops, [&vec] (size_t j) {
vec[j] = someFunction(j);
});
void randomWorkConcurrently() {
dispenso::TaskSet tasks(dispenso::globalThreadPool());
tasks.schedule([&stateA]() { stateA = doA(); });
tasks.schedule([]() { doB(); });
// Do some work on current thread
tasks.wait(); // After this, A, B done.
tasks.schedule(doC);
tasks.schedule([&stateD]() { doD(stateD); });
} // TaskSet's destructor waits for all scheduled tasks to finish
struct Node {
int val;
std::unique_ptr<Node> left, right;
};
void buildTree(dispenso::ConcurrentTaskSet& tasks, std::unique_ptr<Node>& node, int depth) {
if (depth) {
node = std::make_unique<Node>();
node->val = depth;
tasks.schedule([&tasks, &left = node->left, depth]() { buildTree(tasks, left, depth - 1); });
tasks.schedule([&tasks, &right = node->right, depth]() { buildTree(tasks, right, depth - 1); });
}
}
void buildTreeParallel() {
std::unique_ptr<Node> root;
dispenso::ConcurrentTaskSet tasks(dispenso::globalThreadPool());
buildTree(tasks, root, 20);
tasks.wait(); // tasks would also wait here in destructor if we omitted this line
}
dispenso::Future<size_t> ThingProcessor::processThings() {
auto expensiveFuture = dispenso::async([this]() {
return processExpensiveThing(expensive_);
});
auto futureOfManyCheap = dispenso::async([this]() {
size_t sum = 0;
for (auto &thing : cheapThings_) {
sum += processCheapThing(thing);
}
return sum;
});
return dispenso::when_all(expensiveFuture, futureOfManyCheap).then([](auto &&tuple) {
return std::get<0>(tuple).get() + std::get<1>(tuple).get();
});
}
auto result = thingProc->processThings();
useResult(result.get());
ConcurrentVector<std::unique_ptr<int>> values;
dispenso::parallel_for(
dispenso::makeChunkedRange(0, length, dispenso::ParForChunking::kStatic),
[&values](int i, int end) {
values.grow_by_generator(end - i, [i]() mutable { return std::make_unique<int>(i++); });
});
Binary builds of Dispenso are available for various Linux distributions through their native package managers, as well as for Windows, macOS, and Linux via the Conda package manager. If your distribution or platform is not on the list, see the next section for instructions to build it yourself.
Internally to Meta, we use the Buck build system, but as that relies on a monorepo for relevant dependencies, we do not (yet) ship our BUCK build files. To enable easy use outside of Meta monorepos, we ship a CMake build. Improvements to the CMake build and build files for additional build systems are welcome, as are instructions for building on other platforms, including BSD variants, Windows+Clang, etc...
sudo dnf install cmake
brew install cmake
Install CMake from https://cmake.org/download/
mkdir build && cd build
cmake PATH_TO_DISPENSO_ROOT
make -j
Install Build Tools for Visual Studio. All commands should be run from the Developer Command Prompt.
mkdir build && cd build
cmake PATH_TO_DISPENSO_ROOT
cmake --build . --config Release
Once built, the library can be installed by building the "install" target. Typically on Linux and MacOS, this is done with
make install
On Windows (and works on any platfrom), instead do
cmake --build . --target install
Once installed, a downstream CMake project can be pointed to it by using
CMAKE_PREFIX_PATH
or Dispenso_DIR
, either as an environment variable or
CMake variable. All that is required to use the library is link the imported
CMake target Dispenso::dispenso
, which might look like
find_package(Dispenso REQUIRED)
target_link_libraries(myDispensoApp Dispenso::dispenso)
This brings in all required include paths, library files to link, and any other
properties to the myDispensoApp
target (your library or application).
To keep dependencies to an absolute minimum, we do not build tests or benchmarks by default, but only the core library. Building tests requires GoogleTest.
mkdir build && cd build
cmake PATH_TO_DISPENSO_ROOT -DDISPENSO_BUILD_TESTS=ON -DCMAKE_BUILD_TYPE=Release
make -j
ctest
All commands should be run from the Developer Command Prompt.
mkdir build && cd build
cmake PATH_TO_DISPENSO_ROOT -DDISPENSO_BUILD_TESTS=ON
cmake --build . --config Release
ctest
Dispenso has several benchmarks, and some of these can benchmark against OpenMP, TBB, and/or folly variants. If benchmarks are turned on via -DDISPENSO_BUILD_BENCHMARKS=ON
, the build will attempt to find these libraries, and if found, will enable those variants in the benchmarks. It is important to note that none of these dependencies are dependencies of the dispenso library, but only the benchmark binaries.
The folly variant is turned off by default, because unfortunately it appears to be common to find build issues in many folly releases; note however that the folly code does run and provide benchmark data on our internal Meta platform.
OpenMP should already be available on most platforms that support it (it must be partially built into the compiler after all), but TBB can be had by e.g. sudo dnf install tbb-devel
.
After you have the deps you want, you can build and run:
mkdir build && cd build
cmake PATH_TO_DISPENSO_ROOT -DDISPENSO_BUILD_BENCHMARKS=ON -DCMAKE_BUILD_TYPE=Release
make -j
- (e.g.)
bin/once_function_benchmark
Not currently supported through CMake.
Here are some limited benchmark results. Unless otherwise noted, these were run on a dual Epyc Rome machine with 128 cores and 256 threads. One benchmark here was repeated on a Threadripper 2990WX with 32 cores and 64 threads.
Some additional notes about the benchmarks: Your mileage may vary based on compiler, OS/platform, and processor. These benchmarks were run with default glibc malloc, but use of tcmalloc or jemalloc can significantly boost performance, especially for ConcurrentVector growth operations (grow_by
and push_back
).
- A subset of dispenso tests are known to fail on 32-bit PPC Mac. If you have access to such a machine and are willing to help debug, it would be appreciated!
- Enable Windows benchmarks through CMake.