Comments
You can use your Mastodon account to reply to this post.
If you’re building software tuned for performance (in C++ or a similar language), you probably know this as a golden rule: Avoid memory allocations in your performance-critical code at (almost) all costs. However, sometimes that’s just not feasible and you need to make sure that your allocations are as efficient as possible.
In a recent project of mine (where performance is not of utmost importance, but also not negligible),
I am not only in the situation that I need to allocate lots of small-ish objects with high
frequency, but I also want to use std::shared_ptr
for memory safety reasons.
In this article, I look into how pool allocators (especially boost::pool_allocator
) can help in
this situation.
With a question like this, I think the most useful approach is: Benchmark, benchmark, and benchmark again. Benchmark some test scenarios. Benchmark your final application. Just benchmark it all. Yes, as a developer you usually develop (no pun intended) an intuition for how something affects your memory allocations, cache efficiency, etc. - but a good benchmark usually is the gold standard in my eyes.
Since my project is probably different from yours (and I cannot share the code of my project), the
benchmarks on the actual project are not meaningful in the context of this article. In this article,
I’ll work with a synthetic benchmark designed to measure the performance impact of the allocator
used to allocate std::shared_ptr
s.
To keep performance overhead as low as possible, we’re using a very simple data structure for our
benchmark: A singly-linked list. The list pointers are std::shared_ptr
. Because we never hold
pointers to list elements except for the list heads, deleting a list head ‘cascades’ down the list
and causes all the list elements to be deallocated. The performance of allocations can vary with the
size of the allocations (and the difference between sizes of different allocations), thus our list
nodes look like this:
1template <size_t SIZE>
2struct ListNode
3{
4 constexpr static size_t BYTES_PER_STEP = 16;
5
6 std::shared_ptr<ListNode> next;
7
8 ListNode() : next(nullptr) {}
9
10 uint8_t data[SIZE * BYTES_PER_STEP];
11};
This allows us to define list nodes of various sizes in 16-byte increments.
Our benchmark works roughly like this: We create 1.000 of these singly-linked lists. In a (not
measured) ramp-up phase, we generate a total of 1.000.000 list nodes and insert each into a randomly
picked list so that the lists have an average list length of 1.000. We then perform 50.000
iterations, where iteration i
starts by completely deallocating the ~i~th list, and then generates
1.000 fresh list nodes and again inserts each node into a random list. All random decisions (i.e.,
the list into which each node is inserted) are precomputed before performance measurement starts.
The benchmark as explained in the paragraph above, i.e., using only list nodes of a single size,
would pose an unfair advantage for any pool-based allocator. Pool allocators usually work by
allocating a larger chunk of memory and then chopping it up into the size of the pooled object. This
means that when objects of multiple sizes are in play, the allocator needs to maintain multiple of
these chopped-up memory regions, making its life harder. Thus, we run the whole benchmark for five
different sizes, setting SIZE
from 0
to 4
, resulting (on my machine) in node sizes between
16
bytes (a single std::shared_ptr
) and 80
bytes (the std::shared_ptr
plus 4 * 16
bytes
data). We maximally interleave the allocations and deallocations of objects of different sizes by
first allocating (and inserting) the first node for sizes 0
, 1
, 2
, 3
and 4
each, then
allocating (and inserting) the second node for the sizes, and so on.
For details you can also look at the code I publish.
We want to benchmark different ways of allocating std::shared_ptr
. The first two candidates are obvious:
1// (1) The most inefficient way
2auto ptr = std::shared_ptr<ListNode>(new ListNode());
3
4// (2) The usual way
5auto ptr = std::make_shared<ListNode>();
I am mostly interested in evaluating Boost.Pool’s boost::fast_pool_allocator
, so the third variant
tested is:
1// (3) boost::fast_pool_allocator
2
3static boost::fast_pool_allocator<void> alloc;
4
5auto ptr = std::allocate_shared<ListNode>(alloc);
fast_pool_allocator<void>
? Why not fast_pool_allocator<ListNode>
?The pool allocator used to perform the actual allocation must be specialized on the class that is
actually going to be allocated. However - we do not know what that class will be! A
std::allocate_shared<Foo>()
call does not allocate a Foo
object, but some other
(implementation-defined) class that contains both a Foo
and the control block for the
std::shared_ptr
.
However, the C++ allocator system solves that problem for us: When an allocator for an unfitting type (like ~void~…) is used to allocate some other type (like our implementation-defined type), the ~std::allocator_traits<Allocator>::rebind_alloc<T> template is used to create a new, “fitting” allocator.
boost::fast_pool_allocator
vs boost::pool_allocator
If you look at the Boost.Pool allocators documentation, you might wonder: Why is there a
boost::pool_allocator
and a boost::fast_pool_allocator
? Why would I ever want a non-fast
allocator?
The difference seems to be: boost::fast_pool_allocator<T>
is optimized to allocate single objects
of type T
, while boost::pool_allocator<T>
seems to be optimized for allocation of whole chunks
of T~s, like a ~std::vector<T>
would need - see the fast_pool_allocator
documentation here.
malloc()
implementationsAll three variants laid out above are (some more, some less) dependent on the performance of
malloc
/ new
, which depends on the standard library implementation. Thus, we try both of the large free implementations:
libstdc++
implementationlibstdc++
implementation and LLVM/Clang’s libc++
implementationAdditionally, there are ’external’ malloc
/ new
implementations, which are not bound to a
certain standard library, but override the malloc
/ new
functions provided by the standard
library. We try two such implementations (both with the GCC-based build):
I’ve run the experiments on a machine with an AMD Ryzen 7 5800H and 32 GB dual-channel DDR4 memory at 3200 MHz. The system had plenty of free memory at the time. The system runs Ubuntu 22.04.
Now for the part we’ve all been waiting for. What are the benchmark results?
The following table shows the time in seconds it took to execute the iterations explained above. The
columns depict the three allocation variants from above, and the rows indicating the different
standard library / malloc
implementations.
new | make_shared | fast_pool_allocator | |
---|---|---|---|
GCC | 69.0 | 38.1 | 34.0 |
Clang + libstdc++ | 69.2 | 38.6 | 35.7 |
Clang + libc++ | 76.5 | 40.8 | 40.4 |
GCC + tcmalloc | 57.2 | 30.3 | 33.8 |
GCC + jemalloc | 86.7 | 42.7 | 33.9 |
The most important insights seem to be:
new
is slower than make_shared
by about a factor of 1.8
. This is
consistent with the fact that std::shared_ptr<Foo>(new Foo())
needs not one but two allocations.
The most interesting takeaway from this is: Since the slowdown is close to two, we may assume that
most of the work done in our benchmarks is indeed memory (de)allocation, and everything else that
happens (e.g. reference counting inside the std::shared_ptr
) does not skew our results too
heavily.libc++
implementation seems to be a
little slower than the libstdc++
implementation, however.tcmalloc
is very fast, even in our single-threaded tests! The make_shared
case with tcmalloc
is about ten percent faster than anything else we’ve seen.jemalloc
on the other hand seems to be rather slow in this single-threaded benchmark.tcmalloc
, fast_pool_allocator
gives you a ten percent speedup against
make_shared
.If you want to replicate and verify my results, all code is published on GitHub. The folder also contains a short Readme.md file which explains how to build and run the benchmark.
You can use your Mastodon account to reply to this post.