Efficiently allocating lots of std::shared_ptr

If you’re building software tuned for performance (in C++ or a similar language), you probably know this as a golden rule: Avoid memory allocations in your performance-critical code at (almost) all costs. However, sometimes that’s just not feasible and you need to make sure that your allocations are as efficient as possible.

In a recent project of mine (where performance is not of utmost importance, but also not negligible), I am not only in the situation that I need to allocate lots of small-ish objects with high frequency, but I also want to use std::shared_ptr for memory safety reasons.

In this article, I look into how pool allocators (especially boost::pool_allocator) can help in this situation.

With a question like this, I think the most useful approach is: Benchmark, benchmark, and benchmark again. Benchmark some test scenarios. Benchmark your final application. Just benchmark it all. Yes, as a developer you usually develop (no pun intended) an intuition for how something affects your memory allocations, cache efficiency, etc. - but a good benchmark usually is the gold standard in my eyes.

Benchmark Setup

Since my project is probably different from yours (and I cannot share the code of my project), the benchmarks on the actual project are not meaningful in the context of this article. In this article, I’ll work with a synthetic benchmark designed to measure the performance impact of the allocator used to allocate std::shared_ptr s.

The benchmarks in this article are all single-threaded. Running meaningful multi-threaded benchmarks is a challenge and not in the scope of this article.

To keep performance overhead as low as possible, we’re using a very simple data structure for our benchmark: A singly-linked list. The list pointers are std::shared_ptr. Because we never hold pointers to list elements except for the list heads, deleting a list head ‘cascades’ down the list and causes all the list elements to be deallocated. The performance of allocations can vary with the size of the allocations (and the difference between sizes of different allocations), thus our list nodes look like this:

 1template <size_t SIZE>
 2struct ListNode
 3{
 4    constexpr static size_t BYTES_PER_STEP = 16;
 5
 6    std::shared_ptr<ListNode> next;
 7
 8    ListNode() : next(nullptr) {}
 9
10    uint8_t data[SIZE * BYTES_PER_STEP];
11};

This allows us to define list nodes of various sizes in 16-byte increments.

Our benchmark works roughly like this: We create 1.000 of these singly-linked lists. In a (not measured) ramp-up phase, we generate a total of 1.000.000 list nodes and insert each into a randomly picked list so that the lists have an average list length of 1.000. We then perform 50.000 iterations, where iteration i starts by completely deallocating the ~i~th list, and then generates 1.000 fresh list nodes and again inserts each node into a random list. All random decisions (i.e., the list into which each node is inserted) are precomputed before performance measurement starts.

The benchmark as explained in the paragraph above, i.e., using only list nodes of a single size, would pose an unfair advantage for any pool-based allocator. Pool allocators usually work by allocating a larger chunk of memory and then chopping it up into the size of the pooled object. This means that when objects of multiple sizes are in play, the allocator needs to maintain multiple of these chopped-up memory regions, making its life harder. Thus, we run the whole benchmark for five different sizes, setting SIZE from 0 to 4, resulting (on my machine) in node sizes between 16 bytes (a single std::shared_ptr) and 80 bytes (the std::shared_ptr plus 4 * 16 bytes data). We maximally interleave the allocations and deallocations of objects of different sizes by first allocating (and inserting) the first node for sizes 0, 1, 2, 3 and 4 each, then allocating (and inserting) the second node for the sizes, and so on.

For details you can also look at the code I publish.

Allocators Under Test

We want to benchmark different ways of allocating std::shared_ptr. The first two candidates are obvious:

1// (1) The most inefficient way
2auto ptr = std::shared_ptr<ListNode>(new ListNode());
3
4// (2) The usual way
5auto ptr = std::make_shared<ListNode>();

I am mostly interested in evaluating Boost.Pool’s boost::fast_pool_allocator, so the third variant tested is:

1// (3) boost::fast_pool_allocator
2
3static boost::fast_pool_allocator<void> alloc;
4
5auto ptr = std::allocate_shared<ListNode>(alloc);
Why fast_pool_allocator<void>? Why not fast_pool_allocator<ListNode>?

The pool allocator used to perform the actual allocation must be specialized on the class that is actually going to be allocated. However - we do not know what that class will be! A std::allocate_shared<Foo>() call does not allocate a Foo object, but some other (implementation-defined) class that contains both a Foo and the control block for the std::shared_ptr.

However, the C++ allocator system solves that problem for us: When an allocator for an unfitting type (like ~void~…) is used to allocate some other type (like our implementation-defined type), the ~std::allocator_traits<Allocator>::rebind_alloc<T> template is used to create a new, “fitting” allocator.

A note on boost::fast_pool_allocator vs boost::pool_allocator

If you look at the Boost.Pool allocators documentation, you might wonder: Why is there a boost::pool_allocator and a boost::fast_pool_allocator? Why would I ever want a non-fast allocator?

The difference seems to be: boost::fast_pool_allocator<T> is optimized to allocate single objects of type T, while boost::pool_allocator<T> seems to be optimized for allocation of whole chunks of T~s, like a ~std::vector<T> would need - see the fast_pool_allocator documentation here.

malloc() implementations

All three variants laid out above are (some more, some less) dependent on the performance of malloc / new, which depends on the standard library implementation. Thus, we try both of the large free implementations:

  • GCC 12.3.0 with its libstdc++ implementation
  • Clang 17.0.6 with GCC’s libstdc++ implementation and LLVM/Clang’s libc++ implementation

Additionally, there are ’external’ malloc / new implementations, which are not bound to a certain standard library, but override the malloc / new functions provided by the standard library. We try two such implementations (both with the GCC-based build):

  • tcmalloc 2.9.1 (the “minimal” version shipped by Ubuntu, which does not contain the profiling code)
  • jemalloc 5.2.1

Experiments

Environment

I’ve run the experiments on a machine with an AMD Ryzen 7 5800H and 32 GB dual-channel DDR4 memory at 3200 MHz. The system had plenty of free memory at the time. The system runs Ubuntu 22.04.

Results

Now for the part we’ve all been waiting for. What are the benchmark results?

The following table shows the time in seconds it took to execute the iterations explained above. The columns depict the three allocation variants from above, and the rows indicating the different standard library / malloc implementations.

newmake_sharedfast_pool_allocator
GCC69.038.134.0
Clang + libstdc++69.238.635.7
Clang + libc++76.540.840.4
GCC + tcmalloc57.230.333.8
GCC + jemalloc86.742.733.9

The most important insights seem to be:

  • Not very surprisingly, new is slower than make_shared by about a factor of 1.8. This is consistent with the fact that std::shared_ptr<Foo>(new Foo()) needs not one but two allocations. The most interesting takeaway from this is: Since the slowdown is close to two, we may assume that most of the work done in our benchmarks is indeed memory (de)allocation, and everything else that happens (e.g. reference counting inside the std::shared_ptr) does not skew our results too heavily.
  • Whether you use GCC or Clang has little to no impact. The libc++ implementation seems to be a little slower than the libstdc++ implementation, however.
  • tcmalloc is very fast, even in our single-threaded tests! The make_shared case with tcmalloc is about ten percent faster than anything else we’ve seen.
  • jemalloc on the other hand seems to be rather slow in this single-threaded benchmark.
  • If you can not use tcmalloc, fast_pool_allocator gives you a ten percent speedup against make_shared.

Code for Replication

If you want to replicate and verify my results, all code is published on GitHub. The folder also contains a short Readme.md file which explains how to build and run the benchmark.

Comments

You can use your Mastodon account to reply to this post.

Reply to tinloaf's post

With an account on the Fediverse or Mastodon, you can respond to this post. Since Mastodon is decentralized, you can use your existing account hosted by another Mastodon server or compatible platform if you don't have an account on this one.

Copy and paste this URL into the search field of your favourite Fediverse app or the web interface of your Mastodon server.