I’ve been implementing an algorithm that works on a graph. That algorithm needs to create and destroy lots of nodes and edges to keep track of the algorithm state correctly. The algorithm also needs to be fast in order to be competitive against a similar algorithm that uses sets of stacks instead of graphs. Profiles show that memory allocations are impacting negatively the performance, so maybe a memory pool can help.
Don't reject malloc/free or new/delete lightly
free and, in C++,
delete provide the foundations of dynamic memory. Quite often our applications need dynamic memory because the amount of memory that they will require is not known at compile time. Instead of creating big arrays and limiting the sizes of our inputs, which may not be acceptable, dynamic memory allows us to allocate the memory we need, at runtime.
These facilities give good performance in general because most applications do a lot of work between the allocations and deallocations. They are general enough to perform well in a broad spectrum of applications. There is usually little incentive not to use them.
That said there are applications where the ratio of allocation/deallocation over “real work” is very small. I found myself in that situation, I have an algorithm that uses a graph to represent its state. It creates and destroys many nodes and edges during the computation. Sadly the amount of work between updates in the graph is really small. This means that allocations and deallocations are just observed in the profile as overhead. In these circumstances we are forced to use less general strategies for our dynamic memory. I decided to implement a very simple pool allocator.
In an algorithm like the one I’m trying to make competitive, there is a big number of allocations/deallocations all of them of the same kind of object. This implies that all these objects will take the same amount of storage in memory. This means that if we try to be less general than plain
malloc we can do faster allocations.
A pool allocator is a typical approach to this. A chunk of memory, that I will call arena, is allocated by the program. This chunk has enough storage for up to a given number of elements. So we can allocate up to that number of elements using the arena. As the arena itself is nothing but an array, it is very easy to find storage for the allocations. A pool that only uses an arena means that it has a limit of elements it can allocate. I am interested in a pool allocator that can use more than one arena. This means that when all arenas are full, the pool allocator should allocate another arena (of course using
Perhaps the trickiest part of memory allocation is keeping track of what storage is available for later allocations. We can use two techniques: bitmaps and free lists. A bitmap will assign each bit of an integer (32 or 64 bit) to each position of the arena. When the i-th bit of the bitmap is set to 1, the i-th element in the arena is allocated and can be freed. If set to 0, that element in the arena is free and can be allocated. By using specific instructions like “count leading zeroes” and similar, it is possible to efficiently use bitmaps to tell where is the next free element. There are some downsides: they take extra storage and they are limited to the maximum integer we can use (for efficiency reasons). This means that in practice our arenas will be 32 or 64 elements long. Bitmaps also force us to be a bit clever when allocating memory: the first arena we check should always have available storage (otherwise it should mean we need a new arena) to avoid having to check several bitmaps when finding free storage.
Free lists are a different technique. A single link is threaded through the free elements. This forms a list of free elements that we can use to get the next free item. A free list is easy to maintain during allocation and deallocation. When the free list is empty this means we simply allocate a new arena and initialize the free list. For arenas with lots of elements this means that allocating a new arena may take longer. Also as we can free data from different arenas, the free list may end threading different arenas. This can harm the spatial locality of successive allocations after a bunch of deallocations of objects in different arenas.
A very simple implementation in C++11
As this is C++, the pool will be parameterized for each type.
An arena is formed by several items. Each item is a union of properly sized and aligned storage to hold
T and a pointer to the next free item (it can be null).
This item will have a few methods that we will use later. We have to be able to get and set the next item and also get the storage. We also need a function that given a pointer to
T returns a pointer to the item that contains it. We will use this when freeing a
T. As this is a
union a simple cast will do.
Our pool will be made of one or more arenas. An arena is just an array of items and a pointer to the next arena.
When creating an arena, all the elements inside are free and so they have to be in the free list. We need to chain each item with the consecutive one, except the last one. The arena itself is created using the usual
operator new from C++.
When creating an arena we need to get the storage so we can correctly update the free list.
Finally, when creating a new arena because the current one is full, we need to link the new one to the current one, so the arenas are linked.
Data members of the pool
Our pool basically needs to know the current arena and the free list. As this pool allocates arenas all of the same size, we need to store the size of the arenas.
Creation of the pool
When creating a pool we need a first arena. So this is what the constructor does.
To allocate a new
T* we will simply forward the arguments to the constructor of
T. But first we need to find storage for the new
If our free list is empty (this is, the current arena is full) we will need to create a new arena. We will link the new arena to the current one and then point to it. The new arena brings new free items, so we need to update the free list with the new storage of the arena.
Now that we have storage, it is just a matter of using the free list (which now we know is not empty). We’re going to use the first item of the free list, so we just retrieve the current item and then we update the free list. Now it is only missing to get the storage of the item and initialize the object using a placement
new. And return the new object, of course.
When deallocating an object, we need to put it back in the free list after we have destroyed the object. For simplicity we are going to put it at the beginning of the free list, so it is just a matter of retrieving the item related to the freed pointer and then link it to the free list (that could be empty at this point but it does not matter). Finally, update the free list to start at the freed item.
And that’s it. Really simple.
Did it help?
In my case yes, it does help. Using malloc/free instead of a pool makes the program two times slower.
But your mileage may vary so make sure you benchmark your program before and after. Few things are sadder than optimizing something that won’t have any impact.
Where to go from here?
OK. This is very simple and it fulfills my original needs. But it can be improved of course. Some ideas follow:
- It is easy to do mistakes using this allocator. In particular double-frees. So we may want to extend it with consistency checks. For instance, adding a
Policytemplate parameter defaulted to no operations (for efficiency) so it can be set to a hypothetical
DebugMemorypolicy that checks the operations.
- What about memory leaks? Adding a memory leak checker (also policy controlled for efficiency)
- What about bulk destructions? If we destroy the memory pool the memory is released but the destructors of the remaining objects are not invoked! Options include making this an error (as if this were a memory leak) or bulk destroy all the objects. The last option is very convenient for computations that need to allocate temporary data and want to release it at once. This is not trivial because we have a free list but not a "used list"!
- What about exceptions?
- What about
minipool<T>::ptr(or similar) instead of a raw
T*. This would allow us to safely call
deletewith these pointers but comes at cost that each pointer now needs to know its pool.