BIRD Journey to Threads. Chapter 4: Memory and other resource management.

BIRD is mostly a large specialized database engine, storing mega/gigabytes of Internet routing data in memory. To keep accounts of every byte of allocated data, BIRD has its own resource management system which must be adapted to the multithreaded environment. The resource system has not changed much between v2 and v3, yet it deserves a short chapter.

BIRD is a fast, robust and memory-efficient routing daemon designed and implemented at the end of 20th century. We’re doing a significant amount of BIRD’s internal structure changes to make it run in multiple threads in parallel.

We’ve already released version 3.0-alpha0 and our fellow users immediately found a way how to crash it just after the release. The development continues and we’re aiming to release version 3.0-alpha1 as soon as possible, fixing all the previous crashes and merging the 2.0 branch. Currently, most of the code can be seen in branch thread-next.

Resources

Inside BIRD, (almost) every piece of allocated memory belongs to a resource. The most straightforward way to achieve this is to include a generic struct resource header. (Yes, this is object inheritance.) The resource node is enlisted inside a linked list of a resource pool (see below), the class pointer defines basic operations done on resources.

typedef struct resource {
  node n;               /* Inside resource pool */
  struct resclass *class;       /* Resource class */
} resource;

struct resclass {
  char *name;               /* Resource class name */
  unsigned size;            /* Standard size of single resource */
  void (*free)(resource *);     /* Freeing function */
  void (*dump)(resource *);     /* Dump to debug output */
  resource *(*lookup)(resource *, unsigned long);   /* Look up address (only for debugging) */
  struct resmem (*memsize)(resource *); /* Return size of memory used by the resource, may be NULL */
};

void *ralloc(pool *, struct resclass *);

Resource cycle begins with an allocation of a resource. To do that, you should call ralloc(), passing the parent pool and the appropriate resource class as arguments. BIRD allocates a memory block of size given by the class member size. Beginning of the block is reserved for struct resource itself and initialized by the given arguments. Therefore, you may sometimes see an idiom where a structure has a first member resource r;, indicating that this item should be allocated as a resource.

The counterpart is resource freeing. This may be implicit (by resource pool freeing) or explicit (by rfree()). In both cases, the free() function of the appropriate class is called to cleanup the resource before final freeing; when calling rfree(), the resource is removed from its parent pool before the appropriate free() function is called. (Some resources rely on this order of execution, don’t try to change it.)

To account for dump and memsize calls, there are CLI commands dump resources and show memory, using these to dump resources or show memory usage as perceived by BIRD. The memory usage may be sometimes misleading, anyway if the number reported is too far from what the kernel claims, there may be something broken inside BIRD.

The last, lookup, is quite an obsolete way to identify a specific pointer from a debug interface. You may call rlookup(pointer) and BIRD should dump that resource to the debug output. This mechanism is probably incomplete as no developer uses it actively for debugging. We may even remove this mechanism in future and update our GDB extension to provide this functionality instead.

Resources can be also moved between pools by rmove when needed. You probably don’t want to do it unless you have to.

Resource pools

The first internal resource class is a recursive resource – a resource pool. In the singlethreaded version, this is just a simple structure:

struct pool {
  resource r;
  list inside;
  struct birdloop *loop;  /* In multithreaded version only */
  const char *name;
};

Resource pools are used for grouping resources together. There are pools everywhere and it is a common idiom inside BIRD to just rfree the appropriate pool when e.g. a protocol or table is going down. Everything inside the pool is properly destroyed (called its free method) and freed.

The pools generally make a tree with root in a root_pool. Therefore if you ask for memory size or dumping of root_pool, BIRD counts or dumps everything it knows about.

Thread safety in resource pools

In the multithreaded version, every resource pool is bound to a specific IO loop and therefore includes an IO loop pointer. This is important for allocations as the resource list inside the pool is thread-unsafe. All pool operations require the IO loop to be entered to do anything with them, if possible. (In case of rfree, the pool data structure is not accessed at all so no assert is possible. We’re currently relying on the caller to ensure proper locking. In future, this will probably change.)

Each IO loop also has its base resource pool for its allocations. All pools inside the IO loop pool must belong to the same loop or to a loop with a subordinate lock (see the previous chapter for lock ordering). If there is a shared data structure accessed from multiple IO loops, it should be allocated from a pool which is not bound to any of these loops. Typically, it’s best to have a separate lock (subordinate to the accessor loops) and a separate pool for such a data structure.

The pool structure should follow the locking order. Any pool should belong to either the same loop as its parent or its loop lock should be after its parent loop lock in the locking order. Otherwise the show memory and dump resources commands would crash BIRD on locking order violation.

When writing this text, in branch thread-next, this has not been merged yet and in one place the principle is also violated. If you’re reading this text, it’s either already fixed or in progress.

Resource pools in the wilderness

Root pool contains (among others):

route attributes and sources
routing tables
protocols
interfaces
configuration data

Each table has its IO loop and uses the loop base pool for allocations. The same holds for protocols. Each protocol has its pool; it is either its IO loop base pool or an ordinary pool bound to main loop.

Memory block allocators

Resource pool is a resource management tool; ralloc() allocates memory by malloc() but in many cases, the whole resource management overhead is unnecessarily complex. Therefore we have also three block allocators; they provide allocation of just memory blocks for general purpose usage.

Simple memory block

When just a chunk of memory is needed, mb_alloc() or mb_allocz() is used to get it. The first with malloc() semantics, the other is also zeroed. There is also mb_realloc() available, mb_free() to explicitly free such a memory and mb_move() to move that memory to another pool.

This API is intended to be used instead of malloc and free, with a difference that mb_alloc and mb_free is tracked, accounted for and can be freed at once when the parent pool is freed. You shouldn’t use malloc and free directly in BIRD; if you feel like you need these functions, go and find (or create) the appropriate pool for such resources and use the mb_* API.

Simple memory blocks are, in fact, also resources, yet the resource header is hidden from the user. They consume a fixed amount of overhead memory for the resource header (32 bytes on systems with 64-bit pointers) so they are suitable mostly for big chunks, taking advantage of the default stdlib allocator which is used by this allocation strategy. There are anyway some parts of BIRD (in all versions) where this allocator is used for little blocks. This will be fixed some day.

Linear pools

Sometimes, we need to allocate memory by chunks of different sizes (e.g. when parsing config or when processing filters) but we don’t know in advance the total memory needed. Anyway, we know that all the memory is going to be freed at once when some procedure ends or when some data structure is at end of its life. For these purposes, there are linpools.

This data structure allocates memory blocks of requested size with negligible overhead in functions lp_alloc() (uninitialized) or lp_allocz() (zeroed). There is no realloc and no free call; to reallocate larger chunk, you have to either use mb_alloc and mb_realloc or just allocate a larger block from the same linpool and abandon the old one.

Freeing the linpool contents is done at once by lp_flush(). After this call, all memory chunks allocated from this linpool must be considered invalid, yet the linpool retains some of the memory to be reused.

Internally, linpools allocate blocks of a predetermined size (typically it is one memory page, at least in the multithreaded version) and in these blocks, they keep a pointer to the first available byte and an end pointer. The fast path when the last chunk has enough available bytes looks like this:

void * lp_alloc(linpool *m, uint size)
{
  byte *a = (byte *) BIRD_ALIGN((unsigned long) m->ptr, CPU_STRUCT_ALIGN);
  byte *e = a + size;

  if (e <= m->end)
    {
      m->ptr = e;
      return a;
    }
  else
    /* Here is the slow path when a new chunk is needed */
    ...
}

This is quite inefficient when the allocated blocks are just 2048 bytes long (with 4096B page size): the linpool chunk has a little header (one pointer), therefore two 2048B blocks can’t fit and the consumed memory overall is 200%. Anyway, this is not a problem: most of the blocks allocated by linpools are lesser than 100 bytes.

Note: If the requested block size is bigger than page size, linpool allocates it by malloc. This approach isn’t much efficient, yet we don’t have any better solution for now.

In some cases, memory is allocated temporarily. For developers’ convenience and also for better memory management, there is tmp_linpool for every allocation with a task-long lifetime. After an event hook, socket hook or timer hook has returned, tmp_linpool is autoflushed. If you are tempted to use alloca to allocate something big on stack, consider using tmp_alloc instead. Stacks have to be the same size for all threads and can’t be freed before the thread ends. Temporary linpool is much more flexible than stack allocation.

It is also possible to locally store the linpool state in struct lp_state and subsequently restore it to the stored state. This is used e.g. in BGP packet processing.

Slabs

To allocate lots of same-sized objects, a slab allocator is an ideal choice.

Slab allocates full pages as the linpool does. Every page hosts a slab head with a bitmap of available/used blocks and then the blocks themselves one after another. When allocation is requested, Slab takes the first head with available blocks, marks one block used and returns it. With this approach, there is the least possible allocation overhead we can currently have.

Freeing blocks is tricky. BIRD relies on the fact that page addresses are divisible by page size. If you know about a system where this isn’t true, please let us know. When a block should be freed, the 13 (for 4096B-sized pages) least significant bits of its address are zeroed to get the slab head. From this, we can also recompute the position of the block and set the appropriate bit back to available.

In versions until 2.0.8 (incl.), our slab allocator used blocks allocated by malloc(), every object included a slab head pointer and free objects were linked into a single-linked list. This led to memory inefficiency and to anti-intuitive behavior where a use-after-free bug could do lots of damage before finally crashing.

Raw memory pages

Until 2.0.8 (incl.), BIRD allocated all memory by malloc(). This method is suitable for lots of use cases, yet when gigabytes of memory should be allocated by little pieces, BIRD uses its internal allocators to keep track about everything. This brings some ineffectivity as stdlib allocator has its own overhead and may not allocate well-aligned memory unless explicitly asked for.

To save some memory and time, slabs and linpools got backed by memory pages allocated by mmap() directly. Yet if we had to do this syscall for every block acquisition and release, it would be no help. This is even worse when BIRD needs to free gigabytes of memory at once.

To minimize the needed number of syscalls, there is a page cache, keeping pages for future use:

When a new page is requested:
- First the local thread page cache is tried.
- Then there may be some page in the global page cache.
- If no page is in cache, the thread mmaps several pages at once, puts them into its local cache and returns one to use.
When a page is freed:
- If the local page cache is small enough, it’s left there.
- Otherwise it’s put into the global page cache.
- If there are too many pages in the global page cache, a cleanup routine is scheduled to free excessive pages.

The global page cache is lockless. Its limits are now fixed at 32 (minimum) and 512 (maximum) number of pages retained. With lots of threads, this range may be too little, therefore it may get either configurable, or at least dependent on number of running threads. This is a future task to do.

On shutdown, both the cleanup routine and returning to global page cache is inhibited. All pages are returned to the local page caches. Therefore this method gives the multithreaded BIRD not only faster memory management than ever before but also almost immediate shutdown times.

Other resources

Among others, some basic objects are resources (or may be!) and belong to pools. Notable resources are events and timers. These two, if allocated as a resource, are automatically cancelled by their free methods.

Some objects are not only a piece of memory; notable items are sockets, owning the underlying mechanism of I/O, and object locks, owning the right to use a specific I/O. This ensures that collisions on e.g. TCP port numbers and addresses are resolved in a predictable and recoverable way. These objects heavily benefit from the resource free() method.

All these resources should be used with the same locking principles as the memory blocks. There aren’t many checks inside BIRD code to ensure that yet, nevertheless violating this recommendation may lead to multiple-access issues. This will probably change soon to much stricter access checking.

It’s still a long road to the version 3. This series of texts document what is needed to be changed, why we do it and how. I also try to describe some underlying principles, algorithms and data structures used inside BIRD. The previous chapter showed the locking system and how the parallel execution is done. Recently, we implemented a lockless event queue so the next chapter will be a detailed explanation of that.

Autor: Maria Matějka

Blog zaměstnanců CZ.NIC

@ IN SOA domains.dns.enum.mojeid.internet. nic.cz.