November 2019 – wfbsoftware

Introduction

When paging is used by an OS, the OS can provide pages to a process that needs memory. Pages are often times 4kB in size (the size can vary). If a process needs only a few bytes, handing out an entire page is a waste of space.

To use memory more efficiently, another layer of memory management on top of paging is used by operating systems. That layer can be implemented by using a heap. A heap provides the malloc() and free() primitives which allocate memory of a certain size and also allow an application to return memory to the operating system when it does not need it any more.

One well-known algorithm for heaps is the buddy system described by Donald E. Knuth in his first Volume of The Art of Computer Programming.

This article contains my notes on the Buddy System Allocator implementation by evanw. The implementation can manage a linear address space starting at an arbitrary location. It speeds up operations by maintaining a binary tree of the buddy memory areas. The tree allows for fast split and merge operations because it naturally encodes the buddy relationship between memory areas by storing buddies as siblings in the binary tree.

Content

This is a explanation of the allocator code available here:
https://github.com/evanw/buddy-malloc/blob/master/buddy-malloc.c

The allocator manages a linear address space with malloc() and free() operations
to allow the reservation and the return of memory.

The underlying algorithm is the buddy memory system.
buddies are a pair of two adjacent memory regions.

Start and End of the memory space

The allocator will work on a linear block of memory which can be extended when more memory is needed.

The start of the linear block is marked by the base_ptr

static uint8_t *base_ptr;

Then current end of the linear block is marked by max_ptr (a.k.a. the break)

static uint8_t *max_ptr;

The method

static int update_max_ptr(uint8_t *new_value)

is used to advance the max_ptr in order to reserve additional memory for the heap. update_max_ptr() uses the brk() system call to retrieve more memory from the operating system. brk() stands for break which is a common term used for the current end of the heap.

The circular linked list datatype

Again this is the element definition:

typedef struct list_t {
    struct list_t *prev, *next;
} list_t;

List operations are:

static void list_init(list_t *list) – initializes a list so the element points to itself with both prev and next pointers which denotes an empty list
static void list_push(list_t *list, list_t *entry) – adds an element to the end of the list
static void list_remove(list_t *entry) – removes an element from a list
static list_t *list_pop(list_t *list) – removes the element at the end of the list

There is one list per bucket to manage all the free elements that have the buckets size.
The buckets are an array of list_t structures.

// array of lists
static list_t buckets[BUCKET_COUNT];

buckets[0] —> circular linked list of free blocks of whole memory space
buckets[1] —> circular linked list of free blocks of 1/2 memory space
buckets[2] —> circular linked list of free blocks of 1/4 memory space
buckets[3] —> circular linked list of free blocks of 1/8 memory space
buckets[4] —> circular linked list of free blocks of 1/2^4 memory space
buckets[5] —> circular linked list of free blocks of 1/2^5 memory space
…

The list_t structures in the bucket are part of their respective linked list but they are treated specially because they are merely
elements to manage those lists not elements that denote a free block of memory.

If a bucket of size 2 ^ 4 currently manages five blocks of free memory, the circular linked list actually contains six elements.
The first element in the list is just the list_t element that is used to point to the list.

The linearized bit tree

The linearized bit tree is a binary tree, where every node has zero or two children.

Linearized means that the tree is stored in memory as an array of bits.
It is not stored as nodes that store pointers to other nodes to form a tree.

The tree stores a bit per node (except for it’s leafs (therefore BUCKET_COUNT – 1) because leaves cannot be split).
A node stands for a block of memory.
The state of that block of memory is encoded in the bit.
A value of 0 means that the block was not split.
A value of 1 menas that the block was split.

The array that stores the linearized binary bit tree

The array is defined by the array called ‘node_is_split’:

static uint8_t node_is_split[(1 << (BUCKET_COUNT - 1)) / 8];

BUCKET_COUNT is the amount of buckets which are lists of free memory blocks of certain sizes. The sizes are multiples of 2.
BUCKET_COUNT -1 is used because the leaves (smallest possible block) can never be split so that information does not have to be managed.

1 << (BUCKET_COUNT – 1) is the amount of memory areas that the tree has to keep track of at most.
The division by 8 returns the bytes that are required to manage all the bits.

Finding children, parents and siblings

Because there are no pointers between nodes and a node is represented by a single bit,
how does the tree know, where the child nodes of a parent node are in the bit array?

The answer is that there is a convention on how to compute the left and right child’s indexes given the index of a parent:

Left child: index = index * 2 + 1;
Right child: index = index * 2 + 2;

Example: the root node has an index of 0.
Using the formula above, it’s left child has an index of 1 and it’s right child has an index of 2.

The left root child has an index of 1.
It’s left child has an index of 3, it’s right child has an index of 4.

The right root child has an index of 2.
It’s left child has an index of 5, it’s right child has an index of 6.

Going from a child to a parent is achieved by the formula:
index = (index – 1) / 2;

For example, the child with index 6 has a parent with an index of 5 / 2 = 2.
Remember this formula is computed on integer datatypes where floating part is just cut off without any rounding. Therefore 5 / 2 results in the value 2 not 2.5 or 3.

The child with the index 2 has a parent with the index 0.

Going from a node to it’s sibling is achieved by:
index = ((index – 1) ^ 1) + 1;

^ 1 inverts the last bit in the number

Nodes 5 and 6 are siblings.

Putting 5 into the formula yields (binary numbers are flagged with a ‘b’ as a suffix)
((5 – 1) ^ 1) + 1 = (4 ^ 1) + 1 = (100b ^ 1b) + 1b = 101b + 1b = 110b = 6

Putting 6 into the formula yields (binary numbers are flagged with a ‘b’ as a suffix)
((6 – 1) ^ 1) + 1 = (5 ^ 1) + 1 = (101b ^ 1b) + 1b = 100b + 1b = 101b = 5

The meaning of the bit

The blocks of memory are represented as nodes in the bit tree.
The bit can either be 0 or 1. The question is what is encoded by the bit?

A value of 0 means that the block was not split.
A value of 1 means that the block was split.

If a block of memory was split, the respective node in the tree has two children.
A left child and a right child.

Converting an index to a pointer to the respective block

An index in the tree denotes a memory block.
The operation that computes a pointer to that block is:

static uint8_t *ptr_for_node(size_t index, size_t bucket) {
     return base_ptr + ((index - (1 << bucket) + 1) << (MAX_ALLOC_LOG2 - bucket));
}

The first parameter is the index, the second parameter is the bucket the memory block pointed to by the index belongs into.

The formula that the function uses is:

base_ptr + ((index – (1 << bucket) + 1) << (MAX_ALLOC_LOG2 – bucket))

It computes a amount of bytes (offset) than adds that amount of bytes to the absolute base_ptr to retrieve an absolute pointer again.

The amount of bytes are the bytes located before the block indexed by index starts. This is the definition of an offset. The addition between base_ptr and the offset returns an absolute pointer to the start of the block.

The amount of bytes (offset) is computed by:

(index – (1 << bucket) + 1) << (MAX_ALLOC_LOG2 – bucket)

The binary left shift operator is the same as repeatedly multiplying by two.
The number that is multiplied by two is:

[1] index – (1 << bucket) + 1

The amount to multiply by is

[2] MAX_ALLOC_LOG2 – bucket

Looking at formula [1], this will compute the zero based index of the block amongst its all siblings at that level of the binary tree.

Given the tree

          -------------------------
bucket 0: |           0           |
          -------------------------
bucket 1: |     1     |     2     |
          -------------------------
bucket 2: |  3  |  4  |  5  |  6  |
          -------------------------
bucket 3: | 7| 8| 9|10|11|12|13|14|

when using the index 12 (in bucket 3) in the formula:
index – (1 << bucket) + 1
we get
12 – (1 << 3) + 1 = 12 – 8 + 1 = 5.
5 is the zero based index of the node 12 inside it’s level of the tree. (Count 7, 8, 9, 10, 11, 12 using zero based indexes and you will arrive at node twelve when you counted from 0 to 5. 5 is node 12’s zero-based index inside it’s level of the tree).

The formula basically works by subtracting the amount of nodes below the node’s level in the tree. For the node 12, there are 7 nodes below it’s level, so 12 – 7 = 12 – 8 + 1 = 5

Once the formula [1] has determined the zero-based index among all children, the formula [2] will compute the size of a block on that level of the tree (= bucket) and multiply this amount of bytes by the zero based index.

The size of a block at level/bucket n in the tree is (2 ^ (MAX_ALLOC_LOG2 – bucket))

(MAX_ALLOC_LOG2 – bucket) is a larger number, the higher up in the tree the node is located. A larger number results in a larger block of bytes since the number is used as an exponent. This makes sense since a block higher up in the tree is twice as large as it’s direct children. The root node is as large as the entire memory area.

MAX_ALLOC_LOG2 is the exponent of the largest block used by the allocators configuration. The root node in the tree, which is the largest block is of size 2 ^ MAX_ALLOC_LOG2. If MAX_ALLOC_LOG2 is set to 32 for example, the largest block is 2 ^ MAX_ALLOC_LOG2. Bucket 0 stores those blocks.

Blocks in bucket 1 are 2 ^ (MAX_ALLOC_LOG2 – 1)
Blocks in bucket n are 2 ^ (MAX_ALLOC_LOG2 – n)

This is exactly what formula [2] computes.

The computed number from [2] is used as an exponent to 2 to compute the size of the block at that level in the tree in bytes.

The bitwise shift between the results of [1] and [2] will multiply the blocksize times the zero base index. The result is the offset to the start of the memory block.

Adding that offset to the absolute address stored in base_ptr yields the absolute address of the block of free memory.

The inverse operation is:

static size_t node_for_ptr(uint8_t *ptr, size_t bucket) {
    return ((ptr - base_ptr) >> (MAX_ALLOC_LOG2 - bucket)) + (1 << bucket) - 1;
}

This first computes the offset of the ptr (ptr – base_ptr).
This offset is an amount of bytes which is then divided by 2 repeatedly by bitwise right shifting.
The amount of right shifts is (MAX_ALLOC_LOG2 – bucket) which is taking the size of blocks at the bucket-level into account.

We arrive at the zero-based index of the node relative to its level.
The last step is to add the amount of nodes in all lower levels of the tree which is achieved by (1 << bucket) – 1

You can see that this are all the steps of the opposite operation (outlined above) in reverse order.

Operations for managing ‘node split or not’ concerns

The operation parent_is_split() figures out if the parent of a node (NOT THE NODE ITSELF) is split or not.

static int parent_is_split(size_t index) {
    index = (index - 1) / 2;
    return (node_is_split[index / 8] >> (index % 8)) & 1;
}

Line [2] is the operation to compute the parent index of a node.
Line [3] a straigh-forward bitmap operation. It returns the value at a specific index of a bitmap.
The & 1 operation returns true if the bit is set and hence if the node is split.
It returns 0 otherwise.

The operation flip_parent_is_split() toggles the parent’s split status.

static void flip_parent_is_split(size_t index) {
    index = (index - 1) / 2;
    node_is_split[index / 8] ^= 1 << (index % 8);
}

Operations: malloc() and free()

This is the meat and potatoes of the implementation! Finally!

The Header of Blocks

There is a header at the start of each memory block. That header is always 8 byte in size. A memory can either be free or used.

If the memory block is free, it is part of a circular linked list of a bucket. In that case, the header contains a list_t struct. That struct stores a pointer to it’s predecessor free block (4 byte) and it’s successor free block (4 byte).

If the memory block is used the malloc() method stores the size of that block in the first 4 byte of the header. The following 4 byte are basically unused although they still contain the 4 byte successor pointer from the list_t struct that the header contained when the block was on the free list. The size is a multiple of 2. Given a pointer to the memory block, traversing back 8 byte, the header and within it the size of the block can be retrieved. Given the size value, it is possible to compute the bucket the block belongs to and from the bucket it is possible to compute the node index in the tree using:

ptr = (uint8_t *)ptr - HEADER_SIZE;
bucket = bucket_for_request(*(size_t *)ptr + HEADER_SIZE);
i = node_for_ptr((uint8_t *)ptr, bucket);

Once the node index is known, the block can be merged with it’s buddy during a call to free().

How Memory is Layed Out in the Linearized Binary Bit Tree

The implementation does not create a root node spanning the entire available memory since, according to the author, the first request for memory would immediately reserve the entire memory space and break down memory blocks repeatedly until the first allocation can be served.

This approach is expensive, especially if the application only requests small blocks of memory. Instead another approach is choosen.

The tree starts out with the smallest possible block size. That means it’s root node starts at the bucket with the highest index which. The bucket with the highest index manages the smallest possible memory blocks. That means that the used bits in the linearized binary tree are not starting at the index 0 but rather somewhere at the end or the middle of the bitmap.

If memory is requested that does not fit into the smallest memory block, the tree is grown. The root will be placed at the next bucket. The next bucket has a smaller index than the bucket before. That is why the operation is called lower_bucket_limit(). lower_bucket_limit() will put the root onto the next smaller bucket and by doing so, it will double the amount of memory that the tree manages. lower_bucket_limit() will add free memory areas into buckets and provide more space to allocate. In the linearized bitmap, the tree will use more bits towards the start of the bitmap. That means the tree continuoulsy grows towards the first bit in the bitmap and if it shrinks, it will shrink back towards the end of the bitmap.

How Memory is Reserved When Growing the Heap

When the tree cannot satisfy a request for memory, it has to grow. In order to grow there may be the point where the max_ptr has to be moved to reserve more memory. The function update_max_ptr() does exactly that. It will call the operating system using the brk() call. If brk() manages to reserve the memory, max_ptr is moved to the newly allocated border.

Implementation of malloc()

The idea of malloc() is that it receives an amount of bytes to reserve and it returns a pointer to the address at which the bytes are available in a contiguous block. The pointer can later be used in a call to free() and free() is then able to return the block back onto a free list of it’s corresponding bucket.

malloc() starts of by initialization if malloc() has never been called before.

malloc() than looks at the amount of bytes it has to reserve. It will compute the next larger multiple of 2 that is larger than the amount of bytes to reserve. The computed multiple of 2 corresponds to a certain bucket. malloc() will grow the tree, until it’s root is located at a bucket that can serve the requested amount of bytes.

malloc() contains a while loop which iterates over all buckets from the highest index towards the lower indexes. That means it iterates from small buckets to ever larger ones until it is able to locate a matching block to serve the request.

For each bucket index, it will call lower_bucket_limit() to grow the tree to the current bucket size.

NOTE: TODO: I have not understood the rest of the method completely. It kind of walks up the tree, than splits the nodes down again until it finds a matching block. If a matching block is found, it will write the size of the block into the header, killing the linked list struct in the process that was store in the header before. It will then return a pointer to after the header to the use as a return value of malloc(). The user can than use the space and return the block using the pointer as a parameter to free() later.

Implementation of free()

TODO

Month: November 2019

A Buddy System Memory Allocator Explained