OperatingSystems – wfbsoftware

A Buddy System Memory Allocator Explained

Introduction

When paging is used by an OS, the OS can provide pages to a process that needs memory. Pages are often times 4kB in size (the size can vary). If a process needs only a few bytes, handing out an entire page is a waste of space.

To use memory more efficiently, another layer of memory management on top of paging is used by operating systems. That layer can be implemented by using a heap. A heap provides the malloc() and free() primitives which allocate memory of a certain size and also allow an application to return memory to the operating system when it does not need it any more.

One well-known algorithm for heaps is the buddy system described by Donald E. Knuth in his first Volume of The Art of Computer Programming.

This article contains my notes on the Buddy System Allocator implementation by evanw. The implementation can manage a linear address space starting at an arbitrary location. It speeds up operations by maintaining a binary tree of the buddy memory areas. The tree allows for fast split and merge operations because it naturally encodes the buddy relationship between memory areas by storing buddies as siblings in the binary tree.

Content

This is a explanation of the allocator code available here:
https://github.com/evanw/buddy-malloc/blob/master/buddy-malloc.c

The allocator manages a linear address space with malloc() and free() operations
to allow the reservation and the return of memory.

The underlying algorithm is the buddy memory system.
buddies are a pair of two adjacent memory regions.

Start and End of the memory space

The allocator will work on a linear block of memory which can be extended when more memory is needed.

The start of the linear block is marked by the base_ptr

static uint8_t *base_ptr;

Then current end of the linear block is marked by max_ptr (a.k.a. the break)

static uint8_t *max_ptr;

The method

static int update_max_ptr(uint8_t *new_value)

is used to advance the max_ptr in order to reserve additional memory for the heap. update_max_ptr() uses the brk() system call to retrieve more memory from the operating system. brk() stands for break which is a common term used for the current end of the heap.

The circular linked list datatype

Again this is the element definition:

typedef struct list_t {
    struct list_t *prev, *next;
} list_t;

List operations are:

static void list_init(list_t *list) – initializes a list so the element points to itself with both prev and next pointers which denotes an empty list
static void list_push(list_t *list, list_t *entry) – adds an element to the end of the list
static void list_remove(list_t *entry) – removes an element from a list
static list_t *list_pop(list_t *list) – removes the element at the end of the list

There is one list per bucket to manage all the free elements that have the buckets size.
The buckets are an array of list_t structures.

// array of lists
static list_t buckets[BUCKET_COUNT];

buckets[0] —> circular linked list of free blocks of whole memory space
buckets[1] —> circular linked list of free blocks of 1/2 memory space
buckets[2] —> circular linked list of free blocks of 1/4 memory space
buckets[3] —> circular linked list of free blocks of 1/8 memory space
buckets[4] —> circular linked list of free blocks of 1/2^4 memory space
buckets[5] —> circular linked list of free blocks of 1/2^5 memory space
…

The list_t structures in the bucket are part of their respective linked list but they are treated specially because they are merely
elements to manage those lists not elements that denote a free block of memory.

If a bucket of size 2 ^ 4 currently manages five blocks of free memory, the circular linked list actually contains six elements.
The first element in the list is just the list_t element that is used to point to the list.

The linearized bit tree

The linearized bit tree is a binary tree, where every node has zero or two children.

Linearized means that the tree is stored in memory as an array of bits.
It is not stored as nodes that store pointers to other nodes to form a tree.

The tree stores a bit per node (except for it’s leafs (therefore BUCKET_COUNT – 1) because leaves cannot be split).
A node stands for a block of memory.
The state of that block of memory is encoded in the bit.
A value of 0 means that the block was not split.
A value of 1 menas that the block was split.

The array that stores the linearized binary bit tree

The array is defined by the array called ‘node_is_split’:

static uint8_t node_is_split[(1 << (BUCKET_COUNT - 1)) / 8];

BUCKET_COUNT is the amount of buckets which are lists of free memory blocks of certain sizes. The sizes are multiples of 2.
BUCKET_COUNT -1 is used because the leaves (smallest possible block) can never be split so that information does not have to be managed.

1 << (BUCKET_COUNT – 1) is the amount of memory areas that the tree has to keep track of at most.
The division by 8 returns the bytes that are required to manage all the bits.

Finding children, parents and siblings

Because there are no pointers between nodes and a node is represented by a single bit,
how does the tree know, where the child nodes of a parent node are in the bit array?

The answer is that there is a convention on how to compute the left and right child’s indexes given the index of a parent:

Left child: index = index * 2 + 1;
Right child: index = index * 2 + 2;

Example: the root node has an index of 0.
Using the formula above, it’s left child has an index of 1 and it’s right child has an index of 2.

The left root child has an index of 1.
It’s left child has an index of 3, it’s right child has an index of 4.

The right root child has an index of 2.
It’s left child has an index of 5, it’s right child has an index of 6.

Going from a child to a parent is achieved by the formula:
index = (index – 1) / 2;

For example, the child with index 6 has a parent with an index of 5 / 2 = 2.
Remember this formula is computed on integer datatypes where floating part is just cut off without any rounding. Therefore 5 / 2 results in the value 2 not 2.5 or 3.

The child with the index 2 has a parent with the index 0.

Going from a node to it’s sibling is achieved by:
index = ((index – 1) ^ 1) + 1;

^ 1 inverts the last bit in the number

Nodes 5 and 6 are siblings.

Putting 5 into the formula yields (binary numbers are flagged with a ‘b’ as a suffix)
((5 – 1) ^ 1) + 1 = (4 ^ 1) + 1 = (100b ^ 1b) + 1b = 101b + 1b = 110b = 6

Putting 6 into the formula yields (binary numbers are flagged with a ‘b’ as a suffix)
((6 – 1) ^ 1) + 1 = (5 ^ 1) + 1 = (101b ^ 1b) + 1b = 100b + 1b = 101b = 5

The meaning of the bit

The blocks of memory are represented as nodes in the bit tree.
The bit can either be 0 or 1. The question is what is encoded by the bit?

A value of 0 means that the block was not split.
A value of 1 means that the block was split.

If a block of memory was split, the respective node in the tree has two children.
A left child and a right child.

Converting an index to a pointer to the respective block

An index in the tree denotes a memory block.
The operation that computes a pointer to that block is:

static uint8_t *ptr_for_node(size_t index, size_t bucket) {
     return base_ptr + ((index - (1 << bucket) + 1) << (MAX_ALLOC_LOG2 - bucket));
}

The first parameter is the index, the second parameter is the bucket the memory block pointed to by the index belongs into.

The formula that the function uses is:

base_ptr + ((index – (1 << bucket) + 1) << (MAX_ALLOC_LOG2 – bucket))

It computes a amount of bytes (offset) than adds that amount of bytes to the absolute base_ptr to retrieve an absolute pointer again.

The amount of bytes are the bytes located before the block indexed by index starts. This is the definition of an offset. The addition between base_ptr and the offset returns an absolute pointer to the start of the block.

The amount of bytes (offset) is computed by:

(index – (1 << bucket) + 1) << (MAX_ALLOC_LOG2 – bucket)

The binary left shift operator is the same as repeatedly multiplying by two.
The number that is multiplied by two is:

[1] index – (1 << bucket) + 1

The amount to multiply by is

[2] MAX_ALLOC_LOG2 – bucket

Looking at formula [1], this will compute the zero based index of the block amongst its all siblings at that level of the binary tree.

Given the tree

          -------------------------
bucket 0: |           0           |
          -------------------------
bucket 1: |     1     |     2     |
          -------------------------
bucket 2: |  3  |  4  |  5  |  6  |
          -------------------------
bucket 3: | 7| 8| 9|10|11|12|13|14|

when using the index 12 (in bucket 3) in the formula:
index – (1 << bucket) + 1
we get
12 – (1 << 3) + 1 = 12 – 8 + 1 = 5.
5 is the zero based index of the node 12 inside it’s level of the tree. (Count 7, 8, 9, 10, 11, 12 using zero based indexes and you will arrive at node twelve when you counted from 0 to 5. 5 is node 12’s zero-based index inside it’s level of the tree).

The formula basically works by subtracting the amount of nodes below the node’s level in the tree. For the node 12, there are 7 nodes below it’s level, so 12 – 7 = 12 – 8 + 1 = 5

Once the formula [1] has determined the zero-based index among all children, the formula [2] will compute the size of a block on that level of the tree (= bucket) and multiply this amount of bytes by the zero based index.

The size of a block at level/bucket n in the tree is (2 ^ (MAX_ALLOC_LOG2 – bucket))

(MAX_ALLOC_LOG2 – bucket) is a larger number, the higher up in the tree the node is located. A larger number results in a larger block of bytes since the number is used as an exponent. This makes sense since a block higher up in the tree is twice as large as it’s direct children. The root node is as large as the entire memory area.

MAX_ALLOC_LOG2 is the exponent of the largest block used by the allocators configuration. The root node in the tree, which is the largest block is of size 2 ^ MAX_ALLOC_LOG2. If MAX_ALLOC_LOG2 is set to 32 for example, the largest block is 2 ^ MAX_ALLOC_LOG2. Bucket 0 stores those blocks.

Blocks in bucket 1 are 2 ^ (MAX_ALLOC_LOG2 – 1)
Blocks in bucket n are 2 ^ (MAX_ALLOC_LOG2 – n)

This is exactly what formula [2] computes.

The computed number from [2] is used as an exponent to 2 to compute the size of the block at that level in the tree in bytes.

The bitwise shift between the results of [1] and [2] will multiply the blocksize times the zero base index. The result is the offset to the start of the memory block.

Adding that offset to the absolute address stored in base_ptr yields the absolute address of the block of free memory.

The inverse operation is:

static size_t node_for_ptr(uint8_t *ptr, size_t bucket) {
    return ((ptr - base_ptr) >> (MAX_ALLOC_LOG2 - bucket)) + (1 << bucket) - 1;
}

This first computes the offset of the ptr (ptr – base_ptr).
This offset is an amount of bytes which is then divided by 2 repeatedly by bitwise right shifting.
The amount of right shifts is (MAX_ALLOC_LOG2 – bucket) which is taking the size of blocks at the bucket-level into account.

We arrive at the zero-based index of the node relative to its level.
The last step is to add the amount of nodes in all lower levels of the tree which is achieved by (1 << bucket) – 1

You can see that this are all the steps of the opposite operation (outlined above) in reverse order.

Operations for managing ‘node split or not’ concerns

The operation parent_is_split() figures out if the parent of a node (NOT THE NODE ITSELF) is split or not.

static int parent_is_split(size_t index) {
    index = (index - 1) / 2;
    return (node_is_split[index / 8] >> (index % 8)) & 1;
}

Line [2] is the operation to compute the parent index of a node.
Line [3] a straigh-forward bitmap operation. It returns the value at a specific index of a bitmap.
The & 1 operation returns true if the bit is set and hence if the node is split.
It returns 0 otherwise.

The operation flip_parent_is_split() toggles the parent’s split status.

static void flip_parent_is_split(size_t index) {
    index = (index - 1) / 2;
    node_is_split[index / 8] ^= 1 << (index % 8);
}

Operations: malloc() and free()

This is the meat and potatoes of the implementation! Finally!

The Header of Blocks

There is a header at the start of each memory block. That header is always 8 byte in size. A memory can either be free or used.

If the memory block is free, it is part of a circular linked list of a bucket. In that case, the header contains a list_t struct. That struct stores a pointer to it’s predecessor free block (4 byte) and it’s successor free block (4 byte).

If the memory block is used the malloc() method stores the size of that block in the first 4 byte of the header. The following 4 byte are basically unused although they still contain the 4 byte successor pointer from the list_t struct that the header contained when the block was on the free list. The size is a multiple of 2. Given a pointer to the memory block, traversing back 8 byte, the header and within it the size of the block can be retrieved. Given the size value, it is possible to compute the bucket the block belongs to and from the bucket it is possible to compute the node index in the tree using:

ptr = (uint8_t *)ptr - HEADER_SIZE;
bucket = bucket_for_request(*(size_t *)ptr + HEADER_SIZE);
i = node_for_ptr((uint8_t *)ptr, bucket);

Once the node index is known, the block can be merged with it’s buddy during a call to free().

How Memory is Layed Out in the Linearized Binary Bit Tree

The implementation does not create a root node spanning the entire available memory since, according to the author, the first request for memory would immediately reserve the entire memory space and break down memory blocks repeatedly until the first allocation can be served.

This approach is expensive, especially if the application only requests small blocks of memory. Instead another approach is choosen.

The tree starts out with the smallest possible block size. That means it’s root node starts at the bucket with the highest index which. The bucket with the highest index manages the smallest possible memory blocks. That means that the used bits in the linearized binary tree are not starting at the index 0 but rather somewhere at the end or the middle of the bitmap.

If memory is requested that does not fit into the smallest memory block, the tree is grown. The root will be placed at the next bucket. The next bucket has a smaller index than the bucket before. That is why the operation is called lower_bucket_limit(). lower_bucket_limit() will put the root onto the next smaller bucket and by doing so, it will double the amount of memory that the tree manages. lower_bucket_limit() will add free memory areas into buckets and provide more space to allocate. In the linearized bitmap, the tree will use more bits towards the start of the bitmap. That means the tree continuoulsy grows towards the first bit in the bitmap and if it shrinks, it will shrink back towards the end of the bitmap.

How Memory is Reserved When Growing the Heap

When the tree cannot satisfy a request for memory, it has to grow. In order to grow there may be the point where the max_ptr has to be moved to reserve more memory. The function update_max_ptr() does exactly that. It will call the operating system using the brk() call. If brk() manages to reserve the memory, max_ptr is moved to the newly allocated border.

Implementation of malloc()

The idea of malloc() is that it receives an amount of bytes to reserve and it returns a pointer to the address at which the bytes are available in a contiguous block. The pointer can later be used in a call to free() and free() is then able to return the block back onto a free list of it’s corresponding bucket.

malloc() starts of by initialization if malloc() has never been called before.

malloc() than looks at the amount of bytes it has to reserve. It will compute the next larger multiple of 2 that is larger than the amount of bytes to reserve. The computed multiple of 2 corresponds to a certain bucket. malloc() will grow the tree, until it’s root is located at a bucket that can serve the requested amount of bytes.

malloc() contains a while loop which iterates over all buckets from the highest index towards the lower indexes. That means it iterates from small buckets to ever larger ones until it is able to locate a matching block to serve the request.

For each bucket index, it will call lower_bucket_limit() to grow the tree to the current bucket size.

NOTE: TODO: I have not understood the rest of the method completely. It kind of walks up the tree, than splits the nodes down again until it finds a matching block. If a matching block is found, it will write the size of the block into the header, killing the linked list struct in the process that was store in the header before. It will then return a pointer to after the header to the use as a return value of malloc(). The user can than use the space and return the block using the pointer as a parameter to free() later.

Implementation of free()

TODO

Multiboot Memory Map

Read this article to learn about common pitfals.

The multiboot specification contains a memory map. The memory map has some characteristics that are hard to understand.

A trick for debugging is to type c when GRUB shows the selection screen to enter a console mode. Once in console mode, Type lsmmap to tell GRUB to print the multiboot memory map.

The Memory Map Contains More Addresses than are Physically Available!

The memory map describes more addresses than are actually available. Some of the addresses in the memory map are not backed by physical memory! Those areas are marked as reserved.

For example if you start qemu and emulate a machine with 1G of RAM,

qemu-system-i386 -m 1G -cdrom image.iso

the memory map will contain entries with addresses for up to 4G. The memory areas in the memory map that are not backed by physical memory are marked as reserved. That means if you ignore all reserved areas in the memory map and only concentrate on the available memory areas, you will have no problems. The areas marked as available are always backed by physical memory as far is I can tell from my personal experiments.

Available Memory Is Not The Same As Free Memory!

The kernel can read and use that memory map to find available regions of physical memory. Available means usable by the the BIOS, the bootloader and the operating system! All these components make use of the available memory, so some of the available memory will be occupied by the time your operating system takes over control from the bootloader! Available and “Not Occupied” or “Free” are two different things! The information that the multiboot memory map gives you is about theoretically available memory not about free memory. That memory might already be filled with important data!

Lower 1 Megabyte

For GRUB, a memory area is available, if you do have access to it and can write into it. The lower MB of RAM typically contains things like the Video RAM at 0xB8000 placed there by the BIOS. You can choose to store data in this section of the memory but being video memory, it will write those characters to the screen! You should just not use this part of the memory although it is available RAM! GRUB does not care! GRUB will just tell you that this Area of memory is available.

The lowest 1 Megabyte also contains interupt vectors amongst other things.

A good approach would be to treat the first MB of RAM is memory that is in use and to just start using the available memory starting from 1MB.

Dealing with memory occupied by GRUB Modules

If you tell GRUB to load modules, those modules are loaded into available memory areas by GRUB. GRUB still says those memory areas are available! You can decide to reuse that space and just write new data over the modules rendering the modules unusable. GRUB does not care!

That means you have to take a close look at the modules section of the multiboot information data struct to learn about which locations in available memory are occupied by the modules in order to not accidentely write data into those sections..

Creating an ISO with GRUB and Your kernel as an elf File

Source Material and Usage

The following article contains a very good explanation about how to create an iso image that contains the GRUB bootloader to load your own kernel which has the elf file format.

Just type make in the folder. It will run the makefile that creates an iso image. You should create a cross compiler for 32 bit x86 code using the System V ABI for the kernel binary to be created correctly. See this Article.

You can then run that iso image on Ubuntu with either VirtualBox, bochs or qemu.

For qemu, first install qemu:

sudo apt-get install qemu
sudo apt-get install qemu-system-i386

Then start qemu using the iso file:

qemu-system-i386 -m 2G -cdrom image.iso

For VirtualBox, you can install VirtualBox via the Ubuntu activities search box. Then in the graphical user interface create a virtual machine and configure it to have your iso loaded in the cdrom drive and start it.

Bootloader and Kernel

The GRUB bootloader implements the Multiboot specification. The multiboot specification describes how a bootloader and a kernel can interact with each other. Any bootloader implementing the Multiboot specification can load any operating system that also adheres to the Multiboot specification.

What does it mean for an operating system to be multiboot compliant? The specification in section “3.1 OS image format” states that the operating system kernel binary must contain a Multiboot header structure in its first 8192 bytes. The structure must be contained in the text segment. Only if after scanning the text segment this structure can be found by a multiboot bootloader, the bootloader will recognize the binary as a kernel and list it in the list of bootable operating systems for example.

Do not confuse the Multiboot header structure with the Multiboot Information Data structure. The Multiboot header is part of the OS kernel binary and is a service for the bootloader by the kernel, whereas the Multiboot Information Data structure is passed to the kernel main function as a parameter by the bootloader and is a service by the bootloader for the operating system.

Multiboot Information Data Structure

The Multiboot specification under section “3.3 Boot information format” says

Upon entry to the operating system, the EBX register contains the physical address of a Multiboot information data structure, through which the boot loader communicates vital information to the operating system. The operating system can use or ignore any parts of the structure as it chooses; all information passed by the boot loader is advisory only.

The Multiboot Information Data structure is a way to transfer information from the bootloader into the kernel. This way, the kernel can learn for example how much physical memory is available.

To use it, you must define the structure. The easiest way to define the structure is to insert the official header file into your codebase. The bottom of the multiboot specification contains the header file in raw text form along with a small example operating system that shows how to use that structure.

Kernel Start Address – At which address will the kernel be loaded by GRUB?

There is no fixed value, it depends on what the elf file instructs GRUB to do.

The kernel binary created from the article is an elf file. The elf file format contains the address it expects to be loaded at. After loading an elf file the elf file should be under the requested address in virtual/physical memory so that all absolute addresses in the application actually point to the correct location.

GRUB is capable of loading elf binary files. If it loads a elf binary, it will load it at the physical address that the elf file requests. The linker script that creates the elf file looks like this:

OUTPUT_FORMAT(elf32-i386)
ENTRY(start)
SECTIONS
 {
   . = 1M;
   .text BLOCK(4K) : ALIGN(4K)
   {
       *(.multiboot)
       *(.text)
   }
   .data : { *(.data) }
   .bss  : { *(.bss)  }
 }

It first tells the linker to create a elf binary. Then, under the sections block, it lets the text segment start at 1M which is one megabyte = 0x00100000. The elf file will now specify that it wants the text segment to be loaded at 1M. The text segment contains the kernel’s executable code. So the kernel is loaded to 1M by GRUB in this example. You could also choose another physical address to load your kernel to.

Kernel End Address – How does one know how large the kernel is and where usable memory starts

The kernel itself is a binary and is placed into memory. As the binary has a certain size in bytes, it will take up a certain amount of memory. So far we know where the kernel binary starts in memory (defined by the linker, contained in the elf binary).

How do we figure out, where the kernel binary ends and where free space starts after the kernel? Another thing to keep in mind is that the kernel binary will keep getting larger and larger the more features and therefore code you add to your kernel codebase. It is inconvenient to make an assumption about an upper bound of the kernel’s size.

Also how does a bootloader know how many sectors to copy from the elf binary into RAM? If the bootloader copies too many sectors, that is not a problem aside from a waste of space. If the bootloader copies too few sectors, only a part of the kernel is available in RAM which will have fatal consequences. Your functions will work just fine until in the midst of execution, your code will have incorrect behaviour such as not outputting the log statements you expect or just total fault of all operations. The reason is that the instruction pointer just moves into parts of the memory that should contain more kernel code but just have not been loaded by the bootloader.

GRUB as a bootloader will determine how large your kernel is by looking at the metadata in the elf file. You do not have to worry about GRUB. If you write your own custom bootloader that is a problem you have to solve. Also if your kernel is a flat binary file and not an elf binary with metadata, how does GRUB or your custom bootloader know how many sectors to loader into RAM to load the entire kernel?

A problem you have to solve in your kernel (!= bootloader) is to figure out, where the first address is that can be use to store data in RAM (= placement address).

After the kernel boots, you will be in a state where paging is disabled and no heap is set up. In this phase you will use placement memory which means you put a unsigned byte pointer to the start address (= placement address) and whenever you require n bytes of memory, you increment the placement address pointer by n. The problem is, that this approach is so simple and basic that it lacks a lot of features that a heap has. For example you cannot free memory with the placement memory system because you have no metadata where objects start and where they end. From this lack of features it follows that one way to deal with this situation is to accept the fact that the kernel will never free memory that it has allocated in the phase before paging and a heap have been activated.

How does the kernel learn about the placement address? The kernel code can use a variable that contains an address set by the linker script. If the linker script sets the address after the kernel binary and all kernel segments, the kernel code suddenly knows about an address where placement memory can start. The linker can set the address correctly because he constructs the binaries and segments and hence knows their sizes. An example of such a linker script is James Molloys linker script. Check out the end label in the linker script. end is the address where the placement memory could start.

/* Link.ld -- Linker script for the kernel - ensure everything goes in the */
/*            Correct place.  */
/*            Original file taken from Bran's Kernel Development */
/*            tutorials: http://www.osdever.net/bkerndev/index.php. */

ENTRY(start)
SECTIONS
{

    .text 0x100000 :
    {
        code = .; _code = .; __code = .;
        *(.text)
        . = ALIGN(4096);
    }

    .data :
    {
        data = .; _data = .; __data = .;
        *(.data)
        *(.rodata)
        . = ALIGN(4096);
    }

    .bss :
    {
        bss = .; _bss = .; __bss = .;
        *(.bss)
        . = ALIGN(4096);
    }

    end = .; _end = .; __end = .;
}

Now the kernel code can now use C code to make use of the end label:

// end is defined in the linker script.
extern u32int end;
u32int placement_address = (u32int)&end

If you look closely, the end variable’s value is not used at all! It is the end variable’s address that is used to retrieve the end of the kernel! (See here)

Working with GRUB Modules

GRUB can, besides loading your kernel, put so called modules in memory. Modules are files (code binaries or just any arbitrary file) that the kernel can use to provide additional functionality or to read configuration from or do anything with in general. A module could be a binary program such as a hello world test program as an elf binary that you want to run as a test of your kernel elf loader for example. It could be a module that allows the kernel to understand how to read a FAT12 file system. It could also be a prepared image file of a filesystem that the kernel can use during it’s early stages of operation.

If you make use of GRUB’s module loader feature, it is not enough to just know where the kernel binary ends, you need to know where in memory GRUB has put the modules and also how much memory those modules occupy. GRUB will choose a memory address to put the modules. There is no well-known memory location where GRUB puts the modules, you have to retrieve the memory locations from GRUB somehow. You can learn that information from the memory map stored inside the multiboot information data. (Example code at the bottom of the multiboot specification).

Knowing which parts of the memory is occupied by your kernel’s binary, the placement memory and the modules is important because you do not want to override that memory in order to guarantee stable operation of your OS. The OS has to have a way to mark the memory as occupied. As the example OS that is build throughout those articles, will use paging, the most straightforward way is to maintain a bitmap of occupied phyiscal frames as outlined in James Molloy’s article about paging.

How does the kernel interact with the multiboot loader to learn about the end address of the modules? The information can be read from the multiboot information data. To retrieve that structure, the assembler boot code has to push the ebx register onto the stack because ebx contains the address of the multiboot information data which was put into ebx by the multiboot loader.

Let’s implement this idea using GRUB and a custom kernel! In this test, the module is a plain ASCII text file that contains the sentence “This is a plain text file module test.”. Create a file called “test” in the folder that contains the Makefile. Into the file “test” enter the following text:

This is a plain text file module test.

Update the Makefile to copy the “test” file into the boot folder. You can use another folder if you want.

CP := cp
RM := rm -rf
MKDIR := mkdir -pv

BIN = kernel
CFG = grub.cfg
ISO_PATH := iso
BOOT_PATH := $(ISO_PATH)/boot
GRUB_PATH := $(BOOT_PATH)/grub

#GCC := gcc
GCC := ~/dev/cross/install/bin/i386-elf-gcc

#LD := ld
LD := ~/dev/cross/install/bin/i386-elf-ld


.PHONY: all

all: bootloader kernel linker modules iso
  @echo Make has completed.

bootloader: boot.asm
  nasm -f elf32 boot.asm -o boot.o

kernel: kernel.c
  $(GCC) -m32 -c kernel.c -o kernel.o

linker: linker.ld boot.o kernel.o
  $(LD) -m elf_i386 -T linker.ld -o kernel boot.o kernel.o

iso: kernel
  $(MKDIR) $(GRUB_PATH)
  $(CP) $(BIN) $(BOOT_PATH)
  $(CP) $(CFG) $(GRUB_PATH)
  grub-file --is-x86-multiboot $(BOOT_PATH)/$(BIN)
  grub-mkrescue -o image.iso $(ISO_PATH)

modules:
  $(MKDIR) $(GRUB_PATH)
  $(CP) test $(BOOT_PATH)

.PHONY: clean
clean:
  $(RM) *.o $(BIN) *iso

If you build using the command “make”, the “test” file will be part of the created iso image. (It is not part of the kernel itself but part of the iso image).

Let GRUB know about the module so it is loaded alongside your kernel. To configure GRUB, change the menuentry in the grub.cfg file and add the “module” keyword and pass in the path to where the “test” file is contained in the iso image (/boot/test in this example).

# timeout in seconds, -1 waits indefinitely without timing out ever
set timeout=-1

# first entry is the default entry to boot after a timeout
set default=0

# custom kernel
# https://www.gnu.org/software/grub/manual/grub/grub.html#menuentry
menuentry "The worst kernel ever" {
        multiboot /boot/kernel
        module /boot/test /boot/test
}

An explanation of all parameters allowed by the menuentry is contained in https://www.gnu.org/software/grub/manual/grub/grub.html#menuentry.

In your assembler file which eventually calls the kernel’s main function, right before calling main, push ebx and eax in this order onto the stack. The multibootloader will write the address of the multiboot information data structure into ebx and the multiboot magic number into eax. Pushing data onto the stack will actually make those pushed bytes available as parameters to the called function in your C code! This is part of the Application Binary Interface (ABI) used which defines this behaviour.

bits 32

section .multiboot               ;according to multiboot spec
        dd 0x1BADB002            ;set magic number for bootloader
        dd 0x0                   ;set flags
        dd - (0x1BADB002 + 0x0)  ;set checksum

section .text
global start
extern main                      ;defined in the C file

start:
        cli                      ;block interrupts
        mov esp, stack_space     ;set stack pointer

        
        push   ebx             ;Push the pointer to the Multiboot information structure.
        push   eax             ;Push the magic value.

        call main                ; call main
        hlt                      ;halt the CPU

section .bss
resb 8192                        ;8KB for stack
stack_space:

Now update your kernel’s main function. An example is contained at the bottom of the multiboot specification. I will only list excerpts from there because there is quite a bit of code involved for printing strings via the BIOS.

void main(unsigned long magic, unsigned long addr) {
    
  multiboot_info_t *mbi;

  terminal_buffer = (unsigned short *)VGA_ADDRESS;

  // clear_screen();
  cls();

  vga_index = 0;
  // print_string("Hello World!", WHITE_COLOR);
  printf("Hello World!\n\n");

  /* Am I booted by a Multiboot-compliant boot loader? */
  if (magic != MULTIBOOT_BOOTLOADER_MAGIC) {
    vga_index = 160;
    // print_string("Invalid magic number", RED);
    printf("Invalid magic number!\n");

    return;
  }

  /* Set MBI to the address of the Multiboot information structure. */
  mbi = (multiboot_info_t *)addr;

  /* Are mods_* valid? */
  if (CHECK_FLAG(mbi->flags, 3)) {

    module_t *mod;
    int i;
    int j;

    printf("mods_count = %d, mods_addr = 0x%x\n", mbi->mods_count,
           mbi->mods_addr);

    for (i = 0, mod = (module_t *)mbi->mods_addr; i < mbi->mods_count;
         i++, mod += sizeof(module_t)) {
      printf(" mod_start = 0x%x, mod_end = 0x%x, string = %s\n", mod->mod_start,
             mod->mod_end, (char *)mod->string);

      // output the first characters from the test module
      char *character = mod->mod_start;
      for (j = 0; j < 37; j++) {
        // putchar(&mod->mod_start);
        putchar((*character));
        character++;
      }

      printf("\n");
    }
  } else {
    printf("No mods found!\n");
  }

  /* Is the section header table of ELF valid? */
  if (CHECK_FLAG(mbi->flags, 5)) {
    elf_section_header_table_t *elf_sec = &(mbi->u.elf_sec);

    printf("elf_sec: num = %d, size = 0x%x,"
           " addr = 0x%x, shndx = 0x%x\n",
           elf_sec->num, elf_sec->size, elf_sec->addr, elf_sec->shndx);
  }

  /* Are mmap_* valid? */
  if (CHECK_FLAG(mbi->flags, 6)) {

    memory_map_t *mmap;

    printf("mmap_addr = 0x%x, mmap_length = 0x%x\n", mbi->mmap_addr,
           mbi->mmap_length);

    for (mmap = (memory_map_t *)mbi->mmap_addr;
         (unsigned long)mmap < mbi->mmap_addr + mbi->mmap_length;
         mmap = (memory_map_t *)((unsigned long)mmap + mmap->size +
                                 sizeof(mmap->size))) {

      printf(" size = 0x%x, base_addr = 0x%x%x,"
             " length = 0x%x%x, type = 0x%x",
             mmap->size, mmap->base_addr_high, mmap->base_addr_low,
             mmap->length_high, mmap->length_low, mmap->type);

      // https://www.gnu.org/software/grub/manual/multiboot/multiboot.html#Boot-modules
      //
      // ‘type’ is the variety of address range represented, where a value of 1
      // indicates available RAM, value of 3 indicates usable memory holding
      // ACPI information, value of 4 indicates reserved memory which needs to
      // be preserved on hibernation, value of 5 indicates a memory which is
      // occupied by defective RAM modules and all other values currently
      // indicated a reserved area.

      switch (mmap->type) {

      case 1:
        printf("Available RAM\n");
        break;

      case 3:
        printf("Usable memory holding ACPI information\n");
        break;

      case 4:
        printf("Reserved memory which needs to be preserved on hibernation\n");
        break;

      case 5:
        printf("Defective RAM\n");
        break;

      default:
        printf("Reserved Area\n");
        break;
      }
    }
  }

  vga_index = 80;
  // print_string("Goodbye World!", WHITE_COLOR);
  printf("Goodbye World!\n");

  return;
}

You can see that the main function now does not have a void parameter any more but it contains the magic number as first parameter and a pointer to the multiboot information data structure as the second parameter.

The kernel’s main function makes use of these parameters to output data about the modules and the memory map. Our goal was to use the custom module which is the plain ASCII text file “test”. The above main function does cheat quite a bit! It loops over all available modules and outputs the first 37 bytes contained in each module. This code assumes that there is a module that contains our “test” file containing a sentence consisting of 37 characters (“This is a plain text file module test.”)!

When you make this project and start the iso using qemu, you will see the text contained in “test” be printed.

As a general reminder: This is not production ready code! This code is merely here to illustrate concepts! It is not good code by any stretch of the imagination! Do not use this code in any project you plan to use for production purposes! Learn the concepts, write your own code, write tests to verify your code and prevent it from regression errors, let your colleagues or peers review your code and tests and only then use your own code for any important purpose! Do yourself a favor and cover your own back.

The Stack

The multiboot specification in “3.2 Machine State” states that the register ESP has an undefined value after booting and then ads:

The OS image must create its own stack as soon as it needs one.

Maybe GRUB will initialize a valid stack but according to the specification it does not have to. The kernel should therefore always creates a stack for itself. This is advisable because if your OS is loaded by a multiboot bootloader other than GRUB, this bootloader might not set a stack pointer and your OS has to be prepared for that situation.

Writing the ISO to a bootable USB Stick

In order to use your operating system on a real machine outside an emulator, you have to get the machine to boot your operating system. The most straightforward way is to boot from USB.

You need a USB stick which contains no relevant data as the USB stick will be erased in the process. You also need your operating system packaged as an ISO. On the machine in the BIOS settings, make sure that the machine does use USB as one of the entries in it’s boot order.

Creating a bootable ISO cannot be achieved by just manually copying the ISO file to an existing USB stick, as your ISO file is then just contained on the stick just like a regular file in the filesystem. Instead, you can use a tool that correctly lays out the ISO on the USB stick so that the machine can recognize the USB stick as a bootable media.

On ubuntu you can use the Startup Disk Creator application which comes preinstalled with the standard ubuntu installation.

The Startup Disk Creator is a little bit on in the sense that it will not accept your custom iso image. I had to rename the image to give it the .img extension. So instead of image.iso, I had to rename it to image.img

mv image.iso image.img

After you have your image.img file, load it in Startup Disk Creator, select the USB stick to write it to and start the process. The USB stick now can be used to boot your operating system.

If you prefer to use command line only to create the USB stick, you should look at this post. It maybe contains a command line only solution.

C Cross Compiler on Ubuntu Linux

Sources

Introduction

The cross compiler will use the System V ABI (Application Binary Interface). An ABI defines how to machine language programs interface with each other. In this case the kernel and applications that will run on it. Also you can write libraries for your operating system which you can compile applications against. The library and the applications using the libraries have to be able to talk to each other. The ABI defines amongst other things how parameters to a function are put into registers and onto the stack. It that interface is defined, two applications adhering to the interface can talk to each other.

Switching to a certain version of GCC via the alternatives system

I compiled the sources of gcc-4.9.2 on Ubuntu 19.04 after installing and changing the alternatives to gcc-6 as outlined in https://askubuntu.com/questions/26498/how-to-choose-the-default-gcc-and-g-version and applying the patch from https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=ec1cc0263f156f70693a62cf17b254a0029f4852

setterm -linewrap off

sudo apt-get install gcc-6 g++-6
sudo apt install libc6-dev

dpkg -l | grep gcc | awk '{print $2}'

sudo update-alternatives --remove-all gcc
sudo update-alternatives --remove-all g++

sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-6 10
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-6 10
sudo update-alternatives --install /usr/bin/cc cc /usr/bin/gcc 30
sudo update-alternatives --set cc /usr/bin/gcc
sudo update-alternatives --install /usr/bin/c++ c++ /usr/bin/g++ 30
sudo update-alternatives --set c++ /usr/bin/g++
sudo update-alternatives --config gcc
sudo update-alternatives --config g++

export CC=/usr/bin/gcc
export LD=/usr/bin/ld

You have to apply this patch!!!!! https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=ec1cc0263f156f70693a62cf17b254a0029f4852

Installing the prerequisits

According to https://packages.ubuntu.com/search?keywords=libcloog-isl-dev the libcloog-isl-dev package is not part of the ubuntu 19.04 disco dingo release and it is not advised to install packages from older releases. cloog is optional anyways so it is not used in this explanation.

sudo apt-get update
sudo apt-get install build-essential flex bison libgmp3-dev libmpc-dev libmpfr-dev texinfo

Install and build:

####################################
echo Stage 1 - Building Dependencies
####################################

setterm -linewrap off

# make a working directory
cd $HOME/dev
rm -rf cross
mkdir cross
cd cross

# install or update all apt-get dependencies
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install gcc                   # not cross
sudo apt-get install g++
sudo apt-get install make
sudo apt-get install bison
sudo apt-get install flex
sudo apt-get install gawk
sudo apt-get install libgmp3-dev
sudo apt-get install libmpfr-dev libmpfr-doc 
#sudo apt-get install libmpfr4 libmpfr4-dbg
sudo apt-get install mpc
sudo apt-get install libmpc-dev
sudo apt-get install texinfo               # optional
#sudo apt-get install libcloog-isl-dev      # optional
sudo apt-get install build-essential
sudo apt-get install glibc-devel
sudo apt-get -y install gcc-multilib libc6-i386

# download and unpack necessary files
wget http://ftpmirror.gnu.org/binutils/binutils-2.25.1.tar.gz
wget http://ftpmirror.gnu.org/gcc/gcc-5.3.0/gcc-5.3.0.tar.gz
wget http://ftpmirror.gnu.org/gcc/gcc-4.9.2/gcc-4.9.2.tar.gz
wget http://ftpmirror.gnu.org/gcc/gcc-4.9.0/gcc-4.9.0.tar.gz
wget http://ftpmirror.gnu.org/gcc/gcc-4.8.3/gcc-4.8.3.tar.gz
wget http://ftpmirror.gnu.org/mpc/mpc-1.0.3.tar.gz

# unzip all archives
#for f in *.tar*; do tar zvxf $f; done

rm -rf binutils-2.25.1
tar zvxf binutils-2.25.1.tar.gz

rm -rf gcc-4.8.3
tar zvxf gcc-4.8.3.tar.gz

rm -rf gcc-4.9.2
tar zvxf gcc-4.9.2.tar.gz

# create installation directory
cd $HOME/dev/cross
mkdir install
export PREFIX="$HOME/dev/cross/install"
#export TARGET=i686-elf
export TARGET=i386-elf
export PATH="$PREFIX/bin:$PATH"

################################
echo Stage 2 - Building Compiler
################################

## install mpc
#cd $HOME/dev/cross
#mkdir build-mpc
#cd build-mpc
#../mpc-1.0.3/configure --prefix="$PREFIX"
#make -j2
#make -j2 check
#make -j2 install
#cd ..

# install binutils
cd $HOME/dev/cross
rm -rf build-binutils
mkdir build-binutils
cd build-binutils
../binutils-2.25.1/configure --target=$TARGET --prefix="$PREFIX" --with-sysroot --disable-nls --disable-werror
make -j2
make -j2 install
cd ..

# install gcc
cd $HOME/dev/cross
rm -rf build-gcc
mkdir build-gcc
cd build-gcc

#../gcc-4.8.3/configure --target=$TARGET --prefix="$PREFIX" --disable-nls --enable-languages=c,c++ --without-headers --with-mpc="$PREFIX"

#../gcc-4.8.3/configure --target=$TARGET --prefix="$PREFIX" --disable-nls --enable-languages=c,c++ --without-headers



../gcc-4.9.2/configure --target=$TARGET --prefix="$PREFIX" --disable-nls --enable-languages=c,c++ --without-headers

make -j2 all-gcc
make -j2 all-target-libgcc
make -j2 install-gcc
make -j2 install-target-libgcc

Build Errors:

../../mpc-1.0.3/src/mul.c:175:1: error: conflicting types for ‘mpfr_fmma’

cfns.gperf:101:1: error: ‘const char* libc_name_p(const char*, unsigned int)’ redeclared inline with ‘gnu_inline’ attribute

The cross compilers are in

~/dev/cross/install/bin

Design of the Operating System

Purpose of this Article

As a beginner in operating system development it is difficult to find your way through the material that is out there. There are no definite guides in implementing an operating system because most of the authors have thought themselves by reading Intel’s developer manuals. A lot of the books are therefore opinionated. Operating Systems become a very personal subject all of a sudden.

In general free tutorials teaching os development in a concise manner that go beyond a hello world boot loader are scarce. James Molloy wrote an excellent set of articles that even explain paging and a heap implementation from scratch line by line! James articles do explain the implementation of more concepts than any other article on the internet.

The problem is that some other sites on the internet attack his implementation mentioning problems in James’s code. James also mentions concepts but does not thouroughly explain why he wants to adhere to those concepts. For example he says kmalloc should not be called after identity mapping but before a heap is initialized. He never says what errors occur should this concept be ignored!

On the other hand you have the heaps of academic books that explain broad concept in nicely written and well organized fashion! The problem is that academic material is too high level to implement anything. Implementation deatils are left out of the picture. A student could leave a course with a broad understanding of what an operating system does but at the same time that student is not able to implement any of the concepts because the understanding is too high level. In a sense the student knows everything and nothing at all at the same time! The knowledge becomes useless beyond passing an exam. Only the most brilliant minds will be able to transform an academic book into a working operating system. Us mere mortals, we need material that gets us started with a basic implementation so we can learn the steps it takes to write an operating sytem.

I think the problem is that in order to be able to write an operating system, you have to be able to find a way to dive down from the level of abstraction displayed in academic books to the very low level tutorials of the web. Articles or books that provide those intermediary steps do not exist. You have to wade through the heaps of opinionated articles and find your own why. You have to establish a plan of what to implement in what order. In order to create an implementation plan, you have to manifest all the concepts in your brain into a concrete architecture not knowing yet if the architecture will hold up. You should not let yourself get discouraged by opinionated, elitist posts on the internet and keep on working on your own implementation.

This article outlines my personal architecture and ideas of how an operating system could be implemented. It is heavily influenced by James Molloys articles and tries to solve problems that might arise due to James implementation as mentioned on the OSDev wiki.

I have no idea if all that is written in this post is correct or not. If I find myself in a situation that I cannot solve based on my current understanding, I will update this architecture with changed architecture which does not have the problems of the architecture before it.

Besides James Molloy’s influence, the architecture is heavily influenced by Unix concepts, because almost all books on operating systems are based on the Unix operating system, I almost exclusively think in Unix terms when I think about operating systems at this point. Although things like fork before exec do not intuitively make sense to me, (why fork? Why not just exec?) I can see however that an implementation of fork and exec is possible and leads to new processes. So I will implement this concept. Especially since it aligns with James Molloys ideas.

Overall Design

The operating system will use paging. It will not swap to a hard drive in the early versions. If all physical memory is used, no process will receive more memory. A process has to be able to deal with this situation. It will use paging to secure the kernel frames from being written to by user mode code and also to facilitate the creation of new processes by copying physical frames for isolation of processes while maintaining the same virtual address space for the copied process. It will also use paging to map the operating system’s frames to the bottom of every running process.

The kernel’s page directory and page tables are stored in identity mapped frames, although I still do not 100% understand why identity mapping is needed. That is not true! The page directory and page tables have to be managed by the heap and they have to belong to the init process. They should be copied (not mapped) on fork(). Because fork() will copy everything above the last used kernel frame, the page directory and page tables have to be located above the last identity mapped kernel frame.

The kernel’s frames are located at the bottom of the physical memory starting from 0 (Where does GRUB put the kernel? Also where is the stack placed by GRUB?). Also code placed by the BIOS is contained in the low frames of the memory map. See http://www.cs.bham.ac.uk/~exr/lectures/opsys/10_11/lectures/os-dev.pdf – Figure 3.4) The kernel’s frames and the BIOS’s frames have to be marked as occupied in the bitmap of used frames to save them from being used twice. This is how the kernel prevents code from overriding it in the early stages of the boot process.

Before activating a heap, the kernel will just use static memory (placement memory system). It will start with an address (read from GRUB’s multiboot information) and it will move that pointer (placement_address) up the address space whenever it needs memory. This memory is never returned as long as the OS runs and hence no heap is necessary for this simple placement memory system.

Calls to fork() will map (not copy) all lower frames into the newly forked processes memory space. Overall fork will map

The frames created by the BIOS
- contains the interrupt vector table
The kernel code installed by the bootloader GRUB
The modules that where placed into memory by GRUB above the kernel
The kernel’s placement memory created via the placement memory system
- The placement memory contains the bitmap of used frames

Program Flow and the Init Process

When the OS starts, it eventually creates the init process. (It also creates an entry for the init process in its list of running processes. This list is the array of process descriptor data structures. That list is also updated by fork(). Because the list is maintained in the kernel memory section at the bottom of the memory map, it has to be static in size, because once the kernel initializes init’s heap, its memory remains fixed to keep it from growing up the address space into any other area of the memory trashing processes along the way. The kernel can therefore only run a fixed amount of processes. Maximum 16 for the start maybe?)

The program flow starts after the bootloader (GRUB) has set the CPU’s instruction pointer onto the kernel code that it has placed into memory. The CPU starts to execute the kernel code. The kernel at this early stage has no processes running, it is the only part executing.

The kernel will

Set GDT and interrupt tables
Create its own stack
Prepare the array of process descriptors
Create the bitmap of frames. All frames are free yet.
Prepare frames (above the static kernel memory) and put Page Directory and Page Tables into those frames (The page table entries do not point to frames yet!)
Identity map the kernels frames and mark them used in the bitmap of frames.
Assign frames to the Page Table Entries prepared above and marks the frames as used in the bitmap of frames
Initialize the heap
Start the init process and hands the program flow over to the init process

Once the init process has received the program flow, it is taking care of all further tasks. It will read the harddrive to find a configuration file. The configuration file describes, which process init should start. In most cases, the new process will be the console process (user space) which reads from stdin and outputs to stdout and can fork() and exec() new processes via system calls.

At some point the program flow has to go into a scheduler which assigns processing time to all running processes. It will run the init process, the console process and all forked and execed processes in their respective time slices.

Back to the init process that just took over the flow from the kernel. The init process will inherit and use the operating system’s stack and its first Page Directory and Page Tables. In the future, whenever init forks itself (maybe to prepare a call to exec later on), the kernel’s/init’s stack will be copied to the new processes by in-memory copying the frames that the kernel stack occupies. The newly copied stack will then be available under the same virtual memory address as the kernel’s/init’s stack (no pointers have to be moved around). The new process has all pointers functioning because the virtual memory looks the same as init’s virtual memory. The frames underneath the copied stack are copies of init’s frames and the new process cannot affect init’s original stack. James Molloy copies frames which is fine but then proceeds to moves pointers around, I think this is not necessary. OSDev wiki also say that it is harmful.

During fork() the new process will not make a copy of the frames that the operating system uses (except the stack as explained above) but it will just map them into the new processes virtual address space. Mapping means that some of the Page Table Entries will receive a pointer to the kernel’s physical memory. That pointer is copied from the parent process as is without any changes.

Overview of Frames

The operating system uses frames for:

The frames where GRUB loaded the kernel sections into.
The frames that where allocated by the OS before a heap was activated (= identity mapped frames)
The frames used to manage running process information. This is the fixed size array of process descriptor data structures.
The frames used by the bitmap of free or occupied physical frames.
The frames used by the operating systems Page Directory and Page Tables (more or less owned by the init process).
Its own stack (more or less owned by the init process also).
The frames used by the heap (more or less owned by the init process).

The latter three types of frames are passed over by the operating system to the init process. The reason is that every process should have

it’s own virtual memory address space (Page Directory and Page Tables)
it’s own stack
It’s own heap

By assigning those three objects to the init process, they can be in-memory copied for new, forked processes and they are not merely mapped.

Because the criteria for fork() to decide whether to copy or to fork is the last frame used by the OS before the heap was activated. Everything below that frame is mapped instead of copied. Everything above that frame is copied (cloned, duplicated). The stack, heap and the Page Directory and Page Tables have to be place above the last used OS frame in the memory map so they are copied and not mapped.

Also frames used by the Page Directory and Page Table will not be statically allocated by the OS! Instead they should be allocated via the heap. The reason is that a process has to have its own virtual address space and has to be able to grow or shrink its own Page Directory and Page Tables (= virtual address space) via the heap. That means that before adding the first Page Directory and Page Tables by the OS for the init proces, a heap implementation has to be activated so init immediately starts to behave like a normal process. init has to be a fully functional template for all other processes forked from it.

The reason for mapping the OS frames is that the operating systems code, static resource data and the memory that it allocated before the heap is active will remain unchanged as long as the OS is running. Because those frames are static, there is no reason to make physical copies of those frames. How does fork() know which frames to copy and which to map? All frames from 0 to the last frame allocated before the heap was activated are mapped. Those are the frames the operating system uses. All other frames (Stack, the processes data) are copied.

Process Descriptors, List of Process Information

What is stored in a ProcessDescriptor? A ProcessDescriptor is created for each running process. The ProcessDescriptors are stored in the kernels static memory, hence their number must be limited to a maximum value (16 for the beginning). Each Process Descriptor contains:

The process identifier PID, a numeric value identifying the process amongst all processes. 0 is not a valid PID because fork returns 0 in the child process after fork.
A pointer to a physical address. That physical address contains the processe’s Page Directory Table. This pointer is needed to enlarge the memory managment structures Page Directory and Page Table if the process requests more memory. Also the memory management structures have to be in-memory copied during a fork().
A pointer to the heap.
A pointer to the stack.
A place in memory where registers can be stored to conserve the CPU state.

Running a Program

The Console

Executing a program starts with the user typing a command in the console/shell. According to the comet book (Operating Systems – Three easy Pieces) the shell is a normal user mode program.

It will by some means find the executable binary file that implements the command. It calls the kernel function fork() to copy itself. It will then call the kernel function exec() to replace the code segment of the copy with the code segment read from the executable program that implements the command. It will the let the new forked process run and wait on its termination. It will then show the commands return value on the console.

No console

To make things easier during development, interaction with the operating system is not needed and the console can be left out. Identifying a program binary to execute, calling fork() and exec() can just be done in the kernel_main() method. The console can be left out of the picture during early developments.

Under Linux, the first process that is started is the init process. A first process could be started from the kernel_main() method similar to init.

Why fork() and exec()?

The question is could an application not be started by creation of a completely new process without forking an existing process and without using exec()? The comet book says, the approach of fork() and exec() is just the right thing to do. Is that true or not? Could it be done easier?

James Molloys Paging and Heap

Why is it not safe to call kmalloc between identity mapping and once the heap is active

James Molloys tutorials are the best source I can find about implementing paging including a heap.

According to osdev wiki the code has some flaws but there is no other write up which goes into the same detail as James Molloy so I read his tutorials quite a bit and I take away a lot from them.

I had a hard time understanding why James Molloy states that between identity mapping the frames and activating the heap, calls to his kmalloc() function are prohibited and the placement_address memory pointer should not be moved.

I think I finally figured out, why he organizes his heap setup code the way he did it and why he does not want to call kmalloc after identity mapping frames and before the heap is functional.

The reason is that in his initialize_paging() function, during identity mapping he iterates over all frames from 0x0000 up to the current value of placement_address. The area between 0x0000 up to placement_address contains everything that was loaded into memory by grub, that means the kernel code and all the data the kernel uses. The area also contains the Page Directory and Page Tables created so far. It also contains the heap’s Page Tables and Page Table Entries. This are should not be overridden ever otherwise the system will crash. To prevent the area from being overridden, the frames covering this area are allocated by calling alloc_frame(). Once a frame is allocated, it is marked as used and will never be handed out to any other program by the heap. Allocating frames from 0x0000 to placement_address is also called the identity mapping loop:

i = 0;
while (i < placement_address+0x1000)
{
    // Kernel code is readable but not writeable from userspace.
    alloc_frame( get_page(i, 1, kernel_directory), 0, 0 );
    i += 0x1000;
}

If at this point, kmalloc is called, kmalloc would move placement_address further and it would use memory that is not secured by a frame and hance could be overriden if a frame is created over that data and handed out to an application by the heap. That is why after identity mapping, James Molloy does not want kmalloc to be called.

In his tutorials he then changes the code of kmalloc to use the heap if the heap is finally activated. That means if the heap is initialized, kmalloc will not use placement_address any more but it will use the heap and it is safe to call kmalloc again.

That is why he says: kmalloc should not be called between identity_mapping and the point when the heap is ready.

Why is the heap paging code split in two parts?

If you look at initialise_paging() you can see that pages are created for the heap before identity mapping and frames are assigned to the pages after identity mapping().

By creating pages for the heap, I mean that a Page Table Entry is requested. Requesting a Page Table Entry can potentially trigger a call to kmalloc as if a Page Table is full while a new page is requested, a new block of memory has to be allocated to house a new Page Table. This basically means that in order to use paging, there is a overhead for management/meta data which is Page Directory, Page Directory Entry, Page Tables and Page Table Entries.

James Molloy first creates Page Tables and Page Table Entries for the heap, without allocating frames for the Page Tables and Page Table Entries. He then identity maps the area from 0x0000 up to placement_address and after that allocates frames for the heap entries.

The reason is that he wants the heap page tables and entries to be stored within the area 0x0000 up to placement_address. He wants that data to be located in the identity mapped frames. He does not assign frames to the heaps Page Table Entries because if he would assign frames, they would be placed at the address 0x0000 because no frames are used yet.

To understand why the first allocated frame goes to 0x0000, you have to understand how James Molloy decides which frames should be used next. To decide which frame to use, he maintains a bitmap that points to all frames from 0 to MAX_FRAME in that order and contains the information whether that frame is used or still available. Whenever a frame is needed, the next free frame in that order is used.

Because no frames have ever been used before the heap is created, the first free frame is frame 0x0000. The heap frames should not be located at 0x0000 because 0x0000 has to contain the kernels frames so they are identity mapped. That is why the heap frames are only assigned after the identity mapping has used all the frames it needs to cover the kernel. The heap frames are then taken from some part of the physical memory but not from the kernels identity mapped area.

The Memory Map

A diagram helps to understand the situation. The diagram shows the memory map used by James Molloy. The memory starts from 0x0000 at the bottom and goes up to where placement_address is currently located. The linker script puts the kernel code .text, .bss and all other sections into the memory between 0x0000 and placement_address.

The kernel will have the information where the linker has put the last section. The kernel will that initialize placement_address to a memory location after the last loaded section. placement_address will then grow upwards, whenever kmalloc is called before the heap is active. All kmalloc does without heap is just to move placement_address upwards so that the caller can use the memory without the memory being used twice by some other program. Once the heap is active, the next free frame in the bitmap of frames is used when someone requests memory and a free frame could be anywhere in RAM.

What happens during identity mapping can be visualized on the diagram pretty well. identity mapping is basically looping over the memory from 0x0000 up to placement_address, covering the entire RAM by frames along the way. The frames are marked as used. Once the iteration is done, all kernel code, kernel data and all files and modules loaded by grub are secured by frames with a set used flag and noone will override that data because the heap will not return those frames to any other process.

Realtek rtl8139 Network Interface Card

Introduction

This post will explain all I know about developing a driver for the Realtek rtl8139 network adapter. It is a network interface card that is capable of 10 / 100 Mbit/s network speeds. It is emulated by qemu which makes it a prime target for learning about driver development.

To enable the rtl8139 in qemu, use

qemu-system-i386 -net nic,model=rtl8139 -fda <your_image>

The osdev wiki says:

If you find your driver suddenly freezes and stops receiving interrupts and you’re using kvm/qemu. Try the option -no-kvm-irqchip

Initialization Process

In order to initialize the network card, there are several settings to configure. There are two types of locations where settings have to be applied:

PCI Configuration Space
ioaddr

Finding the Device in the System

The rtl8139 is connected to the PCI bus. With PCI, every device is identified by a pair of IDs. The pair consists of a vendor ID and a device ID. The rtl8139 has a vendor ID of 0x10ec and a device ID of 0x8139. You can check any pair of vendor ID and device ID on https://www.pcilookup.com or http://pciengine.com/.

First, you have to check if the system has the rtl8139 build into it (or if qemu does emulate the card) by listing all the PCI devices of the system and searching for the vendor and device ID. PCI device listing is described here.

PCI Configuration Space

PCI (Peripheral Component Interconnect) is a way to configure hardware on a pure software basis. Extension cards you put into your PC via a PCI slot, are part of the PCI system.

A PCI system consists of up to 256 busses, each bus can contain up to 32 devices, every device can be a package of up to 8 functions. That means a PCI extension card can act as up to 8 devices when plugged into the PC. Each of these devices will get it’s own function via PCI.

A PC usually only contains a single PCI bus, so instead of using 256 busses it only contains 1.

One of these function in one of the devices on one of the busses will be the RTL 8139 but it is not predefined which one it is. That means the tuple of (bus, device, function) is unknown and the driver has to find the device.

In order to find the touple, there are several ways to do it. The simplest way is to iterate over all busses, devices and functions. On each function, using the current touple (bus, device, function) it is possible to read data from the device at those coordinates. If the touple does not point to an existing device, contine with the next touple. If there is a device at the touple, it is possible to read the so called PCI configuration space of that device. The configuration space contains several registers on the PCI Hardware. Two important registers are the vendor and device registers.

Here is a general depiction of the PCI configuration space of PCI card. You can see that the first four byte contain the vendor Id and the device Id.

Reading and writing the PCI configuration space is done via ports. A port is a memory address that points to hardware instead of a memory cell. Ports can be used to write and read data and to communicate with hardware instead of writing and reading to memory. First you have to specify the address you want to manipulate by writing data to the configuration address 0xCF8. Once that location is configured, you can read or write data by reading and writing to the configuration data address 0xCFC.

The RTL 8319 (and every PCI card) has specific values for vendor and device. Knowing these values, the card can be identified and the touple (bus, device, function) can be found.

Having the touple (bus, device, function) the driver can start with the configuration.

The code to iterate and find the RTL 8139 is listed here:

const u32int PCI_ENABLE_BIT = 0x80000000;
const u32int PCI_CONFIG_ADDRESS = 0xCF8;
const u32int PCI_CONFIG_DATA = 0xCFC;

// func - 0-7
// slot - 0-31
// bus - 0-255
//
// described here: https://en.wikipedia.org/wiki/PCI_configuration_space under
// the section "software implementation"
// parameter pcireg: 0 will read the first 32bit dword of the pci control space
// which is DeviceID and Vendor ID
// pcireg = 1 will read the second 32bit dword which is status and command
// and so on...
u32int r_pci_32(u8int bus, u8int device, u8int func, u8int pcireg) {

  // compute the index
  //
  // pcireg is left shifted twice to multiply it by 4 because each register
  // is 4 byte long (32 bit registers)
  u32int index = PCI_ENABLE_BIT | (bus << 16) | (device << 11) | (func << 8) |
                 (pcireg << 2);

  // write the index value onto the index port
  outl(index, PCI_CONFIG_ADDRESS);

  // read a value from the data port
  return inl(PCI_CONFIG_DATA);
}

int realtek8319Found = 0;

unsigned char pci_bus = 0;
unsigned char pci_device = 0;
unsigned char pci_device_fn = 0;

// there are 256 busses allowed
for (bus = 0; bus != 0xff; bus++) {

// per bus there can be at most 32 devices
for (device = 0; device < 32; device++) {

  // every device can be multi function device of up to 8 functions
  for (func = 0; func < 8; func++) {

    // read the first dword (index 0) from the PCI configuration space
    // dword 0 contains the vendor and device values from the PCI configuration space
    data = r_pci_32(bus, device, func, 0);
    if (data != 0xffffffff) {
       
       // parse the values
       u16int device_value = (data >> 16);
       u16int vendor = data & 0xFFFF;

       // check vendor and device against the values of the RTL 8139 PCI device
       realtek8319Found = 0;
       if (vendor == 0x10ec && device_value == 0x8139) {

        realtek8319Found = 1;

        pci_bus = bus;
        pci_device = device;
        pci_device_fn = func;

        k_printf("RTL8139 found! bus: %d", pci_bus);
        k_printf(" device: %d", pci_device);
        k_printf(" func: %d \n", pci_device_fn);
      }

    }
  }
}

ioaddr

If the Realtek 8139 is build into a PC, it gets a ioaddr assigned during system boot. The device is mapped into memory at that ioaddr. By writing or reading data from memory at that ioaddr, the operating system can configure the card.

The ioaddr can be read from the PCI configuration space at the byte 4. Byte 4 is where the command register starts. The ioaddr is stored in the lowest three bits of the command register.

// read the ioaddr/base_address
u32int pci_ioaddr = r_pci_32(pci_bus, pci_device, pci_device_fn, 4);
k_printf("pci_ioaddr: 0x%x \n", pci_ioaddr);

unsigned long ioaddr = pci_ioaddr & ~3;
k_printf("ioaddr: 0x%x \n", ioaddr);

Using the ioaddr, the driver can power up the card.

Powering up the card

Write the value 0 into the config1 address via the ioaddr.

// write a byte out to the specified port.
void outb(u8int value, u16int port) {

  __asm__ __volatile__("outb %1, %0" : : "dN"(port), "a"(value));
}

outb(0x00, ioaddr + Config1);
k_printf("starting chip done.\n");

Bus Mastering

Next step is to enable bus mastering. If you do not enable bus mastering, qemu will not transfer any data between the memory of the operating system and the memory on the RTL 8139 network card but it will transfer zeroes instead.

A transfer of data is necessery to send a packet or to receive packets. To send a packet, first the data is copied from the memory of the operating system into a buffer on the PCI card. From the buffer the card transfers the data onto the wire.

The transfer of data is performed via DMA (Direct Memory Access). If a PCI card is not assigned rights to be the bus master, it cannot perform DMA. Only the bus master is allowed to perform DMA. (Sidenote: It was reported that on some real hardware, enabling bus mastering is not needed. qemu was updated to make bus mastering mandatory. If you test on qemu, you need this step)

If bus mastering is turned off, qemu will not copy any data to the card but it will only copy zeroes.

The same goes for receiving. The PCI card receives data from the wire and writes that data into a buffer. The operating system will copy data from the buffer into the memory of the operating system via DMA. If bus mastering is turned off, qemu will only transfer zeroes instead of the real data.

To enable bus mastering, you have to set bit 3 (zero indexed, bit3 is actually the fourth bit if you start counting from 1 instead from 0) inside the command register.

The bit is set by reading the command register, flipping bit 3 and writing the value back into the command register.

// https://wiki.osdev.org/RTL8139
// enable bus mastering in the command register
// Some BIOS may enable Bus Mastering at startup, but some versions
// of qemu don't. You should thus be careful about this step.
k_printf("BUS mastering ...\n");

u16int command_register =
    pci_read_word(pci_bus, pci_device, pci_device_fn, 0x04);

k_printf("BUS mastering command_register = %x\n", command_register);

command_register |= 0x04;

pci_write_word(pci_bus, pci_device, pci_device_fn, 0x04, command_register);

command_register = pci_read_word(pci_bus, pci_device, pci_device_fn, 0x04);

k_printf("BUS mastering command_register = %x\n", command_register);

Software Reset

Next is a software reset

// software reset
// https://wiki.osdev.org/RTL8139
// Sending 0x10 to the Command register (0x37) will send the RTL8139 into a
// software reset. Once that byte is sent, the RST bit must be checked to
// make sure that the chip has finished the reset. If the RST bit is high
// (1), then the reset is still in operation.

// ChipCmd is the Command Register 0x37 = 55
// 0x10 == 0001 0000 == bit 5
// k_printf("Reset the chip %d ...\n", i);
outb(0x10, ioaddr + ChipCmd);
while ((inb(ioaddr + ChipCmd) & 0x10) != 0) {
  k_printf("waiting for reset!\n");
}
k_printf("Reset done.\n");

Enable Receiver and Transmitter

// enable receiver and transmitter
// Sets the RE and TE bits high
// k_printf("Enable receiver and transmitter %d...\n", i);
// 0x0C = 1100 = bit 2 und bit 3
outb(0x0C, ioaddr + ChipCmd);
k_printf("Enable receiver and transmitter done.\n");

Set Transmit and Receive Configuration Registers

// https://www.lowlevel.eu/wiki/RTL8139
// CR (Transmit Configuration Register, 0x40, 4 Bytes) und RCR
// (Receive Configuration Register, 0x44, 4 Bytes) setzen.
outl(0x03000700, ioaddr + TxConfig);
outl(0x0000070a, ioaddr + RxConfig);

Configuration Done

At this point the RTL 8139 is ready to send and receive data. Next the Sending of data is explained.

Sending Data

The data to send is written into a buffer (byte array) in operating system memory. Then the buffer is transferred over to the card via DMA (which is why the driver enables bus mastering). You have to specify the physical address for DMA! The PCI card does not understand paging! It only reads from memory at physical locations and does no go through the memory management unit.

My tip for you is to turn off paging during your initial tests with the RTL 8139 just to rule out that source of error.

TSAD and TSD

The way that the RTL 8139 accepts data for sending is explained in this section. On a more abstract level, the card has four hardware buffers for sending. Those buffers are also called descriptors. At any one point in time, there is only a single hardware buffer active. After the reset of the card during initialization, the buffer with index 0 is the active buffer.

The card will send the data stored in the currently active hardware buffer and then make the next hardware buffer in line the active buffer. Once data has been send from buffer 3, the index is reset to 0 and 0 is active again.

Each one of the four hardware buffers is implemented via registers which are available via two memory locations. There is a memory location called TSAD and one called TSD per hardware buffer.

TSAD is the transmission start register. It has to contain the physical address of the buffer that contains the data that the operating system wants to send. The data is transferred between the operating system and the card via DMA in the first step. Once the data is stored in the card’s internal memory, it is transferred onto the wire from there.

TSD is the transmission status or transmission control register and has to be set to contain the length of the data to send in bits 0 to 12 which is the length of the buffer in TSAD in bytes. Also the bit 13 (OWN bit) has to be set to 0. If the OWN bit is zero (low), the hardware on the RTL 8139 card will start to transmit the data to the card and from the card onto the wire. If the DMA transfer between the operationg system and the card was successfull, the OWN bit is set to 1 (high) by the hardware. Once the OWN bit is high, the card will start to transfer the data from the cards internal memory over the wire. I think that the name OWN was choosen to tell the user that the card now owns the data to transfer.

For each of the four buffers there is a pair of TSAD and TSD. The addresses are:

// TSAD = Transmit Start Registers = 32bit = Physical Address of data to
// be sent
u8int TSAD_array[4] = {0x20, 0x24, 0x28, 0x2C};

// TSD - Transmit Status / Command Registers = 32bit
u8int TSD_array[4] = {0x10, 0x14, 0x18, 0x1C};

The operating system has to remember which is the currently active buffer because it is not possible to ask the RTL 8139 card about which buffer is active at the moment. The variable tx_cur is used to store the index of the active buffer.

int tx_cur = 0;

The operating system prepares a buffer (array of byte / char) of data to send. For this example, let’s send 256 bytes containing the ASCII character ‘A’ which has the hex code 0x41 or decimal code 65.

int len = 256;
unsigned char tx_buffer[len];
for (int i = 0; i < len; i++) {
    tx_buffer[i] = 'A';
}

The variable len stores the size of the buffer.

Fill TSAD and TSD of the currently active buffer with the data to send.

// Second, fill in physical address of data to TSAD
outl(tx_buffer, ioaddr + TSAD_array[tx_cur]);

// Fill the length to TSD and start the transmission by setting the OWN
// bit to 0 Start https://wiki.osdev.org/RTL8139#Transmitting_Packets
u32int status = 0;
status |= len & 0x1FFF; // 0-12: Length
status |= 0 << 13;      // 13: OWN bit

outl(status, ioaddr + TSD_array[tx_cur]);

Wait until the OK bit (bit 15) is high. This signals that the transmission is completed. The OWN bit will tell you, when the data was transferred between the operating system and the card. Once the data is stored on the card, it will start to transmit that data over the wire. Once the wire transfer is complete, the card will set the OK bit in the TSD to high which means that the transfer is done and the next transfer buffer is active.

u32int transmit_ok = inl(ioaddr + TSD_array[tx_cur]);
while (transmit_ok & (1 << 15) == 0) {
    k_printf("Waiting for transmit_ok ...\n");
    transmit_ok = inl(ioaddr + TSD_array[tx_cur]);
}
k_printf("Waiting for transmit_ok done. transmit_ok = %d\n", transmit_ok);

Tell the operating system which buffer is active after the last buffer was used. In order to do that, increment tx_cur and wrap around back to zero if the last buffer was used in the prior send operation.

tx_cur++;
if (tx_cur > 3) {
    tx_cur = 0;
}

Now that you are able to send an arbitray byte array into the network, you have to learn how to construct valid ethernet frames for a protocol such as ARP, ICMP, DHPC, TCP, IP, HTTP or anything else. This is not the RTL 8139 driver’s job so the details are not explained in this article.

Constructing the frames for a specific protocol in the OSI model is the job of the so called IP-stack.

Retrieving the MAC Address

The RTL 8139 sends and receives data and is therefore a part of a network. As such it needs an address so packets can be sent point to point between the sender and the receiver.

On the lower levels of the OSI stack where Ethernet frames are sent, the MAC address is used for this purpose. A MAC address is a unique address assigned to a RLT 8139 during manufacturing.

When implementing ARP for example, you need to know the MAC address of your card. This section explains how to retrieve the NIC’s MAC address.

On qemu, you can specify the MAC address on the command line. Knowing the MAC address when testing code is a big advantage because as soon as you retrieve the expected MAC address, it is proven that the code works correctly.

The qemu command line parameter mac specifies the mac address.

/home/<user>/dev/qemu/build/i386-softmmu/qemu-system-i386 \
-monitor stdio \
-cdrom image.iso \
-netdev user,id=network0 \
-device rtl8139,netdev=network0,mac=52:54:00:12:34:56 \
-object filter-dump,id=network_filter_object,netdev=network0,file=dump.dat

Here 52:54:00:12:34:56 is used as a mac address.

The MAC address is stored in a EEPROM chip on the card. To read the EEPROM you need a function.

// Delay between EEPROM clock transitions.
// No extra delay is needed with 33Mhz PCI, but 66Mhz may change this.
#define eeprom_delay() inl(ee_addr)

// The EEPROM commands include the alway-set leading bit.
#define EE_WRITE_CMD (5 << 6)
#define EE_READ_CMD (6 << 6)
#define EE_ERASE_CMD (7 << 6)

static int read_eeprom(long ioaddr, int location) {

  unsigned retval = 0;
  long ee_addr = ioaddr + Cfg9346;
  int read_cmd = location | EE_READ_CMD;

  outb(EE_ENB & ~EE_CS, ee_addr);
  outb(EE_ENB, ee_addr);

  // Shift the read command bits out.
  for (int i = 10; i >= 0; i--) {

    int dataval = (read_cmd & (1 << i)) ? EE_DATA_WRITE : 0;

    outb(EE_ENB | dataval, ee_addr);
    eeprom_delay();

    outb(EE_ENB | dataval | EE_SHIFT_CLK, ee_addr);
    eeprom_delay();
  }

  outb(EE_ENB, ee_addr);
  eeprom_delay();

  for (int i = 16; i > 0; i--) {

    outb(EE_ENB | EE_SHIFT_CLK, ee_addr);
    eeprom_delay();

    retval = (retval << 1) | ((inb(ee_addr) & EE_DATA_READ) ? 1 : 0);

    outb(EE_ENB, ee_addr);
    eeprom_delay();
  }

  // Terminate the EEPROM access.
  outb(~EE_CS, ee_addr);

  return retval;
}

Using this function, the MAC can be read and stored into an array. The array is then output to show that the correct MAC address is read.

// prepare mac address read
int mac_address_index = 0;
u32int mac_address[6];
for (int i = 0; i < 6; i++) {
  mac_address[i] = 0;
}

// Read EEPROM
//
// Read the MAC Addresses from the NIC's EEPROM memory chip
// k_printf("read_eeprom() ...\n");

int readEEPROMResult = read_eeprom(ioaddr, 0) != 0xffff;
if (readEEPROMResult) {

  // loop three times to read three int (= 32 bit)
  for (int i = 0; i < 3; i++) {

    u16int data = read_eeprom(ioaddr, i + 7);

    mac_address[mac_address_index] = data & 0xFF;
    mac_address[mac_address_index + 1] = data >> 8;

    mac_address_index += 2;
  }

} else {

  // loop six times
  for (int i = 0; i < 6; i++) {

    u16int data = inb(ioaddr + i);

    mac_address_index += 1;
  }
}

// DEBUG: print MAC Address
k_printf("MAC: ");
for (int i = 0; i < 6; i++) {
  k_printf("%x:", mac_address[i]);
}
k_printf("\n");

Debugging

This section will introduce you to two ways of debugging the process of sending data using the RTL 8139.

The first method is telling qemu to dump all incoming and outgoing packages to a file. The file is in the pcap format which makes it possible to open the file in wireshark. wireshark is a networking tool that can display all field in ethernet packages and knows a large array of protocols for detailed display of all fields in packets.

If your RTL driver sends data, you can look at what data is send by loading the dump file and looking at the send packets using wireshark.

The second method is to compile qemu and enable the debug output in the emulation layer of the RTL 8139 card. Sadly there is no command line parameter to enable the RTL 8139 debug output. You can only enable the debug output by changing qemu’s code and and compiling qemu. This sounds hard but it actually is pretty easy. If I managed to do it, you will easily be able to do it as well. This method was only tested on a Ubuntu linux. The steps to compile on windows or mac are unknown to me. You can follow method 2 on Ubuntu linux easily.

Dumping Network Traffic with qemu

qemu internally contains so called objects for diverse purposes. One of those objects is the filter-dump object. You can apply the filter-dump object to one of the network interface cards to dump all packets into a file.

/home/<user>/dev/qemu/build/i386-softmmu/qemu-system-i386 \
-monitor stdio \
-cdrom image.iso \
-netdev user,id=network0 \
-device rtl8139,netdev=network0,mac=52:54:00:12:34:56 \
-object filter-dump,id=network_filter_object,netdev=network0,file=dump.dat

The filter-dump object is pointed to the netdev. It will capture traffic on that netdev. The netdev is the RTL 8139 NIC. The output file is called dump.dat it is written into the folder where you start qemu.

Open dump.dat in qemu. You should see the packet you have sent! If the RTL 8139 only sends zeroes, check that you are specifying virtual addresses and check the code that enables bis mastering.

Compile qemu and Enable RTL Debug Output

https://forum.osdev.org/viewtopic.php?f=1&t=28285

In qemu/hw/net/rtl8139.c

#define DEBUG_RTL8139 1
replace by
define DEBUG_RTL8139 1

Build Qemu
https://en.wikibooks.org/wiki/QEMU/Installing_QEMU

0. sudo apt-get install libglib2.0-dev libpango1.0-dev libatk1.0-dev libsdl2-dev
1. git clone git://git.qemu-project.org/qemu.git
2. cd qemu
3. git submodule init
4. git submodule update --recursive
5. git submodule status --recursive
6. git checkout stable-4.1
7. mkdir build
8. cd build
9. ../configure --disable-kvm --prefix=PFX --target-list="i386-softmmu x86_64-softmmu" --enable-sdl
10. make

In step 6, replace the version number with the most current qemu release.
In step 9, the command specifies targets and only lists i386. That way only x86 32 bit qemu is built.
If you call ../configure without additional parameters, qemu will be build for all possible targets which will take forever.

The qemu executable will be placed inside build folder. For example in /home/<user>/dev/qemu/build/i386-softmmu/qemu-system-i386

Now qemu will output debug statements to the command line. You should see lines like these:

RTL8139: +++ transmitting from descriptor 0
RTL8139: +++ transmit reading 42 bytes from host memory at 0x0010504a
RTL8139: +++ transmitted 42 bytes from descriptor 0

FAT Filesystem

Creating a FAT12 sample file

On Ubuntu the commands (https://superuser.com/questions/668485/creating-a-fat-file-system-and-save-it-into-a-file-in-gnu-linux) create a FAT12 formatted file on your harddrive. Replace the SIZE placeholder by e.g. 2048 for a 2 MB file. of determines the file’s filename. The last command mounts the file and is optional.

dd if=/dev/zero of=fat.fs bs=1024 count=SIZE

mkfs.vfat fat.fs

mount -o loop <image_name> /mnt

On macos which is a BSD-derived Unix system, the command newfs_type is more commonly used than mkfs. The type can be one of hfs, msdos, exfat or udf.

To create a FAT12 file use

dd if=/dev/zero of=floppy.img bs=1024 count=1440

Now attach the floppy.img to a system file (without mounting it, as it does not have a filesystem yet and mounting can only be done on a filesystem)

hdiutil attach -nomount floppy.img
The command above will output the file that the iamge was attached to e.g. /dev/disk2

Now, using the file descriptor (e.g. /dev/disk2), you can call newfs_msdos to create a filesystem on the attached image.

newfs_msdos -F 12 -v vollabel /dev/disk2

Detach the image from the file again

hdiutil unmount /dev/disk2

Check the image

hdiutil attach -readonly floppy.img

Now mount the image

diskutil list 
mount -t msdos /dev/disk2 ./mnt
mount_msdos: /dev/disk2 on /Users/bischowg/dev/osdev/fat/resources/mnt: Invalid argument

The file contains data that matches the description for FAT12 given in

The filesystem on the file is initially empty. To store a file, mount the file and copy a file to it.

FAT 12 structure

FAT12 was only ever used on floppy disks. It is only meant for small filesystems. FAT12 is the only FAT filesystem out of FAT12, FAT16 and FAT32 that has a sector to cluster ratio of one. That means a cluster contains only a single sector. That means that when working with FAT12, there is no need to ever distinguish the concept of sectors and clusters. Sectors and clusters can be used interchangeably! In other flavors of FAT, a cluster consists of several sectors.

A FAT12 file system is made up of four parts.

Reserved Sectors – the first sector is the boot sector and contains the bios paramater block (BPB) (see below)
File Allocation Table (FAT) – the bios parameter block describes how many copies of the FAT are stored. FATs are stored redundendly to prevent unreadable disks if a FAT gets corrupted.
Root Directory – the top level directory of the volume
Data Area – stores the raw data of the files and directories

The Boot Sector and the BIOS Parameter Block

The FAT12 filesystem starts of with reserved sectors. There is usually only a single reserved sector which is the boot sector. The boot sector stores the BIOS Parameter Block (BPB) which contains general information that is necessary to know to navigate the FAT12 volume.

The first three bytes contain an assembler jump instruction which makes the CPU jump over the boot sector should it ever be told to execute the contents of the first sector.

The next eight bytes contain the OEM Name, a label that is padded with zeroes should the content be smaller than eight bytes. I think the content is not relevant and can be ignored when reading a FAT12 image.

The following bytes contain the BIOS Parameter Block (BPB). A very good visualization of the BPB is given in https://thestarman.pcministry.com/asm/mbr/GRUBbpb.htm

https://jdebp.eu/FGA/bios-parameter-block.html says:

Because they were originally designed for use on IBM PC compatible machines with Intel CPUs, all of the (integer) fields in BPBs are little-endian.

https://en.wikipedia.org/wiki/Endianness says:

A little-endian ordering places the least significant byte first and the most significant byte last, while a big-endian ordering does the opposite.

Integers in this structure are stored little endian on the disk. That means if you read a word and the word contains the bytes 0x00 0x02, you have to assemble a value of 0x02 0x00 = 512 (decimal) because the byte order is little-endian, and the byte with the highest value is 0x02 whereas the second byte 0x00 follows.

The macros __bswap_16() and __bswap_32() from byteswap.h can be used to convert endianess if needed. On Intel and AMD, there is no need to convert, it will automatically read the bytes in the correct order.

A packed structure that describes the Jump, the OEM Name and the BPB is:

typedef struct __attribute__((packed))
{
    unsigned char jmpBoot[3];
    unsigned char oemName[8];
    uint16_t bytesPerSec; // Bytes per logical sector
    uint8_t secPerClus; // Logical sectors per cluster
    uint16_t rsvdSecCnt; // Reserved logical sectors 
    uint8_t numFats; // Number of FATs 
    uint16_t rootEntCnt; // Root directory entries 
    uint16_t totSec16; // Total logical sectors 
    int8_t media; // Media descriptor 
    int16_t fatSz16; // Logical sectors per FAT 
    int16_t secPerTrk;
    int16_t numHeads;
    int32_t hiddSec;
    int32_t totSec32;

} bios_parameter_block ;

Data Area, Clusters, Sectors, FAT

Files and Directories are stored the same way in FAT, they are stored in sectors within clusters. A directory contains directory entries. In a directory entry, a directory has a directory flag set which distinguishes it from a regular file. A directory maintains a table of the files and directories it contains, it contains directory entries to store the files and folders it contains.

Files and folders are organized in one or more Clusters connected to each other (cluster chain). Clusters contain Sectors. (FAT12 has a sector to cluster ratio of one, that means a cluster contains only a single sector) If a file or folder fits into one Cluster, one cluster suffices. If a file or folder is larger than one cluster, the clusters are chained together that means a cluster maintains a pointer to the next cluster. The FAT can be indexed with a cluster id and stores if there is a pointer to the next cluster, if the cluster is faulty or if it is the last cluster in a cluster chain.

A File Allocation Table (FAT) maintains a list of all the clusters that pertain to files and directories. The FAT is a map, that maps logical cluster indexes to logical cluster indexes. If you put a logical cluster index into the FAT, the FAT gives you the next logical cluster index in the chain. That means the FAT describes a chain of clusters. If a file or a folder is too large for one cluster, it is split up and stored into several clusters. The FAT stores the entire file or folder by storing the file’s or folder’s cluster chain.

The File Allocation Table is stored redundantly (more than once, several copies) in order to keep the files accessible even if one of the copies of the FAT gets corrupted. If files are changed in size, created or deleted, all copies of the FAT have to be updated.

Reading a file from FAT12

The strategy for reading a file from a FAT12 file system is as follows:

Read the boot sector and the bios parameter block therein to get general information about the file system
Make sure that the file system is FAT12 and not FAT16 or FAT32
Compute the offset to the root directory using the count of reserved sectors, the amount of FAT table copies, the size of a FAT in sectors, the size of a sector in bytes. All this information is contained in the BPB
Read the top level entries from the root directory. The root directory is one of the four major parts of a FAT12 volume. (Boot Sector, FATs, root directory, data area). The root directory contains several directory entries. A directory entry points to a file or a folder or a volume label. After reading one of the directory entries, you get the index of the first cluster of the cluster chain that stores the file or folder that the entry points to. Using the first clusters index (which is a logical index), you do two things: You can index the FAT to follow the cluster chain. The second thing you can do is, you can read clusters and sectors from the data area after converting the logical index to a physical index. Reading from the data area allows you to access a file’s raw data or the directory entries of a sub-directory.
Index the FAT to follow the cluster chain that starts with the cluster referenced by the root directory entry.
Read the data from the data area. The data is either a volume description, a file or a folder. In order to read from the data area, you have to convert the logical cluster index into a physical cluster index. Given that pysical cluster index, you can compute an offset in bytes from the start of the volume and read the bytes from that cluster. For a file, the clusters contain the raw data stored in the file. For a directory, the clusters store an array of directory entries.
If the data is a directory, it contains the same kind of directory entries that are also stored in the root directory. You can use the directory entries to dive deeper into the dir tree, to move up the dir tree (by changing directory to the entry called .. which denotes the parent folder) or to access files stored in the current directory. The root directory does not have a .. entry. For folders located in the root directory, the .. directory entry stores a logical cluster index zero, with zero beeing a placeholder for the fact that the root directory is not stored in the data area and hence that there is no physical cluster index to compute.
For a file, visit all the clusters in the file’s cluster chain and read the bytes into a buffer. Return the buffer to the caller.

Following James Molloy’s tutorial

1. You have to download VGABIOS-elpin-2.40 from here https://github.com/nickplee/BochsWatchOS/blob/master/Bochs/bios/VGABIOS-elpin-2.40 and put it to /usr/share/bochs/VGABIOS-elpin-2.40. If you have compiled bochs yourself from the latest SVN Snapshot from SourceForge, the elpin bios is already contained in the bios folder!

2. you have to install bochs and bochs-x

sudo apt-get install bochs bochs-x

3. You have to fix the Makefile

# Makefile for JamesM's kernel tutorials.
# The C and C++ rules are already setup by default.
# The only one that needs changing is the assembler 
# rule, as we use nasm instead of GNU as.

SOURCES=boot.o main.o

# 64bit
#ASFLAGS=-felf64
#LDFLAGS=-Tlink.ld -o 64bit
#CFLAGS=-nostdlib -nostdinc -fno-builtin -fno-stack-protector

# 32bit
ASFLAGS=-felf32
LDFLAGS=-Tlink.ld -melf_i386
CFLAGS=-m32 -nostdlib -nostdinc -fno-builtin -fno-stack-protector

#ASFLAGS=-felf

#LDFLAGS=-Tlink.ld -melf_i386
#LDFLAGS=-Tlink.ld -m32
#LDFLAGS=-Tlink.ld

all: $(SOURCES) link

clean:
-rm *.o kernel

link:
ld $(LDFLAGS) -o kernel $(SOURCES)

.s.o:
nasm $(ASFLAGS) $<

4.You have to fix bochsrc.txt

megs: 32

#romimage: file=/usr/share/bochs/BIOS-bochs-latest, address=0xf0000
#romimage: file=/usr/share/bochs/BIOS-bochs-latest, address=0xe0000
romimage: file=/usr/share/bochs/BIOS-bochs-latest

vgaromimage: file=/usr/share/bochs/VGABIOS-elpin-2.40
#vgaromimage: file=/usr/share/bochs/VGABIOS-lgpl-latest

floppya: 1_44=/dev/loop0, status=inserted

boot: a

mouse: enabled=0

clock: sync=realtime

#cpu: ips=500000
cpu: ips=1000000

#display_library: x, options="gui_debug"

log: bochsout.txt

5. when bochs starts and hangs, type

continue

Introduction

Content

Start and End of the memory space

The circular linked list datatype

The linearized bit tree

The array that stores the linearized binary bit tree

Finding children, parents and siblings

The meaning of the bit

Converting an index to a pointer to the respective block

Operations for managing ‘node split or not’ concerns

Operations: malloc() and free()

The Header of Blocks

How Memory is Layed Out in the Linearized Binary Bit Tree

The Memory Map Contains More Addresses than are Physically Available!

Available Memory Is Not The Same As Free Memory!

Lower 1 Megabyte

Dealing with memory occupied by GRUB Modules

Source Material and Usage

Bootloader and Kernel

Multiboot Information Data Structure

Kernel Start Address – At which address will the kernel be loaded by GRUB?

Kernel End Address – How does one know how large the kernel is and where usable memory starts

Working with GRUB Modules

The Stack

Writing the ISO to a bootable USB Stick

Sources

Introduction

Switching to a certain version of GCC via the alternatives system

Installing the prerequisits

Purpose of this Article

Overall Design

Program Flow and the Init Process

Overview of Frames

Process Descriptors, List of Process Information

The Console

No console

Why fork() and exec()?

Why is it not safe to call kmalloc between identity mapping and once the heap is active

Why is the heap paging code split in two parts?

The Memory Map

Introduction

Links

Initialization Process

Finding the Device in the System

PCI Configuration Space

ioaddr

Powering up the card

Bus Mastering

Software Reset

Enable Receiver and Transmitter

Set Transmit and Receive Configuration Registers

Configuration Done

Sending Data

TSAD and TSD

Retrieving the MAC Address

Debugging

Dumping Network Traffic with qemu

Compile qemu and Enable RTL Debug Output

Creating a FAT12 sample file

FAT 12 structure

The Boot Sector and the BIOS Parameter Block

Data Area, Clusters, Sectors, FAT

Reading a file from FAT12