tmmrs.com

Compiling Brainfuck to x86-64 machine code


Introduction

In this post we will create a compiler that compiles Brainfuck to bare x86-64 machine code (not assembly) without the use of libraries. While many introductions to compilers and processors target simplified machine code or assembly language, we will see it’s surprisingly easy to target machine code for a real-world architecture.

Brainfuck is an esoteric language in which the current state of a program consists of sequential memory cells that store integers and a pointer to one of the cells (the current cell). There are eight instructions to interact with this memory:

InstructionMeaning
+Increment the value in the current cell
-Decrement the value in the current cell
>Increment the pointer
<Decrement the pointer
.Prints the ASCII character corresponding to the value of the current cell
,Reads a character of input and stores the corresponding ASCII code in the current cell
[If the current cell is zero, jump to instruction after the matching closing bracket
]If the current cell is nonzero, jump to instruction after the matching opening bracket

Even though this is a limited set of instructions, Brainfuck can do arbitrary computations (it’s Turing-complete). For example, if we store two numbers in adjacent cells, we can “easily” add the value of the first to the second:

+++++    Store 5 in the first cell
>+++<    Store 3 in the second cell (keep the pointer at the first)
[->+<]   Subtract 1 from the first cell and add 1 to the second until the first is zero
53000000000

Note that any character not one of the eight listed above is simply ignored, so comments don’t require special syntax (although the above snippet did need a few edits to weed out punctuation). One can easily visualize the execution of the above snippet or try programs yourself using an online tool such as this one.

While Brainfuck is clearly impractical, as even a naive “Hello world!” program would use many hundreds of characters (most of which being +), it is very simple while still being powerful. For example, it is possible to visualize a Mandelbrot set.

Brainfuck is not only horrible to write, it’s also horribly slow (for obvious reasons). Rather than simply using another language, we will create a compiler for this simple language, targeting bare x86-64 machine code. We will use Rust on Linux, but it should be possible to follow along on most operating systems and in most languages.1 Targeting other architectures, including regular x86, should also be possible.2 After compilation we will execute the resulting machine code immediately, so that we don’t have to bother writing an executable.3 4

There are many ways to accomplish the same task in assembly, but we will choose the most straightforward one (read: the first approach I got to work). This also means that our output will be suboptimal, but by getting a better feeling for machine code it will also be easier to understand the output of optimizing compilers.

A 28-line interpreter

To ensure we have something to compare to with regard to performance (why else compile to machine code?) and to get the semantics clear, let’s take a quick look at a straightforward interpreter. The state consists of two parts: the memory (with the index of the current cell) and a list of instructions (and the current index, or instruction pointer). To execute a string, we convert it to a list of instructions (where the loop start/end instructions are annotated with the end/start index) using parse5, and then simply loop and match the current instruction until we finish.

fn run(code: &str) {
    let mut memory = vec![0; 1024];
    let mut memory_index = 0;

    let instructions = parse(code);
    let mut instruction_index = 0;
    while instruction_index < instructions.len() {
        match instructions[instruction_index] {
            Instruction::Increment => memory[memory_index] += 1,
            Instruction::Decrement => memory[memory_index] -= 1,
            Instruction::Forward => memory_index += 1,
            Instruction::Backward => memory_index -= 1,
            Instruction::Output => print!("{}", memory[memory_index] as char),
            Instruction::Input => memory[memory_index] = stdin().bytes().next().unwrap().unwrap(),
            Instruction::LoopStart(index) => {
                if memory[memory_index] == 0 {
                    instruction_index = index;
                }
            }
            Instruction::LoopEnd(index) => {
                if memory[memory_index] != 0 {
                    instruction_index = index;
                }
            }
        }
        instruction_index += 1;
    }
}

Running the below example from Wikipedia will, as you would expect (but maybe not based on the code), print “Hello World!“:

fn main() {
    run("++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.");
}

Running the Mandelbrot program referenced earlier takes 15s, which will be our reference point.

Executing some bits

We will now look at the interesting part: executing bytes of code. We start off simple: we create a function that just returns 47 and execute it. As we are using Linux, the standard calling convention is part of the System V ABI. The only important part for now is that the return value of a function is passed in the rax register. Therefore, the following machine code should work:

0xB8                    ; Move a constant into RAX
0x2F 0x00 0x00 0x00     ; The 4-byte constant
0xC3                    ; Return

We will dive into the details in a moment, but before you understand something you should execute it. This can be done as follows:

use memmap::{Mmap, MmapMut};

/// We cannot simply execute data, we first have to copy it to a particular address space which we
/// can mark as executable. We do this using the `mmap` system call, wrapped in a nice library.
unsafe fn mmap(instructions: &[u8]) -> Mmap {
    // Create an in-memory buffer
    let mut map = MmapMut::map_anon(instructions.len()).unwrap();
    // Copy the instructions
    map.copy_from_slice(&instructions);
    // Mark the buffer as executable
    map.make_exec().unwrap()
}

fn main() {
    let instructions = [0xB8, 0x2F, 0x00, 0x00, 0x00, 0xC3];
    // Rust requires marking operations which could result in crashes as "unsafe".
    unsafe {
        let map = mmap(&instructions);
        // Rust notation for converting a pointer to the memory buffer to a C function pointer.
        // (We need a C function pointer because Rust leaves its calling convention undefined.)
        let f: unsafe extern "C" fn() -> i32 = mem::transmute(map.as_ptr());
        // Print the result
        println!("{}", f());
    }
}

Running the above snippet indeed print 47. Now, let’s take a look at the raw bytes once more:

0xB8                    ; Move a constant into EAX
0x2F 0x00 0x00 0x00     ; The 4-byte constant
0xC3                    ; Return

Without further information, the first thing that might catch your eye is that the constant 47 (0x2F) is stored in the first byte instead of the latter as one might expect. This is because x86-64 uses a little-endian encoding.6

As for the opcodes, we obviously don’t know these from memory, but we have a few resources to help us read and write them:

The first byte, 0xB8, is the instruction code for moving a constant into eax. The constant to be moved into the register is specified inline as an instruction operand. Moving a constant into a register is apparently so important that every one of the eight main output registers gets its own opcode: 0xB8 moves into eax, 0xB9 into ecx, 0xBA into edx, and so on.

Lastly, there is the single-byte 0xC3 return instruction, which pops the return address from the stack and resumes execution at that address. The return address is pushed onto the stack by the caller of the function.

Intermezzo: a code generation API

Instead of becoming fluent in hexadecimal opcodes, we will wrap an instruction buffer and add functions to emit opcodes for a move, a return, and any other construct we need. First, we create a union type for the eight main registers: 7

#[derive(Copy, Clone, Debug)]
enum Register {
    EAX,
    ECX,
    EDX,
    EBX,
    ESP,
    EBP,
    ESI,
    EDI,
}

impl Register {
    fn to_integer(self) -> u8 {
        match self {
            Register::EAX => 0,
            Register::ECX => 1,
            Register::EDX => 2,
            Register::EBX => 3,
            Register::ESP => 4,
            Register::EBP => 5,
            Register::ESI => 6,
            Register::EDI => 7,
        }
    }
}

We will use register names starting with E here. When instruction sets expanded from 8-bit to 16-bit, to 32-bit, and to 64-bit, existing registers were extended. The register eax (“extended” ax) references the least significant half of the 64-bit “register” (?) rax. When we move a 32-bit constant into eax, the most significant half of rax is implicitly zeroed.

Even if we reference, for example, rax later when dealing with a 64-bit value, we will use Register::EAX as eax and rax refer to the same physical register. We will also occasionally use the names interchangeably in the text.

Next, we create the wrapper:

struct Assembler {
    instructions: Vec<u8>,
}

impl Assembler {
    fn new() -> Assembler {
        Assembler { instructions: Vec::new() }
    }

    fn emit(&mut self, data: u8) {
        self.instructions.push(data);
    }

    fn get_instructions(&self) -> &[u8] {
        &self.instructions
    }
}

Let’s now add some helper functions to generate the bytes we wrote out manually in the previous section:

impl Assembler {
    // ...

    fn emit_const<T>(&mut self, constant: T) {
        unsafe {
            let mut pointer = mem::transmute::<_, *const u8>(&constant);
            for _ in 0..mem::size_of::<T>() {
                self.emit(*pointer);
                pointer = pointer.add(1);
            }
        }
    }

    // We will stick to the convention that the destination operand goes first.
    fn emit_move_const<T>(&mut self, destination: Register, value: T) {
        if mem::size_of::<T>() == 4 {
            self.emit(0xB8 + destination.to_integer());
        } else {
            unimplemented!("unsupported constant size {}", mem::size_of::<T>());
        }
        self.emit_const(value);
    }

    fn emit_return(&mut self) {
        self.emit(0xC3);
    }
}

With these helper functions, we can rewrite our previous example as follows:

fn main() {
    let mut assembler = Assembler::new();
    assembler.emit_move_const(Register::EAX, 47);
    assembler.emit_return();

    unsafe {
        let map = mmap(assembler.get_instructions());
        let f: unsafe extern "C" fn() -> i32 = mem::transmute(map.as_ptr());
        println!("{}", f());
    }
}

We still need a little more before we can convert Brainfuck to machine code: arithmetic, interacting with memory, input/output, and loops. In the following sections, we will extend our Assembler abstraction to support these operations based on a few examples.

Basic instructions

We will first look at the most basic Brainfuck instructions: >, <, +, and -. If we have allocated memory, we can store the address of this memory in a register (say rbx, or Register::EBX below). The first two operators then correspond to increasing or decreasing the value in that register. The + and - operators require us to read the value pointed to from memory, modify it, and finally write it back. Moving data between memory and registers is accomplished using move instructions. We will store the value we are editing in register eax.

Let’s start with how we want the result to look:

let mut memory = vec![0; 1024];
let mut assembler = Assembler::new();

// We store the pointer in `ebx`.
assembler.emit_move_const(Register::EBX, memory.as_mut_ptr());

// `>`
assembler.emit_add_const(Register::EBX, 4);
// `<`
assembler.emit_add_const(Register::EBX, -4);
// `+`
assembler.emit_move_read(Register::EAX, Register::EBX);
assembler.emit_inc(Register::EAX);
assembler.emit_move_write(Register::EBX, Register::EAX);
// `-`
assembler.emit_move_read(Register::EAX, Register::EBX);
assembler.emit_dec(Register::EAX);
assembler.emit_move_write(Register::EBX, Register::EAX);
4230700000030x7F...2B10EBXEAXinc / decmov

So we require five new helpers, as well as one subtle change: pointers are 64-bit on x86-64, but our emit_move_const function was specifically written for 32-bit constants; operations run in 32-bit mode by default on x86-64. Luckily, as can be seen in the reference, the 0xB8 move opcode we used does support 64-bit operands. If we want to use one of those, we can prefix it with a 0x48 byte to enable 64-bit operand sizes (which can also be found in the reference).

fn emit_move_const<T>(&mut self, destination: Register, value: T) {
    if mem::size_of::<T>() == 4 {
        self.emit(0xB8 + destination.to_integer());
    } else if mem::size_of::<T>() == 8 {
        self.emit(0x48);
        self.emit(0xB8 + destination.to_integer());
    } else {
        unimplemented!("unsupported constant size {}", mem::size_of::<T>());
    }
    self.emit_const(value);
}

Writing the new helpers is somewhat more complicated: they all require a so-called ModR/M byte. When operations require multiple operands, such as a move of a value from one register to another, the most straightforward way of encoding this is as multiple bytes following the opcode. However, with only eight registers previously available, this would have wasted a lot of space!

If we are addressing only eight registers, we can encode them in just 3 bits, therefore easily fitting two register operands in a single byte following the opcode. Even better, we have two bits left to encode other possibilities more efficiently. As an example, we can encode that we want to store the value of eax into the memory location pointed to by rbx: mov [rbx], eax. The ModR/M byte has three fields and looks like this:

modregr/m

The three fields are as follows:

  1. The mod field stores the mode of the ModR/M byte, for example, to refer to the memory pointed to by a register.
  2. The reg field usually stores a register operand, but can also be used to extend the opcode (then the value is called an opcode extension, used for e.g. an add as we will see soon).
  3. The r/m field can specify a register, but can also have other uses depending on the mod field.

Below is a simplified version of the table from the Intel software developer manuals describing the meaning of different values of the ModR/M byte:

eaxecxedxebxespebpesiedi
reg000001010011100101110111
Effective addressmodr/mValue of ModR/M byte
[eax]
[ecx]
[edx]
[ebx]
[-]
disp32
[esi]
[edi]
00000
001
010
011
100
101
110
111
00
01
02
03
04
05
06
07
08
09
0A
0B
0C
0D
0E
0F
10
11
12
13
14
15
16
17
18
19
1A
1B
1C
1D
1E
1F
20
21
22
23
24
25
26
27
28
29
2A
2B
2C
2D
2E
2F
30
31
32
33
34
35
36
37
38
39
3A
3B
3C
3D
3E
3F
[eax] + disp8
[ecx] + disp8
[edx] + disp8
[ebx] + disp8
[-] + disp8
[ebp] + disp8
[esi] + disp8
[edi] + disp8
01000
001
010
011
100
101
110
111
40
41
42
43
44
45
46
47
48
49
4A
4B
4C
4D
4E
4F
50
51
52
53
54
55
56
57
58
59
5A
5B
5C
5D
5E
5F
60
61
62
63
64
65
66
67
68
69
6A
6B
6C
6D
6E
6F
70
71
72
73
74
75
76
77
78
79
7A
7B
7C
7D
7E
7F
[eax] + disp32
[ecx] + disp32
[edx] + disp32
[ebx] + disp32
[-] + disp32
[ebp] + disp32
[esi] + disp32
[edi] + disp32
10000
001
010
011
100
101
110
111
80
81
82
83
84
85
86
87
88
89
8A
8B
8C
8D
8E
8F
90
91
92
93
94
95
96
97
98
99
9A
9B
9C
9D
9E
9F
A0
A1
A2
A3
A4
A5
A6
A7
A8
A9
AA
AB
AC
AD
AE
AF
B0
B1
B2
B3
B4
B5
B6
B7
B8
B9
BA
BB
BC
BD
BE
BF
eax
ecx
edx
ebx
esp
ebp
esi
edi
11000
001
010
011
100
101
110
111
C0
C1
C2
C3
C4
C5
C6
C7
C8
C9
CA
CB
CC
CD
CE
CF
D0
D1
D2
D3
D4
D5
D6
D7
D8
D9
DA
DB
DC
DD
DE
DF
E0
E1
E2
E3
E4
E5
E6
E7
E8
E9
EA
EB
EC
ED
EE
EF
F0
F1
F2
F3
F4
F5
F6
F7
F8
F9
FA
FB
FC
FD
FE
FF

The [-] value indicates that a so-called SIB byte follows the ModR/M byte, which can be used to reference, for example, the address of one register plus eight times the value of another register. Since we won’t need the SIB byte, we will not go into further details.

Let’s now take a look at how we could encode the earlier example:

mov [rbx], eax

Our favorite reference lists the following opcode (with a few more empty columns):

poomnemonicop1op2description, notes
89rMOVr/m16/32/64r/16/32/64Move

The r in the o column means that a normal ModR/M byte follows the 0x89 opcode. Predictably, the move operation has two operands: a destination register (destination registers are formatted in bold) specified in the r/m field and a source register specified in the reg field. The destination of the move should be [rbx], so we use 0b00 as value for the mod field and 0b011 as value for the r/m field. Finally, our source register is eax, so we use 0b000 for the reg field. This results in a machine code translation of 0x8903:

modregr/m0x8900000011opcode

With this example, writing the emit_move_write helper becomes easy:

fn emit_move_write(&mut self, address_register: Register, source: Register) {
    self.emit(0x89);
    self.emit(0 + (source.to_integer() << 3) + address_register.to_integer());
}

Similarly, we can implement emit_move_read (to emit code for statements like mov eax, [rbx]) using the 0x8B opcode. Note that the operands in the reference are swapped and the r16/32/64 (in the reg field) operand is marked in bold and therefore the destination, so we only need to swap the operands in our function as well:

fn emit_move_read(&mut self, destination: Register, address_register: Register) {
    self.emit(0x8B);
    self.emit(0 + (destination.to_integer() << 3 )+ address_register.to_integer());
}

For the last three helpers, emit_add_const, emit_inc and emit_dec, we will need the opcodes from the following excerpt of the reference:

poomnemonicop1op2description, notes
810ADDr/m16/32/64imm16/32Add
FF0INCr/m16/32/64Increment by 1
FF1DECr/m16/32/64Decrement by 1

But note that the increment and decrement operations have the same opcode! To distinguish them, we need to use the value in the o column: the opcode extension mentioned earlier. We store this value in the reg field of the ModR/M byte that follows the opcode. For example, let’s examine the following operation:

dec ebx

The opcode for decrement is 0xFF and uses an opcode extension of 1, which we store in the reg field. According to the reference, the destination register is specified in the r/m field. As we can see in the ModR/M byte reference, ebx requires a mode of 11 and a value of 011 in the r/m field. Therefore, we obtain 0xFFCB:

modregr/m0xFF11001011opcodeopcode extension

The increment operation can be used in the same way, and the only difference for the add operation is the constant following the opcode and ModR/M byte:

fn emit_add_const(&mut self, destination: Register, constant: i32) {
    self.emit(0x48); // 64-bit mode
    self.emit(0x81);
    self.emit(0b11000000 + destination.to_integer());
    self.emit_const(constant);
}

fn emit_inc(&mut self, register: Register) {
    self.emit(0xFF);
    self.emit(0b11000000 + register.to_integer());
}

fn emit_dec(&mut self, register: Register) {
    self.emit(0xFF);
    self.emit(0b11001000 + register.to_integer());
}

We can now already emit code for four out of the eight Brainfuck instructions! This is sufficient to create machine code for (very) basic arithmetic.

Input and output

We will now examine the machine code for the , and . operators, used for input and output respectively. We will outsource the hard work of actual I/O to other functions, so we can reduce the problem to that of calling a function such as:

extern "C" fn get_input() -> u8 {
    stdin().bytes().next().unwrap().unwrap()
}

As we generate code at runtime, we can easily obtain a pointer to the function: get_input as extern "C" fn() -> u8. Now, we just need a way to call a pointer. Unfortunately, there doesn’t seem to be an opcode which can be followed by a 64-bit address to call. Instead, we can store the pointer to the function in a register and call the function pointer stored in the register using the call opcode:

poomnemonicop1op2description, notes
FF2CALLr/m64Call Procedure

The same opcode as for the increment and decrement instructions! But now the opcode extension is set to 2, completely changing its meaning. It takes a single 64-bit operand stored in the r/m field of the ModR/M byte following the opcode. The mod field is set to 2 as we the argument is the register itself and the reg field takes the opcode:

fn emit_call(&mut self, register: Register) {
    self.emit(0xFF);
    self.emit(0b11010000 + register.to_integer());
}

We now have all we need to emit machine code for the input instruction:

// `,`: ask input
assembler.emit_move_const(
    Register::EDX,
    get_input as extern "C" fn() -> u8,
);
assembler.emit_call(Register::EDX);
assembler.emit_move_write(Register::EBX, Register::EAX);

For the output function, we will try to immediately call printf from libc. It takes a format string as first argument and any values to insert into it as additional arguments. But how do pass arguments? Besides specifying return values go in eax, the System V ABI also has conventions for this purpose:

This is a 64-bit platform. The stack grows downwards. Parameters to functions are passed in the registers rdi, rsi, rdx, rcx, r8, r9, and further values are passed on the stack in reverse order. Parameters passed on the stack may be modified by the called function.

So we pass a pointer to our format string in rdi (denoted Register::EDI in our code), and the value to print from memory in esi. Finally, we store the printf function pointer in a register and emit a call instruction:

// `.`: echo back the value
let format_string = CString::new("%c").unwrap();
assembler.emit_move_const(Register::EDI, format_string.as_ptr());
assembler.emit_move_read(Register::ESI, Register::EBX);
assembler.emit_move_const(
    Register::EDX,
    libc::printf as unsafe extern "C" fn(format: *const c_char, ...) -> c_int,
);
assembler.emit_call(Register::EDX);

Unfortunately, concatenating the above two snippets (input and output, wrapped in the main function from earlier) doesn’t produce the desired result:

    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/main`
X
fish: Job 1, 'cargo run' terminated by signal SIGSEGV (Address boundary error)

As you can see, we are asked for input, enter X, and the program proceeds to crash. In case you’ve never programmed in C(++): a segmentation fault occurs when a program makes an attempt to access restricted memory. In practice, this happens pretty quickly once you start doing operations that don’t make sense:

int main() {
    void (*f)() = 0;
    f(); // './a.out' terminated by signal SIGSEGV (Address boundary error)
}

But where did we go wrong? We encoded the function pointer into the machine code without making any modifications. Compiling the following Rust example to assembly using the compiler explorer provides us with a hint:

pub fn dummy() {}

pub fn main() {
    dummy();
}
example::dummy:
        ret

example::main:
        push    rax
        call    qword ptr [rip + example::dummy@GOTPCREL]
        pop     rax
        ret

Note the seemingly useless push and pop before and after the function call. It turns out that these are required because the System V ABI specifies that the stack should be aligned to 16 bytes before a call instruction, so the callee can assume a specific alignment. This means that if the top of the stack has an address that is not 0 modulo 16, we need to ensure that it is before we execute the call instruction.8

Note that our code is also executed using a call instruction, and remember that a call instruction pushes the return address onto the stack. As we are on x86-64, the return address is 64 bits, and because the stack had to be 16-byte aligned before the call, we are now misaligned. We can fix this by introducing helpers for push and pop and calling them with a register such as rbx (note that the instructions push and pop 64-bit values):

fn emit_push(&mut self, register: Register) {
    self.emit(0x50 + register.to_integer());
}

fn emit_pop(&mut self, register: Register) {
    self.emit(0x58 + register.to_integer());
}
assembler.emit_push(Register::EBX);
// ...
assembler.emit_pop(Register::EBX);
return addressreturn addressreturn addressrbx16-byte16-byte8-bytealignmentcallpushpopret...call...

If we now run our example again, we will get the expected result and our character will be echoed back to us.

This may also be a good time to pay attention to another “detail” of the System V ABI:

Functions preserve the registers rbx, rsp, rbp, r12, r13, r14, and r15; while rax, rdi, rsi, rdx, rcx, r8, r9, r10, r11 are scratch registers.

This means that we need to make sure (if we ignore the numbered registers) the registers rbx, rsp and rbp are the same when we return as they were when our function was called. So storing the pointer to the Brainfuck memory in ebx, without restoring its value before returning, was actually illegal! But luckily, we fixed this as well by pushing rbx at the start of our function and popping it at the end.

Loops

The remaining two operations are for looping:

[If the current cell is zero, jump to instruction after the matching closing bracket
]If the current cell is nonzero, jump to instruction after the matching opening bracket

So how do we jump? We could, for example, use the following opcode:

pomnemonicop1op2description, notes
E9CALLrel16/32Jump

An E9 opcode is followed by a 32-bit address to jump to, relative to the current instruction pointer. The instruction pointer points at the instruction following the currently executing instruction. That means the following shows how to skip the next two no-ops (which are instructions that do nothing, or “no operation”, when executed):

0xE90x900x900x020x000x000x00rel32jmpnopnop

Recall that the non-zero byte of the 2 constant comes first, as x86-64 uses a little-endian encoding.

So jumping to another related piece of code is actually pretty straightforward, but this doesn’t yet allow us to encode the above Brainfuck instructions: how to jump conditionally?

There are other opcodes to conditionally jump, all of which are two-byte opcodes: 0x0F, followed by another opcode byte to specify the actual operation. The conditional jumps consist of 0x0F followed by 0x80 through 0x8F. For example, “jump near if zero/equal” is encoded as 0x0F84. We’ll only need two of these (as described by our favorite reference):

poomnemonicop1op2description, notes
0F84JZ / JErel16/32Jump near if zero/equal (ZF=1)
0F85JNZ / JNErel16/32Jump near if not zero/equal (ZF=0)

So to jump if a value is zero, we replace the 0xE9 from the previous example with 0x0F84, and we have encoded a conditional jump. But what is the condition here exactly? It jumps if ZF=1, but what does this mean?

All x86 processors contain a 32-bit EFLAGS register, with each bit (flag) of the register having a specific meaning. One such flag is the “zero flag” (ZF). If that flag is set (the bit is set to 1), the jump of the 0x0F84 instruction would be executed. If the flag is not set, the instruction would be skipped.

What determines whether a flag is set? The flags that are relevant to us, called status flags, are set by arithmetic instructions. For example:

Next to these there is a carry flag, parity flag, auxiliary carry flag and overflow flag. However, we will only need the zero flag.

As for what qualifies as an arithmetic instruction, these include add, sub, mul, div, inc, and dec. For all details about the EFLAGS register and interactions with instructions, see the Intel software developer manuals.

To implement loop instructions, we can read the value pointed to from memory and jump based on its value. To set the flags based on the value, we can execute a bitwise or with itself:

// Start of loop
assembler.emit_move_read(Register::ESI, Register::EBX);
assembler.emit_or(Register::ESI, Register::ESI);
assembler.emit_jump_z(&loop_end_label);
assembler.label(&loop_start_label);

// End of loop
assembler.emit_move_read(Register::ESI, Register::EBX);
assembler.emit_or(Register::ESI, Register::ESI);
assembler.emit_jump_nz(&loop_start_label);
assembler.label(&loop_end_label);

As you can see, we use a helper to mark a position with a label and pass a label to our jump helper. We do this instead of manually calculating relative positions while generating assembly, as this would be cumbersome. Instead, we make our Assembler store jumps and labels: instead of writing the relative address when emitting a jump, we emit a placeholder. After we have emitted all instructions, we can go back and replace these placeholders with the actual values.

The Assembler type now looks like this:

struct Assembler {
    instructions: Vec<u8>,
    /// The indices of labels.
    labels: HashMap<String, usize>,
    /// A tuple `(x, l)` denotes that the four bytes starting at index `l` should contain a
    /// relative reference to the position of label `l`.
    label_references: Vec<(usize, String)>,
}

And we add the following helpers:

fn emit_or(&mut self, destination: Register, source: Register) {
    self.emit(0x09);
    self.emit(0b11000000 + (source.to_integer() << 3) + destination.to_integer());
}

fn emit_jump_z(&mut self, label: &str) {
    self.emit(0x0F);
    self.emit(0x84);
    self.emit_label_reference(label);
}

fn emit_jump_nz(&mut self, label: &str) {
    self.emit(0x0F);
    self.emit(0x85);
    self.emit_label_reference(label);
}

fn emit_label_reference(&mut self, label: &str) {
    // Keep track of placeholders
    self.label_references
        .push((self.instructions.len(), label.to_string()));

    // Zero bytes as placeholder
    self.emit(0);
    self.emit(0);
    self.emit(0);
    self.emit(0);
}

fn label(&mut self, label: &str) {
    self.labels.insert(label.to_string(), self.instructions.len());
}

fn assemble(&mut self) {
    for (location, label) in &self.label_references {
        let label_location = self.labels[label];
        let relative = label_location as i32 - *location as i32 - 4;
        self.instructions[*location] = relative as u8;
        self.instructions[*location + 1] = (relative >> 8) as u8;
        self.instructions[*location + 2] = (relative >> 16) as u8;
        self.instructions[*location + 3] = (relative >> 24) as u8;
    }
}

Putting the pieces together

We now have all the necessary pieces to compile and execute a Brainfuck program:

fn compile_and_run(brainfuck: &str) {
    let instructions = parse(brainfuck);

    let mut memory = vec![0; 1024];
    let mut assembler = Assembler::new();
    let format_string = CString::new("%c").unwrap();

    assembler.emit_push(Register::EBX);
    assembler.emit_move_const(Register::EBX, memory.as_mut_ptr());

    for instruction in instructions {
        match instruction {
            Instruction::Forward => assembler.emit_add_const(Register::EBX, 4),
            Instruction::Backward => assembler.emit_add_const(Register::EBX, -4),
            Instruction::Increment => {
                assembler.emit_move_read(Register::EAX, Register::EBX);
                assembler.emit_inc(Register::EAX);
                assembler.emit_move_write(Register::EBX, Register::EAX);
            }
            Instruction::Decrement => {
                assembler.emit_move_read(Register::EAX, Register::EBX);
                assembler.emit_dec(Register::EAX);
                assembler.emit_move_write(Register::EBX, Register::EAX);
            }
            Instruction::LoopStart(label) => {
                let loop_start_label = format!("loop_start_{}", label);
                let loop_end_label = format!("loop_end_{}", label);
                assembler.emit_move_read(Register::ESI, Register::EBX);
                assembler.emit_or(Register::ESI, Register::ESI);
                assembler.emit_jump_z(&loop_end_label);
                assembler.label(&loop_start_label);
            }
            Instruction::LoopEnd(label) => {
                let loop_start_label = format!("loop_start_{}", label);
                let loop_end_label = format!("loop_end_{}", label);
                assembler.emit_move_read(Register::ESI, Register::EBX);
                assembler.emit_or(Register::ESI, Register::ESI);
                assembler.emit_jump_nz(&loop_start_label);
                assembler.label(&loop_end_label);
            }
            Instruction::Output => {
                assembler.emit_move_const(Register::EDI, format_string.as_ptr());
                assembler.emit_move_read(Register::ESI, Register::EBX);
                assembler.emit_move_const(
                    Register::EDX,
                    libc::printf as unsafe extern "C" fn(format: *const c_char, ...) -> c_int,
                );
                assembler.emit_call(Register::EDX);
            }
            Instruction::Input => {
                assembler.emit_move_const(
                    Register::EDX,
                    get_input as extern "C" fn() -> u8,
                );
                assembler.emit_call(Register::EDX);
                assembler.emit_move_write(Register::EBX, Register::EAX);
            }
        }
    }

    assembler.emit_pop(Register::EBX);
    assembler.emit_return();

    assembler.assemble();

    unsafe {
        let map = mmap(assembler.get_instructions());
        let f: unsafe extern "C" fn() -> i32 = mem::transmute(map.as_ptr());
        f();
    }
}

Note that the parse function now returns LoopStart and LoopEnd values with a label (where the start and end of a loop have the same label). This is easier for us to generate machine code for and is actually also easier to generate (using a stack to keep track of the labels of the currently opened loops).

And indeed, the following program now prints the most famous phrase in computer science:

fn main() {
    let hello_world = "++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.";

    compile_and_run(hello_world);
}

Performance

If we now run our Mandelbrot example again, we don’t actually need the time command to tell us it’s much faster than using the interpreter:

$ time target/release/example6
...
________________________________________________________
Executed in    2.68 secs    fish           external
   usr time    2.68 secs    0.00 micros    2.68 secs
   sys time    0.00 secs  700.00 micros    0.00 secs

So we achieved a >5x speedup compared to the interpreter’s 15s! And notably, we didn’t really optimize anything: we still execute individual forward and backward instructions, repeated increments and other inefficient sequences.

Optimization

While we won’t turn our compiler into an optimizing one here, it might be interesting to talk a bit about how you could go about it. However, you can probably skip this section if you have general knowledge about compilers.

Let’s look at a pretty (but not completely) random excerpt of the Mandelbrot program:

<<<<<<<<<<<<<<+>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>]>>>>>
>]<<<<<<<<<[<<<<<<<<<]>>>>>[-]>>>>++++++++++++++[[>>>>>>>
>>]<<<<<<<<<-<<<<<<<<<[<<<<<<<<<]>>>>>>>>>-]+[>>>[-<<<->>
>]+<<<[->>>->[-<<<<+>>>>]<<<<[->>>>+<<<<<<<<<<<<<[<<<<<<<
<<]>>>>[-]+>>>>>[>>>>>>>>>]>+<]]+>>>>[-<

Look at all of those repeating instructions! For instance, the backward and forward instructions are compiled into separate add 4 or add -4 instructions, when they could be trivially represented using a single add! The same holds for the increment and decrement operations, which currently each map to three x86-64 instructions.9

> > [ - ] > > > > > > > + +7>

If we want to optimize this, we need to determine which representation of the program we want to optimize. Brainfuck does not have an instruction to encode adding 64 to the current value, so optimizing the Brainfuck code does not make much sense in this instance. We could optimize the generated machine code, but that’s extremely cumbersome. Optimizing while emitting the machine code is also guaranteed to turn into a mess.

So we need an intermediate representation (IR): instead of messing with representations which are not meant for it, we transform the input into an IR more susceptible to optimization. We have actually already used an IR: we parsed the Brainfuck string into a list of Instruction values, and only then compiled these to machine code. We can add a similar IR in between for optimization:

/// A high-level instruction
enum HLInstruction {
    Increment,
    Decrement,
    Forward,
    Backward,
    Output,
    Input,
    LoopStart(usize),
    LoopEnd(usize),

    // Higher-level instructions
    Add(i32), // used to represent multiple `Increment` / `Decrement` instructions
    Move(i32), // used to represent multiple `Forward` / `Backward` instructions
}

If we convert our list of Instruction values to a list of HLInstruction values, we can combine increment, decrement, forward and backward instructions into single add and move instructions. Conceptually, with Rust notation, we can implement the following transformations: 10

StringVec<Instruction>Vec<HLInstruction>Vec<HLInstruction>Vec<u8>parseoptimizegeneratetransform

In this case, the optimize function turns the list of HLInstruction values into a faster version (in this example meaning fewer values), for example:

vec![HLInstruction::Increment, 5] // 5 increment instructions
// becomes
vec![HLInstruction::Add(5)]

The generate function can then use more specific x86-64 instructions to compile the higher-level instructions. It is easy to imagine we could add more passes of different optimization functions that use different techniques to make the instructions more efficient.

This is actually an example of general compiler architecture: we have multiple phases transforming the input through IRs until the final output is reached. For example, many compilers and even interpreters (such as clang, rustc and Julia) use an IR from the LLVM project: LLVM IR. The advantage of using such a common IR is that you can compile to this IR and then apply countless existing optimizations. The optimization above could be part of a peephole optimization pass: optimizations applied locally to specific patterns.11

Final notes

We have seen how x86-64 machine code is actually still pretty understandable, or at least the parts we need to encode a few simple operations. It would be much more difficult if we wanted to add support for concurrency primitives or optimize better.

If you want to learn more about machine code generation, you should probably look up books on writing assemblers, as that’s pretty much what we have done (most compiler references teach how to produce assembly as output, not machine code). I haven’t read any, but this Stack Overflow answer has some references.

If you want to learn more about the other parts of the compiler (including code generation with a higher-level target), a seminal reference is Compilers: Principles, Techniques, and Tools (also known as the dragon book). And while I haven’t read it myself, I’ve read good things about Crafting Interpreters (kindly offered for free online, and look at the effort that goes into the illustrations!).

Appendix

Parsing

The final parse version is pretty short. We use a stack to match opening and closing brackets.

fn parse(code: &str) -> Vec<Instruction> {
    let mut instructions = Vec::new();
    let chars: Vec<char> = code.chars().collect();
    let mut loop_stack = Vec::new();
    for (index, char) in chars.iter().enumerate() {
        instructions.push(match char {
            '+' => Instruction::Increment,
            '-' => Instruction::Decrement,
            '>' => Instruction::Forward,
            '<' => Instruction::Backward,
            '.' => Instruction::Output,
            ',' => Instruction::Input,
            '[' => {
                loop_stack.push(index);
                Instruction::LoopStart(index)
            }
            ']' => Instruction::LoopEnd(loop_stack.pop().unwrap()),
            _ => continue,
        });
    }
    instructions
}

Disassembling

We can easily disassemble our generated machine code to inspect it for errors using the iced-x86 crate by adding the following function to our Assembler:

// Modified version of the code from the readme of the `iced-x86` crate.
fn print_disassembly(&self) {
    let mut decoder = Decoder::new(64, &self.instructions, DecoderOptions::NONE);
    static COLUMN_LENGTH: usize = 10;
    let mut formatter = NasmFormatter::new();

    formatter.options_mut().set_digit_separator("`");
    formatter
        .options_mut()
        .set_first_operand_char_index(COLUMN_LENGTH as _);

    let mut output = String::new();
    let mut instruction = iced_x86::Instruction::default();
    while decoder.can_decode() {
        decoder.decode_out(&mut instruction);
        output.clear();
        formatter.format(&instruction, &mut output);

        // Eg. "00007FFAC46ACDB2 488DAC2400FFFFFF     lea       rbp,[rsp-100h]"
        print!("{:016X} ", instruction.ip());
        let start_index = instruction.ip() as usize;
        let instr_bytes = &self.instructions[start_index..start_index + instruction.len()];
        for b in instr_bytes.iter() {
            print!("{:02X}", b);
        }
        if instr_bytes.len() < COLUMN_LENGTH {
            for _ in 0..COLUMN_LENGTH - instr_bytes.len() {
                print!("  ");
            }
        }
        println!(" {}", output);
    }
}

An excerpt of the output generated for the “Hello World!” example:

0000000000000035 8B03                 mov       eax,[rbx]
0000000000000037 FFC0                 inc       eax
0000000000000039 8903                 mov       [rbx],eax
000000000000003B 8B33                 mov       esi,[rbx]
000000000000003D 09F6                 or        esi,esi
000000000000003F 0F8414010000         je        near 0000`0000`0000`0159h
0000000000000045 4881C304000000       add       rbx,4
000000000000004C 8B03                 mov       eax,[rbx]
000000000000004E FFC0                 inc       eax
0000000000000050 8903                 mov       [rbx],eax
0000000000000052 8B03                 mov       eax,[rbx]
0000000000000054 FFC0                 inc       eax
0000000000000056 8903                 mov       [rbx],eax

Footnotes

  1. The only library functions we will use are for input/output and to mark the generated machine code’s memory as executable so that we can actually run it. However, calling conventions may differ for other operating systems.

  2. However, you will need to look up the references and likely modify all machine code. There are also other things to keep in mind, for example, Apple silicon chips apparently don’t have instruction and data cache coherency; this means that you should call sys_icache_invalidate to prevent the processor from executing stale data when invoking the dynamically constructed machine code.

  3. It’s tempting to call this a JIT-compiler, but this usually means you also compile after the program has already started running. Julia also compiles and then executes on every run, and in that community, it’s referred to as a just-ahead-of-time compiler according to Wikipedia.

  4. If we would like to write out an executable, we would also have to do memory allocation, input, and output ourselves. We can now defer this to our host language.

  5. The parse function simply converts the characters in the input into Instruction values one by one (ignoring non-instructions) and finds the end/start indices for loops by running forward/backwards counting brackets until the matching one is found. Depending on your definition of an interpreter, this also means the interpreter isn’t exactly 28 lines. The parse function we use for the interpreter is similar to the one in the appendix, except that matching LoopStart and LoopEnd values contain each other’s indices instead of the same label.

  6. Little-endian means that the byte storing the least significant “digits” is placed at the position in memory with the lower address. When writing a number as a series of bytes, this means that the least significant digits come first. See Wikipedia for more.

  7. Rust separates the definition of data structures from the definition of accompanying functions, but functions taking a self parameter as first argument will still be available as methods on the data structure.

  8. One might wonder why this alignment is necessary. It appears to be for performance reasons. Having a certain stack alignment can be required for certain operations or can improve performance, but if a function does not start with a fixed alignment, it needs to conditionally align the stack (which is more expensive).

  9. Compressing a sequence of many identical operations using a single operation is actually an example of run-length encoding.

  10. The transform pass that simply transforms instructions into their equivalent high-level instruction could be easily combined with the optimization pass in this case, but we separate them to illustrate the more general case with multiple optimization passes and a less trivial IR.

  11. At first thought, you might think that all optimizations are of this form. However, many optimizations need more context. In the Brainfuck case, you might be able to detect that a specific memory cell is never used so that you can replace >> by > and << by < in some instances, but this requires much more information than looking at a few isolated instructions.