Compiling Brainfuck to x86-64 machine code
Introduction
In this post we will create a compiler that compiles Brainfuck to bare x86-64 machine code (not assembly) without the use of libraries. While many introductions to compilers and processors target simplified machine code or assembly language, we will see it’s surprisingly easy to target machine code for a real-world architecture.
Brainfuck is an esoteric language in which the current state of a program consists of sequential memory cells that store integers and a pointer to one of the cells (the current cell). There are eight instructions to interact with this memory:
Instruction | Meaning |
---|---|
+ | Increment the value in the current cell |
- | Decrement the value in the current cell |
> | Increment the pointer |
< | Decrement the pointer |
. | Prints the ASCII character corresponding to the value of the current cell |
, | Reads a character of input and stores the corresponding ASCII code in the current cell |
[ | If the current cell is zero, jump to instruction after the matching closing bracket |
] | If the current cell is nonzero, jump to instruction after the matching opening bracket |
Even though this is a limited set of instructions, Brainfuck can do arbitrary computations (it’s Turing-complete). For example, if we store two numbers in adjacent cells, we can “easily” add the value of the first to the second:
+++++ Store 5 in the first cell
>+++< Store 3 in the second cell (keep the pointer at the first)
[->+<] Subtract 1 from the first cell and add 1 to the second until the first is zero
Note that any character not one of the eight listed above is simply ignored, so comments don’t require special syntax (although the above snippet did need a few edits to weed out punctuation). One can easily visualize the execution of the above snippet or try programs yourself using an online tool such as this one.
While Brainfuck is clearly impractical, as even a naive “Hello world!” program would use many hundreds of characters (most of which being +
), it is very simple while still being powerful. For example, it is possible to visualize a Mandelbrot set.
Brainfuck is not only horrible to write, it’s also horribly slow (for obvious reasons). Rather than simply using another language, we will create a compiler for this simple language, targeting bare x86-64 machine code. We will use Rust on Linux, but it should be possible to follow along on most operating systems and in most languages.1 Targeting other architectures, including regular x86, should also be possible.2 After compilation we will execute the resulting machine code immediately, so that we don’t have to bother writing an executable.3 4
There are many ways to accomplish the same task in assembly, but we will choose the most straightforward one (read: the first approach I got to work). This also means that our output will be suboptimal, but by getting a better feeling for machine code it will also be easier to understand the output of optimizing compilers.
A 28-line interpreter
To ensure we have something to compare to with regard to performance (why else compile to machine code?) and to get the semantics clear, let’s take a quick look at a straightforward interpreter. The state consists of two parts: the memory (with the index of the current cell) and a list of instructions (and the current index, or instruction pointer). To execute a string, we convert it to a list of instructions (where the loop start/end instructions are annotated with the end/start index) using parse
5, and then simply loop and match the current instruction until we finish.
fn run(code: &str) {
let mut memory = vec![0; 1024];
let mut memory_index = 0;
let instructions = parse(code);
let mut instruction_index = 0;
while instruction_index < instructions.len() {
match instructions[instruction_index] {
Instruction::Increment => memory[memory_index] += 1,
Instruction::Decrement => memory[memory_index] -= 1,
Instruction::Forward => memory_index += 1,
Instruction::Backward => memory_index -= 1,
Instruction::Output => print!("{}", memory[memory_index] as char),
Instruction::Input => memory[memory_index] = stdin().bytes().next().unwrap().unwrap(),
Instruction::LoopStart(index) => {
if memory[memory_index] == 0 {
instruction_index = index;
}
}
Instruction::LoopEnd(index) => {
if memory[memory_index] != 0 {
instruction_index = index;
}
}
}
instruction_index += 1;
}
}
Running the below example from Wikipedia will, as you would expect (but maybe not based on the code), print “Hello World!“:
fn main() {
run("++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.");
}
Running the Mandelbrot program referenced earlier takes 15s, which will be our reference point.
Executing some bits
We will now look at the interesting part: executing bytes of code. We start off simple: we create a function that just returns 47
and execute it. As we are using Linux, the standard calling convention is part of the System V ABI. The only important part for now is that the return value of a function is passed in the rax
register. Therefore, the following machine code should work:
0xB8 ; Move a constant into RAX
0x2F 0x00 0x00 0x00 ; The 4-byte constant
0xC3 ; Return
We will dive into the details in a moment, but before you understand something you should execute it. This can be done as follows:
use memmap::{Mmap, MmapMut};
/// We cannot simply execute data, we first have to copy it to a particular address space which we
/// can mark as executable. We do this using the `mmap` system call, wrapped in a nice library.
unsafe fn mmap(instructions: &[u8]) -> Mmap {
// Create an in-memory buffer
let mut map = MmapMut::map_anon(instructions.len()).unwrap();
// Copy the instructions
map.copy_from_slice(&instructions);
// Mark the buffer as executable
map.make_exec().unwrap()
}
fn main() {
let instructions = [0xB8, 0x2F, 0x00, 0x00, 0x00, 0xC3];
// Rust requires marking operations which could result in crashes as "unsafe".
unsafe {
let map = mmap(&instructions);
// Rust notation for converting a pointer to the memory buffer to a C function pointer.
// (We need a C function pointer because Rust leaves its calling convention undefined.)
let f: unsafe extern "C" fn() -> i32 = mem::transmute(map.as_ptr());
// Print the result
println!("{}", f());
}
}
Running the above snippet indeed print 47
. Now, let’s take a look at the raw bytes once more:
0xB8 ; Move a constant into EAX
0x2F 0x00 0x00 0x00 ; The 4-byte constant
0xC3 ; Return
Without further information, the first thing that might catch your eye is that the constant 47
(0x2F
) is stored in the first byte instead of the latter as one might expect. This is because x86-64 uses a little-endian encoding.6
As for the opcodes, we obviously don’t know these from memory, but we have a few resources to help us read and write them:
- A nice compact reference for x86-64 opcodes.
- The online Godbolt compiler explorer, where we can enter assembly code such as
mov eax, 47
or code in a higher-level language, and immediately obtain the corresponding machine code. This helps us generate sample code, but our goal is to understand the opcodes themselves, not just to dynamically mix and match assembler outputs. - If we get stuck, we can look at the disassembly of our generated machine code.
The first byte, 0xB8
, is the instruction code for moving a constant into eax
. The constant to be moved into the register is specified inline as an instruction operand. Moving a constant into a register is apparently so important that every one of the eight main output registers gets its own opcode: 0xB8
moves into eax
, 0xB9
into ecx
, 0xBA
into edx
, and so on.
Lastly, there is the single-byte 0xC3
return instruction, which pops the return address from the stack and resumes execution at that address. The return address is pushed onto the stack by the caller of the function.
Intermezzo: a code generation API
Instead of becoming fluent in hexadecimal opcodes, we will wrap an instruction buffer and add functions to emit opcodes for a move, a return, and any other construct we need. First, we create a union type for the eight main registers: 7
#[derive(Copy, Clone, Debug)]
enum Register {
EAX,
ECX,
EDX,
EBX,
ESP,
EBP,
ESI,
EDI,
}
impl Register {
fn to_integer(self) -> u8 {
match self {
Register::EAX => 0,
Register::ECX => 1,
Register::EDX => 2,
Register::EBX => 3,
Register::ESP => 4,
Register::EBP => 5,
Register::ESI => 6,
Register::EDI => 7,
}
}
}
We will use register names starting with E
here. When instruction sets expanded from 8-bit to 16-bit, to 32-bit, and to 64-bit, existing registers were extended. The register eax
(“extended” ax
) references the least significant half of the 64-bit “register” (?) rax
. When we move a 32-bit constant into eax
, the most significant half of rax
is implicitly zeroed.
Even if we reference, for example, rax
later when dealing with a 64-bit value, we will use Register::EAX
as eax
and rax
refer to the same physical register. We will also occasionally use the names interchangeably in the text.
Next, we create the wrapper:
struct Assembler {
instructions: Vec<u8>,
}
impl Assembler {
fn new() -> Assembler {
Assembler { instructions: Vec::new() }
}
fn emit(&mut self, data: u8) {
self.instructions.push(data);
}
fn get_instructions(&self) -> &[u8] {
&self.instructions
}
}
Let’s now add some helper functions to generate the bytes we wrote out manually in the previous section:
impl Assembler {
// ...
fn emit_const<T>(&mut self, constant: T) {
unsafe {
let mut pointer = mem::transmute::<_, *const u8>(&constant);
for _ in 0..mem::size_of::<T>() {
self.emit(*pointer);
pointer = pointer.add(1);
}
}
}
// We will stick to the convention that the destination operand goes first.
fn emit_move_const<T>(&mut self, destination: Register, value: T) {
if mem::size_of::<T>() == 4 {
self.emit(0xB8 + destination.to_integer());
} else {
unimplemented!("unsupported constant size {}", mem::size_of::<T>());
}
self.emit_const(value);
}
fn emit_return(&mut self) {
self.emit(0xC3);
}
}
With these helper functions, we can rewrite our previous example as follows:
fn main() {
let mut assembler = Assembler::new();
assembler.emit_move_const(Register::EAX, 47);
assembler.emit_return();
unsafe {
let map = mmap(assembler.get_instructions());
let f: unsafe extern "C" fn() -> i32 = mem::transmute(map.as_ptr());
println!("{}", f());
}
}
We still need a little more before we can convert Brainfuck to machine code: arithmetic, interacting with memory, input/output, and loops. In the following sections, we will extend our Assembler
abstraction to support these operations based on a few examples.
Basic instructions
We will first look at the most basic Brainfuck instructions: >
, <
, +
, and -
. If we have allocated memory, we can store the address of this memory in a register (say rbx
, or Register::EBX
below). The first two operators then correspond to increasing or decreasing the value in that register. The +
and -
operators require us to read the value pointed to from memory, modify it, and finally write it back. Moving data between memory and registers is accomplished using move instructions. We will store the value we are editing in register eax
.
Let’s start with how we want the result to look:
let mut memory = vec![0; 1024];
let mut assembler = Assembler::new();
// We store the pointer in `ebx`.
assembler.emit_move_const(Register::EBX, memory.as_mut_ptr());
// `>`
assembler.emit_add_const(Register::EBX, 4);
// `<`
assembler.emit_add_const(Register::EBX, -4);
// `+`
assembler.emit_move_read(Register::EAX, Register::EBX);
assembler.emit_inc(Register::EAX);
assembler.emit_move_write(Register::EBX, Register::EAX);
// `-`
assembler.emit_move_read(Register::EAX, Register::EBX);
assembler.emit_dec(Register::EAX);
assembler.emit_move_write(Register::EBX, Register::EAX);
So we require five new helpers, as well as one subtle change: pointers are 64-bit on x86-64, but our emit_move_const
function was specifically written for 32-bit constants; operations run in 32-bit mode by default on x86-64. Luckily, as can be seen in the reference, the 0xB8
move opcode we used does support 64-bit operands. If we want to use one of those, we can prefix it with a 0x48
byte to enable 64-bit operand sizes (which can also be found in the reference).
fn emit_move_const<T>(&mut self, destination: Register, value: T) {
if mem::size_of::<T>() == 4 {
self.emit(0xB8 + destination.to_integer());
} else if mem::size_of::<T>() == 8 {
self.emit(0x48);
self.emit(0xB8 + destination.to_integer());
} else {
unimplemented!("unsupported constant size {}", mem::size_of::<T>());
}
self.emit_const(value);
}
Writing the new helpers is somewhat more complicated: they all require a so-called ModR/M byte. When operations require multiple operands, such as a move of a value from one register to another, the most straightforward way of encoding this is as multiple bytes following the opcode. However, with only eight registers previously available, this would have wasted a lot of space!
If we are addressing only eight registers, we can encode them in just 3 bits, therefore easily fitting two register operands in a single byte following the opcode. Even better, we have two bits left to encode other possibilities more efficiently. As an example, we can encode that we want to store the value of eax
into the memory location pointed to by rbx
: mov [rbx], eax
. The ModR/M byte has three fields and looks like this:
The three fields are as follows:
- The
mod
field stores the mode of the ModR/M byte, for example, to refer to the memory pointed to by a register. - The
reg
field usually stores a register operand, but can also be used to extend the opcode (then the value is called an opcode extension, used for e.g. anadd
as we will see soon). - The
r/m
field can specify a register, but can also have other uses depending on themod
field.
Below is a simplified version of the table from the Intel software developer manuals describing the meaning of different values of the ModR/M byte:
eax | ecx | edx | ebx | esp | ebp | esi | edi | |||
reg | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 | ||
---|---|---|---|---|---|---|---|---|---|---|
Effective address | mod | r/m | Value of ModR/M byte | |||||||
[eax] [ecx] [edx] [ebx] [-] disp32 [esi] [edi] | 00 | 000 001 010 011 100 101 110 111 | 00 01 02 03 04 05 06 07 | 08 09 0A 0B 0C 0D 0E 0F | 10 11 12 13 14 15 16 17 | 18 19 1A 1B 1C 1D 1E 1F | 20 21 22 23 24 25 26 27 | 28 29 2A 2B 2C 2D 2E 2F | 30 31 32 33 34 35 36 37 | 38 39 3A 3B 3C 3D 3E 3F |
[eax] + disp8 [ecx] + disp8 [edx] + disp8 [ebx] + disp8 [-] + disp8 [ebp] + disp8 [esi] + disp8 [edi] + disp8 | 01 | 000 001 010 011 100 101 110 111 | 40 41 42 43 44 45 46 47 | 48 49 4A 4B 4C 4D 4E 4F | 50 51 52 53 54 55 56 57 | 58 59 5A 5B 5C 5D 5E 5F | 60 61 62 63 64 65 66 67 | 68 69 6A 6B 6C 6D 6E 6F | 70 71 72 73 74 75 76 77 | 78 79 7A 7B 7C 7D 7E 7F |
[eax] + disp32 [ecx] + disp32 [edx] + disp32 [ebx] + disp32 [-] + disp32 [ebp] + disp32 [esi] + disp32 [edi] + disp32 | 10 | 000 001 010 011 100 101 110 111 | 80 81 82 83 84 85 86 87 | 88 89 8A 8B 8C 8D 8E 8F | 90 91 92 93 94 95 96 97 | 98 99 9A 9B 9C 9D 9E 9F | A0 A1 A2 A3 A4 A5 A6 A7 | A8 A9 AA AB AC AD AE AF | B0 B1 B2 B3 B4 B5 B6 B7 | B8 B9 BA BB BC BD BE BF |
eax ecx edx ebx esp ebp esi edi | 11 | 000 001 010 011 100 101 110 111 | C0 C1 C2 C3 C4 C5 C6 C7 | C8 C9 CA CB CC CD CE CF | D0 D1 D2 D3 D4 D5 D6 D7 | D8 D9 DA DB DC DD DE DF | E0 E1 E2 E3 E4 E5 E6 E7 | E8 E9 EA EB EC ED EE EF | F0 F1 F2 F3 F4 F5 F6 F7 | F8 F9 FA FB FC FD FE FF |
The [-]
value indicates that a so-called SIB byte follows the ModR/M byte, which can be used to reference, for example, the address of one register plus eight times the value of another register. Since we won’t need the SIB byte, we will not go into further details.
Let’s now take a look at how we could encode the earlier example:
mov [rbx], eax
Our favorite reference lists the following opcode (with a few more empty columns):
po | o | mnemonic | op1 | op2 | description, notes |
---|---|---|---|---|---|
89 | r | MOV | r/m16/32/64 | r/16/32/64 | Move |
The r
in the o
column means that a normal ModR/M byte follows the 0x89
opcode. Predictably, the move operation has two operands: a destination register (destination registers are formatted in bold) specified in the r/m
field and a source register specified in the reg
field. The destination of the move should be [rbx]
, so we use 0b00
as value for the mod
field and 0b011
as value for the r/m
field. Finally, our source register is eax
, so we use 0b000
for the reg
field. This results in a machine code translation of 0x8903
:
With this example, writing the emit_move_write
helper becomes easy:
fn emit_move_write(&mut self, address_register: Register, source: Register) {
self.emit(0x89);
self.emit(0 + (source.to_integer() << 3) + address_register.to_integer());
}
Similarly, we can implement emit_move_read
(to emit code for statements like mov eax, [rbx]
) using the 0x8B
opcode. Note that the operands in the reference are swapped and the r16/32/64
(in the reg
field) operand is marked in bold and therefore the destination, so we only need to swap the operands in our function as well:
fn emit_move_read(&mut self, destination: Register, address_register: Register) {
self.emit(0x8B);
self.emit(0 + (destination.to_integer() << 3 )+ address_register.to_integer());
}
For the last three helpers, emit_add_const
, emit_inc
and emit_dec
, we will need the opcodes from the following excerpt of the reference:
po | o | mnemonic | op1 | op2 | description, notes |
---|---|---|---|---|---|
81 | 0 | ADD | r/m16/32/64 | imm16/32 | Add |
FF | 0 | INC | r/m16/32/64 | Increment by 1 | |
FF | 1 | DEC | r/m16/32/64 | Decrement by 1 |
But note that the increment and decrement operations have the same opcode! To distinguish them, we need to use the value in the o
column: the opcode extension mentioned earlier. We store this value in the reg
field of the ModR/M byte that follows the opcode. For example, let’s examine the following operation:
dec ebx
The opcode for decrement is 0xFF
and uses an opcode extension of 1
, which we store in the reg
field. According to the reference, the destination register is specified in the r/m
field. As we can see in the ModR/M byte reference, ebx
requires a mode of 11
and a value of 011
in the r/m
field. Therefore, we obtain 0xFFCB
:
The increment operation can be used in the same way, and the only difference for the add
operation is the constant following the opcode and ModR/M byte:
fn emit_add_const(&mut self, destination: Register, constant: i32) {
self.emit(0x48); // 64-bit mode
self.emit(0x81);
self.emit(0b11000000 + destination.to_integer());
self.emit_const(constant);
}
fn emit_inc(&mut self, register: Register) {
self.emit(0xFF);
self.emit(0b11000000 + register.to_integer());
}
fn emit_dec(&mut self, register: Register) {
self.emit(0xFF);
self.emit(0b11001000 + register.to_integer());
}
We can now already emit code for four out of the eight Brainfuck instructions! This is sufficient to create machine code for (very) basic arithmetic.
Input and output
We will now examine the machine code for the ,
and .
operators, used for input and output respectively. We will outsource the hard work of actual I/O to other functions, so we can reduce the problem to that of calling a function such as:
extern "C" fn get_input() -> u8 {
stdin().bytes().next().unwrap().unwrap()
}
As we generate code at runtime, we can easily obtain a pointer to the function: get_input as extern "C" fn() -> u8
. Now, we just need a way to call a pointer. Unfortunately, there doesn’t seem to be an opcode which can be followed by a 64-bit address to call. Instead, we can store the pointer to the function in a register and call the function pointer stored in the register using the call opcode:
po | o | mnemonic | op1 | op2 | description, notes |
---|---|---|---|---|---|
FF | 2 | CALL | r/m64 | Call Procedure |
The same opcode as for the increment and decrement instructions! But now the opcode extension is set to 2
, completely changing its meaning. It takes a single 64-bit operand stored in the r/m
field of the ModR/M byte following the opcode. The mod
field is set to 2
as we the argument is the register itself and the reg
field takes the opcode:
fn emit_call(&mut self, register: Register) {
self.emit(0xFF);
self.emit(0b11010000 + register.to_integer());
}
We now have all we need to emit machine code for the input instruction:
// `,`: ask input
assembler.emit_move_const(
Register::EDX,
get_input as extern "C" fn() -> u8,
);
assembler.emit_call(Register::EDX);
assembler.emit_move_write(Register::EBX, Register::EAX);
For the output function, we will try to immediately call printf
from libc
. It takes a format string as first argument and any values to insert into it as additional arguments. But how do pass arguments? Besides specifying return values go in eax
, the System V ABI also has conventions for this purpose:
This is a 64-bit platform. The stack grows downwards. Parameters to functions are passed in the registers rdi, rsi, rdx, rcx, r8, r9, and further values are passed on the stack in reverse order. Parameters passed on the stack may be modified by the called function.
So we pass a pointer to our format string in rdi
(denoted Register::EDI
in our code), and the value to print from memory in esi
. Finally, we store the printf
function pointer in a register and emit a call instruction:
// `.`: echo back the value
let format_string = CString::new("%c").unwrap();
assembler.emit_move_const(Register::EDI, format_string.as_ptr());
assembler.emit_move_read(Register::ESI, Register::EBX);
assembler.emit_move_const(
Register::EDX,
libc::printf as unsafe extern "C" fn(format: *const c_char, ...) -> c_int,
);
assembler.emit_call(Register::EDX);
Unfortunately, concatenating the above two snippets (input and output, wrapped in the main
function from earlier) doesn’t produce the desired result:
Finished dev [unoptimized + debuginfo] target(s) in 0.00s
Running `target/debug/main`
X
fish: Job 1, 'cargo run' terminated by signal SIGSEGV (Address boundary error)
As you can see, we are asked for input, enter X
, and the program proceeds to crash. In case you’ve never programmed in C(++): a segmentation fault occurs when a program makes an attempt to access restricted memory. In practice, this happens pretty quickly once you start doing operations that don’t make sense:
int main() {
void (*f)() = 0;
f(); // './a.out' terminated by signal SIGSEGV (Address boundary error)
}
But where did we go wrong? We encoded the function pointer into the machine code without making any modifications. Compiling the following Rust example to assembly using the compiler explorer provides us with a hint:
pub fn dummy() {}
pub fn main() {
dummy();
}
example::dummy:
ret
example::main:
push rax
call qword ptr [rip + example::dummy@GOTPCREL]
pop rax
ret
Note the seemingly useless push
and pop
before and after the function call. It turns out that these are required because the System V ABI specifies that the stack should be aligned to 16 bytes before a call instruction, so the callee can assume a specific alignment. This means that if the top of the stack has an address that is not 0
modulo 16
, we need to ensure that it is before we execute the call instruction.8
Note that our code is also executed using a call instruction, and remember that a call instruction pushes the return address onto the stack. As we are on x86-64, the return address is 64
bits, and because the stack had to be 16-byte aligned before the call, we are now misaligned. We can fix this by introducing helpers for push
and pop
and calling them with a register such as rbx
(note that the instructions push and pop 64-bit values):
fn emit_push(&mut self, register: Register) {
self.emit(0x50 + register.to_integer());
}
fn emit_pop(&mut self, register: Register) {
self.emit(0x58 + register.to_integer());
}
assembler.emit_push(Register::EBX);
// ...
assembler.emit_pop(Register::EBX);
If we now run our example again, we will get the expected result and our character will be echoed back to us.
This may also be a good time to pay attention to another “detail” of the System V ABI:
Functions preserve the registers rbx, rsp, rbp, r12, r13, r14, and r15; while rax, rdi, rsi, rdx, rcx, r8, r9, r10, r11 are scratch registers.
This means that we need to make sure (if we ignore the numbered registers) the registers rbx
, rsp
and rbp
are the same when we return as they were when our function was called. So storing the pointer to the Brainfuck memory in ebx
, without restoring its value before returning, was actually illegal! But luckily, we fixed this as well by pushing rbx
at the start of our function and popping it at the end.
Loops
The remaining two operations are for looping:
[ | If the current cell is zero, jump to instruction after the matching closing bracket |
] | If the current cell is nonzero, jump to instruction after the matching opening bracket |
So how do we jump? We could, for example, use the following opcode:
po | mnemonic | op1 | op2 | description, notes |
---|---|---|---|---|
E9 | CALL | rel16/32 | Jump |
An E9
opcode is followed by a 32-bit address to jump to, relative to the current instruction pointer. The instruction pointer points at the instruction following the currently executing instruction. That means the following shows how to skip the next two no-ops (which are instructions that do nothing, or “no operation”, when executed):
Recall that the non-zero byte of the 2
constant comes first, as x86-64 uses a little-endian encoding.
So jumping to another related piece of code is actually pretty straightforward, but this doesn’t yet allow us to encode the above Brainfuck instructions: how to jump conditionally?
There are other opcodes to conditionally jump, all of which are two-byte opcodes: 0x0F
, followed by another opcode byte to specify the actual operation. The conditional jumps consist of 0x0F
followed by 0x80
through 0x8F
. For example, “jump near if zero/equal” is encoded as 0x0F84
. We’ll only need two of these (as described by our favorite reference):
po | o | mnemonic | op1 | op2 | description, notes |
---|---|---|---|---|---|
0F | 84 | JZ / JE | rel16/32 | Jump near if zero/equal (ZF=1) | |
0F | 85 | JNZ / JNE | rel16/32 | Jump near if not zero/equal (ZF=0) |
So to jump if a value is zero, we replace the 0xE9
from the previous example with 0x0F84
, and we have encoded a conditional jump. But what is the condition here exactly? It jumps if ZF=1
, but what does this mean?
All x86 processors contain a 32-bit EFLAGS
register, with each bit (flag) of the register having a specific meaning. One such flag is the “zero flag” (ZF
). If that flag is set (the bit is set to 1
), the jump of the 0x0F84
instruction would be executed. If the flag is not set, the instruction would be skipped.
What determines whether a flag is set? The flags that are relevant to us, called status flags, are set by arithmetic instructions. For example:
- The zero flag (
ZF
, bit 6) is set when the result is zero. - The sign flag (
SF
, bit 7) is set when the result is negative (the flag equals the most-significant bit of the result, which is active in a two’s complement representation if and only if the number is negative).
Next to these there is a carry flag, parity flag, auxiliary carry flag and overflow flag. However, we will only need the zero flag.
As for what qualifies as an arithmetic instruction, these include add
, sub
, mul
, div
, inc
, and dec
. For all details about the EFLAGS
register and interactions with instructions, see the Intel software developer manuals.
To implement loop instructions, we can read the value pointed to from memory and jump based on its value. To set the flags based on the value, we can execute a bitwise or
with itself:
// Start of loop
assembler.emit_move_read(Register::ESI, Register::EBX);
assembler.emit_or(Register::ESI, Register::ESI);
assembler.emit_jump_z(&loop_end_label);
assembler.label(&loop_start_label);
// End of loop
assembler.emit_move_read(Register::ESI, Register::EBX);
assembler.emit_or(Register::ESI, Register::ESI);
assembler.emit_jump_nz(&loop_start_label);
assembler.label(&loop_end_label);
As you can see, we use a helper to mark a position with a label and pass a label to our jump helper. We do this instead of manually calculating relative positions while generating assembly, as this would be cumbersome. Instead, we make our Assembler
store jumps and labels: instead of writing the relative address when emitting a jump, we emit a placeholder. After we have emitted all instructions, we can go back and replace these placeholders with the actual values.
The Assembler
type now looks like this:
struct Assembler {
instructions: Vec<u8>,
/// The indices of labels.
labels: HashMap<String, usize>,
/// A tuple `(x, l)` denotes that the four bytes starting at index `l` should contain a
/// relative reference to the position of label `l`.
label_references: Vec<(usize, String)>,
}
And we add the following helpers:
fn emit_or(&mut self, destination: Register, source: Register) {
self.emit(0x09);
self.emit(0b11000000 + (source.to_integer() << 3) + destination.to_integer());
}
fn emit_jump_z(&mut self, label: &str) {
self.emit(0x0F);
self.emit(0x84);
self.emit_label_reference(label);
}
fn emit_jump_nz(&mut self, label: &str) {
self.emit(0x0F);
self.emit(0x85);
self.emit_label_reference(label);
}
fn emit_label_reference(&mut self, label: &str) {
// Keep track of placeholders
self.label_references
.push((self.instructions.len(), label.to_string()));
// Zero bytes as placeholder
self.emit(0);
self.emit(0);
self.emit(0);
self.emit(0);
}
fn label(&mut self, label: &str) {
self.labels.insert(label.to_string(), self.instructions.len());
}
fn assemble(&mut self) {
for (location, label) in &self.label_references {
let label_location = self.labels[label];
let relative = label_location as i32 - *location as i32 - 4;
self.instructions[*location] = relative as u8;
self.instructions[*location + 1] = (relative >> 8) as u8;
self.instructions[*location + 2] = (relative >> 16) as u8;
self.instructions[*location + 3] = (relative >> 24) as u8;
}
}
Putting the pieces together
We now have all the necessary pieces to compile and execute a Brainfuck program:
fn compile_and_run(brainfuck: &str) {
let instructions = parse(brainfuck);
let mut memory = vec![0; 1024];
let mut assembler = Assembler::new();
let format_string = CString::new("%c").unwrap();
assembler.emit_push(Register::EBX);
assembler.emit_move_const(Register::EBX, memory.as_mut_ptr());
for instruction in instructions {
match instruction {
Instruction::Forward => assembler.emit_add_const(Register::EBX, 4),
Instruction::Backward => assembler.emit_add_const(Register::EBX, -4),
Instruction::Increment => {
assembler.emit_move_read(Register::EAX, Register::EBX);
assembler.emit_inc(Register::EAX);
assembler.emit_move_write(Register::EBX, Register::EAX);
}
Instruction::Decrement => {
assembler.emit_move_read(Register::EAX, Register::EBX);
assembler.emit_dec(Register::EAX);
assembler.emit_move_write(Register::EBX, Register::EAX);
}
Instruction::LoopStart(label) => {
let loop_start_label = format!("loop_start_{}", label);
let loop_end_label = format!("loop_end_{}", label);
assembler.emit_move_read(Register::ESI, Register::EBX);
assembler.emit_or(Register::ESI, Register::ESI);
assembler.emit_jump_z(&loop_end_label);
assembler.label(&loop_start_label);
}
Instruction::LoopEnd(label) => {
let loop_start_label = format!("loop_start_{}", label);
let loop_end_label = format!("loop_end_{}", label);
assembler.emit_move_read(Register::ESI, Register::EBX);
assembler.emit_or(Register::ESI, Register::ESI);
assembler.emit_jump_nz(&loop_start_label);
assembler.label(&loop_end_label);
}
Instruction::Output => {
assembler.emit_move_const(Register::EDI, format_string.as_ptr());
assembler.emit_move_read(Register::ESI, Register::EBX);
assembler.emit_move_const(
Register::EDX,
libc::printf as unsafe extern "C" fn(format: *const c_char, ...) -> c_int,
);
assembler.emit_call(Register::EDX);
}
Instruction::Input => {
assembler.emit_move_const(
Register::EDX,
get_input as extern "C" fn() -> u8,
);
assembler.emit_call(Register::EDX);
assembler.emit_move_write(Register::EBX, Register::EAX);
}
}
}
assembler.emit_pop(Register::EBX);
assembler.emit_return();
assembler.assemble();
unsafe {
let map = mmap(assembler.get_instructions());
let f: unsafe extern "C" fn() -> i32 = mem::transmute(map.as_ptr());
f();
}
}
Note that the parse
function now returns LoopStart
and LoopEnd
values with a label (where the start and end of a loop have the same label). This is easier for us to generate machine code for and is actually also easier to generate (using a stack to keep track of the labels of the currently opened loops).
And indeed, the following program now prints the most famous phrase in computer science:
fn main() {
let hello_world = "++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.";
compile_and_run(hello_world);
}
Performance
If we now run our Mandelbrot example again, we don’t actually need the time
command to tell us it’s much faster than using the interpreter:
$ time target/release/example6
...
________________________________________________________
Executed in 2.68 secs fish external
usr time 2.68 secs 0.00 micros 2.68 secs
sys time 0.00 secs 700.00 micros 0.00 secs
So we achieved a >5x speedup compared to the interpreter’s 15s! And notably, we didn’t really optimize anything: we still execute individual forward and backward instructions, repeated increments and other inefficient sequences.
Optimization
While we won’t turn our compiler into an optimizing one here, it might be interesting to talk a bit about how you could go about it. However, you can probably skip this section if you have general knowledge about compilers.
Let’s look at a pretty (but not completely) random excerpt of the Mandelbrot program:
<<<<<<<<<<<<<<+>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>]>>>>>
>]<<<<<<<<<[<<<<<<<<<]>>>>>[-]>>>>++++++++++++++[[>>>>>>>
>>]<<<<<<<<<-<<<<<<<<<[<<<<<<<<<]>>>>>>>>>-]+[>>>[-<<<->>
>]+<<<[->>>->[-<<<<+>>>>]<<<<[->>>>+<<<<<<<<<<<<<[<<<<<<<
<<]>>>>[-]+>>>>>[>>>>>>>>>]>+<]]+>>>>[-<
Look at all of those repeating instructions! For instance, the backward and forward instructions are compiled into separate add 4
or add -4
instructions, when they could be trivially represented using a single add
! The same holds for the increment and decrement operations, which currently each map to three x86-64 instructions.9
If we want to optimize this, we need to determine which representation of the program we want to optimize. Brainfuck does not have an instruction to encode adding 64
to the current value, so optimizing the Brainfuck code does not make much sense in this instance. We could optimize the generated machine code, but that’s extremely cumbersome. Optimizing while emitting the machine code is also guaranteed to turn into a mess.
So we need an intermediate representation (IR): instead of messing with representations which are not meant for it, we transform the input into an IR more susceptible to optimization. We have actually already used an IR: we parsed the Brainfuck string into a list of Instruction
values, and only then compiled these to machine code. We can add a similar IR in between for optimization:
/// A high-level instruction
enum HLInstruction {
Increment,
Decrement,
Forward,
Backward,
Output,
Input,
LoopStart(usize),
LoopEnd(usize),
// Higher-level instructions
Add(i32), // used to represent multiple `Increment` / `Decrement` instructions
Move(i32), // used to represent multiple `Forward` / `Backward` instructions
}
If we convert our list of Instruction
values to a list of HLInstruction
values, we can combine increment, decrement, forward and backward instructions into single add and move instructions. Conceptually, with Rust notation, we can implement the following transformations: 10
In this case, the optimize
function turns the list of HLInstruction
values into a faster version (in this example meaning fewer values), for example:
vec![HLInstruction::Increment, 5] // 5 increment instructions
// becomes
vec![HLInstruction::Add(5)]
The generate
function can then use more specific x86-64 instructions to compile the higher-level instructions. It is easy to imagine we could add more passes of different optimization functions that use different techniques to make the instructions more efficient.
This is actually an example of general compiler architecture: we have multiple phases transforming the input through IRs until the final output is reached. For example, many compilers and even interpreters (such as clang
, rustc
and Julia) use an IR from the LLVM project: LLVM IR. The advantage of using such a common IR is that you can compile to this IR and then apply countless existing optimizations. The optimization above could be part of a peephole optimization pass: optimizations applied locally to specific patterns.11
Final notes
We have seen how x86-64 machine code is actually still pretty understandable, or at least the parts we need to encode a few simple operations. It would be much more difficult if we wanted to add support for concurrency primitives or optimize better.
If you want to learn more about machine code generation, you should probably look up books on writing assemblers, as that’s pretty much what we have done (most compiler references teach how to produce assembly as output, not machine code). I haven’t read any, but this Stack Overflow answer has some references.
If you want to learn more about the other parts of the compiler (including code generation with a higher-level target), a seminal reference is Compilers: Principles, Techniques, and Tools (also known as the dragon book). And while I haven’t read it myself, I’ve read good things about Crafting Interpreters (kindly offered for free online, and look at the effort that goes into the illustrations!).
Appendix
Parsing
The final parse
version is pretty short. We use a stack to match opening and closing brackets.
fn parse(code: &str) -> Vec<Instruction> {
let mut instructions = Vec::new();
let chars: Vec<char> = code.chars().collect();
let mut loop_stack = Vec::new();
for (index, char) in chars.iter().enumerate() {
instructions.push(match char {
'+' => Instruction::Increment,
'-' => Instruction::Decrement,
'>' => Instruction::Forward,
'<' => Instruction::Backward,
'.' => Instruction::Output,
',' => Instruction::Input,
'[' => {
loop_stack.push(index);
Instruction::LoopStart(index)
}
']' => Instruction::LoopEnd(loop_stack.pop().unwrap()),
_ => continue,
});
}
instructions
}
Disassembling
We can easily disassemble our generated machine code to inspect it for errors using the iced-x86
crate by adding the following function to our Assembler
:
// Modified version of the code from the readme of the `iced-x86` crate.
fn print_disassembly(&self) {
let mut decoder = Decoder::new(64, &self.instructions, DecoderOptions::NONE);
static COLUMN_LENGTH: usize = 10;
let mut formatter = NasmFormatter::new();
formatter.options_mut().set_digit_separator("`");
formatter
.options_mut()
.set_first_operand_char_index(COLUMN_LENGTH as _);
let mut output = String::new();
let mut instruction = iced_x86::Instruction::default();
while decoder.can_decode() {
decoder.decode_out(&mut instruction);
output.clear();
formatter.format(&instruction, &mut output);
// Eg. "00007FFAC46ACDB2 488DAC2400FFFFFF lea rbp,[rsp-100h]"
print!("{:016X} ", instruction.ip());
let start_index = instruction.ip() as usize;
let instr_bytes = &self.instructions[start_index..start_index + instruction.len()];
for b in instr_bytes.iter() {
print!("{:02X}", b);
}
if instr_bytes.len() < COLUMN_LENGTH {
for _ in 0..COLUMN_LENGTH - instr_bytes.len() {
print!(" ");
}
}
println!(" {}", output);
}
}
An excerpt of the output generated for the “Hello World!” example:
0000000000000035 8B03 mov eax,[rbx]
0000000000000037 FFC0 inc eax
0000000000000039 8903 mov [rbx],eax
000000000000003B 8B33 mov esi,[rbx]
000000000000003D 09F6 or esi,esi
000000000000003F 0F8414010000 je near 0000`0000`0000`0159h
0000000000000045 4881C304000000 add rbx,4
000000000000004C 8B03 mov eax,[rbx]
000000000000004E FFC0 inc eax
0000000000000050 8903 mov [rbx],eax
0000000000000052 8B03 mov eax,[rbx]
0000000000000054 FFC0 inc eax
0000000000000056 8903 mov [rbx],eax
Footnotes
-
The only library functions we will use are for input/output and to mark the generated machine code’s memory as executable so that we can actually run it. However, calling conventions may differ for other operating systems. ↩
-
However, you will need to look up the references and likely modify all machine code. There are also other things to keep in mind, for example, Apple silicon chips apparently don’t have instruction and data cache coherency; this means that you should call
sys_icache_invalidate
to prevent the processor from executing stale data when invoking the dynamically constructed machine code. ↩ -
It’s tempting to call this a JIT-compiler, but this usually means you also compile after the program has already started running. Julia also compiles and then executes on every run, and in that community, it’s referred to as a just-ahead-of-time compiler according to Wikipedia. ↩
-
If we would like to write out an executable, we would also have to do memory allocation, input, and output ourselves. We can now defer this to our host language. ↩
-
The
parse
function simply converts the characters in the input intoInstruction
values one by one (ignoring non-instructions) and finds the end/start indices for loops by running forward/backwards counting brackets until the matching one is found. Depending on your definition of an interpreter, this also means the interpreter isn’t exactly 28 lines. Theparse
function we use for the interpreter is similar to the one in the appendix, except that matchingLoopStart
andLoopEnd
values contain each other’s indices instead of the same label. ↩ -
Little-endian means that the byte storing the least significant “digits” is placed at the position in memory with the lower address. When writing a number as a series of bytes, this means that the least significant digits come first. See Wikipedia for more. ↩
-
Rust separates the definition of data structures from the definition of accompanying functions, but functions taking a
self
parameter as first argument will still be available as methods on the data structure. ↩ -
One might wonder why this alignment is necessary. It appears to be for performance reasons. Having a certain stack alignment can be required for certain operations or can improve performance, but if a function does not start with a fixed alignment, it needs to conditionally align the stack (which is more expensive). ↩
-
Compressing a sequence of many identical operations using a single operation is actually an example of run-length encoding. ↩
-
The transform pass that simply transforms instructions into their equivalent high-level instruction could be easily combined with the optimization pass in this case, but we separate them to illustrate the more general case with multiple optimization passes and a less trivial IR. ↩
-
At first thought, you might think that all optimizations are of this form. However, many optimizations need more context. In the Brainfuck case, you might be able to detect that a specific memory cell is never used so that you can replace
>>
by>
and<<
by<
in some instances, but this requires much more information than looking at a few isolated instructions. ↩