How the Rust Compiler Compiles Itself

If you look at the source code for the Rust compiler, you notice something strange: the compiler itself is written in Rust.

To compile Rust code, you need the Rust compiler. But to build the Rust compiler, you also need a Rust compiler. It is the classic chicken-and-egg problem. And yet, this is normal. Python is mostly written in Python. Go’s compiler is written in Go.

This is called self-hosting, but to understand how it works, you have to face a paradox that most programmers never consider.

The Bootstrapping Problem ¶

Self-hosting depends on a process called bootstrapping. You do not need the language to write itself from scratch. You just need to get it started, far enough that it can take over.

Rust’s origin story shows how this works. The first Rust compiler was written in OCaml. The team used the OCaml compiler to build a simple Rust compiler. Then they rewrote the compiler in Rust, used the OCaml-built version to compile it, and got a pure Rust binary. After that, the OCaml code was thrown away. Rust had pulled itself up by its own bootstraps.

After that first leap, the process becomes self-sustaining.

You have rustc (a binary) and src.rs (the compiler’s source code).
You feed src.rs into rustc.
You get a new rustc binary with all the latest changes baked in.
Repeat forever.

The compiler becomes its own ancestor. Each generation builds the next.

If you follow this chain back through history, before OCaml, before C, before FORTRAN, you end up with people sitting at machines, flipping switches, and punching programs onto cards by hand. Each generation built on the last, adding new layers of abstraction, one at a time.

Building a Self-Hosted Language From Scratch ¶

It’s one thing to read about self-hosting. Building it yourself is another matter. Let’s try making the smallest self-hosted language we can.

We’ll call it Q. It has just one command: the letter q. When the compiler sees q, it prints its own source code.

Our compiler will be written in C.

The goal is simple: write a C program that reads Q files and outputs C code that does the same thing.

The Paradox ¶

Here’s where things get interesting. To output our own source code, we might try something like this:

if (*s == 'q') {
  printf("#include <stdio.h>\nint main() { ... printf(\"...wait, what goes here?\"); }");
}

But what do you put inside the printf? The printf itself. Which would contain another printf, and so on, forever.

This is the quine paradox. How do you write a program that prints itself, without getting stuck in an endless loop of self-reference?

The Anchor Trick ¶

The solution is to separate data from logic. You need an anchor.

We introduce a string variable called self that holds the entire source code of our program:

const char *self = "...the entire program, escaped as a string...";

Now, when we see a q, we don’t reproduce logic, we just print data:

if (*s == 'q') {
  printf("%s", self);
}

But there is still a catch. If you just print self, you print the string’s contents. You do not print the const char *self = "..."; declaration that wraps it. To reproduce itself completely, the program has to print self twice: once as raw code, and once as an escaped string literal.

We handle this with an anchor character, a question mark, placed exactly where the string literal belongs:

const char *self = "?";

When outputting itself, the compiler walks through self character by character. Most characters get printed as-is. When it hits the ?, it prints the entire contents of self with quotes and backslashes properly escaped.

if (self[j] == 63) { // 63 = ASCII for '?'
  for (int i = 0; i < n; ++i) {
    switch (self[i]) {
      case '\n': printf("\\n"); break;
      case '"': printf("\\\""); break;
      case '\\': printf("\\\\"); break;
      default: printf("%c", self[i]);
    }
  }
} else {
  printf("%c", self[j]);
}

Why use 63 instead of just writing a question mark? If we wrote a literal question mark, it would trigger the anchor logic when we don’t want it to. So we use its ASCII value instead. This is a classic trick for hiding a character from your own parser.

The final step: paste the escaped source code blob into self where the ? is. Compile. Feed the binary a file containing a single q. Capture the output as q1.c. Then run:

Bash

diff q.c q1.c

There is no output or differences. The compiler has reproduced itself exactly.

The Thompson Hack ¶

Now, here’s an interesting consequence of a self-hosted compiler.

In 1984, Ken Thompson, co-creator of Unix, gave a Turing Award lecture titled Reflections on Trusting Trust. In it, he described a theoretical backdoor so elegant and so invisible that it’s been tormenting security researchers ever since.

Imagine you want to backdoor Unix’s login program, letting you bypass password checks. You could modify login.c, but a code reviewer would find it. So instead, you modify the C compiler. You add a rule: if the file being compiled is login.c, silently inject the backdoor into the output binary.

Now login.c looks clean. But every compiled login binary is compromised.

There’s still a problem: your malicious rule is sitting in the compiler’s source code, where a reviewer could find it. So Thompson proposed a second modification: if the file being compiled is the compiler itself, inject both rules, the login backdoor, and this self-replication rule, into the new binary.

Compile this poisoned compiler once. Then delete the malicious source code.

What you’re left with:

login.c - completely clean.
The compiler’s source code - completely clean.
Every binary compiled by this compiler - compromised.

The backdoor now lives only in the binary. It’s completely separated from the source code. No code review will ever find it. You would have to take apart the compiler binary itself

What This Actually Means ¶

There’s a real lesson hiding in all of this.

When you are starting out, language choice feels important. Python or C, Rust or Go, these debates eat up a lot of energy. But once you have built a self-hosting compiler, even a simple one, the argument falls apart. After you have seen the same computation run in C, then Python, then across 128 languages in a row, the syntax starts to look like what it really is: a human-readable skin over the same machine underneath.

The Thompson Hack shows the same thing, but from a darker angle. The binary is the ground truth. Source code is just a story we tell ourselves about what the binary does. Sometimes, those two things drift far apart.

To understand self-hosting is to realize that the tools you use to build software are themselves software, built by earlier tools, passed down through generations of compiled binaries, all the way back to people punching cards by hand. This is the real nature of the stack you are standing on.