Compilers & Parser Engineer — Build Programming Languages, Interpreters, VMs, JITs, and MLIR Pipelines From Scratch
"A compiler is a function from strings to behavior. Everything else is engineering."
A lab-based curriculum for becoming a senior compiler engineer by building the systems you'll one day extend, optimize, and ship: a tree-walking interpreter, a stack-based bytecode VM, an SSA-based optimizer, an LLVM native backend, an ORC JIT runtime, an MLIR dialect with progressive lowering, and a production runtime with a GC and FFI — all implemented from scratch in C++17/20 on macOS (Apple Silicon supported and verified).
Why This Repo Exists
Most engineers treat compilers as black boxes. This curriculum makes them transparent. You will:
- Write a programming language from a flat character stream to a Mach-O executable.
- Implement every classical IR: AST, three-address code, control-flow graph, SSA, LLVM IR, MLIR dialects.
- Understand every classical optimization: constant folding, dead code elimination, mem2reg, loop invariant hoisting.
- Build runtime systems: stack frames, calling conventions, mark-sweep GC, C FFI.
- Reason about hardware tradeoffs: ICache behavior of bytecode dispatchers, JIT compile-time vs run-time, register pressure, ABI boundaries.
- Compare compilation strategies: tree-walker vs bytecode vs JIT vs AOT, the same fundamental performance hierarchy that distinguishes Python, V8, JVM, and Clang.
- Build the same language repeatedly with progressively more sophisticated backends to internalize design over syntax.
Curriculum at a Glance
| Phase | Theme | Labs |
|---|---|---|
| 1 | Frontend Foundations | cp-01 … cp-03 |
| 2 | Static Semantics & Type Systems | cp-04 … cp-05 |
| 3 | Bytecode Virtual Machines | cp-06 … cp-07 |
| 4 | Compiler Middle-End (IR & Optimization) | cp-08 … cp-09 |
| 5 | LLVM Backend (Industry Core) | cp-10 … cp-11 |
| 6 | JIT Compilation (LLVM ORC) | cp-12 |
| 7 | MLIR (Multi-Level IR) | cp-13 |
| 8 | Runtime Systems (GC, ABI, FFI) | cp-14 |
| 9 | Tooling & Diagnostics | cp-15 |
| Capstones | Production-grade demonstrations | cp-16 … cp-18 |
See PHASES.md for the full breakdown with learning objectives per lab.
How To Use This Repo
- Read TOOLS.md and complete
cp-01-environment-setup/. This is the mandatory first step — even if you already have the tools, the verification process teaches you what each tool does and why. - Move through the labs in order. Each lab is self-contained and has the same shape:
cp-NN-<name>/ ├── CONCEPTS.md # The "why" — read this FIRST (8-part framework) ├── references.md # Papers, source-code links, suggested reading ├── docs/ │ ├── analysis.md # Design tradeoffs (performance, engineering) │ ├── broader-ideas.md # Extensions, alternatives, where this goes in production │ ├── execution.md # Toolchain versions + quick-start commands │ ├── observation.md # How to debug, profile, and inspect output │ └── verification.md # Pass/fail checks for your implementation ├── steps/ # Numbered, sequential implementation guides │ ├── 01-*.md │ ├── 02-*.md │ └── ... └── src/cpp/ # CMake project — reference implementation - Read
CONCEPTS.mdfirst, then worksteps/in order. The reference code insrc/cpp/is a target — try to write your own first, then compare. - Run the checks in
docs/verification.mdbefore moving on. If anything fails, seedocs/observation.mdfor debugging guidance.
What You Will Build
By the end of the curriculum you will have implemented:
- A complete arithmetic evaluator with a hand-written lexer, recursive-descent parser, AST, and tree-walking interpreter.
- A full MiniLang v0 frontend with Pratt parsing, variables, control flow, functions, closures, and an interactive REPL.
- A symbol-table-driven semantic analyzer with lexical scoping and a Hindley-Milner-lite static type system.
- A stack-based bytecode VM with a hand-tuned dispatch loop, instruction encoding, and a disassembler.
- A three-address-code SSA IR with a control-flow-graph builder, dominator computation, and classical optimization passes (constant folding, DCE, mem2reg).
- An LLVM native code generator that produces optimized Mach-O executables for Apple Silicon.
- An ORC JIT engine for lazy on-demand compilation with function caching.
- A custom MLIR dialect (
minilang) with progressive lowering to the LLVM dialect. - A production runtime with stack frames, a mark-sweep garbage collector, and a C FFI.
- A complete compiler CLI toolchain with diagnostic spans, source-pointing errors, and module support.
- Three capstone projects stitching everything together.
Prerequisites
- Comfortable reading and writing modern C++17/20 (the curriculum assumes you can write a class and use STL containers).
- Familiarity with trees, recursion, and basic data structures (graphs, hash tables).
- Basic command-line and
git. - Not required: prior compiler, LLVM, parsing, or type-system knowledge. We build it all from the ground up.
Pedagogical Style
Modeled after distributed-systems-engineer/ — every CONCEPTS.md follows the same 8-part framework:
- What Is It — one-paragraph executive summary
- Why It Matters — concrete benefits
- How It Works — ASCII architecture diagram
- Core Terminology — table of precise definitions
- Mental Models — analogies for intuition
- Common Misconceptions — myths corrected
- Interview Talking Points — what to say in a senior compiler interview
- Connections to Other Labs — how this fits the bigger picture
Every lab produces observable, runnable, testable output. No pseudo-code, no hand-waving, no abstract-only sections.
Status
| Phase | Status |
|---|---|
| Phase 1 — Frontend Foundations | cp-01 complete · cp-02 complete · cp-03 scaffolded |
| Phase 2 — Static Semantics | Scaffolded |
| Phase 3 — Bytecode VMs | Scaffolded |
| Phase 4 — IR & Optimization | Scaffolded |
| Phase 5 — LLVM Backend | Scaffolded |
| Phase 6 — JIT | Scaffolded |
| Phase 7 — MLIR | Scaffolded |
| Phase 8 — Runtime | Scaffolded |
| Phase 9 — Tooling | Scaffolded |
| Capstones | Scaffolded |
See PHASES.md for per-lab status.
License
MIT — see source headers in each implementation.