Compilers & Parser Engineer — Build Programming Languages, Interpreters, VMs, JITs, and MLIR Pipelines From Scratch

"A compiler is a function from strings to behavior. Everything else is engineering."

A lab-based curriculum for becoming a senior compiler engineer by building the systems you'll one day extend, optimize, and ship: a tree-walking interpreter, a stack-based bytecode VM, an SSA-based optimizer, an LLVM native backend, an ORC JIT runtime, an MLIR dialect with progressive lowering, and a production runtime with a GC and FFI — all implemented from scratch in C++17/20 on macOS (Apple Silicon supported and verified).

Why This Repo Exists

Most engineers treat compilers as black boxes. This curriculum makes them transparent. You will:

  • Write a programming language from a flat character stream to a Mach-O executable.
  • Implement every classical IR: AST, three-address code, control-flow graph, SSA, LLVM IR, MLIR dialects.
  • Understand every classical optimization: constant folding, dead code elimination, mem2reg, loop invariant hoisting.
  • Build runtime systems: stack frames, calling conventions, mark-sweep GC, C FFI.
  • Reason about hardware tradeoffs: ICache behavior of bytecode dispatchers, JIT compile-time vs run-time, register pressure, ABI boundaries.
  • Compare compilation strategies: tree-walker vs bytecode vs JIT vs AOT, the same fundamental performance hierarchy that distinguishes Python, V8, JVM, and Clang.
  • Build the same language repeatedly with progressively more sophisticated backends to internalize design over syntax.

Curriculum at a Glance

PhaseThemeLabs
1Frontend Foundationscp-01cp-03
2Static Semantics & Type Systemscp-04cp-05
3Bytecode Virtual Machinescp-06cp-07
4Compiler Middle-End (IR & Optimization)cp-08cp-09
5LLVM Backend (Industry Core)cp-10cp-11
6JIT Compilation (LLVM ORC)cp-12
7MLIR (Multi-Level IR)cp-13
8Runtime Systems (GC, ABI, FFI)cp-14
9Tooling & Diagnosticscp-15
CapstonesProduction-grade demonstrationscp-16cp-18

See PHASES.md for the full breakdown with learning objectives per lab.

How To Use This Repo

  1. Read TOOLS.md and complete cp-01-environment-setup/. This is the mandatory first step — even if you already have the tools, the verification process teaches you what each tool does and why.
  2. Move through the labs in order. Each lab is self-contained and has the same shape:
    cp-NN-<name>/
    ├── CONCEPTS.md       # The "why" — read this FIRST (8-part framework)
    ├── references.md     # Papers, source-code links, suggested reading
    ├── docs/
    │   ├── analysis.md       # Design tradeoffs (performance, engineering)
    │   ├── broader-ideas.md  # Extensions, alternatives, where this goes in production
    │   ├── execution.md      # Toolchain versions + quick-start commands
    │   ├── observation.md    # How to debug, profile, and inspect output
    │   └── verification.md   # Pass/fail checks for your implementation
    ├── steps/            # Numbered, sequential implementation guides
    │   ├── 01-*.md
    │   ├── 02-*.md
    │   └── ...
    └── src/cpp/          # CMake project — reference implementation
    
  3. Read CONCEPTS.md first, then work steps/ in order. The reference code in src/cpp/ is a target — try to write your own first, then compare.
  4. Run the checks in docs/verification.md before moving on. If anything fails, see docs/observation.md for debugging guidance.

What You Will Build

By the end of the curriculum you will have implemented:

  • A complete arithmetic evaluator with a hand-written lexer, recursive-descent parser, AST, and tree-walking interpreter.
  • A full MiniLang v0 frontend with Pratt parsing, variables, control flow, functions, closures, and an interactive REPL.
  • A symbol-table-driven semantic analyzer with lexical scoping and a Hindley-Milner-lite static type system.
  • A stack-based bytecode VM with a hand-tuned dispatch loop, instruction encoding, and a disassembler.
  • A three-address-code SSA IR with a control-flow-graph builder, dominator computation, and classical optimization passes (constant folding, DCE, mem2reg).
  • An LLVM native code generator that produces optimized Mach-O executables for Apple Silicon.
  • An ORC JIT engine for lazy on-demand compilation with function caching.
  • A custom MLIR dialect (minilang) with progressive lowering to the LLVM dialect.
  • A production runtime with stack frames, a mark-sweep garbage collector, and a C FFI.
  • A complete compiler CLI toolchain with diagnostic spans, source-pointing errors, and module support.
  • Three capstone projects stitching everything together.

Prerequisites

  • Comfortable reading and writing modern C++17/20 (the curriculum assumes you can write a class and use STL containers).
  • Familiarity with trees, recursion, and basic data structures (graphs, hash tables).
  • Basic command-line and git.
  • Not required: prior compiler, LLVM, parsing, or type-system knowledge. We build it all from the ground up.

Pedagogical Style

Modeled after distributed-systems-engineer/ — every CONCEPTS.md follows the same 8-part framework:

  1. What Is It — one-paragraph executive summary
  2. Why It Matters — concrete benefits
  3. How It Works — ASCII architecture diagram
  4. Core Terminology — table of precise definitions
  5. Mental Models — analogies for intuition
  6. Common Misconceptions — myths corrected
  7. Interview Talking Points — what to say in a senior compiler interview
  8. Connections to Other Labs — how this fits the bigger picture

Every lab produces observable, runnable, testable output. No pseudo-code, no hand-waving, no abstract-only sections.

Status

PhaseStatus
Phase 1 — Frontend Foundationscp-01 complete · cp-02 complete · cp-03 scaffolded
Phase 2 — Static SemanticsScaffolded
Phase 3 — Bytecode VMsScaffolded
Phase 4 — IR & OptimizationScaffolded
Phase 5 — LLVM BackendScaffolded
Phase 6 — JITScaffolded
Phase 7 — MLIRScaffolded
Phase 8 — RuntimeScaffolded
Phase 9 — ToolingScaffolded
CapstonesScaffolded

See PHASES.md for per-lab status.

License

MIT — see source headers in each implementation.