Compilers & Parser Engineer — Build Programming Languages, Interpreters, VMs, JITs, and MLIR Pipelines From Scratch
"A compiler is a function from strings to behavior. Everything else is engineering."
A lab-based curriculum for becoming a senior compiler engineer by building the systems you'll one day extend, optimize, and ship: a tree-walking interpreter, a stack-based bytecode VM, an SSA-based optimizer, an LLVM native backend, an ORC JIT runtime, an MLIR dialect with progressive lowering, and a production runtime with a GC and FFI — all implemented from scratch in C++17/20 on macOS (Apple Silicon supported and verified).
Why This Repo Exists
Most engineers treat compilers as black boxes. This curriculum makes them transparent. You will:
- Write a programming language from a flat character stream to a Mach-O executable.
- Implement every classical IR: AST, three-address code, control-flow graph, SSA, LLVM IR, MLIR dialects.
- Understand every classical optimization: constant folding, dead code elimination, mem2reg, loop invariant hoisting.
- Build runtime systems: stack frames, calling conventions, mark-sweep GC, C FFI.
- Reason about hardware tradeoffs: ICache behavior of bytecode dispatchers, JIT compile-time vs run-time, register pressure, ABI boundaries.
- Compare compilation strategies: tree-walker vs bytecode vs JIT vs AOT, the same fundamental performance hierarchy that distinguishes Python, V8, JVM, and Clang.
- Build the same language repeatedly with progressively more sophisticated backends to internalize design over syntax.
Curriculum at a Glance
| Phase | Theme | Labs |
|---|---|---|
| 1 | Frontend Foundations | cp-01 … cp-03 |
| 2 | Static Semantics & Type Systems | cp-04 … cp-05 |
| 3 | Bytecode Virtual Machines | cp-06 … cp-07 |
| 4 | Compiler Middle-End (IR & Optimization) | cp-08 … cp-09 |
| 5 | LLVM Backend (Industry Core) | cp-10 … cp-11 |
| 6 | JIT Compilation (LLVM ORC) | cp-12 |
| 7 | MLIR (Multi-Level IR) | cp-13 |
| 8 | Runtime Systems (GC, ABI, FFI) | cp-14 |
| 9 | Tooling & Diagnostics | cp-15 |
| Capstones | Production-grade demonstrations | cp-16 … cp-18 |
See PHASES.md for the full breakdown with learning objectives per lab.
How To Use This Repo
- Read TOOLS.md and complete
cp-01-environment-setup/. This is the mandatory first step — even if you already have the tools, the verification process teaches you what each tool does and why. - Move through the labs in order. Each lab is self-contained and has the same shape:
cp-NN-<name>/ ├── CONCEPTS.md # The "why" — read this FIRST (8-part framework) ├── references.md # Papers, source-code links, suggested reading ├── docs/ │ ├── analysis.md # Design tradeoffs (performance, engineering) │ ├── broader-ideas.md # Extensions, alternatives, where this goes in production │ ├── execution.md # Toolchain versions + quick-start commands │ ├── observation.md # How to debug, profile, and inspect output │ └── verification.md # Pass/fail checks for your implementation ├── steps/ # Numbered, sequential implementation guides │ ├── 01-*.md │ ├── 02-*.md │ └── ... └── src/cpp/ # CMake project — reference implementation - Read
CONCEPTS.mdfirst, then worksteps/in order. The reference code insrc/cpp/is a target — try to write your own first, then compare. - Run the checks in
docs/verification.mdbefore moving on. If anything fails, seedocs/observation.mdfor debugging guidance.
What You Will Build
By the end of the curriculum you will have implemented:
- A complete arithmetic evaluator with a hand-written lexer, recursive-descent parser, AST, and tree-walking interpreter.
- A full MiniLang v0 frontend with Pratt parsing, variables, control flow, functions, closures, and an interactive REPL.
- A symbol-table-driven semantic analyzer with lexical scoping and a Hindley-Milner-lite static type system.
- A stack-based bytecode VM with a hand-tuned dispatch loop, instruction encoding, and a disassembler.
- A three-address-code SSA IR with a control-flow-graph builder, dominator computation, and classical optimization passes (constant folding, DCE, mem2reg).
- An LLVM native code generator that produces optimized Mach-O executables for Apple Silicon.
- An ORC JIT engine for lazy on-demand compilation with function caching.
- A custom MLIR dialect (
minilang) with progressive lowering to the LLVM dialect. - A production runtime with stack frames, a mark-sweep garbage collector, and a C FFI.
- A complete compiler CLI toolchain with diagnostic spans, source-pointing errors, and module support.
- Three capstone projects stitching everything together.
Prerequisites
- Comfortable reading and writing modern C++17/20 (the curriculum assumes you can write a class and use STL containers).
- Familiarity with trees, recursion, and basic data structures (graphs, hash tables).
- Basic command-line and
git. - Not required: prior compiler, LLVM, parsing, or type-system knowledge. We build it all from the ground up.
Pedagogical Style
Modeled after distributed-systems-engineer/ — every CONCEPTS.md follows the same 8-part framework:
- What Is It — one-paragraph executive summary
- Why It Matters — concrete benefits
- How It Works — ASCII architecture diagram
- Core Terminology — table of precise definitions
- Mental Models — analogies for intuition
- Common Misconceptions — myths corrected
- Interview Talking Points — what to say in a senior compiler interview
- Connections to Other Labs — how this fits the bigger picture
Every lab produces observable, runnable, testable output. No pseudo-code, no hand-waving, no abstract-only sections.
Status
| Phase | Status |
|---|---|
| Phase 1 — Frontend Foundations | cp-01 complete · cp-02 complete · cp-03 scaffolded |
| Phase 2 — Static Semantics | Scaffolded |
| Phase 3 — Bytecode VMs | Scaffolded |
| Phase 4 — IR & Optimization | Scaffolded |
| Phase 5 — LLVM Backend | Scaffolded |
| Phase 6 — JIT | Scaffolded |
| Phase 7 — MLIR | Scaffolded |
| Phase 8 — Runtime | Scaffolded |
| Phase 9 — Tooling | Scaffolded |
| Capstones | Scaffolded |
See PHASES.md for per-lab status.
License
MIT — see source headers in each implementation.
cp-01 — Environment Setup & Toolchain
Install and verify the C++/LLVM toolchain. Mandatory first lab — every other lab depends on these tools and the concepts taught here.
Read First
CONCEPTS.md— the 8-part framework: toolchains, target triples, Apple Clang vs upstream LLVM, Mach-O vs ELF, sysroots, CMake.docs/analysis.md— design tradeoffs and what to choose when.references.md— official docs and further reading.
Then Walk The Steps
steps/01-verify-xcode-clt.md— Apple's command-line toolchain.steps/02-install-cmake-and-ninja.md— build-system generator and executor.steps/03-install-homebrew-llvm.md— upstream LLVM with full SDK.steps/04-verify-end-to-end.md— single script that confirms everything.
Quick Verify (TL;DR)
If you're already comfortable with everything in CONCEPTS.md and just want to confirm your setup:
./scripts/verify.sh
If green, move to ../cp-02-arithmetic-evaluator/. If anything's red, open the relevant step.
Lab-Specific Docs
docs/execution.md— every install command in one place.docs/observation.md— how to inspect Mach-O binaries, find LLVM include paths, understandclang -voutput.docs/verification.md— the exact checksverify.shperforms, with expected outputs.docs/broader-ideas.md— cross-compilation, Docker for Linux validation, LLVM-from-source.
Outcomes
You leave this lab with:
- A working C++17/20 toolchain via Apple Clang.
- A working LLVM 18+ installation via Homebrew, plus
llvm-config,opt,llc,lli,mlir-opt. - CMake 3.20+ and Ninja installed.
- Environment variables (
LLVM_HOME,LLVM_DIR,MLIR_DIR) configured. - The mental model for what a toolchain is and the difference between a compiler driver, frontend, optimizer, backend, assembler, and linker.
Step 1 — Verify Xcode Command Line Tools
Goal
Confirm Apple's C++ toolchain is installed and discoverable. This gives us clang++, make, system headers, git, and lldb — everything needed for Phases 1–4.
Why This Step Exists
Apple ships a C/C++ toolchain as part of "Xcode Command Line Tools" (CLT) — a minimal subset of full Xcode. Almost every macOS development workflow starts here. Without it, even git is missing.
The CLT also installs the macOS SDK (the sysroot containing system headers like <stdio.h> and frameworks like Foundation). When clang compiles, it implicitly looks here.
Check What's Installed
# Where is the CLT installed?
xcode-select -p
# Expected (one of):
# /Library/Developer/CommandLineTools ← CLT-only install
# /Applications/Xcode.app/Contents/Developer ← full Xcode install
If you see "No developer tools were found" or similar, install:
xcode-select --install
This opens a GUI dialog. Click "Install". Wait ~5 minutes.
Verify The Compiler
clang++ --version
Expected:
Apple clang version 16.0.0 (clang-1600.0.26.6)
Target: arm64-apple-darwin24.0.0
Thread model: posix
InstalledDir: /usr/bin
Three things to notice in the output, each a teaching moment:
- "Apple clang" — this is Apple's fork, not upstream LLVM. The version number ("16.0.0") tracks Xcode releases, not LLVM major releases. (Apple Clang 16 ≈ LLVM 17 under the hood.)
Target: arm64-apple-darwin24.0.0— this is your default target triple.arm64is the instruction set (Apple Silicon),appleis the vendor,darwin24is the kernel version (macOS 15 = Darwin 24). The triple is what every LLVM backend uses to know what code to emit.InstalledDir: /usr/bin— Apple installs into/usr/bin, which is on yourPATHby default. Compare with Homebrew LLVM in Step 3, which installs elsewhere.
See The SDK Location
xcrun --show-sdk-path
Expected (something like):
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk
This is your sysroot — Clang's -isysroot default. Inside it:
ls "$(xcrun --show-sdk-path)/usr/include" | head
# stdio.h, stdlib.h, string.h, ...
ls "$(xcrun --show-sdk-path)/System/Library/Frameworks" | head
# Foundation.framework, AppKit.framework, ...
This is where #include <stdio.h> actually resolves. Without the SDK, the compiler doesn't know where the system headers live.
Quick Sanity Test
cat > /tmp/hello.cpp <<'EOF'
#include <cstdio>
int main() { std::puts("hello from " __VERSION__); return 0; }
EOF
clang++ -std=c++17 /tmp/hello.cpp -o /tmp/hello && /tmp/hello
Expected (something like):
hello from 4.2.1 Compatible Apple LLVM 16.0.0 (clang-1600.0.26.6)
If this works, your Apple Clang installation is healthy.
What Just Happened
xcode-select -pconfirmed the CLT prefix.clang++ --versionprinted the version, the default target triple (which equals your machine's host triple), and the install path.xcrun --show-sdk-pathrevealed your sysroot — the directory tree that the compiler treats as the target's/when looking for headers and frameworks.- Compiling
hello.cppexercised the full pipeline: preprocessor expanded<cstdio>, the frontend parsed the C++, the backend emitted arm64 machine code, the assembler converted it to a Mach-O object file, andldlinked it into an executable.
Next
→ 02-install-cmake-and-ninja.md
Step 2 — Install CMake and Ninja
Goal
Install CMake (build-system generator) and Ninja (fast parallel build executor). Every lab in this curriculum builds via CMake. Ninja becomes important starting at Phase 5 (LLVM) where build times grow.
Why CMake?
Because every C++ compiler project in the LLVM ecosystem — Clang, LLVM itself, MLIR, Swift, Mesa, KDE, and dozens more — uses CMake. Learning the CMake idioms here pays off in every later lab and every real LLVM contribution.
CMake is a generator, not a builder. It reads CMakeLists.txt, inspects your system (which compiler, which libs), and writes platform-appropriate build files: a Makefile (default on Unix) or a build.ninja. Then cmake --build . invokes whichever was generated.
Why Ninja?
Make is single-threaded by default and re-stats every file on every build. Ninja was designed at Google specifically for Chromium and adopted by LLVM. Differences:
| GNU Make | Ninja | |
|---|---|---|
| Designed for | hand-edited | machine-generated |
| Default parallelism | -j1 | all cores |
| Incremental rebuild speed | O(targets) | O(changed files) |
| LLVM build time (clean, 8 cores) | ~25 min | ~12 min |
For Phases 1–4 you can use either. From Phase 5 onward Ninja makes a noticeable difference.
Install
# If you don't have Homebrew, install it first:
# /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install cmake ninja
Verify
cmake --version | head -1
# cmake version 3.28.x or newer
ninja --version
# 1.11.x or newer
Minimum required:
- CMake ≥ 3.20 (we use modern target syntax)
- Ninja ≥ 1.10 (any recent version works)
Try It — A Tiny CMake Project
This isn't part of any lab; it's a 30-second sanity check.
mkdir -p /tmp/cmake-smoke && cd /tmp/cmake-smoke
cat > CMakeLists.txt <<'EOF'
cmake_minimum_required(VERSION 3.20)
project(smoke LANGUAGES CXX)
set(CMAKE_CXX_STANDARD 17)
add_executable(smoke main.cpp)
EOF
cat > main.cpp <<'EOF'
#include <cstdio>
int main() { std::puts("cmake + ninja: ok"); return 0; }
EOF
cmake -B build -G Ninja
cmake --build build
./build/smoke
Expected:
cmake + ninja: ok
What Just Happened
cmake -B build -G Ninjaran the configure step. CMake inspected your system (found Clang, the SDK, ninja), wrotebuild/build.ninja, and cached the decisions inbuild/CMakeCache.txt.cmake --build buildran the build step. CMake dispatched to Ninja, which compiledmain.cppand linked the executable.- The whole thing took less than 2 seconds. Most of that was CMake's first-time configure; subsequent rebuilds are milliseconds.
Debugging Tips
cmake -B build -G Ninja --debug-findprints the resolution of everyfind_packagecall. Invaluable when LLVM can't be found in Step 3.cmake -B build --freshwipes the cache and reconfigures (useful if you changedPATHor installed a new tool).ninja -C build -vprints every command Ninja runs — see the exactclang++invocation if a build is mysteriously failing.
Next
Step 3 — Install Homebrew LLVM (The Full Toolchain)
Goal
Install upstream LLVM via Homebrew. This gives us the headers, libraries, and command-line tools (opt, llc, lli, mlir-opt, llvm-config) that Apple Clang does not include. Required from Phase 5 onward.
Why Apple Clang Isn't Enough
You already have Apple Clang from Step 1. It compiles C++ fine. But it lacks:
libLLVM.dylibC++ headers — needed to write a compiler that emits LLVM IR programmatically (Phase 5).llvm-config— the helper that tells CMake'sfind_package(LLVM)where to look.opt— the LLVM optimizer driver.opt -O3 input.ll > output.ll.llc— the LLVM static compiler. Lowers.ll→ target assembly.lli— the LLVM IR interpreter/JIT. Run.llfiles directly.mlir-opt,mlir-translate— MLIR'soptand bridge to LLVM (Phase 7).lld— LLVM's linker. Faster than systemld, used for cross-linking.
Apple intentionally omits these because Apple Clang is a product, not a development SDK. Homebrew LLVM fills the gap.
Install
brew install llvm
This installs to /opt/homebrew/opt/llvm/ on Apple Silicon (or /usr/local/opt/llvm/ on Intel Macs). Homebrew deliberately does NOT symlink to /usr/local/bin to avoid shadowing Apple Clang.
Disk: ~3 GB. Time: 5–10 minutes (cached binary download).
Make It Discoverable
Homebrew prints instructions; the relevant parts:
# Add to your ~/.zshrc (or ~/.bashrc):
export LLVM_HOME="/opt/homebrew/opt/llvm"
export PATH="$LLVM_HOME/bin:$PATH"
# For CMake's find_package(LLVM) and find_package(MLIR):
export LLVM_DIR="$LLVM_HOME/lib/cmake/llvm"
export MLIR_DIR="$LLVM_HOME/lib/cmake/mlir"
# For dyld to find LLVM libraries at runtime (rarely needed; CMake usually handles rpath):
# export DYLD_LIBRARY_PATH="$LLVM_HOME/lib:$DYLD_LIBRARY_PATH"
After editing, reload:
source ~/.zshrc
Why prepend
$LLVM_HOME/bintoPATH? Soclang++andclangrefer to Homebrew LLVM during compiler-engineering work. Apple Clang is still at/usr/bin/clang++for any tool that hard-codes that path.
Verify
which clang++
# Expected: /opt/homebrew/opt/llvm/bin/clang++
clang++ --version
# Expected: clang version 18.x.x or 19.x.x or 20.x.x
# (NOT "Apple clang")
llvm-config --version
# Expected: 18.x.x / 19.x.x / 20.x.x (matches clang++)
llvm-config --prefix
# Expected: /opt/homebrew/opt/llvm
llc --version | head -3
opt --version | head -3
lli --version | head -3
mlir-opt --version | head -3
All four should print a version line. If mlir-opt: command not found, your Homebrew LLVM is too old or wasn't built with MLIR. brew upgrade llvm should fix it (current Homebrew LLVM bottles include MLIR by default).
Inspect What's Available
ls "$LLVM_HOME/bin" | head -30
# clang, clang++, lld, ld64.lld, llc, lli, llvm-ar, llvm-as, llvm-cov, llvm-dis,
# llvm-dwarfdump, llvm-link, llvm-mc, llvm-nm, llvm-objcopy, llvm-objdump,
# llvm-readelf, llvm-readobj, llvm-rtdyld, llvm-symbolizer, llvm-undname,
# mlir-cpu-runner, mlir-opt, mlir-tblgen, mlir-translate, opt, ...
Every one of these is a tool you'll meet at some point.
The llvm-config Story
llvm-config is a small helper that prints LLVM build information. Try:
llvm-config --includedir
# /opt/homebrew/opt/llvm/include
llvm-config --libdir
# /opt/homebrew/opt/llvm/lib
llvm-config --libs core support irreader
# -lLLVMIRReader -lLLVMBitReader -lLLVMAsmParser -lLLVMCore -lLLVMRemarks ...
llvm-config --cxxflags
# -I/opt/homebrew/opt/llvm/include -std=c++17 -fno-exceptions -fno-rtti ...
CMake's find_package(LLVM) internally calls llvm-config to discover all of this. As long as LLVM_DIR is set (or llvm-config is on PATH), CMake "just works".
Try It — Compile and Run LLVM IR
Save this as /tmp/hello.ll:
@.str = private unnamed_addr constant [14 x i8] c"hello, llvm!\0A\00"
declare i32 @printf(ptr, ...)
define i32 @main() {
%1 = call i32 (ptr, ...) @printf(ptr @.str)
ret i32 0
}
Run it three different ways — each demonstrates a different tool:
# 1. Interpret/JIT directly (lli)
lli /tmp/hello.ll
# → hello, llvm!
# 2. Compile to assembly (llc), then assemble + link (clang++)
llc /tmp/hello.ll -o /tmp/hello.s
clang++ /tmp/hello.s -o /tmp/hello-static && /tmp/hello-static
# → hello, llvm!
# 3. Optimize, then compile (opt + llc)
opt -O3 /tmp/hello.ll -S -o /tmp/hello.opt.ll
llc /tmp/hello.opt.ll -o /tmp/hello.opt.s
clang++ /tmp/hello.opt.s -o /tmp/hello-opt && /tmp/hello-opt
# → hello, llvm!
You just performed all three roles of LLVM: as an interpreter (lli), as a static compiler (llc), and as an optimizer (opt). Every later lab will revisit these tools.
What Just Happened
- You installed upstream LLVM separately from Apple Clang.
- You prepended its
bintoPATHsoclang++now refers to Homebrew's, NOT Apple's. - You exported
LLVM_DIRso CMake'sfind_package(LLVM)works without extra flags. - You wrote raw LLVM IR by hand, then exercised
lli,llc, andopt— proof that the toolchain is wired correctly.
Common Pitfalls
clang++ --versionstill says "Apple clang": you forgot tosource ~/.zshrcor you opened a new terminal that doesn't load it. Check withecho $PATH | tr : '\n' | head.mlir-opt: command not found: very old Homebrew LLVM (pre-15). Runbrew upgrade llvm.dyld: Library not loaded: @rpath/libLLVM.dylib: a binary you built can't find its libs at runtime. Either add$LLVM_HOME/libtoDYLD_LIBRARY_PATH, or have CMake setINSTALL_RPATH(set(CMAKE_INSTALL_RPATH "${LLVM_HOME}/lib")).- CMake can't
find_package(LLVM): confirmLLVM_DIRis exported. Trycmake -DLLVM_DIR="$LLVM_HOME/lib/cmake/llvm" ...as a one-off.
Next
Step 4 — End-to-End Verification
Goal
Run a single verification script that exercises every tool we'll use across the curriculum and confirms versions meet the minimums. If this passes, you're ready for Lab cp-02.
What Gets Verified
| Tool | Required Version | Used In |
|---|---|---|
| Xcode CLT | any current | all phases |
Apple Clang (/usr/bin/clang++) | 14+ | phases 1–4 (optional) |
| Homebrew Clang++ | 18+ | phases 5+ |
llvm-config | matches Clang | phases 5+ |
llc, opt, lli | matches Clang | phases 5+ |
mlir-opt, mlir-translate | matches Clang | phase 7 |
| CMake | 3.20+ | all phases |
| Ninja | 1.10+ | phases 5+ (optional) |
| LLDB | any current | all phases |
| Git | any current | all phases |
Run
From this lab directory:
cd cp-01-environment-setup
./scripts/verify.sh
Expected output (with your specific versions):
== compilers-parser-engineer / cp-01 — environment verification ==
[1/9] Xcode CLT prefix : /Library/Developer/CommandLineTools OK
[2/9] Apple Clang : Apple clang version 16.0.0 OK
[3/9] Homebrew Clang++ : clang version 20.1.8 OK
[4/9] llvm-config : 20.1.8 OK
[5/9] LLVM tools : llc opt lli OK
[6/9] MLIR tools : mlir-opt mlir-translate OK
[7/9] CMake : 4.1.0 OK (need >=3.20)
[8/9] Ninja : 1.11.1 OK (optional)
[9/9] LLDB : lldb-1600.0.39.109 OK
Architecture : arm64 (Apple Silicon)
macOS : 15.0
SDK path : /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk
LLVM prefix : /opt/homebrew/opt/llvm
LLVM_DIR : /opt/homebrew/opt/llvm/lib/cmake/llvm
MLIR_DIR : /opt/homebrew/opt/llvm/lib/cmake/mlir
== all critical tools present; you are ready for cp-02 ==
If anything is marked MISSING or OLD, see the matching step:
| Failure | Fix |
|---|---|
| Xcode CLT missing | Step 1 |
| Apple Clang missing | Step 1 |
| Homebrew clang++ missing | Step 3 |
llvm-config missing | Step 3 |
| MLIR tools missing | brew upgrade llvm (Step 3) |
| CMake missing/old | Step 2 |
| Ninja missing | Step 2 (optional for now) |
LLVM_DIR/MLIR_DIR unset | Step 3 |
A Mini Build To Confirm find_package(LLVM)
After the script passes, do this final check to confirm CMake can pull in LLVM (the linchpin for Phase 5):
cd scripts/llvm-smoke
cmake -B build -G Ninja
cmake --build build
./build/llvm-smoke
Expected:
LLVM version: 20.1.8
target triple: arm64-apple-macosx<version>
created module: smoke; declared one function: int main()
This compiles a tiny program that links against libLLVM and uses the IRBuilder API to construct a function. If this works, cp-11 (LLVM Codegen) will work.
What Just Happened
verify.sh ran ~30 commands across your toolchain, each printing a tool version. It then printed environment variables CMake needs (LLVM_DIR, MLIR_DIR).
The llvm-smoke mini-project then did the only thing that actually matters for compiler engineering: it pulled in the LLVM C++ headers and emitted IR programmatically. This is the API you'll spend Phases 5–9 writing against.
Next
→ Mark this lab complete and move to ../cp-02-arithmetic-evaluator/.
cp-02 — Arithmetic Evaluator
The first real compiler. Read source text. Produce a number. Use every classical frontend technique compressed into ~400 lines of C++17.
Why This Lab
This is the foundational lab of the entire curriculum. Almost every concept in the rest of the course is a generalization of something built here:
| Built here | Generalizes to (later) |
|---|---|
| Hand-written DFA lexer | Clang's lexer; Phase 3 bytecode-compiler lexer |
| Recursive-descent parser | Pratt parser (cp-03); Clang's parser |
| AST + Visitor pattern | Every IR in every later lab |
| Post-order tree-walk eval | Bytecode VM (cp-06); LLVM IRBuilder traversal |
| EBNF grammar | All MiniLang grammar throughout the curriculum |
| Operator precedence via grammar nesting | Pratt's "binding powers" (cp-03); MLIR op verifiers |
Read First
CONCEPTS.md— the 8-part deep dive: lexers, parsers, AST, EBNF, precedence, associativity, the Visitor pattern.references.md— Crafting Interpreters chapters, Clang lexer source, V8 parser source.docs/analysis.md— design tradeoffs (DFA vs regex, recursive descent vs Pratt vs LR, Visitor vsstd::variant).
Walk The Steps
steps/01-the-lexer.md— tokens, DFA, single-char lookahead.steps/02-the-ast.md— node hierarchy, ownership, the Visitor pattern.steps/03-recursive-descent-parser.md— grammar, precedence-via-nesting, associativity-via-recursion-direction.steps/04-the-evaluator.md— post-order tree walk implemented as a Visitor returningdouble.steps/05-repl-tests-and-cli.md— wire it into a usable tool with a test suite.steps/06-extensions.md— extension exercises (right-associative^, AST printer, source-location errors, constant folding).
Lab Docs
docs/execution.md— exact build/run/test commands.docs/verification.md— the 19 unit tests + 4 manual REPL checks.docs/observation.md— how to inspect the AST, debug parser failures, trace evaluation.docs/broader-ideas.md— Pratt parsing preview, LR parsing, parser combinators, error recovery.
Code
src/cpp/— full reference implementation (CMake project, ~450 lines of C++).
Build & Run (TL;DR)
cd src/cpp
cmake -B build && cmake --build build
./build/eval "1 + 2 * 3" # 7
./build/eval # REPL
ctest --test-dir build # 19 tests
Outcomes
You leave this lab able to:
- Hand-write a lexer for an arbitrary regular language (no
lex/flex). - Write a recursive-descent parser for any LL(1) grammar.
- Design an AST with the Visitor pattern.
- Explain why the grammar shape determines operator precedence and associativity.
- Recognize what changes (and what doesn't) when you swap a tree-walker for a bytecode VM or LLVM backend in later labs.
Step 1 — The Lexer
Goal: turn a string of characters into a sequence of tokens.
Why A Lexer?
Imagine writing a parser that operated directly on characters. Every parsing rule would have to also skip whitespace, parse numeric literals, distinguish == from =, etc. The result would be unreadable.
Splitting into lexer → parser is the same principle as separating tokenization from semantics in any text processing — it's why wc works on lines, not bytes. Tokens are the unit the grammar cares about.
For arithmetic, our token vocabulary is tiny:
| Token | Matches |
|---|---|
Number | one or more digits, optionally followed by . and more digits |
Plus, Minus, Star, Slash | +, -, *, / |
LParen, RParen | (, ) |
Eof | synthetic end-of-stream marker |
Error | anything else (e.g., letters), carrying a message |
The DFA Mental Model
Even our tiny lexer is a deterministic finite automaton (DFA):
digit digit
┌────────┐ ┌─────────┐
▼ │ ▼ │
[START] ──────► [IN_NUMBER] ──.──► [AFTER_DOT] ──┐
│ │ │
│ (other) (other) │
│ ▼ ▼ │
│ emit Number emit Number │
│ │
│ + - * / ( ) │
├──────────────────────────── │
│ │ │
▼ ▼ │
emit Plus emit RParen (etc.) │
│ │
│ ws │
▼ │
skip → loop │
│ │
│ EOF │
▼ │
emit Eof ◄───────────────────────────────────────┘
Position into the source string serves as our state. We never need a separate "state" variable for the arithmetic lexer; for more complex lexers (Python's indentation, JavaScript's >> vs >>=) you'd track explicit modes.
Single-Character Lookahead
char Lexer::peek() const { return atEnd() ? '\0' : src_[pos_]; }
char Lexer::advance() { return src_[pos_++]; }
peek() looks at the current character without consuming it. advance() returns it and moves on. Almost every hand-written lexer follows this pattern.
For most operators we need zero lookahead (+ is Plus, full stop). For numbers we need a single character of lookahead to decide whether more digits follow. In later labs we'll add 2-character lookahead for == vs =.
The Number Sub-DFA
Token Lexer::lexNumber() {
std::size_t start = pos_;
while (!atEnd() && std::isdigit(peek())) advance();
if (!atEnd() && peek() == '.') {
advance();
while (!atEnd() && std::isdigit(peek())) advance();
}
std::string text = src_.substr(start, pos_ - start);
Token t;
t.kind = TokenKind::Number;
t.text = text;
t.value = std::stod(text);
return t;
}
Reads:
- As many digits as possible.
- If a
.follows, consume it and as many more digits as possible. - Slice the substring; convert to
doubleviastd::stod.
This accepts 42, 3.14, 0, 0.5. It rejects .5 (no leading digit), 1. (we accept this with empty fraction; harmless), 1e5 (scientific notation — left as Step 6 extension).
Why
std::stodinstead of manual parsing? It's correct (handles negative zero, denormals, exponents we plan to add later), and the lexer is not the performance bottleneck. Compare to Clang, which has a hand-tunedAPFloat::convertFromStringbecause Clang's lexer is sometimes the hot path.
Eager vs Lazy
std::vector<Token> Lexer::tokenize() { ... } // eager
We lex the entire input into a vector<Token> upfront. The parser then consumes from this vector.
Why eager? Simpler control flow. The parser doesn't need to hold a reference to the lexer.
Why production compilers go lazy: memory. Lexing a 100k-line C++ file into a vector of tokens is megabytes; lazy lexing keeps the working set small. Clang is lazy.
For arithmetic — and for everything through cp-09 — eager is fine.
Error Tokens
default: {
Token e;
e.kind = TokenKind::Error;
e.text = std::string("unexpected character '") + c + "'";
return e;
}
We don't throw from the lexer. We emit an Error token. The parser then decides what to do (we throw there). This separation lets a future error-recovery layer skip the bad token and keep parsing — important for IDE diagnostics where you want to see all errors at once, not just the first.
Try It
After building (next step covers CMake), the lexer is hidden behind eval. To inspect tokens directly, you can add a quick dump_tokens helper:
for (auto& t : Lexer("1 + 2 * 3").tokenize())
std::cout << kindName(t.kind) << " " << t.text << "\n";
Expected:
NUMBER 1
PLUS +
NUMBER 2
STAR *
NUMBER 3
EOF
Next
→ 02-the-ast.md — design the tree the parser will build.
Step 2 — The AST
Goal: design the tree data structure the parser will build and the evaluator will walk.
Three Node Types Cover All Of Arithmetic
| Node | Represents | Children |
|---|---|---|
NumberExpr | 42, 3.14 | none |
BinaryExpr | a + b, a * b, … | lhs, rhs |
UnaryExpr | -a | operand |
That's it. Three classes, no surprises.
Class Hierarchy + Visitor
struct NumberExpr; struct BinaryExpr; struct UnaryExpr;
template <typename R>
struct ExprVisitor {
virtual ~ExprVisitor() = default;
virtual R visit(NumberExpr&) = 0;
virtual R visit(BinaryExpr&) = 0;
virtual R visit(UnaryExpr&) = 0;
};
struct Expr {
virtual ~Expr() = default;
virtual double accept(ExprVisitor<double>& v) = 0;
};
struct NumberExpr : Expr {
double value;
explicit NumberExpr(double v) : value(v) {}
double accept(ExprVisitor<double>& v) override { return v.visit(*this); }
};
// BinaryExpr, UnaryExpr follow the same accept pattern
The dance:
- Each node has
accept(Visitor&)— one virtual call site. acceptimmediately callsvisitor.visit(*this)— and because*thishas its concrete type at the call site, the right overload is selected at compile time.- The visitor's
visit(NumberExpr&)etc. are user-defined per pass.
This is the double dispatch trick: virtual dispatch picks the node's concrete type; static dispatch (overloading) picks the operation on it.
Why Not Just A Switch / std::variant?
Two viable alternatives:
Alternative A — std::variant<Number, Binary, Unary>
using Expr = std::variant<NumberExpr, BinaryExpr, UnaryExpr>;
double eval(const Expr& e) {
return std::visit(overloaded{
[](const NumberExpr& n) { return n.value; },
[](const BinaryExpr& b) { /* … */ },
[](const UnaryExpr& u) { /* … */ }
}, e);
}
Pros: no virtuals, slightly faster, more "modern C++".
Cons: every new node type requires updating every visit site (or using if constexpr); recursive types need extra wrapping (std::variant<NumberExpr, std::unique_ptr<BinaryExpr>, …>); error messages are worse.
Alternative B — Switch On Tag
struct Expr { ExprKind kind; /* … */ };
double eval(const Expr& e) {
switch (e.kind) {
case ExprKind::Number: /* … */;
case ExprKind::Binary: /* … */;
}
}
Pros: zero virtual overhead; single allocation possible (one Expr struct sized for the largest variant).
Cons: every pass touches every node type's data layout; no encapsulation.
Which does production use?
- Clang's AST: tagged union via
Stmt::getStmtClass()+ visitor (RecursiveASTVisitor). - rustc's AST: enums (Rust's
std::variant-equivalent). - LLVM IR's
Instruction: tagged class hierarchy + visitor.
We use the virtual + Visitor pattern here because it's the most teachable and the most directly comparable to Clang's design. The trade-offs are real — we discuss them in docs/analysis.md.
Ownership: std::unique_ptr
using ExprPtr = std::unique_ptr<Expr>;
struct BinaryExpr : Expr {
TokenKind op;
ExprPtr lhs;
ExprPtr rhs;
BinaryExpr(TokenKind o, ExprPtr l, ExprPtr r)
: op(o), lhs(std::move(l)), rhs(std::move(r)) {}
};
Children are owned via unique_ptr. A tree is not shared — destroying the root recursively destroys the whole tree.
Why not shared_ptr? Compilers never need shared ownership of AST nodes. Sharing would imply two parents, and ASTs are trees — by definition no shared parents.
Why not raw pointers + manual delete? Memory leak waiting to happen, especially with parser exceptions. unique_ptr is RAII for ownership.
Why not arena allocation? It's the technique production compilers use (LLVM has BumpPtrAllocator; Clang's ASTContext allocates from a single arena). We'll add arena allocation in cp-09 when we have many AST nodes per program. For arithmetic, individual unique_ptr allocations are fine.
The accept Method, Concretely
double NumberExpr::accept(ExprVisitor<double>& v) { return v.visit(*this); }
When the evaluator runs someExpr->accept(eval):
- Virtual dispatch goes to the right
accept(e.g.,NumberExpr::accept). - That calls
eval.visit(*this)— and because*thisis aNumberExpr&, overload resolution choosesEvaluator::visit(NumberExpr&).
The combination is sometimes called "double dispatch" — but conceptually it's just: one virtual call routes to the right type, then static overload routes to the right operation.
The Template Parameter R
template <typename R>
struct ExprVisitor {
virtual R visit(NumberExpr&) = 0;
virtual R visit(BinaryExpr&) = 0;
virtual R visit(UnaryExpr&) = 0;
};
R is the return type. We define Evaluator : ExprVisitor<double>. A printer would be Printer : ExprVisitor<std::string>. A type-checker would be TypeChecker : ExprVisitor<Type>. Code generation in cp-11 will be LLVMGen : ExprVisitor<llvm::Value*>.
In this lab Expr::accept is hard-wired to ExprVisitor<double> — sufficient for our one visitor. In production you'd either define multiple accept overloads or use std::any/erased return types. Clang uses a non-templated visitor + side-effects to a Result member; rustc uses Rust enums and match.
Try It
After we build, this minimal main exercises the AST manually:
auto five = std::make_unique<NumberExpr>(5.0);
auto three = std::make_unique<NumberExpr>(3.0);
auto plus = std::make_unique<BinaryExpr>(TokenKind::Plus,
std::move(five),
std::move(three));
Evaluator e;
std::cout << e.eval(*plus) << "\n"; // → 8
(For the actual interactive evaluator we go through the parser; this is just to confirm the AST machinery works in isolation.)
Next
→ 03-recursive-descent-parser.md — build the AST from tokens.
Step 3 — The Recursive-Descent Parser
Goal: turn a token stream into the AST you designed in Step 2, respecting precedence and associativity.
The Grammar (Recap From CONCEPTS.md)
expr = term { ('+' | '-') term }
term = factor { ('*' | '/') factor }
factor = NUMBER
| '(' expr ')'
| '-' factor
Three rules, three functions. Recursive descent is a literal transcription.
The Cursor Helpers
const Token& Parser::peek() const { return tokens_[pos_]; }
const Token& Parser::advance() { return tokens_[pos_++]; }
bool Parser::check(TokenKind k) const { return peek().kind == k; }
bool Parser::match(TokenKind k) { if (check(k)) { advance(); return true; } return false; }
const Token& Parser::expect(TokenKind k, const char* msg) {
if (!check(k)) throw ParseError(/* … */);
return advance();
}
Five helpers cover every recursive-descent parser ever written:
peek— look without consumingadvance— consume onecheck— peek's kind matches?match— if so, consume and return trueexpect— must match, or throw
This vocabulary scales: Clang's parser uses essentially the same primitives, just with richer diagnostics in expect.
parseExpr — Lowest Precedence
ExprPtr Parser::parseExpr() {
ExprPtr left = parseTerm();
while (check(TokenKind::Plus) || check(TokenKind::Minus)) {
TokenKind op = peek().kind; advance();
ExprPtr right = parseTerm();
left = std::make_unique<BinaryExpr>(op, std::move(left), std::move(right));
}
return left;
}
Reads aloud: "parse a term, then while the next token is + or -, consume the operator, parse another term, and wrap the previous result as the new left."
Why The While Loop = Left Associativity
Trace 5 - 3 - 1:
Initial: left = parseTerm() = (5)
Iter 1: token = -, consume
right = parseTerm() = (3)
left = BinaryExpr(-, (5), (3)) ← (5 - 3)
Iter 2: token = -, consume
right = parseTerm() = (1)
left = BinaryExpr(-, (5-3), (1)) ← ((5 - 3) - 1)
Done. evaluate → 1
Each iteration nests the previous left on the inside. The tree leans left. That's left-associativity.
To make - right-associative (5 - 3 - 1 = 5 - (3 - 1) = 3), you'd write:
ExprPtr right = parseExpr(); // recurse instead of loop
return std::make_unique<BinaryExpr>(op, std::move(left), std::move(right));
Recursion-on-the-right ⇒ right-associativity. This is exactly how we'd handle ^ (exponentiation) in Step 6's extension.
parseTerm — Same Pattern, Higher Precedence
ExprPtr Parser::parseTerm() {
ExprPtr left = parseFactor();
while (check(TokenKind::Star) || check(TokenKind::Slash)) {
TokenKind op = peek().kind; advance();
ExprPtr right = parseFactor();
left = std::make_unique<BinaryExpr>(op, std::move(left), std::move(right));
}
return left;
}
Identical shape; the only changes are the operators consumed and the next-level call (parseFactor). Adding a new precedence level (modulo, comparison, etc.) means inserting another function in the call chain.
Why This = Higher Precedence Than +/-
When parseExpr calls parseTerm, control descends into parseTerm's while loop. That loop consumes all the * and / operators before returning. By the time parseTerm returns to parseExpr, the multiplication has already been packaged into a sub-tree. parseExpr then attaches that sub-tree as either left or right of its +/- node.
* binds tighter because parseTerm "grabs" its operators first — they're inside its loop, not parseExpr's.
parseFactor — Atoms and Recursion Back to parseExpr
ExprPtr Parser::parseFactor() {
if (check(TokenKind::Number)) {
double v = peek().value; advance();
return std::make_unique<NumberExpr>(v);
}
if (match(TokenKind::LParen)) {
ExprPtr inner = parseExpr(); // back to the top
expect(TokenKind::RParen, "expected ')'");
return inner;
}
if (check(TokenKind::Minus)) {
advance();
ExprPtr operand = parseFactor(); // right-recursive
return std::make_unique<UnaryExpr>(TokenKind::Minus, std::move(operand));
}
if (check(TokenKind::Error)) { /* propagate lex error */ }
throw ParseError(/* … */);
}
Three productions:
NUMBER— consume and wrap.'(' expr ')'— consume(, recurse all the way back toparseExpr(precedence resets!), expect). This is how parens override precedence.'-' factor— unary minus.--5works because the recursion is onparseFactor, not directly creating a Number.
Why Parens Override Precedence
(1 + 2) * 3:
parseExpr→parseTerm→parseFactorparseFactorsees(, callsparseExpragain.- That
parseExpr(the inner one) parses1 + 2→BinaryExpr(+, 1, 2). - Outer
parseFactorconsumes), returnsBinaryExpr(+, 1, 2). - Control returns to outer
parseTerm. Itsleftis(1+2). Loop sees*, parses3. BuildsBinaryExpr(*, (1+2), 3).
The parens temporarily gave us "expr-level reset" inside what would have been factor-level parsing. The grammar isn't doing anything magical; the recursion-back-to-parseExpr is.
Trip Through The Full Pipeline
For 2 * (3 + 4):
Tokens: NUMBER(2) STAR LPAREN NUMBER(3) PLUS NUMBER(4) RPAREN EOF
parseExpr
└─ parseTerm
├─ parseFactor → NumberExpr(2)
├─ sees STAR, consume; parseFactor:
│ ├─ sees LPAREN, consume
│ ├─ parseExpr (inner)
│ │ └─ parseTerm
│ │ └─ parseFactor → NumberExpr(3)
│ │ (no */)
│ │ ├─ sees PLUS, consume; parseTerm → parseFactor → NumberExpr(4)
│ │ └─ returns BinaryExpr(+, 3, 4)
│ └─ consume RPAREN; returns BinaryExpr(+, 3, 4)
└─ wrap into BinaryExpr(*, 2, BinaryExpr(+, 3, 4))
Result: BinaryExpr(*, NumberExpr(2), BinaryExpr(+, NumberExpr(3), NumberExpr(4)))
Error Cases
| Input | What Happens |
|---|---|
"" | parse() sees Eof immediately → throws "empty input" |
1 + | parseExpr parses 1, consumes +, calls parseTerm → parseFactor which sees Eof and throws |
(1 + 2 | parseFactor matches (, recurses, then expect(RParen) fails → throws "expected ')'" |
1 2 | parseExpr parses 1, no +/-, returns. parse() checks for Eof, finds NUMBER(2) → throws "unexpected token" |
1 / 0 | parses fine; the evaluator throws (Step 4) |
1 + abc | lexer emits Error token; parseFactor propagates it |
All five error categories live in this lab. Phase 9 (Diagnostics) replaces these one-line throws with Clang-style source spans + fix-it hints, but the detection logic is identical.
Why Is This Called "LL(1)"?
- Left-to-right scan of input.
- Leftmost derivation produced.
- 1 token of lookahead (we only ever call
peek()once per decision; neverpeek(2)).
LL(1) is the most common parsing class for hand-written compilers. Some real languages need LL(2) or even unbounded lookahead (C++ template parsing) — Clang uses tentative parsing in those spots.
Next
→ 04-the-evaluator.md — walk the tree and produce a number.
Step 4 — The Evaluator (Tree-Walking Interpreter)
Goal: given the AST, compute the number it represents.
Three Lines Per Node
double Evaluator::visit(NumberExpr& n) { return n.value; }
double Evaluator::visit(BinaryExpr& b) {
double l = b.lhs->accept(*this);
double r = b.rhs->accept(*this);
switch (b.op) {
case TokenKind::Plus: return l + r;
case TokenKind::Minus: return l - r;
case TokenKind::Star: return l * r;
case TokenKind::Slash:
if (r == 0.0) throw EvalError("division by zero");
return l / r;
default: throw EvalError("unknown binary operator");
}
}
double Evaluator::visit(UnaryExpr& u) {
double v = u.operand->accept(*this);
switch (u.op) {
case TokenKind::Minus: return -v;
default: throw EvalError("unknown unary operator");
}
}
That's the entire interpreter for arithmetic. ~25 lines.
Post-Order Traversal
BinaryExpr::visit evaluates both children first, then combines. This is post-order traversal — the only correct order for expression evaluation. (Pre-order would try to "+" before knowing what to "+".)
BinaryExpr(*)
/ \
BinaryExpr(+) NumberExpr(3)
/ \
(1) (2)
Evaluation order (post-order):
1. NumberExpr(1) → 1
2. NumberExpr(2) → 2
3. BinaryExpr(+) → 1 + 2 = 3
4. NumberExpr(3) → 3
5. BinaryExpr(*) → 3 * 3 = 9
In visit(BinaryExpr&), the two accept(*this) lines recursively evaluate the subtrees. The C++ call stack mirrors the tree shape — a 10-deep expression makes 10 stack frames.
What accept(*this) Does, Step By Step
double l = b.lhs->accept(*this);
b.lhsisstd::unique_ptr<Expr>— dereference getsExpr&.Expr::acceptis virtual; dispatched to (say)NumberExpr::accept.- That calls
visitor.visit(*this)with*this=NumberExpr&. - Overload resolution picks
Evaluator::visit(NumberExpr&). - Returns
n.value.
For each node visit there's: 1 virtual call (accept), 1 overload resolution (statically resolved), 1 chain of work. The virtual call is the main cost of tree-walking and is exactly what bytecode VMs eliminate.
Errors Surface Bottom-Up
Division by zero at any depth throws EvalError, which unwinds through every accept/visit frame back to the top-level eval. The CLI catches it and prints a message.
This works because C++ exceptions propagate transparently through virtual calls. Manual error-handling — say, returning a sentinel — would require checking after every recursive call. Exceptions handle the "anywhere in this subtree" case for free.
Modern compilers like Clang largely don't use exceptions in their AST passes (the LLVM project disables them —
-fno-exceptions). They useExpected<T>/ErrorOr<T>value-types that force explicit handling. We use exceptions here for simplicity; future labs will switch when the costs/benefits flip.
What Happens For 3 * -4?
Tokens: NUMBER(3) STAR MINUS NUMBER(4) EOF
Parser:
parseExpr→parseTermparseTerm→parseFactor→NumberExpr(3)- sees
*, consume parseFactor→ sees-, consume →parseFactorrecurse →NumberExpr(4)→UnaryExpr(-, 4)- builds
BinaryExpr(*, NumberExpr(3), UnaryExpr(-, NumberExpr(4)))
Evaluation:
visit(BinaryExpr*):- left =
visit(NumberExpr 3)= 3 - right =
visit(UnaryExpr-):- operand =
visit(NumberExpr 4)= 4 - return
-4
- operand =
- return
3 * -4 = -12
- left =
Same shape as any other binary op — the unary minus is just one extra level of recursion.
Why Tree-Walking Is Slow
Per node visit:
- 1 virtual dispatch (~10ns; branch predictor cold the first time)
- 1 pointer-indirection to the next node (likely cache miss for big trees)
- C++ function-call overhead (stack frame, saved registers)
For a 1000-instruction expression: ~30-50µs evaluating. A bytecode VM doing the same: ~5µs. Native code: ~1µs.
Numbers vary wildly by language and CPU; the ratio is consistent. We'll measure precisely in cp-06 when we have a bytecode VM to compare against.
Why It's Still Worth Building
For learning, tree-walkers win:
- Smallest code surface — every concept is local.
- Easy to debug: print the tree, watch it walk.
- Easy to add features: new node type → new
visitoverload.
Tree-walkers are also adequate for many real workloads — config languages, shell scripts, build files, query languages. Performance-critical paths get bytecode (cp-06+); the rest stay tree-walked.
Try It
cd src/cpp
cmake -B build && cmake --build build
./build/eval "1 + 2 * 3"
# 7
./build/eval "((1 + 2) * (3 + 4) - 5) / 2"
# 8
./build/eval "1 / 0"
# division by zero
Next
→ 05-repl-tests-and-cli.md — wire the front-end into a usable CLI + tests.
Step 5 — REPL, Tests, and CLI
Goal: wire lexer + parser + evaluator into a real tool, with a test suite and an interactive REPL.
The CLI Driver — main.cpp
int main(int argc, char** argv) {
if (argc == 1) return arith::repl();
// gather argv[1..] into one expression for unquoted use
std::ostringstream oss;
for (int i = 1; i < argc; ++i) {
if (i > 1) oss << ' ';
oss << argv[i];
}
return arith::evaluateOnce(oss.str());
}
Two modes:
- One-shot:
./eval "1 + 2 * 3"or./eval 1 + 2 + 3. - REPL:
./evalwith no args.
The REPL Loop
static int repl() {
std::cout << "arith repl (cp-02). type 'quit' or Ctrl-D to exit.\n";
std::string line;
for (;;) {
std::cout << "> " << std::flush;
if (!std::getline(std::cin, line)) { std::cout << "\n"; return 0; }
if (line == "quit" || line == "exit") return 0;
if (line.empty()) continue;
evaluateOnce(line);
}
}
Every iteration: read a line, send it through the whole pipeline, print. No persistent state between lines — variables and bindings come in cp-03.
Building It
cd src/cpp
cmake -B build # configure once
cmake --build build # build (incremental)
CMake breakdown:
arithlib— static library withlexer.cpp,parser.cpp,evaluator.cpp.eval— executable that linksarithlibandmain.cpp.test_eval— executable that linksarithliband the test file.
Separating library from main lets us reuse the same compiler internals from tests — a pattern we keep in every later lab.
The Test Suite — test_eval.cpp
Assert-based, no test-framework dependency. 19 tests across 5 categories:
// arithmetic
assert(APPROX(eval("1 + 2"), 3.0));
assert(APPROX(eval("10 - 4"), 6.0));
// precedence
assert(APPROX(eval("1 + 2 * 3"), 7.0));
// left-associativity
assert(APPROX(eval("10 - 3 - 2"), 5.0));
// parens
assert(APPROX(eval("(1 + 2) * 3"), 9.0));
// unary
assert(APPROX(eval("-(3 + 4)"), -7.0));
// floats / whitespace
assert(APPROX(eval("3.14 + 0.86"), 4.0));
// error cases
assert(throws(""));
assert(throws("1 +"));
assert(throws("1 / 0"));
assert(throws("(1 + 2"));
The macro APPROX(a, b) uses std::fabs((a) - (b)) < 1e-9 because comparing floats with == is fragile. throws(...) runs a snippet inside a try/catch and returns whether any std::exception was thrown.
Running The Tests
ctest --test-dir build
Or directly:
./build/tests/test_eval
# cp-02 tests: 19/19 PASS
Expected: 19/19 tests pass. If anything fails, the failing assert aborts immediately with a line number.
Why No gtest / catch2?
Pulling in a framework would mean a find_package (often a git submodule), a CMake config file, and 10 MB of headers. For 19 trivial tests, plain assert is shorter, faster to build, and removes a teaching distraction.
We'll graduate to a real framework around cp-08, when individual test cases benefit from structured fixtures and parameterization.
Manual Sanity Checks
The classic checklist:
./build/eval "1 + 2 * 3" # 7 (precedence)
./build/eval "(1 + 2) * 3" # 9 (parens override)
./build/eval "10 - 3 - 2" # 5 (left associativity)
./build/eval "3 * -4" # -12 (unary in factor)
./build/eval "((((42))))" # 42 (parens nest)
./build/eval "1 / 0" 2>&1 # division by zero
./build/eval "1 +" 2>&1 # parse error
If any of these don't match, suspect: parser precedence (Step 3), evaluator dispatch (Step 4), error propagation (parser → eval).
A Tour Of A Failure
Suppose you accidentally swapped left-associativity for right in parseExpr:
// WRONG:
ExprPtr right = parseExpr(); // recursion instead of loop
Then:
10 - 3 - 2→ tree:(10 - (3 - 2))=10 - 1 = 9(not 5!).- Test
assert(APPROX(eval("10 - 3 - 2"), 5.0));fails immediately.
This is exactly why we write tests for associativity — the bug is subtle and silent without them.
Outcomes
You now have:
- A working evaluator binary, both REPL and one-shot.
- A 19-test test suite proving correctness across precedence, associativity, parens, unary, floats, whitespace, and 4 error categories.
- A CMake project structured for reuse — the same
arithlibwill be linked from cp-03's expanded language.
Next
→ 06-extensions.md — optional extension exercises that deepen each concept.
Step 6 — Extension Exercises
Goal: consolidate the concepts by extending the evaluator in small, focused ways. Each extension is self-contained — pick one or more. Solutions live nowhere in this repo on purpose; struggle is the point.
Exercise 1 — Add % (Modulo)
Difficulty: ★☆☆☆☆ (1 minute)
Add % with the same precedence as * and /.
Hints:
- Add
PercenttoTokenKind. - Add the lexer case (
'%'). - Add the parser check inside
parseTerm's while-loop condition. - Add the evaluator case: use
std::fmod(not%— operands aredouble).
Verify:
10 % 3 = 1
10 % 3 % 2 = 1 (left-assoc)
Exercise 2 — Add ^ (Exponentiation, Right-Associative)
Difficulty: ★★☆☆☆
Add ^ with higher precedence than * and right-associativity.
Hints:
- Add a new precedence level between
termandfactor: call itpower. - Grammar:
term = power { ('*' | '/') power } power = factor [ '^' power ] ; recursion on the right ⇒ right-assoc - Note the
[ ... ]— an exponent may be absent. Thepowerbody recurses intopower, not loops.
Verify:
2 ^ 3 = 8
2 ^ 3 ^ 2 = 512 (i.e. 2 ^ (3 ^ 2) = 2 ^ 9 — NOT (2^3)^2 = 64)
2 * 3 ^ 2 = 18 (^ binds tighter than *)
Exercise 3 — AST Pretty-Printer
Difficulty: ★★☆☆☆
Add a Printer : ExprVisitor<std::string> that returns a string representation. Two flavors:
(a) S-expression:
2 * (3 + 4) → (* 2 (+ 3 4))
(b) Indented tree:
BinaryExpr(*)
├── NumberExpr(2)
└── BinaryExpr(+)
├── NumberExpr(3)
└── NumberExpr(4)
This exercise proves the Visitor pattern: no AST classes change, just a new visitor. Production debug tools (-ast-dump in Clang) do exactly this.
Hint: accept is currently ExprVisitor<double>-only. Either:
- Templatize
accept(template<typename R> R accept(ExprVisitor<R>&)), or - Add a separate
accept_string(ExprVisitor<std::string>&), or - Use a non-visitor
print(Expr&)function with aswitchon a tag (simpler).
Exercise 4 — Source Locations in Errors
Difficulty: ★★★☆☆
Make errors report the column they occurred at:
> 1 + ?
^
parse error: expected number, '(' or '-' (got ERROR)
Hints:
- Add
std::size_t postoToken. - Lexer records
pos_at the start of each token. ParseErrorcarries the position. The catch site prints the source line + a^at that column.
This is a tiny taste of cp-15 (Diagnostics), where we'd add file:line:col, fix-it hints, and color.
Exercise 5 — Constant Folding Pass
Difficulty: ★★★☆☆
Write an Optimizer : ExprVisitor<ExprPtr> that returns a new AST with constants folded:
BinaryExpr(+, NumberExpr(2), NumberExpr(3))
→ NumberExpr(5)
If both children are NumberExpr, compute the result; otherwise return a new BinaryExpr with optimized children.
Verify by:
- Printing the AST before and after.
- Confirming the evaluator still returns the same number.
This is your first compiler optimization pass. The same pattern appears in LLVM's -instcombine, JVM's C1 constant folding, and Clang's EvaluatedExprVisitor.
Exercise 6 — Disallow Trailing Garbage
Difficulty: ★☆☆☆☆
The parser already does this (parse() checks for Eof). Confirm with:
./eval "1 + 2 3"
# parse error: unexpected token '3' after expression
Now: make the error message highlight where the garbage starts — combine with Exercise 4.
Exercise 7 — REPL History and Last Result
Difficulty: ★★☆☆☆
Make _ in the REPL refer to the previous result:
> 1 + 2
3
> _ * 4
12
Hint: the REPL keeps a lastResult variable. The lexer recognizes _ as a special token. The parser allows it as a factor.
Exercise 8 — Bench Tree-Walking vs std::function
Difficulty: ★★★★☆ (more about C++ than compilers)
Build the tree once, evaluate it 1M times. Measure with <chrono>:
- via the Visitor (
accept+ virtual); - via a tagged switch (replace virtuals with
kindswitch); - via
std::variant+std::visit.
Predict which is fastest. Verify. Discuss why.
(Spoiler: tagged switch usually wins by ~2× on hot trees because the branch predictor can latch on. Virtual dispatch loses to indirect-branch mispredicts. This is the exact reason bytecode VMs win over tree-walkers.)
Done?
When you've internalized the concepts (even without doing every extension):
- Mark this lab complete.
- Move to
../cp-03-minilang-frontend/— where the language grows to include variables, control flow, functions, and we switch to Pratt parsing.
cp-03 — MiniLang Frontend (Statements, Variables, Control Flow, Functions)
Status: ✅ Built — 34/34 tests pass.
Replaces the arithmetic grammar from cp-02 with a real language:
let/varbindings,if/while, functions, blocks, closures. Switches the parser from hand-rolled recursive descent to a Pratt parser for expressions.
What You'll Build
- A full Pratt expression parser driven by a binding-power table.
- A recursive-descent statement parser:
let/var, blocks,if/else,while,return,print,fn. - Two-tier AST (
Stmt+ templated-returnExpr<R>visitors). - Scope-chain
Environment(parent-linked maps) and lexical closures. - A tree-walking interpreter with first-class functions, recursion, higher-order calls.
- A
mliREPL and file runner.
Reading Order
- CONCEPTS.md — Pratt parsing, binding powers, statement/expression split, closures.
- src/cpp/steps/ — seven guided steps (tokens → lexer → AST → Pratt → environment/interpreter → functions → REPL+tests).
- src/cpp/src/ — the full source.
Prereqs
- cp-02 complete (recursive descent + AST + Visitor internalized).
Outcomes
- Hand-write a Pratt parser for any precedence-rich expression language.
- Understand the statement/expression distinction and where each fits in the AST.
- Implement lexical scope via a parent-linked environment chain.
- Build first-class functions and closures using shared environments.
Build & Run
cd src/cpp
cmake -S . -B build && cmake --build build -j
ctest --test-dir build --output-on-failure
./build/mli # REPL
./build/mli script.ml # run a file
Sample
fn fact(n) { if (n <= 1) return 1; return n * fact(n - 1); }
fn fib(n) { if (n < 2) return n; return fib(n-1) + fib(n-2); }
print fact(10); // 3628800
print fib(15); // 610
fn makeAdder(a) { fn add(b) { return a + b; } return add; }
let plus5 = makeAdder(5);
print plus5(10); // 15
01 — Tokens and the Lexer
The lexer is the boundary between raw text and structured data. Its job is single-responsibility: consume a character stream and emit a flat stream of typed tokens. Every subsequent phase sees tokens, never characters.
The token inventory for MiniLang
cp-03 extends the arithmetic token set from cp-02 with keywords and punctuation that support a full statement language:
// literals
NUMBER "123" STRING "hello" IDENT "myVar"
// keywords
LET VAR FN IF ELSE WHILE RETURN PRINT TRUE FALSE NIL
// arithmetic & comparison
PLUS MINUS STAR SLASH PERCENT
EQ EQ_EQ BANG BANG_EQ LT LT_EQ GT GT_EQ
// logical
AND OR
// delimiters
LPAREN RPAREN LBRACE RBRACE COMMA SEMICOLON
// end-of-file sentinel
EOF
Key lexer decisions
One character of lookahead is enough for all MiniLang tokens.
= vs ==, ! vs !=, < vs <=, > vs >= all resolve with one
peek() call after consuming the first character.
Maximal munch: always consume the longest valid token. The lexer loop
calls advance() and then decides, not the other way round.
Keyword recognition via a hash-map at the identifier stage:
const std::unordered_map<std::string, TokKind> keywords = {
{"let", TokKind::Let},
{"var", TokKind::Var},
{"fn", TokKind::Fn},
// ...
};
When an identifier is scanned, look it up in the table. If it's there, emit
the keyword token; otherwise emit IDENT. This keeps the lexer loop clean:
no per-keyword branches in the main switch.
Character classification helpers
static bool isAlpha(char c) { return std::isalpha(c) || c == '_'; }
static bool isAlNum(char c) { return std::isalnum(c) || c == '_'; }
static bool isDigit(char c) { return std::isdigit(c); }
_ is part of identifiers in MiniLang (and every real language), so it's
included in both isAlpha and isAlNum.
Lexer structure
class Lexer {
const std::string& source_;
size_t start_ = 0; // start of current token
size_t cur_ = 0; // current scan position
int line_ = 1; // for error reporting
char advance();
char peek() const;
char peekNext() const;
bool match(char expected);
Token makeToken(TokKind);
Token scanToken();
public:
Lexer(const std::string& source);
std::vector<Token> scanAll();
};
The scanAll() loop calls scanToken() until it sees the source end, then
appends an EOF token and returns. Callers get a vector<Token> — a flat,
random-access stream. This is important: the Pratt parser (cp-03 step 03)
peeks and consumes non-linearly.
Source location on every token
struct Token {
TokKind kind;
std::string lexeme;
int line;
};
line tracks the source line number. The lexer increments line_ on every
\n. Later phases use line for error messages. A real production lexer
stores column too; for now line is enough.
Try it
After writing the lexer, scan a source string and print the token stream:
Lexer lex("let x = 2 + 3;\nif (x > 4) { print x; }");
for (auto& tok : lex.scanAll())
std::cout << tok.line << "\t" << tok.lexeme << "\n";
Expected output:
1 let
1 x
1 =
1 2
1 +
1 3
1 ;
2 if
...
This linear token dump is the best first debugging tool for any lexer.
02 — The AST
The AST is the contract between the parser and every downstream phase. Get it right and the interpreter, type checker, resolver, and codegen all become straightforward tree walks. Get it wrong and every phase carries workarounds.
Two-tier design: Stmt + Expr
MiniLang separates the grammar cleanly into statements (things with side effects but no value) and expressions (things that produce a value). Some languages blur this line (expressions as statements, last- expression-is-the-return-value, etc.); MiniLang keeps them distinct so the AST shape guides the interpreter.
Statement nodes
struct LetStmt { std::string name; bool immutable; ExprPtr init; int line; };
struct AssignStmt{ std::string name; ExprPtr value; int line; };
struct IfStmt { ExprPtr cond; StmtPtr then; StmtPtr else_; int line; };
struct WhileStmt{ ExprPtr cond; StmtPtr body; int line; };
struct ReturnStmt{ ExprPtr value; int line; };
struct PrintStmt { ExprPtr value; int line; };
struct ExprStmt { ExprPtr expr; int line; };
struct BlockStmt { std::vector<StmtPtr> body; int line; };
Expression nodes
struct NumberExpr { double value; int line; };
struct StringExpr { std::string value; int line; };
struct BoolExpr { bool value; int line; };
struct NilExpr { int line; };
struct VarExpr { std::string name; int line; };
struct UnaryExpr { TokKind op; ExprPtr operand; int line; };
struct BinaryExpr { TokKind op; ExprPtr left; ExprPtr right; int line; };
struct CallExpr { ExprPtr callee; std::vector<ExprPtr> args; int line; };
struct FnExpr { std::vector<std::string> params; StmtPtr body; int line; };
FnExpr is an anonymous function literal (fn(x, y) { ... }). Named
functions (fn foo(x) { ... }) desugar into let foo = fn(x) { ... } at
parse time.
Ownership with unique_ptr
Every node owns its children:
using ExprPtr = std::unique_ptr<Expr>;
using StmtPtr = std::unique_ptr<Stmt>;
No parent pointers, no shared ownership. The AST is a tree (DAG-free), so a single-owner, depth-first-owned hierarchy matches its shape exactly. Destruction is recursive and automatic when the root goes out of scope.
The Visitor pattern
Every downstream phase (interpreter, resolver, type checker) needs to walk the AST without modifying the node types. The Visitor pattern provides that extension point:
// Returns T per expression node.
template<typename T>
struct ExprVisitor {
virtual T visitNumber(NumberExpr&) = 0;
virtual T visitVar(VarExpr&) = 0;
virtual T visitBinary(BinaryExpr&) = 0;
// ... one method per expression node kind
};
Each expression node implements:
template<typename T>
T Expr::accept(ExprVisitor<T>& v);
The interpreter inherits ExprVisitor<Value>, the type-checker inherits
ExprVisitor<TypePtr>, the printer inherits ExprVisitor<std::string>.
Adding a new pass doesn't change any AST file.
Line numbers on every node
Every node stores an int line. This is the source line where the node
began. Passes that emit errors use it:
throw RuntimeError("[line " + std::to_string(node.line) + "] ...");
A production compiler would store a full source span (start/end offset); a line number is sufficient for teaching and for cp-03–cp-05 diagnostics.
Design check: where do FnStmt and FnExpr differ?
fn foo(x) { ... } is syntactic sugar for let foo = fn(x) { ... }.
At parse time the parser sees the fn keyword followed by an identifier,
converts it to a LetStmt wrapping a FnExpr, and the rest of the
compiler never needs a FnDecl node. This simplification:
- Keeps the namespace of a function as a variable binding (consistent with
letsemantics). - Makes closures and first-class functions automatic — a
fnliteral is just a value. - Means the resolver treats function declarations identically to variable declarations (both create a binding; both allow shadowing at new scope).
03 — Pratt Parsing: Expression Precedence Without Grammar Rules
Recursive-descent parsers grow one function per precedence tier:
parseExpr → parseAddSub → parseMulDiv → parseUnary → parsePrimary.
With 5 levels that's fine. With 15 it becomes a maintenance nightmare
where adding ** (exponentiation) requires touching every existing tier
function to get the associativity and slot correct.
Pratt parsing collapses all expression precedence levels into one function controlled by a numeric binding-power table.
Binding powers
Each operator has a left-binding power (how tightly it binds to the left)
and a right-binding power (how tightly it binds to the right). For left-
associative operators rbp = lbp. For right-associative rbp = lbp - 1.
| Operator | Left BP | Right BP | Assoc |
|---|---|---|---|
= | 1 | 1 | right |
or | 3 | 4 | left |
and | 5 | 6 | left |
== != | 7 | 8 | left |
< <= > >= | 9 | 10 | left |
+ - | 11 | 12 | left |
* / % | 13 | 14 | left |
unary - ! | — | 15 | — |
call ( | 17 | 18 | left |
Higher numbers bind tighter. The loop condition while (lbp(peek) > minBp)
continues consuming operators as long as the next one binds tighter than
the caller's minimum.
The Pratt loop
ExprPtr parseExpr(int minBp = 0) {
auto left = parsePrimary(); // nud: no left context yet
while (lbp(peek()) > minBp) {
Token op = advance();
auto right = parseExpr(rbp(op)); // led: right-denotation recursion
left = makeBinary(op, move(left), move(right));
}
return left;
}
parsePrimary handles prefix forms (numbers, identifiers, (expr),
unary -, !, fn). The loop handles infix forms by taking the current
left side and calling parseExpr(rbp) for the right.
Parsing function calls in Pratt
Function call f(x, y) is an infix operator with a ( left-denotation:
// Inside the while loop, when op.kind == LPAREN:
std::vector<ExprPtr> args;
while (peek().kind != RPAREN) {
args.push_back(parseExpr(0));
if (!match(COMMA)) break;
}
expect(RPAREN);
left = makeCall(move(left), move(args));
No special parseCallExpr function — it's handled by the left-binding
power of ( being high (17/18), making call tighter than any arithmetic.
Why associativity matters
Given a = b = c:
- Right-associativity (rbp = lbp): parses as
a = (b = c)— assignment chains, each one stores the innermost result first. - Left-associativity (rbp = lbp + 1): would parse as
(a = b) = c— trying to assign to a temporary, which is wrong.
Given a - b - c:
- Left-assoc:
(a - b) - c— correct for subtraction. - Right-assoc:
a - (b - c)— wrong.
The binding power table encodes this precisely: no per-operator branches in the Pratt loop.
Testing the expression parser in isolation
Before connecting it to the statement parser, write a small test driver:
std::string src = "1 + 2 * 3 - -4";
Lexer lex(src);
Parser p(lex.scanAll());
auto e = p.parseExpr(0);
// Pretty-print: "(- (+ 1 (* 2 3)) (- 4))" — left-associative and
// unary applied before binary
If the parenthesisation matches your expectations, the binding-power table is correct.
04 — Recursive-Descent Statement Parsing
Expressions handle values and operators. Statements handle control flow, declarations, and side-effects. The two halves live in different parser methods and produce different AST node types.
Statement dispatch
The top-level parse method peeks at the current token and dispatches:
StmtPtr parseStmt() {
switch (peek().kind) {
case TokKind::Let: return parseLet();
case TokKind::Var: return parseLet();
case TokKind::If: return parseIf();
case TokKind::While: return parseWhile();
case TokKind::Return: return parseReturn();
case TokKind::Print: return parsePrint();
case TokKind::Fn: return parseFnDecl();
case TokKind::LBrace: return parseBlock();
default: return parseExprStmt();
}
}
Each branch consumes exactly the tokens it owns. All branches advance past
any trailing ;.
Blocks create scope
parseBlock reads {, a list of statements, then }:
StmtPtr parseBlock() {
int line = advance().line; // consume {
std::vector<StmtPtr> body;
while (peek().kind != RBrace && peek().kind != Eof)
body.push_back(parseStmt());
expect(RBrace);
return std::make_unique<BlockStmt>(move(body), line);
}
The interpreter will create a new Environment child for every
BlockStmt, so blocks naturally scope variable declarations.
if with optional else
StmtPtr parseIf() {
int line = advance().line; // consume 'if'
expect(LParen);
auto cond = parseExpr(0);
expect(RParen);
auto then = parseBlock(); // always a block
StmtPtr else_;
if (match(Else)) else_ = peek().kind == If ? parseIf() : parseBlock();
return std::make_unique<IfStmt>(move(cond), move(then), move(else_), line);
}
else if chains are implemented by letting else consume another if
statement, producing a right-recursive tree. No special elif keyword.
while is simpler
StmtPtr parseWhile() {
int line = advance().line;
expect(LParen); auto cond = parseExpr(0); expect(RParen);
auto body = parseBlock();
return std::make_unique<WhileStmt>(move(cond), move(body), line);
}
Named function declaration → desugar
StmtPtr parseFnDecl() {
int line = advance().line; // consume 'fn'
auto name = expect(Ident).lexeme;
auto fn = parseFnBody(line); // parses (params) { body }
// Desugar into: let name = fn(params) { body }
return std::make_unique<LetStmt>(name, /*immutable=*/true, move(fn), line);
}
This is the key simplification: the interpreter's visitLet handles both
variable declarations and function declarations uniformly. A function is
just a value bound to a name.
Panic-mode error recovery
When the parser hits something unexpected, it throws or calls a sync
function that skips tokens until it finds a synchronisation point:
void sync() {
while (peek().kind != Eof) {
if (previous().kind == Semicolon) return;
switch (peek().kind) {
case Fn: case Let: case Var: case If:
case While: case Return: return;
default: advance();
}
}
}
After sync, parsing resumes at the next statement boundary. In a REPL
this means one bad expression doesn't lock up the session. In a file run
it means a single error doesn't suppress everything downstream.
The expression-statement bridge
StmtPtr parseExprStmt() {
int line = peek().line;
auto e = parseExpr(0);
expect(Semicolon);
return std::make_unique<ExprStmt>(move(e), line);
}
Function calls at statement position (foo(42);) hit this path. The
expression is evaluated for side effects; its value is discarded.
05 — Environments and Lexical Scope
The environment model answers one question: when a variable is used, which binding does it refer to? MiniLang uses lexical scope — the binding is determined by where the code is written, not where it is called.
The Environment structure
class Environment {
std::unordered_map<std::string, Value> vars_;
std::shared_ptr<Environment> parent_;
public:
explicit Environment(std::shared_ptr<Environment> parent = nullptr);
void define(const std::string& name, Value v);
Value& get(const std::string& name);
void set(const std::string& name, Value v);
};
get walks up the parent_ chain until it finds the name or reaches the
top-level (null parent) and throws "undefined variable". set does the
same but writes back instead of reading.
Chain creation for blocks and calls
When entering a block:
void Interpreter::visitBlock(BlockStmt& b) {
auto child = std::make_shared<Environment>(env_);
std::swap(env_, child);
for (auto& s : b.body) execute(*s);
std::swap(env_, child); // restore on exit (RAII alternative below)
}
When calling a function, a fresh environment is created with the function's closure (the captured enclosing environment) as parent — not the caller's current environment. This is what makes lexical scope different from dynamic scope.
Why shared_ptr?
A closure can outlive the scope that created it:
fn makeCounter() {
var n = 0;
fn inc() { n = n + 1; return n; }
return inc;
}
let c = makeCounter();
print c(); // 1
print c(); // 2
Here inc captures the environment created when makeCounter ran, and
that environment holds n. After makeCounter returns, the n binding
is still alive because the closure inc holds a shared_ptr to it. When
c is eventually garbage-collected, the shared_ptr ref-count drops to
zero and the environment is freed.
If environments were stored on the stack by value, the closure would hold
a dangling reference. shared_ptr is the minimal-complexity solution;
real VMs use heap-allocated scope frames instead.
The define vs set distinction
definealways writes in the current environment (creates a new slot).setwalks the parent chain to find an existing binding and updates it.
This matters for:
var x = 1;
{
var x = 2; // define: creates a NEW x in the inner scope
x = 3; // set: updates the INNER x
}
print x; // still 1 — the outer x was never touched
Without the distinction, x = 3 inside the block would climb to the outer
x, breaking scope.
Scope chain depth
Every define at block entry and every scope exit is O(1). Every get and
set is O(depth) — proportional to the nesting depth of scopes. In
practice depth is small (rarely > 10 for real programs), so this is
acceptable.
cp-04 introduces depth annotations on variable uses that let the interpreter do one hash-map lookup at the right depth instead of walking every parent:
Value& getAt(int depth, const std::string& name); // cp-04 addition
For cp-03, the naive walk is fine and pedagogically clearer.
06 — First-Class Functions and Closures
A language has first-class functions when functions can be:
- Stored in variables
- Passed as arguments
- Returned from functions
- Constructed at runtime
MiniLang supports all four, and the implementation is a small addition to what step 05 already built.
The Value type
struct FnValue {
std::vector<std::string> params;
StmtPtr* body; // pointer into the AST; AST is stable
std::shared_ptr<Environment> closure; // captured scope
};
using Value = std::variant<
std::monostate, // nil
double, // number
bool,
std::string,
std::shared_ptr<FnValue>
>;
FnValue holds the parameter list, a pointer to the function body in the
AST, and the captured environment at definition time. The closure field
is the one shared_ptr chain that makes closures work.
Calling a function
Value Interpreter::visitCall(CallExpr& call) {
Value callee = evaluate(*call.callee);
auto fn = std::get<std::shared_ptr<FnValue>>(callee); // or throw type error
if (call.args.size() != fn->params.size()) throw ...;
// Build the call frame with the closure as parent.
auto frame = std::make_shared<Environment>(fn->closure);
for (size_t i = 0; i < fn->params.size(); ++i)
frame->define(fn->params[i], evaluate(*call.args[i]));
auto saved = env_;
env_ = frame;
try { execute(*fn->body); }
catch (ReturnSignal& r) { env_ = saved; return r.value; }
env_ = saved;
return std::monostate{}; // nil if no return statement
}
ReturnSignal is a C++ exception used as a non-local transfer out of
visitBlock chains when a return statement fires. It's not an error —
it's a structured jump. This avoids threading a "should I keep executing?"
flag through every interpreter method.
Closures capture the environment, not variables
fn adder(x) {
fn add(y) { return x + y; }
return add;
}
let add5 = adder(5);
let add10 = adder(10);
print add5(3); // 8
print add10(3); // 13
When adder(5) is called:
- A frame for
adderis created withx = 5. - The
FnExprforaddcaptures that frame as itsclosure. adderreturns theFnValue.
When add5(3) is called:
- A new frame is created with
add5's closure (theadderframe) as parent. y = 3is defined in that frame.x + yresolves:yis in the current frame,xis in the capturedadderframe.
add10 has its own separate adder frame with x = 10. The two closures
don't share state.
The mutation case
fn makeCounter() {
var count = 0;
fn inc() { count = count + 1; return count; }
return inc;
}
let c = makeCounter();
print c(); // 1
print c(); // 2
Here count = count + 1 inside inc calls env_->set("count", ...).
set walks the parent chain, finds count in the captured makeCounter
frame, and updates it in place. The next call to c() builds a new call
frame with the same closure, sees the updated count, and returns 2.
Mutable closures work for free with the parent-chain set semantics.
Tail calls and stack overflow
cp-03 does not implement tail-call optimisation. A deep recursion like
fib(40) will produce a tall C++ call stack and may segfault before
finishing. This is a known limitation — the solution is continuation-
passing or explicit stack in the interpreter, deferred to later labs.
07 — REPL, Tests, and Extensions
The final step wires the components into a usable tool and verifies the interpreter with automated tests.
The REPL loop
void repl(std::istream& in, std::ostream& out) {
auto global = std::make_shared<Environment>();
Interpreter interp(global);
std::string line;
while (true) {
out << "> ";
if (!std::getline(in, line)) break;
try {
Lexer lex(line);
Parser p(lex.scanAll());
auto stmts = p.parse();
for (auto& s : stmts) interp.execute(*s);
} catch (const std::exception& e) {
out << "error: " << e.what() << "\n";
}
}
}
The key points:
- The same
globalenvironment persists across REPL lines — you can define a function on one line and call it on the next. - Errors are caught and printed, not propagated — the REPL doesn't die on a bad expression.
- Each line is re-lexed and re-parsed; no incremental state.
File execution
int main(int argc, char** argv) {
if (argc == 1) { repl(std::cin, std::cout); return 0; }
std::ifstream f(argv[1]);
if (!f) { std::cerr << "cannot open " << argv[1] << "\n"; return 74; }
std::string src((std::istreambuf_iterator<char>(f)),
std::istreambuf_iterator<char>());
Lexer lex(src);
Parser p(lex.scanAll());
auto stmts = p.parse();
Interpreter interp(std::make_shared<Environment>());
for (auto& s : stmts) interp.execute(*s);
return 0;
}
The test harness
cp-03 uses a hand-rolled test harness consistent with later labs. Each test runs a source string through the full pipeline and checks the output or thrown message:
static int g_checks = 0, g_passed = 0;
#define CHECK_EQ(a, b) do { ++g_checks; \
if ((a) == (b)) ++g_passed; \
else std::cerr << "FAIL " << __LINE__ << ": " << (a) << " != " << (b) << "\n"; \
} while(0)
// Example test
void test_closure() {
auto out = run("fn f(a) { fn g(b) { return a + b; } return g; } print f(3)(4);");
CHECK_EQ(out, "7\n");
}
run(src) lex-parses-interprets and returns the captured stdout. This
lets every test be written as a one-liner source program.
Extending MiniLang — next steps
These extensions each add one focused concept without rewriting the interpreter:
1. Native functions
Add a NativeValue variant to Value holding a std::function<Value(vector<Value>)>.
Register built-ins like clock(), sqrt(), len() in the global environment
before executing user code.
2. Arrays
Add VecValue = shared_ptr<vector<Value>>. Add arr[i] subscript as a
special CallExpr-like AST node, arr.push(x) as a method call.
3. Classes
Add class Foo { ... } syntax → a ClassDef node. Instances are
environments whose parent is the class's method map. this is a binding
in the method's call frame pointing to the instance environment.
4. for loops
Desugar into while at parse time:
for (let i = 0; i < n; i = i + 1) { body }
→
{ let i = 0; while (i < n) { body; i = i + 1; } }
No new interpreter support needed.
5. Continuation-based non-local flow
Replace ReturnSignal exception with a Continuation value that wraps
the rest of the execution as a callable — the basis for coroutines and
generators.
Each extension exercises a different compiler engineering concept. The interpreter's visitor-based architecture absorbs new node types without touching existing ones — which is the point of choosing Visitor over ad-hoc dispatch in step 02.
cp-04 — Symbol Tables & the Resolver Pass
Status: ✅ Built · all tests passing
A second compiler pass that runs between parsing and execution. The resolver:
- maintains a scope stack (lexical scopes seen so far),
- annotates every variable use with the lexical depth at which its binding lives,
- statically detects a class of bugs the cp-03 interpreter would only catch at runtime — or worse, miss entirely.
The interpreter is then changed to do O(depth) lookups using the annotation — no more hash-map walk up the parent chain.
What's new vs cp-03
| Aspect | cp-03 | cp-04 |
|---|---|---|
| Phases | lex → parse → interpret | lex → parse → resolve → interpret |
| Variable lookup | dynamic walk of parent envs | static depth + one hash lookup at the right scope |
let vs var | parsed but treated identically | let immutable (resolver rejects assignment), var mutable |
| Errors caught | mostly at runtime | undefined, redecl, self-init, assign-to-let, top-level return |
Source layout (src/cpp/)
src/
token.hpp / lexer.{hpp,cpp} # unchanged from cp-03
value.hpp # unchanged
ast.hpp # adds DeclKind + depth fields
parser.{hpp,cpp} # records DeclKind & node line numbers
environment.hpp # adds getAt(depth)/assignAt(depth)
resolver.{hpp,cpp} # NEW — the static-analysis pass
interpreter.{hpp,cpp} # uses depth when available
main.cpp # runs resolver before interpreter
tests/test_resolver.cpp # 15 assertions (regression + new diagnostics)
Build & test
cd src/cpp
cmake -S . -B build -G "Unix Makefiles"
cmake --build build -j
ctest --test-dir build --output-on-failure
Expected: cp-04 tests: 15/15 PASS.
Sample diagnostics
$ cat > bad.ml <<'EOF'
{ let x = 1; let x = 2; x = 3; return 99; }
EOF
$ build/mli bad.ml
[line 1] resolver: redeclaration of 'x' in the same scope (previous at line 1)
[line 1] resolver: cannot assign to immutable binding 'x' (declared with 'let')
[line 1] resolver: 'return' outside of a function
The resolver reports all problems at once, then main exits 1 without
running a single instruction. This is the same shape as a C compiler:
fail in the front end, never reach codegen.
See src/cpp/steps/ for the build-up
01-why-a-separate-pass.md— what runtime-only resolution costs us02-ast-changes.md— addingDeclKindand thedepthslot03-the-resolver-walk.md— visitor over Stmt/Expr, scope stack04-declare-then-define.md— the trick that catcheslet x = x;05-let-vs-var.md— immutability check on assignment06-getAt-and-fast-lookup.md— wiring depth into the interpreter07-error-recovery.md— collecting diagnostics instead of throwing
01 — The Resolver Pass: Why and What
The interpreter in cp-03 resolves variable names at runtime, by walking the environment chain on every access. That works but has two problems:
- Performance: O(depth) lookup on every read/write.
- Correctness: The interpreter can silently read from the wrong scope if names shadow each other across closure boundaries.
The resolver pass is a static analysis pass that runs after parsing and
before interpreting. It walks the AST once, builds a map of (VarExpr* → depth), and annotates every variable use with the exact environment
depth at which its binding lives. The interpreter can then do O(1)
direct-depth lookups.
The scope of the problem
var a = "global";
fn outer() {
fn inner() { print a; }
inner();
}
outer();
With a naive runtime-walking resolver:
- When
print aruns, the interpreter looks ininner's frame, thenouter's frame, then global, findsa = "global"and prints it. - This works correctly.
But:
var a = "global";
fn showA() {
print a;
}
fn test() {
var a = "local";
showA();
}
test();
Lexical scope says showA should print "global" — a in its body
refers to the a visible when showA was defined, not when it's called.
A runtime chain walk starting from the call site environment would
incorrectly find "local" from the caller test.
The resolver fixes this: it records, at parse time, that the a in showA
is 1 level up from its closure's capture point. At runtime the interpreter
goes exactly 1 level up in the closure's parent chain — not the caller's.
What the resolver produces
A std::unordered_map<Expr*, int> locals_ in the interpreter:
class Resolver {
std::vector<std::unordered_map<std::string, bool>> scopes_;
Interpreter& interp_; // writes into interp_.locals_
public:
void resolve(std::vector<StmtPtr>& stmts);
private:
void resolveLocal(Expr* e, const std::string& name);
void beginScope();
void endScope();
};
resolveLocal(expr, name) searches scopes_ from the innermost outward.
When it finds the name, it records scopes_.size() - 1 - depth (the
"distance" from the current scope) in interp_.locals_[expr].
The interpreter's lookUpVariable
Value Interpreter::lookUpVariable(const std::string& name, Expr* e) {
auto it = locals_.find(e);
if (it != locals_.end())
return env_->getAt(it->second, name); // O(1) direct depth
return globals_->get(name);
}
Variables not found in locals_ are globals — they're not annotated
because they live at depth 0 from the global environment, which the
interpreter always has a direct pointer to.
When to run the resolver
After parsing the full source, before execution:
auto stmts = parser.parse();
Resolver resolver(interp);
resolver.resolve(stmts);
for (auto& s : stmts) interp.execute(*s);
The resolver pass is one linear traversal of the AST. It touches every
node exactly once. After it runs, the interpreter's locals_ map is
fully populated and all subsequent variable lookups are O(1).
02 — The Scope Stack
The resolver needs to track which names are in scope at each point in the
AST. It does this with a scope stack: a vector of maps, each map
representing one lexical scope.
The structure
std::vector<std::unordered_map<std::string, bool>> scopes_;
Each map entry has type bool:
false— the variable has been declared (the slot exists) but its initialiser hasn't finished evaluating yet.true— the variable has been fully defined (initialiser is done).
This two-stage state catches the self-initialisation error:
let x = x + 1; // error: x used in its own initialiser
When the resolver sees let x = ..., it first declares x (pushes
"x" → false), evaluates the initialiser (during which x is false),
then defines x (sets "x" → true). If the initialiser references x,
resolveLocal sees false and reports an error before interpreting.
Scope boundaries
void Resolver::beginScope() {
scopes_.emplace_back(); // push empty map
}
void Resolver::endScope() {
scopes_.pop_back();
}
Called around:
- Every
BlockStmt - Every function body (one scope for the function's parameters, one for the body block — or combined)
void Resolver::visitBlock(BlockStmt& b) {
beginScope();
for (auto& s : b.body) resolve(*s);
endScope();
}
Looking up across the stack
void Resolver::resolveLocal(Expr* e, const std::string& name) {
for (int i = (int)scopes_.size() - 1; i >= 0; --i) {
if (scopes_[i].count(name)) {
interp_.locals_[e] = (int)scopes_.size() - 1 - i;
return;
}
}
// Not found → global; no annotation needed
}
The depth (size - 1 - i) is 0 for the innermost scope, 1 for one level
up, and so on. The interpreter's getAt(depth, name) walks the environment
chain exactly depth steps:
Value& Environment::getAt(int depth, const std::string& name) {
Environment* env = this;
for (int i = 0; i < depth; ++i) env = env->parent_.get();
return env->vars_.at(name);
}
Why a vector of maps, not a single map?
A single global map can't track shadowing: if x is declared at depth 2
and redeclared at depth 0, both need entries that point to different depths.
A vector of maps makes the stack structure explicit and indexable. The
outermost scope is scopes_[0], the innermost is scopes_.back().
The global scope
The top-level scope is not pushed onto scopes_. Global variables are
resolved by falling through the entire stack search without finding the
name. The interpreter then looks them up directly in globals_. This
separation means globals can be referenced before they're declared (e.g.
mutual recursion at the top level), which is a valid and useful feature.
03 — Declaring and Resolving Names
The resolver's two core operations are declaring a name (introducing a new binding) and resolving a reference (computing its depth).
Declaring a name
void Resolver::declare(const std::string& name) {
if (scopes_.empty()) return; // global — skip
auto& scope = scopes_.back();
if (scope.count(name))
reportError("Variable '" + name + "' already declared in this scope.");
scope[name] = false; // declared, not yet defined
}
void Resolver::define(const std::string& name) {
if (scopes_.empty()) return; // global — skip
scopes_.back()[name] = true; // fully defined
}
The declare/define split means:
let x = x;→declare("x")setsx→false, resolve initialiserx→ findsfalse→ error "can't read local in its own initialiser".let x = 1;→declare("x"), resolve initialiser1(no names),define("x")→ no error.
The visitLet method
void Resolver::visitLet(LetStmt& s) {
declare(s.name);
if (s.init) resolve(*s.init); // initialiser can NOT see s.name yet
define(s.name);
}
The visitVar / visitFunction
var works the same as let for the resolver — mutability is an
interpreter-level concern (cp-04 step 05), not a resolution concern.
Functions:
void Resolver::resolveFunction(FnExpr& fn) {
beginScope();
for (auto& param : fn.params) {
declare(param);
define(param); // params are immediately defined
}
resolve(*fn.body);
endScope();
}
Parameters are both declared and defined before the body is resolved. There's no initialiser for parameters — they're always provided by the caller.
The visitVarExpr method
void Resolver::visitVarExpr(VarExpr& e) {
if (!scopes_.empty()) {
auto it = scopes_.back().find(e.name);
if (it != scopes_.back().end() && it->second == false)
reportError("Can't read '" + e.name + "' in its own initialiser.");
}
resolveLocal(&e, e.name);
}
The self-initialiser check only looks at scopes_.back() — the current
scope. If the name is in an outer scope and is false, it's a different
(outer) variable in mid-initialisation, not a problem for the current
reference.
Block scopes don't "hoist"
In JavaScript, var declarations are hoisted to the top of the function
scope. In MiniLang, let and var are not hoisted — a reference
before the declaration is a static error:
print x; // error: "x" not found (resolver reports it)
let x = 1;
The resolver only adds a name to scopes_.back() when it encounters the
let/var statement. Any VarExpr seen before that point falls through
the entire scope stack and is treated as a global. If x is not a global
either, the error is caught at resolve time. This is strictly better than
runtime "undefined variable" errors.
04 — Depth Annotations: O(1) Environment Lookups
After the resolver runs, every local variable reference has an integer
depth stored in interp_.locals_. This step shows how the interpreter
uses that information to avoid chain walking.
The getAt and setAt methods
Value& Environment::getAt(int depth, const std::string& name) {
Environment* e = this;
for (int i = 0; i < depth; ++i) e = e->parent_.get();
return e->vars_.at(name);
}
void Environment::setAt(int depth, const std::string& name, Value v) {
Environment* e = this;
for (int i = 0; i < depth; ++i) e = e->parent_.get();
e->vars_[name] = std::move(v);
}
The loop here is O(depth), but depth is a small compile-time constant for
each site. More importantly, it avoids calling vars_.count() at every
intermediate scope — the lookup is direct.
lookUpVariable and assignVariable
Value Interpreter::lookUpVariable(VarExpr& e) {
auto it = locals_.find(&e);
if (it != locals_.end())
return env_->getAt(it->second, e.name);
return globals_->get(e.name); // fallback to globals
}
void Interpreter::assignVariable(AssignExpr& e, Value v) {
auto it = locals_.find(&e);
if (it != locals_.end())
env_->setAt(it->second, e.name, std::move(v));
else
globals_->set(e.name, std::move(v));
}
The locals_ map key is the pointer to the AST node, not the variable
name. This is intentional: two different uses of x in the source (two
VarExpr nodes) can refer to bindings at different depths, and the
pointer uniquely identifies which AST node is being evaluated.
Why pointer keying is correct
let x = 1;
fn f() {
let x = 2;
print x; // VarExpr node A — depth 0 in f's scope
}
print x; // VarExpr node B — depth 0 in globals
Both uses are named x, but they are different AST nodes. The resolver
annotated A as depth=0 and B as not-in-locals (global). The interpreter
uses the pointer to distinguish them.
If the key were the name string, both would map to the same entry and one would be wrong.
Measuring the improvement
Without annotations, getAt is O(depth) and calls find at every
intermediate scope. With annotations, it's still O(depth) for the loop
but does only one find at the target scope. The real saving is
correctness (dynamic-scope bug eliminated) more than raw speed.
For a program with average nesting depth 3:
- Without annotations: 3 × (hash_lookup + parent_deref) per variable use.
- With annotations: 3 × parent_deref + 1 × hash_lookup.
At scale this matters. The JVM and V8 both keep variable annotations in bytecode for the same reason — not because they need O(1) vs O(3), but because the slot index is also used for register allocation and inlining.
Environment structure with slots
Production VMs go one step further: instead of named maps, each scope is
a slot array and each variable use has a slot index. getAt(depth, slot) is just pointer arithmetic:
// production VM sketch
env[depth].slots[slot]
cp-03/04 uses named maps for clarity. cp-06+ introduces bytecode with explicit stack slots, which is the array-indexed equivalent.
05 — Immutability: let vs var
MiniLang distinguishes let (immutable binding) from var (mutable
binding). This distinction is enforced by the interpreter after the
resolver has already annotated depths.
Tracking mutability in the environment
The environment stores a mutability flag alongside each value:
struct Slot {
Value value;
bool mutable_;
};
std::unordered_map<std::string, Slot> vars_;
define stores the slot:
void Environment::define(const std::string& name, Value v, bool mutable_) {
vars_[name] = {std::move(v), mutable_};
}
set / setAt checks the flag:
void Environment::setAt(int depth, const std::string& name, Value v) {
Environment* e = this;
for (int i = 0; i < depth; ++i) e = e->parent_.get();
auto& slot = e->vars_.at(name);
if (!slot.mutable_)
throw RuntimeError("Cannot reassign 'let' binding '" + name + "'.");
slot.value = std::move(v);
}
The assignment check
When the interpreter visits an assignment expression:
Value Interpreter::visitAssign(AssignExpr& e) {
Value v = evaluate(*e.value);
assignVariable(e, v);
return v;
}
assignVariable → setAt → immutability check. If the target was bound
with let, a RuntimeError is thrown with a clear message. This is a
runtime check, not a static one.
Why not static?
Making immutability a static error (checked by the resolver) would require
tracking whether each name was declared as let or var in the scope
stack. That's doable — the scope map could store {bool defined, bool mutable}.
The choice here is pragmatic: static checking is strictly better for user experience (error before running), but it requires threading the mutability flag through two more data structures. For the curriculum, a runtime check demonstrates the concept clearly. cp-05 introduces the type-checker pass which is a static pass and shows how static checks are structured.
Shadowing across scopes
let x = 1;
{
var x = 2; // OK — new binding in inner scope, different slot
x = 3; // OK — this x is mutable
}
print x; // 1 — outer let x unchanged
Each let/var creates a new slot in its scope. Shadowing is allowed:
a var x in an inner scope doesn't make the outer let x mutable.
The resolver assigns separate depths to each, so the inner assignment
never reaches the outer slot.
The let design in practice
In real languages:
- Rust:
letis immutable,let mutis mutable. - JavaScript:
constis immutable,letis mutable (confusingly opposite to MiniLang). - Swift:
letis immutable,varis mutable (same as MiniLang). - Haskell: Everything is
let-bound and immutable by default.
MiniLang follows Swift/Rust semantics. The pedagogical point is that
immutability is a property of the binding, not the value. A let
binding to a mutable array still allows mutating the array's contents;
it prevents rebinding the name to a different array.
06 — Static Errors: Redeclaration, Self-Init, Top-Level Return
The resolver catches three categories of semantic errors before the program runs, producing precise messages that point to the exact source location.
Redeclaration in the same scope
let x = 1;
let x = 2; // error: already declared in this scope
In the resolver's declare:
void Resolver::declare(const std::string& name, int line) {
if (scopes_.empty()) return;
auto& scope = scopes_.back();
if (scope.count(name))
throw ResolveError("[line " + std::to_string(line) +
"] Variable '" + name + "' already declared in this scope.");
scope[name] = false;
}
Redeclaration in a nested scope is allowed (shadowing). Only the same scope triggers the error:
let x = 1;
{
let x = 2; // OK — different scope
}
Self-referential initialiser
let x = x + 1; // error: can't read 'x' in its own initialiser
Detected in visitVarExpr when the found entry is false (declared but
not yet defined):
if (!scopes_.empty()) {
auto it = scopes_.back().find(name);
if (it != scopes_.back().end() && it->second == false)
throw ResolveError("[line " + std::to_string(line) +
"] Can't read '" + name + "' in its own initialiser.");
}
This distinguishes the bad case from the legitimate recursive case:
fn fib(n) {
if (n <= 1) return n;
return fib(n-1) + fib(n-2); // OK — fib is fully defined before we get here
}
fib is defined (the let fib = fn... fully finishes) before the body
runs. So when visitVarExpr for fib(n-1) is resolved, the resolver
finds fib as true in an outer scope — not as false.
return outside a function
return 42; // error at top level
The resolver tracks whether it is currently inside a function:
enum class FunctionType { None, Function };
FunctionType currentFunction_ = FunctionType::None;
void Resolver::visitReturn(ReturnStmt& s) {
if (currentFunction_ == FunctionType::None)
throw ResolveError("[line " + std::to_string(s.line) +
"] Can't return from top-level code.");
if (s.value) resolve(*s.value);
}
void Resolver::resolveFunction(FnExpr& fn) {
auto enclosing = currentFunction_;
currentFunction_ = FunctionType::Function;
// ... resolve body ...
currentFunction_ = enclosing;
}
currentFunction_ is a scoped state flag, saved and restored when entering
and leaving each function. Nested functions work correctly because the
save/restore is a stack discipline.
Error recovery
Each error throws a ResolveError exception. In a production compiler
you'd collect all errors and report them together. For the curriculum
the first error terminates resolution with a clear message. Improving
this to collect-and-continue is a good exercise: change the resolver to
push errors into a vector<ResolveError> and only throw at the end of
resolve(stmts).
The three errors together — a test
void test_static_errors() {
// Redeclaration
CHECK_THROWS(run("let x = 1; let x = 2;"), "already declared");
// Self-init
CHECK_THROWS(run("let x = x + 1;"), "own initialiser");
// Top-level return
CHECK_THROWS(run("return 42;"), "top-level");
}
These three static analyses, taken together, eliminate a whole class of runtime crashes that would otherwise only manifest as obscure interpreter bugs deep into execution.
07 — O(depth) Interpreter Lookups and Testing
This final step ties everything together, reviews the performance model, and verifies the full resolver + interpreter integration.
The complete variable lookup chain
From source text to value:
Source → Lexer → [Token stream] → Parser → [AST]
→ Resolver → [locals_ map: Expr* → depth] → Interpreter
→ lookUpVariable(name, expr*) → env_->getAt(depth, name) → Value
The resolver runs once. After it populates locals_, every variable
reference in every execution of any function body is a direct-depth lookup
with no chain scanning.
When depth can be wrong
There is one subtle case: if the interpreter creates extra environment
layers that the resolver didn't see, getAt(depth) overshoots. This can
happen if you add an implicit scope (e.g. around a single-expression if
body without braces). The resolver must create a beginScope/endScope
wherever the interpreter creates a new Environment. Keep them in sync.
The test suite for cp-04 includes a "depth sync" test:
void test_depth_sync() {
// Deep nesting should still resolve correctly
auto out = run(R"(
let a = 10;
fn f() {
let b = 20;
fn g() {
let c = 30;
return a + b + c;
}
return g();
}
print f();
)");
CHECK_EQ(out, "60\n");
}
This exercises a 3-level closure chain. If a has depth 2 from inside
g, getAt(2) climbs: g's frame → f's frame → f's closure (global),
finds a = 10. Any off-by-one in the depth computation fails this test.
Testing the static errors
void test_redeclare() {
bool threw = false;
try { run("let x = 1; let x = 2;"); }
catch (const ResolveError& e) { threw = true; }
CHECK_EQ(threw, true);
}
void test_self_init() {
bool threw = false;
try { run("let x = x + 1;"); }
catch (const ResolveError& e) { threw = true; }
CHECK_EQ(threw, true);
}
void test_top_level_return() {
bool threw = false;
try { run("return 42;"); }
catch (const ResolveError& e) { threw = true; }
CHECK_EQ(threw, true);
}
Testing closure correctness
void test_closure_captures_definition_site() {
auto out = run(R"(
var a = "global";
fn showA() { print a; }
fn test() {
var a = "local";
showA();
}
test();
)");
CHECK_EQ(out, "global\n"); // lexical scope, not dynamic
}
Without the resolver, a dynamic-scope interpreter prints "local".
With the resolver, the a reference inside showA is annotated as
a global (depth not in locals_), so the interpreter looks in globals_,
finds "global", and prints it correctly.
Summary: what the resolver gives you
| Feature | Without resolver | With resolver |
|---|---|---|
| Variable lookup | O(depth) chain walk per access | O(1) direct-depth |
| Closure semantics | Accidental dynamic scope possible | Lexical scope enforced |
| Self-init bug | Crashes at runtime | Static error |
| Redeclaration | Silent shadowing | Static error |
| Top-level return | Runtime crash (ReturnSignal uncaught) | Static error |
The resolver is a small investment — ~150 lines — that pays dividends in every subsequent phase. The bytecode compiler in cp-06 uses the same scope-stack technique to allocate stack slots; the type checker in cp-05 uses it to track type annotations.
cp-05 — Static Type Checker (gradual)
Status: ✅ Built · all tests passing
A third compiler pass added between resolver and interpreter:
lex → parse → resolve → typecheck → interpret
The checker walks the AST a second time, this time with a TypePtr
visitor. It computes a static type for every expression and validates
operands, conditions, arities, return values, and assignments.
Gradual typing
Annotations are optional. A bare var x = 1; is fine; so is
let y: int = 1;. Wherever the source omits a type, the slot is
any, a wildcard that:
- satisfies any constraint (
any + int = int,if (anyVal)accepted), - propagates through unknown operations (
any * any = any), - lets cp-04 programs (and the recursive
facttest) keep working unchanged, while fully-annotated programs get strict checks.
What's new vs cp-04
| Aspect | cp-04 | cp-05 |
|---|---|---|
| Phases | lex → parse → resolve → interpret | lex → parse → resolve → typecheck → interpret |
| Tokens | — | adds : and -> |
| Annotations | none | let x: int, fn f(a: int, b: int) -> int { ... } |
| Type ADT | — | int / bool / string / nil / fn(...) -> T / any |
| Errors caught | resolution-only (undef, redecl, …) | + operand types, arity, arg types, return type, condition type, assign-T |
Source layout (src/cpp/)
src/
token.hpp # adds Colon, Arrow
lexer.{hpp,cpp} # handles ':' and '->'
value.hpp # unchanged
type.hpp # NEW — Type ADT and tyInt/tyBool/...
ast.hpp # adds declaredType, paramTypes, returnType, checkedType
# and a second `accept` overload returning TypePtr
parser.{hpp,cpp} # parseType() + optional annotations
environment.hpp # unchanged
resolver.{hpp,cpp} # unchanged
typecheck.{hpp,cpp} # NEW — gradual type checker
interpreter.{hpp,cpp} # unchanged
main.cpp # invokes TypeChecker between resolver and interpreter
tests/test_typecheck.cpp # regression + new annotated programs + 11 negative cases
Build & test
cd src/cpp
cmake -S . -B build -G "Unix Makefiles"
cmake --build build -j
ctest --test-dir build --output-on-failure
Sample diagnostics
$ cat > bad.ml <<'EOF'
let x: int = true;
if (5) print 1;
fn h(a: int) -> int { return true; }
h(true, 2);
EOF
$ build/mli bad.ml
[line 1] type error: initializer of 'x' has type bool, expected int
[line 2] type error: if-condition must be bool (got int)
[line 3] type error: return type mismatch: function returns int, got bool
[line 4] type error: call to function expected 1 arg(s), got 2
All diagnostics are collected before the program is rejected — same recovery model as the resolver.
See src/cpp/steps/ for the walk-through
01-why-static-types.md— what dynamic-only typing costs02-type-syntax.md—:annotations,->returns, function types03-type-representation.md— theTypeADT andAnywildcard04-the-checker-walk.md— visitor, scope stack, currentReturn05-inference-vs-annotation.md— when the checker fills the gaps06-function-types-and-calls.md— arity, arg-by-arg, return matching07-error-recovery.md— collecting diagnostics like the resolver
01 — Why Static Types
A type system is a lightweight formal method that proves a class of runtime errors cannot occur — without running the program. The guarantees it provides depend on how expressive the type language is.
The cost of untyped code
In cp-03/cp-04, MiniLang is fully dynamic: every value carries its type at
runtime (std::variant<nil, double, bool, string, FnValue>). A type error
like "hello" + 42 raises a RuntimeError when the + operator tries to
add a string and a number. This is correct, but:
- The error appears at runtime, not compile time.
- It may appear on a rarely-executed code path — you might not see it until production.
- Error messages say "expected number" without knowing the programmer's intent.
What the type checker provides
The type checker runs after parsing and before execution. It annotates
every expression with a type (Num, Bool, Str, Nil, Fn<...>) and
reports mismatches statically:
let x: Num = "hello"; // error at line 1: expected Num, got Str
This moves the failure from runtime to compile time.
MiniLang's type language
Type ::= Num | Bool | Str | Nil | Any
| Fn(Type, ...) -> Type
Num,Bool,Str,Nil— the four base types from the value set.Any— the gradual escape hatch (step 06). A value of typeAnycan be used anywhere without a static error; correctness is checked at runtime.Fn(T1, T2, ...) -> R— a function type.
Soundness vs completeness
A type system is sound if every accepted program is safe at runtime. A type system is complete if it accepts every safe program.
No practical type system is both sound and complete (undecidability). The trade-off is:
- Reject false positives (be unsound) → miss bugs.
- Reject true negatives (be incomplete) → reject correct programs.
MiniLang's checker is sound for the base types: a program that passes
type-checking with no Any types will not produce a type error at runtime.
The Any type trades soundness for usability on the parts of the program
that can't be typed statically.
Checked vs unchecked features
In cp-05, the following are statically checked:
- Arithmetic (
+,-,*,/) requiresNumoperands. - Comparison (
<,<=, etc.) requiresNumorStroperands. - Logical (
and,or) requiresBooloperands. - Negation
!requiresBool; unary-requiresNum. - Function call arity is checked against the declared parameter count.
- Function return type must match the declared return type (if annotated).
let/vartype annotations must match the initialiser.
The following are not statically checked in cp-05:
- Array element types (no array type in the type language).
- Object field access (no class types yet).
These are left for extensions.
02 — The Type ADT
The type representation is the data model for the entire type checker. It must be:
- Recursive: function types contain other types.
- Comparable:
unifyand equality checks must work. - Printable: error messages need
"expected Num, got Bool".
The Type representation
struct NumType {};
struct BoolType {};
struct StrType {};
struct NilType {};
struct AnyType {};
struct FnType {
std::vector<std::shared_ptr<Type>> params;
std::shared_ptr<Type> ret;
};
using Type = std::variant<NumType, BoolType, StrType, NilType, AnyType, FnType>;
using TypePtr = std::shared_ptr<Type>;
Alternatively, use an inheritance hierarchy with a TypeKind enum. The
variant approach avoids virtual dispatch and is exhaustively checkable
with std::visit:
std::string typeToStr(const Type& t) {
return std::visit(overloaded{
[](const NumType&) { return std::string("Num"); },
[](const BoolType&) { return std::string("Bool"); },
[](const StrType&) { return std::string("Str"); },
[](const NilType&) { return std::string("Nil"); },
[](const AnyType&) { return std::string("Any"); },
[](const FnType& f) {
std::string s = "Fn(";
for (size_t i=0; i<f.params.size(); ++i) {
if (i) s += ", ";
s += typeToStr(*f.params[i]);
}
s += ") -> " + typeToStr(*f.ret);
return s;
},
}, t);
}
Type equality
Two types are equal if they are structurally identical:
bool typeEq(const Type& a, const Type& b) {
if (a.index() != b.index()) return false;
if (auto* f = std::get_if<FnType>(&a)) {
auto& g = std::get<FnType>(b);
if (f->params.size() != g.params.size()) return false;
for (size_t i = 0; i < f->params.size(); ++i)
if (!typeEq(*f->params[i], *g.params[i])) return false;
return typeEq(*f->ret, *g.ret);
}
return true; // same variant index, non-FnType
}
The "gradual" compatibility check
For gradual typing (step 06), we need compatible(a, b) which is weaker
than typeEq: Any is compatible with everything.
bool compatible(const Type& a, const Type& b) {
if (std::holds_alternative<AnyType>(a)) return true;
if (std::holds_alternative<AnyType>(b)) return true;
return typeEq(a, b);
}
Type factory helpers
TypePtr mkNum() { return std::make_shared<Type>(NumType{}); }
TypePtr mkBool() { return std::make_shared<Type>(BoolType{}); }
TypePtr mkStr() { return std::make_shared<Type>(StrType{}); }
TypePtr mkNil() { return std::make_shared<Type>(NilType{}); }
TypePtr mkAny() { return std::make_shared<Type>(AnyType{}); }
TypePtr mkFn(std::vector<TypePtr> params, TypePtr ret) {
return std::make_shared<Type>(FnType{std::move(params), std::move(ret)});
}
These reduce noise in the checker: mkNum() is clearer than
std::make_shared<Type>(NumType{}) at every call site.
Why shared_ptr?
Function types contain nested types. A type like Fn(Fn(Num)->Bool, Str) -> Nil
has a FnType whose first param is another FnType. Shared ownership
means the inner types can be cheaply aliased across multiple places in the
checker's environment without copying.
For the curriculum, shared_ptr is clear and correct. Production type
checkers use arenas (bump allocators) so all types live in one allocation
region and are freed together at compile-time end.
03 — Type Annotations in the AST and Parser
Type annotations let programmers express intent:
let x: Num = 42;
fn add(a: Num, b: Num): Num { return a + b; }
The parser must recognise the : Type syntax and store the annotation
in the AST.
AST changes
The LetStmt and function parameter nodes gain an optional type annotation:
struct LetStmt {
std::string name;
bool immutable;
ExprPtr init;
TypePtr annotation; // nullptr if absent
int line;
};
struct Param {
std::string name;
TypePtr annotation; // nullptr if absent
};
struct FnExpr {
std::vector<Param> params;
TypePtr retAnnotation; // nullptr if absent
StmtPtr body;
int line;
};
Parsing type annotations
A parseType helper handles the type grammar:
// Type ::= "Num" | "Bool" | "Str" | "Nil" | "Any"
// | "Fn" "(" TypeList ")" "->" Type
TypePtr Parser::parseType() {
Token t = advance();
if (t.lexeme == "Num") return mkNum();
if (t.lexeme == "Bool") return mkBool();
if (t.lexeme == "Str") return mkStr();
if (t.lexeme == "Nil") return mkNil();
if (t.lexeme == "Any") return mkAny();
if (t.lexeme == "Fn") {
expect(LParen);
std::vector<TypePtr> params;
while (peek().kind != RParen) {
params.push_back(parseType());
if (!match(Comma)) break;
}
expect(RParen);
expect(Arrow); // "->"
auto ret = parseType();
return mkFn(std::move(params), std::move(ret));
}
throw ParseError("[line " + std::to_string(t.line) +
"] Expected type annotation, got '" + t.lexeme + "'.");
}
Parsing let with annotation
StmtPtr Parser::parseLet() {
int line = advance().line; // consume 'let' / 'var'
bool immutable = (previous().kind == Let);
auto name = expect(Ident).lexeme;
TypePtr ann;
if (match(Colon)) ann = parseType(); // optional ": Type"
expect(Eq);
auto init = parseExpr(0);
expect(Semicolon);
return std::make_unique<LetStmt>(name, immutable, std::move(init), std::move(ann), line);
}
Parsing function parameters with annotations
std::vector<Param> Parser::parseParams() {
expect(LParen);
std::vector<Param> params;
while (peek().kind != RParen) {
auto name = expect(Ident).lexeme;
TypePtr ann;
if (match(Colon)) ann = parseType();
params.push_back({name, std::move(ann)});
if (!match(Comma)) break;
}
expect(RParen);
return params;
}
Token additions
The lexer needs two new token kinds:
Colonfor:separating name from type.Arrowfor->separating parameter types from return type.
-> is a two-character token; the lexer handles it in the - branch:
case '-':
return makeToken(match('>') ? Arrow : Minus);
Annotations are optional
All annotations are TypePtr defaulting to nullptr. Code without any
annotations is valid — it's treated as fully Any-typed (step 06). This
means cp-03/cp-04 programs are valid cp-05 programs without modification.
04 — The Type Checker Pass
The type checker walks the AST after the resolver and before execution. It visits every expression and statement, computing or verifying types.
The TypeChecker class
class TypeChecker : public ExprVisitor<TypePtr>, public StmtVisitor<void> {
// Type environment: name → TypePtr
std::vector<std::unordered_map<std::string, TypePtr>> scopes_;
TypePtr currentReturnType_; // expected return type of current function
void beginScope();
void endScope();
void declare(const std::string& name, TypePtr t);
TypePtr lookup(const std::string& name);
void checkCompatible(TypePtr expected, TypePtr actual, int line);
public:
void check(std::vector<StmtPtr>& stmts);
// ExprVisitor
TypePtr visitNumber(NumberExpr&) override;
TypePtr visitBool(BoolExpr&) override;
TypePtr visitString(StringExpr&) override;
TypePtr visitNil(NilExpr&) override;
TypePtr visitVar(VarExpr&) override;
TypePtr visitBinary(BinaryExpr&) override;
TypePtr visitUnary(UnaryExpr&) override;
TypePtr visitCall(CallExpr&) override;
TypePtr visitFn(FnExpr&) override;
// StmtVisitor
void visitLet(LetStmt&) override;
void visitBlock(BlockStmt&) override;
void visitIf(IfStmt&) override;
void visitWhile(WhileStmt&) override;
void visitReturn(ReturnStmt&) override;
void visitPrint(PrintStmt&) override;
};
Expression type rules
TypePtr TypeChecker::visitBinary(BinaryExpr& e) {
auto L = check(*e.left);
auto R = check(*e.right);
switch (e.op) {
case Plus: case Minus: case Star: case Slash:
checkCompatible(mkNum(), L, e.line);
checkCompatible(mkNum(), R, e.line);
return mkNum();
case EqEq: case BangEq:
// any two compatible types may be compared for equality
return mkBool();
case Lt: case LtEq: case Gt: case GtEq:
checkCompatible(mkNum(), L, e.line);
checkCompatible(mkNum(), R, e.line);
return mkBool();
case And: case Or:
checkCompatible(mkBool(), L, e.line);
checkCompatible(mkBool(), R, e.line);
return mkBool();
// ...
}
}
The checkCompatible(expected, actual, line) function throws a
TypeCheckError if !compatible(expected, actual) (see step 06 for the
compatible definition):
void TypeChecker::checkCompatible(TypePtr expected, TypePtr actual, int line) {
if (!compatible(*expected, *actual))
throw TypeCheckError("[line " + std::to_string(line) +
"] Expected " + typeToStr(*expected) +
", got " + typeToStr(*actual) + ".");
}
Statement rules
void TypeChecker::visitLet(LetStmt& s) {
TypePtr initType = s.init ? check(*s.init) : mkNil();
if (s.annotation)
checkCompatible(s.annotation, initType, s.line);
TypePtr declaredType = s.annotation ? s.annotation : initType;
declare(s.name, declaredType);
}
If no annotation is given, the type is inferred from the initialiser. This is a simple form of Hindley-Milner local type inference:
let x = 42; // inferred: Num
let y = "hello"; // inferred: Str
let z = x + y; // error: expected Num, got Str (for y)
Function type checking
TypePtr TypeChecker::visitFn(FnExpr& fn) {
beginScope();
std::vector<TypePtr> paramTypes;
for (auto& p : fn.params) {
TypePtr t = p.annotation ? p.annotation : mkAny();
declare(p.name, t);
paramTypes.push_back(t);
}
TypePtr retType = fn.retAnnotation ? fn.retAnnotation : mkAny();
auto saved = currentReturnType_;
currentReturnType_ = retType;
check(*fn.body);
currentReturnType_ = saved;
endScope();
return mkFn(std::move(paramTypes), retType);
}
Return type checking
void TypeChecker::visitReturn(ReturnStmt& s) {
TypePtr t = s.value ? check(*s.value) : mkNil();
checkCompatible(currentReturnType_, t, s.line);
}
If the function has Any return type (no annotation), any return value
is accepted.
05 — Function Types and Call Checking
Function types form the most complex part of the type system: they are recursive (parameters and return types can themselves be function types), and call-site checking must verify arity and argument types.
Checking a call expression
TypePtr TypeChecker::visitCall(CallExpr& call) {
TypePtr calleeType = check(*call.callee);
// If the callee is Any, we can't check — return Any
if (std::holds_alternative<AnyType>(*calleeType)) {
for (auto& arg : call.args) check(*arg); // still check arg types
return mkAny();
}
// Must be a function type
auto* fn = std::get_if<FnType>(calleeType.get());
if (!fn)
throw TypeCheckError("[line " + std::to_string(call.line) +
"] Cannot call non-function value of type " +
typeToStr(*calleeType) + ".");
// Check arity
if (call.args.size() != fn->params.size())
throw TypeCheckError("[line " + std::to_string(call.line) +
"] Expected " + std::to_string(fn->params.size()) +
" arguments, got " + std::to_string(call.args.size()) + ".");
// Check argument types
for (size_t i = 0; i < call.args.size(); ++i) {
TypePtr argType = check(*call.args[i]);
checkCompatible(fn->params[i], argType, call.line);
}
return fn->ret;
}
Higher-order functions
Function types compose naturally:
fn apply(f: Fn(Num)->Num, x: Num): Num {
return f(x);
}
fn double(n: Num): Num { return n * 2; }
print apply(double, 21); // 42
Type checking apply(double, 21):
applyhas typeFn(Fn(Num)->Num, Num) -> Num.- Arg 0:
doublehas typeFn(Num)->Num. Compatible with param 0 (Fn(Num)->Num). ✓ - Arg 1:
21has typeNum. Compatible with param 1 (Num). ✓ - Return type:
Num.
Recursive functions
fn fib(n: Num): Num {
if (n <= 1) return n;
return fib(n-1) + fib(n-2);
}
When visitFn resolves the body, fib is already declared in the enclosing
scope with type Fn(Num)->Num (set when the let fib = fn(n:Num):Num {...}
desugaring runs visitLet). So the recursive call fib(n-1) finds the
correct function type in the scope.
The key ordering: visitLet calls declare after type-checking the
initialiser for forward-referenced functions? No — declare must happen
before checking the body for recursion to work. Here's the fix:
void TypeChecker::visitLet(LetStmt& s) {
// For function literals, pre-declare with the annotated type
// before checking the body (enables recursion).
if (auto* fn = dynamic_cast<FnExpr*>(s.init.get())) {
if (fn->retAnnotation) {
auto preType = buildFnType(*fn); // params + ret from annotations
declare(s.name, preType); // pre-declare
check(*s.init); // body can now see s.name
return;
}
}
// Non-function or unannotated: infer normally
TypePtr initType = s.init ? check(*s.init) : mkNil();
if (s.annotation) checkCompatible(s.annotation, initType, s.line);
declare(s.name, s.annotation ? s.annotation : initType);
}
The interaction with Any
let f: Any = fn(x: Num): Num { return x + 1; };
f(42); // accepted — callee is Any, no type checking on call
f("hi"); // accepted — but will crash at runtime
The Any escape hatch means the checker accepts the call (returns Any)
while the interpreter's runtime check catches the actual type mismatch.
This is the defining property of gradual typing: you can opt specific
sites out of static checking at the cost of runtime safety guarantees.
Testing function types
void test_function_types() {
// Correct types — should type-check clean
CHECK_NOTHROW(typeCheck(R"(
fn add(a: Num, b: Num): Num { return a + b; }
print add(1, 2);
)"));
// Wrong arg type
CHECK_THROWS(typeCheck("fn f(x: Num): Num { return x; } f(\"hi\");"),
"Expected Num");
// Wrong arity
CHECK_THROWS(typeCheck("fn f(x: Num): Num { return x; } f(1, 2);"),
"Expected 1 arguments");
}
06 — Gradual Typing with Any
Gradual typing lets programmers mix typed and untyped code in the same
program. The Any type is the mechanism: a value of type Any bypasses
static checks but retains runtime checks.
The key rule: compatibility not equality
The type checker uses compatible(A, B) instead of typeEq(A, B) for
all mismatch checks:
bool compatible(const Type& a, const Type& b) {
if (std::holds_alternative<AnyType>(a)) return true;
if (std::holds_alternative<AnyType>(b)) return true;
if (a.index() != b.index()) return false;
if (auto* fa = std::get_if<FnType>(&a)) {
auto& fb = std::get<FnType>(b);
if (fa->params.size() != fb.params.size()) return false;
for (size_t i = 0; i < fa->params.size(); ++i)
if (!compatible(*fa->params[i], *fb.params[i])) return false;
return compatible(*fa->ret, *fb.ret);
}
return true; // same non-Fn variant
}
Any is compatible with every type in both directions. This is what
allows unannotated code to coexist with annotated code.
Where Any flows
- Unannotated parameters:
fn f(x) { ... }→xhas typeAny. - Unannotated return:
fn f() { return expr; }→ return type isAny. - Unannotated variable with complex initialiser: if the checker can't
infer a concrete type, it falls back to
Any. - Explicit annotation:
let x: Any = ...always.
The flow problem
Gradual typing introduces a subtlety: Any can flow through operations
and infect computed types:
let x = f(); // f is unannotated → x has type Any
let y: Num = x + 1; // x is Any, so x+1 is Any, compatible with Num → accepted
At runtime, if f() returns a string, x + 1 crashes. This is the
fundamental trade-off: accepting Any means losing static guarantees
for all downstream computations that use that value.
Gradual guarantee
The gradual guarantee says: if you annotate everything (no Any), the
program passes static type checking iff it is type-safe at runtime. As
soon as you introduce Any, you give up some of that guarantee for the
paths touched by Any values.
In practice this means: start with Any everywhere (fully dynamic), add
annotations incrementally, and the type checker's coverage grows with
each annotation.
Runtime blame
In research gradual type systems (Typed Racket, Reticulated Python), the
runtime inserts casts at Any ↔ typed boundaries and reports blame
precisely: "this function was typed but received an untyped argument from
line 42". cp-05 does not implement blame tracking — the runtime check is
just the existing std::get<double> in the interpreter throwing a
RuntimeError. The blame extension is left for exploration.
Testing gradual typing
void test_gradual() {
// Unannotated fn — no type error
CHECK_NOTHROW(typeCheck("fn f(x) { return x + 1; }"));
// Annotated caller, unannotated callee — OK (Any is compatible with Num)
CHECK_NOTHROW(typeCheck(R"(
fn id(x) { return x; }
let n: Num = id(42);
)"));
// Annotated callee, wrong caller argument — error even with Any param
// If callee is fn(x: Num), passing a Str is an error
CHECK_THROWS(typeCheck(R"(
fn f(x: Num) { return x + 1; }
f("hello");
)"), "Expected Num");
}
07 — Error Messages and Recovery
A type system is only useful if its error messages are actionable. This step covers how to produce clear diagnostics and how to continue checking after the first error.
Anatomy of a good type error
error[E001]: type mismatch
--> main.ml:3:15
|
3 | let x: Num = "hello";
| ^^^^^^^^^^ expected Num, found Str
|
= note: variable 'x' is declared as Num at line 3
The key components:
- Error code — makes errors searchable in documentation.
- Location — file, line, column (or at least line).
- Expected vs actual — always say what was expected, not just what was found.
- Context — which variable, which function, which call.
cp-05's error format is simpler but includes the essentials:
TypeCheckError at line 3: expected Num, got Str (in binding 'x')
The TypeCheckError structure
struct TypeCheckError {
std::string message;
int line;
TypeCheckError(std::string msg, int line)
: message(std::move(msg)), line(line) {}
const char* what() const { return message.c_str(); }
};
Error messages are assembled at the site of detection:
void TypeChecker::checkCompatible(TypePtr expected, TypePtr actual,
int line, const std::string& context) {
if (!compatible(*expected, *actual))
throw TypeCheckError(
"expected " + typeToStr(*expected) +
", got " + typeToStr(*actual) +
(context.empty() ? "" : " (" + context + ")"),
line);
}
Error recovery: collect-and-continue
Throwing on the first error is the simplest strategy. The downside is that one typo hides all downstream errors. The collect-and-continue strategy accumulates errors:
class TypeChecker {
std::vector<TypeCheckError> errors_;
void reportError(const std::string& msg, int line) {
errors_.emplace_back(msg, line);
}
TypePtr checkCompatibleSoft(TypePtr expected, TypePtr actual, int line,
const std::string& ctx) {
if (!compatible(*expected, *actual))
reportError("expected " + typeToStr(*expected) +
", got " + typeToStr(*actual) + " " + ctx, line);
return expected; // return expected type to continue checking
}
};
After reportError, the checker returns the expected type and continues.
Downstream code sees a "correct" type and may or may not produce spurious
secondary errors. Primary errors (the first one) are always reliable;
secondary errors may be false positives caused by recovery.
The "error type" sentinel
Some type checkers introduce an ErrorType variant. Any operation on an
ErrorType operand silently returns ErrorType without reporting another
error. This prevents cascading:
let x: Num = "str"; // error: expected Num, got Str → x gets ErrorType
let y = x + 1; // x is ErrorType → suppress the type error for +
Without ErrorType, the second line would also error "expected Num, got
ErrorType-derived-Str", confusing the user.
The complete pipeline check
void runProgram(const std::string& src) {
Lexer lex(src);
auto tokens = lex.scanAll();
Parser parser(tokens);
auto stmts = parser.parse();
Resolver resolver(interp);
resolver.resolve(stmts);
TypeChecker checker;
checker.check(stmts);
if (!checker.errors().empty()) {
for (auto& e : checker.errors())
std::cerr << "[line " << e.line << "] " << e.message << "\n";
return;
}
for (auto& s : stmts) interp.execute(*s);
}
Testing error messages
The test suite verifies not just that errors are reported but that the message contains the right content:
void test_error_messages() {
try {
typeCheck("let x: Num = true;");
std::cerr << "FAIL: expected error\n";
} catch (const TypeCheckError& e) {
CHECK_CONTAINS(e.message, "Num");
CHECK_CONTAINS(e.message, "Bool");
}
try {
typeCheck("fn f(x: Num) {} f(\"hi\");");
std::cerr << "FAIL: expected error\n";
} catch (const TypeCheckError& e) {
CHECK_CONTAINS(e.message, "Num");
CHECK_CONTAINS(e.message, "Str");
}
}
What you've built by the end of cp-05
- A type ADT (
Num,Bool,Str,Nil,Any,Fn(...)→...). - Optional type annotations in the parser without breaking old code.
- A type-checker pass that infers and verifies types for all expressions.
- Gradual typing via
Anyfor incremental annotation of large codebases. - Clear error messages with expected-vs-actual and source locations.
- A complete pipeline: parse → resolve → type-check → interpret.
This is the foundation that every production language's front-end is built on. The gap between cp-05 and, say, TypeScript's checker is not concept but scale: more types, more inference, more generics — but the same propagate-and-unify core.
cp-06 — Bytecode Compiler (AST → Stack-VM Chunks)
Status: ✅ Implemented.
Replaces the tree-walking model with compile-then-run: AST → flat array of bytecodes ("chunk"). The chunks are executed in cp-07.
What's Built
Openum — a 32-instruction bytecode ISA: stack manipulation, globals/locals access, arithmetic/logic/comparison, control flow (JUMP,JUMP_IF_FALSE,LOOP), I/O, plus reserved opcodes (CALL,RETURN,CLOSURE, upvalues) that cp-07 will activate.Chunk— bytecode array + deduplicated constants pool + parallel line table.Compiler— AST visitor (bothExprVisitor<void>andStmtVisitor) that emits bytecode while tracking lexical locals as stack slots.disassembler— human-readable dump for debugging and unit testing.mlcCLI:mlc file.mlcompiles a file and prints the chunk;mlcalone reads stdin.
Architecture
source → Lexer → Parser → Resolver → TypeChecker → Compiler → Chunk
│
└─→ Disassembler → text
The frontend (lex/parse/resolve/typecheck) is unchanged from cp-05; we re-use it. The interpreter was deleted. The new backend stages are Compiler and Disassembler. The tree-walker's Environment chain is gone — locals are stack slots, globals live in a (future) runtime hash table keyed by name strings interned in the constants pool.
Reading Order
CONCEPTS.md— stack machines, bytecode design, operand encoding, why this is faster than tree-walking.steps/01-instruction-set-design.mdsteps/02-the-chunk.mdsteps/03-emit-helpers-and-jumps.mdsteps/04-locals-vs-globals.mdsteps/05-control-flow.mdsteps/06-short-circuit-logic.mdsteps/07-disassembler-and-testing.mdsrc/cpp/— actual code.
Build & Run
cd src/cpp
cmake -S . -B build -G "Unix Makefiles"
cmake --build build -j
ctest --test-dir build --output-on-failure
Then disassemble a program:
echo 'let n = 10; print n * (n + 1) / 2;' | ./build/mlc
Outcomes
After reading the code and steps you can:
- Design a bytecode instruction set from first principles, justifying every operand width.
- Compile a typed AST to a flat, executable form using a single forward pass.
- Encode
if/else,while, and short-circuit&&/||using only conditional jumps. - Resolve identifier references to stack slots (locals) vs hash lookups (globals).
- Disassemble chunks for debugging and assert on the byte stream in unit tests.
- Articulate the trade-offs between stack VMs (this) and register VMs (Lua, Dalvik).
- Identify what's deferred to cp-07 (call frames, closures,
CALL/RETURN, GC) and why each requires a runtime.
Limitations (revisited in cp-07)
- No execution. We compile, we disassemble, we stop. The VM is cp-07's job.
- No function bodies, calls, or
return. Closures need call frames and upvalues — both runtime concepts. - Constants are capped at 256 per chunk (1-byte index). cp-07 will add
CONSTANT_LONGwith a 3-byte index for chunks that need more. - No source spans for error reporting beyond line numbers. cp-15 expands this.
Step 1 — Instruction Set Design
Goal
Design a bytecode instruction set for MiniLang that:
- Is easy to emit from the AST (one or two opcodes per node).
- Is easy to execute with a tight dispatch loop.
- Encodes operands compactly (typical instruction = 1–3 bytes).
- Leaves headroom for cp-07's call frames, closures, and upvalues.
The output of this step is opcode.hpp: a single enum and its opName() lookup.
Stack Machine vs Register Machine
Two dominant designs:
| Stack machine | Register machine | |
|---|---|---|
| Examples | JVM, .NET CLR, CPython, WASM | Lua 5+, Dalvik, V8 Ignition |
| Operand encoding | Implicit (top of stack) | Explicit (register indices) |
Instruction count for a = b + c | 4 (GetB, GetC, Add, SetA) | 1 (Add a, b, c) |
| Compiler complexity | Low | High (register allocation) |
| Per-op dispatch overhead | Higher (more ops) | Lower |
| Total dispatch overhead | Comparable in practice | Comparable in practice |
We pick stack because (a) compilation is dead simple — every expression node "leaves its result on the stack", and (b) the visitor pattern compiles cleanly to stack opcodes (you don't need to thread destination registers through the visitor).
Byte-Aligned, Variable-Length Encoding
Each instruction is one opcode byte followed by zero or more inline operand bytes:
+--------+--------+--------+
| OPCODE | (operands…) |
+--------+--------+--------+
Operand widths used in cp-06:
- 1 byte for constant-pool indices (max 256 constants per chunk).
- 1 byte for local-slot indices (max 256 locals per function).
- 2 bytes (big-endian) for jump offsets (max ±32 KB jump distance).
The trade-off: variable length means the disassembler has to know each opcode's operand size, but the byte stream is smaller than fixed-width 32-bit encoding (Lua's 4-byte instructions waste bytes on RETURN, POP, etc.).
The cp-06 ISA at a glance
The 32 opcodes break into six functional groups:
Stack/literal: Constant Nil True False Pop
Variables: DefGlobal GetGlobal SetGlobal GetLocal SetLocal
Arithmetic: Add Sub Mul Div Mod Neg
Logic/compare: Not Eq Ne Lt Le Gt Ge
Control flow: Jump JumpIfFalse Loop
I/O: Print
Reserved (cp-07): Call Return Closure GetUpvalue SetUpvalue CloseUpvalue
Notice:
- No dedicated
And/Or. Short-circuit logic compiles to control flow (JumpIfFalse+Pop). See step 6. Loopis just an unconditional backward jump with au16operand (forward jumps are signed via the encoding pattern; backward jumps are explicit because the bytecode is being emitted forward and the target is already known).- No typed arithmetic ops.
Addworks on both numbers (sum) and strings (concat) — the type check at compile time guarantees the runtime knows which to do. Other VMs (JVM withiadd/fadd/…) split by type for performance; we trade that for simplicity. - Reserved opcodes are emitted by the disassembler but never produced by the cp-06 compiler. cp-07 wires them up.
Why 32 Opcodes and Not 200?
Crafting Interpreters' Lox has ~30 opcodes. Python has ~120, Java has ~200. Each extra opcode is one more branch in your dispatch switch — and as opcodes proliferate, micro-ops dilute the hot ones. Modern VMs (V8) actually go the other direction: a few big "polymorphic" opcodes that internally specialise.
For learning, fewer opcodes means less to memorise. cp-08 (TAC IR) and cp-09 (SSA) re-explore the design space; cp-13 (MLIR) lets you define your own dialects.
Self-Check
After this step you should be able to:
- Pick an instruction set design (stack vs register, variable vs fixed) and defend it.
- Predict, for a source like
a + b * c, the exact opcode sequence. - Explain why we don't have
OpAnd. - List which opcodes are reserved and what runtime concept each needs.
Step 2 — The Chunk
A chunk is a self-contained compiled unit: every byte the VM will execute, every constant those bytes refer to, and the source-line metadata for diagnostics. In cp-06 there is exactly one chunk — the top-level script. cp-07 will add a chunk-per-function model.
See chunk.hpp.
Data Layout
struct Chunk {
std::vector<uint8_t> code; // flat byte stream
std::vector<Value> constants; // referenced by 1-byte index
std::vector<int> lines; // lines[i] = source line for code[i]
std::string name = "<script>";
};
Three parallel structures:
code
The byte stream. Opcode and operand bytes are mixed: e.g., [CONSTANT, 0, ADD] is three bytes, two instructions. The VM advances ip by 1 for opcode then by N more for operands.
constants
A pool of Values. Literals in the source (42, "hello") are interned here. The compiler emits OpConstant ix where ix is the pool index. Up to 256 entries (1-byte index). Deduplication is by value-equality so print 1; print 1; uses one slot.
lines
Parallel array: lines[i] is the source line of byte code[i]. When the VM throws a runtime error, it consults lines[ip-1] to print "RuntimeError at line 17". The disassembler suppresses repeated line numbers visually so consecutive bytes on the same line read as a group (|).
Why Parallel Vectors?
The alternative is a vector of Instruction { opcode; operand; line; } structs. That would be cache-cleaner per instruction but each struct is 8+ bytes vs 1 byte for a packed stream. For a typical chunk (hundreds to thousands of bytes), the byte stream pulls more instructions per cache line.
Real VMs go further: HotSpot uses a similar packed bytecode; V8 Ignition uses fixed-size 32-bit instructions but in a TurboFan-style separate handler table. Neither uses one-struct-per-instruction in production.
addConstant and Deduplication
uint8_t addConstant(const Value& v) {
for (size_t i = 0; i < constants.size(); ++i)
if (valuesEqual(constants[i], v)) return static_cast<uint8_t>(i);
constants.push_back(v);
return static_cast<uint8_t>(constants.size() - 1);
}
O(n²) over a chunk's compilation but n is small (constants typically <50 per script). For real workloads you'd hash; we keep the linear scan for clarity and zero dependencies.
valuesEqual is structural: same kind, same payload. For strings this is == on the contained std::string. For functions (cp-07) we'll compare by pointer identity since two fn () {} declarations are different closures even with identical source.
Overflow
If constants.size() >= 256, addConstant returns 255 and the compiler emits a diagnostic ("too many constants in chunk"). cp-07 introduces OpConstantLong with a 3-byte (24-bit) operand to lift this to 16M.
Lifetimes
The Chunk owns its constants by value. Strings are std::strings on the heap inside Value::s. cp-07 will introduce a GC for runtime-allocated strings (string concat results) but constant strings live for the chunk's lifetime — they're effectively read-only and could be interned across chunks in a future optimisation pass.
Self-Check
- Why three parallel arrays and not one array of structs?
- How would you change
Chunkto support more than 256 constants? - What invariant must hold between
code.size()andlines.size()?
Step 3 — Emit Helpers and Back-Patching Jumps
The compiler is one long sequence of emit(byte, line) calls. Most are trivial; the interesting case is forward jumps, where you have to emit an instruction whose target you don't know yet.
See emit*, emitJump, patchJump, emitLoop in compiler.cpp.
Emit Primitives
void emit(Op op, int line);
void emit(uint8_t byte, int line);
void emitConstant(const Value& v, int line);
These all reduce to chunk_->writeByte (append + track line). emitConstant is one helper because every constant emission is [OpConstant, ix].
The Back-Patching Problem
Consider if (cond) thenBranch else elseBranch. We want bytecode roughly:
<cond>
JumpIfFalse ELSE_START
Pop ; drop condition on then-path
<thenBranch>
Jump END
ELSE_START:
Pop ; drop condition on else-path
<elseBranch>
END:
When we emit JumpIfFalse ELSE_START, we don't yet know what ELSE_START is — it depends on how big the <thenBranch> bytecode turns out to be. Same for Jump END.
Solution: emit a placeholder operand (0xff 0xff), remember its offset, and write the real value once the target is known.
size_t emitJump(Op op, int line) {
emit(op, line);
emit(0xff, line);
emit(0xff, line);
return chunk_->code.size() - 2; // offset of the placeholder bytes
}
void patchJump(size_t at, int line) {
size_t target = chunk_->code.size();
size_t jumpFrom = at + 2; // ip after the operand
size_t off = target - jumpFrom;
if (off > 0xffff) error(line, "branch too far");
chunk_->code[at] = (off >> 8) & 0xff;
chunk_->code[at + 1] = off & 0xff;
}
Usage:
size_t thenJump = emitJump(Op::JumpIfFalse, line);
emit(Op::Pop, line);
visit(thenBranch);
size_t endJump = emitJump(Op::Jump, line);
patchJump(thenJump, line);
emit(Op::Pop, line);
visit(elseBranch);
patchJump(endJump, line);
Backward Jumps — Loop
Loops are easier because the target (loop start) was already emitted:
void emitLoop(size_t loopStart, int line) {
emit(Op::Loop, line);
size_t off = chunk_->code.size() - loopStart + 2;
emit((off >> 8) & 0xff, line);
emit(off & 0xff, line);
}
No patching needed — the offset is computed inline. The VM reads the unsigned operand and subtracts it from ip (which now points past the operand bytes).
Why Two-Byte Offsets?
A u16 gives ±32 KB of branch range from any single jump site. Real programs in MiniLang rarely have functions larger than a few KB. If we hit the limit (the compiler emits "branch too far"), cp-07 will either:
- introduce a
JumpLongwith au24/u32operand, or - bounce through a trampoline (emit a short jump to an intermediate
JumpLongthat does the real work).
JVM uses goto_w for the same reason: long jumps are an opcode flavour, not a switch.
Sentinel Bytes — 0xff 0xff
Why fill placeholders with 0xff 0xff rather than 0x00 0x00? It's a defensive habit: if we forget to patchJump, the VM will read a 65535-byte jump and trip an obvious bug rather than a subtle off-by-one (jumping zero bytes). A linter / asan could be configured to flag this further.
Self-Check
- For an
ifwith noelse, you only need one jump. Why? - Why do we
Poptwice in theif/elselowering (once on each branch)? - Could
emitJumpreturn aOp*instead of asize_t? What problem would that cause?
Step 4 — Locals vs Globals
The compiler resolves every variable reference to one of two operations:
GET_GLOBAL/SET_GLOBAL/DEF_GLOBAL— a hash lookup at runtime, keyed by the name string in the constants pool. Used for any binding declared at the top level.GET_LOCAL/SET_LOCAL— a direct stack-slot fetch. Used for any binding declared inside a block.
The split mirrors Lox, CPython, and Lua. Globals are slow-but-flexible (you can monkey-patch them, late-bind them, redefine them); locals are fast-but-strict (their slot is baked into the bytecode at compile time).
Tracking Locals at Compile Time
The compiler keeps a flat std::vector<Local> parallel to the runtime stack layout:
struct Local {
std::string name;
int depth; // scope depth at declaration
bool isConst; // `let` is true, `var` is false
};
std::vector<Local> locals_;
int scopeDepth_ = 0;
The index in locals_ is the stack slot. When the compiler emits OpGetLocal n, the runtime will compute frame.slots[n] and push that.
Entering and leaving scope
void beginScope() { ++scopeDepth_; }
void endScope(int line) {
while (!locals_.empty() && locals_.back().depth >= scopeDepth_) {
emit(Op::Pop, line);
locals_.pop_back();
}
--scopeDepth_;
}
Every local that leaves the scope must be popped off the runtime stack, so we emit one Op::Pop per local being removed. The compiler's local table is the source of truth for runtime stack layout.
Declaring
void addLocal(const std::string& name, bool isConst, int line) {
for (int i = locals_.size() - 1; i >= 0 && locals_[i].depth == scopeDepth_; --i) {
if (locals_[i].name == name)
error(line, "variable '" + name + "' already declared in this scope");
}
locals_.push_back({name, scopeDepth_, isConst});
}
Same-scope redeclaration is forbidden. Cross-scope shadowing is allowed — the resolver in cp-04 already enforced this, but the compiler double-checks because it's the one assigning slots.
Resolving
int resolveLocal(const std::string& name) const {
for (int i = locals_.size() - 1; i >= 0; --i)
if (locals_[i].name == name) return i;
return -1;
}
We walk backwards so inner shadowing wins (the most recently declared x is the one in scope).
How Globals Are Encoded
void Compiler::visit(IdentExpr& e) {
int slot = resolveLocal(e.name);
if (slot >= 0) {
emit(Op::GetLocal, e.line);
emit(static_cast<uint8_t>(slot), e.line);
} else {
uint8_t ix = makeConstant(Value::makeString(e.name), e.line);
emit(Op::GetGlobal, e.line);
emit(ix, e.line);
}
}
The global name ("x", "foo") is interned into the constants pool. At runtime, cp-07's VM will do globals[constants[ix].s] — a hash lookup. Note the same name string is reused for DEF_GLOBAL / GET_GLOBAL / SET_GLOBAL thanks to addConstant deduplication.
Init Expression and the Stack
void Compiler::visit(LetStmt& s) {
if (s.init) visit(*s.init); else emit(Op::Nil, line);
if (scopeDepth_ == 0) {
uint8_t ix = makeConstant(Value::makeString(s.name), line);
emit(Op::DefGlobal, line); emit(ix, line);
} else {
addLocal(s.name, s.kind == DeclKind::Let, line);
// Init value is already on top of the stack — that's our local slot.
}
}
A subtle invariant: for a local, the init expression leaves the value on the stack and we just say "from now on, that stack slot is named s.name". No OpDefLocal opcode is needed — the local exists implicitly the moment we record it in locals_.
Why Globals Survive endScope
endScope pops every local with depth >= scopeDepth_. Globals have depth == 0 but they aren't in locals_ at all — they're emitted as OpDefGlobal which writes to the runtime hash table and Pop, not to the stack. So they survive scope exit by living somewhere the scope-exit pop loop doesn't touch.
Self-Check
- Why do we resolve locals back-to-front?
- What runtime data structure does
DEF_GLOBALwrite to (refer ahead to cp-07)? - If you wanted to allow
let x = x;legally (reading the outerxto init the inner one), where would you change the compiler? - How would you add
OpGetLocalLongto support more than 256 locals?
Step 5 — Control Flow (if / while)
All conditional execution in MiniLang's bytecode is achieved with three opcodes:
JumpIfFalse off— if!truthy(peek()), advanceipbyoff. Value remains on the stack.Jump off— unconditional forward branch.Loop off— unconditional backward branch (subtractsofffromip).
Pop is the other star of the show — without it, the runtime stack would slowly fill with leftover conditions.
Lowering if (cond) then [else other]
<cond>
JumpIfFalse L_ELSE
Pop ; drop condition on the then-path
<then>
Jump L_END
L_ELSE:
Pop ; drop condition on the else-path
<other> ; (omitted if no else, but the Pop still emits)
L_END:
Two pops, one per branch. JumpIfFalse leaves the value on the stack (it has to — if we popped first we couldn't both consult the value AND keep the stack discipline aligned across both branches). Each branch is then responsible for popping it once it's chosen.
If the else branch is missing, the structure simplifies but the second Pop still has to run — otherwise the runtime stack drifts upward by one each time we hit a "no-else" if.
In code (compiler.cpp):
void Compiler::visit(IfStmt& s) {
s.cond->accept(*this);
size_t thenJump = emitJump(Op::JumpIfFalse, s.line);
emit(Op::Pop, s.line);
s.thenBranch->accept(*this);
size_t elseJump = emitJump(Op::Jump, s.line);
patchJump(thenJump, s.line);
emit(Op::Pop, s.line);
if (s.elseBranch) s.elseBranch->accept(*this);
patchJump(elseJump, s.line);
}
Lowering while (cond) body
L_START:
<cond>
JumpIfFalse L_EXIT
Pop ; drop condition on the body-path
<body>
Loop L_START
L_EXIT:
Pop ; drop condition on the exit-path
void Compiler::visit(WhileStmt& s) {
size_t loopStart = chunk_->code.size();
s.cond->accept(*this);
size_t exitJump = emitJump(Op::JumpIfFalse, s.line);
emit(Op::Pop, s.line);
s.body->accept(*this);
emitLoop(loopStart, s.line);
patchJump(exitJump, s.line);
emit(Op::Pop, s.line);
}
Note the same pop-on-both-branches dance. The Loop instruction takes an unsigned 16-bit operand and the VM does ip -= off. Because we know loopStart upfront, no back-patching is needed.
Truthiness
JumpIfFalse consults Value::isTruthy():
bool Value::isTruthy() const {
switch (kind) {
case K::Nil: return false;
case K::Bool: return b;
default: return true; // numbers (including 0!), strings, fns
}
}
This matches Lua and Ruby: only nil and false are falsy. 0 is truthy, "" is truthy. The "JavaScript / Python" alternative — empty containers and zero are falsy — is a different design choice; both are coherent.
Nested Control Flow
Each if/while is independent — its own jumps, its own pops. Nesting Just Works because every visit method maintains the "expression leaves exactly one value on the stack" / "statement is stack-neutral" invariants.
For example, if (a) { while (b) { ... } } compiles to a while lowering nested inside the then-branch of the if. The if's Pop (then-path) drops a; the while's pops drop b. No interference.
break and continue?
Not yet — MiniLang has no break/continue keywords. Adding them is a fun exercise:
- Compile-time: keep a stack of
vector<size_t> breakSitesper active loop.breakdoesemitJump(Op::Jump)and pushes the placeholder offset. On loop exit, patch all of them to point to the post-loopL_EXIT. continuejumps toL_START(a backwardLoop, computed inline).
cp-15 adds these.
Self-Check
- Why does
ifproduce twoPopopcodes but a single<cond>? - What happens if you remove the post-loop
Popfromwhile's lowering? - How does
JumpIfFalsediffer from a hypotheticalBranchIfFalsethat pops the value? Which would you prefer for short-circuit logic in step 6?
Step 6 — Short-Circuit Logic
&& and || are not arithmetic operators — they have to avoid evaluating the right-hand side when the left settles the answer. That's "short-circuiting", and it's expressed entirely with jumps.
a && b
<a>
JumpIfFalse L_END
Pop ; drop the (truthy) <a>
<b>
L_END: ; either <a> (if it was falsy) or <b> sits on the stack
If a is falsy, we skip the Pop <b> block and the falsy value of a remains as the result — exactly what a && b should return when a is falsy. If a is truthy, we pop it and evaluate b, so b is the result.
void Compiler::visit(LogicalExpr& e) {
e.lhs->accept(*this);
if (e.op == TokenKind::AmpAmp) {
size_t endJump = emitJump(Op::JumpIfFalse, e.line);
emit(Op::Pop, e.line);
e.rhs->accept(*this);
patchJump(endJump, e.line);
} else if (e.op == TokenKind::PipePipe) {
// see below
}
}
a || b
<a>
JumpIfFalse L_RHS
Jump L_END ; <a> was truthy; keep it as result
L_RHS:
Pop ; drop the (falsy) <a>
<b>
L_END:
If a is truthy, we jump straight to L_END leaving a on the stack. If a is falsy, we pop it and evaluate b so b becomes the result.
size_t elseJump = emitJump(Op::JumpIfFalse, e.line);
size_t endJump = emitJump(Op::Jump, e.line);
patchJump(elseJump, e.line);
emit(Op::Pop, e.line);
e.rhs->accept(*this);
patchJump(endJump, e.line);
Why JumpIfFalse Doesn't Pop
If JumpIfFalse popped its condition, encoding &&/|| would need a Dup opcode (push another copy first) to preserve the value as the result. By keeping JumpIfFalse non-popping and emitting an explicit Pop on the "consume-the-condition" branch only, we save the Dup and one byte per logical operator.
Lox's compiler in Crafting Interpreters makes the same call.
Truthiness Reminder
Because JumpIfFalse uses Value::isTruthy, you get JavaScript-like coalescing semantics:
print 0 || 5; // 0 (since 0 is truthy in MiniLang — Lua semantics)
print nil || "hi"; // "hi"
print false || 42; // 42
print 3 && 4; // 4
To get C-style &&/|| that return 0 or 1 you'd add a final Op::Not Op::Not to coerce — the test of "is this booleanable" runs twice.
Verifying the Compilation
The unit tests assert the exact opcode sequence:
void test_short_circuit_and() {
auto out = compileSource("print false && true;");
CHECK(opsMatch(out.chunk, {
Op::False, Op::JumpIfFalse, Op::Pop, Op::True,
Op::Print, Op::Return
}));
}
This is exactly the lowering we sketched.
Self-Check
- Why don't we need a
Dupopcode given howJumpIfFalseis defined? - Walk through the compiled bytecode for
a && b || c. (Hint:||has the lowest precedence;&&binds tighter.) - How would you implement
??(null-coalescing, "use rhs if lhs isnilelse lhs")?
Step 7 — Disassembler and Testing
Without a VM yet (cp-07's job), the only way to know the compiler is right is to read the bytecode it produces. The disassembler is therefore both a debugging tool and the primary test surface.
See disassembler.cpp and tests/test_compiler.cpp.
Output Format
== <script> ==
0000 1 CONSTANT 0 ; 10
0002 | DEF_GLOBAL 1 ; n
0004 | GET_GLOBAL 1 ; n
0006 | CONSTANT 2 ; 1
0008 | ADD
0009 | PRINT
000a | RETURN
Per line:
- 4-hex-digit byte offset.
- Source line number, or
|if same as previous (visual grouping). - Opcode name.
- Operand (right-aligned), with a
; commentshowing the resolved constant or scope info.
The format owes everything to Crafting Interpreters' Lox disassembler.
The Dispatcher
Each opcode falls into one of four "shapes":
case Op::Constant: consumed = constantInstr("CONSTANT", chunk, offset, os); break;
case Op::Pop: consumed = simple("POP", os); break;
case Op::GetLocal: consumed = byteInstr("GET_LOCAL", chunk, offset, os); break;
case Op::Jump: consumed = jumpInstr("JUMP", +1, chunk, offset, os); break;
case Op::Loop: consumed = jumpInstr("LOOP", -1, chunk, offset, os); break;
simple— opcode only, advances 1 byte.byteInstr— opcode + 1 operand byte, advances 2.constantInstr— opcode + 1 operand byte (constants index), advances 2, resolves and prints the value.jumpInstr— opcode + 2 operand bytes (big-endian), advances 3, computes and prints the target offset.
disassembleInstruction returns consumed so the outer loop knows how many bytes to skip. This is the same shape your VM dispatch loop will have (cp-07) — the disassembler and VM are isomorphic structurally; the VM swaps "print this" for "execute this".
Test Strategy
Two complementary approaches:
(a) Exact opcode sequence
For short programs where the lowering is fully predictable:
auto out = compileSource("print 1 + 2 * 3;");
CHECK(opsMatch(out.chunk,
{Op::Constant, Op::Constant, Op::Constant, Op::Mul, Op::Add,
Op::Print, Op::Return}));
opsMatch walks the byte stream, extracts only the opcodes (skipping operand bytes by knowing the size of each opcode), and compares the resulting vector<Op> to your expected list. It's robust to operand-value churn — if Op::Constant's constant-pool slot changes, the test still passes; only the opcode shape matters.
(b) Substring match on the disassembly
For control flow where exact jump offsets are noisy but landmark opcodes matter:
CHECK_CONTAINS(out.disasm, "LOOP");
CHECK_CONTAINS(out.disasm, "JUMP_IF_FALSE");
Use this when the presence of an opcode is the assertion, not the exact byte sequence.
Negative tests
auto out = compileSource("{ let a = 1; a = 2; }");
CHECK(!out.compiledOk);
CHECK(/* "immutable" appears in some diagnostic */);
The compiler collects diagnostics without throwing, so tests can verify both the failure and the message text.
What the Tests Cover
- Arithmetic / unary / logical operators emit the right opcodes in the right order.
- Globals get
DEF_GLOBAL/GET_GLOBAL/SET_GLOBAL; locals getGET_LOCAL/SET_LOCAL. - Block scope correctly emits per-local
Popon exit. letimmutability is enforced.if/elseandwhileemitJump/JumpIfFalse/Loopcorrectly with pairedPops.- String constants are deduplicated in the pool.
- The line table has the same length as the code stream and contains real source line numbers.
fndeclarations and calls emit clear "deferred to cp-07" diagnostics.- Same-scope local redeclaration errors at resolve or compile time.
- Short-circuit
&&lowers exactly as expected.
That's 15 tests covering all currently-supported features and the principled subset of unsupported ones.
Using the Disassembler at the REPL
echo 'var x = 0; while (x < 3) { x = x + 1; print x; }' | ./build/mlc
Read the output as a sanity check on any compiler change you make. cp-07's VM will replay these same bytes.
Self-Check
- What does a typical disassembled
IfStmtlook like? Predict the line count. - Why is
opsMatch(a) preferable to a string compare on the disassembly for short programs? - How would you extend the disassembler to print stack-effect estimates next to each opcode?
cp-07 — Stack-VM Execution Engine
Status: ✅ Built —
mlrruns MiniLang programs end-to-end on the stack VM.
This lab takes the bytecode Chunk produced by cp-06 and gives it a runtime:
an operand stack, call frames, globals, and a dispatch loop. After cp-07 you
have a real compiler + virtual machine pair — source goes in, output comes out,
no AST walking at runtime.
Pipeline
source ──► Lexer ──► Parser ──► Resolver ──► TypeChecker ──► Compiler ──► Function(script) ──► VM
│
(with --dump:
Disassembler)
The compiler emits a single top-level Function named <script> whose chunk is
just the program's bytecode. The VM bootstraps execution by calling that
function on an empty stack. Every subsequent function call follows the same
mechanism — the script is just function #0.
What You'll Build
- A
Valuetagged union with a newFncase that carries ashared_ptr<Function>. - A
Functionrecord: name, arity, ownedChunk. - The
VMitself:- operand stack (
std::vector<Value>), - call-frame stack (
std::vector<CallFrame>), - globals table (
std::unordered_map<string,Value>), - a big
switch-based dispatch loop.
- operand stack (
- A
Compilerthat nests oneFunctionStateper function being compiled, sofn foo() { … }compiles to a fresh chunk with its own locals. - A driver
mlrthat compiles then runs.
Reading Order
CONCEPTS.md— the why: stack vs registers, dispatch cost, call frames, why closures need a separate mechanism.steps/— implementation walkthrough:src/cpp/— the implementation.
Build & Run
cd src/cpp
cmake -S . -B build -G "Unix Makefiles"
cmake --build build -j
# Run a program from a file
./build/mlr program.ml
# …or pipe it
echo 'print 1 + 2 * 3;' | ./build/mlr
# Disassemble the script chunk before running
echo 'fn add(a, b) { return a + b; } print add(3, 4);' | ./build/mlr --dump
# Run tests
ctest --test-dir build --output-on-failure
Status
- ✅ 19 VM tests, 41 sub-checks, 100% green.
- ✅ Arithmetic, locals, globals, control flow, functions, recursion.
- ✅ Compile-time errors (typecheck) flagged; runtime errors print line + message.
- ⏳ Closures over outer locals are deliberately rejected at compile time. Capturing globals works fine; the upvalue machinery lands in cp-12 when we build the JIT.
Prereqs
- cp-06 complete (chunks, opcodes, compiler).
Outcomes
- Run a compiled MiniLang program with no AST present at runtime.
- Explain why CPython, JVM bytecode, V8 Ignition, and Lua share the same architectural sketch (stack + frames + dispatch loop + constant pool).
- Diagnose and reason about stack-balance bugs — the single biggest source of hard-to-debug VM crashes.
Step 1 — Operand Stack, Values, and the Dispatch Loop
Goal
Build the minimum VM that can execute a flat sequence of opcodes acting on a
stack of Values. At the end of this step:
push(Constant 3); push(Constant 4); Add; Print;
prints 7.
The Operand Stack
A stack virtual machine evaluates expressions by pushing intermediates to a LIFO buffer and popping them when consumed. This mirrors how a tree-walker recurses, but flattens the call sequence into a linear instruction stream.
class VM {
std::vector<Value> stack_;
void push(Value v) { stack_.push_back(std::move(v)); }
Value pop() { Value v = stack_.back(); stack_.pop_back(); return v; }
Value& peek(size_t off=0) { return stack_[stack_.size()-1-off]; }
};
Important properties:
- The stack is
vector<Value>(notarray<Value, 256>). Real VMs pre-allocate, butvectoris simpler and correct. - Every instruction has a known stack effect — e.g.
Addis-2 / +1. The compiler tracks this implicitly; bugs here manifest as crashes deep in unrelated code because the wrong value is on top.
The Value Union
A VM Value is a tagged sum, NOT a pointer. Cache-friendly, no allocation for
primitives:
enum class ValueKind : uint8_t { Nil, Bool, Number, Str, Fn };
struct Value {
ValueKind kind;
union {
bool b;
double n;
};
std::string s; // outside union — has a non-trivial destructor
FunctionPtr fn; // shared_ptr — Fn case
};
Aside — why
std::stringoutside the union? C++ unions can hold non-trivial members but you must placement-new and manually destruct them. That's a real win for production VMs (every cache miss matters) but adds 40 lines of error-prone code. cp-07 chooses clarity; cp-12 swaps in a tagged 64-bitNaN-boxedpayload for the JIT.
The Dispatch Loop
The dispatch loop is the single hottest piece of code in any interpreter. It reads an opcode, jumps to its handler, executes, repeats:
for (;;) {
auto op = Op(readByte());
switch (op) {
case Op::Constant: push(readConstant()); break;
case Op::Add: {
Value b = pop(), a = pop();
push(Value::makeNum(a.n + b.n));
break;
}
case Op::Print: {
(*out) << pop().toString() << "\n";
break;
}
case Op::Return: return Status::Ok;
// …
}
}
Why a switch?
GCC and Clang compile a dense switch over a uint8_t to a jump table:
one indirect branch per dispatch. It is hard to beat without going to inline
assembly.
What about computed goto?
Computed goto (&&Op_Add labels in a static const void* table[256]) reduces
branch predictor pressure by giving each handler its own predictor entry —
each opcode's "what comes next?" is a separate prediction. Speedups are real
(15–30%) but the code is GNU-only and harder to read. We use switch here
and revisit computed goto when we profile in cp-12.
Reading Operands
Each opcode may consume immediate bytes (operands) from the instruction stream:
uint8_t readByte() { return *ip_++; }
uint16_t readShort() { uint16_t hi = *ip_++; uint16_t lo = *ip_++; return (hi<<8)|lo; }
Value readConstant() { return chunk()->constants[readByte()]; }
The instruction pointer (ip_) is a const uint8_t* into the current chunk's
code buffer. Advancing it past the buffer end is undefined behavior — the
compiler is responsible for terminating every code path with Return.
Stack-Balance Discipline
Every opcode handler in cp-07 obeys a stack discipline:
| Opcode | Pops | Pushes |
|---|---|---|
| Constant | 0 | 1 |
| Add/Sub/... | 2 | 1 |
| Neg/Not | 1 | 1 |
| 1 | 0 | |
| Pop | 1 | 0 |
| Jump | 0 | 0 |
| JumpIfFalse | 0 | 0 |
| Call N | N+1 | 1 |
| Return | 1 | 0 (then push result into caller frame) |
When you introduce a new opcode, document its stack effect first; the compiler tracks scope depth and locals based on this contract.
Try It
cd src/cpp && cmake --build build -j
echo 'print 2 + 3 * 4;' | ./build/mlr --dump
You should see the disassembly of the script chunk followed by 14.
Pitfalls
- Popping in the wrong order. For binary ops the right-hand side is on
top:
b = pop(); a = pop();. Reverse it and1 - 2becomes1. push(pop() + …)evaluation order. C++ does not specify argument evaluation order — sequence the pops into named locals before pushing.- Re-reading from
stack_afterpush.vector::push_backcan reallocate, invalidating references. Always copyValues out before pushing.
Step 2 — Call Frames and Slot-Base Addressing
Goal
Make fn add(a, b) { return a + b; } print add(3, 4); work. Two questions:
- Where do
aandblive during execution? - How does
add"return" without losing the rest of the program?
Answer to both: call frames.
A Call Frame
struct CallFrame {
FunctionPtr fn; // function being executed
const uint8_t* ip; // instruction pointer into fn->chunk.code
size_t slotBase; // index into VM::stack_ where this frame's slot 0 lives
};
std::vector<CallFrame> frames_;
The crucial field is slotBase. It says: "all local variables in this
function are accessed relative to stack_[slotBase]."
The Invariant at Op::Call N
When the compiler emits Op::Call N, the runtime stack looks like:
┌── top
[ … caller's stuff, <fn>, arg1, arg2, …, argN ]
▲
└── this is `slotBase` for the new frame
So:
slotBase = stack_.size() - N - 1(the-1accounts for the function itself).frame.slots[0] = <fn>— the function value lives at slot 0 of its own frame. This is a convenient invariant that lets us implement closures cheaply later (the closure object is always reachable from inside its body).frame.slots[1..N] = arg1..argN— parameters are already in place because the call ABI deliberately leaves the args on the stack.
This is why the compiler emits addLocal(p, …) for each parameter when it
opens a function: parameters get slot indices 1, 2, … N, and the runtime
fulfils that mapping for free.
Op::Return
┌── top
[ … caller's stuff, <fn>, args/locals…, RESULT ]
▲
└── slotBase of returning frame
The handler:
Value result = pop();
stack_.resize(frame.slotBase); // drops <fn> + all locals + args
frames_.pop_back();
if (frames_.empty()) return Status::Ok; // returning from <script>
push(result); // caller sees the value on top
Because slotBase includes the function value's slot, the resize cleans
everything in one operation. No per-local pops.
Op::GetLocal slot / Op::SetLocal slot
These are now trivial:
case Op::GetLocal: {
uint8_t slot = readByte();
push(stack_[frame.slotBase + slot]);
break;
}
case Op::SetLocal: {
uint8_t slot = readByte();
stack_[frame.slotBase + slot] = peek(); // assignment is an expression: leaves value on stack
break;
}
Why doesn't
SetLocalpop? Becausea = 1 + (b = 2)is a valid expression in many languages — the assigned value is the expression's result. MiniLang is more conservative, but keeping the value on the stack letsexpr-stmtuse a uniformPopafterwards.
callValue
The unified call entry point:
void callValue(Value callee, int argc, int line) {
if (callee.kind != ValueKind::Fn)
throw RuntimeError(line, "can only call functions");
auto fn = callee.fn;
if (fn->arity != argc)
throw RuntimeError(line, "function '" + fn->name + "' expects "
+ std::to_string(fn->arity) + " argument(s)");
if (frames_.size() == kMaxFrames)
throw RuntimeError(line, "stack overflow (max call depth)");
frames_.push_back({fn, fn->chunk.code.data(),
stack_.size() - argc - 1});
}
That's it. After callValue, control falls back into the dispatch loop with
a fresh frame = frames_.back(), and execution begins at fn->chunk byte 0.
Recursion Just Works
fn fact(n) {
if (n <= 1) { return 1; }
return n * fact(n - 1);
}
Each recursive call pushes a new frame. No special case. The slot-base
trick is the entire reason recursion works without renaming variables —
each frame has its own slice of stack_.
Bootstrapping the Script
The top-level program is itself a function called <script> with arity 0.
The VM bootstraps it by pushing the function value and calling it:
push(Value::makeFn(script));
callValue(stack_.back(), 0, 0);
run_loop();
The compiler always ends <script> with Nil; Return, so the loop terminates
gracefully when the program finishes.
Try It
echo 'fn fib(n) { if (n < 2) { return n; } return fib(n-1) + fib(n-2); }
print fib(10);' | ./build/mlr
# 55
Add --dump to inspect the bytecode — you will see two separate chunks if you
add a top-level disassembly of fib (the framework only dumps the script in
cp-07; cp-12 dumps every function).
Pitfalls
- Off-by-one in
slotBase. Forgetting the<fn>slot makes parameterslot 1map to slot 0 of the caller — silent data corruption. frames_.push_backinvalidatesframereferences. We always re-acquireframe = frames_.back()at the top of each iteration.- Returning from
<script>. Without the empty-frames check the VM wouldpop()the script's "result" into a non-existent caller.
Step 3 — Compiling Functions as Nested Compilers
Goal
Extend the cp-06 single-chunk compiler so that fn foo(...) { ... } emits a
separate Function with its own chunk and its own local-variable bookkeeping
— while staying able to resume compiling the outer code afterwards.
The Mental Model
A function body is just another little program. When the parser hands the
compiler a FnDeclStmt, the compiler temporarily switches its target from the
current chunk to a fresh chunk owned by a new Function. When the body
finishes, the compiler:
- Emits
Nil; Return(so a body without an explicitreturndoes the right thing — see step 5 for control-flow specifics). - Pops back to the outer compiler state.
- Records the new
Functionas a constant in the outer chunk's constant pool. - Emits
Closure <const-ix>at the outer cursor, which loads the function value onto the operand stack. - Stores that value as a global or as a new local in the outer scope.
Crucially the outer compiler doesn't need to know anything about the inner body — it just sees a single opaque value.
State
class Compiler {
struct Local { std::string name; int depth; bool isConst; };
struct FunctionState {
FunctionPtr fn;
std::vector<Local> locals;
int scopeDepth = 0;
bool isScript;
};
std::vector<FunctionState> states_;
Chunk& chunk() { return states_.back().fn->chunk; }
std::vector<Local>& locals() { return states_.back().locals; }
int& scopeDepth() { return states_.back().scopeDepth; }
};
The whole "current compilation context" is the top of states_. Push to
enter a function, pop to leave.
void pushFunction(std::string name, int arity, bool isScript) {
auto fn = std::make_shared<Function>();
fn->name = std::move(name);
fn->arity = arity;
FunctionState fs;
fs.fn = fn;
fs.isScript = isScript;
// Reserve slot 0 for the function value itself (matches the VM's call ABI).
fs.locals.push_back({"", 0, true});
states_.push_back(std::move(fs));
}
FunctionPtr popFunction() {
auto fn = states_.back().fn;
states_.pop_back();
return fn;
}
That reserved slot 0 is the link to step 2 — the runtime puts the callable there, and the compiler must not accidentally allocate it to a user variable.
Compiling a FnDeclStmt
void visit(FnDeclStmt& s) override {
pushFunction(s.name, s.params.size(), /*isScript=*/false);
for (auto& p : s.params) addLocal(p, /*isConst=*/false, s.line);
beginScope();
for (auto& stmt : s.body) stmt->accept(*this);
endScope();
emit(Op::Nil);
emit(Op::Return);
auto fn = popFunction();
// Outer scope: load the function value as a constant, then bind it.
uint8_t ix = makeConstant(Value::makeFn(fn));
emit(Op::Closure); emit(ix);
if (scopeDepth() == 0) {
uint8_t nameIx = makeConstant(Value::makeStr(s.name));
emit(Op::DefGlobal); emit(nameIx);
} else {
addLocal(s.name, /*isConst=*/true, s.line);
}
}
A few things worth noting:
- We pass
isConst=truefor the binding itself butisConst=falsefor the parameters — assigning to a parameter inside its function body is legal. - The body opens its own block scope so
endScope()cleans up anylets declared inside; the parameters are above this scope and persist for the entire function (correctly). Op::Closureis currently a synonym forOp::Constant. We give it a distinct opcode so cp-12 can graft upvalue handling on without touching every call site.
Why addLocal(p, ...) Just Works
The cp-06 local table is indexed by insertion order, which matches the
runtime slot numbering. Because we reserved slot 0 in pushFunction,
the first parameter ends up at slot 1, the second at slot 2, … exactly what
the call ABI delivers.
Forbidding Closure Capture (for now)
Without an upvalue system, inner can't see outer's local a:
fn outer(a) {
fn inner() { return a; } // ← capture
return inner();
}
The compiler must detect this at compile time and refuse, rather than emit broken bytecode. Helper:
bool isOuterLocal(const std::string& name) {
for (int i = (int)states_.size() - 2; i >= 0; --i) {
const auto& ls = states_[i].locals;
for (int j = (int)ls.size() - 1; j >= 1; --j)
if (ls[j].name == name) return true;
}
return false;
}
IdentExpr and AssignExpr consult isOuterLocal after their normal local
lookup misses but before they fall back to globals. If true, they emit a
diagnostic pointing the user to cp-12.
<script> Is a Function Too
Result compile(Program& p) {
pushFunction("<script>", 0, /*isScript=*/true);
for (auto& s : p.statements) s->accept(*this);
emit(Op::Nil); emit(Op::Return);
auto script = popFunction();
return Result{script, diagnostics_};
}
Everything composes. No special case for top-level — the VM just calls
<script> like any other function.
Compiling CallExpr
void visit(CallExpr& e) override {
e.callee->accept(*this); // pushes <fn>
for (auto& a : e.args) a->accept(*this); // pushes args
if (e.args.size() > 255)
error(e.line, "too many arguments to a single call (>255)");
emit(Op::Call);
emit(uint8_t(e.args.size()));
}
The shape on the stack at Op::Call N is exactly what callValue expects —
this is how the static side and runtime side cooperate.
Compiling ReturnStmt
void visit(ReturnStmt& s) override {
if (states_.back().isScript)
error(s.line, "'return' outside a function");
if (s.value) s.value->accept(*this);
else emit(Op::Nil);
emit(Op::Return);
}
Pitfalls
- Forgetting the reserved slot 0. Parameters get the wrong slot numbers.
pushFunctionafter starting to emit prelude. The freshFunctionState's chunk is empty by design; emit nothing into it before the body.- Capturing the inner
Chunk&reference acrosspushFunction/popFunction.states_.push_backcan reallocate the vector — always go throughchunk()/locals()accessors.
Step 4 — Globals (Hash) vs. Locals (Slots) at Runtime
Two Worlds, One Stack
The compiler already decides per identifier whether it is local (resolved during compilation to a slot index) or global (resolved at runtime by name). Step 4 implements the runtime half.
| Kind | Storage | Access cost |
|---|---|---|
| Local | stack_[slotBase + slot] | O(1), 1 load |
| Global | unordered_map<string,Value> | O(1) avg, hash |
The compiler emits GetLocal slot / SetLocal slot for locals (resolved at
compile time), and GetGlobal nameIx / SetGlobal nameIx / DefGlobal nameIx
for globals — where nameIx is an index into the chunk's constant pool whose
value is a Value::makeStr(name).
VM Side
case Op::DefGlobal: {
Value name = readConstant();
globals_[name.s] = pop();
break;
}
case Op::GetGlobal: {
Value name = readConstant();
auto it = globals_.find(name.s);
if (it == globals_.end())
throw RuntimeError(currentLine(),
"undefined variable '" + name.s + "'");
push(it->second);
break;
}
case Op::SetGlobal: {
Value name = readConstant();
auto it = globals_.find(name.s);
if (it == globals_.end())
throw RuntimeError(currentLine(),
"undefined variable '" + name.s + "'");
it->second = peek(); // assignment is an expression; leaves value on stack
break;
}
Why SetGlobal errors if the variable doesn't exist
This distinguishes declaration from assignment. let x = 1; and
var x = 1; declare; x = 2; assigns. Without this check, typos silently
create new globals — exactly the JavaScript footgun we don't want.
DefGlobal, in contrast, unconditionally inserts. If the user shadows an
existing global with another let, the resolver already complained.
Why store names, not numeric ids?
Three reasons:
- REPL friendliness. In an interactive session, each entered statement is a separate compilation. Numeric ids would not survive across compilations.
- Dynamic globals. Future built-ins (
print,clock, FFI bindings) inject themselves intoglobals_by name without coordinating with the compiler. - Cheap. String hashing on short identifiers is a few ns; the access pattern is dominated by cache misses in the hash table, not the hash itself.
Real production VMs (V8, LuaJIT) cache name-id pairs in inline caches at the call site so subsequent accesses skip the hash. cp-15 covers ICs.
Locals — the entire implementation
case Op::GetLocal: {
uint8_t slot = readByte();
push(stack_[frame.slotBase + slot]);
break;
}
case Op::SetLocal: {
uint8_t slot = readByte();
stack_[frame.slotBase + slot] = peek();
break;
}
Two array indirections, zero hashing. This is why locals exist as a separate notion: the dominant performance gap between a "scripting" VM and a "systems" VM is whether identifier resolution is a slot read or a hash probe.
Stack Discipline on Block Exit
When a block scope closes:
void endScope() {
while (!locals().empty() && locals().back().depth > scopeDepth() - 1) {
emit(Op::Pop);
locals().pop_back();
}
--scopeDepth();
}
This issues a runtime Pop for every local going out of scope. At runtime
the stack shrinks back to the size it had at beginScope, restoring the
invariant that stack depth = number of live locals + temporaries currently on
top.
Functions on Globals
Top-level functions live in globals_ like any other value. Function calls
do:
GetGlobal "fact" ; pushes the Fn value
Constant 5 ; pushes the arg
Call 1
Recursion works because GetGlobal happens each time — by the time fact
calls itself, the global table already contains it.
Mutable vs Immutable
The compiler tracks isConst on each Local/FunctionState::locals[i] and
emits a compile-time diagnostic for let-bound writes. The VM is uniform: it
has no notion of const at runtime. This is the standard tradeoff — push
errors as far forward as possible.
Pitfalls
- Forgetting to pop locals in
endScope. The stack grows monotonically through the program; nested blocks would corrupt parent locals' indices. SetGlobalaccepting unknown names. Silent globals are a tooling nightmare. Always requireDefGlobalfirst.- Using
[]onglobals_inGetGlobal.operator[]creates default-constructed entries on miss. Usefindand report the error.
Step 5 — Control Flow on a Stack Machine
The compiler already emits Jump, JumpIfFalse, and Loop (cp-06, step 5).
This step explains the runtime half: how ip_ moves through the chunk.
The Three Jump Opcodes
| Opcode | Layout | Stack effect | Action |
|---|---|---|---|
Jump | [op][hi][lo] | 0 | ip_ += offset16 |
JumpIfFalse | [op][hi][lo] | 0 (peek only) | if !isTruthy(peek()) then ip_ += offset16 |
Loop | [op][hi][lo] | 0 | ip_ -= offset16 (backwards jump) |
All offsets are unsigned 16-bit numbers measured from the byte after the operand. Two-byte operand → max range ±65 535 bytes — plenty for human code, trivially extended to 24-bit if anything pathological appears.
Why doesn't
JumpIfFalsepop? Becauseif/elseand short-circuit operators want the test value in different states after the branch. The compiler emits an explicitPopafterJumpIfFalsein cases (likeif-stmt) where the condition value is no longer needed.
Runtime Implementation
case Op::Jump: {
uint16_t off = readShort();
ip_ += off;
break;
}
case Op::JumpIfFalse: {
uint16_t off = readShort();
if (!isTruthy(peek())) ip_ += off;
break;
}
case Op::Loop: {
uint16_t off = readShort();
ip_ -= off;
break;
}
isTruthy is the language's falsiness rule:
bool isTruthy(const Value& v) {
switch (v.kind) {
case ValueKind::Nil: return false;
case ValueKind::Bool: return v.b;
default: return true; // 0 is truthy
}
}
This decision is a language design choice. Lua agrees (only nil and
false are falsy). Python disagrees (empty containers, 0, 0.0, "" are
all falsy). We follow Lua/Lox for simplicity.
if-else at Runtime
Recall the compiled pattern:
<cond>
JumpIfFalse ───┐
Pop │
<then-body> │
Jump ──────┐
else: ──────────────────┘ │
Pop │
<else-body> │
end: ────────────────────┘
At runtime:
- The condition pushes
true/false. JumpIfFalsepeeks, leaves it alone, optionally skips the then-branch.- The branch starts with
Pop(the condition is consumed exactly once). - The unconditional
Jumpover the else-arm leaves the stack unchanged. - The else-arm also begins with
Pop(matching the other side).
The two Pops guarantee that exactly one of them runs per execution and the
stack ends each branch in the same state. This is the stack-balance proof
for the construct.
while at Runtime
start: ◄── loopStart
<cond>
JumpIfFalse ───┐
Pop │
<body> │
Loop start │
end: ─────────────────┘
Pop
Loop re-evaluates the condition. The compiler computes the backward offset
at chunk-emit time as (chunk.code.size() + 3) - loopStart, where the +3
accounts for the Loop opcode + 2-byte operand we are about to emit.
Short-Circuit && / ||
print false && something_expensive(); // never calls
print true || something_expensive(); // never calls
&& compiles to:
<lhs>
JumpIfFalse end
Pop
<rhs>
end:
If lhs is falsy, JumpIfFalse leaves it on the stack and jumps to end
— the expression's result. Otherwise we Pop it and evaluate rhs,
leaving its value as the result.
|| is the mirror: JumpIfTrue end, except we don't have a JumpIfTrue
opcode. Two implementation choices:
- Add
Op::JumpIfTrue. - Reuse
JumpIfFalsewith the boolean inverted in a tiny code template:JumpIfFalse two-ahead; Jump end; Pop; <rhs>; end:.
cp-06 takes the second route to keep the opcode set minimal. Same runtime semantics.
Patch-Back
The compiler emits jumps with a placeholder offset (0xFFFF) and patches
the real distance once the target is known:
size_t emitJump(Op op) {
emit(op);
emit(uint8_t(0xff));
emit(uint8_t(0xff));
return chunk().code.size() - 2; // index of high byte
}
void patchJump(size_t offsetSlot) {
size_t jumpDist = chunk().code.size() - offsetSlot - 2;
if (jumpDist > 0xFFFF) error(..., "jump too large");
chunk().code[offsetSlot] = (jumpDist >> 8) & 0xff;
chunk().code[offsetSlot + 1] = jumpDist & 0xff;
}
Loop is symmetric but the offset is known at emit time (we always loop
backwards to a previously-seen loopStart).
Pitfalls
- Forgetting the Pop after JumpIfFalse. The condition value stays on the stack forever, then collides with the next statement's expectations — the bug surfaces much later as a wrong value in a totally different opcode.
- Wrong direction for
Loop.ip_ -= offnot+=. The disassembler prints the target address — sanity check by inspection. - Patch-back arithmetic. Off-by-two errors are common; write the
emitJump/patchJumppair once and reuse it religiously.
Step 6 — Closures and Upvalues (Sketch, Deferred to cp-12)
Why This Is a Step at All
cp-07 deliberately rejects programs that capture a local from an enclosing function:
fn outer(a) {
fn inner() { return a; } // ❌ compile error in cp-07
return inner();
}
with a clear message pointing to cp-12. This step explains why the restriction exists and what the implementation will look like when we lift it.
The Problem
A function value can outlive the stack frame in which it was defined:
fn make_counter() {
var n = 0;
fn step() { n = n + 1; return n; }
return step; // ← step references `n` AFTER make_counter returns
}
let c = make_counter();
print c(); print c(); print c(); // 1 2 3
When make_counter returns, its stack frame is destroyed — yet step still
needs n. The variable has escaped from the stack.
Possible Solutions
| Strategy | Cost | Used by |
|---|---|---|
| Disallow it (cp-07) | Free. Limits expressiveness. | Early C, embedded DSLs |
| Boxed locals everywhere | Every local is heap-allocated and ref-counted. | Pre-V8 JS, Scheme R6RS |
| Upvalues (Crafting Interpreters) | Stack-allocated locals; promoted to heap lazily when a closure captures them. | Lua, Lox, our cp-12 |
| Lambda lifting (compile-time) | Inner function rewritten to take captures as extra args. No runtime support. | OCaml, Haskell middle-ends |
| Full first-class environments | Each scope is a heap object linked to its parent. | Scheme, Smalltalk |
cp-12 will implement the upvalue approach because:
- It keeps non-capturing locals as cheap stack slots (no boxing tax).
- It scales to mutable captures without aliasing footguns.
- It is what Lua 5.x, Lox, and many embedded VMs do — well-documented.
Sketch of the Upvalue Mechanism
Add three new value-level concepts:
struct Upvalue {
Value* location; // points into the stack while OPEN
Value closed; // takes ownership when the slot is closed
bool isOpen;
Upvalue* next; // intrusive list, head per VM, kept sorted by stack address
};
struct Closure {
FunctionPtr fn;
std::vector<Upvalue*> upvalues; // resolved by index in bytecode
};
A Value::Fn evolves into Value::Closure carrying a shared_ptr<Closure>.
Closure holds the function plus a vector of upvalues, one per captured
variable.
New Opcodes (cp-12)
| Opcode | Operands | Meaning |
|---|---|---|
Closure | [const-ix] then per upvalue: [isLocal:1][index:1] | Allocate closure; capture each upvalue. |
GetUpvalue | [slot] | Push the value the upvalue points to. |
SetUpvalue | [slot] | Store top into the upvalue's location. |
CloseUpvalue | none | Promote top-of-stack local to heap, splice into open-upvalue list. |
Compile Side
When the compiler sees a reference inside inner to a name declared in
outer's locals:
- Walk outwards through
states_until it finds the name. - In every intermediate function, add an upvalue entry whose
isLocal=truein the immediately-surrounding function andisLocal=falsedeeper out. - Replace the bytecode with
GetUpvalue idx/SetUpvalue idx.
When the compiler emits a Closure for a function with k upvalues, it
emits k (isLocal, index) pairs following the opcode. At runtime, the VM
reads these and either:
- (
isLocal=true) captures the surrounding frame's slot directly (callscaptureUpvalue(&stack_[frame.slotBase + index])), or - (
isLocal=false) copies one of the enclosing closure's upvalues (captureUpvalue(enclosing->upvalues[index])).
Run Side
captureUpvalue(loc) walks the open-upvalue list, returns an existing one if
some closure already captured the same address, otherwise allocates a new
Upvalue{loc, …, isOpen=true} and links it in sorted order.
When a local goes out of scope (or a frame returns), the VM emits
CloseUpvalue / scans for any open upvalues at addresses ≥ the popping
threshold and closes them — copying the value into closed and re-pointing
location at &closed. From that moment on, the closure transparently sees
the value through the heap copy.
The result: captured locals only pay the heap cost when they are actually captured, and only once per (variable × set of capturing closures).
Why Defer All This?
- The cp-07 lab is large enough already.
- Most of the interesting engineering — slot allocation, dispatch, frame management — is independent of closures and easier to learn in isolation.
- cp-12's JIT motivates closures: a closure becomes a useful unit for inlining and specialisation.
What cp-07 Actually Does
Op::GetUpvalue / Op::SetUpvalue / Op::CloseUpvalue exist in the opcode
table (so cp-12 can drop in changes without renumbering) but the VM throws a
runtime error if it ever executes one:
case Op::GetUpvalue:
case Op::SetUpvalue:
case Op::CloseUpvalue:
throw RuntimeError(currentLine(),
"upvalues not supported in cp-07 (see cp-12)");
And the compiler refuses to emit them — the isOuterLocal helper in
steps/03 detects the attempted capture and emits
a friendly diagnostic at compile time, well before the user sees any opaque
runtime error.
Pitfalls (for cp-12)
- Forgetting to sort open-upvalues by stack address. The close operation relies on stopping at the first upvalue below the threshold.
- Double-closing. An upvalue captured by N closures must close exactly once; the open list dedupes by address.
- Calling convention coupling. The
Closureopcode's variable-length operand encoding is awkward to disassemble; budget extra time on the disassembler.
Step 7 — Runtime Errors and (Mini) Stack Traces
What Counts as a Runtime Error
cp-07 catches at runtime:
| Cause | Where | Message template |
|---|---|---|
| Undefined global read | Op::GetGlobal | undefined variable '<name>' |
| Undefined global write | Op::SetGlobal | undefined variable '<name>' |
Type mismatch in binary + | Op::Add | operands to '+' must be two numbers or two strings |
| Wrong type in numeric op | Op::Sub/Mul/… | operands must be numbers |
| Division by zero | Op::Div, Op::Mod | division by zero |
Wrong type in unary - | Op::Neg | operand to unary '-' must be a number |
| Calling a non-function | callValue | can only call functions |
| Arity mismatch | callValue | function '<n>' expects K argument(s) |
| Stack overflow | callValue | stack overflow (max call depth) |
Use of unsupported Op::*Upvalue | dispatch loop | upvalues not supported in cp-07 (see cp-12) |
The RuntimeError Type
struct RuntimeError : std::runtime_error {
int line;
RuntimeError(int l, std::string m)
: std::runtime_error(std::move(m)), line(l) {}
};
A single exception type keeps the dispatch loop's error path uniform. The
public VM::run catches it at the top, prints a one-line message + the call
chain, and returns Status::RuntimeError.
Source-Line Tracking
The compiler stuffs a lines parallel array into each Chunk:
struct Chunk {
std::vector<uint8_t> code;
std::vector<int> lines; // 1:1 with code
std::vector<Value> constants;
};
This is wasteful — one int per byte — and a real VM compresses lines via run-length encoding. We trade memory for clarity; cp-15 covers debug-info compression.
At runtime the VM computes the current line as
int currentLine() {
auto& fr = frames_.back();
size_t off = fr.ip - fr.fn->chunk.code.data() - 1; // -1: we already advanced past the op
return fr.fn->chunk.lines[off];
}
The -1 is the subtle bit — readByte() post-increments ip, so by the time
we're handling an opcode ip_ points to the next byte.
A Mini Stack Trace
VM::run's catch block walks frames_ from the top down:
catch (const RuntimeError& e) {
(*err) << "runtime error [line " << e.line << "]: " << e.what() << "\n";
for (auto it = frames_.rbegin(); it != frames_.rend(); ++it) {
const auto& fn = it->fn;
(*err) << " in " << (fn->name.empty() ? "<script>" : fn->name) << "\n";
}
return Status::RuntimeError;
}
So:
fn boom() { return 1 / 0; }
fn caller() { return boom(); }
print caller();
prints (on stderr):
runtime error [line 1]: division by zero
in boom
in caller
in <script>
This is the minimum-viable stack trace. Real implementations also include file/line per frame, source snippets, and column ranges. cp-15 (Tooling & Diagnostics) is the big upgrade.
Why Not assert?
A core principle: runtime errors are not crashes. They are recoverable
from the embedder's standpoint — a REPL must keep running after a typo, an
embedded interpreter must report to its host, etc. Using assert or
std::terminate would couple the VM's lifecycle to the user's mistake.
This is also why we don't std::exit() from inside the VM; the driver
(main.cpp) decides the exit code from Status.
Compile-Time vs Runtime Errors
After cp-04 (resolver) and cp-05 (typechecker), most static errors fire before we ever build a chunk:
| Error | Caught by | Phase |
|---|---|---|
| Undefined name (local) | Resolver | static |
| Use before init | Resolver | static |
Assignment to let | Resolver / Compiler | static |
| Calling non-callable type | TypeChecker | static |
| Arity mismatch (known fn) | TypeChecker | static |
| Type mismatch on operator | TypeChecker | static |
| Bad cast at runtime | VM | dynamic |
| Undefined global | VM | dynamic |
| Division by zero on user input | VM | dynamic |
This split is intentional — static = before any code runs; dynamic = unavoidable.
Pitfalls
- Catching
std::exceptiontoo broadly. Other exceptions (bad-alloc, malformed UTF-8 in string display, …) deserve different handling. We catchRuntimeErrorspecifically and let everything else propagate to the driver. - Forgetting to flush
err. When the VM is embedded in another process, usingstd::cerris fine; when the test harness usesostringstream, buffered output is captured automatically. - Reusing line indices after chunk mutation.
Chunk::linesmust stay 1:1 withcode; any opcode emit must push exactly one line entry per byte. Helper functions handle this — never push tocodedirectly.
cp-08 — Three-Address Code IR
A new compiler middle-end that lowers the resolved/type-checked AST into Three-Address Code (TAC), the canonical compiler IR taught in every dragon book. The bytecode VM of cp-07 was a great way to run code, but a poor representation for reasoning about code: an operand stack hides def/use relationships, and locals are addressed by slot rather than name.
TAC reverses these tradeoffs. Each instruction has at most one operation
and writes its result into one named destination — t3 = add t1, t2.
Control flow lives in a control-flow graph (CFG) of basic blocks
connected by explicit jumps. This is exactly the shape an SSA construction
algorithm wants in cp-09, and it's the shape LLVM IR will demand in cp-11.
What's in the box
| File | Purpose |
|---|---|
src/ir.hpp/cpp | Operand, Op enum, Instr, BasicBlock, Function, Module |
src/ir_printer.* | Textual IR pretty-printer (the "assembly" we read in tests) |
src/ir_builder.* | AST → IR lowering pass |
src/main.cpp | mltac CLI driver: source → IR text on stdout |
tests/test_ir.cpp | String-level golden tests over the printed IR |
The pipeline is now:
source ─► lexer ─► parser ─► resolver ─► typecheck ─► ir::Builder ─► Module
There is no execution stage in cp-08. cp-09 wires up an interpreter that walks this IR directly (and adds SSA + a couple of optimisation passes).
Build & run
cmake -S src/cpp -B src/cpp/build
cmake --build src/cpp/build -j
ctest --test-dir src/cpp/build --output-on-failure
echo 'fn add(a,b){return a+b;} print add(3,4);' | ./src/cpp/build/mltac
Expected output:
fn @__script__() {
bb0 (entry):
t0 = call @add(3, 4)
print t0
ret
}
fn @add(%a, %b) {
bb0 (entry):
t0 = add %a, %b
ret t0
}
What's new conceptually
- Three operand kinds.
t<n>temps (SSA-friendly),%namenamed storage (local variables / params), and immediate constants. - One op per instruction. Compound expressions are flattened by introducing fresh temps for each subexpression result.
- Globals through memory ops.
ldg @x/stg @x, vmake global reads and writes explicit — paralleling LLVM'sload/store. - Explicit control flow. Every block ends in a terminator
(
jmp,cjmp,ret). No fall-through. No implicit "next instruction". - Short-circuit lowered to branches.
a && bbecomes acjmpplus a join block, just as cp-07 did with patchable jumps — but now the join lives in the CFG, ready for phi insertion in cp-09.
Reading order
The seven step docs in steps/ follow the same progression as the code:
01-tac-and-three-address-form.md02-operands-and-instructions.md03-basic-blocks-and-cfg.md04-lowering-expressions.md05-lowering-statements-and-control-flow.md06-short-circuit-and-phi-preview.md07-printer-and-debugging.md
Step 1 — TAC and three-address form
Why a new IR?
The cp-07 bytecode VM was a perfectly good interpreter. But the moment you try to do interesting things to the program — eliminate dead code, fold constants, allocate registers, prove that two pointers don't alias — the stack representation fights you at every turn.
Three things make a stack machine awkward for analysis:
- Implicit operands.
ADDdoesn't say what it adds; you have to simulate the stack to find out. Every analysis becomes an abstract interpretation. - Position-dependent. Reordering two adjacent instructions changes what's on the stack. You can't pattern-match peepholes locally.
- No SSA story. SSA wants every value to have a name; stack slots are anonymous and transient.
TAC fixes all three by writing each operation in the form
dst = op src1, src2
— at most one operation per instruction, every operand explicit, every result named.
A first example
Source:
print (1 + 2) * 3;
Stack bytecode (cp-07-style):
PUSH 1
PUSH 2
ADD
PUSH 3
MUL
PRINT
TAC:
t0 = add 1, 2
t1 = mul t0, 3
print t1
The TAC form has more "instructions" by raw count, but each line is a
fully self-contained def with explicit uses. You can ask "what defines
t1?" without simulating anything — just grep.
Three operand kinds
Our Operand is a tagged variant with four cases:
None— placeholder for instructions that produce no value.Temp(n)—t0,t1, ... — single-assignment compiler-generated intermediates. In cp-09 these become SSA values.Constant(Value)— an immediate, printed inline.Named(name)—%x,%a— a local variable or parameter. Named operands behave like alloca'd memory slots (read and write via Move), but at TAC level we expose them as first-class operands. cp-09's mem2reg pass promotes these into SSA temps.
Globals do not get an Operand form; they are accessed through explicit
ldg @x / stg @x instructions. That mirrors LLVM's model where
"variable" really means "memory cell" and SSA only describes local
register flow.
The deal we're making
TAC introduces verbosity and a small constant-factor compile-time hit compared to direct bytecode generation. In exchange, we get:
- a uniform substrate for all later analyses (cp-09 SSA, cp-10 passes, cp-11 LLVM, cp-13 MLIR),
- a printable, diff-able IR that makes compiler bugs visible,
- a CFG abstraction we can use to talk precisely about reachability, dominance, and loop structure.
That's the trade every production compiler makes. cp-08 just makes us pay the price explicitly so cp-09 onwards can spend the proceeds.
Step 2 — Operands and instructions
Operand
struct Operand {
enum class Kind { None, Temp, Constant, Named };
Kind kind = Kind::None;
int tempId = -1; // Temp
Value constVal; // Constant
std::string name; // Named (includes leading sigil-less form)
// factories: none(), temp(id), constant(v), named(name)
};
We use a single struct with a Kind tag rather than std::variant to
keep the struct trivially copyable and (more importantly) easy to print
in a debugger. When you're chasing an IR bug at 1 a.m. you want
p ins.srcs[0] to show something, not a variant index.
tempId, constVal, and name are independent fields; only one is
meaningful for any given Kind. The constructors mirror that:
Operand::temp(3); // t3
Operand::constant(Value::makeInt(42)); // immediate
Operand::named("x"); // %x
Operand::none(); // placeholder
Op — the opcode enum
| Group | Opcodes |
|---|---|
| arithmetic | Add Sub Mul Div Mod Neg |
| comparison | Eq Ne Lt Le Gt Ge |
| logical | Not (and/or are lowered, not opcodes) |
| move/load | Move LoadGlobal StoreGlobal |
| control | Jump CondJump Return |
| effects | Print Call |
Notable design choices:
- No
And/Oropcode. Short-circuit semantics demand control flow; we lower them toCondJump(see step 6). Moverather thanCopy. Same idea as RISC-V or MIPS pseudo-ops: one instruction that says "write the source into the destination, unchanged." The mem2reg pass in cp-09 will eliminate most of these.Callis a regular instruction. It has a destination temp (for the return value), an opcode-level callee name inins.namefor direct calls, and operands[callee, arg0, arg1, ...]. Indirect calls (cp-12 closures) will store<indirect>in the name and usesrcs[0]for the callee operand.
Instr
struct Instr {
Op op;
Operand dst; // None if the op produces no value
std::vector<Operand> srcs; // 0..N source operands
std::string name; // global name / function name
int bbT = -1; // jmp target / cjmp true target
int bbF = -1; // cjmp false target
int line = 0; // source line for diagnostics
};
One struct fits all instruction kinds. The alternative — a
discriminated hierarchy with AddInstr, JumpInstr, CallInstr, ... —
is dogmatically purer, but cripplingly painful to walk in passes. Every
pass would need a giant visitor or a type-switch. A flat struct lets
passes loop over instrs and switch on ins.op.
The cost: each instruction carries unused fields. For TAC at this scale
that's a sub-megabyte overhead even for large programs, and it's the
shape MLIR uses (an Operation* with attributes, results, operands,
successors). Compiler IRs converge on this design for a reason.
Why constants are inline operands
In some IRs (notably LLVM) constants are first-class Values, distinct
from instructions. We took the simpler route: a constant is just an
Operand::Constant, printed inline. Pros: trivial printer, no constant
pool to manage. Cons: you can't dyn_cast a constant the way you can
in LLVM. For a teaching IR that's the right trade.
Step 3 — Basic blocks and the CFG
What is a basic block?
A basic block is a maximal sequence of instructions such that
- control enters only at the top (no branches into the middle), and
- control leaves only at the bottom (no branches out of the middle).
Formally: a straight-line code fragment that ends in exactly one
terminator — Jump, CondJump, or Return. Nothing fancier than
that.
The set of basic blocks plus their successor edges form the control-flow graph (CFG) of a function. Almost every interesting analysis — dominance, reachability, liveness, loop detection — operates on this graph.
Our BasicBlock
struct BasicBlock {
int id; // small integer name
std::string label; // human-readable hint
std::vector<Instr> instrs;
};
The label is purely cosmetic ("entry", "if.then", "while.cond") — the
real identifier is id. We use small dense integer ids because most
analyses will index into per-block bit-vectors.
Function::blocks is the vector of all blocks in creation order.
blocks[0] is always the entry by convention; we don't store it
separately.
Block creation
BasicBlock& Function::newBlock(std::string label) {
blocks.push_back({nextBlock++, std::move(label), {}});
return blocks.back();
}
Returns a reference and an id; the builder remembers the id (the reference is invalidated by the next allocation, so we never store it across emits).
Terminator discipline
The builder enforces a simple invariant: every emitted instruction
goes into a non-terminated block. If you try to emit into a
terminated block, the builder silently opens a fresh "unreachable"
block and emits there instead:
void Builder::emit(Instr ins) {
if (currentBlockTerminated()) {
auto& nb = fn().newBlock("unreachable");
setBlock(nb.id);
}
block().instrs.push_back(std::move(ins));
}
This keeps the lowering code free of if (terminated) return; clutter
and preserves dead code as visible IR. cp-09's DCE pass will prune
unreachable blocks.
CFG edges
We don't store CFG edges explicitly. Successors are recoverable from the terminator:
| Terminator | Successors |
|---|---|
Jump bbT | {bbT} |
CondJump bbT bbF | {bbT, bbF} |
Return | {} |
A trivial helper (introduced in cp-09) walks blocks.back().instrs.back()
and returns the successor set. Predecessors are computed by inverting
that map once per pass — cheap, and avoids the bookkeeping pain of
keeping bidirectional edges in sync during construction.
Why a vector of blocks (and not a linked structure)?
Industrial IRs (LLVM, MLIR) use intrusive linked lists for instructions within blocks, because passes need cheap O(1) removal. At the block level they store a list of blocks per function for the same reason: inserting a block in the middle of a function is common (loop rotation, critical-edge splitting).
We use std::vector because (a) we're not implementing those passes
yet, and (b) a vector<BasicBlock> is dramatically nicer to print and
debug. The cost is amortised — append is O(1) — and the only real
limitation is that we can't store stable pointers to blocks. We work
around that with integer ids, which is the correct compiler-engineering
discipline anyway.
Step 4 — Lowering expressions
Expression lowering in ir_builder.cpp follows one rule: every visit
method writes the expression's result operand into result_, and
returns void. A helper eval(Expr&) runs accept and reads back
result_:
Operand eval(Expr& e) { e.accept(*this); return result_; }
This dance is a workaround for the AST's typed-visitor design (which
specialises only ExprVisitor<TypePtr> and ExprVisitor<void>), and
gives us a useful invariant: the IR Builder reuses the exact same
visitor base class as the bytecode compiler.
Constants and identifiers
void Builder::visit(LiteralExpr& e) { result_ = Operand::constant(e.value); }
void Builder::visit(IdentExpr& e) {
if (isLocal(e.name)) { result_ = Operand::named(e.name); return; }
Operand dst = freshTemp();
emit({Op::LoadGlobal, dst, {}, e.name, -1, -1, e.line});
result_ = dst;
}
- Literals are pure data — no instruction, just an operand.
- Locals become named operands; no instruction is emitted on a read. Reads happen for free at the use site, which is correct for an alloca-style memory model.
- Globals require an explicit
LoadGlobalto a fresh temp.
This asymmetry between locals and globals matters: in the SSA promotion
pass of cp-09 we will recognise alloca-like patterns over %name
operands and promote them. Globals will stay as memory because they
cross function boundaries.
Unary and binary operators
void Builder::visit(BinaryExpr& e) {
Operand a = eval(*e.lhs);
Operand b = eval(*e.rhs);
Operand dst = freshTemp();
emit({binOpFor(e.op), dst, {a, b}, "", -1, -1, e.line});
result_ = dst;
}
Subexpressions are recursively lowered first, producing fresh temps, and the parent's instruction binds those temps. This is the literal "flatten compound expressions into single ops" recipe — the whole point of TAC.
Constant folding could happen here (add 1, 2 → Operand::constant(3))
but we don't do it. Folding is a pass, not a lowering concern. cp-09
introduces a constant-folder pass that consumes the unfolded IR cp-08
emits — and being able to see the pre-fold IR makes the pass's effect
obvious in diff tests.
Calls
void Builder::visit(CallExpr& e) {
std::string calleeName;
Operand calleeOp;
if (auto* id = dynamic_cast<IdentExpr*>(e.callee.get())) {
calleeName = id->name; // direct call
calleeOp = Operand::named(id->name);
} else {
calleeOp = eval(*e.callee); // indirect (no closures yet though)
calleeName = "<indirect>";
}
std::vector<Operand> args { calleeOp };
for (auto& a : e.args) args.push_back(eval(*a));
Operand dst = freshTemp();
Instr ins{ Op::Call, dst, std::move(args), calleeName, -1, -1, e.line };
emit(std::move(ins));
result_ = dst;
}
The callee always lives at srcs[0], even for direct calls. Carrying
the name separately is redundant but useful: it makes the printed IR
read as t0 = call @add(3, 4) instead of t0 = call %add, 3, 4, and
it lets passes filter direct vs indirect calls without parsing operands.
Argument evaluation order is left-to-right, exactly matching the language semantics enforced by the parser.
Assignments
void Builder::visit(AssignExpr& e) {
Operand val = eval(*e.value);
if (isLocal(e.name)) {
emit({Op::Move, Operand::named(e.name), {val}, "", -1, -1, e.line});
} else {
emit({Op::StoreGlobal, Operand::none(), {val}, e.name, -1, -1, e.line});
}
result_ = val;
}
Note that assignment returns the assigned value (as an expression),
which is needed for let y = (x = 3);. Locals use Move; globals use
StoreGlobal. The destination of an assignment is the operand we
wrote, not the value we just produced — which means cp-09's mem2reg
can use Move %x, v as the SSA-construction's "definition of x" point.
Step 5 — Lowering statements and control flow
Where expression lowering produced an Operand, statement lowering
produces blocks. The recipe is always the same:
- allocate the blocks you'll need,
- emit a terminator into the current block to enter the structure,
- lower the body into the relevant blocks,
- emit terminators stitching them together,
- set
currentBlockto the join block so the next statement continues there.
if / else
┌─── cjmp cond ──→ if.then ──jmp── if.cont
pre ─┤ ↑
└─── cjmp cond ──→ if.else ──jmp────┘
Code:
void Builder::visit(IfStmt& s) {
Operand cond = eval(*s.cond);
auto& thenB = fn().newBlock("if.then");
auto& elseB = fn().newBlock(s.elseBranch ? "if.else" : "if.cont");
int thenId = thenB.id, elseId = elseB.id;
emitCondJump(cond, thenId, elseId, s.line);
setBlock(thenId);
s.thenBranch->accept(*this);
bool thenTerm = currentBlockTerminated();
if (s.elseBranch) {
auto& cont = fn().newBlock("if.cont");
int joinId = cont.id;
if (!thenTerm) emitJump(joinId, s.line);
setBlock(elseId);
s.elseBranch->accept(*this);
if (!currentBlockTerminated()) emitJump(joinId, s.line);
setBlock(joinId);
} else {
if (!thenTerm) emitJump(elseId, s.line);
setBlock(elseId); // elseB *is* the join when no else exists
}
}
Two subtleties:
- When there is no
else, we reuseelseBas the join (if.cont). Wasting a block is harmless but ugly in diff tests, and merging them gives the more natural printed IR. - We only emit the jump to the join if the branch didn't already
terminate (e.g. with
return). Without this guard you'd get aretfollowed by ajmp, which is malformed: a block may have only one terminator.
while
pre ──jmp── while.cond ──cjmp── while.body ──jmp── while.cond (back-edge)
│
└──cjmp── while.cont
void Builder::visit(WhileStmt& s) {
auto& condB = fn().newBlock("while.cond");
auto& bodyB = fn().newBlock("while.body");
auto& contB = fn().newBlock("while.cont");
int condId = condB.id, bodyId = bodyB.id, contId = contB.id;
emitJump(condId, s.line); // pre → cond
setBlock(condId);
Operand c = eval(*s.cond); // (re-evaluated each iteration)
emitCondJump(c, bodyId, contId, s.line);
setBlock(bodyId);
s.body->accept(*this);
if (!currentBlockTerminated()) emitJump(condId, s.line); // back-edge
setBlock(contId);
}
The back-edge is what makes this a loop in CFG terms: an edge whose target dominates its source. cp-09's loop-detection pass will find it.
block and scoping
void Builder::visit(BlockStmt& s) {
beginScope();
for (auto& st : s.body) st->accept(*this);
endScope();
}
A block does not introduce its own basic block. Scope and block
are orthogonal — a single { } may contain several BBs (because of an
embedded if), and a single BB may span several { } (because the
inner block had no control flow). Conflating the two is a beginner's
mistake worth flagging.
Variables declared inside { } are recorded in a local scope stack used
only for the isLocal predicate. No backing storage is emitted; the
local is a named operand.
return
void Builder::visit(ReturnStmt& s) {
if (ctx().isScript) { error(s.line, "'return' outside a function"); return; }
Operand v = s.value ? eval(*s.value) : Operand::none();
emitReturn(v, s.line);
}
We deliberately reject top-level returns even though the resolver probably already did — defence in depth.
print
void Builder::visit(PrintStmt& s) {
Operand v = eval(*s.expr);
emit({Op::Print, Operand::none(), {v}, "", -1, -1, s.line});
}
print is the language's only built-in side effect, so it gets a
dedicated opcode rather than going through Call. Treating it as
Call @print would be cleaner but would force the interpreter and
the LLVM backend to special-case the name later. A dedicated opcode is
more honest.
fn declarations
A fn declaration opens a new Function, lowers its body, and queues
the function into nestedFns_ to be appended to the module after the
script. Nested fns (declared inside another fn) are rejected — they'd
require closure capture, which lands in cp-12.
Step 6 — Short-circuit and a preview of phi
a && b and a || b are not arithmetic operators — they have
short-circuit semantics. b is only evaluated when the result is
not yet determined by a. That's a control-flow property; expressing
it as a regular binary op would either evaluate both eagerly (wrong) or
require a magic "lazy" flag (gross).
The right move is to lower logical operators into the same shape as a
hand-written if:
result = a;
if (!result_truthy) result = b; // for &&
if ( result_truthy) result = b; // for ||
use result
…which, expressed in TAC blocks, looks like this:
┌── cjmp slot ──→ and.eval ──jmp──┐
pre ────┤ (true case) ▼
└─── (false case) ──────────► and.join
(slot = a || b)
The implementation
void Builder::visit(LogicalExpr& e) {
Operand lhs = eval(*e.lhs);
Operand slot = freshTemp();
emit({Op::Move, slot, {lhs}, "", -1, -1, e.line});
auto& evalBlock = fn().newBlock(e.op == TokenKind::AmpAmp ? "and.eval" : "or.eval");
auto& joinBlock = fn().newBlock(e.op == TokenKind::AmpAmp ? "and.join" : "or.join");
int evalId = evalBlock.id, joinId = joinBlock.id;
if (e.op == TokenKind::AmpAmp) emitCondJump(slot, evalId, joinId, e.line);
else emitCondJump(slot, joinId, evalId, e.line);
setBlock(evalId);
Operand rhs = eval(*e.rhs);
emit({Op::Move, slot, {rhs}, "", -1, -1, e.line});
emitJump(joinId, e.line);
setBlock(joinId);
result_ = slot;
}
The crucial detail: both predecessors of the join block write to the
same temp slot. That's why we treat the temp as a write-many
named slot in cp-08 — temps are not yet SSA values.
Preview: this is where phi nodes come in
In SSA form, every value must have exactly one definition. Our
slot violates that — it's defined twice, once on each path into the
join. The classical fix is the phi node, a pseudo-instruction at
the top of a block that selects a value based on which predecessor
arrived:
join:
slot = phi [lhs_value, pre], [rhs_value, and.eval]
cp-09's SSA construction pass will scan for our slot pattern and emit
exactly such a phi. In fact, every multi-write temp our cp-08 builder
produces — including locals — becomes either a phi or gets folded into
a single dominating definition.
For now we model the join with explicit moves because it lets cp-08 stay strictly imperative. No worklist algorithms, no dominator trees, no iterated dominance frontier. Those are cp-09's burden, and seeing the pre-SSA form here is what makes their effect legible later.
Why not just do SSA construction now?
We considered it. But: SSA construction is genuinely intricate (Cytron's algorithm depends on dominator trees, which depend on the CFG, which depends on lowering being done…), and lumping the two concepts into one lab obscured both. Splitting them by step gives a cleaner narrative: cp-08 builds the grammar; cp-09 imposes the invariant.
Step 7 — Printer and debugging
A pretty-printer is the single highest-leverage tool in any compiler codebase. It is the difference between guessing what your IR looks like and seeing it. Every test in this lab compares the printed IR against expected substrings; every pass in cp-09+ will use the printer to log "before" / "after" snapshots.
The format
fn @<name>(<params>) {
bb0 (<label>):
<instr>
<instr>
...
bb1 (<label>):
...
}
- Functions are introduced with
fn @name(...)(the@sigil mirrors LLVM globals). - Parameters are named operands:
%a, %b. - Blocks render their label in parens for readability; the integer id is the primary identifier.
- Instructions are indented four spaces.
- No blank lines inside a function, one blank line between functions.
Operand syntax
| Form | Notation |
|---|---|
| Temp | t<n> |
| Named | %<name> |
| Global ref | @<name> |
| Constant int | 42 |
| Constant str | "hello" |
| Constant nil | nil |
| None | _ |
Constants delegate to Value::toString(), the same formatter the cp-07
VM uses for print. That gives us a single source of truth for
literal representation.
Instruction syntax
t0 = add %x, 1 binop with explicit dst, two srcs
%x = 1 move into named local
stg @x, t1 store to global
t2 = ldg @x load from global
print t0 side-effect, no dst
t0 = call @add(3, 4) direct call
t0 = call <indirect>(t1) indirect call (cp-12)
cjmp t0, bb3, bb4 conditional branch
jmp bb5 unconditional branch
ret t0 return with value
ret return without value
We chose = over := because it matches LLVM textual IR and reads
more naturally. Comparisons render with mnemonic ops (lt, ge)
rather than C-style symbols (<, >=) so that print a < b doesn't
get confusing.
Why this matters
When cp-09's mem2reg pass turns
%x = 1
%x = add %x, 1
print %x
into
t10 = 1
t11 = add t10, 1
print t11
we want that diff to be a one-line change in a golden test. String-level printer assertions are a coarse tool but they catch regressions in lowering exactly when humans care about them — when the printed IR changes shape.
Debugging tactics
- Pipe through
mltac.echo '...' | ./build/mltacis the fastest feedback loop for "what does this lower to?". - Look at the unreachable blocks. Stray
unreachable:blocks in the output often indicate the lowering forgot to advance to a join block — a sign of a missingsetBlock(joinId). - Check the label hints.
if.cont,while.body,and.joinare deliberately chosen to make IR readable in the absence of source lines. If you seeunreachablewhere you expectedif.cont, the order of operations is off.
The printer has no semantic content — it's pure formatting. But it is the most-read file in the IR layer. Spend time on it. Future you will be grateful.
cp-09 — SSA Construction & Optimisation Passes
Status: ✅ Built (41/41 checks)
This lab takes the Three-Address Code from cp-08 and adds the middle-end: an IR interpreter that can execute TAC directly, a small pass pipeline (constant folding + propagation, dead-code elimination, CFG simplification), and a fix-point driver. The conceptual material covers full SSA — phi nodes, dominance, mem2reg — even though the implementation stops at a simpler whole-function constant-propagation scheme. The progression to real SSA happens when we hand off to LLVM in cp-10/11.
Layout
| File | Purpose |
|---|---|
src/ir.{hpp,cpp} | TAC IR types (unchanged from cp-08). |
src/ir_builder.{hpp,cpp} | AST → TAC lowering. Fixed dangling-reference bug from cp-08 (newBlock returns a vector reference invalidated by subsequent inserts; capture .id immediately). |
src/ir_printer.{hpp,cpp} | Module pretty-printer. |
src/passes.{hpp,cpp} | constFold, dce, simplifyCFG, runAll. |
src/ir_interp.{hpp,cpp} | Direct interpreter for the IR — semantics oracle. |
src/main.cpp | mlopt driver: prints before/after IR, then runs. |
tests/test_passes.cpp | 11 tests / 41 checks. |
Pipeline
source → tokens → AST → resolver → typechecker → IR (TAC) →
[print "before"] → constFold → dce → simplifyCFG → [fixpoint] →
[print "after"] → IR interpreter → stdout
Build & run
cd src/cpp
cmake -S . -B build && cmake --build build
ctest --test-dir build --output-on-failure
CLI:
./build/mlopt examples/loop.ml # print before/after IR + run
./build/mlopt --quiet examples/loop.ml # just run
./build/mlopt --no-run examples/loop.ml # IR only
What each pass does
constFold— substitute single-deftN = <const>chains, fold binary/unary on constant operands intoMove, collapsecjmpon a constant condition intojmp.dce— drop pure value-producing instructions whose temp dst is never used.print,stg,call, and terminators are always preserved (no effect-purity analysis in this lab).simplifyCFG— flood-fill reachability frombb0, drop unreachable blocks. After acjmpis rewritten to ajmp, this is what actually deletes the abandoned branch.
The driver alternates them until a full sweep changes nothing.
Conceptual highlights (see steps/)
- SSA as the universal IR shape used by LLVM, V8 Sparkplug, HotSpot, GraalVM, JavaScriptCore, and Cranelift — and why every modern compiler bottoms out in dominance/dominance-frontier algorithms.
- mem2reg as the canonical SSA-construction algorithm: every named local becomes a series of versioned values, joined by phi nodes at dominance frontiers.
- Why we stopped short of mem2reg here: classical SSA construction requires dominator computation, dominance-frontier sets, iterated insertion, and renaming. The book-keeping is well-documented (Cytron et al. 1991) and worth implementing — but at the cost of much more code than is needed for our optimisation set. The constant-propagation scheme we use here gets ~80% of the practical wins.
Test coverage
- Constant folding of arithmetic and comparisons.
- DCE preserves side effects (
print,call,stg). simplifyCFGdropsunreachable:blocks emitted afterreturn.- Branches on constants collapse to a single arm.
- Interpreter executes loops, function calls, mutual recursion, short-circuit operators.
- Round-trip property: running the interpreter before and after the pass pipeline produces identical output.
Step 1 — Why an optimising middle-end
A compiler that emits unoptimised IR straight to a backend gets you correct code; that is all. Everything good about a modern compiler — inlining, escape analysis, vectorisation, devirtualisation — happens in the middle-end, after the frontend has built an IR and before the backend lowers it to machine code.
The single most important pattern
IR_in → pass_1 → IR_intermediate → pass_2 → IR_intermediate' → ...
Each pass is a function IR → IR that preserves the program's observable
behaviour. The middle-end's job is to thread enough passes that the IR
that comes out the other side is small, dense, and easy for the backend.
A pass manager schedules and re-runs passes. LLVM's new pass manager (2020) tracks per-pass invalidation of analyses; we use a much simpler "run everything to a fixed point" loop. Both designs share the same invariants:
- Passes do not change observable behaviour (no I/O reordering, no new observable allocations, no new exceptions).
- Passes are robust: malformed input (e.g. a
cjmpon a constant) should be cleaned up, not crashed on. - Passes are composable: running pass A then pass B should produce the same IR as running them as a single fused pass would.
Why ordering matters
Our pipeline runs constFold → dce → simplifyCFG. Each unlocks the
next:
constFoldrewritescjmp t0, bb1, bb2(witht0constant) tojmp bb1. Thebb2block is now unreachable — but that fact isn't visible insideconstFold.simplifyCFGdrops the unreachable block. The temps defined in it are now uses-of-nothing.dcedrops those orphaned defs.- Some of those defs were the only uses of upstream temps. So we run the whole pipeline again. Repeat to fixed point.
LLVM calls this the "phase-ordering problem" and ships hundreds of passes; finding a good static order is genuinely hard. Our loop is the brute-force version: keep going until nobody changes anything.
What we'll cover
- Step 2 — How an IR interpreter works and why it's the semantic oracle.
- Step 3 — SSA: what it is, why every modern compiler uses it, why we don't fully implement it here.
- Step 4 — Constant folding + propagation, including the single-def temp substitution trick that approximates SSA.
- Step 5 — Dead-code elimination and why purity matters.
- Step 6 — CFG simplification.
- Step 7 — The pass-manager loop and how to think about phase ordering.
Step 2 — IR interpreter as oracle
Before writing any optimisation pass, we built an IR interpreter.
That is the most important file in src/: ir_interp.cpp.
Why interpret IR at all?
In production, IR is consumed by a backend that lowers it to machine code. We are not yet ready to do that — cp-10 introduces LLVM IR emission, cp-11 native codegen. But to test our middle-end now, we need a way to ask "does this IR module mean the same thing it did before I ran the passes?"
That is the role of runProgram(module). Given an IR module, it
executes it and returns the stream of print outputs. Two invocations
on semantically-equivalent modules must produce identical strings.
This is the strongest test we can write for a pass. It avoids the "golden file" trap (matching exact instruction sequences is brittle: any harmless permutation breaks the test) and instead checks the only thing that matters — meaning.
Implementation shape
struct Interp {
const Module& mod;
unordered_map<string, Value> globals;
ostringstream out;
Value callFunction(const Function& fn, const vector<Value>& args);
};
A Frame is conjured per call: temps map from tempId → Value, named
locals from string → Value. Globals live on the interpreter.
Execution is a while (true) over blocks. Within a block we walk
instrs sequentially:
- Value-producing ops (
add,lt,move,neg…) compute aValueand write to the dst operand. printformats the operand and appends toout.ldg/stgread or write a global.calllooks up a function by name in the module, recursively invokescallFunction, and stores the result in the dst.jmp/cjmpsetcurrentIdandgoto next_block.retreturns fromcallFunction.
A safety budget (safety = 1e6 instructions) prevents tests from
hanging on infinite loops — see test test_const_fold_comparison
which would have spun forever if we hadn't fixed the IR-builder bug.
Why named operands work
A typical SSA interpreter only has temps. Ours has both temps and
named locals (%i, %x) because cp-08's lowering keeps source-level
variables as memory cells. That's deliberate: cp-09's passes never
need to reason about them.
When we move to LLVM in cp-10, those named locals become allocas and
LLVM's own mem2reg pass converts them into SSA temps. We simulate the
same final result by using the named-local convention as a "loadable
storage slot" model.
How tests use the interpreter
auto preOut = ir::runProgram(module); // before any pass
ir::runAll(module); // mutate
auto postOut = ir::runProgram(module); // after all passes
CHECK(preOut.output == postOut.output);
If a pass ever breaks semantics, that assertion fails before the golden-string checks do. It is the single most valuable assertion in the test file.
Step 3 — SSA: the universal IR shape
Static Single Assignment is the IR shape that essentially every modern
compiler uses internally. LLVM IR is SSA. HotSpot's C2 is SSA. V8's
Sparkplug and Turbofan are SSA. Cranelift is SSA. WebKit's B3 is SSA.
MLIR is SSA. Even pre-LLVM compilers (gcc ≥ 4.0) added SSA in the
form of "GIMPLE" + tree-SSA.
The core idea
Every variable is assigned exactly once. If source code re-assigns, we create a new SSA name:
// source IR (SSA)
x = 1; %x.1 = 1
x = x + 2; %x.2 = add %x.1, 2
print x; print %x.2
Every %name has exactly one definition. That is the magic property:
to find the value of %x.2, you don't need to scan blocks looking for
the most recent assignment — there is no most-recent, there is the
one. Every analysis that classical IR does with backwards walks becomes
a one-step lookup in SSA.
The hard part: control flow
if (cond) { x = 1; } else { x = 2; }
print x;
In the join block, what is x? Neither %x.1 nor %x.2 alone. SSA
solves this with phi nodes:
bb_then: %x.1 = 1; jmp bb_join
bb_else: %x.2 = 2; jmp bb_join
bb_join: %x.3 = phi [%x.1, bb_then], [%x.2, bb_else]
print %x.3
A phi instruction at the top of a block picks which incoming value to use based on which predecessor flow came from. It is the SSA analogue of "the variable's most recent definition".
Where do phi nodes go?
The classical answer is the dominance frontier:
A block
Bis in the dominance frontier ofAifAdominates a predecessor ofBbut does not dominateBitself.
Every block where two control paths from different definitions can join is in some definition's dominance frontier — and that's exactly where we need a phi.
The Cytron et al. (1991) algorithm computes dominance frontiers, then
inserts phi nodes, then renames every variable use to refer to its
unique SSA definition. This is mem2reg in LLVM terminology —
"promote memory locations (allocas) into SSA registers".
Why we're not implementing it here
Full mem2reg is ~600 lines including dominator-tree construction
(Lengauer–Tarjan or Cooper-Harvey-Kennedy iterative), DF computation,
phi insertion, and the dominator-tree renaming walk. It is a worthy
exercise — and the standard reference implementation (LLVM's
Utils/PromoteMemoryToRegister.cpp) is the right thing to read.
Our constFold cheats: it requires that a temp has exactly one
definition (which TAC builder-produced IR already satisfies for temps,
since each tN is fresh), and propagates only those. That gets us
constant folding through binary-op chains without ever computing a
dominator tree.
When we move to LLVM in cp-10/11, mem2reg comes for free as a built-in optimisation pass. The point of this step is conceptual: you should be able to draw the dominance frontier of an arbitrary CFG on a whiteboard, and you should know what a phi node is and where it goes.
Practical SSA in real compilers
- LLVM — SSA from frontend; mem2reg promotes
allocas; instcombine, GVN, EarlyCSE, SROA all assume SSA. - HotSpot C2 — "Ideal Graph" is a sea-of-nodes SSA variant.
- V8 Turbofan / Sparkplug — sea-of-nodes graph IR; SSA invariant.
- Cranelift — strict SSA, block parameters instead of phi nodes (semantically equivalent but easier to manipulate).
- GCC — GIMPLE → tree-SSA → RTL; first major non-research compiler to adopt SSA in mainline.
Read next
- Cytron, Ferrante, Rosen, Wegman, Zadeck (1991) — Efficiently Computing Static Single Assignment Form and the Control Dependence Graph.
- Cooper, Harvey, Kennedy (2001) — A Simple, Fast Dominance Algorithm. The iterative version everyone now uses.
- Braun et al. (2013) — Simple and Efficient Construction of Static Single Assignment Form. Constructs SSA on-the-fly during IR generation; this is what Cranelift, V8, and Rust's MIR do today.
Step 4 — Constant folding and propagation
What it is
Constant folding evaluates expressions at compile time when all the inputs are known:
t0 = add 3, 4 → t0 = 7
t1 = lt 10, 20 → t1 = true
cjmp t1, bb1, bb2 → jmp bb1
Constant propagation takes folded constants and substitutes them forward, so the next pass can fold further:
t0 = 7
t1 = add t0, 1 → (after propagation) t1 = add 7, 1
→ (after folding) t1 = 8
These two are conceptually one optimisation but practically two passes: fold uses what's already constant, propagation makes more things constant.
Why it's the first pass everyone implements
- Cheap. Linear scan; no analyses required.
- Cascades. Folding one operation usually enables folding several more — the compiler equivalent of compound interest.
- High value. Source code is rich in constants:
(width * 4 + padding * 2) / 8— every operation here can fold once the user picks values. - Foundation for everything. Inlining + constant folding + propagation is how you get the C++ template-meta-programming style of zero-cost abstraction.
Implementation walkthrough
bool constFold(Function& fn) {
bool changed = false;
// Step 1: propagate.
unordered_map<int, Value> constMap;
buildConstMap(fn, constMap); // tN -> Value if sole def is Move-of-const
for (auto& bb : fn.blocks)
for (auto& ins : bb.instrs)
for (auto& s : ins.srcs)
if (substituteOperand(s, constMap)) changed = true;
// Step 2: fold what's now foldable.
for (auto& bb : fn.blocks) {
for (auto& ins : bb.instrs) {
// binary, unary, cjmp-on-const ...
}
}
return changed;
}
The crucial check is buildConstMap: a temp is eligible for
substitution only if it has exactly one definition. In real SSA
that's automatic. In our TAC IR it's true for builder-generated
temps but we double-check because future passes might break the
invariant.
What we don't do (yet)
- Algebraic identities —
x + 0 → x,x * 1 → x,x - x → 0. Every textbook calls these "strength reductions" and they're trivial to add to the fold loop. - Comparison strength reduction —
x < x → false,x == x → true. - GVN (Global Value Numbering) — recognising that
t1 = add a, bandt2 = add a, balways compute the same value, sot2can be replaced witht1. - Conditional constant propagation (SCCP) — Wegman-Zadeck (1991).
Folds across control flow: if every reaching def of
xis the same constant, treatxas that constant.
LLVM's InstCombine is the production-grade version of this pass —
~10k lines and counting, and it's still one of the most-modified files
in the LLVM tree.
The cjmp rewrite is special
cjmp <const>, bbT, bbF
→ jmp bbT // if const is truthy
→ jmp bbF // otherwise
This is the single most important fold for any compiler because it
exposes dead code to simplifyCFG. if (false) { huge_block }
disappears entirely once cjmp folds and simplifyCFG drops the
unreachable side.
This pattern — "fold the predicate, then simplify the CFG" — is
exactly what makes if constexpr viable in C++17 and what makes
generic monomorphisation in Rust/Go produce reasonable code.
Step 5 — Dead-code elimination
DCE removes instructions whose results are not used. The principle is simple but the bookkeeping is subtle.
The naïve version
bool dce(Function& fn) {
unordered_set<int> used;
for (auto& bb : fn.blocks)
for (auto& ins : bb.instrs)
for (auto& s : ins.srcs)
if (s.isTemp()) used.insert(s.tempId);
for (auto& bb : fn.blocks) {
auto& v = bb.instrs;
auto endIt = remove_if(v.begin(), v.end(), [&](const Instr& ins) {
if (!isPureValueOp(ins.op)) return false;
if (!ins.dst.isTemp()) return false;
return used.count(ins.dst.tempId) == 0;
});
v.erase(endIt, v.end());
}
}
Two checks:
- The op must be pure — no side effects.
print,stg,call,jmp,cjmp,retalways survive. (We're being conservative oncall: in reality some calls are pure and could be elided. Doing that safely requires effect analysis — punt to cp-14.) - The dst must be an unused temp. Named operands are storage; we don't DCE writes to them because they may be loaded later (or externally visible). Globals are out of scope entirely.
What "side effect" really means
A side effect is any change observable outside the local computation:
- I/O —
print,read, file writes, system calls. - Mutation of memory aliased by anyone else — globals, fields, references.
- Synchronisation — atomics, fences, locks.
- Exceptions / traps —
divby zero, signed overflow in some languages, dereferencing null.
The trickiest is the last: in most languages, integer division is
conditionally effectful — it traps on zero. LLVM marks udiv as
having no side effects only when accompanied by an nuw (no-unsigned-wrap)
or exact flag the optimiser inserted after proving the divisor
nonzero. The point: "purity" is not a property of the opcode alone but
of the opcode and the values that flow through it.
We dodge all of this by treating Div and Mod as pure even though
they can trap. A real compiler would track this in a side-table.
Why DCE matters
It's not just hygienic cleanup. DCE is what makes other passes affordable. After inlining or partial evaluation, the IR is bloated with intermediate temps that no longer matter. Without DCE every subsequent pass would walk all of that dead material on every iteration. With DCE the IR stays roughly the size of the live program.
LLVM's ADCE (Aggressive DCE) goes further: it starts from the
program's observable sinks (ret, stores, calls) and works
backwards, considering any instruction not reachable from a sink to be
dead. This catches things our naïve forward-pass misses (mutually
dead temps that nominally use each other).
The DCE-CFG interaction
After simplifyCFG drops a block, the temps defined in that block are
"unreachable" in a stronger sense — no one references them anymore.
DCE catches that on the next iteration. This is why our runAll
re-runs the whole pipeline to a fixed point.
The inverse interaction also matters: after DCE removes a Move t1 = 7
that nobody read, the constant 7 is no longer propagated anywhere. So
the next constFold has nothing new to do — fix point reached.
Step 6 — CFG simplification
After constFold rewrites a cjmp to a jmp, one of the targets
becomes unreachable. The instructions in that block — prints, function
calls, everything — are still there in fn.blocks. They take no
runtime time (nothing branches to them) but they bloat the IR, confuse
later passes, and confuse humans reading the dump.
simplifyCFG is the pass that throws them away.
The algorithm
unordered_set<int> reachable;
vector<int> stack{ fn.blocks[0].id };
while (!stack.empty()) {
int id = stack.back(); stack.pop_back();
if (!reachable.insert(id).second) continue;
// For each successor (jmp / cjmp / ret) of bb id, push onto stack.
}
// Drop blocks not in `reachable`.
Pure flood-fill from the entry block (always bb0). Successors come
from inspecting the terminator of each block — the one place in the
IR where control flow is encoded explicitly. That's the entire point
of the basic-block model: control flow is structural, not embedded in
straight-line instructions.
Other things simplifyCFG could do
We only do reachability pruning. Production-grade implementations also:
- Block merging. If
bb1ends injmp bb2andbb2has onlybb1as predecessor, merge them. - Empty-block bypass. If
bb1's only instruction isjmp bb2, rewrite every predecessor ofbb1to targetbb2directly. - Branch threading. If
bb1ends incjmp t, bb2, bb3andbb2is a known-constant block (e.g.cjmp false, ...), thread directly. - Hoisting common code. If both arms of a cjmp begin with the same instruction, hoist it before the branch.
LLVM's SimplifyCFG pass implements all of these and quite a few
more.
A subtle correctness rule
Whenever you delete a block, you must also check whether any remaining block referenced it as a successor — and if so, that reference is now invalid. Our reachability flood guarantees this: nothing reaches a deleted block, by definition, so no remaining terminator points to one.
In a more aggressive pass (block merging), you must rewrite phi nodes' incoming-block lists when you merge predecessors. We don't have phi nodes, so we sidestep that landmine. Welcome to mem2reg in cp-10/11.
Renumbering, or not?
We don't renumber block ids after pruning. bb0, bb2, bb4 is fine in
the printer — readers know it's a sparse sequence. If we did renumber,
every bbT / bbF field in every terminator would need updating.
That's fragile and offers no semantic benefit.
LLVM and Cranelift both keep sparse block numbering for the same reason.
Composition with other passes
loop {
a = constFold(fn); // may rewrite cjmp → jmp
b = dce(fn); // may drop now-unused temps
c = simplifyCFG(fn); // may drop now-unreachable blocks
if (!a && !b && !c) break;
}
The order matters: constFold first (creates work for the others),
then dce, then simplifyCFG. Swap any pair and you'll get the same
fixed point eventually, but more iterations to reach it.
This kind of phase ordering is one of the long-standing research problems in compiler engineering — see Whitfield & Soffa (1997) for a formal treatment.
Step 7 — Pass manager and phase ordering
A pass manager is the small piece of glue that decides which passes to run when. Ours is twenty lines:
PassStats runAll(Module& m) {
PassStats st;
for (auto& f : m.functions) {
for (int i = 0; i < 16; ++i) {
++st.iterations;
bool a = constFold(*f);
bool b = dce(*f);
bool c = simplifyCFG(*f);
if (!a && !b && !c) break;
...
}
}
return st;
}
That is enough. But it raises three real questions.
1. Why fixed-point at all?
Each pass can enable the next:
constFoldrewritescjmp t, bbT, bbFtojmp bbTwhentis a known constant.simplifyCFGdrops the now-unreachablebbF.dcedrops the temps that were only used insidebbF.- Some of those temps' definitions were the only uses of even earlier temps. Drop them too.
- Now
constFoldmay see a new chain of single-def Moves. Re-run.
The fuel cap (i < 16) is paranoia: if some pass interaction caused
oscillation, we'd notice in tests rather than hanging. In practice
real programs reach fix point in 1–3 iterations.
2. Why this order?
constFold → dce → simplifyCFG puts the creator of opportunities
first and the consumer last. The reverse order would still reach the
fixed point but slower:
| Order | Iterations on if(10<20) { print "a"; } |
|---|---|
| constFold, dce, simplify | 2 |
| simplify, dce, constFold | 3 |
Multiply by every program in the test suite and the difference
compounds. LLVM's pass pipelines (O0, O1, O2, O3, Os) are
extensively tuned for exactly this kind of ordering.
3. Why not analysis caching?
LLVM's "new pass manager" tracks analyses — dominator trees, alias sets, loop info — separately from transforms. Analyses are computed lazily and invalidated when a transform mutates the IR. This avoids re-computing a dominator tree just because some unrelated pass tweaked a branch.
Our compiler has no analyses to cache (no dominator tree, no alias info), so we get away without the machinery. When cp-11/12 introduce LLVM properly, the new pass manager is what schedules everything.
How real compilers structure this
- LLVM
opt -O2runs a sequence of ~70 passes includingInstCombine(constant folding + tons of peepholes),GVN,SCCP,LoopUnroll,Inline,SimplifyCFG,DCE,EarlyCSE, ... - HotSpot C2 runs roughly four phases: GVN, LoopOpts, Macro Expand, Optimisation. Each is itself a fix-point of smaller passes.
- V8 Turbofan runs a graph-based scheduler instead — passes mutate a sea-of-nodes representation, and a final "schedule" pass linearises back to blocks.
The common thread: every modern compiler iterates its passes until nothing changes, then hands off to the backend.
Where this lab takes us
After cp-09 we have:
- A correctness oracle (
runProgram). - A small but real optimisation pipeline.
- Tests that both verify pass behaviour (golden IR) and verify semantic equivalence (interpret before vs after).
cp-10 takes the same IR shape but emits LLVM IR text instead of
running our own interpreter. cp-11 hands that off to llc. After
that, the pass manager that runs over our IR is LLVM's, not ours.
cp-10 — LLVM IR Fundamentals
Status: ✅ Built (30/30 checks). Emitted IR runs under lli.
This lab takes the TAC IR from cp-08/09 and emits textual LLVM IR.
The emitter is hand-rolled — no LLVM library dependency — so the lab
builds anywhere a C++17 compiler is available. cp-11 replaces this
with the real C++ IRBuilder API and integrates find_package(LLVM).
Restriction
Programs in cp-10 must be numeric-only. All MiniLang values lower
to i64; booleans are 0 / 1 in i64. Strings (and function
values) raise an error from the emitter — the runtime for them lands
in cp-14.
Layout
| File | Purpose |
|---|---|
src/llvm_emit.{hpp,cpp} | TAC IR → textual LLVM IR. |
src/main.cpp | mlllvm driver. -O enables our cp-09 passes first. |
tests/test_llvm_emit.cpp | 30 assertions covering module shape, arithmetic, control flow, function definition, globals, and the strings-rejected error path. |
Build, test, smoke-run
cd src/cpp
cmake -S . -B build && cmake --build build
ctest --test-dir build --output-on-failure
# Use the system LLVM toolchain to actually execute the emitted module:
echo 'let i=0; while(i<3){print i; i=i+1;}' | ./build/mlllvm -O > /tmp/m.ll
/opt/homebrew/opt/llvm/bin/lli /tmp/m.ll # prints 0 1 2
What the IR looks like
; ModuleID = 'minilang'
target triple = "arm64-apple-macosx"
@.fmt = private constant [6 x i8] c"%lld\0A\00"
declare i32 @printf(i8*, ...)
define i32 @main() {
%i.addr = alloca i64
store i64 0, i64* %i.addr
br label %L0
L0:
%v0 = load i64, i64* %i.addr
%v1 = icmp slt i64 %v0, 3
...
}
Why hand-roll the emitter?
- Forces engagement with the textual format — every LLVM IR document starts there, and you should be able to read it.
- Keeps cp-10 dependency-free and portable.
- Demonstrates the lowering decisions without LLVM's
IRBuilderdoing them silently —allocafor locals,load/storeeverywhere, expliciticmp+zextfor booleans, explicitprintfcall. - Sets the conceptual baseline so cp-11 can focus on what the C++ API gives you for free (block management, name uniquing, type-erasure of constants, type verification, optimisation passes).
Step docs
- LLVM IR overview
- Types, values, and constants
- The module/function/block structure
- Memory model: alloca / load / store
- Lowering arithmetic and comparisons
- Lowering control flow
- The runtime ABI: printf and
main
Step 1 — LLVM IR overview
LLVM IR is a strongly-typed, SSA-form, three-address virtual ISA. It sits between the frontend (which generates it) and the backend (which lowers it to a real machine). It is the lingua franca of modern production compilers: Clang, Rust, Swift, Julia, Numba, GHC's LLVM backend, ldc (D), Crystal, Pony, Zig (in part), and a hundred research languages all emit LLVM IR and let LLVM handle codegen.
Three forms, same content
LLVM IR exists in three equivalent encodings:
- Textual (
.ll) — human-readable; what we emit in this lab. - Bitcode (
.bc) — compact binary form; what gets serialised. - In-memory —
llvm::Module*etc.; what the C++ API manipulates.
They are losslessly interconvertible: llvm-as (text → bitcode),
llvm-dis (bitcode → text), the IRBuilder constructs in-memory IR,
Module::print() dumps text. cp-11 will switch to the in-memory form.
Three shapes within the text
module
└─ globals (constants, mutable globals, function declarations)
└─ functions
└─ basic blocks (labelled, single-entry single-exit)
└─ instructions (one per line)
Almost every line in a function body has the form:
%name = <opcode> <type> <operands>
That <type> is mandatory — LLVM IR is strongly typed, and the type
system is part of the verifier's job. The verifier (opt -verify,
also run implicitly by lli/llc) rejects modules whose types don't
align. This is what makes generating LLVM IR feel slightly tedious
the first time and very pleasant thereafter: errors are caught early
and precisely.
SSA
Every value (%name) is defined exactly once. We avoid grappling with
SSA construction directly by using alloca + load/store for every
mutable variable; LLVM's mem2reg pass promotes those to SSA later.
This is the canonical strategy for frontends — the "alloca trick" is
how Clang emits IR for C local variables.
What we will NOT do in cp-10
phinodes — needed only if you SSA-construct on the frontend (Cranelift, Rust MIR). With alloca/mem2reg you never write a phi yourself.- Aggregate types (struct, array as value).
- Garbage-collection statepoints.
- Debug info.
- Metadata.
- TBAA / aliasing annotations.
Each is essential for a full production frontend; each is a separable concern that we can layer on in cp-14+.
Why the textual form first?
Because LLVM's error messages, dumps, and documentation all speak in textual IR. Even if you write a frontend that only ever calls the C++ API, you will read textual IR every day for the rest of your compiler career. Get fluent in it now.
Step 2 — Types, values, and constants
The type system
LLVM IR types come in two flavours: primitive and derived.
Primitive
- Integer:
i1,i8,i16,i32,i64,i128, ... up to any width. There is no separatebool—i1plays that role. - Floating point:
half(16-bit),float(32-bit),double(64-bit),fp80/fp128(extended). - Void:
void(only valid as a function return type). - Label:
label(block reference; rarely written explicitly). - Metadata:
metadata(debug info etc.).
Derived
- Pointer:
T*(legacy) orptr(modern LLVM ≥ 15, "opaque pointers"). We useT*here because it's clearer for teaching; modern LLVM auto-converts. - Array:
[N x T]. Fixed-size, allocated as a value. - Vector:
<N x T>. SIMD lane group. - Struct:
{ T1, T2, ... }(literal) or%S(named). - Function:
R (A1, A2, ...).
We use only i64, i1, i8*, and one array ([6 x i8] for the
format string).
Value categories
Every operand in LLVM IR is one of:
- A constant —
42,true,0.5, or agetelementptrof a global. - A register —
%nameor%number, the result of a previous instruction in the same function. Function parameters are also registers (%arg0…). - A global —
@name, a top-level symbol. Includes function references, mutable globals, and constant data like our@.fmt.
The leading sigil is meaningful: % is local to a function, @ is
module-global. There is no other namespace.
How we lower MiniLang values
We picked the simplest possible mapping: everything is i64.
| MiniLang | LLVM | Notes |
|---|---|---|
Number | i64 | We discard the fractional part on purpose. |
Bool | i64 | 0 or 1. We zext i1 ... to i64 after icmp. |
Nil | i64 0 | Same as false. |
String | ✗ | Not supported; emitter errors out. |
This is a pedagogical choice: it makes the IR easy to read and
type-uniform, at the cost of disallowing mixed-type expressions.
cp-14 introduces a real boxed Value representation that supports
the full type set.
Constants
A constant in LLVM IR has both a type and a value:
i64 42 ; integer
[6 x i8] c"%lld\0A\00" ; byte array
@.fmt ; symbol (type is i8*)
getelementptr ([6 x i8], [6 x i8]* @.fmt, i64 0, i64 0)
; "address of first byte of @.fmt"
The getelementptr (GEP) form is how you compute addresses without
emitting an actual instruction — it's a constant expression. We use
it to convert our array-of-bytes format string into a pointer to its
first byte, which is what printf expects.
GEP is one of the most-misunderstood parts of LLVM IR. The
two-index form [6 x i8]* @.fmt, i64 0, i64 0 reads as: "starting
from @.fmt (which points to a [6 x i8]), advance by zero arrays,
then advance to byte index zero". The result is an i8*. Why two
indices? Because @.fmt is a pointer; the first index dereferences
the pointer, the second indexes into the array it points at. This
catches everyone the first time.
Why types matter even in a dynamic language
Even when you're compiling a dynamic language, the LLVM-level
types must be statically known. That's why cp-10 restricts to
numeric only: we can't emit add i64 %a, %b if %a might be a
string at runtime.
The two ways out are:
- Uniform representation — pick one LLVM type (typically
i64or a tagged 64-bit) and stuff every dynamic value into it. - Specialisation — generate different IR for different type profiles, possibly at JIT time. This is what V8 and LuaJIT do.
cp-14 takes path (1). cp-17's capstone explores path (2).
Step 3 — Module / function / block
The module
A module is the unit of compilation. One .ll file = one
llvm::Module = one translation unit. Modules contain:
- A target triple (
arm64-apple-macosx) and data layout. - Global declarations (
@printf,@.fmt,@x). - Function definitions.
- Metadata (debug info, optimisation hints).
; ModuleID = 'minilang'
target triple = "arm64-apple-macosx"
@.fmt = private constant [6 x i8] c"%lld\0A\00"
declare i32 @printf(i8*, ...)
define i32 @main() { ... }
The function
A function has a return type, name, parameter list, and a list of basic blocks:
define i64 @add(i64 %arg0, i64 %arg1) {
L0:
%v0 = add i64 %arg0, %arg1
ret i64 %v0
}
The first block listed is the entry block — implicit, no special marker. Parameters are values in scope from the entry block. Return type is declared up front; every terminator that returns must agree.
Linkage and visibility
Function definitions default to external linkage (visible to the
linker, as if extern in C). Other options:
private— invisible to the linker (we use this for@.fmt).internal— visible only within the module.weak/linkonce— for inline functions and templates.
We don't decorate @main or @add — external is the right default.
The basic block
A basic block is a maximal sequence of straight-line instructions
ending in a terminator — ret, br, switch, unreachable,
indirectbr, invoke, resume, catchret, cleanupret. Exactly
one terminator per block; if you forget, the verifier rejects.
L1:
%v3 = add i64 %v0, 1
store i64 %v3, i64* %i.addr
br label %L0
Blocks are labelled (L1: …). Labels are values of type label,
referred to as %L1 in br targets. The label name on the
definition site has no % — but the reference site does. (Yes, this
inconsistency is annoying. Welcome to LLVM IR.)
Why basic blocks at all?
Because every flow-graph analysis is dramatically simpler if you can reason about straight-line sequences as opaque units, then handle control at the boundaries. Dominator computation, liveness, register allocation, scheduling — all of them operate on basic-block CFGs.
Compare to a representation where any instruction could be a branch target: now every analysis has to track "did anyone jump into the middle of this run?" The basic-block invariant — enter at the top, exit at the bottom — buys you enormous simplification.
Mapping our TAC
Our TAC already had basic blocks (BasicBlock in cp-08), so the
mapping is one-to-one. The only difference: LLVM blocks are labelled
by %LN syntactically; ours by integer id. The emitter prefixes
with L:
static std::string blockLabel(int id) { return "L" + std::to_string(id); }
And emits a br label %L<id> to enter the first block from the
alloca region (LLVM requires an explicit terminator-into-entry-of-body
even when the alloca prelude is in the same block — we just put the
allocas before the br for clarity).
Step 4 — Memory model: alloca / load / store
LLVM IR is in SSA, which means every register is assigned exactly
once. But MiniLang variables — let x = 0; x = x + 1; — are not
single-assignment. How does a frontend bridge that gap?
The alloca trick
You don't emit SSA directly. You emit memory:
%x.addr = alloca i64 ; reserve a stack slot
store i64 0, i64* %x.addr ; x = 0
%v1 = load i64, i64* %x.addr ; read x
%v2 = add i64 %v1, 1 ; v1 + 1
store i64 %v2, i64* %x.addr ; x = ...
Every named local becomes an alloca in the entry block, every read
becomes a load, every write a store. The SSA registers (%v1,
%v2) are short-lived temporaries that hold loaded values.
The IR you produce this way is trivially SSA-valid (every register defined once), but it's pessimistic — it leaves variables in memory when they could live in registers.
mem2reg does the rest
LLVM's mem2reg pass (technically PromoteMemoryToRegister) scans
for allocas that are only loaded from and stored to (no address
taken, no escape) and promotes them into proper SSA values with phi
nodes at join points.
opt -O2 ← runs mem2reg first, then everything else
After mem2reg:
L0:
%x.1 = ...
br label %L1
L1:
%x.2 = phi i64 [%x.1, %L0], [%x.3, %L1]
%x.3 = add i64 %x.2, 1
br label %L1
The frontend never had to compute a dominator tree, never had to insert phi nodes, never had to think about variable renaming. The optimiser did it.
This is the design decision that makes Clang's frontend maintainable. C-style mutable locals would otherwise force the frontend to implement Cytron-style SSA construction.
When does alloca go where?
Always in the entry block. This is critical for performance.
alloca in a non-entry block creates a dynamic stack allocation
(real sub $rsp, size at runtime); in the entry block it's collapsed
by the backend into a single stack-frame reservation.
The entry block is also the only place mem2reg will consider
promoting. An alloca in a loop body stays in memory forever.
Our emitter respects this:
// 1. Emit all alloca instructions first (in the implicit entry block).
for (const auto& name : localsWritten) {
out << " %" << esc(name) << ".addr = alloca i64\n";
}
// 2. Then branch to the first IR block of the function body.
out << " br label %L" << fn.blocks[0].id << "\n";
Function parameters
Parameters arrive as SSA registers (%arg0, %arg1). To let them be
reassigned, we immediately spill them into an alloca:
define i64 @add(i64 %arg0, i64 %arg1) {
%a.addr = alloca i64
%b.addr = alloca i64
store i64 %arg0, i64* %a.addr
store i64 %arg1, i64* %b.addr
...
}
Again, mem2reg cleans this up after the optimiser runs.
Globals
Globals are top-level @-prefixed pointers:
@x = global i64 0
The variable @x is itself an i64*; you load i64, i64* @x to read
and store i64 N, i64* @x to write. This is unlike alloca (which
returns a pointer to a stack slot); @x's pointer is the address of
the global's storage.
Globals are not promoted by mem2reg — that pass operates only on
stack allocas. Promoting a global to register is a different
optimisation (sometimes called "globalopt" or "internalize +
mem2reg") and requires whole-module analysis.
Pointer typing — old vs new
In LLVM ≤ 14, pointers carry the pointee type (i64*). In LLVM ≥ 15,
the opaque-pointer revolution made all pointers just ptr, and the
instruction carries the type (load i64, ptr %x.addr). Our text-form
emitter uses the older typed-pointer form because it's
self-documenting; modern lli accepts either.
Step 5 — Lowering arithmetic and comparisons
Binary integer operations
The mapping is essentially one-to-one with TAC:
| TAC Op | LLVM instruction |
|---|---|
Add | add i64 a, b |
Sub | sub i64 a, b |
Mul | mul i64 a, b |
Div | sdiv i64 a, b |
Mod | srem i64 a, b |
And | and i64 a, b |
Or | or i64 a, b |
Signed vs unsigned: we use sdiv and srem (signed) because
MiniLang's Number is signed-ish in spirit. The s/u/f prefix
on LLVM arithmetic is a frequent source of bugs:
add— no prefix; signedness doesn't matter (two's complement).mul— no prefix; same reason.sdiv/udiv— different result for negative operands.srem/urem— likewise.fadd/fmul/fdiv— floating point.shl— no prefix;lshr(logical) /ashr(arithmetic) for right shift.
The nsw / nuw flags (no signed wrap / no unsigned wrap) on
arithmetic let the optimiser assume overflow is impossible. We don't
emit them — being conservative — but a real frontend should track this
from the source language's overflow semantics.
Unary
Negbecomessub i64 0, %a. There is no dedicatedneginstruction.Not(boolean negation) becomesicmp eq i64 %a, 0followed byzext i1 ... to i64.
Comparisons
%v0 = icmp slt i64 %a, %b ; signed less-than
%v1 = zext i1 %v0 to i64
icmp returns i1. To use the result as our uniform i64 value, we
zext (zero-extend) to i64. If we stored booleans as i1
throughout we wouldn't need the zext — but every other operation
would then need to widen back to i64 for arithmetic.
The condition mnemonics:
| TAC | LLVM icmp cond |
|---|---|
Eq | eq |
Ne | ne |
Lt | slt |
Le | sle |
Gt | sgt |
Ge | sge |
The s prefix is for signed comparison. ult, ule, etc. are
unsigned. eq and ne don't have a sign because they don't need
one — bitwise equality is the same either way.
The zext / trunc dance
icmp always produces i1. Storing or arithmetic always wants
i64. Branching on a value always wants i1 again.
%v0 = icmp slt i64 %a, %b ; i1
%v1 = zext i1 %v0 to i64 ; i64
; ... later, used as a branch condition: ...
%v2 = icmp ne i64 %v1, 0 ; back to i1
br i1 %v2, label %T, label %F
This back-and-forth is what you pay for using i64 as the uniform
value type. LLVM's instcombine cleans most of it up:
icmp ne (zext T to i64), 0 → T
So after opt -O1 the i64 round-trip vanishes entirely.
Why we don't use fadd / fmul
MiniLang numbers are doubles in the interpreter, but we lower to
i64 for simplicity. To handle floats properly:
- Pick
doubleas the uniform type instead ofi64. - Replace
add→fadd,sdiv→fdiv,icmp→fcmp. fcmppredicates have an ordered/unordered distinction (oeq,ueq,olt,ult, ...) because NaN can fail every comparison.- Print with
%gor%lfformat.
cp-14 will introduce a tagged value type that handles both i64 and double, with a runtime dispatch on the tag bits.
Step 6 — Lowering control flow
Unconditional branch
br label %L3
br is the workhorse terminator. The single-operand form is an
unconditional jump.
Our TAC Jump op lowers directly:
case Op::Jump:
out << " br label %" << blockLabel(ins.bbT) << "\n";
Conditional branch
br i1 %cond, label %T, label %F
The condition must be i1. Since we keep our values in i64, we
must emit an explicit icmp ne ..., 0 before the branch:
case Op::CondJump: {
std::string a; operandValue(ins.srcs[0], a);
std::string cmp = fresh();
out << " " << cmp << " = icmp ne i64 " << a << ", 0\n";
out << " br i1 " << cmp << ", label %L" << ins.bbT
<< ", label %L" << ins.bbF << "\n";
}
If we were storing booleans as i1 natively, this would just be:
br i1 %cond, label %T, label %F
That's a real performance argument for using i1 as the boolean
type, even if it complicates the rest of the IR.
Multi-way branches: switch
LLVM supports an n-way switch:
switch i64 %tag, label %default [
i64 0, label %A
i64 1, label %B
i64 2, label %C
]
We don't emit switch because our TAC doesn't have one — any
switch-like construct would have been lowered to a chain of cjmps
in the IR builder. A more sophisticated frontend would detect
switch-like AST patterns and lower to switch directly, giving the
backend the opportunity to emit a jump table.
phi nodes (or the lack thereof)
Because we use the alloca trick, our emitted IR has no phi
nodes. Every variable read is a load; every write is a store.
The IR for if (c) { x = 1; } else { x = 2; } print x;:
%v0 = load i64, i64* %c.addr
%v1 = icmp ne i64 %v0, 0
br i1 %v1, label %L1, label %L2
L1:
store i64 1, i64* %x.addr
br label %L3
L2:
store i64 2, i64* %x.addr
br label %L3
L3:
%v2 = load i64, i64* %x.addr
call i32 (i8*, ...) @printf(...)
After mem2reg:
br i1 %v1, label %L1, label %L2
L1:
br label %L3
L2:
br label %L3
L3:
%x.3 = phi i64 [1, %L1], [2, %L2]
call i32 (i8*, ...) @printf(...)
That phi was always implicit in the semantics; mem2reg just made it
explicit.
Loops
A while (cond) { body } in TAC has three blocks: header, body, exit.
The header tests the condition and cjmps into body or exit. The
body unconditionally jumps back to header. The header dominates both
body and exit; the body's backward edge is what makes this a natural
loop.
br label %Lheader
Lheader:
...test...
br i1 %cond, label %Lbody, label %Lexit
Lbody:
...body...
br label %Lheader
Lexit:
...
LLVM detects this pattern (loop analysis runs on the SSA form after mem2reg) and enables loop-specific optimisations: invariant code motion, induction-variable simplification, vectorisation, unrolling.
unreachable
If a block has no logical successor (e.g. control falls off the end
of noreturn code), the terminator is unreachable. Our compiler
doesn't generate unreachable because cp-09's simplifyCFG already
removed those blocks. But they appear all the time in C frontends
after calls to abort(), exit(), etc.
Indirect branches
indirectbr label %target, [label %a, label %b, ...] is how
computed-goto / threaded-code interpreters express their dispatch.
Out of scope here; relevant for cp-12's JIT capstone where we may
explore inline caches.
invoke and exception handling
LLVM's exception model uses invoke (a call with a normal and an
exceptional destination) plus landingpad blocks. We have no
exceptions in MiniLang. C++ frontends spend a lot of time on this.
Step 7 — Runtime ABI: printf and main
Why printf?
MiniLang has a built-in print. To execute the emitted module, some
piece of code outside it has to do the formatting and the actual
system call. The cheapest option is to delegate to libc's printf,
which lli, llc + ld, and any host linker make trivially
available.
@.fmt = private constant [6 x i8] c"%lld\0A\00"
declare i32 @printf(i8*, ...)
@.fmtis a private (module-local) constant holding the bytes%lld\n\0. The[6 x i8]array type makes the length explicit; the null terminator at the end matches C string convention.declare i32 @printf(i8*, ...)introduces an external function declaration. The variadic...is part of the type signature.
The call site
%v_ = call i32 (i8*, ...) @printf
(i8* getelementptr ([6 x i8], [6 x i8]* @.fmt, i64 0, i64 0),
i64 %v)
Three things here are non-obvious:
-
call i32 (i8*, ...) @printf— the function type appears parenthesised after the return type. This is required only for variadic calls. Non-variadic calls can omit it:call i32 @add(...). -
The GEP —
@.fmtis an[6 x i8]*.printfwantsi8*.getelementptrwith two zero indices computes "address of the first byte". This is the canonical "decay an array to a pointer" pattern in LLVM. -
The discarded return value — we assign it to
%v_even though we never use it. LLVM doesn't allow standalone calls to be silently dropped; you must give the result some name. (Or callvoid-returning functions, where naming is forbidden.)
main
LLVM doesn't define what main is — that's a libc/runtime
convention. We adopt the C convention:
define i32 @main() {
...
ret i32 0
}
@main returns i32 (the process exit status). When we lower our
__script__ function, we use define i32 @main() and the final
Op::Return becomes ret i32 0.
lli looks for @main to start execution. llc + system linker
build an executable that the OS loader calls via _start → libc
init → main.
What's missing
We have no:
- String literals at runtime — no allocation, no managed string.
cp-14 introduces a runtime with
ml_string_new,ml_print_value,ml_value_t. - Closures — function values that capture environment. cp-12 introduces them as part of the JIT capstone.
- GC — every allocation in cp-10/11 is leaked or stack-only. cp-14 sketches a mark-sweep collector with a shadow stack.
- Exception model — no
invoke, nolandingpad. - TLS, threading primitives, atomics — out of scope.
ABI considerations for cp-11
When cp-11 actually links against LLVM's C++ API, the ABI surface expands:
- Calling conventions —
ccc(default),fastcc,coldcc,swiftcc,tailcc, custom numbered ccs. We use ccc (C calling convention) because we link with libc. - Attributes —
noinline,readonly,nounwind,cold,optsize, ... These affect optimisation decisions. - Target attributes —
target-cpu,target-features. The difference between scalar codegen and AVX-512 vectorised codegen.
All of these become accessible through the C++ API as we move to real LLVM integration in cp-11.
cp-11 · LLVM Codegen (the real C++ API)
Goal: emit a verified
llvm::Modulestraight from our TAC IR usingIRBuilder, then run the new-pass-manager O2 pipeline (mem2reg, instcombine, GVN, SimplifyCFG, …) over it. Execute withlli, or pipe throughllcto produce native object code.
cp-10 wrote LLVM IR as text. cp-11 builds the same IR through the
official C++ API, links against libLLVM*, and lets LLVM run its
production-grade optimisation pipeline on the result.
What changed since cp-10
| Concern | cp-10 | cp-11 |
|---|---|---|
| IR producer | hand-written text emitter | llvm::IRBuilder<> |
| Verifier | none — trusted lli to complain | llvm::verifyModule after build |
| Optimiser | none | PassBuilder::buildPerModuleDefaultPipeline(O2) |
| Globals | string-templated @x = global i64 0 | new llvm::GlobalVariable(...) |
| Format string | hand-emitted [6 x i8] literal | IRBuilder::CreateGlobalString |
Build & run
cmake -S src/cpp -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
./build/tests/test_llvm_codegen # → 25/25 checks passed
# REPL-ish: pipe MiniLang in, get IR out
echo 'fn fib(n){ if(n<2){return n;} return fib(n-1)+fib(n-2);} print fib(20);' \
| ./build/mlcc -O \
| /opt/homebrew/opt/llvm/bin/lli
# 6765
-O runs both our cp-09 TAC passes and LLVM's O2 pipeline.
Layout
src/cpp/
├── CMakeLists.txt # find_package(LLVM); link core/passes/analysis/...
├── src/
│ ├── llvm_codegen.hpp # CodegenResult + build()/optimise()
│ ├── llvm_codegen.cpp # IRBuilder emitter + new-PM pipeline
│ └── main.cpp # `mlcc` CLI
└── tests/test_llvm_codegen.cpp
Reading order
- steps/01-llvm-cpp-api-tour.md
- steps/02-irbuilder.md
- steps/03-verifier.md
- steps/04-new-pass-manager.md
- steps/05-mem2reg-and-O2.md
- steps/06-globals-and-runtime.md
- steps/07-targets-and-llc.md
Step 01 · LLVM C++ API tour
LLVM's C++ API is huge. The codegen path we touch is a thin slice:
LLVMContext ── owns types, constants, metadata. One per thread of work.
│
▼
Module ── translation unit. Holds globals + functions + metadata.
│
▼
Function ── signature + list of BasicBlocks.
│
▼
BasicBlock ── straight-line list of Instructions; ends in a terminator.
│
▼
Instruction ── created via IRBuilder, never `new`.
Value ── base of everything (Constant, Argument, Instruction).
Type ── obtained from the Context (Int64Ty, etc.).
Linking
llvm_map_components_to_libnames(LLVM_LIBS core support irreader passes analysis transformutils scalaropts instcombine)
expands to the static archives we need. With Homebrew LLVM 20 you can
also link the umbrella LLVM library, but listing components keeps the
binary small.
find_package(LLVM REQUIRED CONFIG)
target_include_directories(mllib PUBLIC ${LLVM_INCLUDE_DIRS})
target_compile_definitions(mllib PUBLIC ${LLVM_DEFINITIONS})
target_link_libraries(mllib PUBLIC ${LLVM_LIBS})
Ownership
unique_ptr<LLVMContext> must outlive unique_ptr<Module>. Module
destruction touches its types, which live in the context. Keep both in
the same CodegenResult and let RAII handle order — declared in the
right sequence in llvm_codegen.hpp.
A common pitfall
Forward-declaring llvm::Module in a header and putting
unique_ptr<Module> in a struct breaks because the implicit
destructor needs the complete type. Either include <llvm/IR/Module.h>
in the header, or declare an out-of-line destructor. We chose the
include.
Step 02 · IRBuilder
IRBuilder<> is a cursor: you SetInsertPoint(BB) and then every
Create* call appends one instruction at the cursor.
llvm::IRBuilder<> b(ctx);
auto* bb = llvm::BasicBlock::Create(ctx, "entry", fn);
b.SetInsertPoint(bb);
auto* sum = b.CreateAdd(lhs, rhs, "sum");
b.CreateRet(sum);
What we build
Per function:
entryblock with onealloca i64per named local/param.- Spill each parameter into its alloca.
- Pre-create every TAC basic block as a
BasicBlock*. - Branch from
entryto the first TAC block. - Walk each TAC instruction, dispatch on opcode, emit one or more LLVM instructions.
Mapping opcodes
| TAC | IRBuilder |
|---|---|
Add/Sub/Mul/Div/Mod | CreateAdd/Sub/Mul/SDiv/SRem |
And/Or | CreateAnd/Or |
Eq/Ne/Lt/... | CreateICmpEQ/NE/SLT/... then CreateZExt to i64 |
Neg | CreateNeg |
Not | ICmpEQ x, 0 then ZExt |
Move t,_ | alias in temps map |
Move name,_ | CreateStore to alloca |
LoadGlobal/StoreGlobal | CreateLoad/Store on GlobalVariable |
Print | CreateCall(printf, {fmt, v}) |
Call f(args) | CreateCall(callee, args) |
Jump | CreateBr |
CondJump | ICmpNE x,0 then CreateCondBr |
Return | CreateRet |
Naming
Anonymous SSA values get %0, %1, … from the printer. Passing a
name string to Create* ("sum") is purely cosmetic — it survives to
the printed IR and is invaluable when reading optimised output.
Step 03 · The Verifier
llvm::verifyModule(module, &os) returns true on failure and
writes a description into os. We call it as the last step of build
and refuse to return a module that does not verify.
What the verifier catches
- Blocks without a terminator (we used to emit a defensive
CreateUnreachablefor safety). - Branch targets in a different function.
- Type mismatches (
add i64, i32). - Values used outside the dominance tree of their definition.
- Improper SSA — multiple definitions of the same
%x.
Why it matters
A module that fails verification will often segfault lli or the
backend instead of producing a clean diagnostic. Verifying up front
turns those into a single string we can surface to the user.
Test hook
auto cg = compile("print 1;");
CHECK(cg.ok); // verifier passed
CHECK(cg.mod != nullptr); // module survived
CHECK_CONTAINS(cg.toText(), "ModuleID");
On failure, CodegenResult::error carries the verifier's report
verbatim, which is invaluable when developing new lowering rules.
Step 04 · The new pass manager
LLVM has two pass managers: the legacy llvm::PassManager and the
"new" one rooted at llvm::PassBuilder. New code targets the new PM.
llvm::LoopAnalysisManager lam;
llvm::FunctionAnalysisManager fam;
llvm::CGSCCAnalysisManager cam;
llvm::ModuleAnalysisManager mam;
llvm::PassBuilder pb;
pb.registerModuleAnalyses(mam);
pb.registerCGSCCAnalyses(cam);
pb.registerFunctionAnalyses(fam);
pb.registerLoopAnalyses(lam);
pb.crossRegisterProxies(lam, fam, cam, mam);
auto mpm = pb.buildPerModuleDefaultPipeline(llvm::OptimizationLevel::O2);
mpm.run(mod, mam);
Anatomy
- Analysis managers cache analyses (dominator tree, loop info, alias analysis, …) per IR unit. They must be registered with each other so a function pass can ask for a module-level analysis.
- PassBuilder assembles a
ModulePassManagercorresponding to one of the standard-O0/-O1/-O2/-O3/-Os/-Ozpipelines. mpm.run(mod, mam)mutates the module in place.
Custom pipelines
You can build your own with addPass(MyPass()). We rely on the canned
O2 pipeline so we inherit decades of tuning.
Where this lives
llvm_codegen.cpp → optimise(Module&). mlcc calls it when given
-O. Tests pass runOpt = true to exercise specific transformations.
Step 05 · mem2reg + the O2 pipeline
Our lowering is deliberately naïve: every named local becomes an
alloca i64 in the entry block, every read is a load, every write
a store. We do not try to construct SSA in the front end.
That's fine, because O2 includes mem2reg (a.k.a. PromoteMemToReg),
which:
- Finds
allocas whose only uses are direct loads/stores. - Replaces them with proper SSA values, inserting Φ-nodes where control flow joins.
After mem2reg, downstream passes can do real work:
instcombine— peephole rewrites.gvn— global value numbering deduplicates.simplifycfg— collapses trivial branches.licm— hoists loop invariants.loop-unroll,loop-vectorize,slp-vectorize— where profitable.globalopt— turns module-private mutable globals into constants when only initialised once.
Observable
let x = 7;
print x;
Pre--O:
@x = global i64 0
…
store i64 7, ptr @x
%0 = load i64, ptr @x
call i32 @printf(ptr @.fmt, i64 %0)
Post--O (test test_mem2reg_after_opt_eliminates_allocas asserts this):
call i32 @printf(ptr @.fmt, i64 7)
Lesson
Front-ends do not need a smart code generator. Emit straightforward load/store-heavy IR; let mem2reg + the rest of O2 turn it into great machine code. This is the LLVM superpower.
Step 06 · Globals and the print runtime
MiniLang's top-level let bindings become module globals. We
scan every function for LoadGlobal/StoreGlobal to discover the
names, then create one GlobalVariable each:
new llvm::GlobalVariable(
mod, i64, /*isConstant=*/false,
llvm::GlobalValue::ExternalLinkage,
llvm::ConstantInt::get(i64, 0), name);
External linkage keeps the symbol visible in the .o we'd emit with
llc — necessary for any future linker-level integration.
printf shim
auto* ft = llvm::FunctionType::get(i32, {i8p}, /*isVarArg=*/true);
printfFn = llvm::Function::Create(ft, ExternalLinkage, "printf", mod);
fmtStr = b.CreateGlobalString("%lld\n", ".fmt", 0, &mod);
CreateGlobalString returns a pointer (i8*/ptr under opaque
pointers) directly usable as the first argument to printf. Under
LLVM 20 the textual form is ptr @.fmt.
Why printf and not a hand-rolled write(2) loop? Three reasons:
- the C runtime is always available on a JIT or system linker;
%lldis portable and exactly matches ouri64;- it lets
lliexecute the program with no extra plumbing.
cp-14 will replace this with a proper ml_print(Value) runtime that
understands strings, booleans and closures.
Step 07 · Targets and llc
Our mlcc emits target-independent LLVM IR. To produce a binary:
./build/mlcc -O program.ml > program.ll
/opt/homebrew/opt/llvm/bin/llc -O3 -filetype=obj program.ll -o program.o
clang program.o -o program
./program
llc walks:
LLVM IR
→ SelectionDAG / GlobalISel (instruction selection)
→ MachineInstr (target-specific MI)
→ Register allocation
→ Instruction scheduling
→ Machine code emission (.o)
Going programmatic
Inside C++, you'd:
InitializeNativeTarget,InitializeNativeTargetAsmPrinter.TargetRegistry::lookupTarget(sys::getDefaultTargetTriple(), err).target->createTargetMachine(...).- Set the module's
DataLayoutandTargetTriple. - Use a legacy
PassManager+addPassesToEmitFile(...)to write a.o.
We stop short of that in cp-11 to keep the lab focused; cp-12 (ORC
JIT) will internalise target initialisation in order to execute IR
without lli.
When to add a custom target
Building a real backend is a full course. For research languages, ride the existing X86/AArch64/RISC-V backends. Only write a target when you ship custom silicon.
cp-12 · ORC JIT — compile and execute in-process
cp-11 produced an optimised llvm::Module. cp-12 hands that module
to LLVM's modern JIT (ORCv2, wrapped behind LLJIT) which
compiles it to native code on the spot, looks up main, and calls it.
Build & run
cmake -S src/cpp -B build
cmake --build build
./build/tests/test_jit # → 17/17 checks passed
echo 'print 2 + 3 * 4;' | ./build/mljit # 14
echo 'print 2 + 3 * 4;' | ./build/mljit --emit-llvm # textual IR instead
./build/mljit -O program.ml # run after O2
Headline test (recursive fib)
fn fib(n){ if (n < 2) { return n; } return fib(n-1) + fib(n-2); }
print fib(20);
JIT-compiled with -O → 6765.
Layout
src/cpp/
├── CMakeLists.txt # links orcjit + executionengine + native + nativecodegen
├── src/jit.hpp / jit.cpp # initNative() + runMain(ctx, module)
├── src/main.cpp # `mljit` CLI
└── tests/test_jit.cpp # 17/17 checks
Reading order
- steps/01-why-jit.md
- steps/02-orc-overview.md
- steps/03-lljit-builder.md
- steps/04-thread-safe-module.md
- steps/05-symbol-lookup-and-call.md
- steps/06-runtime-symbols.md
- steps/07-beyond-lljit.md
Step 01 · Why JIT?
Ahead-of-time (AOT) compilation produces a binary at build time. Just-in-time (JIT) compilation defers code generation to run time:
| AOT | JIT | |
|---|---|---|
| Compile cost paid | once, off-line | every invocation (cached after first run) |
| Knows actual inputs | no | yes — can specialise on profile data |
| Patching live code | hard | first-class (tiering, deopt) |
| Cold-start latency | tiny | non-trivial |
| Deploy artefact | .exe | the JIT + bytecode/IR |
Real-world JITs: HotSpot (Java), V8 (JS), LuaJIT, Julia, Numba, PyPy, Pharo. They share three ingredients:
- An IR low enough to lower to machine code (LLVM IR, Sea-of-Nodes, etc.).
- A code generator that emits into executable memory.
- A symbol table that lets fresh code call previously-jitted code and runtime helpers.
LLVM gives us all three through ORC.
Step 02 · ORC overview
ORC = On-Request Compilation. ORCv2 is the current modular JIT framework inside LLVM.
ExecutionSession
├── JITDylibs (≈ shared libraries; symbol namespaces)
│ └── MaterializationUnits (lazy producers of symbols)
└── Layers (stack):
ObjectLinkingLayer (loads .o into memory)
↑
IRCompileLayer (Module → .o via TargetMachine)
↑
IRTransformLayer (optional: run passes per-Module)
↑
CompileOnDemandLayer / SpeculativeJIT (lazy & tiering)
Each layer is a MaterializationUnit that produces symbols on demand.
When lookup("main") runs, ORC chains downward until the right
machine code is in memory.
LLJIT
LLJIT is a turnkey façade that pre-wires those layers with sensible
defaults (one main JITDylib, IR-compile + linking layers, native
target). For a simple compile-and-run scenario you never need to
touch ORC's plumbing directly.
Step 03 · LLJITBuilder & initialisation
llvm::InitializeNativeTarget();
llvm::InitializeNativeTargetAsmPrinter();
llvm::InitializeNativeTargetAsmParser();
These pull the host backend (X86 / AArch64 / …) into the binary.
Without them LLJIT::create() fails with "No available targets".
We wrap the three calls behind std::call_once so it's safe to call
jit::runMain from anywhere.
Building the JIT
auto jitOrErr = llvm::orc::LLJITBuilder().create();
if (!jitOrErr) /* report toString(takeError()) */;
auto jit = std::move(*jitOrErr);
LLJITBuilder exposes knobs (setNumCompileThreads,
setDataLayout, setObjectLinkingLayerCreator, …) for advanced
setups. We accept defaults.
The DataLayout
LLJIT derives the data layout from the JIT's TargetMachine and sets
it on every module added later. That's why we can build the
Module in cp-11 without specifying a DataLayout up front — LLJIT
patches it before compilation.
Step 04 · ThreadSafeModule
LLVM Module and LLVMContext are not thread-safe. ORC wraps
each (module, context) pair in ThreadSafeModule to enforce a
single-thread access discipline.
llvm::orc::ThreadSafeModule tsm(std::move(mod), std::move(ctx));
jit->addIRModule(std::move(tsm));
Notice both unique_ptrs are moved in. After this:
- Your local
modandctxare empty. - LLJIT owns the module until it has been materialised; afterwards it drops the IR and keeps only the compiled object.
Why one context per module?
Sharing a context between modules forces a global lock around code generation. Giving each module its own context lets ORC's optional compile-threads work in parallel without contention.
In CodegenResult (cp-11) we already kept ctx and mod as
unique_ptrs in the right destruction order for exactly this
hand-off.
Step 05 · Symbol lookup & calling main
auto sym = jit->lookup("main");
if (!sym) /* surface toString(sym.takeError()) */;
auto fnPtr = sym->toPtr<int(*)()>();
int exitCode = fnPtr();
lookup triggers materialisation: parsing/optimising hasn't finished
when addIRModule returns; the work happens when a symbol is
asked for. This is the entry point to ORC's laziness.
Type-safe call
toPtr<Fn>() is a templated cast that returns a function pointer of
the requested signature. Get the signature wrong and you'll see UB —
since mlcc always produces define i32 @main(), we ask for
int(*)().
Capturing stdout
Tests need to assert on what printf printed. We dup fd 1 to a
temp file around the call, then slurp the file. POSIX-only, but
plenty for cp-12. Production code would either:
- register a custom
puts/printfsymbol that writes to a buffer, or - run the JIT in a subprocess and read its stdout.
Step 06 · Runtime symbols (printf and friends)
When our JITed module says call i32 @printf(...), ORC needs an
address for the external symbol. LLJIT adds a
DynamicLibrarySearchGenerator::GetForCurrentProcess(...) to its
main JITDylib by default, which dlsyms into the host process.
Since the test binary is linked against libc, printf resolves
immediately and the JITed call writes to our captured stdout.
Injecting your own runtime functions
// inside jit.cpp, after LLJIT::create()
auto& dylib = jit->getMainJITDylib();
dylib.define(orc::absoluteSymbols({
{ jit->mangleAndIntern("ml_print"),
{ reinterpret_cast<uintptr_t>(&ml_print),
JITSymbolFlags::Exported } }
}));
cp-14 will introduce a real ml_print(Value) helper and a tiny
runtime library. Then the front end can stop hard-coding printf and
emit call void @ml_print(%Value) instead — opening the door to
boxed types, GC headers, etc.
Mangling
On macOS, symbols carry a leading underscore (_printf). LLJIT does
the right mangling internally; mangleAndIntern exposes the same
logic when you register a host pointer.
Step 07 · Beyond LLJIT — lazy, tiered, remote
LLJIT covers ~80 % of JIT use cases. ORC gives you more when you need it.
Lazy compilation
LLLazyJIT (a sibling of LLJIT) wraps each function in a stub. The
function isn't compiled until the stub is hit, which dramatically
reduces start-up time for big modules where only a fraction of code
runs.
Tiering
A real engine like V8 starts cold code in an interpreter, then re-compiles hot functions with full optimisations, then patches the call sites. Build this on ORC by:
- Tier-1: compile with
OptimizationLevel::O0immediately. - Use sampling or counters in the runtime to find hot functions.
- Submit a fresh module for those functions with
O3to a compile-thread. - Use ORC's
JITLinkre-defining a symbol to atomically swap the entry point.
Remote / out-of-process JIT
OrcRemoteTargetClient lets you compile in one process and execute
in another (or on a different machine). Great for embedded targets
or sandboxing untrusted code.
Caching
ObjectCache plugs into the IR-compile layer to memoise compiled
objects across runs — perfect for shell-style use of the JIT where
the same script is launched repeatedly.
What we won't tackle in this curriculum
The integration with debuggers (GDB/LLDB JIT interface), profilers
(VTune), and deopt/inline-cache machinery (V8/HotSpot-style ICs)
deserve a course of their own. cp-12 leaves you with a working
foundation; adding any one of the above is a focused, incremental
exercise on top of jit::runMain.
cp-13 · MLIR foundations — emit, lower, translate
cp-11/12 used LLVM IR directly. cp-13 takes a step up the abstraction ladder to MLIR (Multi-Level Intermediate Representation): a generic framework for building IRs with first-class regions, blocks, and extensible dialects.
We emit our TAC IR as MLIR text in the llvm dialect (a near-1:1
mapping of LLVM IR into MLIR syntax), then drive Homebrew's
mlir-translate + lli to execute the program.
Why the llvm dialect?
Real MLIR projects build a custom dialect (minilang.*) and lower it
through arith, cf, memref to llvm. That's pedagogically
fantastic but operationally fragile across MLIR versions. cp-13
keeps the toolchain minimal so every test passes out-of-the-box on
LLVM/MLIR 20. The step docs walk through what a full dialect would
look like.
Build & run
cmake -S src/cpp -B build
cmake --build build
./build/tests/test_mlir_emit # → 25/25 checks passed
echo 'print 2+3*4;' | ./build/mlmlir # emit MLIR
echo 'print 2+3*4;' | ./build/mlmlir --run # → 14
Inspect the pipeline by hand:
echo 'print 42;' | ./build/mlmlir > /tmp/m.mlir
/opt/homebrew/opt/llvm/bin/mlir-opt /tmp/m.mlir --canonicalize
/opt/homebrew/opt/llvm/bin/mlir-translate /tmp/m.mlir --mlir-to-llvmir
/opt/homebrew/opt/llvm/bin/mlir-translate /tmp/m.mlir --mlir-to-llvmir \
| /opt/homebrew/opt/llvm/bin/lli
Reading order
- steps/01-why-mlir.md
- steps/02-ir-shape.md
- steps/03-dialects.md
- steps/04-llvm-dialect-mapping.md
- steps/05-pipeline-mlir-translate.md
- steps/06-progressive-lowering.md
- steps/07-when-to-reach-for-mlir.md
Step 01 · Why MLIR?
LLVM IR is a great low-level representation, but it has one level. By the time a high-level construct (a tensor reshape, a SQL plan, a distributed task) becomes LLVM IR, all its structure is gone and domain-specific optimisation is much harder.
MLIR (Multi-Level IR) solves this by letting you define dialects that live alongside one another in the same module. You can write your compiler as a series of dialect-to-dialect lowerings, each step losing only the structure you no longer need.
Origins
Born out of TensorFlow's compiler stack at Google, contributed to the LLVM project. Today MLIR is the foundation of:
- TensorFlow's XLA / IREE
- Mojo (Modular)
- Triton (OpenAI's GPU DSL)
- CIRCT (hardware design)
- Polygeist (C → MLIR)
- Flang (Fortran)
What we gain
- Composability: pick the right abstraction for each pass.
- Reuse: standard dialects (
arith,cf,memref,linalg,vector,gpu, …) give you an enormous toolbox. - Lower-bar custom dialects: ODS (TableGen) generates op classes.
- Common verifier / printer / parser infrastructure.
What we don't tackle in cp-13
Defining a custom dialect in C++ via ODS. That's a multi-day deep dive
and tightly coupled to LLVM/MLIR version specifics. cp-18 leaves a
spec exercise; here we focus on understanding the IR shape and the
lowering toolchain by emitting the llvm dialect.
Step 02 · The shape of MLIR
module {
llvm.func @main() -> i32 {
%0 = llvm.mlir.constant(42 : i64) : i64
%1 = llvm.mlir.addressof @fmt : !llvm.ptr
%2 = llvm.call @printf(%1, %0) vararg(!llvm.func<i32 (ptr, ...)>)
: (!llvm.ptr, i64) -> i32
%3 = llvm.mlir.constant(0 : i32) : i32
llvm.return %3 : i32
}
}
Key concepts:
- Operation — every line is an
Operation. The name carries the dialect (llvm.,arith.,func.,scf., ...). - Region — a block of
Operations, enclosed in{ ... }. Some ops (scf.for,func.func) have nested regions; that's how MLIR expresses structured control flow. - Block — a list of operations ending in a terminator. Labels are
^bb0,^bb1, .... Blocks may take SSA arguments (MLIR's unification of Φ-nodes and parameters). - Value (
%name) — SSA result of an op. - Type (
i64,!llvm.ptr,tensor<4xf32>) — typed by the dialect;!prefix means "non-builtin".
Implications
- No global symbol table for SSA — each block can reuse names.
- Every op states all its operand and result types, so the IR is
self-describing and can be parsed by
mlir-opteven without knowing the producing dialect's C++ class (provided the dialect is loaded). moduleitself is an op whose region holds the program.
Our emission strategy
Emitter::emitFunction produces a llvm.func with one entry block,
allocas for every named local, then a llvm.br ^bb1 into the first
TAC block. After that each TAC block becomes a ^bbN label and
its instructions translate one-for-one to llvm.* ops.
Step 03 · Dialects worth knowing
A dialect is a namespace of operations + types + attributes. Some upstream ones you'll meet constantly:
| Dialect | Purpose |
|---|---|
builtin | module, func.func (in older MLIR builtin.module) |
func | func.func, func.call, func.return |
arith | Pure integer/float math: arith.addi, arith.cmpi, ... |
cf | Unstructured control flow: cf.br, cf.cond_br |
scf | Structured control flow: scf.for, scf.if, scf.while |
memref | Memory references with shape/layout |
tensor | Immutable value tensors |
linalg | High-level array/linear-algebra ops |
vector | Explicit SIMD vectors |
affine | Polyhedral loops, ideal for analyses |
gpu, nvvm, rocdl, spirv | Device backends |
llvm | Mirror of LLVM IR; the terminal target |
The point: write your compiler as a sequence
mydialect → linalg → memref → scf → cf → llvm, each step removing
abstraction you no longer need.
Defining a dialect (in C++)
You declare ops in TableGen (.td), which mlir-tblgen expands into
C++ classes. A typical workflow:
MinilangOps.td— declare ops, types, attributes.- Register the dialect with
MLIRContext::loadDialect. - Implement
verify,canonicalize,foldper op. - Write a
MinilangToLLVMconversion pass (mlir::ConversionTargetRewritePatterns).
cp-18's capstone leaves the dialect implementation as a guided exercise; the heavy lifting is mostly mechanical TableGen + pattern boilerplate.
Step 04 · LLVM-dialect mapping
Our emitter speaks one dialect: llvm. The mapping is mechanical:
| TAC | MLIR llvm |
|---|---|
| Numeric constant | llvm.mlir.constant(N : i64) : i64 |
Named local x | llvm.alloca in entry block, llvm.load/llvm.store thereafter |
Add a,b | llvm.add %a, %b : i64 |
Sub/Mul/Div/Mod | llvm.sub/mul/sdiv/srem |
And/Or | llvm.and/or |
Eq/Ne/Lt/... | llvm.icmp "eq"/"ne"/"slt"/... then llvm.zext _ : i1 to i64 |
Neg | llvm.sub %zero, %a |
Not | llvm.icmp "eq" %a, %zero then zext |
LoadGlobal x | llvm.mlir.addressof @x then llvm.load |
StoreGlobal x, v | llvm.mlir.addressof @x then llvm.store |
Print v | llvm.call @printf(@fmt, v) vararg(...) : (!llvm.ptr, i64) -> i32 |
Call f(args) | llvm.call @f(args) : (...) -> i64 |
Jump bb | llvm.br ^bbN |
CondJump v, T, F | llvm.icmp "ne" v, 0 then llvm.cond_br ... ^bbT, ^bbF |
Return v | llvm.return %v : i64 (i32 0 for main) |
Globals
llvm.mlir.global internal @x(0 : i64) : i64
internal linkage; initial value 0. Stores at main-time install the
user's initialiser. llvm.mlir.addressof @x reifies the global as a
!llvm.ptr value.
printf
llvm.mlir.global internal constant @fmt("%lld\0A\00") {addr_space = 0 : i32}
llvm.func @printf(!llvm.ptr, ...) -> i32
Variadic call sites must spell their vararg signature:
%r = llvm.call @printf(%f, %v) vararg(!llvm.func<i32 (ptr, ...)>)
: (!llvm.ptr, i64) -> i32
That's MLIR's way of preserving variadic information that LLVM's
function type would otherwise carry as (...).
Step 05 · The mlir-translate / lli pipeline
mlir-translate is MLIR's bridge to external IRs. Its
--mlir-to-llvmir mode walks an MLIR module that's already in the
llvm dialect and produces textual LLVM IR. From there lli or
llc take over.
MiniLang source
│ Lexer → Parser → Resolver → TypeChecker → IR builder
▼
MiniLang TAC IR
│ mlir_emit::emit
▼
MLIR (`llvm` dialect)
│ mlir-translate --mlir-to-llvmir
▼
LLVM IR (text)
│ lli (or llc -filetype=obj + ld)
▼
Process exit code + stdout
Implementing the pipeline in C++
runShell (in mlir_emit.cpp) forks /bin/sh -c "mlir-translate ... | lli"
with pipes attached. Robust enough for tests; production tooling would
likely link MLIR's own translation library instead.
Catching errors
If mlir-translate rejects our IR (wrong type, missing op), the pipe
breaks and lli exits non-zero. We capture stderr from the child
into PipelineResult::error so test failures point at the offending
stage.
Inspecting intermediate stages
echo 'print 42;' | ./build/mlmlir | tee /tmp/m.mlir
/opt/homebrew/opt/llvm/bin/mlir-opt /tmp/m.mlir --canonicalize # pretty + sanity-check
/opt/homebrew/opt/llvm/bin/mlir-translate /tmp/m.mlir --mlir-to-llvmir
/opt/homebrew/opt/llvm/bin/mlir-translate /tmp/m.mlir --mlir-to-llvmir \
| /opt/homebrew/opt/llvm/bin/llc -O2 -filetype=obj -o /tmp/m.o
Step 06 · Progressive lowering — what a "real" pipeline looks like
Emitting llvm dialect directly skips MLIR's superpower:
progressive lowering. Here's the shape of a fuller pipeline you
would build once you have a custom dialect.
A hypothetical minilang dialect
module {
minilang.func @add(%a: !minilang.value, %b: !minilang.value) -> !minilang.value {
%r = minilang.add %a, %b : !minilang.value
minilang.return %r : !minilang.value
}
minilang.func @main() {
%a = minilang.const #minilang.num<40> : !minilang.value
%b = minilang.const #minilang.num<2> : !minilang.value
%c = minilang.call @add(%a, %b) : (!minilang.value, !minilang.value) -> !minilang.value
minilang.print %c : !minilang.value
minilang.return
}
}
The !minilang.value type encodes our boxed runtime value (Nil / Bool /
Number / Str / Fn).
Lowering passes
mlir-opt minilang.mlir \
--minilang-specialise-numeric # unbox numeric ops where types prove safe
--convert-minilang-to-func # minilang.func/call → func dialect
--convert-minilang-to-arith # numeric ops → arith
--convert-minilang-to-cf # control flow lowered
--convert-minilang-to-memref # boxed values → struct in memref
--convert-arith-to-llvm
--convert-cf-to-llvm
--convert-memref-to-llvm
--convert-func-to-llvm
--reconcile-unrealized-casts
| mlir-translate --mlir-to-llvmir
| lli
Each --convert-* is a RewritePattern set authored once and reused
across every MiniLang program. That's the value MLIR offers: a
ready-made pattern infrastructure plus a dozen battle-tested target
dialects.
Domain-specific optimisation
Before lowering to arith, the minilang dialect can run high-level
passes: type-specialise polymorphic numeric ops, sink GC barriers,
inline closures whose upvalues are constant. Those are nearly
impossible to do once everything is i64s and pointers.
What we lose by going straight to llvm
- No structured loops (
scf.for) so loop transformations like--affine-loop-unrolland--scf-parallel-loop-fusionare out. - No higher-level type info — every value is
i64. - No room for domain-specific peepholes.
For a numeric scripting language those losses are small; for ML or DSP workloads they're enormous.
Step 07 · When to reach for MLIR
MLIR is heavyweight. Reach for it when at least one of these is true:
- Multiple abstraction layers are useful. A SQL planner, a tensor compiler, an HDL toolchain, a DSL with high-level semantic passes — anywhere you want to optimise before losing structure.
- You'll write many compilers, sharing infrastructure. A team building five DSLs benefits from a shared dialect ecosystem.
- You need polyhedral / loop-nest transforms — the
affineandlinalgdialects have no LLVM-only equivalent. - You target heterogeneous hardware — GPU, TPU, FPGA. The
gpu,nvvm,rocdl,spirv, andcirctdialects let you keep one front-end.
When not to use MLIR:
- Simple scripting language → LLVM directly (cp-11/12 path).
- Single-target compiler with no high-level structure to preserve.
- Time/iteration is short — MLIR's per-dialect ceremony is real cost.
What we'd build next in this curriculum
- cp-18 (capstone) sketches a
minilang.*dialect with one custom high-level pass (escape analysis to elide allocations) plus a manual conversion to thellvmdialect. The infrastructure cost is large; the educational payoff is understanding the one-source/many-targets pattern that MLIR enables.
Resources
- The MLIR Toy tutorial (chapters 1–7) is the canonical hands-on introduction — it builds a tiny tensor language end-to-end.
- "The Architecture of Open Source Applications" entry on MLIR.
- The MLIR Open Meeting talks (mlir.llvm.org), especially the dialect spotlights.
cp-14 · Runtime Systems
cp-01..cp-13 produce code. cp-14 builds the runtime that the code
calls into: a tagged value representation, a mark-sweep garbage
collector, and high-level operations (add, print, array indexing).
Build & run
cmake -S src/cpp -B build
cmake --build build
./build/tests/test_runtime # → 36/36 checks passed
./build/mlrt-demo # → hello, world [0, 1, 4] ...
What lives here
src/cpp/src/
value.hpp — tagged 64-bit Value
heap.hpp/.cpp — Object header + mark-sweep GC
runtime.hpp/.cpp — add / sub / mul / print / array ops
main.cpp — demo binary
tests/
test_runtime.cpp — 14 tests, 36 checks
The runtime is standalone — no LLVM, no IR. cp-17's capstone wires it into the JIT from cp-12.
Reading order
- steps/01-the-runtime-layer.md
- steps/02-value-representation.md
- steps/03-object-layout.md
- steps/04-mark-sweep-gc.md
- steps/05-roots-and-safepoints.md
- steps/06-allocation-strategies.md
- steps/07-beyond-mark-sweep.md
Step 01 · The runtime layer
A compiler produces code, but that code runs on top of services provided by the runtime:
- Allocate heap objects
- Move/free objects (garbage collection)
- Construct boxed values (strings, arrays, closures)
- Format values for
print - Raise structured errors (exceptions)
- Provide builtin functions and FFI bridges
These services live in a small library that the codegen calls into. Three places they show up:
- Compiler emits calls.
ml_alloc_stringbecomes an external symbol; the linker (or JIT) resolves it to a runtime function. - Compiler emits inline code that uses the runtime's invariants — reading the tag bits of a Value, indexing into an Object header, etc.
- Compiler emits metadata for the runtime to consume — stack maps that tell the GC where pointers live in each frame, unwind tables for exceptions, debug info for backtraces.
This lab implements (1) and (2). (3) — emitting stack maps — is the hard, language-specific work that production runtimes invest heavily in; we approximate it with an explicit root API that the host program pushes Values onto.
Why a tagged Value?
MiniLang is dynamically typed. Every variable can hold a number,
string, bool, nil, function, or array. The simplest representation is
a struct — { tag, payload } — but that's at least 16 bytes per
slot. A tagged 64-bit value gets us:
- Pointer-sized (fits in registers)
- Fast bit-test type checks
- 63-bit fixnums with no boxing
- Compatible with calling conventions for free
The trade-off is restricted integer range and a few bits of mental overhead — a worthwhile bargain for a scripting language.
Step 02 · Value representation
Our encoding (defined in value.hpp):
bit 63 ........................... 3 2 1 0
[ 63-bit signed int ] 1 → fixnum
[ 0 ][ 0 1 0 ] → nil
[ 0 ][ 1 1 0 ] → true
[ 0 ][ 1 0 1 0] → false
[ pointer to Object, aligned to 8 ] 000 → heap object
How tests look
bool isFixnum() const { return raw & 1; }
bool isObject() const { return (raw & 0b111) == 0 && raw != 0; }
Each is a single and + compare, branch-prediction friendly.
Encoding fixnums
static Value Fixnum(int64_t v) {
return {(uint64_t)((v << 1) | 1)};
}
int64_t asFixnum() const { return ((int64_t)raw) >> 1; }
We lose one bit of range. For MiniLang's scripting niche, ±2⁶² is
plenty. Real languages that need full 64-bit integers either box big
numbers (CPython, OCaml Int64.t) or use NaN-boxing (described
below).
NaN-boxing — the alternative
IEEE-754 doubles have 52 payload bits in quiet NaNs, enough to encode a pointer + a small tag. SpiderMonkey, JSC, and Lua 5.3 use variants of this trick:
double: [sign 1][exponent 11][mantissa 52]
A double is a quiet NaN iff exponent = all 1s AND mantissa MSB = 1.
We hide 51 bits of payload + 3 tag bits in there.
Pros: full IEEE doubles fly without boxing — huge for numeric code. Cons: bit-twiddling is finicky, hostile to debuggers, doesn't play nicely with sanitisers.
Our scheme (low-bit tagging) is simpler and integer-friendly; we'd swap to NaN-boxing only if floats became a major workload.
Why aligned-to-8 pointers
std::malloc already returns 8-aligned blocks on every mainstream
platform. We additionally round our allocation sizes up to 8 so
the next object also lands aligned. The low 3 bits of any Object*
we hand out are guaranteed zero → safe to overlay tags.
Step 03 · Object layout
Every heap object starts with the same header:
struct Object {
ObjKind kind; // 1 byte: String, Array, ...
uint8_t marked; // 1 byte: GC bookkeeping
uint16_t _pad; // 2 bytes
uint32_t size; // 4 bytes: total size including header
Object* next; // 8 bytes: intrusive linked list
};
16 bytes of overhead per object. We trade a few bytes for:
- A uniform header the GC can inspect blind.
- An intrusive object table — no separate metadata structure to
keep in sync. The sweep phase just walks
head_ → next → next → …. - Inline
sizeso sweep knows how much memory each object holds.
StringObj
struct StringObj : Object {
uint32_t len;
char data[1]; // flexible array
};
Allocation: sizeof(StringObj) + len. We declare data[1] so the
struct is well-formed even at length 0; the real size is computed in
newString. We always NUL-terminate for cheap interop with C
printers.
ArrayObj
struct ArrayObj : Object {
uint32_t len;
Value elems[1];
};
Inline storage of Values (boxed pointers). The GC iterates elems[0..len)
during mark, no separate "type descriptor" needed because the kind
field tells it the layout.
Forwarding the design
For closures we'd add:
struct ClosureObj : Object {
uint32_t numUpvalues;
uint64_t funcId; // index into the JIT'd function table
Value upvalues[1]; // captured environment
};
The mark routine grows a switch:
switch (o->kind) {
case ObjKind::Array: for each elem mark(elem); break;
case ObjKind::Closure: for each upvalue mark(upval); break;
case ObjKind::String: /* no pointers */ break;
}
This is the canonical "GC tracing per kind" pattern. Variations:
embed pointer offsets directly in the header, or rely on a
type-descriptor pointer to call a virtual trace method.
Step 04 · Mark-sweep GC
Algorithm (in heap.cpp):
collect():
for each root slot:
mark(*slot)
sweep:
for each object in object table:
if marked: clear mark
else: unlink + free
mark is the classic tri-colour (here: just two-colour) traversal:
void markObject(Object* o) {
if (!o || o->marked) return; // already grey/black
o->marked = 1; // turn black
if (o->kind == ObjKind::Array)
for (each elem) mark(elem); // grey-ify children
}
Recursion-depth concern: a long array chain can blow the C stack. In
production, switch to an explicit work queue (std::vector<Object*>)
and pop until empty. Not needed for our tests but worth knowing.
Why mark-sweep?
Pros:
- Simplest correct GC. Trivial to debug — print the object table before and after, diff the survivors.
- Doesn't move objects. Pointers from the C/C++ host remain valid
across collections. This matters when a JIT'd function takes a raw
char*from a string. - Tolerates ambiguous roots. Even if a root might be a pointer (conservative scanning), false positives just keep objects alive a cycle longer.
Cons:
- Fragmentation. Repeated allocate/free leaves holes that the bump allocator can't reuse. We'd need a free list (next step upward) or compaction.
- Pause time scales with live set + dead set. Sweep is O(total objects), even if very few survived.
- Cache-unfriendly object table walks. Each pointer chase costs a miss.
Triggering policy
maybeCollect() runs on every allocation:
if (allocatedBytes_ >= gcThreshold_) collect();
// after collect:
if (allocatedBytes_ * 2 > gcThreshold_) gcThreshold_ = allocatedBytes_ * 2;
Doubling the threshold based on the post-collect live set is a classic way to keep amortised GC cost bounded. If 1 KiB survives, we let the heap grow to 2 KiB before collecting again — guaranteeing O(live) work per allocated byte.
Step 05 · Roots and safepoints
GC needs to know which Values are live. The mark phase starts from the root set; anything not reachable from a root is garbage.
Our root set is explicit:
class Heap {
std::vector<Value*> roots_;
void pushRoot(Value* slot);
void popRoot();
};
class RootScope {
RootScope(Heap& h, Value* slot) { h.pushRoot(slot); }
~RootScope() { h.popRoot(); }
};
Callers root every local that could hold an object before any allocation:
Value greet = makeString(h, "hello, ");
RootScope rg(h, &greet); // greet is now safe across GC
Value who = makeString(h, "world"); // GC may run here
RootScope rw(h, &who);
Value full = add(greet, who, h);
This is awkward and easy to forget. Production runtimes do better:
Conservative stack scanning
Scan every byte of the C stack as if each pointer-aligned word might be a pointer; if it points into a known heap object, treat it as a root. The Boehm GC works this way.
- Pros: no rooting boilerplate; works with arbitrary C/C++.
- Cons: false positives keep dead objects alive; can't relocate objects (so no copying / compacting GC).
Precise stack maps
The compiler emits, per call site, a map of which stack slots and registers contain pointers. The GC walks the stack frame by frame, asks the map "what's live here?", and traces those slots.
- Pros: precise, supports relocating GC.
- Cons: codegen complexity, every safepoint inhibits some optimisations.
Safepoints
A safepoint is a point in code where the runtime guarantees the stack is in a known state — typically function entry and loop backedges. The compiler inserts a poll:
if (gc_request) suspend();
When the GC wants to run, it sets gc_request, then waits for all
mutator threads to reach a safepoint. This bounds GC latency to
roughly the loop-iteration time.
cp-14 has only one thread and no JIT integration, so we just call
collect() from alloc synchronously — the entire VM is a "GC
safepoint" by construction. cp-17's capstone wires precise stack maps
into the cp-12 JIT.
Step 06 · Allocation strategies
Heap::alloc is a thin wrapper over std::malloc. That's the
simplest correct allocator, but real runtimes layer specialised ones
on top because the alloc fast-path runs every few instructions.
Bump allocation
arena: [................................]
^ free
free += bytes
- Cost: one add, one compare. ~5 cycles.
- Used by: every copying GC (because the from-space is wiped each collection, the bump pointer resets).
Free lists
Maintain per-size buckets of freed objects:
size 16: → block → block → block → ...
size 32: → block → ...
Allocation: pop head of the right bucket. Sweep: push freed objects onto the right bucket. Pros: reuses fragmented space. Cons: slower than bump; harder to reason about live size.
Thread-Local Allocation Buffers (TLABs)
In a multi-threaded VM, every thread reserves a chunk of the global heap and bump-allocates inside it. No atomics on the fast path.
Thread 1 TLAB: [................xxxxxxxx] xxxx = live
Thread 2 TLAB: [............xxxxxxxxxxx]
Global heap: [TLAB 1 ][TLAB 2 ][...]
When a thread's TLAB is exhausted it locks the global heap, claims a new one, and continues. The lock-free fast path is critical to multi-threaded performance.
Generational allocation
nursery (small, fast): bump alloc → frequent minor GC
tenured (large, slow): mark-sweep or mark-compact → rare major GC
Most objects die young → minor GC is fast and cheap. Survivors get promoted to the tenured space.
This requires a write barrier: when an old object's field is set to a young object, record the cross-generation pointer so the minor GC can find young roots without scanning the entire old generation.
inline void writeBarrier(Object* parent, Value child) {
if (child.isObject() && parent->isOld() && child.asObject()->isYoung())
rememberedSet.add(parent);
}
cp-14's runtime doesn't implement any of these — but the linked-list
object table and explicit roots make it easy to swap in any of them
later. The right abstraction is "all alloc paths go through one
function" — which we have.
Step 07 · Beyond mark-sweep
The collector zoo, roughly in order of complexity:
Reference counting
Every object keeps a counter; increment on assign, decrement on overwrite/scope exit. Free when count hits 0.
- Pros: deterministic — finalisers run promptly. Memory footprint predictable.
- Cons: cycles leak (need a backup tracing GC). Atomic refcounts in multi-threaded code are expensive.
- Users: CPython (with cycle-detector backup), Swift, Rust's
Rc<T>.
Mark-sweep (cp-14)
What we built. Simple, non-relocating, fragmenting.
Mark-compact
After mark, slide survivors to one end of the heap (or use a forwarding-pointer Cheney pass). Eliminates fragmentation. Pointers need updating — easier with precise stack maps; impossible with conservative scanning.
Copying (Cheney)
Two equally-sized spaces: from-space and to-space. Allocation bump- allocates in from-space. On GC, walk roots and copy live objects into to-space, leaving forwarding pointers in from-space. Flip spaces. From-space is now garbage; bump pointer resets.
- Pros: O(live) work, no fragmentation, blazing-fast alloc.
- Cons: half the heap is always wasted; relocates objects.
- Users: most young generations (because young set is small).
Generational
Combine: copying for the nursery, mark-compact for the tenured. The Hotspot, V8, and SpiderMonkey GCs are all generational variants of this idea.
Concurrent / incremental
Run mark and/or sweep on a separate thread concurrently with the mutator. Needs:
- Write barriers (Yuasa/Dijkstra) so the marker doesn't lose objects that change mid-flight.
- Read barriers for relocating concurrent collectors (Shenandoah, ZGC).
Sub-millisecond GC pauses on multi-GB heaps are now standard.
Region-based / arena
Allocate from a context-bound arena; free the entire arena at once. No collector needed; tracks a region per request/task. Used by Rust allocators, Apache, OCaml's minor heap variant, Zig, server runtimes.
What to choose
- Educational implementation: mark-sweep (cp-14).
- Embeddable, single-threaded: mark-sweep or refcount (CPython).
- Server, GB-scale heap, low-pause: concurrent generational (Hotspot G1, ZGC).
- Soft-realtime / latency-critical: region-based or Shenandoah/ZGC.
- Embedded / no malloc: arena + free list.
Stack maps and codegen integration
cp-17's capstone wires this runtime into the cp-12 JIT. The minimal
addition: every call-site stack-map descriptor, plus a "safepoint
poll" inserted into loop headers. We do that on top of LLVM's
@llvm.gcroot / Statepoints intrinsics rather than reinventing the
metadata format.
cp-15 · Tooling and Diagnostics
A compiler is only as approachable as its error messages. cp-15
builds the tooling layer that wraps the language: a unified
minilang CLI (run, fmt, ast, repl), rustc-style structured
diagnostics with carets, source-span tracking, and a REPL with
multi-line input.
Build & run
cmake -S src/cpp -B build
cmake --build build
./build/tests/test_tooling # → 46/46 checks passed
echo 'let x = 1+2*3; print x;' | ./build/minilang run - # → 7
echo 'let x= 1 +2; print x ;' | ./build/minilang fmt -
./build/minilang repl
> let x = 10;
> print x * x;
100
What an error looks like:
$ echo 'print 1 +;' | ./build/minilang run -
error[E0202]: expected expression, got `;`
--> <stdin>:1:10
|
1 | print 1 +;
| ^
| help: try a number, a variable, or `(`
Layout
src/cpp/src/
source.hpp/.cpp — SourceFile with line index
diag.hpp/.cpp — Diagnostic + renderer
lex.hpp/.cpp — span-tracking tokenizer
parse.hpp/.cpp — recursive-descent parser with error recovery
format.hpp/.cpp — AST pretty printer
eval.hpp/.cpp — tree-walking evaluator
repl.hpp/.cpp — REPL loop with multi-line continuation
main.cpp — `minilang` CLI
tests/
test_tooling.cpp — 14 tests / 46 checks
Reading order
- steps/01-developer-experience-matters.md
- steps/02-source-spans-and-locations.md
- steps/03-rustc-style-diagnostics.md
- steps/04-parser-error-recovery.md
- steps/05-pretty-printing-and-formatting.md
- steps/06-building-a-repl.md
- steps/07-cli-design-and-lsp.md
Step 01 · Developer experience matters
A compiler that's correct but cryptic loses to one that's slightly less powerful but obviously helpful. Compare:
$ old-compiler
error: syntax error
$ rustc-style
error[E0202]: expected expression, got `;`
--> example.rs:5:11
|
5 | print 1 + ;
| ^
| help: try a number, a variable, or `(`
Same parser bug; vastly different debugging experience. The investment that pays off:
- Source spans on every AST node. Tokens carry
(start, length); AST nodes inherit and merge them. - Structured diagnostics:
(severity, code, message, span, hint)rather than a string. This lets:- The compiler suggest fix-its (
hint). - IDEs render squigglies precisely (span).
clippy-style tools filter by code.
- The compiler suggest fix-its (
- Error recovery: parsers continue past errors so users see all problems in one pass, not "fix → recompile → fix → recompile".
- Tooling ecosystem:
fmt,ast,repl,lspare first-class citizens that share the same parser & diagnostics.
This lab implements the foundation. cp-16's capstone wires it into the full compiler frontend.
Why the CLI is one binary with subcommands
minilang run|fmt|ast|repl instead of minilang-run, minilang-fmt, etc:
- Single install footprint.
- Shared option parsing & shared error format.
- Easier to add
minilang check,minilang test,minilang doclater.
This is the design cargo, go, git, dotnet, and dart all
converged on for the same reasons.
Step 02 · Source spans and locations
A span = (start_offset, length) in the source buffer. A
location = (line, column). We compute the latter from the
former on demand.
class SourceFile {
std::string text_;
std::vector<size_t> lineStarts_; // offset of each line
};
Loc SourceFile::loc(size_t offset) const {
auto it = std::upper_bound(lineStarts_.begin(), lineStarts_.end(), offset);
int line = (int)(it - lineStarts_.begin());
return {line, (int)(offset - lineStarts_[line - 1]) + 1};
}
- Why store offsets, not (line, col)? Offsets are constant-time comparable, deduplicable, hashable. Lines change with edits; offsets don't (per-file).
- Why binary-search lookup? O(log lines) is fast enough for diagnostics. We only convert offsets → (line, col) at print time.
Span propagation
Tokens carry spans straight from the lexer. AST nodes either copy their primary token's span (literals, identifiers) or merge:
e->span = {lhs->span.start,
rhs->span.start + rhs->span.length - lhs->span.start};
A binary expression's span covers lhs op rhs end-to-end. This is
crucial for IDE highlighting: hover over 1 + 2, the whole
expression lights up.
Multi-file
Real compilers carry a (fileId, span) pair. cp-15 uses one file at
a time because that's all the REPL and CLI need; extending to a
SourceMap of files is mechanical (vector<SourceFile> keyed by id).
Step 03 · Rustc-style diagnostics
The Diagnostic struct is small and intentional:
struct Diagnostic {
Severity severity; // Error / Warning / Note
std::string code; // "E0202"
std::string message;
Span span;
std::string hint; // "help: try ..."
};
Renderer output:
error[E0202]: expected expression, got `;`
--> ex.ml:1:10
|
1 | print 1 +;
| ^
| help: try a number, a variable, or `(`
The four lines after the header are:
- Gutter blank line matching the line-number width.
- Source line with the offending text.
- Caret line — spaces to the column, then
^for the span. - Hint (optional) — what to try next.
Why no ANSI colours
Easier to test (string comparison), easier to pipe to less, easier
to integrate with editors that re-style errors. A real CLI adds a
--color=auto flag that wraps error and the caret in red — trivial
to layer on top.
Codes
Rust assigns each error an E#### code, Swift uses descriptive IDs,
TypeScript uses TS####. Benefits:
- Documentation hooks (
rustc --explain E0382). - Stable references in tutorials/bug reports.
- Tooling can ignore-list specific codes.
We pre-assign blocks: E01xx lex errors, E02xx parse, E03xx semantic.
Cheap up-front; pays for itself the first time someone googles
minilang E0301.
Spans across multiple lines
The renderer currently assumes the span fits on one line. Multi-line
spans (if {\n bad\n}) need the line bar repeated with ~~~ for
trailing lines and ^^^ on the start. We left this as a 30-minute
extension exercise — pattern-match what rustc does.
Suggestions / fix-its
hint is plain text. A richer system attaches a replacement span +
replacement text that an IDE can apply automatically:
struct FixIt { Span span; std::string replacement; };
Useful but easy to get wrong — apply two fix-its that overlap and you corrupt the file. Production compilers (rustc, clang-tidy) keep them gated behind explicit user invocation.
Step 04 · Parser error recovery
A parser that aborts on the first error is useless for IDEs and frustrating in the CLI. Error recovery is the difference between "fix one thing at a time" and "see the whole picture, fix everything in one pass".
cp-15's parser uses panic mode recovery — the simplest strategy that works:
Program parseProgram() {
while (peek().kind != Tok::Eof) {
auto s = parseStmt();
if (s) p.stmts.push_back(std::move(*s));
else skipToSyncPoint(); // resync
}
}
void skipToSyncPoint() {
while (peek().kind != Tok::Eof && peek().kind != Tok::Semi) ++i;
accept(Tok::Semi);
}
; is our synchronisation token. After an error inside a
statement, we discard everything up to the next ; and start fresh.
This guarantees:
- The parser terminates (no infinite loops on bad input).
- Subsequent valid statements still get parsed.
- The number of errors reported scales linearly with the number of real mistakes (not exponentially — cascade failures are a common parser-design pitfall).
Better recovery strategies
- Token deletion / insertion: try plausible edits (insert
), delete+) and continue. Powerful but combinatorial. - Phrase-level recovery: define multi-token sync sets per
non-terminal. Statements sync on
; { fn while if, expressions sync on) ; ,. - Tree-sitter / GLR: parse as much as possible, leaving "ERROR" nodes in the tree. Fast enough to re-run on every keystroke.
For a small language, panic mode with one sync token is 95% as useful as any of these and 10× less code.
Don't forget the lexer
The lexer must also recover. lex emits a diagnostic for the bad
character and advances by one byte:
out.diagnostics.push_back(Diagnostic{...});
++i;
Skipping the whole rest of the file on a single bad character would be a denial-of-service vector for IDE users mid-typing.
Step 05 · Pretty printing and formatting
A formatter is a function AST → canonical source. Round-tripping
through parse ∘ format should be the identity on the AST (and
idempotent on the formatted text). Our formatter is in
format.cpp; it's tiny because the grammar
is tiny.
Key design choices:
Precedence-aware parenthesisation
void writeExpr(os, e, parentPrec) {
case Bin: {
int p = precOf(e.op);
bool wrap = parentPrec > p;
if (wrap) os << "(";
writeExpr(os, *e.lhs, p);
os << " " << e.op << " ";
writeExpr(os, *e.rhs, p + 1); // right-bias for left-assoc
if (wrap) os << ")";
}
}
parentPrec > p wraps when the parent expects tighter precedence.
The p + 1 on the right side preserves left-associativity:
(1 - 2) - 3 keeps its inner parens (because subtraction isn't
associative) but 1 - 2 - 3 doesn't need any.
Insertion of canonical whitespace
let x=1+2*3; print x ; → let x = 1 + 2 * 3;\nprint x;\n.
A formatter has one correct output per AST. Don't make whitespace configurable; that's gofmt's wisdom. Once teams agree, debates disappear.
Comments are hard
Our toy formatter strips comments because the parser drops them. A real formatter needs to:
- Carry comments through the AST (attach to neighbouring nodes).
- Distinguish leading vs. trailing comments.
- Reflow long lines without orphaning the comment.
This is where rustfmt, prettier, and gofmt all sink most of their implementation budget. The simple case (no comments) is fine for our educational scope.
Beyond plain text
- AST → diff for refactorings (rename a variable, output is a formatted version with the new name everywhere).
- AST → HTML for documentation (syntax-highlighted source).
- AST → AST transformations (CST-preserving rewrites for codemod-style tools).
These all start with a clean formatter.
Step 06 · Building a REPL
A REPL ("read-eval-print loop") is the most useful tool a language ships. Our implementation in repl.cpp is ~50 lines because all the heavy lifting is reused from the CLI.
void runRepl(in, out, err, opts) {
EvalState st; // persists across lines
std::string buffer;
while (getline(in, line)) {
buffer += line + "\n";
SourceFile src("<repl>", buffer);
auto l = lex(src);
auto p = parse(l.tokens);
if (needsContinuation(...)) continue; // accumulate
for (auto& d : l.diagnostics) renderTo(err, d, src);
for (auto& d : p.diagnostics) renderTo(err, d, src);
if (no errors) eval(st, p.program, out);
buffer.clear();
}
}
Multi-line continuation
The REPL recognises unfinished input — unbalanced parens, dangling
operators — and doesn't evaluate yet. It keeps reading lines into
a buffer, showing a | continuation prompt, until the input
type-checks as a complete program.
Heuristic in needsContinuation:
( > )count → expect more.- Last diagnostic mentions "got end of input" → expect more.
A more rigorous approach: have the parser return a distinguished "unexpected EOF" error type rather than scanning messages. We chose strings here for simplicity; it's the kind of decision easy to revisit once you feel the friction.
State preservation
EvalState st lives outside the loop, so:
> let x = 10;
> print x;
10
works as expected. The semantics is "each REPL line is appended to a notional program that you've been building all along".
Error recovery, REPL-style
When evaluation fails, we discard the buffer and re-prompt. The alternative — keeping the buffer so the user can edit — is what IPython / Jupyter offer, but requires terminal-line-editor integration (readline / replxx) outside the scope of this lab.
Things real REPLs add
- Line editing & history (libedit, readline, replxx).
- Tab completion (introspect
st.envfor variable names). - Special commands (
:type,:reset,:doc). - Pretty-printing of last value (Python's
_). - Persistent history file.
Each is a half-day; they're orthogonal to the core REPL loop.
Step 07 · CLI design & looking ahead to LSP
Unified CLI
minilang <command> <args>
run <file|-> parse + evaluate
fmt <file|-> pretty-print
ast <file|-> dump AST
repl interactive read-eval-print loop
Patterns to adopt early:
-for stdin everywhere a file is accepted. Pipe-friendly.- Exit codes matter: 0 success, 64 usage error, 65 source error, 70 internal error (we follow rough sysexits.h conventions).
- Subcommand-first so
minilang fmt --checkandminilang run --debughave separate option spaces — no flag conflicts. - One binary, many commands simplifies install + distribution.
What's missing for production
minilang check <file>— parse + typecheck, no eval, machine-readable JSON output (--format=json).minilang test <file>— discover and run inline tests.minilang doc— extract docstrings → HTML.minilang lsp— Language Server Protocol stdio mode.
LSP — the natural next step
Every tool we built — span-tracking lexer, error-recovering parser, structured diagnostics, pretty printer — is what an LSP server needs. Sketch:
client → JSON-RPC → textDocument/didOpen { uri, text }
→ store in document map, run lex + parse
→ publishDiagnostics { uri, diagnostics }
client → JSON-RPC → textDocument/didChange { uri, edits }
→ incremental re-parse (or full)
→ publishDiagnostics
client → JSON-RPC → textDocument/formatting → call format(program)
client → JSON-RPC → textDocument/hover { position }
→ look up enclosing AST node, return doc/type
The hardest parts:
- Incremental parsing to keep latency low on large files (tree-sitter solves this; rust-analyzer rolls its own).
- Indexing across files for cross-file go-to-definition / find-references.
- Cancellation — long-running analyses must be interruptable.
All of those build on the foundations we laid here. Diagnostic codes, spans, and AST formatter aren't optional in an LSP world — they're the contract you have with the editor.
Connection to the rest of the curriculum
- cp-16 (capstone compiler suite) reuses this CLI shell.
- cp-17 (capstone JIT) layers
compileandjitsubcommands on top. - cp-18 (MLIR framework) adds
dump-mlirandlowerstages.
The unifying lesson: a great compiler is also a great library, and the CLI / REPL / LSP are just thin shells over that library.
cp-16 · Capstone Compiler Suite (minilangc)
A complete ahead-of-time compiler that puts everything from cp-01…cp-15
together. minilangc lexes → parses → typechecks → emits LLVM IR →
shells out to llc + clang to produce a native executable.
Build & run
cmake -S src/cpp -B build
cmake --build build
./build/tests/test_suite # → 28/28 checks passed
cat > hello.ml <<'PROG'
fn fib(n) {
if (n < 2) { return n; }
return fib(n - 1) + fib(n - 2);
}
fn main() {
print fib(10);
}
PROG
./build/minilangc emit-ir hello.ml > hello.ll
./build/minilangc build hello.ml -o hello
./build/minilangc run hello.ml # → 55
./build/minilangc check hello.ml # silent if clean
CLI:
minilangc <command> [options] <file|->
emit-ir lex+parse+typecheck and print LLVM IR
build compile to executable (-o path, -O0..3, -v verbose)
run build then execute
check lex+parse+typecheck only
Language
program := func+
func := "fn" Ident "(" params? ")" block
block := "{" stmt* "}"
stmt := "let" Ident "=" expr ";"
| "print" expr ";"
| "return" expr ";"
| "if" "(" expr ")" block ("else" block)?
| "while" "(" expr ")" block
| Ident "=" expr ";"
| expr ";"
expr := cmp
cmp := add (("=="|"!="|"<"|"<="|">"|">=") add)*
add := mul (("+"|"-") mul)*
mul := unary (("*"|"/"|"%") unary)*
unary := "-" unary | call
call := primary ("(" args? ")")*
primary := Number | Ident | "(" expr ")"
Every value is i64. print lowers to printf("%lld\n", v). Must
have fn main() { ... }.
Layout
src/cpp/src/
source.hpp/.cpp SourceFile + line index
diag.hpp/.cpp Diagnostic + renderer
lex.hpp/.cpp tokenizer
parse.hpp/.cpp recursive-descent parser
typecheck.hpp/.cpp scope + arity + main checks
llvm_emit.hpp/.cpp AST → textual LLVM IR
driver.hpp/.cpp llc + clang shell-out
main.cpp `minilangc` CLI
tests/
test_suite.cpp frontend + IR + end-to-end (9 tests / 28 checks)
Reading order
- steps/01-pipeline-overview.md
- steps/02-frontend-reuse.md
- steps/03-multi-function-language.md
- steps/04-typecheck-and-scope.md
- steps/05-emitting-llvm-ir.md
- steps/06-driving-the-toolchain.md
- steps/07-from-here-to-production.md
Step 01 · Pipeline overview
minilangc is a thin orchestrator over six clean stages:
source bytes
│
▼ ml::lex
Token stream ── diagnostics? ──► render & exit 65
│
▼ ml::parse
AST (Program) ── diagnostics? ──► render & exit 65
│
▼ ml::typecheck
AST + scope info ── diagnostics? ──► render & exit 65
│
▼ ml::emitLLVMIR
"module.ll" (textual LLVM IR string)
│
▼ ml::buildExecutable → shell out to llc -filetype=obj
"module.o"
│
▼ same path → shell out to clang
executable
│
▼ ml::runExecutable
stdout text
Each arrow corresponds to one function in driver.hpp. The two phases that can fail (parse / build) return rich result structs so the CLI can format the failure however it wants.
Why shell out?
Linking against LLVM-as-a-library is the "right" answer for production
compilers (incremental compilation, JIT, fewer process forks). For this
capstone we shell out to llc + clang because:
- Zero LLVM CMake friction — works as long as
/opt/homebrew/opt/llvm/binexists. - Easier to debug — you can re-run the exact
llccommand yourself. - The pipeline is the same idea — only the boundary is text on disk
vs. in-memory
Module*.
Subsequent labs (cp-17 JIT, cp-18 MLIR) demonstrate the linked alternative.
Stages, separately
Want only the IR? minilangc emit-ir foo.ml > foo.ll.
Want only typecheck? minilangc check foo.ml.
Want everything? minilangc run foo.ml.
This separability is the architecture. Each stage's output is serialisable (tokens → JSON, AST → JSON, IR → text), so you can mix & match: write a third-party formatter, a linter, a documentation generator, all on the same frontend.
Step 02 · Reusing the cp-15 frontend
The source/diag/lex/parse modules are nearly verbatim copies of cp-15's, extended for the multi-function language. This is intentional: the capstone proves the cp-15 design generalises.
What grew:
| concern | cp-15 | cp-16 |
|---|---|---|
| top-level | flat statement list | function definitions |
| keywords | let, print | + fn, return, if, else, while |
| operators | +, -, *, / | + %, comparisons |
| expressions | numeric arithmetic | + function calls |
| typecheck | none | scope, arity, main |
The lexer's structure didn't change — just more keywords and
two-character operators (==, <=, …).
The parser gained parseFunc, parseBlock, control-flow statements,
parseCall, and a parseCmp precedence layer above parseAdd. Every
new feature followed the same recipe:
- Define the AST node.
- Add the parser rule.
- Extend
typecheckto validate it. - Extend
emitLLVMIRto lower it.
Lessons from doing it twice
- Spans, not positions. Spans carry through every transformation; positions become stale the moment you concatenate AST nodes.
- Synchronisation tokens scale. cp-15 used
;. cp-16 uses;and brace-balanced sync inside blocks — seeParser::syncin parse.cpp. The principle is identical. - Diagnostic codes are forever. Once you ship
E0202, you don't renumber it. Our scheme (E01xxlex,E02xxparse,E03xxeval,E04xxsemantic) is just enough structure. std::optional<Stmt>requires<optional>. Compilers may include it transitively today; relying on that is a portability bug waiting to happen.
Step 03 · A multi-function language
cp-15's language was a calculator. cp-16's is a real (if tiny) language with functions, control flow, and recursion. The grammar changes that mattered:
Top-level is functions only
program := func+
A program is a list of fn declarations. There's no top-level "main
scope" — that's fn main(). This rule is enforced by the parser
(error E0210) and by typecheck (error E0411 if no main).
Blocks introduce scope
case Stmt::K::If: {
auto sc1 = scope; checkBlock(s.body, sc1);
auto sc2 = scope; checkBlock(s.elseBody, sc2);
return;
}
We snapshot the scope before each branch so a let inside a branch
doesn't leak out. This is the simplest form of lexical scoping; real
languages use a linked stack of scopes for efficiency and shadowing
rules.
Calls
parseCall runs after parsePrimary and wraps the result in zero or
more ( args ) suffixes:
while (peek().kind == Tok::LParen) { ... }
This lets f(1)(2) parse (even though we don't have first-class
functions). It also makes adding methods (obj.method(arg)) a small
extension.
Control flow lowers to branches
if/while are compiled to plain LLVM basic blocks; we don't use
select or phi. The IR for while (cond) body is:
br label %cond
cond:
%v = ... evaluate cond ...
%t = icmp ne i64 %v, 0
br i1 %t, label %body, label %end
body:
... body ...
br label %cond
end:
That's the canonical "structured control flow → CFG" lowering.
Optimisation passes (mem2reg, jump threading) clean up the alloca
traffic introduced by let/assign.
Step 04 · Typecheck and scope
The typechecker in typecheck.cpp is a single AST walk. Three responsibilities:
- Function table — collect every
fn nameand remember its arity. mainrequirements — must exist (E0411), must take zero parameters (E0412), no duplicate definitions (E0410).- Per-function scope walk:
- Variables must be
let-introduced before use (E0300). - Assignment requires prior declaration (E0302).
- Calls must resolve to a known function with matching arity (E0420, E0421).
- Variables must be
That's it. There's no actual type checking because everything is
i64. That's the right starting point: it disentangles "is it
syntactically and semantically well-formed?" from "is it
type-correct?". You can graft a real type system on top later
(introduce i64/bool/str, unify across operators, infer
generics) without touching the parser.
Scopes are sets, not stacks
std::unordered_set<std::string> scope(f.params.begin(), f.params.end());
Snapshotting the scope before each branch (auto sc1 = scope; ...)
gives the right semantics without a linked structure. It's O(n × m)
in pathological deep nesting but blazingly fast in practice.
A production typechecker uses a stack of scopes for shadowing
(let x = 1; { let x = 2; print x; } print x; → 2 then 1). Our
language doesn't allow shadowing — let x = 1; let x = 2; would
quietly clobber, which is a UX bug we'd fix by adding an E0303
diagnostic.
Why typecheck before IR emission
- Better errors. "Unknown function
fbn" with caret at the call site is far nicer thanllc: undefined symbol _fbn. - Performance. We don't waste time generating IR for code that's going to be rejected.
- Layering. IR emission can assume the AST is well-formed —
fewer
ifchecks, simpler code.
The typechecker is also reused unmodified by the check subcommand,
which is the building block for IDEs.
Step 05 · Emitting LLVM IR
llvm_emit.cpp is the longest file in this lab, ~150 lines, because it covers nine constructs (literal, variable, neg, binop, cmp, call, if, while, return) plus the module preamble. Highlights:
Memory model
We use the classic mem2reg-friendly pattern: every variable is an
alloca and every read/write goes through load/store:
%x.addr = alloca i64
store i64 0, ptr %x.addr
%t = load i64, ptr %x.addr
The frontend never has to compute SSA itself. LLVM's mem2reg pass at
-O1 and above promotes the allocas to virtual registers and inserts
phi nodes where needed. This separation of concerns (frontend
allocates, optimiser promotes) is one of the most important
architectural ideas in modern compilers.
Comparisons are i64-valued booleans
%cb = icmp slt i64 %a, %b
%v = zext i1 %cb to i64
Everything is i64, including booleans. if (...) then re-truncates
via icmp ne i64 %v, 0. Wasteful? Yes. Compatible with the rest of
the language? Also yes. A real bool type would be cheaper but
requires propagating types through every operator.
Functions
define i64 @add(i64 %arg0, i64 %arg1) {
entry:
%a.addr = alloca i64
store i64 %arg0, ptr %a.addr
%b.addr = alloca i64
store i64 %arg1, ptr %b.addr
...
ret i64 0 ; fallback if no explicit return
}
Each parameter gets spilled to an alloca of the same name. After
mem2reg these vanish. The trailing ret i64 0 guarantees every
function ends with a terminator even if the user omits return —
defensive but not wrong.
Module preamble
target triple = "arm64-apple-macosx"
@.fmt = private unnamed_addr constant [6 x i8] c"%lld\0A\00"
declare i32 @printf(ptr, ...)
The hard-coded triple is for the macOS-on-ARM workstation this lab
was developed on. A portable driver would either: (a) drop the triple
and let llc pick the host default, or (b) call llvm::sys::getDefaultTargetTriple()
via the LLVM-as-a-library route. We chose explicit because it
documents the assumed target.
What we don't do
- SSA construction (mem2reg handles it).
- Register allocation (LLVM backend).
- Instruction selection (LLVM backend).
- Linking object files (clang invokes
ld).
We're orchestrating, not reinventing. That's what "use LLVM" buys you.
Step 06 · Driving the toolchain
driver.cpp owns the boring-but-critical job of turning an IR string into an executable on disk.
The pipeline as commands
$ /opt/homebrew/opt/llvm/bin/llc -O0 -filetype=obj -o /tmp/minilangc-obj-XXX.o /tmp/minilangc-ir-YYY.ll
$ /opt/homebrew/opt/llvm/bin/clang -O0 -o a.out /tmp/minilangc-obj-XXX.o
That's the whole compiler. llc lowers IR → object file (handles
instruction selection, register allocation, scheduling, emission). clang
acts as the linker driver — it knows how to invoke the system linker
(ld64 on macOS) with the right libc paths so printf resolves.
-O<N> is forwarded to both; -v echoes the commands so users can
re-run them. Both tools are looked up via ${LLVM_BIN_DIR} (settable
via CMake -DLLVM_BIN_DIR=...) so the lab works on other people's
machines.
Process management
We use popen (read end) for capturing combined stdout+stderr. This
is good enough for our purposes:
- Tools rarely produce large output on success.
- On failure, we want all the diagnostic chatter.
- No need to handle stdin (we pass IR via a file).
For production:
- Use
posix_spawn+ pipes to capture stderr separately. - Stream to user terminal in
-vmode rather than buffer. - Handle signals properly (Ctrl-C should kill the child).
Temp file hygiene
mkstemp("/tmp/minilangc-{tag}-XXXXXX") then rename to add the right
extension. We don't clean up because:
- On success the user doesn't care.
- On failure the user wants to inspect the IR.
A real driver would offer --save-temps (gcc) or --keep-tmp-files.
The current behaviour matches --save-temps, which is fine for an
educational tool.
Why clang instead of ld directly
Calling ld directly requires knowing the platform-specific runtime
glue: crt1.o, libSystem.B.dylib, the right SDK path. clang -o figures
all that out by querying the macOS SDK. It's slower (one extra exec)
but vastly more portable.
Testing the e2e pipeline
Tests in test_suite.cpp call
toolchainAvailable() and skip e2e tests gracefully if llc isn't
present. This keeps CI green even on machines without LLVM installed.
Step 07 · From here to production
minilangc is a complete compiler — small, but every stage that a
real production compiler has is present. What separates it from
something you'd ship?
Language features
- Types. Booleans, strings, structs, arrays. Each one threads through lex → parse → typecheck → IR. Strings need a runtime (cp-14).
- Closures. Capture-by-reference vs. by-value, free variable analysis, environment lowering.
- Modules / namespaces. Multi-file compilation, separate
compilation units, an
importstatement, a build manifest. - Generics. Monomorphisation (Rust) or boxing (Java). Either way, the typechecker grows substantially.
Performance
- Link with LLVM-as-a-library to avoid process overhead. We'd
replace
buildExecutablewith code that builds anllvm::Moduledirectly (cp-10/11/12 already do this). - Incremental compilation. Hash AST nodes, cache IR per function, re-emit only changed functions. Rust's query system, Swift's modular header maps.
- Parallel compilation. One thread per function, share an
immutable AST. LLVM's
Moduleis per-thread butLLVMContextcan be. - Optimisation passes. Run a custom pipeline: mem2reg, instcombine,
GVN, licm, loopvectorize, before
llc.
Tooling ecosystem
minilangc fmt(re-emit canonical source) — port cp-15's formatter.minilangc test(discoverfn test_*and run them).minilangc doc(extract doc comments).minilangc lsp— full LSP server using the spans + diagnostics we built.- Debugger support — emit DWARF line tables (
!dbgin IR,DICompileUnitmetadata,-g).
Distribution
- Pre-compiled standard library distributed as object files (or LLVM bitcode for cross-target).
- Package manager. Cargo, npm, go modules — all evolved alongside their compilers.
- Cross-compilation. Parameterise the triple, ship multiple
llcbackends.
Where the curriculum goes next
- cp-17 (capstone JIT) demonstrates the dynamic-language path:
parse → IR → ORC JIT → call into a runtime (cp-14) at runtime.
No object files, no
clang. Same frontend. - cp-18 (capstone MLIR) demonstrates the high-level-IR path: parse → custom MLIR dialect → progressive lowering → LLVM dialect → object. More machinery for more optimisation headroom.
All three capstones share the cp-15 frontend skeleton. That's the deepest lesson of the curriculum: the compiler is a frontend + a backend choice, and the backend choice depends on the deployment story you want.
cp-17 — Capstone: JIT for a Tiny Dynamic Language
A small dynamic language frontend (fn / let / control flow / strings) compiled
with the LLVM C++ API and executed via ORC LLJIT, with a host-side
runtime registered as JIT-resolvable symbols.
This is the JIT counterpart to cp-16 (which AOT-compiled by shelling out to
llc and clang). Here we link LLVM as a library and run code in-process.
Build & test
cd src/cpp
cmake -S . -B build && cmake --build build
./build/tests/test_jit # 7/7 checks passed
./build/mldyn examples/hello.ml # CLI entry point
Pipeline
source.ml ──lex──▶ tokens ──parse──▶ AST ──emit──▶ llvm::Module ──ORC──▶ run main()
└── runtime symbols (host)
Layout
src/cpp/src/{source,diag,lex,parse}.{hpp,cpp}— frontend (cp-15/16 style)src/cpp/src/runtime.{hpp,cpp}— host functions exposed to JIT'd codesrc/cpp/src/ir_emit.{hpp,cpp}—Program→llvm::ModuleviaIRBuildersrc/cpp/src/jit.{hpp,cpp}—LLJITsetup, symbol registration,lookup("main")src/cpp/src/main.cpp—mldyn <file>CLIsrc/cpp/tests/test_jit.cpp— end-to-end pipeline testssteps/01..07.md— narrative walkthrough
Tests (7 checks)
print 42→42\nprint 1 + 2 * 3→7\nprint_str "hello, jit"→hello, jit\nfib(10)→55\nwhileloop printing0\n1\n2\nml_record_int_argruntime callback fires from JIT'd code- Interleaved
print_str+printinteger output
01 — JIT vs AOT
In cp-16 we built minilangc, an ahead-of-time compiler: it produced a
.ll text file, then shelled out to llc and clang to assemble and link an
executable. The compiler and the program are different processes, separated by
time and by the filesystem.
In cp-17 we build mldyn, a just-in-time runner. The compiler and the
program live in the same process. The flow is:
source ──► tokens ──► AST ──► llvm::Module ──► machine code ──► call()
(in memory) (in memory)
There is no .ll, no .o, no clang invocation. We link against LLVM's
libraries (libLLVMCore, libLLVMOrcJIT, …) and the IRBuilder hands us a
Module object that ORC compiles in-process to executable memory pages.
Why JIT for a dynamic language?
Dynamic languages — Python, Ruby, JavaScript, Lua — discover types and shapes at runtime. An AOT compiler must therefore choose: either generate slow generic code, or refuse to compile until annotations are added. A JIT can defer codegen, observe what actually happens (type feedback, hot paths), then emit specialised code. That's how V8, HotSpot, LuaJIT, PyPy and Truffle all earn their speed.
cp-17 doesn't implement specialisation — that would take a tracing/IC infrastructure — but it lays the groundwork:
- IR built programmatically, so we could vary it per call site.
- A runtime symbol table that JIT'd code can call back into.
- A
ml_record_int_arghook that demonstrates type feedback: every function entry tells the host "this argument was an int". A real system would consult this table on the next compile to decide whether to generate a fast int-only version.
What we keep from cp-16
The frontend. Lexer, parser, diagnostics, spans — all of it carries over. The
language grew a string literal and a print_str statement, but the parser
infrastructure (recursive descent, sync, span-bearing diagnostics) is the same
code we built in cp-15 and reused in cp-16. Frontends are stable; backends
are where the interesting variation lives.
02 — Building IR with IRBuilder
cp-16 produced LLVM IR as text. Strings concatenated into a .ll file
which llc then parsed back into an in-memory Module. That round-trip is
fine for AOT (the textual form is great for debugging) but it's slow and lossy
for JIT.
In cp-17, ir_emit.cpp constructs the Module directly:
LLVMContext ctx;
Module mod("test", ctx);
IRBuilder<> b(ctx);
Every IR node is a C++ object. The IRBuilder tracks a current insertion point
(a BasicBlock) and appends instructions to it. Compare:
| operation | textual IR | IRBuilder call |
|---|---|---|
| add two i64 | %t = add i64 %a, %b | b.CreateAdd(a, c) |
| signed less-than | %t = icmp slt i64 %a, %b | b.CreateICmpSLT(a, c) |
| call printf | call void @printf(...) | b.CreateCall(fn, args) |
| return value | ret i64 %v | b.CreateRet(v) |
| branch | br label %L | b.CreateBr(L) |
| cond br | br i1 %c, label %T, label %F | b.CreateCondBr(c, T, F) |
The Value* that each Create* returns is the IR-level result you splice
into subsequent operations. You're building a directed graph of SSA values,
just with C++ syntax instead of .ll text.
Locals as alloca slots
We keep the simple model from earlier labs: every local is an alloca slot
named <name>.addr, loaded on read, stored on write. mem2reg (run by the
default LLJIT pipeline) promotes them to SSA registers. This means the
emitter never has to track SSA names or phi nodes.
auto* slot = b.CreateAlloca(i64(), nullptr, "x.addr");
b.CreateStore(value, slot);
// later:
auto* v = b.CreateLoad(i64(), slot);
Control flow via named basic blocks
For if/while we explicitly create blocks and stitch branches:
auto* T = BasicBlock::Create(ctx, "then", fn);
auto* E = BasicBlock::Create(ctx, "else", fn);
auto* M = BasicBlock::Create(ctx, "end", fn);
b.CreateCondBr(cond_i1, T, E);
b.SetInsertPoint(T);
// emit `then` body...
if (!b.GetInsertBlock()->getTerminator()) b.CreateBr(M);
The terminator check matters: if the then body ended with return, the
block is already terminated and we must NOT append a second terminator (LLVM's
verifier will reject the module). That single rule is responsible for most of
the conditional if (!terminator) br calls in ir_emit.cpp.
verifyModule
After emitting we call llvm::verifyModule. If it returns true, the IR is
malformed: dangling references, missing terminators, type mismatches, etc.
We capture the report and surface it as EmitResult::error. This is the
guardrail against bugs in your emitter. Catching a verifier error is a
millisecond; catching a "JIT executed bad machine code" error is a debugger
session at best.
03 — Registering Runtime Symbols with ORC
The JIT'd module declares but does not define the runtime functions:
declare void @ml_print_int(i64)
declare void @ml_print_str(ptr)
declare void @ml_record_int_arg(i64)
When ORC compiles main, those calls become real bl/call instructions to
some address — but to which address? Nothing in the module says. ORC will
search its JITDylib for a symbol named ml_print_int, and if none is found,
the lookup fails at execution time with a link error.
Our job is to put the host process's function addresses into that table
before we run anything. In jit.cpp:
llvm::orc::SymbolMap syms;
auto def = [&](void* p) {
return llvm::orc::ExecutorSymbolDef(
llvm::orc::ExecutorAddr::fromPtr(p),
llvm::JITSymbolFlags::Exported | llvm::JITSymbolFlags::Callable);
};
syms[es.intern("ml_print_int")] = def((void*)&ml_print_int);
syms[es.intern("ml_print_str")] = def((void*)&ml_print_str);
syms[es.intern("ml_record_int_arg")] = def((void*)&ml_record_int_arg);
jd.define(llvm::orc::absoluteSymbols(std::move(syms)));
Three pieces:
intern(name)turns aStringRefinto aSymbolStringPtr. The string pool is owned by theExecutionSession, so all lookups can compare pointers instead of strings.ExecutorSymbolDef(addr, flags)wraps a raw pointer with metadata.Exportedmakes the symbol visible to lookups;Callabledistinguishes function pointers from data pointers (relevant for some platforms' ABI).absoluteSymbolswraps the map in aMaterializationUnitwhose "materialise" step is trivial: the addresses are already known.
Then jd.define(...) installs the unit. From this point on,
jit->lookup("ml_print_int") would return the host address, and so will
ORC's internal linker when it resolves the declare in the user module.
Why not just rely on DynamicLibrarySearchGenerator?
LLJIT has a default generator that searches the host process for symbols by
name. If our runtime functions had public C linkage in the main
executable, that mechanism would find them automatically. We register
explicitly for three reasons:
- Determinism. We control which names are reachable; nothing else leaks from the host into JIT'd code.
- Plumbing for sandboxing. In a production VM you eventually want the
JIT to live in a different address space (or a different process). The
ExecutorAddrindirection is what makes that swap possible — same API, just point at a remote address. - It's the same path the type-feedback hook will take. Future VM services — bailout, GC write barrier, deopt — register exactly the same way.
Lookup and call
auto sym = jit->lookup("main"); // ORC compiles `main` on demand
auto fn = sym->toPtr<int64_t(*)()>(); // raw function pointer
int64_t result = fn();
That lookup call is the moment ORC walks the module, runs optimisation
passes, lowers to machine code, copies the bytes into executable pages, and
applies relocations. From the C++ side it looks like a hash-table lookup; in
reality it's the whole back end pipeline you used to spend clang minutes on.
04 — Type Feedback: Foundations of Inline Caching
ml_record_int_arg is the most interesting function in runtime.cpp. It's
called from JIT'd code at every user function entry, once per parameter:
// In ir_emit.cpp:
for (auto& a : fn->args()) {
auto* slot = b.CreateAlloca(i64(), nullptr, ...);
b.CreateStore(&a, slot);
b.CreateCall(fn_record, {ConstantInt::get(i64(), nextSite++)}); // ← here
...
}
Each parameter gets a unique site id chosen at compile time. The runtime
maintains unordered_map<site_id, count>:
extern "C" void ml_record_int_arg(int64_t site) {
g_intCounts[site] += 1;
}
That's a one-line implementation, but it's the same architecture that powers V8's inline caches, HotSpot's TypeProfile, and LuaJIT's traces. The pattern is:
- JIT'd code reports observations to host.
- Host accumulates statistics keyed by call site.
- When some heuristic fires (counter > threshold, distribution narrow enough), the host invalidates the current compile and recompiles with the observed types baked in as assumptions.
- The new code adds a guard: if the assumption is violated, it bails back to the generic path.
cp-17 stops at step 2: we record but never recompile. The seventh test
verifies the mechanism works end-to-end: a JIT'd print(f(5)) call increments
the counter for site #1 (the first parameter of the first user-defined
function), provable from C++ after the JIT returns.
Why this needs the runtime-symbol plumbing from step 03
The recording call is just call void @ml_record_int_arg(i64 1). There's no
LLVM magic — it's an external function call. The reason it works is that we
already taught ORC that the name ml_record_int_arg resolves to a real C
function in our process. The whole "type feedback" feature is purely:
- IR emitter inserts a function call at the right place.
- Host registers the target.
- Host reads the counter later.
Every dynamic-language VM's profiling subsystem is layered on the same idea.
What you'd add next
- Per-type counters: distinguish int / string / object / null. We have only one bucket; a real system stores a small set with frequencies.
- Site-keyed cache slots: replace counters with a small struct
{type_tag, cached_method, miss_count}per site. That's an inline cache. - Tiered compilation: once a counter crosses a threshold, queue the
function for recompilation at a higher tier (e.g. with arguments
specialised to
i64). Keep the old code as the bail target. - Deoptimisation: when an assumption fails at runtime, jump from the optimised frame back into the unoptimised one with reconstructed state. This is the hardest part and a topic in its own right.
05 — Strings as Private Globals + GEP
The Str expression node is the only non-numeric value type in cp-17. Its
lowering reveals two LLVM concepts every IR emitter must internalise:
GlobalVariable for constant data, and getelementptr (GEP) for
address arithmetic.
case Expr::K::Str: {
Constant* s = ConstantDataArray::getString(ctx, e.str, /*addNull=*/true);
auto* gv = new GlobalVariable(
mod, s->getType(), /*isConstant=*/true,
GlobalValue::PrivateLinkage, s, ".str");
return b.CreateInBoundsGEP(
s->getType(), gv,
{ConstantInt::get(i32(), 0), ConstantInt::get(i32(), 0)});
}
Step by step:
ConstantDataArray::getStringbuilds an[N x i8]constant from the string bytes, optionally NUL-terminated. This is the value.new GlobalVariable(...)registers that constant as a module-level symbol withPrivateLinkage(linker-internal, won't conflict across modules) andisConstant=true(the optimiser may place it in.rodata). The variable's type is[N x i8], andgvis aConstant*pointing to it (in LLVM 20 with opaque pointers, the pointer is justptr).CreateInBoundsGEPcomputes the address of element[0][0]:- first index
0steps through the pointer (typical idiom for "the array itself, not array #N"); - second index
0steps to the first byte inside the array. The result is aptrto byte 0 of the string — exactly whatml_print_str(const char*)expects.
- first index
The two-index GEP for arrays
You'll see {0, 0} patterns everywhere in LLVM IR for "decay an array to a
char*". The mental model:
gv : ptr to [N x i8]
gep 0 : same as gv (no offset, but lets us index into the pointee)
gep 0,0 : pointer to the first i8 in the array
Conceptually like C's &str[0]. If the global were [10 x [4 x i8]],
{0, i, j} would give &str[i][j]. The first index is special; subsequent
indices walk the aggregate type.
Why opaque pointers don't change this
LLVM 20 dropped typed pointers — every pointer in IR is just ptr. But the
GEP instruction still needs to know the pointee type to compute offsets.
That's what the explicit s->getType() (the array type) argument to
CreateInBoundsGEP is for. The IRBuilder no longer infers it from the
pointer's type, because the pointer has no type.
Lifetime
The GlobalVariable is owned by the Module. When the Module is moved
into ThreadSafeModule and handed to ORC, ownership transfers. After
JITting, the global's bytes live somewhere in the JIT's data section, and
the pointer we returned is valid for as long as the LLJIT instance lives.
For test code this is fine; for a long-running VM you'd care about reclaiming
unused string globals (a job for ORC's resource-tracker API).
06 — verifyModule and Debugging JIT Crashes
A JIT failure mode looks like this: lookup("main") returns a function
pointer. You call it. The process crashes with SIGBUS, SIGSEGV, or — if
you're unlucky — silently produces wrong output. There is no stack trace
pointing back at the IR that did it. Debugging requires reproducing the bug
in tools designed for native code (lldb, instruments) against memory pages
that didn't exist a moment ago.
The single best defense is to verify before you JIT. cp-17 does this
at the end of emitModule:
std::string verr;
raw_string_ostream os(verr);
if (verifyModule(*r.module, &os)) {
r.error = "verifyModule failed:\n" + os.str();
r.module.reset();
}
verifyModule catches:
- Basic blocks without terminators (the #1 emitter bug).
- Multiple terminators in a basic block.
- Branches to blocks in the wrong function.
- Type mismatches in operands (e.g., passing
i32wherei64is expected). - PHI nodes with the wrong number of incoming values.
- Use-before-def on
Values outside their dominator scope. - Function signature mismatches between caller and callee.
ret voidfrom ani64-returning function (and vice versa).
Every one of those will at best cause a JIT crash and at worst quietly miscompile. Catch them at IR emission and you save days.
Reading verifier output
The verifier names the broken instruction and prints surrounding context:
Terminator found in the middle of a basic block!
label %then
When you see this, the cause is almost always:
- You emitted a
retorbr, then forgot to switch insertion point and emitted more instructions into the same now-closed block, or - You forgot to check
GetInsertBlock()->getTerminator()before appending a fall-through branch after anif/whilebody that already returned.
The cure in ir_emit.cpp is the if (!b.GetInsertBlock()->getTerminator())
guard before any CreateBr/CreateRet.
When the verifier passes but the JIT still crashes
Common remaining causes:
- Wrong calling convention between a
declared extern and the host function (e.g., declaringvoid f(i64)but the C function takesi32). The verifier can't catch this — both sides look internally consistent. Treatment: keep signatures in one place (a header) and reference them from both the emitter and the runtime. - Mutated
Moduleafter JIT addModule. ORC takes ownership; subsequent edits via the stale pointer are undefined. - Forgetting
InitializeNativeTarget()—LLJITBuilder().create()will return an error like "no available targets are compatible with this triple".jit.cppcalls these once viastd::call_once.
Dumping IR while debugging
mod.print(errs(), nullptr) will dump the module as text to stderr. Add it
right before the verifyModule call and you can copy-paste into llc or
opt to reproduce a bug outside the JIT. The text and the in-memory form
round-trip exactly, so behaviour is identical.
07 — Where to Grow
cp-17 is a complete dynamic-language JIT in maybe 800 lines. It demonstrates the architecture; it doesn't demonstrate the optimisations that make JITs worth their complexity. Here's the roadmap if you wanted to grow this into a real VM.
1. Interpreter tier
Real JITs don't JIT first. They interpret bytecode until a function gets
warm, then JIT. Why? Because compilation is expensive and most code runs
once. cp-17 currently JITs everything immediately, paying the LLVM
compile cost even for print 42.
Add: a bytecode design (stack-based, small), an interpreter loop in C++, per-function call counters, a threshold (say 1000 calls), and a queue of "functions to compile". The JIT becomes the second tier, not the first.
2. Inline caches
Right now every method call goes through full IR with no specialisation.
The hook for change is there — ml_record_int_arg proves you can collect
type observations — but the IR doesn't use them.
Add: a CallSite struct keyed by (function, bytecode offset). On each
call, write the receiver type into a small slot. On recompile, generate
code that checks the cached type with a single compare-and-branch
("guard") and then proceeds along the fast path. On guard failure, fall
back to the generic dispatch.
That single mechanism — type guard + cached fast path — is most of what makes V8 and LuaJIT fast.
3. Deoptimisation / OSR
Once you have guards, you need to handle guard failures. The optimised frame is laid out differently from the interpreter's stack; on a bailout you must reconstruct the interpreter state from the optimised frame's registers and spills, then resume in the interpreter.
This is on-stack replacement (OSR) in the deopt direction. The OSR-in direction (interpreter → JIT, mid-loop) is also useful: detect a hot loop, JIT it, patch the interpreter to jump into the JIT'd loop with current state.
Both are hard. Both require precise side-tables emitted by the JIT describing how every interpreter value maps to a JIT location at every deopt point.
4. Hidden classes for objects
cp-17 has no objects. When you add them: every object header should point to a shape (V8 calls them "maps", JavaScriptCore "structures") that describes its layout. Two objects with the same key sequence share a shape; adding a key transitions to a new shape, recording the transition.
Why? Because inline caches key on shape, not on dynamic type. A property lookup becomes "load shape pointer; compare to cached shape; if match, load at cached offset; else miss". This is the single most important optimisation for dynamic-OO languages and it falls out naturally from the inline-cache infrastructure above.
5. Garbage collection
The runtime currently leaks every string global into the JIT's data section. A real VM needs:
- A heap with allocation, marking, and reclamation.
- GC roots identified in optimised frames (more side-tables from the JIT).
- Write barriers for generational GC (yet another runtime symbol the JIT must inject around every store-to-heap).
cp-14 (Runtime Systems) showed a tagged-value layout; this is where you'd plug a real collector under it.
6. Concurrency
ORC supports multi-threaded compilation out of the box (LLJIT is
thread-safe; that's what ThreadSafeModule is about). A real VM compiles
on background threads while the main thread keeps interpreting, then
atomically swaps the function entry pointer when the JIT result is ready.
Where this lab leaves you
Concretely, after cp-17 you should be able to:
- Build an
llvm::Modulefrom an AST withIRBuilder(no text). - Wire a runtime function into JIT'd code via ORC's symbol API.
- Emit a runtime callback at any IR point and read its results from C++.
- Diagnose JIT bugs with
verifyModulebefore they crash.
These are the muscles. The rest — IC, deopt, hidden classes, GC — are combinations of them.
cp-18 — Capstone: An MLIR-Style Compiler Framework
A self-contained, ~700-line reimplementation of MLIR's core ideas — Operations,
Regions, Blocks, Values, Types, Attributes, Builders, Passes, and conversion
between dialects — with zero LLVM/MLIR dependency. Two demonstration dialects
(tiny.* and ll.*), a constant-folder, a DCE pass, and a lowering pass that
rewrites tiny.* into ll.*.
The point isn't to use MLIR; it's to understand its architecture by rebuilding the skeleton. After cp-18 you can read MLIR source code and recognise every concept.
Build & test
cd src/cpp
cmake -S . -B build && cmake --build build
./build/tests/test_mlf # 35/35 checks passed
./build/mlfdriver --tiny - # parse stdin, print tiny.* IR
./build/mlfdriver --opt - # ... after fold+DCE
./build/mlfdriver - # ... lowered to ll.*
Example session:
$ echo "let x = 2 * 3 + 1; print x;" | ./build/mlfdriver
"module"() {
"ll.func"() {sym_name = "main"} {
%0 = "ll.const"() {value = 7} : () -> (i64)
"ll.call"(%0) {callee = "ml_print_int"} : (i64) -> ()
%1 = "ll.const"() {value = 0} : () -> (i64)
"ll.ret"(%1)
}
}
Layout
src/cpp/src/mlf.{hpp,cpp}— the framework: Op, Region, Block, Value, Builder, walks.src/cpp/src/dialects.{hpp,cpp}—tiny.*andll.*op constructors.src/cpp/src/passes.{hpp,cpp}— Pipeline + constantFold + DCE.src/cpp/src/lowering.{hpp,cpp}— tiny → ll dialect conversion.src/cpp/src/printer.{hpp,cpp}— MLIR-flavoured IR printer.src/cpp/src/parser.{hpp,cpp}— tiny surface language → tiny.* IR.src/cpp/src/main.cpp—mlfdriverCLI.src/cpp/tests/test_mlf.cpp— 7 tests, 35 checks.steps/01..07.md— narrative.
Mapping to real MLIR
| cp-18 | MLIR equivalent |
|---|---|
mlf::Op | mlir::Operation |
mlf::Region | mlir::Region |
mlf::Block | mlir::Block |
mlf::Value | mlir::Value |
mlf::Type | mlir::Type |
mlf::Attribute | mlir::Attribute |
mlf::Builder | mlir::OpBuilder |
mlf::pass::Pipeline | mlir::PassManager |
mlf::convert::lowerTinyToLL | dialect-conversion pass with rewrite patterns |
Tests
- Hand-build a module via
Builder. constantFoldshrinks1 + 2 + 3→6.- DCE after folding deletes the now-dead literals.
- DCE preserves a const used by
tiny.print. - Lowering: zero
tiny.*ops remain afterlowerTinyToLL. - End-to-end: parse → fold → DCE → lower → check the lowered IR.
- Parser surfaces a clear error for malformed input.
01 — Why an MLIR-Style Framework?
LLVM IR is one intermediate representation. It works beautifully for languages whose operations map onto C-like primitives — integer arithmetic, memory load/store, function call. It works poorly for anything else:
- Tensor compilers want
matmul,convolution,reduceas first-class ops. Expressing these in LLVM IR loses too much structure to recover later. - Hardware DSLs want device-specific ops (
gpu.launch,spirv.kernel,nvvm.barrier0) that LLVM IR has no good way to represent. - Polyhedral compilers want loop nests, affine maps, and dependency information that LLVM scalar evolution can only partially reconstruct.
The historical answer was: each project invented its own IR (XLA HLO, TensorFlow Graph, Halide, Tiramisu, …) and re-implemented passes, printers, parsers, and verifiers. Every project paid the same tax.
MLIR's answer: make the IR itself extensible. Provide a single skeleton (Operations, Regions, Blocks, Values, Types, Attributes) and let each project plug in custom dialects that define their own ops, types, and conversions. Re-use the printer, the parser, the pass manager, the canonicaliser — the whole infrastructure — across every dialect.
What a dialect is
A dialect is a namespace of ops with custom semantics. Examples in real MLIR:
arith.*— integer and float arithmetic.linalg.*— structured linear-algebra ops (matmul, conv).tosa.*— neural-network ops at a higher level.scf.*— structured control flow (if, for, while).memref.*— buffers with strides and offsets.gpu.*,nvvm.*,spirv.*— device dialects.llvm.*— a one-to-one mirror of LLVM IR, used as a lowering sink.
A typical compile looks like a sequence of dialect rewrites:
tosa → linalg → scf + memref → llvm → LLVM IR → machine code
Each step is a pass. Each pass is built from rewrite patterns. Each pattern matches a small subgraph of ops and replaces it with another small subgraph. Eventually no ops from the source dialect remain, and you've lowered the program one tier closer to hardware.
What cp-18 reproduces
We model exactly this skeleton at teaching scale: two dialects
(tiny.* and ll.*), a constant-folding pass, a dead-code-elimination pass,
and a lowering that rewrites tiny.* → ll.*. After it runs, no tiny.
ops remain. That's the MLIR programming model in miniature.
What cp-18 does not reproduce
- TableGen. Real MLIR generates op classes, builders, verifiers,
parsers, and printers from
.tdfiles. We hand-write them. - Pattern matching. Real MLIR has a declarative DSL for rewrite patterns
(
RewritePattern,OpRewritePattern<T>). Our "patterns" are if-chains inlowering.cpp— same logic, no sugar. - Verification. Real MLIR runs structural verifiers on every op. We trust the builder.
- Type system depth. Real MLIR types form a class hierarchy and can carry shapes, layouts, and dialects. Ours are strings.
The framework you build here is not a replacement; it's a reading companion. After it, every concept in MLIR has a hook in your head to hang on.
02 — The IR Skeleton: Ops, Regions, Blocks, Values
MLIR's core insight is a uniform IR shape. Every operation — whether it represents a constant, a loop, a function, or a module — has the same structure:
operation:
name : string ("dialect.opname")
operands : list of Values used as inputs
results : list of Values produced as outputs
attributes : list of (name, constant) — compile-time metadata
regions : list of Regions — nested IR
parent block : back-pointer
A Region is a list of Blocks. A Block is a list of Operations plus a list of block arguments (SSA values that flow in at the head of the block). The graph is recursive: regions contain blocks, blocks contain ops, ops contain regions, ad infinitum.
cp-18 implements this directly:
struct Op {
std::string name;
std::vector<Value*> operands;
std::vector<std::unique_ptr<Value>> results;
std::vector<std::unique_ptr<Region>> regions;
std::vector<NamedAttr> attrs;
Block* parent = nullptr;
Op* prev = nullptr;
Op* next = nullptr;
};
struct Block {
std::vector<std::unique_ptr<Value>> args;
Op* first = nullptr;
Op* last = nullptr;
Region* parent = nullptr;
};
struct Region {
std::vector<std::unique_ptr<Block>> blocks;
std::vector<std::unique_ptr<Op>> opStore; // ownership
Op* parentOp = nullptr;
};
Why a single shape for everything?
Because every algorithm that walks IR can use the same traversal. The constant folder, the DCE pass, the printer, the verifier — none of them need special cases for "is this a function or a basic op". A function is just an op with one region. A loop is an op with one region. A module is an op with one region. Same data structure, same walks.
Compare with LLVM IR, where Module, Function, BasicBlock,
Instruction are four distinct classes with four distinct traversal
APIs. Every pass that operates above the instruction level has to
hand-roll its own walk over the right level of the hierarchy.
MLIR's nested-op design is a strict generalisation: anything LLVM can
express, MLIR can express with llvm.func and llvm.* op-per-instruction.
But the reverse isn't true.
Linked-list blocks
cp-18 uses an intrusive linked list of ops inside each block (Op::prev,
Op::next). This matters because passes constantly insert and delete
ops in the middle of blocks. A vector<unique_ptr<Op>> would invalidate
iterators on every mutation and require O(n) shifts.
The trick from real MLIR: ownership lives in a separate opStore on the
parent region (a vector of unique_ptr<Op>). The list pointers in Op
form the logical sequence. Erasing an op detaches it from the list
but leaves the storage alive until the region dies. That's why
Block::eraseOp doesn't actually delete.
Block arguments instead of phi nodes
LLVM IR uses phi instructions at the start of merge blocks:
%x = phi i64 [%a, %B1], [%b, %B2]
MLIR (and cp-18) uses block arguments:
^bb3(%x: i64):
...
with predecessors supplying values when they branch in. Functionally equivalent; structurally cleaner. Block-argument indexing matches operand indexing of the branch op, so you never have to scan the block header to match incoming values to predecessors.
cp-18 doesn't yet emit block arguments (the tiny.* dialect has no
control flow), but Block::addArg is there for when you extend it.
03 — Builders, Insertion Points, and Op Creation
In LLVM you have IRBuilder<>. In MLIR you have OpBuilder. In cp-18 you
have mlf::Builder. All three serve the same purpose: encapsulate the
"where am I currently inserting?" cursor so op-construction calls can stay
short.
mlf::Builder b;
b.setInsertionPointToEnd(funcBody);
Value* lhs = b.create("tiny.const", {}, {i64Ty}, {{"value", Attr::integer(6)}})
->result(0);
Value* rhs = b.create("tiny.const", {}, {i64Ty}, {{"value", Attr::integer(7)}})
->result(0);
b.create("tiny.mul", {lhs, rhs}, {i64Ty});
Each create allocates an Op, populates its operands/results/attributes,
splices it into the current block at the insertion point, and returns a
raw pointer (ownership rests with the region's opStore).
Three insertion modes
setInsertionPointToStart(block)— prepend new ops.setInsertionPointToEnd(block)— append new ops (the common case).setInsertionPointBefore(op)— insert immediately before a known op (the constant folder uses this to splice the folded const).
Why insertion-point APIs and not "just append"?
Because rewriters need to insert in the middle. The constant folder finds
a tiny.add op, computes the folded result, and emits a new tiny.const
right before the old add. That fresh const needs to land between the
last constant and the add — not at the end of the block.
Builder bld;
bld.setInsertionPointBefore(op);
Op* foldedConst = bld.create("tiny.const", {}, ..., {{"value", ...}});
replaceAllUses(moduleOp, op->result(0), foldedConst->result(0));
op->parent->eraseOp(op);
The four-line pattern — point, create, replace, erase — is the entire shape of rewrite-based optimisation.
SSA name management
Every op result gets a name like %0, %1, %2 from a counter in the
builder. The counter is per-builder, which means a fresh builder gives
fresh names — useful for nested function bodies. The names are only for
printing; the IR's identity is the Value* pointer.
Real MLIR does the same: SSA names in textual IR are reconstructed at
print time from an AsmState that walks the op tree assigning fresh names.
The in-memory IR uses pointer identity.
Op result vs value
A subtle but important distinction:
Op*is the operation — the thing with a name, attributes, regions.Value*is one of its results — what an operand points at.
op->result(0) returns the first result Value. You almost always pass
Value* (not Op*) into other ops' operand lists. cp-18's API forces
this: create takes vector<Value*> for operands.
What we left out
Real MLIR's OpBuilder also tracks:
- A
Listenerfor rewrites (so pattern drivers can be notified of changes). - A
Locationattribute attached to every created op (for diagnostics). - Type inference via op interfaces (
SameOperandsAndResultType, etc.).
All are nice to have, none change the picture. The core abstraction is the insertion point.
04 — Defining Dialects
A dialect in cp-18 is just a namespace of helpers that construct ops with the right name + signature. There's no class hierarchy, no registration, no inheritance.
namespace mlf::tiny {
Op* constant(Builder& b, int64_t v) {
return b.create("tiny.const", {}, {i64Ty()},
{{"value", Attribute::integer(v)}});
}
Op* add(Builder& b, Value* l, Value* r) {
return b.create("tiny.add", {l, r}, {i64Ty()});
}
// ...
} // namespace
Three things define an op:
- Name — a dotted string
"dialect.op". Used by passes to match. - Signature — operand types, result types, region count.
- Attributes — name → constant.
valuefortiny.const,sym_namefortiny.func, etc.
That's it. A dialect is a contract about what those three things mean.
Why two dialects?
cp-18 ships tiny.* (high-level, source-aware) and ll.* (low-level, LLVM-ish). The reason for the split is the same reason real MLIR has ~30 built-in dialects: different passes want different abstractions.
- On
tiny.*we can run constant folding trivially — operands oftiny.addeither come fromtiny.constor they don't. No interleaved loads/stores, no aliasing, no ABI quirks. - On
ll.*we'd run register allocation, calling-convention rewrites, memory-layout passes — all things that need to know about the lowered representation.
If you tried to do both at the same level, you'd have one giant dialect
where every pass needs if (op.name == "tiny.add" || op.name == "ll.add") ...
checks. Splitting cleanly separates concerns.
Dialects as a contract
When passes.cpp writes:
if (op->name == "tiny.add" || op->name == "tiny.mul") { ... }
it's relying on the contract that those op names always mean what
dialects.cpp says they mean. If someone adds a tiny.add with two
results, or with a side effect, that contract breaks and the fold pass
becomes a miscompiler.
Real MLIR codifies this with op interfaces and traits: a pass
declares "I match anything implementing BinaryOp", and the trait system
guarantees the matched op has the expected structure. cp-18 trusts the
dialect-helper API as the contract.
How would you add a new dialect?
Pick a name (e.g. tensor.*), decide on op signatures, write helpers in
the namespace. That's the user-facing work. The framework requires no
changes: the printer, the walks, the rewrites all operate on opaque
Op objects.
In real MLIR you'd also subclass Dialect, register your ops, write
verifiers, generate them from TableGen, etc. The architectural shape is
the same as cp-18; the production scaffolding is heavier.
Function ops vs basic ops
tiny.func is a region-carrying op: it has one region containing the
function body. Same for ll.func. Notice how this is just an op —
no special "Function" class in the IR. That's MLIR's design choice: at
the IR level a function isn't fundamentally different from an scf.for
or an scf.if. They all carry regions; they all participate in the same
walks; they all live in the same pass manager.
The implication: you can put a function inside another op. Closures,
nested function definitions, module-of-modules — none require special
casing. Real MLIR exploits this constantly (e.g. gpu.module contains
gpu.func).
05 — Pass Infrastructure and the Canonicalisation Idea
A pass in cp-18 is just a function:
using Pass = std::function<bool(Op& moduleOp)>;
It mutates the module and returns true if anything changed. A Pipeline
runs them in sequence:
mlf::pass::Pipeline pipe;
pipe.add("constant-fold", mlf::pass::constantFold);
pipe.add("dce", mlf::pass::deadCodeElimination);
pipe.run(moduleOp);
That's the entire model. Real MLIR's PassManager adds threading,
caching of analyses, scheduling at different nesting levels (module pass
vs function pass vs op pass), pass options, statistics, and pipeline
specification via textual config. None of those change the core idea:
passes are functions, pipelines compose them.
Constant folding as a model rewrite
constantFold in cp-18 demonstrates the universal rewrite pattern:
for (Op* op in candidates):
if op matches a known shape: // ← pattern matcher
compute folded value // ← semantic step
Builder b; b.setInsertionPointBefore(op);
Op* replacement = b.create("...", ...); // ← rewriter
replaceAllUses(root, op->result(0), replacement->result(0));
op->parent->eraseOp(op); // ← cleanup
restart // because the IR shape changed under us
Real MLIR formalises this as RewritePattern. You write a class with
match and rewrite methods, register it, and a driver
(applyPatternsAndFoldGreedily) handles the fixpoint loop. The driver
solves three problems cp-18 punts on:
- Termination. Fixpoint iteration can loop if patterns keep undoing each other. MLIR tracks rewrites and bails if the same op gets rewritten too many times.
- Cost-based selection. When multiple patterns match, MLIR picks the cheapest. cp-18 only has one folding rule, so no ambiguity.
- Worklist management. Newly inserted ops should be revisited. MLIR keeps a worklist; cp-18 restarts the whole walk after each change (correct but quadratic).
DCE as a model "delete dead things" pass
bool deadCodeElimination(Op& moduleOp) {
// 1. Find all values that ARE used somewhere.
// 2. Find a pure op whose results aren't in that set.
// 3. Delete it. Repeat.
}
The crucial concept: purity. We hard-code the set of pure ops:
static bool isPure(const Op& op) {
return op.name == "tiny.const" || op.name == "tiny.add"
|| op.name == "tiny.mul" || /* etc */;
}
Real MLIR uses the MemoryEffectOpInterface: an op declares the
read/write/allocate/free effects it has on memory. DCE removes an op
only if it has no effects (or only MemoryEffects::Read from immutable
storage). This generalises to any dialect without changing the DCE pass.
cp-18 hard-codes the set because we don't have an interface system. Same algorithm, less elegant.
Why a single pipeline, not many "phases"?
Because IRs in this design are monomorphic: every op is just mlf::Op.
Passes can be composed freely — fold then DCE then fold again then convert
— without serialising to text and re-parsing. Real MLIR does the same:
you build a single PassManager and run dozens of passes in sequence,
all operating on the same in-memory module.
Compare with classic LLVM, where each pass is a different
FunctionPass/ModulePass subclass, ordering is managed by a Pass
Manager whose API is large and historical, and pipeline specification
(e.g. -O2) is hard-coded in C++ rather than declared in text. MLIR's
PassManager is a cleaner take on the same idea.
06 — Lowering Between Dialects
Lowering is the act of rewriting ops from one dialect into ops from
another. Conceptually it's another pass; structurally it's special because
the input and output dialects differ. cp-18 implements one lowering:
tiny.* → ll.*.
std::unique_ptr<Op> lowerTinyToLL(Op& tinyModule) {
auto newMod = makeModule();
Builder b;
std::unordered_map<Value*, Value*> valueMap;
Block* topB = newMod->region(0)->entry();
b.setInsertionPointToEnd(topB);
for each tiny.func in tinyModule:
Op* newFunc = ll::func(b, fname);
Builder bf;
bf.setInsertionPointToEnd(newFunc->region(0)->entry());
for each op in tinyFunc.body:
switch (op.name) {
case "tiny.const": valueMap[old] = ll::constant(bf, value); break;
case "tiny.add": valueMap[old] = ll::add(bf, map(a), map(b)); break;
case "tiny.mul": valueMap[old] = ll::mul(bf, map(a), map(b)); break;
case "tiny.print": ll::call(bf, "ml_print_int", map(a)); break;
case "tiny.return": ll::ret(bf, map(a)); break;
}
}
The pattern: for each source op, emit one or more target ops and record the value mapping. When a later source op uses an SSA result, look up its replacement in the map.
Why the value map?
Because we can't reuse the source ops' Value* in the target IR — they
belong to the old module which is about to die (or persist independently).
Every source value needs a corresponding fresh target value.
This map is the heart of dialect conversion. In real MLIR
ConversionPatternRewriter maintains it implicitly: when a pattern
matches and emits replacement ops, the rewriter records the value mapping
and rewires uses automatically. cp-18 maintains it explicitly because it's
clearer pedagogically.
One-to-one vs one-to-many
tiny.const → ll.const is one-to-one. tiny.print(x) →
ll.call(x) {callee = "ml_print_int"} is also one-to-one but with name
rewriting. A more interesting case: a single linalg.matmul lowers to a
nested scf.for loop body that calls arith.mulf and arith.addf — one
op blowing up into a dozen. cp-18 doesn't show one-to-many because
tiny.* is too simple; the framework supports it (just create more ops
in the case branch).
Partial vs full conversion
- Full conversion insists no source-dialect ops remain. cp-18's
test #5 verifies this:
CHECK_NOT_CONTAINS(ir, "tiny."). - Partial conversion lowers some ops, leaves others. Useful when a dialect contains a mix of low-level and high-level concerns.
Our lowerTinyToLL is full: every tiny.* op has a case in the switch.
Real MLIR's applyFullConversion will fail loudly if any unconverted
source ops survive; applyPartialConversion will leave them in place.
Where the framework matters
Notice what didn't change for the lowering: the printer, the
walks, the passes. The ll-dialect module prints with the same
printer, walks with the same walker. The DCE pass works on ll.* ops
because isPure lists ll.const/add/mul. The constant folder, however,
doesn't recognise ll.const + ll.const → ll.const — by design. If you
want post-lowering folding, you add another pass that matches ll.* ops.
That separation — generic infrastructure, dialect-specific patterns — is exactly what makes MLIR scale to dozens of dialects.
Lowering chains
cp-18 has one lowering step. Real MLIR pipelines often chain several:
tosa → linalg → scf+memref → llvm → LLVM IR
Each step is implemented as a set of patterns + a populate*Patterns
function + a target spec declaring which ops are "legal" in the output.
The shape of each step is identical to lowerTinyToLL: walk the input,
emit the output, thread the value map.
07 — Where to Grow
cp-18 reproduces MLIR's shape. If you wanted to grow it into something genuinely useful, here's the roadmap.
1. Verifiers
Right now any code that builds an ill-formed op succeeds silently. The
first quality-of-life upgrade is a verifyOp(Op&) function that checks:
- operand and result counts match the dialect spec,
- operand and result types match,
- required attributes are present and the right kind,
- region count is right,
- terminator ops are present at end of every block.
Real MLIR generates these from TableGen; you can hand-write them per dialect. Run after every pass; refuse to print invalid IR.
2. Op interfaces / traits
Hard-coded if (op.name == "tiny.add" || op.name == "tiny.mul") doesn't
scale past a handful of ops. Replace with a trait system:
struct BinaryOp { Value* lhs(Op& o); Value* rhs(Op& o); };
bool implementsBinary(const std::string& name);
Then folders and DCE check implementsBinary(op.name) rather than naming
specific ops. New ops opt into the trait by registering with the system.
This is MLIR's OpInterface mechanism in skeleton form.
3. Pattern DSL
Switch from hand-written if/switch to declarative patterns:
addPattern<BinaryOpPattern<"tiny.add">>(folder);
The base class encapsulates the match + create-replacement + replace-uses
- erase boilerplate. Patterns become 5-line specs of "what to match" and
"what to emit". This is
RewritePatternin MLIR.
4. Real type system
Replace Type { string name } with a tagged union or class hierarchy:
struct Type { enum class Kind { Int, Float, Tensor, Function, ... }; ... };
struct TensorType : Type { Type elemType; vector<int64_t> shape; };
Then types can be compared structurally and dialects can demand specific
type shapes. Shape inference becomes possible: an op like
tensor.matmul : tensor<MxKxf32> × tensor<KxNxf32> → tensor<MxNxf32>
can verify and propagate shapes.
5. Conversion framework
Generalise lowerTinyToLL into:
struct TypeConverter { Type convert(Type src); };
struct ConversionPattern { virtual bool match(Op&) = 0; virtual void rewrite(Op&, ...) = 0; };
void applyFullConversion(Op& root, vector<ConversionPattern*>, TargetSpec);
Where TargetSpec declares which ops are "legal" in the output. Patterns
plug in modularly. Same idea as MLIR's
mlir::ConversionPatternRewriter.
6. A useful dialect: tensors
The natural next thing to model is tensor.*:
tensor.const : tensor<NxNxf32>— a constant tensor with shape.tensor.add : (tensor, tensor) -> tensor— elementwise.tensor.matmul.- Conversion to a loop dialect (
scf.for+memref.store/load). - Conversion of the loop dialect to
ll.*.
That's the toy tutorial of MLIR done in your own framework. Three dialects, two lowering steps, demonstrates the whole stack: high-level algebraic IR → loop nest → low-level CPU code.
7. Plug into real LLVM
If you wire the final ll.* dialect to actually emit llvm::IRBuilder
calls (the way cp-17's ir_emit.cpp does), you have a complete frontend:
surface language → tiny → ll → LLVM IR → JIT or native code.
At that point you're a small implementation distance from your own domain-specific compiler. The IRBuilder bridge is the same code as cp-17 with op-name dispatch driving it.
Where this lab leaves you
You can read MLIR source code, recognise its idioms, and understand why an op-centric, region-carrying, dialect-extensible IR was the right answer for modern compiler stacks. You can also build your own project-internal IR with this shape when LLVM IR is too low-level for your problem domain — which, for any compiler targeting ML, hardware design, or DSLs, is essentially always.
Phases & Labs
This curriculum has 9 teaching phases and 18 labs, ending in 3 capstone projects. Labs build on each other, but Phase 5 (LLVM), Phase 6 (JIT), and Phase 7 (MLIR) can be tackled in any order after Phase 4.
Legend: ✅ complete · 🟡 scaffolded · ⬜ planned
Phase 1 — Frontend Foundations
Before you can compile, you must convert source text into a structured tree. This phase teaches lexing, parsing, AST design, and tree-walk interpretation.
| Lab | Title | Status | Key Concepts |
|---|---|---|---|
| cp-01 | Environment Setup & Toolchain | ✅ | Clang vs Apple Clang, target triples, LLVM toolchain, Mach-O vs ELF, CMake, llvm-config |
| cp-02 | Arithmetic Evaluator | ✅ | Tokens, recursive descent, EBNF, precedence vs grammar nesting, associativity, AST + Visitor, post-order eval |
| cp-03 | MiniLang v0 Frontend | 🟡 | Pratt parsing, statements vs expressions, blocks, functions, closures, REPL state, tree-walk interpreter |
Phase 2 — Static Semantics
A type-checked language with scoped variables is the foundation of every real frontend. This phase makes MiniLang reject invalid programs before they ever run.
| Lab | Title | Status | Key Concepts |
|---|---|---|---|
| cp-04 | Symbol Tables & Scoping | 🟡 | Lexical scoping, scope stacks, closure capture, name resolution, shadowing, two-pass resolution |
| cp-05 | Static Type System (MiniLang v1) | 🟡 | Hindley-Milner basics, type environments, monomorphic types, structural vs nominal typing, diagnostics |
Phase 3 — Bytecode Virtual Machines
Tree-walkers are slow because of pointer-chasing and virtual dispatch. Bytecode VMs are how CPython, the JVM, V8 (Ignition), and Lua reach 10–50× more throughput.
| Lab | Title | Status | Key Concepts |
|---|---|---|---|
| cp-06 | Bytecode Design & Compiler | 🟡 | Stack-based vs register-based VMs, opcode encoding, constant pools, AST → bytecode lowering, disassembler |
| cp-07 | Stack VM Execution (MiniLang v2) | 🟡 | Computed-goto dispatch, frame layout, call/return, switch-vs-direct-threading, ICache effects |
Phase 4 — Compiler Middle-End (IR & Optimization)
Every production compiler has a middle-end: AST → IR → optimized IR → backend. This phase introduces SSA, the CFG, and classical optimization passes.
| Lab | Title | Status | Key Concepts |
|---|---|---|---|
| cp-08 | Three-Address IR & CFG | 🟡 | TAC representation, basic blocks, control-flow graph, dominators, immediate dominator computation |
| cp-09 | SSA & Optimization Passes (MiniLang v3) | 🟡 | φ-nodes, SSA construction (Cytron's algorithm), constant folding, DCE, mem2reg, pass manager |
Phase 5 — LLVM Backend (Industry Core)
LLVM is the compiler infrastructure for Clang, Swift, Rust, Julia, Mojo, and dozens of others. This phase teaches you to generate LLVM IR, run its optimizer, and produce native binaries.
| Lab | Title | Status | Key Concepts |
|---|---|---|---|
| cp-10 | LLVM IR Fundamentals | 🟡 | Module / Function / BasicBlock / Instruction hierarchy, IRBuilder, types, attributes, intrinsics |
| cp-11 | LLVM Codegen (MiniLang++) | 🟡 | AST → LLVM IR, control-flow IR patterns, calling conventions, opt pipelines, llc, native linking on macOS |
Phase 6 — JIT Compilation (LLVM ORC)
JITs make dynamic languages fast (V8, LuaJIT, HotSpot). LLVM's ORC v2 API is the industrial-strength way to embed a JIT into your runtime.
| Lab | Title | Status | Key Concepts |
|---|---|---|---|
| cp-12 | ORC JIT Runtime | 🟡 | ORC v2 layers, lazy compilation, symbol resolution, function caching, hot-path materialization |
Phase 7 — MLIR (Multi-Level IR)
MLIR is the next-generation compiler infrastructure powering TensorFlow XLA, IREE, Mojo, and Triton. This phase teaches dialect design and progressive lowering.
| Lab | Title | Status | Key Concepts |
|---|---|---|---|
| cp-13 | MiniLang MLIR Dialect & Lowering | 🟡 | Operations / Types / Dialects, TableGen, rewrite patterns, ConversionTarget, lowering to LLVM dialect |
Phase 8 — Runtime Systems
A language is more than a compiler — it needs a runtime: stack frames, a heap, a GC, and an FFI.
| Lab | Title | Status | Key Concepts |
|---|---|---|---|
| cp-14 | Stack, Heap, GC, FFI | 🟡 | Calling conventions (System V vs Apple ARM64), object headers, mark-sweep GC, root-set scanning, C FFI |
Phase 9 — Tooling
Production compilers live or die by their error messages and tooling.
| Lab | Title | Status | Key Concepts |
|---|---|---|---|
| cp-15 | Diagnostics, Modules, CLI | 🟡 | Source spans, fix-it hints (Clang-style), module loader, dependency graph, CLI driver design |
Capstones
| Lab | Title | Status | Demonstrates |
|---|---|---|---|
| cp-16 | MiniLang Compiler Suite | 🟡 | End-to-end: interpreter + VM + LLVM backend in one toolchain |
| cp-17 | JIT-Accelerated Dynamic Language | 🟡 | Python-like subset, ORC JIT, runtime specialization |
| cp-18 | MLIR-Style Compiler Framework | 🟡 | Plugin dialect registry, multi-level lowering, custom passes |
Suggested Pace
- Full-time learner: ~2 labs per week ⇒ ~9 weeks end-to-end.
- Side-project learner: ~1 lab per 1–2 weeks ⇒ ~5 months.
- Concept-only path: skim
CONCEPTS.md+docs/analysis.mdper lab ⇒ ~1 week to absorb the field.
Recommended Progression
Phase 1 (cp-01, cp-02, cp-03) ── MANDATORY, in order
│
└─→ Phase 2 (cp-04, cp-05) ── MANDATORY (frontends pile up)
│
├─→ Phase 3 (cp-06, cp-07) ── VM track
│
└─→ Phase 4 (cp-08, cp-09) ── IR track ── MANDATORY before Phase 5/6/7
│
├─→ Phase 5 (cp-10, cp-11) ── LLVM backend
│ │
│ └─→ Phase 6 (cp-12) ── JIT (needs LLVM)
│
└─→ Phase 7 (cp-13) ── MLIR (parallel to LLVM)
│
└─→ Phase 8 (cp-14) ── Runtime
│
└─→ Phase 9 (cp-15) ── Tooling
│
└─→ Capstones (cp-16 / 17 / 18)
Phase 3 (Bytecode VM) and Phase 4 (IR) are independent — pick whichever excites you first. Phase 5, 6, 7 are each a serious commitment; pick the one most relevant to your career goals first (LLVM = static compilers, JIT = dynamic languages, MLIR = ML compilers / DSLs).
Tools & Toolchain
This curriculum is C++-only (Track B — LLVM Core). All labs target macOS (Apple Silicon verified) and are portable to Linux with trivial flag changes (noted per-lab).
The full setup, version verification, and why each tool exists is taught in
cp-01-environment-setup/. Do that lab first — even if you already have the tools installed — because it teaches concepts (target triples, sysroots, ELF vs Mach-O, Apple Clang vs upstream LLVM) that you'll need for every subsequent phase.
Required Tools
| Tool | Minimum Version | Where It's Used | Install |
|---|---|---|---|
| Xcode Command Line Tools | (any current) | C++ compiler, linker, system headers (phases 1–4) | xcode-select --install |
| CMake | 3.20+ | Build system for every lab | brew install cmake |
| Ninja | 1.10+ | Fast parallel builder (used in phases 5+) | brew install ninja |
| Homebrew LLVM | 18+ | Full LLVM with headers, libraries, mlir-opt, llc, lli (phases 5+) | brew install llvm |
| lldb | (bundled with Xcode CLT and Homebrew LLVM) | Debugger | already installed |
| git | 2.30+ | Version control | already installed on macOS |
Optional Tools
| Tool | Purpose | Install |
|---|---|---|
| Docker Desktop | Run a Linux container to validate ELF/glibc behavior (Phase 8 FFI optional cross-check) | https://www.docker.com/products/docker-desktop |
| graphviz | Render CFG / dominator-tree dumps as PNGs (Phase 4) | brew install graphviz |
| rr (Linux only) | Time-travel debugger — useful inside Docker for Phase 8 GC debugging | apt in Linux container |
| clang-format | Auto-format C++ (configured per lab) | bundled with both Clang installations |
| gdb | Some people prefer GDB; LLDB ships natively on macOS so we use LLDB | brew install gdb (with caveats on macOS) |
Tool Sets By Phase
| Phase | Need |
|---|---|
| 1–4 | Apple Clang + CMake |
| 5 | + Homebrew LLVM (brew install llvm) + Ninja |
| 6 | same as Phase 5 |
| 7 | same as Phase 5; also requires mlir-opt, mlir-translate (ship with Homebrew LLVM) |
| 8 | same as Phase 5; optional Docker for Linux validation |
| 9 | same as Phase 5 |
Apple Clang vs Homebrew LLVM — Why We Have Both
| Apple Clang | Homebrew LLVM | |
|---|---|---|
| Path | /usr/bin/clang++ | /opt/homebrew/opt/llvm/bin/clang++ |
| Provides | Compiler, linker integration | Compiler + libLLVM.dylib + headers + tools |
| Tools included | clang, clang++, lldb | clang++, llc, opt, llvm-config, mlir-opt, lli, llvm-link |
| Used for | Phases 1–4 (plain C++) | Phases 5–18 (LLVM/MLIR work) |
Why Apple ships its own Clang: Apple uses LLVM internally for the macOS toolchain. Their Clang is patched, tracks Xcode releases, links against the system frameworks, and produces signed binaries. But it does not ship the LLVM C++ libraries or the MLIR tools — those are reserved for the development tools install. We install upstream LLVM via Homebrew to get the missing pieces.
Target Triple (macOS Apple Silicon)
Your default triple is arm64-apple-macosx<version>. This is recorded in every Mach-O binary as the load command LC_BUILD_VERSION. Inspect with:
otool -l <binary> | grep -A5 LC_BUILD_VERSION
This matters in:
- Phase 5: when you ask LLVM to emit object files, you pass this triple.
- Phase 11 and beyond: when you write FFI or assembly intrinsics.
Verification
The full step-by-step verification script lives in cp-01-environment-setup/docs/verification.md. Run it once, before any other lab.
Glossary
Curriculum-wide terminology, alphabetised. When a term appears for the first time in a lab's CONCEPTS.md, it's also defined inline; this file is the consolidated index.
| Term | Definition |
|---|---|
| AOT | Ahead-of-time compilation. Source → native binary before execution. Clang, rustc, GCC, MSVC. Opposite of JIT. |
| AST | Abstract Syntax Tree. Hierarchical representation of source code after parsing, with syntax noise (parens, whitespace, semicolons) removed. |
| Basic Block | Maximal straight-line sequence of IR instructions with one entry and one exit. Building block of the CFG. |
| Bytecode | A linear sequence of opcodes designed for a virtual machine, not real hardware. CPython, JVM, V8 Ignition. |
| CFG | Control-Flow Graph. Directed graph of basic blocks where edges are possible jumps. |
| Closure | A function value that captures variables from its lexical environment. Requires escape analysis or heap-allocated environments. |
| Computed Goto | A C extension (&&label) enabling threaded bytecode dispatch — 2–3× faster than switch-based dispatch on most CPUs. |
| Constant Folding | Optimization that pre-computes constant expressions at compile time (2+3 → 5). |
| DCE | Dead Code Elimination. Removes instructions whose results are never used. |
| Dialect (MLIR) | A self-contained set of operations and types in MLIR. tensor, arith, affine, llvm are dialects. |
| Dispatch (VM) | The act of selecting and jumping to the implementation of the current opcode. Hottest loop in any interpreter. |
| Dominator | Block A dominates B if every path from entry to B passes through A. Foundation of SSA construction. |
| EBNF | Extended Backus-Naur Form. Standard notation for context-free grammars. |
| ELF | Executable and Linkable Format. Linux/BSD object/binary format. macOS uses Mach-O instead. |
| FFI | Foreign Function Interface. Calling C (or other-ABI) functions from your language. |
| GC | Garbage Collector. Subsystem that reclaims unreachable heap memory. We build mark-sweep in Phase 8. |
| HM | Hindley-Milner. Type inference algorithm (Algorithm W) for functional languages with let-polymorphism. |
| IR | Intermediate Representation. Any data structure between AST and machine code. Compilers typically have 2–10 IRs (e.g., Clang has AST → MLIR → LLVM IR → MIR → MachineInstr). |
| IRBuilder | LLVM helper class that constructs LLVM IR instructions and tracks the insertion point. Most-used API in LLVM frontends. |
| JIT | Just-In-Time compilation. Source/bytecode is compiled to native at runtime. V8, HotSpot, LuaJIT, PyPy, ORC. |
| Lexer | Also called scanner or tokenizer. Converts a character stream into a token stream. |
| LLVM IR | LLVM's typed, SSA-form intermediate representation. Human-readable assembly-like syntax. |
| lli | LLVM IR interpreter / JIT driver. Runs .ll files directly. |
| llc | LLVM static compiler. Lowers LLVM IR to target assembly. |
| Mach-O | Mach Object file format. macOS/iOS executable/library format. ELF's macOS counterpart. |
| mem2reg | LLVM pass that promotes stack allocas to SSA registers when their address is never taken. Foundational; most frontends rely on it. |
| MLIR | Multi-Level Intermediate Representation. LLVM project for multi-IR compiler infrastructure. Powers TensorFlow XLA, IREE, Mojo. |
| mlir-opt | The MLIR optimizer driver. clang -opt for MLIR. |
| ORC | On-Request Compilation. LLVM's JIT framework, current version is ORC v2. |
| Parser | Converts a token stream into an AST. Two flavors: hand-written (recursive descent / Pratt) or generated (yacc, ANTLR). |
| Phi Node (φ) | SSA-form instruction that selects a value based on which predecessor block was the source. The defining characteristic of SSA. |
| Pratt Parser | Top-down operator-precedence parser. Used by V8, Crafting Interpreters, many JavaScript parsers. |
| Recursive Descent | A top-down parser written as one mutually-recursive function per grammar rule. Used by Clang, GCC, rustc. |
| SSA | Static Single Assignment. IR form in which every variable is assigned exactly once. Enables almost all modern optimizations. |
| Symbol Table | Data structure mapping names to declarations, often a stack of hash maps (one per scope). |
| Target Triple | <arch>-<vendor>-<os>-<abi> string identifying the compilation target (e.g., arm64-apple-macosx15.0). |
| Three-Address Code (TAC) | IR form where instructions have at most 3 operands: x = y op z. Common pre-SSA representation. |
| Token | Atomic syntactic unit emitted by the lexer (keywords, identifiers, literals, operators). |
| Tree-Walk Interpreter | Executes a program by recursively visiting AST nodes. Simplest backend; slowest runtime. |
| Type Environment (Γ) | Mapping from variable names to types, used during type checking. |
| Visitor Pattern | Design pattern that adds an operation to a class hierarchy without modifying it. Standard tool for AST traversal. |
| VM | Virtual Machine. Interpreter for a custom bytecode instruction set. CPython, JVM, V8 Ignition, Lua. |