Compilers & Parser Engineer — Build Programming Languages, Interpreters, VMs, JITs, and MLIR Pipelines From Scratch

"A compiler is a function from strings to behavior. Everything else is engineering."

A lab-based curriculum for becoming a senior compiler engineer by building the systems you'll one day extend, optimize, and ship: a tree-walking interpreter, a stack-based bytecode VM, an SSA-based optimizer, an LLVM native backend, an ORC JIT runtime, an MLIR dialect with progressive lowering, and a production runtime with a GC and FFI — all implemented from scratch in C++17/20 on macOS (Apple Silicon supported and verified).

Why This Repo Exists

Most engineers treat compilers as black boxes. This curriculum makes them transparent. You will:

  • Write a programming language from a flat character stream to a Mach-O executable.
  • Implement every classical IR: AST, three-address code, control-flow graph, SSA, LLVM IR, MLIR dialects.
  • Understand every classical optimization: constant folding, dead code elimination, mem2reg, loop invariant hoisting.
  • Build runtime systems: stack frames, calling conventions, mark-sweep GC, C FFI.
  • Reason about hardware tradeoffs: ICache behavior of bytecode dispatchers, JIT compile-time vs run-time, register pressure, ABI boundaries.
  • Compare compilation strategies: tree-walker vs bytecode vs JIT vs AOT, the same fundamental performance hierarchy that distinguishes Python, V8, JVM, and Clang.
  • Build the same language repeatedly with progressively more sophisticated backends to internalize design over syntax.

Curriculum at a Glance

PhaseThemeLabs
1Frontend Foundationscp-01cp-03
2Static Semantics & Type Systemscp-04cp-05
3Bytecode Virtual Machinescp-06cp-07
4Compiler Middle-End (IR & Optimization)cp-08cp-09
5LLVM Backend (Industry Core)cp-10cp-11
6JIT Compilation (LLVM ORC)cp-12
7MLIR (Multi-Level IR)cp-13
8Runtime Systems (GC, ABI, FFI)cp-14
9Tooling & Diagnosticscp-15
CapstonesProduction-grade demonstrationscp-16cp-18

See PHASES.md for the full breakdown with learning objectives per lab.

How To Use This Repo

  1. Read TOOLS.md and complete cp-01-environment-setup/. This is the mandatory first step — even if you already have the tools, the verification process teaches you what each tool does and why.
  2. Move through the labs in order. Each lab is self-contained and has the same shape:
    cp-NN-<name>/
    ├── CONCEPTS.md       # The "why" — read this FIRST (8-part framework)
    ├── references.md     # Papers, source-code links, suggested reading
    ├── docs/
    │   ├── analysis.md       # Design tradeoffs (performance, engineering)
    │   ├── broader-ideas.md  # Extensions, alternatives, where this goes in production
    │   ├── execution.md      # Toolchain versions + quick-start commands
    │   ├── observation.md    # How to debug, profile, and inspect output
    │   └── verification.md   # Pass/fail checks for your implementation
    ├── steps/            # Numbered, sequential implementation guides
    │   ├── 01-*.md
    │   ├── 02-*.md
    │   └── ...
    └── src/cpp/          # CMake project — reference implementation
    
  3. Read CONCEPTS.md first, then work steps/ in order. The reference code in src/cpp/ is a target — try to write your own first, then compare.
  4. Run the checks in docs/verification.md before moving on. If anything fails, see docs/observation.md for debugging guidance.

What You Will Build

By the end of the curriculum you will have implemented:

  • A complete arithmetic evaluator with a hand-written lexer, recursive-descent parser, AST, and tree-walking interpreter.
  • A full MiniLang v0 frontend with Pratt parsing, variables, control flow, functions, closures, and an interactive REPL.
  • A symbol-table-driven semantic analyzer with lexical scoping and a Hindley-Milner-lite static type system.
  • A stack-based bytecode VM with a hand-tuned dispatch loop, instruction encoding, and a disassembler.
  • A three-address-code SSA IR with a control-flow-graph builder, dominator computation, and classical optimization passes (constant folding, DCE, mem2reg).
  • An LLVM native code generator that produces optimized Mach-O executables for Apple Silicon.
  • An ORC JIT engine for lazy on-demand compilation with function caching.
  • A custom MLIR dialect (minilang) with progressive lowering to the LLVM dialect.
  • A production runtime with stack frames, a mark-sweep garbage collector, and a C FFI.
  • A complete compiler CLI toolchain with diagnostic spans, source-pointing errors, and module support.
  • Three capstone projects stitching everything together.

Prerequisites

  • Comfortable reading and writing modern C++17/20 (the curriculum assumes you can write a class and use STL containers).
  • Familiarity with trees, recursion, and basic data structures (graphs, hash tables).
  • Basic command-line and git.
  • Not required: prior compiler, LLVM, parsing, or type-system knowledge. We build it all from the ground up.

Pedagogical Style

Modeled after distributed-systems-engineer/ — every CONCEPTS.md follows the same 8-part framework:

  1. What Is It — one-paragraph executive summary
  2. Why It Matters — concrete benefits
  3. How It Works — ASCII architecture diagram
  4. Core Terminology — table of precise definitions
  5. Mental Models — analogies for intuition
  6. Common Misconceptions — myths corrected
  7. Interview Talking Points — what to say in a senior compiler interview
  8. Connections to Other Labs — how this fits the bigger picture

Every lab produces observable, runnable, testable output. No pseudo-code, no hand-waving, no abstract-only sections.

Status

PhaseStatus
Phase 1 — Frontend Foundationscp-01 complete · cp-02 complete · cp-03 scaffolded
Phase 2 — Static SemanticsScaffolded
Phase 3 — Bytecode VMsScaffolded
Phase 4 — IR & OptimizationScaffolded
Phase 5 — LLVM BackendScaffolded
Phase 6 — JITScaffolded
Phase 7 — MLIRScaffolded
Phase 8 — RuntimeScaffolded
Phase 9 — ToolingScaffolded
CapstonesScaffolded

See PHASES.md for per-lab status.

License

MIT — see source headers in each implementation.

cp-01 — Environment Setup & Toolchain

Install and verify the C++/LLVM toolchain. Mandatory first lab — every other lab depends on these tools and the concepts taught here.

Read First

  • CONCEPTS.md — the 8-part framework: toolchains, target triples, Apple Clang vs upstream LLVM, Mach-O vs ELF, sysroots, CMake.
  • docs/analysis.md — design tradeoffs and what to choose when.
  • references.md — official docs and further reading.

Then Walk The Steps

  1. steps/01-verify-xcode-clt.md — Apple's command-line toolchain.
  2. steps/02-install-cmake-and-ninja.md — build-system generator and executor.
  3. steps/03-install-homebrew-llvm.md — upstream LLVM with full SDK.
  4. steps/04-verify-end-to-end.md — single script that confirms everything.

Quick Verify (TL;DR)

If you're already comfortable with everything in CONCEPTS.md and just want to confirm your setup:

./scripts/verify.sh

If green, move to ../cp-02-arithmetic-evaluator/. If anything's red, open the relevant step.

Lab-Specific Docs

Outcomes

You leave this lab with:

  • A working C++17/20 toolchain via Apple Clang.
  • A working LLVM 18+ installation via Homebrew, plus llvm-config, opt, llc, lli, mlir-opt.
  • CMake 3.20+ and Ninja installed.
  • Environment variables (LLVM_HOME, LLVM_DIR, MLIR_DIR) configured.
  • The mental model for what a toolchain is and the difference between a compiler driver, frontend, optimizer, backend, assembler, and linker.

Step 1 — Verify Xcode Command Line Tools

Goal

Confirm Apple's C++ toolchain is installed and discoverable. This gives us clang++, make, system headers, git, and lldb — everything needed for Phases 1–4.

Why This Step Exists

Apple ships a C/C++ toolchain as part of "Xcode Command Line Tools" (CLT) — a minimal subset of full Xcode. Almost every macOS development workflow starts here. Without it, even git is missing.

The CLT also installs the macOS SDK (the sysroot containing system headers like <stdio.h> and frameworks like Foundation). When clang compiles, it implicitly looks here.

Check What's Installed

# Where is the CLT installed?
xcode-select -p

# Expected (one of):
#   /Library/Developer/CommandLineTools                  ← CLT-only install
#   /Applications/Xcode.app/Contents/Developer            ← full Xcode install

If you see "No developer tools were found" or similar, install:

xcode-select --install

This opens a GUI dialog. Click "Install". Wait ~5 minutes.

Verify The Compiler

clang++ --version

Expected:

Apple clang version 16.0.0 (clang-1600.0.26.6)
Target: arm64-apple-darwin24.0.0
Thread model: posix
InstalledDir: /usr/bin

Three things to notice in the output, each a teaching moment:

  1. "Apple clang" — this is Apple's fork, not upstream LLVM. The version number ("16.0.0") tracks Xcode releases, not LLVM major releases. (Apple Clang 16 ≈ LLVM 17 under the hood.)
  2. Target: arm64-apple-darwin24.0.0 — this is your default target triple. arm64 is the instruction set (Apple Silicon), apple is the vendor, darwin24 is the kernel version (macOS 15 = Darwin 24). The triple is what every LLVM backend uses to know what code to emit.
  3. InstalledDir: /usr/bin — Apple installs into /usr/bin, which is on your PATH by default. Compare with Homebrew LLVM in Step 3, which installs elsewhere.

See The SDK Location

xcrun --show-sdk-path

Expected (something like):

/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk

This is your sysroot — Clang's -isysroot default. Inside it:

ls "$(xcrun --show-sdk-path)/usr/include" | head
# stdio.h, stdlib.h, string.h, ...

ls "$(xcrun --show-sdk-path)/System/Library/Frameworks" | head
# Foundation.framework, AppKit.framework, ...

This is where #include <stdio.h> actually resolves. Without the SDK, the compiler doesn't know where the system headers live.

Quick Sanity Test

cat > /tmp/hello.cpp <<'EOF'
#include <cstdio>
int main() { std::puts("hello from " __VERSION__); return 0; }
EOF

clang++ -std=c++17 /tmp/hello.cpp -o /tmp/hello && /tmp/hello

Expected (something like):

hello from 4.2.1 Compatible Apple LLVM 16.0.0 (clang-1600.0.26.6)

If this works, your Apple Clang installation is healthy.

What Just Happened

  1. xcode-select -p confirmed the CLT prefix.
  2. clang++ --version printed the version, the default target triple (which equals your machine's host triple), and the install path.
  3. xcrun --show-sdk-path revealed your sysroot — the directory tree that the compiler treats as the target's / when looking for headers and frameworks.
  4. Compiling hello.cpp exercised the full pipeline: preprocessor expanded <cstdio>, the frontend parsed the C++, the backend emitted arm64 machine code, the assembler converted it to a Mach-O object file, and ld linked it into an executable.

Next

02-install-cmake-and-ninja.md

Step 2 — Install CMake and Ninja

Goal

Install CMake (build-system generator) and Ninja (fast parallel build executor). Every lab in this curriculum builds via CMake. Ninja becomes important starting at Phase 5 (LLVM) where build times grow.

Why CMake?

Because every C++ compiler project in the LLVM ecosystem — Clang, LLVM itself, MLIR, Swift, Mesa, KDE, and dozens more — uses CMake. Learning the CMake idioms here pays off in every later lab and every real LLVM contribution.

CMake is a generator, not a builder. It reads CMakeLists.txt, inspects your system (which compiler, which libs), and writes platform-appropriate build files: a Makefile (default on Unix) or a build.ninja. Then cmake --build . invokes whichever was generated.

Why Ninja?

Make is single-threaded by default and re-stats every file on every build. Ninja was designed at Google specifically for Chromium and adopted by LLVM. Differences:

GNU MakeNinja
Designed forhand-editedmachine-generated
Default parallelism-j1all cores
Incremental rebuild speedO(targets)O(changed files)
LLVM build time (clean, 8 cores)~25 min~12 min

For Phases 1–4 you can use either. From Phase 5 onward Ninja makes a noticeable difference.

Install

# If you don't have Homebrew, install it first:
#   /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

brew install cmake ninja

Verify

cmake --version | head -1
# cmake version 3.28.x or newer

ninja --version
# 1.11.x or newer

Minimum required:

  • CMake ≥ 3.20 (we use modern target syntax)
  • Ninja ≥ 1.10 (any recent version works)

Try It — A Tiny CMake Project

This isn't part of any lab; it's a 30-second sanity check.

mkdir -p /tmp/cmake-smoke && cd /tmp/cmake-smoke

cat > CMakeLists.txt <<'EOF'
cmake_minimum_required(VERSION 3.20)
project(smoke LANGUAGES CXX)
set(CMAKE_CXX_STANDARD 17)
add_executable(smoke main.cpp)
EOF

cat > main.cpp <<'EOF'
#include <cstdio>
int main() { std::puts("cmake + ninja: ok"); return 0; }
EOF

cmake -B build -G Ninja
cmake --build build
./build/smoke

Expected:

cmake + ninja: ok

What Just Happened

  1. cmake -B build -G Ninja ran the configure step. CMake inspected your system (found Clang, the SDK, ninja), wrote build/build.ninja, and cached the decisions in build/CMakeCache.txt.
  2. cmake --build build ran the build step. CMake dispatched to Ninja, which compiled main.cpp and linked the executable.
  3. The whole thing took less than 2 seconds. Most of that was CMake's first-time configure; subsequent rebuilds are milliseconds.

Debugging Tips

  • cmake -B build -G Ninja --debug-find prints the resolution of every find_package call. Invaluable when LLVM can't be found in Step 3.
  • cmake -B build --fresh wipes the cache and reconfigures (useful if you changed PATH or installed a new tool).
  • ninja -C build -v prints every command Ninja runs — see the exact clang++ invocation if a build is mysteriously failing.

Next

03-install-homebrew-llvm.md

Step 3 — Install Homebrew LLVM (The Full Toolchain)

Goal

Install upstream LLVM via Homebrew. This gives us the headers, libraries, and command-line tools (opt, llc, lli, mlir-opt, llvm-config) that Apple Clang does not include. Required from Phase 5 onward.

Why Apple Clang Isn't Enough

You already have Apple Clang from Step 1. It compiles C++ fine. But it lacks:

  • libLLVM.dylib C++ headers — needed to write a compiler that emits LLVM IR programmatically (Phase 5).
  • llvm-config — the helper that tells CMake's find_package(LLVM) where to look.
  • opt — the LLVM optimizer driver. opt -O3 input.ll > output.ll.
  • llc — the LLVM static compiler. Lowers .ll → target assembly.
  • lli — the LLVM IR interpreter/JIT. Run .ll files directly.
  • mlir-opt, mlir-translate — MLIR's opt and bridge to LLVM (Phase 7).
  • lld — LLVM's linker. Faster than system ld, used for cross-linking.

Apple intentionally omits these because Apple Clang is a product, not a development SDK. Homebrew LLVM fills the gap.

Install

brew install llvm

This installs to /opt/homebrew/opt/llvm/ on Apple Silicon (or /usr/local/opt/llvm/ on Intel Macs). Homebrew deliberately does NOT symlink to /usr/local/bin to avoid shadowing Apple Clang.

Disk: ~3 GB. Time: 5–10 minutes (cached binary download).

Make It Discoverable

Homebrew prints instructions; the relevant parts:

# Add to your ~/.zshrc (or ~/.bashrc):
export LLVM_HOME="/opt/homebrew/opt/llvm"
export PATH="$LLVM_HOME/bin:$PATH"

# For CMake's find_package(LLVM) and find_package(MLIR):
export LLVM_DIR="$LLVM_HOME/lib/cmake/llvm"
export MLIR_DIR="$LLVM_HOME/lib/cmake/mlir"

# For dyld to find LLVM libraries at runtime (rarely needed; CMake usually handles rpath):
# export DYLD_LIBRARY_PATH="$LLVM_HOME/lib:$DYLD_LIBRARY_PATH"

After editing, reload:

source ~/.zshrc

Why prepend $LLVM_HOME/bin to PATH? So clang++ and clang refer to Homebrew LLVM during compiler-engineering work. Apple Clang is still at /usr/bin/clang++ for any tool that hard-codes that path.

Verify

which clang++
# Expected: /opt/homebrew/opt/llvm/bin/clang++

clang++ --version
# Expected: clang version 18.x.x or 19.x.x or 20.x.x
#           (NOT "Apple clang")

llvm-config --version
# Expected: 18.x.x / 19.x.x / 20.x.x (matches clang++)

llvm-config --prefix
# Expected: /opt/homebrew/opt/llvm

llc --version | head -3
opt --version | head -3
lli --version | head -3
mlir-opt --version | head -3

All four should print a version line. If mlir-opt: command not found, your Homebrew LLVM is too old or wasn't built with MLIR. brew upgrade llvm should fix it (current Homebrew LLVM bottles include MLIR by default).

Inspect What's Available

ls "$LLVM_HOME/bin" | head -30
# clang, clang++, lld, ld64.lld, llc, lli, llvm-ar, llvm-as, llvm-cov, llvm-dis,
# llvm-dwarfdump, llvm-link, llvm-mc, llvm-nm, llvm-objcopy, llvm-objdump,
# llvm-readelf, llvm-readobj, llvm-rtdyld, llvm-symbolizer, llvm-undname,
# mlir-cpu-runner, mlir-opt, mlir-tblgen, mlir-translate, opt, ...

Every one of these is a tool you'll meet at some point.

The llvm-config Story

llvm-config is a small helper that prints LLVM build information. Try:

llvm-config --includedir
# /opt/homebrew/opt/llvm/include

llvm-config --libdir
# /opt/homebrew/opt/llvm/lib

llvm-config --libs core support irreader
# -lLLVMIRReader -lLLVMBitReader -lLLVMAsmParser -lLLVMCore -lLLVMRemarks ...

llvm-config --cxxflags
# -I/opt/homebrew/opt/llvm/include -std=c++17 -fno-exceptions -fno-rtti ...

CMake's find_package(LLVM) internally calls llvm-config to discover all of this. As long as LLVM_DIR is set (or llvm-config is on PATH), CMake "just works".

Try It — Compile and Run LLVM IR

Save this as /tmp/hello.ll:

@.str = private unnamed_addr constant [14 x i8] c"hello, llvm!\0A\00"

declare i32 @printf(ptr, ...)

define i32 @main() {
  %1 = call i32 (ptr, ...) @printf(ptr @.str)
  ret i32 0
}

Run it three different ways — each demonstrates a different tool:

# 1. Interpret/JIT directly (lli)
lli /tmp/hello.ll
# → hello, llvm!

# 2. Compile to assembly (llc), then assemble + link (clang++)
llc /tmp/hello.ll -o /tmp/hello.s
clang++ /tmp/hello.s -o /tmp/hello-static && /tmp/hello-static
# → hello, llvm!

# 3. Optimize, then compile (opt + llc)
opt -O3 /tmp/hello.ll -S -o /tmp/hello.opt.ll
llc /tmp/hello.opt.ll -o /tmp/hello.opt.s
clang++ /tmp/hello.opt.s -o /tmp/hello-opt && /tmp/hello-opt
# → hello, llvm!

You just performed all three roles of LLVM: as an interpreter (lli), as a static compiler (llc), and as an optimizer (opt). Every later lab will revisit these tools.

What Just Happened

  1. You installed upstream LLVM separately from Apple Clang.
  2. You prepended its bin to PATH so clang++ now refers to Homebrew's, NOT Apple's.
  3. You exported LLVM_DIR so CMake's find_package(LLVM) works without extra flags.
  4. You wrote raw LLVM IR by hand, then exercised lli, llc, and opt — proof that the toolchain is wired correctly.

Common Pitfalls

  • clang++ --version still says "Apple clang": you forgot to source ~/.zshrc or you opened a new terminal that doesn't load it. Check with echo $PATH | tr : '\n' | head.
  • mlir-opt: command not found: very old Homebrew LLVM (pre-15). Run brew upgrade llvm.
  • dyld: Library not loaded: @rpath/libLLVM.dylib: a binary you built can't find its libs at runtime. Either add $LLVM_HOME/lib to DYLD_LIBRARY_PATH, or have CMake set INSTALL_RPATH (set(CMAKE_INSTALL_RPATH "${LLVM_HOME}/lib")).
  • CMake can't find_package(LLVM): confirm LLVM_DIR is exported. Try cmake -DLLVM_DIR="$LLVM_HOME/lib/cmake/llvm" ... as a one-off.

Next

04-verify-end-to-end.md

Step 4 — End-to-End Verification

Goal

Run a single verification script that exercises every tool we'll use across the curriculum and confirms versions meet the minimums. If this passes, you're ready for Lab cp-02.

What Gets Verified

ToolRequired VersionUsed In
Xcode CLTany currentall phases
Apple Clang (/usr/bin/clang++)14+phases 1–4 (optional)
Homebrew Clang++18+phases 5+
llvm-configmatches Clangphases 5+
llc, opt, llimatches Clangphases 5+
mlir-opt, mlir-translatematches Clangphase 7
CMake3.20+all phases
Ninja1.10+phases 5+ (optional)
LLDBany currentall phases
Gitany currentall phases

Run

From this lab directory:

cd cp-01-environment-setup
./scripts/verify.sh

Expected output (with your specific versions):

== compilers-parser-engineer / cp-01 — environment verification ==

[1/9]  Xcode CLT prefix    : /Library/Developer/CommandLineTools     OK
[2/9]  Apple Clang         : Apple clang version 16.0.0              OK
[3/9]  Homebrew Clang++    : clang version 20.1.8                    OK
[4/9]  llvm-config         : 20.1.8                                  OK
[5/9]  LLVM tools          : llc opt lli                             OK
[6/9]  MLIR tools          : mlir-opt mlir-translate                 OK
[7/9]  CMake               : 4.1.0                                   OK (need >=3.20)
[8/9]  Ninja               : 1.11.1                                  OK (optional)
[9/9]  LLDB                : lldb-1600.0.39.109                      OK

Architecture : arm64 (Apple Silicon)
macOS        : 15.0
SDK path     : /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk
LLVM prefix  : /opt/homebrew/opt/llvm
LLVM_DIR     : /opt/homebrew/opt/llvm/lib/cmake/llvm
MLIR_DIR     : /opt/homebrew/opt/llvm/lib/cmake/mlir

== all critical tools present; you are ready for cp-02 ==

If anything is marked MISSING or OLD, see the matching step:

FailureFix
Xcode CLT missingStep 1
Apple Clang missingStep 1
Homebrew clang++ missingStep 3
llvm-config missingStep 3
MLIR tools missingbrew upgrade llvm (Step 3)
CMake missing/oldStep 2
Ninja missingStep 2 (optional for now)
LLVM_DIR/MLIR_DIR unsetStep 3

A Mini Build To Confirm find_package(LLVM)

After the script passes, do this final check to confirm CMake can pull in LLVM (the linchpin for Phase 5):

cd scripts/llvm-smoke
cmake -B build -G Ninja
cmake --build build
./build/llvm-smoke

Expected:

LLVM version: 20.1.8
target triple: arm64-apple-macosx<version>
created module: smoke; declared one function: int main()

This compiles a tiny program that links against libLLVM and uses the IRBuilder API to construct a function. If this works, cp-11 (LLVM Codegen) will work.

What Just Happened

verify.sh ran ~30 commands across your toolchain, each printing a tool version. It then printed environment variables CMake needs (LLVM_DIR, MLIR_DIR).

The llvm-smoke mini-project then did the only thing that actually matters for compiler engineering: it pulled in the LLVM C++ headers and emitted IR programmatically. This is the API you'll spend Phases 5–9 writing against.

Next

→ Mark this lab complete and move to ../cp-02-arithmetic-evaluator/.

cp-02 — Arithmetic Evaluator

The first real compiler. Read source text. Produce a number. Use every classical frontend technique compressed into ~400 lines of C++17.

Why This Lab

This is the foundational lab of the entire curriculum. Almost every concept in the rest of the course is a generalization of something built here:

Built hereGeneralizes to (later)
Hand-written DFA lexerClang's lexer; Phase 3 bytecode-compiler lexer
Recursive-descent parserPratt parser (cp-03); Clang's parser
AST + Visitor patternEvery IR in every later lab
Post-order tree-walk evalBytecode VM (cp-06); LLVM IRBuilder traversal
EBNF grammarAll MiniLang grammar throughout the curriculum
Operator precedence via grammar nestingPratt's "binding powers" (cp-03); MLIR op verifiers

Read First

  • CONCEPTS.md — the 8-part deep dive: lexers, parsers, AST, EBNF, precedence, associativity, the Visitor pattern.
  • references.md — Crafting Interpreters chapters, Clang lexer source, V8 parser source.
  • docs/analysis.md — design tradeoffs (DFA vs regex, recursive descent vs Pratt vs LR, Visitor vs std::variant).

Walk The Steps

  1. steps/01-the-lexer.md — tokens, DFA, single-char lookahead.
  2. steps/02-the-ast.md — node hierarchy, ownership, the Visitor pattern.
  3. steps/03-recursive-descent-parser.md — grammar, precedence-via-nesting, associativity-via-recursion-direction.
  4. steps/04-the-evaluator.md — post-order tree walk implemented as a Visitor returning double.
  5. steps/05-repl-tests-and-cli.md — wire it into a usable tool with a test suite.
  6. steps/06-extensions.md — extension exercises (right-associative ^, AST printer, source-location errors, constant folding).

Lab Docs

Code

  • src/cpp/ — full reference implementation (CMake project, ~450 lines of C++).

Build & Run (TL;DR)

cd src/cpp
cmake -B build && cmake --build build
./build/eval "1 + 2 * 3"      # 7
./build/eval                  # REPL
ctest --test-dir build        # 19 tests

Outcomes

You leave this lab able to:

  • Hand-write a lexer for an arbitrary regular language (no lex / flex).
  • Write a recursive-descent parser for any LL(1) grammar.
  • Design an AST with the Visitor pattern.
  • Explain why the grammar shape determines operator precedence and associativity.
  • Recognize what changes (and what doesn't) when you swap a tree-walker for a bytecode VM or LLVM backend in later labs.

Step 1 — The Lexer

Goal: turn a string of characters into a sequence of tokens.

Why A Lexer?

Imagine writing a parser that operated directly on characters. Every parsing rule would have to also skip whitespace, parse numeric literals, distinguish == from =, etc. The result would be unreadable.

Splitting into lexer → parser is the same principle as separating tokenization from semantics in any text processing — it's why wc works on lines, not bytes. Tokens are the unit the grammar cares about.

For arithmetic, our token vocabulary is tiny:

TokenMatches
Numberone or more digits, optionally followed by . and more digits
Plus, Minus, Star, Slash+, -, *, /
LParen, RParen(, )
Eofsynthetic end-of-stream marker
Erroranything else (e.g., letters), carrying a message

The DFA Mental Model

Even our tiny lexer is a deterministic finite automaton (DFA):

            digit                       digit
        ┌────────┐                  ┌─────────┐
        ▼        │                  ▼         │
  [START] ──────► [IN_NUMBER] ──.──► [AFTER_DOT] ──┐
        │             │                              │
        │           (other)        (other)           │
        │             ▼                ▼             │
        │         emit Number       emit Number      │
        │                                            │
        │ +    -      *    /   (    )                │
        ├────────────────────────────                │
        │             │                              │
        ▼             ▼                              │
   emit Plus     emit RParen (etc.)                  │
        │                                            │
        │ ws                                         │
        ▼                                            │
   skip → loop                                       │
        │                                            │
        │ EOF                                        │
        ▼                                            │
   emit Eof ◄───────────────────────────────────────┘

Position into the source string serves as our state. We never need a separate "state" variable for the arithmetic lexer; for more complex lexers (Python's indentation, JavaScript's >> vs >>=) you'd track explicit modes.

Single-Character Lookahead

char Lexer::peek()    const { return atEnd() ? '\0' : src_[pos_]; }
char Lexer::advance()       { return src_[pos_++]; }

peek() looks at the current character without consuming it. advance() returns it and moves on. Almost every hand-written lexer follows this pattern.

For most operators we need zero lookahead (+ is Plus, full stop). For numbers we need a single character of lookahead to decide whether more digits follow. In later labs we'll add 2-character lookahead for == vs =.

The Number Sub-DFA

Token Lexer::lexNumber() {
    std::size_t start = pos_;
    while (!atEnd() && std::isdigit(peek())) advance();
    if (!atEnd() && peek() == '.') {
        advance();
        while (!atEnd() && std::isdigit(peek())) advance();
    }
    std::string text = src_.substr(start, pos_ - start);
    Token t;
    t.kind = TokenKind::Number;
    t.text = text;
    t.value = std::stod(text);
    return t;
}

Reads:

  1. As many digits as possible.
  2. If a . follows, consume it and as many more digits as possible.
  3. Slice the substring; convert to double via std::stod.

This accepts 42, 3.14, 0, 0.5. It rejects .5 (no leading digit), 1. (we accept this with empty fraction; harmless), 1e5 (scientific notation — left as Step 6 extension).

Why std::stod instead of manual parsing? It's correct (handles negative zero, denormals, exponents we plan to add later), and the lexer is not the performance bottleneck. Compare to Clang, which has a hand-tuned APFloat::convertFromString because Clang's lexer is sometimes the hot path.

Eager vs Lazy

std::vector<Token> Lexer::tokenize() { ... }      // eager

We lex the entire input into a vector<Token> upfront. The parser then consumes from this vector.

Why eager? Simpler control flow. The parser doesn't need to hold a reference to the lexer.

Why production compilers go lazy: memory. Lexing a 100k-line C++ file into a vector of tokens is megabytes; lazy lexing keeps the working set small. Clang is lazy.

For arithmetic — and for everything through cp-09 — eager is fine.

Error Tokens

default: {
    Token e;
    e.kind = TokenKind::Error;
    e.text = std::string("unexpected character '") + c + "'";
    return e;
}

We don't throw from the lexer. We emit an Error token. The parser then decides what to do (we throw there). This separation lets a future error-recovery layer skip the bad token and keep parsing — important for IDE diagnostics where you want to see all errors at once, not just the first.

Try It

After building (next step covers CMake), the lexer is hidden behind eval. To inspect tokens directly, you can add a quick dump_tokens helper:

for (auto& t : Lexer("1 + 2 * 3").tokenize())
    std::cout << kindName(t.kind) << " " << t.text << "\n";

Expected:

NUMBER 1
PLUS +
NUMBER 2
STAR *
NUMBER 3
EOF

Next

02-the-ast.md — design the tree the parser will build.

Step 2 — The AST

Goal: design the tree data structure the parser will build and the evaluator will walk.

Three Node Types Cover All Of Arithmetic

NodeRepresentsChildren
NumberExpr42, 3.14none
BinaryExpra + b, a * b, …lhs, rhs
UnaryExpr-aoperand

That's it. Three classes, no surprises.

Class Hierarchy + Visitor

struct NumberExpr; struct BinaryExpr; struct UnaryExpr;

template <typename R>
struct ExprVisitor {
    virtual ~ExprVisitor() = default;
    virtual R visit(NumberExpr&) = 0;
    virtual R visit(BinaryExpr&) = 0;
    virtual R visit(UnaryExpr&)  = 0;
};

struct Expr {
    virtual ~Expr() = default;
    virtual double accept(ExprVisitor<double>& v) = 0;
};

struct NumberExpr : Expr {
    double value;
    explicit NumberExpr(double v) : value(v) {}
    double accept(ExprVisitor<double>& v) override { return v.visit(*this); }
};
// BinaryExpr, UnaryExpr follow the same accept pattern

The dance:

  1. Each node has accept(Visitor&)one virtual call site.
  2. accept immediately calls visitor.visit(*this) — and because *this has its concrete type at the call site, the right overload is selected at compile time.
  3. The visitor's visit(NumberExpr&) etc. are user-defined per pass.

This is the double dispatch trick: virtual dispatch picks the node's concrete type; static dispatch (overloading) picks the operation on it.

Why Not Just A Switch / std::variant?

Two viable alternatives:

Alternative A — std::variant<Number, Binary, Unary>

using Expr = std::variant<NumberExpr, BinaryExpr, UnaryExpr>;
double eval(const Expr& e) {
    return std::visit(overloaded{
        [](const NumberExpr& n) { return n.value; },
        [](const BinaryExpr& b) { /* … */ },
        [](const UnaryExpr& u)  { /* … */ }
    }, e);
}

Pros: no virtuals, slightly faster, more "modern C++". Cons: every new node type requires updating every visit site (or using if constexpr); recursive types need extra wrapping (std::variant<NumberExpr, std::unique_ptr<BinaryExpr>, …>); error messages are worse.

Alternative B — Switch On Tag

struct Expr { ExprKind kind; /* … */ };
double eval(const Expr& e) {
    switch (e.kind) {
        case ExprKind::Number: /* … */;
        case ExprKind::Binary: /* … */;
    }
}

Pros: zero virtual overhead; single allocation possible (one Expr struct sized for the largest variant). Cons: every pass touches every node type's data layout; no encapsulation.

Which does production use?

  • Clang's AST: tagged union via Stmt::getStmtClass() + visitor (RecursiveASTVisitor).
  • rustc's AST: enums (Rust's std::variant-equivalent).
  • LLVM IR's Instruction: tagged class hierarchy + visitor.

We use the virtual + Visitor pattern here because it's the most teachable and the most directly comparable to Clang's design. The trade-offs are real — we discuss them in docs/analysis.md.

Ownership: std::unique_ptr

using ExprPtr = std::unique_ptr<Expr>;

struct BinaryExpr : Expr {
    TokenKind op;
    ExprPtr   lhs;
    ExprPtr   rhs;
    BinaryExpr(TokenKind o, ExprPtr l, ExprPtr r)
        : op(o), lhs(std::move(l)), rhs(std::move(r)) {}
};

Children are owned via unique_ptr. A tree is not shared — destroying the root recursively destroys the whole tree.

Why not shared_ptr? Compilers never need shared ownership of AST nodes. Sharing would imply two parents, and ASTs are trees — by definition no shared parents.

Why not raw pointers + manual delete? Memory leak waiting to happen, especially with parser exceptions. unique_ptr is RAII for ownership.

Why not arena allocation? It's the technique production compilers use (LLVM has BumpPtrAllocator; Clang's ASTContext allocates from a single arena). We'll add arena allocation in cp-09 when we have many AST nodes per program. For arithmetic, individual unique_ptr allocations are fine.

The accept Method, Concretely

double NumberExpr::accept(ExprVisitor<double>& v) { return v.visit(*this); }

When the evaluator runs someExpr->accept(eval):

  1. Virtual dispatch goes to the right accept (e.g., NumberExpr::accept).
  2. That calls eval.visit(*this) — and because *this is a NumberExpr&, overload resolution chooses Evaluator::visit(NumberExpr&).

The combination is sometimes called "double dispatch" — but conceptually it's just: one virtual call routes to the right type, then static overload routes to the right operation.

The Template Parameter R

template <typename R>
struct ExprVisitor {
    virtual R visit(NumberExpr&) = 0;
    virtual R visit(BinaryExpr&) = 0;
    virtual R visit(UnaryExpr&)  = 0;
};

R is the return type. We define Evaluator : ExprVisitor<double>. A printer would be Printer : ExprVisitor<std::string>. A type-checker would be TypeChecker : ExprVisitor<Type>. Code generation in cp-11 will be LLVMGen : ExprVisitor<llvm::Value*>.

In this lab Expr::accept is hard-wired to ExprVisitor<double> — sufficient for our one visitor. In production you'd either define multiple accept overloads or use std::any/erased return types. Clang uses a non-templated visitor + side-effects to a Result member; rustc uses Rust enums and match.

Try It

After we build, this minimal main exercises the AST manually:

auto five  = std::make_unique<NumberExpr>(5.0);
auto three = std::make_unique<NumberExpr>(3.0);
auto plus  = std::make_unique<BinaryExpr>(TokenKind::Plus,
                                          std::move(five),
                                          std::move(three));
Evaluator e;
std::cout << e.eval(*plus) << "\n";   // → 8

(For the actual interactive evaluator we go through the parser; this is just to confirm the AST machinery works in isolation.)

Next

03-recursive-descent-parser.md — build the AST from tokens.

Step 3 — The Recursive-Descent Parser

Goal: turn a token stream into the AST you designed in Step 2, respecting precedence and associativity.

The Grammar (Recap From CONCEPTS.md)

expr   = term   { ('+' | '-') term }
term   = factor { ('*' | '/') factor }
factor = NUMBER
       | '(' expr ')'
       | '-' factor

Three rules, three functions. Recursive descent is a literal transcription.

The Cursor Helpers

const Token& Parser::peek() const   { return tokens_[pos_]; }
const Token& Parser::advance()      { return tokens_[pos_++]; }
bool Parser::check(TokenKind k) const { return peek().kind == k; }
bool Parser::match(TokenKind k)     { if (check(k)) { advance(); return true; } return false; }

const Token& Parser::expect(TokenKind k, const char* msg) {
    if (!check(k)) throw ParseError(/* … */);
    return advance();
}

Five helpers cover every recursive-descent parser ever written:

  • peek — look without consuming
  • advance — consume one
  • check — peek's kind matches?
  • match — if so, consume and return true
  • expect — must match, or throw

This vocabulary scales: Clang's parser uses essentially the same primitives, just with richer diagnostics in expect.

parseExpr — Lowest Precedence

ExprPtr Parser::parseExpr() {
    ExprPtr left = parseTerm();
    while (check(TokenKind::Plus) || check(TokenKind::Minus)) {
        TokenKind op = peek().kind; advance();
        ExprPtr right = parseTerm();
        left = std::make_unique<BinaryExpr>(op, std::move(left), std::move(right));
    }
    return left;
}

Reads aloud: "parse a term, then while the next token is + or -, consume the operator, parse another term, and wrap the previous result as the new left."

Why The While Loop = Left Associativity

Trace 5 - 3 - 1:

Initial:  left = parseTerm() = (5)

Iter 1:   token = -, consume
          right = parseTerm() = (3)
          left = BinaryExpr(-, (5), (3))            ← (5 - 3)

Iter 2:   token = -, consume
          right = parseTerm() = (1)
          left = BinaryExpr(-, (5-3), (1))          ← ((5 - 3) - 1)

Done.     evaluate → 1

Each iteration nests the previous left on the inside. The tree leans left. That's left-associativity.

To make - right-associative (5 - 3 - 1 = 5 - (3 - 1) = 3), you'd write:

ExprPtr right = parseExpr();    // recurse instead of loop
return std::make_unique<BinaryExpr>(op, std::move(left), std::move(right));

Recursion-on-the-right ⇒ right-associativity. This is exactly how we'd handle ^ (exponentiation) in Step 6's extension.

parseTerm — Same Pattern, Higher Precedence

ExprPtr Parser::parseTerm() {
    ExprPtr left = parseFactor();
    while (check(TokenKind::Star) || check(TokenKind::Slash)) {
        TokenKind op = peek().kind; advance();
        ExprPtr right = parseFactor();
        left = std::make_unique<BinaryExpr>(op, std::move(left), std::move(right));
    }
    return left;
}

Identical shape; the only changes are the operators consumed and the next-level call (parseFactor). Adding a new precedence level (modulo, comparison, etc.) means inserting another function in the call chain.

Why This = Higher Precedence Than +/-

When parseExpr calls parseTerm, control descends into parseTerm's while loop. That loop consumes all the * and / operators before returning. By the time parseTerm returns to parseExpr, the multiplication has already been packaged into a sub-tree. parseExpr then attaches that sub-tree as either left or right of its +/- node.

* binds tighter because parseTerm "grabs" its operators first — they're inside its loop, not parseExpr's.

parseFactor — Atoms and Recursion Back to parseExpr

ExprPtr Parser::parseFactor() {
    if (check(TokenKind::Number)) {
        double v = peek().value; advance();
        return std::make_unique<NumberExpr>(v);
    }
    if (match(TokenKind::LParen)) {
        ExprPtr inner = parseExpr();              // back to the top
        expect(TokenKind::RParen, "expected ')'");
        return inner;
    }
    if (check(TokenKind::Minus)) {
        advance();
        ExprPtr operand = parseFactor();          // right-recursive
        return std::make_unique<UnaryExpr>(TokenKind::Minus, std::move(operand));
    }
    if (check(TokenKind::Error)) { /* propagate lex error */ }
    throw ParseError(/* … */);
}

Three productions:

  1. NUMBER — consume and wrap.
  2. '(' expr ')' — consume (, recurse all the way back to parseExpr (precedence resets!), expect ). This is how parens override precedence.
  3. '-' factor — unary minus. --5 works because the recursion is on parseFactor, not directly creating a Number.

Why Parens Override Precedence

(1 + 2) * 3:

  1. parseExprparseTermparseFactor
  2. parseFactor sees (, calls parseExpr again.
  3. That parseExpr (the inner one) parses 1 + 2BinaryExpr(+, 1, 2).
  4. Outer parseFactor consumes ), returns BinaryExpr(+, 1, 2).
  5. Control returns to outer parseTerm. Its left is (1+2). Loop sees *, parses 3. Builds BinaryExpr(*, (1+2), 3).

The parens temporarily gave us "expr-level reset" inside what would have been factor-level parsing. The grammar isn't doing anything magical; the recursion-back-to-parseExpr is.

Trip Through The Full Pipeline

For 2 * (3 + 4):

Tokens:    NUMBER(2) STAR LPAREN NUMBER(3) PLUS NUMBER(4) RPAREN EOF

parseExpr
 └─ parseTerm
     ├─ parseFactor → NumberExpr(2)
     ├─ sees STAR, consume; parseFactor:
     │   ├─ sees LPAREN, consume
     │   ├─ parseExpr (inner)
     │   │   └─ parseTerm
     │   │       └─ parseFactor → NumberExpr(3)
     │   │       (no */)
     │   │   ├─ sees PLUS, consume; parseTerm → parseFactor → NumberExpr(4)
     │   │   └─ returns BinaryExpr(+, 3, 4)
     │   └─ consume RPAREN; returns BinaryExpr(+, 3, 4)
     └─ wrap into BinaryExpr(*, 2, BinaryExpr(+, 3, 4))

Result:    BinaryExpr(*, NumberExpr(2), BinaryExpr(+, NumberExpr(3), NumberExpr(4)))

Error Cases

InputWhat Happens
""parse() sees Eof immediately → throws "empty input"
1 +parseExpr parses 1, consumes +, calls parseTermparseFactor which sees Eof and throws
(1 + 2parseFactor matches (, recurses, then expect(RParen) fails → throws "expected ')'"
1 2parseExpr parses 1, no +/-, returns. parse() checks for Eof, finds NUMBER(2) → throws "unexpected token"
1 / 0parses fine; the evaluator throws (Step 4)
1 + abclexer emits Error token; parseFactor propagates it

All five error categories live in this lab. Phase 9 (Diagnostics) replaces these one-line throws with Clang-style source spans + fix-it hints, but the detection logic is identical.

Why Is This Called "LL(1)"?

  • Left-to-right scan of input.
  • Leftmost derivation produced.
  • 1 token of lookahead (we only ever call peek() once per decision; never peek(2)).

LL(1) is the most common parsing class for hand-written compilers. Some real languages need LL(2) or even unbounded lookahead (C++ template parsing) — Clang uses tentative parsing in those spots.

Next

04-the-evaluator.md — walk the tree and produce a number.

Step 4 — The Evaluator (Tree-Walking Interpreter)

Goal: given the AST, compute the number it represents.

Three Lines Per Node

double Evaluator::visit(NumberExpr& n) { return n.value; }

double Evaluator::visit(BinaryExpr& b) {
    double l = b.lhs->accept(*this);
    double r = b.rhs->accept(*this);
    switch (b.op) {
        case TokenKind::Plus:  return l + r;
        case TokenKind::Minus: return l - r;
        case TokenKind::Star:  return l * r;
        case TokenKind::Slash:
            if (r == 0.0) throw EvalError("division by zero");
            return l / r;
        default: throw EvalError("unknown binary operator");
    }
}

double Evaluator::visit(UnaryExpr& u) {
    double v = u.operand->accept(*this);
    switch (u.op) {
        case TokenKind::Minus: return -v;
        default: throw EvalError("unknown unary operator");
    }
}

That's the entire interpreter for arithmetic. ~25 lines.

Post-Order Traversal

BinaryExpr::visit evaluates both children first, then combines. This is post-order traversal — the only correct order for expression evaluation. (Pre-order would try to "+" before knowing what to "+".)

        BinaryExpr(*)
       /            \
  BinaryExpr(+)  NumberExpr(3)
   /        \
 (1)        (2)

Evaluation order (post-order):
  1.  NumberExpr(1)  → 1
  2.  NumberExpr(2)  → 2
  3.  BinaryExpr(+)  → 1 + 2 = 3
  4.  NumberExpr(3)  → 3
  5.  BinaryExpr(*)  → 3 * 3 = 9

In visit(BinaryExpr&), the two accept(*this) lines recursively evaluate the subtrees. The C++ call stack mirrors the tree shape — a 10-deep expression makes 10 stack frames.

What accept(*this) Does, Step By Step

double l = b.lhs->accept(*this);
  1. b.lhs is std::unique_ptr<Expr> — dereference gets Expr&.
  2. Expr::accept is virtual; dispatched to (say) NumberExpr::accept.
  3. That calls visitor.visit(*this) with *this = NumberExpr&.
  4. Overload resolution picks Evaluator::visit(NumberExpr&).
  5. Returns n.value.

For each node visit there's: 1 virtual call (accept), 1 overload resolution (statically resolved), 1 chain of work. The virtual call is the main cost of tree-walking and is exactly what bytecode VMs eliminate.

Errors Surface Bottom-Up

Division by zero at any depth throws EvalError, which unwinds through every accept/visit frame back to the top-level eval. The CLI catches it and prints a message.

This works because C++ exceptions propagate transparently through virtual calls. Manual error-handling — say, returning a sentinel — would require checking after every recursive call. Exceptions handle the "anywhere in this subtree" case for free.

Modern compilers like Clang largely don't use exceptions in their AST passes (the LLVM project disables them — -fno-exceptions). They use Expected<T> / ErrorOr<T> value-types that force explicit handling. We use exceptions here for simplicity; future labs will switch when the costs/benefits flip.

What Happens For 3 * -4?

Tokens: NUMBER(3) STAR MINUS NUMBER(4) EOF

Parser:

  • parseExprparseTerm
  • parseTermparseFactorNumberExpr(3)
  • sees *, consume
  • parseFactor → sees -, consume → parseFactor recurse → NumberExpr(4)UnaryExpr(-, 4)
  • builds BinaryExpr(*, NumberExpr(3), UnaryExpr(-, NumberExpr(4)))

Evaluation:

  • visit(BinaryExpr*):
    • left = visit(NumberExpr 3) = 3
    • right = visit(UnaryExpr-):
      • operand = visit(NumberExpr 4) = 4
      • return -4
    • return 3 * -4 = -12

Same shape as any other binary op — the unary minus is just one extra level of recursion.

Why Tree-Walking Is Slow

Per node visit:

  • 1 virtual dispatch (~10ns; branch predictor cold the first time)
  • 1 pointer-indirection to the next node (likely cache miss for big trees)
  • C++ function-call overhead (stack frame, saved registers)

For a 1000-instruction expression: ~30-50µs evaluating. A bytecode VM doing the same: ~5µs. Native code: ~1µs.

Numbers vary wildly by language and CPU; the ratio is consistent. We'll measure precisely in cp-06 when we have a bytecode VM to compare against.

Why It's Still Worth Building

For learning, tree-walkers win:

  • Smallest code surface — every concept is local.
  • Easy to debug: print the tree, watch it walk.
  • Easy to add features: new node type → new visit overload.

Tree-walkers are also adequate for many real workloads — config languages, shell scripts, build files, query languages. Performance-critical paths get bytecode (cp-06+); the rest stay tree-walked.

Try It

cd src/cpp
cmake -B build && cmake --build build
./build/eval "1 + 2 * 3"
# 7
./build/eval "((1 + 2) * (3 + 4) - 5) / 2"
# 8
./build/eval "1 / 0"
# division by zero

Next

05-repl-tests-and-cli.md — wire the front-end into a usable CLI + tests.

Step 5 — REPL, Tests, and CLI

Goal: wire lexer + parser + evaluator into a real tool, with a test suite and an interactive REPL.

The CLI Driver — main.cpp

int main(int argc, char** argv) {
    if (argc == 1) return arith::repl();
    // gather argv[1..] into one expression for unquoted use
    std::ostringstream oss;
    for (int i = 1; i < argc; ++i) {
        if (i > 1) oss << ' ';
        oss << argv[i];
    }
    return arith::evaluateOnce(oss.str());
}

Two modes:

  • One-shot: ./eval "1 + 2 * 3" or ./eval 1 + 2 + 3.
  • REPL: ./eval with no args.

The REPL Loop

static int repl() {
    std::cout << "arith repl (cp-02). type 'quit' or Ctrl-D to exit.\n";
    std::string line;
    for (;;) {
        std::cout << "> " << std::flush;
        if (!std::getline(std::cin, line)) { std::cout << "\n"; return 0; }
        if (line == "quit" || line == "exit") return 0;
        if (line.empty()) continue;
        evaluateOnce(line);
    }
}

Every iteration: read a line, send it through the whole pipeline, print. No persistent state between lines — variables and bindings come in cp-03.

Building It

cd src/cpp
cmake -B build           # configure once
cmake --build build      # build (incremental)

CMake breakdown:

  • arithlib — static library with lexer.cpp, parser.cpp, evaluator.cpp.
  • eval — executable that links arithlib and main.cpp.
  • test_eval — executable that links arithlib and the test file.

Separating library from main lets us reuse the same compiler internals from tests — a pattern we keep in every later lab.

The Test Suite — test_eval.cpp

Assert-based, no test-framework dependency. 19 tests across 5 categories:

// arithmetic
assert(APPROX(eval("1 + 2"),    3.0));
assert(APPROX(eval("10 - 4"),   6.0));
// precedence
assert(APPROX(eval("1 + 2 * 3"), 7.0));
// left-associativity
assert(APPROX(eval("10 - 3 - 2"), 5.0));
// parens
assert(APPROX(eval("(1 + 2) * 3"), 9.0));
// unary
assert(APPROX(eval("-(3 + 4)"), -7.0));
// floats / whitespace
assert(APPROX(eval("3.14 + 0.86"), 4.0));
// error cases
assert(throws(""));
assert(throws("1 +"));
assert(throws("1 / 0"));
assert(throws("(1 + 2"));

The macro APPROX(a, b) uses std::fabs((a) - (b)) < 1e-9 because comparing floats with == is fragile. throws(...) runs a snippet inside a try/catch and returns whether any std::exception was thrown.

Running The Tests

ctest --test-dir build

Or directly:

./build/tests/test_eval
# cp-02 tests: 19/19 PASS

Expected: 19/19 tests pass. If anything fails, the failing assert aborts immediately with a line number.

Why No gtest / catch2?

Pulling in a framework would mean a find_package (often a git submodule), a CMake config file, and 10 MB of headers. For 19 trivial tests, plain assert is shorter, faster to build, and removes a teaching distraction.

We'll graduate to a real framework around cp-08, when individual test cases benefit from structured fixtures and parameterization.

Manual Sanity Checks

The classic checklist:

./build/eval "1 + 2 * 3"           # 7        (precedence)
./build/eval "(1 + 2) * 3"         # 9        (parens override)
./build/eval "10 - 3 - 2"          # 5        (left associativity)
./build/eval "3 * -4"              # -12      (unary in factor)
./build/eval "((((42))))"          # 42       (parens nest)
./build/eval "1 / 0" 2>&1          # division by zero
./build/eval "1 +" 2>&1            # parse error

If any of these don't match, suspect: parser precedence (Step 3), evaluator dispatch (Step 4), error propagation (parser → eval).

A Tour Of A Failure

Suppose you accidentally swapped left-associativity for right in parseExpr:

// WRONG:
ExprPtr right = parseExpr();   // recursion instead of loop

Then:

  • 10 - 3 - 2 → tree: (10 - (3 - 2)) = 10 - 1 = 9 (not 5!).
  • Test assert(APPROX(eval("10 - 3 - 2"), 5.0)); fails immediately.

This is exactly why we write tests for associativity — the bug is subtle and silent without them.

Outcomes

You now have:

  • A working evaluator binary, both REPL and one-shot.
  • A 19-test test suite proving correctness across precedence, associativity, parens, unary, floats, whitespace, and 4 error categories.
  • A CMake project structured for reuse — the same arithlib will be linked from cp-03's expanded language.

Next

06-extensions.md — optional extension exercises that deepen each concept.

Step 6 — Extension Exercises

Goal: consolidate the concepts by extending the evaluator in small, focused ways. Each extension is self-contained — pick one or more. Solutions live nowhere in this repo on purpose; struggle is the point.

Exercise 1 — Add % (Modulo)

Difficulty: ★☆☆☆☆ (1 minute)

Add % with the same precedence as * and /.

Hints:

  1. Add Percent to TokenKind.
  2. Add the lexer case ('%').
  3. Add the parser check inside parseTerm's while-loop condition.
  4. Add the evaluator case: use std::fmod (not % — operands are double).

Verify:

10 % 3 = 1
10 % 3 % 2 = 1     (left-assoc)

Exercise 2 — Add ^ (Exponentiation, Right-Associative)

Difficulty: ★★☆☆☆

Add ^ with higher precedence than * and right-associativity.

Hints:

  1. Add a new precedence level between term and factor: call it power.
  2. Grammar:
    term  = power { ('*' | '/') power }
    power = factor [ '^' power ]      ; recursion on the right ⇒ right-assoc
    
  3. Note the [ ... ] — an exponent may be absent. The power body recurses into power, not loops.

Verify:

2 ^ 3 = 8
2 ^ 3 ^ 2 = 512       (i.e. 2 ^ (3 ^ 2) = 2 ^ 9 — NOT (2^3)^2 = 64)
2 * 3 ^ 2 = 18        (^ binds tighter than *)

Exercise 3 — AST Pretty-Printer

Difficulty: ★★☆☆☆

Add a Printer : ExprVisitor<std::string> that returns a string representation. Two flavors:

(a) S-expression:

2 * (3 + 4)   →   (* 2 (+ 3 4))

(b) Indented tree:

BinaryExpr(*)
├── NumberExpr(2)
└── BinaryExpr(+)
    ├── NumberExpr(3)
    └── NumberExpr(4)

This exercise proves the Visitor pattern: no AST classes change, just a new visitor. Production debug tools (-ast-dump in Clang) do exactly this.

Hint: accept is currently ExprVisitor<double>-only. Either:

  • Templatize accept (template<typename R> R accept(ExprVisitor<R>&)), or
  • Add a separate accept_string(ExprVisitor<std::string>&), or
  • Use a non-visitor print(Expr&) function with a switch on a tag (simpler).

Exercise 4 — Source Locations in Errors

Difficulty: ★★★☆☆

Make errors report the column they occurred at:

> 1 + ?
                 ^
parse error: expected number, '(' or '-' (got ERROR)

Hints:

  1. Add std::size_t pos to Token.
  2. Lexer records pos_ at the start of each token.
  3. ParseError carries the position. The catch site prints the source line + a ^ at that column.

This is a tiny taste of cp-15 (Diagnostics), where we'd add file:line:col, fix-it hints, and color.

Exercise 5 — Constant Folding Pass

Difficulty: ★★★☆☆

Write an Optimizer : ExprVisitor<ExprPtr> that returns a new AST with constants folded:

BinaryExpr(+, NumberExpr(2), NumberExpr(3))
   →   NumberExpr(5)

If both children are NumberExpr, compute the result; otherwise return a new BinaryExpr with optimized children.

Verify by:

  1. Printing the AST before and after.
  2. Confirming the evaluator still returns the same number.

This is your first compiler optimization pass. The same pattern appears in LLVM's -instcombine, JVM's C1 constant folding, and Clang's EvaluatedExprVisitor.

Exercise 6 — Disallow Trailing Garbage

Difficulty: ★☆☆☆☆

The parser already does this (parse() checks for Eof). Confirm with:

./eval "1 + 2 3"
# parse error: unexpected token '3' after expression

Now: make the error message highlight where the garbage starts — combine with Exercise 4.

Exercise 7 — REPL History and Last Result

Difficulty: ★★☆☆☆

Make _ in the REPL refer to the previous result:

> 1 + 2
3
> _ * 4
12

Hint: the REPL keeps a lastResult variable. The lexer recognizes _ as a special token. The parser allows it as a factor.

Exercise 8 — Bench Tree-Walking vs std::function

Difficulty: ★★★★☆ (more about C++ than compilers)

Build the tree once, evaluate it 1M times. Measure with <chrono>:

  • via the Visitor (accept + virtual);
  • via a tagged switch (replace virtuals with kind switch);
  • via std::variant + std::visit.

Predict which is fastest. Verify. Discuss why.

(Spoiler: tagged switch usually wins by ~2× on hot trees because the branch predictor can latch on. Virtual dispatch loses to indirect-branch mispredicts. This is the exact reason bytecode VMs win over tree-walkers.)


Done?

When you've internalized the concepts (even without doing every extension):

  • Mark this lab complete.
  • Move to ../cp-03-minilang-frontend/ — where the language grows to include variables, control flow, functions, and we switch to Pratt parsing.

cp-03 — MiniLang Frontend (Statements, Variables, Control Flow, Functions)

Status: ✅ Built — 34/34 tests pass.

Replaces the arithmetic grammar from cp-02 with a real language: let/var bindings, if/while, functions, blocks, closures. Switches the parser from hand-rolled recursive descent to a Pratt parser for expressions.

What You'll Build

  • A full Pratt expression parser driven by a binding-power table.
  • A recursive-descent statement parser: let/var, blocks, if/else, while, return, print, fn.
  • Two-tier AST (Stmt + templated-return Expr<R> visitors).
  • Scope-chain Environment (parent-linked maps) and lexical closures.
  • A tree-walking interpreter with first-class functions, recursion, higher-order calls.
  • A mli REPL and file runner.

Reading Order

  1. CONCEPTS.md — Pratt parsing, binding powers, statement/expression split, closures.
  2. src/cpp/steps/ — seven guided steps (tokens → lexer → AST → Pratt → environment/interpreter → functions → REPL+tests).
  3. src/cpp/src/ — the full source.

Prereqs

  • cp-02 complete (recursive descent + AST + Visitor internalized).

Outcomes

  • Hand-write a Pratt parser for any precedence-rich expression language.
  • Understand the statement/expression distinction and where each fits in the AST.
  • Implement lexical scope via a parent-linked environment chain.
  • Build first-class functions and closures using shared environments.

Build & Run

cd src/cpp
cmake -S . -B build && cmake --build build -j
ctest --test-dir build --output-on-failure
./build/mli                  # REPL
./build/mli script.ml        # run a file

Sample

fn fact(n) { if (n <= 1) return 1; return n * fact(n - 1); }
fn fib(n)  { if (n < 2) return n; return fib(n-1) + fib(n-2); }
print fact(10);   // 3628800
print fib(15);    // 610

fn makeAdder(a) { fn add(b) { return a + b; } return add; }
let plus5 = makeAdder(5);
print plus5(10);  // 15

01 — Tokens and the Lexer

The lexer is the boundary between raw text and structured data. Its job is single-responsibility: consume a character stream and emit a flat stream of typed tokens. Every subsequent phase sees tokens, never characters.

The token inventory for MiniLang

cp-03 extends the arithmetic token set from cp-02 with keywords and punctuation that support a full statement language:

// literals
NUMBER  "123"   STRING  "hello"   IDENT "myVar"

// keywords
LET  VAR  FN  IF  ELSE  WHILE  RETURN  PRINT  TRUE  FALSE  NIL

// arithmetic & comparison
PLUS MINUS STAR SLASH PERCENT
EQ EQ_EQ BANG BANG_EQ LT LT_EQ GT GT_EQ

// logical
AND OR

// delimiters
LPAREN RPAREN LBRACE RBRACE COMMA SEMICOLON

// end-of-file sentinel
EOF

Key lexer decisions

One character of lookahead is enough for all MiniLang tokens. = vs ==, ! vs !=, < vs <=, > vs >= all resolve with one peek() call after consuming the first character.

Maximal munch: always consume the longest valid token. The lexer loop calls advance() and then decides, not the other way round.

Keyword recognition via a hash-map at the identifier stage:

const std::unordered_map<std::string, TokKind> keywords = {
    {"let",    TokKind::Let},
    {"var",    TokKind::Var},
    {"fn",     TokKind::Fn},
    // ...
};

When an identifier is scanned, look it up in the table. If it's there, emit the keyword token; otherwise emit IDENT. This keeps the lexer loop clean: no per-keyword branches in the main switch.

Character classification helpers

static bool isAlpha(char c)   { return std::isalpha(c) || c == '_'; }
static bool isAlNum(char c)   { return std::isalnum(c) || c == '_'; }
static bool isDigit(char c)   { return std::isdigit(c); }

_ is part of identifiers in MiniLang (and every real language), so it's included in both isAlpha and isAlNum.

Lexer structure

class Lexer {
    const std::string& source_;
    size_t start_ = 0;  // start of current token
    size_t cur_   = 0;  // current scan position
    int    line_  = 1;  // for error reporting
    char advance();
    char peek() const;
    char peekNext() const;
    bool match(char expected);
    Token makeToken(TokKind);
    Token scanToken();
public:
    Lexer(const std::string& source);
    std::vector<Token> scanAll();
};

The scanAll() loop calls scanToken() until it sees the source end, then appends an EOF token and returns. Callers get a vector<Token> — a flat, random-access stream. This is important: the Pratt parser (cp-03 step 03) peeks and consumes non-linearly.

Source location on every token

struct Token {
    TokKind     kind;
    std::string lexeme;
    int         line;
};

line tracks the source line number. The lexer increments line_ on every \n. Later phases use line for error messages. A real production lexer stores column too; for now line is enough.

Try it

After writing the lexer, scan a source string and print the token stream:

Lexer lex("let x = 2 + 3;\nif (x > 4) { print x; }");
for (auto& tok : lex.scanAll())
    std::cout << tok.line << "\t" << tok.lexeme << "\n";

Expected output:

1   let
1   x
1   =
1   2
1   +
1   3
1   ;
2   if
...

This linear token dump is the best first debugging tool for any lexer.

02 — The AST

The AST is the contract between the parser and every downstream phase. Get it right and the interpreter, type checker, resolver, and codegen all become straightforward tree walks. Get it wrong and every phase carries workarounds.

Two-tier design: Stmt + Expr

MiniLang separates the grammar cleanly into statements (things with side effects but no value) and expressions (things that produce a value). Some languages blur this line (expressions as statements, last- expression-is-the-return-value, etc.); MiniLang keeps them distinct so the AST shape guides the interpreter.

Statement nodes

struct LetStmt  { std::string name; bool immutable; ExprPtr init; int line; };
struct AssignStmt{ std::string name; ExprPtr value; int line; };
struct IfStmt   { ExprPtr cond; StmtPtr then; StmtPtr else_; int line; };
struct WhileStmt{ ExprPtr cond; StmtPtr body; int line; };
struct ReturnStmt{ ExprPtr value; int line; };
struct PrintStmt { ExprPtr value; int line; };
struct ExprStmt  { ExprPtr expr; int line; };
struct BlockStmt { std::vector<StmtPtr> body; int line; };

Expression nodes

struct NumberExpr { double value; int line; };
struct StringExpr { std::string value; int line; };
struct BoolExpr   { bool value; int line; };
struct NilExpr    { int line; };
struct VarExpr    { std::string name; int line; };
struct UnaryExpr  { TokKind op; ExprPtr operand; int line; };
struct BinaryExpr { TokKind op; ExprPtr left; ExprPtr right; int line; };
struct CallExpr   { ExprPtr callee; std::vector<ExprPtr> args; int line; };
struct FnExpr     { std::vector<std::string> params; StmtPtr body; int line; };

FnExpr is an anonymous function literal (fn(x, y) { ... }). Named functions (fn foo(x) { ... }) desugar into let foo = fn(x) { ... } at parse time.

Ownership with unique_ptr

Every node owns its children:

using ExprPtr = std::unique_ptr<Expr>;
using StmtPtr = std::unique_ptr<Stmt>;

No parent pointers, no shared ownership. The AST is a tree (DAG-free), so a single-owner, depth-first-owned hierarchy matches its shape exactly. Destruction is recursive and automatic when the root goes out of scope.

The Visitor pattern

Every downstream phase (interpreter, resolver, type checker) needs to walk the AST without modifying the node types. The Visitor pattern provides that extension point:

// Returns T per expression node.
template<typename T>
struct ExprVisitor {
    virtual T visitNumber(NumberExpr&) = 0;
    virtual T visitVar(VarExpr&) = 0;
    virtual T visitBinary(BinaryExpr&) = 0;
    // ... one method per expression node kind
};

Each expression node implements:

template<typename T>
T Expr::accept(ExprVisitor<T>& v);

The interpreter inherits ExprVisitor<Value>, the type-checker inherits ExprVisitor<TypePtr>, the printer inherits ExprVisitor<std::string>. Adding a new pass doesn't change any AST file.

Line numbers on every node

Every node stores an int line. This is the source line where the node began. Passes that emit errors use it:

throw RuntimeError("[line " + std::to_string(node.line) + "] ...");

A production compiler would store a full source span (start/end offset); a line number is sufficient for teaching and for cp-03–cp-05 diagnostics.

Design check: where do FnStmt and FnExpr differ?

fn foo(x) { ... } is syntactic sugar for let foo = fn(x) { ... }. At parse time the parser sees the fn keyword followed by an identifier, converts it to a LetStmt wrapping a FnExpr, and the rest of the compiler never needs a FnDecl node. This simplification:

  • Keeps the namespace of a function as a variable binding (consistent with let semantics).
  • Makes closures and first-class functions automatic — a fn literal is just a value.
  • Means the resolver treats function declarations identically to variable declarations (both create a binding; both allow shadowing at new scope).

03 — Pratt Parsing: Expression Precedence Without Grammar Rules

Recursive-descent parsers grow one function per precedence tier: parseExpr → parseAddSub → parseMulDiv → parseUnary → parsePrimary. With 5 levels that's fine. With 15 it becomes a maintenance nightmare where adding ** (exponentiation) requires touching every existing tier function to get the associativity and slot correct.

Pratt parsing collapses all expression precedence levels into one function controlled by a numeric binding-power table.

Binding powers

Each operator has a left-binding power (how tightly it binds to the left) and a right-binding power (how tightly it binds to the right). For left- associative operators rbp = lbp. For right-associative rbp = lbp - 1.

OperatorLeft BPRight BPAssoc
=11right
or34left
and56left
== !=78left
< <= > >=910left
+ -1112left
* / %1314left
unary - !15
call (1718left

Higher numbers bind tighter. The loop condition while (lbp(peek) > minBp) continues consuming operators as long as the next one binds tighter than the caller's minimum.

The Pratt loop

ExprPtr parseExpr(int minBp = 0) {
    auto left = parsePrimary();          // nud: no left context yet
    while (lbp(peek()) > minBp) {
        Token op = advance();
        auto right = parseExpr(rbp(op));  // led: right-denotation recursion
        left = makeBinary(op, move(left), move(right));
    }
    return left;
}

parsePrimary handles prefix forms (numbers, identifiers, (expr), unary -, !, fn). The loop handles infix forms by taking the current left side and calling parseExpr(rbp) for the right.

Parsing function calls in Pratt

Function call f(x, y) is an infix operator with a ( left-denotation:

// Inside the while loop, when op.kind == LPAREN:
std::vector<ExprPtr> args;
while (peek().kind != RPAREN) {
    args.push_back(parseExpr(0));
    if (!match(COMMA)) break;
}
expect(RPAREN);
left = makeCall(move(left), move(args));

No special parseCallExpr function — it's handled by the left-binding power of ( being high (17/18), making call tighter than any arithmetic.

Why associativity matters

Given a = b = c:

  • Right-associativity (rbp = lbp): parses as a = (b = c) — assignment chains, each one stores the innermost result first.
  • Left-associativity (rbp = lbp + 1): would parse as (a = b) = c — trying to assign to a temporary, which is wrong.

Given a - b - c:

  • Left-assoc: (a - b) - c — correct for subtraction.
  • Right-assoc: a - (b - c) — wrong.

The binding power table encodes this precisely: no per-operator branches in the Pratt loop.

Testing the expression parser in isolation

Before connecting it to the statement parser, write a small test driver:

std::string src = "1 + 2 * 3 - -4";
Lexer lex(src);
Parser p(lex.scanAll());
auto e = p.parseExpr(0);
// Pretty-print: "(- (+ 1 (* 2 3)) (- 4))"  — left-associative and
// unary applied before binary

If the parenthesisation matches your expectations, the binding-power table is correct.

04 — Recursive-Descent Statement Parsing

Expressions handle values and operators. Statements handle control flow, declarations, and side-effects. The two halves live in different parser methods and produce different AST node types.

Statement dispatch

The top-level parse method peeks at the current token and dispatches:

StmtPtr parseStmt() {
    switch (peek().kind) {
        case TokKind::Let:    return parseLet();
        case TokKind::Var:    return parseLet();
        case TokKind::If:     return parseIf();
        case TokKind::While:  return parseWhile();
        case TokKind::Return: return parseReturn();
        case TokKind::Print:  return parsePrint();
        case TokKind::Fn:     return parseFnDecl();
        case TokKind::LBrace: return parseBlock();
        default:              return parseExprStmt();
    }
}

Each branch consumes exactly the tokens it owns. All branches advance past any trailing ;.

Blocks create scope

parseBlock reads {, a list of statements, then }:

StmtPtr parseBlock() {
    int line = advance().line;   // consume {
    std::vector<StmtPtr> body;
    while (peek().kind != RBrace && peek().kind != Eof)
        body.push_back(parseStmt());
    expect(RBrace);
    return std::make_unique<BlockStmt>(move(body), line);
}

The interpreter will create a new Environment child for every BlockStmt, so blocks naturally scope variable declarations.

if with optional else

StmtPtr parseIf() {
    int line = advance().line;   // consume 'if'
    expect(LParen);
    auto cond = parseExpr(0);
    expect(RParen);
    auto then = parseBlock();    // always a block
    StmtPtr else_;
    if (match(Else)) else_ = peek().kind == If ? parseIf() : parseBlock();
    return std::make_unique<IfStmt>(move(cond), move(then), move(else_), line);
}

else if chains are implemented by letting else consume another if statement, producing a right-recursive tree. No special elif keyword.

while is simpler

StmtPtr parseWhile() {
    int line = advance().line;
    expect(LParen); auto cond = parseExpr(0); expect(RParen);
    auto body = parseBlock();
    return std::make_unique<WhileStmt>(move(cond), move(body), line);
}

Named function declaration → desugar

StmtPtr parseFnDecl() {
    int line = advance().line;   // consume 'fn'
    auto name = expect(Ident).lexeme;
    auto fn   = parseFnBody(line);  // parses (params) { body }
    // Desugar into: let name = fn(params) { body }
    return std::make_unique<LetStmt>(name, /*immutable=*/true, move(fn), line);
}

This is the key simplification: the interpreter's visitLet handles both variable declarations and function declarations uniformly. A function is just a value bound to a name.

Panic-mode error recovery

When the parser hits something unexpected, it throws or calls a sync function that skips tokens until it finds a synchronisation point:

void sync() {
    while (peek().kind != Eof) {
        if (previous().kind == Semicolon) return;
        switch (peek().kind) {
            case Fn: case Let: case Var: case If:
            case While: case Return: return;
            default: advance();
        }
    }
}

After sync, parsing resumes at the next statement boundary. In a REPL this means one bad expression doesn't lock up the session. In a file run it means a single error doesn't suppress everything downstream.

The expression-statement bridge

StmtPtr parseExprStmt() {
    int line = peek().line;
    auto e = parseExpr(0);
    expect(Semicolon);
    return std::make_unique<ExprStmt>(move(e), line);
}

Function calls at statement position (foo(42);) hit this path. The expression is evaluated for side effects; its value is discarded.

05 — Environments and Lexical Scope

The environment model answers one question: when a variable is used, which binding does it refer to? MiniLang uses lexical scope — the binding is determined by where the code is written, not where it is called.

The Environment structure

class Environment {
    std::unordered_map<std::string, Value> vars_;
    std::shared_ptr<Environment> parent_;
public:
    explicit Environment(std::shared_ptr<Environment> parent = nullptr);
    void define(const std::string& name, Value v);
    Value& get(const std::string& name);
    void   set(const std::string& name, Value v);
};

get walks up the parent_ chain until it finds the name or reaches the top-level (null parent) and throws "undefined variable". set does the same but writes back instead of reading.

Chain creation for blocks and calls

When entering a block:

void Interpreter::visitBlock(BlockStmt& b) {
    auto child = std::make_shared<Environment>(env_);
    std::swap(env_, child);
    for (auto& s : b.body) execute(*s);
    std::swap(env_, child);  // restore on exit (RAII alternative below)
}

When calling a function, a fresh environment is created with the function's closure (the captured enclosing environment) as parent — not the caller's current environment. This is what makes lexical scope different from dynamic scope.

Why shared_ptr?

A closure can outlive the scope that created it:

fn makeCounter() {
    var n = 0;
    fn inc() { n = n + 1; return n; }
    return inc;
}
let c = makeCounter();
print c();  // 1
print c();  // 2

Here inc captures the environment created when makeCounter ran, and that environment holds n. After makeCounter returns, the n binding is still alive because the closure inc holds a shared_ptr to it. When c is eventually garbage-collected, the shared_ptr ref-count drops to zero and the environment is freed.

If environments were stored on the stack by value, the closure would hold a dangling reference. shared_ptr is the minimal-complexity solution; real VMs use heap-allocated scope frames instead.

The define vs set distinction

  • define always writes in the current environment (creates a new slot).
  • set walks the parent chain to find an existing binding and updates it.

This matters for:

var x = 1;
{
    var x = 2;  // define: creates a NEW x in the inner scope
    x = 3;      // set: updates the INNER x
}
print x;  // still 1 — the outer x was never touched

Without the distinction, x = 3 inside the block would climb to the outer x, breaking scope.

Scope chain depth

Every define at block entry and every scope exit is O(1). Every get and set is O(depth) — proportional to the nesting depth of scopes. In practice depth is small (rarely > 10 for real programs), so this is acceptable.

cp-04 introduces depth annotations on variable uses that let the interpreter do one hash-map lookup at the right depth instead of walking every parent:

Value& getAt(int depth, const std::string& name);  // cp-04 addition

For cp-03, the naive walk is fine and pedagogically clearer.

06 — First-Class Functions and Closures

A language has first-class functions when functions can be:

  • Stored in variables
  • Passed as arguments
  • Returned from functions
  • Constructed at runtime

MiniLang supports all four, and the implementation is a small addition to what step 05 already built.

The Value type

struct FnValue {
    std::vector<std::string>       params;
    StmtPtr*                       body;    // pointer into the AST; AST is stable
    std::shared_ptr<Environment>   closure; // captured scope
};

using Value = std::variant<
    std::monostate,  // nil
    double,          // number
    bool,
    std::string,
    std::shared_ptr<FnValue>
>;

FnValue holds the parameter list, a pointer to the function body in the AST, and the captured environment at definition time. The closure field is the one shared_ptr chain that makes closures work.

Calling a function

Value Interpreter::visitCall(CallExpr& call) {
    Value callee = evaluate(*call.callee);
    auto fn = std::get<std::shared_ptr<FnValue>>(callee);  // or throw type error
    if (call.args.size() != fn->params.size()) throw ...;

    // Build the call frame with the closure as parent.
    auto frame = std::make_shared<Environment>(fn->closure);
    for (size_t i = 0; i < fn->params.size(); ++i)
        frame->define(fn->params[i], evaluate(*call.args[i]));

    auto saved = env_;
    env_ = frame;
    try { execute(*fn->body); }
    catch (ReturnSignal& r) { env_ = saved; return r.value; }
    env_ = saved;
    return std::monostate{};  // nil if no return statement
}

ReturnSignal is a C++ exception used as a non-local transfer out of visitBlock chains when a return statement fires. It's not an error — it's a structured jump. This avoids threading a "should I keep executing?" flag through every interpreter method.

Closures capture the environment, not variables

fn adder(x) {
    fn add(y) { return x + y; }
    return add;
}
let add5 = adder(5);
let add10 = adder(10);
print add5(3);   // 8
print add10(3);  // 13

When adder(5) is called:

  1. A frame for adder is created with x = 5.
  2. The FnExpr for add captures that frame as its closure.
  3. adder returns the FnValue.

When add5(3) is called:

  1. A new frame is created with add5's closure (the adder frame) as parent.
  2. y = 3 is defined in that frame.
  3. x + y resolves: y is in the current frame, x is in the captured adder frame.

add10 has its own separate adder frame with x = 10. The two closures don't share state.

The mutation case

fn makeCounter() {
    var count = 0;
    fn inc() { count = count + 1; return count; }
    return inc;
}
let c = makeCounter();
print c();  // 1
print c();  // 2

Here count = count + 1 inside inc calls env_->set("count", ...). set walks the parent chain, finds count in the captured makeCounter frame, and updates it in place. The next call to c() builds a new call frame with the same closure, sees the updated count, and returns 2. Mutable closures work for free with the parent-chain set semantics.

Tail calls and stack overflow

cp-03 does not implement tail-call optimisation. A deep recursion like fib(40) will produce a tall C++ call stack and may segfault before finishing. This is a known limitation — the solution is continuation- passing or explicit stack in the interpreter, deferred to later labs.

07 — REPL, Tests, and Extensions

The final step wires the components into a usable tool and verifies the interpreter with automated tests.

The REPL loop

void repl(std::istream& in, std::ostream& out) {
    auto global = std::make_shared<Environment>();
    Interpreter interp(global);
    std::string line;
    while (true) {
        out << "> ";
        if (!std::getline(in, line)) break;
        try {
            Lexer lex(line);
            Parser p(lex.scanAll());
            auto stmts = p.parse();
            for (auto& s : stmts) interp.execute(*s);
        } catch (const std::exception& e) {
            out << "error: " << e.what() << "\n";
        }
    }
}

The key points:

  1. The same global environment persists across REPL lines — you can define a function on one line and call it on the next.
  2. Errors are caught and printed, not propagated — the REPL doesn't die on a bad expression.
  3. Each line is re-lexed and re-parsed; no incremental state.

File execution

int main(int argc, char** argv) {
    if (argc == 1)   { repl(std::cin, std::cout); return 0; }
    std::ifstream f(argv[1]);
    if (!f)          { std::cerr << "cannot open " << argv[1] << "\n"; return 74; }
    std::string src((std::istreambuf_iterator<char>(f)),
                     std::istreambuf_iterator<char>());
    Lexer lex(src);
    Parser p(lex.scanAll());
    auto stmts = p.parse();
    Interpreter interp(std::make_shared<Environment>());
    for (auto& s : stmts) interp.execute(*s);
    return 0;
}

The test harness

cp-03 uses a hand-rolled test harness consistent with later labs. Each test runs a source string through the full pipeline and checks the output or thrown message:

static int g_checks = 0, g_passed = 0;
#define CHECK_EQ(a, b) do { ++g_checks; \
    if ((a) == (b)) ++g_passed; \
    else std::cerr << "FAIL " << __LINE__ << ": " << (a) << " != " << (b) << "\n"; \
} while(0)

// Example test
void test_closure() {
    auto out = run("fn f(a) { fn g(b) { return a + b; } return g; } print f(3)(4);");
    CHECK_EQ(out, "7\n");
}

run(src) lex-parses-interprets and returns the captured stdout. This lets every test be written as a one-liner source program.

Extending MiniLang — next steps

These extensions each add one focused concept without rewriting the interpreter:

1. Native functions

Add a NativeValue variant to Value holding a std::function<Value(vector<Value>)>. Register built-ins like clock(), sqrt(), len() in the global environment before executing user code.

2. Arrays

Add VecValue = shared_ptr<vector<Value>>. Add arr[i] subscript as a special CallExpr-like AST node, arr.push(x) as a method call.

3. Classes

Add class Foo { ... } syntax → a ClassDef node. Instances are environments whose parent is the class's method map. this is a binding in the method's call frame pointing to the instance environment.

4. for loops

Desugar into while at parse time:

for (let i = 0; i < n; i = i + 1) { body }
→
{ let i = 0; while (i < n) { body; i = i + 1; } }

No new interpreter support needed.

5. Continuation-based non-local flow

Replace ReturnSignal exception with a Continuation value that wraps the rest of the execution as a callable — the basis for coroutines and generators.

Each extension exercises a different compiler engineering concept. The interpreter's visitor-based architecture absorbs new node types without touching existing ones — which is the point of choosing Visitor over ad-hoc dispatch in step 02.

cp-04 — Symbol Tables & the Resolver Pass

Status: ✅ Built · all tests passing

A second compiler pass that runs between parsing and execution. The resolver:

  • maintains a scope stack (lexical scopes seen so far),
  • annotates every variable use with the lexical depth at which its binding lives,
  • statically detects a class of bugs the cp-03 interpreter would only catch at runtime — or worse, miss entirely.

The interpreter is then changed to do O(depth) lookups using the annotation — no more hash-map walk up the parent chain.

What's new vs cp-03

Aspectcp-03cp-04
Phaseslex → parse → interpretlex → parse → resolve → interpret
Variable lookupdynamic walk of parent envsstatic depth + one hash lookup at the right scope
let vs varparsed but treated identicallylet immutable (resolver rejects assignment), var mutable
Errors caughtmostly at runtimeundefined, redecl, self-init, assign-to-let, top-level return

Source layout (src/cpp/)

src/
  token.hpp / lexer.{hpp,cpp}     # unchanged from cp-03
  value.hpp                       # unchanged
  ast.hpp                         # adds DeclKind + depth fields
  parser.{hpp,cpp}                # records DeclKind & node line numbers
  environment.hpp                 # adds getAt(depth)/assignAt(depth)
  resolver.{hpp,cpp}              # NEW — the static-analysis pass
  interpreter.{hpp,cpp}           # uses depth when available
  main.cpp                        # runs resolver before interpreter
tests/test_resolver.cpp           # 15 assertions (regression + new diagnostics)

Build & test

cd src/cpp
cmake -S . -B build -G "Unix Makefiles"
cmake --build build -j
ctest --test-dir build --output-on-failure

Expected: cp-04 tests: 15/15 PASS.

Sample diagnostics

$ cat > bad.ml <<'EOF'
{ let x = 1; let x = 2; x = 3; return 99; }
EOF
$ build/mli bad.ml
[line 1] resolver: redeclaration of 'x' in the same scope (previous at line 1)
[line 1] resolver: cannot assign to immutable binding 'x' (declared with 'let')
[line 1] resolver: 'return' outside of a function

The resolver reports all problems at once, then main exits 1 without running a single instruction. This is the same shape as a C compiler: fail in the front end, never reach codegen.

See src/cpp/steps/ for the build-up

  1. 01-why-a-separate-pass.md — what runtime-only resolution costs us
  2. 02-ast-changes.md — adding DeclKind and the depth slot
  3. 03-the-resolver-walk.md — visitor over Stmt/Expr, scope stack
  4. 04-declare-then-define.md — the trick that catches let x = x;
  5. 05-let-vs-var.md — immutability check on assignment
  6. 06-getAt-and-fast-lookup.md — wiring depth into the interpreter
  7. 07-error-recovery.md — collecting diagnostics instead of throwing

01 — The Resolver Pass: Why and What

The interpreter in cp-03 resolves variable names at runtime, by walking the environment chain on every access. That works but has two problems:

  1. Performance: O(depth) lookup on every read/write.
  2. Correctness: The interpreter can silently read from the wrong scope if names shadow each other across closure boundaries.

The resolver pass is a static analysis pass that runs after parsing and before interpreting. It walks the AST once, builds a map of (VarExpr* → depth), and annotates every variable use with the exact environment depth at which its binding lives. The interpreter can then do O(1) direct-depth lookups.

The scope of the problem

var a = "global";
fn outer() {
    fn inner() { print a; }
    inner();
}
outer();

With a naive runtime-walking resolver:

  • When print a runs, the interpreter looks in inner's frame, then outer's frame, then global, finds a = "global" and prints it.
  • This works correctly.

But:

var a = "global";
fn showA() {
    print a;
}
fn test() {
    var a = "local";
    showA();
}
test();

Lexical scope says showA should print "global"a in its body refers to the a visible when showA was defined, not when it's called. A runtime chain walk starting from the call site environment would incorrectly find "local" from the caller test.

The resolver fixes this: it records, at parse time, that the a in showA is 1 level up from its closure's capture point. At runtime the interpreter goes exactly 1 level up in the closure's parent chain — not the caller's.

What the resolver produces

A std::unordered_map<Expr*, int> locals_ in the interpreter:

class Resolver {
    std::vector<std::unordered_map<std::string, bool>> scopes_;
    Interpreter& interp_;  // writes into interp_.locals_
public:
    void resolve(std::vector<StmtPtr>& stmts);
private:
    void resolveLocal(Expr* e, const std::string& name);
    void beginScope();
    void endScope();
};

resolveLocal(expr, name) searches scopes_ from the innermost outward. When it finds the name, it records scopes_.size() - 1 - depth (the "distance" from the current scope) in interp_.locals_[expr].

The interpreter's lookUpVariable

Value Interpreter::lookUpVariable(const std::string& name, Expr* e) {
    auto it = locals_.find(e);
    if (it != locals_.end())
        return env_->getAt(it->second, name);  // O(1) direct depth
    return globals_->get(name);
}

Variables not found in locals_ are globals — they're not annotated because they live at depth 0 from the global environment, which the interpreter always has a direct pointer to.

When to run the resolver

After parsing the full source, before execution:

auto stmts = parser.parse();
Resolver resolver(interp);
resolver.resolve(stmts);
for (auto& s : stmts) interp.execute(*s);

The resolver pass is one linear traversal of the AST. It touches every node exactly once. After it runs, the interpreter's locals_ map is fully populated and all subsequent variable lookups are O(1).

02 — The Scope Stack

The resolver needs to track which names are in scope at each point in the AST. It does this with a scope stack: a vector of maps, each map representing one lexical scope.

The structure

std::vector<std::unordered_map<std::string, bool>> scopes_;

Each map entry has type bool:

  • false — the variable has been declared (the slot exists) but its initialiser hasn't finished evaluating yet.
  • true — the variable has been fully defined (initialiser is done).

This two-stage state catches the self-initialisation error:

let x = x + 1;  // error: x used in its own initialiser

When the resolver sees let x = ..., it first declares x (pushes "x" → false), evaluates the initialiser (during which x is false), then defines x (sets "x" → true). If the initialiser references x, resolveLocal sees false and reports an error before interpreting.

Scope boundaries

void Resolver::beginScope() {
    scopes_.emplace_back();  // push empty map
}
void Resolver::endScope() {
    scopes_.pop_back();
}

Called around:

  • Every BlockStmt
  • Every function body (one scope for the function's parameters, one for the body block — or combined)
void Resolver::visitBlock(BlockStmt& b) {
    beginScope();
    for (auto& s : b.body) resolve(*s);
    endScope();
}

Looking up across the stack

void Resolver::resolveLocal(Expr* e, const std::string& name) {
    for (int i = (int)scopes_.size() - 1; i >= 0; --i) {
        if (scopes_[i].count(name)) {
            interp_.locals_[e] = (int)scopes_.size() - 1 - i;
            return;
        }
    }
    // Not found → global; no annotation needed
}

The depth (size - 1 - i) is 0 for the innermost scope, 1 for one level up, and so on. The interpreter's getAt(depth, name) walks the environment chain exactly depth steps:

Value& Environment::getAt(int depth, const std::string& name) {
    Environment* env = this;
    for (int i = 0; i < depth; ++i) env = env->parent_.get();
    return env->vars_.at(name);
}

Why a vector of maps, not a single map?

A single global map can't track shadowing: if x is declared at depth 2 and redeclared at depth 0, both need entries that point to different depths. A vector of maps makes the stack structure explicit and indexable. The outermost scope is scopes_[0], the innermost is scopes_.back().

The global scope

The top-level scope is not pushed onto scopes_. Global variables are resolved by falling through the entire stack search without finding the name. The interpreter then looks them up directly in globals_. This separation means globals can be referenced before they're declared (e.g. mutual recursion at the top level), which is a valid and useful feature.

03 — Declaring and Resolving Names

The resolver's two core operations are declaring a name (introducing a new binding) and resolving a reference (computing its depth).

Declaring a name

void Resolver::declare(const std::string& name) {
    if (scopes_.empty()) return;  // global — skip
    auto& scope = scopes_.back();
    if (scope.count(name))
        reportError("Variable '" + name + "' already declared in this scope.");
    scope[name] = false;  // declared, not yet defined
}

void Resolver::define(const std::string& name) {
    if (scopes_.empty()) return;  // global — skip
    scopes_.back()[name] = true;  // fully defined
}

The declare/define split means:

  1. let x = x;declare("x") sets x→false, resolve initialiser x → finds false → error "can't read local in its own initialiser".
  2. let x = 1;declare("x"), resolve initialiser 1 (no names), define("x") → no error.

The visitLet method

void Resolver::visitLet(LetStmt& s) {
    declare(s.name);
    if (s.init) resolve(*s.init);  // initialiser can NOT see s.name yet
    define(s.name);
}

The visitVar / visitFunction

var works the same as let for the resolver — mutability is an interpreter-level concern (cp-04 step 05), not a resolution concern.

Functions:

void Resolver::resolveFunction(FnExpr& fn) {
    beginScope();
    for (auto& param : fn.params) {
        declare(param);
        define(param);  // params are immediately defined
    }
    resolve(*fn.body);
    endScope();
}

Parameters are both declared and defined before the body is resolved. There's no initialiser for parameters — they're always provided by the caller.

The visitVarExpr method

void Resolver::visitVarExpr(VarExpr& e) {
    if (!scopes_.empty()) {
        auto it = scopes_.back().find(e.name);
        if (it != scopes_.back().end() && it->second == false)
            reportError("Can't read '" + e.name + "' in its own initialiser.");
    }
    resolveLocal(&e, e.name);
}

The self-initialiser check only looks at scopes_.back() — the current scope. If the name is in an outer scope and is false, it's a different (outer) variable in mid-initialisation, not a problem for the current reference.

Block scopes don't "hoist"

In JavaScript, var declarations are hoisted to the top of the function scope. In MiniLang, let and var are not hoisted — a reference before the declaration is a static error:

print x;   // error: "x" not found (resolver reports it)
let x = 1;

The resolver only adds a name to scopes_.back() when it encounters the let/var statement. Any VarExpr seen before that point falls through the entire scope stack and is treated as a global. If x is not a global either, the error is caught at resolve time. This is strictly better than runtime "undefined variable" errors.

04 — Depth Annotations: O(1) Environment Lookups

After the resolver runs, every local variable reference has an integer depth stored in interp_.locals_. This step shows how the interpreter uses that information to avoid chain walking.

The getAt and setAt methods

Value& Environment::getAt(int depth, const std::string& name) {
    Environment* e = this;
    for (int i = 0; i < depth; ++i) e = e->parent_.get();
    return e->vars_.at(name);
}

void Environment::setAt(int depth, const std::string& name, Value v) {
    Environment* e = this;
    for (int i = 0; i < depth; ++i) e = e->parent_.get();
    e->vars_[name] = std::move(v);
}

The loop here is O(depth), but depth is a small compile-time constant for each site. More importantly, it avoids calling vars_.count() at every intermediate scope — the lookup is direct.

lookUpVariable and assignVariable

Value Interpreter::lookUpVariable(VarExpr& e) {
    auto it = locals_.find(&e);
    if (it != locals_.end())
        return env_->getAt(it->second, e.name);
    return globals_->get(e.name);  // fallback to globals
}

void Interpreter::assignVariable(AssignExpr& e, Value v) {
    auto it = locals_.find(&e);
    if (it != locals_.end())
        env_->setAt(it->second, e.name, std::move(v));
    else
        globals_->set(e.name, std::move(v));
}

The locals_ map key is the pointer to the AST node, not the variable name. This is intentional: two different uses of x in the source (two VarExpr nodes) can refer to bindings at different depths, and the pointer uniquely identifies which AST node is being evaluated.

Why pointer keying is correct

let x = 1;
fn f() {
    let x = 2;
    print x;  // VarExpr node A — depth 0 in f's scope
}
print x;      // VarExpr node B — depth 0 in globals

Both uses are named x, but they are different AST nodes. The resolver annotated A as depth=0 and B as not-in-locals (global). The interpreter uses the pointer to distinguish them.

If the key were the name string, both would map to the same entry and one would be wrong.

Measuring the improvement

Without annotations, getAt is O(depth) and calls find at every intermediate scope. With annotations, it's still O(depth) for the loop but does only one find at the target scope. The real saving is correctness (dynamic-scope bug eliminated) more than raw speed.

For a program with average nesting depth 3:

  • Without annotations: 3 × (hash_lookup + parent_deref) per variable use.
  • With annotations: 3 × parent_deref + 1 × hash_lookup.

At scale this matters. The JVM and V8 both keep variable annotations in bytecode for the same reason — not because they need O(1) vs O(3), but because the slot index is also used for register allocation and inlining.

Environment structure with slots

Production VMs go one step further: instead of named maps, each scope is a slot array and each variable use has a slot index. getAt(depth, slot) is just pointer arithmetic:

// production VM sketch
env[depth].slots[slot]

cp-03/04 uses named maps for clarity. cp-06+ introduces bytecode with explicit stack slots, which is the array-indexed equivalent.

05 — Immutability: let vs var

MiniLang distinguishes let (immutable binding) from var (mutable binding). This distinction is enforced by the interpreter after the resolver has already annotated depths.

Tracking mutability in the environment

The environment stores a mutability flag alongside each value:

struct Slot {
    Value value;
    bool  mutable_;
};
std::unordered_map<std::string, Slot> vars_;

define stores the slot:

void Environment::define(const std::string& name, Value v, bool mutable_) {
    vars_[name] = {std::move(v), mutable_};
}

set / setAt checks the flag:

void Environment::setAt(int depth, const std::string& name, Value v) {
    Environment* e = this;
    for (int i = 0; i < depth; ++i) e = e->parent_.get();
    auto& slot = e->vars_.at(name);
    if (!slot.mutable_)
        throw RuntimeError("Cannot reassign 'let' binding '" + name + "'.");
    slot.value = std::move(v);
}

The assignment check

When the interpreter visits an assignment expression:

Value Interpreter::visitAssign(AssignExpr& e) {
    Value v = evaluate(*e.value);
    assignVariable(e, v);
    return v;
}

assignVariablesetAt → immutability check. If the target was bound with let, a RuntimeError is thrown with a clear message. This is a runtime check, not a static one.

Why not static?

Making immutability a static error (checked by the resolver) would require tracking whether each name was declared as let or var in the scope stack. That's doable — the scope map could store {bool defined, bool mutable}.

The choice here is pragmatic: static checking is strictly better for user experience (error before running), but it requires threading the mutability flag through two more data structures. For the curriculum, a runtime check demonstrates the concept clearly. cp-05 introduces the type-checker pass which is a static pass and shows how static checks are structured.

Shadowing across scopes

let x = 1;
{
    var x = 2;   // OK — new binding in inner scope, different slot
    x = 3;       // OK — this x is mutable
}
print x;  // 1 — outer let x unchanged

Each let/var creates a new slot in its scope. Shadowing is allowed: a var x in an inner scope doesn't make the outer let x mutable. The resolver assigns separate depths to each, so the inner assignment never reaches the outer slot.

The let design in practice

In real languages:

  • Rust: let is immutable, let mut is mutable.
  • JavaScript: const is immutable, let is mutable (confusingly opposite to MiniLang).
  • Swift: let is immutable, var is mutable (same as MiniLang).
  • Haskell: Everything is let-bound and immutable by default.

MiniLang follows Swift/Rust semantics. The pedagogical point is that immutability is a property of the binding, not the value. A let binding to a mutable array still allows mutating the array's contents; it prevents rebinding the name to a different array.

06 — Static Errors: Redeclaration, Self-Init, Top-Level Return

The resolver catches three categories of semantic errors before the program runs, producing precise messages that point to the exact source location.

Redeclaration in the same scope

let x = 1;
let x = 2;   // error: already declared in this scope

In the resolver's declare:

void Resolver::declare(const std::string& name, int line) {
    if (scopes_.empty()) return;
    auto& scope = scopes_.back();
    if (scope.count(name))
        throw ResolveError("[line " + std::to_string(line) +
            "] Variable '" + name + "' already declared in this scope.");
    scope[name] = false;
}

Redeclaration in a nested scope is allowed (shadowing). Only the same scope triggers the error:

let x = 1;
{
    let x = 2;  // OK — different scope
}

Self-referential initialiser

let x = x + 1;  // error: can't read 'x' in its own initialiser

Detected in visitVarExpr when the found entry is false (declared but not yet defined):

if (!scopes_.empty()) {
    auto it = scopes_.back().find(name);
    if (it != scopes_.back().end() && it->second == false)
        throw ResolveError("[line " + std::to_string(line) +
            "] Can't read '" + name + "' in its own initialiser.");
}

This distinguishes the bad case from the legitimate recursive case:

fn fib(n) {
    if (n <= 1) return n;
    return fib(n-1) + fib(n-2);  // OK — fib is fully defined before we get here
}

fib is defined (the let fib = fn... fully finishes) before the body runs. So when visitVarExpr for fib(n-1) is resolved, the resolver finds fib as true in an outer scope — not as false.

return outside a function

return 42;  // error at top level

The resolver tracks whether it is currently inside a function:

enum class FunctionType { None, Function };
FunctionType currentFunction_ = FunctionType::None;

void Resolver::visitReturn(ReturnStmt& s) {
    if (currentFunction_ == FunctionType::None)
        throw ResolveError("[line " + std::to_string(s.line) +
            "] Can't return from top-level code.");
    if (s.value) resolve(*s.value);
}

void Resolver::resolveFunction(FnExpr& fn) {
    auto enclosing = currentFunction_;
    currentFunction_ = FunctionType::Function;
    // ... resolve body ...
    currentFunction_ = enclosing;
}

currentFunction_ is a scoped state flag, saved and restored when entering and leaving each function. Nested functions work correctly because the save/restore is a stack discipline.

Error recovery

Each error throws a ResolveError exception. In a production compiler you'd collect all errors and report them together. For the curriculum the first error terminates resolution with a clear message. Improving this to collect-and-continue is a good exercise: change the resolver to push errors into a vector<ResolveError> and only throw at the end of resolve(stmts).

The three errors together — a test

void test_static_errors() {
    // Redeclaration
    CHECK_THROWS(run("let x = 1; let x = 2;"), "already declared");
    // Self-init
    CHECK_THROWS(run("let x = x + 1;"), "own initialiser");
    // Top-level return
    CHECK_THROWS(run("return 42;"), "top-level");
}

These three static analyses, taken together, eliminate a whole class of runtime crashes that would otherwise only manifest as obscure interpreter bugs deep into execution.

07 — O(depth) Interpreter Lookups and Testing

This final step ties everything together, reviews the performance model, and verifies the full resolver + interpreter integration.

The complete variable lookup chain

From source text to value:

Source → Lexer → [Token stream] → Parser → [AST]
→ Resolver → [locals_ map: Expr* → depth] → Interpreter
→ lookUpVariable(name, expr*) → env_->getAt(depth, name) → Value

The resolver runs once. After it populates locals_, every variable reference in every execution of any function body is a direct-depth lookup with no chain scanning.

When depth can be wrong

There is one subtle case: if the interpreter creates extra environment layers that the resolver didn't see, getAt(depth) overshoots. This can happen if you add an implicit scope (e.g. around a single-expression if body without braces). The resolver must create a beginScope/endScope wherever the interpreter creates a new Environment. Keep them in sync.

The test suite for cp-04 includes a "depth sync" test:

void test_depth_sync() {
    // Deep nesting should still resolve correctly
    auto out = run(R"(
        let a = 10;
        fn f() {
            let b = 20;
            fn g() {
                let c = 30;
                return a + b + c;
            }
            return g();
        }
        print f();
    )");
    CHECK_EQ(out, "60\n");
}

This exercises a 3-level closure chain. If a has depth 2 from inside g, getAt(2) climbs: g's frame → f's frame → f's closure (global), finds a = 10. Any off-by-one in the depth computation fails this test.

Testing the static errors

void test_redeclare() {
    bool threw = false;
    try { run("let x = 1; let x = 2;"); }
    catch (const ResolveError& e) { threw = true; }
    CHECK_EQ(threw, true);
}

void test_self_init() {
    bool threw = false;
    try { run("let x = x + 1;"); }
    catch (const ResolveError& e) { threw = true; }
    CHECK_EQ(threw, true);
}

void test_top_level_return() {
    bool threw = false;
    try { run("return 42;"); }
    catch (const ResolveError& e) { threw = true; }
    CHECK_EQ(threw, true);
}

Testing closure correctness

void test_closure_captures_definition_site() {
    auto out = run(R"(
        var a = "global";
        fn showA() { print a; }
        fn test() {
            var a = "local";
            showA();
        }
        test();
    )");
    CHECK_EQ(out, "global\n");  // lexical scope, not dynamic
}

Without the resolver, a dynamic-scope interpreter prints "local". With the resolver, the a reference inside showA is annotated as a global (depth not in locals_), so the interpreter looks in globals_, finds "global", and prints it correctly.

Summary: what the resolver gives you

FeatureWithout resolverWith resolver
Variable lookupO(depth) chain walk per accessO(1) direct-depth
Closure semanticsAccidental dynamic scope possibleLexical scope enforced
Self-init bugCrashes at runtimeStatic error
RedeclarationSilent shadowingStatic error
Top-level returnRuntime crash (ReturnSignal uncaught)Static error

The resolver is a small investment — ~150 lines — that pays dividends in every subsequent phase. The bytecode compiler in cp-06 uses the same scope-stack technique to allocate stack slots; the type checker in cp-05 uses it to track type annotations.

cp-05 — Static Type Checker (gradual)

Status: ✅ Built · all tests passing

A third compiler pass added between resolver and interpreter:

lex → parse → resolve → typecheck → interpret

The checker walks the AST a second time, this time with a TypePtr visitor. It computes a static type for every expression and validates operands, conditions, arities, return values, and assignments.

Gradual typing

Annotations are optional. A bare var x = 1; is fine; so is let y: int = 1;. Wherever the source omits a type, the slot is any, a wildcard that:

  • satisfies any constraint (any + int = int, if (anyVal) accepted),
  • propagates through unknown operations (any * any = any),
  • lets cp-04 programs (and the recursive fact test) keep working unchanged, while fully-annotated programs get strict checks.

What's new vs cp-04

Aspectcp-04cp-05
Phaseslex → parse → resolve → interpretlex → parse → resolve → typecheck → interpret
Tokensadds : and ->
Annotationsnonelet x: int, fn f(a: int, b: int) -> int { ... }
Type ADTint / bool / string / nil / fn(...) -> T / any
Errors caughtresolution-only (undef, redecl, …)+ operand types, arity, arg types, return type, condition type, assign-T

Source layout (src/cpp/)

src/
  token.hpp                       # adds Colon, Arrow
  lexer.{hpp,cpp}                 # handles ':' and '->'
  value.hpp                       # unchanged
  type.hpp                        # NEW — Type ADT and tyInt/tyBool/...
  ast.hpp                         # adds declaredType, paramTypes, returnType, checkedType
                                  # and a second `accept` overload returning TypePtr
  parser.{hpp,cpp}                # parseType() + optional annotations
  environment.hpp                 # unchanged
  resolver.{hpp,cpp}              # unchanged
  typecheck.{hpp,cpp}             # NEW — gradual type checker
  interpreter.{hpp,cpp}           # unchanged
  main.cpp                        # invokes TypeChecker between resolver and interpreter
tests/test_typecheck.cpp          # regression + new annotated programs + 11 negative cases

Build & test

cd src/cpp
cmake -S . -B build -G "Unix Makefiles"
cmake --build build -j
ctest --test-dir build --output-on-failure

Sample diagnostics

$ cat > bad.ml <<'EOF'
let x: int = true;
if (5) print 1;
fn h(a: int) -> int { return true; }
h(true, 2);
EOF
$ build/mli bad.ml
[line 1] type error: initializer of 'x' has type bool, expected int
[line 2] type error: if-condition must be bool (got int)
[line 3] type error: return type mismatch: function returns int, got bool
[line 4] type error: call to function expected 1 arg(s), got 2

All diagnostics are collected before the program is rejected — same recovery model as the resolver.

See src/cpp/steps/ for the walk-through

  1. 01-why-static-types.md — what dynamic-only typing costs
  2. 02-type-syntax.md: annotations, -> returns, function types
  3. 03-type-representation.md — the Type ADT and Any wildcard
  4. 04-the-checker-walk.md — visitor, scope stack, currentReturn
  5. 05-inference-vs-annotation.md — when the checker fills the gaps
  6. 06-function-types-and-calls.md — arity, arg-by-arg, return matching
  7. 07-error-recovery.md — collecting diagnostics like the resolver

01 — Why Static Types

A type system is a lightweight formal method that proves a class of runtime errors cannot occur — without running the program. The guarantees it provides depend on how expressive the type language is.

The cost of untyped code

In cp-03/cp-04, MiniLang is fully dynamic: every value carries its type at runtime (std::variant<nil, double, bool, string, FnValue>). A type error like "hello" + 42 raises a RuntimeError when the + operator tries to add a string and a number. This is correct, but:

  • The error appears at runtime, not compile time.
  • It may appear on a rarely-executed code path — you might not see it until production.
  • Error messages say "expected number" without knowing the programmer's intent.

What the type checker provides

The type checker runs after parsing and before execution. It annotates every expression with a type (Num, Bool, Str, Nil, Fn<...>) and reports mismatches statically:

let x: Num = "hello";  // error at line 1: expected Num, got Str

This moves the failure from runtime to compile time.

MiniLang's type language

Type ::= Num | Bool | Str | Nil | Any
       | Fn(Type, ...) -> Type
  • Num, Bool, Str, Nil — the four base types from the value set.
  • Any — the gradual escape hatch (step 06). A value of type Any can be used anywhere without a static error; correctness is checked at runtime.
  • Fn(T1, T2, ...) -> R — a function type.

Soundness vs completeness

A type system is sound if every accepted program is safe at runtime. A type system is complete if it accepts every safe program.

No practical type system is both sound and complete (undecidability). The trade-off is:

  • Reject false positives (be unsound) → miss bugs.
  • Reject true negatives (be incomplete) → reject correct programs.

MiniLang's checker is sound for the base types: a program that passes type-checking with no Any types will not produce a type error at runtime. The Any type trades soundness for usability on the parts of the program that can't be typed statically.

Checked vs unchecked features

In cp-05, the following are statically checked:

  • Arithmetic (+, -, *, /) requires Num operands.
  • Comparison (<, <=, etc.) requires Num or Str operands.
  • Logical (and, or) requires Bool operands.
  • Negation ! requires Bool; unary - requires Num.
  • Function call arity is checked against the declared parameter count.
  • Function return type must match the declared return type (if annotated).
  • let/var type annotations must match the initialiser.

The following are not statically checked in cp-05:

  • Array element types (no array type in the type language).
  • Object field access (no class types yet).

These are left for extensions.

02 — The Type ADT

The type representation is the data model for the entire type checker. It must be:

  • Recursive: function types contain other types.
  • Comparable: unify and equality checks must work.
  • Printable: error messages need "expected Num, got Bool".

The Type representation

struct NumType  {};
struct BoolType {};
struct StrType  {};
struct NilType  {};
struct AnyType  {};

struct FnType {
    std::vector<std::shared_ptr<Type>> params;
    std::shared_ptr<Type>              ret;
};

using Type = std::variant<NumType, BoolType, StrType, NilType, AnyType, FnType>;
using TypePtr = std::shared_ptr<Type>;

Alternatively, use an inheritance hierarchy with a TypeKind enum. The variant approach avoids virtual dispatch and is exhaustively checkable with std::visit:

std::string typeToStr(const Type& t) {
    return std::visit(overloaded{
        [](const NumType&)  { return std::string("Num"); },
        [](const BoolType&) { return std::string("Bool"); },
        [](const StrType&)  { return std::string("Str"); },
        [](const NilType&)  { return std::string("Nil"); },
        [](const AnyType&)  { return std::string("Any"); },
        [](const FnType& f) {
            std::string s = "Fn(";
            for (size_t i=0; i<f.params.size(); ++i) {
                if (i) s += ", ";
                s += typeToStr(*f.params[i]);
            }
            s += ") -> " + typeToStr(*f.ret);
            return s;
        },
    }, t);
}

Type equality

Two types are equal if they are structurally identical:

bool typeEq(const Type& a, const Type& b) {
    if (a.index() != b.index()) return false;
    if (auto* f = std::get_if<FnType>(&a)) {
        auto& g = std::get<FnType>(b);
        if (f->params.size() != g.params.size()) return false;
        for (size_t i = 0; i < f->params.size(); ++i)
            if (!typeEq(*f->params[i], *g.params[i])) return false;
        return typeEq(*f->ret, *g.ret);
    }
    return true;  // same variant index, non-FnType
}

The "gradual" compatibility check

For gradual typing (step 06), we need compatible(a, b) which is weaker than typeEq: Any is compatible with everything.

bool compatible(const Type& a, const Type& b) {
    if (std::holds_alternative<AnyType>(a)) return true;
    if (std::holds_alternative<AnyType>(b)) return true;
    return typeEq(a, b);
}

Type factory helpers

TypePtr mkNum()  { return std::make_shared<Type>(NumType{}); }
TypePtr mkBool() { return std::make_shared<Type>(BoolType{}); }
TypePtr mkStr()  { return std::make_shared<Type>(StrType{}); }
TypePtr mkNil()  { return std::make_shared<Type>(NilType{}); }
TypePtr mkAny()  { return std::make_shared<Type>(AnyType{}); }
TypePtr mkFn(std::vector<TypePtr> params, TypePtr ret) {
    return std::make_shared<Type>(FnType{std::move(params), std::move(ret)});
}

These reduce noise in the checker: mkNum() is clearer than std::make_shared<Type>(NumType{}) at every call site.

Why shared_ptr?

Function types contain nested types. A type like Fn(Fn(Num)->Bool, Str) -> Nil has a FnType whose first param is another FnType. Shared ownership means the inner types can be cheaply aliased across multiple places in the checker's environment without copying.

For the curriculum, shared_ptr is clear and correct. Production type checkers use arenas (bump allocators) so all types live in one allocation region and are freed together at compile-time end.

03 — Type Annotations in the AST and Parser

Type annotations let programmers express intent:

let x: Num = 42;
fn add(a: Num, b: Num): Num { return a + b; }

The parser must recognise the : Type syntax and store the annotation in the AST.

AST changes

The LetStmt and function parameter nodes gain an optional type annotation:

struct LetStmt {
    std::string name;
    bool        immutable;
    ExprPtr     init;
    TypePtr     annotation;  // nullptr if absent
    int         line;
};

struct Param {
    std::string name;
    TypePtr     annotation;  // nullptr if absent
};

struct FnExpr {
    std::vector<Param> params;
    TypePtr            retAnnotation;  // nullptr if absent
    StmtPtr            body;
    int                line;
};

Parsing type annotations

A parseType helper handles the type grammar:

// Type ::= "Num" | "Bool" | "Str" | "Nil" | "Any"
//         | "Fn" "(" TypeList ")" "->" Type
TypePtr Parser::parseType() {
    Token t = advance();
    if (t.lexeme == "Num")  return mkNum();
    if (t.lexeme == "Bool") return mkBool();
    if (t.lexeme == "Str")  return mkStr();
    if (t.lexeme == "Nil")  return mkNil();
    if (t.lexeme == "Any")  return mkAny();
    if (t.lexeme == "Fn") {
        expect(LParen);
        std::vector<TypePtr> params;
        while (peek().kind != RParen) {
            params.push_back(parseType());
            if (!match(Comma)) break;
        }
        expect(RParen);
        expect(Arrow);    // "->"
        auto ret = parseType();
        return mkFn(std::move(params), std::move(ret));
    }
    throw ParseError("[line " + std::to_string(t.line) +
        "] Expected type annotation, got '" + t.lexeme + "'.");
}

Parsing let with annotation

StmtPtr Parser::parseLet() {
    int line = advance().line;  // consume 'let' / 'var'
    bool immutable = (previous().kind == Let);
    auto name = expect(Ident).lexeme;
    TypePtr ann;
    if (match(Colon)) ann = parseType();   // optional ": Type"
    expect(Eq);
    auto init = parseExpr(0);
    expect(Semicolon);
    return std::make_unique<LetStmt>(name, immutable, std::move(init), std::move(ann), line);
}

Parsing function parameters with annotations

std::vector<Param> Parser::parseParams() {
    expect(LParen);
    std::vector<Param> params;
    while (peek().kind != RParen) {
        auto name = expect(Ident).lexeme;
        TypePtr ann;
        if (match(Colon)) ann = parseType();
        params.push_back({name, std::move(ann)});
        if (!match(Comma)) break;
    }
    expect(RParen);
    return params;
}

Token additions

The lexer needs two new token kinds:

  • Colon for : separating name from type.
  • Arrow for -> separating parameter types from return type.

-> is a two-character token; the lexer handles it in the - branch:

case '-':
    return makeToken(match('>') ? Arrow : Minus);

Annotations are optional

All annotations are TypePtr defaulting to nullptr. Code without any annotations is valid — it's treated as fully Any-typed (step 06). This means cp-03/cp-04 programs are valid cp-05 programs without modification.

04 — The Type Checker Pass

The type checker walks the AST after the resolver and before execution. It visits every expression and statement, computing or verifying types.

The TypeChecker class

class TypeChecker : public ExprVisitor<TypePtr>, public StmtVisitor<void> {
    // Type environment: name → TypePtr
    std::vector<std::unordered_map<std::string, TypePtr>> scopes_;
    TypePtr currentReturnType_;   // expected return type of current function

    void beginScope();
    void endScope();
    void declare(const std::string& name, TypePtr t);
    TypePtr lookup(const std::string& name);

    void checkCompatible(TypePtr expected, TypePtr actual, int line);
public:
    void check(std::vector<StmtPtr>& stmts);
    // ExprVisitor
    TypePtr visitNumber(NumberExpr&) override;
    TypePtr visitBool(BoolExpr&)     override;
    TypePtr visitString(StringExpr&) override;
    TypePtr visitNil(NilExpr&)       override;
    TypePtr visitVar(VarExpr&)       override;
    TypePtr visitBinary(BinaryExpr&) override;
    TypePtr visitUnary(UnaryExpr&)   override;
    TypePtr visitCall(CallExpr&)     override;
    TypePtr visitFn(FnExpr&)        override;
    // StmtVisitor
    void visitLet(LetStmt&)     override;
    void visitBlock(BlockStmt&) override;
    void visitIf(IfStmt&)       override;
    void visitWhile(WhileStmt&) override;
    void visitReturn(ReturnStmt&) override;
    void visitPrint(PrintStmt&)  override;
};

Expression type rules

TypePtr TypeChecker::visitBinary(BinaryExpr& e) {
    auto L = check(*e.left);
    auto R = check(*e.right);
    switch (e.op) {
    case Plus: case Minus: case Star: case Slash:
        checkCompatible(mkNum(), L, e.line);
        checkCompatible(mkNum(), R, e.line);
        return mkNum();
    case EqEq: case BangEq:
        // any two compatible types may be compared for equality
        return mkBool();
    case Lt: case LtEq: case Gt: case GtEq:
        checkCompatible(mkNum(), L, e.line);
        checkCompatible(mkNum(), R, e.line);
        return mkBool();
    case And: case Or:
        checkCompatible(mkBool(), L, e.line);
        checkCompatible(mkBool(), R, e.line);
        return mkBool();
    // ...
    }
}

The checkCompatible(expected, actual, line) function throws a TypeCheckError if !compatible(expected, actual) (see step 06 for the compatible definition):

void TypeChecker::checkCompatible(TypePtr expected, TypePtr actual, int line) {
    if (!compatible(*expected, *actual))
        throw TypeCheckError("[line " + std::to_string(line) +
            "] Expected " + typeToStr(*expected) +
            ", got " + typeToStr(*actual) + ".");
}

Statement rules

void TypeChecker::visitLet(LetStmt& s) {
    TypePtr initType = s.init ? check(*s.init) : mkNil();
    if (s.annotation)
        checkCompatible(s.annotation, initType, s.line);
    TypePtr declaredType = s.annotation ? s.annotation : initType;
    declare(s.name, declaredType);
}

If no annotation is given, the type is inferred from the initialiser. This is a simple form of Hindley-Milner local type inference:

let x = 42;       // inferred: Num
let y = "hello";  // inferred: Str
let z = x + y;    // error: expected Num, got Str (for y)

Function type checking

TypePtr TypeChecker::visitFn(FnExpr& fn) {
    beginScope();
    std::vector<TypePtr> paramTypes;
    for (auto& p : fn.params) {
        TypePtr t = p.annotation ? p.annotation : mkAny();
        declare(p.name, t);
        paramTypes.push_back(t);
    }
    TypePtr retType = fn.retAnnotation ? fn.retAnnotation : mkAny();
    auto saved = currentReturnType_;
    currentReturnType_ = retType;
    check(*fn.body);
    currentReturnType_ = saved;
    endScope();
    return mkFn(std::move(paramTypes), retType);
}

Return type checking

void TypeChecker::visitReturn(ReturnStmt& s) {
    TypePtr t = s.value ? check(*s.value) : mkNil();
    checkCompatible(currentReturnType_, t, s.line);
}

If the function has Any return type (no annotation), any return value is accepted.

05 — Function Types and Call Checking

Function types form the most complex part of the type system: they are recursive (parameters and return types can themselves be function types), and call-site checking must verify arity and argument types.

Checking a call expression

TypePtr TypeChecker::visitCall(CallExpr& call) {
    TypePtr calleeType = check(*call.callee);

    // If the callee is Any, we can't check — return Any
    if (std::holds_alternative<AnyType>(*calleeType)) {
        for (auto& arg : call.args) check(*arg);  // still check arg types
        return mkAny();
    }

    // Must be a function type
    auto* fn = std::get_if<FnType>(calleeType.get());
    if (!fn)
        throw TypeCheckError("[line " + std::to_string(call.line) +
            "] Cannot call non-function value of type " +
            typeToStr(*calleeType) + ".");

    // Check arity
    if (call.args.size() != fn->params.size())
        throw TypeCheckError("[line " + std::to_string(call.line) +
            "] Expected " + std::to_string(fn->params.size()) +
            " arguments, got " + std::to_string(call.args.size()) + ".");

    // Check argument types
    for (size_t i = 0; i < call.args.size(); ++i) {
        TypePtr argType = check(*call.args[i]);
        checkCompatible(fn->params[i], argType, call.line);
    }

    return fn->ret;
}

Higher-order functions

Function types compose naturally:

fn apply(f: Fn(Num)->Num, x: Num): Num {
    return f(x);
}
fn double(n: Num): Num { return n * 2; }
print apply(double, 21);  // 42

Type checking apply(double, 21):

  1. apply has type Fn(Fn(Num)->Num, Num) -> Num.
  2. Arg 0: double has type Fn(Num)->Num. Compatible with param 0 (Fn(Num)->Num). ✓
  3. Arg 1: 21 has type Num. Compatible with param 1 (Num). ✓
  4. Return type: Num.

Recursive functions

fn fib(n: Num): Num {
    if (n <= 1) return n;
    return fib(n-1) + fib(n-2);
}

When visitFn resolves the body, fib is already declared in the enclosing scope with type Fn(Num)->Num (set when the let fib = fn(n:Num):Num {...} desugaring runs visitLet). So the recursive call fib(n-1) finds the correct function type in the scope.

The key ordering: visitLet calls declare after type-checking the initialiser for forward-referenced functions? No — declare must happen before checking the body for recursion to work. Here's the fix:

void TypeChecker::visitLet(LetStmt& s) {
    // For function literals, pre-declare with the annotated type
    // before checking the body (enables recursion).
    if (auto* fn = dynamic_cast<FnExpr*>(s.init.get())) {
        if (fn->retAnnotation) {
            auto preType = buildFnType(*fn);  // params + ret from annotations
            declare(s.name, preType);         // pre-declare
            check(*s.init);                   // body can now see s.name
            return;
        }
    }
    // Non-function or unannotated: infer normally
    TypePtr initType = s.init ? check(*s.init) : mkNil();
    if (s.annotation) checkCompatible(s.annotation, initType, s.line);
    declare(s.name, s.annotation ? s.annotation : initType);
}

The interaction with Any

let f: Any = fn(x: Num): Num { return x + 1; };
f(42);   // accepted — callee is Any, no type checking on call
f("hi"); // accepted — but will crash at runtime

The Any escape hatch means the checker accepts the call (returns Any) while the interpreter's runtime check catches the actual type mismatch. This is the defining property of gradual typing: you can opt specific sites out of static checking at the cost of runtime safety guarantees.

Testing function types

void test_function_types() {
    // Correct types — should type-check clean
    CHECK_NOTHROW(typeCheck(R"(
        fn add(a: Num, b: Num): Num { return a + b; }
        print add(1, 2);
    )"));

    // Wrong arg type
    CHECK_THROWS(typeCheck("fn f(x: Num): Num { return x; } f(\"hi\");"),
        "Expected Num");

    // Wrong arity
    CHECK_THROWS(typeCheck("fn f(x: Num): Num { return x; } f(1, 2);"),
        "Expected 1 arguments");
}

06 — Gradual Typing with Any

Gradual typing lets programmers mix typed and untyped code in the same program. The Any type is the mechanism: a value of type Any bypasses static checks but retains runtime checks.

The key rule: compatibility not equality

The type checker uses compatible(A, B) instead of typeEq(A, B) for all mismatch checks:

bool compatible(const Type& a, const Type& b) {
    if (std::holds_alternative<AnyType>(a)) return true;
    if (std::holds_alternative<AnyType>(b)) return true;
    if (a.index() != b.index()) return false;
    if (auto* fa = std::get_if<FnType>(&a)) {
        auto& fb = std::get<FnType>(b);
        if (fa->params.size() != fb.params.size()) return false;
        for (size_t i = 0; i < fa->params.size(); ++i)
            if (!compatible(*fa->params[i], *fb.params[i])) return false;
        return compatible(*fa->ret, *fb.ret);
    }
    return true;  // same non-Fn variant
}

Any is compatible with every type in both directions. This is what allows unannotated code to coexist with annotated code.

Where Any flows

  1. Unannotated parameters: fn f(x) { ... }x has type Any.
  2. Unannotated return: fn f() { return expr; } → return type is Any.
  3. Unannotated variable with complex initialiser: if the checker can't infer a concrete type, it falls back to Any.
  4. Explicit annotation: let x: Any = ... always.

The flow problem

Gradual typing introduces a subtlety: Any can flow through operations and infect computed types:

let x = f();          // f is unannotated → x has type Any
let y: Num = x + 1;  // x is Any, so x+1 is Any, compatible with Num → accepted

At runtime, if f() returns a string, x + 1 crashes. This is the fundamental trade-off: accepting Any means losing static guarantees for all downstream computations that use that value.

Gradual guarantee

The gradual guarantee says: if you annotate everything (no Any), the program passes static type checking iff it is type-safe at runtime. As soon as you introduce Any, you give up some of that guarantee for the paths touched by Any values.

In practice this means: start with Any everywhere (fully dynamic), add annotations incrementally, and the type checker's coverage grows with each annotation.

Runtime blame

In research gradual type systems (Typed Racket, Reticulated Python), the runtime inserts casts at Any ↔ typed boundaries and reports blame precisely: "this function was typed but received an untyped argument from line 42". cp-05 does not implement blame tracking — the runtime check is just the existing std::get<double> in the interpreter throwing a RuntimeError. The blame extension is left for exploration.

Testing gradual typing

void test_gradual() {
    // Unannotated fn — no type error
    CHECK_NOTHROW(typeCheck("fn f(x) { return x + 1; }"));

    // Annotated caller, unannotated callee — OK (Any is compatible with Num)
    CHECK_NOTHROW(typeCheck(R"(
        fn id(x) { return x; }
        let n: Num = id(42);
    )"));

    // Annotated callee, wrong caller argument — error even with Any param
    // If callee is fn(x: Num), passing a Str is an error
    CHECK_THROWS(typeCheck(R"(
        fn f(x: Num) { return x + 1; }
        f("hello");
    )"), "Expected Num");
}

07 — Error Messages and Recovery

A type system is only useful if its error messages are actionable. This step covers how to produce clear diagnostics and how to continue checking after the first error.

Anatomy of a good type error

error[E001]: type mismatch
 --> main.ml:3:15
  |
3 |     let x: Num = "hello";
  |               ^^^^^^^^^^ expected Num, found Str
  |
  = note: variable 'x' is declared as Num at line 3

The key components:

  1. Error code — makes errors searchable in documentation.
  2. Location — file, line, column (or at least line).
  3. Expected vs actual — always say what was expected, not just what was found.
  4. Context — which variable, which function, which call.

cp-05's error format is simpler but includes the essentials:

TypeCheckError at line 3: expected Num, got Str (in binding 'x')

The TypeCheckError structure

struct TypeCheckError {
    std::string message;
    int         line;
    TypeCheckError(std::string msg, int line) 
        : message(std::move(msg)), line(line) {}
    const char* what() const { return message.c_str(); }
};

Error messages are assembled at the site of detection:

void TypeChecker::checkCompatible(TypePtr expected, TypePtr actual,
                                   int line, const std::string& context) {
    if (!compatible(*expected, *actual))
        throw TypeCheckError(
            "expected " + typeToStr(*expected) +
            ", got " + typeToStr(*actual) +
            (context.empty() ? "" : " (" + context + ")"),
            line);
}

Error recovery: collect-and-continue

Throwing on the first error is the simplest strategy. The downside is that one typo hides all downstream errors. The collect-and-continue strategy accumulates errors:

class TypeChecker {
    std::vector<TypeCheckError> errors_;
    
    void reportError(const std::string& msg, int line) {
        errors_.emplace_back(msg, line);
    }
    
    TypePtr checkCompatibleSoft(TypePtr expected, TypePtr actual, int line,
                                const std::string& ctx) {
        if (!compatible(*expected, *actual))
            reportError("expected " + typeToStr(*expected) +
                        ", got " + typeToStr(*actual) + " " + ctx, line);
        return expected;  // return expected type to continue checking
    }
};

After reportError, the checker returns the expected type and continues. Downstream code sees a "correct" type and may or may not produce spurious secondary errors. Primary errors (the first one) are always reliable; secondary errors may be false positives caused by recovery.

The "error type" sentinel

Some type checkers introduce an ErrorType variant. Any operation on an ErrorType operand silently returns ErrorType without reporting another error. This prevents cascading:

let x: Num = "str";  // error: expected Num, got Str → x gets ErrorType
let y = x + 1;      // x is ErrorType → suppress the type error for +

Without ErrorType, the second line would also error "expected Num, got ErrorType-derived-Str", confusing the user.

The complete pipeline check

void runProgram(const std::string& src) {
    Lexer lex(src);
    auto tokens = lex.scanAll();
    Parser parser(tokens);
    auto stmts = parser.parse();
    
    Resolver resolver(interp);
    resolver.resolve(stmts);
    
    TypeChecker checker;
    checker.check(stmts);
    if (!checker.errors().empty()) {
        for (auto& e : checker.errors())
            std::cerr << "[line " << e.line << "] " << e.message << "\n";
        return;
    }
    
    for (auto& s : stmts) interp.execute(*s);
}

Testing error messages

The test suite verifies not just that errors are reported but that the message contains the right content:

void test_error_messages() {
    try {
        typeCheck("let x: Num = true;");
        std::cerr << "FAIL: expected error\n";
    } catch (const TypeCheckError& e) {
        CHECK_CONTAINS(e.message, "Num");
        CHECK_CONTAINS(e.message, "Bool");
    }

    try {
        typeCheck("fn f(x: Num) {} f(\"hi\");");
        std::cerr << "FAIL: expected error\n";
    } catch (const TypeCheckError& e) {
        CHECK_CONTAINS(e.message, "Num");
        CHECK_CONTAINS(e.message, "Str");
    }
}

What you've built by the end of cp-05

  • A type ADT (Num, Bool, Str, Nil, Any, Fn(...)→...).
  • Optional type annotations in the parser without breaking old code.
  • A type-checker pass that infers and verifies types for all expressions.
  • Gradual typing via Any for incremental annotation of large codebases.
  • Clear error messages with expected-vs-actual and source locations.
  • A complete pipeline: parse → resolve → type-check → interpret.

This is the foundation that every production language's front-end is built on. The gap between cp-05 and, say, TypeScript's checker is not concept but scale: more types, more inference, more generics — but the same propagate-and-unify core.

cp-06 — Bytecode Compiler (AST → Stack-VM Chunks)

Status: ✅ Implemented.

Replaces the tree-walking model with compile-then-run: AST → flat array of bytecodes ("chunk"). The chunks are executed in cp-07.

What's Built

  • Op enum — a 32-instruction bytecode ISA: stack manipulation, globals/locals access, arithmetic/logic/comparison, control flow (JUMP, JUMP_IF_FALSE, LOOP), I/O, plus reserved opcodes (CALL, RETURN, CLOSURE, upvalues) that cp-07 will activate.
  • Chunk — bytecode array + deduplicated constants pool + parallel line table.
  • Compiler — AST visitor (both ExprVisitor<void> and StmtVisitor) that emits bytecode while tracking lexical locals as stack slots.
  • disassembler — human-readable dump for debugging and unit testing.
  • mlc CLI: mlc file.ml compiles a file and prints the chunk; mlc alone reads stdin.

Architecture

source → Lexer → Parser → Resolver → TypeChecker → Compiler → Chunk
                                                       │
                                                       └─→ Disassembler → text

The frontend (lex/parse/resolve/typecheck) is unchanged from cp-05; we re-use it. The interpreter was deleted. The new backend stages are Compiler and Disassembler. The tree-walker's Environment chain is gone — locals are stack slots, globals live in a (future) runtime hash table keyed by name strings interned in the constants pool.

Reading Order

  1. CONCEPTS.md — stack machines, bytecode design, operand encoding, why this is faster than tree-walking.
  2. steps/01-instruction-set-design.md
  3. steps/02-the-chunk.md
  4. steps/03-emit-helpers-and-jumps.md
  5. steps/04-locals-vs-globals.md
  6. steps/05-control-flow.md
  7. steps/06-short-circuit-logic.md
  8. steps/07-disassembler-and-testing.md
  9. src/cpp/ — actual code.

Build & Run

cd src/cpp
cmake -S . -B build -G "Unix Makefiles"
cmake --build build -j
ctest --test-dir build --output-on-failure

Then disassemble a program:

echo 'let n = 10; print n * (n + 1) / 2;' | ./build/mlc

Outcomes

After reading the code and steps you can:

  • Design a bytecode instruction set from first principles, justifying every operand width.
  • Compile a typed AST to a flat, executable form using a single forward pass.
  • Encode if/else, while, and short-circuit &&/|| using only conditional jumps.
  • Resolve identifier references to stack slots (locals) vs hash lookups (globals).
  • Disassemble chunks for debugging and assert on the byte stream in unit tests.
  • Articulate the trade-offs between stack VMs (this) and register VMs (Lua, Dalvik).
  • Identify what's deferred to cp-07 (call frames, closures, CALL/RETURN, GC) and why each requires a runtime.

Limitations (revisited in cp-07)

  • No execution. We compile, we disassemble, we stop. The VM is cp-07's job.
  • No function bodies, calls, or return. Closures need call frames and upvalues — both runtime concepts.
  • Constants are capped at 256 per chunk (1-byte index). cp-07 will add CONSTANT_LONG with a 3-byte index for chunks that need more.
  • No source spans for error reporting beyond line numbers. cp-15 expands this.

Step 1 — Instruction Set Design

Goal

Design a bytecode instruction set for MiniLang that:

  1. Is easy to emit from the AST (one or two opcodes per node).
  2. Is easy to execute with a tight dispatch loop.
  3. Encodes operands compactly (typical instruction = 1–3 bytes).
  4. Leaves headroom for cp-07's call frames, closures, and upvalues.

The output of this step is opcode.hpp: a single enum and its opName() lookup.

Stack Machine vs Register Machine

Two dominant designs:

Stack machineRegister machine
ExamplesJVM, .NET CLR, CPython, WASMLua 5+, Dalvik, V8 Ignition
Operand encodingImplicit (top of stack)Explicit (register indices)
Instruction count for a = b + c4 (GetB, GetC, Add, SetA)1 (Add a, b, c)
Compiler complexityLowHigh (register allocation)
Per-op dispatch overheadHigher (more ops)Lower
Total dispatch overheadComparable in practiceComparable in practice

We pick stack because (a) compilation is dead simple — every expression node "leaves its result on the stack", and (b) the visitor pattern compiles cleanly to stack opcodes (you don't need to thread destination registers through the visitor).

Byte-Aligned, Variable-Length Encoding

Each instruction is one opcode byte followed by zero or more inline operand bytes:

+--------+--------+--------+
| OPCODE | (operands…)     |
+--------+--------+--------+

Operand widths used in cp-06:

  • 1 byte for constant-pool indices (max 256 constants per chunk).
  • 1 byte for local-slot indices (max 256 locals per function).
  • 2 bytes (big-endian) for jump offsets (max ±32 KB jump distance).

The trade-off: variable length means the disassembler has to know each opcode's operand size, but the byte stream is smaller than fixed-width 32-bit encoding (Lua's 4-byte instructions waste bytes on RETURN, POP, etc.).

The cp-06 ISA at a glance

The 32 opcodes break into six functional groups:

Stack/literal:   Constant Nil True False Pop
Variables:       DefGlobal GetGlobal SetGlobal GetLocal SetLocal
Arithmetic:      Add Sub Mul Div Mod Neg
Logic/compare:   Not Eq Ne Lt Le Gt Ge
Control flow:    Jump JumpIfFalse Loop
I/O:             Print
Reserved (cp-07): Call Return Closure GetUpvalue SetUpvalue CloseUpvalue

Notice:

  • No dedicated And/Or. Short-circuit logic compiles to control flow (JumpIfFalse + Pop). See step 6.
  • Loop is just an unconditional backward jump with a u16 operand (forward jumps are signed via the encoding pattern; backward jumps are explicit because the bytecode is being emitted forward and the target is already known).
  • No typed arithmetic ops. Add works on both numbers (sum) and strings (concat) — the type check at compile time guarantees the runtime knows which to do. Other VMs (JVM with iadd/fadd/…) split by type for performance; we trade that for simplicity.
  • Reserved opcodes are emitted by the disassembler but never produced by the cp-06 compiler. cp-07 wires them up.

Why 32 Opcodes and Not 200?

Crafting Interpreters' Lox has ~30 opcodes. Python has ~120, Java has ~200. Each extra opcode is one more branch in your dispatch switch — and as opcodes proliferate, micro-ops dilute the hot ones. Modern VMs (V8) actually go the other direction: a few big "polymorphic" opcodes that internally specialise.

For learning, fewer opcodes means less to memorise. cp-08 (TAC IR) and cp-09 (SSA) re-explore the design space; cp-13 (MLIR) lets you define your own dialects.

Self-Check

After this step you should be able to:

  • Pick an instruction set design (stack vs register, variable vs fixed) and defend it.
  • Predict, for a source like a + b * c, the exact opcode sequence.
  • Explain why we don't have OpAnd.
  • List which opcodes are reserved and what runtime concept each needs.

Step 2 — The Chunk

A chunk is a self-contained compiled unit: every byte the VM will execute, every constant those bytes refer to, and the source-line metadata for diagnostics. In cp-06 there is exactly one chunk — the top-level script. cp-07 will add a chunk-per-function model.

See chunk.hpp.

Data Layout

struct Chunk {
    std::vector<uint8_t> code;       // flat byte stream
    std::vector<Value>   constants;  // referenced by 1-byte index
    std::vector<int>     lines;      // lines[i] = source line for code[i]
    std::string          name = "<script>";
};

Three parallel structures:

code

The byte stream. Opcode and operand bytes are mixed: e.g., [CONSTANT, 0, ADD] is three bytes, two instructions. The VM advances ip by 1 for opcode then by N more for operands.

constants

A pool of Values. Literals in the source (42, "hello") are interned here. The compiler emits OpConstant ix where ix is the pool index. Up to 256 entries (1-byte index). Deduplication is by value-equality so print 1; print 1; uses one slot.

lines

Parallel array: lines[i] is the source line of byte code[i]. When the VM throws a runtime error, it consults lines[ip-1] to print "RuntimeError at line 17". The disassembler suppresses repeated line numbers visually so consecutive bytes on the same line read as a group (|).

Why Parallel Vectors?

The alternative is a vector of Instruction { opcode; operand; line; } structs. That would be cache-cleaner per instruction but each struct is 8+ bytes vs 1 byte for a packed stream. For a typical chunk (hundreds to thousands of bytes), the byte stream pulls more instructions per cache line.

Real VMs go further: HotSpot uses a similar packed bytecode; V8 Ignition uses fixed-size 32-bit instructions but in a TurboFan-style separate handler table. Neither uses one-struct-per-instruction in production.

addConstant and Deduplication

uint8_t addConstant(const Value& v) {
    for (size_t i = 0; i < constants.size(); ++i)
        if (valuesEqual(constants[i], v)) return static_cast<uint8_t>(i);
    constants.push_back(v);
    return static_cast<uint8_t>(constants.size() - 1);
}

O(n²) over a chunk's compilation but n is small (constants typically <50 per script). For real workloads you'd hash; we keep the linear scan for clarity and zero dependencies.

valuesEqual is structural: same kind, same payload. For strings this is == on the contained std::string. For functions (cp-07) we'll compare by pointer identity since two fn () {} declarations are different closures even with identical source.

Overflow

If constants.size() >= 256, addConstant returns 255 and the compiler emits a diagnostic ("too many constants in chunk"). cp-07 introduces OpConstantLong with a 3-byte (24-bit) operand to lift this to 16M.

Lifetimes

The Chunk owns its constants by value. Strings are std::strings on the heap inside Value::s. cp-07 will introduce a GC for runtime-allocated strings (string concat results) but constant strings live for the chunk's lifetime — they're effectively read-only and could be interned across chunks in a future optimisation pass.

Self-Check

  • Why three parallel arrays and not one array of structs?
  • How would you change Chunk to support more than 256 constants?
  • What invariant must hold between code.size() and lines.size()?

Step 3 — Emit Helpers and Back-Patching Jumps

The compiler is one long sequence of emit(byte, line) calls. Most are trivial; the interesting case is forward jumps, where you have to emit an instruction whose target you don't know yet.

See emit*, emitJump, patchJump, emitLoop in compiler.cpp.

Emit Primitives

void emit(Op op, int line);
void emit(uint8_t byte, int line);
void emitConstant(const Value& v, int line);

These all reduce to chunk_->writeByte (append + track line). emitConstant is one helper because every constant emission is [OpConstant, ix].

The Back-Patching Problem

Consider if (cond) thenBranch else elseBranch. We want bytecode roughly:

<cond>
JumpIfFalse  ELSE_START
Pop                       ; drop condition on then-path
<thenBranch>
Jump         END
ELSE_START:
Pop                       ; drop condition on else-path
<elseBranch>
END:

When we emit JumpIfFalse ELSE_START, we don't yet know what ELSE_START is — it depends on how big the <thenBranch> bytecode turns out to be. Same for Jump END.

Solution: emit a placeholder operand (0xff 0xff), remember its offset, and write the real value once the target is known.

size_t emitJump(Op op, int line) {
    emit(op, line);
    emit(0xff, line);
    emit(0xff, line);
    return chunk_->code.size() - 2;  // offset of the placeholder bytes
}

void patchJump(size_t at, int line) {
    size_t target = chunk_->code.size();
    size_t jumpFrom = at + 2;             // ip after the operand
    size_t off = target - jumpFrom;
    if (off > 0xffff) error(line, "branch too far");
    chunk_->code[at]     = (off >> 8) & 0xff;
    chunk_->code[at + 1] = off & 0xff;
}

Usage:

size_t thenJump = emitJump(Op::JumpIfFalse, line);
emit(Op::Pop, line);
visit(thenBranch);
size_t endJump = emitJump(Op::Jump, line);
patchJump(thenJump, line);
emit(Op::Pop, line);
visit(elseBranch);
patchJump(endJump, line);

Backward Jumps — Loop

Loops are easier because the target (loop start) was already emitted:

void emitLoop(size_t loopStart, int line) {
    emit(Op::Loop, line);
    size_t off = chunk_->code.size() - loopStart + 2;
    emit((off >> 8) & 0xff, line);
    emit(off & 0xff, line);
}

No patching needed — the offset is computed inline. The VM reads the unsigned operand and subtracts it from ip (which now points past the operand bytes).

Why Two-Byte Offsets?

A u16 gives ±32 KB of branch range from any single jump site. Real programs in MiniLang rarely have functions larger than a few KB. If we hit the limit (the compiler emits "branch too far"), cp-07 will either:

  • introduce a JumpLong with a u24/u32 operand, or
  • bounce through a trampoline (emit a short jump to an intermediate JumpLong that does the real work).

JVM uses goto_w for the same reason: long jumps are an opcode flavour, not a switch.

Sentinel Bytes — 0xff 0xff

Why fill placeholders with 0xff 0xff rather than 0x00 0x00? It's a defensive habit: if we forget to patchJump, the VM will read a 65535-byte jump and trip an obvious bug rather than a subtle off-by-one (jumping zero bytes). A linter / asan could be configured to flag this further.

Self-Check

  • For an if with no else, you only need one jump. Why?
  • Why do we Pop twice in the if/else lowering (once on each branch)?
  • Could emitJump return a Op* instead of a size_t? What problem would that cause?

Step 4 — Locals vs Globals

The compiler resolves every variable reference to one of two operations:

  • GET_GLOBAL / SET_GLOBAL / DEF_GLOBAL — a hash lookup at runtime, keyed by the name string in the constants pool. Used for any binding declared at the top level.
  • GET_LOCAL / SET_LOCAL — a direct stack-slot fetch. Used for any binding declared inside a block.

The split mirrors Lox, CPython, and Lua. Globals are slow-but-flexible (you can monkey-patch them, late-bind them, redefine them); locals are fast-but-strict (their slot is baked into the bytecode at compile time).

Tracking Locals at Compile Time

The compiler keeps a flat std::vector<Local> parallel to the runtime stack layout:

struct Local {
    std::string name;
    int         depth;       // scope depth at declaration
    bool        isConst;     // `let` is true, `var` is false
};
std::vector<Local> locals_;
int scopeDepth_ = 0;

The index in locals_ is the stack slot. When the compiler emits OpGetLocal n, the runtime will compute frame.slots[n] and push that.

Entering and leaving scope

void beginScope() { ++scopeDepth_; }

void endScope(int line) {
    while (!locals_.empty() && locals_.back().depth >= scopeDepth_) {
        emit(Op::Pop, line);
        locals_.pop_back();
    }
    --scopeDepth_;
}

Every local that leaves the scope must be popped off the runtime stack, so we emit one Op::Pop per local being removed. The compiler's local table is the source of truth for runtime stack layout.

Declaring

void addLocal(const std::string& name, bool isConst, int line) {
    for (int i = locals_.size() - 1; i >= 0 && locals_[i].depth == scopeDepth_; --i) {
        if (locals_[i].name == name)
            error(line, "variable '" + name + "' already declared in this scope");
    }
    locals_.push_back({name, scopeDepth_, isConst});
}

Same-scope redeclaration is forbidden. Cross-scope shadowing is allowed — the resolver in cp-04 already enforced this, but the compiler double-checks because it's the one assigning slots.

Resolving

int resolveLocal(const std::string& name) const {
    for (int i = locals_.size() - 1; i >= 0; --i)
        if (locals_[i].name == name) return i;
    return -1;
}

We walk backwards so inner shadowing wins (the most recently declared x is the one in scope).

How Globals Are Encoded

void Compiler::visit(IdentExpr& e) {
    int slot = resolveLocal(e.name);
    if (slot >= 0) {
        emit(Op::GetLocal, e.line);
        emit(static_cast<uint8_t>(slot), e.line);
    } else {
        uint8_t ix = makeConstant(Value::makeString(e.name), e.line);
        emit(Op::GetGlobal, e.line);
        emit(ix, e.line);
    }
}

The global name ("x", "foo") is interned into the constants pool. At runtime, cp-07's VM will do globals[constants[ix].s] — a hash lookup. Note the same name string is reused for DEF_GLOBAL / GET_GLOBAL / SET_GLOBAL thanks to addConstant deduplication.

Init Expression and the Stack

void Compiler::visit(LetStmt& s) {
    if (s.init) visit(*s.init); else emit(Op::Nil, line);
    if (scopeDepth_ == 0) {
        uint8_t ix = makeConstant(Value::makeString(s.name), line);
        emit(Op::DefGlobal, line); emit(ix, line);
    } else {
        addLocal(s.name, s.kind == DeclKind::Let, line);
        // Init value is already on top of the stack — that's our local slot.
    }
}

A subtle invariant: for a local, the init expression leaves the value on the stack and we just say "from now on, that stack slot is named s.name". No OpDefLocal opcode is needed — the local exists implicitly the moment we record it in locals_.

Why Globals Survive endScope

endScope pops every local with depth >= scopeDepth_. Globals have depth == 0 but they aren't in locals_ at all — they're emitted as OpDefGlobal which writes to the runtime hash table and Pop, not to the stack. So they survive scope exit by living somewhere the scope-exit pop loop doesn't touch.

Self-Check

  • Why do we resolve locals back-to-front?
  • What runtime data structure does DEF_GLOBAL write to (refer ahead to cp-07)?
  • If you wanted to allow let x = x; legally (reading the outer x to init the inner one), where would you change the compiler?
  • How would you add OpGetLocalLong to support more than 256 locals?

Step 5 — Control Flow (if / while)

All conditional execution in MiniLang's bytecode is achieved with three opcodes:

  • JumpIfFalse off — if !truthy(peek()), advance ip by off. Value remains on the stack.
  • Jump off — unconditional forward branch.
  • Loop off — unconditional backward branch (subtracts off from ip).

Pop is the other star of the show — without it, the runtime stack would slowly fill with leftover conditions.

Lowering if (cond) then [else other]

  <cond>
  JumpIfFalse  L_ELSE
  Pop                  ; drop condition on the then-path
  <then>
  Jump         L_END
L_ELSE:
  Pop                  ; drop condition on the else-path
  <other>              ; (omitted if no else, but the Pop still emits)
L_END:

Two pops, one per branch. JumpIfFalse leaves the value on the stack (it has to — if we popped first we couldn't both consult the value AND keep the stack discipline aligned across both branches). Each branch is then responsible for popping it once it's chosen.

If the else branch is missing, the structure simplifies but the second Pop still has to run — otherwise the runtime stack drifts upward by one each time we hit a "no-else" if.

In code (compiler.cpp):

void Compiler::visit(IfStmt& s) {
    s.cond->accept(*this);
    size_t thenJump = emitJump(Op::JumpIfFalse, s.line);
    emit(Op::Pop, s.line);
    s.thenBranch->accept(*this);
    size_t elseJump = emitJump(Op::Jump, s.line);
    patchJump(thenJump, s.line);
    emit(Op::Pop, s.line);
    if (s.elseBranch) s.elseBranch->accept(*this);
    patchJump(elseJump, s.line);
}

Lowering while (cond) body

L_START:
  <cond>
  JumpIfFalse  L_EXIT
  Pop                  ; drop condition on the body-path
  <body>
  Loop         L_START
L_EXIT:
  Pop                  ; drop condition on the exit-path
void Compiler::visit(WhileStmt& s) {
    size_t loopStart = chunk_->code.size();
    s.cond->accept(*this);
    size_t exitJump = emitJump(Op::JumpIfFalse, s.line);
    emit(Op::Pop, s.line);
    s.body->accept(*this);
    emitLoop(loopStart, s.line);
    patchJump(exitJump, s.line);
    emit(Op::Pop, s.line);
}

Note the same pop-on-both-branches dance. The Loop instruction takes an unsigned 16-bit operand and the VM does ip -= off. Because we know loopStart upfront, no back-patching is needed.

Truthiness

JumpIfFalse consults Value::isTruthy():

bool Value::isTruthy() const {
    switch (kind) {
        case K::Nil:  return false;
        case K::Bool: return b;
        default:      return true;       // numbers (including 0!), strings, fns
    }
}

This matches Lua and Ruby: only nil and false are falsy. 0 is truthy, "" is truthy. The "JavaScript / Python" alternative — empty containers and zero are falsy — is a different design choice; both are coherent.

Nested Control Flow

Each if/while is independent — its own jumps, its own pops. Nesting Just Works because every visit method maintains the "expression leaves exactly one value on the stack" / "statement is stack-neutral" invariants.

For example, if (a) { while (b) { ... } } compiles to a while lowering nested inside the then-branch of the if. The if's Pop (then-path) drops a; the while's pops drop b. No interference.

break and continue?

Not yet — MiniLang has no break/continue keywords. Adding them is a fun exercise:

  • Compile-time: keep a stack of vector<size_t> breakSites per active loop. break does emitJump(Op::Jump) and pushes the placeholder offset. On loop exit, patch all of them to point to the post-loop L_EXIT.
  • continue jumps to L_START (a backward Loop, computed inline).

cp-15 adds these.

Self-Check

  • Why does if produce two Pop opcodes but a single <cond>?
  • What happens if you remove the post-loop Pop from while's lowering?
  • How does JumpIfFalse differ from a hypothetical BranchIfFalse that pops the value? Which would you prefer for short-circuit logic in step 6?

Step 6 — Short-Circuit Logic

&& and || are not arithmetic operators — they have to avoid evaluating the right-hand side when the left settles the answer. That's "short-circuiting", and it's expressed entirely with jumps.

a && b

  <a>
  JumpIfFalse  L_END
  Pop                  ; drop the (truthy) <a>
  <b>
L_END:                 ; either <a> (if it was falsy) or <b> sits on the stack

If a is falsy, we skip the Pop <b> block and the falsy value of a remains as the result — exactly what a && b should return when a is falsy. If a is truthy, we pop it and evaluate b, so b is the result.

void Compiler::visit(LogicalExpr& e) {
    e.lhs->accept(*this);
    if (e.op == TokenKind::AmpAmp) {
        size_t endJump = emitJump(Op::JumpIfFalse, e.line);
        emit(Op::Pop, e.line);
        e.rhs->accept(*this);
        patchJump(endJump, e.line);
    } else if (e.op == TokenKind::PipePipe) {
        // see below
    }
}

a || b

  <a>
  JumpIfFalse  L_RHS
  Jump         L_END   ; <a> was truthy; keep it as result
L_RHS:
  Pop                  ; drop the (falsy) <a>
  <b>
L_END:

If a is truthy, we jump straight to L_END leaving a on the stack. If a is falsy, we pop it and evaluate b so b becomes the result.

size_t elseJump = emitJump(Op::JumpIfFalse, e.line);
size_t endJump  = emitJump(Op::Jump, e.line);
patchJump(elseJump, e.line);
emit(Op::Pop, e.line);
e.rhs->accept(*this);
patchJump(endJump, e.line);

Why JumpIfFalse Doesn't Pop

If JumpIfFalse popped its condition, encoding &&/|| would need a Dup opcode (push another copy first) to preserve the value as the result. By keeping JumpIfFalse non-popping and emitting an explicit Pop on the "consume-the-condition" branch only, we save the Dup and one byte per logical operator.

Lox's compiler in Crafting Interpreters makes the same call.

Truthiness Reminder

Because JumpIfFalse uses Value::isTruthy, you get JavaScript-like coalescing semantics:

print 0 || 5;       // 0  (since 0 is truthy in MiniLang — Lua semantics)
print nil || "hi";  // "hi"
print false || 42;  // 42
print 3 && 4;       // 4

To get C-style &&/|| that return 0 or 1 you'd add a final Op::Not Op::Not to coerce — the test of "is this booleanable" runs twice.

Verifying the Compilation

The unit tests assert the exact opcode sequence:

void test_short_circuit_and() {
    auto out = compileSource("print false && true;");
    CHECK(opsMatch(out.chunk, {
        Op::False, Op::JumpIfFalse, Op::Pop, Op::True,
        Op::Print, Op::Return
    }));
}

This is exactly the lowering we sketched.

Self-Check

  • Why don't we need a Dup opcode given how JumpIfFalse is defined?
  • Walk through the compiled bytecode for a && b || c. (Hint: || has the lowest precedence; && binds tighter.)
  • How would you implement ?? (null-coalescing, "use rhs if lhs is nil else lhs")?

Step 7 — Disassembler and Testing

Without a VM yet (cp-07's job), the only way to know the compiler is right is to read the bytecode it produces. The disassembler is therefore both a debugging tool and the primary test surface.

See disassembler.cpp and tests/test_compiler.cpp.

Output Format

== <script> ==
0000     1  CONSTANT          0   ; 10
0002     |  DEF_GLOBAL        1   ; n
0004     |  GET_GLOBAL        1   ; n
0006     |  CONSTANT          2   ; 1
0008     |  ADD
0009     |  PRINT
000a     |  RETURN

Per line:

  • 4-hex-digit byte offset.
  • Source line number, or | if same as previous (visual grouping).
  • Opcode name.
  • Operand (right-aligned), with a ; comment showing the resolved constant or scope info.

The format owes everything to Crafting Interpreters' Lox disassembler.

The Dispatcher

Each opcode falls into one of four "shapes":

case Op::Constant:     consumed = constantInstr("CONSTANT", chunk, offset, os); break;
case Op::Pop:          consumed = simple("POP", os); break;
case Op::GetLocal:     consumed = byteInstr("GET_LOCAL", chunk, offset, os); break;
case Op::Jump:         consumed = jumpInstr("JUMP",          +1, chunk, offset, os); break;
case Op::Loop:         consumed = jumpInstr("LOOP",          -1, chunk, offset, os); break;
  • simple — opcode only, advances 1 byte.
  • byteInstr — opcode + 1 operand byte, advances 2.
  • constantInstr — opcode + 1 operand byte (constants index), advances 2, resolves and prints the value.
  • jumpInstr — opcode + 2 operand bytes (big-endian), advances 3, computes and prints the target offset.

disassembleInstruction returns consumed so the outer loop knows how many bytes to skip. This is the same shape your VM dispatch loop will have (cp-07) — the disassembler and VM are isomorphic structurally; the VM swaps "print this" for "execute this".

Test Strategy

Two complementary approaches:

(a) Exact opcode sequence

For short programs where the lowering is fully predictable:

auto out = compileSource("print 1 + 2 * 3;");
CHECK(opsMatch(out.chunk,
    {Op::Constant, Op::Constant, Op::Constant, Op::Mul, Op::Add,
     Op::Print, Op::Return}));

opsMatch walks the byte stream, extracts only the opcodes (skipping operand bytes by knowing the size of each opcode), and compares the resulting vector<Op> to your expected list. It's robust to operand-value churn — if Op::Constant's constant-pool slot changes, the test still passes; only the opcode shape matters.

(b) Substring match on the disassembly

For control flow where exact jump offsets are noisy but landmark opcodes matter:

CHECK_CONTAINS(out.disasm, "LOOP");
CHECK_CONTAINS(out.disasm, "JUMP_IF_FALSE");

Use this when the presence of an opcode is the assertion, not the exact byte sequence.

Negative tests

auto out = compileSource("{ let a = 1; a = 2; }");
CHECK(!out.compiledOk);
CHECK(/* "immutable" appears in some diagnostic */);

The compiler collects diagnostics without throwing, so tests can verify both the failure and the message text.

What the Tests Cover

  • Arithmetic / unary / logical operators emit the right opcodes in the right order.
  • Globals get DEF_GLOBAL/GET_GLOBAL/SET_GLOBAL; locals get GET_LOCAL/SET_LOCAL.
  • Block scope correctly emits per-local Pop on exit.
  • let immutability is enforced.
  • if/else and while emit Jump/JumpIfFalse/Loop correctly with paired Pops.
  • String constants are deduplicated in the pool.
  • The line table has the same length as the code stream and contains real source line numbers.
  • fn declarations and calls emit clear "deferred to cp-07" diagnostics.
  • Same-scope local redeclaration errors at resolve or compile time.
  • Short-circuit && lowers exactly as expected.

That's 15 tests covering all currently-supported features and the principled subset of unsupported ones.

Using the Disassembler at the REPL

echo 'var x = 0; while (x < 3) { x = x + 1; print x; }' | ./build/mlc

Read the output as a sanity check on any compiler change you make. cp-07's VM will replay these same bytes.

Self-Check

  • What does a typical disassembled IfStmt look like? Predict the line count.
  • Why is opsMatch (a) preferable to a string compare on the disassembly for short programs?
  • How would you extend the disassembler to print stack-effect estimates next to each opcode?

cp-07 — Stack-VM Execution Engine

Status: ✅ Built — mlr runs MiniLang programs end-to-end on the stack VM.

This lab takes the bytecode Chunk produced by cp-06 and gives it a runtime: an operand stack, call frames, globals, and a dispatch loop. After cp-07 you have a real compiler + virtual machine pair — source goes in, output comes out, no AST walking at runtime.

Pipeline

source ──► Lexer ──► Parser ──► Resolver ──► TypeChecker ──► Compiler ──► Function(script) ──► VM
                                                                                 │
                                                                          (with --dump:
                                                                           Disassembler)

The compiler emits a single top-level Function named <script> whose chunk is just the program's bytecode. The VM bootstraps execution by calling that function on an empty stack. Every subsequent function call follows the same mechanism — the script is just function #0.

What You'll Build

  • A Value tagged union with a new Fn case that carries a shared_ptr<Function>.
  • A Function record: name, arity, owned Chunk.
  • The VM itself:
    • operand stack (std::vector<Value>),
    • call-frame stack (std::vector<CallFrame>),
    • globals table (std::unordered_map<string,Value>),
    • a big switch-based dispatch loop.
  • A Compiler that nests one FunctionState per function being compiled, so fn foo() { … } compiles to a fresh chunk with its own locals.
  • A driver mlr that compiles then runs.

Reading Order

  1. CONCEPTS.md — the why: stack vs registers, dispatch cost, call frames, why closures need a separate mechanism.
  2. steps/ — implementation walkthrough:
    1. Operand stack, values, dispatch loop
    2. Call frames and slot-base addressing
    3. Compiling functions as nested compilers
    4. Globals hash, locals slots, name resolution at runtime
    5. Control flow on a stack machine (jumps & loops)
    6. Closures and upvalues — sketch, deferred
    7. Runtime errors and stack traces
  3. src/cpp/ — the implementation.

Build & Run

cd src/cpp
cmake -S . -B build -G "Unix Makefiles"
cmake --build build -j

# Run a program from a file
./build/mlr program.ml

# …or pipe it
echo 'print 1 + 2 * 3;' | ./build/mlr

# Disassemble the script chunk before running
echo 'fn add(a, b) { return a + b; } print add(3, 4);' | ./build/mlr --dump

# Run tests
ctest --test-dir build --output-on-failure

Status

  • ✅ 19 VM tests, 41 sub-checks, 100% green.
  • ✅ Arithmetic, locals, globals, control flow, functions, recursion.
  • ✅ Compile-time errors (typecheck) flagged; runtime errors print line + message.
  • Closures over outer locals are deliberately rejected at compile time. Capturing globals works fine; the upvalue machinery lands in cp-12 when we build the JIT.

Prereqs

  • cp-06 complete (chunks, opcodes, compiler).

Outcomes

  • Run a compiled MiniLang program with no AST present at runtime.
  • Explain why CPython, JVM bytecode, V8 Ignition, and Lua share the same architectural sketch (stack + frames + dispatch loop + constant pool).
  • Diagnose and reason about stack-balance bugs — the single biggest source of hard-to-debug VM crashes.

Step 1 — Operand Stack, Values, and the Dispatch Loop

Goal

Build the minimum VM that can execute a flat sequence of opcodes acting on a stack of Values. At the end of this step:

push(Constant 3); push(Constant 4); Add; Print;

prints 7.

The Operand Stack

A stack virtual machine evaluates expressions by pushing intermediates to a LIFO buffer and popping them when consumed. This mirrors how a tree-walker recurses, but flattens the call sequence into a linear instruction stream.

class VM {
    std::vector<Value> stack_;
    void push(Value v)     { stack_.push_back(std::move(v)); }
    Value pop()            { Value v = stack_.back(); stack_.pop_back(); return v; }
    Value& peek(size_t off=0) { return stack_[stack_.size()-1-off]; }
};

Important properties:

  • The stack is vector<Value> (not array<Value, 256>). Real VMs pre-allocate, but vector is simpler and correct.
  • Every instruction has a known stack effect — e.g. Add is -2 / +1. The compiler tracks this implicitly; bugs here manifest as crashes deep in unrelated code because the wrong value is on top.

The Value Union

A VM Value is a tagged sum, NOT a pointer. Cache-friendly, no allocation for primitives:

enum class ValueKind : uint8_t { Nil, Bool, Number, Str, Fn };

struct Value {
    ValueKind kind;
    union {
        bool   b;
        double n;
    };
    std::string s;     // outside union — has a non-trivial destructor
    FunctionPtr fn;    // shared_ptr — Fn case
};

Aside — why std::string outside the union? C++ unions can hold non-trivial members but you must placement-new and manually destruct them. That's a real win for production VMs (every cache miss matters) but adds 40 lines of error-prone code. cp-07 chooses clarity; cp-12 swaps in a tagged 64-bit NaN-boxed payload for the JIT.

The Dispatch Loop

The dispatch loop is the single hottest piece of code in any interpreter. It reads an opcode, jumps to its handler, executes, repeats:

for (;;) {
    auto op = Op(readByte());
    switch (op) {
        case Op::Constant: push(readConstant()); break;
        case Op::Add: {
            Value b = pop(), a = pop();
            push(Value::makeNum(a.n + b.n));
            break;
        }
        case Op::Print: {
            (*out) << pop().toString() << "\n";
            break;
        }
        case Op::Return: return Status::Ok;
        // …
    }
}

Why a switch?

GCC and Clang compile a dense switch over a uint8_t to a jump table: one indirect branch per dispatch. It is hard to beat without going to inline assembly.

What about computed goto?

Computed goto (&&Op_Add labels in a static const void* table[256]) reduces branch predictor pressure by giving each handler its own predictor entry — each opcode's "what comes next?" is a separate prediction. Speedups are real (15–30%) but the code is GNU-only and harder to read. We use switch here and revisit computed goto when we profile in cp-12.

Reading Operands

Each opcode may consume immediate bytes (operands) from the instruction stream:

uint8_t  readByte()      { return *ip_++; }
uint16_t readShort()     { uint16_t hi = *ip_++; uint16_t lo = *ip_++; return (hi<<8)|lo; }
Value    readConstant()  { return chunk()->constants[readByte()]; }

The instruction pointer (ip_) is a const uint8_t* into the current chunk's code buffer. Advancing it past the buffer end is undefined behavior — the compiler is responsible for terminating every code path with Return.

Stack-Balance Discipline

Every opcode handler in cp-07 obeys a stack discipline:

OpcodePopsPushes
Constant01
Add/Sub/...21
Neg/Not11
Print10
Pop10
Jump00
JumpIfFalse00
Call NN+11
Return10 (then push result into caller frame)

When you introduce a new opcode, document its stack effect first; the compiler tracks scope depth and locals based on this contract.

Try It

cd src/cpp && cmake --build build -j
echo 'print 2 + 3 * 4;' | ./build/mlr --dump

You should see the disassembly of the script chunk followed by 14.

Pitfalls

  • Popping in the wrong order. For binary ops the right-hand side is on top: b = pop(); a = pop();. Reverse it and 1 - 2 becomes 1.
  • push(pop() + …) evaluation order. C++ does not specify argument evaluation order — sequence the pops into named locals before pushing.
  • Re-reading from stack_ after push. vector::push_back can reallocate, invalidating references. Always copy Values out before pushing.

Step 2 — Call Frames and Slot-Base Addressing

Goal

Make fn add(a, b) { return a + b; } print add(3, 4); work. Two questions:

  1. Where do a and b live during execution?
  2. How does add "return" without losing the rest of the program?

Answer to both: call frames.

A Call Frame

struct CallFrame {
    FunctionPtr     fn;        // function being executed
    const uint8_t*  ip;        // instruction pointer into fn->chunk.code
    size_t          slotBase;  // index into VM::stack_ where this frame's slot 0 lives
};

std::vector<CallFrame> frames_;

The crucial field is slotBase. It says: "all local variables in this function are accessed relative to stack_[slotBase]."

The Invariant at Op::Call N

When the compiler emits Op::Call N, the runtime stack looks like:

                          ┌── top
[ … caller's stuff, <fn>, arg1, arg2, …, argN ]
                  ▲
                  └── this is `slotBase` for the new frame

So:

  • slotBase = stack_.size() - N - 1 (the -1 accounts for the function itself).
  • frame.slots[0] = <fn> — the function value lives at slot 0 of its own frame. This is a convenient invariant that lets us implement closures cheaply later (the closure object is always reachable from inside its body).
  • frame.slots[1..N] = arg1..argN — parameters are already in place because the call ABI deliberately leaves the args on the stack.

This is why the compiler emits addLocal(p, …) for each parameter when it opens a function: parameters get slot indices 1, 2, … N, and the runtime fulfils that mapping for free.

Op::Return

                              ┌── top
[ … caller's stuff, <fn>, args/locals…, RESULT ]
                  ▲
                  └── slotBase of returning frame

The handler:

Value result = pop();
stack_.resize(frame.slotBase);   // drops <fn> + all locals + args
frames_.pop_back();
if (frames_.empty()) return Status::Ok;   // returning from <script>
push(result);                    // caller sees the value on top

Because slotBase includes the function value's slot, the resize cleans everything in one operation. No per-local pops.

Op::GetLocal slot / Op::SetLocal slot

These are now trivial:

case Op::GetLocal: {
    uint8_t slot = readByte();
    push(stack_[frame.slotBase + slot]);
    break;
}
case Op::SetLocal: {
    uint8_t slot = readByte();
    stack_[frame.slotBase + slot] = peek();   // assignment is an expression: leaves value on stack
    break;
}

Why doesn't SetLocal pop? Because a = 1 + (b = 2) is a valid expression in many languages — the assigned value is the expression's result. MiniLang is more conservative, but keeping the value on the stack lets expr-stmt use a uniform Pop afterwards.

callValue

The unified call entry point:

void callValue(Value callee, int argc, int line) {
    if (callee.kind != ValueKind::Fn)
        throw RuntimeError(line, "can only call functions");
    auto fn = callee.fn;
    if (fn->arity != argc)
        throw RuntimeError(line, "function '" + fn->name + "' expects "
                           + std::to_string(fn->arity) + " argument(s)");
    if (frames_.size() == kMaxFrames)
        throw RuntimeError(line, "stack overflow (max call depth)");
    frames_.push_back({fn, fn->chunk.code.data(),
                       stack_.size() - argc - 1});
}

That's it. After callValue, control falls back into the dispatch loop with a fresh frame = frames_.back(), and execution begins at fn->chunk byte 0.

Recursion Just Works

fn fact(n) {
    if (n <= 1) { return 1; }
    return n * fact(n - 1);
}

Each recursive call pushes a new frame. No special case. The slot-base trick is the entire reason recursion works without renaming variables — each frame has its own slice of stack_.

Bootstrapping the Script

The top-level program is itself a function called <script> with arity 0. The VM bootstraps it by pushing the function value and calling it:

push(Value::makeFn(script));
callValue(stack_.back(), 0, 0);
run_loop();

The compiler always ends <script> with Nil; Return, so the loop terminates gracefully when the program finishes.

Try It

echo 'fn fib(n) { if (n < 2) { return n; } return fib(n-1) + fib(n-2); }
print fib(10);' | ./build/mlr
# 55

Add --dump to inspect the bytecode — you will see two separate chunks if you add a top-level disassembly of fib (the framework only dumps the script in cp-07; cp-12 dumps every function).

Pitfalls

  • Off-by-one in slotBase. Forgetting the <fn> slot makes parameter slot 1 map to slot 0 of the caller — silent data corruption.
  • frames_.push_back invalidates frame references. We always re-acquire frame = frames_.back() at the top of each iteration.
  • Returning from <script>. Without the empty-frames check the VM would pop() the script's "result" into a non-existent caller.

Step 3 — Compiling Functions as Nested Compilers

Goal

Extend the cp-06 single-chunk compiler so that fn foo(...) { ... } emits a separate Function with its own chunk and its own local-variable bookkeeping — while staying able to resume compiling the outer code afterwards.

The Mental Model

A function body is just another little program. When the parser hands the compiler a FnDeclStmt, the compiler temporarily switches its target from the current chunk to a fresh chunk owned by a new Function. When the body finishes, the compiler:

  1. Emits Nil; Return (so a body without an explicit return does the right thing — see step 5 for control-flow specifics).
  2. Pops back to the outer compiler state.
  3. Records the new Function as a constant in the outer chunk's constant pool.
  4. Emits Closure <const-ix> at the outer cursor, which loads the function value onto the operand stack.
  5. Stores that value as a global or as a new local in the outer scope.

Crucially the outer compiler doesn't need to know anything about the inner body — it just sees a single opaque value.

State

class Compiler {
    struct Local { std::string name; int depth; bool isConst; };
    struct FunctionState {
        FunctionPtr           fn;
        std::vector<Local>    locals;
        int                   scopeDepth = 0;
        bool                  isScript;
    };
    std::vector<FunctionState> states_;

    Chunk&                chunk()      { return states_.back().fn->chunk; }
    std::vector<Local>&   locals()     { return states_.back().locals; }
    int&                  scopeDepth() { return states_.back().scopeDepth; }
};

The whole "current compilation context" is the top of states_. Push to enter a function, pop to leave.

void pushFunction(std::string name, int arity, bool isScript) {
    auto fn = std::make_shared<Function>();
    fn->name = std::move(name);
    fn->arity = arity;
    FunctionState fs;
    fs.fn = fn;
    fs.isScript = isScript;
    // Reserve slot 0 for the function value itself (matches the VM's call ABI).
    fs.locals.push_back({"", 0, true});
    states_.push_back(std::move(fs));
}

FunctionPtr popFunction() {
    auto fn = states_.back().fn;
    states_.pop_back();
    return fn;
}

That reserved slot 0 is the link to step 2 — the runtime puts the callable there, and the compiler must not accidentally allocate it to a user variable.

Compiling a FnDeclStmt

void visit(FnDeclStmt& s) override {
    pushFunction(s.name, s.params.size(), /*isScript=*/false);

    for (auto& p : s.params) addLocal(p, /*isConst=*/false, s.line);

    beginScope();
    for (auto& stmt : s.body) stmt->accept(*this);
    endScope();

    emit(Op::Nil);
    emit(Op::Return);

    auto fn = popFunction();

    // Outer scope: load the function value as a constant, then bind it.
    uint8_t ix = makeConstant(Value::makeFn(fn));
    emit(Op::Closure); emit(ix);

    if (scopeDepth() == 0) {
        uint8_t nameIx = makeConstant(Value::makeStr(s.name));
        emit(Op::DefGlobal); emit(nameIx);
    } else {
        addLocal(s.name, /*isConst=*/true, s.line);
    }
}

A few things worth noting:

  • We pass isConst=true for the binding itself but isConst=false for the parameters — assigning to a parameter inside its function body is legal.
  • The body opens its own block scope so endScope() cleans up any lets declared inside; the parameters are above this scope and persist for the entire function (correctly).
  • Op::Closure is currently a synonym for Op::Constant. We give it a distinct opcode so cp-12 can graft upvalue handling on without touching every call site.

Why addLocal(p, ...) Just Works

The cp-06 local table is indexed by insertion order, which matches the runtime slot numbering. Because we reserved slot 0 in pushFunction, the first parameter ends up at slot 1, the second at slot 2, … exactly what the call ABI delivers.

Forbidding Closure Capture (for now)

Without an upvalue system, inner can't see outer's local a:

fn outer(a) {
    fn inner() { return a; }   // ← capture
    return inner();
}

The compiler must detect this at compile time and refuse, rather than emit broken bytecode. Helper:

bool isOuterLocal(const std::string& name) {
    for (int i = (int)states_.size() - 2; i >= 0; --i) {
        const auto& ls = states_[i].locals;
        for (int j = (int)ls.size() - 1; j >= 1; --j)
            if (ls[j].name == name) return true;
    }
    return false;
}

IdentExpr and AssignExpr consult isOuterLocal after their normal local lookup misses but before they fall back to globals. If true, they emit a diagnostic pointing the user to cp-12.

<script> Is a Function Too

Result compile(Program& p) {
    pushFunction("<script>", 0, /*isScript=*/true);
    for (auto& s : p.statements) s->accept(*this);
    emit(Op::Nil); emit(Op::Return);
    auto script = popFunction();
    return Result{script, diagnostics_};
}

Everything composes. No special case for top-level — the VM just calls <script> like any other function.

Compiling CallExpr

void visit(CallExpr& e) override {
    e.callee->accept(*this);                 // pushes <fn>
    for (auto& a : e.args) a->accept(*this); // pushes args
    if (e.args.size() > 255)
        error(e.line, "too many arguments to a single call (>255)");
    emit(Op::Call);
    emit(uint8_t(e.args.size()));
}

The shape on the stack at Op::Call N is exactly what callValue expects — this is how the static side and runtime side cooperate.

Compiling ReturnStmt

void visit(ReturnStmt& s) override {
    if (states_.back().isScript)
        error(s.line, "'return' outside a function");
    if (s.value) s.value->accept(*this);
    else         emit(Op::Nil);
    emit(Op::Return);
}

Pitfalls

  • Forgetting the reserved slot 0. Parameters get the wrong slot numbers.
  • pushFunction after starting to emit prelude. The fresh FunctionState's chunk is empty by design; emit nothing into it before the body.
  • Capturing the inner Chunk& reference across pushFunction/ popFunction. states_.push_back can reallocate the vector — always go through chunk()/locals() accessors.

Step 4 — Globals (Hash) vs. Locals (Slots) at Runtime

Two Worlds, One Stack

The compiler already decides per identifier whether it is local (resolved during compilation to a slot index) or global (resolved at runtime by name). Step 4 implements the runtime half.

KindStorageAccess cost
Localstack_[slotBase + slot]O(1), 1 load
Globalunordered_map<string,Value>O(1) avg, hash

The compiler emits GetLocal slot / SetLocal slot for locals (resolved at compile time), and GetGlobal nameIx / SetGlobal nameIx / DefGlobal nameIx for globals — where nameIx is an index into the chunk's constant pool whose value is a Value::makeStr(name).

VM Side

case Op::DefGlobal: {
    Value name = readConstant();
    globals_[name.s] = pop();
    break;
}
case Op::GetGlobal: {
    Value name = readConstant();
    auto it = globals_.find(name.s);
    if (it == globals_.end())
        throw RuntimeError(currentLine(),
            "undefined variable '" + name.s + "'");
    push(it->second);
    break;
}
case Op::SetGlobal: {
    Value name = readConstant();
    auto it = globals_.find(name.s);
    if (it == globals_.end())
        throw RuntimeError(currentLine(),
            "undefined variable '" + name.s + "'");
    it->second = peek();   // assignment is an expression; leaves value on stack
    break;
}

Why SetGlobal errors if the variable doesn't exist

This distinguishes declaration from assignment. let x = 1; and var x = 1; declare; x = 2; assigns. Without this check, typos silently create new globals — exactly the JavaScript footgun we don't want.

DefGlobal, in contrast, unconditionally inserts. If the user shadows an existing global with another let, the resolver already complained.

Why store names, not numeric ids?

Three reasons:

  1. REPL friendliness. In an interactive session, each entered statement is a separate compilation. Numeric ids would not survive across compilations.
  2. Dynamic globals. Future built-ins (print, clock, FFI bindings) inject themselves into globals_ by name without coordinating with the compiler.
  3. Cheap. String hashing on short identifiers is a few ns; the access pattern is dominated by cache misses in the hash table, not the hash itself.

Real production VMs (V8, LuaJIT) cache name-id pairs in inline caches at the call site so subsequent accesses skip the hash. cp-15 covers ICs.

Locals — the entire implementation

case Op::GetLocal: {
    uint8_t slot = readByte();
    push(stack_[frame.slotBase + slot]);
    break;
}
case Op::SetLocal: {
    uint8_t slot = readByte();
    stack_[frame.slotBase + slot] = peek();
    break;
}

Two array indirections, zero hashing. This is why locals exist as a separate notion: the dominant performance gap between a "scripting" VM and a "systems" VM is whether identifier resolution is a slot read or a hash probe.

Stack Discipline on Block Exit

When a block scope closes:

void endScope() {
    while (!locals().empty() && locals().back().depth > scopeDepth() - 1) {
        emit(Op::Pop);
        locals().pop_back();
    }
    --scopeDepth();
}

This issues a runtime Pop for every local going out of scope. At runtime the stack shrinks back to the size it had at beginScope, restoring the invariant that stack depth = number of live locals + temporaries currently on top.

Functions on Globals

Top-level functions live in globals_ like any other value. Function calls do:

GetGlobal "fact"   ; pushes the Fn value
Constant   5       ; pushes the arg
Call       1

Recursion works because GetGlobal happens each time — by the time fact calls itself, the global table already contains it.

Mutable vs Immutable

The compiler tracks isConst on each Local/FunctionState::locals[i] and emits a compile-time diagnostic for let-bound writes. The VM is uniform: it has no notion of const at runtime. This is the standard tradeoff — push errors as far forward as possible.

Pitfalls

  • Forgetting to pop locals in endScope. The stack grows monotonically through the program; nested blocks would corrupt parent locals' indices.
  • SetGlobal accepting unknown names. Silent globals are a tooling nightmare. Always require DefGlobal first.
  • Using [] on globals_ in GetGlobal. operator[] creates default-constructed entries on miss. Use find and report the error.

Step 5 — Control Flow on a Stack Machine

The compiler already emits Jump, JumpIfFalse, and Loop (cp-06, step 5). This step explains the runtime half: how ip_ moves through the chunk.

The Three Jump Opcodes

OpcodeLayoutStack effectAction
Jump[op][hi][lo]0ip_ += offset16
JumpIfFalse[op][hi][lo]0 (peek only)if !isTruthy(peek()) then ip_ += offset16
Loop[op][hi][lo]0ip_ -= offset16 (backwards jump)

All offsets are unsigned 16-bit numbers measured from the byte after the operand. Two-byte operand → max range ±65 535 bytes — plenty for human code, trivially extended to 24-bit if anything pathological appears.

Why doesn't JumpIfFalse pop? Because if/else and short-circuit operators want the test value in different states after the branch. The compiler emits an explicit Pop after JumpIfFalse in cases (like if-stmt) where the condition value is no longer needed.

Runtime Implementation

case Op::Jump: {
    uint16_t off = readShort();
    ip_ += off;
    break;
}
case Op::JumpIfFalse: {
    uint16_t off = readShort();
    if (!isTruthy(peek())) ip_ += off;
    break;
}
case Op::Loop: {
    uint16_t off = readShort();
    ip_ -= off;
    break;
}

isTruthy is the language's falsiness rule:

bool isTruthy(const Value& v) {
    switch (v.kind) {
        case ValueKind::Nil:    return false;
        case ValueKind::Bool:   return v.b;
        default:                return true;   // 0 is truthy
    }
}

This decision is a language design choice. Lua agrees (only nil and false are falsy). Python disagrees (empty containers, 0, 0.0, "" are all falsy). We follow Lua/Lox for simplicity.

if-else at Runtime

Recall the compiled pattern:

            <cond>
            JumpIfFalse  ───┐
            Pop             │
            <then-body>     │
            Jump        ──────┐
   else: ──────────────────┘ │
            Pop               │
            <else-body>       │
   end:  ────────────────────┘

At runtime:

  1. The condition pushes true/false.
  2. JumpIfFalse peeks, leaves it alone, optionally skips the then-branch.
  3. The branch starts with Pop (the condition is consumed exactly once).
  4. The unconditional Jump over the else-arm leaves the stack unchanged.
  5. The else-arm also begins with Pop (matching the other side).

The two Pops guarantee that exactly one of them runs per execution and the stack ends each branch in the same state. This is the stack-balance proof for the construct.

while at Runtime

   start:                       ◄── loopStart
            <cond>
            JumpIfFalse ───┐
            Pop            │
            <body>         │
            Loop  start    │
   end: ─────────────────┘
            Pop

Loop re-evaluates the condition. The compiler computes the backward offset at chunk-emit time as (chunk.code.size() + 3) - loopStart, where the +3 accounts for the Loop opcode + 2-byte operand we are about to emit.

Short-Circuit && / ||

print false && something_expensive();   // never calls
print true  || something_expensive();   // never calls

&& compiles to:

            <lhs>
            JumpIfFalse  end
            Pop
            <rhs>
   end:

If lhs is falsy, JumpIfFalse leaves it on the stack and jumps to end — the expression's result. Otherwise we Pop it and evaluate rhs, leaving its value as the result.

|| is the mirror: JumpIfTrue end, except we don't have a JumpIfTrue opcode. Two implementation choices:

  • Add Op::JumpIfTrue.
  • Reuse JumpIfFalse with the boolean inverted in a tiny code template: JumpIfFalse two-ahead; Jump end; Pop; <rhs>; end:.

cp-06 takes the second route to keep the opcode set minimal. Same runtime semantics.

Patch-Back

The compiler emits jumps with a placeholder offset (0xFFFF) and patches the real distance once the target is known:

size_t emitJump(Op op) {
    emit(op);
    emit(uint8_t(0xff));
    emit(uint8_t(0xff));
    return chunk().code.size() - 2;  // index of high byte
}

void patchJump(size_t offsetSlot) {
    size_t jumpDist = chunk().code.size() - offsetSlot - 2;
    if (jumpDist > 0xFFFF) error(..., "jump too large");
    chunk().code[offsetSlot]     = (jumpDist >> 8) & 0xff;
    chunk().code[offsetSlot + 1] = jumpDist & 0xff;
}

Loop is symmetric but the offset is known at emit time (we always loop backwards to a previously-seen loopStart).

Pitfalls

  • Forgetting the Pop after JumpIfFalse. The condition value stays on the stack forever, then collides with the next statement's expectations — the bug surfaces much later as a wrong value in a totally different opcode.
  • Wrong direction for Loop. ip_ -= off not +=. The disassembler prints the target address — sanity check by inspection.
  • Patch-back arithmetic. Off-by-two errors are common; write the emitJump/patchJump pair once and reuse it religiously.

Step 6 — Closures and Upvalues (Sketch, Deferred to cp-12)

Why This Is a Step at All

cp-07 deliberately rejects programs that capture a local from an enclosing function:

fn outer(a) {
    fn inner() { return a; }   // ❌ compile error in cp-07
    return inner();
}

with a clear message pointing to cp-12. This step explains why the restriction exists and what the implementation will look like when we lift it.

The Problem

A function value can outlive the stack frame in which it was defined:

fn make_counter() {
    var n = 0;
    fn step() { n = n + 1; return n; }
    return step;            // ← step references `n` AFTER make_counter returns
}

let c = make_counter();
print c(); print c(); print c();   // 1 2 3

When make_counter returns, its stack frame is destroyed — yet step still needs n. The variable has escaped from the stack.

Possible Solutions

StrategyCostUsed by
Disallow it (cp-07)Free. Limits expressiveness.Early C, embedded DSLs
Boxed locals everywhereEvery local is heap-allocated and ref-counted.Pre-V8 JS, Scheme R6RS
Upvalues (Crafting Interpreters)Stack-allocated locals; promoted to heap lazily when a closure captures them.Lua, Lox, our cp-12
Lambda lifting (compile-time)Inner function rewritten to take captures as extra args. No runtime support.OCaml, Haskell middle-ends
Full first-class environmentsEach scope is a heap object linked to its parent.Scheme, Smalltalk

cp-12 will implement the upvalue approach because:

  • It keeps non-capturing locals as cheap stack slots (no boxing tax).
  • It scales to mutable captures without aliasing footguns.
  • It is what Lua 5.x, Lox, and many embedded VMs do — well-documented.

Sketch of the Upvalue Mechanism

Add three new value-level concepts:

struct Upvalue {
    Value* location;     // points into the stack while OPEN
    Value  closed;       // takes ownership when the slot is closed
    bool   isOpen;
    Upvalue* next;       // intrusive list, head per VM, kept sorted by stack address
};

struct Closure {
    FunctionPtr           fn;
    std::vector<Upvalue*> upvalues;   // resolved by index in bytecode
};

A Value::Fn evolves into Value::Closure carrying a shared_ptr<Closure>. Closure holds the function plus a vector of upvalues, one per captured variable.

New Opcodes (cp-12)

OpcodeOperandsMeaning
Closure[const-ix] then per upvalue: [isLocal:1][index:1]Allocate closure; capture each upvalue.
GetUpvalue[slot]Push the value the upvalue points to.
SetUpvalue[slot]Store top into the upvalue's location.
CloseUpvaluenonePromote top-of-stack local to heap, splice into open-upvalue list.

Compile Side

When the compiler sees a reference inside inner to a name declared in outer's locals:

  1. Walk outwards through states_ until it finds the name.
  2. In every intermediate function, add an upvalue entry whose isLocal=true in the immediately-surrounding function and isLocal=false deeper out.
  3. Replace the bytecode with GetUpvalue idx / SetUpvalue idx.

When the compiler emits a Closure for a function with k upvalues, it emits k (isLocal, index) pairs following the opcode. At runtime, the VM reads these and either:

  • (isLocal=true) captures the surrounding frame's slot directly (calls captureUpvalue(&stack_[frame.slotBase + index])), or
  • (isLocal=false) copies one of the enclosing closure's upvalues (captureUpvalue(enclosing->upvalues[index])).

Run Side

captureUpvalue(loc) walks the open-upvalue list, returns an existing one if some closure already captured the same address, otherwise allocates a new Upvalue{loc, …, isOpen=true} and links it in sorted order.

When a local goes out of scope (or a frame returns), the VM emits CloseUpvalue / scans for any open upvalues at addresses ≥ the popping threshold and closes them — copying the value into closed and re-pointing location at &closed. From that moment on, the closure transparently sees the value through the heap copy.

The result: captured locals only pay the heap cost when they are actually captured, and only once per (variable × set of capturing closures).

Why Defer All This?

  • The cp-07 lab is large enough already.
  • Most of the interesting engineering — slot allocation, dispatch, frame management — is independent of closures and easier to learn in isolation.
  • cp-12's JIT motivates closures: a closure becomes a useful unit for inlining and specialisation.

What cp-07 Actually Does

Op::GetUpvalue / Op::SetUpvalue / Op::CloseUpvalue exist in the opcode table (so cp-12 can drop in changes without renumbering) but the VM throws a runtime error if it ever executes one:

case Op::GetUpvalue:
case Op::SetUpvalue:
case Op::CloseUpvalue:
    throw RuntimeError(currentLine(),
        "upvalues not supported in cp-07 (see cp-12)");

And the compiler refuses to emit them — the isOuterLocal helper in steps/03 detects the attempted capture and emits a friendly diagnostic at compile time, well before the user sees any opaque runtime error.

Pitfalls (for cp-12)

  • Forgetting to sort open-upvalues by stack address. The close operation relies on stopping at the first upvalue below the threshold.
  • Double-closing. An upvalue captured by N closures must close exactly once; the open list dedupes by address.
  • Calling convention coupling. The Closure opcode's variable-length operand encoding is awkward to disassemble; budget extra time on the disassembler.

Step 7 — Runtime Errors and (Mini) Stack Traces

What Counts as a Runtime Error

cp-07 catches at runtime:

CauseWhereMessage template
Undefined global readOp::GetGlobalundefined variable '<name>'
Undefined global writeOp::SetGlobalundefined variable '<name>'
Type mismatch in binary +Op::Addoperands to '+' must be two numbers or two strings
Wrong type in numeric opOp::Sub/Mul/…operands must be numbers
Division by zeroOp::Div, Op::Moddivision by zero
Wrong type in unary -Op::Negoperand to unary '-' must be a number
Calling a non-functioncallValuecan only call functions
Arity mismatchcallValuefunction '<n>' expects K argument(s)
Stack overflowcallValuestack overflow (max call depth)
Use of unsupported Op::*Upvaluedispatch loopupvalues not supported in cp-07 (see cp-12)

The RuntimeError Type

struct RuntimeError : std::runtime_error {
    int line;
    RuntimeError(int l, std::string m)
        : std::runtime_error(std::move(m)), line(l) {}
};

A single exception type keeps the dispatch loop's error path uniform. The public VM::run catches it at the top, prints a one-line message + the call chain, and returns Status::RuntimeError.

Source-Line Tracking

The compiler stuffs a lines parallel array into each Chunk:

struct Chunk {
    std::vector<uint8_t> code;
    std::vector<int>     lines;       // 1:1 with code
    std::vector<Value>   constants;
};

This is wasteful — one int per byte — and a real VM compresses lines via run-length encoding. We trade memory for clarity; cp-15 covers debug-info compression.

At runtime the VM computes the current line as

int currentLine() {
    auto& fr = frames_.back();
    size_t off = fr.ip - fr.fn->chunk.code.data() - 1;   // -1: we already advanced past the op
    return fr.fn->chunk.lines[off];
}

The -1 is the subtle bit — readByte() post-increments ip, so by the time we're handling an opcode ip_ points to the next byte.

A Mini Stack Trace

VM::run's catch block walks frames_ from the top down:

catch (const RuntimeError& e) {
    (*err) << "runtime error [line " << e.line << "]: " << e.what() << "\n";
    for (auto it = frames_.rbegin(); it != frames_.rend(); ++it) {
        const auto& fn = it->fn;
        (*err) << "  in " << (fn->name.empty() ? "<script>" : fn->name) << "\n";
    }
    return Status::RuntimeError;
}

So:

fn boom() { return 1 / 0; }
fn caller() { return boom(); }
print caller();

prints (on stderr):

runtime error [line 1]: division by zero
  in boom
  in caller
  in <script>

This is the minimum-viable stack trace. Real implementations also include file/line per frame, source snippets, and column ranges. cp-15 (Tooling & Diagnostics) is the big upgrade.

Why Not assert?

A core principle: runtime errors are not crashes. They are recoverable from the embedder's standpoint — a REPL must keep running after a typo, an embedded interpreter must report to its host, etc. Using assert or std::terminate would couple the VM's lifecycle to the user's mistake.

This is also why we don't std::exit() from inside the VM; the driver (main.cpp) decides the exit code from Status.

Compile-Time vs Runtime Errors

After cp-04 (resolver) and cp-05 (typechecker), most static errors fire before we ever build a chunk:

ErrorCaught byPhase
Undefined name (local)Resolverstatic
Use before initResolverstatic
Assignment to letResolver / Compilerstatic
Calling non-callable typeTypeCheckerstatic
Arity mismatch (known fn)TypeCheckerstatic
Type mismatch on operatorTypeCheckerstatic
Bad cast at runtimeVMdynamic
Undefined globalVMdynamic
Division by zero on user inputVMdynamic

This split is intentional — static = before any code runs; dynamic = unavoidable.

Pitfalls

  • Catching std::exception too broadly. Other exceptions (bad-alloc, malformed UTF-8 in string display, …) deserve different handling. We catch RuntimeError specifically and let everything else propagate to the driver.
  • Forgetting to flush err. When the VM is embedded in another process, using std::cerr is fine; when the test harness uses ostringstream, buffered output is captured automatically.
  • Reusing line indices after chunk mutation. Chunk::lines must stay 1:1 with code; any opcode emit must push exactly one line entry per byte. Helper functions handle this — never push to code directly.

cp-08 — Three-Address Code IR

A new compiler middle-end that lowers the resolved/type-checked AST into Three-Address Code (TAC), the canonical compiler IR taught in every dragon book. The bytecode VM of cp-07 was a great way to run code, but a poor representation for reasoning about code: an operand stack hides def/use relationships, and locals are addressed by slot rather than name.

TAC reverses these tradeoffs. Each instruction has at most one operation and writes its result into one named destination — t3 = add t1, t2. Control flow lives in a control-flow graph (CFG) of basic blocks connected by explicit jumps. This is exactly the shape an SSA construction algorithm wants in cp-09, and it's the shape LLVM IR will demand in cp-11.

What's in the box

FilePurpose
src/ir.hpp/cppOperand, Op enum, Instr, BasicBlock, Function, Module
src/ir_printer.*Textual IR pretty-printer (the "assembly" we read in tests)
src/ir_builder.*AST → IR lowering pass
src/main.cppmltac CLI driver: source → IR text on stdout
tests/test_ir.cppString-level golden tests over the printed IR

The pipeline is now:

source ─► lexer ─► parser ─► resolver ─► typecheck ─► ir::Builder ─► Module

There is no execution stage in cp-08. cp-09 wires up an interpreter that walks this IR directly (and adds SSA + a couple of optimisation passes).

Build & run

cmake -S src/cpp -B src/cpp/build
cmake --build src/cpp/build -j
ctest --test-dir src/cpp/build --output-on-failure
echo 'fn add(a,b){return a+b;} print add(3,4);' | ./src/cpp/build/mltac

Expected output:

fn @__script__() {
bb0 (entry):
    t0 = call @add(3, 4)
    print t0
    ret
}

fn @add(%a, %b) {
bb0 (entry):
    t0 = add %a, %b
    ret t0
}

What's new conceptually

  • Three operand kinds. t<n> temps (SSA-friendly), %name named storage (local variables / params), and immediate constants.
  • One op per instruction. Compound expressions are flattened by introducing fresh temps for each subexpression result.
  • Globals through memory ops. ldg @x / stg @x, v make global reads and writes explicit — paralleling LLVM's load/store.
  • Explicit control flow. Every block ends in a terminator (jmp, cjmp, ret). No fall-through. No implicit "next instruction".
  • Short-circuit lowered to branches. a && b becomes a cjmp plus a join block, just as cp-07 did with patchable jumps — but now the join lives in the CFG, ready for phi insertion in cp-09.

Reading order

The seven step docs in steps/ follow the same progression as the code:

  1. 01-tac-and-three-address-form.md
  2. 02-operands-and-instructions.md
  3. 03-basic-blocks-and-cfg.md
  4. 04-lowering-expressions.md
  5. 05-lowering-statements-and-control-flow.md
  6. 06-short-circuit-and-phi-preview.md
  7. 07-printer-and-debugging.md

Step 1 — TAC and three-address form

Why a new IR?

The cp-07 bytecode VM was a perfectly good interpreter. But the moment you try to do interesting things to the program — eliminate dead code, fold constants, allocate registers, prove that two pointers don't alias — the stack representation fights you at every turn.

Three things make a stack machine awkward for analysis:

  1. Implicit operands. ADD doesn't say what it adds; you have to simulate the stack to find out. Every analysis becomes an abstract interpretation.
  2. Position-dependent. Reordering two adjacent instructions changes what's on the stack. You can't pattern-match peepholes locally.
  3. No SSA story. SSA wants every value to have a name; stack slots are anonymous and transient.

TAC fixes all three by writing each operation in the form

dst = op src1, src2

— at most one operation per instruction, every operand explicit, every result named.

A first example

Source:

print (1 + 2) * 3;

Stack bytecode (cp-07-style):

PUSH 1
PUSH 2
ADD
PUSH 3
MUL
PRINT

TAC:

t0 = add 1, 2
t1 = mul t0, 3
print t1

The TAC form has more "instructions" by raw count, but each line is a fully self-contained def with explicit uses. You can ask "what defines t1?" without simulating anything — just grep.

Three operand kinds

Our Operand is a tagged variant with four cases:

  • None — placeholder for instructions that produce no value.
  • Temp(n)t0, t1, ... — single-assignment compiler-generated intermediates. In cp-09 these become SSA values.
  • Constant(Value) — an immediate, printed inline.
  • Named(name)%x, %a — a local variable or parameter. Named operands behave like alloca'd memory slots (read and write via Move), but at TAC level we expose them as first-class operands. cp-09's mem2reg pass promotes these into SSA temps.

Globals do not get an Operand form; they are accessed through explicit ldg @x / stg @x instructions. That mirrors LLVM's model where "variable" really means "memory cell" and SSA only describes local register flow.

The deal we're making

TAC introduces verbosity and a small constant-factor compile-time hit compared to direct bytecode generation. In exchange, we get:

  • a uniform substrate for all later analyses (cp-09 SSA, cp-10 passes, cp-11 LLVM, cp-13 MLIR),
  • a printable, diff-able IR that makes compiler bugs visible,
  • a CFG abstraction we can use to talk precisely about reachability, dominance, and loop structure.

That's the trade every production compiler makes. cp-08 just makes us pay the price explicitly so cp-09 onwards can spend the proceeds.

Step 2 — Operands and instructions

Operand

struct Operand {
    enum class Kind { None, Temp, Constant, Named };
    Kind        kind = Kind::None;
    int         tempId = -1;     // Temp
    Value       constVal;        // Constant
    std::string name;            // Named  (includes leading sigil-less form)
    // factories: none(), temp(id), constant(v), named(name)
};

We use a single struct with a Kind tag rather than std::variant to keep the struct trivially copyable and (more importantly) easy to print in a debugger. When you're chasing an IR bug at 1 a.m. you want p ins.srcs[0] to show something, not a variant index.

tempId, constVal, and name are independent fields; only one is meaningful for any given Kind. The constructors mirror that:

Operand::temp(3);                 // t3
Operand::constant(Value::makeInt(42));   // immediate
Operand::named("x");              // %x
Operand::none();                  // placeholder

Op — the opcode enum

GroupOpcodes
arithmeticAdd Sub Mul Div Mod Neg
comparisonEq Ne Lt Le Gt Ge
logicalNot (and/or are lowered, not opcodes)
move/loadMove LoadGlobal StoreGlobal
controlJump CondJump Return
effectsPrint Call

Notable design choices:

  • No And / Or opcode. Short-circuit semantics demand control flow; we lower them to CondJump (see step 6).
  • Move rather than Copy. Same idea as RISC-V or MIPS pseudo-ops: one instruction that says "write the source into the destination, unchanged." The mem2reg pass in cp-09 will eliminate most of these.
  • Call is a regular instruction. It has a destination temp (for the return value), an opcode-level callee name in ins.name for direct calls, and operands [callee, arg0, arg1, ...]. Indirect calls (cp-12 closures) will store <indirect> in the name and use srcs[0] for the callee operand.

Instr

struct Instr {
    Op            op;
    Operand       dst;         // None if the op produces no value
    std::vector<Operand> srcs; // 0..N source operands
    std::string   name;        // global name / function name
    int           bbT = -1;    // jmp target / cjmp true target
    int           bbF = -1;    // cjmp false target
    int           line = 0;    // source line for diagnostics
};

One struct fits all instruction kinds. The alternative — a discriminated hierarchy with AddInstr, JumpInstr, CallInstr, ... — is dogmatically purer, but cripplingly painful to walk in passes. Every pass would need a giant visitor or a type-switch. A flat struct lets passes loop over instrs and switch on ins.op.

The cost: each instruction carries unused fields. For TAC at this scale that's a sub-megabyte overhead even for large programs, and it's the shape MLIR uses (an Operation* with attributes, results, operands, successors). Compiler IRs converge on this design for a reason.

Why constants are inline operands

In some IRs (notably LLVM) constants are first-class Values, distinct from instructions. We took the simpler route: a constant is just an Operand::Constant, printed inline. Pros: trivial printer, no constant pool to manage. Cons: you can't dyn_cast a constant the way you can in LLVM. For a teaching IR that's the right trade.

Step 3 — Basic blocks and the CFG

What is a basic block?

A basic block is a maximal sequence of instructions such that

  1. control enters only at the top (no branches into the middle), and
  2. control leaves only at the bottom (no branches out of the middle).

Formally: a straight-line code fragment that ends in exactly one terminatorJump, CondJump, or Return. Nothing fancier than that.

The set of basic blocks plus their successor edges form the control-flow graph (CFG) of a function. Almost every interesting analysis — dominance, reachability, liveness, loop detection — operates on this graph.

Our BasicBlock

struct BasicBlock {
    int                 id;          // small integer name
    std::string         label;       // human-readable hint
    std::vector<Instr>  instrs;
};

The label is purely cosmetic ("entry", "if.then", "while.cond") — the real identifier is id. We use small dense integer ids because most analyses will index into per-block bit-vectors.

Function::blocks is the vector of all blocks in creation order. blocks[0] is always the entry by convention; we don't store it separately.

Block creation

BasicBlock& Function::newBlock(std::string label) {
    blocks.push_back({nextBlock++, std::move(label), {}});
    return blocks.back();
}

Returns a reference and an id; the builder remembers the id (the reference is invalidated by the next allocation, so we never store it across emits).

Terminator discipline

The builder enforces a simple invariant: every emitted instruction goes into a non-terminated block. If you try to emit into a terminated block, the builder silently opens a fresh "unreachable" block and emits there instead:

void Builder::emit(Instr ins) {
    if (currentBlockTerminated()) {
        auto& nb = fn().newBlock("unreachable");
        setBlock(nb.id);
    }
    block().instrs.push_back(std::move(ins));
}

This keeps the lowering code free of if (terminated) return; clutter and preserves dead code as visible IR. cp-09's DCE pass will prune unreachable blocks.

CFG edges

We don't store CFG edges explicitly. Successors are recoverable from the terminator:

TerminatorSuccessors
Jump bbT{bbT}
CondJump bbT bbF{bbT, bbF}
Return{}

A trivial helper (introduced in cp-09) walks blocks.back().instrs.back() and returns the successor set. Predecessors are computed by inverting that map once per pass — cheap, and avoids the bookkeeping pain of keeping bidirectional edges in sync during construction.

Why a vector of blocks (and not a linked structure)?

Industrial IRs (LLVM, MLIR) use intrusive linked lists for instructions within blocks, because passes need cheap O(1) removal. At the block level they store a list of blocks per function for the same reason: inserting a block in the middle of a function is common (loop rotation, critical-edge splitting).

We use std::vector because (a) we're not implementing those passes yet, and (b) a vector<BasicBlock> is dramatically nicer to print and debug. The cost is amortised — append is O(1) — and the only real limitation is that we can't store stable pointers to blocks. We work around that with integer ids, which is the correct compiler-engineering discipline anyway.

Step 4 — Lowering expressions

Expression lowering in ir_builder.cpp follows one rule: every visit method writes the expression's result operand into result_, and returns void. A helper eval(Expr&) runs accept and reads back result_:

Operand eval(Expr& e) { e.accept(*this); return result_; }

This dance is a workaround for the AST's typed-visitor design (which specialises only ExprVisitor<TypePtr> and ExprVisitor<void>), and gives us a useful invariant: the IR Builder reuses the exact same visitor base class as the bytecode compiler.

Constants and identifiers

void Builder::visit(LiteralExpr& e) { result_ = Operand::constant(e.value); }

void Builder::visit(IdentExpr& e) {
    if (isLocal(e.name)) { result_ = Operand::named(e.name); return; }
    Operand dst = freshTemp();
    emit({Op::LoadGlobal, dst, {}, e.name, -1, -1, e.line});
    result_ = dst;
}
  • Literals are pure data — no instruction, just an operand.
  • Locals become named operands; no instruction is emitted on a read. Reads happen for free at the use site, which is correct for an alloca-style memory model.
  • Globals require an explicit LoadGlobal to a fresh temp.

This asymmetry between locals and globals matters: in the SSA promotion pass of cp-09 we will recognise alloca-like patterns over %name operands and promote them. Globals will stay as memory because they cross function boundaries.

Unary and binary operators

void Builder::visit(BinaryExpr& e) {
    Operand a = eval(*e.lhs);
    Operand b = eval(*e.rhs);
    Operand dst = freshTemp();
    emit({binOpFor(e.op), dst, {a, b}, "", -1, -1, e.line});
    result_ = dst;
}

Subexpressions are recursively lowered first, producing fresh temps, and the parent's instruction binds those temps. This is the literal "flatten compound expressions into single ops" recipe — the whole point of TAC.

Constant folding could happen here (add 1, 2Operand::constant(3)) but we don't do it. Folding is a pass, not a lowering concern. cp-09 introduces a constant-folder pass that consumes the unfolded IR cp-08 emits — and being able to see the pre-fold IR makes the pass's effect obvious in diff tests.

Calls

void Builder::visit(CallExpr& e) {
    std::string calleeName;
    Operand     calleeOp;
    if (auto* id = dynamic_cast<IdentExpr*>(e.callee.get())) {
        calleeName = id->name;             // direct call
        calleeOp   = Operand::named(id->name);
    } else {
        calleeOp   = eval(*e.callee);      // indirect (no closures yet though)
        calleeName = "<indirect>";
    }

    std::vector<Operand> args { calleeOp };
    for (auto& a : e.args) args.push_back(eval(*a));

    Operand dst = freshTemp();
    Instr ins{ Op::Call, dst, std::move(args), calleeName, -1, -1, e.line };
    emit(std::move(ins));
    result_ = dst;
}

The callee always lives at srcs[0], even for direct calls. Carrying the name separately is redundant but useful: it makes the printed IR read as t0 = call @add(3, 4) instead of t0 = call %add, 3, 4, and it lets passes filter direct vs indirect calls without parsing operands.

Argument evaluation order is left-to-right, exactly matching the language semantics enforced by the parser.

Assignments

void Builder::visit(AssignExpr& e) {
    Operand val = eval(*e.value);
    if (isLocal(e.name)) {
        emit({Op::Move, Operand::named(e.name), {val}, "", -1, -1, e.line});
    } else {
        emit({Op::StoreGlobal, Operand::none(), {val}, e.name, -1, -1, e.line});
    }
    result_ = val;
}

Note that assignment returns the assigned value (as an expression), which is needed for let y = (x = 3);. Locals use Move; globals use StoreGlobal. The destination of an assignment is the operand we wrote, not the value we just produced — which means cp-09's mem2reg can use Move %x, v as the SSA-construction's "definition of x" point.

Step 5 — Lowering statements and control flow

Where expression lowering produced an Operand, statement lowering produces blocks. The recipe is always the same:

  1. allocate the blocks you'll need,
  2. emit a terminator into the current block to enter the structure,
  3. lower the body into the relevant blocks,
  4. emit terminators stitching them together,
  5. set currentBlock to the join block so the next statement continues there.

if / else

        ┌─── cjmp cond ──→ if.then ──jmp── if.cont
   pre ─┤                                    ↑
        └─── cjmp cond ──→ if.else ──jmp────┘

Code:

void Builder::visit(IfStmt& s) {
    Operand cond = eval(*s.cond);

    auto& thenB = fn().newBlock("if.then");
    auto& elseB = fn().newBlock(s.elseBranch ? "if.else" : "if.cont");
    int thenId = thenB.id, elseId = elseB.id;
    emitCondJump(cond, thenId, elseId, s.line);

    setBlock(thenId);
    s.thenBranch->accept(*this);
    bool thenTerm = currentBlockTerminated();

    if (s.elseBranch) {
        auto& cont = fn().newBlock("if.cont");
        int joinId = cont.id;
        if (!thenTerm)            emitJump(joinId, s.line);
        setBlock(elseId);
        s.elseBranch->accept(*this);
        if (!currentBlockTerminated()) emitJump(joinId, s.line);
        setBlock(joinId);
    } else {
        if (!thenTerm) emitJump(elseId, s.line);
        setBlock(elseId);   // elseB *is* the join when no else exists
    }
}

Two subtleties:

  1. When there is no else, we reuse elseB as the join (if.cont). Wasting a block is harmless but ugly in diff tests, and merging them gives the more natural printed IR.
  2. We only emit the jump to the join if the branch didn't already terminate (e.g. with return). Without this guard you'd get a ret followed by a jmp, which is malformed: a block may have only one terminator.

while

pre ──jmp── while.cond ──cjmp── while.body ──jmp── while.cond (back-edge)
                  │
                  └──cjmp── while.cont
void Builder::visit(WhileStmt& s) {
    auto& condB = fn().newBlock("while.cond");
    auto& bodyB = fn().newBlock("while.body");
    auto& contB = fn().newBlock("while.cont");
    int condId = condB.id, bodyId = bodyB.id, contId = contB.id;

    emitJump(condId, s.line);            // pre  → cond

    setBlock(condId);
    Operand c = eval(*s.cond);           // (re-evaluated each iteration)
    emitCondJump(c, bodyId, contId, s.line);

    setBlock(bodyId);
    s.body->accept(*this);
    if (!currentBlockTerminated()) emitJump(condId, s.line);  // back-edge

    setBlock(contId);
}

The back-edge is what makes this a loop in CFG terms: an edge whose target dominates its source. cp-09's loop-detection pass will find it.

block and scoping

void Builder::visit(BlockStmt& s) {
    beginScope();
    for (auto& st : s.body) st->accept(*this);
    endScope();
}

A block does not introduce its own basic block. Scope and block are orthogonal — a single { } may contain several BBs (because of an embedded if), and a single BB may span several { } (because the inner block had no control flow). Conflating the two is a beginner's mistake worth flagging.

Variables declared inside { } are recorded in a local scope stack used only for the isLocal predicate. No backing storage is emitted; the local is a named operand.

return

void Builder::visit(ReturnStmt& s) {
    if (ctx().isScript) { error(s.line, "'return' outside a function"); return; }
    Operand v = s.value ? eval(*s.value) : Operand::none();
    emitReturn(v, s.line);
}

We deliberately reject top-level returns even though the resolver probably already did — defence in depth.

print

void Builder::visit(PrintStmt& s) {
    Operand v = eval(*s.expr);
    emit({Op::Print, Operand::none(), {v}, "", -1, -1, s.line});
}

print is the language's only built-in side effect, so it gets a dedicated opcode rather than going through Call. Treating it as Call @print would be cleaner but would force the interpreter and the LLVM backend to special-case the name later. A dedicated opcode is more honest.

fn declarations

A fn declaration opens a new Function, lowers its body, and queues the function into nestedFns_ to be appended to the module after the script. Nested fns (declared inside another fn) are rejected — they'd require closure capture, which lands in cp-12.

Step 6 — Short-circuit and a preview of phi

a && b and a || b are not arithmetic operators — they have short-circuit semantics. b is only evaluated when the result is not yet determined by a. That's a control-flow property; expressing it as a regular binary op would either evaluate both eagerly (wrong) or require a magic "lazy" flag (gross).

The right move is to lower logical operators into the same shape as a hand-written if:

result = a;
if (!result_truthy)  result = b;          // for &&
if ( result_truthy)  result = b;          // for ||
use result

…which, expressed in TAC blocks, looks like this:

        ┌── cjmp slot ──→ and.eval ──jmp──┐
pre ────┤   (true case)                   ▼
        └─── (false case) ──────────► and.join
                                       (slot = a || b)

The implementation

void Builder::visit(LogicalExpr& e) {
    Operand lhs  = eval(*e.lhs);
    Operand slot = freshTemp();
    emit({Op::Move, slot, {lhs}, "", -1, -1, e.line});

    auto& evalBlock = fn().newBlock(e.op == TokenKind::AmpAmp ? "and.eval" : "or.eval");
    auto& joinBlock = fn().newBlock(e.op == TokenKind::AmpAmp ? "and.join" : "or.join");
    int evalId = evalBlock.id, joinId = joinBlock.id;

    if (e.op == TokenKind::AmpAmp) emitCondJump(slot, evalId, joinId, e.line);
    else                            emitCondJump(slot, joinId, evalId, e.line);

    setBlock(evalId);
    Operand rhs = eval(*e.rhs);
    emit({Op::Move, slot, {rhs}, "", -1, -1, e.line});
    emitJump(joinId, e.line);

    setBlock(joinId);
    result_ = slot;
}

The crucial detail: both predecessors of the join block write to the same temp slot. That's why we treat the temp as a write-many named slot in cp-08 — temps are not yet SSA values.

Preview: this is where phi nodes come in

In SSA form, every value must have exactly one definition. Our slot violates that — it's defined twice, once on each path into the join. The classical fix is the phi node, a pseudo-instruction at the top of a block that selects a value based on which predecessor arrived:

join:
    slot = phi [lhs_value, pre], [rhs_value, and.eval]

cp-09's SSA construction pass will scan for our slot pattern and emit exactly such a phi. In fact, every multi-write temp our cp-08 builder produces — including locals — becomes either a phi or gets folded into a single dominating definition.

For now we model the join with explicit moves because it lets cp-08 stay strictly imperative. No worklist algorithms, no dominator trees, no iterated dominance frontier. Those are cp-09's burden, and seeing the pre-SSA form here is what makes their effect legible later.

Why not just do SSA construction now?

We considered it. But: SSA construction is genuinely intricate (Cytron's algorithm depends on dominator trees, which depend on the CFG, which depends on lowering being done…), and lumping the two concepts into one lab obscured both. Splitting them by step gives a cleaner narrative: cp-08 builds the grammar; cp-09 imposes the invariant.

Step 7 — Printer and debugging

A pretty-printer is the single highest-leverage tool in any compiler codebase. It is the difference between guessing what your IR looks like and seeing it. Every test in this lab compares the printed IR against expected substrings; every pass in cp-09+ will use the printer to log "before" / "after" snapshots.

The format

fn @<name>(<params>) {
bb0 (<label>):
    <instr>
    <instr>
    ...
bb1 (<label>):
    ...
}
  • Functions are introduced with fn @name(...) (the @ sigil mirrors LLVM globals).
  • Parameters are named operands: %a, %b.
  • Blocks render their label in parens for readability; the integer id is the primary identifier.
  • Instructions are indented four spaces.
  • No blank lines inside a function, one blank line between functions.

Operand syntax

FormNotation
Tempt<n>
Named%<name>
Global ref@<name>
Constant int42
Constant str"hello"
Constant nilnil
None_

Constants delegate to Value::toString(), the same formatter the cp-07 VM uses for print. That gives us a single source of truth for literal representation.

Instruction syntax

t0 = add %x, 1                  binop with explicit dst, two srcs
%x  = 1                         move into named local
stg @x, t1                      store to global
t2  = ldg @x                    load from global
print t0                        side-effect, no dst
t0 = call @add(3, 4)            direct call
t0 = call <indirect>(t1)        indirect call (cp-12)
cjmp t0, bb3, bb4               conditional branch
jmp  bb5                        unconditional branch
ret  t0                         return with value
ret                             return without value

We chose = over := because it matches LLVM textual IR and reads more naturally. Comparisons render with mnemonic ops (lt, ge) rather than C-style symbols (<, >=) so that print a < b doesn't get confusing.

Why this matters

When cp-09's mem2reg pass turns

%x = 1
%x = add %x, 1
print %x

into

t10 = 1
t11 = add t10, 1
print t11

we want that diff to be a one-line change in a golden test. String-level printer assertions are a coarse tool but they catch regressions in lowering exactly when humans care about them — when the printed IR changes shape.

Debugging tactics

  • Pipe through mltac. echo '...' | ./build/mltac is the fastest feedback loop for "what does this lower to?".
  • Look at the unreachable blocks. Stray unreachable: blocks in the output often indicate the lowering forgot to advance to a join block — a sign of a missing setBlock(joinId).
  • Check the label hints. if.cont, while.body, and.join are deliberately chosen to make IR readable in the absence of source lines. If you see unreachable where you expected if.cont, the order of operations is off.

The printer has no semantic content — it's pure formatting. But it is the most-read file in the IR layer. Spend time on it. Future you will be grateful.

cp-09 — SSA Construction & Optimisation Passes

Status: ✅ Built (41/41 checks)

This lab takes the Three-Address Code from cp-08 and adds the middle-end: an IR interpreter that can execute TAC directly, a small pass pipeline (constant folding + propagation, dead-code elimination, CFG simplification), and a fix-point driver. The conceptual material covers full SSA — phi nodes, dominance, mem2reg — even though the implementation stops at a simpler whole-function constant-propagation scheme. The progression to real SSA happens when we hand off to LLVM in cp-10/11.

Layout

FilePurpose
src/ir.{hpp,cpp}TAC IR types (unchanged from cp-08).
src/ir_builder.{hpp,cpp}AST → TAC lowering. Fixed dangling-reference bug from cp-08 (newBlock returns a vector reference invalidated by subsequent inserts; capture .id immediately).
src/ir_printer.{hpp,cpp}Module pretty-printer.
src/passes.{hpp,cpp}constFold, dce, simplifyCFG, runAll.
src/ir_interp.{hpp,cpp}Direct interpreter for the IR — semantics oracle.
src/main.cppmlopt driver: prints before/after IR, then runs.
tests/test_passes.cpp11 tests / 41 checks.

Pipeline

source → tokens → AST → resolver → typechecker → IR (TAC) →
   [print "before"] → constFold → dce → simplifyCFG → [fixpoint] →
   [print "after"]  → IR interpreter → stdout

Build & run

cd src/cpp
cmake -S . -B build && cmake --build build
ctest --test-dir build --output-on-failure

CLI:

./build/mlopt examples/loop.ml          # print before/after IR + run
./build/mlopt --quiet  examples/loop.ml # just run
./build/mlopt --no-run examples/loop.ml # IR only

What each pass does

  • constFold — substitute single-def tN = <const> chains, fold binary/unary on constant operands into Move, collapse cjmp on a constant condition into jmp.
  • dce — drop pure value-producing instructions whose temp dst is never used. print, stg, call, and terminators are always preserved (no effect-purity analysis in this lab).
  • simplifyCFG — flood-fill reachability from bb0, drop unreachable blocks. After a cjmp is rewritten to a jmp, this is what actually deletes the abandoned branch.

The driver alternates them until a full sweep changes nothing.

Conceptual highlights (see steps/)

  • SSA as the universal IR shape used by LLVM, V8 Sparkplug, HotSpot, GraalVM, JavaScriptCore, and Cranelift — and why every modern compiler bottoms out in dominance/dominance-frontier algorithms.
  • mem2reg as the canonical SSA-construction algorithm: every named local becomes a series of versioned values, joined by phi nodes at dominance frontiers.
  • Why we stopped short of mem2reg here: classical SSA construction requires dominator computation, dominance-frontier sets, iterated insertion, and renaming. The book-keeping is well-documented (Cytron et al. 1991) and worth implementing — but at the cost of much more code than is needed for our optimisation set. The constant-propagation scheme we use here gets ~80% of the practical wins.

Test coverage

  • Constant folding of arithmetic and comparisons.
  • DCE preserves side effects (print, call, stg).
  • simplifyCFG drops unreachable: blocks emitted after return.
  • Branches on constants collapse to a single arm.
  • Interpreter executes loops, function calls, mutual recursion, short-circuit operators.
  • Round-trip property: running the interpreter before and after the pass pipeline produces identical output.

Step 1 — Why an optimising middle-end

A compiler that emits unoptimised IR straight to a backend gets you correct code; that is all. Everything good about a modern compiler — inlining, escape analysis, vectorisation, devirtualisation — happens in the middle-end, after the frontend has built an IR and before the backend lowers it to machine code.

The single most important pattern

IR_in  →  pass_1  →  IR_intermediate  →  pass_2  →  IR_intermediate' → ...

Each pass is a function IR → IR that preserves the program's observable behaviour. The middle-end's job is to thread enough passes that the IR that comes out the other side is small, dense, and easy for the backend.

A pass manager schedules and re-runs passes. LLVM's new pass manager (2020) tracks per-pass invalidation of analyses; we use a much simpler "run everything to a fixed point" loop. Both designs share the same invariants:

  • Passes do not change observable behaviour (no I/O reordering, no new observable allocations, no new exceptions).
  • Passes are robust: malformed input (e.g. a cjmp on a constant) should be cleaned up, not crashed on.
  • Passes are composable: running pass A then pass B should produce the same IR as running them as a single fused pass would.

Why ordering matters

Our pipeline runs constFold → dce → simplifyCFG. Each unlocks the next:

  • constFold rewrites cjmp t0, bb1, bb2 (with t0 constant) to jmp bb1. The bb2 block is now unreachable — but that fact isn't visible inside constFold.
  • simplifyCFG drops the unreachable block. The temps defined in it are now uses-of-nothing.
  • dce drops those orphaned defs.
  • Some of those defs were the only uses of upstream temps. So we run the whole pipeline again. Repeat to fixed point.

LLVM calls this the "phase-ordering problem" and ships hundreds of passes; finding a good static order is genuinely hard. Our loop is the brute-force version: keep going until nobody changes anything.

What we'll cover

  • Step 2 — How an IR interpreter works and why it's the semantic oracle.
  • Step 3 — SSA: what it is, why every modern compiler uses it, why we don't fully implement it here.
  • Step 4 — Constant folding + propagation, including the single-def temp substitution trick that approximates SSA.
  • Step 5 — Dead-code elimination and why purity matters.
  • Step 6 — CFG simplification.
  • Step 7 — The pass-manager loop and how to think about phase ordering.

Step 2 — IR interpreter as oracle

Before writing any optimisation pass, we built an IR interpreter. That is the most important file in src/: ir_interp.cpp.

Why interpret IR at all?

In production, IR is consumed by a backend that lowers it to machine code. We are not yet ready to do that — cp-10 introduces LLVM IR emission, cp-11 native codegen. But to test our middle-end now, we need a way to ask "does this IR module mean the same thing it did before I ran the passes?"

That is the role of runProgram(module). Given an IR module, it executes it and returns the stream of print outputs. Two invocations on semantically-equivalent modules must produce identical strings.

This is the strongest test we can write for a pass. It avoids the "golden file" trap (matching exact instruction sequences is brittle: any harmless permutation breaks the test) and instead checks the only thing that matters — meaning.

Implementation shape

struct Interp {
    const Module& mod;
    unordered_map<string, Value> globals;
    ostringstream out;
    Value callFunction(const Function& fn, const vector<Value>& args);
};

A Frame is conjured per call: temps map from tempId → Value, named locals from string → Value. Globals live on the interpreter.

Execution is a while (true) over blocks. Within a block we walk instrs sequentially:

  • Value-producing ops (add, lt, move, neg …) compute a Value and write to the dst operand.
  • print formats the operand and appends to out.
  • ldg / stg read or write a global.
  • call looks up a function by name in the module, recursively invokes callFunction, and stores the result in the dst.
  • jmp / cjmp set currentId and goto next_block.
  • ret returns from callFunction.

A safety budget (safety = 1e6 instructions) prevents tests from hanging on infinite loops — see test test_const_fold_comparison which would have spun forever if we hadn't fixed the IR-builder bug.

Why named operands work

A typical SSA interpreter only has temps. Ours has both temps and named locals (%i, %x) because cp-08's lowering keeps source-level variables as memory cells. That's deliberate: cp-09's passes never need to reason about them.

When we move to LLVM in cp-10, those named locals become allocas and LLVM's own mem2reg pass converts them into SSA temps. We simulate the same final result by using the named-local convention as a "loadable storage slot" model.

How tests use the interpreter

auto preOut  = ir::runProgram(module);   // before any pass
ir::runAll(module);                       // mutate
auto postOut = ir::runProgram(module);   // after all passes
CHECK(preOut.output == postOut.output);

If a pass ever breaks semantics, that assertion fails before the golden-string checks do. It is the single most valuable assertion in the test file.

Step 3 — SSA: the universal IR shape

Static Single Assignment is the IR shape that essentially every modern compiler uses internally. LLVM IR is SSA. HotSpot's C2 is SSA. V8's Sparkplug and Turbofan are SSA. Cranelift is SSA. WebKit's B3 is SSA. MLIR is SSA. Even pre-LLVM compilers (gcc ≥ 4.0) added SSA in the form of "GIMPLE" + tree-SSA.

The core idea

Every variable is assigned exactly once. If source code re-assigns, we create a new SSA name:

// source                IR (SSA)
x = 1;                   %x.1 = 1
x = x + 2;               %x.2 = add %x.1, 2
print x;                 print %x.2

Every %name has exactly one definition. That is the magic property: to find the value of %x.2, you don't need to scan blocks looking for the most recent assignment — there is no most-recent, there is the one. Every analysis that classical IR does with backwards walks becomes a one-step lookup in SSA.

The hard part: control flow

if (cond) { x = 1; } else { x = 2; }
print x;

In the join block, what is x? Neither %x.1 nor %x.2 alone. SSA solves this with phi nodes:

bb_then:  %x.1 = 1; jmp bb_join
bb_else:  %x.2 = 2; jmp bb_join
bb_join:  %x.3 = phi [%x.1, bb_then], [%x.2, bb_else]
          print %x.3

A phi instruction at the top of a block picks which incoming value to use based on which predecessor flow came from. It is the SSA analogue of "the variable's most recent definition".

Where do phi nodes go?

The classical answer is the dominance frontier:

A block B is in the dominance frontier of A if A dominates a predecessor of B but does not dominate B itself.

Every block where two control paths from different definitions can join is in some definition's dominance frontier — and that's exactly where we need a phi.

The Cytron et al. (1991) algorithm computes dominance frontiers, then inserts phi nodes, then renames every variable use to refer to its unique SSA definition. This is mem2reg in LLVM terminology — "promote memory locations (allocas) into SSA registers".

Why we're not implementing it here

Full mem2reg is ~600 lines including dominator-tree construction (Lengauer–Tarjan or Cooper-Harvey-Kennedy iterative), DF computation, phi insertion, and the dominator-tree renaming walk. It is a worthy exercise — and the standard reference implementation (LLVM's Utils/PromoteMemoryToRegister.cpp) is the right thing to read.

Our constFold cheats: it requires that a temp has exactly one definition (which TAC builder-produced IR already satisfies for temps, since each tN is fresh), and propagates only those. That gets us constant folding through binary-op chains without ever computing a dominator tree.

When we move to LLVM in cp-10/11, mem2reg comes for free as a built-in optimisation pass. The point of this step is conceptual: you should be able to draw the dominance frontier of an arbitrary CFG on a whiteboard, and you should know what a phi node is and where it goes.

Practical SSA in real compilers

  • LLVM — SSA from frontend; mem2reg promotes allocas; instcombine, GVN, EarlyCSE, SROA all assume SSA.
  • HotSpot C2 — "Ideal Graph" is a sea-of-nodes SSA variant.
  • V8 Turbofan / Sparkplug — sea-of-nodes graph IR; SSA invariant.
  • Cranelift — strict SSA, block parameters instead of phi nodes (semantically equivalent but easier to manipulate).
  • GCC — GIMPLE → tree-SSA → RTL; first major non-research compiler to adopt SSA in mainline.
  • Cytron, Ferrante, Rosen, Wegman, Zadeck (1991) — Efficiently Computing Static Single Assignment Form and the Control Dependence Graph.
  • Cooper, Harvey, Kennedy (2001) — A Simple, Fast Dominance Algorithm. The iterative version everyone now uses.
  • Braun et al. (2013) — Simple and Efficient Construction of Static Single Assignment Form. Constructs SSA on-the-fly during IR generation; this is what Cranelift, V8, and Rust's MIR do today.

Step 4 — Constant folding and propagation

What it is

Constant folding evaluates expressions at compile time when all the inputs are known:

t0 = add 3, 4         →     t0 = 7
t1 = lt 10, 20        →     t1 = true
cjmp t1, bb1, bb2     →     jmp bb1

Constant propagation takes folded constants and substitutes them forward, so the next pass can fold further:

t0 = 7
t1 = add t0, 1         →   (after propagation) t1 = add 7, 1
                       →   (after folding)     t1 = 8

These two are conceptually one optimisation but practically two passes: fold uses what's already constant, propagation makes more things constant.

Why it's the first pass everyone implements

  • Cheap. Linear scan; no analyses required.
  • Cascades. Folding one operation usually enables folding several more — the compiler equivalent of compound interest.
  • High value. Source code is rich in constants: (width * 4 + padding * 2) / 8 — every operation here can fold once the user picks values.
  • Foundation for everything. Inlining + constant folding + propagation is how you get the C++ template-meta-programming style of zero-cost abstraction.

Implementation walkthrough

bool constFold(Function& fn) {
    bool changed = false;
    // Step 1: propagate.
    unordered_map<int, Value> constMap;
    buildConstMap(fn, constMap);          // tN -> Value if sole def is Move-of-const
    for (auto& bb : fn.blocks)
        for (auto& ins : bb.instrs)
            for (auto& s : ins.srcs)
                if (substituteOperand(s, constMap)) changed = true;

    // Step 2: fold what's now foldable.
    for (auto& bb : fn.blocks) {
        for (auto& ins : bb.instrs) {
            // binary, unary, cjmp-on-const ...
        }
    }
    return changed;
}

The crucial check is buildConstMap: a temp is eligible for substitution only if it has exactly one definition. In real SSA that's automatic. In our TAC IR it's true for builder-generated temps but we double-check because future passes might break the invariant.

What we don't do (yet)

  • Algebraic identitiesx + 0 → x, x * 1 → x, x - x → 0. Every textbook calls these "strength reductions" and they're trivial to add to the fold loop.
  • Comparison strength reductionx < x → false, x == x → true.
  • GVN (Global Value Numbering) — recognising that t1 = add a, b and t2 = add a, b always compute the same value, so t2 can be replaced with t1.
  • Conditional constant propagation (SCCP) — Wegman-Zadeck (1991). Folds across control flow: if every reaching def of x is the same constant, treat x as that constant.

LLVM's InstCombine is the production-grade version of this pass — ~10k lines and counting, and it's still one of the most-modified files in the LLVM tree.

The cjmp rewrite is special

cjmp <const>, bbT, bbF
   → jmp bbT     // if const is truthy
   → jmp bbF     // otherwise

This is the single most important fold for any compiler because it exposes dead code to simplifyCFG. if (false) { huge_block } disappears entirely once cjmp folds and simplifyCFG drops the unreachable side.

This pattern — "fold the predicate, then simplify the CFG" — is exactly what makes if constexpr viable in C++17 and what makes generic monomorphisation in Rust/Go produce reasonable code.

Step 5 — Dead-code elimination

DCE removes instructions whose results are not used. The principle is simple but the bookkeeping is subtle.

The naïve version

bool dce(Function& fn) {
    unordered_set<int> used;
    for (auto& bb : fn.blocks)
        for (auto& ins : bb.instrs)
            for (auto& s : ins.srcs)
                if (s.isTemp()) used.insert(s.tempId);

    for (auto& bb : fn.blocks) {
        auto& v = bb.instrs;
        auto endIt = remove_if(v.begin(), v.end(), [&](const Instr& ins) {
            if (!isPureValueOp(ins.op)) return false;
            if (!ins.dst.isTemp())     return false;
            return used.count(ins.dst.tempId) == 0;
        });
        v.erase(endIt, v.end());
    }
}

Two checks:

  1. The op must be pure — no side effects. print, stg, call, jmp, cjmp, ret always survive. (We're being conservative on call: in reality some calls are pure and could be elided. Doing that safely requires effect analysis — punt to cp-14.)
  2. The dst must be an unused temp. Named operands are storage; we don't DCE writes to them because they may be loaded later (or externally visible). Globals are out of scope entirely.

What "side effect" really means

A side effect is any change observable outside the local computation:

  • I/O — print, read, file writes, system calls.
  • Mutation of memory aliased by anyone else — globals, fields, references.
  • Synchronisation — atomics, fences, locks.
  • Exceptions / traps — div by zero, signed overflow in some languages, dereferencing null.

The trickiest is the last: in most languages, integer division is conditionally effectful — it traps on zero. LLVM marks udiv as having no side effects only when accompanied by an nuw (no-unsigned-wrap) or exact flag the optimiser inserted after proving the divisor nonzero. The point: "purity" is not a property of the opcode alone but of the opcode and the values that flow through it.

We dodge all of this by treating Div and Mod as pure even though they can trap. A real compiler would track this in a side-table.

Why DCE matters

It's not just hygienic cleanup. DCE is what makes other passes affordable. After inlining or partial evaluation, the IR is bloated with intermediate temps that no longer matter. Without DCE every subsequent pass would walk all of that dead material on every iteration. With DCE the IR stays roughly the size of the live program.

LLVM's ADCE (Aggressive DCE) goes further: it starts from the program's observable sinks (ret, stores, calls) and works backwards, considering any instruction not reachable from a sink to be dead. This catches things our naïve forward-pass misses (mutually dead temps that nominally use each other).

The DCE-CFG interaction

After simplifyCFG drops a block, the temps defined in that block are "unreachable" in a stronger sense — no one references them anymore. DCE catches that on the next iteration. This is why our runAll re-runs the whole pipeline to a fixed point.

The inverse interaction also matters: after DCE removes a Move t1 = 7 that nobody read, the constant 7 is no longer propagated anywhere. So the next constFold has nothing new to do — fix point reached.

Step 6 — CFG simplification

After constFold rewrites a cjmp to a jmp, one of the targets becomes unreachable. The instructions in that block — prints, function calls, everything — are still there in fn.blocks. They take no runtime time (nothing branches to them) but they bloat the IR, confuse later passes, and confuse humans reading the dump.

simplifyCFG is the pass that throws them away.

The algorithm

unordered_set<int> reachable;
vector<int> stack{ fn.blocks[0].id };
while (!stack.empty()) {
    int id = stack.back(); stack.pop_back();
    if (!reachable.insert(id).second) continue;
    // For each successor (jmp / cjmp / ret) of bb id, push onto stack.
}
// Drop blocks not in `reachable`.

Pure flood-fill from the entry block (always bb0). Successors come from inspecting the terminator of each block — the one place in the IR where control flow is encoded explicitly. That's the entire point of the basic-block model: control flow is structural, not embedded in straight-line instructions.

Other things simplifyCFG could do

We only do reachability pruning. Production-grade implementations also:

  • Block merging. If bb1 ends in jmp bb2 and bb2 has only bb1 as predecessor, merge them.
  • Empty-block bypass. If bb1's only instruction is jmp bb2, rewrite every predecessor of bb1 to target bb2 directly.
  • Branch threading. If bb1 ends in cjmp t, bb2, bb3 and bb2 is a known-constant block (e.g. cjmp false, ...), thread directly.
  • Hoisting common code. If both arms of a cjmp begin with the same instruction, hoist it before the branch.

LLVM's SimplifyCFG pass implements all of these and quite a few more.

A subtle correctness rule

Whenever you delete a block, you must also check whether any remaining block referenced it as a successor — and if so, that reference is now invalid. Our reachability flood guarantees this: nothing reaches a deleted block, by definition, so no remaining terminator points to one.

In a more aggressive pass (block merging), you must rewrite phi nodes' incoming-block lists when you merge predecessors. We don't have phi nodes, so we sidestep that landmine. Welcome to mem2reg in cp-10/11.

Renumbering, or not?

We don't renumber block ids after pruning. bb0, bb2, bb4 is fine in the printer — readers know it's a sparse sequence. If we did renumber, every bbT / bbF field in every terminator would need updating. That's fragile and offers no semantic benefit.

LLVM and Cranelift both keep sparse block numbering for the same reason.

Composition with other passes

loop {
    a = constFold(fn);       // may rewrite cjmp → jmp
    b = dce(fn);             // may drop now-unused temps
    c = simplifyCFG(fn);     // may drop now-unreachable blocks
    if (!a && !b && !c) break;
}

The order matters: constFold first (creates work for the others), then dce, then simplifyCFG. Swap any pair and you'll get the same fixed point eventually, but more iterations to reach it.

This kind of phase ordering is one of the long-standing research problems in compiler engineering — see Whitfield & Soffa (1997) for a formal treatment.

Step 7 — Pass manager and phase ordering

A pass manager is the small piece of glue that decides which passes to run when. Ours is twenty lines:

PassStats runAll(Module& m) {
    PassStats st;
    for (auto& f : m.functions) {
        for (int i = 0; i < 16; ++i) {
            ++st.iterations;
            bool a = constFold(*f);
            bool b = dce(*f);
            bool c = simplifyCFG(*f);
            if (!a && !b && !c) break;
            ...
        }
    }
    return st;
}

That is enough. But it raises three real questions.

1. Why fixed-point at all?

Each pass can enable the next:

  • constFold rewrites cjmp t, bbT, bbF to jmp bbT when t is a known constant.
  • simplifyCFG drops the now-unreachable bbF.
  • dce drops the temps that were only used inside bbF.
  • Some of those temps' definitions were the only uses of even earlier temps. Drop them too.
  • Now constFold may see a new chain of single-def Moves. Re-run.

The fuel cap (i < 16) is paranoia: if some pass interaction caused oscillation, we'd notice in tests rather than hanging. In practice real programs reach fix point in 1–3 iterations.

2. Why this order?

constFold → dce → simplifyCFG puts the creator of opportunities first and the consumer last. The reverse order would still reach the fixed point but slower:

OrderIterations on if(10<20) { print "a"; }
constFold, dce, simplify2
simplify, dce, constFold3

Multiply by every program in the test suite and the difference compounds. LLVM's pass pipelines (O0, O1, O2, O3, Os) are extensively tuned for exactly this kind of ordering.

3. Why not analysis caching?

LLVM's "new pass manager" tracks analyses — dominator trees, alias sets, loop info — separately from transforms. Analyses are computed lazily and invalidated when a transform mutates the IR. This avoids re-computing a dominator tree just because some unrelated pass tweaked a branch.

Our compiler has no analyses to cache (no dominator tree, no alias info), so we get away without the machinery. When cp-11/12 introduce LLVM properly, the new pass manager is what schedules everything.

How real compilers structure this

  • LLVM opt -O2 runs a sequence of ~70 passes including InstCombine (constant folding + tons of peepholes), GVN, SCCP, LoopUnroll, Inline, SimplifyCFG, DCE, EarlyCSE, ...
  • HotSpot C2 runs roughly four phases: GVN, LoopOpts, Macro Expand, Optimisation. Each is itself a fix-point of smaller passes.
  • V8 Turbofan runs a graph-based scheduler instead — passes mutate a sea-of-nodes representation, and a final "schedule" pass linearises back to blocks.

The common thread: every modern compiler iterates its passes until nothing changes, then hands off to the backend.

Where this lab takes us

After cp-09 we have:

  • A correctness oracle (runProgram).
  • A small but real optimisation pipeline.
  • Tests that both verify pass behaviour (golden IR) and verify semantic equivalence (interpret before vs after).

cp-10 takes the same IR shape but emits LLVM IR text instead of running our own interpreter. cp-11 hands that off to llc. After that, the pass manager that runs over our IR is LLVM's, not ours.

cp-10 — LLVM IR Fundamentals

Status: ✅ Built (30/30 checks). Emitted IR runs under lli.

This lab takes the TAC IR from cp-08/09 and emits textual LLVM IR. The emitter is hand-rolled — no LLVM library dependency — so the lab builds anywhere a C++17 compiler is available. cp-11 replaces this with the real C++ IRBuilder API and integrates find_package(LLVM).

Restriction

Programs in cp-10 must be numeric-only. All MiniLang values lower to i64; booleans are 0 / 1 in i64. Strings (and function values) raise an error from the emitter — the runtime for them lands in cp-14.

Layout

FilePurpose
src/llvm_emit.{hpp,cpp}TAC IR → textual LLVM IR.
src/main.cppmlllvm driver. -O enables our cp-09 passes first.
tests/test_llvm_emit.cpp30 assertions covering module shape, arithmetic, control flow, function definition, globals, and the strings-rejected error path.

Build, test, smoke-run

cd src/cpp
cmake -S . -B build && cmake --build build
ctest --test-dir build --output-on-failure

# Use the system LLVM toolchain to actually execute the emitted module:
echo 'let i=0; while(i<3){print i; i=i+1;}' | ./build/mlllvm -O > /tmp/m.ll
/opt/homebrew/opt/llvm/bin/lli /tmp/m.ll      # prints 0 1 2

What the IR looks like

; ModuleID = 'minilang'
target triple = "arm64-apple-macosx"
@.fmt = private constant [6 x i8] c"%lld\0A\00"
declare i32 @printf(i8*, ...)

define i32 @main() {
  %i.addr = alloca i64
  store i64 0, i64* %i.addr
  br label %L0
L0:
  %v0 = load i64, i64* %i.addr
  %v1 = icmp slt i64 %v0, 3
  ...
}

Why hand-roll the emitter?

  • Forces engagement with the textual format — every LLVM IR document starts there, and you should be able to read it.
  • Keeps cp-10 dependency-free and portable.
  • Demonstrates the lowering decisions without LLVM's IRBuilder doing them silently — alloca for locals, load/store everywhere, explicit icmp + zext for booleans, explicit printf call.
  • Sets the conceptual baseline so cp-11 can focus on what the C++ API gives you for free (block management, name uniquing, type-erasure of constants, type verification, optimisation passes).

Step docs

  1. LLVM IR overview
  2. Types, values, and constants
  3. The module/function/block structure
  4. Memory model: alloca / load / store
  5. Lowering arithmetic and comparisons
  6. Lowering control flow
  7. The runtime ABI: printf and main

Step 1 — LLVM IR overview

LLVM IR is a strongly-typed, SSA-form, three-address virtual ISA. It sits between the frontend (which generates it) and the backend (which lowers it to a real machine). It is the lingua franca of modern production compilers: Clang, Rust, Swift, Julia, Numba, GHC's LLVM backend, ldc (D), Crystal, Pony, Zig (in part), and a hundred research languages all emit LLVM IR and let LLVM handle codegen.

Three forms, same content

LLVM IR exists in three equivalent encodings:

  • Textual (.ll) — human-readable; what we emit in this lab.
  • Bitcode (.bc) — compact binary form; what gets serialised.
  • In-memoryllvm::Module* etc.; what the C++ API manipulates.

They are losslessly interconvertible: llvm-as (text → bitcode), llvm-dis (bitcode → text), the IRBuilder constructs in-memory IR, Module::print() dumps text. cp-11 will switch to the in-memory form.

Three shapes within the text

module
  └─ globals  (constants, mutable globals, function declarations)
  └─ functions
       └─ basic blocks (labelled, single-entry single-exit)
            └─ instructions (one per line)

Almost every line in a function body has the form:

%name = <opcode> <type> <operands>

That <type> is mandatory — LLVM IR is strongly typed, and the type system is part of the verifier's job. The verifier (opt -verify, also run implicitly by lli/llc) rejects modules whose types don't align. This is what makes generating LLVM IR feel slightly tedious the first time and very pleasant thereafter: errors are caught early and precisely.

SSA

Every value (%name) is defined exactly once. We avoid grappling with SSA construction directly by using alloca + load/store for every mutable variable; LLVM's mem2reg pass promotes those to SSA later. This is the canonical strategy for frontends — the "alloca trick" is how Clang emits IR for C local variables.

What we will NOT do in cp-10

  • phi nodes — needed only if you SSA-construct on the frontend (Cranelift, Rust MIR). With alloca/mem2reg you never write a phi yourself.
  • Aggregate types (struct, array as value).
  • Garbage-collection statepoints.
  • Debug info.
  • Metadata.
  • TBAA / aliasing annotations.

Each is essential for a full production frontend; each is a separable concern that we can layer on in cp-14+.

Why the textual form first?

Because LLVM's error messages, dumps, and documentation all speak in textual IR. Even if you write a frontend that only ever calls the C++ API, you will read textual IR every day for the rest of your compiler career. Get fluent in it now.

Step 2 — Types, values, and constants

The type system

LLVM IR types come in two flavours: primitive and derived.

Primitive

  • Integer: i1, i8, i16, i32, i64, i128, ... up to any width. There is no separate booli1 plays that role.
  • Floating point: half (16-bit), float (32-bit), double (64-bit), fp80/fp128 (extended).
  • Void: void (only valid as a function return type).
  • Label: label (block reference; rarely written explicitly).
  • Metadata: metadata (debug info etc.).

Derived

  • Pointer: T* (legacy) or ptr (modern LLVM ≥ 15, "opaque pointers"). We use T* here because it's clearer for teaching; modern LLVM auto-converts.
  • Array: [N x T]. Fixed-size, allocated as a value.
  • Vector: <N x T>. SIMD lane group.
  • Struct: { T1, T2, ... } (literal) or %S (named).
  • Function: R (A1, A2, ...).

We use only i64, i1, i8*, and one array ([6 x i8] for the format string).

Value categories

Every operand in LLVM IR is one of:

  • A constant42, true, 0.5, or a getelementptr of a global.
  • A register%name or %number, the result of a previous instruction in the same function. Function parameters are also registers (%arg0 …).
  • A global@name, a top-level symbol. Includes function references, mutable globals, and constant data like our @.fmt.

The leading sigil is meaningful: % is local to a function, @ is module-global. There is no other namespace.

How we lower MiniLang values

We picked the simplest possible mapping: everything is i64.

MiniLangLLVMNotes
Numberi64We discard the fractional part on purpose.
Booli640 or 1. We zext i1 ... to i64 after icmp.
Nili64 0Same as false.
StringNot supported; emitter errors out.

This is a pedagogical choice: it makes the IR easy to read and type-uniform, at the cost of disallowing mixed-type expressions. cp-14 introduces a real boxed Value representation that supports the full type set.

Constants

A constant in LLVM IR has both a type and a value:

i64 42                                  ; integer
[6 x i8] c"%lld\0A\00"                  ; byte array
@.fmt                                   ; symbol  (type is i8*)
getelementptr ([6 x i8], [6 x i8]* @.fmt, i64 0, i64 0)
                                        ; "address of first byte of @.fmt"

The getelementptr (GEP) form is how you compute addresses without emitting an actual instruction — it's a constant expression. We use it to convert our array-of-bytes format string into a pointer to its first byte, which is what printf expects.

GEP is one of the most-misunderstood parts of LLVM IR. The two-index form [6 x i8]* @.fmt, i64 0, i64 0 reads as: "starting from @.fmt (which points to a [6 x i8]), advance by zero arrays, then advance to byte index zero". The result is an i8*. Why two indices? Because @.fmt is a pointer; the first index dereferences the pointer, the second indexes into the array it points at. This catches everyone the first time.

Why types matter even in a dynamic language

Even when you're compiling a dynamic language, the LLVM-level types must be statically known. That's why cp-10 restricts to numeric only: we can't emit add i64 %a, %b if %a might be a string at runtime.

The two ways out are:

  1. Uniform representation — pick one LLVM type (typically i64 or a tagged 64-bit) and stuff every dynamic value into it.
  2. Specialisation — generate different IR for different type profiles, possibly at JIT time. This is what V8 and LuaJIT do.

cp-14 takes path (1). cp-17's capstone explores path (2).

Step 3 — Module / function / block

The module

A module is the unit of compilation. One .ll file = one llvm::Module = one translation unit. Modules contain:

  • A target triple (arm64-apple-macosx) and data layout.
  • Global declarations (@printf, @.fmt, @x).
  • Function definitions.
  • Metadata (debug info, optimisation hints).
; ModuleID = 'minilang'
target triple = "arm64-apple-macosx"

@.fmt = private constant [6 x i8] c"%lld\0A\00"
declare i32 @printf(i8*, ...)

define i32 @main() { ... }

The function

A function has a return type, name, parameter list, and a list of basic blocks:

define i64 @add(i64 %arg0, i64 %arg1) {
L0:
  %v0 = add i64 %arg0, %arg1
  ret i64 %v0
}

The first block listed is the entry block — implicit, no special marker. Parameters are values in scope from the entry block. Return type is declared up front; every terminator that returns must agree.

Linkage and visibility

Function definitions default to external linkage (visible to the linker, as if extern in C). Other options:

  • private — invisible to the linker (we use this for @.fmt).
  • internal — visible only within the module.
  • weak / linkonce — for inline functions and templates.

We don't decorate @main or @add — external is the right default.

The basic block

A basic block is a maximal sequence of straight-line instructions ending in a terminatorret, br, switch, unreachable, indirectbr, invoke, resume, catchret, cleanupret. Exactly one terminator per block; if you forget, the verifier rejects.

L1:
  %v3 = add i64 %v0, 1
  store i64 %v3, i64* %i.addr
  br label %L0

Blocks are labelled (L1: …). Labels are values of type label, referred to as %L1 in br targets. The label name on the definition site has no % — but the reference site does. (Yes, this inconsistency is annoying. Welcome to LLVM IR.)

Why basic blocks at all?

Because every flow-graph analysis is dramatically simpler if you can reason about straight-line sequences as opaque units, then handle control at the boundaries. Dominator computation, liveness, register allocation, scheduling — all of them operate on basic-block CFGs.

Compare to a representation where any instruction could be a branch target: now every analysis has to track "did anyone jump into the middle of this run?" The basic-block invariant — enter at the top, exit at the bottom — buys you enormous simplification.

Mapping our TAC

Our TAC already had basic blocks (BasicBlock in cp-08), so the mapping is one-to-one. The only difference: LLVM blocks are labelled by %LN syntactically; ours by integer id. The emitter prefixes with L:

static std::string blockLabel(int id) { return "L" + std::to_string(id); }

And emits a br label %L<id> to enter the first block from the alloca region (LLVM requires an explicit terminator-into-entry-of-body even when the alloca prelude is in the same block — we just put the allocas before the br for clarity).

Step 4 — Memory model: alloca / load / store

LLVM IR is in SSA, which means every register is assigned exactly once. But MiniLang variables — let x = 0; x = x + 1; — are not single-assignment. How does a frontend bridge that gap?

The alloca trick

You don't emit SSA directly. You emit memory:

%x.addr = alloca i64           ; reserve a stack slot
store i64 0, i64* %x.addr       ; x = 0

%v1 = load  i64, i64* %x.addr   ; read x
%v2 = add   i64 %v1, 1          ; v1 + 1
store i64 %v2, i64* %x.addr     ; x = ...

Every named local becomes an alloca in the entry block, every read becomes a load, every write a store. The SSA registers (%v1, %v2) are short-lived temporaries that hold loaded values.

The IR you produce this way is trivially SSA-valid (every register defined once), but it's pessimistic — it leaves variables in memory when they could live in registers.

mem2reg does the rest

LLVM's mem2reg pass (technically PromoteMemoryToRegister) scans for allocas that are only loaded from and stored to (no address taken, no escape) and promotes them into proper SSA values with phi nodes at join points.

opt -O2  ← runs mem2reg first, then everything else

After mem2reg:

L0:
  %x.1 = ...
  br label %L1
L1:
  %x.2 = phi i64 [%x.1, %L0], [%x.3, %L1]
  %x.3 = add i64 %x.2, 1
  br label %L1

The frontend never had to compute a dominator tree, never had to insert phi nodes, never had to think about variable renaming. The optimiser did it.

This is the design decision that makes Clang's frontend maintainable. C-style mutable locals would otherwise force the frontend to implement Cytron-style SSA construction.

When does alloca go where?

Always in the entry block. This is critical for performance. alloca in a non-entry block creates a dynamic stack allocation (real sub $rsp, size at runtime); in the entry block it's collapsed by the backend into a single stack-frame reservation.

The entry block is also the only place mem2reg will consider promoting. An alloca in a loop body stays in memory forever.

Our emitter respects this:

// 1. Emit all alloca instructions first (in the implicit entry block).
for (const auto& name : localsWritten) {
    out << "  %" << esc(name) << ".addr = alloca i64\n";
}
// 2. Then branch to the first IR block of the function body.
out << "  br label %L" << fn.blocks[0].id << "\n";

Function parameters

Parameters arrive as SSA registers (%arg0, %arg1). To let them be reassigned, we immediately spill them into an alloca:

define i64 @add(i64 %arg0, i64 %arg1) {
  %a.addr = alloca i64
  %b.addr = alloca i64
  store i64 %arg0, i64* %a.addr
  store i64 %arg1, i64* %b.addr
  ...
}

Again, mem2reg cleans this up after the optimiser runs.

Globals

Globals are top-level @-prefixed pointers:

@x = global i64 0

The variable @x is itself an i64*; you load i64, i64* @x to read and store i64 N, i64* @x to write. This is unlike alloca (which returns a pointer to a stack slot); @x's pointer is the address of the global's storage.

Globals are not promoted by mem2reg — that pass operates only on stack allocas. Promoting a global to register is a different optimisation (sometimes called "globalopt" or "internalize + mem2reg") and requires whole-module analysis.

Pointer typing — old vs new

In LLVM ≤ 14, pointers carry the pointee type (i64*). In LLVM ≥ 15, the opaque-pointer revolution made all pointers just ptr, and the instruction carries the type (load i64, ptr %x.addr). Our text-form emitter uses the older typed-pointer form because it's self-documenting; modern lli accepts either.

Step 5 — Lowering arithmetic and comparisons

Binary integer operations

The mapping is essentially one-to-one with TAC:

TAC OpLLVM instruction
Addadd i64 a, b
Subsub i64 a, b
Mulmul i64 a, b
Divsdiv i64 a, b
Modsrem i64 a, b
Andand i64 a, b
Oror i64 a, b

Signed vs unsigned: we use sdiv and srem (signed) because MiniLang's Number is signed-ish in spirit. The s/u/f prefix on LLVM arithmetic is a frequent source of bugs:

  • add — no prefix; signedness doesn't matter (two's complement).
  • mul — no prefix; same reason.
  • sdiv / udiv — different result for negative operands.
  • srem / urem — likewise.
  • fadd / fmul / fdiv — floating point.
  • shl — no prefix; lshr (logical) / ashr (arithmetic) for right shift.

The nsw / nuw flags (no signed wrap / no unsigned wrap) on arithmetic let the optimiser assume overflow is impossible. We don't emit them — being conservative — but a real frontend should track this from the source language's overflow semantics.

Unary

  • Neg becomes sub i64 0, %a. There is no dedicated neg instruction.
  • Not (boolean negation) becomes icmp eq i64 %a, 0 followed by zext i1 ... to i64.

Comparisons

%v0 = icmp slt i64 %a, %b       ; signed less-than
%v1 = zext i1   %v0 to i64

icmp returns i1. To use the result as our uniform i64 value, we zext (zero-extend) to i64. If we stored booleans as i1 throughout we wouldn't need the zext — but every other operation would then need to widen back to i64 for arithmetic.

The condition mnemonics:

TACLLVM icmp cond
Eqeq
Nene
Ltslt
Lesle
Gtsgt
Gesge

The s prefix is for signed comparison. ult, ule, etc. are unsigned. eq and ne don't have a sign because they don't need one — bitwise equality is the same either way.

The zext / trunc dance

icmp always produces i1. Storing or arithmetic always wants i64. Branching on a value always wants i1 again.

%v0 = icmp slt i64 %a, %b      ; i1
%v1 = zext i1   %v0 to i64     ; i64

; ... later, used as a branch condition: ...
%v2 = icmp ne i64 %v1, 0       ; back to i1
br i1 %v2, label %T, label %F

This back-and-forth is what you pay for using i64 as the uniform value type. LLVM's instcombine cleans most of it up:

icmp ne (zext T to i64), 0   →   T

So after opt -O1 the i64 round-trip vanishes entirely.

Why we don't use fadd / fmul

MiniLang numbers are doubles in the interpreter, but we lower to i64 for simplicity. To handle floats properly:

  • Pick double as the uniform type instead of i64.
  • Replace addfadd, sdivfdiv, icmpfcmp.
  • fcmp predicates have an ordered/unordered distinction (oeq, ueq, olt, ult, ...) because NaN can fail every comparison.
  • Print with %g or %lf format.

cp-14 will introduce a tagged value type that handles both i64 and double, with a runtime dispatch on the tag bits.

Step 6 — Lowering control flow

Unconditional branch

br label %L3

br is the workhorse terminator. The single-operand form is an unconditional jump.

Our TAC Jump op lowers directly:

case Op::Jump:
    out << "  br label %" << blockLabel(ins.bbT) << "\n";

Conditional branch

br i1 %cond, label %T, label %F

The condition must be i1. Since we keep our values in i64, we must emit an explicit icmp ne ..., 0 before the branch:

case Op::CondJump: {
    std::string a; operandValue(ins.srcs[0], a);
    std::string cmp = fresh();
    out << "  " << cmp << " = icmp ne i64 " << a << ", 0\n";
    out << "  br i1 " << cmp << ", label %L" << ins.bbT
        << ", label %L" << ins.bbF << "\n";
}

If we were storing booleans as i1 natively, this would just be:

br i1 %cond, label %T, label %F

That's a real performance argument for using i1 as the boolean type, even if it complicates the rest of the IR.

Multi-way branches: switch

LLVM supports an n-way switch:

switch i64 %tag, label %default [
    i64 0, label %A
    i64 1, label %B
    i64 2, label %C
]

We don't emit switch because our TAC doesn't have one — any switch-like construct would have been lowered to a chain of cjmps in the IR builder. A more sophisticated frontend would detect switch-like AST patterns and lower to switch directly, giving the backend the opportunity to emit a jump table.

phi nodes (or the lack thereof)

Because we use the alloca trick, our emitted IR has no phi nodes. Every variable read is a load; every write is a store.

The IR for if (c) { x = 1; } else { x = 2; } print x;:

%v0 = load i64, i64* %c.addr
%v1 = icmp ne i64 %v0, 0
br i1 %v1, label %L1, label %L2
L1:
  store i64 1, i64* %x.addr
  br label %L3
L2:
  store i64 2, i64* %x.addr
  br label %L3
L3:
  %v2 = load i64, i64* %x.addr
  call i32 (i8*, ...) @printf(...)

After mem2reg:

br i1 %v1, label %L1, label %L2
L1:
  br label %L3
L2:
  br label %L3
L3:
  %x.3 = phi i64 [1, %L1], [2, %L2]
  call i32 (i8*, ...) @printf(...)

That phi was always implicit in the semantics; mem2reg just made it explicit.

Loops

A while (cond) { body } in TAC has three blocks: header, body, exit. The header tests the condition and cjmps into body or exit. The body unconditionally jumps back to header. The header dominates both body and exit; the body's backward edge is what makes this a natural loop.

br label %Lheader
Lheader:
  ...test...
  br i1 %cond, label %Lbody, label %Lexit
Lbody:
  ...body...
  br label %Lheader
Lexit:
  ...

LLVM detects this pattern (loop analysis runs on the SSA form after mem2reg) and enables loop-specific optimisations: invariant code motion, induction-variable simplification, vectorisation, unrolling.

unreachable

If a block has no logical successor (e.g. control falls off the end of noreturn code), the terminator is unreachable. Our compiler doesn't generate unreachable because cp-09's simplifyCFG already removed those blocks. But they appear all the time in C frontends after calls to abort(), exit(), etc.

Indirect branches

indirectbr label %target, [label %a, label %b, ...] is how computed-goto / threaded-code interpreters express their dispatch. Out of scope here; relevant for cp-12's JIT capstone where we may explore inline caches.

invoke and exception handling

LLVM's exception model uses invoke (a call with a normal and an exceptional destination) plus landingpad blocks. We have no exceptions in MiniLang. C++ frontends spend a lot of time on this.

Step 7 — Runtime ABI: printf and main

Why printf?

MiniLang has a built-in print. To execute the emitted module, some piece of code outside it has to do the formatting and the actual system call. The cheapest option is to delegate to libc's printf, which lli, llc + ld, and any host linker make trivially available.

@.fmt = private constant [6 x i8] c"%lld\0A\00"
declare i32 @printf(i8*, ...)
  • @.fmt is a private (module-local) constant holding the bytes %lld\n\0. The [6 x i8] array type makes the length explicit; the null terminator at the end matches C string convention.
  • declare i32 @printf(i8*, ...) introduces an external function declaration. The variadic ... is part of the type signature.

The call site

%v_ = call i32 (i8*, ...) @printf
       (i8* getelementptr ([6 x i8], [6 x i8]* @.fmt, i64 0, i64 0),
        i64 %v)

Three things here are non-obvious:

  1. call i32 (i8*, ...) @printf — the function type appears parenthesised after the return type. This is required only for variadic calls. Non-variadic calls can omit it: call i32 @add(...).

  2. The GEP@.fmt is an [6 x i8]*. printf wants i8*. getelementptr with two zero indices computes "address of the first byte". This is the canonical "decay an array to a pointer" pattern in LLVM.

  3. The discarded return value — we assign it to %v_ even though we never use it. LLVM doesn't allow standalone calls to be silently dropped; you must give the result some name. (Or call void-returning functions, where naming is forbidden.)

main

LLVM doesn't define what main is — that's a libc/runtime convention. We adopt the C convention:

define i32 @main() {
  ...
  ret i32 0
}

@main returns i32 (the process exit status). When we lower our __script__ function, we use define i32 @main() and the final Op::Return becomes ret i32 0.

lli looks for @main to start execution. llc + system linker build an executable that the OS loader calls via _start → libc init → main.

What's missing

We have no:

  • String literals at runtime — no allocation, no managed string. cp-14 introduces a runtime with ml_string_new, ml_print_value, ml_value_t.
  • Closures — function values that capture environment. cp-12 introduces them as part of the JIT capstone.
  • GC — every allocation in cp-10/11 is leaked or stack-only. cp-14 sketches a mark-sweep collector with a shadow stack.
  • Exception model — no invoke, no landingpad.
  • TLS, threading primitives, atomics — out of scope.

ABI considerations for cp-11

When cp-11 actually links against LLVM's C++ API, the ABI surface expands:

  • Calling conventionsccc (default), fastcc, coldcc, swiftcc, tailcc, custom numbered ccs. We use ccc (C calling convention) because we link with libc.
  • Attributesnoinline, readonly, nounwind, cold, optsize, ... These affect optimisation decisions.
  • Target attributestarget-cpu, target-features. The difference between scalar codegen and AVX-512 vectorised codegen.

All of these become accessible through the C++ API as we move to real LLVM integration in cp-11.

cp-11 · LLVM Codegen (the real C++ API)

Goal: emit a verified llvm::Module straight from our TAC IR using IRBuilder, then run the new-pass-manager O2 pipeline (mem2reg, instcombine, GVN, SimplifyCFG, …) over it. Execute with lli, or pipe through llc to produce native object code.

cp-10 wrote LLVM IR as text. cp-11 builds the same IR through the official C++ API, links against libLLVM*, and lets LLVM run its production-grade optimisation pipeline on the result.

What changed since cp-10

Concerncp-10cp-11
IR producerhand-written text emitterllvm::IRBuilder<>
Verifiernone — trusted lli to complainllvm::verifyModule after build
OptimisernonePassBuilder::buildPerModuleDefaultPipeline(O2)
Globalsstring-templated @x = global i64 0new llvm::GlobalVariable(...)
Format stringhand-emitted [6 x i8] literalIRBuilder::CreateGlobalString

Build & run

cmake -S src/cpp -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
./build/tests/test_llvm_codegen      # → 25/25 checks passed

# REPL-ish: pipe MiniLang in, get IR out
echo 'fn fib(n){ if(n<2){return n;} return fib(n-1)+fib(n-2);} print fib(20);' \
  | ./build/mlcc -O \
  | /opt/homebrew/opt/llvm/bin/lli
# 6765

-O runs both our cp-09 TAC passes and LLVM's O2 pipeline.

Layout

src/cpp/
├── CMakeLists.txt           # find_package(LLVM); link core/passes/analysis/...
├── src/
│   ├── llvm_codegen.hpp     # CodegenResult + build()/optimise()
│   ├── llvm_codegen.cpp     # IRBuilder emitter + new-PM pipeline
│   └── main.cpp             # `mlcc` CLI
└── tests/test_llvm_codegen.cpp

Reading order

  1. steps/01-llvm-cpp-api-tour.md
  2. steps/02-irbuilder.md
  3. steps/03-verifier.md
  4. steps/04-new-pass-manager.md
  5. steps/05-mem2reg-and-O2.md
  6. steps/06-globals-and-runtime.md
  7. steps/07-targets-and-llc.md

Step 01 · LLVM C++ API tour

LLVM's C++ API is huge. The codegen path we touch is a thin slice:

LLVMContext   ── owns types, constants, metadata. One per thread of work.
   │
   ▼
Module        ── translation unit. Holds globals + functions + metadata.
   │
   ▼
Function      ── signature + list of BasicBlocks.
   │
   ▼
BasicBlock    ── straight-line list of Instructions; ends in a terminator.
   │
   ▼
Instruction   ── created via IRBuilder, never `new`.
Value         ── base of everything (Constant, Argument, Instruction).
Type          ── obtained from the Context (Int64Ty, etc.).

Linking

llvm_map_components_to_libnames(LLVM_LIBS core support irreader passes analysis transformutils scalaropts instcombine) expands to the static archives we need. With Homebrew LLVM 20 you can also link the umbrella LLVM library, but listing components keeps the binary small.

find_package(LLVM REQUIRED CONFIG)
target_include_directories(mllib PUBLIC ${LLVM_INCLUDE_DIRS})
target_compile_definitions(mllib PUBLIC ${LLVM_DEFINITIONS})
target_link_libraries(mllib PUBLIC ${LLVM_LIBS})

Ownership

unique_ptr<LLVMContext> must outlive unique_ptr<Module>. Module destruction touches its types, which live in the context. Keep both in the same CodegenResult and let RAII handle order — declared in the right sequence in llvm_codegen.hpp.

A common pitfall

Forward-declaring llvm::Module in a header and putting unique_ptr<Module> in a struct breaks because the implicit destructor needs the complete type. Either include <llvm/IR/Module.h> in the header, or declare an out-of-line destructor. We chose the include.

Step 02 · IRBuilder

IRBuilder<> is a cursor: you SetInsertPoint(BB) and then every Create* call appends one instruction at the cursor.

llvm::IRBuilder<> b(ctx);
auto* bb = llvm::BasicBlock::Create(ctx, "entry", fn);
b.SetInsertPoint(bb);
auto* sum = b.CreateAdd(lhs, rhs, "sum");
b.CreateRet(sum);

What we build

Per function:

  1. entry block with one alloca i64 per named local/param.
  2. Spill each parameter into its alloca.
  3. Pre-create every TAC basic block as a BasicBlock*.
  4. Branch from entry to the first TAC block.
  5. Walk each TAC instruction, dispatch on opcode, emit one or more LLVM instructions.

Mapping opcodes

TACIRBuilder
Add/Sub/Mul/Div/ModCreateAdd/Sub/Mul/SDiv/SRem
And/OrCreateAnd/Or
Eq/Ne/Lt/...CreateICmpEQ/NE/SLT/... then CreateZExt to i64
NegCreateNeg
NotICmpEQ x, 0 then ZExt
Move t,_alias in temps map
Move name,_CreateStore to alloca
LoadGlobal/StoreGlobalCreateLoad/Store on GlobalVariable
PrintCreateCall(printf, {fmt, v})
Call f(args)CreateCall(callee, args)
JumpCreateBr
CondJumpICmpNE x,0 then CreateCondBr
ReturnCreateRet

Naming

Anonymous SSA values get %0, %1, … from the printer. Passing a name string to Create* ("sum") is purely cosmetic — it survives to the printed IR and is invaluable when reading optimised output.

Step 03 · The Verifier

llvm::verifyModule(module, &os) returns true on failure and writes a description into os. We call it as the last step of build and refuse to return a module that does not verify.

What the verifier catches

  • Blocks without a terminator (we used to emit a defensive CreateUnreachable for safety).
  • Branch targets in a different function.
  • Type mismatches (add i64, i32).
  • Values used outside the dominance tree of their definition.
  • Improper SSA — multiple definitions of the same %x.

Why it matters

A module that fails verification will often segfault lli or the backend instead of producing a clean diagnostic. Verifying up front turns those into a single string we can surface to the user.

Test hook

auto cg = compile("print 1;");
CHECK(cg.ok);                       // verifier passed
CHECK(cg.mod != nullptr);           // module survived
CHECK_CONTAINS(cg.toText(), "ModuleID");

On failure, CodegenResult::error carries the verifier's report verbatim, which is invaluable when developing new lowering rules.

Step 04 · The new pass manager

LLVM has two pass managers: the legacy llvm::PassManager and the "new" one rooted at llvm::PassBuilder. New code targets the new PM.

llvm::LoopAnalysisManager     lam;
llvm::FunctionAnalysisManager fam;
llvm::CGSCCAnalysisManager    cam;
llvm::ModuleAnalysisManager   mam;

llvm::PassBuilder pb;
pb.registerModuleAnalyses(mam);
pb.registerCGSCCAnalyses(cam);
pb.registerFunctionAnalyses(fam);
pb.registerLoopAnalyses(lam);
pb.crossRegisterProxies(lam, fam, cam, mam);

auto mpm = pb.buildPerModuleDefaultPipeline(llvm::OptimizationLevel::O2);
mpm.run(mod, mam);

Anatomy

  • Analysis managers cache analyses (dominator tree, loop info, alias analysis, …) per IR unit. They must be registered with each other so a function pass can ask for a module-level analysis.
  • PassBuilder assembles a ModulePassManager corresponding to one of the standard -O0/-O1/-O2/-O3/-Os/-Oz pipelines.
  • mpm.run(mod, mam) mutates the module in place.

Custom pipelines

You can build your own with addPass(MyPass()). We rely on the canned O2 pipeline so we inherit decades of tuning.

Where this lives

llvm_codegen.cppoptimise(Module&). mlcc calls it when given -O. Tests pass runOpt = true to exercise specific transformations.

Step 05 · mem2reg + the O2 pipeline

Our lowering is deliberately naïve: every named local becomes an alloca i64 in the entry block, every read is a load, every write a store. We do not try to construct SSA in the front end.

That's fine, because O2 includes mem2reg (a.k.a. PromoteMemToReg), which:

  1. Finds allocas whose only uses are direct loads/stores.
  2. Replaces them with proper SSA values, inserting Φ-nodes where control flow joins.

After mem2reg, downstream passes can do real work:

  • instcombine — peephole rewrites.
  • gvn — global value numbering deduplicates.
  • simplifycfg — collapses trivial branches.
  • licm — hoists loop invariants.
  • loop-unroll, loop-vectorize, slp-vectorize — where profitable.
  • globalopt — turns module-private mutable globals into constants when only initialised once.

Observable

let x = 7;
print x;

Pre--O:

@x = global i64 0
…
store i64 7, ptr @x
%0 = load i64, ptr @x
call i32 @printf(ptr @.fmt, i64 %0)

Post--O (test test_mem2reg_after_opt_eliminates_allocas asserts this):

call i32 @printf(ptr @.fmt, i64 7)

Lesson

Front-ends do not need a smart code generator. Emit straightforward load/store-heavy IR; let mem2reg + the rest of O2 turn it into great machine code. This is the LLVM superpower.

Step 06 · Globals and the print runtime

MiniLang's top-level let bindings become module globals. We scan every function for LoadGlobal/StoreGlobal to discover the names, then create one GlobalVariable each:

new llvm::GlobalVariable(
    mod, i64, /*isConstant=*/false,
    llvm::GlobalValue::ExternalLinkage,
    llvm::ConstantInt::get(i64, 0), name);

External linkage keeps the symbol visible in the .o we'd emit with llc — necessary for any future linker-level integration.

printf shim

auto* ft = llvm::FunctionType::get(i32, {i8p}, /*isVarArg=*/true);
printfFn = llvm::Function::Create(ft, ExternalLinkage, "printf", mod);
fmtStr = b.CreateGlobalString("%lld\n", ".fmt", 0, &mod);

CreateGlobalString returns a pointer (i8*/ptr under opaque pointers) directly usable as the first argument to printf. Under LLVM 20 the textual form is ptr @.fmt.

Why printf and not a hand-rolled write(2) loop? Three reasons:

  • the C runtime is always available on a JIT or system linker;
  • %lld is portable and exactly matches our i64;
  • it lets lli execute the program with no extra plumbing.

cp-14 will replace this with a proper ml_print(Value) runtime that understands strings, booleans and closures.

Step 07 · Targets and llc

Our mlcc emits target-independent LLVM IR. To produce a binary:

./build/mlcc -O program.ml > program.ll
/opt/homebrew/opt/llvm/bin/llc -O3 -filetype=obj program.ll -o program.o
clang program.o -o program
./program

llc walks:

LLVM IR
  → SelectionDAG / GlobalISel    (instruction selection)
  → MachineInstr                 (target-specific MI)
  → Register allocation
  → Instruction scheduling
  → Machine code emission         (.o)

Going programmatic

Inside C++, you'd:

  1. InitializeNativeTarget, InitializeNativeTargetAsmPrinter.
  2. TargetRegistry::lookupTarget(sys::getDefaultTargetTriple(), err).
  3. target->createTargetMachine(...).
  4. Set the module's DataLayout and TargetTriple.
  5. Use a legacy PassManager + addPassesToEmitFile(...) to write a .o.

We stop short of that in cp-11 to keep the lab focused; cp-12 (ORC JIT) will internalise target initialisation in order to execute IR without lli.

When to add a custom target

Building a real backend is a full course. For research languages, ride the existing X86/AArch64/RISC-V backends. Only write a target when you ship custom silicon.

cp-12 · ORC JIT — compile and execute in-process

cp-11 produced an optimised llvm::Module. cp-12 hands that module to LLVM's modern JIT (ORCv2, wrapped behind LLJIT) which compiles it to native code on the spot, looks up main, and calls it.

Build & run

cmake -S src/cpp -B build
cmake --build build
./build/tests/test_jit         # → 17/17 checks passed

echo 'print 2 + 3 * 4;' | ./build/mljit              # 14
echo 'print 2 + 3 * 4;' | ./build/mljit --emit-llvm  # textual IR instead
./build/mljit -O program.ml                          # run after O2

Headline test (recursive fib)

fn fib(n){ if (n < 2) { return n; } return fib(n-1) + fib(n-2); }
print fib(20);

JIT-compiled with -O6765.

Layout

src/cpp/
├── CMakeLists.txt        # links orcjit + executionengine + native + nativecodegen
├── src/jit.hpp / jit.cpp # initNative() + runMain(ctx, module)
├── src/main.cpp          # `mljit` CLI
└── tests/test_jit.cpp    # 17/17 checks

Reading order

  1. steps/01-why-jit.md
  2. steps/02-orc-overview.md
  3. steps/03-lljit-builder.md
  4. steps/04-thread-safe-module.md
  5. steps/05-symbol-lookup-and-call.md
  6. steps/06-runtime-symbols.md
  7. steps/07-beyond-lljit.md

Step 01 · Why JIT?

Ahead-of-time (AOT) compilation produces a binary at build time. Just-in-time (JIT) compilation defers code generation to run time:

AOTJIT
Compile cost paidonce, off-lineevery invocation (cached after first run)
Knows actual inputsnoyes — can specialise on profile data
Patching live codehardfirst-class (tiering, deopt)
Cold-start latencytinynon-trivial
Deploy artefact.exethe JIT + bytecode/IR

Real-world JITs: HotSpot (Java), V8 (JS), LuaJIT, Julia, Numba, PyPy, Pharo. They share three ingredients:

  1. An IR low enough to lower to machine code (LLVM IR, Sea-of-Nodes, etc.).
  2. A code generator that emits into executable memory.
  3. A symbol table that lets fresh code call previously-jitted code and runtime helpers.

LLVM gives us all three through ORC.

Step 02 · ORC overview

ORC = On-Request Compilation. ORCv2 is the current modular JIT framework inside LLVM.

ExecutionSession
  ├── JITDylibs      (≈ shared libraries; symbol namespaces)
  │   └── MaterializationUnits  (lazy producers of symbols)
  └── Layers (stack):
        ObjectLinkingLayer (loads .o into memory)
            ↑
        IRCompileLayer    (Module → .o via TargetMachine)
            ↑
        IRTransformLayer  (optional: run passes per-Module)
            ↑
        CompileOnDemandLayer / SpeculativeJIT (lazy & tiering)

Each layer is a MaterializationUnit that produces symbols on demand. When lookup("main") runs, ORC chains downward until the right machine code is in memory.

LLJIT

LLJIT is a turnkey façade that pre-wires those layers with sensible defaults (one main JITDylib, IR-compile + linking layers, native target). For a simple compile-and-run scenario you never need to touch ORC's plumbing directly.

Step 03 · LLJITBuilder & initialisation

llvm::InitializeNativeTarget();
llvm::InitializeNativeTargetAsmPrinter();
llvm::InitializeNativeTargetAsmParser();

These pull the host backend (X86 / AArch64 / …) into the binary. Without them LLJIT::create() fails with "No available targets".

We wrap the three calls behind std::call_once so it's safe to call jit::runMain from anywhere.

Building the JIT

auto jitOrErr = llvm::orc::LLJITBuilder().create();
if (!jitOrErr) /* report toString(takeError()) */;
auto jit = std::move(*jitOrErr);

LLJITBuilder exposes knobs (setNumCompileThreads, setDataLayout, setObjectLinkingLayerCreator, …) for advanced setups. We accept defaults.

The DataLayout

LLJIT derives the data layout from the JIT's TargetMachine and sets it on every module added later. That's why we can build the Module in cp-11 without specifying a DataLayout up front — LLJIT patches it before compilation.

Step 04 · ThreadSafeModule

LLVM Module and LLVMContext are not thread-safe. ORC wraps each (module, context) pair in ThreadSafeModule to enforce a single-thread access discipline.

llvm::orc::ThreadSafeModule tsm(std::move(mod), std::move(ctx));
jit->addIRModule(std::move(tsm));

Notice both unique_ptrs are moved in. After this:

  • Your local mod and ctx are empty.
  • LLJIT owns the module until it has been materialised; afterwards it drops the IR and keeps only the compiled object.

Why one context per module?

Sharing a context between modules forces a global lock around code generation. Giving each module its own context lets ORC's optional compile-threads work in parallel without contention.

In CodegenResult (cp-11) we already kept ctx and mod as unique_ptrs in the right destruction order for exactly this hand-off.

Step 05 · Symbol lookup & calling main

auto sym = jit->lookup("main");
if (!sym) /* surface toString(sym.takeError()) */;
auto fnPtr = sym->toPtr<int(*)()>();
int  exitCode = fnPtr();

lookup triggers materialisation: parsing/optimising hasn't finished when addIRModule returns; the work happens when a symbol is asked for. This is the entry point to ORC's laziness.

Type-safe call

toPtr<Fn>() is a templated cast that returns a function pointer of the requested signature. Get the signature wrong and you'll see UB — since mlcc always produces define i32 @main(), we ask for int(*)().

Capturing stdout

Tests need to assert on what printf printed. We dup fd 1 to a temp file around the call, then slurp the file. POSIX-only, but plenty for cp-12. Production code would either:

  • register a custom puts/printf symbol that writes to a buffer, or
  • run the JIT in a subprocess and read its stdout.

Step 06 · Runtime symbols (printf and friends)

When our JITed module says call i32 @printf(...), ORC needs an address for the external symbol. LLJIT adds a DynamicLibrarySearchGenerator::GetForCurrentProcess(...) to its main JITDylib by default, which dlsyms into the host process.

Since the test binary is linked against libc, printf resolves immediately and the JITed call writes to our captured stdout.

Injecting your own runtime functions

// inside jit.cpp, after LLJIT::create()
auto& dylib = jit->getMainJITDylib();
dylib.define(orc::absoluteSymbols({
    { jit->mangleAndIntern("ml_print"),
      { reinterpret_cast<uintptr_t>(&ml_print),
        JITSymbolFlags::Exported } }
}));

cp-14 will introduce a real ml_print(Value) helper and a tiny runtime library. Then the front end can stop hard-coding printf and emit call void @ml_print(%Value) instead — opening the door to boxed types, GC headers, etc.

Mangling

On macOS, symbols carry a leading underscore (_printf). LLJIT does the right mangling internally; mangleAndIntern exposes the same logic when you register a host pointer.

Step 07 · Beyond LLJIT — lazy, tiered, remote

LLJIT covers ~80 % of JIT use cases. ORC gives you more when you need it.

Lazy compilation

LLLazyJIT (a sibling of LLJIT) wraps each function in a stub. The function isn't compiled until the stub is hit, which dramatically reduces start-up time for big modules where only a fraction of code runs.

Tiering

A real engine like V8 starts cold code in an interpreter, then re-compiles hot functions with full optimisations, then patches the call sites. Build this on ORC by:

  1. Tier-1: compile with OptimizationLevel::O0 immediately.
  2. Use sampling or counters in the runtime to find hot functions.
  3. Submit a fresh module for those functions with O3 to a compile-thread.
  4. Use ORC's JITLink re-defining a symbol to atomically swap the entry point.

Remote / out-of-process JIT

OrcRemoteTargetClient lets you compile in one process and execute in another (or on a different machine). Great for embedded targets or sandboxing untrusted code.

Caching

ObjectCache plugs into the IR-compile layer to memoise compiled objects across runs — perfect for shell-style use of the JIT where the same script is launched repeatedly.

What we won't tackle in this curriculum

The integration with debuggers (GDB/LLDB JIT interface), profilers (VTune), and deopt/inline-cache machinery (V8/HotSpot-style ICs) deserve a course of their own. cp-12 leaves you with a working foundation; adding any one of the above is a focused, incremental exercise on top of jit::runMain.

cp-13 · MLIR foundations — emit, lower, translate

cp-11/12 used LLVM IR directly. cp-13 takes a step up the abstraction ladder to MLIR (Multi-Level Intermediate Representation): a generic framework for building IRs with first-class regions, blocks, and extensible dialects.

We emit our TAC IR as MLIR text in the llvm dialect (a near-1:1 mapping of LLVM IR into MLIR syntax), then drive Homebrew's mlir-translate + lli to execute the program.

Why the llvm dialect?

Real MLIR projects build a custom dialect (minilang.*) and lower it through arith, cf, memref to llvm. That's pedagogically fantastic but operationally fragile across MLIR versions. cp-13 keeps the toolchain minimal so every test passes out-of-the-box on LLVM/MLIR 20. The step docs walk through what a full dialect would look like.

Build & run

cmake -S src/cpp -B build
cmake --build build
./build/tests/test_mlir_emit               # → 25/25 checks passed

echo 'print 2+3*4;' | ./build/mlmlir       # emit MLIR
echo 'print 2+3*4;' | ./build/mlmlir --run # → 14

Inspect the pipeline by hand:

echo 'print 42;' | ./build/mlmlir > /tmp/m.mlir
/opt/homebrew/opt/llvm/bin/mlir-opt /tmp/m.mlir --canonicalize
/opt/homebrew/opt/llvm/bin/mlir-translate /tmp/m.mlir --mlir-to-llvmir
/opt/homebrew/opt/llvm/bin/mlir-translate /tmp/m.mlir --mlir-to-llvmir \
  | /opt/homebrew/opt/llvm/bin/lli

Reading order

  1. steps/01-why-mlir.md
  2. steps/02-ir-shape.md
  3. steps/03-dialects.md
  4. steps/04-llvm-dialect-mapping.md
  5. steps/05-pipeline-mlir-translate.md
  6. steps/06-progressive-lowering.md
  7. steps/07-when-to-reach-for-mlir.md

Step 01 · Why MLIR?

LLVM IR is a great low-level representation, but it has one level. By the time a high-level construct (a tensor reshape, a SQL plan, a distributed task) becomes LLVM IR, all its structure is gone and domain-specific optimisation is much harder.

MLIR (Multi-Level IR) solves this by letting you define dialects that live alongside one another in the same module. You can write your compiler as a series of dialect-to-dialect lowerings, each step losing only the structure you no longer need.

Origins

Born out of TensorFlow's compiler stack at Google, contributed to the LLVM project. Today MLIR is the foundation of:

  • TensorFlow's XLA / IREE
  • Mojo (Modular)
  • Triton (OpenAI's GPU DSL)
  • CIRCT (hardware design)
  • Polygeist (C → MLIR)
  • Flang (Fortran)

What we gain

  • Composability: pick the right abstraction for each pass.
  • Reuse: standard dialects (arith, cf, memref, linalg, vector, gpu, …) give you an enormous toolbox.
  • Lower-bar custom dialects: ODS (TableGen) generates op classes.
  • Common verifier / printer / parser infrastructure.

What we don't tackle in cp-13

Defining a custom dialect in C++ via ODS. That's a multi-day deep dive and tightly coupled to LLVM/MLIR version specifics. cp-18 leaves a spec exercise; here we focus on understanding the IR shape and the lowering toolchain by emitting the llvm dialect.

Step 02 · The shape of MLIR

module {
  llvm.func @main() -> i32 {
    %0 = llvm.mlir.constant(42 : i64) : i64
    %1 = llvm.mlir.addressof @fmt : !llvm.ptr
    %2 = llvm.call @printf(%1, %0) vararg(!llvm.func<i32 (ptr, ...)>)
         : (!llvm.ptr, i64) -> i32
    %3 = llvm.mlir.constant(0 : i32) : i32
    llvm.return %3 : i32
  }
}

Key concepts:

  • Operation — every line is an Operation. The name carries the dialect (llvm., arith., func., scf., ...).
  • Region — a block of Operations, enclosed in { ... }. Some ops (scf.for, func.func) have nested regions; that's how MLIR expresses structured control flow.
  • Block — a list of operations ending in a terminator. Labels are ^bb0, ^bb1, .... Blocks may take SSA arguments (MLIR's unification of Φ-nodes and parameters).
  • Value (%name) — SSA result of an op.
  • Type (i64, !llvm.ptr, tensor<4xf32>) — typed by the dialect; ! prefix means "non-builtin".

Implications

  • No global symbol table for SSA — each block can reuse names.
  • Every op states all its operand and result types, so the IR is self-describing and can be parsed by mlir-opt even without knowing the producing dialect's C++ class (provided the dialect is loaded).
  • module itself is an op whose region holds the program.

Our emission strategy

Emitter::emitFunction produces a llvm.func with one entry block, allocas for every named local, then a llvm.br ^bb1 into the first TAC block. After that each TAC block becomes a ^bbN label and its instructions translate one-for-one to llvm.* ops.

Step 03 · Dialects worth knowing

A dialect is a namespace of operations + types + attributes. Some upstream ones you'll meet constantly:

DialectPurpose
builtinmodule, func.func (in older MLIR builtin.module)
funcfunc.func, func.call, func.return
arithPure integer/float math: arith.addi, arith.cmpi, ...
cfUnstructured control flow: cf.br, cf.cond_br
scfStructured control flow: scf.for, scf.if, scf.while
memrefMemory references with shape/layout
tensorImmutable value tensors
linalgHigh-level array/linear-algebra ops
vectorExplicit SIMD vectors
affinePolyhedral loops, ideal for analyses
gpu, nvvm, rocdl, spirvDevice backends
llvmMirror of LLVM IR; the terminal target

The point: write your compiler as a sequence mydialect → linalg → memref → scf → cf → llvm, each step removing abstraction you no longer need.

Defining a dialect (in C++)

You declare ops in TableGen (.td), which mlir-tblgen expands into C++ classes. A typical workflow:

  1. MinilangOps.td — declare ops, types, attributes.
  2. Register the dialect with MLIRContext::loadDialect.
  3. Implement verify, canonicalize, fold per op.
  4. Write a MinilangToLLVM conversion pass (mlir::ConversionTarget
    • RewritePatterns).

cp-18's capstone leaves the dialect implementation as a guided exercise; the heavy lifting is mostly mechanical TableGen + pattern boilerplate.

Step 04 · LLVM-dialect mapping

Our emitter speaks one dialect: llvm. The mapping is mechanical:

TACMLIR llvm
Numeric constantllvm.mlir.constant(N : i64) : i64
Named local xllvm.alloca in entry block, llvm.load/llvm.store thereafter
Add a,bllvm.add %a, %b : i64
Sub/Mul/Div/Modllvm.sub/mul/sdiv/srem
And/Orllvm.and/or
Eq/Ne/Lt/...llvm.icmp "eq"/"ne"/"slt"/... then llvm.zext _ : i1 to i64
Negllvm.sub %zero, %a
Notllvm.icmp "eq" %a, %zero then zext
LoadGlobal xllvm.mlir.addressof @x then llvm.load
StoreGlobal x, vllvm.mlir.addressof @x then llvm.store
Print vllvm.call @printf(@fmt, v) vararg(...) : (!llvm.ptr, i64) -> i32
Call f(args)llvm.call @f(args) : (...) -> i64
Jump bbllvm.br ^bbN
CondJump v, T, Fllvm.icmp "ne" v, 0 then llvm.cond_br ... ^bbT, ^bbF
Return vllvm.return %v : i64 (i32 0 for main)

Globals

llvm.mlir.global internal @x(0 : i64) : i64

internal linkage; initial value 0. Stores at main-time install the user's initialiser. llvm.mlir.addressof @x reifies the global as a !llvm.ptr value.

printf

llvm.mlir.global internal constant @fmt("%lld\0A\00") {addr_space = 0 : i32}
llvm.func @printf(!llvm.ptr, ...) -> i32

Variadic call sites must spell their vararg signature:

%r = llvm.call @printf(%f, %v) vararg(!llvm.func<i32 (ptr, ...)>)
     : (!llvm.ptr, i64) -> i32

That's MLIR's way of preserving variadic information that LLVM's function type would otherwise carry as (...).

Step 05 · The mlir-translate / lli pipeline

mlir-translate is MLIR's bridge to external IRs. Its --mlir-to-llvmir mode walks an MLIR module that's already in the llvm dialect and produces textual LLVM IR. From there lli or llc take over.

MiniLang source
   │  Lexer → Parser → Resolver → TypeChecker → IR builder
   ▼
MiniLang TAC IR
   │  mlir_emit::emit
   ▼
MLIR (`llvm` dialect)
   │  mlir-translate --mlir-to-llvmir
   ▼
LLVM IR (text)
   │  lli   (or  llc -filetype=obj  + ld)
   ▼
Process exit code + stdout

Implementing the pipeline in C++

runShell (in mlir_emit.cpp) forks /bin/sh -c "mlir-translate ... | lli" with pipes attached. Robust enough for tests; production tooling would likely link MLIR's own translation library instead.

Catching errors

If mlir-translate rejects our IR (wrong type, missing op), the pipe breaks and lli exits non-zero. We capture stderr from the child into PipelineResult::error so test failures point at the offending stage.

Inspecting intermediate stages

echo 'print 42;' | ./build/mlmlir | tee /tmp/m.mlir
/opt/homebrew/opt/llvm/bin/mlir-opt /tmp/m.mlir --canonicalize       # pretty + sanity-check
/opt/homebrew/opt/llvm/bin/mlir-translate /tmp/m.mlir --mlir-to-llvmir
/opt/homebrew/opt/llvm/bin/mlir-translate /tmp/m.mlir --mlir-to-llvmir \
  | /opt/homebrew/opt/llvm/bin/llc -O2 -filetype=obj -o /tmp/m.o

Step 06 · Progressive lowering — what a "real" pipeline looks like

Emitting llvm dialect directly skips MLIR's superpower: progressive lowering. Here's the shape of a fuller pipeline you would build once you have a custom dialect.

A hypothetical minilang dialect

module {
  minilang.func @add(%a: !minilang.value, %b: !minilang.value) -> !minilang.value {
    %r = minilang.add %a, %b : !minilang.value
    minilang.return %r : !minilang.value
  }
  minilang.func @main() {
    %a = minilang.const #minilang.num<40> : !minilang.value
    %b = minilang.const #minilang.num<2>  : !minilang.value
    %c = minilang.call @add(%a, %b) : (!minilang.value, !minilang.value) -> !minilang.value
    minilang.print %c : !minilang.value
    minilang.return
  }
}

The !minilang.value type encodes our boxed runtime value (Nil / Bool / Number / Str / Fn).

Lowering passes

mlir-opt minilang.mlir \
    --minilang-specialise-numeric          # unbox numeric ops where types prove safe
    --convert-minilang-to-func             # minilang.func/call → func dialect
    --convert-minilang-to-arith            # numeric ops → arith
    --convert-minilang-to-cf               # control flow lowered
    --convert-minilang-to-memref           # boxed values → struct in memref
    --convert-arith-to-llvm
    --convert-cf-to-llvm
    --convert-memref-to-llvm
    --convert-func-to-llvm
    --reconcile-unrealized-casts
  | mlir-translate --mlir-to-llvmir
  | lli

Each --convert-* is a RewritePattern set authored once and reused across every MiniLang program. That's the value MLIR offers: a ready-made pattern infrastructure plus a dozen battle-tested target dialects.

Domain-specific optimisation

Before lowering to arith, the minilang dialect can run high-level passes: type-specialise polymorphic numeric ops, sink GC barriers, inline closures whose upvalues are constant. Those are nearly impossible to do once everything is i64s and pointers.

What we lose by going straight to llvm

  • No structured loops (scf.for) so loop transformations like --affine-loop-unroll and --scf-parallel-loop-fusion are out.
  • No higher-level type info — every value is i64.
  • No room for domain-specific peepholes.

For a numeric scripting language those losses are small; for ML or DSP workloads they're enormous.

Step 07 · When to reach for MLIR

MLIR is heavyweight. Reach for it when at least one of these is true:

  1. Multiple abstraction layers are useful. A SQL planner, a tensor compiler, an HDL toolchain, a DSL with high-level semantic passes — anywhere you want to optimise before losing structure.
  2. You'll write many compilers, sharing infrastructure. A team building five DSLs benefits from a shared dialect ecosystem.
  3. You need polyhedral / loop-nest transforms — the affine and linalg dialects have no LLVM-only equivalent.
  4. You target heterogeneous hardware — GPU, TPU, FPGA. The gpu, nvvm, rocdl, spirv, and circt dialects let you keep one front-end.

When not to use MLIR:

  • Simple scripting language → LLVM directly (cp-11/12 path).
  • Single-target compiler with no high-level structure to preserve.
  • Time/iteration is short — MLIR's per-dialect ceremony is real cost.

What we'd build next in this curriculum

  • cp-18 (capstone) sketches a minilang.* dialect with one custom high-level pass (escape analysis to elide allocations) plus a manual conversion to the llvm dialect. The infrastructure cost is large; the educational payoff is understanding the one-source/many-targets pattern that MLIR enables.

Resources

  • The MLIR Toy tutorial (chapters 1–7) is the canonical hands-on introduction — it builds a tiny tensor language end-to-end.
  • "The Architecture of Open Source Applications" entry on MLIR.
  • The MLIR Open Meeting talks (mlir.llvm.org), especially the dialect spotlights.

cp-14 · Runtime Systems

cp-01..cp-13 produce code. cp-14 builds the runtime that the code calls into: a tagged value representation, a mark-sweep garbage collector, and high-level operations (add, print, array indexing).

Build & run

cmake -S src/cpp -B build
cmake --build build
./build/tests/test_runtime    # → 36/36 checks passed
./build/mlrt-demo             # → hello, world  [0, 1, 4]  ...

What lives here

src/cpp/src/
    value.hpp         — tagged 64-bit Value
    heap.hpp/.cpp     — Object header + mark-sweep GC
    runtime.hpp/.cpp  — add / sub / mul / print / array ops
    main.cpp          — demo binary
tests/
    test_runtime.cpp  — 14 tests, 36 checks

The runtime is standalone — no LLVM, no IR. cp-17's capstone wires it into the JIT from cp-12.

Reading order

  1. steps/01-the-runtime-layer.md
  2. steps/02-value-representation.md
  3. steps/03-object-layout.md
  4. steps/04-mark-sweep-gc.md
  5. steps/05-roots-and-safepoints.md
  6. steps/06-allocation-strategies.md
  7. steps/07-beyond-mark-sweep.md

Step 01 · The runtime layer

A compiler produces code, but that code runs on top of services provided by the runtime:

  • Allocate heap objects
  • Move/free objects (garbage collection)
  • Construct boxed values (strings, arrays, closures)
  • Format values for print
  • Raise structured errors (exceptions)
  • Provide builtin functions and FFI bridges

These services live in a small library that the codegen calls into. Three places they show up:

  1. Compiler emits calls. ml_alloc_string becomes an external symbol; the linker (or JIT) resolves it to a runtime function.
  2. Compiler emits inline code that uses the runtime's invariants — reading the tag bits of a Value, indexing into an Object header, etc.
  3. Compiler emits metadata for the runtime to consume — stack maps that tell the GC where pointers live in each frame, unwind tables for exceptions, debug info for backtraces.

This lab implements (1) and (2). (3) — emitting stack maps — is the hard, language-specific work that production runtimes invest heavily in; we approximate it with an explicit root API that the host program pushes Values onto.

Why a tagged Value?

MiniLang is dynamically typed. Every variable can hold a number, string, bool, nil, function, or array. The simplest representation is a struct — { tag, payload } — but that's at least 16 bytes per slot. A tagged 64-bit value gets us:

  • Pointer-sized (fits in registers)
  • Fast bit-test type checks
  • 63-bit fixnums with no boxing
  • Compatible with calling conventions for free

The trade-off is restricted integer range and a few bits of mental overhead — a worthwhile bargain for a scripting language.

Step 02 · Value representation

Our encoding (defined in value.hpp):

bit 63 ........................... 3 2 1 0
[              63-bit signed int      ] 1   → fixnum
[                          0  ][ 0 1 0 ]    → nil
[                          0  ][ 1 1 0 ]    → true
[                          0  ][ 1 0 1 0]   → false
[ pointer to Object, aligned to 8     ] 000 → heap object

How tests look

bool isFixnum() const { return raw & 1; }
bool isObject() const { return (raw & 0b111) == 0 && raw != 0; }

Each is a single and + compare, branch-prediction friendly.

Encoding fixnums

static Value Fixnum(int64_t v) {
    return {(uint64_t)((v << 1) | 1)};
}
int64_t asFixnum() const { return ((int64_t)raw) >> 1; }

We lose one bit of range. For MiniLang's scripting niche, ±2⁶² is plenty. Real languages that need full 64-bit integers either box big numbers (CPython, OCaml Int64.t) or use NaN-boxing (described below).

NaN-boxing — the alternative

IEEE-754 doubles have 52 payload bits in quiet NaNs, enough to encode a pointer + a small tag. SpiderMonkey, JSC, and Lua 5.3 use variants of this trick:

double: [sign 1][exponent 11][mantissa 52]
A double is a quiet NaN iff exponent = all 1s AND mantissa MSB = 1.
We hide 51 bits of payload + 3 tag bits in there.

Pros: full IEEE doubles fly without boxing — huge for numeric code. Cons: bit-twiddling is finicky, hostile to debuggers, doesn't play nicely with sanitisers.

Our scheme (low-bit tagging) is simpler and integer-friendly; we'd swap to NaN-boxing only if floats became a major workload.

Why aligned-to-8 pointers

std::malloc already returns 8-aligned blocks on every mainstream platform. We additionally round our allocation sizes up to 8 so the next object also lands aligned. The low 3 bits of any Object* we hand out are guaranteed zero → safe to overlay tags.

Step 03 · Object layout

Every heap object starts with the same header:

struct Object {
    ObjKind  kind;     // 1 byte: String, Array, ...
    uint8_t  marked;   // 1 byte: GC bookkeeping
    uint16_t _pad;     // 2 bytes
    uint32_t size;     // 4 bytes: total size including header
    Object*  next;     // 8 bytes: intrusive linked list
};

16 bytes of overhead per object. We trade a few bytes for:

  • A uniform header the GC can inspect blind.
  • An intrusive object table — no separate metadata structure to keep in sync. The sweep phase just walks head_ → next → next → ….
  • Inline size so sweep knows how much memory each object holds.

StringObj

struct StringObj : Object {
    uint32_t len;
    char     data[1];   // flexible array
};

Allocation: sizeof(StringObj) + len. We declare data[1] so the struct is well-formed even at length 0; the real size is computed in newString. We always NUL-terminate for cheap interop with C printers.

ArrayObj

struct ArrayObj : Object {
    uint32_t len;
    Value    elems[1];
};

Inline storage of Values (boxed pointers). The GC iterates elems[0..len) during mark, no separate "type descriptor" needed because the kind field tells it the layout.

Forwarding the design

For closures we'd add:

struct ClosureObj : Object {
    uint32_t numUpvalues;
    uint64_t funcId;      // index into the JIT'd function table
    Value    upvalues[1]; // captured environment
};

The mark routine grows a switch:

switch (o->kind) {
    case ObjKind::Array:   for each elem mark(elem);     break;
    case ObjKind::Closure: for each upvalue mark(upval); break;
    case ObjKind::String:  /* no pointers */              break;
}

This is the canonical "GC tracing per kind" pattern. Variations: embed pointer offsets directly in the header, or rely on a type-descriptor pointer to call a virtual trace method.

Step 04 · Mark-sweep GC

Algorithm (in heap.cpp):

collect():
    for each root slot:
        mark(*slot)
    sweep:
        for each object in object table:
            if marked: clear mark
            else: unlink + free

mark is the classic tri-colour (here: just two-colour) traversal:

void markObject(Object* o) {
    if (!o || o->marked) return;     // already grey/black
    o->marked = 1;                   // turn black
    if (o->kind == ObjKind::Array)
        for (each elem) mark(elem);  // grey-ify children
}

Recursion-depth concern: a long array chain can blow the C stack. In production, switch to an explicit work queue (std::vector<Object*>) and pop until empty. Not needed for our tests but worth knowing.

Why mark-sweep?

Pros:

  • Simplest correct GC. Trivial to debug — print the object table before and after, diff the survivors.
  • Doesn't move objects. Pointers from the C/C++ host remain valid across collections. This matters when a JIT'd function takes a raw char* from a string.
  • Tolerates ambiguous roots. Even if a root might be a pointer (conservative scanning), false positives just keep objects alive a cycle longer.

Cons:

  • Fragmentation. Repeated allocate/free leaves holes that the bump allocator can't reuse. We'd need a free list (next step upward) or compaction.
  • Pause time scales with live set + dead set. Sweep is O(total objects), even if very few survived.
  • Cache-unfriendly object table walks. Each pointer chase costs a miss.

Triggering policy

maybeCollect() runs on every allocation:

if (allocatedBytes_ >= gcThreshold_) collect();
// after collect:
if (allocatedBytes_ * 2 > gcThreshold_) gcThreshold_ = allocatedBytes_ * 2;

Doubling the threshold based on the post-collect live set is a classic way to keep amortised GC cost bounded. If 1 KiB survives, we let the heap grow to 2 KiB before collecting again — guaranteeing O(live) work per allocated byte.

Step 05 · Roots and safepoints

GC needs to know which Values are live. The mark phase starts from the root set; anything not reachable from a root is garbage.

Our root set is explicit:

class Heap {
    std::vector<Value*> roots_;
    void pushRoot(Value* slot);
    void popRoot();
};

class RootScope {
    RootScope(Heap& h, Value* slot)   { h.pushRoot(slot); }
    ~RootScope()                      { h.popRoot(); }
};

Callers root every local that could hold an object before any allocation:

Value greet = makeString(h, "hello, ");
RootScope rg(h, &greet);               // greet is now safe across GC
Value who   = makeString(h, "world");  // GC may run here
RootScope rw(h, &who);
Value full  = add(greet, who, h);

This is awkward and easy to forget. Production runtimes do better:

Conservative stack scanning

Scan every byte of the C stack as if each pointer-aligned word might be a pointer; if it points into a known heap object, treat it as a root. The Boehm GC works this way.

  • Pros: no rooting boilerplate; works with arbitrary C/C++.
  • Cons: false positives keep dead objects alive; can't relocate objects (so no copying / compacting GC).

Precise stack maps

The compiler emits, per call site, a map of which stack slots and registers contain pointers. The GC walks the stack frame by frame, asks the map "what's live here?", and traces those slots.

  • Pros: precise, supports relocating GC.
  • Cons: codegen complexity, every safepoint inhibits some optimisations.

Safepoints

A safepoint is a point in code where the runtime guarantees the stack is in a known state — typically function entry and loop backedges. The compiler inserts a poll:

if (gc_request) suspend();

When the GC wants to run, it sets gc_request, then waits for all mutator threads to reach a safepoint. This bounds GC latency to roughly the loop-iteration time.

cp-14 has only one thread and no JIT integration, so we just call collect() from alloc synchronously — the entire VM is a "GC safepoint" by construction. cp-17's capstone wires precise stack maps into the cp-12 JIT.

Step 06 · Allocation strategies

Heap::alloc is a thin wrapper over std::malloc. That's the simplest correct allocator, but real runtimes layer specialised ones on top because the alloc fast-path runs every few instructions.

Bump allocation

arena: [................................]
                ^ free
free += bytes
  • Cost: one add, one compare. ~5 cycles.
  • Used by: every copying GC (because the from-space is wiped each collection, the bump pointer resets).

Free lists

Maintain per-size buckets of freed objects:

size 16: → block → block → block → ...
size 32: → block → ...

Allocation: pop head of the right bucket. Sweep: push freed objects onto the right bucket. Pros: reuses fragmented space. Cons: slower than bump; harder to reason about live size.

Thread-Local Allocation Buffers (TLABs)

In a multi-threaded VM, every thread reserves a chunk of the global heap and bump-allocates inside it. No atomics on the fast path.

Thread 1 TLAB: [................xxxxxxxx]   xxxx = live
Thread 2 TLAB: [............xxxxxxxxxxx]
Global heap:   [TLAB 1 ][TLAB 2 ][...]

When a thread's TLAB is exhausted it locks the global heap, claims a new one, and continues. The lock-free fast path is critical to multi-threaded performance.

Generational allocation

nursery (small, fast):  bump alloc → frequent minor GC
tenured (large, slow):  mark-sweep or mark-compact → rare major GC

Most objects die young → minor GC is fast and cheap. Survivors get promoted to the tenured space.

This requires a write barrier: when an old object's field is set to a young object, record the cross-generation pointer so the minor GC can find young roots without scanning the entire old generation.

inline void writeBarrier(Object* parent, Value child) {
    if (child.isObject() && parent->isOld() && child.asObject()->isYoung())
        rememberedSet.add(parent);
}

cp-14's runtime doesn't implement any of these — but the linked-list object table and explicit roots make it easy to swap in any of them later. The right abstraction is "all alloc paths go through one function" — which we have.

Step 07 · Beyond mark-sweep

The collector zoo, roughly in order of complexity:

Reference counting

Every object keeps a counter; increment on assign, decrement on overwrite/scope exit. Free when count hits 0.

  • Pros: deterministic — finalisers run promptly. Memory footprint predictable.
  • Cons: cycles leak (need a backup tracing GC). Atomic refcounts in multi-threaded code are expensive.
  • Users: CPython (with cycle-detector backup), Swift, Rust's Rc<T>.

Mark-sweep (cp-14)

What we built. Simple, non-relocating, fragmenting.

Mark-compact

After mark, slide survivors to one end of the heap (or use a forwarding-pointer Cheney pass). Eliminates fragmentation. Pointers need updating — easier with precise stack maps; impossible with conservative scanning.

Copying (Cheney)

Two equally-sized spaces: from-space and to-space. Allocation bump- allocates in from-space. On GC, walk roots and copy live objects into to-space, leaving forwarding pointers in from-space. Flip spaces. From-space is now garbage; bump pointer resets.

  • Pros: O(live) work, no fragmentation, blazing-fast alloc.
  • Cons: half the heap is always wasted; relocates objects.
  • Users: most young generations (because young set is small).

Generational

Combine: copying for the nursery, mark-compact for the tenured. The Hotspot, V8, and SpiderMonkey GCs are all generational variants of this idea.

Concurrent / incremental

Run mark and/or sweep on a separate thread concurrently with the mutator. Needs:

  • Write barriers (Yuasa/Dijkstra) so the marker doesn't lose objects that change mid-flight.
  • Read barriers for relocating concurrent collectors (Shenandoah, ZGC).

Sub-millisecond GC pauses on multi-GB heaps are now standard.

Region-based / arena

Allocate from a context-bound arena; free the entire arena at once. No collector needed; tracks a region per request/task. Used by Rust allocators, Apache, OCaml's minor heap variant, Zig, server runtimes.

What to choose

  • Educational implementation: mark-sweep (cp-14).
  • Embeddable, single-threaded: mark-sweep or refcount (CPython).
  • Server, GB-scale heap, low-pause: concurrent generational (Hotspot G1, ZGC).
  • Soft-realtime / latency-critical: region-based or Shenandoah/ZGC.
  • Embedded / no malloc: arena + free list.

Stack maps and codegen integration

cp-17's capstone wires this runtime into the cp-12 JIT. The minimal addition: every call-site stack-map descriptor, plus a "safepoint poll" inserted into loop headers. We do that on top of LLVM's @llvm.gcroot / Statepoints intrinsics rather than reinventing the metadata format.

cp-15 · Tooling and Diagnostics

A compiler is only as approachable as its error messages. cp-15 builds the tooling layer that wraps the language: a unified minilang CLI (run, fmt, ast, repl), rustc-style structured diagnostics with carets, source-span tracking, and a REPL with multi-line input.

Build & run

cmake -S src/cpp -B build
cmake --build build
./build/tests/test_tooling                       # → 46/46 checks passed

echo 'let x = 1+2*3; print x;' | ./build/minilang run -    # → 7
echo 'let  x= 1 +2;  print  x  ;' | ./build/minilang fmt -
./build/minilang repl
> let x = 10;
> print x * x;
100

What an error looks like:

$ echo 'print 1 +;' | ./build/minilang run -
error[E0202]: expected expression, got `;`
  --> <stdin>:1:10
    |
  1 | print 1 +;
    |          ^
    | help: try a number, a variable, or `(`

Layout

src/cpp/src/
    source.hpp/.cpp   — SourceFile with line index
    diag.hpp/.cpp     — Diagnostic + renderer
    lex.hpp/.cpp      — span-tracking tokenizer
    parse.hpp/.cpp    — recursive-descent parser with error recovery
    format.hpp/.cpp   — AST pretty printer
    eval.hpp/.cpp     — tree-walking evaluator
    repl.hpp/.cpp     — REPL loop with multi-line continuation
    main.cpp          — `minilang` CLI
tests/
    test_tooling.cpp  — 14 tests / 46 checks

Reading order

  1. steps/01-developer-experience-matters.md
  2. steps/02-source-spans-and-locations.md
  3. steps/03-rustc-style-diagnostics.md
  4. steps/04-parser-error-recovery.md
  5. steps/05-pretty-printing-and-formatting.md
  6. steps/06-building-a-repl.md
  7. steps/07-cli-design-and-lsp.md

Step 01 · Developer experience matters

A compiler that's correct but cryptic loses to one that's slightly less powerful but obviously helpful. Compare:

$ old-compiler
error: syntax error
$ rustc-style
error[E0202]: expected expression, got `;`
  --> example.rs:5:11
    |
  5 | print 1 + ;
    |           ^
    | help: try a number, a variable, or `(`

Same parser bug; vastly different debugging experience. The investment that pays off:

  • Source spans on every AST node. Tokens carry (start, length); AST nodes inherit and merge them.
  • Structured diagnostics: (severity, code, message, span, hint) rather than a string. This lets:
    • The compiler suggest fix-its (hint).
    • IDEs render squigglies precisely (span).
    • clippy-style tools filter by code.
  • Error recovery: parsers continue past errors so users see all problems in one pass, not "fix → recompile → fix → recompile".
  • Tooling ecosystem: fmt, ast, repl, lsp are first-class citizens that share the same parser & diagnostics.

This lab implements the foundation. cp-16's capstone wires it into the full compiler frontend.

Why the CLI is one binary with subcommands

minilang run|fmt|ast|repl instead of minilang-run, minilang-fmt, etc:

  • Single install footprint.
  • Shared option parsing & shared error format.
  • Easier to add minilang check, minilang test, minilang doc later.

This is the design cargo, go, git, dotnet, and dart all converged on for the same reasons.

Step 02 · Source spans and locations

A span = (start_offset, length) in the source buffer. A location = (line, column). We compute the latter from the former on demand.

class SourceFile {
    std::string                 text_;
    std::vector<size_t>         lineStarts_;  // offset of each line
};

Loc SourceFile::loc(size_t offset) const {
    auto it = std::upper_bound(lineStarts_.begin(), lineStarts_.end(), offset);
    int  line = (int)(it - lineStarts_.begin());
    return {line, (int)(offset - lineStarts_[line - 1]) + 1};
}
  • Why store offsets, not (line, col)? Offsets are constant-time comparable, deduplicable, hashable. Lines change with edits; offsets don't (per-file).
  • Why binary-search lookup? O(log lines) is fast enough for diagnostics. We only convert offsets → (line, col) at print time.

Span propagation

Tokens carry spans straight from the lexer. AST nodes either copy their primary token's span (literals, identifiers) or merge:

e->span = {lhs->span.start,
           rhs->span.start + rhs->span.length - lhs->span.start};

A binary expression's span covers lhs op rhs end-to-end. This is crucial for IDE highlighting: hover over 1 + 2, the whole expression lights up.

Multi-file

Real compilers carry a (fileId, span) pair. cp-15 uses one file at a time because that's all the REPL and CLI need; extending to a SourceMap of files is mechanical (vector<SourceFile> keyed by id).

Step 03 · Rustc-style diagnostics

The Diagnostic struct is small and intentional:

struct Diagnostic {
    Severity    severity;     // Error / Warning / Note
    std::string code;         // "E0202"
    std::string message;
    Span        span;
    std::string hint;         // "help: try ..."
};

Renderer output:

error[E0202]: expected expression, got `;`
  --> ex.ml:1:10
    |
  1 | print 1 +;
    |          ^
    | help: try a number, a variable, or `(`

The four lines after the header are:

  1. Gutter blank line matching the line-number width.
  2. Source line with the offending text.
  3. Caret line — spaces to the column, then ^ for the span.
  4. Hint (optional) — what to try next.

Why no ANSI colours

Easier to test (string comparison), easier to pipe to less, easier to integrate with editors that re-style errors. A real CLI adds a --color=auto flag that wraps error and the caret in red — trivial to layer on top.

Codes

Rust assigns each error an E#### code, Swift uses descriptive IDs, TypeScript uses TS####. Benefits:

  • Documentation hooks (rustc --explain E0382).
  • Stable references in tutorials/bug reports.
  • Tooling can ignore-list specific codes.

We pre-assign blocks: E01xx lex errors, E02xx parse, E03xx semantic. Cheap up-front; pays for itself the first time someone googles minilang E0301.

Spans across multiple lines

The renderer currently assumes the span fits on one line. Multi-line spans (if {\n bad\n}) need the line bar repeated with ~~~ for trailing lines and ^^^ on the start. We left this as a 30-minute extension exercise — pattern-match what rustc does.

Suggestions / fix-its

hint is plain text. A richer system attaches a replacement span + replacement text that an IDE can apply automatically:

struct FixIt { Span span; std::string replacement; };

Useful but easy to get wrong — apply two fix-its that overlap and you corrupt the file. Production compilers (rustc, clang-tidy) keep them gated behind explicit user invocation.

Step 04 · Parser error recovery

A parser that aborts on the first error is useless for IDEs and frustrating in the CLI. Error recovery is the difference between "fix one thing at a time" and "see the whole picture, fix everything in one pass".

cp-15's parser uses panic mode recovery — the simplest strategy that works:

Program parseProgram() {
    while (peek().kind != Tok::Eof) {
        auto s = parseStmt();
        if (s) p.stmts.push_back(std::move(*s));
        else   skipToSyncPoint();      // resync
    }
}

void skipToSyncPoint() {
    while (peek().kind != Tok::Eof && peek().kind != Tok::Semi) ++i;
    accept(Tok::Semi);
}

; is our synchronisation token. After an error inside a statement, we discard everything up to the next ; and start fresh. This guarantees:

  • The parser terminates (no infinite loops on bad input).
  • Subsequent valid statements still get parsed.
  • The number of errors reported scales linearly with the number of real mistakes (not exponentially — cascade failures are a common parser-design pitfall).

Better recovery strategies

  • Token deletion / insertion: try plausible edits (insert ), delete +) and continue. Powerful but combinatorial.
  • Phrase-level recovery: define multi-token sync sets per non-terminal. Statements sync on ; { fn while if, expressions sync on ) ; ,.
  • Tree-sitter / GLR: parse as much as possible, leaving "ERROR" nodes in the tree. Fast enough to re-run on every keystroke.

For a small language, panic mode with one sync token is 95% as useful as any of these and 10× less code.

Don't forget the lexer

The lexer must also recover. lex emits a diagnostic for the bad character and advances by one byte:

out.diagnostics.push_back(Diagnostic{...});
++i;

Skipping the whole rest of the file on a single bad character would be a denial-of-service vector for IDE users mid-typing.

Step 05 · Pretty printing and formatting

A formatter is a function AST → canonical source. Round-tripping through parse ∘ format should be the identity on the AST (and idempotent on the formatted text). Our formatter is in format.cpp; it's tiny because the grammar is tiny.

Key design choices:

Precedence-aware parenthesisation

void writeExpr(os, e, parentPrec) {
    case Bin: {
        int p = precOf(e.op);
        bool wrap = parentPrec > p;
        if (wrap) os << "(";
        writeExpr(os, *e.lhs, p);
        os << " " << e.op << " ";
        writeExpr(os, *e.rhs, p + 1);  // right-bias for left-assoc
        if (wrap) os << ")";
    }
}

parentPrec > p wraps when the parent expects tighter precedence. The p + 1 on the right side preserves left-associativity: (1 - 2) - 3 keeps its inner parens (because subtraction isn't associative) but 1 - 2 - 3 doesn't need any.

Insertion of canonical whitespace

let x=1+2*3; print x ;let x = 1 + 2 * 3;\nprint x;\n.

A formatter has one correct output per AST. Don't make whitespace configurable; that's gofmt's wisdom. Once teams agree, debates disappear.

Comments are hard

Our toy formatter strips comments because the parser drops them. A real formatter needs to:

  • Carry comments through the AST (attach to neighbouring nodes).
  • Distinguish leading vs. trailing comments.
  • Reflow long lines without orphaning the comment.

This is where rustfmt, prettier, and gofmt all sink most of their implementation budget. The simple case (no comments) is fine for our educational scope.

Beyond plain text

  • AST → diff for refactorings (rename a variable, output is a formatted version with the new name everywhere).
  • AST → HTML for documentation (syntax-highlighted source).
  • AST → AST transformations (CST-preserving rewrites for codemod-style tools).

These all start with a clean formatter.

Step 06 · Building a REPL

A REPL ("read-eval-print loop") is the most useful tool a language ships. Our implementation in repl.cpp is ~50 lines because all the heavy lifting is reused from the CLI.

void runRepl(in, out, err, opts) {
    EvalState st;                           // persists across lines
    std::string buffer;
    while (getline(in, line)) {
        buffer += line + "\n";
        SourceFile src("<repl>", buffer);
        auto l = lex(src);
        auto p = parse(l.tokens);
        if (needsContinuation(...)) continue;   // accumulate
        for (auto& d : l.diagnostics) renderTo(err, d, src);
        for (auto& d : p.diagnostics) renderTo(err, d, src);
        if (no errors) eval(st, p.program, out);
        buffer.clear();
    }
}

Multi-line continuation

The REPL recognises unfinished input — unbalanced parens, dangling operators — and doesn't evaluate yet. It keeps reading lines into a buffer, showing a | continuation prompt, until the input type-checks as a complete program.

Heuristic in needsContinuation:

  • ( > ) count → expect more.
  • Last diagnostic mentions "got end of input" → expect more.

A more rigorous approach: have the parser return a distinguished "unexpected EOF" error type rather than scanning messages. We chose strings here for simplicity; it's the kind of decision easy to revisit once you feel the friction.

State preservation

EvalState st lives outside the loop, so:

> let x = 10;
> print x;
10

works as expected. The semantics is "each REPL line is appended to a notional program that you've been building all along".

Error recovery, REPL-style

When evaluation fails, we discard the buffer and re-prompt. The alternative — keeping the buffer so the user can edit — is what IPython / Jupyter offer, but requires terminal-line-editor integration (readline / replxx) outside the scope of this lab.

Things real REPLs add

  • Line editing & history (libedit, readline, replxx).
  • Tab completion (introspect st.env for variable names).
  • Special commands (:type, :reset, :doc).
  • Pretty-printing of last value (Python's _).
  • Persistent history file.

Each is a half-day; they're orthogonal to the core REPL loop.

Step 07 · CLI design & looking ahead to LSP

Unified CLI

minilang <command> <args>
  run <file|->     parse + evaluate
  fmt <file|->     pretty-print
  ast <file|->     dump AST
  repl             interactive read-eval-print loop

Patterns to adopt early:

  • - for stdin everywhere a file is accepted. Pipe-friendly.
  • Exit codes matter: 0 success, 64 usage error, 65 source error, 70 internal error (we follow rough sysexits.h conventions).
  • Subcommand-first so minilang fmt --check and minilang run --debug have separate option spaces — no flag conflicts.
  • One binary, many commands simplifies install + distribution.

What's missing for production

  • minilang check <file> — parse + typecheck, no eval, machine-readable JSON output (--format=json).
  • minilang test <file> — discover and run inline tests.
  • minilang doc — extract docstrings → HTML.
  • minilang lsp — Language Server Protocol stdio mode.

LSP — the natural next step

Every tool we built — span-tracking lexer, error-recovering parser, structured diagnostics, pretty printer — is what an LSP server needs. Sketch:

client → JSON-RPC →  textDocument/didOpen { uri, text }
                  → store in document map, run lex + parse
                  → publishDiagnostics { uri, diagnostics }
client → JSON-RPC →  textDocument/didChange { uri, edits }
                  → incremental re-parse (or full)
                  → publishDiagnostics
client → JSON-RPC →  textDocument/formatting → call format(program)
client → JSON-RPC →  textDocument/hover { position }
                  → look up enclosing AST node, return doc/type

The hardest parts:

  • Incremental parsing to keep latency low on large files (tree-sitter solves this; rust-analyzer rolls its own).
  • Indexing across files for cross-file go-to-definition / find-references.
  • Cancellation — long-running analyses must be interruptable.

All of those build on the foundations we laid here. Diagnostic codes, spans, and AST formatter aren't optional in an LSP world — they're the contract you have with the editor.

Connection to the rest of the curriculum

  • cp-16 (capstone compiler suite) reuses this CLI shell.
  • cp-17 (capstone JIT) layers compile and jit subcommands on top.
  • cp-18 (MLIR framework) adds dump-mlir and lower stages.

The unifying lesson: a great compiler is also a great library, and the CLI / REPL / LSP are just thin shells over that library.

cp-16 · Capstone Compiler Suite (minilangc)

A complete ahead-of-time compiler that puts everything from cp-01…cp-15 together. minilangc lexes → parses → typechecks → emits LLVM IR → shells out to llc + clang to produce a native executable.

Build & run

cmake -S src/cpp -B build
cmake --build build
./build/tests/test_suite              # → 28/28 checks passed

cat > hello.ml <<'PROG'
fn fib(n) {
    if (n < 2) { return n; }
    return fib(n - 1) + fib(n - 2);
}
fn main() {
    print fib(10);
}
PROG

./build/minilangc emit-ir hello.ml > hello.ll
./build/minilangc build   hello.ml -o hello
./build/minilangc run     hello.ml      # → 55
./build/minilangc check   hello.ml      # silent if clean

CLI:

minilangc <command> [options] <file|->
  emit-ir         lex+parse+typecheck and print LLVM IR
  build           compile to executable (-o path, -O0..3, -v verbose)
  run             build then execute
  check           lex+parse+typecheck only

Language

program := func+
func    := "fn" Ident "(" params? ")" block
block   := "{" stmt* "}"
stmt    := "let" Ident "=" expr ";"
         | "print" expr ";"
         | "return" expr ";"
         | "if" "(" expr ")" block ("else" block)?
         | "while" "(" expr ")" block
         | Ident "=" expr ";"
         | expr ";"
expr    := cmp
cmp     := add (("=="|"!="|"<"|"<="|">"|">=") add)*
add     := mul (("+"|"-") mul)*
mul     := unary (("*"|"/"|"%") unary)*
unary   := "-" unary | call
call    := primary ("(" args? ")")*
primary := Number | Ident | "(" expr ")"

Every value is i64. print lowers to printf("%lld\n", v). Must have fn main() { ... }.

Layout

src/cpp/src/
    source.hpp/.cpp     SourceFile + line index
    diag.hpp/.cpp       Diagnostic + renderer
    lex.hpp/.cpp        tokenizer
    parse.hpp/.cpp      recursive-descent parser
    typecheck.hpp/.cpp  scope + arity + main checks
    llvm_emit.hpp/.cpp  AST → textual LLVM IR
    driver.hpp/.cpp     llc + clang shell-out
    main.cpp            `minilangc` CLI
tests/
    test_suite.cpp      frontend + IR + end-to-end (9 tests / 28 checks)

Reading order

  1. steps/01-pipeline-overview.md
  2. steps/02-frontend-reuse.md
  3. steps/03-multi-function-language.md
  4. steps/04-typecheck-and-scope.md
  5. steps/05-emitting-llvm-ir.md
  6. steps/06-driving-the-toolchain.md
  7. steps/07-from-here-to-production.md

Step 01 · Pipeline overview

minilangc is a thin orchestrator over six clean stages:

   source bytes
       │
       ▼  ml::lex
   Token stream  ── diagnostics? ──► render & exit 65
       │
       ▼  ml::parse
   AST (Program)  ── diagnostics? ──► render & exit 65
       │
       ▼  ml::typecheck
   AST + scope info  ── diagnostics? ──► render & exit 65
       │
       ▼  ml::emitLLVMIR
   "module.ll"  (textual LLVM IR string)
       │
       ▼  ml::buildExecutable  → shell out to llc -filetype=obj
   "module.o"
       │
       ▼  same path             → shell out to clang
   executable
       │
       ▼  ml::runExecutable
   stdout text

Each arrow corresponds to one function in driver.hpp. The two phases that can fail (parse / build) return rich result structs so the CLI can format the failure however it wants.

Why shell out?

Linking against LLVM-as-a-library is the "right" answer for production compilers (incremental compilation, JIT, fewer process forks). For this capstone we shell out to llc + clang because:

  • Zero LLVM CMake friction — works as long as /opt/homebrew/opt/llvm/bin exists.
  • Easier to debug — you can re-run the exact llc command yourself.
  • The pipeline is the same idea — only the boundary is text on disk vs. in-memory Module*.

Subsequent labs (cp-17 JIT, cp-18 MLIR) demonstrate the linked alternative.

Stages, separately

Want only the IR? minilangc emit-ir foo.ml > foo.ll. Want only typecheck? minilangc check foo.ml. Want everything? minilangc run foo.ml.

This separability is the architecture. Each stage's output is serialisable (tokens → JSON, AST → JSON, IR → text), so you can mix & match: write a third-party formatter, a linter, a documentation generator, all on the same frontend.

Step 02 · Reusing the cp-15 frontend

The source/diag/lex/parse modules are nearly verbatim copies of cp-15's, extended for the multi-function language. This is intentional: the capstone proves the cp-15 design generalises.

What grew:

concerncp-15cp-16
top-levelflat statement listfunction definitions
keywordslet, print+ fn, return, if, else, while
operators+, -, *, /+ %, comparisons
expressionsnumeric arithmetic+ function calls
typechecknonescope, arity, main

The lexer's structure didn't change — just more keywords and two-character operators (==, <=, …).

The parser gained parseFunc, parseBlock, control-flow statements, parseCall, and a parseCmp precedence layer above parseAdd. Every new feature followed the same recipe:

  1. Define the AST node.
  2. Add the parser rule.
  3. Extend typecheck to validate it.
  4. Extend emitLLVMIR to lower it.

Lessons from doing it twice

  • Spans, not positions. Spans carry through every transformation; positions become stale the moment you concatenate AST nodes.
  • Synchronisation tokens scale. cp-15 used ;. cp-16 uses ; and brace-balanced sync inside blocks — see Parser::sync in parse.cpp. The principle is identical.
  • Diagnostic codes are forever. Once you ship E0202, you don't renumber it. Our scheme (E01xx lex, E02xx parse, E03xx eval, E04xx semantic) is just enough structure.
  • std::optional<Stmt> requires <optional>. Compilers may include it transitively today; relying on that is a portability bug waiting to happen.

Step 03 · A multi-function language

cp-15's language was a calculator. cp-16's is a real (if tiny) language with functions, control flow, and recursion. The grammar changes that mattered:

Top-level is functions only

program := func+

A program is a list of fn declarations. There's no top-level "main scope" — that's fn main(). This rule is enforced by the parser (error E0210) and by typecheck (error E0411 if no main).

Blocks introduce scope

case Stmt::K::If: {
    auto sc1 = scope; checkBlock(s.body, sc1);
    auto sc2 = scope; checkBlock(s.elseBody, sc2);
    return;
}

We snapshot the scope before each branch so a let inside a branch doesn't leak out. This is the simplest form of lexical scoping; real languages use a linked stack of scopes for efficiency and shadowing rules.

Calls

parseCall runs after parsePrimary and wraps the result in zero or more ( args ) suffixes:

while (peek().kind == Tok::LParen) { ... }

This lets f(1)(2) parse (even though we don't have first-class functions). It also makes adding methods (obj.method(arg)) a small extension.

Control flow lowers to branches

if/while are compiled to plain LLVM basic blocks; we don't use select or phi. The IR for while (cond) body is:

  br label %cond
cond:
  %v = ... evaluate cond ...
  %t = icmp ne i64 %v, 0
  br i1 %t, label %body, label %end
body:
  ... body ...
  br label %cond
end:

That's the canonical "structured control flow → CFG" lowering. Optimisation passes (mem2reg, jump threading) clean up the alloca traffic introduced by let/assign.

Step 04 · Typecheck and scope

The typechecker in typecheck.cpp is a single AST walk. Three responsibilities:

  1. Function table — collect every fn name and remember its arity.
  2. main requirements — must exist (E0411), must take zero parameters (E0412), no duplicate definitions (E0410).
  3. Per-function scope walk:
    • Variables must be let-introduced before use (E0300).
    • Assignment requires prior declaration (E0302).
    • Calls must resolve to a known function with matching arity (E0420, E0421).

That's it. There's no actual type checking because everything is i64. That's the right starting point: it disentangles "is it syntactically and semantically well-formed?" from "is it type-correct?". You can graft a real type system on top later (introduce i64/bool/str, unify across operators, infer generics) without touching the parser.

Scopes are sets, not stacks

std::unordered_set<std::string> scope(f.params.begin(), f.params.end());

Snapshotting the scope before each branch (auto sc1 = scope; ...) gives the right semantics without a linked structure. It's O(n × m) in pathological deep nesting but blazingly fast in practice.

A production typechecker uses a stack of scopes for shadowing (let x = 1; { let x = 2; print x; } print x; → 2 then 1). Our language doesn't allow shadowing — let x = 1; let x = 2; would quietly clobber, which is a UX bug we'd fix by adding an E0303 diagnostic.

Why typecheck before IR emission

  • Better errors. "Unknown function fbn" with caret at the call site is far nicer than llc: undefined symbol _fbn.
  • Performance. We don't waste time generating IR for code that's going to be rejected.
  • Layering. IR emission can assume the AST is well-formed — fewer if checks, simpler code.

The typechecker is also reused unmodified by the check subcommand, which is the building block for IDEs.

Step 05 · Emitting LLVM IR

llvm_emit.cpp is the longest file in this lab, ~150 lines, because it covers nine constructs (literal, variable, neg, binop, cmp, call, if, while, return) plus the module preamble. Highlights:

Memory model

We use the classic mem2reg-friendly pattern: every variable is an alloca and every read/write goes through load/store:

%x.addr = alloca i64
store i64 0, ptr %x.addr
%t = load i64, ptr %x.addr

The frontend never has to compute SSA itself. LLVM's mem2reg pass at -O1 and above promotes the allocas to virtual registers and inserts phi nodes where needed. This separation of concerns (frontend allocates, optimiser promotes) is one of the most important architectural ideas in modern compilers.

Comparisons are i64-valued booleans

%cb = icmp slt i64 %a, %b
%v  = zext i1 %cb to i64

Everything is i64, including booleans. if (...) then re-truncates via icmp ne i64 %v, 0. Wasteful? Yes. Compatible with the rest of the language? Also yes. A real bool type would be cheaper but requires propagating types through every operator.

Functions

define i64 @add(i64 %arg0, i64 %arg1) {
entry:
  %a.addr = alloca i64
  store i64 %arg0, ptr %a.addr
  %b.addr = alloca i64
  store i64 %arg1, ptr %b.addr
  ...
  ret i64 0      ; fallback if no explicit return
}

Each parameter gets spilled to an alloca of the same name. After mem2reg these vanish. The trailing ret i64 0 guarantees every function ends with a terminator even if the user omits return — defensive but not wrong.

Module preamble

target triple = "arm64-apple-macosx"
@.fmt = private unnamed_addr constant [6 x i8] c"%lld\0A\00"
declare i32 @printf(ptr, ...)

The hard-coded triple is for the macOS-on-ARM workstation this lab was developed on. A portable driver would either: (a) drop the triple and let llc pick the host default, or (b) call llvm::sys::getDefaultTargetTriple() via the LLVM-as-a-library route. We chose explicit because it documents the assumed target.

What we don't do

  • SSA construction (mem2reg handles it).
  • Register allocation (LLVM backend).
  • Instruction selection (LLVM backend).
  • Linking object files (clang invokes ld).

We're orchestrating, not reinventing. That's what "use LLVM" buys you.

Step 06 · Driving the toolchain

driver.cpp owns the boring-but-critical job of turning an IR string into an executable on disk.

The pipeline as commands

$ /opt/homebrew/opt/llvm/bin/llc -O0 -filetype=obj -o /tmp/minilangc-obj-XXX.o /tmp/minilangc-ir-YYY.ll
$ /opt/homebrew/opt/llvm/bin/clang -O0 -o a.out /tmp/minilangc-obj-XXX.o

That's the whole compiler. llc lowers IR → object file (handles instruction selection, register allocation, scheduling, emission). clang acts as the linker driver — it knows how to invoke the system linker (ld64 on macOS) with the right libc paths so printf resolves.

-O<N> is forwarded to both; -v echoes the commands so users can re-run them. Both tools are looked up via ${LLVM_BIN_DIR} (settable via CMake -DLLVM_BIN_DIR=...) so the lab works on other people's machines.

Process management

We use popen (read end) for capturing combined stdout+stderr. This is good enough for our purposes:

  • Tools rarely produce large output on success.
  • On failure, we want all the diagnostic chatter.
  • No need to handle stdin (we pass IR via a file).

For production:

  • Use posix_spawn + pipes to capture stderr separately.
  • Stream to user terminal in -v mode rather than buffer.
  • Handle signals properly (Ctrl-C should kill the child).

Temp file hygiene

mkstemp("/tmp/minilangc-{tag}-XXXXXX") then rename to add the right extension. We don't clean up because:

  • On success the user doesn't care.
  • On failure the user wants to inspect the IR.

A real driver would offer --save-temps (gcc) or --keep-tmp-files. The current behaviour matches --save-temps, which is fine for an educational tool.

Why clang instead of ld directly

Calling ld directly requires knowing the platform-specific runtime glue: crt1.o, libSystem.B.dylib, the right SDK path. clang -o figures all that out by querying the macOS SDK. It's slower (one extra exec) but vastly more portable.

Testing the e2e pipeline

Tests in test_suite.cpp call toolchainAvailable() and skip e2e tests gracefully if llc isn't present. This keeps CI green even on machines without LLVM installed.

Step 07 · From here to production

minilangc is a complete compiler — small, but every stage that a real production compiler has is present. What separates it from something you'd ship?

Language features

  • Types. Booleans, strings, structs, arrays. Each one threads through lex → parse → typecheck → IR. Strings need a runtime (cp-14).
  • Closures. Capture-by-reference vs. by-value, free variable analysis, environment lowering.
  • Modules / namespaces. Multi-file compilation, separate compilation units, an import statement, a build manifest.
  • Generics. Monomorphisation (Rust) or boxing (Java). Either way, the typechecker grows substantially.

Performance

  • Link with LLVM-as-a-library to avoid process overhead. We'd replace buildExecutable with code that builds an llvm::Module directly (cp-10/11/12 already do this).
  • Incremental compilation. Hash AST nodes, cache IR per function, re-emit only changed functions. Rust's query system, Swift's modular header maps.
  • Parallel compilation. One thread per function, share an immutable AST. LLVM's Module is per-thread but LLVMContext can be.
  • Optimisation passes. Run a custom pipeline: mem2reg, instcombine, GVN, licm, loopvectorize, before llc.

Tooling ecosystem

  • minilangc fmt (re-emit canonical source) — port cp-15's formatter.
  • minilangc test (discover fn test_* and run them).
  • minilangc doc (extract doc comments).
  • minilangc lsp — full LSP server using the spans + diagnostics we built.
  • Debugger support — emit DWARF line tables (!dbg in IR, DICompileUnit metadata, -g).

Distribution

  • Pre-compiled standard library distributed as object files (or LLVM bitcode for cross-target).
  • Package manager. Cargo, npm, go modules — all evolved alongside their compilers.
  • Cross-compilation. Parameterise the triple, ship multiple llc backends.

Where the curriculum goes next

  • cp-17 (capstone JIT) demonstrates the dynamic-language path: parse → IR → ORC JIT → call into a runtime (cp-14) at runtime. No object files, no clang. Same frontend.
  • cp-18 (capstone MLIR) demonstrates the high-level-IR path: parse → custom MLIR dialect → progressive lowering → LLVM dialect → object. More machinery for more optimisation headroom.

All three capstones share the cp-15 frontend skeleton. That's the deepest lesson of the curriculum: the compiler is a frontend + a backend choice, and the backend choice depends on the deployment story you want.

cp-17 — Capstone: JIT for a Tiny Dynamic Language

A small dynamic language frontend (fn / let / control flow / strings) compiled with the LLVM C++ API and executed via ORC LLJIT, with a host-side runtime registered as JIT-resolvable symbols.

This is the JIT counterpart to cp-16 (which AOT-compiled by shelling out to llc and clang). Here we link LLVM as a library and run code in-process.

Build & test

cd src/cpp
cmake -S . -B build && cmake --build build
./build/tests/test_jit          # 7/7 checks passed
./build/mldyn examples/hello.ml # CLI entry point

Pipeline

source.ml ──lex──▶ tokens ──parse──▶ AST ──emit──▶ llvm::Module ──ORC──▶ run main()
                                                                  └── runtime symbols (host)

Layout

  • src/cpp/src/{source,diag,lex,parse}.{hpp,cpp} — frontend (cp-15/16 style)
  • src/cpp/src/runtime.{hpp,cpp} — host functions exposed to JIT'd code
  • src/cpp/src/ir_emit.{hpp,cpp}Programllvm::Module via IRBuilder
  • src/cpp/src/jit.{hpp,cpp}LLJIT setup, symbol registration, lookup("main")
  • src/cpp/src/main.cppmldyn <file> CLI
  • src/cpp/tests/test_jit.cpp — end-to-end pipeline tests
  • steps/01..07.md — narrative walkthrough

Tests (7 checks)

  1. print 4242\n
  2. print 1 + 2 * 37\n
  3. print_str "hello, jit"hello, jit\n
  4. fib(10)55\n
  5. while loop printing 0\n1\n2\n
  6. ml_record_int_arg runtime callback fires from JIT'd code
  7. Interleaved print_str + print integer output

01 — JIT vs AOT

In cp-16 we built minilangc, an ahead-of-time compiler: it produced a .ll text file, then shelled out to llc and clang to assemble and link an executable. The compiler and the program are different processes, separated by time and by the filesystem.

In cp-17 we build mldyn, a just-in-time runner. The compiler and the program live in the same process. The flow is:

source ──► tokens ──► AST ──► llvm::Module ──► machine code ──► call()
                              (in memory)      (in memory)

There is no .ll, no .o, no clang invocation. We link against LLVM's libraries (libLLVMCore, libLLVMOrcJIT, …) and the IRBuilder hands us a Module object that ORC compiles in-process to executable memory pages.

Why JIT for a dynamic language?

Dynamic languages — Python, Ruby, JavaScript, Lua — discover types and shapes at runtime. An AOT compiler must therefore choose: either generate slow generic code, or refuse to compile until annotations are added. A JIT can defer codegen, observe what actually happens (type feedback, hot paths), then emit specialised code. That's how V8, HotSpot, LuaJIT, PyPy and Truffle all earn their speed.

cp-17 doesn't implement specialisation — that would take a tracing/IC infrastructure — but it lays the groundwork:

  • IR built programmatically, so we could vary it per call site.
  • A runtime symbol table that JIT'd code can call back into.
  • A ml_record_int_arg hook that demonstrates type feedback: every function entry tells the host "this argument was an int". A real system would consult this table on the next compile to decide whether to generate a fast int-only version.

What we keep from cp-16

The frontend. Lexer, parser, diagnostics, spans — all of it carries over. The language grew a string literal and a print_str statement, but the parser infrastructure (recursive descent, sync, span-bearing diagnostics) is the same code we built in cp-15 and reused in cp-16. Frontends are stable; backends are where the interesting variation lives.

02 — Building IR with IRBuilder

cp-16 produced LLVM IR as text. Strings concatenated into a .ll file which llc then parsed back into an in-memory Module. That round-trip is fine for AOT (the textual form is great for debugging) but it's slow and lossy for JIT.

In cp-17, ir_emit.cpp constructs the Module directly:

LLVMContext ctx;
Module      mod("test", ctx);
IRBuilder<> b(ctx);

Every IR node is a C++ object. The IRBuilder tracks a current insertion point (a BasicBlock) and appends instructions to it. Compare:

operationtextual IRIRBuilder call
add two i64%t = add i64 %a, %bb.CreateAdd(a, c)
signed less-than%t = icmp slt i64 %a, %bb.CreateICmpSLT(a, c)
call printfcall void @printf(...)b.CreateCall(fn, args)
return valueret i64 %vb.CreateRet(v)
branchbr label %Lb.CreateBr(L)
cond brbr i1 %c, label %T, label %Fb.CreateCondBr(c, T, F)

The Value* that each Create* returns is the IR-level result you splice into subsequent operations. You're building a directed graph of SSA values, just with C++ syntax instead of .ll text.

Locals as alloca slots

We keep the simple model from earlier labs: every local is an alloca slot named <name>.addr, loaded on read, stored on write. mem2reg (run by the default LLJIT pipeline) promotes them to SSA registers. This means the emitter never has to track SSA names or phi nodes.

auto* slot = b.CreateAlloca(i64(), nullptr, "x.addr");
b.CreateStore(value, slot);
// later:
auto* v = b.CreateLoad(i64(), slot);

Control flow via named basic blocks

For if/while we explicitly create blocks and stitch branches:

auto* T = BasicBlock::Create(ctx, "then", fn);
auto* E = BasicBlock::Create(ctx, "else", fn);
auto* M = BasicBlock::Create(ctx, "end",  fn);
b.CreateCondBr(cond_i1, T, E);
b.SetInsertPoint(T);
// emit `then` body...
if (!b.GetInsertBlock()->getTerminator()) b.CreateBr(M);

The terminator check matters: if the then body ended with return, the block is already terminated and we must NOT append a second terminator (LLVM's verifier will reject the module). That single rule is responsible for most of the conditional if (!terminator) br calls in ir_emit.cpp.

verifyModule

After emitting we call llvm::verifyModule. If it returns true, the IR is malformed: dangling references, missing terminators, type mismatches, etc. We capture the report and surface it as EmitResult::error. This is the guardrail against bugs in your emitter. Catching a verifier error is a millisecond; catching a "JIT executed bad machine code" error is a debugger session at best.

03 — Registering Runtime Symbols with ORC

The JIT'd module declares but does not define the runtime functions:

declare void @ml_print_int(i64)
declare void @ml_print_str(ptr)
declare void @ml_record_int_arg(i64)

When ORC compiles main, those calls become real bl/call instructions to some address — but to which address? Nothing in the module says. ORC will search its JITDylib for a symbol named ml_print_int, and if none is found, the lookup fails at execution time with a link error.

Our job is to put the host process's function addresses into that table before we run anything. In jit.cpp:

llvm::orc::SymbolMap syms;
auto def = [&](void* p) {
    return llvm::orc::ExecutorSymbolDef(
        llvm::orc::ExecutorAddr::fromPtr(p),
        llvm::JITSymbolFlags::Exported | llvm::JITSymbolFlags::Callable);
};
syms[es.intern("ml_print_int")] = def((void*)&ml_print_int);
syms[es.intern("ml_print_str")] = def((void*)&ml_print_str);
syms[es.intern("ml_record_int_arg")] = def((void*)&ml_record_int_arg);
jd.define(llvm::orc::absoluteSymbols(std::move(syms)));

Three pieces:

  • intern(name) turns a StringRef into a SymbolStringPtr. The string pool is owned by the ExecutionSession, so all lookups can compare pointers instead of strings.
  • ExecutorSymbolDef(addr, flags) wraps a raw pointer with metadata. Exported makes the symbol visible to lookups; Callable distinguishes function pointers from data pointers (relevant for some platforms' ABI).
  • absoluteSymbols wraps the map in a MaterializationUnit whose "materialise" step is trivial: the addresses are already known.

Then jd.define(...) installs the unit. From this point on, jit->lookup("ml_print_int") would return the host address, and so will ORC's internal linker when it resolves the declare in the user module.

Why not just rely on DynamicLibrarySearchGenerator?

LLJIT has a default generator that searches the host process for symbols by name. If our runtime functions had public C linkage in the main executable, that mechanism would find them automatically. We register explicitly for three reasons:

  1. Determinism. We control which names are reachable; nothing else leaks from the host into JIT'd code.
  2. Plumbing for sandboxing. In a production VM you eventually want the JIT to live in a different address space (or a different process). The ExecutorAddr indirection is what makes that swap possible — same API, just point at a remote address.
  3. It's the same path the type-feedback hook will take. Future VM services — bailout, GC write barrier, deopt — register exactly the same way.

Lookup and call

auto sym = jit->lookup("main");        // ORC compiles `main` on demand
auto fn  = sym->toPtr<int64_t(*)()>(); // raw function pointer
int64_t result = fn();

That lookup call is the moment ORC walks the module, runs optimisation passes, lowers to machine code, copies the bytes into executable pages, and applies relocations. From the C++ side it looks like a hash-table lookup; in reality it's the whole back end pipeline you used to spend clang minutes on.

04 — Type Feedback: Foundations of Inline Caching

ml_record_int_arg is the most interesting function in runtime.cpp. It's called from JIT'd code at every user function entry, once per parameter:

// In ir_emit.cpp:
for (auto& a : fn->args()) {
    auto* slot = b.CreateAlloca(i64(), nullptr, ...);
    b.CreateStore(&a, slot);
    b.CreateCall(fn_record, {ConstantInt::get(i64(), nextSite++)}); // ← here
    ...
}

Each parameter gets a unique site id chosen at compile time. The runtime maintains unordered_map<site_id, count>:

extern "C" void ml_record_int_arg(int64_t site) {
    g_intCounts[site] += 1;
}

That's a one-line implementation, but it's the same architecture that powers V8's inline caches, HotSpot's TypeProfile, and LuaJIT's traces. The pattern is:

  1. JIT'd code reports observations to host.
  2. Host accumulates statistics keyed by call site.
  3. When some heuristic fires (counter > threshold, distribution narrow enough), the host invalidates the current compile and recompiles with the observed types baked in as assumptions.
  4. The new code adds a guard: if the assumption is violated, it bails back to the generic path.

cp-17 stops at step 2: we record but never recompile. The seventh test verifies the mechanism works end-to-end: a JIT'd print(f(5)) call increments the counter for site #1 (the first parameter of the first user-defined function), provable from C++ after the JIT returns.

Why this needs the runtime-symbol plumbing from step 03

The recording call is just call void @ml_record_int_arg(i64 1). There's no LLVM magic — it's an external function call. The reason it works is that we already taught ORC that the name ml_record_int_arg resolves to a real C function in our process. The whole "type feedback" feature is purely:

  • IR emitter inserts a function call at the right place.
  • Host registers the target.
  • Host reads the counter later.

Every dynamic-language VM's profiling subsystem is layered on the same idea.

What you'd add next

  • Per-type counters: distinguish int / string / object / null. We have only one bucket; a real system stores a small set with frequencies.
  • Site-keyed cache slots: replace counters with a small struct {type_tag, cached_method, miss_count} per site. That's an inline cache.
  • Tiered compilation: once a counter crosses a threshold, queue the function for recompilation at a higher tier (e.g. with arguments specialised to i64). Keep the old code as the bail target.
  • Deoptimisation: when an assumption fails at runtime, jump from the optimised frame back into the unoptimised one with reconstructed state. This is the hardest part and a topic in its own right.

05 — Strings as Private Globals + GEP

The Str expression node is the only non-numeric value type in cp-17. Its lowering reveals two LLVM concepts every IR emitter must internalise: GlobalVariable for constant data, and getelementptr (GEP) for address arithmetic.

case Expr::K::Str: {
    Constant* s = ConstantDataArray::getString(ctx, e.str, /*addNull=*/true);
    auto* gv = new GlobalVariable(
        mod, s->getType(), /*isConstant=*/true,
        GlobalValue::PrivateLinkage, s, ".str");
    return b.CreateInBoundsGEP(
        s->getType(), gv,
        {ConstantInt::get(i32(), 0), ConstantInt::get(i32(), 0)});
}

Step by step:

  1. ConstantDataArray::getString builds an [N x i8] constant from the string bytes, optionally NUL-terminated. This is the value.
  2. new GlobalVariable(...) registers that constant as a module-level symbol with PrivateLinkage (linker-internal, won't conflict across modules) and isConstant=true (the optimiser may place it in .rodata). The variable's type is [N x i8], and gv is a Constant* pointing to it (in LLVM 20 with opaque pointers, the pointer is just ptr).
  3. CreateInBoundsGEP computes the address of element [0][0]:
    • first index 0 steps through the pointer (typical idiom for "the array itself, not array #N");
    • second index 0 steps to the first byte inside the array. The result is a ptr to byte 0 of the string — exactly what ml_print_str(const char*) expects.

The two-index GEP for arrays

You'll see {0, 0} patterns everywhere in LLVM IR for "decay an array to a char*". The mental model:

gv      : ptr to [N x i8]
gep 0   : same as gv (no offset, but lets us index into the pointee)
gep 0,0 : pointer to the first i8 in the array

Conceptually like C's &str[0]. If the global were [10 x [4 x i8]], {0, i, j} would give &str[i][j]. The first index is special; subsequent indices walk the aggregate type.

Why opaque pointers don't change this

LLVM 20 dropped typed pointers — every pointer in IR is just ptr. But the GEP instruction still needs to know the pointee type to compute offsets. That's what the explicit s->getType() (the array type) argument to CreateInBoundsGEP is for. The IRBuilder no longer infers it from the pointer's type, because the pointer has no type.

Lifetime

The GlobalVariable is owned by the Module. When the Module is moved into ThreadSafeModule and handed to ORC, ownership transfers. After JITting, the global's bytes live somewhere in the JIT's data section, and the pointer we returned is valid for as long as the LLJIT instance lives. For test code this is fine; for a long-running VM you'd care about reclaiming unused string globals (a job for ORC's resource-tracker API).

06 — verifyModule and Debugging JIT Crashes

A JIT failure mode looks like this: lookup("main") returns a function pointer. You call it. The process crashes with SIGBUS, SIGSEGV, or — if you're unlucky — silently produces wrong output. There is no stack trace pointing back at the IR that did it. Debugging requires reproducing the bug in tools designed for native code (lldb, instruments) against memory pages that didn't exist a moment ago.

The single best defense is to verify before you JIT. cp-17 does this at the end of emitModule:

std::string verr;
raw_string_ostream os(verr);
if (verifyModule(*r.module, &os)) {
    r.error = "verifyModule failed:\n" + os.str();
    r.module.reset();
}

verifyModule catches:

  • Basic blocks without terminators (the #1 emitter bug).
  • Multiple terminators in a basic block.
  • Branches to blocks in the wrong function.
  • Type mismatches in operands (e.g., passing i32 where i64 is expected).
  • PHI nodes with the wrong number of incoming values.
  • Use-before-def on Values outside their dominator scope.
  • Function signature mismatches between caller and callee.
  • ret void from an i64-returning function (and vice versa).

Every one of those will at best cause a JIT crash and at worst quietly miscompile. Catch them at IR emission and you save days.

Reading verifier output

The verifier names the broken instruction and prints surrounding context:

Terminator found in the middle of a basic block!
label %then

When you see this, the cause is almost always:

  • You emitted a ret or br, then forgot to switch insertion point and emitted more instructions into the same now-closed block, or
  • You forgot to check GetInsertBlock()->getTerminator() before appending a fall-through branch after an if/while body that already returned.

The cure in ir_emit.cpp is the if (!b.GetInsertBlock()->getTerminator()) guard before any CreateBr/CreateRet.

When the verifier passes but the JIT still crashes

Common remaining causes:

  • Wrong calling convention between a declared extern and the host function (e.g., declaring void f(i64) but the C function takes i32). The verifier can't catch this — both sides look internally consistent. Treatment: keep signatures in one place (a header) and reference them from both the emitter and the runtime.
  • Mutated Module after JIT addModule. ORC takes ownership; subsequent edits via the stale pointer are undefined.
  • Forgetting InitializeNativeTarget()LLJITBuilder().create() will return an error like "no available targets are compatible with this triple". jit.cpp calls these once via std::call_once.

Dumping IR while debugging

mod.print(errs(), nullptr) will dump the module as text to stderr. Add it right before the verifyModule call and you can copy-paste into llc or opt to reproduce a bug outside the JIT. The text and the in-memory form round-trip exactly, so behaviour is identical.

07 — Where to Grow

cp-17 is a complete dynamic-language JIT in maybe 800 lines. It demonstrates the architecture; it doesn't demonstrate the optimisations that make JITs worth their complexity. Here's the roadmap if you wanted to grow this into a real VM.

1. Interpreter tier

Real JITs don't JIT first. They interpret bytecode until a function gets warm, then JIT. Why? Because compilation is expensive and most code runs once. cp-17 currently JITs everything immediately, paying the LLVM compile cost even for print 42.

Add: a bytecode design (stack-based, small), an interpreter loop in C++, per-function call counters, a threshold (say 1000 calls), and a queue of "functions to compile". The JIT becomes the second tier, not the first.

2. Inline caches

Right now every method call goes through full IR with no specialisation. The hook for change is there — ml_record_int_arg proves you can collect type observations — but the IR doesn't use them.

Add: a CallSite struct keyed by (function, bytecode offset). On each call, write the receiver type into a small slot. On recompile, generate code that checks the cached type with a single compare-and-branch ("guard") and then proceeds along the fast path. On guard failure, fall back to the generic dispatch.

That single mechanism — type guard + cached fast path — is most of what makes V8 and LuaJIT fast.

3. Deoptimisation / OSR

Once you have guards, you need to handle guard failures. The optimised frame is laid out differently from the interpreter's stack; on a bailout you must reconstruct the interpreter state from the optimised frame's registers and spills, then resume in the interpreter.

This is on-stack replacement (OSR) in the deopt direction. The OSR-in direction (interpreter → JIT, mid-loop) is also useful: detect a hot loop, JIT it, patch the interpreter to jump into the JIT'd loop with current state.

Both are hard. Both require precise side-tables emitted by the JIT describing how every interpreter value maps to a JIT location at every deopt point.

4. Hidden classes for objects

cp-17 has no objects. When you add them: every object header should point to a shape (V8 calls them "maps", JavaScriptCore "structures") that describes its layout. Two objects with the same key sequence share a shape; adding a key transitions to a new shape, recording the transition.

Why? Because inline caches key on shape, not on dynamic type. A property lookup becomes "load shape pointer; compare to cached shape; if match, load at cached offset; else miss". This is the single most important optimisation for dynamic-OO languages and it falls out naturally from the inline-cache infrastructure above.

5. Garbage collection

The runtime currently leaks every string global into the JIT's data section. A real VM needs:

  • A heap with allocation, marking, and reclamation.
  • GC roots identified in optimised frames (more side-tables from the JIT).
  • Write barriers for generational GC (yet another runtime symbol the JIT must inject around every store-to-heap).

cp-14 (Runtime Systems) showed a tagged-value layout; this is where you'd plug a real collector under it.

6. Concurrency

ORC supports multi-threaded compilation out of the box (LLJIT is thread-safe; that's what ThreadSafeModule is about). A real VM compiles on background threads while the main thread keeps interpreting, then atomically swaps the function entry pointer when the JIT result is ready.

Where this lab leaves you

Concretely, after cp-17 you should be able to:

  • Build an llvm::Module from an AST with IRBuilder (no text).
  • Wire a runtime function into JIT'd code via ORC's symbol API.
  • Emit a runtime callback at any IR point and read its results from C++.
  • Diagnose JIT bugs with verifyModule before they crash.

These are the muscles. The rest — IC, deopt, hidden classes, GC — are combinations of them.

cp-18 — Capstone: An MLIR-Style Compiler Framework

A self-contained, ~700-line reimplementation of MLIR's core ideas — Operations, Regions, Blocks, Values, Types, Attributes, Builders, Passes, and conversion between dialects — with zero LLVM/MLIR dependency. Two demonstration dialects (tiny.* and ll.*), a constant-folder, a DCE pass, and a lowering pass that rewrites tiny.* into ll.*.

The point isn't to use MLIR; it's to understand its architecture by rebuilding the skeleton. After cp-18 you can read MLIR source code and recognise every concept.

Build & test

cd src/cpp
cmake -S . -B build && cmake --build build
./build/tests/test_mlf      # 35/35 checks passed
./build/mlfdriver --tiny -   # parse stdin, print tiny.* IR
./build/mlfdriver --opt  -   # ... after fold+DCE
./build/mlfdriver        -   # ... lowered to ll.*

Example session:

$ echo "let x = 2 * 3 + 1; print x;" | ./build/mlfdriver
"module"() {
  "ll.func"() {sym_name = "main"} {
    %0 = "ll.const"() {value = 7} : () -> (i64)
    "ll.call"(%0) {callee = "ml_print_int"} : (i64) -> ()
    %1 = "ll.const"() {value = 0} : () -> (i64)
    "ll.ret"(%1)
  }
}

Layout

  • src/cpp/src/mlf.{hpp,cpp} — the framework: Op, Region, Block, Value, Builder, walks.
  • src/cpp/src/dialects.{hpp,cpp}tiny.* and ll.* op constructors.
  • src/cpp/src/passes.{hpp,cpp} — Pipeline + constantFold + DCE.
  • src/cpp/src/lowering.{hpp,cpp} — tiny → ll dialect conversion.
  • src/cpp/src/printer.{hpp,cpp} — MLIR-flavoured IR printer.
  • src/cpp/src/parser.{hpp,cpp} — tiny surface language → tiny.* IR.
  • src/cpp/src/main.cppmlfdriver CLI.
  • src/cpp/tests/test_mlf.cpp — 7 tests, 35 checks.
  • steps/01..07.md — narrative.

Mapping to real MLIR

cp-18MLIR equivalent
mlf::Opmlir::Operation
mlf::Regionmlir::Region
mlf::Blockmlir::Block
mlf::Valuemlir::Value
mlf::Typemlir::Type
mlf::Attributemlir::Attribute
mlf::Buildermlir::OpBuilder
mlf::pass::Pipelinemlir::PassManager
mlf::convert::lowerTinyToLLdialect-conversion pass with rewrite patterns

Tests

  1. Hand-build a module via Builder.
  2. constantFold shrinks 1 + 2 + 36.
  3. DCE after folding deletes the now-dead literals.
  4. DCE preserves a const used by tiny.print.
  5. Lowering: zero tiny.* ops remain after lowerTinyToLL.
  6. End-to-end: parse → fold → DCE → lower → check the lowered IR.
  7. Parser surfaces a clear error for malformed input.

01 — Why an MLIR-Style Framework?

LLVM IR is one intermediate representation. It works beautifully for languages whose operations map onto C-like primitives — integer arithmetic, memory load/store, function call. It works poorly for anything else:

  • Tensor compilers want matmul, convolution, reduce as first-class ops. Expressing these in LLVM IR loses too much structure to recover later.
  • Hardware DSLs want device-specific ops (gpu.launch, spirv.kernel, nvvm.barrier0) that LLVM IR has no good way to represent.
  • Polyhedral compilers want loop nests, affine maps, and dependency information that LLVM scalar evolution can only partially reconstruct.

The historical answer was: each project invented its own IR (XLA HLO, TensorFlow Graph, Halide, Tiramisu, …) and re-implemented passes, printers, parsers, and verifiers. Every project paid the same tax.

MLIR's answer: make the IR itself extensible. Provide a single skeleton (Operations, Regions, Blocks, Values, Types, Attributes) and let each project plug in custom dialects that define their own ops, types, and conversions. Re-use the printer, the parser, the pass manager, the canonicaliser — the whole infrastructure — across every dialect.

What a dialect is

A dialect is a namespace of ops with custom semantics. Examples in real MLIR:

  • arith.* — integer and float arithmetic.
  • linalg.* — structured linear-algebra ops (matmul, conv).
  • tosa.* — neural-network ops at a higher level.
  • scf.* — structured control flow (if, for, while).
  • memref.* — buffers with strides and offsets.
  • gpu.*, nvvm.*, spirv.* — device dialects.
  • llvm.* — a one-to-one mirror of LLVM IR, used as a lowering sink.

A typical compile looks like a sequence of dialect rewrites:

tosa  →  linalg  →  scf + memref  →  llvm  →  LLVM IR  →  machine code

Each step is a pass. Each pass is built from rewrite patterns. Each pattern matches a small subgraph of ops and replaces it with another small subgraph. Eventually no ops from the source dialect remain, and you've lowered the program one tier closer to hardware.

What cp-18 reproduces

We model exactly this skeleton at teaching scale: two dialects (tiny.* and ll.*), a constant-folding pass, a dead-code-elimination pass, and a lowering that rewrites tiny.*ll.*. After it runs, no tiny. ops remain. That's the MLIR programming model in miniature.

What cp-18 does not reproduce

  • TableGen. Real MLIR generates op classes, builders, verifiers, parsers, and printers from .td files. We hand-write them.
  • Pattern matching. Real MLIR has a declarative DSL for rewrite patterns (RewritePattern, OpRewritePattern<T>). Our "patterns" are if-chains in lowering.cpp — same logic, no sugar.
  • Verification. Real MLIR runs structural verifiers on every op. We trust the builder.
  • Type system depth. Real MLIR types form a class hierarchy and can carry shapes, layouts, and dialects. Ours are strings.

The framework you build here is not a replacement; it's a reading companion. After it, every concept in MLIR has a hook in your head to hang on.

02 — The IR Skeleton: Ops, Regions, Blocks, Values

MLIR's core insight is a uniform IR shape. Every operation — whether it represents a constant, a loop, a function, or a module — has the same structure:

operation:
  name           : string ("dialect.opname")
  operands       : list of Values used as inputs
  results        : list of Values produced as outputs
  attributes     : list of (name, constant) — compile-time metadata
  regions        : list of Regions — nested IR
  parent block   : back-pointer

A Region is a list of Blocks. A Block is a list of Operations plus a list of block arguments (SSA values that flow in at the head of the block). The graph is recursive: regions contain blocks, blocks contain ops, ops contain regions, ad infinitum.

cp-18 implements this directly:

struct Op {
    std::string                          name;
    std::vector<Value*>                  operands;
    std::vector<std::unique_ptr<Value>>  results;
    std::vector<std::unique_ptr<Region>> regions;
    std::vector<NamedAttr>               attrs;
    Block*                               parent = nullptr;
    Op*                                  prev = nullptr;
    Op*                                  next = nullptr;
};

struct Block {
    std::vector<std::unique_ptr<Value>> args;
    Op*                                 first = nullptr;
    Op*                                 last  = nullptr;
    Region*                             parent = nullptr;
};

struct Region {
    std::vector<std::unique_ptr<Block>> blocks;
    std::vector<std::unique_ptr<Op>>    opStore;  // ownership
    Op*                                 parentOp = nullptr;
};

Why a single shape for everything?

Because every algorithm that walks IR can use the same traversal. The constant folder, the DCE pass, the printer, the verifier — none of them need special cases for "is this a function or a basic op". A function is just an op with one region. A loop is an op with one region. A module is an op with one region. Same data structure, same walks.

Compare with LLVM IR, where Module, Function, BasicBlock, Instruction are four distinct classes with four distinct traversal APIs. Every pass that operates above the instruction level has to hand-roll its own walk over the right level of the hierarchy.

MLIR's nested-op design is a strict generalisation: anything LLVM can express, MLIR can express with llvm.func and llvm.* op-per-instruction. But the reverse isn't true.

Linked-list blocks

cp-18 uses an intrusive linked list of ops inside each block (Op::prev, Op::next). This matters because passes constantly insert and delete ops in the middle of blocks. A vector<unique_ptr<Op>> would invalidate iterators on every mutation and require O(n) shifts.

The trick from real MLIR: ownership lives in a separate opStore on the parent region (a vector of unique_ptr<Op>). The list pointers in Op form the logical sequence. Erasing an op detaches it from the list but leaves the storage alive until the region dies. That's why Block::eraseOp doesn't actually delete.

Block arguments instead of phi nodes

LLVM IR uses phi instructions at the start of merge blocks:

%x = phi i64 [%a, %B1], [%b, %B2]

MLIR (and cp-18) uses block arguments:

^bb3(%x: i64):
  ...

with predecessors supplying values when they branch in. Functionally equivalent; structurally cleaner. Block-argument indexing matches operand indexing of the branch op, so you never have to scan the block header to match incoming values to predecessors.

cp-18 doesn't yet emit block arguments (the tiny.* dialect has no control flow), but Block::addArg is there for when you extend it.

03 — Builders, Insertion Points, and Op Creation

In LLVM you have IRBuilder<>. In MLIR you have OpBuilder. In cp-18 you have mlf::Builder. All three serve the same purpose: encapsulate the "where am I currently inserting?" cursor so op-construction calls can stay short.

mlf::Builder b;
b.setInsertionPointToEnd(funcBody);
Value* lhs = b.create("tiny.const", {}, {i64Ty}, {{"value", Attr::integer(6)}})
             ->result(0);
Value* rhs = b.create("tiny.const", {}, {i64Ty}, {{"value", Attr::integer(7)}})
             ->result(0);
b.create("tiny.mul", {lhs, rhs}, {i64Ty});

Each create allocates an Op, populates its operands/results/attributes, splices it into the current block at the insertion point, and returns a raw pointer (ownership rests with the region's opStore).

Three insertion modes

  • setInsertionPointToStart(block) — prepend new ops.
  • setInsertionPointToEnd(block) — append new ops (the common case).
  • setInsertionPointBefore(op) — insert immediately before a known op (the constant folder uses this to splice the folded const).

Why insertion-point APIs and not "just append"?

Because rewriters need to insert in the middle. The constant folder finds a tiny.add op, computes the folded result, and emits a new tiny.const right before the old add. That fresh const needs to land between the last constant and the add — not at the end of the block.

Builder bld;
bld.setInsertionPointBefore(op);
Op* foldedConst = bld.create("tiny.const", {}, ..., {{"value", ...}});
replaceAllUses(moduleOp, op->result(0), foldedConst->result(0));
op->parent->eraseOp(op);

The four-line pattern — point, create, replace, erase — is the entire shape of rewrite-based optimisation.

SSA name management

Every op result gets a name like %0, %1, %2 from a counter in the builder. The counter is per-builder, which means a fresh builder gives fresh names — useful for nested function bodies. The names are only for printing; the IR's identity is the Value* pointer.

Real MLIR does the same: SSA names in textual IR are reconstructed at print time from an AsmState that walks the op tree assigning fresh names. The in-memory IR uses pointer identity.

Op result vs value

A subtle but important distinction:

  • Op* is the operation — the thing with a name, attributes, regions.
  • Value* is one of its results — what an operand points at.

op->result(0) returns the first result Value. You almost always pass Value* (not Op*) into other ops' operand lists. cp-18's API forces this: create takes vector<Value*> for operands.

What we left out

Real MLIR's OpBuilder also tracks:

  • A Listener for rewrites (so pattern drivers can be notified of changes).
  • A Location attribute attached to every created op (for diagnostics).
  • Type inference via op interfaces (SameOperandsAndResultType, etc.).

All are nice to have, none change the picture. The core abstraction is the insertion point.

04 — Defining Dialects

A dialect in cp-18 is just a namespace of helpers that construct ops with the right name + signature. There's no class hierarchy, no registration, no inheritance.

namespace mlf::tiny {
Op* constant(Builder& b, int64_t v) {
    return b.create("tiny.const", {}, {i64Ty()},
                    {{"value", Attribute::integer(v)}});
}
Op* add(Builder& b, Value* l, Value* r) {
    return b.create("tiny.add", {l, r}, {i64Ty()});
}
// ...
} // namespace

Three things define an op:

  1. Name — a dotted string "dialect.op". Used by passes to match.
  2. Signature — operand types, result types, region count.
  3. Attributes — name → constant. value for tiny.const, sym_name for tiny.func, etc.

That's it. A dialect is a contract about what those three things mean.

Why two dialects?

cp-18 ships tiny.* (high-level, source-aware) and ll.* (low-level, LLVM-ish). The reason for the split is the same reason real MLIR has ~30 built-in dialects: different passes want different abstractions.

  • On tiny.* we can run constant folding trivially — operands of tiny.add either come from tiny.const or they don't. No interleaved loads/stores, no aliasing, no ABI quirks.
  • On ll.* we'd run register allocation, calling-convention rewrites, memory-layout passes — all things that need to know about the lowered representation.

If you tried to do both at the same level, you'd have one giant dialect where every pass needs if (op.name == "tiny.add" || op.name == "ll.add") ... checks. Splitting cleanly separates concerns.

Dialects as a contract

When passes.cpp writes:

if (op->name == "tiny.add" || op->name == "tiny.mul") { ... }

it's relying on the contract that those op names always mean what dialects.cpp says they mean. If someone adds a tiny.add with two results, or with a side effect, that contract breaks and the fold pass becomes a miscompiler.

Real MLIR codifies this with op interfaces and traits: a pass declares "I match anything implementing BinaryOp", and the trait system guarantees the matched op has the expected structure. cp-18 trusts the dialect-helper API as the contract.

How would you add a new dialect?

Pick a name (e.g. tensor.*), decide on op signatures, write helpers in the namespace. That's the user-facing work. The framework requires no changes: the printer, the walks, the rewrites all operate on opaque Op objects.

In real MLIR you'd also subclass Dialect, register your ops, write verifiers, generate them from TableGen, etc. The architectural shape is the same as cp-18; the production scaffolding is heavier.

Function ops vs basic ops

tiny.func is a region-carrying op: it has one region containing the function body. Same for ll.func. Notice how this is just an op — no special "Function" class in the IR. That's MLIR's design choice: at the IR level a function isn't fundamentally different from an scf.for or an scf.if. They all carry regions; they all participate in the same walks; they all live in the same pass manager.

The implication: you can put a function inside another op. Closures, nested function definitions, module-of-modules — none require special casing. Real MLIR exploits this constantly (e.g. gpu.module contains gpu.func).

05 — Pass Infrastructure and the Canonicalisation Idea

A pass in cp-18 is just a function:

using Pass = std::function<bool(Op& moduleOp)>;

It mutates the module and returns true if anything changed. A Pipeline runs them in sequence:

mlf::pass::Pipeline pipe;
pipe.add("constant-fold", mlf::pass::constantFold);
pipe.add("dce",           mlf::pass::deadCodeElimination);
pipe.run(moduleOp);

That's the entire model. Real MLIR's PassManager adds threading, caching of analyses, scheduling at different nesting levels (module pass vs function pass vs op pass), pass options, statistics, and pipeline specification via textual config. None of those change the core idea: passes are functions, pipelines compose them.

Constant folding as a model rewrite

constantFold in cp-18 demonstrates the universal rewrite pattern:

for (Op* op in candidates):
    if op matches a known shape:                 // ← pattern matcher
        compute folded value                     // ← semantic step
        Builder b; b.setInsertionPointBefore(op);
        Op* replacement = b.create("...", ...);  // ← rewriter
        replaceAllUses(root, op->result(0), replacement->result(0));
        op->parent->eraseOp(op);                 // ← cleanup
        restart  // because the IR shape changed under us

Real MLIR formalises this as RewritePattern. You write a class with match and rewrite methods, register it, and a driver (applyPatternsAndFoldGreedily) handles the fixpoint loop. The driver solves three problems cp-18 punts on:

  1. Termination. Fixpoint iteration can loop if patterns keep undoing each other. MLIR tracks rewrites and bails if the same op gets rewritten too many times.
  2. Cost-based selection. When multiple patterns match, MLIR picks the cheapest. cp-18 only has one folding rule, so no ambiguity.
  3. Worklist management. Newly inserted ops should be revisited. MLIR keeps a worklist; cp-18 restarts the whole walk after each change (correct but quadratic).

DCE as a model "delete dead things" pass

bool deadCodeElimination(Op& moduleOp) {
    // 1. Find all values that ARE used somewhere.
    // 2. Find a pure op whose results aren't in that set.
    // 3. Delete it. Repeat.
}

The crucial concept: purity. We hard-code the set of pure ops:

static bool isPure(const Op& op) {
    return op.name == "tiny.const" || op.name == "tiny.add"
        || op.name == "tiny.mul"   || /* etc */;
}

Real MLIR uses the MemoryEffectOpInterface: an op declares the read/write/allocate/free effects it has on memory. DCE removes an op only if it has no effects (or only MemoryEffects::Read from immutable storage). This generalises to any dialect without changing the DCE pass.

cp-18 hard-codes the set because we don't have an interface system. Same algorithm, less elegant.

Why a single pipeline, not many "phases"?

Because IRs in this design are monomorphic: every op is just mlf::Op. Passes can be composed freely — fold then DCE then fold again then convert — without serialising to text and re-parsing. Real MLIR does the same: you build a single PassManager and run dozens of passes in sequence, all operating on the same in-memory module.

Compare with classic LLVM, where each pass is a different FunctionPass/ModulePass subclass, ordering is managed by a Pass Manager whose API is large and historical, and pipeline specification (e.g. -O2) is hard-coded in C++ rather than declared in text. MLIR's PassManager is a cleaner take on the same idea.

06 — Lowering Between Dialects

Lowering is the act of rewriting ops from one dialect into ops from another. Conceptually it's another pass; structurally it's special because the input and output dialects differ. cp-18 implements one lowering: tiny.*ll.*.

std::unique_ptr<Op> lowerTinyToLL(Op& tinyModule) {
    auto newMod = makeModule();
    Builder b;
    std::unordered_map<Value*, Value*> valueMap;

    Block* topB = newMod->region(0)->entry();
    b.setInsertionPointToEnd(topB);

    for each tiny.func in tinyModule:
        Op* newFunc = ll::func(b, fname);
        Builder bf;
        bf.setInsertionPointToEnd(newFunc->region(0)->entry());

        for each op in tinyFunc.body:
            switch (op.name) {
                case "tiny.const":   valueMap[old] = ll::constant(bf, value); break;
                case "tiny.add":     valueMap[old] = ll::add(bf, map(a), map(b)); break;
                case "tiny.mul":     valueMap[old] = ll::mul(bf, map(a), map(b)); break;
                case "tiny.print":   ll::call(bf, "ml_print_int", map(a));        break;
                case "tiny.return":  ll::ret(bf, map(a));                          break;
            }
}

The pattern: for each source op, emit one or more target ops and record the value mapping. When a later source op uses an SSA result, look up its replacement in the map.

Why the value map?

Because we can't reuse the source ops' Value* in the target IR — they belong to the old module which is about to die (or persist independently). Every source value needs a corresponding fresh target value.

This map is the heart of dialect conversion. In real MLIR ConversionPatternRewriter maintains it implicitly: when a pattern matches and emits replacement ops, the rewriter records the value mapping and rewires uses automatically. cp-18 maintains it explicitly because it's clearer pedagogically.

One-to-one vs one-to-many

tiny.constll.const is one-to-one. tiny.print(x)ll.call(x) {callee = "ml_print_int"} is also one-to-one but with name rewriting. A more interesting case: a single linalg.matmul lowers to a nested scf.for loop body that calls arith.mulf and arith.addf — one op blowing up into a dozen. cp-18 doesn't show one-to-many because tiny.* is too simple; the framework supports it (just create more ops in the case branch).

Partial vs full conversion

  • Full conversion insists no source-dialect ops remain. cp-18's test #5 verifies this: CHECK_NOT_CONTAINS(ir, "tiny.").
  • Partial conversion lowers some ops, leaves others. Useful when a dialect contains a mix of low-level and high-level concerns.

Our lowerTinyToLL is full: every tiny.* op has a case in the switch. Real MLIR's applyFullConversion will fail loudly if any unconverted source ops survive; applyPartialConversion will leave them in place.

Where the framework matters

Notice what didn't change for the lowering: the printer, the walks, the passes. The ll-dialect module prints with the same printer, walks with the same walker. The DCE pass works on ll.* ops because isPure lists ll.const/add/mul. The constant folder, however, doesn't recognise ll.const + ll.const → ll.const — by design. If you want post-lowering folding, you add another pass that matches ll.* ops.

That separation — generic infrastructure, dialect-specific patterns — is exactly what makes MLIR scale to dozens of dialects.

Lowering chains

cp-18 has one lowering step. Real MLIR pipelines often chain several:

tosa → linalg → scf+memref → llvm → LLVM IR

Each step is implemented as a set of patterns + a populate*Patterns function + a target spec declaring which ops are "legal" in the output. The shape of each step is identical to lowerTinyToLL: walk the input, emit the output, thread the value map.

07 — Where to Grow

cp-18 reproduces MLIR's shape. If you wanted to grow it into something genuinely useful, here's the roadmap.

1. Verifiers

Right now any code that builds an ill-formed op succeeds silently. The first quality-of-life upgrade is a verifyOp(Op&) function that checks:

  • operand and result counts match the dialect spec,
  • operand and result types match,
  • required attributes are present and the right kind,
  • region count is right,
  • terminator ops are present at end of every block.

Real MLIR generates these from TableGen; you can hand-write them per dialect. Run after every pass; refuse to print invalid IR.

2. Op interfaces / traits

Hard-coded if (op.name == "tiny.add" || op.name == "tiny.mul") doesn't scale past a handful of ops. Replace with a trait system:

struct BinaryOp { Value* lhs(Op& o); Value* rhs(Op& o); };
bool implementsBinary(const std::string& name);

Then folders and DCE check implementsBinary(op.name) rather than naming specific ops. New ops opt into the trait by registering with the system. This is MLIR's OpInterface mechanism in skeleton form.

3. Pattern DSL

Switch from hand-written if/switch to declarative patterns:

addPattern<BinaryOpPattern<"tiny.add">>(folder);

The base class encapsulates the match + create-replacement + replace-uses

  • erase boilerplate. Patterns become 5-line specs of "what to match" and "what to emit". This is RewritePattern in MLIR.

4. Real type system

Replace Type { string name } with a tagged union or class hierarchy:

struct Type { enum class Kind { Int, Float, Tensor, Function, ... }; ... };
struct TensorType : Type { Type elemType; vector<int64_t> shape; };

Then types can be compared structurally and dialects can demand specific type shapes. Shape inference becomes possible: an op like tensor.matmul : tensor<MxKxf32> × tensor<KxNxf32> → tensor<MxNxf32> can verify and propagate shapes.

5. Conversion framework

Generalise lowerTinyToLL into:

struct TypeConverter { Type convert(Type src); };
struct ConversionPattern { virtual bool match(Op&) = 0; virtual void rewrite(Op&, ...) = 0; };
void applyFullConversion(Op& root, vector<ConversionPattern*>, TargetSpec);

Where TargetSpec declares which ops are "legal" in the output. Patterns plug in modularly. Same idea as MLIR's mlir::ConversionPatternRewriter.

6. A useful dialect: tensors

The natural next thing to model is tensor.*:

  • tensor.const : tensor<NxNxf32> — a constant tensor with shape.
  • tensor.add : (tensor, tensor) -> tensor — elementwise.
  • tensor.matmul.
  • Conversion to a loop dialect (scf.for + memref.store/load).
  • Conversion of the loop dialect to ll.*.

That's the toy tutorial of MLIR done in your own framework. Three dialects, two lowering steps, demonstrates the whole stack: high-level algebraic IR → loop nest → low-level CPU code.

7. Plug into real LLVM

If you wire the final ll.* dialect to actually emit llvm::IRBuilder calls (the way cp-17's ir_emit.cpp does), you have a complete frontend: surface language → tiny → ll → LLVM IR → JIT or native code.

At that point you're a small implementation distance from your own domain-specific compiler. The IRBuilder bridge is the same code as cp-17 with op-name dispatch driving it.

Where this lab leaves you

You can read MLIR source code, recognise its idioms, and understand why an op-centric, region-carrying, dialect-extensible IR was the right answer for modern compiler stacks. You can also build your own project-internal IR with this shape when LLVM IR is too low-level for your problem domain — which, for any compiler targeting ML, hardware design, or DSLs, is essentially always.

Phases & Labs

This curriculum has 9 teaching phases and 18 labs, ending in 3 capstone projects. Labs build on each other, but Phase 5 (LLVM), Phase 6 (JIT), and Phase 7 (MLIR) can be tackled in any order after Phase 4.

Legend: ✅ complete · 🟡 scaffolded · ⬜ planned


Phase 1 — Frontend Foundations

Before you can compile, you must convert source text into a structured tree. This phase teaches lexing, parsing, AST design, and tree-walk interpretation.

LabTitleStatusKey Concepts
cp-01Environment Setup & ToolchainClang vs Apple Clang, target triples, LLVM toolchain, Mach-O vs ELF, CMake, llvm-config
cp-02Arithmetic EvaluatorTokens, recursive descent, EBNF, precedence vs grammar nesting, associativity, AST + Visitor, post-order eval
cp-03MiniLang v0 Frontend🟡Pratt parsing, statements vs expressions, blocks, functions, closures, REPL state, tree-walk interpreter

Phase 2 — Static Semantics

A type-checked language with scoped variables is the foundation of every real frontend. This phase makes MiniLang reject invalid programs before they ever run.

LabTitleStatusKey Concepts
cp-04Symbol Tables & Scoping🟡Lexical scoping, scope stacks, closure capture, name resolution, shadowing, two-pass resolution
cp-05Static Type System (MiniLang v1)🟡Hindley-Milner basics, type environments, monomorphic types, structural vs nominal typing, diagnostics

Phase 3 — Bytecode Virtual Machines

Tree-walkers are slow because of pointer-chasing and virtual dispatch. Bytecode VMs are how CPython, the JVM, V8 (Ignition), and Lua reach 10–50× more throughput.

LabTitleStatusKey Concepts
cp-06Bytecode Design & Compiler🟡Stack-based vs register-based VMs, opcode encoding, constant pools, AST → bytecode lowering, disassembler
cp-07Stack VM Execution (MiniLang v2)🟡Computed-goto dispatch, frame layout, call/return, switch-vs-direct-threading, ICache effects

Phase 4 — Compiler Middle-End (IR & Optimization)

Every production compiler has a middle-end: AST → IR → optimized IR → backend. This phase introduces SSA, the CFG, and classical optimization passes.

LabTitleStatusKey Concepts
cp-08Three-Address IR & CFG🟡TAC representation, basic blocks, control-flow graph, dominators, immediate dominator computation
cp-09SSA & Optimization Passes (MiniLang v3)🟡φ-nodes, SSA construction (Cytron's algorithm), constant folding, DCE, mem2reg, pass manager

Phase 5 — LLVM Backend (Industry Core)

LLVM is the compiler infrastructure for Clang, Swift, Rust, Julia, Mojo, and dozens of others. This phase teaches you to generate LLVM IR, run its optimizer, and produce native binaries.

LabTitleStatusKey Concepts
cp-10LLVM IR Fundamentals🟡Module / Function / BasicBlock / Instruction hierarchy, IRBuilder, types, attributes, intrinsics
cp-11LLVM Codegen (MiniLang++)🟡AST → LLVM IR, control-flow IR patterns, calling conventions, opt pipelines, llc, native linking on macOS

Phase 6 — JIT Compilation (LLVM ORC)

JITs make dynamic languages fast (V8, LuaJIT, HotSpot). LLVM's ORC v2 API is the industrial-strength way to embed a JIT into your runtime.

LabTitleStatusKey Concepts
cp-12ORC JIT Runtime🟡ORC v2 layers, lazy compilation, symbol resolution, function caching, hot-path materialization

Phase 7 — MLIR (Multi-Level IR)

MLIR is the next-generation compiler infrastructure powering TensorFlow XLA, IREE, Mojo, and Triton. This phase teaches dialect design and progressive lowering.

LabTitleStatusKey Concepts
cp-13MiniLang MLIR Dialect & Lowering🟡Operations / Types / Dialects, TableGen, rewrite patterns, ConversionTarget, lowering to LLVM dialect

Phase 8 — Runtime Systems

A language is more than a compiler — it needs a runtime: stack frames, a heap, a GC, and an FFI.

LabTitleStatusKey Concepts
cp-14Stack, Heap, GC, FFI🟡Calling conventions (System V vs Apple ARM64), object headers, mark-sweep GC, root-set scanning, C FFI

Phase 9 — Tooling

Production compilers live or die by their error messages and tooling.

LabTitleStatusKey Concepts
cp-15Diagnostics, Modules, CLI🟡Source spans, fix-it hints (Clang-style), module loader, dependency graph, CLI driver design

Capstones

LabTitleStatusDemonstrates
cp-16MiniLang Compiler Suite🟡End-to-end: interpreter + VM + LLVM backend in one toolchain
cp-17JIT-Accelerated Dynamic Language🟡Python-like subset, ORC JIT, runtime specialization
cp-18MLIR-Style Compiler Framework🟡Plugin dialect registry, multi-level lowering, custom passes

Suggested Pace

  • Full-time learner: ~2 labs per week ⇒ ~9 weeks end-to-end.
  • Side-project learner: ~1 lab per 1–2 weeks ⇒ ~5 months.
  • Concept-only path: skim CONCEPTS.md + docs/analysis.md per lab ⇒ ~1 week to absorb the field.
Phase 1 (cp-01, cp-02, cp-03)  ── MANDATORY, in order
   │
   └─→ Phase 2 (cp-04, cp-05)  ── MANDATORY (frontends pile up)
          │
          ├─→ Phase 3 (cp-06, cp-07)  ── VM track
          │
          └─→ Phase 4 (cp-08, cp-09)  ── IR track ── MANDATORY before Phase 5/6/7
                 │
                 ├─→ Phase 5 (cp-10, cp-11)  ── LLVM backend
                 │       │
                 │       └─→ Phase 6 (cp-12)  ── JIT (needs LLVM)
                 │
                 └─→ Phase 7 (cp-13)  ── MLIR (parallel to LLVM)
                        │
                        └─→ Phase 8 (cp-14)  ── Runtime
                               │
                               └─→ Phase 9 (cp-15)  ── Tooling
                                      │
                                      └─→ Capstones (cp-16 / 17 / 18)

Phase 3 (Bytecode VM) and Phase 4 (IR) are independent — pick whichever excites you first. Phase 5, 6, 7 are each a serious commitment; pick the one most relevant to your career goals first (LLVM = static compilers, JIT = dynamic languages, MLIR = ML compilers / DSLs).

Tools & Toolchain

This curriculum is C++-only (Track B — LLVM Core). All labs target macOS (Apple Silicon verified) and are portable to Linux with trivial flag changes (noted per-lab).

The full setup, version verification, and why each tool exists is taught in cp-01-environment-setup/. Do that lab first — even if you already have the tools installed — because it teaches concepts (target triples, sysroots, ELF vs Mach-O, Apple Clang vs upstream LLVM) that you'll need for every subsequent phase.

Required Tools

ToolMinimum VersionWhere It's UsedInstall
Xcode Command Line Tools(any current)C++ compiler, linker, system headers (phases 1–4)xcode-select --install
CMake3.20+Build system for every labbrew install cmake
Ninja1.10+Fast parallel builder (used in phases 5+)brew install ninja
Homebrew LLVM18+Full LLVM with headers, libraries, mlir-opt, llc, lli (phases 5+)brew install llvm
lldb(bundled with Xcode CLT and Homebrew LLVM)Debuggeralready installed
git2.30+Version controlalready installed on macOS

Optional Tools

ToolPurposeInstall
Docker DesktopRun a Linux container to validate ELF/glibc behavior (Phase 8 FFI optional cross-check)https://www.docker.com/products/docker-desktop
graphvizRender CFG / dominator-tree dumps as PNGs (Phase 4)brew install graphviz
rr (Linux only)Time-travel debugger — useful inside Docker for Phase 8 GC debuggingapt in Linux container
clang-formatAuto-format C++ (configured per lab)bundled with both Clang installations
gdbSome people prefer GDB; LLDB ships natively on macOS so we use LLDBbrew install gdb (with caveats on macOS)

Tool Sets By Phase

PhaseNeed
1–4Apple Clang + CMake
5+ Homebrew LLVM (brew install llvm) + Ninja
6same as Phase 5
7same as Phase 5; also requires mlir-opt, mlir-translate (ship with Homebrew LLVM)
8same as Phase 5; optional Docker for Linux validation
9same as Phase 5

Apple Clang vs Homebrew LLVM — Why We Have Both

Apple ClangHomebrew LLVM
Path/usr/bin/clang++/opt/homebrew/opt/llvm/bin/clang++
ProvidesCompiler, linker integrationCompiler + libLLVM.dylib + headers + tools
Tools includedclang, clang++, lldbclang++, llc, opt, llvm-config, mlir-opt, lli, llvm-link
Used forPhases 1–4 (plain C++)Phases 5–18 (LLVM/MLIR work)

Why Apple ships its own Clang: Apple uses LLVM internally for the macOS toolchain. Their Clang is patched, tracks Xcode releases, links against the system frameworks, and produces signed binaries. But it does not ship the LLVM C++ libraries or the MLIR tools — those are reserved for the development tools install. We install upstream LLVM via Homebrew to get the missing pieces.

Target Triple (macOS Apple Silicon)

Your default triple is arm64-apple-macosx<version>. This is recorded in every Mach-O binary as the load command LC_BUILD_VERSION. Inspect with:

otool -l <binary> | grep -A5 LC_BUILD_VERSION

This matters in:

  • Phase 5: when you ask LLVM to emit object files, you pass this triple.
  • Phase 11 and beyond: when you write FFI or assembly intrinsics.

Verification

The full step-by-step verification script lives in cp-01-environment-setup/docs/verification.md. Run it once, before any other lab.

Glossary

Curriculum-wide terminology, alphabetised. When a term appears for the first time in a lab's CONCEPTS.md, it's also defined inline; this file is the consolidated index.

TermDefinition
AOTAhead-of-time compilation. Source → native binary before execution. Clang, rustc, GCC, MSVC. Opposite of JIT.
ASTAbstract Syntax Tree. Hierarchical representation of source code after parsing, with syntax noise (parens, whitespace, semicolons) removed.
Basic BlockMaximal straight-line sequence of IR instructions with one entry and one exit. Building block of the CFG.
BytecodeA linear sequence of opcodes designed for a virtual machine, not real hardware. CPython, JVM, V8 Ignition.
CFGControl-Flow Graph. Directed graph of basic blocks where edges are possible jumps.
ClosureA function value that captures variables from its lexical environment. Requires escape analysis or heap-allocated environments.
Computed GotoA C extension (&&label) enabling threaded bytecode dispatch — 2–3× faster than switch-based dispatch on most CPUs.
Constant FoldingOptimization that pre-computes constant expressions at compile time (2+3 → 5).
DCEDead Code Elimination. Removes instructions whose results are never used.
Dialect (MLIR)A self-contained set of operations and types in MLIR. tensor, arith, affine, llvm are dialects.
Dispatch (VM)The act of selecting and jumping to the implementation of the current opcode. Hottest loop in any interpreter.
DominatorBlock A dominates B if every path from entry to B passes through A. Foundation of SSA construction.
EBNFExtended Backus-Naur Form. Standard notation for context-free grammars.
ELFExecutable and Linkable Format. Linux/BSD object/binary format. macOS uses Mach-O instead.
FFIForeign Function Interface. Calling C (or other-ABI) functions from your language.
GCGarbage Collector. Subsystem that reclaims unreachable heap memory. We build mark-sweep in Phase 8.
HMHindley-Milner. Type inference algorithm (Algorithm W) for functional languages with let-polymorphism.
IRIntermediate Representation. Any data structure between AST and machine code. Compilers typically have 2–10 IRs (e.g., Clang has AST → MLIR → LLVM IR → MIR → MachineInstr).
IRBuilderLLVM helper class that constructs LLVM IR instructions and tracks the insertion point. Most-used API in LLVM frontends.
JITJust-In-Time compilation. Source/bytecode is compiled to native at runtime. V8, HotSpot, LuaJIT, PyPy, ORC.
LexerAlso called scanner or tokenizer. Converts a character stream into a token stream.
LLVM IRLLVM's typed, SSA-form intermediate representation. Human-readable assembly-like syntax.
lliLLVM IR interpreter / JIT driver. Runs .ll files directly.
llcLLVM static compiler. Lowers LLVM IR to target assembly.
Mach-OMach Object file format. macOS/iOS executable/library format. ELF's macOS counterpart.
mem2regLLVM pass that promotes stack allocas to SSA registers when their address is never taken. Foundational; most frontends rely on it.
MLIRMulti-Level Intermediate Representation. LLVM project for multi-IR compiler infrastructure. Powers TensorFlow XLA, IREE, Mojo.
mlir-optThe MLIR optimizer driver. clang -opt for MLIR.
ORCOn-Request Compilation. LLVM's JIT framework, current version is ORC v2.
ParserConverts a token stream into an AST. Two flavors: hand-written (recursive descent / Pratt) or generated (yacc, ANTLR).
Phi Node (φ)SSA-form instruction that selects a value based on which predecessor block was the source. The defining characteristic of SSA.
Pratt ParserTop-down operator-precedence parser. Used by V8, Crafting Interpreters, many JavaScript parsers.
Recursive DescentA top-down parser written as one mutually-recursive function per grammar rule. Used by Clang, GCC, rustc.
SSAStatic Single Assignment. IR form in which every variable is assigned exactly once. Enables almost all modern optimizations.
Symbol TableData structure mapping names to declarations, often a stack of hash maps (one per scope).
Target Triple<arch>-<vendor>-<os>-<abi> string identifying the compilation target (e.g., arm64-apple-macosx15.0).
Three-Address Code (TAC)IR form where instructions have at most 3 operands: x = y op z. Common pre-SSA representation.
TokenAtomic syntactic unit emitted by the lexer (keywords, identifiers, literals, operators).
Tree-Walk InterpreterExecutes a program by recursively visiting AST nodes. Simplest backend; slowest runtime.
Type Environment (Γ)Mapping from variable names to types, used during type checking.
Visitor PatternDesign pattern that adds an operation to a class hierarchy without modifying it. Standard tool for AST traversal.
VMVirtual Machine. Interpreter for a custom bytecode instruction set. CPython, JVM, V8 Ignition, Lua.