Step 1 — LLVM IR overview
LLVM IR is a strongly-typed, SSA-form, three-address virtual ISA. It sits between the frontend (which generates it) and the backend (which lowers it to a real machine). It is the lingua franca of modern production compilers: Clang, Rust, Swift, Julia, Numba, GHC's LLVM backend, ldc (D), Crystal, Pony, Zig (in part), and a hundred research languages all emit LLVM IR and let LLVM handle codegen.
Three forms, same content
LLVM IR exists in three equivalent encodings:
- Textual (
.ll) — human-readable; what we emit in this lab. - Bitcode (
.bc) — compact binary form; what gets serialised. - In-memory —
llvm::Module*etc.; what the C++ API manipulates.
They are losslessly interconvertible: llvm-as (text → bitcode),
llvm-dis (bitcode → text), the IRBuilder constructs in-memory IR,
Module::print() dumps text. cp-11 will switch to the in-memory form.
Three shapes within the text
module
└─ globals (constants, mutable globals, function declarations)
└─ functions
└─ basic blocks (labelled, single-entry single-exit)
└─ instructions (one per line)
Almost every line in a function body has the form:
%name = <opcode> <type> <operands>
That <type> is mandatory — LLVM IR is strongly typed, and the type
system is part of the verifier's job. The verifier (opt -verify,
also run implicitly by lli/llc) rejects modules whose types don't
align. This is what makes generating LLVM IR feel slightly tedious
the first time and very pleasant thereafter: errors are caught early
and precisely.
SSA
Every value (%name) is defined exactly once. We avoid grappling with
SSA construction directly by using alloca + load/store for every
mutable variable; LLVM's mem2reg pass promotes those to SSA later.
This is the canonical strategy for frontends — the "alloca trick" is
how Clang emits IR for C local variables.
What we will NOT do in cp-10
phinodes — needed only if you SSA-construct on the frontend (Cranelift, Rust MIR). With alloca/mem2reg you never write a phi yourself.- Aggregate types (struct, array as value).
- Garbage-collection statepoints.
- Debug info.
- Metadata.
- TBAA / aliasing annotations.
Each is essential for a full production frontend; each is a separable concern that we can layer on in cp-14+.
Why the textual form first?
Because LLVM's error messages, dumps, and documentation all speak in textual IR. Even if you write a frontend that only ever calls the C++ API, you will read textual IR every day for the rest of your compiler career. Get fluent in it now.