Step 1 — LLVM IR overview

LLVM IR is a strongly-typed, SSA-form, three-address virtual ISA. It sits between the frontend (which generates it) and the backend (which lowers it to a real machine). It is the lingua franca of modern production compilers: Clang, Rust, Swift, Julia, Numba, GHC's LLVM backend, ldc (D), Crystal, Pony, Zig (in part), and a hundred research languages all emit LLVM IR and let LLVM handle codegen.

Three forms, same content

LLVM IR exists in three equivalent encodings:

  • Textual (.ll) — human-readable; what we emit in this lab.
  • Bitcode (.bc) — compact binary form; what gets serialised.
  • In-memoryllvm::Module* etc.; what the C++ API manipulates.

They are losslessly interconvertible: llvm-as (text → bitcode), llvm-dis (bitcode → text), the IRBuilder constructs in-memory IR, Module::print() dumps text. cp-11 will switch to the in-memory form.

Three shapes within the text

module
  └─ globals  (constants, mutable globals, function declarations)
  └─ functions
       └─ basic blocks (labelled, single-entry single-exit)
            └─ instructions (one per line)

Almost every line in a function body has the form:

%name = <opcode> <type> <operands>

That <type> is mandatory — LLVM IR is strongly typed, and the type system is part of the verifier's job. The verifier (opt -verify, also run implicitly by lli/llc) rejects modules whose types don't align. This is what makes generating LLVM IR feel slightly tedious the first time and very pleasant thereafter: errors are caught early and precisely.

SSA

Every value (%name) is defined exactly once. We avoid grappling with SSA construction directly by using alloca + load/store for every mutable variable; LLVM's mem2reg pass promotes those to SSA later. This is the canonical strategy for frontends — the "alloca trick" is how Clang emits IR for C local variables.

What we will NOT do in cp-10

  • phi nodes — needed only if you SSA-construct on the frontend (Cranelift, Rust MIR). With alloca/mem2reg you never write a phi yourself.
  • Aggregate types (struct, array as value).
  • Garbage-collection statepoints.
  • Debug info.
  • Metadata.
  • TBAA / aliasing annotations.

Each is essential for a full production frontend; each is a separable concern that we can layer on in cp-14+.

Why the textual form first?

Because LLVM's error messages, dumps, and documentation all speak in textual IR. Even if you write a frontend that only ever calls the C++ API, you will read textual IR every day for the rest of your compiler career. Get fluent in it now.