Step 3 — Module / function / block

The module

A module is the unit of compilation. One .ll file = one llvm::Module = one translation unit. Modules contain:

  • A target triple (arm64-apple-macosx) and data layout.
  • Global declarations (@printf, @.fmt, @x).
  • Function definitions.
  • Metadata (debug info, optimisation hints).
; ModuleID = 'minilang'
target triple = "arm64-apple-macosx"

@.fmt = private constant [6 x i8] c"%lld\0A\00"
declare i32 @printf(i8*, ...)

define i32 @main() { ... }

The function

A function has a return type, name, parameter list, and a list of basic blocks:

define i64 @add(i64 %arg0, i64 %arg1) {
L0:
  %v0 = add i64 %arg0, %arg1
  ret i64 %v0
}

The first block listed is the entry block — implicit, no special marker. Parameters are values in scope from the entry block. Return type is declared up front; every terminator that returns must agree.

Linkage and visibility

Function definitions default to external linkage (visible to the linker, as if extern in C). Other options:

  • private — invisible to the linker (we use this for @.fmt).
  • internal — visible only within the module.
  • weak / linkonce — for inline functions and templates.

We don't decorate @main or @add — external is the right default.

The basic block

A basic block is a maximal sequence of straight-line instructions ending in a terminatorret, br, switch, unreachable, indirectbr, invoke, resume, catchret, cleanupret. Exactly one terminator per block; if you forget, the verifier rejects.

L1:
  %v3 = add i64 %v0, 1
  store i64 %v3, i64* %i.addr
  br label %L0

Blocks are labelled (L1: …). Labels are values of type label, referred to as %L1 in br targets. The label name on the definition site has no % — but the reference site does. (Yes, this inconsistency is annoying. Welcome to LLVM IR.)

Why basic blocks at all?

Because every flow-graph analysis is dramatically simpler if you can reason about straight-line sequences as opaque units, then handle control at the boundaries. Dominator computation, liveness, register allocation, scheduling — all of them operate on basic-block CFGs.

Compare to a representation where any instruction could be a branch target: now every analysis has to track "did anyone jump into the middle of this run?" The basic-block invariant — enter at the top, exit at the bottom — buys you enormous simplification.

Mapping our TAC

Our TAC already had basic blocks (BasicBlock in cp-08), so the mapping is one-to-one. The only difference: LLVM blocks are labelled by %LN syntactically; ours by integer id. The emitter prefixes with L:

static std::string blockLabel(int id) { return "L" + std::to_string(id); }

And emits a br label %L<id> to enter the first block from the alloca region (LLVM requires an explicit terminator-into-entry-of-body even when the alloca prelude is in the same block — we just put the allocas before the br for clarity).