Step 3 — Basic blocks and the CFG
What is a basic block?
A basic block is a maximal sequence of instructions such that
- control enters only at the top (no branches into the middle), and
- control leaves only at the bottom (no branches out of the middle).
Formally: a straight-line code fragment that ends in exactly one
terminator — Jump, CondJump, or Return. Nothing fancier than
that.
The set of basic blocks plus their successor edges form the control-flow graph (CFG) of a function. Almost every interesting analysis — dominance, reachability, liveness, loop detection — operates on this graph.
Our BasicBlock
struct BasicBlock {
int id; // small integer name
std::string label; // human-readable hint
std::vector<Instr> instrs;
};
The label is purely cosmetic ("entry", "if.then", "while.cond") — the
real identifier is id. We use small dense integer ids because most
analyses will index into per-block bit-vectors.
Function::blocks is the vector of all blocks in creation order.
blocks[0] is always the entry by convention; we don't store it
separately.
Block creation
BasicBlock& Function::newBlock(std::string label) {
blocks.push_back({nextBlock++, std::move(label), {}});
return blocks.back();
}
Returns a reference and an id; the builder remembers the id (the reference is invalidated by the next allocation, so we never store it across emits).
Terminator discipline
The builder enforces a simple invariant: every emitted instruction
goes into a non-terminated block. If you try to emit into a
terminated block, the builder silently opens a fresh "unreachable"
block and emits there instead:
void Builder::emit(Instr ins) {
if (currentBlockTerminated()) {
auto& nb = fn().newBlock("unreachable");
setBlock(nb.id);
}
block().instrs.push_back(std::move(ins));
}
This keeps the lowering code free of if (terminated) return; clutter
and preserves dead code as visible IR. cp-09's DCE pass will prune
unreachable blocks.
CFG edges
We don't store CFG edges explicitly. Successors are recoverable from the terminator:
| Terminator | Successors |
|---|---|
Jump bbT | {bbT} |
CondJump bbT bbF | {bbT, bbF} |
Return | {} |
A trivial helper (introduced in cp-09) walks blocks.back().instrs.back()
and returns the successor set. Predecessors are computed by inverting
that map once per pass — cheap, and avoids the bookkeeping pain of
keeping bidirectional edges in sync during construction.
Why a vector of blocks (and not a linked structure)?
Industrial IRs (LLVM, MLIR) use intrusive linked lists for instructions within blocks, because passes need cheap O(1) removal. At the block level they store a list of blocks per function for the same reason: inserting a block in the middle of a function is common (loop rotation, critical-edge splitting).
We use std::vector because (a) we're not implementing those passes
yet, and (b) a vector<BasicBlock> is dramatically nicer to print and
debug. The cost is amortised — append is O(1) — and the only real
limitation is that we can't store stable pointers to blocks. We work
around that with integer ids, which is the correct compiler-engineering
discipline anyway.