Step 3 — Basic blocks and the CFG

What is a basic block?

A basic block is a maximal sequence of instructions such that

  1. control enters only at the top (no branches into the middle), and
  2. control leaves only at the bottom (no branches out of the middle).

Formally: a straight-line code fragment that ends in exactly one terminatorJump, CondJump, or Return. Nothing fancier than that.

The set of basic blocks plus their successor edges form the control-flow graph (CFG) of a function. Almost every interesting analysis — dominance, reachability, liveness, loop detection — operates on this graph.

Our BasicBlock

struct BasicBlock {
    int                 id;          // small integer name
    std::string         label;       // human-readable hint
    std::vector<Instr>  instrs;
};

The label is purely cosmetic ("entry", "if.then", "while.cond") — the real identifier is id. We use small dense integer ids because most analyses will index into per-block bit-vectors.

Function::blocks is the vector of all blocks in creation order. blocks[0] is always the entry by convention; we don't store it separately.

Block creation

BasicBlock& Function::newBlock(std::string label) {
    blocks.push_back({nextBlock++, std::move(label), {}});
    return blocks.back();
}

Returns a reference and an id; the builder remembers the id (the reference is invalidated by the next allocation, so we never store it across emits).

Terminator discipline

The builder enforces a simple invariant: every emitted instruction goes into a non-terminated block. If you try to emit into a terminated block, the builder silently opens a fresh "unreachable" block and emits there instead:

void Builder::emit(Instr ins) {
    if (currentBlockTerminated()) {
        auto& nb = fn().newBlock("unreachable");
        setBlock(nb.id);
    }
    block().instrs.push_back(std::move(ins));
}

This keeps the lowering code free of if (terminated) return; clutter and preserves dead code as visible IR. cp-09's DCE pass will prune unreachable blocks.

CFG edges

We don't store CFG edges explicitly. Successors are recoverable from the terminator:

TerminatorSuccessors
Jump bbT{bbT}
CondJump bbT bbF{bbT, bbF}
Return{}

A trivial helper (introduced in cp-09) walks blocks.back().instrs.back() and returns the successor set. Predecessors are computed by inverting that map once per pass — cheap, and avoids the bookkeeping pain of keeping bidirectional edges in sync during construction.

Why a vector of blocks (and not a linked structure)?

Industrial IRs (LLVM, MLIR) use intrusive linked lists for instructions within blocks, because passes need cheap O(1) removal. At the block level they store a list of blocks per function for the same reason: inserting a block in the middle of a function is common (loop rotation, critical-edge splitting).

We use std::vector because (a) we're not implementing those passes yet, and (b) a vector<BasicBlock> is dramatically nicer to print and debug. The cost is amortised — append is O(1) — and the only real limitation is that we can't store stable pointers to blocks. We work around that with integer ids, which is the correct compiler-engineering discipline anyway.