05 — Strings as Private Globals + GEP

The Str expression node is the only non-numeric value type in cp-17. Its lowering reveals two LLVM concepts every IR emitter must internalise: GlobalVariable for constant data, and getelementptr (GEP) for address arithmetic.

case Expr::K::Str: {
    Constant* s = ConstantDataArray::getString(ctx, e.str, /*addNull=*/true);
    auto* gv = new GlobalVariable(
        mod, s->getType(), /*isConstant=*/true,
        GlobalValue::PrivateLinkage, s, ".str");
    return b.CreateInBoundsGEP(
        s->getType(), gv,
        {ConstantInt::get(i32(), 0), ConstantInt::get(i32(), 0)});
}

Step by step:

  1. ConstantDataArray::getString builds an [N x i8] constant from the string bytes, optionally NUL-terminated. This is the value.
  2. new GlobalVariable(...) registers that constant as a module-level symbol with PrivateLinkage (linker-internal, won't conflict across modules) and isConstant=true (the optimiser may place it in .rodata). The variable's type is [N x i8], and gv is a Constant* pointing to it (in LLVM 20 with opaque pointers, the pointer is just ptr).
  3. CreateInBoundsGEP computes the address of element [0][0]:
    • first index 0 steps through the pointer (typical idiom for "the array itself, not array #N");
    • second index 0 steps to the first byte inside the array. The result is a ptr to byte 0 of the string — exactly what ml_print_str(const char*) expects.

The two-index GEP for arrays

You'll see {0, 0} patterns everywhere in LLVM IR for "decay an array to a char*". The mental model:

gv      : ptr to [N x i8]
gep 0   : same as gv (no offset, but lets us index into the pointee)
gep 0,0 : pointer to the first i8 in the array

Conceptually like C's &str[0]. If the global were [10 x [4 x i8]], {0, i, j} would give &str[i][j]. The first index is special; subsequent indices walk the aggregate type.

Why opaque pointers don't change this

LLVM 20 dropped typed pointers — every pointer in IR is just ptr. But the GEP instruction still needs to know the pointee type to compute offsets. That's what the explicit s->getType() (the array type) argument to CreateInBoundsGEP is for. The IRBuilder no longer infers it from the pointer's type, because the pointer has no type.

Lifetime

The GlobalVariable is owned by the Module. When the Module is moved into ThreadSafeModule and handed to ORC, ownership transfers. After JITting, the global's bytes live somewhere in the JIT's data section, and the pointer we returned is valid for as long as the LLJIT instance lives. For test code this is fine; for a long-running VM you'd care about reclaiming unused string globals (a job for ORC's resource-tracker API).