Step 2 — Types, values, and constants
The type system
LLVM IR types come in two flavours: primitive and derived.
Primitive
- Integer:
i1,i8,i16,i32,i64,i128, ... up to any width. There is no separatebool—i1plays that role. - Floating point:
half(16-bit),float(32-bit),double(64-bit),fp80/fp128(extended). - Void:
void(only valid as a function return type). - Label:
label(block reference; rarely written explicitly). - Metadata:
metadata(debug info etc.).
Derived
- Pointer:
T*(legacy) orptr(modern LLVM ≥ 15, "opaque pointers"). We useT*here because it's clearer for teaching; modern LLVM auto-converts. - Array:
[N x T]. Fixed-size, allocated as a value. - Vector:
<N x T>. SIMD lane group. - Struct:
{ T1, T2, ... }(literal) or%S(named). - Function:
R (A1, A2, ...).
We use only i64, i1, i8*, and one array ([6 x i8] for the
format string).
Value categories
Every operand in LLVM IR is one of:
- A constant —
42,true,0.5, or agetelementptrof a global. - A register —
%nameor%number, the result of a previous instruction in the same function. Function parameters are also registers (%arg0…). - A global —
@name, a top-level symbol. Includes function references, mutable globals, and constant data like our@.fmt.
The leading sigil is meaningful: % is local to a function, @ is
module-global. There is no other namespace.
How we lower MiniLang values
We picked the simplest possible mapping: everything is i64.
| MiniLang | LLVM | Notes |
|---|---|---|
Number | i64 | We discard the fractional part on purpose. |
Bool | i64 | 0 or 1. We zext i1 ... to i64 after icmp. |
Nil | i64 0 | Same as false. |
String | ✗ | Not supported; emitter errors out. |
This is a pedagogical choice: it makes the IR easy to read and
type-uniform, at the cost of disallowing mixed-type expressions.
cp-14 introduces a real boxed Value representation that supports
the full type set.
Constants
A constant in LLVM IR has both a type and a value:
i64 42 ; integer
[6 x i8] c"%lld\0A\00" ; byte array
@.fmt ; symbol (type is i8*)
getelementptr ([6 x i8], [6 x i8]* @.fmt, i64 0, i64 0)
; "address of first byte of @.fmt"
The getelementptr (GEP) form is how you compute addresses without
emitting an actual instruction — it's a constant expression. We use
it to convert our array-of-bytes format string into a pointer to its
first byte, which is what printf expects.
GEP is one of the most-misunderstood parts of LLVM IR. The
two-index form [6 x i8]* @.fmt, i64 0, i64 0 reads as: "starting
from @.fmt (which points to a [6 x i8]), advance by zero arrays,
then advance to byte index zero". The result is an i8*. Why two
indices? Because @.fmt is a pointer; the first index dereferences
the pointer, the second indexes into the array it points at. This
catches everyone the first time.
Why types matter even in a dynamic language
Even when you're compiling a dynamic language, the LLVM-level
types must be statically known. That's why cp-10 restricts to
numeric only: we can't emit add i64 %a, %b if %a might be a
string at runtime.
The two ways out are:
- Uniform representation — pick one LLVM type (typically
i64or a tagged 64-bit) and stuff every dynamic value into it. - Specialisation — generate different IR for different type profiles, possibly at JIT time. This is what V8 and LuaJIT do.
cp-14 takes path (1). cp-17's capstone explores path (2).