Step 2 — Types, values, and constants

The type system

LLVM IR types come in two flavours: primitive and derived.

Primitive

  • Integer: i1, i8, i16, i32, i64, i128, ... up to any width. There is no separate booli1 plays that role.
  • Floating point: half (16-bit), float (32-bit), double (64-bit), fp80/fp128 (extended).
  • Void: void (only valid as a function return type).
  • Label: label (block reference; rarely written explicitly).
  • Metadata: metadata (debug info etc.).

Derived

  • Pointer: T* (legacy) or ptr (modern LLVM ≥ 15, "opaque pointers"). We use T* here because it's clearer for teaching; modern LLVM auto-converts.
  • Array: [N x T]. Fixed-size, allocated as a value.
  • Vector: <N x T>. SIMD lane group.
  • Struct: { T1, T2, ... } (literal) or %S (named).
  • Function: R (A1, A2, ...).

We use only i64, i1, i8*, and one array ([6 x i8] for the format string).

Value categories

Every operand in LLVM IR is one of:

  • A constant42, true, 0.5, or a getelementptr of a global.
  • A register%name or %number, the result of a previous instruction in the same function. Function parameters are also registers (%arg0 …).
  • A global@name, a top-level symbol. Includes function references, mutable globals, and constant data like our @.fmt.

The leading sigil is meaningful: % is local to a function, @ is module-global. There is no other namespace.

How we lower MiniLang values

We picked the simplest possible mapping: everything is i64.

MiniLangLLVMNotes
Numberi64We discard the fractional part on purpose.
Booli640 or 1. We zext i1 ... to i64 after icmp.
Nili64 0Same as false.
StringNot supported; emitter errors out.

This is a pedagogical choice: it makes the IR easy to read and type-uniform, at the cost of disallowing mixed-type expressions. cp-14 introduces a real boxed Value representation that supports the full type set.

Constants

A constant in LLVM IR has both a type and a value:

i64 42                                  ; integer
[6 x i8] c"%lld\0A\00"                  ; byte array
@.fmt                                   ; symbol  (type is i8*)
getelementptr ([6 x i8], [6 x i8]* @.fmt, i64 0, i64 0)
                                        ; "address of first byte of @.fmt"

The getelementptr (GEP) form is how you compute addresses without emitting an actual instruction — it's a constant expression. We use it to convert our array-of-bytes format string into a pointer to its first byte, which is what printf expects.

GEP is one of the most-misunderstood parts of LLVM IR. The two-index form [6 x i8]* @.fmt, i64 0, i64 0 reads as: "starting from @.fmt (which points to a [6 x i8]), advance by zero arrays, then advance to byte index zero". The result is an i8*. Why two indices? Because @.fmt is a pointer; the first index dereferences the pointer, the second indexes into the array it points at. This catches everyone the first time.

Why types matter even in a dynamic language

Even when you're compiling a dynamic language, the LLVM-level types must be statically known. That's why cp-10 restricts to numeric only: we can't emit add i64 %a, %b if %a might be a string at runtime.

The two ways out are:

  1. Uniform representation — pick one LLVM type (typically i64 or a tagged 64-bit) and stuff every dynamic value into it.
  2. Specialisation — generate different IR for different type profiles, possibly at JIT time. This is what V8 and LuaJIT do.

cp-14 takes path (1). cp-17's capstone explores path (2).