04 — Type Feedback: Foundations of Inline Caching
ml_record_int_arg is the most interesting function in runtime.cpp. It's
called from JIT'd code at every user function entry, once per parameter:
// In ir_emit.cpp:
for (auto& a : fn->args()) {
auto* slot = b.CreateAlloca(i64(), nullptr, ...);
b.CreateStore(&a, slot);
b.CreateCall(fn_record, {ConstantInt::get(i64(), nextSite++)}); // ← here
...
}
Each parameter gets a unique site id chosen at compile time. The runtime
maintains unordered_map<site_id, count>:
extern "C" void ml_record_int_arg(int64_t site) {
g_intCounts[site] += 1;
}
That's a one-line implementation, but it's the same architecture that powers V8's inline caches, HotSpot's TypeProfile, and LuaJIT's traces. The pattern is:
- JIT'd code reports observations to host.
- Host accumulates statistics keyed by call site.
- When some heuristic fires (counter > threshold, distribution narrow enough), the host invalidates the current compile and recompiles with the observed types baked in as assumptions.
- The new code adds a guard: if the assumption is violated, it bails back to the generic path.
cp-17 stops at step 2: we record but never recompile. The seventh test
verifies the mechanism works end-to-end: a JIT'd print(f(5)) call increments
the counter for site #1 (the first parameter of the first user-defined
function), provable from C++ after the JIT returns.
Why this needs the runtime-symbol plumbing from step 03
The recording call is just call void @ml_record_int_arg(i64 1). There's no
LLVM magic — it's an external function call. The reason it works is that we
already taught ORC that the name ml_record_int_arg resolves to a real C
function in our process. The whole "type feedback" feature is purely:
- IR emitter inserts a function call at the right place.
- Host registers the target.
- Host reads the counter later.
Every dynamic-language VM's profiling subsystem is layered on the same idea.
What you'd add next
- Per-type counters: distinguish int / string / object / null. We have only one bucket; a real system stores a small set with frequencies.
- Site-keyed cache slots: replace counters with a small struct
{type_tag, cached_method, miss_count}per site. That's an inline cache. - Tiered compilation: once a counter crosses a threshold, queue the
function for recompilation at a higher tier (e.g. with arguments
specialised to
i64). Keep the old code as the bail target. - Deoptimisation: when an assumption fails at runtime, jump from the optimised frame back into the unoptimised one with reconstructed state. This is the hardest part and a topic in its own right.