04 — Type Feedback: Foundations of Inline Caching

ml_record_int_arg is the most interesting function in runtime.cpp. It's called from JIT'd code at every user function entry, once per parameter:

// In ir_emit.cpp:
for (auto& a : fn->args()) {
    auto* slot = b.CreateAlloca(i64(), nullptr, ...);
    b.CreateStore(&a, slot);
    b.CreateCall(fn_record, {ConstantInt::get(i64(), nextSite++)}); // ← here
    ...
}

Each parameter gets a unique site id chosen at compile time. The runtime maintains unordered_map<site_id, count>:

extern "C" void ml_record_int_arg(int64_t site) {
    g_intCounts[site] += 1;
}

That's a one-line implementation, but it's the same architecture that powers V8's inline caches, HotSpot's TypeProfile, and LuaJIT's traces. The pattern is:

  1. JIT'd code reports observations to host.
  2. Host accumulates statistics keyed by call site.
  3. When some heuristic fires (counter > threshold, distribution narrow enough), the host invalidates the current compile and recompiles with the observed types baked in as assumptions.
  4. The new code adds a guard: if the assumption is violated, it bails back to the generic path.

cp-17 stops at step 2: we record but never recompile. The seventh test verifies the mechanism works end-to-end: a JIT'd print(f(5)) call increments the counter for site #1 (the first parameter of the first user-defined function), provable from C++ after the JIT returns.

Why this needs the runtime-symbol plumbing from step 03

The recording call is just call void @ml_record_int_arg(i64 1). There's no LLVM magic — it's an external function call. The reason it works is that we already taught ORC that the name ml_record_int_arg resolves to a real C function in our process. The whole "type feedback" feature is purely:

  • IR emitter inserts a function call at the right place.
  • Host registers the target.
  • Host reads the counter later.

Every dynamic-language VM's profiling subsystem is layered on the same idea.

What you'd add next

  • Per-type counters: distinguish int / string / object / null. We have only one bucket; a real system stores a small set with frequencies.
  • Site-keyed cache slots: replace counters with a small struct {type_tag, cached_method, miss_count} per site. That's an inline cache.
  • Tiered compilation: once a counter crosses a threshold, queue the function for recompilation at a higher tier (e.g. with arguments specialised to i64). Keep the old code as the bail target.
  • Deoptimisation: when an assumption fails at runtime, jump from the optimised frame back into the unoptimised one with reconstructed state. This is the hardest part and a topic in its own right.