Step 3 — The Recursive-Descent Parser

Goal: turn a token stream into the AST you designed in Step 2, respecting precedence and associativity.

The Grammar (Recap From CONCEPTS.md)

expr   = term   { ('+' | '-') term }
term   = factor { ('*' | '/') factor }
factor = NUMBER
       | '(' expr ')'
       | '-' factor

Three rules, three functions. Recursive descent is a literal transcription.

The Cursor Helpers

const Token& Parser::peek() const   { return tokens_[pos_]; }
const Token& Parser::advance()      { return tokens_[pos_++]; }
bool Parser::check(TokenKind k) const { return peek().kind == k; }
bool Parser::match(TokenKind k)     { if (check(k)) { advance(); return true; } return false; }

const Token& Parser::expect(TokenKind k, const char* msg) {
    if (!check(k)) throw ParseError(/* … */);
    return advance();
}

Five helpers cover every recursive-descent parser ever written:

  • peek — look without consuming
  • advance — consume one
  • check — peek's kind matches?
  • match — if so, consume and return true
  • expect — must match, or throw

This vocabulary scales: Clang's parser uses essentially the same primitives, just with richer diagnostics in expect.

parseExpr — Lowest Precedence

ExprPtr Parser::parseExpr() {
    ExprPtr left = parseTerm();
    while (check(TokenKind::Plus) || check(TokenKind::Minus)) {
        TokenKind op = peek().kind; advance();
        ExprPtr right = parseTerm();
        left = std::make_unique<BinaryExpr>(op, std::move(left), std::move(right));
    }
    return left;
}

Reads aloud: "parse a term, then while the next token is + or -, consume the operator, parse another term, and wrap the previous result as the new left."

Why The While Loop = Left Associativity

Trace 5 - 3 - 1:

Initial:  left = parseTerm() = (5)

Iter 1:   token = -, consume
          right = parseTerm() = (3)
          left = BinaryExpr(-, (5), (3))            ← (5 - 3)

Iter 2:   token = -, consume
          right = parseTerm() = (1)
          left = BinaryExpr(-, (5-3), (1))          ← ((5 - 3) - 1)

Done.     evaluate → 1

Each iteration nests the previous left on the inside. The tree leans left. That's left-associativity.

To make - right-associative (5 - 3 - 1 = 5 - (3 - 1) = 3), you'd write:

ExprPtr right = parseExpr();    // recurse instead of loop
return std::make_unique<BinaryExpr>(op, std::move(left), std::move(right));

Recursion-on-the-right ⇒ right-associativity. This is exactly how we'd handle ^ (exponentiation) in Step 6's extension.

parseTerm — Same Pattern, Higher Precedence

ExprPtr Parser::parseTerm() {
    ExprPtr left = parseFactor();
    while (check(TokenKind::Star) || check(TokenKind::Slash)) {
        TokenKind op = peek().kind; advance();
        ExprPtr right = parseFactor();
        left = std::make_unique<BinaryExpr>(op, std::move(left), std::move(right));
    }
    return left;
}

Identical shape; the only changes are the operators consumed and the next-level call (parseFactor). Adding a new precedence level (modulo, comparison, etc.) means inserting another function in the call chain.

Why This = Higher Precedence Than +/-

When parseExpr calls parseTerm, control descends into parseTerm's while loop. That loop consumes all the * and / operators before returning. By the time parseTerm returns to parseExpr, the multiplication has already been packaged into a sub-tree. parseExpr then attaches that sub-tree as either left or right of its +/- node.

* binds tighter because parseTerm "grabs" its operators first — they're inside its loop, not parseExpr's.

parseFactor — Atoms and Recursion Back to parseExpr

ExprPtr Parser::parseFactor() {
    if (check(TokenKind::Number)) {
        double v = peek().value; advance();
        return std::make_unique<NumberExpr>(v);
    }
    if (match(TokenKind::LParen)) {
        ExprPtr inner = parseExpr();              // back to the top
        expect(TokenKind::RParen, "expected ')'");
        return inner;
    }
    if (check(TokenKind::Minus)) {
        advance();
        ExprPtr operand = parseFactor();          // right-recursive
        return std::make_unique<UnaryExpr>(TokenKind::Minus, std::move(operand));
    }
    if (check(TokenKind::Error)) { /* propagate lex error */ }
    throw ParseError(/* … */);
}

Three productions:

  1. NUMBER — consume and wrap.
  2. '(' expr ')' — consume (, recurse all the way back to parseExpr (precedence resets!), expect ). This is how parens override precedence.
  3. '-' factor — unary minus. --5 works because the recursion is on parseFactor, not directly creating a Number.

Why Parens Override Precedence

(1 + 2) * 3:

  1. parseExprparseTermparseFactor
  2. parseFactor sees (, calls parseExpr again.
  3. That parseExpr (the inner one) parses 1 + 2BinaryExpr(+, 1, 2).
  4. Outer parseFactor consumes ), returns BinaryExpr(+, 1, 2).
  5. Control returns to outer parseTerm. Its left is (1+2). Loop sees *, parses 3. Builds BinaryExpr(*, (1+2), 3).

The parens temporarily gave us "expr-level reset" inside what would have been factor-level parsing. The grammar isn't doing anything magical; the recursion-back-to-parseExpr is.

Trip Through The Full Pipeline

For 2 * (3 + 4):

Tokens:    NUMBER(2) STAR LPAREN NUMBER(3) PLUS NUMBER(4) RPAREN EOF

parseExpr
 └─ parseTerm
     ├─ parseFactor → NumberExpr(2)
     ├─ sees STAR, consume; parseFactor:
     │   ├─ sees LPAREN, consume
     │   ├─ parseExpr (inner)
     │   │   └─ parseTerm
     │   │       └─ parseFactor → NumberExpr(3)
     │   │       (no */)
     │   │   ├─ sees PLUS, consume; parseTerm → parseFactor → NumberExpr(4)
     │   │   └─ returns BinaryExpr(+, 3, 4)
     │   └─ consume RPAREN; returns BinaryExpr(+, 3, 4)
     └─ wrap into BinaryExpr(*, 2, BinaryExpr(+, 3, 4))

Result:    BinaryExpr(*, NumberExpr(2), BinaryExpr(+, NumberExpr(3), NumberExpr(4)))

Error Cases

InputWhat Happens
""parse() sees Eof immediately → throws "empty input"
1 +parseExpr parses 1, consumes +, calls parseTermparseFactor which sees Eof and throws
(1 + 2parseFactor matches (, recurses, then expect(RParen) fails → throws "expected ')'"
1 2parseExpr parses 1, no +/-, returns. parse() checks for Eof, finds NUMBER(2) → throws "unexpected token"
1 / 0parses fine; the evaluator throws (Step 4)
1 + abclexer emits Error token; parseFactor propagates it

All five error categories live in this lab. Phase 9 (Diagnostics) replaces these one-line throws with Clang-style source spans + fix-it hints, but the detection logic is identical.

Why Is This Called "LL(1)"?

  • Left-to-right scan of input.
  • Leftmost derivation produced.
  • 1 token of lookahead (we only ever call peek() once per decision; never peek(2)).

LL(1) is the most common parsing class for hand-written compilers. Some real languages need LL(2) or even unbounded lookahead (C++ template parsing) — Clang uses tentative parsing in those spots.

Next

04-the-evaluator.md — walk the tree and produce a number.