Step 1 — The Lexer

Goal: turn a string of characters into a sequence of tokens.

Why A Lexer?

Imagine writing a parser that operated directly on characters. Every parsing rule would have to also skip whitespace, parse numeric literals, distinguish == from =, etc. The result would be unreadable.

Splitting into lexer → parser is the same principle as separating tokenization from semantics in any text processing — it's why wc works on lines, not bytes. Tokens are the unit the grammar cares about.

For arithmetic, our token vocabulary is tiny:

Token	Matches
`Number`	one or more digits, optionally followed by `.` and more digits
`Plus`, `Minus`, `Star`, `Slash`	`+`, `-`, `*`, `/`
`LParen`, `RParen`	`(`, `)`
`Eof`	synthetic end-of-stream marker
`Error`	anything else (e.g., letters), carrying a message

The DFA Mental Model

Even our tiny lexer is a deterministic finite automaton (DFA):

            digit                       digit
        ┌────────┐                  ┌─────────┐
        ▼        │                  ▼         │
  [START] ──────► [IN_NUMBER] ──.──► [AFTER_DOT] ──┐
        │             │                              │
        │           (other)        (other)           │
        │             ▼                ▼             │
        │         emit Number       emit Number      │
        │                                            │
        │ +    -      *    /   (    )                │
        ├────────────────────────────                │
        │             │                              │
        ▼             ▼                              │
   emit Plus     emit RParen (etc.)                  │
        │                                            │
        │ ws                                         │
        ▼                                            │
   skip → loop                                       │
        │                                            │
        │ EOF                                        │
        ▼                                            │
   emit Eof ◄───────────────────────────────────────┘

Position into the source string serves as our state. We never need a separate "state" variable for the arithmetic lexer; for more complex lexers (Python's indentation, JavaScript's >> vs >>=) you'd track explicit modes.

Single-Character Lookahead

char Lexer::peek()    const { return atEnd() ? '\0' : src_[pos_]; }
char Lexer::advance()       { return src_[pos_++]; }

peek() looks at the current character without consuming it. advance() returns it and moves on. Almost every hand-written lexer follows this pattern.

For most operators we need zero lookahead (+ is Plus, full stop). For numbers we need a single character of lookahead to decide whether more digits follow. In later labs we'll add 2-character lookahead for == vs =.

The Number Sub-DFA

Token Lexer::lexNumber() {
    std::size_t start = pos_;
    while (!atEnd() && std::isdigit(peek())) advance();
    if (!atEnd() && peek() == '.') {
        advance();
        while (!atEnd() && std::isdigit(peek())) advance();
    }
    std::string text = src_.substr(start, pos_ - start);
    Token t;
    t.kind = TokenKind::Number;
    t.text = text;
    t.value = std::stod(text);
    return t;
}

Reads:

As many digits as possible.
If a . follows, consume it and as many more digits as possible.
Slice the substring; convert to double via std::stod.

This accepts 42, 3.14, 0, 0.5. It rejects .5 (no leading digit), 1. (we accept this with empty fraction; harmless), 1e5 (scientific notation — left as Step 6 extension).

Why std::stod instead of manual parsing? It's correct (handles negative zero, denormals, exponents we plan to add later), and the lexer is not the performance bottleneck. Compare to Clang, which has a hand-tuned APFloat::convertFromString because Clang's lexer is sometimes the hot path.

Eager vs Lazy

std::vector<Token> Lexer::tokenize() { ... }      // eager

We lex the entire input into a vector<Token> upfront. The parser then consumes from this vector.

Why eager? Simpler control flow. The parser doesn't need to hold a reference to the lexer.

Why production compilers go lazy: memory. Lexing a 100k-line C++ file into a vector of tokens is megabytes; lazy lexing keeps the working set small. Clang is lazy.

For arithmetic — and for everything through cp-09 — eager is fine.

Error Tokens

default: {
    Token e;
    e.kind = TokenKind::Error;
    e.text = std::string("unexpected character '") + c + "'";
    return e;
}

We don't throw from the lexer. We emit an Error token. The parser then decides what to do (we throw there). This separation lets a future error-recovery layer skip the bad token and keep parsing — important for IDE diagnostics where you want to see all errors at once, not just the first.

Try It

After building (next step covers CMake), the lexer is hidden behind eval. To inspect tokens directly, you can add a quick dump_tokens helper:

for (auto& t : Lexer("1 + 2 * 3").tokenize())
    std::cout << kindName(t.kind) << " " << t.text << "\n";

Expected:

NUMBER 1
PLUS +
NUMBER 2
STAR *
NUMBER 3
EOF

→ 02-the-ast.md — design the tree the parser will build.

Compilers & Parser Engineer