Step 1 — The Lexer
Goal: turn a string of characters into a sequence of tokens.
Why A Lexer?
Imagine writing a parser that operated directly on characters. Every parsing rule would have to also skip whitespace, parse numeric literals, distinguish == from =, etc. The result would be unreadable.
Splitting into lexer → parser is the same principle as separating tokenization from semantics in any text processing — it's why wc works on lines, not bytes. Tokens are the unit the grammar cares about.
For arithmetic, our token vocabulary is tiny:
| Token | Matches |
|---|---|
Number | one or more digits, optionally followed by . and more digits |
Plus, Minus, Star, Slash | +, -, *, / |
LParen, RParen | (, ) |
Eof | synthetic end-of-stream marker |
Error | anything else (e.g., letters), carrying a message |
The DFA Mental Model
Even our tiny lexer is a deterministic finite automaton (DFA):
digit digit
┌────────┐ ┌─────────┐
▼ │ ▼ │
[START] ──────► [IN_NUMBER] ──.──► [AFTER_DOT] ──┐
│ │ │
│ (other) (other) │
│ ▼ ▼ │
│ emit Number emit Number │
│ │
│ + - * / ( ) │
├──────────────────────────── │
│ │ │
▼ ▼ │
emit Plus emit RParen (etc.) │
│ │
│ ws │
▼ │
skip → loop │
│ │
│ EOF │
▼ │
emit Eof ◄───────────────────────────────────────┘
Position into the source string serves as our state. We never need a separate "state" variable for the arithmetic lexer; for more complex lexers (Python's indentation, JavaScript's >> vs >>=) you'd track explicit modes.
Single-Character Lookahead
char Lexer::peek() const { return atEnd() ? '\0' : src_[pos_]; }
char Lexer::advance() { return src_[pos_++]; }
peek() looks at the current character without consuming it. advance() returns it and moves on. Almost every hand-written lexer follows this pattern.
For most operators we need zero lookahead (+ is Plus, full stop). For numbers we need a single character of lookahead to decide whether more digits follow. In later labs we'll add 2-character lookahead for == vs =.
The Number Sub-DFA
Token Lexer::lexNumber() {
std::size_t start = pos_;
while (!atEnd() && std::isdigit(peek())) advance();
if (!atEnd() && peek() == '.') {
advance();
while (!atEnd() && std::isdigit(peek())) advance();
}
std::string text = src_.substr(start, pos_ - start);
Token t;
t.kind = TokenKind::Number;
t.text = text;
t.value = std::stod(text);
return t;
}
Reads:
- As many digits as possible.
- If a
.follows, consume it and as many more digits as possible. - Slice the substring; convert to
doubleviastd::stod.
This accepts 42, 3.14, 0, 0.5. It rejects .5 (no leading digit), 1. (we accept this with empty fraction; harmless), 1e5 (scientific notation — left as Step 6 extension).
Why
std::stodinstead of manual parsing? It's correct (handles negative zero, denormals, exponents we plan to add later), and the lexer is not the performance bottleneck. Compare to Clang, which has a hand-tunedAPFloat::convertFromStringbecause Clang's lexer is sometimes the hot path.
Eager vs Lazy
std::vector<Token> Lexer::tokenize() { ... } // eager
We lex the entire input into a vector<Token> upfront. The parser then consumes from this vector.
Why eager? Simpler control flow. The parser doesn't need to hold a reference to the lexer.
Why production compilers go lazy: memory. Lexing a 100k-line C++ file into a vector of tokens is megabytes; lazy lexing keeps the working set small. Clang is lazy.
For arithmetic — and for everything through cp-09 — eager is fine.
Error Tokens
default: {
Token e;
e.kind = TokenKind::Error;
e.text = std::string("unexpected character '") + c + "'";
return e;
}
We don't throw from the lexer. We emit an Error token. The parser then decides what to do (we throw there). This separation lets a future error-recovery layer skip the bad token and keep parsing — important for IDE diagnostics where you want to see all errors at once, not just the first.
Try It
After building (next step covers CMake), the lexer is hidden behind eval. To inspect tokens directly, you can add a quick dump_tokens helper:
for (auto& t : Lexer("1 + 2 * 3").tokenize())
std::cout << kindName(t.kind) << " " << t.text << "\n";
Expected:
NUMBER 1
PLUS +
NUMBER 2
STAR *
NUMBER 3
EOF
Next
→ 02-the-ast.md — design the tree the parser will build.