01 — Tokens and the Lexer
The lexer is the boundary between raw text and structured data. Its job is single-responsibility: consume a character stream and emit a flat stream of typed tokens. Every subsequent phase sees tokens, never characters.
The token inventory for MiniLang
cp-03 extends the arithmetic token set from cp-02 with keywords and punctuation that support a full statement language:
// literals
NUMBER "123" STRING "hello" IDENT "myVar"
// keywords
LET VAR FN IF ELSE WHILE RETURN PRINT TRUE FALSE NIL
// arithmetic & comparison
PLUS MINUS STAR SLASH PERCENT
EQ EQ_EQ BANG BANG_EQ LT LT_EQ GT GT_EQ
// logical
AND OR
// delimiters
LPAREN RPAREN LBRACE RBRACE COMMA SEMICOLON
// end-of-file sentinel
EOF
Key lexer decisions
One character of lookahead is enough for all MiniLang tokens.
= vs ==, ! vs !=, < vs <=, > vs >= all resolve with one
peek() call after consuming the first character.
Maximal munch: always consume the longest valid token. The lexer loop
calls advance() and then decides, not the other way round.
Keyword recognition via a hash-map at the identifier stage:
const std::unordered_map<std::string, TokKind> keywords = {
{"let", TokKind::Let},
{"var", TokKind::Var},
{"fn", TokKind::Fn},
// ...
};
When an identifier is scanned, look it up in the table. If it's there, emit
the keyword token; otherwise emit IDENT. This keeps the lexer loop clean:
no per-keyword branches in the main switch.
Character classification helpers
static bool isAlpha(char c) { return std::isalpha(c) || c == '_'; }
static bool isAlNum(char c) { return std::isalnum(c) || c == '_'; }
static bool isDigit(char c) { return std::isdigit(c); }
_ is part of identifiers in MiniLang (and every real language), so it's
included in both isAlpha and isAlNum.
Lexer structure
class Lexer {
const std::string& source_;
size_t start_ = 0; // start of current token
size_t cur_ = 0; // current scan position
int line_ = 1; // for error reporting
char advance();
char peek() const;
char peekNext() const;
bool match(char expected);
Token makeToken(TokKind);
Token scanToken();
public:
Lexer(const std::string& source);
std::vector<Token> scanAll();
};
The scanAll() loop calls scanToken() until it sees the source end, then
appends an EOF token and returns. Callers get a vector<Token> — a flat,
random-access stream. This is important: the Pratt parser (cp-03 step 03)
peeks and consumes non-linearly.
Source location on every token
struct Token {
TokKind kind;
std::string lexeme;
int line;
};
line tracks the source line number. The lexer increments line_ on every
\n. Later phases use line for error messages. A real production lexer
stores column too; for now line is enough.
Try it
After writing the lexer, scan a source string and print the token stream:
Lexer lex("let x = 2 + 3;\nif (x > 4) { print x; }");
for (auto& tok : lex.scanAll())
std::cout << tok.line << "\t" << tok.lexeme << "\n";
Expected output:
1 let
1 x
1 =
1 2
1 +
1 3
1 ;
2 if
...
This linear token dump is the best first debugging tool for any lexer.