01 — Tokens and the Lexer

The lexer is the boundary between raw text and structured data. Its job is single-responsibility: consume a character stream and emit a flat stream of typed tokens. Every subsequent phase sees tokens, never characters.

The token inventory for MiniLang

cp-03 extends the arithmetic token set from cp-02 with keywords and punctuation that support a full statement language:

// literals
NUMBER  "123"   STRING  "hello"   IDENT "myVar"

// keywords
LET  VAR  FN  IF  ELSE  WHILE  RETURN  PRINT  TRUE  FALSE  NIL

// arithmetic & comparison
PLUS MINUS STAR SLASH PERCENT
EQ EQ_EQ BANG BANG_EQ LT LT_EQ GT GT_EQ

// logical
AND OR

// delimiters
LPAREN RPAREN LBRACE RBRACE COMMA SEMICOLON

// end-of-file sentinel
EOF

Key lexer decisions

One character of lookahead is enough for all MiniLang tokens. = vs ==, ! vs !=, < vs <=, > vs >= all resolve with one peek() call after consuming the first character.

Maximal munch: always consume the longest valid token. The lexer loop calls advance() and then decides, not the other way round.

Keyword recognition via a hash-map at the identifier stage:

const std::unordered_map<std::string, TokKind> keywords = {
    {"let",    TokKind::Let},
    {"var",    TokKind::Var},
    {"fn",     TokKind::Fn},
    // ...
};

When an identifier is scanned, look it up in the table. If it's there, emit the keyword token; otherwise emit IDENT. This keeps the lexer loop clean: no per-keyword branches in the main switch.

Character classification helpers

static bool isAlpha(char c)   { return std::isalpha(c) || c == '_'; }
static bool isAlNum(char c)   { return std::isalnum(c) || c == '_'; }
static bool isDigit(char c)   { return std::isdigit(c); }

_ is part of identifiers in MiniLang (and every real language), so it's included in both isAlpha and isAlNum.

Lexer structure

class Lexer {
    const std::string& source_;
    size_t start_ = 0;  // start of current token
    size_t cur_   = 0;  // current scan position
    int    line_  = 1;  // for error reporting
    char advance();
    char peek() const;
    char peekNext() const;
    bool match(char expected);
    Token makeToken(TokKind);
    Token scanToken();
public:
    Lexer(const std::string& source);
    std::vector<Token> scanAll();
};

The scanAll() loop calls scanToken() until it sees the source end, then appends an EOF token and returns. Callers get a vector<Token> — a flat, random-access stream. This is important: the Pratt parser (cp-03 step 03) peeks and consumes non-linearly.

Source location on every token

struct Token {
    TokKind     kind;
    std::string lexeme;
    int         line;
};

line tracks the source line number. The lexer increments line_ on every \n. Later phases use line for error messages. A real production lexer stores column too; for now line is enough.

Try it

After writing the lexer, scan a source string and print the token stream:

Lexer lex("let x = 2 + 3;\nif (x > 4) { print x; }");
for (auto& tok : lex.scanAll())
    std::cout << tok.line << "\t" << tok.lexeme << "\n";

Expected output:

1   let
1   x
1   =
1   2
1   +
1   3
1   ;
2   if
...

This linear token dump is the best first debugging tool for any lexer.

Compilers & Parser Engineer