Compiler - Lexer

2026-02-01

Programming

Lexical Analysis (Scanning)

Convert Character stream to Token stream

simple example:
- input character stream: “position = initial + rate * 60”.
  - lexemes include: “postion”, “=”, “initial”, “+”, “rate”, “*”, “60”
- output Token Stream
  - output tokens: (id, 1), (=), (id, 2), (+), (id, 3), (*), (60)
  - And the corresponding table
Lexime is the actual character sequence matched.
Pattern is the rule describing valid lexemes (usually described using regular expressions)
- example:
  - identifier: [A-Za-z][A-Za-z0-9]*
  - number: [0-9]+
Token <token_name, attribute_value>
- Token-name is category used by parsing (identifier, number, symbol names like of ‘+’ and ‘=’)
- Attribute-value: tokens may carry extra information
  - typically points into the symbol table which holding details
  - e.g., identifier name / type
- example:
  - {id, 1} means identifier token with pointer into symbol table entry #1
  - {+} no attribute
    1
    2
    3
    4
    struct Token{
    TokenType type; // type => {ID, NUM, IF, PLUS, ...}
    Attribute attr; // holds pointer/index, points into the symbol table
    };

Input Buffer

lexers read file in chunks, 4KB, 8KB, etc., this chunk is the input buffer
- stores characters
  1
  std::array<char, 4096> buffer;
Pointers
- LexemeBegin pointer: where the current token started
- Forward: current scanning position
  1
  buffer[lexemeBegin ... forward]
Retract()
- move the buffer Forward pointer back by one
- Lookahead problem
  - lexer cannot know if a token has ended until see the character after it.
  - lexer reads too far to examine the next character to determine if the current token is complete
  - need to
- example: lexer scan characters “123+45”
  - reads 1
  - reads 2
  - reads 3
  - reads +
  - lexer sees +, it knows “123” is finished, however the + is not part of the number
  - Retract: lexer has read one character too far, it must Retract so that the next time the lexer is called, it starts exactly at the ‘+’
two-buffer sentinel version (TBD)

class InputBuffer {
    public:
        static constexpr std::size_t N = 4096;
        explicit InputBuffer(const std::string& filename) : stream_(filename, std::ios::binary) {
            refill();
        }
        std::optional<char> nextChar() {
            if (pos_ >= length_) {
                refill();
                if (length_ == 0) return std::nullopt;
            }

            return buffer_[pos_++];
        }
        void retract() {
            if (pos_ > 0) {
                --pos_;
            }
        }
        std::optional<char> peekChar() {
            auto c = nextChar();
            if (c) retract();
            return c;
        }

    private:
        void refill() {
            if (!stream_) {
                length_ = 0;
                return;
            }
            stream_.read(buffer_.data(), N);
            length_ = static_cast<std::size_t>(stream_.gcount());
            pos_ = 0;
        }

    std::ifstream stream_;
    std::array<char, N> buffer_{};
    std::size_t pos_ = 0;
    std::size_t length_ = 0;
};

Token nextToken(InputBuffer& input){
    auto optC = input.nextChar();
    if(!optC){
        return Token{TokenType::EndOfFile, ""};
    }
    char c = *optC;
    if(std::isalpha(static_cast<unsigned char>(c))){
        return readIdentifier(input, c);
    }
    else if(std::isdigit(static_cast<unsigned char>(c))){
        return readNumber(input,c);
    }
    else{
        return operatorToken(c);
    }
}

Token readIdentifier(InputBuffer& input, char firstChar){
    std::string lexeme;
    lexeme += firstChar;
}

c = nextChar()

if isLetter(c):
    lexeme = readIdentifier()
elif isDigit(c):
    lexeme = readNumber()
else:
    return operatorToken(c)

Algorithms of Lexical Analysis
- Terminologies
  - NFA(Nondeterministic Finite Automation)
    - From one state, may have multiple possible next states
    - may have \epsilon-transitions (move without consuming input)(\epsilon mean empty string)
      - \epsilon: transition that consumes NO input character
    - example: NFA for a|b
      - NFA has branching
        1
        2
        3
        --a--> (accept)
        (start)
        --b--> (accept)
      - NFA is nondeterminism, multiple possible transitions
      - from start, if input is ‘a’, take top arrow, if input is ‘b’, take bottom arrow
  - DFA(Deterministic Finite Automation, Table-driven state machine)
    - For every(state, input char), there is exactly one next state
    - A DFA state represents a set of NFA states
    - simulate “all possible NFA paths” by tracking them as a set
    - example: DFS for a|b
      - DFA must have exactly one transition per character
        1
        2
        (state0) --a--> (accept)
        (state0) --b--> (accept)
      - Table form:
        
        State ‘a’ ‘b’
        
        0 1 1
        
        1 dead dead
      - DFA is deterministic,
      - from state0, input decides uniquely
- Pipeline: Regex –> NFA –> DFA
  - step1: Regex –> NFA (Thompson Construction)
    - Turn a regex into a machine that recognizes the same language.
    - Key property:
      - \epsilon-transitions (moves that consume no character)
      - branching (multiple possible next states)
    - Thompson’s construction builds an NFA by composiing small building blocks:
      1. For a literal a:
      2. For concatenation RS:
  - step2: NFA –> DFA (Subset Construction)
    - Remove nondeterminism so the scanner can run with a single current state.
    - Example
      - ab|a
      - matches ‘ab’ and ‘a’
      - build the NFA
        
        a nondeterministic finite state machine that accepts either:
        
        a
        
        ab
        
        it looks like
        1
        2
        3
        --a--> (accept1)
        (start)
        --a--> (s1) --b--> (accept2)
        
        possible transitions:
        
        start–a–>accept1
        
        start–a–>state1
        
        state1–b–>accept2
        
        this is nondeterministic because from start on input ‘a’, there 2 possiblities
      - build DFA
        
        a DFA is a set of NFA states
        
        NFA start state = {start}
        
        D0 = {start}
        
        DFA start state A={start}
        
        determine move({start}, ‘a’), according to NFA
        
        From NFA, from state0
        
        start–a–>accept1
        
        start–a–>state1
        
        so: move({start},a) ={accept1,state1}
        
        get the DFA transition: {start}–a–>{accept1,state1}
        
        similarly,
        
        move({start},’a’) = {accept1, state1}
        
        move({start},’b’) = dead
        
        move({accept1,state1},’a’)=dead
        
        move({accept1,state1},’b’)=accept2
        
        when accept2 is accepted, no outgoing transitions any more, dead for all input.
        
        final DFA build from NFA
        
        DFA State on ‘a’ –> on ‘b’ –> Accepting?
        
        {start} {accept1, state1} dead No
        
        {accept1, state1} dead {accept2} Yes(accept1)
        
        {accept2} dead dead Yes
  - step3: DFA Mninimization (Hopcroft)
    - merge redundant state transitions
- Rules
  - Longest Match, first apply longest match
  - Priority Rule, if there’s still a tie (same longest length), choose the rule that appears earlier in the Lex program
Practical Scanner Implementation
- Strategy A - Hand-written Lexer
  - Typical loop:
    1
    2
    3
    4
    5
    skip_whitespace();
    
    if(is_letter(c)) read_identifier();
    else if (is_digit(c)) read_number();
    else return operator_token();
  - Pros & Cons
    - Pros:
      - fast
      - customizable
    - Cons:
      - harder to maintain
- Strategy B - Generated Lexer (Lex/Flex)
  - Input:
    - Lexer spec file (Lex/Flex)
      - commonly named
        
        lexer.l
        
        scanner.l
      - Contains
        
        Regex patterns (rules)
        
        actions (code to run when a pattern matches)
      - example
        1
        2
        3
        4
        5
        6
        %%
        "if" { return IF; }
        [a-zA-Z][a-zA-Z0-9]* { return ID; }
        [0-9]+ { return NUMBER; }
        [ \t\n]+ ; /* skip whitespace */
        %%
      - Run flex
        1
        flex scanner.l
      - Produces C code: lex.yy.c
      - Compile it
        1
        gcc lex.yy.c -o lexer -lfl
      - run it on an input file, it emits tokens
  - Tool produces:
    - DFA tansition table
    - efficient scanner code
  - Pros & Cons
    - Pros:
      - Very fast in practice (table-driven DFA)
      - Easy to modify token rules (edit regex rules, regenerate)
      - Handles lots of tricky cases (lookahead, longest-match, priority rules) systematically
    - Cons
      - Debugging can be less direct (we debug rules, not the generated control flow)
      - Can produce large transition tables (usually fine; occasionally memory-sensitive)
      - Rule interactions can surprise, have to understand “longest match + rule priority”
Errors Handled at Lexical Phase
(Lexer detects and reports below errors early before parsing:)
- illegal characters
- malformed literals
- unterminated strings/comments

State	‘a’	‘b’
0	1	1
1	dead	dead

DFA State	on ‘a’ –>	on ‘b’ –>	Accepting?
{start}	{accept1, state1}	dead	No
{accept1, state1}	dead	{accept2}	Yes(accept1)
{accept2}	dead	dead	Yes

Unicode-aware tokenization
incremental lexing (IDEs)
Scannerless parsing (rare)
Lexer modes (context-sensitive lexing)

Symbol Table

SymbolEntry is struct to store each symbol record

name kind(variable/function/type) type(int/float/…) scope level(global/local) memory offset

token string name variable/function/type int/float global/local stack location

name	kind(variable/function/type)	type(int/float/…)	scope level(global/local)	memory offset
token string name	variable/function/type	int/float	global/local	stack location

SymbolTable wrapps SymbolEntry into hashmap and exposure actions,

maintain a scoped hashmap of SymbolEntry list
lookup type, scope or storage location of a symbol
Hash-based, average O(1) lookup time by symbol name
1
unordered_map<string, SymbolEntry>

Scoped

using a stack to track each symbol’s scope dynamically in parsing, and save to the symbol table
make SymbolTable a stack of hashmaps
- global scope map
- function scope map
- block scope map
One hashmap per scope
Same name allowed in different scopes

stack

when enter a new scope –> push

when exit –> pop

1 2	Stack top --> Innermost scope (current) Stack bottom --> Outermost scope (global)

example

int x;     //global scope
void f(){  //function scope begins
    int x; //local scope shadows global x
    {      // inner block scope begins
        int y;  
    }      // block scope ends
}

1
2
3

Global Scope
└── Function Scope (f)
    └── Block Scope {...}

1 2	Stack top : {x -> local x entry} Stack bottom : {x -> global x entry}

How lookup works

lookup always searches from top to bottom:

lookup("x"):
    for scope from top to bottom:
        if scope contains "x":
            return that entry

struct SymbolEntry {
    std::string name;
    enum class Kind { Var, Func, Type, Keyword } kind;
    // other properties
};
class SymbolTable {
    public:
        SymbolTable() { 
            enterScope(); 
        }
        void enterScope() {
            scopes_.emplace_back();
        }
        void exitScope() {
            if (!scopes_.empty()) scopes_.pop_back();
        }
        std::optional<SymbolEntry> lookup(const std::string& name) const {
            for (int i = static_cast<int>(scopes_.size()) - 1; i >= 0; --i) {
                auto it = scopes_[i].find(name);
                if (it != scopes_[i].end()) return it->second;
            }
            return std::nullopt;
        }
        SymbolEntry& insert(const std::string& name, SymbolEntry::Kind kind) {
            auto& cur = scopes_.back();
            auto [it, inserted] = cur.emplace(name, SymbolEntry{name, kind});
            return it->second;
        }
    private:
        std::vector<std::unordered_map<std::string, SymbolEntry>> scopes_;
};