Compiler - Parser

2026-02-09

Programming

Syntax Analysis (Parsing) Token stream –> parse tree/syntax tree

Converts a strem of tokens into syntactic structure
- (id)(=)(id)(+)(id)(*)(num)
- Parser builds a tree structure that represents:
  - operator precedence
  - associativity
  - nested constructs
- Grammar(CFG: Context-Free Grammar)
  - Grammars describe how abstract categories expand into sequences.
  - Grammar is
    - a set of rules that describe what sequences of tokens are valid in the language.
    - a start symbol that a nonterminal that represents a complete valid program (or complete construct)
  - CFG(context-free grammer) defines what token sequences are legal
    - A context-free grammar is a tuple:
      - G = (V, \EPSILON, P, S)
        
        V: Set of nonterminals
        
        \EPSILON: Set of terminals
        
        P: Set of productions
        
        S: Start symbol
    - Naming convention
      - Uppercase letters (E, T, A, B) : Nonterminals
      - Lowercase letters (a, b, +): Terminals
      - id, num: Terminals (tokens)
      - Greek letters (\alpha, \beta): Strings of grammar symbols
      - \epsilon: empty string
      - $: End-of-input marker
    - Consistes of
      - Terminals(tokens)
        
        terminals are tokens returned by the lexer, Parser cannot break them further
        
        id, num, +, *, (, ), if, …
        
        terminals appear in the input stream
      - Nonterminals
        
        nonterminals are names for syntactic categories
        
        Expr(E): represents a full expression
        
        Term(T): represents multiplication-level expressions
        
        Factor(F): represents atomic things like identifiers or parentheses,
        
        smallest expression unit
        
        Statement(S): a higher-level syntactic unit, like
        
        assignment
        
        if statement
        
        while statement
        
        nonterminals do not appear in input, they appear in the grammer rules
        
        typical grammar:
        1
        2
        3
        E -> E + T | T
        T -> T * F | F
        F -> (E) | id
        
        E is “addition-level” expression
        
        T is “multipication-level” expression
        
        F is “atomic unit”
    - Example
      - Grammar for addition:
        1
        E -> E + id | id
        
        an expression E is either E + id or just id
  - grammar representation
    - production rule:
      - LHS -> RHS
        
        LHS nonterminals
        
        RHS symbols, which could be tokens(terminals) or categories(nonterminals)
    - example
      - F -> (E)
- Porduction
  - A production is a rewriting rule
    - A -> \alpha
      - menas: the nonterminal A can be replaced by the string \alpha
  - example
    - E -> T E’
- Derivation
  - A derivation is a sequence of rewrite steps:
    - Starts with start symbol (like E)
    - Repeatedly replace a nonterminal using a rule
    - End with only terminals(tokens)
  - example
    - rule: E -> E + id | id
    - steps of a derivation
      - start E
      - use rule E -> E + id:
      - become: E + id,
      - the left E is still a nonterminal
      - Replace left E with rule E -> id
      - become: id + id
      - now everything is terminals
- Parsing pipeline (parsing is table-driven decision making)
  - pipeline
    - grammer
      - the grammar alone is not executable
    - FIRST / FOLLOW sets : extract “lookahead knowledge”
      - FIRST/FOLLOW extract “lookahead knowledge”
      - FIRST/FOLLOW are not arbitrary categories
        
        they correspond to two concrete parser actions
      - FIRST sets
        
        FIRST(X) is the set of terminals that can appear first in any string derived from X
        
        FIRST help on choosing which rule to expand when expanding a nonterminal
        
        Parser sees:
        
        Stack top: A
        
        Lookahead: a
        
        Question: which production A->\alpha should I use?
        
        Use FIRST(\alpha)
        
        This applies to
        
        a single symbol (terminal or nonterminal)
        
        a sequence of grammar symbols
        
        FIRST computation
        
        Rule 1 - terminal
        
        if a is a terminal
        
        FIRST(a) = { a }
        
        Rule 2 - nonterminal
        
        If A is a nonterminal
        
        FIRST(A) = union of FIRST(a) for all productions A->a
        
        Rule 3 - string of symbols
        
        Let a = X_1 X_2 … X_n
        
        Then
        
        Add FIRST(X_1) - {\epsilon}
        
        If \epsilon \belongto FIRST(X_1), add FIRST(X_2)-{\epsilon}
        
        Continue
        
        If all X_i can derive \epsilon, add \epsilon
        
        example
        
        grammar
        
        E -> T E’
        
        E’ -> + T E’ | \epsilon
        
        T -> id
        
        compute FIRST:
        
        FIRST(id ) = { id }
        
        FIRST( T ) = { id }
        
        FIRST( E ) = { id }
        
        FIRST( E’) = { + , \epsilon }
      - FOLLOW sets
        
        FOLLOW(A) is the set of terminals that can appear immediately to the right of A
        
        FOLLOW is only defined for nonterminals
        
        FOOLOW never contains \epsilon
        
        FOLLOW help on deciding whether \epsilon is allowed - \epsilon-production
        
        “\epsilon-production”
        
        A production like E’ -> \epsilon
        
        Meaning this nonterminal produces nothing
        
        “choosing \epsilon-production”
        
        parser decides not to expand the nonterminal further
        
        “Safe to disappear”
        
        The next input symbol is allowed to follow this nonterminal
        
        that condition is checked using FOLLOW
        
        FOLLOW computation
        
        Rule 1 - start symbol
        
        $ \belongto FOLLOW(S)
        
        because input must end
        
        Rule 2 - internal position
        
        For every production A -> \alpha B \beta
        
        Add FIRST(\beta) - {\epsilon} \in FOLLOW(B)
        
        Meaning: whatever can start \beta can appear after B
        
        Rule 3 - end or nullable tail
        
        if: A -> \alpha B
        
        or: A -> \alpha B \beta and \epsilon \belongto FIRST(\beta)
        
        then: FOLLOW(A) \in FOLLOW(B)
        
        Meaning: if \beta can disappear, B may be followed by whatever follows A
        
        example
        
        grammar
        
        E -> T E’
        
        E’ -> + T E’ | \epsilon
        
        T -> id
        
        step 1 - start symbol E
        
        FOLLOW(E) ={\epsilon}
        
        step 2 - from E -> T E’
        
        E’ is at end -> add FOLLOW(E) to FOLLOW(E’)
        
        FOLLOW(E’) = {$}
        
        T is followed by E’
        
        FIRST(E’) - {\epsilon} = {+}
        
        So FOLLOW(T) = {+}
        
        Since \epsilon \belongto FIRST(E’), also add FOLLOW(E)
        
        FOLLOW(T) = { + , $}
      - FIRST and FOLLOW help on two distinct parser decisions:
    - parsing table : encode that knowledge in a mechanical form
    - parser control decisions
      - Parsing is table-driven decision making
Example
- rate * 60 + initial
- Lexer gives flat tokens
  - <id, rate> <*> <num,60> <+> <id,initial>
- Parser builds structure
  1
  2
  3
  4
  5
  +
  / \
  * <id,initial>
  / \
  <id,rate> <num,60>
  - parser enforces precedence:
    - * binds tigher than +
Data Structure
- Parse Tree
  - shows how the grammar generated from the input
- Syntax Tree (AST)
  - AST is a simplified parse tree
    - removes unnecessary grammar nodes
    - keeps essential structure
- Parse stack
  - stack used by the parser to track partially recognized syntax
  - stores
    - LL: stack stores expected grammar symbols (terminals/nonterminals)
    - LR: stack stores states and/or grammar symbols
- Parsing Table (LL or LR)
Two Main Parsing Strategies
- Top-Down Parsing(LL) into Predictive Table
  - LL means
    - Left-to-right scan
    - leftmost derivation
      - always expand leftmost nontermial first
  - Top-down parsing
    - start from the start symbol and expand downward
  - Predictive parsing
    - Parser looks at
      - top of stack(nonterminal)
      - next input token
    - then decides which tule to apply
    - Predictive table
      - a table to save rules of grammar
        
        rows = nonterminals (E, E’)
        
        columns = lookahead tokens (id, +, $)
        
        cell = which production to use
      - example table entry
        
        M[E, id] = E -> idE’
        
        M[E’, +]= E’ -> + id E’
        
        M[E’, $]= E’ -> \epsilon
        
        nonterminals id + $
        
        E E -> id E’ / /
        
        E’ / E’-> + id E’ E’ -> \epsilon
      - so if current nonterminal is E and next token is id:
        
        use this production rule
  - example
    - grammer
      - rule: E -> id E’
      - rule: E’ -> + id E’ | \epsilon
        
        E is expression
        
        E’ is the rest of an expression (a helper nonterminal)
        
        this is a standard transformation for LL parsing
    - input id + id $
    - LL parsing rule
      - LL parsing expand rules by looking ahead one token.
      - At each step:
        
        look at stack top (what we expect)
        
        look at next input token (lookahead)
        
        if top is nonterminal: expand using a rule
        
        if top is terminal: it must match input token
    - Step-by-step
      - step1
        
        initial stack [E, $]
        
        input id + id $
        
        top is nonterminal E, lookahead is id
        
        use rule E -> id E’
        
        stack [id E’ $]
        
        top is terminal id, lookahead is id, match, consume
      - step2
        
        stack [E’ $]
        
        input + id $
        
        top is nonterminal E’, lookahead is +
        
        use rule E’ -> + id E’
        
        stack [+ id E’ $]
        
        top is terminal id, lookahead is +, match, consume
      - step3
        
        stack [id E’ $]
        
        input id $
        
        top is terminal id, lookahead is id, matches, consume
      - step4
        
        stack [E’ $]
        
        input $
        
        top is nonterminal E’, lookahead is $
        
        use rule E’ -> \epsilon
        
        stack [$]
        
        top is $, lookahead $, match, accept
- Bottom-Up Parsing(LR) into Action/GOTO Table
  - LR means
    - Left-to-right scan
    - Rightmost derivation in reverse
      - LR parsing constructs a rightmost derivation backwards by reducing
        
        derivation expands nonterminals -> terminals
        
        LR parsing reduces terminals -> nonterminals(reverse direction)
  - Bottom-up parsing:
    - start from tokens and reduce upward
  - ACTION/GOTO table
    - In real LR parsing, we can’t just guess when to reduce
    - the parser uses a state machine:
      - ACTION[state, terminal], decides what to do on terminals, SHIFT/REDUCE/ACCEPT
      - GOTO[state, nonterminal], decides where to go after reduce to a nonterminal
    - LR parsing stack usually stores:
      - states (integers)
      - sometimes symbols interleaved
    - example shape: [state0, E, state3, +, state7, id, state9]
  - example
    - grammar
      - rule: E -> E + id | id (left recursion is not LL-friendly rule)
    - input: id + id $
    - LR parsing rule
      - SHIFT: push input token onto stack
      - REDUCE: replace RHS symbols on stack with LHS nonterminal
    - step by step
      - step1
        
        stack []
        
        SHIFT id
        
        stack [id]
        
        REDUCE id -> E
      - step2
        
        stack [E]
        
        input + id $
        
        SHIFT +
      - step3
        
        stack [E +]
        
        input id $
        
        SHIFT id
        
        stack [E + id]
        
        REDUCE E + id -> E
      - step4
        
        stack [E]
        
        input $
        
        accept
    - reverse derivation
      - instead of expanding E into E + id, we reduce E + id back into E
Parser Generators
- Most compilers do not hand-write LR tables, LR table construction by hand is painful and error-prone
- Parser generators automate parsing table construction.
- parser generators solve two hard problems automatically:
  - they compute parsing tables
    - LL: compute FIRST/FOLLOW -> predictive table M
    - LR: build LR automation -> ACTION/GOTO
  - they generate a working parser
    - user provide
      - grammar rules
      - semantic actions (code to run on reduce/expand)
    - output
      - parser code (C/C++/Java/etc.)
- example: Yacc/Bison workflow
  1. write a .y file
    - tokens
    - grammar rules
    - action code like “build AST node”
  2. generator produces:
    - a parser engine
    - parsing tables embedded in code
    - calls action code at reductions
  - write grammar file (ACTION/GOT table)
    1
    2
    3
    4
    Expr : Expr '+' Term
    | Term
    ;
    Term : id ;
  - Tool generates
    - parsing tables
    - parsing engine code
  - Parser builds AST and calls semantic actions
- Parser generators
  - Yacc/Bison: LR parsers
  - ANTLR: LL-stype parsers

nonterminals	id	+	$
E	E -> id E’	/	/
E’	/	E’-> + id E’	E’ -> \epsilon

### 1.1 Lexical Structure (scanning)
Convert character stream to tokens by
- Regular expressions
- Finite automate (NFA --> DFA)
- Lexical analyzers (Lex/Flex)

Key algorithms:
- Thompson construction
- Subset construction
- DFA minimization

### 1.2 Syntactic Structure (Parsing)
Convert tokens to Parse Tree/AST
- Context-free grammars (CFG)
- LL Parsing(top-down)
- LR parsing(bottom-up)
- Parser generators(Yacc/Bison)

Key algorithms:
- FIST/FOLLOW computation
- Predictive parsing tables
- LR(0), SLR, LALR, Canonical LR

### 1.3 Semantic Structure (Meaning & Types)
Convert AST to Typed, well-formed program
- Symbol tables
- Scope rules
- Type checking
- Type inference(unification)

Key Topics:
- Static vs dynamic typing
- Type equivalence
- Polymorphism basic

### 1.4 syntax-Directed Translation
Attach computation to syntax
- Attribute grammers
- Syntax-directed definitions(SDD)
- Syntax-directed translation(SDT)

Implementation:
- Annotated parse trees
- Translation schemes
- Dependency graphs

## Layer 2 -- Intermediate Representation (IR Core)

TAC --> CFG --> SSA

### 2.1 IR Forms
- AST
- DAGs
- Three-address code (TAC)
- Control-flow graph (CFG)
- Static single assignment (SSA)

### 2.2 Control-Flow Translation
- Boolean expressions
- Short-circuit evaluation
- Backpatching
- Basic blocks

### 2.3 Data Structures of IR
Key implementation artifacts:
- IR instruction objects
- Def-use chains
- Dominator trees
- Phi functions (SSA)

## Layer 3 -- Optimization (Program Improvement)

Data-flow --> Global opts --> Loop opts

### 3.1 Local Optimization
- Peephole optimization
- Algebraic simplification
- Constant folding

### 3.2 Data-Flow Analysis Framework

Program Analysis = CFG + Equations on Sets

Core analyses:
- Reaching definitions
- Live variables
- Available expression
- Constant propagation

Concepts:
- Forward vs backward analysis
- Meet-over-paths solution
- Iterative fixpoint computation

### 3.3 Global Optimizations
- Common subexpression elimination
- Partial redundancy elimination (PRE)
- Loop invariant code motion
- Strength reduction
- Induction variable elimination

### 3.4 Loop Optimizations + Locality
- Denpendence analysis
- Loop transformations
- Tiling, interchange, fusion

### 3.5 Interprocedural Optimization
- Call graph construction
- Context sensitivity
- Pointer analysis foundations

## Layer 4 -- Back-End (Machine Realization)

Instruction selection --> Reg alloc --> scheduling --> Runtime


### 4.1 Instruction Selection
- Tree covering /tiling
- Pattern matching
- DAG-based code generation

### 4.2 Register Allocation
Algorithm:
- Graph coloring allocation
- Spill handing
- Live range splitting

### 4.3 Instruction Scheduling
- Dependence graphs
- List scheduling
- Pipeline hazards

### 4.4 Runtime Organization
- Activation records
- Stack layout
- Heap allocation
- Garbage collection


# 5. Modern Compiler Architecture

LLVM + MLIR + JIT + GPU

- LLVM/MLIR multi-level IR
- SSA as default IR
- Profile-guided optimization (PGO)
- JIT + tiered compilation
- Auto-vectorization for SIMD/GPU
- Verified compilation (CompCert)
- Security hardening passes



# Comcepts that support the compiler system

## 1. Formal Language Theory
- Regex, CFG, PDA
- Automate models

## 2. Graph Theory Everywhere
- CFGs
- Dominators
- Interference graphs
- Dependence graphs

## 3. Algorithmic Frameworks
- Fixpoint iteration
- Lattice theory
- Union-find
- Worklist algorithms

## 4. Engineering Practice
- Pass pipelines
- IR design tradeoffs
- Debuggable compilation