Compiler - Parser

Syntax Analysis (Parsing) Token stream –> parse tree/syntax tree

  • Converts a strem of tokens into syntactic structure

    • (id)(=)(id)(+)(id)(*)(num)

    • Parser builds a tree structure that represents:

      • operator precedence
      • associativity
      • nested constructs
    • Grammar(CFG: Context-Free Grammar)

      • Grammars describe how abstract categories expand into sequences.

      • Grammar is

        • a set of rules that describe what sequences of tokens are valid in the language.
        • a start symbol that a nonterminal that represents a complete valid program (or complete construct)
      • CFG(context-free grammer) defines what token sequences are legal

        • A context-free grammar is a tuple:
          • G = (V, \EPSILON, P, S)
            • V: Set of nonterminals
            • \EPSILON: Set of terminals
            • P: Set of productions
            • S: Start symbol
        • Naming convention
          • Uppercase letters (E, T, A, B) : Nonterminals
          • Lowercase letters (a, b, +): Terminals
          • id, num: Terminals (tokens)
          • Greek letters (\alpha, \beta): Strings of grammar symbols
          • \epsilon: empty string
          • $: End-of-input marker
        • Consistes of
          • Terminals(tokens)
            • terminals are tokens returned by the lexer, Parser cannot break them further
              • id, num, +, *, (, ), if, …
            • terminals appear in the input stream
          • Nonterminals
            • nonterminals are names for syntactic categories
              • Expr(E): represents a full expression
              • Term(T): represents multiplication-level expressions
              • Factor(F): represents atomic things like identifiers or parentheses,
                • smallest expression unit
              • Statement(S): a higher-level syntactic unit, like
                • assignment
                • if statement
                • while statement
            • nonterminals do not appear in input, they appear in the grammer rules
            • typical grammar:
              1
              2
              3
              E -> E + T | T
              T -> T * F | F
              F -> (E) | id
              • E is “addition-level” expression
              • T is “multipication-level” expression
              • F is “atomic unit”
        • Example
          • Grammar for addition:
            1
            E -> E + id | id
            • an expression E is either E + id or just id
      • grammar representation

        • production rule:
          • LHS -> RHS
            • LHS nonterminals
            • RHS symbols, which could be tokens(terminals) or categories(nonterminals)
        • example
          • F -> (E)
    • Porduction

      • A production is a rewriting rule
        • A -> \alpha
          • menas: the nonterminal A can be replaced by the string \alpha
      • example
        • E -> T E’
    • Derivation

      • A derivation is a sequence of rewrite steps:
        • Starts with start symbol (like E)
        • Repeatedly replace a nonterminal using a rule
        • End with only terminals(tokens)
      • example
        • rule: E -> E + id | id
        • steps of a derivation
          • start E
          • use rule E -> E + id:
          • become: E + id,
          • the left E is still a nonterminal
          • Replace left E with rule E -> id
          • become: id + id
          • now everything is terminals
    • Parsing pipeline (parsing is table-driven decision making)

      • pipeline
        • grammer

          • the grammar alone is not executable
        • FIRST / FOLLOW sets : extract “lookahead knowledge”

          • FIRST/FOLLOW extract “lookahead knowledge”
          • FIRST/FOLLOW are not arbitrary categories
            • they correspond to two concrete parser actions
          • FIRST sets
            • FIRST(X) is the set of terminals that can appear first in any string derived from X
            • FIRST help on choosing which rule to expand when expanding a nonterminal
              • Parser sees:
                • Stack top: A
                • Lookahead: a
              • Question: which production A->\alpha should I use?
              • Use FIRST(\alpha)
            • This applies to
              • a single symbol (terminal or nonterminal)
              • a sequence of grammar symbols
            • FIRST computation
              • Rule 1 - terminal
                • if a is a terminal
                • FIRST(a) = { a }
              • Rule 2 - nonterminal
                • If A is a nonterminal
                • FIRST(A) = union of FIRST(a) for all productions A->a
              • Rule 3 - string of symbols
                • Let a = X_1 X_2 … X_n
                  • Then
                      1. Add FIRST(X_1) - {\epsilon}
                      1. If \epsilon \belongto FIRST(X_1), add FIRST(X_2)-{\epsilon}
                      1. Continue
                      1. If all X_i can derive \epsilon, add \epsilon
            • example
              • grammar
                • E -> T E’
                • E’ -> + T E’ | \epsilon
                • T -> id
              • compute FIRST:
                • FIRST(id ) = { id }
                • FIRST( T ) = { id }
                • FIRST( E ) = { id }
                • FIRST( E’) = { + , \epsilon }
          • FOLLOW sets
            • FOLLOW(A) is the set of terminals that can appear immediately to the right of A
              • FOLLOW is only defined for nonterminals
              • FOOLOW never contains \epsilon
              • FOLLOW help on deciding whether \epsilon is allowed - \epsilon-production
                • “\epsilon-production”
                  • A production like E’ -> \epsilon
                  • Meaning this nonterminal produces nothing
                • “choosing \epsilon-production”
                  • parser decides not to expand the nonterminal further
                • “Safe to disappear”
                  • The next input symbol is allowed to follow this nonterminal
                  • that condition is checked using FOLLOW
            • FOLLOW computation
              • Rule 1 - start symbol
                • $ \belongto FOLLOW(S)
                  • because input must end
              • Rule 2 - internal position
                • For every production A -> \alpha B \beta
                • Add FIRST(\beta) - {\epsilon} \in FOLLOW(B)
                • Meaning: whatever can start \beta can appear after B
              • Rule 3 - end or nullable tail
                • if: A -> \alpha B
                • or: A -> \alpha B \beta and \epsilon \belongto FIRST(\beta)
                • then: FOLLOW(A) \in FOLLOW(B)
                • Meaning: if \beta can disappear, B may be followed by whatever follows A
            • example
              • grammar
                • E -> T E’
                • E’ -> + T E’ | \epsilon
                • T -> id
              • step 1 - start symbol E
                • FOLLOW(E) ={\epsilon}
              • step 2 - from E -> T E’
                • E’ is at end -> add FOLLOW(E) to FOLLOW(E’)
                  • FOLLOW(E’) = {$}
                • T is followed by E’
                  • FIRST(E’) - {\epsilon} = {+}
                  • So FOLLOW(T) = {+}
                • Since \epsilon \belongto FIRST(E’), also add FOLLOW(E)
                  • FOLLOW(T) = { + , $}
          • FIRST and FOLLOW help on two distinct parser decisions:
        • parsing table : encode that knowledge in a mechanical form

        • parser control decisions

          • Parsing is table-driven decision making
  • Example

    • rate * 60 + initial
    • Lexer gives flat tokens
      • <id, rate> <*> <num,60> <+> <id,initial>
    • Parser builds structure
      1
      2
      3
      4
      5
                  +
      / \
      * <id,initial>
      / \
      <id,rate> <num,60>
      • parser enforces precedence:
        • * binds tigher than +
  • Data Structure

    • Parse Tree
      • shows how the grammar generated from the input
    • Syntax Tree (AST)
      • AST is a simplified parse tree
        • removes unnecessary grammar nodes
        • keeps essential structure
    • Parse stack
      • stack used by the parser to track partially recognized syntax
      • stores
        • LL: stack stores expected grammar symbols (terminals/nonterminals)
        • LR: stack stores states and/or grammar symbols
    • Parsing Table (LL or LR)
  • Two Main Parsing Strategies

    • Top-Down Parsing(LL) into Predictive Table

      • LL means
        • Left-to-right scan
        • leftmost derivation
          • always expand leftmost nontermial first
      • Top-down parsing
        • start from the start symbol and expand downward
      • Predictive parsing
        • Parser looks at
          • top of stack(nonterminal)
          • next input token
        • then decides which tule to apply
        • Predictive table
          • a table to save rules of grammar

            • rows = nonterminals (E, E’)
            • columns = lookahead tokens (id, +, $)
            • cell = which production to use
          • example table entry

            • M[E, id] = E -> idE’
            • M[E’, +]= E’ -> + id E’
            • M[E’, $]= E’ -> \epsilon
            nonterminals id + $
            E E -> id E’ / /
            E’ / E’-> + id E’ E’ -> \epsilon
          • so if current nonterminal is E and next token is id:

            • use this production rule
      • example
        • grammer
          • rule: E -> id E’
          • rule: E’ -> + id E’ | \epsilon
            • E is expression
            • E’ is the rest of an expression (a helper nonterminal)
            • this is a standard transformation for LL parsing
        • input id + id $
        • LL parsing rule
          • LL parsing expand rules by looking ahead one token.
          • At each step:
            • look at stack top (what we expect)
            • look at next input token (lookahead)
            • if top is nonterminal: expand using a rule
            • if top is terminal: it must match input token
        • Step-by-step
          • step1
            • initial stack [E, $]
            • input id + id $
            • top is nonterminal E, lookahead is id
            • use rule E -> id E’
            • stack [id E’ $]
            • top is terminal id, lookahead is id, match, consume
          • step2
            • stack [E’ $]
            • input + id $
            • top is nonterminal E’, lookahead is +
            • use rule E’ -> + id E’
            • stack [+ id E’ $]
            • top is terminal id, lookahead is +, match, consume
          • step3
            • stack [id E’ $]
            • input id $
            • top is terminal id, lookahead is id, matches, consume
          • step4
            • stack [E’ $]
            • input $
            • top is nonterminal E’, lookahead is $
            • use rule E’ -> \epsilon
            • stack [$]
            • top is $, lookahead $, match, accept
    • Bottom-Up Parsing(LR) into Action/GOTO Table

      • LR means

        • Left-to-right scan
        • Rightmost derivation in reverse
          • LR parsing constructs a rightmost derivation backwards by reducing
            • derivation expands nonterminals -> terminals
            • LR parsing reduces terminals -> nonterminals(reverse direction)
      • Bottom-up parsing:

        • start from tokens and reduce upward
      • ACTION/GOTO table

        • In real LR parsing, we can’t just guess when to reduce
        • the parser uses a state machine:
          • ACTION[state, terminal], decides what to do on terminals, SHIFT/REDUCE/ACCEPT
          • GOTO[state, nonterminal], decides where to go after reduce to a nonterminal
        • LR parsing stack usually stores:
          • states (integers)
          • sometimes symbols interleaved
        • example shape: [state0, E, state3, +, state7, id, state9]
      • example

        • grammar
          • rule: E -> E + id | id (left recursion is not LL-friendly rule)
        • input: id + id $
        • LR parsing rule
          • SHIFT: push input token onto stack
          • REDUCE: replace RHS symbols on stack with LHS nonterminal
        • step by step
          • step1
            • stack []
            • SHIFT id
            • stack [id]
            • REDUCE id -> E
          • step2
            • stack [E]
            • input + id $
            • SHIFT +
          • step3
            • stack [E +]
            • input id $
            • SHIFT id
            • stack [E + id]
            • REDUCE E + id -> E
          • step4
            • stack [E]
            • input $
            • accept
        • reverse derivation
          • instead of expanding E into E + id, we reduce E + id back into E
  • Parser Generators

    • Most compilers do not hand-write LR tables, LR table construction by hand is painful and error-prone

    • Parser generators automate parsing table construction.

    • parser generators solve two hard problems automatically:

      • they compute parsing tables
        • LL: compute FIRST/FOLLOW -> predictive table M
        • LR: build LR automation -> ACTION/GOTO
      • they generate a working parser
        • user provide
          • grammar rules
          • semantic actions (code to run on reduce/expand)
        • output
          • parser code (C/C++/Java/etc.)
    • example: Yacc/Bison workflow

      1. write a .y file
        • tokens
        • grammar rules
        • action code like “build AST node”
      2. generator produces:
        • a parser engine
        • parsing tables embedded in code
        • calls action code at reductions
      • write grammar file (ACTION/GOT table)
        1
        2
        3
        4
        Expr : Expr '+' Term
        | Term
        ;
        Term : id ;
      • Tool generates
        • parsing tables
        • parsing engine code
      • Parser builds AST and calls semantic actions
    • Parser generators

      • Yacc/Bison: LR parsers
      • ANTLR: LL-stype parsers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
### 1.1 Lexical Structure (scanning)
Convert character stream to tokens by
- Regular expressions
- Finite automate (NFA --> DFA)
- Lexical analyzers (Lex/Flex)

Key algorithms:
- Thompson construction
- Subset construction
- DFA minimization

### 1.2 Syntactic Structure (Parsing)
Convert tokens to Parse Tree/AST
- Context-free grammars (CFG)
- LL Parsing(top-down)
- LR parsing(bottom-up)
- Parser generators(Yacc/Bison)

Key algorithms:
- FIST/FOLLOW computation
- Predictive parsing tables
- LR(0), SLR, LALR, Canonical LR

### 1.3 Semantic Structure (Meaning & Types)
Convert AST to Typed, well-formed program
- Symbol tables
- Scope rules
- Type checking
- Type inference(unification)

Key Topics:
- Static vs dynamic typing
- Type equivalence
- Polymorphism basic

### 1.4 syntax-Directed Translation
Attach computation to syntax
- Attribute grammers
- Syntax-directed definitions(SDD)
- Syntax-directed translation(SDT)

Implementation:
- Annotated parse trees
- Translation schemes
- Dependency graphs

## Layer 2 -- Intermediate Representation (IR Core)

TAC --> CFG --> SSA

### 2.1 IR Forms
- AST
- DAGs
- Three-address code (TAC)
- Control-flow graph (CFG)
- Static single assignment (SSA)

### 2.2 Control-Flow Translation
- Boolean expressions
- Short-circuit evaluation
- Backpatching
- Basic blocks

### 2.3 Data Structures of IR
Key implementation artifacts:
- IR instruction objects
- Def-use chains
- Dominator trees
- Phi functions (SSA)

## Layer 3 -- Optimization (Program Improvement)

Data-flow --> Global opts --> Loop opts

### 3.1 Local Optimization
- Peephole optimization
- Algebraic simplification
- Constant folding

### 3.2 Data-Flow Analysis Framework

Program Analysis = CFG + Equations on Sets

Core analyses:
- Reaching definitions
- Live variables
- Available expression
- Constant propagation

Concepts:
- Forward vs backward analysis
- Meet-over-paths solution
- Iterative fixpoint computation

### 3.3 Global Optimizations
- Common subexpression elimination
- Partial redundancy elimination (PRE)
- Loop invariant code motion
- Strength reduction
- Induction variable elimination

### 3.4 Loop Optimizations + Locality
- Denpendence analysis
- Loop transformations
- Tiling, interchange, fusion

### 3.5 Interprocedural Optimization
- Call graph construction
- Context sensitivity
- Pointer analysis foundations

## Layer 4 -- Back-End (Machine Realization)

Instruction selection --> Reg alloc --> scheduling --> Runtime


### 4.1 Instruction Selection
- Tree covering /tiling
- Pattern matching
- DAG-based code generation

### 4.2 Register Allocation
Algorithm:
- Graph coloring allocation
- Spill handing
- Live range splitting

### 4.3 Instruction Scheduling
- Dependence graphs
- List scheduling
- Pipeline hazards

### 4.4 Runtime Organization
- Activation records
- Stack layout
- Heap allocation
- Garbage collection


# 5. Modern Compiler Architecture

LLVM + MLIR + JIT + GPU

- LLVM/MLIR multi-level IR
- SSA as default IR
- Profile-guided optimization (PGO)
- JIT + tiered compilation
- Auto-vectorization for SIMD/GPU
- Verified compilation (CompCert)
- Security hardening passes



# Comcepts that support the compiler system

## 1. Formal Language Theory
- Regex, CFG, PDA
- Automate models

## 2. Graph Theory Everywhere
- CFGs
- Dominators
- Interference graphs
- Dependence graphs

## 3. Algorithmic Frameworks
- Fixpoint iteration
- Lattice theory
- Union-find
- Worklist algorithms

## 4. Engineering Practice
- Pass pipelines
- IR design tradeoffs
- Debuggable compilation