Syntax Analysis (Parsing) Token stream –> parse tree/syntax tree
Converts a strem of tokens into syntactic structure
(id)(=)(id)(+)(id)(*)(num)
Parser builds a tree structure that represents:
- operator precedence
- associativity
- nested constructs
Grammar(CFG: Context-Free Grammar)
Grammars describe how abstract categories expand into sequences.
Grammar is
- a set of rules that describe what sequences of tokens are valid in the language.
- a start symbol that a nonterminal that represents a complete valid program (or complete construct)
CFG(context-free grammer) defines what token sequences are legal
- A context-free grammar is a tuple:
- G = (V, \EPSILON, P, S)
- V: Set of nonterminals
- \EPSILON: Set of terminals
- P: Set of productions
- S: Start symbol
- G = (V, \EPSILON, P, S)
- Naming convention
- Uppercase letters (E, T, A, B) : Nonterminals
- Lowercase letters (a, b, +): Terminals
- id, num: Terminals (tokens)
- Greek letters (\alpha, \beta): Strings of grammar symbols
- \epsilon: empty string
- $: End-of-input marker
- Consistes of
- Terminals(tokens)
- terminals are tokens returned by the lexer, Parser cannot break them further
- id, num, +, *, (, ), if, …
- terminals appear in the input stream
- terminals are tokens returned by the lexer, Parser cannot break them further
- Nonterminals
- nonterminals are names for syntactic categories
- Expr(E): represents a full expression
- Term(T): represents multiplication-level expressions
- Factor(F): represents atomic things like identifiers or parentheses,
- smallest expression unit
- Statement(S): a higher-level syntactic unit, like
- assignment
- if statement
- while statement
- nonterminals do not appear in input, they appear in the grammer rules
- typical grammar:
1
2
3E -> E + T | T
T -> T * F | F
F -> (E) | id- E is “addition-level” expression
- T is “multipication-level” expression
- F is “atomic unit”
- nonterminals are names for syntactic categories
- Terminals(tokens)
- Example
- Grammar for addition:
1
E -> E + id | id
- an expression E is either E + id or just id
- Grammar for addition:
- A context-free grammar is a tuple:
grammar representation
- production rule:
- LHS -> RHS
- LHS nonterminals
- RHS symbols, which could be tokens(terminals) or categories(nonterminals)
- LHS -> RHS
- example
- F -> (E)
- production rule:
Porduction
- A production is a rewriting rule
- A -> \alpha
- menas: the nonterminal A can be replaced by the string \alpha
- A -> \alpha
- example
- E -> T E’
- A production is a rewriting rule
Derivation
- A derivation is a sequence of rewrite steps:
- Starts with start symbol (like E)
- Repeatedly replace a nonterminal using a rule
- End with only terminals(tokens)
- example
- rule: E -> E + id | id
- steps of a derivation
- start E
- use rule E -> E + id:
- become: E + id,
- the left E is still a nonterminal
- Replace left E with rule E -> id
- become: id + id
- now everything is terminals
- A derivation is a sequence of rewrite steps:
Parsing pipeline (parsing is table-driven decision making)
- pipeline
grammer
- the grammar alone is not executable
FIRST / FOLLOW sets : extract “lookahead knowledge”
- FIRST/FOLLOW extract “lookahead knowledge”
- FIRST/FOLLOW are not arbitrary categories
- they correspond to two concrete parser actions
- FIRST sets
- FIRST(X) is the set of terminals that can appear first in any string derived from X
- FIRST help on choosing which rule to expand when expanding a nonterminal
- Parser sees:
- Stack top: A
- Lookahead: a
- Question: which production A->\alpha should I use?
- Use FIRST(\alpha)
- Parser sees:
- This applies to
- a single symbol (terminal or nonterminal)
- a sequence of grammar symbols
- FIRST computation
- Rule 1 - terminal
- if a is a terminal
- FIRST(a) = { a }
- Rule 2 - nonterminal
- If A is a nonterminal
- FIRST(A) = union of FIRST(a) for all productions A->a
- Rule 3 - string of symbols
- Let a = X_1 X_2 … X_n
- Then
- Add FIRST(X_1) - {\epsilon}
- If \epsilon \belongto FIRST(X_1), add FIRST(X_2)-{\epsilon}
- Continue
- If all X_i can derive \epsilon, add \epsilon
- Then
- Let a = X_1 X_2 … X_n
- Rule 1 - terminal
- example
- grammar
- E -> T E’
- E’ -> + T E’ | \epsilon
- T -> id
- compute FIRST:
- FIRST(id ) = { id }
- FIRST( T ) = { id }
- FIRST( E ) = { id }
- FIRST( E’) = { + , \epsilon }
- grammar
- FOLLOW sets
- FOLLOW(A) is the set of terminals that can appear immediately to the right of A
- FOLLOW is only defined for nonterminals
- FOOLOW never contains \epsilon
- FOLLOW help on deciding whether \epsilon is allowed - \epsilon-production
- “\epsilon-production”
- A production like E’ -> \epsilon
- Meaning this nonterminal produces nothing
- “choosing \epsilon-production”
- parser decides not to expand the nonterminal further
- “Safe to disappear”
- The next input symbol is allowed to follow this nonterminal
- that condition is checked using FOLLOW
- “\epsilon-production”
- FOLLOW computation
- Rule 1 - start symbol
- $ \belongto FOLLOW(S)
- because input must end
- $ \belongto FOLLOW(S)
- Rule 2 - internal position
- For every production A -> \alpha B \beta
- Add FIRST(\beta) - {\epsilon} \in FOLLOW(B)
- Meaning: whatever can start \beta can appear after B
- Rule 3 - end or nullable tail
- if: A -> \alpha B
- or: A -> \alpha B \beta and \epsilon \belongto FIRST(\beta)
- then: FOLLOW(A) \in FOLLOW(B)
- Meaning: if \beta can disappear, B may be followed by whatever follows A
- Rule 1 - start symbol
- example
- grammar
- E -> T E’
- E’ -> + T E’ | \epsilon
- T -> id
- step 1 - start symbol E
- FOLLOW(E) ={\epsilon}
- step 2 - from E -> T E’
- E’ is at end -> add FOLLOW(E) to FOLLOW(E’)
- FOLLOW(E’) = {$}
- T is followed by E’
- FIRST(E’) - {\epsilon} = {+}
- So FOLLOW(T) = {+}
- Since \epsilon \belongto FIRST(E’), also add FOLLOW(E)
- FOLLOW(T) = { + , $}
- E’ is at end -> add FOLLOW(E) to FOLLOW(E’)
- grammar
- FOLLOW(A) is the set of terminals that can appear immediately to the right of A
- FIRST and FOLLOW help on two distinct parser decisions:
parsing table : encode that knowledge in a mechanical form
parser control decisions
- Parsing is table-driven decision making
- pipeline
Example
- rate * 60 + initial
- Lexer gives flat tokens
- <id, rate> <*> <num,60> <+> <id,initial>
- Parser builds structure
1
2
3
4
5+
/ \
* <id,initial>
/ \
<id,rate> <num,60>- parser enforces precedence:
- * binds tigher than +
- parser enforces precedence:
Data Structure
- Parse Tree
- shows how the grammar generated from the input
- Syntax Tree (AST)
- AST is a simplified parse tree
- removes unnecessary grammar nodes
- keeps essential structure
- AST is a simplified parse tree
- Parse stack
- stack used by the parser to track partially recognized syntax
- stores
- LL: stack stores expected grammar symbols (terminals/nonterminals)
- LR: stack stores states and/or grammar symbols
- Parsing Table (LL or LR)
- Parse Tree
Two Main Parsing Strategies
Top-Down Parsing(LL) into Predictive Table
- LL means
- Left-to-right scan
- leftmost derivation
- always expand leftmost nontermial first
- Top-down parsing
- start from the start symbol and expand downward
- Predictive parsing
- Parser looks at
- top of stack(nonterminal)
- next input token
- then decides which tule to apply
- Predictive table
a table to save rules of grammar
- rows = nonterminals (E, E’)
- columns = lookahead tokens (id, +, $)
- cell = which production to use
example table entry
- M[E, id] = E -> idE’
- M[E’, +]= E’ -> + id E’
- M[E’, $]= E’ -> \epsilon
nonterminals id + $ E E -> id E’ / / E’ / E’-> + id E’ E’ -> \epsilon so if current nonterminal is E and next token is id:
- use this production rule
- Parser looks at
- example
- grammer
- rule: E -> id E’
- rule: E’ -> + id E’ | \epsilon
- E is expression
- E’ is the rest of an expression (a helper nonterminal)
- this is a standard transformation for LL parsing
- input id + id $
- LL parsing rule
- LL parsing expand rules by looking ahead one token.
- At each step:
- look at stack top (what we expect)
- look at next input token (lookahead)
- if top is nonterminal: expand using a rule
- if top is terminal: it must match input token
- Step-by-step
- step1
- initial stack [E, $]
- input id + id $
- top is nonterminal E, lookahead is id
- use rule E -> id E’
- stack [id E’ $]
- top is terminal id, lookahead is id, match, consume
- step2
- stack [E’ $]
- input + id $
- top is nonterminal E’, lookahead is +
- use rule E’ -> + id E’
- stack [+ id E’ $]
- top is terminal id, lookahead is +, match, consume
- step3
- stack [id E’ $]
- input id $
- top is terminal id, lookahead is id, matches, consume
- step4
- stack [E’ $]
- input $
- top is nonterminal E’, lookahead is $
- use rule E’ -> \epsilon
- stack [$]
- top is $, lookahead $, match, accept
- step1
- grammer
- LL means
Bottom-Up Parsing(LR) into Action/GOTO Table
LR means
- Left-to-right scan
- Rightmost derivation in reverse
- LR parsing constructs a rightmost derivation backwards by reducing
- derivation expands nonterminals -> terminals
- LR parsing reduces terminals -> nonterminals(reverse direction)
- LR parsing constructs a rightmost derivation backwards by reducing
Bottom-up parsing:
- start from tokens and reduce upward
ACTION/GOTO table
- In real LR parsing, we can’t just guess when to reduce
- the parser uses a state machine:
- ACTION[state, terminal], decides what to do on terminals, SHIFT/REDUCE/ACCEPT
- GOTO[state, nonterminal], decides where to go after reduce to a nonterminal
- LR parsing stack usually stores:
- states (integers)
- sometimes symbols interleaved
- example shape: [state0, E, state3, +, state7, id, state9]
example
- grammar
- rule: E -> E + id | id (left recursion is not LL-friendly rule)
- input: id + id $
- LR parsing rule
- SHIFT: push input token onto stack
- REDUCE: replace RHS symbols on stack with LHS nonterminal
- step by step
- step1
- stack []
- SHIFT id
- stack [id]
- REDUCE id -> E
- step2
- stack [E]
- input + id $
- SHIFT +
- step3
- stack [E +]
- input id $
- SHIFT id
- stack [E + id]
- REDUCE E + id -> E
- step4
- stack [E]
- input $
- accept
- step1
- reverse derivation
- instead of expanding E into E + id, we reduce E + id back into E
- grammar
Parser Generators
Most compilers do not hand-write LR tables, LR table construction by hand is painful and error-prone
Parser generators automate parsing table construction.
parser generators solve two hard problems automatically:
- they compute parsing tables
- LL: compute FIRST/FOLLOW -> predictive table M
- LR: build LR automation -> ACTION/GOTO
- they generate a working parser
- user provide
- grammar rules
- semantic actions (code to run on reduce/expand)
- output
- parser code (C/C++/Java/etc.)
- user provide
- they compute parsing tables
example: Yacc/Bison workflow
- write a .y file
- tokens
- grammar rules
- action code like “build AST node”
- generator produces:
- a parser engine
- parsing tables embedded in code
- calls action code at reductions
- write grammar file (ACTION/GOT table)
1
2
3
4Expr : Expr '+' Term
| Term
;
Term : id ; - Tool generates
- parsing tables
- parsing engine code
- Parser builds AST and calls semantic actions
- write a .y file
Parser generators
- Yacc/Bison: LR parsers
- ANTLR: LL-stype parsers
1 | ### 1.1 Lexical Structure (scanning) |