Lexical Analysis (Scanning)
Convert Character stream to Token stream
simple example:
- input character stream: “position = initial + rate * 60”.
- lexemes include: “postion”, “=”, “initial”, “+”, “rate”, “*”, “60”
- output Token Stream
- output tokens: (id, 1), (=), (id, 2), (+), (id, 3), (*), (60)
- And the corresponding table
- input character stream: “position = initial + rate * 60”.
Lexime is the actual character sequence matched.
Pattern is the rule describing valid lexemes (usually described using regular expressions)
- example:
- identifier: [A-Za-z][A-Za-z0-9]*
- number: [0-9]+
- example:
Token <token_name, attribute_value>
- Token-name is category used by parsing (identifier, number, symbol names like of ‘+’ and ‘=’)
- Attribute-value: tokens may carry extra information
- typically points into the symbol table which holding details
- e.g., identifier name / type
- example:
- {id, 1} means identifier token with pointer into symbol table entry #1
- {+} no attribute
1
2
3
4struct Token{
TokenType type; // type => {ID, NUM, IF, PLUS, ...}
Attribute attr; // holds pointer/index, points into the symbol table
};
Input Buffer
- lexers read file in chunks, 4KB, 8KB, etc., this chunk is the input buffer
- stores characters
1
std::array<char, 4096> buffer;
- stores characters
- Pointers
- LexemeBegin pointer: where the current token started
- Forward: current scanning position
1
buffer[lexemeBegin ... forward]
- Retract()
- move the buffer Forward pointer back by one
- Lookahead problem
- lexer cannot know if a token has ended until see the character after it.
- lexer reads too far to examine the next character to determine if the current token is complete
- need to
- example: lexer scan characters “123+45”
- reads 1
- reads 2
- reads 3
- reads +
- lexer sees +, it knows “123” is finished, however the + is not part of the number
- Retract: lexer has read one character too far, it must Retract so that the next time the lexer is called, it starts exactly at the ‘+’
- two-buffer sentinel version (TBD)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64class InputBuffer {
public:
static constexpr std::size_t N = 4096;
explicit InputBuffer(const std::string& filename) : stream_(filename, std::ios::binary) {
refill();
}
std::optional<char> nextChar() {
if (pos_ >= length_) {
refill();
if (length_ == 0) return std::nullopt;
}
return buffer_[pos_++];
}
void retract() {
if (pos_ > 0) {
--pos_;
}
}
std::optional<char> peekChar() {
auto c = nextChar();
if (c) retract();
return c;
}
private:
void refill() {
if (!stream_) {
length_ = 0;
return;
}
stream_.read(buffer_.data(), N);
length_ = static_cast<std::size_t>(stream_.gcount());
pos_ = 0;
}
std::ifstream stream_;
std::array<char, N> buffer_{};
std::size_t pos_ = 0;
std::size_t length_ = 0;
};
Token nextToken(InputBuffer& input){
auto optC = input.nextChar();
if(!optC){
return Token{TokenType::EndOfFile, ""};
}
char c = *optC;
if(std::isalpha(static_cast<unsigned char>(c))){
return readIdentifier(input, c);
}
else if(std::isdigit(static_cast<unsigned char>(c))){
return readNumber(input,c);
}
else{
return operatorToken(c);
}
}
Token readIdentifier(InputBuffer& input, char firstChar){
std::string lexeme;
lexeme += firstChar;
}1
2
3
4
5
6
7
8
9c = nextChar()
if isLetter(c):
lexeme = readIdentifier()
elif isDigit(c):
lexeme = readNumber()
else:
return operatorToken(c)- lexers read file in chunks, 4KB, 8KB, etc., this chunk is the input buffer
Algorithms of Lexical Analysis
Terminologies
NFA(Nondeterministic Finite Automation)
- From one state, may have multiple possible next states
- may have \epsilon-transitions (move without consuming input)(\epsilon mean empty string)
- \epsilon: transition that consumes NO input character
- example: NFA for a|b
- NFA has branching
1
2
3--a--> (accept)
(start)
--b--> (accept) - NFA is nondeterminism, multiple possible transitions
- from start, if input is ‘a’, take top arrow, if input is ‘b’, take bottom arrow
- NFA has branching
DFA(Deterministic Finite Automation, Table-driven state machine)
- For every(state, input char), there is exactly one next state
- A DFA state represents a set of NFA states
- simulate “all possible NFA paths” by tracking them as a set
- example: DFS for a|b
- DFA must have exactly one transition per character
1
2(state0) --a--> (accept)
(state0) --b--> (accept) - Table form:
State ‘a’ ‘b’ 0 1 1 1 dead dead - DFA is deterministic,
- from state0, input decides uniquely
- DFA must have exactly one transition per character
Pipeline: Regex –> NFA –> DFA
step1: Regex –> NFA (Thompson Construction)
- Turn a regex into a machine that recognizes the same language.
- Key property:
- \epsilon-transitions (moves that consume no character)
- branching (multiple possible next states)
- Thompson’s construction builds an NFA by composiing small building blocks:
- For a literal a:
- For concatenation RS:
step2: NFA –> DFA (Subset Construction)
- Remove nondeterminism so the scanner can run with a single current state.
- Example
- ab|a
- matches ‘ab’ and ‘a’
- build the NFA
- a nondeterministic finite state machine that accepts either:
- a
- ab
- it looks like
1
2
3--a--> (accept1)
(start)
--a--> (s1) --b--> (accept2) - possible transitions:
- start–a–>accept1
- start–a–>state1
- state1–b–>accept2
- this is nondeterministic because from start on input ‘a’, there 2 possiblities
- a nondeterministic finite state machine that accepts either:
- build DFA
- a DFA is a set of NFA states
- NFA start state = {start}
- D0 = {start}
- DFA start state A={start}
- determine move({start}, ‘a’), according to NFA
- From NFA, from state0
- start–a–>accept1
- start–a–>state1
- so: move({start},a) ={accept1,state1}
- get the DFA transition: {start}–a–>{accept1,state1}
- similarly,
- move({start},’a’) = {accept1, state1}
- move({start},’b’) = dead
- move({accept1,state1},’a’)=dead
- move({accept1,state1},’b’)=accept2
- when accept2 is accepted, no outgoing transitions any more, dead for all input.
- final DFA build from NFA
DFA State on ‘a’ –> on ‘b’ –> Accepting? {start} {accept1, state1} dead No {accept1, state1} dead {accept2} Yes(accept1) {accept2} dead dead Yes
step3: DFA Mninimization (Hopcroft)
- merge redundant state transitions
Rules
- Longest Match, first apply longest match
- Priority Rule, if there’s still a tie (same longest length), choose the rule that appears earlier in the Lex program
Practical Scanner Implementation
Strategy A - Hand-written Lexer
- Typical loop:
1
2
3
4
5skip_whitespace();
if(is_letter(c)) read_identifier();
else if (is_digit(c)) read_number();
else return operator_token(); - Pros & Cons
- Pros:
- fast
- customizable
- Cons:
- harder to maintain
- Pros:
- Typical loop:
Strategy B - Generated Lexer (Lex/Flex)
- Input:
- Lexer spec file (Lex/Flex)
- commonly named
- lexer.l
- scanner.l
- Contains
- Regex patterns (rules)
- actions (code to run when a pattern matches)
- example
1
2
3
4
5
6%%
"if" { return IF; }
[a-zA-Z][a-zA-Z0-9]* { return ID; }
[0-9]+ { return NUMBER; }
[ \t\n]+ ; /* skip whitespace */
%% - Run flex
1
flex scanner.l
- Produces C code: lex.yy.c
- Compile it
1
gcc lex.yy.c -o lexer -lfl
- run it on an input file, it emits tokens
- commonly named
- Lexer spec file (Lex/Flex)
- Tool produces:
- DFA tansition table
- efficient scanner code
- Pros & Cons
- Pros:
- Very fast in practice (table-driven DFA)
- Easy to modify token rules (edit regex rules, regenerate)
- Handles lots of tricky cases (lookahead, longest-match, priority rules) systematically
- Cons
- Debugging can be less direct (we debug rules, not the generated control flow)
- Can produce large transition tables (usually fine; occasionally memory-sensitive)
- Rule interactions can surprise, have to understand “longest match + rule priority”
- Pros:
- Input:
Errors Handled at Lexical Phase
(Lexer detects and reports below errors early before parsing:)- illegal characters
- malformed literals
- unterminated strings/comments
more:
Unicode-aware tokenization
incremental lexing (IDEs)
Scannerless parsing (rare)
Lexer modes (context-sensitive lexing)
Symbol Table
SymbolEntry is struct to store each symbol record
name kind(variable/function/type) type(int/float/…) scope level(global/local) memory offset token string name variable/function/type int/float global/local stack location SymbolTable wrapps SymbolEntry into hashmap and exposure actions,
- maintain a scoped hashmap of SymbolEntry list
- lookup type, scope or storage location of a symbol
- Hash-based, average O(1) lookup time by symbol name
1
unordered_map<string, SymbolEntry>
- Scoped
- using a stack to track each symbol’s scope dynamically in parsing, and save to the symbol table
- make SymbolTable a stack of hashmaps
- global scope map
- function scope map
- block scope map
- One hashmap per scope
- Same name allowed in different scopes
- stack
- when enter a new scope –> push
- when exit –> pop
1
2Stack top --> Innermost scope (current)
Stack bottom --> Outermost scope (global) - example
1
2
3
4
5
6
7int x; //global scope
void f(){ //function scope begins
int x; //local scope shadows global x
{ // inner block scope begins
int y;
} // block scope ends
}1
2
3Global Scope
└── Function Scope (f)
└── Block Scope {...}1
2Stack top : {x -> local x entry}
Stack bottom : {x -> global x entry}- How lookup works
- lookup always searches from top to bottom:
1
2
3
4lookup("x"):
for scope from top to bottom:
if scope contains "x":
return that entry
- lookup always searches from top to bottom:
- How lookup works
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31struct SymbolEntry {
std::string name;
enum class Kind { Var, Func, Type, Keyword } kind;
// other properties
};
class SymbolTable {
public:
SymbolTable() {
enterScope();
}
void enterScope() {
scopes_.emplace_back();
}
void exitScope() {
if (!scopes_.empty()) scopes_.pop_back();
}
std::optional<SymbolEntry> lookup(const std::string& name) const {
for (int i = static_cast<int>(scopes_.size()) - 1; i >= 0; --i) {
auto it = scopes_[i].find(name);
if (it != scopes_[i].end()) return it->second;
}
return std::nullopt;
}
SymbolEntry& insert(const std::string& name, SymbolEntry::Kind kind) {
auto& cur = scopes_.back();
auto [it, inserted] = cur.emplace(name, SymbolEntry{name, kind});
return it->second;
}
private:
std::vector<std::unordered_map<std::string, SymbolEntry>> scopes_;
};