Compiler - Lexer

Lexical Analysis (Scanning)

  • Convert Character stream to Token stream

    • simple example:

      • input character stream: “position = initial + rate * 60”.
        • lexemes include: “postion”, “=”, “initial”, “+”, “rate”, “*”, “60”
      • output Token Stream
        • output tokens: (id, 1), (=), (id, 2), (+), (id, 3), (*), (60)
        • And the corresponding table
    • Lexime is the actual character sequence matched.

    • Pattern is the rule describing valid lexemes (usually described using regular expressions)

      • example:
        • identifier: [A-Za-z][A-Za-z0-9]*
        • number: [0-9]+
    • Token <token_name, attribute_value>

      • Token-name is category used by parsing (identifier, number, symbol names like of ‘+’ and ‘=’)
      • Attribute-value: tokens may carry extra information
        • typically points into the symbol table which holding details
        • e.g., identifier name / type
      • example:
        • {id, 1} means identifier token with pointer into symbol table entry #1
        • {+} no attribute
          1
          2
          3
          4
          struct Token{
          TokenType type; // type => {ID, NUM, IF, PLUS, ...}
          Attribute attr; // holds pointer/index, points into the symbol table
          };
    • Input Buffer

      • lexers read file in chunks, 4KB, 8KB, etc., this chunk is the input buffer
        • stores characters
          1
          std::array<char, 4096> buffer;
      • Pointers
        • LexemeBegin pointer: where the current token started
        • Forward: current scanning position
          1
          buffer[lexemeBegin ... forward]
      • Retract()
        • move the buffer Forward pointer back by one
        • Lookahead problem
          • lexer cannot know if a token has ended until see the character after it.
          • lexer reads too far to examine the next character to determine if the current token is complete
          • need to
        • example: lexer scan characters “123+45”
          • reads 1
          • reads 2
          • reads 3
          • reads +
          • lexer sees +, it knows “123” is finished, however the + is not part of the number
          • Retract: lexer has read one character too far, it must Retract so that the next time the lexer is called, it starts exactly at the ‘+’
      • two-buffer sentinel version (TBD)
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      class InputBuffer {
      public:
      static constexpr std::size_t N = 4096;
      explicit InputBuffer(const std::string& filename) : stream_(filename, std::ios::binary) {
      refill();
      }
      std::optional<char> nextChar() {
      if (pos_ >= length_) {
      refill();
      if (length_ == 0) return std::nullopt;
      }

      return buffer_[pos_++];
      }
      void retract() {
      if (pos_ > 0) {
      --pos_;
      }
      }
      std::optional<char> peekChar() {
      auto c = nextChar();
      if (c) retract();
      return c;
      }

      private:
      void refill() {
      if (!stream_) {
      length_ = 0;
      return;
      }
      stream_.read(buffer_.data(), N);
      length_ = static_cast<std::size_t>(stream_.gcount());
      pos_ = 0;
      }

      std::ifstream stream_;
      std::array<char, N> buffer_{};
      std::size_t pos_ = 0;
      std::size_t length_ = 0;
      };

      Token nextToken(InputBuffer& input){
      auto optC = input.nextChar();
      if(!optC){
      return Token{TokenType::EndOfFile, ""};
      }
      char c = *optC;
      if(std::isalpha(static_cast<unsigned char>(c))){
      return readIdentifier(input, c);
      }
      else if(std::isdigit(static_cast<unsigned char>(c))){
      return readNumber(input,c);
      }
      else{
      return operatorToken(c);
      }
      }

      Token readIdentifier(InputBuffer& input, char firstChar){
      std::string lexeme;
      lexeme += firstChar;
      }

      1
      2
      3
      4
      5
      6
      7
      8
      9
      c = nextChar()

      if isLetter(c):
      lexeme = readIdentifier()
      elif isDigit(c):
      lexeme = readNumber()
      else:
      return operatorToken(c)

  • Algorithms of Lexical Analysis

    • Terminologies

      • NFA(Nondeterministic Finite Automation)

        • From one state, may have multiple possible next states
        • may have \epsilon-transitions (move without consuming input)(\epsilon mean empty string)
          • \epsilon: transition that consumes NO input character
        • example: NFA for a|b
          • NFA has branching
            1
            2
            3
                    --a--> (accept)
            (start)
            --b--> (accept)
          • NFA is nondeterminism, multiple possible transitions
          • from start, if input is ‘a’, take top arrow, if input is ‘b’, take bottom arrow
      • DFA(Deterministic Finite Automation, Table-driven state machine)

        • For every(state, input char), there is exactly one next state
        • A DFA state represents a set of NFA states
        • simulate “all possible NFA paths” by tracking them as a set
        • example: DFS for a|b
          • DFA must have exactly one transition per character
            1
            2
            (state0) --a--> (accept)
            (state0) --b--> (accept)
          • Table form:
            State ‘a’ ‘b’
            0 1 1
            1 dead dead
          • DFA is deterministic,
          • from state0, input decides uniquely
    • Pipeline: Regex –> NFA –> DFA

      • step1: Regex –> NFA (Thompson Construction)

        • Turn a regex into a machine that recognizes the same language.
        • Key property:
          • \epsilon-transitions (moves that consume no character)
          • branching (multiple possible next states)
        • Thompson’s construction builds an NFA by composiing small building blocks:
          1. For a literal a:
          2. For concatenation RS:
      • step2: NFA –> DFA (Subset Construction)

        • Remove nondeterminism so the scanner can run with a single current state.
        • Example
          • ab|a
          • matches ‘ab’ and ‘a’
          • build the NFA
            • a nondeterministic finite state machine that accepts either:
              • a
              • ab
            • it looks like
              1
              2
              3
                      --a--> (accept1)
              (start)
              --a--> (s1) --b--> (accept2)
            • possible transitions:
              • start–a–>accept1
              • start–a–>state1
              • state1–b–>accept2
            • this is nondeterministic because from start on input ‘a’, there 2 possiblities
          • build DFA
            • a DFA is a set of NFA states
            • NFA start state = {start}
            • D0 = {start}
            • DFA start state A={start}
            • determine move({start}, ‘a’), according to NFA
            • From NFA, from state0
              • start–a–>accept1
              • start–a–>state1
            • so: move({start},a) ={accept1,state1}
            • get the DFA transition: {start}–a–>{accept1,state1}
            • similarly,
              • move({start},’a’) = {accept1, state1}
              • move({start},’b’) = dead
              • move({accept1,state1},’a’)=dead
              • move({accept1,state1},’b’)=accept2
            • when accept2 is accepted, no outgoing transitions any more, dead for all input.
            • final DFA build from NFA
              DFA State on ‘a’ –> on ‘b’ –> Accepting?
              {start} {accept1, state1} dead No
              {accept1, state1} dead {accept2} Yes(accept1)
              {accept2} dead dead Yes
      • step3: DFA Mninimization (Hopcroft)

        • merge redundant state transitions
    • Rules

      • Longest Match, first apply longest match
      • Priority Rule, if there’s still a tie (same longest length), choose the rule that appears earlier in the Lex program
  • Practical Scanner Implementation

    • Strategy A - Hand-written Lexer

      • Typical loop:
        1
        2
        3
        4
        5
        skip_whitespace();

        if(is_letter(c)) read_identifier();
        else if (is_digit(c)) read_number();
        else return operator_token();
      • Pros & Cons
        • Pros:
          • fast
          • customizable
        • Cons:
          • harder to maintain
    • Strategy B - Generated Lexer (Lex/Flex)

      • Input:
        • Lexer spec file (Lex/Flex)
          • commonly named
            • lexer.l
            • scanner.l
          • Contains
            • Regex patterns (rules)
            • actions (code to run when a pattern matches)
          • example
            1
            2
            3
            4
            5
            6
            %%
            "if" { return IF; }
            [a-zA-Z][a-zA-Z0-9]* { return ID; }
            [0-9]+ { return NUMBER; }
            [ \t\n]+ ; /* skip whitespace */
            %%
          • Run flex
            1
            flex scanner.l
          • Produces C code: lex.yy.c
          • Compile it
            1
            gcc lex.yy.c -o lexer -lfl
          • run it on an input file, it emits tokens
      • Tool produces:
        • DFA tansition table
        • efficient scanner code
      • Pros & Cons
        • Pros:
          • Very fast in practice (table-driven DFA)
          • Easy to modify token rules (edit regex rules, regenerate)
          • Handles lots of tricky cases (lookahead, longest-match, priority rules) systematically
        • Cons
          • Debugging can be less direct (we debug rules, not the generated control flow)
          • Can produce large transition tables (usually fine; occasionally memory-sensitive)
          • Rule interactions can surprise, have to understand “longest match + rule priority”
  • Errors Handled at Lexical Phase
    (Lexer detects and reports below errors early before parsing:)

    • illegal characters
    • malformed literals
    • unterminated strings/comments
  • more:

    • Unicode-aware tokenization

    • incremental lexing (IDEs)

    • Scannerless parsing (rare)

    • Lexer modes (context-sensitive lexing)

    • Symbol Table

      • SymbolEntry is struct to store each symbol record

        name kind(variable/function/type) type(int/float/…) scope level(global/local) memory offset
        token string name variable/function/type int/float global/local stack location
      • SymbolTable wrapps SymbolEntry into hashmap and exposure actions,

        • maintain a scoped hashmap of SymbolEntry list
        • lookup type, scope or storage location of a symbol
        • Hash-based, average O(1) lookup time by symbol name
          1
          unordered_map<string, SymbolEntry> 
        • Scoped
          • using a stack to track each symbol’s scope dynamically in parsing, and save to the symbol table
          • make SymbolTable a stack of hashmaps
            • global scope map
            • function scope map
            • block scope map
          • One hashmap per scope
          • Same name allowed in different scopes
          • stack
            • when enter a new scope –> push
            • when exit –> pop
              1
              2
              Stack top --> Innermost scope (current)
              Stack bottom --> Outermost scope (global)
            • example
              1
              2
              3
              4
              5
              6
              7
              int x;     //global scope
              void f(){ //function scope begins
              int x; //local scope shadows global x
              { // inner block scope begins
              int y;
              } // block scope ends
              }
              1
              2
              3
              Global Scope
              └── Function Scope (f)
              └── Block Scope {...}
              1
              2
              Stack top : {x -> local x entry}
              Stack bottom : {x -> global x entry}
              • How lookup works
                • lookup always searches from top to bottom:
                  1
                  2
                  3
                  4
                  lookup("x"):
                  for scope from top to bottom:
                  if scope contains "x":
                  return that entry
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      struct SymbolEntry {
      std::string name;
      enum class Kind { Var, Func, Type, Keyword } kind;
      // other properties
      };
      class SymbolTable {
      public:
      SymbolTable() {
      enterScope();
      }
      void enterScope() {
      scopes_.emplace_back();
      }
      void exitScope() {
      if (!scopes_.empty()) scopes_.pop_back();
      }
      std::optional<SymbolEntry> lookup(const std::string& name) const {
      for (int i = static_cast<int>(scopes_.size()) - 1; i >= 0; --i) {
      auto it = scopes_[i].find(name);
      if (it != scopes_[i].end()) return it->second;
      }
      return std::nullopt;
      }
      SymbolEntry& insert(const std::string& name, SymbolEntry::Kind kind) {
      auto& cur = scopes_.back();
      auto [it, inserted] = cur.emplace(name, SymbolEntry{name, kind});
      return it->second;
      }
      private:
      std::vector<std::unordered_map<std::string, SymbolEntry>> scopes_;
      };