Lexical Analysis Dr. Nguyen Hua Phung HCMC University of Technology, Viet Nam 08, 2016 Dr. Nguyen Hua Phung Lexical Analysis 1 / 33 Outline 1 Introduction 2 Roles 3 Implementation 4 Use ANTLR to generate Lexer Dr. Nguyen Hua Phung Lexical Analysis 2 / 33 Compilation Phases source program lexical analyzer syntax analyzer front end semantic analyzer intermediate code generator code optimizer back end code generator target program Dr. Nguyen Hua Phung Lexical Analysis 4 / 33 Lexical Analysis Like a word extractor in ⇒ i n ⇒ in Like a spell checker I og::: og to socholsochol ::::::: Like a classification am a student I pronoun verb article Dr. Nguyen Hua Phung noun Lexical Analysis 6 / 33 Lexical Analysis Roles Identify lexemes: substrings of the source program that belong to a grammar unit Return tokens: a lexical category of lexemes Ignore spaces such as blank, newline, tab Record the position of tokens that are used in next phases Dr. Nguyen Hua Phung Lexical Analysis 7 / 33 Example on Lexeme and Token r e s u l t Lexemes result = oldsum value / 100 ; = ’’ ’’ o ldsum - value / 100; Kind of Tokens IDENT ASSIGN_OP IDENT SUBSTRACT_OP IDENT DIV_OP INT_LIT SEMICOLON Dr. Nguyen Hua Phung Lexical Analysis 8 / 33 How to build a lexical analyzer? How to build a lexical analysis for English? 65000 words Simply build a dictionary: {(I,pronoun);(We,pronoun);(am,verb);...} Extract, search, compare But for a programming language? How many words? Identifiers: abc, cab, Abc, aBc, cAb, ... Integers: 1, 10, 120, 20, 210, ... ... Too many words to build a dictionary, so how? Apply rules for each kind of word (token) Dr. Nguyen Hua Phung Lexical Analysis 11 / 33 Rule Representations Finite Automata Deterministic Finite Automata Nondeterministic Finite Automata Regular Expressions Dr. Nguyen Hua Phung Lexical Analysis 12 / 33 Finite Automata ... b a b a a . . . Input Tape a Head (Read only) q3 ... q2 qn q1 q0 Finite Control Dr. Nguyen Hua Phung Lexical Analysis 13 / 33 State Diagram a a b start q0 q1 b Input: abaabb Current state q0 q0 q1 q1 q1 q0 Dr. Nguyen Hua Phung Read a b a a b b New State q0 q1 q1 q1 q0 q1 Lexical Analysis 14 / 33 Deterministic Finite Automata Definition Deterministic Finite Automaton(DFA) is a 5-tuple M =(K,Σ,δ,s,F) where K = a finite set of state Σ = alphabet s ∈ K = the initial state F ⊆ K = the set of final states δ = a transition function from K ×Σ to K Dr. Nguyen Hua Phung Lexical Analysis 15 / 33 Example M =(K,Σ,δ,s,F) where K = {q0 , q1 } and δ Σ = {a,b} K q0 q0 q1 q1 Σ a b a b s=q0 F={q1 } δ(K , Σ) q0 q1 q1 q0 a a b start q0 q1 b Dr. Nguyen Hua Phung Lexical Analysis 16 / 33 Nondeterministic Finite Automata Permit several possible “next states” for a given combination of current state and input symbol Accept the empty string ∈ in state diagram Help simplifying the description of automata Every NFA is equivalent to a DFA Dr. Nguyen Hua Phung Lexical Analysis 17 / 33 Example Language L = ({ab} ∪ {aba})* a start q0 q1 b a b q2 Dr. Nguyen Hua Phung Lexical Analysis 18 / 33 Example Language L = ({ab} ∪ {aba})* start a q0 ∈ a q1 b q2 Dr. Nguyen Hua Phung Lexical Analysis 19 / 33 Regular Expression (regex) Describe regular sets of strings Symbols other than ( ) | * stand for themselves Use ∈ for an empty string Concatenation α β = First part matches α, second part β Union α | β = Match α or β Kleene star α* = 0 or more matches of α Use ( ) for grouping Dr. Nguyen Hua Phung Lexical Analysis 20 / 33 Example RE 0 01 0|1 0(0|1) (0|1)(0|1) 0* (0|1)* Language => { 0 } => { 01 } => {0,1} => {00,01} => {00,01,10,11} => {∈,0,00,000,0000,...} => {∈,0,1,00,01,10,11,000,001,...} Dr. Nguyen Hua Phung Lexical Analysis 21 / 33 Example (i|I)(f|F) Keyword if of language Pascal if IF If iF E(0|1|2|3|4|5|6|7|8|9)* An E followed by a (possibly empty) sequence of digits E123 E9 E Dr. Nguyen Hua Phung Lexical Analysis 22 / 33 Regular Expression and Finite Automata a start b a ∈ ab ∈ a|b start b ∈ ∈ ∈ start a ∈ ∈ Dr. Nguyen Hua Phung a* ∈ Lexical Analysis 24 / 33 Convenience Notation α+ = one or more (i.e. αα∗) α? = 0 or 1 (i.e. (α| ∈)) [xyz]= x|y|z [x-y]= all characters from x to y, e.g. [0-9] = all ASCII digits [^x-y]= all characters other than [x-y] . matches any character Dr. Nguyen Hua Phung Lexical Analysis 25 / 33 Example (0|1|2|3|4) (a|g|h|m) (0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)* (E|e)(+|-|∈)(0|1|2|3|4|5|6|7|8|9)+ Dr. Nguyen Hua Phung Lexical Analysis => [0-4] => [aghm] => [0-9]+ => [Ee][+-]?[0-9]+ 26 / 33 ANTLR [1] ANother Tool for Language Recognition Terence Parr, Professor of CS at the Uni. San Francisco powerful parser/lexer generator Dr. Nguyen Hua Phung Lexical Analysis 29 / 33 Lexer /∗∗ ∗ Filename : H e l l o . g4 ∗/ l e x e r grammar H e l l o ; / / match any d i g i t s INT : [0 −9]+; / / Hexadecimal number HEX: 0 [ Xx][0 −9A−Fa−f ] + ; / / match lower−case i d e n t i f i e r s ID : [ a−z ] + ; / / s k i p spaces , tabs , n e w l i n e s WS : [ \ t \ r \ n ] + −> s k i p ; Dr. Nguyen Hua Phung Lexical Analysis 30 / 33 Lexical Analyzer 1 . 0 e 0 2 . . . Input Tape 0 Look ahead r3 ... r2 rn r1 Token with longest prefix match r0 Lexical Analyzer Dr. Nguyen Hua Phung Lexical Analysis 31 / 33 Summary A lexical analyzer is a pattern matcher that isolates small-scale parts of a program Lexical rules are represented by Regular expressions or Finite Automata. How to write a lexical analyzer (lexer) in ANTLR Dr. Nguyen Hua Phung Lexical Analysis 32 / 33 References I [1] ANTLR, http:antlr.org, 19 08 2016. Dr. Nguyen Hua Phung Lexical Analysis 33 / 33