Lexical Analysis

Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning) Mayer Goldberg and Roman Manevich Ben-Gurion University General stuff  Topics taught by me        Lexical analysis (scanning) Syntax analysis (parsing) … Dataflow analysis Register allocation Slides will be available from web-site after lecture Request: please mute mobiles, tablets, super-cool squeaking devices 2 Today  Understand role of lexical analysis  Lexical analysis theory  Implementing modern scanner 3 Role of lexical analysis  First part of compiler front-end High-level Language Lexical Analysis Syntax Analysis Parsing AST Symbol Table etc. Inter. Rep. (IR) Code Generation Executable Code (scheme)  Convert stream of characters into stream of tokens   Split text into most basic meaningful strings Simplify input for syntax analysis 4 From scanning to parsing 5 + (7 * x) program text Lexical Analyzer token stream Grammar: E  id E  num EE+E EE*E E(E) num + ( num * id ) Parser valid syntax error + num Abstract Syntax Tree * num x 5 Javascript example  Identify basic units in this code var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } 6 Javascript example  Identify basic units in this code var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } 7 Javascript example  Identify basic units in this code operator keyword whitespace numeric literal var currOption = 0; string literal // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { identifier currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { punctuation elt.style.display = "none"; } } } 8 Scanner output var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications“, "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } Stream of Tokens LINE: ID(value) 1: VAR 1: ID(currOption) 1: EQ 1: INT_LITERAL(0) 1: SEMI 3: FUNCTION 3: ID(choose) 3: LP 3: ID(id) 3: EP 3: LCB ... 9 What is a token?  Lexeme – substring of original text constituting an identifiable unit   Record type storing:      Identifiers, Values, reserved words, … Kind Value (when applicable) Start-position/end-position Any information that is useful for the parser Different for different languages 10 C++ example 1   Splitting text into tokens can be tricky How should the code below be split? vector<vector<int>> myVector >> operator or >, > two tokens ? 11 C++ example 2   Splitting text into tokens can be tricky How should the code below be split? vector<vector<int> > myVector >, > two tokens 12 Example tokens Type Examples Identifier x, y, z, foo, bar NUM 42 FLOATNUM -3.141592654 STRING “so long, and thanks for all the fish” LPAREN ( RPAREN ) IF if … 13 Separating tokens  Type Examples Comments /* ignore code */ // ignore until end of line White spaces \t \n Lexemes are recognized but get consumed rather than transmitted to parser  if if i/*comment*/f 14 Preprocessor directives in C Type Examples Inlude directives #include<foo.h> Macros #define THE_ANSWER 42 15 Designing a scanner  Define each type of lexeme       Reserved words: var, if, for, while Operators: < = ++ Identifiers: myFunction Literals: 123 “hello” Annotations: @SuppressWarnings But how do we define lexemes of unbounded length? 16 Designing a scanner  Define each type of lexeme       Reserved words: var, if, for, while Operators: < = ++ Identifiers: myFunction Literals: 123 “hello” Annotations: @SuppressWarnings But how do we define lexemes of unbounded length?  Regular expressions 17 Regular languages refresher  Formal languages     Alphabet = finite set of letters Word = sequence of letter Language = set of words Regular languages defined equivalently by   Regular expressions Finite-state automata 18 Regular expressions      Empty string: Є Letter: a Concatenation: R1 R2 Union: R1 | R2 Kleene-star: R*    Shorthand: R+ stands for R R* scope: (R) Example: (0* 1*) | (1* 0*)  What is this language? 19 Exercise 1 - Question  Language of Java identifiers   Identifiers start with either an underscore ‘_’ or a letter Continue with either underscore, letter, or digit 20 Exercise 1 - Answer  Language of Java identifiers     Identifiers start with either an underscore ‘_’ or a letter Continue with either underscore, letter, or digit (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* Using shorthand macros First = _|a|b|…|z|A|…|Z Next = First|0|…|9 R = First Next* 21 Exercise 2 - Question  Language of rational numbers in decimal representation (no leading, ending zeros)      0 123.757 .933333 Not 007 Not 0.30 22 Exercise 2 - Answer   Language of rational numbers in decimal representation (no leading, ending zeros) Digit = 1|2|…|9 Digit0 = 0|Digit Num = Digit Digit0* Frac = Digit0* Digit Pos = Num | .Frac | 0.Frac| Num.Frac PosOrNeg = (Є|-)Pos R = 0 | PosOrNeg 23 Exercise 3 - Question  Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], … 24 Exercise 3 - Answer     Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], … Not regular Context-free Grammar: S ::= [] | [S] 25 Finite automata  An automaton is defined by states and transitions transition accepting state b c a start b start state 26 Automaton running example  Words are read left-to-right a b c b c a start b 27 Automaton running example  Words are read left-to-right a b c b c a start b 28 Automaton running example  Words are read left-to-right a b c b c a start b 29 Automaton running example  Words are read left-to-right a b c word accepted b c a start b 30 Word outside of language b b c b c a start b 31 Word outside of language  Missing transition means non-acceptance b b c b c a start b 32 Exercise - Question  What is the language defined by the automaton below? b c a start b 33 Exercise - Answer  What is the language defined by the automaton below?   a b* c Generally: all paths leading to accepting states b c a start b 34 Non-deterministic automata  Allow multiple transitions from given state labeled by same letter b c a start a c b 35 NFA run example a b c b c a start a c b 36 NFA run example  Maintain set of states a b c b c a start a c b 37 NFA run example a b c b c a start a c b 38 NFA run example  Accept word if any of the states in the set is accepting a b c b c a start a c b 39 NFA+Є automata  Є transitions can “fire” without reading the input b start a c Є 40 NFA+Є run example a b c b start a c Є 41 NFA+Є run example  Now Є transition can non-deterministically take place a b c b start a c Є 42 NFA+Є run example a b c b start a c Є 43 NFA+Є run example a b c b start a c Є 44 NFA+Є run example a b c b start a c Є 45 NFA+Є run example  Word accepted a b c b start a c Є 46 Reg-exp vs. automata  Regular expressions are declarative    Offer compact way to define a regular language by humans Don’t offer direct way to check whether a given word is in the language Automata are operative   Define an algorithm for deciding whether a given word is in a regular language Not a natural notation for humans 47 From reg. exp. to automata   Theorem: there is an algorithm to build an NFA+Є automaton for any regular expression Proof: by induction on the structure of the regular expression    For each sub-expression R we build an automaton with exactly one start state and one accepting state Start state has no incoming transitions Accepting state has no outgoing transitions 48 From reg. exp. to automata   Theorem: there is an algorithm to build an NFA+Є automaton for any regular expression Proof: by induction on the structure of the regular expression start 49 Base cases R= start R=a start  a 50 Construction for R1 | R2 R1  start   R2  51 Construction for R1 R2 R1 start  R2   52 Construction for R* R  start    53 From NFA+Є to DFA    Construction requires O(n) states for a regexp of length n Running an NFA+Є with n states on string of length m takes O(m·n2) time Solution: determinization via subset construction   Number of states worst-case exponential in n Running time O(m) 54 Subset construction   For an NFA+Є with states M={s1,…,sk} Construct a DFA with one state per set of states of the corresponding NFA   M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …} Simulate transitions between individual states for every letter NFA+Є s1 a s2 s4 a s7 DFA [s1,s4] a [s2,s7] 55 Subset construction   For an NFA+Є with states M={s1,…,sk} Construct a DFA with one state per set of states of the corresponding NFA   M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …} Extend macro states by states reachable via Є transitions NFA+Є s1 Є s4 DFA [s1,s2] [s1,s2,s4] 56 Scanning challenges   Regular expressions allow us to define the language of all sequences of tokens Automata theory provides an algorithm for checking membership of words    But we are interested in splitting the text not just deciding on membership How do we determine lexemes? How do we handle ambiguities – lexemes matching more than one token? 57 Separating lexemes    ID = (a+b+…+z) (a+b+…+z)* ONE =1 Input: abb1 How do we identify ID(abb), ONE? 58 Separating lexemes    ID = (a+b+…+z) (a+b+…+z)* ONE =1 Input: abb1 How do we identify ID(abb), ONE? a-z start ID a-z 1 ONE 59 Maximal munch     ID = (a+b+…+z) (a+b+…+z)* ONE =1 Input: abb1 How do we identify ID(abb), ONE? Solution: find longest matching lexeme    Keep reading text until automaton leaves accepting state Return token corresponding to accepting state Reset – go back to start state and continue reading input from there 60 Handling ambiguities     ID = (a+b+…+z) (a+b+…+z)* IF = if Input: if Matches both tokens What should the scanner output? a-z start a-z ID NFA i f IF 61 Handling ambiguities     ID = (a+b+…+z) (a+b+…+z)* IF = if Input: if Matches both tokens What should the scanner output? a-z a-z\i start i ID a-z a-z\f ID f DFA IF ID 62 Handling ambiguities      ID = (a+b+…+z) (a+b+…+z)* IF = if Input: if Matches both tokens What should the scanner output? Solution: break tie using order of definitions a-z\i ID  Output: ID(if) start i a-z a-z a-z\f ID f IF ID 63 Handling ambiguities      IF = if ID = (a+b+…+z) (a+b+…+z)* Input: if Conclusion: list keyword token definitions before identifier definition Matches both tokens What should the scanner output? Solution: break tie using order of a-z definitions a-z\i ID  Output: IF a-z start i a-z\f ID f IF ID 64 Implementing scanners in practice 65 Implementing scanners  Manual construction of automata + determinization is     Very tedious Error-prone Non-incremental Fortunately there are tools that automatically generate code from a specification for most languages  C: Lex, Flex Java: JLex, JFlex 66 Using JFlex    Define tokens (and states) Run Jflex to generate Java implementation Usually MyScanner.nextToken() will be called in a loop by parser Stream of characters MyScanner.lex Regular Expressions JFlex MyScanner.java Tokens 67 Common format for reg-exps Basic Patterns Matching x The character x . Any character, usually except a new line [xyz] Any of the characters x,y,z Repetition Operators R? An R or nothing (=optionally an R) R* Zero or more occurrences of R R+ One or more occurrences of R Composition Operators R1R2 An R1 followed by R2 R1|R2 Either an R1 or R2 Grouping (R) R itself 68 Escape characters  What is the expression for one or more + symbols?       (+)+ won’t work (\+)+ will backslash \ before an operator turns it to standard character \*, \?, \+, … Newline: \n or \r\n depending on OS Tab: \t 69 Shorthands  Use names for expressions      letter = a | b | … | z | A | B | … | Z letter_ = letter | _ digit = 0 | 1 | 2 | … | 9 id = letter_ (letter_ | digit)* Use hyphen to denote a range   letter = a-z | A-Z digit = 0-9 70 Catching errors   What if input doesn’t match any token definition? Trick: Add a “catch-all” rule that matches any character and reports an error  Add after all other rules 71 Next lecture: parsing 72

Lexical Analysis

Related documents

Products

Support

Lexical Analysis

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib