ACSC373 – Compiler Writing Chapter 5 – Lexical Analysis Introduction Lexical analyzer (or scanner) has to recognize the basic tokens making up the source program input to the compiler. Tokens – syntactically simple in structure (recognised by comparatively simple algorithm). The lexical analyser relieves the burden of the syntax analyser. If the syntax of the tokens can be expressed in terms of a regular grammar straightforward construction of a lexical analyser. The lexical analyser breaks up the program input into a sequence of tokens – the basic syntactic component of the language. The lexical analyser is also responsible for decoding the lexical representation of symbols in the source program. (e.g. possible to modify a Pascal compiler to accept the character @ instead of ^ by simply making a minor modification to the lexical analyser). Spelling of reserved words should be concern only to the lexical analyser (e.g. replace all Pascal reserved words by their Welsh translation, by changing the lexical analyser’s table). Tokens The lexical analyser recognises the basic syntactic components of a programming language. For example, a lexical analyser for Pascal would recognise identifiers, reserved words, numerical constants, strings, punctuation symbols and special symbols. These tokens would be passed to the syntax analyser as single values (nature, value or identification). So, the lexical analyser has to perform the dual task of recognising the token and sometimes evaluating it. Significant Problem: the need for “look-ahead” It may be necessary to read several characters of a token before the type of that token can be determined. The degree of lookahead depends on the particular token on the language. PASCAL – good language on this respect (most tokens can be distinguished using just one-or-two character lookahead). ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona e.g. a+ character, a Pascal lexical analyser immediately returns the symbol to the syntax analyser or, to determine that the token 12345.67 is going to be a real rather than an integer. FORTRAN – Do 1 I = 1 The lexical analyser has to reach the end of the statement before it can determine that it should return the identifier Do 1 I rather than the Do to start a Do loop. Furthermore, in some languages the role of the lexical analyser becomes merged with that of the syntax analyser. Furthermore, in some languages, the role of the lexical analyser becomes merged with that of the syntax analyser. e.g. IF (GT.GT.DO) IF = DO + GT Special keywords, not identifiers partially resolved by the syntax analyser offering context information to the lexical analyser when required. Regular Grammars Chomsky type 3 (Finite – state or regular) grammars special significance to lexical analysis. Recall, type grammars had productions of the form: Aa or A aB syntactically simple structures Alternation and repetition can be specified by these grammar but more complex structures such as balanced parentheses cannot be handled. Strengths of these grammars: possible to construct simple and efficient parsers for them. If the lexical tokens of a programming language can be defined in terms of a type 3 grammar, then an efficient lexical analyser for a compiler for that language can be constructed simply, derived directly from the set of production rules defining the tokens. Notation: Regular Expressions Consists of symbols (in the alphabet of the language that is being defined) and a set of operators that allow In terms of precedence, 2 ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona (2) – Concatenation (adjacent units) (3) – Alternation (units separated by |) (1) – Repetition (*) – (zero or more) Example a b denotes the set of strings {a b} – this set contains just one member a | b denotes {a, b} a * denotes (ε, a, aa, aaa, …..} a b * denotes (a, ab, abb, abbb, ….} (a | b)* denotes the set of strings made up of zero or more instances of an ‘a’ or a ‘b’ (a b | c)* d denotes {d, abd, cd, abcd, ababcd, …..} Equivalent regular expressions: when they denote the same set of strings, that is, the same language. e.g. a ( b | c), a b | a c are equivalent Regular grammars and regular expressions are equivalent notations. i.e. given a set of productions defining a regular grammar, it is possible to convert them to an equivalent regular expression by an algorithmic process. Or, similarly, a regular expression can be converted into an equivalent set of production rules. Regular expression can be used to specify the syntax of the lexical tokens of programming language in a clear and concise way. e.g. the syntax of an integer constant; Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Sign + | - | ε Integerconstant sign digit digit* i.e. an optional sign followed by one or more digits. An identifier Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Letter a | b | c | … | z |A | B | C | … | Z Identifier letter (letter| digit)* i.e. an initial letter followed by a string of letters or digits 3 ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona Regular expressions can be represented diagrammatically, e.g. (a b | c) * d a b c d Finite-State Automata It is possible to represent a regular expression as a transition diagram, that is, a directed graph having labelled branches e.g. (a b | c ) * d The nodes – states are enclosed in circles (state number). Double circle states: accepting states (i.e. a state reached if the expression to be parsed has been successfully recognised). Arrow lines: edges or transitions. c d 1 start a 3 b 2 4 ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona Parser Action ( of abcd) 1. 2. 3. 4. 5. Start in state 1 Input a, transition to state 2 Input b, transition to state 1 Input c, transition to state 1 Input d, transition to state 3, an accepting state the parse succeeds Another input abc 1. Starting in state 1 2. Input a, transition to state 2 3. Input c: there are no edges labelled c, from state 2 and hence this parse fails. (Since the parser knows that it is in state 2 when the parse fails, it can output the informative information that it was expecting the input of a b – the only edge emerging from state 2). Transition table (of (a b | c ) * d ) State a b c d 1 2 3 2 - 1 1 finished 3 - or, transition diagram, describes a finite-state automata. Deterministic finite-state automation (DFA): for any state, there can only be one possible next state for any given input symbol (i.e. no two edges emerging from a state can be labelled by the same symbol). No transitions can be labelled with ε (the empty string) – all the edges have to labelled with non-null symbols, e.g. diagram above. Non-deterministic finite-state automation (NFA): if an input symbol can cause more than one next state. A state may have two or more edges emerging from it, labelled with the same input symbol. Example 5 ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona a 1 2 3 4 Two edges labelled by b coming from state 2 NFA a*bb*ba Its transition table, State 1 2 3 4 a {1} {4} b {2} {2 , 3} finished Sets of states have to be used (no longer single – state numbers). NFA somewhat more difficult to implement on a conventional computer than a DFA. However, NFA’s have the advantage that they generally require fewer states than an equivalent DFA (if they recognised strings described by the same regular expression). Later, (somewhere else) to see how to convert NFA into an equivalent DFA – an exercise for you! Implementing a Lexical Analyser Writing a lexical analyser for Pascal (techniques relevant to wide range of other languages). Task of lexical analyser: to recognise basic tokens in its input and to return some encoded representation of these tokens to the next part of the compiler. 6 ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona First Steps: - To decide on the precise set of tokens the lexical analyser should recognise. - The notation to be used to specify the syntax of these tokens, preferably in some formal notation. Tokens – identified by an integer value or equivalent e.g. enumerated type in Pascal. Convenient way – implement the lexical analyser as a procedure or function called by the syntax analyser. e.g. (Pascal’s lexical analyser). Type lextoken = (beginsym, endsym, ifsym, dosym, whilesym, periodsym, commasym, semicolon …); Var token: lextoken (* the L.A. updates this variable each time it is called *) Procedure NextToken; . . . Has to to read read characters characters from from the the compiler’s compiler’s input input and Has and return the identity of single lexical the return the identity of single lexical tokentoken in thein global global variable token each time it is called. variable token each time it is called. First step, decide Set of tokens: Pascal: - 21 special symbols - 35 word symbols or reserved words - one directive and - a few alternative representations - + identifiers and numerical and character constants - to skip comments and non-significant spaces, tabs and newlines - to handle erroneous input and issue informative error message Ex. Think of the corresponding tokens of either the C or C++ language. The body of NextToken starts as follows: While (ch = spacechar) or (ch = tabchar) do Nextch; (*skip white space*) Case ch of ‘+’: begin token := plussyml Nextch end; ‘-‘ : begin token := minusym Nextch end; (*and so on for all the other single – character tokens*) 7 ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona (Possibly, use array or table lookup to return the corresponding Nexttoken value in Token as there is a fairly large number of single-character tokens in Pascal). Other special symbols have to be handled more carefully, <, <=, or <>. ‘<’ Begin NextCh; If Ch = ‘=’ then Begin Token := lesseqsym; NextCh; End Else If ch = ‘>’ then Begin Token := noteqsym; NextCh End Else Sym := lesssym End; Comments {or (* ….} or *) When NextToken encounters a { or a (* symbol, it must call NextCh repeatedly until } or *) is found. One way: remove comments within NextCh; Count {this is a comment} er := 0; Counter : = 0; If comments were removed completely by NextCh; Or, return a single-space character to the caller on encountering a comment construct in the input text. 8 ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona Numerical constants Integer, floating-point constants ‘0’ : ‘1’ : ‘2’ : ‘3’ : ‘4’ : ‘5’ : ‘6’ : ‘7’ : ‘8’ : ‘9’ : Begin End; Intval := 0; While Ch in [‘0’ .. ‘9’] do Begin (*accumulate decimal value*) Intval := intval * 10 + ord (ch) – ord (‘0’); NextCh End; Token := integersym (*value of constant returned in intval*) Character and string constants Enclosed by apostrophes, ‘a’ “ For denoting single’. Identifiers and reserved words While Ch in Valididentifierschars do The set of Characters initialized to contain the characters that may be found within an identifier. Example Consider the character string abc<def input to NextToken. The letter a guides NextToken into the routine to read an identifier; thus, the characters are read one by one until the < character is encountered. This character is retained as the first character to be analysed when NextToken is next called – there is no need to backspace on the input stream. This situation where a single character causes the termination of one token and starts another is very common and is handled naturally by the single-character lookahead. Errors Errors detected by the lexical analyser: “ a letter not allowed to terminate a number, numerical overflow, end of line before end of string and null string detected”. 9 ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona Also, NextToken should report error if it encounters a character that is not a member of the character set of Pascal. e.g. (explanation mark outside a character string or comment) Automating the Production of Lexical Analysis Writing a lexical analyser from scratch – fairly demanding task easier to modify an existing lexical analyser for a similar language. Given the syntax specification of the lexical tokens, the coding of the lexical analyser should be mechanical. Software tools can automate the production of lexical analyser. e.g. Lex (one of the many utilities of UNIX) (most famous) - Requires the syntax of each lexical token be defined in terms of a regular expression. - Lex generates a recognizer program – the lexical analyser. - Lex transforms regular expressions to an equivalent deterministic finite-state automation. - Produces rapidly reliable and efficient lexical analysis while the control file for lex provides clear and accurate documentation. 10