Chapter Two : Lexical Analysis oToken oLanguage oRegular Expressions and Finite Automata oConversion RE-NFA-DFA oLexical Analyzer Generator PreparedbyBontuG.:2014E. C Chapter Tow : Lexical Analysis oLexical analysis is the first phase of a compiler. oThe role of the lexical analyzer is to read a sequence of characters from the source program and produce tokens to be used by the parser. oThe lexical analyzer breaks these sentence (source code) into a series of tokens, by removing any whitespace and comments in the source code. Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis oThe main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as an output a sequence of tokens for each lexeme in the source program. oThe stream of tokens is sent to the parser for syntax analysis. Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis oThe lexical analyzer also interacts with the symbol table, e.g., when the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table. Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis oThe following are additional tasks performed by the lexical analyzer other than identifying lexemes: o Stripping out comments and whitespace (blank, newline, and tab) o Correlating error messages generated by the compiler with the source program by keeping track of line numbers (using newline characters) o Expanding macros in some lexical analyzers Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Tokens, Patterns, and Lexemes o A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit, e.g., a keyword, an identifier, etc. The token names are input symbols that the parser processes. In any programming language (Keywords, operators, identifiers, constants, literals, punctuation symbols) are token o Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some predefined rules for every lexeme to be identified as a valid token. These rules are defined by grammar rules, by means of a pattern. Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Tokens, Patterns, and Lexemes • A pattern is a description of the form that lexemes of a token may take. In case of a keyword as a token, the pattern is the sequence of characters that forms the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings. • In programming language, keywords, constants, identifiers, strings, numbers, operators and punctuations symbols can be considered as tokens. Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Tokens, Patterns, and Lexemes Token Some lexemes Informal pattern begin begin, Begin, BEGIN, beGin, … Begin in small or capital letters if if, IF, iF, If if in small or capital letters ident Distance, F1, x, Dist1, … Letter followed by zero or more letters and/or digits Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Attributes of tokens oWhen more than one pattern matches a lexeme, the scanner must provide additional information about the particular lexeme to the subsequent phases of the compiler. oFor ex., both 0 and 1 match the pattern for the token num. But the code generator needs to know which number is recognized. Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Attributes of tokens • The lexical analyzer collects information about tokens into their associated attributes. • Practically, a token has one attribute: a pointer to the symbol table entry in which the information about the token is kept. • Symbol table entry contains information about the token such as the lexeme, the line number in which it was first seen, … Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Attributes of tokens • For ex. consider x = y + 2 The tokens and their attributes are written as: <id, pointer to symbol-table entry for x> <assign_op, > <id, pointer to symbol-table entry for y> <plus_op, > <num, integer value 2> Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Errors oVery few errors are detected by the lexical analyzer. oFor ex., if the programmer mistakes wihle for while, the lexical analyzer cannot detect the error (why?) oNonetheless, if a certain sequence of characters follows none of the specified patterns, the lexical analyzer can detect the error. Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Errors oWhen an error occurs, the lexical analyzer recovers by: o skipping (deleting) successive characters from the remaining input until the lexical analyzer can find a well-formed token (panic mode recovery) o deleting extraneous(unimportant) characters o inserting missing characters o replacing an incorrect character by a correct character o transposing two adjacent characters Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Specifying and recognizing tokens oRegular expressions are used to specify the patters of tokens. oEach pattern matches a set of strings. Example letter A|B|C|…|Z|a|b|c|…|z digit 0|1|…|9 identifier letter (letter|digit)* Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Languages o Alphabet: It is defined as a finite set of symbols. o String: A “string” over an alphabet is a finite sequence of symbols from that alphabet, which is usually written next to one another and not separated by commas. o Sentence and word are also used in terms of string o ε is the empty string o |s| is the length of string s. o Substring: z is a substring of w if z appears consecutively within w. Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Languages Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Operations on Languages L1L2 = { s1s2 | s1 ∈ L1 and s2 ∈ L2 } o Concatenation: L1 ∪ L2 = { s | s ∈ L1 or s ∈ L2 } o Union: o Exponentiation: L0 = {ε} L1 = L L2 = LL o Kleene Closure: L* = include the empty string o Positive Closure: L+ = doesn’t include the empty string . Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Examples Preparedby BontuG.:2014E.C Chapter Tow : Lexical Analysis Examples Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Grammar Preparedby BontuG.:2014E.C Chapter Tow : Lexical Analysis Grammar Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Example Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Example Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Automata—What is it? • An automaton is an abstract model of a digital computer. • An automaton has a mechanism to read input, which is a string over a given alphabet. This input is actually written on an “input file”, which can be read by the automaton but can not change it. Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Automata—What is it? • The automaton has a temporary “storage” device, which has unlimited number of cells, the contents of which can be altered by the automaton. • Automaton has a control unit, which is said to be in one of a finite number of “internal states”. Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Types of Automaton oDeterministic Automata oNon-deterministic Automata Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Deterministic Automata oA deterministic automata is one in which each move (transition from one state to another) is unequally determined by the current configuration. Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Deterministic Automata oA deterministic automata is one in which each move (transition from one state to another) is unequally determined by the current configuration. Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Deterministic Automata Preparedby BontuG.:2014E.C Chapter Tow : Lexical Analysis Deterministic Automata Preparedby BontuG.:2014E.C Chapter Tow : Lexical Analysis Deterministic Automata(EX) Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Regular Expiration o Regular expressions were designed to represent regular languages with a mathematical tool, a tool built from a set of primitives and operations. o This representation involves a combination of strings of symbols from some alphabet S, parentheses and the operators +, ×, and *. Preparedby BontuG.:2014E.C Chapter Tow : Lexical Analysis Building Regular Expressions Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Languages defined by Regular Expressions • There is a very simple correspondence between regular expressions and the languages they denote: Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Languages defined by Regular Expressions • There is a very simple correspondence between regular expressions and the languages they denote: Preparedby BontuG.:2014E. C Chapter Tow : Lexical Analysis Ex (revision) • Determine a deterministic Finite State Automaton from the given Nondeterministic FSA. Preparedby BontuG.:2014E. C