Lexical Analysis; Regular Expressions- SWE – Sr. Sallam 23/11/2010 Introduction A string (word, sentence) is a finite sequence of symbols from an alphabet. A language is a set of strings over some alphabet. For example, o { ( ab )n | n=1,2,3,4,… } is the language consisting of strings of the pair ab repeated any number of times. o { (n )n | n=1,2,3,4,… } is the language consisting of arbitrarily nested pairs of parentheses. A fundamental problem in studying languages is the recognition problem. That is, given a language and a string, determine if the string is a member of the language. This problem is especially important for compilers, which must determine whether a given input is in fact a valid program in the source language. Languages can be classified according to their grammars and the complexity of the computing machine required to recognize them, as follows (this is the Chomsky hierarchy): o Type 3 (regular): regular grammar; finite state automaton o Type 2 (context-free): context-free grammar; pushdown stack automaton o Type 1 (context-sensitive): context-sensitive grammar; linear bounded automaton o Type 0 (recursively enumerable): unrestricted grammar; Turing machine Of these, regular and context-free languages are of interest to compiler writers. Regular expressions are used to model tokens. As a result, lexical analyzers are typically based on finite state automata. Context-free grammars are used to define programming language syntax. Language parsers use algorithms designed to parse strings using grammar rules. Grammars A grammar is a method for defining a language, by giving rules which can be used to derive (generate) strings (sentences) of the language. These rules are called productions. A grammar consists of a set of symbols called terminal symbols. These are the symbols which can appear in strings of the language. a set of symbols called nonterminal symbols. These symbols represent syntactic classes; that is, sets of strings in the language. One of the nonterminals is designated as the start (or sentence) symbol. The syntactic class represented by the start symbol is the entire language. a set of rules called productions. Each production consists of a sequence of terminals and nonterminals on the left, followed by an arrow, followed by a sequence of terminals and nonterminals on the right. For example,: while-stmt --> while ( expression ) statementwhich might appear in a grammar for C or Java. The left side consists of a single nonterminal. The right side is a sequence of 5 symbols. while, "(", and ")" are terminals, and expression and statement are nonterminals. The nonterminal while-stmt represents the set of all valid while statements in C. The production shows how valid while statements can be generated. In a context-free grammar, the left side of every production consists of a single nonterminal. The notation for representing a context-free grammar is called BNF (Backus Naur Form). Lexical Analysis Lexical analysis is the first step in compilation. The lexical analyzer is also called a scanner or a lexer. Its objective is to divide the input string into a series of substrings called tokens. Its input is a sequence of characters, its output is a sequence of tokens. What are tokens? They are the atomic program units used by all further stages of compilation. They include identifiers, constants, keywords, operators, and punctuation symbols. Whitespace characters (spaces, tabs, newlines) are used to delimit tokens and are removed at this stage. Comments are detected (and removed) at this stage. Some related information, such as line numbers, may be retained for printing of error messages. The exact specification for what constitutes a token (what are the keywords and operators, what are the rules for forming identifiers, etc.) is part of the language definition. Token types (keyword, constant, etc.) are analogous to parts of speech in English. a lexeme is the string which represents a token. a pattern is a rule used to describe the set of strings corresponding to a token. Token types (keyword, constant, etc.) are analogous to parts of speech in English. Regular Expressions Regular expressions are patterns used to describe languages (i.e., sets of strings) from a given alphabet. They can only describe relatively simple languages (regular languages), but they permit very efficient membership tests and matching. The following rules can be used to construct regular expressions for a given alphabet Σ: 1. ε is a regular expression, representing the language consisting of a single empty string. 2. if a is a symbol in Σ, then a is a regular expression representing the language consisting of the the single string "a". 3. If r and s are regular expressions representing the languages L(r) and L(s) respectively, then a. r | s is a regular expression representing the language L(r) U L(s) (the alternation of r and s). b. (r) is a regular expression representing the language L(r). c. rs is a regular expression representing the language L(r)L(s) containing concatenations of strings from L(r) with strings from L(s). d. r* is a regular expression representing the Kleene closure of r; that is, the language consisting of concatenations of zero or more strings from L(r). o Regular definitions. A name can be used to represent one regular expression, in order to use it in the definition of other regular expressions. For example, digit --> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 number --> digit(digit)* o r+ is used to represent the set of concatenations of one or more strings from L(r). o r? is used to represent zero or one instance of r. o [abc] is a character class equivalent to a | b | c. Character classes may also contain ranges of characters, as in [0-9] or [a-z]. Examples: 1. Strings of a's and b's: (a|b)* 2. Strings of 0's and 1's with even parity: 0*(10*10*)* 3. Numbers in C: i. digit --> 0|1|2|3|4|5|6|7|8|9 ii. digits --> digit* iii. optional_fraction --> . digits | ε iv. optional_exponent --> ( (e|E)(+|-|ε)digits) | ε v. num --> digits optional_fraction optional_exponent 4. Numbers in C using lex notation: [0-9]+(.[0-9]+)?([eE][+-]?[0-9]+)? Describing tokens with regular expressions o Regular expressions are used during lexical analysis by writing regular expressions to describe the tokens of our language. Some examples: An identifier is given by [a-zA-Z][a-zA-Z0-9]* The keyword if is given by if. Integers are given by [+-]?[0-9]+.