Lexical Analysis

advertisement
Lexical Analysis; Regular Expressions- SWE – Sr. Sallam 23/11/2010
Introduction
 A string (word, sentence) is a finite sequence of symbols from an alphabet. A language is a set of
strings over some alphabet. For example,
o { ( ab )n | n=1,2,3,4,… } is the language consisting of strings of the pair ab repeated any
number of times.
o { (n )n | n=1,2,3,4,… } is the language consisting of arbitrarily nested pairs of
parentheses.
 A fundamental problem in studying languages is the recognition problem. That is, given a
language and a string, determine if the string is a member of the language. This problem is
especially important for compilers, which must determine whether a given input is in fact a valid
program in the source language.
 Languages can be classified according to their grammars and the complexity of the computing
machine required to recognize them, as follows (this is the Chomsky hierarchy):
o Type 3 (regular): regular grammar; finite state automaton
o Type 2 (context-free): context-free grammar; pushdown stack automaton
o Type 1 (context-sensitive): context-sensitive grammar; linear bounded automaton
o Type 0 (recursively enumerable): unrestricted grammar; Turing machine
 Of these, regular and context-free languages are of interest to compiler writers. Regular
expressions are used to model tokens. As a result, lexical analyzers are typically based on finite
state automata. Context-free grammars are used to define programming language syntax.
Language parsers use algorithms designed to parse strings using grammar rules.
Grammars
A grammar is a method for defining a language, by giving rules which can be used to derive (generate)
strings (sentences) of the language. These rules are called productions. A grammar consists of
 a set of symbols called terminal symbols. These are the symbols which can appear in strings of
the language.
 a set of symbols called nonterminal symbols. These symbols represent syntactic classes; that is,
sets of strings in the language. One of the nonterminals is designated as the start (or sentence)
symbol. The syntactic class represented by the start symbol is the entire language.
 a set of rules called productions. Each production consists of a sequence of terminals and
nonterminals on the left, followed by an arrow, followed by a sequence of terminals and
nonterminals on the right. For example,:
 while-stmt --> while ( expression ) statementwhich might appear in a grammar
for C or Java. The left side consists of a single nonterminal. The right side is a
sequence of 5 symbols. while, "(", and ")" are terminals, and expression and
statement are nonterminals. The nonterminal while-stmt represents the set of all
valid while statements in C. The production shows how valid while statements
can be generated.
In a context-free grammar, the left side of every production consists of a single nonterminal. The notation
for representing a context-free grammar is called BNF (Backus Naur Form).
Lexical Analysis
 Lexical analysis is the first step in compilation. The lexical analyzer is also called a scanner or a
lexer. Its objective is to divide the input string into a series of substrings called tokens. Its input is
a sequence of characters, its output is a sequence of tokens.
 What are tokens? They are the atomic program units used by all further stages of compilation.
They include identifiers, constants, keywords, operators, and punctuation symbols. Whitespace
characters (spaces, tabs, newlines) are used to delimit tokens and are removed at this stage.
Comments are detected (and removed) at this stage. Some related information, such as line
numbers, may be retained for printing of error messages. The exact specification for what
constitutes a token (what are the keywords and operators, what are the rules for forming
identifiers, etc.) is part of the language definition. Token types (keyword, constant, etc.) are
analogous to parts of speech in English.
 a lexeme is the string which represents a token.

a pattern is a rule used to describe the set of strings corresponding to a token. Token types
(keyword, constant, etc.) are analogous to parts of speech in English.
Regular Expressions
 Regular expressions are patterns used to describe languages (i.e., sets of strings) from a given
alphabet. They can only describe relatively simple languages (regular languages), but they permit
very efficient membership tests and matching. The following rules can be used to construct
regular expressions for a given alphabet Σ:
1. ε is a regular expression, representing the language consisting of a single empty
string.
2. if a is a symbol in Σ, then a is a regular expression representing the language
consisting of the the single string "a".
3. If r and s are regular expressions representing the languages L(r) and L(s)
respectively, then
a. r | s is a regular expression representing the language L(r) U L(s) (the
alternation of r and s).
b. (r) is a regular expression representing the language L(r).
c. rs is a regular expression representing the language L(r)L(s) containing concatenations
of strings from L(r) with strings from L(s).
d. r* is a regular expression representing the Kleene closure of r; that is, the language
consisting of concatenations of zero or more strings from L(r).
o Regular definitions. A name can be used to represent one regular expression, in order to
use it in the definition of other regular expressions. For example,
digit --> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
number --> digit(digit)*
o
r+ is used to represent the set of concatenations of one or more strings from L(r).
o r? is used to represent zero or one instance of r.
o [abc] is a character class equivalent to a | b | c. Character classes may also contain
ranges of characters, as in [0-9] or [a-z].
Examples:
1. Strings of a's and b's: (a|b)*
2. Strings of 0's and 1's with even parity: 0*(10*10*)*
3. Numbers in C:
i. digit --> 0|1|2|3|4|5|6|7|8|9
ii. digits --> digit*
iii. optional_fraction --> . digits | ε
iv. optional_exponent --> ( (e|E)(+|-|ε)digits) | ε
v. num --> digits optional_fraction optional_exponent
4. Numbers in C using lex notation:
[0-9]+(.[0-9]+)?([eE][+-]?[0-9]+)?
Describing tokens with regular expressions
o Regular expressions are used during lexical analysis by writing regular expressions to describe the
tokens of our language. Some examples:
 An identifier is given by [a-zA-Z][a-zA-Z0-9]*
 The keyword if is given by if.
 Integers are given by [+-]?[0-9]+.
Download