Regular Expressions

advertisement
CIS 324: Language Design and Implementation
Lexical Analysis
1. Regular Expressions
1.1 Lexical Analysis using Tokens
The purpose of the lexical analyzer is to read the input characters
from the source program and to translate them into a sequence of
tokens suitable for syntax analysis.
The simplest way to build a lexical analyzer is to construct a diagram
that illustrates the structure of the tokens in the source program, and
next manually to make the diagram into a working program that
is capable of recognizing the tokens.
There are three main reasons for separating the lexical analysis phase
of the compiler from the syntax analysis phase:
- the separation allows to simplify the design of the lexical analyzer
and the design of the syntax analyzer since when considered in
isolation many design issues can be handled more precisely;
- the separation allows to improve the efficiency of the lexical analyzer
and the efficiency of the syntax analyzer since when considered in
isolation specialized techniques can be applied;
- the separation allows to enhance the portability of the lexical analyzer
and the portability of the syntax analyzer.
1.2 Notation for Regular expressions
Regular expressions serve to denote sets of strings.
Regular expressions are defined with values and operations.
Regular expressions over a given alphabet  are defined recursively
using the following values and operations:
- the empty string  is a regular expression that denotes the set: {  }
- any character a from  is a regular expression that denotes the set: { a }
- if a and b are characters from the sets P and Q, then:
( a | b ), ( ab ) and ( a )* are regular expressions which denote the sets:
P  Q, PQ, and P* respectively.
This definition uses the following classical operations on languages:
Operation
Definition
union
P  Q P  Q = { a | a is in P or a is in Q }
concatenation PQ
PQ
= { ab | a is in P and b is in Q }
closure
P*
P*
= zero or more concatenations of a
The precedence and associativity of these operations is as follows:
- the closure operator has the highest precedence and is left associative;
- the concatenation has the second highest precedence and is left associative;
- the union operator has the lowest precedence and is left associative.
Examples:
The regular expression: a | b denotes the set: { a, b }
The regular expression: ( a | b ) ( a | b ) denotes the set: { aa, ab, ba, bb }
The regular expression: a* denotes the set: { , a, aa, aaa, ... }
The regular expression: ( a | b )* denotes the set:
{ , a, b, aa, bb, ab, ba, aaa, bbb, aab, abb... }
Example: Consider the following grammar fragment
statement  if expr then statement
| if expr then statement else statement
| 
expr
 term relop term
| term
term
 id | num
The corresponding regular definitions for the terminals are:
if 
then
else 
relop
id 
num
delim
ws 
if
 then
else
 < | <= | = | <> | > | >=
letter ( letter | digit ) *
 digit+ ( . digit+ )? ( E ( + | - )? digit+ )?
 blank | tab | newline
delim+
2. Strings and Languages
An alphabet denotes a finite set of symbols. The symbols are
usually letters and characters.
String is called a finite sequence of symbols drawn from the alphabet.
In programming language theory the notions of sentence and word
are synonyms for the term string.
A language denotes a set of strings over a fixed alphabet.
In programming language theory the following terminology is used:
Term
prefix of string
Definition
This is a string including the leading symbols
from the given string
suffix of string
This is a string obtained by deleting leading symbols
from the given string
substring of string
This is a string obtained by removing a prefix and
a suffix from the given string
proper prefix, suffix, Any nonempty string that is respectively a prefix, a
and substring
suffix, or substring of the given string different from it
subsequence of string Any string produced by deleting zero or more not
necessarily contiguous symbols from the given string
3. Transition Diagrams
Transition diagrams show the actions performed by the lexical analyser
when invoked by the parser. Transition diagrams consist of states and
edges that connect them.
The transition diagram for a token begins with a start state and carries
out transitions from the current state along an edge whose label matches
the input character from the source program.
Usually several transition diagrams for several tokens are developed
together to begin from a common start state.
Example: Implementing transition diagrams for the regular expressions
relop  < | <= | = | <> | > | >=
<
start
0
1
=
>
2
3
other
4
=
5
>
=
6
7
other
8
num  digit+ ( . digit+ )? ( E ( + | - )? digit+ )?
digit
start
0
digit
1
digit
.
2
digit
E
3
digit
+
E
or
4
-
5
digit
digit
6
other
7
Download