transition diagram

advertisement
Chapter 3
Lexical Analysis
Yu-Chen Kuo
1
3.1 The Role of The Lexical
Analyzer
• Its main task is to read the input characters
and produce as output a sequence of tokens
that the parser uses for syntax analysis
• It also performs certain secondary tasks
such as stripping out comments and white
space and correlating error messages with
the source program
Yu-Chen Kuo
2
3.1 The Role of The Lexical
Analyzer
Yu-Chen Kuo
3
Token, Patterns, Lexemes
• In general, a set of strings in the input for
which the same token is produced as output.
• This set of strings is described by a rule
called a pattern associated with the token.
• A lexeme is a sequence of characters in the
source program that is matched by the
pattern for a token.
– const pi=3.14156; pi is a lexeme for token id
Yu-Chen Kuo
4
Examples of Tokens
regular
expression
• In most programming language, the following
constructs are treated as tokens: keyword,
identifiers, constants, literal strings, operators,
and punctuation symbols.
Yu-Chen Kuo
5
Attributes for Tokens
• When more than one lexeme matches a
pattern, the lexical analyzer must provide
additional information about the particular
lexeme that matched to the subsequent
phases of the compiler.
• The lexical analyzer collects information
about tokens into their associated attributes.
Yu-Chen Kuo
6
Attributes for Tokens (Cont.)
• The token influence parsing decision; the
attributed influence the translation of tokens.
• A token has usually only a single attributea pointer (index) to the symbol-table entry
in which the information about the token is
kept.
Yu-Chen Kuo
7
Lexical Errors
• Few errors are detected at lexical level alone,
because a lexical analyzer has a very localized view
of a source program.
• For example, if the string fi is encountered in a C
program for the first time in the context
– fi ( a == f(x)) …..
– whether fi is a misspelling of the keyword if or an
undeclared function identifier
– Since fi is a valid identifier, the lexical analyzer must
return the token for an identifier and let latter phase
handle any error.
Yu-Chen Kuo
8
Lexical Errors (Cont.)
• A lexical analyzer finds an error when it is
unable to proceed because none of the
patterns matches a prefix of the remaining
input.
• The simplest recovery strategy is “panic
mode”, to delete successive characters from
the remaining input until the lexical
analyzer can find a well-formed token.
Yu-Chen Kuo
9
Lexical Errors (Cont.)
• Other possible error-recovery actions are:
– Deleting an extraneous character
– Inserting a missing character
– Replacing an incorrect character by a correct
character
– Transposing two adjacent characters
Yu-Chen Kuo
10
Lexical Errors (Cont.)
• Error transformation attempts to repair the
input.
• The simplest strategy is to see if a prefix of
the remaining input can be transformed into a
valid lexeme by a single error transformation.
• This strategy assumes most lexical errors are
the result of a single transformation.
Yu-Chen Kuo
11
Input Buffering
• There are times when a lexical analyzer needs to
look ahead several characters beyond the lexeme
for a token before a match can be announced.
• Buffering techniques can be used to reduce the
overhead required to process input characters.
• The buffer is divided into two N-character halves.
Yu-Chen Kuo
12
Input Buffering(Cont.)
• N input characters are read into each half of the
buffer with one read command.
• If fewer than N characters remain in the input then
a special character eof is read into the buffer.
• Two pointers are maintained. Initially, both
pointers point to the first character of the next
lexeme. The forward pointer scans ahead until a
match for a pattern is found. After the lexeme is
processed, both pointers are set to the character
immediately past the lexeme.
Yu-Chen Kuo
13
Input Buffering(Cont.)
• If the forward pointer is about to move past
the halfway mark, the right half is filled
with N new characters.
• If the forward pointer is about to move past
the right end of the buffer, the left half is
filled with N new characters.
• Lookahead is limited by the length of the
buffer.
Yu-Chen Kuo
14
Input Buffering(Cont.)
Yu-Chen Kuo
15
Sentinels
to Improving Input Buffering
• Except at the ends of buffer halves, we need
two tests for each advance of the forward
pointer. We can reduce it to one test if we
extend each buffer half to hold the special
characters eof at the end of each half.
Yu-Chen Kuo
16
Sentinels
to Improving Input Buffering (Cont.)
Yu-Chen Kuo
17
Sentinels
to Improving Input Buffering (Cont.)
• Most of the time only one test is needed to
see except the forward pointer points to an
eof.
• The average number of tests per input
character is very close to 1.
Yu-Chen Kuo
18
Specification of Tokens
• Regular expressions are an important
notation for specifying patterns.
Yu-Chen Kuo
19
Strings and Languages
• An alphabet denotes any finite set of symbols,
– {0,1}: binary alphabet
– ASCII code: computer alphabet
• A string over some alphabet is a finite sequence of
symbols drawn from the alphabet.
• A language denotes a set of strings over some
fixed alphabet.
• The string exponentiation operation is defined as
s0 =  (empty string);
si = si-1s, for i>0 (string concatenation)
Yu-Chen Kuo
20
Operation on Languages
• The language exponentiation operation is defined
as L0 = {} and Li = Li-1L
Yu-Chen Kuo
21
Operation on Languages (Cont.)
• Let L={A,…,Z, a,…,z} and D = {0,…,9}
1. LD is the set of letters and digits.
2. LD is the set of strings consisting of a letter
followed by a digit.
3. L4 is the set of four-letter strings.
4. L* is the set of all strings of letters, including .
5. L(LD)* is the set of all strings of letters and
digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.
Yu-Chen Kuo
22
Regular Expressions
•
•
•
•
A regular expression r is a formalism for
defining a language L(r).
A language that can be defined by a regular
expression is called a regular set.
A language that can be defined by a contextfree grammar is called a context-free
language.
the set of regular sets  the set of contextfree language
Yu-Chen Kuo
23
Rule for Regular Expressions
•
The rules that define the regular expression
over alphabet  are as follows.
1.  is a regular expression, denoted {}
2. If a is a symbol in , then a is a regular
expression denoting {a}
Yu-Chen Kuo
24
Rule for Regular Expressions
(Cont.)
3.
Suppose r and s are regular expressions for the languages
L(r) and L(s), then,
a) (r) | (s) is a regular expression denoting L(r)L(s)
b) (r) (s) is a regular expression denoting L(r)L(s)
c) (r )* is a regular expression denoting (L(r ))*
•
Unnecessary parentheses can be avoided in regular
expression if we adopt the following conventions
1. The unary operator * has the highest precedence and is left
associative.
2. Concatenation has the second highest precedence and is left
associative.
3. | has the lowest precedence and is left associative
Yu-Chen Kuo
25
Rule for Regular Expressions
(Example)
• Let  ={a, b}
1. a | b denotes {a, b}
2. (a | b)(a | b) denotes {aa, ab, ba, bb}, the set of all
strings of a’s and b’s of length two.
3. a* denotes {, a, aa, aaa, …}, the set of all strings
of zero or more a’s.
4. (a | b)* denotes the set of all strings containing zero
or more instances of a or b.
5. a | a*b denotes the set containing string a or the
strings consisting zero or more a’s followed by b.
Yu-Chen Kuo
26
Algebraic Properties
of Regular Expressions
Yu-Chen Kuo
27
Regular Definition
•
Let  be an alphabet, then a regular definition is a
sequence of definition of the form
d1  r1
d2  r2
…
dn  rn
where each di is a distinct name, and each ri is a
regular expression over the symbols in   {d1,
d2,…,di-1}
Yu-Chen Kuo
28
Regular Definition (Example)
•
•
The set of Pascal identifiers is the set of
strings of letters and digits beginning with a
letter.
A regular definition for this set is as follows.
letter A | B | … | Z | a | b | … | z
digit  0 | 1 | … | 9
id  letter ( letter | digit) *
Yu-Chen Kuo
29
Regular Definition (Example)
•
•
Unsigned numbers in Pascal are strings such as
5280, 39.37, 6.33E4, or 1.894E-4.
A regular definition for this set is as follows.
digit  0 | 1 | … | 9
digits  digit digit*
optional_faction  .digits | 
optional_exponent  (E(+|-| ) digits) | 
num  digits optional_fraction optional_exponent
Yu-Chen Kuo
30
Notational Shorthands
1. One or more instances +
–
–
a+ : the set of all strings of one ore more a’s
r + = r r*, r* = r + | 
2. Zero or one instance ?
– r? = r | 
digit  0 | 1 | … | 9
digits  digit +
optional_faction  (.digits)?
optional_exponent  (E(+|-) ? digits)?
num  digits optional_fraction optional_exponent
Yu-Chen Kuo
31
Notational Shorthands (Cont.)
3. Character class:
−
−
−
[abc] = a | b | c
[a-z] = a | b | … | z
id  [A-Za-z][A-Za-z0-9]*
Yu-Chen Kuo
32
Nonregular Sets
•
•
•
•
Some languages cannot be described by any regular
expression.
Regular expressions cannot describe balanced or
nested constructs.
Regular expressions cannot describe the set of all
strings of balanced parentheses but that can be
specified by a context-free grammar.
Repeating string cannot be described by regular
expressions or context-free grammar.
–
{wcw| w is a string of a’s and b’s}
Yu-Chen Kuo
33
Nonregular Sets (Cont.)
•
Regular expressions can be used to denote only a fix
number of repetition or an unspecified number of
repetitions. Two arbitrary numbers cannot be
compared to see whether they are the same.
–
nHa1a2…an
Yu-Chen Kuo
34
3.4 Recognition of Tokens
•
Consider the following grammar fragment:
stmt  if expr then stmt
| if expr then stmt else stmt
|
expr  term relop term
| term
term  id
| num
Yu-Chen Kuo
35
Recognition of Tokens (Cont.)
•
The regular definitions for tokens are as follows:
if  if
then  then
else  else
relop  < | <= | = | <>| > | >=
id  letter (letter|digit)*
num  digit+ (.digit+)? (E(+|-)?digit+ )?
delim  blank | tab | newline
ws  delim+
Yu-Chen Kuo
36
Regular-expression Patterns
for Tokens
Yu-Chen Kuo
37
Transition Diagrams
•
•
Lexical analysis use transition diagram to
keep track of information about characters
that are seen as the forward pointer scans the
input.
Positions in a transition diagram are drawn as
circles and are called states. The states are
connected by arrows, called edges. A double
circle indicated an accepting state, a state in
which a token is found. a* indicates that
input retraction must take place.
Yu-Chen Kuo
38
Transition Diagrams for >=
• start state : stare 0 in the above example
• If input character is >, go to state 6.
• other refers to any character that is not indicated
by any of the other edges leaving s.
Yu-Chen Kuo
39
Transition Diagrams for
Relational Operators
token attribute-value
Yu-Chen Kuo
40
Transition Diagrams for
Identifiers and Keywords
• gettoken( ): return token (id, if, then,…) if it
looks the symbol table
• install_id( ): return 0 if keyword or a pointer
to the symbol table entry if id
Yu-Chen Kuo
41
Transition Diagrams for
Unsigned Numbers
install_num( )
install_num( )
order:
Ex. 12.3E4 ?
install_num( )
Yu-Chen Kuo
42
Transition Diagrams for
White Space
Yu-Chen Kuo
43
Following Transition Diagrams
• Transition diagrams are followed one by
one trying to determine the next tokens to
be returned.
• If failure occurs while we are following one
transition diagram, we retract the forward
pointer to where it was in the start state of
this diagram, and activate the next transition
diagram.
Yu-Chen Kuo
44
Following Transition Diagrams
(Cont.)
• If failure occurs in all transition diagrams, then a
lexical error has been detected and we invoke an
error-recovery routine.
• It is better to look for frequently occurring tokens
before less frequently occurring ones, because a
transition diagram is reached only after we fail on
all earlier transition diagrams.
• Since white space is expected to occur frequently,
we should put the transition diagram for white
space near the beginning.
Yu-Chen Kuo
45
Implement a Transition Diagrams
• A sequence of transition diagrams can be
converted into a program to look for tokens.
• Each state gets a segment of code.
Yu-Chen Kuo
46
Implement a Transition Diagrams
(Cont.)
• state and start record the current state and the
start state of current transition diagram.
• lexical_value is assigned the pointer returned by
install_id( ) and install_num( ) when an identifier
or number is found.
• When a diagram fails, the function fail( ) is used
to retract the forward pointer to the position of the
lexeme beginning pointer and to return the start
state of the next diagram. If all diagrams fail the
function fail( ) calls an error-recovery routine.
Yu-Chen Kuo
47
Implement a Transition Diagrams
(Cont.)
Yu-Chen Kuo
48
Implement a Transition Diagrams
(Cont.)
return a character pointed
by forward pointer
and forward pointer ++
Yu-Chen Kuo
49
Implement a Transition Diagrams
(Cont.)
id
Yu-Chen Kuo
50
Implement a Transition Diagrams
(Cont.)
Yu-Chen Kuo
51
Download