CS308-slides02

advertisement
CS308 Compiler Principles
Lexical Analyzer
Fan Wu
Department of Computer Science and Engineering
Shanghai Jiao Tong University
Fall 2012
Lexical Analyzer
• Lexical Analyzer reads the source program
character by character to produce tokens.
– strips out comments and whitespaces
– returns a token when the parser asks for
– correlates error messages with the source
program
2
Compiler Principles
Token
• A token is a pair of a token name and an optional
attribute value.
– Token name specifies the pattern of the token
– Attribute stores the lexeme of the token
• Tokens
–
–
–
–
Keyword: “begin”, “if”, “else”, …
Identifier: string of letters or digits, starting with a letter
Integer: a non-empty string of digits
Punctuation symbol: “,”, “;”, “(”, “)”, …
• Regular expressions are widely used to specify
patterns of the tokens.
3
Compiler Principles
Token Example
4
Compiler Principles
Terminology of Languages
• Alphabet: a finite set of symbols
– ASCII
– Unicode
• String: a finite sequence of symbols on an alphabet
–
–
–
–
 is the empty string
|s| is the length of string s
Concatenation: xy represents x followed by y
Exponentiation: sn = s s s .. s ( n times) s0 = 
• Language: a set of strings over some fixed alphabet
–  the empty set is a language
– The set of well-formed C programs is a language
7
Compiler Principles
Operations on Languages
• Union: L1 L2 = { s | s  L1 or s  L2 }
• Concatenation: L1L2 = { s1s2 | s1  L1 and s2 
L2 }
• (Kleene) Closure: L* 

i
L

i 0
• Positive Closure:


L  L
i
i 1
8
Compiler Principles
Example
• L1 = {a,b,c,d}
L2 = {1,2}
• L1  L2 = {a,b,c,d,1,2}
• L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
• L1* = all strings using letters a,b,c,d including
the empty string
• L1+ = all strings using letters a,b,c,d without
the empty string
9
Compiler Principles
Regular Expressions
• Regular expression is a representation of a
language that can be built from the operators
applied to the symbols of some alphabet.
• A regular expression is built up of smaller
regular expressions (using defining rules).
• Each regular expression r denotes a
language L(r).
• A language denoted by a regular expression
is called as a regular set.
10
Compiler Principles
Regular Expressions (Rules)
Regular expressions over alphabet 
11
Reg. Expr

a 
(r1) | (r2)
(r1) (r2)
(r)*
(r)
Language it denotes
L() = {}
L(a) = {a}
L(r1)  L(r2)
L(r1) L(r2)
(L(r))*
L(r)
Extension
(r)+ = (r)(r)*
(r)? = (r) | 
[a1-an]
(L(r))+
L(r) {} zero or one instance
L(a1|a2|…|an) character class
Compiler Principles
Regular Expressions (cont.)
• We may remove parentheses by using
precedence rules:
–*
– concatenation
–|
highest
second highest
lowest
• (a(b)*)|(c)  ab*|c
• Example:
–
–
–
–
–
12
 = {0,1}
0|1 => {0,1}
(0|1)(0|1) => {00,01,10,11}
0* => { ,0,00,000,0000,....}
(0|1)* => all strings with 0 and 1, including the empty
string
Compiler Principles
Regular Definitions
• We can give names to regular expressions, and
use these names as symbols to define other
regular expressions.
• A regular definition is a sequence of the
definitions of the form:
d1  r1
where di is a innovative symbol and
d2  r2
ri is a regular expression over symbols
…
in {d1,d2,...,di-1}
dn  rn
alphabet
13
Compiler Principles
previously defined
symbols
Regular Definitions Example
• Example: Identifiers in Pascal
letter  A | B | ... | Z | a | b | ... | z
digit  0 | 1 | ... | 9
id  letter (letter | digit ) *
– If we try to write the regular expression
representing identifiers without using regular
definitions, that regular expression will be
complex.
(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *
14
Compiler Principles
Grammar
Regular Definitions
15
Compiler Principles
Transition Diagram
• State: represents a condition that could
occur during scanning
– start/initial state:
– accepting/final state: lexeme found
– intermediate state:
• Edge: directs from one state to another,
labeled with one or a set of symbols
16
Compiler Principles
Transition Diagram for relop
Transition Diagram for ``relop  < | > |< = | >= | = | <>’’
17
Compiler Principles
Transition-Diagram-Based Lexical Analyzer
Implementation of relop transition diagram
18
Compiler Principles
Transition Diagram for Others
A transition diagram for id's and keywords
A transition diagram for unsigned numbers
19
Compiler Principles
Practice
• Draw the transition diagram for recognizing
the following regular expression
a(a|b)*a
a|b
1
a
2
b
a
3
1
a
2
a
a
b
20
Compiler Principles
3
Finite Automata
• A finite automaton is a recognizer that takes a
string, and answers “yes” if the string matches a
pattern of a specified language, and “no”
otherwise.
• Two kinds:
– Nondeterministic finite automaton (NFA)
• no restriction on the labels of their edges
– Deterministic finite automaton (DFA)
• exactly one edge with a distinguished symbol goes out of
each state
• Both NFA and DFA have the same capability
• We may use NFA or DFA as lexical analyzer
21
Compiler Principles
Nondeterministic Finite Automaton (NFA)
• A NFA consists of:
– S: a set of states
– Σ: a set of input symbols (alphabet)
– A transition function: maps state-symbol pairs to sets of
states
– s0: a start (initial) state
– F: a set of accepting states (final states)
• NFA can be represented by a transition graph
• Accepts a string x, if and only if there is a path from
the starting state to one of accepting states such that
edge labels along this path spell out x.
• Remarks
– The same symbol can label edges from one state to
several different states
– An edge may be labeled by ε, the empty string
22
Compiler Principles
NFA Example (1)
The language recognized by this NFA is (a|b) * a b
23
Compiler Principles
NFA Example (2)
NFA accepting aa* |bb*
24
Compiler Principles
Implementing an NFA
S  -closure({s0})
{ set all of states can be accessible
from s0 by -transitions }
c  nextchar()
while (c != eof) {
begin
S  -closure(move(S,c))
c  nextchar
end
if (SF != ) then
return “yes”
else
return “no”
25
{ set of all states can be
accessible from a state in S by a
transition on c}
{ if S contains an accepting state }
Compiler Principles
Deterministic Finite Automaton (DFA)
• A Deterministic Finite Automaton (DFA) is
a special form of a NFA.
– No state has ε- transition
– For each symbol a and state s, there is at
most one a labeled edge leaving s.
start
The language recognized by this DFA is also (a|b) * a b
26
Compiler Principles
Implementing a DFA
s  s0
c  nextchar
{ start from the initial state }
{ get the next character from the
input string }
{ do until the end of the string }
while (c != eof) do
begin
s  move(s,c) { transition function }
c  nextchar
end
if (s in F) then
{ if s is an accepting state }
return “yes”
else
return “no”
28
Compiler Principles
NFA vs. DFA
Compactibility
Readability
Speed
NFA
Good
Good
Slow
DFA
Bad
Bad
Fast
• DFAs are widely used to build lexical analyzers.
NFA
DFA
The language recognized 0*1*
29
Compiler Principles
NFA vs. DFA
Compactibility
Readability
Speed
NFA
Good
Good
Slow
DFA
Bad
Bad
Fast
• DFAs are widely used to build lexical analyzers.
NFA
DFA
The language recognized (a|b) * a b
30
Compiler Principles
Test Yourself
1) What are the languages presented by the two FAs?
1
1
0
0
(a)
1
2
3
4
5
1
0
1
0
0
0
0
6
7
8
9
1
1
1
(b)
a
a
a
a
1
2
3
4
5
31
a
2) For a language only accepting characters from {0,1},
please design a DFA which represents all strings containing
three ‘0’s.
31
Compiler Principles
Regular Expression  NFA
• McNaughton-Yamada-Thompson (MYT)
construction
– Simple and systematic
– Guarantees the resulting NFA will have
exactly one final state, and one start state.
– Construction starts from the simplest parts
(alphabet symbols).
– For a complex regular expression, subexpressions are combined to create its NFA.
32
Compiler Principles
MYT Construction
• Basic rules: for subexpressions with no
operators
– For expression 
start
i

f
– For a symbol a in the alphabet 
start
33
i
a
f
Compiler Principles
MYT Construction Cont’d
• Inductive rules: for constructing larger
NFAs from the NFAs of subexpressions
(Let N(r1) and N(r2) denote NFAs for regular
expressions r1 and r2, respectively)
– For regular expression r1 | r2

start

i
f

34
N(r1)
N(r2)
Compiler Principles

MYT Construction Cont’d
– For regular expression r1r2
start
i
N(r1)
N(r2)
f
– For regular expression r*

start
i

N(r)

35
Compiler Principles

f
Example: (a|b)*a
a:
b:
a
a

(a|b):


b

b

(a|b)*:


a

b





(a|b)*a:


a

b



a

36
Compiler Principles
36
Properties of the Constructed NFA
1. N(r) has at most twice as many states as there
are operators and operands in r.
–
This bound follows from the fact that each step of
the algorithm creates at most two new states.
2. N(r) has one start state and one accepting
state. The accepting state has no outgoing
transitions, and the start state has no incoming
transitions.
3. Each state of N(r) other than the accepting
state has either one outgoing transition on a
symbol in  or two outgoing transitions, both on
.
37
Compiler Principles
Conversion of an NFA to a DFA
• Approach: Subset Construction
– each state of the constructed DFA corresponds to
a set of NFA states
• Details
①
②
③
④
Create transition table Dtran for the DFA
Insert -closure(s0) to Dstates as initial state
Pick a not visited state T in Dstates
For each input symbol a, Create state
-closure(move(T, a)), and add it to Dstates and
Dtran
⑤ Repeat step (3) and (4) until all states in
Dstates are vistited
38
Compiler Principles
The Subset Construction
39
Compiler Principles
NFA to DFA Example
NFA for (a|b) * abb
Transition table for DFA
Equivalent DFA
4
40
Compiler Principles
Regular Expression  DFA
• First, augment the given regular expression
by concatenating a special symbol #
r  r#
augmented regular expression
• Second, create a syntax tree for the
augmented regular expression.
– All leaves are alphabet symbols (plus # and the
empty string)
– All inner nodes are operators
• Third, number each alphabet symbol (plus #)
(position numbers)
43
Compiler Principles
Regular Expression  DFA Cont’d
(a|b)*a  (a|b)*a#
augmented regular expression


*
#
a 4
3
|
a
1
b
2


 1 a
 2 b


3
a
4
#


Syntax tree of (a|b)*a#
• each symbol is at a leaf
• each symbol is numbered (positions)
• inner nodes are operators
44
Compiler Principles
44
F
followpos
Then we define the function followpos for the positions (positions
assigned to leaves).
followpos(i) -- the set of positions which can follow
the position i in the strings generated by
the augmented regular expression.
Example:
( a | b) * a #
1 2 3 4
followpos(1) = {1,2,3}
followpos(2) = {1,2,3} followpos() is just defined for leaves,
followpos(3) = {4}
not defined for inner nodes.
followpos(4) = {}
45
Compiler Principles
45
firstpos, lastpos, nullable
• To compute followpos, we need three more
functions defined for the nodes (not just for
leaves) of the syntax tree.
– firstpos(n) -- the set of the positions of the first
symbols of strings generated by the subexpression rooted by n.
– lastpos(n) -- the set of the positions of the last
symbols of strings generated by the subexpression rooted by n.
– nullable(n) -- true if the empty string is a
member of strings generated by the subexpression rooted by n; false otherwise
46
Compiler Principles
Usage of the Functions
(a|b)*a  (a|b)*a#
m
*
n


#
a 4
3
|
a
1
b
2
augmented regular expression
nullable(n) = false
nullable(m) = true
firstpos(n) = {1, 2, 3}
lastpos(n) = {3}
Syntax tree of (a|b)*a#
47
Compiler Principles
Computing nullable, firstpos, lastpos
n
nullable(n)
firstpos(n)
lastpos(n)
leaf labeled 
true


leaf labeled
with position i
false
{i}
{i}
|
c1
c2

c1
c2
*
c1
48
nullable(c1) or
nullable(c2)
firstpos(c1)  firstpos(c2) lastpos(c1) 
lastpos(c2)
nullable(c1)
and
nullable(c2)
if (nullable(c1))
if (nullable(c2))
firstpos(c1)firstpos(c2) lastpos(c1)lastpos(c2)
else firstpos(c1)
else lastpos(c2)
true
firstpos(c1)
Compiler Principles
lastpos(c1)
How to evaluate followpos
• Two-rules define the function followpos:
1. If n is concatenation-node with left child c1 and
right child c2, and i is a position in lastpos(c1),
then all positions in firstpos(c2) are in
followpos(i).
2. If n is a star-node, and i is a position in
lastpos(n), then all positions in firstpos(n) are
in followpos(i).
• If firstpos and lastpos have been computed
for each node, followpos of each position
can be computed by making one depth-first
traversal of the syntax tree.
49
Compiler Principles
Example -- ( a | b) * a #
{1,2,3}  {4}
red – firstpos
blue – lastpos
{1,2,3} {3} {4}# {4}
4
Then we can calculate followpos
{1,2}* {1,2}{3}a{3}
3
followpos(1) = {1,2,3}
{1,2} | {1,2}
followpos(2) = {1,2,3}
followpos(3) = {4}
{1} a {1} {2}b {2}
followpos(4) = {}
1
2
• After we calculate follow positions, we are ready to create
DFA for the regular expression.
50
Compiler Principles
Algorithm (RE  DFA)
51
1.
2.
3.
4.
Create the syntax tree of (r) #
Calculate nullable, firstpos, lastpos, followpos
Put firstpos(root) into the states of DFA as an unmarked state.
while (there is an unmarked state S in the states of DFA) do
– mark S
– for each input symbol a do
• let s1,...,sn are positions in S and symbols in those positions are a
• S’  followpos(s1)  ...  followpos(sn)
• Dtran[S,a]  S’
• if (S’ is not in the states of DFA)
– put S’ into the states of DFA as an unmarked state.
•
•
the start state of DFA is firstpos(root)
the accepting states of DFA are all states containing the position of #
Compiler Principles
Example -- ( a | b) * a #
1
followpos(1)={1,2,3}
followpos(3)={4}
2
3
followpos(2)={1,2,3}
followpos(4)={}
S1=firstpos(root)={1,2,3}
 mark S1
a: followpos(1)  followpos(3)={1,2,3,4}=S2
b: followpos(2)={1,2,3}=S1
 mark S2
a: followpos(1)  followpos(3)={1,2,3,4}=S2
b: followpos(2)={1,2,3}=S1
Dtran[S1,a]=S2
Dtran[S1,b]=S1
Dtran[S2,a]=S2
Dtran[S2,b]=S1
b
start state: S1
accepting states: {S2}
S1
a
a
S2
b
52
4
Compiler Principles
Example -- ( a | ) b c* #
1
followpos(1)={2}
followpos(4)={}
followpos(2)={3,4}
2
3
4
followpos(3)={3,4}
S1=firstpos(root)={1,2}
 mark S1
a: followpos(1)={2}=S2
Dtran[S1,a]=S2
b: followpos(2)={3,4}=S3
Dtran[S1,b]=S3
 mark S2
b: followpos(2)={3,4}=S3
Dtran[S2,b]=S3
 mark S3
c: followpos(3)={3,4}=S3
a
Dtran[S3,c]=S3
start state: S1
b
S1
b
S3
accepting states: {S3}
53
S2
Compiler Principles
c
Minimizing Number of DFA States
• For any regular language, there is always a unique
minimum state DFA, which can be constructed from
any DFA of the language.
• Algorithm:
– Partition the set of states into two groups:
• G1 : set of accepting states
• G2 : set of non-accepting states
– For each new group G
• partition G into subgroups such that states s1 and s2 are in the
same group iff
for all input symbols a, states s1 and s2 have transitions to states
in the same group.
– Start state of the minimized DFA is the group containing
the start state of the original DFA.
– Accepting states of the minimized DFA are the groups
containing the accepting states of the original DFA.
54
Compiler Principles
Minimizing DFA – Example (1)
a
2
a
1
G1 = {2}
G2 = {1,3}
b
b
a G cannot be partitioned because
2
Dtran[1,a]=2
Dtran[3,a]=2
3
Dtran[1,b]=3
Dtran[3,b]=3
b
So, the minimized DFA (with minimum states) is
a
b
1
a
2
b
55
Compiler Principles
Minimizing DFA – Example (2)
a
a
Groups:
2
1
{4}
a
{1,2}
{3}
no more partitioning
4
b
a
b
b
b
a
Minimized DFA
a
b
1->2 1->3
2->2 2->3
3->4 3->3
b
3
b
2
a
1
a
56
{1,2,3}
b
3
Compiler Principles
56
Architecture of A Lexical Analyzer
57
Compiler Principles
57
An NFA for Lex program
• Create an NFA for each
regular expression
• Combine all the NFAs into
one
• Introduce a new start
state
• Connect it with εtransitions to the start
states of the NFAs
58
Compiler Principles
Pattern Matching with NFA
① The lexical analyzer read
in input calculates the set
of states it is in at each
symbol.
② Eventually, it reach a point
with no next state.
③ It looks backwards in the
sequence of sets of
states, until it finds a set
including one or more
accepting states.
④ It picks the one associated
with the earliest pattern in
the list from the Lex
program.
⑤ It performs the associated
action of the pattern.
59
Compiler Principles
Pattern Matching with NFA -- Example
Input: aaba
Report pattern: a*b+
60
Compiler Principles
Pattern Matching with DFA
① Convert the NFA for all the
patterns into an equivalent
DFA. For each DFA state
with more than one
accepting NFA states,
choose the pattern, who is
defined earliest, the output
of the DFA state.
② Simulate the DFA until
there is no next state.
③ Trace back to the nearest
accepting DFA state, and
perform the associated
action.
61
Input: abba
0137
Compiler Principles
247
58
Report pattern abb
68
Summary
• How lexical analysers work
– Convert REs to NFA
– Convert NFA to DFA
– Minimise DFA
– Use the minimised DFA to recognise tokens
in the input
– Use priorities, longest matching rule
62
Compiler Principles
Homework
• Exercise 3.7.1 (c)
• Exercise 3.7.3 (c)
• Exercise 3.9.4 (a)
• Due date: Sept. 29, 2012 (Saturday)
63
Compiler Principles
Download