• Two issues in lexical analysis – Specifying tokens (regular expression) expression.

advertisement
• Two issues in lexical analysis
– Specifying tokens (regular expression)
– Identifying tokens specified by regular
expression.
• How to recognize tokens specified by
regular expressions?
– A recognizer for a language is a program that takes a
string x as input and answers “yes” if x is a sentence of
the language and “no” otherwise.
• In the context of lexical analysis, given a string and a regular
expression, a recognizer of the language specified by the
regular expression answer “yes” if the string is in the language.
• A regular expression can be compiled into a recognizer
(automatically) by constructing a finite automata which can be
deterministic or non-deterministic.
• Non-deterministic finite automata
(NFA)
– A non-deterministic finite automata (NFA) is a mathematical
model that consists of: (a 5-tuple (Q,  ,  , q0, F )
• a set of states Q
• a set of input symbols 
• a transition function that maps state-symbol pairs to sets of
states.
• A state q0 that is distinguished as the start (initial) state
• A set of states F distinguished as accepting (final) states.
– An NFA accepts an input string x if and only if there is some path
in the transition graph from the start state to some accepting state.
– Show an NFA example (page 116, Figure 3.21).
• An NFA is non-deterministic in that (1) same
character can label two or more transitions out of
one state (2) empty string can label transitions.
• For example, here is an NFA that recognizes the
language ???. a
0
a
1
b
2
b
3
b
• An NFA can easily implemented using a transition
table.
State
0
1
2
a
{0, 1}
-
b
{0}
{2}
{3}
• The algorithm that recognizes the language
accepted by NFA.
– Input: an NFA (transition table) and a string x (terminated by eof).
– output “yes” if accepted, “no” otherwise.
S = e-closure({s0});
a = nextchar;
while a != eof do begin
S = e-closure(move(S, a));
a := next char;
end
if (intersect (S, F) != empty) then return “yes”
else return “no”
Note: e-closure({S}) are the state that can be reached from states in S
through transitions labeled by the empty string.
– Example: recognizing ababb from previous NFA
– Example2: Use the example in Fig. 3.27 for recognizing ababb
Space complexity O(|S|), time complexity O(|S|^2|x|)??
• Construct an NFA from a regular expression:
– Input: A regular expression r over an alphabet 
– Output: An NFA N accepting L( r )
– Algorithm (3.3, pages 122):

• For  , construct the NFA
• For a in  , construct the NFA
a
• Let N(s) and N(t) be NFA’s for regular s and t:
– for s|t, construct the NFA N(s|t): 
N(s)
– For st, construct the NFA N(st):
N(s)

N(t)
N(t)
– For s*, construct the NFA N(s*):


N(s)




• Example: r = (a|b)*abb.
• Example: using algorithm 3.3 to construct
N( r ) for r = (ab | a)*b* | b.
• Using NFA, we can recognize a token in
O(|S|^2|X|) time, we can improve the time
complexity by using deterministic finite
automaton instead of NFA.
– An NFA is deterministic (a DFA) if
• no transitions on empty-string
• for each state S and an input symbol a, there is at
most one edge labeled a leaving S.
– What is the time complexity to recognize a
token when a DFA is used?
• Algorithm to convert an NFA to a DFA that accepts the
same language (algorithm 3.2, page 118)
initially e-closure(s0) is the only state in Dstates and it is unmarked
while there is an unmarked state T in Dstates do begin
mark T;
for each input symbol a do begin
U := e-closure(move(T, a));
if (U is not in Dstates) then
add U as an unmarked state to Dstates;
Dtran[T, a] := U;
end
end;
Initial state = e-closure(s0), Final state = ?
• Example: page 120, fig 3.27.
• Question:
– for a NFA with |S| states, at most how many states can its
corresponding DFA have?
Download