• Two issues in lexical analysis – Specifying tokens (regular expression) – Identifying tokens specified by regular expression. • How to recognize tokens specified by regular expressions? – A recognizer for a language is a program that takes a string x as input and answers “yes” if x is a sentence of the language and “no” otherwise. • In the context of lexical analysis, given a string and a regular expression, a recognizer of the language specified by the regular expression answer “yes” if the string is in the language. • A regular expression can be compiled into a recognizer (automatically) by constructing a finite automata which can be deterministic or non-deterministic. • Non-deterministic finite automata (NFA) – A non-deterministic finite automata (NFA) is a mathematical model that consists of: (a 5-tuple (Q, , , q0, F ) • a set of states Q • a set of input symbols • a transition function that maps state-symbol pairs to sets of states. • A state q0 that is distinguished as the start (initial) state • A set of states F distinguished as accepting (final) states. – An NFA accepts an input string x if and only if there is some path in the transition graph from the start state to some accepting state. – Show an NFA example (page 116, Figure 3.21). • An NFA is non-deterministic in that (1) same character can label two or more transitions out of one state (2) empty string can label transitions. • For example, here is an NFA that recognizes the language ???. a 0 a 1 b 2 b 3 b • An NFA can easily implemented using a transition table. State 0 1 2 a {0, 1} - b {0} {2} {3} • The algorithm that recognizes the language accepted by NFA. – Input: an NFA (transition table) and a string x (terminated by eof). – output “yes” if accepted, “no” otherwise. S = e-closure({s0}); a = nextchar; while a != eof do begin S = e-closure(move(S, a)); a := next char; end if (intersect (S, F) != empty) then return “yes” else return “no” Note: e-closure({S}) are the state that can be reached from states in S through transitions labeled by the empty string. – Example: recognizing ababb from previous NFA – Example2: Use the example in Fig. 3.27 for recognizing ababb Space complexity O(|S|), time complexity O(|S|^2|x|)?? • Construct an NFA from a regular expression: – Input: A regular expression r over an alphabet – Output: An NFA N accepting L( r ) – Algorithm (3.3, pages 122): • For , construct the NFA • For a in , construct the NFA a • Let N(s) and N(t) be NFA’s for regular s and t: – for s|t, construct the NFA N(s|t): N(s) – For st, construct the NFA N(st): N(s) N(t) N(t) – For s*, construct the NFA N(s*): N(s) • Example: r = (a|b)*abb. • Example: using algorithm 3.3 to construct N( r ) for r = (ab | a)*b* | b. • Using NFA, we can recognize a token in O(|S|^2|X|) time, we can improve the time complexity by using deterministic finite automaton instead of NFA. – An NFA is deterministic (a DFA) if • no transitions on empty-string • for each state S and an input symbol a, there is at most one edge labeled a leaving S. – What is the time complexity to recognize a token when a DFA is used? • Algorithm to convert an NFA to a DFA that accepts the same language (algorithm 3.2, page 118) initially e-closure(s0) is the only state in Dstates and it is unmarked while there is an unmarked state T in Dstates do begin mark T; for each input symbol a do begin U := e-closure(move(T, a)); if (U is not in Dstates) then add U as an unmarked state to Dstates; Dtran[T, a] := U; end end; Initial state = e-closure(s0), Final state = ? • Example: page 120, fig 3.27. • Question: – for a NFA with |S| states, at most how many states can its corresponding DFA have?