The Three Faces of Regular Languages September 25, 2001 1 Announcements OH’s (this week only) Today 4-5:15 (instead of Thursday) None on Tue/Wed (he22ld Wed’s OH yesterday) HW 2 delayed by one week Now due Tuesday, 10/2 Anticipate covering all necessary material for HW2 today Can download egrep for PC’s (included with Cygwin). See the course links page. HW 3 is still due Thursday, 10/4! Will try my best not to delay this due date. Greg Estren will give Thursday’s lecture 2 Agenda NFA’s Formal Definition Closure under regular operations Union () Concatenation () Kleene-* Kleene-+ Determinization Regular Expressions (REX’s) Formal Definition Simulation by NFA’s Simulation of FA’s Conclusion: FA’s NFA’s REX’s 3 NFA’s -Informally The static picture of an NFA is as a graph whose edges are labeled by S and by e (together called Se) and with start vertex q0 and accept states F. EG: 1 e e 1 0 0,1 0,1 0 1 0 Q: Why is this any different for deterministic FA’s? Isn’t an FA just a labeled graph in the same way? 4 NFA’s -Informally A: FA’s are labeled graphs. However, determinism gives an extra constraint on the form that the graphs can take. Specifically, d must be a function. Graph theoretically this means that every vertex has exactly one edge of a given label sticking out of it. (Of course, e’s cannot appear either.) Summarizing: any labeled graph you can come up with is an NFA, as long as it only has one start state. Later, even this restriction will be dispensed with. 5 NFA Examples M1: M2: e M3: M4: e 0 0,1 e 1 1 6 NFA’s. Formal Definition. DEF: A nondeterministic finite automaton (NFA) is encapsulated by M = (Q, S, d, q0, F ) in the same way as an FA, except that d has different domain and codomain. δ : Q Σ ε P (Q ) Here, P (Q ) is the power set of Q so that d(q,a) is the set of all endpoints of edges from q which are labeled by a. 7 NFA’s. Formal Definition. M1: M2: q0 M4: q0 1 M3: q0 q1 0 q2 e q0 0,1 e q1 q2 q3 e 1 Q: What are d(q0,0), d(q0,1), d(q0,e) in each of M1, M2, M3 and in M4.? 8 NFA’s. Formal Definition. M1: M2: q0 M4: M1: M2: M3: M4: q0 1 M3: q0 q1 0 q2 e q0 0,1 e q1 q2 q3 e 1 d(q0,0) = d(q0,1) = d(q0,e) = Same d(q0,0) = d(q0,1) = , d(q0,e) = {q1,q2} d(q0,0)={q1,q3}, d(q0,1)={q2,q3}, d(q0,e)= 9 Formal Definition of an NFA: Dynamic Just as with FA’s, there is an implicit auxiliary tape containing the input string which is operated on by the NFA. As opposed to FA’s, NFA’s are parallel machines –able to be in several states at any given instant. The NFA reads the tape from left to right with each new character causing the NFA to go into another set of states. When the string is completely read, the string is accepted depending on whether the NFA’s final configuration contains an accept state. 10 Formal Definition of an NFA: Dynamic DEF: A string u is accepted by an NFA M iff there exists a path starting at q0 which is labeled by u and ends in an accept state. The language accepted by M is the set of all strings which are accepted by M and is denoted by L (M ). Notes: 1. In computing the label of a path, you should delete all e’s. 2. The only difference in acceptance for NFA’s vs. FA’s are the words “there exists”. In FA’s the path always exists and is unique. 11 NFA’s. Accepted String Examples. M4: q1 0 q0 1 q2 0,1 q3 e 1 Q: Which of the following strings is accepted? 1. e 2. 0 3. 1 4. 0111 12 NFA’s. Accepted String Examples. q1 0 A: q0 q2 1 0,1 q3 e 1 1. e is rejected. No path labeled by empty string from start state to an accept state. q1 2. 0 is accepted. EG the path q0 0 q3 3. 1 is accepted. EG the path q0 1 4. 0111 is accepted. There is only one accepted path: q0 q3 q2 q3 q2 q3 q2 q3 0 1 ε 1 ε 1 ε 13 NFA’s. Regular Operations. Schematic. The red circle stands for the start state q0, the green portion represents the accept states F, and the other states are gray 14 NFA’s. Schematic Union. The union AB is formed by putting the automata in parallel. Create a new start state and connect it to the former start states using e-edges: 15 Union Example L = {x has even length} {x ends with 11} a e b 0,1 0,1 c 1 d 0 0 e 1 e f 1 0 16 The Game of e-DETERMINIZE Rules (only large font differs from DETERMINIZE): 1. 2. 3. 4. 5. 6. The NFA is made up of a team The referee calls out the input string Each member of the team is responsible for knowing who they ‘pass the buck’ to. If necessary, turn the buck into referee (in case transition undefined), or spread the buck around to several teammates (if multiple choices). If you are an e-source and you’re holding a buck, make sure that all your e-targets also hold a buck before referee calls out next letter. Each member of the team is responsible for knowing themselves (know if you are in F ) At the end of input string, any accept states holding the buck should call out: “ACCEPT” If no one says ACCEPT, input was rejected. 17 NFA’s. Schematic Concatenation. The concatenation A B is formed by putting the automata in serial. The start state comes from A while the accept states come from B. The A’s accept states are turned off and connected via e-edges to B ’s start state: 18 Concatenation Example L = {x has even length} {x ends with 11} 0 0,1 0,1 c d 0 b e 1 1 e f 1 0 19 NFA’s. Schematic Kleene-+. The Kleene-+ A+ is formed by creating a feedback loop. The accept states connect to the start state via e-edges: 20 Kleene-+ Example L={ = { x is a streak of 1 or more 1’s followed by a streak of 2 or more 0’s x starts with 1, ends with 0, and alternates between 1 or more consecutive 1’s and 2 or more consecutive 0’s 1 c }+ 1 d } 0 0 e 0 f e 21 NFA’s. Schematic Kleene-*. The construction follows from Kleene-+ construction using the fact that A* is the union of A+ with the empty string. Just create Kleene-+ and add a new start accept state connecting to old start state with an e-edge: 22 Kleene-* Example L={ = { x is a streak of 1 or more 1’s followed by a streak of 2 or more 0’s }* x starts with 1, ends with 0, and alternates between 1 or more consecutive 1’s and 2 or more consecutive 0’s, OR x is e } a e c 1 1 d 0 0 e 0 f e 23 Closure of NFA’s under Regular Operations The constructions above all show that NFA’s are constructively closed under the regular operations. THM 1: If L1 and L2 are accepted by NFA’s, then so are L1 L2 , L1 L2, L1+ and L1*. In fact, the accepting NFA’s can be constructed algorithmically in linear time. This is almost what we want. If we can show that all NFA’s can be converted into DFA’s this will show that DFA’s –and hence regular languages– are closed under the regular operations. 24 Regular Expressions We are already familiar with the regular operations. Regular expressions give a way of symbolizing a sequence of regular operations, and therefore a way of generating new languages from old. For example, to generate the finite language {banana,nab}* from the atomic languages {a},{b} and {n} we could do the following: ( ({b}{a}{n}{a}{n}{a})({n}{a}{b}) )* Regular expressions specify the same in a more compact form: (banananab)* 25 Regular Expressions DEF: The set of regular expressions over an alphabet S and the languages in S* which they generate are defined recursively: Base Cases: Each symbol a S as well as the symbols e and are regular expressions: a generates the atomic language L(a ) = { a } e generates the language L( e ) = { e } generates the empty language L() = { } = Inductive Cases: if r1 and r2 are regular expressions so are r1r2, (r1)(r2), (r1)* and (r1)+: L(r1r2) = L(r1)L(r2), so r1r2 generates the union L((r1)(r2)) = L(r1)L(r2), so (r1)(r2) is the concatenation L((r1)*) = L(r1) *, so (r1)* represents the Kleene-* L((r1)+) = L (r1)+:, so (r1)+ represents the Kleene-+ 26 Regular Expressions. Table of Operations including UNIX Operation Notation Language UNIX Union r1r2 L(r1)L(r2) r1|r2 Concatenation (r1)(r2) L(r1)L(r2) (r1)(r2) Kleene-* (r )* L(r )* (r )* Kleene-+ (r )+ L(r )+ (r )+ Exponentiation (r )n L(r )n (r ){n} 27 Regular Expressions. Simplifications. Just as algebraic formulas can be simplified by using less parentheses when the order of operations is clear, regular expressions can be simplified. Using the pure definition of regular expressions to express the language {banana,nab}* we would be forced to write something nasty like: ((((b)(a))(n))(((a)(n))(a))(((n)(a))(b)))* Using the operator precedence ordering *, , and the associativity of allows us to obtain the simpler: (banananab)* This is done in the same way as one would simplify the algebraic expression with re-ordering disallowed: ((((b)(a))(n))(((a)(n))(a))+(((n)(a))(b)))4=(banana+nab)4 28 Regular Expressions. Example. Q: Find a regular expression that generates the language consisting of all bit-strings which contain a streak of seven 0’s or contain two disjoint streaks of three 1’s. Note: A streak of nine 0’s is valid because it contains a sub-streak of seven 0’s. EG: Legal: 010000000011010, 01110111001, 111111 Illegal: 11011010101, 10011111001010 29 Regular Expressions. Example. A: (01)*(0713(01)*13)(01)* An even briefer valid answer is: S*(0713S*13)S* The official answer using only the standard regular operations is: (01)*(0000000111(01)*111)(01)* A brief UNIX answer is: (0|1)*(0{7}|1{3}(0|1)*1{3})(0|1)* 30 Regular Expressions. A Language in their Own Right. Regular expressions are just strings. Consequently, the set of regular expressions is a set of strings, so by definition is a language. Q: Supposing that only union, concatenation and Kleene-* are considered. What is the alphabet for the language of regular expressions over the base alphabet S ? 31 Regular Expressions. A Language in their Own Right. A: S { (, ), , *} 32 REX NFA Since NFA’s are closed under the regular operations we immediately get: THM 2: Given any regular expression r there is an NFA N which simulates r. I.e. the language accepted by N is precisely the language generated by r so that L(N ) = L(r ). Furthermore, the NFA is constructible in linear time. 33 REX NFA Proof. The proof works by induction, using the recursive definition of regular expressions. First we need to show how to accept the base case regular expressions aS, e and . These are respectively accepted by the NFA’s: q0 a q1 q0 q0 34 REX NFA Finally, we need to show how to inductively accept regular expressions formed by using the regular operations. These are just the constructions that we saw before, encapsulated by: 35 REX NFA. Example. Q: Find an NFA for the regular expression (01)*(0000000111(01)*111)(01)* the previous exercise. 36 REX NFA. Example. (01)*(0000000111(01)*111)(01)* A: See how the NFA was constructed. 37 REX NFA FA The fact that regular expressions can be converted into NFA’s means that it makes sense to call the languages accepted by NFA’s “regular”. However, the regular languages were defined to be the languages accepted by FA’s, which are by default, deterministic. It would be nice if NFA’s could be determinized and converted to FA’s, for then the definition of “regular” languages, as being FA-accepted would be justified. Let’s try this next. 38 Determinizing NFA’s Q: Determinize the left-most state of: f 0,1 e 0,1 d 0,1 0,1 a 0 c 0,1 1 b 39 Determinizing NFA’s A: No problem. Under-deterministic crashes can be fixed by adding a fail sink-state: f 0,1 e 0,1 d 0,1 0,1 0,1 0,1 a 0 c 0,1 1 b hell Q: Now do the same for the start state. 40 Determinizing NFA’s A: No such luck!!! In general, overdeterminism is much harder to fix. Different sequences of letters create different possible combinations of active states. To determinize, will need to take these combinations into account. INSPIRATION: The game of DETERMINIZE (and e-DETERMINIZE) gives the intuition for how this works! 41 Determinizing NFA’s Recall that in the game, we kept track of active states as the input was being called out. If at the end of the input, one of the active states happened to be an accept state, the input was accepted. Graph theoretically, what the game does is keep track of all endpoints of all possible paths of the prefixes of the input string. 42 Determinizing NFA’s For example, consider the following NFA which reads the input 11000. 0 0,1 0,1 1 1 1 1 0 0 0 0 43 Determinizing NFA’s 0 0,1 0,1 1 1 1 1 0 0 0 0 44 Determinizing NFA’s 0 0,1 0,1 1 1 1 1 0 0 0 0 45 Determinizing NFA’s 0 0,1 0,1 1 1 1 1 0 0 0 0 46 Determinizing NFA’s 0 0,1 0,1 1 1 1 1 0 0 0 0 47 Determinizing NFA’s 0 0,1 0,1 1 1 1 1 0 0 0 0 Accepted! 48 Determinizing NFA’s Recall that in the Cartesian product construction, we kept track of two parallel automata by using spliced states. Let’s try to modify this approach for determinizing. 0 0,1 0,1 1 1 1 1 0 0 0 0 Q: How can we remember all the active states in a computation by a single state? 49 Determinizing NFA’s A: Keep track of all active states by thinking of each set of states as a state in its own right. 0 0,1 0,1 1 1 1 1 0 0 0 0 EG: If the right-most states are q2 and q3 here the active “state” is {q2, q3 }. 50 Determinizing NFA’s We’re now ready for the formal process of determinizing NFA’s. Sipser gives a direct method for creating a DFA from an arbitrary NFA. I prefer breaking the process up into two stages: 1. DE-EPSILONIZE: Get rid of the e-edges by using a transitive-closure like procedure. 2. SUBSET CONSTRUCTION: Convert an NFA without e’s into a DFA by considering sets of states. You can use my method or Sipser’s. 51 Determinizing NFA’s Let’s explain these constructions in reverse: 1. SUBSET CONSTRUCTION 2. DE-EPSILONIZE 52 Subset Construction Given an NFA N = (Q, S, d, q0, F ) without epsilons, the determinizor of N is M(N ) = (Q’, S, d’, q0’, F’ ) where The state set of M is the power set of Q. I.e., Q’ = P (Q ) = {all subsets of Q } Each accept state of M consists of a subset which contains an accept state. I.e. F’ = {S Q | SF } The start state of M is the singleton q0’ = {q0 } d’ requires further thought: 53 Subset Construction. Delta. Consider the previous example. About to read 1: 0,1 q0 Done reading 1: q1 q2 0 0,1 1 1 0 0 0,1 0,1 1 1 0 Q: Why was d’({q1,q2 },1) = {q2,q3 } ? 54 q3 Subset Construction. Delta. q1 Why d’({q1,q2 },1) = {q2,q3 } ? q0 A: If we think of the 0,1 0,1 DETERMINIZE game, 1 Each state must remember 1 who they should pass the buck to. The set of 0,1 active states at the end is 0,1 1 the conglomerate of all the 1 active states created by each member. In other words, d’({q1,q2 },1) = d(q1,1) d(q2,1) = {q1,q2 } q2 0 0 0 0 55 q3 Subset Construction. Delta. In general we have: δ’(S , a) = δ(q, a) = {q' Q q S , q' δ(q, a)} qS This completes the formal definition of the Subset Construction. 56 Subset Construction. In Practice. Let’s actually determinize a particular NFA to see how this works. Start with: Q1: What’s the accepted language? Q2: How many states does the subset construction create in this case? 57 Subset Construction. In Practice. A1: L = {x{a,b}* | 3rd bit of x from right is a} A2: 16 = 24 states. That’s a lot of states. Would be nice if only had to construct useful states, I.e. those that can be reached from start state. 58 Subset Construction. In Practice. We can in fact, construct only the states that we need. Starting from the start state, perform a breadth first search (or depth first search) while determinizing! First state will be {1}: 59 Subset Construction. In Practice. Start with {1}: 60 Subset Construction. In Practice. Branch out. Notice that d(1,a) = {1,2}. 61 Subset Construction. In Practice. Branch out. Notice that d’({1,2},a) = {1,2,3}. 62 Subset Construction. In Practice. Branch out. Note that d’({1,2,3},a) = {1,2,3,4} 63 Subset Construction. In Practice. Continue 64 Subset Construction. In Practice. 65 Subset Construction. In Practice. 66 Subset Construction. In Practice. 67 Subset Construction. In Practice. 68 Subset Construction. In Practice. Summarizing: Therefore, we saved 50% effort by not constructing all possible states unthinkingly. 69 Determinizing NFA’s We finished with the first half of the determinization procedure: SUBSET CONSTRUCTION DE-EPSILONIZE Now let’s show how to remove epsilons. 70 De-Epsilonization De-epsilonization works in three phases: 1. Keep adding any shortcuts made possible by the e-edges until all missing shortcuts have been filled in. 2. Possibly make the start state accept. 3. Remove the epsilon edges. 71 De-Epsilonization. Phase (1) Fill in Missing Shortcuts There are three kinds of shortcuts: 1. 2. 3. u a w v e u w e v a u w e v e unless u=w 72 De-Epsilonization. Phase (1) Fill in Missing Shortcuts There are three kinds of shortcuts: 1. 2. 3. u a a v w e u w e v a u w e v e unless u=w 73 De-Epsilonization. Phase (1) Fill in Missing Shortcuts There are three kinds of shortcuts: 1. u a 2. w e 3. a w v a e v a w u w e v e unless u=w 74 De-Epsilonization. Phase (1) Fill in Missing Shortcuts There are three kinds of shortcuts: 1. u a 2. w e 3. u e a w v a e v e a v e w w unless u=w in which case don’t want e-loop 75 De-Epsilonization. Phase (1) Example. 0 b 0,e 1,e a e c 1 76 De-Epsilonization. Phase (1) Example. 0 b 0,e 1,e 0,e a 1,e,0 0,1,e e,1 c 1 77 De-Epsilonization. Phase (1) Example. 0,1 b 0,e 1,e,0 0,e,1 a 0,1 1,e,0 0,1,e e,1 c 0,1 78 De-Epsilonization. Phase (1) Example. 0,1 b 0,e,1 1,e,0 0,e,1 a 0,1 1,e,0 0,1,e e,1,0 c 0,1 79 De-Epsilonization. Phase (1) At this point, since all the shortcuts have been filled, will be able to simulate all non-empty strings without needing epsilons. For example. Supposing beforehand that a path accepting the string 011 needed e-insertions in such a way that its graph-label was ee0ee1e1 Filling in the shortcuts of type 3 implies that the new graph will have a short-cut path labeled by e0e1e1, and in turn, filling in the shortcuts of type 2 implies that this path has the shortcut 011, which is the de-epsilonized path that we’re looking for. 80 De-Epsilonization. Phase (2) Possibly Make Start Accept On the other hand, no amount of shortcutting will help us accept the null string e, if we throw out all e-edges, unless the start state is made into an accept state. Therefore, before deleting the e-edges, make sure that: If an e-edge exists from q0 into F, reset F = F {q0 } 81 De-Epsilonization. Phase (2) Possibly Make Start Accept Made start accept because Of the e label of (a,c): 0,1 0,1 b 0,e,1 1,e,0 0,e,1 a 1,e,0 0,1,e e,1,0 c 0,1 82 De-Epsilonization. Phase (3) Remove epsilons 0,1 b 0,1 1,0 0,1 Finished! 1,0 0,1 a 0,1 1,0 c 0,1 83 REX NFA FA In summary. Starting from any NFA, we can use DE-EPSILONIZE to find an equivalent NFA without epsilons and then use the Subset Construction to produce a deterministic FA accepting the same language. This implies the following two results: 84 REX NFA FA THM 3: If L is any language accepted by an NFA, then there exists a constructible FA which also accepts L. COR: The class of regular languages is closed under the regular operations. Proof : Since NFA’s are closed under regular operations (THM 1) and FA’s are by default also NFA’s, we can apply the regular operations to any FA’s and determinize at the end (THM 3) to obtain an FA accepting the language defined by the regular operations.• 85 REX NFA FA REX … We are one step away from showing that FA’s NFA’s REX’s; i.e., all three representation are equivalent. We will be done when we can complete the circle of transformations: NFA FA REX 86 REX NFA FA REX … In fact, we can add more information to this picture as FA’s are automatically NFA’s: NFA FA REX 87 REX NFA FA REX … We’ll show next how to convert NFA’s into regular expressions, which in particular, shows how to convert FA’s into regular expressions and therefore completes the circle of equivalences: NFA FA REX 88 REX NFA FA REX … In converting NFA’s to REX’s we’ll introduce the most generalized notion of an automaton, the so called “Generalized NFA” or “GNFA”. In converting into REX’s, we’ll first go through a GNFA: NFA FA GNFA REX 89 GNFA’s DEF: A generalized nondeterministic finite automaton (GNFA) is a graph whose edges are labeled by regular expressions, with a unique start state with in-degree 0, and a unique accept state with out-degree 0. A string u is said to label a path in a GNFA, if it is an element of the language generated by the regular expression which is gotten by concatenating all labels of edges traversed in the path. The language accepted by a GNFA consists of all the accepted strings of the GNFA. 90 GNFA’s. Example. 000 b 0e a (01101001)* c This is a GNFA because edges are labeled by REX’s, start state has no in-edges, and the unique accept state has no out-edges. Q: Is 000000100101100110 accepted? 91 GNFA’s. Example. 000 b 0e (01101001)* a c A: 000000100101100110 is accepted. The path given by a b b b c ε 000 000 100101100110 proves this fact. (The last label results from 100101100110 (01101001)* ) 92 NFA REX The NFA to REX conversion process is summarized by: 1. Construct a GNFA from the NFA. A. B. C. Unify multilabel edges using “” Create a start state with in-degree 0 Create a unique accepter of out-degree 0 2. Rip out interior states and modify edge labels until arrive at automaton that looks like: r accept start 3. The answer is the unique label r. 93 NFA REX. Example Start with: First we need to convert this into a GNFA. Currently not a GNFA because accept state, though unique, has nonzero out-degree. Start state has zero in-degree, so is okay. 94 NFA REX. Example Just added an accept state by connecting via e from the old accept state. Since the labels are single letters, they are automatically regular expressions so this is a GNFA And we are ready to start ripping out interior states. 95 NFA REX. Ripping Out. Ripping out is done as follows. If you want to rip the middle state v out of: r2 u r1 v r3 w r4 then you’ll need to recreate all the lost possibilities from u to w. I.e., to the current REX label r4 of the edge (u,w) you should add the concatenation of the (u,v ) label r1 followed by the (v,v )-loop label r2 repeated arbitrarily, followed by the (v,w ) label r3.. The new (u,w ) substitute would therefore be: r4 r1 (r2)*r3 u w 96 NFA REX. Example Getting back to our example, let’s rip out d. Q1: What will be the label of (b,c )? Q2: What will be the label of (c,c )? 97 NFA REX. Example A1: (b,c ): CD*A A2: (c,c ): CD*A Next, rip out c. Q: What will be the label of (b,a )? 98 NFA REX. Example A: B(CD*A)(CD*A)*B which simplifies to B(CD*A)+B Next, rip out b. 99 NFA REX. Example Finally, rip out a : 100 NFA REX. Example The resulting REX Is the unique label (AD(B(CD*A)+B))+ 101 Summary This completes the demonstration that the three methods of describing regular languages: 1. Deterministic FA’s 2. NFA’s 3. Regular Expressions are all equivalent. 102