Lesson 6 Regular Expressions Regular expressions are useful for specifying the language that a finite automaton recognizes. They are also employed to denote syntax for tokens in a programming language when building a compiler or to check syntax for responses on web-applications forms. Definition: A regular expression over an alphabet : 1. Ø, ε are regular 2. a is regular a . 3. Given that , are regular expressions, then so are: a) ( ◦ ) b) ( + ) c) (*), (*) 4. An expression is regular iff it can be formed by a finite number of application of rules 1 3. Regular Expressions and Regular Sets Regular Expression An expression built up from the previous rules Regular set The set of strings that the expression denotes Let = {0,1} 00 {00} (0 + 1)* {, 0, 1, 00, 01, 10, 11, 000, …, 111, …} (0 + 1)*00(0 + 1)* The set of binary strings containing ‘00’ {00, 000, 100, 001, 1001, 1100, 10100, …} (0 + 1)*011 The set of binary strings ending with ‘011’ {011, 0011, 1011, 01011, 10011, … } Now, let = {0, 1, 2} 0*1*2* The set of strings consisting of an arbitrary number of 0’s followed by an of 1’s and finally by an of 2’s. {, 0, 1, 2, 011, 1112, 011222, …} 1 L(E) denotes the regular set denoted by the expression E. Consider the language expression E = {0n1 n 2 n | n 0} Note that L(E) L(0*1*2*). i.e. L(E) = {, 012, 001122, …}, observe here the number of 0’s, 1’s, and 2’s must be equal. In fact L(E) L(0*1*2*). A finite automaton M can be built such that L(M) = L(0*1*2*). We provide an -nfa. 0 q0 1 q1 2 q2 We note that no such fa can be constructed for L(E). We may thereby conclude that {0n1 n 2 n | n 0} is not a regular language (a fact we will prove in due time). Problem: Find a regular expression for the set of all strings over {b,c} containing an even number of b’s. Why is the answer not (bb)+? Well, of course, zero is an even number, yet (bb)+ = {bb, bbbb, b6, …} We try (bb)* … But no c’s are permitted by this expression … O.K … how about c*(bb)* c* ? Why must the b’s be contiguous (next to each other) ? 2 And finally … E = c*(b c*b)* c* The following -nfa M has L(M) = L(E). c q0 c q1 b q2 c b q3 q4 Regular Expressions and Finite Automata We wish to prove: Regular Expressions Finite Automata i.e. the class of languages that regular expressions can denote is equivalent to the class of languages that a finite automata can recognize. We have seen several such proofs of this ilk and recall their form. I. Given an arbitrary regular expression E, we must be able to construct an fa M, such that L(M) = L(E). II. Given an arbitrary fa M, we must be able to construct a regular expression E, such that L(E) = L(M). __________________________________________ 3 I. We begin with our basis machines: Regular Expression E 1. 2. Corresponding fa M Ø q0 q0 a q0 qf a qf Next, the inductive step is considered: Suppose that and are arbitrary regular expressions. Then we may assume that M and M, machines to recognize and respectively, exist. 3. (a) Then a machine for ( ◦ ) may be constructed as follows: M M M ◦ The start state of M is the new start state. The accept state of M is this composite machine’s accept state. And we have an –transition from M’s accept state to the start state of M. 4 (b) Constructing a machine for ( + ): M M M (c) And finally, a machine for ()* M M * We have shown that given an arbitrary regular expression, an equivalent finite automaton can be constructed. Hence, we now have: Regular expression finite automata 5 Before completing our proof, let’s take an example to illustrate the aforementioned constructions. Example Build a finite automaton for the regular expression (01 + 10)* We begin with a machine for each of the regular expressions: 0 and 1. M0 q0 0 q2 and M1 q0 1 q2 Next, we employ construction 3(a) to build machines for regular expressions 01 and 10. 0 M0 1 M1 M01 1 M1 0 M0 M10 6 Using construction 3(b) we obtain a finite automaton for (01 + 10) 0 1 1 0 M(01+10) And finally rule 3(c) yields a machine for (01 + 10)* 0 1 1 0 M(01+10) M(01+10)* 7 Part II of our proof that regular expressions fa. Wlog we may assure that our finite automaton is deterministic. … //Why? The algorithm paradigm employed is dynamic programming. To solve a problem P we will first solve all smaller problems. (Contrast this with the divide and conquer paradigm as employed in binary search, quick sort, merge sort wherein only some smaller problems are first solved). An example 0 1 1 M: We desire a regular expression such that L(E) = L(M) q2 q1 0 First, some notation: We let R k ij stand for the set of all strings that take our machine from state i to state j never passing through a state numbered higher than k. Note, that to pass through a state means to enter and then leave that state (and not to leave and then enter!). R k ij is defined recursively: R = R R R + R k k 1 k 1 k 1 k 1 ij ik kk kj ij The basis steps are: 0 Rij = a if qi, a = qj with i j, and R 0 ij = a + if qi, a = qj with i = j In our example L(M) = R 2 12 //why? 8 0 1 1 recall q1 q2 0 0 We start at the beginning: 0 R11 = 0 + R R R 0 22 0 12 0 21 = 1+ = 1 We use this to fill in the following table: = 0 R R R R k 11 k 22 k=0 k=1 0+ ? 1+ ? 1 ? 0 ? k 12 k 21 1 R 11 = R110 R110 R110 + R110 = (0 + )(0 + )*(0 + ) + (0 + ) = 0* 1 R 22 = R021 R110 R120 + R 022 = 0 (0 + )*1 + (1 + ) = 00*1 + (1 + ) = 0*1 + 1 R 12 = R110 R110 R120 + R120 = (0 + )(0 + )*1 + 1 = 00*1 + 1 = 0*1 9 0 1 1 recall q1 q2 0 0 1 R 21 = R021 R110 R110 + R021 = 0 (0 + )*(0 + ) + 0 = 00*(0) + 0 = 00* So we have: R R R R k 11 k 22 k k=0 k=1 0+ 0* 1+ 0*1 + 1 0*1 0 00* 12 k 21 And L(M) = R = R R R + R 2 1 1 1 1 12 12 22 22 12 = 0*1(0*1 + )*(0*1 + ) + 0*1 = 0*1(0*1 + )* = 0*1(0*1)* or (0*1)+ Hence, any language than can be recognized by a finite automaton can be denoted by a regular expression and vice versa. 10 Practice Problem: Give a regular expression E that expresses the set of strings recognized by the following dfa: 1 start q1 0 0 q2 1 q3 0,1 11