Regular Expressions • Highlights: – A regular expression is used to specify a language, and it does so precisely. – Regular expressions are very intuitive. – Regular expressions are very useful in a variety of contexts. – Given a regular expression, an NFA-ε can be constructed from it automatically. – Thus, so can an NFA be constructed, and a DFA, and a corresponding program, all automatically! 1 Two Operations • Concatenation: – – – • Language Concatenation: L1L2 = {xy | x is in L1 and y is in L2} – – – • x = 010 y = 1101 xy = 010 1101 L1 = {01, 00} L2 = {11, 010} L1L2 = {01 11, 01 010, 00 11, 00 010} Language Union: – – – L1 = {01, 00} L2 = {01, 11, 010} L1L2 = {01, 00, 11, 010} 2 Operations on Languages • Let L, L1, L2 be subsets of Σ* • Concatenation: L1L2 = {xy | x is in L1 and y is in L2} • Concatenating a language with itself: L0 = {ε} Li = LLi-1, for all i >= 1 3 Kleene closure Say, L, or L1 ={a, abc, ba}, on Σ ={a,b,c} Then, L2 = {aa, aabc, aba, abca, abcabc, abcba, baa, baabc, baba} L3= {a, abc, ba}. L2 ….. But, L0 = {ε} Kleene closure of L, L* = {ε, L1, L2, L3, . . .} 4 Operations on Languages • Let L, L1, L2 be subsets of Σ* • Concatenation: L1L2 = {xy | x is in L1 and y is in L2} • Union is set union of L1 and L2 • Kleene Closure: L* • Positive Closure: L+ • Question: Does L+ contain ε? = Li = L0 U L1 U L2 U… i 0 = Li = L1 U L2 U… i 1 5 Definition of a Regular Expression • Let Σ be an alphabet. The regular expressions over Σ are: – Ø – ε – a Represents the empty set { } Represents the set {ε} Represents the set {a}, for any symbol a in Σ Let r and s be regular expressions that represent the sets R and S, respectively. – – – – • r+s rs r* (r) Represents the set R U S Represents the set RS Represents the set R* Represents the set R (precedence 3) (precedence 2) (highest precedence) (not an operator, rather provides precedence) If r is a regular expression, then L(r) is used to denote the corresponding language. 6 • Examples: Let Σ = {0, 1} (0 + 1)* 01* All strings of 0’s and 1’s 0 followed by any number 1’s 0(0 + 1)* All strings of 0’s and 1’s, beginning with a 0 (0 + 1)*1 All strings of 0’s and 1’s, ending with a 1 (0 + 1)*0(0 + 1)* All strings of 0’s and 1’s containing at least one 0 (0 + 1)*0(0 + 1)*0(0 + 1)* All strings of 0’s and 1’s containing at least two 0’s (0 + 1)*01*01* All strings of 0’s and 1’s containing at least two 0’s (1 + 01*0)* All strings of 0’s and 1’s containing an even number of 0’s 1*(01*01*)* All strings of 0’s and 1’s containing an even number of 0’s (1*01*0)*1* (0+1)* = (0*1*)* All strings of 0’s and 1’s containing an even number of 0’s Any string, or (sigma)*, sigma={0, 1} in all cases here 7 • Question: Is there a unique minimum regular expression for a given language? • Identities: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Øu = uØ = Ø Like multiplying by 0 ε u = uε = u Like multiplying by 1 * Ø* = ε L = Li = L0 U L1 U L2 U… i 0 ε* = ε = {ε} u+v = v+u u+Ø=u u+u=u u* = (u*)* u(v+w) = uv+uw [which operation is hidden before parenthesis?] (u+v)w = uw+vw (uv)*u = u(vu)* [note: you have to have a single u, at start or end] (u+v)* = (u*+v)* = u*(u+v)* = (u+vu*)* = (u*v*)* = u*(vu*)* = (u*v)*u* 8 Equivalence of Regular Expressions and NFA-εs • Note: Throughout the following, keep in mind that a string is accepted by an NFA-ε if there exists ANY path from the start state to any final state. • Lemma 1: Let r be a regular expression. Then there exists an NFA-ε M such that L(M) = L(r). Furthermore, M has exactly one final state with no transitions out of it. • Proof: (by induction on the number of operators, denoted by OP(r), in r). 9 Basis: OP(r) = 0 Then r is either Ø, ε, or a, for some symbol a in Σ For Ø: q0 qf For ε: qf For a: q0 a qf 10 Inductive Hypothesis: Suppose there exists a k 0 such that for any regular expression r where 0 OP(r) k, there exists an NFA-ε such that L(M) = L(r). Furthermore, suppose that M has exactly one final state. Inductive Step: Let r be a regular expression with k + 1 operators (OP(r) = k + 1), where k + 1 >= 1. Case 1) r = r1 + r2 Since OP(r) = k +1, it follows that 0<= OP(r1), OP(r2) <= k. By the inductive hypothesis there exist NFA-ε machines M1 and M2 such that L(M1) = L(r1) and L(M2) = L(r2). Furthermore, both M1 and M2 have exactly one final state. Construct M as: ε q0 q1 M1 f1 ε ε ε q2 M2 qf f2 11 Case 2) r = r1r2 Since OP(r) = k+1, it follows that 0<= OP(r1), OP(r2) <= k. By the inductive hypothesis there exist NFA-ε machines M1 and M2 such that L(M1) = L(r1) and L(M2) = L(r2). Furthermore, both M1 and M2 have exactly one final state. Construct M as: q1 Case 3) ε f1 M1 q2 M2 f2 r = r1* Since OP(r) = k+1, it follows that 0<= OP(r1) <= k. By the inductive hypothesis there exists an NFA-ε machine M1 such that L(M1) = L(r1). Furthermore, M1 has exactly one final state. ε Construct M as: q0 ε q1 M1 f1 ε qf 12 ε • Example: r = 0(0+1)* r = r1r2 r1 = 0 r2 = (0+1)* r2 = r3* q0 1 q1 r3 = 0+1 r3 = r4 + r5 r4 = 0 r5 = 1 13 • Example: r = 0(0+1)* r = r1r2 r1 = 0 r2 = (0+1)* r2 = r3* q0 1 q2 0 q1 r3 = 0+1 r3 = r4 + r5 q3 r4 = 0 r5 = 1 14 • Example: r = 0(0+1)* r = r1r2 r1 = 0 r2 = (0+1)* r2 = r3* r3 = 0+1 r3 = r 4 + r 5 ε q0 1 q1 ε q5 q4 ε q2 0 q3 ε r4 = 0 r5 = 1 15 • Example: r = 0(0+1)* r = r1r2 r1 = 0 ε r2 = (0+1)* r2 = r3* r3 = 0+1 r3 = r4 + r5 r4 = 0 ε q6 ε q0 1 q1 ε q4 q5 ε q2 0 q3 ε qf ε ε r5 = 1 16 • Example: r = 0(0+1)* q8 r = r1r2 0 q9 r1 = 0 ε r2 = (0+1)* r2 = r3* r3 = 0+1 r3 = r4 + r5 r4 = 0 ε q6 ε q0 1 q1 ε q4 q5 ε q2 0 q3 ε qf ε ε r5 = 1 17 • Example: r = 0(0+1)* 0 q8 r = r1r2 r1 = 0 q9 ε ε r2 = (0+1)* r2 = r3* r3 = 0+1 r3 = r4 + r5 r4 = 0 ε q6 ε q0 1 q1 ε q4 q5 ε q2 0 q3 ε qf ε ε r5 = 1 18 Definitions Required to Convert a DFA to a Regular Expression • Let M = (Q, Σ, δ, q1, F) be a DFA with state set Q = {q1, q2, …, qn}, and define: Ri,j = { x | x is in Σ* and δ(qi,x) = qj} Ri,j is the set of all strings that define a path in M from qi to qj. • Note that states have been numbered starting at 1, not 0! 19 • Example: q2 1 q4 0 0 q1 1 0 1 1 q3 0 q5 1 0 R2,3 = {0, 001, 00101, 011, …} R1,4 = {01, 00101, …} R3,3 = {11, 100, …} 20 • In words: Rki,j is the set of all the strings that define a path in M from qi to qj but that passes through no state numbered greater than k. • Definition: Rki,j = { x | x is in Σ* and δ(qi,x) = qj, and for no u where 1 |u| < |x| and x = uv there is no case such that δ(qi,u) = qp where p>k} • Note that it may be true that i>k or j>k, only the intermediate states on the path from i to j may not be >k. 21 • Example: q2 1 q4 0 0 q1 1 0 1 1 q3 0 q5 1 0 R42,3 = {0, 1000, 011, …} R12,3 = {0} 111 is not in R42,3 because it goes via q5 111 is not in R12,3 101 is not in R12,3 R52,3 = R2,3 any state may be on the path now 22 • Obeservations: 1) Rni,j = Ri,j 2) Rk-1i,j is a subset of Rki,j 3) L(M) = Rn1,q = qF 4) R0i,j = R1,q qF {a | (qi , a) q j }, orPhi i j {a | (qi , a) q j }{ } i j 5) Rki,j = Rk-1i,k (Rk-1k,k)* Rk-1k,j U Rk-1i,j Easily computed from the DFA! Now, you see the purpose of introducing k: So that we can write it as a RE 23 • Notes on 5: 5) Rki,j = Rk-1i,k (Rk-1k,k)* Rk-1k,j U Rk-1i,j • Consider paths represented by the strings in Rki,j : qi qj : • IF x is a string in Rki,j then no state numbered > k may passed through when processing x and either: – qk is not passed through, i.e., x is in Rk-1i,j – qk is passed through one or more times, i.e., x is in Rk-1i,k (Rk-1k,k)* Rk-1k,j 24 • Lemma 2: Let M = (Q, Σ, δ, q1, F) be a DFA. Then there exists a regular expression r such that L(M) = L(r). • Proof: First we will show (by induction on k) that for all i,j, and k, where 1 i,j n and 0 k n, that there exists a regular expression r such that L(r) = Rki,j . Basis: k=0 R0i,j contains single symbols, one for each transition from qi to qj, and possibly ε if i=j. case 1) No transitions from qi to qj and i != j r0i,j = Ø case 2) At least one (m 1) transition from qi to qj and i != j r0i,j = a1 + a2 + a3 + … + am where δ(qi, ap) = qj, for all 1 p m 25 case 3) No transitions from qi to qj and i = j r0i,j = ε case 4) At least one (m 1) transition from qi to qj and i = j r0i,j = a1 + a2 + a3 + … + am + ε where δ(qi, ap) = qj for all 1 p m Inductive Hypothesis: Suppose that Rk-1i,j can be represented by the regular expression rk-1i,j for all 1 i,j n, and some k1. Inductive Step: Consider Rki,j = Rk-1i,k (Rk-1k,k)* Rk-1k,j U Rk-1i,j . By the inductive hypothesis there exist regular expressions rk-1i,k , rk-1k,k , rk-1k,j , and rk-1i,j generating Rk-1i,k , Rk-1k,k , Rk-1k,j , and Rk-1i,j , respectively. Thus, if we let rki,j = rk-1i,k (rk-1k,k)* rk-1k,j + rk-1i,j then rki,j is a regular expression generating Rki,j ,i.e., L(rki,j) = Rki,j . 26 • Finally, if F = {qj1, qj2, …, qjr}, then rn1,j1 + rn1,j2 + … + rn1,jr is a regular expression generating L(M).• • Note: not only does this prove that the regular expressions generate the regular languages, but it also provides an algorithm for computing it! 27 • Example: 1 q1 0 0 k=0 rk1,1 rk1,2 rk1,3 rk2,1 rk2,2 rk2,3 rk3,1 rk3,2 rk3,3 q2 1 q3 First table column is computed from the DFA. 0/1 k=1 k=2 ε 0 1 0 ε 1 Ø 0+1 ε 28 • All remaining columns are computed from the previous column using the formula. 1 r12,3 = r02,1 (r01,1 )* r01,3 + r02,3 = 0 (ε)* 1 + 1 = 01 + 1 q1 0 0 rk1,1 rk1,2 rk1,3 rk2,1 rk2,2 rk2,3 rk3,1 rk3,2 rk3,3 k=0 k=1 ε ε 0 1 0 0 1 0 ε ε + 00 1 Ø 0+1 1 + 01 Ø 0+1 ε ε q2 1 q3 0/1 k=2 29 1 r21,3 = r11,2 (r12,2 )* r12,3 + r11,3 = 0 (ε + 00)* (1 + 01) + 1 = 0*1 q1 0 q2 0 rk1,1 rk1,2 rk1,3 rk2,1 rk2,2 rk2,3 rk3,1 rk3,2 rk3,3 k=0 k=1 k=2 ε 0 1 0 ε 1 Ø 0+1 ε ε 0 1 0 ε + 00 1 + 01 Ø 0+1 ε (00)* 0(00)* 0*1 0(00)* (00)* 0*1 (0 + 1)(00)*0 (0 + 1)(00)* ε + (0 + 1)0*1 1 q3 0/1 30 • To complete the regular expression, we compute: r31,2 + r31,3 rk1,1 rk1,2 rk1,3 rk2,1 rk2,2 rk2,3 rk3,1 rk3,2 rk3,3 k=0 k=1 k=2 ε 0 1 0 ε 1 Ø 0+1 ε ε 0 1 0 ε + 00 1 + 01 Ø 0+1 ε (00)* 0(00)* 0*1 0(00)* (00)* 0*1 (0 + 1)(00)*0 (0 + 1)(00)* ε + (0 + 1)0*1 31 • Theorem: Let L be a language. Then there exists an a regular expression r such that L = L(r) if and only if there exits a DFA M such that L = L(M). • Proof: (if) Suppose there exists a DFA M such that L = L(M). Then by Lemma 2 there exists a regular expression r such that L = L(r). (only if) Suppose there exists a regular expression r such that L = L(r). Then by Lemma 1 there exists a DFA M such that L = L(M).• • Corollary: The regular expressions define the regular languages. • Note: The conversion from a regular expression to a DFA and a program accepting L(r) is now complete, and fully automated! 32