Regular Show Err, Regular Expressions • An algebraic equivalent to finite automata – Useful as a language for describing simple but useful patterns in text… – e.g. How can one tell if an email address is a syntactically valid email address? – e.g. How can one update the copyright statement (or a copyleft statement) to add the current year in thousands of programs? Regular Languages Recall: A language is called a regular language if some finite automaton accepts it. Regular expressions describe regular languages Operators and Operands If E is a regular expression, then L(E) denotes the language that E stands for. Expressions are built as follows: An operand can be: 1. A variable, standing for a language. 2. A symbol, standing for itself as a set of strings, i.e., a stands for the language {a} (formally, L(a) = {a}). 3. , standing for {} (a language). 4. , standing for (the empty language). The operators are: 1. + or , standing for union. L(E+F) = L(E) L(F). 2. or juxtaposition (i.e., no operator symbol, as in xy to mean x “times” y) to stand for concatenation. L(EF) = L(E)L(F), where the concatenation of languages L and M is {xy | x is in L and y is in M}. 3. * to represent closure. L(E*) = (L(E))*, where L* = {} L LL LLL … . Parentheses may be used to alter grouping, which by default is * (highest precedence), then concatenation, then union (lowest precedence). Formal Definition of REs R is a regular expression if R is 1. a for some a 2. 3. 4. (R1 R2) where R1 and R2 are regular expressions 5. (R1 R2) where R1 and R2 are regular expressions 6. (R1*) where R1 is a regular expression Every regular expression arises by a finite number of applications of these 6 rules Said Another way… A Regular Expression describes a language. Which one? i.e. L(R) = ? :Apply these recursively: 1. L(a) = {a} 2. L() = {} 3. L() = { } 4. L(R1 | R2) = L(R1) L(R2) 5. L(R1 R2) = L(R1) L(R2) 6. L(R1*) = L(R1)* Example R is a regular expression if R is 1. a for some a 2. 3. 4. (R1 R2) where R1 and R2 are regular expressions 5. (R1 R2) where R1 and R2 are regular expressions 6. (R1*) where R1 is a regular expression To prove ((a(b*))+a) is a regular expression over (a,b), show it can be constructed according to the rules: 1. 2. 3. 4. 5. b is regular by Rule 1 (b*) is regular by Rule 6 a is regular by Rule 1 (a(b*)) is regular by Rule 5 ((a(b*))+a) is regular by Rule 4 applied to (4) and (3) Examples • L(001) = {001} • L(0+10*)={0,1,10,100,1000,…} • L(((0(0+1))*)= the set of strings of 0's and 1's, of even length, such that every odd position has a 0 A few more examples… • • • • • • ab*a a*b* (ab)* (same as a*b*?) a*b*a* (is baa in this?) L={xodd} = x(xx)* or (xx)*x but not x*xx* All strings of as and bs of exactly length 3 – L={aaa aab aba abb baa bab bba bbb} or (a+b) (a+b) (a+b) or (a+b)3 What are RE’s for these languages? Assume = {a,b} unless otherwise indicated • Strings with an a in them somewhere (a+b)*a(a+b)* • Strings with at least 2 a’s b*ab*a(a+b)* • Strings with exactly 2 a’s b*ab*ab* • Strings with at least one a and one b (a+b)*a(a+b)*b(a+b)*+ (a+b)*b(a+b)*a(a+b)* • Strings that end in b but do not contain aa (b+ab)*(b+ab) = (b+ab)+ • All strings over {a,b,c} having no substring ac c*(a+bc*)* Equality of REs • Two regular expressions s and t are equal if and only if L(s) = L(t) – Two regular expressions can look quite different yet describe the same language • Example: s = (a+b)* and t = (b+aa*b)*a* Equivalence of FA Languages and RE Languages Kleene’s Theorem • We've already shown that an NFA with or without -transitions can be converted to a DFA • We'll show that NFA- accept the languages for REs • Then, we'll show that a RE can describe the language of a DFA (same construction works for an NFA) • Therefore, NFA-, NFA, DFA, and RE are equivalent (describe the same languages) NFA, NFA- DFA Regular Expression ((a+ba*)*+ca* ab*(c+b)* • The languages accepted by DFA, NFA, NFA, and described by RE are called the regular languages Proof • We will prove this set of equivalences by – Showing how to construct an NFA- from a regular expression – Showing how to construct a regular expression from a finite automaton • We already know how to construct a DFA from an NFA- so this completes the circle NFA, NFA- DFA Regular Expression ((a+ba*)*+ca* ab*(c+b)* RE to NFA- Cover the six cases in the formal (recursive) definition of REs 1. R = a for some a . Then L(R) = {a} and the following NFA recognizes L(R) a 1. R = • Formally, N = ({q1},,,q1,{q1}), where (r,b) = for and r and b 1. R = • Formally, N = ({q},,,q,), where (r,b) = for and r and b 4. R =(R1 R2) The class of regular languages is closed under the union operation For two languages R1 and R2, take two NFAs N1 and N2 and combine them into one new NFA N. N must accept input if either N1 or N2 accepts input. N1 N N2 The new machine guesses nondeterministically which of the two machines accepts the input 5. R =(R1 R2) The class of regular languages is closed under the concatenation operation For two languages R1 and R2, take two NFAs N1 and N2 and combine them sequentially into one new NFA N. N2 N1 N The new machine guesses nondeterministically where to split the input in order to have a first part accepted by N1 and a second part accepted by N2. 6. R =(R1)* The class of regular languages is closed under the star operation For a language R1, modify N1 to accept (R1)*. N1 N The new machine has the option of jumping back to the start state to read another piece that N1 accepts. Q: Why not just make the start state of N1 a final state? Rite of Passage FA-to-RE Construction Two algorithms: 1. State elimination: gives smaller expression, in general, and easier to apply 2. Inductive construction: covered in the appendix DFA-to-RE by State Elimination • Basic idea : Eliminate a state s (remove all arcs into and out of s); label arcs from q to p that went through s with an RE representing the sequence of symbols on that path. General Process e d qi c q a qj • Remove state q • Label paths from • qi to qi • qi to qj • qj to qi • qj to qj b ae*d ce*b ce*d qi qj ae*b Alternative Method • We can simplify things considerably if we ensure the following before applying the procedure for state elimination: – There is a single final state – There are no transitions into the initial state, and none out of the final state – Since the procedure works on NFA-'s also, this is easy to do: Original FA q0new qf1 q0 qf2 qf3 qfnew Procedure R4 qj qi qi (R1)(R2)*(R3)+(R4) R3 R1 qrip Before R2 After qj Example S a 1 b b 2 Add new start and end state a, b S 1 ? A A 2 S a a*b(a + b)* ? b(a + b)* Remove state 2 a 1 Remove state 1 A a+b Example a a ORIGINAL FA q3 a a b b b b q4 b b MODIFY TO SATISFY CRITERIA q2 a b q1 q2 b a b a*b a q5 b a*b q4 ELIMINATE q2 q3 q5 a q1 ELIMINATE q1 q3 q2 a a*b q4 ELIMINATE q3 ba*b q3 q5 a*ba*b q4 (a*b + a*ba*b)(a + ba*b)* q5 Try this a b a b b a STEP 1: Modify to create a unique start and end state: STEP 2: Eliminate state 1: path from s to 2 is a*b; path from 3 to 2 is aa*b. STEP 3: Eliminate state 2; path from s to 3 is a*bb*a; path from s to f is a*bb*; path from 3 to f is (b + aa*b)b*; path from 3 to 3 is (b + aa*b)a STEP 4: Eliminate state 3: label on the path from s to f yields the final RE: Another Example Simplify Remove State 1 Remove state 2 Remove state 3 Remove state 4 Done! APPENDIX APPENDIX APPENDIX Inductive Construction • Let A be a FA with states 1, 2,… n. (k ) • Let Rij be a RE whose language is the set of labels of paths that go from state i to state j without passing through any state numbered above k. • Construction, and the proof that the expressions for these RE's are correct, are inductions on k. • Basis: k = 0. Path can't go through any states. – Thus, path is either an arc or the null path (a single node). (0) – If i j, then Rij is the sum of all symbols a such that A has a transition from i to j on symbol a ( if none). – If i = j, then add to above. • Induction: Assume we have correctly developed expressions for the R(k-1)'s. Then for the R(k)'s: (k -1) ij R =R (k) ij +R (k-1) ik (k -1) kk (R (k -1) kj )* R • Proof it works: A path from i to j that goes through no state higher than k either: – Never goes through k, in which case the path's (k -1) label is (by the IH) in the language of Rij ; or – Goes through k one or more times. In this case: (k -1) • Rik contains the portion of the path that goes from i to k for the first time. (k -1) • (Rkk )* contains the portion of the path (possibly empty) from the first k visit to the last. (k -1) • Rkj contains the portion of the path from the last k visit to j. • Final step: The RE for the entire FA is (n) the sum (union) of the RE's Rij , where i is the start state and j is one of the accepting states. – Note that superscript (n) represents no restriction on the path at all, since n is the highest-numbered state. Example The "clamping" automaton, with states named by integers: 0 start 3 0,1 1 0 1 1 2 • Some basis expressions: (0) 11 =e (0) 12 =1 (0) 22 = e + 0 +1 (0) 31 =1 R R R R R =R =Æ (0) 32 (0) 21 Two inductive examples: (1) (0) (0) • R32 = R32 + R31 (R11(0) ) * R12(0) = Æ+1e *1=11 – Uses algebraic laws: * = ; R = R = R ( is the identity for concatenation); + R = R + = R ( is the identity for union). • R = R + R (R ) * R = e + 0 +1+ Æe *1= e + 0 +1 (1) 22 (0) 22 (0) 21 (0) 11 (0) 12 – Additional algebraic law used: R = R = ( is the annihilator for concatenation). To simplify the more complex regular expressions during state elimination(using algebraic rules): • * = ; • R = R = R • R = R = • | R = { } U R = R can also be stated as: + R = R + = R