CSE 3204: Formal Language, Automata and Computability Regular Expressions Reading: Chapter 3 1 RE’s: Introduction Regular expressions are an algebraic way to describe languages. They describe exactly the regular languages. If E is a regular expression, then L(E) is the language it defines. Regular Expressions: Language • The set of strings accepted by a fnite automaton is referred to as the language accepted by the finite automaton. • For each fnite automaton there is a regular expression that defnes the same language. Regular Expressions: Language Basis 1: If a is any symbol, then a is a RE, and L(a) = {a}. Note: {a} is the language containing one string, and that string is of length 1. Basis 2: ε is a RE, and L(ε) = {ε}. Basis 3: ∅ is a RE, and L(∅) = ∅. Regular Expressions: Language Induction 1: If E1 and E2 are regular expressions, then E1+E2 is a regular expression, and L(E1+E2) = L(E1)L(E2). Induction 2: If E1 and E2 are regular expressions, then E1E2 is a regular expression, and L(E1E2) = L(E1)L(E2). Concatenation : the set of strings wx such that w is in L(E1) and x is in L(E2). Regular Expressions: Language Induction 3: If E is a RE, then E* is a RE, and L(E*) = (L(E))*. Closure, or “Kleene closure” = set of strings w1w2…wn, for some n > 0, where each wi is in L(E). Note: when n=0, the string is ε. Identities and Annihilators ∅ is the identity for +. R + ∅ = R. ε is the identity for concatenation. εR = Rε = R. ∅ is the annihilator for concatenation. ∅R = R∅ = ∅. Examples: RE’s L(01) = {01}. L(01+0) = {01, 0}. L(0(1+0)) = {01, 00}. Note order of precedence of operators. L(0*) = {ε, 0, 00, 000,… }. Examples Examples • The Language defned by the expression ab*a • Is the set of all strings of a’s and b’s that have at least two leters, that begin and end with a’s and that have nothing but b’s inside (if any thing at all). • Language(ab*a)={aa aba abba abbba…} Examples • a*b* • Language(a*b*)={ε a b aa ab bb aaa aab…} Note a*b* ≠ (ab)* • L2 = { xodd} can be defned as x(xx)* or (xx)*x • 01* + 10* • Denotes the language consistng of all strings that are either a single 0 followed by any number of 1’s or a single 1 followed by any number of 0’s. Languages and Regular Expression S.No. Languages Regular Expression 1 {ε} ε 2 {0} 0 3 {001} i.e. {0}{0}{1} 001 Examples • Write regular expression for the following languages: 1. 2. The set of strings over alphabet {a, b, c} containing at least one a and at least one b. Ans: The simplest approach is to consider those strings in which the frst a precedes the frst b separately from those where the opposite occurs. The expression: c*a(a+c)*b(a+b+c)* + c*b(b+c)*a(a+b+c)* Examples • • • The language of all words that have at least two a’s can be described by the expression (a + b)*a (a + b)*a (a + b)* (some beginning) (the 1st important a) (some middle) (the 2nd important a) (some end) Equivalence of RE’s and Automata We need to show that for every RE, there is an automaton that accepts the same language. Pick the most powerful automaton type: the ε-NFA. And we need to show that for every automaton, there is a RE defining its language. Pick the most restrictive type: the DFA. Converting a RE to an ε-NFA Proof is an induction on the number of operators (+, concatenation, *) in the RE. We always construct an automaton of a special form (next slide). Equivalence of FA’s and regex’s • We have already shown that DFA’s, NFA’s, and ε-NFA all are equivalent. • To show FA’s equivalent to regex’s we need to establish that 1. 2. For every DFA A we can fnd (construct, in this case) a regex R, such that L(R) = L(A). For every regex R there is a ε-NFA A, such that L(A) = L(R). Simplification Rules • We will be needing the following simplifiatio rules: • • • • (ε + R)* = R* R + RS* = RS* ØR = R Ø = Ø (Annihilaton) Ø + R = R + Ø = R (Identty) Convert DFA to regex L(A) = {x0y | x Є {1}* and y Є {0,1} } Convert DFA to regex (con’t) Convert DFA to regex (con’t) Convert DFA to regex (con’t) Observations • There are n3 expressions for an o-state automaton • We need a more efcient approach: • The State Elimination Technique The State Elimination Technique • When state S is eliminated, all the paths that went through s no longer exist in the automaton. • To not to change the language of automaton, add an arc from q to p. • How to label that arc? Use a Regular Expression. • The language of the automaton is the union over all paths from the start state to an acceptng state of the language formed by concatenatng the REs along the path. The State Elimination Technique • What happens when we eliminate state s. • For each acceptng state q eliminate from the original automaton all states except q0 and q. • To compensate, we introduce, for each predecessor qi of s and each successor pj of s, a RE that represents all the paths that start at qi and fnally go to pj. • The expression for these paths is QiS*Pj. • Add this expression to the arc from qi to pj. Constructing a RE from a FA 1. 2. For each acceptng state q, apply the previous reducton process to produce an equivalent automaton with RE labels on the arcs. Eliminate all states except q and the start state q0. If q ≠ q0, then we shall be lef with a two-state automaton that looks like; The RE for the accepted strings can be described as (R+SU*T)*SU The Strategy for Constructing a RE from a FA 3. If the start state is also an acceptng state, then we are lef with a one-state automaton that looks like; The RE denotng the strings that it accepts is R*. 4. The desired RE is the sum (union) of all the expressions derived from the reduced automata for each acceptng state, by rules (2) and (3). Example 3.6 • First step is to convert it to an automaton with regular expression labels. Example (con’t) • Lets eliminate state B. • State B has one predecessor, A, and one successor, C. Thus: • Q1 = 1, P1 = 0 + 1, R11 = Ø (Since the arc from A to C does not exist) and S = Ø (because there is no loop at state B). • The resultant expression is Ø + 1Ø*(0 + 1). • To simplify; • inital Ø may be ignored in a union. • L(Ø*) = {Є} U L(Ø) U L(Ø) ……. • Thus Ø + 1Ø*(0 + 1) is equivalent to 1(0 + 1). Example (con’t) • Lets eliminate state C and obtain AD. • The mechanics is similar to those performed to eliminate state B and the resultng automaton is shown as follows: •The REs are R = 0 + 1, S = 1(0 + 1)(0 + 1), T = Ø, U = Ø. •The generic expression (R + SU*T)*SU* thus simplifies in this case to R*S, or (0 + 1)*1(0 + 1)(0 + 1). Example (con’t) • We can eliminate D to obtain AC. • with regex (0 + 1)*1(0 + 1) •The final expression is the sum of previous two regex’s: • (0 + 1)*1(0 + 1)(0 + 1) + (0 + 1)*1(0 + 1) From regex’s to ε-NFA’s • Theorem 3.7: For every regex R we can construct an ε-NFA A, s.t. L(A) = L(R). • Proof: By structural inducton: • Basis: Automata for ε, Ø, and a. From regex’s to ε-NFA’s • Inducton: Automata for R + S, RS, and R*. Example 3.8 • Let us convert the regular expression (0 + 1)*1(0 + 1) Example 3.8 (cont.) Application of Regular Expressions • Regular Expression in UNIX • Most real applicatons deal with the ASCII character set • UNIX regular expression allow us to write iharaiter ilasses to represent large sets of characters. The rules are: • The symbol . (dot) stands for “any character” • The sequence [a1,a2…ak] stands for the regular expression a1 + a2 + … + ak • A range of the form x-y mean all the characters form x to y in the ASCII sequence. e.g. digits can be expressed [0-9] Application of Regular Expressions • There are special notatons for several of the most common classes of characters. e.g. • [:digit;] is the set of ten digits, the same as [0-9]. • [:alpha:] stands for any alphabetc character, as does [AZa-z] • [:alnum:] stands for the digits and leters, as does {A-Zaz0-9] • grep stands for “Global (search for) Regular Expression and Print” • Lexical Analysis • Finding Paterns in Text