Chapter 3 Regular languages and grammars Section 3.1 regular expressions A regular expression is an algebraic expression used to represent a language that can be accepted by a DFA. Basically, a regular expression describes the patterns used to construct strings in the language. The notation involves a combination of strings of symbols from some alphabet , parentheses, and the operators +, •, and *. We saw this before when we described the language consisting of the strings over = {a, b} with exactly two a’s as follows: b*ab*ab*. Here are some more examples 1) A variable name: (letter)(letter + digit)* where * means to repeat it 0 or more times. 2) All even length strings: [(0 + 1)(0 + 1)]* where = {0, 1} 3) All strings ending in 00 or 11: (0 + 1)*(00 + 11) where = {0, 1} Let’s look at recursive way of describing even parity strings. Basically, we know that both the empty string and a string consisting of a single 0 have even parity. Then, to construct a longer string of even parity, we must have an even number of 1’s in the string. (Note that they need not be consecutive.) So 1’s must be inserted in pairs, with possibly additional 0’s between the 1’s. remember that we must be able to start the string with either 0 or 1. Using * to indicate repeating a pattern 0 or more times, we have the following regular expressions for even parity strings: (0*10*10*)* or 0*(10*10*)* We now turn to a formal definition of regular expressions: Definition: Let be an alphabet. Then 1. , , and a are primitive regular expressions 2. If r1 and r2 are regular expressions so are r1 + r2, r1•r2, r1*, and (r1) 3. A string is a regular expression if and only if it can be derived from the primitive regular expressions by a finite number of applications of the rules in 2. Associated with each regular expression r is a language we denote by L(r). The languages corresponding to the regular expressions above are: 1. The regular expression denotes the empty set which contains no strings. 2. The regular expression denotes the set {} which contains only the empty string 3. For every a , the regular expression a denotes the set {a}. We now turn to regular expressions obtained using the operators +, •, and * and look at the corresponding sets. For regular expressions r1 and r2, 4. L(r1 + r2) = L(r1) L(r2) 5. L(r1• r2) = L(r1)•L(r2) where • means concatenation 6. L((r1)) = L(r1) 7. L(r1*) = (L(r1))* where * is called Kleene closure The primary use of the last four rules is to reduce regular expressions and determine strings in a language. We won’t spend much time on reducing regular expressions. The default precedence of the operators is closure (*), concatenation, union. Two regular expressions are equivalent if they represent the same language. Given below are some examples. Be sure to read those in the textbook also. Examples: Assume the alphabet is = {a, b}. Find regular expressions for each of the following: 1) Strings that contain consecutive a’s: (a + b)*aa(a + b)* Thought process we must have two a’s in a row and any string of a’s and b’s may be placed before or after the substring aa 2) The complement of the language in 1) i.e. all strings over {a, b} that do not contain aa: We have to look at this a bit differently. Clearly, is in the language as are strings containing only b’s. Thus we must make sure that the regular expression allows for this. Also, To ensure that we do not have consecutive a’s, every time an a appears it will be followed by a b unless that a is the last symbol in the string. Finally we must make sure it is possible to get strings that begin or end in a. The following regular expression will work: (b + ab)*(a + ) 3) All strings over Σ = {a, b} in which b's occur in runs of even length. (Recall that a run is a substring consisting of one symbol that is as long as possible for that strings. In other words, if we have the string abbbb then the run of b’s has length 4. Note that these other strings also have runs of b’s of even length: abbabb, aabbbbbb, and aaa. This is a class exercise and the answer will be given there. Try it yourself first. 4) Strings in which the number of a’s is odd. Suppose we must have an odd number of a’s. This implies there must be at least one a so we have b*a(b*ab*ab*)* or b*a(b*ab*a)*b* Class exercise 5) Just to clarify the notation, we need to look at bit more carefully at regular expressions using . Here is exercise 7 on page 76: What languages do (*)* and a represent? Answer: (*)* = {} and a = 6) Strings that contain both aa and bb Class exercise 7) Strings in which aa occurs at least twice. Let’s begin with the answer to 1) above: If there is exactly one pair of consecutive a’s this regular expression works: (a + b)*aa(a + b)*. Suppose we concatenate two copies of this expression (a + b)*aa(a + b)* (a + b)*aa(a + b)*. We can remove one of (a + b)* as indicated. However, consider the string aaa. This contains aa twice and our regular expression must be able to handle this as well. Thus, our final regular expression will consist of two parts connected by a +: (a + b)*aa(a + b)*aa(a + b)* + (a + b)*aaa(a + b)*. Although we may get three consecutive a’s from the first part, we still need the second piece to get strings like bbaaab, baaabbbb, etc. 8) Strings containing both ab and ba. Again we have two cases depending on whether the substring ab occurs before ba or vice versa. We also introduce the notation of putting a + sign as an exponent--the notation a+ indicates we’re using one or more a’s. Here’s a regular expression that will work for this language: b*a+b+a(a + b)* + a*b+a+b(a + b)* Notice that the subexpression a+b+a for example means one or more a’s followed by one or more b’s followed by at least one a. This guarantees we get both ab and ba 9) Strings over {a, b, c} in which the total number of b’s and c’s is three This one’s pretty straightforward—we must have a total of 3 b’s and/or c’s with as many a’s as we’d like in the rest of the string. Convince yourself that the following works: a*(b + c)a*(b+ c)a*(b + c)a* Section 3.2 Connection between Regular expressions and regular languages We have already shown that deterministic and nondeterministic automata are equivalent (for every NFA there is an equivalent DFA). We now prove that regular expressions are another representation for regular languages by showing that if r is a regular expression, then L(r) is a regular language. The proof of this is constructive and based on the definition of L(r) above. The NFA constructed in the proof has the following properties: 1. There is exactly one accepting state 2. There are no edges into the initial state. 3. There are no edges out of the accepting state. The way in which regular expressions are combined leads to machines that have more states and transitions than you would expect. For example, when we combine two expressions we need to introduce new start and final states for the resulting NFA even though it appears we might have been able to “reuse” one of the original start states. In doing the conversion of a regular expression to an NFA you must follow the algorithm given in the proof below exactly. This is the only way you can guarantee the correctness of your construction. Theorem 3.1 Let r be a regular expression. Then, there exists some NFA, that accepts L(r). Consequently, L(r) is a regular language. Proof: Basis: We first construct automata for the three basis cases: Ø, , and a where a . It should be clear the top machine accepts no strings, the middle accepts only the string and the bottom accepts only a. Hypothesis: If a regular expression r has at most n operators then there is an equivalent NFA. Induction step: Again there are three cases to consider, each corresponding to one of the operations +, •, *. Because regular expressions are built by combining two regular expressions or using the * operator, there is a “last” operation that was performed to get the final expression. In the discussion that follows, M(ri) is an NFA that accepts L(ri). Case 1: Let r1 and r2 be regular expressions. Consider the regular expression r1 + r2 which has a total of n + 1 operators. Thus, r1 and r2 each has at most n operators and so by the induction hypothesis, there are NFA's M(r1) and M(r2) that accept to L(r1) and L(r2), respectively. We construct an NFA that accepts L(r1 + r2) by introducing new start and final states, and using -transitions from the new start state to the start states of M(r1) and M(r2) and transitions out of the final states of M(r1) and M(r2) to the new final state. Case 2: Let r1 and r2 be regular expressions and consider the expression r1•r2 which has n + 1 operators. Then the following construction will produce an NFA corresponding to L(r 1•r2). Again each of r1, r2 has at most n operators so the hypothesis applies and we combine the machines M(r1) and M(r2). We introduce new start and final states and three -transitions: from the new start state to the start state of M(r1), from the final state of M(r1) to the start state of M(r2) and from the final state of M(r2) to the new final state. Case 3: Consider the regular expression r1* with n + 1 operators. Since r1 has n operators, the hypothesis holds so we have an NFA corresponding to L(r1) which we modify as indicated in the diagram below. Note that my figure differs from that in the text since I am building machines with no transitions out of the final state. Explanation of the last diagram. To get the loop we need the -transition between the final state to the start state of M(r1). Since we can't have a transition out of a final state we need the new final state and a -transition into it from a new start state. We need the new start state because if we put in a -transition directly from the start state in M(r1) to the new final state, we could accidentally accept strings not in r1*. This would occur if, in the course of going through the machine, we went back through the original start state. Example: Construct an automaton corresponding to (a • b)*(a + b) We need a machine for each of the following stages: a, b, a•b, (a•b)*, (a•b)*a, (a•b)*(a + b) The machine on the left below accepts only a and the one on the right accepts only b. Next, we combine the machines to get a machine for the expression a•b Now, for (a•b)* we have the following machine: Given below is a machine for a + b, M(a+b). All that remains to finish the machine is to introduce new start and final states and connect the machine above with the one below by putting a -move from the final state of M((a•b)*) to the start state of M(a+b) Obtaining regular expressions from automata There are actually two methods of obtaining a regular expression from a transition graph for a regular language. One is an inductive method and will be omitted here. The basic idea is to find a regular expression that is capable of generating the labels of all walks from q 0 to a final state. The method used to show this is to create a generalized transition graph, a graph in which the edges are labeled by regular expressions rather than by alphabet symbols. Thus, a walk from the initial state to a final state can be represented as the concatenation of several regular expressions. If there are two or more paths between a pair of vertices, then we connect the labels of those paths with a +. Eventually, the regular expression obtained will correspond to the labels of all paths from the start state to a final state. As a simple example, let’s look at eliminating vertex q2 in the figure below. Notice that q1q2q3 is the only path passing through q2. To eliminate q2 in this case, we insert an edge from q1 to q3 and label it cc to get the result below. Suppose the original machine had a loop on q2 as in the diagram on the left below. Then, removing that state gives the machine on the right. Theorem 3.2: Let L be a regular language. Then there exists a regular expression r such that L = L(r). Proof: Let M be an nfa that accepts L. Without loss of generality assume M has only one final state and that q0 F. We use the vertex removal construction. To remove a vertex q, we need to find all paths of length 2 with q as the intermediate vertex. Suppose q iqpi is such a path. Add an edge from qi to pi with a label obtained as follows: If there is no loop from q to itself, then the new edge is labeled by the concatenation of the expressions on the edges being removed. If there is a loop, then we label the new edge by e 1(e2)*e3 where e1 is the expression that labels the edge from qi to q, e3 is the label of the edge from q to pi, and e2 is the label on the loop at q. Note that qi and pi do not need to be distinct states since we can have a cycle from a vertex back to itself. Basically this process continues until only the start state and a final states remain. We’ll stop here rather than go through the formal rigorous method of obtaining the path labels. Example 3.10 on page 82. Let’s consider this example from the text. Recall that the machine accepts all strings with an even number of a’s and odd number of b’s Let’s begin by removing the state OE. We need to consider all other pairs of states for which a path of length two passes through OE. Here are the paths: EE OE EE so we put a loop on EE labeled aa OO OE OO so we put a loop on OO labeled bb EE OE OO so we put an edge from EE to OO labeled ab OO OE EE so we put an edge from OO to EE labeled ba After doing this we get the following machine: Since EE is the start state and EO is the final state, we need to remove OO so let’s look at the paths going through OO EE OO EE Since there is a loop on OO, add ab(bb)*ba to the label on EE That label label is now aa + ab(bb)*ba EO OO EO since there is a loop on OO we get a loop on EO labeled a(bb)*a EE OO EO the regular expression ab(bb)*a is added to the original label of the edge from EE to EO giving us this regular expression on the edge: b + ab(bb)*a EO OO EE adding a(bb)*ba and to the original label of the edge from EO to EE we get b + a(bb)*ba. The final diagram is: To get the entire regular expression requires finding all paths from EE to EO and concatenating their labels. Read the section in the text on p. 86 about using regular expressions to describe simple patterns.