Chapter 3 Regular languages and grammars

advertisement
Chapter 3 Regular languages and grammars
Section 3.1 regular expressions
A regular expression is an algebraic expression used to represent a language that can be
accepted by a DFA. Basically, a regular expression describes the patterns used to construct
strings in the language. The notation involves a combination of strings of symbols from some
alphabet , parentheses, and the operators +, •, and *. We saw this before when we
described the language consisting of the strings over  = {a, b} with exactly two a’s as follows:
b*ab*ab*. Here are some more examples
1) A variable name: (letter)(letter + digit)* where * means to repeat it 0 or more times.
2) All even length strings: [(0 + 1)(0 + 1)]* where  = {0, 1}
3) All strings ending in 00 or 11: (0 + 1)*(00 + 11) where  = {0, 1}
Let’s look at recursive way of describing even parity strings. Basically, we know that both the
empty string  and a string consisting of a single 0 have even parity. Then, to construct a
longer string of even parity, we must have an even number of 1’s in the string. (Note that they
need not be consecutive.) So 1’s must be inserted in pairs, with possibly additional 0’s
between the 1’s. remember that we must be able to start the string with either 0 or 1. Using *
to indicate repeating a pattern 0 or more times, we have the following regular expressions for
even parity strings: (0*10*10*)* or 0*(10*10*)*
We now turn to a formal definition of regular expressions:
Definition: Let  be an alphabet. Then
1. , , and a   are primitive regular expressions
2. If r1 and r2 are regular expressions so are r1 + r2, r1•r2, r1*, and (r1)
3. A string is a regular expression if and only if it can be derived from the primitive regular
expressions by a finite number of applications of the rules in 2.
Associated with each regular expression r is a language we denote by L(r). The languages
corresponding to the regular expressions above are:
1. The regular expression  denotes the empty set which contains no strings.
2. The regular expression  denotes the set {} which contains only the empty string
3. For every a  , the regular expression a denotes the set {a}.
We now turn to regular expressions obtained using the operators +, •, and * and look at the
corresponding sets. For regular expressions r1 and r2,
4. L(r1 + r2) = L(r1)  L(r2)
5. L(r1• r2) = L(r1)•L(r2) where • means concatenation
6. L((r1)) = L(r1)
7. L(r1*) = (L(r1))* where * is called Kleene closure
The primary use of the last four rules is to reduce regular expressions and determine strings in
a language. We won’t spend much time on reducing regular expressions.
The default precedence of the operators is closure (*), concatenation, union. Two regular
expressions are equivalent if they represent the same language. Given below are some
examples. Be sure to read those in the textbook also.
Examples: Assume the alphabet is  = {a, b}. Find regular expressions for each of the
following:
1) Strings that contain consecutive a’s: (a + b)*aa(a + b)*
Thought process we must have two a’s in a row and any string of a’s and b’s may be
placed before or after the substring aa
2) The complement of the language in 1) i.e. all strings over {a, b} that do not contain aa:
We have to look at this a bit differently. Clearly,  is in the language as are strings
containing only b’s. Thus we must make sure that the regular expression allows for this.
Also, To ensure that we do not have consecutive a’s, every time an a appears it will be
followed by a b unless that a is the last symbol in the string. Finally we must make sure it
is possible to get strings that begin or end in a. The following regular expression will work:
(b + ab)*(a + )
3) All strings over Σ = {a, b} in which b's occur in runs of even length. (Recall that a run is a
substring consisting of one symbol that is as long as possible for that strings. In other
words, if we have the string abbbb then the run of b’s has length 4. Note that these other
strings also have runs of b’s of even length: abbabb, aabbbbbb, and aaa. This is a class
exercise and the answer will be given there. Try it yourself first.
4) Strings in which the number of a’s is odd. Suppose we must have an odd number of a’s.
This implies there must be at least one a so we have b*a(b*ab*ab*)* or b*a(b*ab*a)*b*
Class exercise
5) Just to clarify the notation, we need to look at bit more carefully at regular expressions
using . Here is exercise 7 on page 76: What languages do (*)* and a represent?
Answer: (*)* = {} and a = 
6) Strings that contain both aa and bb
Class exercise
7) Strings in which aa occurs at least twice. Let’s begin with the answer to 1) above: If there
is exactly one pair of consecutive a’s this regular expression works: (a + b)*aa(a + b)*.
Suppose we concatenate two copies of this expression
(a + b)*aa(a + b)* (a + b)*aa(a + b)*. We can remove one of (a + b)* as indicated.
However, consider the string aaa. This contains aa twice and our regular expression must
be able to handle this as well. Thus, our final regular expression will consist of two parts
connected by a +: (a + b)*aa(a + b)*aa(a + b)* + (a + b)*aaa(a + b)*. Although we may
get three consecutive a’s from the first part, we still need the second piece to get strings
like bbaaab, baaabbbb, etc.
8) Strings containing both ab and ba. Again we have two cases depending on whether the
substring ab occurs before ba or vice versa. We also introduce the notation of putting a +
sign as an exponent--the notation a+ indicates we’re using one or more a’s. Here’s a
regular expression that will work for this language:
b*a+b+a(a + b)* + a*b+a+b(a + b)*
Notice that the subexpression a+b+a for example means one or more a’s followed by one or
more b’s followed by at least one a. This guarantees we get both ab and ba
9) Strings over {a, b, c} in which the total number of b’s and c’s is three
This one’s pretty straightforward—we must have a total of 3 b’s and/or c’s with as many a’s
as we’d like in the rest of the string. Convince yourself that the following works:
a*(b + c)a*(b+ c)a*(b + c)a*
Section 3.2 Connection between Regular expressions and regular languages
We have already shown that deterministic and nondeterministic automata are equivalent (for
every NFA there is an equivalent DFA). We now prove that regular expressions are another
representation for regular languages by showing that if r is a regular expression, then L(r) is a
regular language. The proof of this is constructive and based on the definition of L(r) above.
The NFA constructed in the proof has the following properties:
1. There is exactly one accepting state
2. There are no edges into the initial state.
3. There are no edges out of the accepting state.
The way in which regular expressions are combined leads to machines that have more states
and transitions than you would expect. For example, when we combine two expressions we
need to introduce new start and final states for the resulting NFA even though it appears we
might have been able to “reuse” one of the original start states. In doing the conversion of a
regular expression to an NFA you must follow the algorithm given in the proof below exactly.
This is the only way you can guarantee the correctness of your construction.
Theorem 3.1 Let r be a regular expression. Then, there exists some NFA, that accepts L(r).
Consequently, L(r) is a regular language.
Proof:
Basis: We first construct automata for the three basis cases: Ø, , and a where a  . It
should be clear the top machine accepts no strings, the middle accepts only the string  and
the bottom accepts only a.
Hypothesis: If a regular expression r has at most n operators then there is an equivalent NFA.
Induction step: Again there are three cases to consider, each corresponding to one of the
operations +, •, *. Because regular expressions are built by combining two regular expressions
or using the * operator, there is a “last” operation that was performed to get the final
expression. In the discussion that follows, M(ri) is an NFA that accepts L(ri).
Case 1: Let r1 and r2 be regular expressions. Consider the regular expression r1 + r2 which
has a total of n + 1 operators. Thus, r1 and r2 each has at most n operators and so by the
induction hypothesis, there are NFA's M(r1) and M(r2) that accept to L(r1) and L(r2),
respectively. We construct an NFA that accepts L(r1 + r2) by introducing new start and final
states, and using -transitions from the new start state to the start states of M(r1) and M(r2) and
transitions out of the final states of M(r1) and M(r2) to the new final state.
Case 2: Let r1 and r2 be regular expressions and consider the expression r1•r2 which has n + 1
operators. Then the following construction will produce an NFA corresponding to L(r 1•r2).
Again each of r1, r2 has at most n operators so the hypothesis applies and we combine the
machines M(r1) and M(r2). We introduce new start and final states and three -transitions:
from the new start state to the start state of M(r1), from the final state of M(r1) to the start state
of M(r2) and from the final state of M(r2) to the new final state.
Case 3: Consider the regular expression r1* with n + 1 operators. Since r1 has n operators,
the hypothesis holds so we have an NFA corresponding to L(r1) which we modify as indicated
in the diagram below. Note that my figure differs from that in the text since I am building
machines with no transitions out of the final state.
Explanation of the last diagram. To get the loop we need the -transition between the final
state to the start state of M(r1). Since we can't have a transition out of a final state we need the
new final state and a -transition into it from a new start state. We need the new start state
because if we put in a -transition directly from the start state in M(r1) to the new final state, we
could accidentally accept strings not in r1*. This would occur if, in the course of going through
the machine, we went back through the original start state.
Example: Construct an automaton corresponding to (a • b)*(a + b)
We need a machine for each of the following stages: a, b, a•b, (a•b)*, (a•b)*a, (a•b)*(a + b)
The machine on the left below accepts only a and the one on the right accepts only b.
Next, we combine the machines to get a machine for the expression a•b
Now, for (a•b)* we have the following machine:
Given below is a machine for a + b, M(a+b). All that remains to finish the machine is to
introduce new start and final states and connect the machine above with the one below by
putting a -move from the final state of M((a•b)*) to the start state of M(a+b)
Obtaining regular expressions from automata
There are actually two methods of obtaining a regular expression from a transition graph for a
regular language. One is an inductive method and will be omitted here. The basic idea is to
find a regular expression that is capable of generating the labels of all walks from q 0 to a final
state. The method used to show this is to create a generalized transition graph, a graph in
which the edges are labeled by regular expressions rather than by alphabet symbols. Thus, a
walk from the initial state to a final state can be represented as the concatenation of several
regular expressions. If there are two or more paths between a pair of vertices, then we
connect the labels of those paths with a +. Eventually, the regular expression obtained will
correspond to the labels of all paths from the start state to a final state.
As a simple example, let’s look at eliminating vertex q2 in the figure below. Notice that q1q2q3
is the only path passing through q2.
To eliminate q2 in this case, we insert an edge from q1 to q3 and label it cc to get the result
below.
Suppose the original machine had a loop on q2 as in the diagram on the left below. Then,
removing that state gives the machine on the right.
Theorem 3.2: Let L be a regular language. Then there exists a regular expression r such that
L = L(r).
Proof: Let M be an nfa that accepts L. Without loss of generality assume M has only one final
state and that q0 F. We use the vertex removal construction. To remove a vertex q, we
need to find all paths of length 2 with q as the intermediate vertex. Suppose q iqpi is such a
path. Add an edge from qi to pi with a label obtained as follows: If there is no loop from q to
itself, then the new edge is labeled by the concatenation of the expressions on the edges
being removed. If there is a loop, then we label the new edge by e 1(e2)*e3 where e1 is the
expression that labels the edge from qi to q, e3 is the label of the edge from q to pi, and e2 is
the label on the loop at q. Note that qi and pi do not need to be distinct states since we can
have a cycle from a vertex back to itself. Basically this process continues until only the start
state and a final states remain. We’ll stop here rather than go through the formal rigorous
method of obtaining the path labels.
Example 3.10 on page 82. Let’s consider this example from the text. Recall that the machine
accepts all strings with an even number of a’s and odd number of b’s
Let’s begin by removing the state OE. We need to consider all other pairs of states for which a
path of length two passes through OE. Here are the paths:
EE OE EE so we put a loop on EE labeled aa
OO OE OO so we put a loop on OO labeled bb
EE OE OO so we put an edge from EE to OO labeled ab
OO OE EE so we put an edge from OO to EE labeled ba
After doing this we get the following machine:
Since EE is the start state and EO is the final state, we need to remove OO so let’s look at the
paths going through OO
EE OO EE Since there is a loop on OO, add ab(bb)*ba to the label on EE
That label label is now aa + ab(bb)*ba
EO OO EO since there is a loop on OO we get a loop on EO labeled a(bb)*a
EE OO EO the regular expression ab(bb)*a is added to the original label of the edge from EE
to EO giving us this regular expression on the edge: b + ab(bb)*a
EO OO EE adding a(bb)*ba and to the original label of the edge from EO to EE we get b +
a(bb)*ba. The final diagram is:
To get the entire regular expression requires finding all paths from EE to EO and
concatenating their labels.
Read the section in the text on p. 86 about using regular expressions to describe simple
patterns.
Download