Regular Expressions

advertisement
2.5 Regular Expressions
A language L accepted by an FA M is a regular language. An FA
M is a recognizer that can be used to determine whether a given
string  is in L or not.
Although an FA M can be used to represent a regular language L.
But it is not easy to describe what the set L is according to the
machine M. For example, can you describe the set L 1 accepted by
the following FA M 1?
0
q0
q1
1
1
1
q3
1
0
0
q2
0
q4
0
0, 1
1
q5
It is not easy to understand what a program does. In a sense an FA
M is a program. It is still not easy to understand what an FA does.
Therefore, an FA is not a good descriptor to represent the set
accepted by an FA is.
The language L 1 accepted by the machine M 1.can be described as
the set of strings of 0’s and 1’s either ending with 00 or containing
a substring 010.
Regular expressions are as good and powerful as natural language
to describe regular languages. Therefore, the regular expression is
a pretty good descriptor for regular languages.
Theorem 2 and theorem 3 will prove that both regular expressions
and finite state automata define the same class of regular
languages.
Definition 9 : The regular expressions over an alphabet  are
defined recursively by the following:
(1)  is a regular expression and represents the empty language.
(2)  is a regular expression and represents the language {}.
(3) a   is a regular expression and represents the language {a}.
(4) If r and s are regular expressions representing the sets R and
S, then (r+s), (r  s) and (r*) are regular expressions represents
the languages RS, RS and R*, respectively.
Definition 10 : If r is a regular expression, then the set
represented by r is denoted by L( r ).
Note : Regular expressions are strings over the alphabet
{ (, ), +, , *, , }  .
Left parenthesis ( and right parenthesis ) are parts of regular
expression.
Convention : As long as there is no ambiguity in a regular
expression, we may omit parentheses according to the precedence
rule that * is higher than concatenation  and +, and concatenation 
is higher than +. And we may also omit concatenation  .
Convention : In order to distinguish regular expressions from
input strings, regular expressions are written in red.
Example 1 : Find a regular expression over an alphabet ={0, 1}
representing the set L of strings of 0’s and 1’s either ending with 00
or containing a substring 010.
Solution :
First, consider the set L 1 of strings ending with 00. A possible
regular expression representing the set L 1 is (0+1)*(00).
Second, consider the set L 2 of strings containing a substring
010. A possible regular expression representing the set L 2 is
(0+1)* (010) (0+1)*.
Third, L = L 1  L 2. A regular expression representing the set L is
( (0+1)*(00) + (0+1)* (010) (0+1)* ).
Theorem 2 : Let r be a regular expression. Then there is an NFA M
that accepts L( r ).
Proof :
The proof is based on the definition of regular expressions.
And we prove the theorem by induction on the number of
operators in the regular expressions.
First, there is no operator in a regular expression.
(1) consider the regular expression  that denotes the set { }.
Define an NFA M = (, Q, , q 0, F) as follows.
q0
Then we have that L( M ) = L(  )= .
(2) consider the regular expression  that denotes the set { }.
Define an NFA M = (, Q, , q 0, F) as follows.
q0
Then we have that L( M ) = L( )= { }.
(3) consider the regular expression a   that denotes the set {a }.
Define an NFA M = (, Q, , q 0, F) as follows.
q0
a
qf
Then we have that L( M ) = L( a )= {a }.
Second, assume that the theorem is true for regular expressions
containing at most n operators.
Third, consider the induction step. If regular expressions both r and
s contain at most n operators, then there is an NFA M 1 = (, Q1 , 1,
q 01, F 1={q f1}) such that L( M 1 )=L( r ) and an NFA M 2 = (, Q2 ,
2, q 02, F 2 = {q f2}) such that L( M 2 ) = L( s ), where Q1 Q2 = .
(1) consider the regular expression (r+s).
Let NFA M = (, Q, , q 0, F) as follows, Q= {q 0, q f}  Q1 Q2.

q 01
q0
M1
q f1

qf

q 02
M2
q f2

Then we have that L( M ) = L(M 1)  L(M 2) = L((r+s)).
(2) consider the regular expression (rs).
Let NFA M = (, Q, , q 0, F) as follows, Q= {q 0, q f}  Q1 Q2.

q0
q 01
M1
q f1

q 02
M2
q f2

qf
Then we have that L( M ) = L(M 1) L(M 2) = L((rs)).
(3) consider the regular expression (r*).
Let NFA M = (, Q, , q 0, F) as follows, Q= {q 0, q f}  Q1.
q0


q 01
M1
q f1


qf
Then we have that L( M ) = L(M 1)* = L((r*)).
Theorem 3 : If L=L(M) for some DFA M= (, Q, , q 0, F), then
there is a regular expression r such that L= L( r ).
Proof :
The proof is based on dynamic programming method to
find an equivalent regular expression for a DFA.
Given a DFA M = (, Q, , q 1, F), where Q={q 1, q 2, ..., q n},
i.e., |Q| = n.
Define regular expressions R i, j k for the DFA M = (, Q, ,
q 0, F), where 1  i, j  n and 0  k  n, by
R i, j k = {*|  (q i, ) = q j, and for any prefix  of , 
  , and   , if  (q i, ) = q s, then s  k. }
R i, j k can also be defined recursively by:
(1) for k = 0,
(a) R i, j 0 = {a  | ( q i, a) = q j}, if i  j, or
(b) R i, j 0 = {a  | ( q i, a) = q j} {}, if i = j .
The set R i, j 0 is finite, and there is an equivalent regular
expression r i, j 0 for R i, j 0. The regular expression r i, j 0 =
a 1 + … +a t, or r i, j 0 = a 1 + … +a t +  , where ( q i, a s)
= q j , 1  s  t.
(2) for k > 0,
R i, j k = R i, k k-1 ( R k, k k-1 )* R k, j k-1  R i, j k-1, there is an
equivalent regular expression r i, j k for R i, j k , where r i, j k
= r i, k k-1 (r k, k k-1 )* r k, j k-1 + r i, j k-1.
By the definition of R i, j k , we know that R i, j k is regular for
1  i, j  n and 0  k  n, and there is a corresponding
equivalent regular expression r i, j k .
By the definition of R i, j k , we have that the set accepted by
the machine M is L = L(M) = {*|  (q 1, ) = q jF} =
 qjF R 1, j n.
For each final state q jF = {qs1, qs2, …, qst}, there is an
equivalent regular expression r 1, j n for R 1, j n. Therefore,
there is an equivalent regular expression r 1, s1 n + r 1, s2 n
+ … + r 1, st n for L.
Algorithm : The algorithm to compute a regular expression r for
some DFA M= (, Q, , q 0, F) such that L( r ) = L(M) is as follows.
Step 1 : k = 0,
for i = 1 to n do
for j = 1 to n do
compute r i, j 0
Step 2 : for k = 1 to n do
for i = 1 to n do
for j = 1 to n do
compute r i, j k = r i, k k-1 (r k, k k-1 )* r k, j k-1 + r i, j k-1
The time bound of the algorithm is O(n 3).
Example 2 : Find a regular expression representing the set L over
an alphabet ={0, 1} accepted by the following DFA M.
1
q1
0
0
q2
1
0
1
Solution :
r(1,1,k)
r(1,2,k)
r(1,3,k)
r(2,1,k)
r(2,2,k)
r(2,3,k)
r(3,1,k)
r(3,2,k)
r(3,3,k)
q3
k=0 k=1 k=2
 
0(00)*0+
0 0
0(00)*
0(00)*(1+01)+1
1 1
0 0
(00)*0
 00+ (00)*
1 1+01 (00)*(1+01)
  1(00)*0
1 1
1(00)*
0+ 0+ 1(00)*(1+01)+0+ 
r 1, 3 3 = r 1, 3 2 +
r 1, 3 2 (r 3, 3 2 )* r 3, 3 2
= (0(00)*(1+01)+1)+(0(00)*
(1+01)+1)(1(00)*(1+01)+0+
)*(1(00)*(1+01)+0+)
= (0(00)*(1+01)+1) (1(00)*
(1+01)+0)*
= (0*1) (1(00)* (1+01)+0)*
= (0*1) (10*1+0)*
Download