courses:cs240-201601:regular-expressions.pptx (1.7 MB)

advertisement
Regular Show
Err,
Regular Expressions
• An algebraic equivalent to finite
automata
– Useful as a language for describing simple
but useful patterns in text…
– e.g. How can one tell if an email address is a
syntactically valid email address?
– e.g. How can one update the copyright statement
(or a copyleft statement) to add the current year in
thousands of programs?
Regular Languages
Recall: A language is called a
regular language if some finite
automaton accepts it.
Regular expressions describe regular languages
Operators and Operands
If E is a regular expression, then L(E) denotes
the language that E stands for.
Expressions are built as follows:
An operand can be:
1. A variable, standing for a language.
2. A symbol, standing for itself as a set of strings, i.e., a
stands for the language {a} (formally, L(a) = {a}).
3.
, standing for {} (a language).
4. , standing for  (the empty language).
The operators are:
1. + or , standing for union. L(E+F) = L(E) 
L(F).
2.
 or juxtaposition (i.e., no operator symbol,
as in xy to mean x “times” y) to stand for
concatenation. L(EF) = L(E)L(F), where the
concatenation of languages L and M is {xy |
x is in L and y is in M}.
3. * to represent closure. L(E*) = (L(E))*,
where L* = {}  L  LL  LLL  … .
Parentheses may be used to alter grouping, which by
default is * (highest precedence), then concatenation, then
union (lowest precedence).
Formal Definition of REs
R is a regular expression if R is
1. a for some a  
2. 
3. 
4. (R1  R2) where R1 and R2 are regular
expressions
5. (R1 R2) where R1 and R2 are regular
expressions
6. (R1*) where R1 is a regular expression

Every regular expression arises by a finite number of
applications of these 6 rules
Said Another way…
A Regular Expression describes a language.
Which one? i.e. L(R) = ?
:Apply these recursively:
1. L(a) = {a}
2. L() = {}
3. L() = { }
4. L(R1 | R2) = L(R1)  L(R2)
5. L(R1 R2) = L(R1) L(R2)
6. L(R1*) = L(R1)*


Example
R is a regular expression if R is
1.
a for some a  
2.

3.

4.
(R1  R2) where R1 and R2 are regular expressions
5.
(R1 R2) where R1 and R2 are regular expressions
6.
(R1*) where R1 is a regular expression

To prove ((a(b*))+a) is a regular expression over
(a,b), show it can be constructed according to
the rules:
1.
2.
3.
4.
5.
b is regular by Rule 1
(b*) is regular by Rule 6
a is regular by Rule 1
(a(b*)) is regular by Rule 5
((a(b*))+a) is regular by Rule 4 applied to (4) and (3)
Examples
• L(001) = {001}
• L(0+10*)={0,1,10,100,1000,…}
• L(((0(0+1))*)= the set of strings of 0's
and 1's, of even length, such that every
odd position has a 0
A few more examples…
•
•
•
•
•
•
ab*a
a*b*
(ab)* (same as a*b*?)
a*b*a* (is baa in this?)
L={xodd} = x(xx)* or (xx)*x but not x*xx*
All strings of as and bs of exactly length 3
– L={aaa aab aba abb baa bab bba bbb} or (a+b) (a+b) (a+b)
or (a+b)3
What are RE’s for these languages?
Assume  = {a,b} unless otherwise indicated
• Strings with an a in them somewhere
(a+b)*a(a+b)*
• Strings with at least 2 a’s
b*ab*a(a+b)*
• Strings with exactly 2 a’s
b*ab*ab*
• Strings with at least one a and one b
(a+b)*a(a+b)*b(a+b)*+ (a+b)*b(a+b)*a(a+b)*
• Strings that end in b but do not contain aa
(b+ab)*(b+ab) = (b+ab)+
• All strings over {a,b,c} having no substring ac
c*(a+bc*)*
Equality of REs
• Two regular expressions s and t are
equal if and only if L(s) = L(t)
– Two regular expressions can look quite different
yet describe the same language
• Example:
s = (a+b)*
and
t = (b+aa*b)*a*
Equivalence of FA Languages
and RE Languages
Kleene’s Theorem
• We've already shown that an NFA with or
without -transitions can be converted to a
DFA
• We'll show that NFA- accept the
languages for REs
• Then, we'll show that a RE can describe the
language of a DFA (same construction
works for an NFA)
• Therefore, NFA-, NFA, DFA, and RE are
equivalent (describe the same languages)

NFA,
NFA-
DFA

Regular
Expression
((a+ba*)*+ca*
ab*(c+b)*
• The languages accepted by DFA, NFA, NFA, and described by RE are called the regular
languages
Proof
• We will prove this set of equivalences by
– Showing how to construct an NFA- from a regular
expression
– Showing how to construct a regular expression from
a finite automaton
• We already know how to construct a DFA from an NFA- so
this completes the circle

NFA,
NFA-
DFA

Regular
Expression
((a+ba*)*+ca*
ab*(c+b)*
RE to NFA-
Cover the six cases in the formal (recursive)
definition of REs
1. R = a for some a  . Then L(R) = {a} and the following
NFA recognizes L(R)
a
1. R = 
• Formally, N = ({q1},,,q1,{q1}), where (r,b) =  for and r and b
1. R = 
• Formally, N = ({q},,,q,), where (r,b) =  for and r and b
4. R =(R1  R2)
The class of regular languages is closed under the union
operation
For two languages R1 and R2, take two NFAs N1 and N2 and
combine them into one new NFA N.
N must accept input if either N1 or N2 accepts input.
N1
N

N2

The new machine
guesses nondeterministically
which of the two
machines accepts
the input
5. R =(R1  R2)
The class of regular languages is closed under the
concatenation operation
For two languages R1 and R2, take two NFAs N1 and N2 and
combine them sequentially into one new NFA N.
N2
N1
N


The new machine
guesses nondeterministically
where to split the
input in order to
have a first part
accepted by N1 and
a second part
accepted by N2.
6. R =(R1)*
The class of regular languages is closed under the star
operation
For a language R1, modify N1 to accept (R1)*.
N1
N



The new machine
has the option of
jumping back to the
start state to read
another piece that
N1 accepts.
Q: Why not just make the start state of N1 a final state?
Rite of Passage
FA-to-RE Construction
Two algorithms:
1. State elimination: gives smaller
expression, in general, and easier to
apply
2. Inductive construction: covered in
the appendix
DFA-to-RE by State Elimination
• Basic idea : Eliminate a state s
(remove all arcs into and out of s); label
arcs from q to p that went through s with
an RE representing the sequence of
symbols on that path.
General Process
e
d
qi
c
q
a
qj
• Remove state q
• Label paths from
• qi to qi
• qi to qj
• qj to qi
• qj to qj
b
ae*d
ce*b
ce*d
qi
qj
ae*b
Alternative Method
• We can simplify things considerably if we ensure the
following before applying the procedure for state
elimination:
– There is a single final state
– There are no transitions into the initial state, and none out of
the final state
– Since the procedure works on NFA-'s also, this is easy to
do:
Original FA
q0new

qf1
q0
qf2


qf3

qfnew
Procedure
R4
qj
qi
qi
(R1)(R2)*(R3)+(R4)
R3
R1
qrip
Before
R2
After
qj
Example
S
a
1

b
b
2
Add new start and end state
a, b
S

1
?
A
A

2
S
a
a*b(a + b)*
?
b(a + b)*
Remove state 2
a
1
Remove state 1
A
a+b
Example
a
a
ORIGINAL FA
q3
a
a
b
b
b
b
q4
b
b
MODIFY TO
SATISFY
CRITERIA
q2
a
b

q1
q2
b
a
b
a*b
a
q5
b
a*b
q4
ELIMINATE q2

q3
q5
a
q1
ELIMINATE q1

q3
q2
a
a*b
q4
ELIMINATE q3
ba*b
q3

q5
a*ba*b
q4
(a*b + a*ba*b)(a + ba*b)*
q5
Try this
a
b
a
b
b
a
STEP 1: Modify to create a unique start and end state:
STEP 2: Eliminate state 1: path
from s to 2 is a*b; path from 3 to 2
is aa*b.
STEP 3: Eliminate state 2; path from s to 3 is a*bb*a; path
from s to f is a*bb*; path from 3 to f is (b + aa*b)b*; path
from 3 to 3 is (b + aa*b)a
STEP 4: Eliminate state 3: label on the path from s to f
yields the final RE:
Another Example
Simplify
Remove State 1
Remove state 2
Remove state 3
Remove state 4
Done!
APPENDIX
APPENDIX
APPENDIX
Inductive Construction
• Let A be a FA with states 1, 2,… n.
(k )
• Let Rij be a RE whose language is the
set of labels of paths that go from state i
to state j without passing through any
state numbered above k.
• Construction, and the proof that the
expressions for these RE's are correct,
are inductions on k.
• Basis: k = 0. Path can't go through
any states.
– Thus, path is either an arc or the null path
(a single node).
(0)
– If i  j, then Rij is the sum of all symbols a
such that A has a transition from i to j on
symbol a ( if none).
– If i = j, then add  to above.
• Induction: Assume we have correctly
developed expressions for the R(k-1)'s.
Then for the R(k)'s:
(k -1)
ij
R =R
(k)
ij
+R
(k-1)
ik
(k -1)
kk
(R
(k -1)
kj
)* R
• Proof it works: A path from i to j that
goes through no state higher than k
either:
– Never goes through k, in which case the path's
(k -1)
label is (by the IH) in the language of Rij
; or
– Goes through k one or more times. In this case:
(k -1)
• Rik
contains the portion of the path that goes from
i to k for the first time.
(k -1)
• (Rkk )* contains the portion of the path (possibly
empty) from the first k visit to the last.
(k -1)
• Rkj
contains the portion of the path from the last k
visit to j.
• Final step: The RE for the entire FA is
(n)
the sum (union) of the RE's Rij , where i
is the start state and j is one of the
accepting states.
– Note that superscript (n) represents no restriction
on the path at all, since n is the highest-numbered
state.
Example
The "clamping" automaton, with states named
by integers:
0
start
3
0,1
1
0
1
1
2
• Some basis expressions:
(0)
11
=e
(0)
12
=1
(0)
22
= e + 0 +1
(0)
31
=1
R
R
R
R
R =R =Æ
(0)
32
(0)
21
Two inductive examples:
(1)
(0)
(0)
•
R32
= R32
+ R31
(R11(0) ) * R12(0) = Æ+1e *1=11
– Uses algebraic laws: * =  ; R = R = R ( is the identity for
concatenation);  + R = R +  = R ( is the identity for union).
•
R = R + R (R ) * R = e + 0 +1+ Æe *1= e + 0 +1
(1)
22
(0)
22
(0)
21
(0)
11
(0)
12
– Additional algebraic law used: R = R =  ( is the
annihilator for concatenation).
To simplify the more complex regular expressions
during state elimination(using algebraic rules):
• * =  ;
• R = R = R
• R = R = 
•  | R = { } U R = R can also be stated as:
 + R = R +  = R
Download