Regular Languages and Regular Grammars Chapter 3

advertisement
Regular Languages
and
Regular Grammars
Chapter 3
Regular Languages
Describes
Regular Expression
Regular
Language
Accepts
Finite State
Machine
Operators on Regular Expressions
In order of precedence:
() Parentheses
Example:
* Star Closure
Over  = {a, b, c}, (a + (b . c))*
produces:
.
Concatenation
{λ, a, bc, aa, abc, bcbc, … }
+ Union
Note: The concatenation symbol is often omitted.
Regular Expressions
Let  be a given alphabet. Then
1. , λ, and a   are all primitive regular expressions.
2. If r1 and r2 are regular expressions,
so are r1 + r2, r1 . r2, r1*, and (r1)
3. A string is a regular expression, iff it can be derived
from the primitive regular expressions by a finite
number of application of the rules in (2).
Languages Associated with
Regular Expressions
If r is a regular expression L(r) is a language associated
with r.
Rules to simplify languages associated with r:
L() = 
L(λ) = λ
L(r1 + r2) = L(r1) U L(r2)
L(r1 . r2) = L(r1) . L(r2)
L((r1)) = L(r1)
L(r1*) = (L(r1))*
L(a) = {a}
Analyzing a Regular Expression
L((a + b)*b) = L((a + b)*) L(b)
= (L(a + b))* L(b)
= (L(a) U L(b))* L(b)
= ({a} U {b})* {b}
= {a, b}* {b}.
A string of a’s and b’s that end with b
Analyzing a Regular Expression
L(a*b*) = L(a*)L(b*)
= {a}*{b}*
A string of zero or more a’s followed by a string of zero or
more b’s.
Given a Language, find a rex
L = {w  {a, b}* : w = |w| is even}
((a + b)(a + b))*
or
(aa + ab + ba + bb)*
Examples
L = {w  {a, b}* : w contains an odd number of a’s}
b*(ab*ab*)*ab*
or
b*ab*(ab*ab*)*
Both expressions require that there be a single a
somewhere. There can also be other a’s, but they must
occur in pairs.
More Regular Expression Examples
Try these:
L = {w  {a, b}*: there is no more than one b in w}
L(r) = {a2nb2m+1 : n  0, m  0}
More Regular Expression Examples
Try these:
L = {w  {a, b}*: there is no more than one b in w}
a*(λ+b)a*
or
a* + a*ba*
L(r) = {a2n b2m+1 : n  0, m  0}
(aa)*(bb)*b
The Details Matter
a* + b*  (a + b)*
(ab)*  a*b*
Rex to NFA
Finite state machines and regular expressions define
the same class of languages.
Theorem: Any language that can be defined with a
regular expression can be accepted by some NFA and
so is regular.
Proof by Construction: Must show that an NFA can
be constructed using rules for: , λ, any symbol in ,
union, and concatenation.
For Every Regular Expression
There is a Corresponding FSM
We’ll show this by construction. An FSM for:
:
For Every Regular Expression
There is a Corresponding FSM
We’ll show this by construction. An FSM for:
:
For Every Regular Expression
There is a Corresponding FSM
We’ll show this by construction. An FSM for:
:
A single element of :
For Every Regular Expression
There is a Corresponding FSM
We’ll show this by construction. An FSM for:
:
A single element of :
For Every Regular Expression
There is a Corresponding FSM
We’ll show this by construction. An FSM for:
:
A single element of :
λ:
For Every Regular Expression
There is a Corresponding FSM
We’ll show this by construction. An FSM for:
:
A single element of :
λ:
Union
M1 (recognizes string s)
;;;
…
λ
λ
λ
…
M2 (recognizes string t)
FSA that recognizes s + t
λ
Concatenation
M1 (recognizes string s)
λ
;;;
…
M2 (recognizes string t)
λ
…
FSA that recognizes st
λ
Star Closure
λ
M1 (recognizes string s)
λ
;;;
λ
…
λ
λ
FSA that recognizes s*
An Example
(b + ab)*
An FSM for a
An FSM for b
An FSM for ab:
λ
An Example
(b + ab)*
An FSM for (b + ab):
λ
λ
λ
An Example
An FSM for (b + ab)*:
λ
λ
λ
λ
λ
λ
λ
λ
An Example
A Simplified FSM for (b + ab)*:
λ
b
a
b
λ
For Every FSM There is a
Corresponding Regular Expression
Theorem: Every regular language (i.e., every language
that can be accepted by some DFSM) can be defined with
a regular expression.
Proof by Construction: Use generalized transition
graphs (GTGs) to convert FSM to REX. A GTG is a
transition graph whose edges are labeled with regular
expressions.
A Simple Example
Let M be:
Suppose we rip out state 2:
The Algorithm fsmtoregexheuristic
fsmtoregexheuristic(M: FSM) =
1. Remove unreachable states from M.
2. If M has no accepting states then return .
3. If the start state of M is part of a loop, create a new start state s
and connect s to M’s start state via an λ-transition.
4. If there is more than one accepting state of M or there are any
transitions out of any of them, create a new accepting state and
connect each of M’s accepting states to it via an λ-transition. The
old accepting states no longer accept.
5. If M has only one state then return λ.
6. Until only the start state and the accepting state remain do:
6.1 Select rip (not s or an accepting state).
6.2 Remove rip from M.
6.3 *Modify the transitions among the remaining states so M
accepts the same strings.
7. Return the regular expression that labels the one remaining
transition from the start state to the accepting state.
Example 1
1. Create a new initial state and a new, unique accepting
state, neither of which is part of a loop.
Note:   λ
Example 1, Continued
2. Remove states and arcs and replace with arcs labeled
with larger and larger regular expressions.
Example 1, Continued
Remove state 3:
Example 1, Continued
+
Remove state 2:
Example 1, Continued
+
Remove state 1:
+
+
Example 2
a*(a + b)c*
Example 3
a* + a*(a + b)c*
Simplifying Regular Expressions
Regex’s describe sets:
● Union is commutative:  +  =  + .
● Union is associative: ( + ) +  =  + ( + ).
●  is the identity for union:  + =  +  = .
● Union is idempotent:  +  = .
Concatenation:
● Concatenation is associative: () = ().
● λ is the identity for concatenation:  λ = λ  = .
●  is a zero for concatenation:   =   = .
Concatenation distributes over union:
● ( + )  = ( ) + ( ).
●  ( + ) = ( ) + ( ).
Kleene star:
● * = λ.
● λ* = λ.
●(*)* = *.
● ** = *.
●( + )* = (**)*.
Applications of regular expressions:
Pattern Matching
Many applications allow pattern matches
unix
perl
Excel
Access
…
Pattern matching programs use automata
pattern  rex  nfa  dfa  transition table  driver
A Biology Example – BLAST
Given a protein or DNA sequence, find others that are likely
to be evolutionarily close to it.
ESGHDTTTYYNKNRYPAGWNNHHDQMFFWV
Build a DFSM that can examine thousands of other
sequences and find those that match any of the selected
patterns.
Regular Expressions in Perl
Syntax
Name
Description
abc
Concatenation
Matches a, then b, then c, where a, b, and c are any regexs
a|b|c
Union (Or)
Matches a or b or c, where a, b, and c are any regexs
a*
Kleene star
Matches 0 or more a’s, where a is any regex
a+
At least one
Matches 1 or more a’s, where a is any regex
a?
Matches 0 or 1 a’s, where a is any regex
a{n, m}
Replication
Matches at least n but no more than m a’s, where a is any regex
a*?
Parsimonious
Turns off greedy matching so the shortest match is selected
a+?


.
Wild card
Matches any character except newline
^
Left anchor
Anchors the match to the beginning of a line or string
$
Right anchor
Anchors the match to the end of a line or string
[a-z]
Assuming a collating sequence, matches any single character in range
[^a-z]
Assuming a collating sequence, matches any single character not in range
\d
Digit
Matches any single digit, i.e., string in [0-9]
\D
Nondigit
Matches any single nondigit character, i.e., [^0-9]
\w
Alphanumeric
Matches any single “word” character, i.e., [a-zA-Z0-9]
\W
Nonalphanumeric
Matches any character in [^a-zA-Z0-9]
\s
White space
Matches any character in [space, tab, newline, etc.]
Regular Expressions in Perl
Syntax
Name
Description
\S
Nonwhite space
Matches any character not matched by \s
\n
Newline
Matches newline
\r
Return
Matches return
\t
Tab
Matches tab
\f
Formfeed
Matches formfeed
\b
Backspace
Matches backspace inside []
\b
Word boundary
Matches a word boundary outside []
\B
Nonword boundary
Matches a non-word boundary
\0
Null
Matches a null character
\nnn
Octal
Matches an ASCII character with octal value nnn
\xnn
Hexadecimal
Matches an ASCII character with hexadecimal value nn
\cX
Control
Matches an ASCII control character
\char
Quote
Matches char; used to quote symbols such as . and \
(a)
Store
Matches a, where a is any regex, and stores the matched string in the next variable
\1
Variable
Matches whatever the first parenthesized expression matched
\2
Matches whatever the second parenthesized expression matched
…
For all remaining variables
Using Regular Expressions
in the Real World
Matching numbers:
-? ([0-9]+(\.[0-9]*)? | \.[0-9]+)
Matching ip addresses:
S !<emphasis> ([0-9]{1,3} (\ . [0-9] {1,3}){3}) </emphasis>
!<inet> $1 </inet>!
Finding doubled words:
\< ([A-Za-z]+) \s+ \1 \>
From Friedl, J., Mastering Regular Expressions, O’Reilly,1997.
More Regular Expressions
Identifying spam:
\badv\(?ert\)?\b
Trawl for email addresses:
\b[A-Za-z0-9_%-]+@[A-Za-z0-9_%-]+ (\.[A-Zaz]+){1,4}\b
Using Substitution
Building a chatbot:
On input:
<phrase1> is <phrase2>
the chatbot will reply:
Why is <phrase1> <phrase2>?
Chatbot Example
<user> The food there is awful
<chatbot> Why is the food there awful?
Assume that the input text is stored in the variable $text:
$text =~
s/^([A-Za-z]+)\sis\s([A-Za-z]+)\.?$/
Why is \1 \2?/
;
Regular Grammars
A regular grammar G is a quadruple (V, T, S, P)
that is either consistently right-linear or consistently
left-linear.
● V - Variables
● T – Terminals
● S - Start variable, S  V
● P - Productions
Right-Linear Grammar
All production rules are of the form:
A  xB
or
Ax
A,B  V
x  T*
A and B are variables
x is a string in the alphabet
Example:
G = ({S}, {a, b}, S, P)
P:
S  abS | a
Corresponding Regular
Expression:
(ab)*a
Left-Linear Grammar
All production rules are of the form:
A  Bx
or
Ax
A,B  V
x  T*
A and B are variables
x is a string in the alphabet
Example:
G = ({S, S1, S2}, {a, b}, S, P)
P:
S  S1ab
S1  S1ab | S2
S2  a
Corresponding Regular
Expression:
aab(ab)*
Focus on Right-Linear Grammars
A language generated by a right-linear grammar is
always regular. Proof by construction of FA on
page 91 of text.
Example: Construct an FA that accepts the
language generated by the grammar:
V0  aV1
V1  abV0 | b
Focus on Right-Linear Grammars
V0  aV1
V1  b
V1  abV0
Complete FA:
Right-Linear Grammars
Every regular language can be generated by some
right-linear grammar. Proof by reverse construction of
an FA, page 93 of text.
Example: Find a right-linear grammar that generates
the language accepted by the FA shown below.
G = {{Q0, Q1, Q2}, {0, 1}, Q0, P}
P:
Q0  1Q1 | Q2 | λ
Q1  0Q0 | 0Q2
Q2  1Q2
Each state in the FA is represented by a variable in the grammar.
Each transition symbol in the FA is a terminal in the grammar.
Each transition in the FA is represented by a rule in the grammar.
If a state, qk is a final state, include the production qk  λ
Download