regular expression

advertisement
Chapter 2: Finite-State Machines
Heshaam Faili
hfaili@ece.ut.ac.ir
University of Tehran
Overview



Regular Expressions
FSAs
Properties of
Regular Languages
2
Regular Expressions

A regular expression (RE) is a formula in a
specialized language, used to characterize strings.




A finite-state machine is a device for
recognizing/generating regular expressions
We’ll use a “Perlish” notation for writing regular
expressions, based on regular expressions in the Perl
programming language.



A string is a sequence of characters
REs allow us to search for patterns
The concepts are the important thing
NB: Perlish isn’t exactly the same as Perl
We will write REs between slashes: /…/
3
Regular expression inventory
(1)

Character Literals and Classes




Characters: /abcd/
Set: /p[aeiou]p/
Range: /ab[a-z]d/
Operators (disjunction, negation)

Disjunction:



Set elements: /[Aa]ardvark/
Sequences of characters: /ant(eater|farm)/
Negation:


Single item: /[^a]/ (any character but a)
Range: [^a-z] (not a lowercase letter)
Regular expression inventory
(2)

Counters





?: Optionality (0 or 1 occurrence): /colou?r/
* (Kleene star): Any number of occurrences: /[09]*/
+: At least one occurrence: /[0-9]+/
{n}: n number of occurrences: /[0-9]{4}/
Wildcard: matches any single character (.)

/beg.n/
Regular expression inventory
(3)

Parentheses: used to group items
together


/ant(farm)?/: all of farm is optional
Escaped characters: needed to specify
characters that have a special meaning:
*, +, ?, (, ), |, [, ], .:


Use a backslash: /why\?/
Period expressed as: _
6
Regular expression inventory
(4)

Anchors: anchor expressions to various
parts of the string

^ start of line



do not confuse with [^..] used to express
negation; anywhere else it’s a start of line
$ end of line
\b non-word character

word characters are digits, underscores, or
letters, i.e., [0-9A-Za-z\_]
7
Examples of Regular
Expressions












/fire/ a sequence of f followed immediately by i, then immediately
by r, then immediately by e
/fires?/ matches fire or fires
/fires\?/ matches fires ?
/[abcd]/ matches a, b, c, or d
/[0-9]/ matches any character in the range 0 to 9 (inclusive)
/[^0-9]/ matches any non-digit character, i.e., any character except
those in the set 0 thru 9
/[0-9]+/ matches 0, 1, 11, 12, 367, …
/[0-9]*/ matches 0, 1, 11, 12, 367, … and matches no string
/fir./ matches fire, fir9, firm, firp, …
/fir.*/ matches fir, fire, fir987, firppery, …
/[fFHhs]ire/ matches fire, Fire, Hire, hire, sire
/f|Fire/ matches f and Fire
8
Precedence





/fire|ings?/ the sequence fire or the
sequence ing (the latter optionally followed
by s)
Why?
Because sequences have precedence over
disjunction
To override precedence, use parentheses
/fir(e|ings)/ the sequence fire followed by
either the sequence e or the sequence ings
9
Precedence Rules
1) Parentheses have the highest precedence.
2) Then come counters, *, +, ?, {}
3) Then come sequences and anchors
• so, /good.*/ matches goodies, etc., and not
(just) goodgood
• /echo{3}/
the sequence ech followed by
ooo
• /(echo){3}/
the sequence echoechoecho
4) Then comes disjunction
10
Aliases







Use aliases to designate particular recurrent sets of
characters
\d [0-9]: digit
\D [^\d]: non-digit
\w [a-zA-Z0-9\_]: alphanumeric
\W [^\w]: non-alphanumeric
\s [~\r\t\n\f]: whitespace character
\r: space, \t: tab
\n: newline, \f: formfeed
\S [^\s]: non-whitespace
11
Example 1
/\$[0-9]+(\.[0-9][0-9])?/
12
Example 2
Times on a digital watch (hours and
minutes)
/[1-9]|(1[012]):[0-5][0-9]/
13
Overgeneration
/\d\d:\d\d/
recognizes watch times, but also other
sequences. In other words, the pattern
over generates, covering expressions
which aren’t in the target
14
Undergeneration
/1[012]:[0-5][0-9]/
undergenerates, i.e., does not cover all
watch times.
15
Representing sentences
‘handling’ agreement:
/the (student solves|students solve) the problem/
an optional adjective:
/the clever?(student solves|students solve) the
problem/
generating an infinite number of sentences
/the clever?(student solves|students solve) the
problem (and (the clever?(student solves|students
solve) the problem)*/
NOTE: here the symbols are words, not characters! Be
sure to define the symbol type
16
Overview



Regular Expressions
FSAs
Properties of
Regular Languages
17
A Simple Finite State Analyzer
(or FSA)

Example: FSA to recognize strings of the
form: /[ab]+/
i.e., L ={a, b, ab, ba, aab, bab, aba, bba, …}

Transition Table

initial =0; final = {1}
0–>a-> 1
0->b->1
1->a->1
1->b->1
18
How an FSA accepts or rejects
a string




The behavior of an FSA is completely determined by its
transition table. The assumption is that there is a tape, with the
input symbols are read off consecutive cells of the tape.
The machine starts in the start (initial) state, about to read the
contents of the first cell on the input ‘tape’.
The FSA uses the transition table to decide where to go at each
step
A string is rejected in exactly two cases:



1. a transition on an input symbol takes you nowhere
2. the state you’re in after processing the entire input is not an
accept (final) state
Otherwise. the string is accepted.
19
FSA formally

Finite state automaton defined by the
following parameters:





Q: finite set of (N) states: q0, q1, …, qN
: finite input alphabet
q0: designated start state
F: set of final states (subset of Q)
(q, i): transition function
20
More Examples of FSA’s

Let’s design FSA’s to recognize





the set of zero or more a’s
the set of all lowercase alphabetic strings
ending in a b.
the set of all strings in [ab]* with exactly
two a’s.
simple NPs, PPs, Ss
etc.
21
The set of zero or more a’s

L ={, a, aa, aaa, aaaa, …}

Transition Table
initial =0; final = {0}
0–>a-> 0
22
FSA for set of all lowercase
alphabetic strings ending in b






/[a-z]*b/
initial =0; final ={1}
0->[a, c-z]->0
0->b->1
1->b->1
1->[a, c-z]->0
23
The set of all strings in [ab]*
with exactly 2 a’s


Do this yourself
It might help to first rewrite a more
precise regular expression for this
24
FSA for simple NPs, PPs, S, …
initial=0; final ={2}
0->D->1
0->->1
1->N->2
Another FSA for NPs:
initial=0; final ={2}
0->N->2
0->D->1
1->N->2
2->N->2
• D is an alias for [the, a, an, all,…], N for [dog, cat, robin,…]
• What if we wanted to add adjectives? Or recognize PPs?
• What about one for simple sentences?
• /(Prep D? A* N+)* (D? N) (Prep D? A* N+)* (V_tns|Aux
V_ing) (Prep D? A* N+)*/
• Note: FSA1 concat FSA2 recognizes L(FSA1) concat L(FSA2)
25
Deterministic and NonDeterministic FSA’s


An FSA is non-deterministic (NFSA) when, for some state and
input, there is more than one state it can go to
Occurs when transition table allows for a transition to two or
more states from one state on a given input symbol.


Whenever epsilon-transitions occur, these can be taken without
consuming input.


e.g., 1->a->2, 1->a->4
So, whenever epsilon-transitions occur, the machine could either
take the epsilon-transition, or consume an input symbol,
introducing non-determinism.
Any NFSA can be reduced to a DFSA (deterministic) (at the
expense of possibly more states).
26
FAQ: Why Are These Machines
Finite-State?


Finite number of states
Number of states bounded in advance -- determined by its
transition table



Therefore, the machine has a limit to the amount of memory it
uses.
Its behavior at each stage is based on the transition table, and
depends just on the state it’s in, and the input. So, the current
state reflects the history of the processing so far.
Certain classes of formal languages (and linguistic phenomena)
which are not regular require additional memory to keep track
of previous information (beyond current state and input)

e.g., center-embedding constructions (discussed later)
27
Overview



Regular Expressions
FSAs
Properties of
Regular Languages
28
Formal Languages Revisited



We will view any formal language as a
set of expressions
The language will use a finite
vocabulary  (called an alphabet), and
a set of expression-combining
operations
Regular languages are the simplest
class of formal languages
29
Formal Languages Revisited

Note: Kleene closure of a set
Let L = {a, b}.
Then L* = the set of a’s and b’s
concatenated zero or more times
= {, a, b, ab, aab, aaab, aaaab, ba, baa, ….}.
30
Properties of Regular
Languages

The class of regular languages over  is defined as
follows:
1.  (the empty set) is a regular language.
2.  a   U  , {a} is a regular language.
( = alphabet of symbols)
3. If L1 and L2 are regular languages, so are:
a. L1 U L2, the union (or disjunction) of L1 and L2
b. L1.L2 = {xy | x L1, yL2}, concatenation of L1 and L2
c. L1*, the Kleene closure of L1 (set formed by concatenating
members of L1 zero or more times)

So, if the language L is a regular language, any
expression in L must be expressible by the three
operations of concatenation, disjunction, and Kleene
31
closure.
General Closure Properties of
Regular Languages




Concatenation, Union, Kleene Closure
Intersection: If L1 and L2 are regular
languages, so are L1  L2.
Set Difference: If L1 and L2 are regular
languages, so are L1- L2.
Reversal: If L1 is a regular language, so is
L1R, the language formed by reversing all
the strings in L1
32
What sorts of expressions
aren’t regular

In natural language, examples include
center-embedding constructions.
The cat loves Mozart.
The cat the dog chased loves Mozart.
The cat the dog the rat bit chased loves Mozart.
The cat the dog the rat the elephant admired bit
chased loves Mozart.
(the noun)n (transitive-verb)n-1 loves Mozart

These aren’t regular

though /A*B*loves Mozart/ is regular
33
Regular Expressions and FSAs



Regular expressions are equivalent to
FSA’s
So, any FSA can be constructed by just
concatenation, union, and Kleene *
Question: how would you (graphically)
combine FSA’s using:



Concatenation
Union
Kleene *
34
Practice #1

e-Book: 2.1,2.4,2.8,2.10
35
Download