Regular Expressions into Finite Automata

advertisement
Regular Expressions into Finite Automata
Anne Bruggemann-Klein, Freiburg University, Germany.
Summary written by: Rutie Mesing
The article concentrates on three main points. In the first part comes a definition of the Star Normal Form and a
building of the Glushkov automaton from a regular expression E in O((size of E) 2). The second issue is building
the Glushkov automaton in O(size of E) for deterministic regular expressions. The third result is characterizing
the relationship between strong and weak unambiguity, using the star normal form and presenting quadratic time
decision algorithm for weak unambiguity.
Definitions
E, F or G will be used to mark regular expression. L(E) denotes the language specified by the regular expression
E. ME usually denotes the Glushkov automaton for E, unless mentioned otherwise.
The size of a regular expression E is defined as the number of symbols it contain, including the syntactic symbols
such as brackets, +, ., and * . The size of an NFA is the number of its transitions.
pos(E) is defined to be the set of subscripted symbols in an expression E. for example, the expression
(a+b)*a(ab)* is written as (a 1+b1)*a2(a3b2)* or as (a1+b2)*a3(a4b5)*. The subscripted symbols ai, bi are called
positions and are marked with the letters x, y, z. For a position x, (x) is the corresponding symbol of .
Subscripting implies for regular expressions F+G or FG that pos(F) and pos(G) are disjoint.
The following functions definitions are used to define the Glushkov automaton recognizing L(E). These functions
capture the notion of a position in a regular expression matching a symbol in a word. These functions are:
first(E), the set of positions that match the first symbol of some word in L(E). last(E), the dual set of last
positions and symbols. follow(E, x), maps x to subset of positions of E that match the symbols that might come
right after the symbol (x) in some word in L(E). Full inductive definitions are found in appendix A.
Defining the Glushkov automaton: ME = (QE {qI}, , E, qI, FE) : QE = pos(E), For a , let
E(qI,a)={x|xfirst(E),(x)=a}, For xpos(E), a, let E(x,a)={y| yfollow(E,x), (y)=a}. FE will be last(E)
{qI} if L(ME), or last(E) otherwise.
The canonical method for computation of first, last and follow is cubic in the size of E. In order to compute these
methods in quadratic time and to compute the Glushkov automaton, M E, in time O(size(E)+size(ME)) we will
show the canonical method and a refinement of it, that uses the definition of the star normal form.
For using the canonical method we first need to convert E into a syntax tree: leafs are labeled with: ,  or
positions of E, internal nodes are labeled with the operators: + (or), . (concatenation) or * (iteration). Since the
regular expressions are generated by an LL(1) grammar, this can be done in time O(n).
The canonical method goes over the syntax tree in postorder and computes for each node v the sets of positions
first(v) and last(v) and updates the global variables follow(x) that exists for each x in E. As a result we receive the
positions sets first(E), last(E) and follow(E, x) for each x. The code that is executed for each node appears in
appendix B. Notice that the operations marked with (**) or (***) take O(n2) so computing this code for all
vertices takes cubic time. Notice also that all unions labeled (*) or (**) are disjoint, but unions labeled (***) are
not necessarily disjoint, as the expression (a*b*)* illustrates. From now on we will only consider expressions for
which all unions, including the ones of type (***), are disjoint. Such expressions are in star normal form.
The Star Normal Form (SNF):
A regular expression is in star normal form if for each starred sub expression H* of E the SNF-conditions:
follow(H, last(H))  first(H)= and ∉L(H) hold. Meaning, there are no positions in H that belong to first(H) and
can also follow a position that belongs to last(H) in any word in L(H).
Lemma: Let E be a regular expression in star normal form. ME (Glushkov automaton) can be computed from E
in time O(size(E) + size(ME)).
Proof: Steps marked with (*) take constant time (list concatenation) and calculating them can be done in
O(size(E)). For (**) or (***) – remember that the SNF promises disjoint unions – the union is implemented by
copying the elements of the right side of the union one by one to the end of the left side of the union. Thus, the
run time in proportional to the size of the right side of the union. Since those are disjoint unions, for each position
x in E, the computation of follow(E, x) (done in (***) and (**)) takes time linear in follow(E, x). We received that
total run time spent in instructions (**) or (***) is : xpos(E) |follow(E, x)|, which is less or equal to the number of
transitions in ME. So calculating steps marked with (**) and (***) is done in O(size(ME)).
The nest theorem comes to explain why the restriction to star normal form is justified,
Theorem: For each regular expression E, there is a regular expression E such that
1.
ME = ME
2.
E is in star normal form
3.
E can be computed from E in linear time.
(Glushkov Automaton)
The proof theorem, including the formal inductive building of E from E, is presented in the article and in the
presentation, and will no be given here.
Deterministic regular expressions
A regular expression E is deterministic if the corresponding NFA M E is deterministic.
Theorem:
1. It can be decided in linear time whether a regular expression E is deterministic.
2. If E is deterministic, then the deterministic finite automaton ME can be computed from E in linear time.
Proof: E is deterministic if and only if E is, because they have isomorphic Glushkov automata, so we can
assume that E is in star normal form. We start to compute first(E), last(E), and follow (E,x) for xpos(E)
incrementally, keeping track of the follow(E,x) in a |pos(E)|||matrix. If we receive a clash at any point it means
that the expression is not deterministic, and we decided it in linear time. Otherwise we will continue until
computing the follow(E,x) for all position of E is completed. For a deterministic expression E, the size of ME is
linear in the size of E. Then what we have is exactly what we need for the construction of M E.
Ambiguity in automata and expressions
Two types of unambiguity of regular expressions have been defined in the literature:
Weakly unambiguous expression:
An –NFA is Unambiguous if for each word w there is at most one path from the initial state to a final0 state that
spells out w.
Intuition: E is weakly unambiguous if each word of E has a unique path through E.
Definition: A regular expression E is weakly unambiguous if and only if the NFA ME is unambiguous.
Strongly unambiguous expression:
Intuition: Each word of E can be uniquely decomposed into subwords of E.
Definition: Let M’E be the NFA recognizing L(E) according to any of the standard constructions. Then
expression E is strongly unambiguous if and only if M’E is unambiguous.
Investigating the relationship between these two types of unambiguity leads to the following:
Lemma: If E is strongly unambiguous, then E is weakly unambiguous. Proven using elimination of  transitions.
Lemma: If E* is strongly unambiguous, then follow(E, last(E))first(E) = . Meaning that E fulfills the SNFcondition. The proof is presented in the article and the presentation.
Using the following definition of the epsilon normal form we receive a full knowledge of the relationship
between weakly and strongly unambiguous expressions:
Epsilon Normal Form (ENF) condition: No subexpression of E denotes the empty word ambiguously. Full
inductive definition is presented in the article.
The next theorems are proven in the article using the ENF definition and the previous lemmas.
Theorem: E is strongly unambiguous if and only if
1. E is weakly unambiguous
2. E is in star normal form
3. E is in epsilon normal form
Theorem: Regular expressions in epsilon normal form can be tested for weak unambiguity in quadratic time
Open problems
It is easy to see that a regular expression can be tested for epsilon normal form in linear time, but can a given
regular expression be transformed into epsilon normal form in linear time?
The presented transformation into star normal form can deal with starred subexpressions. Hence, the
crucial point is how expressions E=F+G with L(F)L(G) can be handled. A straightforward approach
would eliminate the empty string either from L(F) or from L(G). This opens up another question: is there
a lineartime algorithm transforming a regular expression E into an expression E’ with L(E’)=L(E)\{}?
Appendix A
Inductive definition of first(E), last(E):
[E =  or ]
first(E) = last(E) = 
[E = x]
first(E) = last(E) = {x}
first(E) = first(F)  first(G)
[E = F + G]
last(E) = last(F)  last(G)
first(E) =
[E = FG]
first(F)  first(G) if ∈L(F)
first(F)
last(E) = last(F)  last(G)
last(G)
[E = F *]
otherwise
if ∈L(G)
otherwise
first(E) = first(F)
last(E) = last(F)
Inductive definition of follow(E,x):
[E =  or ]
E has no positions
[E = x]
follow(E,x) = 
[E = F + G]
follow(E,x) = follow(F,x)
follow(G,x)
[E = FG]
follow(E,x) = follow(F,x)
if x∈pos(F)
if x∈pos(G)
if x∈pos(F)\ last(F)
follow(F,x)first(G) if x∈last(F)
follow(G,x)
[E = F*]
follow(E,x) = follow(F,x)
if x∈pos(G)
if x∈pos(F)\ last(F)
follow(F,x)first(F) if x∈last(F)
Appendix B
case
v is a node labeled  :
nullable (v) := false;
first(v) := ;
last(v) := ;
v is a node labeled :
nullable (v) := true;
first(v) := ;
last(v) := ;
v is a node labeled x:
nullable (v) := false;
follow (x) := ;
first(v) := {x};
last(v) := {x};
v is a node labeled +:
nullable (v) := nullable (leftchild ) or nullable (rightchild );
first(v) := first(leftchild ) first(rightchild ); (
)
last(v) := last(leftchild ) last(rightchild ); ( )
v is a node labeled . :
nullable (v) := nullable (leftchild ) and nullable (rightchild );
for each x in last(leftchild) do
follow (x) := follow (x) first(rightchild ); (
)
if nullable(leftchild) then
first(v) := first(leftchild ) first(rightchild ) (
)
else
first(v) := first(leftchild );
if nullable(rightchild) then
last(v) := last(leftchild ) last(rightchild ) (
else
last(v) := last(rightchild );
)
v is a node labeled *:
nullable (v) := true;
for each x in last(child) do
follow (x) := follow (x) first(child ); (
first(v) := first(child );
last(v) := last(child );
end case;
)
Download