Regular Expressions into Finite Automata Anne Bruggemann-Klein, Freiburg University, Germany. Summary written by: Rutie Mesing The article concentrates on three main points. In the first part comes a definition of the Star Normal Form and a building of the Glushkov automaton from a regular expression E in O((size of E) 2). The second issue is building the Glushkov automaton in O(size of E) for deterministic regular expressions. The third result is characterizing the relationship between strong and weak unambiguity, using the star normal form and presenting quadratic time decision algorithm for weak unambiguity. Definitions E, F or G will be used to mark regular expression. L(E) denotes the language specified by the regular expression E. ME usually denotes the Glushkov automaton for E, unless mentioned otherwise. The size of a regular expression E is defined as the number of symbols it contain, including the syntactic symbols such as brackets, +, ., and * . The size of an NFA is the number of its transitions. pos(E) is defined to be the set of subscripted symbols in an expression E. for example, the expression (a+b)*a(ab)* is written as (a 1+b1)*a2(a3b2)* or as (a1+b2)*a3(a4b5)*. The subscripted symbols ai, bi are called positions and are marked with the letters x, y, z. For a position x, (x) is the corresponding symbol of . Subscripting implies for regular expressions F+G or FG that pos(F) and pos(G) are disjoint. The following functions definitions are used to define the Glushkov automaton recognizing L(E). These functions capture the notion of a position in a regular expression matching a symbol in a word. These functions are: first(E), the set of positions that match the first symbol of some word in L(E). last(E), the dual set of last positions and symbols. follow(E, x), maps x to subset of positions of E that match the symbols that might come right after the symbol (x) in some word in L(E). Full inductive definitions are found in appendix A. Defining the Glushkov automaton: ME = (QE {qI}, , E, qI, FE) : QE = pos(E), For a , let E(qI,a)={x|xfirst(E),(x)=a}, For xpos(E), a, let E(x,a)={y| yfollow(E,x), (y)=a}. FE will be last(E) {qI} if L(ME), or last(E) otherwise. The canonical method for computation of first, last and follow is cubic in the size of E. In order to compute these methods in quadratic time and to compute the Glushkov automaton, M E, in time O(size(E)+size(ME)) we will show the canonical method and a refinement of it, that uses the definition of the star normal form. For using the canonical method we first need to convert E into a syntax tree: leafs are labeled with: , or positions of E, internal nodes are labeled with the operators: + (or), . (concatenation) or * (iteration). Since the regular expressions are generated by an LL(1) grammar, this can be done in time O(n). The canonical method goes over the syntax tree in postorder and computes for each node v the sets of positions first(v) and last(v) and updates the global variables follow(x) that exists for each x in E. As a result we receive the positions sets first(E), last(E) and follow(E, x) for each x. The code that is executed for each node appears in appendix B. Notice that the operations marked with (**) or (***) take O(n2) so computing this code for all vertices takes cubic time. Notice also that all unions labeled (*) or (**) are disjoint, but unions labeled (***) are not necessarily disjoint, as the expression (a*b*)* illustrates. From now on we will only consider expressions for which all unions, including the ones of type (***), are disjoint. Such expressions are in star normal form. The Star Normal Form (SNF): A regular expression is in star normal form if for each starred sub expression H* of E the SNF-conditions: follow(H, last(H)) first(H)= and ∉L(H) hold. Meaning, there are no positions in H that belong to first(H) and can also follow a position that belongs to last(H) in any word in L(H). Lemma: Let E be a regular expression in star normal form. ME (Glushkov automaton) can be computed from E in time O(size(E) + size(ME)). Proof: Steps marked with (*) take constant time (list concatenation) and calculating them can be done in O(size(E)). For (**) or (***) – remember that the SNF promises disjoint unions – the union is implemented by copying the elements of the right side of the union one by one to the end of the left side of the union. Thus, the run time in proportional to the size of the right side of the union. Since those are disjoint unions, for each position x in E, the computation of follow(E, x) (done in (***) and (**)) takes time linear in follow(E, x). We received that total run time spent in instructions (**) or (***) is : xpos(E) |follow(E, x)|, which is less or equal to the number of transitions in ME. So calculating steps marked with (**) and (***) is done in O(size(ME)). The nest theorem comes to explain why the restriction to star normal form is justified, Theorem: For each regular expression E, there is a regular expression E such that 1. ME = ME 2. E is in star normal form 3. E can be computed from E in linear time. (Glushkov Automaton) The proof theorem, including the formal inductive building of E from E, is presented in the article and in the presentation, and will no be given here. Deterministic regular expressions A regular expression E is deterministic if the corresponding NFA M E is deterministic. Theorem: 1. It can be decided in linear time whether a regular expression E is deterministic. 2. If E is deterministic, then the deterministic finite automaton ME can be computed from E in linear time. Proof: E is deterministic if and only if E is, because they have isomorphic Glushkov automata, so we can assume that E is in star normal form. We start to compute first(E), last(E), and follow (E,x) for xpos(E) incrementally, keeping track of the follow(E,x) in a |pos(E)|||matrix. If we receive a clash at any point it means that the expression is not deterministic, and we decided it in linear time. Otherwise we will continue until computing the follow(E,x) for all position of E is completed. For a deterministic expression E, the size of ME is linear in the size of E. Then what we have is exactly what we need for the construction of M E. Ambiguity in automata and expressions Two types of unambiguity of regular expressions have been defined in the literature: Weakly unambiguous expression: An –NFA is Unambiguous if for each word w there is at most one path from the initial state to a final0 state that spells out w. Intuition: E is weakly unambiguous if each word of E has a unique path through E. Definition: A regular expression E is weakly unambiguous if and only if the NFA ME is unambiguous. Strongly unambiguous expression: Intuition: Each word of E can be uniquely decomposed into subwords of E. Definition: Let M’E be the NFA recognizing L(E) according to any of the standard constructions. Then expression E is strongly unambiguous if and only if M’E is unambiguous. Investigating the relationship between these two types of unambiguity leads to the following: Lemma: If E is strongly unambiguous, then E is weakly unambiguous. Proven using elimination of transitions. Lemma: If E* is strongly unambiguous, then follow(E, last(E))first(E) = . Meaning that E fulfills the SNFcondition. The proof is presented in the article and the presentation. Using the following definition of the epsilon normal form we receive a full knowledge of the relationship between weakly and strongly unambiguous expressions: Epsilon Normal Form (ENF) condition: No subexpression of E denotes the empty word ambiguously. Full inductive definition is presented in the article. The next theorems are proven in the article using the ENF definition and the previous lemmas. Theorem: E is strongly unambiguous if and only if 1. E is weakly unambiguous 2. E is in star normal form 3. E is in epsilon normal form Theorem: Regular expressions in epsilon normal form can be tested for weak unambiguity in quadratic time Open problems It is easy to see that a regular expression can be tested for epsilon normal form in linear time, but can a given regular expression be transformed into epsilon normal form in linear time? The presented transformation into star normal form can deal with starred subexpressions. Hence, the crucial point is how expressions E=F+G with L(F)L(G) can be handled. A straightforward approach would eliminate the empty string either from L(F) or from L(G). This opens up another question: is there a lineartime algorithm transforming a regular expression E into an expression E’ with L(E’)=L(E)\{}? Appendix A Inductive definition of first(E), last(E): [E = or ] first(E) = last(E) = [E = x] first(E) = last(E) = {x} first(E) = first(F) first(G) [E = F + G] last(E) = last(F) last(G) first(E) = [E = FG] first(F) first(G) if ∈L(F) first(F) last(E) = last(F) last(G) last(G) [E = F *] otherwise if ∈L(G) otherwise first(E) = first(F) last(E) = last(F) Inductive definition of follow(E,x): [E = or ] E has no positions [E = x] follow(E,x) = [E = F + G] follow(E,x) = follow(F,x) follow(G,x) [E = FG] follow(E,x) = follow(F,x) if x∈pos(F) if x∈pos(G) if x∈pos(F)\ last(F) follow(F,x)first(G) if x∈last(F) follow(G,x) [E = F*] follow(E,x) = follow(F,x) if x∈pos(G) if x∈pos(F)\ last(F) follow(F,x)first(F) if x∈last(F) Appendix B case v is a node labeled : nullable (v) := false; first(v) := ; last(v) := ; v is a node labeled : nullable (v) := true; first(v) := ; last(v) := ; v is a node labeled x: nullable (v) := false; follow (x) := ; first(v) := {x}; last(v) := {x}; v is a node labeled +: nullable (v) := nullable (leftchild ) or nullable (rightchild ); first(v) := first(leftchild ) first(rightchild ); ( ) last(v) := last(leftchild ) last(rightchild ); ( ) v is a node labeled . : nullable (v) := nullable (leftchild ) and nullable (rightchild ); for each x in last(leftchild) do follow (x) := follow (x) first(rightchild ); ( ) if nullable(leftchild) then first(v) := first(leftchild ) first(rightchild ) ( ) else first(v) := first(leftchild ); if nullable(rightchild) then last(v) := last(leftchild ) last(rightchild ) ( else last(v) := last(rightchild ); ) v is a node labeled *: nullable (v) := true; for each x in last(child) do follow (x) := follow (x) first(child ); ( first(v) := first(child ); last(v) := last(child ); end case; )