Three New Algorithms for Regular Language Enumeration Margareta Ackerman Erkki Makinen University of Waterloo Waterloo, ON University of Tempere Tempere, Finland 0 B 0 A D 1 1 1 C 1 0 E What kind of words does this NFA accepts? 0 B 0 A D 1 1 1 C 1 0 E ε 0 00 11 000 110 0000 1100 1110 00000 .... Cross-section problem: enumerate all words of length n accepted by the NFA in lexicographic order. 0 B 0 A D 1 1 1 C 1 0 E ε 0 00 11 000 110 0000 1100 1110 00000 .... Enumeration problem: enumerate the first m words accepted by the NFA in length-lexicographic order. 0 B 0 A D 1 1 1 C 1 0 E ε 0 00 11 000 110 0000 1100 1110 00000 .... Min-word problem: find the first word of length n accepted by the NFA. Applications • Correctness testing, provides evidence that an NFA generates the expected language. • An enumeration algorithm can be used to verify whether two NFAs accept the same language (Conway, 1971). • A cross-section algorithm can be used to determine whether every word accepted by a given NFA is a power - a string of the from wn for n>1, |w|>0. (Anderson, Rampersad, Santean, and Shallit, 2007) • A cross-section algorithm can be used to solve the “ksubset of an n-set” problem: Enumerate all k-subset of a set in alphabetical order. (Ackerman & Shallit, 2007) Objectives Find algorithms for the three problems that are • Asymptotically efficient in – Size of the NFA (s states and d transitions) – Output size (t) – The length of the words in the cross-section (n) • Efficient in practice Previous Work • A cross-section algorithm, where finding each consecutive word is super-exponential in the size of the cross-section (Domosi, 1998). • A cross-section algorithm that is exponential in n (length of words in the cross-section) is found in the Grail computation package. – “Breast-First-Search” approach – Trace all paths of length n in the NFA, storing the paths that end at a final state. – O(dσn+1), where d is the number of transitions in the NFA and σ is the alphabet size. Previous Polynomial Algorithms: Makinen, 1997 • Dynamic programming solution – Min-word O(dn2) – Cross-section O(dn2+dt) – Enumeration O(d(e+t)) Quadratic in n e: the number of empty cross-section encountered d: the number of transitions in the NFA n: the length of words in the cross-section t: the number of characters in the output Previous Polynomial Algorithms: Ackerman and Shallit, 2007 • Linear in the length of words in the cross-section – Min-word: O(s2.376n) – Cross-section: O(s2.376n+dt) – Enumeration: O(s2.376c+dt) Linear in n c: the number of cross-section encountered d: the number of transitions in the NFA n: the length of words in the cross-section t: the number of characters in the output Previous Polynomial Algorithms: Ackerman and Shallit, 2007 • The algorithm uses “smart breadth first search,” following only those paths that lead to a final state. • Main idea: compute a look-ahead matrix, used to determine whether there is a path of length i starting at state s and ending at a final state. • In practice, Makinen’s algorithm (slightly modified) is usually more efficient, except on some boundary cases. Contributions Present 3 algorithms for each of the enumeration problems, including: • O(dn) algorithm for min-word • O(dn+dt) algorithm for cross-section • Algorithms with improved practical performance for each of the enumeration problems Contributions: Detailed • We present three sets of algorithms 1. AMSorted: - An efficient min-word algorithm, based on Makinen’s original algorithm. - A cross-section and enumeration algorithms based on this min-word algorithm. 2. AMBoolean: - A more efficient min-word algorithm, based on minWordAMSorted. - A cross-section and enumeration algorithms based on this min-word algorithm. 3. Intersection-based: - An elegant min-word algorithm. - A cross-section algorithm based on this min-word algorithm. Key ideas behind our first two algorithms - Makinen’s algorithm uses simple dynamic programming, which is efficient in practice on most NFAs. - The algorithm by Ackerman & Shallit uses “smart breadth first search,” following only those paths that lead to a final state. - We build on these ideas to yield algorithms that are more efficient both asymptotically and in practice. Makinen’s original min-word algorithm A 1 2 3 - (3,C) (3,C) 2 B 0 A B C 0 1 (2,B) (1,B) (0,A) (1,B) 1 3 C 1 S[i] stores a representation of the minimal word w of length i that appears on a path from S to a final state. Makinen’s original min-word algorithm A 1 2 3 - (3,C) (3,C) 2 B 0 A B C 0 1 (2,B) (1,B) (0,A) (1,B) 1 3 C 1 The minimal word of length n can be found by tracing back from the last column of the start state. Makinen’s original min-word algorithm • Initialize the first column • For columns i = 2...n – For each state S Find S[i] by comparing all words of length i appearing on 2 paths from S to a final state. 1 2 3 A - (3,C) (3,C) B 0 (2,B) (0,A) C 1 (1,B) (1,B) B 0 A 1 3 1 C Makinen’s original min-word algorithm • Initialize the first column • For columns i = 2...n i operations – For each state S Find S[i] by comparing all words of length i appearing on 2 paths from S to a final state. 1 2 3 A - (3,C) (3,C) B 0 (2,B) (0,A) C 1 (1,B) (1,B) B 0 A 1 3 1 C Makinen’s original min-word algorithm • Initialize the first column • For columns i = 2...n i operations – For each state S Find S[i] by comparing all words of length i appearing on paths from S to a final state. Theorem: Makinen’s original min-word algorithm is O(dn2). New min-word algorithm: MinWordAMSorted Idea: Sort every columns by the words that the entries represent. A 1 2 - (3,C) 3 (3,C) 321 B 0 (2,B) (0,A) 031 C 1 (1,B) (1,B) 120 B 0 A 1 3 1 C 2 New min-word algorithm: MinWordAMSorted • We define an order on {S[i] : S a state in N}. • If A[1]=a and B[1]=b, where a<b, then A[1]<B[1]. • For i > 1, A[i] = (a, A’) and B[i] = (b, B’) – If a<b, then A[i] < B[i]. – If a = b, and A’[i-1] < B’[i-1], then A[i] < B[i]. • If A[i] is defined, and B[i] is undefined, then A[i] > B[i]. New min-word algorithm: MinWordAMSorted • Initialize the first column • For columns i = 2...n – For each state S • Find S[i] using only column i-1 and the edges leaving S. – Sort column i 2 1 2 3 A - (3,C) (3,C) B 0 (2,B) (0,A) C 1 (1,B) (1,B) B 0 A 1 3 1 C New min-word algorithm: MinWordAMSorted • Initialize the first column • For columns i = 2...n d operations – For each state S • Find S[i] using only column i-1 and the edges leaving S. – Sort column i s log s operations Theorem: The algorithm minWordAMSorted is O((s log s +d) n). New cross-section algorithm: crossSectionAMSorted • A state S is i-complete if there exists a path of length i from state S to a final state. • To enumerate all words of length n: 1. Call minWordAMSorted (create a table) O((s log s +d) n). 2. Perform a “smart BFS”: O(dt) - Begin at the start state. - Follow only those paths of length n that end at a final state, by using the table to identify i-complete states. Theorem: The algorithm crossSectionAMSorted is O(n (s log s + d) + dn). New enumeration algorithm: enumAMSorted Run the cross-section algorithm until the required number of words are listed, while reusing the table. Theorem: The algorithm enumAMSorted is O(c (s log s + d)+ dt). c: the number of cross-section encountered d: the number of transitions in the NFA t: the number of characters in the output What have we got so far? New Algorithms Previous Algorithms Makinen Ackerman & Shallit O(dn2) O(s2.376n) cross-section O(n (s log s + d)+dt) O(dn2+dt) O(s2.376n+dt) enumeration O(c (s log s +d) + dt) O(de + dt) O(s2.376c+dt) Sorted min-word O((s log s + d)n) c: the number of cross-section encountered e: the number of empty cross-section encountered d: the number of transitions in the NFA n: the length of words in the cross-section t: the number of characters in the output New min-word algorithm: minWordAMBoolean Idea: instead of using a table to find the minimal word, construct a table whose only purpose is to determine i-complete states. Can be done using a similar algorithm to minWordAMSorted, but more efficiently, since there is no need to sort. New min-word algorithm: minWordAMBoolean A 1 2 3 F T T B 0 A B C T T T T T F 1 3 C New min-word algorithm: minWordAMBoolean • Fill in the first column • For i=2 ... n – For every state S • Determine whether S is i-complete using only the transitions leaving S and column i-1 • Starting at the start state, follow minimal transitions to paths that can complete a word of length n (using the table). 1 2 3 A F T T B T T T C T T F B 0 A 1 3 C New min-word algorithm: minWordAMBoolean d operations • Fill in the first column • For i=2 ... n – For every state S • Determine whether S is i-complete using only the transitions leaving S and column i-1 • Starting at the start state, follow minimal transitions to paths that can complete a word of length n (using the table). 1 2 3 A F T T B T T T C T T F B 0 A 1 3 C New min-word algorithm: minWordAMBoolean • Fill in the first column • For i=2 ... n – For every state S d operations • Determine whether S is i-complete using only the transitions leaving S and column i-1 • Starting at the start state, follow minimal transitions to paths that can complete a word of length n (using the table). Theorem: The algorithm minWordAMBoolean is O(dn). New cross-section algorithm: crossSectionAMBoolean • Extend to a cross-section algorithm using the same approach as the Sorted algorithm. • To enumerate all words of length n: – Call minWordAMBoolean (create a table) O(dn). – Perform a “smart BFS”: O(dt) - Begin at the start state. - Follow only those paths of length n that end at a final state, by using the table to identify i-complete states. Theorem: The algorithm crossSectionAMBoolean is O(dn+dt). New enumeration algorithm: enumAMBoolean Run the cross-section algorithm until the required number of words are listed, while reusing the table. Theorem: The algorithm enumAMBoolean is O(de+ dn). e: the number of empty cross-section encountered d: the number of transitions in the NFA n: the length of words in the cross-section t: the number of characters in the output What have we got so far? New Algorithms Previous Algorithms Makinen Ackerman & Shallit Sorted Boolean min-word O((s logs+d)n) O(dn) O(dn2) O(s2.376n) cross-section O(n (s log s+d)+dt) O(dn+dt) O(dn2+dt) O(s2.376n+dt) enumeration O(c (s log s +d) + dt) O(de+dt) O(de+dt) O(s2.376c+dt) c: the number of cross-section encountered e: the number of empty cross-section encountered d: the number of transitions in the NFA n: the length of words in the cross-section t: the number of characters in the output Intersection-Based Algorithms • We present surprisingly elegant min-word and cross-section algorithms that have the asymptotic efficiency of the Boolean-based algorithms. • However, these algorithms are not as efficient in practice as the Boolean-based and Sortedbased algorithms. New min-word algorithm: minWordIntersection Let N be the input NFA, and A be the NFA that accepts the language of all words of length n. 1. Let C = N x A 2. Remove all states of C that cannot be reached from the final states of C using reversed transitions. 3. Starting at the start state, follow the minimal n consecutive transitions to a final state. New min-word algorithm: minWordIntersection Let N be the input NFA, and A be the NFA that accepts the language of all words of length n. 1. Let C = N x A Let n = 2 Automaton A 2. Remove all states of C that cannot be Automaton N reached from the final states of C using 1 reversed transitions. 0 1 B 1 3. Starting at the start state, follow the A minimal n consecutive transitions to a final 0 1 0 1 state. 0 C 0 New min-word algorithm: minWordIntersection Let N be the input NFA, and A be the NFA that accepts the language of all words of length n. 1. Let C = N x A Automaton C 2. Remove all states of C that cannot be Automaton N reached from the final states of C using 1 reversed transitions. 0 1 B 1 3. Starting at the start state, follow the A minimal n consecutive transitions to a final 0 1 state. 0 C 0 New min-word algorithm: minWordIntersection Let N be the input NFA, and A be the NFA that accepts the language of all words of length n. 1. Let C = N x A 2. Remove all states of C that cannot be reached from the final states of C using reversed transitions. 3. Starting at the start state, follow the 1 minimal n consecutive transitions to a final 1 state. New min-word algorithm: minWordIntersection Let N be the input NFA, and A be the NFA that accepts the language of all words of length n. 1. Let C = N x A 2. Remove all states of C that cannot be1 reached from the final states of C using 1 reverse transitions. 3. Starting at the start state, follow the minimal n consecutive transitions to a final state. Thus the minimal word of length 2 accepted by N is “11” Asymptotic running time of minWordIntersection 1. Let C = N x A Concatenate n copies of N. 2. Remove all states of C that cannot be reached from the final states of C using reverse transitions. 3. Starting at the start state, Follow the minimal n consecutive transitions to final. Each step is proportional to size of C, which is O(nd). Theorem: The algorithm minWordIntersection is O(dn). New cross-section algorithm: crossSectionIntersection • To enumerate all words of length n, perform BFS on C = N x A, and remove all states not reachable from final state removed (using reverse transitions). • Since all paths of length n starting at the start state lead to a final state, there is no need to check for i-completness. Theorem: The algorithm crossSectionIntersection is O(dn+dt). Practical Performance • We compared Makinen’s, Ackerman-Shallit, AMSorted, and AMBoolean, and Intersection-based algorithms. • Tested the algorithms on a variety of NFAs: dense, sparse, few and many final states, different alphabet size, worst case for Makinen’s algorithm, ect… • Here are the best performing algorithms: – Min-word: AMSorted – Cross-section: AMBoolean – Enumeration: AMBoolean Summary New Algorithms Sorted Boolean Previous Algorithms Intersection Makinen Ackerman & Shallit min-word O((s logs+d)n) O(dn) O(dn) O(dn2) O(s2.376n) cross-section O(n (s log s +d)+dt) O(dn+dt) O(dn+dt) O(dn2+dt) O(s2.376n+dt) enumeration O(c (s log s +d) + dt) O(de+dt) - O(de+dt) O(s2.376c+dt) c: the number of cross-section encountered e: the number of empty cross-section encountered d: the number of transitions in the NFA n: the length of words in the cross-section t: the number of characters in the output : most efficient in practice Open problems • Extending the intersection-based cross-section algorithm to an enumeration algorithm. • Lower bounds. • Can better results be obtained using a different order? • Restricting attention to a smaller family of NFAs.