From: AAAI Technical Report SS-93-04. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
Exploiting Algebraic Structure in Parallel State-Space Search
(Extended Abstract)
Jon Bright, Simon Kasif and Lewis Stiller
Department of Computer Science
The Johns Hopkins University
Baltimore,
MD21218
16 October 1992
Abstract
In this paper wepresent an approachfor performing very large state-space search
on parallel machines.Whilethe majority of searching methodsin ArtificiM Intelligence
rely on heuristics, the paralld algorithm weproposeexploits the algebraic structure of
problemsto reduce both the time and space complexityrequired to solve these problems
on massively parallel machines. Our algorithms have applications to manyimportant
problemsin Artificial Intelligence and Combinatorial Optimization such as planning
problemsand scheduling problems.
1
Introduction
Manykey computational problems in Artificial Intelligence (AI) such as planning, game
playing and scheduling can be solved by performing very large state-space searches. Our
main interest in this paper is in the problem of planning, a well-studied problem in the
AI literature. The best knownalgorithm for finding optimal solutions for general planning
problems on sequential computers is IDA*, developed by Rich Korf [Kor85a, Kor85b]. IDA*
has an exponential worst-case time complexity and its efficiency heavily depends on the ability to synthesize good heuristics. There are manyattempts to parallelize state space search
[EHMNg0,PKgl, PFKgl, PK89, RK87]. In this paper we examine problems for which the
computational cost of exploring the entire space of possible states maybe prohibitive and
derivation of good heuristics is difficult. For these problems we must reduce the worst case
complexity by a significant amount. Surprisingly, due to the special algebraic structure of
the problems we are considering, we can substantially reduce the time and space complexity
of the algorithm. For example we will consider problems that have O(kIv) states (where
is the size of the input to the algorithm) and suggest parallel solutions whose time complexity is o(klgP/P) (where P is the number of processors) and whose space complexity
is 0(ki¢/4). Whereas the time-space complexity remains exponential, these algorithms are
capable of solving problems that were not tractable for conventional architectures. For
example, if we consider a problem that requires 1025 time using a brute force algorithm
we can potentially solve it in 109 time and 10° space with 1000 processors. This class of
32
problems seem to be particularly well matched with massively parallel machines such as the
CM-2or CM-5[Thi92, Thigl, Hi185]. Our technique offers several exciting new possibilities
for solving many problems, ranging from toy problems such as Rubik’s Cube to scheduling
problems for Hubble Space Telescope.
Our approach is strongly influenced by the elegant sequential algorithm proposed by
Shamir and his group, which we will review in the next section [SS81, FMS+89].In fact,
we view our work as a parallel implementation of this algorithm.
In this paper we describe parallel algorithms using the CREW
PRAM
model of parallel
computation. This model allows multiple processors to read from the same location and
therefore hides the cost of communication. While this assumption is unrealistic in practice
it allows us to simplify the description of a relatively complexalgorithm. Detailed analysis
of our algorithms suggests that they can be expressed by efficient composition of computationally efficient primitives such as sorting, merging, parallel prefix and convolution.
Ultimately, we plan to achieve efficient implementation of our algorithms by exploiting the
techniques to mapalgorithms for parallel architectures that are being developed by Stiller
[Sti92a, Sti92b, stigla]. This approach has already resulted in several important applications to chess endgameanalysis and several scientific problems [Sti91b, Sti91c].
2
Review of the Shamir et al algorithm
In this section, we consider the example of SUBSETSUMto illustrate
this approach. A
similar approach can be used to solve 0-1 knapsack problems. The input to the algorithm
is a set of integers S and a number z. The output is a subset of S, S~, such that the sum
of elements in S~ is equal to z. This problem is fundamental in scheduling, bin packing
and other combinatorial optimization problems. For example, we can use solutions to this
problem to minimize total completion time of tasks of fixed duration executing on two
processors. This problem is sometimes referred to as the partition problem, and is known
to be NP-complete. It is also knownthat in practice it can be solved in pseudo-polynomial
time by dynamic programming. Let N the size of S. We will introduce some helpful
notation. Let A and B to be sets of integers. Define A + B to be the set of integers c
such that c = a + b where a~A and b~B. Define A - B to be the set of integers c such that
c = a - b where aeA and beB.
Observation
1:
The SUBSETSUMproblem can be solved in O(N2N/2) time and O(2N/~) space.
ProoL Wepartition the S into disjoint sets $1 and $2 such that the size of $1 is N/2.
Let G1 and G2 be the sets of all sums of subsets of elements in $1 and $2 respectively.
Clearly, our problem has a solution if and only if the set G1 has a non-empty intersection
with {z) - G2. However, note that we can sort both sets and find the intersection by
merging. Thus, the time complexity of this algorithm can be seen to be N2NP using any
optimal sorting algorithm and noting the trivial identity 2 log(2 N/2) = N. Unfortunately, the
space complexity is also 2lv/2 which makesit prohibitive for solution on currently available
machines for many interesting problems. This approach was first suggested by Horowitz
and Salml for 0-1 knapsack problems [HS86].
33
In the AI literature this algorithm has a strong similarity to bidirectional search studied
by Pohl [Poh71]. The next observation allows us to reduce the space complexity to make
the algorithm practical [SS81].
Observation
2:
The SUBSET SUM problem can be solved
in O(N2N/2) time
and 2 N/4 space.
Proof. We partition S into four sets, Si, 1 < i _< 4. Each set is of size N/4. Let Gi,
1 < i < 4 be the sets of all possible subset sums in Si respectively.
Clearly, the partition
problem has a solution iif GI+G2 has a non-empty intersection
with the set ({z} - G4)
G3.
This observation essentially
reduces our problem to computing intersections
of A + B
with C + D (where A, B, C and D are sets of integers each of size N/4). To accomplish this
we utilize a data structure that allows us to compute such intersections
without an increase
N/2)
in space in O(N2
time. The main idea is to create an algorithm that generates elements
in A + B and C + D in increasing order, which allows us to compute the intersection
by
merging. However, their algorithm is strictly
sequential as it generates elements in A + B
one at a time. Wewill review their data structure in a latter section.
3
A Parallel
Solution
to Intersecting
A + B with C + D
Since we will use a very similar data structure
to the one proposed in [FMS+89] it is
worthwhile to review their implementation. Wewill first show how to generate elements in
A + B in ascending order. First, assume without loss of generality that A and B are given
in ascending sorted order. During each phase of the algorithm, for each element ak in A we
keep a pointer to an element bj such that all the sums of the form ak + bl (i < ]) have been
generated. Wedenote such pointers as ak ---, bj For example, part of our data structure
may look similar to the figure below:
a1
a2
a3
a4
--,
--+
---*
--~
bl0
b7
be
b4
Additionally, we will maintain a priority queue of all such sums. To generate A + B in
ascending order we repeatedly output the smallest ak + bj in the priority queue, and insert
the pointer ak ~ bj+l into the data structure and also insert ak + bj+t into the priority
queue. It is easy to see that, if A and B are of size 2N/4 we can generate all elements in
A + B in ascending order in time cN2JV/2, where c is a small constant.
Our algorithm is based on the idea that instead of generating a single element of A + B
one at a time we will in parallel generate the next K elements (K _~ N). We describe the
algorithm assuming that we have P = K processors. By the standard use of Brent’s theorem
[Bre74] we obtain speed-up results for any number of processors less than or equal to K.
This is important since K in practice might be far larger than the number of processors in
the system.
34
We now describe the algorithm to find the next K elements. The main idea of the
algorithm is as follows. Each element ak in A points to the smallest element of the form
ak + bj that has not been generated yet. Our algorithm works as follows. Wefirst insert
the smallest K of these elements in an array TEMPof size 2K which will keep track of the
dements that are candidates to be generated. The reader should note that we cannot just
output these K elements. Wecall those ak such that ak + bj is in TEMPalive (this notion
will be clarified in step 5 below). Weexecute the following procedure:
1. offset= l.
2. Repeat 3-6 until offset equals 2K.
.
Each ak that is alive and points to bj (ak --~ bj ) inserts all elements of the form
ak + bj+m, where m ~_ offset in the array TEMP.
4. Find the Kth smallest element in TEMPand delete all elements larger than it.
5. If the element of the form ak + bj+o#,et remains in TEMPwe will call ak alive. Otherwise, ai is called dead, and will not participate in further computations.
6. Doublethe offset (i.e., offset := 2*offset).
Note that the number of elements in TEMPnever exceeds 2K. This is true for the
following reason. Assumethat at phase t the numberof live elements ak is L. Each of these
L elements contributes exactly offset pairs ai + bj to TEMPIn the next phase, each such
ak will contribute 2 × offset pairs, doubling its previous contribution. Thus, we will add
offset×L new pairs to TEMP.Since offset × L <_ K, the number of pairs in TEMPnever
exceeds 2 × K.
It is easy to see that the procedure above terminates in O(logM) steps where Mis the
size of the set B. Therefore, the entire process can be accomplished in O(log~M)steps with
Mprocessors using at most M- 2/v/4 storage. Weuse the procedure above repeatedly to
generate the entire set A + B in ascending order.
3.1
Parallel
Solution
to SUBSET SUM
Using our idea above we can obtain significant speed-ups in the implementation of each
phase of the SUBSETSUMalgorithm. Wedescribe a crude version of our algorithm with
P processors assuming P is constant. Weprovide informal analysis of each phase suggesting
that significant speed-ups are possible in each phase.
1. Wepartition S into four sets, Si, 1 ~ i ~ 4. Each set is of size N/4.
.
Wegenerate the sets Gi, 1 <_ i <_ 4, i.e., the sets of all possible subset sums Si,
1 ( i < 4. Each set is of size 2 N/4 -~ M. It is trivial to obtain M/Ptime complexity
for this problem for any number of processors P _~ M. Wesort G1, G2, Gs and G4.
Parallel sorting is a well studied problem and is amenable for maximalspeed-ups.
35
.
.
We invoke our algorithm for computing the next K, (K = M) elements in the set
A + B as described in the previous section to generate G1 o G~and G3 o G4 in ascending
order. This phase take M/P time.
Weintersect (by merging) the generated sets in Mphases, generating and
Melements at a time. Merging can be accomplished in M/P + O(log log P)
a knownparallel merge algorithm. Since at least Melements are eliminated
phase, and the number of possible sums is M2, the total time to compute the
intersection is O(M2/P)time.
merging
time
at each
desired
The algorithm we sketched above provides a framework to achieve substantial speed-up
without sacrificing the good space complexity of the best knownsequential algorithm. The
only non-trivial aspect of the algorithm is step 3 where we suggest an original procedure.
There are manydetails missing from the description of the algorithm above. Since we are
in the process of designing an efficient implementation on existing architectures such as the
CM-5, the final details will vary from the algorithm sketched above. The main purpose
of this note is to illustrate the potential for parallelism. The key to the success of the
final implementation is our ability to mapthe algorithm above to available computational
primitives such as parallel prefix and convolution.
4
Parallel
Planning
It turns out that the algorithm sketched above has applications to a variety of problems
on surface different from the SUBSETSUMproblem one of which is the planning problem
in AI. The approach was first outlined in a paper by Shamir and his group for planning
in permutation groups [FMS+89]. Wegive a very informal description of their idea. A
planning problem maybe stated as follows. Let S be a space of states. Typically, this space
is exponential in size. Let G be a set of operators mappingelements of S into S. Wedenote
composition of gi and gj by glgj. As before, by Gi o Gj we denote the set of operators
formed by composing operators in Gi and Gj respectively. Weassume that each operator
-1.
g in G has an inverse denoted by g
A planning problem is to determine the shortest sequence of operators of the form
gl,g2, ... that maps some initial state Xo to a final state Y. Without loss of generality
assume the initial and final states are always somestate 0 and F respectively. In this paper
we makethe planning problem slightly simpler by asking whether there exists a sequence of
operators of length I that mapsthe initial state into the final state. Assumek is the size of
G. Clearly, the problem can be solved by a brute force algorithm of time complexity O(kt).
Wecan also solve it with O(1) space by iteratively-deepening depth-first search. Weare
interested in problems for which kt is prohibitive but v~7 as total number of computations
is feasible.
Let us spell out some assumptions that make the approach work. Weassume that all
the operators have inverses. Werefer to G-1 as the set of all inverses of operators in G.
Wealso assume that it is possible to induce a total order < on the states of S. E.g, in
the example above the states are sums of subsets ordered by the < relation on the set of
integers. In the case of permutations groups considered in Shamir et al, permutations can
36
be ordered lexicographically. There are several additional mathematical assumptions that
must be satisfied, some of which are outlined in Shamir et al. and the full version of our
paper.
Let G1 = G2 = G3 = G4 = G o G o ... o G, (G composed with itself//4
times).
determinewhetherthere exists a sequencegl,, ..., gl that maps0 into F (i.e., gl,, ..., gz(0)
F), we instead ask the question whether F is contained in G1 o G2 o G3 o G4(0).
But since the operators have inverses we can ask the question of whether G21 o G~I(F)
has a non-empty intersection with G3 o G4(F).
However, this naturally suggests the very similar scenario that we considered in the
discussion on the SUBSETSUMproblem. If the sizes of G1,G2, G3 and G4 are M, we
can solve this intersection
problem in time O(M2/P) and space O(M). This assumes
that we can generate the states in A o B (where o is now a composition on operators) in
increasing order. The implementation of the algorithm sketched above becomes somewhat
more involved because composition of operators is not necessarily monotonic in the order.
We, nevertheless, can modify the algorithms sketched for SUBSET
SUMto obtain efficient
parallel implementations without losing efficiency. This implementation is discussed in the
full version of our paper. As an application, we can obtain a parallel algorithm for problems
such as Rubik’s Cube and other problems.
5
Discussion
Wepresented an approach for massively parallel state-space search. This approach is a parallel implementation of the sequential algorithm proposed in [FMS+89]for finding shortest
word representations in permutation groups. Our implementation relies on a new idea for
computing the K-smallest elements in the set A + B. This idea is used to parallelize a key
part of the algorithm. As mentioned before, we are in the process of designing a practical
implementation of our approach on existing parallel machines. There are several non-trivial
aspects of this implementation. Our main goal is solving extremely large instances of hard
problems that are not tractable at all on existing sequential machines. Since the size of the
state space generated is extremely large, we must be careful about the constants that influence dramatically the actual size of the problem this approach can attack. Webelieve that
our practical implementation mayhave significant practical implications since it should indicate which parallel primitives are fundamental in the solutions of large state space search
problems. As a long range goal we plan to develop automated mapping techniques that take
advantage of the special algebraic structure of problems in order to construct an efficient
mapping onto parallel machines.
Acknowledgements
This researchhas been supported
partiallyby U.S. Army Research
Office Grant DAAL03-92-G-0345
and NSF/DARPAGrant CCR-8908092.
37
References
[Bre74]
It. P. Brent. The parallel evaluation of general arithmetic expressions. Journal
of the ACM,21(2):201-206, 1974.
[EHMN90] M. Evett, J. Hendler, A. Mahanti, and D. Nau. PItA*: A memory-limlted
heuristic search procedure for the Connection Machine. In Third Symposium
on the Frontiers of Massively Parallel Computations, pages 145-149, 1990.
[FMS+89]
A. Fiat, S. Moses, A. Shamir, I. Shimshoni, and G. Tardos. Planning and
learning in permutation groups. In 30th Annual Symposium on Foundations of
Computer Science, pages 274-279, Los Alamitos, CA, 1989. IEEE Computer
Society Press.
[Hi185]
D. Hillis.
[HS86]
E. Horowitz and S. Sahni. Computingpartitions with applications to the knapsack problem. Journal of the A CM, 23:317-327, 1986.
[Kor85a]
It. E. Korf. Depth-first iterative-deepening: an optimal admissible tree search.
Artificial Intelligence, 27:97-109, 1985.
[Kor85b]
It. E. Korf. Iterative-deepening-A*: an optimal admissible tree search. In
International Joint Conferenceon Artificial Intelligence, 1985.
[PFKgl]
C. Powley, C. Ferguson, and It. E. Korf. Parallel tree search on a SIMDmachine. In Third IEEESymposiumon parallel and distributed processing, Dallas,
Texas, December 1991.
[PK89]
C. Powley and It. E. Korf. SIMDand MIMDparallel search. In Proceedings
of the AAAI Symposium on planning and search, Stanford, California, March
1989.
[PK91]
C. Powley and R. E. Korf. Single-agent parallel windowsearch. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(5):466-477, May1991.
[Poh71]
I. Pohl. Bi-directional search. MachineIntelligence,
[ItK87]
V. Nageshwara Itao and V. Kumar. Parallel depth-first search, part i. Implementation. International Journal of Parallel Programming, 16(6):479-499,
1987.
[ss81]
It. Schroeppel and A. Shamir. A T = O(2n/2), S = 0(2 n/4) algorithm for certain NP-complete problems. SIAM Journal on Computing, 10(3):456-464, August 1981.
[Sti91a]
L. Stiller. Group graphs and computational symmetry on massively parallel
architecture. Journal of Supercomputing, 5(2/3):99-117, November1991.
[Sti91b]
L. Stiller. Karpov and kasparov: the end is perfection. International Computer
Chess Association Journal, 14(4):198-201, December1991.
The Connection Machine. MITPress, 1985.
38
16:127-140, 1971.
Someresults
froma massively
parallel
retrograde
analysis.
Interna[Sti91c] L. Stiller.
tionalComputer
ChessAssociation
Journal,
14(3):129-134,
September
1991.
Howto write
fastandclear
parallel
programs
using
algebra.
Technical
[Sti92a] L. Stiller.
ReportLA-UR-92-2924,
Los AlamosNational
Laboratories,
1992.
[Sti92b]
L. Stiller.
Whiteto playandwinin 223moves:
a massively
parallel
exhaustive
statespacesearch.Journalof the Advanced
Computing
Laboratory,
1(4):1,
14-19,
July1992.
[Thi91] Thinking Machines Corporation. Connection Machine CM-200Series Technical
Summary. Thinking Machines Corporation,
02142-1264, June 1991.
[Thi92]
245 First
St.,
Cambridge, MA
Thinking Machines Corporation. Connection Machine CM-5 Technical Summary. Thinking Machines Corporation, 245 First St., Cambridge, MA021421264, January 1992.
39