Experiments with Distributed Anytime Inferencing: Algorithms

From: AAAI Technical Report WS-97-06. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved.
Experiments with Distributed
Anytime Inferencing:
Working with Cooperative Algorithms
Edward Williams
and Eugene
Santos
Dept. of Elec. and Computer Engineering
Air Force Institute of Technology
Wright-Patterson
AFB, OH 45433-7765
{esant
osI ewilliam}@af
it.af.rail
Jr.
Solomon Eyal Shimony
Dept. of Math. and Computer Science,
Ben Gurion University of the Negev
84105 Beer-Sheva, Israel
shimony@cs
.bgu.ac.il
Anytime algorithms have demonstrated their usefulness in solving many classes of intractable and NPhard problems. This approach allows the potential
for improvement in the quality of the solution to be
balanced against the cost of generating that improvement, both in time and system resources. While significant work has been accomplished on characterizing
individual algorithms and sequences of algorithms, the
same has not been done for collections of algorithms.
Our research extends the anytime concept by providing feedback to algorithms operating concurrently in a
distributed cooperative environment. Traditional anytime algorithms execute individually, taking all their
input from the problem being solved; the monitoring
task plays little part in the problem solving process.
Westep beyond this paradigm by not only using multiple concurrent anytime algorithms, but also by providing the capability for these algorithms to interact.
The anytime solutions obtained from the individual algorithms are available to the collection; the capability
of each algorithm to utilize this feedback to improve
its performance is called the anywhereproperty.
The introduction of the anywhere property to inference algorithms gives us the capability of run-time control over the tasks. A controller can determine which
task(s) are suitable for a given problem based on the
characteristics of that specific problem. During runtime it can monitor and control the processes to guide
the convergence towards the optimal solution, determining when tasks are non-productive so they may be
terminated and possibly replaced with more appropriate tasks. To do this, however, it needs models of the
algorithms’ performance. We develop these models,
taking into consideration not only the characteristics
of the problem, but also the effects of receiving the
anytime solutions produced by other algorithms. Using combinations of algorithms can lead to an improved
performance profile for the composite system.
One class of problems that can benefit from this
strategy is probabilistic
reasoning; in particular,
85
Bayesian Networks (Pearl 1988). Unfortunately, exact inference using Bayesian Networksis NP-hard (Shimony1994) in general; though certain classes of networks (e.g. polytrees) are solvable in polynomial time.
Approximation techniques also fall into this category
(Dagum ~ Luby 1993). We have found, as have others, that characteristics of Bayesian Networks have a
significant effect on the performance of the inferencing process. Network density and the distribution of
the conditional probability tables are examplesof these
characteristics;
but different algorithms can respond
differently to the same characteristics.
This paper explores models of algorithm performance and the interactions
between them, producing a composite system model. Early experiments on
Bayesian Networkswith different combinations of algorithms in a static system configuration show promising
results.
Algorithm
Modeling
Wehave two different uses for our models: algorithm
selection and run-time control. For algorithm selection, we need to capture the behavior of algorithms
singly and in combination with others. There are a
number of approaches available for inferencing with
Bayesian Networks; but only a few of them address
the issue of anytime computations (D’Ambrosio 1993;
Wellman & Liu 1994) and none have been combined
to work in a parallel and cooperative fashion. As we
mentioned earlier, algorithm performance is very much
problem-instance dependent as addressed in (Jitnah
A.E.Nicholson 1996) and elsewhere.
Run-time control requires knowledge of the likelihood that an algorithm will produce a better solution
if allowed to continue. Werefer to this likelihood as
the probability of improvement:
P, mp(S~) = Z P,e,e~t(SIS,)P[P(S)
sE8
> P(S,)]
(1)
where ,9 is the set of all possible solutions. This proba-
solution:
bility encompassesboth the joint probability distribution of the Bayesian Network (P(S)) and the actions
of the algorithm itself (P, etect(SISi)). The anytime nature of the algorithms allows us to treat each one as
a state machine, and its operation can be modeled as
the relation that produces the next state (Si+l) given
the current state (Si) and any inputs. Thus, the algorithm’s execution produces a sequence of states. In
general, for each state in the sequence there is a set
of possible states that can be transitioned into. In the
first term of the summationthe probability of selecting
these states will be non-zero. Pimp is then the likelihood that the algorithm will select a solution with a
joint probability higher than the joint probability of
the initial state Si. For a simple randomsearch (equal
probability of selection for all solutions) the probability of selection is 1/N (N -- [Sl), and Pimp is simply
the likelihood that a solution exists with a higher joint
probability. This model is not necessarily exact, but
reflects the nature of the algorithm’s response to interesting characteristics in the data as well as its response
to the other algorithms in the cooperative system.
Pi,np(Si)= P(X > c)
(2)
whereX is a discreterandomvariablewithan underlying
probability
density
function
P(S),thejoint
probability
distribution
of ournetwork.
Theconstant
c isthejoint
probability
of ourgivensolution
P(Si).
If
theprobability
distribution
of ournetwork
is known,
we can calculate
the probability
of improvement
directly;
unfortunately,
thisisrarely
thecase.
If we knowthatourgivensolution
camefroman earlieriteration
ofthissearch,
we canestimate
theprobability
of improvement
by usinginformation
aboutthe
A* heuristic.
TheA* searchproduces
a seriesof partialsolutions
suchthatthevalueof thesolution
(cost
plusheuristic)
is monotonically
non-increasing.
Dueto
thebottom-up
expansion
of the next-state
generation
method,
theRVsin thispartial
solution
fallintotwo
groups:
I represents
thoseRVswhichareinstantiated
in bothsolutions
andalsohaveallof theirparent
RVs
instantiated;
P covers
thoseRVson the’fringe’
which
areinstantiated
butarenotin I. A thirdgroupof RVs
in thenetwork
consists
of thosethatareuninstantiated
in both solutions; these are represented by U.
Using these categories, the probability of improvement from one iteration of the search to the next can
be represented by substituting the joint probabilities
back into the previous equation and using the chain
rule to expand them, we get for each solution P(Si)
P(Ii [r(Ii))P(Pi[~r(Fi))P(Ui [Tr(Ui)) where0 is theset
of instantiations to the parents of the RVs. If we define the cost of the partial solution as the marginal
probability of those RVs that have complete parental
instantiations (the set I) and let 0 be t he h euristic covering the RVsnot in I, the monotonically nonincreasing attribute of the cost and heuristic can be
stated as: P(I[r(I) )h(F t3 < P(I i[rc(I~) )h( Fi U U
By movingthe probabilities to the left side of the inequality we get the following relation between the cost
and heuristic of the two solutions:
A* Search
Traditional best-first algorithms, such as A*, begin
with an initial state and expands one RV based on
a topologically ordered list of RVs. This ordering is
from the leaves of the network to the root.
To provide an anytime solution, we adapted the A*
by simply picking the current best partial instantiation
and completing it. While an incoming solution cannot
be used in the search itself, the value of the solution can
be used to prune the search space. The incoming solution can also be used to help with the generation of the
anytime solution; instead of using an arbitrary value
to fill the uninstantiated RVs, we can use the values
from the received solution. This provides an anywhere
property and benefits from optimizations occurring in
other algorithms.
For any deterministic search, there is no uncertainty
about the next solution that will be generated; the
problem is that we have no way of knowing what that
solution is without actually running the algorithm,
which could be computationally expensive. All we have
is a solution that was generated in a previous iteration;
we don’t know (or even want to know) the internal
state of the search state. With this limited amountof
information, we need to determine probability of improvement for the next iteration.
Based on the above, we cannot narrow the range of
possible solutions; the probability of selection is 1/N
for all solutions. The result becomes the percentage
of solutions with probabilities higher than our given
P(I[Tr(I)) < h(Fi U Ui)
P(I,l~r(Ii))h(PUU)
(3)
Equation 3 allows us to define the probability distribution the samples are drawn from in terms of the semantics used in the computation of the heuristic. Making
the proper substitutions, we get
P(FI~(F))P(UI~(U))
[
h(FvV)
> 1 (4)
Pimp(S{)<_ P P(al,r(FO)P(V,l=(U,))
h(F~uv0
Thisindicates
thattheprobability
of improvement
is
boundedby the relative
proximity
of the completed
87
a higher probability solution decreases. Empirically,
it has been shown that GAs can find good solutions
quickly(Rojas-Guzman & Kramer 1993); but again,
don’t know when we have the optimal answer.
portion of the anytime solution to the value of the
heuristic representing those same RVs. Since the solutions are completed with the same values, the significant part of the numerator is the marginal probability
of the ’fringe’ RVs. The accuracy of the heuristic and
the variance in the marginal probabilities are the driving factors in this calculation.
Genetic
Other
Algorithms
This methodology can be applied to other inference
algorithms:
Clustering Clustering and junction tree methods,
much in use for belief updating, can also be used for
belief revision. Wewere not able to utilize clustering
algorithms because even the smaller networks in our
experiments were, as a rule, too large or too connected;
these methods require space exponential in the cluster
size (or graph effective width).
Hybrid Stochastic Search This algorithm is very
similar to the GAin its processing of a population
of solutions; the two primary operators are random
replacement and local optimization. These operations
make this algorithm more likely to get stuck on local
optima.
Integer
Linear Programming This approach
transforms the belief revision problem into an integer
linear programming (ILP) problem. The method used
to solve the ILP problem is to set an upper and lower
bounds on the quality of the solution and then find
any integral solution that falls within these bounds;
the quality of the solution found is then used as the
new lower bound and the process iterates until the optimal solution is found.
Algorithms
The class of algorithms based on simple Genetic Algorithms (GA) is nondeterministic. The process of belief
revision is accomplished when genetic algorithms take
a small sample from the space of possible solutions
(called the population) and use it to generate other
(possibly better) solutions. The method of generating
new solutions is modeled after natural genetic evolution; this allows the GAto explore more of the search
space than a deterministic algorithm. The disadvantage to this approach is that there is no knownmethod
for determining if the optimal solution has been found.
Incorporating anytime and anywhere characteristics
into the GAis straightforward; we can take intermediate solutions from and place incoming solutions directly into the population.
For the GA, the probability of improvement is dominated by the probability of selection; the probability
that a given solution is better than the initial solution
is just the probability distribution itself. If the GAis
viewed as a set of separate selection operations with
identical characteristics, the probability of improvement for the whole operation is Pimp = 1 - (1 M
-Pimp)
where Pimp is the probability of improvementfor each
memberof the population of size M. This obviously
is a gross oversimplification, but it does showthe significant potential performance for the GAin the early
stages where Pimpis quite large.
The exact probability of selection for GAsin general cannot currently be calculated; but for our problem domainit is clearly dependent on several factors,
including the probability distribution of the Bayesian
Network, the function used to evaluate the membersof
the population and the number of RVsthat are different from the current memberof the population. In a
simple case where the evaluation function is the joint
probability of the solution and the landscape has only a
single optimum, the individual probability of selection
can be approximated by P(S), giving us Pimp(Si)
~’]~s~s P(S)P(P(S) > P(Si)), which is the sum of solutions whose values are greater than the value of the
initial solution; a value that can be estimated from the
probability distribution of the Bayesian Network. As
the membersof the population evolve towards their local optima, the level rises and the probability of finding
Algorithm
Interaction
Wenow take the models of algorithm performance developed above and explore the effects of interaction
between algorithms.
Individual Algorithms Deterministic algorithms
use the incoming solutions to prune their search space.
The greatest impact of the incoming solution on these
algorithms, however, is in the generation of anytime
solutions by using the received solution as a template
to fill in the incomplete parts of the partial solution;
this can significantly improve the quality of the completed solutions. Most exact algorithms expand their
search progressively from either the roots or the leaves
of the network; the difference between completing the
solution arbitrarily and with the received solution is
limited to the RVsin the F and U sets described earlier. Thus, the solution completed with the incoming solution is better than the solution completed with
the arbitrary solution when P(V[Ir(V))P(Fbr(F))
P(U,I~r(U,
) )PIE, I~r( F,)
Approximate algorithms typically work with complete solutions; the GAand Stochastic Simulation are
88
no exception. Both approximate algorithms used here
incorporate the incomingsolution directly into the pool
of solutions being manipulated; the presence of the
higher probability solutions tends to improve the algorithm performanceif the new solutions are significantly
better than the rest of the solutions in the pool.
Combinations With the different algorithms executing independently, the probability of improvement
for any given time increment is
Pimp,,,,.,
= 1 - H(1 -- Pimp,)
0.8
f ...........................
0.6
0.4
......
r
0.2
med~ribufion
(5)
0
0.5
i
which is the probability that at least one algorithm will
improve. Unless the probability of improvementfor all
algorithms is zero, we will always gain something by
utilizing multiple algorithms. With the performance
improvements to each algorithm resulting from interaction, the overall system’s probability of improvement
can be even higher.
The algorithm analyses earlier indicated a similarity in performance between some algorithms. The GA
and Stochastic Simulation have similar probability of
improvement equations; this not surprising due to the
similar nature of the manipulations performed. The
A* and ILP also have similar performance characteristics. While combining algorithms from the same class
can improve performance by increasing the coverage of
the problem space, the improvement due to the information sharing can only take place if the information
being shared is not already knownto the receiving algorithm. For example, if two identical GAswere being
used, as soon as the first high-probability solution was
shared, both populations would tend to converge to
the same local optimum.
Improvement due to sharing information mostly occurs when the algorithm’s performance characteristic
is significantly different from the rest of the collection. One algorithm with a steep performance curve
can boost the performance of a slower performer; or
if the performance curves for two algorithms intersect,
they will alternate the roles of leader and follower. Obviously, algorithms from different classes will have significantly different performance profiles; but in some
cases, it is possible to significantly affect the performance of an algorithm through its parameters. In
such cases, it maybe beneficial to use two similar algorithms, but with different parameters.
Results
The networks used for the testing were randomly generated. The probability distributions ranged from fiat
(uniform, with very small variance) to extreme (exponential, with high variance). Five networks of each
type were generated.
89
1
1.5
2
Time(aec/RVs)
,
2.5
GAA,,
.--_.
3
Figure 1: Performance on extreme network
1
0.95
0.9
0.85
0.8
/
0.75
0.7
0.65
0.6
0.55
0.5
.I- ................................-;;fiatd~flbution
A* "--"
0.5
1
1.5
2
Time(sedRVs)
2.5
3
Figure 2: Performance on fiat network
Our experiments reported here involved running
the GAand A* individually
and in combination on
Bayesian Networks with different characteristics.
The
individual runs provided a baseline for performance
comparisons. For homogeneous networks (all RVs in
the network have the same characteristic),
the combination of the two algorithms always performed at
or above the level of the individual algorithms. This
result was predicted by our analysis of the algorithm
combinations.
Where we achieved a significant
performance improvement is with heterogenous networks. The networks were constructed by combining a flat 30 RVnetwork with a extreme 30 node network. The individual
algorithm performance on each of the 30 node networks
by themselves is shown in Figures 1&2. The vertical
axis is the average normalized joint probability generated and the horizontal axis is normalized run-time.
As you can see here and in Table 1, the GAcould never
achieve even close to the optimal solution for the extreme network, and the A* bogged down early in the
decrease whenmultiple algorithms are used, and it has
the potential for significantly increasing. One case in
particular that showed improvement was when an exact algorithm (A*) was combined with an approximate
algorithm (Genetic Algorithm). Our tests with this
combination consistently demonstrated the ability of
the algorithms to assist each other, producing a better
performance profile.
Using anytime algorithms for uncertain reasoning
gives us the opportunity to decide when our answer is
"good enough", and using multiple algorithms allows
us to take the strengths of the individual algorithms
and create a stronger, more versatile system. Using
a cooperative distributed environment where the algorithms share intermediate results gives us a system
with the potential to maximize the utilization of the
available resources, even in a resource-limited situation.
extreme
fiat
combination
3.367e-05 6.085e-12
2.737e-16
6.369e-01 3.948e-12
1.681e-12
3.790e-12
6.369e-01 6.099e-12
3.884e-12
GA
m
A
Both
Optimal
Table 1: Performance comparison for one composite
network.
1
.f ........
0.9
..............
]’
0.8
0.7
0.8
/’
f
0.5
0.4
0.3
0.2
A¯ ....
Both........
0.1
0
i
1234587’8910
i
References
,
Dagum, P., and Luby, M. 1993. Approximating probabilistic inference in Bayesian belief networks is NPhard. Artificial Intelligence 60 (1):141-153.
Time(sec/RVs)
Figure 3: Performance on combined network
flat network. Figure 3 shows the results of running
the individual algorithms on the composite network;
neither algorithm could come close to the optimal answer by itself.
Whenthe algorithms were combined,
however, performance significantly improved. The raw
data for all these runs is summarizedin Table 1.
The difference between the A* and the combined run
is more significant that it seems: the range in probabilities for the flat network is less than one order of
magnitude. The A* by itself solved the extreme part of
the network, leaving the flat part alone. The GAalone
solved the flat part of the network and got caught on
a spike in the extreme part. Together, both parts of
the network were solved, resulting in a near-optimal
solution.
Conclusions
Others have characterized the performance of individual anytime algorithms, and even sequences of anytime
algorithms to perform a task; our analysis expands the
scope to encompass heterogeneous collections of anytime algorithms running concurrently, interacting with
each other to improve the overall system performance.
Our model for task behavior addresses the effects of
problem characteristics as well as the interaction between tasks at run-time. It is clear from the interaction models that overall system performance cannot
9O
D’Ambrosio,B. 1993. Incremental probabilistic inference. In Proceedings of the Conference on Uncertainty
in Artificial Intelligence. San Francisco, CA: Morgan
Kaufmann Publishers.
Jitnah, N., and A.E.Nicholson. 1996. Belief network
inference algorithms: a study of performance based
on domain characterisation.
Technical Report TR96-249, Monash University, Clayton, VIC, 3168 Australia.
Pearl, J. 1988. Probabilistic Reasoning in Intelligent
Systems: Networks of Plausible Inference. San Mateo,
CA: Morgan Kaufmann.
Rojas-Guzman,
C., and Kramer, M.A. 1993.
GALGO:A Genetic ALGOrithm decision
support
tool for complex uncertain systems modeled with
bayesian belief networks. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 368375. San Francisco, CA: Morgan Kaufmann Publishers.
Shimony, S. E. 1994. Finding MAPsfor belief networks is NP-hard. Artificial Intelligence 68:399-410.
Wellman, M. P., and Liu, C.-L. 1994. State-space abstraction for anytime evaluation of probabilistie networks. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 567-574. San Francisco, CA: Morgan Kaufmann Publishers.