From: AAAI Technical Report WS-97-06. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved. Experiments with Distributed Anytime Inferencing: Working with Cooperative Algorithms Edward Williams and Eugene Santos Dept. of Elec. and Computer Engineering Air Force Institute of Technology Wright-Patterson AFB, OH 45433-7765 {esant osI ewilliam}@af it.af.rail Jr. Solomon Eyal Shimony Dept. of Math. and Computer Science, Ben Gurion University of the Negev 84105 Beer-Sheva, Israel shimony@cs .bgu.ac.il Anytime algorithms have demonstrated their usefulness in solving many classes of intractable and NPhard problems. This approach allows the potential for improvement in the quality of the solution to be balanced against the cost of generating that improvement, both in time and system resources. While significant work has been accomplished on characterizing individual algorithms and sequences of algorithms, the same has not been done for collections of algorithms. Our research extends the anytime concept by providing feedback to algorithms operating concurrently in a distributed cooperative environment. Traditional anytime algorithms execute individually, taking all their input from the problem being solved; the monitoring task plays little part in the problem solving process. Westep beyond this paradigm by not only using multiple concurrent anytime algorithms, but also by providing the capability for these algorithms to interact. The anytime solutions obtained from the individual algorithms are available to the collection; the capability of each algorithm to utilize this feedback to improve its performance is called the anywhereproperty. The introduction of the anywhere property to inference algorithms gives us the capability of run-time control over the tasks. A controller can determine which task(s) are suitable for a given problem based on the characteristics of that specific problem. During runtime it can monitor and control the processes to guide the convergence towards the optimal solution, determining when tasks are non-productive so they may be terminated and possibly replaced with more appropriate tasks. To do this, however, it needs models of the algorithms’ performance. We develop these models, taking into consideration not only the characteristics of the problem, but also the effects of receiving the anytime solutions produced by other algorithms. Using combinations of algorithms can lead to an improved performance profile for the composite system. One class of problems that can benefit from this strategy is probabilistic reasoning; in particular, 85 Bayesian Networks (Pearl 1988). Unfortunately, exact inference using Bayesian Networksis NP-hard (Shimony1994) in general; though certain classes of networks (e.g. polytrees) are solvable in polynomial time. Approximation techniques also fall into this category (Dagum ~ Luby 1993). We have found, as have others, that characteristics of Bayesian Networks have a significant effect on the performance of the inferencing process. Network density and the distribution of the conditional probability tables are examplesof these characteristics; but different algorithms can respond differently to the same characteristics. This paper explores models of algorithm performance and the interactions between them, producing a composite system model. Early experiments on Bayesian Networkswith different combinations of algorithms in a static system configuration show promising results. Algorithm Modeling Wehave two different uses for our models: algorithm selection and run-time control. For algorithm selection, we need to capture the behavior of algorithms singly and in combination with others. There are a number of approaches available for inferencing with Bayesian Networks; but only a few of them address the issue of anytime computations (D’Ambrosio 1993; Wellman & Liu 1994) and none have been combined to work in a parallel and cooperative fashion. As we mentioned earlier, algorithm performance is very much problem-instance dependent as addressed in (Jitnah A.E.Nicholson 1996) and elsewhere. Run-time control requires knowledge of the likelihood that an algorithm will produce a better solution if allowed to continue. Werefer to this likelihood as the probability of improvement: P, mp(S~) = Z P,e,e~t(SIS,)P[P(S) sE8 > P(S,)] (1) where ,9 is the set of all possible solutions. This proba- solution: bility encompassesboth the joint probability distribution of the Bayesian Network (P(S)) and the actions of the algorithm itself (P, etect(SISi)). The anytime nature of the algorithms allows us to treat each one as a state machine, and its operation can be modeled as the relation that produces the next state (Si+l) given the current state (Si) and any inputs. Thus, the algorithm’s execution produces a sequence of states. In general, for each state in the sequence there is a set of possible states that can be transitioned into. In the first term of the summationthe probability of selecting these states will be non-zero. Pimp is then the likelihood that the algorithm will select a solution with a joint probability higher than the joint probability of the initial state Si. For a simple randomsearch (equal probability of selection for all solutions) the probability of selection is 1/N (N -- [Sl), and Pimp is simply the likelihood that a solution exists with a higher joint probability. This model is not necessarily exact, but reflects the nature of the algorithm’s response to interesting characteristics in the data as well as its response to the other algorithms in the cooperative system. Pi,np(Si)= P(X > c) (2) whereX is a discreterandomvariablewithan underlying probability density function P(S),thejoint probability distribution of ournetwork. Theconstant c isthejoint probability of ourgivensolution P(Si). If theprobability distribution of ournetwork is known, we can calculate the probability of improvement directly; unfortunately, thisisrarely thecase. If we knowthatourgivensolution camefroman earlieriteration ofthissearch, we canestimate theprobability of improvement by usinginformation aboutthe A* heuristic. TheA* searchproduces a seriesof partialsolutions suchthatthevalueof thesolution (cost plusheuristic) is monotonically non-increasing. Dueto thebottom-up expansion of the next-state generation method, theRVsin thispartial solution fallintotwo groups: I represents thoseRVswhichareinstantiated in bothsolutions andalsohaveallof theirparent RVs instantiated; P covers thoseRVson the’fringe’ which areinstantiated butarenotin I. A thirdgroupof RVs in thenetwork consists of thosethatareuninstantiated in both solutions; these are represented by U. Using these categories, the probability of improvement from one iteration of the search to the next can be represented by substituting the joint probabilities back into the previous equation and using the chain rule to expand them, we get for each solution P(Si) P(Ii [r(Ii))P(Pi[~r(Fi))P(Ui [Tr(Ui)) where0 is theset of instantiations to the parents of the RVs. If we define the cost of the partial solution as the marginal probability of those RVs that have complete parental instantiations (the set I) and let 0 be t he h euristic covering the RVsnot in I, the monotonically nonincreasing attribute of the cost and heuristic can be stated as: P(I[r(I) )h(F t3 < P(I i[rc(I~) )h( Fi U U By movingthe probabilities to the left side of the inequality we get the following relation between the cost and heuristic of the two solutions: A* Search Traditional best-first algorithms, such as A*, begin with an initial state and expands one RV based on a topologically ordered list of RVs. This ordering is from the leaves of the network to the root. To provide an anytime solution, we adapted the A* by simply picking the current best partial instantiation and completing it. While an incoming solution cannot be used in the search itself, the value of the solution can be used to prune the search space. The incoming solution can also be used to help with the generation of the anytime solution; instead of using an arbitrary value to fill the uninstantiated RVs, we can use the values from the received solution. This provides an anywhere property and benefits from optimizations occurring in other algorithms. For any deterministic search, there is no uncertainty about the next solution that will be generated; the problem is that we have no way of knowing what that solution is without actually running the algorithm, which could be computationally expensive. All we have is a solution that was generated in a previous iteration; we don’t know (or even want to know) the internal state of the search state. With this limited amountof information, we need to determine probability of improvement for the next iteration. Based on the above, we cannot narrow the range of possible solutions; the probability of selection is 1/N for all solutions. The result becomes the percentage of solutions with probabilities higher than our given P(I[Tr(I)) < h(Fi U Ui) P(I,l~r(Ii))h(PUU) (3) Equation 3 allows us to define the probability distribution the samples are drawn from in terms of the semantics used in the computation of the heuristic. Making the proper substitutions, we get P(FI~(F))P(UI~(U)) [ h(FvV) > 1 (4) Pimp(S{)<_ P P(al,r(FO)P(V,l=(U,)) h(F~uv0 Thisindicates thattheprobability of improvement is boundedby the relative proximity of the completed 87 a higher probability solution decreases. Empirically, it has been shown that GAs can find good solutions quickly(Rojas-Guzman & Kramer 1993); but again, don’t know when we have the optimal answer. portion of the anytime solution to the value of the heuristic representing those same RVs. Since the solutions are completed with the same values, the significant part of the numerator is the marginal probability of the ’fringe’ RVs. The accuracy of the heuristic and the variance in the marginal probabilities are the driving factors in this calculation. Genetic Other Algorithms This methodology can be applied to other inference algorithms: Clustering Clustering and junction tree methods, much in use for belief updating, can also be used for belief revision. Wewere not able to utilize clustering algorithms because even the smaller networks in our experiments were, as a rule, too large or too connected; these methods require space exponential in the cluster size (or graph effective width). Hybrid Stochastic Search This algorithm is very similar to the GAin its processing of a population of solutions; the two primary operators are random replacement and local optimization. These operations make this algorithm more likely to get stuck on local optima. Integer Linear Programming This approach transforms the belief revision problem into an integer linear programming (ILP) problem. The method used to solve the ILP problem is to set an upper and lower bounds on the quality of the solution and then find any integral solution that falls within these bounds; the quality of the solution found is then used as the new lower bound and the process iterates until the optimal solution is found. Algorithms The class of algorithms based on simple Genetic Algorithms (GA) is nondeterministic. The process of belief revision is accomplished when genetic algorithms take a small sample from the space of possible solutions (called the population) and use it to generate other (possibly better) solutions. The method of generating new solutions is modeled after natural genetic evolution; this allows the GAto explore more of the search space than a deterministic algorithm. The disadvantage to this approach is that there is no knownmethod for determining if the optimal solution has been found. Incorporating anytime and anywhere characteristics into the GAis straightforward; we can take intermediate solutions from and place incoming solutions directly into the population. For the GA, the probability of improvement is dominated by the probability of selection; the probability that a given solution is better than the initial solution is just the probability distribution itself. If the GAis viewed as a set of separate selection operations with identical characteristics, the probability of improvement for the whole operation is Pimp = 1 - (1 M -Pimp) where Pimp is the probability of improvementfor each memberof the population of size M. This obviously is a gross oversimplification, but it does showthe significant potential performance for the GAin the early stages where Pimpis quite large. The exact probability of selection for GAsin general cannot currently be calculated; but for our problem domainit is clearly dependent on several factors, including the probability distribution of the Bayesian Network, the function used to evaluate the membersof the population and the number of RVsthat are different from the current memberof the population. In a simple case where the evaluation function is the joint probability of the solution and the landscape has only a single optimum, the individual probability of selection can be approximated by P(S), giving us Pimp(Si) ~’]~s~s P(S)P(P(S) > P(Si)), which is the sum of solutions whose values are greater than the value of the initial solution; a value that can be estimated from the probability distribution of the Bayesian Network. As the membersof the population evolve towards their local optima, the level rises and the probability of finding Algorithm Interaction Wenow take the models of algorithm performance developed above and explore the effects of interaction between algorithms. Individual Algorithms Deterministic algorithms use the incoming solutions to prune their search space. The greatest impact of the incoming solution on these algorithms, however, is in the generation of anytime solutions by using the received solution as a template to fill in the incomplete parts of the partial solution; this can significantly improve the quality of the completed solutions. Most exact algorithms expand their search progressively from either the roots or the leaves of the network; the difference between completing the solution arbitrarily and with the received solution is limited to the RVsin the F and U sets described earlier. Thus, the solution completed with the incoming solution is better than the solution completed with the arbitrary solution when P(V[Ir(V))P(Fbr(F)) P(U,I~r(U, ) )PIE, I~r( F,) Approximate algorithms typically work with complete solutions; the GAand Stochastic Simulation are 88 no exception. Both approximate algorithms used here incorporate the incomingsolution directly into the pool of solutions being manipulated; the presence of the higher probability solutions tends to improve the algorithm performanceif the new solutions are significantly better than the rest of the solutions in the pool. Combinations With the different algorithms executing independently, the probability of improvement for any given time increment is Pimp,,,,., = 1 - H(1 -- Pimp,) 0.8 f ........................... 0.6 0.4 ...... r 0.2 med~ribufion (5) 0 0.5 i which is the probability that at least one algorithm will improve. Unless the probability of improvementfor all algorithms is zero, we will always gain something by utilizing multiple algorithms. With the performance improvements to each algorithm resulting from interaction, the overall system’s probability of improvement can be even higher. The algorithm analyses earlier indicated a similarity in performance between some algorithms. The GA and Stochastic Simulation have similar probability of improvement equations; this not surprising due to the similar nature of the manipulations performed. The A* and ILP also have similar performance characteristics. While combining algorithms from the same class can improve performance by increasing the coverage of the problem space, the improvement due to the information sharing can only take place if the information being shared is not already knownto the receiving algorithm. For example, if two identical GAswere being used, as soon as the first high-probability solution was shared, both populations would tend to converge to the same local optimum. Improvement due to sharing information mostly occurs when the algorithm’s performance characteristic is significantly different from the rest of the collection. One algorithm with a steep performance curve can boost the performance of a slower performer; or if the performance curves for two algorithms intersect, they will alternate the roles of leader and follower. Obviously, algorithms from different classes will have significantly different performance profiles; but in some cases, it is possible to significantly affect the performance of an algorithm through its parameters. In such cases, it maybe beneficial to use two similar algorithms, but with different parameters. Results The networks used for the testing were randomly generated. The probability distributions ranged from fiat (uniform, with very small variance) to extreme (exponential, with high variance). Five networks of each type were generated. 89 1 1.5 2 Time(aec/RVs) , 2.5 GAA,, .--_. 3 Figure 1: Performance on extreme network 1 0.95 0.9 0.85 0.8 / 0.75 0.7 0.65 0.6 0.55 0.5 .I- ................................-;;fiatd~flbution A* "--" 0.5 1 1.5 2 Time(sedRVs) 2.5 3 Figure 2: Performance on fiat network Our experiments reported here involved running the GAand A* individually and in combination on Bayesian Networks with different characteristics. The individual runs provided a baseline for performance comparisons. For homogeneous networks (all RVs in the network have the same characteristic), the combination of the two algorithms always performed at or above the level of the individual algorithms. This result was predicted by our analysis of the algorithm combinations. Where we achieved a significant performance improvement is with heterogenous networks. The networks were constructed by combining a flat 30 RVnetwork with a extreme 30 node network. The individual algorithm performance on each of the 30 node networks by themselves is shown in Figures 1&2. The vertical axis is the average normalized joint probability generated and the horizontal axis is normalized run-time. As you can see here and in Table 1, the GAcould never achieve even close to the optimal solution for the extreme network, and the A* bogged down early in the decrease whenmultiple algorithms are used, and it has the potential for significantly increasing. One case in particular that showed improvement was when an exact algorithm (A*) was combined with an approximate algorithm (Genetic Algorithm). Our tests with this combination consistently demonstrated the ability of the algorithms to assist each other, producing a better performance profile. Using anytime algorithms for uncertain reasoning gives us the opportunity to decide when our answer is "good enough", and using multiple algorithms allows us to take the strengths of the individual algorithms and create a stronger, more versatile system. Using a cooperative distributed environment where the algorithms share intermediate results gives us a system with the potential to maximize the utilization of the available resources, even in a resource-limited situation. extreme fiat combination 3.367e-05 6.085e-12 2.737e-16 6.369e-01 3.948e-12 1.681e-12 3.790e-12 6.369e-01 6.099e-12 3.884e-12 GA m A Both Optimal Table 1: Performance comparison for one composite network. 1 .f ........ 0.9 .............. ]’ 0.8 0.7 0.8 /’ f 0.5 0.4 0.3 0.2 A¯ .... Both........ 0.1 0 i 1234587’8910 i References , Dagum, P., and Luby, M. 1993. Approximating probabilistic inference in Bayesian belief networks is NPhard. Artificial Intelligence 60 (1):141-153. Time(sec/RVs) Figure 3: Performance on combined network flat network. Figure 3 shows the results of running the individual algorithms on the composite network; neither algorithm could come close to the optimal answer by itself. Whenthe algorithms were combined, however, performance significantly improved. The raw data for all these runs is summarizedin Table 1. The difference between the A* and the combined run is more significant that it seems: the range in probabilities for the flat network is less than one order of magnitude. The A* by itself solved the extreme part of the network, leaving the flat part alone. The GAalone solved the flat part of the network and got caught on a spike in the extreme part. Together, both parts of the network were solved, resulting in a near-optimal solution. Conclusions Others have characterized the performance of individual anytime algorithms, and even sequences of anytime algorithms to perform a task; our analysis expands the scope to encompass heterogeneous collections of anytime algorithms running concurrently, interacting with each other to improve the overall system performance. Our model for task behavior addresses the effects of problem characteristics as well as the interaction between tasks at run-time. It is clear from the interaction models that overall system performance cannot 9O D’Ambrosio,B. 1993. Incremental probabilistic inference. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. San Francisco, CA: Morgan Kaufmann Publishers. Jitnah, N., and A.E.Nicholson. 1996. Belief network inference algorithms: a study of performance based on domain characterisation. Technical Report TR96-249, Monash University, Clayton, VIC, 3168 Australia. Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann. Rojas-Guzman, C., and Kramer, M.A. 1993. GALGO:A Genetic ALGOrithm decision support tool for complex uncertain systems modeled with bayesian belief networks. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 368375. San Francisco, CA: Morgan Kaufmann Publishers. Shimony, S. E. 1994. Finding MAPsfor belief networks is NP-hard. Artificial Intelligence 68:399-410. Wellman, M. P., and Liu, C.-L. 1994. State-space abstraction for anytime evaluation of probabilistie networks. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 567-574. San Francisco, CA: Morgan Kaufmann Publishers.