Graduate Center/City University of New York University of Helsinki FINDING OPTIMAL BAYESIAN NETWORK STRUCTURES Xiannian Fan, Brandon Malone and Changhe Yuan WITH CONSTRAINTS LEARNED FROM DATA Several recent algorithms for learning Bayesian network structures first calculate potentially optimal parent sets (POPS) for all variables and then use various optimization techniques to find a set of POPS, one for each variable, that constitutes an optimal network structure. This paper makes the observation that there is useful information implicit in the POPS. Specifically, the POPS of a variable constrain its parent candidates. Moreover, the parent candidates of all variables together give a directed cyclic graph, which often decomposes into a set of strongly connected components (SCCs). Each SCC corresponds to a smaller subproblem which can be solved independently of the others. Our results show that solving the constrained subproblems significantly improves the efficiency and scalability of heuristic search-based structure learning algorithms. Further, we show that by considering only the top p POPS of each variable, we quickly find provably very high quality networks for large datasets. Bayesian Network Structure Learning with Graph Search Graph Search Formulation Bayesian Network Structure Learning Representation. Joint probability distribution over a set of variables. Structure. DAG storing conditional dependencies. •Vertices correspond to variables. •Edges indicate relationships among variables. Parameters. Conditional probability distributions. Learning. Find the network with the minimal score for complete dataset D. We often omit D for brevity. where Score(Xi | PAi) is called local score Dynamic Programming Intuition. All DAGs must have a leaf. Optimal networks for a single variable are trivial. Recursively add new leaves and select optimal parents until adding all variables. All orderings have to be considered. Begin with a Pick one variable Pick another leaf. Find its optimal single variable. as leaf. Find its optimal parents. parents from current. Continue picking leaves and finding optimal parents. Recurrences. Potentially Optimal Parent Sets (POPS) The dynamic programming can be visualized as a search through an order graph. While the local scores are defined for all 2n-1 possible parent sets for each variable, this number is greatly reduced by pruning parent sets that are provably never optimal (Tian, J. 2000; de Campos, C. P., et al 2011.). The Order Graph Calculation. Score(U), best subnetwork for U. Node. Score(U) for U. Successor. Add X as a leaf to U. Path. Induces an ordering on variables. Size. 2n nodes, one for each subset. We refer to the above pruning as lossless score pruning because it is guaranteed to not remove the optimal network from consideration. We refer to the scores remaining after pruning as potentially optimal parent sets (POPS). Denote the set of POPS for variable Xi as Pi . Figure 1: Order Graph for 4 variables Admissible Heuristic Search Formulation Start Node. Top node, {}. Goal Node. Bottom node, V. Shortest Path. Corresponds to optimal structure. g(U). Score(U). h(U). Relax acyclicity. Table 1: The POPS for six variables problem. The ith row shows the Pi . POPS Constraints Pruning Motivation: Not all variables can possibly be ancestors of the others. Previous technique: Consider all variable orderings anyway. Shortcomings: Exponential increase in the number of paths in the search space. Contribution: Construct the parent relation graph and find its SCCs; divide the problem into independent subproblems based on the SCCs. We collected all the potential parents–children relation from POPS, and get the resulting parent relation graph. Figure 2: The parent relation graph. We extracted its strongly connected components (SCCs) from parent relation graph. The SCCs form the component graph (Cormen et al. 2001), giving the ancestor constraints (which we call POPS constraint). Each SCC corresponds to a smaller subproblem which can be solved independently of the others. Recursive POPS Constraints Pruning Selecting the parents for one of the variables has the effect of removing that variable from the parent relation graph. After removing it, the remaining variables may split into smaller SCCs, and the resulting smaller subproblems can be solved recursively. Figure 3(b) shows the example. Top-p POPS Constraints Motivation: Despite POPS constraints pruning, some problems remain difficult. Previous technique: AWA* has been used to find bounded optimality solutions. Shortcomings: AWA* does not give any explicit tradeoff between complexity and optimality. Contribution: Lossy score pruning gives a more principled way to control the tradeoff; we create the parent relation graph using only the best p POPS of each variable and discard POPS not compatible with this graph. Rather than constructing the parent relation graph by aggregating all of the POPS, we can instead create the graph by considering only the best p POPS for each variable. Experiment Result FAIL 600 Observation: The constraints seem to help benchmark network datasets more than UCI. Observation: Even for very small values of p, the top-p POPS constraint results in networks provably very close to the globally optimal solution. FAIL No Constraint 511.55 500 Figure 5: The behavior of dataset Hailfinder under the top-p POPS constraint as p varies. POPS Constraint 435.65 400 Software: http://url.cs.qc.cuny.edu/software/URLearning.html Recursive POPS Constraint Selected References 300 200 100 76.51 46.62 26.76 20.93 6.47 4.06 2.51 0 Figure 3: Order graphs after applying the POPS constraints. (a) The order graph after applying the POPS constraints once. (b) The order graph after recursively applying the POPS constraints on the second subproblem. Autos Soybean Alarm 1.28 1. de Campos, C. P.; and Ji, Q. Efficient learning of Bayesian networks using constraints. JMLR 2011. 2. Tian, J. A branch-and-bound algorithm for MDL learning Bayesian networks. In UAI’00, 2000. 3. Yuan, C.; Malone, B.; and Wu, X. 2011. Learning optimal Bayesian networks using A* search. In IJCAI ‘11, 2011. 4. Yuan, C.,; and Maline, B. An Improved Admissible Heuristic for Learning Optimal Bayesian Networks. In UAI’12, 2012.. Barley Acknowledgements Figure 4: Running Time (in seconds) of A* with three different constraint settings: No Constraint, POPS Constraint and This research was supported by NSF grants IIS-0953723, IIS-1219114 and the Academy Recursive POPS Constraint. of Finland (COIN, 251170).