Learning Structure in Bayes Nets (Typically also learn CPTs here) • Given the set of random variables (features), the space of all possible networks is well-defined and finite. • Unfortunately, it is super-exponential in the number of variables. • We can define a transition function between states (network structures), such as adding an arc, deleting an arc, or changing the Learning Structure (Continued) • (Continued)… direction of an arc. • For each state (structure), we take our best guess of the CPTs given the data as before. • We define the score of the network to be either the probability of the data given the network (maximum likelihood framework) or the posterior probability of the network (the product of the prior probability of the Learning Structure (Continued) • (Continued)… network and the probability of the data given the network, normalized over all possible networks). • Given a state space with transition function and scoring function, we now have a traditional AI search space to which we can apply greedy hill-climbing, randomized walks with multiple restarts, or a variety of Learning Structure (Continued) • (Continued)… other heuristic search techniques. • The balance of opinion currently appears to favor greedy hill-climbing search for this applications, but search techniques for learning Bayes Net structure are wide open for further research -- nice thesis task. Structural Equivalence • Independence Equivalent: 2 structures are independence equivalent if they encode the same conditional independence relations. • Distribution Equivalent with respect to a family of CPT formats: 2 structures are equivalent if they represent the same sets of possible distributions. • Likelihood Equivalent:the data does not help discriminate between the 2 structures. One Other Key Point • The previous discussion assumes we are going to make a prediction based on the best (e.g., MAP or maximum likelihood) single hypothesis. • Alternatively, we could make avoid committing to a single Bayes Net. Instead we could compute all Bayes Nets, and have a probability for each. For any new query One Other Key Point (Continued) • (Continued)… we could calculate the prediction of every network. We could then weigh each network’s prediction by the probability that it is the correct network (given our previous training data), and go with the highest scoring prediction. Such a predictor is the Bayes-optimal predictor. Problem with Bayes Optimal • Because there are a super-exponential number of structures, we don’t want to average over all of them. Two options are used in practice: • Selective model averaging:just choose a subset of “best” but “distinct” models (networks) and pretend it’s exhaustive. • Go back to MAP/ML (model selection). Example of Structure Learning: Modeling Gene Expression Data • Expression of a gene: making from the gene the protein for which it codes (involves transcription and translation). • Can estimate expression by transcription (amount of mRNA made from the gene). • DNA hybridization arrays: “chips” that simultaneously measure the levels at which all genes in a sample are expressed. Importance of Expression Data • Often the best clue to a disease or measurement of successful treatment is the degree to which genes are expressed. Such data also gives insight into regulatory networks among genes (one gene may code for a protein that affects another’s expression rate). • Can get snapshots of global expression levels. Modeling Expression Data by Learning a Bayes Net • We can model the effects of genes on others by learning a Bayes Net (both structure and CPTs. Friedman et al. do so. • See associated figure. Expression of gene E might promote expression of B but expression of A might inhibit B. The facts that E and A directly influence B are captured by the network structure, and the Modeling Gene Expression Data (Continued) • (Continued)… fact that E promotes while A inhibits is captured in the CPT for B given its parents. • B directly influences C according to the network, but E and A influence C only indirectly via B. Decisions Made in this Application • Use selective model averaging. Learn multiple models via bootstrapping. (Randomly sample with replacement from the original data set -- each run is with a different random sample of the original data.) Average simply by extracting common features: Is Y in the Markov Blanket of X? Is Y an ancestor of X? Decisions (Continued) • Use Independence Equivalence (two models are equivalent if and only if they encode the same conditional independence information. • Must choose a prior distribution over structures. Choose one such that (1) equivalent structures have equivalent scores, and (2) scores are decomposable (score is the sum of the scores of each node, which depend only on node and its parents). Decisions (Continued) • Use arc insertion, arc deletion, or arc direction switch as the transition function. • Use greedy hill-climbing as the search strategy, with the following (also greedy) additional heuristic. • Sparse candidate algorithm: consider only networks in which the parents of X are nodes that have a “high” correlation with X Decisions (Continued) • (Continued)… when taken individually. Note the similarity of this heuristic with the greedy heuristic in decision tree learning. This means that probabilistic versions of exclusive-OR (for example) will cause problems.