Distributed Observation-Interpretation Networks for the Human Genome Project and Beyond Christopher J. Lee∗ April 26, 2002 Abstract • A Coordination problem: How should the activities of many independent agents be coordinated [1]? How should work be distributed among them? I will refer to this category as the question of what agent model is preferred for a given problem. Multi-agent networks are an attractive solution for many problems. A key challenge is finding ways to break large, coupled problems into subtasks that can be completed in parallel but which still ensure a valid global solution. This creates a combined problem of algorithmic decoupling and inferential decoupling. I argue that a general key for solving this problem is the concept of inferential integrity, meaning that a parallel algorithm guarantees adherence to rigorous statistical inference rules globally, invariant to the grouping or ordering of how observations are processed by independent agents. I describe a series of properties of Hidden Markov Model computations which enable large, coupled computations to be split up while guaranteeing inferential integrity. We have used this approach extensively for analyses of the human genome and its internal complexity. 1 • A Computational Complexity problem: a network is typically designed to solve some problem, and its design is in effect an algorithm for solving that problem. This algorithm must break the computational complexity of the problem down to an order that the network can perform in an acceptable amount of time for the range of problem sizes it will encounter [2]. Moreover, the algorithm must be a parallel algorithm [3]. In other words, it must break the problem down into quasiindependent computations by the individual agents. This allows the network to be scalable for larger problems simply by adding more agents. I will refer to this as the algorithmic decoupling problem. Introduction • An Inference problem: the network must be able to make accurate inferences about entities and events in its environment [4]. Moreover, to achieve significant autonomy, individual agents in the network also need this ability to some extent. How can we attain this, and also combine the inferences of independent agents in a way that ensures an optimal overall inference by the network (even when The construction of autonomous intelligent networks is a large problem requiring solutions to at least three very different kinds of problems: ∗ Department of Chemistry and Biochemistry, University of California, Los Angeles, California 90095, leec@mbi.ucla.edu Work supported by NSF grant IIS 0082964. Project home page: www.bioinformatics.ucla.edu/leelab/db 1 individual agents may disagree)? I will refer to this as the inference decoupling problem. t In this paper, I will present a simple model of “Distributed Interpretation-Observation Networks” that we have used extensively for large-scale scientific computations such as analysis of the human genome [5, 6, 7]. This model attempts to address all three of these aspects in a single consistent way, as I will illustrate with an example application (D ECON) that uses this network model to analyze the vast amount of raw experimental data for candidate human genes and their complex structures in the human genome. The problem of finding and understanding human genes has turned out to be surprisingly difficult [8], and has driven the development of our agent model, because such a distributed model is required to solve it in a scalable way. 2 1-t Figure 1: Modeling genes as a graph of directed edges, whose weights τ reflect the transition probabilities within the graph. 2.1 Algorithmic Decoupling Let’s consider a simple problem of interpreting event sequences. For example, the Human Genome Project had to infer from many short sequencing observations the complete sequence and gene structure of the human genome [5]. Assume there exists a hidden state graph G consisting of nodes representing states σ connected by directed, weighted edges representing state transitions τ . Assume that each state σ emits observable events e with some probability p(e|σ), drawn from an “alphabet” α of possible events e ∈ α. Both the states σ, the emission probabilities p(e|σ) and transition probabilities τ1→2 ≡ p(σ2 |σ1 ) are hidden (unknown, and not directly observable). These transition probabilities are normalized. In other words, for all the edges that go from a given node i to any other node j, X τi→j = 1 Autonomy Challenges Building networks of agents with increasing autonomy can be considered to be a gradual transition from supervised learning algorithms to unsupervised learning algorithms. In the first case, a training set of fully and correctly classified data is analyzed by a supervised learning algorithm (such as a Hidden Markov Model [9]) to produce a set of optimal parameters for a model that will best predict the correct classifications of new data [10]. In the second case, no classification of the training set is provided initially, and the algorithm must itself discover the classification in the data (for example by clustering). In both cases the goal of training is to produce a set of optimized model parameters, that are then applied to classify real-world data. However, in the unsupervised case there is no convenient separation between the problems of training the model (from a given classification) and of prediction (from a given model). Instead both are joined as a coupled computation, producing a potentially catastrophic explosion in the computational complexity of the problem. j Rather than assuming that there is one state sequence σ1 σ2 ... which we must infer from the set of observed events, the τ transition probabilities represent a “hidden mixture” of many possible hidden state sequences. Different event sequences may have been emitted by entirely different state sequences within the graph. The complexity of this 2 hidden mixture is controlled by the τ parameters. If they are all binary (τ = 0 or 1), the graph consists of a single unbranched path (state sequence), whereas fractional τ values represent nodes with multiple allowed outgoing edges–i.e. branch points where more than one path is possible (figure 1). This property has been essential for our detection of the diversity that exists in the human genome (i.e. that it is different in different people, and expressed differently in different cells) [6, 7]. Under this model, any traversal through G forms a Markov chain, and we can make inferences about G based on a large number of observed event sequences o = {e1 e2 ...} using the Hidden Markov Model approach [9]. This is a classic machine learning problem, applicable to many scenarios such as speech recognition (inferring the phonemes or words a person has pronounced, on the basis of the observed audio signal) [11]. Generally speaking, this involves calculating the posterior probability of the graph model G given a set a set of observations O via Bayes’ Law p(O|G)p(G) p(G|O) = P ∀g p(O|g)p(g) allowing the calculations for different observations o to be performed separately. I will refer to this as observation factoring. Second, because the model obeys the Markov chain property [12], the probability of a single observation can be broken down into separate contributions for disjoint paths π through G. p(o|G) = X p(o|π)p(π|G) (3) π∈G where the π are disjoint paths through G. For the speech recognition example, for one audio observation o, p(o|“Wachovia Bank”) and p(o|“watch over your bank”) can be calculated separately. This implies that the calculation could be distributed efficiently over a network of many agents. I will refer to this as Markov independence. • Global observation coupling: unfortunately, while these likelihood components p(o|π) can be calculated separately, the total posterior p(G|O) tightly couples the set of all observations. For example, the probability of a single observation o being emitted from a particular path π within the model, p(o|π) depends on the transition probabilities τ that lie on the path π. However, there is no way to estimate these τ values on the basis of a single observation. The τ can be thought of as the “hidden mixture parameters” within the model, that show what fraction of events will take one path vs. another at a particular branch point in the model. As such they are inherently a property of the entire observation set, and can only be estimated from the total set of observations, via the calculation of p(G|O). This coupling is a real barrier to distributing the computation in parallel: in principle we cannot compute p(G|O) without having all the p(o|π), and we cannot compute any p(o|π) without already having p(G|O). (1) and finding the model G∗ that maximizes this. If we wished to solve this problem with a network of agents performing unsupervised learning, what algorithmic challenges would we have to overcome? Let’s consider the ways in which different aspects of this problem are coupled: • The model G and observations O are distributable: because the model obeys the Markov chain property, and the observations consist of many independent event sequences o, in principle the calculation of the total probability of the observations over the set of possible models p(O|G) can be broken down into many separate calculations. First, the independent observations O = {o1 o2 ...} simply factor Y p(o|G) (2) p(O|G) = • Global computational complexity is O(N 2 ) or worse: Can we solve the coupled global op- o∈O 3 timization problem that this presents? There are standard algorithms (such as the HMM forward-backward algorithm [9, 11]) for calculating the likelihood crossing each edge in the graph model, and iterative optimization methods such as the Expectation Maximization (EM) algorithm for converging an initial “guess” graph model to an optimal solution [13]. How long will this take? For a single observation sequence consisting of n events, versus a graph model consisting of N states each with an average of ² incoming edges, the computational complexity of the forwardbackward algorithm will be O(²nN ). If we assume that the total number of events in all observation sequences is some multiple βN of the total number of model states N which we are trying to infer (and β > 1 is required to hope for fully determining these states with strong posterior probabilities), then the total computational complexity is O(β²N 2 ). Finally, if ² is not a constant, but increases with the size of the graph such that ² = γN , then the complexity increases to O(βγN 3 ). the computations required to interpret the observations are distributed over many agents performing largely independent calculations. In the latter case, agents simply report their observations to a central “interpreter” that performs a global, coupled computation on the set of all data. The choice of algorithm type determines the agent model. Local, decoupled algorithms are easy to distribute over autonomous agents, whereas global, coupled algorithms are hard. In the Human Genome Project, observations were generated by a network of multiple observers (different sequencing centers) [5, 14]. This distributed observation model is typically joined to one of two models for interpreting the data. In a centralized interpretation model, all of the raw data must be sent to a central location. This has the advantage of allowing a global analysis of the total observations, but is totally dependent on communication, and reduces the role of agents to being passive sensors. At the opposite extreme, interpretation is performed locally by the agents making the observations. Their interpretations may be communicated to others. This has the advantage of making the agents “smart” (able to interpret and act on their own observations immediately), but blocks a fully global analysis, since information tends to be lost during local interpretation. Neither of these models is acceptable for difficult unsupervised learning problems like analysis of the human genome. These problems are too hard for simplistic local algorithms. On the other hand, the computational complexity of the global, coupled calculation is far too large for a centralized cpu. Instead, I will describe a distributed observationinterpretation network model, in which local interpretation is performed, but in a way that is consistent with rigorous global interpretation algorithms. For large problems such as human genome assembly (inferring the complete genome sequence from observations of short fragment sequences, as was actually done in the Human Genome Project [5]), N can be very large (3 × 109 for the human genome). O(N 2 ) and O(N 3 ) algorithms are useless for such large problems. This is representative of a general challenge. To deliver the power of networks of computational agents to such problems, we must find an algorithmic decoupling for breaking down their computational complexity. 2.2 Interpretation Networks A parallel can be drawn between the distinction of local algorithms (computing on a single observation o, or a single path π through the model) vs. global algorithms, and distributed vs. centralized interpretation paradigms. In the former case, 2.3 Inferential Decoupling The key question is whether the global inference problem can be broken down into separate, local inference problems that can later be re-combined 4 EST Sequence Lower Probability Matches High Probability Matches Calculated Evidence-confidence: 0 0 6.8 7.5 Pseudogenes? in equation 1), and those which are not (hidden parameters). For example, to infer gene structure for the human genome, we must determine which edges τ in the graph representing the genomic sequence are actually followed by the cellular machinery that produces an active gene product (“expressed sequences” or “EST”s [14]). Human genes are “spliced”; that is, they consist of large regions in which each node is linked to the next node, with occasional “splices” where a large segment of nodes is skipped over (figure 2) [15]. On the basis of the observed expressed sequences (ESTs) we want to infer which edges are real splices (i.e. have a nonzero τ value). Inferential integrity can be achieved by any factoring of Equation 1 into separate calculation steps that does not change its value. Fortunately, the nature of equation 1 permits many powerful ways of subdividing its calculation: Lower Probability Matches 0 0 0 Strong Gene Evidence 0 Paralogous Gene? Genomic Sequence Figure 2: D ECON uses all possible matches of individual ESTs vs. the whole genome sequence to measure evidence for genes, exons, introns, polyA sites, etc., by calculating the probability drop due to excluding each feature. to produce the global solution. Let’s define this property as inferential integrity. Such an algorithm must be insensitive to different ways of grouping the observations, and different orders of recombining them; the total result should remain the same. This would solve both of the problems raised in the previous two sections. The true barrier to algorithmic decoupling is that overly simplistic algorithms may simply get the wrong answer. By contrast, to the extent we can find decouplings that guarantee (or strongly approximate) inferential integrity, we can take full advantage of them to accelerate the computation. Moreover, inferential integrity provides a bridge between purely local vs. purely global interpretation network models. A computation with this property provides a flexible mix of both strategies. 3 • Observation factoring: As shown in equation 2, we can separate the calculation into factors for independent observations, which can be recombined to obtain the global solution by simply multiplying them. • Markov independence: As shown in equation 3, we can go further and break the calculation for a single observation into separate contributions from different (disjoint) paths π through G. • Diagonalization: knowledge of the general structure of the graph can provide additional ways to divide the calculation. For example, for the gene structure problem we know that splices are rare, so the comparison between observed EST and genomic sequence should show long regions of match (diagonals in Figure 2) separated by occasional splices. This allows us focus the calculation on just those parts π of the model which could emit a given EST with non-negligible probability, by run- Inferential Integrity What are the essential principles for obtaining this property? First, my model for the rigorous global solution is simply Bayesian statistical inference, maximizing the posterior probability of the model as given by equation 1. This approach is very general, and has as its key principle the distinction between things which are directly observable (e.g. O 5 4 Conclusions ning a fast diagonal search algorithm such as BLAST [16]. What general utility does this approach have for the problem of constructing autonomous intelligent networks? I suggest that the property of inferential integrity is a key for designing an agent network that can move flexibly across the spectrum of “independence” versus “cooperation”. Inferential integrity is valuable for decoupling a rigorous inference process both to reduce its computational complexity (the algorithmic decoupling problem emphasized in the Introduction), and also so that it can be distributed efficiently over a large number of compute agents (the agent model question mentioned in the Introduction). For our work in the Human Genome Project we have used this to build distributed observationinterpretation networks, in which each agent is both an observer and also an interpreter. Such networks have many advantages over pure localinterpretation or pure global-interpretation. It can combine the advantages of each of these extremes, while avoiding many of their respective disadvantages. It combines the scalability and robustness of a distributed approach, with the rigorous statistical inference of a fully centralized, global calculation. It maximizes local resource utilization by breaking down a large calculation into a very fine granularity that can be distributed well, while keeping communication bandwidth requirements as low as possible (through mechanisms such as graph reduction). Since the computational models we’re using are very general (statistical inference, Hidden Markov Models, graph reduction, etc.), it seems likely that these basic ideas could be applicable to seeking inferential integrity for many other problems. • Graph reduction: a further consequence of diagonalization is that we can separately calculate the p(o|π) over the possible paths for a given observation o, and store it as much reduced form of the original graph. Specifically, we eliminate all nodes and edges with negligible probability, and collapse all linear segments (i.e. without internal branching) to a single node, saving the probability on each segment. • τ factoring: this graph reduction allows us to do the single observation p(o|π) calculation before we have τ , by using reasonable estimates for τ and simply factoring them back out from the probability factors stored on the reduced graph. Using these separations, calculations for all the individual observations can be performed on a distributed network with little or no communication. Each agent simply performs the calculations for its own observations. Finally, the reduced probability graphs for the observations are combined. This also can be separated into many independent calculations, one for each individual τ . For each τ , only a tiny subset of observations will be found to cross that edge with high probability, and all of these will have been identified efficiently by the diagonalization and graph reduction calculations. Only these observations need be considered to determine this τ . Moreover, only a small amount of information (the reduced probability graphs for those observations) needs to be communicated between agents to allow this calculation to be performed independently by a single agent. Thus all stages of the 5 Acknowledgements computation can be distributed efficiently, and a globally optimal solution produced with minimal This research was supported by NSF Grant IIS communication demands. 0082964. 6 References [10] T. Mitchell, Machine Learning. McGraw Hill, 1997. [1] M. Barbaceanu, M.S. Fox, “COOL: a language for describing coordination in multiagent sys- [11] L.R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech tems,” Proceedings of the First International recognition,” Proceedings of the IEEE 77, Conference on Multi-Agent Systems. 17–24, 257–286, 1989. 1995. [12] D.R. Cox, H.D. Miller, The Theory of [2] J.T. Schwartz, M. Scharir, J. Hopcroft, PlanStochastic Processes. Chapman Hall, 1965. ning, Geometry and Complexity of Robot Motion. Ablex Publishing Corp. 1987. [13] A.P. Dempster, N.M. Laird, D.B. Rubin, “Maximum likelihood from incomplete data [3] T. Oates, M.V. Nagavendra Prasad, V.R. via the EM algorithm,” J. Royal Stat. Soc. B Lesser, Cooperative Information Gathering: 39, 1–38, 1977. A Distributed Problem Solving Approach. UMass Computer Science Technical Report [14] M.S. Boguski, G. Schuler, “ESTablishing 94-66. 1994. a human transcript map,” Nature Genet. 10, 369–371, 1995. [4] J. Binder, D. Koller, S. Russell, K. Kanazawa, “Adaptive probabilistic networks with hidden [15] W. Gilbert, “Why genes in pieces?” Nature 271, 501, 1978. variables,” Machine Learning 29, 213–244, 1997. [16] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, “Basic local alignment [5] International Human Genome Sequencing search tool,” J. Mol. Biol. 215, 403–410, 1990. Consortium, “Initial sequencing and analysis of the human genome,” Nature 409, 860–921, 2001. [6] K. Irizarry, et al., C. Lee, “Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences,” Nature Genet. 26, 233–236, 2000. [7] B. Modrek, et al., C. Lee, “Genome-wide detection of alternative splicing expressed sequences of human genes,” Nucleic Acids Res. 29, 2850–2859, 2001. [8] S. Karlin, A. Bergeman, A.J. Gentles, “Genomics: annotation of the Drosophila genome,” Nature 411, 259–260, 2001. [9] L.R. Rabiner, B.H. Juang, “An introduction to hidden Markov models,” IEEE ASSP Magazine 3, 4–16, 1986. 7