Aspects of Inference for the Influence Model and Related Graphical Models by Arvind K. Jammalamadaka B.S., University of California at Santa Barbara (2001) Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the degree of Master of Science in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology June 2004 @ Massachusetts Institute of Technology 2004. All rights reserved. .......... A uthor ....... Department of Electrical Engineering and Computer Science May 21, 2004 Certified by ........... George C. Verghese Professor of Electrical Engineering Thesis Supervisor Accepted by .. Arthur C. Smith Chairman, Department Committee on Graduate Students MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUL 2 6 2004 LIBRARIES BARKER 2 Aspects of Inference for the Influence Model and Related Graphical Models by Arvind K. Jammalamadaka Submitted to the Department of Electrical Engineering and Computer Science on May 21, 2004, in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science Abstract The Influence Model (IM), developed with the primary motivation of describing network dynamics in power systems, has proved to be very useful in a variety of contexts. It consists of a directed graph of interacting sites whose Markov state transition probabilities depend on their present state and that of their neighbors. The major goals of this thesis are (1) to place the Influence Model in the broader framework of graphical models, such as Bayesian networks, (2) to provide and discuss a hybrid model between the IM and dynamic Bayesian networks, (3) to discuss the use of inference tools available for such graphical models in the context of the IM, and (4) to provide some methods of estimating the unknown parameters that describe the IM. We hope each of these developments will enhance the use of IM as a tool for studying networked interactions. Thesis Supervisor: George C. Verghese Title: Professor of Electrical Engineering 3 4 Dedicated to: Sreenivasa, Vijaya, and Aruna 5 Acknowledgments I would like to thank Professor George Verghese who has been a great teacher, mentor and supervisor for this thesis work. His encouragement, support and advice have made this work a pleasure for me. My early interactions and discussions with Dr. Sandip Roy and Carlos GomezUribe steered me in this direction and for that I am thankful to both of them. Finally, I would like to thank my family for their love and affection, and for encouraging me to excel in whatever I do. This work is supported by an AFOSR DoD URI for "Architectures for Secure and Robust Distributed Infrastructures" F49620-01-1-0365 (led by Stanford University). 6 Contents 1 2 Introduction 11 1.1 12 The Influence Model and Related Graphical Models 2.1 2.2 2.3 2.4 2.5 3 O verview . . . . . . . . . . . . . . . . . . . . . . . . . . The Influence M odel .......................... 13 . 13 2.1.1 Definition ....... 2.1.2 An Alternate Formulation . . . . . . . . . . . . . . . . . . . . 15 2.1.3 Expectation Recursion . . . . . . . . . . . . . . . . . . . . . . 16 2.1.4 Homogeneous Influence Model ..... 16 2.1.5 Matrix-Matrix Form ............................. 13 .................. . . . . . . . . . . . . . . . . . . . . . . . 17 Partially Observed Markov Networks . . . . . . . . . . . . . . . . . . 17 2.2.1 M arkov Networks . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.2 Partial Observation . . . . . . . . . . . . . . . . . . . . . . . . 18 Relation to Hidden Markov Models . . . . . . . . . . . . . . . . . . . 18 2.3.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Relation to the POMNet . . . . . . . . . . . . . . . . . . . . . 19 Relation to Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . 20 2.4.1 Bayesian Networks and Graph Theory . . . . . . . . . . . . . 20 2.4.2 Relation to the POMNet . . . . . . . . . . . . . . . . . . . . . 22 A Generalized Influence Model . . . . . . . . . . . . . . . . . . . . . . 24 Efficient Inference: Relevance and Partitioning 31 3.1 31 The Inference Task . . . . . . . . . . . . . . . . . 7 3.2 3.2.1 3.3 3.4 . . . . . . . . . . . . 32 4 5 d-Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.1 Relevance in Bayesian networks . . . . . . . . . . . . . . . . . 34 3.3.2 Relevance in the POMNet . . . . . . . . . . . . . . . . . . . . 36 . . . . . . . . . . . . . . . . . . . . . 38 Relevance Reasoning Algorithms for Exact Inference 3.4.1 3.5 32 Probabilistic Independence and Separation . Message Passing and the Generalized Forward-Backward Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.2 The Junction Tree Algorithm . . . . . . . . . . . . . . . . . . 39 3.4.3 Exact Inference in the POMNet . . . . . . . . . . . . . . . . . 40 . . . . . . . . . . . . . . . . . 42 . . . . . . . . . . . . . . . . . . . . . . . 42 . . . . . . . . . . . . . . . . . . . 42 Algorithms for Approximate Inference 3.5.1 Variational Methods 3.5.2 "Loopy" Belief Propagation Parameter Estimation 45 4.1 Maximum Likelihood Estimation 4.2 Estimation of State Transition Probabilities . . . . . . . . . 45 4.3 Reparameterization of Network Influence . . . . . . . . . . . 50 . . . . . . . . . . . . . . . 45 Conclusion and Future Work 57 5.1 C onclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2.1 Further Analysis of the Generalized Influence Model . 58 5.2.2 Conditioning on a "Double Layer" . . . . . . . . . . 58 8 List of Figures 2-1 A hidden Markov model expressed as a Bayesian network. Shaded circles represent evidence nodes. . . . . . . . . . . . . . . . . . . . . . 2-2 A POMNet graph and corresponding trellis diagram. Shaded circles represent evidence nodes. . . . . . . . . . . . . . . . . . . . . . . . . . 2-3 22 A "generalized" IM graph and trellis (DBN) structure. 24 Solid lines represent influence in the usual sense, whereas dashed lines represent same-time influence. The order for updating nodes at each time step would be {1}, {2, 3}, {4} 3-1 . ... . . . . . . . . . . . . . . . . . . . . . . 25 An example of a BN graph, and the nodes relevant to target node 1. Shaded circles represent evidence nodes. . 36 4-1 Plot of L(p, q) (p = 0.5, q = 0.5, T = 10). . . . . . . . . . . . . . . . . 47 4-2 Estimates of p (p = 0.5, q = 0.5, T = 100). . . . . . . . . . . . . . . . 48 4-3 Estimates of q (p = 0.5, q = 0.5, T = 100). . . . . . . . . . . . . . . . 49 4-4 Estimates of p (p = 0.25, q = 0.63, T = 20). . . . . . . . . . . . . . . . 49 4-5 Estimates of q (p = 0.25, q = 0.63, T = 20). . . . . . . . . . . . . . . . 50 4-6 Plot of L(A) (p = 0.1, q = 0.4, A = 1, T = 50) .. . . . . . . . . . . . . . 52 4-7 Estimates of A (p = 0.1, q = 0.4, A = 1, T = 50). 52 4-8 Plot of L(p, q, A), pq cross-section (p 0.1, q = 0.4, A = 1, T 20). 53 4-9 Plot of L(p, q, A), Ap cross-section (p = 0.1, q = 0.4, A = 1, T = 20). 53 = 0.4, A = 1, T = 20). 54 4-10 Plot of L(p, q, A), Aq cross-section (p = 0.1, q . . . . . . . . . . . . = 4-11 Estimates of p (p = 0.1, q = 0.4, A = 1, T = 100). . . . . . . . . . . . . 54 4-12 Estimates of q (p = 0.1, q = 0.4, A = 1, T = 100). . . . . . . . . . . . . 55 9 4-13 Estimates of A (p = 0.1, q = 0.4, A 10 = 1, T = 100). . . . . . . . . . . . . 55 Chapter 1 Introduction Graphical models are an effective way of representing relationships between variables in a complex stochastic system. They are used in a variety of contexts, from digital communication [11] to modelling social interactions [4], and in fact have roots in several different research communities including statistics [17], artificial intelligence [23], and computational neuroscience [15]. The influence model (IM), introduced in [1] and described concisely in [2], was developed with the primary motivation of describing network dynamics in power systems. It is, however, a rich stochastic model for networked components, capable of modelling systems that are complex but structured in a tractable way. It consists of a network of interacting Markov chains in a graph structure, with the next state of each chain dependent not only on its own current state, but that of its neighbors. The overall model is Markovian, in the sense that the probability distribution for the next state of the system is determined from model parameters in conjunction with the current state. The work in [1] focuses on motivating and defining the influence model, as well as analyzing its tractability and dynamics. In [26], the influence model was used an example of a more general class of mod11 els called moment-linear stochastic systems, and the dynamics of those systems were analyzed, along with approaches to estimation and control. More recently, [29] treats state estimation in the IM for cases in which we may not be able to obtain complete information about the system, by analogy with hidden Markov models. Further details regarding this approach to inference in the IM will be discussed later in this thesis. The influence model has also been recently applied in the context of social interactions. In one application, it was used to model interaction dynamics among participants in a conversation [4]. In another, it was used to model "webs of trust" in relationships between users of knowledge-sharing sites on the Internet [28]. The major goals of this thesis are: (1) to place the influence model in the broader framework of graphical models, as used in other fields mentioned above, (2) to provide and discuss a hybrid between the IM and dynamic Bayesian networks, (3) to understand tools for performing efficient inference on the class of graphical models in which the influence model lies, and (4) to provide some results regarding estimation of parameters and structure for the IM. 1.1 Overview In Chapter 2, we discuss how the IM can be embedded in the framework of graphical models. We relate it to hidden Markov models and to Bayesian networks, and provide a generalization based on this relationship. In Chapter 3, we exploit this connection to discuss algorithms for exact and approximate inference for the IM. Although we often assume the model parameters in an IM to be given, in practice, they are not. In Chapter 4, we discuss maximum likelihood estimation of these unknown model parameters. Finally, Chapter 5 provides some possibilities for potential future work. 12 Chapter 2 The Influence Model and Related Graphical Models 2.1 2.1.1 The Influence Model Definition We first define the influence model. Consider a network which consists of n interacting nodes (or sites), evolving over discrete time k = 1, . . . , T, where node i can take on any one of mi possible states. We denote the state of node i at time k by an mi x 1 column vector s [k] which is an indicator vector i.e., the mi-vector s2 [k] contains a single 1 in the position corresponding to the state of node i at time k, and a 0 everywhere else, for instance s'[k] = [0 ...010 ...0], where / denotes transpose. Letting M = ET m, we denote the overall state of this networked system at time k by the length M vector s[k], which simply consists of the collection of si [k] for 1< i n, stacked one on top of the other: 13 s1 [k] s[k] = (2.1.1) . -s, [k] The state evolution and the network structure of the IM are governed by two things: (1) a set of state transition matrices {Aij} (where Aij is of order mi x which describe the effect of node i on node iM) j, in a way analogous to the transition matrix of a Markov chain, and (2) an n x n stochastic matrix D called the "network influence matrix", whose elements dij provide the probability with which the state of node j at time k influences the state of node i in the next time step k + 1 (so d,, > 0 and Ej dij = 1). We assume that the Aij and the dij are homogeneous over time to avoid further complication. To express the evolution of an IM, we also define pi[k], the probability vector for the next state of site i. This mi x 1 vector is defined so that the 1L" element of pi [k] is the probability that node i is in state 1 at time k (and its entries therefore sum to 1). We also have the corresponding concatenated M-vector p[k]: Pi [k] p[k] = : .(2.1.2) Pn [k] Then, the state evolution of the IM has the following form: pi[k + 1] =Zdi s[k]Aj. (2.1.3) p'[k + 1] = s'[k]H, (2.1.4) Or, more compactly, 14 where dj1 A H = ... d1An i= LdinAni D' @ Aij} (2.1.5) ... dpjnAnn and 0 denotes the Kronecker product. The properties of the specially structured matrix H are explored in detail in [1], but are not essential to this thesis. Some things to note: we see that si [k + 1] is conditionally independent of all other state variables at time k + 1 given s[k] i.e., n P(s[k + 1]Is[k]) = JP(s[k + 1]fs[k]), (2.1.6) i=1 and the Markov property holds for the process s[k] as it evolves over time, that is, P(s[k + 1]js[k, s[k - 1], . - -) = P(s[k +1]Is[k]). (2.1.7) We also see that the influence matrix D specifies a directed graph for the structure of the model by indicating the presence or absence of dependencies between nodes. 2.1.2 An Alternate Formulation The above way of viewing the evolution/update process makes its quasi-linear structure clear, but an equivalent way of viewing the process, which might be more intuitive in certain cases, involves each site picking a "determining site" from among its neighbors for the update. That process can be summarized in the following three steps (see [2]): 1. A site i randomly selects one of its neighboring sites in the network, or itself, to be its determining site for Step 2. Site j is selected as the determining site for site i with probability dij. 2. Site j fixes the probability vector p'i[k + 1] for site i, depending on its status 15 s [k]. The probability vector pi [k + 1] will be used in the next stage to choose the next status of site i, so pj[k + 1] is a length-mi row vector with nonnegative entries that sum to 1. Specifically, if site j is selected as the determining site, then p'[k + 1] = s'[k]Aji, where Aji is a fixed row-stochastic mj x mi matrix. That is, Aji is a matrix whose rows are probability vectors, with nonnegative entries that sum to 1. 3. The next status s [k + 1] is chosen according to the probabilities in p'[k + 1], which were computed in the previous step. 2.1.3 Expectation Recursion One interesting result of the IM structure is that the conditional expectation of the system state given the state at the previous time step can be expressed in a simple recursive way (see [2]). This can be shown as follows: since the state at time k + 1 is realized from the probabilities in p[k + 1], we have E(s'[k + 1] 1s[k]) = p'[k + 1] = s'[k]H, (2.1.8) where the second equality is from Equation 2.1.4. Applying this expectation equation repeatedly, we get the recursion: E(s'[k + 1] 1 s[O]) = s'[]Hk1l = E(s'[k] Is[O])H. 2.1.4 (2.1.9) Homogeneous Influence Model It is worth mentioning a particular specialization of the influence model. The homogeneous influence model occurs when all the nodes have the same number of possible states, mi all i and j. = m for all i, and there is only a single transition matrix, A 3 = A for This specialization can be useful in certain symmetric cases, and will potentially simplify aspects of the update and inference tasks greatly. 16 2.1.5 Matrix-Matrix Form Finally, for the homogeneous influence model, instead of stacking the state and stateprobability vectors on top of each other as in (2.1.2) and (2.1.1), we can stack their transposes, and create n x m matrices S[k] and P[k]: s [k] ' [k] S[k1= P[k]= . (2.1.10) p' [k] s'[kp] Now, the update equation takes the following form: P[k + 1] = DS[k]A. (2.1.11) We call this the matrix-matrix form of the update. 2.2 Partially Observed Markov Networks In this section, we will discuss two generalizations that can be made regarding the structure of the influence model. The more general model, which we call a Partially Observed Markov Network (POMNet), is closer in spirit to the graphical models seen in contexts such as machine learning, while still maintaining the flavor of the IM. 2.2.1 Markov Networks In the full specification of the IM, the probability vector for the next state of the system, s[k + 1], is given in terms of a linear equation involving s[k], {A 3 }, and D. In the first generalization, we will eschew the linearity of the IM update, and allow for the probabilistic dependence between nodes to be specified in general terms. We maintain the idea that each node is influenced by its neighbors in the network, and updates at the next time step are based on that influence, but we allow the precise nature of the influence to be arbitrary. The particular dependence between nodes in 17 the IM is therefore a special case of this model, which we term a Markov network. Note that equations (2.1.6) and (2.1.7) still hold for Markov networks. 2.2.2 Partial Observation The second generalization has to do with the idea that we may only be capable of a partial observation of the system; that is, the state evolution may be observed at only a fixed subset of nodes [291. We will call this subset the observed nodes, and the remainder unobserved, or hidden nodes. It is often the case in situations of interest that we cannot obtain full knowledge of the system, and the idea of observed and hidden nodes may well fit the type of data we are able to obtain. This type of partial knowledge also meshes well with the general theory for other graphical models, as we will see. 2.3 Relation to Hidden Markov Models 2.3.1 Hidden Markov Models We now look at the hidden Markov model (HMM). The HMM consists of a Markov chain whose state evolution is unknown, and observations at each time step that probabilistically depend upon the current state of the Markov chain. It is often used in contexts such as speech processing [8], in which we have a process inherently evolving in time, and want to separate our observations from some simple (but hidden) underlying structure. We will give an overview of the HMM here, but see a source such as [24] or [10] for a more thorough introduction. An HMM is defined by three things: 1. Transition Probabilities. The state transition matrix of the underlying Markov chain. 18 2. Output Probabilities. The probability that.a particular output is observed, given a particular state of the underlying chain. 3. Initial State. The distribution of the initial state of the underlying chain. There are well-known algorithms for inference on the HMM [10]. The forward- backward algorithm is a recursive method for determining the likelihood of a particular observation sequence. The Viterbi algorithm can be used to identify the most likely hidden state sequence corresponding to an observation sequence. The BaumWelch algorithm is a generalized expectation-maximization (EM) algorithm used to estimate the model parameters from training data. 2.3.2 Relation to the POMNet If we consider a Markov chain with states corresponding to the possible states of s[k] in a POMNet, the evolution of the POMNet is captured by this chain. If we are dealing with partial observations, then we can imagine an HMM observation corresponding to each time step, consisting of the appropriate subset of state variables (as this observation is certainly probabilistically dependent on the current state of the entire system). However, this formulation of the POMNet doesn't seem to be especially useful, as the size and complexity of the underlying chain grows to be prohibitively large even for a moderately sized POMNet, since the number of possible states of s[k] is exponential in the number of nodes. Also, the original network graph structure is obscured by this representation, which is an indication that we are not efficiently using that structure for computation. Rather than looking at the POMNet as an HMM, we can come up with an analogy of an HMM as a POMNet. Take an IM with two nodes, one hidden and one observed, where the hidden node influences both the observed node and itself, and the observed node influences nothing. The hidden node evolves as a simple Markov chain, since there are no outside influences, and the state of observed node is probabilistically de19 pendent upon the state of the hidden node - this is almost an HMM. The difference is that the observation (i.e. the state of the observed node) at time k is dependent upon the state of the hidden Markov chain at time k - 1. Something else that can be gleaned from the HMM analogy is the development of forward and backward variables for the POMNet, analogous to those in the forwardbackward algorithm for HMMs. While the computation will be different, and it should take into account the dependencies inherent in the graph structure, the goal of recursively computing a forward variable and the corresponding backward variable which describe the probability of seeing a particular observation sequence from time 1 to time k and ending up in state i at time k is still relevant. Calculating these analogous forward and backward variables should give us an easy way to compute the probability of a particular observation sequence, or identify the most likely hidden state sequence (over all the hidden nodes) by Viterbi decoding, just as for the HMM (see [29]). Further details about this calculation can be found in Chapter 3. 2.4 2.4.1 Relation to Bayesian Networks Bayesian Networks and Graph Theory A more general class of graphical models than the HMM is that of Bayesian Networks (BNs). A Bayesian network is a graphical model which consists of nodes and directed arcs. The nodes represent random variables, and the arcs, and lack thereof, represent assumptions regarding conditional independence. We will briefly review some graph theory definitions in order to present Bayesian networks; for a more thorough treatment of graph theory in the context of BNs, go to [7] or [21] among other places. Many of the following definitions will be used later in Chapter 3. A graph G = (V, E) consists of vertices (or nodes) V, and edges (or arcs) E. E 20 is a subset of the set V x V of pairs of vertices. In an undirected graph, the pairs are unordered and denote a two-way link between the vertices, whereas in a directed graph, we order the pairs, and consider the edges to specify a directional relation. If (a, 3) E E, we write a -- 4, and say that a and 4 are neighbors. In the directed graph, we also say that a is a parent of 4, and 4 is a child of a. GA = (A, EA) is a subgraph of G ifA C V and EA E E n (A x A) (that is, it may only contain edges pertaining to its subset of vertices). A graph is called complete if every pair of vertices is connected. A complete subgraph that is maximal is called a clique. A path of length n from a to 0 is a sequence a = i = ao, . .. , an 43 of distinct vertices such that (aC_ 1, a) E E for all 1, . . , n. If there is a path from a to 0, we say that a leads to 0, and write a - 4. A chain or trail is a path that may also follow undirected edges (i.e. (ai- 1 , ai) or (ai, ai_ 1) E E for all i = 1,. .., n). A minimal trail is a trail in which no node appears more than once. The set of vertices a such that a ancestors of 4, and the set 4 such that a - 4 '-4 but not 4 43 but not - 43 F-- a is the set of a is the set of descendants of a. The nondescendants of a comprise the set of all vertices excluding descendants of a and a itself. A cycle is a path that begins and ends at the same vertex. A graph is acyclic if it has no cycles. An undirected graph is triangulated if every cycle of length greater than 3 possesses an edge between nonconsecutive vertices of the cycle (also called a chord). A directed acyclic graph, also called a DAG, is singly connected if for any two distinct vertices, there is at most one chain between them; otherwise it is multiply connected. A (singly connected) DAG is called a tree if every vertex has at most one parent, and exactly one vertex has no parent (this vertex is called the root). A Bayesian network is always on a DAG. Each node in the BN has a conditional probability distribution (CPD) associated with it - this is the probability distribution of the random variable corresponding to the node, given the node's parents. The power of Bayesian networks lies in the ability to represent conditional independence, and thus a factorization of the joint distribution of the random variables, in an easily 21 recognizable way using the graph structure. The key rule is that each node is conditionally independent of its nondescendants, given its parents [22]. This rule, and corresponding algorithms, can be used to determine useful independence relations in the network, as we will see in Chapter 3. The values of some of the random variables in a BN may be specified, in which case we call them instantiated, or evidence nodes. This is analogous to the POMNet notion of observed sites (rather than hidden sites). 2.4.2 Relation to the POMNet An important facet of the general BN framework is that there is no implicit temporality - the network simply represents relationships between different random variables. However, we know that our POMNet model uses the same kind of temporal evolution found in HMMs, so we want to find a way to represent it. We note that an HMM can be represented as a particular type of BN, which has a simple two-node structure that repeats once for each time step of the HMM (Figure 2-1), in which, in each time slice, a hidden node represents the state of the underlying Markov chain, and an evidence node represents the corresponding observation. This type of BN, with a repeating structure over time steps, is in fact often called a Dynamic Bayesian Network (DBN) (see for instance [5], [12], or [16]). Figure 2-1: A hidden Markov model expressed as a Bayesian network. Shaded circles represent evidence nodes. 22 Since the POMNet can be represented as an HMM, we know that it can also be represented as a DBN. However, we can choose a much more informative and useful representation than the two-node repetition we used for HMMs. Considering the POMNet over a finite interval of time, we can create a DBN with one node per state variable per time step, and represent dependencies between nodes across time. The resulting directed acyclic graph, which we will call a trellis diagram, fully represents the structure of the POMNet (Figure 2-2). Note that we don't have any edges between nodes within a given time step, since as we said, si[k + 1] is independent of all other state variables at time k + 1 given s[k]. Also, the directed edges always feed forward by exactly one time step, since we are dealing with a first order Markov process. Our partial observations constitute instantiated (evidence) nodes in the network, and consist of the same nodes at every time step of the repeating structure, since we defined partial observation over a fixed subset of nodes in the POMNet. Therefore all the structural information in this infinite trellis is contained in a snapshot of the nodes at any time, and the feed-forward edges from any time k to k + 1. This representation of the POMNet is especially useful, as it allows us to apply some results from BNs to understand and improve the efficiency of inference tasks. Furthermore, it appears that these results can be translated to a form that is easily identified in the original POMNet or IM network graph. These tools for inference are the topic of the next chapter. 23 POMNet Network Graph 1 2 3 4 Trellis Diagram 2 3 4 Figure 2-2: A POMNet graph and corresponding trellis diagram. represent evidence nodes. 2.5 Shaded circles A Generalized Influence Model We saw how a POMNet can be formulated as a Bayesian network. We now present a novel and potentially useful model which results from hybridizing the influence model and dynamic Bayesian networks in a different way than a POMNet. This can be thought of as a generalization of the IM, or equivalently a specially structured DBN, since the influence model is a particular instance of DBN. The basic idea is to allow for nodes within each time-slice of the model evolution to influence one another, resulting in same-time directed edges in the trellis diagram for the model (see Figure 2-3 for an example). In this sense, it is close to a DBN, which allows this type of same-time dependence. However, we choose to maintain the quasi-linear update 24 structure of the IM, and describe the same-time influence by a new influence matrix, called F, in addition to our standard network influence matrix D. The strictly lower triangular matrix F describes edge weights on a DAG (for a Bayesian network) over the nodes at a given time step, and a set of mi x mj matrices {Bij}, analogous to the {Aij}, describes the effects of these same-time influences (now E,(dij + fij) 1 and {Bjj} is also row-stochastic). "Generalized"IM Network Graph 2 3 4 Time-unfolded "trellis" Diagram Figure 2-3: A "generalized" IM graph and trellis (DBN) structure. Solid lines represent influence in the usual sense, whereas dashed lines represent same-time influence. The order for updating nodes at each time step would be {1}, {2, 3}, {4}. The model update can proceed in a similar fashion to the IM, with the following difference: the standard process will only update those nodes which are root nodes of the DAG, that is, those nodes whose values at time k + 1 are only influenced by the 25 state of the model at time k. We then add a step in which the remaining nodes at time k + 1 are updated from the root nodes at that time (i.e. the root nodes update their children, and the updated nodes update their children, and so on). In other words, the update proceeds as follows: 1. Update the distributions for nodes dependent only on the previous time step (without same-time influence) as in the standard IM: for such a node i, Pi[k I3+ 1] = dij s' kAg,(.5.1) just as in Equation 2.1.3 of the IM. We then find the realizations of those distributions, obtaining si [k + 1] for this set i. 2. Update the distributions for nodes dependent only on the previous time step as well as the already-updated nodes: for such a node i, pi[k + 1] = Note that in the second sum, fij s [k + 1]Bji. + dis [k]AiZ fij (2.5.2) will be 0 for nodes j at which s [k + 1] has not yet been obtained. We find the realization of the calculated distributions, obtaining s [k + 1] for this set i. 3. Repeat Step 2 until si[k + 1] is obtained for all i, i.e. until all nodes at time k + 1 are updated. In order to further explore the properties of this model, we look at the conditional expectation of the system state, given the state at the previous time step. We will restrict ourselves to the homogeneous case for simplicity (analogous to the homogeneous influence model), where the number of possible states at each node is fixed at 26 m, {A 3 } = A and {B 3 } = B. Since a state vector is a realization of the corresponding probability vector, we still have that E(s'[k + 1]1 s[k]) = p'[k + 1]. To further follow the IM development, we would like to represent our update equation 2.5.2 in matrix form, as in Equation 2.1.4 of the IM. Toward this goal, we introduce some terminology to deal with the additional complexity of the generalized IM. Let the set of nodes n in the model be partitioned so that n= {=ii, n2,- , rq}, where ni is the first tier of nodes (root nodes of the DAG), n 2 is the second tier which depends only upon nodes in ni and in the previous time step, etc., and q is the number of tiers. We assume from now on that the vectors and matrices describing the model as a whole are specified in this tiered ordering. Let Nj = IJ>= ni for j including tier k, and sN, = 1, - ,q, that is, N is the cumulative set of nodes up to and j (so that Nq= n). Let sn [k] denote the state of nodes in tier i at time [k] denote the state of nodes in tiers 1 to j at time k. Just as H = D' 0 A (Equation 2.1.5) in the IM, we now define G = F' o B, (2.5.3) where F and B are as given before. We then define the following notation for picking subsets of the matrix H: Hi =D', o A, where Dn, consists of the size Inj' x InI matrix obtained by selecting only the rows 27 of D corresponding to the nodes in nj. Similarly, let GN_1 =F'NJ 1 where FNj- 1 consists of the size Inr B I x jNj_1 I matrix obtained from picking the rows of F corresponding to the nodes in nr, and the columns of F corresponding to the nodes in N_ 1 . Finally, let G where Fn, consists of the size jnr o B, = F' I x InjI matrix obtained from picking the rows and columns of F corresponding to the nodes in nr. Note that Gni GNj_1- This notation is necessary because each time we perform Step 2 of the update (Equation 2.5.2), we need to map the state vector for the previous time step s[k] and the state vector for previously computed tiers SN_ 1 [k + 1] on to the probability vector for tier j in a different way. Using this terminology, and assembling Equation 2.5.2 for all sites i E nj gives p'j [k + 1] = = s'[k]Hnj + S'Nj _1[k + 1]GN_ forjz=1,- 1]) E(s' j [k + 1] 1s'[k, s' _ 1 [k 1 (2.5.4) , q. Iterating the expectations over s[k] and sN_ 1 [k + 1] , we get E(s' [k + 1]) = E(s'[k])Hn3 + E(s' _1 [k + 1])GN_ 1 . 28 (2.5.5) Moving the second term on the right hand side over to the left, we have E(s' [k + 1]) - E(s'N _[k + 1])GN, for j 1,.- , q. E(s'[k])Ha 3, 1 Now, combining Equation 2.5.6 for all j (2.5.6) using the definitions for the subsets of G and H discussed earlier, and using a block vector/matrix notation for grouping tiers together, the left hand side becomes E([s' 1 [k+1]s' 2 [k+F1] . . . s'[k+1]]) I -Gn, 0 -Gni.. -Gni I -Gn2 -Gn2 0 0 I ... -G. 0 0 0 ... I = E(s'[k+1])[I-G]. (2.5.7) Finally, we obtain the recursive update for the expectation: E(s'[k + 1]) [I - G] = E(s'[k])H. (2.5.8) Compared to the standard IM, which results in E(s'[k + 1]) = E(s'[k])H (see Equation 2.1.9), we now have E(s'[k + 1])J = E(s'[k])H, where J = [I - G], for the evolution of the expected state vector. Using this, one is able to discuss the limiting steady state probabilities for large k. Just as the steady state distribution for the original IM depends on the eigenvalues and eigenvectors of H (see the discussion in [1] and [2]), our steady state distribution should depend on the generalized eigenvalues and eigenvectors of the matrix pair (J, H). Note that the strictly upper triangular nature of G allows for a simple expression for the inverse of J: j-1 = I+ G+ G2 +... + Gq-1. This is easily verified by multiplying this expression by I - G, and noting that Gq and higher powers are all 0. Further, G" can be easily written in terms of F and B, 29 due to the properties of Kronecker products (see [19] for instance): G' = (F')' 0 B. Thus, by bringing J to the right hand side of Equation 2.5.8, the expected value recursion can be propagated forward using a matrix whose form is explicit. This generalized influence model, motivated by DBN structure, may be able to better capture the system dynamics for particular applications, while maintaining much of the tractability and appeal of the IM. 30 Chapter 3 Efficient Inference: Relevance and Partitioning 3.1 The Inference Task Statistical inference can be described as making an estimation, making a decision, or computing relevant distributions, based on a set of observed random variables. In the context of the influence model and other graphical models, inference tasks might include: (1) calculation of the marginal distribution over a few variables of inter- est, (2) estimation of the current or future state of the system, or (3) learning the structure or parameters of the model (although learning is considered a separate task from inference in some contexts). All of these tasks require some data, and may have varying assumptions as to how much of the model we take for granted and how much we must discover. When we talk about inference in the context of graphical models, we are almost always referring to the first task, namely calculating posterior marginal distributions conditional on some data, as this will allow us to do state estimation and other things as well. At first glance, we may think to approach such a task by considering the joint distribution of all the system variables, and marginalizing out all the random variables that are not of interest. Doing this by brute force, however, is intractable for 31 practically any problem worth modelling with a graphical model. The key, then, is to take advantage of the special structure of the joint distribution that the graphical model captures in order to make inference tractable. The rest of this chapter will be about methods that attempt to do exactly that. We must realize, however, that even after optimizing our algorithms with respect to a given model structure, inference may still be very complex. Computational time is often exponential in the number of nodes in our graph. In fact, it has been shown that in the most general case of a graph with arbitrary structure, both exact and approximate inference will always be NP-hard [6]. Keeping this in mind, we still try to find methods to improve efficiency for specific problems. 3.2 3.2.1 Probabilistic Independence and Separation d-Separation In Chapter 2, we noted that the strength of Bayesian networks lies in their ability to capture conditional independence between random variables. Directional-separation, or d-separation (named so because it functions on a directed graph), is the mechanism for identifying these independencies; namely, a set of nodes A is d-separated from a set B by S if and only if the random variables associated with A are independent of those associated with B, conditional on those associated with S. Informally, this occurs in BNs because an evidence node blocks propagation of information from its ancestors to its descendants, but also makes all its ancestors interdependent. There are several equivalent ways to define d-separation. We will first define it using the graph-theoretic terminology outlined in Chapter 2; this is the formal definition preferred in sources from the statistical community, such as [7] and [17]. To do this, we will first mention a few more terms. A DAG is moralized by joining (or 32 "cmarrying") all pairs of parents of each node, and then dropping the directionality of arcs. That is, if G = (V, E) is a DAG, and if u and w are parents of v, create an undirected arc between u and w. Do this for cvery pair of parents of c and for all v E V, and then convert all directed arcs into undirected arcs connecting the same nodes. The resulting graph is called the moral graph relative to G, and is denoted by Gm. If a set A contains all parents and neighbors of node a, for all a E A, then A is called an ancestral set. We will denote the smallest ancestral set containing set A by An(A). A set C is said to separate A from B if all trails from a E A to 3 E B intersect C. Then, the definition is as follows: given that A, B and S are disjoint subsets of a directed acyclic graph D, S d-separates A from B if and only if S separates A from B in (DAn(AUBUS))m, the moral graph of the smallest ancestral set containing AUBUS. The second definition is more algorithmic in nature, and tends to be favored by the artificial intelligence community; see for instance [22]. Recall that a trail is a sequence that forms a path in the undirected version of a graph. A node a is called a head-to-head node with respect to a trail if the connections from both the previous and subsequent nodes in the trail are directed towards a. A trail 7r from a to b in a DAG D is said to be blocked by a set of nodes S if it contains a vertex -y E r such that either: 1. is not a head-to-head node with respect to 7r, and - E S, or 2. - is a head-to-head node with respect to 7r, and -y and all its descendants are not in S. A trail that is not blocked is said to be active. With this definition, A and B are d-separated by S if all trails from A to B are blocked by S (i.e. there are no active trails). This definition lends itself to a convenient and fairly elegant method for determining d-separation called the "Bayes-ball" algorithm [27]. This algorithm determines d-separation with an imaginary bouncing ball that originates at one node and tries to visit other nodes in the graph, and at each node either "passes through," "bounces back," or is "blocked." The behavior of the ball is determined as follows: 33 1. an unobserved node passes the ball through, but also bounces back balls from children, and 2. an observed node bounces back balls from parents, but blocks balls from children. As the ball travels, it marks visited nodes, and at the end of the algorithm (see [27] for details), the nodes that remain unmarked are those that are d-separated from the starting nodes given the observations. The advantage of this method is that it is simple to implement, and intuitive to follow. As the description of the Bayes-ball algorithm makes clear, our interest in dseparation is that it will tell us which nodes are independent from others given a particular set of observations, or evidence nodes. Easily identifying independence in the graph is central to efficient inference, as we will see next. 3.3 3.3.1 Relevance Reasoning Relevance in Bayesian networks "Relevance reasoning" is the process of eliminating nodes in the Bayesian network that are unnecessary for the computation at hand (see [9] and [18]). The idea is to identify a set of target nodes for the inference task, and then to systematically eliminate all nodes that will not affect the computation on those target nodes, given a set of evidence nodes, before performing inference on the graph. Since complexity is generally exponential in the number of nodes, this will greatly reduce the complexity of our computations in general - performing any inference, whether exact or approximate, on a graph with fewer nodes will be more efficient. The process of eliminating nodes can be divided into three steps: (1) eliminating "computationally unrelated" nodes, (2) eliminating "barren" nodes, and (3) eliminat34 ing "nuisance" nodes. The first, and most important, step is based on the d-separation criterion. Nodes that are d-separated from all of our target nodes given the evidence are probabilistically independent of all random variables of interest, and are thus computationally unrelated to them. Once we have identified these nodes by using an algorithm such as Bayes-ball, we can remove them from our graph, reducing the number of nodes our algorithms must operate upon without changing the result. A barren node is one such that is it neither an evidence node nor a target node, and either has no descendants, or all its descendants are also barren. Barren nodes may depend upon the evidence, but they do not affect the computation of probabilities at the target nodes, and are therefore computationally irrelevant. Algorithms that identify independence can be extended, in a simple fashion, to also identify and eliminate barren nodes [18]. In order to describe the third step, once again, we need to define some additional terms. A minimal active trail between an evidence node e and a node n, given a set of nodes E, is called an evidential trail from e to n given E. A nuisance node, given evidence nodes E and target nodes T, is a node that is computationally related to T given E (i.e. is not d-separated), but is not part of any evidential trail from any node in E to any node in T. Elimination of nuisance nodes doesn't yield as much of a computational benefit as the first two steps, and we will not discuss them further here. It should be noted however that nuisance nodes are not removed directly - they must be marginalized into their children; see [18] for details. For an example of node elimination using relevance, consider the graph in Figure 3-1. Let our target be node 1. We first eliminate nodes 3 and 6, as they are d-separated from node 1 by the evidence at node 2. We then eliminate node 4, as it is barren, leaving us with a significantly reduced graph on which to perform inference for our target. 35 1 4 1 3 2 5 2 5 6 Figure 3-1: An example of a BN graph, and the nodes relevant to target node 1. Shaded circles represent evidence nodes. 3.3.2 Relevance in the POMNet Given that the POMNet translates readily to a Bayesian network (as a DBN), and that relevance in BNs is easily identified, it is not unreasonable to think that we might identify relevant nodes for a particular computation in the POMNet itself, without resorting to a trellis diagram. From a slightly different perspective, the notion of a independent set in the influence model is introduced in [29]. A (-independent set A is defined as follows (with respect to the network influence graph of the POMNet/IM): 1. All incoming links to A originate from evidence sites. 2. All outgoing links from the non-evidence sites of A terminate in non-influencing sites, where a non-influencing site is defined as one from which there exists no path to an evidence site. The goal of identifying (-independence is equivalent to that of eliminating irrelevant nodes - to reduce inference to the simplest possible calculation for a particular set of sites. We claim that applying relevance reasoning in the POMNet trellis is in fact equivalent to proceeding by (-independence. In particular, by selecting a set of sites in a POMNet and running the relevance algorithms on the corresponding BN with the time-extended version of the selected sites as target nodes, the resulting set of nodes will be the smallest (-independent set containing the selected sites, and thus the concepts of relevance and (-independence are equivalent in this sense. This can be seen as follows: if all incoming links to A are from evidence sites, then all incoming arcs to the 36 set of nodes corresponding to A are from evidence nodes. These nodes are not headto-head nodes with respect to any paths connecting A to the nodes outside of A, and thus they block those paths (by our second definition of d-separation above). Additionally, all outgoing links from A terminate in non-influencing sites. Non-influencing sites correspond to barren nodes in the trellis diagram, and thus these outgoing links can be removed. Finally, evidence sites in A with only outgoing links leaving A can never be head-to-head nodes along any path leading out of A, and therefore block the remaining paths. So, we see that the conditions for (-independence induce a set of nodes that can be obtained from relevance reasoning. Thus a possible use for the (-independence conditions is to directly and intuitively identify sets of sites in a POMNet that would be produced by relevance reasoning in the corresponding trellis, without resorting to an analysis of d-separation and barrenness in the underlying BN. This idea of obtaining minimally-relevant subsets of the graph (in both BNs and POMNets) lends itself to simplifying inference in a complex graph by identifying subgraphs that can be dealt with separately. The process of breaking the graph up into such subgraphs is referred to as relevance-based decomposition ([18]), or partitioning ([29]). Note that in the general case of decomposition, these subgraphs will be partially overlapping, especially sharing evidence nodes. Algorithms for decomposing the graph in this way can be as simple as arbitrarily choosing small subsets of target nodes, and eliminating irrelevant nodes for each such subset; however, the particular choices of subsets will significantly affect the efficiency of the algorithm. Although even crude choices of target subsets will improve overall performance ([18]), ways of optimizing this process remain to be studied in greater detail. 37 3.4 3.4.1 Algorithms for Exact Inference Message Passing and the Generalized Forward-Backward Algorithm The generalized forward-backward algorithm is an exact inference message passing algorithm for singly connected connected Bayesian networks [11]. passing algorithm (due to Pearl As the message [23]) works in polynomial time, it is the method of choice for inference in singly connected networks. Message passing (also called probability propagation, or belief propagation, although the latter often refers to loopy belief propagation, discussed later) is a mechanism by which information local to each node is propagated to all other nodes in the graph so that marginalization can be performed. The forward-backward algorithm arranges this message passing in a highly regular way by breaking the process into two parts: a forward pass, in which information is propagated towards an arbitrarily chosen "root" node (also known as "collecting evidence"), and a backward pass in which information is propagated away from the root ("distributing evidence"). There are two types of computation performed in this process: multiplication of local marginal distributions at each node, and summation over local joint distributions (this is why it is also sometimes called the "sum-product algorithm") (see [11]), so that the outgoing message from each node consists of the product of all incoming messages, summed over the local distribution. We will call a message from a child to a parent a A message, and the message from a parent to a child a 7r message. A node can calculate its i message to its children once it receives 7r messages from all its parents, and can calculate its A message once it receives A messages from its children. The local posterior distribution at a node can be found once it has calculated both its A and 7r messages, by multiplying the two. 38 let us look at message For a simple example. k (that is, a, node j passing on a directed chami. i <- j with unique child i and unique parent k). We pick node k as the root, and the generalized forward-backward message passing order will then be as follows: Ai-,, Aj-k, 7k-j, 7j. In this example (from [20]), the A message from j to k is and the 7 message from j to i is k As mentioned in [20], and as we might expect, in this case the algorithm is equivalent to the forward-backward algorithm for HMMs, where r is the equivalent of the forward variable a and A is the equivalent of the backward variable 3, and where A and 7r can be computed independently of each other (this is not generally the case in a tree). 3.4.2 The Junction Tree Algorithm The standard algorithm for exact inference on a multiply connected Bayesian network is the so-called junction tree or join tree algorithm (see for example [21]). This concept is sometimes also referred to as clustering when a tree is created from clusters, or groups, of variables in a complex problem. The junction tree algorithm proceeds by first creating a "junction tree" out of the BN graph, which is a representation that allows a potentially complex graph to be expressed in an easily manipulated (and singly connected) tree form. The process of building a junction tree can be summarized in the following steps: 1. Moralize the graph. As mentioned previously in the context of d-separation, moralization involves "marrying"common parents, and converting the graph to its undirected form. 2. Triangulate the graph. As mentioned in Chapter 2, a graph is triangulated 39 if it (loesil't li'iy& r111v cvcles of lel-ti -ive(i thwn 3. Triiaiiguilat e G, by adding chords as necessary. 3. Find the maximal cliques. Group the nodes in each clique together - each clique will make up a node of the junction tree. 4. Connect the cliques in a tree. Once the tree is created, message passing algorithms that work on singly connected graphs are will efficiently solve the inference task. However, this method is not without its flaws. Following the above steps to cre- ate a junction tree doesn't result in a unique solution, and some junction trees will result in more efficient solutions than others - this arises from the fact that there are often many different triangulations of a given graph. In fact, finding the optimal triangulation (i.e. leading to the most efficient inference possible) is itself an NP-complete problem [13]. For this reason, the junction tree algorithm is sometimes used with approximately optimal triangulation methods. Still, it has been shown that any exact inference method based on local reasoning is either less efficient than the junction tree, or has an optimality problem equivalent to that of triangulation [13]. 3.4.3 Exact Inference in the POMNet We now turn to inference in the POMNet model. Since we know that a POMNet trellis is in fact a DBN, and that it will be multiply connected in all but the most degenerate cases, we might think to immediately apply the junction tree algorithm (preferably after eliminating extraneous nodes through relevance reasoning). For detailed treatment of junction tree inference in DBNs, see [16] or [20]. An alternate way to approach this problem, discussed in [29], is to develop a method by analogy with the HMM forward-backward algorithm. This entails defining 40 le1n(rsivelv computed forward and backward variables. which in fact can be computed locally to any (-independent subset of the POMNet ((-independence was defined in section 3.3.2). We assume that we have data in the form of observations (at the observable sites) from time 1 to T. The local forward variable a' [k] of a (-independent set A is defined as the probability that the hidden sites of A (the state of which will be denoted by A[k]) are in a particular state i at time k, given the portion of the observations up to some time k (denoted Q1'k): c4[k] e P(A[k = zQ1k) Just as in the HMM, the backward variable 3f'[k is defined such that the product of forward and backward variables will result in the posterior marginal distribution that we desire, that is, aA[k]O[k] = P(A[k] =7 101,T). Thus, we perform the inference task by computing the forward and backward variables for any intermediate time k (see [29]). The standard forward-backward algorithm for HMMs can itself be applied to a POMNet as well, since we can simply formulate our POMNet as an HMM, as described earlier. We do not compute variables local to subsets of the graph in this case, but we can eliminate computationally irrelevant sites before converting to an HMM, and so can still perform relevance-decomposed inference. Further variations on the forward-backward algorithm are also used in the context of DBNs. Just as the forward-backward algorithm exploits the Markov structure of HMMs by conditioning on the present state to make the past and future hidden states independent, the frontier algorithm exploits the Markov structure of DBN models by conditioning on all the nodes at a particular time step to separate the past from the future, updating the joint distribution with a forward and backward pass that maintain such a separating set at every step [20]. Similarly, the interface algorithm attempts to improve upon this strategy by identifying the minimal set of nodes needed to d-separate the past and future based on the graph structure, thereby proceeding in a more efficient manner [201. 41 3.5 Algorithms for Approximate Inference In many problems of practical interest that map to multiply connected networks, even the most efficient exact inference algorithms are intractable [11]. Although the approximate inference task has also been shown to be NP-hard in the general case [18], approximations can potentially make inference tractable, with each approximation algorithm best suited to a particular class of problems. Approximate inference algorithms can be divided into two categories: (1) stochastic methods, which include such Monte Carlo methods as Gibbs sampling, importance sampling, and a sequential method called particle filtering, and (2) deterministic approximation methods, two of which are described below. 3.5.1 Variational Methods Variational methods are a commonly used approximation technique in a wide variety of settings, from quantum mechanics to statistics [14]. [14] also provides a compre- hensive tutorial on their use in the context of graphical models. The basic idea can be summarized as follows: in this context, we use variational methods to approach the problem of approximating the posterior probability distribution P over the hidden nodes by defining a parameterized distribution minimize the distance between P and approximation given by we can choose Q, and varying its parameters to Q. Since the complexity of inference in the Q is determined by its conditional independence relations, Q to have a tractable structure, such as a Bayesian network with fewer dependencies than P [12]. We then minimize the distance between P and Q with respect to a metric such as the Kullback-Leibler divergence. 3.5.2 "Loopy" Belief Propagation So-called "loopy" belief propagation is simply the message passing/probability propagation algorithm described previously, applied to multiply connected graphs (i.e. 42 containing loops in the underlying undirected graph). While this method will not in general converge to the correct marginal distributions in a short time, as it does in the singly connected case (e.g. two passes for the generalized forward-backward algorithm), it is in many cases guaranteed to converge to a set of locally consistent messages. The accuracy and optimality of this algorithm have been studied to some degree, including potential ways to correct the posterior marginals after convergence (see [30]). Loopy belief propagation has been found to produce good results in a variety of fields with large scale applications for graphical models, including decoding turbo codes, image processing, and medical diagnosis. 43 44 Chapter 4 Parameter Estimation 4.1 Maximum Likelihood Estimation Maximum likelihood estimation of the transition probability matrix in a Markov chain as well as properties of such estimators have been well studied in the literature (see for example [3]). Although the problem of parameter estimation in an IM can be formulated in that context, (1) the number of parameters involved is large, and will need huge data sets to accomplish the estimation with any reliability, and (2) there are issues of identifiability. Our intent is to show the feasibility of maximum likelihood estimation in this context; we will start with a simple example. 4.2 Estimation of State Transition Probabilities To illustrate the ideas, we will start by assuming we have a 5 node homogeneous influence model, with each node having only two possible states, 0 or 1. We will assume that all the nodes are observable at all times. Using the standard notation, the parameters for such a model are given by the 2 matrices D and A. In this section, we will take D as being fully given while the transition probabilities in the matrix A 45 are to be estimated. In particular, we use a model given as an example in [29]: D A = .5 .5 0 0 0 .25 .5 .25 0 0 0 .25 .5 .25 0 0 0 .5 .5 0 O .25 .25 .25 .25 = q 1 - q 7ri = I1 P, and q 0] for all nodes 1 < i < 5. The observations consist of the complete state of the system at time k for T time steps, i.e. for k = 1,..., T. Given the large sample optimality properties of maximum likelihood estimates (see for example [3]), our goal is to estimate the p and q which maximize the likelihood of the observed state sequence. We express this likelihood as a function of p and q, and call it L(p, q). For given p and q, we can calculate the probability of a node changing from state i at time k to state j at time k + 1, so we can find the probability that the system changes from state S[k] to state S[k + 1] by multiplying probabilities for each node. Given the Markov nature of the model, the probability of the whole observation sequence is then obtained by multiplying probabilities for each time step. Recall the matrix-matrix form of the evolution equations, i.e. P[k + 1] = DS[k]A, where s' [k] S [k] = p'[k] . P[k]= : s' [k] p'[k] We can write the likelihood for the general case of n nodes and m statuses per node as follows: 46 T n m {P[k]} s[k]}ii L(p, q) =fQ k=2 i=1 j=1 where {A}23 denotes the (i, j)th element of matrix A. To study the behavior of the likelihood as a function of p and q, we calculate the value of L(p, q) for several (p, q) pairs and plot it. For this particular computation, we take p = 0.5, q=0.5, and T = 10 and then simulate data (states) using the algorithm given by [29]. This data is then used to calculate and plot the likelihood L(p, q). The following plot is the likelihood calculated at the 10,000 value pairs (p, q), each argument incremented in steps of 0.01 in the range (0,1). 2 0 100 -7 4 so 105 0 Figure 4-1: Plot of L(p, q) (p = 0.5, q = 0.5, T = 10). We see that the likelihood has a single sharp peak and should be fairly easy to maximize, giving us the maximum likelihood estimators (MLEs) of p and q. We ob- 47 serve through simulations that such a unimodal likelihood function results regardless of the choice of p and q used to generate the data. To find the MLEs, the most direct approach is to find the maximum value of the 10,000 calculated values of the likelihood (grid search), and read off the corresponding p and q. However, the shape of the likelihood suggests a much faster and more efficient method, namely to maximize the likelihood (still to the nearest 0.01) with respect to p while holding q fixed, then maximize with respect to q while holding p fixed, iterating until there is no further improvement. This method converges in just a few iterations, and produces the same results as maximizing over all 10,000 values for all the data tested. In order to say something about the performance and the small sample properties of these estimators, we do Monte Carlo simulations. The following histograms represent the results for 180 runs of the estimation of p and q. 40 3 - Mean: 0.4940 Stddev: 0.0458 30- 25- 20- 15- 10- - 5 0.3 0.35 0.4 0.45 0.5 055 08 085 0.7 Figure 4-2: Estimates of p (p = 0.5, q = 0.5, T = 100). 48 35 - Mean: 0.4975 Stddev: 0.0400 Z' 30- 20 25 035 04 045 0.5 055 06 065 Figure 4-3: Estimates of q (p = 0.5, q = 0.5, T 07 100). We also ran some simulations with a smaller T, and p and q away from 0.5. The following histograms represent the estimates of p and q for these cases. 2011 Mean: 0.2666 Stddev: 0.1022 18 16141210864- 20 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 045 05 Figure 4-4: Estimates of p (p = 0.25, q = 0.63, T 49 0.55 = 20). 25- Mean: 0.6399 Stddev: 0.0756 20- 15- 0. 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 Figure 4-5: Estimates of q (p = 0.25, q = 0.63, T = 20). These ML estimates appear to be unbiased and approximately normally distributed, with decreasing variance as T increases, as one might expect. In addition to these encouraging simulations, one can use the asymptotic theory for MLEs in this (dependent data) context to set set confidence intervals around the estimated values [3]. 4.3 Reparameterization of Network Influence In this section, we go a step further, and ask if we can also estimate the network influence matrix D. In an attempt to allow for estimation of this matrix without having an unmanageable number of free parameters, we look for ways of specifying D in terms of a reduced number of parameters. One such way, motivated by various models for particle interaction, is to make the assumption that we know something about the spatial location of each of the nodes, and that the influence is based on the distance between nodes in an exponentially decaying way. We assume a decay parameter 0 < A < oo and let 50 where rij is the distance between nodes i and j. The denominator here is just a normalizing factor to ensure that D remains row-stochastic. Thus, we have added only one new unknown parameter A under the plausible assumption that we know the matrix of r values. A large A would mean that the nodes evolve primarily due to self-influence, whereas A going to zero results in dij =) that is, each node being influenced equally by all the nodes in the graph, including itself. We first examine how estimation of A performs for a fixed p and q. We need a matrix R of distances, so we arbitrarily choose the following matrix for a 5 node graph: 0 R= 1 2 3 4 1 0 1 2 3 2 1 0 3 2 1 2, 1 0 1 4 3 2 1 0 (i.e. the nodes equally spaced on a line). To find the MLEs now, although we could search the likelihood in the entire range 0 < A < oo, we note that choosing anything larger than small single digits for A results in essentially entirely self-influencing behavior for this choice of R. Thus, we choose to use a small A to generate the data, and search over a truncated range. For the following example, we choose p = 0.1, q = 0.4, T = 50, A = 1, and we search in the range 0 < A < 2 in steps of 0.01. 51 2.S Max at k = 0.9 2 - 1.5-- 1 -- 0.5 -- 0 10 20D 150 25,0 Figure 4-6: Plot of L(A) (p = 0.1, q = 0.4, A = 1, T = 50). Again, we see an encouraging single sharp peak in the likelihood. The results for 100 runs of this example follow. 0.2 0.4 0.4 0.0 12 1 1.4 1.6 1.8 Figure 4-7: Estimates of A (p = 0.1, q = 0.4, A 2 = 1, T = 50). The results appear to be consistent with what is to be expected. While estimation of A alone is a useful check, our goal is to estimate p, q, and A jointly. As it is no longer feasible to graph the likelihood as a function of all three parameters, we might find it useful to examine two parameters at a time, holding the third fixed at its true value. 52 The following graphs are cross-sections of the likelihood for the previous example (i.e. generated with p=0.1, q=0.4, A=1, and the above R matrix, but T=20 in the interest of speed). Figure 4-8: Plot of L(p, q, A), pq cross-section (p 0 1, q 1, T 0.4, A 20). .. ...- ..- 2. 200 0 150 OD 50 100 0 Figure 4-9: Plot of L(p, q, A), Ap cross-section (p 53 0.1, q = 0.4, A = 1, T = 20). Figure 4Lt. What WFigaures 4-10 Plth thf saL s , . askt apr = = ft q arssscind (p 0s 1b, Whfast ed see in ach casisg tat thre likelihoo b0.4 siA 1e Tr 2s0)g in 1.termus of praetrs stil thasameterds.low 0 0.06 0.1 0.30 0.2 0.25 Figure 4-11: Estimates of p(p =0.1, q= 0.4, A= 1, T= 100). 54 Figure 4-12: Estimates of q (p= 0.1, q = 0.4, A = 1, T = 100). 25 Sd . 0 D 20 15 10- 02 0.4 0.1 0.8 1 1.2 IA 1.6 1.8 Figure 4-13: Estimates of A (p = 0.1, q = 0.4, A = 1, T = 100). Again, the method seems to be producing reasonable results. While estimating the three parameters seems to work well with the example used, changes in the R matrix, and the values of p, q, and A used to generate the data may all affect the performance of the estimators, and more extensive studies might shed further light on this. 55 56 Chapter 5 Conclusion and Future Work 5.1 Conclusion In this thesis we have discussed the influence model and generalizations resulting from it: we described the partially-observed Markov network, and introduced the generalized influence model. We placed these models in the broader context and terminology of graphical models such as hidden Markov models and (dynamic) Bayesian networks. We have described some strengths of these models, such as convenient representation of network interaction through conditional independence, as well as weaknesses, such as the difficulty of inference in the general case. We have also described several algorithms for exact and approximate inference, and how they attempt to exploit model structure for efficiency. Finally, we have demonstrated that maximum likelihood estimation of parameters is possible for the influence model in certain cases. 5.2 Future Work One obvious direction for future work would be to improve upon parameter estimation for the influence model (and the POMNet) for the general case. As discussed in Chapter 4, the number of parameters to be estimated in the general model is very large, and the likelihood can be unwieldy to manipulate. 57 Nevertheless, this is an important avenue to explore, as being able to estimate parameters is crucial for the application of this model to problems of interest. Some optimization techniques such as gradient ascent have been tried in this context ([281), and perhaps others such as simulated or deterministic annealing ([25]) will also prove to be useful. The fact that efficient inference algorithms are tailored to the specifics of a graphical model also highlights the search for particular applications in which the IM/POMNet is especially powerful. Ideally, these would be applications which inherently have a time-evolving network-structured form, and for which the modelling of system dynamics is important, as then this type of specialized graphical model would be well justified. In addition to the above, the following topics might also prove to be promising. 5.2.1 Further Analysis of the Generalized Influence Model In Section 2.5, we introduced the generalized influence model, and discussed its update process and the propagation of its expected state through a recursion. However, an analysis of its system dynamics, through the generalized eigenvalues and vectors of the pair (J, H), the invertibility of J, and other means, remains to be pursued in greater detail. Further, identifying to what degree DBNs such as those found in [16] and [20] might benefit from a linearization of this form, and when such an approximation is useful remains an interesting topic. 5.2.2 Conditioning on a "Double Layer" This idea deals with partitioning a POMNet graph for inference. It stems from the observation that if all paths between two sites in the POMNet network influence graph pass through two consecutive observed sites, all paths between the time-extended 58 versions of those sites are blocked in the trellis diagram. Thus this "double layer" of evidence always induces d-separation in the trellis, and therefore partitions the POMNet (note that a double layer of evidence is always sufficient for partitioning, but not necessary unless every site has arcs to and from all its neighbors, in a "bidirectionally connected" graph). Another way to see this separation is to convert the trellis graph to its undirected form by moralization, as discussed in Chapter 3. It immediately becomes clear that a double layer in the directed graph corresponds with a separating single layer in the moralized graph, which is precisely what is required for d-separation. While this is interesting on its own, it may actually result in a powerful tool for partitioning a general DBN network graph by conditioning on the proper set of sites (i.e. considering them observed), we can artificially create double layers of evidence and break up the graph in order to perform partitioned inference. We would then marginalize over the sites we conditioned on, in order to obtain the proper posterior probabilities. A divide and conquer algorithm of this sort may potentially yield significant reduction in computation time for certain graph structures. However, we should note that we must condition on the appropriate sites for all time steps of interest in order to achieve this partitioning, and so the marginalization must be done over an entire sequence of states. If the marginalization can be done in an efficient recursive manner, then this technique holds promise for improving the efficiency of inference. One idea for creating observed nodes in this context is through the use of a sequential Monte Carlo sampling technique called Rao-Blackwellised particle filtering (RBPF) (see [20]). Particle filtering is a method for tracking and approximating the values of nodes that we wish to condition on, by representing the evolving probability density at those nodes in a reduced form. More specifically, we choose to represent the posterior density by a sum of weights at a finite number of support points so as to best approximate the true posterior. In addition, we can reduce the size of the state space by marginalizing out some of the variables analytically (called Rao- 59 Blackwellisation in this context), which improves the efficiency of particle filtering. We are currently exploring the effectiveness of RBPF in the context of a divide and conquer partitioned inference algorithm for POMNet-like graphs. 60 Bibliography [1] Chalee Asavathiratham. The Influence Model: A Tractable Representation for the Dynamics of Networked Markov Chains. PhD dissertation, MIT, Department of Electrical Engineering and Computer Science, October 2000. [2] Chalee Asavathiratham, Sandip Roy, Bernard Lesieutre, and George Verghese. The influence model. IEEE Control Systems, 21(6):52-64, December 2001. [31 Ishwar V. Basawa and B.L.S. Prakasa Rao. Statistical Inference for Stochastic Processes. Probability and Mathematical Statistics. Academic Press, London, 1980. [4] Sumit Basu, Tanzeem Choudhury, Brian Clarkson, and Alex Pentland. Learning human interactions with the influence model. Technical Report 539, MIT Media Laboratory Vision and Modeling, June 2001. [5] Brendan Burns and Clayton T. Morrison. Temporal abstraction in bayesian networks. In AAAI Spring Symposium 2003, Palo Alto, March 2003. [6] G. F. Cooper. The computational complexity of probabilistic inference using Bayesian belief networks. Artificial Intelligence, 42(2-3):393-405, 1990. [7] Robert G. Cowell, A. Philip Dawid, Steffen L. Lauritzen, and David J. Spiegelhalter. Probabilistic Networks and Expert Systems. Statistics for Engineering and Information Science. Springer-Verlag, New York, 1999. [8] Murat Deviren. Structural learning of dynamic Bayesian networks in speech recognition. Technical report, Speech Group, LORIA, September 2001. 61 [9] Marek J. Druzdzel and Henri J. Suermondt. Relevance in probabilistic mod- els: "Backyards" in a "small world". In Working Notes of the AAAI 1994 Fall Symposium Series: Relevance, pages 60-63, New Orleans, November 1994. [10] Richard 0. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. John Wiley and Sons, New York, 2nd edition, 2001. [11] Brendan J. Frey. Graphical Models for Machine Learning and Digital Communication. Adaptive Computation and Machine Learning. MIT Press, Cambridge, Massachusetts, 1998. [12] Zoubin Ghahramani. Learning dynamic Bayesian networks. Lecture Notes in Computer Science, 1387:168-197, 1998. [13] Finn V. Jensen and Frank Jensen. Optimal junction trees. In Proceedings of the Tenth Annual Conference on Uncertainty in Artificial Intelligence (UAI-94), pages 360-366, 1994. [14] Michael I. Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183-233, 1999. [15] Michael I. Jordan and Terrence J. Sejnowski, editors. Graphical Models: Foundations of Neural Computation. Computational Neuroscience. MIT Press, Cambridge, Massachusetts, 2001. [16] Uffe Kjaerulff. dHugin: A computational system for dynamic time-sliced Bayesian networks. InternationalJournal of Forecasting, Special Issue on Probability Forecasting, 11:89-111, 1995. [17] Steffen L. Lauritzen. GraphicalModels. Oxford University Press, Oxford, 1996. [18] Yan Lin and Marek J. Druzdzel. Computational advantages of relevance reasoning in Bayesian belief networks. In Proceedings of the Thirteenth Annual 62 Conference on Uncertainty in Artificial Intelligence (UAI-97), pages 342-350, August 1997. [19] Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus. John Wiley and Sons, New York, 1988. [20] Kevin P. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. PhD dissertation, UC Berkeley, Computer Science Division, July 2002. [21] Richard E. Neapolitan. ProbabilisticReasoning in Expert Systems: Theory and Algorithms. John Wiley and Sons, New York, 1990. [22] Nils J. Nilsson. Artificial Intelligence: A New Synthesis. Morgan Kaufmann, San Francisco, 1998. [23] Judea Pearl. ProbabilisticReasoning in Intelligent Systems : Networks of Plausible Inference. Morgan Kaufmann, San Mateo, 1988. [24] Lawrence Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257-286, February 1989. [25] Kenneth Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. IEEE, 86(11):2210-2239, November 1998. [26] Sandip Roy. Moment-Linear Stochastic Systems and their Applications. PhD dissertation, MIT, Department of Electrical Engineering and Computer Science, June 2003. [27] Ross Shachter. Bayes-ball: The rational pastime (for determining irrelevance and requisite information in belief networks and influence diagrams). In Proceedings of the Fourteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI-98), pages 480-487, 1998. [28] YongHong Tian, Zheng Mei, TieJun Huang, and Wen Gao. Incremental learning for interaction dynamics with the influence model. 2003. 63 [29] Carlos Alberto G6mez Uribe. Estimation on a partially observed influence model. Master's project, MIT, Department of Electrical Engineering and Computer Science, June 2003. [30] Yair Weiss. Correctness of local probability propagation in graphical models with loops. Neural Computation, 12(1):1-41, 2000. 64