Summary of “Modeling T-cell activation using gene expression profiling and state-space models” by Rangel, et al. Keala Chan 7/9/04 SoCalBSI California State University at Los Angeles INTRODUCTION This paper has to do with microarray data analysis, in particular the use of microarray data to model the gene pathways involved in T-cell activation. By applying microarray data to state-space models, the authors show that this class of dynamic Bayesian networks effectively represents a well-established model of T-cell activation. BACKGROUND T-cell Activation The generation of T-lymphocytes is the central event in the generation of an immune response. The T-cell is activated by the interaction between the T-cell receptor (TCR) complex and an antigenic peptide on the surface of an antigen-presenting cell (1362). This event triggers a network of signaling molecules that initiate a number of gene transcription events in the nucleus. The paper reverse-engineers this well-known model of T-cell activation. State-Space Models State-space models (SSM) are a class of Bayesian networks in which observed measurements depend on a Markov process of hidden state variables. Thus, it is believed that state-space models are ideal for modeling complicated gene-gene interactions, since the hidden state variables can be used to represent variables not explicitly measured in an experiment, such as genes not included in the microarray, levels of regulatory proteins, effects of protein degradation, etc… The main goal when using the state-space model for gene expression modeling is to determine the matrices most likely to have generated the sequence of observation vectors. These transition matrices determine the gene-gene interaction network, which can be 2 depicted graphically for clarity. Note, finding the transition matrices most likely to have generated a sequence of observations is a common application of Hidden Markov Models. A linear state-space model represents a sequence of p-dimensional observation vectors {y1, … , yT} assuming that each yt was generated from a K-dimensional hidden-state variable xt, in which {x1, … , xt} is a first-order Markov process1. Thus, the model is described by (1) xt+1 = Axt + wt (2) yt = Cxt + vt where A represents transitions between the hidden states, C is the state to observation matrix, and wt and vt are sequences of uncorrelated white noise (1362). Further, the observations can be divided into input variables and response variables. For distinct inputs to the state (1) and observation (2) equations, the SSM becomes (3) xt+1 = Axt + Bht + wt (4) yt = Cxt + Dut + vt where ht and ut are inputs to the state and observation vectors, B is the input to state matrix, and D is the input to observation matrix. For the gene expression model in particular, the (suitably normalized) fluorescent intensities measured for each of p genes at time step t are kept in the vector gt, while the hidden variables remain in the vector xt. Gene expression is now more specifically modeled by (5) (6) xt+1 = Axt + Bgt + wt gt = Cxt + Dgt-1 + vt. Thus, the hidden state xt+1 now depends on the previous state xt (according to the Markov process) as well as the previous observed gene expressions gt, and the observation gt at time t 1 Markov process in which the next state of the system depends only on the previous state. 3 depends on the hidden variables xt as well as the previous observed gene expression gt-1. In addition it is evident that the matrix D holds the probabilities of gene-gene interactions at consecutive time points, the matrix B captures the influence of gene expression on the next hidden state, and the matrix C shows the influence of hidden variables on gene expression at each time point. Finally, it is the matrix CB+D that captures both direct gene-gene interaction and the gene-gene interactions through the hidden states over time, in other words, all the important information related to gene-gene interaction over one time step (1363). METHOD The microarrays were made by spotting PCR products on glass slides. The genes chosen for the microarrays were all determined to be modulated in response to T-cell activation. Two replicated experiments were hybridized on two sets of arrays. Genes whose expression values at all time points were below a specified value were discarded, as were genes that displayed very poor reproducibility between the two experiments. The paper specifies the algorithms used to determine the optimal number of hidden states and to estimate the transition matrices A, B, C, and D. The latter algorithm is described in a previous paper by the authors, and because they are related to the field of statistical modeling rather than bioinformatics, they are not described here. RESULTS After data pre-processing, 39 out of the 58 genes remained with significant interactions. The structural parameters A, B, C, and D and corresponding confidence intervals are estimated using the “EM algorithm” described in the authors’ previous paper. Then, a connectivity matrix 4 for CB+D is constructed by assigning zero to elements for which zero is within the confidence interval, and assigning one to elements otherwise. Finally, a directed graph (Figure 1) is drawn based on this connectivity matrix. In the graph, arrows are drawn from a gene expression variable at a given time t to the gene variable whose expression it influences at the next time point t+1. Note that the non-zero entries in CB+D can be positive or negative, indicating up or down regulation; in Figure 1, up and down regulation are represented by solid and dotted arrows respectively. The authors discuss many functional groupings evident in the graphical representation that correspond to known functions in T-cell activation. For instance, FYB (gene 1) is an important adaptor molecule and cells defective in this molecule have severely impaired proliferation and migratory response (Burack et al., 2002); the model links FYB to three interleukin genes: IL2 (gene 7), IL4 (gene 5), and IL3 (gene 2). The cytokines that bind to the three interleukin genes are well-known to be proliferation signals in T-cells. In another example, the model shows the gene SMN1 (gene 19) negatively influencing the expression of JunB (gene 13), a pro-apoptotic gene (Weitzman, 2001). This fits with the experimental finding that SMN1 inhibits the onset of apoptosis. The graph suggests many more specific connections that are well supported by the literature. In addition, labeling the genes by functional categories yields some interesting groupings in the diagram. For example, FYB (gene 1) and 5 of its connected genes are directly related to the inflammation response. CONCLUSIONS Reverse engineering of T-cell activation pathways using state-space models confirms many known interactions, making the SSM a powerful tool in predicting gene pathways. Moreover, interactions suggested by the model that are not supported in the literature represent 5 novel hypotheses (1370) and thus present opportunities for novel discoveries through experimental confirmation. The authors suggest improvements to the modeling procedure such as more replicates in the data set and additional time points. They also note that the hidden variables are in general not identifiable, in that a one-to-one correspondence between hidden variables and specific genes does not exist. Instead, hidden variables likely represent a combination of complex events. However, it is precisely the inclusion of hidden variables in the model that makes the SSM a realistic biological model. The paper adequately shows the accuracy of the SSM in modeling the T-cell pathway network; the litany of matches with the literature is very convincing as a feat of reverse engineering. It is not specified how such a model would best be used, however, besides to suggest hypotheses for immediate clinical research, and perhaps this is exciting enough. But because the paper does not explain or prove the accuracy of the SSM in modeling a biological system, SSM cannot be used independent of experimental evidence. 6 Figure 1 7