Variable selection in regression mixture modeling for the discovery of gene regulatory networks Mayetri Gupta and Joseph G. Ibrahim∗ November 6, 2006 Abstract The profusion of genomic data through genome sequencing and gene expression microarray technology has facilitated statistical research in determining gene interactions regulating a biological process. Current methods generally consist of a two-stage procedure: clustering gene expression measurements, and searching for regulatory “switches”, typically short, conserved sequence patterns (motifs) in the DNA sequence adjacent to the genes. This process often leads to misleading conclusions as incorrect cluster selection may lead to missing important regulatory motifs or making many false discoveries. Treating cluster memberships as known, rather than estimated, introduces bias into analyses, preventing uncertainty about cluster parameters. Further, there is under-utilization of the available data, as the sequence information is ignored for purposes of expression clustering and vice-versa. We propose a way to address these issues by combining gene clustering and motif discovery in a unified framework, a mixture of hierarchical regression models, with unknown components representing the latent gene clusters, and genomic sequence features linked to the resultant gene expression through a multivariate hierarchical regression. We demonstrate a Monte Carlo method for simultaneous variable selection (for motifs) and clustering (for genes). The selection of the number of components in the mixture is addressed by computing the analytically intractable Bayes factor through a novel multi-stage mixture importance sampling approach. This methodology is used to analyze a yeast cell cycle dataset to determine an optimal set of motifs that discriminates between groups of genes and ∗ Mayetri Gupta (email: mgupta@unc.edu) is Assistant Professor, and Joseph G. Ibrahim (email: ibrahim@bios.unc.edu) is Alumni Distinguished Professor, at the Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599, U.S.A. The authors thank the editor, associate editor, and two referees for their helpful comments and suggestions that substantially improved this article. This research was partially supported by National Insitutes of Health grants GM 70335, CA 74015 and Environmental Protection Agency grant RD-83272001. 1 simultaneously finds the most significant gene clusters. KEY WORDS: Transcription regulation; motif discovery; hierarchical model; evolutionary Monte Carlo; importance sampling; Bayesian model selection. 1 Introduction The availability of diverse types of genomic data, such as DNA sequence, gene expression microarray and proteomic data, has led to a rapid growth of statistical research in the effort to decipher workings of biological processes, which are primarily regulated by interactions of genes. However, until recently, the analysis of sequence and gene expression data have been considered two separate problems, in spite of their intrinsic biological relationship. The behavior of large numbers of genes (typically thousands in a single experiment) in an organism is frequently inferred through analyzing mRNA expression from microarrays. Groups of genes involved in a particular biological process are typically regulated by one or more transcription factors (TFs) that bind to transcription factor binding sites in the upstream sequence adjacent to genes (promoters). TF binding sites corresponding to a TF often show a similar conserved pattern called a motif. A motif pattern of length w is often represented through a 4 × w matrix of probabilities, called a position-specific weight matrix (PSWM). Each column of the PSWM gives the relative probabilities of observing any of the 4 letters A, C, G, or T in that position of a binding site. A common strategy for understanding gene regulation is to look for conserved motifs upstream of genes that are co-expressed under a given condition. After genes with similar expression patterns across experimental conditions are grouped together via clustering algorithms, a motif discovery algorithm is often used to search sequences upstream of genes within each cluster. Several computational algorithms have been developed for motif discovery from upstream sequences of co-regulated gene clusters (Lawrence and Reilly, 1990; Bailey and Elkan, 1994; Liu et al., 1995, 2002; Thompson et al., 2004). The co-regulated gene cluster-based motif discovery approach has its pitfalls, since there 2 may be genes in a cluster lacking a common motif and many gene promoters containing a similar motif may not show any experimental response, leading to inaccurate predictions. If the clustering is inaccurate, it becomes difficult to discover the correct TF binding site patterns. Additionally, by assuming the cluster identities as known and fixed, instead of estimated, bias is introduced into the statistical analysis. One way to overcome such drawbacks is to combine sequence and expression information in a coherent model to infer regulatory networks; however, developing such an approach presents significant statistical challenges. In this article, we present an approach that combines information from expression measurements and sequence through a unified mixture model and develop a methodology to simultaneously cluster genes and select the most likely TF binding sites. In one of the first approaches, Holmes and Bruno (2000) used an iterative procedure to cluster genes using a multivariate normal mixture model, and find motifs in co-regulated clusters. The sequence and expression parts in this model were considered to be independent of each other. Linear model-based approaches to link expression values with motif occurrence were introduced by Bussemaker et al. (2001), the underlying assumption being that the presence of a motif site contributes additively to the expression level of the gene. To avoid the arbitrariness in deciding whether a given segment is a binding site, Conlon et al. (2003) propose a multiple linear regression relationship between the logarithm of the differential expression values and a sequence motif “score”. Tadesse et al. (2004) present a Bayesian version of this procedure with significant regulatory motifs being chosen through Bayesian variable selection. However, none of these latter methods address the simultaneous determination of co-regulated gene clusters (assumed to be fixed in advance) and the motifs involved in their regulation. Barash and Friedman (2002) use sequence and expression data for a somewhat different goal, to determine functional gene clusters rather than de-novo motif discovery. Starting with known motif lists, they estimate a Bayesian network-based clustering of genes using conditional models to link sequence with gene expression, through an approximate EM-like procedure. However, parametric complexity, and a lack of standard model validation approaches (due to incomplete specification of a joint model), hampers the effectiveness of this approach for moderately large datasets. 3 From the methodological standpoint, the problems of simultaneous clustering and variable selection in the Bayesian context have been considered in the multivariate normal set-up (Liu et al., 2003; Tadesse et al., 2005). However, to the best of our knowledge, there has been limited study of such a scenario in the mixed effects model or the linear regression framework, which has some unique problems (Hennig, 1999). The variable selection problem with a large number of predictors, on the other hand, is one under considerable current study (George and McCulloch, 1993; Liang and Wong, 2000). In this paper, we propose a mixture of hierarchical regression models to simultaneously address the clustering of co-regulated genes, and determining a set of transcription factors regulating a gene cluster. A hierarchical prior framework is formulated for modeling intra-cluster gene correlations.We develop an efficient Monte Carlo algorithm for iterative clustering of genes and selection of significant motifs. This approach succeeds in linking regulatory motifs to the resulting gene expression without a huge parameter estimation cost. Additionally, it can uncover relationships that cannot be discovered using a single linear model, for example the same motif acting as a positive and negative regulator on separate groups of genes under the same condition. The outline of the paper is as follows. In Section 2, we describe the motivating yeast data set that led to this application. Next, we introduce the unified hierarchical mixture model linking gene expression measurements with promoter sequence, and indicate the biological interpretation of such a model. In Section 4, we describe a Markov chain Monte Carlo (MCMC) procedure for simultaneous clustering of genes and selection of motifs (covariates). In Section 5, we present computational tools for calculating Bayesian model selection criteria in the regression mixture framework and in Section 6, apply our methodology on the yeast data and evaluate the performance of the method through a number of simulation studies. 2 Yeast cell-cycle data set The motivating data set for this application is from a yeast cell-cycle microarray experiment (Spellman et al., 1998). This data set has repeatedly been used to examine aspects of 4 cell cycle gene regulation, e.g. Barash and Friedman (2002); Conlon et al. (2003). cDNA microarrays were used to study samples from yeast cultures synchronized by three independent methods: (a) alpha-factor arrest (b) elutriation and (c) arrest of a cdc15 temperaturesensitive mutant. For purposes of this analysis, we concentrate on the first experiment which has observations over two full cell-cycles and also the largest number of complete measurements. Yeast strain DBY8724 was grown in a glucose solution, an alpha-factor was added and 25-ml RNA samples were taken every 7 minutes for 120 minutes, after which the alpha-factor was removed by a centrifugation method. The samples were allowed to hybridize to cDNA microarrays for 4-6 hours, after which the expression levels of nearly 6000 genes over 18 time points (spanning two hours) were recorded. The value for each gene was recorded as the logarithm (base 2) of the ratio of expression of that gene in the sample to a baseline control measurement. After scanning the microarray with a laser microscope, the data set was preprocessed by computing the fluorescence ratios through a local background correction, estimated by the intensities of the weakest 12% of the pixels in each box. Data for each gene were normalized so that the average log-ratio over the course of the experiments was equal to zero. The first goal of Spellman et al. (1998) was to identify the genes regulated in a cell cycledependent manner. A Fourier transform was applied to the data series for each gene, and the resultant expression profile of each gene was then correlated to five different profiles representing genes known to be expressed in the five phases of the cell-cycle: G1, S, G2, M, and M/G1, using the standard Pearson correlation function. The profiles for known gene classes were identified by averaging the log-ratio data for each of the genes known to peak in each of the five phases. Genes were ranked by the correlation scores, and all genes whose scores exceeded the threshold score (91st percentile of known cell cycle genes) were classified as being cell-cycle regulated. This led to a total of 800 genes, which included 496 genes not previously identified to be involved in cell cycle regulation. For our analysis, we downloaded the complete microarray data set from the Stanford Microarray Database (Ball et al., 2005). Among the measurements for the 800 cell-cycle genes, a number of genes were discarded due to completely missing or suspicious look5 ing data, resulting in a final set of 612 genes measured over 18 time points (Figure 1). In addition, we extracted 600 bp of the upstream promoter corresponding to each gene from the Saccaromyces Cerevisiae Promoter Database (SCPD; Zhu and Zhang (1999)) for the purpose of motif discovery. Among our questions of interest are: can we improve the prediction of TF binding sites, by using a joint model approach instead of a two-step procedure, or one assuming all genes in a single cluster? In addition, we are interested in determining whether the sequence information can provide an improved estimate of which genes are involved in particular functions along the pathway of cell cycle regulation, in comparison to considering the gene expression information alone. [Figure 1 about here] 3 Regression mixture model for sequence and expression Now we introduce a regression mixture-based joint model linking the sequence and expression data. First, we summarize the notation used. We denote the gene expression measurements for G genes under T time points by Y = ((Yig )), (i = 1, . . . , T ; g = 1, . . . , G). Corresponding to each gene g, let the upstream sequence (of length L g ) be xg = {x1 , . . . , xLg }, and let the set of D potential motif candidates be characterized by D position-specific weight matrices (PSWMs) {Θ1 , . . . , ΘD }. Assume that we have a score function that “scores” every upstream region with respect to every weight matrix resulting in a G × D matrix S = ((Sgd )). The entry Sgd reflects the propensity of sequence g to be bound by the transcription factor corresponding to motif type d. More details on the scoring function are provided in Section 3.3. Now, we introduce a mixture model framework for the gene expression clusters. Let πk (k = 1, . . . , K) denote the prior probability of belonging to cluster k. Assume that Y is generated from a K-component mixture distribution with p(Y g |all parameters) = K X k=1 6 πk pk (Y g |all parameters), (1) where pk (·) denotes the probability density for the k th component. The gene expression measurements are linked to the corresponding motif scores through a regression model. Let u = (u1 , . . . , uD )T be a binary vector with ui = 1(0) if motif i is involved (not involved) in regulating expression. Conditional on the unobserved cluster membership z g , the gene expression measurement Y g is modeled as: Y g |zg = k, ξ, S, β, σk2 ∼ NT ξg + S 0g(u) β k(u) 1T , σk2 I , (2) where ξg is a T −dimensional parameter vector representing the variation at T time points; β k(u) is the subset of regression coefficients βk corresponding to TFs indexed by u; and σk2 is the variance of expression measurements corresponding to the k th cluster (1T denotes a T −dimensional vector of ones). Throughout the article, Np (·, ·) denotes a p−dimensional multivariate normal distribution. The model specification (1) and (2) can be viewed as a mixture of random effects models. Although the following discussion relates to the development of this model and inference under normality assumptions stated in (2), the framework and methodology can be extended to a variety of distributional assumptions. The mathematical formalism here is mainly to derive insights into the biological relationships between genes and their regulatory motifs, rather than try to represent the true (unknown) biological picture. However, the regression mixture framework is an intuitively appealing construct in this application, since (i) it reflects the notion that if different groups of genes are differently regulated, then the regression relationships should likewise differ between such “clusters” and (ii) the regression model assumes that upstream sequences of genes having distinctive expression patterns are likely to be regulated by certain TFs, and hence are more likely to contain more (and “stronger”) binding sites for these TFs. Estimation of parameters under the model set-up (1) and (2), with simultaneous selection of regressor variables and clusters, is a challenging task in likelihood-based inference. By adopting a Bayesian framework, we ensure that such a model can be fitted, and demonstrate that meaningful biological constraints can be incorporated into the model by a careful choice of the prior distributions. For example, we show in the following section that intra- 7 cluster gene correlations can be incorporated into the hierarchical model without resorting to a huge expansion in parameters and avoiding estimability problems with small datasets. An implicit assumption made is that dependence between expression measurements over t different points (t = 1, . . . , T ) is due to either cluster or sequence effects. 3.1 Hierarchical prior framework The prior distributions of the parameters in the gene expression model are specified conditionally on the gene expression cluster, with: ξ g |zg = k, µk , σk2 ∼ NT (µk , τ0 σk2 I), g = 1, . . . , G, µk ∼ NT (m0 , V0 ), 1/σk2 ∼ Gamma(w0 /2, S0 /2), (3) (k0) where V0 = ((vij )) is the T × T prior covariance matrix. Possible choices of V0 and the other hyperparameters τ0 , m0 , w0 and S0 are discussed in Section 4.1. We also assume that π = (π1 , . . . , πK ) ∼ Dirichlet(α), where α = (α1 , . . . , αK ). The model specification in (3), ensures two desirable properties: (i) genes do not borrow strength across clusters; however, genes within a cluster may borrow strength from each other (through the a priori specified prior covariance matrix V0 ), (ii) the a priori correlation between measurements on a gene at two time points is the same as the correlation between different genes in the same cluster at those time points. More precisely, it can be shown that (k0) Corr(ξgi , ξgj |zg = k, σk2 ) vij = Corr(ξgi , ξg0 j |zg = k, zg0 = k, σk2 ), = q q (k0) (k0) vii + τ0 σk2 vjj + τ0 σk2 (k0) while Corr(ξgi , ξg0 i |zg=k, zg0 = k, σk2 ) = vii (k0) vii + τ0 σk2 . 3.2 Generalized g-prior for high-dimensional covariates For the regression coefficient of the sequence model, we assume the standard conjugate prior form β k ∼ ND (β 0 , Σβ ). Let |u| denote the number of weight matrices, out of a 8 possible D, that are included in the model, i.e. |u| = PD d=1 1[ui =1] denotes the cardinality of u. We assume ui ∼ Bernoulli(η) and specify a conjugate prior, Beta(a1 , a2 ), for η. Our choice of the prior for β k is motivated by the feasibility of computation of the marginal posterior in the variable selection step. In the regression-like framework, it may be appropriate to use a multivariate generalization of the g-prior (Zellner, 1986). For the linear regression model Y = Xβ + , ∼ Nn (0, σ 2 I), the g-prior for β is Np (β 0 , cσ 2 (X 0 X)−1 ), where c is a specified scalar. In the current set-up, since the cluster identities are unknown, it is not possible to get a closed form expression for the g-prior. Let us write X ∗ = stack(X) = [X10 , . . . , XG0 ]0 , where Xi = 1T S 0i , S 0i = (Si1 , . . . , SiD ). Ignoring the class labels, we can get an approximate expression for the variance of the gP 0 −1 prior based on the whole design matrix, as cσk2 (X 0∗ X ∗ )−1 = cσk2 [T G g=1 S g S g ] . This covariance matrix, however, becomes nearly singular for high-dimensional covariates, or covariates which are highly collinear, both of which are common occurrences when we are dealing with large numbers D of motif covariates. Taking such a prior covariance matrix then leads to MCMC convergence problems. Hence, we propose a modified form of the g-prior, motivated from ideas of ridge regression (Hoerl and Kennard, 1970). We take the prior distribution for β k (k = 1, . . . , K), as β k ∼ ND β 0 , cσk2 Σβ G X −1 0 , (4) , where Σβ = (X ∗ X ∗ + λI)−1 = T S g S 0g + λI g=1 where λ is a specified scalar similar to the ridge parameter in ridge regression. This form of the prior simultaneously stabilizes the prior (and posterior estimation) of the regression coefficients, while possessing the operating characteristics and properties essentially identical to the usual g-prior when high-dimensionality and collinearity issues are not present. 0 For instance, (X ∗ X ∗ + λI) is necessarily non-singular for G < D (and λ > 0), whereas 0 the original matrix X ∗ X ∗ is necessarily singular. The “ridge” parameter λ is generally chosen within a range of values between 0 and 1 (Hoerl and Kennard, 1970), that leads to maximum stability of the estimated coefficients. In our simulations and application, we tried using a full Bayesian model with a uniform prior on λ, as well as setting different 9 fixed values of λ between 0 and 1. Empirical evidence suggests that a fixed value of λ leads to more stable and less variable estimates, with λ values in the range of 0.5 to 1 essentially performing similarly well. The bias in estimates introduced by λ turns out to be negligible, especially for large values of c. By choosing a sufficiently large c, we can make the prior suitably non-informative (diffuse) with respect to the likelihood. Also, the choice of c determines the a priori covariance of gene expression measurements in a cluster, since Cov(Y g , Y h |zg , zh = k, σk2 ) = V0 + cσk2 T 1T S 0g G X S i S 0i i=1 !−1 S h 10T . 3.3 Score function for motif conservation The set of D position-specific weight matrices Θ = (Θ1 , . . . , ΘD ) is used to represent potential TF binding site pattern candidates. Each of the upstream sequences is “scored” with respect to each weight matrix Θi (i = 1, . . . , D), so that we have a G × D matrix of gene-sequence scores S = (S 1 , . . . , S G ), where S g = (Sg1 , . . . , SgD ) is the sequence score vector for the g th gene. The score for a sequence xg and a weight matrix Θi is taken to be the likelihood ratio between the sequence xg being or not being regulated by the transcription factor corresponding to the weight matrix Θi . In Conlon et al. (2003), the score of weight matrix i for sequence g is: Lg −wi +1 Sgi = X j=1 P ({xj , . . . , xj+wi −1 }|Θi ) , P ({xj , . . . , xj+wi −1 }|θ0 ) (5) where θ0 denotes the parameter set characterizing the background distribution (sequence not containing motifs). For an i.i.d. background, θ0 represents the probabilities of the four nucleotides; under a Markovian assumption, it denotes the transition probabilities. (5) represents the likelihood ratio between the sequence containing a single motif site and a null model containing no motif (up to a constant of proportionality). 10 4 Clustering with variable selection for fixed K Under the model specifications discussed in Section 3, we construct a Monte Carlo-based estimation strategy. For now, we assume the total number of clusters specified as K. The full conditional posterior distribution of the parameters is given by: p(ξ, µ, σk2 , β, β 0 |Y , S, z, u, K) ∝ K Y k=1 " Y P (Y g ; ξ g + S 0g(u) β k(u) 1, σk2 I) × g:zg =k # P (ξg ; µk , τ0 σk2 I) × P (µk ; m0 , V0 ) × P (σk2 ; w0 , S0 ) × P (βk(u) ; β0(u) , Σβ(u) ) , where the subscript (u) corresponds to the subset of variables indexed by the binary vector u; thus Σβ(u) denotes the |u| × |u| submatrix corresponding to the selected u. In this framework, we are interested in making inference about πk , µk , β k and σk2 for each cluster, given all the gene expression measurements Y and sequence motif scores S. The additional complication is that the cluster membership (z) and active TF set (u) are both unknown. We develop an MCMC framework to estimate the parameters, that alternates between three main steps (technical details are in Appendices A-C): (i) Selection of variables (motifs) given clustering of genes. We update [u|Y , S, z, K, all parameters]. Since we often need to select a subset of motifs from a very large set (e.g. D ≈ 100), the evolutionary Monte Carlo (EMC) method (Liang and Wong, 2000) is used for this step. To make the EMC step more efficient, we marginalize out most of the parameters from the posterior distribution. (ii) Updating clusters given the selected variables. Here we draw the cluster indicators [z|Y , S, u, K, parameters] from their full conditional posterior distribution. (iii) Updating parameters from their posterior distributions, i.e. [µk , σk2 , β k , π|Y , S, z, u, K] (for 1 ≤ k ≤ K). 11 4.1 Choice of prior hyperparameters and starting values The prior hyperparameters consist of the scalars τ0 , c, w0 , vectors m0 (k = 1, . . . , K), β 0 , and matrices S0T ×T and V0 , (k = 1, . . . , K). Proper priors are needed in the model in order to ensure proper posterior distributions; however, by taking sufficiently large tuning parameters, we try to keep the priors non-informative. It is essential to base the prior choice independently of the initial cluster specification. Sensitivity analyses were conducted over a range of values of the hyperparameters that demonstrated robustness to prior specifications (see Section 6.1). We chose prior settings for the hyper-parameters as the followP 2 ing: m0 = (0, . . . , 0)0 ; and V0 ∝ c0 diag(v11 , . . . , vT T ) (where vii = G1 G g=1 (Ygi − Ȳi ) , P P 0 −1 , and β 0 = 0. To Ȳi = G ˆ G1 G g=1 (Y g − Ȳ )(Y g − Ȳ ) g=1 Ygi ). Also, we set S0 = ensure proper prior distributions, we choose the degrees of freedom w 0 > T + 1. The scale parameters (τ0 , c) are chosen to be sufficiently large (in our applications, above 100) to ensure non-informativeness of the prior distributions of µ and β, and c 0 was chosen in the range 0.1 to 10. If historical information is available, it can also be used to elicit priors. 5 Determining the number of mixture components K Earlier, it was assumed that the total number of gene clusters K was fixed and known. However, in most cases, we need to assess the statistical significance of a choice of K. Here we formulate the choice of K as a model selection problem between competing models M1 , M2 , . . . , MKmax where Kmax is a suitably chosen upper bound for the possible number of clusters. The Bayes factor between the two models Mk and Ml , is the ratio of marginal likelihoods given by: R P P (Y |Mk ) z ,u θ P (Y |z, u, θ, Mk )P (z, u, θ|Mk )dθ R BFkl = = P , P (Y |Ml ) z ,u θ P (Y |z, u, θ, Ml )P (z, u, θ|Ml )dθ (6) where (z, u, θ) is generic notation for the missing data (cluster identity and variable indices) and unknown parameters in the model. The main difficulty in computing (6) is the summation over all iG × 2D possible values of (z, u) for each model Mi (i = k, l). In 12 the problem of simultaneously selecting the number of clusters and variables in a multivariate normal mixture framework, Tadesse et al. (2005) used the reversible jump (Green, 1995) method for selecting K along with other model parameters, embedding the problem in a larger problem of variable dimensions. With the additional complexity in our model, although reversible jump is a good idea in principle, in practice it appears highly unstable and possibly inefficient, as high-dimensionality and multimodality in the likelihood typically leads to high rejection rates when jointly sampling the large parameter space. We thus adopt an alternative approach. When the integral over θ can be evaluated analytically, Steele et al. (2003) propose a computational method to calculate (6) using importance sampling. More specifically, the marginal density P (Y |Mk ) in (6) can then be writP P z t |Mk ) , ten as I = z P (Y |z, Mk )P (z|Mk ), and estimated by IˆIS = T1 Tt=1 P (Y |z t , Mk ) P (g( zt) where g(·) is an importance sampling distribution for z. In our model, we have two addi- tional complications: (i) Variables selected (u) in the regression part of the model under different K may vary and (ii) Marginalizing the mean parameter µ in P (Y , θ|z, M k ) is intractable. We numerically integrate over the joint distribution of the discrete variables u and z, under every model Mk . We next propose a multi-stage importance sampling procedure. (The following expressions are assumed to be conditional on the current choice of model Mk , so we drop Mk from the expressions for notational simplicity.) Note that unP R der Mk , P (Y |u) = z µ P (Y |u, z, µ)P (µ|z)p(z)dµ. Then, for any u, we calculate H(u|z, µ) = P (Y |u, z, µ) (details in Appendix A), so that P (Y ) can be estimated by " Tµ T # Tz u j i i k X XX 1 k i j p1 (µ |z ) p2 (z ) p3 (u ) H(u |z , µ ) , Tu × Tµ × Tz i=1 j=1 k=1 g1 (µj |z i ) g2 (z i ) g3 (uk ) where p1 (·), p2 (·), p3 (·), are the densities of µ, z, and u under model Mk ; g1 (·), g2 (·), g3 (·) are the corresponding importance sampling densities; and Tu , Tµ , Tz are the number of samples drawn respectively from the importance sampling densities (µ and z are a priori independent of u). Since µj |z is a priori distributed as NT (mj0 , V0 ), a product of scaled Q multivariate t-distributions, kj=1 t( ; mj0 , V0 , ν) may be appropriate as the importance sampling distribution for µ|z. 13 For the cluster indicator z, we use an importance sampling density of the form g 2 (z) = δp(z|K)+(1−δ)g ∗ (z) where p(z|K) denotes the prior, 0 < δ < 1, and g ∗ (z) is a sampling function that covers important parts of the space. Sampling solely from the prior p(z|K) is highly likely to result in many observations that should be in the same group not being clustered together, and vice versa. By using the information contained in P (z|Y , S, θ̂)– the posterior probability matrix of group allocation– to determine preliminary groupings of observations, we can apply a Dirichlet-Multinomial sampling function to each of these groups individually (see Appendix D for details). A weak dependency is thus generated among observations which have high posterior probabilities of belonging to the same group. The Dirichlet-multinomial distribution samples parts of the space other than the mode, guarding against large importance sampling weights. The sampling distribution for u is chosen similarly, as a mixture of the prior density and a density likely to sample from high posterior density regions. Let φ = (φ1 , . . . , φD ) be the set of marginal posterior probabilities of selection of variables, i.e. φj = P (uj = 1|Y , S), obtained from the MCMC sample. We take g3 (u) = δpB (u|η) + (1 − δ)g ∗ (u|φ), where pB (u|η) denotes the prior Bernoulli density with hyperparameter η, and g ∗ (u) is taken to be a product of Bernoulli densities where the j-th density has parameter φj (1 ≤ j ≤ D). 6 Applications 6.1 Case study 1: A simulated data set Simulation studies were conducted to test the performance of our method. Position specific weight matrices (PSWMs) corresponding to ten known transcription factors (TFs) were collected from the yeast TF database (Zhu and Zhang, 1999). Next, three “groups” of sequences (emulating gene promoters) were constructed by extracting random sequences from yeast intergenic regions,and inserting one or more motif sites into sequences of each group according to the following rules: (i) Motif types 1 to 5 in group 1 (ii) Motif types 6 to 10 in group 2 and (iii) Motif types 1, 2, 3, 8, 9, 10 in group 3. Vectors of gene expression scores for genes in each group were then generated according to the model 14 N2 (Sg0 β k , σk2 I), where σ12 = σ22 = σ32 = 0.1, and the group specific regression coefficients were as follows: β 1 = 2 × (1, 1, 1, 1, 1, 0, 0, 0, 0, 0), β 2 = 2 × (0, 0, 0, 0, 0, 1, 1, 1, 1, 1), and β 3 = −2 × (1, 1, 0, 0, 0, 0, 0, 0, 1, 1). The choice of regression coefficients were made so that motifs 1 to 5 had positive effects in group 1, 6-10 had similar effects in group 2, while in group 3, motifs 1, 2, 9, 10 had significant negative effects, whereas sites corresponding to motifs 3 and 8 although present, do not have a significant effect on gene expression. Next to evaluate the performance of the method in presence of noise, we simulated “scores” for 50 random NT (·, ·) variables, uncorrelated with the gene expression. Iterative fitting of the joint sequence-expression model was done for each data set in turn, with a total of D = 30, 40, 50, 60 variables (the first 10 in each case being the “true” covariates). To judge convergence of the sampler, the R̂ statistic (Gelman and Rubin, 1992) was computed for all scalar parameters of interest using 5 runs of the sampler with different starting points. For about 50,000 iterations of the sampler (excluding the initial 500 as burn-in), the maximum value of R̂ over all parameters averaged 1.003, indicating that the chains can be approximately assumed to have converged. The k-means algorithm was used on the expression data to get starting values of cluster identity (for K = 2, . . . , 6). The Bayes factor was found to select the the correct cluster count, K = 3, in each case (second panel of Table 3). For comparison, we also computed the Bayes factor using the marginal density approach of Chib (1995). Although this can often be calculated directly from the MCMC output, for the missing data framework of the regression mixture model, it requires at four sets of extra samples from the conditional distributions of the parameters. Additionally, the two levels of missing data (mixture components and covariate indicator variable) appear to make the procedure computationally expensive as well as inaccurate, as the Bayes factor did not succeed in selecting the correct cluster count in simulation studies. Hence we do not further pursue this approach. As is evident from Figure 2, the set of selected variables converges quickly to the correct set. Figure 2 also shows the false discovery rate on choosing a cutoff for the posterior marginal probability of variable selection ranging between 0 and 1 (all correct variables are chosen with probability 1 after burn-in, so the false negative rate is 0). For essentially 15 any cut-off between 0.2 and 1, the FDR is virtually zero, showing that the method can discriminate strongly between the true covariates and noisy variables that do not explain gene expression for any of the clusters. From Table 1, it is seen that varying the total number of variables D, does not affect the MSEs of estimated parameters and misclassification rates. Also, varying the values of the hyperparameters for the prior of u, (a 1 , a2 ) towards favoring less variables does not have a discernible effect on the final selection of variables or parameter estimates, indicating that the method is robust to specification of these hyperparameters. Similar sensitivity studies were carried out over a range of values for the other hyperparameters in the model, none of which had a significant effect on the final results. The hyperparameter settings for which the final results are reported (for simulation studies and the yeast data) are: (a1 , a2 ) = (1, 100), w0 = 5, c0 = 1, c = 1000, τ0 = 0.1, and λ = 1. [Figure 2, Table 1 about here] 6.2 Case study 2: Comparative study of four competing procedures We designed a simulation study to compare the performance of the new method with three simpler approaches. These were: (i) motif detection from sequence, with no gene clustering (MDscan) (ii) a single multiple regression model with no gene clustering (Motif Regressor; Conlon et al. (2003)), (iii) a two-step procedure that clusters genes (Fraley and Raftery, 2003) and carries out motif detection separately within each cluster (Motif Regressor). Sequence data was generated for three groups of 100 sequences of length 500 each, based on a third-order Markov model with nucleotide frequencies matching yeast intergenic regions. Next, five weight matrices (see Table 2) were chosen from the SCPD database (Zhu and Zhang, 1999), and 1-2 sites corresponding to each motif was inserted into each of the sequences. Expression data was generated under the assumption that motifs 1 and 2 had an inductive effect on group I, a repressive effect on group II, and no effect on group III; and motifs 4 and 5 had an inductive effect on group III and no effect on groups I and II. Motif scores were calculated according to the five weight matrices, standardized, and the expression data generated under the model N2 (Sg0 β k , I), where β 1 = 2 × (1, 1, 0, 0, 0), 16 β 2 = −2 × (1, 1, 0, 0, 0), and β 3 = 2 × (0, 0, 0, 1, 1). Note that although sites corresponding to motif 3 were inserted into the sequences, they are simulated to be “non-functional”, i.e. they do not have any effect on gene expression. Next, each approach was applied to the data set. MDscan with a total motif count between 5 and 10 could find only two of the correct motifs (Table 2). Additionally, one discovered motif was the third, which does not correlate with gene expression. Motif Regressor uses MDscan (Liu et al., 2002) to find a set of candidate motifs for regression. When applied on this data set, it predicted two motifs, none of which match the true ones. Finally, we used MCLUST (Fraley and Raftery, 2003), which uses a multivariate normal mixture, to first cluster the genes (we set K = 3), and then applied Motif Regressor separately on each of the three clusters. In the first cluster, this led to one significant predicted motif (which matches one of the Motif Regressor predictions), while no motifs were predicted to be significant in the other two clusters. None of the predicted motifs were correct. The two-step results may be due to the fact that the difference between clusters are primarily driven by the effect of the significant motifs (and not just the gene expression values), while the method takes gene expression and sequence into account one at a time. Hence clustering the genes fails to separate out the correct clusters (the misclassification rate for MCLUST is 72%), and that in turn leads to failure in detecting the true motifs. Finally, we applied the new joint regression mixture model on the same data set. An initial set of 60 motifs were generated running MDscan separately on each cluster, with K = 1, 2, 3. Leaving out redundant motifs (exact matches), led to a set of 36 candidates for variable selection. For initialization, a random cluster index was assigned to each gene and five chains were run based on five sets of initial indices. MCMC convergence diagnostics demonstrated adequate convergence of the algorithm after about 5000 iterations (R̂ ≈ 1.001). Six motifs were predicted to be significant, of which five match the true motifs, while the sixth (AATCCCAGAT) is very close to a shifted version of motif 5 (matches in 50% of the positions). The misclassification rate, based on the highest posterior probability of class identity for each gene, is reduced to 25%. Unlike MDscan, this method does not pick out motif 3, which is present in the sequence data but does not have a significant 17 effect on expression. This indicates that the joint approach is a more promising method for differentiating groups of genes that differ as a functional category, while using gene expression alone to differentiate clusters could result in missing correct TFs. [Table 2 about here] 6.3 Case study 3: Yeast cell-cycle data We applied our method on the yeast cell cycle data set studied in Spellman et al. (1998) for discovering interactions within the transcription regulatory network. The first question was how to derive a starting set of PSWMs. Motif search using the entire list of 612 genes may not find any other than the “strongest” binding sites since different sub-groups of genes may be acted upon by different TFs. In order to collect a more exhaustive list, we used the following strategy. First, using the k-means algorithm, we repeatedly grouped the genes into K groups, where K = 1, 2, . . . , 10. Then, for each of the sub-groups, for each K, we ran the MDscan algorithm (Liu et al., 2002) to find sets of PSWMs, ranging in width from 8 to 10 bp. By using all the clusterings corresponding to different K, the initial motif set is made independent of any particular choice of K. Next, we took the top five non-redundant motifs from each group, combined them into one large set, and scored each of the 612 sequences with respect to each of the motifs. Since some motifs potentially overlap, the correlation between their score vectors are nearly one. To reduce redundancy in the model, we excluded all motifs which had a correlation of 1, or if their consensus sequences matched exactly in more than 90% of the positions, resulting in a set of 191 motif candidates, which is still fairly large for variable selection purposes. However, the structure of the prior discussed in Section 3.2 allows us to include all variables in the regression without leading to singularity problems. It may not be reasonable to consider the entire set of measurements over 18 time points simultaneously, as different TFs may be active over different phases in the cycle. Taking two to three consecutive time points may succeed in uncovering groups of genes that are acted upon together by a set of TFs. Hence we divided the expression measurements into nine consecutive intervals 18 corresponding to time points 1-2, 3-4, . . ., 17-18. 6.3.1 Model fitting and data analysis To assess convergence, we ran five chains of the sampler for each time interval. In all cases, 50,000 iterations, excluding the first 2% as burn-in, gave R̂ values less than 1.1, indicating adequate convergence to the posterior distribution. Computing time for this example in the R statistical software (R Development Core Team, 2004) was about 40 hours on a cluster of dual AMD Athlon 1600+ 1.4GHz MP processors with 2GB DDR RAM. Priors were chosen to be proper but non-informative. To assess the sensitivity of our method to prior specification, we ran the algorithm varying the following: (i) the initial clustering, (ii) the values of c and τ0 , the scaling parameters for the prior covariances of β, and µ, (iii) the magnitude of the uniform prior pseudocount vector α for π, and (iv) the hyperparameters (a1 , a2 ) of the distribution of η. We found that (i) varying the initial clustering, and values of m0 and c0 had little effect on the final results; (ii) varying values of c larger than 500 had no effect on parameter estimates (iii) taking small values of τ 0 in the range 0.01 to 1 led to stable estimates, and (iv) the magnitude of α had no visible effect on the results. The hyperparameters (a1 , a2 ) behave as stringency parameters that control the degree of parsimony of the model: if the value of a2 is large compared to a1 , inclusion of fewer weight matrices in the model are expected to be favored, and vice-versa. However, this difference in posterior probabilities of selection was only observable when a1 was set to be much smaller than a2 , and none of the motifs were selected. For a large range of a1 /a2 ratios between 0 and 1 the same set of motifs were sampled with the highest posterior probability. A problem often associated with estimation in Bayesian mixture models is “label-switching”, due to the posterior indistinguishability of the K! modes. In our simulation studies and application, however, we found no evidence of this. The first panel of Table 3 shows the Bayes factor computed by the multi-stage importance sampling method (Section 5) for different choices of K. The optimal number of clusters K is seen to vary between one and three. Figure 3 gives the posterior probability of motif selection, corresponding to the optimal number of clusters K for that time inter19 val. Using the frequencies of inclusion of single motifs in the model was equivalent to the marginal posterior probabilities of selection. Table 5 shows the selected motifs at each stage, their consensus patterns and those of the known TF binding motifs that each may correspond to. A number of binding sites are seen to match with previously found TFs, and many of them are discovered precisely at the times of peak expression in the cell-cycle of the genes they are believed to regulate (Spellman et al., 1998). Moreover, we discovered several TFs that act in opposing directions in different clusters, that are missed when considering a single cluster of genes: ROX1 (Interval 1-2), GAL4 (3-4) and ABF1 (17-18). [Figure 3, Tables 3, 5 about here] 6.3.2 Comparative study with competing procedures To further explore the benefits of using this method, we conducted a detailed comparison with two simpler methods that differ from our method in two significant ways. Motif Regressor (Conlon et al., 2003), assumes a single cluster of genes at each time point. Next, we used a sequential two-step approach, first clustering the gene expression values using a normal mixture model (Fraley and Raftery, 2003), and then doing a separate motif analysis on each cluster. MCLUST uses the BIC to determine the optimal number of clusters. Table 4 shows that of the three approaches, the joint regression mixture model has the highest specificity over all time points, finding the largest number of motifs known to be functional in the yeast cell-cycle. Although Motif Regressor finds a number of known yeast motifs, the false positive rate is high, a likely result of noise introduced by using all genes instead of those relevant to a particular TF. In addition, Motif Regressor requires a number of tuning parameters to be set (for example, the number/proportion of genes to consider “top” genes). Although having more precise information on these in a particular experiment may influence the results favorably, this information is generally unavailable in advance. The two step approach, interestingly, picks up fewer motifs, resulting in a higher specificity in some cases. However, at some intervals it fails to pick up any functional TFs, which may be due to the clustering based solely on gene expression. Biologically, it is known that not all gene expression may be caused by transcription, hence grouping genes 20 solely by expression ignoring the sequence could lead to missing important TFs. This result further supports our hypothesis that a more powerful way of using gene expression in motif discovery is through a joint model that incorporates sequence information into clustering and expression information into the motif search procedure. [Table 4 about here] 6.3.3 Cluster analysis Since our joint model includes a clustering step, apart from motif selection, it is also of interest to see whether the resultant gene clusters appear to be functionally relevant. We compared the gene lists obtained from the regression mixture model to the Gene Ontology (GO) database (http://www.geneontology.org/). A web-based tool, FuncAssociate (Berriz et al., 2003) was used to obtain lists of GO attributes that are statistically over-represented among the genes in each gene list, correcting for multiple hypotheses testing. The results over each time interval are shown in Table 6, where the highlighted attributes are ones known to be functional at that phase of the cell cycle (Spellman et al., 1998). It is noticeable in this table that (i) a number of attributes are highly significant (ii) the clusters appear to be well-separated by attribute, suggesting that the different TFs found in different clusters may be involved in separate functions; and (iii) a number of attributes that appear in the same cluster are correlated functionally. For example, budding and polarized growth are grouped together; replication and DNA helicase occur in the same group (DNA helicase is needed for unwinding of the replication fork)– hence the gene clustering seems to be biologically consistent. The same table also presents results of a comparison with Table 7 in Spellman et al. (1998), that gives the names of genes at each time point known to be involved in a particular function. Using the gene clusters from the joint model, we find the proportion of genes in a functional group, that appear in the same cluster on classification. As can be seen in Table 6 a number of functional gene groups are over-represented in certain clusters (as opposed to being randomly divided among the clusters, again indicating that the clusters may capture some functional aspects. It is interesting to see that a number of functional groups from Spellman et al. (1998) match the independently generated GO 21 characterizations, e.g. at time interval 1, the enriched functional groups are “chromatin”, “cell wall” and “mating” in cluster 1, while the top three GO annotations also characterize functional enrichment for cell wall, conjugation (which corresponds to “mating”) and nucleosome (which corresponds to “chromatin”). In cluster 2 over the same interval, the “budding: fatty acids” functional gene group is over-represented, which matches the GO annotation for cluster 2 (bud). A number of such correspondences over the table indicates that the clusters found may indeed be meaningful in terms of biological function. It should however be stressed that the gene clustering produced here may not provide the complete picture of gene activity. What is captured by the joint model is only the part of gene expression that is related to transcription; there may be other functions that are missed, because TFs may not have a significant effect on such activity. The purpose of clustering in this context is to separate out TFs that have differential effects on subgroups of genes, which may be difficult to detect when all genes are grouped together. The gene clustering afforded by this procedure may be used to generate hypotheses about subgroups of genes that may be functionally related, which can then be validated by further experiments. [Table 6 about here] 7 Discussion We have introduced a method to infer co-regulation based on clustering of similar expression profiles, and similarity of sequence features. We formulate a joint sequence-expression model within a hierarchical regression mixture framework and design an efficient procedure to estimate the model parameters. The model aims at facilitating the analysis of large and complex datasets. Introduction of a hierarchical prior structure motivated by ideas from ridge regression, allows consistent variable selection with high-dimensional covariates– a major advantage over currently used methods. Detailed comparative studies demonstrate that the joint model is more successful than simpler approaches in capturing important data features, leading to more accurate inference. To the best of our knowledge, there has been no previous successful demonstration of a unified model that simultaneously clusters genes 22 and detects sets of active regulatory motifs in biological data. The motivation for using a random-effects hierarchical model arises from the observation that the upstream sequence binding is characterized by the sequence composition that is constant over time; however, due to interactions of various TFs at different time points, the resulting gene expression measurements may be different. Although a Gaussian model is used here for simplicity and computational efficiency- it is not necessarily a rigid assumption and can be extended to more complex frameworks. The joint framework introduced here has the potential to be improved in various modeling aspects, such as incorporating a possible dependence in gene measurements over time more fully, and using a more biologically sophisticated function for motif scoring, for instance, modeling the spatial dependence that arises due to multiple sites forming a “regulatory module” (Thompson et al., 2004). These issues, as well as extensions of the model in related applications that involve determining significant predictor variables in presence of latent clusters in the population, are currently being investigated. References Bailey, T. and Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol., pages 28–36. Ball, C., Awad, I., Demeter, J., Gollub, J., Hebert, J. M., Hernandez-Boussard, T., Jin, H., Matese, J. C., Nitzberg, M., Wymore, F., Zachariah, Z. K., Brown, P. O., and Sherlock, G. (2005). The Stanford Microarray Database accommodates additional microarray platforms and data formats. Nucleic Acids Res., 33(Database issue):D580–2. Barash, Y. and Friedman, N. (2002). Context-specific Bayesian clustering for gene expression data. J Comput Biol, 9(2):169–191. Berriz, G. F., King, O. D., Bryant, B., Sander, C., and Roth, F. P. (2003). Characterizing gene sets with FuncAssociate. Bioinformatics, 19(18):2502–2504. Bussemaker, H. J., Li, H., and Siggia, E. D. (2001). Regulatory detection using correlation with expression. Nature Genetics, 27:167–174. 23 Chib, S. (1995). Marginal likelihood from the Gibbs output. J. Am. Statist. Assoc., 90(432):1313–1321. Conlon, E. M., Liu, X. S., Lieb, J. D., and Liu, J. S. (2003). Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl. Acad. Sci. USA, 100(6):3339– 3344. Fraley, C. and Raftery, A. E. (2003). Enhanced model-based clustering, density estimation, and discriminant analysis software: MCLUST. J. Classification, 20(2):263–286. Gelman, A. and Rubin, D. (1992). Inference from iterative simulation using multiple sequences (with discussion). Stat. Sci., 7:457–511. George, E. and McCulloch, R. (1993). Variable selection via Gibbs sampling. J. Am. Stat. Assoc., 88:881–889. Green, P. J. (1995). Reversible jump MCMC and Bayesian model determination. Biometrika, 82:711–732. Hennig, C. (1999). Models and methods for clusterwise linear regression. In Classification in the Information Age, pages 179–187. Springer-Verlag, Berlin. Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12:55–67. Holmes, I. and Bruno, W. (2000). Finding regulatory elements using joint likelihoods for sequence and expression profile data. Proc. Int. Conf. Intell. Syst. Mol. Biol., 8:202–10. Lawrence, C. E. and Reilly, A. A. (1990). An expectation-maximization (EM) algorithm for the identification and characterization of common sites in biopolymer sequences. Proteins, 7:41–51. Liang, F. and Wong, W. H. (2000). Evolutionary Monte Carlo: applications to c p model sampling and change point problem. Statistica Sinica, 10:317–342. Liu, J. S., Neuwald, A. F., and Lawrence, C. E. (1995). Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Am. Stat. Assoc., 90:1156–1170. Liu, J. S., Zhang, J. L., Palumbo, M. J., and Lawrence, C. E. (2003). Bayesian clustering with variable and transformation selections. In Bayesian Statistics, volume 7, pages 249–275. Oxford University Press. 24 Liu, X., Brutlag, D. L., and Liu, J. S. (2002). An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotech., 20(8):835–9. R Development Core Team (2004). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. Spellman, P., Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M., Brown, P., Botstein, D., and Futcher, B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell., 9(12):3273–97. Steele, R., Raftery, A. E., and Emond, M. (2003). Computing normalizing constants for finite mixture models via incremental mixture importance sampling. Technical report, Dept. of Statistics, University of Washington. Tadesse, M., Sha, N., and Vannucci, M. (2005). Bayesian variable selection in clustering high-dimensional data. J. Am. Stat. Assoc., 100:602–617. Tadesse, M. G., Vannucci, M., and Lio, P. (2004). Identification of DNA regulatory motifs using Bayesian variable selection. Bioinformatics, 20(16):2553–61. Thompson, W., Palumbo, M. J., Wasserman, W. W., Liu, J. S., and Lawrence, C. E. (2004). Decoding human regulatory circuits. Genome Research, 10:1967–74. Zellner, A. (1986). On assessing prior distributions and bayesian regression analysis with g-prior distributions. In Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, page 233. North-Holland, Amsterdam. Zhu, J. and Zhang, M. (1999). SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics, 15(7):607–611. APPENDIX A: Conditional distributions in MCMC steps (i) Updating [u|Y , S, z, K, all parameters]. For the EMC step, the marginal distribution of the data is computed by integrating out all parameters of dimension that vary with 25 u. The marginal likelihood is proportional to exp {− log H(u|z, µ)}, where K K ∗ (w0 /2, S0 /2) Beta(|u| + a1 , D − |u| + a2 ) Y ck∗ ∗ w0 +nk T S0 +T1k +T2k , (7) H(u|z, µ) = Beta(a1 , a2 ) K ( 2 , ) 2 k=1 1 −1 −1 −1 −1 2 P 0 −1 [2π(1+τ0 )]−nk T /2 , and ck∗ = c Σβ(u) c Σβ(u) + (1 + τ0 ) g:zg =k S g(u) S g(u) P 0 0 1 0 T1k = (1+τ g:zg =k (Y g − µk − S g(u) β̃ k(u) 1) (Y g − µk − S g(u) β̃ k(u) 1), and T2k = 0) h i −1 T 0 ∗ Σ ( β̃ − β ) k(u) 0(u) β(u) (β̃ k(u) − β 0(u) ). K (p, α) denotes the normalizing constant c of the inverse gamma (p, α), αp /Γ(p), and β̃ k(u) is defined in (8). ((7) is derived in Appendix B.) (ii) Update [z|Y , S, u, K, parameters], using πk p(Y g |Y , S, u, K, µk , β k , σk2 ) P (zg = k|Y , S, u, K, all parameters) = P . 2 j πj p(Y g |Y , S, u, K, µj , β j , σj ) (iii) Update [parameters |Y , S, z, u, K]. P (a) µk |βk , σk2 , Y , S, z, u ∼ NT ȳ ∗k , (V0−1 + nk σk−2 I)−1 ; ȳ k= n1k g:zg =k y g−S Tg(u) β k(u) 1 and ȳ ∗k = (V0−1 + nk σk−2 I)−1 (V0−1 m0 + nk ȳ k /σk2 ). (b) By assumption, σk2∼Inv-Gamma(w0 /2, S0 /2) (where X ∼ Inv-Gamma(p, α) ⇒ f (x) = αp x−(p+1) e−α/x ). Γ(α) Then, we draw from the posterior distribution (σk2 |Y , S, z, µ, u) ∼ Inv-Gamma(A, B), where A = (w0 + nk T )/2, and B = P 1 S0 + (1 + τ0 )−1 g:zg =k (Y g − µk − S 0g(u) β̃ k(u) 1)0 (Y g − µk − S 0g(u) β̃ k(u) 1) 2 −1 −1 0 +T c (β̃ k(u) − β 0(u) ) Σβ(u) (β̃ k(u) − β 0(u) ) . (c) β k(u) |Y , S, µk , σk2 , z, u is drawn from its posterior distribution N|u| (β̃ k(u) , σk2 Σ̃βk(u) ), h i−1 P 0 −1 + (1 + τ ) and where Σ̃βk(u) = T1 c−1 Σ−1 S S 0 g:zg =k g(u) g(u) β(u) β̃ k(u) = Σ̃βk(u) c −1 Σ−1 β(u) β 0(u) + (1 + τ0 ) −1 X S g(u) 10T (Y g − µg ) . g:zg =k (8) (d) The cluster occupancy probabilities π are drawn from their posterior distribution Dirichlet(n + α), where the cluster sizes n = (n1 , . . . , nK ). 26 Posterior mean and variance of β k Let nk denote the number of genes assigned to cluster k (k = 1, . . . , K), and for simplicity of notation, let S k(u) = (S 01(u) , . . . , S 0nk (u) )0 denote the nk × |u| sub-matrix of scores corresponding to the nk genes. Define the nk T × 1 vector Y k(u)∗ = vec(Y k ) = [β 00(u) ; (Y 1 − µk )0 ; · · · ; (Y nk − µk )0 ]0 , where [a01 ; a02 ] represents the vector a1 stacked on a2 . Define the (|u| + nk T ) × |u| matrix Xk(u)∗ = [I|u| , (1T S 01(u) )0 , . . . , (1T S 0nk (u) )0 ]0 and 2 2 the (|u| + nk T ) × (|u| + nk T ) matrix Qk(u)∗ = diag cσk Σβ(u) , σk (1 + τ0 )Ink T , where P PD nk = G g=1 1[zg =k] , |u| = d=1 1[ud =1] , and Im denotes an identity matrix of dimension m. From weighted linear regression techniques, it then follows that the posterior distribution of β k(u) is β k(u) |µk , σk2 , Y , S, z, u ∼ N|u| β̃ k(u) , Σ̃βk(u) , where −1 −1 0 −1 0 0 −1 β̃ k(u)= (Xk(u)∗ Q−1 k(u)∗ Xk(u)∗ ) Xk(u)∗ Qk(u)∗Y k(u)∗ and Σ̃βk(u) = (Xk(u)∗ Qk(u)∗Xk(u)∗ ) . (9) Now, 0 −1 −2 −1 Xk(u)∗ Q−1 k(u)∗ Xk(u)∗ = Σβk(u) + σk (1 + τ0 ) X (S g(u) 10T )(1T S 0g(u) ) g:zg =k X 0 −1 −2 −1 −1 S g(u) S g(u) . = T σk c Σβ(u) + (1 + τ0 ) (10) g:zg =k −1 −2 0 −1 −2 −1 Also, Xk(u)∗ Q−1 k(u)∗ Y k(u)∗ = c σk Σβ(u) β 0(u) +(1+τ0 ) σk The posterior mean and covariance of β k(u) are then −1 −1 −1 P 0 S S Σβ(u) β 0(u) Σβ(u) g(u) g(u) g:z =k g + β̃ k(u)= T −1 c + (1+τ0 ) c and Σ̃βk(u) = T −1 σk2 c −1 X S g(u) S 0g(u) g + (1 + τ0 ) P P g:zg =k S g(u) 10T (Y g −µg ). S g(u) 10T (Y g −µg ) g:zg =k (1+τ0 ) −1 X S g(u) S 0g(u) g:zg =k −1 . , (11) APPENDIX B: Derivation of H(u|z, µ) Let u be as defined in Section 3. Since each uj (1 ≤ j ≤ D) has a prior probability of η to be 1, the marginalized posterior probability for a configuration u is: Z Y K Y H(u|z, µ)= P (Y g |S g , z, u, K, µ, σk2 , β k(u) , u)p(β k(u) )p(u|η)p(σk2 )dβ k(u) dηdσk2 k=1 g:zg =k (12) 27 Let nk = |{g : zg = k}|, |u| = = Z Y K Y PD j=1 uj and B(a, b) = Γ(a)Γ(b)/Γ(a + b). Then, H(u|z, µ) P (Y g |S g , z, u, K, µ, σk2 , β k(u) , η)p(η)p(βk(u) )p(σk2 )dβ k(u) dηdσk2 k=1 g:zg =k Z K B(|u|+a1 , D−|u|+a2) Y Y P (Y g |S g , z, u, K, µ, σk2 , βk(u) )p(β k(u) )p(σk2 )dβ k(u) dσk2 . = B(a1 , a2 ) k=1g:z =k g Let Xk(u)∗ , Y k(u)∗ , Qk(u)∗ be defined as in Appendix A. Also, the posterior distribution 2 of β k(u) |µk , σk , Y , S, z, u is N|u| β̃ k(u) , Σ̃βk(u) , where β̃ k(u) and Σ̃βk(u) are as defined R |θ)P (θ) , in (9). Now, for any set of vectors X, Y, θ, P (X, Y ) = θ P (X, Y |θ)P (θ)dθ = P (X,Y P (θ|X,Y ) where the identity holds for any choice of θ such that P (θ|X, Y ) 6= 0. So, P (Y 2 g |S g , z, u, K, µ, σk ) P (Y g |S g , β k(u) , z, u, K, µ, σk2 )P (β k(u) |z, u, K, µ, σk2 ) . = P (βk(u) |Y , S, z, u, K, µ, σk2 ) (13) where (13) can be evaluated at any value of β k(u) . Choosing β k(u) = β̃ k(u) , the denominator of (13) reduces to N|u| β k(u) ; β̃k(u) , Σ̃βk(u) β k(u) =β˜ k(u) = (2π) The numerator of (13) evaluated at β̃ k(u) is: Y g:zg =k |u| 2 1 −1 2 . Σ̃βk(u) h i NT Y g ; µk + S 0g(u) β̃ k(u) 1, σk2 (1 + τ0 )I × N|u| β̃ k(u) ; β 0(u) , cσk2 Σβ(u) , (14) (15) where Σβ(u) is defined in (4) and Σ̃βk(u) is as given in (11). Then the ratio of (15) to (14) Q Q 2 gives the expression K k=1 g:zg =k P (Y g |S g , z, u, K, µ, σk ). Finally, the prior distribution of σk2 is Inverse-Gamma(w0 /2, S0 /2). So finally, K B(|u| + a1 , D − |u| + a2 ) Y H(u|z, µ) = B(a1 , a2 ) k=1 Z Y P (Y g |S g , z, u, K, µ, σk2 )p(σk2 )dσk2 , g:zg =k which upon using (15), (14), and (9), simplifies to the expression (7) in Appendix A. 28 APPENDIX C: Evolutionary Monte Carlo (EMC) To conduct EMC, we need to prescribe a set of temperatures, t1 > t2 > · · · > tM = 1, for the M “population units”. Using the marginalized probability for a variable configuration Q u (12), we set πi (ui ) ∝ exp[− log H(u|z, µ)/ti ], and let π(U ) ∝ M i=1 πi (ui ). The population U = (u1 , . . . , uM ) is updated iteratively using two types of moves: mutation and crossover. In the mutation operation, a unit uk is randomly selected from the current population and mutated to a new vector v k by changing the values of some of the ui ’s chosen at random. The new member v k is accepted into the population with probability min(1, rm ), where rm = πk (v k )/πk (uk ). In crossover, two individuals, uj and uk , are chosen at random from the population. A crossover point x is chosen randomly to be l (1 ≤ l ≤ D), and two new units v j and v k are formed by switching between the two individuals the segments on the right side of the crossover point. The two “children” are accepted into the population π (v )π (v ) to replace their parents uj and uk with probability min(1, rc ), where rc = πjj(ujj )πkk (ukk ) . It can be shown that the samples of uM (tM = 1), converge to the target distribution (12). APPENDIX D: Importance sampling procedure for z The importance sampling procedure for z consists of the following steps: (i) Let ζ̂ be the matrix of P (z|Y , S, θ̂) for a specific permutation of the group labels z. Create K groups by assigning observations, to group li , where li = arg maxj ζˆij . (ii) Next, for each non-empty group r (r = 1, . . . , K), sample ψr from a Dirichlet distribution with parameter vector αr = (αr1 , . . . , αrK ) (αr is chosen to be 1K ). (iii) For each group, re-assign observations to groups according to their group-specific ψr . The probability g ∗ (z) then is K Y P Q Γ( j αrj ) j Γ(αrj + nrj ) P Q g (z) = , Γ( [α + n ]) Γ(α ) rj rj rj j j r=1 ∗ where nrj is the number of observations from the r th group assigned to group j. 29 2 1 0 −2 −1 gene expression 3 frag replacements G1 G2/M M/G1 S S/G2 28 63 98 PSfrag replacements Variable count iteration PSfrag replacements time (minutes) 1.0 60 cutoff FP/FN Figure 1: Yeast cell-cycle data. Gene expression measurements over the five phases of the cell-cycle (M/G1, G1, S, S/G2, G2/M) show the increased or decreased activity of varying sets of genes at specific time points. 0.8 0.6 FP/FN 20 0.4 30 40 D=20 D=30 D=40 D=50 0.0 0 0.2 10 Variable count 50 D=20 D=30 D=40 D=50 0 5000 10000 15000 0.0 0.2 0.4 0.6 0.8 1.0 cutoff iteration Figure 2: (a) Number of variables selected over iterations for 4 simulated data sets. (b) Cutoff probability for variable selection ensures 0% FP rate for any cutoff over 0.2. 30 0.0 0 40000 50000 40000 50000 40000 50000 40000 50000 40000 50000 40000 50000 40000 50000 0.0 time 34 0 10000 20000 30000 V2 V17 V34 V51 V68 V85 V103 V123 V143 V163 seq(1, 50000, by = 2) 0.0 time 56 0 10000 20000 30000 V2 V17 V34 V51 V68 V85 V103 V123 V143 V163 seq(1, 50000, by = 2) 0.0 time 78 0 10000 20000 30000 V2 V17 V34 V51 V68 V85 V103 V123 V143 V163 seq(1, 50000, by = 2) 0.0 time 910 0 10000 20000 30000 V2 V17 V34 V51 V68 V85 V103 V123 V143 V163 seq(1, 50000, by = 2) 0.0 time 1112 0 10000 20000 30000 V2 V17 V34 V51 V68 V85 V103 V123 V143 V163 seq(1, 50000, by = 2) 0.0 time 1314 0 10000 20000 30000 V2 V17 V34 V51 V68 V85 V103 V123 V143 V163 seq(1, 50000, by = 2) 0.0 time 1516 0 10000 20000 30000 V2 V17 V34 V51 V68 V85 V103 V123 V143 V163 seq(1, 50000, by = 2) time 1718 0.0 totsel[1:25000] 50000 seq(1, 50000, by = 2) 0 totsel[1:25000] 40000 V2 V17 V34 V51 V68 V85 V103 V123 V143 V163 0 totsel[1:25000] 50000 0 totsel[1:25000] 40000 0 totsel[1:25000] 30000 0 totsel[1:25000] 20000 0 totsel[1:25000] 10000 0 totsel[1:25000] 0 0 totsel[1:25000] PSfrag replacements Psel motif index time 12 0 10000 20000 30000 V2 V17 V34 V51 V68 V85 V103 V123 V143 V163 Figure 3: Number of selected variables over iteration (Left panel) and posterior probability of motif type selection (Right panel) for 9 consecutive time intervals, for optimal choice of total number of clusters (as in second column of Table 5). The horizontal axis labels correspond to iteration number (Left panel) and motif index (Right panel). 31 Table 1: Mean square errors (MSE), misclassification percentage (miscl. rate) and false positive (FP)/ false negative (FN) rates for (i) increasing number of variables D and different hyperparameter specifications for (ii) (a1 , a2 ) (iii) scalar multiple c0 of V0 and (iv) “ridge” factor λ for hyperprior of β. (FP/FN rates are for at any marginal posterior probability of the number of selected variables between 0.2-1.0). FN σ2 miscl. rate FP β MSE µ 20 30 40 50 0.0003 0.0004 0.0003 0.0407 0.0203 0.0308 0.0210 0.0278 0.0004 0.0005 0.0004 0.0804 0.0400 0.0283 0.0367 0.0200 0 0 0 0 0 0 0 0 (1,1) (1,100) (1,10000) 0.0004 0.0005 0.0004 0.0344 0.0455 0.0260 0.0007 0.0018 0.0007 0.0317 0.0333 0.0367 0 0 0 0 0 0 c0 0.1 1 10 50 0.0048 0.0049 0.0049 0.0049 0.0089 0.0096 0.0093 0.0087 0.0992 0.1025 0.0982 0.0984 0.0867 0.0867 0.0533 0.0600 0 0 0 0 0 0 0 0 λ 0.0 0.2 0.4 0.6 0.8 1.0 0.0049 0.0051 0.0049 0.0048 0.0052 0.0048 0.0083 0.0094 0.0088 0.0091 0.0092 0.0089 0.1002 0.1039 0.0994 0.1023 0.1000 0.0992 0.0800 0.0667 0.0733 0.0867 0.0533 0.0867 0 0 0 0 0 0 0 0 0 0 0 0 D (a1 , a2 ) 32 Table 2: Simulation study comparing the performance of the joint regression mixture model with three methods. Position-specific probabilities less than 60% are denoted by ‘N’. “Sens”: sensitivity (percentage of true motifs found); “Spec”: specificity (percentage of motifs found which are correct). “Clust %”: percent correctly classified. A motif is considered a “match” if ≥ 70% of positions match (underlined). Method Motif Motif Motif Clust. Motif Pattern Count Sens Spec % AGCCGCCNAA TNCCNNNTTT ACGCGTACGT CNGCAGATTG MDscan 5 0.5 0.4 − AGCCGCCGAA − − CAGCAGATTG Motif 2 Regressor Two-step 2,0,0 Joint model 6 − CCCAGGCGAG CAATCTGCTG 0 0 0 0 0.28 CAATCTGCTG − − − − − 1 0.83 0.75 GCCGCCGAAT TCCCAGATTT GGGACGCGTA GCAGCAGATT Table 3: Model selection for number of clusters in (i) yeast data and (ii) simulated data by Bayes Factor computation using IS, compared to K = 1. Time Interval 1, 2 3, 4 5, 6 7, 8 9, 10 11, 12 13, 14 15, 16 17, 18 K=2 17.72 4.71 -12.10 8.38 22.50 0.98 15.33 4.33 41.33 Yeast K=3 12.11 -33.33 -22.74 -24.09 28.09 -2.62 -26.50 -28.30 -44.79 K =4 -11.02 -152.26 -93.38 -50.31 -128.99 -85.57 -46.64 -71.49 -147.16 K=2 366.40 Simulated data K=3 K=4 527.64 220.17 K=5 80.00 Table 4: Comparison of joint model, MotifRegressor (MR), and two-step approach (MCLUST+MR) in the yeast data set. In the column “MCLUST+MR”, the numbers denote number of TFs found separately in each cluster. “Spec” denotes specificity, here defined by the percentage of found TFs that are known in the literature. (It is not possible to accurately estimate the sensitivity in this case as the total number of TFs active at each precise time point may not be known.) Time Interval 1, 2 3, 4 5, 6 7, 8 9, 10 11, 12 13, 14 15, 16 17, 18 MotifRegressor Total Spec 23 0.087 19 0.158 21 0.095 22 0.227 26 0 21 0.095 23 0.087 17 0 16 0.125 33 MCLUST+MR Total Spec 5,4,3 0 9,7 0.188 7,1,5 0.154 8,1 0 5,7 0.083 7,3 0.1 7,2 0 8,0 0 1,7 0 Joint Total Spec 2 0.5 6 1 2 1 4 1 1 0.8 4 0.75 1 1 2 1 4 0.75 Table 5: Posterior confidence intervals for cluster-specific regression coefficients for yeast transcription factor binding motifs. The second column denotes the chosen number of clusters (K) and the third column (Sel %) the posterior marginal probability of selection of the motif. Interval K Motif (Consensus) 1,2 2 3, 4 2 5, 6 1 7, 8 2 9, 10 11, 12 3 2 13, 14 15, 16 2 2 17, 18 2 ROX1 (ACAACAAC) unknown (ACCAAGCG) ABF1(CGTATATA) REB1 (GGTTAACC) SCB (CTCGAAAA) UASH (CTGTGCAG) MCB (TACGCGAT) GAL4 (CGCGGCTC) MCM1 (GGCCAGAA) MCB (TACGCGAT) UASH (CTGTGCAG) GAL4 (GGACAGAC) unknown (CAATTGCA) MCB (TACGCGAT) RAP1 (ACGCGTTA) unknown (CAAACCGA) GCR1 (CACCACTA) SWI5 (GCTGGATT) MCB (TACGCGAT) RAP1 (ACGCGTTA) ROX1 (CAAACAAA) MIG1/SFF (TGTTTGTT) ABF1 (CGTATATA) unknown (GTGTGTAT) GAL4 (CAGCCACT) PHO4 (ACGCGTTA) Sel % 0.63 0.82 1.00 1.00 1.00 0.52 1.00 1.00 1.00 0.82 0.52 0.59 1.00 1.00 1.00 0.79 0.87 0.84 0.94 1.00 0.71 0.94 1.00 1.00 1.00 0.58 34 95% Posterior Interval 1 2 3 -0.132, 0.014 0.034, 0.096 -0.198, -0.020 -0.083, -0.031 -0.038, 0.033 -0.076, -0.035 -0.155, -0.007 0.085, 0.174 -0.106, 0.014 0.053, 0.125 -0.129, 0.012 0.043, 0.108 0.035, 0.177 0.089, 0.158 0.037, 0.170 -0.143, -0.060 -0.068, -0.018 0.023, 0.068 0.042, 0.102 -0.008 ,0.038 -0.054, 0.067 -0.166, 0.010 0.071, 0.207 -0.097, 0.018 -0.052, 0.039 -0.144, -0.061 0.001 ,0.144 -0.162, -0.014 -0.089, -0.007 0.091, 0.280 -0.050, 0.016 -0.214, -0.078 -0.009, 0.054 0.099, 0.229 -0.046, 0.031 -0.030, 0.048 0.035, 0.080 0.017, 0.076 0.025, 0.053 -0.053, 0.022 0.045, 0.131 -0.041, 0.034 0.041, 0.124 -0.083, -0.013 0.036, 0.075 0.043, 0.130 -0.089, -0.040 -0.165, -0.047 0.022, 0.114 -0.098, -0.044 -0.058, -0.021 Table 6: Gene ontology-based functional annotation of clusters. The cell-cycle phase is approximate, based on Figure 1 in Spellman et al. (1998). “GO”: Gene-ontology-based top three annotations (p-values); “PG”: Proportion of Genes previously known to be involved in a function (based on Figure 7 in Spellman et al. (1998)), that occur in the same group in our model (highlighted in bold). Italicized attributes indicate that this function is known to occur at this phase of the cell-cycle (GO-enrichment p-values in parentheses). For example, “DNA repair (7/10)” represents that 7 out of 10 total genes previously known to be involved in DNA repair at this phase of the cell-cycle are found in this cluster. Time GO/PG Gene Clusters from Regression mixture model (phase) k=1 k=2 k=3 1 GO cell wall (4.8e-13) replication (3.3e-09) (M/G1) nucleosome (8.3e-11) chromosome (1.1e-08) conjugation (8.2e-10) bud (2.2e-08) PG chromatin (9/9) budding:fatty acids (4/5) cell wall (5/5) mating (8/8) 2 GO cell wall (1.4e-12) replication (2.1e-08) (G1) bud (4e-11) DNA helicase (4.6e-07) nucleosome (1.3e-10) nuclear nucleosome (2.5e-08) PG chromatin (9/9) budding:site selection (6/6) cell wall (7/8) nutrition(9/15) budding:fatty acids (4/5) 3 GO bud (4e-16) (S) polarized growth (5.2e-16) replication (1.4e-14) PG − − 4 GO replication (5.1e-15) bud (1.2e-13) (G2/M) DNA synthesis (8.7e-13) polarized growth (4.5e-13) DNA replication (9.5e-12) PG DNA synthesis (14/19) chromatin (9/9) cell cycle control (5/5) nutrition(12/15) 5 GO cell wall (5.9e-15) chromosome (6.7e-11) bud (1.1e-06) (M) replication (1.7e-10) nucleosome (2.5e-08) polarized growth (9.1e-06) polarized growth (4.8e-09) nuclear nucleosome (2.5e-08) bud neck (1.7e-05) PG chromatin (7/9) mitosis:microtubules (4/6) 6 GO DNA synthesis (2.6e-15) microtubule process (1.4e-06) (M/G1) replication (7.5e-15) microtubule cytoskeleton (2.6e-06) chromosome (1.7e-14) PG DNA repair (7/10) mitosis:microtubules (4/6) DNA synthesis (13/19) chromatin (8/9) 7 GO cell wall (6.7e-13) DNA synthesis (1.3e-10) (G1) polarized growth (6.7e-12) replication (6.5e-10) bud (1.5e-11) DNA helicase (1e-07) PG chromatin (9/9) DNA repair (7/10) cell wall (3/3) DNA synthesis (12/19) mitosis:microtubules (5/6) nutrition (10/15) 8 GO DNA elongation (2e-10) cell wall (4.9e-14) (S) DNA synthesis (3.6e-10) polarized growth (1.8e-13) replication (1.4e-09) bud (2.2e-12) PG DNA synthesis (14/19) chromatin (9/9) cell wall (6/6) budding:fatty acids (4/5) mating (6/8) 9 GO bud (9.4e-12) replication (1.8e-08) (G2/M) cell wall (1.4e-11) chromosome (1.5e-07) polarized growth (8.1e-10) PG chromatin (5/5) budding:glycolysis (5/7) 35