Variable selection in regression mixture modeling November 6, 2006

advertisement
Variable selection in regression mixture modeling
for the discovery of gene regulatory networks
Mayetri Gupta and Joseph G. Ibrahim∗
November 6, 2006
Abstract
The profusion of genomic data through genome sequencing and gene expression microarray technology has facilitated statistical research in determining gene interactions
regulating a biological process. Current methods generally consist of a two-stage
procedure: clustering gene expression measurements, and searching for regulatory
“switches”, typically short, conserved sequence patterns (motifs) in the DNA sequence
adjacent to the genes. This process often leads to misleading conclusions as incorrect
cluster selection may lead to missing important regulatory motifs or making many
false discoveries. Treating cluster memberships as known, rather than estimated, introduces bias into analyses, preventing uncertainty about cluster parameters. Further,
there is under-utilization of the available data, as the sequence information is ignored
for purposes of expression clustering and vice-versa. We propose a way to address
these issues by combining gene clustering and motif discovery in a unified framework,
a mixture of hierarchical regression models, with unknown components representing
the latent gene clusters, and genomic sequence features linked to the resultant gene expression through a multivariate hierarchical regression. We demonstrate a Monte Carlo
method for simultaneous variable selection (for motifs) and clustering (for genes). The
selection of the number of components in the mixture is addressed by computing the
analytically intractable Bayes factor through a novel multi-stage mixture importance
sampling approach. This methodology is used to analyze a yeast cell cycle dataset
to determine an optimal set of motifs that discriminates between groups of genes and
∗
Mayetri Gupta (email: mgupta@unc.edu) is Assistant Professor, and Joseph G. Ibrahim (email:
ibrahim@bios.unc.edu) is Alumni Distinguished Professor, at the Department of Biostatistics, University
of North Carolina at Chapel Hill, NC 27599, U.S.A. The authors thank the editor, associate editor, and two
referees for their helpful comments and suggestions that substantially improved this article. This research
was partially supported by National Insitutes of Health grants GM 70335, CA 74015 and Environmental
Protection Agency grant RD-83272001.
1
simultaneously finds the most significant gene clusters.
KEY WORDS: Transcription regulation; motif discovery; hierarchical model;
evolutionary Monte Carlo; importance sampling; Bayesian model selection.
1 Introduction
The availability of diverse types of genomic data, such as DNA sequence, gene expression
microarray and proteomic data, has led to a rapid growth of statistical research in the effort
to decipher workings of biological processes, which are primarily regulated by interactions
of genes. However, until recently, the analysis of sequence and gene expression data have
been considered two separate problems, in spite of their intrinsic biological relationship.
The behavior of large numbers of genes (typically thousands in a single experiment) in
an organism is frequently inferred through analyzing mRNA expression from microarrays.
Groups of genes involved in a particular biological process are typically regulated by one
or more transcription factors (TFs) that bind to transcription factor binding sites in the
upstream sequence adjacent to genes (promoters). TF binding sites corresponding to a TF
often show a similar conserved pattern called a motif. A motif pattern of length w is often represented through a 4 × w matrix of probabilities, called a position-specific weight
matrix (PSWM). Each column of the PSWM gives the relative probabilities of observing
any of the 4 letters A, C, G, or T in that position of a binding site. A common strategy
for understanding gene regulation is to look for conserved motifs upstream of genes that
are co-expressed under a given condition. After genes with similar expression patterns
across experimental conditions are grouped together via clustering algorithms, a motif discovery algorithm is often used to search sequences upstream of genes within each cluster.
Several computational algorithms have been developed for motif discovery from upstream
sequences of co-regulated gene clusters (Lawrence and Reilly, 1990; Bailey and Elkan,
1994; Liu et al., 1995, 2002; Thompson et al., 2004).
The co-regulated gene cluster-based motif discovery approach has its pitfalls, since there
2
may be genes in a cluster lacking a common motif and many gene promoters containing
a similar motif may not show any experimental response, leading to inaccurate predictions. If the clustering is inaccurate, it becomes difficult to discover the correct TF binding
site patterns. Additionally, by assuming the cluster identities as known and fixed, instead
of estimated, bias is introduced into the statistical analysis. One way to overcome such
drawbacks is to combine sequence and expression information in a coherent model to infer
regulatory networks; however, developing such an approach presents significant statistical challenges. In this article, we present an approach that combines information from
expression measurements and sequence through a unified mixture model and develop a
methodology to simultaneously cluster genes and select the most likely TF binding sites.
In one of the first approaches, Holmes and Bruno (2000) used an iterative procedure to
cluster genes using a multivariate normal mixture model, and find motifs in co-regulated
clusters. The sequence and expression parts in this model were considered to be independent of each other. Linear model-based approaches to link expression values with motif
occurrence were introduced by Bussemaker et al. (2001), the underlying assumption being
that the presence of a motif site contributes additively to the expression level of the gene.
To avoid the arbitrariness in deciding whether a given segment is a binding site, Conlon
et al. (2003) propose a multiple linear regression relationship between the logarithm of the
differential expression values and a sequence motif “score”. Tadesse et al. (2004) present a
Bayesian version of this procedure with significant regulatory motifs being chosen through
Bayesian variable selection. However, none of these latter methods address the simultaneous determination of co-regulated gene clusters (assumed to be fixed in advance) and
the motifs involved in their regulation. Barash and Friedman (2002) use sequence and expression data for a somewhat different goal, to determine functional gene clusters rather
than de-novo motif discovery. Starting with known motif lists, they estimate a Bayesian
network-based clustering of genes using conditional models to link sequence with gene
expression, through an approximate EM-like procedure. However, parametric complexity,
and a lack of standard model validation approaches (due to incomplete specification of a
joint model), hampers the effectiveness of this approach for moderately large datasets.
3
From the methodological standpoint, the problems of simultaneous clustering and variable selection in the Bayesian context have been considered in the multivariate normal
set-up (Liu et al., 2003; Tadesse et al., 2005). However, to the best of our knowledge, there
has been limited study of such a scenario in the mixed effects model or the linear regression framework, which has some unique problems (Hennig, 1999). The variable selection
problem with a large number of predictors, on the other hand, is one under considerable
current study (George and McCulloch, 1993; Liang and Wong, 2000).
In this paper, we propose a mixture of hierarchical regression models to simultaneously
address the clustering of co-regulated genes, and determining a set of transcription factors regulating a gene cluster. A hierarchical prior framework is formulated for modeling
intra-cluster gene correlations.We develop an efficient Monte Carlo algorithm for iterative
clustering of genes and selection of significant motifs. This approach succeeds in linking regulatory motifs to the resulting gene expression without a huge parameter estimation
cost. Additionally, it can uncover relationships that cannot be discovered using a single
linear model, for example the same motif acting as a positive and negative regulator on
separate groups of genes under the same condition. The outline of the paper is as follows.
In Section 2, we describe the motivating yeast data set that led to this application. Next,
we introduce the unified hierarchical mixture model linking gene expression measurements
with promoter sequence, and indicate the biological interpretation of such a model. In Section 4, we describe a Markov chain Monte Carlo (MCMC) procedure for simultaneous
clustering of genes and selection of motifs (covariates). In Section 5, we present computational tools for calculating Bayesian model selection criteria in the regression mixture
framework and in Section 6, apply our methodology on the yeast data and evaluate the
performance of the method through a number of simulation studies.
2 Yeast cell-cycle data set
The motivating data set for this application is from a yeast cell-cycle microarray experiment (Spellman et al., 1998). This data set has repeatedly been used to examine aspects of
4
cell cycle gene regulation, e.g. Barash and Friedman (2002); Conlon et al. (2003). cDNA
microarrays were used to study samples from yeast cultures synchronized by three independent methods: (a) alpha-factor arrest (b) elutriation and (c) arrest of a cdc15 temperaturesensitive mutant. For purposes of this analysis, we concentrate on the first experiment
which has observations over two full cell-cycles and also the largest number of complete
measurements. Yeast strain DBY8724 was grown in a glucose solution, an alpha-factor was
added and 25-ml RNA samples were taken every 7 minutes for 120 minutes, after which
the alpha-factor was removed by a centrifugation method. The samples were allowed to
hybridize to cDNA microarrays for 4-6 hours, after which the expression levels of nearly
6000 genes over 18 time points (spanning two hours) were recorded. The value for each
gene was recorded as the logarithm (base 2) of the ratio of expression of that gene in the
sample to a baseline control measurement. After scanning the microarray with a laser microscope, the data set was preprocessed by computing the fluorescence ratios through a
local background correction, estimated by the intensities of the weakest 12% of the pixels
in each box. Data for each gene were normalized so that the average log-ratio over the
course of the experiments was equal to zero.
The first goal of Spellman et al. (1998) was to identify the genes regulated in a cell cycledependent manner. A Fourier transform was applied to the data series for each gene, and
the resultant expression profile of each gene was then correlated to five different profiles
representing genes known to be expressed in the five phases of the cell-cycle: G1, S, G2,
M, and M/G1, using the standard Pearson correlation function. The profiles for known
gene classes were identified by averaging the log-ratio data for each of the genes known to
peak in each of the five phases. Genes were ranked by the correlation scores, and all genes
whose scores exceeded the threshold score (91st percentile of known cell cycle genes) were
classified as being cell-cycle regulated. This led to a total of 800 genes, which included
496 genes not previously identified to be involved in cell cycle regulation.
For our analysis, we downloaded the complete microarray data set from the Stanford
Microarray Database (Ball et al., 2005). Among the measurements for the 800 cell-cycle
genes, a number of genes were discarded due to completely missing or suspicious look5
ing data, resulting in a final set of 612 genes measured over 18 time points (Figure 1). In
addition, we extracted 600 bp of the upstream promoter corresponding to each gene from
the Saccaromyces Cerevisiae Promoter Database (SCPD; Zhu and Zhang (1999)) for the
purpose of motif discovery. Among our questions of interest are: can we improve the
prediction of TF binding sites, by using a joint model approach instead of a two-step procedure, or one assuming all genes in a single cluster? In addition, we are interested in
determining whether the sequence information can provide an improved estimate of which
genes are involved in particular functions along the pathway of cell cycle regulation, in
comparison to considering the gene expression information alone.
[Figure 1 about here]
3 Regression mixture model for sequence and expression
Now we introduce a regression mixture-based joint model linking the sequence and expression data. First, we summarize the notation used. We denote the gene expression measurements for G genes under T time points by Y = ((Yig )), (i = 1, . . . , T ; g = 1, . . . , G). Corresponding to each gene g, let the upstream sequence (of length L g ) be xg = {x1 , . . . , xLg },
and let the set of D potential motif candidates be characterized by D position-specific
weight matrices (PSWMs) {Θ1 , . . . , ΘD }. Assume that we have a score function that
“scores” every upstream region with respect to every weight matrix resulting in a G × D
matrix S = ((Sgd )). The entry Sgd reflects the propensity of sequence g to be bound by the
transcription factor corresponding to motif type d. More details on the scoring function are
provided in Section 3.3.
Now, we introduce a mixture model framework for the gene expression clusters. Let
πk (k = 1, . . . , K) denote the prior probability of belonging to cluster k. Assume that Y is
generated from a K-component mixture distribution with
p(Y g |all parameters) =
K
X
k=1
6
πk pk (Y g |all parameters),
(1)
where pk (·) denotes the probability density for the k th component. The gene expression
measurements are linked to the corresponding motif scores through a regression model. Let
u = (u1 , . . . , uD )T be a binary vector with ui = 1(0) if motif i is involved (not involved)
in regulating expression. Conditional on the unobserved cluster membership z g , the gene
expression measurement Y g is modeled as:
Y g |zg = k, ξ, S, β, σk2 ∼ NT ξg + S 0g(u) β k(u) 1T , σk2 I ,
(2)
where ξg is a T −dimensional parameter vector representing the variation at T time points;
β k(u) is the subset of regression coefficients βk corresponding to TFs indexed by u; and σk2
is the variance of expression measurements corresponding to the k th cluster (1T denotes a
T −dimensional vector of ones). Throughout the article, Np (·, ·) denotes a p−dimensional
multivariate normal distribution. The model specification (1) and (2) can be viewed as a
mixture of random effects models. Although the following discussion relates to the development of this model and inference under normality assumptions stated in (2), the framework and methodology can be extended to a variety of distributional assumptions.
The mathematical formalism here is mainly to derive insights into the biological relationships between genes and their regulatory motifs, rather than try to represent the true
(unknown) biological picture. However, the regression mixture framework is an intuitively
appealing construct in this application, since (i) it reflects the notion that if different groups
of genes are differently regulated, then the regression relationships should likewise differ
between such “clusters” and (ii) the regression model assumes that upstream sequences of
genes having distinctive expression patterns are likely to be regulated by certain TFs, and
hence are more likely to contain more (and “stronger”) binding sites for these TFs.
Estimation of parameters under the model set-up (1) and (2), with simultaneous selection of regressor variables and clusters, is a challenging task in likelihood-based inference.
By adopting a Bayesian framework, we ensure that such a model can be fitted, and demonstrate that meaningful biological constraints can be incorporated into the model by a careful
choice of the prior distributions. For example, we show in the following section that intra-
7
cluster gene correlations can be incorporated into the hierarchical model without resorting
to a huge expansion in parameters and avoiding estimability problems with small datasets.
An implicit assumption made is that dependence between expression measurements over t
different points (t = 1, . . . , T ) is due to either cluster or sequence effects.
3.1 Hierarchical prior framework
The prior distributions of the parameters in the gene expression model are specified conditionally on the gene expression cluster, with:
ξ g |zg = k, µk , σk2 ∼ NT (µk , τ0 σk2 I),
g = 1, . . . , G,
µk ∼ NT (m0 , V0 ),
1/σk2 ∼ Gamma(w0 /2, S0 /2),
(3)
(k0)
where V0 = ((vij )) is the T × T prior covariance matrix. Possible choices of V0 and
the other hyperparameters τ0 , m0 , w0 and S0 are discussed in Section 4.1. We also assume
that π = (π1 , . . . , πK ) ∼ Dirichlet(α), where α = (α1 , . . . , αK ). The model specification
in (3), ensures two desirable properties: (i) genes do not borrow strength across clusters;
however, genes within a cluster may borrow strength from each other (through the a priori
specified prior covariance matrix V0 ), (ii) the a priori correlation between measurements
on a gene at two time points is the same as the correlation between different genes in the
same cluster at those time points. More precisely, it can be shown that
(k0)
Corr(ξgi , ξgj |zg = k, σk2 )
vij
= Corr(ξgi , ξg0 j |zg = k, zg0 = k, σk2 ),
= q
q
(k0)
(k0)
vii + τ0 σk2 vjj + τ0 σk2
(k0)
while
Corr(ξgi , ξg0 i |zg=k, zg0 = k, σk2 )
=
vii
(k0)
vii
+ τ0 σk2
.
3.2 Generalized g-prior for high-dimensional covariates
For the regression coefficient of the sequence model, we assume the standard conjugate
prior form β k ∼ ND (β 0 , Σβ ). Let |u| denote the number of weight matrices, out of a
8
possible D, that are included in the model, i.e. |u| =
PD
d=1
1[ui =1] denotes the cardinality
of u. We assume ui ∼ Bernoulli(η) and specify a conjugate prior, Beta(a1 , a2 ), for η.
Our choice of the prior for β k is motivated by the feasibility of computation of the
marginal posterior in the variable selection step. In the regression-like framework, it
may be appropriate to use a multivariate generalization of the g-prior (Zellner, 1986).
For the linear regression model Y = Xβ + , ∼ Nn (0, σ 2 I), the g-prior for β is
Np (β 0 , cσ 2 (X 0 X)−1 ), where c is a specified scalar. In the current set-up, since the cluster
identities are unknown, it is not possible to get a closed form expression for the g-prior.
Let us write X ∗ = stack(X) = [X10 , . . . , XG0 ]0 , where Xi = 1T S 0i , S 0i = (Si1 , . . . , SiD ).
Ignoring the class labels, we can get an approximate expression for the variance of the gP
0 −1
prior based on the whole design matrix, as cσk2 (X 0∗ X ∗ )−1 = cσk2 [T G
g=1 S g S g ] . This
covariance matrix, however, becomes nearly singular for high-dimensional covariates, or
covariates which are highly collinear, both of which are common occurrences when we are
dealing with large numbers D of motif covariates. Taking such a prior covariance matrix
then leads to MCMC convergence problems. Hence, we propose a modified form of the
g-prior, motivated from ideas of ridge regression (Hoerl and Kennard, 1970). We take the
prior distribution for β k (k = 1, . . . , K), as
β k ∼ ND
β 0 , cσk2 Σβ
G
X
−1
0
, (4)
, where Σβ = (X ∗ X ∗ + λI)−1 = T
S g S 0g + λI
g=1
where λ is a specified scalar similar to the ridge parameter in ridge regression. This form
of the prior simultaneously stabilizes the prior (and posterior estimation) of the regression
coefficients, while possessing the operating characteristics and properties essentially identical to the usual g-prior when high-dimensionality and collinearity issues are not present.
0
For instance, (X ∗ X ∗ + λI) is necessarily non-singular for G < D (and λ > 0), whereas
0
the original matrix X ∗ X ∗ is necessarily singular. The “ridge” parameter λ is generally
chosen within a range of values between 0 and 1 (Hoerl and Kennard, 1970), that leads
to maximum stability of the estimated coefficients. In our simulations and application, we
tried using a full Bayesian model with a uniform prior on λ, as well as setting different
9
fixed values of λ between 0 and 1. Empirical evidence suggests that a fixed value of λ
leads to more stable and less variable estimates, with λ values in the range of 0.5 to 1 essentially performing similarly well. The bias in estimates introduced by λ turns out to be
negligible, especially for large values of c. By choosing a sufficiently large c, we can make
the prior suitably non-informative (diffuse) with respect to the likelihood. Also, the choice
of c determines the a priori covariance of gene expression measurements in a cluster, since
Cov(Y g , Y h |zg , zh = k, σk2 ) = V0 +
cσk2
T

1T S 0g
G
X
S i S 0i
i=1
!−1

S h  10T .
3.3 Score function for motif conservation
The set of D position-specific weight matrices Θ = (Θ1 , . . . , ΘD ) is used to represent
potential TF binding site pattern candidates. Each of the upstream sequences is “scored”
with respect to each weight matrix Θi (i = 1, . . . , D), so that we have a G × D matrix
of gene-sequence scores S = (S 1 , . . . , S G ), where S g = (Sg1 , . . . , SgD ) is the sequence
score vector for the g th gene. The score for a sequence xg and a weight matrix Θi is taken
to be the likelihood ratio between the sequence xg being or not being regulated by the
transcription factor corresponding to the weight matrix Θi . In Conlon et al. (2003), the
score of weight matrix i for sequence g is:
Lg −wi +1
Sgi =
X
j=1
P ({xj , . . . , xj+wi −1 }|Θi )
,
P ({xj , . . . , xj+wi −1 }|θ0 )
(5)
where θ0 denotes the parameter set characterizing the background distribution (sequence
not containing motifs). For an i.i.d. background, θ0 represents the probabilities of the
four nucleotides; under a Markovian assumption, it denotes the transition probabilities. (5)
represents the likelihood ratio between the sequence containing a single motif site and a
null model containing no motif (up to a constant of proportionality).
10
4 Clustering with variable selection for fixed K
Under the model specifications discussed in Section 3, we construct a Monte Carlo-based
estimation strategy. For now, we assume the total number of clusters specified as K. The
full conditional posterior distribution of the parameters is given by:
p(ξ, µ, σk2 , β, β 0 |Y
, S, z, u, K) ∝
K
Y
k=1
"
Y P (Y g ; ξ g + S 0g(u) β k(u) 1, σk2 I) ×
g:zg =k
#
P (ξg ; µk , τ0 σk2 I) × P (µk ; m0 , V0 ) × P (σk2 ; w0 , S0 ) × P (βk(u) ; β0(u) , Σβ(u) ) ,
where the subscript (u) corresponds to the subset of variables indexed by the binary vector
u; thus Σβ(u) denotes the |u| × |u| submatrix corresponding to the selected u. In this
framework, we are interested in making inference about πk , µk , β k and σk2 for each cluster,
given all the gene expression measurements Y and sequence motif scores S. The additional
complication is that the cluster membership (z) and active TF set (u) are both unknown.
We develop an MCMC framework to estimate the parameters, that alternates between three
main steps (technical details are in Appendices A-C):
(i) Selection of variables (motifs) given clustering of genes.
We update [u|Y , S, z, K, all parameters]. Since we often need to select a subset of
motifs from a very large set (e.g. D ≈ 100), the evolutionary Monte Carlo (EMC)
method (Liang and Wong, 2000) is used for this step. To make the EMC step more
efficient, we marginalize out most of the parameters from the posterior distribution.
(ii) Updating clusters given the selected variables.
Here we draw the cluster indicators [z|Y , S, u, K, parameters] from their full conditional posterior distribution.
(iii) Updating parameters from their posterior distributions, i.e. [µk , σk2 , β k , π|Y , S, z, u, K]
(for 1 ≤ k ≤ K).
11
4.1 Choice of prior hyperparameters and starting values
The prior hyperparameters consist of the scalars τ0 , c, w0 , vectors m0 (k = 1, . . . , K), β 0 ,
and matrices S0T ×T and V0 , (k = 1, . . . , K). Proper priors are needed in the model in
order to ensure proper posterior distributions; however, by taking sufficiently large tuning
parameters, we try to keep the priors non-informative. It is essential to base the prior choice
independently of the initial cluster specification. Sensitivity analyses were conducted over
a range of values of the hyperparameters that demonstrated robustness to prior specifications (see Section 6.1). We chose prior settings for the hyper-parameters as the followP
2
ing: m0 = (0, . . . , 0)0 ; and V0 ∝ c0 diag(v11 , . . . , vT T ) (where vii = G1 G
g=1 (Ygi − Ȳi ) ,
P
P
0 −1
, and β 0 = 0. To
Ȳi = G
ˆ G1 G
g=1 (Y g − Ȳ )(Y g − Ȳ )
g=1 Ygi ). Also, we set S0 =
ensure proper prior distributions, we choose the degrees of freedom w 0 > T + 1. The scale
parameters (τ0 , c) are chosen to be sufficiently large (in our applications, above 100) to
ensure non-informativeness of the prior distributions of µ and β, and c 0 was chosen in the
range 0.1 to 10. If historical information is available, it can also be used to elicit priors.
5 Determining the number of mixture components K
Earlier, it was assumed that the total number of gene clusters K was fixed and known.
However, in most cases, we need to assess the statistical significance of a choice of K.
Here we formulate the choice of K as a model selection problem between competing models M1 , M2 , . . . , MKmax where Kmax is a suitably chosen upper bound for the possible
number of clusters. The Bayes factor between the two models Mk and Ml , is the ratio of
marginal likelihoods given by:
R
P
P (Y |Mk )
z
,u θ P (Y |z, u, θ, Mk )P (z, u, θ|Mk )dθ
R
BFkl =
= P
,
P (Y |Ml )
z ,u θ P (Y |z, u, θ, Ml )P (z, u, θ|Ml )dθ
(6)
where (z, u, θ) is generic notation for the missing data (cluster identity and variable indices) and unknown parameters in the model. The main difficulty in computing (6) is the
summation over all iG × 2D possible values of (z, u) for each model Mi (i = k, l). In
12
the problem of simultaneously selecting the number of clusters and variables in a multivariate normal mixture framework, Tadesse et al. (2005) used the reversible jump (Green,
1995) method for selecting K along with other model parameters, embedding the problem
in a larger problem of variable dimensions. With the additional complexity in our model,
although reversible jump is a good idea in principle, in practice it appears highly unstable
and possibly inefficient, as high-dimensionality and multimodality in the likelihood typically leads to high rejection rates when jointly sampling the large parameter space.
We thus adopt an alternative approach. When the integral over θ can be evaluated analytically, Steele et al. (2003) propose a computational method to calculate (6) using importance sampling. More specifically, the marginal density P (Y |Mk ) in (6) can then be writP
P
z t |Mk ) ,
ten as I = z P (Y |z, Mk )P (z|Mk ), and estimated by IˆIS = T1 Tt=1 P (Y |z t , Mk ) P (g(
zt)
where g(·) is an importance sampling distribution for z. In our model, we have two addi-
tional complications: (i) Variables selected (u) in the regression part of the model under
different K may vary and (ii) Marginalizing the mean parameter µ in P (Y , θ|z, M k ) is
intractable. We numerically integrate over the joint distribution of the discrete variables u
and z, under every model Mk . We next propose a multi-stage importance sampling procedure. (The following expressions are assumed to be conditional on the current choice of
model Mk , so we drop Mk from the expressions for notational simplicity.) Note that unP R
der Mk , P (Y |u) = z µ P (Y |u, z, µ)P (µ|z)p(z)dµ. Then, for any u, we calculate
H(u|z, µ) = P (Y |u, z, µ) (details in Appendix A), so that P (Y ) can be estimated by
" Tµ T
#
Tz
u
j i
i
k
X
XX
1
k i
j p1 (µ |z ) p2 (z ) p3 (u )
H(u |z , µ )
,
Tu × Tµ × Tz i=1 j=1 k=1
g1 (µj |z i ) g2 (z i ) g3 (uk )
where p1 (·), p2 (·), p3 (·), are the densities of µ, z, and u under model Mk ; g1 (·), g2 (·),
g3 (·) are the corresponding importance sampling densities; and Tu , Tµ , Tz are the number
of samples drawn respectively from the importance sampling densities (µ and z are a priori
independent of u). Since µj |z is a priori distributed as NT (mj0 , V0 ), a product of scaled
Q
multivariate t-distributions, kj=1 t( ; mj0 , V0 , ν) may be appropriate as the importance
sampling distribution for µ|z.
13
For the cluster indicator z, we use an importance sampling density of the form g 2 (z) =
δp(z|K)+(1−δ)g ∗ (z) where p(z|K) denotes the prior, 0 < δ < 1, and g ∗ (z) is a sampling
function that covers important parts of the space. Sampling solely from the prior p(z|K)
is highly likely to result in many observations that should be in the same group not being
clustered together, and vice versa. By using the information contained in P (z|Y , S, θ̂)–
the posterior probability matrix of group allocation– to determine preliminary groupings
of observations, we can apply a Dirichlet-Multinomial sampling function to each of these
groups individually (see Appendix D for details). A weak dependency is thus generated
among observations which have high posterior probabilities of belonging to the same group.
The Dirichlet-multinomial distribution samples parts of the space other than the mode,
guarding against large importance sampling weights. The sampling distribution for u is
chosen similarly, as a mixture of the prior density and a density likely to sample from
high posterior density regions. Let φ = (φ1 , . . . , φD ) be the set of marginal posterior
probabilities of selection of variables, i.e. φj = P (uj = 1|Y , S), obtained from the
MCMC sample. We take g3 (u) = δpB (u|η) + (1 − δ)g ∗ (u|φ), where pB (u|η) denotes
the prior Bernoulli density with hyperparameter η, and g ∗ (u) is taken to be a product of
Bernoulli densities where the j-th density has parameter φj (1 ≤ j ≤ D).
6 Applications
6.1 Case study 1: A simulated data set
Simulation studies were conducted to test the performance of our method. Position specific
weight matrices (PSWMs) corresponding to ten known transcription factors (TFs) were
collected from the yeast TF database (Zhu and Zhang, 1999). Next, three “groups” of
sequences (emulating gene promoters) were constructed by extracting random sequences
from yeast intergenic regions,and inserting one or more motif sites into sequences of each
group according to the following rules: (i) Motif types 1 to 5 in group 1 (ii) Motif types
6 to 10 in group 2 and (iii) Motif types 1, 2, 3, 8, 9, 10 in group 3. Vectors of gene
expression scores for genes in each group were then generated according to the model
14
N2 (Sg0 β k , σk2 I), where σ12 = σ22 = σ32 = 0.1, and the group specific regression coefficients
were as follows: β 1 = 2 × (1, 1, 1, 1, 1, 0, 0, 0, 0, 0), β 2 = 2 × (0, 0, 0, 0, 0, 1, 1, 1, 1, 1), and
β 3 = −2 × (1, 1, 0, 0, 0, 0, 0, 0, 1, 1). The choice of regression coefficients were made so
that motifs 1 to 5 had positive effects in group 1, 6-10 had similar effects in group 2, while
in group 3, motifs 1, 2, 9, 10 had significant negative effects, whereas sites corresponding
to motifs 3 and 8 although present, do not have a significant effect on gene expression.
Next to evaluate the performance of the method in presence of noise, we simulated
“scores” for 50 random NT (·, ·) variables, uncorrelated with the gene expression. Iterative
fitting of the joint sequence-expression model was done for each data set in turn, with a
total of D = 30, 40, 50, 60 variables (the first 10 in each case being the “true” covariates).
To judge convergence of the sampler, the R̂ statistic (Gelman and Rubin, 1992) was computed for all scalar parameters of interest using 5 runs of the sampler with different starting
points. For about 50,000 iterations of the sampler (excluding the initial 500 as burn-in), the
maximum value of R̂ over all parameters averaged 1.003, indicating that the chains can be
approximately assumed to have converged.
The k-means algorithm was used on the expression data to get starting values of cluster
identity (for K = 2, . . . , 6). The Bayes factor was found to select the the correct cluster
count, K = 3, in each case (second panel of Table 3). For comparison, we also computed
the Bayes factor using the marginal density approach of Chib (1995). Although this can
often be calculated directly from the MCMC output, for the missing data framework of
the regression mixture model, it requires at four sets of extra samples from the conditional
distributions of the parameters. Additionally, the two levels of missing data (mixture components and covariate indicator variable) appear to make the procedure computationally
expensive as well as inaccurate, as the Bayes factor did not succeed in selecting the correct
cluster count in simulation studies. Hence we do not further pursue this approach.
As is evident from Figure 2, the set of selected variables converges quickly to the correct set. Figure 2 also shows the false discovery rate on choosing a cutoff for the posterior
marginal probability of variable selection ranging between 0 and 1 (all correct variables
are chosen with probability 1 after burn-in, so the false negative rate is 0). For essentially
15
any cut-off between 0.2 and 1, the FDR is virtually zero, showing that the method can discriminate strongly between the true covariates and noisy variables that do not explain gene
expression for any of the clusters. From Table 1, it is seen that varying the total number of
variables D, does not affect the MSEs of estimated parameters and misclassification rates.
Also, varying the values of the hyperparameters for the prior of u, (a 1 , a2 ) towards favoring less variables does not have a discernible effect on the final selection of variables or
parameter estimates, indicating that the method is robust to specification of these hyperparameters. Similar sensitivity studies were carried out over a range of values for the other
hyperparameters in the model, none of which had a significant effect on the final results.
The hyperparameter settings for which the final results are reported (for simulation studies
and the yeast data) are: (a1 , a2 ) = (1, 100), w0 = 5, c0 = 1, c = 1000, τ0 = 0.1, and λ = 1.
[Figure 2, Table 1 about here]
6.2 Case study 2: Comparative study of four competing procedures
We designed a simulation study to compare the performance of the new method with three
simpler approaches. These were: (i) motif detection from sequence, with no gene clustering
(MDscan) (ii) a single multiple regression model with no gene clustering (Motif Regressor; Conlon et al. (2003)), (iii) a two-step procedure that clusters genes (Fraley and Raftery,
2003) and carries out motif detection separately within each cluster (Motif Regressor). Sequence data was generated for three groups of 100 sequences of length 500 each, based
on a third-order Markov model with nucleotide frequencies matching yeast intergenic regions. Next, five weight matrices (see Table 2) were chosen from the SCPD database (Zhu
and Zhang, 1999), and 1-2 sites corresponding to each motif was inserted into each of the
sequences. Expression data was generated under the assumption that motifs 1 and 2 had
an inductive effect on group I, a repressive effect on group II, and no effect on group III;
and motifs 4 and 5 had an inductive effect on group III and no effect on groups I and II.
Motif scores were calculated according to the five weight matrices, standardized, and the
expression data generated under the model N2 (Sg0 β k , I), where β 1 = 2 × (1, 1, 0, 0, 0),
16
β 2 = −2 × (1, 1, 0, 0, 0), and β 3 = 2 × (0, 0, 0, 1, 1). Note that although sites corresponding to motif 3 were inserted into the sequences, they are simulated to be “non-functional”,
i.e. they do not have any effect on gene expression.
Next, each approach was applied to the data set. MDscan with a total motif count between 5 and 10 could find only two of the correct motifs (Table 2). Additionally, one
discovered motif was the third, which does not correlate with gene expression. Motif Regressor uses MDscan (Liu et al., 2002) to find a set of candidate motifs for regression.
When applied on this data set, it predicted two motifs, none of which match the true ones.
Finally, we used MCLUST (Fraley and Raftery, 2003), which uses a multivariate normal
mixture, to first cluster the genes (we set K = 3), and then applied Motif Regressor separately on each of the three clusters. In the first cluster, this led to one significant predicted
motif (which matches one of the Motif Regressor predictions), while no motifs were predicted to be significant in the other two clusters. None of the predicted motifs were correct.
The two-step results may be due to the fact that the difference between clusters are primarily driven by the effect of the significant motifs (and not just the gene expression values),
while the method takes gene expression and sequence into account one at a time. Hence
clustering the genes fails to separate out the correct clusters (the misclassification rate for
MCLUST is 72%), and that in turn leads to failure in detecting the true motifs.
Finally, we applied the new joint regression mixture model on the same data set. An
initial set of 60 motifs were generated running MDscan separately on each cluster, with
K = 1, 2, 3. Leaving out redundant motifs (exact matches), led to a set of 36 candidates
for variable selection. For initialization, a random cluster index was assigned to each gene
and five chains were run based on five sets of initial indices. MCMC convergence diagnostics demonstrated adequate convergence of the algorithm after about 5000 iterations
(R̂ ≈ 1.001). Six motifs were predicted to be significant, of which five match the true motifs, while the sixth (AATCCCAGAT) is very close to a shifted version of motif 5 (matches
in 50% of the positions). The misclassification rate, based on the highest posterior probability of class identity for each gene, is reduced to 25%. Unlike MDscan, this method does
not pick out motif 3, which is present in the sequence data but does not have a significant
17
effect on expression. This indicates that the joint approach is a more promising method
for differentiating groups of genes that differ as a functional category, while using gene
expression alone to differentiate clusters could result in missing correct TFs.
[Table 2 about here]
6.3 Case study 3: Yeast cell-cycle data
We applied our method on the yeast cell cycle data set studied in Spellman et al. (1998) for
discovering interactions within the transcription regulatory network. The first question was
how to derive a starting set of PSWMs. Motif search using the entire list of 612 genes may
not find any other than the “strongest” binding sites since different sub-groups of genes
may be acted upon by different TFs. In order to collect a more exhaustive list, we used the
following strategy. First, using the k-means algorithm, we repeatedly grouped the genes
into K groups, where K = 1, 2, . . . , 10. Then, for each of the sub-groups, for each K, we
ran the MDscan algorithm (Liu et al., 2002) to find sets of PSWMs, ranging in width from
8 to 10 bp. By using all the clusterings corresponding to different K, the initial motif set is
made independent of any particular choice of K.
Next, we took the top five non-redundant motifs from each group, combined them into
one large set, and scored each of the 612 sequences with respect to each of the motifs. Since
some motifs potentially overlap, the correlation between their score vectors are nearly one.
To reduce redundancy in the model, we excluded all motifs which had a correlation of 1, or
if their consensus sequences matched exactly in more than 90% of the positions, resulting
in a set of 191 motif candidates, which is still fairly large for variable selection purposes.
However, the structure of the prior discussed in Section 3.2 allows us to include all variables
in the regression without leading to singularity problems. It may not be reasonable to
consider the entire set of measurements over 18 time points simultaneously, as different
TFs may be active over different phases in the cycle. Taking two to three consecutive time
points may succeed in uncovering groups of genes that are acted upon together by a set
of TFs. Hence we divided the expression measurements into nine consecutive intervals
18
corresponding to time points 1-2, 3-4, . . ., 17-18.
6.3.1 Model fitting and data analysis
To assess convergence, we ran five chains of the sampler for each time interval. In all cases,
50,000 iterations, excluding the first 2% as burn-in, gave R̂ values less than 1.1, indicating
adequate convergence to the posterior distribution. Computing time for this example in the
R statistical software (R Development Core Team, 2004) was about 40 hours on a cluster
of dual AMD Athlon 1600+ 1.4GHz MP processors with 2GB DDR RAM.
Priors were chosen to be proper but non-informative. To assess the sensitivity of our
method to prior specification, we ran the algorithm varying the following: (i) the initial
clustering, (ii) the values of c and τ0 , the scaling parameters for the prior covariances of
β, and µ, (iii) the magnitude of the uniform prior pseudocount vector α for π, and (iv)
the hyperparameters (a1 , a2 ) of the distribution of η. We found that (i) varying the initial
clustering, and values of m0 and c0 had little effect on the final results; (ii) varying values
of c larger than 500 had no effect on parameter estimates (iii) taking small values of τ 0
in the range 0.01 to 1 led to stable estimates, and (iv) the magnitude of α had no visible
effect on the results. The hyperparameters (a1 , a2 ) behave as stringency parameters that
control the degree of parsimony of the model: if the value of a2 is large compared to a1 ,
inclusion of fewer weight matrices in the model are expected to be favored, and vice-versa.
However, this difference in posterior probabilities of selection was only observable when
a1 was set to be much smaller than a2 , and none of the motifs were selected. For a large
range of a1 /a2 ratios between 0 and 1 the same set of motifs were sampled with the highest
posterior probability. A problem often associated with estimation in Bayesian mixture
models is “label-switching”, due to the posterior indistinguishability of the K! modes. In
our simulation studies and application, however, we found no evidence of this.
The first panel of Table 3 shows the Bayes factor computed by the multi-stage importance sampling method (Section 5) for different choices of K. The optimal number of
clusters K is seen to vary between one and three. Figure 3 gives the posterior probability
of motif selection, corresponding to the optimal number of clusters K for that time inter19
val. Using the frequencies of inclusion of single motifs in the model was equivalent to
the marginal posterior probabilities of selection. Table 5 shows the selected motifs at each
stage, their consensus patterns and those of the known TF binding motifs that each may
correspond to. A number of binding sites are seen to match with previously found TFs,
and many of them are discovered precisely at the times of peak expression in the cell-cycle
of the genes they are believed to regulate (Spellman et al., 1998). Moreover, we discovered several TFs that act in opposing directions in different clusters, that are missed when
considering a single cluster of genes: ROX1 (Interval 1-2), GAL4 (3-4) and ABF1 (17-18).
[Figure 3, Tables 3, 5 about here]
6.3.2 Comparative study with competing procedures
To further explore the benefits of using this method, we conducted a detailed comparison
with two simpler methods that differ from our method in two significant ways. Motif
Regressor (Conlon et al., 2003), assumes a single cluster of genes at each time point. Next,
we used a sequential two-step approach, first clustering the gene expression values using a
normal mixture model (Fraley and Raftery, 2003), and then doing a separate motif analysis
on each cluster. MCLUST uses the BIC to determine the optimal number of clusters. Table
4 shows that of the three approaches, the joint regression mixture model has the highest
specificity over all time points, finding the largest number of motifs known to be functional
in the yeast cell-cycle. Although Motif Regressor finds a number of known yeast motifs,
the false positive rate is high, a likely result of noise introduced by using all genes instead
of those relevant to a particular TF. In addition, Motif Regressor requires a number of
tuning parameters to be set (for example, the number/proportion of genes to consider “top”
genes). Although having more precise information on these in a particular experiment may
influence the results favorably, this information is generally unavailable in advance.
The two step approach, interestingly, picks up fewer motifs, resulting in a higher specificity in some cases. However, at some intervals it fails to pick up any functional TFs,
which may be due to the clustering based solely on gene expression. Biologically, it is
known that not all gene expression may be caused by transcription, hence grouping genes
20
solely by expression ignoring the sequence could lead to missing important TFs. This result
further supports our hypothesis that a more powerful way of using gene expression in motif
discovery is through a joint model that incorporates sequence information into clustering
and expression information into the motif search procedure.
[Table 4 about here]
6.3.3 Cluster analysis
Since our joint model includes a clustering step, apart from motif selection, it is also of
interest to see whether the resultant gene clusters appear to be functionally relevant. We
compared the gene lists obtained from the regression mixture model to the Gene Ontology
(GO) database (http://www.geneontology.org/). A web-based tool, FuncAssociate (Berriz
et al., 2003) was used to obtain lists of GO attributes that are statistically over-represented
among the genes in each gene list, correcting for multiple hypotheses testing. The results
over each time interval are shown in Table 6, where the highlighted attributes are ones
known to be functional at that phase of the cell cycle (Spellman et al., 1998). It is noticeable in this table that (i) a number of attributes are highly significant (ii) the clusters appear
to be well-separated by attribute, suggesting that the different TFs found in different clusters may be involved in separate functions; and (iii) a number of attributes that appear in
the same cluster are correlated functionally. For example, budding and polarized growth
are grouped together; replication and DNA helicase occur in the same group (DNA helicase is needed for unwinding of the replication fork)– hence the gene clustering seems to
be biologically consistent. The same table also presents results of a comparison with Table
7 in Spellman et al. (1998), that gives the names of genes at each time point known to be
involved in a particular function. Using the gene clusters from the joint model, we find the
proportion of genes in a functional group, that appear in the same cluster on classification.
As can be seen in Table 6 a number of functional gene groups are over-represented in certain clusters (as opposed to being randomly divided among the clusters, again indicating
that the clusters may capture some functional aspects. It is interesting to see that a number
of functional groups from Spellman et al. (1998) match the independently generated GO
21
characterizations, e.g. at time interval 1, the enriched functional groups are “chromatin”,
“cell wall” and “mating” in cluster 1, while the top three GO annotations also characterize functional enrichment for cell wall, conjugation (which corresponds to “mating”) and
nucleosome (which corresponds to “chromatin”). In cluster 2 over the same interval, the
“budding: fatty acids” functional gene group is over-represented, which matches the GO
annotation for cluster 2 (bud). A number of such correspondences over the table indicates
that the clusters found may indeed be meaningful in terms of biological function.
It should however be stressed that the gene clustering produced here may not provide
the complete picture of gene activity. What is captured by the joint model is only the part of
gene expression that is related to transcription; there may be other functions that are missed,
because TFs may not have a significant effect on such activity. The purpose of clustering
in this context is to separate out TFs that have differential effects on subgroups of genes,
which may be difficult to detect when all genes are grouped together. The gene clustering
afforded by this procedure may be used to generate hypotheses about subgroups of genes
that may be functionally related, which can then be validated by further experiments.
[Table 6 about here]
7 Discussion
We have introduced a method to infer co-regulation based on clustering of similar expression profiles, and similarity of sequence features. We formulate a joint sequence-expression
model within a hierarchical regression mixture framework and design an efficient procedure to estimate the model parameters. The model aims at facilitating the analysis of large
and complex datasets. Introduction of a hierarchical prior structure motivated by ideas from
ridge regression, allows consistent variable selection with high-dimensional covariates– a
major advantage over currently used methods. Detailed comparative studies demonstrate
that the joint model is more successful than simpler approaches in capturing important data
features, leading to more accurate inference. To the best of our knowledge, there has been
no previous successful demonstration of a unified model that simultaneously clusters genes
22
and detects sets of active regulatory motifs in biological data.
The motivation for using a random-effects hierarchical model arises from the observation that the upstream sequence binding is characterized by the sequence composition
that is constant over time; however, due to interactions of various TFs at different time
points, the resulting gene expression measurements may be different. Although a Gaussian model is used here for simplicity and computational efficiency- it is not necessarily
a rigid assumption and can be extended to more complex frameworks. The joint framework introduced here has the potential to be improved in various modeling aspects, such as
incorporating a possible dependence in gene measurements over time more fully, and using a more biologically sophisticated function for motif scoring, for instance, modeling the
spatial dependence that arises due to multiple sites forming a “regulatory module” (Thompson et al., 2004). These issues, as well as extensions of the model in related applications
that involve determining significant predictor variables in presence of latent clusters in the
population, are currently being investigated.
References
Bailey, T. and Elkan, C. (1994). Fitting a mixture model by expectation maximization to
discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol., pages 28–36.
Ball, C., Awad, I., Demeter, J., Gollub, J., Hebert, J. M., Hernandez-Boussard, T., Jin, H.,
Matese, J. C., Nitzberg, M., Wymore, F., Zachariah, Z. K., Brown, P. O., and Sherlock,
G. (2005). The Stanford Microarray Database accommodates additional microarray platforms and data formats. Nucleic Acids Res., 33(Database issue):D580–2.
Barash, Y. and Friedman, N. (2002). Context-specific Bayesian clustering for gene expression data. J Comput Biol, 9(2):169–191.
Berriz, G. F., King, O. D., Bryant, B., Sander, C., and Roth, F. P. (2003). Characterizing
gene sets with FuncAssociate. Bioinformatics, 19(18):2502–2504.
Bussemaker, H. J., Li, H., and Siggia, E. D. (2001). Regulatory detection using correlation
with expression. Nature Genetics, 27:167–174.
23
Chib, S. (1995). Marginal likelihood from the Gibbs output. J. Am. Statist. Assoc.,
90(432):1313–1321.
Conlon, E. M., Liu, X. S., Lieb, J. D., and Liu, J. S. (2003). Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl. Acad. Sci. USA, 100(6):3339–
3344.
Fraley, C. and Raftery, A. E. (2003). Enhanced model-based clustering, density estimation,
and discriminant analysis software: MCLUST. J. Classification, 20(2):263–286.
Gelman, A. and Rubin, D. (1992). Inference from iterative simulation using multiple sequences (with discussion). Stat. Sci., 7:457–511.
George, E. and McCulloch, R. (1993). Variable selection via Gibbs sampling. J. Am. Stat.
Assoc., 88:881–889.
Green, P. J. (1995).
Reversible jump MCMC and Bayesian model determination.
Biometrika, 82:711–732.
Hennig, C. (1999). Models and methods for clusterwise linear regression. In Classification
in the Information Age, pages 179–187. Springer-Verlag, Berlin.
Hoerl, A. E. and Kennard, R. W. (1970).
Ridge regression: Biased estimation for
nonorthogonal problems. Technometrics, 12:55–67.
Holmes, I. and Bruno, W. (2000). Finding regulatory elements using joint likelihoods for
sequence and expression profile data. Proc. Int. Conf. Intell. Syst. Mol. Biol., 8:202–10.
Lawrence, C. E. and Reilly, A. A. (1990). An expectation-maximization (EM) algorithm
for the identification and characterization of common sites in biopolymer sequences.
Proteins, 7:41–51.
Liang, F. and Wong, W. H. (2000). Evolutionary Monte Carlo: applications to c p model
sampling and change point problem. Statistica Sinica, 10:317–342.
Liu, J. S., Neuwald, A. F., and Lawrence, C. E. (1995). Bayesian models for multiple local
sequence alignment and Gibbs sampling strategies. J. Am. Stat. Assoc., 90:1156–1170.
Liu, J. S., Zhang, J. L., Palumbo, M. J., and Lawrence, C. E. (2003). Bayesian clustering
with variable and transformation selections. In Bayesian Statistics, volume 7, pages
249–275. Oxford University Press.
24
Liu, X., Brutlag, D. L., and Liu, J. S. (2002). An algorithm for finding protein-DNA binding
sites with applications to chromatin-immunoprecipitation microarray experiments. Nat.
Biotech., 20(8):835–9.
R Development Core Team (2004). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
Spellman, P., Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M., Brown, P., Botstein, D., and Futcher, B. (1998). Comprehensive identification of cell cycle-regulated
genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell.,
9(12):3273–97.
Steele, R., Raftery, A. E., and Emond, M. (2003). Computing normalizing constants for
finite mixture models via incremental mixture importance sampling. Technical report,
Dept. of Statistics, University of Washington.
Tadesse, M., Sha, N., and Vannucci, M. (2005). Bayesian variable selection in clustering
high-dimensional data. J. Am. Stat. Assoc., 100:602–617.
Tadesse, M. G., Vannucci, M., and Lio, P. (2004). Identification of DNA regulatory motifs
using Bayesian variable selection. Bioinformatics, 20(16):2553–61.
Thompson, W., Palumbo, M. J., Wasserman, W. W., Liu, J. S., and Lawrence, C. E. (2004).
Decoding human regulatory circuits. Genome Research, 10:1967–74.
Zellner, A. (1986). On assessing prior distributions and bayesian regression analysis with
g-prior distributions. In Bayesian Inference and Decision Techniques: Essays in Honor
of Bruno de Finetti, page 233. North-Holland, Amsterdam.
Zhu, J. and Zhang, M. (1999). SCPD: a promoter database of the yeast Saccharomyces
cerevisiae. Bioinformatics, 15(7):607–611.
APPENDIX A: Conditional distributions in MCMC steps
(i) Updating [u|Y , S, z, K, all parameters]. For the EMC step, the marginal distribution
of the data is computed by integrating out all parameters of dimension that vary with
25
u. The marginal likelihood is proportional to exp {− log H(u|z, µ)}, where
K
K ∗ (w0 /2, S0 /2)
Beta(|u| + a1 , D − |u| + a2 ) Y
ck∗ ∗ w0 +nk T S0 +T1k +T2k , (7)
H(u|z, µ) =
Beta(a1 , a2 )
K ( 2 ,
)
2
k=1
1
−1 −1 −1 −1
2
P
0
−1
[2π(1+τ0 )]−nk T /2 ,
and ck∗ = c Σβ(u)
c Σβ(u) + (1 + τ0 )
g:zg =k S g(u) S g(u)
P
0
0
1
0
T1k = (1+τ
g:zg =k (Y g − µk − S g(u) β̃ k(u) 1) (Y g − µk − S g(u) β̃ k(u) 1), and T2k =
0)
h
i
−1
T
0
∗
Σ
(
β̃
−
β
)
k(u)
0(u)
β(u) (β̃ k(u) − β 0(u) ). K (p, α) denotes the normalizing constant
c
of the inverse gamma (p, α), αp /Γ(p), and β̃ k(u) is defined in (8). ((7) is derived in
Appendix B.)
(ii) Update [z|Y , S, u, K, parameters], using
πk p(Y g |Y , S, u, K, µk , β k , σk2 )
P (zg = k|Y , S, u, K, all parameters) = P
.
2
j πj p(Y g |Y , S, u, K, µj , β j , σj )
(iii) Update [parameters |Y , S, z, u, K].
P
(a) µk |βk , σk2 , Y , S, z, u ∼ NT ȳ ∗k , (V0−1 + nk σk−2 I)−1 ; ȳ k= n1k g:zg =k y g−S Tg(u) β k(u) 1
and ȳ ∗k = (V0−1 + nk σk−2 I)−1 (V0−1 m0 + nk ȳ k /σk2 ).
(b) By assumption, σk2∼Inv-Gamma(w0 /2, S0 /2) (where X ∼ Inv-Gamma(p, α) ⇒
f (x) =
αp
x−(p+1) e−α/x ).
Γ(α)
Then, we draw from the posterior distribution
(σk2 |Y , S, z, µ, u) ∼ Inv-Gamma(A, B), where A = (w0 + nk T )/2, and B =
P
1
S0 + (1 + τ0 )−1 g:zg =k (Y g − µk − S 0g(u) β̃ k(u) 1)0 (Y g − µk − S 0g(u) β̃ k(u) 1)
2
−1
−1
0
+T c (β̃ k(u) − β 0(u) ) Σβ(u) (β̃ k(u) − β 0(u) ) .
(c) β k(u) |Y , S, µk , σk2 , z, u is drawn from its posterior distribution N|u| (β̃ k(u) , σk2 Σ̃βk(u) ),
h
i−1
P
0
−1
+
(1
+
τ
)
and
where Σ̃βk(u) = T1 c−1 Σ−1
S
S
0
g:zg =k g(u) g(u)
β(u)
β̃ k(u) = Σ̃βk(u) c
−1
Σ−1
β(u) β 0(u)
+ (1 + τ0 )
−1
X
S g(u) 10T (Y g
− µg ) .
g:zg =k
(8)
(d) The cluster occupancy probabilities π are drawn from their posterior distribution Dirichlet(n + α), where the cluster sizes n = (n1 , . . . , nK ).
26
Posterior mean and variance of β k
Let nk denote the number of genes assigned to cluster k (k = 1, . . . , K), and for simplicity of notation, let S k(u) = (S 01(u) , . . . , S 0nk (u) )0 denote the nk × |u| sub-matrix of
scores corresponding to the nk genes. Define the nk T × 1 vector Y k(u)∗ = vec(Y k ) =
[β 00(u) ; (Y 1 − µk )0 ; · · · ; (Y nk − µk )0 ]0 , where [a01 ; a02 ] represents the vector a1 stacked on
a2 . Define the (|u| + nk T ) × |u| matrix Xk(u)∗ = [I|u| , (1T S 01(u) )0 , . . . , (1T S 0nk (u) )0 ]0 and
2
2
the (|u| + nk T ) × (|u| + nk T ) matrix Qk(u)∗ = diag cσk Σβ(u) , σk (1 + τ0 )Ink T , where
P
PD
nk = G
g=1 1[zg =k] , |u| =
d=1 1[ud =1] , and Im denotes an identity matrix of dimension m.
From weighted linear regression techniques, it then follows that the posterior distribution
of β k(u) is β k(u) |µk , σk2 , Y , S, z, u ∼ N|u| β̃ k(u) , Σ̃βk(u) , where
−1
−1
0
−1 0
0
−1
β̃ k(u)= (Xk(u)∗
Q−1
k(u)∗ Xk(u)∗ ) Xk(u)∗ Qk(u)∗Y k(u)∗ and Σ̃βk(u) = (Xk(u)∗ Qk(u)∗Xk(u)∗ ) . (9)
Now,
0
−1
−2
−1
Xk(u)∗
Q−1
k(u)∗ Xk(u)∗ = Σβk(u) + σk (1 + τ0 )
X
(S g(u) 10T )(1T S 0g(u) )
g:zg =k
X
0
−1
−2 −1 −1
S g(u) S g(u) .
= T σk c Σβ(u) + (1 + τ0 )
(10)
g:zg =k
−1 −2
0
−1 −2 −1
Also, Xk(u)∗
Q−1
k(u)∗ Y k(u)∗ = c σk Σβ(u) β 0(u) +(1+τ0 ) σk
The posterior mean and covariance of β k(u) are then
−1 −1
−1
P
0
S
S
Σβ(u) β 0(u)
Σβ(u)
g(u)
g(u)
g:z
=k
g
+
β̃ k(u)= T −1 c +
(1+τ0 )
c
and Σ̃βk(u) =
T −1 σk2
c
−1
X
S g(u) S 0g(u)
g
+ (1 + τ0 )
P
P
g:zg =k
S g(u) 10T (Y g −µg ).
S g(u) 10T (Y g −µg )
g:zg =k
(1+τ0 )
−1
X
S g(u) S 0g(u)
g:zg =k
−1
.
,
(11)
APPENDIX B: Derivation of H(u|z, µ)
Let u be as defined in Section 3. Since each uj (1 ≤ j ≤ D) has a prior probability of η to
be 1, the marginalized posterior probability for a configuration u is:
Z Y
K
Y
H(u|z, µ)=
P (Y g |S g , z, u, K, µ, σk2 , β k(u) , u)p(β k(u) )p(u|η)p(σk2 )dβ k(u) dηdσk2
k=1 g:zg =k
(12)
27
Let nk = |{g : zg = k}|, |u| =
=
Z Y
K Y
PD
j=1 uj
and B(a, b) = Γ(a)Γ(b)/Γ(a + b). Then, H(u|z, µ)
P (Y g |S g , z, u, K, µ, σk2 , β k(u) , η)p(η)p(βk(u) )p(σk2 )dβ k(u) dηdσk2
k=1 g:zg =k
Z
K
B(|u|+a1 , D−|u|+a2) Y Y
P (Y g |S g , z, u, K, µ, σk2 , βk(u) )p(β k(u) )p(σk2 )dβ k(u) dσk2 .
=
B(a1 , a2 )
k=1g:z =k
g
Let Xk(u)∗ , Y k(u)∗ , Qk(u)∗ be defined as in Appendix A. Also, the posterior distribution
2
of β k(u) |µk , σk , Y , S, z, u is N|u| β̃ k(u) , Σ̃βk(u) , where β̃ k(u) and Σ̃βk(u) are as defined
R
|θ)P (θ)
,
in (9). Now, for any set of vectors X, Y, θ, P (X, Y ) = θ P (X, Y |θ)P (θ)dθ = P (X,Y
P (θ|X,Y )
where the identity holds for any choice of θ such that P (θ|X, Y ) 6= 0. So,
P (Y
2
g |S g , z, u, K, µ, σk )
P (Y g |S g , β k(u) , z, u, K, µ, σk2 )P (β k(u) |z, u, K, µ, σk2 )
.
=
P (βk(u) |Y , S, z, u, K, µ, σk2 )
(13)
where (13) can be evaluated at any value of β k(u) . Choosing β k(u) = β̃ k(u) , the denominator
of (13) reduces to
N|u|
β k(u) ; β̃k(u) , Σ̃βk(u) β k(u) =β˜ k(u)
= (2π)
The numerator of (13) evaluated at β̃ k(u) is:
Y
g:zg =k
|u|
2
1 −1
2
.
Σ̃βk(u) h
i
NT Y g ; µk + S 0g(u) β̃ k(u) 1, σk2 (1 + τ0 )I × N|u| β̃ k(u) ; β 0(u) , cσk2 Σβ(u) ,
(14)
(15)
where Σβ(u) is defined in (4) and Σ̃βk(u) is as given in (11). Then the ratio of (15) to (14)
Q Q
2
gives the expression K
k=1
g:zg =k P (Y g |S g , z, u, K, µ, σk ). Finally, the prior distribution of σk2 is Inverse-Gamma(w0 /2, S0 /2). So finally,
K
B(|u| + a1 , D − |u| + a2 ) Y
H(u|z, µ) =
B(a1 , a2 )
k=1
Z
Y
P (Y g |S g , z, u, K, µ, σk2 )p(σk2 )dσk2 ,
g:zg =k
which upon using (15), (14), and (9), simplifies to the expression (7) in Appendix A.
28
APPENDIX C: Evolutionary Monte Carlo (EMC)
To conduct EMC, we need to prescribe a set of temperatures, t1 > t2 > · · · > tM = 1, for
the M “population units”. Using the marginalized probability for a variable configuration
Q
u (12), we set πi (ui ) ∝ exp[− log H(u|z, µ)/ti ], and let π(U ) ∝ M
i=1 πi (ui ). The population U = (u1 , . . . , uM ) is updated iteratively using two types of moves: mutation and
crossover. In the mutation operation, a unit uk is randomly selected from the current population and mutated to a new vector v k by changing the values of some of the ui ’s chosen at
random. The new member v k is accepted into the population with probability min(1, rm ),
where rm = πk (v k )/πk (uk ). In crossover, two individuals, uj and uk , are chosen at random from the population. A crossover point x is chosen randomly to be l (1 ≤ l ≤ D), and
two new units v j and v k are formed by switching between the two individuals the segments
on the right side of the crossover point. The two “children” are accepted into the population
π (v )π (v )
to replace their parents uj and uk with probability min(1, rc ), where rc = πjj(ujj )πkk (ukk ) . It
can be shown that the samples of uM (tM = 1), converge to the target distribution (12).
APPENDIX D: Importance sampling procedure for z
The importance sampling procedure for z consists of the following steps:
(i) Let ζ̂ be the matrix of P (z|Y , S, θ̂) for a specific permutation of the group labels z.
Create K groups by assigning observations, to group li , where li = arg maxj ζˆij .
(ii) Next, for each non-empty group r (r = 1, . . . , K), sample ψr from a Dirichlet distribution with parameter vector αr = (αr1 , . . . , αrK ) (αr is chosen to be 1K ).
(iii) For each group, re-assign observations to groups according to their group-specific
ψr . The probability g ∗ (z) then is
K
Y
P
Q
Γ( j αrj )
j Γ(αrj + nrj )
P
Q
g (z) =
,
Γ(
[α
+
n
])
Γ(α
)
rj
rj
rj
j
j
r=1
∗
where nrj is the number of observations from the r th group assigned to group j.
29
2
1
0
−2
−1
gene expression
3
frag replacements
G1
G2/M
M/G1
S
S/G2
28
63
98
PSfrag replacements
Variable count
iteration
PSfrag replacements
time (minutes)
1.0
60
cutoff
FP/FN
Figure 1: Yeast cell-cycle data. Gene expression measurements over the five phases of the
cell-cycle (M/G1, G1, S, S/G2, G2/M) show the increased or decreased activity of varying
sets of genes at specific time points.
0.8
0.6
FP/FN
20
0.4
30
40
D=20
D=30
D=40
D=50
0.0
0
0.2
10
Variable count
50
D=20
D=30
D=40
D=50
0
5000
10000
15000
0.0
0.2
0.4
0.6
0.8
1.0
cutoff
iteration
Figure 2: (a) Number of variables selected over iterations for 4 simulated data sets. (b)
Cutoff probability for variable selection ensures 0% FP rate for any cutoff over 0.2.
30
0.0
0
40000
50000
40000
50000
40000
50000
40000
50000
40000
50000
40000
50000
40000
50000
0.0
time 34
0
10000
20000
30000
V2 V17 V34 V51 V68 V85 V103 V123 V143 V163
seq(1, 50000, by = 2)
0.0
time 56
0
10000
20000
30000
V2 V17 V34 V51 V68 V85 V103 V123 V143 V163
seq(1, 50000, by = 2)
0.0
time 78
0
10000
20000
30000
V2 V17 V34 V51 V68 V85 V103 V123 V143 V163
seq(1, 50000, by = 2)
0.0
time 910
0
10000
20000
30000
V2 V17 V34 V51 V68 V85 V103 V123 V143 V163
seq(1, 50000, by = 2)
0.0
time 1112
0
10000
20000
30000
V2 V17 V34 V51 V68 V85 V103 V123 V143 V163
seq(1, 50000, by = 2)
0.0
time 1314
0
10000
20000
30000
V2 V17 V34 V51 V68 V85 V103 V123 V143 V163
seq(1, 50000, by = 2)
0.0
time 1516
0
10000
20000
30000
V2 V17 V34 V51 V68 V85 V103 V123 V143 V163
seq(1, 50000, by = 2)
time 1718
0.0
totsel[1:25000]
50000
seq(1, 50000, by = 2)
0
totsel[1:25000]
40000
V2 V17 V34 V51 V68 V85 V103 V123 V143 V163
0
totsel[1:25000]
50000
0
totsel[1:25000]
40000
0
totsel[1:25000]
30000
0
totsel[1:25000]
20000
0
totsel[1:25000]
10000
0
totsel[1:25000]
0
0
totsel[1:25000]
PSfrag replacements
Psel
motif index
time 12
0
10000
20000
30000
V2 V17 V34 V51 V68 V85 V103 V123 V143 V163
Figure 3: Number of selected variables over iteration (Left panel) and posterior probability
of motif type selection (Right panel) for 9 consecutive time intervals, for optimal choice
of total number of clusters (as in second column of Table 5). The horizontal axis labels
correspond to iteration number (Left panel) and motif index (Right panel).
31
Table 1: Mean square errors (MSE), misclassification percentage (miscl. rate) and false
positive (FP)/ false negative (FN) rates for (i) increasing number of variables D and different hyperparameter specifications for (ii) (a1 , a2 ) (iii) scalar multiple c0 of V0 and (iv)
“ridge” factor λ for hyperprior of β. (FP/FN rates are for at any marginal posterior
probability of the number of selected variables between 0.2-1.0).
FN
σ2
miscl.
rate
FP
β
MSE
µ
20
30
40
50
0.0003
0.0004
0.0003
0.0407
0.0203
0.0308
0.0210
0.0278
0.0004
0.0005
0.0004
0.0804
0.0400
0.0283
0.0367
0.0200
0
0
0
0
0
0
0
0
(1,1)
(1,100)
(1,10000)
0.0004
0.0005
0.0004
0.0344
0.0455
0.0260
0.0007
0.0018
0.0007
0.0317
0.0333
0.0367
0
0
0
0
0
0
c0
0.1
1
10
50
0.0048
0.0049
0.0049
0.0049
0.0089
0.0096
0.0093
0.0087
0.0992
0.1025
0.0982
0.0984
0.0867
0.0867
0.0533
0.0600
0
0
0
0
0
0
0
0
λ
0.0
0.2
0.4
0.6
0.8
1.0
0.0049
0.0051
0.0049
0.0048
0.0052
0.0048
0.0083
0.0094
0.0088
0.0091
0.0092
0.0089
0.1002
0.1039
0.0994
0.1023
0.1000
0.0992
0.0800
0.0667
0.0733
0.0867
0.0533
0.0867
0
0
0
0
0
0
0
0
0
0
0
0
D
(a1 , a2 )
32
Table 2: Simulation study comparing the performance of the joint regression mixture model
with three methods. Position-specific probabilities less than 60% are denoted by ‘N’.
“Sens”: sensitivity (percentage of true motifs found); “Spec”: specificity (percentage of
motifs found which are correct). “Clust %”: percent correctly classified. A motif is considered a “match” if ≥ 70% of positions match (underlined).
Method
Motif Motif Motif Clust.
Motif Pattern
Count Sens Spec % AGCCGCCNAA TNCCNNNTTT ACGCGTACGT CNGCAGATTG
MDscan
5 0.5 0.4
− AGCCGCCGAA
−
−
CAGCAGATTG
Motif
2
Regressor
Two-step 2,0,0
Joint
model
6
− CCCAGGCGAG CAATCTGCTG
0
0
0
0 0.28 CAATCTGCTG
−
−
−
−
−
1 0.83 0.75 GCCGCCGAAT TCCCAGATTT GGGACGCGTA GCAGCAGATT
Table 3: Model selection for number of clusters in (i) yeast data and (ii) simulated data by
Bayes Factor computation using IS, compared to K = 1.
Time Interval
1, 2
3, 4
5, 6
7, 8
9, 10
11, 12
13, 14
15, 16
17, 18
K=2
17.72
4.71
-12.10
8.38
22.50
0.98
15.33
4.33
41.33
Yeast
K=3
12.11
-33.33
-22.74
-24.09
28.09
-2.62
-26.50
-28.30
-44.79
K =4
-11.02
-152.26
-93.38
-50.31
-128.99
-85.57
-46.64
-71.49
-147.16
K=2
366.40
Simulated data
K=3 K=4
527.64 220.17
K=5
80.00
Table 4: Comparison of joint model, MotifRegressor (MR), and two-step approach
(MCLUST+MR) in the yeast data set. In the column “MCLUST+MR”, the numbers denote
number of TFs found separately in each cluster. “Spec” denotes specificity, here defined
by the percentage of found TFs that are known in the literature. (It is not possible to accurately estimate the sensitivity in this case as the total number of TFs active at each precise
time point may not be known.)
Time Interval
1, 2
3, 4
5, 6
7, 8
9, 10
11, 12
13, 14
15, 16
17, 18
MotifRegressor
Total
Spec
23
0.087
19
0.158
21
0.095
22
0.227
26
0
21
0.095
23
0.087
17
0
16
0.125
33
MCLUST+MR
Total
Spec
5,4,3
0
9,7
0.188
7,1,5
0.154
8,1
0
5,7
0.083
7,3
0.1
7,2
0
8,0
0
1,7
0
Joint
Total Spec
2
0.5
6
1
2
1
4
1
1
0.8
4
0.75
1
1
2
1
4
0.75
Table 5: Posterior confidence intervals for cluster-specific regression coefficients for yeast
transcription factor binding motifs. The second column denotes the chosen number of
clusters (K) and the third column (Sel %) the posterior marginal probability of selection
of the motif.
Interval
K
Motif (Consensus)
1,2
2
3, 4
2
5, 6
1
7, 8
2
9, 10
11, 12
3
2
13, 14
15, 16
2
2
17, 18
2
ROX1 (ACAACAAC)
unknown (ACCAAGCG)
ABF1(CGTATATA)
REB1 (GGTTAACC)
SCB (CTCGAAAA)
UASH (CTGTGCAG)
MCB (TACGCGAT)
GAL4 (CGCGGCTC)
MCM1 (GGCCAGAA)
MCB (TACGCGAT)
UASH (CTGTGCAG)
GAL4 (GGACAGAC)
unknown (CAATTGCA)
MCB (TACGCGAT)
RAP1 (ACGCGTTA)
unknown (CAAACCGA)
GCR1 (CACCACTA)
SWI5 (GCTGGATT)
MCB (TACGCGAT)
RAP1 (ACGCGTTA)
ROX1 (CAAACAAA)
MIG1/SFF (TGTTTGTT)
ABF1 (CGTATATA)
unknown (GTGTGTAT)
GAL4 (CAGCCACT)
PHO4 (ACGCGTTA)
Sel %
0.63
0.82
1.00
1.00
1.00
0.52
1.00
1.00
1.00
0.82
0.52
0.59
1.00
1.00
1.00
0.79
0.87
0.84
0.94
1.00
0.71
0.94
1.00
1.00
1.00
0.58
34
95% Posterior Interval
1
2
3
-0.132, 0.014
0.034, 0.096
-0.198, -0.020 -0.083, -0.031
-0.038, 0.033 -0.076, -0.035
-0.155, -0.007 0.085, 0.174
-0.106, 0.014
0.053, 0.125
-0.129, 0.012
0.043, 0.108
0.035, 0.177
0.089, 0.158
0.037, 0.170 -0.143, -0.060
-0.068, -0.018
0.023, 0.068
0.042, 0.102
-0.008 ,0.038
-0.054, 0.067 -0.166, 0.010
0.071, 0.207
-0.097, 0.018
-0.052, 0.039 -0.144, -0.061
0.001 ,0.144 -0.162, -0.014 -0.089, -0.007
0.091, 0.280
-0.050, 0.016
-0.214, -0.078 -0.009, 0.054
0.099, 0.229
-0.046, 0.031
-0.030, 0.048
0.035, 0.080
0.017, 0.076
0.025, 0.053
-0.053, 0.022
0.045, 0.131
-0.041, 0.034
0.041, 0.124
-0.083, -0.013 0.036, 0.075
0.043, 0.130 -0.089, -0.040
-0.165, -0.047 0.022, 0.114
-0.098, -0.044 -0.058, -0.021
Table 6: Gene ontology-based functional annotation of clusters. The cell-cycle phase is approximate, based on Figure 1 in Spellman et al. (1998). “GO”: Gene-ontology-based top
three annotations (p-values); “PG”: Proportion of Genes previously known to be involved
in a function (based on Figure 7 in Spellman et al. (1998)), that occur in the same group in
our model (highlighted in bold). Italicized attributes indicate that this function is known to
occur at this phase of the cell-cycle (GO-enrichment p-values in parentheses). For example, “DNA repair (7/10)” represents that 7 out of 10 total genes previously known to be
involved in DNA repair at this phase of the cell-cycle are found in this cluster.
Time GO/PG
Gene Clusters from Regression mixture model
(phase)
k=1
k=2
k=3
1
GO
cell wall (4.8e-13)
replication (3.3e-09)
(M/G1)
nucleosome (8.3e-11)
chromosome (1.1e-08)
conjugation (8.2e-10)
bud (2.2e-08)
PG
chromatin (9/9)
budding:fatty acids (4/5)
cell wall (5/5)
mating (8/8)
2
GO
cell wall (1.4e-12)
replication (2.1e-08)
(G1)
bud (4e-11)
DNA helicase (4.6e-07)
nucleosome (1.3e-10)
nuclear nucleosome (2.5e-08)
PG
chromatin (9/9)
budding:site selection (6/6)
cell wall (7/8)
nutrition(9/15)
budding:fatty acids (4/5)
3
GO
bud (4e-16)
(S)
polarized growth (5.2e-16)
replication (1.4e-14)
PG
−
−
4
GO
replication (5.1e-15)
bud (1.2e-13)
(G2/M)
DNA synthesis (8.7e-13) polarized growth (4.5e-13)
DNA replication (9.5e-12)
PG
DNA synthesis (14/19) chromatin (9/9)
cell cycle control (5/5) nutrition(12/15)
5
GO
cell wall (5.9e-15)
chromosome (6.7e-11)
bud (1.1e-06)
(M)
replication (1.7e-10)
nucleosome (2.5e-08)
polarized growth (9.1e-06)
polarized growth (4.8e-09) nuclear nucleosome (2.5e-08)
bud neck (1.7e-05)
PG
chromatin (7/9)
mitosis:microtubules (4/6)
6
GO
DNA synthesis (2.6e-15) microtubule process (1.4e-06)
(M/G1)
replication (7.5e-15)
microtubule cytoskeleton (2.6e-06)
chromosome (1.7e-14)
PG
DNA repair (7/10)
mitosis:microtubules (4/6)
DNA synthesis (13/19)
chromatin (8/9)
7
GO
cell wall (6.7e-13)
DNA synthesis (1.3e-10)
(G1)
polarized growth (6.7e-12) replication (6.5e-10)
bud (1.5e-11)
DNA helicase (1e-07)
PG
chromatin (9/9)
DNA repair (7/10)
cell wall (3/3)
DNA synthesis (12/19)
mitosis:microtubules (5/6)
nutrition (10/15)
8
GO
DNA elongation (2e-10) cell wall (4.9e-14)
(S)
DNA synthesis (3.6e-10) polarized growth (1.8e-13)
replication (1.4e-09)
bud (2.2e-12)
PG
DNA synthesis (14/19) chromatin (9/9)
cell wall (6/6)
budding:fatty acids (4/5)
mating (6/8)
9
GO
bud (9.4e-12)
replication (1.8e-08)
(G2/M)
cell wall (1.4e-11)
chromosome (1.5e-07)
polarized growth (8.1e-10)
PG
chromatin (5/5)
budding:glycolysis (5/7)
35
Download