Functional supervised and unsupervised classification of gene expression data

advertisement
Functional supervised and unsupervised
classification of gene expression data
Yuko Araki1 and Sadanori Konishi2
1
2
Faculty of Mathematics, Kyushu University, 6-10-1 Hakozaki, Higashi-ku,
Fukuoka 812-8581, Japan. yuko@math.kyushu-u.ac.jp
Faculty of Mathematics, Kyushu University, 6-10-1 Hakozaki, Higashi-ku,
Fukuoka 812-8581, Japan. konishi@math.kyushu-u.ac.jp
Summary. We consider applying the technique of functional discriminant analysis
with the extended Bayesian information criterion to analyze the time series yeast
gene expression data. The functional discrimination procedure uses radial basis expansion with help of regularization. For clustering the gene expression data, we carry
out the clustering by first fitting the curves by radial basis expansions, and then
partitioning the model coefficients using hierarchical and k-means clustering procedures. We observed that the procedures provide useful tools for classifying functions
or curves. The combination of functional supervised and unsupervised classification
revealed that subclass partitioning may be promising for further analysis of gene
expression profiles.
Key words: functional data analysis, Bayesian information criterion, radial basis
functions, regularization
1 Introduction
In the yeast cell cycle data analysis, cDNA microarray has been used to simultaneously measure the gene expression levels of thousands of genes. The complicated
structures of gene expression data require effective supervised and unsupervised
classification methods. In the analysis of time series yeast cell cycle gene expression
data, Spellman et al. [SSZ98] identified cell cycle regulated genes based on hierarchical clustering analysis. The data provide inference about how gene expression levels
evolve in time and how genes are dependent during a given biological process. Classification of genes enables us to predict functions of unknown genes and to identify
the set of co-regulated genes.
Figure 1 shows the expression patterns of some yeast cell cyclic genes under
cdc15-based synchronization. We consider these dynamics of expression profiles as
functional data, since basically the observation of expression levels is a function of
time. Although the measurements of n curves can be considered as n sets of vectors
[HJ75], there are several disadvantages to use classical classification methods directly
on such data; measurements errors are ignored and measurements per subject should
1106
Araki, Y. and Konishi, S.
1
0
-1
-2
expression level
2
be exactly the same among n curves. If the number of measurements per subject is
much larger than n, the high-dimensional and low-sample size problem may occur.
5
10
time
15
20
Fig. 1. Yeast gene expression patterns of some yeast cell cyclic genes under cdc15based synchronization, measured at 24 time points
Araki et al. [AKI04] introduced functional discriminant analysis for two-class,
using radial basis function expansion with help of regularization. In order to select smoothing parameters, they derived an information criterion [KK04] within
the framework of functional data analysis. Araki and Konishi [AK05] extended this
model for multi-class case, and derived the extended Bayesian information criterion [KK04] in the context of functional discriminant analysis. In this paper we
apply this technique to analyze the time series yeast gene expression data [SSZ98].
For clustering time series gene expression data, Luan and Li [LL03] developed
the mixed-effects model using B -splines. Abraham, C. et al. [AC03] clustered a
real-world example from food industry by k -means method on the fitted B -spline
coefficients. We carry out the clustering of gene expression curves in keeping the
functional structure, by first fitting the curves by radial basis expansions and then
partitioning the model coefficients using hierarchical and k -means clustering procedures.
The rest of the paper is organized as follows. In Section 2, we present the multiclass functional discriminant procedure as defined in [AK05]. Then in Section 2,
we demonstrate the method by analyzing the yeast cell cycle gene expression data
set [SSZ98].
2 Multi-class Functional Discriminant Analysis
For classifying functional data, Araki and Konishi [AK05] proposed the following
functional discriminant procedure for multi-class with model selection criterion.
Suppose we have n independent observations {(xα (t), gα ); α = 1, 2,· · · , n; t∈
T }, where xα (t) are functional predictors drawn from a mixture G of thePL distinct
groups, G1 , G2 , · · · , GL in proportions π1 , π2 , · · · , πL , respectively, where L
k=1 πk =
1 and Pr(G = k) = πk ≥ 0 (k = 1, · · · , L). gα are the group indicator variables,
where gα = k implies that it belongs to group Gk . A set of functions smoothed by
Functional classification of gene expression data
1107
the Gaussian radial basis smoothing method for functonal data [AKI04] are given
by
xα (t) = wTα φ(t),
α = 1, · · · , n,
(1)
where wα are estimated parameter vectors and φ(t) is a vector of Gaussian
radial basis functions.
The Bayes rule of allocation assigns xα (t) to group Gk (k = 1, 2, · · · , L), with
the maximum posterior probability Pr(g = k|xα (t)). The conditional probabilities
of g given xα (t) are assumed P
to be Pr(gα = k|xα (t)) = πkα (k = 1, · · · , L − 1) and
α
α
Pr(gα = L|xα (t)) = πL
= 1 − L−1
k=1 πk . The functional discriminant analysis model
for multi-class was defined as
log
πα k
α
πL
Z
βk (t)xα (t)dt,
= βkf +
k = 1, · · · , L − 1
(2)
T
where βkf and βk (t) are the unknown parameter values and functions. By using
the same Gaussian radial basis
φ(t) as in (1), the functional parameter is
P function
expanded as βk (t) = βk0 + m
β
φ
(t)
= γ Tk φ(t) (k = 1, · · · , L − 1), where γ k =
j
kj
j=1
T
α
(βk0 , βk1 , · · · , βkm ) . Substituting β(t) and xα (t) into (2), we have log (πkα /πL
)=
T
T T
T
T
β k zα , where β k = (βkf , γ k ) and zα = (1, wα J) .
Let the L − 1 dimensional response variable yα , having components either 0 or
α
1 to indicate group membership of a sample, be yα = (y1α , y2α , · · · , yL−1
), where
ykα = 1 if gα = k (k = 1, 2, · · · , L − 1), otherwise ykα = 0. This L − 1 dimensional
vector yα is assumed to be distributed according to the multinomial distribution
given by the conditional mean πkα :
Y
L−1
f (yα |xα (t); β) =
α
α 1−
(πkα )yk (πL
)
P
L−1 α
yl
l=1
,
k=1
(3)
P
(α)
T
where β = (β T1 , β T2 , · · · , β TL−1 )T , πk = exp(β Tk zα )/{1+ L−1
l=1 exp(β l zα )} and
P
L−1
T
α
πL = 1/{1 + l=1 exp(β l zα )}. To construct the model with high generalization
performance, the parameter vector β is estimated by the regularization method
( [AK05]) which maximizes the penalized log-likelihood function
lλ (β) =
n
X
α=1
log f (y α |xα (t); β) −
λ
2
X
L−1
β Tk Kβ k ,
k = 1, 2, · · · , L − 1,
(4)
k=1
where K is an (m + 2) × (m + 2) matrix with rank m − d and λ is a smoothing
parameter. We use the Newton-Raphson algorithm to maximze the penalized loglikelihood function (4). Given the estimate β̂, a future observation x(t) may be
assigned to group Gk which maximizes the posterior probability π̂k .
The crucial issue on regularization method is the choice of the optimal value
of smoothing parameter λ. The smoothing parameter selection can be considered
as a model selection problem. So in order to choose the optimal value of λ, we use
the generalized Bayesian information criterion (GBIC) obtained in the context of
1108
Araki, Y. and Konishi, S.
our functional logistic discrimination procedure. The GBIC was originally derived
by [KK04], and it enables us to evaluate models estimated by the maximum penalized
likelihood method. We know that the Bayesian information criterion (BIC; [SC78])
only covers models estimated by the maximum likelihood method.
3 Application to Yeast Cell Cycle Data
3.1 Functional discriminant analysis
In this section we describe the use of functional discriminant analysis to yeast genes
based on their expression profiles measured during the cell cycle process. Spellman
et al. [SSZ98] measured expressions of 6,178 genome-wide genes in the yeast genome
using cDNA microarrays over about two cell cycles. These data contain 77 microarrays and consist of two short time-courses (two time points) and four medium
time-courses (18, 24, 17 and 14 time points). Among these, about 800 genes are
characterized as cell cycle regulated genes based on clustering analysis. These 800
genes were classified into five different cell-cycle phases, M/G1, G1, S, S/G2 and
G2/M phases.
In our analysis, we concentrated on the “cdc15-based experiment data” sampled
over 24 points after synchronization. For simplicity, any genes that contained missing
values across any of the 24 time points were discarded. These expression data were
considered to be a discretized realization of 632 expression curves evaluated at 24
time points. Note that since microarray data usually contains observational noise, it
is important that we first perform smoothing to remove the observational noise from
the expression data. In addition, because the gene expression pattern of each cell
cycle related gene can be considered as a function of time, the proposed functional
method is appropriate since it takes account of functional nature of the data.
We first functionalized the data by radial basis smoothing method, then carried
out 5-class functional discrimination on the 632 genes. In practice, 450 genes were
used as a training data set, and the remaining 182 were used as a test data set.
Table 1 presents the distribution of the genes from test data for each group in terms
of the five cell-cycle phases defined by Spellman et al. [SSZ98] where rows represent
labeled classes, while columns represent classified results. It is natural that most of
the misclassified genes are classified into the nearby phase group, since genes were
labeled according to their phase.
Table 4 shows the gene code names which were misclassified with high posterior
probability (more than 90%). From the past experiments, we know that such genes
have the profiles suitable to the predicted classes. Therefore the genes on Table 4
may have been mislabeled by some reason.
3.2 Re-clustering by functional discrimination
Once the functional discriminant procedure was applied to the 632 cell-cyclic gene
expression data, the following estimated decision function can be considered as a decision rule for the left 3749 genes which were not used in the discriminant procedure,
since they were labeled as ”non cyclic”.
We clustered them with the following decision rule;
Functional classification of gene expression data
1109
Table 1. Distribution of the 632 cell-cycle regulated genes of five different phases
defined by Spellman et al. over the classified phases using the proposed method.
Rows represents labeled classes, and columns represent classified results
Class M/G1 G1 S S/G2 G2/M
M/G1
10 4
0
1
5
G1
9 50
0
1
2
S
0 5
9
4
0
S/G2
0 0
3
24
4
G2/M
3 0
0
6
42
Total
22 59 12
36
53
Table 2. Code names of genes with posterior probability more than 0.95
YBL051C:G1 YBR110W:G1 YCR005C:S/G2
YDL246C:G1
YDR026C:G1
YDR039C:G1
YDR334W:G1 YDR510W:S/G2 YEL071W:S/G2
YFL062W:G1 YGL070C:S/G2
YGL176C:G1
YJL131C:S/G2 YJL219W:S/G2 YLL009C:M/G1
YLR258W:G1 YMR193W:G1 YMR276W:S/G2
YOL147C:G1
YOR287C:G1 YPL132W:S/G2
ĝα = argmax π̂kα ,
k = 1, 2, · · · , L,
α = 1, 2, · · · , n
YDL077C:S/G2
YDR201W:S/G2
YER030W:S/G2
YHR053C:G1
YLR126C:S/G2
YNL217W:G1
(5)
where
T
π̂kα =
α
π̂L
=
exp(β̂ k zα )
1+
PL−1 exp(β̂ T z
l
l=1
1
P
T
1 + L−1 exp(β̂ z
l=1
l
, k = 1, · · · , L − 1,
α)
.
(6)
α)
Since the five classes are all ”cell-cyclic” class, one need to re-investigate genes
with large value of π̂. Table 2 shows the code name of such genes, which has π̂ > 0.99,
and their predicted classes.
Figure 2 plots the labeled class genes and predicted genes with high posterior
probability (more than 99%). From the figure, we see that the predicted genes actually hold the similar shape to that of classified class. Even though they were not
defined as cell-cyclic genes by Spellman et al. [SSZ98], they may have cyclic feature
in practice.
3.3 Functional clustering analysis
For the purpose of clustering genes, the cluster indicator vector g={g1 ,g2 ,· · · ,gn }
for n genes is unknown. In the previous section, we have functionalized each gene
expression profile as {xα (t); α = 1, 2, · · · , n} by radial basis expansions. Therefore
to partition the n functional data into L clusters, where L is given(L = 5 here), only
Araki, Y. and Konishi, S.
-2
0
2
Gene Expression
1110
G1
10
time
15
20
-2
0
2
Gene Expression
5
S/G2
5
10
time
15
20
Fig. 2. Plots of the gene expression profiles which were clustered to the class ”G1”
and ”S/G2” with posterior probability more than 0.99. The solid lines are clustered
genes, and the dotted lines are genes belong to each class, labeled by Spellman et
al. [SSZ98]
partitioning their coefficients wα ∈ RM +1 of equation 1 is needed. Hierarchical clustering and k -means clustering methods are conducted to the estimated coefficients
of radial basis functions, and the results are compared.
To compare the Spellman’s labeling with our hierarchical clustering result, we
have cut the clustering tree so as to get five classes. Deciding where to cut tree
resolves the tradeoff between the desire for detail and that for generality and simplicity.
Table 3 shows the distribution of the genes in each cluster in terms of the five
cell-cycle phases defined by Spellman et al. [SSZ98]. Though there are n − 1 clusters
as a result of this clustering method, we cut tree when it creates five clusters to
compare with the original Spellman’s labeling and with the result of our functional
discriminant analysis. The rows represent Spellman’s labeling, while the columns denote hierarchical (left) and k -means (right) clustering classes. The two small number
clusters are mostly those expressed in G1 and G2/M. According to the table, the
result has strong relationship with the original Spellman’s labeling. Further, since
there are one or two large number clusters in each labelled class, we may suggest
creating sub-groups in each labelled class.
Functional classification of gene expression data
1111
Table 3. Distribution of the 632 cell-cycle regulated genes of five different phases
defined by Spellman et al. over the five estimated genes clusters using the radial
basis coefficient (Left) hierarchical and (Right) k -means clustering methods
Hierarchical 1
G1
S/G2
M/G1
S
G2/M
Sum
55
11
2
24
0
92
2 3 4
48
63
50
8
120
289
13
0
1
0
0
14
0
0
10
0
23
33
5 Sum k -means
116 232 G1
29 103 S/G2
18 81 M/G1
24 56 S
17 160 G2/M
204
Sum
1
84
12
3
21
2
122
2 3 4
31
48
36
4
88
207
22
0
1
4
0
27
0
2
17
0
44
63
5 Sum
95 232
41 103
24 81
27 56
26 160
213
Table 4. Spellman’s labeling and clustering results for genes which have high posterior distribution(> 90%) in discriminant analysis
YNL082W
YNL134C
YNL160W
YNR009W
YNR066C
YNR067C
YOR153W
YOR247W
YOR250C
YPL014W
YPL021W
YPL155C
YPL265W
YPR019W
Spellman
M/G1
G2/M
G2/M
S/G2
G1
G1
S/G2
G2/M
G1
G2/M
G2/M
G2/M
S/G2
M/G1
Prediction
G1
M/G1
M/G1
S
S
M/G1
G2/M
G1
S
M/G1
S/G2
S/G2
G2/M
G2/M
k -means
5
4
2
1
5
3
2
2
5
2
2
2
2
2
Hierarchical
5
2
2
1
2
3
2
2
5
2
2
2
2
2
Table 4 presents the correspondence of gene codes which have high posterior
probability (90%) in supervised classification. The 78% of the k -means and hierarchical clustering have the same results for those high posterior probability genes.
4 Concluding Remarks
In the analysis of time series yeast cell cycle gene expression data, we introduced
the functional logistic discriminant procedure and functional clustering based on
the radial basis expansions. We observed that the procedures provide useful tools
for classifying functions or curves.
An advantage of this method is that functional counterpart of logistic discrimination yields posterior probability for each gene, which enables us to reveal gene
code names that might have cyclic nature. Further, the combination of functional
supervised and unsupervised classificaion found out that subclass partitioning may
1112
Araki, Y. and Konishi, S.
be promising for further analysis of gene expression profiles.
Acknowledgements
The authors would like to thank referees for their helpful comments and suggestions.
References
[AC03]
[AKI04]
[AK05]
[HJ75]
[KK04]
[LL03]
[SC78]
[SSZ98]
Abraham, C., Cornillon, P. A., Matzner-Lober, E. and Molinari, N.: Unsupervised curve clustering using B-splines. Scandinavian Journal of Statistics, 30(3), 581–595 (2003)
Araki, Y. and Konishi, S. and Imoto, S.: Functional discriminant analysis
for microarray gene expression data via radial basis function networks.
Proceedings of COMPSTAT’2004 Symposium. Physica-Verlag/Springer,
613–620 (2004)
Araki, Y. and Konishi, S.: Functional discriminant analysis via regularized basis expansions. MHF Preprint Series, 2005-4, Kyushu University,
Fukuoka (2005)
Hartigan, J. A.: Clustering algorithms. Wiley, New York (1975)
Konishi, S., Ando, T. and Imoto, S.: Bayesian information criteria
and smoothing parameter selection in radial basis function networks.
Biometrika 91(1), 27–43 (2004)
Luan, Y. and Li, H.: Clustering of time-course gene expression data using a mixed-effects momdel with B-splines. Bioinformatics 19(4), 474–482
(2003)
Schwarz, G.: Estimating the dimension of a model. Ann. Statist. 6, 461–
464 (1978)
Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen,
M. B., Brown, P. O., Bostein, D. and Futcher, B.: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae
by microarray hybridization. Mol. Biol. Cell., 9, 3273–3297 (1998)
Download