Functional supervised and unsupervised classification of gene expression data Yuko Araki1 and Sadanori Konishi2 1 2 Faculty of Mathematics, Kyushu University, 6-10-1 Hakozaki, Higashi-ku, Fukuoka 812-8581, Japan. yuko@math.kyushu-u.ac.jp Faculty of Mathematics, Kyushu University, 6-10-1 Hakozaki, Higashi-ku, Fukuoka 812-8581, Japan. konishi@math.kyushu-u.ac.jp Summary. We consider applying the technique of functional discriminant analysis with the extended Bayesian information criterion to analyze the time series yeast gene expression data. The functional discrimination procedure uses radial basis expansion with help of regularization. For clustering the gene expression data, we carry out the clustering by first fitting the curves by radial basis expansions, and then partitioning the model coefficients using hierarchical and k-means clustering procedures. We observed that the procedures provide useful tools for classifying functions or curves. The combination of functional supervised and unsupervised classification revealed that subclass partitioning may be promising for further analysis of gene expression profiles. Key words: functional data analysis, Bayesian information criterion, radial basis functions, regularization 1 Introduction In the yeast cell cycle data analysis, cDNA microarray has been used to simultaneously measure the gene expression levels of thousands of genes. The complicated structures of gene expression data require effective supervised and unsupervised classification methods. In the analysis of time series yeast cell cycle gene expression data, Spellman et al. [SSZ98] identified cell cycle regulated genes based on hierarchical clustering analysis. The data provide inference about how gene expression levels evolve in time and how genes are dependent during a given biological process. Classification of genes enables us to predict functions of unknown genes and to identify the set of co-regulated genes. Figure 1 shows the expression patterns of some yeast cell cyclic genes under cdc15-based synchronization. We consider these dynamics of expression profiles as functional data, since basically the observation of expression levels is a function of time. Although the measurements of n curves can be considered as n sets of vectors [HJ75], there are several disadvantages to use classical classification methods directly on such data; measurements errors are ignored and measurements per subject should 1106 Araki, Y. and Konishi, S. 1 0 -1 -2 expression level 2 be exactly the same among n curves. If the number of measurements per subject is much larger than n, the high-dimensional and low-sample size problem may occur. 5 10 time 15 20 Fig. 1. Yeast gene expression patterns of some yeast cell cyclic genes under cdc15based synchronization, measured at 24 time points Araki et al. [AKI04] introduced functional discriminant analysis for two-class, using radial basis function expansion with help of regularization. In order to select smoothing parameters, they derived an information criterion [KK04] within the framework of functional data analysis. Araki and Konishi [AK05] extended this model for multi-class case, and derived the extended Bayesian information criterion [KK04] in the context of functional discriminant analysis. In this paper we apply this technique to analyze the time series yeast gene expression data [SSZ98]. For clustering time series gene expression data, Luan and Li [LL03] developed the mixed-effects model using B -splines. Abraham, C. et al. [AC03] clustered a real-world example from food industry by k -means method on the fitted B -spline coefficients. We carry out the clustering of gene expression curves in keeping the functional structure, by first fitting the curves by radial basis expansions and then partitioning the model coefficients using hierarchical and k -means clustering procedures. The rest of the paper is organized as follows. In Section 2, we present the multiclass functional discriminant procedure as defined in [AK05]. Then in Section 2, we demonstrate the method by analyzing the yeast cell cycle gene expression data set [SSZ98]. 2 Multi-class Functional Discriminant Analysis For classifying functional data, Araki and Konishi [AK05] proposed the following functional discriminant procedure for multi-class with model selection criterion. Suppose we have n independent observations {(xα (t), gα ); α = 1, 2,· · · , n; t∈ T }, where xα (t) are functional predictors drawn from a mixture G of thePL distinct groups, G1 , G2 , · · · , GL in proportions π1 , π2 , · · · , πL , respectively, where L k=1 πk = 1 and Pr(G = k) = πk ≥ 0 (k = 1, · · · , L). gα are the group indicator variables, where gα = k implies that it belongs to group Gk . A set of functions smoothed by Functional classification of gene expression data 1107 the Gaussian radial basis smoothing method for functonal data [AKI04] are given by xα (t) = wTα φ(t), α = 1, · · · , n, (1) where wα are estimated parameter vectors and φ(t) is a vector of Gaussian radial basis functions. The Bayes rule of allocation assigns xα (t) to group Gk (k = 1, 2, · · · , L), with the maximum posterior probability Pr(g = k|xα (t)). The conditional probabilities of g given xα (t) are assumed P to be Pr(gα = k|xα (t)) = πkα (k = 1, · · · , L − 1) and α α Pr(gα = L|xα (t)) = πL = 1 − L−1 k=1 πk . The functional discriminant analysis model for multi-class was defined as log πα k α πL Z βk (t)xα (t)dt, = βkf + k = 1, · · · , L − 1 (2) T where βkf and βk (t) are the unknown parameter values and functions. By using the same Gaussian radial basis φ(t) as in (1), the functional parameter is P function expanded as βk (t) = βk0 + m β φ (t) = γ Tk φ(t) (k = 1, · · · , L − 1), where γ k = j kj j=1 T α (βk0 , βk1 , · · · , βkm ) . Substituting β(t) and xα (t) into (2), we have log (πkα /πL )= T T T T T β k zα , where β k = (βkf , γ k ) and zα = (1, wα J) . Let the L − 1 dimensional response variable yα , having components either 0 or α 1 to indicate group membership of a sample, be yα = (y1α , y2α , · · · , yL−1 ), where ykα = 1 if gα = k (k = 1, 2, · · · , L − 1), otherwise ykα = 0. This L − 1 dimensional vector yα is assumed to be distributed according to the multinomial distribution given by the conditional mean πkα : Y L−1 f (yα |xα (t); β) = α α 1− (πkα )yk (πL ) P L−1 α yl l=1 , k=1 (3) P (α) T where β = (β T1 , β T2 , · · · , β TL−1 )T , πk = exp(β Tk zα )/{1+ L−1 l=1 exp(β l zα )} and P L−1 T α πL = 1/{1 + l=1 exp(β l zα )}. To construct the model with high generalization performance, the parameter vector β is estimated by the regularization method ( [AK05]) which maximizes the penalized log-likelihood function lλ (β) = n X α=1 log f (y α |xα (t); β) − λ 2 X L−1 β Tk Kβ k , k = 1, 2, · · · , L − 1, (4) k=1 where K is an (m + 2) × (m + 2) matrix with rank m − d and λ is a smoothing parameter. We use the Newton-Raphson algorithm to maximze the penalized loglikelihood function (4). Given the estimate β̂, a future observation x(t) may be assigned to group Gk which maximizes the posterior probability π̂k . The crucial issue on regularization method is the choice of the optimal value of smoothing parameter λ. The smoothing parameter selection can be considered as a model selection problem. So in order to choose the optimal value of λ, we use the generalized Bayesian information criterion (GBIC) obtained in the context of 1108 Araki, Y. and Konishi, S. our functional logistic discrimination procedure. The GBIC was originally derived by [KK04], and it enables us to evaluate models estimated by the maximum penalized likelihood method. We know that the Bayesian information criterion (BIC; [SC78]) only covers models estimated by the maximum likelihood method. 3 Application to Yeast Cell Cycle Data 3.1 Functional discriminant analysis In this section we describe the use of functional discriminant analysis to yeast genes based on their expression profiles measured during the cell cycle process. Spellman et al. [SSZ98] measured expressions of 6,178 genome-wide genes in the yeast genome using cDNA microarrays over about two cell cycles. These data contain 77 microarrays and consist of two short time-courses (two time points) and four medium time-courses (18, 24, 17 and 14 time points). Among these, about 800 genes are characterized as cell cycle regulated genes based on clustering analysis. These 800 genes were classified into five different cell-cycle phases, M/G1, G1, S, S/G2 and G2/M phases. In our analysis, we concentrated on the “cdc15-based experiment data” sampled over 24 points after synchronization. For simplicity, any genes that contained missing values across any of the 24 time points were discarded. These expression data were considered to be a discretized realization of 632 expression curves evaluated at 24 time points. Note that since microarray data usually contains observational noise, it is important that we first perform smoothing to remove the observational noise from the expression data. In addition, because the gene expression pattern of each cell cycle related gene can be considered as a function of time, the proposed functional method is appropriate since it takes account of functional nature of the data. We first functionalized the data by radial basis smoothing method, then carried out 5-class functional discrimination on the 632 genes. In practice, 450 genes were used as a training data set, and the remaining 182 were used as a test data set. Table 1 presents the distribution of the genes from test data for each group in terms of the five cell-cycle phases defined by Spellman et al. [SSZ98] where rows represent labeled classes, while columns represent classified results. It is natural that most of the misclassified genes are classified into the nearby phase group, since genes were labeled according to their phase. Table 4 shows the gene code names which were misclassified with high posterior probability (more than 90%). From the past experiments, we know that such genes have the profiles suitable to the predicted classes. Therefore the genes on Table 4 may have been mislabeled by some reason. 3.2 Re-clustering by functional discrimination Once the functional discriminant procedure was applied to the 632 cell-cyclic gene expression data, the following estimated decision function can be considered as a decision rule for the left 3749 genes which were not used in the discriminant procedure, since they were labeled as ”non cyclic”. We clustered them with the following decision rule; Functional classification of gene expression data 1109 Table 1. Distribution of the 632 cell-cycle regulated genes of five different phases defined by Spellman et al. over the classified phases using the proposed method. Rows represents labeled classes, and columns represent classified results Class M/G1 G1 S S/G2 G2/M M/G1 10 4 0 1 5 G1 9 50 0 1 2 S 0 5 9 4 0 S/G2 0 0 3 24 4 G2/M 3 0 0 6 42 Total 22 59 12 36 53 Table 2. Code names of genes with posterior probability more than 0.95 YBL051C:G1 YBR110W:G1 YCR005C:S/G2 YDL246C:G1 YDR026C:G1 YDR039C:G1 YDR334W:G1 YDR510W:S/G2 YEL071W:S/G2 YFL062W:G1 YGL070C:S/G2 YGL176C:G1 YJL131C:S/G2 YJL219W:S/G2 YLL009C:M/G1 YLR258W:G1 YMR193W:G1 YMR276W:S/G2 YOL147C:G1 YOR287C:G1 YPL132W:S/G2 ĝα = argmax π̂kα , k = 1, 2, · · · , L, α = 1, 2, · · · , n YDL077C:S/G2 YDR201W:S/G2 YER030W:S/G2 YHR053C:G1 YLR126C:S/G2 YNL217W:G1 (5) where T π̂kα = α π̂L = exp(β̂ k zα ) 1+ PL−1 exp(β̂ T z l l=1 1 P T 1 + L−1 exp(β̂ z l=1 l , k = 1, · · · , L − 1, α) . (6) α) Since the five classes are all ”cell-cyclic” class, one need to re-investigate genes with large value of π̂. Table 2 shows the code name of such genes, which has π̂ > 0.99, and their predicted classes. Figure 2 plots the labeled class genes and predicted genes with high posterior probability (more than 99%). From the figure, we see that the predicted genes actually hold the similar shape to that of classified class. Even though they were not defined as cell-cyclic genes by Spellman et al. [SSZ98], they may have cyclic feature in practice. 3.3 Functional clustering analysis For the purpose of clustering genes, the cluster indicator vector g={g1 ,g2 ,· · · ,gn } for n genes is unknown. In the previous section, we have functionalized each gene expression profile as {xα (t); α = 1, 2, · · · , n} by radial basis expansions. Therefore to partition the n functional data into L clusters, where L is given(L = 5 here), only Araki, Y. and Konishi, S. -2 0 2 Gene Expression 1110 G1 10 time 15 20 -2 0 2 Gene Expression 5 S/G2 5 10 time 15 20 Fig. 2. Plots of the gene expression profiles which were clustered to the class ”G1” and ”S/G2” with posterior probability more than 0.99. The solid lines are clustered genes, and the dotted lines are genes belong to each class, labeled by Spellman et al. [SSZ98] partitioning their coefficients wα ∈ RM +1 of equation 1 is needed. Hierarchical clustering and k -means clustering methods are conducted to the estimated coefficients of radial basis functions, and the results are compared. To compare the Spellman’s labeling with our hierarchical clustering result, we have cut the clustering tree so as to get five classes. Deciding where to cut tree resolves the tradeoff between the desire for detail and that for generality and simplicity. Table 3 shows the distribution of the genes in each cluster in terms of the five cell-cycle phases defined by Spellman et al. [SSZ98]. Though there are n − 1 clusters as a result of this clustering method, we cut tree when it creates five clusters to compare with the original Spellman’s labeling and with the result of our functional discriminant analysis. The rows represent Spellman’s labeling, while the columns denote hierarchical (left) and k -means (right) clustering classes. The two small number clusters are mostly those expressed in G1 and G2/M. According to the table, the result has strong relationship with the original Spellman’s labeling. Further, since there are one or two large number clusters in each labelled class, we may suggest creating sub-groups in each labelled class. Functional classification of gene expression data 1111 Table 3. Distribution of the 632 cell-cycle regulated genes of five different phases defined by Spellman et al. over the five estimated genes clusters using the radial basis coefficient (Left) hierarchical and (Right) k -means clustering methods Hierarchical 1 G1 S/G2 M/G1 S G2/M Sum 55 11 2 24 0 92 2 3 4 48 63 50 8 120 289 13 0 1 0 0 14 0 0 10 0 23 33 5 Sum k -means 116 232 G1 29 103 S/G2 18 81 M/G1 24 56 S 17 160 G2/M 204 Sum 1 84 12 3 21 2 122 2 3 4 31 48 36 4 88 207 22 0 1 4 0 27 0 2 17 0 44 63 5 Sum 95 232 41 103 24 81 27 56 26 160 213 Table 4. Spellman’s labeling and clustering results for genes which have high posterior distribution(> 90%) in discriminant analysis YNL082W YNL134C YNL160W YNR009W YNR066C YNR067C YOR153W YOR247W YOR250C YPL014W YPL021W YPL155C YPL265W YPR019W Spellman M/G1 G2/M G2/M S/G2 G1 G1 S/G2 G2/M G1 G2/M G2/M G2/M S/G2 M/G1 Prediction G1 M/G1 M/G1 S S M/G1 G2/M G1 S M/G1 S/G2 S/G2 G2/M G2/M k -means 5 4 2 1 5 3 2 2 5 2 2 2 2 2 Hierarchical 5 2 2 1 2 3 2 2 5 2 2 2 2 2 Table 4 presents the correspondence of gene codes which have high posterior probability (90%) in supervised classification. The 78% of the k -means and hierarchical clustering have the same results for those high posterior probability genes. 4 Concluding Remarks In the analysis of time series yeast cell cycle gene expression data, we introduced the functional logistic discriminant procedure and functional clustering based on the radial basis expansions. We observed that the procedures provide useful tools for classifying functions or curves. An advantage of this method is that functional counterpart of logistic discrimination yields posterior probability for each gene, which enables us to reveal gene code names that might have cyclic nature. Further, the combination of functional supervised and unsupervised classificaion found out that subclass partitioning may 1112 Araki, Y. and Konishi, S. be promising for further analysis of gene expression profiles. Acknowledgements The authors would like to thank referees for their helpful comments and suggestions. References [AC03] [AKI04] [AK05] [HJ75] [KK04] [LL03] [SC78] [SSZ98] Abraham, C., Cornillon, P. A., Matzner-Lober, E. and Molinari, N.: Unsupervised curve clustering using B-splines. Scandinavian Journal of Statistics, 30(3), 581–595 (2003) Araki, Y. and Konishi, S. and Imoto, S.: Functional discriminant analysis for microarray gene expression data via radial basis function networks. Proceedings of COMPSTAT’2004 Symposium. Physica-Verlag/Springer, 613–620 (2004) Araki, Y. and Konishi, S.: Functional discriminant analysis via regularized basis expansions. MHF Preprint Series, 2005-4, Kyushu University, Fukuoka (2005) Hartigan, J. A.: Clustering algorithms. Wiley, New York (1975) Konishi, S., Ando, T. and Imoto, S.: Bayesian information criteria and smoothing parameter selection in radial basis function networks. Biometrika 91(1), 27–43 (2004) Luan, Y. and Li, H.: Clustering of time-course gene expression data using a mixed-effects momdel with B-splines. Bioinformatics 19(4), 474–482 (2003) Schwarz, G.: Estimating the dimension of a model. Ann. Statist. 6, 461– 464 (1978) Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Bostein, D. and Futcher, B.: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell., 9, 3273–3297 (1998)