Chromosomal Clustering of Periodically Expressed Genes in Plasmodium Falciparum Pingzhao Hu1, Celia M.T. Greenwood1,2, Cyr Emile M’lan3 and Joseph Beyene1,2 1Hospital for Sick Children Research Institute of Public Health Sciences, University of Toronto 3Department of Statistics, University of Connecticut, Storrs, CT 555 University Avenue, Toronto ON, M5G 1X8 (416) 813-7654 x2302 2Department joseph@utstat.toronto.edu ABSTRACT Identification of periodically expressed genes has been widely studied, but understanding how periodically expressed genes are distributed along chromosomes is largely unexplored. In this study we focused on the detection of chromosomal clusters of periodically expressed genes in stages of intraerythrocytic developmental cycle (IDC) of plasmodium falciparum. The DNA microarray data was provided by the organizers of the Critical Assessment of Microarray Data Analysis (CAMDA) 2004 competition. To this end, we first applied a multiple linear regression model containing sinusoidal curves to identify periodically expressed oligonucleotides. Setting the proportion of variance explained (PVE) at ≥ 0.7, a list of 2949 periodically expressed oligonucleotides (2204 genes) with a false discovery rate (FDR) of 3*10-5 were selected. Subsequently, a supervised support vector machine (SVM) method was used to assign these oligonucleotides into four IDC stages with at least 80% level of confidence. Furthermore, genes in each stage were mapped on to the 14 chromosomes of plasmodium falciparum genome. A total of 312 chromosomal clusters were identified. Finally, we performed a brief analysis of gene functions in these clusters. Our findings revealed that the expression of periodically regulated genes is coordinated locally on chromosomes where there are clusters of genes within same stage, suggested cisregulation. Keywords Asexual intraerythrocytic development cycle, multiple linear regression model, support vector machine, class probability, chromosomal clusters 1. INTRODUCTION Plasmodium falciparum is the organism which causes human malaria. The 22.8 Mb genome of P.falciparum is comprised of 14 linear chromosomes. Understanding the genome of P.falciparum will hopefully provide a foundation for prevention and treatment of the disease. The complete P.falciparum life cycle includes three major developmental stages: the mosquito, liver and blood stages. The periodic nature of genes expressed in one of these stages, which has been called the asexual intraerythrocytic development cycle (IDC), has been investigated in detail by Bozdech et al. [1] Genes sharing this periodicity are likely to be co-regulated. Previous studies on Saccharomyces cerevisiae [2], Homo sapiens [3] and Caenorhabditis elegans [4] have demonstrated that co-regulated genes were clustered together on chromosomes. Proteomic analysis of the three developmental stages of P.falciparum also revealed the presence of chromosomal clusters encoding coexpressed proteins [5]. The focus of this study is on the association between chromosomal location and the periodic nature of genes expressed in IDC using the dataset of Bozdech et al. [1]. 2. METHODS 2.1 Data Source and Preprocessing The organizers of CAMDA 2004 provided three datasets: the complete raw data set, a quality controlled data set and an overview data set. In this study we used the quality-controlled data set to simplify the preprocessing and to facilitate comparisons with the original work on this dataset [1]. The data set includes 5080 oligonucleotides measured at 46 time points spanning 48 hours. The data was originally normalized using the NOMAD (Normalization of MicroArray Data) database system. 243 of the oligonucleotides had a missing value at one or more time points. We imputed missing values in the dataset using the 10-nearest neighbor averaging method [6]. This imputation method can be summarized as follows: if oligonucleotide x has one missing value at time point j, the approach first finds 10 other oligonucleotides that have a value measured at time point j, with expression most similar to x at all other 45 time points. Then the weighted average of expression values for time point j from these 10 similar oligonucleotides is used as an estimate of the missing intensity value in oligonucleotide x. The inverse of the Euclidean distance was used to weight the average. 2.2 Identification of Periodically Expressed Oligonucleotides ˆ let V (1 / B ) In order to objectively analyze periodical gene expression measurements, several studies [7][8] calculated a numerical score for quantifying the periodicity of the expression profile of each gene based on Fourier analysis. Here we applied standard statistical methods [9], consisting of multiple linear regression, R 2 scores and F-statistics, to identify periodically expressed genes. Since many genes were measured by more than one oligonucleotide, we fitted a linear model for each oligonucleotide. For oligonucleotide i at time point j, the variation in log expression ratios over the course of the study was modeled as a linear combination of sine-cosine waves as follows: y ij b0i b1i cos( 2t j / T ) b2i sin( 2t j / T ) eij , least squares method, for fixed T. In order to evaluate whether an oligonucleotide is periodically expressed in the intraerythrocytic development cycle, the goodness-of-fit of the linear model for each oligonucleotide’s expression profile was measured by R 2 . The R 2 value quantifies the “proportion of variance explained (PVE)” by the periodicity. The PVE falls between zero and one, and values close to one indicate greater periodicity for a given T. The statistical significance of each R 2 can be determined by the F-statistic [9], F ( J p) R 2 /(( p 1)(1 R2 )) . Here J is the number of time points and p=3 is the number of parameters in the linear model. Selecting periodically expressed oligonucleotides based on Fstatistics involves multiple testing as described by Dudoit et al. [10]. The false discovery rate (FDR) [11] has become a popular error measure for controlling the false positive and false negative errors in this situation. We applied Taylor et al.’s algorithm [12], a column-wise permutation-based method, (that is, we permuted the times in the data) to calculate the FDR. In their method, T-statistics were used, but here we used our Fstatistics. The details of this method are as follows: Create B column-wise permutations, producing Fstatistics F1,b ,..., FI ,b , for oligonucleotide i 1, 2,...,I and permutations b 1, 2,...,B . Let Fi , 0 be the F-statistics for oligonucleotide i in the original data, for a cutpoint Fc , ˆ let R I I (|Fi , 0 | Fc ) , and i 1 b 1 i 1 ˆ ˆ Estimate the FDR by 0V / R , where 0 is the true proportion of oligonucleotides without periodicity among all the oligonucleotides I, as suggested by Efron et al. [13] and Storey [14]. We followed Storey [14] and Taylor et al.‘s methods [12] to calculate 0 . Statistically significant oligonucleotides were chosen by comparing the F-statistic Fi , 0 with a given cutpoint Fc at the estimated FDR. Equation (1) is a standard multiple linear regression model, so the regression parameters b0 i , b1i , b2i can be estimated using the 2. I (1) where T is the period for the cyclically expressed oligonucleotides. We estimated the period by minimizing the sum of squared errors (SSE) of least squares fits of known periodically expressed oligonucleotide profiles to model (1), for different values of T. 1. 3. B I (|Fi ,b|Fc ) . 2.3 Classification of Periodically Expressed Oligonucleotides Many studies have used clustering methods to classify genes into cell cycle phase [7][8]. However, unsupervised classification methods require an arbitrary specification of the number of clusters in a dataset, and furthermore cannot use prior information. In this study, the IDC contains 4 stages, namely, ring/early trophozoite, trophozoite/early schizont, schizont and early ring, and a total of 472 oligonucleotides (351 genes) were known to be expressed in one of these stages [1]. Based on Table S2 and Figure 2 of Bozdech study [1], there are 183, 75, 69 and 24 periodically expressed genes in these four stages respectively. Therefore, to classify the oligonucleotides identified in Section 2.2 into these stages with high confidence level, we used a pairwise coupling method to solve this multiclass classification problem [15]. This involves estimating class probabilities for each pair of classes, and then coupling the estimates together for each oligonucleotide. We employed SVM with a radial basis function (RBF) kernel as our base classifier for each pair of classes. SVM is a core machine learning technique with a strong theoretical basis and excellent empirical success [16]. It has been widely applied in handwriting digit recognition [16] and text classification. Generally speaking, given a periodically expressed oligonucleotide x, the SVM outputs a decision value f kl for each pair of classes k and l. While the sign and magnitude of f kl can be used to determine the class prediction and the confidence level of that prediction, the SVM decision value f kl is an uncalibrated value that does not always translate directly to a probability value useful for estimating confidence. Platt [17] proposed a parametric model for calibration in which rkl for each pair of classes k and l was 1 on: rˆkl , where A and B are 1e Af kl B the class probability estimated based estimated by minimizing the negative log-likelihood function. A common way to combine pairwise comparison scores rkl is through a majority voting method described by Friedman [18]. The voting method selects the class label with the most winning two-class decisions. In our study, however, a confidence level is required in order to assign a periodically expressed oligonucleotide into a stage. Hastie and Tibshirani [15] proposed an algorithm to calculate coupled class probabilities for this task. For the periodically expressed oligonucleotide x, the pairwise calibrated SVM computes estimates k , l 1,...,4 , k l . Assume that nkl r̂kl for classes is the number of oligonucleotides in the training set for the classifier trained on classes where k and l. We wish to estimate { p k }4k 1 , pk p(class k | x) . The algorithm of Hastie and Tibshirani words as follows: (1) Start with some initial pˆ k 0 , 3. RESULTS 3.1 Estimation of the Cycle of Periodically Expressed Oligonucleotides We used the 472 oligonucleotides (351 genes) whose staging is known to estimate the period T by fitting equation (1). Bozdech et al. [1] found that the majority of gene profiles exhibited an overall expression period of 0.75-1.5 cycles per 48h. For this reason we fitted equation (1) over a range of 100 T values evenly spaced from 1 hour to 100 hours. As can be seen in Figure 1, the sum of squared errors over the 351 genes was minimized at 50 hours. subsequent analysis. Therefore, we selected Tˆ =50 for and corresponding uˆ kl pˆ k /( pˆ k pˆ l ) . (2) Repeat (k 1,...,4,1,...) until convergence: nkl rˆkl pˆ k pˆ k k l n uˆ , rˆkl 1e Af1kl B k l kl kl pˆ pˆ / k 1 pˆ k , pˆ ( pˆ 1 , pˆ 2 , pˆ 3 , pˆ 4 ) 4 recompute the û kl (3) The final class prediction y is based on the maximum, pˆ y arg max k ( pˆ k ) , and so we assign p̂ y as the probability that the oligonucleotide x falls into the predicted stage y {1,2,3,4} . A total of 472 oligonucleotides with known stages were used as the training data for this algorithm. After training, class predictions were estimated for all periodically expressed oligonucleotides identified using the methods described in Section 2.2 that were not included in the training data. We assigned the periodically expressed oligonucleotide x to stage y if the maximum probability p̂ y was greater or equal to 0.8. When different oligonucleotides from the same gene were assigned to more than one stage, we assigned the gene to the stage with the highest confidence estimate p̂ y 2.4 Clustering of Periodically Genes on Chromosomes . Expressed We used www.PlasmoDB.org to obtain the physical locations and ordering of all genes, and marked the stage assigned to each gene (if any). Then we examined the patterns of periodicallyexpressed, stage-assigned genes along the 14 chromosomes. Using the chromosomal positions obtained above, we defined a cluster as two or more consecutive loci whose expression patterns were matched to the same stage. Based on this definition, we can identify chromosomal clusters for each stage for a given cluster size. Figure 1. The relationship between the SSE and period 3.2 Identification of Periodically Expressed Oligonucleotides For all the remaining oligonucleotides whose staging is unknown, we fit equation (1) using the least-squares method, and calculated the PVE and corresponding F-statistics. We defined an oligonucleotide as periodically expressed if its PVE was at least 0.7, which corresponds to an F-statistic=50.2. There were 2949 oligonucleotides (2204 genes) which passed this filtering criteria and demonstrated periodicity. Figure 2 shows examples of expression profiles for 4 genes, PFL2355w, PFA0285c, PFC0185w and PF11_0231. These genes were selected because they represent four distinct sine-cosine wave profiles in the dataset. The first peaks of the sine-cosine wave forms of these four genes were about 15 hours, 36 hours, 43 hours and 5 hours, respectively. We observed that most of the oligonucleotides which passed the PVE filtering criteria had one of these four profiles. This suggested that there were four dominant expression patterns in the selected periodically expressed oligonucleotides. Figure 2. Examples expression Profiles for 4 genes shown with a least-square fit of the data (curved line) In order to verify whether random variation can produce marked systematic patterns of expression, we performed 10,000 permutations of the data over the time points, and refit equation (1) to the permuted datasets. The estimated FDR was only 0.00003, strongly suggesting that the randomized datasets do not demonstrate periodicity. 3.3 Classification of Stage Group for Periodically Expressed Oligonucleotides As we stated before, there are 472 oligonucleotides (351 genes) whose staging was known. These were used as the training samples in the SVM. Excluding these oligonucleotides, we had 2545 oligonucleotides (1918 genes) for testing. (It should be noted that some of the oligonucleotides in the training sample had PVE values less than 0.7, which explains why the number of oligonucleotides in the combined training and testing samples does not equal the number of periodically expressed oligonucleotides selected). We built pairwise binary SVM classifiers with the RBF kernel for the four stages, and generated 6 predictors. A 10-fold cross-validation scheme was used to evaluate each binary predictor, and the overall cross validation error was 3.4%. For the 1918 periodically expressed genes of unknown stage, we assigned 718 genes (923 oligonucleotides) into ring/early trophozoite stage, 624 genes (835 oligonucleotides) into trophozoite/early schizont stage, 141 genes (186 oligonucleotides) into schizont stage and 167 genes (199 oligonucleotides) into early ring stage, each with an estimated class probability p̂ y of at least 0.8. Another 268 genes that had class probabilities less than 0.8 were not assigned into any one of these four stages. Figure 3. Heat map of periodically expressed genes predicted in four stages of IDC Figure 3 shows the stageogram of the IDC transcriptome based on the 1650 classified genes which had class probability at least 0.8 and the 351 training set genes for which stage was known (class probability 1). First, the genes were ordered by predicted stage,.from top to bottom the ordering is ring/early trophozoite, trophozoite/early schizont, schizont and early ring, respectively. Secondly, within each stage genes were sorted by probability in descending order. Our IDC stageogram demonstrates clear boundaries among these four stages, unlike Bozdech’s study [1] where the stageogram showed a cascade of continuous expression. By not classifying genes with low PVE or low class probabilities into the four stages, the genes in our stageogram were highly selected for clear and consistent periodic signatures. We calculated a meta-expression profile for each stage over the 46 time points by averaging the expression values of all genes predicted to be in the stage. Sine-cosine curves were then fitted to the meta-expression profiles using equation (1). As can be seen in Figure 4, the meta-expression profiles of each stage are very similar to the profiles of the 4 representative genes shown in Figure 2. Our proposed method clearly identifies stagespecific patterns. 3.4 Chromosomal Clustering In the remaining analysis, we focused on the 351 genes with known staging, together with the 1650 genes whose estimated class probabilities were at least 0.8, for a total of 2001 genes. Average Gene Expression Profile of Ring/Early trophozoite Stage 1.0 0.5 0.0 log2(Cy5/Cy3) 10 20 30 40 0 10 20 Hours 30 40 # of adjacent loci predicted to belong to the same stage in a cluster 2 3 4 5 Chr-1 4 1 Chr-2 15 2 1 1 Chr-3 14 2 2 1 Chr-4 9 3 2 1 Chr-5 19 1 1 Chr-6 13 Chr-7 15 5 1 Chr-8 14 1 Chr-9 16 2 Chr-10 12 5 1 Chr-11 13 7 1 Chr-12 18 3 1 Chr-13 33 15 2 Chr-14 43 8 3 total 238 55 15 -1.0 -1.5 -1.5 0 Hours Average Gene Expression Profile of Schizont Stage 1.5 1.0 0.0 -1.5 -1.5 -1.0 -0.5 log2(Cy5/Cy3) 0.5 1.5 Average Gene Expression Profile of Early Ring Stage 1.0 0.5 0.0 -0.5 -1.0 log2(Cy5/Cy3) Table 1. Number of Clusters on each Chromosome with different cluster size Chromosome -0.5 0.5 0.0 -0.5 -1.0 log2(Cy5/Cy3) 1.0 1.5 1.5 Average Gene Expression Profile of Trophozoite/Early Schizont Stage 0 10 20 30 Hours 40 0 10 20 30 40 Hours Figure 4. Meta-gene expression profiles of 4 stages As stated before, a chromosomal cluster is defined as two or more adjacent loci that are classified to the same stage. In order to determine whether gene clustering exists in the P.falciparum genome, we mapped the periodically expressed genes onto the 14 chromosomes in a stage dependent manner. Table 1 shows the number of clusters on each chromosome of different cluster sizes. A total of 238 clusters containing 2 loci, 55 clusters containing 3 loci, 15 clusters containing 4 loci and 4 clusters containing 5 loci were identified. It should be noted that since the chromosomal clusters were defined in a stage dependent way, the number of clusters for each chromosome and cluster size in Table 1 is the total number of clusters over all four IDC stages. For example, on chromosome 1 for cluster size 2, we identified 2 clusters at trophozoite/early schizont stage, 1 cluster at schizont stage and 1 cluster at the early ring stage, so the total number of clusters on this chromosome is 4. 1 4 Total number of clusters: 312 Figure 5 shows a whole genome view of the 74 large clusters (where 3 or more adjacent genes were mapped to the same stage). Blue, yellow, green and red colors represent clusters identified at ring/early trophozoite, trophozoite/early schizont, schizont and early ring stages, respectively; circle, diamond and triangle symbols denote cluster sizes from 3 to 5, respectively. It can be seen that most large clusters were identified at ring/early trophozoite and trophozoite/early schizont stages with cluster size 3. In order to evaluate whether patterns of clustering similar to those observed in Table 1 could occur by chance, we performed a permutation analysis. For each chromosome, we randomly permuted the order of all the genes. Holding the number of periodically-expressed genes fixed, together with the number of genes assigned to each of the four stages, we randomly assigned these outcomes to the re-ordered genes and counted the number of clusters observed. Figure 6 illustrates how the permutations were performed. Figure 5. Whole chromosome view of 74 large clusters distributed on 14 chromosomes For chromosome 14 (the longest chromosome), there are 787 genes, 160, 107, 24 and 35 of which were assigned to stages 14, respectively. We found a total of 26, 17, 0 and 0 clusters of size 2 for stages 1-4 in the 10,000 permuted data sets, compared to 21,19,1,2 clusters of size 2 for stages 1-4 in the original data. No single permutation has more than 5 clusters of size 2 for all four stages. For larger clusters, the difference between the original and permuted data is even more dramatic. For other chromosomes, results were qualitatively similar. Original Data Permuted Data Sets Figure 6. Assessment of significance of chromosomal clustering. On this fictitious chromosome, there are 30 genes of which 20 are periodically expressed. Solid blue, yellow, green and red colors represent periodically-expressed genes assigned to 1-4 stages (ring/early trophozoite, trophozoite/early schizont, schizont and early ring stages), solid black symbols represents genes that are periodically expressed but were not assigned to a particular stage and open circles are genes that were not periodically expressed. It can be seen that in the original data there is one cluster of size 3 in stage “blue” and one yellow cluster of size 2. Three sample permutations are shown above, and one of the permutations gives a blue cluster of size 2. Hence empirical significance would be 0/3 for yellow clusters of size 2 and 1/3 for blue clusters of size 2. Our study identified many more number of clusters than Bozdech et al’s study [1]. They defined a chromosomal cluster as one in which the correlation of 70% of the possible pairs of adjacent genes on the same chromosome was greater than or equal to 0.75. Based on this criterion, they found only 37 clusters consisting of 3 genes and 14 clusters consisting of more than 3 genes. In our study, there were 55 clusters with 3 genes and 19 clusters consisting of more than 3 genes. Many clusters detected in their study were also found in our study. For example, 34 of 51 large clusters (cluster size is 3 or larger) identified in their study were also found in the 74 large clusters we detected. The seven genes of the SERA family found on chromosome 2 [19] were observed in two clusters. The first SERA gene cluster contained two genes at trophozoite/early schizont stage and another SERA gene cluster contained 5 genes at schizont stage. Based on our and Bozdech et al’s studies [1], it seems that only clusters with 3 or fewer periodically expressed genes within same stage, were prevalent in the P.falciparum genome. This criterion includes about 94% of the chromosomal clusters detected in our study. It is also interesting to note that there was no obvious difference in cluster-distribution across the chromosomes; for example, approximately 33% of the clusters were on two longest chromosomes 13 and 14, and these chromosomes form approximately 35% of the total genome length. 3.5 Gene Functional Analysis of Chromosomal clusters We downloaded genes with GO terms and EC for P.falciparum strain 3D7 from www.PlasmoDB.org. A total of 3119 loci have been annotated to 2074 functions. Of 312 clusters that contain 721 loci, 126 of them (40.4%) contained at least two adjacent loci that have been functionally annotated. More than 90% of the loci in these 126 clusters have been assigned to at least 2 functions. For the large clusters, where there are 3 or more adjacent genes in a cluster, only genes in two (SERA gene cluster and ribosomal protein gene cluster) of the 51 large clusters were shown to have functional relationship (within cluster) in Bozdech et al.’s study [1]. However, we found 11 (including the above two) of 74 large clusters contain at least two loci whose annotation clearly indicates that the genes are functionally related. For example, we idenfied an energy gene cluster (PF10_0121, PF10_0122 and PF10_0123) at stage 1 on chromosome 10. A RNA processing gene cluster (MAL13P1.322, MAL13P1.323 and PF13_0340) and a ATP binding gene cluster (PF13_0177, PF13_0178, PF13_0179 and PF13_0180) were also found at stage 1 on chromosome 13. 4. DISCUSSION In this study we proposed a comprehensive procedure with solid statistical basis to identify periodically expressed oligonucleotides, classify these oligonucleotides into different stages of the intraerythrocytic developmental cycle of P.falciparum and map them to chromosomes to detect chromosomal clusters. This method provides a chromosomal viewpoint of the higher order organization of the genome. We found that around 60% of the oligonucleotides were periodically expressed by our definition. Most of them were highly expressed in ring/early trophozoite and trophozoite/early schizont stages of IDC. Our study demonstrated that many of the periodically expressed genes were arranged in clusters with 3 or fewer periodically expressed genes within same stage. In addition, our primary analysis showed that some periodically expressed genes with similar functions are clustered together. This information may be useful when annotating the function of the many unknown gene products in the P.falciparum genome. It should be noted that there are some concerns in this analysis. The first concern is that our estimate of the FDR for identifying periodically expressed oligonucleotides was very small, which gives rise to concern about underestimation. One possible reason for a downward bias in FDR is that there were significant serial correlations in the expression levels of a given oligonucleotide over time due to the slowly varying nature of the cell culture. Anderson et al. [20] pointed out that permutation of raw data under the full model will not maintain type I error close to a nominal when there is collinearity among the independent variables. They suggested that permutation of residuals under a reduced model is a better choice in this case. The second concern is that the permutation analysis to evaluate the significance of the number of stagespecific chromosomal clusters in our study is still relatively rough. Some studies explored the use of the cumulative binomial distribution [2] or the 2 distribution [5] to evaluate the statistical significance of the number of chromosomalspecific clusters for given cluster sizes. A detailed consideration of methods for assessing the significance of the number of stagespecific chromosomal clusters would be an interesting topic for further investigation. [16] Vapnik, V. Statistical learning theory. Wiley, 1998. 5. REFERENCES [18] Friedman, F. Another approach to polychotomous classification. Stanford University, Statistics Department Technical Report. 1996. [1] Bozdech, et al. The Transcriptome of the intraerythrocytic developmental cycle of plasmodium falciparum. PloS Biology, 1, 1-16, 2003. [2] Cohen, B.A., Mitra, R.D., Hughes, J.D.and Church, G.M. A computational analysis of whole-genome expression data reveals chromosomal domains of gene expression. Nature Genetics, 26, 183-186, 2000. [3] Caron, H. et al. The human transcriptome map: clustering of highly expressed genes in chromosomal domains. Science, 291, 1289-1292, 2001. [4] Roy, P.J. et al. Chromosomal clustering of muscleexpressed genes in Caenorhabditis elegans. Nature, 418, 975-979, 2002. [5] Florens, L. et al. A proteomic view of the plasmodium falciparum life cycle. Nature, 419, 520-526, 2002. [6] Troyanskaya, O. et al. Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520-525, 2001. [7] Spellman, P.T. et al. Comprehensive identification of cellcycle-regulated genes of the Yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell, 9, 3723-3297, 1998. [8] Whitfield, M.L. et al. Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Molecular Biology of the Cell, 13, 1977-2000, 2002. [9] Booth, J.G. Clustering periodically expressed genes using mciroarray data: a statistical analysis of the yeast cell cycle data. University of Florida, Statistics Department Technical Report. 2003. [10] Dudoit, S., Shaffer, J.P. and Boldrick, J.C.Multiple hypothesis testing in microarray experiments. Statistical Science, 18, 71-103, 2003. [11] Benjamini, Y. and Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B, 85, 289-300, 1995. [12] Taylor, J., Tibshirani, R. and Efron, B. The “Miss rate” for the analysis of gene expression data. Technical Report, Department of Statistics, Stanford University, http://wwwstat.stanford.edu/~tibs/ftp/miss.pdf, 2004. [13] Efron, B, Tibshirani, R. and Tusher, V. Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association, 96, 1151-1160, 2001. [14] Storey, J. A direct approach to false discovery rate. Journal of the Royal Statistical Society B, 64,479-498, 2002. [15] Hastie,T. and Tibshirani,R. Classification by pairwise coupling. The Annals of Statistics, 26, 451–471, 1998. [17] Platt, J. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. Advances in Large Margin Classifiers, A. Smola, P. Bartlett, B. Schoelkopf and D. Schuurmans, Eds. Cambridge, MA: MIT Press, 2000. [19] Miller, S.K., et al. A subset of Plasmodium falciparum SERA genes are expressed and appear to play an important role in the erythrocytic cycle. Journal of Biology Chemistry, 277,47524-47532,2002. [20] Anderson, M.J. and Legender, P. An empirical comparison of permutation methods for tests of partial regression coefficients in a linear model. Journal of Statistical Computation and Simulation, 62, 271-303, 1999.