Integration of clinical, SNP, and microarray gene expression measurements in prediction of chronic fatigue syndrome Sooyeol Lim1, Wen Le1,2, Pingzhao Hu1, Baifang Xing1, Celia M.T. Greenwood1,2,*, Joseph Beyene1,2,* 1The Hospital for Sick Children Research Institute, Toronto, ON, M5G1X8, Canada 2Department of Public Health Sciences, University of Toronto, Toronto, ON, M5S1A1, Canada E-mail: Sooyeol Lim (slim@sickkids.ca), Wen Le (lisa.le@utoronto.ca), Pingzhao Hu (phu@sickkids.ca), Baifang Xing (baifang@sickkids.ca), Celia M.T. Greenwood (celia.greenwood@utoronto.ca), Joseph Beyene (joseph@utstat.toronto.edu) * Corresponding authors ABSTRACT Chronic fatigue syndrome (CFS) is a chronic debilitating and poorly understood disease. In an attempt to better characterize the disorder and be able to accurately diagnose new cases, researchers have started to look into integrative data analytic approaches drawing information from various disparate sources. In this paper, we describe statistical and computational methods that were used to integrate clinical data with genomic data from single nucleotide polymorphism (SNP) and microarray gene expression data for the prediction of chronic fatigue syndrome (CFS). We generated prediction scores separately from the gene expression microarray data and the SNP data for 164 subjects, of which 129 had CFS or CFS-like symptoms. Each predictive model was trained via 10fold cross-validation and predicted scores were calculated for each test set. Scores for the microarray data were generated by a kernelbased k nearest neighbor (KNN) classifier, whereas those for the SNP data scores were generated using logistic regression models. The summary scores of the microarray and SNP data were then combined with two clinical variables of interest in a final logistic model and once again evaluated via 10-fold cross-validation. Our results show that the integration of relevant clinical and genomic data from different sources may improve the diagnostic classification accuracy for CFS. Keywords Genomics, data integration, cross-validation, logistic regression, chronic fatigue syndrome. 1. INTRODUCTION The study of chronic fatigue syndrome (CFS) has been a challenging medical research problem due to its unknown etiology and considerable uncertainty in diagnosis and disease characterization. Ambiguities in the case definition of CFS have been pointed out [1] and some recent efforts have focused on formulating empirical definitions of CFS [2]. In recent years, there has been considerable interest among the statistical and computational community in methods for the integration of various sources spanning clinical and genomic data [3]. In this work, we aim to investigate a strategy for combining disparate information from clinical, microarray, and SNP data to construct a model that better predicts CFS disease status than any one data source alone. 2. MATERIALS AND METHODS 2.1 Subjects A total of 164 subjects (129 cases and 35 controls) that had data in all of the clinical, SNP, and microarray datasets (made publicly available by the US Centers for Disease Control (CDC)) were analyzed. The previously determined clinical diagnoses were used rather than empirically derived classifications [2]. Included in the 129 cases were 64 CFS as well as 65 subjects with CFS-like symptoms. 2.2 Clinical Data Among the numerous variables in the clinical and laboratory datasets, we have chosen to include in our statistical model only two clinical assessments: the presence of tender lymph nodes, and the presence of sleep problems with moderate or greater severity, both of which were coded as binary variables. We wanted to select only those clinical assessments that were likely to be either useful prognostic factors or potential risk factors for the etiology of the disease. It has been shown in a previous study of CFS patients [4] that the presence of tender lymph nodes has a statistically significant association with CFS. Solomon and Reeves [4] suggest that it is a common symptom of many infectious illnesses and may contribute to detection or reporter bias on the part of a patient or a physician. Yet the examination of lymph nodes can still serve as a useful prognostic criterion for the diagnosis of CFS. It is also believed that sleep physiology may be a crucial factor in explaining the etiology of CFS. Many CFS patients report sleep disorders, and a previous investigation of sleep characteristics among CFS subjects [5] showed that 81.4% of their CFS subjects reported at least one form of sleep abnormality. Although the relationship between sleep disorders and CFS has not been clearly elucidated, it is a plausible causative factor. While several demographic characteristics such as gender and age seem to be associated with CFS [4], we have chosen to exclude them from our statistical modeling due to the initial design of the CDC case-control study that matched cases and controls on gender and age. 2.3 SNP Data Forty-two SNP markers spanning 10 different genes on different chromosomes were genotyped. Six of the genes (COMT, MAOA, MAOB, SLC6A4, TH and TPH2) are involved in the neurotransmission system, and mutations at those genes can lead to various psychiatric illnesses. The remaining four genes (CRHR1, CRHR2, NR3C1, and POMC) are involved in the neuroendocrine system. [6] Two genes (MAOA and MAOB) are located on the X chromosome. Haplotyping: Many alleles at closely linked SNP markers occur as common blocks called haplotypes. Identification and use of haplotypes in genes can aid in disease mapping of genes and help reduce the dimension of statistical models on the SNP data. We have performed haplotyping analyses on 6 out of 8 autosomal genes within which we could identify the relative order and distance among markers. Haplotypes were estimated with the PHASE software, which is based on a Bayesian computational approach proposed by Stevens et al. [7, 8]. For the remaining four genes (two autosomal genes, POMC and SLC6A4, and two Xlinked genes, MAOA and MAOB) SNP markers were used in the statistical models without any estimation of haplotypes). For each of the 6 haplotyped genes, the three major haplotypes were coded as two binary indicator variables. For both haplotyped genes and SNPs, an additive inheritance mode was assumed for the coding (0, 1, 2 according to the number of haplotypes or alleles), with the exception of the X chromosome, for which a dominant inheritance model was used. We have used the most likely haplotypes in our current model as an approximation for haplotypes with phase uncertainty in order to ease the implementation of our computational algorithms. We hope to modify our implementation to account for the probabilities of haplotype construction in future research work. 2.4 Microarray Gene Expression Data A total of 177 microarray gene expression samples for 19,892 genes were initially available. Eight technical replicates and 5 gene expression samples for excluded subjects were removed, and the remaining 164 samples were analyzed. Expression values were normalized using quantile normalization after log transformation [9]. 2.5 Statistical modeling for data integration Our modeling work draws upon the pre-validation method proposed by Tibshirani and Efron [10] to use full cross-validation to obtain prediction scores for each individual and then to combine these with clinical data to improve the predictive power of the model. It should be noted that the partitions for crossvalidation must remain the same throughout the calculation of prediction scores in different data types, and the evaluation of the performance of the final model. Exploratory analyses: For the clinical and SNP data, we fitted logistic models, in SAS [11], predicting CFS using all 164 subjects without any cross-validation or data integration. A model constructed in such a way cannot be evaluated for performance as the entire dataset is used for the model estimation, but it nevertheless gives a useful indication of which features may be important. For the microarray expression data, principal component analysis (PCA) [12] and Sammon’s non-linear mapping technique [13] were used to look for any clustering of gene expression that could aid in classification or dimension reduction. Prediction scores from microarray and SNP data using crossvalidation (CV): The 164 subjects were randomly divided into ten groups. For generating prediction scores for the case-control labels for all samples, the model parameters estimated on 9/10ths of the data were used to make predictions on the remaining tenth. These procedures were repeated ten times to obtain prediction scores for all 164 subjects. Two sets of prediction scores (one from the microarray data and the other from the SNP data) were generated. We have implemented our algorithm in the R statistical programming language. For the microarray gene expression data, we used a signal-tonoise filter (S2N) [14] to select the 50 most variable genes out of the 19,892 genes on the array. Then the kernel-based K-nearest neighbour (KNN) algorithm was used to generate prediction scores. These two steps were performed on each cross-validation partition of the data, and prediction was performed on each 1/10 th of the data that have been set aside during the model “training” phase. [14] For the SNP data, using each of the ten partitions of the dataset, haplotype variables or SNP marker variables for each gene were fitted in a single logistic regression model for feature selection using 9/10ths of the data. (We opted against fitting more than one gene in a logistic regression due to data attrition from missing values in SNPs, and potential convergence problems with too many covariates.) Ten such models for each gene were compared with each other, and k genes were selected such that the k models for these genes were ranked higher than other models on a given selection criterion such as the Kendall correlation between the observed and estimated case-control labels in the training set. Once the k genes were chosen during the feature selection step, all the haplotypes or SNP marker variables belonging to those k genes were combined in a single logistic model, and this model 0.3 0.2 0.1 0.0 Classification criteria: A prediction score calculated for a given sample may be used to classify the sample into the case or the control group. For the KNN, the majority voting rule is used such that the sample is classified to the group with the highest weight. [14, 16] For the logistic regression, the following prediction rule was used: for each sample i, if yi is coded as 1 for the case and 0 for the control, then for a given estimated vector of logistic regression parameters i, Xi, ˆ and the covariate vector for the sample -0.1 -0.2 0.00 0.05 0.10 0.15 Third component -0.2 -0.1 0.0 0.1 0.2 Various values for the number of genes k (from 1 to 5) and selection criteria (area-under-the-curve (AUC) of receiver operating characteristic curve [15], accuracy, and various correlation measures) were compared, and the combination that yielded the highest AUC between the predicted labels and the observed labels over the entire sample was chosen as the best predictive model. second component was re-fitted to the training dataset (the 9/10ths of the data) to obtain estimates of the logistic regression parameters. Then prediction scores for the remaining tenth of the data were obtained from these estimated model parameters. 0.20 first component Figure 1. Principal components analysis using a subset of 400 genes with highest inter-quartile range (Red dots: controls, blue dots: cases). y i P( y i 1 | , X i ) exp( X i ) /(1 exp( X i )) If y i (129 / 164) 0.7866, y i is classified as case. If y i (129 / 164), y i is classified as control. 60 1 40 where 129/164 reflects the proportion of cases in the 164 samples. 1 1 1 1 00 1 11 1 0 1 20 1 0 1 1 0 0 1 0 1 1 1 0 1 1 1 1 11 1 1 1 0 1 01 1 1 1 11 1 0 1 1 0 1 1 1 1 01111 1 0 1 11 1 1 11 111 111 0 1 01 1 1 111 1 1 1 10 01 1 11 01 1 11 1 1 1 01 10 1 1 11 11 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 00 1 1 10 0 1 1 1 1 10 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 -20 0 1 -40 Integration of clinical data and prediction scores: A logistic model with four variables (two clinical variables, SNP and microarray prediction scores) was fitted and prediction scores were generated via the 10-fold CV as before. One difference from the previous CV steps used for the SNP and microarray prediction scores is that no feature selection was necessary for final procedure as there were only four covariates of interest. Once the predicted values were obtained for all samples, various performance criteria such as AUC or accuracy were evaluated by comparing the predicted and observed case-control labels. -80 -60 -40 -20 1 0 10 00 0 20 40 3. RESULTS 3.1 Exploratory Analysis Cluster analysis of microarray gene expression data: Genes were ranked by the value of the inter-quartile range (IQR) across the 164 samples. A total of 400 genes with the greatest IQRs were selected for principal components analysis (PCA). The results show that no clear pattern could be found along three major principal components (Figure 1). For the same subset of genes, there were no clear clusters using Sammon’s non-linear mapping technique either. (Figure 2) [13]. Figure 2. Sammon mapping using a subset of 400 genes with highest inter-quartile range (1: cases, 0: controls). Logistic regression of the entire 164 samples with clinical and SNP covariates, without CV: In the logistic regression model with two clinical variables (“node” = presence of tender lymph nodes, and “sleep” = presence of sleep problems with moderate or severe symptoms), sleep problems showed statistically significant association with CFS (OR = 11.2; 95%CI: 4.0 - 31.0; p < 0.001) but tender lymph node did not (OR = 3.2; 95% CI: 0.7 - 15.5; p = 0.144). Using gene-by-gene logistic regression models, only two genes were identified as having statistically significant associations with CFS, namely, NR3C1 (p=0.002), and POMC (p=0.0432). These p-values were generated using a likelihood ratio test which has a better statistical property than the usual Wald test [17]. Table 2. AUC values of classifications with clinical, SNP, and expression scores. AUC for model with raw predicted scores AUC for model with classified predicted scores Clinical var. only 0.737 0.737 SNP scores only 0.603 0.645 Microarray scores only 0.488 0.559 Clin. Var. + SNP 0.753 0.753 Clin. Var. + Microarray 0.737 0.741 SNP + Microarray 0.585 0.623 Clin. Var. + SNP + Microarray 0.753 0.753 3.2 Classification and Performance Evaluation with Microarray and SNP Data Microarray gene expression data: The AUC value calculated with the microarray data based on the predicted and observed case-control labels was 0.529, which is only marginally greater than the AUC value of 0.5 expected for random predictions. SNP data: We explored various combinations of the number of markers selected, and the method of gene selection during the feature selection step (ten logistic regression models, one for each gene in the 10-fold CV). Our goal is to optimize the performance of the final model prediction of case-control status as measured by AUC (Table 1). Table 1. AUC values for classification of CFS using SNP data with different gene selection criteria. # of genes Criteria 1 2 3 4 5 Accuracy 0.488 0.536 0.578 0.593 0.573 AUC 0.552 0.610 0.643 0.542 0.529 Kendall Corr. 0.613 0.592 0.645 0.587 0.529 (Rows are criteria for feature selection, and columns are for the number of genes chosen for the final haplotype/SNP estimation model.) The result shows that the highest AUC was attained (0.645) when we selected top three single-gene models using the Kendall correlation, and we generated predictions for the test set using a logistic regression model containing those three genes. This is the method adopted for the generation of prediction scores with the SNP data. 3.3 Evaluation of Data Integration We evaluated the performance of our final logistic regression model with four covariates (two binary clinical covariates, SNP scores, and expression scores) through 10-fold CV (Table 2). Two different strategies were employed: in one set of models, raw prediction scores (continuous values in [0, 1]) for SNP and microarray data were used as covariates (2nd column of Table 2). In the other set of models, SNP and microarray scores were first converted to classification labels (1 for case and 0 for control) according to their classification criteria, and these two binary covariates were fitted in the final model as covariates. (3rd column of Table 2.) For the purpose of comparison, logistic regression models with subsets of the four covariates were also fitted. The results show that overall, the models with binary-classified SNPs and gene expression microarray covariates performed better, using an AUC criterion, than the models with SNP and microarray covariates based on raw prediction scores. Comparison of the full model with the models containing subsets of covariates reveals that while the inclusion of SNP scores on top of the clinical variables increased the predictive power of the model by a small amount, inclusion of microarray scores did not seem to make a measurable contribution to the AUC. (Table 2) For the final model with all four covariates (with classified SNP and microarray scores), the accuracy was calculated to be 72.56%, with a sensitivity of 70.54% and specificity of 80%. This implies that knowledge of 2 clinical covariates, 42 SNPs, and gene expression of 50 genes can lead to a prediction of CFS that is 73% accurate. 4. DISCUSSION In this work, we have explored and presented statistical and computational strategies for integrating clinical and genomic data. We adopted various data integration approaches in order to obtain a model with improved predictive power for classification of CFS. The results, based on our current method, show that the integration of SNP scores with clinical data delivers a small improvement in predictive power of the model, yet the integration of microarray gene expression scores delivers little measurable improvement. Compared to other methods – i.e. fitting all predictors together at the same time – the strategy we followed has the advantage in that one can use technology-specific methods to identify predictors for each dataset and then combine afterwards. Various reasons may explain the suboptimal performance of microarray expression scores in our work. Uncertainty in case definitions for CFS is one such reason. Our CFS case samples contained those with clinically identified CFS as well as those with insufficient numbers of CFS symptoms. Thus, use of only those with clinically confirmed CFS for cases may lead to better delineation between cases and controls in statistical analysis. In the same vein, an alternative empirical diagnostic criterion of CFS that was explored in recent research work by Reeves et al. [2] may also provide a definition of the disease that better correlates with genetic signatures of CFS patients. One possible way to improve the microarray scores is to integrate only the expression values of the genes containing SNP markers in order to minimize the noises introduced by other genes less relevant to the disease under investigation. This would restrict the set of genes examined in the microarray analysis to only those strong candidates and those with cis effects. Another way to integrate data would be to use a method that assigns weights to classification rules, so that genes that provide complementary sources of information are optimally combined for the classification purpose. Boosting appears to be one such method [18] that can identify non-overlapping subsets, and we hope to direct further efforts in this area so that the information contents in various sources can be optimally utilized for classification tasks. 5. ACKNOWLEDGEMENTS Our thanks to Mr. Earl Glynn for his suggestions through e-mail correspondences. This research was supported by funding from Ontario Genomics Institute, Genome Canada, and CIHR grant NPG-64872. 6. REFERENCES [1] Reeves W.C, Lloyd A, Vernon S.D, Klimas N, Jason L.A., Bleijenberg G, Evengard B, White P.D., Nisenbaum R, Unger E.R., and the International Chronic Fatigue Syndrome Study Group. Identification of ambiguities in the 1994 chronic fatigue syndrome research case definition and recommendations for resolution. BMC Health Services Research., 3:25, 2003. [2] Reeves, C.W., Wagner, D., Nisenbaum R., Jones J.F., Gurbaxani B., Solomon L., Papanicolaou D.A., Unger E.R., Vernon S.D., and Heim C. Chronic fatigue syndrome – a clinically empirical approach to its definition and study. BMC Medicine. 3:19, 2005. [3] Detours, V., Dumont J.E., Bersini H., and Maenhaut C. Integration and cross-validation of high-throughput gene expression data: comparing heterogeneous data sets. FEBS Letters 546, 98-102, 2003. [4] Solomon L. and Reeves, W.C. Factors influencing the diagnosis of chronic fatigue syndrome. Arch Intern Med 164:2241-2245, 2004 [5] Unger, E. R., Nisenbaum R., Moldofsky H., Cesta A., Sammut C., Reyes M., and Reeves, W.C. Sleep assessment in a population-based study of chronic fatigue syndrome. BMC Neurology. 4:6, 2004. [6] Hattori, E., Liu, C., Zhu, H., and Gershon, E.S. Genetic tests of biologic systems in affective disorders. Molecular Psychiatry. 10:719-740. 2005. [7] Stephens, M., Smith, N.J., and Donnelly, P. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68:978-989, 2001. [8] Stephens, M., and Donnelly, P. A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am. J. Hum. Genet. 73:1162-1169, 2003. [9] Irizarry, R.A, Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U., and Speed, T.P. 2003 Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4:249264, 2003. [10] Tibshirani, R.J., and Efron, B. Pre-validation and inference in microarrays. Statistical Applications in Genetics and Molecular Biology Vol. 1, Iss. 1. Article 1, 2002. [11] The SAS System for Windows. Cary, NC, USA. The SAS Institute, 2002. [12] Yeung K.Y., and Ruzzo, W.L. Principal component analysis for clustering gene expression data. Bioinformatics. 17(9):763-774. 2001. [13] Ewing, R.M, and Cherry, J.M. Visualization of expression clusters using Sammon’s non-linear mapping. Bioinformatics. 17(7):658-659. 2001. [14] Hu, P. et al. Serum diagnosis of chronic fatigue syndrome using array-based proteomics. Abstract for CAMDA 2006. [15] Swets, J.A. Measuring the accuracy of diagnostic systems. Science. 240:1285-1293. 1988. [16] Hechenbichler, K. and Schliep, K.P. Weighted k-NearestNeighbor Techniques and Ordinal Classification, Discussion Paper 399, SFB 386, Ludwig-Maximilians University Munich (htt://www.stat.unimuenchen.de/sfb386/papers/dsp/paper399.ps), 2004. [17] Wackerly, D.D., Mendenhall, W., and Scheaffer, R.L. Mathematical statistics with applications. 5th ed. Duxbury press. 1996. [18] Long P.M. and Vega, V.B. Boosting and microarray data. Machine Learning, 52(1):31-44, 2003.