1. introduction

advertisement
Integration of clinical, SNP, and microarray gene
expression measurements in prediction of chronic fatigue
syndrome
Sooyeol Lim1, Wen Le1,2, Pingzhao Hu1, Baifang Xing1,
Celia M.T. Greenwood1,2,*, Joseph Beyene1,2,*
1The
Hospital for Sick Children Research Institute, Toronto, ON, M5G1X8, Canada
2Department of Public Health Sciences, University of Toronto,
Toronto, ON, M5S1A1, Canada
E-mail: Sooyeol Lim (slim@sickkids.ca), Wen Le (lisa.le@utoronto.ca),
Pingzhao Hu (phu@sickkids.ca), Baifang Xing (baifang@sickkids.ca),
Celia M.T. Greenwood (celia.greenwood@utoronto.ca), Joseph Beyene (joseph@utstat.toronto.edu)
* Corresponding authors
ABSTRACT
Chronic fatigue syndrome (CFS) is a chronic debilitating and
poorly understood disease. In an attempt to better characterize the
disorder and be able to accurately diagnose new cases, researchers
have started to look into integrative data analytic approaches
drawing information from various disparate sources. In this paper,
we describe statistical and computational methods that were used
to integrate clinical data with genomic data from single nucleotide
polymorphism (SNP) and microarray gene expression data for the
prediction of chronic fatigue syndrome (CFS). We generated
prediction scores separately from the gene expression microarray
data and the SNP data for 164 subjects, of which 129 had CFS or
CFS-like symptoms. Each predictive model was trained via 10fold cross-validation and predicted scores were calculated for each
test set. Scores for the microarray data were generated by a kernelbased k nearest neighbor (KNN) classifier, whereas those for the
SNP data scores were generated using logistic regression models.
The summary scores of the microarray and SNP data were then
combined with two clinical variables of interest in a final logistic
model and once again evaluated via 10-fold cross-validation. Our
results show that the integration of relevant clinical and genomic
data from different sources may improve the diagnostic
classification accuracy for CFS.
Keywords
Genomics, data integration, cross-validation, logistic regression,
chronic fatigue syndrome.
1. INTRODUCTION
The study of chronic fatigue syndrome (CFS) has been a
challenging medical research problem due to its unknown etiology
and considerable uncertainty in diagnosis and disease
characterization. Ambiguities in the case definition of CFS have
been pointed out [1] and some recent efforts have focused on
formulating empirical definitions of CFS [2].
In recent years, there has been considerable interest among the
statistical and computational community in methods for the
integration of various sources spanning clinical and genomic data
[3]. In this work, we aim to investigate a strategy for combining
disparate information from clinical, microarray, and SNP data to
construct a model that better predicts CFS disease status than any
one data source alone.
2. MATERIALS AND METHODS
2.1 Subjects
A total of 164 subjects (129 cases and 35 controls) that had data
in all of the clinical, SNP, and microarray datasets (made publicly
available by the US Centers for Disease Control (CDC)) were
analyzed. The previously determined clinical diagnoses were used
rather than empirically derived classifications [2]. Included in the
129 cases were 64 CFS as well as 65 subjects with CFS-like
symptoms.
2.2 Clinical Data
Among the numerous variables in the clinical and laboratory
datasets, we have chosen to include in our statistical model only
two clinical assessments: the presence of tender lymph nodes, and
the presence of sleep problems with moderate or greater severity,
both of which were coded as binary variables. We wanted to
select only those clinical assessments that were likely to be either
useful prognostic factors or potential risk factors for the etiology
of the disease.
It has been shown in a previous study of CFS patients [4] that the
presence of tender lymph nodes has a statistically significant
association with CFS. Solomon and Reeves [4] suggest that it is a
common symptom of many infectious illnesses and may
contribute to detection or reporter bias on the part of a patient or a
physician. Yet the examination of lymph nodes can still serve as a
useful prognostic criterion for the diagnosis of CFS.
It is also believed that sleep physiology may be a crucial factor in
explaining the etiology of CFS. Many CFS patients report sleep
disorders, and a previous investigation of sleep characteristics
among CFS subjects [5] showed that 81.4% of their CFS subjects
reported at least one form of sleep abnormality. Although the
relationship between sleep disorders and CFS has not been clearly
elucidated, it is a plausible causative factor.
While several demographic characteristics such as gender and age
seem to be associated with CFS [4], we have chosen to exclude
them from our statistical modeling due to the initial design of the
CDC case-control study that matched cases and controls on
gender and age.
2.3 SNP Data
Forty-two SNP markers spanning 10 different genes on different
chromosomes were genotyped. Six of the genes (COMT, MAOA,
MAOB, SLC6A4, TH and TPH2) are involved in the
neurotransmission system, and mutations at those genes can lead
to various psychiatric illnesses. The remaining four genes
(CRHR1, CRHR2, NR3C1, and POMC) are involved in the
neuroendocrine system. [6] Two genes (MAOA and MAOB) are
located on the X chromosome.
Haplotyping: Many alleles at closely linked SNP markers occur
as common blocks called haplotypes. Identification and use of
haplotypes in genes can aid in disease mapping of genes and help
reduce the dimension of statistical models on the SNP data. We
have performed haplotyping analyses on 6 out of 8 autosomal
genes within which we could identify the relative order and
distance among markers. Haplotypes were estimated with the
PHASE software, which is based on a Bayesian computational
approach proposed by Stevens et al. [7, 8]. For the remaining four
genes (two autosomal genes, POMC and SLC6A4, and two Xlinked genes, MAOA and MAOB) SNP markers were used in the
statistical models without any estimation of haplotypes). For each
of the 6 haplotyped genes, the three major haplotypes were coded
as two binary indicator variables. For both haplotyped genes and
SNPs, an additive inheritance mode was assumed for the coding
(0, 1, 2 according to the number of haplotypes or alleles), with the
exception of the X chromosome, for which a dominant inheritance
model was used. We have used the most likely haplotypes in our
current model as an approximation for haplotypes with phase
uncertainty in order to ease the implementation of our
computational algorithms. We hope to modify our implementation
to account for the probabilities of haplotype construction in future
research work.
2.4 Microarray Gene Expression Data
A total of 177 microarray gene expression samples for 19,892
genes were initially available. Eight technical replicates and 5
gene expression samples for excluded subjects were removed, and
the remaining 164 samples were analyzed. Expression values were
normalized using quantile normalization after log transformation
[9].
2.5 Statistical modeling for data integration
Our modeling work draws upon the pre-validation method
proposed by Tibshirani and Efron [10] to use full cross-validation
to obtain prediction scores for each individual and then to
combine these with clinical data to improve the predictive power
of the model. It should be noted that the partitions for crossvalidation must remain the same throughout the calculation of
prediction scores in different data types, and the evaluation of the
performance of the final model.
Exploratory analyses: For the clinical and SNP data, we fitted
logistic models, in SAS [11], predicting CFS using all 164
subjects without any cross-validation or data integration. A model
constructed in such a way cannot be evaluated for performance as
the entire dataset is used for the model estimation, but it
nevertheless gives a useful indication of which features may be
important. For the microarray expression data, principal
component analysis (PCA) [12] and Sammon’s non-linear
mapping technique [13] were used to look for any clustering of
gene expression that could aid in classification or dimension
reduction.
Prediction scores from microarray and SNP data using crossvalidation (CV): The 164 subjects were randomly divided into
ten groups. For generating prediction scores for the case-control
labels for all samples, the model parameters estimated on 9/10ths
of the data were used to make predictions on the remaining tenth.
These procedures were repeated ten times to obtain prediction
scores for all 164 subjects. Two sets of prediction scores (one
from the microarray data and the other from the SNP data) were
generated. We have implemented our algorithm in the R statistical
programming language.
For the microarray gene expression data, we used a signal-tonoise filter (S2N) [14] to select the 50 most variable genes out of
the 19,892 genes on the array. Then the kernel-based K-nearest
neighbour (KNN) algorithm was used to generate prediction
scores. These two steps were performed on each cross-validation
partition of the data, and prediction was performed on each 1/10 th
of the data that have been set aside during the model “training”
phase. [14]
For the SNP data, using each of the ten partitions of the dataset,
haplotype variables or SNP marker variables for each gene were
fitted in a single logistic regression model for feature selection
using 9/10ths of the data. (We opted against fitting more than one
gene in a logistic regression due to data attrition from missing
values in SNPs, and potential convergence problems with too
many covariates.) Ten such models for each gene were compared
with each other, and k genes were selected such that the k models
for these genes were ranked higher than other models on a given
selection criterion such as the Kendall correlation between the
observed and estimated case-control labels in the training set.
Once the k genes were chosen during the feature selection step, all
the haplotypes or SNP marker variables belonging to those k
genes were combined in a single logistic model, and this model
0.3
0.2
0.1
0.0
Classification criteria: A prediction score calculated for a given
sample may be used to classify the sample into the case or the
control group. For the KNN, the majority voting rule is used such
that the sample is classified to the group with the highest weight.
[14, 16] For the logistic regression, the following prediction rule
was used: for each sample i, if yi is coded as 1 for the case and 0
for the control, then for a given estimated vector of logistic
regression parameters
i, Xi,


ˆ
and the covariate vector for the sample

-0.1
-0.2
0.00
0.05
0.10
0.15
Third component
-0.2 -0.1 0.0 0.1 0.2
Various values for the number of genes k (from 1 to 5) and
selection criteria (area-under-the-curve (AUC) of receiver
operating characteristic curve [15], accuracy, and various
correlation measures) were compared, and the combination that
yielded the highest AUC between the predicted labels and the
observed labels over the entire sample was chosen as the best
predictive model.
second component
was re-fitted to the training dataset (the 9/10ths of the data) to
obtain estimates of the logistic regression parameters. Then
prediction scores for the remaining tenth of the data were obtained
from these estimated model parameters.
0.20
first component
Figure 1. Principal components analysis using a subset of 400
genes with highest inter-quartile range (Red dots: controls, blue
dots: cases).

y i  P( y i  1 |  , X i )  exp(  X i ) /(1  exp(  X i ))


If y i  (129 / 164)  0.7866, y i is classified as case.


If y i  (129 / 164), y i is classified as control.
60
1
40
where 129/164 reflects the proportion of cases in the 164 samples.
1
1
1
1
00
1
11
1
0 1
20
1
0
1 1
0
0
1
0
1
1 1 0
1
1
1
1
11 1 1 1
0 1 01 1
1
1 11
1
0
1
1
0
1
1 1
1
01111 1
0
1
11
1
1 11 111 111 0 1
01 1
1 111 1
1
1
10
01
1 11 01 1
11
1
1
1
01 10 1 1 11 11 1
1
1
1 1
1
0
0
1 1
1
1
0
1 1 1 00 1 1
10
0
1
1
1
1
10
1
1
1
1
1
1
1
1
0
1
1
1
1
0
1
-20
0
1
-40
Integration of clinical data and prediction scores: A logistic
model with four variables (two clinical variables, SNP and
microarray prediction scores) was fitted and prediction scores
were generated via the 10-fold CV as before. One difference from
the previous CV steps used for the SNP and microarray prediction
scores is that no feature selection was necessary for final
procedure as there were only four covariates of interest. Once the
predicted values were obtained for all samples, various
performance criteria such as AUC or accuracy were evaluated by
comparing the predicted and observed case-control labels.
-80
-60
-40
-20
1
0
10
00
0
20
40
3. RESULTS
3.1 Exploratory Analysis
Cluster analysis of microarray gene expression data: Genes
were ranked by the value of the inter-quartile range (IQR) across
the 164 samples. A total of 400 genes with the greatest IQRs were
selected for principal components analysis (PCA). The results
show that no clear pattern could be found along three major
principal components (Figure 1). For the same subset of genes,
there were no clear clusters using Sammon’s non-linear mapping
technique either. (Figure 2) [13].
Figure 2. Sammon mapping using a subset of 400 genes with
highest inter-quartile range (1: cases, 0: controls).
Logistic regression of the entire 164 samples with clinical and
SNP covariates, without CV: In the logistic regression model
with two clinical variables (“node” = presence of tender lymph
nodes, and “sleep” = presence of sleep problems with moderate or
severe symptoms), sleep problems showed statistically significant
association with CFS (OR = 11.2; 95%CI: 4.0 - 31.0; p < 0.001)
but tender lymph node did not (OR = 3.2; 95% CI: 0.7 - 15.5; p =
0.144). Using gene-by-gene logistic regression models, only two
genes were identified as having statistically significant
associations with CFS, namely, NR3C1 (p=0.002), and POMC
(p=0.0432). These p-values were generated using a likelihood
ratio test which has a better statistical property than the usual
Wald test [17].
Table 2. AUC values of classifications with clinical, SNP, and
expression scores.
AUC for model
with raw
predicted scores
AUC for model
with classified
predicted scores
Clinical var. only
0.737
0.737
SNP scores only
0.603
0.645
Microarray scores only
0.488
0.559
Clin. Var. + SNP
0.753
0.753
Clin. Var. + Microarray
0.737
0.741
SNP + Microarray
0.585
0.623
Clin. Var. + SNP +
Microarray
0.753
0.753
3.2 Classification and Performance
Evaluation with Microarray and SNP Data
Microarray gene expression data: The AUC value calculated
with the microarray data based on the predicted and observed
case-control labels was 0.529, which is only marginally greater
than the AUC value of 0.5 expected for random predictions.
SNP data: We explored various combinations of the number of
markers selected, and the method of gene selection during the
feature selection step (ten logistic regression models, one for each
gene in the 10-fold CV). Our goal is to optimize the performance
of the final model prediction of case-control status as measured by
AUC (Table 1).
Table 1. AUC values for classification of CFS using SNP data
with different gene selection criteria.
# of genes
Criteria
1
2
3
4
5
Accuracy
0.488
0.536
0.578
0.593
0.573
AUC
0.552
0.610
0.643
0.542
0.529
Kendall Corr.
0.613
0.592
0.645
0.587
0.529
(Rows are criteria for feature selection, and columns are for the
number of genes chosen for the final haplotype/SNP estimation
model.)
The result shows that the highest AUC was attained (0.645) when
we selected top three single-gene models using the Kendall
correlation, and we generated predictions for the test set using a
logistic regression model containing those three genes. This is the
method adopted for the generation of prediction scores with the
SNP data.
3.3 Evaluation of Data Integration
We evaluated the performance of our final logistic regression
model with four covariates (two binary clinical covariates, SNP
scores, and expression scores) through 10-fold CV (Table 2). Two
different strategies were employed: in one set of models, raw
prediction scores (continuous values in [0, 1]) for SNP and
microarray data were used as covariates (2nd column of Table 2).
In the other set of models, SNP and microarray scores were first
converted to classification labels (1 for case and 0 for control)
according to their classification criteria, and these two binary
covariates were fitted in the final model as covariates. (3rd column
of Table 2.) For the purpose of comparison, logistic regression
models with subsets of the four covariates were also fitted.
The results show that overall, the models with binary-classified
SNPs and gene expression microarray covariates performed
better, using an AUC criterion, than the models with SNP and
microarray covariates based on raw prediction scores. Comparison
of the full model with the models containing subsets of covariates
reveals that while the inclusion of SNP scores on top of the
clinical variables increased the predictive power of the model by a
small amount, inclusion of microarray scores did not seem to
make a measurable contribution to the AUC. (Table 2)
For the final model with all four covariates (with classified SNP
and microarray scores), the accuracy was calculated to be 72.56%,
with a sensitivity of 70.54% and specificity of 80%. This implies
that knowledge of 2 clinical covariates, 42 SNPs, and gene
expression of 50 genes can lead to a prediction of CFS that is 73%
accurate.
4. DISCUSSION
In this work, we have explored and presented statistical and
computational strategies for integrating clinical and genomic data.
We adopted various data integration approaches in order to obtain
a model with improved predictive power for classification of CFS.
The results, based on our current method, show that the
integration of SNP scores with clinical data delivers a small
improvement in predictive power of the model, yet the integration
of microarray gene expression scores delivers little measurable
improvement.
Compared to other methods – i.e. fitting all predictors together at
the same time – the strategy we followed has the advantage in that
one can use technology-specific methods to identify predictors for
each dataset and then combine afterwards.
Various reasons may explain the suboptimal performance of
microarray expression scores in our work. Uncertainty in case
definitions for CFS is one such reason. Our CFS case samples
contained those with clinically identified CFS as well as those
with insufficient numbers of CFS symptoms. Thus, use of only
those with clinically confirmed CFS for cases may lead to better
delineation between cases and controls in statistical analysis. In
the same vein, an alternative empirical diagnostic criterion of CFS
that was explored in recent research work by Reeves et al. [2] may
also provide a definition of the disease that better correlates with
genetic signatures of CFS patients. One possible way to improve
the microarray scores is to integrate only the expression values of
the genes containing SNP markers in order to minimize the noises
introduced by other genes less relevant to the disease under
investigation. This would restrict the set of genes examined in the
microarray analysis to only those strong candidates and those with
cis effects. Another way to integrate data would be to use a
method that assigns weights to classification rules, so that genes
that provide complementary sources of information are optimally
combined for the classification purpose. Boosting appears to be
one such method [18] that can identify non-overlapping subsets,
and we hope to direct further efforts in this area so that the
information contents in various sources can be optimally utilized
for classification tasks.
5. ACKNOWLEDGEMENTS
Our thanks to Mr. Earl Glynn for his suggestions through e-mail
correspondences. This research was supported by funding from
Ontario Genomics Institute, Genome Canada, and CIHR grant
NPG-64872.
6. REFERENCES
[1] Reeves W.C, Lloyd A, Vernon S.D, Klimas N, Jason L.A.,
Bleijenberg G, Evengard B, White P.D., Nisenbaum R,
Unger E.R., and the International Chronic Fatigue Syndrome
Study Group. Identification of ambiguities in the 1994
chronic fatigue syndrome research case definition and
recommendations for resolution. BMC Health Services
Research., 3:25, 2003.
[2] Reeves, C.W., Wagner, D., Nisenbaum R., Jones J.F.,
Gurbaxani B., Solomon L., Papanicolaou D.A., Unger E.R.,
Vernon S.D., and Heim C. Chronic fatigue syndrome – a
clinically empirical approach to its definition and study.
BMC Medicine. 3:19, 2005.
[3] Detours, V., Dumont J.E., Bersini H., and Maenhaut C.
Integration and cross-validation of high-throughput gene
expression data: comparing heterogeneous data sets. FEBS
Letters 546, 98-102, 2003.
[4] Solomon L. and Reeves, W.C. Factors influencing the
diagnosis of chronic fatigue syndrome. Arch Intern Med
164:2241-2245, 2004
[5] Unger, E. R., Nisenbaum R., Moldofsky H., Cesta A.,
Sammut C., Reyes M., and Reeves, W.C. Sleep assessment in
a population-based study of chronic fatigue syndrome. BMC
Neurology. 4:6, 2004.
[6] Hattori, E., Liu, C., Zhu, H., and Gershon, E.S. Genetic tests
of biologic systems in affective disorders. Molecular
Psychiatry. 10:719-740. 2005.
[7] Stephens, M., Smith, N.J., and Donnelly, P. A new statistical
method for haplotype reconstruction from population data.
Am. J. Hum. Genet. 68:978-989, 2001.
[8] Stephens, M., and Donnelly, P. A comparison of Bayesian
methods for haplotype reconstruction from population
genotype data. Am. J. Hum. Genet. 73:1162-1169, 2003.
[9] Irizarry, R.A, Hobbs, B., Collin, F., Beazer-Barclay, Y.D.,
Antonellis, K.J., Scherf, U., and Speed, T.P. 2003
Exploration, normalization, and summaries of high density
oligonucleotide array probe level data. Biostatistics 4:249264, 2003.
[10] Tibshirani, R.J., and Efron, B. Pre-validation and inference
in microarrays. Statistical Applications in Genetics and
Molecular Biology Vol. 1, Iss. 1. Article 1, 2002.
[11] The SAS System for Windows. Cary, NC, USA. The SAS
Institute, 2002.
[12] Yeung K.Y., and Ruzzo, W.L. Principal component analysis
for clustering gene expression data. Bioinformatics.
17(9):763-774. 2001.
[13] Ewing, R.M, and Cherry, J.M. Visualization of expression
clusters using Sammon’s non-linear mapping.
Bioinformatics. 17(7):658-659. 2001.
[14] Hu, P. et al. Serum diagnosis of chronic fatigue syndrome
using array-based proteomics. Abstract for CAMDA 2006.
[15] Swets, J.A. Measuring the accuracy of diagnostic systems.
Science. 240:1285-1293. 1988.
[16] Hechenbichler, K. and Schliep, K.P. Weighted k-NearestNeighbor Techniques and Ordinal Classification, Discussion
Paper 399, SFB 386, Ludwig-Maximilians University
Munich (htt://www.stat.unimuenchen.de/sfb386/papers/dsp/paper399.ps), 2004.
[17] Wackerly, D.D., Mendenhall, W., and Scheaffer, R.L.
Mathematical statistics with applications. 5th ed. Duxbury
press. 1996.
[18] Long P.M. and Vega, V.B. Boosting and microarray data.
Machine Learning, 52(1):31-44, 2003.
Download