Part 5: Linking Microarray Data with Survival Analysis
Use of microarray data via model-based classification in the study and prediction of survival from lung cancer
(Ben-Tovim Jones et al., 2005)
Problems
•Censored Observations – the time of occurrence of the event
(death) has not yet been observed.
•Small Sample Sizes – study limited by patient numbers
•Specific Patient Group – is the study applicable to other populations?
•Difficulty in integrating different studies (different microarray platforms)
A Case Study: The Lung Cancer data sets from
CAMDA’03
Four independently acquired lung cancer data sets
(Harvard, Michigan, Stanford and Ontario).
The challenge: To integrate information from different data sets (2 Affy chips of different versions, 2 cDNA arrays).
The final goal: To make an impact on cancer biology and eventually patient care.
“Especially, we welcome the methodology of survival analysis using microarrays for cancer prognosis (Park et al.
Bioinformatics
: S120, 2002).”
Methodology of Survival Analysis using Microarrays
Cluster the tissue samples (eg using hierarchical clustering), then compare the survival curves for each cluster using a non-parametric
Kaplan-Meier analysis (Alizadeh et al. 2000).
Park et al. (2002), Nguyen and Rocke (2002) used partial least squares with the proportional hazards model of Cox.
Unsupervised vs. Supervised Methods
Semi-supervised approach of Bair and Tibshirani (2004), to combine gene expression data with the clinical data.
AIM : To link gene-expression data with survival from lung cancer in the CAMDA’03 challenge
A CLUSTER ANALYSIS
We apply a model-based clustering approach to classify tumour tissues on the basis of microarray gene expression.
B SURVIVAL ANALYSIS
The association between the clusters so formed and patient survival (recurrence) times is established.
C DISCRIMINANT ANALYSIS
We demonstrate the potential of the clustering-based prognosis as a predictor of the outcome of disease.
Lung Cancer
Approx. 80% of lung cancer patients have NSCLC (of which adenocarcinoma is the most common form).
All Patients diagnosed with NSCLC are treated on the basis of stage at presentation (tumour size, lymph node involvement and presence of metastases).
Yet 30% of patients with resected stage I lung cancer will die of metastatic cancer within 5 years of surgery.
Want a prognostic test for early-stage lung adenocarcinoma to identify patients more likely to recur, and therefore who would benefit from adjuvant therapy.
Lung Cancer Data Sets
(see http://www.camda.duke.edu/camda03)
Wigle et al. (2002), Garber et al. (2001), Bhattacharjee et al. (2001),
Beer et al. (2002).
Heat Map for 2880 Ontario Genes (39 Tissues)
Tissues
Heat Maps for the 20 Ontario Gene-Groups (39 Tissues)
Tissues are ordered as:
Recurrence (1-24) and Censored (25-39)
Tissues
Expression Profiles for Useful Metagenes (Ontario 39 Tissues)
Gene Group 1
Our Tissue Cluster 1
Our Tissue Cluster 2
Gene Group 2
Recurrence (1-24)
Censored (25-39)
Gene Group 19 Gene Group 20
Tissues
Tissue Clusters
CLUSTER ANALYSIS via EMMIX-GENE of 20
METAGENES yields TWO CLUSTERS:
CLUSTER 1 (31): 23 (recurrence) plus
8 (censored)
Poor-prognosis
CLUSTER 2 (8): 1 (recurrence) plus
7 (censored)
Good-prognosis
SURVIVAL ANALYSIS:
LONG-TERM SURVIVOR (LTS) MODEL
S ( t )
prob .
{ T
t }
p
1
S
1
( t )
p
2 where T is time to recurrence and p
1 prior prob. of recurrence.
= 1p
2 is the
Adopt Weibull model for the survival function for recurrence S
1
(t).
Fitted LTS Model vs. Kaplan-Meier
PCA of Tissues Based on Metagenes
First PC
PCA of Tissues Based on Metagenes
First PC
PCA of Tissues Based on All Genes (via SVD)
First PC
PCA of Tissues Based on All Genes (via SVD)
First PC
Cluster-Specific Kaplan-Meier Plots
Survival Analysis for Ontario Dataset
• Nonparametric analysis:
Cluster No. of Tissues No. of Censored Mean time to Failure (
SE)
1
2
29
8
8
7
665
85.9
1388
155.7
A significant difference between Kaplan-Meier estimates for the two clusters ( P= 0.027).
• Cox’s proportional hazards analysis:
Variable
Cluster 1 vs. Cluster 2
Tumor stage (I vs. II&III)
Hazard ratio (95% CI)
6.78 (0.9 – 51.5)
1.07 (0.57 – 2.0)
P-value
0.06
0.83
Discriminant Analysis (Supervised Classification)
A prognosis classifier was developed to predict the class of origin of a tumor tissue with a small error rate after correction for the selection bias.
A support vector machine (SVM) was adopted to identify important genes that play a key role on predicting the clinical outcome, using all the genes, and the metagenes.
A cross-validation (CV) procedure was used to calculate the prediction error, after correction for the selection bias.
ONTARIO DATA (39 tissues): Support Vector Machine
(SVM) with Recursive Feature Elimination (RFE)
0.12
0.1
0.08
0.06
0.04
0.02
0
0 2 10 12 4 6 log2 (number of genes)
8
Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector
Machine (SVM). applied to g=2 clusters (G1: 1-14, 16- 29,33,36,38;
G2: 15,30-32,34,35,37,39)
STANFORD DATA
918 genes based on 73 tissue samples from 67 patients.
Row and column normalized, retained 451 genes after select-genes step. Used 20 metagenes to cluster tissues.
Retrieved histological groups.
Heat Maps for the 20 Stanford Gene-Groups (73 Tissues)
Tissues
Tissues are ordered by their histological classification:
Adenocarcinoma (1-41), Fetal Lung (42), Large cell (43-47), Normal
(48-52), Squamous cell (53-68), Small cell (69-73)
STANFORD CLASSIFICATION:
Cluster 1: 1-19 (good prognosis)
Cluster 2: 20-26 (long-term survivors)
Cluster 3: 27-35 (poor prognosis)
Heat Maps for the 15 Stanford Gene-Groups (35 Tissues)
Tissues
Tissues are ordered by the Stanford classification into AC groups: AC group 1 (1-19), AC group 2 (20-26), AC group 3 (27-35)
Expression Profiles for Top Metagenes (Stanford 35 AC Tissues)
Gene Group 1 Gene Group 2
Stanford AC group 1
Stanford AC group 2
Stanford AC group 3
Misallocated
Gene Group 3
Gene Group 4
Tissues
Cluster-Specific Kaplan-Meier Plots
Cluster-Specific Kaplan-Meier Plots
Survival Analysis for Stanford Dataset
• Kaplan-Meier estimation:
Cluster No. of Tissues No. of Censored Mean time to Failure (
SE)
1
2
17
5
10
0
37.5
5.0
5.2
2.3
A significant difference in survival between clusters ( P< 0.001)
• Cox’s proportional hazards analysis:
Variable
Cluster 3 vs. Clusters 1&2
Grade 3 vs. grades 1 or 2
Tumor size
No. of tumors in lymph nodes
Presence of metastases
Hazard ratio (95% CI)
13.2 (2.1 – 81.1)
1.94 (0.5 – 8.5)
0.96 (0.3 – 2.8)
1.65 (0.7 – 3.9)
4.41 (1.0 – 19.8)
P-value
0.005
0.38
0.93
0.25
0.05
Survival Analysis for Stanford Dataset
• Univariate Cox’s proportional hazards analysis (metagenes):
Metagene
6
7
8
9
10
3
4
1
2
5
11
12
13
14
15
Coefficient (SE)
1.37 (0.44)
-0.24 (0.31)
0.14 (0.34)
-1.01 (0.56)
0.66 (0.65)
-0.63 (0.50)
-0.68 (0.57)
0.75 (0.46)
-1.13 (0.50)
0.73 (0.39)
0.35 (0.50)
-0.55 (0.41)
-0.61 (0.48)
0.22 (0.36)
1.70 (0.92)
P-value
0.002
0.44
0.68
0.07
0.31
0.20
0.24
0.10
0.02
0.06
0.48
0.18
0.20
0.53
0.06
Survival Analysis for Stanford Dataset
• Multivariate Cox’s proportional hazards analysis (metagenes):
Metagene
1
2
8
11
Coefficient (SE)
3.44 (0.95)
-1.60 (0.62)
-1.55 (0.73)
1.16 (0.54)
The final model consists of four metagenes.
P-value
0.0003
0.010
0.033
0.031
STANFORD DATA: Support Vector Machine
(SVM) with Recursive Feature Elimination (RFE)
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0 1 2 3 4 5 6 7 log2 (number of genes)
8 9 10
Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector
Machine (SVM). Applied to g=2 clusters.
CONCLUSIONS
We applied a model-based clustering approach to classify tumors using their gene signatures into:
(a) clusters corresponding to tumor type
(b) clusters corresponding to clinical outcomes for tumors of a given subtype
In (a), almost perfect correspondence between cluster and tumor type, at least for non-AC tumors (but not in the Ontario dataset).
CONCLUSIONS (cont.)
The clusters in (b) were identified with clinical outcomes (e.g. recurrence/recurrence-free and death/long-term survival).
We were able to show that gene-expression data provide prognostic information, beyond that of clinical indicators such as stage.
CONCLUSIONS (cont.)
Based on the tissue clusters, a discriminant analysis using support vector machines (SVM) demonstrated further the potential of gene expression as a tool for guiding treatment therapy and patient care to lung cancer patients.
This supervised classification procedure was used to provide marker genes for prediction of clinical outcomes.
(In addition to those provided by the cluster-genes step in the initial unsupervised classification.)
LIMITATIONS
Small number of tumors available (e.g Ontario and
Stanford datasets).
Clinical data available for only subsets of the tumors; often for only one tumor type (AC).
High proportion of censored observations limits comparison of survival rates.