uniwaikato_workshop_part5

advertisement

Part 5: Linking Microarray Data with Survival Analysis

Use of microarray data via model-based classification in the study and prediction of survival from lung cancer

(Ben-Tovim Jones et al., 2005)

Problems

•Censored Observations – the time of occurrence of the event

(death) has not yet been observed.

•Small Sample Sizes – study limited by patient numbers

•Specific Patient Group – is the study applicable to other populations?

•Difficulty in integrating different studies (different microarray platforms)

A Case Study: The Lung Cancer data sets from

CAMDA’03

Four independently acquired lung cancer data sets

(Harvard, Michigan, Stanford and Ontario).

The challenge: To integrate information from different data sets (2 Affy chips of different versions, 2 cDNA arrays).

The final goal: To make an impact on cancer biology and eventually patient care.

“Especially, we welcome the methodology of survival analysis using microarrays for cancer prognosis (Park et al.

Bioinformatics

: S120, 2002).”

Methodology of Survival Analysis using Microarrays

Cluster the tissue samples (eg using hierarchical clustering), then compare the survival curves for each cluster using a non-parametric

Kaplan-Meier analysis (Alizadeh et al. 2000).

Park et al. (2002), Nguyen and Rocke (2002) used partial least squares with the proportional hazards model of Cox.

Unsupervised vs. Supervised Methods

Semi-supervised approach of Bair and Tibshirani (2004), to combine gene expression data with the clinical data.

AIM : To link gene-expression data with survival from lung cancer in the CAMDA’03 challenge

A CLUSTER ANALYSIS

We apply a model-based clustering approach to classify tumour tissues on the basis of microarray gene expression.

B SURVIVAL ANALYSIS

The association between the clusters so formed and patient survival (recurrence) times is established.

C DISCRIMINANT ANALYSIS

We demonstrate the potential of the clustering-based prognosis as a predictor of the outcome of disease.

Lung Cancer

Approx. 80% of lung cancer patients have NSCLC (of which adenocarcinoma is the most common form).

All Patients diagnosed with NSCLC are treated on the basis of stage at presentation (tumour size, lymph node involvement and presence of metastases).

Yet 30% of patients with resected stage I lung cancer will die of metastatic cancer within 5 years of surgery.

Want a prognostic test for early-stage lung adenocarcinoma to identify patients more likely to recur, and therefore who would benefit from adjuvant therapy.

Lung Cancer Data Sets

(see http://www.camda.duke.edu/camda03)

Wigle et al. (2002), Garber et al. (2001), Bhattacharjee et al. (2001),

Beer et al. (2002).

Heat Map for 2880 Ontario Genes (39 Tissues)

Tissues

Heat Maps for the 20 Ontario Gene-Groups (39 Tissues)

Tissues are ordered as:

Recurrence (1-24) and Censored (25-39)

Tissues

Expression Profiles for Useful Metagenes (Ontario 39 Tissues)

Gene Group 1

Our Tissue Cluster 1

Our Tissue Cluster 2

Gene Group 2

Recurrence (1-24)

Censored (25-39)

Gene Group 19 Gene Group 20

Tissues

Tissue Clusters

CLUSTER ANALYSIS via EMMIX-GENE of 20

METAGENES yields TWO CLUSTERS:

CLUSTER 1 (31): 23 (recurrence) plus

8 (censored)

Poor-prognosis

CLUSTER 2 (8): 1 (recurrence) plus

7 (censored)

Good-prognosis

SURVIVAL ANALYSIS:

LONG-TERM SURVIVOR (LTS) MODEL

S ( t )

 prob .

{ T

 t }

 p

1

S

1

( t )

 p

2 where T is time to recurrence and p

1 prior prob. of recurrence.

= 1p

2 is the

Adopt Weibull model for the survival function for recurrence S

1

(t).

Fitted LTS Model vs. Kaplan-Meier

PCA of Tissues Based on Metagenes

First PC

PCA of Tissues Based on Metagenes

First PC

PCA of Tissues Based on All Genes (via SVD)

First PC

PCA of Tissues Based on All Genes (via SVD)

First PC

Cluster-Specific Kaplan-Meier Plots

Survival Analysis for Ontario Dataset

• Nonparametric analysis:

Cluster No. of Tissues No. of Censored Mean time to Failure (

SE)

1

2

29

8

8

7

665

85.9

1388

155.7

A significant difference between Kaplan-Meier estimates for the two clusters ( P= 0.027).

• Cox’s proportional hazards analysis:

Variable

Cluster 1 vs. Cluster 2

Tumor stage (I vs. II&III)

Hazard ratio (95% CI)

6.78 (0.9 – 51.5)

1.07 (0.57 – 2.0)

P-value

0.06

0.83

Discriminant Analysis (Supervised Classification)

A prognosis classifier was developed to predict the class of origin of a tumor tissue with a small error rate after correction for the selection bias.

A support vector machine (SVM) was adopted to identify important genes that play a key role on predicting the clinical outcome, using all the genes, and the metagenes.

A cross-validation (CV) procedure was used to calculate the prediction error, after correction for the selection bias.

ONTARIO DATA (39 tissues): Support Vector Machine

(SVM) with Recursive Feature Elimination (RFE)

0.12

0.1

0.08

0.06

0.04

0.02

0

0 2 10 12 4 6 log2 (number of genes)

8

Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector

Machine (SVM). applied to g=2 clusters (G1: 1-14, 16- 29,33,36,38;

G2: 15,30-32,34,35,37,39)

STANFORD DATA

918 genes based on 73 tissue samples from 67 patients.

Row and column normalized, retained 451 genes after select-genes step. Used 20 metagenes to cluster tissues.

Retrieved histological groups.

Heat Maps for the 20 Stanford Gene-Groups (73 Tissues)

Tissues

Tissues are ordered by their histological classification:

Adenocarcinoma (1-41), Fetal Lung (42), Large cell (43-47), Normal

(48-52), Squamous cell (53-68), Small cell (69-73)

STANFORD CLASSIFICATION:

Cluster 1: 1-19 (good prognosis)

Cluster 2: 20-26 (long-term survivors)

Cluster 3: 27-35 (poor prognosis)

Heat Maps for the 15 Stanford Gene-Groups (35 Tissues)

Tissues

Tissues are ordered by the Stanford classification into AC groups: AC group 1 (1-19), AC group 2 (20-26), AC group 3 (27-35)

Expression Profiles for Top Metagenes (Stanford 35 AC Tissues)

Gene Group 1 Gene Group 2

Stanford AC group 1

Stanford AC group 2

Stanford AC group 3

Misallocated

Gene Group 3

Gene Group 4

Tissues

Cluster-Specific Kaplan-Meier Plots

Cluster-Specific Kaplan-Meier Plots

Survival Analysis for Stanford Dataset

• Kaplan-Meier estimation:

Cluster No. of Tissues No. of Censored Mean time to Failure (

SE)

1

2

17

5

10

0

37.5

5.0

5.2

2.3

A significant difference in survival between clusters ( P< 0.001)

• Cox’s proportional hazards analysis:

Variable

Cluster 3 vs. Clusters 1&2

Grade 3 vs. grades 1 or 2

Tumor size

No. of tumors in lymph nodes

Presence of metastases

Hazard ratio (95% CI)

13.2 (2.1 – 81.1)

1.94 (0.5 – 8.5)

0.96 (0.3 – 2.8)

1.65 (0.7 – 3.9)

4.41 (1.0 – 19.8)

P-value

0.005

0.38

0.93

0.25

0.05

Survival Analysis for Stanford Dataset

• Univariate Cox’s proportional hazards analysis (metagenes):

Metagene

6

7

8

9

10

3

4

1

2

5

11

12

13

14

15

Coefficient (SE)

1.37 (0.44)

-0.24 (0.31)

0.14 (0.34)

-1.01 (0.56)

0.66 (0.65)

-0.63 (0.50)

-0.68 (0.57)

0.75 (0.46)

-1.13 (0.50)

0.73 (0.39)

0.35 (0.50)

-0.55 (0.41)

-0.61 (0.48)

0.22 (0.36)

1.70 (0.92)

P-value

0.002

0.44

0.68

0.07

0.31

0.20

0.24

0.10

0.02

0.06

0.48

0.18

0.20

0.53

0.06

Survival Analysis for Stanford Dataset

• Multivariate Cox’s proportional hazards analysis (metagenes):

Metagene

1

2

8

11

Coefficient (SE)

3.44 (0.95)

-1.60 (0.62)

-1.55 (0.73)

1.16 (0.54)

The final model consists of four metagenes.

P-value

0.0003

0.010

0.033

0.031

STANFORD DATA: Support Vector Machine

(SVM) with Recursive Feature Elimination (RFE)

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0

0 1 2 3 4 5 6 7 log2 (number of genes)

8 9 10

Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector

Machine (SVM). Applied to g=2 clusters.

CONCLUSIONS

We applied a model-based clustering approach to classify tumors using their gene signatures into:

(a) clusters corresponding to tumor type

(b) clusters corresponding to clinical outcomes for tumors of a given subtype

In (a), almost perfect correspondence between cluster and tumor type, at least for non-AC tumors (but not in the Ontario dataset).

CONCLUSIONS (cont.)

The clusters in (b) were identified with clinical outcomes (e.g. recurrence/recurrence-free and death/long-term survival).

We were able to show that gene-expression data provide prognostic information, beyond that of clinical indicators such as stage.

CONCLUSIONS (cont.)

Based on the tissue clusters, a discriminant analysis using support vector machines (SVM) demonstrated further the potential of gene expression as a tool for guiding treatment therapy and patient care to lung cancer patients.

This supervised classification procedure was used to provide marker genes for prediction of clinical outcomes.

(In addition to those provided by the cluster-genes step in the initial unsupervised classification.)

LIMITATIONS

Small number of tumors available (e.g Ontario and

Stanford datasets).

Clinical data available for only subsets of the tumors; often for only one tumor type (AC).

High proportion of censored observations limits comparison of survival rates.

Download