Selection of Patient Samples and Genes for Disease Prognosis

Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu Plan • Introduction • Techniques – Extreme sample selection – ERCOF gene identification – Scoring function construction • Results – Diffuse large-B-cell lymphoma – Lung adenocarcinoma • Discussions and conclusions Copyright © 2004 by Limsoon Wong Introduction Gene Expression Profile + Clinical Data  Outcome Prediction • Univariate & multivariate Cox survival analysis (Beer et al 2002, Rosenwald et al 2002) • Fuzzy neural network (Ando et al 2002) • Partial least squares regression (Park et al 2002) • Weighted voting algorithm (Shipp et al 2002) • Gene index and “reference gene” (LeBlanc et al 2003) • …… Copyright © 2004 by Limsoon Wong Our Approach “extreme” sample selection ERCOF Copyright © 2004 by Limsoon Wong Techniques Extreme Sample Selection ERCOF Gene Identification Scoring Function Construction Extreme Sample Selection Short-term Survivors v.s. Long-term Survivors Short-term survivors who died within a short period Long-term survivors who were alive after a long follow-up time   F(T) < c1 and E(T) = 1 F(T) > c2 T: sample F(T): follow-up time E(T): status (1:unfavorable; 0: favorable) c1 and c2: thresholds of survival time Copyright © 2004 by Limsoon Wong ERCOF EntropyBased Rank Sum Test & Correlation Filtering Copyright © 2004 by Limsoon Wong ERCOF Phase I: Entropy Measure • Apply a discretization algorithm (Fayyad & Irani 1993) • Remove genes with expression values w/o cut point found (cannot be discretized) Copyright © 2004 by Limsoon Wong ERCOF Phase II: Wilcoxon Rank Sum Test • Calculate Wilcoxon rank sum w(x) for gene x • Remove gene x if w(x) [clower, cupper] Copyright © 2004 by Limsoon Wong ERCOF Phase III: Pearson Correlation • Group features by Pearson Correlation • For each group, retain the top 50% Copyright © 2004 by Limsoon Wong Risk Score Construction Linear Kernel SVM regression function G (T )   ai yi K (T , x (i ))  b i T: test sample, x(i): support vector, yi: class label (1: short-term survivors; -1: long-term survivors) Transformation function (posterior probability) S (T )  1 1  e G ( T ) ( S (T )  (0,1)) S(T): risk score of sample T Copyright © 2004 by Limsoon Wong Copyright © 2004 by Limsoon Wong Predicting Survival of Patients with DLBC Lymphoma Image credit: Rosenwald et al, 2002 Diffuse Large B-Cell Lymphoma • DLBC lymphoma is the most common type of lymphoma in adults • Can be cured by anthracycline-based chemotherapy in 35 to 40 percent of patients  DLBC lymphoma comprises several diseases that differ in responsiveness to chemotherapy Copyright © 2004 by Limsoon Wong • Intl Prognostic Index (IPI) – age, “Eastern Cooperative Oncology Group” Performance status, tumor stage, lactate dehydrogenase level, sites of extranodal disease, ... • Not good for stratifying DLBC lymphoma patients for therapeutic trials  Use gene-expression profiles to predict outcome of chemotherapy? Rosenwald et al., NEJM 2002 • 240 data samples – 160 in preliminary group – 80 in validation group – each sample described by 7399 microarray features • Rosenwald et al.’s approach – identify gene: Cox proportional-hazards model – cluster identified genes into four gene signatures – calculate for each sample an outcome-predictor score – divide patients into quartiles according to score Copyright © 2004 by Limsoon Wong Knowledge Discovery from Gene Expression of “Extreme” Samples 240 samples “extreme” sample selection: < 1 yr vs > 8 yrs 47 shortterm survivors 26 longterm survivors 84 genes knowledge discovery from gene expression T is long-term if S(T) < 0.3 T is short-term if S(T) > 0.7 Copyright © 2004 by Limsoon Wong 7399 genes 80 samples Kaplan-Meier Plot for 80 Test Cases p-value of log-rank test: < 0.0001 Risk score thresholds: 0.7, 0.3 Copyright © 2004 by Limsoon Wong Improvement Over IPI (A) IPI low, p-value = 0.0063 Copyright © 2004 by Limsoon Wong (B) IPI intermediate, p-value = 0.0003 Merit of “Extreme” Samples (A) W/o sample selection (p =0.38) (B) With sample selection (p=0.009) No clear difference on the overall survival of the 80 samples in the validation group of DLBCL study, if no training sample selection conducted Copyright © 2004 by Limsoon Wong Copyright © 2004 by Limsoon Wong Predicting Survival of Patients with Lung Adenocarcinoma Image credit: Beer et al, 2002 Beer et al., Nat Med 2002 • 86 data samples – 67 are stage I – 19 are stage III – each described by 7129 microarray features • Beer et al.’s approach – identify genes: univariate Cox analysis – derive risk index using selected genes – validation methods • equal sized training and testing sample • leave-one-out cross validation Copyright © 2004 by Limsoon Wong Knowledge Discovery from Gene Expression of “Extreme” Samples 67 stage I 19 stage III “extreme” sample selection: < 1 yr vs > 5 yrs 10 shortterm survivors 21 longterm survivors 591 genes knowledge discovery from gene expression T is long-term if S(T) < 0.25 T is short-term if S(T) > 0.73 Copyright © 2004 by Limsoon Wong 7129 genes 55 samples Kaplan-Meier Plots for Lung Adenocarcinoma Study (A) Test cases (55) (B) All cases (86) p-value of log-rank test: (A) 0.0036, (B) <0.0001 Copyright © 2004 by Limsoon Wong Copyright © 2004 by Limsoon Wong Discussions Discussions: Sample Selection Application DLBCL Lung Data set Status Total Dead Alive Original 88 72 160 Informative 47+1(*) 25 73 Original 24 62 86 Informative 10+2(*) 19 31 Number of samples in original data and selected informative training set. (*): Number of samples whose corresponding patient was dead at the end of follow-up time, but selected as a long-term survivor. Copyright © 2004 by Limsoon Wong Discussions: Gene Identification Gene selection DLBCL Lung Original 4937(*) 7129 Phase I 132(2.7%) 884(12.4%) Phase II 84(1.7%) 591(8.3%) Number of genes left after feature filtering for each phase. (*): number of genes after removing those genes who were absent in more than 10% of the experiments. Copyright © 2004 by Limsoon Wong Discussions: Kaplan-Meier plots on Test Cases Using Diff No. of Genes Copyright © 2004 by Limsoon Wong Conclusion I • Selecting extreme cases as training samples is an effective way to improve patient outcome prediction based on gene expression profiles and clinical information Copyright © 2004 by Limsoon Wong Copyright © 2004 by Limsoon Wong Is ERCOF Useful? 1000+ Expts • Feature selection methods considered – All use all features – All-entropy select features whose value range can be partitioned by Fayyad & Irani’s entropy method – Mean-entropy select features whose entropy is better than the mean entropy – Top-number-entropy select the top 20, 50, 100, 200 genes by their entropy – ERCOF at 5% significant level for Wilcoxon rank sum test and 0.99 Pearson correlation coeff threshold Copyright © 2004 by Limsoon Wong • Data sets considered – – – – – – – Colon tumor Prostate cancer Lung cancer Ovarian cancer DLBC lymphoma ALL-AML Childhood ALL • Learning methods considered – C4.5 – Bagging, Boosting, CS4 – SVM, 3-NN ERCOF vs All-Entropy All-entropy wins 4 times ERCOF wins 60 times Copyright © 2004 by Limsoon Wong ERCOF vs Mean-Entropy Mean-entropy wins 18 times ERCOF wins 42 times Copyright © 2004 by Limsoon Wong Effectiveness of ERCOF Total wins 22 38 Copyright © 2004 by Limsoon Wong 46 47 48 51 47 67 Conclusion 2 • ERCOF is very suitable for SVM, 3-NN, CS4, Random Forest, as it gives these learning algos highest no. of wins • ERCOF is suitable for Bagging also, as it gives this classifier the lowest no. of errors  ERCOF is a systematic feature selection method that is very useful Copyright © 2004 by Limsoon Wong Copyright © 2004 by Limsoon Wong Any Question? References • H. Liu et al, “Selection of patient samples and genes for outcome prediction”, Proc. CSB2004, pages 382-392 • Beer et al, “Gene-expression profiles predict survival of patients with lung adenocarcinoma.” Nat. Med, 8(8):816--823, 2002 • Rosenwald et al, “The use of molecular profiling to predict survival after chemotherapy for diffuse largeB-cell lymphoma”, NEJM, 346(25):1937--1947, 2002 Copyright © 2004 by Limsoon Wong

Selection of Patient Samples and Genes for Disease Prognosis

Related documents

Products

Support

Selection of Patient Samples and Genes for Disease Prognosis

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib