Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014 Who Am I Yin Aphinyanaphongs (yinformatics.com) MD, PhD from Vanderbilt University in Nashville, TN. Assistant Professor in the Center for Health Informatics and Bioinformatics. Primary Expertise Machine Learning Predictive Modeling Text Classification Data Mining Social Media Large Medical Datasets Secondary Expertise Search Engine Design/ Information Retrieval Natural Language Processing What I Teach Introduction to Biomedical Informatics. Introduction to Medicine for Computer Scientists. Data Analytics in R for physicians. Machine Learning Examples Given an email, classify it as spam or not spam. Given a handwritten digit, assign it the right number. Given descriptions of passengers on the titanic, predict who will survive or not survive. Given a gene expression microarray of a cancer, predict whether the cancer will or will not metastasize. Email Spam Text Classification http://blog.cyren.com/uploads/blog/google-docs-spamsample.jpg Digit Classification http://nonbiritereka.hatenablog.com/entry/2014/09/18/100439 Predicting Titanic Survival Passenger class Name Sex Age Number of siblings/ spouses aboard Number of parents/ children aboard Ticket number Passenger fare Cabin Port of Embarkation https://www.kaggle.com/c/titanicgettingStarted Molecular Signatures Molecular signature is a computational or mathematical model that links high-dimensional molecular information to phenotype or other response variable of interest. Golub et al.. (1999)) heatmap + Machine Learning Goal Construct algorithms to learn from data such that a built model from training data will generalize to unseen data. General Framework Obtain Seq Sample Seq (Optio nal) Label Seq Clean Seq Encode Seq Build a Model Performance Evaluation (Internal) Model Application and Validation (External) Basic Framework Unseen Examples Labeled Examples ALL AM L Classification Algorithm • Random Forests • Regularized Logistic Regression • Support Vector Machines etc. Labeled ALL AM L + Key Concept – Supervised Learning From the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon Principles and geometric representation for supervised learning (1/7) 14 • Want to classify objects as boats and houses. Principles and geometric representation for supervised learning (2/7) • All objects before the coast line are boats and all objects after the coast line are houses. • Coast line serves as a decision surface that separates two classes. 15 Principles and geometric representation for supervised learning (3/7) These boats will be misclassified as houses This house will be misclassified as boat 16 Principles and geometric representation for supervised learning (4/7) 17 Longitude Boat Latitude House • The methods that build classification models (i.e., “classification algorithms”) operate very similarly to the previous example. • First all objects are represented geometrically. Principles and geometric representation for supervised learning (5/7) 18 Longitude Boat Latitude Then the algorithm seeks to find a decision surface that separates classes of objects House Principles and geometric representation for supervised learning (6/7) 19 Longitude These objects are classified as houses ? ? ? ? ? ? These objects are classified as boats Latitude Unseen (new) objects are classified as “boats” if they fall below the decision surface and as “houses” if the fall above it Principles and geometric representation for supervised learning (7/7) 20 Longitude Object #1 Object #2 Latitude Object #3 + Key Concept – Overfitting, Underfitting From the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon Two problems: Over-fitting & Under-fitting Over-fitting (a model to your data) = building a model that is good in original data but fails to generalize well to new/unseen data Under-fitting (a model to your data) = building a model that is poor in both original data and new/unseen data 22 Over/under-fitting are related to complexity of the decision surface and how well the training data is fit Outcome of Interest Y Predictor X 23 Scenario 1 24 Outcome of Interest Y Training Data Future Data Predictor X Scenario 1 25 Outcome of Interest Y Training Data Future Data Predictor X Scenario 1 26 Outcome of Interest Y Training Data Future Data Predictor X Scenario 1 27 Outcome of Interest Y This line is good! Training Data Future Data Predictor X This line overfits! Scenario 2 28 Outcome of Interest Y Training Data Future Data Predictor X Scenario 2 29 Outcome of Interest Y Training Data Future Data Predictor X Over/under-fitting are related to complexity of the decision surface and how well the training data is fit Outcome of Interest Y Training Data Future Data Predictor X 30 Over/under-fitting are related to complexity of the decision surface and how well the training data is fit 31 Outcome of Interest Y This line is good! Training Data Future Data Predictor X This line underfits! 32 Very important concept… Successful data analysis methods balance training data fit with complexity. Too complex signature (to fit training data well) overfitting (i.e., signature does not generalize) Too simplistic signature (to avoid overfitting) underfitting (will generalize but the fit to both the training and future data will be low and predictive performance small). + Key Concept – Performance Estimation From the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon On estimation of classifier accuracy 34 test Large sample case: use hold-out validation Small sample case: use Nfold crossvalidation data train test test data test test train train train train train train test test Other versions of this general notion… Leave one out cross validation Leave pair out cross validation Bootstrap Single Holdout + Key Concept – The Support Vector Machine From the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon The Support Vector Machine (SVM) approach for building molecular signatures Support vector machines (SVMs) is a binary classification algorithm. SVMs are important because of (a) theoretical reasons: - Robust to very large number of variables and small samples Can learn both simple and highly complex classification models Employ sophisticated mathematical principles to avoid overfitting and (b) superior empirical results. 37 38 Main ideas of SVMs (1/3) Gene Y Normal patients Cancer patients Gene X • Consider example dataset described by 2 genes, gene X and gene Y • Represent patients geometrically (by “vectors”) 39 Main ideas of SVMs (2/3) Gene Y Normal patients Cancer patients Gene X • Find a linear decision surface (“hyperplane”) that can separate patient classes and has the largest distance (i.e., largest “gap” or “margin”) between border-line patients (i.e., “support vectors”); 40 Main ideas of SVMs (3/3) Gene Y Cancer Cancer Decision surface kernel Normal Normal Gene X • If such linear decision surface does not exist, the data is mapped into a much higher dimensional space (“feature space”) where the separating decision surface is found; • The feature space is constructed via very clever mathematical projection (“kernel trick”). + Key Concept - Curse of Dimensionality Thanks to Dr. Gutierrez-Osuna http://courses.cs.tamu.edu/rgutier/cs790_w02/l5.pdf. Curse of Dimensionality (1/3) Curse of dimensionality (2/3) Curse of Dimensionality (3/3) 45 The range of features in higher dimensional data include. 10,000-50,000 (gene expression microarrays, aCGH, and early SNP arrays) >500,000 arrays) (exon arrays/tiled microarrays/SNP 10,000-300,000 >10,000,000 (MS proteomics) (LC-MS proteomics) >100,000,000 (next-generation sequencing) 46 High Dimensionality in Small Samples Causes Some methods do not run at all (classical regression) Some methods give bad results (KNN, Decision trees) Very slow analysis Very expensive/cumbersome clinical application Tends to “overfit” + Cancer Classification Case Study From Golub et al. (1999) Case Study Classify the values of a gene microarray according to leukemia type. AML ALL Task meta-data 72 samples 47 ALL 25 AML 5,327 genes Labeled Microarrays Treatment AML 25 ALL 47 Encode Microarray Within each train fold, normalize the values of each column between 0 and 1. Notice that we don’t normalize the entire dataset and then run our classification algorithms (this would result in overfitting). Build a Model - Support Vector Machine * * * * * * * * * * * * * * * * * * * * * * * * * * * This example illustrates a 2 dimensional space. The x and y axis represent one word each. A full text categorization example could contain upwards of 50,000 words and thus 50,000 dimensions. Build a Model – K nearest neighbors http://mines.humanoriented.com/classes/2010/fall/csci568/ portfolio_exports/lguo/knn.html Build a Model – Neural Network http://en.wikipedia.org/wiki/Artifi cial_neural_network 54 Estimate Performance Small sample case: use Nfold crossvalidation test test data test test train train train train train train test test Results Proportion of Correct Classifications Baseline (All in one class) 65.0% Support Vector Machine 91.7% K Nearest Neighbors 87.9% Neural Network 84.7% http://bib.oxfordjournals.org/content/7/1/86.fu ll.pdf+html Conclusions Machine Learning Examples Key Concepts Supervised Learning Overfitting/ Underfitting Support Vector Machines Cross Validation Curse of Dimensionality Case Study – Cancer Classification Thanks. Dr. Gutierrez-Osuna Dr. Alexander Statnikov + Molecular Signatures Slides from Dr Alexander Statnikov PhD. Definition of a molecular signature Molecular signature is a computational or mathematical model that links high-dimensional molecular information to phenotype or other response variable of interest. 60 Example of a molecular signature 61 Patient with lung cancer Primary Lung Cancer Biopsy Molecular signature Gene expression profile Metastatic Lung Cancer Main uses of molecular signatures 1. Direct benefits: Models of disease phenotype/clinical outcome • • • 2. Ancillary benefits 1: Biomarkers for diagnosis, or outcome prediction • • • 3. Diagnosis Prognosis, long-term disease management Personalized treatment (drug selection, titration) Make the above tasks resource efficient, and easy to use in clinical practice Helps next-generation molecular imaging Leads for potential new drug candidates Ancillary benefits 2: Discovery of structure & mechanisms (regulatory/interaction networks, pathways, sub-types) • Leads for potential new drug candidates 62 Recent molecular signatures available for patient care Agendia Clarient Prediction Sciences LabCorp OvaSure University Genomics BioTheranostics Applied Genomics Genomic Health Veridex Power3 Correlogic Systems 63 Prostate cancer signatures in the market 64 MammaPrint • Developed by Agendia (www.agendia.com) • 70-gene signature to stratify women with breast cancer that hasn’t spread into “low risk” and “high risk” for recurrence of the disease • Independently validated in >1,000 patients • So far performed >10,000 tests • Cost of the test is ~$3,000 • In February, 2007 the FDA cleared the MammaPrint test for marketing in the U.S. for node negative women under 61 years of age with tumors of less than 5 cm. • TIME Magazine’s 2007 “medical invention of the year”. 65 Oncotype DX Breast Cancer Assay (Launched66in 2004) Developed by Genomic Health (www.genomichealth.com) 21-gene signature to predict whether a woman with localized, ER+ breast cancer is at risk of relapse Independently validated in thousands of patients So far performed >200,000 tests Price of the test is $4,175 Not FDA approved but covered by most insurances including Medicare Its sales in 2012 reached $199M. Economic validity 67 In a 2005 economic analysis of the Recurrence Score result in LN-,ER+ patients receiving tamoxifen, Hornberger et al. performed a cost-utility analysis using a decision analytic model. Using a model, recurrence Score result was predicted on average to increase quality-adjusted survival by 16.3 years and reduce overall costs by $155,128. Instead of using the model, economic benefits can now be assessed from the published clinical utility of the test and actual health plan costs for adjuvant chemotherapy. For example, in a 2 million member plan, approximately 773 women are eligible for the test. If half receive the test, given the high and increasing cost of adjuvant chemotherapy, supportive care and management of adverse events, the use of the Oncotype DX assay is estimated to save approximately $1,930 per woman tested (given an aggregate 34% reduction in chemotherapy use). References about health benefits and cost-effectiveness: “Economic Analysis of Targeting Chemotherapy Using a 21-Gene RT-PCR Assay in Lymph NodeNegative, Estrogen Receptor-Positive, Early-Stage Breast Cancer” Am J Manag Care. 2005; 11(5):313324. “Impact of a 21-Gene RT-PCR Assay on Treatment Decisions in Early-Stage Breast Cancer, An Economic Analysis Based on Prognostic and Predictive Validation Studies” Cancer. 2007; 109(6):1011-1018. Oncotype DX Colon Cancer Assay (Launched68in 2010) Developed by Genomic Health (www.genomichealth.com) Multigene gene signature to predict risk of recurrence in patients with stage II colon cancer Independently validated in thousands of patients Price of the test is $3,280 Not FDA approved but covered by most insurances including Medicare Oncotype DX Prostate Cancer Assay (Launched 69 in 2013) Developed by Genomic Health (www.genomichealth.com) Multigene gene signature to distinguish aggressive prostate cancer from less threatening one Independently validated Price of the test is $3,820 Not FDA approved but covered by most insurances including Medicare Oncotype DX Business Metrics Data from http://investor.genomichealth.com/ 70 Conclusions Machine Learning Examples Key Concepts Supervised Learning Overfitting/ Underfitting Support Vector Machines Cross Validation Curse of Dimensionality Case Study – Cancer Classification Case Study – Molecular Signatures Thanks. Dr. Gutierrez-Osuna Dr. Alexander Statnikov