Introduction to high-throughput data analysis Guanghua (Andy) Xiao July 24, 2012 University of Texas Southwestern Medical Center Overview • • • • Introduction to high-throughput data Gene expression microarray data Data Preprocessing Data analysis 1. 2. 3. 4. 5. Gene clustering and gene function prediction Identify differently expressed genes Pathway/gene set enrichment analysis Constructing gene network Classification/prediction • Some real data analysis examples University of Texas Southwestern Medical Center Traditional Biology University of Texas Southwestern Medical Center Systems Biology University of Texas Southwestern Medical Center Northern blot vs microarray University of Texas Southwestern Medical Center Data Matrix CL20041 CL20041 10909AA 11002AA 1007_s_at 10.4 10.2 1053_at 6.37 7.53 117_at 6.44 7.04 121_at 8.99 8.92 1255_g_at 4.36 4.73 1294_at 7.79 8.1 1316_at 6.16 6.41 1320_at 5.09 5.05 1405_i_at 8.38 8.82 1431_at 4.37 4.34 1438_at 7.87 8.3 1487_at 7.62 8.1 1494_f_at 7 7.35 …… …… CL20041 11003AA 10.22 6.11 6.77 9.03 4.54 8.19 6.43 4.86 8.47 4.39 7.73 8.05 7.25 CL200411 CL20041 CL20041 10100AA 11010AA 11013AA 10.7 9.63 12.05 6.61 6.45 6.65 7.61 7.07 7.04 9.13 8.87 8.85 4.73 4.48 4.55 8.17 8.24 8.14 6.47 6.15 6.92 5.13 4.97 4.96 7.6 8.24 7.36 4.44 4.26 4.37 7.74 6.73 7.47 8.26 7.25 7.7 7.29 6.79 7.11 CL20041 11017AA 9.42 6.87 6.95 8.74 4.78 7.92 6.07 5.02 8.37 4.38 7.4 7.9 7.04 CL20041 11018AA …… 10.75 7 6.56 8.56 4.47 7.89 6.13 5.01 7.11 4.25 7.34 7.64 6.92 University of Texas Southwestern Medical Center Application • Genetics Genome-wide association study (GWAS), copy number variation (CNV) Technique: genome-wide single nucleotide polymorphism (SNP) array • Epigenetics Definition: mechanisms that causes gene expression changes without changes in the underlying DNA sequence DNA methylation, histone methylation/acetylation, and transcriptional factor binding Technique: Chromatin immunoprecipitation on chip (ChIP-chip) Chromatin immunoprecipitation – sequencing (ChIP-Seq) • Gene/exon expression Technique: gene expression array, exon expression array, RNA-Seq • • Protein expression Compound screening University of Texas Southwestern Medical Center SOME EXAMPLES University of Texas Southwestern Medical Center Discover disease subtypes Golub et al, 1999, Science Yeoh et al, 2002, Cancer Cell University of Texas Southwestern Medical Center Development of Diagnostic Tests for Cancer From Ramaswamy, N Engl J Med, 2004 University of Texas Southwestern Medical Center Identify tumor driver genes Weir et al, Nature, 2007 Akavia et al, Cell, 2010 University of Texas Southwestern Medical Center GENE EXPRESSION MICROARRAY University of Texas Southwestern Medical Center Microarray Platforms University of Texas Southwestern Medical Center Microarray Platforms • • • Nature Biotechnology 24, 1151 - 1161 (2006) Published online: 8 September 2006 | doi:10.1038/nbt1239 The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements MAQC Consortium University of Texas Southwestern Medical Center Platforms Manufacturer Code Protocol Platform # of Probesets Applied Biosystems ABI One-color microarray Affymetrix AFX One-color microarray Human Genome Survey Microarray v2.0 HG-U133 Plus 2.0 GeneChip Two-color microarray Whole Human Genome Oligo Microarray, G4112A 43,931 AG1 One-color microarray Whole Human Genome Oligo Microarray, G4112A 43,931 Eppendorf EPP One-color microarray GE Healthcare GEH One-color microarray Illumina ILM One-color microarray NCI_Operon NCI Two-color microarray DualChip Microarray CodeLink Human Whole Genome, 300026 Human-6 BeadChip, 48K v1.0 Operon Human Oligo Set v3 Applied Biosystems TAQ TaqMan assays >200,000 assays available Panomics Gene Express QGN GEX QuantiGene assays StaRT-PCR assays 2,600 assays available 1,000 assays available Agilent AGL 32,878 54,675 294 54,359 47,293 37,632 University of Texas Southwestern Medical Center 1,004 245 207 University of Texas Southwestern Medical Center Spotted Array University of Texas Southwestern Medical Center Affymetrix Array • • • • • Probes = 25 nt sequences Probe sets = 11 to 20 probes corresponding to a particular gene or EST Sequence data obtained from dbEST, GenBank, and RefSeq. Draft assembly of Human Genome (NCBI Build 31) used to assess sequence orientation and quality. Probes selected from the 600 bases most proximal to the 3’ end of each transcript. University of Texas Southwestern Medical Center Illumina BeadArray • Advantages: – High quality – Less expensive – Need much less RNA sample • Features: – Multiple replicates – Negative control: nonspecific beads to control the background noise level University of Texas Southwestern Medical Center Microarray Data Preprocessing University of Texas Southwestern Medical Center Data Preprocessing • The purpose of preprocessing microarray data: to minimize the system variation while retaining full biological variation. This is a critical step for obtaining valid results. • Steps: Image acquisition and Feature extraction: the process of defining the array features, which correspond to the probe spots found in the microarray, so that the hybridization intensity of each spot can be determined. Background Correction: To remove the signal intensity from non-specific hybridization and fluorescence from the solid support. Normalization: To correct for systematic differences between samples on the same chip, or between chips, which do not represent true biological variation between samples. Summarization: from probe level or bead level intensity to summarize to gene level intensity. University of Texas Southwestern Medical Center The top 10 genes based on an analysis of the Beer et al. data using different processing methods. RMA Symbol CD8B SLC2A1 CCR2 PLD3 RAFTLIN HNRPL BCL2 PFKP STX1A INPP5D P 0.0697 0.127 0.2111 0.2224 0.2433 0.2787 0.3106 0.3223 0.361 0.369 MAS5 Symbol P 0.0245 RAFTLIN 0.0465 TMSB4X 0.0559 SLC2A1 0.3312 IHPK1 0.3414 MLL 0.3492 NP 0.4494 PRKACB 0.4787 <NA> 0.5528 E2F4 0.5846 P2RX5 Beer et al. Symbol P 0.0187 RAFTLIN 0.0993 NP 0.2968 KLHDC3 0.3808 TMSB4X 0.4084 CXCL3 0.4441 SELP 0.5026 STX1A 0.5068 SEC31L1 0.5355 PRKACB 0.5571 PBXIP1 Owzar K, et al, CCR, 2008 University of Texas Southwestern Medical Center Another example The overlaps among top 50 genes NonNormaliza tion Mean Normaliza tion Mean T T NonNormalization Normalization Mean T Mean T 50 2 18 3 50 2 5 50 2 50 T: Student t-test et al, 2003 Medical Center University of TexasXie Southwestern 23 Microarray Data Analysis 1. Predicting gene functions University of Texas Southwestern Medical Center Predicting gene functions (Guilt by association) Co-expression network Microarray expression data • Cell cycle Unsupervised learning (cluster) Hierarchical clustering K-means clustering Self Organizing Map (SOM) • CDC3 CLB4 CDC16 UNK1 Supervised learning (classifier/predictor) K-nearest Neighbor (KNN) Linear Discrimanant Analysis(LDA) Support Vector Machine (SVM) RPT1 RPN3 RPT6 Eisen et al (PNAS 1998) UNK2 Protein degradation Fraser AG, Marcotte EM - A probabilistic view of gene function - Nat Genet. 2004 Jun;36(6):559-64 University of Texas Southwestern Medical Center Microarray Data Analysis 2. Identify differentially expressed genes University of Texas Southwestern Medical Center Identify DE genes • • Goal: Which genes express differently under different conditions (normal v.s tumor tissues) Quantification of cDNA microarray data • • • Suppose that Xij are normalized log-ratio of two channel intensities for gene i on array j; j = 1, ..., n and i = 1, ...,G Hypothesis: μi = E(Xij) = 0 for each i • • Ranking genes by test statistics Deciding cut-off value University of Texas Southwestern Medical Center Identify DE genes • Ranking criteria M-statistic: average Xi for each gene i over j replications (fold changes) Student t-statistic: ti = Xi /vi SAM t-statistic: (Tusher et al, 2001): Si = Xi /(vi + a0) B-statistic (Lonnstedt et al, 2001): empirical Bayes statistic James-Stein estimator for standard deviation (Cui et al, 2005) Some non-parametric approaches University of Texas Southwestern Medical Center Multiple Testing and FDR • Controlling the family-wise error rate (FWER) Bonferroni correction Hochberg FWER procedure Other corrections Controlling FWER for microarray analysis is too conservative • False discovery rate (Benjamini and Hochberg 1995, Xie 2005): FDR(d) = FP(d)/TP(d) University of Texas Southwestern Medical Center SAM plot University of Texas Southwestern Medical Center Microarray Data Analysis 3. pathway/gene set enrichment analysis University of Texas Southwestern Medical Center Pathway Analysis University of Texas Southwestern Medical Center Gene Set Enrichment Analysis University of Texas Southwestern Medical Center Microarray Data Analysis 4. Constructing gene network University of Texas Southwestern Medical Center Construct gene network University of Texas Southwestern Medical Center Microarray Data Analysis 5. Prediction and classification University of Texas Southwestern Medical Center Leukemia Diagnosis n’ -1 +1 +1 -1 m {-yi} {yi}, i=1:m Golub et al, Science Vol 286:15 Oct. 1999 University of Texas Southwestern Medical Center MDACC tumor Sample clustering Cluster 3 Cluster 1 University of Texas Southwestern Center Cluster Medical 2 Kaplan-Meier plots for two clusters University of Texas Southwestern Medical Center Predicting Drug Response University of Texas Southwestern Medical Center Background • Lung cancer is the leading cause of death from cancer among both men and women in the United States • Median survival time for Non-small Cell Lung Cancer: 8 months • Cancer patients have different response to chemotherapy due to the complexity and uniqueness of each tumor’s genetic profile • Personalized medicine: match the right therapeutic regimen with the right individual University of Texas Southwestern Medical Center MTS Drug Sensitivity Assay No Cells Cells No Drug Drug Day 0: Plate cells No Cells Day 1: Add drug ... Day 5: Add MTS and read plate Cells 1,000 – 4,000/well Octuplicate measurements, one per row 96-well plate assays are repeated at least 3 times Dehydrogenase enzymes found in metabolically active cells catalyze the formation of formazan product, which is measured at 490nm absorbance University of Texas Southwestern Medical Center Lung Cancer Cell Lines Show Different In Vitro drug sensitivity HCC1171 IC50 = 127 μM > 1000-fold HCC827 IC50 = 0.04 μM IC50 : drug concentration causing 50% growth inhibition University of Texas Southwestern Medical Center In vitro drug sensitivity Vinorelbine Pemetrexed Peloruside.A Paclitaxel Irinotecan Gemcitabine Gefitinib Etoposide..VP.16. Erlotinib Docetaxel Cisplatin Carboplatin 0.01 0.1 1 10 100 1000 IC50 In vitro sensitivity to 12 therapeutic drugs were determined for 45 lung cancer cell lines. University of Texas Southwestern Medical Center Drug coverage University of Texas Southwestern Medical Center Drug Selection random optimal H2887 University of Texas Southwestern Medical Center Prediction Methods • • • • Filtering genes Clustering genes Principal components of the cluster Classification/Regression tree method to predict the drug sensitivity • Leave-one-out cross validation University of Texas Southwestern Medical Center Prediction Results Accuracy Sensitivity Specificity NPR PPR Cisplatin 0.84 0.86 0.84 0.50 0.97 Gefitinib 0.80 0.50 0.85 0.33 0.92 Paclitaxel 0.84 0.89 0.70 0.91 0.64 Vinorelbine 0.86 0.92 0.50 0.92 0.50 EGFR 0.93 0.63 0.94 0.92 0.71 Leave-one-out cross-validation results for supervised classification of drug sensitivity or for EGFR status using mRNA expression profiles. NPR: negative predictive rate, PPR: positive predictive rate. For modeling we used extreme cases (<0.2 and >0.8) and for testing on all of the cell line data. For EGFR, tumor cell lines were divided into those with EGFR TK domain mutations and those with wild type EGFR. University of Texas Southwestern Medical Center Drug Selection random predicted optimal p=0.0008 University of Texas Southwestern Medical Center Ovarian Cancer Example University of Texas Southwestern Medical Center Example of Over-fitting and Good Fitting Over fitting Good fitting Overfitting function is not generalize enough to unknown data. University of Texas Southwestern Medical Center Over-fitting • The training data contains information about the regularities in the mapping from input to output. But it also contains noise The target values may be unreliable. There is sampling error. There will be accidental regularities just because of the particular training cases that were chosen. • When we fit the model, it cannot tell which regularities are real and which are caused by sampling error. So it fits both kinds of regularity. If the model is very flexible it can model the sampling error really well. This is a disaster. University of Texas Southwestern Medical Center Feature Validation N genes/features Split data into 3 sets: training, validation, and test set. m2 M samples m1 1) For each feature subset, train predictor on training data. 2) Select the feature subset, which performs best on validation data. Repeat and average if you want to reduce variance (cross-validation). 3) Test on test data. m3 University of Texas Southwestern Medical Center Feature Validation • Divide the total dataset into three subsets: Training data is used for learning the parameters of the model. Validation data is not used of learning but is used for deciding what type of model and what amount of regularization works best. Test data is used to get a final, unbiased estimate of how well the network works. We expect this estimate to be worse than on the validation data. • • • We could then re-divide the total dataset to get another unbiased estimate of the true error rate. Leave One Out Validation (Using all data as a training set) Independent testing data is the best way to test the prediction model! University of Texas Southwestern Medical Center A real analysis example 1: developing prognostic signature of nonsmall cell lung cancer (NSCLC) University of Texas Southwestern Medical Center Hierarchical Clustering of the Robust Gene Set (RGS) Group1 Group 2 University of Texas Southwestern Medical Center Unsupervised clustering groups are associated with survival --- Group1 --- Group 2 University of Texas Southwestern Medical Center Gene sets enriched analysis ER-Negative signature (Nature 2002) Enriched in Group 1 ER-Positive signature (Nature 2002) Enriched in Group 2 University of Texas Southwestern Medical Center Gene Set Enrichment Analysis Enriched in group 1 (worse prognosis group) University of Texas Southwestern Medical Center Gene Set Enrichment Analysis Enriched in group 2 (better survival group) University of Texas Southwestern Medical Center FFPE training to frozen sample testing prediction results (442 NSCLCs from Shedden et al Nat Med, 2008 ) University of Texas Southwestern Medical Center FFPE training to frozen sample testing prediction results University of Texas Southwestern Medical Center FFPE to frozen samples prediction results within each stage University of Texas Southwestern Medical Center A real analysis example 2: Construct gene network in NSCLC University of Texas Southwestern Medical Center Construct gene network in NSCLC (B) (A) Predict MDACC data SARG NKX2-1 HOP pv=0.00025 n= 209 SFTPB MBIP (C) Predict Tomida et al MLF1IP TTC37 PRC1 pv=0.023 n= 117 University of Texas Southwestern Medical Center References • Bioinformatics course in MD Anderson: http://bioinformatics.mdanderson.org/MicroarrayCourse/index.html • Terry Speed's Class Homepages : http://www.stat.berkeley.edu/users/terry/Classes/index.html • Iowa State bioinformatics course: http://www.public.iastate.edu/%7Ednett/microarray/microarray.shtml • Dov Stekel, Microarray Bioinformatics • Richard Simon, et al Design and analysis of DNA microarray investigations • Rober Gentleman et al. Bioinformatics and computational biology solutions using R and Bioconductor University of Texas Southwestern Medical Center Microarray v.s mRNA-Seq Mortazavi et al, Nat Methods 2008 University of Texas Southwestern Medical Center Microarray v.s mRNA-Seq Slide from Wing Wong, Stanford University of Texas Southwestern Medical Center Reproducibility of RNA-Seq Mortazavi et al, Nat Methods 2008 University of Texas Southwestern Medical Center Microarray v.s mRNA-Seq • • Sequencing assays provide digital measures of sequence abundance, i.e., read counts. In contrast, microarrays provide analog measures of sequence abundance, i.e., fluorescence intensities. Microarrays depend on the design of chips --- Annotation problems --- Aligning probes across platforms ----Hard to deal with alternative splicing ----Can not identify new transcripts • mRNA-Seq --- Measure transcriptome composition --- Relatively easy to deal with alternative splicing --- Discover new exons or genes University of Texas Southwestern Medical Center