Comparison of Statistical Methods for Differential Expression Analysis Nathan Abshire, Eduardo Gomez, Hugh McCullough, Bowen Yang Overview • Background • Methods • Results • Conclusions Background • A Comparison of Statistical Methods for Detecting Differentially Expressed Genes from RNA-Seq Data (2012 Vanessa M. Kvam et al) • edgeR, DESeq, baySeq, and TSPM • We decided to move with DESeq2 • Four different simulations to test the function of different methods for detecting differentially expressed gene Simulations • 1st simulation is 100% simulated data, took read data from a poisson distribution • 2nd and 3rd simulation derived from the same biological data. LCM (laser-capture microdissected) samples of maize leaves • 4th simulation is obtained from 69 samples collected from lymphoblastoid cell lines (LCL) of unrelated Nigerian individuals • ROC curves, FPRs, and TPRs • bayseq performed the best, and TSPM performed the worest Motivation and purpose • Generate simulations as 2012 Kvam et al • Perform similar analysis • Observe differences and similarities in results • The DEseq and DEseq2 should be interesting Methods Data Collection • NCBI’s Short Read Archive • Data for simulation 2 and 3 are from Bioproject “PRJNA79627”. • 4 SRA files were downloaded from this project. SRR039509.3; SRR039510.3; SRR039512.3; SRR039514.3 • Transcriptome or Gene expression data collected from Zea mays subsp. Mays [Taxonomy ID: 381124] Methods Data Collection • Data for simulation 4 are from Bioproject “PRJNA122271”. • 161 SRAs ran on 69 different lymphoblastoid lines • Following the methods of Kvam et al, with the meta-data from the original study (Pickerell et al. 2010 ), we selected 81 sequencing runs performed at Yale University. • 12 samples received a duplicate sequencing run, so we randomly selected one of the two duplicates for our analysis (Kvam et al. were not particularly clear on their methods for selection) • A total of 69 SRAs for simulation 4 • All samples are Transcriptome or Gene expression data obtained from Homo sapiens lymphoblastoid cell lines [Taxonomy ID: 9606] Methods Mapping and Read counting • EdgeR’s developers suggested RSubread or Subread for getting read counts of each genetic feature • We used Subread • For read mapping, two genomes were downloaded and used: GCF_000005005 (Zea mays subsp. Mays); GCF_000001405 (Homo sapiens ) • subread-align, featurecount • 35 computational cores to shorten run time Methods Data Processing, Simulation, and Analysis • Simulation 4, lymphoblastoid data, feature count tables were read into R, filtered to the appropriate samples • All genes with zero counts were removed, samples were randomly selected for three subsets with varying number of replicates, n=2, n=4, and n=5. Removed 0 counts across replicates • To simulated treatment effect: • Taking twenty percent of the genes • Multiplied by a base e exponent with the exponent being the value for each gene selected from a mixed normal distribution • Yielded over-dispersed counts for certain genes • Replacing the previous values in the count table • This new treatment group was then added to the original count table as separate samples