Differential Expression Analysis Methods Comparison

Comparison of Statistical Methods for Differential Expression Analysis Nathan Abshire, Eduardo Gomez, Hugh McCullough, Bowen Yang Overview • Background • Methods • Results • Conclusions Background • A Comparison of Statistical Methods for Detecting Differentially Expressed Genes from RNA-Seq Data (2012 Vanessa M. Kvam et al) • edgeR, DESeq, baySeq, and TSPM • We decided to move with DESeq2 • Four different simulations to test the function of different methods for detecting differentially expressed gene Simulations • 1st simulation is 100% simulated data, took read data from a poisson distribution • 2nd and 3rd simulation derived from the same biological data. LCM (laser-capture microdissected) samples of maize leaves • 4th simulation is obtained from 69 samples collected from lymphoblastoid cell lines (LCL) of unrelated Nigerian individuals • ROC curves, FPRs, and TPRs • bayseq performed the best, and TSPM performed the worest Motivation and purpose • Generate simulations as 2012 Kvam et al • Perform similar analysis • Observe differences and similarities in results • The DEseq and DEseq2 should be interesting Methods Data Collection • NCBI’s Short Read Archive • Data for simulation 2 and 3 are from Bioproject “PRJNA79627”. • 4 SRA files were downloaded from this project. SRR039509.3; SRR039510.3; SRR039512.3; SRR039514.3 • Transcriptome or Gene expression data collected from Zea mays subsp. Mays [Taxonomy ID: 381124] Methods Data Collection • Data for simulation 4 are from Bioproject “PRJNA122271”. • 161 SRAs ran on 69 different lymphoblastoid lines • Following the methods of Kvam et al, with the meta-data from the original study (Pickerell et al. 2010 ), we selected 81 sequencing runs performed at Yale University. • 12 samples received a duplicate sequencing run, so we randomly selected one of the two duplicates for our analysis (Kvam et al. were not particularly clear on their methods for selection) • A total of 69 SRAs for simulation 4 • All samples are Transcriptome or Gene expression data obtained from Homo sapiens lymphoblastoid cell lines [Taxonomy ID: 9606] Methods Mapping and Read counting • EdgeR’s developers suggested RSubread or Subread for getting read counts of each genetic feature • We used Subread • For read mapping, two genomes were downloaded and used: GCF_000005005 (Zea mays subsp. Mays); GCF_000001405 (Homo sapiens ) • subread-align, featurecount • 35 computational cores to shorten run time Methods Data Processing, Simulation, and Analysis • Simulation 4, lymphoblastoid data, feature count tables were read into R, filtered to the appropriate samples • All genes with zero counts were removed, samples were randomly selected for three subsets with varying number of replicates, n=2, n=4, and n=5. Removed 0 counts across replicates • To simulated treatment effect: • Taking twenty percent of the genes • Multiplied by a base e exponent with the exponent being the value for each gene selected from a mixed normal distribution • Yielded over-dispersed counts for certain genes • Replacing the previous values in the count table • This new treatment group was then added to the original count table as separate samples

Differential Expression Analysis Methods Comparison

Related documents

Products

Support

Differential Expression Analysis Methods Comparison

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib