Uploaded by lotadios77

Presentation1

advertisement
Comparison of Statistical
Methods for Differential
Expression Analysis
Nathan Abshire, Eduardo Gomez, Hugh McCullough, Bowen Yang
Overview
• Background
• Methods
• Results
• Conclusions
Background
• A Comparison of Statistical Methods for Detecting Differentially
Expressed Genes from RNA-Seq Data (2012 Vanessa M. Kvam et al)
• edgeR, DESeq, baySeq, and TSPM
• We decided to move with DESeq2
• Four different simulations to test the function of different methods
for detecting differentially expressed gene
Simulations
• 1st simulation is 100% simulated data, took read data from a poisson
distribution
• 2nd and 3rd simulation derived from the same biological data. LCM
(laser-capture microdissected) samples of maize leaves
• 4th simulation is obtained from 69 samples collected from
lymphoblastoid cell lines (LCL) of unrelated Nigerian individuals
• ROC curves, FPRs, and TPRs
• bayseq performed the best, and TSPM performed the worest
Motivation and purpose
• Generate simulations as 2012 Kvam et al
• Perform similar analysis
• Observe differences and similarities in results
• The DEseq and DEseq2 should be interesting
Methods
Data Collection
• NCBI’s Short Read Archive
• Data for simulation 2 and 3 are from Bioproject “PRJNA79627”.
• 4 SRA files were downloaded from this project. SRR039509.3;
SRR039510.3; SRR039512.3; SRR039514.3
• Transcriptome or Gene expression data collected from Zea mays
subsp. Mays [Taxonomy ID: 381124]
Methods
Data Collection
• Data for simulation 4 are from Bioproject “PRJNA122271”.
• 161 SRAs ran on 69 different lymphoblastoid lines
• Following the methods of Kvam et al, with the meta-data from the original study (Pickerell et al.
2010 ), we selected 81 sequencing runs performed at Yale University.
• 12 samples received a duplicate sequencing run, so we randomly selected one of the two
duplicates for our analysis (Kvam et al. were not particularly clear on their methods for selection)
• A total of 69 SRAs for simulation 4
• All samples are Transcriptome or Gene expression data obtained from Homo sapiens
lymphoblastoid cell lines [Taxonomy ID: 9606]
Methods
Mapping and Read counting
• EdgeR’s developers suggested RSubread or Subread for getting read counts
of each genetic feature
• We used Subread
• For read mapping, two genomes were downloaded and used:
GCF_000005005 (Zea mays subsp. Mays); GCF_000001405 (Homo sapiens )
• subread-align, featurecount
• 35 computational cores to shorten run time
Methods
Data Processing, Simulation, and Analysis
• Simulation 4, lymphoblastoid data, feature count tables were read into R, filtered to the appropriate samples
• All genes with zero counts were removed, samples were randomly selected for three subsets with varying
number of replicates, n=2, n=4, and n=5. Removed 0 counts across replicates
• To simulated treatment effect:
• Taking twenty percent of the genes
• Multiplied by a base e exponent with the exponent being the value for each gene selected from a mixed normal distribution
• Yielded over-dispersed counts for certain genes
• Replacing the previous values in the count table
• This new treatment group was then added to the original count table as separate samples
Download