RNAseq data analysis in gene expression study. Tuesday 8th

advertisement
RNAseq data analysis in gene expression study.
Tuesday 8th February 2011
Alexander Kanapin
The aim of this practical session is to use R [1] and BioConductor packages [2] to
process and analyse RNAseq data from a gene expression study of different
immune cell types, namely B-cells and monocytes. We have a total of 8 samples, 4
from B-cells and 4 from monocytes.
Follow the workflow by copying and pasting the commands indicated in Monaco
font text into an R session, with the working directory set to the location of the data
files. Note that anything after a # sign is a comment and not part of a command.
Try to understand what each command is doing by reading the comments and
using the help function in R – e.g. to find information on a particular function type
help(function_name).
Part 1: Data loading and preparation
If you do not have DESeq package [3] installed, you may do that now:
source("http://www.bioconductor.org/biocLite.R")
biocLite("DESeq")
Load the required packages for this session:
library(DESeq)
Set the working directory:
Misc - > Change working directory
Select the directory where you downloaded the example files.
The raw number of reads falling into a given gene was calculated with HTseq
package using Ensembl37 data as the reference. The data file “raw_counts.txt”
contains 12 tab-delimited columns:
1:
Ensembl gene ID
2 – 5: B-cells samples
6 – 9: monocyte samples
First, we load the data into R using read.delim function and assign gene names to
corresponding rows:
countsTable <-read.delim("raw_counts.txt",header=TRUE,stringsAsFactors=TRUE)
rownames( countsTable ) <- countsTable$gene
countsTable <- countsTable[ , -1 ]
The next step is to create conditions vector to attribute each column to a given cell
type, “B” for B-cells and “M” for monocytes:
conds <- c(rep("B",4), rep("M",4))
Then we create main dataframe for the count data set using function
newCountDataSet:
cds <- newCountDataSet( countsTable, conds )
And normalize the number of read counts:
cds <- estimateSizeFactors(cds)
We now may check normalization factors for each sample by typing:
sizeFactors(cds)
Finally, we estimate variance functions for the dataset:
cds <-estimateVarianceFunctions(cds)
Part 2: Dataset statistics and differential expression analysis
A general overview of data variance may be estimated using function scvPlot:
scvPlot(cds)
The produced plot shows the estimated variance functions of the given
CountDataSet, in the form of the squared coefficient of variation (SCV), i.e., the
variance divided by the squared mean. The solid lines are the raw SCV estimates,
one per condition. The dashed lines are the full variance estimates for each sample,
i.e., the vertical distance between a dashed line and its corresponding solid line (of
the same colour) is the shot noise. As the x axis is scaled as base mean (sizeadjusted mean), the amount of shot noise depends on the size factor. The solid
black line is a density estimate of the base means. Only were a sufficient density of
a count is present can a good estimate can be expected.
We also may estimate a similarity between different samples using VST (variance
stabilizing transformation). The function getVarianceStabilizedData produces
homoscedastic (homogenous variance) normalized count data. This is useful as
input to statistical analyses requiring homoskedasticity. By plotting the VST data
in a heatmap we may estimate homogeneity of our samples:
First, we create a matrix with VST data:
vsd <- getVarianceStabilizedData(cds)
Then, distances between different samples are calculate and plotted as a heatmap:
dists <- dist( t( vsd ) )
heatmap( as.matrix( dists ), symm=TRUE, cexCol=0.7, cexRow=0.7 )
Now we find genes, which are differentially expressed between the two different
cell types using negative binomial distribution test:
res <- nbinomTest(cds, "B", "M")
The new dataframe, res, is created with the following fields:
The ID of the observable, taken from the row names of the counts slots.
The base mean (i.e., mean of the counts divided by the size factors)
for the counts for both conditions
baseMeanA
The base mean (i.e., mean of the counts divided by the size factors) for the counts for
condition A
baseMeanB
The base mean for condition B
foldChange
The ratio meanB/meanA
log2FoldChange The log2 of the fold change
pval
The p value for rejecting the null hypothesis 'meanA==meanB'
padj
The adjusted p values (adjusted with 'p.adjust( pval, method="BH")')
resVarA
The ratio of the row-wise estimate of the base variance of the counts
for condition A, divided by the value predicted with the base variance function
n from the base mean. If this number is very high, the hit seems to be a variance
outlier and might be false.
resVarB
The same for condition B.
id
baseMean
Now we plot MA diagram to estimate expression values and fold changes. Also we
put a threshold for the adjusted p-value (padj field in res) as 0.0001 to estimate
visually a scale of the differential expression:
plot( res$baseMean, res$log2FoldChange, log="x", pch=20, cex=.1, col =
ifelse( res$padj < .0001, "red", "black" ) )
The red dots represent genes with padj < 0.0001.
Finally, we filter out the genes with padj > 0.0001 and create the subset of the
results for differentially expressed ones:
sig <- res[ res$padj < .0001, ]
sig <- sig[ is.na(sig$pval) != "TRUE", ]
The data is now exported into tab-delimited format file and may be taken into
Excel:
write.table(sig, file = "diff_expression.txt", append = FALSE, quote =
TRUE, sep = "\t", eol = "\n", na = "NA", dec = ".", row.names = FALSE,
col.names = TRUE, qmethod = c("escape", "double"))
Part 3: Differential expression visualization
We use Integrated Genome Viewer (IGV) developed by the Broad Institute
to visualize expression data. The viewer displays short reads along genomic
sequences and also produces raw expression profile.
If the viewer is not installed, download it from the IGV site (registration
required):
http://www.broadinstitute.org/software/igv/
Start the viewer and select human genome version 19 from the list in the top-left
corner.
Load two .bam files: File-> Load from file…
Select file b_75_chr5.bam (B-cells data)
Then load second one, m_75_chr5.bam (monocyte data)
Load the differential expression data from the part 2 into MS Excel:
File-> New workbook
Data-> Get external data -> Import text file…
Select the file diff_expression.dat
Follow the steps and select “Delimited” and “Tab” options during the import;
everything else is default setting.
Sort the file by column padj (adjusted p-value) in ascending order.
The first gene in the list is ENSG00000038427, human versican proteoglycan.
Put its coordinates into the window next to chromosome ID on IGV and push
“GO” button next to it:
chr5:82,785,065-82,787,144
You now may see aligned reads pileups for the different experiments and
differential expression of the gene in different cell types.
Zoom in and out using “+” and “-“ buttons in top right corner of the viewer, move
along the chromosome by clicking and dragging the pileups to see various regions.
Part 4: Functional annotation analysis with NCBI DAVID
The Database for Annotation, Visualization and Integrated Discovery (DAVID )
is a powerful resource for functional annotation analysis [4].
We are going to use it to check if there are any important functional categories
describing the differentially expressed genes we detected.
Open the DAVID webpage
http://david.abcc.ncifcrf.gov
Select “Start Analysis” button.
On the left side of the page select “Upload” button
In the excel workbook (Part 3) select first 100 gene IDs from the first column and
copy them into clipboard.
Paste the list into the window “Step 1” on the DAVID page
Step 2: Select Identifier – ENSEMBL_GENE_ID – from the list
Step 3: List type: select Gene list
Step 4: Push “Submit list”
The new page “Analysis wizard” opens now.
Now select “Functional annotation clustering” link on the page.
You may now explore different categories on the page.
For the general overview push “Functional Annotation Clustering” button.
The new page opens and it lists clusters of common terms from different databases,
which are overrepresented in the genes annotations.
Cluster 1 includes such terms as “defense response”, “response to wounding”,
“inflammatory response” from Gene Ontology.
Cluster 2 contains immunoglobulin domains.
References
[1] R Development Core Team, R: A language and environment for statistical
computing, R Foundation for Statistical Computing, Vienna Austria, 2007
(http://www.R-project.org).
[2] Gentleman, R.C. et al., Bioconductor: open software development for
computational biology and bioinformatics, Genome Biol. 5 p. R80. (2004).
[3]Anders S. and Huber W. Differential expression analysis for sequence count
data Genome Biol. 11 p. R106 (2010)
[4]Systematic and integrative analysis of large gene lists using DAVID
Bioinformatics Resources. Nat Protoc. 4(1):44 -57. (2009)
Download