RNAseq data analysis in gene expression study. Tuesday 8th February 2011 Alexander Kanapin The aim of this practical session is to use R [1] and BioConductor packages [2] to process and analyse RNAseq data from a gene expression study of different immune cell types, namely B-cells and monocytes. We have a total of 8 samples, 4 from B-cells and 4 from monocytes. Follow the workflow by copying and pasting the commands indicated in Monaco font text into an R session, with the working directory set to the location of the data files. Note that anything after a # sign is a comment and not part of a command. Try to understand what each command is doing by reading the comments and using the help function in R – e.g. to find information on a particular function type help(function_name). Part 1: Data loading and preparation If you do not have DESeq package [3] installed, you may do that now: source("http://www.bioconductor.org/biocLite.R") biocLite("DESeq") Load the required packages for this session: library(DESeq) Set the working directory: Misc - > Change working directory Select the directory where you downloaded the example files. The raw number of reads falling into a given gene was calculated with HTseq package using Ensembl37 data as the reference. The data file “raw_counts.txt” contains 12 tab-delimited columns: 1: Ensembl gene ID 2 – 5: B-cells samples 6 – 9: monocyte samples First, we load the data into R using read.delim function and assign gene names to corresponding rows: countsTable <-read.delim("raw_counts.txt",header=TRUE,stringsAsFactors=TRUE) rownames( countsTable ) <- countsTable$gene countsTable <- countsTable[ , -1 ] The next step is to create conditions vector to attribute each column to a given cell type, “B” for B-cells and “M” for monocytes: conds <- c(rep("B",4), rep("M",4)) Then we create main dataframe for the count data set using function newCountDataSet: cds <- newCountDataSet( countsTable, conds ) And normalize the number of read counts: cds <- estimateSizeFactors(cds) We now may check normalization factors for each sample by typing: sizeFactors(cds) Finally, we estimate variance functions for the dataset: cds <-estimateVarianceFunctions(cds) Part 2: Dataset statistics and differential expression analysis A general overview of data variance may be estimated using function scvPlot: scvPlot(cds) The produced plot shows the estimated variance functions of the given CountDataSet, in the form of the squared coefficient of variation (SCV), i.e., the variance divided by the squared mean. The solid lines are the raw SCV estimates, one per condition. The dashed lines are the full variance estimates for each sample, i.e., the vertical distance between a dashed line and its corresponding solid line (of the same colour) is the shot noise. As the x axis is scaled as base mean (sizeadjusted mean), the amount of shot noise depends on the size factor. The solid black line is a density estimate of the base means. Only were a sufficient density of a count is present can a good estimate can be expected. We also may estimate a similarity between different samples using VST (variance stabilizing transformation). The function getVarianceStabilizedData produces homoscedastic (homogenous variance) normalized count data. This is useful as input to statistical analyses requiring homoskedasticity. By plotting the VST data in a heatmap we may estimate homogeneity of our samples: First, we create a matrix with VST data: vsd <- getVarianceStabilizedData(cds) Then, distances between different samples are calculate and plotted as a heatmap: dists <- dist( t( vsd ) ) heatmap( as.matrix( dists ), symm=TRUE, cexCol=0.7, cexRow=0.7 ) Now we find genes, which are differentially expressed between the two different cell types using negative binomial distribution test: res <- nbinomTest(cds, "B", "M") The new dataframe, res, is created with the following fields: The ID of the observable, taken from the row names of the counts slots. The base mean (i.e., mean of the counts divided by the size factors) for the counts for both conditions baseMeanA The base mean (i.e., mean of the counts divided by the size factors) for the counts for condition A baseMeanB The base mean for condition B foldChange The ratio meanB/meanA log2FoldChange The log2 of the fold change pval The p value for rejecting the null hypothesis 'meanA==meanB' padj The adjusted p values (adjusted with 'p.adjust( pval, method="BH")') resVarA The ratio of the row-wise estimate of the base variance of the counts for condition A, divided by the value predicted with the base variance function n from the base mean. If this number is very high, the hit seems to be a variance outlier and might be false. resVarB The same for condition B. id baseMean Now we plot MA diagram to estimate expression values and fold changes. Also we put a threshold for the adjusted p-value (padj field in res) as 0.0001 to estimate visually a scale of the differential expression: plot( res$baseMean, res$log2FoldChange, log="x", pch=20, cex=.1, col = ifelse( res$padj < .0001, "red", "black" ) ) The red dots represent genes with padj < 0.0001. Finally, we filter out the genes with padj > 0.0001 and create the subset of the results for differentially expressed ones: sig <- res[ res$padj < .0001, ] sig <- sig[ is.na(sig$pval) != "TRUE", ] The data is now exported into tab-delimited format file and may be taken into Excel: write.table(sig, file = "diff_expression.txt", append = FALSE, quote = TRUE, sep = "\t", eol = "\n", na = "NA", dec = ".", row.names = FALSE, col.names = TRUE, qmethod = c("escape", "double")) Part 3: Differential expression visualization We use Integrated Genome Viewer (IGV) developed by the Broad Institute to visualize expression data. The viewer displays short reads along genomic sequences and also produces raw expression profile. If the viewer is not installed, download it from the IGV site (registration required): http://www.broadinstitute.org/software/igv/ Start the viewer and select human genome version 19 from the list in the top-left corner. Load two .bam files: File-> Load from file… Select file b_75_chr5.bam (B-cells data) Then load second one, m_75_chr5.bam (monocyte data) Load the differential expression data from the part 2 into MS Excel: File-> New workbook Data-> Get external data -> Import text file… Select the file diff_expression.dat Follow the steps and select “Delimited” and “Tab” options during the import; everything else is default setting. Sort the file by column padj (adjusted p-value) in ascending order. The first gene in the list is ENSG00000038427, human versican proteoglycan. Put its coordinates into the window next to chromosome ID on IGV and push “GO” button next to it: chr5:82,785,065-82,787,144 You now may see aligned reads pileups for the different experiments and differential expression of the gene in different cell types. Zoom in and out using “+” and “-“ buttons in top right corner of the viewer, move along the chromosome by clicking and dragging the pileups to see various regions. Part 4: Functional annotation analysis with NCBI DAVID The Database for Annotation, Visualization and Integrated Discovery (DAVID ) is a powerful resource for functional annotation analysis [4]. We are going to use it to check if there are any important functional categories describing the differentially expressed genes we detected. Open the DAVID webpage http://david.abcc.ncifcrf.gov Select “Start Analysis” button. On the left side of the page select “Upload” button In the excel workbook (Part 3) select first 100 gene IDs from the first column and copy them into clipboard. Paste the list into the window “Step 1” on the DAVID page Step 2: Select Identifier – ENSEMBL_GENE_ID – from the list Step 3: List type: select Gene list Step 4: Push “Submit list” The new page “Analysis wizard” opens now. Now select “Functional annotation clustering” link on the page. You may now explore different categories on the page. For the general overview push “Functional Annotation Clustering” button. The new page opens and it lists clusters of common terms from different databases, which are overrepresented in the genes annotations. Cluster 1 includes such terms as “defense response”, “response to wounding”, “inflammatory response” from Gene Ontology. Cluster 2 contains immunoglobulin domains. References [1] R Development Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna Austria, 2007 (http://www.R-project.org). [2] Gentleman, R.C. et al., Bioconductor: open software development for computational biology and bioinformatics, Genome Biol. 5 p. R80. (2004). [3]Anders S. and Huber W. Differential expression analysis for sequence count data Genome Biol. 11 p. R106 (2010) [4]Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nat Protoc. 4(1):44 -57. (2009)