Exploring the Human Transcriptome “RNA-Seq alignment with intron-split short reads. Graphical representation of the alignment of mRNA sequence obtained via high throughput sequencing, and the expected behaviour of the alignment to the reference genome when the read falls in an exon-exon junction.” Source: Wikipedia (Author: Rgocs) (URL: http://en.wikipedia.org/wiki/File:RNA-Seqalignment.png) Learning Objectives After completion of this module, the student will be able to download RNA-seq expression count data from the web into Excel apply different normalization to RNA-seq expression count data calculate fold changes begin to ask questions about gene expression using RNA-seq data Concepts gene expression differential expression fold change Knowledge and Skills downloading data from the web Excel functions Prerequisites basic familiarity with Excel o copy and paste o basic arithmetic operations in Excel o sorting Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions: Copyright: © 2014 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Page 1 “A fundamental question in molecular biology is how cells and tissues differ in gene expression and how those differences specify biological function.” (Ramsköld et al. 2009) What Makes a Heart a Heart? The transcriptome is the set of all RNA produced in a cell or population of cells. It varies over time and with environmental conditions. There are many types of RNA: about 80-85% of RNA is ribosomal RNA (rRNA), which serves as the catalytic component of ribosomes; transfer RNA (tRNA) makes up about 15%; and messenger RNA (mRNA) another 5%. The mRNA transcripts reflect which genes are actively expressed. There are a number of technologies that can measure gene expression. The first such technique was the northern blot technique, developed in 1977. In the early 1980s, microarray technology was developed, which dominated gene expression analysis throughout the 1990s into the first decade of the 21st century. In 2008, a new technology began to dominate genomics: next generation sequencing technology allowed for massively parallel sequencing of genomic material. The next generation sequencing technology for RNA is called RNA-seq. It was introduced in May 2008 in a series of five papers. In 2010, RNA-seq was applied to singe cells to study the transcriptional profiles during early development. Software has been developed, for instance, to align the reads to a reference genome, assemble the transcripts de novo, or study differential expression. The ENCyclopedia Of DNA Elements (ENCODE) Consortium utilizes RNA-seq to characterize the human transcriptome. A search on Web of Science shows the rapid increase of publications using this new technology (Fig. 1). Figure 1: Number of published items between 2008 and May 2014 indexed within Web of Science Core Collection using the key word “rna-seq.” Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions: Copyright: © 2014 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Page 2 RNA-Seq Module This module serves as an introduction to RNA-seq count data. These data are obtained by aligning reads to a reference genome and counting the number of reads that align to a gene. Illumina is one of the companies that developed the technology and workflows for RNA-seq data, and the document available through the following link provides a brief overview: http://res.illumina.com/documents/products/datasheets/datasheet_rnaseq_analysis.pdf A visual history of RNA-seq can be found at Seven Bridges genomics: http://blog.sbgenomics.com/history-of-rna-seq/ We will use a subset of a data set from Wang et. al. (2008) to gain a better understanding of how gene expression defines tissue types in the heart and the liver. RNA-seq data is typically analyzed using sophisticated software packages, such as Bioconductor (written in the programming language R) or Galaxy (a web-based platform). The learning curve for these packages is steep, and typically difficult to implement in undergraduate courses whose focus is often on the biological interpretation of genomic information and not on acquiring the technical skills for analysis of genomic data. Yet, it is desirable to expose students early on to the new types of data that are becoming ubiquitous across the health and life sciences. This module uses Excel to explore gene expression data. While Excel has clear limitations, it has the advantage that students will use it in other courses as well and so will either already be familiar with it or will find it useful to become familiar with it. Once students gain familiarity with the type of data, it will be easier for them to learn other more complex analysis tools. References Bioconductor: Open source software for bioinformatics. http://www.bioconductor.org/ (Accessed June 1, 2014) Galaxy: Open source, web-based platform for data intensive biomedical research. https://usegalaxy.org/ (Accessed June 1, 2014) Ramsköld, Daniel, et al. "An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data." PLoS computational biology5.12 (2009): e1000598. Wang, Eric T., et al. "Alternative isoform regulation in human tissue transcriptomes." Nature 456.7221 (2008): 470-476. Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions: Copyright: © 2014 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Page 3 Downloading Data from the Web ReCount is an online resource for RNA-seq count data sets. The data is processed so that raw count data are available. The website is maintained by the Johns Hopkins Bloomberg School of Public Health: http://bowtie-bio.sourceforge.net/recount/ Clicking on the Count Table link in the dataset table opens the data set in the browser. Open a new Excel spreadsheet for downloading the data into the spreadsheet. Return to the browser site where the data is displayed. Complete the following steps: Click on the browser page and use Ctrl-a to highlight the entire text Ctrl-c will copy the highlighted text Go to the spreadsheet, click on the Cell A1, and use Ctrl-v to paste the copied text into the spreadsheet Click on the Data tab in your spreadsheet and click on Text to Columns in the ribbon under Data Tools. The Convert to Columns Wizard will guide you through the next steps. Your original data are separated by spaces. Click on Delimited to choose the original data type, and click Next. Click Space in the Delimiters box. You should see how the data will be displayed in the data preview. If it looks correct, click Finish. Save your file. The data for this module are available in two files RNAseqWang2008BioQUEST2014.xlsx o The data has a tab where the raw data is included (8,426 KB) RNAseqWang2008BioQUEST2014SMALL.xlsx o The data does not have a tab where the raw data is included (4,382KB) Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions: Copyright: © 2014 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Page 4 Exercise 1 The Heart tab has entries in three columns: Column A is an ID that labels each gene with an arbitrary identifier in Column B from 1 to 52,580; Column B is the gene ID that Ensemble uses; and Column C has the read counts from the RNA-seq experiment for the heart tissue of the individual SRX003929. To familiarize yourself with the data, search in your browser for ENSG00000000003 (the entry in Cell B3) and go to the Ensemble website for this gene ID. You will see the following: The gene is called TSPAN6, and is located on chromosome X between 99,883,667 and 99,894,988. To determine the length, take the difference and add 1: length=99,894,988-99,883,667+1=11,322 You will also see that this gene has three transcripts (splice variants), one of which is protein coding. If you click on the name TSPAN6 in the Summary, you will be sent to the HUGO Gene Nomenclature Committee of TSPAN6. You can explore this gene in more depth or search for tetraspanin 6 to find out more about the gene and its function. Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions: Copyright: © 2014 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Page 5 Exercise 2 We will use the data set in the Heart tab to get a better sense for the distribution of read counts. (a) Sort the data by read count, from largest to smallest, and determine the percentage of genes with a positive read count. (b) Use a scatter plot to graph the sorted, positive read counts. Transform both axes logarithmically. Label the axes of your graph and paste it below. Describe what you see. (c) Use the COUNTIF function in Excel to count the number of genes whose reads are between 1 and 9; 10 and 99; 100 and 999; 1,000 and 9,999; 10,000 and 99,999; and greater than 100,000. The syntax for this function is ‘=COUNTIF(range, “operator” &cell)’. That is, enclose the operator in double quotation marks, and use an ampersand before the cell reference. For instance, to count the number of genes whose reads are 10 or higher, you would enter 10 in cell G10 and then type in Cell H10: ‘=COUNTIF($C$3:$C$8620,">= " &G10)’. Note that the range of the read counts between Cell C3 and Cell 8620 is a fixed reference, and thus $-signs are required before the column and the row numbers. The table below lists the minimum number of reads in each bin, that is, the number 10 refers to the number of reads between 10 and 99. Use subtraction to find the number of genes in each bin. # Reads # Genes 100000 10000 1000 100 10 1 0 Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions: Copyright: © 2014 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Page 6 Exercise 3 If two genes are equally expressed but of different length, we would expect the longer gene to have a larger number of reads. Since the lengths of genes vary quite a bit, it is therefore important to normalize the read counts by the length of the gene. The Heart Length tab has read counts for the heart tissue together with the length of each gene (Ensemble gene ID). We only included those genes from the raw data where we know the length. A commonly used quantity is the number of fragments per kilobase of exon per million reads (or fragments) mapped (RPKM). Here is how to calculate this quantity: If N is the total number of mapped fragments, q is the number of fragments that were mapped to a specific gene, and L is the number of base pairs in the exon (length), then RPKM q L N 3 6 10 10 (a) Calculate RPKM for each gene in the Heart tab. Then sort the table by RPKM from largest to smallest. (b) Plot RPKM as a function of length and describe the resulting graph. (c) Go to the Expression Atlas (http://www.ebi.ac.uk/gxa/home) to confirm that the top genes are indeed strongly expressed in the heart. (Enter the gene ID in the Search box on the top right of the website.) Explore other genes to get a better feel for RPKM. Gene ID RPKM Comments Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions: Copyright: © 2014 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Page 7 Exercise 4 The Heart Liver tab has RNA-seq read counts for two tissue types, the heart and the liver. We will use this data set to learn about differential expression. (a) How many genes are expressed in both the heart and the liver, and in one but not the other? Exercise 5 The Heart Liver Length tab has an additional column (Column C) with the length of each gene. We will compare relative importance of each gene. First, we will need to normalize the data. (a) Determine the total number of reads N for each tissue. (b) Calculate for each gene how much it contributes to the transcript pool. We call this fraction the relative abundance. To calculate the relative abundance, we let qi be the number of reads, Li the length of gene i, and N be the total number of reads for a given sample. The equation for the relative abundance of reads for gene i is then as follows i qi NLi 1 q k NLk i (c) Explain the equation for the relative abundance in words. How is this quantity similar/different to RPKM? (d) Visually explore the relative abundance of transcripts in each tissue and compare across tissues. Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions: Copyright: © 2014 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Page 8 Exercise 6 To compare the expression of gene i across tissue types, we compute the ratio of their respective RPKM values. We call this ratio the fold change: fold change RPKM i 1 RPKM i 2 (a) Calculate the fold change for each gene that is expressed in both tissues. (b) Since the fold change varies across many orders of magnitudes, the fold change is often expressed as a log base 2 fold change. The log base 2 fold change is calculated as log base 2 fold change log2 (fold change) (c) Explore the meaning of the log base 2 fold change: What does a value of 1 mean? What does a value of -1 mean? What does a value of 3 mean? What does a value of -3 mean? (d) Calculate the log base 2 fold change for each gene that is expressed in both tissues. (e) Graph the log fold change as a function of RPKM for each tissue type. Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions: Copyright: © 2014 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Page 9