RNA_Seq Worksheet - BioQUEST Curriculum Consortium

advertisement
Exploring the Human Transcriptome
“RNA-Seq alignment with intron-split short reads. Graphical
representation of the alignment of mRNA sequence obtained
via high throughput sequencing, and the expected behaviour
of the alignment to the reference genome when the read falls
in an exon-exon junction.” Source: Wikipedia (Author: Rgocs)
(URL: http://en.wikipedia.org/wiki/File:RNA-Seqalignment.png)
Learning Objectives
After completion of this module, the student will be able to
 download RNA-seq expression count data from the web into Excel
 apply different normalization to RNA-seq expression count data
 calculate fold changes
 begin to ask questions about gene expression using RNA-seq data
Concepts



gene expression
differential expression
fold change
Knowledge and Skills


downloading data from the web
Excel functions
Prerequisites

basic familiarity with Excel
o copy and paste
o basic arithmetic operations in Excel
o sorting
Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions:
Copyright: © 2014
Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial
Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to
translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and
the new work will carry the same license.
Page 1
“A fundamental question in molecular
biology is how cells and tissues differ in
gene expression and how those
differences specify biological function.”
(Ramsköld et al. 2009)
What Makes a Heart a Heart?
The transcriptome is the set of all RNA produced in a cell or population of cells. It varies over time and
with environmental conditions. There are many types of RNA: about 80-85% of RNA is ribosomal RNA
(rRNA), which serves as the catalytic component of ribosomes; transfer RNA (tRNA) makes up about
15%; and messenger RNA (mRNA) another 5%. The mRNA transcripts reflect which genes are actively
expressed. There are a number of technologies that can measure gene expression. The first such
technique was the northern blot technique, developed in 1977. In the early 1980s, microarray
technology was developed, which dominated gene expression analysis throughout the 1990s into the
first decade of the 21st century. In 2008, a new technology began to dominate genomics: next
generation sequencing technology allowed for massively parallel sequencing of genomic material.
The next generation sequencing technology for RNA is called RNA-seq. It was introduced in May 2008 in
a series of five papers. In 2010, RNA-seq was applied to singe cells to study the transcriptional profiles
during early development. Software has been developed, for instance, to align the reads to a reference
genome, assemble the transcripts de novo, or study differential expression. The ENCyclopedia Of DNA
Elements (ENCODE) Consortium utilizes RNA-seq to characterize the human transcriptome. A search on
Web of Science shows the rapid increase of publications using this new technology (Fig. 1).
Figure 1: Number of published items between 2008 and May 2014 indexed within Web of Science Core Collection using the key
word “rna-seq.”
Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions:
Copyright: © 2014
Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial
Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to
translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and
the new work will carry the same license.
Page 2
RNA-Seq Module
This module serves as an introduction to RNA-seq count data. These data are obtained by aligning reads
to a reference genome and counting the number of reads that align to a gene. Illumina is one of the
companies that developed the technology and workflows for RNA-seq data, and the document available
through the following link provides a brief overview:
http://res.illumina.com/documents/products/datasheets/datasheet_rnaseq_analysis.pdf
A visual history of RNA-seq can be found at Seven Bridges genomics:
http://blog.sbgenomics.com/history-of-rna-seq/
We will use a subset of a data set from Wang et. al. (2008) to gain a better understanding of how gene
expression defines tissue types in the heart and the liver. RNA-seq data is typically analyzed using
sophisticated software packages, such as Bioconductor (written in the programming language R) or
Galaxy (a web-based platform). The learning curve for these packages is steep, and typically difficult to
implement in undergraduate courses whose focus is often on the biological interpretation of genomic
information and not on acquiring the technical skills for analysis of genomic data. Yet, it is desirable to
expose students early on to the new types of data that are becoming ubiquitous across the health and
life sciences. This module uses Excel to explore gene expression data. While Excel has clear limitations, it
has the advantage that students will use it in other courses as well and so will either already be familiar
with it or will find it useful to become familiar with it. Once students gain familiarity with the type of
data, it will be easier for them to learn other more complex analysis tools.
References
Bioconductor: Open source software for bioinformatics. http://www.bioconductor.org/ (Accessed June
1, 2014)
Galaxy: Open source, web-based platform for data intensive biomedical research. https://usegalaxy.org/
(Accessed June 1, 2014)
Ramsköld, Daniel, et al. "An abundance of ubiquitously expressed genes revealed by tissue
transcriptome sequence data." PLoS computational biology5.12 (2009): e1000598.
Wang, Eric T., et al. "Alternative isoform regulation in human tissue transcriptomes." Nature 456.7221
(2008): 470-476.
Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions:
Copyright: © 2014
Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial
Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to
translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and
the new work will carry the same license.
Page 3
Downloading Data from the Web
ReCount is an online resource for RNA-seq count data sets. The data is processed so that raw count data
are available. The website is maintained by the Johns Hopkins Bloomberg School of Public Health:
http://bowtie-bio.sourceforge.net/recount/
Clicking on the Count Table link in the dataset table opens the data set in the browser. Open a new Excel
spreadsheet for downloading the data into the spreadsheet. Return to the browser site where the data
is displayed. Complete the following steps:







Click on the browser page and use Ctrl-a to highlight the entire text
Ctrl-c will copy the highlighted text
Go to the spreadsheet, click on the Cell A1, and use Ctrl-v to paste the copied text into the
spreadsheet
Click on the Data tab in your spreadsheet and click on Text to Columns in the ribbon under Data
Tools. The Convert to Columns Wizard will guide you through the next steps.
Your original data are separated by spaces. Click on Delimited to choose the original data type,
and click Next.
Click Space in the Delimiters box. You should see how the data will be displayed in the data
preview. If it looks correct, click Finish.
Save your file.
The data for this module are available in two files


RNAseqWang2008BioQUEST2014.xlsx
o The data has a tab where the raw data is included (8,426 KB)
RNAseqWang2008BioQUEST2014SMALL.xlsx
o The data does not have a tab where the raw data is included (4,382KB)
Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions:
Copyright: © 2014
Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial
Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to
translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and
the new work will carry the same license.
Page 4
Exercise 1
The Heart tab has entries in three columns: Column A is an ID that labels each gene with an arbitrary
identifier in Column B from 1 to 52,580; Column B is the gene ID that Ensemble uses; and Column C has
the read counts from the RNA-seq experiment for the heart tissue of the individual SRX003929.
To familiarize yourself with the data, search in your browser for ENSG00000000003 (the entry in Cell B3)
and go to the Ensemble website for this gene ID. You will see the following:
The gene is called TSPAN6, and is located on chromosome X between 99,883,667 and 99,894,988. To
determine the length, take the difference and add 1:
length=99,894,988-99,883,667+1=11,322
You will also see that this gene has three transcripts (splice variants), one of which is protein coding. If
you click on the name TSPAN6 in the Summary, you will be sent to the HUGO Gene Nomenclature
Committee of TSPAN6.
You can explore this gene in more depth or search for tetraspanin 6 to find out more about the gene and
its function.
Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions:
Copyright: © 2014
Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial
Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to
translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and
the new work will carry the same license.
Page 5
Exercise 2
We will use the data set in the Heart tab to get a better sense for the distribution of read counts.
(a) Sort the data by read count, from largest to smallest, and determine the percentage of genes with a
positive read count.
(b) Use a scatter plot to graph the sorted, positive read counts. Transform both axes logarithmically.
Label the axes of your graph and paste it below. Describe what you see.
(c) Use the COUNTIF function in Excel to count the number of genes whose reads are between 1 and 9;
10 and 99; 100 and 999; 1,000 and 9,999; 10,000 and 99,999; and greater than 100,000. The syntax
for this function is ‘=COUNTIF(range, “operator” &cell)’. That is, enclose the operator in double
quotation marks, and use an ampersand before the cell reference. For instance, to count the
number of genes whose reads are 10 or higher, you would enter 10 in cell G10 and then type in Cell
H10: ‘=COUNTIF($C$3:$C$8620,">= " &G10)’. Note that the range of the read counts between Cell
C3 and Cell 8620 is a fixed reference, and thus $-signs are required before the column and the row
numbers. The table below lists the minimum number of reads in each bin, that is, the number 10
refers to the number of reads between 10 and 99. Use subtraction to find the number of genes in
each bin.
# Reads # Genes
100000
10000
1000
100
10
1
0
Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions:
Copyright: © 2014
Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial
Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to
translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and
the new work will carry the same license.
Page 6
Exercise 3
If two genes are equally expressed but of different length, we would expect the longer gene to have a
larger number of reads. Since the lengths of genes vary quite a bit, it is therefore important to normalize
the read counts by the length of the gene. The Heart Length tab has read counts for the heart tissue
together with the length of each gene (Ensemble gene ID). We only included those genes from the raw
data where we know the length.
A commonly used quantity is the number of fragments per kilobase of exon per million reads (or
fragments) mapped (RPKM). Here is how to calculate this quantity: If N is the total number of mapped
fragments, q is the number of fragments that were mapped to a specific gene, and L is the number of
base pairs in the exon (length), then
RPKM 
q
 L  N 
 3  6 
 10  10 
(a) Calculate RPKM for each gene in the Heart tab. Then sort the table by RPKM from largest to
smallest.
(b) Plot RPKM as a function of length and describe the resulting graph.
(c) Go to the Expression Atlas (http://www.ebi.ac.uk/gxa/home) to confirm that the top genes are
indeed strongly expressed in the heart. (Enter the gene ID in the Search box on the top right of the
website.) Explore other genes to get a better feel for RPKM.
Gene ID
RPKM
Comments
Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions:
Copyright: © 2014
Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial
Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to
translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and
the new work will carry the same license.
Page 7
Exercise 4
The Heart Liver tab has RNA-seq read counts for two tissue types, the heart and the liver. We will use
this data set to learn about differential expression.
(a) How many genes are expressed in both the heart and the liver, and in one but not the other?
Exercise 5
The Heart Liver Length tab has an additional column (Column C) with the length of each gene. We will
compare relative importance of each gene. First, we will need to normalize the data.
(a) Determine the total number of reads N for each tissue.
(b) Calculate for each gene how much it contributes to the transcript pool. We call this fraction the
relative abundance. To calculate the relative abundance, we let qi be the number of reads, Li the
length of gene i, and N be the total number of reads for a given sample. The equation for the
relative abundance of reads for gene i is then as follows
i 
qi
NLi
1
q
k NLk
i
(c) Explain the equation for the relative abundance in words. How is this quantity similar/different to
RPKM?
(d) Visually explore the relative abundance of transcripts in each tissue and compare across tissues.
Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions:
Copyright: © 2014
Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial
Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to
translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and
the new work will carry the same license.
Page 8
Exercise 6
To compare the expression of gene i across tissue types, we compute the ratio of their respective RPKM
values. We call this ratio the fold change:
fold change 
RPKM i 1
RPKM i 2
(a) Calculate the fold change for each gene that is expressed in both tissues.
(b) Since the fold change varies across many orders of magnitudes, the fold change is often expressed
as a log base 2 fold change. The log base 2 fold change is calculated as
log base 2 fold change  log2 (fold change)
(c) Explore the meaning of the log base 2 fold change: What does a value of 1 mean? What does a value
of -1 mean? What does a value of 3 mean? What does a value of -3 mean?
(d) Calculate the log base 2 fold change for each gene that is expressed in both tissues.
(e) Graph the log fold change as a function of RPKM for each tissue type.
Citation: Neuhauser, C. Exploring the Human Transcriptome Created: June 01, 2014 Revisions:
Copyright: © 2014
Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial
Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to
translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and
the new work will carry the same license.
Page 9
Download