RNA-seq reads

advertisement
RNA-seq for Transcriptome profiling
and discovery of novel transcripts
and alternatively spliced variants
using HPC
Presented by:
Al Ritacco, Shailender Nagpal
Research Computing
UMASS Medical School
Information Services, 09/17/2012
Agenda
SESSION 1
– What is RNA-seq?
– Workflow for RNA-seq analysis
– Tools required
SESSION 2
– Download and perform QC on sample dataset
– Mapping and alignment
– Transcript expression and other “discoveries”
2
Information Services, 00/00/2010
Agenda
SESSION 1
– What is RNA-seq?
– Workflow for RNA-seq analysis
– Tools required
SESSION 2
– Download and perform QC on sample dataset
– Mapping and alignment
– Transcript expression and other “discoveries”
3
Information Services, 00/00/2010
What is “Next Generation Sequencing”?
• Set of new high throughput technologies
– allow millions of short DNA sequences from a biological
sample to be “read” or sequenced in a rapid manner
– Computational power is then used to assemble or align
the “reads” to a reference genome, allowing biologists to
make comparisons and interpret various biological
phenomena
• Due to high depth of coverage (30-100x), accurate
sequencing is obtained much faster and cheaper
compared to traditional Sanger/Shotgun sequencing
RNA-Seq experiment
• Reverse transcription of mRNAs yield double
stranded cDNAs, which are sliced to selected
fragment length
What is RNA-Seq?
• RNA-seq is a Next Generation Sequencing (NGS)
technology for sequencing total mRNA (“expressed”)
in biological samples of interest such as tissues,
tumors and cell lines
–
–
–
–
–
–
Provides deep coverage and base level resolution
Abundance of known transcripts
Novel transcripts
Alternative splicing events
Post-transcriptional mutations
Gene fusions
RNA-seq reads: Issue 1
ACTTAAGGCTGACTAGC
TCGTACCGATATGCTG
• Small, single-end reads are hard to align to a
reference genome - multiple possible mapping sites.
Longer reads can overcome this limitation
• Paired ends allow for longer fragments to be
sequenced, with a small read from each end of the
fragment. Distance between ends validates mapping
RNA-seq reads: Issue 2
• Reads from one exon are easily mapped
• Reads that span 2 exons are handled by
special software like TopHat
8
Information Services, 00/00/2010
DNA assembly versus alignment
• DNA assembly is the computational task of putting
together pieces of the genome in the original order
– Overlapping short sequences are extended to form
islands, that are subsequently extended and merged
– Used for denovo sequencing of a new genome
• DNA alignment is the computational task of mapping
short reads to a known, sequenced genome
– Reads and searched for in the full genome sequence, then
aligned with a local alignment algorithm
Agenda
SESSION 1
– What is RNA-seq?
– Workflow for RNA-seq analysis
– Tools required
SESSION 2
– Download and perform QC on sample dataset
– Mapping and alignment
– Transcript expression and other “discoveries”
10
Information Services, 00/00/2010
RNA-seq workflow summary
• Sample preparation and submission for experiment.
Obtain “read” file(s) as FASTQ
• Perform QC
• Perform read alignment and mapping to reference
genome
• Determine expression, novel transcripts and
alternative splicing
Step 1: Working with FASTQ Reads
• After a sample has been processed by an NGS
platform, DNA sequence “reads” are provided to the
user in FASTQ format
• FASTQ is a text-based format for storing both a DNA
sequence and its corresponding quality scores
– sequence letter and quality score are encoded with a
single ASCII character for brevity
– originally developed at the Wellcome Trust Sanger
Institute to bundle a FASTA sequence and its quality data
– recently become the de facto standard for storing the
output of high throughput sequencing instruments
Type of reads
• Single end reads
– refer to the sequence determined as DNA bases are
added to single stranded DNA and detected, usually
from one end only
• Paired end reads
– refer to the two ends of the same DNA molecule
– After sequencing one end, you can turn it around and
sequence the other end.
– Long segment of DNA in between the two ends (usually
200-500 bp), who’s sequence is unknown
– Once the two paired end reads are mapped, the
intermediate sequence can be inferred from reference
sequence
Quality Score
• A quality value Q is an integer mapping of p (i.e., the
probability that the corresponding base call is
incorrect)
• Two different equations have been in use. The first is
the standard Sanger variant to assess reliability of a
base call, otherwise known as Phred quality score:
– The Solexa pipeline earlier used a different mapping,
encoding the odds p/(1-p) instead of the probability p:
– Although both mappings are asymptotically identical at
higher quality values, they differ at lower quality levels
(i.e., approximately p > 0.05, or equivalently, Q < 13)
Quality Score Encoding
• Various platform produce versions of FASTQ format,
which mainly differ in the Quality score
representation
• Converters from various tools can convert Solexa to
Sanger FASTQ format
Step 2: Quality Control
• FASTQ, PRINSEQ or other custom tools can be used
to perform QC on FASTQ files
• Good tutorial for FASTQC on Youtube:
https://www.youtube.com/watch?v=bz93ReOv87Y
16
Information Services, 00/00/2010
Step 3: Alignment/Mapping to Reference
Genome
• Many tools exist that will map the reads to the
reference genome and align them, generating a
quality score per base of alignment
– BLAST-like algorithm to search the read in the genome
– Local alignment to determine the accuracy of the
alignment
• Many tools (SAMtools, MAQ, TopHat, CASSAVA, etc)
use the BWA, Bowtie and Eland algorithms.Result of
alignment is the SAM file
Alignment algorithms
• A category of aligners that hash the reads and
scan the genome for matches
– Eland, RMAP, MAQ, ZOOM, SeqMap, CloudBurst,
SHRiMP
• Disadvantages
– For few reads, whole genome must be scanned
– Memory footprint is variable
Alignment algorithms (…contd)
• Another category of aligners hash the reference
genome
– SOAPv1, PASS, MOM, ProbeMatch, NovoAlign, ReSeq,
Mosaik, Bfast
• Easily parallelized with multi-threading, but they
usually require large memory to build an index for
the human genome
• Disadvantage: iterative strategy frequently
introduced by these software may make their speed
sensitive to the sequencing error rate
Alignment algorithms (…contd)
• A third category which does alignment by
merge-sorting the reference subsequences
and read sequences
– Slider
• Burrows–Wheeler Transform (BWT) for string
matching has been incorporated into a new
generation of alignment tools
– SOAPv2, Bowtie and BWA
– These are memory efficient and suit well to single
and paired-end read alignments
Step 4: Transcript expression
• “Cufflinks”, “myRNA” and other software can
estimate the expression level/ abundance of
the transcripts
• Other discoveries possible – novel transcripts,
fusions, etc.
Agenda
SESSION 1
– What is RNA-seq?
– Workflow for RNA-seq analysis
– Tools required
SESSION 2
– Download and perform QC on sample dataset
– Mapping and alignment
– Transcript expression and other “discoveries”
22
Information Services, 00/00/2010
FASTQ manipulation tools
• Galaxy FASTQ tools
https://bitbucket.org/galaxy/galaxy-central/src/tip/tools/fastq
• FASTXtoolkit
http://hannonlab.cshl.edu/fastx_toolkit
QC software
• FASTqc
http://www.bioinformatics.bbsrc.ac.uk/projec
ts/fastqc/
• PRINSEQ
http://prinseq.sourceforge.net
24
Information Services, 00/00/2010
Alignment software
• Single and paired-end alignments can be
done using the BWA algorithm in the
following software
–
–
–
–
MAQ (http://maq.sourceforget.net)
BWA
Bowtie
TopHat (http://tophat.cbcb.umd.edu)
• Alignments are produced generally in the
widely accepted SAM format
Step 4: Transcript expression
• Cufflinks
• myRNA
Agenda
SESSION 1
– What is RNA-seq?
– Workflow for RNA-seq analysis
– Tools required
SESSION 2
– Download and perform QC on sample dataset
– Mapping and alignment
– Transcript expression and other “discoveries”
27
Information Services, 00/00/2010
Computing Hardware Requirements
• Two types of computing hardware are ideally suited
for NGS data analysis
– High-end workstation, for example: 64-bit linux, 3.6 GHz
quad-core processor, 32 GB RAM, 7200 rpm hard disk
– High Performance Computing cluster (HPC) where total
execution time can be sped up – for example, split
reads into small files and align them in parallel on
dozens of nodes
• For this workshop, we will use HPC
RNA-seq datasets
• NCBI’s Short Read Archive is a good source of datasets.
Data can be download from:
http://www.ncbi.nlm.nih.gov/sra
– Look for HapMap, 1000 genomes, cancer samples – all types of
experiment types and platforms
– Combine keywords in search: “Illumina” and “Paired” and
“RNA-seq” and “HapMap”
Obtain data from SRA
• Dataset to be used:
– Prostate cancer
http://www.ncbi.nlm.nih.gov/sra/SRX022065
– Tumor-matched Normal
http://www.ncbi.nlm.nih.gov/sra/SRX022083
• These are two paired-end Solexa datasets with reads
split by those belong to the forward or end of a
sequence fragment
Setting up the data for analysis
• Create directory for this dataset
cd /home/username
mkdir rna-seq
cd rna-seq
31
Information Services, 00/00/2010
Load RNA-seq tools
• Load TopHat, Bowtie, Cufflinks and Samtools
module load tophat-1.2.0
module load cufflinks-2.0.2
module load bowtie-0.12.7
module load samtools-0.1.18
module load sratoolkit.2.1.9
Download Genome Annotation (GTF) file
• The genome annotation file contains exon/intron
co-ordinates of reference genome
Download the dataset
• Download the Paired-end reads for the following
library
wget ftp://ftptrace.ncbi.nlm.nih.gov/sra/srainstant/reads/ByRun/sra/SRR/SRR057/SRR05765
2/SRR057652.sra
wget ftp://ftptrace.ncbi.nlm.nih.gov/sra/srainstant/reads/ByRun/sra/SRR/SRR057/SRR05763
4/SRR057634.sra
SRA tools to extract FASTQ files
• Use "sratools" to convert SRA format to FASTQ
fastq-dump -A SRR057634 --split-3
SRR057634.sra
fastq-dump -A SRR057652 --split-3
SRR057652.sra
35
Information Services, 00/00/2010
Perform QC using FASTQC
• Load FASTQC module
module load fastqc-1.0
• Perform QC
fastqc –t 8 SRR057634_1.fastq
fastqc –t 8 SRR057634_2.fastq
• This creates 2 directories with output in HTML
format for visual inspection in a browser, key
statistics and tests are in text file
• The zip files generated can be deleted
Agenda
SESSION 1
– What is RNA-seq?
– Workflow for RNA-seq analysis
– Tools required
SESSION 2
– Download and perform QC on sample dataset
– Mapping and alignment
– Transcript expression and other “discoveries”
37
Information Services, 00/00/2010
TopHat-Cufflinks
38
Information Services, 00/00/2010
Alignment and Mapping
•
For RNA-seq, the reads will either map fully to
exons or partially to exon-intron junctions, resulting
in rejected reads for the latter case
o Should not use Bowtie or BWA directly for mapping
•
against a reference genome
o one of the goals is to identify novel transcripts, so we
should not use transcriptome as reference
TopHat is ideally suited for the job, uses Bowtie for
read alignment
o needs the reference genome and it's annotations as an
input. Annotations are optional
Running TopHat for read alignment
•
To run TopHat, execute the following command:
tophat --num-threads 8 \
--solexa-quals --max-multihits 10 \
--coverage-search --microexon-search \
--mate-inner-dist 150 \
-o tophat-tumor-out \
--keep-tmp \
–G /usr/public_data/ucsc/genomes/hg19/hg19.gtf \
hg19 \
SRR057634_1.fastq \
SRR057634_2.fastq
Working with TopHat output
• This produces the “tophat-tumor-out” folder with the
following files:
accepted_hits.bam
junctions.bed
insertions.bed
deletions.bed
Agenda
SESSION 1
– What is RNA-seq?
– Workflow for RNA-seq analysis
– Tools required
SESSION 2
– Download and perform QC on sample dataset
– Mapping and alignment
– Transcript expression and other “discoveries”
42
Information Services, 00/00/2010
Reporting quantitative expression:
FPKM/RPKM
• In NGS RNA-seq experiments, quantitative gene
expression data is normalized for total
gene/transcript length and the number of
sequencing reads, and reported as
– RPKM: Reads Per Kilobase of exon per Million mapped
reads. Used for reporting data based on single-end reads
– FPKM: Fragments Per Kilobase of exon per Million
fragments. Used for reporting data based on paired-end
fragments
Cufflinks to estimate expression
1. Quantify reference genes and transcripts only
cufflinks -p 8 -G hg19.gtf -o cuff-tumor
accepted_hits.bam
2. Quantify novel genes & transcripts use hg19 as
"guide”
cufflinks -p 8 -g hg19.gtf -o cuff-tumor
accepted_hits.bam
3. Quantify novel genes & transcripts, "unguided"
cufflinks -p 8 -o cuff-tumor
accepted_hits.bam
Cufflinks (…contd)
• This creates the following files
transcripts.gtf
isoforms.fpkm_tracking
genes.fpkm_tracking
skipped.gtf
(Generated annotation)
(Transcript expression)
(Gene expression)
(Skipped annotations)
• Depending on the mode in previous step, these files can
have vague identifiers (CUFF*) for gene names if known
gene annotations are not used. We have to compare
with reference annotations to uncover which genes they
are
– "Cuffcompare" allows us to do that
Compare assembled transcripts to
reference
# Run “cuffcompare” for reference-guided assembly
cd cuff2
cuffcompare -r ../../hg19.gtf –R -V
transcripts.gtf
# This produces the following files in the “cuffcom-out”
directory with the following files
cuffcmp.transcripts.gtf.tmap,
cuffcmp.transcripts.gtf.refmap,
cuffcmp.tracking, cuffcmp.stats,
cuffcmp.loci, cuffcmp.combined.gtf
Cuffcompare output
• cuffcmp.stats
– Reports statistics related to the "accuracy" of the
transcripts when compared to the reference annotation
data.
– Gene finding measures of “sensitivity” and “specificity” are
calculated at various levels (nucleotide, exon, intron,
transcript, gene)
• cuffcmp.combined.gtf
– Reports a GTF file containing the "union" of all transfrags in
each sample
Cuffcompare output
• cuffcmp.loci
• cuffcmp.tracking
– Each row contains a transcript structure that is present in
one or more input GTF files
Column
number
Column name
Example
Description
1
Cufflinks
transfrag id
TCONS_00000045
A unique internal id for the transfrag
2
Cufflinks locus id
XLOC_000023
A unique internal id for the locus
3
Reference gene id Tcea
The gene_name attribute of the reference GTF record for this transcript, or '-'
if no reference transcript overlaps this Cufflinks transcript
4
Reference
transcript id
uc007afj.1
The transcript_id attribute of the reference GTF record for this transcript, or
'-' if no reference transcript overlaps this Cufflinks transcript
5
Class code
c
The type of match between the Cufflinks transcripts in column 6 and the
reference transcript. See class codes
Cuffcompare output
• cuffcmp.transcripts.gtf.refmap
– For each input GTF file, it lists the reference
transcripts, one row per reference transcript for
cufflinks transcripts that either fully or partially match it
Column
number
Column
name
Example
Description
1
Reference
gene name
uc007crl.1
The gene_name attribute of the reference GTF record for this
transcript, if present. Otherwise gene_id is used.
2
Reference uc007crl.1
transcript id
The transcript_id attribute of the reference GTF record for this
transcript
3
Class code
c
The type of match between the Cufflinks transcripts in column 4 and
the reference transcript. One of either 'c' for partial match, or '=' for
full match.
4
Cufflinks
matches
CUFF.23567.0,
CUFF.24689.0
A comma separated list of Cufflinks transcript ids matching the
reference transcript
Cufflinks output (…contd)
• cuffcmp.transcripts.gtf.tmap
– For each input GTF file, it lists the most closely
matching reference transcript, one row per cufflinks
transcript, for each cufflinks transcript
– (see column definitions on next slide)
Cufflinks output (…contd)
Col. Column name
1
Reference gene name
Example
Myog
2
Reference transcript id
3
Class code
4
5
6
Cufflinks gene id
Cufflinks transcript id
Fraction of major
isoform (FMI)
Description
The gene_name attribute of the reference GTF record for this
transcript, if present. Otherwise gene_id is used.
uc007crl.1
The transcript_id attribute of the reference GTF record for this
transcript
c
The type of relationship between the Cufflinks transcripts in
column 4 and the reference transcript (see Class Codes)
CUFF.23567
The Cufflinks internal gene id
CUFF.23567.0 The Cufflinks internal transcript id
100
The expression of this transcript expressed as a fraction of the
major isoform for the gene. Ranges from 1 to 100.
7
8
9
10
FPKM
FPKM_conf_lo
FPKM_conf_hi
Coverage
1.4567
0.7778
1.9776
3.2687
11
12
Length
Major isoform ID
The expression of this transcript expressed in FPKM
The lower limit of the 95% FPKM confidence interval
The upper limit of the 95% FPKM confidence interval
The estimated average depth of read coverage across the
transcript.
1426
The length of the transcript
CUFF.23567.0 The Cufflinks ID of the gene's major isoform
Cuffcompare output (…contd)
• Class codes
Priority
1
2
3
4
Code
=
c
j
e
5
6
7
i
o
p
8
r
9
10
11
u
x
s
12
.
Description
Complete match of intron chain
Contained
Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript
Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron,
indicating a possible pre-mRNA fragment.
A transfrag falling entirely within a reference intron
Generic exonic overlap with a reference transcript
Possible polymerase run-on fragment (within 2Kbases of a reference transcript)
Repeat. Currently determined by looking at the soft-masked reference sequence and applied to
transcripts where at least 50% of the bases are lower case
Unknown, intergenic transcript
Exonic overlap with reference on the opposite strand
An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read
mapping errors)
(.tracking file only, indicates multiple classifications)
Further analysis
• The slides so far show how to interpret the RNA-seq
results of one sample/library
• In an actual experiment, tumor-normal pair might be
available as separate library. Refer to tophat-cufflinks
diagram to choose workflow
– Might involve cuffmerge prior to cuffcompare
– Cuffdiff and cuffquant can be used to report
expression or even perform differential expression analysis
53
Information Services, 00/00/2010
Q&A
• Don’t be shy – everyone has a different work
flow and we want to help you
54
Information Services, 00/00/2010
Download