RNA-seq data analysis

advertisement
© FIMM - Institiute for Molecular Medicine Finland
www.fimm.fi
RNA-seq analysis
Dr.Tech. Daniel Nicorici
FIMM – Institute for Molecular Medicine Finland
CSC - June 2, 2010
© FIMM - Institiute for Molecular Medicine Finland
www.fimm.fi
Outline
› RNA sequencing overview
› Finding fusion genes
› Alternative splicing
› Conclusions
www.fimm.fi
3
RNA-seq
› high-throughput sequencing technology for sequencing RNAs
(actually cDNAs which contain the RNAs' content)
› invaluable tool for study of diseases like cancer
› allows researchers to obtain information like:
 gene/transcript/exon expressions
 alternative splicing
 gene fusions
 post-transcriptional mutations
 single nucleotide variations
…
www.fimm.fi
4
RNA-seq - cont’d
› It reduces greatly the variability between experiments compared to
other established measurement technologies like microarrays,
exon arrays, etc.
› Due to the small size of the read (cDNA is fragmented before
sequencing) the bioinformatics analysis is challenging, e.g.
 de novo assembly
 aligning of sequenced reads
 computation of gene/transcript/exon expressions
www.fimm.fi
5
Reads in RNA-seq
5’ end
3’ end
adaptor
adaptor
This is sequenced (short reads)
Fig. 1 – Adaptor and reads in RNA-seq
www.fimm.fi
6
Reads in RNA-seq – cont’d
Exon A
Exon B
Exon C
Exon D
chromosome
?
?
? ?
?
?
Exon A
Exon B
? ?
?
?
Exon C Exon D
transcript
Fig. 2 – Reads’ mappings at chromosome and transcript level
www.fimm.fi
7
Why RNA-seq?
RNA-seq
Exon array
~700€/sample
(alternative splicing)
~1000€/sample
- exon/transcripts expressions
- gene expressions
- alternative splicing events
- SNPs
- fusion genes
- ...
cDNA array ~600€/sample
SNPs array ~400€/sample
Exon array
~700€/sample
(fusion genes)
Fig. 3 – RNA-seq vs array technologies
www.fimm.fi
8
General steps of RNA-seq analysis
1. Filtering of short reads
2. Aligning the reads against a reference
3. Computationaly analysing of reads’ alignments
1. compute the gene/transcript/exon expressions
2. find new/known alternative splicing events
3. find new/known fusion genes
4. find new/known SNPs
4. Visualization
www.fimm.fi
9
Examples of RNA-seq visualization
Fig. 4 – Visualization using MapView
www.fimm.fi
10
Examples of RNA-seq visualization –
cont’d
Fig. 5 – Coverage plot
www.fimm.fi
11
Examples of RNA-seq visualization –
cont’d
Normalized coverage
130.71
Coverage plot for gene ERBB2 in breast cancer
0.00
Normalized coverage
4.41
Coverage plot for gene ERBB2 in normal breast
0.00
Fig. 6 – Coverage plots visualization
www.fimm.fi
12
Examples of RNA-seq visualization –
cont’d
Fig. 7 – Visualization of reads’ mappings using the UCSC browser
www.fimm.fi
13
Examples of RNA-seq visualization –
cont’d
Fig. 8 – Visualization of coverages using UCSC browser
www.fimm.fi
14
Examples of RNA-seq visualization –
cont’d
Fig. 9 – ”Gel-like” visualization of coverages using UCSC browser
www.fimm.fi
15
Examples of RNA-seq visualization –
cont’d
Fig. 10 – Histogram of distances between the paired-end reads
www.fimm.fi
16
Examples of RNA-seq visualization –
cont’d
Fig. 11 – Visualization of candidate fusion genes
www.fimm.fi
17
Finding fusion genes
Steps:
1. Reads filtering (quality, B’s, etc.)
2. Align all reads on genome
3. Aligning against the transcriptome all the reads which

map uniquely on genome, or

do not map on genome
4. Find the candiates fusion-genes by looking for paired-end reads
which map simultaneusly on two different transcripts from two
different genes
5. Find the fusion junction (e.g. generating exon-exon
combinations and find on which one the reads are aligning)
6. Filtering of candidate fusion-genes
www.fimm.fi
18
Reads in RNA-seq – cont’d
Exon A
Exon B
Exon C
Exon D
chromosome
?
?
? ?
?
?
Exon A
Exon B
? ?
?
?
Exon C Exon D
transcript
Fig. 2 – Reads’ mappings at chromosome and transcript level
www.fimm.fi
19
Finding fusion genes – cont’d
› RNA-seq data for the leukemia K562 cell line [1]
 Philadelphia chromosome with the known BCR-ABL fusion genes
 ~15 000 candidate fusion-genes found
 ~85% candidate fusion-genes are known paralogs or have no protein
product!!!
 15 candidate fusion-genes are found after additional filtering of candidate
fusion-genes where the known BCR-ABL is number one candidate
› Filtering of candidate fusion-genes is highly necessary in order to
reduce the large number of candidate fusion-genes (from ten of
thousands to tens)!!!
www.fimm.fi
20
Alternative splicing
› process by which the gene’s exons are pieced together in multiple
ways forming mRNA during the RNA splicing.
› there is a large body of evidence showing the links between
alternative splicing and different diseases like cancer
› Shannon’s entropy from information theory has been used
previously for finding the imbalance in transcript expression [2,3]
› Jensen-Shannon divergence has been used in quantifying the
relative changes in expression of transcripts [4]
› MDL [5] can be used for measuring the relative changes in
expression of transcripts too
www.fimm.fi
21
Alternative splicing – cont’d
Steps:
1. Reads filtering (quality, B’s, etc.)
2. Align all reads on genome
3. Aligning against the transcriptome all the reads which

map uniquely on genome, or

do not map on genome
4. Compute (normalized) transcript expressions (e.g. RPKM)
5. Repeat steps 1-4 for all samples
6. Find relative-changes/imbalances between their transcript
expressions of the same gene across the group of samples
www.fimm.fi
22
Alternative splicing – cont’d
Table 1 – Example of a gene with its five transcripts
Transcript of gene ”G”
Sample ”A”
Sample ”B”
Transcript 1
3
1
Transcript 2
5
7
Transcript 3
4
2
Transcript 4
4
6
Transcript 5
2
3
www.fimm.fi
23
Alternative splicing – cont’d
› Computing the imbalance of transcript expression for example
from Table 1 using MDL method [5]:
L ( A )   3  log
3
 5  log
18
L ( B )   1  log
1
5
4
 4  log
18
 7  log
19
7
18
 2  log
19
L ( A  B )   4  log
4
4
 4  log
 12  log
37
2
18
 6  log
19
12
 2  log
 6  log
37
6
18
 3  log
19
6
37
 10  log
C n ,5 
3
19
10
 log 2 C 18 , 5 ( bits )
 log 2 C 19 , 5 ( bits )
 5  log
37
i1
where
2
5
37
i2
 log 2 C 37 , 5 ( bits )
i3
i4
n

 i   i   i   i   i 

  i , i , i , i , i   n1   n2   n3   n4   n5 
        
i1  i 2  i 3  i 4  i 5  n  1 2
3
4
5 
Transcript of gene ”G”
Sample
”A”
Sample
”B”
Transcript 1
3
1
Transcript 2
5
7
Transcript 3
4
2
Transcript 4
4
6
Transcript 5
2
3
i5
i1 , i 2 , i 3 , i 4 , i 5  0
If
L ( A  B )  L ( A)  L ( B )
then there is imbalance
› MDL’s advantage: the criteria for deciding between
balanced/imbalanced is built-in
www.fimm.fi
24
Alternative splicing – cont’d
› only the transcripts which are validated (e.g. there are reads which
map only on the given transcript [3]) are used for finding the
imbalances
› for example in a prostate cancer control sample versus treated
sample are found ~3500 alternatively spliced genes
www.fimm.fi
25
Conclusions
› RNA-seq data analysis:
 is computational intensive (when compared to, for example, microarray
analysis)
 needs very good filtering criteria, which are based on biology mathematics, in
order to improve the quality of the results (i.e. low number of false positives)
 there is not only one established way of doing it
 many tools used for analysis, e.g. aligners, samtools, etc., are still work in
progress
› Visualization:
 multiple facets, i.e. read coverage, fusion genes, etc.
 depends on the user profile:
1.
biologist/medical doctor
2.
bioinformatician
www.fimm.fi
26
References
1.
Berger M. et al., Integrative analysis of the melanoma transcriptome, Genome
Research, Feb. 2010.
2.
Ritchie W. et al., Entropy measures quantify global splicing disorders in cancer,
PLOS Computational Biology, vol. 4, March 2008.
3.
Gan Q. et al., Dynamic regulation of alternative splicing and chromatin structure
in Drosophila gonads revealed by RNA-seq, Cell Research, May 2010.
4.
Trapnell C. et al., Transcript assembly and quantification by RNA-Seq reveals
unannotated transcripts and isoform switching during cell differentiation, Nature
Biotechnology, vol. 28, May 2010.
5.
P. Grunwald, “Minimum description length principle tutorial”, in Advances in
Minimum Description Length: Theory and Applications, P. Grunwald, I.J. Myung,
and M. Pitt, Eds., pp. 22-79. MIT Press, Cambridge, 2005.
www.fimm.fi
27
Acknowledgements
› Olli Kallioniemi
› Janna Saarela
› Henrik Edgren
› Astrid Murumägi
› Sara Kangaspeska
› Pekka Ellonen
www.fimm.fi
28
› Thank you!
www.fimm.fi
29
Download