Subho_prel_analysis

advertisement
Analysis
of
the RNAseq Genome Annotation
Assessment Project
by
Subhajyoti De
The RNAseq Genome Annotation Assessment Project
Introduction
and a summary
of submissions
Analysis
methodology
Nucleotide
level analysis
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
•The RGASP aims to assess the current progress of automatic gene
building using RNAseq as its primary dataset.
•More specifically we aim to evaluate the status of computational
methods to
•map human RNAseq data, 
•assemble them into transcripts and 
•quantify the abundance of that transcript in particular
datasets.
•Promising transcript predictions not covered by Gencode annotation
will be validated by experimental methods

The RNAseq Genome Annotation Assessment Project
Introduction
and a summary
of submissions
Analysis
methodology
Nucleotide
level analysis
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
3 species: human, worm and fly.
Multiple RNA-seq daasets for each organism.
15 submitters.
304 submissions
The RNAseq Genome Annotation Assessment Project
Introduction and
summary of
submissions
Analysis methodology
1.
we carried out independent evaluation for the coding portions of the
mRNA transcripts (CDS focused) and the mRNA transcripts as a
whole (mRNA focused).
2.
Analysis was carried out at multiple levels:
1. Nucleotide level
2. Exon level
3. Transcript level
3.
For each of the levels, we calculated the sensitivity and specificity of
the predictions (as discussed later). As a summary measure we also
reported the average of the two statistic.
Analysis
methodology
Nucleotide
level analysis
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
The RNAseq Genome Annotation Assessment Project
Nucleotide level analysis
Annotation set
Introduction and
summary of
submissions
Prediction set
Analysis
methodology
True positives
Nucleotide
level analysis
False positives
False negatives
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
Sensitivity =
Number of annotated nucleotides correctly predicted
Number of annotated nucleotides in the annotation set
Specificity =
Number of predicted nucleotides correctly also annotated
Number of predicted nucleotides in the annotation set
The RNAseq Genome Annotation Assessment Project
Nucleotide level analysis
Introduction and
summary of
submissions
Analysis
methodology
Points to note:
1. Nucleotide predictions had to be on the same strand as the
annotations to be considered as correct.
Nucleotide
level analysis
2.
Individual nucleotides present in multiple transcripts in either the
annotation or the predictions are considered only once.
Exon level
analysis
3.
As a summary measure, we also calculated the arithmetic average
of specificity and sensitivity.
Transcript
level analysis
Missing and
wrong genes
The RNAseq Genome Annotation Assessment Project
Nucleotide level analysis (H. sapiens)
Introduction and
summary of
submissions
Analysis
methodology
Nucleotide
level analysis
T ea m
S ens itiv ity
T yl_hum_qbo_s ole xa_K 5 6 2 s ingle 6 9 .5 8 7
S ea _hum_qex_s olexa _
4 8 .4 1 9
M a r_hum_qbo_s olexa _K 5 6 2 s trand4 7 .4 8 3
S im_hum_qtr_s olexa _
3 2 .9 4 7
G er_hum_qtr_s olex a_hummul
3 1 .9 0 4
V ic _hum_qbo_s olex a_
3 0 .9 6 9
L io_hum_qtr_s olexa _
3 4 .3 3 0
T yl_hum_qna_s ole xa_hummul
3 1 .6 6 8
T ho_hum_qbo_s olexa _
3 5 .2 0 3
C hr_hum_qbo_s olexa _
4 4 .6 6 0
C ar_hum_qna _s olexa _hummul
8 .6 3 5 7
J ie_hum_qex _s olex a_K 5 6 2 s ingle 0 .2 2 4 5
S pe c ific ity
9
9 .3 0 8
93.308
9 4 .2 8 9
8 2 .2 4 7
8 3 .1 7 2
8 4 .0 1 2
8 4 .4 8 8
7 8 .6 2 2
8 0 .2 8 5
6 9 .2 6 2
1 4 .2 3 6
2 .4 0 1 9
7 5 .7 6 2
A vera ge
8 4 .4 4 7
7 1 .3 5 4
6 4 .8 6 5
5 8 .0 5 9
5 7 .9 5 8
5 7 .7 2 9
5 6 .4 7 6
5 5 .9 7 7
5 2 .2 3 3
2 9 .4 4 8
5 .5 1 8 8
3 7 .9 9 3
F oc us
ex on
ex on
ex on
ex on
ex on
ex on
ex on
ex on
ex on
ex on
ex on
ex on
T eam
J el_hum_qna_s olexa_hummul
M ar_hum_qbo_s olexa_
V ic _hum_qna_s olexa_hummul_
T yl_hum_qna_s olexa_hummul
S im_hum_qtr_s olexa_hummul
C hr_hum_qbo_s olexa_
S pec ific ity
8 2 .1 9 8
8 5 .0 1 7
6 8 .2 6 9
7 3 .4 6 7
9 2 .8 3 2
5 7 .9 7 1
A verage
8 3 .5 6 7
8 2 .5 8 9
7 6 .3 1 8
7 4 .3 2 0
7 3 .5 5 6
5 1 .0 2 4
Foc us
c ds
c ds
c ds
c ds
c ds
c ds
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
S ens itivity
8 4 .9 3 6
8 0 .1 6 1
8 4 .3 6 7
7 5 .1 7 3
5 4 .2 8 0
4 4 .0 7 6
The RNAseq Genome Annotation Assessment Project
Nucleotide level analysis (D.melanogaster)
Introduction and
summary of
submissions
Analysis
methodology
T eam
Sens itivity
T yl_fly_qbo_s olexa_S2 D RSC
9 4 .4 6 7
M ar_fly_qna_s olexa_flymul
8 5 .5 0 8
V ic _fly_qna_s olexa_flymul
7 0 .8 3 5
T ho_fly_qbo_s olexa_M L D mBG 3 c 24 2 .0 5 1
G un_fly_qtr_s olexa_C M E W1 C I
7 2 .8 3 6
Spec ific ity
9 8 .4 7 2
8 6 .3 7 9
8 3 .5 5 5
8 7 .3 7 8
5 5 .4 8 6
A verage
9 6 .4 7 0
8 5 .9 4 4
7 7 .1 9 5
6 4 .7 1 5
6 4 .1 6 1
Foc us
exon
exon
exon
exon
exon
T eam
Sens itivity
M ar_fly_qbo_s olexa_M L D mBG 3 c 29 3 .1 2 9
J el_fly_qna_s olexa_flymul
9 0 .7 4 6
T yl_fly_qna_s olexa_flymul
8 5 .9 3 8
V ic _fly_qna_s olexa_flymul
9 5 .6 4 0
G un_fly_qna_s olexa_flymul
8 8 .9 2 9
Spec ific ity
9 5 .3 6 7
9 5 .3 1 6
9 3 .3 8 3
8 3 .2 5 2
7 2 .1 6 9
A verage
9 4 .2 4 8
9 3 .0 3 1
8 9 .6 6 1
8 9 .4 4 6
8 0 .5 4 9
Foc us
c ds
c ds
c ds
c ds
c ds
Nucleotide
level analysis
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
The RNAseq Genome Annotation Assessment Project
Nucleotide level analysis (C.elegans)
Introduction and
summary of
submissions
Analysis
methodology
T eam
S ens itivity
V ic _wor_qna_s olexa_wormmul
9 4 .6 5 8
T yl_wor_qbo_s olexa_S RX0 0 1 8 7 2 9 2 .4 6 4
M ar_wor_qna_helic os _wormmul
9 0 .8 6 3
G un_wor_qtr_s olexa_S RX0 0 1 8 7 2 9 0 .3 4 3
T ho_wor_qbo_s olexa_wormmul
7 4 .6 6 9
G er_wor_qbo_s olexa_wormmul
6 8 .9 9 3
L io_wor_qtr_s olexa_S RX0 0 1 8 7 4 5 7 .3 3 4
S pec ific ity
9 2 .4 7 0
9 0 .1 9 9
7 6 .5 1 5
7 6 .3 5 0
7 2 .6 6 2
7 7 .1 8 7
8 1 .4 1 1
A verage
9 3 .5 6 4
9 1 .3 3 1
8 3 .6 8 9
8 3 .3 4 6
7 3 .6 6 5
7 3 .0 9 0
6 9 .3 7 2
Foc us
exon
exon
exon
exon
exon
exon
exon
Nucleotide
level analysis
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
T eam
S ens itivity
V ic _wor_qna_s olexa_wormmul
9 6 .4 5 5
Wol_wor_qex_s olexa_S RX0 0 4 8 6 79 2 .8 8 3
M ar_wor_qbo_s olexa_S RX0 0 4 8 6 69 1 .4 3 3
J el_wor_qna_s olexa_wormmul
9 0 .8 0 5
T yl_wor_qna_s olexa_S RX0 0 4 8 6 7 9 0 .3 2 8
G un_wor_qtr_s olexa_S RX0 0 4 8 6 5 9 3 .6 1 0
G er_wor_qbo_s olexa_wormmul
7 5 .2 0 0
S pec ific ity
8 9 .9 3 1
9 1 .7 1 9
9 3 .0 6 2
9 2 .6 6 3
8 9 .0 3 8
8 3 .8 6 2
9 7 .1 8 6
A verage
9 3 .1 9 3
9 2 .3 0 1
9 2 .2 4 7
9 1 .7 3 4
8 9 .6 8 3
8 8 .7 3 6
8 6 .1 9 3
Foc us
c ds
c ds
c ds
c ds
c ds
c ds
c ds
The RNAseq Genome Annotation Assessment Project
Exon level analysis
Introduction and
summary of
submissions
Annotation set
Prediction set
Analysis
methodology
True positives
Nucleotide
level analysis
False positives
False negatives
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
Sensitivity =
Number of annotated exons correctly predicted
Number of annotated exons in the annotation set
Specificity =
Number of predicted exons correctly also annotated
Number of predicted exons in the annotation set
The RNAseq Genome Annotation Assessment Project
Exon level analysis
Introduction and
summary of
submissions
Analysis
methodology
Nucleotide
level analysis
Points to note:
1. An exon in the prediction must have identical start and end
coordinates and also the same strand as an exon in the annotation
to be counted correct.
2.
or the predictions, it is counted only once.
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
If an exon is present in multiple transcripts in either the annotation
3.
As a summary measure, we also calculated the arithmetic average
of specificity and sensitivity.
The RNAseq Genome Annotation Assessment Project
Exon level analysis (H.sapiens)
Introduction and
summary of
submissions
Analysis
methodology
Nucleotide
level analysis
T eam
S ens itiv ityS pec ific ityA verage Foc us
V ic _hum_qbo_s olexa_
3 1 .3 6 8
6 5 .8 7 0
4 8 .6 1 9 ex on
M ar_hum_qbo_s olexa_
3 2 .2 2 8
6 4 .1 8 6
4 8 .2 0 7 ex on
T yl_hum_qbo_s olexa_S RX 0 0 4 83625.9 3 2
6 1 .2 2 8
4 7 .0 8 0 ex on
G er_hum_qtr_s olex a_hummul 1 9 .7 4 1
5 8 .6 9 4
3 9 .2 1 7 ex on
S im_hum_qtr_s olexa_
1 6 .5 0 9
5 4 .3 8 1
3 5 .4 4 5 ex on
L io_hum_qtr_s olex a_
1 8 .1 5 1
5 2 .3 8 2
3 5 .2 6 6 ex on
T ho_hum_qbo_s olexa_
1 4 .0 3 5
3 3 .9 5 5
2 3 .9 9 5 ex on
C hr_hum_qbo_s olex a_
2 .3 7 3 1
1 .5 1 1 3
1 .9 4 2 2 ex on
S ea_hum_qex_s olid_G M 1 2 8 7 8 s
0olid
.2 4 6 3
0 .8 9 7 3
0 .5 7 1 8 ex on
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
T eam
M ar_hum_qbo_s olexa_
V ic _hum_qbo_s olexa_
J el_hum_qtr_s olexa_
C hr_hum_qbo_s olexa_
T yl_hum_qna_s olexa_hummul
S im_hum_qtr_s olexa_
S ens itivityS pec ific ityA verage Foc us
5 9 .9 4 7
7 8 .3 7 7
6 9 .1 6 2 c ds
5 7 .8 4 8
7 3 .3 3 7
6 5 .5 9 3 c ds
5 0 .2 5 1
7 7 .9 1 0
6 4 .0 8 1 c ds
7 .6 7 5 7
4 .8 5 1 9
6 .2 6 3 8 c ds
4 9 .4 2 3
6 4 .7 2 9
5 7 .0 7 6 c ds
3 0 .7 2 5
6 8 .6 1 5
4 9 .6 7 0 c ds
The RNAseq Genome Annotation Assessment Project
Exon level analysis (D.melanogaster)
Introduction and
summary of
submissions
Analysis
methodology
T eam
Sens itivity Spec ific ityA verage
T yl_fly_qbo_s olexa_S2 D RS C 4 6 .4 9 0
5 6 .9 1 2
5 1 .7 0 1
M ar_fly_qna_s olexa_flymul
3 8 .1 9 0
4 9 .9 5 1
4 4 .0 7 1
V ic _fly_qna_s olexa_flymul
3 8 .6 0 8
4 4 .7 4 0
4 1 .6 7 4
G un_fly_qtr_s olexa_C M E W1 C I 2 0 .8 7 8
5 6 .5 9 1
3 8 .7 3 4
T ho_fly_qbo_s olexa_M L D mBG 38c.2
2 705
1 7 .6 5 1
1 2 .9 6 1
Foc us
exon
exon
exon
exon
exon
Nucleotide
level analysis
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
T eam
M ar_fly _qbo_s olex a_Kc 1 6 7
V ic _fly_qna_s olexa_fly mul
J el_fly_qtr_s olexa_Kc 1 6 7
T yl_fly _qna_s olex a_flymul
G un_fly_qtr_s olex a_C M E W1 C I
S ens itiv ityS pec ific ityA verage Foc us
5 6 .0 6 3
6 4 .8 6 9
6 0 .4 6 6 c ds
5 7 .8 7 5
5 3 .8 7 7
5 5 .8 7 6 c ds
4 8 .2 9 9
6 0 .4 0 8
5 4 .3 5 4 c ds
4 2 .4 2 5
5 7 .2 0 6
4 9 .8 1 5 c ds
5 4 .5 8 8
4 0 .7 8 4
4 7 .6 8 6 c ds
The RNAseq Genome Annotation Assessment Project
Exon level analysis (C.elegans)
Introduction and
summary of
submissions
Analysis
methodology
T eam
Sens itivity Spec ific ityA verage
V ic _wor_qna_s olexa_wormmul 7 5 .4 7 1
8 0 .5 5 3
7 8 .0 1 2
T yl_wor_qna_s olexa_wormmul 6 0 .4 0 0
7 2 .4 1 5
6 6 .4 0 8
G un_wor_qtr_s olexa_SRX0 0 1 8 7
42
2 .8 0 2
6 3 .9 2 8
5 3 .3 6 5
L io_wor_qtr_s olexa_SR X0 0 1 8 7 2
4 3 .3 0 9
4 3 .9 5 9
3 3 .6 3 4
T ho_wor_qbo_s olexa_SRX0 0 4 896.4
7 368
1 5 .2 3 7
1 2 .3 3 6
Foc us
exon
exon
exon
exon
exon
Nucleotide
level analysis
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
T e am
S e ns itivity S pec ific ityA ve rage
Wol_wor_qex _s olex a_S RX 0 0 4 8 8
60
7 .7 3 8
7 8 .6 6 1
7 9 .6 9 9
V ic _wor_qna_s ole xa_wormmul 7 1 .7 7 2
6 6 .9 7 8
6 9 .3 7 5
M ar_wor_qna_s olexa_wormmul 6 7 .1 0 0
6 8 .6 3 3
6 7 .8 6 6
J el_wor_qna_s olex a_wormmul 6 5 .7 8 8
6 7 .5 2 1
6 6 .6 5 5
T y l_wor_qbo_s ole xa_S RX 0 0 1 8 7
62
0 .4 8 4
5 9 .1 8 0
5 9 .8 3 2
G un_wor_qtr_s ole xa_S RX 0 0 4 8 6
65
1 .6 3 3
6 3 .1 2 2
6 2 .3 7 7
G er_wor_qbo_s olex a_S RX 0 0 4 8 2
60
3 .7 4 4
2 4 .1 5 7
2 2 .4 5 0
F oc us
c ds
c ds
c ds
c ds
c ds
c ds
c ds
The RNAseq Genome Annotation Assessment Project
Transcript level analysis
Introduction and
summary of
submissions
Annotation set
Prediction set
Analysis
methodology
True positives
Nucleotide
level analysis
False positives
False negatives
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
Sensitivity =
Number of annotated transcripts correctly predicted
Number of annotated transcripts in the annotation set
Specificity =
Number of predicted transcripts correctly also annotated
Number of predicted transcripts in the annotation set
The RNAseq Genome Annotation Assessment Project
Transcript level analysis
Introduction and
summary of
submissions
Analysis
methodology
Nucleotide
level analysis
Points to note:
1. We consider a transcript accurately predicted if the number of
exons in a transcript and their boundaries match exactly between
the annotation and the prediction.
2.
for the CDS-focused evaluation if the beginning and end of
translation are correctly annotated and each of the 5' and 3' splice
sites for the coding exons are correct we consider the transcript to
be correctly predicted.
3.
for the mRNA evaluation, a transcript is counted correct if all of the
exons from the start of transcription to the end of transcription
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
match perfectly between the annotation and prediction sets.
The RNAseq Genome Annotation Assessment Project
Transcript level analysis
Introduction and
summary of
submissions
Analysis
methodology
Nucleotide
level analysis
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
Human, (CDS-focused)
The RNAseq Genome Annotation Assessment Project
Relaxed Transcript level analysis
Introduction and
summary of
submissions
Annotation set
Prediction set
Analysis
methodology
True positives
Nucleotide
level analysis
False positives
False negatives
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
Sensitivity =
Number of annotated transcripts correctly predicted
Number of annotated transcripts in the annotation set
Specificity =
Number of predicted transcripts correctly also annotated
Number of predicted transcripts in the annotation set
The RNAseq Genome Annotation Assessment Project
Relaxed Transcript level analysis
Introduction and
summary of
submissions
Analysis
methodology
Nucleotide
level analysis
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
Points to note:
1. We consider a transcript ‘accurately’ predicted if the number of
exons in a transcript match exactly between the annotation and the
prediction, and their boundaries differ by no more than 5bp.
2.
All other criteria remain same as that of Transcript-level analysis.
The RNAseq Genome Annotation Assessment Project
Very relaxed Transcript level analysis
Introduction and
summary of
submissions
Annotation set
Prediction set
Analysis
methodology
True positives
Nucleotide
level analysis
False positives
False negatives
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
Sensitivity =
Number of annotated transcripts correctly predicted
Number of annotated transcripts in the annotation set
Specificity =
Number of predicted transcripts correctly also annotated
Number of predicted transcripts in the annotation set
The RNAseq Genome Annotation Assessment Project
Very relaxed Transcript level analysis
Introduction and
summary of
submissions
Analysis
methodology
Nucleotide
level analysis
Exon level
analysis
Points to note:
1. We consider a transcript ‘accurately’ predicted if
1. the number of exons in a transcript differ by no more than two
(terminal exons only) between the annotation and prediction, and
2.
the boundaries of all equivalent exons differ by no more than 5bp
between the annotation and the prediction.
2.
All other criteria remain same as that of Transcript-level Analysis.
Transcript
level analysis
Missing and
wrong genes
Worm, (exon-focused)
The RNAseq Genome Annotation Assessment Project
Introduction and
summary of
submissions
'missing exons' (MEs:): the annotated exons that have no overlap with
predicted exons by at least 1 bp
Analysis
methodology
'wrong exons' (WEs): the predicted exons not overlapping annotated
exons by at least 1 bp.
Nucleotide
level analysis
Annotation set
Exon level
analysis
Prediction set
Missed exons
Transcript
level analysis
Missing and
wrong genes
Wrong exons
'wrong exons' (WEs) that are predicted independently by more than
two predictors are recorded, and some of them will be tested
experimentally.
The RNAseq Genome Annotation Assessment Project
Introduction and
summary of
submissions
’Dubious wrong exons' (WEs) that are predicted independently by
more than two predictors are reported.
Analysis
methodology
Annotation set
Prediction set
Nucleotide
level analysis
Exon level
analysis
Dubious wrong exons
Transcript
level analysis
Missing and
wrong genes
Screen shot of the list of dubious wrong exons.
15704 dubious wrong exons in
the whole human genome.
17678 dubious wrong exons in
the whole worm genome.
The RNAseq Genome Annotation Assessment Project
Acknowledgement
Introduction and
summary of
submissions
Analysis
methodology
Nucleotide
level analysis
Exon level
analysis
Transcript
level analysis
Missing and
wrong genes
Jen Harrow
Felix Kokocinski
Tim Hubbard
The RGASP community
Download