Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De The RNAseq Genome Annotation Assessment Project Introduction and a summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes •The RGASP aims to assess the current progress of automatic gene building using RNAseq as its primary dataset. •More specifically we aim to evaluate the status of computational methods to •map human RNAseq data, •assemble them into transcripts and •quantify the abundance of that transcript in particular datasets. •Promising transcript predictions not covered by Gencode annotation will be validated by experimental methods The RNAseq Genome Annotation Assessment Project Introduction and a summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes 3 species: human, worm and fly. Multiple RNA-seq daasets for each organism. 15 submitters. 304 submissions The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions Analysis methodology 1. we carried out independent evaluation for the coding portions of the mRNA transcripts (CDS focused) and the mRNA transcripts as a whole (mRNA focused). 2. Analysis was carried out at multiple levels: 1. Nucleotide level 2. Exon level 3. Transcript level 3. For each of the levels, we calculated the sensitivity and specificity of the predictions (as discussed later). As a summary measure we also reported the average of the two statistic. Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes The RNAseq Genome Annotation Assessment Project Nucleotide level analysis Annotation set Introduction and summary of submissions Prediction set Analysis methodology True positives Nucleotide level analysis False positives False negatives Exon level analysis Transcript level analysis Missing and wrong genes Sensitivity = Number of annotated nucleotides correctly predicted Number of annotated nucleotides in the annotation set Specificity = Number of predicted nucleotides correctly also annotated Number of predicted nucleotides in the annotation set The RNAseq Genome Annotation Assessment Project Nucleotide level analysis Introduction and summary of submissions Analysis methodology Points to note: 1. Nucleotide predictions had to be on the same strand as the annotations to be considered as correct. Nucleotide level analysis 2. Individual nucleotides present in multiple transcripts in either the annotation or the predictions are considered only once. Exon level analysis 3. As a summary measure, we also calculated the arithmetic average of specificity and sensitivity. Transcript level analysis Missing and wrong genes The RNAseq Genome Annotation Assessment Project Nucleotide level analysis (H. sapiens) Introduction and summary of submissions Analysis methodology Nucleotide level analysis T ea m S ens itiv ity T yl_hum_qbo_s ole xa_K 5 6 2 s ingle 6 9 .5 8 7 S ea _hum_qex_s olexa _ 4 8 .4 1 9 M a r_hum_qbo_s olexa _K 5 6 2 s trand4 7 .4 8 3 S im_hum_qtr_s olexa _ 3 2 .9 4 7 G er_hum_qtr_s olex a_hummul 3 1 .9 0 4 V ic _hum_qbo_s olex a_ 3 0 .9 6 9 L io_hum_qtr_s olexa _ 3 4 .3 3 0 T yl_hum_qna_s ole xa_hummul 3 1 .6 6 8 T ho_hum_qbo_s olexa _ 3 5 .2 0 3 C hr_hum_qbo_s olexa _ 4 4 .6 6 0 C ar_hum_qna _s olexa _hummul 8 .6 3 5 7 J ie_hum_qex _s olex a_K 5 6 2 s ingle 0 .2 2 4 5 S pe c ific ity 9 9 .3 0 8 93.308 9 4 .2 8 9 8 2 .2 4 7 8 3 .1 7 2 8 4 .0 1 2 8 4 .4 8 8 7 8 .6 2 2 8 0 .2 8 5 6 9 .2 6 2 1 4 .2 3 6 2 .4 0 1 9 7 5 .7 6 2 A vera ge 8 4 .4 4 7 7 1 .3 5 4 6 4 .8 6 5 5 8 .0 5 9 5 7 .9 5 8 5 7 .7 2 9 5 6 .4 7 6 5 5 .9 7 7 5 2 .2 3 3 2 9 .4 4 8 5 .5 1 8 8 3 7 .9 9 3 F oc us ex on ex on ex on ex on ex on ex on ex on ex on ex on ex on ex on ex on T eam J el_hum_qna_s olexa_hummul M ar_hum_qbo_s olexa_ V ic _hum_qna_s olexa_hummul_ T yl_hum_qna_s olexa_hummul S im_hum_qtr_s olexa_hummul C hr_hum_qbo_s olexa_ S pec ific ity 8 2 .1 9 8 8 5 .0 1 7 6 8 .2 6 9 7 3 .4 6 7 9 2 .8 3 2 5 7 .9 7 1 A verage 8 3 .5 6 7 8 2 .5 8 9 7 6 .3 1 8 7 4 .3 2 0 7 3 .5 5 6 5 1 .0 2 4 Foc us c ds c ds c ds c ds c ds c ds Exon level analysis Transcript level analysis Missing and wrong genes S ens itivity 8 4 .9 3 6 8 0 .1 6 1 8 4 .3 6 7 7 5 .1 7 3 5 4 .2 8 0 4 4 .0 7 6 The RNAseq Genome Annotation Assessment Project Nucleotide level analysis (D.melanogaster) Introduction and summary of submissions Analysis methodology T eam Sens itivity T yl_fly_qbo_s olexa_S2 D RSC 9 4 .4 6 7 M ar_fly_qna_s olexa_flymul 8 5 .5 0 8 V ic _fly_qna_s olexa_flymul 7 0 .8 3 5 T ho_fly_qbo_s olexa_M L D mBG 3 c 24 2 .0 5 1 G un_fly_qtr_s olexa_C M E W1 C I 7 2 .8 3 6 Spec ific ity 9 8 .4 7 2 8 6 .3 7 9 8 3 .5 5 5 8 7 .3 7 8 5 5 .4 8 6 A verage 9 6 .4 7 0 8 5 .9 4 4 7 7 .1 9 5 6 4 .7 1 5 6 4 .1 6 1 Foc us exon exon exon exon exon T eam Sens itivity M ar_fly_qbo_s olexa_M L D mBG 3 c 29 3 .1 2 9 J el_fly_qna_s olexa_flymul 9 0 .7 4 6 T yl_fly_qna_s olexa_flymul 8 5 .9 3 8 V ic _fly_qna_s olexa_flymul 9 5 .6 4 0 G un_fly_qna_s olexa_flymul 8 8 .9 2 9 Spec ific ity 9 5 .3 6 7 9 5 .3 1 6 9 3 .3 8 3 8 3 .2 5 2 7 2 .1 6 9 A verage 9 4 .2 4 8 9 3 .0 3 1 8 9 .6 6 1 8 9 .4 4 6 8 0 .5 4 9 Foc us c ds c ds c ds c ds c ds Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes The RNAseq Genome Annotation Assessment Project Nucleotide level analysis (C.elegans) Introduction and summary of submissions Analysis methodology T eam S ens itivity V ic _wor_qna_s olexa_wormmul 9 4 .6 5 8 T yl_wor_qbo_s olexa_S RX0 0 1 8 7 2 9 2 .4 6 4 M ar_wor_qna_helic os _wormmul 9 0 .8 6 3 G un_wor_qtr_s olexa_S RX0 0 1 8 7 2 9 0 .3 4 3 T ho_wor_qbo_s olexa_wormmul 7 4 .6 6 9 G er_wor_qbo_s olexa_wormmul 6 8 .9 9 3 L io_wor_qtr_s olexa_S RX0 0 1 8 7 4 5 7 .3 3 4 S pec ific ity 9 2 .4 7 0 9 0 .1 9 9 7 6 .5 1 5 7 6 .3 5 0 7 2 .6 6 2 7 7 .1 8 7 8 1 .4 1 1 A verage 9 3 .5 6 4 9 1 .3 3 1 8 3 .6 8 9 8 3 .3 4 6 7 3 .6 6 5 7 3 .0 9 0 6 9 .3 7 2 Foc us exon exon exon exon exon exon exon Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes T eam S ens itivity V ic _wor_qna_s olexa_wormmul 9 6 .4 5 5 Wol_wor_qex_s olexa_S RX0 0 4 8 6 79 2 .8 8 3 M ar_wor_qbo_s olexa_S RX0 0 4 8 6 69 1 .4 3 3 J el_wor_qna_s olexa_wormmul 9 0 .8 0 5 T yl_wor_qna_s olexa_S RX0 0 4 8 6 7 9 0 .3 2 8 G un_wor_qtr_s olexa_S RX0 0 4 8 6 5 9 3 .6 1 0 G er_wor_qbo_s olexa_wormmul 7 5 .2 0 0 S pec ific ity 8 9 .9 3 1 9 1 .7 1 9 9 3 .0 6 2 9 2 .6 6 3 8 9 .0 3 8 8 3 .8 6 2 9 7 .1 8 6 A verage 9 3 .1 9 3 9 2 .3 0 1 9 2 .2 4 7 9 1 .7 3 4 8 9 .6 8 3 8 8 .7 3 6 8 6 .1 9 3 Foc us c ds c ds c ds c ds c ds c ds c ds The RNAseq Genome Annotation Assessment Project Exon level analysis Introduction and summary of submissions Annotation set Prediction set Analysis methodology True positives Nucleotide level analysis False positives False negatives Exon level analysis Transcript level analysis Missing and wrong genes Sensitivity = Number of annotated exons correctly predicted Number of annotated exons in the annotation set Specificity = Number of predicted exons correctly also annotated Number of predicted exons in the annotation set The RNAseq Genome Annotation Assessment Project Exon level analysis Introduction and summary of submissions Analysis methodology Nucleotide level analysis Points to note: 1. An exon in the prediction must have identical start and end coordinates and also the same strand as an exon in the annotation to be counted correct. 2. or the predictions, it is counted only once. Exon level analysis Transcript level analysis Missing and wrong genes If an exon is present in multiple transcripts in either the annotation 3. As a summary measure, we also calculated the arithmetic average of specificity and sensitivity. The RNAseq Genome Annotation Assessment Project Exon level analysis (H.sapiens) Introduction and summary of submissions Analysis methodology Nucleotide level analysis T eam S ens itiv ityS pec ific ityA verage Foc us V ic _hum_qbo_s olexa_ 3 1 .3 6 8 6 5 .8 7 0 4 8 .6 1 9 ex on M ar_hum_qbo_s olexa_ 3 2 .2 2 8 6 4 .1 8 6 4 8 .2 0 7 ex on T yl_hum_qbo_s olexa_S RX 0 0 4 83625.9 3 2 6 1 .2 2 8 4 7 .0 8 0 ex on G er_hum_qtr_s olex a_hummul 1 9 .7 4 1 5 8 .6 9 4 3 9 .2 1 7 ex on S im_hum_qtr_s olexa_ 1 6 .5 0 9 5 4 .3 8 1 3 5 .4 4 5 ex on L io_hum_qtr_s olex a_ 1 8 .1 5 1 5 2 .3 8 2 3 5 .2 6 6 ex on T ho_hum_qbo_s olexa_ 1 4 .0 3 5 3 3 .9 5 5 2 3 .9 9 5 ex on C hr_hum_qbo_s olex a_ 2 .3 7 3 1 1 .5 1 1 3 1 .9 4 2 2 ex on S ea_hum_qex_s olid_G M 1 2 8 7 8 s 0olid .2 4 6 3 0 .8 9 7 3 0 .5 7 1 8 ex on Exon level analysis Transcript level analysis Missing and wrong genes T eam M ar_hum_qbo_s olexa_ V ic _hum_qbo_s olexa_ J el_hum_qtr_s olexa_ C hr_hum_qbo_s olexa_ T yl_hum_qna_s olexa_hummul S im_hum_qtr_s olexa_ S ens itivityS pec ific ityA verage Foc us 5 9 .9 4 7 7 8 .3 7 7 6 9 .1 6 2 c ds 5 7 .8 4 8 7 3 .3 3 7 6 5 .5 9 3 c ds 5 0 .2 5 1 7 7 .9 1 0 6 4 .0 8 1 c ds 7 .6 7 5 7 4 .8 5 1 9 6 .2 6 3 8 c ds 4 9 .4 2 3 6 4 .7 2 9 5 7 .0 7 6 c ds 3 0 .7 2 5 6 8 .6 1 5 4 9 .6 7 0 c ds The RNAseq Genome Annotation Assessment Project Exon level analysis (D.melanogaster) Introduction and summary of submissions Analysis methodology T eam Sens itivity Spec ific ityA verage T yl_fly_qbo_s olexa_S2 D RS C 4 6 .4 9 0 5 6 .9 1 2 5 1 .7 0 1 M ar_fly_qna_s olexa_flymul 3 8 .1 9 0 4 9 .9 5 1 4 4 .0 7 1 V ic _fly_qna_s olexa_flymul 3 8 .6 0 8 4 4 .7 4 0 4 1 .6 7 4 G un_fly_qtr_s olexa_C M E W1 C I 2 0 .8 7 8 5 6 .5 9 1 3 8 .7 3 4 T ho_fly_qbo_s olexa_M L D mBG 38c.2 2 705 1 7 .6 5 1 1 2 .9 6 1 Foc us exon exon exon exon exon Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes T eam M ar_fly _qbo_s olex a_Kc 1 6 7 V ic _fly_qna_s olexa_fly mul J el_fly_qtr_s olexa_Kc 1 6 7 T yl_fly _qna_s olex a_flymul G un_fly_qtr_s olex a_C M E W1 C I S ens itiv ityS pec ific ityA verage Foc us 5 6 .0 6 3 6 4 .8 6 9 6 0 .4 6 6 c ds 5 7 .8 7 5 5 3 .8 7 7 5 5 .8 7 6 c ds 4 8 .2 9 9 6 0 .4 0 8 5 4 .3 5 4 c ds 4 2 .4 2 5 5 7 .2 0 6 4 9 .8 1 5 c ds 5 4 .5 8 8 4 0 .7 8 4 4 7 .6 8 6 c ds The RNAseq Genome Annotation Assessment Project Exon level analysis (C.elegans) Introduction and summary of submissions Analysis methodology T eam Sens itivity Spec ific ityA verage V ic _wor_qna_s olexa_wormmul 7 5 .4 7 1 8 0 .5 5 3 7 8 .0 1 2 T yl_wor_qna_s olexa_wormmul 6 0 .4 0 0 7 2 .4 1 5 6 6 .4 0 8 G un_wor_qtr_s olexa_SRX0 0 1 8 7 42 2 .8 0 2 6 3 .9 2 8 5 3 .3 6 5 L io_wor_qtr_s olexa_SR X0 0 1 8 7 2 4 3 .3 0 9 4 3 .9 5 9 3 3 .6 3 4 T ho_wor_qbo_s olexa_SRX0 0 4 896.4 7 368 1 5 .2 3 7 1 2 .3 3 6 Foc us exon exon exon exon exon Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes T e am S e ns itivity S pec ific ityA ve rage Wol_wor_qex _s olex a_S RX 0 0 4 8 8 60 7 .7 3 8 7 8 .6 6 1 7 9 .6 9 9 V ic _wor_qna_s ole xa_wormmul 7 1 .7 7 2 6 6 .9 7 8 6 9 .3 7 5 M ar_wor_qna_s olexa_wormmul 6 7 .1 0 0 6 8 .6 3 3 6 7 .8 6 6 J el_wor_qna_s olex a_wormmul 6 5 .7 8 8 6 7 .5 2 1 6 6 .6 5 5 T y l_wor_qbo_s ole xa_S RX 0 0 1 8 7 62 0 .4 8 4 5 9 .1 8 0 5 9 .8 3 2 G un_wor_qtr_s ole xa_S RX 0 0 4 8 6 65 1 .6 3 3 6 3 .1 2 2 6 2 .3 7 7 G er_wor_qbo_s olex a_S RX 0 0 4 8 2 60 3 .7 4 4 2 4 .1 5 7 2 2 .4 5 0 F oc us c ds c ds c ds c ds c ds c ds c ds The RNAseq Genome Annotation Assessment Project Transcript level analysis Introduction and summary of submissions Annotation set Prediction set Analysis methodology True positives Nucleotide level analysis False positives False negatives Exon level analysis Transcript level analysis Missing and wrong genes Sensitivity = Number of annotated transcripts correctly predicted Number of annotated transcripts in the annotation set Specificity = Number of predicted transcripts correctly also annotated Number of predicted transcripts in the annotation set The RNAseq Genome Annotation Assessment Project Transcript level analysis Introduction and summary of submissions Analysis methodology Nucleotide level analysis Points to note: 1. We consider a transcript accurately predicted if the number of exons in a transcript and their boundaries match exactly between the annotation and the prediction. 2. for the CDS-focused evaluation if the beginning and end of translation are correctly annotated and each of the 5' and 3' splice sites for the coding exons are correct we consider the transcript to be correctly predicted. 3. for the mRNA evaluation, a transcript is counted correct if all of the exons from the start of transcription to the end of transcription Exon level analysis Transcript level analysis Missing and wrong genes match perfectly between the annotation and prediction sets. The RNAseq Genome Annotation Assessment Project Transcript level analysis Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Human, (CDS-focused) The RNAseq Genome Annotation Assessment Project Relaxed Transcript level analysis Introduction and summary of submissions Annotation set Prediction set Analysis methodology True positives Nucleotide level analysis False positives False negatives Exon level analysis Transcript level analysis Missing and wrong genes Sensitivity = Number of annotated transcripts correctly predicted Number of annotated transcripts in the annotation set Specificity = Number of predicted transcripts correctly also annotated Number of predicted transcripts in the annotation set The RNAseq Genome Annotation Assessment Project Relaxed Transcript level analysis Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Points to note: 1. We consider a transcript ‘accurately’ predicted if the number of exons in a transcript match exactly between the annotation and the prediction, and their boundaries differ by no more than 5bp. 2. All other criteria remain same as that of Transcript-level analysis. The RNAseq Genome Annotation Assessment Project Very relaxed Transcript level analysis Introduction and summary of submissions Annotation set Prediction set Analysis methodology True positives Nucleotide level analysis False positives False negatives Exon level analysis Transcript level analysis Missing and wrong genes Sensitivity = Number of annotated transcripts correctly predicted Number of annotated transcripts in the annotation set Specificity = Number of predicted transcripts correctly also annotated Number of predicted transcripts in the annotation set The RNAseq Genome Annotation Assessment Project Very relaxed Transcript level analysis Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Points to note: 1. We consider a transcript ‘accurately’ predicted if 1. the number of exons in a transcript differ by no more than two (terminal exons only) between the annotation and prediction, and 2. the boundaries of all equivalent exons differ by no more than 5bp between the annotation and the prediction. 2. All other criteria remain same as that of Transcript-level Analysis. Transcript level analysis Missing and wrong genes Worm, (exon-focused) The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions 'missing exons' (MEs:): the annotated exons that have no overlap with predicted exons by at least 1 bp Analysis methodology 'wrong exons' (WEs): the predicted exons not overlapping annotated exons by at least 1 bp. Nucleotide level analysis Annotation set Exon level analysis Prediction set Missed exons Transcript level analysis Missing and wrong genes Wrong exons 'wrong exons' (WEs) that are predicted independently by more than two predictors are recorded, and some of them will be tested experimentally. The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions ’Dubious wrong exons' (WEs) that are predicted independently by more than two predictors are reported. Analysis methodology Annotation set Prediction set Nucleotide level analysis Exon level analysis Dubious wrong exons Transcript level analysis Missing and wrong genes Screen shot of the list of dubious wrong exons. 15704 dubious wrong exons in the whole human genome. 17678 dubious wrong exons in the whole worm genome. The RNAseq Genome Annotation Assessment Project Acknowledgement Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Jen Harrow Felix Kokocinski Tim Hubbard The RGASP community