Progress report Yiming Zhang 02/10/2012 All AS events in ASIP • Intron retention • Exon skipping • Alternative Acceptor site NAGNAG AltA • Alternative Donor site GYNGYN AltD • Alternative both sites (AltP) NAGNAG alternative splicing Figure 1. NAGNAG alternative splicing with E and I sites and isoforms. NAGNAG alternaive splicing can result in one of three possibilities (Figure 1) constitutive use of the first acceptor (the so-called exonic, or “E” variant), constitutive use of the second acceptor (the so-called intronic, or “I” variant), or use of both acceptors, that is,alternative splicing (the “EI” variant). Sinha et al. 2010 GYNGYN alternative splicing Figure 2. GYNGYN alternative splicing with e and i sites and isoforms. Hilller et al. 2006 All introns NAGNAG-E NAGNAG-I Constitutive GYNGYN-e GYNGYN-I …… IntronR All introns Alternative ExonS NAGNAG-ei AltA …… AltD GYNGYN-ei Unclear Multiple AS …… …… Intron statistics from ASIP Cons. Alt. EST>=4 all NAG-E NAG-I GYN-e GYN-i EST>=10 all NAG-E NAG-I GYN-e GYN-i EST>=2 IntronR AltA_all AltA_NAG AltD_all ALtD_GYN AltP ExonS AT 36040 1494 522 1283 489 13366 574 199 467 152 4197 648 128 305 9 51 476 BD 760 30 5 40 14 220 10 1 5 7 50 10 4 2 0 0 5 GM 16203 583 212 621 296 5697 189 74 232 94 1669 189 31 99 0 32 371 LJ 1056 39 13 34 13 406 7 3 17 7 76 11 3 5 1 1 11 MT 6351 195 74 246 89 2509 85 32 100 39 327 49 6 47 1 0 72 OS 32393 1299 400 1406 705 14409 568 160 637 303 8233 926 178 575 10 188 2100 PP 15306 533 331 748 241 6606 220 107 330 88 2004 404 44 453 3 27 429 PT 3347 102 36 135 71 1129 40 11 36 25 256 38 3 15 0 2 79 SB 9014 317 96 403 168 3366 117 31 163 46 1671 92 27 64 1 39 304 SL 1681 69 26 58 27 689 24 12 28 14 91 24 4 16 0 0 59 VV 9000 349 139 382 208 4091 148 55 177 88 806 97 30 71 2 16 238 Total 131151 5010 1854 5356 2321 52488 1982 685 2192 863 19380 2488 458 1652 27 356 4144 Table 1. Intron statistics from ASIP. 4 species which have small amount of data are not listed here. All statistics are intron-based instead of event-based which means redundancy has been removed. The most common type of alternative intron type is IntronR, second common type is ExonS. NAGNAG AS occurs much more frequently in AltA than GYNGYN AS occurs in AltD. Background NAGNAG alternative splicing which can insert or delete a single amino acid in the protein, is very common and well studied in animals. • The NAGNAG motif is present in 30% of human genes and is functional in at least 5% of the genes. Hiller et al. 2004 • NAGNAG AS is frame-preserving, the vast majority of cases should lead to different proteins. Studies so far have found evidence of both cases where such proteins have variations in function, as well as those in which there is no noticeable difference. Akerman et al. 2006 Iida et al. 2008 • The GO analyses in some studies shows genes with specific GO term DNA binding to be statistically significant and more than half of all AS-NAGNAG events affected polar amino acid residues. Iida et al. 2008 Sinha et al. 2010 Background The studies of NAGNAG AS in plant is few right now (Only 3 species: Arabidopsis, Rice and Physcomitrala). • One study found 321 and 372 AS-NAGNAG events in Arabidopsis and rice, respectively. Another study found 6% of all introns and 21% of all annotated genes in Arabidopsis harbor a genomic NAGNAG acceptor motif. Iida et al. 2008 Schindler et al. 2008 • In addition, the GO analysis is agreed with previous study in human that the specific GO term DNA binding is statistically significant. Some study indicates that NAGNAG acceptors frequently occur in the Arabidopsis genome and are particularly prevalent in SR and SR-related protein-coding genes. Sinha et al 2010 Schindler et al. 2008 Background The state-of-the-art in silico studies for prediction of NAGNAG splice site are done by Sinha's group for both human and plant species. They achieved high balanced specificity and sensitivity for both human and plant species. • The most informative features they found are the nucleotides in the NAGNAG and in its immediate vicinity, along with the splice sites scores. • The model they trained on human data also can achieve high AUC on plant data shows that NAGNAG splicing in plants is similar to that in animals. Sinha et al. 2009, 2010 NAGNAG dataset I tried to predict NAGNAG events (thus to predict EI, I or E isoforms) based on the dataset I generated from ASIP using Random Forest. • Strict criteria has been used to identify NAGNAG events from ASIP database: For E and I events, at least 10 ESTs or cDNAs support them, and for EI events at lease 2 EST or cDNA support each isoform. • After removing redundancy, I got 458 EI form alternative NAGNAG introns, 1988 E form constitutive introns and 685 I form constitutive introns in 15 plant species. Features Figure 3. A total of 28 features which each represented a nucleotide, and thus had four possible values (A, C, G, T). U1, U2, U3 are the first three nucleotides in the upstream exon. D1, D2, D3 are the first three nucleotides in the downstream exon. A weak polypyrimidine tract (PPT) can contribute to AS. So P1-P20 are PPT upstream of NAGNAG. Finally, I also use intron length as an additional feature. Classifier evaluation Random Forest with 200 trees has been used and 5 fold cross validation has been applied. TP rate FP rate Precision Recall F-measure ROC area Class 0.992 0.089 0.951 0.992 0.971 0.995 E 0.953 0.023 0.92 0.953 0.936 0.995 I 0.657 0.017 0.87 0.657 0.749 0.967 EI The evaluation results strongly agree with Sinha’s paper (For Physcomitrella) in which AUC = 0.96, 0.99 and 0.98 for the EI, E and I forms, respectively. Figure 4. The EI class, or AS, harder to predict (AUC = 0.967) than the two constitutive variants, E and I (AUC = 0.995 for both). Most informative features 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 N2 N1 P_-1 P_-2 D1 P_-7 Figure 5. Most informative features according to information gain. D2 Sequence Logos Figure 6a. Figure 6b. Figure 6c. Figure 6d. Figure 6a-6d. Sequence logos of NAGNAG splice sites. 6a: E sites; 6b: I sites; 6c: EI sites; 6d: all splice sites. Position 1-3 is U1-U3. Position 4-24 are P20-P1. Position 30-32 are D1-D3. Conclusion • NAGNAG-AS can be predicted with high accuracy. Using carefully constructed training and test datasets, an in silico performance of AUC = 0.967, 0.995 and 0.995 was achieved for the EI, E and I forms, respectively. • The most informative features are the nucleotides in the NAGNAG and in its immediate vicinity. • NAGNAG AS in plants is similar to that in animals and is largely dependent on the splice site and its immediate neighborhood.