progress_report_0210..

advertisement
Progress report
Yiming Zhang
02/10/2012
All AS events in ASIP
• Intron retention
• Exon skipping
• Alternative Acceptor site
NAGNAG AltA
• Alternative Donor site
GYNGYN AltD
• Alternative both sites (AltP)
NAGNAG alternative splicing
Figure 1. NAGNAG alternative splicing with E and I sites and isoforms.
NAGNAG alternaive splicing can result in one of three possibilities (Figure 1) constitutive use of the first acceptor (the so-called exonic, or “E” variant), constitutive
use of the second acceptor (the so-called intronic, or “I” variant), or use of both
acceptors, that is,alternative splicing (the “EI” variant).
Sinha et al. 2010
GYNGYN alternative splicing
Figure 2. GYNGYN alternative splicing with e and i sites and isoforms.
Hilller et al. 2006
All introns
NAGNAG-E
NAGNAG-I
Constitutive
GYNGYN-e
GYNGYN-I
……
IntronR
All introns
Alternative
ExonS
NAGNAG-ei
AltA
……
AltD
GYNGYN-ei
Unclear
Multiple AS
……
……
Intron statistics from ASIP
Cons.
Alt.
EST>=4
all
NAG-E
NAG-I
GYN-e
GYN-i
EST>=10
all
NAG-E
NAG-I
GYN-e
GYN-i
EST>=2 IntronR
AltA_all
AltA_NAG
AltD_all
ALtD_GYN
AltP
ExonS
AT
36040
1494
522
1283
489
13366
574
199
467
152
4197
648
128
305
9
51
476
BD
760
30
5
40
14
220
10
1
5
7
50
10
4
2
0
0
5
GM
16203
583
212
621
296
5697
189
74
232
94
1669
189
31
99
0
32
371
LJ
1056
39
13
34
13
406
7
3
17
7
76
11
3
5
1
1
11
MT
6351
195
74
246
89
2509
85
32
100
39
327
49
6
47
1
0
72
OS
32393
1299
400
1406
705
14409
568
160
637
303
8233
926
178
575
10
188
2100
PP
15306
533
331
748
241
6606
220
107
330
88
2004
404
44
453
3
27
429
PT
3347
102
36
135
71
1129
40
11
36
25
256
38
3
15
0
2
79
SB
9014
317
96
403
168
3366
117
31
163
46
1671
92
27
64
1
39
304
SL
1681
69
26
58
27
689
24
12
28
14
91
24
4
16
0
0
59
VV
9000
349
139
382
208
4091
148
55
177
88
806
97
30
71
2
16
238
Total
131151
5010
1854
5356
2321
52488
1982
685
2192
863
19380
2488
458
1652
27
356
4144
Table 1. Intron statistics from ASIP. 4 species which have small amount of data are not listed here. All statistics are
intron-based instead of event-based which means redundancy has been removed. The most common type of
alternative intron type is IntronR, second common type is ExonS. NAGNAG AS occurs much more frequently in AltA
than GYNGYN AS occurs in AltD.
Background
NAGNAG alternative splicing which can insert or delete a single amino acid in
the protein, is very common and well studied in animals.
•
The NAGNAG motif is present in 30% of human genes and is functional in at
least 5% of the genes.
Hiller et al. 2004
•
NAGNAG AS is frame-preserving, the vast majority of cases should lead to
different proteins. Studies so far have found evidence of both cases where
such proteins have variations in function, as well as those in which there is no
noticeable difference.
Akerman et al. 2006 Iida et al. 2008
•
The GO analyses in some studies shows genes with specific GO term DNA
binding to be statistically significant and more than half of all AS-NAGNAG
events affected polar amino acid residues.
Iida et al. 2008 Sinha et al. 2010
Background
The studies of NAGNAG AS in plant is few right now (Only 3 species:
Arabidopsis, Rice and Physcomitrala).
• One study found 321 and 372 AS-NAGNAG events in Arabidopsis and rice,
respectively. Another study found 6% of all introns and 21% of all
annotated genes in Arabidopsis harbor a genomic NAGNAG acceptor
motif.
Iida et al. 2008
Schindler et al. 2008
• In addition, the GO analysis is agreed with previous study in human that
the specific GO term DNA binding is statistically significant. Some study
indicates that NAGNAG acceptors frequently occur in the Arabidopsis
genome and are particularly prevalent in SR and SR-related protein-coding
genes.
Sinha et al 2010
Schindler et al. 2008
Background
The state-of-the-art in silico studies for prediction of NAGNAG splice site
are done by Sinha's group for both human and plant species. They
achieved high balanced specificity and sensitivity for both human and
plant species.
• The most informative features they found are the nucleotides in the
NAGNAG and in its immediate vicinity, along with the splice sites scores.
• The model they trained on human data also can achieve high AUC on plant
data shows that NAGNAG splicing in plants is similar to that in animals.
Sinha et al. 2009, 2010
NAGNAG dataset
I tried to predict NAGNAG events (thus to predict EI, I or E isoforms) based
on the dataset I generated from ASIP using Random Forest.
• Strict criteria has been used to identify NAGNAG events from ASIP
database: For E and I events, at least 10 ESTs or cDNAs support them, and
for EI events at lease 2 EST or cDNA support each isoform.
• After removing redundancy, I got 458 EI form alternative NAGNAG introns,
1988 E form constitutive introns and 685 I form constitutive introns in 15
plant species.
Features
Figure 3. A total of 28 features which each represented a nucleotide, and thus
had four possible values (A, C, G, T). U1, U2, U3 are the first three nucleotides in
the upstream exon. D1, D2, D3 are the first three nucleotides in the downstream
exon. A weak polypyrimidine tract (PPT) can contribute to AS. So P1-P20 are PPT
upstream of NAGNAG. Finally, I also use intron length as an additional feature.
Classifier evaluation
Random Forest with 200 trees has been used and 5 fold cross validation
has been applied.
TP rate
FP rate
Precision
Recall
F-measure
ROC area
Class
0.992
0.089
0.951
0.992
0.971
0.995
E
0.953
0.023
0.92
0.953
0.936
0.995
I
0.657
0.017
0.87
0.657
0.749
0.967
EI
The evaluation results strongly agree with Sinha’s paper (For Physcomitrella) in
which AUC = 0.96, 0.99 and 0.98 for the EI, E and I forms, respectively.
Figure 4. The EI class, or AS, harder to predict (AUC = 0.967) than the two constitutive
variants, E and I (AUC = 0.995 for both).
Most informative features
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
N2
N1
P_-1
P_-2
D1
P_-7
Figure 5. Most informative features according to information gain.
D2
Sequence Logos
Figure 6a.
Figure 6b.
Figure 6c.
Figure 6d.
Figure 6a-6d. Sequence logos of NAGNAG splice sites. 6a: E sites; 6b: I sites; 6c: EI sites; 6d: all splice
sites. Position 1-3 is U1-U3. Position 4-24 are P20-P1. Position 30-32 are D1-D3.
Conclusion
• NAGNAG-AS can be predicted with high accuracy. Using carefully
constructed training and test datasets, an in silico performance of AUC =
0.967, 0.995 and 0.995 was achieved for the EI, E and I forms, respectively.
• The most informative features are the nucleotides in the NAGNAG and in
its immediate vicinity.
• NAGNAG AS in plants is similar to that in animals and is largely dependent
on the splice site and its immediate neighborhood.
Download