Gene Feature Recognition

advertisement
For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician
Gene Feature Recognition
Limsoon Wong
NUS-KI Course on Bioinformatics, Nov 2005
Recognition of Splice Sites
A simple example to start the day 
NUS-KI Course on Bioinformatics, Nov 2005
Splice Sites
Donor
Acceptor
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Acceptor Site (Human Genome)
• If we align all known acceptor sites (with their
splice junction site aligned), we have the
following nucleotide distribution
Image credit: Xu
• Acceptor site: CAG | TAG | coding region
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Donor Site (Human Genome)
• If we align all known donor sites (with their splice
junction site aligned), we have the following
nucleotide distribution
Image credit: Xu
• Donor site: coding region | GT
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
What Positions Have
“High” Information Content?
• For a weight matrix, information content of each
column is calculated as
– X{A,C,G,T} Prob(X)*log (Prob(X)/0.25)
• When a column has evenly distributed
nucleotides, its information content is lowest
• Only need to look at positions having high
information content
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Information Content Around
Donor Sites in Human Genome
Image credit: Xu
• Information content
 column –3 = – .34*log (.34/.25) – .363*log
(.363/.25) – .183* log (.183/.25) – .114* log
(.114/.25) = 0.04
 column –1 = – .092*log (.92/.25) – .03*log
(.033/.25) – .803* log (.803/.25) – .073* log
(.73/.25) = 0.30
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Weight Matrix Model for Splice Sites
• Weight matrix model
– build a weight matrix for donor, acceptor,
translation start site, respectively
– use positions of high information content
Image credit: Xu
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Splice Site Prediction: A Procedure
Image credit: Xu
• Add up freq of corr letter in corr positions:
AAGGTAAGT: .34 + .60 + .80 +1.0 + 1.0
+ .52 + .71 + .81 + .46 = 6.24
TGTGTCTCA: .11 + .12 + .03 +1.0 + 1.0
+ .02 + .07 + .05 + .16 = 2.56
• Make prediction on splice site based on some
threshold
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Recognition of
Translation Initiation Sites
An introduction to the World’s simplest
TIS recognition system
A simple approach to accuracy and
understandability
NUS-KI Course on Bioinformatics, Nov 2005
Translation Initiation Site
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
A Sample cDNA
299 HSU27655.1 CAT U27655 Homo sapiens
CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG
CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA
GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA
CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
............................................................
................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
80
160
240
80
160
240
• What makes the second ATG the TIS?
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Approach
• Training data gathering
• Signal generation
– k-grams, distance, domain know-how, ...
• Signal selection
– Entropy, 2, CFS, t-test, domain know-how...
• Signal integration
– SVM, ANN, PCL, CART, C4.5, kNN, ...
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Training & Testing Data
•
•
•
•
•
•
Vertebrate dataset of Pedersen & Nielsen [ISMB’97]
3312 sequences
13503 ATG sites
3312 (24.5%) are TIS
10191 (75.5%) are non-TIS
Use for 3-fold x-validation expts
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Signal Generation
• K-grams (ie., k consecutive letters)
– K = 1, 2, 3, 4, 5, …
– Window size vs. fixed position
– Up-stream, downstream vs. any where in window
– In-frame vs. any frame
3
2.5
2
seq1
seq2
seq3
1.5
1
0.5
0
A
C
NUS-KI Course on Bioinformatics, Nov 2005
G
T
Copyright 2005 © Limsoon Wong
Signal Generation: An Example
299 HSU27655.1 CAT U27655 Homo sapiens
CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG
CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA
GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA
CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
80
160
240
• Window = 100 bases
• In-frame, downstream
– GCT = 1, TTT = 1, ATG = 1…
• Any-frame, downstream
– GCT = 3, TTT = 2, ATG = 2…
• In-frame, upstream
– GCT = 2, TTT = 0, ATG = 0, ...
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
An Example File
Resulting From Feature
Generation
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Too Many Signals
• For each value of k, there are 4k * 3 * 2 k-grams
• If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 +
1536 + 6144 = 8184 features!
• This is too many for most machine learning
algorithms
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Signal Selection (Basic Idea)
• Choose a signal w/ low intra-class distance
• Choose a signal w/ high inter-class distance
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Signal Selection (eg., t-statistics)
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Signal Selection (eg., 2)
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Signal Selection (eg., CFS)
• Instead of scoring individual signals, how about
scoring a group of signals as a whole?
• CFS
– Correlation-based Feature Selection
– A good group contains signals that are highly
correlated with the class, and yet uncorrelated
with each other
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Sample k-grams Selected by CFS
Kozak consensus
Leaky scanning
Stop codon
• Position –3
• in-frame upstream ATG
• in-frame downstream
– TAA, TAG, TGA,
– CTG, GAC, GAG, and GCC
Codon bias?
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Signal Integration
• kNN
– Given a test sample, find the k training samples
that are most similar to it. Let the majority class
win
• SVM
– Given a group of training samples from two
classes, determine a separating plane that
maximises the margin of error
• Naïve Bayes, ANN, C4.5, ...
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Illustration of kNN (k=8)
Neighborhood
5 of class
3 of class
=
Image credit: Zaki
Typical “distance” measure =
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Using WEKA for
TIS Prediction
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Results (3-fold x-validation)
TP/(TP + FN)
TN/(TN + FP)
TP/(TP + FP)
Accuracy
Naïve Bayes
84.3%
86.1%
66.3%
85.7%
SVM
73.9%
93.2%
77.9%
88.5%
Neural Network
77.6%
93.2%
78.8%
89.4%
Decision Tree
74.0%
94.4%
81.1%
89.4%
3-NN*
73.2%
92.9%
77.2%
88.0%
* Using top 20 2-selected features from amino-acid features
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Validation Results (on Chr X and Chr 21)
Our
method
ATGpr
• Using top 100 features selected by entropy and
trained on Pedersen & Nielsen’s
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Technique Comparisons
• Pedersen&Nielsen [ISMB’97]
– 85% accuracy
– Neural network
– No explicit features
• Zien [Bioinformatics’00]
– 88% accuracy
– SVM+kernel engineering
– No explicit features
• Hatzigeorgiou [Bioinformatics’02]
– 94% accuracy (with
scanning rule)
– Multiple neural networks
– No explicit features
NUS-KI Course on Bioinformatics, Nov 2005
• Our approach
– 89% accuracy (94% with
scanning rule)
– Explicit feature
generation
– Explicit feature selection
– Use any machine
learning method w/o any
form of complicated
tuning
Copyright 2005 © Limsoon Wong
Recognition of
Transcription Start Sites
An introduction to the World’s best TSS
recognition system
A heavy tuning approach
NUS-KI Course on Bioinformatics, Nov 2005
Transcription Start Site
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Structure of Dragon Promoter Finder
-200 to +50
window size
Model selected based
on desired sensitivity
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Each model has two submodels
based on GC content
GC-rich submodel
(C+G) =
#C + #G
Window Size
GC-poor submodel
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Data Analysis Within Submodel
sp
se
si
K-gram (k = 5) positional weight matrix
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Promoter, Exon, Intron Sensors
• These sensors are positional weight matrices of
k-grams, k = 5 (aka pentamers)
• They are calculated as s below using promoter,
exon, intron data respectively
Pentamer at ith
position in input
Window size
s
Frequency of jth
pentamer at ith position
in training window
NUS-KI Course on Bioinformatics, Nov 2005
jth pentamer at
ith position in
training window
Copyright 2005 © Limsoon Wong
Data Preprocessing & ANN
Tuning parameters
Simple feedforward ANN
trained by the Bayesian
regularisation method
sE
wi
tanh(net)
Tuned
threshold
sI
sIE
ex - e-x
tanh(x) = ex + e-x
net =  si * wi
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Accuracy Comparisons
with C+G submodels
without C+G submodels
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Notes
NUS-KI Course on Bioinformatics, Nov 2005
References (TIS Recognition)
• A. G. Pedersen, H. Nielsen, “Neural network
prediction of translation initiation sites in
eukaryotes”, ISMB 5:226--233, 1997
• H.Liu, L. Wong, “Data Mining Tools for Biological
Sequences”, Journal of Bioinformatics and
Computational Biology, 1(1):139--168, 2003
• A. Zien et al., “Engineering support vector
machine kernels that recognize translation
initiation sites”, Bioinformatics 16:799--807, 2000
• A. G. Hatzigeorgiou, “Translation initiation start
prediction in human cDNAs with high accuracy”,
Bioinformatics 18:343--350, 2002
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
References (TSS Recognition)
• V. B. Bajic et al., “Computer model for recognition
of functional transcription start sites in RNA
polymerase II promoters of vertebrates”, J. Mol.
Graph. & Mod. 21:323--332, 2003
• J. W. Fickett, A. G. Hatzigeorgiou, “Eukaryotic
promoter recognition”, Gen. Res. 7:861--878, 1997
• A. G. Pedersen et al., “The biology of eukaryotic
promoter prediction---a review”, Computer &
Chemistry 23:191--207, 1999
• M. Scherf et al., “Highly specific localisation of
promoter regions in large genome sequences by
PromoterInspector”, JMB 297:599--606, 2000
NUS-KI Course on Bioinformatics, Nov 2005
Copyright 2005 © Limsoon Wong
Download