For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician Gene Feature Recognition Limsoon Wong NUS-KI Course on Bioinformatics, Nov 2005 Recognition of Splice Sites A simple example to start the day NUS-KI Course on Bioinformatics, Nov 2005 Splice Sites Donor Acceptor NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Acceptor Site (Human Genome) • If we align all known acceptor sites (with their splice junction site aligned), we have the following nucleotide distribution Image credit: Xu • Acceptor site: CAG | TAG | coding region NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Donor Site (Human Genome) • If we align all known donor sites (with their splice junction site aligned), we have the following nucleotide distribution Image credit: Xu • Donor site: coding region | GT NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong What Positions Have “High” Information Content? • For a weight matrix, information content of each column is calculated as – X{A,C,G,T} Prob(X)*log (Prob(X)/0.25) • When a column has evenly distributed nucleotides, its information content is lowest • Only need to look at positions having high information content NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Information Content Around Donor Sites in Human Genome Image credit: Xu • Information content column –3 = – .34*log (.34/.25) – .363*log (.363/.25) – .183* log (.183/.25) – .114* log (.114/.25) = 0.04 column –1 = – .092*log (.92/.25) – .03*log (.033/.25) – .803* log (.803/.25) – .073* log (.73/.25) = 0.30 NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Weight Matrix Model for Splice Sites • Weight matrix model – build a weight matrix for donor, acceptor, translation start site, respectively – use positions of high information content Image credit: Xu NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Splice Site Prediction: A Procedure Image credit: Xu • Add up freq of corr letter in corr positions: AAGGTAAGT: .34 + .60 + .80 +1.0 + 1.0 + .52 + .71 + .81 + .46 = 6.24 TGTGTCTCA: .11 + .12 + .03 +1.0 + 1.0 + .02 + .07 + .05 + .16 = 2.56 • Make prediction on splice site based on some threshold NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Recognition of Translation Initiation Sites An introduction to the World’s simplest TIS recognition system A simple approach to accuracy and understandability NUS-KI Course on Bioinformatics, Nov 2005 Translation Initiation Site NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong A Sample cDNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT ............................................................ ................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 80 160 240 80 160 240 • What makes the second ATG the TIS? NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Approach • Training data gathering • Signal generation – k-grams, distance, domain know-how, ... • Signal selection – Entropy, 2, CFS, t-test, domain know-how... • Signal integration – SVM, ANN, PCL, CART, C4.5, kNN, ... NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Training & Testing Data • • • • • • Vertebrate dataset of Pedersen & Nielsen [ISMB’97] 3312 sequences 13503 ATG sites 3312 (24.5%) are TIS 10191 (75.5%) are non-TIS Use for 3-fold x-validation expts NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Signal Generation • K-grams (ie., k consecutive letters) – K = 1, 2, 3, 4, 5, … – Window size vs. fixed position – Up-stream, downstream vs. any where in window – In-frame vs. any frame 3 2.5 2 seq1 seq2 seq3 1.5 1 0.5 0 A C NUS-KI Course on Bioinformatics, Nov 2005 G T Copyright 2005 © Limsoon Wong Signal Generation: An Example 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT 80 160 240 • Window = 100 bases • In-frame, downstream – GCT = 1, TTT = 1, ATG = 1… • Any-frame, downstream – GCT = 3, TTT = 2, ATG = 2… • In-frame, upstream – GCT = 2, TTT = 0, ATG = 0, ... NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong An Example File Resulting From Feature Generation NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Too Many Signals • For each value of k, there are 4k * 3 * 2 k-grams • If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 + 1536 + 6144 = 8184 features! • This is too many for most machine learning algorithms NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Signal Selection (Basic Idea) • Choose a signal w/ low intra-class distance • Choose a signal w/ high inter-class distance NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Signal Selection (eg., t-statistics) NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Signal Selection (eg., 2) NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Signal Selection (eg., CFS) • Instead of scoring individual signals, how about scoring a group of signals as a whole? • CFS – Correlation-based Feature Selection – A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Sample k-grams Selected by CFS Kozak consensus Leaky scanning Stop codon • Position –3 • in-frame upstream ATG • in-frame downstream – TAA, TAG, TGA, – CTG, GAC, GAG, and GCC Codon bias? NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Signal Integration • kNN – Given a test sample, find the k training samples that are most similar to it. Let the majority class win • SVM – Given a group of training samples from two classes, determine a separating plane that maximises the margin of error • Naïve Bayes, ANN, C4.5, ... NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Illustration of kNN (k=8) Neighborhood 5 of class 3 of class = Image credit: Zaki Typical “distance” measure = NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Using WEKA for TIS Prediction NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Results (3-fold x-validation) TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy Naïve Bayes 84.3% 86.1% 66.3% 85.7% SVM 73.9% 93.2% 77.9% 88.5% Neural Network 77.6% 93.2% 78.8% 89.4% Decision Tree 74.0% 94.4% 81.1% 89.4% 3-NN* 73.2% 92.9% 77.2% 88.0% * Using top 20 2-selected features from amino-acid features NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Validation Results (on Chr X and Chr 21) Our method ATGpr • Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Technique Comparisons • Pedersen&Nielsen [ISMB’97] – 85% accuracy – Neural network – No explicit features • Zien [Bioinformatics’00] – 88% accuracy – SVM+kernel engineering – No explicit features • Hatzigeorgiou [Bioinformatics’02] – 94% accuracy (with scanning rule) – Multiple neural networks – No explicit features NUS-KI Course on Bioinformatics, Nov 2005 • Our approach – 89% accuracy (94% with scanning rule) – Explicit feature generation – Explicit feature selection – Use any machine learning method w/o any form of complicated tuning Copyright 2005 © Limsoon Wong Recognition of Transcription Start Sites An introduction to the World’s best TSS recognition system A heavy tuning approach NUS-KI Course on Bioinformatics, Nov 2005 Transcription Start Site NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Structure of Dragon Promoter Finder -200 to +50 window size Model selected based on desired sensitivity NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Each model has two submodels based on GC content GC-rich submodel (C+G) = #C + #G Window Size GC-poor submodel NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Data Analysis Within Submodel sp se si K-gram (k = 5) positional weight matrix NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Promoter, Exon, Intron Sensors • These sensors are positional weight matrices of k-grams, k = 5 (aka pentamers) • They are calculated as s below using promoter, exon, intron data respectively Pentamer at ith position in input Window size s Frequency of jth pentamer at ith position in training window NUS-KI Course on Bioinformatics, Nov 2005 jth pentamer at ith position in training window Copyright 2005 © Limsoon Wong Data Preprocessing & ANN Tuning parameters Simple feedforward ANN trained by the Bayesian regularisation method sE wi tanh(net) Tuned threshold sI sIE ex - e-x tanh(x) = ex + e-x net = si * wi NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Accuracy Comparisons with C+G submodels without C+G submodels NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong Notes NUS-KI Course on Bioinformatics, Nov 2005 References (TIS Recognition) • A. G. Pedersen, H. Nielsen, “Neural network prediction of translation initiation sites in eukaryotes”, ISMB 5:226--233, 1997 • H.Liu, L. Wong, “Data Mining Tools for Biological Sequences”, Journal of Bioinformatics and Computational Biology, 1(1):139--168, 2003 • A. Zien et al., “Engineering support vector machine kernels that recognize translation initiation sites”, Bioinformatics 16:799--807, 2000 • A. G. Hatzigeorgiou, “Translation initiation start prediction in human cDNAs with high accuracy”, Bioinformatics 18:343--350, 2002 NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong References (TSS Recognition) • V. B. Bajic et al., “Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates”, J. Mol. Graph. & Mod. 21:323--332, 2003 • J. W. Fickett, A. G. Hatzigeorgiou, “Eukaryotic promoter recognition”, Gen. Res. 7:861--878, 1997 • A. G. Pedersen et al., “The biology of eukaryotic promoter prediction---a review”, Computer & Chemistry 23:191--207, 1999 • M. Scherf et al., “Highly specific localisation of promoter regions in large genome sequences by PromoterInspector”, JMB 297:599--606, 2000 NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong