SVM: Non-coding Neutral Sequences Vs Regulatory Modules Ying Zhang, BMB, Penn State Ritendra Datta, CSE, Penn State Bioinformatics – I Fall 2005 Outline Background: Machine Learning & Bioinformatics Data Collection and Encoding Distinguish sequences using SVM Results Discussion Regulation: A Recurring Challenge Expression of genes are under regulation. Regulation: cis-element vs trans-element Right protein, right time, right amount, right location… Cis-element: Non-coding functional sequence Trans-element: Proteins interact with cis-element Predicting cis-regulatory elements remains a challenge: Significant effort put in the past Current trends: TFBS clusters, pattern analysis Alignments and Sequences: The Data Information: Sequence Genetics information encoded in DNA sequence Typical information: Codon, Binding site, … Codon: ATG (Met), CGT (Arg.), … Binding sites: A/TGATAA/G ( Gata1 ), … Evolutionary Information: Aligned Sequence Similarity between species Conservation ~ Function Human: TCCTTATCAGCCATTACC Mouse: TCCTTATCAGCCACCACC Problem Given the genome sequence information, is it possible to automatically distinguish Regulatory Regions from other genomic non-coding Neutral sequences using machine learning ? Machine Learning: The Tool Sub-field of A.I. Computers programs “learn” from experience, i.e. by analyzing data and corresponding behavior Confluence of Statistics, Mathematical Logic, Numerical Optimization Applied in Information Retrieval, Financial Analysis, Computer Vision, Speech Recognition, Robotics, Bioinformatics, etc. M.L. Statistics Optimization Predicting Genes Logic Analyzing Stocks Personalized WWW search Applications Machine Learning: Types of Learning Supervised Learning Unsupervised Learning Learning statistical models from past sample-label data pairs, e.g. Classification Building models to capture the inherent organization in data, e.g. Clustering Reinforcement Learning Building models from interactive feedback on how well the current model is doing, e.g. Robotic learning Machine Learning and Bioinformatics: The Confluence Learning problems in Bioinformatics [ICML ’03] Protein folding and protein structure prediction Inference of genetic and molecular networks Gene-protein interactions Data mining from micro arrays Functional and comparative genomics, etc. Machine Learning and Bioinformatics: Sample Publications Identification of DNaseI Hypersensitive Sites in the human genome (may disclose the location of cis-regulatory sequences) Functionally classifying genes based on gene expression data from DNA microarray hybridization experiments using SVMs M. P. S. Brown, “Knowledge-based analysis of microarray gene expression data by using support vector machines,” PNAS, 2004. Using Log-odds ratios from Markov models for identifying regulatory regions in DNA sequences W.S. Noble et al., “Predicting the in vivo signature of human gene regulatory sequences,” Bioinformatics, 2005. L. Elnitski et al., “Distinguishing Regulatory DNA From Neutral Sites,” Genome Research, 2003. Selection of informative genes using an SVM-based feature selection algorithm I. Guyon et al., “Gene selection for cancer classification using support vector machines,” Machine Learning, 2002. Machine Learning and Bioinformatics: Books Support Vector Machines: A Powerful Statistical Learning Technique Which of the linear separators is optimal? Support Vector Machines: A Powerful Statistical Learning Technique Choose the one that maximizes the margin between the classes ξi ξi Support Vector Machines: A Powerful Statistical Learning Technique The classes in these datasets linearly separate easily x 0 x What about these datasets ? 0 x Support Vector Machines: A Powerful Statistical Learning Technique Solution: Kernel Trick ! 0 x x2 x Experiments: Overview Classification in Question: Two types of experiments: Nucleotide sequences – ATCG Alignments (reduced 5-symbol) - SWVIG (S: match involving G & C, W: match involving A & T, G:gap V:transversion, I: transition) Two datasets: Regulatory regions (REG) vs Ancestral Repeats (AR) Elnitski et al. dataset Dataset from PennState CCGB Mapping Sequences/Alignments → Real Numbers Frequencies of short length K-mers (K=1, 2, 3) Normalizing factor - sequence length (Ambiguous for K > 1) Stability of variance – Equal length sequences (whenever possible) Experiments: Feature Selection Total number of features: Relatively high-dimensionality: Sequences: 4 + 42 + 43 = 84 Alignments: 5 + 52 + 53 = 155 Curse of dimensionality: Convergence of estimators very slow Over-fitting: Poor generalization performance Solutions: Dimension Reduction – e.g., PCA Feature Selection - e.g., Forward Selection, Backward Elimination Experiments: Training and Validation Training Set: SVM setup: Elnitski et al. dataset Sequences: 300 samples of 100 bp each class (REG and AR) Alignments: 300 samples of length 100 from each class RBF Kernel: k(x1, x2) = exp( δ || x1 – x2 || ) Implementation: LibSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) Validation: N-fold Cross-validation Used in feature selection, parameter tuning, and testing Results: The Elnitski et al. dataset Parameter selection Feature Selection SVM Parameters: δ and C Assessing Feature Importance G-C Normalization Sequences: 10 out of 84 Symbols: 10 out of 155 Accuracy scores Overall Ancestral Repeats (AR) Regulatory Regions (Reg) Results: SVM Parameter Selection Iterative selection procedure Coarse selection – Initial neighborhood Fine-grained selection - Brute force Validation Set from data Within-loop CV Chosen Parameters: δ = 1.6 C = 1.5 Chosen by One-dimensional SVMs Results: Feature Selection - Sequence Distribution of Nucleotide frequencies of the top 9 most significant k-mers Chosen by One-dimensional SVMs Results: Feature Selection - Symbol Distribution of 5-symbol frequencies of the top 9 most significant k-mers Results: Feature Selection Procedure: Greedy Forward Selection + Backward Elimination Chosen Features: Sequence: [5 68 3 20 63 4 16 10 1 22] ( 0 = A, 1 = T, 2 = G, 3 =C, 4 = AA, 5 = AT, etc. ) [AT,CAA,C,AAA,GGC,AA,CA,TG,T,AAG] Symbol: [3 5 4 18 24 124 17 143 19 95 103] ( 0 = G, 1 = V, 2 = W, 3 =S, 4 = I, 5 = GG, 6 = GV, etc. ) [S, GG, I, WS,SI,SIG,WW,IWI,WI,WSV,WII] Results: Accuracy Scores Experiment Type Overall Accuracy Elnitski et al. 5-symbol ≈ 74.7% Hexamers ≈ 75% Reg Precision AR Precision 78.49% 81.4% 73% 72.5% Sequences only 1-mers 2-mers 3-mers Selection 78.33% 77.67% 80.17% 80.33% 76.54% 72.84% 83.67% 80.87% 80.54 82.97% 77.21% 79.63% Symbols only 84.33% 84.33% 85.17% 86.00% 79.39% 77.53% 78.83 % 80.58% 90.03% 90.96% 92.42% 91.54% 1-mers 2-mers 3-mers Selection Results: Laboratory Data Training: Data: SVM models built using Elnitski et al. data Same parameters; Same features selected 9 candidate cis-regulatory regions predicted by RP score 1: negative control based on the definition. 5 of the 9 candidates passed current biological testing,positive Accuracy Classification result for sequence (1-, 2-, 3-mer): 1 negative control 4 out of 5 positive element + 3 out of 4 “negative” element Classification result for alignment (1-, 2-, 3-mer): 1 negative control 9 original candidates Discussion High validation rate for Ancestral repeat The structure of selected training set is not that diverse Ancestral repeat tends to be AT-rich AR: LINE, SINE etc. SVM performs a little better than RP scores in training set Statistically more powerful RP: Markov model for pattern recognition SVM: Hyper-plane in high-dimensional feature space Feature selection using wrapper method possible Discussion (cont’d) Performance degradation in Lab Data classification No improvement in SVM classification compared to RP score Features identified from the Elnitski et al. data may have some bias – other features may be more informative on the Lab data Sequence classification vs Alignment (Accuracy Table) SVM yields higher overall cross-validation accuracy for aligned symbol sequences compared to nucleotide sequences Gained accuracy rate: Ancestral Repeat driven No improvement for aligned symbol sequence In Lab data classification, sequence classification is better than aligned symbol sequence No information gained from evolutionary history !!! Alphabet reduction not optimal Assumption worng!!! Summary Generally, SVM is a powerful tool for classification SVM: answer “yes or no” question Performance better than RP in distinguishing AR training set from Reg training set RP: Probabilistic method, can generate quantitative measurement genome-wide SVM: Results can be extended using probabilistic forms of SVM SVM can reveal potentially interesting biological features e.g. the transcription regulation scheme Future Directions: Possible extensions Explore more complex features Refine models for neutral non-coding genomic segments Utilize multi-species alignment for the classification Combining sequence and alignment information to build more robust multi-classifiers – “Committee of Experts” Pattern recognition for more accurate prediction Questions and recommendations? Using original alignment features, 20 columns. Other lab data (avoiding the possible bias of RP preselection) for SVM performance testing. References L. Elnitski et al., “Distinguishing Regulatory DNA From Neutral Sites,” Genome Research, 2003. Machine Learning Group, University of Texas at Austin, “Support Vector Machines,” http://www.cs.utexas.edu/~ml/ . N. Cristianini, “Support Vector and Kernel Methods for Pattern Recognition,” http://www.support-vector.net/tutorial.html. Acknowledgement Dr. Webb Miller Dr. Francesca Chiaromonte David King