Optimization of SVM Parameters for Promoter Recognition in DNA Sequences Robertas Damaševičius Software Engineering Department, Kaunas University of Technology Studentų 50-415, Kaunas, Lithuania Email: damarobe@soften.ktu.lt Data: genetic (DNA) sequences Meaning: represent genetic information stored in DNA molecule in symbolic form Syntax: 4-letter alphabet {A, C, G, T} Complexity: numerous layers of information protein-coding genes regulatory sequences mRNA sequences responsible for protein structure directions from DNA packaging and unwinding, etc. Motivation: over 95% - “junk DNA” (biological function is not fully understood) Aim: identify structural parts of DNA introns, exons, promoters, splice sites, etc. Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 2 What are promoters? Promoter: a regulatory region of DNA located upstream of a gene, providing a control point for gene transcription Function: by binding to promoter, specific proteins (Transcription Factors) can either promote or repress the transcription of a gene Structure: promoters contain binding sites or “boxes” – short DNA subsequences, which are (usually) conserved Gene Promoter exon1 intron exon2 intron exon3 Start Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 Stop 3 Promoter recognition problem Multitude of promoter “boxes” (nucleotide patterns) (a) (b) (c) TATA, Pribnow, Gilbert, DPE, E-box, Y-box, … “Boxes” within a species are conserved, but there are many exceptions to this rule Exact pattern = TACACC CAATGCAGGATACACCGATCGGTA Pattern with mismatches = TACACC + 1 mismatch CAATGCAGGATTCACCGATCGGTA Degenerate pattern = TASDCC (S={C,G}, D={A,G,T}) CAATGCAGGATAGTCCGATCGGTA Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 4 Support Vector Machine (SVM) g x j sgn i yi K xi , x j b x SV i xi X are training data vectors, x j X are unknown data vectors yi Y, Y 1,1 is a target space K xi , x j is the kernel function. Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 5 Quality of classification Training data Mapping of data into feature space linear, polynomial, RBF, sigmoid Kernel function parameters SVM learning parameters Orthogonal, single nucleotide, nucleotide grouping, ... Selection of an optimal kernel function size of dataset, generation of negative examples, imbalanced datasets Regularization parameter, Cost factor Selection of SVM parameter values – an optimization problem Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 6 SVM optimization strategies Kernel optimization Parameter optimization Putting additional parameters Designing new kernels Learning parameters only Kernel parameters only Learning & kernel parameters Optimization decisions Optimization method Objective function Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 7 SVM (hyper)parameters Kernel parameters Learning parameters Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 8 SVM parameter optimization methods Method Advantages Disadvantages Random Simplicity. search Depends on selection of random points and their distribution. Very slow as the size of the parameter space increases Grid Simplicity. search A starting point is not required. Box-constraints for grid are necessary. No optimality criteria for the solution. Computationally expensive for a large number of parameters. Solution depends upon coarseness of the grid. Nelder- Few function Mead evaluations. Good convergence and stability. Can fail if the initial simplex is too small. No proof of convergence. Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 9 Dataset Drosophila sequence datasets: Datasets for SVM classifier: Promoter dataset: 1842 sequences, each 300 bp length, from -250 bp to +50 bp with regards to the gene transcription site location Intron dataset: 1799 sequences, each 300 bp length Coding sequence (CDS) dataset: 2859 sequences, each 300 bp length Training file: 1260 examples (372 promoters, 361 introns, 527 CDS) Test file: 6500 examples (1842 promoters, 1799 introns, 2859 CDS) Datasets are unbalanced: 29.5% promoters vs. 70.5% non-promoters in the training dataset 28.3% promoters vs. 71.7% non-promoters in the test dataset Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 10 Classification requisites Feature mapping: orthogonal { A 0001, C 0010, G 0100, T 1000} Kernel function: power series kernel xi , x j ak xi n K npow T xj c k k 1 Metrics: Specificity (SPC) SPC nTN 100% nFP nTN Sensitivity (TPR) nTP TPR 100% nTP n FN SVM classifier: SVMlight SVM parameter optimization method: Modified Nelder-Mead (downhill simplex) Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 11 Modification of Nelder-Mead Optimization time problem: Call to SVM training and testing function is very time-costly for large datasets Requires many evaluations of objective function Modifications: Function value caching Normalization after reflection step Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 12 Classification results Kernel No. of Type of Classification optimized optimized evaluation metric parameters parameters Specificity Sensitivity (SPC) (TPR) Linear - none 84.83% 58.25% Linear 3 learning 91.23% 81.38% Polynomial - none 81.81% 44.90% Polynomial 6 learning + kernel 87.64% 67.48% Power series (2) 3 kernel 94.85% 89.69% Power series (3) 4 kernel 94.92% 89.95% Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 13 ROC plot 100 Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 14 Conclusions SVM classifier alone can not achieve satisfactory classification results for a complex unbalanced dataset SVM parameter optimization can improve classification results significantly Best results can be achieved when SVM parameter optimization is combined with kernel function modification Power series kernel is particularly suitable for optimization because of a larger number of kernel parameters Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 15 Ongoing work and future research Application of SVM parameter optimization for splice site recognition problem [presented in CISIS’2008] Selection of rules for optimal DNA sequence mapping to the feature space [accepted to WCSB’2008] Analysis of the relationships between data characteristics and classifier behavior [accepted to IS’2008] Automatic derivation of formal grammars rules [accepted to KES’2008] Structural analysis of sequences using SVM with grammar inference [accepted to ITA’2008] Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 16 Thank You. Any questions? Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 17