A statistical framework for genomic data fusion William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington Outline • • • • • • Recognizing correctly identified peptides The support vector machine algorithm Experimental results Yeast protein classification SVM learning from heterogeneous data Results Recognizing correctly identified peptides Protein sample Database search Sequence database Tandem mass spectrometer Search algorithm Observed spectra Predicted peptides The learning task Observed Theoretical • We are given paired observed and theoretical spectra. • Question: Is the pairing correct? Properties of the observed spectrum 1. Total peptide mass. Too small yields little information; too large (>25 amino acids) yields uneven fragmentation. 2. Charge (+1, +2 or +3). Provides some evidence about amino acid composition. 3. Total ion current. Proportional to the amount of peptide present. 4. Peak count. Small indicates poor fragmentation; large indicates noise. Observed vs. theoretical spectra 5. 6. Mass difference. Percent of ions matched. Number of matched ions / total number of ions. 7. Percent of peaks matched. Number of matched peaks / total number of peaks. 8. Percent of peptide fragment ion current matched. Total intensity of matched peaks / total intensity of all peaks. 9. Preliminary SEQUEST score (Sp). 10. Preliminary score rank. 11. SEQUEST cross-correlation (XCorr). Top-ranked vs. second-ranked peptides 12. Change in cross-correlation. Compute the difference in XCorr for the top-ranked and second-ranked peptide. 13. Percent sequence identity. Usually anticorrelated with change in crosscorrelation. Negative examples Positive examples The support vector machine algorithm SVMs in computational biology • Splice site recognition • Protein sequence similarity detection • Protein functional classification • Regulatory module search • Protein-protein interaction prediction • Gene functional classification from microarray data • Cancer classification from microarray data Support vector machine Support vector machine + + + + + + + + + + - - + -++ + - - Locate a plane that separates positive from negative examples. - - - Focus on the examples closest to the boundary. Kernel matrix K X , Y X Y 1 3 X Y K X , Y exp 2 2 2 Kernel functions • Let X be a finite input space. • A kernel is a function K, such that for all x, z X, K(x, z) = (x) · (y), where is a mapping from X to an (inner product) feature space F. • Let K(x,z) be a symmetric function on X. Then K(x,z) is a kernel function if and only if the matrix K K xi , x j i , j 1 n is positive semi-definite. Peptide ID kernel function • Let p(x,y) be the function that computes a 13-element vector of parameters for a pair of spectra, x and y. • The kernel function K operates on pairs of observed and theoretical spectra: K S oA : StA , S oB : StB K p S oA , StA , p S oB , StB p S , S p S , S A o A t B o B t 1 2 Experimental results Experimental design • Data consists of one 13-element vector per predicted peptide. • Each feature is normalized to sum to 1.0 across all examples. • The SVM is tested using leave-one-out crossvalidation. • The SVM uses a second-degree polynomial, normalized kernel with a 2-norm asymmetric soft margin. Three data sets • Set 1: Ion trap mass spectrometer. Sequest search on the full non-redundant database. • Set 2: Ion trap mass spectrometer. Sequest search on human NRDB. • Set 3: Quadrupole time-of-flight mass spectrometer. Sequence search on human NRDB. Data set sizes Positive Negative Total Ion-trap NRDB 497 479 976 Ion-trap HNRDB 696 465 1161 QTOF HNRDB 1017 523 1540 0.94 0.95 0.99 (18,966) (126,931) (27,936) (108,936) (57,732) Conversion to probabilities • Hold out a subset of the training examples. • Use the hold-out set to fit a sigmoid. y = label 1 f = discriminant Pr y 1 f Af B 1 e • This is equivalent to assuming that the SVM output is proportional to the logodds of a positive example. Pr x 1 1 e 1.15 x 0.114 Yeast protein classification Membrane proteins • Membrane proteins anchor in a cellular membrane (plasma, ER, golgi, mitochondrial). • Communicate across membrane. • Pass through the membrane several times. Heterogeneous data sequence data mRNA expression data protein-protein interaction data Vector representation • Each matrix entry is an mRNA expression measurement. • Each column is an experiment. • Each row corresponds to a gene. X Y K X , Y exp 2 2 2 Sequence kernels >ICYA_MANSE GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKY DGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVN LVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVD NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH >LACB_BOVIN MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDA QSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKI DALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALE KFDKALKALPMHIRLSFNPTQLEEQCHI • We cannot compute a scalar product on a pair of variable-length, discrete strings. Pairwise comparison kernel Pairwise comparison kernel Pairwise kernel variants • Smith-Waterman allvs-all • BLAST all-vs-all • Smith-Waterman w.r.t. SCOP database • E-values from Pfam database Protein-protein interactions • Pairwise interactions can be represented as a graph or a matrix. protein protein 10010101 10101101 00001100 00101101 00101001 10000001 00101000 Linear interaction kernel 10010101 10101101 00001100 00101101 00101001 10000001 00101000 3 • The simplest kernel counts the number of interactions between each pair. Diffusion kernel • A general method for establishing similarities between nodes of a graph. • Based upon a random walk. • Efficiently accounts for all paths connecting two nodes, weighted by path lengths. Hydrophobicity profile Membrane protein Non-membrane protein • Transmembrane regions are typically hydrophobic, and vice versa. • The hydrophobicity profile of a membrane protein is evolutionarily conserved. Hydrophobicity kernel • Generate hydropathy profile from amino acid sequence using Kyte-Doolittle index. • Prefilter the profiles. • Compare two profiles by – Computing fast Fourier transform (FFT), and – Applying Gaussian kernel function. • This kernel detects periodicities in the hydrophobicity profile. SVM learning from heterogeneous data Combining kernels A K(A) B A K(B) A:B B Identical K(A:B) K(A)+K(B) Semidefinite programming • Define a convex cost function to assess the quality of a kernel matrix. • Semidefinite programming (SDP) optimizes convex cost functions over the convex cone of positive semidefinite matrices. Semidefinite programming Learn K from the convex cone of positive-semidefinite matrices or a convex subset of it : According to a convex quality measure: Integrate constructed kernels Large margin classifier (SVM) K i K i i Learn a linear mix Maximize the margin SDP Integrate constructed kernels Large margin classifier (SVM) K i K i i Learn a linear mix Maximize the margin Experimental results Seven yeast kernels Kernel KSW KB KHMM KFFT KLI KD KE Data protein sequence protein sequence protein sequence hydropathy profile protein interactions protein interactions gene expression Similarity measure Smith-Waterman BLAST Pfam HMM FFT linear kernel diffusion kernel radial basis kernel Membrane proteins Comparison of performance Simple rules from hydrophobicity profile TMHMM Cytoplasmic ribosomal proteins False negative predictions False negative expression profiles Markov Random Field • General Bayesian method, applied by Deng et al. to yeast functional classification. • Used five different types of data. • For their model, the input data must be binary. • Reported improved accuracy compared to using any single data type. Yeast functional classes Category Size Metabolism 1048 Energy 242 Cell cycle & DNA processing 600 Transcription 753 Protein synthesis 335 Protein fate 578 Cellular transport 479 Cell rescue, defense 264 Interaction w/ evironment 193 Cell fate 411 Cellular organization 192 Transport facilitation 306 Other classes 81 Six types of data • • • • • • Presence of Pfam domains. Genetic interactions from CYGD. Physical interactions from CYGD. Protein-protein interaction by TAP. mRNA expression profiles. (Smith-Waterman scores). Results MRF SDP/SVM (binary) SDP/SVM (enriched) Many yeast kernels • protein sequence • phylogenetic profiles • separate gene expression kernels • time series expression kernel • promoter regions using seven aligned species • protein localization • ChIP • protein-protein interactions • yeast knockout growth data • more ... Future work • New kernel functions that incorporate domain knowledge. • Better understanding of the semantics of kernel weights. • Further investigation of yeast biology. • Improved scalability of the algorithm. • Prediction of protein-protein interactions. Acknowledgments • Dave Anderson, University of Oregon • Wei Wu, Genome Sciences, UW & FHCRC • Michael Jordan, Statistics & EECS, UC Berkeley • Laurent El Ghaoui, EECS, UC Berkeley • Gert Lanckriet, EECS, UC Berkeley • Nello Cristianini, Statistics, UC Davis Fisher criterion score Low score 1 2 2 2 1 2 2 High score Feature ranking delta Cn % match total ion current Cn % match peaks Sp mass charge rank Sp peak count sequence similarity % ion match total ion current delta mass 2.861 2.804 2.444 2.314 1.158 0.704 0.488 0.313 0.209 0.115 0.079 0.026 0.024 Pairwise feature ranking % match TIC-delta Cn % match peaks-delta Cn % match TIC-Cn delta Cn-Cn delta Cn-charge % match peaks-Cn delta Cn-mass % match TIC-% match peaks % ion match-delta Cn Sp-delta Cn % match TIC-Sp % match peaks-Sp 4.741 4.233 3.819 3.597 3.563 3.377 3.119 2.823 2.812 2.799 2.579 2.383 % match TIC-mass % ion match-mass Cn-charge Sp-mass % match TIC-charge Cn-mass Sp-Cn % ion match-Cn Sp-charge % match peaks-mass % match peaks-charge % match TIC-% ion match 2.097 2.091 1.943 1.922 1.898 1.884 1.881 1.827 1.770 1.668 1.528 1.473