Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis Center for Excellence in Cancer Genomics University at Albany, SUNY 1 The problem Develop computational models of characteristics of protein structure and function from sequence alone using machine-learned classifiers Input: Data Output: A model (function) h : X Y Traditional approach: supervised learning Challenges: Experimentally determined data – Expensive, limited, subject to noise/error Large repositories of unannotated data Data representation, bias from unbalanced / underrepresented classes, etc. Swiss-Prot 54.5: 289,473 TrEMBL 37.5: 5,035,267 AIM: Develop a method to use labeled and unlabeled data, while improving performance given the challenges presented by small, unbalanced data 2 Solution Semi-supervised learning Use Dl and Du for model induction Method: Generative, Bayesian probabilistic model Based on ngLOC – supervised, Naïve Bayes classification method Input / Feature Representation: Sequence n-gram model Assumption – multinomial distribution Use EXPECTATION MAXIMIZATION! Test setup Prediction of subcellular localization Eukaryotic, non-plant sequences only Dl : Data annotated with subcellular localization for eukaryotic, non-plant sequences IID – Sequence and n-grams DL-2 – EXT/PLA (~5500 sequences, balanced) DL-3 – GOL [65%] / LYS [14%] /POX [21%] (~600 sequences, unbalanced) Du : Set from ~75K eukaryotic, non-plant protein sequences. Comparative method Transductive SVM 3 Algorithms based on EM 0.9 No Unlabeled Sequences 25,000 Unlabeled Sequences # Unlabeled 0 2,000 4,000 6,000 8,000 50,000 Macro-averaged F1 0.8 0.7 0.6 0.5 0.4 0.3 0 1000 2000 TSVM macF1 Accuracy 0.499 0.626 0.641 0.653 0.645 0.661 0.659 0.667 NA NA NA NA 3000 4000 5000 Number of Labeled Protein Sequences Basic EM on DL-2 Varied labeled data 25,000 UL sequences Most improvement when data is limited 6000 EM-λ macF1 Accuracy 0.561 0.728 0.602 0.735 0.633 0.743 0.645 0.743 0.672 0.747 0.677 0.752 EM-λ on DL-3 data λ – controls effect of UL data on parameter adjustments ALL labeled data (~600) Varied UL data EM- λ outperforms TSVM on this problem (Failed to converge on large amounts of UL data, despite parameter selection) NOTE – TSVM performed very well on binary, balanced classification problems 4 Algorithm – EM-CS Core ngLOC method outputs a confidence score (CS) Improve running time through intelligent selection of unlabeled instances CS(xi) > CSthresh? Use the instance Test on DL-3 data: CSthresh Precision Percentile 40 100% 94 38.25 97% 87 37 88% 71 36 85% 50 35 78% 21 First, determine range of CS scores through cross-validation without UL: 33.5-47.8 (Dependent on level of similarity in data, size of dataset.) CSthresh 0 35 36 37 38.25 40 UL data 0 60,733 6814 3810 2388 1474 726 macF1 0.561 0.650 0.662 0.674 0.676 0.672 0.621 Accuracy 0.728 0.690 0.721 0.735 0.741 0.752 0.740 Using only sequences that meet or exceed CSthresh significantly reduces UL data required (97.5% eliminated) NOTE: it is possible to reduce UL data too much. 5 Conclusion Benefits: Probabilistic Extract unlabeled sequences of “high-confidence” Difficult with SVM or TSVM Extraction of knowledge from model Discriminative n-grams and anomalies Time: Significantly lower than SVM and TSVM Space: Dependent on n-gram model Can use large amounts of unlabeled data Applicable toward prediction of any structural or functional characteristic Outputs a global model Again, difficult with SVM or TSVM Computational resources Information theoretic measures, KL-divergence, etc. Transduction is not global! Most substantial gain with limited labeled data Current work in progress: TSVMs Improve performance on smaller, unbalanced data Select an improved smaller dimensional feature space representation Ensemble classifiers, Bayesian model averaging, Mixture of experts 6