King

Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis Center for Excellence in Cancer Genomics University at Albany, SUNY 1 The problem  Develop computational models of characteristics of protein structure and function from sequence alone using machine-learned classifiers     Input: Data Output: A model (function) h : X  Y Traditional approach: supervised learning Challenges:    Experimentally determined data – Expensive, limited, subject to noise/error Large repositories of unannotated data Data representation, bias from unbalanced / underrepresented classes, etc. Swiss-Prot 54.5: 289,473 TrEMBL 37.5: 5,035,267 AIM: Develop a method to use labeled and unlabeled data, while improving performance given the challenges presented by small, unbalanced data 2 Solution  Semi-supervised learning   Use Dl and Du for model induction Method: Generative, Bayesian probabilistic model    Based on ngLOC – supervised, Naïve Bayes classification method Input / Feature Representation: Sequence  n-gram model Assumption – multinomial distribution    Use EXPECTATION MAXIMIZATION! Test setup    Prediction of subcellular localization Eukaryotic, non-plant sequences only Dl : Data annotated with subcellular localization for eukaryotic, non-plant sequences     IID – Sequence and n-grams DL-2 – EXT/PLA (~5500 sequences, balanced) DL-3 – GOL [65%] / LYS [14%] /POX [21%] (~600 sequences, unbalanced) Du : Set from ~75K eukaryotic, non-plant protein sequences. Comparative method  Transductive SVM 3 Algorithms based on EM 0.9 No Unlabeled Sequences 25,000 Unlabeled Sequences # Unlabeled 0 2,000 4,000 6,000 8,000 50,000 Macro-averaged F1 0.8 0.7 0.6 0.5 0.4 0.3 0 1000 2000 TSVM macF1 Accuracy 0.499 0.626 0.641 0.653 0.645 0.661 0.659 0.667 NA NA NA NA 3000 4000 5000 Number of Labeled Protein Sequences Basic EM on DL-2  Varied labeled data  25,000 UL sequences  Most improvement when data is limited 6000 EM-λ macF1 Accuracy 0.561 0.728 0.602 0.735 0.633 0.743 0.645 0.743 0.672 0.747 0.677 0.752 EM-λ on DL-3 data  λ – controls effect of UL data on parameter adjustments  ALL labeled data (~600)  Varied UL data  EM- λ outperforms TSVM on this problem   (Failed to converge on large amounts of UL data, despite parameter selection) NOTE – TSVM performed very well on binary, balanced classification problems 4 Algorithm – EM-CS   Core ngLOC method outputs a confidence score (CS) Improve running time through intelligent selection of unlabeled instances   CS(xi) > CSthresh? Use the instance Test on DL-3 data: CSthresh Precision Percentile 40 100% 94 38.25 97% 87 37 88% 71 36 85% 50 35 78% 21 First, determine range of CS scores through cross-validation without UL: 33.5-47.8 (Dependent on level of similarity in data, size of dataset.) CSthresh 0 35 36 37 38.25 40 UL data 0 60,733 6814 3810 2388 1474 726 macF1 0.561 0.650 0.662 0.674 0.676 0.672 0.621 Accuracy 0.728 0.690 0.721 0.735 0.741 0.752 0.740 Using only sequences that meet or exceed CSthresh significantly reduces UL data required (97.5% eliminated) NOTE: it is possible to reduce UL data too much. 5 Conclusion  Benefits:  Probabilistic    Extract unlabeled sequences of “high-confidence” Difficult with SVM or TSVM Extraction of knowledge from model  Discriminative n-grams and anomalies        Time: Significantly lower than SVM and TSVM Space: Dependent on n-gram model Can use large amounts of unlabeled data Applicable toward prediction of any structural or functional characteristic Outputs a global model   Again, difficult with SVM or TSVM Computational resources   Information theoretic measures, KL-divergence, etc. Transduction is not global! Most substantial gain with limited labeled data Current work in progress:  TSVMs    Improve performance on smaller, unbalanced data Select an improved smaller dimensional feature space representation Ensemble classifiers, Bayesian model averaging, Mixture of experts 6

King

Related documents

Products

Support

King

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib