Functional characterization of membrane transporters from protein sequences Haiquan Li The Samuel Roberts Noble Foundation Membrane transport proteins (transporters) • Functions Uptake of nutrients (nitrogen) Pump out toxic metabolites Mediate signal transduction Maintain ionic osmotic homeostasis • Classes based on driving energy channels (passive diffusion) carrier-type facilitators (electrochemical potentialdriven eg. sodium potential) primary active transporters 2 Characterization of transporters • Small-scale experimental methods Patch-clamp techniques for channels Isotopical-labeled substrates Heterologous expression Mutant complementation • The demand of genome-scale computational methods (transportomics) Comparative studies Comparative study of transporter families from multiple organisms, such as lignin-making organisms and non-lignin marking organisms Integrative study with transporter gene expression Exchange of metabolites (e.g. nitrogen) between legumes and rhizobia 3 An example of transportomics Udvardi & Day, 1997 Day et al., 2001 4 Transporter resources and classification systems • Manually curated resources TCDB by Sailer et al. TransportDB by Ren et al. 5 Computational characterization of transporters Machine learning Homology search (Domain) • False positives caused by gene duplication (paralogs), domain shuffling, or non-transporter domains Example: Plant Plasmodesmata (PPD) family (1.A.26) transports hormones or growth factors. Single member: Connexin 32, a gap junction protein (Blast) Empirical rules Computational characterization methods 6 Motivation of our work • Objectives List of all candidate transporters, since the low confidence may imply novelty and significance Reduce curation efforts significantly • Methodologies Using distinct machine learning and empirical rules to enhance annotation confidence Efficiently and automatically integrate multiple evidence from TCDB, Pfam, GO, SWISS-PROT and transmembrane segment (TMS) 7 Saport: a semi-automatic transporter annotation system Input sequences Machine Learning Module (TransportTP) Empirical Rule Module Initial classifier from TCDB BLAST Search HMM Search Collect transporterrelated evidence Score integration and initial prediction Refining classifier TMS KNN in TCDB Pfam domains Go Terms SwissProt Homologs Classification by ensemble of SVMs Score integration and ranking Summarize family-based empirical rules Interpret rules and generate putative transporters 8 TransportTP: Two-phase classification Initial classifier from TCDB ? F1 Fi (Correctly categorized transporters) False positives Transporters Refining classifier (incorrectly predicted Nontransporters transporters) blast ( pij )*HMM ( Fi ) p True positives NN transporter … Fm Haiquan Li, Xinbin Dai & Xunchun Zhao. Bioinformatics, 24,1129-1136, 2008. False negatives (Missed transporters) True negatives (non-transporters) Haiquan Li, Vagner A. Benedito, Michael K. Udvardi and Xunchun Zhao. BMC Bioinformatics, under revision. 9 Refining features: TMS & KNN TMS distribution for 1.A.1 family (72 channels) 18 Number of transporters KNN Ptms ( p) tms ( F ) ztms ( p, F ) tms ( F ) 16 14 12 10 8 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 TMS number 10 Refining features: Pfam,GO & Swissprot Pfam families TC families … p TCDB Swissprot … + cross-links 11 Refining classifier: ensemble of SVMs • Classification label of training samples Positives are benchmarked by TransportDB for their manual annotation Others are negatives SVM1 Major class Major samples SVM2 unknown proteins pos_weight > neg_weight? Minor class SVMk 12 Generation of empirical rules • Manual curation of transporters Collect transporter-related evidence Categorize the evidence manually • Summarize the rules on transporter families during the curation of plant organisms medicago, lotus, sorghum, poplar, grape, sorghum, moss, green algae seqid protein size hmmtop tms Tmpred tms Tcdb hits Pfam domains Go terms Universal table of raw evidence Swissprot homologs NR hits Localiza tions 13 Representation of rules • Categories of curation Level 1: every expected features are there Level 2: a minor feature is missing Level 3: a major feature is missing or multiple features are conflicted • Representation and customization of complicated empirical rules family Min len Max len len std dev hmmtop tms Tmpred tms Tcdb hits Pfam domains Go terms Swissprot homologs definition 1.A.1 … 3.A.1.1 isnull($tcdb_top_evalue); lt($len,$-2/2):=3; lt($len,$-2-$0):+1 up to 3; gt($len,$-1+$0):+1 up to 3 14 Interpretation of Rules • A simple script language Flow control: serial ‘&’, otherwise ‘;’ Variable definition: database field variable and rule column variable Assign and arithmetic operations: ‘:’ ‘+’ ‘-’ ‘*’ ‘/’ Comparison operations: lt, gt, eq, le, ge String operations: isnull, matched, items, match_items, compatible Boundary functions: up to, down to Advance functions: key, index, gradient, etc Nested functions • The interpretation program can be fixed and the rules can be tuned and customized for other kingdoms of organisms • Interpret the script language using programming techniques 15 Final issues on Saport • Final Integration Final scores are integrated from machine learning scores and empirical categorization Sequences annotated by either method is accepted, otherwise, will be filtered out Confidence is gained from the mutual support of both methods; further review is need for conflicted or single annotated ones • Tools: filtering, visualization and online curation 16 Saport (http://bioinfo3.noble.org/saport) 17 Evaluation of TransportTP module: cross-validation results Organism Matches Text mining validated Recall (%) Precision (%) Balanced accuracy (%) 577 456 61 79.03 77.42 78.22 1073 1278 996 38 77.93 92.82 84.73 56278 1230 1283 1061 88 82.70 86.26 84.44 C. elegans 20051 906 667 601 87 90.10 66.34 76.42 D. melanogaster 13890 663 646 535 26 82.82 80.69 81.74 H. sapiens 37742 1272 1466 1140 79 77.76 89.62 83.27 81.72 82.19 81.96 Num of proteins Predictions by TransportTP E. coli 5411 589 A. thaliana 26960 O. sativa Annotations in TransportDB Average on model proteomes P. torridus 1535 165 171 137 15 80.12 83.03 81.55 P. profundum 5489 550 580 445 35 76.72 80.91 78.76 D. psychrophila 3234 316 305 242 38 79.34 76.58 77.94 A. fumigatus 9923 671 619 563 50 90.95 83.90 87.28 81.78 81.11 81.44 81.75 81.76 81.75 Average on non-model proteomes Average on all testing proteomes 7.57% 2 Re call Pr ecision Re call Pr ecision Yeast was used for training and e-value threshold of initial classifier was set to 0.1 Balanced _ accuracy 18 Full results of TransportTP in Leave-one-in cross-validation Recall/sensitivity Average=80.2% Precision Average=81.9% E-value threshold was set to 0.1 in initial classifier 19 General model versus genome-specific model on the balanced accuracy of TransportTP E-value thresholds of initial classifier 2 Re call Pr ecision Balanced _ accuracy Re call Pr ecision 20 Benefit of integrating machine learning with homology search 100 90 80 80 70 Balanced Accuracy (%) 90 Precision (%) 70 60 50 40 60 50 TransprotTP 40 BLAST plus HMM TransportTP BLAST BLAST plus HMM BLAST 30 20 30 10 20 40 0.0 1 0.0 01 0.0 0 0.0 01 00 0.0 01 0 0.0 000 00 1 00 01 1E -08 1E -09 1E -10 1E -11 1E -12 1E -13 1E -14 1E -15 1E -16 1E -17 1E -18 1E -19 1E -20 1E -21 1E -22 1E -23 1E -24 1E -25 1E -26 1E -27 1E -28 1E -29 1E -30 1E -35 1E -40 1E -45 1E -50 10 0 1 0.1 0 10 50 60 70 80 90 100 E-value thresholds Recall (%) Yeast was used for training and e-value threshold 10 to 1e-50 were tested 21 The predictive performance of TransportTP on plant organisms Organism Manually curated Predictions Matches Recall (%) Precision (%) Potential transporter rate (%) M. truncatula 1621 1991 1251 77.17 62.83 29.83 G. Max 3509 4178 3054 87.03 73.10 18.26 L. Japonicus 1740 2381 1299 74.66 54.56 25.66 S. Bicolor 1918 1960 1485 77.42 75.77 7.70 P. Trichocarpa 2512 2889 1936 77.07 67.01 14.36 V. Vinifera 2188 2002 1540 70.38 76.92 5.49 P. Patens 1388 1380 1019 73.41 73.84 6.81 76.74 69.28 15.45 56.59 71.95 7.66 Average C. Reinhardtii 979 770 554 Manually curated: curation with confidence level 1 and 2 Potential transporter rates: proportion of predictions match curation level 3 Arabidopsis was used for training and 10 was used as e-value threshold 22 Preliminary results of automatic annotation by empirical rules Organism Manually curated Automatic annotated Matches Recall (%) Precision (%) M. truncatula 1621 1665 1386 85.50 83.24 G. Max 3509 3876 2867 81.70 73.97 L. japonicus 1740 1580 1136 65.29 71.90 S. Bicolor 1918 1836 1534 79.98 83.55 P. trichocarpa 2512 2575 2011 80.06 78.10 V. vinifera 2188 1674 1429 65.31 85.36 P. patens 1388 1384 1101 79.32 79.55 76.74 79.38 59.55 89.28 average C. reinhardtii 979 653 583 23 Consistence between the two modules Organism Curation TransportTP Rules Overlaps Matches Recall (%) Precision (%) M. Truncatula 1621 1991 1665 1235 1110 68.48 89.88 G. max 3509 4178 3876 2838 2638 75.18 93.28 L. japonicus 1740 2381 1580 1193 971 55.80 81.39 S. bicolor 1918 1960 1836 1374 1308 68.20 95.20 P. trichocarpa 2512 2889 2575 1915 1693 67.40 88.41 V. vinifera 2188 2002 1674 1294 1222 55.85 94.44 P. patens 1388 1380 1384 952 891 64.19 93.59 65.01 90.88 44.84 97.12 Average C. reinhardtii 979 770 653 452 439 24 Consistence between the two methods (con’t) Curation results 76.74 79.38 69.28 Machine Learning results TransportTPEmpirical Rules 76.74 65.01 90.88 Empirical rule results Human Curation Recall Precision Saport 25 Comparative study of monolignal transporters Plant cell High plants moss ? algae Comparative study strengthening predictions versus all potential predictions Candidate mono-lignol transporters 2.A.85 Aromatic Acid Transporters (ArAE) fungi 26 Results on nodule transporters TC Family Num of Transporter Genes substrates (specific) Expr folds Characterized orthologs Reference 1.A.8.12 2 LIMP ammonia NH3+ over LIMP1/2 in lotus Guenther & Roberts, 2000 2.A.17 1 POT/PTR dicarboxylate (malate) >200 AgDCAT1 in A. glutinosa Jeong et al. 2004 2.A.53 13 (2) sulfate >2 LjSST1 in lotus Krusell et al 2005 2.A.1.8 2.A.1 2 NPP 32 nitrate/nitrite over LjN70 in lotus, GmN70 in soybean Vincill et al. 2005 2.A.7 2.A.5 7 (1) DMT 6 ZIP iron zinc >50 >2 GmDMT1 in soybean GmZIP1 Kaiser et al 2003 Moreau et al 2002 2.A.72 1 potassium (K+) over LjKUP1 in lotus Desbrosses et al 2004 3.A.3.2 1 Ca2+-ATPase >150 unknown Andreev et al 1998,1999 total 195 transporter genes expressed at least five folds and 50 transporter genes are nodule specific Benedito, Li et al. Plant Physiology, under review. 27 Discussion • Comparison of two methods Machine learning method is general, but the black boxed method is difficult to check by biologists Empirical rules are family-based, easy to check by biologists, but may be biased on the organisms summarized • Pitfalls of system Difficult to distinguish transporters and sensors Sensitive to partial sequences such as ESTs Weak to handle transporter complexes • Further work Integrate gene expression and sub-cellular localization analysis Integrate phylogenetic analysis 1) characterize subfamily or substrates based on SIFTER or TransportDB and 2) comparative study of annotated transporter families from multiple organisms 28 Summary • Present a transporter annotation system which effectively integrates homology based, machine learning methods and empirical rules • The system is promising to characterize eukaryotic transporters with significantly reduced curation efforts • Provide a general framework for integrative decision, including integration of multiple resources and prior biological knowledge 29 Acknowledgements • Patrick Xuechun Zhao • Vagner Benedito • Ranamalie Amarasinghe • Jian Zhao • Xinbin Dai • Michael Udvardi • Carolyn Young • Rick Dixon 30