Title: Comparing Sequence and Structure-based Classifiers for Predicting RNA Binding Sites in Specific Families of RNA Binding Proteins Authors: Michael Terribilini1, Cornelia Caragea2, Deepak Reyon3, Jeffry Sander4, Jae-Hyung Lee5, Robert L. Jernigan6, Vasant Honavar7, and Drena Dobbs8 Short Abstract We evaluate machine learning classifiers for predicting RNA-binding residues in proteins, using either sequence-based information only, or a combination of sequence and structure-derived information and quantitate relative contributions of these different input types to overall prediction performance. We also present novel classifiers optimized for specific families of RNA binding proteins. Abstract Protein-RNA interactions play critical roles in a wide range of biological processes. Previously, we developed a machine learning approach for predicting which amino acids of an RNA-binding protein mediate protein-RNA interactions, using only the amino acid sequence of the protein as input (Terribilini et al, 2006, RNA 12:1450; http://bindr.gdcb.iastate.edu/RNABindR/). Here we report an evaluation of the relative contributions of sequence, structural features and evolutionary information to performance of algorithms for predicting RNA binding residues in proteins. In this study we train and test multiple classifiers using several benchmark datasets, including a non-redundant dataset of 181 RNA-binding polypeptide chains with <30% sequence identity (RB181), and “custom” datasets comprising sets of related RNA-binding proteins. We systematically compare results obtained using simple classifiers that use only one type of information as input (e.g., Naïve Bayes classifier, using only amino acid sequence as input) with results obtained using ensemble classifiers that exploit specific combinations of input information (e.g., an ensemble of Naïve Bayes classifiers that use the amino acid sequence, information from sequence homologs and/or the identities of spatial neighbors in known structures as input). Our results, partially summarized in Table 1 below, indicate that the best “overall” performance, evaluated on the basis of AUC for ROC curves, is obtained using ensemble classifiers using amino acid sequence information in combination with either: i) PSSMs (derived from sequence homologs identified using BLAST) or ii) spatial neighbor information (extracted from PDB structures of proteins). Also, we will report results obtained using “custom” classifiers for predicting RNA-binding residues in specific families of RNA binding proteins (i.e., those sharing similar sequences or structures). Table 1: Comparison of Classifiers for RNA-binding Site Prediction Sequencebased Area Under ROC Curve 1 2 3 4 5 6 7 8 0.74 Structurebased 0.77 PSSM-based 0.80 Bioinformatics & Computational Biology, Iowa State University, E-mail: terrible@iastate.edu Computer Science, Iowa State University, E-mail: caragea@cs.iastate.edu Bioinformatics & Computational Biology, Iowa State University, E-mail: dreyon@iastate.edu Bioinformatics & Computational Biology, Iowa State University, E-mail: jdsander@iastate.edu Bioinformatics & Computational Biology, Iowa State University, E-mail: jhlee777@iastate.edu Biochemistry, Biophysics & Molecular Biology, Iowa State University, E-mail: jernigan@iastate.edu Computer Science, Iowa State University, E-mail: honavar@cs.iastate.edu Genetics, Development, & Cell Biology, Iowa State University, E-mail: ddobbs@iastate.edu Ensemble 0.81 Scientific Justification 1. The ability to reliably predict which residues of a protein directly contribute to RNA binding would significantly enhance our understanding of how proteins recognize RNA and potentially generate new strategies for clinical intervention in both genetic and infectious diseases. We will present our results on predictions made on several clinically important RNA-binding proteins, such as the HIV-1 Rev and human telomerase reverse transcriptase (hTERT) protein. We will also present a comparison of our results with those of other labs 2. Classifiers can be "tuned" to enhance either specificity or sensitivity of interface residue prediction for specific subfamilies of related RNA binding proteins, thus facilitating the design of experimental investigations.