A Guide to Finding Protein Information on the Internet Alexander Simon Hill Lab University of Utah Biological Chemistry Program Summer Rotation August 2006 Introduction Numerous databases and applications exist for collecting, analyzing, and synthesizing biological information on the Internet. However, searching for relevant, nonredundant, and meaningful information is not always a straightforward task. The search process is complicated by characteristics of the Internet such as its rapid rate of growth and the volatility of URLs, as well as a lack of standard data formats and nomenclature among biological databases. Furthermore, although some Internet resources for finding biological information are well-known, it is not always obvious which one is best suited for a particular problem or whether a better solution exists. As a result, biologists may still be under- or inefficiently utilizing the Internet. Thus, there is a need for a clear and concise schema for finding biological information on the Internet. The aim of this guide is to provide a selection of reliable and informative Web sites that can be used to explore the known and predicted information about a protein of interest (POI). These sites are hyperlinked both in the text and in a flowchart (Figure 1) to facilitate the search process. If desired, these sites can then be used as launch points for further analyses. Primary Knowledgebases The process of collecting protein information on the Internet starts at two “top-level” knowledgebases: NCBI and Expasy. Both provide five main types of information: sequence, domains, 3D structure, cross-reference, and descriptive (text). Each of these categories is described in more detail in the corresponding section of this guide. The two knowledgebases differ in their distribution and redundancy of information as well as their ease of use. NCBI provides more comprehensive text (PubMed) and sequence (Entrez/RefSeq) information than Expasy. In contrast, because Expasy’s information is curated, it has less redundant information than NCBI. Expasy also provides more predictive analyses as well as links with external databases than NCBI. However, the search interface for Expasy is somewhat more cumbersome than for NCBI (in the author’s opinion). Although there is some overlap between the two sites, it is useful to explore your POI in both. Doing so, however, may not be straightforward since the identifier or accession number for a protein in GenBank (NCBI) is incompatible with SWISSPROT/UniProt (Expasy). At the time of this writing, there does not appear to be a service to translate IDs between the two databases. However, if your POI is a human protein or has a human homolog, you can query it in the HUGO (Human Genome Nomenclature) database, which will provide the GenBank and UniProt IDs (as well as RefSeq and PubMed IDs) for a specified human protein. Descriptive Information Once you have located NCBI and/or Expasy entries for your POI, your next step is to research the literature about it (but you can do it at any stage of your research). NCBI’s PubMed service is the most comprehensive database of biomedical literature and is probably the best resource for 1 this task. If you are interested in a shorter, curated list of references, then check the “References” section of the Expasy entry for your POI. Expasy entries provide other useful descriptive information about your POI in the “CrossReferences” section (e.g. gene ontology listings and genome annotations) and “Comments” section (e.g. function and subcellular localization). Sequence Information Now that you’ve learned what other people have published about your protein, it’s time to start finding out things for yourself! First, obtain the amino acid sequence. For Expasy entries, this is located in the “Sequence Information” section at the very bottom of the page. At NCBI, you have a choice of either the amino acid or nucleotide sequence, but finding the right one can be a “needle-in-the-haystack” search since there are often many sequences from different organisms and experimental methods. To help restrict the search to reliable and completed sequences, try to find a reference sequence for your POI by clicking the “RefSeq” tab. If you are interested, you can also find the genomic sequence for your POI if it is from an organism whose genome has been sequenced by going to the GOLD Web site and following the link to the organism’s genome site. Bioinformatic Information Once you know the amino acid sequence of your POI, you can start deriving information from it. Expasy provides a number of Web-based tools to calculate chemical and physical characteristics of your POI given its sequence. For example, ProtParam computes the molecular weight, theoretical pI, extinction coefficient, etc. A related program, ProtScale, creates a graphical representation for any of 55 different amino acid scales, such as hydrophobicity, for a protein. Expasy is also linked to external Web-based programs that can predict topological and structural features of a protein. For instance, PSORT predicts the subcellular localization of a protein based on its sequence while TMHMM and Coils predict the location of transmembrane helices and coiled coil regions, respectively. Another program, PredictProtein, predicts both transmembrane and low complexity regions in a protein. (It also provides other useful information; see the “Structural Information” and “Comparative Information” sections of this guide). The location of transmembrane or coiled coil regions can help with the prediction of domain boundaries in your POI, particularly when there are no known domains. 2 Structural Information Domains & Motifs NCBI’s Conserved Domain Database (CDD) provides limited descriptive information about the domains in a protein. Several Expasy programs (found in the “Cross-References” section of entries) do a better job at delineating known domains (ProDom and Pfam) and motifs (PROSITE). PredictProtein, a program accessible from Expasy Tools, provides search results from both ProDom and PROSITE in its output. Secondary Structure PredictProtein also provides graphical representations of predicted locations of alpha helices and beta sheets in the sequence of your POI. Tertiary Structure The primary database of three-dimensional protein structures is the RCSB Protein Data Bank (PDB). If a structure for your POI is available, you can link to the PDB entry from both Expasy (in the “Cross-References” section) and NCBI entries. If the structure of your POI is unknown, do not despair; you can try to model its structure. The most accurate method, “homology modeling, ” uses the structure of a similar sequence to predict the structure of your POI. This method is implemented on the SWISSMODEL server. (To find a sequence similar to your POI, refer to the “Comparative Information” section below.) In the absence of similar sequences, you can try phyre, a fold recognition program, and/or HMMSTR, an ab initio structure prediction server, but these methods are not considered to be very reliable. Comparative Information Some information that can be gleaned about a protein sequence can only be deduced by comparison with similar sequences. The most common method used to find sequences similar to your POI is NCBI’s BLAST program. A version of BLAST is also available on the Expasy site. The HomoloGene database at NCBI, which includes sequences of known homologs of your POI, may also be useful to you. Sequence Comparison To compare sequences effectively, you need to construct a multiple sequence alignment of all (or a subset) of the sequences that are similar to your POI. Many alignment programs need to be downloaded and run locally; however, there is an Internet server for the CLUSTALW program. 3 The versatile PredictProtein program will also generate a multiple sequence alignment for your POI using its own “MAXHOM” alignment program. Multiple sequence alignments are very useful. They can be used to: 1. Identify positions of conserved residues in your POI which are likely to be structurally or functionally critical. 2. Delineate novel domains in your POI, particularly when combined with secondary structure prediction information. 3. Construct evolutionary trees that provide either quantitative or qualitative measures of the divergence of your POI from related sequences and their relative rates of evolution. They can be made with the PHYLIP package or similar programs. 4. Generate a 3D homology model of your POI with SWISSMODEL, provided that a structure exists for at least one sequence in the multiple sequence alignment. Structure Comparison If the 3D structure of your POI exists, then you can use NCBI’s structure similarity search program VAST to identify its structural “neighbors” (i.e. other protein structures that superimpose with that of your POI). A related program is HOMSTRAD, which searches a curated database of structure-based alignments for homologous protein families. You can also browse structural similarity databases such as SCOP or CATH to find proteins that share similar types of folds as your POI, but this may be a time-consuming process. ‘Omic’ Information Beyond the realm of multiple sequence comparison lies the frontier of ‘-omics’ which aims to organize and analyze biological information on the cellular or organismic scale. Although ‘omics’ efforts have generated an abundance of data and hype over its potential, it is currently fraught with challenges such as false positives from high throughput experiments. Genomic Currently, nearly 400 complete genome sequences are available, spanning the entire spectrum of life. The genome sequence of your POI can be obtained from any of these genomes from the GOLD ‘umbrella’ database, which is linked to specific genome databases and analysis programs. Proteomic/Interactomic Expasy entries are linked to IntAct, a protein interaction database (in the “Cross-References” section). Some other notable protein interaction databases include: 4 • • • BioGRID – One of the largest collections of experimentally determined protein interactions (there is a bias toward high throughput experiments, however). Provides references to the original publications. STRING – Calculates probabilities of interactions and provides a ranked list of potential interactions. POINT – Predicts interactions with and among homologs of your POI. Pathways provide another, more global perspective of known protein interactions. Expasy entries are linked to Reactome, a detailed database of protein pathways (in the “CrossReferences” section), however it is not as user friendly as BioCarta (in the author’s opinion). Significance & Future Directions The heart of this guide is the flowchart shown in Figure 1. It illustrates the Web sites and programs that can be used to find specific types of information about a protein of interest (POI), their relationships, and a logical order in which they can be used to collect protein information. Furthermore, the embedded hyperlinks facilitate the search process by allowing a user to explore the flowchart in real-time on the Internet. This interactive schema moves the search for protein information on the Internet toward a more fully automated method. In the future, it may be possible to program a “meta-search” engine (semi-automatic) or an intelligent agent (fully automatic) to traverse this schema and then report the results in a succinct manner to a user. Pr 5