A Guide to Finding Protein

advertisement
A Guide to Finding
Protein Information on the Internet
Alexander Simon
Hill Lab
University of Utah
Biological Chemistry Program Summer Rotation
August 2006
Introduction
Numerous databases and applications exist for collecting, analyzing, and synthesizing biological
information on the Internet. However, searching for relevant, nonredundant, and meaningful
information is not always a straightforward task. The search process is complicated by
characteristics of the Internet such as its rapid rate of growth and the volatility of URLs, as well
as a lack of standard data formats and nomenclature among biological databases. Furthermore,
although some Internet resources for finding biological information are well-known, it is not
always obvious which one is best suited for a particular problem or whether a better solution
exists. As a result, biologists may still be under- or inefficiently utilizing the Internet. Thus,
there is a need for a clear and concise schema for finding biological information on the Internet.
The aim of this guide is to provide a selection of reliable and informative Web sites that can be
used to explore the known and predicted information about a protein of interest (POI). These
sites are hyperlinked both in the text and in a flowchart (Figure 1) to facilitate the search process.
If desired, these sites can then be used as launch points for further analyses.
Primary Knowledgebases
The process of collecting protein information on the Internet starts at two “top-level”
knowledgebases: NCBI and Expasy. Both provide five main types of information: sequence,
domains, 3D structure, cross-reference, and descriptive (text). Each of these categories is
described in more detail in the corresponding section of this guide.
The two knowledgebases differ in their distribution and redundancy of information as well as
their ease of use. NCBI provides more comprehensive text (PubMed) and sequence
(Entrez/RefSeq) information than Expasy. In contrast, because Expasy’s information is curated,
it has less redundant information than NCBI. Expasy also provides more predictive analyses as
well as links with external databases than NCBI. However, the search interface for Expasy is
somewhat more cumbersome than for NCBI (in the author’s opinion).
Although there is some overlap between the two sites, it is useful to explore your POI in both.
Doing so, however, may not be straightforward since the identifier or accession number for a
protein in GenBank (NCBI) is incompatible with SWISSPROT/UniProt (Expasy). At the time of
this writing, there does not appear to be a service to translate IDs between the two databases.
However, if your POI is a human protein or has a human homolog, you can query it in the
HUGO (Human Genome Nomenclature) database, which will provide the GenBank and UniProt
IDs (as well as RefSeq and PubMed IDs) for a specified human protein.
Descriptive Information
Once you have located NCBI and/or Expasy entries for your POI, your next step is to research
the literature about it (but you can do it at any stage of your research). NCBI’s PubMed service
is the most comprehensive database of biomedical literature and is probably the best resource for
1
this task. If you are interested in a shorter, curated list of references, then check the “References”
section of the Expasy entry for your POI.
Expasy entries provide other useful descriptive information about your POI in the “CrossReferences” section (e.g. gene ontology listings and genome annotations) and “Comments”
section (e.g. function and subcellular localization).
Sequence Information
Now that you’ve learned what other people have published about your protein, it’s time to start
finding out things for yourself! First, obtain the amino acid sequence. For Expasy entries, this is
located in the “Sequence Information” section at the very bottom of the page. At NCBI, you
have a choice of either the amino acid or nucleotide sequence, but finding the right one can be a
“needle-in-the-haystack” search since there are often many sequences from different organisms
and experimental methods. To help restrict the search to reliable and completed sequences, try
to find a reference sequence for your POI by clicking the “RefSeq” tab.
If you are interested, you can also find the genomic sequence for your POI if it is from an
organism whose genome has been sequenced by going to the GOLD Web site and following the
link to the organism’s genome site.
Bioinformatic Information
Once you know the amino acid sequence of your POI, you can start deriving information from it.
Expasy provides a number of Web-based tools to calculate chemical and physical characteristics
of your POI given its sequence. For example, ProtParam computes the molecular weight,
theoretical pI, extinction coefficient, etc. A related program, ProtScale, creates a graphical
representation for any of 55 different amino acid scales, such as hydrophobicity, for a protein.
Expasy is also linked to external Web-based programs that can predict topological and structural
features of a protein. For instance, PSORT predicts the subcellular localization of a protein
based on its sequence while TMHMM and Coils predict the location of transmembrane helices
and coiled coil regions, respectively.
Another program, PredictProtein, predicts both
transmembrane and low complexity regions in a protein. (It also provides other useful
information; see the “Structural Information” and “Comparative Information” sections of this
guide).
The location of transmembrane or coiled coil regions can help with the prediction of domain
boundaries in your POI, particularly when there are no known domains.
2
Structural Information
Domains & Motifs
NCBI’s Conserved Domain Database (CDD) provides limited descriptive information about the
domains in a protein. Several Expasy programs (found in the “Cross-References” section of
entries) do a better job at delineating known domains (ProDom and Pfam) and motifs
(PROSITE).
PredictProtein, a program accessible from Expasy Tools, provides search results from both
ProDom and PROSITE in its output.
Secondary Structure
PredictProtein also provides graphical representations of predicted locations of alpha helices and
beta sheets in the sequence of your POI.
Tertiary Structure
The primary database of three-dimensional protein structures is the RCSB Protein Data Bank
(PDB). If a structure for your POI is available, you can link to the PDB entry from both Expasy
(in the “Cross-References” section) and NCBI entries.
If the structure of your POI is unknown, do not despair; you can try to model its structure. The
most accurate method, “homology modeling, ” uses the structure of a similar sequence to predict
the structure of your POI. This method is implemented on the SWISSMODEL server. (To find
a sequence similar to your POI, refer to the “Comparative Information” section below.)
In the absence of similar sequences, you can try phyre, a fold recognition program, and/or
HMMSTR, an ab initio structure prediction server, but these methods are not considered to be
very reliable.
Comparative Information
Some information that can be gleaned about a protein sequence can only be deduced by
comparison with similar sequences. The most common method used to find sequences similar to
your POI is NCBI’s BLAST program. A version of BLAST is also available on the Expasy site.
The HomoloGene database at NCBI, which includes sequences of known homologs of your POI,
may also be useful to you.
Sequence Comparison
To compare sequences effectively, you need to construct a multiple sequence alignment of all (or
a subset) of the sequences that are similar to your POI. Many alignment programs need to be
downloaded and run locally; however, there is an Internet server for the CLUSTALW program.
3
The versatile PredictProtein program will also generate a multiple sequence alignment for your
POI using its own “MAXHOM” alignment program.
Multiple sequence alignments are very useful. They can be used to:
1. Identify positions of conserved residues in your POI which are likely to be structurally or
functionally critical.
2. Delineate novel domains in your POI, particularly when combined with secondary
structure prediction information.
3. Construct evolutionary trees that provide either quantitative or qualitative measures of the
divergence of your POI from related sequences and their relative rates of evolution. They
can be made with the PHYLIP package or similar programs.
4. Generate a 3D homology model of your POI with SWISSMODEL, provided that a
structure exists for at least one sequence in the multiple sequence alignment.
Structure Comparison
If the 3D structure of your POI exists, then you can use NCBI’s structure similarity search
program VAST to identify its structural “neighbors” (i.e. other protein structures that
superimpose with that of your POI). A related program is HOMSTRAD, which searches a
curated database of structure-based alignments for homologous protein families.
You can also browse structural similarity databases such as SCOP or CATH to find proteins that
share similar types of folds as your POI, but this may be a time-consuming process.
‘Omic’ Information
Beyond the realm of multiple sequence comparison lies the frontier of ‘-omics’ which aims to
organize and analyze biological information on the cellular or organismic scale. Although
‘omics’ efforts have generated an abundance of data and hype over its potential, it is currently
fraught with challenges such as false positives from high throughput experiments.
Genomic
Currently, nearly 400 complete genome sequences are available, spanning the entire spectrum of
life. The genome sequence of your POI can be obtained from any of these genomes from the
GOLD ‘umbrella’ database, which is linked to specific genome databases and analysis programs.
Proteomic/Interactomic
Expasy entries are linked to IntAct, a protein interaction database (in the “Cross-References”
section). Some other notable protein interaction databases include:
4
•
•
•
BioGRID – One of the largest collections of experimentally determined protein
interactions (there is a bias toward high throughput experiments, however).
Provides references to the original publications.
STRING – Calculates probabilities of interactions and provides a ranked list of
potential interactions.
POINT – Predicts interactions with and among homologs of your POI.
Pathways provide another, more global perspective of known protein interactions. Expasy
entries are linked to Reactome, a detailed database of protein pathways (in the “CrossReferences” section), however it is not as user friendly as BioCarta (in the author’s opinion).
Significance & Future Directions
The heart of this guide is the flowchart shown in Figure 1. It illustrates the Web sites and
programs that can be used to find specific types of information about a protein of interest (POI),
their relationships, and a logical order in which they can be used to collect protein information.
Furthermore, the embedded hyperlinks facilitate the search process by allowing a user to explore
the flowchart in real-time on the Internet.
This interactive schema moves the search for protein information on the Internet toward a more
fully automated method. In the future, it may be possible to program a “meta-search” engine
(semi-automatic) or an intelligent agent (fully automatic) to traverse this schema and then report
the results in a succinct manner to a user.
Pr
5
Download