6_Petrin_prot_DBs_2011

advertisement
Protein databases
Petri Törönen
Shamelessly copied from material done by Eija Korpelainen
and from CSC bio-opas
http://www.csc.fi/oppaat/bio/
http://www.csc.fi/oppaat/bio/bio-opas.pdf
Why protein sequences?
• most (laboratory) analysis is done with
nucleotide sequences
• therefore the analysis at the nucleotide
level is natural
But there are drawbacks:
-divergence in codons => same protein,
different nucleotide sequence!
http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/C/Codons.html
-similarity between different aminoacids
Therefore all the similarity is not visible at the
nucleotide level!
…more…
Protein databases also include often more
detailed information.
Protein (not the RNA) is often the actual
functional unit that has a biological function.
-note the exceptions like structural RNAs.
Various protein (related)
databases
• Databases including protein sequences
– UniProt
• Databases including protein domains
– PFAM
– PROSITE
• Databases including protein sequence
patterns, motifs
– PROSITE
Differences between databases
”Size” of included data components:
• ”Large” components:
– Whole sequences
• ”Medium” components
– Protein domains
– http://en.wikipedia.org/wiki/Protein_domain
• ”Small” components
– Protein sequence motifs
– http://en.wikipedia.org/wiki/Sequence_motif
• Protein sequence can include many domains
and domains can have many motifs
Differences between databases
• Some include all the available information (more
or less reliable information)
– large coverage, everything is stored in the database
– small reliablity, information has not been confirmed
– computer annotation => updating fast
• Some cover only the reliable information
– small coverage
– information is reliable
– expert curation => updating slow
• SwissProt (curated) ↔ TREMBL (uncurated)
Differences between databases
Why previous division?
• Some protein features/functions are linked to
domains
• Some features/functions are linked to specific
sequence motifs
• Some features can be best described at the
whole sequence level
Protein sequence databases
• UniProt
• SwissProt + TREMBL
• PIR-PSD
• Lets focus on SwissProt
Why Swissprot is nice?
• Sequences are manually annotated and
checked
• No multiple entries for the same sequence
• Annotations include protein function,
modifications after translation, active sites
etc.
• Linked to many other databases
• Similarity to RefSeq
So how to search protein sequences
from available databases?
• Search with a protein name
• Search with a proteins function or
descriptive words
• Search with a protein/RNA sequence
WWW link for first two options…
http://www.uniprot.org/uniprot/
Searching Uniprot
• Demonstrate the search by
looking protein kinase proteins
from human
Choose
database
Type query
here
Here you limit
search to SwissP.
Lets first go to Advanced Search
Select field
here
Type
query here
1. Select field as protein name
2. Type query: protein kinase
We get all sequences that have both words
(protein AND kinase) in their description
After previous results open new search row from Advanced Search
Next select organism from field and type homo sapiens. Click Add&Search
RESULTS:
Here you can look common features
among the obtained sequences
Here limit to Swissprot
More info on hits by clicking the gene name
Lets open one for better view…
Different fields of information can be found when scrolling down the page
NOTICE:
Detailed description of function → General annotation
Alternative splice variants and mutations reported
→ Alternative products
→ Natural variations
• Obtained result demonstrated the detailed
information available from the SwissProt
• Note that the stored information includes
–
–
–
–
information on the organism
gene name, gene description
links to the articles discussing about the seq.
Comment part has a detailed description on
• function
• tissue localization
– features part has a detailed description on
• domains
• various functional components
Extra Slide
Go back to search results
Test these
Select keyword, and open Disease list for better viewing…
Extra Slide
You can view which
genes have been
reported to be
involved in some
diseases
Note that 18 are
linked to tumor
suppressors and 36
to Proto-oncogenes
Summary
• protein databases show detailed information of
protein sequences
• Uniprot/Swissprot is recommended protein
database
-manually curated
-non-overlapping
• Swissprot can show very detailed information on
sequences
Sequence Motifs
• Motifs are conserved areas in the functionally
similar proteins
• These are crucial parts for protein function
– protein cannot change them without changing the
function
• Analysis of sequences with motifs can be more
efficient when no close sequence relatives are
found
– recommended when normal sequence search gives
no results
http://en.wikipedia.org/wiki/Sequence_motif
What is motif?
Areas with strong conservation between
alingned sequences
Multiple sequence
alingment of
sequences with
similar function
modified from Terri Attwood, 2002
modified from Eija korpelainen...
Domain databases
• Domain is a sub-component of protein
• It can exist and function independently from the
rest of the protein sequence
• Domains form often a building blocks in the
evolution that are combined to form proteins
• Same domain can occur in various proteins
• http://en.wikipedia.org/wiki/Protein_domain
Domain and motif databases
• PFAM
• PROSITE
• PRINTS
• TIGRFAM
• PRODOM
… and many more
Domain and motif databases
• PFAM
• PROSITE
• PRINTS
• TIGRFAM
• PRODOM
…
http://www.ebi.ac.uk/interpro/
http://www.ebi.ac.uk/interpro/about.html
All are combined
Into one service
→ InterPro
What is InterPro
• Collection of many protein related
databases
• All aim to report various features
that can be used to analyze
sequences
• Features:
• Domains, Sequence motifs, Global sequence homology
• Different databases can queried
simultaneously via InterPro
What is InterPro
• This generates large amount of
information for single query
• Good chance to get useful
information for unknown
sequence
• Some databases are well annotated
• Drawback is the repetition in the
results from different databases
• Queries are also SLOW
How to use InterPro
• Sequence queries to InterProScan
Sequence
here
Lets use Serine/threonine protein kinase N1 sequence as query
This sequence was in Uniprot results
Results
Click titles for more info
Query name
Sequence
here
Visualization of
results
Domain associated with one
region of sequence
Lets check more information on reported domains….
Results
Contributing
signatures from
many databases
Sequence signatures, found by InterProScan, usually have a detailed description
Results
• InterProScan gives us matches in the
sequence to various sequence features
– Domains, motifs
• These features are often well annotated
• Features associate functions to specific
regions of sequence
Other Databases
• Databases describing gene functions
– Gene Ontology databases
– Reaction pathway databases
• Databases describing associations to
phenotypes
– Disease gene databases
– Phenotype databases
Databases describing functions
Why do we need these databases?
• Earlier databases were helpful when
analysis starts from unknown single gene
• These databases help us to find all genes
known to be linked to certain task
– Say, all apoptosis-related genes in human
• They are also helpful when we analyze
large sets of genes
– Is there something common among 100
genes that are most active in cancer cell?
Databases describing functions
• Gene Ontology databases
– Classify genes into categories that describe
gene function
– Standardized classification applicable to all
species
– Classes represent involvement in biological
tasks (like protein synthesis), chemical
activities (like carbohydrate binding) or
localization in cell (like nucleus)
• http://en.wikipedia.org/wiki/Gene_ontology
Databases describing functions
• Pathway databases
– Classify genes into biochemical pathways
– Classify genes into signalling pathways
• Example databases:
– KEGG: www.genome.ad.jp/kegg/
– REACTOME: http://www.reactome.org/
• http://en.wikipedia.org/wiki/Biological_path
way
www.geneontology.org
• The Gene Ontology (GO) is a hierarchical
structure for categorizing gene products in
terms of their association with:
• 1. biological processes
• 2. cellular components
• 3. molecular functions
• in a species-independent manner
Structure of Gene Ontology
• Hierarchical structure of linked
nodes
• Smaller classes: child classes
root of hierarchical
structure
• Precise, detail information
• Larger classes: parent classes
• Broad, unspecific information
• Smaller classes belong to
larger classes
• Viral protein biosynthesis =>
• Protein biosynthesis =>
• Biosynthesis
Starting node
Gene Ontology databases
• AmiGO http://amigo.geneontology.org/cgibin/amigo/go.cgi
• QuickGO http://www.ebi.ac.uk/QuickGO/
AmiGO
• Server maintained by GO consortium for
analysis gene annotations across the
species
• http://amigo.geneontology.org/cgi-bin/amigo/go.cgi
AmiGO
Select: GO-terms
Or gene names
Query here
This limits to
exact match
AmiGO
Assosiated genes
We get the
precise definition
of the class
AmiGO
Lets have a view on genes associated to apoptosis in yeast
(Saccharomyces Cerevisiae)
Here you can limit
the species
Selected genes could be taken to a more detailed laboratory
analysis…
Databases describing functions
• These group genes into classes or
pathways
• Databases can be queried to see which
genes are in certain class / pathway
• You can also check to which classes a
certain gene belongs to
Databases summary
•
•
•
•
•
Nucleotide databases
Genome databases
Protein databases
Protein motif / domain databases
Function related databases
WAKE UP!
Download