An introduction to informatics

advertisement
Protein sequence databases:
dissemination of protein knowledge
http://education.expasy.org/cours/UniProt/
Marie-Claude.Blatter@isb-sib.ch
Swiss-Prot group, Geneva
SIB Swiss Institute of Bioinformatics
Menu
Introduction
Nucleic acid sequence databases
ENA, GenBank, DDBJ
Protein sequence databases
UniProt databases (UniProtKB)
NCBI protein databases (NCBInr, RefSeq…)
Protein sequences are
the fundamental determinants
of biological structure and function.
http://www.ncbi.nlm.nih.gov/protein
Challenge
Flood of data -> need to be stored, curated and
made available for analysis and knowledge discovery
Challenge (1)
Many different protein sequence databases
PRF
RefSeq
TrEMBL
Genpept
UniProtKB
(IPI)
Swiss-Prot
Ensembl
UniMES
NCBInr
TPA
UniParc
(PIR)
PDB
CCDS
Challenge (1bis)
Different protein sequence databases :
many identifiers for the same protein sequence
These identifiers are all pointing to the same TP53 protein sequence (p53) !
P04637, NP_000537, ENSG00000141510, CCDS11118,
UPI000002ED67, IPI00025087, HIT000320921, XP_001172091,
DD954676 , JT0436 , etc.
A HUPO test sample study reveals common problems in mass
spectrometry–based proteomics
PubMed 19448641 (2009)
• A single mass spectrometry experiment can identified up to
about 4000 proteins (15’000 peptides)
• Protein databases vary greatly in terms of their curation,
completeness and comprehensiveness (search with different
protein databases = could get different results).
• Only 7 labs (on 27) were able to identify the 20 human
proteins present in a sample, mainly due to the fact that the
search engines used cannot distinguish among different
identifiers for the same protein…
Challenge (3)
(protein) sequence annotation
Nucleic Acids Res. 2010 ; 38(Database issue): D633–D639.
‘Examining links from the perspective of PubMed, we
found that only a small fraction of published articles are
linked to human genes (Entrez Gene).’
Journals do not (SHOULD NOT) accept a paper dealing with a
nucleic acid sequence if the ENA/GenBank/DDBJ AC number is not
available…
‘journal publishers generally require deposition prior to publication so
that an accession number can be included in the paper.’
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?highlight=refseq&rid=handbook.section.GenBank_ASM#GenBank_ASM.RefSeq
…not the case for protein sequences
!!! no more the case for a lot of genomes !!!
Protein sequence origin…
More than 99 % of the protein sequences are derived
from the translation of nucleotide sequences
(genomes and/or cDNAs)
sequencing quality
coding sequence (CDS) annotation accuracy
gene prediction quality
… ~ 2500 genomes sequenced
(single organism, varying sizes, including virus)
… ~ 5’000 ongoing genome sequencing projects
http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html
http://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat
~ 50-100 genomes/month
+ ~2’500 viral genomes
=> Total ~ 5’000 genomes
… ~ 2500 genomes sequenced
(single organism, varying sizes, including virus)
… ~ 5’000 ongoing genome sequencing projects
… cDNAs sequencing projects (ESTs or cDNAs)
… metagenome sequencing projects
= environmental samples: multiple ‘unknown’ organisms,
Metagenomics
study of genetic material recovered directly
from environmental samples
• Global Ocean Sampling (C. Venter)
1ml sea water: 1 mo bacteria and 10 mo virus
• Whale fall
(AAFZ00000000.1)
• Soil, sand beach, New-York air, …
Venter’s Sorcerer II
• Human fluids, mouse gut (millions of bacteria within
human body)
• Water treatment industry…
• Lists of projects: http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi
… ~ 2500 genomes sequenced
(single organism, varying sizes)
… ~ 5’000 ongoing genome sequencing projects
… cDNAs sequencing projects (ESTs or cDNAs)
… metagenome sequencing projects
… personal human genomes
new generation sequencers : Illumina: 25 billions of bp /day;
3’000’000’000 $
(public consortium, 2000)
2’000’000 $
(2007)
70’000’000 $
(diploid, 2007)
300’000’000 $
(Celera, 2000)
2010
http://www.youtube.com/watch?v=mVZI7NBgcWM
…2700 genomes in 2010, 30’000 genomes in 2011 ?
But…we known now that his apoE allele is the one
associated with increased risk for Alzheimer and
that he has the ‘blue eye’ allele…
apoE gene (Ensembl genome browser)
New projects (homo sapiens)
• 1000 genomes (first publication, October 2010)
• Multiple personal genomes (sexual cells, lymphoid
cells, cancer cells…)
• International cancer genome consortium
(www.icgc.org).
They look at the most common cancers and for
each they sequence the genome of 500 patients
with cancer and 500 healthy individuals….
How to define the human proteome ???
Which sequences ???
How many proteins-coding
genes at the end?
190‘500'025'042
1st estimate: ~30 million species (1.8 million named)
2nd estimate:
20
million bacteria/archea
x
4'000 genes
1
million protists
x
6'000 genes
5
million insects
x
14'000 genes
2
million fungi
x
6'000 genes
x
20'000 genes
0.5 million plants
0.5 million molluscs, worms, arachnids, etc.
0.1 million vertebrates
x
x
20'000 genes
25'000 genes
The calculation:
2x107x4000+1x106x6000+5x106x14000+2x106x6000+5x105
x20000+5x105x20000+1x105x25000
+20000 (Craig Venter)+ 42(Douglas Adam) + …
About 190 billions of proteins (?)
About 14.0 millions of ‘known’ protein sequences in 2011
(from ~300’000 species)
More than 99 % of the protein sequences are derived
from the translation of nucleotide sequences
Less than 1 % direct protein sequencing (Edman,
MS/MS…)
-> It is important that protein database users know
where the protein sequence comes from…
The ideal life of a sequence …
cDNAs, ESTs, genes, genomes, …
Nucleic acid sequence databases
Protein sequence databases
Menu
Introduction
Nucleic acid sequence databases
ENA/GenBank, DDBJ
Protein sequence databases
UniProt databases (UniProtKB)
NCBI protein databases
ENA (EMBL-Bank)
GenBank
DDBJ
European Nucleotide Archive
DNA Data Bank of Japan
archive of primary sequence data and corresponding annotation
submitted by the laboratories that did the sequencing.
ENA/GenBank/DDBJ
http://www.insdc.org/
ENA/GenBank/DDBJ
• Serve as archives : ‘nothing goes out’
• Contain all public sequences derived from:
– Genome projects (> 80 % of entries)
– Sequencing centers (cDNAs, ESTs…)
– Individual scientists ( 15 % of entries)
– Patent offices (i.e. European Patent Office, EPO)
• Currently: ~210x106 sequences, ~320 x109 bp;
• Sequences from > 300’000 different species;
Archival databases:
- Can be very redundant for some loci
- Sequence records are owned by the original
submitter and can not be alterered by a third
party (except TPA)
accession number
taxonomy
references
Cross-references
CDS
CoDing Sequence
(proposed by submitters)
CDS annotation
(Prediction or
experimentally determined)
sequence
The hectic life of a sequence …
Data not submitted to public databases, delayed or cancelled…
with or without cDNAs,
annotated CDS
provided by authors
ESTs, genes, genomes, …
ENA, GenBank, DDBJ
CDS
CoDing Sequence
portion of DNA/RNA translated into protein
(from Met to STOP)
Experimentally proved
or derived from gene prediction
!!! not so well documented !!!
CoDing Sequence
Alignment between a mRNA and a genomic sequence
Genomic
CONTIG
Genomic
CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG------------------------------------------------------CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA
************************************ ****************************
CONTIG
Genomic
-----------------------------------------------------------------------------------------------------------------------TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT
CONTIG
Genomic
----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC
GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC
**************************************************************************
CONTIG
Genomic
TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC
TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC
****************************
***********************************************
CONTIG
Genomic
CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA
CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA
************************************************************************************************************************
CONTIG
Genomic
TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG----------------------------------------------------------------------------TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT
********************************** ********
CONTIG
Genomic
-------------------------------------------------------------------------------------------------------------------GNAAA
GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA
* ***
intron
exon
intron
intron
exon
CONTIG
Genomic
TAAACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC
GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGATAAACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC
GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGA
******************************************* * ************** ******** ***** **** * *********** ***************************
CONTIG
Genomic
C----------------------------------------------------------------------------------------------------------------------CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA
*
exon
CONTIG
exon
--------------------------------------------------------------------------------------CGANGGCCTATCAACAATGAAAGGTCGAAACCTG
AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACAATG-AAGGTCGAAACCTG
*** ************ ** * **************
exon
CDS provided by the submitters
The first Met !
CDS translation provided by ENA
mRNAs and their
corresponding CDS
annotation (from
EMBL/GenBank/DDBJ)
UCSC: human EPO
contig
5’
3’
mRNAs and their
corresponding CDS
annotation (from
EMBL/GenBank/DDBJ)
Very rarely done…
Complete genome (submitted)
but only ~ 2,000 CDS/proteins available !
…annotated CDS in UniProtKB
http://www.ebi.ac.uk/swissprot/sptr_stats/index.html
From nucleic acid to amino acid sequences
databases….
The hectic life of a protein sequence …
Data not submitted to public databases, delayed or cancelled…
cDNAs, ESTs, genomes, …
Nucleic acid
databases
no CDS
ENA, GenBank, DDBJ
…if the submitters provide an
annotated Coding Sequence (CDS)
(1/10 ENA entries)
RefSeq, Ensembl and other
Gene prediction
RefSeq, Ensembl
Protein sequence
databases
Why doing things in a simple way, when you
can do it in a very complex one ?
The hectic life of a sequence …
Data not submitted to public databases, delayed or cancelled…
cDNAs, ESTs, genomes, …
Scientific publications
derived sequences
ENA, GenBank, DDBJ
CoDing Sequences
provided by submitters
TrEMBL Genpept
CoDing Sequences
provided by submitters
and gene prediction
RefSeq
UniProtKB
(IPI)
Swiss-Prot
Ensembl
UniMES
(PIR)
PRF
TPA
UniParc
CCDS
PDB
+ all ‘species’ specific databases (EcoGene, TAIR, …)
Major ‘general’ protein sequence database ‘sources’
TPA
PIR
PDB
PRF
Integrated
resources
‘cross-references’
UniProtKB: Swiss-Prot + TrEMBL
Resources kept
separated
NCBI-nr: Swiss-Prot + TrEMBL + GenPept + PIR + PDB + PRF + RefSeq + TPA
not complete !!! (only entries created before 2007 ?)
UniProtKB/Swiss-Prot: manually annotated protein sequences (12’000 species)
UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot
(300’000 species)
GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (300’000 species ?)
PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB
PDB: Protein Databank: 3D data and associated sequences
PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences
RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation
(11’000 species)
TPA: Third part annotation
Look for EPO
(homo sapiens)
Swiss-Prot
TrEMBL
www.uniprot.org
Look for EPO
(homo sapiens)
Swiss-Prot
GenPept
Swiss-Prot
RefSeq
RefSeq
GenPept
Menu
Introduction
Nucleic acid sequence databases
ENA-Bank/GenBank, DDBJ
Protein sequence databases
UniProt databases (UniProtKB)
NCBI protein databases
UniProt
UniProt consortium
EBI : European Bioinformatics Institute (UK)
SIB : Swiss Institute of Bioinformatics (CH)
PIR : Protein information resource (US)
UniProt databases
UniProtKB: protein sequence knowledgebase, 2 sections
UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast,
download) (~14 mo entries)
UniParc: protein sequence archive (ENA equivalent at the protein
level). Each entry contains a protein sequence with crosslinks to other databases where you find the sequence
(active or not). Not annotated (query, Blast, download) (~25mo entries)
UniRef: 3 clusters of protein sequences with 100, 90 and
50 % identity; useful to speed up sequence similarity
search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo
entries; UniRef50 3.3 mo entries)
UniMES: protein sequences derived from metagenomic
projects (mostly Global Ocean Sampling (GOS)) (download)
(8 mo entries, included in UniParc)
UniProtKB
an encyclopedia on proteins
composed of 2 sections
UniProtKB/TrEMBL and UniProtKB/Swiss-Prot
unreviewed and reviewed
automatically annotated and manually annotated
released every 4 weeks
UniProtKB
from EMBL to TrEMBL
UniProtKB protein sequence data are mainly derived from
EMBL (CDS) but also from Ensembl, RefSeq, model
organism databases (MODs; e.g. TAIR) and PDB.
Data from the PIR database have been integrated in
UniProtKB since 2003.
UniProtKB/Swiss-Prot and UniProtKB/TrEMBL give access to all the protein
sequences which are available to the public.
However, UniProtKB excludes the following protein sequences:
- Most non-germline immunoglobulins and T-cell receptors
- Synthetic sequences
- Most patent application sequences
- Small fragments encoded from nucleotide sequence (<8 amino acids)
- Pseudogenes*
- Fusion/truncated proteins
- Not real proteins
* many putative pseudogene sequences (which are tagged as potential
pseudogenes) may be expected to remain in UniProtKB for some time as it
can be difficult to prove the non-existence of a protein
Data increase in UniProtKB
14,000,000
12,000,000
UniProtKB
Number of sequences
10,000,000
UniProtKB/Swiss-Prot
8,000,000
6,000,000
4,000,000
2,000,000
0
5-Jan-04
5-Jan-05
5-Jan-06
5-Jan-07
5-Jan-08
Date
5-Jan-09
5-Jan-10
EMBL
TrEMBL
Automated extraction of
protein sequence
(translated CDS), gene
name and references.+
Automated annotation
Protein and gene names
Taxonomic information
References
Cross-references
to over 125 databases
Automated annotation
Function, Subcellular location,
Catalytic activity,
Sequence similarities…
One protein sequence
One species
UniProtKB/TrEMBL
www.uniprot.org
Automated annotation
transmembrane domains,
signal peptide…
Automated annotation
Keywords
and
Gene Ontology
UniProtKB:
from EMBL to TrEMBL
Automated annotation
1. Protein sequence
2. Biological information
UniProtKB/TrEMBL
Protein sequence
- The quality of UniProtKB/TrEMBL protein sequences
is dependent on the information provided by the
submitter of the original nucleotide entry (CDS).
- 100% identical sequences (same lenght, same
organism are merged automatically).
Biological information
Sources of annotation
- Provided by the submitter (EMBL, PDB, TAIR…)
- From automated annotation (SAAS: automated
generated annotation rules)
- From automated annotation (UniRule; manually
generated annotation rules)
UniProtKB/TrEMBL
Protein sequence
- The quality of UniProtKB/TrEMBL protein sequences
is dependent on the information provided by the
submitter of the original nucleotide entry (CDS).
- 100% identical sequences (same length, same
organism are merged automatically).
Biological information
Sources of annotation
- Provided by the submitter (EMBL, PDB, TAIR)
- From automated annotation (SAAS: automated
generated annotation rules)
- From automated annotation (UniRule; manually
generated annotation rules)
Automatic annotation in
UniProtKB/TrEMBL
System
Rule
creation
Trigger
Annotations
Scope
SAAS
automatic
InterPro
comments, KW
all taxa
InterPro*
protein names,
comments,
features, KW,
GO terms
all taxa
UniRules
manual
* Flexibility to create custom signatures for InterPro as required
UniProtKB/TrEMBL
SAAS
•
Rules are derived from the UniProtKB/Swiss-Prot manual annotation.
•
Fully automated rule generation based on C4.5 decision tree algorithm.
•
One annotation, one rule.
•
Precision calculated for each rule vs UniProtKB/Swiss-Prot.
•
High stringency – require 99% or greater estimated precision on
UniProtKB/TrEMBL to generate annotation.
•
Rules are produced, updated and validated at each release.
UniRules (RuleBase, HAMAP, PIRSF)
•
Rules of varying complexity: annotation varies from simple KW attribution
to complete annotation as for UniProtKB/Swiss-Prot
•
Rules are manually curated:

From SAAS rules as input

From UniProtKB/Swiss-Prot annotation and InterPro match data,
taxonomy information – continuously reported to curators

From literature based curation of characterized families - with the
possibility to create new signatures for specific functional groups
•
Rules are continuously monitored – validation on UniProtKB/Swiss-Prot –
97% confidence
•
HAMAP is also used for the annotation of some UniProtKB/Swiss-Prot
entries
UniRule – HAMAP
UniRule – HAMAP
Automatic annotation in
UniProtKB/TrEMBL - Summary
•
SAAS – automatically generated annotation rules for comments, KWs
- Tested on UniProtKB/Swiss-Prot
•
UniRule – manually curated annotation rules (e.g. HAMAP)
– annotation varies from simple KWs to full annotation
– start point can be SAAS rules, InterPro reports, literature-based
curation of protein families
– possibility to create custom signatures -> InterPro
•
Automatic annotation of UniProtKB/TrEMBL is refreshed, and validated,
each UniProtKB release – validation using UniProtKB/Swiss-Prot as
reference. ~10% of the rules are ‘refreshed’ at each release.
•
The source of each annotation is indicated - users can access rule logic
Current status – coverage of
UniProtKB/TrEMBL
System
Rules
Coverage*
SAAS
1684
17.2%
RuleBase
1108
23.0%
PIR name
/ site
rules
142
0.26%
HAMAP
1087
4.7%
UniRules
* Proportion of entries with at least one annotation from the specified system
UniProtKB/TrEMBL 2010_12: 12,769,092 entries, all systems combined, 33%
Current status - coverage of
UniProtKB/TrEMBL
% coverage of UniProtKB/TrEMBL
35.00
30.00
25.00
20.00
15.00
10.00
5.00
0.00
All
CC
DE
FT
GN
Annotation type
UniRule
UniRule + SAAS
SAAS
UniProt release 2010_10 included annotations from 7767 SAAS rules, 1814 UniRules
KW
GO annotation
- KW2GO
- InterPro2GO
- HAMAP2GO
UniProtKB
from TrEMBL to Swiss-Prot
Once manually annotated and integrated into SwissProt, the entry is deleted from TrEMBL
-> minimal redundancy
EMBL
Manual annotation of
the sequence and
associated biological
information
TrEMBL
Automated extraction
of protein sequence
(translated CDS), gene
name and references.+
Automated annotation
Swiss-Prot
Protein and gene names
Taxonomic information
References
Cross-references
to over 125 databases
MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAAR
AFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVK
NMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFL
One protein sequence
NKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE
One gene
GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRG
One species
TVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGR
AGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKD
EGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV
VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG
Alternative products:
protein sequences produced by
alternative splicing,
alternative promoter usage,
alternative initiation…
UniProtKB/Swiss-Prot
www.uniprot.org
Manual annotation
Function, Subcellular location,
Catalytic activity, Disease,
Tissue specificty, Pathway…
Manual annotation
Post-translational modifications,
variants, transmembrane domains,
signal peptide…
Manual annotation
Keywords
and
Gene Ontology
UniProtKB:
from TrEMBL to Swiss-Prot
Manual annotation
1. Protein sequence (merge available CDS, annotate
sequence discrepancies, report sequencing mistakes…)
2. Biological information (sequence analysis, extract
literature information, ortholog data propagation, …)
UniProtKB
1- Sequence curation
The displayed protein sequence
…canonical, representative,
consensus…
UniProtKB/Swiss-Prot protein sequence annotation
‘Merging/Redundancy policy’:
a gene-centric view of protein space
1 entry <-> 1 gene (1 species)
1 displayed sequence
(annotation of alternative sequences, when available)
The displayed sequence is the most prevalent protein
sequence and/or the protein sequence which is also found
in orthologous species.
The displayed sequence is generally derived from the
translation of the genomic sequence (when available).
Sequence differences are documented.
What is the current status?
• At least 20% of Swiss-Prot entries
required a minimal amount of curation
effort so as to obtain the “correct”
sequence.
• Typical problems
–
–
–
–
unsolved conflicts;
uncorrected initiation sites;
frameshifts;
other ‘problems’
… once a gene on chromosome 11…
Quality of protein information from genome projects
• Lets look at proteins originating from genome projects:
– Drosophila: the paradigm of a curated genome should look like
(thanks to FlyBase) : only 1.8% of the gene models conflict with
Swiss-Prot sequences;
– Arabidopsis: a typical example of a genome where a lot of
annotation was done when it was sequenced, but no update since
then (at least in the public view): 20% of the gene models are
erroneous;
– Tetraodon nigroviridis: the typical example of a quick and dirty
automatic run through a genome with no manual intervention:
>90% of the gene models produce incorrect proteins.
– Bacteria and Archaea have almost no splicing, so predictions are
“easier”, however errors are still made… Start codons, missed
small proteins (<100aa)…
UniProtKB/Swiss-Prot
Protein sequence annotation
Example of problem (derived from gene prediction pipeline)
Ensembl completes the human ‘proteome’ by predicting/annotating
missing genes according to orthologous sequences..
ID
AC
DT
DT
DT
DE
DE
GN
…
DR
DR
DR
PE
URAD_HUMAN
Unreviewed;
171 AA.
A6NGE7;
24-JUL-2007, integrated into UniProtKB/TrEMBL.
24-JUL-2007, sequence version 1.
02-OCT-2007, entry version 3.
2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog
(OHCU decarboxylase homolog) (Parahox neighbour).
Name=PRHOXNB;
EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA.
Ensembl; ENSG00000183463; Homo sapiens.
HGNC; HGNC:17785; PRHOXNB.
4: Predicted;
In primates the genes coding for the enzymes for the degradation
of uric acid were inactivated and converted to pseudogenes.
• Producing a clean set of sequences is not a
trivial task;
• It is not getting easier as more and more types
of sequence data are submitted;
‘Protein existence’ tag
•
The ‘Protein existence’ tag indicates what
is the evidence for the existence of a given
protein;
•
Different qualifiers:
1. Evidence at protein level (~18%)
(MS, western blot (tissue specificity), immuno (subcellular location),…)
2. Evidence at transcript level (~19%)
3. Inferred from homology (~58 %)
4. Predicted (~5%)
5. Uncertain (mainly in TrEMBL)
http://www.uniprot.org/docs/pe_criteria
In order to avoid ‘pseudogenes’ and most of the unprobable
protein sequences, you can filter your query and avoid
sequences with ‘protein existence tag’ = ‘Uncertain’
The ‘alternative’ sequence(s)
UniProtKB/Swiss-Prot
1 entry <-> 1 gene (1 species)
Annotation of the sequence differences
(including conflicts, polymorphisms, splice variants etc..)
-> annotation of protein diversity
1 entry <-> 1 gene (1 species)
Multiple alignment of the end of the available GCR sequences
Annotation of the sequence differences (protein diversity)
…and natural variants
P04150
www.uniprot.org
UniProtKB (and RefSeq) do under-represent alternatively spliced products
According to PMID:21307931: alternative splicing seems to occur at more
than 90% of protein-coding genes (might not always modify the protein
sequence).
Transcript variants are only made when there is information available on
the full-lenght nature of the product; if multiple, alternate exons are found
through the lenght of the gene, no assumption is made about the
combination of the alternate exons that may exist in vivo.
Uncertain alternative sequences (confirmed by only one cDNA) are tagged
with ‘No experimental confirmation available’
http://www.ncbi.nlm.nih.gov/books/NBK50679/#RefSeqFAQ.what_does_a_reviewed_status_me
Important remark
Available in separated files!
> 30’000 additional
sequences (total)
The ‘alternative’ sequence(s)
not ‘directly available’ for a lot of tools,
including protein identification tools,
Blast, depending on the server !….
Not included yet in the UniProtKB
complete proteome sets !
Depending upon the organism, the inclusion of alternative
sequences to the basic set of protein sequences can make a
tremedous difference. For instance, in Homo sapiens,
alternative sequences currently represent close to 40% of the
total number of annotated human sequences described in
UniProtKB/Swiss-Prot.
http://www.uniprot.org/faq/38
Blast P04150 against Swiss-Prot / homo sapiens @ UniProt
Isoform
sequences
Blast P04150 against Swiss-Prot / homo sapiens @ NCBI
The isoform sequences are not present in the NCBI protein
databases !
The .x number (P06401.4) correspond to the version number
of the sequence…not to an alternatively spliced sequence !
How to track sequence changes ?
• The sequence version number applies to the
canonical sequence only
• There is no easy way yet to track sequence
updates of isoforms
UniProtKB
2- Biological data curation
Extract literature information
and protein sequence analysis
maximum usage of controlled vocabulary
UniProtKB/Swiss-Prot gathers data form multiple sources:
- publications (literature/Pubmed)
- prediction programs (Prosite, TMHMM, …)
- contacts with experts
- other databases
- nomenclature committees
An evidence attribution system allows to easily trace the
source of each annotation
Protein and gene names
General annotation
(Comments)
…enable researchers to
obtain a summary of
what is known about a
protein…
www.uniprot.org
Human protein manual annotation:
some statistics (Aug 2010)
Sequence annotation
(Features)
…enable researchers to
obtain a summary of
what is known about a
protein…
www.uniprot.org
Human protein manual annotation:
some statistics
(PTM)
Non-experimental qualifiers
UniProtKB/Swiss-Prot considers both experimental and
predicted data and makes a clear distinction between both.
Level. Type of evidence
Qualifier
1st. Strong experimental evidence
Ref.X
2nd. Light experimental evidence
Probable
3rd. Inferred by similarity with homologous
protein (data of 1st or 2nd level)
By similarity
4th. Inferred by sequence prediction
Potential
Find all the protein localized in the
cytoplasm (experimentally proven)
which are phosphorylated on a
serine (experimentally proven)
UniProtKB
Additional information can be found in
the cross-references
(to more than 140 databases)
Organism-specific
AGD
ArachnoServer
CGD
ConoServer
CTD
CYGD
dictyBase
EchoBASE
EcoGene
euHCVdb
EuPathDB
FlyBase
GeneCards
GeneDB_Spombe
GeneFarm
GenoList
Gramene
H-InvDB
HGNC
HPA
LegioList
Leproma
MaizeGDB
MGI
MIM
neXtProt
Orphanet
PharmGKB
PseudoCAP
RGD
SGD
TAIR
TubercuList
WormBase
Xenbase
ZFIN
Sequence
EMBL
IPI
PIR
RefSeq
UniGene
Proteomic
Genome annotation
Polymorphism
Family and domain
PeptideAtlas
PRIDE
ProMEX
Ensembl
EnsemblBacteria
EnsemblFungi
EnsemblMetazoa
EnsemblPlants
EnsemblProtists
GeneID
GenomeReviews
KEGG
NMPDR
TIGR
UCSC
VectorBase
dbSNP
Gene3D
HAMAP
InterPro
PANTHER
Pfam
PIRSF
PRINTS
ProDom
PROSITE
SMART
SUPFAM
TIGRFAMs
Gene expression
ArrayExpress
Bgee
CleanEx
Genevestigator
GermOnline
Protein family/group
Allergome
CAZy
MEROPS
PeroxiBase
PptaseDB
REBASE
TCDB
Ontologies
GO
UniProtKB/Swiss-Prot:
129 explicit links
2D gel
and 14 implicit links!
Phylogenomic dbs
eggNOG
GeneTree
HOGENOM
HOVERGEN
InParanoid
OMA
OrthoDB
PhylomeDB
ProtClustDB
3D structure
PTM
GlycoSuiteDB
PhosphoSite
PhosSite
Other
PPI
BindingDB
DrugBank
NextBio
PMAP-CutDB
DIP
IntAct
MINT
STRING
DisProt
HSSP
PDB
PDBsum
ProteinModelPortal
SMR
2DBase-Ecoli
ANU-2DPAGE
Aarhus/Ghent-2DPAGE (no server)
COMPLUYEAST-2DPAGE
Cornea-2DPAGE
DOSAC-COBS-2DPAGE
ECO2DBASE (no server)
OGP
PHCI-2DPAGE
PMMA-2DPAGE
Rat-heart-2DPAGE
REPRODUCTION-2DPAGE
Siena-2DPAGE
SWISS-2DPAGE
UCD-2DPAGE
World-2DPAGE
Enzyme and pathway
BioCyc
BRENDA
Pathway_Interaction_DB
Reactome
Protein sequence origin
http://www.uniprot.org/faq/35
Access to UniProtKB
www.uniprot.org
The UniProt web site - www.uniprot.org
•
Powerful search engine, google-like and easy-to-use, but also
supports very directed field searches (similar to SRS)
•
Scoring mechanism presenting relevant matches first
•
Entry views, search result views and downloads are customizable
•
The URL of a result page reflects the query; all pages and queries
are bookmarkable, supporting programmatic access
•
Tools: Blast, Align, IDmapping, Batch retrieval (Retrieve)
Search
A very powerful text search tool with
autocompletion and refinement options
allowing to look for UniProt entries and
documentation by biological information
Search
A very powerful text search tool with
autocompletion and refinement options
allowing to look for UniProt entries and
documentation by biological information
The search interface
guides users with helpful
suggestions and hints
Advanced Search
A very powerful search tool
To be used when you know in which
entry section the information is stored
Have first a look to annotation examples.
Find all human proteins with
experimental evidence for their
location in the nucleus
The information is stored in the ‘General annotation’
section, Subcellular location
Find all human proteins with
experimental evidence for their
location in the nucleus
Result pages: Highly customizable
Custom downloads….
Accession Genes Domains Protein Existence
P02768 ALB (GIG20) (GIG42) (PRO0903) (PRO1708) (PRO2044) (PRO2619) (PRO2675
P02769 ALB Albumin domains (3) Evidence at protein level
P02770 Alb Albumin domains (3) Evidence at protein level
P07724 Alb (Alb-1) (Alb1) Albumin domains (3) Evidence at protein level
P08759 alb-A Albumin domains (3) Evidence at transcript level
P14872 alb-B Albumin domains (3) Evidence at transcript level
P43652 AFM (ALB2) (ALBA) Albumin domains (3) Evidence at protein level
P08835 ALB Albumin domains (3) Evidence at protein level
P49822 ALB Albumin domains (3) Evidence at protein level
P19121 ALB Albumin domains (3) Evidence at protein level
Open with Excel etc.
The URL (results) can be
bookmarked and manually
modified.
Blast
A tool associated with the standard
options to search sequences
in UniProt databases
Blast results: customize display
Blast: use of UniProt annotation
amino-acids highlighting options
and feature annotation highlighting option in the local alignment
Align
A ClustalW multiple alignment tool with
amino-acids highlighting options
and feature annotation highlighting
option
ClustalW
multiple alignment of insulin sequences
amino-acids highlighting options
and feature annotation highlighting option in the local alignment
Retrieve
A UniProt specific tool allowing to retrieve a list of
entries in several standard identifiers formats.
You can then query your ‘personal database’ with the
UniProt search tool.
Your dataset: results of a
Scan Prosite
ID Mapping
Gives the possibility to get a mapping between
different databases for a given protein
These identifiers are all pointing to TP53 (p53) !
P04637, NP_000537, ENSG00000141510, CCDS11118,
UPI000002ED67, IPI00025087, etc.
Download
Downloading UniProt
http://www.uniprot.org/downloads
Downloading UniProt
http://www.uniprot.org/downloads
Canonical and isoform sequences
Complete proteome
‘gene’ centred
or
all known proteins ?
http://www.uniprot.org/faq/38
Remark: Some peptides are not associated with the keyword
‘Complete proteome’ because they do not match with the human
genome
UniProt proteome sets, if downloaded in
UniProt flat file or XML format, contain
one sequence per UniProt record !
‘gene’ centred
all protein sequences in UniProtKB/Swiss-Prot…
Are missing: other alternatively spliced
protein sequences in UniProtKB/TrEMBL
IPI closure
• The complete MOUSE proteome will be composed of all
MOUSE sequences in UniProtKB/Swiss-Prot plus those
MOUSE sequences in UniProtKB/TrEMBL that have a crossreference to an Ensembl protein.
• The complete HUMAN proteome will therefore be composed
of all HUMAN sequences in UniProtKB/Swiss-Prot plus those
HUMAN sequences in UniProtKB/TrEMBL that have a crossreference to an Ensembl protein.
•
News: 30th March 2011: next UniProtKB release.
UniProtKB
Statistics
Swiss-Prot & TrEMBL
introduce a new arithmetical concept !
Swiss-Prot
TrEMBL
520’000 + 14’000’000
12’000 species
130’000 species
 14’000’000
Redundancy in TrEMBL
&
Redundancy between TrEMBL and Swiss-Prot
12’000 species
mainly model organisms
Not yet available
~ 200 new entries / day
new release every 4 weeks
- Annotation is useful, good annotation is better, update is essential !
- Some entries have gone through more than 120 versions since their integration
in UniProtKB/Swiss-Prot
UniProtKB entry history
Always cite the primary accession number (AC) !
UniParc
UniParc
- non-redundant protein sequence archive, containing both active and
inactive sequences (including sequences which are not in UniProtKB i.e.
immunoglobulins….)
- the equivalent of ENA/GenBank/DDBJ at the protein level
- species-merged: merge sequences between species when 100% identical
over the whole length.
- no annotation (only taxonomy)
- can be searched only with database names, taxonomy, checksum (CRC64)
and accession numbers (ACs) or UniProtKB, UniRef and UniParc IDs.
- Beware: contains wrong prediction, pseudogenes etc…
Query UniParc
UniRef
‘UniRef is useful for comprehensive
BLAST similarity searches by providing
sets of representative sequences’
«Collapsing BLAST results»
Three collections of sequence clusters from UniProtKB and selected
UniParc entries:
One UniRef100 entry -> all identical sequences (identical sequences and
sub-fragments are grouped in a single record) -> reduction of 12 %
One UniRef90 entry -> sequences that have at least 90 % or more
identity -> reduction of 40 %
One UniRef50 entry -> sequences that are at least 50 % identical
-> reduction of 65 %
Based on sequence identity -> Independent of the species !
UniRef 90
Independent of
species and
sequence length
UniMes
The UniProt Metagenomic and Environmental Sequences
(UniMES) database is a repository specifically
developed for metagenomic and environmental protein
data (only GOS data for the moment).
Download only (but included in UniParc -> Blast).
- UniMES Fasta sequences
- UniMES matches to InterPro methods
ftp.uniprot.org/pub/databases/uniprot
UniMES: sequences in fasta format
Menu
Introduction
Nucleic acid sequencedatabases
ENA/GenBank, DDBJ
Protein sequence databases
UniProt databases (UniProtKB)
NCBI protein databases
NCBI protein
databases
(Entrez protein, NCBI nr)
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
Major ‘general’ protein sequence database ‘sources’
TPA
PIR
PDB
PRF
Integrated
resources
‘cross-references’
UniProtKB: Swiss-Prot + TrEMBL
Resources kept
separated
NCBI-nr: Swiss-Prot + TrEMBL + GenPept + PIR + PDB + PRF + RefSeq + TPA
not complete !!! (only entries created before 2007 ?)
UniProtKB/Swiss-Prot: manually annotated protein sequences (12’000 species)
UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot
(300’000 species)
GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (300’000 species ?)
PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB
PDB: Protein Databank: 3D data and associated sequences
PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences
RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation
(11’000 species)
TPA: Third part annotation
Query at Entrez protein
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
Swiss-Prot
Typical result of
a query at
« Entrez protein »
RefSeq
Genpept
A Swiss-Prot entry with the NCBI look
A TrEMBL entry with the NCBI look
!!!!
GI number
‘GenInfo identifier’ number
- In addition to an AC number specific from the original database,
each protein sequence in the NCBInr database (included Swiss-Prot
entry) has a GI number.
AC
GI number: ‘GenInfo identifier’ number
- If the sequence changes in any way, a new GI number will be
assigned:
GI identifiers provide a mechanism for identifying the exact
sequence that was used or retrieved in a given search.
- A separate GI number is assigned to each protein translation
(alternative products)
- A Sequence Revision History tool is available to track the various
GI numbers, version numbers, and update dates for sequences that
appeared in a specific GenBank record:
http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi
ID/AC mapping
http://www.ebi.ac.uk/Tools/picr/
GenPept
Translation from annotated CDS in GenBank
Contains all translated CDS annotated in
GenBank/ENA/DDBJ sequences
- equivalent to UniProtKB/TrEMBL,
except that it is
redundant with other databases
(Swiss-Prot, RefSeq, PIR….)
GenPept: ‘translations from all annotated coding regions (CDS) in GenBank’
RefSeq
Produced by NCBI and NLM
http://www.ncbi.nlm.nih.gov/RefSeq/
http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/handbook/ch18.pdf
FAQ: http://www.ncbi.nlm.nih.gov/books/NBK50679/
The Reference Sequence (RefSeq) collection aims to provide a
comprehensive, integrated, non-redundant set of sequences,
including genomic DNA, transcript (RNA), and protein products,
for major research organisms.
Protein – mRNA – genomic sequence
Also chromosomes, organelle genomes, plasmids, intermediate
assembled genomic contigs, ncRNAs.
- tighly linked to Entrez Gene (« interdependent curated resources »)
Example: NP_000790
Beware: NeXtProt accession number: NX_P00918
AC
KW
Taxonomy
References
GenBank source
and status
Annotation and
ontologies
Curated records
UniProtKB vs RefSeq
UniProtKB/Swiss-Prot merges all CDS available for a given gene and
describes the sequence differences
UniProtKB/Swiss-Prot P04150 (GCR_HUMAN):
RefSeq chooses one or several protein reference sequences for a given
gene: they do not annotate the sequence differences.
- If there is an alternative splicing event, there will be several distinct
entries for a given gene
Example: GCR_HUMAN
1 UniProtKB entry
GCR_HUMAN
UniProtKB/Swiss-Prot
cross-linked with
7 RefSeq entries
Protein feature annotation found in RefSeq
- Conserved domains
- Signal and mature petides
- Propagation of a subset of features from Swiss-Prot.
PTM annotation
Swiss-Prot vs
RefSeq
GCR_human
RefSeq statistics
The numbers are not comparable:
entries ‘sequence’ (RefSeq) vs entries ‘gene’ (UniProtKB/Swiss-Prot)
UniProtKB vs NCBI protein
Summary
ENA/GenBank/DDBJ
RefSeq
www.ncbi.nlm.nih.gov/RefSeq/
UniProt
www.uniprot.org
Protein and nucleotide data
Genomic, RNA and protein data
Protein data only
Biological data added by the
submitters (gene name, tissue…)
Biological data annotated by
curators, also found in the
corresponding Entrez Gene entry
Biological data annotated by
curators (Swiss-Prot), within the
entry
Not curated
Partially manually curated
(‘reviewed’ entries)
Manually curated in Swiss-Prot,
not in TrEMBL
Author submission
NCBI creates from existing data
+ gene prediction
UniProt creates from existing
data
Only author can revise (except
TPA)
NCBI revises as new data
emerge
UniProt revises as new data
emerge
Multiple records for same loci
common
Single records for each molecule
of major organisms
Single records for each protein
from one gene of major
organisms (in Swiss-Prot,
TrEMBL is redundant)
Records can contradict each
other
Identification and annotation of
discrepancy
No limit to species included
Limited to model organisms
Priority (but not limited) to
model organisms
Data exchanged among INSDC
members
NCBI database; collaboration
with UniProt
UniProt database; collaboration
with NCBI (RefSeq, CCDS)
PIR
PIR: the Protein Identification Resource
PIR-PSD is no
more updated,
but exists as
an archive
PDB
PDB
• PDB (Protein Data Bank), 3D structure
• Contains the spatial coordinates of macromolecule
atoms whose 3D structure has been obtained by X-ray
or NMR studies
• Contains also the corresponding protein sequences
*The PIR-NRL3D database makes the sequence information in PDB
available for similarity searches and other tools
• Includes protein sequences which are mutated,
chimearic etc… (created specifically to study the
effect of a mutation on the 3D structure)
PDB: Protein Data Bank
www.rcsb.org/pdb/
• Managed by Research Collaboratory for Structural
Bioinformatics (RCSB) (USA).
• Associated with specialized programs allow the
visualization of the corresponding 3D structure
(e.g., SwissPDB-viewer, Chime, Rasmol)).
• Currently there are ~68’000 structural data for
about 15’000 different proteins, but far less
protein family (highly redundant) !
PDB: example
Sequence
Coordinates of each atom
Visualisation with Jmol
PRF
Protein Research Foundation
Looks for the peptide sequence described in publication (and
which are not submitted in databases !!!)
http://www.genome.jp/dbget-bin/www_bfind?prf
Other protein
databases
Ensembl
http://www.ensembl.org/
Review
http://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D610
Annotation pipeline
http://www.genome.org/cgi/content/full/14/5/942
- Ensembl: align the genomic sequences with all the sequences
found in ENA, UniProtKB/Swiss-Prot, RefSeq and
UniProtKB/TrEMBL (-> known genes)
- Also do gene prediction (-> novel genes)
Ensembl= UniProtKB + RefSeq + gene prediction
- DNA, RNA and protein sequences available for several species.
- Ensembl concentrates on vertebrate genomes, but other groups
have adapted the system for use with plant, fungal and metazoa
genomes.
Example of problem (derived from gene prediction pipeline)
Ensembl completes the human ‘proteome’ by predicting/annotating
missing genes according to orthologous sequences..
ID
AC
DT
DT
DT
DE
DE
GN
…
DR
DR
DR
PE
URAD_HUMAN
Unreviewed;
171 AA.
A6NGE7;
24-JUL-2007, integrated into UniProtKB/TrEMBL.
24-JUL-2007, sequence version 1.
02-OCT-2007, entry version 3.
2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog
(OHCU decarboxylase homolog) (Parahox neighbour).
Name=PRHOXNB;
EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA.
Ensembl; ENSG00000183463; Homo sapiens.
HGNC; HGNC:17785; PRHOXNB.
4: Predicted;
In primates the genes coding for the enzymes for the degradation
of uric acid were inactivated and converted to pseudogenes.
IPI
http://www.ebi.ac.uk/IPI/IPIhelp.html
IPI: Closure !
Automatic approach that builds clusters through
combining knowledge already present in the primary
data source (UniProtKB, RefSeq, Ensembl) and
sequence similarity.
IPI=UniProtKB + RefSeq + Ensembl (+ H-InvDB,
TAIR +VEGA).
!!! Complete proteome sets include all alternative
splicing sequences….
Available for human, mouse, rat, Zebrafish,
Arabidopsis, Chicken, and Cow
CCDS
http://www.ncbi.nlm.nih.gov/CCDS/
CCDS (human, mouse)
Combining different approaches – ab
initio, by similarity - and taking
advantage of the expertise acquired
by different institutes, including
manual annotation…
Consensus between 4 institutions…
Gene Ontology
(GO)
Standards :
Why is it so important ?
•‘The ever-increasing number of sequencing projects
necessitates a standardized system (…) to ensure that
the flood of information produced can be effectively
utilized.‘ (PMID 19577473 )
•Standardization of biological data/information (data
sharing and computational analysis).
•Aim: extract and compare annotation between different
resources or species (semantic similarity).
Secreted or not secreted ?
Pubmed19299134
Gene Ontology (GO)
• The Gene Ontology is a controlled vocabulary, a set
of standard terms—words and phrases—used for
indexing and retrieving information. In addition to
defining terms, GO also defines the relationships
between the terms, making it a structured
vocabulary. Contains ~30’000 terms.
Gene Ontology (GO) terms
 biological process
• broad biological phenomena e.g. mitosis,
growth, digestion
 molecular function
• molecular role e.g. catalytic activity,
binding
 cellular component
• Subcellular location e.g nucleus, ribosome,
origin recognition complex
GO terms associated with human Erythropoietin
http://www.geneontology.org
Caveats
• Annotation is the process of assigning/mapping GO
terms to gene products…
• Electronic vs Manual annotation…
Example with EPO
Histone H4
!!! Large scale derived data (‘proteome’)
GO terms: Essential link between biological knowledge and high
throuput genomic and proteomic datasets…
‘summary of the gene ontology classifications for all mapped ESTs…’
PMID: 15514041
~40 % of human proteins have no known function
(experimental data)…but many more are associated with GO
terms…(computer-assigned).
Human proteins functional distribution
Maybe
Potentially
Putative
Expected
Probably
Hopefully
All documents (including practicals) are online
http://education.expasy.org/cours/UniProt/
Download