An introduction to informatics

advertisement
The UniProt knowledgebase
www.uniprot.org
a hub of integrated protein data
http://education.expasy.org/cours/Turin/
Marie-Claude.Blatter@isb-sib.ch
Swiss-Prot group, Geneva
SIB Swiss Institute of Bioinformatics
Protein sequences
•
> 180 billions of ‘different’ proteins on earth (∑ N species x M genes)
•
> 23.0 millions of ‘known and public’ protein sequences in 2012
•
More than 99 % of the protein sequences are derived from the translation
of nucleotide sequences (mRNA or DNA)
•
About 1 % come from direct protein sequencing (Edman, MS/MS…)
Science cover, february 2011
data
protein sequence
knowledge
functional information
UniProt consortium
EBI : European Bioinformatics Institute (UK)
SIB : Swiss Institute of Bioinformatics (CH)
PIR : Protein information resource (US)
www.uniprot.org
UniProt databases
UniProt databases
UniParc: protein sequence archive (EMBL-ENA equivalent at the protein level)
Each entry contains a protein sequence, taxonomic information,
cross-links to other databases where you find the sequence (active or not)
No annotation
All the public patented sequences are stored in UniParc (EPO, USPO, JPO)
You can: query, Blast, download
~31 mo entries
UniProt databases
UniRef
3 clusters of protein sequences with 100, 90 and 50 % identity;
useful to speed up sequence similarity search (BLAST)
You can: query, Blast, download
UniRef100 17 mo entries;
UniRef90 11 mo entries;
UniRef50 5 mo entries
UniProt databases
UniMES:
protein sequences derived from metagenomic projects
(mostly Global Ocean Sampling (GOS))
You can : download
12 mo entries, included in UniParc
UniProt databases
The centerpiece
UniProtKB
an encyclopedia on proteins
composed of 2 sections
UniProtKB/TrEMBL and UniProtKB/Swiss-Prot
unreviewed and reviewed
automatically annotated and manually annotated
released every 4 weeks
UniProtKB
Origin of protein sequences
UniProtKB protein sequences are mainly derived from
-
INSDC (translated submitted coding sequences - CDS)
Ensembl (gene prediction) and RefSeq sequences
Sequences of PDB structures
Direct submission or sequences scanned from literature
85 %
15 %
(includes direct protein sequencing)
Notes: - UniProt is not doing any gene prediction
- Most non-germline immunoglobulins, T-cell receptors , most patent sequences,
highly over-represented data (e.g. viral antigens), pseudogenes sequences are
excluded from UniProtKB, - but stored in UniParc
- Data from the PIR database have been integrated in UniProtKB since 2003.
EMBL
Manual annotation of
the sequence and
associated biological
information
TrEMBL
Automated extraction
of protein sequence
(translated CDS), gene
name and references.
Automated annotation
Swiss-Prot
UniProtKB/TrEMBL
unreviewed
Automatic annotation
released every 4 weeks
Protein and gene names
Taxonomic information
References
Cross-references
to over 125 databases
Automated annotation
Function, Subcellular location,
Catalytic activity,
Sequence similarities…
One protein sequence
One species
UniProtKB/TrEMBL
www.uniprot.org
Automated annotation
transmembrane domains,
signal peptide…
Automated annotation
Keywords
and
Gene Ontology
UniProtKB/TrEMBL
Automatic annotation
Protein sequence
- The quality of the protein sequences is dependent on the information
provided by the submitter of the original nucleotide entry (CDS) or of the
gene prediction pipeline (i.e. Ensembl).
- 100% identical sequences (same length, same organism are merged
automatically).
Biological information
Sources of annotation
- Provided by the submitter (EMBL, PDB, TAIR…)
- From automated annotation (automated generated annotation rules (i.e.
SAAS) and/or manually generated annotation rules (i.e. UniRule))
UniProtKB/TrEMBL
Example of fully automatic annotation: SAAS
•
Rules are derived from the UniProtKB/Swiss-Prot manual annotation.
•
Fully automated rule generation based on C4.5 decision tree algorithm.
•
One annotation, one rule.
•
High stringency – require 99% or greater estimated precision to
generate annotation (test on UniProtKB/Swiss-Prot)
•
Rules are produced, updated and validated at each release.
UniProtKB/Swiss-Prot
reviewed
manually annotated
released every 4 weeks
Manual biocuration is essential to knowledge maintenance
Protein and gene names
Taxonomic information
References
Cross-references
to over 125 databases
MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAAR
AFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVK
NMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFL
One protein sequence
NKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE
One gene
GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRG
One species
TVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGR
AGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKD
EGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV
VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG
Alternative products:
protein sequences produced by
alternative splicing,
alternative promoter usage,
alternative initiation…
UniProtKB/Swiss-Prot
www.uniprot.org
Manual annotation
Function, Subcellular location,
Catalytic activity, Disease,
Tissue specificty, Pathway…
Manual annotation
Post-translational modifications,
variants, transmembrane domains,
signal peptide…
Manual annotation
Keywords
and
Gene Ontology
UniProtKB/Swiss-Prot
Manual annotation
1. Protein sequence (merge available CDS, annotate
sequence discrepancies, report sequencing mistakes…)
2. Biological information (sequence analysis, extract
literature information, ortholog data propagation, …)
UniProtKB/Swiss-Prot
1- Protein sequence curation
UniProtKB/Swiss-Prot
a gene-centric view of the protein space
1 entry <-> 1 gene (1 species)
The displayed protein sequence:
…canonical, representative, consensus…
+
alternative sequences (described within the entry)
What is the current status?
• At least 20% of Swiss-Prot entries required a minimal
amount of curation effort so as to obtain the “correct”
sequence.
• Typical problems
– unsolved conflicts
– uncorrected initiation sites
– frameshifts
– wrong gene prediction
– other ‘problems’
UCSC genome browser
examples of CDS annotation submitted to INSDC…
UniProtKB/Swiss-Prot
2- Biological data curation
Extract literature information
and protein sequence analysis
maximum usage of controlled vocabulary
UniProtKB/Swiss-Prot gathers data form multiple sources:
- publications (literature/Pubmed)
- prediction programs (Prosite, TMHMM, …)
- contacts with experts
- other databases
- nomenclature committees
An evidence attribution system allows to easily trace the
source of each annotation
Protein and gene names
Synonyms useful for
literature searching
General annotation
(Comments)
…enable researchers to
obtain a summary of
what is known about a
protein…
An evidence attribution
system allows to easily
trace the source of
each annotation
(Reference number, By similarity,
Probable, Potential)
www.uniprot.org
Human protein manual annotation:
some statistics (June 2012)
Sequence annotation
(Features)
…enable researchers to
obtain a summary of
what is known about a
protein…
An evidence attribution
system allows to easily
trace the source of
each annotation
(Reference number, By similarity,
Probable, Potential)
www.uniprot.org
Find all the proteins localized in
the cytoplasm (experimentally
proven) which are phosphorylated
on a serine (experimentally proven)
Ontologies
www.uniprot.org
‘Protein existence’ tag
•
The ‘Protein existence’ tag indicates what is the evidence
for the existence of a given protein;
•
Different qualifiers:
1. Evidence at protein level (~18%)
(MS, western blot (tissue specificity), immuno (subcellular location),…)
2. Evidence at transcript level (~19%)
3. Inferred from homology (~58 %)
4. Predicted (~5%)
5. Uncertain (mainly in TrEMBL)
http://www.uniprot.org/docs/pe_criteria
Not sequence validation !
UniProtKB
Additional information
can be found in the cross-references
(to more than 140 databases)
Organism-specific
AGD
ArachnoServer
CGD
ConoServer
CTD
CYGD
dictyBase
EchoBASE
EcoGene
euHCVdb
EuPathDB
FlyBase
GeneCards
GeneDB_Spombe
GeneFarm
GenoList
Gramene
H-InvDB
HGNC
HPA
LegioList
Leproma
MaizeGDB
MGI
MIM
neXtProt
Orphanet
PharmGKB
PseudoCAP
RGD
SGD
TAIR
TubercuList
WormBase
Xenbase
ZFIN
Sequence
EMBL
IPI
PIR
RefSeq
UniGene
Proteomic
Genome annotation
Polymorphism
Family and domain
PeptideAtlas
PRIDE
ProMEX
Ensembl
EnsemblBacteria
EnsemblFungi
EnsemblMetazoa
EnsemblPlants
EnsemblProtists
GeneID
GenomeReviews
KEGG
NMPDR
TIGR
UCSC
VectorBase
dbSNP
Gene3D
HAMAP
InterPro
PANTHER
Pfam
PIRSF
PRINTS
ProDom
PROSITE
SMART
SUPFAM
TIGRFAMs
Gene expression
ArrayExpress
Bgee
CleanEx
Genevestigator
GermOnline
Protein family/group
Allergome
CAZy
MEROPS
PeroxiBase
PptaseDB
REBASE
TCDB
Ontologies
GO
UniProtKB/Swiss-Prot:
129 explicit links
2D gel
and 14 implicit links!
Phylogenomic dbs
eggNOG
GeneTree
HOGENOM
HOVERGEN
InParanoid
OMA
OrthoDB
PhylomeDB
ProtClustDB
3D structure
PTM
GlycoSuiteDB
PhosphoSite
PhosSite
Other
PPI
BindingDB
DrugBank
NextBio
PMAP-CutDB
DIP
IntAct
MINT
STRING
DisProt
HSSP
PDB
PDBsum
ProteinModelPortal
SMR
2DBase-Ecoli
ANU-2DPAGE
Aarhus/Ghent-2DPAGE (no server)
COMPLUYEAST-2DPAGE
Cornea-2DPAGE
DOSAC-COBS-2DPAGE
ECO2DBASE (no server)
OGP
PHCI-2DPAGE
PMMA-2DPAGE
Rat-heart-2DPAGE
REPRODUCTION-2DPAGE
Siena-2DPAGE
SWISS-2DPAGE
UCD-2DPAGE
World-2DPAGE
Enzyme and pathway
BioCyc
BRENDA
Pathway_Interaction_DB
Reactome
The UniProt web site
www.uniprot.org
•
Powerful search engine, google-like and easy-to-use, but also
supports very directed field searches
•
Scoring mechanism presenting relevant matches first
•
Entry views, search result views and downloads are customizable
•
The URL of a result page reflects the query; all pages and queries
are bookmarkable, supporting programmatic access
•
Search, Blast, Align, Retrieve, ID mapping
Search
A very powerful text search tool with
autocompletion and refinement options
allowing to look for UniProt entries and
documentation by biological information
The search interface
guides users with helpful
suggestions and hints
Result pages: highly customizable
Result pages: downloadable
The URL can be bookmarked
and manually modified.
Query: sequence:(database:epo OR database:JPO or database:USPTO)
Blast
A tool associated with the standard
options to search sequences
in different UniProt databases and
data sets
Blast: customize the result display
Blast: local alignment
sequence annotation highlighting option
Align
A ClustalW multiple alignment tool
with
sequence annotation highlighting option
Align
sequence annotation highlighting option
Retrieve
A UniProt specific tool allowing to retrieve a list of
entries in several standard identifiers formats.
You can then query your ‘personal database’ with the
UniProt search tool.
Query your own dataset
ID Mapping
Gives the possibility to get a mapping between
different databases for a given protein
These identifiers are all pointing to a TP53 (p53) protein sequence !
P04637, NP_000537, NP_001119584.1, NP_001119585.1,
NP_001119584.1, NP_001119584.1, NP_001119584.1,
NP_001119584.1, ENSG00000141510, CCDS11118,
UPI000002ED67, IPI00025087, etc.
Download
Download UniProt
http://www.uniprot.org/downloads
Do not hesitate to contact us !
help@uniprot.org
The UniProt Consortium
SIB
Ioannis Xenarios, Lydie Bougueleret, Andrea Auchincloss, Kristian Axelsen, Delphine Baratin, MarieClaude Blatter, Brigitte Boeckmann, Jerven Bolleman, Laurent Bollondi, Emmanuel Boutet, Lionel
Breuza, Alan Bridge, Edouard de Castro, Lorenzo Cerutti, Elisabeth Coudert, Béatrice Cuche, Mikael
Doche, Dolnide Dornevil, Severine Duvaud, Anne Estreicher, Livia Famiglietti, Marc Feuermann,
Sebastien Gehant, Elisabeth Gasteiger, Alain Gateau, Vivienne Gerritsen, Arnaud Gos, Nadine GruazGumowski, Ursula Hinz, Chantal Hulo, Nicolas Hulo, Janet James, Florence Jungo, Guillaume Keller,
Vicente Lara, Philippe Lemercier, Damien Lieberherr, Xavier Martin, Patrick Masson, Anne Morgat,
Salvo Paesano, Ivo Pedruzzi, Sandrine Pilbout, Sylvain Poux, Monica Pozzato, Manuela Pruess, Nicole
Redaschi, Catherine Rivoire, Bernd Roechert, Michel Schneider, Christian Sigrist, Karin Sonesson,
Sylvie Staehli, Eleanor Stanley, André Stutz, Shyamala Sundaram, Michael Tognolli, Laure Verbregue,
Anne-Lise Veuthey
EBI
Rolf Apweiler, Maria Jesus Martin, Claire O'Donovan, Michele Magrane, Yasmin Alam-Faruque, Ricardo
Antunes, Benoit Bely, Mark Bingley, David Binns, Lawrence Bower, Wei Mun Chan, Emily Dimmer,
Francesco Fazzini, Alexander Fedotov, John Garavelli, Leyla Garcia Castro, Rachael Huntley, Julius
Jacobsen, Michael Kleen, Duncan Legge, Wudong Liu, Jie Luo, Sandra Orchard, Samuel Patient,
Klemens Pichler, Diego Poggioli, Nikolas Pontikos, Steven Rosanoff, Tony Sawford, Harminder Sehra,
Edward Turner, Matt Corbett, Mike Donnelly and Pieter van Rensburg
PIR
Cathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Winona C. Barker, Chuming Chen, Yongxing Chen,
Pratibha Dubey, Hongzhan Huang, Kati Laiho, Raja Mazumder, Peter McGarvey, Darren A. Natale,
Thanemozhi G. Natarajan, Jules Nchoutmboube, Natalia V. Roberts, Baris E. Suzek, Uzoamaka
Ugochukwu, C. R. Vinayaka, Qinghua Wang, Yuqi Wang, Lai-Su Yeh and Jian Zhang
www.uniprot.org
UniProt is mainly supported by the National
Institutes of Health (NIH) grant 1 U41 HG00610401. Additional support for the EBI's involvement in
UniProt comes from the NIH grant 2P41 HG02273-07.
Swiss-Prot activities at the SIB are supported by the
Swiss Federal Government through the Federal
Office of Education and Science and the European
Commission contracts SLING (226073), Gen2Phen
(200754) and MICROME (222886). PIR activities are
also supported by the NIH grants 5R01GM080646-04,
3R01GM080646-04S2, 1G08LM010720-01, and
3P20RR016472-09S2, and NSF grant DBI-0850319.
www.isb-sib.ch
Thank you for your attention
http://education.expasy.org/cours/Turin/
Some pratical issues
www.uniprot.org
• Look for HBB – customize display - multiple
alignment
• Look for plant protein sequence similar to
human HBB (Blast).
• ID mapping: Several proteins have been identified in a proteomic experiment. Which
GO terms do they share? (GI numbers of the identified proteins: 16130093, 20664033, 1789812,
27574045, 229597766).
Summary
A few words on the UniProt
‘complete proteome’
sequence sets…
UniProtKB - complete proteomes
2’747 complete proteomes
 Genome completely sequenced
 Proteins mapped to the genome
 Entries tagged with the KW ‘Complete proteome’
 UniProtKB/Swiss-Prot isoform sequences are available
in FASTA format only
Fully manually reviewed (e.g. S. cerevisiae)
Partially manually reviewed (e.g. Homo sapiens)
Unreviewed (e.g. Acinetobacter baumannii (strain 1656-2))
UniProtKB - complete proteomes
Can be downloaded:
 From our complete proteome page
www.uniprot.org/taxonomy/complete-proteomes
 From the ‘ftp download ‘ page
 By querying UniProtKB + download
Query: organism:93062 AND keyword:"complete proteome"
Additional information: www.uniprot.org/faq/15
Query UniProtKB + download
Human proteome ~ 20’200 genes
Query for ‘homo sapiens’ (August 2011)
• UniProtKB: 110,056 entries + alt sequences (~ 15’435) = 125’491
• UniProtKB/Swiss-Prot: 20’244 entries + alt sequences (~ 15’435) = 35’679
• UniProtKB/TrEMBL: 89,834 entries
• RefSeq: 32’898 sequences
• Ensembl: 90’720 sequences
Query for ‘homo sapiens’ + Complete proteome (KW-181)
• UniProtKB: 56’392 + alt sequences (15’435) = 71’827
• UniProtKB/Swiss-Prot: 20’238 + alt sequences (15’435) = 35’673
• UniProtKB/TrEMBL: 36’154
92% of human entries are linked with at least one RefSeq entry…
Download