An introduction to informatics

advertisement
Protein sequence databases
http://education.expasy.org/cours/Murcia2011/
Marie-Claude.Blatter@isb-sib.ch
Swiss-Prot group, Geneva
SIB Swiss Institute of Bioinformatics
Protein Sequence Databases
Murcia, February, 2011
Menu
Introduction
Nucleic acid sequence databases
ENA, GenBank, DDBJ
Protein sequence databases
UniProt databases (UniProtKB)
NCBI protein databases
Other databases (Ensembl, IPI, CCDS, …)
Protein Sequence Databases
Murcia, February, 2011
Menu
Introduction
Nucleic acid sequence databases
ENA, GenBank, DDBJ
Protein sequence databases
UniProt databases (UniProtKB)
NCBI protein databases
Protein Sequence Databases
Murcia, February, 2011
Indispensible for bioinformatic
studies
1. Databases (free access on the web)
2. Software tools
3. Servers
Protein Sequence Databases
Murcia, February, 2011
What is a database ?
• A collection of related data, which are
– structured
– searchable
– updated periodically
– cross-referenced
• Includes also associated tools necessary for access/query,
download, etc.
Protein Sequence Databases
Murcia, February, 2011
Why biological databases ?
• Exponential growth in biological data.
• Data (genomic sequences, protein sequences,
3D structures, 2D gel electrophoresis, MS
analysis, microarrays, publications….) are no
longer published in a conventional manner, but
directly submitted to databases.
• Essential tools for biological research.
Protein Sequence Databases
Murcia, February, 2011
The NAR Online Molecular Biology Database
collection in 2011
A total of 1’330 databases
http://nar.oxfordjournals.org/content/38/suppl_1
Protein Sequence Databases
Murcia, February, 2011
Categories of databases for Life Sciences
•
•
•
•
•
•
•
Sequences (DNA, protein)
Genomics
3D structure
Mutation/polymorphism
Protein domain/family
Metabolism/Pathways
Bibliography
• ‘Others’
(Protein protein interaction, Microarrays…)
Protein Sequence Databases
Murcia, February, 2011
Categories of databases for Life Sciences
•
•
•
•
•
•
•
–
Sequences (DNA, protein)
– DNA/RNA: EMBL/GenBank/DDBJ,
– Protein: UniProtKB, NCBInr
Genomics
- OMIM, Flybase
3D structure
– PDB
Mutation/polymorphism
– dbSNP
Protein domain/family
– InterPro
Metabolism/Pathways
– KEGG
Bibliography
– PubMed
‘Others’ (Protein protein interaction, Microarrays…)
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
DNA sequences
Microarray
Expression Data
Protein
Sequences
Human Genome
Gene Annotation
Macromolecular
Structure Data
Protein Sequence Databases
Murcia, February, 2011
Proliferation of databases
•Which does contain the highest quality
data ?
•Which is comprehensive ?
•Which is up-to-date ?
•Which is redundant ?
•Which is indexed (allows complex
queries) ?
•Which Web server does respond most
quickly ?
• …….??????
Awareness of the content and
usage of knowledge resources
is a pre-requisite to do any
type of « serious » research in
the field of molecular life
sciences
(AMB, 2007)
Protein Sequence Databases
Murcia, February, 2011
Where can we find…
•A video -> Youtube
•Info on S. Hawking-> Wikipedia
•A book -> Amazon
•A friend -> Facebook
– Usually only one server
•DNA sequence -> EMBL
•Protein sequence -> UniProtKB, RefSeq…
– Several different servers give access to the
‘same’ database
Servers
• ‘Any computer (…) serving out applications
or services can technically be called a
server. ‘ (Wikipedia)
Protein Sequence Databases
Murcia, February, 2011
EBI: http://www.ebi.ac.uk/
Protein Sequence Databases
Murcia, February, 2011
NCBI: http://www.ncbi.nlm.nih.gov/
Protein Sequence Databases
Murcia, February, 2011
ExPASy: http://expasy.org
Protein Sequence Databases
Murcia, February, 2011
www.uniprot.org
Protein Sequence Databases
Murcia, February, 2011
How to find a database ?
Beware not all servers give access to the latest
version of the database. Important to know the
‘home server’ for a given database.
– ExPASy life sciences directory: -> ‘home’
server links (www.expasy.org/alinks.html)
– Google (http://www.google.com) (not
always linked to the ‘home’ server)
Protein Sequence Databases
Murcia, February, 2011
http://www.expasy.org/
Protein Sequence Databases
Murcia, February, 2011
http://www.expasy.org/links.html
http://www.expasy.org/links.html
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
The same data on different servers….
UniProt
Protein Sequence Databases
NCBI
Murcia, February, 2011
http://srs.dna.affrc.go.jp/srs8/srs?-id+1QexuT1Yn4Di0xF+[uniprot_swissprot-AccNumber:P16855]+-e
Protein Sequence Databases
Murcia, February, 2011
Proteins…proteins
Protein Sequence Databases
Murcia, February, 2011
Protein sequences are
the fundamental determinants
of biological structure and function.
http://www.ncbi.nlm.nih.gov/protein
Protein Sequence Databases
Murcia, February, 2011
Protein sequence databases
are essential for…
- Identification of proteins by proteomics
--> completeness, sequence quality
‘producing large protein lists is not the end point in Proteomics’ -> extract knowledge
- Similarity searches, BLAST (functional prediction)
--> sequence quality (no redundance)
- Training datasets (prediction tools, PTM etc.)
--> sequence and annotation quality
- Creation of DNA chips for mRNA expression studies
--> completeness (complete proteome), sequence quality
Protein Sequence Databases
Murcia, February, 2011
?
TrEMBL
RefSeq
PRF
Genpept
UniProtKB
(IPI)
Swiss-Prot
Ensembl
UniMES
TPA
UniParc
(PIR)
NCBInr
Protein Sequence Databases
PDB
CCDS
Murcia, February, 2011
These identifiers are all pointing to a same sequence of TP53 (p53) !
P04637, NP_000537, ENSG00000141510, CCDS11118,
UPI000002ED67, IPI00025087, HIT000320921, XP_001172091,
DD954676 , JT0436 , etc.
Protein Sequence Databases
Murcia, February, 2011
A HUPO test sample study reveals common problems in mass
spectrometry–based proteomics
PubMed 19448641 (2009)
• A single mass spectrometry experiment can identified up to
about 4000 proteins (15’000 peptides)
• Protein databases vary greatly in terms of their curation,
completeness and comprehensiveness (search with different
protein databases = could get different results).
• Only 7 labs (on 27) were able to identify the 20 human
proteins present in a sample, mainly due to the fact that the
search engines used cannot distinguish among different
identifiers for the same protein…
Protein Sequence Databases
Murcia, February, 2011
Protein sequence origin…
Protein Sequence Databases
Murcia, February, 2011
Protein sequence origin
More than 99 % of the protein sequences are derived
from the translation of nucleotide sequences
(genomes and/or cDNAs)
-> Important to know where the protein
sequence comes from…
(sequencing & gene prediction quality) !
Protein Sequence Databases
Murcia, February, 2011
Flood of data
example with the genome
sequences…
New challenge
 Flood of data -> need to be stored, curated and
made available for analysis and knowledge discovery
Protein Sequence Databases
Murcia, February, 2011
… ~ 2500 genomes sequenced
(single organism, varying sizes, including virus)
… ~ 5’000 ongoing genome sequencing projects
Protein Sequence Databases
Murcia, February, 2011
http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html
http://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat
~ 50-100 genomes/month
Protein Sequence Databases
+ ~2’500 viral genomes
=> Total
5’0002011
genomes
Murcia, ~
February,
… ~ 2500 genomes sequenced
(single organism, varying sizes, including virus)
… ~ 5’000 ongoing genome sequencing projects
… cDNAs sequencing projects (ESTs or cDNAs)
… metagenome sequencing projects
= environmental samples: multiple ‘unknown’ organisms,
Protein Sequence Databases
Murcia, February, 2011
Metagenomics
study of genetic material recovered directly
from environmental samples
• Global Ocean Sampling (C. Venter)
1ml sea water: 1 mo bacteria and 10 mo virus
• Whale fall
(AAFZ00000000.1)
• Soil, sand beach, New-York air, …
Venter’s Sorcerer II
• Human fluids, mouse gut (millions of bacteria within
human body)
• Water treatment industry…
• Lists of projects: Protein
http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi
Sequence Databases
Murcia, February, 2011
… ~ 2500 genomes sequenced
(single organism, varying sizes)
… ~ 5’000 ongoing genome sequencing projects
… cDNAs sequencing projects (ESTs or cDNAs)
… metagenome sequencing projects
… personal human genomes
new generation sequencers : Illumina: 25 billions of bp /day;
Protein Sequence Databases
Murcia, February, 2011
3’000’000’000 $
(public consortium, 2000)
2’000’000 $
(2007)
70’000’000 $
(diploid, 2007)
300’000’000 $
(Celera, 2000)
2010
http://www.youtube.com/watch?v=mVZI7NBgcWM
…2700 genomes in 2010, 30’000 genomes in 2011 ?
Protein Sequence Databases
Murcia, February, 2011
But…we known now that his apoE allele is the one
associated with increased risk for Alzheimer and
that he has the ‘blue eye’ allele…
Protein Sequence Databases
Murcia, February, 2011
apoE gene (Ensembl genome browser)
Protein Sequence Databases
Murcia, February, 2011
New projects
• 1000 genomes (first publication, October 2010)
• Multiple personal genomes (sexual cells, lymphoid
cells, cancer cells…)
• International cancer genome consortium
(www.icgc.org).
They look at the most common cancers and for
each they sequence the genome of 500 patients
with cancer and 500 healthy individuals….
Protein Sequence Databases
Murcia, February, 2011
How many proteins-coding
genes at the end?
Protein Sequence Databases
Murcia, February, 2011
Peabody museum exhibition on the Tree of Life http://www.peabody.yale.edu/exhibits/treeoflife/
Protein Sequence Databases
Murcia, February, 2011
190‘500'025'042
1st estimate: ~30 million species (1.8 million named)
2nd estimate:
20
million bacteria/archea
x
4'000 genes
1
million protists
x
6'000 genes
5
million insects
x
14'000 genes
2
million fungi
x
6'000 genes
x
20'000 genes
0.5 million plants
0.5 million molluscs, worms, arachnids, etc.
0.1 million vertebrates
x
x
20'000 genes
25'000 genes
The calculation:
2x107x4000+1x106x6000+5x106x14000+2x106x6000+5x105
x20000+5x105x20000+1x105x25000
+20000 (Craig Venter)+ 42(Douglas Adam) + …
Protein Sequence Databases
Murcia, February, 2011
About 190 milliards of proteins (?)
About 13.0 millions of ‘known’ protein sequences in 2011
(from ~300’000 species)
More than 99 % of the protein sequences are derived
from the translation of nucleotide sequences
Less than 1 % direct protein sequencing (Edman,
MS/MS…)
-> It is important that users know where the
protein sequence comes from…
(sequencing & gene prediction quality) !
The ideal life of a sequence …
cDNAs, ESTs, genes, genomes, …
Nucleic acid sequence databases
Protein sequence databases
Protein Sequence Databases
Murcia, February, 2011
Menu
Introduction
Nucleic acid sequence databases
ENA/GenBank, DDBJ
Protein sequence databases
UniProt databases (UniProtKB)
NCBI protein databases
Protein Sequence Databases
Murcia, February, 2011
ENA (EMBL-Bank)
GenBank
DDBJ
European Nucleotide Archive
DNA Data Bank of Japan
archive of primary sequence data and corresponding annotation
submitted by the laboratories that did the sequencing.
Protein Sequence Databases
Murcia, February, 2011
ENA/GenBank/DDBJ
http://www.insdc.org/
Protein Sequence Databases
Murcia, February, 2011
The hectic life of a sequence …
Data not submitted to public databases, delayed or cancelled…
cDNAs, ESTs, genes, genomes, …
ENA, GenBank, DDBJ
Protein Sequence Databases
Murcia, February, 2011
Journals do not (SHOULD NOT) accept a paper dealing with a
nucleic acid sequence if the ENA/GenBank/DDBJ AC number is not
available…
‘journal publishers generally require deposition prior to publication so
that an accession number can be included in the paper.’
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?highlight=refseq&rid=handbook.section.GenBank_ASM#GenBank_ASM.RefSeq
…not the case for protein sequences
!!! no more the case for a lot of genomes !!!
Protein Sequence Databases
Murcia, February, 2011
ENA/GenBank/DDBJ
• Serve as archives : ‘nothing goes out’
• Contain all public sequences derived from:
– Genome projects (> 80 % of entries)
– Sequencing centers (cDNAs, ESTs…)
– Individual scientists ( 15 % of entries)
– Patent offices (i.e. European Patent Office, EPO)
• Currently: ~200x106 sequences, ~300 x109 bp;
• Sequences from > 300’000 different species;
Protein Sequence Databases
Murcia, February, 2011
Archival databases:
- Can be very redundant for some loci
- Sequence records are owned by the original
submitter and can not be alterered by a third
party (except TPA)
Protein Sequence Databases
Murcia, February, 2011
Organisms
with the
highest
redundancy
…
Protein Sequence Databases
Murcia, February, 2011
accession number
taxonomy
references
Cross-references
Protein Sequence Databases
Murcia, February, 2011
CDS
CoDing Sequence
(proposed by submitters)
CDS annotation
(Prediction or
experimentally determined)
sequence
Protein Sequence Databases
Murcia, February, 2011
The hectic life of a sequence …
Data not submitted to public databases, delayed or cancelled…
with or without cDNAs,
annotated CDS
provided by authors
ESTs, genes, genomes, …
ENA, GenBank, DDBJ
CDS
CoDing Sequence
portion of DNA/RNA translated into protein
(from Met to STOP)
Experimentally proved
or derived from gene prediction
!!! not so well documented !!!
Protein Sequence Databases
Murcia, February, 2011
CoDing Sequence
Alignment between a mRNA and a genomic sequence
Genomic
CONTIG
Genomic
CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG------------------------------------------------------CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA
************************************ ****************************
CONTIG
Genomic
-----------------------------------------------------------------------------------------------------------------------TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT
CONTIG
Genomic
----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC
GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC
**************************************************************************
CONTIG
Genomic
TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC
TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC
****************************
***********************************************
CONTIG
Genomic
CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA
CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA
************************************************************************************************************************
CONTIG
Genomic
TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG----------------------------------------------------------------------------TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT
********************************** ********
CONTIG
Genomic
-------------------------------------------------------------------------------------------------------------------GNAAA
GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA
* ***
intron
exon
intron
intron
exon
CONTIG
Genomic
TAAACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC
GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGATAAACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC
GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGA
******************************************* * ************** ******** ***** **** * *********** ***************************
CONTIG
Genomic
C----------------------------------------------------------------------------------------------------------------------CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA
Protein Sequence Databases
Murcia, February, 2011
*
exon
CONTIG
exon
--------------------------------------------------------------------------------------CGANGGCCTATCAACAATGAAAGGTCGAAACCTG
AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACAATG-AAGGTCGAAACCTG
*** ************ ** * **************
exon
CDS provided by the submitters
The first Met !
CDS translation provided by ENA
Protein Sequence Databases
Murcia, February, 2011
A eukaryotic gene (UCSC)
Introns
Final exon
Initial exon
3’ untranslated
region
Internal exons
STOP
Met
5’
3’
This particular gene lies on the reverse strand !
Protein Sequence Databases
Murcia, February, 2011
mRNAs and their
corresponding CDS
annotation (from
EMBL/GenBank/DDBJ)
UCSC: human EPO
contig
5’
3’
Protein Sequence Databases
Murcia, February, 2011
Complete genome (submitted)
but only ~ 2,000 CDS/proteins available !
Protein Sequence Databases
Murcia, February, 2011
…annotated CDS in UniProtKB
http://www.ebi.ac.uk/swissprot/sptr_stats/index.html
Protein Sequence Databases
Murcia, February, 2011
ENA/GenBank/DDBJ
Variable level of sequence quality
- Sequencing quality
- Gene prediction quality
Authors can specify the nature of the CDS by using the
qualifier:
"/evidence=experimental" or "/evidence=not_experimental".
Very rarely done…
Protein Sequence Databases
Murcia, February, 2011
Very rarely done…
Protein Sequence Databases
Murcia, February, 2011
Variable level of sequence quality
DNA vs RNA
Protein Sequence Databases
Murcia, February, 2011
RNA
EST: Expressed Sequence Tags produced by one-shot sequencing of
a cloned cDNA
(no CDS, but proteomic tools give access to‘translated ESTs’)
HTC : High Throughput cDNAs
(CDS annotation)
DNA
GSS: Genome Sequence Survey: similar to the EST division, with the
exception that most of the sequences are genomic in origin
(no annotation, no CDS, with some exceptions (Drosophila))
HTG: High-Throughput Genomic Sequences: single-pass, unfinished
genomic sequences
(no annotation, no CDS with some exceptions (Leishmania))
WGS: Whole Genome Shotgun: contigs of a sequencing project. WGS
data can contain annotation and should be updated as sequencing
progresses.
(CDS annotation)
Protein Sequence Databases
Murcia, February, 2011
Complete proteomes
Complete genomes
?
Protein Sequence Databases
Murcia, February, 2011
Complete genomes ??
Protein Sequence Databases
UCSC
Murcia, February, 2011
27478 contigs
N50 is a weighted median statistic such that 50% of the entire assembly is contained in contigs equal to or
larger than this value
Genome reference consortium
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml
Protein Sequence Databases
Murcia, February, 2011
Genome sequencing and assembly
some caveats to deal with…
• ~ 350 gaps in 2010 (human genome)
• In the next future, we will have to deal
with ‘incomplete genome’ sequences (never
finished, metagenome…)… Prediction of
‘partial’ genes/exons is complex !
• Updates of genome sequences: not always
‘stable’ data…
• We are all different: -> ‘pan genome’ ?
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
From nucleic acid to amino acid sequences
databases….
Protein Sequence Databases
Murcia, February, 2011
The hectic life of a protein sequence …
Data not submitted to public databases, delayed or cancelled…
cDNAs, ESTs, genomes, …
Nucleic acid
databases
no CDS
ENA, GenBank, DDBJ
…if the submitters provide an
annotated Coding Sequence (CDS)
(1/10 ENA entries)
Gene prediction
RefSeq, Ensembl
Protein sequence
databases
The hectic life of a protein sequence …
Data not submitted to public databases, delayed or cancelled…
cDNAs, ESTs, genomes, …
Nucleic acid
databases
no CDS
ENA, GenBank, DDBJ
…if the submitters provide an
annotated Coding Sequence (CDS)
(1/10 ENA entries)
RefSeq, Ensembl and other*
Gene prediction
RefSeq, Ensembl
Protein sequence
databases
* 1000 genomes: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/2010_11/
Why doing things in a simple way, when you
can do it in a very complex one ?
Protein Sequence Databases
Murcia, February, 2011
The hectic life of a sequence …
Data not submitted to public databases, delayed or cancelled…
cDNAs, ESTs, genomes, …
Scientific publications
derived sequences
ENA, GenBank, DDBJ
CoDing Sequences
provided by submitters
TrEMBL Genpept
CoDing Sequences
provided by submitters
and gene prediction
RefSeq
UniProtKB
(IPI)
Swiss-Prot
Ensembl
UniMES
(PIR)
PRF
TPA
UniParc
CCDS
PDB
+ all ‘species’ specific databases (EcoGene, TAIR, …)
Major ‘general’ protein sequence database ‘sources’
TPA PIR
PDB
PRF
Integrated
resources
‘cross-references’
UniProtKB: Swiss-Prot + TrEMBL
Resources kept
separated
NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA
UniProtKB/Swiss-Prot: manually annotated protein sequences (12’000 species)
UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot
(300’000 species)
GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (300’000 species ?)
PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB
PDB: Protein Databank: 3D data and associated sequences
PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences
RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation
(11’000 species)
TPA: Third part annotation
Look for toll-like
receptor 4
(homo sapiens)
Swiss-Prot
TrEMBL
www.uniprot.org
Protein Sequence Databases
Murcia, February, 2011
Look for toll-like
receptor 4
(homo sapiens)
GenPept
GenPept
GenPept
GenPept
GenPept
GenPept
RefSeq
GenPept
Swiss-Prot
http://www.ncbi.nlm.nih.gov/
Protein Sequence Databases
Menu
Introduction
Nucleic acid sequence databases
ENA-Bank/GenBank, DDBJ
Protein sequence databases
UniProt databases (UniProtKB)
NCBI protein databases
Protein Sequence Databases
Murcia, February, 2011
UniProt consortium: EBI + SIB + PIR
UniProt
What is UniProt ?
. UniProtKB sequence curation
. UniProtKB biological data curation
. Statistics
. Access to UniProtKB
Protein Sequence Databases
Murcia, February, 2011
www.uniprot.org
Protein Sequence Databases
Murcia, February, 2011
UniProt databases
Protein Sequence Databases
Murcia, February, 2011
UniProtKB: protein sequence knowledgebase, 2 sections
UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast,
download) (~13 mo entries)
UniParc: protein sequence archive (ENA equivalent at the protein
level). Each entry contains a protein sequence with crosslinks to other databases where you find the sequence
(active or not). Not annotated (query, Blast, download) (~25mo entries)
UniRef: 3 clusters of protein sequences with 100, 90 and
50 % similarity; useful to speed up sequence similarity
search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo
entries; UniRef50 3.3 mo entries)
UniMES: protein sequences derived from metagenomic
projects (mostly Global Ocean Sampling (GOS)) (download)
(8 mo entries, included in UniParc)
Protein Sequence Databases
Murcia, February, 2011
UniProtKB/Swiss-Prot and UniProtKB/TrEMBL give access to all the protein
sequences which are available to the public.
However, UniProtKB excludes the following protein sequences:
- Most non-germline immunoglobulins and T-cell receptors
- Synthetic sequences
- Most patent application sequences
- Small fragments encoded from nucleotide sequence (<8 amino acids)
- Pseudogenes*
- Fusion/truncated proteins
- Not real proteins
* many putative pseudogene sequences may be expected to remain in
UniProtKB for some time as it can be difficult to prove the non-existence of
a protein
Protein Sequence Databases
Murcia, February, 2011
UniProtKB
an encyclopedia on proteins
composed of 2 sections
UniProtKB/TrEMBL and UniProtKB/Swiss-Prot
released every 4 weeks
Protein Sequence Databases
Murcia, February, 2011
UniProtKB
from ENA to TrEMBL
UniProtKB protein sequence data are mainly derived
from ENA (CDS) but also from Ensembl and other
sequence resources such as RefSeq or model organism
databases (MODs).
Data from the PIR database have been integrated in
UniProt since 2003.
Protein Sequence Databases
Murcia, February, 2011
ENA
TrEMBL
Automated extraction of
protein sequence
(translated CDS), gene
name and references.+
Automated annotation
The quality of UniProtKB/TrEMBL data, including the
protein sequence, is directly dependent on the
information provided by the submitter of the original
nucleotide entry.
Automated annotation
• Redundancy check (100% merge (same lenght, not fragment))
• Family attribution (InterPro)
• Many other cross-references
• Rule-based automated annotation (~38% of TrEMBL entries)
Automated annotation systems:
- UniRule (RuleBase, HAMAP; manually reviewed)
- SAAS (automated generated rules, i.e. via InterPro)
Protein Sequence Databases
Murcia, February, 2011
Protein and gene names
Taxonomic information
References
Cross-references
to over 125 databases
Automated annotation
Function, Subcellular location,
Catalytic activity,
Sequence similarities…
One protein sequence
One species
UniProtKB/TrEMBL
www.uniprot.org
Protein Sequence Databases
Automated annotation
transmembrane domains,
signal peptide…
Automated annotation
Keywords
and
Gene Ontology
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
UniProtKB
from TrEMBL to Swiss-Prot
Once manually annotated and integrated into SwissProt, the entry is deleted from TrEMBL
-> minimal redundancy
Protein Sequence Databases
Murcia, February, 2011
ENA
Manual annotation of
the sequence and
associated biological
information
Swiss-Prot
TrEMBL
Automated extraction
of protein sequence
(translated CDS), gene
name and references.+
Automated annotation
Protein Sequence Databases
Murcia, February, 2011
UniProtKB:
from TrEMBL to Swiss-Prot
Manual annotation
1. Protein sequence (merge available CDS, annotate
sequence discrepancies, report sequencing mistakes…)
2. Biological information (sequence analysis, extract
literature information, ortholog data propagation, …)
Protein Sequence Databases
Murcia, February, 2011
Protein and gene names
Taxonomic information
References
Cross-references
to over 125 databases
MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAAR
AFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVK
NMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFL
One protein sequence
NKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE
One gene
GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRG
One species
TVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGR
AGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKD
EGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV
VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG
Alternative products:
protein sequences produced by
alternative splicing,
alternative promoter usage,
alternative initiation…
Manual annotation
Function, Subcellular location,
Catalytic activity, Disease,
Tissue specificty, Pathway…
Manual annotation
Post-translational modifications,
variants, transmembrane domains,
signal peptide…
Manual annotation
Keywords
and
Gene Ontology
UniProtKB/Swiss-Prot
www.uniprot.org
Protein Sequence Databases
Murcia, February, 2011
In a UniProtKB/Swiss-Prot entry, you can
expect to find:
• A (often corrected) protein sequence and the
description of various isoforms/variants.
• All the names of a given protein (and of its gene);
• A summary of what is known about the protein:
function, PTM, tissue expression, disease, 3D data
etc.…;
• A description of important sequence features:
domains, PTMs, variations, etc.;
• A selection of references;
• Selected keywords and ontologies;
• Numerous cross-references (central hub);
Protein Sequence Databases
Murcia, February, 2011
UniProtKB
1- Sequence curation
Protein Sequence Databases
Murcia, February, 2011
UniProtKB:
from TrEMBL to Swiss-Prot
Manual annotation
1. Protein sequence (merge available CDS, annotate
sequence discrepancies, report sequencing mistakes…)
2. Biological information (extract literature
information, ortholog data propagation, protein
sequence analysis…)
Protein Sequence Databases
Murcia, February, 2011
The displayed protein sequence
…canonical, representative,
consensus…
Protein Sequence Databases
Murcia, February, 2011
UniProtKB/Swiss-Prot protein sequence annotation
‘Merging policy’: a gene-centric view of protein space
1 entry <-> 1 gene (1 species)
1 displayed sequence
(annotation of alternative sequences, when available)
The displayed sequence is the most prevalent protein
sequence and/or the protein sequence which is also found
in orthologous species.
The displayed sequence is generally derived from the
translation of the genomic sequence (when available).
Sequence differences are documented.
Protein Sequence Databases
Murcia, February, 2011
What is the current status?
• At least 20% of Swiss-Prot entries
required a minimal amount of curation
effort so as to obtain the “correct”
sequence.
• Typical problems
–
–
–
–
unsolved conflicts;
uncorrected initiation sites;
frameshifts;
other ‘problems’
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
… once a gene on chromosome 11…
Protein Sequence Databases
Murcia, February, 2011
Quality of protein information from genome projects
• Lets look at proteins originating from genome projects:
– Drosophila: the paradigm of a curated genome should look like
(thanks to FlyBase) : only 1.8% of the gene models conflict with
Swiss-Prot sequences;
– Arabidopsis: a typical example of a genome where a lot of
annotation was done when it was sequenced, but no update since
then (at least in the public view): 20% of the gene models are
erroneous;
– Tetraodon nigroviridis: the typical example of a quick and dirty
automatic run through a genome with no manual intervention:
>90% of the gene models produce incorrect proteins.
– Bacteria and Archaea have almost no splicing, so predictions are
“easier”, however errors are still made… Start codons, missed
small proteins (<100aa)…
Protein Sequence Databases
Murcia, February, 2011
UniProtKB/Swiss-Prot
Protein sequence annotation
Protein Sequence Databases
Murcia, February, 2011
Example of problem (derived from gene prediction pipeline)
Ensembl completes the human ‘proteome’ by predicting/annotating
missing genes according to orthologous sequences..
ID
AC
DT
DT
DT
DE
DE
GN
…
DR
DR
DR
PE
URAD_HUMAN
Unreviewed;
171 AA.
A6NGE7;
24-JUL-2007, integrated into UniProtKB/TrEMBL.
24-JUL-2007, sequence version 1.
02-OCT-2007, entry version 3.
2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog
(OHCU decarboxylase homolog) (Parahox neighbour).
Name=PRHOXNB;
EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA.
Ensembl; ENSG00000183463; Homo sapiens.
HGNC; HGNC:17785; PRHOXNB.
4: Predicted;
In primates the genes coding for the enzymes for the degradation
of uric acid were inactivated and converted to pseudogenes.
Protein Sequence Databases
Murcia, February, 2011
• Producing a clean set of sequences is not a
trivial task;
• It is not getting easier as more and more types
of sequence data are submitted;
• It is important to pursue our efforts to make
sure we provide our users with the most correct
set of sequences for a given organism.
‘Protein existence’ tag
•
The ‘Protein existence’ tag indicates what
is the evidence for the existence of a given
protein;
•
Different qualifiers:
1. Evidence at protein level (~18%)
(MS, western blot (tissue specificity), immuno (subcellular location),…)
2. Evidence at transcript level (~19%)
3. Inferred from homology (~58 %)
4. Predicted (~5%)
5. Uncertain (mainly in TrEMBL)
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
In order to avoid ‘pseudogenes’ and most of the unprobable
protein sequences, you can filter your query and avoid
sequences with ‘protein existence tag’ = ‘Uncertain’
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
The ‘alternative’ sequence(s)
Protein Sequence Databases
Murcia, February, 2011
How many proteins
at the end?
Example with human
Protein Sequence Databases
Murcia, February, 2011
Proteome complexity
Example with human
~20’000
Not predictable at
the genome level !
-> important postgenomic data !
(Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154).
Protein Sequence Databases
Murcia, February, 2011
UniProtKB/Swiss-Prot
1 entry <-> 1 gene (1 species)
Annotation of the sequence differences
(including conflicts, polymorphisms, splice variants etc..)
-> annotation of protein diversity
Protein Sequence Databases
Murcia, February, 2011
1 entry <-> 1 gene (1 species)
Multiple alignment of the end of the available GCR sequences
Annotation of the sequence differences (protein diversity)
…and natural variant
Protein Sequence Databases
Murcia, February, 2011
P04150
Protein Sequence Databases
Murcia,www.uniprot.org
February, 2011
UniProtKB (and RefSeq) do under-represent alternatively spliced products
Transcript variant are only made when there is information available on the
full-lenght nature of the product; if multiple, alternate exons are found
through the lenght of the gene, no assumption is made about the
combination of the alternate exons that exists in vivo.
http://www.ncbi.nlm.nih.gov/books/NBK50679/#RefSeqFAQ.what_does_a_reviewed_status_me
Protein Sequence Databases
Murcia, February, 2011
Important remark
Available in separated files!
> 30’000 additional
sequences (total)
Protein Sequence Databases
Murcia, February, 2011
The ‘alternative’ sequence(s)
not ‘directly available’ for a lot of tools,
including protein identification tools,
Blast, depending on the server !….
Protein Sequence Databases
Murcia, February, 2011
Blast P04150 against Swiss-Prot / homo sapiens @ UniProt
Isoform
sequences
Protein Sequence Databases
Murcia, February, 2011
Blast P04150 against Swiss-Prot / homo sapiens @ NCBI
The isoform sequences are not present in the NCBI protein
database !
The .x number (P06401.4) correspond to the version number
of the sequence…not to an alternatively spliced sequence !
Protein Sequence Databases
Murcia, February, 2011
UniProtKB
2- Biological data curation
Protein Sequence Databases
Murcia, February, 2011
UniProtKB:
from TrEMBL to Swiss-Prot
Manual annotation
1. Protein sequence (merge available CDS, annotate
sequence discrepancies, report sequencing mistakes…)
2. Biological information (extract literature
information, ortholog data propagation, protein
sequence analysis…)
Protein Sequence Databases
Murcia, February, 2011
UniProtKB/Swiss-Prot
General annotation
•
Summary of the current knowledge on a given protein.
•
Maximum usage of controlled vocabulary
•
Provides a reliable set of annotated protein entries for:
• Reference data for systems designed to automatically
transfer annotation to similar, not yet (or never)
characterized sequences
Keywords, Tissues, Post-translational modifications, Strains, Species,
Subcellular location, Extracellular domains, Journals…
•
Training of data mining tools, prediction programs
Protein Sequence Databases
Murcia, February, 2011
Extract literature information
and protein sequence analysis
maximum usage of controlled vocabulary
UniProtKB/Swiss-Prot gathers data form multiple sources:
- publications (literature/Pubmed)
- prediction programs (Prosite, Anabelle)
- contacts with experts
- other databases
- nomenclature committees
An evidence attribution system allows to easily trace the
source of each annotation
Protein Sequence Databases
Murcia, February, 2011
Protein nomenclature
Protein Sequence Databases
Murcia, February, 2011
General annotation
(Comments)
…enable researchers to
obtain a summary of
what is known about a
protein…
www.uniprot.org
Protein Sequence Databases
Murcia, February, 2011
Human protein manual annotation:
some statistics (Aug 2010)
Protein Sequence Databases
Murcia, February, 2011
Sequence annotation
(Features)
…enable researchers to
obtain a summary of
what is known about a
protein…
www.uniprot.org
Protein Sequence Databases
Murcia, February, 2011
Proteome complexity
Example with human
~20’000
Not predictable at
the genome level !
-> important postgenomic data !
(Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154).
Protein Sequence Databases
Murcia, February, 2011
Human protein manual annotation:
some statistics
(PTM)
Protein Sequence Databases
Murcia, February, 2011
Non-experimental qualifiers
UniProtKB/Swiss-Prot considers both experimental and
predicted data and makes a clear distinction between both.
Level. Type of evidence
Qualifier
1st. Strong experimental evidence
Ref.X
2nd. Light experimental evidence
Probable
3rd. Inferred by similarity with homologous
protein (data of 1st or 2nd level)
By similarity
4th. Inferred by sequence prediction
Potential
Protein Sequence Databases
Murcia, February, 2011
Find all the protein localized in the
cytoplasm (experimentally proven)
which are phosphorylated on a
serine (experimentally proven)
Protein Sequence Databases
Murcia, February, 2011
UniProtKB
Additional information can be found in
the cross-references
(to more than 140 databases)
Protein Sequence Databases
Murcia, February, 2011
Protein centric view of database network
DNA sequences
Gene
expression data
Protein
sequences
Gene annotation
Macromolecular
structure data
Protein Sequence Databases
Murcia, February, 2011
Organism-specific
AGD
ArachnoServer
CGD
ConoServer
CTD
CYGD
dictyBase
EchoBASE
EcoGene
euHCVdb
EuPathDB
FlyBase
GeneCards
GeneDB_Spombe
GeneFarm
GenoList
Gramene
H-InvDB
HGNC
HPA
LegioList
Leproma
MaizeGDB
MGI
MIM
neXtProt
Orphanet
PharmGKB
PseudoCAP
RGD
SGD
TAIR
TubercuList
WormBase
Xenbase
ZFIN
Sequence
EMBL
IPI
PIR
RefSeq
UniGene
Proteomic
Genome annotation
Polymorphism
Family and domain
PeptideAtlas
PRIDE
ProMEX
Ensembl
EnsemblBacteria
EnsemblFungi
EnsemblMetazoa
EnsemblPlants
EnsemblProtists
GeneID
GenomeReviews
KEGG
NMPDR
TIGR
UCSC
VectorBase
dbSNP
Gene3D
HAMAP
InterPro
PANTHER
Pfam
PIRSF
PRINTS
ProDom
PROSITE
SMART
SUPFAM
TIGRFAMs
Gene expression
ArrayExpress
Bgee
CleanEx
Genevestigator
GermOnline
Protein family/group
Allergome
CAZy
MEROPS
PeroxiBase
PptaseDB
REBASE
TCDB
Ontologies
GO
UniProtKB/Swiss-Prot:
129 explicit links
2D gel
and 14 implicit links!
Phylogenomic dbs
eggNOG
GeneTree
HOGENOM
HOVERGEN
InParanoid
OMA
OrthoDB
PhylomeDB
ProtClustDB
3D structure
PTM
GlycoSuiteDB
PhosphoSite
PhosSite
Other
PPI
BindingDB
DrugBank
NextBio
PMAP-CutDB
DIP
IntAct
MINT
STRING
DisProt
HSSP
PDB
PDBsum
ProteinModelPortal
SMR
2DBase-Ecoli
ANU-2DPAGE
Aarhus/Ghent-2DPAGE (no server)
COMPLUYEAST-2DPAGE
Cornea-2DPAGE
DOSAC-COBS-2DPAGE
ECO2DBASE (no server)
OGP
PHCI-2DPAGE
PMMA-2DPAGE
Rat-heart-2DPAGE
REPRODUCTION-2DPAGE
Siena-2DPAGE
SWISS-2DPAGE
UCD-2DPAGE
World-2DPAGE
Enzyme and pathway
BioCyc
BRENDA
Pathway_Interaction_DB
Reactome
UniProtKB
Access to UniProtKB
Protein Sequence Databases
Murcia, February, 2011
The UniProt web site:
www.uniprot.org
Protein Sequence Databases
Murcia, February, 2011
The UniProt web site - www.uniprot.org
•
Powerful search engine, google-like and easy-to-use, but also
supports very directed field searches (similar to SRS)
•
Scoring mechanism presenting relevant matches first
•
Entry views, search result views and downloads are customizable
•
The URL of a result page reflects the query; all pages and queries
are bookmarkable, supporting programmatic access
•
Tools: Blast, Alignment, IDmapping, Batch retrieval (Retrieve)
Protein Sequence Databases
Murcia, February, 2011
Search
A very powerful text search tool with
autocompletion and refinement options
allowing to look for UniProt entries and
documentation by biological information
Protein Sequence Databases
Murcia, February, 2011
Search
A very powerful text search tool with
autocompletion and refinement options
allowing to look for UniProt entries and
documentation by biological information
Protein Sequence Databases
Murcia, February, 2011
UniProt query tool
(www.uniprot.org)
A mixture of Google and SRS
Find all human proteins with
experimental evidence for their
location in the nucleus
Protein Sequence Databases
Murcia, February, 2011
The search interface
guides users with helpful
suggestions and hints
Protein Sequence Databases
Murcia, February, 2011
Result pages: Highly customizable
Protein Sequence Databases
Murcia, February, 2011
Custom downloads….
Accession Genes Domains Protein Existence
P02768 ALB (GIG20) (GIG42) (PRO0903) (PRO1708) (PRO2044) (PRO2619) (PRO2675
P02769 ALB Albumin domains (3) Evidence at protein level
P02770 Alb Albumin domains (3) Evidence at protein level
P07724 Alb (Alb-1) (Alb1) Albumin domains (3) Evidence at protein level
P08759 alb-A Albumin domains (3) Evidence at transcript level
P14872 alb-B Albumin domains (3) Evidence at transcript level
P43652 AFM (ALB2) (ALBA) Albumin domains (3) Evidence at protein level
P08835 ALB Albumin domains (3) Evidence at protein level
P49822 ALB Albumin domains (3) Evidence at protein level
P19121 ALB Albumin domains (3) Evidence at protein level
Open with Excel etc.
Protein Sequence Databases
Murcia, February, 2011
The URL (results) can be
bookmarked and manually
modified.
Protein Sequence Databases
Murcia, February, 2011
Blast
A tool associated with the standard
options to search sequences
in UniProt databases
Protein Sequence Databases
Murcia, February, 2011
Blast results: customize display
Protein Sequence Databases
Murcia, February, 2011
Blast: use of UniProt annotation
amino-acids highlighting options
and feature annotation highlighting option in the local alignment
Protein Sequence Databases
Murcia, February, 2011
Align
A ClustalW multiple alignment tool with
amino-acids highlighting options
and feature annotation highlighting
option
Protein Sequence Databases
Murcia, February, 2011
ClustalW
multiple alignment of insulin sequences
amino-acids highlighting options
and feature annotation highlighting option in the local alignment
Protein Sequence Databases
Murcia, February, 2011
Retrieve
A UniProt specific tool allowing to retrieve a
list of entries in several standard formats.
You can then query your ‘personal database’
with the UniProt search tool.
Protein Sequence Databases
Murcia, February, 2011
Your dataset: results of a
Scan Prosite
Protein Sequence Databases
Murcia, February, 2011
ID Mapping
Gives the possibility to get a mapping between
different databases for a given protein
Protein Sequence Databases
Murcia, February, 2011
These identifiers are all pointing to TP53 (p53) !
P04637, NP_000537, ENSG00000141510, CCDS11118,
UPI000002ED67, IPI00025087, etc.
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
Download
Protein Sequence Databases
Murcia, February, 2011
Downloading UniProt
http://www.uniprot.org/downloads
Protein Sequence Databases
Murcia, February, 2011
Complete proteome
‘gene’ centred
or
all known proteins ?
Protein Sequence Databases
Murcia, February, 2011
http://www.uniprot.org/faq/38
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
Remark: Some peptides are not associated with the keyword
‘Complete proteome’ because they do not match with the human
genome
Protein Sequence Databases
Murcia, February, 2011
UniProt proteome sets, if downloaded in
UniProt flat file or XML format, contain
one sequence per UniProt record !
‘gene’ centred
all protein sequences in UniProtKB/Swiss-Prot…
Are missing: other alternatively spliced
protein sequences in UniProtKB/TrEMBL
Protein Sequence Databases
Murcia, February, 2011
Human protein manual annotation:
some statistics (Aug 2010)
Protein Sequence Databases
Murcia, February, 2011
UniProtKB
Statistics
Protein Sequence Databases
Murcia, February, 2011
Swiss-Prot & TrEMBL
introduce a new arithmetical concept !
Swiss-Prot
TrEMBL
520’000 + 13’000’000
12’000 species
130’000 species
 13’000’000
Redundancy in TrEMBL
&
Redundancy between TrEMBL and Swiss-Prot
Protein Sequence Databases
Murcia, February, 2011
12’000 species
mainly model organisms
Protein Sequence Databases
Murcia, February, 2011
Not yet available
Protein Sequence Databases
Murcia, February, 2011
~ 200 new entries / day
new release every 4 weeks
-Annotation is useful, good annotation is better, update is essential !
- Some entries have gone through more than 120 versions since their integration
in UniProtKB/Swiss-Prot
Protein Sequence Databases
Murcia, February, 2011
UniProtKB entry history
Always cite the primary accession number (AC) !
UniParc
Protein Sequence Databases
Murcia, February, 2011
UniParc
- non-redundant protein sequence archive, containing both active and
inactive sequences (including sequences which are not in UniProtKB i.e.
immunoglobulins….)
- the equivalent of ENA/GenBank/DDBJ at the protein level
- species-merged: merge sequences between species when 100% identical
over the whole length.
- no annotation (only taxonomy)
- can be searched only with database names, taxonomy, checksum (CRC64)
and accession numbers (ACs) or UniProtKB, UniRef and UniParc IDs.
- Beware: contains wrong prediction, pseudogenes etc…
Protein Sequence Databases
Murcia, February, 2011
Query UniParc
Protein Sequence Databases
Murcia, February, 2011
UniRef
Protein Sequence Databases
Murcia, February, 2011
‘UniRef is useful for comprehensive
BLAST similarity searches by providing
sets of representative sequences’
Protein Sequence Databases
Murcia, February, 2011
«Collapsing BLAST results»
Three collections of sequence clusters from UniProtKB and selected
UniParc entries:
One UniRef100 entry -> all identical sequences (identical sequences and
sub-fragments are grouped in a single record) -> reduction of 12 %
One UniRef90 entry -> sequences that have at least 90 % or more
identity -> reduction of 40 %
One UniRef50 entry -> sequences that are at least 50 % identical
-> reduction of 65 %
Based on sequence identity -> Independent of the species !
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
UniRef 90
Independent of
species and
sequence length
Protein Sequence Databases
Murcia, February, 2011
UniMes
Protein Sequence Databases
Murcia, February, 2011
The UniProt Metagenomic and Environmental Sequences
(UniMES) database is a repository specifically
developed for metagenomic and environmental protein
data (only GOS data for the moment).
Download only (but included in UniParc -> Blast).
- UniMES Fasta sequences
- UniMES matches to InterPro methods
ftp.uniprot.org/pub/databases/uniprot
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
UniMES: sequences in fasta format
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
Menu
Introduction
Nucleic acid sequencedatabases
ENA/GenBank, DDBJ
Protein sequence databases
UniProt databases (UniProtKB)
NCBI protein databases
Protein Sequence Databases
Murcia, February, 2011
NCBI protein
databases
(Entrez protein, NCBI nr)
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
Protein Sequence Databases
Murcia, February, 2011
Major ‘general’ protein sequence database ‘sources’
TPA PIR
PDB
PRF
Integrated
resources
‘cross-references’
UniProtKB: Swiss-Prot + TrEMBL
Resources kept
separated
NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA
UniProtKB/Swiss-Prot: manually annotated protein sequences (12’000 species)
UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot
(300’000 species)
GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (300’000 species ?)
PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB
PDB: Protein Databank: 3D data and associated sequences
PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences
RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation
(11’000 species)
TPA: Third part annotation
Protein Sequence Databases
Murcia, February, 2011
Query at Entrez protein
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
Protein Sequence Databases
Murcia, February, 2011
Swiss-Prot
Typical result of
a query at
« Entrez protein »
RefSeq
Genpept
Protein Sequence Databases
Murcia, February, 2011
A Swiss-Prot entry with the NCBI look
Protein Sequence Databases
Murcia, February, 2011
GI number
‘GenInfo identifier’ number
- In addition to an AC number specific from the original database,
each protein sequence in the NCBInr database (included Swiss-Prot
entry) has a GI number.
Protein Sequence Databases
Murcia, February, 2011
AC
Protein Sequence Databases
Murcia, February, 2011
GI number: ‘GenInfo identifier’ number
- If the sequence changes in any way, a new GI number will be
assigned:
GI identifiers provide a mechanism for identifying the exact
sequence that was used or retrieved in a given search.
- A separate GI number is assigned to each protein translation
(alternative products)
- A Sequence Revision History tool is available to track the various
GI numbers, version numbers, and update dates for sequences that
appeared in a specific GenBank record:
http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi
Protein Sequence Databases
Murcia, February, 2011
ID/AC mapping
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
http://www.ebi.ac.uk/Tools/picr/
Protein Sequence Databases
Murcia, February, 2011
GenPept
Translation from annotated CDS in GenBank
Contains all translated CDS annotated in
GenBank/ENA/DDBJ sequences
- equivalent to UniProtKB/TrEMBL,
except that it is
redundant with other databases
(Swiss-Prot, RefSeq, PIR….)
Protein Sequence Databases
Murcia, February, 2011
GenPept: ‘translations from all annotated coding regions (CDS) in GenBank’
Protein Sequence Databases
Murcia, February, 2011
RefSeq
Produced by NCBI and NLM
http://www.ncbi.nlm.nih.gov/RefSeq/
http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/handbook/ch18.pdf
FAQ: http://www.ncbi.nlm.nih.gov/books/NBK50679/
Protein Sequence Databases
Murcia, February, 2011
The Reference Sequence (RefSeq) collection aims to provide a
comprehensive, integrated, non-redundant set of sequences,
including genomic DNA, transcript (RNA), and protein products,
for major research organisms.
Protein – mRNA – genomic sequence
Also chromosomes, organelle genomes, plasmids, intermediate
assembled genomic contigs, ncRNAs.
- tighly linked to Entrez Gene (« interdependent curated resources »)
Example: NP_000790
Protein Sequence Databases
Murcia, February, 2011
AC
KW
Taxonomy
References
Protein Sequence Databases
Murcia, February, 2011
GenBank source
and status
Annotation and
ontologies
Protein Sequence Databases
Murcia, February, 2011
Curated records
Protein Sequence Databases
Murcia, February, 2011
UniProtKB vs RefSeq
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
UniProtKB/Swiss-Prot merges all CDS available for a given gene and
describes the sequence differences
UniProtKB/Swiss-Prot P04150 (GCR_HUMAN):
Protein Sequence Databases
Murcia, February, 2011
RefSeq chooses one or several protein reference sequences for a given
gene: they do not annotate the sequence differences.
- If there is an alternative splicing event, there will be several distinct
entries for a given gene
Example: GCR_HUMAN
1 UniProtKB entry
cross-linked with
7 RefSeq entries
GCR_HUMAN
UniProtKB/Swiss-Prot
Protein Sequence Databases
Murcia, February, 2011
Protein feature annotation found in RefSeq
- Conserved domains
- Signal and mature petides
- Propagation of a subset of features from Swiss-Prot.
Protein Sequence Databases
Murcia, February, 2011
PTM annotation
Swiss-Prot vs
RefSeq
GCR_human
Protein Sequence Databases
Murcia, February, 2011
RefSeq statistics
The numbers are not comparable:
entries ‘sequence’ (RefSeq) vs entries ‘gene’ (UniProtKB/Swiss-Prot)
Protein Sequence Databases
Murcia, February, 2011
UniProtKB vs NCBI protein
Summary
Protein Sequence Databases
Murcia, February, 2011
ENA/GenBank/DDBJ
RefSeq
www.ncbi.nlm.nih.gov/RefSeq/
UniProt
www.uniprot.org
Protein and nucleotide data
Genomic, RNA and protein data
Protein data only
Biological data added by the
submitters (gene name, tissue…)
Biological data annotated by
curators, also found in the
corresponding Entrez Gene entry
Biological data annotated by
curators (Swiss-Prot), within the
entry
Not curated
Partially manually curated
(‘reviewed’ entries)
Manually curated in Swiss-Prot,
not in TrEMBL
Author submission
NCBI creates from existing data
+ gene prediction
UniProt creates from existing
data
Only author can revise (except
TPA)
NCBI revises as new data
emerge
UniProt revises as new data
emerge
Multiple records for same loci
common
Single records for each molecule
of major organisms
Single records for each protein
from one gene of major
organisms (in Swiss-Prot,
TrEMBL is redundant)
Records can contradict each
other
Identification and annotation of
discrepancy
No limit to species included
Limited to model organisms
Priority (but not limited) to
model organisms
Data exchanged among INSDC
members
NCBI database; collaboration
with UniProt
UniProt database; collaboration
with NCBI (RefSeq, CCDS)
Protein Sequence Databases
Murcia, February, 2011
PIR
Protein Sequence Databases
Murcia, February, 2011
PIR: the Protein Identification Resource
PIR-PSD is no
more updated,
but exists as
an archive
Protein Sequence Databases
Murcia, February, 2011
PDB
Protein Sequence Databases
Murcia, February, 2011
PDB
• PDB (Protein Data Bank), 3D structure
• Contains the spatial coordinates of macromolecule
atoms whose 3D structure has been obtained by X-ray
or NMR studies
• Contains also the corresponding protein sequences
*The PIR-NRL3D database makes the sequence information in PDB
available for similarity searches and other tools
• Includes protein sequences which are mutated,
chimearic etc… (created specifically to study the
effect of a mutation on the 3D structure)
Protein Sequence Databases
Murcia, February, 2011
PDB: Protein Data Bank
www.rcsb.org/pdb/
• Managed by Research Collaboratory for Structural
Bioinformatics (RCSB) (USA).
• Associated with specialized programs allow the
visualization of the corresponding 3D structure
(e.g., SwissPDB-viewer, Chime, Rasmol)).
• Currently there are ~68’000 structural data for
about 15’000 different proteins, but far less
protein family (highly redundant) !
Protein Sequence Databases
Murcia, February, 2011
PDB: example
Protein Sequence Databases
Murcia, February, 2011
Sequence
Coordinates of each atom
Protein Sequence Databases
Murcia, February, 2011
Visualisation with Jmol
Protein Sequence Databases
Murcia, February, 2011
PRF
Protein Research Foundation
Protein Sequence Databases
Murcia, February, 2011
Looks for the peptide sequence described in publication (and
which are not submitted in databases !!!)
http://www.genome.jp/dbget-bin/www_bfind?prf
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
Other protein
databases
Protein Sequence Databases
Murcia, February, 2011
Ensembl
http://www.ensembl.org/
Review
http://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D610
Annotation pipeline
http://www.genome.org/cgi/content/full/14/5/942
Protein Sequence Databases
Murcia, February, 2011
- Ensembl: align the genomic sequences with all the sequences
found in ENA, UniProtKB/Swiss-Prot, RefSeq and
UniProtKB/TrEMBL (-> known genes)
- Also do gene prediction (-> novel genes)
Ensembl= UniProtKB + RefSeq + gene prediction
- DNA, RNA and protein sequences available for several species.
- Ensembl concentrates on vertebrate genomes, but other groups
have adapted the system for use with plant, fungal and metazoa
genomes.
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
Example of problem (derived from gene prediction pipeline)
Ensembl completes the human ‘proteome’ by predicting/annotating
missing genes according to orthologous sequences..
ID
AC
DT
DT
DT
DE
DE
GN
…
DR
DR
DR
PE
URAD_HUMAN
Unreviewed;
171 AA.
A6NGE7;
24-JUL-2007, integrated into UniProtKB/TrEMBL.
24-JUL-2007, sequence version 1.
02-OCT-2007, entry version 3.
2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog
(OHCU decarboxylase homolog) (Parahox neighbour).
Name=PRHOXNB;
EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA.
Ensembl; ENSG00000183463; Homo sapiens.
HGNC; HGNC:17785; PRHOXNB.
4: Predicted;
In primates the genes coding for the enzymes for the degradation
of uric acid were inactivated and converted to pseudogenes.
Protein Sequence Databases
Murcia, February, 2011
IPI
http://www.ebi.ac.uk/IPI/IPIhelp.html
IPI: Closure !
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
Automatic approach that builds clusters through
combining knowledge already present in the primary
data source (UniProtKB, RefSeq, Ensembl) and
sequence similarity.
IPI=UniProtKB + RefSeq + Ensembl (+ H-InvDB,
TAIR +VEGA).
!!! Complete proteome sets include all alternative
splicing sequences….
Available for human, mouse, rat, Zebrafish,
Arabidopsis, Chicken, and Cow
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
CCDS
Protein Sequence Databases
Murcia, February, 2011
http://www.ncbi.nlm.nih.gov/CCDS/
Protein Sequence Databases
Murcia, February, 2011
CCDS (human, mouse)
Combining different approaches – ab
initio, by similarity - and taking
advantage of the expertise acquired
by different institutes, including
manual annotation…
Consensus between 4 institutions…
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
Gene Ontology
(GO)
Protein Sequence Databases
Murcia, February, 2011
Standards :
Why is it so important ?
•‘The ever-increasing number of sequencing projects
necessitates a standardized system (…) to ensure that
the flood of information produced can be effectively
utilized.‘ (PMID 19577473 )
•Standardization of biological data/information (data
sharing and computational analysis).
•Aim: extract and compare annotation between different
resources or species (semantic similarity).
Secreted or not secreted ?
Pubmed19299134
Gene Ontology (GO)
• The Gene Ontology is a controlled vocabulary, a set
of standard terms—words and phrases—used for
indexing and retrieving information. In addition to
defining terms, GO also defines the relationships
between the terms, making it a structured
vocabulary. Contains ~30’000 terms.
Gene Ontology (GO) terms
 biological process
• broad biological phenomena e.g. mitosis,
growth, digestion
 molecular function
• molecular role e.g. catalytic activity,
binding
 cellular component
• Subcellular location e.g nucleus, ribosome,
origin recognition complex
Protein Sequence Databases
Murcia, February, 2011
GO terms associated with human Erythropoietin
http://www.geneontology.org
Caveats
• Annotation is the process of assigning/mapping GO
terms to gene products…
• Electronic vs Manual annotation…
Protein Sequence Databases
Murcia, February, 2011
Example with EPO
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
Protein Sequence Databases
Murcia, February, 2011
Histone H4
!!! Large scale derived data (‘proteome’)
Protein Sequence Databases
Murcia, February, 2011
GO terms: Essential link between biological knowledge and high
throuput genomic and proteomic datasets…
‘summary of the gene ontology classifications for all mapped ESTs…’
PMID: 15514041
Protein Sequence Databases
Murcia, February, 2011
~40 % of human proteins have no known function
(experimental data)…but many more are associated with GO
terms…(computer-assigned).
Human proteins functional distribution
Maybe
Potentially
Putative
Expected
Probably
Hopefully
Protein Sequence Databases
Murcia, February, 2011
All documents (including practicals) are online
http://education.expasy.org/cours/Murcia2011/
Protein Sequence Databases
Murcia, February, 2011
Download