Protein sequence databases: dissemination of protein knowledge http://education.expasy.org/cours/UniProt/ Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Menu Introduction Nucleic acid sequence databases ENA, GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases (NCBInr, RefSeq…) Protein sequences are the fundamental determinants of biological structure and function. http://www.ncbi.nlm.nih.gov/protein Challenge Flood of data -> need to be stored, curated and made available for analysis and knowledge discovery Challenge (1) Many different protein sequence databases PRF RefSeq TrEMBL Genpept UniProtKB (IPI) Swiss-Prot Ensembl UniMES NCBInr TPA UniParc (PIR) PDB CCDS Challenge (1bis) Different protein sequence databases : many identifiers for the same protein sequence These identifiers are all pointing to the same TP53 protein sequence (p53) ! P04637, NP_000537, ENSG00000141510, CCDS11118, UPI000002ED67, IPI00025087, HIT000320921, XP_001172091, DD954676 , JT0436 , etc. A HUPO test sample study reveals common problems in mass spectrometry–based proteomics PubMed 19448641 (2009) • A single mass spectrometry experiment can identified up to about 4000 proteins (15’000 peptides) • Protein databases vary greatly in terms of their curation, completeness and comprehensiveness (search with different protein databases = could get different results). • Only 7 labs (on 27) were able to identify the 20 human proteins present in a sample, mainly due to the fact that the search engines used cannot distinguish among different identifiers for the same protein… Challenge (3) (protein) sequence annotation Nucleic Acids Res. 2010 ; 38(Database issue): D633–D639. ‘Examining links from the perspective of PubMed, we found that only a small fraction of published articles are linked to human genes (Entrez Gene).’ Journals do not (SHOULD NOT) accept a paper dealing with a nucleic acid sequence if the ENA/GenBank/DDBJ AC number is not available… ‘journal publishers generally require deposition prior to publication so that an accession number can be included in the paper.’ http://www.ncbi.nlm.nih.gov/books/bv.fcgi?highlight=refseq&rid=handbook.section.GenBank_ASM#GenBank_ASM.RefSeq …not the case for protein sequences !!! no more the case for a lot of genomes !!! Protein sequence origin… More than 99 % of the protein sequences are derived from the translation of nucleotide sequences (genomes and/or cDNAs) sequencing quality coding sequence (CDS) annotation accuracy gene prediction quality … ~ 2500 genomes sequenced (single organism, varying sizes, including virus) … ~ 5’000 ongoing genome sequencing projects http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html http://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat ~ 50-100 genomes/month + ~2’500 viral genomes => Total ~ 5’000 genomes … ~ 2500 genomes sequenced (single organism, varying sizes, including virus) … ~ 5’000 ongoing genome sequencing projects … cDNAs sequencing projects (ESTs or cDNAs) … metagenome sequencing projects = environmental samples: multiple ‘unknown’ organisms, Metagenomics study of genetic material recovered directly from environmental samples • Global Ocean Sampling (C. Venter) 1ml sea water: 1 mo bacteria and 10 mo virus • Whale fall (AAFZ00000000.1) • Soil, sand beach, New-York air, … Venter’s Sorcerer II • Human fluids, mouse gut (millions of bacteria within human body) • Water treatment industry… • Lists of projects: http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi … ~ 2500 genomes sequenced (single organism, varying sizes) … ~ 5’000 ongoing genome sequencing projects … cDNAs sequencing projects (ESTs or cDNAs) … metagenome sequencing projects … personal human genomes new generation sequencers : Illumina: 25 billions of bp /day; 3’000’000’000 $ (public consortium, 2000) 2’000’000 $ (2007) 70’000’000 $ (diploid, 2007) 300’000’000 $ (Celera, 2000) 2010 http://www.youtube.com/watch?v=mVZI7NBgcWM …2700 genomes in 2010, 30’000 genomes in 2011 ? But…we known now that his apoE allele is the one associated with increased risk for Alzheimer and that he has the ‘blue eye’ allele… apoE gene (Ensembl genome browser) New projects (homo sapiens) • 1000 genomes (first publication, October 2010) • Multiple personal genomes (sexual cells, lymphoid cells, cancer cells…) • International cancer genome consortium (www.icgc.org). They look at the most common cancers and for each they sequence the genome of 500 patients with cancer and 500 healthy individuals…. How to define the human proteome ??? Which sequences ??? How many proteins-coding genes at the end? 190‘500'025'042 1st estimate: ~30 million species (1.8 million named) 2nd estimate: 20 million bacteria/archea x 4'000 genes 1 million protists x 6'000 genes 5 million insects x 14'000 genes 2 million fungi x 6'000 genes x 20'000 genes 0.5 million plants 0.5 million molluscs, worms, arachnids, etc. 0.1 million vertebrates x x 20'000 genes 25'000 genes The calculation: 2x107x4000+1x106x6000+5x106x14000+2x106x6000+5x105 x20000+5x105x20000+1x105x25000 +20000 (Craig Venter)+ 42(Douglas Adam) + … About 190 billions of proteins (?) About 14.0 millions of ‘known’ protein sequences in 2011 (from ~300’000 species) More than 99 % of the protein sequences are derived from the translation of nucleotide sequences Less than 1 % direct protein sequencing (Edman, MS/MS…) -> It is important that protein database users know where the protein sequence comes from… The ideal life of a sequence … cDNAs, ESTs, genes, genomes, … Nucleic acid sequence databases Protein sequence databases Menu Introduction Nucleic acid sequence databases ENA/GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases ENA (EMBL-Bank) GenBank DDBJ European Nucleotide Archive DNA Data Bank of Japan archive of primary sequence data and corresponding annotation submitted by the laboratories that did the sequencing. ENA/GenBank/DDBJ http://www.insdc.org/ ENA/GenBank/DDBJ • Serve as archives : ‘nothing goes out’ • Contain all public sequences derived from: – Genome projects (> 80 % of entries) – Sequencing centers (cDNAs, ESTs…) – Individual scientists ( 15 % of entries) – Patent offices (i.e. European Patent Office, EPO) • Currently: ~210x106 sequences, ~320 x109 bp; • Sequences from > 300’000 different species; Archival databases: - Can be very redundant for some loci - Sequence records are owned by the original submitter and can not be alterered by a third party (except TPA) accession number taxonomy references Cross-references CDS CoDing Sequence (proposed by submitters) CDS annotation (Prediction or experimentally determined) sequence The hectic life of a sequence … Data not submitted to public databases, delayed or cancelled… with or without cDNAs, annotated CDS provided by authors ESTs, genes, genomes, … ENA, GenBank, DDBJ CDS CoDing Sequence portion of DNA/RNA translated into protein (from Met to STOP) Experimentally proved or derived from gene prediction !!! not so well documented !!! CoDing Sequence Alignment between a mRNA and a genomic sequence Genomic CONTIG Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG------------------------------------------------------CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA ************************************ **************************** CONTIG Genomic -----------------------------------------------------------------------------------------------------------------------TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT CONTIG Genomic ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC ************************************************************************** CONTIG Genomic TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC **************************** *********************************************** CONTIG Genomic CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA ************************************************************************************************************************ CONTIG Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG----------------------------------------------------------------------------TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT ********************************** ******** CONTIG Genomic -------------------------------------------------------------------------------------------------------------------GNAAA GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA * *** intron exon intron intron exon CONTIG Genomic TAAACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGATAAACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGA ******************************************* * ************** ******** ***** **** * *********** *************************** CONTIG Genomic C----------------------------------------------------------------------------------------------------------------------CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA * exon CONTIG exon --------------------------------------------------------------------------------------CGANGGCCTATCAACAATGAAAGGTCGAAACCTG AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACAATG-AAGGTCGAAACCTG *** ************ ** * ************** exon CDS provided by the submitters The first Met ! CDS translation provided by ENA mRNAs and their corresponding CDS annotation (from EMBL/GenBank/DDBJ) UCSC: human EPO contig 5’ 3’ mRNAs and their corresponding CDS annotation (from EMBL/GenBank/DDBJ) Very rarely done… Complete genome (submitted) but only ~ 2,000 CDS/proteins available ! …annotated CDS in UniProtKB http://www.ebi.ac.uk/swissprot/sptr_stats/index.html From nucleic acid to amino acid sequences databases…. The hectic life of a protein sequence … Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … Nucleic acid databases no CDS ENA, GenBank, DDBJ …if the submitters provide an annotated Coding Sequence (CDS) (1/10 ENA entries) RefSeq, Ensembl and other Gene prediction RefSeq, Ensembl Protein sequence databases Why doing things in a simple way, when you can do it in a very complex one ? The hectic life of a sequence … Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … Scientific publications derived sequences ENA, GenBank, DDBJ CoDing Sequences provided by submitters TrEMBL Genpept CoDing Sequences provided by submitters and gene prediction RefSeq UniProtKB (IPI) Swiss-Prot Ensembl UniMES (PIR) PRF TPA UniParc CCDS PDB + all ‘species’ specific databases (EcoGene, TAIR, …) Major ‘general’ protein sequence database ‘sources’ TPA PIR PDB PRF Integrated resources ‘cross-references’ UniProtKB: Swiss-Prot + TrEMBL Resources kept separated NCBI-nr: Swiss-Prot + TrEMBL + GenPept + PIR + PDB + PRF + RefSeq + TPA not complete !!! (only entries created before 2007 ?) UniProtKB/Swiss-Prot: manually annotated protein sequences (12’000 species) UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (300’000 species) GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (300’000 species ?) PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species) TPA: Third part annotation Look for EPO (homo sapiens) Swiss-Prot TrEMBL www.uniprot.org Look for EPO (homo sapiens) Swiss-Prot GenPept Swiss-Prot RefSeq RefSeq GenPept Menu Introduction Nucleic acid sequence databases ENA-Bank/GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases UniProt UniProt consortium EBI : European Bioinformatics Institute (UK) SIB : Swiss Institute of Bioinformatics (CH) PIR : Protein information resource (US) UniProt databases UniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (~14 mo entries) UniParc: protein sequence archive (ENA equivalent at the protein level). Each entry contains a protein sequence with crosslinks to other databases where you find the sequence (active or not). Not annotated (query, Blast, download) (~25mo entries) UniRef: 3 clusters of protein sequences with 100, 90 and 50 % identity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo entries; UniRef50 3.3 mo entries) UniMES: protein sequences derived from metagenomic projects (mostly Global Ocean Sampling (GOS)) (download) (8 mo entries, included in UniParc) UniProtKB an encyclopedia on proteins composed of 2 sections UniProtKB/TrEMBL and UniProtKB/Swiss-Prot unreviewed and reviewed automatically annotated and manually annotated released every 4 weeks UniProtKB from EMBL to TrEMBL UniProtKB protein sequence data are mainly derived from EMBL (CDS) but also from Ensembl, RefSeq, model organism databases (MODs; e.g. TAIR) and PDB. Data from the PIR database have been integrated in UniProtKB since 2003. UniProtKB/Swiss-Prot and UniProtKB/TrEMBL give access to all the protein sequences which are available to the public. However, UniProtKB excludes the following protein sequences: - Most non-germline immunoglobulins and T-cell receptors - Synthetic sequences - Most patent application sequences - Small fragments encoded from nucleotide sequence (<8 amino acids) - Pseudogenes* - Fusion/truncated proteins - Not real proteins * many putative pseudogene sequences (which are tagged as potential pseudogenes) may be expected to remain in UniProtKB for some time as it can be difficult to prove the non-existence of a protein Data increase in UniProtKB 14,000,000 12,000,000 UniProtKB Number of sequences 10,000,000 UniProtKB/Swiss-Prot 8,000,000 6,000,000 4,000,000 2,000,000 0 5-Jan-04 5-Jan-05 5-Jan-06 5-Jan-07 5-Jan-08 Date 5-Jan-09 5-Jan-10 EMBL TrEMBL Automated extraction of protein sequence (translated CDS), gene name and references.+ Automated annotation Protein and gene names Taxonomic information References Cross-references to over 125 databases Automated annotation Function, Subcellular location, Catalytic activity, Sequence similarities… One protein sequence One species UniProtKB/TrEMBL www.uniprot.org Automated annotation transmembrane domains, signal peptide… Automated annotation Keywords and Gene Ontology UniProtKB: from EMBL to TrEMBL Automated annotation 1. Protein sequence 2. Biological information UniProtKB/TrEMBL Protein sequence - The quality of UniProtKB/TrEMBL protein sequences is dependent on the information provided by the submitter of the original nucleotide entry (CDS). - 100% identical sequences (same lenght, same organism are merged automatically). Biological information Sources of annotation - Provided by the submitter (EMBL, PDB, TAIR…) - From automated annotation (SAAS: automated generated annotation rules) - From automated annotation (UniRule; manually generated annotation rules) UniProtKB/TrEMBL Protein sequence - The quality of UniProtKB/TrEMBL protein sequences is dependent on the information provided by the submitter of the original nucleotide entry (CDS). - 100% identical sequences (same length, same organism are merged automatically). Biological information Sources of annotation - Provided by the submitter (EMBL, PDB, TAIR) - From automated annotation (SAAS: automated generated annotation rules) - From automated annotation (UniRule; manually generated annotation rules) Automatic annotation in UniProtKB/TrEMBL System Rule creation Trigger Annotations Scope SAAS automatic InterPro comments, KW all taxa InterPro* protein names, comments, features, KW, GO terms all taxa UniRules manual * Flexibility to create custom signatures for InterPro as required UniProtKB/TrEMBL SAAS • Rules are derived from the UniProtKB/Swiss-Prot manual annotation. • Fully automated rule generation based on C4.5 decision tree algorithm. • One annotation, one rule. • Precision calculated for each rule vs UniProtKB/Swiss-Prot. • High stringency – require 99% or greater estimated precision on UniProtKB/TrEMBL to generate annotation. • Rules are produced, updated and validated at each release. UniRules (RuleBase, HAMAP, PIRSF) • Rules of varying complexity: annotation varies from simple KW attribution to complete annotation as for UniProtKB/Swiss-Prot • Rules are manually curated: From SAAS rules as input From UniProtKB/Swiss-Prot annotation and InterPro match data, taxonomy information – continuously reported to curators From literature based curation of characterized families - with the possibility to create new signatures for specific functional groups • Rules are continuously monitored – validation on UniProtKB/Swiss-Prot – 97% confidence • HAMAP is also used for the annotation of some UniProtKB/Swiss-Prot entries UniRule – HAMAP UniRule – HAMAP Automatic annotation in UniProtKB/TrEMBL - Summary • SAAS – automatically generated annotation rules for comments, KWs - Tested on UniProtKB/Swiss-Prot • UniRule – manually curated annotation rules (e.g. HAMAP) – annotation varies from simple KWs to full annotation – start point can be SAAS rules, InterPro reports, literature-based curation of protein families – possibility to create custom signatures -> InterPro • Automatic annotation of UniProtKB/TrEMBL is refreshed, and validated, each UniProtKB release – validation using UniProtKB/Swiss-Prot as reference. ~10% of the rules are ‘refreshed’ at each release. • The source of each annotation is indicated - users can access rule logic Current status – coverage of UniProtKB/TrEMBL System Rules Coverage* SAAS 1684 17.2% RuleBase 1108 23.0% PIR name / site rules 142 0.26% HAMAP 1087 4.7% UniRules * Proportion of entries with at least one annotation from the specified system UniProtKB/TrEMBL 2010_12: 12,769,092 entries, all systems combined, 33% Current status - coverage of UniProtKB/TrEMBL % coverage of UniProtKB/TrEMBL 35.00 30.00 25.00 20.00 15.00 10.00 5.00 0.00 All CC DE FT GN Annotation type UniRule UniRule + SAAS SAAS UniProt release 2010_10 included annotations from 7767 SAAS rules, 1814 UniRules KW GO annotation - KW2GO - InterPro2GO - HAMAP2GO UniProtKB from TrEMBL to Swiss-Prot Once manually annotated and integrated into SwissProt, the entry is deleted from TrEMBL -> minimal redundancy EMBL Manual annotation of the sequence and associated biological information TrEMBL Automated extraction of protein sequence (translated CDS), gene name and references.+ Automated annotation Swiss-Prot Protein and gene names Taxonomic information References Cross-references to over 125 databases MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAAR AFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVK NMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFL One protein sequence NKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE One gene GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRG One species TVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGR AGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKD EGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG Alternative products: protein sequences produced by alternative splicing, alternative promoter usage, alternative initiation… UniProtKB/Swiss-Prot www.uniprot.org Manual annotation Function, Subcellular location, Catalytic activity, Disease, Tissue specificty, Pathway… Manual annotation Post-translational modifications, variants, transmembrane domains, signal peptide… Manual annotation Keywords and Gene Ontology UniProtKB: from TrEMBL to Swiss-Prot Manual annotation 1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…) 2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …) UniProtKB 1- Sequence curation The displayed protein sequence …canonical, representative, consensus… UniProtKB/Swiss-Prot protein sequence annotation ‘Merging/Redundancy policy’: a gene-centric view of protein space 1 entry <-> 1 gene (1 species) 1 displayed sequence (annotation of alternative sequences, when available) The displayed sequence is the most prevalent protein sequence and/or the protein sequence which is also found in orthologous species. The displayed sequence is generally derived from the translation of the genomic sequence (when available). Sequence differences are documented. What is the current status? • At least 20% of Swiss-Prot entries required a minimal amount of curation effort so as to obtain the “correct” sequence. • Typical problems – – – – unsolved conflicts; uncorrected initiation sites; frameshifts; other ‘problems’ … once a gene on chromosome 11… Quality of protein information from genome projects • Lets look at proteins originating from genome projects: – Drosophila: the paradigm of a curated genome should look like (thanks to FlyBase) : only 1.8% of the gene models conflict with Swiss-Prot sequences; – Arabidopsis: a typical example of a genome where a lot of annotation was done when it was sequenced, but no update since then (at least in the public view): 20% of the gene models are erroneous; – Tetraodon nigroviridis: the typical example of a quick and dirty automatic run through a genome with no manual intervention: >90% of the gene models produce incorrect proteins. – Bacteria and Archaea have almost no splicing, so predictions are “easier”, however errors are still made… Start codons, missed small proteins (<100aa)… UniProtKB/Swiss-Prot Protein sequence annotation Example of problem (derived from gene prediction pipeline) Ensembl completes the human ‘proteome’ by predicting/annotating missing genes according to orthologous sequences.. ID AC DT DT DT DE DE GN … DR DR DR PE URAD_HUMAN Unreviewed; 171 AA. A6NGE7; 24-JUL-2007, integrated into UniProtKB/TrEMBL. 24-JUL-2007, sequence version 1. 02-OCT-2007, entry version 3. 2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog (OHCU decarboxylase homolog) (Parahox neighbour). Name=PRHOXNB; EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. Ensembl; ENSG00000183463; Homo sapiens. HGNC; HGNC:17785; PRHOXNB. 4: Predicted; In primates the genes coding for the enzymes for the degradation of uric acid were inactivated and converted to pseudogenes. • Producing a clean set of sequences is not a trivial task; • It is not getting easier as more and more types of sequence data are submitted; ‘Protein existence’ tag • The ‘Protein existence’ tag indicates what is the evidence for the existence of a given protein; • Different qualifiers: 1. Evidence at protein level (~18%) (MS, western blot (tissue specificity), immuno (subcellular location),…) 2. Evidence at transcript level (~19%) 3. Inferred from homology (~58 %) 4. Predicted (~5%) 5. Uncertain (mainly in TrEMBL) http://www.uniprot.org/docs/pe_criteria In order to avoid ‘pseudogenes’ and most of the unprobable protein sequences, you can filter your query and avoid sequences with ‘protein existence tag’ = ‘Uncertain’ The ‘alternative’ sequence(s) UniProtKB/Swiss-Prot 1 entry <-> 1 gene (1 species) Annotation of the sequence differences (including conflicts, polymorphisms, splice variants etc..) -> annotation of protein diversity 1 entry <-> 1 gene (1 species) Multiple alignment of the end of the available GCR sequences Annotation of the sequence differences (protein diversity) …and natural variants P04150 www.uniprot.org UniProtKB (and RefSeq) do under-represent alternatively spliced products According to PMID:21307931: alternative splicing seems to occur at more than 90% of protein-coding genes (might not always modify the protein sequence). Transcript variants are only made when there is information available on the full-lenght nature of the product; if multiple, alternate exons are found through the lenght of the gene, no assumption is made about the combination of the alternate exons that may exist in vivo. Uncertain alternative sequences (confirmed by only one cDNA) are tagged with ‘No experimental confirmation available’ http://www.ncbi.nlm.nih.gov/books/NBK50679/#RefSeqFAQ.what_does_a_reviewed_status_me Important remark Available in separated files! > 30’000 additional sequences (total) The ‘alternative’ sequence(s) not ‘directly available’ for a lot of tools, including protein identification tools, Blast, depending on the server !…. Not included yet in the UniProtKB complete proteome sets ! Depending upon the organism, the inclusion of alternative sequences to the basic set of protein sequences can make a tremedous difference. For instance, in Homo sapiens, alternative sequences currently represent close to 40% of the total number of annotated human sequences described in UniProtKB/Swiss-Prot. http://www.uniprot.org/faq/38 Blast P04150 against Swiss-Prot / homo sapiens @ UniProt Isoform sequences Blast P04150 against Swiss-Prot / homo sapiens @ NCBI The isoform sequences are not present in the NCBI protein databases ! The .x number (P06401.4) correspond to the version number of the sequence…not to an alternatively spliced sequence ! How to track sequence changes ? • The sequence version number applies to the canonical sequence only • There is no easy way yet to track sequence updates of isoforms UniProtKB 2- Biological data curation Extract literature information and protein sequence analysis maximum usage of controlled vocabulary UniProtKB/Swiss-Prot gathers data form multiple sources: - publications (literature/Pubmed) - prediction programs (Prosite, TMHMM, …) - contacts with experts - other databases - nomenclature committees An evidence attribution system allows to easily trace the source of each annotation Protein and gene names General annotation (Comments) …enable researchers to obtain a summary of what is known about a protein… www.uniprot.org Human protein manual annotation: some statistics (Aug 2010) Sequence annotation (Features) …enable researchers to obtain a summary of what is known about a protein… www.uniprot.org Human protein manual annotation: some statistics (PTM) Non-experimental qualifiers UniProtKB/Swiss-Prot considers both experimental and predicted data and makes a clear distinction between both. Level. Type of evidence Qualifier 1st. Strong experimental evidence Ref.X 2nd. Light experimental evidence Probable 3rd. Inferred by similarity with homologous protein (data of 1st or 2nd level) By similarity 4th. Inferred by sequence prediction Potential Find all the protein localized in the cytoplasm (experimentally proven) which are phosphorylated on a serine (experimentally proven) UniProtKB Additional information can be found in the cross-references (to more than 140 databases) Organism-specific AGD ArachnoServer CGD ConoServer CTD CYGD dictyBase EchoBASE EcoGene euHCVdb EuPathDB FlyBase GeneCards GeneDB_Spombe GeneFarm GenoList Gramene H-InvDB HGNC HPA LegioList Leproma MaizeGDB MGI MIM neXtProt Orphanet PharmGKB PseudoCAP RGD SGD TAIR TubercuList WormBase Xenbase ZFIN Sequence EMBL IPI PIR RefSeq UniGene Proteomic Genome annotation Polymorphism Family and domain PeptideAtlas PRIDE ProMEX Ensembl EnsemblBacteria EnsemblFungi EnsemblMetazoa EnsemblPlants EnsemblProtists GeneID GenomeReviews KEGG NMPDR TIGR UCSC VectorBase dbSNP Gene3D HAMAP InterPro PANTHER Pfam PIRSF PRINTS ProDom PROSITE SMART SUPFAM TIGRFAMs Gene expression ArrayExpress Bgee CleanEx Genevestigator GermOnline Protein family/group Allergome CAZy MEROPS PeroxiBase PptaseDB REBASE TCDB Ontologies GO UniProtKB/Swiss-Prot: 129 explicit links 2D gel and 14 implicit links! Phylogenomic dbs eggNOG GeneTree HOGENOM HOVERGEN InParanoid OMA OrthoDB PhylomeDB ProtClustDB 3D structure PTM GlycoSuiteDB PhosphoSite PhosSite Other PPI BindingDB DrugBank NextBio PMAP-CutDB DIP IntAct MINT STRING DisProt HSSP PDB PDBsum ProteinModelPortal SMR 2DBase-Ecoli ANU-2DPAGE Aarhus/Ghent-2DPAGE (no server) COMPLUYEAST-2DPAGE Cornea-2DPAGE DOSAC-COBS-2DPAGE ECO2DBASE (no server) OGP PHCI-2DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWISS-2DPAGE UCD-2DPAGE World-2DPAGE Enzyme and pathway BioCyc BRENDA Pathway_Interaction_DB Reactome Protein sequence origin http://www.uniprot.org/faq/35 Access to UniProtKB www.uniprot.org The UniProt web site - www.uniprot.org • Powerful search engine, google-like and easy-to-use, but also supports very directed field searches (similar to SRS) • Scoring mechanism presenting relevant matches first • Entry views, search result views and downloads are customizable • The URL of a result page reflects the query; all pages and queries are bookmarkable, supporting programmatic access • Tools: Blast, Align, IDmapping, Batch retrieval (Retrieve) Search A very powerful text search tool with autocompletion and refinement options allowing to look for UniProt entries and documentation by biological information Search A very powerful text search tool with autocompletion and refinement options allowing to look for UniProt entries and documentation by biological information The search interface guides users with helpful suggestions and hints Advanced Search A very powerful search tool To be used when you know in which entry section the information is stored Have first a look to annotation examples. Find all human proteins with experimental evidence for their location in the nucleus The information is stored in the ‘General annotation’ section, Subcellular location Find all human proteins with experimental evidence for their location in the nucleus Result pages: Highly customizable Custom downloads…. Accession Genes Domains Protein Existence P02768 ALB (GIG20) (GIG42) (PRO0903) (PRO1708) (PRO2044) (PRO2619) (PRO2675 P02769 ALB Albumin domains (3) Evidence at protein level P02770 Alb Albumin domains (3) Evidence at protein level P07724 Alb (Alb-1) (Alb1) Albumin domains (3) Evidence at protein level P08759 alb-A Albumin domains (3) Evidence at transcript level P14872 alb-B Albumin domains (3) Evidence at transcript level P43652 AFM (ALB2) (ALBA) Albumin domains (3) Evidence at protein level P08835 ALB Albumin domains (3) Evidence at protein level P49822 ALB Albumin domains (3) Evidence at protein level P19121 ALB Albumin domains (3) Evidence at protein level Open with Excel etc. The URL (results) can be bookmarked and manually modified. Blast A tool associated with the standard options to search sequences in UniProt databases Blast results: customize display Blast: use of UniProt annotation amino-acids highlighting options and feature annotation highlighting option in the local alignment Align A ClustalW multiple alignment tool with amino-acids highlighting options and feature annotation highlighting option ClustalW multiple alignment of insulin sequences amino-acids highlighting options and feature annotation highlighting option in the local alignment Retrieve A UniProt specific tool allowing to retrieve a list of entries in several standard identifiers formats. You can then query your ‘personal database’ with the UniProt search tool. Your dataset: results of a Scan Prosite ID Mapping Gives the possibility to get a mapping between different databases for a given protein These identifiers are all pointing to TP53 (p53) ! P04637, NP_000537, ENSG00000141510, CCDS11118, UPI000002ED67, IPI00025087, etc. Download Downloading UniProt http://www.uniprot.org/downloads Downloading UniProt http://www.uniprot.org/downloads Canonical and isoform sequences Complete proteome ‘gene’ centred or all known proteins ? http://www.uniprot.org/faq/38 Remark: Some peptides are not associated with the keyword ‘Complete proteome’ because they do not match with the human genome UniProt proteome sets, if downloaded in UniProt flat file or XML format, contain one sequence per UniProt record ! ‘gene’ centred all protein sequences in UniProtKB/Swiss-Prot… Are missing: other alternatively spliced protein sequences in UniProtKB/TrEMBL IPI closure • The complete MOUSE proteome will be composed of all MOUSE sequences in UniProtKB/Swiss-Prot plus those MOUSE sequences in UniProtKB/TrEMBL that have a crossreference to an Ensembl protein. • The complete HUMAN proteome will therefore be composed of all HUMAN sequences in UniProtKB/Swiss-Prot plus those HUMAN sequences in UniProtKB/TrEMBL that have a crossreference to an Ensembl protein. • News: 30th March 2011: next UniProtKB release. UniProtKB Statistics Swiss-Prot & TrEMBL introduce a new arithmetical concept ! Swiss-Prot TrEMBL 520’000 + 14’000’000 12’000 species 130’000 species 14’000’000 Redundancy in TrEMBL & Redundancy between TrEMBL and Swiss-Prot 12’000 species mainly model organisms Not yet available ~ 200 new entries / day new release every 4 weeks - Annotation is useful, good annotation is better, update is essential ! - Some entries have gone through more than 120 versions since their integration in UniProtKB/Swiss-Prot UniProtKB entry history Always cite the primary accession number (AC) ! UniParc UniParc - non-redundant protein sequence archive, containing both active and inactive sequences (including sequences which are not in UniProtKB i.e. immunoglobulins….) - the equivalent of ENA/GenBank/DDBJ at the protein level - species-merged: merge sequences between species when 100% identical over the whole length. - no annotation (only taxonomy) - can be searched only with database names, taxonomy, checksum (CRC64) and accession numbers (ACs) or UniProtKB, UniRef and UniParc IDs. - Beware: contains wrong prediction, pseudogenes etc… Query UniParc UniRef ‘UniRef is useful for comprehensive BLAST similarity searches by providing sets of representative sequences’ «Collapsing BLAST results» Three collections of sequence clusters from UniProtKB and selected UniParc entries: One UniRef100 entry -> all identical sequences (identical sequences and sub-fragments are grouped in a single record) -> reduction of 12 % One UniRef90 entry -> sequences that have at least 90 % or more identity -> reduction of 40 % One UniRef50 entry -> sequences that are at least 50 % identical -> reduction of 65 % Based on sequence identity -> Independent of the species ! UniRef 90 Independent of species and sequence length UniMes The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental protein data (only GOS data for the moment). Download only (but included in UniParc -> Blast). - UniMES Fasta sequences - UniMES matches to InterPro methods ftp.uniprot.org/pub/databases/uniprot UniMES: sequences in fasta format Menu Introduction Nucleic acid sequencedatabases ENA/GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases NCBI protein databases (Entrez protein, NCBI nr) http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein Major ‘general’ protein sequence database ‘sources’ TPA PIR PDB PRF Integrated resources ‘cross-references’ UniProtKB: Swiss-Prot + TrEMBL Resources kept separated NCBI-nr: Swiss-Prot + TrEMBL + GenPept + PIR + PDB + PRF + RefSeq + TPA not complete !!! (only entries created before 2007 ?) UniProtKB/Swiss-Prot: manually annotated protein sequences (12’000 species) UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (300’000 species) GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (300’000 species ?) PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species) TPA: Third part annotation Query at Entrez protein http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein Swiss-Prot Typical result of a query at « Entrez protein » RefSeq Genpept A Swiss-Prot entry with the NCBI look A TrEMBL entry with the NCBI look !!!! GI number ‘GenInfo identifier’ number - In addition to an AC number specific from the original database, each protein sequence in the NCBInr database (included Swiss-Prot entry) has a GI number. AC GI number: ‘GenInfo identifier’ number - If the sequence changes in any way, a new GI number will be assigned: GI identifiers provide a mechanism for identifying the exact sequence that was used or retrieved in a given search. - A separate GI number is assigned to each protein translation (alternative products) - A Sequence Revision History tool is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record: http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi ID/AC mapping http://www.ebi.ac.uk/Tools/picr/ GenPept Translation from annotated CDS in GenBank Contains all translated CDS annotated in GenBank/ENA/DDBJ sequences - equivalent to UniProtKB/TrEMBL, except that it is redundant with other databases (Swiss-Prot, RefSeq, PIR….) GenPept: ‘translations from all annotated coding regions (CDS) in GenBank’ RefSeq Produced by NCBI and NLM http://www.ncbi.nlm.nih.gov/RefSeq/ http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/handbook/ch18.pdf FAQ: http://www.ncbi.nlm.nih.gov/books/NBK50679/ The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. Protein – mRNA – genomic sequence Also chromosomes, organelle genomes, plasmids, intermediate assembled genomic contigs, ncRNAs. - tighly linked to Entrez Gene (« interdependent curated resources ») Example: NP_000790 Beware: NeXtProt accession number: NX_P00918 AC KW Taxonomy References GenBank source and status Annotation and ontologies Curated records UniProtKB vs RefSeq UniProtKB/Swiss-Prot merges all CDS available for a given gene and describes the sequence differences UniProtKB/Swiss-Prot P04150 (GCR_HUMAN): RefSeq chooses one or several protein reference sequences for a given gene: they do not annotate the sequence differences. - If there is an alternative splicing event, there will be several distinct entries for a given gene Example: GCR_HUMAN 1 UniProtKB entry GCR_HUMAN UniProtKB/Swiss-Prot cross-linked with 7 RefSeq entries Protein feature annotation found in RefSeq - Conserved domains - Signal and mature petides - Propagation of a subset of features from Swiss-Prot. PTM annotation Swiss-Prot vs RefSeq GCR_human RefSeq statistics The numbers are not comparable: entries ‘sequence’ (RefSeq) vs entries ‘gene’ (UniProtKB/Swiss-Prot) UniProtKB vs NCBI protein Summary ENA/GenBank/DDBJ RefSeq www.ncbi.nlm.nih.gov/RefSeq/ UniProt www.uniprot.org Protein and nucleotide data Genomic, RNA and protein data Protein data only Biological data added by the submitters (gene name, tissue…) Biological data annotated by curators, also found in the corresponding Entrez Gene entry Biological data annotated by curators (Swiss-Prot), within the entry Not curated Partially manually curated (‘reviewed’ entries) Manually curated in Swiss-Prot, not in TrEMBL Author submission NCBI creates from existing data + gene prediction UniProt creates from existing data Only author can revise (except TPA) NCBI revises as new data emerge UniProt revises as new data emerge Multiple records for same loci common Single records for each molecule of major organisms Single records for each protein from one gene of major organisms (in Swiss-Prot, TrEMBL is redundant) Records can contradict each other Identification and annotation of discrepancy No limit to species included Limited to model organisms Priority (but not limited) to model organisms Data exchanged among INSDC members NCBI database; collaboration with UniProt UniProt database; collaboration with NCBI (RefSeq, CCDS) PIR PIR: the Protein Identification Resource PIR-PSD is no more updated, but exists as an archive PDB PDB • PDB (Protein Data Bank), 3D structure • Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X-ray or NMR studies • Contains also the corresponding protein sequences *The PIR-NRL3D database makes the sequence information in PDB available for similarity searches and other tools • Includes protein sequences which are mutated, chimearic etc… (created specifically to study the effect of a mutation on the 3D structure) PDB: Protein Data Bank www.rcsb.org/pdb/ • Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA). • Associated with specialized programs allow the visualization of the corresponding 3D structure (e.g., SwissPDB-viewer, Chime, Rasmol)). • Currently there are ~68’000 structural data for about 15’000 different proteins, but far less protein family (highly redundant) ! PDB: example Sequence Coordinates of each atom Visualisation with Jmol PRF Protein Research Foundation Looks for the peptide sequence described in publication (and which are not submitted in databases !!!) http://www.genome.jp/dbget-bin/www_bfind?prf Other protein databases Ensembl http://www.ensembl.org/ Review http://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D610 Annotation pipeline http://www.genome.org/cgi/content/full/14/5/942 - Ensembl: align the genomic sequences with all the sequences found in ENA, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (-> known genes) - Also do gene prediction (-> novel genes) Ensembl= UniProtKB + RefSeq + gene prediction - DNA, RNA and protein sequences available for several species. - Ensembl concentrates on vertebrate genomes, but other groups have adapted the system for use with plant, fungal and metazoa genomes. Example of problem (derived from gene prediction pipeline) Ensembl completes the human ‘proteome’ by predicting/annotating missing genes according to orthologous sequences.. ID AC DT DT DT DE DE GN … DR DR DR PE URAD_HUMAN Unreviewed; 171 AA. A6NGE7; 24-JUL-2007, integrated into UniProtKB/TrEMBL. 24-JUL-2007, sequence version 1. 02-OCT-2007, entry version 3. 2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog (OHCU decarboxylase homolog) (Parahox neighbour). Name=PRHOXNB; EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. Ensembl; ENSG00000183463; Homo sapiens. HGNC; HGNC:17785; PRHOXNB. 4: Predicted; In primates the genes coding for the enzymes for the degradation of uric acid were inactivated and converted to pseudogenes. IPI http://www.ebi.ac.uk/IPI/IPIhelp.html IPI: Closure ! Automatic approach that builds clusters through combining knowledge already present in the primary data source (UniProtKB, RefSeq, Ensembl) and sequence similarity. IPI=UniProtKB + RefSeq + Ensembl (+ H-InvDB, TAIR +VEGA). !!! Complete proteome sets include all alternative splicing sequences…. Available for human, mouse, rat, Zebrafish, Arabidopsis, Chicken, and Cow CCDS http://www.ncbi.nlm.nih.gov/CCDS/ CCDS (human, mouse) Combining different approaches – ab initio, by similarity - and taking advantage of the expertise acquired by different institutes, including manual annotation… Consensus between 4 institutions… Gene Ontology (GO) Standards : Why is it so important ? •‘The ever-increasing number of sequencing projects necessitates a standardized system (…) to ensure that the flood of information produced can be effectively utilized.‘ (PMID 19577473 ) •Standardization of biological data/information (data sharing and computational analysis). •Aim: extract and compare annotation between different resources or species (semantic similarity). Secreted or not secreted ? Pubmed19299134 Gene Ontology (GO) • The Gene Ontology is a controlled vocabulary, a set of standard terms—words and phrases—used for indexing and retrieving information. In addition to defining terms, GO also defines the relationships between the terms, making it a structured vocabulary. Contains ~30’000 terms. Gene Ontology (GO) terms biological process • broad biological phenomena e.g. mitosis, growth, digestion molecular function • molecular role e.g. catalytic activity, binding cellular component • Subcellular location e.g nucleus, ribosome, origin recognition complex GO terms associated with human Erythropoietin http://www.geneontology.org Caveats • Annotation is the process of assigning/mapping GO terms to gene products… • Electronic vs Manual annotation… Example with EPO Histone H4 !!! Large scale derived data (‘proteome’) GO terms: Essential link between biological knowledge and high throuput genomic and proteomic datasets… ‘summary of the gene ontology classifications for all mapped ESTs…’ PMID: 15514041 ~40 % of human proteins have no known function (experimental data)…but many more are associated with GO terms…(computer-assigned). Human proteins functional distribution Maybe Potentially Putative Expected Probably Hopefully All documents (including practicals) are online http://education.expasy.org/cours/UniProt/