Protein sequence databases http://education.expasy.org/cours/Murcia2011/ Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein Sequence Databases Murcia, February, 2011 Menu Introduction Nucleic acid sequence databases ENA, GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases Other databases (Ensembl, IPI, CCDS, …) Protein Sequence Databases Murcia, February, 2011 Menu Introduction Nucleic acid sequence databases ENA, GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases Protein Sequence Databases Murcia, February, 2011 Indispensible for bioinformatic studies 1. Databases (free access on the web) 2. Software tools 3. Servers Protein Sequence Databases Murcia, February, 2011 What is a database ? • A collection of related data, which are – structured – searchable – updated periodically – cross-referenced • Includes also associated tools necessary for access/query, download, etc. Protein Sequence Databases Murcia, February, 2011 Why biological databases ? • Exponential growth in biological data. • Data (genomic sequences, protein sequences, 3D structures, 2D gel electrophoresis, MS analysis, microarrays, publications….) are no longer published in a conventional manner, but directly submitted to databases. • Essential tools for biological research. Protein Sequence Databases Murcia, February, 2011 The NAR Online Molecular Biology Database collection in 2011 A total of 1’330 databases http://nar.oxfordjournals.org/content/38/suppl_1 Protein Sequence Databases Murcia, February, 2011 Categories of databases for Life Sciences • • • • • • • Sequences (DNA, protein) Genomics 3D structure Mutation/polymorphism Protein domain/family Metabolism/Pathways Bibliography • ‘Others’ (Protein protein interaction, Microarrays…) Protein Sequence Databases Murcia, February, 2011 Categories of databases for Life Sciences • • • • • • • – Sequences (DNA, protein) – DNA/RNA: EMBL/GenBank/DDBJ, – Protein: UniProtKB, NCBInr Genomics - OMIM, Flybase 3D structure – PDB Mutation/polymorphism – dbSNP Protein domain/family – InterPro Metabolism/Pathways – KEGG Bibliography – PubMed ‘Others’ (Protein protein interaction, Microarrays…) Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 DNA sequences Microarray Expression Data Protein Sequences Human Genome Gene Annotation Macromolecular Structure Data Protein Sequence Databases Murcia, February, 2011 Proliferation of databases •Which does contain the highest quality data ? •Which is comprehensive ? •Which is up-to-date ? •Which is redundant ? •Which is indexed (allows complex queries) ? •Which Web server does respond most quickly ? • …….?????? Awareness of the content and usage of knowledge resources is a pre-requisite to do any type of « serious » research in the field of molecular life sciences (AMB, 2007) Protein Sequence Databases Murcia, February, 2011 Where can we find… •A video -> Youtube •Info on S. Hawking-> Wikipedia •A book -> Amazon •A friend -> Facebook – Usually only one server •DNA sequence -> EMBL •Protein sequence -> UniProtKB, RefSeq… – Several different servers give access to the ‘same’ database Servers • ‘Any computer (…) serving out applications or services can technically be called a server. ‘ (Wikipedia) Protein Sequence Databases Murcia, February, 2011 EBI: http://www.ebi.ac.uk/ Protein Sequence Databases Murcia, February, 2011 NCBI: http://www.ncbi.nlm.nih.gov/ Protein Sequence Databases Murcia, February, 2011 ExPASy: http://expasy.org Protein Sequence Databases Murcia, February, 2011 www.uniprot.org Protein Sequence Databases Murcia, February, 2011 How to find a database ? Beware not all servers give access to the latest version of the database. Important to know the ‘home server’ for a given database. – ExPASy life sciences directory: -> ‘home’ server links (www.expasy.org/alinks.html) – Google (http://www.google.com) (not always linked to the ‘home’ server) Protein Sequence Databases Murcia, February, 2011 http://www.expasy.org/ Protein Sequence Databases Murcia, February, 2011 http://www.expasy.org/links.html http://www.expasy.org/links.html Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 The same data on different servers…. UniProt Protein Sequence Databases NCBI Murcia, February, 2011 http://srs.dna.affrc.go.jp/srs8/srs?-id+1QexuT1Yn4Di0xF+[uniprot_swissprot-AccNumber:P16855]+-e Protein Sequence Databases Murcia, February, 2011 Proteins…proteins Protein Sequence Databases Murcia, February, 2011 Protein sequences are the fundamental determinants of biological structure and function. http://www.ncbi.nlm.nih.gov/protein Protein Sequence Databases Murcia, February, 2011 Protein sequence databases are essential for… - Identification of proteins by proteomics --> completeness, sequence quality ‘producing large protein lists is not the end point in Proteomics’ -> extract knowledge - Similarity searches, BLAST (functional prediction) --> sequence quality (no redundance) - Training datasets (prediction tools, PTM etc.) --> sequence and annotation quality - Creation of DNA chips for mRNA expression studies --> completeness (complete proteome), sequence quality Protein Sequence Databases Murcia, February, 2011 ? TrEMBL RefSeq PRF Genpept UniProtKB (IPI) Swiss-Prot Ensembl UniMES TPA UniParc (PIR) NCBInr Protein Sequence Databases PDB CCDS Murcia, February, 2011 These identifiers are all pointing to a same sequence of TP53 (p53) ! P04637, NP_000537, ENSG00000141510, CCDS11118, UPI000002ED67, IPI00025087, HIT000320921, XP_001172091, DD954676 , JT0436 , etc. Protein Sequence Databases Murcia, February, 2011 A HUPO test sample study reveals common problems in mass spectrometry–based proteomics PubMed 19448641 (2009) • A single mass spectrometry experiment can identified up to about 4000 proteins (15’000 peptides) • Protein databases vary greatly in terms of their curation, completeness and comprehensiveness (search with different protein databases = could get different results). • Only 7 labs (on 27) were able to identify the 20 human proteins present in a sample, mainly due to the fact that the search engines used cannot distinguish among different identifiers for the same protein… Protein Sequence Databases Murcia, February, 2011 Protein sequence origin… Protein Sequence Databases Murcia, February, 2011 Protein sequence origin More than 99 % of the protein sequences are derived from the translation of nucleotide sequences (genomes and/or cDNAs) -> Important to know where the protein sequence comes from… (sequencing & gene prediction quality) ! Protein Sequence Databases Murcia, February, 2011 Flood of data example with the genome sequences… New challenge Flood of data -> need to be stored, curated and made available for analysis and knowledge discovery Protein Sequence Databases Murcia, February, 2011 … ~ 2500 genomes sequenced (single organism, varying sizes, including virus) … ~ 5’000 ongoing genome sequencing projects Protein Sequence Databases Murcia, February, 2011 http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html http://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat ~ 50-100 genomes/month Protein Sequence Databases + ~2’500 viral genomes => Total 5’0002011 genomes Murcia, ~ February, … ~ 2500 genomes sequenced (single organism, varying sizes, including virus) … ~ 5’000 ongoing genome sequencing projects … cDNAs sequencing projects (ESTs or cDNAs) … metagenome sequencing projects = environmental samples: multiple ‘unknown’ organisms, Protein Sequence Databases Murcia, February, 2011 Metagenomics study of genetic material recovered directly from environmental samples • Global Ocean Sampling (C. Venter) 1ml sea water: 1 mo bacteria and 10 mo virus • Whale fall (AAFZ00000000.1) • Soil, sand beach, New-York air, … Venter’s Sorcerer II • Human fluids, mouse gut (millions of bacteria within human body) • Water treatment industry… • Lists of projects: Protein http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi Sequence Databases Murcia, February, 2011 … ~ 2500 genomes sequenced (single organism, varying sizes) … ~ 5’000 ongoing genome sequencing projects … cDNAs sequencing projects (ESTs or cDNAs) … metagenome sequencing projects … personal human genomes new generation sequencers : Illumina: 25 billions of bp /day; Protein Sequence Databases Murcia, February, 2011 3’000’000’000 $ (public consortium, 2000) 2’000’000 $ (2007) 70’000’000 $ (diploid, 2007) 300’000’000 $ (Celera, 2000) 2010 http://www.youtube.com/watch?v=mVZI7NBgcWM …2700 genomes in 2010, 30’000 genomes in 2011 ? Protein Sequence Databases Murcia, February, 2011 But…we known now that his apoE allele is the one associated with increased risk for Alzheimer and that he has the ‘blue eye’ allele… Protein Sequence Databases Murcia, February, 2011 apoE gene (Ensembl genome browser) Protein Sequence Databases Murcia, February, 2011 New projects • 1000 genomes (first publication, October 2010) • Multiple personal genomes (sexual cells, lymphoid cells, cancer cells…) • International cancer genome consortium (www.icgc.org). They look at the most common cancers and for each they sequence the genome of 500 patients with cancer and 500 healthy individuals…. Protein Sequence Databases Murcia, February, 2011 How many proteins-coding genes at the end? Protein Sequence Databases Murcia, February, 2011 Peabody museum exhibition on the Tree of Life http://www.peabody.yale.edu/exhibits/treeoflife/ Protein Sequence Databases Murcia, February, 2011 190‘500'025'042 1st estimate: ~30 million species (1.8 million named) 2nd estimate: 20 million bacteria/archea x 4'000 genes 1 million protists x 6'000 genes 5 million insects x 14'000 genes 2 million fungi x 6'000 genes x 20'000 genes 0.5 million plants 0.5 million molluscs, worms, arachnids, etc. 0.1 million vertebrates x x 20'000 genes 25'000 genes The calculation: 2x107x4000+1x106x6000+5x106x14000+2x106x6000+5x105 x20000+5x105x20000+1x105x25000 +20000 (Craig Venter)+ 42(Douglas Adam) + … Protein Sequence Databases Murcia, February, 2011 About 190 milliards of proteins (?) About 13.0 millions of ‘known’ protein sequences in 2011 (from ~300’000 species) More than 99 % of the protein sequences are derived from the translation of nucleotide sequences Less than 1 % direct protein sequencing (Edman, MS/MS…) -> It is important that users know where the protein sequence comes from… (sequencing & gene prediction quality) ! The ideal life of a sequence … cDNAs, ESTs, genes, genomes, … Nucleic acid sequence databases Protein sequence databases Protein Sequence Databases Murcia, February, 2011 Menu Introduction Nucleic acid sequence databases ENA/GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases Protein Sequence Databases Murcia, February, 2011 ENA (EMBL-Bank) GenBank DDBJ European Nucleotide Archive DNA Data Bank of Japan archive of primary sequence data and corresponding annotation submitted by the laboratories that did the sequencing. Protein Sequence Databases Murcia, February, 2011 ENA/GenBank/DDBJ http://www.insdc.org/ Protein Sequence Databases Murcia, February, 2011 The hectic life of a sequence … Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genes, genomes, … ENA, GenBank, DDBJ Protein Sequence Databases Murcia, February, 2011 Journals do not (SHOULD NOT) accept a paper dealing with a nucleic acid sequence if the ENA/GenBank/DDBJ AC number is not available… ‘journal publishers generally require deposition prior to publication so that an accession number can be included in the paper.’ http://www.ncbi.nlm.nih.gov/books/bv.fcgi?highlight=refseq&rid=handbook.section.GenBank_ASM#GenBank_ASM.RefSeq …not the case for protein sequences !!! no more the case for a lot of genomes !!! Protein Sequence Databases Murcia, February, 2011 ENA/GenBank/DDBJ • Serve as archives : ‘nothing goes out’ • Contain all public sequences derived from: – Genome projects (> 80 % of entries) – Sequencing centers (cDNAs, ESTs…) – Individual scientists ( 15 % of entries) – Patent offices (i.e. European Patent Office, EPO) • Currently: ~200x106 sequences, ~300 x109 bp; • Sequences from > 300’000 different species; Protein Sequence Databases Murcia, February, 2011 Archival databases: - Can be very redundant for some loci - Sequence records are owned by the original submitter and can not be alterered by a third party (except TPA) Protein Sequence Databases Murcia, February, 2011 Organisms with the highest redundancy … Protein Sequence Databases Murcia, February, 2011 accession number taxonomy references Cross-references Protein Sequence Databases Murcia, February, 2011 CDS CoDing Sequence (proposed by submitters) CDS annotation (Prediction or experimentally determined) sequence Protein Sequence Databases Murcia, February, 2011 The hectic life of a sequence … Data not submitted to public databases, delayed or cancelled… with or without cDNAs, annotated CDS provided by authors ESTs, genes, genomes, … ENA, GenBank, DDBJ CDS CoDing Sequence portion of DNA/RNA translated into protein (from Met to STOP) Experimentally proved or derived from gene prediction !!! not so well documented !!! Protein Sequence Databases Murcia, February, 2011 CoDing Sequence Alignment between a mRNA and a genomic sequence Genomic CONTIG Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG------------------------------------------------------CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA ************************************ **************************** CONTIG Genomic -----------------------------------------------------------------------------------------------------------------------TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT CONTIG Genomic ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC ************************************************************************** CONTIG Genomic TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC **************************** *********************************************** CONTIG Genomic CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA ************************************************************************************************************************ CONTIG Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG----------------------------------------------------------------------------TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT ********************************** ******** CONTIG Genomic -------------------------------------------------------------------------------------------------------------------GNAAA GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA * *** intron exon intron intron exon CONTIG Genomic TAAACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGATAAACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGA ******************************************* * ************** ******** ***** **** * *********** *************************** CONTIG Genomic C----------------------------------------------------------------------------------------------------------------------CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA Protein Sequence Databases Murcia, February, 2011 * exon CONTIG exon --------------------------------------------------------------------------------------CGANGGCCTATCAACAATGAAAGGTCGAAACCTG AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACAATG-AAGGTCGAAACCTG *** ************ ** * ************** exon CDS provided by the submitters The first Met ! CDS translation provided by ENA Protein Sequence Databases Murcia, February, 2011 A eukaryotic gene (UCSC) Introns Final exon Initial exon 3’ untranslated region Internal exons STOP Met 5’ 3’ This particular gene lies on the reverse strand ! Protein Sequence Databases Murcia, February, 2011 mRNAs and their corresponding CDS annotation (from EMBL/GenBank/DDBJ) UCSC: human EPO contig 5’ 3’ Protein Sequence Databases Murcia, February, 2011 Complete genome (submitted) but only ~ 2,000 CDS/proteins available ! Protein Sequence Databases Murcia, February, 2011 …annotated CDS in UniProtKB http://www.ebi.ac.uk/swissprot/sptr_stats/index.html Protein Sequence Databases Murcia, February, 2011 ENA/GenBank/DDBJ Variable level of sequence quality - Sequencing quality - Gene prediction quality Authors can specify the nature of the CDS by using the qualifier: "/evidence=experimental" or "/evidence=not_experimental". Very rarely done… Protein Sequence Databases Murcia, February, 2011 Very rarely done… Protein Sequence Databases Murcia, February, 2011 Variable level of sequence quality DNA vs RNA Protein Sequence Databases Murcia, February, 2011 RNA EST: Expressed Sequence Tags produced by one-shot sequencing of a cloned cDNA (no CDS, but proteomic tools give access to‘translated ESTs’) HTC : High Throughput cDNAs (CDS annotation) DNA GSS: Genome Sequence Survey: similar to the EST division, with the exception that most of the sequences are genomic in origin (no annotation, no CDS, with some exceptions (Drosophila)) HTG: High-Throughput Genomic Sequences: single-pass, unfinished genomic sequences (no annotation, no CDS with some exceptions (Leishmania)) WGS: Whole Genome Shotgun: contigs of a sequencing project. WGS data can contain annotation and should be updated as sequencing progresses. (CDS annotation) Protein Sequence Databases Murcia, February, 2011 Complete proteomes Complete genomes ? Protein Sequence Databases Murcia, February, 2011 Complete genomes ?? Protein Sequence Databases UCSC Murcia, February, 2011 27478 contigs N50 is a weighted median statistic such that 50% of the entire assembly is contained in contigs equal to or larger than this value Genome reference consortium http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml Protein Sequence Databases Murcia, February, 2011 Genome sequencing and assembly some caveats to deal with… • ~ 350 gaps in 2010 (human genome) • In the next future, we will have to deal with ‘incomplete genome’ sequences (never finished, metagenome…)… Prediction of ‘partial’ genes/exons is complex ! • Updates of genome sequences: not always ‘stable’ data… • We are all different: -> ‘pan genome’ ? Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 From nucleic acid to amino acid sequences databases…. Protein Sequence Databases Murcia, February, 2011 The hectic life of a protein sequence … Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … Nucleic acid databases no CDS ENA, GenBank, DDBJ …if the submitters provide an annotated Coding Sequence (CDS) (1/10 ENA entries) Gene prediction RefSeq, Ensembl Protein sequence databases The hectic life of a protein sequence … Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … Nucleic acid databases no CDS ENA, GenBank, DDBJ …if the submitters provide an annotated Coding Sequence (CDS) (1/10 ENA entries) RefSeq, Ensembl and other* Gene prediction RefSeq, Ensembl Protein sequence databases * 1000 genomes: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/2010_11/ Why doing things in a simple way, when you can do it in a very complex one ? Protein Sequence Databases Murcia, February, 2011 The hectic life of a sequence … Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … Scientific publications derived sequences ENA, GenBank, DDBJ CoDing Sequences provided by submitters TrEMBL Genpept CoDing Sequences provided by submitters and gene prediction RefSeq UniProtKB (IPI) Swiss-Prot Ensembl UniMES (PIR) PRF TPA UniParc CCDS PDB + all ‘species’ specific databases (EcoGene, TAIR, …) Major ‘general’ protein sequence database ‘sources’ TPA PIR PDB PRF Integrated resources ‘cross-references’ UniProtKB: Swiss-Prot + TrEMBL Resources kept separated NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA UniProtKB/Swiss-Prot: manually annotated protein sequences (12’000 species) UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (300’000 species) GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (300’000 species ?) PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species) TPA: Third part annotation Look for toll-like receptor 4 (homo sapiens) Swiss-Prot TrEMBL www.uniprot.org Protein Sequence Databases Murcia, February, 2011 Look for toll-like receptor 4 (homo sapiens) GenPept GenPept GenPept GenPept GenPept GenPept RefSeq GenPept Swiss-Prot http://www.ncbi.nlm.nih.gov/ Protein Sequence Databases Menu Introduction Nucleic acid sequence databases ENA-Bank/GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases Protein Sequence Databases Murcia, February, 2011 UniProt consortium: EBI + SIB + PIR UniProt What is UniProt ? . UniProtKB sequence curation . UniProtKB biological data curation . Statistics . Access to UniProtKB Protein Sequence Databases Murcia, February, 2011 www.uniprot.org Protein Sequence Databases Murcia, February, 2011 UniProt databases Protein Sequence Databases Murcia, February, 2011 UniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (~13 mo entries) UniParc: protein sequence archive (ENA equivalent at the protein level). Each entry contains a protein sequence with crosslinks to other databases where you find the sequence (active or not). Not annotated (query, Blast, download) (~25mo entries) UniRef: 3 clusters of protein sequences with 100, 90 and 50 % similarity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo entries; UniRef50 3.3 mo entries) UniMES: protein sequences derived from metagenomic projects (mostly Global Ocean Sampling (GOS)) (download) (8 mo entries, included in UniParc) Protein Sequence Databases Murcia, February, 2011 UniProtKB/Swiss-Prot and UniProtKB/TrEMBL give access to all the protein sequences which are available to the public. However, UniProtKB excludes the following protein sequences: - Most non-germline immunoglobulins and T-cell receptors - Synthetic sequences - Most patent application sequences - Small fragments encoded from nucleotide sequence (<8 amino acids) - Pseudogenes* - Fusion/truncated proteins - Not real proteins * many putative pseudogene sequences may be expected to remain in UniProtKB for some time as it can be difficult to prove the non-existence of a protein Protein Sequence Databases Murcia, February, 2011 UniProtKB an encyclopedia on proteins composed of 2 sections UniProtKB/TrEMBL and UniProtKB/Swiss-Prot released every 4 weeks Protein Sequence Databases Murcia, February, 2011 UniProtKB from ENA to TrEMBL UniProtKB protein sequence data are mainly derived from ENA (CDS) but also from Ensembl and other sequence resources such as RefSeq or model organism databases (MODs). Data from the PIR database have been integrated in UniProt since 2003. Protein Sequence Databases Murcia, February, 2011 ENA TrEMBL Automated extraction of protein sequence (translated CDS), gene name and references.+ Automated annotation The quality of UniProtKB/TrEMBL data, including the protein sequence, is directly dependent on the information provided by the submitter of the original nucleotide entry. Automated annotation • Redundancy check (100% merge (same lenght, not fragment)) • Family attribution (InterPro) • Many other cross-references • Rule-based automated annotation (~38% of TrEMBL entries) Automated annotation systems: - UniRule (RuleBase, HAMAP; manually reviewed) - SAAS (automated generated rules, i.e. via InterPro) Protein Sequence Databases Murcia, February, 2011 Protein and gene names Taxonomic information References Cross-references to over 125 databases Automated annotation Function, Subcellular location, Catalytic activity, Sequence similarities… One protein sequence One species UniProtKB/TrEMBL www.uniprot.org Protein Sequence Databases Automated annotation transmembrane domains, signal peptide… Automated annotation Keywords and Gene Ontology Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 UniProtKB from TrEMBL to Swiss-Prot Once manually annotated and integrated into SwissProt, the entry is deleted from TrEMBL -> minimal redundancy Protein Sequence Databases Murcia, February, 2011 ENA Manual annotation of the sequence and associated biological information Swiss-Prot TrEMBL Automated extraction of protein sequence (translated CDS), gene name and references.+ Automated annotation Protein Sequence Databases Murcia, February, 2011 UniProtKB: from TrEMBL to Swiss-Prot Manual annotation 1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…) 2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …) Protein Sequence Databases Murcia, February, 2011 Protein and gene names Taxonomic information References Cross-references to over 125 databases MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAAR AFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVK NMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFL One protein sequence NKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE One gene GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRG One species TVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGR AGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKD EGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG Alternative products: protein sequences produced by alternative splicing, alternative promoter usage, alternative initiation… Manual annotation Function, Subcellular location, Catalytic activity, Disease, Tissue specificty, Pathway… Manual annotation Post-translational modifications, variants, transmembrane domains, signal peptide… Manual annotation Keywords and Gene Ontology UniProtKB/Swiss-Prot www.uniprot.org Protein Sequence Databases Murcia, February, 2011 In a UniProtKB/Swiss-Prot entry, you can expect to find: • A (often corrected) protein sequence and the description of various isoforms/variants. • All the names of a given protein (and of its gene); • A summary of what is known about the protein: function, PTM, tissue expression, disease, 3D data etc.…; • A description of important sequence features: domains, PTMs, variations, etc.; • A selection of references; • Selected keywords and ontologies; • Numerous cross-references (central hub); Protein Sequence Databases Murcia, February, 2011 UniProtKB 1- Sequence curation Protein Sequence Databases Murcia, February, 2011 UniProtKB: from TrEMBL to Swiss-Prot Manual annotation 1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…) 2. Biological information (extract literature information, ortholog data propagation, protein sequence analysis…) Protein Sequence Databases Murcia, February, 2011 The displayed protein sequence …canonical, representative, consensus… Protein Sequence Databases Murcia, February, 2011 UniProtKB/Swiss-Prot protein sequence annotation ‘Merging policy’: a gene-centric view of protein space 1 entry <-> 1 gene (1 species) 1 displayed sequence (annotation of alternative sequences, when available) The displayed sequence is the most prevalent protein sequence and/or the protein sequence which is also found in orthologous species. The displayed sequence is generally derived from the translation of the genomic sequence (when available). Sequence differences are documented. Protein Sequence Databases Murcia, February, 2011 What is the current status? • At least 20% of Swiss-Prot entries required a minimal amount of curation effort so as to obtain the “correct” sequence. • Typical problems – – – – unsolved conflicts; uncorrected initiation sites; frameshifts; other ‘problems’ Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 … once a gene on chromosome 11… Protein Sequence Databases Murcia, February, 2011 Quality of protein information from genome projects • Lets look at proteins originating from genome projects: – Drosophila: the paradigm of a curated genome should look like (thanks to FlyBase) : only 1.8% of the gene models conflict with Swiss-Prot sequences; – Arabidopsis: a typical example of a genome where a lot of annotation was done when it was sequenced, but no update since then (at least in the public view): 20% of the gene models are erroneous; – Tetraodon nigroviridis: the typical example of a quick and dirty automatic run through a genome with no manual intervention: >90% of the gene models produce incorrect proteins. – Bacteria and Archaea have almost no splicing, so predictions are “easier”, however errors are still made… Start codons, missed small proteins (<100aa)… Protein Sequence Databases Murcia, February, 2011 UniProtKB/Swiss-Prot Protein sequence annotation Protein Sequence Databases Murcia, February, 2011 Example of problem (derived from gene prediction pipeline) Ensembl completes the human ‘proteome’ by predicting/annotating missing genes according to orthologous sequences.. ID AC DT DT DT DE DE GN … DR DR DR PE URAD_HUMAN Unreviewed; 171 AA. A6NGE7; 24-JUL-2007, integrated into UniProtKB/TrEMBL. 24-JUL-2007, sequence version 1. 02-OCT-2007, entry version 3. 2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog (OHCU decarboxylase homolog) (Parahox neighbour). Name=PRHOXNB; EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. Ensembl; ENSG00000183463; Homo sapiens. HGNC; HGNC:17785; PRHOXNB. 4: Predicted; In primates the genes coding for the enzymes for the degradation of uric acid were inactivated and converted to pseudogenes. Protein Sequence Databases Murcia, February, 2011 • Producing a clean set of sequences is not a trivial task; • It is not getting easier as more and more types of sequence data are submitted; • It is important to pursue our efforts to make sure we provide our users with the most correct set of sequences for a given organism. ‘Protein existence’ tag • The ‘Protein existence’ tag indicates what is the evidence for the existence of a given protein; • Different qualifiers: 1. Evidence at protein level (~18%) (MS, western blot (tissue specificity), immuno (subcellular location),…) 2. Evidence at transcript level (~19%) 3. Inferred from homology (~58 %) 4. Predicted (~5%) 5. Uncertain (mainly in TrEMBL) Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 In order to avoid ‘pseudogenes’ and most of the unprobable protein sequences, you can filter your query and avoid sequences with ‘protein existence tag’ = ‘Uncertain’ Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 The ‘alternative’ sequence(s) Protein Sequence Databases Murcia, February, 2011 How many proteins at the end? Example with human Protein Sequence Databases Murcia, February, 2011 Proteome complexity Example with human ~20’000 Not predictable at the genome level ! -> important postgenomic data ! (Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154). Protein Sequence Databases Murcia, February, 2011 UniProtKB/Swiss-Prot 1 entry <-> 1 gene (1 species) Annotation of the sequence differences (including conflicts, polymorphisms, splice variants etc..) -> annotation of protein diversity Protein Sequence Databases Murcia, February, 2011 1 entry <-> 1 gene (1 species) Multiple alignment of the end of the available GCR sequences Annotation of the sequence differences (protein diversity) …and natural variant Protein Sequence Databases Murcia, February, 2011 P04150 Protein Sequence Databases Murcia,www.uniprot.org February, 2011 UniProtKB (and RefSeq) do under-represent alternatively spliced products Transcript variant are only made when there is information available on the full-lenght nature of the product; if multiple, alternate exons are found through the lenght of the gene, no assumption is made about the combination of the alternate exons that exists in vivo. http://www.ncbi.nlm.nih.gov/books/NBK50679/#RefSeqFAQ.what_does_a_reviewed_status_me Protein Sequence Databases Murcia, February, 2011 Important remark Available in separated files! > 30’000 additional sequences (total) Protein Sequence Databases Murcia, February, 2011 The ‘alternative’ sequence(s) not ‘directly available’ for a lot of tools, including protein identification tools, Blast, depending on the server !…. Protein Sequence Databases Murcia, February, 2011 Blast P04150 against Swiss-Prot / homo sapiens @ UniProt Isoform sequences Protein Sequence Databases Murcia, February, 2011 Blast P04150 against Swiss-Prot / homo sapiens @ NCBI The isoform sequences are not present in the NCBI protein database ! The .x number (P06401.4) correspond to the version number of the sequence…not to an alternatively spliced sequence ! Protein Sequence Databases Murcia, February, 2011 UniProtKB 2- Biological data curation Protein Sequence Databases Murcia, February, 2011 UniProtKB: from TrEMBL to Swiss-Prot Manual annotation 1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…) 2. Biological information (extract literature information, ortholog data propagation, protein sequence analysis…) Protein Sequence Databases Murcia, February, 2011 UniProtKB/Swiss-Prot General annotation • Summary of the current knowledge on a given protein. • Maximum usage of controlled vocabulary • Provides a reliable set of annotated protein entries for: • Reference data for systems designed to automatically transfer annotation to similar, not yet (or never) characterized sequences Keywords, Tissues, Post-translational modifications, Strains, Species, Subcellular location, Extracellular domains, Journals… • Training of data mining tools, prediction programs Protein Sequence Databases Murcia, February, 2011 Extract literature information and protein sequence analysis maximum usage of controlled vocabulary UniProtKB/Swiss-Prot gathers data form multiple sources: - publications (literature/Pubmed) - prediction programs (Prosite, Anabelle) - contacts with experts - other databases - nomenclature committees An evidence attribution system allows to easily trace the source of each annotation Protein Sequence Databases Murcia, February, 2011 Protein nomenclature Protein Sequence Databases Murcia, February, 2011 General annotation (Comments) …enable researchers to obtain a summary of what is known about a protein… www.uniprot.org Protein Sequence Databases Murcia, February, 2011 Human protein manual annotation: some statistics (Aug 2010) Protein Sequence Databases Murcia, February, 2011 Sequence annotation (Features) …enable researchers to obtain a summary of what is known about a protein… www.uniprot.org Protein Sequence Databases Murcia, February, 2011 Proteome complexity Example with human ~20’000 Not predictable at the genome level ! -> important postgenomic data ! (Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154). Protein Sequence Databases Murcia, February, 2011 Human protein manual annotation: some statistics (PTM) Protein Sequence Databases Murcia, February, 2011 Non-experimental qualifiers UniProtKB/Swiss-Prot considers both experimental and predicted data and makes a clear distinction between both. Level. Type of evidence Qualifier 1st. Strong experimental evidence Ref.X 2nd. Light experimental evidence Probable 3rd. Inferred by similarity with homologous protein (data of 1st or 2nd level) By similarity 4th. Inferred by sequence prediction Potential Protein Sequence Databases Murcia, February, 2011 Find all the protein localized in the cytoplasm (experimentally proven) which are phosphorylated on a serine (experimentally proven) Protein Sequence Databases Murcia, February, 2011 UniProtKB Additional information can be found in the cross-references (to more than 140 databases) Protein Sequence Databases Murcia, February, 2011 Protein centric view of database network DNA sequences Gene expression data Protein sequences Gene annotation Macromolecular structure data Protein Sequence Databases Murcia, February, 2011 Organism-specific AGD ArachnoServer CGD ConoServer CTD CYGD dictyBase EchoBASE EcoGene euHCVdb EuPathDB FlyBase GeneCards GeneDB_Spombe GeneFarm GenoList Gramene H-InvDB HGNC HPA LegioList Leproma MaizeGDB MGI MIM neXtProt Orphanet PharmGKB PseudoCAP RGD SGD TAIR TubercuList WormBase Xenbase ZFIN Sequence EMBL IPI PIR RefSeq UniGene Proteomic Genome annotation Polymorphism Family and domain PeptideAtlas PRIDE ProMEX Ensembl EnsemblBacteria EnsemblFungi EnsemblMetazoa EnsemblPlants EnsemblProtists GeneID GenomeReviews KEGG NMPDR TIGR UCSC VectorBase dbSNP Gene3D HAMAP InterPro PANTHER Pfam PIRSF PRINTS ProDom PROSITE SMART SUPFAM TIGRFAMs Gene expression ArrayExpress Bgee CleanEx Genevestigator GermOnline Protein family/group Allergome CAZy MEROPS PeroxiBase PptaseDB REBASE TCDB Ontologies GO UniProtKB/Swiss-Prot: 129 explicit links 2D gel and 14 implicit links! Phylogenomic dbs eggNOG GeneTree HOGENOM HOVERGEN InParanoid OMA OrthoDB PhylomeDB ProtClustDB 3D structure PTM GlycoSuiteDB PhosphoSite PhosSite Other PPI BindingDB DrugBank NextBio PMAP-CutDB DIP IntAct MINT STRING DisProt HSSP PDB PDBsum ProteinModelPortal SMR 2DBase-Ecoli ANU-2DPAGE Aarhus/Ghent-2DPAGE (no server) COMPLUYEAST-2DPAGE Cornea-2DPAGE DOSAC-COBS-2DPAGE ECO2DBASE (no server) OGP PHCI-2DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWISS-2DPAGE UCD-2DPAGE World-2DPAGE Enzyme and pathway BioCyc BRENDA Pathway_Interaction_DB Reactome UniProtKB Access to UniProtKB Protein Sequence Databases Murcia, February, 2011 The UniProt web site: www.uniprot.org Protein Sequence Databases Murcia, February, 2011 The UniProt web site - www.uniprot.org • Powerful search engine, google-like and easy-to-use, but also supports very directed field searches (similar to SRS) • Scoring mechanism presenting relevant matches first • Entry views, search result views and downloads are customizable • The URL of a result page reflects the query; all pages and queries are bookmarkable, supporting programmatic access • Tools: Blast, Alignment, IDmapping, Batch retrieval (Retrieve) Protein Sequence Databases Murcia, February, 2011 Search A very powerful text search tool with autocompletion and refinement options allowing to look for UniProt entries and documentation by biological information Protein Sequence Databases Murcia, February, 2011 Search A very powerful text search tool with autocompletion and refinement options allowing to look for UniProt entries and documentation by biological information Protein Sequence Databases Murcia, February, 2011 UniProt query tool (www.uniprot.org) A mixture of Google and SRS Find all human proteins with experimental evidence for their location in the nucleus Protein Sequence Databases Murcia, February, 2011 The search interface guides users with helpful suggestions and hints Protein Sequence Databases Murcia, February, 2011 Result pages: Highly customizable Protein Sequence Databases Murcia, February, 2011 Custom downloads…. Accession Genes Domains Protein Existence P02768 ALB (GIG20) (GIG42) (PRO0903) (PRO1708) (PRO2044) (PRO2619) (PRO2675 P02769 ALB Albumin domains (3) Evidence at protein level P02770 Alb Albumin domains (3) Evidence at protein level P07724 Alb (Alb-1) (Alb1) Albumin domains (3) Evidence at protein level P08759 alb-A Albumin domains (3) Evidence at transcript level P14872 alb-B Albumin domains (3) Evidence at transcript level P43652 AFM (ALB2) (ALBA) Albumin domains (3) Evidence at protein level P08835 ALB Albumin domains (3) Evidence at protein level P49822 ALB Albumin domains (3) Evidence at protein level P19121 ALB Albumin domains (3) Evidence at protein level Open with Excel etc. Protein Sequence Databases Murcia, February, 2011 The URL (results) can be bookmarked and manually modified. Protein Sequence Databases Murcia, February, 2011 Blast A tool associated with the standard options to search sequences in UniProt databases Protein Sequence Databases Murcia, February, 2011 Blast results: customize display Protein Sequence Databases Murcia, February, 2011 Blast: use of UniProt annotation amino-acids highlighting options and feature annotation highlighting option in the local alignment Protein Sequence Databases Murcia, February, 2011 Align A ClustalW multiple alignment tool with amino-acids highlighting options and feature annotation highlighting option Protein Sequence Databases Murcia, February, 2011 ClustalW multiple alignment of insulin sequences amino-acids highlighting options and feature annotation highlighting option in the local alignment Protein Sequence Databases Murcia, February, 2011 Retrieve A UniProt specific tool allowing to retrieve a list of entries in several standard formats. You can then query your ‘personal database’ with the UniProt search tool. Protein Sequence Databases Murcia, February, 2011 Your dataset: results of a Scan Prosite Protein Sequence Databases Murcia, February, 2011 ID Mapping Gives the possibility to get a mapping between different databases for a given protein Protein Sequence Databases Murcia, February, 2011 These identifiers are all pointing to TP53 (p53) ! P04637, NP_000537, ENSG00000141510, CCDS11118, UPI000002ED67, IPI00025087, etc. Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 Download Protein Sequence Databases Murcia, February, 2011 Downloading UniProt http://www.uniprot.org/downloads Protein Sequence Databases Murcia, February, 2011 Complete proteome ‘gene’ centred or all known proteins ? Protein Sequence Databases Murcia, February, 2011 http://www.uniprot.org/faq/38 Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 Remark: Some peptides are not associated with the keyword ‘Complete proteome’ because they do not match with the human genome Protein Sequence Databases Murcia, February, 2011 UniProt proteome sets, if downloaded in UniProt flat file or XML format, contain one sequence per UniProt record ! ‘gene’ centred all protein sequences in UniProtKB/Swiss-Prot… Are missing: other alternatively spliced protein sequences in UniProtKB/TrEMBL Protein Sequence Databases Murcia, February, 2011 Human protein manual annotation: some statistics (Aug 2010) Protein Sequence Databases Murcia, February, 2011 UniProtKB Statistics Protein Sequence Databases Murcia, February, 2011 Swiss-Prot & TrEMBL introduce a new arithmetical concept ! Swiss-Prot TrEMBL 520’000 + 13’000’000 12’000 species 130’000 species 13’000’000 Redundancy in TrEMBL & Redundancy between TrEMBL and Swiss-Prot Protein Sequence Databases Murcia, February, 2011 12’000 species mainly model organisms Protein Sequence Databases Murcia, February, 2011 Not yet available Protein Sequence Databases Murcia, February, 2011 ~ 200 new entries / day new release every 4 weeks -Annotation is useful, good annotation is better, update is essential ! - Some entries have gone through more than 120 versions since their integration in UniProtKB/Swiss-Prot Protein Sequence Databases Murcia, February, 2011 UniProtKB entry history Always cite the primary accession number (AC) ! UniParc Protein Sequence Databases Murcia, February, 2011 UniParc - non-redundant protein sequence archive, containing both active and inactive sequences (including sequences which are not in UniProtKB i.e. immunoglobulins….) - the equivalent of ENA/GenBank/DDBJ at the protein level - species-merged: merge sequences between species when 100% identical over the whole length. - no annotation (only taxonomy) - can be searched only with database names, taxonomy, checksum (CRC64) and accession numbers (ACs) or UniProtKB, UniRef and UniParc IDs. - Beware: contains wrong prediction, pseudogenes etc… Protein Sequence Databases Murcia, February, 2011 Query UniParc Protein Sequence Databases Murcia, February, 2011 UniRef Protein Sequence Databases Murcia, February, 2011 ‘UniRef is useful for comprehensive BLAST similarity searches by providing sets of representative sequences’ Protein Sequence Databases Murcia, February, 2011 «Collapsing BLAST results» Three collections of sequence clusters from UniProtKB and selected UniParc entries: One UniRef100 entry -> all identical sequences (identical sequences and sub-fragments are grouped in a single record) -> reduction of 12 % One UniRef90 entry -> sequences that have at least 90 % or more identity -> reduction of 40 % One UniRef50 entry -> sequences that are at least 50 % identical -> reduction of 65 % Based on sequence identity -> Independent of the species ! Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 UniRef 90 Independent of species and sequence length Protein Sequence Databases Murcia, February, 2011 UniMes Protein Sequence Databases Murcia, February, 2011 The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental protein data (only GOS data for the moment). Download only (but included in UniParc -> Blast). - UniMES Fasta sequences - UniMES matches to InterPro methods ftp.uniprot.org/pub/databases/uniprot Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 UniMES: sequences in fasta format Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 Menu Introduction Nucleic acid sequencedatabases ENA/GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases Protein Sequence Databases Murcia, February, 2011 NCBI protein databases (Entrez protein, NCBI nr) http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein Protein Sequence Databases Murcia, February, 2011 Major ‘general’ protein sequence database ‘sources’ TPA PIR PDB PRF Integrated resources ‘cross-references’ UniProtKB: Swiss-Prot + TrEMBL Resources kept separated NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA UniProtKB/Swiss-Prot: manually annotated protein sequences (12’000 species) UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (300’000 species) GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (300’000 species ?) PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species) TPA: Third part annotation Protein Sequence Databases Murcia, February, 2011 Query at Entrez protein http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein Protein Sequence Databases Murcia, February, 2011 Swiss-Prot Typical result of a query at « Entrez protein » RefSeq Genpept Protein Sequence Databases Murcia, February, 2011 A Swiss-Prot entry with the NCBI look Protein Sequence Databases Murcia, February, 2011 GI number ‘GenInfo identifier’ number - In addition to an AC number specific from the original database, each protein sequence in the NCBInr database (included Swiss-Prot entry) has a GI number. Protein Sequence Databases Murcia, February, 2011 AC Protein Sequence Databases Murcia, February, 2011 GI number: ‘GenInfo identifier’ number - If the sequence changes in any way, a new GI number will be assigned: GI identifiers provide a mechanism for identifying the exact sequence that was used or retrieved in a given search. - A separate GI number is assigned to each protein translation (alternative products) - A Sequence Revision History tool is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record: http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi Protein Sequence Databases Murcia, February, 2011 ID/AC mapping Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 http://www.ebi.ac.uk/Tools/picr/ Protein Sequence Databases Murcia, February, 2011 GenPept Translation from annotated CDS in GenBank Contains all translated CDS annotated in GenBank/ENA/DDBJ sequences - equivalent to UniProtKB/TrEMBL, except that it is redundant with other databases (Swiss-Prot, RefSeq, PIR….) Protein Sequence Databases Murcia, February, 2011 GenPept: ‘translations from all annotated coding regions (CDS) in GenBank’ Protein Sequence Databases Murcia, February, 2011 RefSeq Produced by NCBI and NLM http://www.ncbi.nlm.nih.gov/RefSeq/ http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/handbook/ch18.pdf FAQ: http://www.ncbi.nlm.nih.gov/books/NBK50679/ Protein Sequence Databases Murcia, February, 2011 The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. Protein – mRNA – genomic sequence Also chromosomes, organelle genomes, plasmids, intermediate assembled genomic contigs, ncRNAs. - tighly linked to Entrez Gene (« interdependent curated resources ») Example: NP_000790 Protein Sequence Databases Murcia, February, 2011 AC KW Taxonomy References Protein Sequence Databases Murcia, February, 2011 GenBank source and status Annotation and ontologies Protein Sequence Databases Murcia, February, 2011 Curated records Protein Sequence Databases Murcia, February, 2011 UniProtKB vs RefSeq Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 UniProtKB/Swiss-Prot merges all CDS available for a given gene and describes the sequence differences UniProtKB/Swiss-Prot P04150 (GCR_HUMAN): Protein Sequence Databases Murcia, February, 2011 RefSeq chooses one or several protein reference sequences for a given gene: they do not annotate the sequence differences. - If there is an alternative splicing event, there will be several distinct entries for a given gene Example: GCR_HUMAN 1 UniProtKB entry cross-linked with 7 RefSeq entries GCR_HUMAN UniProtKB/Swiss-Prot Protein Sequence Databases Murcia, February, 2011 Protein feature annotation found in RefSeq - Conserved domains - Signal and mature petides - Propagation of a subset of features from Swiss-Prot. Protein Sequence Databases Murcia, February, 2011 PTM annotation Swiss-Prot vs RefSeq GCR_human Protein Sequence Databases Murcia, February, 2011 RefSeq statistics The numbers are not comparable: entries ‘sequence’ (RefSeq) vs entries ‘gene’ (UniProtKB/Swiss-Prot) Protein Sequence Databases Murcia, February, 2011 UniProtKB vs NCBI protein Summary Protein Sequence Databases Murcia, February, 2011 ENA/GenBank/DDBJ RefSeq www.ncbi.nlm.nih.gov/RefSeq/ UniProt www.uniprot.org Protein and nucleotide data Genomic, RNA and protein data Protein data only Biological data added by the submitters (gene name, tissue…) Biological data annotated by curators, also found in the corresponding Entrez Gene entry Biological data annotated by curators (Swiss-Prot), within the entry Not curated Partially manually curated (‘reviewed’ entries) Manually curated in Swiss-Prot, not in TrEMBL Author submission NCBI creates from existing data + gene prediction UniProt creates from existing data Only author can revise (except TPA) NCBI revises as new data emerge UniProt revises as new data emerge Multiple records for same loci common Single records for each molecule of major organisms Single records for each protein from one gene of major organisms (in Swiss-Prot, TrEMBL is redundant) Records can contradict each other Identification and annotation of discrepancy No limit to species included Limited to model organisms Priority (but not limited) to model organisms Data exchanged among INSDC members NCBI database; collaboration with UniProt UniProt database; collaboration with NCBI (RefSeq, CCDS) Protein Sequence Databases Murcia, February, 2011 PIR Protein Sequence Databases Murcia, February, 2011 PIR: the Protein Identification Resource PIR-PSD is no more updated, but exists as an archive Protein Sequence Databases Murcia, February, 2011 PDB Protein Sequence Databases Murcia, February, 2011 PDB • PDB (Protein Data Bank), 3D structure • Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X-ray or NMR studies • Contains also the corresponding protein sequences *The PIR-NRL3D database makes the sequence information in PDB available for similarity searches and other tools • Includes protein sequences which are mutated, chimearic etc… (created specifically to study the effect of a mutation on the 3D structure) Protein Sequence Databases Murcia, February, 2011 PDB: Protein Data Bank www.rcsb.org/pdb/ • Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA). • Associated with specialized programs allow the visualization of the corresponding 3D structure (e.g., SwissPDB-viewer, Chime, Rasmol)). • Currently there are ~68’000 structural data for about 15’000 different proteins, but far less protein family (highly redundant) ! Protein Sequence Databases Murcia, February, 2011 PDB: example Protein Sequence Databases Murcia, February, 2011 Sequence Coordinates of each atom Protein Sequence Databases Murcia, February, 2011 Visualisation with Jmol Protein Sequence Databases Murcia, February, 2011 PRF Protein Research Foundation Protein Sequence Databases Murcia, February, 2011 Looks for the peptide sequence described in publication (and which are not submitted in databases !!!) http://www.genome.jp/dbget-bin/www_bfind?prf Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 Other protein databases Protein Sequence Databases Murcia, February, 2011 Ensembl http://www.ensembl.org/ Review http://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D610 Annotation pipeline http://www.genome.org/cgi/content/full/14/5/942 Protein Sequence Databases Murcia, February, 2011 - Ensembl: align the genomic sequences with all the sequences found in ENA, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (-> known genes) - Also do gene prediction (-> novel genes) Ensembl= UniProtKB + RefSeq + gene prediction - DNA, RNA and protein sequences available for several species. - Ensembl concentrates on vertebrate genomes, but other groups have adapted the system for use with plant, fungal and metazoa genomes. Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 Example of problem (derived from gene prediction pipeline) Ensembl completes the human ‘proteome’ by predicting/annotating missing genes according to orthologous sequences.. ID AC DT DT DT DE DE GN … DR DR DR PE URAD_HUMAN Unreviewed; 171 AA. A6NGE7; 24-JUL-2007, integrated into UniProtKB/TrEMBL. 24-JUL-2007, sequence version 1. 02-OCT-2007, entry version 3. 2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog (OHCU decarboxylase homolog) (Parahox neighbour). Name=PRHOXNB; EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. Ensembl; ENSG00000183463; Homo sapiens. HGNC; HGNC:17785; PRHOXNB. 4: Predicted; In primates the genes coding for the enzymes for the degradation of uric acid were inactivated and converted to pseudogenes. Protein Sequence Databases Murcia, February, 2011 IPI http://www.ebi.ac.uk/IPI/IPIhelp.html IPI: Closure ! Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 Automatic approach that builds clusters through combining knowledge already present in the primary data source (UniProtKB, RefSeq, Ensembl) and sequence similarity. IPI=UniProtKB + RefSeq + Ensembl (+ H-InvDB, TAIR +VEGA). !!! Complete proteome sets include all alternative splicing sequences…. Available for human, mouse, rat, Zebrafish, Arabidopsis, Chicken, and Cow Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 CCDS Protein Sequence Databases Murcia, February, 2011 http://www.ncbi.nlm.nih.gov/CCDS/ Protein Sequence Databases Murcia, February, 2011 CCDS (human, mouse) Combining different approaches – ab initio, by similarity - and taking advantage of the expertise acquired by different institutes, including manual annotation… Consensus between 4 institutions… Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 Gene Ontology (GO) Protein Sequence Databases Murcia, February, 2011 Standards : Why is it so important ? •‘The ever-increasing number of sequencing projects necessitates a standardized system (…) to ensure that the flood of information produced can be effectively utilized.‘ (PMID 19577473 ) •Standardization of biological data/information (data sharing and computational analysis). •Aim: extract and compare annotation between different resources or species (semantic similarity). Secreted or not secreted ? Pubmed19299134 Gene Ontology (GO) • The Gene Ontology is a controlled vocabulary, a set of standard terms—words and phrases—used for indexing and retrieving information. In addition to defining terms, GO also defines the relationships between the terms, making it a structured vocabulary. Contains ~30’000 terms. Gene Ontology (GO) terms biological process • broad biological phenomena e.g. mitosis, growth, digestion molecular function • molecular role e.g. catalytic activity, binding cellular component • Subcellular location e.g nucleus, ribosome, origin recognition complex Protein Sequence Databases Murcia, February, 2011 GO terms associated with human Erythropoietin http://www.geneontology.org Caveats • Annotation is the process of assigning/mapping GO terms to gene products… • Electronic vs Manual annotation… Protein Sequence Databases Murcia, February, 2011 Example with EPO Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 Protein Sequence Databases Murcia, February, 2011 Histone H4 !!! Large scale derived data (‘proteome’) Protein Sequence Databases Murcia, February, 2011 GO terms: Essential link between biological knowledge and high throuput genomic and proteomic datasets… ‘summary of the gene ontology classifications for all mapped ESTs…’ PMID: 15514041 Protein Sequence Databases Murcia, February, 2011 ~40 % of human proteins have no known function (experimental data)…but many more are associated with GO terms…(computer-assigned). Human proteins functional distribution Maybe Potentially Putative Expected Probably Hopefully Protein Sequence Databases Murcia, February, 2011 All documents (including practicals) are online http://education.expasy.org/cours/Murcia2011/ Protein Sequence Databases Murcia, February, 2011