Corrections - The cacao genome is currently being sequenced http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi - Human Chromosome 1 sequence Search ‘Genome’ with ‘homo sapiens chromosome 1’ RefSeq has an entry for each human chromosome (genome reference): these entries are numbered NC_000001 to NC_000024 Mapviewer view: http://www.ncbi.nlm.nih.gov/mapview/maps.cgi?taxid=9606&chr=1 Entrez nucleotide view: http://www.ncbi.nlm.nih.gov/nuccore/224589800 - Measles virus complete genome (RefSeq) http://www.ncbi.nlm.nih.gov/nuccore/NC_001498 - Nucleic acid sequences available for human erythropoietin (EPO) (GenBank and RefSeq) http://www.ncbi.nlm.nih.gov/nucleotide/?term=homo+sapiens+erythropoietin+EPO mRNA coding for human EPO (in RefSeq) http://www.ncbi.nlm.nih.gov/nuccore/NM_000799.2 Look for Escherichia coli strain K-12 substrain W3110 complete genome (Query @ NCBI 'Genome') Query ‘Genome’ with ‘Escherichia coli strain K-12 substrain W3110’ http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=ShowDetailView&TermToSearch=19221 CU466930.1 • http://www.ncbi.nlm.nih.gov/nuccore/167729007 Corresponding proteins in UniProtKB (query with CU466930): http://www.uniprot.org/uniprot/?query=CU466930&sort=score Corresponding proteins in NCBInr (query ‘nucleotide’ with CU466930 and follow the link to ‘protein’ http://www.ncbi.nlm.nih.gov/nuccore?Db=protein&DbFrom=nuccore&Cmd=Link&LinkNam e=nuccore_protein&IdsFromResult=167729007 AAFZ00000000.1 • http://www.ncbi.nlm.nih.gov/nuccore/60175893 No protein yet : no annotated CDS available (or not yet submitted by the authors)! EPO protein in different databases • UniProtKB/Swiss-Prot (reviewed) http://www.uniprot.org/uniprot/P01588 • UniProtKB/TrEMBL (unreviewed) http://www.uniprot.org/uniprot/B7ZKK5 • UniParc (follow the link from the UniProtKB sequence section) http://www.uniprot.org/uniparc/UPI0000033477 • NCBInr (see next slide) http://www.ncbi.nlm.nih.gov/protein/?term=homo+sapiens+erythropoietin+EPO • RefSeq (see next slide) http://www.ncbi.nlm.nih.gov/protein/NP_000790.2? • Ensembl (follow the link from the UniProt entry) http://www.ensembl.org/Homo_sapiens/Transcript/ProteinSummary?g=ENSG00000130427;r=7:100318423-100321323;t=ENST00000252723 http://www.uniprot.org/uniprot/P04150 - Which server? UniProt - Which database? UniProtKB/Swiss-Prot - What is the function of the protein? Receptor for glucocorticoids (GC). - How many different post-translational modifications (PTMs) ? Phosphorylated, Sumoylated, Ubiquitinated - How many phosphorylation sites? 10 sites - What are the associated GO terms? http://www.uniprot.org/uniprot/P04150#section_terms - What is the evidence for the existence of the protein(s)? http://www.uniprot.org/uniprot/P04150#section_attribute: at protein level - What is the corresponding mRNA? X03225 for example - How many different protein sequences are available for the corresponding gene? http://www.uniprot.org/uniprot/P04150#section_alternative: 9 isoforms http://www.ncbi.nlm.nih.gov/protein/NP_000167.1? - Which server? NCBI - Which database? RefSeq - What is the function of the protein? receptor for glucocorticoids - How many different post-translational modifications (PTMs) ? Phosphorylated, Sumoylated, Ubiquitinated - How many phosphorylation sites? 14 sites - What are the associated GO terms? Not directly available - What is the evidence for the existence of the protein(s)? This information is not available - What is the corresponding mRNA? ‘The reference sequence was derived from AC091925.3, X03225.1 and AC004782.1. ‘ - How many different protein sequences are available for the corresponding gene? This information is not directly available; go to Entrez Gene. http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=2908 Find the UniProt entry corresponding to: - RefSeq NP_036231 - GI:584682 - Find the GI numbers corresponding to UniProtKB P04150 - Why are there so many GI numbers ? Find the UniProt entry corresponding to: - GI:584682 - Find the GI numbers corresponding to UniProtKB P04150 - Why are there so many GI numbers ? Find the GI numbers corresponding to UniProtKB P04150 Why are there so many GI numbers ? - Because there are several protein sequences corresponding to this gene due to alternative splicing events etc.; sequences have also been updated/modified; - The merging policy at NCBI (one sequence one entry) is not same as the one at UniProt (one entry one gene one species) Look for the URL corresponding to the following queries: - Mouse proteins localized in the nucleus Query: taxonomy:"Mus musculus [10090]" AND annotation:(type:location nucleus) URL:http://www.uniprot.org/uniprot/?query=taxonomy%3A%22Mus+musculus+[10090]%22+AND+annotation%3A%28type%3Alocation+nucleus%29&sort=score - Proteins for which a 3D structure is known (Hint: they have a cross-reference to PDB) Query: database:(type:pdb) URL: http://www.uniprot.org/uniprot/?query=database%3A%28type%3Apdb%29&sort=score - What is the query corresponding to this URL: http://www.uniprot.org/uniprot/?query=taxonomy%3A9606+AND+keyword%3A"complete+proteome“ Query: taxonomy:9606 AND keyword:"complete proteome“ -> human complete proteome - Modify the URL to get the same result, but for Escherichia coli (strain K12) Query: taxonomy:83333 AND keyword:"complete proteome“ URL: http://www.uniprot.org/uniprot/?query=taxonomy%3A83333+AND+keyword%3A"complete+proteome“ - Yeast (Saccharomyces cerevisiae) proteins found in the nucleus in UniProtKB/Swiss-Prot. Query: organism:4932 AND annotation:(type:location AND nucleus) AND reviewed:yes - How many of them have a nuclear localization which is 'experimentally proven' ? Query: organism:"Saccharomyces cerevisiae [4932]" AND annotation:(type:location nucleus confidence:experimental) AND reviewed:yes 1655 and 1392 entries respectively (query done in January, the 19th) - Download the list of corresponding accession numbers / protein names / gene names (use 'customize display'). - The set might not be complete: not all proteins have been tested to be localized in the nucleus ! And UniProtKB might not have annotated all the experiments showing that the yeast proteins are nuclear. P00001 @ UniProtKB P00001 @ NCBInr An 'old' publication cites a protein sequence with accession number (AC) o00597: Could you find it ? You did a proteomics analysis in December 2007 without any match. You repeat the analysis in April 2008 and get entry with AC P0C6S9 as the best match. Why ? P0C6S9 was first created in April 2008 in UniProtKB Compare the GO terms associated with mouse and human erythropoeitin (EPO) Have a look to the different GO evidence tags How many GO terms have been 'inferred by direct assay (IDA)' to the human EPO gene ? hierarchy of the GO term 'apoptosis' These GO terms are associated with insulin ! Several proteins have been identified in a proteomic experiment. Which GO terms do they share? (GI numbers of the identified proteins: 16130093, 20664033, 1789812, 89110178, 85677033, 27574045, 89111003, 229597766). GO terms in common… Searching databases with Blast The ‘alternative’ sequence(s) not ‘directly available’ for a lot of tools, including protein identification tools, Blast, depending on the server !…. Murcia, February, 2011 Protein Sequence Databases Blast P04150 against Swiss-Prot / homo sapiens @ UniProt Isoform sequences Murcia, February, 2011 Protein Sequence Databases Blast P04150 against Swiss-Prot / homo sapiens @ NCBI The isoform sequences (from Swiss-Prot) are not present in the NCBI protein database ! The .x number (P06401.4) correspond to the version number of the sequence…not to an alternatively spliced sequence ! Murcia, February, 2011 Protein Sequence Databases Blast A tool associated with the standard options to search sequences in UniProt databases The metal binding (Fe) site is conserved between HBB human and pea leghemoglobin! Detailed BLAST results Murcia, February, 2011 Protein Sequence Databases InterPro : other shema (Graphical view from UniProtKB) InterPro shema PFAM Graphical view Prosite Graphical view Not kepted in the InterPro overview ! Do a Blast with the sequence of the domain 'C-type lectin' of protein P28175 against UniProtKB/Swiss-Prot. Discuss the overview of your results. Look at the other domains present in the most similar sequences. Blast p28175 (complete seq) @ Swiss-Prot UniProt: Color code for identity scores (not alignment !) UniProt: Color code for identity scores (not alignment !) Blast p28175 (complete seq) @ NCBI against Swiss-Prot NCBI: Color key for alignment scores NCBI Swiss-Prot does not contain the alternative sequences (i.e. P28175-2) – !! NCBI gives the ‘version number’ of the Swiss-Prot sequence (i.e. Q8BU25.2)…. N-linked glycosylation (GlcNac): Look at the Swiss-Prot annotation (in a random ‘glycosylated’ entry) Query UniProtKB Taxonomic distribution TPNLINDTME Multiple alignment (ClustalW) -[LAPIQ]-N-[HAYRCS]-[ST]-[KLESGM] Logo N-glycosylation does not occur in Bacteria: …false positive ! 28 protein (within the set of 1000 proteins) are glycosylated according to the UniProtKB annotation…! Not easy to find that there is never a P after the N glycvosylation site (needs a lot of sequences….) ! C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H Pattern scan The pattern missed a second Zn finger in the same protein i.e. Q24174 Pattern Profile The pattern: C - X(2,4) - C - X(3) - [LIVMFYWC] - X(8) - H - X(3,5) – H Should includes: YRCVLCGTVAKSRNSLHSHMSrQHRGIST C-X(2,4)-C-X(3)-[LIVMFYWCA]-X(8)-H-X(3,5)-H Yes ! But: The pattern becomes less restrictive. But you get more sequences which should not be here. As the results are limited to 1000, the number of hits is not the same… Discriminators (Signatures, descriptors) for the Zinc finger C2H2 type domain can be found in Prosite (Pattern and Profile) and Pfam (HMM) Doublecortin Kinase The doublecortin domain is associated with many different domains (not only kinase) Seq 1 Seq 2 Patient with cardiovascular disease Lost of the protein kinase ATP binding site ! Step 1: scan UniProtKB/Swiss-Prot with the pattern Use the ‘scanprosite’ tool at http://www.expasy.org/tools/scanprosite/ At the bottom of the Scan prosite result page: Step 2: Retrieve the 103 human entries @ UniProt (go at the bottom of the Scan Prosite result page; Matched UniProtKB entries) Step 3: Retrieve the sequences annotated as being ‘phosphorylated on a Thr’ -> 19 candidates to be manually checked …. The end