FTP access to annotation data source exercises FTP access to annotation data in different file format in distinct data sources Index File format File analysis and navigation between different databanks Exercises Solutions File format The genomic and proteomic data are stored in several file formats, the main ones are: - Flat file [GeneBank, GO, UniProt, BioCyc, PDB, RefSeq, IPI, Pfam, OMIM, …] - Tabular [KEGG, GO, Entrez Gene, ChEBI, NetAffx, Ensembl, Jaspar, InterPro, IPI, UniSTS, BioCyc, Reactome, OMIM, HapMap, ...] - SQL dump [GO, ChEBI, InterPro, Ensembl, Pfam, Reactome, IPI, …] - XML [Homologene, InterPro, PDB, UniProt, BioCyc, KEGG, Reactome, GO, ChEBI, eVOC, …] - RDF [GO, ChEBI, eVOC, BioCyc, KEGG, Reactome, …] File repository Access the FTP sites below, download the data they provide and look at the different available data and file formats. Entrez Gene: (ftp://ftp.ncbi.nih.gov/gene/DATA/) gene2accession.gz [tabular file with fixed column number, single separator, non data-like header] gene2go.gz [tabular file with fixed column number, single separator, non data-like header] gene2pubmed.gz [tabular file with fixed column number, single separator, non data-like header] gene2sts [tabular file with fixed column number, single separator, non data-like header] gene2unigene [tabular file with fixed column number, single separator, non data-like header] gene_history.gz [tabular file with fixed column number, single separator, non data-like header] gene_info.gz gene_refseq_uniprotkb_collab mim2gene [tabular file with fixed column number, single separator, non data-like header] [tabular file with fixed column number, single Giorgio Ghisalberti and Marco Masseroli, PhD 1 FTP access to annotation data source exercises gene_group separator, non data-like header] [tabular file with fixed column number, single separator, non data-like header] Homologene: (ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/) homologene.data Gene expression ontologies (eVOC): (http://www.evocontology.org/site/Main/EvocData2p9) Human: evoc_v2.9_oboc.tar.gz [OBO 1.0 flat file] evoc_v2.9_gene2cdna.tar.gz [tabular file with variable column number, multiple separators, without header] … Gene Ontology (GO): (http://archive.geneontology.org/) full/2002-12-01/ go_200212-seqdb.fasta.gz [FASTA file] latest-termdb/ go_daily-termdb.obo-xml.gz go_daily-termdb-data.gz [SQL dump file] Gene Ontology Annotation (GOA): (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/) external2go/ ec2go [tabular file with fixed column number, multiple separators, with non data-like header] interpro2go pfam2go hamap2go spsl2go spkw2go [tabular file with fixed column number, multiple separators, with non data-like header] [tabular file with fixed column number, multiple separators, with non data-like header] [tabular file with fixed column number, multiple separators, with non data-like header] [tabular file with fixed column number, multiple separators, with non data-like header] gp2protein/ gp2protein.geneid.gz [tabular file with fixed column number, multiple separators, with non data-like header] HUMAN/ gene_association.goa_human.gz [tabular file with fixed column number, single separator, without header] .../ ... Giorgio Ghisalberti and Marco Masseroli, PhD 2 FTP access to annotation data source exercises Kyoto Encyclopaedia of Genes and Genomes (KEGG): (ftp://ftp.genome.jp/pub/kegg/) brite/ko/ ko00001.keg pathway/ map_title.tab [tabular file with fixed column number, single separator, without header] map/ cpd_map.tab [tabular file with variable column number, multiple separator, without header] genes/organisms/hsa/ hsa_pathway.list hsa_ncbi-geneid.list hsa_uniprot.list hsa_ko.list [Human KEGG Orthology in tabular file with variable column number, single separators, without header] genes/organisms/.../ ... ligand/compound/ compound [generic flat file] Reactome: (http://reactome.org/download/current/) uniprot_2_pathways.stid.txt [tabular file with fixed column number, single separators, without header] Universal Protein Resource (UniProt): (ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/) uniprot_sprot.xml.gz [generic XML file] uniprot_trembl.xml.gz [generic XML file] uniprot.xsd [XML-Schema file] International Protein Index databank (IPI): (ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/) Human: gi2ipi.xrefs.gz [tabular file with fixed column number, single separator, with header] ipi.HUMAN.history.gz [current and substituted IPIs in tabular file with fixed column number, single separator, non data-like header] ipi.HUMAN.xrefs.gz Giorgio Ghisalberti and Marco Masseroli, PhD 3 FTP access to annotation data source exercises ipi.HUMAN.IPC.gz [InterPro linking in tabular file with fixed column number, single separator, non data-like header] ... ... InterPro: (ftp://ftp.ebi.ac.uk/pub/databases/interpro/ ) interpro.xml.gz [generic XML file] match_complete.xml.gz [generic XML file] interpro.dtd [DTD file] Online Mendelian Inheritance in Man (OMIM): (ftp://ftp.ncbi.nih.gov/repository/OMIM/) genemap [tabular file with fixed column number, single separator, without header] morbidmap [tabular file with fixed column number, single separator, without header] omim.txt [flat file] ChEBI: (ftp://ftp.ebi.ac.uk/pub/databases/chebi/) Flat_file_tab_delimited/ compounds.tsv [tabular file with fixed column number, single separator, data-like header] reference.tsv.zip [tabular file with fixed column number, single separator, data-like header] Other example data file formats: [generic RDF file] http://www.berkeleybop.org/ontologies/obo-all/adult_mouse_anatomy/adult_mouse_anatomy.rdf [RDF generic OWL file] http://www.berkeleybop.org/ontologies/oboall/adult_mouse_anatomy/adult_mouse_anatomy.owl [XML SBML Level 1 file] http://systems-biology.org/001/kegg/SBML.l1v2/rco/rco00010.xml [XML SBML Level 2 file] http://systems-biology.org/001/kegg/SBML.l2v1/rco/rco00010.xml [RDF OWL BioPAX Level 1 file] http://www.biopax.org/release/biopax-level1.owl [RDF OWL BioPAX Level 2 file] http://www.biopax.org/release/biopax-level2.owl Giorgio Ghisalberti and Marco Masseroli, PhD 4 FTP access to annotation data source exercises File analysis and navigation between different databanks 1) Download the gene2unigene file of the Entrez Gene databank and answer these questions: a. Is the documentation consistent with the data contained in the file? i. Yes, all the fields are described in the README file. b. Which is its file format type? i. Tabular file format, fixed number of columns, header non data-like. c. In this file we can find the taxonomy of the data? i. Yes, reading the README file we can read that “tax_id is not provided in a separate column. The prefix of the UniGene cluster can be used to determine the species”. 2) Download the ko00001.keg file of the KEGG databank and answer these questions: a. Which is its file format type? i. Flat file with hierarchy defined by tags. b. Which is the hierarchy (root leaf) of “Pancreatic cancer”? i. 01160 Human Diseases 01161 Cancers 05212 Pancreatic cancer. 3) Download the homologene.data file of the Homologene databank and answer these questions: a. Which type of data does contain each field? i. Homologene group ID, taxonomy ID, gene ID, gene symbol, protein GI, protein accession. b. Which is the source from which the IDs in the “GeneID” field are retrieved? i. Entrez Gene databank. c. Which is the name of the organism of the protein codified by the gene with GeneID: “469356”? i. Pan troglodytes. 4) Download the go_daily-termdb.obo-xml.gz file of the GO data source and answer these questions: a. Inside the term tags of “GO:0000001” GO ID, what is the “GO:0048308” ID in the is_a tag? And “GO:0048311”? i. GO:0048308 and GO:0048311 represent the GO terms that precede GO:0000001 in the ontology. is_a is the relation that links the terms. b. Which is the name of the pathway that is in relation with the “GO:0000016” ID? i. lactase activity (Reactome). Here you can see a list of the main Web sites of biomolecular databanks you can use to better understand the available data: Entrez Gene: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene Universal Protein Resource (UniProt): http://www.pir.uniprot.org/ Kyoto Encyclopaedia of Genes and Genomes (KEGG): http://www.genome.ad.jp/kegg/ Giorgio Ghisalberti and Marco Masseroli, PhD 5 FTP access to annotation data source exercises InterPro: http://www.ebi.ac.uk/interpro/ Online Mendelian Inheritance in Man (OMIM): http://www.ncbi.nlm.nih.gov/Omim/ Gene Ontology (GO): http://www.geneontology.org/ eVOC: http://www.evocontology.org/ Homologene: http://www.ncbi.nlm.nih.gov/homologene Gene Ontology Annotation (GOA): http://www.ebi.ac.uk/GOA/ Reactome: http://www.reactome.org/ BioCyc: http://biocyc.org/ International Protein Index (IPI): http://www.ebi.ac.uk/IPI/IPIhelp.html Protein Data Bank (PDB): http://www.rcsb.org/pdb/ Reference Sequence (RefSeq): http://www.ncbi.nlm.nih.gov/RefSeq/ Ensemble: http://www.ensembl.org Expert Protein Analysis System (ExPASY): http://www.expasy.ch/enzyme/ Exercises Below there are some exercises concerning the analysis of different file formats of genomic data. 2) Download the hsa_pathway.list, hsa_uniprot.list and hsa_ncbi-geneid.list files of the KEGG databank and answer these questions: a. Which type of data does contain each file? b. Which are the data sources that provide the second IDs in the hsa_uniprot.list and hsa_ncbi-geneid.list files? c. Which is the pathway ID that is in relations with the “Q92837” protein ID? 2) Download the interpro2go file of the GOA data source and answer these questions: a. Which is its file format type? b. Which are the PubMed IDs that are related to the “Peptidase C3, picornavirus core protein 2A” protein? c. Which is the ontology of the GO ID that is related to the “Peptidase C3, picornavirus core protein 2A” protein? 3) Download the ipi.HUMAN.xrefs file of the IPI data source and answer these questions: a. What do the entries of the field in the twelfth column represent? Are described one or more values? b. What does the “Q96T58” ID represent? Which is its data source? Giorgio Ghisalberti and Marco Masseroli, PhD 6 FTP access to annotation data source exercises c. Which is the taxonomy ID of the “Q96T58” ID? Is it correct? d. Which is the taxonomy ID of the “IPI00735641” ID? Is it correct? Giorgio Ghisalberti and Marco Masseroli, PhD 7