File format

advertisement
FTP access to annotation data source exercises
FTP access to annotation data in different file format in distinct data sources
Index




File format
File analysis and navigation between different databanks
Exercises
Solutions
File format
The genomic and proteomic data are stored in several file formats, the main ones are:
- Flat file [GeneBank, GO, UniProt, BioCyc, PDB, RefSeq, IPI, Pfam, OMIM, …]
- Tabular [KEGG, GO, Entrez Gene, ChEBI, NetAffx, Ensembl, Jaspar, InterPro, IPI,
UniSTS, BioCyc, Reactome, OMIM, HapMap, ...]
- SQL dump [GO, ChEBI, InterPro, Ensembl, Pfam, Reactome, IPI, …]
- XML [Homologene, InterPro, PDB, UniProt, BioCyc, KEGG, Reactome, GO, ChEBI,
eVOC, …]
- RDF [GO, ChEBI, eVOC, BioCyc, KEGG, Reactome, …]
File repository
Access the FTP sites below, download the data they provide and look at the different available
data and file formats.
Entrez Gene: (ftp://ftp.ncbi.nih.gov/gene/DATA/)
gene2accession.gz
[tabular file with fixed column number, single
separator, non data-like header]
gene2go.gz
[tabular file with fixed column number, single
separator, non data-like header]
gene2pubmed.gz
[tabular file with fixed column number, single
separator, non data-like header]
gene2sts
[tabular file with fixed column number, single
separator, non data-like header]
gene2unigene
[tabular file with fixed column number, single
separator, non data-like header]
gene_history.gz
[tabular file with fixed column number, single
separator, non data-like header]
gene_info.gz
gene_refseq_uniprotkb_collab
mim2gene
[tabular file with fixed column number, single
separator, non data-like header]
[tabular file with fixed column number, single
Giorgio Ghisalberti and Marco Masseroli, PhD
1
FTP access to annotation data source exercises
gene_group
separator, non data-like header]
[tabular file with fixed column number, single
separator, non data-like header]
Homologene: (ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/)
homologene.data
Gene expression ontologies (eVOC): (http://www.evocontology.org/site/Main/EvocData2p9)
Human:
evoc_v2.9_oboc.tar.gz
[OBO 1.0 flat file]
evoc_v2.9_gene2cdna.tar.gz [tabular file with variable column number, multiple
separators, without header]
…
Gene Ontology (GO): (http://archive.geneontology.org/)
full/2002-12-01/
go_200212-seqdb.fasta.gz
[FASTA file]
latest-termdb/
go_daily-termdb.obo-xml.gz
go_daily-termdb-data.gz
[SQL dump file]
Gene Ontology Annotation (GOA): (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/)
external2go/
ec2go
[tabular file with fixed column number, multiple separators,
with non data-like header]
interpro2go
pfam2go
hamap2go
spsl2go
spkw2go
[tabular file with fixed column number, multiple separators,
with non data-like header]
[tabular file with fixed column number, multiple separators,
with non data-like header]
[tabular file with fixed column number, multiple separators,
with non data-like header]
[tabular file with fixed column number, multiple separators,
with non data-like header]
gp2protein/
gp2protein.geneid.gz [tabular file with fixed column number, multiple separators,
with non data-like header]
HUMAN/
gene_association.goa_human.gz
[tabular file with fixed column number,
single separator, without header]
.../
...
Giorgio Ghisalberti and Marco Masseroli, PhD
2
FTP access to annotation data source exercises
Kyoto Encyclopaedia of Genes and Genomes (KEGG): (ftp://ftp.genome.jp/pub/kegg/)
brite/ko/
ko00001.keg
pathway/
map_title.tab
[tabular file with fixed column number, single separator,
without header]
map/
cpd_map.tab [tabular file with variable column number, multiple
separator, without header]
genes/organisms/hsa/
hsa_pathway.list
hsa_ncbi-geneid.list
hsa_uniprot.list
hsa_ko.list
[Human KEGG Orthology in tabular file with variable
column number, single separators, without header]
genes/organisms/.../
...
ligand/compound/
compound
[generic flat file]
Reactome: (http://reactome.org/download/current/)
uniprot_2_pathways.stid.txt [tabular file with fixed column number, single separators,
without header]
Universal Protein Resource (UniProt):
(ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/)
uniprot_sprot.xml.gz
[generic XML file]
uniprot_trembl.xml.gz
[generic XML file]
uniprot.xsd
[XML-Schema file]
International Protein Index databank (IPI): (ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/)
Human:
gi2ipi.xrefs.gz
[tabular file with fixed column number, single
separator, with header]
ipi.HUMAN.history.gz
[current and substituted IPIs in tabular file with
fixed column number, single separator, non
data-like header]
ipi.HUMAN.xrefs.gz
Giorgio Ghisalberti and Marco Masseroli, PhD
3
FTP access to annotation data source exercises
ipi.HUMAN.IPC.gz
[InterPro linking in tabular file with fixed column
number, single separator, non data-like header]
...
...
InterPro: (ftp://ftp.ebi.ac.uk/pub/databases/interpro/ )
interpro.xml.gz
[generic XML file]
match_complete.xml.gz
[generic XML file]
interpro.dtd
[DTD file]
Online Mendelian Inheritance in Man (OMIM): (ftp://ftp.ncbi.nih.gov/repository/OMIM/)
genemap
[tabular file with fixed column number, single separator, without header]
morbidmap [tabular file with fixed column number, single separator, without header]
omim.txt
[flat file]
ChEBI: (ftp://ftp.ebi.ac.uk/pub/databases/chebi/)
Flat_file_tab_delimited/
compounds.tsv
[tabular file with fixed column number, single separator,
data-like header]
reference.tsv.zip
[tabular file with fixed column number, single separator,
data-like header]
Other example data file formats:
[generic RDF file]
http://www.berkeleybop.org/ontologies/obo-all/adult_mouse_anatomy/adult_mouse_anatomy.rdf
[RDF generic OWL file]
http://www.berkeleybop.org/ontologies/oboall/adult_mouse_anatomy/adult_mouse_anatomy.owl
[XML SBML Level 1 file]
http://systems-biology.org/001/kegg/SBML.l1v2/rco/rco00010.xml
[XML SBML Level 2 file]
http://systems-biology.org/001/kegg/SBML.l2v1/rco/rco00010.xml
[RDF OWL BioPAX Level 1 file]
http://www.biopax.org/release/biopax-level1.owl
[RDF OWL BioPAX Level 2 file]
http://www.biopax.org/release/biopax-level2.owl
Giorgio Ghisalberti and Marco Masseroli, PhD
4
FTP access to annotation data source exercises
File analysis and navigation between different databanks
1) Download the gene2unigene file of the Entrez Gene databank and answer these
questions:
a. Is the documentation consistent with the data contained in the file?
i. Yes, all the fields are described in the README file.
b. Which is its file format type?
i. Tabular file format, fixed number of columns, header non
data-like.
c. In this file we can find the taxonomy of the data?
i. Yes, reading the README file we can read that “tax_id is not
provided in a separate column. The prefix
of the UniGene cluster can be used to determine
the species”.
2) Download the ko00001.keg file of the KEGG databank and answer these questions:
a. Which is its file format type?
i. Flat file with hierarchy defined by tags.
b. Which is the hierarchy (root  leaf) of “Pancreatic cancer”?
i. 01160 Human Diseases  01161 Cancers  05212 Pancreatic
cancer.
3) Download the homologene.data file of the Homologene databank and answer these
questions:
a. Which type of data does contain each field?
i. Homologene group ID, taxonomy ID, gene ID, gene symbol,
protein GI, protein accession.
b. Which is the source from which the IDs in the “GeneID” field are retrieved?
i. Entrez Gene databank.
c. Which is the name of the organism of the protein codified by the gene with
GeneID: “469356”?
i. Pan troglodytes.
4) Download the go_daily-termdb.obo-xml.gz file of the GO data source and answer these
questions:
a. Inside the term tags of “GO:0000001” GO ID, what is the “GO:0048308” ID in
the is_a tag? And “GO:0048311”?
i. GO:0048308 and GO:0048311 represent the GO terms that
precede GO:0000001 in the ontology. is_a is the relation
that links the terms.
b. Which is the name of the pathway that is in relation with the “GO:0000016” ID?
i. lactase activity (Reactome).
Here you can see a list of the main Web sites of biomolecular databanks you can use to better
understand the available data:
Entrez Gene: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene
Universal Protein Resource (UniProt): http://www.pir.uniprot.org/
Kyoto Encyclopaedia of Genes and Genomes (KEGG): http://www.genome.ad.jp/kegg/
Giorgio Ghisalberti and Marco Masseroli, PhD
5
FTP access to annotation data source exercises
InterPro: http://www.ebi.ac.uk/interpro/
Online Mendelian Inheritance in Man (OMIM): http://www.ncbi.nlm.nih.gov/Omim/
Gene Ontology (GO): http://www.geneontology.org/
eVOC: http://www.evocontology.org/
Homologene: http://www.ncbi.nlm.nih.gov/homologene
Gene Ontology Annotation (GOA): http://www.ebi.ac.uk/GOA/
Reactome: http://www.reactome.org/
BioCyc: http://biocyc.org/
International Protein Index (IPI): http://www.ebi.ac.uk/IPI/IPIhelp.html
Protein Data Bank (PDB): http://www.rcsb.org/pdb/
Reference Sequence (RefSeq): http://www.ncbi.nlm.nih.gov/RefSeq/
Ensemble: http://www.ensembl.org
Expert Protein Analysis System (ExPASY): http://www.expasy.ch/enzyme/
Exercises
Below there are some exercises concerning the analysis of different file formats of genomic data.
2) Download the hsa_pathway.list, hsa_uniprot.list and hsa_ncbi-geneid.list files of the
KEGG databank and answer these questions:
a. Which type of data does contain each file?
b. Which are the data sources that provide the second IDs in the hsa_uniprot.list and
hsa_ncbi-geneid.list files?
c. Which is the pathway ID that is in relations with the “Q92837” protein ID?
2) Download the interpro2go file of the GOA data source and answer these questions:
a. Which is its file format type?
b. Which are the PubMed IDs that are related to the “Peptidase C3, picornavirus
core protein 2A” protein?
c. Which is the ontology of the GO ID that is related to the “Peptidase C3,
picornavirus core protein 2A” protein?
3) Download the ipi.HUMAN.xrefs file of the IPI data source and answer these questions:
a. What do the entries of the field in the twelfth column represent? Are described one
or more values?
b. What does the “Q96T58” ID represent? Which is its data source?
Giorgio Ghisalberti and Marco Masseroli, PhD
6
FTP access to annotation data source exercises
c. Which is the taxonomy ID of the “Q96T58” ID? Is it correct?
d. Which is the taxonomy ID of the “IPI00735641” ID? Is it correct?
Giorgio Ghisalberti and Marco Masseroli, PhD
7
Download