Exploiting scientific data in the domain of ‘omics

advertisement
Exploiting scientific data in the
domain of ‘omics
'Genomics Standards Consortium
Ontology requirements and experiences'
Dawn Field
Oxford Centre for Ecology and
Hydrology
Overview
Goal of this Workshop: to explore what's been achieved to date with
RDF, meta-data and ontologies in exploiting scientific data particularly data integration, discovery and sharing
•what we have achieved
•the challenges we face
•what we hope to achieve in the near future
•what are the major issues requiring further research
Challenges and Opportunities
• Rapidly growing collection of genomes
• Increasing need for researchers to access,
combine and analyze data sets containing
genomic, taxonomic, ecological and
environmental data
• Increasing number of initiatives capturing
metadata
• Additional information about complete
genome sequences would be beneficial
De novo DNA sequencing
Continues to grow exponentially
SymBio Corporation
SymBio Corporation
Data scope of genome resources at NCBI
Organisms
Environmental samples?
Nematoda
C.elegans, C.briggsae
Viruses
Microbes
Insects
Fungi/small eukaryotes
A.thaliana
Barley
Corn
Oat
Rice
Soybean
Tomato
Rice
Wheat
Fishes
D.melanogaster, A.gambia,
D.pseudoobscura, Honey bee,
Plants
Chicken
Dog
Mouse/Rat
pig, cow
chimpanzee
Human
The Promise of
Metagenomics
Features of GBMF marine microbial genome sequencing project webpage
 Acts as portal to primary investigator webpage
 Provides basic information about the organism
1) Phylogeny of organism
2) Physiology, if known
3) Habitat
4) Geographic location
5) Isolation technique
6) Primary citation
7) Culture collection
www.moore.org/microgenome
Problems
DATA
INTEGRATION!!!!!
NO SUFFICIENT DATA
REGARDING
PHYSIOLOGY OF
ORGANISMS !!!!
Morphology and Growth
• Haemophilus influenzae is a nonmotile, gram-negative, rod shaped
bacterium. Optimal growth
temperature is 37 degrees and
doubling time in culture is 26
minutes.
Interactions and Ecology
• H. influenzae is a obligate commensal
with the ability to cause disease
including menigitis and otitis media.
The primary habitat of this species
is the human nasopharyx. This
bacterium is faculatively anaerobic and
uses organic matter as a source of
carbon and organic matter as a
source of energy.
What we have achieved
Cataloguing our Complete
Genome Collection
• Proposal: Field D, & Hughes J (2005). Cataloguing our current
genome collection. Microbiology 151: 1016-1019
• Analysis: Hughes J & Field D (2005) Ecological Perspectives on
our complete genome collection” Ecology Letters. 8, 1334-1345
• Workshop: “Cataloguing our current genome collection” Sept 7-9,
2005 Cambridge, UK NIEeS ; D. Field, G. Garrity, N. Morrison, J.
Selengut, P. Sterk, N. Thomson, T. Tatusova. Meeting report. Comp.
Func. Genomics.
• Genomic Standards Consortium (GSC):
http://gensc.sourceforge.net
• Funding: “Cataloguing our current genome collection” (NERC
International Opportunities Fund Award: NE/3521773/1)
Cataloguing our Complete
Genome Collection
• Workshop: “Cataloguing our current genome collection II” Nov 1011, 2005, EBI, Cambridge, UK; D. Field, N. Morrison, J. Selengut, P.
Sterk, Meeting report OMICS (in press)
• Special issue of OMICS on data standards: guest editors Dawn
Field and Susanna Sansone; organized around first two GSC
workshops
• Funding: “Cataloguing our current genome collection” funding from
NIEeS for two more workshops in June 2006 and 2007
• Workshop: 3rd GSC workshop Sept 11-13, 2006 NIEeS,
Cambridge UK. Co-organizers Dawn Field and Tatiana Tatusova
• Genome Catalogue: Launch of implementation of MIGS checklist
as a database ready to accept case study genomes
Overview of GSC activities
The aim of the Genomic Standards Consortium (GSC) is to support the
community-based development of a genomic standard that captures a
richer set of information about complete genomes and metagenomic
datasets.
• Checklist
• Implementation
• Ontology development
• Metadata exchange
Overview of GSC activities
• Checklist: The GSC is currently working
together towards the "Minimal Information about
a Genome Sequence" (MIGS) specification.
• Implementation: To promote discussion and
support the capture of preliminary data an XML
schema has been built from the checklist and
implemented as the Genome Catalogue
database.
Overview of GSC activities
• Ontology development: The GSC is also
working towards the development of
controlled vocabularies for describing
genomes and this work feeds into the
FuGO project (A Functional Genomics
Investigation Ontology).
• Metadata exchange: GFF3 and GnoME
The challenges we face
Challenges
• Defining the standard
• Collecting the data
• Fields can be calculated
in a variety of ways;
separate curated and
calculated fields
• We don’t know enough
about many of these
genomes with respect to
‘lifestyle’
• Relationships between
genomes
• Completeness of data
Defining the Checklist
Taxonomic Groups
Concepts
Eukaryotes
Bacteria/Archaea
Plasmids
Viruses
Organelles
Metagenomes
Organism
Phenotype
Environment
Sample Processing
Data Processing
Implementation Working Group
Metadata Exchange Working Group
Proliferation of MI Checklists
• Upcoming special issue of OMICS: a
journal of integrative biology on data
standards includes descriptions of 7
checklists
• Upcoming issue of Nature Biotechnology
expected to include more
Protein Standards Initiative (June 2006)
Special session:The proliferation of “MI” checklists:
opportunities and challenges
Chris Taylor (EBI) Minimal Information about a Protein Experiment (MIAPE)
and “MIxxx and the need for a central registry”
Dawn Field (CEH Oxford) Minimal Information about a Genome Sequence
(MIGS)
Don Robertson (Pfizer Global R&D, Ann Arbor MI) MSI -- Metabolomics
Standards Initiative.
Graeme Grimes (Scottish Centre for Genomic Technology and Information,
Edinburgh, UK) Minimum Information About a RNAi Experiment (MIARE)
Stefan Wiemann (DKFZ, Heidelberg, Germany) Minimum Information About a
Cellular Assay (MIACA)
Ryan Brinkman (UBC, Canada) presented by Chris Taylor (EBI) Minimum
Information for a Fluorescence Activated Cell Experiment (MIFACE)
MICheck: A Minimum Information Checklist
Portal
Chris Taylor, Dawn Field, Susanna-Assunta
Sansone, Rolf Apweiler, Michael Ashburner,
Cathy Ball, Pierre-Alain Binz, Alvis Brazma,
Ryan Brinkman, Eric Deutsch, Oliver Fiehn,
Jennifer Fostel, Peter Ghazal, Graeme Grimes,
Nigel Hardy, Henning Hermjakob, Randall
Julian, Martin Kuiper, Nicholas Le Novère, Jim
Leebens-Mack, Suzi Lewis, Ruth McNally,
Norman Morrison, Norman Paton, John
Quackenbush, Donald Robertson, Philippe
Rocca-Serra, Barry Smith, Jason Snape, Stefan
Wiemann
micheck.sourceforge.net
The MICheck website will provide
• a comprehensive list of MI checklists
• ‘convenience’ links to relevant resources; appropriate
tools, data formats, ontologies
• links to relevant policy statements from various external
bodies (such as funders’ data sharing policies, journals’
publication guidelines and so forth).
• contact(s) for submitting feedback
• where possible, most recent versions of checklists
(either as a local copy or a link)
• charter for the group
• guidelines for registering a checklist
• sign-up details for the mailing list.
micheck.sourceforge.net
The MICheck website will provide
• Minimal Information about a Minimal Information
Checklist (MIMI)
• Searchable database of terms from all checklists
We propose that the MICheck play
two primary roles:
• The first is to provide a ‘one-stop shop’ for
researchers, journal editors and reviewers, and
funders; providing a quick and simple way to
discover (whether there are) guidelines for a
particular domain.
• This second is to facilitate investigation of the
boundaries, overlaps and gaps between
projects, minimally by raising awareness of the
scope and progress of extant efforts.
These two roles translate into two
distinct parts of MICheck
• Portal: exists simply to raise awareness of, and
afford simple access to a wide range of
checklists; registering for the portal implies no
commitment to integrate by the registrant.
• Foundry: communities can, if motivated, sign up
to the foundry to jointly examine ways to refactor
the checklists over which they have control and
begin to produce the first components of a suite
of self-consistent, clearly bounded, orthogonal,
integrable checklist modules.
Registering a project
Domain: Genomics and metagenomics
Checklist type: Primary guidelines
Community Name: The Genomic Standards Consortium
Main website: http://gensc.sourceforge.org/
MI Checklist Name: Minimal Information about a Genomic Sequence
MI Checklist Acronym: MIGS
Current Version Number: 0.1
Release Date for current version: 2006-01-01
Primary Contact Person: Dr Jane Doe
Comments: Early draft based on first two exploratory workshops; public
distribution for comment
Key concepts: eukaryotes, bacteria/archaea, plasmids, organelles, viruses,
metagenomes, organism, phenotype,
environment, sample processing, data processing
Bibliography: Publications to be reposited where possible
Location of document(s):
http://sourceforge.net/project/showfiles.php?group_id=153365
Proteomics: three main efforts
•
•
•
The Minimum Information About a Proteomics Experiment (MIAPE)
— HUPO Proteomics Standards Initiative
The ‘Paris Guidelines’
— sponsored by MCP
Guidelines for the Next Ten Years of Proteomics
— published by Proteomics
MCP
31
2
PSI
22+
0
1
1
Proteomics
16
MIAPE
MCP (‘Paris’)
Proteomics
Requirement
Facts
Facts + Q.A.
Facts + Q.A.
Breadth
Complete
MS / MSI (+)
Complete
Depth
Significant
Significant
Moderate
Drafting
Committees
Committee
Committee
Revision
PSI
Meetings
Ad hoc
Not
specified
Integrative Activities
Project [URL]
Products
RSBI
[www.mged.org/Workgroups/rsbi]
Cross-domain analysis of project structures; development of
well-characterized generic concepts to facilitate integrative
activities
FuGE
[fuge.sf.net]
Object model (and markup language) to support the description
of diverse experiments and development of new formats
FuGO
[fugo.sf.net[
Ontology providing descriptors for a wide range of experimental
workflows, equipment and data types
OBO Foundry
[obofoundry.org]
Collaborative management of orthogonal (i.e. non-overlapping)
ontologies covering diverse domains
Defining the Checklist
Investigation
Taxonomic Groups
Concepts
Eukaryotes
Bacteria/Archaea
Plasmids
Viruses
Organelles
Metagenomes
Organism
Phenotype
Environment
Sample Processing
‘Study’
‘Assay’
Data Processing
Implementation Working Group
Metadata Exchange Working Group
what we hope to achieve in the
near future
FuGO
An Ontology for
Functional Genomics Investigation
Susanna-Assunta Sansone (EBI): Overview
Trish Whetzel (Un of Penn): Microarray
Daniel Schober (EBI): Metabolomics
Chris Taylor (EBI): Proteomics
On behalf of the FuGO working group
http://fugo.sourceforge.net
FuGO - Rationale
 Standardization activities in (single) domains
• Reporting structures, CVs/ontology and exchange formats
 Pieces of a puzzle
• Standards should stand alone BUT also function together
- Build it in a modular way, maximizing interactions
 Capitalize on synergies, where commonality exists
 Develop a common terminology for those parts of an investigation
that are common across technological and biological domains
Investigation
Design
Source and
Characteristics
Sample Preparation
Treatments
Instrumental Analysis
(MS, NMR, array, etc.)
Collection
Data Pre-Processing
Computational
Analysis
FuGO - Overview
 Purpose
• NOT model biology, NOR the laboratory workflow
• BUT provide core of ‘universal’ descriptors for its components
-To be ‘extended’ by biological and technological domain-specific WGs
• No dependency on any Object Model
- Can be mapped to any object model, e.g. FuGE OM
 Open source approach
• Protégé tool and Ontology Web Language (OWL)
Investigation
Design
Source and
Characteristics
Sample Preparation
Treatments
Instrumental Analysis
(MS, NMR, array, etc.)
Collection
Data Pre-Processing
Computational
Analysis
FuGO – Communities and Funds
 List of current communities
• Omics technologies
- HUPO - Proteomics Standards Initiative (PSI)
- Microarray Gene Expression Data (MGED) Society
- Metabolomics Society – Metabolomics Standards Initiative (MSI)
• Other technologies
- Flow cytometry
- Polymorphism
• Specific domains of application
- Environmental groups (crop science and environmental genomics)
- Nutrition group
- Toxicology group
- Immunology groups
 List of current funds
• NIH-NHGRI grant (C. Stoeckert, Un of Pen) for workshops and ontologist
• BBSRC grant (S.A. Sansone, EBI) for ontologist
FuGO – Processes
 Coordination Committee
• Representatives of technological and biological communities
- Monthly conferences calls
 Developers WG
• Representatives and members of these communities
- Weekly conferences calls
 Documentations
• http://fugo.sourceforge.net
 Advisory Board
• Advise on high level design and best practices
• Provide links to other key efforts
•
•
•
•
•
•
Barry Smith, Buffalo Un and IFOMIS
Frank Hartel, NIH-NCI
Mark Musen, Stanford Un and Protégé Team
Robert Stevens, Manchester Un
Steve Oliver, Manchester Un
-> cBiO will also oversee the Open BioMedical
Suzi Lewis, Berkeley Un and GO
Ontology (OBO) initiative
FuGO – Strategy
 Use cases -> within community activity
• Collect real examples
 Bottom up approach -> within community activity
• Gather terms and definitions
- Each communities in its own domain
 Top down approach -> collaborative activity
• Develop a ‘naming convention’
• Build a top level ontology structure, is_a relationships
• Other foreseen relationships
- part_of (currently expressed in the taxonomy as cardinal_part_of)
- participate_in (input) and derive_from (output),
- describe or qualify
- located_in and contained_in
 Binning terms in the top level ontology structure
• The higher semantics helps for faster ‘binning’
FuGO – Status and Plans
 Binning process - ongoing
• Reconciliations into one canonical version
• Iterative process
 Common working practices - established
• Each class consists of: term ID, preferred term, synonyms, definition
and comments
• Sourceforge tracker to send comments on terms, definitions,
relationships
 Timeline for completion of core omics technologies
• Two years and several intermediate milestones
• Interim solution
- Community-specific CVs posted under the OBO
 Ultimately FuGO will be part of the OBO Foundry (Core) Ontology
 Overview paper – “Special Issue on Data Standards” OMICS journal
Areas requiring significant research
Summary: gensc.sf.net
The GSC is tackling the issue of describing our
complete genome collections in greater detail
through:
MIGS
Genome Catalogue
Ontology Development
Metadata Exchange
In co-ordination with:
MICheck micheck.sf.net
FuGO fugo.sf.net
Acknowledgements
GSC: Coordinators Working Groups
• Dawn Field (CEH Oxford)
• George Garrity (Bergey’s
Trust)
• Norman Morrison (NEBC)
• Jeremy Selengut (TIGR)
• Peter Sterk (EBI)
• Tatiana Tatusova (NCBI)
• Nick Thomson (Sanger)
gensc.sf.net
General Members
of the GSC
Participants of all
meetings
Acknowledgements
MICheck: A Minimum Information Checklist
Portal
Chris Taylor, Dawn Field, Susanna-Assunta
Sansone, Rolf Apweiler, Michael Ashburner,
Cathy Ball, Pierre-Alain Binz, Alvis Brazma,
Ryan Brinkman, Eric Deutsch, Oliver Fiehn,
Jennifer Fostel, Peter Ghazal, Graeme Grimes,
Nigel Hardy, Henning Hermjakob, Randall
Julian, Martin Kuiper, Nicholas Le Novère, Jim
Leebens-Mack, Suzi Lewis, Ruth McNally,
Norman Morrison, Norman Paton, John
Quackenbush, Donald Robertson, Philippe
Rocca-Serra, Barry Smith, Jason Snape, Stefan
Wiemann
FuGO
An Ontology for
Functional Genomics Investigation
Susanna-Assunta Sansone (EBI): Overview
Trish Whetzel (Un of Pen): Microarray
Daniel Schober (EBI): Metabolomics
Chris Taylor (EBI): Proteomics
On behalf of the FuGO working group
http://fugo.sourceforge.net
Download