Semantic annotation of Web Services

advertisement
What is EDAM?
EMBRACE Data and Methods
Ontology for bioinformatics tools and data
A set of defined terms, relationships between terms and rules that govern the terms and relations
Glorified glossary – with terms organised by is_a relations (class/subclass) into hierarchy
Controlled vocabulary for describing:
• Web services e.g. WSDL files
• Standalone tools
• Web servers
• Databases
• Data, e.g. XSD data schema associated with a WSDL file
• Data syntax and file formats
Aims to describe (coarse level) all major bioinformatics databases, data and tools in use
The "beta" release covers tools (and associated data) in the EMBRACE Registry:
http://www.embraceregistry.net/
Scope
EDAM includes 7 sub-ontologies (branches of terms in their own namespace) In the domain of
"bioinformatics tool and data description“:
• biological entity – “Any biological thing (or part of a thing) with a physical existence, a physical part,
region or feature that can be mapped to such a thing, a collection of such things or an observable
phenonema or occurrence”
• topic – “A general field of bioinformatics study, data, processing and analysis or technology.”
• operation – “A specific, singular function or process performed by a tool, for example a WS
operation. What is done, but not (typically) how or in what context.”
• data resource – “A category of content of a data source including databases and ontologies.”
• data – “A semantic description of a data entity (datum) commonly used in bioinformatics.”
• format – “A reference (typically a URL) of a data format specification.”
Required terms not specific to this domain might (eventually) be removed – including the entity branch
(which provides biological context for other branches).
Conceptual model
Bold text within a box indicates a namespace (top-level term)
Non-bold text within a box indicates a minor branch
Text next to lines indicates a relation between two terms
Design Principles
It wasn’t just thrown together (honestly) …
• Clearly defined scope
• A purpose-independent design, not tied to a particular use case
• Relevant to annotation of current:
•WSDL files
•XSD schema
•Standalone databases, servers and tools
• Comprehensive, with enough terms to be useful
• Comprehensible, with terms and relations that are simple and intuitive
• Uncluttered, including only commonly used terms use and with as few relation types as possible
• Navigable, with a simple class (is_a) hierarchy
• General, including terms of general use and excluding fine-grained specialised concepts.
• Complementary to (not duplicate) other established ontologies.
• Compatible (e.g. cross-referenced) with existing resources
• Integrity, compatible (so far as possible) with "upper level" ontologies
• Extensible, with clear guidelines for developers
• Convenient, with clear guidelines for annotators
• Ideally, support automated logical inference (reasoning software)
• Validatable
There is a compromise between “ontological correctness” and usability – a pragmatic
approach is essential!
Limitations
EDAM is/does not:
• Describe syntax or file formats in detail (syntax namespace will provide references)
• Define data structures. Although has_part / is_part_of relations are defined they are not currently
used.
• Include terms for every conceptual part of things. Typically a datatype is only listed if it known to be
in common use
• A catalogue of individual data structures, databases etc. Terms correspond to classes; specific
instances are not included.
• A full-strength ontology. Many relations and other domain features that could be expressed, e.g. in
OWL format, are not modelled.
• A way (in itself) to identify or unify all services and data (but it might help).
• Complete (and arguably never can be).
Sources (current version)
Software collections and registries:
• EMBRACE Web Services
• EBI Web Services
• EBI databases and retrievable fields known to the EB-eye web services ()
• EMBOSS including EMBASSY packages (>200 applications)
• WHAT-IF data and services (see also WHAT-IF help)
• Lists of tools from the Web
Domain ontologies:
• myGrid ontology
• NAR Databases
• NAR web servers
• Sequence (sequence-related terms)
• Sequence service (sequence service terms)
Database-related terms:
• dbxref.txt (databases cross-referenced in UniProtKB/Swiss-Prot)
• List of databases collated by the ELIXIR project
• Lists of databases from the web
Other (not used as source of terms):
• MI (molecular interactions)
• MIRIAM Resources
• bio2rdf
Sources (to consider)
1. BioMoby:
BioMoby Object Ontology (datatypes)
BioMoby Namespace Ontology (namespaces)
BioMoby service types (analysis types)
BioMoby web service registry (Moby-compliant services)
2. Tool collections and registries:
PSICQUIC services
Web services lists and registries
Services supported by the bio* projects
3. Domain ontologies:
PDBML Schema (Protein Data Bank Markup Language)
Sequence Ontology (sequence annotation and annotation exchange)
BioPAX ontology (biological pathway data)
Ondex ontology
DAS (sequence annotation)
Map (biological map-related terms from Gramene database)
4. XML formats:
BSML
MACSIM
HSAML
BEAST
MSAML
PHYLIP
JalView 2 Project
AlignmentML
EBI Application XML
UniProtKB RDF
5. Other:
MSD/PDBe API
OMG LSR documents
Download
“Beta" version in OBO (Open Biomedical Ontologies) format:
http://sourceforge.net/projects/edamontology/files/
Status
“Beta” version intended primarily for testing and feedback
Starting point for service nomenclature
Coverage is quite broad in general and quite deep for sequence analysis:
•~2000 terms with definitions
•8 basic types of relation (plus inverse relations)
• Relations are defined but not used in many term definitions. Relations will be added in the future
depending on requirements.
Maturing nicely through iterative cycles of development
• Term names, definitions and hierarchy (is_a relations) in all branches are reasonably stable
• Future versions will not be a fundamental departure
EDAM is being actively developed:
• OBO uses IDs to uniquely identify terms. EDAM IDs will persist between versions: a given ID is
guaranteed to identify the same concept. This does *not* imply term names, definitions and other fields
will remain constant, but they will remain true to the concept.
• Obsolete terms will also persist (they will not be removed and will maintain their ID).
Suggestions, requirements and collaborations welcome!
License
EDAM is made available to all without any constraint or license on its use or redistribution other than:
• EDAM is clearly acknowledged as the source of the product.
• EDAM files displayed publicly include the publication date and/or version number.
• EDAM files are not altered and subsequently redistributed under their original name or with the same
term identifiers.
Documentation
Documentation at:
http://edamontology.sourceforge.net/
Including clear statement of:
• Branches of terms (namespaces / sub-ontologies)
• Relations
• Rules (governing rules and relations)
• Guidelines for Developers
• Guidelines for Annotators (basic)
• And more …
Viewing
EDAM may be viewed in:
• Any text editor
• Ontology editor
OBO Ontology Editor (OBOEdit) Version 2
http://oboedit.org
• Web-based browsers:
NCBO Ontology Browser
http://bioportal.bioontology.org/visualize/42800
EBI Ontology Look-up Service (coming soon)
http://www.ebi.ac.uk/ontology-lookup/
• SRS
EBI SRS server
http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+LibInfo+-lib+EDAM
Viewing in Text Editor
• Any text editor
Viewing in Ontology Editor
• Ontology editor
OBO Ontology Editor (OBOEdit) Version 2
http://oboedit.org
Viewing in Web-based Browser
• Web-based browsers:
NCBO Ontology Browser
http://bioportal.bioontology.org/visualize/42800
EBI Ontology Look-up Service (coming soon)
http://www.ebi.ac.uk/ontology-lookup/
Viewing in SRS
EDAM is in EBI SRS server:
http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+LibInfo+-lib+EDAM
And from the EBI dbfetch:
http://wwwdev.ebi.ac.uk/Tools/dbfetch/
Which allows the terms to be addressed :
http://wwwdev.ebi.ac.uk/Tools/dbfetch/dbfetch/edam/0000352 (plain text view) or
http://wwwdev.ebi.ac.uk/Tools/dbfetch/dbfetch/edam/0000352?style=html (HTML view)
These views are the term “end-points”
Guidelines for Annotators
Which EDAM branch to use?
• “topic” for coarse-grained annotation of tools, databases, servers and so on
• “operation" for fine-grained annotation of tool functions
• “data resource" for annotating data resources such as databases and servers into broad categories based
on content-type
• “data" and “format" for annotating data in semantic and syntactic terms respectively
Picking terms
• Familiarise yourself with EDAM (use a text editor or OBOEdit)
• Identify the correct branch/namespace (“operation", “data" etc. see above)
• Search EDAM using keywords to find candidate terms. Use synyonyms, alternative spellings etc.
• Pick the most specific term(s) available (some concepts are necessarily overlapping or general!)
• Only pick a correct term (if it doesn't exist it can be added)
Use other ontologies
Use EDAM alongside other ontologies where possible and desirable.
For example, an operation that predicts specific features of a molecular sequence could be annotated with GO
terms for the features.
Annotation of Web Services
Model of a Web Service
A WS is considered as an arbitrary (but usually related) set of one or more operations, reducing the problem of WS interoperation to one of
compatibility between operations.
Operation
• Discrete unit of functionality performing (typically) one or more definite functions
• Reads an input
• Writes an output
• Uses zero or more data resources
Input
• Payload of SOAP message passed in operation call
• Name and (ideally) description is given in WSDL file
• Input has one or XML elements which must be set (input values)
Output
• Payload of SOAP message returned from operation call
• Name and (ideally) description is given in WSDL file
• Output has one or XML elements which are written (output values)
XML elements
• Simple or complex XSD types given in XSD schema associated with a WSDL file
• Correspond to values that are input or output by a service
• Name and (ideally) description of element is given in schema
• Element values are instances of a particular datatype with a semantic type and a specific syntax.
• Most element values have a syntax fully specified by the schema
• Some element values correspond to text in a specific file format which is not specified by the schema. Such reports may be a composite of
different semantic types.
Data resources
• Databases or ontologies used in the background
• Not passed in a WS call
• Might be specified indirectly via a parameter. For example an operation reads a database, the name of which is specified
Annotation of Web Services
Levels of annotation
Annotation of a WSDL file or associated XSD schema is possible at several levels.
Assuming SAWSDL annotation, the XML elements that may be annotated are:
1.
2.
3.
4.
Service (<wsdl:portType>)
•
Ideally one “Topic" term for the service as a whole
Operation (<wsdl:operation>)
•
Ideally one "Operation" term for each WSDL operation (more than one in exceptional circumstances)
Input (parameter) values (<xs:element>, <xs:complexType>, <xs:simpleType>, <xs:attribute>)
•
One "Data" term
•
One “Format" term
Output values (<xs:element>, <xs:complexType>, <xs:simpleType>, <xs:attribute>)
•
One "Data" term
•
One “Format" term
The expectation is for annotation of operation inputs and outputs to go into XSD schema although the WSDL file
(<input> and <output> elements) might also be used. The following annotations might be useful but are not supported
by SAWSDL:
1.
2.
3.
Web service (<wsdl:service>)
•
One or more "Topic" terms to describe the general area(s) the service operates in
•
One or more “Data resource" terms to describe the data resources used by the service
Operation input (<input>)
•
One or more "Data" terms for the input(s) of each operation (if needed)
Operation output (<output>)
•
One or more "Data" terms for the output(s) of each operation (if needed)
Annotation of EMBOSS
EMBOSS (European Molecular Biology Open Software Suite)
>200 applications for (mostly) molecular sequence analysis
Application descriptions are kept in ACD (Application Command Definition) file
ACD file includes:
1 “Application definition”
1 or more “Data definitions”
ACD files are annotated with EDAM terms
Application definition:
>=1 “topic” term
>=1 “operation” term
Data definition:
>=1 “data” term
EMBOSS Service Annotation
Annotated WSDL files (and associated XSD data schema) are available from:
http://wwwdev.ebi.ac.uk/soaplab/typed/services/list
You will see a list of service end-points with WSDL URLs. For example:
http://wwwdev.ebi.ac.uk/soaplab/typed/services/alignment_consensus.cons.sa?wsdl
To see the data schema associated with a WSDL, you must replace
"?wsdl" with "?xsd=1", "?xsd=2" or "?xsd=3"
For example:
http://wwwdev.ebi.ac.uk/soaplab/typed/services/alignment_consensus.cons.sa?xsd=1
SAWSDL annotation
The proposed format of SAWSDL annotation includes the term namespace, unique identifier and URN
pointing to the term definition:
<element name="elementName"
sawsdl:modelReference="http://purl.org/edam/namespace/id">
Where ...
* element is the XML element being annotated
* elementName is the name of the XML element
* namespace is the namespace of the EDAM term, e.g. "operation"
* id is the unique identifier of the term, e.g. "0000295"
The term name, if required, could be given as an XML comment after the annotated element:
<element name="elementName"
sawsdl:modelReference="http://purl.org/edam/namespace/id">
term_name -->
<!--
This is not recommended however as term names are not guaranteed to remain constant.
The value of the sawsdl:modelReference attribute is a URN pointing to the term definition.
Proposal is to use PURLs (Persistent Uniform Resource Locators) which include the term namespace.
EDAM term end-points
When pasted into a browser, the PURLs:
http://purl.org/edam/topic/0000182
http://purl.org/edam/operation/0000292
http://purl.org/edam/data/0000863
... will (eventually) resolve to:
http://wwwdev.ebi.ac.uk/Tools/dbfetch/dbfetch/edam/0000182
http://wwwdev.ebi.ac.uk/Tools/dbfetch/dbfetch/edam/0000292
http://wwwdev.ebi.ac.uk/Tools/dbfetch/dbfetch/edam/0000863
These are complete OBO term statements in plain text (OBO format). PURLs support text extensions
allowing a format specifier to be added. For example these PURLs:
http://purl.org/edam/topic/0000182?style=html
http://purl.org/edam/operation/0000292?style=html
http://purl.org/edam/data/0000863?style=html
... will resolve to OBO term statements in HTML such that terms referred to in the statements (via relations)
will be clickable to allow navigation:
http://wwwdev.ebi.ac.uk/Tools/dbfetch/dbfetch/edam/0000182?style=html
http://wwwdev.ebi.ac.uk/Tools/dbfetch/dbfetch/edam/0000292?style=html
http://wwwdev.ebi.ac.uk/Tools/dbfetch/dbfetch/edam/0000863?style=html
EDAM term end-points
The eventual final list of end-points will provide other formats/views:
• Plain text in OBO format (default)
• HTML
• XML
• JSON
• The term in a web browser, e.g. NCBO Ontology Browser.
http://wwwdev.ebi.ac.uk/Tools/dbfetch/dbfetch/edam/0000182?style=html
http://wwwdev.ebi.ac.uk/Tools/dbfetch/dbfetch/edam/0000182%format=xml
http://wwwdev.ebi.ac.uk/Tools/dbfetch/dbfetch/edam/0000182%format=txt
http://wwwdev.ebi.ac.uk/Tools/dbfetch/dbfetch/edam/0000182%format=json
http://wwwdev.ebi.ac.uk/Tools/dbfetch/dbfetch/edam/0000182%format=browser (default)
For now, you can see this in action for this term:
http://purl.org/edam/entity/0000002
http://purl.org/edam/entity/0000002?style=html
Parallel Developments
(and other applications)
These include:
•
BioXSD
•
EMBRACE Registry / BioCatalogue
•
Taverna
•
BioNEMUS
•
Ondex
•
ELIXIR
BioNemus
Thanks
• Peter Rice (boss)
• Alan Bleasby (PURL handling)
• Mahmut Uludag (EMBOSS WS)
• Hamish McWilliam (SRS, discussions)
• Matus Kalas (BioXSD, discussions)
• James Malone (SWO + discussions)
• Steve Pettifer (publications + discussions)
• The Forgotten … (sorry)
All enquiries to Jon Ison (jison@ebi.ac.uk)
Download