Driving the Terminology Hub

advertisement
Driving the Terminology Hub
RDF Triplets as a means to express lexical and referential data.
Therese Vachon, NIBR, Unit Head UltraLink Technologies
W3C Workshop on RDF Access to Relational Databases
25-26 October, 2007 — Boston, MA, USA
Requirements
 Cross-linking of database information on e.g. genes,
proteins, metabolic pathways, compounds, ligands. to the
original sources is a key issue.
 The productivity for accessing, sharing, searching,
navigating, cross-linking and analyzing internal data and
external data relevant for the Pharmaceutical industry
should be increased
2 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Strategy
 In NIBR, we have been developing a semantic integration
layer on top of knowledge resources that has been
implemented within various services and applications.
 It uses
• A rich domain-specific terminology (biology, chemistry and
medicine) containing 1.6 Mio terms
• A Terminology Hub containing 8 GB of referential data (crossreferences between data repositories.)
 Using that knowledge, the scientist can access all data at
hand with just a single mouse-click.
3 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Application Areas for Terminologies
 Categorization of documents (via associated taxonomies)
 Search for concepts
 Semantic expansion of queries using synonyms and related
terms
 Identification and extraction of relevant concepts (like e.g.
targets, genes, diseases, products) from texts
 Annotation of textual data with controlled terms as
referential anchors
 Construction of a semantic layer on top of information
sources allowing navigation context-sensitive navigation
(Ultralink)
4 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Application Areas for the Terminology Hub
 Coherent mapping between Terminologies and Coding
Systems (e.g. Uniprot Accession Number for a Protein)
 Coherent mapping between internal knowledge repositories
(e.g. Biological Assays and Chemical Compounds)
 Coherent mapping between external knowledge
repositories (e.g. HUGO and OMIM)
 Coherent mapping between internal and external
knowledge repositories (e.g. Internal Project Code and
Product Name)
Ultralink makes both of terminologies (entity recognition)
and terminology hub (cross referencing)
5 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Ultralink Plug-in icon
2
Activation Concept Types Frame
UltraLink
6 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Activation Ultralink
7 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
The Landscape of Knowledge - Rooting the
Ultralink in Data Sources/Terminologies
 The Ultralink makes use of a broad range of knowledge
sources both internal to Novartis and external. The linkage
of these terminologies provide the routes along which you
can navigate when using the Ultralink.
 The linkage between the resources is created automatically
via a rule-based mapping procedure and manually by
annotation. The latter is extremely important for connecting
internal knowledge sources together and to external ones.
 The annotations built on the fly by the UltraLink could be
stored as RDF annotations associated to a document and
be accessed by other computer programs – just in the spirit
of the Semantic Web
8 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
The Landscape of Knowledge - Rooting the
Ultralink in Data Sources/Terminologies
Concepts and Terminology
Concepts and Data
9 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Underlying terminologies used at NIBR
• > 15’000 Companies with > 35’000 terms
• > 2’000 Diseases with >19’000 terms
• > 150’000 Genes with about 400’000 terms
• > 5’000 Modes of Action with > 12’000 terms
• > 95’000 Products with > 380’000 terms
• > 170’000 Targets with > 250’000 terms
• > 310’000 Species with > 435’000 terms
• + complete MESH and EMTREE
• More than 1’600’000 terms
• The terminology consists of terms, and relations between terms (main
entry: normalized terms, synonyms, broader terms, narrower terms)
10 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Principles used for the construction of the
terminology and organization of terms
 In order to create the terminology of reference, terms are extracted from
available terminologies (e.g. UniProt, EntrezGene, HGNC, etc.) and the
references to the source systems are preserved.
 Terms specific to a database are referred as local terms. These local
terms are stored in a dedicated data structure, the Metastore. Besides
the flat set of terms, thesaurus relations such as synonymy, broader
term and narrower terms are extracted as well thus allowing to create
a thesaurus.
 For each entry in the terminology like e.g. for a gene name or for a
product, a term is chosen among the list of synonyms and is declared
as a “normalized term”
 Normalized / global terms, synonyms / local terms as well as broader
and narrower terms together with their sources of reference constitute
the terminology content behind the UltraLink and are used by the
Terminology Hub.
11 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Creating Reference – the Terminology Hub
Different knowledge repositories have different ways to encode a concept:
• Registry Number
• Unique Internal ID
• Concept Identifier
• Enumerating terms
• Just using different terms
without
any8constraints
More
than
GB of
cross-referencing
information
Searching a
term T both in source A and
B may lead to different
results because of different naming/referencing conventions
(false negatives in IR)
Terminology Hub ensures coherent mapping
• Between coding systems
• Between different representation levels (e.g. ID vs. Concept)
• Between local terms and global terms
12 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Classes of objects covered by the Terminology
Hub
 Coding systems
• A coding system provides a predefined set of (sometimes hierarchical)
codes to represent a classification, a nomenclature, a controlled
vocabulary, a thesaurus or chemical structures. For example, you can
use the MeSH® Tree number C06.405.205.697 to refer to Gastritis in
a specific sub-tree of MeSH®
 References
• Unique and unequivocal identifiers based on a coding system create
references in their corresponding data repository. By nature, they are
technical artifacts and not part of our scientific natural language (e.g.
FTY720), nevertheless most of them deserve to be identified, being
used in scientific literature.
 Pointers and cross-referencing information
• The Metastore contains pointers that allow to cross reference
knowledge sources and applications.
13 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Classes of objects covered by the Terminology
Hub
 Terms
• A term is the smallest meaningful linguistic unit on which our domains of
discourse (biology, chemistry, medicine) are based. A term is something
different than a word because a term can consist of multiple meaningful
words such as “chronic obstructive pulmonary disease”.
 Concepts
• A concept is an abstraction based on properties of individuals that we
observe in the world. Individuals that belong to the same concept share
a set of common properties. For example, “targets” share the property
that they should be druggable.
 Data Repositories also named Knowledge Sources
• For all kinds of different data, we use the general notion of a data
repository. Using the term “data repository” we emphasize the fact that
there is a source where some data resides without making any
commitments about physical representation (e.g. database or text file) or
of representation
(e.g. structured or free text).
14format
| Driving the Terminoogy
Hub | Therese Vachon | 25.10.2007
Classes of objects covered by the Terminology Hub
synonym-of
broader
narrower
Encoding
IUPAC
Structures
IDs
GIF
Symbols
Formulas
Registry Numbers
...
encodes
Data Repositories
Internal Chemistry DB
CI sources
Literature
Patents
...
spinal cord
vascular endothelial growth factor
CCR5
Glivec
ovarian cancer
Novartis
Cytomegalovirus
...
has-type
Reference
Compound nos
Project codes
Competitor codes
PMID 9683255
EntrezGene 450128
CAS 439-14-5
Patent numbers
Terms
Concepts
points--to
is-a
15 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Species
Products
Companies
Diseases
Genes
Targets
Mammalian Genes
...
Achievements and Improvements
 All information about terminologies and cross-references is
stored in a relational database (Oracle 10.2.0.2).
 The data in the database can be accessed through
WebServices allowing user to find normalized terms, pointers
for a specific concept-type etc.
16 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Metastore Web Service
Get all synonyms for a normalized form
17 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
UltraLink Web Services
Get all accessible pointer types for a normalized form
…
18 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Achievements and Improvements
 We intend to improve the semantic representation of the data
in order to facilitate reuse, interoperability and exchange.
 RDF notation and RDF coding standards provide an
adequate means for a richer semantic representation.
 We use SKOS, DublinCore and other RDF-based coding
standards and supplement them with our own RDF
vocabulary.
19 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Simple Knowledge Organisation System (example)
20 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Terminology for Diseases (SKOS fragment)
21 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Converting Terminologies to RDF
 Clear separation of terminologies from ontologies. We
assign a type (rdf:type) to the URI of a term as reference to
a concept in an ontology.
 Conversion to RDF increased the amount of data rougly by
the factor 3.
 We obtained more than 5 Mio RDF triplets as a preliminary
representation of our terminologies.
 We are currently setting up the entire workflow for
generation, storing and querying RDF.
22 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Conclusion
 The first phase of transforming the terminology to RDFXML is completed
 We are currently developing a model for representing the
Terminology Hub in RDF. We expect that an RDF notation
of the Terminology Hub will comprise approximately 50 Mio.
RDF triples
 We intend to test the framework thoroughly (performance,
effective semantic gain compared to the current
technology)
 Closer collaboration with the W3C Healthcare group
23 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Acknowledgements
Thanks to the ULT team
Semantic & Text Analytics Layer
Martin Romacker
Pierre Parisot
Nicolas Grandjean
Data Integration & Services Layer
Alexander Fromm
Laurent Mentek
Application Layer
Daniel Cronenberger
Olivier Kreim
Thanks to Manuel Peitsch
24 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Download