BioQuery: A Bioinformatics Source Querying Environment

advertisement
BioQuery: A Bioinformatics Source Querying Environment
Robert D. Stevens1 , Martin Peim1 , Norman W. Paton1 , Brian Donnelly1
1
Department of Computer Science
University of Manchester
Oxford Road
Manchester M13 9PL
tambis@cs.man.ac.uk
2
713 Santa Cruz Avenue
Menlo Park, CA 94025-4519
Phone: +1 831 307 8301
Fax: +1 650 618 1440
Abstract
Formulating and executing queries over distributed, autonomous and heterogeneous resources is an important topic within e-science in general and bioinformatics in particular. Where resources have differing query
capabilities and call interfaces, and heterogeneous representations of the semantics of a domain, formulating a
single query that will work over multiple resources is difficult. In the TAMBIS (Transparent Access to Multiple
Bioinformatics Information Sources) system, an ontology of molecular biology and bioinformatics is used to
give the illusion of a common query interface to diverse resources. The ontology is also central to the query
processing and reconciliation of heterogeneities. The BioQuery project takes TAMBIS onto a new phase, where
we will provide a robust service over commercial middleware provided by our industrial collaborator geneticXchange1 . This paper gives an overview of the TAMBIS project and its current status in BioQuery.
1
Introduction
BioQuery is a collaborative project under the auspices of ESNW (the North West Regional e-Science
Centre). It brings together existing middleware for
wrapping and querying distributed bioinformatics resources and knowledge driven conceptual querying
facilities. Managing the distributed, heterogeneous
resources existing in bioinformatics is an exemplar
for the problems to be tackled within e-Science. In
Bioinformatics, there are many information sources
and tools. This proliferation of resources means that
a biologist can potentially combine the resources to
carry out many sophisticated analyses using any Webaccessible computer. The reality, however, is that
these resources were rarely designed to be used together, and that constructing requests that refer to several resources is time consuming and error-prone.
This scenario makes work in such an area a central
topic in e-Science upon the Grid. A large proportion
of e-Science, whether within or without bioinformatics, will involve the marshalling of diverse resources
within applications or to answer complex queries. A
working scientist needs considerable knowledge in order to answer complex questions from these diverse
resources, where his or her expertise really lies in posing the appropriate question. Thus, within e-Science,
there is a need to create suitable query building facil-
ities over diverse, heterogeneous and distributed resources, as well as the creation of suitable middleware
technology for answering such questions. In essence,
there is a need to capture the knowledge of how to perform a task within e-Science software and to enable an
e-Scientist to use his or her domain knowledge to pose
the question.
BioQuery is a collaborative research proposal between the University of Manchester and geneticXchange, Inc. The BioQuery project takes a research
prototype, TAMBIS (Transparent Access to Multiple Bioinformatics Information Sources) [3], and will
be updating the software to make it more suitable
for work within the e-Science context. In particular, TAMBIS is being evolved to use the Discovery
Hub middleware from geneticXchange to evaluate requests over wrapped sources. The following sections
describe the architecture of the new TAMBIS system
and describe its role within e-Science.
2
BioQuery Architecture
The two project partners have developed two information integration systems for use in biology. These systems, TAMBIS [3] (http://img.cs.man.ac.uk/tambis)
and Discovery Hub [2], provide the following fea-
tures:
of TAMBIS is the part that actually answers the
queries posed by biologists.
• Discovery Hub (geneticXchange, Inc.) - lowcost source wrapping facilities, and a declarative
language that allows programs to be written over
the wrapped sources.
The rest of the system is concerned with formulating
conceptual, declarative queries; mapping query elements to sources and values used within sources; optimising the order of calls to methods on the wrapped
• TAMBIS (University of Manchester) - high-level sources; and finally encoding the query in Discovery
ontology-driven query facilities that allow high- Hub’s query language.
level requests to be expressed over multiple re- Central to the TAMBIS design is the ontology of
sources.
bioinformatics and Molecular biology [1]. The TAMBIS ontology (TaO) describes the entities within
In essence, Discovery Hub can be seen as a query- molecular biology about which questions may be
oriented middleware, whereas TAMBIS can be seen asked. It also captures the bioinformatics questions
as a high-level query-building environment. The two that may be asked. The user asks a question by deproposals can be seen as complementary: TAMBIS scribing a new concept, based upon the classes and
provides high-level, graphical query construction fa- roles within the ontology [1].
cilities, but assumes the presence of an existing mid- The query interface formulation presents a query fordleware; Discovery Hub provides wrapping and inter- mulation interface to the user, in which he or she
source querying, but does not directly support high- can create new concepts, browse the TaO and launch
level query formulation facilities for biologists.
queries (Figure 2). A Sources and Services Model
TAMBIS attempts to avoid the pitfalls of querying (SSM) is used by the query processor to map classes
multiple distributed, heterogeneous resources by us- and roles to sources (via methods on wrappers) and
ing an ontology of molecular biology and bioinfor- values within those bio-sources.
matics to manage the presentation and usage of the
sources. This ontology allows TAMBIS to have the
following attributes: A homogenising layer over the
numerous databases and analysis tools; an opportu- 3 Discussion
nity to manage the semantic heterogeneity between
the datasources; and a common, consistent query- The queries possible in TAMBIS are multi-source,
forming user interface that allows queries across over distributed resources with heterogeneous APIs,
sources to be precisely expressed and progressively query mechanisms and semantics. These are typirefined.
cal e-Science tasks that require layers of knowledge
TAMBIS represents knowledge of what concepts exist in these domains and the relationships that exist between those concepts. It is this knowledge that TAMBIS uses to give transparent access to a wide range of
bioinformatics databanks and tools. The TAMBIS ontology is used for retrieving instances represented by
concepts in the model. A concept is a description of
a set of instances, so a concept or description can also
be viewed as a query.
This approach allows a biologist to ask questions such
as find all antigenic human apoptosis receptor protein
homologues and their phosphorylation site motifs.
This query is answered by the three sources Swissprot, Blast and Prosite. The user does not have to
choose the sources, the key-words with which to filter
the proteins etc.
over the simple computational facilities underlying
the query itself.
BioQuery takes existing artefacts and combines the
facilities of both the original TAMBIS and Discovery
Hub to provide a conceptual, knowledge driven query
interface over a middleware that allows distributed resources to be included and transformed to a common
syntactic interface. This new version of TAMBIS,
linking to the Discovery Hub, can be found as an applet via http://img.cs.man.ac.uk/tambis.
BioQuery has enabled us to stabilise and improve our
implementation of TAMBIS and provide a more robust version to the public. BioQuery gives the illusion
of a common query interface to distributed and heterogeneous bioinformatics resources. BioQuery frees
a biological scientist from the need to hold knowledge
of how to answer bioinformatics queries and allows
them to capitalise on their knowledge of what questions to ask. The questions capable of being asked
by BioQuery are in silico experiments and the mechanism provided for performing these experiments provides a fine example of e-Science in action.
Figure 1 shows the architecture of TAMBIS. The geneticXchange middleware layer, K1 or DiscoveryHub, can be seen at the base of the system; it provides
a Java API by which queries formulated in Discovery Hub’s query language (CPL [2]) are passed to and
executed against the wrappers. The bio-services currently wrapped within TAMBIS are SWISS-PROT, Acknowledgements: The BioQuery grant is funded
ENZYME, BLAST, PROSITE and CATH. This layer by the DTI & EPSRC under the e-Science programme
Figure 1: the BioQuery architecture: A. TAMBIS Ontology: an ontology of biological and bioinformatics
terms. B. A knowledge-driven query formulation interface. C. A services model linking the TaO with the
source services. D. Transformation from high-level, source-independent queries into source-dependent, ordered queries. E. A wrapper service dealing with external sources provided by Discovery Hub.
via the ESNW.
References
[1] P.G. Baker, C.A. Goble, S. Bechhofer, N.W. Paton, R. Stevens, and A Brass. An Ontology
for Bioinformatics Applications. Bioinformatics,
15(6):510–520, 1999.
[2] Jing Chen, Sun Yun Chung, and Limsoon Wong.
The Kleisli Query System as a backbone for
bioinformatics data integration and analysis. In
Zoe Lacroix and Terence Critchlow, editors,
Bioinformatics: Managing Scientific Data. Morgan Kaufmann, May 2003.
[3] C.A. Goble, R. Stevens, G. Ng, S. Bechhofer,
N.W. Paton, P.G. Baker, M. Peim, and A. Brass.
Transparent Access to Multiple Bioinformatics
Information Sources. IBM Systems Journal Special issue on deep computing for the life sciences,
40(2):532 – 552, 2001.
Figure 2: the TAMBIS user interface.
Download