BioQuery: A Bioinformatics Source Querying Environment Robert D. Stevens1 , Martin Peim1 , Norman W. Paton1 , Brian Donnelly1 1 Department of Computer Science University of Manchester Oxford Road Manchester M13 9PL tambis@cs.man.ac.uk 2 713 Santa Cruz Avenue Menlo Park, CA 94025-4519 Phone: +1 831 307 8301 Fax: +1 650 618 1440 Abstract Formulating and executing queries over distributed, autonomous and heterogeneous resources is an important topic within e-science in general and bioinformatics in particular. Where resources have differing query capabilities and call interfaces, and heterogeneous representations of the semantics of a domain, formulating a single query that will work over multiple resources is difficult. In the TAMBIS (Transparent Access to Multiple Bioinformatics Information Sources) system, an ontology of molecular biology and bioinformatics is used to give the illusion of a common query interface to diverse resources. The ontology is also central to the query processing and reconciliation of heterogeneities. The BioQuery project takes TAMBIS onto a new phase, where we will provide a robust service over commercial middleware provided by our industrial collaborator geneticXchange1 . This paper gives an overview of the TAMBIS project and its current status in BioQuery. 1 Introduction BioQuery is a collaborative project under the auspices of ESNW (the North West Regional e-Science Centre). It brings together existing middleware for wrapping and querying distributed bioinformatics resources and knowledge driven conceptual querying facilities. Managing the distributed, heterogeneous resources existing in bioinformatics is an exemplar for the problems to be tackled within e-Science. In Bioinformatics, there are many information sources and tools. This proliferation of resources means that a biologist can potentially combine the resources to carry out many sophisticated analyses using any Webaccessible computer. The reality, however, is that these resources were rarely designed to be used together, and that constructing requests that refer to several resources is time consuming and error-prone. This scenario makes work in such an area a central topic in e-Science upon the Grid. A large proportion of e-Science, whether within or without bioinformatics, will involve the marshalling of diverse resources within applications or to answer complex queries. A working scientist needs considerable knowledge in order to answer complex questions from these diverse resources, where his or her expertise really lies in posing the appropriate question. Thus, within e-Science, there is a need to create suitable query building facil- ities over diverse, heterogeneous and distributed resources, as well as the creation of suitable middleware technology for answering such questions. In essence, there is a need to capture the knowledge of how to perform a task within e-Science software and to enable an e-Scientist to use his or her domain knowledge to pose the question. BioQuery is a collaborative research proposal between the University of Manchester and geneticXchange, Inc. The BioQuery project takes a research prototype, TAMBIS (Transparent Access to Multiple Bioinformatics Information Sources) [3], and will be updating the software to make it more suitable for work within the e-Science context. In particular, TAMBIS is being evolved to use the Discovery Hub middleware from geneticXchange to evaluate requests over wrapped sources. The following sections describe the architecture of the new TAMBIS system and describe its role within e-Science. 2 BioQuery Architecture The two project partners have developed two information integration systems for use in biology. These systems, TAMBIS [3] (http://img.cs.man.ac.uk/tambis) and Discovery Hub [2], provide the following fea- tures: of TAMBIS is the part that actually answers the queries posed by biologists. • Discovery Hub (geneticXchange, Inc.) - lowcost source wrapping facilities, and a declarative language that allows programs to be written over the wrapped sources. The rest of the system is concerned with formulating conceptual, declarative queries; mapping query elements to sources and values used within sources; optimising the order of calls to methods on the wrapped • TAMBIS (University of Manchester) - high-level sources; and finally encoding the query in Discovery ontology-driven query facilities that allow high- Hub’s query language. level requests to be expressed over multiple re- Central to the TAMBIS design is the ontology of sources. bioinformatics and Molecular biology [1]. The TAMBIS ontology (TaO) describes the entities within In essence, Discovery Hub can be seen as a query- molecular biology about which questions may be oriented middleware, whereas TAMBIS can be seen asked. It also captures the bioinformatics questions as a high-level query-building environment. The two that may be asked. The user asks a question by deproposals can be seen as complementary: TAMBIS scribing a new concept, based upon the classes and provides high-level, graphical query construction fa- roles within the ontology [1]. cilities, but assumes the presence of an existing mid- The query interface formulation presents a query fordleware; Discovery Hub provides wrapping and inter- mulation interface to the user, in which he or she source querying, but does not directly support high- can create new concepts, browse the TaO and launch level query formulation facilities for biologists. queries (Figure 2). A Sources and Services Model TAMBIS attempts to avoid the pitfalls of querying (SSM) is used by the query processor to map classes multiple distributed, heterogeneous resources by us- and roles to sources (via methods on wrappers) and ing an ontology of molecular biology and bioinfor- values within those bio-sources. matics to manage the presentation and usage of the sources. This ontology allows TAMBIS to have the following attributes: A homogenising layer over the numerous databases and analysis tools; an opportu- 3 Discussion nity to manage the semantic heterogeneity between the datasources; and a common, consistent query- The queries possible in TAMBIS are multi-source, forming user interface that allows queries across over distributed resources with heterogeneous APIs, sources to be precisely expressed and progressively query mechanisms and semantics. These are typirefined. cal e-Science tasks that require layers of knowledge TAMBIS represents knowledge of what concepts exist in these domains and the relationships that exist between those concepts. It is this knowledge that TAMBIS uses to give transparent access to a wide range of bioinformatics databanks and tools. The TAMBIS ontology is used for retrieving instances represented by concepts in the model. A concept is a description of a set of instances, so a concept or description can also be viewed as a query. This approach allows a biologist to ask questions such as find all antigenic human apoptosis receptor protein homologues and their phosphorylation site motifs. This query is answered by the three sources Swissprot, Blast and Prosite. The user does not have to choose the sources, the key-words with which to filter the proteins etc. over the simple computational facilities underlying the query itself. BioQuery takes existing artefacts and combines the facilities of both the original TAMBIS and Discovery Hub to provide a conceptual, knowledge driven query interface over a middleware that allows distributed resources to be included and transformed to a common syntactic interface. This new version of TAMBIS, linking to the Discovery Hub, can be found as an applet via http://img.cs.man.ac.uk/tambis. BioQuery has enabled us to stabilise and improve our implementation of TAMBIS and provide a more robust version to the public. BioQuery gives the illusion of a common query interface to distributed and heterogeneous bioinformatics resources. BioQuery frees a biological scientist from the need to hold knowledge of how to answer bioinformatics queries and allows them to capitalise on their knowledge of what questions to ask. The questions capable of being asked by BioQuery are in silico experiments and the mechanism provided for performing these experiments provides a fine example of e-Science in action. Figure 1 shows the architecture of TAMBIS. The geneticXchange middleware layer, K1 or DiscoveryHub, can be seen at the base of the system; it provides a Java API by which queries formulated in Discovery Hub’s query language (CPL [2]) are passed to and executed against the wrappers. The bio-services currently wrapped within TAMBIS are SWISS-PROT, Acknowledgements: The BioQuery grant is funded ENZYME, BLAST, PROSITE and CATH. This layer by the DTI & EPSRC under the e-Science programme Figure 1: the BioQuery architecture: A. TAMBIS Ontology: an ontology of biological and bioinformatics terms. B. A knowledge-driven query formulation interface. C. A services model linking the TaO with the source services. D. Transformation from high-level, source-independent queries into source-dependent, ordered queries. E. A wrapper service dealing with external sources provided by Discovery Hub. via the ESNW. References [1] P.G. Baker, C.A. Goble, S. Bechhofer, N.W. Paton, R. Stevens, and A Brass. An Ontology for Bioinformatics Applications. Bioinformatics, 15(6):510–520, 1999. [2] Jing Chen, Sun Yun Chung, and Limsoon Wong. The Kleisli Query System as a backbone for bioinformatics data integration and analysis. In Zoe Lacroix and Terence Critchlow, editors, Bioinformatics: Managing Scientific Data. Morgan Kaufmann, May 2003. [3] C.A. Goble, R. Stevens, G. Ng, S. Bechhofer, N.W. Paton, P.G. Baker, M. Peim, and A. Brass. Transparent Access to Multiple Bioinformatics Information Sources. IBM Systems Journal Special issue on deep computing for the life sciences, 40(2):532 – 552, 2001. Figure 2: the TAMBIS user interface.