Discovery Net e-Science made easy

Discovery Net
e-Science made easy
Discovery Net is funded by the EPSRC under the UK e-Science core program
“e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it”
John Taylor, Director General of the Research Councils, OST
Knowledge Discovery & e-Science
Discovery Net is building the software infrastructure and
tools for providing Knowledge Discovery Services that
allow scientists to conduct and manage complex data
analysis and knowledge discovery activities on the new
generation Internet.
In the same way the web has revolutionised publishing
and searching for information, Knowledge Discovery
Services will revolutionise how the analysis of distributed
scientific data is conducted over the Internet.
High Throughput Discovery Informatics
Discovery Net addresses the complexities faced by
scientists from all information-intensive fields where:
Modern high throughput devices are routinely
generating and capturing large amounts of data.
Data sets are not analysed in isolation, but
dynamically integrated during the analysis.
New data analysis methods and software
components are continually being developed.
Knowledge discovery procedures are complex multistep procedures conducted by interdisciplinary
teams of scientists.
To address the challenges of high throughput discovery
informatics, the Discovery Net project is building the
architecture and the associated application-level
middleware and tools for conducting and managing
complex, distributed knowledge discovery activities.
A Grid of Knowledge Discovery Services
Discovery Net provides a service-oriented computing
model for knowledge discovery, allowing users to connect
to and use data analysis software as well as data sources
that are made available online by third parties. In
particular, Discovery Net defines the standards,
architecture and tools that:
Allow scientists to plan, manage, share and
execute complex knowledge discovery and data
analysis procedures available as remote services.
Allow service providers to publish and make
available data mining and data analysis software
components as services to be used in knowledge
discovery procedures.
Allow data owners to provide interfaces and
access to scientific databases, data stores,
sensors and experimental results as services so that
they can be integrated in knowledge discovery
Discovery Net Architecture and
Knowledge Discovery Services
Discovery Net architecture
standards for specifying:
Knowledge Discovery Adapters: used to
declare the properties of analytical software
components and scientific data stores including
their input/output types, performance and
accuracy characteristics.
Knowledge Discovery Services look-up and
registration: allowing scientists to retrieve and
compose Knowledge Discovery Services in their
discovery procedures.
Integrated Scientific Database Access:
allowing the integration of structured and semistructured data from different data sources
within a discovery procedure using XML
Knowledge Discovery Process Management:
including DPML (Discovery Process Markup
Language) as a standard specification language
for constructing and managing knowledge
discovery procedures, as well as recording their
Knowledge and Discovery Process Storage:
allowing discovery procedures to be stored,
shared and re-executed.
Knowledge Discovery Process Deployment:
allowing users to deploy and publish their
existing knowledge discovery procedures as
new services.
Figure 1: Application of Discovery Net in Life Sciences
From Information Search to Knowledge
Discovery Services
Consider how easy it is to use Google to conduct a text
search over information already published on the Internet.
Discovery Net aims to extend Internet usage from this
simple model of publishing and retrieving information to a
more powerful model of publishing, retrieving and using
Knowledge Discovery Services for scientific data analysis.
Data Analysis Services
A life scientist, for example, may have access to an
automated laboratory experiment where a range of
sensors produces a large amount of data about the
activities of genes in cancerous cells and their response to
the introduction of a possible drug.
The scientist analysing the data searches through the
latest set of data mining software components available on
the Internet. These software components, as well as their
remote execution, are provided by service providers.
Knowledge Discovery Process Composition
From the available services, scientists choose a software
component for data normalisation to be applied to the data
and execute the computation. The results are then fed into
another software component performing a clustering
algorithm analysis to segment the data into sub-groups of
genes exhibiting similar behaviour.
The software
components used may have come from different providers
but, for scientists, their integration and execution is a
matter of a few mouse clicks.
Data Access and Integration
Using the analysis results, scientists then proceed to verify
that the biological significance of the subgroups can be
explained by referring to existing available information
about related genes, proteins, metabolic activities and
regulation. Such information is available from various
online databases. Each of the databases and their
querying interfaces are accessible as remote services.
Scientists have to search and find the appropriate service,
then connect to it, pulling out the relevant data, and finally
integrating the data in their Knowledge Discovery Process.
Following this, they can then proceed to perform further
analysis on the integrated data.
Knowledge Management and Deployment
Once the process is complete, scientists publish their
results in a database, thus making them available to the
community. More importantly, they also publish the details
of the discovery processes (discovery work flows) in a
Discovery Process Store. This makes them available to
other scientists, who can verify the results and methods
used, as well as re-execute the published processes as
services in their own Knowledge Discovery Processes.
Discovery Net Testbeds
Discovery Net is multi-disciplinary project. In addition to
developing the software infrastructure for Knowledge
Discovery Services, the project is also developing a series
of testbeds and demonstrators for using the technology in
the areas of life sciences, environmental modelling
and geo-hazard prediction.
Life Science Research Testbed
The first testbed implemented in the Discovery Net project
is an integrated platform for life science researchers,
integrating the analysis of data from: novel high
throughput sequencing technologies and novel
protein chips with a wide range of existing data sources
DNA microarray experiments.
NMR data for Metabolite analysis.
Online genomic, proteomic and metabolic
To realise this testbed,
implementations of:
Specialised Knowledge Discovery Clients and Webbased Clients for interactive visual analysis and
discovery process creation.
Knowledge Discovery Servers and their associated
adapters providing a variety of data analysis and
classification, clustering and time series analysis, as
well as adapters to the semi-structured life-science
data sources available online.
Life-Science Discovery Process Data Store for the
publication of domain-specific complex compositions of
Knowledge Discovery Processes, and allowing the
retrieval and deployment of such processes.
Knowledge Discovery Deployment Engine for the
execution of discovery processes and Knowledge
Reporting Tools for dissemination of the results of the
related discoveries.
The Knowledge Discovery Services provided by
Discovery Net are a set application-level services. Their
implementation will be coupled with emerging grid
standards and technologies such as the OGSA (Open
Grid Services Architecture) standard that provides
services for managing lower-level grid computational
infrastructure and resources.
“The essential message of the OGSA is an abstraction of
the grid architecture in terms of a set of basic grid services.
The hope is that these services may be easily composed
with higher level services to deliver complex applications
and workflows”.
Tony Hey, Director of the UK e-Science Core Programme
The Discovery Net project is conducted at Imperial College of Science, Technology and Medicine.
Principal Investigator: Dr. Yike Guo (Dept of Computing).
Discovery Net Team: Prof. Tony Cass (Dept. of Biological Sciences), Prof. John Darlington (Dept. of
Computing), Dr. John Hassard (Dept. of Physics), Dr. Jian Guo Liu (Dept. of Earth Sciences), Dr. Daniel Ruckert
(Dept. of Computing), Prof. Robert Spence (Dept. of Electrical Engineering).
Discovery Net Collaborations: Discovery Net is currently collaborating with National Centre for Data Mining
(NCDM) at the University of Illinois at Chicago for the creation of the Global Discovery Net project.
Contact Information:
Dr. Moustafa M. Ghanem, Discovery Net Project Manager
Department of Computing, Imperial College, London SW7 2BZ, United Kingdom
Email: Phone: +44 (0) 20 7594 8357 Fax: +44 (0) 20 7594 8246