Discovery Processes in Discovery Net Discovery Net, Imperial College, London Abstract

advertisement
Discovery Processes in Discovery Net
Jameel Syed, Moustafa Ghanem and Yike Guo
Discovery Net, Imperial College, London
Abstract
The activity of e-Science involves making discoveries by analyzing data to find new knowledge.
Discoveries of value cannot be made by simply performing a pre-defined set of steps to produce a
result. Rather, there is an original, creative aspect to the activity which by its nature cannot be
automated. In addition to finding new knowledge, discovery therefore also concerns finding a
process to find new knowledge. How discovery processes are modeled is therefore key to effectively
practicing e-Science. We argue that since a discovery process instance serves a similar purpose to a
mathematical proof it should have similar properties, namely it allows results to be deterministically
reproduced when re-executed and that intermediate results can be viewed to aid examination and
comprehension.
1. Analysis for Discovery
The massive increase in data generated from
new high-throughput methods for data
collection in life sciences has driven interest in
the emerging area of e-Science. Practitioners in
this field are scientific ‘knowledge workers’
performing analysis in-silico. The traditional
methodology of formulating a hypothesis,
designing an experiment to test the hypothesis,
running the experiment then studying the results
and evaluating is now largely computerised.
Experimental data in e-Science is either
captured electronically or generated by a
software simulation. For example, in gene
expression analysis the activity of the whole
genome may be recorded using microarray
technology.
e-Science concerns the development of new
practices and methods to find knowledge. Nontrivial, actionable knowledge cannot be batch
generated by a set of predefined methods, but
rather the creativity and expertise of the
scientist is necessary to formulate new
approaches. Whilst the dynamic nature of
massively distributed service-oriented
architectures provides much promise to provide
scientists with powerful tools, they raise many
issues of complexity. New resources such as
online data sources, algorithms and methods
defined as processes are becoming available
daily. A single process may need to integrate
techniques from a range of disciplines such as
data mining, text mining, image mining,
bioinformatics, cheminformatics and be created
by a multidisciplinary team of experts. A major
challenge is to effectively coordinate these
resources in a discovery environment in order to
create knowledge. The purpose of a discovery
environment is to allow users to perform their
analyses, computing required results and
creating a record of how this is accomplished.
The most widespread and promising approach is
to use process-based systems with re-usable
components that allow the end user to compose
new methods of analysis through a visual
programming interface.
Workflow systems and analysis software are
already well established in the sphere of
business but the requirements for scientific
disciplines are markedly different owing to the
nature of discovery. Two major uses of process
based systems in business are for production
workflows and business intelligence.
Production workflows[1], which grew from
document workflow systems, are primarily
concerned with tying production systems such
as operational databases together so that an
organisation can streamline its regular, day to
day activities. Such workflows are relatively
fixed, and authored by software engineers based
on models of accepted practice to the extent that
systems in certain domains[2] come predefined
with processes and databases. It is not
uncommon for businesses to change their actual
practices to fit predefined process supplied with
such a system. In production workflow systems
the emphasis is on repeatedly performing a
fixed set of processes in order for their
constituent activities to be repeatedly
performed.
Business intelligence concerns volumes of
data that are too large to be handled manually
thus requiring computer-based approaches.
Knowledge Discovery in Databases (KDD)[3]
is a field that concerns the use of techniques
from machine learning, statistics and
visualisation to enable users to find useful,
actionable patterns in large quantities of data. It
is a multi-disciplinary collaborative activity
with data, algorithm and domain experts
working side by side on a project. Knowledge
discovery has been important in business for
many years, and the techniques used for
analysis have relevance in e-Science, but the
nature of processes and how they are
constructed is markedly different.
In e-Science the dynamic nature of
processes used and the operations they are
composed of are of central importance. A
primary function of practitioners is to create
new processes, which may utilise new resources
as they become available on a daily basis. Each
process instance is an experiment in itself,
described in terms of specific data whose aim is
to determine the validity of the approach.
Processes must be rapidly prototyped with a
tight edit-execute-edit cycle. The need for such
dynamism means that for creative scientific
work a new kind of environment and process
representation are needed to best support the
user.
2. Provenance and Audit of Discovery
Provenance is a loaded word with many
possible definitions and interpretations. In the
field of antiques it simply means how
authenticity is proven. In e-Science provenance
clearly concerns a record of the origins of data
for such purposes as peer review and
reproduction of results. A static record of events
that occurred are readily recorded, but creating a
dynamically re-executable artefact is a difficult
problem that relies on properties of the
environment in which the result is calculated.
An e-Science process will typically make use of
shared, distributed resources for execution and
as sources of data. Within an organisation,
LIMS (Laboratory Information Management
Systems), for example, can bridge the gap
between the physical world and the
computerised environment and provide a robust
cornerstone for all data exploration, recording
how data was captured. In contrast, for public
resources there is scant support for versioning,
and the worst case- basic day-to-day availability
of resources- must be carefully considered by
organisations that rely on such services to
produce their intellectual property. Audit is
another loaded word, with similar scope for
interpretation. In the simplest sense it is very
similar to provenance- a record of the method of
how a result was produced. However, a fully
detailed record would include how the
methodology evolved to reach its final
configuration to reach an end result.
Here we examine the following issues of
provenance and audit that need addressing for
discovery:
• Data archiving
• Operation versioning
• Provenance of a computed result
• Provenance of a discovery process
Data archiving concerns the use of data
sources used in the scope of our discovery
process instance. Operation versioning concerns
managing changes to executable resources such
as analysis algorithms over time. Provenance of
a computed result concerns derived data or
results that have been locally computed from a
discovery process. Provenance of a discovery
process concerns how a discovery process was
authored.
2.1 Data archiving
Data archiving concerns the maintenance of any
data sources that are used within a discovery
process. Whilst a process may access externally
stored data by reference there is no guarantee
that the external data source will not be updated
unbeknownst to the process author.
Guaranteeing the immutability of data
externally referenced from a discovery process
is vital to preserving its value since without the
originally used data a process cannot be reexecuted or examined fully at later date. In the
domain of e-Science we identify two subcategories of data that require archiving:
experimental data and reference data.
We define experimental data as the primary
data set(s) used in a discovery process. Such
data is either captured using computerised
instruments from performing physical
experiments, or generated using simulation. In
many cases this data is well managed within an
organisation owing to how expensive such data
may be to generate. LIMS record metadata of
the conditions of how data was captured in
terms of events that occurred at a certain time.
In contrast to a regular production database
system, these warehouses are designed as
immutable stores, providing versioned access to
the stored data such that at any time any stored
experiment can be retrieved with certainty that
it has not been modified.
We define reference data as supplemental,
supporting evidence data that is used to
augment the core experimental data in a project.
In current practice this kind of data is usually
fetched from public web-based data resources.
Items of experimental data are `looked up' using
a unique identifier to find related information.
These public data resources have scant support
for versioning, and indeed if we consider the
worst case, their basic day to day availability is
not even guaranteed. In addition to their
contents and basic availability, changes may
also occur to their logical schema such that
automated mechanisms for their access may
even fail to function. The use of such resources
is carefully considered by organisations who
relying on such services to produce their
intellectual property, but a trade-off must
always be made between using the most up-todate information available and a safe, locally
cached copy. At the least this choice must be
controlled.
Given the lack of a quality of service
guarantee from the data producer, it is the data
consumer's responsibility to maintain a copy of
the data. In life science there are hundreds of
online data resources that are continually being
modified, with more resources being added.
Data even propagates between these data
resources themselves in complex ways, with
resources containing derivations or direct copies
of others.
2.2 Operation versioning
Similarly to data archiving, the use of
executable analysis operations needs to be
carefully considered, particularly remote
services provided by other organisations. The
growth in the use of web services has enabled
easier provision of services to the public,
consumed by a variety of programming
languages and platforms. Reliance on such
resources for the reproduction of important
scientific results requires a lot of trust in the
owning organisations ability to sufficiently
support it.
2.3 Provenance of a computed result
In contrast to archived experimental or
reference data, we characterise this type of data
by the fact that it has been computed as part of
data analysis. This result is the ‘nugget of
knowledge’ that has been found through the
process of discovery, and may vary in scale
from a concise model generated though the use
of machine learning algorithms to an archived
data set that has been transformed or augmented
in some way.
This issue is the corollary of our requirement
that discovery processes be re-executable such
that they can generate identical results: a
computed result must possess a corresponding
record of how it was calculated. The discovery
process description serves the role of both
record of how the data was computed, and the
justification for why it was computed in that
way.
The discovery process representation should
include enough detail so that processes can be
re-executed, and also additional details about
how and why they were constructed. If we
compare the artefacts of an e-Science process
and a traditional mathematical proof or
scientific publication the challenges are clear. A
mathematical proof is relatively self contained,
maybe referencing some widely accepted
theorems. An e-Science process is knowledge
schema representing how to coordinate the
integration of distributed resources managed by
different organisations to produce knowledge.
The annotation of such a schema requires much
more than margin notes, and associated
provenance information is equally essential
given the different levels of control over aspects
of the coordination and nature of their
integration.
A discovery processes assembled from
disparate resources must retain integrity such
that it can be relied on to guide costly realworld experiments that could potentially
involve lives. A discovery process must be reexecutable at any time to reproduce result,
independent of any other processes. Dataflow
graphs have some very desirable properties for
representing a proof of how a result was
generated. Discovery processes must make `non
destructive' use of distributed resources from
within and external to an organisation. Further,
each instance must maintain its own state
independently. These properties help ensure that
a process can be re-executed at a later date
without affecting any other systems or data.
Since a discovery process used to generate a
result may itself make use other datasets there
may be a dependency between maintaining the
provenance of computed results and
maintaining archived data source.
2.4 Provenance of a discovery process
The final class of data requiring provenance is
the discovery process itself. Since a discovery
process may be authored in multi-user
environment, and be used as the basis for
defining the provenance of computed results it
is vital that the history of how the process came
to be. This audit information should contain
both captured parameter change information
tracking the evolution of the configuration of
the process, and also manually entered user
annotations to help justify the course of analysis
chosen.
3. Discovery Environments
In order to address the issue of reproducibility
of results we characterise two kinds of
discovery environments, strict and inclusive,
and show how the nature of these environments
determines the properties of our e-Science
processes. The successful unification of these
approaches is necessary to allow adequate
scrutiny of e-Science results and full use of
available resources.
3.1 Strict discovery environments
A strict discovery environment is one where all
the semantics of data interactions between
resources used with the process are explicitly
expressed in the model and are a transparent
part of the process representation. We model
this kind of controlled environment using
dataflow processes. A dataflow is represented as
a directed graph whose nodes are functions,
where arcs represent the flow of data between
functions. Additional function parameterisation
is specified by means of values stored with each
function instance. In such an environment, any
access to external data resources, either as a
data source or as a lookup to augment existing
data, are consumer-cached by the environment
so that they can re-executed to produce identical
results
By strictly modelling all interactions
between resources, and containing the scope of
state change in the system, non-destructive
interactive authoring and execution of processes
is possible. For example, users may explore
possibilities in a what-if manner without
disrupting shared resources or relying on them
not being changed by other users. A wide range
of semantics for how data flows through
dataflow graphs are possible depending on
application area (data analysis, signal
processing, image processing).
3.2 Inclusive discovery environments
An inclusive discovery environment can be
modelled as a classic activity based workflow.
An activity based workflow is unmanaged in
both data scope and state since the user of the
system is not aware of where the data used in
the execution has originated from and what state
changes an execution will cause. These
properties are understandable given that from
their inception, workflow systems were
designed to be tied to production systems, to the
extent that executions be triggered by external
events with small fragments of data passed as
messages between different systems, the global
state of the system must be carefully designed
by software engineers. Activity executions are
logged but few guarantees can be made about
the effects of ad-hoc re-execution. A workflow
may even involve activities requiring direct
human interaction, perhaps to operate a piece of
equipment or manually verify data.
4. Discovery Net
In this section we examine discovery processes
in Discovery Net and detail how the concepts of
strict and inclusive discover environments are
tackled. Figure 1 shows how the Discovery Net
environment fundamentally aims to be a strict
environment whilst also inclusively using
external resources. Many external resources are
managed by the environment that provides
facilities for archiving both tabular and semistructured data used in processes.
Figure 1: A view of the Discovery Net
environment
4.1 Data archiving
In e-Science a large proportion of public
services are `lookup' style databases resources
providing domain-specific reference data that
needs to be combined with experimental data
created or captured by the scientist. The
volatility of these resources, in terms of service
availability and non-versioned data updates, is
mitigated by the discovery environment which
guarantees that data used in warehoused
processes is permanently available.
Where experimental data is accessed from
databases, tables are first imported into the
environment into a data store where they remain
for the lifetime of the discovery processes that
use them. In this way operational databases may
change over time but saved processes still
remain re-executable. Similarly, reference data
accessed by the Discovery Net InfoGrid[4]
system is fetched and parsed by this subsystem
and persisted within the environment.
Once a reproducible discovery process has
been developed and validated it is then usual for
it to be put into production use and re-executed
with ‘live’ (time variant) data sources. Users are
able to override the system’s data caching
mechanisms to dynamically fetch data from the
original sources in cases where processes are in
production use to allow them to access the most
up-to-date data. The usual practice is for these
deployed processes to be copies of frozen
‘proof’ versions of the process, modified only
with respect to how they access source data.
Figure 2: A discovery process deployed to the
web
Figure 2 illustrates a discovery process that
has been put into production use by deploying it
to the web. Users are able to provide their own
data with which to re-execute the process. In
this case any new data used with the process is
stored in the discovery environment for later
access.
4.2 Operation versioning
Discovery Net supports the use of both strict
and inclusive styles of executable resources,
which may even be combined in the same
process. The manner in which these two
approaches entirely depends on the way in
which the system is configured for use.
The strict method for integrating operations
tightly ties them to the discovery environment,
with the program code accessible only via the
environment. The discovery process includes
versioning information for each operation. Each
operation is managed by a well defined wrapper
describing its input/output types, parameters and
other characteristics such as name and icon. All
data exchange between is done explicitly in
terms of operation graphs (see Figure 3).
Execution and parameterisation of the operation
is managed via this wrapper. Versioning allows
the functionality of a conceptual operation to be
updated by adding a new wrapper. A new
wrapper may coexist with older wrappers (to
allow older processes to use the old wrapper
unmodified) or exist as a replacement for old
operations. Maintaining reproducibility of many
concurrent versions of the same operations is
especially important in e-Science given the rate
of development of methodologies in some areas.
The inclusive approach, by contrast utilises
operations that exist outside the scope and direct
management of the discovery environment.
These remote resources are increasingly being
implemented using web services technology.
There are three aspects to how a service may
change.
The most basic is the availability of the
service: is the service still running? For this
reason alone large organisations such as
pharmaceutical companies typically mirror
copies of frequently accessed public services in
house, in addition to concerns relating to
privacy of data.
The second versioning issue concerns a
change to the WSDL schema of the service
where elements have been added, removed or
modified. In many cases, particularly where
web services are ‘document’ rather than ‘RPC’
type, minor modifications may be handled
transparently since the software reading an
XML message will generally ignore any data
that is in addition to what it is looking for.
However, where data is removed or modified,
the re-execution of a process is much less likely
to run smoothly. For this reason, the service
WSDL must be cached and compared at reexecution time so that the user may be informed
of the change.
The third type of issue concerns whether the
behaviour of the service has changed,
independently of its interfaces. Again, there is
little that can be done by the consumer of the
service if its owner modifies its behaviour: it
may not even be evident that any change has
occurred until a certain input case is
encountered.
4.3 Provenance of a computed result
Provenance of a computed result, such as a list
of interesting genes, is managed in Discovery
Net by embedding the process description graph
used to calculate the result into the entity
representing the materialised result. At any later
date the user may inspect the provenance of the
result by expanding the node to show the source
graph.
4.4 Provenance of a discovery process
In Discovery Net, provenance information for
how a discovery process was performed, and
how that process was constructed over time are
stored in DPML (Discovery Process Markup
Language), an XML language for describing
data flow graphs. A process is modelled as an
directed acyclic graph whose edges define data
flows between operations. For each operation
DPML records:
• Parameterisation of the operation (e.g.
for K-means cluster, value of k,
columns of data for used, distance
metric used)
• User annotations to help explain the
use of the operation
• A change history of how the
parameterisation of the operation
changed over time by different users
In Figure 3 we show how a stored discovery
process can be accessed via the web. Since this
description contains both end user annotation
and information for execution, and since the
discovery environment strictly manages the data
and operations used, users are able to inspect
and re-execute analyses at any time. Not only
can the entire process be re-executed, but also
this active reporting mechanism allows
intermediate results to be re-calculated on
demand (see ‘Execute from this node’ links).
Figure 3: Active reporting of a discovery
process
5. Conclusion
In this paper we have presented two classes of
discovery environment. Discovery Net
endeavors to provide as complete support as is
possible for use as a strict environment for eScience data analysis. However, in the interest
of providing a fully open and extensible system,
users may choose a more inclusive approach
adding resources ad-hoc where necessary. In
choosing how to use and configure the system,
users may balance their desire for flexibility
with strict process integrity considerations.
References
[1] Frank Leymann and Dieter Roller.
Production Workflow. Prentice Hall, 2000.
[2] SAP R/3. http://www.sap.com/
[3] Fayyad et al. Knowledge Discovery and
Data Mining: Towards a Unifying Framework.
In Proceedings of Second International
Conference on Knowledge Discovery and Data
Mining. AAAI press, 1996.
[4] Giannadakis et al. InfoGrid: Providing
Information Integration for Knowledge
Discovery. In Journal of Information Science
199-226, 2003.
Download