Discovery Processes in Discovery Net Jameel Syed, Moustafa Ghanem and Yike Guo Discovery Net, Imperial College, London Abstract The activity of e-Science involves making discoveries by analyzing data to find new knowledge. Discoveries of value cannot be made by simply performing a pre-defined set of steps to produce a result. Rather, there is an original, creative aspect to the activity which by its nature cannot be automated. In addition to finding new knowledge, discovery therefore also concerns finding a process to find new knowledge. How discovery processes are modeled is therefore key to effectively practicing e-Science. We argue that since a discovery process instance serves a similar purpose to a mathematical proof it should have similar properties, namely it allows results to be deterministically reproduced when re-executed and that intermediate results can be viewed to aid examination and comprehension. 1. Analysis for Discovery The massive increase in data generated from new high-throughput methods for data collection in life sciences has driven interest in the emerging area of e-Science. Practitioners in this field are scientific ‘knowledge workers’ performing analysis in-silico. The traditional methodology of formulating a hypothesis, designing an experiment to test the hypothesis, running the experiment then studying the results and evaluating is now largely computerised. Experimental data in e-Science is either captured electronically or generated by a software simulation. For example, in gene expression analysis the activity of the whole genome may be recorded using microarray technology. e-Science concerns the development of new practices and methods to find knowledge. Nontrivial, actionable knowledge cannot be batch generated by a set of predefined methods, but rather the creativity and expertise of the scientist is necessary to formulate new approaches. Whilst the dynamic nature of massively distributed service-oriented architectures provides much promise to provide scientists with powerful tools, they raise many issues of complexity. New resources such as online data sources, algorithms and methods defined as processes are becoming available daily. A single process may need to integrate techniques from a range of disciplines such as data mining, text mining, image mining, bioinformatics, cheminformatics and be created by a multidisciplinary team of experts. A major challenge is to effectively coordinate these resources in a discovery environment in order to create knowledge. The purpose of a discovery environment is to allow users to perform their analyses, computing required results and creating a record of how this is accomplished. The most widespread and promising approach is to use process-based systems with re-usable components that allow the end user to compose new methods of analysis through a visual programming interface. Workflow systems and analysis software are already well established in the sphere of business but the requirements for scientific disciplines are markedly different owing to the nature of discovery. Two major uses of process based systems in business are for production workflows and business intelligence. Production workflows[1], which grew from document workflow systems, are primarily concerned with tying production systems such as operational databases together so that an organisation can streamline its regular, day to day activities. Such workflows are relatively fixed, and authored by software engineers based on models of accepted practice to the extent that systems in certain domains[2] come predefined with processes and databases. It is not uncommon for businesses to change their actual practices to fit predefined process supplied with such a system. In production workflow systems the emphasis is on repeatedly performing a fixed set of processes in order for their constituent activities to be repeatedly performed. Business intelligence concerns volumes of data that are too large to be handled manually thus requiring computer-based approaches. Knowledge Discovery in Databases (KDD)[3] is a field that concerns the use of techniques from machine learning, statistics and visualisation to enable users to find useful, actionable patterns in large quantities of data. It is a multi-disciplinary collaborative activity with data, algorithm and domain experts working side by side on a project. Knowledge discovery has been important in business for many years, and the techniques used for analysis have relevance in e-Science, but the nature of processes and how they are constructed is markedly different. In e-Science the dynamic nature of processes used and the operations they are composed of are of central importance. A primary function of practitioners is to create new processes, which may utilise new resources as they become available on a daily basis. Each process instance is an experiment in itself, described in terms of specific data whose aim is to determine the validity of the approach. Processes must be rapidly prototyped with a tight edit-execute-edit cycle. The need for such dynamism means that for creative scientific work a new kind of environment and process representation are needed to best support the user. 2. Provenance and Audit of Discovery Provenance is a loaded word with many possible definitions and interpretations. In the field of antiques it simply means how authenticity is proven. In e-Science provenance clearly concerns a record of the origins of data for such purposes as peer review and reproduction of results. A static record of events that occurred are readily recorded, but creating a dynamically re-executable artefact is a difficult problem that relies on properties of the environment in which the result is calculated. An e-Science process will typically make use of shared, distributed resources for execution and as sources of data. Within an organisation, LIMS (Laboratory Information Management Systems), for example, can bridge the gap between the physical world and the computerised environment and provide a robust cornerstone for all data exploration, recording how data was captured. In contrast, for public resources there is scant support for versioning, and the worst case- basic day-to-day availability of resources- must be carefully considered by organisations that rely on such services to produce their intellectual property. Audit is another loaded word, with similar scope for interpretation. In the simplest sense it is very similar to provenance- a record of the method of how a result was produced. However, a fully detailed record would include how the methodology evolved to reach its final configuration to reach an end result. Here we examine the following issues of provenance and audit that need addressing for discovery: • Data archiving • Operation versioning • Provenance of a computed result • Provenance of a discovery process Data archiving concerns the use of data sources used in the scope of our discovery process instance. Operation versioning concerns managing changes to executable resources such as analysis algorithms over time. Provenance of a computed result concerns derived data or results that have been locally computed from a discovery process. Provenance of a discovery process concerns how a discovery process was authored. 2.1 Data archiving Data archiving concerns the maintenance of any data sources that are used within a discovery process. Whilst a process may access externally stored data by reference there is no guarantee that the external data source will not be updated unbeknownst to the process author. Guaranteeing the immutability of data externally referenced from a discovery process is vital to preserving its value since without the originally used data a process cannot be reexecuted or examined fully at later date. In the domain of e-Science we identify two subcategories of data that require archiving: experimental data and reference data. We define experimental data as the primary data set(s) used in a discovery process. Such data is either captured using computerised instruments from performing physical experiments, or generated using simulation. In many cases this data is well managed within an organisation owing to how expensive such data may be to generate. LIMS record metadata of the conditions of how data was captured in terms of events that occurred at a certain time. In contrast to a regular production database system, these warehouses are designed as immutable stores, providing versioned access to the stored data such that at any time any stored experiment can be retrieved with certainty that it has not been modified. We define reference data as supplemental, supporting evidence data that is used to augment the core experimental data in a project. In current practice this kind of data is usually fetched from public web-based data resources. Items of experimental data are `looked up' using a unique identifier to find related information. These public data resources have scant support for versioning, and indeed if we consider the worst case, their basic day to day availability is not even guaranteed. In addition to their contents and basic availability, changes may also occur to their logical schema such that automated mechanisms for their access may even fail to function. The use of such resources is carefully considered by organisations who relying on such services to produce their intellectual property, but a trade-off must always be made between using the most up-todate information available and a safe, locally cached copy. At the least this choice must be controlled. Given the lack of a quality of service guarantee from the data producer, it is the data consumer's responsibility to maintain a copy of the data. In life science there are hundreds of online data resources that are continually being modified, with more resources being added. Data even propagates between these data resources themselves in complex ways, with resources containing derivations or direct copies of others. 2.2 Operation versioning Similarly to data archiving, the use of executable analysis operations needs to be carefully considered, particularly remote services provided by other organisations. The growth in the use of web services has enabled easier provision of services to the public, consumed by a variety of programming languages and platforms. Reliance on such resources for the reproduction of important scientific results requires a lot of trust in the owning organisations ability to sufficiently support it. 2.3 Provenance of a computed result In contrast to archived experimental or reference data, we characterise this type of data by the fact that it has been computed as part of data analysis. This result is the ‘nugget of knowledge’ that has been found through the process of discovery, and may vary in scale from a concise model generated though the use of machine learning algorithms to an archived data set that has been transformed or augmented in some way. This issue is the corollary of our requirement that discovery processes be re-executable such that they can generate identical results: a computed result must possess a corresponding record of how it was calculated. The discovery process description serves the role of both record of how the data was computed, and the justification for why it was computed in that way. The discovery process representation should include enough detail so that processes can be re-executed, and also additional details about how and why they were constructed. If we compare the artefacts of an e-Science process and a traditional mathematical proof or scientific publication the challenges are clear. A mathematical proof is relatively self contained, maybe referencing some widely accepted theorems. An e-Science process is knowledge schema representing how to coordinate the integration of distributed resources managed by different organisations to produce knowledge. The annotation of such a schema requires much more than margin notes, and associated provenance information is equally essential given the different levels of control over aspects of the coordination and nature of their integration. A discovery processes assembled from disparate resources must retain integrity such that it can be relied on to guide costly realworld experiments that could potentially involve lives. A discovery process must be reexecutable at any time to reproduce result, independent of any other processes. Dataflow graphs have some very desirable properties for representing a proof of how a result was generated. Discovery processes must make `non destructive' use of distributed resources from within and external to an organisation. Further, each instance must maintain its own state independently. These properties help ensure that a process can be re-executed at a later date without affecting any other systems or data. Since a discovery process used to generate a result may itself make use other datasets there may be a dependency between maintaining the provenance of computed results and maintaining archived data source. 2.4 Provenance of a discovery process The final class of data requiring provenance is the discovery process itself. Since a discovery process may be authored in multi-user environment, and be used as the basis for defining the provenance of computed results it is vital that the history of how the process came to be. This audit information should contain both captured parameter change information tracking the evolution of the configuration of the process, and also manually entered user annotations to help justify the course of analysis chosen. 3. Discovery Environments In order to address the issue of reproducibility of results we characterise two kinds of discovery environments, strict and inclusive, and show how the nature of these environments determines the properties of our e-Science processes. The successful unification of these approaches is necessary to allow adequate scrutiny of e-Science results and full use of available resources. 3.1 Strict discovery environments A strict discovery environment is one where all the semantics of data interactions between resources used with the process are explicitly expressed in the model and are a transparent part of the process representation. We model this kind of controlled environment using dataflow processes. A dataflow is represented as a directed graph whose nodes are functions, where arcs represent the flow of data between functions. Additional function parameterisation is specified by means of values stored with each function instance. In such an environment, any access to external data resources, either as a data source or as a lookup to augment existing data, are consumer-cached by the environment so that they can re-executed to produce identical results By strictly modelling all interactions between resources, and containing the scope of state change in the system, non-destructive interactive authoring and execution of processes is possible. For example, users may explore possibilities in a what-if manner without disrupting shared resources or relying on them not being changed by other users. A wide range of semantics for how data flows through dataflow graphs are possible depending on application area (data analysis, signal processing, image processing). 3.2 Inclusive discovery environments An inclusive discovery environment can be modelled as a classic activity based workflow. An activity based workflow is unmanaged in both data scope and state since the user of the system is not aware of where the data used in the execution has originated from and what state changes an execution will cause. These properties are understandable given that from their inception, workflow systems were designed to be tied to production systems, to the extent that executions be triggered by external events with small fragments of data passed as messages between different systems, the global state of the system must be carefully designed by software engineers. Activity executions are logged but few guarantees can be made about the effects of ad-hoc re-execution. A workflow may even involve activities requiring direct human interaction, perhaps to operate a piece of equipment or manually verify data. 4. Discovery Net In this section we examine discovery processes in Discovery Net and detail how the concepts of strict and inclusive discover environments are tackled. Figure 1 shows how the Discovery Net environment fundamentally aims to be a strict environment whilst also inclusively using external resources. Many external resources are managed by the environment that provides facilities for archiving both tabular and semistructured data used in processes. Figure 1: A view of the Discovery Net environment 4.1 Data archiving In e-Science a large proportion of public services are `lookup' style databases resources providing domain-specific reference data that needs to be combined with experimental data created or captured by the scientist. The volatility of these resources, in terms of service availability and non-versioned data updates, is mitigated by the discovery environment which guarantees that data used in warehoused processes is permanently available. Where experimental data is accessed from databases, tables are first imported into the environment into a data store where they remain for the lifetime of the discovery processes that use them. In this way operational databases may change over time but saved processes still remain re-executable. Similarly, reference data accessed by the Discovery Net InfoGrid[4] system is fetched and parsed by this subsystem and persisted within the environment. Once a reproducible discovery process has been developed and validated it is then usual for it to be put into production use and re-executed with ‘live’ (time variant) data sources. Users are able to override the system’s data caching mechanisms to dynamically fetch data from the original sources in cases where processes are in production use to allow them to access the most up-to-date data. The usual practice is for these deployed processes to be copies of frozen ‘proof’ versions of the process, modified only with respect to how they access source data. Figure 2: A discovery process deployed to the web Figure 2 illustrates a discovery process that has been put into production use by deploying it to the web. Users are able to provide their own data with which to re-execute the process. In this case any new data used with the process is stored in the discovery environment for later access. 4.2 Operation versioning Discovery Net supports the use of both strict and inclusive styles of executable resources, which may even be combined in the same process. The manner in which these two approaches entirely depends on the way in which the system is configured for use. The strict method for integrating operations tightly ties them to the discovery environment, with the program code accessible only via the environment. The discovery process includes versioning information for each operation. Each operation is managed by a well defined wrapper describing its input/output types, parameters and other characteristics such as name and icon. All data exchange between is done explicitly in terms of operation graphs (see Figure 3). Execution and parameterisation of the operation is managed via this wrapper. Versioning allows the functionality of a conceptual operation to be updated by adding a new wrapper. A new wrapper may coexist with older wrappers (to allow older processes to use the old wrapper unmodified) or exist as a replacement for old operations. Maintaining reproducibility of many concurrent versions of the same operations is especially important in e-Science given the rate of development of methodologies in some areas. The inclusive approach, by contrast utilises operations that exist outside the scope and direct management of the discovery environment. These remote resources are increasingly being implemented using web services technology. There are three aspects to how a service may change. The most basic is the availability of the service: is the service still running? For this reason alone large organisations such as pharmaceutical companies typically mirror copies of frequently accessed public services in house, in addition to concerns relating to privacy of data. The second versioning issue concerns a change to the WSDL schema of the service where elements have been added, removed or modified. In many cases, particularly where web services are ‘document’ rather than ‘RPC’ type, minor modifications may be handled transparently since the software reading an XML message will generally ignore any data that is in addition to what it is looking for. However, where data is removed or modified, the re-execution of a process is much less likely to run smoothly. For this reason, the service WSDL must be cached and compared at reexecution time so that the user may be informed of the change. The third type of issue concerns whether the behaviour of the service has changed, independently of its interfaces. Again, there is little that can be done by the consumer of the service if its owner modifies its behaviour: it may not even be evident that any change has occurred until a certain input case is encountered. 4.3 Provenance of a computed result Provenance of a computed result, such as a list of interesting genes, is managed in Discovery Net by embedding the process description graph used to calculate the result into the entity representing the materialised result. At any later date the user may inspect the provenance of the result by expanding the node to show the source graph. 4.4 Provenance of a discovery process In Discovery Net, provenance information for how a discovery process was performed, and how that process was constructed over time are stored in DPML (Discovery Process Markup Language), an XML language for describing data flow graphs. A process is modelled as an directed acyclic graph whose edges define data flows between operations. For each operation DPML records: • Parameterisation of the operation (e.g. for K-means cluster, value of k, columns of data for used, distance metric used) • User annotations to help explain the use of the operation • A change history of how the parameterisation of the operation changed over time by different users In Figure 3 we show how a stored discovery process can be accessed via the web. Since this description contains both end user annotation and information for execution, and since the discovery environment strictly manages the data and operations used, users are able to inspect and re-execute analyses at any time. Not only can the entire process be re-executed, but also this active reporting mechanism allows intermediate results to be re-calculated on demand (see ‘Execute from this node’ links). Figure 3: Active reporting of a discovery process 5. Conclusion In this paper we have presented two classes of discovery environment. Discovery Net endeavors to provide as complete support as is possible for use as a strict environment for eScience data analysis. However, in the interest of providing a fully open and extensible system, users may choose a more inclusive approach adding resources ad-hoc where necessary. In choosing how to use and configure the system, users may balance their desire for flexibility with strict process integrity considerations. References [1] Frank Leymann and Dieter Roller. Production Workflow. Prentice Hall, 2000. [2] SAP R/3. http://www.sap.com/ [3] Fayyad et al. Knowledge Discovery and Data Mining: Towards a Unifying Framework. In Proceedings of Second International Conference on Knowledge Discovery and Data Mining. AAAI press, 1996. [4] Giannadakis et al. InfoGrid: Providing Information Integration for Knowledge Discovery. In Journal of Information Science 199-226, 2003.