Scientific Workflow Applied to Nano- and Material Sciences Francois Gilardoni InforSense Ltd London, UK Vasa Curcin, Yike Guo Department of Computing Imperial College London, UK Abstract The concept of Scientific or Engineering Workflows is an amalgamation of scientific problem-solving and traditional workflow techniques. Scientific workflows promise to become an important area of research within workflow and process automation, and will lead to the development of the next generation of problem-solving and decision-support environments. DiscoveryNet technology, commercialised by InforSense Ltd, is used widely in the life sciences industry to address these exact issues. This paper presents one such workflow used in a real-life use case implemented by the cheminformatics group within InforSense in the Nano- and Material Sciences domain practices, are in place and well defined. Early in the lifecycle, the workflows require considerable human intervention, collaboration and knowledge exchange; later they begin to be executed automatically more and more, for instance through specialized portals, or other service hosting environments. Multidisciplinary simulation or complex data mining operations require the specification of complex workflows that are both data- and process- centric. Currently, Grid workflows are still in the early stages of research and development and some features of the application of workflow technology in a Grid are thoroughly investigated by other national and international projects. DiscoveryNet technology, which has been commercialised by InforSense Ltd, is used widely in the life sciences industry to address these exact issues. The analysis described below is a real-life use case implemented by the cheminformatics group within InforSense. 1. Introduction The concept of Scientific or Engineering Workflows is an amalgamation of scientific problem-solving and traditional workflow techniques. This class of workflows shares many features of business workflows, but also goes beyond them with respect to the complexity of operations and data types handled. Many known workflow patterns and techniques can be leveraged in scientific settings, and many additional features of scientific applications can be usefully deployed in business settings. Scientific workflows promise to become an important area of research within workflow and process automation, and will lead to the development of the next generation of problem-solving and decision-support environments, eventually turning into expert systems. Scientific and engineering workflows often begin as research workflows and evolve into production workflows employed by a large number of users, if the best standard operating procedures, capturing best Local Files Oracle Data Preprocess Matlab R DB Internal Analytics WEKA S-Plus SAS Web Services MDL Spotfire BioTeam iNquiry KXEN Oracle DM Figure 1 Overview of Discovery Net workflow technology Daylight Descriptors Models Predictive Modelling Compound Characterisation From HT Experimental Platforms Compound libraries Raw and Refined Data Protocols & Experiments Prior Knowledge Brainstorm Define Variables and synthesis Multivariate Statistics DOE Statistical Design Visualization Operational Data Recipe Management Control System Raw and Refined Data From HT Experimental Platforms Kinetics Synthesis Experiments Properties and basic costing Chemical Analysis Thermodynamics & Engineering From HT Experimental Platforms Raw and Refined Data Production Constraint Economic Preliminary Feasibility Simulation Route Analysis Scale Up Economics Downstream Experiments Figure 2 Drug discovery process 2. Motivation For decades, the traditional research process within the biopharmaceutical industry has been a sequential operation where, after many months of target validation, the process would lead to assay development followed by high throughput library screens for hits, and then on to lead optimization. This modus operandi is becoming obsolete with the introduction of parallel experimentation. The use of robots allowed the life science industry to screen an ever-increasing number of compounds. At first, compounds for testing may be selected from the large, readily available, collections of products accumulated over years of synthetic effort in industrial or academic research laboratories. But, the next stage consists of synthesizing new, massive, combinatorial libraries, and that process hits the wall of a combinatorial explosion. Nano- and material sciences are dealing with analogous issues whereas the internal structure of materials or catalysts is often unknown, having hypothetical reaction paths and unpredictable properties or activities. Therefore, the properties of the materials depend of the recipe to synthesize them. Besides this quandary, experimentation should be designed to deliver relevant and applicable information aimed to guide discovery in an effective and fast manner. This is often difficult to achieve without the use of a computing infrastructure and without some prior knowledge of the phenomenon. Although the screening power has recently increased by several orders of magnitude, the whole search space remains in general much too large to be fully explored. A way out of this combinatorial explosion relies on the investigation of the greatest diversity of the experimental space in the least number of experiments to create a performance-based model. This delivers the highest density of information per experiment at higher speed and guarantees the transformation of information into knowledge. The methodology combines the advantages of clustering techniques, molecular modeling, statistical design, multivariate statistics, data visualization and data mining. The aim is to build a model that contains knowledge and that is capable of guiding discovery further. 3. Challenges Typically, delivery of each step within the value chain involves data sources, discovery applications and processes that are optimized for specific workgroups. The use of proper tools that allows real-time capture and warehousing of intellectual property as it is created is fundamental to capitalize knowledge. An integrated approach to streamlining discovery research would provide great benefits in terms of cost, efficiency and speed. Unified data flows and workflows within a single environment would then support the creation of well-defined, reusable and manageable discovery processes across all levels of an organization. Individual scientists would benefit from process reuse and improved knowledge management so that valuable information and knowledge could be fully exploited, avoiding loss during staffing or organizational changes. Data handling is an essential part of the process and should not be underestimated. If not handled correctly, it can become an overwhelming bottleneck. Typically, data is stored in heterogeneous formats in different locations. For instance, a research group could have a collection of analytical instruments, each having each its own proprietary software developed by the vendor. Thus, the aggregation of all these disparate data sources is intricate and hinders the flow of information within an organization, with dramatic consequences on discovery. 4. Procedure The work presented below is a methodology which enables the scientist to scrutinize and identify solids that are relevant to be tested in a high throughput program. Elemental descriptors are combined with recipes and experimental data in order to identify relevant inputs capable to anticipate the material activity. The virtual screening enables pre-selection of candidates to be screened experimentally, thus setting a very high rate of relevance at an early stage of the high throughput experimentation program. The number of trials is reduced significantly. The methodology is encapsulated in the workflow (Fig 3) where each building block, or “node”, performs a set of operations on the data. The workflow reads from left to right. Figure 3: Building rule sets for material properties The starting point is a dataset containing both tested and untested materials together with the corresponding set of descriptors. The tested set has information on recipes and their intrinsic performance. The dataset is then split into branches depending if they have been tested (upper branch) or belong to the virtual library (lower branch). Hierarchical and K-Means clustering are applied respectively to the scores derived from the principal component analysis (labelled PCA) to identify alike materials. Also, only for demonstration purpose, Support Vector Machine algorithm (labelled SVM) is also used to classify the catalysts into clusters. The upper branch ends by applying a Decision Tree on both the recipes and the elemental descriptors in order to extract rules linking those to the inherent properties of the materials. The untested materials handled in the lower branch, are sent through the PCA and SVM models in order to be classified, i.e. have their performance estimated. Relevant materials will be eventually tested experimentally. 5. Acknowledgements We would like to thank the “Institut de Recherche sur la Catalyse” in Lyon, France, for providing the dataset to Inforsense. 6. Summary The analysis presented shows how scientific workflows address the needs of the nano- and material sciences community. A complex procedure, which is typically performed with a series of manual and semi-manual steps, is here automated and stored in a standard format, suitable for searching, indexing and evolving further. It also shows that, in many environments, the key benefit of an integrated service approach is not in the services themselves (CORBA, Web, Grid, or whatever the next year’s model may be) but in the infrastructure connecting them. It is this middleware layer that delivers value to the users and solves the informatics challenges they are facing, building a strong case for decoupling the middleware from the individual implementations. 7. References 1. AlSairafi S, Emmanouil F. S, Ghanem M, et al. (2003) The Design of Discovery Net: Towards Open Grid Services for Knowledge Discovery. Special issue of The International Journal on High Performance Computing Applications on Grid Computing: Infrastructure and Applications, Vol. 17, pp. 297-315. 2. Curcin V, Ghanem M, Guo Y. et al (2002) Discovery Net: Towards a Grid of Knowledge Discovery. ACM KDD-2002. July 2002 Edmonton, Canada. 3. Gilardoni F; Curcin V; Karunanayake K; Norgaard J; Guo Y (2005) Integrated Informatics in life and materials sciences: An oxymoron? QSAR & COMBINATORIAL SCIENCE 24 (1): 120-130