Scientific Workflow Applied to Nano- and Material Sciences

advertisement
Scientific Workflow Applied to Nano- and Material Sciences
Francois Gilardoni
InforSense Ltd
London, UK
Vasa Curcin, Yike Guo
Department of Computing
Imperial College London, UK
Abstract
The concept of Scientific or Engineering Workflows is an amalgamation of scientific problem-solving and traditional
workflow techniques. Scientific workflows promise to become an important area of research within workflow and
process automation, and will lead to the development of the next generation of problem-solving and decision-support
environments. DiscoveryNet technology, commercialised by InforSense Ltd, is used widely in the life sciences industry
to address these exact issues. This paper presents one such workflow used in a real-life use case implemented by the
cheminformatics group within InforSense in the Nano- and Material Sciences domain
practices, are in place and well defined. Early in the
lifecycle, the workflows require considerable human
intervention, collaboration and knowledge exchange;
later they begin to be executed automatically more and
more, for instance through specialized portals, or other
service hosting environments.
Multidisciplinary simulation or complex data mining
operations require the specification of complex
workflows that are both data- and process- centric.
Currently, Grid workflows are still in the early stages of
research and development and some features of the
application of workflow technology in a Grid are
thoroughly investigated by other national and
international projects.
DiscoveryNet
technology,
which
has
been
commercialised by InforSense Ltd, is used widely in the
life sciences industry to address these exact issues. The
analysis described below is a real-life use case
implemented by the cheminformatics group within
InforSense.
1. Introduction
The concept of Scientific or Engineering Workflows is
an amalgamation of scientific problem-solving and
traditional workflow techniques. This class of
workflows shares many features of business workflows,
but also goes beyond them with respect to the
complexity of operations and data types handled. Many
known workflow patterns and techniques can be
leveraged in scientific settings, and many additional
features of scientific applications can be usefully
deployed in business settings. Scientific workflows
promise to become an important area of research within
workflow and process automation, and will lead to the
development of the next generation of problem-solving
and decision-support environments, eventually turning
into expert systems.
Scientific and engineering workflows often begin as
research workflows and evolve into production
workflows employed by a large number of users, if the
best standard operating procedures, capturing best
Local
Files
Oracle Data
Preprocess
Matlab
R
DB
Internal
Analytics
WEKA
S-Plus
SAS
Web
Services
MDL
Spotfire
BioTeam
iNquiry
KXEN
Oracle
DM
Figure 1 Overview of Discovery Net workflow technology
Daylight
Descriptors
Models
Predictive Modelling
Compound
Characterisation
From HT
Experimental
Platforms
Compound
libraries
Raw and
Refined Data
Protocols &
Experiments
Prior
Knowledge
Brainstorm
Define Variables
and synthesis
Multivariate
Statistics
DOE
Statistical
Design
Visualization
Operational
Data
Recipe
Management
Control
System
Raw and
Refined Data
From HT
Experimental
Platforms
Kinetics
Synthesis
Experiments
Properties and
basic costing
Chemical
Analysis
Thermodynamics
& Engineering
From HT
Experimental
Platforms
Raw and
Refined Data
Production
Constraint
Economic
Preliminary
Feasibility
Simulation
Route Analysis
Scale Up
Economics
Downstream
Experiments
Figure 2 Drug discovery process
2. Motivation
For decades, the traditional research process within the
biopharmaceutical industry has been a sequential
operation where, after many months of target
validation, the process would lead to assay development
followed by high throughput library screens for hits,
and then on to lead optimization. This modus operandi
is becoming obsolete with the introduction of parallel
experimentation.
The use of robots allowed the life science industry to
screen an ever-increasing number of compounds. At
first, compounds for testing may be selected from the
large, readily available, collections of products
accumulated over years of synthetic effort in industrial
or academic research laboratories. But, the next stage
consists of synthesizing new, massive, combinatorial
libraries, and that process hits the wall of a
combinatorial explosion. Nano- and material sciences
are dealing with analogous issues whereas the internal
structure of materials or catalysts is often unknown,
having hypothetical reaction paths and unpredictable
properties or activities. Therefore, the properties of the
materials depend of the recipe to synthesize them.
Besides this quandary, experimentation should be
designed to deliver relevant and applicable information
aimed to guide discovery in an effective and fast
manner. This is often difficult to achieve without the
use of a computing infrastructure and without some
prior knowledge of the phenomenon. Although the
screening power has recently increased by several
orders of magnitude, the whole search space remains in
general much too large to be fully explored.
A way out of this combinatorial explosion relies on the
investigation of the greatest diversity of the
experimental space in the least number of experiments
to create a performance-based model. This delivers the
highest density of information per experiment at higher
speed and guarantees the transformation of information
into knowledge. The methodology combines the
advantages of clustering techniques, molecular
modeling, statistical design, multivariate statistics, data
visualization and data mining. The aim is to build a
model that contains knowledge and that is capable of
guiding discovery further.
3. Challenges
Typically, delivery of each step within the value chain
involves data sources, discovery applications and
processes that are optimized for specific workgroups.
The use of proper tools that allows real-time capture
and warehousing of intellectual property as it is created
is fundamental to capitalize knowledge. An integrated
approach to streamlining discovery research would
provide great benefits in terms of cost, efficiency and
speed. Unified data flows and workflows within a
single environment would then support the creation of
well-defined, reusable and manageable discovery
processes across all levels of an organization. Individual
scientists would benefit from process reuse and
improved knowledge management so that valuable
information and knowledge could be fully exploited,
avoiding loss during staffing or organizational changes.
Data handling is an essential part of the process and
should not be underestimated. If not handled correctly,
it can become an overwhelming bottleneck. Typically,
data is stored in heterogeneous formats in different
locations. For instance, a research group could have a
collection of analytical instruments, each having each
its own proprietary software developed by the vendor.
Thus, the aggregation of all these disparate data sources
is intricate and hinders the flow of information within
an organization, with dramatic consequences on
discovery.
4. Procedure
The work presented below is a methodology which
enables the scientist to scrutinize and identify solids that
are relevant to be tested in a high throughput program.
Elemental descriptors are combined with recipes and
experimental data in order to identify relevant inputs
capable to anticipate the material activity.
The virtual screening enables pre-selection of
candidates to be screened experimentally, thus setting a
very high rate of relevance at an early stage of the high
throughput experimentation program. The number of
trials is reduced significantly. The methodology is
encapsulated in the workflow (Fig 3) where each
building block, or “node”, performs a set of operations
on the data. The workflow reads from left to right.
Figure 3: Building rule sets for material properties
The starting point is a dataset containing both tested and
untested materials together with the corresponding set
of descriptors. The tested set has information on recipes
and their intrinsic performance. The dataset is then split
into branches depending if they have been tested (upper
branch) or belong to the virtual library (lower branch).
Hierarchical and K-Means clustering are applied
respectively to the scores derived from the principal
component analysis (labelled PCA) to identify alike
materials. Also, only for demonstration purpose,
Support Vector Machine algorithm (labelled SVM) is
also used to classify the catalysts into clusters. The
upper branch ends by applying a Decision Tree on both
the recipes and the elemental descriptors in order to
extract rules linking those to the inherent properties of
the materials. The untested materials handled in the
lower branch, are sent through the PCA and SVM
models in order to be classified, i.e. have their
performance estimated. Relevant materials will be
eventually tested experimentally.
5. Acknowledgements
We would like to thank the “Institut de Recherche sur la
Catalyse” in Lyon, France, for providing the dataset to
Inforsense.
6. Summary
The analysis presented shows how scientific workflows
address the needs of the nano- and material sciences
community. A complex procedure, which is typically
performed with a series of manual and semi-manual
steps, is here automated and stored in a standard format,
suitable for searching, indexing and evolving further. It
also shows that, in many environments, the key benefit
of an integrated service approach is not in the services
themselves (CORBA, Web, Grid, or whatever the next
year’s model may be) but in the infrastructure
connecting them. It is this middleware layer that
delivers value to the users and solves the informatics
challenges they are facing, building a strong case for
decoupling the middleware from the individual
implementations.
7. References
1. AlSairafi S, Emmanouil F. S, Ghanem M, et al.
(2003) The Design of Discovery Net: Towards Open
Grid Services for Knowledge Discovery. Special issue
of The International Journal on High Performance
Computing Applications on Grid Computing:
Infrastructure and Applications, Vol. 17, pp. 297-315.
2. Curcin V, Ghanem M, Guo Y. et al (2002) Discovery
Net: Towards a Grid of Knowledge Discovery. ACM
KDD-2002. July 2002 Edmonton, Canada.
3. Gilardoni F; Curcin V; Karunanayake K; Norgaard J;
Guo Y (2005) Integrated Informatics in life and
materials sciences: An oxymoron? QSAR &
COMBINATORIAL SCIENCE 24 (1): 120-130
Download