The need for e-Science An industrial perspective Yike Guo – Imperial College

advertisement
The need for e-Science
An industrial perspective
Stephen Calvert – VP Cheminformatics GSK
4th Annual EPSRC e-science meeting
Yike Guo – Imperial College
What is the “industrial” world like?
• Historically
–
Low volume
• 30-50 cmpds/yr/chemist: 10,000s assay wells/yr
–
Low information diversity
• scientists generally dealt with limited types of data
–
reductionist approach
• limited information per experiment
– Interpretation critical fro next step
• scientists required:
– simple systems to assist in information monitoring
– decision making resides with the scientist
4th Annual EPSRC e-science meeting
What is the “industrial” world like?
•
What happened in the last 5 years?
– “industrialisation” - Application of “principles of industrialisation” to drug discovery
• high volume
– 10,000 cmpd/yr/chemist/100+ million wells/yr
– biology revolution
• Human genome
– “system biology” – holistic view and interpretation
– high content data --- images
– multiple result types from each experiment – bio-markers, pathways
– knowledge integration
• scientific discipline integration
– scientists required:
• complex systems, algorithms, statistics…….
• decision making shared between systems and scientists
• “Informatics” essential – partnership not service
4th Annual EPSRC e-science meeting
How have we (IT) tackled the
transition?
• Business as usual
– problem centric view
• build applications
• integrate applications
• Educate scientists in the realms of IT
– “Now I need to be an IT expert alongside chemistry, biology,
genetics, robotics, engineering ……”
– interesting time scale - generations
• Technology is our saviour!
– client server, web services, java, C#, Corba, OO programming,
extreme programming, grid computing, …..
4th Annual EPSRC e-science meeting
chemistry
What are the results?
screening
data
“library” design
samples
GSK Applications
sample history
component
availability
component
Discovery
Sample History
Warehouse
client order
component
submission
component
Discovery
Stock
Warehouse
Processing Queue
Sample Holding Area
Booking-in Manager
Stock Record
Database
Works
Order
Processing
Manager
ALS system
ALS
Manager (RTS)
Job Queue
Sample Holding Area
Dispatch
Manager
Solid Store
Manager
manual
store
samples:
client - scientist
client - remote cmpd bank
Process
Control
Manager
(PinPoint)
Manual Store
Manager
Tube Store
Manager
Balance
HayStack
stores
HayStack
stores
Other...
Weighing
H'Ware
User interface component
Database
Balance
Dissolve
Sort
Physical queue
Electronic queue
Automation Hardware
H'Ware
H'Ware
infrastructure
• “islands” of process & data
Minicomputer
– complex integration problem
• “spaghetti” joins our worlds - unsustainable - cost
• control with “IT”
– mismatch in cycle time to change
– engineered out serendipity
– service role reversed
4th Annual EPSRC e-science meeting
Minicomputer
Minicomputer
How could we do it differently?
• result in:
– handing control of science back to the scientist
– match cycle times to change
– Simplify
• how can we merge the 2 worlds?
– physical, information
4th Annual EPSRC e-science meeting
Doodling in knowledge and experiment
space
this is workflow – isn’t it?
physical & information worlds merge
Information Resources
Target
List &
Status
Target
Leads
Q: - are these
results real?
Q: - what do I know
about these
compounds?
Exclusion
Lists
IC50
Assay Structure
Validation Other
Assay...
Q: - what other data
can I acquire?
• no predefined steps
• capture what was done don’t restrict
what can be done?
• don’t restrict the non-obvious
4th Annual EPSRC e-science meeting
Q: - what other data
can I acquire?
Doodling in knowledge & experiment
space
•
•
•
•
•
•
Need access to world-class scientific algorithms and tools
Need access to disparate data sources from multiple locations
Intuitive & flexible GUI design/analysis
Framework needs to be very generic
Ability to construct a “just-in-time” application
Need to serving the requirements of a varied user community
– both in terms of scientific and technical know-how
• Capture and dissemination of “Best practice” within a creative
environment to enhance efficiency company wide
4th Annual EPSRC e-science meeting
Discovery Net Overview
•
Goal : Constructing the World’s First Infrastructure for Global Wide Knowledge Discovery
on the Grid of Web Services
Scientific
Information
• Funding :
–
One of the Eight UK National e-Science Projects
(£2.4 M)
Scientific
Discovery
Workflow = Compositional Service
• Key Features:
Literature
–
In Real Time
Real Time Data
Integration
Discovery
Services
Allow Scientists to Construct, Share and Execute
Complex Knowledge Discovery Processes &
Services
Databases
–
Allow Institutions to Manage and Utilise the
Compositional Services as its Intellectual
Properties
• Applications:
–
–
–
Operational
Data
Life Science
Environmental Modelling
Geo-hazard Prediction
• Achievement :
For the First time Discovery Net Realises the
Dynamic Construction of Compositional ServicesInstrument
on GRID for Real Time Knowledge Discovery and Data
Decision Making
• 4th Annual EPSRC e-science meeting
Process Knowledge
Management
Using GRID Resources
Images
–
Dynamic Application
Integration
Enterprise Wide Integrative Scientific Decision
Making Platform with Discovery Net Workflow
• Constructing a ubiquitous
workflow : by scientists
– Integrate information resources/software applications
cross-domain
– Support innovation and capture the best practice of
your scientific research
• Warehousing workflows: for
scientists
– Manage discovery processes within an organisation
– Construct an enterprise process knowledge bank
• Deployment workflow: to
scientists
– Turn a workflows into reusable applications/services
– Turn every scientist into a solution builder
4th Annual EPSRC e-science meeting
An Integrative Analysis Example:
Interactive&Interactive Scientific
Discovery with Workflow
Relational
Relational
data
mining
data
mining tree
Decision
model of
metabonomic
profile
Visualizing
serial/spectrum
data
Text mining
Text mining
Visualizing
cluster statistics
Spectrum
Spectrum
data
mining
data mining
4th Annual EPSRC e-science meeting
Visualizing
Visualizing
Visualizing
multidimensional
Chemical
sequence
data
relational
Visualizing
data
structure
data clusters
pathway
data
Chemical
visualization
Text mining
Chemical
data
sequence
visualization
model
data
model
Discovery Net Commercialisation
Life Science Industry
KDE Informatics Platform
Label Free HT bioSensors
Commercialisation (Imperial College Spin Out Companies):
DeltaDot
Workflow technology
HT sensor processing
Research :
Discovery Net Research
CS : Workflow for Informatics on SOA
Sensor : Sensor Data Processing and Mining
Application : Life, Environmental and Geo-physical Sciences
4th Annual EPSRC e-science meeting
library design - GSK
• Process of selecting the molecules I want to make from the universe
of molecules
• Toolbox: scientific models, chemical handling, chemical properties, data
access, statistics, data visualisation, ….
• Scientists can doodle in chemical space
– Capture how scientists made decisions
• New algorithms, data sources added in < 1 hour
4th Annual EPSRC e-science meeting
KDE Example2 : SARS Genome Annotation
The 2003 SARS outbreak
Requirements:
‰ Rapid constructing and sharing mission
critical discovery services
‰ Integration
applications
of
diverse
bioinformatics
‰ Support collaborative research between
geographically distributed researchers
‰ Deploying services as easy to use tools
for real time decision making
Achievement: Dynamic Construction
of Compositional Services:
‰ Rapid construction of applications via
composition of existing web services using
workflow.
‰ Instant deployment of analytical workflows
as new web services with resource mapping.
‰ Integrated workflow, provenance and service
management
‰ Collaborative construction of workflows by
large numbers of researchers
China SARS Virtual Lab based on
Discovery Net
Genbank
Homology search against
viral genome DB
Homology search
against protein DB
Annotation using
Artemis and GenSense
Annotation using
Artemis and
GenSense
Exon prediction
Key word
search
GeneSense
Ontology
Multiple sequence
alignment
Annual EPSRC e-science meeting
D-Net:
Integration,
interpretation,
and discovery
Relationship
between
SARS and
other virus
Phylogenetic analysis
Immunogenetics
Homology search
against motif DB
Protein localization
site prediction
Splice site prediction
Mutual regions
identification
Microarray analysis
Epidemiological analysis
4th
Predicted
genes
Gene prediction
SARS patients
diagnosis
Protein interaction
prediction
Relationship
between SARS virus
and human receptors
prediction
Classification and
secondary structure
prediction
Bibliographic databases
Bibliographic databases
Compositional Services for SARS Mutation Analysis
¾50 data resource
¾> 200 software applications and services
Designed on top of the web service environment
Used by more than 200 scientists
Result published in <<Science>>
4th Annual EPSRC e-science meeting
Future Challenge:
GSK- InforSense & IC e-Science Collaboration
• Workflow Fusion : Applying advanced performance
programming technology for dynamic optimization of workflow
execution
• Workflow Abstraction : Investigating abstraction mechanisms
for building workflow hierarchy and higher order composition
forms
• Dynamic Service Composition: Investigating service ontology
for dynamic composing services with workflow
• Workflow Metadata Model : Building up a generic meta data
model for scientific workflow management and workflow
warehousing
• Man – machine interface – free scientists from IT speak
4th Annual EPSRC e-science meeting
How can you help?
• encourage focused research in key issues SCIENTISTS facing
in industries
• catalyst the joint work in these focused fields between
academics, industry and commercial software vendors
• facilitate the solution-oriented communication between
computer scientists and domain scientists in both academic and
industry
4th Annual EPSRC e-science meeting
e-Science
• A politician's view:
‘[The e-Science platform ] intends to make access to computing
power, scientific data repositories and experimental facilities
as easy as the Web makes access to information.’
Tony Blair
• A Scientist’s View:
[The e-Science platform ] should help me to do my scientific
research free from the complexity of IT
4th Annual EPSRC e-science meeting
Download