The need for e-Science An industrial perspective Stephen Calvert – VP Cheminformatics GSK 4th Annual EPSRC e-science meeting Yike Guo – Imperial College What is the “industrial” world like? • Historically – Low volume • 30-50 cmpds/yr/chemist: 10,000s assay wells/yr – Low information diversity • scientists generally dealt with limited types of data – reductionist approach • limited information per experiment – Interpretation critical fro next step • scientists required: – simple systems to assist in information monitoring – decision making resides with the scientist 4th Annual EPSRC e-science meeting What is the “industrial” world like? • What happened in the last 5 years? – “industrialisation” - Application of “principles of industrialisation” to drug discovery • high volume – 10,000 cmpd/yr/chemist/100+ million wells/yr – biology revolution • Human genome – “system biology” – holistic view and interpretation – high content data --- images – multiple result types from each experiment – bio-markers, pathways – knowledge integration • scientific discipline integration – scientists required: • complex systems, algorithms, statistics……. • decision making shared between systems and scientists • “Informatics” essential – partnership not service 4th Annual EPSRC e-science meeting How have we (IT) tackled the transition? • Business as usual – problem centric view • build applications • integrate applications • Educate scientists in the realms of IT – “Now I need to be an IT expert alongside chemistry, biology, genetics, robotics, engineering ……” – interesting time scale - generations • Technology is our saviour! – client server, web services, java, C#, Corba, OO programming, extreme programming, grid computing, ….. 4th Annual EPSRC e-science meeting chemistry What are the results? screening data “library” design samples GSK Applications sample history component availability component Discovery Sample History Warehouse client order component submission component Discovery Stock Warehouse Processing Queue Sample Holding Area Booking-in Manager Stock Record Database Works Order Processing Manager ALS system ALS Manager (RTS) Job Queue Sample Holding Area Dispatch Manager Solid Store Manager manual store samples: client - scientist client - remote cmpd bank Process Control Manager (PinPoint) Manual Store Manager Tube Store Manager Balance HayStack stores HayStack stores Other... Weighing H'Ware User interface component Database Balance Dissolve Sort Physical queue Electronic queue Automation Hardware H'Ware H'Ware infrastructure • “islands” of process & data Minicomputer – complex integration problem • “spaghetti” joins our worlds - unsustainable - cost • control with “IT” – mismatch in cycle time to change – engineered out serendipity – service role reversed 4th Annual EPSRC e-science meeting Minicomputer Minicomputer How could we do it differently? • result in: – handing control of science back to the scientist – match cycle times to change – Simplify • how can we merge the 2 worlds? – physical, information 4th Annual EPSRC e-science meeting Doodling in knowledge and experiment space this is workflow – isn’t it? physical & information worlds merge Information Resources Target List & Status Target Leads Q: - are these results real? Q: - what do I know about these compounds? Exclusion Lists IC50 Assay Structure Validation Other Assay... Q: - what other data can I acquire? • no predefined steps • capture what was done don’t restrict what can be done? • don’t restrict the non-obvious 4th Annual EPSRC e-science meeting Q: - what other data can I acquire? Doodling in knowledge & experiment space • • • • • • Need access to world-class scientific algorithms and tools Need access to disparate data sources from multiple locations Intuitive & flexible GUI design/analysis Framework needs to be very generic Ability to construct a “just-in-time” application Need to serving the requirements of a varied user community – both in terms of scientific and technical know-how • Capture and dissemination of “Best practice” within a creative environment to enhance efficiency company wide 4th Annual EPSRC e-science meeting Discovery Net Overview • Goal : Constructing the World’s First Infrastructure for Global Wide Knowledge Discovery on the Grid of Web Services Scientific Information • Funding : – One of the Eight UK National e-Science Projects (£2.4 M) Scientific Discovery Workflow = Compositional Service • Key Features: Literature – In Real Time Real Time Data Integration Discovery Services Allow Scientists to Construct, Share and Execute Complex Knowledge Discovery Processes & Services Databases – Allow Institutions to Manage and Utilise the Compositional Services as its Intellectual Properties • Applications: – – – Operational Data Life Science Environmental Modelling Geo-hazard Prediction • Achievement : For the First time Discovery Net Realises the Dynamic Construction of Compositional ServicesInstrument on GRID for Real Time Knowledge Discovery and Data Decision Making • 4th Annual EPSRC e-science meeting Process Knowledge Management Using GRID Resources Images – Dynamic Application Integration Enterprise Wide Integrative Scientific Decision Making Platform with Discovery Net Workflow • Constructing a ubiquitous workflow : by scientists – Integrate information resources/software applications cross-domain – Support innovation and capture the best practice of your scientific research • Warehousing workflows: for scientists – Manage discovery processes within an organisation – Construct an enterprise process knowledge bank • Deployment workflow: to scientists – Turn a workflows into reusable applications/services – Turn every scientist into a solution builder 4th Annual EPSRC e-science meeting An Integrative Analysis Example: Interactive&Interactive Scientific Discovery with Workflow Relational Relational data mining data mining tree Decision model of metabonomic profile Visualizing serial/spectrum data Text mining Text mining Visualizing cluster statistics Spectrum Spectrum data mining data mining 4th Annual EPSRC e-science meeting Visualizing Visualizing Visualizing multidimensional Chemical sequence data relational Visualizing data structure data clusters pathway data Chemical visualization Text mining Chemical data sequence visualization model data model Discovery Net Commercialisation Life Science Industry KDE Informatics Platform Label Free HT bioSensors Commercialisation (Imperial College Spin Out Companies): DeltaDot Workflow technology HT sensor processing Research : Discovery Net Research CS : Workflow for Informatics on SOA Sensor : Sensor Data Processing and Mining Application : Life, Environmental and Geo-physical Sciences 4th Annual EPSRC e-science meeting library design - GSK • Process of selecting the molecules I want to make from the universe of molecules • Toolbox: scientific models, chemical handling, chemical properties, data access, statistics, data visualisation, …. • Scientists can doodle in chemical space – Capture how scientists made decisions • New algorithms, data sources added in < 1 hour 4th Annual EPSRC e-science meeting KDE Example2 : SARS Genome Annotation The 2003 SARS outbreak Requirements: Rapid constructing and sharing mission critical discovery services Integration applications of diverse bioinformatics Support collaborative research between geographically distributed researchers Deploying services as easy to use tools for real time decision making Achievement: Dynamic Construction of Compositional Services: Rapid construction of applications via composition of existing web services using workflow. Instant deployment of analytical workflows as new web services with resource mapping. Integrated workflow, provenance and service management Collaborative construction of workflows by large numbers of researchers China SARS Virtual Lab based on Discovery Net Genbank Homology search against viral genome DB Homology search against protein DB Annotation using Artemis and GenSense Annotation using Artemis and GenSense Exon prediction Key word search GeneSense Ontology Multiple sequence alignment Annual EPSRC e-science meeting D-Net: Integration, interpretation, and discovery Relationship between SARS and other virus Phylogenetic analysis Immunogenetics Homology search against motif DB Protein localization site prediction Splice site prediction Mutual regions identification Microarray analysis Epidemiological analysis 4th Predicted genes Gene prediction SARS patients diagnosis Protein interaction prediction Relationship between SARS virus and human receptors prediction Classification and secondary structure prediction Bibliographic databases Bibliographic databases Compositional Services for SARS Mutation Analysis ¾50 data resource ¾> 200 software applications and services Designed on top of the web service environment Used by more than 200 scientists Result published in <<Science>> 4th Annual EPSRC e-science meeting Future Challenge: GSK- InforSense & IC e-Science Collaboration • Workflow Fusion : Applying advanced performance programming technology for dynamic optimization of workflow execution • Workflow Abstraction : Investigating abstraction mechanisms for building workflow hierarchy and higher order composition forms • Dynamic Service Composition: Investigating service ontology for dynamic composing services with workflow • Workflow Metadata Model : Building up a generic meta data model for scientific workflow management and workflow warehousing • Man – machine interface – free scientists from IT speak 4th Annual EPSRC e-science meeting How can you help? • encourage focused research in key issues SCIENTISTS facing in industries • catalyst the joint work in these focused fields between academics, industry and commercial software vendors • facilitate the solution-oriented communication between computer scientists and domain scientists in both academic and industry 4th Annual EPSRC e-science meeting e-Science • A politician's view: ‘[The e-Science platform ] intends to make access to computing power, scientific data repositories and experimental facilities as easy as the Web makes access to information.’ Tony Blair • A Scientist’s View: [The e-Science platform ] should help me to do my scientific research free from the complexity of IT 4th Annual EPSRC e-science meeting