The Chemical Knowledge Cycle Comb-e-Chem ……and its ramifications for e-Science Or the other way round Jeremy Frey School of Chemistry University of Southampton, UK Talk Comb-e-Chem The Comb-e-Chem Project “Smart Lab” “National Crystallography Service” “Cluster Computing” “Dissemination & Publication” March 2004 •Comb-e-Chem Partners •IBM Comb-e-Chem •IT •Innovation •NCS •CCDC •ECS •Chemistry •Stats •Combi •Centre •Pfizer •Bristol •Chemistry •GSK •AZ •Southampton •IUPAC •RSC March 2004 People Comb-e-Chem Chemistry (Southampton & Bristol) Mike Hursthouse, Chris Frampton, Jon Essex, Jeremy Frey, Guy Orpen, Stephan Christensen, Thomas Gelbrich, Sam Peppe, Hongchen Fu, Graham Tizard, Suzanna Ward, Lefteris Danos, Jamie Robinson, Kieron Talyor, Chris Woods, Rob Gledhill National Crystallography Service (NCS) Simon Coles, Mark Light, Ann Bingham, Peter Horton Electronics and Computer Science (Southampton) Dave De Roure, Luck Moreau, Mike Luck, Hugo Mills, Graham Smith, Simon Miles, Nicky Harding, Gareth Hughes, Nick Humphries, monica schraefel, Terry Payne It-Innovation (Southampton) Mike Surridge, Ken Meacham, Steve Taylor, Daren Marvin Statistics (Southampton) Alan Welsh, Sue Lewis, Ralph Manson, Dave Woods Rutherford Appleton Laboratory, Atlas Data Centre IBM – Colin Bird, Syd Chapman March 2004 Design (statistics) Comb-e-Chem Experiments Smart Labs Plan Access to data CombeChem Data and Knowledge Cycle High End-to-End Management Throughput Literature measurement Dissemination Analysis E-Bank Statistics Data March 2004 Plans Small set of fixed plans NCS Variable plans, written by chemist (difficult!) Tea Ad-hoc, implied by process execution SHG A chemistry lab is a hostile environment without much room to maneuver what can be captured captured automatically with sensors? what must rely on manual annotation? The chemist The fume cupboard Competition for space very precise scales - but not connected to any recording device Industrial support Big block to publication@source: if it’s not digital, it’s difficult to share critical data entry By Making Tea! Getting not just the what and how, but the why Comb-e-Chem Making Tea: design elicitation through analogy Developed and validated the analogy with chemists Gave us a way to ask questions that would not otherwise have been possible Let us maximize observation Gave us repeatability Derived rudiments of a process model, too Provided lingua franca with chemists March 2004 Pervasive Grid – “Smart Flight” Comb-e-Chem Tablet? March 2004 Results Comb-e-Chem “I can go anywhere and its, like, this is me and my data. It’s all there! Bang!” In real use, chemists were able to record their experiments After about ten minutes of use, they forgot about it as a new thing, and just used it March 2004 Data model Increasing detail Comb-e-Chem Plan Intended actions: guide to chemist, or [later] workflow Process record Measurements Processes Annotations Provenance record Service invocations Secure time-stamps etc… March 2004 Databases Comb-e-Chem Database will become the key method of handling all data Metadata must be generated at inception and added as data traverses the workflow Version control, audit and backup handled at the database level. March 2004 Aparatus PK ApparatusID Operator Comb-e-Chem PK Rig Name Laser Wavelength Laser PulseRate GateWidth Sensitivity PMTVoltage Incedent Angle OperatorID Title Surname ForeName PasswordHash Position Organisation Name Phone Email Organisation PK OrganisationID PK SolutionID Organisation Name Address ContactName ContactPhone ContactEmail Website FK1 FK2 FK3 FK4 SolventID SoluteID OperatorID OrganisationID Preparation Date SoluteMass SolventVolume pH pHControl Notes ChemicalD CasNumber CompoundName Quantity Supplier Catalogue Number LotNumber PackDate Purchase Date Order Number RMM Purity Notes FK1 FK2 PK FK1 FK3 FK4 RunID Time OperatorID ApparatusID SampleID InputPolarisationAngle OutputPolarisationAngle Azimuthal Angle Surface Pressure MonoChromatorWavelength MonoInputSlit MonoOutputSlit Sample Temperature NumberLaserShots IsBackground DataBlob Notes Sample PK FK3 FK4 RunID BkgID RunData Solution Chemical PK Run/Bkg Link SampleID Notes ChemicalD SolutionID Concentration pH March 2004 Live updates (lab environment End experiment trigger) Client Initiate Rserve run Comb-e-Chem Updates Initiate Rserve run, and and finished notify decisions Agent Server Raw Experimental Data Data Recall and Update End/start experiment/Run Experiment Data Logging PC Web Server Agent to listen for end of experiments, and auto trigger analysis Data Recall and Update Changes in Lab environment Lab Environment Logger Viewer Traffic Rserve MQBroker Database Server Periodic Backups Smart(SHG)Lab Data Flow Processes Broker Backup Agent Also does recall function Backup/Recall Broker data Broker Recall Agent Experiment Data SRB/ATLAS/Network Backup server Control Data March 2004 Databases - Our experience Comb-e-Chem What do you do when the actual users keep changing their mind? Is a traditional relational database suitable? Danger of re-enforcing scientific bias against relational database for laboratory data. RDF & RDFS! March 2004 Ingredient List Comb-e-Chem Fluorinated biphenyl Br11OCB Potassium Carbonate Butanone Dissolve 4flourinated biphenyl in butanone 0.9 g 1.59 g 2.07 g 40 ml Add Add K2CO3 powder Add 0.9031 Heat at reflux for 1.5 hours Reflux grammes Weigh Butanone dried via silica column and measured into 100ml RB flask. Used 1ml extra solvent to wash out container. Sample of 4flourinated biphenyl Annotate Add 1 1 2 2 Add 1 3 Reflux text Annotate Butanone Sample of K2CO3 Powder Measure Weigh text 40 Started reflux at 13.30. (Had to change heater stirrer) Only reflux for 45min, next step 14:15. ml 2.0719 g March 2004 Ingredient List Comb-e-Chem Fluorinated biphenyl Br11OCB Potassium Carbonate Butanone Dissolve 4flourinated biphenyl in butanone 0.9 g 1.59 g 2.07 g 40 ml Add Add K2CO3 powder Add 0.9031 Heat at reflux for 1.5 hours Reflux grammes Weigh Butanone dried via silica column and measured into 100ml RB flask. Used 1ml extra solvent to wash out container. Sample of 4flourinated biphenyl Annotate Add 1 1 2 2 Add 1 3 Reflux text Annotate Butanone Sample of K2CO3 Powder Measure Weigh text 40 Started reflux at 13.30. (Had to change heater stirrer) Only reflux for 45min, next step 14:15. ml 2.0719 g March 2004 Fluorinated biphenyl Br11OCB Potassium Carbonate Butanone Dissolve 4flourinated biphenyl in butanone 0.9 g 1.59 g 2.07 g 40 ml Plan To Do List Ingredient List Add Add K2CO3 powder Heat at reflux for 1.5 hours Add 0.9031 Cool and add Br11OCB Heat at reflux until completion Cool and add water (30ml) Extract with DCM (3x40ml) Cool Reflux Add Cool Reflux Liquidliquid extraction Add Combine organics, dry over MgSO4 & filter Dry Remove solvent in vacuo Remove Solvent by Rotary Evaporation Filter Fuse compound to silica & column in ether/petrol Column Chromatography Fuse grammes Inorganics dissolve 2 layers. Added brine ~20ml. 3 of 40 g excess ml text Ether/ Petrol Ratio image Process Record Weigh Butanone dried via silica column and measured into 100ml RB flask. Used 1ml extra solvent to wash out container. Silica Measure Measure Sample of 4flourinated biphenyl Annotate DCM MgSO4 Annotate Add 1 1 2 2 1 Add 3 Cool Reflux text Sample of K2CO3 Powder Measure 3 4 Add Sample of Br11OCB Annotate Butanone 1 Weigh 5 2 Reflux Weigh 6 2 4 7 Add Cool Water 8 9 10 Dry Liquidliquid extraction Annotate 11 Filter (Buchner) Annotate 12 Remove Solvent by Rotary Evaporation 13 Fuse 14 Column Chromatography Measure text 40 Started reflux at 13.30. (Had to change heater stirrer) Only reflux for 45min, next step 14:15. ml 2.0719 g 1.5918 g 30 ml Organics are yellow solution Key Observation Types Future Questions Process weight - grammes Whether to have many subclasses of processes or fewer with annotations Input Literal measure - ml, drops How to depict destructive processes annotate - text ° How to depict taking lots of samples temperature - K, C Observation What is the observation/process boundary? e.g. MRI scan text Washed MgSO4 with DCM ~ 50ml text Combechem 30 January 2004 gvh, hrm, gms Lessons Comb-e-Chem That we need two related ontologies Plan – that are going to be done Record – what was done Not necessarily the same thing Steps are added/repeated during the experiment Different annotations required for each March 2004 Process Record Ontology Comb-e-Chem March 2004 NCS Grid Service Architecture Comb-e-Chem March 2004 The “Grid Zone” Comb-e-Chem Security is fundamental Who is using our experiments Insulate them from each other and from the rest of our institution Process & Role based security Use DMZ This combination creates a “Grid Zone” March 2004 Comb-e-Chem March 2004 Comb-e-Chem March 2004 Cluster Computation Comb-e-Chem Needed for Design of Experiments Stats computationally intensive Simulations Protein dynamics Clusters, Cycle Steeling Schools engagement – e-Malaria March 2004 Comb-e-Chem •Combechem is compiling a large database of molecules. The database contains the properties of these molecules, e.g. their crystal structure or solvent accessible surface area (SASA). Some of these properties are measured from experiment while others are calculated from simulations run on the GRID. •Molecule ID •pKa •SASA •1CD34 •2.3 •Unknown •1CD35 •1.3 •Unknown •112 •1CD36 •Unknown •36543 •96 •1CD37 •5.3 •58435 •78 •1CD38 •Unknown •Unknown •110 •Melting Point •58 •Comberobots continually scan the database for empty fields. They can automatically submit simulations to calculate any unknown properties. These simulations run on the GRID by stealing the spare cycles of a March 2004 heterogeneous network of computers. Comb-e-Chem The database of molecules can also be screened against pharmaceutical protein targets. To do this accurately requires knowledge of how the protein changes shape upon ligand binding. We can use the GRID to investigate protein conformational change via Replica Exchange simulations. Multiple simulations of the protein are run in parallel, each running under a different condition, e.g. temperature. Periodically the simulations running at neighbouring temperatures are tested and swapped. This enables simulations at high temperatures, where there is rapid conformational change, to rain down to biologically relevant temperatures where conformational change occurs more slowly. HIV Protease Nitrogen Regulatory Protein C March 2004 Comb-e-Chem March 2004 Dissemination & Publication Comb-e-Chem A different approach is required to provide data to the community The grid provides the necessary medium What & How do we want to make available March 2004 Comb-e-Chem Publication@Source Dissemination Bibliography Student Journal Professional Body Archive Institution Laboratory March 2004 The Data Trail Comb-e-Chem Raw data Workflow Process Model Derive Plot Provenance The graphical model of the workflow used as the front end of a typical workflow enactor can also act as the navigation tool for the provenance & publication. March 2004 The need for xtl-Prints Comb-e-Chem 100’s of structures National Crystallography Service How do we disseminate? March 2004 The need for xtl-Prints Comb-e-Chem Combechem DATA PUBLICATION DISSEMINATION Combichem March 2004 Crystallographic e-Prints JOURNAL PUBLICATION Comb-e-Chem EBank (World) EBank REPORT STRUCTURE REPORT REPORT (EPrint) CIF RESULTS DATASET (Contains DATAFILES) EPrint (Local) DERIVED RAW DATA INVESTIGATION HOLDING March 2004 Crystallographic e-Prints Comb-e-Chem March 2004 Direct access to data Comb-e-Chem DERIVED DATA March 2004 Direct access to data Comb-e-Chem RAW DATA March 2004 Dolphin RDF Browser Comb-e-Chem RDF source and resource Resource model Schema model Each statement If literal If resource Display object Add http request to that resource Wh en req uire d March 2004 Comb-e-Chem SVG active graphics March 2004 e-worries Comb-e-Chem WSRF GTi Must ensure this is not a problem for applications March 2004 The Semiotic Web Comb-e-Chem Chemists use signs and symbols as much as, if not more than words Icons have a great significance – The Periodic Table People & Computers need to communicate with each other as well as themselves Need a more powerful (general) concept than the semantic web & grid. March 2004 Changing the way we work Comb-e-Chem E-Lab: X-Ray Crystallography Samples Quantum Mechanical Analysis Data Provenance Authorship/ Submission Samples Laboratory Processes Laboratory Processes Structures DB E-Lab: Combinatorial Synthesis Properties Prediction Data Mining, QSAR, etc E-Lab: Properties Measurement Laboratory Processes Properties DB Design of Experiment Data Streaming Visualisation Agent Assistant March 2004