Molecular Science on the Grid Peter Murray-Rust, Unilever Centre, University of Cambridge JISC Consultation Workshop: Building Collaborative eResearch Environments, NESC, 2004-02-23 Power Corrupts; Powerpoint corrupts absolutely (Tufte). Material is therefore in XML This presentation will be largely interactive using XML tools and remote access. Main site: http://wwmm.ch.cam.ac.uk/ The problem Remote collaboration (let alone eCollaboration) is not common in chemistry. The major problems are cultural. Chemists may occasionally share central equipment, but will compete in most issues. Chemistry is conservative. The eRevolution has yet to impact software and information providers. As examples OpenSource, OpenData and OpenAccess are largely unknown and unappreciated by senior chemists. IPR issues stultify and frustrate development - most chemical data is owned and sold by unimaginative companies and quasi-companies that wish to protect untenable restrictive practices. There is no ontology. There are no standards. The companies preserve noninteroperability as a way of protecting dimishing market share rather than looking at what the rest of the world is doing... Rays of hope The biosciences need chemistry and are increasingly frustrated with this - they are bypassing traditional paths and finding their own solutions: • • • PubChem (NIH) NCI database Molecular Structures (EBI) The Internet revolution has shown the future. Our undergraduates use Google for their literature searches. They normally fail because of IPR. BUT Open Access should revolutionalise this. W3C technology and proticols are unstoppable. Chemical Markup language Therefore we (PMR and Henry Rzepa) developed CML (the very first XML language). CML feeds off all the W3C inititiatives (XML, XSLT, DOM, Schema, XPath, RSS, RDF, OWL, etc.). This is an enormously powerful driver. A major driver for chemical eScience is therefore CML technology (technology push). http://www.xml-cml.org for Chemical Markup Language Selling eChemistry/CML CML does not sell well within mainstream academic chemistry !! Our support comes from elsewhere. Marketing Strategy Develop mainstream W3C implementations Talk passionately Make everything Open Use the power of the web for marketing Collaborate with early adopters Create web-based demonstrators Distribute toolkits Create evolutionary collaborative environments. Create new "business models" - information barter. Trade services for Open data rather than try to sell data for money. have faith and patience... CML Toolkit CML has a complete toolkit created mainly by volunteers. It includes: • • • • • • • • • • • • • • customisation of schemas (e.g. CMLCryst for Comb-e-Chem (soton), CMLMinerals (eminerals.org)) automatic generation of software - SAX and DOM for this schema (Java, CML++, Python, ?F90) automatic documentation per-element examples and validation tools library (manipulation, validation) stylesheets rendering (8D: chemical (2D), 3D (structure), crystalline) dictionaries editors (2D) parsing of program (e.g. FORTRAN) output ant control dictionaries wrappers for legacy codes CMLRSS Everything is OpenSource, OpenData and OpenAccess. Gridification supported by DTI/Cambridge eScience. Includes: • • • • • database (Xindice) CMLCondor CVS CMLWiki CML++ Adopters Main adopters of CML are: o o o o o Goverment and similar agencies: FDA, Eur. Med. Eval Ag., Eur. Pat. Off.. WHO Public orgs: Nat. Canc. Inst., Nat. Inst. Stds. Technology (NIST) Multidisciplinary Projects: BIOPAX (reaction pathways), eMinerals (UK NIEeS, UCL, DL, etc.), Comb-e-chem (UK - soton). MaciE (enxyme reactions EBI/Cambridge) Early adopter publishers (Roy. Soc. Chem., Nature). But slowly Technology pushers (e.g. XML tool developers - Adobe). Collaborative tools Our tools are heavily used by the collaborators. They are completely mainstream and almost trivial to install. Mainly sourceforge, Apache and W3C: CVS. Several orgs use our CVS server for schems, codes, etc. Wiki. The main project planning mechanism. IRC. Enormous use. Software development, strategy, etc. RSS. Very exciting. This can be a lightweight data portal, allowing us to develop metadata in an evolutionary manner. o Servers. Molecular transformations, normalization, canonicalization, indexing, etc. are all provided as services. o Dictionaries. Our ontologies are based on XMLSchema rather than OWL as we need string datatyping, aggregation and programmatic semantics. These are authored communally and provide much of the domain metadata. o o o o Grid tools Condor - very successful. 750,000 molecular jobs run. Main problem is processing results!! Many thanks to eMinerals ppl, Jon Wakelin, Mark Calleja. o AccessGrid. Very successful meeting with Am. Chem. Soc. and Roy. Soc. Chem. Keen to adopt mini-access grid (developed by eMinerals/earthSciences). Can see this being very useful addition to the IRC. o Globus. We are Globus compatible. o