Molecular Science on the Grid

advertisement
Molecular Science on the Grid
Peter Murray-Rust, Unilever Centre, University of
Cambridge
JISC Consultation Workshop: Building Collaborative
eResearch Environments, NESC, 2004-02-23
Power Corrupts; Powerpoint corrupts absolutely (Tufte). Material is therefore in XML
This presentation will be largely interactive using XML tools and remote access. Main
site: http://wwmm.ch.cam.ac.uk/
The problem
Remote collaboration (let alone eCollaboration) is not common in chemistry. The major
problems are cultural. Chemists may occasionally share central equipment, but will
compete in most issues.
Chemistry is conservative. The eRevolution has yet to impact software and information
providers.
As examples OpenSource, OpenData and OpenAccess are largely unknown and
unappreciated by senior chemists.
IPR issues stultify and frustrate development - most chemical data is owned and sold by
unimaginative companies and quasi-companies that wish to protect untenable restrictive
practices.
There is no ontology. There are no standards. The companies preserve
noninteroperability as a way of protecting dimishing market share rather than looking at
what the rest of the world is doing...
Rays of hope
The biosciences need chemistry and are increasingly frustrated with this - they are
bypassing traditional paths and finding their own solutions:
•
•
•
PubChem (NIH)
NCI database
Molecular Structures (EBI)
The Internet revolution has shown the future. Our undergraduates use Google for their
literature searches. They normally fail because of IPR. BUT Open Access should
revolutionalise this.
W3C technology and proticols are unstoppable.
Chemical Markup language
Therefore we (PMR and Henry Rzepa) developed CML (the very first XML language).
CML feeds off all the W3C inititiatives (XML, XSLT, DOM, Schema, XPath, RSS,
RDF, OWL, etc.). This is an enormously powerful driver.
A major driver for chemical eScience is therefore CML technology (technology push).
http://www.xml-cml.org for Chemical Markup Language
Selling eChemistry/CML
CML does not sell well within mainstream academic chemistry !! Our support comes
from elsewhere.
Marketing Strategy
Develop mainstream W3C implementations
Talk passionately
Make everything Open
Use the power of the web for marketing
Collaborate with early adopters
Create web-based demonstrators
Distribute toolkits
Create evolutionary collaborative environments.
Create new "business models" - information barter. Trade services for Open data rather
than try to sell data for money.
have faith and patience...
CML Toolkit
CML has a complete toolkit created mainly by volunteers. It includes:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
customisation of schemas (e.g. CMLCryst for Comb-e-Chem (soton),
CMLMinerals (eminerals.org))
automatic generation of software - SAX and DOM for this schema (Java,
CML++, Python, ?F90)
automatic documentation
per-element examples and validation
tools library (manipulation, validation)
stylesheets
rendering (8D: chemical (2D), 3D (structure), crystalline)
dictionaries
editors (2D)
parsing of program (e.g. FORTRAN) output
ant control
dictionaries
wrappers for legacy codes
CMLRSS
Everything is OpenSource, OpenData and OpenAccess.
Gridification supported by DTI/Cambridge eScience. Includes:
•
•
•
•
•
database (Xindice)
CMLCondor
CVS
CMLWiki
CML++
Adopters
Main adopters of CML are:
o
o
o
o
o
Goverment and similar agencies: FDA, Eur. Med. Eval Ag., Eur. Pat. Off..
WHO
Public orgs: Nat. Canc. Inst., Nat. Inst. Stds. Technology (NIST)
Multidisciplinary Projects: BIOPAX (reaction pathways), eMinerals (UK NIEeS, UCL, DL, etc.), Comb-e-chem (UK - soton). MaciE (enxyme
reactions EBI/Cambridge)
Early adopter publishers (Roy. Soc. Chem., Nature). But slowly
Technology pushers (e.g. XML tool developers - Adobe).
Collaborative tools
Our tools are heavily used by the collaborators.
They are completely mainstream and almost trivial to install.
Mainly sourceforge, Apache and W3C:
CVS. Several orgs use our CVS server for schems, codes, etc.
Wiki. The main project planning mechanism.
IRC. Enormous use. Software development, strategy, etc.
RSS. Very exciting. This can be a lightweight data portal, allowing us to
develop metadata in an evolutionary manner.
o Servers. Molecular transformations, normalization, canonicalization,
indexing, etc. are all provided as services.
o Dictionaries. Our ontologies are based on XMLSchema rather than OWL
as we need string datatyping, aggregation and programmatic semantics.
These are authored communally and provide much of the domain
metadata.
o
o
o
o
Grid tools
Condor - very successful. 750,000 molecular jobs run. Main problem is
processing results!! Many thanks to eMinerals ppl, Jon Wakelin, Mark
Calleja.
o AccessGrid. Very successful meeting with Am. Chem. Soc. and Roy. Soc.
Chem. Keen to adopt mini-access grid (developed by
eMinerals/earthSciences). Can see this being very useful addition to the
IRC.
o Globus. We are Globus compatible.
o
Download