Motivation Vocabolary wrappers Looking for a (standard) Common Format for Computational (Quantum) Chemistry A WG activity within COST action 23 (WG D23/0006/01) – Elda Rossi, Andrew Emerson – CINECA –Gian Luigi Bendazzoli, Antonio Monari – Univeristà di Bologna –Renzo Cimiraglia, Celestino Angeli, Stefano Borini - Università di Ferrara –Daniel Maynau, Stefano Evangelisti - IRSAMC – Toulouse –José Sanchez-Marin - Universitat de Valencia –Peter Szalay - Eötvös Loránd University –Rosa Caballol - Universitat Rovira i Virgili Tarragona Motivation for the work Motivation Vocabolary wrappers To build a meta-system for supporting research collaboration in the field of “Localised Orbitals in post-SCF methods … Linear Scaling methods in a Multi-Reference context” The scenario Motivation Vocabolary wrappers Different laboratories need to collaborate Different “home-made” codes need to be used together since they give different views of the same problem General purpose “basic” codes needed to pre-compute data in a sort of pipeline Programmes should remain on their original sites under the responsibility of their authors Different platforms Network connections (grid architecture) Workflow The need of a Common Format The first problem we faced: How different codes (on different platforms) can communicate we need a Common Format for (at least) Quantum Chemistry codes Motivation Vocabolary wrappers Preliminary steps Looking around … o o o CML available since long time XML is use by Accelrys for internal files XML is used by ArgusLab for internal files All of them not completed suited for computational chemistry mainly structural chemistry, no Quantum Chemistry properties XML seems the best technology so we took the decision to try another XML based format HDF5 looked nice for storing large binary data typical of QC Motivation Vocabolary wrappers How should work the engine Motivation Vocabolary wrappers IN-wrapper Leaves the program unchanged One wrapper for each program – If a code is added only one wrapper to be written IN-files Data Repository XML/HDF Program OUT-files OUT-wrapper QCML: an XML format for QC Motivation Vocabolary wrappers In order to be as general as possible we need to write down a hierarchical schema of Quantum Chemistry quantities As a first approximation three domains can be identified Base FACTS initial data for describing the physics of the system DERIVED quantities computed from FACTS using QC algorithms (Energies, Props, integrals, coeff, …) W-FLOW which codes are in the pipeline, specific input data, … Fact Parameters •A base fact is a fact that is a given in the world and is remembered (stored) in the system. •A derived fact is created by an inference or a mathematical calculation from terms, facts, other derivations, or even action assertions. FACT: molecule Motivation Vocabolary wrappers <system title date program author> <molecule nElectrons charge spinMultiplicity spaceSymmetry> <symmetry> groupName/> <geometry type unit numAtoms symmetryRef > <atom symbol isotope x3 y3 z3/> <basis name type numOrbitals > <atomBase angularMomMAX symbol > <angularMom value symbol numOrbitals> <orbital id numPrimitives> <exps/> <coeffs/> –FACTS –DERIVED –W-FLOW Symmetry: Geometry: Basis: group name & other symmetry data only cartesian, full or unique for sym by name or fully defined DERIVED data: computedData Motivation Vocabolary wrappers <system …> <computedData> <energy unit levelOfTheory quality value> <state spaceSymmetry spinMultiplicity excitationLevel /> <property unit levelOfTheory quality value> <state “bra” spaceSymmetry spinMultiplicity excitationLevel /> <state “ket” spaceSymmetry spinMultiplicity excitationLevel /> <operator order name/> <file address URL/> –FACTS –DERIVED –W-FLOW A “schema” has been written for QCML DERIVED : computedData/file Motivation Vocabolary wrappers Problem with large binary datasets include the reference not the actual data Two possible strategies: 1. Leave data in their native format and translate them only when needed. Maintain different version (formats) of the same data 2. Define a “standard” format for binary data and convert them anyway The second was the solution of choice HDF5 appears to be a good solution HDF Mission To develop, promote, deploy, and support open and free technologies that facilitate scientific data storage, exchange, access, analysis and discovery. • • • • • • Format and software for scientific data Stores images, multidimensional arrays, tables, etc. Emphasis on storage and I/O efficiency Free and commercial software support Emphasis on standards Users from many engineering and scientific fields Motivation Vocabolary wrappers Motivation Vocabolary wrappers Example HDF5 file “/” (root) “/MO” “/MO” “/mono” “/AO” “/bi” “/mono” “/coefficients” Orb | occ | energy ----|-----|----1 | 0 | 0.35 2 | 0.5| 0.26 3 | 2. | 0.69 Overlap Kinetic “/bi” 4-D array Repulsion Property Table Kinetic+ Repulsion HDF file structure for QC Root Name QCML_ref Norb AO Norb MO Spin Polar.: Orb Classif: a=b a b Core Active Virtual Orb Energies: Orb Symm: [1-order] Property <i/j> <i/T/j> <i/Vnuc/j> <i/T/j>+<i/Vnuc/j> <ij/kl> <i/T/j> <i/V/j> <i/T/j>+<i/Vnuc/j> <ij/kl> coeff(i,j) <i/p/j> + format metadata (integer, binary, Endian-ism, …) Motivation Vocabolary wrappers QCML processing: wrappers Motivation Vocabolary wrappers One couple of wrappers for each code in the metasystem They should be written & maintained by the authors of the chemical codes XML processing can be used (DOM) but … what language??? o Fortran: no easy and stable DOM available o Scripting languages (Perl/Python/Java): not known by chemists We tried both ways (Fortran & Python) Fortran DOM: drawbacks Motivation Vocabolary wrappers The only problem is the Fortran binding o It doesn’t exist (at least last year …) o DOM is OO and Fortran is not It exists a C binding (Gdome2) Gdome2 was installed – very hard work – on a mainframe platform (it was conceived for Linux) We are currently converting it to Fortran, by adopting the DOM recommendations (simplified …) Why Fortran Motivation Vocabolary wrappers GOOD •Users don't need to learn a new language •Homogeneous environment BAD •Tricky: need an external library (f77xml) built on top of gdome2 •Porting problems for gdome2/libxml2 may arise F77xml library Motivation Vocabolary wrappers Still in development o v0.4 is out (experimental, with limited features) o v1.0 upcoming, API changed to be nearly DOM2 compliant Written in C on top of gdome2 http://gdome2.cs.unibo.it/index.html Designed for interfacing to F77 (also F90 soon) Reduced namespace pollution Cons: ● F77 syntax is difficult (DOM2 + tricks) ● F90 syntax is simpler ● A pre-processor will convert F90 syntax to F77 http://freshmeat.net/projects/f77xml F77xml library - V1.0 example Motivation Vocabolary wrappers Gdome2 (C) GdomeNode* gdome_el_firstChild (GdomeElement *self, GdomeException *exc); F90 Call f77xml_el_firstChild(nodeCode, elemCode, exc) First position: Return value NodeCode, elemCode,exc mapped to INTEGER F77 Func='el_firstChild' Call xp3t1(nodeCode,func,elemCode,exc) Multiplexer function: x: p3: 3 parameters (+ name function) t1: type 1 parameter schema (code/code/error) Why Python GOOD Very Easy Object Oriented Language Works well with strings Simple ed efficient DOM interface for XML Present in almost all UNIX/LINUX distribution BAD Users do need to learn a new language Maybe less powerful than Perl Usually not used by chemists Motivation Vocabolary wrappers Python Wrapper Motivation Vocabolary wrappers At the present a prototype does work with molpro-fci chain. It takes information from xml-repository Writes down proper MOLPRO and FCI input Starts the two programs With a different XML file users should only specify the file name and some simple parameters (orbital guess for FCI) Python or not Motivation Vocabolary wrappers Python is very simple to learn and works very efficiently with xml Scripts written in Python (at least for prototypes) are quite clear, linear and easy to maintain or upgrade Possibility of a GUI could make our project much more user-friendly What we have done … Single platform: IBM SP4 Two code chains MolPro IN-file IN-wrapper MolPro MolPro to FCI MolPro to CasDI OUT-wrapper FCIDUMP Start here QCML Repository HDF5 Repository IN-wrapper IN-wrapper FCI IN-file Bin file for FCI FCI Stop here In conclusion … Two important hints on data… 1.Use some XML dialect for describing simple structured data 2.Use HDF5 for storing large array and binary data Need of a good and easy API to XML & HDF How to manage the workflow How to manage the grid connection