La reproductibilité des calculs sur les systèmes complexes Konrad HINSEN Centre de Biophysique Moléculaire, Orléans, France and Synchrotron SOLEIL, Saint Aubin, France 20 novembre 2014 Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes20 novembre 2014 1 / 42 Computational studies of complex systems Simple models Goal: study emergent complex phenomena Examples: networks, materials science Complex models Goals: Find simple patterns Compute model predictions for comparing to experiment Examples: climate models, biomolecular force fields Complex models Goals: Find simple patterns Compute model predictions for comparing to experiment Examples: climate models, biomolecular force fields Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes20 novembre 2014 2 / 42 Climate models Atmosphere Chemistry Solar !$ Dynamics Atmospheric transport Thermodynamics Hydrological cycle $ " Inland ice Land Ice sheet surface coupler Atmosphere model surface $ interface Dynamics Mass balance Bedrock model Thermodynamics Surface vegetation atmosphere transfer (SVAT) $ ! Ocean coupler ! Vegetation dynamics Terrestrial carbon cycle % Ocean Thermodynamics Dynamics Terrestrial vegetation Oceanic transport Salinity Sea ice Carbon cycle CO2 & # Figure 1. Conceptual view of the components and couplings of a coupled Earth system model. meteorological research and development (Met and mass transfers between subsystems (see R&D) effort. The two groups primarily occupy Figure 1). Researchers can run the models at difa single,Comp. open-plan Sci. office Eng. space on11(6), the second65-74 ferent resolutions, From: Easterbrook & Johns, (2009) depending on the available floor of the Met Office’s Exeter headquarters computing power. Coarse-resolution GCMs can and have built single unified codedes base.calculs This sur simulate large-scale phenomena, such novembre as midKonrad HINSEN (CBM/SOLEIL) Laareproductibilité les systèmes complexes20 2014 3 / 42 Protein models A small protein (lysozyme) in the Amber99 force field 1960 atoms (1001 shown) 183 distinct atom types several thousand numerical parameters Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes20 novembre 2014 4 / 42 Reproducibility One of the ideals of science Scientific results should be verifiable. Verification requires reproduction by other scientists. Few results actually are reproduced, but it’s still important to make this possible: important for the credibility of science in society (remember “Climategate”, cold fusion, ...) important for the credibility of a specific study The more detail you provide about what you did, the more your peers are willing to believe that you did what you claim to have done. Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes20 novembre 2014 5 / 42 Replicability Replication re-doing exactly the same steps should be trivial for computation ... but isn’t replicability is a proof of quality assurance Reproduction re-implementing the basic idea of the method checks for the impact of details considered irrelevant Reproducibility requires replicability as a foundation. Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes20 novembre 2014 6 / 42 Reproducibility in practice Near-perfect in mathematics No journal publishes a theorem unless the author provides a proof. Often best-effort in experimental science Lab notebooks record all the details for in-house replication. Published protocols are less detailed, but often clear enough for an expert in the field. The main limitation is technical: lab equipment and concrete samples cannot be reproduced identically. Lousy in computational science Papers give short method summaries and concentrate on results. Few scientists can replicate their own results after a few months. Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes20 novembre 2014 7 / 42 Reproducibility in computational science We could do better than experimentalists Results are deterministic, fully defined by input data and algorithms. If we published all input data and programs, anyone could replicate the results exactly. If our programs were understandable, reproduction would be much easier. But we don’t Lots of technical difficulties. Important additional effort. Few incentives. Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes20 novembre 2014 8 / 42 The Reproducible Research movement Goals Create more awareness of the problem. Provide better tools. French Reproducible Research community http://listes.univ-orleans.fr/sympa/info/ recherche-reproductible Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes20 novembre 2014 9 / 42 Reproducible Research matters “Climategate” In 2009 a server at the Climatic Research Unit at the University of East Anglia was hacked and many of its files became public. One of them describes a scientist’s difficulties to reproduce his colleagues’ results. This has been used by climate change skeptics to discredit climate research. Protein structure retractions In 2006, three important protein structures published in Science were retracted following the discovery of a bug in the software used for data processing. For more examples and details: Z. Merali, “...Error ... why scientific programming does not compute”, Nature 467, 775 (2010) Science Special Issue on Computational Biology, 13 April 2012 Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 10 / 42 Reproducibility and complexity Replicability requires archiving all computations (software, data, workflows, ...) Reproducibility requires documenting the computational methods for humans Complexity makes this very difficult: no good notation for specifying complex models no good tools for human exploration of complex models simulation software is even more complex than the models Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 11 / 42 Reproducibility in biomolecular simulations (1/3) Heavy resource requirements Most important technique: Molecular Dynamics Long runs (days to weeks) on big parallel machines Produces large datasets (a few GB per trajectory) Trajectory analysis can be as expensive as the simulation itself Datasets are too big for version control. Today’s provenance trackers can’t keep track of computations across machines. Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 12 / 42 Reproducibility in biomolecular simulations (2/3) Complexity Computational models contain thousands of parameters. No complete formal description of the models (except for program source code). Simulation algorithms contain many approximations made for performance reasons. These approximations are usually undocumented. Publishing a complete description of a molecular simulation is so difficult that very few scientists could do it and equally few could understand the resulting description. There is no file format for storing a complete description of a molecular simulation in machine-readable form. Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 13 / 42 Reproducibility in biomolecular simulations (3/3) A split community A small number of “developers” with backgrounds in physics and chemistry A large number of “users” with backgrounds in biology and biochemistry A negligibly small number of computing specialists Users don’t care about the technical details of the simulations. Users (have to) trust the developers blindly. Developers understand the models but have no training in software development or data management. The field is dominated by a few large monolithic simulation packages whose internal workings are complex and mostly undocumented. Intermediate stages of processing in a simulation workflow are impossible to access or available only in program-specific formats. Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 14 / 42 Biomolecular force fields (1/2) This is how a typical research paper describes the AMBER force field: U bonds + (0) 2 X = kij rij − rij ij angles kijk φijk − φijk ijk X + (0) 2 X kijkl cos (nijkl θijkl − δijkl ) dihedrals ijkl + X 4ij all pairs ij + U = X r 12 − σij6 ! r6 qi qj X all pairs ij Konrad HINSEN (CBM/SOLEIL) σij12 4π0 rij kij rij − r (0) 2 La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 15 / 42 Biomolecular force fields (2/2) Protein parameter files for Amber 12, in somewhat documented formats: lines 984 814 782 533 744 filename amino12.in aminoct12.in aminont12.in frcmod.ff12SB parm99.dat For the details... read the source code! For the algorithms that select the parameters for a given protein structure... read the source code! Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 16 / 42 Biomolecular simulation software SOFTWARE ENGINEERING FOR CSE Streamlining Development of a Multimillion-Line Computational Chemistry Code Robin M. Betz and Ross C. Walker | San Diego Supercomputer Center Software engineering methodologies can be helpful in computational science and engineering projects. Here, a continuous integration software engineering strategy is applied to a multimillion-line molecular dynamics code; the implementation both streamlines the development and release process and unifies a team of widely distributed, academic developers. T Betz & Walker, Comp. Sci. Eng. 16(3), 10-17 (2014) he needs of computational science and en- developers are also researchers first and foremost, gineering (CSE) projects greatly differ from and their goal is primarily to generate publicationthose of more traditional business enterprise quality research, rather than develop maintainable, software, especially in terms of code valida- extensible code using the latest development methKonrad HINSEN (CBM/SOLEIL) reproductibilité des calculs sur les systèmes complexes novembre tion and testing. La Although software engineers have odologies. This frequently results in2a0developer cul- 2014 17 / 42 A real-life case of (non-)reproducibility (1/5) Goal: find the gas-phase structure of short peptide sequences by molecular simulation or Earlier work on this topic1 finds that Ac A15 K + H+ forms a helix, but Ac K A15 + H+ forms a globule. My simulations predict that both sequences form globules. What did I do differently? 1 M.F. Jarrold, Phys. Chem. Chem. Phys. 9, 1659 (2007) Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 18 / 42 A real-life case of (non-)reproducibility (2/5) From the paper: Molecular Dynamics (MD) simulations were performed to help interpret the experimental results. The simulations were done with the MACSIMUS suite of programs [31] using the CHARMM21.3 parameter set. A dielectric constant of 1.0 was employed. The URL in ref. 31 is broken, but Google helps me find MACSIMUS nevertheless. It’s free and comes with a manual! ^ ¨ I download the “latest release” dated 2012-11-09. But which one was used for that paper in 2007? Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 19 / 42 A real-life case of (non-)reproducibility (3/5) MACSIMUS comes with a file charmm21.par that starts with ! ! ! ! Parameter File for CHARMM version 21.3 [June 24, 1991] Includes parameters for both polar and all hydrogen topology files Based on QUANTA Parameter Handbook [Polygen Corporation, 1990] Modified by JK using various sources Did that paper use the “polar” or the “all hydrogen” topology files? I can find only one set of topology files named charmm21, and that’s with polar hydrogens only, so I guess “polar”. Now I have the parameters, but I don’t know the rules of the CHARMM force field, nor can I be sure that MACSIMUS uses the same rules as the CHARMM software. Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 20 / 42 A real-life case of (non-)reproducibility (4/5) From the paper: A variety of starting structures were employed (such as helix, sheet, and extended linear chain) and a number of simulated annealing schedules were used in an effort to escape high energy local minima. Often, hundreds of simulations were performed to explore the energy landscape of a particular peptide. In some cases, MD with simulated annealing was unable to locate the lowest energy conformation and more sophisticated methods were used (see description of evolutionary based methods below). I might as well give up here... My point is not to criticize this particular paper. The level of description is typical for the field. Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 21 / 42 A real-life case of (non-)reproducibility (5/5) What I would have liked to get: a machine-readable file containing a full specification of the simulated system: chemical structure all force field terms with their parameters the initial atom positions a script implementing the annealing protocol links to all the software used in the simulation, with version numbers All that with persistent references (DOIs). Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 22 / 42 Enough whining... Let’s do something to improve the situation! Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 23 / 42 1 2 A data model for molecular simulations A data model for executable papers Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 24 / 42 Data handling in molecular simulation There is no standard file format for molecular simulations. Best approximation: the PDB format made for a different purpose (crystallography) lacks precision for coordinates lacks the possibility to store anything else but coordinates limited to 10.000 atoms few if any programs respect it completely OK... so let’s define something better... ... which is a far from trivial task. Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 25 / 42 MOSAIC Article pubs.acs.org/jcim MOSAIC: A Data Model and File Formats for Molecular Simulations Konrad Hinsen* Centre de Biophysique Moléculaire (UPR 4301 CNRS), Rue Charles Sadron, 45071 Orléans Cedex 2, France Synchrotron SOLEIL, Division Expériences, 91192 Gif-sur-Yvette, France S Supporting Information * ABSTRACT: The MOlecular SimulAtion Interchange Conventions (MOSAIC) consist of a data model for molecular simulations and of concrete implementations of this data model in the form of file formats. MOSAIC is designed as a modular set of specifications, of which the initial version covers molecular structure and configurations. A reference implementation in the Python language facilitates the development of simulation software based on MOSAIC. ■ K. interesting to try to reproduce the results with a different force INTRODUCTION field. Moreover, reproduction using different software can One of the fundamental notions of science is reproducibility. protect against artifacts due to mistakes or inappropriate To qualify as “scientific”, a study must be documented in algorithmic choices in a given program. Hinsen, J. Chem. Inf. 54(1), 131-137 sufficient detail that Model. any competent scientist can reproduce it (2014) The technical requirements for reproducibility are stricter and verify the results. Reproducibility has received much than for mere replicability. Reproducibility requires that peers attention recently because the results of a number of highly can rerun computations with different software or with visible research studies in various domains could not be modified versions of the model. This is possible only if all reproduced by repeating the published experimental protocols the data is available in machine-readable and clearly (see, for example, ref 1). documented formats and if the mathematical models In the computational sciences, many of the uncertainties of implemented in the software are clearly documented as well. experiments do not exist because computers are deterministic In many sur domains computational science, including computa- 2014 Konrad HINSEN (CBM/SOLEIL) La given reproductibilité des calculs les ofsystèmes complexes 20 novembre devices that produce the same results the same programs 26 / 42 What do we want to archive and exchange? the system we are simulating the chemical structure of the molecules the number of each kind of molecule the topology of the system (infinite, 3D periodic, 2D periodic, ...) symmetries (crystals, ...) configurations (positions of all atoms plus PBC parameters) velocities, masses, charges, and other per-atom properties force field parameters force fields (the rules) trajectories experimental data used in simulations ... lots of other stuff ... Don’t try to do everything at once: we want a modular set of conventions. Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 27 / 42 Which systems, which simulation techniques? anything at the atomic and molecular scale: gases, liquids, solids from atoms via small molecules to macromolecules all-atom and coarse-grained models from quantum models via classical mechanics to stochastic dynamics Application domains: Chemical/statistical physics Solid-state physics Chemistry Molecular/structural biology Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 28 / 42 Mosaic: a data model for molecular simulation MOlecular SimulAtion Interchange Conventions http://mosaic-data-model.github.io/ Mosaic is a data model with (currently) two representations. Data model expresses molecular simulation data in terms of basic data items and structures: numbers, strings, lists, trees, ... Representation defines how a data model is stored as a sequence of bytes in a file. Conversion between representations is exact. This is work in progress. In order to succeed, it must become a community project. Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 29 / 42 Mosaic: current state Data items System topology, chemical structure Configurations Per-atom/site numerical properties (velocities, charges, masses, . . .) Per-atom/site labels (atom names, atom types, . . .) Subsets / atom selections Trajectories Representations HDF5 for big data sets, fast access, and easy interfacing to C/C++/Fortran code XML for processing with standard XML tools Three in-memory representations for Python Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 30 / 42 HDF5 (Hierarchical Data Format, version 5) Platform-independent binary storage root group groups Efficient library for data handling Metadata (provenance, dependencies, . . . ) Well-established: used by big organizations (NASA, . . .) for very large data sets (meteorology . . .) Ideal for exchanging and archiving data datasets (≈ arrays) Suitable for very large data sets Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 31 / 42 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 So here’s my model for the gas phase peptide... <?xml version="1.0" encoding="utf-8"?> <mosaic version="0.1"> <universe cell_shape="infinite" convention="MMTK" id="universe"> <molecules> <molecule count="1"> <fragment label="" species="protein"> <fragments> <fragment label="A" polymer_type="polypeptide" species="ACE,Ala,Ala,Ala,Ala,Ala,Ala,Al <fragments> <fragment label="ACE0" species="ace_beginning"> <atoms> <atom label="CBond" name="C" type="element" /> <atom label="CH3" name="C" type="element" /> <atom label="HH31" name="H" type="element" /> <atom label="HH32" name="H" type="element" /> <atom label="HH33" name="H" type="element" /> <atom label="O" name="O" type="element" /> </ atoms> <bonds> <bond atoms="CH3 HH31" order="" /> <bond atoms="CH3 HH32" order="" /> <bond atoms="CH3 HH33" order="" /> <bond atoms="CBond CH3" order="" /> <bond atoms="CBond O" order="" /> </ bonds> </ fragment> . . . (686 lines in total) Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 32 / 42 Future developments Minor: atom pairs, constraints Major: force fields (algorithms) Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 33 / 42 Biomolecular simulation . . . a few years from now PDB import Chain extraction Mosaic HDF5 file PDB crystal universe + configuration Protein in vacuum universe + configuration Force field Force field for protein in vacuum Solvated protein universe + configuration Solvatation MD simulation Small independent programs Konrad HINSEN (CBM/SOLEIL) Force field for solvated protein Trajectory for solvated protein All the data for the simulation study in one place La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 34 / 42 Managing, publishing, and archiving reproducible research Reproducible research results in large sets of interrelated software, data sets, and documentation. Managing files, dependencies, and metadata on multiple computers is a nightmare. Publishing is difficult due to a lack of standards. Archiving is almost pointless: which software will still work 30 years from now? ActivePapers (http://www.activepapers.org/) JVM edition: near-perfect but unusable Python edition: far from the ideal but ready for use K. Hinsen, Procedia Computer Science 4 (2011): 579-588 K. Hinsen, F1000Research, submitted Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 35 / 42 Preserving data for a long time Formalized descriptions data models ontologies well-defined file formats Use standards, even bad ones Standards ensure the critical mass that will motivate future scientists to maintain the knowledge and the tools required to interpret your data. Limited dependencies software: operating systems, compilers, libraries, . . . infrastructure: Internet, communication protocols, . . . organizations: research labs, publishers, . . . Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 36 / 42 Code is data There is nothing special about software, it’s just a kind of data! Everything is data: Scientific data is data. Code is data. Input parameters are data. Documentation is data. All we need is good data models for everything, plus a way to describe dependencies. The tricky part is finding a good data model for code. Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 37 / 42 Representing code as data “Data types” for code: source code: defined by a language specification there are lots of languages programmers are attached to them specifications are often incomplete or even absent binary/executable: defined by a hardware specification very complex specification evolves rapidly with technological progress bytecode: defined by a virtual machine specification (JVM, CLI, LLVM, . . .) clear and simple specification not directly visible to the programmer today’s virtual machines were not designed for scientific computing → sub-optimal performance JVM bytecode provides the best existing infrastructure. Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 There is not much scientific software for the JVM platform, which is38 / 42 Dependency handling Dataflow graph of program execution dataset 1 dataset 2 input program output dataset 3 Directed Acyclic Graph source code compiler of dependencies dataset 2 dataset 1 program dataset 3 Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 39 / 42 ActivePapers overview (1/3) An ActivePaper . . . is an HDF5 file containing any number of data sets typically contains scientific data, code, and documentation is identified by a DOI (Digital Object Identifier) can refer to other ActivePapers through their DOIs stores the dependency graph between data sets in HDF5 metadata Executable code in an ActivePaper. . . cannot access files outside of ActivePaper files cannot modify data except inside its own ActivePaper comes in three varieties: calclet (reads and writes data), importlet (imports “outside” data), and library (called by other code) Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 40 / 42 ActivePapers overview (2/3) Special cases a raw dataset (usually with some documentation) a software library (usually with some documentation) External dependencies (must be maintained for 30 years . . .) JVM edition: a Java Virtual Machine with the Java standard library the HDF5 library (could be rewritten as a JVM library) JHDF5, a Java interface to HDF5 (could be rewritten as a JVM library) Python edition: Python 2.7 or 3.3 (hopefully also later versions) the HDF5 library h5py, a Python interface to HDF5 NumPy, an array processing library used by h5py Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 41 / 42 ActivePapers overview (3/3) Problems solved (at least in principle) future-proof publishing and archiving of computational research secure replicability of computational studies reusability of scientific data and software integration of scientific data and software into bibliometry Problems that remain JVM edition: very small choice of scientific libraries for the JVM floating-point issues in the JVM Python edition: many libraries depend on C code occasional incompatible changes in Python, NumPy, and h5py Konrad HINSEN (CBM/SOLEIL) La reproductibilité des calculs sur les systèmes complexes 20 novembre 2014 42 / 42