Data Management Plans: A good idea, but not sufficient Andreas Rauber Department of Software Technology and Interactive Systems Vienna University of Technology & Secure Business Austria rauber@ifs.tuwien.ac.at http://www.ifs.tuwien.ac.at/~andi Outline Why are Data Management Plans good but insufficient? From Data to Process Management Plans How to capture process & context? Summary Sustainable (e-)Science Data is key enabler in science - Basis for evaluation and verification - Basis for re-use - Basis for meta-studies Safeguarding investment made in data Need to preserve and curate the data Preservation: keeping useable over time fighting mostly technical & semantic obsolescence How to avoid data being lost after projects end? Sustainable (e-)Science Data Management Plans as integral part of research proposals Need recognized by researchers, funding bodies,… Focus on - Data Descriptions Declarations of activities to ensure long-term availability of data Data Management Plans are good, but not sufficient! https://dmp.cdlib.org/ https://data.uni-bielefeld.de/de/datamanagement-plan https://dmponline.dcc.ac.uk/ Data Management Plans Short, free-form text, requiring human interpretation Declarations of intent Not enforceable, hardly verifiable (Burden remains with researchers / institutions, who need to become data management experts) Focuses solely on data, ignoring the process: pre-processing, processing, analysis Limits - availability of data & results - verification of results, - re-use and re-purposing http://rci.ucsd.edu/_files/D MP%20Example%20Cos man.pdf http://deepblue.lib.umich.edu/bitstream/ha ndle/2027.42/86586/CoE_DMP_template_ v1.pdf?sequence=1 From Data to Processes Excursion: Scientific Processes From Data to Processes Rhythm Pattern Feature Set - Used for - extracts numeric descriptors from audio basically 2 Fourier Transforms some psycho-acoustic modelling some filters (gaussian, gradient) to make features more robust music genre classification clustering of music by similarity retrieval Implemented first in Matlab, then in Java - both publicly available on website same same but different... From Data to Processes Excursion: scientific processes set1_freq440Hz_Am11.0Hz set1_freq440Hz_Am12.0Hz set1_freq440Hz_Am05.5Hz Java Matlab From Data to Processes Excursion: Scientific Processes Bug? Psychoacoustic transformation tables? Forgetting a transformation? Diferent implementation of filters? Limited accuracy of calculation? Difference in FFT implementation? ...? From Data to Processes http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0038234 From Data to Processes To sum up: Data - is the fuel for scientific processes - is the result of scientific processes Curation of data thus needs to consider these processes Data Management Plans - are data centric - put too little focus on the processes associated with data - are written by humans for humans Outline Why are Data Management Plans insufficient? From Data to Process Management Plans How to capture process & context? Summary Process Management Plans Process Management Plans (PMPs) Go beyond data to cover research process: - ideas, steps, tools, documentation, results, … - data is only one (important) element, commonly actually a result of a research (pre-)process Ensure re-executability, re-usability Must be machine-actionable & verifiable Basis for preservation and re-use of research Similar to “research objects”, “executable papers”, … Process Management Plans Need to establish Models for representing such process management plans (PMPs) Must be machine-readable and machine-actionable Identify “minimum set” of information Devise means to automate (most of) the activity in creating and maintaining those PMPs Establish them to replace (enhance / subsume / …) Data Management Plans Process Management Plans Structure of PMPs (following concept of DMPs): 1. Overview and context 2. Description of processes and their implementation Process description | Process implementation | Data used and produced by process 3. Preservation Preservation history | Long term storage and funding 4. Sharing and reuse Sharing | Reuse | Verification | Legal aspects 5. Monitoring and external dependencies 6. Adherence and Review Outline Why are Data Management Plans insufficient? From Data to Process Management Plans How to capture process & context? Summary Process Capture Need to establish what forms part of a process: - analyzing process documentation establishing context of process, relationships between elements monitoring of process activities Capture and describe this in a context model Architectural Concepts Based on Enterprise Architecture Framework (Zachmann), taxonomies (e.g. PREMIS), … DIO: Domain-Independent Ontology DSO: Domain-Specific Ontologies (legal, sensor, multimedia codecs, …) DIO (ArchiMate) DIO-DSO1 Transformation Map DSO-1 DIO-DSO2 Transformation Map DSO-2 19 Process Capture Example: Music Classification Process Input: music (e.g. MP3 format) Input: training data, i.e. music with genre labels Output: classification of music, e.g. into genres Intermediate steps extract numeric description (features) from music combine features with ground truth into specific file format, … Process Capture Taverna ……………. Process Capture Software setup can be automatically detected in OS with software packages (e.g. Linux); allows detection of licenses, dependencies Process Capture Process Capture Example: Music Classification Workflow 24 Process Re-deployment Preservation and Re-deployment „Encapsulate“ as complex „research objects“ (RO) Re-Deployment beyond original environment Format migration of elements of ROs Cross-compilation of code Emulation-as-a-Service, virtual machines, … Process Re-deployment Verification, Validation & Data Verify correctness of re-execution validation and verification framework process instance data points of capture Metrics Data and data citation Identifying subsets of data in large and dynamic databases Timestamping and versioning of data PID Provider Assigning PID (DOI, …) to time-stamped query PID Store Query Data Query Store Table B Table A Subsets Sustainable (e-)Science How to get there? Research infrastructure support - Versioning systems - Logging (“virtual lab-book”) - Virtual machines / pre-configured virtual labs for research - Data citation support for large, dynamic databases R&D in process preservation, re-deployment & verification - Evolving research environments, code migration, … - Verification of process re-execution - Financial impact, business models Summary Need to move beyond concept of data Need to move beyond the focus on description Process Management Plans (PMPs) extending DMPs Process capture, preservation & verification Capture “all” elements of a research process Machine-readable and -actionable Data and process re-use as basis for data driven science Thank you! DIO (ArchiMate) DIO-DSO1 Transformation Map DSO-1 PID Provider DIO-DSO2 Transformation Map Query Data DSO-2 PID Store Query Store Table B Table A Subsets http://www.ifs.tuwien.ac.at/imp