Data Management Plans: A good idea, but not sufficient

advertisement
Data Management Plans:
A good idea, but not sufficient
Andreas Rauber
Department of Software Technology and Interactive Systems
Vienna University of Technology
&
Secure Business Austria
rauber@ifs.tuwien.ac.at
http://www.ifs.tuwien.ac.at/~andi
Outline

Why are Data Management Plans good but insufficient?

From Data to Process Management Plans

How to capture process & context?

Summary
Sustainable (e-)Science

Data is key enabler in science
-
Basis for evaluation and verification
-
Basis for re-use
-
Basis for meta-studies

Safeguarding investment made in data

Need to preserve and curate the data

Preservation: keeping useable over time
fighting mostly technical & semantic obsolescence

How to avoid data being lost after projects end?
Sustainable (e-)Science

Data Management Plans
as integral part of research proposals

Need recognized by researchers, funding bodies,…

Focus on
-

Data
Descriptions
Declarations of activities to ensure long-term availability of data
Data Management Plans are good, but not sufficient!
https://dmp.cdlib.org/
https://data.uni-bielefeld.de/de/datamanagement-plan
https://dmponline.dcc.ac.uk/
Data Management Plans

Short, free-form text, requiring human interpretation

Declarations of intent

Not enforceable, hardly verifiable

(Burden remains with researchers / institutions,
who need to become data management experts)

Focuses solely on data, ignoring the process:
pre-processing, processing, analysis

Limits
-
availability of data & results
-
verification of results,
-
re-use and re-purposing
http://rci.ucsd.edu/_files/D
MP%20Example%20Cos
man.pdf
http://deepblue.lib.umich.edu/bitstream/ha
ndle/2027.42/86586/CoE_DMP_template_
v1.pdf?sequence=1
From Data to Processes

Excursion: Scientific Processes
From Data to Processes

Rhythm Pattern Feature Set
-

Used for
-

extracts numeric descriptors from audio
basically 2 Fourier Transforms
some psycho-acoustic modelling
some filters (gaussian, gradient) to make features more robust
music genre classification
clustering of music by similarity
retrieval
Implemented first in Matlab, then in Java
-
both publicly available on website
same same but different...
From Data to Processes

Excursion: scientific processes
set1_freq440Hz_Am11.0Hz
set1_freq440Hz_Am12.0Hz
set1_freq440Hz_Am05.5Hz
Java
Matlab
From Data to Processes

Excursion: Scientific Processes
 Bug?
 Psychoacoustic transformation tables?
 Forgetting a transformation?
 Diferent implementation of filters?
 Limited accuracy of calculation?
 Difference in FFT implementation?
 ...?
From Data to Processes
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0038234
From Data to Processes
To sum up:

Data
-
is the fuel for scientific processes
-
is the result of scientific processes

Curation of data thus needs to consider these processes

Data Management Plans
-
are data centric
-
put too little focus on the processes associated with data
-
are written by humans for humans
Outline

Why are Data Management Plans insufficient?

From Data to Process Management Plans

How to capture process & context?

Summary
Process Management Plans
Process Management Plans (PMPs)

Go beyond data to cover research process:
-
ideas, steps, tools, documentation, results, …
-
data is only one (important) element,
commonly actually a result of a research (pre-)process

Ensure re-executability, re-usability

Must be machine-actionable & verifiable

Basis for preservation and re-use of research

Similar to “research objects”, “executable papers”, …
Process Management Plans
Need to establish

Models for representing such process management
plans (PMPs)

Must be machine-readable and machine-actionable

Identify “minimum set” of information

Devise means to automate (most of) the activity in
creating and maintaining those PMPs

Establish them to replace (enhance / subsume / …)
Data Management Plans
Process Management Plans
Structure of PMPs (following concept of DMPs):
1. Overview and context
2. Description of processes and their implementation
 Process description | Process implementation | Data used and
produced by process
3. Preservation
 Preservation history | Long term storage and funding
4. Sharing and reuse
 Sharing | Reuse | Verification | Legal aspects
5. Monitoring and external dependencies
6. Adherence and Review
Outline

Why are Data Management Plans insufficient?

From Data to Process Management Plans

How to capture process & context?

Summary
Process Capture

Need to establish what forms part of a process:
-

analyzing process documentation
establishing context of process, relationships between elements
monitoring of process activities
Capture and describe this in a context model
Architectural Concepts
 Based on Enterprise Architecture Framework
(Zachmann), taxonomies (e.g. PREMIS), …
 DIO: Domain-Independent Ontology
 DSO: Domain-Specific Ontologies
(legal, sensor, multimedia codecs, …)
DIO
(ArchiMate)
DIO-DSO1
Transformation Map
DSO-1
DIO-DSO2
Transformation Map
DSO-2
19
Process Capture
Example: Music Classification Process




Input: music (e.g. MP3 format)
Input: training data, i.e. music with genre labels
Output: classification of music, e.g. into genres
Intermediate steps
 extract numeric description (features) from music
 combine features with ground truth into specific file format, …
Process Capture
Taverna
…………….
Process Capture


Software setup can be automatically detected in OS with
software packages (e.g. Linux);
allows detection of licenses, dependencies
Process Capture
Process Capture
Example:
 Music Classification Workflow
24
Process Re-deployment
Preservation and Re-deployment
 „Encapsulate“ as complex „research objects“ (RO)
 Re-Deployment beyond original environment
 Format migration of elements of ROs
 Cross-compilation of code
 Emulation-as-a-Service, virtual machines, …
Process Re-deployment
Verification, Validation & Data
 Verify correctness of re-execution
 validation and verification framework
 process instance data
 points of capture
 Metrics
 Data and data citation
 Identifying subsets of data in large and dynamic databases
 Timestamping and versioning of data
PID Provider
 Assigning PID (DOI, …) to time-stamped query
PID Store
Query
Data
Query Store
Table B
Table A
Subsets
Sustainable (e-)Science
How to get there?


Research infrastructure support
-
Versioning systems
-
Logging (“virtual lab-book”)
-
Virtual machines / pre-configured virtual labs for research
-
Data citation support for large, dynamic databases
R&D in process preservation, re-deployment & verification
-
Evolving research environments, code migration, …
-
Verification of process re-execution
-
Financial impact, business models
Summary

Need to move beyond concept of data

Need to move beyond the focus on description

Process Management Plans (PMPs) extending DMPs

Process capture, preservation & verification

Capture “all” elements of a research process

Machine-readable and -actionable

Data and process re-use as basis for data driven science
Thank you!
DIO
(ArchiMate)
DIO-DSO1
Transformation Map
DSO-1
PID Provider
DIO-DSO2
Transformation Map
Query
Data
DSO-2
PID Store
Query Store
Table B
Table A
Subsets
http://www.ifs.tuwien.ac.at/imp
Download