revised notes from sessions

advertisement
Notes from the RDA Data Foundation and Terminology (DFT) WG session1 at
Plenary 3Co-chairs Gary Berg-Cross, Raphael Ritz, Peter Wittenburg ~25 & 15 people
Co-Chair Gary Berg-Cross kicked off the meeting with a signup sheet, a 5 page handout
on core term definitions with supporting graphics and an overview of the agenda for the
2 sessions.
After some brief introductory notes on mission (Describe a basic, abstract (but clear)
data organization model that systemizes the already large body of definition work on
data management terms, especially as involved in RDA’s efforts) and past work a
summary of where we are now as part of a 5 step process to develop and agree on a
core vocabulary with the aid of the newly developed Term Definition Tool (TeD-T).
We now have a table of Core Terms with some initial Definitions and some are also in
TeD-T as examples. Some still being updated.
A summary of where we think we are going included:.
•
Using this P3 meeting as an opportunity to take stock and do some editing,
testing of ideas and refining as well as strategize on next steps.
•
Seeing to what extent we can get some sense of agreement/ buy-in on the
WG-Core.
•
Tool Demo
•
Discussion of working Core issues
A Checklist of Issues Needed for DFT Term Progress was presented”
•
•
Ramp up of effort by DFT WG Community
•
Review of table, categories and definition refinement
•
Confirmation of scope of work
•
How do we handle points of contention?
•
What is the process by which we converge and move to adoption?
Training in and exposure of Term Tool (Demo tomorrow)
•
Use by other WGs for their needs
•
•
Is our table example useful as a model for them?
Further test of Use Case Scenarios
Most of the remainder of the first session was spent presenting and discussing Use
Cases.
Peter Wittenburg presented several scenarios of use starting with
P1 a CLARIN Collection Builder scenario which creates collections out of data, gives
the data citable PIDs and metadata in various repositories. The collection itself also has
a PID and metadata.
P2 Replication of data ( a replication workflow) from different communities in the
EUDAT Data Domain is not trivial and may require adding a PID and other additional
metadata as part of replication in EUDAT data storage. For this we need several PID Info
Types including:
-
cksm
-
cksm type
-
data_URL+
-
metadata_URL+
-
ROR_flag
-
mutability_flag
-
access_rights_store
-
etc.
P3 Curate Gappy Sensor Data (delays in transmission) in Seismology.
Data being collected has gaps since sensing is opportunistic and yields dynamic data
since gaps are being filled at random times. Hence what a data collection looks like is
different with fewer gaps over time. But researchers have to refer to the data collection
before gaps are filled.
In conversation it was noted that some thought that gappy data referred to “incomplete”
data. Defining it by time makes sense.
P4 Crowd Sourcing Data Management requires Curation/Structuring of a stream of data so
it is permanently stored with proper MD, that may need to be created for that store, with all
needed annotations of relations to other data and a register PIDs for all objects.
P5 Language Technology Workflows
As collections evolve over they are modified by a workflow processing modules/service
objects which are informed by the PID system and metadata.
An example of the processing in such service objects is:
1. read collection MD
2. read MD
3. interpret MD & get PID
4. get data
5. process data & create new data
6. register PID
7. create MD*
8. update collection
9. if more in collection go 2
10. end
Reagan Moore’s scenario concerned reproducible data-driven research and its capture
as a service object/workflow of chained operations. To be reproducible one must
understand processes & operations:

Where did the data come from?

How was the data created?

How was the data managed?
Researcher operations for the RHESSys workflow to develop a nested watershed
parameter file (worldfile) containing a nested ecogeomorphic object framework, and full,
initial system state include human operations, such as below along with repository
operations (further below). An abstracting graphic was provided to show some of the
relations.
1. Pick the location of a stream gauge and a date
2. Access USGS data sets to determine the watershed that surrounds the stream
gauge (may needed named digital object)
3. Access USDA for soils data for the watershed
4. Access NASA for LandSat data
5. Access NOAA for precipitation data
6. Access USDOT for roads and dams
7. Project each data set to the region of interest
8. Generate the appropriate environment variables
9. Conduct the watershed analysis
10. Store the workflow, the input files, and the results
There are an equal number of data repository operations such as:
1. Authenticate the user (a response of the repository system)
2. Authorize the deposition
3. Add a retention period
4. Extract descriptive metadata
5. Record provenance information
6. Log the event
7. Create derived data products (image thumbnails)
8. Add access controls (collection sticky bits)
9. Verify checksum
We need to understand vocabulary of each of these types of operations. Reagan
proposed some definitions for key terms from bits, digital object, data object,
Representation object, Operations, Workflow and Workflow object.
As part of Practical Policy work these operations are driven by policy so there are
collection policies..

Hans Pfeiffenberger provided some relationship scenarios for research data objects,
which are „complex objects“ as framed by the Open Archives Initiative Object Reuse
and Exchange (OAI-ORE). He considered what we mean by a research data object ina
simple case involving and article in a classical journal which is related to an article in a
data journal, each of which points to data in a repository. Many more complex
relationships are full studies, such as clinical trials, which incluse connections to method
protocols, data collection forms, raw and cleaned data records, a published primary
report along with results DBs and conference reports. Thousands of pages may be
involved in the aggregate research object and we need to take all of these realtions into
account.

„A plethora of objects per research data set!
i.e.: a complex (yet fixed!) object

In different formats, under different control, ...
i.e.: in different repositories

Each with a distinct timeline (no circles or cycles!)

And then: Reliable and stable bidirectional linkages
between all these elements”
Download