Go to the Document

advertisement
Defining Data Centric Functionality of
SEAD: Bridging Domain Analysis
Report and Existing Medici System
Md Aktaruzzaman
Introduction:
One of the primary objectives of SEAD is to create a virtual organization aiming at providing
data services such as storage and active curation for the sustainability scientists. In order to
better understand the needs of the sustainability scientists, SEAD’s domain engagement
committee has carried out interviews with the researchers at National Center for Earth-Surface
Dynamics (NCED). Two members of Domain Engagement Committee were responsible for
establishing connections with NCED researchers, conducting interviews and documenting the
findings in terms of user needs. A comprehensive documentation highlighting the use cases has
been provided in the report titled ‘User study of the National Center for Earth-surface
Dynamics’. SEAD is aiming at meeting the need through using the open source project Medici
developed at NCSA. The Medici Content Management System is a web and desktop-enabled
content management system that allows users to upload, collate, annotate, and run analytics
on a variety of files types (www.ncsa.illinois.edu). This report bridges user study at NCED and
existing Medici system.
Findings from Domain Analysis Report:
The interviews with the NCED researchers have brought up an array of system requirements
that the researchers would like to have in SEAD. The following use cases have been identified
the most prominent during carrying out the interviews:







Ability to organize data
Ability to share data
Ability to navigate data
Ability to annotate data
Ability to keep track of latest version
Ability to pull in data from external sources
Ability to stitch data
Each of the above mentioned functionality will be discussed in details in the following sections.
Data organization:
NCED is a place where interdisciplinary research takes place at full scale. Researchers at NCED
are engaged in a wide spectrum of research activities for better understanding the coupled
dynamics of landscape and the eco-systems. These activities have given rise to enormous
amount of data either collected from field sites or generated in indoor lab facilities and from
simulation output models. All the researchers, when interviewed by two members of SEAD
Domain Engagement committee, put stress on the need to organize heterogeneous datasets
by using web-based tools. The users want to see the following functions in the SEAD system:
i)
ii)
iii)
iv)
Selection of desired data
Sorting data by name, date and file extension
Creating an ad-hoc data collection
Assign data to an active project
In Medici users can upload variety of datasets and can make collections of data using the
embedded ‘Collections’ function. Figure 1.1 shows the schematic diagram of the existing Medici
data collections function.
Figure 1.1 Data Collections in existing Medici
Figure 1.2: Proposed Structure of Medici Data Collections
From the user’s interview at NCED it was clear that researchers would like have a way to
organize their data project wise and assign the data collections to an active project. The existing
Medici system allows one level of hierarchy which needs to be upgraded to accommodate
multi-level tree structure. Figure 1.2 presents the schematic diagram of proposed multi-layered
data organization structure in Medici system. Users can sort their data with the function ‘Sort
by’ which contains option to sort data by date and name.
Data Sharing:
Efficient data sharing is a challenge to most of the researchers. Ineffective and temporary
solutions such as USB sticks and hard drives are frequently used among the researchers when
there is an urgent need to share and store data. Researchers at NCED unanimously mentioned
the use Dropbox because of its user-friendliness and simplicity to use. According to the
interviewees, Dropbox offers version controlling and data backup functions which can be used
simultaneously. SEAD system is expected to provide data sharing tools alongside data
organization functionality. The users want to see the following functions in SEAD:
i)
ii)
iii)
Users select datasets to be shared
Users are able to choose email IDs of individuals with whom they want to share
data
SEAD system will generate email notifications to the individuals who will be having
access to shared data
Figure 1.3: Data sharing at different data levels
Although data sharing is a topic/subject of high importance for making sustainability sciences
more meaningful, researchers are still faced with significant barriers in order to share data
efficiently with right persons while retaining their copyright and intellectual property of the
data. Researchers are interested to share their data only after they have published their
scientific works. Data has different levels such as raw data, level 1 data and so on. Users also
want to share different levels of data. Figure 1.3 illustrates the schematic diagram of data
sharing at different levels controlled by SEAD system.
Despite the availability of open source data sharing tools such as Dropbox and FileZilla,
researchers still need a data sharing platform which offers more reliable and effective means of
sharing. SEAD system can play an important role to provide data sharing facilities to the
researchers with all possible options. SEAD system will allow users to share data at its all levels
as well as incorporate data sharing protocols. The protocols will have the following layers:
i)
ii)
iii)
iv)
v)
A user alone uses the dataset. No data sharing at this point
A user is able to select preferred individuals and share data with them
A user can share data with a group
A user should have access to multiple project datasets. At this point data can be
shared within project members
Data is open to all
Figure 1.4: Proposed Data Sharing Protocols in SEAD system
Figure 1.4 shows the schematic diagram of proposed data sharing protocols in SEAD (Medici)
system. SEAD system will implement a permission level that will restrict the ability of the users
to download, modify and annotate data. Figure 1.5 presents the schematic diagram of the
proposed data access levels in Medici (SEAD).
The SEAD system will incorporate the following data access levels:
i)
ii)
iii)
Users can only read data and cannot download or modify the data
Users can both read and write the data. Users can download, modify and annotate
data
Users can see all the metadata at all levels of data access and sharing
Figure 1.5: Proposed data access levels in SEAD system
Currently Medici has a very basic level of data sharing function called ‘embed’ with which a
data object can be shared. Medici system needs to be upgraded so that users can share a
collection of data as well as the system should guarantee different levels of data sharing
protocols.
Data Navigation:
Researchers at NCED mentioned the need of having an active data repository where they can
upload their data, search data items and browse through data. SEAD system will allow the users
to navigate the data or datasets he/she has uploaded. SEAD system will implement the
following functionalities in terms of data navigation:
i)
ii)
iii)
iv)
v)
User is able to see all the data that he/she has uploaded into the SEAD system
The system will notify the user whether a particular dataset is public or private
SEAD system will help the users visualize the data on an interface. Data having
spatial property should be represented with a georeferenced map in the
background.
When selected , the user is able to see the associated metadata on the interface
User is also able to browse data uploaded by others if he/she has permission
Figure 1.6 shows a snapshot from Medici geo-application interface where geospatial data has
been overlaid on the GIS maps.
Figure 1.6: Visualizing Geospatial data in Medici
Currently Medici has data navigation functionality and users can upload and browse their data.
Medici can also read and display shape files on the interface. Since the data sharing protocols
are yet to implement in Medici, at present users have only access to their data leaving them
quarantined from exploring others data.
Data Annotation:
Data annotation is the task of furnishing the data by adding explanatory notes and comments
mainly from data producer’s end for providing sufficient information of data and associated
metadata for data consumers. Researchers at NCED placed stress on the effective annotation of
data that would help other researchers and data consumers to discover data efficiently by
typing key words. SEAD’s Active Content Repository (ACR) provides the researchers with such a
platform where they can record explanatory notes and all forms of additional information to
add more value to the data. Researchers at NCED wanted to see the following functionalities in
SEAD system in terms of data annotation:
i)
ii)
iii)
iv)
Users choose a data item in order to annotate it. The data item can be a single data
object (e.g., txt or doc file) or a subset of it.
Users gives some explanatory notes about the data
Users create annotations by providing a short description. SEAD system will provide
a text input area where users will be able to type
Users can also create annotation by tags
Figure 1.7: Data annotation facilities in Medici
Currently Medici offers data annotation functionality by providing both the textual input and
tagging options. Figure 1.7 presents a snapshot highlighting data annotation functionality in
Medici. Explanatory notes regarding a data can be given both in the form of textual input and
tags.
Data Versioning:
SEAD system will incorporate the following functions in terms of keeping track of latest version
of dataset:
i)
ii)
iii)
User is able to upload the most current version of data. The system will refer users
to the most recent version of data
The system will keep track of any change made to a dataset and notify the users who
share access with it
They system should provide a way to compare between versions of the dataset
Currently Medici doesn’t provide a direct way that enables users to keep track to data
versioning. However, Medici offers an alternative way to establish a relationship between two
datasets. Figure 1.8 shows ‘Create Relationship’ function in the existing Medici system.
Figure 1.8: Defining relationship between datasets
Pulling in Data from external source:
While conducting the user interview at NCED, the modelers who were the postdocs and worked
extensively with satellite imagery mentioned that they would like to have a convenient way to
get access to external agencies such as NOAA, NASA and JPL. SEAD system will implement the
following functions to help researchers get access to the external links:
i)
ii)
Users register external data sources through SEAD
The system will connect the users to the links of external sources like USGS, NASA
and EPA
Currently Medici system doesn’t provide a way to allow the researchers establish connection
with an external link.
Stitching/combining datasets:
SEAD system will provide the following functions in terms of data combination:
i)
ii)
SEAD system will allow the users to stitch and combine various datasets
It will also allow to extract a subset from the original data and make a new dataset
Currently Medici system doesn’t have any data stitching tools. In the future, efforts will be
made to provide tools for data stitching.
Table 1.1 presents the comparative scenario of existing and proposed Medici functionalities.
Use cases
Existing Medici
functionality
Ability to Organize Data


Ability to Share Data

Ability to navigate/Browse Data

Ability to Annotate Data

Ability to keep track of latest
version of Data


Ability to Pull in External Data
Medici ‘Collections’
function helps users
organize their datasets
Data can be sorted by
date, name and file type
Medici has a function
called ‘Embed’ with
which one single data
object can be shared
Basic data navigation
and previewing
functions are available in
Medici
Medici offers data
annotation functions
such as textual input
and tagging
Medici doesn’t provide a
direct way to keep track
of latest version of Data
Medici provides an
alternative way to
establish relationship
between datasets
Proposed Medici
functionality

‘Collections’ function
should have hierarchy so
that data can be
assigned to an active
project

Medici will provide
function that will allow
users to share data
collections
A list of metadata will be
presented on the
interface when a data is
selected



Ability to Stitch/Combine Data

Table 1.1: Comparative scenario in Medici system
Medici will point to the
links of USGS, EPA, NASA
and JPL
Links of the external
sources will be treated
as data
Medici will provide tools
to stitch/combine data
Conclusion:
Existing Medici functionalities such as collections, embed, sort by, tags, relationships and
previewing offer a number of ways to fulfill many of the use cases, if not all. Medici system can
be upgraded to next level where it will be able to accommodate all the use cases required by
the NCED researchers. There are some issues that will be barrier for the serial implementation
of all the use case. The biggest concern is the feasibility of the technical solution that would
work. The issue of file sharing can be mentioned in this regard. Some project at NCED contains
as many as 100K files. Implementing data sharing functionalities like Dropbox for such a large
number may pose challenge from technical perspective.
Download