Defining Data Centric Functionality of SEAD: Bridging Domain Analysis Report and Existing Medici System Md Aktaruzzaman Introduction: One of the primary objectives of SEAD is to create a virtual organization aiming at providing data services such as storage and active curation for the sustainability scientists. In order to better understand the needs of the sustainability scientists, SEAD’s domain engagement committee has carried out interviews with the researchers at National Center for Earth-Surface Dynamics (NCED). Two members of Domain Engagement Committee were responsible for establishing connections with NCED researchers, conducting interviews and documenting the findings in terms of user needs. A comprehensive documentation highlighting the use cases has been provided in the report titled ‘User study of the National Center for Earth-surface Dynamics’. SEAD is aiming at meeting the need through using the open source project Medici developed at NCSA. The Medici Content Management System is a web and desktop-enabled content management system that allows users to upload, collate, annotate, and run analytics on a variety of files types (www.ncsa.illinois.edu). This report bridges user study at NCED and existing Medici system. Findings from Domain Analysis Report: The interviews with the NCED researchers have brought up an array of system requirements that the researchers would like to have in SEAD. The following use cases have been identified the most prominent during carrying out the interviews: Ability to organize data Ability to share data Ability to navigate data Ability to annotate data Ability to keep track of latest version Ability to pull in data from external sources Ability to stitch data Each of the above mentioned functionality will be discussed in details in the following sections. Data organization: NCED is a place where interdisciplinary research takes place at full scale. Researchers at NCED are engaged in a wide spectrum of research activities for better understanding the coupled dynamics of landscape and the eco-systems. These activities have given rise to enormous amount of data either collected from field sites or generated in indoor lab facilities and from simulation output models. All the researchers, when interviewed by two members of SEAD Domain Engagement committee, put stress on the need to organize heterogeneous datasets by using web-based tools. The users want to see the following functions in the SEAD system: i) ii) iii) iv) Selection of desired data Sorting data by name, date and file extension Creating an ad-hoc data collection Assign data to an active project In Medici users can upload variety of datasets and can make collections of data using the embedded ‘Collections’ function. Figure 1.1 shows the schematic diagram of the existing Medici data collections function. Figure 1.1 Data Collections in existing Medici Figure 1.2: Proposed Structure of Medici Data Collections From the user’s interview at NCED it was clear that researchers would like have a way to organize their data project wise and assign the data collections to an active project. The existing Medici system allows one level of hierarchy which needs to be upgraded to accommodate multi-level tree structure. Figure 1.2 presents the schematic diagram of proposed multi-layered data organization structure in Medici system. Users can sort their data with the function ‘Sort by’ which contains option to sort data by date and name. Data Sharing: Efficient data sharing is a challenge to most of the researchers. Ineffective and temporary solutions such as USB sticks and hard drives are frequently used among the researchers when there is an urgent need to share and store data. Researchers at NCED unanimously mentioned the use Dropbox because of its user-friendliness and simplicity to use. According to the interviewees, Dropbox offers version controlling and data backup functions which can be used simultaneously. SEAD system is expected to provide data sharing tools alongside data organization functionality. The users want to see the following functions in SEAD: i) ii) iii) Users select datasets to be shared Users are able to choose email IDs of individuals with whom they want to share data SEAD system will generate email notifications to the individuals who will be having access to shared data Figure 1.3: Data sharing at different data levels Although data sharing is a topic/subject of high importance for making sustainability sciences more meaningful, researchers are still faced with significant barriers in order to share data efficiently with right persons while retaining their copyright and intellectual property of the data. Researchers are interested to share their data only after they have published their scientific works. Data has different levels such as raw data, level 1 data and so on. Users also want to share different levels of data. Figure 1.3 illustrates the schematic diagram of data sharing at different levels controlled by SEAD system. Despite the availability of open source data sharing tools such as Dropbox and FileZilla, researchers still need a data sharing platform which offers more reliable and effective means of sharing. SEAD system can play an important role to provide data sharing facilities to the researchers with all possible options. SEAD system will allow users to share data at its all levels as well as incorporate data sharing protocols. The protocols will have the following layers: i) ii) iii) iv) v) A user alone uses the dataset. No data sharing at this point A user is able to select preferred individuals and share data with them A user can share data with a group A user should have access to multiple project datasets. At this point data can be shared within project members Data is open to all Figure 1.4: Proposed Data Sharing Protocols in SEAD system Figure 1.4 shows the schematic diagram of proposed data sharing protocols in SEAD (Medici) system. SEAD system will implement a permission level that will restrict the ability of the users to download, modify and annotate data. Figure 1.5 presents the schematic diagram of the proposed data access levels in Medici (SEAD). The SEAD system will incorporate the following data access levels: i) ii) iii) Users can only read data and cannot download or modify the data Users can both read and write the data. Users can download, modify and annotate data Users can see all the metadata at all levels of data access and sharing Figure 1.5: Proposed data access levels in SEAD system Currently Medici has a very basic level of data sharing function called ‘embed’ with which a data object can be shared. Medici system needs to be upgraded so that users can share a collection of data as well as the system should guarantee different levels of data sharing protocols. Data Navigation: Researchers at NCED mentioned the need of having an active data repository where they can upload their data, search data items and browse through data. SEAD system will allow the users to navigate the data or datasets he/she has uploaded. SEAD system will implement the following functionalities in terms of data navigation: i) ii) iii) iv) v) User is able to see all the data that he/she has uploaded into the SEAD system The system will notify the user whether a particular dataset is public or private SEAD system will help the users visualize the data on an interface. Data having spatial property should be represented with a georeferenced map in the background. When selected , the user is able to see the associated metadata on the interface User is also able to browse data uploaded by others if he/she has permission Figure 1.6 shows a snapshot from Medici geo-application interface where geospatial data has been overlaid on the GIS maps. Figure 1.6: Visualizing Geospatial data in Medici Currently Medici has data navigation functionality and users can upload and browse their data. Medici can also read and display shape files on the interface. Since the data sharing protocols are yet to implement in Medici, at present users have only access to their data leaving them quarantined from exploring others data. Data Annotation: Data annotation is the task of furnishing the data by adding explanatory notes and comments mainly from data producer’s end for providing sufficient information of data and associated metadata for data consumers. Researchers at NCED placed stress on the effective annotation of data that would help other researchers and data consumers to discover data efficiently by typing key words. SEAD’s Active Content Repository (ACR) provides the researchers with such a platform where they can record explanatory notes and all forms of additional information to add more value to the data. Researchers at NCED wanted to see the following functionalities in SEAD system in terms of data annotation: i) ii) iii) iv) Users choose a data item in order to annotate it. The data item can be a single data object (e.g., txt or doc file) or a subset of it. Users gives some explanatory notes about the data Users create annotations by providing a short description. SEAD system will provide a text input area where users will be able to type Users can also create annotation by tags Figure 1.7: Data annotation facilities in Medici Currently Medici offers data annotation functionality by providing both the textual input and tagging options. Figure 1.7 presents a snapshot highlighting data annotation functionality in Medici. Explanatory notes regarding a data can be given both in the form of textual input and tags. Data Versioning: SEAD system will incorporate the following functions in terms of keeping track of latest version of dataset: i) ii) iii) User is able to upload the most current version of data. The system will refer users to the most recent version of data The system will keep track of any change made to a dataset and notify the users who share access with it They system should provide a way to compare between versions of the dataset Currently Medici doesn’t provide a direct way that enables users to keep track to data versioning. However, Medici offers an alternative way to establish a relationship between two datasets. Figure 1.8 shows ‘Create Relationship’ function in the existing Medici system. Figure 1.8: Defining relationship between datasets Pulling in Data from external source: While conducting the user interview at NCED, the modelers who were the postdocs and worked extensively with satellite imagery mentioned that they would like to have a convenient way to get access to external agencies such as NOAA, NASA and JPL. SEAD system will implement the following functions to help researchers get access to the external links: i) ii) Users register external data sources through SEAD The system will connect the users to the links of external sources like USGS, NASA and EPA Currently Medici system doesn’t provide a way to allow the researchers establish connection with an external link. Stitching/combining datasets: SEAD system will provide the following functions in terms of data combination: i) ii) SEAD system will allow the users to stitch and combine various datasets It will also allow to extract a subset from the original data and make a new dataset Currently Medici system doesn’t have any data stitching tools. In the future, efforts will be made to provide tools for data stitching. Table 1.1 presents the comparative scenario of existing and proposed Medici functionalities. Use cases Existing Medici functionality Ability to Organize Data Ability to Share Data Ability to navigate/Browse Data Ability to Annotate Data Ability to keep track of latest version of Data Ability to Pull in External Data Medici ‘Collections’ function helps users organize their datasets Data can be sorted by date, name and file type Medici has a function called ‘Embed’ with which one single data object can be shared Basic data navigation and previewing functions are available in Medici Medici offers data annotation functions such as textual input and tagging Medici doesn’t provide a direct way to keep track of latest version of Data Medici provides an alternative way to establish relationship between datasets Proposed Medici functionality ‘Collections’ function should have hierarchy so that data can be assigned to an active project Medici will provide function that will allow users to share data collections A list of metadata will be presented on the interface when a data is selected Ability to Stitch/Combine Data Table 1.1: Comparative scenario in Medici system Medici will point to the links of USGS, EPA, NASA and JPL Links of the external sources will be treated as data Medici will provide tools to stitch/combine data Conclusion: Existing Medici functionalities such as collections, embed, sort by, tags, relationships and previewing offer a number of ways to fulfill many of the use cases, if not all. Medici system can be upgraded to next level where it will be able to accommodate all the use cases required by the NCED researchers. There are some issues that will be barrier for the serial implementation of all the use case. The biggest concern is the feasibility of the technical solution that would work. The issue of file sharing can be mentioned in this regard. Some project at NCED contains as many as 100K files. Implementing data sharing functionalities like Dropbox for such a large number may pose challenge from technical perspective.