advanced data infrastructure for scientific research

advertisement
ADVANCED DATA INFRASTRUCTURE
FOR SCIENTIFIC RESEARCH
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Waseem Rehmat
Jari Veijalainen
1
Permission is granted to make and distribute copies of this document provided the copyright notice
and this permission notice are preserved on all copies.
Permission is granted to copy and distribute modified versions of this document under the
conditions from authors, provided that the entire resulting derived work is referenced and includes
authors in the ownership.
Permission is granted to copy and distribute translations of this document into another language,
under the above conditions for modified versions, except that this permission notice may be stated
in translation approved by authors.
2
Table of contents
Acknowledgements ……………………………………………………………………………
1
Summary ……………………………………………………………………………………….
2
1.
Introduction ………………………………………………………………………………..
3
1.1. Purpose………………………………………………………………………………...
6
1.2. Scope…………………………………………………………………………………..
6
1.3. Process Flow……………………………………………………………………………
7
1.4. Documentation Flow…………………………………………………………………..
7
1.5. Stakeholders of the Project……………………………………………………………
8
1.6. Project phases …………….……………………………………………………….
10
2. Stakeholder’s analysis and Current Use cases …………………………………………...
11
2.1. Department of Music………………………………………………………………….
2.2. Faculty of Sports and health Sciences………………………………………………..
3. Functional Requirements of solution (What system should do?)………………………..
15
3.1. Dynamics of planned Data Infrastructure in general…………………………………
3.2. Key addressable areas of proposed Data infrastructure …………………………….
3.3. Objectives of planned data Infrastructure…………………………………………….
3.4. Challenges of planned Data Infrastructure……………………………………………
4. New Use Cases and Specifications …………………………………………………………
22
4.1. Actors of Advanced data Infrastructure………………………………………………..
4.2. Links between actors and Use cases……………………………………………………
4.3. Use cases of proposed structure……………………………………………………….
4.4. Conclusion ……………………………………………………………………………….. 42
3
ACKNOWLEDGMENTS
Data Infrastructure has been a project in which professionals from diverse backgrounds and divisions
of University were involved. Without a doubt, it was not possible to complete and develop this
document as a useful resource without valuable contribution of complete project team which
extends to departmental reference personnel, interviewees and stake holders.
We wish to thank Mr. Aki Karjalainen (Development Manager at Faculty of Sport and Health
Sciences) for joining us from start of project documentation and briefing us about current data
handling practices at Faculty of Sports and health sciences of University of Jyäskylä. He also provided
us with useful resources and links for better understanding of faculty needs and expectations.
Vision of proposed solution was elaborated by Professor Taru Lintunen. She has not only been
available for any queries but also contributed towards better visibility of project path and
stakeholders expectations. We also thank Professor Sulin Cheng for her candid feedback during
previous drafts presentations.
Technology perspective of the project was always highlighted and raised to table by Joonas
Kesäniemi. We wish to thank him on this platform for his contributions.
During the data gathering and interviewing phase of our project report, Prof. Lyytinen from
psychology dept referred us to Mr. Kenneth Eklund who is actually owner of a similar solution at
different level. It was informative having a session with him about our project and getting valuable
recommendations.
Perspective about project from Department of music was discussed with Tuomas Eerola who guided
us about current processes of data handling at the department. Meetings with him were informative
and contributing.
Last but not the least, we would like to thank team members and professionals contributing towards
project refinement in many way. We are grateful to University management and Funding authorities
for providing the opportunity and platform to excel in the project.
Warm Regards,
Jari Veijalainen
Waseem Rehmat
4
SUMMARY
Primary and basic motive of this document is to understand the domain of data infrastructure
existing at the University of Jyväskylä, identify areas where improvement is possible and explore
alternate solutions to data related issues where it has been identified that optimal data handling is
not under practice.
There are two ways in which we have moved forward towards domain understanding and
requirements elicitations. Firstly, we developed better understanding of systems by interviewing and
questioning stakeholders and secondly, we analyzed standard modern data infrastructure
management practices. Later, those practices were manipulated to fit our specific needs and
according to expectations of stakeholders.
Process flow is described in this document as a cyclic process in which we identify needs, define and
elaborate on solution, incorporate solution into the system, evaluate system wrt to incorporated
solution and then identify needs again. Following this process flow enables us to make a constantly
improving data structure with constant improvement inherent into it.
According to data needs of stakeholders who are Faculty of sports, Department of music, University
administration, Researchers and funding authorities currently, we have covered elicitation of data
related needs starting from types of data needed to the end point of data archiving covering
maximum data related concerns of stakeholders.
All project phases are defined in this document in the flow diagram starting from basic requirements
to technical requirements. So, this document will serve as a base guide for different project steps
and to measure the progress of milestones.
Current data practices are depicted in the form of use cases diagrammed after discussing them with
stake holders and identified personnel’s of concerned departments. On the basis of current
practices, we identified functional requirements of the system including Functionality, Usability,
Reliability, performance and supportability of the supposed solution. Key addressable issues of the
solution are also identified to assist developers at technical level.
In the last phase of the document, we have presented proposed use cases and process flows. Each
use case includes actors and functions of each actor. This document includes 7 meta level use cases
and three detailed process flows.
5
1.0 Introduction:
1.1 Purpose:

The purpose of this document is to develop an understanding of business and
technical requirements that will be handled and solved by project team of dataInfra
Project.

This document will serve as baseline for solution development, performance control
and implementation.

This document will provide holistic view of the project to new inductions in the
project (Researcher’s, stakeholders etc) and will provide ground information.

Through this RM documentation, we will measure work, progress and activities of
project for consistency.

This RM document may determine if end solutions provided are in line with
stakeholder’s expectations.

This requirements Management document is a living document that will be updated
and supplemented throughout the project lifecycle.
1.2 Scope :
The scope of this RM document includes

Meta description for project flow

What is the current state of matters at stakeholder’s end wrt data ?

Initial classification of data encountered in research project

What needs to be done with reference to the above data?

How it shall be done? (Meta Level view)

What level of quality must be achieved in different phases of project?
6
1.3 Envisioned Process Flow:
Requirements are identified and gathered directly from stakeholders. These
requirements are clarified to ensure understanding of technical specification
development team including any required decomposition of requirements.
Following diagram shows the generic process flow of DataInfra Requirements
management.
1.4 Documentation Flow:
7
1.5 Project Phases:
Mentioned below flow diagram details the flow of the project and suggested phases to be followed
at implementation.
In terms of data resources possessed by University of Jyvaskyla, Research data is the most valuable
source. Currently, this data is stored at different data sources in various formats and increase in the
data is exponential at many locations which give rise to data in measuring of gigabits and terabits.
8
With such an amount of data at the university, there is rising need of finding ways to develop a
relation between data sources (Data synchronization and meta data management). Another primary
need which develops with increasing research data is management of property rights, ownerships
and consent management which is either not managed in totality or managed in paper format in
current systems.
Generic (meta) research process is depicted in following diagram.
Generic Research Process
9
The above diagram shows a schematic research project flow from the idea to the end. If
individual persons are not object of the study, the steps in the middle (select test persons,
gather their consents) can be skipped. The data gathering, storing and analysis is present in
all projects, but in different forms. The source can be TV news, Internet sources,
newspapers, scientific literature, etc. We observed especially cases, where the data is
gathered by various sensors or interviews about persons. The sensors can also gather data
directly or indirectly about phenomena in nature (experiments in physics, observations in
astronomy). The collected digital raw data and the information carried by them is then
analysed by software (or manually) and new kind of data is produced and stored. This cycle
can continue several times and additional raw data can be fed in. The main result from a
scientific project are scientific publications, patents, and perhaps also new substances,
products, business models, methods, etc.
In this context we only consider all kind of project-based information that can be encoded
into bit strings and stored and transmitted as bit strings, i.e. as data. These are produced in
various phases of each real project. The types of the data vary from project to project and
phase to phase.
The final phase is data archiving. This can happen already during the project – especially if
the project lasts years – but most often it is performed as part of closing the project. It is an
issue that needs special treatment as concerns adherence to the legal and ethical
requirements, privacy preservation, access rights to the data, storage requirements,
retention time, etc.
1.6 Stakeholders of Scientific Projects:
Stake holders of the project are following

Faculties and Departments (case study: Faculties of Sports, Social Sciences and Humanities)

University Administration

Internal & External Researchers

Funding Authorities
This document is also intended to update stakeholders about our current standing in the Data Infra
project.
Current practices linked with data creation and management at two faculties are mentioned below
according to following basic heads.
10












Amount of data
Type of data in terms of data format
Data ownership
Data privacy matters
Data protection practices
Consent handling (Object and first user)
Copyright issues
Data sharing
Data archiving and data backup
Current metadata
Retention time of data and disposal
Physical data storage mediums
2.0 Stakeholder’s analysis and Current Use cases:
2.1 Department of Music:
Department of Music is highly data intensive due to its nature of activities. data formats inevitably
created and used in the faculty are high quality sound/video files with supposedly many formats
including WAV,AIFF,WMA,MP3 and 4,ra&m,MSV,AMR, in audio and MPG,MOV,VMV RM in video
formats.
Data ownership matters are handled on individual basis and we do have a mechanism for this which
might need automation and to be brought under a structured flow. Data ownership & privacy are of
prime importance at the department and this is to be taken into account while structuring any future
data infrastructure. A structured consent handling mechanism including automated consent forms
and its presence in the research data handling flow is to be developed.
A very basic and simple assumed practice linked with data creation is depicted in the flow diagram
below.
11
Prerequisites of above mentioned data flow including consents and ownership issues are managed
manually before research data creation and currently, are not directly linked with each data
generation instance.
Multimedia data, consisting of alphanumeric, graphics, image, and animation, video and audio
objects is surely different in terms of semantics and viewing. From viewing perspective, multimedia
data is huge in size and consists of time dependant characteristics for flawless viewing and retrieval
and this can be one of challenges while structuring advanced infrastructure. The architecture of
database system may consist of modules for query processing, process management, buffer
management, file management, recovery and security along with IPR issues.
Just to understand the difference of required structure, query processing in multimedia database has
to be different from standard alphanumeric database. Results of a simple query may contain
multimedia inputs and items and intended result may not be exact but at a certain degree of
similarity. One example query can be”Show all results from archived videos where person X is
available” whereas query has a picture of that person or name connected to another database of
static images. According to Jari Veijalainen, successful implementation of such a repository might be
doable also without guaranteeing full serializability and recoverability properties for interleaved
12
executions. Unless sufficiently is known in advance about the behavior and semantics of the transactions
inserting, retrieving, and updating the data, it is better to guarantee serializability and recoverability.
At the current moment, we are not aware about amount of data possessed by department of music
and if the data is stored at same data store.
2.2 Faculty of Sports and health Sciences:
Faculty of sports and health sciences deals with varied data types .amount of data created and
handled on concurrent basis is not very high but high quality video data created during events is
intense . Most common data is alphanumeric consisting of evaluation results by different
measurement machines with reference to sports activity and health indicators measurement. Other
data sources are Video captures used in different lab activities and possibly, RFID tags linked to
experiments. Amount of data currently available for data mining purposes is not very high and to be
specific, network centric, ubiquitous, knowledge intensive management system has to be created
which will have capability of providing indexed data useful for data mining.
Data created and utilized at the faculty by individuals is under ownership of the faculty and object
consent system is in place which maintains needed consents in paper format .So, automation of
current consent’s and ownership issues will mean digitization of paper data . However, no prime
need for this has been given by faculty.
Data privacy has not emerged as a concern as yet because of the fact that Objects health and
voluntary responses results are used locally and data creating apparatus is not networked. However,
in a networked system which is to be established and implemented, this can be a matter of concern
because complete anonymity is not all the time possible and data is imperatively linked with object.
As the amount of digital networked data will grow, there will be a greater chance of data loss and
misplacements through thumb drives, external hard drives and other storage mediums. So, to
protect the secrecy of the entire data lifetime, we may have confidential ways to store data.
Another important aspect to be taken in to account might be the fact that data created at the
Laboratory can be highly scattered and inconsistent due to its dynamic nature. Do we approximate
such data or other ways of handling that are topics of discussion.
In current system, no archiving or backup issues are noticed. Data backup is done through external
drives and serves the purpose in current situation.
Each data creation practice is currently considered as a new data instance even if the object is being
monitored nth time. A basic Faculty lab instance is diagrammed in below mention drawing.
13
We can see that diagnostic results are directly taken from diagnostic machine and in the usual
settings; readings are preserved on paper in a standardized manner. Through this practice,
researcher can manage any data type or convention changes manually. However, in a networked
system, we need standardized data creation. For example, if an object is recording his/her body
temperature at two machines, both diagnostic machines should present the results in Fahrenheit or
Celsius or else, we may have conversion mediation so that stored data is more consistent and
minable.
14
Above mentioned diagram shows extended generic process with maximum of three links (levels of
network).
One possible concern can be the motivation of subject and researcher to record and update data
linked with a subject every time some measurement is done through diagnostic machines. Some
subjects may like to just monitor some stats for instant reference and may not like to update or
network updating of such data. Here, we also need to discuss if we are interested to have every data
creation instance updated in the database or just some experimental data is needed. By the terms
‘Subject’ we mean the person who’s data is being monitored at the faculty of sports i.e. athlete or
patient.
From viewing perspective, Multimedia data can be huge in size and can consist of time dependant
characteristics for better viewing. This can be a challenge in the proposed solution. Supposed data
infrastructure may contain Query, process and buffer management systems to handle these issues.
15
3.0 Functional Requirements of solution (What system should do?):
Following section of document will cover functional requirements of the solution which is classified
as follows
1.
2.
3.
4.
5.
Functionality
Usability
Reliability
Performance
Supportability
3.1 Dynamics of planned Data Infrastructure in general:
In the following section of the document, we will discuss about general characteristics of advanced
data infrastructure. At later stage in same document, we will discuss requirements specific for each
faculty (Sports and Music).
Data Redundancy:
Amount of data appears to be a non issue at Faculty of sports and department of music due to the
fact that data storage is not expensive and it is possible to store data of all kinds. However, this can
lead to duplication and redundancy of data and storage of different and conflicting versions of same
data stored at different locations(In advanced Data Infrastructure). For example, if temperature data
of a subject is measured in two databases, it is vital that data is stored in same data convention (in
Fahrenheit or Celsius) At a later stage, conflicting data can yield in data inconsistency which affects
the integrity of information created on the basis of that data. Therefore, data redundancy has to be
controlled in the advanced data infrastructure.
Classification of Data:
It would be imperative to classify data especially if we intend to provide a solution of decentralized
or cloud format data handling. Data should be classified into number of categories from which some
suggested categories are mentioned below. These are connected to different phases of the project
and different kind of projects. The classification below is tentative and is related with the inner loop
(data gathering-storage-analysis of the meta project cycle)







Discrete data
Continuous data
Human input data
Auto generated data
Descriptive or meta data
Positional and environmental data
Sensor data
16
If we look at the entire project duration, we can identify need for another categorization. It relates
with the phases of the project and different media types.
Raw data: data that is produced by measuring/capturing devices or input by humans
Derived data: this is produced algorithmically from raw data and/or derived data
Operational data: data that is produced by measuring devices or humans (input) or derived from
that data
Metadata: data that characterizes the operational data in terms of type, source, IPRs, privacy
restrictions, retention time, access history etc. It itself is discrete data.
Discrete data: consists of single values or sets of values; does not have special requirements for
storage or retrieval rate
Streaming data: data that has minimum requirements as concerns the storage or
retrieval/processing rate (e.g. video, audio, stream of discrete measuring values)
Single media data: text, audio, image, voiceless video/animation stream, etc. multimedia data:
consists of at least two synchronized single media type
Compressed data: operational data can be compressed from the beginning (video); compressed data
can also be obtained from operational data or metadata by lossy or lossless compression
Uncompressed data: operational or metadata that is not compressed or is decompressed
This classification (metadata- operational data) will determine the category and is instrumental for
the rights management. For example, it would be possible then to grant access in different tiers and
to any certain type of data (Privacy issues). Furthermore, it will also allow better and swift data
search results if the search is made from a certain classification. For example, data retriever can limit
his/her search to auto generated data if logs of RFID’s are needed.
The distinction between discrete and streaming data helps to guide the implementation of the
system. Typically, streaming data (video) is voluminous and has retrieval and storage rate (real time)
requirements, whereas discrete data does not require so much storage space and does not have real
time access requirements.
Classification of data should be done at general and at faculty level due to the fact that we also
intend to develop a system with capability of integration with all current data infrastructures and
ideally, with any future data structures developed in the University and partner institutions.
3.2 Key addressable areas of proposed Data infrastructure:
17
Concern Area
Description
Data Ownership
This relates to who has the legal right to data and who retains the data
including rights to transfer data
Data Collection
This pertains to collecting project data in a consistent, systematic manner
(i.e., reliability) and establishing an ongoing system for evaluating and
recording changes to the project protocol (i.e., validity).
Data Storage
This concerns the amount of data that should be
Stored and what real time requirements there are
Data Protection
This relates to protecting written and electronic data from physical damage
and protecting data integrity, including damage from tampering or theft.
Data Retention
This refers to the length of time one needs to keep the project data
according to the sponsor's or funder's guidelines. It also includes secure
destruction of data.
This pertains to how raw data are chosen, evaluated, and interpreted into
meaningful and significant conclusions that other researchers and the
public can understand and use.
This concerns how project data and research results are disseminated to
other researchers and the general public, and when data should not be
shared.
This pertains to the publication of conclusive findings
Data Analysis
Data Sharing
Data Reporting
(Steneck 2004)
3.3 Objectives of planned data Infrastructure:
In following section, we will discuss core objectives we wish to achieve through supposed data
structure. Planned solution should be capable of serving at least below mentioned objectives.
 Data Repository
Data repository should be formed by collecting data from multiple internal and external
sources and summarize this data to provide single source to any mining applications or for
any processing by legal users.
 Systematic storage of data through classification and meta data generation.
Data should be classified into number of categories on the basis of data type (some classes
mentioned above in the same document) and also according to level of internal controls to
18
protect data against theft and improper use (Ensuring data confidentiality). To achieve this,
it is imperative to classify data on the basis of sensitivity and confidentiality along with data
type classification. Three classes of data on the basis of control are suggested below [2].
These classes are to be defined with ratings of Low, Moderate and high in Nine box grid
format.
1. Confidentiality
2. Integrity
3. Availability
Low
Moderate
High
Confidentiality
Integrity
Availability
 Efficient access to stored data.
Providing secure and efficient access of data to end user is very important especially in
distributed system or information cloud. This can be insured by using secure data data
access mechanism by Weichao Wang et al in their paper called “Secure and Efficient Access
to Outsourced Data”[3] in which the author has provided a certification method to handle
access. Diagram of method is mentioned below.
Similar practice is followed for online data access and for recovery of passwords over the
internet
 To ensure data integrity and security.
Security of data may include protecting data from below mentioned anomalies.
1. Data Corruption
2. Data destruction (Where data is altered/ destructed by human factor)
3. Undesired modification
4. Data lost (technical defect)
5. Unauthorized disclosure
19
Verification and validation mechanism should be devised to ensure data integrity .
Furthermore, proper archiving and backup should be practiced according to standardized
defined procedures to ensure data lost and destruction. Backing up of systems may be done
according to Grandfather--- Father—Son principal. Backing up of data is discussed in this
context because data backup has a direct impact on data security.
 Concurrent access to data.
In case of concurrent access to resources, data access should have mechanisms to prevent
undesired effects when multiple users try to modify resources that other users are actively
using. There are two basic categories of handling concurrent data.
1. Pessimistic concurrency control where a system of locks prevents users from
modifying data in such a manner which affects data negatively or affects other users.
After a user performs an action that causes a lock to be applied, other users cannot
perform actions that would conflict with the lock until the owner releases it.
2. Optimistic concurrency control where users do not lock data when they read it.
When a user updates data, the system checks to see if another user changed the
data after it was read. If another user updated the data, an error is raised and
transaction is aborted [4].
 To maintain quality of data
Quality of data should be ensured at all stages of data management process from Capture to
digitization (if any), storage, analysis and end usage. Data can be improved in two ways.
1. Prevention
2. Correction
We can introduce both parameters to ensure quality of data. However, anomalies prevention
is considered to be more efficient because data cleaning and correction at later stages does
not provide better results.
 To ensure privacy of data
Privacy handles the rights and obligations of individuals with respect to data collection,
usage, disclosure and disposal. Privacy matters of data come under the hood of risk
management function of University.
Section 10 of the Constitution of Finland, entitled "The right to privacy," handles and
provides guidance about data privacy. The Personal Data Act of 1999 introduces the concept
of informed consent and self-determination into Finnish law, giving data subjects the rights
to access or correct their data, or to prohibit their use for stated purposes [5]. Privacy
matters are to be discussed with legal authorities during the development of data
infrastructure.
20
A separate paper of EU legislation can be attached to this document for guidance. The main
directive on which the Finnish legislation is based is the privacy directives 2002/2006 and
some other directives. These are mainly implemented in Henkilötietolaki 22.4.1999/523
accessible at www.finlex.fi.
 Efficient handling of IPR issues
Any data which is assumed to bear a level of Novelty or distinctiveness may be considered as
copyrighted material and may be handled under guidance of standardized process defined
for IPR handling. IPR controlled data can be
1. Copyrights
2. Patents[Not of concern for operational data because data is not patented in EU]
3. Usage of Trademarks(Signs or emblems)
4. Business plans or strategic plans
IPR issues might also arise in situations of collaborative research where more than one
individuals or groups coordinate to handle segments of a project. For example, Individual
researchers, Laboratory technicians (Faculty of sports) and objects involved in research data.
Details of IPR handling mechanisms are to discussed with experts in the specific field.
3.4 Challenges of planned Data Infrastructure:
 Data Redundancy:
Some object or set of objects with unique identities can be stored at two different
locations(Bit strings and in different locations). For example, data about an object may be
stored in two files of non volatile storage mediums. Some of the information about the
object might be changing such as objects status in the information system. Therefore, it is
possible that when data about the object is changed in master file , then same data may not
have been updated in other files . This can raise issues of duplication and later, can
contribute to issues of integrating.
 Logical representation of physical data:
Logical representation of physical data is the key feature of a data structure. The term
LOGICAL refers here to the view of data as presented to user and term PHYSICAL refers to
the data actually stored in the storage medium. Narrowing the gap between volume of data
between physical and logical can be challenging and also a good measure of efficacy of the
data infrastructure. By narrowing the gap, we mean the gap of information available in
physical storage and logical representation to the end user.
21
4.0 Use Cases and Specifications:
4.1 Actors of Advanced data Infrastructure:
Actors of proposed data infrastructure are following
 Object- It may refer to the person/source or system which is categorized as the first contact
or point of data collection.
 Data Generator – the First user of the raw data or the person/system gathering it. This also
includes the person/system interpreting the information carried by the operational data.
 Data User – Researcher /system utilizing data or information created by data generator.
 Data Analyst – Individual / research group or system manipulating research data may be
referred as data analyst in this document.
 Data Manager or management system – Individual or system responsible of storage/ backup
and archiving of data may be referred as data manager. At the first level, data should be
managed by data generator or else, by system when data is inserted.
 Data importer (If any) – System or individual importing data from external sources may be
referred as data importer. Please note that we don’t have any direct link with subject of data
in this case.
 Data Exporter (If any) – System or individual exporting data to any external source shall be
referred as data exporter. IPR and ownership issues are to be taken care by Data exporter in
case of data export.
22
4.2 Links between actors and Use cases:
In the following section of the document, we shall elaborate generic links between actors
and their link to solution use cases. Purpose of the section is to categorize interactions and
to make technical specifications concrete and under the scope of solution.
Actor 1 - Subject

It may refer to the person/source or system which is categorized as the first contact or point
of data collection. In the scope of Faculty of sports, this may be the individual or group going
through certain fitness tests and in the context of department of music, this may be the
founder owner of any data piece.
Data creation and updating are first contact operations whereas verification is post action operation
in usual cases.
23
Actor 2- Data Generator:
Data Generator is the First user of the data or the person/system gathering the data. This also
includes the person/system transforming data into information. In the context of Faculty of sports,
this may be the researcher or fitness test administrator (Doctor etc) whereas in the context of
Department of music, this can be the person updating the data in system. For example, the person
playing and recording the melodies from manuscript for folk tunes record(Discussion case with
Tuomas Eerola ).
Data generator can be individual or an automated system in the above mentioned use case and
Transformation of data extends to synchronization without direct link with data generator.
24
Actor 3- Data User:
Data User may be referred as Researcher /system utilizing data or information created by data
generator. In the scope of our project , Researchers and data cleaning personnel may be referred as
data user.
Relaying and reusing of data may involve handling IPR issues and ownership constraints . These
issues are to be handled by Data management system or Data Manager, who can create metadata,
set access rights, remove data etc.
25
Actor 4 - Data Analyst
Individual / research group or system manipulating research data may be referred as data analyst in
this document.
In projects of longitudinal research, data analyst can be different from actual researchers.
Transformation of data into information in above mentioned use case means processing and refining
data towards a logical meaningful piece of information for end user’s and other’s referring to data.
26
Actor 5 - Data Manager or management system :
Individual or system responsible of storage/ backup and archiving of data may be referred as data
manager. Same actor may be responsible for handling IPR issues and data ownership and transfer
rules.
27
Actor 6 -Data importer (If any) :
System or individual importing data from external sources(Source outside JY) may be referred as
data importer. Please note that we don’t have any direct link with subject of data in this case.
Data sharing agreements are to be defined prior to any data import and this does not include in
responsibilities of data importer.
28
Actor 7 -Data Export (If any):
System or individual exporting data to any external source shall be referred as data exporter. IPR and
ownership issues are to be taken care by Data exporter in case of data export.
Data sharing agreements are to be defined prior to any data export and this does not include in
responsibilities of data exporting individual or system.
29
4.3 Process flow’s of proposed structure:

Database Query

Data Entry

Data Collection

Data Storage

Data IPR handling.

Data Ownership

Database query system.
30
Proposed PF 1 (Research Data Entry):
31
Proposed PF 2(Data Collection):
Data collection and segmentation of collected data is vital to efficiency of our data infrastructure
because inaccurate data collection or collection of data in vague classification can lead to difficulties
in processing of data and can lead to invalid results.
Data Collection Protocol:
Identification of Need for
Data
Data Collection Plan
Storage and Destruction
Approval
Sharing Results and
Decision Making
Data Collection
Analyzing and
Synthesizing Data
* Process derived fromHastings and Prince Edward District School Board, 2005 (Not a reference but a synthesis)
Data main classes and sub classes are mentioned below to be adhered by data collectors (Systems
and researcher’s).


Quantitative data collection
Qualitative data collection
Quantitative data collection:
The Quantitative data collection methods rely on random sampling and structured data collection
instruments that fit diverse experiences into predetermined response categories [6].
Type’s of Quantitative data categories are
1.
2.
3.
4.
Experiments and trials
Observing events
Obtaining data from other managements information systems
Surveys with closed ended questions
32
Qualitative data collection:
Qualitative methods usually are used to find out reasons behind certain quantitative results and are
used to further explore the subject or area of research. Qualitative methods are also used to improve
accuracy in quantitative methods.
Defined standards mentioned in diagram may be correct form of responses (Yes, No) , fulfillment of
all fields in case of E format questionnaires and any other standards defined by administrator
depending on project requirements.
33
Proposed PF 3(Data Integration):
In cases where data is updated manually by an individual instead of data creating devices, it may be
responsibility of the person updating data to make sure that data is clean and in line with standards
defined according to project and research team.
34
Proposed PF 4(Data Storage):
Storage needs are primarily classified in two segments.
 During project
 Long term storage
Storage needs for both phases of data lifecycle are different in terms of performance, replication and
backup needs. During the project, researcher or research team should be updating and storing the
data to the system. However, data should be refreshed, updated and formatted by University or
DBMS after the project unless it is disposed off according to need/nature of data [7].
Research data lifecycle is mentioned below for storage needs identification. We will discuss both
phases separately in this document.
During Project
• Observe
• Experiment
• Annotate
• Index
• Publish
After Project
•Refresh technology
•Refresh Format according to latest technology
•Store data reliably
During and After Project
•Learning
Data Management during a Research Project:
Main purpose of data management during the project should be to ensure that data can be
transformed into information by others. In technical terms, this means standard file formats, access
methods, descriptive metadata, proper indexing and adequate publishing.
Data Management after a Research Project:
35
Main purpose of data management after the project should be to keep the data useful and
retrievable for any references and to meet legal needs. Efficiency of access can be replaced with
compatibility and availability of data. Following table can summarize data storage preferences during
and after project[8].
Performance
Survivability
Curation
Primary Curator
During Project
High
High
High
Researcher
Long Term storage
Moderate
Very High
Very High
University
One concern with reference to long term preservation of data is management of data when original
funding and original researchers who collected or created the data are not available. This should be
handled by proper handover of data by personnel to database administrators at the time of leaving.
Other way of handling this issue can be shared responsibility of data amongst researchers and
database administrators during the project.
Classification of data and storage:
In the proposed data infrastructure, it is suggested to store data according to standards
classification of data. This classification means that data may be stored at different locations
according to type of data (Audio, video, text etc). In order to achieve this goal, we need to have a
data sorting system embedded in the data storage model that will segregate different data types
and shall store data in a way which will make retrieval easy and efficient. Suggestive model is
mentioned below.
36
Proposed PF 5(Data IPR handling):
Intellectual property ('IP') rights may grant creators or owners of a work certain controls over its
use. Following is the suggested framework for IP Rights handling of research data according to the
guidance of Finland’s Personal Data Act (1999).
37
Proposed PF 6 (Data ownership):
According to our project definition, data and information are valuable assets. This leads to the notion
of responsibility of data and information available at university servers/ data stores. Ownership of
data in this document corresponds to activities related with accessing, creating, modifying,
packaging, selling or removing data along with sharing the rights and privileges of data.
38
Proposed PF 7 (Data Quality):
Data can be rated as high quality if satisfies the defined standards to assist researcher in his/her
operations, planning and decision making. In the scope of our project, we may measure data quality
by analyzing the level of completeness, validity, consistency, conformance of data values to project
requirements and appropriateness for specific use.
Level of quality can be measured on the scale of 1 to 4 depending on the level of quality is exhibits
according to below mentioned table [9].
The data quality activity levels
Proactive
Data quality issues are prevented
Active
Data quality is supervised in real time
Reactive
Data quality is supervised periodically
Passive
Data quality is not measured
From data quality point of view, it has been observed that stakeholders of the project are at reactive stage of data
quality which means that random checks are made on available data to check consistency, supportability and
efficiency.
Through proposed data collection and integration model, we aim to reach at Proactive level of data handling
because in proposed data handling mechanism, we ensure data cleanliness and supportability at the time of data
entry and updating instead of periodic checks.
In terms of projects, data quality is to be ensured by researcher or/and project manager. However, at the higher
level of pyramid, it can be controlled by defining the parameters in the system and can be automated according to
defined parameters.
39
Data Quality process:
Following is the suggestive contamination control flow for research data in the proposed solution.
This process should be followed at data entry level to achieve proactive stage in contamination
control.
Moving Forward:
This document can be continued as a live document for updating project milestones and progress as
it moves forward towards technical specifications and implementations.
40
Conclusion:
The report is based on a meta description of the research project flow and discusses various forms of
data produced in different phases of the project flow. The emphasis in this report is on the actual
performance of the research, because there the need for data infrastructure seems to be the most
urgent one. It is still clear that the preparatory phase and funding type have ramifications for the
data management activities needed in the main phase. This is evidenced by the questions like “who
own this data, who has access to it, what are the privacy restrictions on this data”, etc. It is also clear
that the data produced in the project preparation phase has relationships with the data produced
later. These are for further study.
Current operational and meta data handling practices were analyzed through interviewing and
analysis of systems (Observations). It was identified that optimal data handling procedures were
needed and this creates the scope of advanced data Infrastructure.
Firstly, we developed better understanding of systems by interviewing and questioning stakeholders
and secondly, we analyzed standard modern data infrastructure management practices. Later, those
practices were manipulated to fit our specific needs and according to expectations of stakeholders.
As per data handling needs of stakeholders who are faculties, university administration, researchers
and funding authorities. Currently, we have covered elicitation of data related needs starting from
types of data needed to the end point of data archiving covering maximum data related concerns of
stakeholders. University administration’s concerns are for further study.
The document contains 7 use cases and seven Process flows to serve as instructional boundary for
system developers and technical personnel who are supposed to develop the data infra structure as
required by stakeholders.
Perhaps the most pressing issue is how to define and manage the metadata. Without correct
definition and clear processes for its handling the operational data cannot be kept in order and
protected against misuse, loss, confusion etc.
41
References:
[1] Dursan Delen et al , “A Hollistic framework for knowledge discovery and management” 2009
[2] B. Markham, "http://www.oit.umd.edu," 2011. [Online]. Available:
http://www.oit.umd.edu/Publications/Data_Classification_Presentation_022908.pdf.
[3] W. Wang, "Secure and efficient access to outsourced data," New York , 2009.
[4] G.Weikum, G. Vossen, Transactional Information Systems. Morgan-Kaufmann Publishers, USA, 2002.
[5] P. International, "www.privacyinternational.org," 2011. [Online]. Available:
https://www.privacyinternational.org/article/finland-privacy-profile#frame. [Accessed 2011].
[6] www.confluence.ucdavis.edu , University of California , 2011 [Acessed 2011]
[7] www.confluence.ucdavis.edu , University of California , 2011 [Acessed 2011]
[8] R. Silvola, O. Jaaskelainen, H. Kropsu Vehkapera and H. Haapasalo, "Managing one master data –
challenges and preconditions," Industrial Management & Data Systems, 2011.
42
43
Download