- ULCC Publications Archive

advertisement
Project Identifier: SHARD
Version:
Contact: Patricia Sleeman
Date: 21 July 2012
JISC Final Report
Project Information
Project Identifier
To be completed by JISC
Project Title
The Preservation of Historical Research: SHARD
Project Hashtag
Start Date
1st November 2011
Lead Institution
University of London Computer Centre (ULCC)
Project Director
Richard Davis
Project Manager
Patricia Sleeman
Contact email
Partner Institutions
p.sleeman@ulcc.ac.uk
Institute of Historical Research, University of London
Project Web URL
http://shard-jisc.blogspot.com/
Programme Name
Digital Preservation Programme
Programme Manager
Neil Grindley
End Date
31st July 2012
Document Information
Author(s)
Patricia Sleeman, Ed Pinsent
Project Role(s)
Project manager
Date
Filename
URL
http://shard-jisc.blogspot.co.uk/
Access
This report is for general dissemination
Document History
Version
Date
Version 1
18/7 2012
Version 2
25/7/2012
Version 3
26/7/2012
Comments
Page 1 of 40
Document title: JISC Final Report Template
Last updated : Feb 2011 - v11.0
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
Table of Contents
1
ACKNOWLEDGEMENTS ............................................................................................................................ 3
2
PROJECT SUMMARY ................................................................................................................................. 3
3
MAIN BODY OF REPORT ........................................................................................................................... 4
3.1
PROJECT OUTPUTS AND OUTCOMES ................................................................................................................ 4
3.1.1
Outcomes of SHARD ........................................................................................................................ 4
3.2
HOW WE WENT ABOUT ACHIEVING THE OUTPUTS ............................................................................................... 4
3.2.1
Phases of the project....................................................................................................................... 6
3.2.2
Phase 1: Legacy data assessment ................................................................................................... 6
3.2.3
Stakeholder survey ........................................................................................................................ 10
3.2.4
Analysis of stakeholder survey ...................................................................................................... 10
3.2.5
Survey of existing training at IHR .................................................................................................. 17
3.2.6
Phase 2: Training day and knowledge gathering exercise ............................................................ 19
3.2.7
Building SHARD training modules ................................................................................................. 24
3.2.8
Validation ...................................................................................................................................... 26
3.2.9
Development of leaflet.................................................................................................................. 26
3.2.10 FAQs page ..................................................................................................................................... 27
3.2.11 How did you go about achieving your outputs / outcomes? ........................................................ 27
3.3
WHAT DID YOU LEARN? .............................................................................................................................. 28
3.3.1
Organisation ................................................................................................................................. 28
3.3.2
Resources ...................................................................................................................................... 30
3.3.3
Technology .................................................................................................................................... 32
3.3.4
Other issues................................................................................................................................... 34
3.4
IMMEDIATE IMPACT ................................................................................................................................... 34
3.4.1
What kind of difference has your project made in your institution? ............................................ 34
3.4.2
How has the wider community benefitted from your project? ..................................................... 34
3.4.3
What evidence do you have for this? ............................................................................................ 35
3.4.4
How has your project changed the attitudes of your stakeholders? ............................................ 35
3.5
FUTURE IMPACT ........................................................................................................................................ 35
3.5.1
Who will be impacted? ................................................................................................................. 35
3.5.2
Are you planning to track this impact? If so, how? ....................................................................... 35
4
CONCLUSIONS ........................................................................................................................................ 35
4.1
4.2
4.3
5
GENERAL CONCLUSIONS .............................................................................................................................. 35
CONCLUSIONS RELEVANT TO THE WIDER COMMUNITY ....................................................................................... 36
CONCLUSIONS RELEVANT TO JISC ................................................................................................................. 36
RECOMMENDATIONS ............................................................................................................................. 37
5.1
5.2
5.3
GENERAL RECOMMENDATIONS ..................................................................................................................... 37
RECOMMENDATIONS FOR THE WIDER COMMUNITY........................................................................................... 37
RECOMMENDATIONS FOR JISC ..................................................................................................................... 37
6
IMPLICATIONS FOR THE FUTURE ............................................................................................................ 37
7
REFERENCES ........................................................................................................................................... 38
8
APPENDICES ........................................................................................................................................... 39
8.1
2
APPENDIX 1: FAQ ..................................................................................................................................... 39
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
1 Acknowledgements
This project belongs to the Digital Preservation Programme and was funded by JISC 12/11: Strand
B3: Digital Preservation: Enhancing Capability within HEIs. Partners on the project included the
Institute of Historical Research (IHR). We received a lot of help with the legacy data aspect of the
project from the Centre for Metropolitan History (CMH) based within the Institute of Historical
Research. In addition The University of London’s University Records Manager & Freedom of
Information (FOI) Officer, Kit Good gave us invaluable advice regarding legislation which can affect
research data. We must also thank our project partner Jane Winters at IHR, Matt Philipott at IHR for
assistance with Moodle, attendees at the Training Day for their input, Malcolm Raggett (LSE) for the
leaflet design, Dice Prepare projects for their work on the FAQs, and to Ed Pinsent (ULCC) as project
supplier.
2 Project Summary
SHARD (preServAtion of Research Data) has built a set of digital preservation training modules
pitched at researchers and those with a non-specialist knowledge for the area of digital archives. We
have integrated these with existing postgraduate research training courses at the Institute of Historical
Research, which are offered on a national basis. This course is embedded into already existing online
training opportunities at IHR’s History Spot site. It is believed that embedding digital preservation
issues into existing methods of data creation will help avoid data loss. SHARD was seen as
necessary due to the amount of digital material being produced by researchers. They also produce
research data in a variety of digital formats and size and maintaining access to this valuable resource
in the short and long term is essential in order to enable reuse and sharing over time. In addition this
material is vulnerable both in terms of format and media and due to the lack of intervention.
Investment of time is needed to look after their research material or data to ensure long term access
and sustainability over time. It would seem that there is already a lot of information about digital
preservation available to read and guide people. In fact a lot of this advice is primarily aimed at
practitioners in the area of digital preservation and not at researchers. We thought it very important to
develop training materials which were stripped of technical language and applied directly to
researchers using non-specialist language.
These training materials were developed after an assessment of legacy data as well as consultation
and surveying of stakeholders. The materials have been specifically designed for ease of
accessibility and pitched at the appropriate level using direct feedback from stakeholders targeted
during the project. The project has involved several stakeholders from within the University of London
which is a goal of the project, i.e. to enhance awareness of the issues concerning the preservation of
research data both within the University as well as the national HE sector.
We created a knowledge base consisting of assessment and surveys consisting of legacy data and
researchers as well as existing training at IHR. We also delivered face to face training which we used
as an opportunity to discover researchers’ requirements and needs in relation to the preservation of
their research data. In addition these on-line resources materials are available under an Open
Educational Resources license (OER) which means they can be shared, used and adapted by other
educational Institutes and made applicable to a range of academic disciplines. The materials have
been validated by IHR's Research Training Group and deposited in JORUM, a repository of learning
and teaching materials.
3
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
3 Main Body of Report
3.1
Project Outputs and Outcomes
3.1.1 Outcomes of SHARD
1
Outcome
Legacy data assessment
Deliverable type or URL
Knowledge base
2
Stakeholder survey
Knowledge base
3
Survey of extant training at
IHR
Knowledge base
4
Training day and
knowledge gathering
exercise
Online training materials
Publicity leaflet
FAQs
SCORM files for JORUM
Knowledge base
5.
6.
7.
8
http://training.historyspot.org.uk/course/view.php?id=59&topic=1
http://training.historyspot.org.uk/mod/resource/view.php?id=835
http://training.historyspot.org.uk/mod/page/view.php?id=834
Outcomes 1-4 of SHARD collectively amount to the project's knowledge base. Activities here were
focussed on assessment and survey work. These consisted of:




Legacy data assessment
Stakeholder survey and analysis
Survey of extant training at IHR
Knowledge gathering from the training day
There were two phases to the project. Phase1 involved outcomes 1-3, while phase 2 incorporated
the training day and the development of the online materials as well as the leaflet and FAQs.
The information gathered is held in the Knowledge base. This information, once analysed formed the
basis of outcome 6 - The SHARD online training modules.
Outcomes 7 and 8 were additional to what was originally scoped in the project plan. They are the
summation of what SHARD, DICE and PREPARE thought appropriate to introduce digital
preservation to researchers.
3.2
How we went about achieving the outputs
Collaboration between IHR and ULCC has been considered for a few years as we saw a real need
and opportunity to adapt the Digital Preservation Training Programme for research data. The initial
stage of assessment and research was extremely important and was enabled very much by a survey
done previously such as the survey of legacy data held at IHR completed by Kevin Ashley. Building
on the existing relationship with IHR we knew that we had a good sample of legacy data to assess, a
good cohort of researchers to survey, and a well established training scheme in place via History
Spot.
The first phase of the project saw information gathering exercises which populate the knowledge
base. The initial survey work approach taken to building the digital preservation training materials has
been evidence-based so building this knowledge base has been vital to the success of the project.
4
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
We aimed to develop a knowledge base which contains all information gathered in the course of




An assessment of legacy research data both primary and other
Interviews with stakeholders involved in data creation
An assessment of existing IHR training frameworks
Information acquired from face to face training
Legacy data assessment
Assessment of the legacy data demonstrated some gaps in the compilation of documentation and
descriptive information for the data as well as gaps regarding procedures for data sharing within our
own institution.
Survey
The interviews with stakeholders who were recruited by the IHR through their own contacts as well as
by social networking by twitter demonstrated the variety of research material being produced in the
humanities but also the commonality of their problems in most cases. The fact that the researchers
took part in the survey implies certain sympathy towards data sharing and preservation but
notwithstanding most researchers wanted to share their data as well as use other peoples’ data. What
was striking was how some people seemed to have a lack of appreciation for the value of their own
research and even considered deleting it after their PhD was completed. We suspect that this is due
to a lack of institutional support and guidance about research data and its management and value
beyond the lifespan of the project. As has been stated the lack of awareness of the value of research
data as well as a lack of awareness of the risks associated with digital data are key points raised by
researchers in both the surveys and the training days.
Training
The face to face training day provided SHARD with an opportunity to develop and adapt our Digital
Preservation Training Programme to a different audience with different expectations to those who
traditionally attend our entry level courses on digital preservation.
. Throughout the project we have focussed on the needs of the researchers by working with
them, listening to them and adapting our own knowledge to the research environment.
The knowledge base informed us how best to design such a training event and also gave us an idea
of how best to find out from those attending what their needs were. These needs were assessed
through providing opportunities and allowed us to further refine the ideas we had about designing our
online course. This included avoiding the use of alienating terminology at the start of the project and
explaining what these topics were. These included words such as ‘data’ and ‘metadata’ which people
said were terms which they felt were not relevant to them as they assumed it referred to tabular data
and complicated standards.
Online course
Developing the Moodle version of the course allowed us to hone our e-learning skills. A draft of the
syllabus was sent to IHR for review and validation and once accepted we set to work on storyboarding
each course (four in total). These were then run by the IHR again for validation and were set into the
Moodle environment within History Spot in IHR. The Moodle is exportable as a SCORM package and
deposited in JORUM.
Leaflet and FAQ
The idea to develop a leaflet arose with IHR and again with the meet up with the Prepare and DICE
projects in March 2012. At this meeting it was realised that a lot of our research was reaching similar
conclusions about research data and the gaps regarding preservation. We all agreed on the use of
simple language and how such a leaflet should be organised as we had all pretty much found out the
same things within our own projects! We also thought it would be a good idea to provide an expanded
5
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
version of the leaflet in FAQ form and so any questions researchers would have which arose from the
leaflet could be provided in the FAQ. The three projects worked on a Wiki hosted at the University of
Cambridge to develop this with each project taking responsibility for certain questions.
Communication
The SHARD blog was a good place to write up our thinking about the project as we progressed stage
by stage. Twitter was also used to disseminate information and updates to the project.
Our aim was to develop training materials that:





Deliver the basics, not the complexity, of digital preservation
Deliver practicalities, not theory, of digital preservation
Are tailored to the needs of the target audience
Will have wide applicability to other academic disciplines in Humanities and social sciences
Can be reused by other HEIs
3.2.1 Phases of the project
The project was divided into 2 phases for our purposes, phase 1 was an investigation period and
involved the building of the knowledge base and phase 2 was our implementation period with the
development of our training materials.
3.2.2 Phase 1: Legacy data assessment
Our assessment started with historical research datasets created by our SHARD partner IHR and in
particular the Centre for Metropolitan History (CMH) which is based within IHR at UoL. CMH was
established by the IHR in 1988, and is one of the world’s leading centres for the study of the history of
London and other metropolises. It specialises in innovative research projects, covering a wide range
of periods, themes and problems in metropolitan history, publishing the results and data online and in
print. A survey had been done by ULCC of the legacy data held at CMH a few years ago so we had
the perception that we had some good material available for our SHARD study. The data covers
metropolitan history looking at health, social issues and of London and comparative studies with other
cities.
Using research data referred to in the 2009 preliminary survey of IHR’s digital research data, ULCC
conducted an assessment using an adapted version of AIDA, a tool for assessing an Institution's
capability to support digital preservation. The tool was adapted to apply to the research data test
corpus. The AIDA toolkit is structured as a set of simple elements, each one describing an aspect of
digital asset management. The process is spread over three discrete areas. There are three
components which tell us what we need to know about the assets being assessed from an
Organisational, Technology and Resources point of view. As McGovern and Kenny have
suggested effective digital preservation needs to consider the organizational infrastructure,
technological infrastructure, and resources to succeed 1. Considering these three aspects helps
decision-making as technology evolves considering the resources available. For the legacy data
assessment we did not use the 5 stages assessment model which is part of the AIDA toolkit as we did
not think it suitable for our purposes.
Assessment criteria
We used a reduced adapted version of the AIDA data assessment tool for this purpose and it soon
became clear which aspects were useful. The following questions were used to assess the data:


Reference code?
Description of data available?
‘The Five Organizational Stages of Digital Preservation’ Anne R. Kenney & Nancy Y. McGovern Digital Libraries: A Vision for
the 21st Century: A Festschrift in Honor of Wendy Lougee on the Occasion of her Departure from the University of Michigan
1
6
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:















Is it kept in a repository?
Description of data available
Type of data produced?
Any technical specifications
Instructions/Methodologies used
Descriptive metadata?
Technical metadata?
Administrative metadata?
Information about the structure of data?
Data capture information available?
Variable descriptions
Additional information which is vital to access and enable comprehension of the data
Could all files be opened? If not, what were the problems?
Legal issues
Are rights clearly indicated? Is retention period of data clearly indicated?
Outcomes:
a. Organisation of data
Why is organisation important?
Well organised data of any kind will enhance and enable access and usability. This is as important for
the creator of the research data as it is for future scholars who will want to use the data at a future
point in time. The data owner (they who originally gathered the data) will need to be clear about how
they organised it as over time, information about the data will fade from memory and what was once
clear becomes hazy. Organisation implies a description of how the data was organised, where it was
kept as well as descriptions from the high level general to the low level file level. So it is very
important to document how data is organised or arranged, from a general description to the low level
details. As the research data is digital it is also necessary to record the technical aspects of the data
both from a high level general software/hardware requirements to the detailed technical metadata for
different formats used.
What we found
What was apparent after this assessment was the paucity of information about the organisation of the
legacy data. There simply was no guide to how the data was organised or structured. In addition
contextual information about some of this legacy data was absent. The data was clearly rich in
content and variety in terms of scale and format but the lack of descriptive information rendered the
data inaccessible. This was deeply frustrating. There was a lack of adequate descriptive, structural
and technical metadata about the data. There were also very often no clear indications about legal
issues. There was very little declaration of ownership or rights regarding the data, copyright and right
to use the data was not clearly explained or referred to at all. 43% of the data assessed had no
description of it at all, while the remaining 57% had varying degrees of this information. Figure 1.3
indicates the percentage of the legacy data which was undocumented to the degree that it rendered
the data and its context difficult to understand.
Description of research data available?
57% of the data assessed had some form of general introduction to the data as well as descriptions to
help understand and use the research data while the remaining 43% had none.
b. Technology
Why is technology important?
Many people may think that technology is the primary inhibitor to accessing legacy data. This can
often be the case and it is important to be alert to the issues such as fit for purpose storage media,
7
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
reliable formats, and obsolete media but as stated previously the biggest risk to the data is any lack of
contextual information. As within this contextual information i.e. documentation about your research
data, the technological requirements for using the data must be stated clearly. So it is important to
write down useful information about the technology used for your research data.
What we found
It was challenging to extract information about technology used to create data due to the lack of
description of the technology used to create the data or what was required to open the data.
Very often there were no data dictionaries or code books for encoded data. There was little evidence
of the use of authority files for naming conventions. Some data is held on software which is at risk of
being unreadable soon - old versions of MS Word files. 42% of the data could be opened easily, 29%
could partially be opened with a further 29% unable to be opened at all. Less than half the data could
be opened successfully. Within a set of data, some files could be opened but not all which obviously
undermines the usability of the data.
Could all the data be successfully opened using current software?
42% of the CMH legacy data could not be opened at all due to software issues, some 29% of data
could partially be opened but not all due to the use of unidentifiable software. The remaining 29% was
unable to be opened at all.
c. Resources
Why are they important?
Resources include time, money, skills, responsibilities and awareness. This plays a significant part in
the preservation of research data. If there is no time dedicated to informing oneself of the issues,
organising, planning and managing research data then the data will be at risk of being lost and not
preserved.
What we found
There was little consideration about preservation at any level and little about what and how the data
should be kept. No responsibility for the data post-project was expressed with the exception of two
projects which were funded by a research agency. All data management plans should indicate this
and should be available at an institutional level. No digital asset register existed previously and this
lessened the intellectual control over the data as well as the capacity to share the data easily. The
data was kept on a network but closed to the rest of the University but this made the data inaccessible
to most people and researchers.
d. Type of data
The type of data consisted of structured data, text; audio and some map data. A variety of software
was used including Dbase, MS Access, MS Word, MS Excel and Notepad. Out of the 7 datasets
assessed only 1 described the technical aspects of the data and what the software/hardware
requirements were to open and use the data. In relation to methodology and explanation of such 2 out
of 7 described how they did what they did. Figure 3.1 indicates the formats in which the legacy data
being assessed was held.
8
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
3.1: Formats of research data found in CMH legacy assessment
Clearly at 41% the majority of research data was held in text form, followed by structured data. The
‘other’ category indicates formats which we could not identify, even after using many external sources
of file format information, including PRONOM.
Figure 3.2 shows which software was used to create the legacy data being assessed.
3.2: Software used to create legacy data
The majority of the material was created in Microsoft applications in database, spreadsheet and text
form but certain data was held in an Oracle database and Dbase.
Deliverable: Knowledge base: legacy_data_assessment.xls contains the results for each dataset
being assessed and legacy_data_post_assessment.xls contains a summary of the assessment
9
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
3.2.3 Stakeholder survey
ULCC conducted face to face interviews with a cohort of 10 researchers. These were selected from
IHR staff and teachers, as well as researchers from other institutions within the University of London
and beyond. They reflected a variety of research environments: from the individual
researcher/academic at the early stage of their career as academics, to academics involved in
established research programmes. This was to help us to understand the actual preservation needs
for the sorts of research data that are being created, and build an evidence base. It also enabled us to
discover the training framework we can integrate with.
Using AIDA to help us structure the questions we divided the interviews into three areas namely:
resources, technology and organisation.
A summary of the questions asked include:






















Do you manage research data in the course of your work?
Ownership of research data
Definition of rights. How are the rights of data owners, including IPR, defined?
Who has responsibility for managing the data and associated documentation once the project
has terminated?
Policies for formal documentation of ownership. Are formal documented statements of
responsibility written?
How do you document your data? (Some responses have interpreted this as meaning
"Descriptive Metadata")
Have you ever needed to use other people’s data for your research? Conventions or
expectations for citing secondary data (i.e. data produced by others) that the Institution's
researchers utilize in their own research.
Requirement for Metadata creation. To what degree of detail do you describe the structure
and organization of your data?
Use of metadata standards
Method of storage / backup. How is digital data currently stored?
Location of storage. Where is digital data currently stored?
Future storage strategy. How do you plan to store digital data in the future?
Responsibility to share
Responsibility to limit access. Do you need to regulate access to data (by staff or students)?
Processes in place to review data for legal issues, for example confidentiality, consent,
commercial use, or patents.
Processes in place to review data for other compliance issues, including data retention,
preservation, destruction and anonymisation of datasets.
What are your preservation needs? Understanding and knowledge of stakeholders'
expectations for the preservation of research data.
How long is it needed? Methods to assess whether research data needs to be preserved.
Considerations for providing access to data over time
Written policies for preservation
Responsibility to preserve
Right to preserve. Do you have the right to preserve your data?
3.2.4 Analysis of stakeholder survey
Using the data obtained from each researcher for each question, we assessed the results in relation
to the 5 Stages, indicating a stage of maturity or development. Stage 1 is the least developed, Stage
5 is the most developed. These 5 Stages (Acknowledge, Act, Consolidate, Institutionalise and
Externalise) are based on the Cornell University maturity model, originally designed to assess an
Institution’s readiness for digital preservation.
Here is an explanation of each stage.
5 Stages
10
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
Stage 1: Inaction. Research data is not preserved at all.
Stage 2: Action is determined locally or unilaterally.
Stage 3: Action is determined informally.
Stage 4: Actions are backed up with documentary evidence of their action.
Stage 5: Actions are carried out on the basis of an agreement with a funder or by a parent policy
Stage 5
Stage 4
Stage 3
Stage 2
Stage 1
0
0.5
1
1.5
2
2.5
3
3.5
3.3 Do you manage research data in the course of your work?
Most researchers were at Stage 2 and 3 with regards their management of their research data. They
each had their own approach which was local to their needs. Stage 2 responses did not indicate a
plan as such with one stage 2 researcher indicating that ‘the information about the research data was
mostly in his mind and unsystematic. Not good practice in his mind.’ An example of a stage 3 saw
backups once a week to external hard drive and use of an application called ‘Papers’, which acts like
a library for research data. They also used Dropbox 2. Solutions such as Dropbox are not suitable for
preservation as their primary purpose is to enable sharing of data and not long term access. The
single researcher at Stage 5 was involved in a well established programme and the data was regularly
sent to the publishers for storage on a dedicated server as well as regular daily backups on the
network drive.
2
Dropbox is a free service which enables sharing of data. Dropbox was founded in 2007 by Drew Houston and Arash
Ferdowsi, two MIT students tired of emailing files to themselves to work from more than one computer.
11
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
3.4 Why, in your opinion, should we keep research data?
Figure 3.4 shows where those surveyed thought the importance of preservation lay in relation to
availability and access over time. Use and reuse and reinterpretation were all seen by researchers as
good reasons to preserve research data.
‘Frequently the public outputs of a research project are only a limited snapshot of a much wider
body of work. The latter needs to remain available.’ SHARD researcher survey
Documenting research data
When researchers were asked how and if they documented their data the responses were varied as
were their approaches. One researcher was at stage 1 but most (6/10) were at stage 2 with very local
unstandardised approaches to documenting their material.
12
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
Stage 5
Stage 4
Stage 3
Stage 2
Stage 1
0
1
2
3
4
5
6
7
3.5 How do you document your data?3
As the graph 3.5 shows, most researchers were at stage 2 in our 5 stages indicating that standards
were not being used to describe/document the research data. Those who were at Stage 4 were at a
programme stage and needed to transfer data to another organisation and thus the need for
standardised metadata was essential. The researcher at stage 1 kept all information in their head.
One researcher at stage 2 said ‘…Messily, a couple of Excel spreadsheets with all file names of data.
Has excel file of all files existing in his data, haven’t used it consistently as all stuff I get doesn’t fit in
equal measure into spreadsheet. Not a great programme, but one that I know well.’ Most were at
stage 2 with very inconsistent approaches to documenting the data. It is interesting to note that noone was at stage 5.
Use of other research data
Most researchers had used other research data and 50% expressed frustration with the process due
to idiosyncratic abbreviations/descriptions/cataloguing. One researcher noted that she would like to
use a specific set of data but the researcher had closed access to the material which was very
frustrating.
3
The graph shows only 9 people as the question was not put to one of our researchers during our survey.
13
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
3.6 Have you ever needed to use other people’s data for your research?
Interestingly most researchers who participated in the survey have used research data created by
other people. 50% of them found it problematic due to a lack of documentation, inconsistencies in
management and descriptions.
‘I was gifted research data in relation to my field by an esteemed scholar. This person used
abbreviations which were unknown to anyone else but scholar and many acroymns used which
had lost their context. No contextual explanation or glossary provided. This made using the
data very difficult and time consuming.’ Researcher, SHARD
To what degree of detail do you describe the structure and organization of your data?
50% of the researchers were at stage 1-2 regarding how they organised their research data. They
expressed the importance of good file structure and well named files or relied on existing file structure.
Few described their data at file level, with 1 being at a stage 5. 6/10 researchers used no standard at
all for describing their data. The remainder used either an in-house thesaurus or Library of Congress
thesaurus terms.
To what degree of detail do you describe the structure and organization of your data?
Those at Stage 1 (2 researchers) were either relying on their memory or were waiting until the end of
the project to document their data. The 3 researchers at stage 2 had written some things about their
research data, they thought it was self explanatory and not too urgent. 2 researchers were at stage 3
and 2 at stage 4 describe their data at a more detailed level with stage 4 documenting their data very
well to field level ‘to enable good data transfer. Again none were at stage 5.
‘Will document it (research data) once handover takes place. Would like to draft a users’
manual.’ Researcher, SHARD
14
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
How is your research data currently stored?
Stage 5
Stage 4
Stage 3
Stage 2
Stage 1
0
1
2
3
4
5
6
3.7 Method of storage / backup. How is your research data currently stored?
7
4
Most researchers were at a stage 2 in terms of storage. This was reflected by copies being made
locally, on an ad-hoc basis. There were a variety of methods such as external hard drives, backups
saved on CDs and DVDS. Some used Dropbox and Crashplan 5, a US based company. Others used
USB keys for backing up. Those at stage 3 were embarked on systematic storage, some using
network storage provided by their institution. The researchers who were at Stage 5 were advanced
programmes where continued support of material is well established. None were using a repository.
How long do you need your research data to be kept for?
Forever/idefinately
For as long as needed
Lifespan of Phd
3.8 How long do you need your research data kept (forever?)
4
The graph shows 13 responses as a single researcher used a combination of storage approaches and this is expressed in our
graph
5
CrashPlan provides several automatic backup solutions enabling back ups to other computers and attached external hard
drives.
15
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
Most researchers wanted to keep their research data for as long as possible. 3 thought they would
keep it as long as they were doing their research for the particular project and had no plan or thought
about keeping it any longer than the lifespan of the project.
‘I didn’t know what to do with my data. No incentives provided, I just started panicking that I
wouldknow
lose what
all mytodata
and his
took
action.’
Researcher, SHARD project
‘Didn’t
do with
data.
No incentives/directions
3.9 Factors inhibiting preservation. What, in your opinion,
currently inhibits preservation of research data?
Researchers surveyed considered the lack of time and money to dedicate to the establishment of
adequate preservation mechanisms as the main obstacle. Lack of awareness of the issues came
second which is a recurring theme throughout SHARD. Technology which has traditionally been seen
as the major problem interestingly came third.
16
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
3.10 Which three things would improve the preservation of research data?
Here we posed the reverse of the question illustrated in figure 3.9. Surprisingly the answers were not
exact but similar as awareness-raising of the issues came at the top, but good storage solutions came
a close second with training coming third.
‘There is a lot of repetition in academia which isn’t necessary. Some academics think they have
climbed up a hill and others should have to climb this hill too.’ SHARD researcher survey

Deliverable: Knowledge base: Survey_requirements_analysis.xls contains the
researchers answers to the initial survey Survey_AIDA assessment.xls contains the results
assessed using the 5 stages maturity mode.l
3.2.5 Survey of existing training at IHR
We assessed existing postgraduate research training available at IHR. Current provision on History
Spot (see Figure 3.11) offers a number of practical skills (language and palaeography, database
construction and data modelling, Geographic Information Systems etc.) and generic /transferable
skills (presentation and writing); to these skills it would be feasible to add digital preservation skills
and start to embed them in the fabric of research learning. ULCC identified characteristics of IHR
training. These characteristics helped us set up some requirements for developing our training for
SHARD.
17
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
3.11: IHR current provision on History Spot
Outcome:
The output of this survey is held in the knowledge base. This was useful for our purposes of designing
a course which would fit into the IHR style. SHARD assessed the material available (primarily the
handbooks as the VLE is in draft form) using the following characteristics.
The characteristics included:

Design
This included identifying the IHR style; the use of a variety of media and resources; the variety and
type of activities, both online and offline. The students are assigned relevant readings

Content
SHARD identified content presented by IHR for the purposes of a VLE or other resource. In general
there is always a course and module description with reasonably clear indications of required previous
knowledge and the text introduces and builds on concepts throughout. Assessments are also clearly
explained.

Production
Documentation of course indicating target audience and features is clear at start of each course or
resource. Learning objectives are clearly stated at outset. For technical aspects specialist knowledge
is not required and this is clearly stated. Copyright has been clearly indicated for all of the material
under a Creative Commons License. The content is very clearly laid out and easy to follow. At time of
assessment by SHARD we did not see a glossary but a bibliography is included for each resource.

Delivery
There is no demonstration course available and this is due to scale of courses being delivered. The
courses appear to be self led with self assessments with embedded tasks such as dummy data,
exercises but with clear access to help if needed.
18
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:

Teaching
IHR uses effective strategies for teaching content using the test teach test model.

Students
Students’ participation is encouraged throughout course through tests. This is encouraged by open
questions within the handbooks but these are not formally assessed. It acts as a mechanism to get
students engaged and thinking about the matter at hand.

Assessment
The resources are self guided and as a result encourage the student to check their understanding
throughout the courses provided.

Deliverable: knowledge base: Survey_of_IHR_training_resources.doc
3.2.6 Phase 2: Training day and knowledge gathering exercise
Phase 2 saw the initiation of the development of training materials in light of the results of the
knowledge base.
A. Output: Face to face training day
Training day
The premise of our training on March 14th 2012 was to lure folk in to speak about their experiences of
preserving research data in the course of their research. The event was called ‘Preservation of
research data: what’s in it for me?’ It was advertised on the main IHR events website as well as on
twitter and ULCC events page. 20 researchers attended. We saw it as an opportunity to learn a whole
lot from them and what they needed so we could best plan and design an online course on this for the
History Spot site at IHR. Our cohort of people attending our training day came from a variety of
research backgrounds and different stages in their career path and made it a rich day for information
gathering about their needs.
Why did people come to our workshop? People spoke about various drivers which brought them to
us. Experiencing the loss of data seems to sharpen the mind somewhat when it comes to
preservation of data. People also spoke about being 'swamped with data and the information
overload', wanting to take care of the material they had gathered over the years and worried they
might lose it. Language struck a chord with many around the table. A lot of people don't use the word
'data' to describe their research material. The term 'data' is regarded as scientific and as a result
people in the Humanities often feel alienated by the use of the term.
Many people interviewed simply did not remember what permissions they had regarding use of the
primary material they had copied or recorded. They had signed a piece of paper in the library or
archive but didn't remember what it said. As a result they would not be able to share this data in the
future as copyright and usage was not clear.
Four good ideas
Arising from our legacy data study and our survey we gave an overview of four things which we
thought they could all do easily to enhance the preservation of their research data. Here are the main
ideas for each of which we gave practical solutions.
19
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
1. Write everything down.
2. Store your data safely.
3. Interventions are needed, the earlier the better!
4. Consider sharing, the why and how.
There was a need to demonstrate to researchers the value of creating good data for its long term use
and reuse to the research community, and also to enhance their reputation. In a group work session,
we felt participants could examine issues and think and share their perspective and we could find out
more about their requirements.
The format of the face-to-face training day followed the outline below in Figure 3.12:
3.12 Training Day outline
Session 1: Introduction to SHARD
Session 2: Setting the scene: ‘Why bother to preserve research data?’
Session 2: Examine case studies of good and bad examples of the preservation of data.
Session 3: Having identified the problem, what do we do? Introduction to general practical advice for
preserving research data,
Session 4: Next steps
Session 5: Feedback and conclusions from the day
The training event was an opportunity to adapt the Digital Preservation Training Programme to the
research data environment. It also gave us an opportunity to work with researchers and gather
information about current practice, what they wanted and what they needed in relation to the
preservation of their data.
Eight opportunities
We provided the group with eight feedback opportunities planted at specific points in each module.
The idea was to ask simple questions and get written results from the audience. These responses
were gathered up and added to a PowerPoint show and displayed back to the class at the end of the
training day to enable further discussion, amendments and review.
The aim of this exercise was to get some actual feedback and requirements which can inform the way
the training modules are built. We had our ideas which we had developed from Phase 1 of the project
but this exercise gave us a very valuable opportunity to find out more through group work. The
opportunities were as follows:
#1 - Why are you here?
#2 - Why bother to keep research data?
20
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
#3 - What are the risks of data loss?
#4 - Your examples of good or bad practice
#5 - What do you need from storage?
#6 - What would be the magic desktop application that solves everything?
#7 - Do you do anything to preserve data, and if so what?
#8 - Are you comfortable with sharing data? What are the Pros and Cons?
After the training day we compiled the answers received and grouped similar responses together to
enable analysis.
3.13 Why are you here?
The most common reason for attending the event was a recognition of the importance of starting early
on in their project, to make sure that data was safe. The next most common reason was that they
wanted to try and understand digital content and its preservation more than they already did.
Complicated and ‘alienating’ terminology were mentioned as issues which inhibited understanding.
Many had suffered data loss and were keen not to let this happen again. In addition many expressed
the idea of being overwhelmed by ‘information overload’ as the scale of the data being produced was
becoming difficult to deal with unless managed properly.
Why bother to keep research data?
The responses to this found that first place was shared by future use and reuse, and not wanting to
waste resources spent gathering this research. In second place came the importance of preservation
actions to prevent data loss. In third place came sharing and review.
What are the risks of data loss?
21
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
The groups responded that data loss would lead to disorganisation, loss of control, and waste of
resources as the research would need to be redone which would be a waste of time and money.
Examples of good practice
3.14 Examples of good practice
Having intellectual control over research data was the highest scoring example of good practice.
Having intellectual control over research data is very important as otherwise one is unable to retrieve
the research data in a meaningful way. Factors which contribute to this such as good metadata and
well organised data as well as searchable data came next.
22
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
3.15 Examples of bad practice
Both lack of planning and lack of documentation were the most cited examples of bad practice in
relation the preservation of research data. The lowest scoring were obscure formats and failure to
make copies of the research.
3.16 Storage Needs
23
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
3.17 Features of a magic desktop application that solves everything?
In response to this question there was a big interest in the creation of an automated service which
would perform preservation on the research data, one person suggesting the need for the automated
extraction of useful metadata from software such as Mendeley 6 or other research data tools. Interest
was expressed also in automated backups and systems which create data in formats which last a
long time.
Outcomes:
The responses to these feedback exercises were assessed, clustered together, and entered into the
knowledge base. What was essential to us here was to raise awareness of the issues regarding
preservation and research data but also to ascertain what researchers needed in terms of
functionality. This in turn helped us build our online training modules.
Deliverable: Knowledge base: Workshop_analysis.xls
3.2.7 Building SHARD training modules
We devised practical preservation training modules based on our findings emerging from the
assessments and face to face training day. The materials were designed and evaluated as story
boards as text files prior to input into a Moodle environment. The modules are similar to ones we have
devised for the e-learning version of the DPTP, comprising text, images, links, reading lists, position
papers, online resources, tools, quizzes, and exercises. We aimed for generic preservation modules,
concentrating on practical and achievable approaches.
How the knowledge base helped us
The analysis of our knowledge base was key to helping us develop the training materials.
6
Mendeley is a free reference manager and academic social network that can help its users to organize research, collaborate
with others online, and discover the latest research.
24
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
The legacy data assessment highlighted how important metadata was for accessibility over time
from high level to low level file descriptions for both descriptive and technical information. So we
decided to ensure that this was a key message of our material and this was reinforced by the training
day. In particular the ‘opportunities’ provided us with evidence of what researchers were looking for in
terms of their requirements e.g. best practice. These examples enabled us to design our Preservation
Practices, along with the results of our legacy data assessment which also indicated that organisation
and descriptions were crucial to long term understanding of research data.
Researchers in both the survey and the face to face raining day also highlighted the importance of
metadata as well as keeping lots of copies. Researchers also fed back that they felt alienated by
technical language used very often in relation to digital preservation and we made a concerted effort
in our course to avoid the use of jargon. It is important to remember that terms which seem
commonplace to the digital preservation community are not accessible or have little meaning for other
communities. We decided that if we were to use a term such as "metadata" then we must make it
clear what was meant. Even the term "research data" seems to be an alienating term for people
involved in humanities research as they associated it often with scientific data.
To explain best practice we took a lot of what we learned from the legacy assessment and
demonstrated good and bad practice in relation to the preservation of research data. This was
followed by lists of good and bad practice in relation to descriptive practice and metadata for the data,
documentation for the data and technical considerations for the data.
All these examples were fed into our advice in this course. Both Ed Pinsent and Patricia Sleeman
have had training in how to build a Moodle, and this enabled us to storyboard and design the courses
to easily be implemented in IHR’s existing Moodle environment. Within each course we applied a
pedagogical methodology which we had learned previously, and applied the teach-test idea to the
material. As this is self led material it is entirely up to the student to carry this out but we aim to lead
the student through the material this way.
The material consists of:
1. Introduction to course
An introduction to implementable best practice for ensuring the continued access to and
preservation of your digital research data over time. Starting a research project can be
overwhelming, so it is important to manage your data to ensure you don’t lose it. We aim to
provide an overview of the issues and suggest solutions on to how to preserve your research data
2. Preservation: why bother?
An explanation of why researchers should consider preserving digital research data to ensure
access to the data over time and thus enable sharing and contribution to the body of research. It
was important to convey the importance of the risks of not looking after the data. Awarenessraising was mentioned as a key concern in our research.
3. Preservation practices
This course suggests practical steps which can be taken to ensure your research data can
achieve longevity. The course covers data documentation, the creation of metadata, some simple
technical interventions you can perform yourself, and options for safe storage.
4. Sharing and access
This course looks at various access issues which affect research data in relation to preservation.
It is essential to have a clear understanding of what it is you are allowed to do with the research
you have gathered in order to be able to successfully share it. We had learned that many times
researchers were not recording adequately the IPR issues connected with primary or secondary
sources for their data gathered in the course of their research. As a result they did not know what
25
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
right they had to share the data in the course of their work and over time. This in turn affects the
rights to their own body of research.
All courses have the same structure which includes the following:
Course Name
Name of course
Scope / outline
Outline of course
Learning Objectives
Stating the aim of this course
Purpose
The purpose of the topic
Topics
Expansion of topic of course
3.2.8 Validation
The training materials have been validated by the IHR's Research Training Group. The outcomes will
include identification of generic preservation components, and components specific to history. This
will help to fine-tune the content and make the training materials directly applicable to a wider range of
students and course subjects.
Deliverable: Online Moodle course: http://training.historyspot.org.uk/course/view.php?id=59&topic=1
3.2.9 Development of leaflet
SHARD joined up to prepare a leaflet and FAQ sheet, in collaboration with PrePARE at Cambridge,
Datasafe at Bristol and DICE at LSE to promote research data and its preservation. The group met at
the LSE in March and shared project experience and collaborated to draft content for a leaflet about
the preservation of research data. Malcolm Raggett of the LSE designed the leaflet and ULCC bought
1,000 copies to be distributed at IHR. A digital version is also available online on the History Spot
website at http://training.historyspot.org.uk/mod/resource/view.php?id=835
Content of leaflet:
The meeting at LSE agreed as a result of our projects’ findings that we should have different sections
under the headings:




26
Start early
Explain it
Store it safely
Share it
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
3.18 Image of leaflet
3.2.10
FAQs page
The FAQs are organised under the same headings as the leaflet but with detailed generic descriptions
for various questions under each heading. They were developed in cooperation with the DICE and
Prepare projects on a shared Wiki hosted at the University of Cambridge. Each project took an equal
share and drafted simple responses to some issues which could act as a first line of support online on
our institutions’ websites. We addressed the main issues which we found were arising most
frequently in the course of our work. See appendix 1 of this report for the FAQs. They are also
available on the History Spot website at http://training.historyspot.org.uk/mod/page/view.php?id=834.
3.2.11
How did you go about achieving your outputs / outcomes?
The aims of the project consisted of:





Delivering the basics, not the complexity, of digital preservation
Delivering practicalities, not theory, of digital preservation
Building courses tailored to the needs of the target audience
Creating course content that has wide applicability to other academic disciplines in Humanities
and social sciences
Creating courses that can be shared and reused by other HEIs
To meet these aims we:




Developed a knowledge base which contains all information gathered , e.g. survey of
researchers, legacy data assessment, and assessment of existing training.
Developed online training and delivery of face-to-face training about preservation of research
data.
Drafted a leaflet and FAQs with other JISC funded projects.
Deposited the completed course modules in JORUM.
The aims and objectives did not alter throughout the project, with the exception being the additional
deliverables of a leaflet and FAQs.
The key aspect of this project was the engagement with the researchers. SHARD aimed to ask good
questions to elicit good answers about current research practice, needs and requirements. The
27
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
assessments were phrased in language which was not technical. We wanted to hear what
researchers needed, and not what we thought they should do. Our survey and face-to-face training
were all opportunities to gather information from them. This information, which was assessed though
the use of AIDA, allowed us to reflect back what we had learnt from them in our online Moodle, the
main deliverable of the project.
3.3
What did you learn?
We supply our findings below. The information compiled in our knowledge base was essential for the
analysis of researchers’ requirements and this analysis resulted in the development of our online
training course in the preservation of research data. As stated and explained previously we organised
our findings under the headings of organisation, technology and resources.
3.3.1 Organisation
Organisation in this context reflects how the material is kept and planned and organised. This is a
responsibility of the researcher and without good organisational planning ,the research data material
will be less usable.
Organisation of data
One of the most outstanding gaps in the legacy data assessed at IHR was the lack of descriptive
information about the data as well as a lack of data documentation.
Lack of sufficient documentation about the data
We anticipated technical problems associated with old versions of software or particular propriety
software. This was the case as seen in the legacy data survey but one of the most surprising
problematic issues was the lack of documentation describing the data from high level to file level. This
lack of information thus affected the intellectual control and organisation and inhibited understanding
of the material.
Lack of descriptive information about the data, i.e. lack of metadata
Certain datasets had an introductory text describing them and covering the basic issues such as
name, rights, technical issues and contacts. There were some very good examples of some projects
providing guides and introductions to the data, about half the data provided some sort of
documentation about the data. However it lacked consistency in approach and the remainder of the
data assessed lacked any documentation about the data. In addition researchers did not record their
research methodology. By way of contrast, figure 3.19 shows an exemplary introduction to a body of
research data.
28
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
3.19 Views of hosts: Reporting the Alien Commodity Trade,
1440-45 database guide, Centre for Metropolitan History (CMH)
At a more detailed level there were many instances of fields which were encoded but with no
explanations of the codes available, thus rendering the data unusable. Half of the datasets assessed
did not include anything explaining where to find these codes or how to use them. In addition, there
was evidence suggesting the use of place name and personal names authority files, but few referred
to the standard used. There was little or no evidence that IPR or any other issues had been
considered. No statements for use or sharing were in evidence with the exception of one dataset. An
obvious exception to this was an oral history project which was very explicit about copyright and
permissions and generally was very well documented. Oral history is traditionally strong on good data
collection and management procedures.
For most of the data which did not have this information the data was very difficult to understand and
as certain information, e.g. rights information, was not explicitly detailed, it would be difficult for a data
user to know what use could be made of the data. A brief document detailing some codes/technical
specifications and whatever else might be useful for the non-expert. Since the primary investigator will
usually be the authority on the data, arguably all other users could be deemed "non-experts" .
Lack of information about methodology
The lesson learned here was "Write down and record your research methodology, don't just keep it in
your head." This line of thought is crucial, as it will encourage validation of results and ensure that the
reliability of the data can be verified.
Lack of appropriate institutional guidance
As stated at the beginning of this report, the lack of institutional guidance has forced many
researchers to develop their own strategies regarding managing and organising their research data.
There are many websites providing guidance on these matters such as DCC, or the Essex data
archive. However most researchers interviewed did not demonstrate an awareness of local guidance
being available at their particular institution. Researchers spoke about the alienating language used in
29
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
certain data management advice provided. Researchers in the humanities very often do not identify
with the term ‘data’ as they associate it with scientific research. So they assume the advice is not for
them. In addition ‘metadata’ is meaningless to many of the researchers, unless explained to them in
straightforward terms (as we did during our training day).
Institutions should ideally have practical data management plans which are easy to implement with
some quick wins. Such systematic and consistent approaches to data management should be
introduced at an early stage in a research project. Researchers are busy doing research and some
time should be explicitly dedicated in projects to write supporting documentation for the data.
Otherwise sharing and reuse is going to become less of an option as time goes by. Research data
supports a narrative of research and is an invaluable resource to the researcher while the research is
active. A worst case scenario is that once completed, the data is "orphaned", dropped, deleted or
abandoned to fight for its survival. Some data survive to enhance the narrative of further research
projects or validate the research published, while other data are not so lucky.
3.3.2 Resources
In this context resources are defined as time, content, money, training opportunities, funds, training
and awareness. Many of those surveyed identified a lack of resources and awareness as an inhibitor
to preservation.
Awareness and lack of training
The most pressing issue which we identified as being crucial to convey to researchers in our training
was awareness. Those who participated in the survey and training were to a certain degree by default
aware of some of the issues which affect their data. This was principally due to a brush with data loss
to a lesser or greater degree. Awareness also has implications regarding what researchers actually
consider to be research data in the first place. Many do not think of structured information held on
their PC as research data.
Targeted training was expressed by many as a real need for researchers to enable preservation.
Reticence gap
The other need expressed was an appreciation of research data as having value. If this is not present
then very often the investment in preservation will not occur.
‘Value your data, don't think "people won't be interested in it". SHARD must overcome the "reticence
gap". – Researcher, SHARD
In addition those who are aware of the issue feel that there is very little time or money allowed to them
to put effort into the tasks needed in order to preserve their data.
Researchers are busy doing research and some time should be explicitly dedicated in projects to write
supporting documentation for the data. Once completed, the data is dropped, deleted or abandoned to
fight for its survival. Some make it to enhance the narrative of further research projects or validate the
Figure
1.11published
Factors inhibiting
research
while otherpreservation.
data is not so What,
lucky. in your opinion, currently inhibits
preservation of research data?
30
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
3.20 What are the greatest risks to the preservation of research data?
Researchers surveyed were clear about the most challenging aspect for them as regards
preservation. They viewed the lack of time and money to dedicate to the establishment of adequate
preservation mechanisms as a major problem. Lack of awareness of the issues came second which
is a recurring theme throughout SHARD. Technology which has traditionally been seen as the major
problem interestingly came third.
3.21 What three things would improve the preservation of research data?
The diagrams indicate a need for awareness-raising coupled with good storage and training.
31
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
3.3.3 Technology
Technology in this context reflects the technical aspects affecting research data and its creation;
including storage, file formats, technological infrastructure, operating systems etc. A high proportion
of researchers in the survey and training day identified technology as a potential inhibitor to the
preservation of their data. These ideas are expanded on below.
Multiplicity of formats
'...whereas scientific data tends to be large scale, homogenous, numeric, and generated (or
collected/sampled) automatically, humanities data has a tendency to be fuzzy, small scale,
heterogeneous, of varying quality, and transcribed by human researchers, making humanities data
difficult (and different) to deal with computationally.'
Melissa Terras, 'Number Crunching Historians' http://melissaterras.blogspot.com/2012/01/numbercrunching-historians.html
The legacy data we assessed consisted of many file types, including text-based files, some structured
numerical data, some audio, some mapping data, and other unidentified files. None of the indications
about applications used for creating or manipulating the data were (with two exceptions who noted
this in their guide to their data) explicitly indicated anywhere.
Advice on best formats
In the survey and the training day researchers expressed concerns about best formats for their data.
Some also raised the issue of the use of proprietary software which locked in the data to the particular
application which was used to create it. Mapping data is an example of proprietary software with
systems such as Arcview7 being used.
Lack of technical information
The legacy data was being assessed not with a view to its content but rather at how well it had been
managed to enable access and sharing it over time. How easy would it be to open up this data and
use it again if one was not the data owner or creator? How much time and money would it cost to
recreate these rich data experiences? Researchers should look at their data through the eyes of
someone who had no involvement in its creation and use. This way they will clearly describe it for
further use. During the legacy data assessment, two examples were found which had documents
describing technical information i.e. technical metadata about their operating systems. This enabled
the project team to understand what technology we needed to open and use the data.
Storage
Researchers store their data in many different ways; most back up their data manually onto external
hard drive, and some use the Cloud. One person spoke of having no backups at all, holding all their
data on their PC. The lack of any institutional support and guidance as regards storage was raised.
Those who were most aware of the problem made lots of backups and occasionally used the Cloud.
None used a repository.
The legacy data we assessed, while safe as they were being backed up regularly by central computer
services at the University of London, were locally stored on a secure network drive and not very
accessible as no one beyond CMH would find them unless they knew about them. They do not
appear on any catalogue or database of holdings. Thus access was immediately extremely limited.
7
Arcview is the entry-level licensing level of ArcGIS Desktop, a geographic information system software product produced by
Esri.
32
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
Projects which were funded by research bodies were deposited in a repository as this was a
requirement of the funding. This applied to 2 of the legacy research data which we assessed.
Do software problems inhibit access to research data?
58% of the CMH legacy data assessed in our legacy data assessment could not be successfully
opened due to software redundancy.
3.22: Software redundancy
Some other lessons
Of course everything we learned did not neatly fit into our three categories of Organisation, Resources
and Technology. The reality is that these three categories can overlap and intertwine. An example of
this is data sharing.
Data sharing lessons
The initial legacy data assessment of the material in the Centre for Metropolitan History based at IHR
raised some interesting learning opportunities. Most data understandably due to its sheer size is kept
locally and is rarely managed in centralised storage or a repository, unlike the findings or conclusions,
which are usually published, well maintained and stored in a suitable repository, ensuring access to
this over time. There were some problems regarding moving the legacy data from IHR to ULCC which
were challenging at times due to lack of written procedures. The size of the data was an issue among
other things. We then decided to use Dropbox, removing the data once it had been assessed. The
experience clarified the need for procedures and the client focussed approach as to state the obvious,
different people and different departments have different needs. It also reinforced that old adage
‘assume nothing’.
The experience reinforces the needs for some basic rules of internal data sharing.
1. Context matters. Different people have differing needs, it is not a one size fits all approach as all
data are not equal, some are more special/unique than others.
2. Procedures matter. There should be clear procedures and these should be agreed on. The oral
tradition of remembering belongs in a folk archive, not a data archive.
33
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
3. Metadata matters. Information about data matters. It matters almost as much as data, as without it
the research data can be very difficult to understand, so there is a need to ensure that this is
maintained and shared as well as the data. Otherwise the data can be meaningless and without
context.
4. Trust. This is difficult to gain and easy to lose and very hard to regain. Personal connections
between different departments such as researchers/academics and IT departments are important.
3.3.4 Other issues
Researchers when asked why they were interested in training about the preservation of their data
gave four responses:
1. They wanted to learn how to get it right from the start of their project/career
They wanted to be certain of what good practice consists of in relation to ensuring the preservation of
their research data. They feared that if they didn’t they would lose organisational control over their
research. Many of these researchers were embarking on their careers and needed awareness raising
and advice on how to preserve their data. It is important to get researchers to practise good habits in
relation to managing their research data early on in their careers.
2. They wanted to understand digital preservation on their terms
They also wanted to understand better digital preservation issues as they often found advice
alienating and off putting. They wanted to be able to integrate the research process e.g. use of
software (such as Mendeley) for managing their research. Many worried about obsolete formats and
wanted advice on best formats for preservation purposes which they could implement themselves.
3. They felt they were suffering from information overload
Many felt they had reached critical mass with the amount of research data they held and without good
and adequate approaches to ensure they kept their data in accessible form; they were very worried
about losing it. They also found advice about digital preservation difficult to digest.
4. The inevitable result of inaction regarding preservation was data loss
It is to be expected that those who anticipated in both the survey and the training day were aware of
the risks to their data.
3.4
Immediate Impact
3.4.1 What kind of difference has your project made in your institution?
The project has brought together institutions within the University of London who have synergy in
terms of skills and aims. ULCC have wanted to work with IHR on digital preservation training for
researchers for some time. SHARD has raised awareness of the need for the preservation and
management of research data through a combination of the various approaches of investigation and
face to face training and dissemination of the outputs. The project engaged the Centre for
Metropolitan History (CMH) as well as IHR, Kit Good (University Records Manager & FOI Officer) and
ULCC. The project has absolutely enhanced and built on existing connections. ULCC has also
established a connection with the University of London library at Senate House.
3.4.2 How has the wider community benefitted from your project?
The investigations in phase 1 of the project brought us in immediate contact with a cohort of
researchers. Use of social media such as twitter has promoted the work we are doing in relation to the
34
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
preservation of research data. The research process for SHARD engaged researchers and
encouraged them to consider the issues we were discussing. The result of the investigations of
phase 1 has resulted in the content for face to face training and the online course. We aimed to
benefit the wider community beyond IHR and UoL through the presence of SHARD on the History
Spot site.
3.4.3 What evidence do you have for this?
The feedback for the face to face training day was good and we had a full attendance for the day.
The publicity of the blog and deliverables such as the leaflet and FAQs have resulted in several
retweets on twitter.
3.4.4 How has your project changed the attitudes of your stakeholders?
Our use of jargon-free language and non-technical terms has served to involve the community, as
opposed to alienating them. This is a good thing as researchers had often expressed frustration and
disinterest in much of the professional literature and tools produced for the preservation and
management of their research material. It is hoped that the language and style of delivery used in
SHARD online has reflected this. Many researchers were very concerned about managing their data
and avoiding data loss but lacked a consistent approach. SHARD provided them with several
opportunities to express their requirements and allowed us to assess what stage the people involved
were at which enabled us to develop appropriate material.
3.5
Future Impact
3.5.1 Who will be impacted?
The online course on the preservation of research data will be freely available at History Spot under
an OER license. IHR’s training has considerable traffic and it has a national reputation of delivering
seminars and training in relation to historical research. The online courses in History Spot will be
strongly linked to the IHR website training section. IHR are also uploading a handbook on online
catalogues, palaeography and digital tools in addition to SHARD and the databases course.
History Spot currently has 706 registered users and it is anticipated that the outreach potential for the
SHARD online course will be the same if not more as additional users register with and avail
themselves of the service. In addition depositing the course content into JORUM will make sure the
courses are accessible to wide community of researchers and are kept safe over time.
3.5.2 Are you planning to track this impact? If so, how?
We will be tracking the use of the Moodle course online through History Spot.
4 Conclusions
4.1
General conclusions
The greatest risk to your data is you.
Approaches by default
It is admirable how many researchers in the absence of guidance and direction from their institutions
have established a method of preserving their research data. Most are aware to some degree of the
fragility of research data over time and have made degrees of effort to keep it. This has resulted in a
hybrid approach to the safety of their data. A variety of approaches can be seen with use of external
hard drives and the cloud being the most common. Many use applications such as Mendeley or
Papers for the management of their material and perhaps somehow extracting information from these
applications could help in providing good metadata for the long term preservation of this material.
35
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
What has become clear throughout this project is how little institutional support or guidance is
available.
Lack of guidance
Researchers are as a result preserving their data in increasingly disparate ways which involves
minimal or no use of standards for description or attention to durable formats over time. This spells
danger for long term access to the body of research being created at the present time. Standards of
description can make sharing easier and access easier over time. Adequate descriptions, at both high
and low levels, descriptive and technical and administrative, will help keep this data for longer periods
of time than if these are absent.
Need for applicable jargon free institutional guidance
Institutional guidance must be available to further this endeavour of preserving research data. This
advice should be user-friendly with clear steps to guide the researcher. It is important to always
remember that researchers on the whole are not interested in the finer points of digital preservation,
rather they are interested in how to keep their material safe and sound and avoid the panic of data
loss. As such we should cater to these real and present needs of a community which as demonstrated
by this project has been doing its best to manage its material by itself.
Admonition of their current approaches serves little purpose as the guidance on how to preserve
research data is not readily accessible or easy to absorb for a non digital archivist/asset manager. In
addition researchers will simply do what is practical and easy to maintain their research over time.
4.2
Conclusions relevant to the wider community
The biggest threat to the preservation of research data is not technology but researchers’ lack of
awareness and the realisation that preservation is an issue which needs addressing. Researchers
also find some advice on the preservation of research data alienating and seemingly irrelevant to their
needs.
The SHARD course aims to bridge a gap by providing free online training about preservation in jargon
free language and aims to raise awareness of the issues and practical solutions.
As SHARD discovered the paucity of descriptive information and poor organisation of research data
results in inaccessible data and the loss of the data from an intellectual perspective. There is also a
big gap in institutional guidance about preservation of research data, although the research
community does have a responsibility to look after its data. It has a responsibility in the course of its
work within an institution to take care of the material it is producing. This is an obligation for future
researchers as well as legally. However as has been discovered often researchers are unaware of
what they can legally do with their research material once the project has completed. This is due to
poor record keeping of rights information regarding sources gathered and used.
4.3
Conclusions relevant to JISC
It is the case that many professions are only now waking up to the fact that digital assets have an
inherent risk of loss. The research community despite a lot of effort seems to be amongst these
professions. It is important as professionals that we do not lose sight of this and apply our knowledge
as information professionals carefully and appropriately in order not to alienate communities as well
as lose data. People are slow to share failure and stories of data loss unless there is an element of
trust. Working with researchers and their requirements can enable suitable and appropriate guidance
to be designed. We hope this is the case with SHARD course on the preservation of research data
and that it serves to bridge the gap.
36
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
5 Recommendations
5.1
General recommendations
Researchers need to be given more institutional support and advice about preserving their data. They
are operating in a vacuum and as a result approaches to preservation are ad-hoc and inconsistent,
meaning the risk of data loss is high. SHARD encourages researchers to start early, describe their
research data and store it safely. IPR issues are also crucially important to enable sharing and use
over time.
5.2
Recommendations for the wider community
Institutions need to provide clear and relevant support for research data management and
preservation. It is good to understand what researchers are currently doing and what they suggest
their requirements are; after all it is their data. There is a lack of clarity about whether repositories are
capable of holding certain types of research data, and this needs to be considered. There is a need
for further face-to-face training and awareness which is inclusive and not exclusive looking at general
issues such as dealt with by SHARD but also looking at specific issues in more detail.
5.3
Recommendations for JISC
Clear inclusive guidance on the preservation of research data is needed. It would be good to build on
the results and output of SHARD to further research and investigation into other areas which affect
preservation of research data such as citations.
6 Implications for the future

What are the implications of your work for other professionals in the field, for users, or
for the community?
The provision of an open and free course providing practical guidance on the preservation of research
data will benefit not just IHR but the wider HE community involved in the creation of research.

What new development work could be undertaken to build on your work or carry it
further?
It would be interesting to continue development of more in depth training at UoL focussing on aspects
of the SHARD curriculum, such as IPR for researchers. The development of a series of small
publications about these aspects would also be of value, especially if made available as mobile
applications.

Provide information on the sustainability of your project outputs. How are things going
to work now the funding is over?
The e-learning output will be maintained over time as part of IHR History Spot under an OER license.
IHR have committed to maintain this. Materials are being deposited with JORUM. The JISC will
maintain copies of the knowledge base, as they are project outputs.

Provide information (where applicable) on the long term project contact, how your
outputs (e.g. software, Open Source code, toolkits etc.) will be managed, and whether
there is a user community that interested individuals could get involved with.>
The knowledge base (consisting of the 3 surveys will be held by JISC) and the Moodle materials will
be preserved in JORUM under the University of London account with an OER CC license. The
Leaflet and FAQ will be held by IHR on their website. The leaflet has been deposited in the ULCC
publications repository.
37
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
7 References
Benefits from the Infrastructure Projects in the JISC Managing Research Data Programme – Neil
Beagrie, 2011
http://www.jisc.ac.uk/media/documents/programmes/mrd/RDM_Benefits_FinalReport-Sept.pdf
Creative Commons website: www.creativecommons.org
Data Curation Centre’s online data management plan toolkit: http://www.dcc.ac.uk/dmponline
Data Curation Centre has useful guides on all aspects of data management here:
http://www.dcc.ac.uk/resources/how-guides. Here is a good video from them about managing
research data: http://youtu.be/2JBQS0qKOBU
Data documentation and metadata (University of Edinburgh Information Services):
http://www.ed.ac.uk/schools-departments/information-services/services/research-support/datalibrary/research-data-mgmt/data-mgmt/data-documentation
Digital Preservation Europe: http://www.digitalpreservationeurope.eu/
Documenting your data: create and manage your data: The UK data Archive: http://www.dataarchive.ac.uk/create-manage/document
Documentation and Metadata, Cambridge University Library:
http://www.lib.cam.ac.uk/dataman/pages/metadata.html
Documentation and Metadata MIT: http://libraries.mit.edu/guides/subjects/datamanagement/metadata.html
First insights into digital preservation of research output: PARSE project: http://www.parseinsight.eu/downloads/PARSE-Insight_D3-5_InterimInsightReport_final.pdf
The Five Organizational Stages of Digital Preservation Anne R. Kenney & Nancy Y. McGovern Digital
Libraries: A Vision for the 21st Century: A Festschrift in Honor of Wendy Lougee on the Occasion of
her Departure from the University of Michigan
Guide to copyright related matters. The National Archives:
http://www.nationalarchives.gov.uk/documents/information-management/copyright-related-rights.pdf
Information Commissioner’s Office: http://www.ico.gov.uk/for_organisations/data_protection.aspx
Keeping research data safe (Phase 1)
http://www.jisc.ac.uk/publications/reports/2008/keepingresearchdatasafe.aspx
Long-term preservation (University of Edinburgh Information Services)
http://www.ed.ac.uk/schools-departments/information-services/services/research-support/datalibrary/research-data-mgmt/data-sharing/preservation
Metadata & Data Documentation, University of Oregon Libraries:
http://libweb.uoregon.edu/datamanagement/metadata.html
Open Source Initiative: www.opensource.org/licenses/index.html
SHARD: http://shard-jisc.blogspot.co.uk/2012/03/research-data-preservation-projects.html
38
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:
8 Appendices
8.1
Appendix 1: FAQ
What material and data should I preserve?
To enable the use and reuse of research data over time by others it is important to ensure that you
provide documentation which describes the research data as well as the context of its creation as part
of the research project. Technical information about the research data should also be kept to enable
its reuse. If the data is encoded then code details must be kept. So in addition to the core research
material you should provide a clear introduction to the entirety of the research data to enable future
understanding and use.
Documentation such as emails and other material accompanying the core research data may seem
irrelevant but they will all provide important contextualisation of the research project and can be
appraised for relevance. Cambridge University uses terms such as embedded, supported and
catalogue data to describe data which should accompany the search data itself.
Will I lose control over the material if I preserve it?
A significant number of research funders require that data produced in the course of the research they
fund should be made available for other researchers to discover, examine and build upon to allow for
new knowledge to be discovered through use, reuse, comparing data and so on. However you are
responsible for deciding what data is legally obliged to be open or closed according to various pieces
of legislation such as FOI and data protection. This should be stated at time of deposit.
Why shouldn't I just keep my data/material on my hard drive?
Keeping all your research data in one place is not a good idea in general. It is essential not to keep
your research data on your hard drive as inevitably hard drives fail and you will lose your data. You
should always back up your data at least two more devices or systems (ideally a repository) external
to your hard drive.
I have all my data on an external hard drive - do I need to do anything else?
Ensure that your data is well documented and be held on at least two external devices/systems,
ideally including an institutional digital repository.
Why should I preserve research material?
Researchers from all disciplines accumulate material in the course of their research. Considerable
time, effort and money is spent in this endeavour. The preservation of research data is essential in
order to further research through sharing of the data; to enable validation of results and demonstrate
the process behind the conclusions and results of research.
What is a digital repository?
A digital repository is a system which provides a convenient infrastructure through which to store,
manage, re-use and preserve digital materials. They are used by a variety of communities, may carry
out many different functions, and can take many forms but essentially they are a secure way to keep
data safe and accessible.
What archives/repositories are there for preserving my data?
There is no single UK repository for research data. Instead many are being developed within
universities. The OpenDoar initiative provides a comprehensive list of open repositories worldwide
and in the UK.Here are some UK wide repositories for specific types of data:
39
Project Identifier: SHARD
Version: 1
Contact: Patricia Sleeman
Date:



The Archaeology Data Service supports research, learning and teaching with freely available,
high quality and dependable digital resources. It does this by preserving digital data in the
long term, and by promoting and disseminating a broad range of data in archaeology. The
ADS promotes good practice in the use of digital data in archaeology, it provides technical
advice to the research community, and supports the deployment of digital technologies.
The University of Oxford Text Archive develops, collects, catalogues and preserves
electronic literary and linguistic resources for use in Higher Education, in research, teaching
and learning. We also give advice on the creation and use of these resources, and are
involved in the development of standards and infrastructure for electronic language
resources.
The History Data Service (HDS) collects, preserves, and promotes the use of digital
resources, which result from or support historical research, learning and teaching. The History
Data Service is a successor service to AHDS History which from 1996 to March 2008 was
one of the five centres of the Arts and Humanities Data Service.
Can I use my institutional repository for data preservation?
Yes, you should be able to do this, if your institution has an institutional repository which collects
research material. You should enquire of your institution if this is the case.
Can/should I deposit in more than one repository/archive?
No, it should be more than adequate to deposit in one repository but it depends on the service offered
by the specific repository, e.g. does it guarantee that it will maintain access to the data over time?
40
Download