Project Identifier: SHARD Version: Contact: Patricia Sleeman Date: 21 July 2012 JISC Final Report Project Information Project Identifier To be completed by JISC Project Title The Preservation of Historical Research: SHARD Project Hashtag Start Date 1st November 2011 Lead Institution University of London Computer Centre (ULCC) Project Director Richard Davis Project Manager Patricia Sleeman Contact email Partner Institutions p.sleeman@ulcc.ac.uk Institute of Historical Research, University of London Project Web URL http://shard-jisc.blogspot.com/ Programme Name Digital Preservation Programme Programme Manager Neil Grindley End Date 31st July 2012 Document Information Author(s) Patricia Sleeman, Ed Pinsent Project Role(s) Project manager Date Filename URL http://shard-jisc.blogspot.co.uk/ Access This report is for general dissemination Document History Version Date Version 1 18/7 2012 Version 2 25/7/2012 Version 3 26/7/2012 Comments Page 1 of 40 Document title: JISC Final Report Template Last updated : Feb 2011 - v11.0 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: Table of Contents 1 ACKNOWLEDGEMENTS ............................................................................................................................ 3 2 PROJECT SUMMARY ................................................................................................................................. 3 3 MAIN BODY OF REPORT ........................................................................................................................... 4 3.1 PROJECT OUTPUTS AND OUTCOMES ................................................................................................................ 4 3.1.1 Outcomes of SHARD ........................................................................................................................ 4 3.2 HOW WE WENT ABOUT ACHIEVING THE OUTPUTS ............................................................................................... 4 3.2.1 Phases of the project....................................................................................................................... 6 3.2.2 Phase 1: Legacy data assessment ................................................................................................... 6 3.2.3 Stakeholder survey ........................................................................................................................ 10 3.2.4 Analysis of stakeholder survey ...................................................................................................... 10 3.2.5 Survey of existing training at IHR .................................................................................................. 17 3.2.6 Phase 2: Training day and knowledge gathering exercise ............................................................ 19 3.2.7 Building SHARD training modules ................................................................................................. 24 3.2.8 Validation ...................................................................................................................................... 26 3.2.9 Development of leaflet.................................................................................................................. 26 3.2.10 FAQs page ..................................................................................................................................... 27 3.2.11 How did you go about achieving your outputs / outcomes? ........................................................ 27 3.3 WHAT DID YOU LEARN? .............................................................................................................................. 28 3.3.1 Organisation ................................................................................................................................. 28 3.3.2 Resources ...................................................................................................................................... 30 3.3.3 Technology .................................................................................................................................... 32 3.3.4 Other issues................................................................................................................................... 34 3.4 IMMEDIATE IMPACT ................................................................................................................................... 34 3.4.1 What kind of difference has your project made in your institution? ............................................ 34 3.4.2 How has the wider community benefitted from your project? ..................................................... 34 3.4.3 What evidence do you have for this? ............................................................................................ 35 3.4.4 How has your project changed the attitudes of your stakeholders? ............................................ 35 3.5 FUTURE IMPACT ........................................................................................................................................ 35 3.5.1 Who will be impacted? ................................................................................................................. 35 3.5.2 Are you planning to track this impact? If so, how? ....................................................................... 35 4 CONCLUSIONS ........................................................................................................................................ 35 4.1 4.2 4.3 5 GENERAL CONCLUSIONS .............................................................................................................................. 35 CONCLUSIONS RELEVANT TO THE WIDER COMMUNITY ....................................................................................... 36 CONCLUSIONS RELEVANT TO JISC ................................................................................................................. 36 RECOMMENDATIONS ............................................................................................................................. 37 5.1 5.2 5.3 GENERAL RECOMMENDATIONS ..................................................................................................................... 37 RECOMMENDATIONS FOR THE WIDER COMMUNITY........................................................................................... 37 RECOMMENDATIONS FOR JISC ..................................................................................................................... 37 6 IMPLICATIONS FOR THE FUTURE ............................................................................................................ 37 7 REFERENCES ........................................................................................................................................... 38 8 APPENDICES ........................................................................................................................................... 39 8.1 2 APPENDIX 1: FAQ ..................................................................................................................................... 39 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 1 Acknowledgements This project belongs to the Digital Preservation Programme and was funded by JISC 12/11: Strand B3: Digital Preservation: Enhancing Capability within HEIs. Partners on the project included the Institute of Historical Research (IHR). We received a lot of help with the legacy data aspect of the project from the Centre for Metropolitan History (CMH) based within the Institute of Historical Research. In addition The University of London’s University Records Manager & Freedom of Information (FOI) Officer, Kit Good gave us invaluable advice regarding legislation which can affect research data. We must also thank our project partner Jane Winters at IHR, Matt Philipott at IHR for assistance with Moodle, attendees at the Training Day for their input, Malcolm Raggett (LSE) for the leaflet design, Dice Prepare projects for their work on the FAQs, and to Ed Pinsent (ULCC) as project supplier. 2 Project Summary SHARD (preServAtion of Research Data) has built a set of digital preservation training modules pitched at researchers and those with a non-specialist knowledge for the area of digital archives. We have integrated these with existing postgraduate research training courses at the Institute of Historical Research, which are offered on a national basis. This course is embedded into already existing online training opportunities at IHR’s History Spot site. It is believed that embedding digital preservation issues into existing methods of data creation will help avoid data loss. SHARD was seen as necessary due to the amount of digital material being produced by researchers. They also produce research data in a variety of digital formats and size and maintaining access to this valuable resource in the short and long term is essential in order to enable reuse and sharing over time. In addition this material is vulnerable both in terms of format and media and due to the lack of intervention. Investment of time is needed to look after their research material or data to ensure long term access and sustainability over time. It would seem that there is already a lot of information about digital preservation available to read and guide people. In fact a lot of this advice is primarily aimed at practitioners in the area of digital preservation and not at researchers. We thought it very important to develop training materials which were stripped of technical language and applied directly to researchers using non-specialist language. These training materials were developed after an assessment of legacy data as well as consultation and surveying of stakeholders. The materials have been specifically designed for ease of accessibility and pitched at the appropriate level using direct feedback from stakeholders targeted during the project. The project has involved several stakeholders from within the University of London which is a goal of the project, i.e. to enhance awareness of the issues concerning the preservation of research data both within the University as well as the national HE sector. We created a knowledge base consisting of assessment and surveys consisting of legacy data and researchers as well as existing training at IHR. We also delivered face to face training which we used as an opportunity to discover researchers’ requirements and needs in relation to the preservation of their research data. In addition these on-line resources materials are available under an Open Educational Resources license (OER) which means they can be shared, used and adapted by other educational Institutes and made applicable to a range of academic disciplines. The materials have been validated by IHR's Research Training Group and deposited in JORUM, a repository of learning and teaching materials. 3 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 3 Main Body of Report 3.1 Project Outputs and Outcomes 3.1.1 Outcomes of SHARD 1 Outcome Legacy data assessment Deliverable type or URL Knowledge base 2 Stakeholder survey Knowledge base 3 Survey of extant training at IHR Knowledge base 4 Training day and knowledge gathering exercise Online training materials Publicity leaflet FAQs SCORM files for JORUM Knowledge base 5. 6. 7. 8 http://training.historyspot.org.uk/course/view.php?id=59&topic=1 http://training.historyspot.org.uk/mod/resource/view.php?id=835 http://training.historyspot.org.uk/mod/page/view.php?id=834 Outcomes 1-4 of SHARD collectively amount to the project's knowledge base. Activities here were focussed on assessment and survey work. These consisted of: Legacy data assessment Stakeholder survey and analysis Survey of extant training at IHR Knowledge gathering from the training day There were two phases to the project. Phase1 involved outcomes 1-3, while phase 2 incorporated the training day and the development of the online materials as well as the leaflet and FAQs. The information gathered is held in the Knowledge base. This information, once analysed formed the basis of outcome 6 - The SHARD online training modules. Outcomes 7 and 8 were additional to what was originally scoped in the project plan. They are the summation of what SHARD, DICE and PREPARE thought appropriate to introduce digital preservation to researchers. 3.2 How we went about achieving the outputs Collaboration between IHR and ULCC has been considered for a few years as we saw a real need and opportunity to adapt the Digital Preservation Training Programme for research data. The initial stage of assessment and research was extremely important and was enabled very much by a survey done previously such as the survey of legacy data held at IHR completed by Kevin Ashley. Building on the existing relationship with IHR we knew that we had a good sample of legacy data to assess, a good cohort of researchers to survey, and a well established training scheme in place via History Spot. The first phase of the project saw information gathering exercises which populate the knowledge base. The initial survey work approach taken to building the digital preservation training materials has been evidence-based so building this knowledge base has been vital to the success of the project. 4 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: We aimed to develop a knowledge base which contains all information gathered in the course of An assessment of legacy research data both primary and other Interviews with stakeholders involved in data creation An assessment of existing IHR training frameworks Information acquired from face to face training Legacy data assessment Assessment of the legacy data demonstrated some gaps in the compilation of documentation and descriptive information for the data as well as gaps regarding procedures for data sharing within our own institution. Survey The interviews with stakeholders who were recruited by the IHR through their own contacts as well as by social networking by twitter demonstrated the variety of research material being produced in the humanities but also the commonality of their problems in most cases. The fact that the researchers took part in the survey implies certain sympathy towards data sharing and preservation but notwithstanding most researchers wanted to share their data as well as use other peoples’ data. What was striking was how some people seemed to have a lack of appreciation for the value of their own research and even considered deleting it after their PhD was completed. We suspect that this is due to a lack of institutional support and guidance about research data and its management and value beyond the lifespan of the project. As has been stated the lack of awareness of the value of research data as well as a lack of awareness of the risks associated with digital data are key points raised by researchers in both the surveys and the training days. Training The face to face training day provided SHARD with an opportunity to develop and adapt our Digital Preservation Training Programme to a different audience with different expectations to those who traditionally attend our entry level courses on digital preservation. . Throughout the project we have focussed on the needs of the researchers by working with them, listening to them and adapting our own knowledge to the research environment. The knowledge base informed us how best to design such a training event and also gave us an idea of how best to find out from those attending what their needs were. These needs were assessed through providing opportunities and allowed us to further refine the ideas we had about designing our online course. This included avoiding the use of alienating terminology at the start of the project and explaining what these topics were. These included words such as ‘data’ and ‘metadata’ which people said were terms which they felt were not relevant to them as they assumed it referred to tabular data and complicated standards. Online course Developing the Moodle version of the course allowed us to hone our e-learning skills. A draft of the syllabus was sent to IHR for review and validation and once accepted we set to work on storyboarding each course (four in total). These were then run by the IHR again for validation and were set into the Moodle environment within History Spot in IHR. The Moodle is exportable as a SCORM package and deposited in JORUM. Leaflet and FAQ The idea to develop a leaflet arose with IHR and again with the meet up with the Prepare and DICE projects in March 2012. At this meeting it was realised that a lot of our research was reaching similar conclusions about research data and the gaps regarding preservation. We all agreed on the use of simple language and how such a leaflet should be organised as we had all pretty much found out the same things within our own projects! We also thought it would be a good idea to provide an expanded 5 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: version of the leaflet in FAQ form and so any questions researchers would have which arose from the leaflet could be provided in the FAQ. The three projects worked on a Wiki hosted at the University of Cambridge to develop this with each project taking responsibility for certain questions. Communication The SHARD blog was a good place to write up our thinking about the project as we progressed stage by stage. Twitter was also used to disseminate information and updates to the project. Our aim was to develop training materials that: Deliver the basics, not the complexity, of digital preservation Deliver practicalities, not theory, of digital preservation Are tailored to the needs of the target audience Will have wide applicability to other academic disciplines in Humanities and social sciences Can be reused by other HEIs 3.2.1 Phases of the project The project was divided into 2 phases for our purposes, phase 1 was an investigation period and involved the building of the knowledge base and phase 2 was our implementation period with the development of our training materials. 3.2.2 Phase 1: Legacy data assessment Our assessment started with historical research datasets created by our SHARD partner IHR and in particular the Centre for Metropolitan History (CMH) which is based within IHR at UoL. CMH was established by the IHR in 1988, and is one of the world’s leading centres for the study of the history of London and other metropolises. It specialises in innovative research projects, covering a wide range of periods, themes and problems in metropolitan history, publishing the results and data online and in print. A survey had been done by ULCC of the legacy data held at CMH a few years ago so we had the perception that we had some good material available for our SHARD study. The data covers metropolitan history looking at health, social issues and of London and comparative studies with other cities. Using research data referred to in the 2009 preliminary survey of IHR’s digital research data, ULCC conducted an assessment using an adapted version of AIDA, a tool for assessing an Institution's capability to support digital preservation. The tool was adapted to apply to the research data test corpus. The AIDA toolkit is structured as a set of simple elements, each one describing an aspect of digital asset management. The process is spread over three discrete areas. There are three components which tell us what we need to know about the assets being assessed from an Organisational, Technology and Resources point of view. As McGovern and Kenny have suggested effective digital preservation needs to consider the organizational infrastructure, technological infrastructure, and resources to succeed 1. Considering these three aspects helps decision-making as technology evolves considering the resources available. For the legacy data assessment we did not use the 5 stages assessment model which is part of the AIDA toolkit as we did not think it suitable for our purposes. Assessment criteria We used a reduced adapted version of the AIDA data assessment tool for this purpose and it soon became clear which aspects were useful. The following questions were used to assess the data: Reference code? Description of data available? ‘The Five Organizational Stages of Digital Preservation’ Anne R. Kenney & Nancy Y. McGovern Digital Libraries: A Vision for the 21st Century: A Festschrift in Honor of Wendy Lougee on the Occasion of her Departure from the University of Michigan 1 6 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: Is it kept in a repository? Description of data available Type of data produced? Any technical specifications Instructions/Methodologies used Descriptive metadata? Technical metadata? Administrative metadata? Information about the structure of data? Data capture information available? Variable descriptions Additional information which is vital to access and enable comprehension of the data Could all files be opened? If not, what were the problems? Legal issues Are rights clearly indicated? Is retention period of data clearly indicated? Outcomes: a. Organisation of data Why is organisation important? Well organised data of any kind will enhance and enable access and usability. This is as important for the creator of the research data as it is for future scholars who will want to use the data at a future point in time. The data owner (they who originally gathered the data) will need to be clear about how they organised it as over time, information about the data will fade from memory and what was once clear becomes hazy. Organisation implies a description of how the data was organised, where it was kept as well as descriptions from the high level general to the low level file level. So it is very important to document how data is organised or arranged, from a general description to the low level details. As the research data is digital it is also necessary to record the technical aspects of the data both from a high level general software/hardware requirements to the detailed technical metadata for different formats used. What we found What was apparent after this assessment was the paucity of information about the organisation of the legacy data. There simply was no guide to how the data was organised or structured. In addition contextual information about some of this legacy data was absent. The data was clearly rich in content and variety in terms of scale and format but the lack of descriptive information rendered the data inaccessible. This was deeply frustrating. There was a lack of adequate descriptive, structural and technical metadata about the data. There were also very often no clear indications about legal issues. There was very little declaration of ownership or rights regarding the data, copyright and right to use the data was not clearly explained or referred to at all. 43% of the data assessed had no description of it at all, while the remaining 57% had varying degrees of this information. Figure 1.3 indicates the percentage of the legacy data which was undocumented to the degree that it rendered the data and its context difficult to understand. Description of research data available? 57% of the data assessed had some form of general introduction to the data as well as descriptions to help understand and use the research data while the remaining 43% had none. b. Technology Why is technology important? Many people may think that technology is the primary inhibitor to accessing legacy data. This can often be the case and it is important to be alert to the issues such as fit for purpose storage media, 7 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: reliable formats, and obsolete media but as stated previously the biggest risk to the data is any lack of contextual information. As within this contextual information i.e. documentation about your research data, the technological requirements for using the data must be stated clearly. So it is important to write down useful information about the technology used for your research data. What we found It was challenging to extract information about technology used to create data due to the lack of description of the technology used to create the data or what was required to open the data. Very often there were no data dictionaries or code books for encoded data. There was little evidence of the use of authority files for naming conventions. Some data is held on software which is at risk of being unreadable soon - old versions of MS Word files. 42% of the data could be opened easily, 29% could partially be opened with a further 29% unable to be opened at all. Less than half the data could be opened successfully. Within a set of data, some files could be opened but not all which obviously undermines the usability of the data. Could all the data be successfully opened using current software? 42% of the CMH legacy data could not be opened at all due to software issues, some 29% of data could partially be opened but not all due to the use of unidentifiable software. The remaining 29% was unable to be opened at all. c. Resources Why are they important? Resources include time, money, skills, responsibilities and awareness. This plays a significant part in the preservation of research data. If there is no time dedicated to informing oneself of the issues, organising, planning and managing research data then the data will be at risk of being lost and not preserved. What we found There was little consideration about preservation at any level and little about what and how the data should be kept. No responsibility for the data post-project was expressed with the exception of two projects which were funded by a research agency. All data management plans should indicate this and should be available at an institutional level. No digital asset register existed previously and this lessened the intellectual control over the data as well as the capacity to share the data easily. The data was kept on a network but closed to the rest of the University but this made the data inaccessible to most people and researchers. d. Type of data The type of data consisted of structured data, text; audio and some map data. A variety of software was used including Dbase, MS Access, MS Word, MS Excel and Notepad. Out of the 7 datasets assessed only 1 described the technical aspects of the data and what the software/hardware requirements were to open and use the data. In relation to methodology and explanation of such 2 out of 7 described how they did what they did. Figure 3.1 indicates the formats in which the legacy data being assessed was held. 8 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 3.1: Formats of research data found in CMH legacy assessment Clearly at 41% the majority of research data was held in text form, followed by structured data. The ‘other’ category indicates formats which we could not identify, even after using many external sources of file format information, including PRONOM. Figure 3.2 shows which software was used to create the legacy data being assessed. 3.2: Software used to create legacy data The majority of the material was created in Microsoft applications in database, spreadsheet and text form but certain data was held in an Oracle database and Dbase. Deliverable: Knowledge base: legacy_data_assessment.xls contains the results for each dataset being assessed and legacy_data_post_assessment.xls contains a summary of the assessment 9 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 3.2.3 Stakeholder survey ULCC conducted face to face interviews with a cohort of 10 researchers. These were selected from IHR staff and teachers, as well as researchers from other institutions within the University of London and beyond. They reflected a variety of research environments: from the individual researcher/academic at the early stage of their career as academics, to academics involved in established research programmes. This was to help us to understand the actual preservation needs for the sorts of research data that are being created, and build an evidence base. It also enabled us to discover the training framework we can integrate with. Using AIDA to help us structure the questions we divided the interviews into three areas namely: resources, technology and organisation. A summary of the questions asked include: Do you manage research data in the course of your work? Ownership of research data Definition of rights. How are the rights of data owners, including IPR, defined? Who has responsibility for managing the data and associated documentation once the project has terminated? Policies for formal documentation of ownership. Are formal documented statements of responsibility written? How do you document your data? (Some responses have interpreted this as meaning "Descriptive Metadata") Have you ever needed to use other people’s data for your research? Conventions or expectations for citing secondary data (i.e. data produced by others) that the Institution's researchers utilize in their own research. Requirement for Metadata creation. To what degree of detail do you describe the structure and organization of your data? Use of metadata standards Method of storage / backup. How is digital data currently stored? Location of storage. Where is digital data currently stored? Future storage strategy. How do you plan to store digital data in the future? Responsibility to share Responsibility to limit access. Do you need to regulate access to data (by staff or students)? Processes in place to review data for legal issues, for example confidentiality, consent, commercial use, or patents. Processes in place to review data for other compliance issues, including data retention, preservation, destruction and anonymisation of datasets. What are your preservation needs? Understanding and knowledge of stakeholders' expectations for the preservation of research data. How long is it needed? Methods to assess whether research data needs to be preserved. Considerations for providing access to data over time Written policies for preservation Responsibility to preserve Right to preserve. Do you have the right to preserve your data? 3.2.4 Analysis of stakeholder survey Using the data obtained from each researcher for each question, we assessed the results in relation to the 5 Stages, indicating a stage of maturity or development. Stage 1 is the least developed, Stage 5 is the most developed. These 5 Stages (Acknowledge, Act, Consolidate, Institutionalise and Externalise) are based on the Cornell University maturity model, originally designed to assess an Institution’s readiness for digital preservation. Here is an explanation of each stage. 5 Stages 10 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: Stage 1: Inaction. Research data is not preserved at all. Stage 2: Action is determined locally or unilaterally. Stage 3: Action is determined informally. Stage 4: Actions are backed up with documentary evidence of their action. Stage 5: Actions are carried out on the basis of an agreement with a funder or by a parent policy Stage 5 Stage 4 Stage 3 Stage 2 Stage 1 0 0.5 1 1.5 2 2.5 3 3.5 3.3 Do you manage research data in the course of your work? Most researchers were at Stage 2 and 3 with regards their management of their research data. They each had their own approach which was local to their needs. Stage 2 responses did not indicate a plan as such with one stage 2 researcher indicating that ‘the information about the research data was mostly in his mind and unsystematic. Not good practice in his mind.’ An example of a stage 3 saw backups once a week to external hard drive and use of an application called ‘Papers’, which acts like a library for research data. They also used Dropbox 2. Solutions such as Dropbox are not suitable for preservation as their primary purpose is to enable sharing of data and not long term access. The single researcher at Stage 5 was involved in a well established programme and the data was regularly sent to the publishers for storage on a dedicated server as well as regular daily backups on the network drive. 2 Dropbox is a free service which enables sharing of data. Dropbox was founded in 2007 by Drew Houston and Arash Ferdowsi, two MIT students tired of emailing files to themselves to work from more than one computer. 11 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 3.4 Why, in your opinion, should we keep research data? Figure 3.4 shows where those surveyed thought the importance of preservation lay in relation to availability and access over time. Use and reuse and reinterpretation were all seen by researchers as good reasons to preserve research data. ‘Frequently the public outputs of a research project are only a limited snapshot of a much wider body of work. The latter needs to remain available.’ SHARD researcher survey Documenting research data When researchers were asked how and if they documented their data the responses were varied as were their approaches. One researcher was at stage 1 but most (6/10) were at stage 2 with very local unstandardised approaches to documenting their material. 12 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: Stage 5 Stage 4 Stage 3 Stage 2 Stage 1 0 1 2 3 4 5 6 7 3.5 How do you document your data?3 As the graph 3.5 shows, most researchers were at stage 2 in our 5 stages indicating that standards were not being used to describe/document the research data. Those who were at Stage 4 were at a programme stage and needed to transfer data to another organisation and thus the need for standardised metadata was essential. The researcher at stage 1 kept all information in their head. One researcher at stage 2 said ‘…Messily, a couple of Excel spreadsheets with all file names of data. Has excel file of all files existing in his data, haven’t used it consistently as all stuff I get doesn’t fit in equal measure into spreadsheet. Not a great programme, but one that I know well.’ Most were at stage 2 with very inconsistent approaches to documenting the data. It is interesting to note that noone was at stage 5. Use of other research data Most researchers had used other research data and 50% expressed frustration with the process due to idiosyncratic abbreviations/descriptions/cataloguing. One researcher noted that she would like to use a specific set of data but the researcher had closed access to the material which was very frustrating. 3 The graph shows only 9 people as the question was not put to one of our researchers during our survey. 13 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 3.6 Have you ever needed to use other people’s data for your research? Interestingly most researchers who participated in the survey have used research data created by other people. 50% of them found it problematic due to a lack of documentation, inconsistencies in management and descriptions. ‘I was gifted research data in relation to my field by an esteemed scholar. This person used abbreviations which were unknown to anyone else but scholar and many acroymns used which had lost their context. No contextual explanation or glossary provided. This made using the data very difficult and time consuming.’ Researcher, SHARD To what degree of detail do you describe the structure and organization of your data? 50% of the researchers were at stage 1-2 regarding how they organised their research data. They expressed the importance of good file structure and well named files or relied on existing file structure. Few described their data at file level, with 1 being at a stage 5. 6/10 researchers used no standard at all for describing their data. The remainder used either an in-house thesaurus or Library of Congress thesaurus terms. To what degree of detail do you describe the structure and organization of your data? Those at Stage 1 (2 researchers) were either relying on their memory or were waiting until the end of the project to document their data. The 3 researchers at stage 2 had written some things about their research data, they thought it was self explanatory and not too urgent. 2 researchers were at stage 3 and 2 at stage 4 describe their data at a more detailed level with stage 4 documenting their data very well to field level ‘to enable good data transfer. Again none were at stage 5. ‘Will document it (research data) once handover takes place. Would like to draft a users’ manual.’ Researcher, SHARD 14 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: How is your research data currently stored? Stage 5 Stage 4 Stage 3 Stage 2 Stage 1 0 1 2 3 4 5 6 3.7 Method of storage / backup. How is your research data currently stored? 7 4 Most researchers were at a stage 2 in terms of storage. This was reflected by copies being made locally, on an ad-hoc basis. There were a variety of methods such as external hard drives, backups saved on CDs and DVDS. Some used Dropbox and Crashplan 5, a US based company. Others used USB keys for backing up. Those at stage 3 were embarked on systematic storage, some using network storage provided by their institution. The researchers who were at Stage 5 were advanced programmes where continued support of material is well established. None were using a repository. How long do you need your research data to be kept for? Forever/idefinately For as long as needed Lifespan of Phd 3.8 How long do you need your research data kept (forever?) 4 The graph shows 13 responses as a single researcher used a combination of storage approaches and this is expressed in our graph 5 CrashPlan provides several automatic backup solutions enabling back ups to other computers and attached external hard drives. 15 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: Most researchers wanted to keep their research data for as long as possible. 3 thought they would keep it as long as they were doing their research for the particular project and had no plan or thought about keeping it any longer than the lifespan of the project. ‘I didn’t know what to do with my data. No incentives provided, I just started panicking that I wouldknow lose what all mytodata and his took action.’ Researcher, SHARD project ‘Didn’t do with data. No incentives/directions 3.9 Factors inhibiting preservation. What, in your opinion, currently inhibits preservation of research data? Researchers surveyed considered the lack of time and money to dedicate to the establishment of adequate preservation mechanisms as the main obstacle. Lack of awareness of the issues came second which is a recurring theme throughout SHARD. Technology which has traditionally been seen as the major problem interestingly came third. 16 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 3.10 Which three things would improve the preservation of research data? Here we posed the reverse of the question illustrated in figure 3.9. Surprisingly the answers were not exact but similar as awareness-raising of the issues came at the top, but good storage solutions came a close second with training coming third. ‘There is a lot of repetition in academia which isn’t necessary. Some academics think they have climbed up a hill and others should have to climb this hill too.’ SHARD researcher survey Deliverable: Knowledge base: Survey_requirements_analysis.xls contains the researchers answers to the initial survey Survey_AIDA assessment.xls contains the results assessed using the 5 stages maturity mode.l 3.2.5 Survey of existing training at IHR We assessed existing postgraduate research training available at IHR. Current provision on History Spot (see Figure 3.11) offers a number of practical skills (language and palaeography, database construction and data modelling, Geographic Information Systems etc.) and generic /transferable skills (presentation and writing); to these skills it would be feasible to add digital preservation skills and start to embed them in the fabric of research learning. ULCC identified characteristics of IHR training. These characteristics helped us set up some requirements for developing our training for SHARD. 17 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 3.11: IHR current provision on History Spot Outcome: The output of this survey is held in the knowledge base. This was useful for our purposes of designing a course which would fit into the IHR style. SHARD assessed the material available (primarily the handbooks as the VLE is in draft form) using the following characteristics. The characteristics included: Design This included identifying the IHR style; the use of a variety of media and resources; the variety and type of activities, both online and offline. The students are assigned relevant readings Content SHARD identified content presented by IHR for the purposes of a VLE or other resource. In general there is always a course and module description with reasonably clear indications of required previous knowledge and the text introduces and builds on concepts throughout. Assessments are also clearly explained. Production Documentation of course indicating target audience and features is clear at start of each course or resource. Learning objectives are clearly stated at outset. For technical aspects specialist knowledge is not required and this is clearly stated. Copyright has been clearly indicated for all of the material under a Creative Commons License. The content is very clearly laid out and easy to follow. At time of assessment by SHARD we did not see a glossary but a bibliography is included for each resource. Delivery There is no demonstration course available and this is due to scale of courses being delivered. The courses appear to be self led with self assessments with embedded tasks such as dummy data, exercises but with clear access to help if needed. 18 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: Teaching IHR uses effective strategies for teaching content using the test teach test model. Students Students’ participation is encouraged throughout course through tests. This is encouraged by open questions within the handbooks but these are not formally assessed. It acts as a mechanism to get students engaged and thinking about the matter at hand. Assessment The resources are self guided and as a result encourage the student to check their understanding throughout the courses provided. Deliverable: knowledge base: Survey_of_IHR_training_resources.doc 3.2.6 Phase 2: Training day and knowledge gathering exercise Phase 2 saw the initiation of the development of training materials in light of the results of the knowledge base. A. Output: Face to face training day Training day The premise of our training on March 14th 2012 was to lure folk in to speak about their experiences of preserving research data in the course of their research. The event was called ‘Preservation of research data: what’s in it for me?’ It was advertised on the main IHR events website as well as on twitter and ULCC events page. 20 researchers attended. We saw it as an opportunity to learn a whole lot from them and what they needed so we could best plan and design an online course on this for the History Spot site at IHR. Our cohort of people attending our training day came from a variety of research backgrounds and different stages in their career path and made it a rich day for information gathering about their needs. Why did people come to our workshop? People spoke about various drivers which brought them to us. Experiencing the loss of data seems to sharpen the mind somewhat when it comes to preservation of data. People also spoke about being 'swamped with data and the information overload', wanting to take care of the material they had gathered over the years and worried they might lose it. Language struck a chord with many around the table. A lot of people don't use the word 'data' to describe their research material. The term 'data' is regarded as scientific and as a result people in the Humanities often feel alienated by the use of the term. Many people interviewed simply did not remember what permissions they had regarding use of the primary material they had copied or recorded. They had signed a piece of paper in the library or archive but didn't remember what it said. As a result they would not be able to share this data in the future as copyright and usage was not clear. Four good ideas Arising from our legacy data study and our survey we gave an overview of four things which we thought they could all do easily to enhance the preservation of their research data. Here are the main ideas for each of which we gave practical solutions. 19 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 1. Write everything down. 2. Store your data safely. 3. Interventions are needed, the earlier the better! 4. Consider sharing, the why and how. There was a need to demonstrate to researchers the value of creating good data for its long term use and reuse to the research community, and also to enhance their reputation. In a group work session, we felt participants could examine issues and think and share their perspective and we could find out more about their requirements. The format of the face-to-face training day followed the outline below in Figure 3.12: 3.12 Training Day outline Session 1: Introduction to SHARD Session 2: Setting the scene: ‘Why bother to preserve research data?’ Session 2: Examine case studies of good and bad examples of the preservation of data. Session 3: Having identified the problem, what do we do? Introduction to general practical advice for preserving research data, Session 4: Next steps Session 5: Feedback and conclusions from the day The training event was an opportunity to adapt the Digital Preservation Training Programme to the research data environment. It also gave us an opportunity to work with researchers and gather information about current practice, what they wanted and what they needed in relation to the preservation of their data. Eight opportunities We provided the group with eight feedback opportunities planted at specific points in each module. The idea was to ask simple questions and get written results from the audience. These responses were gathered up and added to a PowerPoint show and displayed back to the class at the end of the training day to enable further discussion, amendments and review. The aim of this exercise was to get some actual feedback and requirements which can inform the way the training modules are built. We had our ideas which we had developed from Phase 1 of the project but this exercise gave us a very valuable opportunity to find out more through group work. The opportunities were as follows: #1 - Why are you here? #2 - Why bother to keep research data? 20 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: #3 - What are the risks of data loss? #4 - Your examples of good or bad practice #5 - What do you need from storage? #6 - What would be the magic desktop application that solves everything? #7 - Do you do anything to preserve data, and if so what? #8 - Are you comfortable with sharing data? What are the Pros and Cons? After the training day we compiled the answers received and grouped similar responses together to enable analysis. 3.13 Why are you here? The most common reason for attending the event was a recognition of the importance of starting early on in their project, to make sure that data was safe. The next most common reason was that they wanted to try and understand digital content and its preservation more than they already did. Complicated and ‘alienating’ terminology were mentioned as issues which inhibited understanding. Many had suffered data loss and were keen not to let this happen again. In addition many expressed the idea of being overwhelmed by ‘information overload’ as the scale of the data being produced was becoming difficult to deal with unless managed properly. Why bother to keep research data? The responses to this found that first place was shared by future use and reuse, and not wanting to waste resources spent gathering this research. In second place came the importance of preservation actions to prevent data loss. In third place came sharing and review. What are the risks of data loss? 21 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: The groups responded that data loss would lead to disorganisation, loss of control, and waste of resources as the research would need to be redone which would be a waste of time and money. Examples of good practice 3.14 Examples of good practice Having intellectual control over research data was the highest scoring example of good practice. Having intellectual control over research data is very important as otherwise one is unable to retrieve the research data in a meaningful way. Factors which contribute to this such as good metadata and well organised data as well as searchable data came next. 22 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 3.15 Examples of bad practice Both lack of planning and lack of documentation were the most cited examples of bad practice in relation the preservation of research data. The lowest scoring were obscure formats and failure to make copies of the research. 3.16 Storage Needs 23 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 3.17 Features of a magic desktop application that solves everything? In response to this question there was a big interest in the creation of an automated service which would perform preservation on the research data, one person suggesting the need for the automated extraction of useful metadata from software such as Mendeley 6 or other research data tools. Interest was expressed also in automated backups and systems which create data in formats which last a long time. Outcomes: The responses to these feedback exercises were assessed, clustered together, and entered into the knowledge base. What was essential to us here was to raise awareness of the issues regarding preservation and research data but also to ascertain what researchers needed in terms of functionality. This in turn helped us build our online training modules. Deliverable: Knowledge base: Workshop_analysis.xls 3.2.7 Building SHARD training modules We devised practical preservation training modules based on our findings emerging from the assessments and face to face training day. The materials were designed and evaluated as story boards as text files prior to input into a Moodle environment. The modules are similar to ones we have devised for the e-learning version of the DPTP, comprising text, images, links, reading lists, position papers, online resources, tools, quizzes, and exercises. We aimed for generic preservation modules, concentrating on practical and achievable approaches. How the knowledge base helped us The analysis of our knowledge base was key to helping us develop the training materials. 6 Mendeley is a free reference manager and academic social network that can help its users to organize research, collaborate with others online, and discover the latest research. 24 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: The legacy data assessment highlighted how important metadata was for accessibility over time from high level to low level file descriptions for both descriptive and technical information. So we decided to ensure that this was a key message of our material and this was reinforced by the training day. In particular the ‘opportunities’ provided us with evidence of what researchers were looking for in terms of their requirements e.g. best practice. These examples enabled us to design our Preservation Practices, along with the results of our legacy data assessment which also indicated that organisation and descriptions were crucial to long term understanding of research data. Researchers in both the survey and the face to face raining day also highlighted the importance of metadata as well as keeping lots of copies. Researchers also fed back that they felt alienated by technical language used very often in relation to digital preservation and we made a concerted effort in our course to avoid the use of jargon. It is important to remember that terms which seem commonplace to the digital preservation community are not accessible or have little meaning for other communities. We decided that if we were to use a term such as "metadata" then we must make it clear what was meant. Even the term "research data" seems to be an alienating term for people involved in humanities research as they associated it often with scientific data. To explain best practice we took a lot of what we learned from the legacy assessment and demonstrated good and bad practice in relation to the preservation of research data. This was followed by lists of good and bad practice in relation to descriptive practice and metadata for the data, documentation for the data and technical considerations for the data. All these examples were fed into our advice in this course. Both Ed Pinsent and Patricia Sleeman have had training in how to build a Moodle, and this enabled us to storyboard and design the courses to easily be implemented in IHR’s existing Moodle environment. Within each course we applied a pedagogical methodology which we had learned previously, and applied the teach-test idea to the material. As this is self led material it is entirely up to the student to carry this out but we aim to lead the student through the material this way. The material consists of: 1. Introduction to course An introduction to implementable best practice for ensuring the continued access to and preservation of your digital research data over time. Starting a research project can be overwhelming, so it is important to manage your data to ensure you don’t lose it. We aim to provide an overview of the issues and suggest solutions on to how to preserve your research data 2. Preservation: why bother? An explanation of why researchers should consider preserving digital research data to ensure access to the data over time and thus enable sharing and contribution to the body of research. It was important to convey the importance of the risks of not looking after the data. Awarenessraising was mentioned as a key concern in our research. 3. Preservation practices This course suggests practical steps which can be taken to ensure your research data can achieve longevity. The course covers data documentation, the creation of metadata, some simple technical interventions you can perform yourself, and options for safe storage. 4. Sharing and access This course looks at various access issues which affect research data in relation to preservation. It is essential to have a clear understanding of what it is you are allowed to do with the research you have gathered in order to be able to successfully share it. We had learned that many times researchers were not recording adequately the IPR issues connected with primary or secondary sources for their data gathered in the course of their research. As a result they did not know what 25 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: right they had to share the data in the course of their work and over time. This in turn affects the rights to their own body of research. All courses have the same structure which includes the following: Course Name Name of course Scope / outline Outline of course Learning Objectives Stating the aim of this course Purpose The purpose of the topic Topics Expansion of topic of course 3.2.8 Validation The training materials have been validated by the IHR's Research Training Group. The outcomes will include identification of generic preservation components, and components specific to history. This will help to fine-tune the content and make the training materials directly applicable to a wider range of students and course subjects. Deliverable: Online Moodle course: http://training.historyspot.org.uk/course/view.php?id=59&topic=1 3.2.9 Development of leaflet SHARD joined up to prepare a leaflet and FAQ sheet, in collaboration with PrePARE at Cambridge, Datasafe at Bristol and DICE at LSE to promote research data and its preservation. The group met at the LSE in March and shared project experience and collaborated to draft content for a leaflet about the preservation of research data. Malcolm Raggett of the LSE designed the leaflet and ULCC bought 1,000 copies to be distributed at IHR. A digital version is also available online on the History Spot website at http://training.historyspot.org.uk/mod/resource/view.php?id=835 Content of leaflet: The meeting at LSE agreed as a result of our projects’ findings that we should have different sections under the headings: 26 Start early Explain it Store it safely Share it Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 3.18 Image of leaflet 3.2.10 FAQs page The FAQs are organised under the same headings as the leaflet but with detailed generic descriptions for various questions under each heading. They were developed in cooperation with the DICE and Prepare projects on a shared Wiki hosted at the University of Cambridge. Each project took an equal share and drafted simple responses to some issues which could act as a first line of support online on our institutions’ websites. We addressed the main issues which we found were arising most frequently in the course of our work. See appendix 1 of this report for the FAQs. They are also available on the History Spot website at http://training.historyspot.org.uk/mod/page/view.php?id=834. 3.2.11 How did you go about achieving your outputs / outcomes? The aims of the project consisted of: Delivering the basics, not the complexity, of digital preservation Delivering practicalities, not theory, of digital preservation Building courses tailored to the needs of the target audience Creating course content that has wide applicability to other academic disciplines in Humanities and social sciences Creating courses that can be shared and reused by other HEIs To meet these aims we: Developed a knowledge base which contains all information gathered , e.g. survey of researchers, legacy data assessment, and assessment of existing training. Developed online training and delivery of face-to-face training about preservation of research data. Drafted a leaflet and FAQs with other JISC funded projects. Deposited the completed course modules in JORUM. The aims and objectives did not alter throughout the project, with the exception being the additional deliverables of a leaflet and FAQs. The key aspect of this project was the engagement with the researchers. SHARD aimed to ask good questions to elicit good answers about current research practice, needs and requirements. The 27 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: assessments were phrased in language which was not technical. We wanted to hear what researchers needed, and not what we thought they should do. Our survey and face-to-face training were all opportunities to gather information from them. This information, which was assessed though the use of AIDA, allowed us to reflect back what we had learnt from them in our online Moodle, the main deliverable of the project. 3.3 What did you learn? We supply our findings below. The information compiled in our knowledge base was essential for the analysis of researchers’ requirements and this analysis resulted in the development of our online training course in the preservation of research data. As stated and explained previously we organised our findings under the headings of organisation, technology and resources. 3.3.1 Organisation Organisation in this context reflects how the material is kept and planned and organised. This is a responsibility of the researcher and without good organisational planning ,the research data material will be less usable. Organisation of data One of the most outstanding gaps in the legacy data assessed at IHR was the lack of descriptive information about the data as well as a lack of data documentation. Lack of sufficient documentation about the data We anticipated technical problems associated with old versions of software or particular propriety software. This was the case as seen in the legacy data survey but one of the most surprising problematic issues was the lack of documentation describing the data from high level to file level. This lack of information thus affected the intellectual control and organisation and inhibited understanding of the material. Lack of descriptive information about the data, i.e. lack of metadata Certain datasets had an introductory text describing them and covering the basic issues such as name, rights, technical issues and contacts. There were some very good examples of some projects providing guides and introductions to the data, about half the data provided some sort of documentation about the data. However it lacked consistency in approach and the remainder of the data assessed lacked any documentation about the data. In addition researchers did not record their research methodology. By way of contrast, figure 3.19 shows an exemplary introduction to a body of research data. 28 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 3.19 Views of hosts: Reporting the Alien Commodity Trade, 1440-45 database guide, Centre for Metropolitan History (CMH) At a more detailed level there were many instances of fields which were encoded but with no explanations of the codes available, thus rendering the data unusable. Half of the datasets assessed did not include anything explaining where to find these codes or how to use them. In addition, there was evidence suggesting the use of place name and personal names authority files, but few referred to the standard used. There was little or no evidence that IPR or any other issues had been considered. No statements for use or sharing were in evidence with the exception of one dataset. An obvious exception to this was an oral history project which was very explicit about copyright and permissions and generally was very well documented. Oral history is traditionally strong on good data collection and management procedures. For most of the data which did not have this information the data was very difficult to understand and as certain information, e.g. rights information, was not explicitly detailed, it would be difficult for a data user to know what use could be made of the data. A brief document detailing some codes/technical specifications and whatever else might be useful for the non-expert. Since the primary investigator will usually be the authority on the data, arguably all other users could be deemed "non-experts" . Lack of information about methodology The lesson learned here was "Write down and record your research methodology, don't just keep it in your head." This line of thought is crucial, as it will encourage validation of results and ensure that the reliability of the data can be verified. Lack of appropriate institutional guidance As stated at the beginning of this report, the lack of institutional guidance has forced many researchers to develop their own strategies regarding managing and organising their research data. There are many websites providing guidance on these matters such as DCC, or the Essex data archive. However most researchers interviewed did not demonstrate an awareness of local guidance being available at their particular institution. Researchers spoke about the alienating language used in 29 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: certain data management advice provided. Researchers in the humanities very often do not identify with the term ‘data’ as they associate it with scientific research. So they assume the advice is not for them. In addition ‘metadata’ is meaningless to many of the researchers, unless explained to them in straightforward terms (as we did during our training day). Institutions should ideally have practical data management plans which are easy to implement with some quick wins. Such systematic and consistent approaches to data management should be introduced at an early stage in a research project. Researchers are busy doing research and some time should be explicitly dedicated in projects to write supporting documentation for the data. Otherwise sharing and reuse is going to become less of an option as time goes by. Research data supports a narrative of research and is an invaluable resource to the researcher while the research is active. A worst case scenario is that once completed, the data is "orphaned", dropped, deleted or abandoned to fight for its survival. Some data survive to enhance the narrative of further research projects or validate the research published, while other data are not so lucky. 3.3.2 Resources In this context resources are defined as time, content, money, training opportunities, funds, training and awareness. Many of those surveyed identified a lack of resources and awareness as an inhibitor to preservation. Awareness and lack of training The most pressing issue which we identified as being crucial to convey to researchers in our training was awareness. Those who participated in the survey and training were to a certain degree by default aware of some of the issues which affect their data. This was principally due to a brush with data loss to a lesser or greater degree. Awareness also has implications regarding what researchers actually consider to be research data in the first place. Many do not think of structured information held on their PC as research data. Targeted training was expressed by many as a real need for researchers to enable preservation. Reticence gap The other need expressed was an appreciation of research data as having value. If this is not present then very often the investment in preservation will not occur. ‘Value your data, don't think "people won't be interested in it". SHARD must overcome the "reticence gap". – Researcher, SHARD In addition those who are aware of the issue feel that there is very little time or money allowed to them to put effort into the tasks needed in order to preserve their data. Researchers are busy doing research and some time should be explicitly dedicated in projects to write supporting documentation for the data. Once completed, the data is dropped, deleted or abandoned to fight for its survival. Some make it to enhance the narrative of further research projects or validate the Figure 1.11published Factors inhibiting research while otherpreservation. data is not so What, lucky. in your opinion, currently inhibits preservation of research data? 30 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 3.20 What are the greatest risks to the preservation of research data? Researchers surveyed were clear about the most challenging aspect for them as regards preservation. They viewed the lack of time and money to dedicate to the establishment of adequate preservation mechanisms as a major problem. Lack of awareness of the issues came second which is a recurring theme throughout SHARD. Technology which has traditionally been seen as the major problem interestingly came third. 3.21 What three things would improve the preservation of research data? The diagrams indicate a need for awareness-raising coupled with good storage and training. 31 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 3.3.3 Technology Technology in this context reflects the technical aspects affecting research data and its creation; including storage, file formats, technological infrastructure, operating systems etc. A high proportion of researchers in the survey and training day identified technology as a potential inhibitor to the preservation of their data. These ideas are expanded on below. Multiplicity of formats '...whereas scientific data tends to be large scale, homogenous, numeric, and generated (or collected/sampled) automatically, humanities data has a tendency to be fuzzy, small scale, heterogeneous, of varying quality, and transcribed by human researchers, making humanities data difficult (and different) to deal with computationally.' Melissa Terras, 'Number Crunching Historians' http://melissaterras.blogspot.com/2012/01/numbercrunching-historians.html The legacy data we assessed consisted of many file types, including text-based files, some structured numerical data, some audio, some mapping data, and other unidentified files. None of the indications about applications used for creating or manipulating the data were (with two exceptions who noted this in their guide to their data) explicitly indicated anywhere. Advice on best formats In the survey and the training day researchers expressed concerns about best formats for their data. Some also raised the issue of the use of proprietary software which locked in the data to the particular application which was used to create it. Mapping data is an example of proprietary software with systems such as Arcview7 being used. Lack of technical information The legacy data was being assessed not with a view to its content but rather at how well it had been managed to enable access and sharing it over time. How easy would it be to open up this data and use it again if one was not the data owner or creator? How much time and money would it cost to recreate these rich data experiences? Researchers should look at their data through the eyes of someone who had no involvement in its creation and use. This way they will clearly describe it for further use. During the legacy data assessment, two examples were found which had documents describing technical information i.e. technical metadata about their operating systems. This enabled the project team to understand what technology we needed to open and use the data. Storage Researchers store their data in many different ways; most back up their data manually onto external hard drive, and some use the Cloud. One person spoke of having no backups at all, holding all their data on their PC. The lack of any institutional support and guidance as regards storage was raised. Those who were most aware of the problem made lots of backups and occasionally used the Cloud. None used a repository. The legacy data we assessed, while safe as they were being backed up regularly by central computer services at the University of London, were locally stored on a secure network drive and not very accessible as no one beyond CMH would find them unless they knew about them. They do not appear on any catalogue or database of holdings. Thus access was immediately extremely limited. 7 Arcview is the entry-level licensing level of ArcGIS Desktop, a geographic information system software product produced by Esri. 32 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: Projects which were funded by research bodies were deposited in a repository as this was a requirement of the funding. This applied to 2 of the legacy research data which we assessed. Do software problems inhibit access to research data? 58% of the CMH legacy data assessed in our legacy data assessment could not be successfully opened due to software redundancy. 3.22: Software redundancy Some other lessons Of course everything we learned did not neatly fit into our three categories of Organisation, Resources and Technology. The reality is that these three categories can overlap and intertwine. An example of this is data sharing. Data sharing lessons The initial legacy data assessment of the material in the Centre for Metropolitan History based at IHR raised some interesting learning opportunities. Most data understandably due to its sheer size is kept locally and is rarely managed in centralised storage or a repository, unlike the findings or conclusions, which are usually published, well maintained and stored in a suitable repository, ensuring access to this over time. There were some problems regarding moving the legacy data from IHR to ULCC which were challenging at times due to lack of written procedures. The size of the data was an issue among other things. We then decided to use Dropbox, removing the data once it had been assessed. The experience clarified the need for procedures and the client focussed approach as to state the obvious, different people and different departments have different needs. It also reinforced that old adage ‘assume nothing’. The experience reinforces the needs for some basic rules of internal data sharing. 1. Context matters. Different people have differing needs, it is not a one size fits all approach as all data are not equal, some are more special/unique than others. 2. Procedures matter. There should be clear procedures and these should be agreed on. The oral tradition of remembering belongs in a folk archive, not a data archive. 33 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 3. Metadata matters. Information about data matters. It matters almost as much as data, as without it the research data can be very difficult to understand, so there is a need to ensure that this is maintained and shared as well as the data. Otherwise the data can be meaningless and without context. 4. Trust. This is difficult to gain and easy to lose and very hard to regain. Personal connections between different departments such as researchers/academics and IT departments are important. 3.3.4 Other issues Researchers when asked why they were interested in training about the preservation of their data gave four responses: 1. They wanted to learn how to get it right from the start of their project/career They wanted to be certain of what good practice consists of in relation to ensuring the preservation of their research data. They feared that if they didn’t they would lose organisational control over their research. Many of these researchers were embarking on their careers and needed awareness raising and advice on how to preserve their data. It is important to get researchers to practise good habits in relation to managing their research data early on in their careers. 2. They wanted to understand digital preservation on their terms They also wanted to understand better digital preservation issues as they often found advice alienating and off putting. They wanted to be able to integrate the research process e.g. use of software (such as Mendeley) for managing their research. Many worried about obsolete formats and wanted advice on best formats for preservation purposes which they could implement themselves. 3. They felt they were suffering from information overload Many felt they had reached critical mass with the amount of research data they held and without good and adequate approaches to ensure they kept their data in accessible form; they were very worried about losing it. They also found advice about digital preservation difficult to digest. 4. The inevitable result of inaction regarding preservation was data loss It is to be expected that those who anticipated in both the survey and the training day were aware of the risks to their data. 3.4 Immediate Impact 3.4.1 What kind of difference has your project made in your institution? The project has brought together institutions within the University of London who have synergy in terms of skills and aims. ULCC have wanted to work with IHR on digital preservation training for researchers for some time. SHARD has raised awareness of the need for the preservation and management of research data through a combination of the various approaches of investigation and face to face training and dissemination of the outputs. The project engaged the Centre for Metropolitan History (CMH) as well as IHR, Kit Good (University Records Manager & FOI Officer) and ULCC. The project has absolutely enhanced and built on existing connections. ULCC has also established a connection with the University of London library at Senate House. 3.4.2 How has the wider community benefitted from your project? The investigations in phase 1 of the project brought us in immediate contact with a cohort of researchers. Use of social media such as twitter has promoted the work we are doing in relation to the 34 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: preservation of research data. The research process for SHARD engaged researchers and encouraged them to consider the issues we were discussing. The result of the investigations of phase 1 has resulted in the content for face to face training and the online course. We aimed to benefit the wider community beyond IHR and UoL through the presence of SHARD on the History Spot site. 3.4.3 What evidence do you have for this? The feedback for the face to face training day was good and we had a full attendance for the day. The publicity of the blog and deliverables such as the leaflet and FAQs have resulted in several retweets on twitter. 3.4.4 How has your project changed the attitudes of your stakeholders? Our use of jargon-free language and non-technical terms has served to involve the community, as opposed to alienating them. This is a good thing as researchers had often expressed frustration and disinterest in much of the professional literature and tools produced for the preservation and management of their research material. It is hoped that the language and style of delivery used in SHARD online has reflected this. Many researchers were very concerned about managing their data and avoiding data loss but lacked a consistent approach. SHARD provided them with several opportunities to express their requirements and allowed us to assess what stage the people involved were at which enabled us to develop appropriate material. 3.5 Future Impact 3.5.1 Who will be impacted? The online course on the preservation of research data will be freely available at History Spot under an OER license. IHR’s training has considerable traffic and it has a national reputation of delivering seminars and training in relation to historical research. The online courses in History Spot will be strongly linked to the IHR website training section. IHR are also uploading a handbook on online catalogues, palaeography and digital tools in addition to SHARD and the databases course. History Spot currently has 706 registered users and it is anticipated that the outreach potential for the SHARD online course will be the same if not more as additional users register with and avail themselves of the service. In addition depositing the course content into JORUM will make sure the courses are accessible to wide community of researchers and are kept safe over time. 3.5.2 Are you planning to track this impact? If so, how? We will be tracking the use of the Moodle course online through History Spot. 4 Conclusions 4.1 General conclusions The greatest risk to your data is you. Approaches by default It is admirable how many researchers in the absence of guidance and direction from their institutions have established a method of preserving their research data. Most are aware to some degree of the fragility of research data over time and have made degrees of effort to keep it. This has resulted in a hybrid approach to the safety of their data. A variety of approaches can be seen with use of external hard drives and the cloud being the most common. Many use applications such as Mendeley or Papers for the management of their material and perhaps somehow extracting information from these applications could help in providing good metadata for the long term preservation of this material. 35 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: What has become clear throughout this project is how little institutional support or guidance is available. Lack of guidance Researchers are as a result preserving their data in increasingly disparate ways which involves minimal or no use of standards for description or attention to durable formats over time. This spells danger for long term access to the body of research being created at the present time. Standards of description can make sharing easier and access easier over time. Adequate descriptions, at both high and low levels, descriptive and technical and administrative, will help keep this data for longer periods of time than if these are absent. Need for applicable jargon free institutional guidance Institutional guidance must be available to further this endeavour of preserving research data. This advice should be user-friendly with clear steps to guide the researcher. It is important to always remember that researchers on the whole are not interested in the finer points of digital preservation, rather they are interested in how to keep their material safe and sound and avoid the panic of data loss. As such we should cater to these real and present needs of a community which as demonstrated by this project has been doing its best to manage its material by itself. Admonition of their current approaches serves little purpose as the guidance on how to preserve research data is not readily accessible or easy to absorb for a non digital archivist/asset manager. In addition researchers will simply do what is practical and easy to maintain their research over time. 4.2 Conclusions relevant to the wider community The biggest threat to the preservation of research data is not technology but researchers’ lack of awareness and the realisation that preservation is an issue which needs addressing. Researchers also find some advice on the preservation of research data alienating and seemingly irrelevant to their needs. The SHARD course aims to bridge a gap by providing free online training about preservation in jargon free language and aims to raise awareness of the issues and practical solutions. As SHARD discovered the paucity of descriptive information and poor organisation of research data results in inaccessible data and the loss of the data from an intellectual perspective. There is also a big gap in institutional guidance about preservation of research data, although the research community does have a responsibility to look after its data. It has a responsibility in the course of its work within an institution to take care of the material it is producing. This is an obligation for future researchers as well as legally. However as has been discovered often researchers are unaware of what they can legally do with their research material once the project has completed. This is due to poor record keeping of rights information regarding sources gathered and used. 4.3 Conclusions relevant to JISC It is the case that many professions are only now waking up to the fact that digital assets have an inherent risk of loss. The research community despite a lot of effort seems to be amongst these professions. It is important as professionals that we do not lose sight of this and apply our knowledge as information professionals carefully and appropriately in order not to alienate communities as well as lose data. People are slow to share failure and stories of data loss unless there is an element of trust. Working with researchers and their requirements can enable suitable and appropriate guidance to be designed. We hope this is the case with SHARD course on the preservation of research data and that it serves to bridge the gap. 36 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 5 Recommendations 5.1 General recommendations Researchers need to be given more institutional support and advice about preserving their data. They are operating in a vacuum and as a result approaches to preservation are ad-hoc and inconsistent, meaning the risk of data loss is high. SHARD encourages researchers to start early, describe their research data and store it safely. IPR issues are also crucially important to enable sharing and use over time. 5.2 Recommendations for the wider community Institutions need to provide clear and relevant support for research data management and preservation. It is good to understand what researchers are currently doing and what they suggest their requirements are; after all it is their data. There is a lack of clarity about whether repositories are capable of holding certain types of research data, and this needs to be considered. There is a need for further face-to-face training and awareness which is inclusive and not exclusive looking at general issues such as dealt with by SHARD but also looking at specific issues in more detail. 5.3 Recommendations for JISC Clear inclusive guidance on the preservation of research data is needed. It would be good to build on the results and output of SHARD to further research and investigation into other areas which affect preservation of research data such as citations. 6 Implications for the future What are the implications of your work for other professionals in the field, for users, or for the community? The provision of an open and free course providing practical guidance on the preservation of research data will benefit not just IHR but the wider HE community involved in the creation of research. What new development work could be undertaken to build on your work or carry it further? It would be interesting to continue development of more in depth training at UoL focussing on aspects of the SHARD curriculum, such as IPR for researchers. The development of a series of small publications about these aspects would also be of value, especially if made available as mobile applications. Provide information on the sustainability of your project outputs. How are things going to work now the funding is over? The e-learning output will be maintained over time as part of IHR History Spot under an OER license. IHR have committed to maintain this. Materials are being deposited with JORUM. The JISC will maintain copies of the knowledge base, as they are project outputs. Provide information (where applicable) on the long term project contact, how your outputs (e.g. software, Open Source code, toolkits etc.) will be managed, and whether there is a user community that interested individuals could get involved with.> The knowledge base (consisting of the 3 surveys will be held by JISC) and the Moodle materials will be preserved in JORUM under the University of London account with an OER CC license. The Leaflet and FAQ will be held by IHR on their website. The leaflet has been deposited in the ULCC publications repository. 37 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 7 References Benefits from the Infrastructure Projects in the JISC Managing Research Data Programme – Neil Beagrie, 2011 http://www.jisc.ac.uk/media/documents/programmes/mrd/RDM_Benefits_FinalReport-Sept.pdf Creative Commons website: www.creativecommons.org Data Curation Centre’s online data management plan toolkit: http://www.dcc.ac.uk/dmponline Data Curation Centre has useful guides on all aspects of data management here: http://www.dcc.ac.uk/resources/how-guides. Here is a good video from them about managing research data: http://youtu.be/2JBQS0qKOBU Data documentation and metadata (University of Edinburgh Information Services): http://www.ed.ac.uk/schools-departments/information-services/services/research-support/datalibrary/research-data-mgmt/data-mgmt/data-documentation Digital Preservation Europe: http://www.digitalpreservationeurope.eu/ Documenting your data: create and manage your data: The UK data Archive: http://www.dataarchive.ac.uk/create-manage/document Documentation and Metadata, Cambridge University Library: http://www.lib.cam.ac.uk/dataman/pages/metadata.html Documentation and Metadata MIT: http://libraries.mit.edu/guides/subjects/datamanagement/metadata.html First insights into digital preservation of research output: PARSE project: http://www.parseinsight.eu/downloads/PARSE-Insight_D3-5_InterimInsightReport_final.pdf The Five Organizational Stages of Digital Preservation Anne R. Kenney & Nancy Y. McGovern Digital Libraries: A Vision for the 21st Century: A Festschrift in Honor of Wendy Lougee on the Occasion of her Departure from the University of Michigan Guide to copyright related matters. The National Archives: http://www.nationalarchives.gov.uk/documents/information-management/copyright-related-rights.pdf Information Commissioner’s Office: http://www.ico.gov.uk/for_organisations/data_protection.aspx Keeping research data safe (Phase 1) http://www.jisc.ac.uk/publications/reports/2008/keepingresearchdatasafe.aspx Long-term preservation (University of Edinburgh Information Services) http://www.ed.ac.uk/schools-departments/information-services/services/research-support/datalibrary/research-data-mgmt/data-sharing/preservation Metadata & Data Documentation, University of Oregon Libraries: http://libweb.uoregon.edu/datamanagement/metadata.html Open Source Initiative: www.opensource.org/licenses/index.html SHARD: http://shard-jisc.blogspot.co.uk/2012/03/research-data-preservation-projects.html 38 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: 8 Appendices 8.1 Appendix 1: FAQ What material and data should I preserve? To enable the use and reuse of research data over time by others it is important to ensure that you provide documentation which describes the research data as well as the context of its creation as part of the research project. Technical information about the research data should also be kept to enable its reuse. If the data is encoded then code details must be kept. So in addition to the core research material you should provide a clear introduction to the entirety of the research data to enable future understanding and use. Documentation such as emails and other material accompanying the core research data may seem irrelevant but they will all provide important contextualisation of the research project and can be appraised for relevance. Cambridge University uses terms such as embedded, supported and catalogue data to describe data which should accompany the search data itself. Will I lose control over the material if I preserve it? A significant number of research funders require that data produced in the course of the research they fund should be made available for other researchers to discover, examine and build upon to allow for new knowledge to be discovered through use, reuse, comparing data and so on. However you are responsible for deciding what data is legally obliged to be open or closed according to various pieces of legislation such as FOI and data protection. This should be stated at time of deposit. Why shouldn't I just keep my data/material on my hard drive? Keeping all your research data in one place is not a good idea in general. It is essential not to keep your research data on your hard drive as inevitably hard drives fail and you will lose your data. You should always back up your data at least two more devices or systems (ideally a repository) external to your hard drive. I have all my data on an external hard drive - do I need to do anything else? Ensure that your data is well documented and be held on at least two external devices/systems, ideally including an institutional digital repository. Why should I preserve research material? Researchers from all disciplines accumulate material in the course of their research. Considerable time, effort and money is spent in this endeavour. The preservation of research data is essential in order to further research through sharing of the data; to enable validation of results and demonstrate the process behind the conclusions and results of research. What is a digital repository? A digital repository is a system which provides a convenient infrastructure through which to store, manage, re-use and preserve digital materials. They are used by a variety of communities, may carry out many different functions, and can take many forms but essentially they are a secure way to keep data safe and accessible. What archives/repositories are there for preserving my data? There is no single UK repository for research data. Instead many are being developed within universities. The OpenDoar initiative provides a comprehensive list of open repositories worldwide and in the UK.Here are some UK wide repositories for specific types of data: 39 Project Identifier: SHARD Version: 1 Contact: Patricia Sleeman Date: The Archaeology Data Service supports research, learning and teaching with freely available, high quality and dependable digital resources. It does this by preserving digital data in the long term, and by promoting and disseminating a broad range of data in archaeology. The ADS promotes good practice in the use of digital data in archaeology, it provides technical advice to the research community, and supports the deployment of digital technologies. The University of Oxford Text Archive develops, collects, catalogues and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. We also give advice on the creation and use of these resources, and are involved in the development of standards and infrastructure for electronic language resources. The History Data Service (HDS) collects, preserves, and promotes the use of digital resources, which result from or support historical research, learning and teaching. The History Data Service is a successor service to AHDS History which from 1996 to March 2008 was one of the five centres of the Arts and Humanities Data Service. Can I use my institutional repository for data preservation? Yes, you should be able to do this, if your institution has an institutional repository which collects research material. You should enquire of your institution if this is the case. Can/should I deposit in more than one repository/archive? No, it should be more than adequate to deposit in one repository but it depends on the service offered by the specific repository, e.g. does it guarantee that it will maintain access to the data over time? 40