AERIFinalReport - DSpace at The University of Texas at Austin

From the Server to the Virtual Machine: Archiving the AERI 2013 Website for Preservation in a Trusted Digital Repository Nicole Feldman Lauren Gaylord Amye McCarther Jim Rizkalla INF392K: Digital Archiving and Preservation 2 Dr. Patricia Galloway April 30, 2014 TABLE OF CONTENTS INTRODUCTION ............................................................................................................................................................... 4 OUR TASK ......................................................................................................................................................................... 4 OUR ROLES ....................................................................................................................................................................... 4 PROJECT DESCRIPTION ................................................................................................................................................. 4 ADMINISTRATIVE HISTORY .............................................................................................................................................. 5 SCOPE AND CONTENTS ................................................................................................................................................... 5 PROJECT GOAL & COLLECTION ASSESSMENT ..................................................................................................... 6 SECTION I: PRESERVATION METHODS .................................................................................................................... 7 AERI 2013 CONFERENCE WEBSITE ................................................................................................................................ 7 WEBSITE OUTLINKS AND WIDGETS ................................................................................................................................ 8 Twitter ......................................................................................................................................................................... 9 Google Maps .......................................................................................................................................................... 10 Outlinks .................................................................................................................................................................... 10 Flickr ........................................................................................................................................................................... 11 SUPPORTING DOCUMENTS ............................................................................................................................................ 11 Introduction ............................................................................................................................................................. 11 Sarah Buchanan ..................................................................................................................................................... 12 Lorrie Dong ............................................................................................................................................................. 12 Patricia Galloway ................................................................................................................................................... 13 Jane Gruning........................................................................................................................................................... 13 Sarah Kim................................................................................................................................................................. 13 Virginia Luehrsen .................................................................................................................................................. 13 Katie Pierce Meyer ................................................................................................................................................ 14 SECTION II: DEVELOPMENT OF ARRANGEMENT ............................................................................................... 14 SECTION III: INGEST OF MATERIALS AND METADATA .................................................................................... 15 DISCUSSION & CONCLUSION................................................................................................................................... 18 3 LESSONS AND RECOMMENDATIONS ............................................................................................................................ 18 CONCLUSIONS FOR FUTURE WEB ARCHIVING PROJECTS .......................................................................................... 19 APPENDICES .................................................................................................................................................................... 21 APPENDIX 1: INVENTORY OF WEBSITE FILE DIRECTORY ............................................................................................... 21 APPENDIX 2: JQUERY SCRIPT TO OUTPUT #AERI2013 TWEETS ............................................................................ 23 APPENDIX 3: ARCHIVE READY REPORT ....................................................................................................................... 24 APPENDIX 4: BASECAMP DOCUMENTATION .............................................................................................................. 25 4 INTRODUCTION Our Task This project was tasked with archiving the AERI 2013 Conference Website, as well as the happenings of the conference event, itself. The 2013 conference was hosted at the University of Texas at Austin and ran from June 17-21, 2013. Its website, logo, and program were designed by faculty and doctoral students at UT’s School of Information. Planning for the conference began in the fall of 2012 and the initial website was established November 28, 2012, as was the website’s email address. The four project members responsible for archiving the website were students in Dr. Patricia Galloway’s INF392K: Digital Archiving and Preservation course. These students were charged with both archiving the website as it existed on the School of Information’s server and with capturing attendant documentation generated during the creation of the website and over the course of the conference. Our Roles Lauren Gaylord acted as our community administrator and was chiefly responsible with managing the workflow of our project and ensuring materials were properly ingested into the repository. Amye McCarther served as the group’s metadata specialist and managed this aspect of our project workflow and created all the documentation needed for a successful batch ingest. Nicole Feldman researched and gathered the widget materials and ran the script to collect conference tweets. She took the lead on documenting the group’s process. Jim Rizkalla was initially in charge of collecting emails, but as the scope of our project changed he established contact with many of the creators and technical staff and set meetings to gather information and documents. PROJECT DESCRIPTION The project’s primary focus was to capture any information relating to the AERI 2013 website. The first step taken was an initial inventory of the various components comprising the website, including any widgets and outgoing links. Once a general sense was gained of the website’s contents and functionality, students and faculty members identified as contributors to the website and the conference were contacted. 5 Administrative History AERI (Archival Education and Research Institutes) is a series of annual week-long conferences funded by a four-year grant from Institute of Museum and Library Sciences (IMLS) that began in 2009 with the intention of strengthening education and research and supporting academic cohort-building and mentoring. Goals of AERI include the fostering of curriculum development and collaborative work. A key part of AERI’s mission is the encouragement of diversity amongst doctoral candidates as supported by minority scholarships for undergraduate and graduate students interested in pursuing doctoral studies in Archival Studies, fellowships for current doctoral students, and mentorship opportunities. AERI 2013 was held at the University of Texas at Austin from June 17-21, 2013. The AERI 2013 website and logo were designed by doctoral candidates at UT’s School of Information, and supporting materials were supplied by faculty and other students. Previous institutes were held at UCLA, the University of Michigan, and Simmons College. The next incarnation of the AERI Institute is scheduled to be hosted at the University of Pittsburgh in July of 2014. The 2015 Conference will be hosted at the University of Maryland, College Park. The accompanying AERI 2013 website was initially designed and implemented by PhD student Virginia Luehrsen. It was established on November 28, 2012 at http://www.ischool.utexas.edu/aeri2013/ and contained application forms, scholarship forms, and Travel and Accommodations information. Edits were made to the website in December 2012. After registration for AERI 2013 concluded on April 15, 2013, the planning team posted participants’ bios, the preliminary schedule, an Explore Austin page, and an Information for Presenters page to the website. Scope and Contents The AERI 2013 Website community documents the website maintained by the University of Texas at Austin School of Information for the 2013 Archival Education and Research Institute (AERI). The community contains HTML code, PDFs, JPGs, GIFs, PNGs, mp4 files, docx and doc files, css files, kml files, text files, tar files, and a virtual machine disk file. The materials in this community were collected and created by Nicole Feldman, Lauren Gaylord, Amye McCarther, and Jim Rizkalla as part of a class project for Digital Archiving and Preservation (INF 392K) taught by Dr. Patricia Galloway in Spring 2014. The website subcommunity contains the underlying directory structure and encoded representation of the AERI 2013 website. This subcommunity includes a rendering 6 of the pages of the AERI 2013 website in DSpace’s HTML support, screencaptures of the AERI 2013 website, a virtualization of the AERI website in VirtualBox, and a zipped .tar file that captures the directory structure of the site as it lives on its hosting server through the linux rsync function. The supporting documentation subcommunity contains materials created by members of the AERI 2013 planning team in the course of conference and website development. Materials include 7-zip files preserving directory structure as well as JPG and GIF logo files. Within the 7-zip creator files are MSWord documents, MSExcel spreadsheets, PowerPoint presentations, MSAccess databases, PDFs, PNGs, JPGs, GIFs, TIFs, HTML files, CSS files, and TXT files, as well as materials generated by Adobe Photoshop, Illustrator, and Dreamweaver programs, including EPS, PSD, AI, and MNO files. The website project documentation subcommunity contains the materials generated by the INF 392K team regarding the archiving of the website. Files include a sample spreadsheet used for batch ingest, final report, a data management plan, the SIP Agreement, and a group presentation. PROJECT GOAL & COLLECTION ASSESSMENT Heretofore, web archiving has traditionally been conducted through using web crawler technologies. However, the code underlying the AERI 2013 Website lives entirely on an iSchool server, which presented us with the opportunity to capture the look, feel, and underlying technical architecture of the AERI 2013 Website entirely from the backend. The website was coded exclusively in HTML and CSS and capturing the directory structure did not require tremendous programming knowledge. We ultimately chose to use the rsync linux command in order to create a preservation quality .tar file of the website, to display the look and feel of the website through HTML support available on DSpace, and to create a virtualization of the website in Virtual Box in order to greater extend access to the site. The AERI 2013 Website also contains several outlinks and widgets not hosted by a local server, and this added a level of complexity to our project. Our group chose to exclude the outlinks on the page, (excepting the professional websites of Conference participants) feeling that these items were largely ancillary to the website and the conference event. Widgets seemed more crucial to providing a portrait of the website and the conference event, and we decided to document the Google Maps, tweets, and Flickr 7 page established for the event in a manner that would outlive the inevitably temporal lifespan of these proprietary web applications. Lastly, the website was designed and the conference was planned by faculty members and doctoral students at UT’s iSchool, many of whom are still based in Austin and retained their personal planning documents. Since this supporting documentation was both so readily available to be securely captured by members of our group and serves to provide users with a richer sense of the website and the event, our group decided it would be instructive to include these materials, as well. Items with confidentiality issues like personal emails or registration lists, as well as very early draft stages of planning materials, were deemed beyond the scope of our collecting focus. In the end, our group chose to set up separate collections by creator within a supporting documents subcommunity and to ingest these supplementary AERI website materials in the order in which they were received. SECTION I: PRESERVATION METHODS AERI 2013 Conference Website Using the Linux rsync command and capturing the entire file directory of the AERI 2013 Conference Website enabled our group to archive and display the website in a variety of ways. We created a .tar file to preserve both the directory structure and all of the public_html files of the AERI Website in a single zipped file. Our first attempt to create the .tar failed due to a lack of server space allotted to Lauren Gaylord’s iSchool account. After being temporarily granted more server space by the iSchool webmaster, we successfully created a .tar file and ingested it into DSpace. We also installed the open source windows utility, 7Zip, onto our workstation and were able to extract all of the component files from the .tar. We saved these files to a folder on the desktop of our workstation. DSpace has built-in HTML support, which allows websites to be hosted and encoded within the trusted digital repository via an internal server. We ingested the AERI Website 2013 index.html as our primary bitstream, and then manually ingested all of these component files and pages as subsequent bitstreams linked to that main page. Visitors to DSpace only see the single “index.html” file but are able to experience the AERI 2013 Conference Website in full. VirtualBox is an open source virtualization software tool developed by Oracle. Virtualization software is a desktop application that allows users to emulate a software 8 environment of creation by installing a guest operating system on top of a host machine. While Windows 7 is currently an operating system du jour, “agile development” seems to be the rallying cry of the entire technology industry, and it could very well fade into obsolescence in the near future. Our group decided it would be worthwhile to include virtualization in our preservation strategy and installed a 32 bit Windows 7 Virtual Machine on our work station. As other groups had mistakenly downloaded malware while building their virtual machines, we installed anti-virus software before proceeding with downloads or file transfers. Though we were originally planning to use Clam AntiVirus software, which is open-source, we found that the only free, available version for a Windows Operating System was Immunet 3.0, powered by ClamAV. The free version of Immunet 3.0 only offers Cloud-based protection, which is not helpful for virus protection in a Virtual Machine. Thus, we downloaded Windows Endpoint Protection, which was available to us through UT Austin’s BevoWare website. We also downloaded Google Chrome and Mozilla Firefox web browsers since these browsers were the ones most commonly used to access and view the site during its creation. Internet Explorer 6 was already installed on the machine. Once the virtual machine was set up, we used the secure shell client to create an identical .tar file within the virtual environment, and similarly, installed 7Zip in this application and were able to extract all component files from the tar. After this step of the project had been completed, we opened the local file of the AERI 2013 Conference Website in Chrome and Firefox. In order to prepare the Virtual Machine for ingest, we uninstalled the secure shell client and 7Zip, which would be of no use to users, and left the AERI Website open in the two web browsers. Once we were content with the saved machine state we had established on our virtual box, we manually ingested the Virtual Machine Disk (VMDK) file into DSpace and provided users with enough descriptive context to explore the site in the virtual environment. Website Outlinks and Widgets Widgets, while especially attractive to website users for their interactivity and dynamic content, present extraordinary difficulties for preservation. They often utilize proprietary or commercial formats and draw from externally-hosted content. Unlike other outlinks such as hyperlinks to repositories and transportation services, widgets involve embedded content and in this case were more directly relevant to the conference and look and feel of the website than other outlinks. Therefore, we decided to focus on preserving the widgets over other, less relevant outlinks. 9 Twitter An account with the handle @aeri2013 was created and maintained throughout the planning stages and the 2013 conference event. However, this account was passed on to the group at University of Pittsburgh who is hosting the 2014 AERI Conference. Twitter provides users the opportunity to request and download their entire archive in zipped text file. However, since the AERI Website Creators were no longer in control of this account, this was not a possibility for our project. Additionally, the account was little used and mostly for the purpose of announcing registration as well as a call for papers. Conference participants were encouraged to generate tweets with the hashtag #AERI2013 over the course of the event, and a widget containing all the tweets tagged this way is embedded directly on the AERI Website homepage. These tweets provide a wonderful granular perspective into the conference event and our group thought it would be highly valuable to capture this facet of the website. As the 2013 Library of Congress Report on the federal institution’s ongoing Twitter project adeptly remarks, “Archiving and preserving outlets such as Twitter will enable future researchers access to a fuller picture of today’s cultural norms, dialogue, trends and events to inform scholarship, the legislative process, new works of authorship, education and other purposes,” (Twitter Report, 1). Our group was eager to ensure that this aspect of the AERI Website was accessible to a future audience. Ultimately, we chose to capture these tweets in two ways. First, we took screenshots of the widget as it appears on the AERI homepage, in addition to taking screenshots of all of the tweets tagged with #AERI2013, which are still live on Twitter’s web application. Based on the vast amount of data a highly trafficked site like Twitter processes daily, it is highly unlikely that these tweets will be easily web searchable for much longer and we wanted to capture these items in a more persistent fashion. In order to accomplish this, we experimented with writing code that would output all the #AERI2013 in a static HTML page that would preserve the informational and contextual value of these tweets. Initially we tried writing a Ruby script for these tweets that would grab from Twitter’s Public API, or Application Programming Interface, which large websites like Twitter often make freely available in order to encourage application development. However, Twitter only includes the preceding six months of tweets in their Public API, and accordingly, the #AERI2013 had been long deprecated. Instead, we had to grab tweets directly from the Twitter website, which took some more maneuvering. After some trial and error, we were able to develop a JQuery script (see Appendix 2) which captured the #AERI2013 tweets in a chronologically ordered list. This page provided working outlinks to websites embedded in tweets, user 10 pages explicitly mentioned in tweets, and other hashtags users included in tweets. This second archiving approach offers a considerably more dynamic insight into this aspect of the conference event and we included this HTML page in this same subcommunity. Google Maps As with the Twitter widget, the Google Maps widgets found on the “Travel Accommodations” webpage presented difficulties as the content is hosted on an external site and utilizes a unique, proprietary display format. While capturing the look and feel is ideal, for the map widgets we settled for preserving the look on the webpage through screenshots and the KML file download option provided by Google. KML stands for Keynote Markup Language and is an xml notation used to communicate and visualize geographic data. Though mostly employed by Google applications including Google Earth and Google Maps, KML is an international standard maintained by the Open Geospatial Consortium (OGC). The OGC offers the official schema for download on its website. Because it is a file format maintained by OGC and an xml language, we felt confident preserving the files in that format. The file contains information about location (with geographic coordinates), icons, and names of places (e.g. Wendy's Restaurant). Thus we anticipate that if the file is opened in ten or fifteen years, it will display the location and place name that was plotted by the AERI 2013 coordinators for the conference and will not update to new locations or place names. Outlinks The website contains outlinks to other web sites regarding transportation, housing, local sites to visit and participant web pages. After discussing to what degree our group should pursue capturing content hosted on other web pages, we decided that commercial websites were outside the scope of our collection. Likewise, institutional websites were deemed out of scope due to their size and complexity as well as the potential of encountering rights issues regarding their contents. However, given that participant web sites are a direct reflection of the archiving community’s members and their research interests, in combination with the likelihood that they will not remain stable over the long term, we decided that it would be appropriate to archive the homepages of participants’ websites linked to from the AERI 2013 “Participants’ Bios” page. Screenshots of these pages were taken by Amye McCarther at various locations scrolling down the pages and 11 reassembled as composite images in Photoshop. These composite images were exported as TIFs and ingested on April 26, 2014. Flickr After the conclusion of the AERI 2013 conference, a Flickr group was created on June 26, 2013 and a link to the group was placed on the AERI 2013 website. Digital photographs of the event were uploaded to Flickr by Lorrie Dong. Because Flickr is an active website and its interface and design are likely to change in the coming years, we decided to take a screenshot of its presentation to capture its look while the AERI 2013 website was actively used. This screenshot was taken by Lauren Gaylord on April 23, 2014 and ingested on April 26, 2014. Additionally we decided to download the photographs in the group for ingest into DSpace so that they would be accessible even if Flickr changes its format or ceases to exist. The group consisted of 85 photographs, though Flickr mistakenly listed the count as 96. These JPG files were downloaded at their original size by Lauren Gaylord on April 23, 2014 and were ingested on April 26, 2014. Supporting Documents Introduction Many of the doctoral students involved in both the creation of the AERI 2013 Website and the planning of the AERI 2013 Conference Event are still based in Austin. These students were very responsive to our inquiries when were conducting research about AERI 2013, and we decided it would be instructive to include supporting documentation within the scope of our project. Our group had read an abundance of literature about how easy it is to inadvertently alter metadata when retrieving digital materials from creators, and we knew we had to be extremely cautious in executing this phase of the project. Our initial plan was to schedule individual meetings with each AERI 2013 creator and to securely retrieve their files using the write-blocker included in the Forensic Recovery of Evidence Data (FRED) workstation. Unfortunately, we were unable to locate a USB Cable to USB Cable (A to A) that would make the passage of files off a creator’s host machine, onto an external storage device, and onto our workstation in the Digital Archaeology Lab a completely secure pathway. That said, we were able to transfer files from a host machine onto an external hard drive in a secure way, and to pull up files on our workstation while keeping the write-blocker’s “read-only” settings activated. This allowed us to view 12 supporting documentation without altering vital metadata fields like date created. It was a worthwhile learning experience to experiment with the write-blocker and to gain a sense of how delicately one must proceed when retrieving materials from creators. In-person transfer was not always a possibility. One creator, Sarah Kim, is based in South Korea and retrieving her materials by email was our only feasible option. In addition, Jane Gruning was unable to schedule an in-person meeting with us and did not have a tremendous role in designing the website and elected to share her files with us over Dropbox, a cloud-based file storage and transfer service. We were displeased to see that Dropbox changed the “creation date” and “date modified” metadata fields of files and also compressed files in an appreciable, but not significant way. Finally, we were able to grant DSpace administrator, Dr. Patricia Galloway collection privileges that afforded her to ability to upload all her supporting documentation, which might be the preferred mode of transfer for any trusted digital repository. Sarah Buchanan Sarah Buchanan is a PhD student at the University of Texas at Austin and was part of the planning team for AERI 2013. Sarah created and edited content for the website, organized and transcribed participants’ information, and kept minutes of the AERI 2013 planning meetings. Her materials were retrieved by Amye McCarther using a clean external hard drive on April 9, 2014. These documents, which included MSWord documents and MSExcel spreadsheets, were reviewed and any items containing confidential information were removed. The remaining documents were then prepared for batch ingest (see below). Lorrie Dong Lorrie Dong is a PhD student at the University of Texas at Austin and was part of the planning team for AERI 2013. She created the content for all of the non-CFP pages of the website, including lodging, schedule, bios, and transportation maps. Materials were retrieved from Lorrie Dong by Amye McCarther using a clean external hard drive on April 13, 2014. The documents were converted to a 7-zip file by Nicole Feldman on April 23, 2014. The zipped file contains MSWord documents, PDFs, MSExcel spreadsheets, Power Point presentations, JPGs, GIFs, HTML, TXT and MNO documents. Two folders and one spreadsheet were removed prior to the documents being zipped as they contained confidential information. 13 Patricia Galloway Dr. Patricia Galloway is a professor at the University of Texas at Austin School of Information. She played a leadership role on the planning team for AERI 2013. Because she is the administrator for the iSchool DSpace repository, we gave her submitter permission to ingest documents into her creator collection. Jane Gruning Jane Gruning was part of the AERI 2013 planning team, but had a very minor role in creating the website. She reviewed materials that went onto the site and helped form the restaurant list. Her content was generated using a 2011 MacBook Air running Mac OS and Microsoft Word and Excel 2008. Materials were retrieved from Jane Gruning by Jim Rizkalla via DropBox April 13, 2014. The documents were converted to a 7-zip file by Nicole Feldman on April 23, 2014. The zipped file contains four MSWord documents and one PDF file. These files consist of drafts of application forms, programs and schedules. Sarah Kim Sarah Kim was a PhD student at the University of Texas at Austin and was part of the planning team for AERI 2013. She initiated the @AERI2013 Twitter account for the Conference Event and designed the official logos for the event and the website (which were adopted by the institution hosting the 2014 AERI Conference). She also transformed the text prepared by Sarah Buchanan for the program into a booklet PDF. Kim currently lives and works in South Korea, so our group was restricted to receiving her materials virtually. Materials were retrieved from Sarah Kim by Nicole Feldman via email on March 31, 2014. The files consist of various versions of the AERI logo in JPG and GIF formats. Virginia Luehrsen Virginia Luehrsen is a PhD student at the University of Texas at Austin and was part of the planning team for AERI 2013. She was involved in the early stages of planning and designed the initial website, but played a diminished role after February 2013. Materials were retrieved from Virginia Luehrsen by Lauren Gaylord and Jim Rizkalla using a clean external hard drive on April 16, 2014. The documents were converted to a 7-zip file by Nicole Feldman on April 23, 2014. The zipped file contains MSWord documents, MSExcel spreadsheets, JPGs, PNGs, HTML and CSS documents and one MSAccess database file. Two folders were removed prior to the documents being zipped as they contained confidential information or duplicate files. 14 Katie Pierce Meyer Katie Pierce Meyer is a PhD student at the University of Texas at Austin and was part of the AERI 2013 planning team, though she had a very minor role in creating the website. She contributed content to the website for the schedule and room assignments. She also designed the AERI 2013 tote bag. Materials were retrieved from Katie Pierce Meyer by Lauren Gaylord using a clean external hard drive on April 18, 2014. The documents were converted to a 7-zip file by Nicole Feldman on April 23, 2014. The zipped file contains MSWord documents, PDFs, MSExcel spreadsheets, Power Point presentations, JPGs, PNGs, GIFs, TIFs, Bitmap images, HTML, PSD, AI, and EPS documents. Three folders and two spreadsheets were removed prior to the documents being zipped as they contained confidential information. SECTION II: DEVELOPMENT OF ARRANGEMENT Before working directly in DSpace it was necessary to develop the architecture of the arrangement for our collection. Our primary goal in this project was to preserve the AERI 2013 website, so naturally, that was selected as the first subcommunity in our collection. This subcommunity contains archived components, aggregate versions, and externally hosted content of the AERI 2013 website. The collections within the subcommunity are the AERI 2013 Virtual Machine, AERI 2013 Website Component Files, and AERI 2013 Website Externally Hosted Content. The AERI 2013 Virtual Machine consists of the Virtual Machine Disk (VMDK) file that captures the virtualization of the AERI 2013 website. AERI 2013 Website Component Files consists of two items: the HTML pages individually ingested, showing the functionality (as of April 2014) of the AERI 2013 website created by UT doctoral students and faculty as well as, a .tar file of the public_html directory of the AERI 2013 website. AERI 2013 Website Externally Hosted Content contains nine items: KML Files of AERI 2013 Google Maps, Lorrie Dong AERI 2013 Flickr Photographs, a Persistent html page of #aeri2013 tweets, Screenshot of AERI 2013 Flickr Page, Screenshot of DIPIR Web Page, Screenshot of Lorrie Dong Web Page, Screenshot of Sarah Ramdeen Web Page, Screenshots of AERI 2013 GoogleMaps, and Twitter Screenshots. The proceeding subcommunity captures our group’s efforts to archive the AERI 2013 Website. The AERI 2013 Website Project Documentation subcommunity consists of 15 documentation about the archiving of the 2013 website created by the 392K team including SIP Agreements, project reports, and project notes. Our group felt it would be instructive to also include supporting materials that documented the creation of the AERI 2013 Website as well as the planning of the 2013 Conference Event. The AERI 2013 Website Supporting Documents subcommunity contains supporting documents used during the creation of the AERI 2013 Website and the planning of the accompanying conference. Collections are arranged according to creator and maintained in the order received. There are seven collections in this subcommunity, namely: Jane Gruning Materials, Katie Pierce Meter Materials, Lorrie Dong Materials, Patricia Galloway Materials, Sarah Buchanan Materials, Sarah Kim Materials, and Virginia Luerhsen Materials. Enough descriptive specificity was given to indicate whether a creator’s role was more in the planning of the conference or more in the creation of the website. SECTION III: INGEST OF MATERIALS AND METADATA Accurate identification and description of digital objects is critical to their longevity. For much of history archives have been populated with paper records which may remain stable over long periods with little need of interference, and whose functionality depends only on the durability of its material substrate and the fixity of any markings thereon. By contrast, digital records are highly unstable and depend entirely on computing environments in order for their contents to be interpreted in a way that is humanly readable. Composed of bytestreams, literally sequences of magnetized and demagnetized particles literally signifying 1s and 0s, these objects and the hardware and software environments that support their creation quickly obsolesce. Without detailed metadata about the file formats represented in these bytestreams and the environments that may render them, the objects become opaque and unusable. Hence, while traditional records archiving could allow a degree of inattention given the proper storage conditions, digital records afford no such luxury. In addition to serving the functional purpose of keeping digital objects usable, metadata is also key to establishing authenticity. The evidentiary value of metadata has been recognized in the digital forensics community and the same concepts can be applied within archival practice. Metadata may be used to authenticate, track and safeguard digital assets. The ease with which digital objects may be transferred or altered makes the detailed and accurate collection of metadata regarding their original instantiation of 16 utmost importance. Digital objects are also vulnerable to invisible changes as they move from one operating environment to another. Checksums may be used to mathematically calculate the components of a digital object and, thus, are useful for validating that a digital object has been transferred without change. Some file formats, such as BWF Wav files, may have checksums embedded in the file itself while others require separate documentation. Additionally, tools developed by the digital forensics community such as write-blockers, can facilitate the transfer of digital objects between devices without alteration. DSpace is equipped to store and create many types of metadata about the objects it ingests. These include metadata describing contents and creators as well as functional aspects. DSpace currently borrows most of its qualifiers from the Dublin Core Libraries Working Group Application Profile (LAP), which it adapts and appends as needed. Some metadata fields are automatically populated by DSpace when items are ingested, such as accession date and checksum values, while other fields may be entered manually or harvested separately and ingested as a batch. Once our team had identified the type and extent of the materials gathered to be archived, we informed the UT DSpace administrator, Dr. Pat Galloway, of the MIME types that were present in the collection so that DSpace could be prepared to ingest them. The items comprising the AERI 2013 website, including KML files of the Google maps and live content harvested from Twitter and Flickr, were ingested manually. In addition to these documents a VirtualBox containing the .tar of the website and contemporary web browsers. These together provide a virtualization of the website as most viewers would have seen it while it was active, in anticipation of a time when those web browsers and their rendering of web content will no longer exist. Narrative description of all of the file types and the methods of their retrieval were provided for each item in addition to identifying file formats in the appropriate DSpace field. The remainder of the project documentation followed a similar methodology for description. Supporting documentation was ingested by two means, batch ingest and manual ingest of zipped file directories or individual items. A batch ingest was employed as a test case. The smallest collection of materials was chosen for batch ingest, as so that any problems occurring during the process could be quickly accounted for and amendments made. Preparation for the batch ingest involved a two-step process: harvesting metadata and preparing documents for DSpace to ingest. The New Zealand Metadata Harvester was downloaded to our workstation and used to extract XML metadata tags for each of the 17 items. While the process was performed quickly, the tags produced did not conform to the Dublin Core schema used in DSpace. To produce the proper tags a spreadsheet was made of the tag elements we selected to ingest with the items, with columns left blank between tags to be populated accordingly. These values were populated using a combination of narrative information from creators and metadata extracted using the New Zealand Harvester. Each row was then concatenated and compiled as an XML document for each item. Batch ingest further requires that each item reside in a file titled item_000, item_001, item_002, etc. and that each file contain the digital item, an XML document containing the Dublin Core metadata tags entitled dublin_core.xml, and a TXT document containing only the filename of the item and entitled contents. With the exception of one folder containing other nested folders, all of the items prepared for the batch ingest were individual files. For the folder a nested folder was removed due to confidentiality concerns and the remaining contents were zipped to retain their structure. Figure 1. Example of a directory prepared for batch ingest. The batch ingest was conducted with the help of information technology coordinator, Carlos Ovalle. For the remainder of the supporting documents it was decided that manual ingest would be used; however the number of these documents was deemed too unwieldy for them to each be ingested individually. As with the folder in the batch ingest, the file directories of each creator were zipped into 7-zip files and these 7-zip files were ingested 18 into their respective collections. The advantage to using this method was that the structure of the file directories was maintained in the order received from the contributors, providing evidence of the organization and file naming schemes employed by each. Finally, documentation produced during the course of archiving the AERI 2013 website was zipped and submitted manually by members of the project group. DISCUSSION & CONCLUSION Lessons and Recommendations The paper backlog is a familiar obstacle, stymying the efficiency of archival institutions, and has led to the widespread adoption of policies like “More Product, Less Process” (MPLP). MPLP articulates that archival theory does not always neatly align with archival practice, and that institutions should work towards an operational strategy that best serves users. Similarly, the daily hurdles archivists working with analog materials face, befall digital archivists, too. Accordingly, our group had to navigate the murky territory between theory and practice throughout our efforts to archive the AERI 2013 website. For example, the batch ingest process undertaken by our group proved to be very time-intensive due to the incompatibility of the metadata tags exported by the New Zealand Metadata Harvester with the DSpace metadata schema. As a result, tedious copying and pasting of harvested values was necessary to prepare the ingest XML document, and detracted from the amount of time our group had to complete other parts of our project. In a similar vein, one of the common types of problems we encountered during the process was the unavailability of critical resources and dependency on other people. Though we would have liked to gather all available supporting documents from their creators with a write-blocker and external hard drive, in practice, this was not possible. We were unable to locate a A to A USB cable (also known as male to male), which we required to copy materials from the originating device to our external drive via a write-blocker. Due to time constraints we proceeded without using the write-blocker to copy supporting materials. Additionally, an external hard drive was not immediately available to us, delaying the collecting of supporting documents. Another instance of lack of resources occurred when we were unable to create a .tar file due to our accounts not being allotted enough space on the iSchool server. Our group’s unfamiliarity with many of the technical processes needed for our project also made us dependent on iSchool IT staff for command-line functions and batch ingest processes. 19 Towards the end of our process of collecting supporting documents from creators, we learned of the planning team’s use of the online project management tool Basecamp. It was not universally adopted by the group and thus is an incomplete record of their activities. While it would have been worthwhile to explore preserving the documentation on Basecamp, time constraints prevented us from gathering and ingesting the materials. Additionally, many members of the planning team found the tool cumbersome and difficult to navigate, leading to its reduced use towards the end of the planning stages. Many of the sixteen files uploaded to Basecamp were duplicated within the existing creator collections, but the Writeboards contain information created within Basecamp and thus not available anywhere else. The AERI 2013 Basecamp contains seven Writeboards and allows users to review different revised states. Basecamp allows these boards to be exported as HTML or TXT files, if they are considered worthy of preservation in the future. A list of the files and Writeboards is included in Appendix 4 for reference. The late discovery by our group of the Basecamp documentation speaks to the difficulties of coordinating with a large group of creators. Few of the creators remembered the site’s existence until prompted, and their frustration with it as a project management tool led many to discourage its preservation. On the fly decisions, such as this, characterize the exciting and challenging nature of digital archiving. Conclusions for Future Web Archiving Projects One of the biggest obstacles our group faced in completing this project was overcoming our minimal technical knowledge. We primarily surmounted this setback by working collaboratively with IT Staff at the iSchool, who were instrumental in seeing that our project tasks were executed correctly. Going forward, our group feels that archivists should adopt a like-minded collaborative attitude and elect to work symbiotically with IT Support Staff as well as with web designers and web developers at their institutions. Simple Linux commands like RSync, which were critical to our archival strategy are well known within IT Support Staffs, and archivists serve to gain tremendously from working with these individuals. In addition, by working directly with web designers and web developers, archivists can aid in making sure websites are designed to meet optimal archival benchmarks. Sites like Archive Ready (see Appendix 3) generate comprehensive reports on a website’s archival compatibility, and should be thoroughly engaged with before embarking on an archival project. While our group enjoyed the imagination required to capture widgets, we would recommend that future website archiving projects not devote considerable 20 resources or time to these endeavors unless the content hosted on these items is completely essential to the underlying archival mission. 21 APPENDICES Appendix 1: Inventory of website file directory 8 -rw-r--r-- 1 lgaylord users 5486 Jun 1 2013 about.html 16 -rw-r--r-- 1 lgaylord users 12419 May 14 2013 AERI_2013_PAYMENT_FORM.docx 72 -rw-r--r-- 1 lgaylord users 67056 Apr 15 2013 AERI2013_PreliminaryProgram.pdf 1328 -rw-r--r-- 1 lgaylord users 1353967 Jun 12 2013 AERI2013-Program.pdf 72 -rw-r--r-- 1 lgaylord users 66014 Apr 30 2013 AERI2013_Week-at-a- Glance.pdf 8 -rw-r--r-- 1 lgaylord users 5731 Sep 13 2010 aeri2.css 8 -rw-r--r-- 1 lgaylord users 5721 Sep 6 2010 aeri3.css 8 -rw-r--r-- 1 lgaylord users 5338 Sep 8 2010 aeri4.css 8 -rw-r--r-- 1 lgaylord users 5451 May 20 2013 aeri.css 316 -rw-r--r-- 1 lgaylord users 317050 Jun 25 2013 AERIgroup.JPG 32 -rw-r--r-- 1 lgaylord users 28689 May 20 2013 AERI-logo-bold.jpg 16 -rw-r--r-- 1 lgaylord users 13498 Jun 1 2013 AERI-logo-web.gif 8 -rw-r--r-- 1 lgaylord users 4566 Nov 28 2012 aeri.png 68 -rw-r--r-- 1 lgaylord users 63488 Dec 26 2012 application.doc 148 -rw-r--r-- 1 lgaylord users 145917 Jun 3 2013 bios.html 24 -rw-r--r-- 1 lgaylord users 22956 Apr 29 2013 Briscoe.jpg 44 -rw-r--r-- 1 lgaylord users 43520 Dec 27 2012 chair.doc 4 -rw-r--r-- 1 lgaylord users 1407 Dec 26 2012 chair_guidelines.doc 4 -rw-r--r-- 1 lgaylord users 1274 Dec 26 2012 chair_guidelines.html 28 -rw-r--r-- 1 lgaylord users 25172 Apr 18 2012 CHIPS.jpg 8 -rw-r--r-- 1 lgaylord users 4303 Jun 3 2013 doctoral_proposal.html 36 -rw-r--r-- 1 lgaylord users 34069 Dec 17 2012 EASP_Application_2013.docx 8 -rw-r--r-- 1 lgaylord users 7687 Jun 3 2013 emerging_scholars.html 8 -rw-r--r-- 1 lgaylord users 6689 Jun 3 2013 explore.html 8 -rw-r--r-- 1 lgaylord users 4662 Jun 3 2013 faculty_proposal.html 4 -rw-r--r-- 1 lgaylord users 3995 May 21 2013 follow_bird-b.png 4 -rw-r--r-- 1 lgaylord users 3090 May 11 2013 food.html 22 4 drwxr-xr-x 2 lgaylord users 4096 Apr 1 2013 forms 170400 -rw-r--r-- 1 lgaylord users 174313472 Jun 18 2013 HRC-Mamlet1931-video-2013-0617-Kim.mp4 28 -rw-r--r-- 1 lgaylord users 27170 Apr 3 2013 IMLS_logo.jpg 8 -rw-r--r-- 1 lgaylord users 4465 Jun 25 2013 index.html 4 -rw-r--r-- 1 lgaylord users 4016 Jun 3 2013 info.html 12 -rw-r--r-- 1 lgaylord users 8236 Apr 29 2013 iSchool.jpg 8 -rw-r--r-- 1 lgaylord users 5114 May 11 2013 lodging.html 8 -rw-r--r-- 1 lgaylord users 4669 May 20 2013 meeting.html 48 -rw-r--r-- 1 lgaylord users 45568 Dec 27 2012 mentor.doc 4 -rw-r--r-- 1 lgaylord users 1774 Dec 27 2012 mentor_guidelines.html 44 -rw-r--r-- 1 lgaylord users 44544 Dec 26 2012 paper_poster.doc 4 -rw-r--r-- 1 lgaylord users 1318 Jan 31 2013 paper_poster_guidelines.html 8 -rw-r--r-- 1 lgaylord users 5407 Jun 3 2013 proposals.html 4 -rw-r--r-- 1 lgaylord users 3184 Jun 3 2013 registration.html 28 -rw-r--r-- 1 lgaylord users 27030 Jun 16 2013 schedule.html 36 -rw-r--r-- 1 lgaylord users 33280 Dec 26 2012 scholarship.doc 8 -rw-r--r-- 1 lgaylord users 4357 Jun 3 2013 scholarships.html 12 -rw-r--r-- 1 lgaylord users 11137 Apr 3 2013 SJH bathroom.jpg 16 -rw-r--r-- 1 lgaylord users 12490 Apr 3 2013 SJH bed.jpg 4 -rw-r--r-- 1 lgaylord users 105 Apr 30 2013 top.gif 8 -rw-r--r-- 1 lgaylord users 5741 May 8 2013 transportation.html 16 -rw-r--r-- 1 lgaylord users 15367 Jun 11 2013 travel.html 136 -rw-r--r-- 1 lgaylord users 134546 May 14 2013 UT_PAYEE_INFORMATION_FORM.pdf 56 -rw-r--r-- 1 lgaylord users 51712 Dec 27 2012 workshop.doc 4 -rw-r--r-- 1 lgaylord users 1262 Dec 26 2012 workshop_guidelines.doc 4 -rw-r--r-- 1 lgaylord users 1224 Dec 26 2012 workshop_guidelines.html 4 -rw-r--r-- 1 lgaylord users 1224 Dec 26 2012 workshop_guidelines.txt 4 drwxr-xr-x 2 lgaylord users 4096 May 20 2013 zzz 23 Appendix 2: JQuery Script to Output #AERI2013 Tweets $('.content').map(function(i, el) { return { #AERI2013: $(el).find('.fullname').text(), tweet: $(el).find('.tweet-text')[0].innerHTML } }) 24 Appendix 3: Archive Ready Report 25 Appendix 4: Basecamp Documentation Writeboards: AERI Workshops – categories General notes – Katie Groups Schedule for the Week Schedule for the Week – another option Student day schedule – Wednesday Workshop idea: Promotion and Advancement Files Uploaded: AERI-banner-Kim-small.jpg AERI-master-black.tif AERI-logo.gif AERI-master.tif HRC-Hamlet1931-video-2013-06-17-Kim.mp4 AERI_Program.doc AERI_2013_Schedule_feb25.doc Expenses list for AERI 2012.xlsx CFP_aeri2013.doc ABOUT AUSTIN.docx AERI_Travel_website.doc Austin-Restaurants.docx AERI Application Form - Draft 1.doc AERI WORKSHOPS.docx Timeline for Preparing AERI 2012.doc To-do for AERI 2013 planning.xlsx 26 Works Cited January 2013 White Paper entitled, “Update on the Twitter Archive At the Library of Congress.” Accessed on April 12, 2014 at http://courses.ischool.utexas.edu/galloway/2014/spring/INF392K/LOC_TwitterReport _2013jan.pdf

AERIFinalReport - DSpace at The University of Texas at Austin

Related documents

Products

Support

AERIFinalReport - DSpace at The University of Texas at Austin

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib