Dr Birgit Plietzsch Arts Computing Advisor & Swithun Crowe Developer for Arts and Humanities Computing projects bp10@st-andrews.ac.uk cs2@st-andrews.ac.uk IT Services, University of St Andrews 2 1. Introduction to the University of St Andrews Digital Archiving Project (DAP) 2. The DAP Open Archival Information System 3. Developing the OAIS Ingest function in Alfresco 3 Digital Preservation is … • the active management of digital information over time to ensure its accessibility • long-term, error-free storage of digital information, with means for retrieval and interpretation, for the entire time span the information is required for. • Long-term is defined as "long enough to be concerned with the impacts of changing technologies, including support for new media and data formats, or with a changing user community. Long Term may extend indefinitely”. • Retrieval means obtaining needed digital files from the long-term, error-free digital storage, without possibility of corrupting the continued error-free storage of the digital files. • Interpretation means that the retrieved digital files, files that, for example, are of texts, charts, images or sounds, are decoded and transformed into usable representations. This is often interpreted as "rendering", i.e. making it available for a human to access. However, in many cases it will mean able to be processed by computational means. (Source: Wikipedia) 4 • Legal requirements (e.g. Freedom of Information Act) • Protection of institutional intellectual property • Funding body requirements • until 2008 Arts and Humanities Data Service for Arts and Humanities (national depository for arts and humanities research data) • no such body exists now for the Arts and Humanities • other subjects national support is patchy • Moral obligations • protection of cultural and corporate memory 5 www.rps.ac.uk • proceedings of the Scottish Parliament from the first surviving act of 1235 to the union of 1707 • 10 years of research • no print publication • c16.5m words • issues: • inconsistent editorial practices • obsolescence of software originally used • long-term sustainability of research data 6 • Pilot project • Scope: • data contained in electronic resources produced within the Faculty of Arts, University of St Andrews • Aims: • ensure long-term sustainability of RPS data • investigate the requirements of digital archiving and obtain experience • meet funding body requirement • flexible implementation (to allow for additional future uses) 7 Concepts and Properties of Archives and Hosting in the Strategy and their Relationships ©Charles Beagrie Ltd 2009. CreativeCommons Attribution-Share Alike3.0 Key: solid colour represents core properties and fading colour represents weaker properties of archives and hosting services. Concepts and Properties of Archives and Hosting in the Strategy and their Relationships © Charles Beagrie Ltd 2009. CreativeCommons Attribution-Share Alike3.0 8 1. Introduction to the University of St Andrews Digital Archiving Project (DAP) 2. The DAP Open Archival Information System 3. Developing the OAIS Ingest function in Alfresco 9 • An Open Archival Information System (or OAIS) is an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community. • reference model: ISO 14721:2003 10 Seven functions SIP AIP DIP 11 Submission Information Package Archival Information Package Dissemination Information Package • Ingest • Archival Storage • Data Management • Administration • Preservation Planning • Access • Management Implementation • Content Information: • • • • XML TIFF DOC Etc • Preservation Description Information: • PREMIS • Descriptive Information: • MODS • Packaging Information: • METS 12 • What needs to be preserved? • • • • data layout functionality user experience • What are the significant properties? • generic low-level properties (e.g. basic data unit, byte-level encoding, data type, and logical schema) • data type specific properties (example: text) • underlying abstract forms (font, spacing, layout) • sub-properties (e.g. font type, style, family, size, colour) • How do we preserve? • bit stream preservation • emulation • migration • Adopted approach: • data is preserved • combination of bit stream preservation and file format migration upon ingest 13 • description needs of different types of material • • • • • • electronic resources digital images video research papers University records etc. • introduce flexibility • future wider uses of the archive 14 Resource Discovery Metadata Project • expressed in MODS Resource type Research data Documentation Digital object Code • 3 layers • use for pilot • more models can be developed 15 Monolithic approach • DSpace • issues with Archival Storage and Data Management functions • EPrints • issues with Administration and Access functions • RODA • technical issues No support for Preservation Planning 16 Breakdown into OAIS requirements • Repository framework: Fedora Commons • issues with suitable front end for Ingest, Access, Preservation Planning, or Administration functions • highly customisable • Metadata • MODS • METS • PREMIS Administration Software used Management • Alfresco • www.alfresco.com Access • Fedora Commons Ingest •Share •Explorer •Records Management 17 Archival storage & Data Management Preservation Planning •Plato •Testbed • fedoracommons.org • Planets Suite • www.openplanets foundation.org 18 • Version control of AIPs • Alfresco / Fedora interaction? • Access front end • Fedora Commons front ends do not normally support OAIS functions • Can extra properties be added to folders and files in Records Management site? We welcome ideas that might help us resolve the above three issues. 19 1. Introduction to the University of St Andrews Digital Archiving Project (DAP) 2. The DAP Open Archival Information System 3. Developing the OAIS Ingest function in Alfresco 20 Introduction • FITS and PREMIS • Technical metadata • RPS and MODS • Resource discovery metadata • Antivirus scanning • METS • Wrapping files and metadata 21 Introduction • FITS (File Information Tool Set) • http://code.google.com/p/fits/ • Consolidates file format metadata from 3rd party tools • Jhove, DROID, NLNZ ME, Exiftool and others • Output as XML • PREMIS (PREservation Metadata: Implementation Strategies) • http://www.loc.gov/standards/premis/ • Data dictionary of semantic units, maps to XML • Transform FITS XML to PREMIS using XSLT 22 The action • Text property defined in custom aspect for storing FITS XML in node metadata • Create temporary file containing content of node • Run FITS on temporary file • Put output into custom property • Later on, transform this to PREMIS XML • Can be run as space rule • Compile to AMP using Alfresco SDK 23 fits-action-context.xml <!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/springbeans.dtd'> <beans> <bean id="fits-action-messages" class="org.alfresco.i18n.ResourceBundleBootstrapComponent"> <property name="resourceBundles"> <list><value>alfresco.module.FitsAction.fits-action-messages</value></list> </property> </bean> <bean id="fits-model-bootstrap" parent="dictionaryModelBootstrap" depends-on="dictionaryBootstrap"> <property name="models"> <list><value>alfresco/module/FitsAction/context/fitsModel.xml</value></list> </property> </bean> <bean id="fits-action“ class="uk.ac.st_andrews.repo.action.executer.FitsActionExecuter“ parent="actionexecuter"> <property name="serviceRegistry"><ref bean="ServiceRegistry"/></property> </bean> </beans> 24 FitsActionExecuter package uk.ac.st_andrews.repo.action.executer; public class FitsActionExecuter extends ActionExecuterAbstractBase { public void setServiceRegistry(ServiceRegistry serviceRegistry); protected void addParameterDefinitions(List<ParameterDefinition> paramList); protected void executeImpl(Action action, NodeRef actionedUponNodeRef); } 25 FitsActionExecuter.executeImpl (fragment) 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 26 // make sure node exists if (!nodeService.exists(actionedUponNodeRef)) { throw new Exception("no node"); } // make sure that node has fits aspect QName fitsAspect = QName.createQName(fitsURI, "fitsAspect"); if (!nodeService.hasAspect(actionedUponNodeRef, fitsAspect)) { this.nodeService.addAspect(actionedUponNodeRef, fitsAspect, null); } // create new FITS instance Fits fits = new Fits(); Fits.allowRounding = true; FitsOutput result = null; FitsActionExecuter.executeImpl (fragment cont.) 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 27 // put input into temp file ContentReader reader = contentService.getReader(actionedUponNodeRef, ContentModel.PROP_CONTENT); String fileName = (String) nodeService.getProperty(actionedUponNodeRef, ContentModel.PROP_NAME); File inputFile = TempFileProvider.createTempFile("FitsActionExecuter_", "." + fileName); reader.getContent(inputFile); // transform into technical metadata result = fits.examine(inputFile); Document doc = result.getFitsXml(); // put result of transformation into output XMLOutputter serializer = new XMLOutputter(Format.getPrettyFormat()); String output = serializer.outputString(doc); // get property to write to QName fitsProp = QName.createQName(fitsURI, "fitsOutput"); nodeService.setProperty(actionedUponNodeRef, fitsProp, output); Fragment of FITS XML showing conflicting file formats <identification status="CONFLICT"> <identity format="Microsoft Word" mimetype="application/msword"> <tool toolname="Exiftool" toolversion="8.25" /> <tool toolname="file utility" toolversion="5.04" /> <tool toolname="NLNZ Metadata Extractor" toolversion="3.4GA" /> <tool toolname="ffident" toolversion="0.2" /> </identity> <identity format="OLE2 Compound Document Format" mimetype="application/octet-stream"> <tool toolname="Droid" toolversion="3.0" /> <externalIdentifier toolname="Droid" toolversion="3.0" type="puid">fmt/111</externalIdentifier> </identity> </identification> 28 Corresponding fragment of PREMIS XML <premis:format> <premis:formatDesignation> <premis:formatName>Microsoft Word</premis:formatName> </premis:formatDesignation> </premis:format> <premis:format> <premis:formatDesignation> <premis:formatName>OLE2 Compound Document Format</premis:formatName> </premis:formatDesignation> <premis:formatRegistry> <premis:formatRegistryName>Droid (3.0)</premis:formatRegistryName> <premis:formatRegistryKey>fmt/111</premis:formatRegistryKey> <premis:formatRegistryRole>puid</premis:formatRegistryRole> </premis:formatRegistry> </premis:format> 29 Introduction • Records of the Parliaments of Scotland marked up in thousands of XML documents • http://www.rps.ac.uk • Using Text Encoding Initiative (TEI) • http://www.tei-c.org/index.xml • TEI headers contain resource discovery metadata • Extract metadata from documents and populate custom metadata fields • Can be run as space rule • Compile as AMP using Alfresco SDK 30 TEI example Unique ID for document Document belongs to translated version of records from reign of William and Mary <TEI.2 id="_william_and_mary_t1689_3_6_d6_trans" n="william_and_mary_trans"> <teiHeader> Main heading in document <fileDesc> <titleStmt> <title>A committee appointed for controverted elections</title> </titleStmt> <editionStmt> <edition n="session">william_and_mary_t1689_3_1_d2_trans</edition> </editionStmt> Pointer to session that <publicationStmt> document belongs to <date>16890314</date> </publicationStmt> </fileDesc> </teiHeader> Date of document, <text>...</text> in YYYYMMDD </TEI.2> format 31 RPSMetadataExtracter package uk.ac.st_andrews.repo.content.metadata; public class RPSMetadataExtracter extends org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter { public RPSMetadataExtracter(); protected Map<String, Serializable> extractRaw(ContentReader reader); } 32 RPSMetadataExtracter.extractRaw 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 33 // set up parser SAXParser sp = spf.newSAXParser(); InputStream cis = reader.getContentInputStream(); InputSource is = new InputSource(cis); RPSSaxParser teip = new RPSSaxParser(); // do parsing teip.setProperties(map); sp.parse(is, teip); map = teip.getProperties(); // loop over properties found Set s = map.entrySet(); Iterator it = s.iterator(); while (it.hasNext()) { Map.Entry m = (Map.Entry) it.next(); putRawValue((String) m.getKey(), (String) m.getValue(), rawProperties); } RPSSaxParser package uk.ac.st_andrews.repo.content.metadata; public class RPSSaxParser extends org.xml.sax.helpers.DefaultHandler { public void setProperties(Map<String, Serializable> prop); public Map<String, Serializable> getProperties(); public void startElement(String uri, String localName, String qName, Attributes attributes); public void endElement(String uri, String localName, String qName); public void characters(char[] ch, int start, int length); private void handleID(String id); private void handleDate(String d); } 34 RPSSaxParser // property names 21 private static final String KEY_ID = "rpsID"; 22 private static final String KEY_REIGN = "rpsReign"; 23 private static final String KEY_VERSION = "rpsVersion"; 24 private static final String KEY_HEADING = "rpsHeading"; 25 private static final String KEY_SESSION = "rpsSession"; 26 private static final String KEY_DATE = "rpsDate"; 27 private static final String KEY_TITLE = "cmTitle"; // some properties get set in RPSSaxParser.characters 185 if (true == inTitle) 186 { 187 rawProperties.put(KEY_TITLE, new String(ch, start, length)); 188 } 189 else if (true == inSession) 190 { 191 rawProperties.put(KEY_SESSION, new String(ch, start, length)); 192 } 35 RPSMetadataExtracter.properties # Namespaces namespace.prefix.rps=http://www.rps.ac.uk/ns/1.0 namespace.prefix.cm=http://www.alfresco.org/model/content/1.0 # Mapping of property names to Qualified names used in model rpsID=rps:id rpsReign=rps:reign rpsSession=rps:session rpsDate=rps:date rpsVersion=rps:version rpsHeading=rps:heading cmTitle=cm:title 36 rpsModel.xml (fragment showing aspect) <aspect name="rps:metadata"> <title>RPS Metadata</title> <properties> <property name="rps:id"><type>d:text</type></property> <property name="rps:reign"><type>d:text</type></property> <property name="rps:session"><type>d:text</type></property> <property name="rps:date"><type>d:text</type></property> <property name="rps:heading"><type>d:text</type></property> <property name="rps:version"><type>d:text</type></property> </properties> </aspect> 37 webclient.properties # I18N strings rpsID=RPS ID rpsReign=RPS Reign rpsSession=RPS Session rpsDate=RPS Date rpsVersion=RPS Version rpsHeading=RPS Heading 38 Using MODS • Metadata Object Description Schema • http://www.loc.gov/standards/mods/ • MODS is a resource discovery metadata standard • Working on defining MODS data models • For Project, Resource Type and Digital Object levels • Will move RPS metadata into MODS fields 39 Introduction • Creates an action for scanning files for viruses • Uses ClamAV • http://www.clamav.net/lang/en/ • Can be configured for other tools • Emails creator of file if virus found • Deletes file from repository if virus found • Can be run as space rule • Compile as AMP using Alfresco SDK 40 antivirus-action.xml (fragment) <bean id="antivirus-action" class="uk.ac.st_andrews.repo.action.executer.AntivirusActionExecuter" parent="action-executer"> <!– services needed by bean --> <property name="contentService“><ref bean="contentService" /></property> <property name="nodeService"><ref bean="nodeService" /></property> <property name="templateService"><ref bean="templateService" /></property> <property name="actionService"><ref bean="actionService" /></property> <property name="personService"><ref bean="personService" /></property> <!– person that email will come from, defined in alfresco-golbal.properties --> <property name="fromEmail"> <value>${antivirus.mailer}</value> </property> <!– path to Freemarker template, defined in alfresco-golbal.properties --> <property name="emailTemplate"> <value>${antivirus.template}</value> </property> 41 antivirus-action.xml (fragment, cont.) <property name="command"> <bean class="org.alfresco.util.exec.RuntimeExec"> <property name="commandMap"> <map> <!– command to run, ${antivirus.exe} set in alfresco-golbal.properties, ${source} in Java class --> <entry key=".*" value="${antivirus.exe} ${source}"/> </map> </property> <property name="errorCodes"> <value>1</value><!– exit code 1 indicates that virus was found --> </property> </bean> </property> </bean> 42 AntivirusActionExecuter package uk.ac.st_andrews.repo.action.executer; public class AntivirusActionExecuter extends ActionExecuterAbstractBase { public void setContentService(ContentService contentService); public void setNodeService(NodeService nodeService); public void setTemplateService(TemplateService templateService); public void setActionService(ActionService actionService); public void setPersonService(PersonService personService); public void setFromEmail(String fromEmail); public void setCommand(RuntimeExec command); public void setEmailTemplate(String emailTemplate); public void init(); protected void addParameterDefinitions(List<ParameterDefinition> paramList); protected void executeImpl(final Action ruleAction, final NodeRef actionedUponNodeRef); } 43 AntivirusActionExecuter.executeImpl (fragment) 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 44 // put content into temp file ContentReader reader = contentService.getReader(actionedUponNodeRef, ContentModel.PROP_CONTENT); String fileName = (String) nodeService.getProperty(actionedUponNodeRef, ContentModel.PROP_NAME); File sourceFile = TempFileProvider.createTempFile("anti_virus_check_", "_" + fileName); reader.getContent(sourceFile); // set source property for command Map<String, String> properties = new HashMap<String, String>(1); properties.put(VAR_SOURCE, sourceFile.getAbsolutePath()); // execute the transformation command ExecutionResult result = null; try { result = command.execute(properties); } catch (Throwable e) { throw new AlfrescoRuntimeException("Antivirus check error: \n" + command, e); } AntivirusActionExecuter.executeImpl (fragment, cont.) 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 45 // try to get document creator's details String creatorName = (String) nodeService.getProperty(actionedUponNodeRef, ContentModel.PROP_CREATOR); if (null == creatorName || 0 == creatorName.length()) { throw new Exception("couldn't get creator's name"); } NodeRef creator = personService.getPerson(creatorName); if (null == creator) { throw new Exception("couldn't get creator"); } String creatorEmail = (String) nodeService.getProperty(creator, ContentModel.PROP_EMAIL); if (null == creatorEmail || 0 == creatorEmail.length()) { throw new Exception("couldn't get creator's email address"); } AntivirusActionExecuter.executeImpl (fragment, cont.) 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 46 // put together message Map<String, Object> model = new HashMap<String, Object>(8, 1.0f); model.put("filename", fileName); model.put("message", result); String emailMsg = templateService.processTemplate("freemarker", emailTemplate, model); // send email message Action emailAction = actionService.createAction("mail"); emailAction.setParameterValue(MailActionExecuter.PARAM_TO, creatorEmail); emailAction.setParameterValue(MailActionExecuter.PARAM_FROM, fromEmail); emailAction.setParameterValue(MailActionExecuter.PARAM_SUBJECT, "Virus found in " + fileName); emailAction.setParameterValue(MailActionExecuter.PARAM_TEXT, emailMsg); emailAction.setExecuteAsynchronously(true); actionService.executeAction(emailAction, null); // delete node nodeService.addAspect(actionedUponNodeRef, ContentModel.ASPECT_TEMPORARY, null); nodeService.deleteNode(actionedUponNodeRef); Introduction • Metadata and Encoding Transmission Standard (METS) • http://www.loc.gov/standards/mets/ • METS is a wrapper for other metadata documents • Plan to generate METS documents containing/referencing: • Ingested files • Renderings of these files (thumbnails, reference copies, archival formatted versions etc.) • Resource discovery metadata • Technical metadata • Fedora Commons can ingest METS documents as SIPs • http://fedora-commons.org/ 47 Project source code available on Alfresco Forge • FITS in Alfresco • http://forge.alfresco.com/projects/fitsinalfresco/ • RPS Metadata Extracter • http://forge.alfresco.com/projects/rpsmetadata/ • Antivrus • http://forge.alfresco.com/projects/antivirus/ University of St Andrews Digital Archiving Project • http://www.st-andrews.ac.uk/itsupport/academic/arts 48 Dr Birgit Plietzsch Arts Computing Advisor & Swithun Crowe Developer for Arts and Humanities Computing projects bp10@st-andrews.ac.uk cs2@st-andrews.ac.uk IT Services, University of St Andrews 49