Using Alfresco to create an Open Archival Information System

advertisement
Dr Birgit Plietzsch
Arts Computing Advisor
&
Swithun Crowe
Developer for Arts and
Humanities Computing projects
bp10@st-andrews.ac.uk
cs2@st-andrews.ac.uk
IT Services, University of St Andrews
2
1. Introduction to the University of St Andrews Digital
Archiving Project (DAP)
2. The DAP Open Archival Information System
3. Developing the OAIS Ingest function in Alfresco
3
Digital Preservation is …
• the active management of digital information over time to ensure its
accessibility
• long-term, error-free storage of digital information, with means for
retrieval and interpretation, for the entire time span the information is
required for.
• Long-term is defined as "long enough to be concerned with the impacts of
changing technologies, including support for new media and data formats, or
with a changing user community. Long Term may extend indefinitely”.
• Retrieval means obtaining needed digital files from the long-term, error-free
digital storage, without possibility of corrupting the continued error-free storage
of the digital files.
• Interpretation means that the retrieved digital files, files that, for example, are
of texts, charts, images or sounds, are decoded and transformed into usable
representations. This is often interpreted as "rendering", i.e. making it available
for a human to access. However, in many cases it will mean able to be
processed by computational means.
(Source: Wikipedia)
4
• Legal requirements (e.g. Freedom of Information Act)
• Protection of institutional intellectual property
• Funding body requirements
• until 2008 Arts and Humanities Data Service for Arts and Humanities
(national depository for arts and humanities research data)
• no such body exists now for the Arts and Humanities
• other subjects national support is patchy
• Moral obligations
• protection of cultural and corporate memory
5
www.rps.ac.uk
• proceedings of the
Scottish Parliament
from the first
surviving act of
1235 to the union of
1707
• 10 years of research
• no print publication
• c16.5m words
• issues:
• inconsistent editorial
practices
• obsolescence of
software originally
used
• long-term
sustainability of
research data
6
• Pilot project
• Scope:
• data contained in electronic resources produced within the Faculty
of Arts, University of St Andrews
• Aims:
• ensure long-term sustainability of RPS data
• investigate the requirements of digital archiving and obtain
experience
• meet funding body requirement
• flexible implementation (to allow for additional future uses)
7
Concepts and Properties of Archives and Hosting in the
Strategy and their Relationships ©Charles Beagrie Ltd
2009. CreativeCommons Attribution-Share Alike3.0 Key:
solid colour represents core properties and fading colour
represents weaker properties of archives and hosting
services.
Concepts and Properties of Archives and Hosting in the Strategy and their Relationships
© Charles Beagrie Ltd 2009. CreativeCommons Attribution-Share Alike3.0
8
1. Introduction to the University of St Andrews Digital
Archiving Project (DAP)
2. The DAP Open Archival Information System
3. Developing the OAIS Ingest function in Alfresco
9
• An Open Archival Information System (or OAIS) is an
archive, consisting of an organization of people and
systems, that has accepted the responsibility to preserve
information and make it available for a Designated
Community.
• reference model: ISO 14721:2003
10
Seven functions
SIP
AIP
DIP
11
Submission Information Package
Archival Information Package
Dissemination Information Package
• Ingest
• Archival
Storage
• Data
Management
• Administration
• Preservation
Planning
• Access
• Management
Implementation
• Content
Information:
•
•
•
•
XML
TIFF
DOC
Etc
• Preservation
Description
Information:
• PREMIS
• Descriptive
Information:
• MODS
• Packaging
Information:
• METS
12
• What needs to be preserved?
•
•
•
•
data
layout
functionality
user experience
• What are the significant properties?
• generic low-level properties (e.g. basic data unit, byte-level encoding, data type, and logical
schema)
• data type specific properties (example: text)
• underlying abstract forms (font, spacing, layout)
• sub-properties (e.g. font type, style, family, size, colour)
• How do we preserve?
• bit stream preservation
• emulation
• migration
• Adopted approach:
• data is preserved
• combination of bit stream preservation and file format migration upon ingest
13
• description needs of different types of material
•
•
•
•
•
•
electronic resources
digital images
video
research papers
University records
etc.
• introduce flexibility
• future wider uses of the archive
14
Resource Discovery
Metadata
Project
• expressed in MODS
Resource type
Research
data
Documentation
Digital object
Code
• 3 layers
• use for pilot
• more models can be
developed
15
Monolithic approach
• DSpace
• issues with Archival Storage
and Data Management
functions
• EPrints
• issues with Administration
and Access functions
• RODA
• technical issues
No support for
Preservation Planning
16
Breakdown into OAIS
requirements
• Repository framework:
Fedora Commons
• issues with suitable front end
for Ingest, Access,
Preservation Planning, or
Administration functions
• highly customisable
• Metadata
• MODS
• METS
• PREMIS
Administration
Software used
Management
• Alfresco
• www.alfresco.com
Access
• Fedora
Commons
Ingest
•Share
•Explorer
•Records
Management
17
Archival storage
&
Data Management
Preservation
Planning
•Plato
•Testbed
• fedoracommons.org
• Planets Suite
• www.openplanets
foundation.org
18
• Version control of AIPs
•
Alfresco / Fedora interaction?
• Access front end
• Fedora Commons front ends do not normally support OAIS
functions
• Can extra properties be added to folders and files in
Records Management site?
We welcome ideas that might help us resolve the above
three issues.
19
1. Introduction to the University of St Andrews Digital
Archiving Project (DAP)
2. The DAP Open Archival Information System
3. Developing the OAIS Ingest function in Alfresco
20
Introduction
• FITS and PREMIS
• Technical metadata
• RPS and MODS
• Resource discovery metadata
• Antivirus scanning
• METS
• Wrapping files and metadata
21
Introduction
• FITS (File Information Tool Set)
• http://code.google.com/p/fits/
• Consolidates file format metadata from 3rd party tools
• Jhove, DROID, NLNZ ME, Exiftool and others
• Output as XML
• PREMIS (PREservation Metadata: Implementation
Strategies)
• http://www.loc.gov/standards/premis/
• Data dictionary of semantic units, maps to XML
• Transform FITS XML to PREMIS using XSLT
22
The action
• Text property defined in custom aspect for storing FITS
XML in node metadata
• Create temporary file containing content of node
• Run FITS on temporary file
• Put output into custom property
• Later on, transform this to PREMIS XML
• Can be run as space rule
• Compile to AMP using Alfresco SDK
23
fits-action-context.xml
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/springbeans.dtd'>
<beans>
<bean id="fits-action-messages" class="org.alfresco.i18n.ResourceBundleBootstrapComponent">
<property name="resourceBundles">
<list><value>alfresco.module.FitsAction.fits-action-messages</value></list>
</property>
</bean>
<bean id="fits-model-bootstrap" parent="dictionaryModelBootstrap" depends-on="dictionaryBootstrap">
<property name="models">
<list><value>alfresco/module/FitsAction/context/fitsModel.xml</value></list>
</property>
</bean>
<bean id="fits-action“ class="uk.ac.st_andrews.repo.action.executer.FitsActionExecuter“ parent="actionexecuter">
<property name="serviceRegistry"><ref bean="ServiceRegistry"/></property>
</bean>
</beans>
24
FitsActionExecuter
package uk.ac.st_andrews.repo.action.executer;
public class FitsActionExecuter extends ActionExecuterAbstractBase
{
public void setServiceRegistry(ServiceRegistry serviceRegistry);
protected void addParameterDefinitions(List<ParameterDefinition> paramList);
protected void executeImpl(Action action, NodeRef actionedUponNodeRef);
}
25
FitsActionExecuter.executeImpl (fragment)
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
26
// make sure node exists
if (!nodeService.exists(actionedUponNodeRef))
{
throw new Exception("no node");
}
// make sure that node has fits aspect
QName fitsAspect = QName.createQName(fitsURI, "fitsAspect");
if (!nodeService.hasAspect(actionedUponNodeRef, fitsAspect))
{
this.nodeService.addAspect(actionedUponNodeRef, fitsAspect, null);
}
// create new FITS instance
Fits fits = new Fits();
Fits.allowRounding = true;
FitsOutput result = null;
FitsActionExecuter.executeImpl (fragment cont.)
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
27
// put input into temp file
ContentReader reader =
contentService.getReader(actionedUponNodeRef, ContentModel.PROP_CONTENT);
String fileName =
(String) nodeService.getProperty(actionedUponNodeRef, ContentModel.PROP_NAME);
File inputFile =
TempFileProvider.createTempFile("FitsActionExecuter_", "." + fileName);
reader.getContent(inputFile);
// transform into technical metadata
result = fits.examine(inputFile);
Document doc = result.getFitsXml();
// put result of transformation into output
XMLOutputter serializer = new XMLOutputter(Format.getPrettyFormat());
String output = serializer.outputString(doc);
// get property to write to
QName fitsProp = QName.createQName(fitsURI, "fitsOutput");
nodeService.setProperty(actionedUponNodeRef, fitsProp, output);
Fragment of FITS XML showing conflicting file formats
<identification status="CONFLICT">
<identity format="Microsoft Word" mimetype="application/msword">
<tool toolname="Exiftool" toolversion="8.25" />
<tool toolname="file utility" toolversion="5.04" />
<tool toolname="NLNZ Metadata Extractor" toolversion="3.4GA" />
<tool toolname="ffident" toolversion="0.2" />
</identity>
<identity format="OLE2 Compound Document Format"
mimetype="application/octet-stream">
<tool toolname="Droid" toolversion="3.0" />
<externalIdentifier toolname="Droid" toolversion="3.0"
type="puid">fmt/111</externalIdentifier>
</identity>
</identification>
28
Corresponding fragment of PREMIS XML
<premis:format>
<premis:formatDesignation>
<premis:formatName>Microsoft Word</premis:formatName>
</premis:formatDesignation>
</premis:format>
<premis:format>
<premis:formatDesignation>
<premis:formatName>OLE2 Compound Document Format</premis:formatName>
</premis:formatDesignation>
<premis:formatRegistry>
<premis:formatRegistryName>Droid (3.0)</premis:formatRegistryName>
<premis:formatRegistryKey>fmt/111</premis:formatRegistryKey>
<premis:formatRegistryRole>puid</premis:formatRegistryRole>
</premis:formatRegistry>
</premis:format>
29
Introduction
• Records of the Parliaments of Scotland marked up in
thousands of XML documents
• http://www.rps.ac.uk
• Using Text Encoding Initiative (TEI)
• http://www.tei-c.org/index.xml
• TEI headers contain resource discovery metadata
• Extract metadata from documents and populate custom
metadata fields
• Can be run as space rule
• Compile as AMP using Alfresco SDK
30
TEI example
Unique ID for document
Document belongs to
translated version of
records from reign of
William and Mary
<TEI.2 id="_william_and_mary_t1689_3_6_d6_trans" n="william_and_mary_trans">
<teiHeader>
Main heading in document
<fileDesc>
<titleStmt>
<title>A committee appointed for controverted elections</title>
</titleStmt>
<editionStmt>
<edition n="session">william_and_mary_t1689_3_1_d2_trans</edition>
</editionStmt>
Pointer to session that
<publicationStmt>
document belongs to
<date>16890314</date>
</publicationStmt>
</fileDesc>
</teiHeader>
Date of document,
<text>...</text>
in YYYYMMDD
</TEI.2>
format
31
RPSMetadataExtracter
package uk.ac.st_andrews.repo.content.metadata;
public class RPSMetadataExtracter extends
org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter
{
public RPSMetadataExtracter();
protected Map<String, Serializable> extractRaw(ContentReader reader);
}
32
RPSMetadataExtracter.extractRaw
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
33
// set up parser
SAXParser sp = spf.newSAXParser();
InputStream cis = reader.getContentInputStream();
InputSource is = new InputSource(cis);
RPSSaxParser teip = new RPSSaxParser();
// do parsing
teip.setProperties(map);
sp.parse(is, teip);
map = teip.getProperties();
// loop over properties found
Set s = map.entrySet();
Iterator it = s.iterator();
while (it.hasNext())
{
Map.Entry m = (Map.Entry) it.next();
putRawValue((String) m.getKey(), (String) m.getValue(), rawProperties);
}
RPSSaxParser
package uk.ac.st_andrews.repo.content.metadata;
public class RPSSaxParser extends org.xml.sax.helpers.DefaultHandler
{
public void setProperties(Map<String, Serializable> prop);
public Map<String, Serializable> getProperties();
public void startElement(String uri, String localName, String qName, Attributes
attributes);
public void endElement(String uri, String localName, String qName);
public void characters(char[] ch, int start, int length);
private void handleID(String id);
private void handleDate(String d);
}
34
RPSSaxParser
// property names
21 private static final String KEY_ID = "rpsID";
22 private static final String KEY_REIGN = "rpsReign";
23 private static final String KEY_VERSION = "rpsVersion";
24 private static final String KEY_HEADING = "rpsHeading";
25 private static final String KEY_SESSION = "rpsSession";
26 private static final String KEY_DATE = "rpsDate";
27 private static final String KEY_TITLE = "cmTitle";
// some properties get set in RPSSaxParser.characters
185
if (true == inTitle)
186
{
187
rawProperties.put(KEY_TITLE, new String(ch, start, length));
188
}
189
else if (true == inSession)
190
{
191
rawProperties.put(KEY_SESSION, new String(ch, start, length));
192
}
35
RPSMetadataExtracter.properties
# Namespaces
namespace.prefix.rps=http://www.rps.ac.uk/ns/1.0
namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
# Mapping of property names to Qualified names used in model
rpsID=rps:id
rpsReign=rps:reign
rpsSession=rps:session
rpsDate=rps:date
rpsVersion=rps:version
rpsHeading=rps:heading
cmTitle=cm:title
36
rpsModel.xml (fragment showing aspect)
<aspect name="rps:metadata">
<title>RPS Metadata</title>
<properties>
<property name="rps:id"><type>d:text</type></property>
<property name="rps:reign"><type>d:text</type></property>
<property name="rps:session"><type>d:text</type></property>
<property name="rps:date"><type>d:text</type></property>
<property name="rps:heading"><type>d:text</type></property>
<property name="rps:version"><type>d:text</type></property>
</properties>
</aspect>
37
webclient.properties
# I18N strings
rpsID=RPS ID
rpsReign=RPS Reign
rpsSession=RPS Session
rpsDate=RPS Date
rpsVersion=RPS Version
rpsHeading=RPS Heading
38
Using MODS
• Metadata Object Description Schema
• http://www.loc.gov/standards/mods/
• MODS is a resource discovery metadata standard
• Working on defining MODS data models
• For Project, Resource Type and Digital Object levels
• Will move RPS metadata into MODS fields
39
Introduction
• Creates an action for scanning files for viruses
• Uses ClamAV
• http://www.clamav.net/lang/en/
• Can be configured for other tools
• Emails creator of file if virus found
• Deletes file from repository if virus found
• Can be run as space rule
• Compile as AMP using Alfresco SDK
40
antivirus-action.xml (fragment)
<bean id="antivirus-action" class="uk.ac.st_andrews.repo.action.executer.AntivirusActionExecuter"
parent="action-executer">
<!– services needed by bean -->
<property name="contentService“><ref bean="contentService" /></property>
<property name="nodeService"><ref bean="nodeService" /></property>
<property name="templateService"><ref bean="templateService" /></property>
<property name="actionService"><ref bean="actionService" /></property>
<property name="personService"><ref bean="personService" /></property>
<!– person that email will come from, defined in alfresco-golbal.properties -->
<property name="fromEmail">
<value>${antivirus.mailer}</value>
</property>
<!– path to Freemarker template, defined in alfresco-golbal.properties -->
<property name="emailTemplate">
<value>${antivirus.template}</value>
</property>
41
antivirus-action.xml (fragment, cont.)
<property name="command">
<bean class="org.alfresco.util.exec.RuntimeExec">
<property name="commandMap">
<map>
<!– command to run, ${antivirus.exe} set in alfresco-golbal.properties, ${source} in Java class -->
<entry key=".*" value="${antivirus.exe} ${source}"/>
</map>
</property>
<property name="errorCodes">
<value>1</value><!– exit code 1 indicates that virus was found -->
</property>
</bean>
</property>
</bean>
42
AntivirusActionExecuter
package uk.ac.st_andrews.repo.action.executer;
public class AntivirusActionExecuter extends ActionExecuterAbstractBase
{
public void setContentService(ContentService contentService);
public void setNodeService(NodeService nodeService);
public void setTemplateService(TemplateService templateService);
public void setActionService(ActionService actionService);
public void setPersonService(PersonService personService);
public void setFromEmail(String fromEmail);
public void setCommand(RuntimeExec command);
public void setEmailTemplate(String emailTemplate);
public void init();
protected void addParameterDefinitions(List<ParameterDefinition> paramList);
protected void executeImpl(final Action ruleAction, final NodeRef actionedUponNodeRef);
}
43
AntivirusActionExecuter.executeImpl (fragment)
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
44
// put content into temp file
ContentReader reader =
contentService.getReader(actionedUponNodeRef, ContentModel.PROP_CONTENT);
String fileName =
(String) nodeService.getProperty(actionedUponNodeRef, ContentModel.PROP_NAME);
File sourceFile =
TempFileProvider.createTempFile("anti_virus_check_", "_" + fileName);
reader.getContent(sourceFile);
// set source property for command
Map<String, String> properties = new HashMap<String, String>(1);
properties.put(VAR_SOURCE, sourceFile.getAbsolutePath());
// execute the transformation command
ExecutionResult result = null;
try
{
result = command.execute(properties);
}
catch (Throwable e)
{
throw new AlfrescoRuntimeException("Antivirus check error: \n" + command, e);
}
AntivirusActionExecuter.executeImpl (fragment, cont.)
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
45
// try to get document creator's details
String creatorName = (String) nodeService.getProperty(actionedUponNodeRef,
ContentModel.PROP_CREATOR);
if (null == creatorName || 0 == creatorName.length())
{
throw new Exception("couldn't get creator's name");
}
NodeRef creator = personService.getPerson(creatorName);
if (null == creator)
{
throw new Exception("couldn't get creator");
}
String creatorEmail = (String) nodeService.getProperty(creator,
ContentModel.PROP_EMAIL);
if (null == creatorEmail || 0 == creatorEmail.length())
{
throw new Exception("couldn't get creator's email address");
}
AntivirusActionExecuter.executeImpl (fragment, cont.)
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
46
// put together message
Map<String, Object> model = new HashMap<String, Object>(8, 1.0f);
model.put("filename", fileName);
model.put("message", result);
String emailMsg = templateService.processTemplate("freemarker", emailTemplate, model);
// send email message
Action emailAction = actionService.createAction("mail");
emailAction.setParameterValue(MailActionExecuter.PARAM_TO, creatorEmail);
emailAction.setParameterValue(MailActionExecuter.PARAM_FROM, fromEmail);
emailAction.setParameterValue(MailActionExecuter.PARAM_SUBJECT,
"Virus found in " + fileName);
emailAction.setParameterValue(MailActionExecuter.PARAM_TEXT, emailMsg);
emailAction.setExecuteAsynchronously(true);
actionService.executeAction(emailAction, null);
// delete node
nodeService.addAspect(actionedUponNodeRef, ContentModel.ASPECT_TEMPORARY, null);
nodeService.deleteNode(actionedUponNodeRef);
Introduction
• Metadata and Encoding Transmission Standard (METS)
• http://www.loc.gov/standards/mets/
• METS is a wrapper for other metadata documents
• Plan to generate METS documents containing/referencing:
• Ingested files
• Renderings of these files (thumbnails, reference copies, archival
formatted versions etc.)
• Resource discovery metadata
• Technical metadata
• Fedora Commons can ingest METS documents as SIPs
• http://fedora-commons.org/
47
Project source code available on Alfresco Forge
• FITS in Alfresco
• http://forge.alfresco.com/projects/fitsinalfresco/
• RPS Metadata Extracter
• http://forge.alfresco.com/projects/rpsmetadata/
• Antivrus
• http://forge.alfresco.com/projects/antivirus/
University of St Andrews Digital Archiving Project
• http://www.st-andrews.ac.uk/itsupport/academic/arts
48
Dr Birgit Plietzsch
Arts Computing Advisor
&
Swithun Crowe
Developer for Arts and
Humanities Computing projects
bp10@st-andrews.ac.uk
cs2@st-andrews.ac.uk
IT Services, University of St Andrews
49
Download