Incompatible or Interoperable?

advertisement
Incompatible or Interoperable?
A METS bridge for a small gap between two digital
preservation software packages
Aaron Collie
Digital Curation Librarian
collie@msu.edu
Lucas Mak
Metadata & CatalogLibrarian
makw@msu.edu
What we wanted
What we found
What we did
METS
We bridged a gap, but we didn’t close the bridge
METS
METS
Archivematica Output:
Fedora Commons Input:
 AIP
 METS Fedora Extension
• fedora-batch-ingest.sh
 DIP
 Datastreams!
 METS.xml
METS 1.8
DIP
DIP
METS
AIP
Staging
12 TB
AIP
Serving
5 TB
Dark Archive
84 TB
METS Fedora
Ext. 1.1
METS
ProQuest
Humans
XSL
METS 1.8
DIP
AIP
Staging
12 TB
METS Fedora
Ext. 1.1
METS
Persistent ID
(PID)
PQ_DATA
DC
MODS
PREMIS
(…)
METS
DIP
AIP (“E”)
Why?
 We wanted to be able to control and systematize
ingest at the microservice level
 And we like the direction Archivematica is taking
 We wanted to pipe technical and preservation
metadata into Fedora Commons
 This was the reason we got started
 We haven’t contributed to the open source
community, and we wanted something to learn on.
 We are thinking of it as professional development…
Comparing Archivematica & Fedora METS
 Different schema
 Archivematica: METS v. 1.8
• http://www.loc.gov/standards/mets/version18/mets.xsd
 Fedora: Fedora METS 1.1
• http://fedora-commons.org/definitions/1/0/mets-fedora-ext1-1.xsd
 Differences in structure, elements, attributes, &
values allowed
 <structMap>
 Archivematica
• Physical structMap of the bag (i.e. directory structure)
 Fedora: No <structMap> per v.1.1*
 Solution:
 Structure represented by <GROUPID> & <SEQ> attributes of
<mets:file>
 <SEQ> by page no. embedded in filename
• only physical arrangement is possible unless changing file naming
convention to include logical info
 <GROUPID> by file type/usage (e.g. preservation master,
high/low resolution access copies)
* <mets:structMap> is allowed in schema v.1.0 (used until Fedora 3.0)
 <fileSec>
 Archivematica
• Two file groups: “Original” & “Submission documentation”
– Original: digital objects
– Submission documentation: descriptive metadata XML files
 Fedora
• Datastreams to be ingested as files
– Files of digital objects and others (e.g. Archivematica METS)
– Descriptive metadata XML files are ingested as “inline XML
datastreams”
» Copy all XML files in “Submission documentation” into
separate <dmdSecFedora> elements
 <amdSec>
 Archivematica: Hierarchical structure
<amdSec ID=“amdSec1”>
<techMD ID=“techMD1”/>
…
<digiProvMD ID=“digiProvMD1”/>
</amdSec>
<amdSec ID=“amdSec2”>
<techMD ID=“techMD2”/>
…
<digiProvMD ID=“digiProvMD2”/>
</amdSec>
• 1 digital file has 1 <amdSec>
• All <techMD>, <rightsMD>, <sourceMD> and <digiProvMD> pertaining to the same
file are nested under the same <amdSec>

Fedora: Flat structure
<amdSec ID=“tech1”>
<techMD ID=“tech1.0”/>
</amdSec>
<amdSec ID=“digiProv1”>
<digiProvMD ID=“digiProv1.0”/>
</amdSec>
<amdSec ID=“tech2”>
<techMD ID=“tech2.0”/>
</amdSec>
• To accommodate inline XML datastream versioning
– ID (syntax DSn.v) contains both:
» the number of the inline datastream (n) and
» the version number of the datastream (v)
– Individual <amdSec> serves as container and its ID serves to indicate
datastream number
– <techMD> and alike have their IDs to indicate datastream version number
 <AMDID> attribute in <mets:file>
 Archivematica
• Pointing to one <amdSec>, which has <techMD>, <rightsMD>,
<sourceMD>, and <digiProvMD> nested within, per file
– <mets:file ID=“file1” AMDID=“amdSec1”/>
 Fedora
• Pointing to multiple <amdSec>, each of which contains <techMD>,
<rightsMD>, <sourceMD>, or <digiProvMD>, per file
– <mets:file ID=“file1” AMDID= “tech1 rights1 source1
digiProv1”/>
 <dmdSec>
 Archivematica
• Only 1 Dublin Core record is allowed to describe the SIP
• Constrained by Archivematica workflow instead of METS schema
• Additional descriptive metadata XML records are included in
“Submission documentation” folder
 Fedora
• Fedora extension element: <dmdSecFedora>
• Allowed MDTYPE: MARC, EAD, DC, NISOIMG, LC-AV, VRA, TEI Header,
DDI, FGDC, & OTHER
• Copy XML files in “Submission documentation” folder into separate
<dmdSecFedora>
– MODS has to be labeled as “OTHER”
– Use namespace URI to assign correct “MDTYPE”
» Does not work with TEI Header or EAD
 <mets:metsHdr>
 Archivematica
• Does not use (optional in METS schema)
 Fedora
• <RECORDSTATUS> attribute to indicate whether the object is
“active”, “inactive” or “deleted”
• Hard-coding in with constant data
<mets:metsHdr RECORDSTATUS="A">
<mets:agent ROLE="IPOWNER" TYPE="ORGANIZATION">
<mets:name>MSU Libraries Digital and Multimedia
Center</mets:name>
</mets:agent>
</mets:metsHdr>
 <OWNERID> attribute in <mets:file>
 Archivematica
• Does not use (optional in METS schema)
 Fedora
• To indicate whether the file is “managed by Fedora internally”,
“externally referenced”, or “redirected”
– Though optional according to Fedora-METS schema
• Determine based on filename or file format
– Archivematica add “checksum” into filename for files
generated during the preservation workflow
Proposed Workflow
Staging Area
12 TB
Web
Display
METS
AIP
Dark Archive(s)
84TB
DIP
Serving Share(s)
METS
What a bridge gets us:
 Automatically extracts and captures technical &
preservation metadata
 Eases handling of complex objects with lots of
metadata or parts
 Maintains and manages separate AIP/DIP packages
What a full integration might benefit from:
 Archivematica A/DIP Content Model & Solution Pack
 Integrated AIP management
 Including dashboard GUI
 Including JMS messaging
 Integrated rebuilds from filesystem
 Currently supported in Fedora Commmons
 On Roadmap for archivematica
 Automated ingest, improved handling
Questions?
 Lucas Mak (makw@msu.edu)
 Aaron Collie (collie@msu.edu)
Download