Archiving David Nathan ELDP Training Workshop March 2010 1 Archiving: what do you think of? 2 3 4 5 6 What is a language archive, then? 7 8 What is a digital language archive? a forum / platform for data providers and data users to negotiate and exchange a trusted repository created and maintained by an institution with a commitment to the long-term preservation of archived material has policies and processes for materials acquisition, cataloguing, preservation, dissemination, migration to new digital formats a collection of managed materials 9 OAIS model OAIS archives define three types of ‘packages’ ingestion, archive, dissemination: afd_34 afd_34 dfa dfadf dfa dfadf fds fdafds fds fdafds afd_34 dfa dfadf fds fdafds Producers 10 Ingestion afd_34 afd_34 dfa dfadf dfa dfadf fds fdafds fds fdafds Archive Dissemination Designated communities What is archiving of language materials? preparing materials in a structured, welldocumented, and complete form building long-term relationships it is not just backup it is not just dissemination/publication it does not define good linguistic practice 11 What can a language archive offer? 12 Security - keep your electronic materials safe Preservation - store your materials for the long term Discovery - help others to find out about your materials, and you to find out about users Protocols - respect and implement sensitivities, restrictions Sharing - share results of your work, if appropriate Acknowledgement - create citable acknowledgement Mobilisation - create usable language materials for communities Quality and standards - advice for assuring your materials are of the highest quality and robust standards Kinds of language archives many cross-cutting classifications: Indigenous and local, eg. Squamish Nation, “language centres” regional, eg. AILLA, Paradisec international, eg. DoBeS, ELAR associated with research institute, eg. AIATSIS, ANLC grant-driven deposits, eg. DoBeS, ELAR digital vs physical vs mixed, eg. DoBeS vs Vienna Sound Archive, ANLC 13 Potential users depositors – deposit, access or update materials speakers and their descendants (“majority of users of Berkeley Language Center archive are community members”) other researchers - comparative/historical linguists, typologists, theoreticians, anthropologists, historians, musicologists etc etc other “stakeholders”, eg educationalists journalists and the wider public 14 Archives networks and bodies foundation concepts and technologies from library initiatives, eg. D-LIB http://www.dlib.org/ OAI (Open Archives Initiative) OAIS Open Archival Information Systems (NASA and space agencies incl JAXA) Open Language Archives Community (OLAC) Digital Endangered Languages and Archives Network (DELAMAN) ELAR, DOBES, ANLC, Paradisec, EMELD, LACITO, AIATSIS, AMPM (Maori) 15 Archives networks and bodies DELAMAN’s interests and activities include: language archiving training coordination and syllabus citation of deposits (for academic recognition of deposited corpora) archive federations (for seamless access to resources across ) 16 Citation examples Courtesy Heidi Johnson of AILLA Collection: Sherzer, Joel. "Kuna Collection." The Archive of the Indigenous Languages of Latin America: www.ailla.utexas.org. Media: audio, text, image. Access: 0% restricted. File/resource: Sherzer, Joel (Researcher). (1970). "Report of a curing specialist." Kuna Collection. Archive of the Indigenous Languages of Latin America: www.ailla.utexas.org. Type: transcription&translation. Media: text. Access: public. Resource ID: CUK001R001. 17 Why is language archiving different? what is a language? the data is not conventionalised (like $, age, year of publication etc) – what and how to code? varying and competing expectations 18 And endangered languages archiving? extremely diverse context – languages, cultures, communities, individuals, projects typical source is fieldworkers no established genres difficult for archive staff to manage sensitivities and restrictions extremely high priority 19 Endangered Languages ARchive (ELAR) one of 3 semi-autonomous programs of the Hans Rausing Endangered Languages Project staff of 3; archivist, software developer, technician, (research assistants etc) develop policies, preservation infrastructure, cataloguing and dissemination, facilities, training, advice, materials development and publishing 20 ELAR’s holdings ELAR currently holds about 50 deposits with a total volume of approx 4 TB. the average deposit is about 80 GB sizes vary widely, with a small number of huge deposits. The median size is around 15GB we expect volume to nearly double over the next 18 months see next slides for distribution of data types 21 ELAR holdings by data type data types for a 25% sample of holdings (early 2008) data type by volume (MB) and number of files, sorted by volume 22 Data type Volume (MB) Files audio 360,411 6,312 video 208,995 895 image 28,592 2,221 msword 223 404 pdf 196 134 eaf 33 176 text 32 781 lex 9 29 trs 5 246 xls 1 19 imdi 1 26 The way we were ... ASEDA Aboriginal Studies Electronic Data Archive, AIATSIS Canberra, founded early 1990s (modelled on Oxford Text Archive) receive and catalogue electronic materials that were at risk or not accessible lexica grammars texts 23 How things have changed .. types of data (modalities and genres) now predominantly media / documentation storage methods now “professional”, mass data systems standardisation and metadata now various standards for data and metadata dissemination now web-based dissemination expanded influence into practice and workflow of linguists 24 Why digital? preservation: digitisation is the only way that media (audio and video) can be preserved for the future because it can be copied and transmitted with zero loss cataloguing, sharing, dissemination all facilitated 25 Digital disadvantages digital data is fragile and ephemeral cost (human, equipment, maintenance) requires strategy and luck to get infrastructure right preservation depends on file and data formats 26 depend on tools and software depends on formats (prefer standard, open, explicit, long-lasting) materials may have to be converted and migrated some formats require particular software (can we archive the software?) These issues impact on archive policy how to balance cost of andling and preservation with value of materials? how to provide long-term preservation when our funding is time-limited? 27 The archiving process (depositors’ view) 28 Documenter and archive interactions 29 grant formulation and application communications, questions, advice training archiving services (transfer, conversion etc) ongoing management of materials Documenter & archive interactions 30 Query/interaction topics analysis of approx 150 queries from documenters/linguists 31 ELAR Feedback template ELAR Data Sample Evaluation Prepared for: By: Date: TEXT - xx file Document type Document format/layout/data structures Character/language representation Linking/references Consistency 33 ELAR Feedback template AUDIO Document type/format Resolution Quality Editing Length Annotation/transcription Consistency 34 ELAR Feedback template VIDEO Document type/format Resolution Quality Editing Length Annotation/transcription Consistency 35 ELAR Feedback template GENERAL File naming Data volume Delivery Consistency 36 Example detail (section: Document format) Use of typography (size, underlining, bold, spaces etc) to make headings and other structures is weak - at least Styles should be used (with complete consistency). Tables to represent interlinear data is reasonably appropriate, although would need to be converted later. Is it clear from this document, or somewhere else, where to look up codes etc, such as the speaker initials? While the language is consistently labelled in the interlinear section, it is identified only by the alternation in font in the first section. 37 Example detail (section: Audio quality) AD-MD03a 4Noe Song thami miya.wav - quality good. AD-MD04b 33Boa Sr. LongNarrativeOnTsunami.wav quality reasonable, but background hiss is too loud in proportion to the signal. Was this was part of your original recording (on what equipment?) or was introduced by digitisation, in which case it would be a good idea to try de-digitising. AD-MD05b 34Peje Phonetic Variation.wav - quality quite good. Stereo separation of voices is nice. CIILQ Seasons Contd 699-703.wav - suffers a number of faults, including severe clipping (overmodulation), background noise, microphone physical handling, and poor acoustic representation (probably due to poor microphone and/or recorder?). 38 Audio evaluation using Dobbin software from Cube-Tec who make Quadriga audio evaluation, conversion and reporting 39 Dobbin 40 Dobbin 41 Dobbin 42 Dobbin 43 Dobbin 44 Dobbin 45 What can you archive (at ELAR)? media - sound, video graphics - images, scans text - fieldnotes, grammars, description, analysis structured data - aligned and annotated transcriptions, databases, lexica metadata - structured, standardised contextual information about the materials 46 Archive objects an “object” could be a file, a set of files, a directory, a “session” or a set of files with relationships between them these are often called “bundles” like all structures, these should be made explicit eg through metadata our new catalogue system will provide a facility to create and label bundles 47 Data “portability” (Bird & Simons 2003) data should also be “portable” (Bird & Simons “Seven Dimensions ...”) 48 complete explicit documented preservable transferable accessible adaptable not technology-specific (also appropriate, accurate, useful etc!!) Archive material should be selected example: Depositor’s question: How much video can I archive? answer: ... however, unlikely that linguist is in position to plan and consistently create excellent video, so selection is unavoidable data has always been edited and selected! 49 (... selection) in your linguistic work you also: selected labeled transformed/processed/edited added, corrected, expanded made links made or assumed relationships between “whole” and processed units; invented labels, IDs, scope etc imposed formats 50 File organisation example 1 IPF10011-Disk3-Story-WulaTuki-LunarEclipse IMDI_3.0.xsd WulaTuki_LunarEclipse.eaf WulaTuki_LunarEclipse.imdi WulaTuki_LunarEclipse.imdi.backup WulaTuki_LunarEclipse.pfs WulaTuki_LunarEclipse.txt WulaTuki_LunarEclipse.wav 51 File organisation example 2 / labelling-system.doc AngryD-Bsi AngryD-Bsi.pdf AngryD-Bsi.wav AngryD-Bsi.doc 52 File organisation example 3 / archivist_notes.txt ELAN transcription key FTG0025.pdf Overview metadata FTG0025.xls [open] Kay07-aud Kay07-aud-jul03a.wav Kay07-aud-jul03b.wav Kay07-aud-jul03c.wav 53 Metadata 54 Metadata Metadata the data about data that enables the management, identification, retrieval and understanding of that data reflects the knowledge and practice of data providers defines and constrains audiences and usages for data documentation’s goals heighten the importance of metadata 55 Metadata formats common or standard: IMDI (‘ISLE Metdata Initiative’, from DoBeS) OLAC (Open Language Archives Community) EAD, and others ELAR: has created its own set, currently in implementation deposit-wide metadata in deposit form file level metadata (will be) by web form also, depositor’s own metadata 56 On metadata formats 57 each depositor can also have different metadata! types of metadata are relative to each project, consultants, community ... our goal: to maximise the amount and quality of metadata quality and extent is more important than standards and comparability many depositors are sending extensive metadata in a variety of formats including spreadsheets Types of metadata 58 depositor's / delegates' details descriptive metadata administrative metadata preservation metadata access protocols metadata for individual files Depositors and delegates 59 name address contact details (telephone, fax, email, URL) role affiliation date of birth nationality Descriptive metadata 60 title, description, subject, summary keywords subject language, community location time span Administrative metadata project details funding and hosting institutions details of external copies modifications and status details of accession agreement cf. deposit form access access protocols (see elsewhere) group membership identification 61 Preservation metadata carrier media formats, size provenance (source/history) 62 File-level metadata media files duration, file size MIME type, content type text files font, character set, encoding format, markup access protocols 63 Access protocol sensitivities, restrictions: identification, description and implementation deposit, file or object-level protocol depositor-oriented change/manage protocol over time delegate other rights holders sunset clause 64 Protocol grows naturally with documentation focus on recorded data » more people, more genres, less researcher knowledge community participation » framework for speakers to shape documentation process and products mobilisation » selecting, juxtaposing; community participation focus on revitalisation » which language to teach? who to host and teach? who can learn? etc time » significance and sensitivities change over time access » increasing scope for dissemination, control of IP 65 Other kinds of metadata information to make resources accessible to community members genres eg songs languages, eg community language materials for language teaching and learning types of metadata are relative to the particular project, consultants, community ... 66 Archiving and data management Most data-related issues are properly part of linguistic data management There are now few data-related issues that are archive-specific But teaching curricula, training, and practices need time to catch up Ultimate goal of documenting languages well means that we must find the optimal “division of labour” in each case 67 ELAR assists depositors 68 preserve your deposited materials implement your access restrictions etc provide advice, general and specific assistance, eg data conversion provide web-based deposit management allow updates and additions provide some equipment and services on a case by case basis, develop resources What is required to make a deposit? resource(s) for an endangered language it could be just one file inventory / metadata deposit form an online version will be available soon deposits can be updated, supplemented, metadata added/modified 69 How do depositors deliver data? Hard disks we return them we send them out some grant applicants factor them into grants Email good for samples for evaluation OK for most text materials Flash cards and USB sticks A web upload facility will be available later Web download 70 What about CDs and DVDs? we have found CDs, and especially DVDs, to be very unreliable DVD fail rate about 10% cause confusion as files are allocated to fit on disks, not according to corpus structure create a lot of work for depositors and for ELAR 71 We ask depositors to manage materials well collect and provide protocol information deliver materials, metadata send trial samples etc (funded grantees) not withhold materials share/manage/delegate custodianship of materials maintain relationships with language stakeholders and ELAR 72 ELAR online We now have ELAR online archive, although data is only just starting to be released to public view: http://elar.soas.ac.uk/ The archive has been implemented using a Content Management System, in this case Drupal: open-source web software based on PHP, MySQL and JavaScript implements user, role and group-based access to materials 73