Archiving David Nathan Endangered Languages ARchive (ELAR) one of 3 semi-autonomous programs of the Hans Rausing Endangered Languages Project staff of 3; archivist, software developer, technician, (research assistants etc) develop preservation infrastructure, cataloguing and dissemination; policies; facilities; training and advice; materials development and publishing What is a digital language archive? A trusted repository created and maintained by an institution with a commitment to the longterm preservation of archived material Will have policies and processes for materials acquisition, cataloguing, preservation, dissemination, migration to new digital formats A collection of managed materials What is archiving of language materials? Preparing materials in a structured form suitable for long-term preservation Creating long-term relationships It is not backup It is not dissemination/publication It should not impinge on good linguistic practice Kinds of language archives Many cross-cutting classifications: Indigenous vs outsider, eg. Squamish Nation Regional vs international, eg. AILLA, Paradisec; DoBeS, ELAR Associated with research institute, eg. AIATSIS, ANLC Granter-funded, eg. DoBeS, ELAR, OTA Digital vs physical vs mixed, eg. DoBeS vs Vienna Sound Archive, ANLC Potential users Speakers and their descendants - up to 95% of users of UCB are community members Depositors - to create or renew materials Other researchers - comparative/historical linguists, typologists, theoreticians, anthropologists, historians, musicologists etc etc Other “stakeholders”, eg educationalists Journalists and the wider public Archives networks Digital Endangered Languages and Archives Network (DELAMAN) ELAR, DOBES, ANLC, Paradisec, EMELD, LACITO, AIATSIS, AMPM (Maori) Open Language Archives Community (OLAC) Others, eg. D-LIB http://www.dlib.org/ Open Archives Initiative Archive architectures afd_34 afd_34 dfa dfadf dfa dfadf fds fdafds fds fdafds afd_34 dfa dfadf fds fdafds Producers Ingestion afd_34 afd_34 dfa dfadf dfa dfadf fds fdafds fds fdafds Archive Dissemination Designated communities Archive architectures afd_34 afd_34 dfa dfadf dfa dfadf fds fdafds fds fdafds afd_34 dfa dfadf fds fdafds Producers Ingestion afd_34 afd_34 dfa dfadf dfa dfadf fds fdafds fds fdafds Archive Dissemination Designated communities The archive needs to define three types of ‘packages’: ingestion, archive and dissemination. The way we were ... ASEDA Aboriginal Studies Electronic Data Archive at AIATSIS Canberra (modeled on Oxford Text Archive) opportunistically collect and catalogue electronic materials that were at risk or not accessible lexica grammars texts etc How things have changed .. types of data (modalities and some genres) means of storage standardisation and metadata dissemination (most explosive) expanded into practice and workflow of linguists What can a language archive offer? Security - keep your electronic materials safe Preservation - store your materials for the long term Discovery - help others to find out about your materials Protocols - respect and implement sensitivities, restrictions Sharing - share results of your work, if appropriate Acknowledgement - create citable acknowledgement Mobilisation - create usable language materials for communities Quality and standards - advice for assuring your materials are of the highest quality and robust standards Preservation issues making materials robust making storage robust organisational, ownership and policy issues changing technologies refreshing migrating Changing technologies data data model advantages of digital preservation based around copying also transmission, dissemination software implications robust formats (standard, open, explicit) formats with long horizons formats easy to refresh formats that don’t require particular software (but distinguish where software is intrinsic) may have to describe software or even archive the software Two preservation models “preserving the bytestream” LOCKSS: “lots of copies keep stuff safe” http://lockss.stanford.edu/ guess which community it came from! (plus ...) distributed archiving Some backup issues risk management undetected problems useless backups under some circumstances, ELAR may provide backup Documenter & archive interactions Grant formulation and application Communications, questions, advice Training Archiving Documenter & archive interactions Documenter & archive interactions What can you archive? Media - sound, video Graphics - images, scans Text - fieldnotes, grammars, description, analysis Structured data - aligned and annotated transcriptions, databases, lexica Metadata - structured, standardised contextual information about the materials Data portability Bird and Simons 2003: (for language documentation) our data needs to have integrity, flexibility, longevity and broad utility Data portability complete explicit documented preservable transferable accessible adaptable not technology-specific (also appropriate, accurate, useful etc!!) Formats sound - WAV image - BMP, TIFF, JPEG. See full advice about images video - MPEG2 text - plain text, with or without markup documents - plain text, PDF or postscript structured text - XML, other markup (with description of markup system) structured data in commonly available Office formats - ELAR will convert them to archive-suitable formats character encoding : preferred encoding is ASCII or Unicode clearly document any other encodings used, e.g. ISO 8859-5 discuss with us if you use font substitution to handle non-Roman characters Basic management points filenames - extension, ASCII, nospaces, check capitalisation use directory structures wisely versions distinguish formats - working, presentation, and archiving handling of characters, fonts, character sets file encoding and format (many file types are ASCII) metadata! Data format duty cycle examples Raw Video DVI Working Interchange Archive Dissemination softwarespecific MPEG-2 MPEG-2 MPEG2, AVI, QT Fieldnotes Shoebox Shoebox FOSF XML WWW, print dictionary Audio ATRAC WAV WAV BWF MP3 Complex data multiple FM Pro database RTF, XML XML Interactive application Multimodal multiple multiple as above as above Multimedia application page Archive objects informed by traditions, eg document archives sometimes, simply called a “resource” it could be a file, a set of files, a directory, a “session” or a coherent item with many parts should have archival qualities eg Bird & Simons “7 Dimensions” (or see Thieberger in LDD2) may impose standard structures or formats need deposit event and processes legal and protocol verification accession ongoing processes Selection Example: video: How much volume allocated? Answer: ... However: unlikely that linguist is in position to plan and consistently create excellent video, so selection is unavoidable data has always been selected! (... selection) you also: selected labeled transformed/processed/edited added, corrected, expanded made links made or assumed relationships between “whole” and processed units; invented labels, IDs, scope etc imposed formats Examples Characters Did my characters come through? Answer: ... há pa ki hená mázaska wikcémna nú pa iyóphewa-ye ks t DBW wóz?az?a-s?ni yeló DB OK wash things-NEG ASS.M perhaps ELAR should do it? 'he didn't do the wash' However: wóz az a-s ni yeló DB OK wash things-NEG ASS.M 'he didn't do the wash' Preservation Is my file preservable? Note: characters? inconsistent segmentation data as comments conventions/metadata Text transcription: “Korimáka” Language: Choguita Rarámuri Language used for transcription: Spanish Consultant: Luz Elena León Ramírez Linguist: abriela Cabaero Transcription: erth Fuen & Gabrela Cabaero Date recorded: 11/02/2006 Date tranbscribed: 11/02/2006 Recording: rec6-LEL.wav Knowledge representation 1 - before wama momol chi naron mon chayako (LB) / wama momol chi naron chayako (MD) wama momol chi nan mon chayako (more emphatic(LB) / wama momol chi nan chayako (MD) Why don't you and him do it? + Notes have both of these sentences without the negator mon. OK runon naynangkroy ile ri He ate their sago. * kipin kannangkroy ngolu intended: We ate their cassowary. OK kipin kanangkroy ngolu We ate their cassowary. Knowledge representation 1 - after * kipin kannangkroy ngolu intended: We ate their cassowary. OK kipin kanangkroy ngolu We ate their cassowary. <sentence.set num="75"> <version> <walman>Kipin kannangkroy ngolu</walman> <judgement>*</judgement> </version> <english>We ate their cassowary. </english> </sentence.set> <sentence.set num="76"> <version> <walman>Kipin kanangkroy ngolu</walman> <judgement>OK</judgement> </version> <english>We ate their cassowary.</english> </sentence.set> Knowledge representation 2 <?xml version=“1.0” encoding=“UTF-8”?> <FMPXMLRESULT xmlns=“http://www.filemaker.com/fmpxmlresult”> <PRODUCT BUILD=“06/26/2002” NAME=“FileMaker Pro” VERSION=“6.0v2”/> <DATABASE DATEFORMAT=“M/d/yyyy” LAYOUT=““ NAME=“Videos” RECORDS=“13” TIMEFORMAT=“h:mm:ss a”/> <METADATA> <FIELD EMPTYOK=“YES” MAXREPEAT=“1” NAME=“Index name” TYPE=“TEXT”/> <FIELD EMPTYOK=“YES” MAXREPEAT=“1” NAME=“Image desc” TYPE=“TEXT”/> <FIELD EMPTYOK=“YES” MAXREPEAT=“1” NAME=“Date” TYPE=“TEXT”/> <FIELD EMPTYOK=“YES” MAXREPEAT=“1” NAME=“Content” TYPE=“TEXT”/> </METADATA> <RESULTSET FOUND=“13”> <ROW MODID=“16” RECORDID=“40”> <COL><DATA>Morly Beeta</DATA></COL> <COL><DATA>Interview with Morly Beeta</DATA></COL> <COL><DATA>Jan/13/05</DATA></COL> <COL><DATA>Obu history by Morly Beeta</DATA></COL> </ROW> ELAR conversion - original Language Dialects Speakers Place recorded Date recorded Recording name Duration Recorded by Recording equipment Translated by Transcribed by Reviewed and corrected by Unangam Tunuu [Aleut Language] Qawalangin [Eastern Aleut] Nii}u}i{ [Western Aleut] Maria Turnpaugh, Nick Lekanoff, Clara Golodoff Unalaska, AK. Ray Hudson Room, Unalaska Public Library. 7.21.04 UNAK2trk1 16:21 min. Alice Taff Marantz CDR 300 recorder with one flat filtered table-mounted cardiod microphone. Also audio/video miniDV - Canon GL2. Alice Taff with Maria Turnpaugh 000-493sec. Millie Prokopeuff 455-499sec. Alice Taff Moses Dirks 129 ET Kamagala, afternoon afternoon 135 CG Aang yes 136 ET Sla{chxisaada{, ii? Nice weather. nice weather 140 CG Yeah. Maku{ that's all right 143 ET Alqutaadaltxichin? How are you? How are you all? ELAR conversion - XHTML <?xml version=“1.0” encoding=“UTF-8”?> <!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Strict//EN” “http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd”> <html xmlns=“http://www.w3.org/1999/xhtml” xml:lang=“en” lang=“en”> <head><title>ANC14trk1</title> <link href=“taff.css” type=“text/css” rel=“stylesheet”></link></head><body> <table class=“metadata”> <tr><td>Language</td><td class=“language”>Unangax̌ (Aleut)</td></tr> <tr><td>Dialect</td><td class=“dialect”>Niiĝuĝix̌ (Western Aleut)</td></tr> <tr><td>Speakers</td><td class=“speaker”>Alice Petrivelli, Vera Snigaroff, Mary Snigaroff, Vivian Koenig</td></tr> <tr><td>Place recorded</td><td class=“place”>Anchorage, Alaska </td></tr> <tr><td>Date recorded</td><td class=“date”>Mar. 15, 2005</td></tr> <tr><td>Recording name</td><td class=“rec_name”>ANC14trk1</td></tr> <tr><td>Recorded by</td><td class=“rec_by”>Alice Taff, Piama Oleyer</td></tr> <tr><td>Recording equipment</td><td class=“rec_equip”>Marantz CDR300 CD recorder with one flat-filtered, tablemounted cardioid microphone. </td></tr> <tr><td>Translated/Transcribed by</td><td>Simeon L. Snigaroff, December 2005</td></tr> </table> <table class=“transcript”> <tr><td class=“time”>1</td><td class=“speaker”>ap</td><td class=“transcription”>Uqlaĝiix̌, x̌aayax̌, uqlaĝil agach aliguutax̌ ax̌.</td></tr> <tr><td>&nbsp;</td><td>&nbsp;</td><td class=“translation”>To take a bath, Steam bath, to take a bath is the one that is Aleut</td></tr> <tr><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr> <tr><td class=“time”>5</td><td class=“speaker”>vs</td><td class=“transcription”>uhmm</td></tr> ELAR conversion - in browser Language Dialect Speakers Place recorded Date recorded Recording name Recorded by Recording equipment Translated/Transcribed by 1 ap Unangax̌ (Aleut) Niiĝuĝix̌ (Western Aleut) Alice Petrivelli, Vera Snigaroff, Mary Snigaroff, Vivian Koenig Anchorage, Alaska Mar. 15, 2005 ANC14trk1 Alice Taff, Piama Oleyer Marantz CDR300 CD recorder with one flat-filtered, table-mounted cardioid microphone. Simeon L. Snigaroff, December 2005 Uqlaĝiix̌, x̌aayax̌, uqlaĝil agach aliguutax̌ ax̌. To take a bath, Steam bath, to take a bath is the one that is Aleut Deposit form is on the web Protocol Sensitivities, restrictions: identification, description and implementation ELAR Deposit Form “Section C” ELAR pays careful attention to any sensitivities or restrictions that apply to any part of your deposit. There are four ways that Access Protocol is implemented: You define permissions for the whole deposit or for individual files (or parts of files) We provide defaults to protect your data if you do not define permissions You/we keep permissions up to date You list other rights holders ELAR Deposit Form “Section C” P1. Anyone Any person may view/listen to or receive a digital copy of any part of the deposit P2. Certain people or groups Choose any combination of P2A, P2B, and P2C: P2A Research community members What level of access (choose one only)? P2A1. They can receive a digital copy of requested material P2A2. They can view/listen but cannot receive a digital copy P2B. Language community members See below regarding identifying members What level of access (choose one only)? P2B1. They can receive a digital copy of requested material P2B2. They can view/listen but cannot receive a digital copy P2C. Particular named people or bodies See below regarding identifying people/bodies P3. Depositor is asked permission for each request You will be contacted and asked for permission on each request. How do you want to be contacted? P3A. Requester is given address to contact you directly P3B. ELAR will relay requests to you P4. Only the depositor has access Persons other than the depositor will not be able to request access. ELAR Deposit Form “Section C” Identifying people/bodies If you chose P2B or P2C, tell us how ELAR should determine who is a member of a group (e.g. language community, educational body). Choose one of the following: M1. You tell ELAR how to determine membership (tell us in Part D) M2. ELAR will ask you on each occasion M3. ELAR will make a judgement about membership If you chose P2C, then list the names of the people or bodies in Part D. Contacting you If you choose P3A or P3B, you will be able to decide about each particular request. If the choice is P3A, we will send your address to the requester, who can then ask you directly for permission. You then send us your decision. If the choice is P3B, ELAR will act as an intermediary, and pass on the request to you, so that your privacy is maintained. However, if you chose one of P3A or P3B and you (or your delegate) are not contactable, ELAR will need to make the decision or change the access permissions. Similarly, if we need to contact you to ask about group membership, and you (or your delegate) are not contactable, we will need to make the decision or change the access permissions. Other aspects defaults sunset clause we provide means to change/manage protocol file or object-level protocol delegate other rights holders effort to identify depositor for long-term depositor-oriented ELAR’s holdings ELAR currently hold 36 deposits with a total volume of approx 0.9 TB. The average deposit is about 25 GB, however, the sizes vary widely, with a few much larger deposits, and the median size is around 10GB. We expect this to nearly double over the next year See next slides for distribution of data types ELAR holdings by data type This table analyses some data types of interest for a representative sample (70%) of holdings Date type by volume and number of files, sorted by volume Data type Volume (MB) Files audio 360,411 6,312 video 208,995 895 image 28,592 2,221 msword 223 404 pdf 196 134 eaf 33 176 text 32 781 lex 9 29 trs 5 246 xls 1 19 imdi 1 26 ELAR holdings by data type This table analyses some data types of interest for a representative sample (70%) of holdings Date type by number of files and volume, sorted by number of files Data type Files Volume (MB) audio 6,312 360,411 image 2,221 28,592 video 895 208,995 text 781 32 msword 404 223 trs 246 5 eaf 176 33 pdf 134 196 lex 29 9 imdi 26 1 xls 19 1 Metadata ELAR metadata set = Selection from IMDI*, OLAC*, EAD, TEI ELAR-specific (e.g. protocol, geographical) Depositor metadata * ie. a set of metadata elements that maps onto both IMDI and OLAC {{ Archive ELAR metadata set Deposit Your metadata All other files Types of Metadata Depositor's / delegates' details Descriptive metadata Administrative metadata preservation metadata Access protocols Metadata for individual files Depositors and delegates Name Address Contact details (telephone, fax, email, URL) Role Affiliation Date of birth, Nationality Descriptive metadata Title, Description, Subject, Summary Keywords Subject Language, Community Location Timespan Helps in cataloguing Administrative metadata Project details funding and hosting institutions Details of external copies Details of accession agreement cf. Deposit form Preservation metadata Carrier media Provenance (Source) Access access protocols (see elsewhere) group membership identification File-level metadata Media files duration, file size MIME type, content type Text files font, character set, encoding format, markup Metadata files schema scope validity ELAR Metadata Set input Deposit Form Metadata edited via ELAR website Full ELAR metadata set Export to IMDI Export to OLAC output Export to TEI (Descriptive, Structural, Technical, Administrative, Preservation) Type ELAR set A Accession_acquisition A Accession_agreement Field ELAR Comments = EAD <acqinfo> depositor_has_signed depositor_sign_date ELAR_has_signed ELAR_sign_date hard_copy_location A Accession_appraisal = EAD <appraisal> A Accession_date Fixed at accession A Accession_number A Accession_request * Requests to ELAR e.g. anonymisation, conversions requestnum request action_note A Accession_status P Carrier * The transmission medium (carrier) e.g. CD, DV tape, La Cie 256MB external hard disk medium labeling_system info D Community * Culture or community group(s) represented D Metacontent_file * For depositors' metadata of various kinds. Contextual (eg lg info resource, bio of speaker; methodology; local history etc); metadata (eg depositor's IMDI metadata or file inventory), related resources (imdi 75). type scope List or description of files covered by this metadata file format Format of data organisation schema A formalisation of a format; by url; or name of schema; or indicate schema file within deposit validated info D Creator * Person primarily associated with production of resource name notes D Date * Dates in the lifecycle of the resource; e.g. start and end dates of data collection event date_or_start_date end_date A Depositor’s delegate; has ability to administer deposit Delegate address_line_1 address_line_2 address_line_3 address_state_county address_country address_postcode email fax family_name given_name title nationality telephone url A Depositor Person who has rights in materials, provides them to ELAR, and makes agreement via deposit form address_line_1 address_line_2 address_line_3 address_state_county address_country address_postcode affiliation dob email fax family_name . given_name title nationality role telephone url e.g. collector, fieldworker, donor D Description Description of the resource: see Summary which has higher priority D Description_language * ISO 639-2b, = EAD <langusage> language_name language_code P Dissemination_format Information about presentation formats, status of presentation objects etc D Features Any unique or outstanding features of the deposit A Handle ELAR use P ID Accession number D Keywords From 2 to 6 keywords related to the content; separate by commas D Language * ISO 639-2b,A15; See alt_names; = EAD <langmaterial> name code alt_names info D Linguistic_genre * Covering type, conventions, key, links, as OLAC type 121122; imdi62; orthographic, phonetic, morphologic, syntactic, translation, … type D Address or more specific place than village, e.g. “Lydia’s house”, “Primary school” Location address country Preferably name in English, standard spelling; otherwise recommended to add country_code country_code ISO country code to allow variant naming of country latitude In Decimal Degree format longitude In Decimal Degree format region town village T Mediafile format size_time_duration allow free text, eg about 1hr 15 mins size_data_volume type M Other_info * ELAR use only, see 205+G95 info importance domain D Participant * role A lifecycle Lifecycle attribute of particpant; need for major participants, whether alive or dead, who are descendants etc alt_name * Use name, then Alt_name (eg as referred to in the content) also Abbreviation, eg as used in transcription or annotation anonymise Need action notes and record of dates anonymised in order to account for anonymisation obligations name As person name abbrev or similar type Also covers imdi participant.role, see also Metacontent_file A Project contact description Only if short - not meant to include narrative descriptions of over 30 words funder Project_sponsors\n{sponsor, type [host, funder …]} host code title A Protocol_acknowledgement See deposit form yes/no ack_text A Protocol_maintenance ELAR use only. Need to define actions and vocabulary; action date A Protocol_M_code See deposit form M_code instructions A Protocol_other_rights_holders Describes how ELAR determines a person’s group membership or access status See deposit form name role address resource_ID identify particular file(s) affected by these rights A Where P2C was selected, to list individuals’/organisations’ names and access types Protocol_P2_names name type [individual, organisation] access_type A Protocol_Pcode Identifies people and access conditions. One only P_code, if P_code = P2, then P2_code can be any combination of P2A, P2B, and P2C; .....Rights * {right_type, right_holder, contact_info} Access * {ELAR_code, date_from, date_to, revise_date, revise_notes, other} P_code P2_codes A Protocol_Usage_restrictions Default is private study only, = EAD <userestrict> A Publisher * Where item is also disseminated, eg other archive name address T Quality Distinguish archive's entry from depositor's. Add here recording conditions and equipment. For quality, do not use subjective terms like good or poor - describe phenomena, eg medium tape hiss, slight clipping, road traffic in background beween 5 and 7 mins etc S Relation * = EAD <revisiondesc> type this_role this_id that_role that_id info Update, correction, contains, part, replaces, extends/enriches, transcription, translation, annotation, other_alignment, other_description, contains_examples, alternative, authorship, protocol_info P Relation_external * materials are deposited with another archive or institution, ~ EAD <repository> future repository_name repository_id date info yes/no existing_id info A Revision Does depositor intend to update or revise the deposit yes/no info P Source Processed Detail the data processing in deriving the deposit/file from the source, e.g. digitisation parameters Provenance Use where depositor was not the main creator, or where deposit/file is drawn from existing media or data sources. Where materials were located, how depositor came into possession, collector, protocol history, access and protocol for source etc; Reference to a specific tape with a unique label. Element characterizing the media format such as DAT, DV, VHS, Hi-8, … See also Accession_acquisition A Status Deposit, complete or not, ELAR only, complete or not, with narr notes D Subject * A short description D Summary A description/summary/abstract (may also describe genres, media, formats). = EAD <abstract> T Textfile Characterset e.g. Cyrillic, Chinese-traditional Encoding Encoding, eg. Unicode, Latin-1 Extended, Big-5, ASCII Font e.g. IPA-Kiel Format Can be a flag to apply to any part of deposit - encourage structured narrative field to tabulate Markup e.g. TEI-lite, Shoebox, EAF; give or link to schema etc D Title Note that Deposit here means a bundle which is the whole set of files deposited D Type Genre and/or type of content/event; see Johnson & Dwyer Format details Filenames characters [A-Z], [a-z], [0-9], underscore and a single period before the extension correct MIME extension e.g. http://www.utoronto.ca/webdocs/HTMLdocs/Book/Book-3ed/appb/mimetype.html favour lower case letters maximum length 30 characters maximum directory depth 8 File formats, see http://www.hrelp.org/archive/depositors/formats.html Dobbin Audio evaluation, processing and reporting Dobbin Dobbin Dobbin Dobbin Dobbin Dobbin ELAR will preserve your deposited materials provide for making changes where possible provide web-based metadata management implement your access restrictions etc give feedback about materials provide advice, general and specific assistance, eg data conversion provide some equipment and services on a case by case basis, develop resources We ask you to manage materials well collect and provide protocol information deliver materials, metadata send trial samples etc not withhold materials share/manage/delegate custodianship of materials maintain relationships with language stakeholders and ELAR Delivery of materials mostly we expect to receive copies on computerreadable media such as CD/DVD/HD DVDs seem consistently unreliable some digitisation of media may be possible Questions?