Improving Access to Recorded Language Data Simon Musgrave (Monash University, Australian National Corpus) Researchers in different disciplines collect and store data which includes human language recorded in real time, for example musicologists, linguists, scholars working in performance studies, and others. Discovery of such data should be easy across disciplines, but is currently impeded by different disciplinary approaches and standards. For example, a linguist may have collected recordings of songs performed by speakers of the language they are studying; these recordings are stored in an archive intended primarily for other linguists, and a musicologist may not easily discover the resource even though it might be very relevant to their research. And of course the opposite situation might equally occur with a musicologist collecting data which might be of interest to a linguist. This paper will discuss the work of a recently formed Working Group within the Research Data Alliance which aims to address this problem by working towards standardisation of metadata elements in two areas: codes for identification of languages and language varieties, and categories for describing the content of resources. 1. Identifying languages It is difficult to provide a single characterisation of the range of data which is of concern for this Working Group, but the presence of some language content (within which we include sign languages) seems basic. Therefore a first step in effective discovery across disciplines and resources is the possibility of accurately identifying the language or languages which are present in a resource. An international standard, ISO639-3,1 exists which provides a set of three letter codes to identify languages. But this is not unproblematic for a variety of reasons. Firstly, it is not adopted everywhere and secondly the standard implies a rather rigid view of what can be defined as a ‘language’ even though this is a notoriously difficult concept to pin down. We will detail both of these problems with an example from Australia. The digital collections of the Australian Institute of Aboriginal and Torres Strait Islander Studies (AISATSIS) use a set of identifiers different to ISO639-3. The divisions recognised by ISO639-3 do not always align with expert understanding and this has been a particular issue for Australian languages, with a number of change requests filed with the registration authority for ISO639-3. A number of these changes relate to issues of granularity, that is, delineating languages from linguistic entities below that level (such as dialects) and above that level (such as macrolanguages and language families). But differences between ISO639-3 and the identifiers used by AIATSIS also reflect differences between insider views of the relevant distinctions and outsider views. Table 1 illustrates some of this complexity by comparing ISO639-3 identifiers and AIATSIS identifiers for the Dhangu lanaguge group within the Yolngu/Yuulngu family of Australian languages. It also includes information from Glottolog,2 yet another source for language identifiers. Glottolog uses the term ‘languoid’ to refer to “any type of lingual entity: language, dialect, family, language area” (Good and Hendryx-Parker 2006, fn7). 1 2 http://www.iso.org/iso/home/standards/language_codes.htm (11/11/2013) http://glottolog.org/ (14/11/2013) Ethnologue Group Dhangu Language Djangu Yan-nhangu Glottolog ID dhg jay Languoid Dhangu DhanguDjangu AIATSIS ID dhan1270 dhan1271 Group Dhangu Language Galpu Lamamirri ID N139 N147 Murru N116.T Ngaymil N116.X Rirratjingu N140 Wangurri N134 Woralul N132 Table 1: Comparison of ISO639-3, Glottolog and AIATSIS identifiers for the Dhagu language group (AIATSIS has two alternate spellings for Ngaymil which are omitted here)3 The Working Group is starting from the position that ISO639-3 is sufficiently entrenched that it cannot be abandoned, but that improvements can be made both in the substance of the standard and in the processes around its maintenance. The various parts of ISO639 currently have different Registration Authorities; for example ISO639-3 is administered by SIL International,4 while ISO639-2 (a partially superseded set of two letter codes) is administered by the Library of Congress.5 The Working Group will participate in efforts which have already begun to unify all the parts of ISO639 with a single Registration Authority which would allow for the construction of a single database documenting the complete standard. A single administration would also have advantages in terms of the processes around seeking amendments to the standard. This will remain an important consideration: languages and linguistics scholarship are not static and what should be identifiable and what is recognised as identifiable will change over time. As part of the ongoing work of ISO Technical Committee 376 (which has responsibility for ISO639), proposals for identification of entities at different levels of granularity are being considered within the ISO process. These efforts cover both the identification of linguistic entities above the level of ‘language’ (e.g. macro-languages and language families) and entities below that level (e.g. dialects and varieties). The Working Group aims to ensure that expert input to these processes is maximised, that the principles underlying the ISO639 standard sets have a sound linguistic basis, and that registration and revision processes are consistent and transparent. These aims will be achieved by direct input to ISO TC37 (one member of the Working Group sits on this committee), and by encouraging national standards bodies to be involved in the work of the Technical Committee, for example by seeking observer status in its meetings and by creating national mirror committees. Our assumption is that progress with these issues will lead to more consistent use of the resulting standard by archives and repositories. 3 Sources: http://www1.aiatsis.gov.au/thesaurus/language/mtw.exe?k=default&l=60&linkType=term&w=837&n=1&s=5 &t=2 (11/11/2013), http://www.ethnologue.com/subgroups/yuulngu (11/11/2013) 4 http://www-01.sil.org/iso639-3/default.asp (11/11/2013 5 http://www.loc.gov/standards/iso639-2/iso639jac.html (11/11/2013) 6 http://www.iso.org/iso/home/standards_development/list_of_iso_technical_committees/iso_technical_com mittee.htm?commid=48104 (11/11/2013) 2. Content description Existing metadata schemas for language data (e.g. IMDI, OLAC)7 include a vocabulary for describing the genres represented in linguistic resources, but these do not necessarily correspond to usage or needs of different disciplines. The Working Group aims to develop a vocabulary for describing the relevant resources which will be sufficiently broad that it can cover the range of material represented, sufficiently accessible that it will be useful to researchers across a range of disciplines, but also sufficiently precise that it will aid discovery. On this last point, we will work from the assumption that an optimal solution will be a high-level, coarse-grained vocabulary which can be extended by individual research communities to achieve the levels of precision in resource discovery which will best serve their needs. The Working Group will consult across the different research communities to establish the range of resource types which need to be covered and vocabularies for describing that range. The Working Group will implement the results of this consultation by creating a set of metadata elements within the frameworks of the Component Metadata Initiative (CMDI)8 and the ISOCat data category registry.9 These technical solutions seem appropriate for the problem being tackled. CMDI is based on the idea that common metadata elements should be useable across different sites without the imposition of a rigid metadata scheme. Given that any solution resulting from the Working Group will have to be retrofitted to existing metadata catalogues, this approach is suitably flexible. Also, the CMDI framework will accommodate the type of extensibility mentioned above. The ISOCat framework (See also the contribution of Broeder and Lannom in this issue) seeks to make explicit records of the semantic content of data categories which are easily accessible. This is desirable in any case, but seems to us to be particularly desirable in work such as ours which crosses disciplinary boundaries. Although we anticipate that the content of some proposed data categories may be the subject of considerable debate and discussion between representatives of different disciplines, we see considerable advantages in the outcome of such discussions being treated as part of a data commons rather than being tied to any individual discipline. We hope that the activities of the Working Group will lead to improved discovery and access for researchers across disciplines who work with recorded language data as well as improved possibilities for inter-repository data exchange. Reference: Good, Jeff & Calvin Hendryx-Parker. 2006. Modeling contested categorization in linguistic databases. In Proceedings of the EMELD 2006 Workshop on Digital Language Documentation: Tools and Standards: The State of the Art. Lansing, Michigan. Online: http://emeld.org/workshop/2006/papers/GoodHendryxParkerModelling.pdf (14/11/2013) 77 http://www.mpi.nl/imdi/ (11/11/2013), http://www.language-archives.org/OLAC/metadata.html (11/11/2013) 8 http://www.clarin.eu/node/3219 (11/11/2013) 9 http://www.isocat.org/ (11/11/2013)