D_Lib_Draft3 - Research Data Alliance

advertisement
Improving Access to Recorded Language Data
Simon Musgrave (Monash University, Australian National Corpus)
Researchers in different disciplines collect and store data which includes human language recorded
in real time, for example musicologists, linguists, scholars working in performance studies, and
others. Discovery of such data should be easy across disciplines, but is currently impeded by
different disciplinary approaches and standards. For example, a linguist may have collected
recordings of songs performed by speakers of the language they are studying; these recordings are
stored in an archive intended primarily for other linguists, and a musicologist may not easily discover
the resource even though it might be very relevant to their research. And of course the opposite
situation might equally occur with a musicologist collecting data which might be of interest to a
linguist. This paper will discuss the work of a recently formed Working Group within the Research
Data Alliance which aims to address this problem by working towards standardisation of metadata
elements in two areas: codes for identification of languages and language varieties, and categories
for describing the content of resources.
1. Identifying languages
It is difficult to provide a single characterisation of the range of data which is of concern for this
Working Group, but the presence of some language content (within which we include sign
languages) seems basic. Therefore a first step in effective discovery across disciplines and resources
is the possibility of accurately identifying the language or languages which are present in a resource.
An international standard, ISO639-3,1 exists which provides a set of three letter codes to identify
languages. But this is not unproblematic for a variety of reasons. Firstly, it is not adopted everywhere
and secondly the standard implies a rather rigid view of what can be defined as a ‘language’ even
though this is a notoriously difficult concept to pin down. We will detail both of these problems with
an example from Australia.
The digital collections of the Australian Institute of Aboriginal and Torres Strait Islander Studies
(AISATSIS) use a set of identifiers different to ISO639-3. The divisions recognised by ISO639-3 do not
always align with expert understanding and this has been a particular issue for Australian languages,
with a number of change requests filed with the registration authority for ISO639-3. A number of
these changes relate to issues of granularity, that is, delineating languages from linguistic entities
below that level (such as dialects) and above that level (such as macrolanguages and language
families). But differences between ISO639-3 and the identifiers used by AIATSIS also reflect
differences between insider views of the relevant distinctions and outsider views. Table 1 illustrates
some of this complexity by comparing ISO639-3 identifiers and AIATSIS identifiers for the Dhangu
lanaguge group within the Yolngu/Yuulngu family of Australian languages. It also includes
information from Glottolog,2 yet another source for language identifiers. Glottolog uses the term
‘languoid’ to refer to “any type of lingual entity: language, dialect, family, language area” (Good and
Hendryx-Parker 2006, fn7).
1
2
http://www.iso.org/iso/home/standards/language_codes.htm (11/11/2013)
http://glottolog.org/ (14/11/2013)
Ethnologue
Group
Dhangu
Language
Djangu
Yan-nhangu
Glottolog
ID
dhg
jay
Languoid
Dhangu
DhanguDjangu
AIATSIS
ID
dhan1270
dhan1271
Group
Dhangu
Language
Galpu
Lamamirri
ID
N139
N147
Murru
N116.T
Ngaymil
N116.X
Rirratjingu N140
Wangurri N134
Woralul
N132
Table 1: Comparison of ISO639-3, Glottolog and AIATSIS identifiers for the Dhagu language group
(AIATSIS has two alternate spellings for Ngaymil which are omitted here)3
The Working Group is starting from the position that ISO639-3 is sufficiently entrenched that it
cannot be abandoned, but that improvements can be made both in the substance of the standard
and in the processes around its maintenance. The various parts of ISO639 currently have different
Registration Authorities; for example ISO639-3 is administered by SIL International,4 while ISO639-2
(a partially superseded set of two letter codes) is administered by the Library of Congress.5 The
Working Group will participate in efforts which have already begun to unify all the parts of ISO639
with a single Registration Authority which would allow for the construction of a single database
documenting the complete standard. A single administration would also have advantages in terms of
the processes around seeking amendments to the standard. This will remain an important
consideration: languages and linguistics scholarship are not static and what should be identifiable
and what is recognised as identifiable will change over time.
As part of the ongoing work of ISO Technical Committee 376 (which has responsibility for ISO639),
proposals for identification of entities at different levels of granularity are being considered within
the ISO process. These efforts cover both the identification of linguistic entities above the level of
‘language’ (e.g. macro-languages and language families) and entities below that level (e.g. dialects
and varieties). The Working Group aims to ensure that expert input to these processes is maximised,
that the principles underlying the ISO639 standard sets have a sound linguistic basis, and that
registration and revision processes are consistent and transparent. These aims will be achieved by
direct input to ISO TC37 (one member of the Working Group sits on this committee), and by
encouraging national standards bodies to be involved in the work of the Technical Committee, for
example by seeking observer status in its meetings and by creating national mirror committees. Our
assumption is that progress with these issues will lead to more consistent use of the resulting
standard by archives and repositories.
3
Sources:
http://www1.aiatsis.gov.au/thesaurus/language/mtw.exe?k=default&l=60&linkType=term&w=837&n=1&s=5
&t=2 (11/11/2013), http://www.ethnologue.com/subgroups/yuulngu (11/11/2013)
4
http://www-01.sil.org/iso639-3/default.asp (11/11/2013
5
http://www.loc.gov/standards/iso639-2/iso639jac.html (11/11/2013)
6
http://www.iso.org/iso/home/standards_development/list_of_iso_technical_committees/iso_technical_com
mittee.htm?commid=48104 (11/11/2013)
2. Content description
Existing metadata schemas for language data (e.g. IMDI, OLAC)7 include a vocabulary for describing
the genres represented in linguistic resources, but these do not necessarily correspond to usage or
needs of different disciplines. The Working Group aims to develop a vocabulary for describing the
relevant resources which will be sufficiently broad that it can cover the range of material
represented, sufficiently accessible that it will be useful to researchers across a range of disciplines,
but also sufficiently precise that it will aid discovery. On this last point, we will work from the
assumption that an optimal solution will be a high-level, coarse-grained vocabulary which can be
extended by individual research communities to achieve the levels of precision in resource discovery
which will best serve their needs.
The Working Group will consult across the different research communities to establish the range of
resource types which need to be covered and vocabularies for describing that range. The Working
Group will implement the results of this consultation by creating a set of metadata elements within
the frameworks of the Component Metadata Initiative (CMDI)8 and the ISOCat data category
registry.9 These technical solutions seem appropriate for the problem being tackled. CMDI is based
on the idea that common metadata elements should be useable across different sites without the
imposition of a rigid metadata scheme. Given that any solution resulting from the Working Group
will have to be retrofitted to existing metadata catalogues, this approach is suitably flexible. Also,
the CMDI framework will accommodate the type of extensibility mentioned above. The ISOCat
framework (See also the contribution of Broeder and Lannom in this issue) seeks to make explicit
records of the semantic content of data categories which are easily accessible. This is desirable in
any case, but seems to us to be particularly desirable in work such as ours which crosses disciplinary
boundaries. Although we anticipate that the content of some proposed data categories may be the
subject of considerable debate and discussion between representatives of different disciplines, we
see considerable advantages in the outcome of such discussions being treated as part of a data
commons rather than being tied to any individual discipline.
We hope that the activities of the Working Group will lead to improved discovery and access for
researchers across disciplines who work with recorded language data as well as improved
possibilities for inter-repository data exchange.
Reference:
Good, Jeff & Calvin Hendryx-Parker. 2006. Modeling contested categorization in linguistic databases. In
Proceedings of the EMELD 2006 Workshop on Digital Language Documentation: Tools and Standards: The
State of the Art. Lansing, Michigan. Online: http://emeld.org/workshop/2006/papers/GoodHendryxParkerModelling.pdf (14/11/2013)
77
http://www.mpi.nl/imdi/ (11/11/2013), http://www.language-archives.org/OLAC/metadata.html
(11/11/2013)
8
http://www.clarin.eu/node/3219 (11/11/2013)
9
http://www.isocat.org/ (11/11/2013)
Download