Proposal_v1 - Research Data Alliance

advertisement
Improving Access to Recorded Language Data
Simon Musgrave (Monash University, Australian National Corpus)
Linda Barwick (University of Sydney, PARADISEC)
Michael Walsh (University of Sydney, AIATSIS)
Researchers in different disciplines collect and store data which includes human language recorded
in real time. Discovery of such data should be easy across disciplines, but is currently impeded by
different disciplinary approaches and standards. For example, a linguist may have collected
recordings of songs performed by speakers of the language they are studying; these recordings are
stored in an archive intended primarily for other linguists, and a musicologist may not easily discover
the resource even though it might be very relevant to their research. This paper will discuss the work
of a recently formed Working Group within the Research Data Alliance which aims to address this
problem by working towards standardisation of metadata elements in two areas: codes for
identification of languages and language varieties, and categories for describing the content of
resources.
For language identification, ISO639-3 provides a set of three letter codes to identify languages. But
this is not unproblematic for a variety of reasons. Firstly, it is not adopted everywhere; for example
the digital collections of the Australian Institute of Aboriginal and Torres Strait Islander Studies use a
different set of identifiers and this example also shows two other problems for language
identification. The divisions recognised by ISO639-3 do not always align with expert understanding.
This has been a particular issue for Australian languages, with a number of change requests filed
with the registration authority for ISO639-3. A number of these changes relate to delineating
languages from linguistic entities below that level (such as dialects) and above that level (such as
macrolanguages and language families). Proposals for identification of entities at different levels of
granularity are being considered within the ISO process; the Working Group aims to ensure that
expert input to these processes is maximised, that the principles underlying the ISO639 standard
sets have a sound linguistic basis, and that registration and revision processes are consistent and
transparent. Our assumption is that progress with these issues will lead to more consistent use of
the standard by archives and repositories.
Existing metadata schemas (e.g. IMDI, OLAC) include a vocabulary for describing the genres
represented in linguistic resources, but these do not necessarily correspond to needs of different
disciplines. Consultation across different research communities is needed to establish the range of
resource types which need to be covered and vocabularies for describing that range. The Working
Group will implement the results of this consultation by creating a set of metadata elements within
the frameworks of the Component Metadata Initiative (CMDI) and the ISOCat data category registry.
CMDI allows for the use of common metadata elements across different sites without imposing a
rigid metadata scheme, while the ISOCat framework ensures that the semantics of (meta)data
elements are explicit and accessible.
We hope that the activities of the Working Group will lead to improved discovery and access for
researchers across disciplines who work with recorded language data as well as improved
possibilities for inter-repository data exchange.
Biographies:
Simon Musgrave is a lecturer in the School of Languages, Cultures and Linguistics at Monash
University. Previously, he was a post-doctoral researcher at Leiden University and an Australian
Research Council post-doctoral fellow at Monash. His research interests include Austronesian
languages, language documentation and language endangerment, African languages in Australia,
communication in medical interactions, the history of English in Australia, and the use of technology
in linguistic research. He is also involved in the Australian National Corpus project, serving on the
steering committee from an early stage as well as being the treasurer of Australian National Corpus
Inc.
Linda Barwick is Associate Dean, Research, at the Sydney Conservatorium of Music. She is an
ethnomusicologist, specialising in the study of Australian Indigenous and immigrant musics, and the
digital humanities (particularly archiving and repatriation of ethnographic field recordings as a site of
interaction between researchers and cultural heritage communities). She has studied community
music practices through fieldwork in Australia, Italy and the Philippines. Themes of her research
include analysis of musical action in place, the language of song, and the aesthetics of cross- cultural
musical practice. She has also published on theoretical issues, including analysis of non- Western
music, and research implications of digital technologies.
Michael Walsh's research has focussed on the Top End of the Northern Territory over the last 30
years. This research includes descriptive and typological studies of Aboriginal languages as well as
investigations into language use among indigenous Australians. An interest in lexical semantics has
given rise to such studies as one on body part metaphors and another on nominal classification.
Outside of strictly linguistic matters he has carried out research or advised on land claims,
assessment of Aboriginal witnesses in legal settings and Native Title matters. One spin-off of these
interests is a focus on cross-cultural communication problems between indigenous and other
Australians.
Download