The Spoken Language Corpus Project (SLCP)

advertisement
Progress in Developing
Spoken Language Corpus
of Indigenous Languages in
South Africa
Mtholeni N. Ngcobo and Nozibele
Nomdebevana
University of South Africa
ngcobmn@unisa.ac.za
nomden@unisa.ac.za
Outline
1.
2.
3.
4.
5.
Introduction
The importance of the spoken
corpus approach
The description of SLCP
Progress and Problems in the
development of Spoken Language
corpora of indigenous languages
Recommended solutions
INTRODUCTION






Multilingual language policy – 11 official
languages – 2 developed – 9 underdeveloped
An urgent need to develop spoken language
corpus for indigenous languages: explained by
Allwood and Hendrikse (2003)
Corpus provides empirical research as opposed to
Chomskyan intuitive approach
Written language corpus has already been done
for some languages – i.e. Zulu and Sepedi (NS).
But a good start in spoken language corpus –
Spoken Language Corpus Project (SLCP) – an
open-ended corpus project
Started in 2000
Introd…





Collaboration – UNISA and
Gothenburg University
Initially funded by NRF and Sida
UNISA is the host institution
UNISA has now approved funds for
the project as it falls under the
strategic projects
Goal: 1M words/tokens per language
AIMS




Establish a corpus research centre
To adapt and develop Computational
linguistic software suitable for
agglutinating languages of South
Africa
Develop indigenous languages of
South Africa
Understand the role of language and
communication in real life situations
The Importance of Spoken Corpus

-
-
-
Allwood and Hagman (1994) Spoken language
Fundamental trait of the human
species
Integrated with the human brain and
human society
There is a limited knowledge of
spoken language as opposed to
written language
The importance…



Corpus linguistics approach allows the use
of statistical performance measures and
observation of language use in real life.
This approach is in contrast with earlier
Chomskyan linguistics which focused on
ideal written language.
Allwood and Hagman 1994:1- “ the progress in
audio, video and computer technology enables us to record and
analyse spoken language without having to rely on either memory
or written language.”
Contrast at a glance






Chomskyan linguistics focused on language competence
(langue) while corpus linguistics also considers language
performance (parole) as important
Chomskyan linguistics is unable to cope with many areas in
linguistic study, since the emphasis is put on the ideal
speaker/hearer to the exclusion of complexity/variation
Chomskyan linguistics views language as an innate mental
faculty while corpus linguistics views language as a social
phenomenon
Chomskyan linguistics relies on intuitive evidence whereas
corpus linguistics relies on empirical evidence
Corpus linguistics looks at differences in languages while
Chomskyan linguistics concentrates on universals
The focus of Chomskyan linguistics is on grammar (form)
while corpus linguistics focuses also on meaning
(semantics).
The Description of SLCP



First task: compilation of a body of
texts (a corpus)
Computer: stores large quantities of
data and allows statistical
performance measures
Research potential: covers linguistic,
social, cultural, educational,
technological, inter-lingual and intercommunicational aspects
Descript…


SLCP has chosen video recordings. Why?
Allwood and Hendrikse (2003, 191) have
mentioned the following reason: “…face-to-face
spoken language is interactive, multimodal and contextdependent.” So we want to capture all the

1.
2.
3.
4.
dynamics of language in communication
Compilation is a process with four major phases:
collecting video recorded spoken language
activities
Transcribing video recordings
Quality control (checking and editing)
Annotation of raw data
Process diagram
CORPUS
BUILDING
RECORDING
TRANSCRIBING
QUALITY CONTROL
TAGGING
1. Recording phase
Biber et al. (1998:246) state that
“...a corpus is not simply a collection of texts. Rather, a

corpus seeks to represent a language or some part of a
language. The appropriate design for corpus therefore
depends upon what it is meant to represent.”


Parameters: representativity of the corpus,
control of variables in language varieties,
recording, volume or size of the corpus and
length of each sample
In SLCP we use socio-economic activities as a
representativity measure, e.g. meetings,
sermons, interviews, etc.
2. Transcription phase



Most crucial
Allwood and Hendrikse 2003:195 without transcriptions there will be
no computer readable corpus
Two components of a transcript:
Header and Body
The Header E.G.
@ Recorded activity Identity code (ID): U-ZV-01-01-01
@ Name of recorder: Magda Altman, Brenda Gonzales
@ Duration of recorded activity: 3 hours
@ Recorded activity date: 2006-08-04
@ Recorded activity type: Interview with Traditional Healers
@ Recorded activity title: Interview with Traditional Healers
@ Short name: TH Interview 1
@ Recorded activity location: Queen Ntuli’s home, Folweni, Umbumbulu
@Activity mode:
Face to face Interview
@ Participant: B=F1 (Makhosi Queen Ntuli)
@ Participant: M=F8 (Philisiwe Mkhize)
@ Participant: K=F3 (Jabu Eunice Ncikazi)
@ Participant: J=F2 (Thokozile Shezi)
@ Participant: G=F4 (Thembeni Roge Magubane)
@ @ Participant: H=GR (All participants)
@ Tape ID code: U-ZV-01-01
@ Transcription name: U-ZV-01-01-T1
@ Transcriber: Mtholeni
@ Transcription date:
@ Transcription system:
@ Electronic checking
@ Editor:
@ Checker:
@ Checking dates:
@ Section:
@ Section:
@ Time coding
@ Comment(s):
The Body
3 types of lines in the transcription
body:
§ - section line (the topic of
discussion)
$ - contribution (interlocutor’s speech)
@ - information line (comment)

The Body…



Standardised orthography is used,
but no capital letters or punctuation
marks
Plain text format is used to make
transcription machine readable
Own communication management
(i.e. hesitations) and interactive
communication management (i.e.
feedback) are indicated
The Body...
Certain symbols are used to transcribe the
following:
Elisions { } – curly brackets
Overlaps [ ] – square brackets
Comments < > - angle brackets
Pauses / or // or /// - slashes
Lengthening : - colon
Unclear speech (. . .) three bracketed dots

The Body E.G.
Example: Elisions, overlaps, comments,
pauses, lengthening
§ Religion
$A: uyakhonza konje
$B: ngiyakhonza ngiyamthand{a} <1 unkulunkulu>1 [
ngiyamthanda angisoze ngimlahle
@ <name: person>
$A: [nanso_ke <1 sisi>1 // e: e:]
@ <adoptive: English: sister>
]
Example: Unclear speech and codeswitching
$Z: sekuphoqelekile ukuba (. . .) <1 neclaim>1 futhi (. . .) <2 that is
why>2 <3 ngiclaimile>3
@ <1 code-mix: English>
@ <2 code-switch: English>
@ <3 code-mix: English>
3. The checking phase



The transcription is manually checked by
another person than the transcriber to
ensure quality control and reliability
The transcription is also checked
electronically for correctness of format
before it is inserted into the corpus
We currently use a GTS checking tool to
monitor compliance with the transcription
standards.
4. The tagging phase


The process whereby the corpus is
annotated by means of various tags enriching a raw corpus with grammatical
tags.
E.G. abantwana a«prepref»ba«pref»ntwa«nstem»ana«dimsuf»

Corpus driven approach (information
retrieved from raw data) vs. corpus based
approach (information retrieved from an
annotated corpus)
4. Tagging …

Allwood and Hendrikse (2003, 199) argue
that while the corpus driven approach
works well with isolating languages, in
agglutinating languages the corpus based
approach may be used. They also note
that Leech (1991) has warned against the
danger of bias underlying any form of
annotation. However, they argue that the
tagging of corpora is now fairly general
practice (Allwood and Hendrikse 2003,
199). The tagging set for the agglutinating
languages has been discussed in detail by
Allwood et al (2003).
Progress and success in SLCP



Only Xhosa out of the nine languages
has been able to show greater
progress
Why? It was used for piloting the
project and has a consistent
transcriber
Zulu is following behind with almost
20 000 transcribed tokens so far
Progress…
Number of Tokens per language
350000
300000
250000
200000
150000
100000
50000
Zu
lu
Nd
eb
el
e
N.
So
th
S. o
So
th
o
Sw
at
Ts i
on
ga
Ts
w
an
a
Ve
nd
a
Xh
os
a
0
Progress…







Recordings
N. of recordings
Hours
Un-transcribed
Transcribed
Checked
Tokens
Audio
33
38
2
31
31
45 723
Video
112
128
17
68
54
201 292
Progress…


We also have some un-transcribed
recordings for Tsonga, especially for
children speech
People behind the progress - the
corpus group - share on issues of
progress, motivating one another
and presenting on key research
aspects of the project
Problems







Little or nothing is currently happening in the
development of corpora for the remaining official
languages
Lack of appropriate monitoring - some of the
video recordings get damaged and some of the
digitized recordings have been lost
Poor quality of recordings
Lost data
Uncoordinated individual activities
Insufficient tools, financial and human resources
…etc.
Recommendations and solutions:
hope for the future


Ultimately: A fully developed spoken
corpus resource centre –
the establishment of a resource
centre will not only be a sign of
growth and prosperity, but also a
sign of an investment in the future of
languages and their speakers.
Recommend…

1.
2.
3.
4.
5.
6.
7.
8.
9.

To this end, the following recommendations need to be considered:
More recordings and more trained transcribers are required in order to
expedite the process.
Recorders and transcribers for all the languages should be remunerated to
encourage them to do more in their work.
A network with other institutions, such as universities, should be created.
Short and medium term corpus development targets should be set up.
A server dedicated to corpus must be established as a matter of urgency.
A properly structured corpus archive must be set up and maintained by a
web master.
All the various corpus-related projects should be re-organised under one
corpus management structure.
Corpus maintenance, tagging and mining tools designed for the
agglutinating languages and other peculiar searches (e.g. communicative
gestures) must be developed.
Preliminary corpus mining should begin for the benefit of the tool
development enterprise and to encourage the use of corpora for language
research and development.
This will lead to the establishment of a dedicated corpus publication series
for the indigenous languages of South Africa.
References




Allwood J, Grönqvist L, and Hendrikse AP. 2003. Developing
a tag set and tagger for the African Languages of South
Africa with special reference to Xhosa. Southern African
Linguistics and Applied Language Studies 21 (4) 221-235.
Alwood J and Hagman J. 1994. Some simple automatic
measures of spoken interaction. Proc. Of the 14th Scand.
Conf. of Linguistics & 8th Conf. of Nordic and Gen.
Linguistics, Vol. 72, Univ. of Göteborg.
Allwood J and Hendrikse AP. 2003. Spoken language
corpora for the nine official languages of South Africa.
Southern African Linguistics and Applied Language Studies
21 (4) 187-199.
Biber D, Condrad S, and Repen R. 1998. Corpus Linguistics:
Investigating Language Structure and Use. Cambridge:
Cambridge University Press.
Download