the south african case

advertisement

DEVELOPING AND MANAGING RESOURCE

SCARCE LANGUAGES: THE SOUTH AFRICAN CASE

JUSTUS C ROUX

IMS STUTTGART

13.07.2015

OUTLINE

• Concept Resource scarce languages

• Overview of the language situation in South Africa

• Lack of language resources and high level support for development of resources

• Co-ordination of activities in resource development and management

• The demand for localised language services over digital devices and related opportunities

2

Resource scarce languages

“Under-resourced languages are generally described as languages that suffer from a chronic lack of available resources, from human, financial, and time resources to linguistic ones (language data and language technology), and often also experience the fragmentation of efforts in resource development.

(Language Resources and Evaluation (LRE) Journal Special Issue Call, August

2014).

3

Resource scarce languages (2)

"This situation is exacerbated by the realization that as technology progresses and the demand for localised languages services over digital devices increases , the divide between adequately- and underresourced languages keeps widening."

(Language Resources and Evaluation (LRE) Journal Special Issue Call,

August 2014).

4

Issues are

• A chronic lack of available resources, from human, financial, and time resources to linguistic ones

Fragmentation of efforts in resource development

• As technology progresses the demand for localised languages services over digital devices increases

• But first, consider the language situation in South Africa

5

2%

4%

7% 2%

3%

7%

9%

Language Situation in South Africa

Home language (n = 52 mil speakers)

11 Official languages

10%

16%

22%

18%

Zulu 22%

Xhosa 18%

Afrikaans 16%

N Sotho 10%

English 9%

Tswana 7%

Swati 3%

Tsonga 4%

Venda 2%

S Sotho 7%

Ndebele 2%

6

The official African languages grouped

Nguni group

• isiZulu

• isiXhosa

• Siswati

• isiNdebele

Sotho group

45%

• Northern Sotho / Sepedi

• Southern Sotho / Sesotho

• Western Sotho / Setswana

24%

Tshivenda / Xitsonga group

• Tshivenda

• Xitsonga

4%

Cross border languages: Mozambique, Zimbabwe, Swaziland, Lesotho, Botswana

7

Similarities at different levels within groups

Sotho group - disjunctive spelling – lexical items

• Ke tla bolela Sepedi.

I will speak Sepedi.

• Ke tla bua Setswana.

• Ke tla bua Sesotho.

I will speak Setswana.

I will speak Sesotho.

Nguni group - conjunctive spelling – lexical items

• Ngi zo khuluma isiZulu.

I will speak isiZulu.

• Ndi zo thetha isiXhosa.

I will speak isiXhosa.

Implications for NLP

• Grammatical structures across language groups the same

• Regular spelling: Grapheme to phoneme conversion – direct

• Tone languages – specific implications and challenges for TTS systems

8

Afrikaans and its Germanic roots

• English:

• Afrikaans:

• Dutch:

• German:

• Danish:

• Norwegian:

• Swedish:

My hand is in warm water.

My hand is in warm water.

Mijn hand is in warm water.

Meine Hand ist in warmen Wasser.

Min hånd er i varmt vand.

Min hånd er i varmt vann.

Min hand är i varmt vatten.

Implications

• Bootstrapping Afrikaans systems from e.g. Dutch.

9

ISSUE #1

Chronic lack of available (digital) resources, from human, financial, and time resources to linguistic ones

• Digital resources for previously marginalised languages extremely limited: newspapers, periodicals, relatively low presence on the Web

• Lack of language expertise – no tradition of Computational Linguistics limited number of students in local languages – only North-West University with degree courses in Language technologies ("Linguists are still needed"

– Ed Greffenstatte)

• Growing expertise in Computer Science and Signal processing with focus on natural languages in most of the larger universities.

• Financial support mainly ad-hoc from private sources

10

• Various initiatives for text and speech data collections over a number of decades – mainly for linguistic / phonetic research at academic institutions – difficult to share resources

• Continued academic pressure (on grounds of the constitution) on government for support of research and development of Language

Technologies - not to marginalise the indigenous languages again

• Large data acquisition projects sponsored by national government since

1999 – Part of National Language Plan (RSA and India are only countries with official policy regarding LT development).

11

• Ministerial Panel: HLT Strategy for South Africa (2002)

• Focus on digital resources: text & speech (SA official languages)

• 2008: Human Language Technology Expert Panel (HLTEP) established

• commissions HLT application projects annually with governmental funds

• these projects invariably create digital resources

• obvious that it was necessary to create a central depository for all newly created language resources

• Ongoing major projects since 2000 in text and speech domains

• Refer to RMA resources to be discussed

12

ISSUE #2

Fragmentation of efforts in resource development

• Various language projects across the country generating text and speech resources for different purposes – availability of the data (?)

• Resources from projects commissioned by the HLTEP (i.e. funded by tax payers money) needed to be deposited in a central place

• 2012: The National Department of Arts and Culture (DAC) established

Resource Management Agency (RMA) at the North-West University

(Potchefstroom) under the auspices of the Centre for Text Technology

(CTexT) as a 3 year project. ( www.rma.nwu.ac.za

)

13

http://www.rma.nwu.ac.za

14

NEWSLETTER

15

Contents of the RMA

LANGUAGE

Afrikaans (31)

English (30) isiNdebele (20) isiXhosa (23) isiZulu (27)

Sesotho sa Leboa (Sepedi) (22)

Setswana (20)

Sesotho (Southern Sotho) (22)

Siswati (20)

Tshivenda (20)

Xitsonga (24)

Dutch (4)

Yoruba (3)

PROJECT

Autshumato (18)

Lwazi (36)

NCHLT Text (43)

NCHLT Speech (13)

African Speech Technology (15)

DATABASE TYPE

Monolingual Speech

Corpora: Annotated (22)

Multilingual Text Corpora:

Aligned (3)

Monolingual Text Corpora:

Annotated (1)

RESOURCE TYPES

Data

Modules

Applications

Tools/ Platforms

16

FROM RMA TO NATIONAL CENTRE FOR DIGITAL LANGUAGE

RESOURCES (NCDLR)

• RMA: status 3-4 year project (2012 – 2015) (Dept of Arts & Culture)

• Untenable as development of resources is ongoing (living archive)

• National Department of Science and Technology (DST) (2014):

• International panel to determine a new South African Research Infrastructure

Roadmap (SARIR)

• Presentations made to include language (Humanities) and technology in a

Roadmap dominated by natural science, medicine, engineering, earth sciences etc.

• June 2015: The National Centre for Digital Language Resources approved – long term funding (Press statement of DST to follow soon)

17

University of Pretoria

Department of African

Languages

ICELDA PARTNERSHIP

National Centre for Digital

Language

Resources

North-West University

Centre for Text

Technology

(CTexT)

University of South

Africa

Department of African

Languages

CSIR MERAKA

INSTITUTE

(Human Language

Technologies Research

Group )

18

NATIONAL CENTRE FOR DIGITAL LANGUAGE RESOURCES

Functions

• Single point of entry for information on SA language resources

(portal)

• Free open access for academic research

• Licensed access for commercial applications

• Includes RMA resources

• Systematic digitisation of scientifically valuable language resources – historical nature (Scientific committee)

19

• Systematic digitisation of different registers/modes of language resources by the Centre, as well as by academics/public as open call funded projects

Dialects

Child language

Urban slang

Natural discourse isiZulu isiXhosa Setswana Sesotho Afrikaans Lang n

X

? ?

(X)

?

X

(X)

X

(X) ?

X = available (X) = limited data ? = uncertain, should be acquired.

• Combine these projects with MA / PhD studies with data to be deposited at Centre

• Resource centre for studies in the domain of Digital Humanities

20

ISSUE #3

Demand for localised language services over digital devices increases

Available

• At text level

• Spelling checkers for all SA languages – CTexT (Microsoft) http://www.nwu.CTexT.ac.za

• Machine translation – government documents – CTexT (Autshumato IMT) http://www.autshumato.sourceforge.net

• On-line translations: e.g. www.Translate.org

, www.Freelang.net

and various others software programs ranging from word lists to communication phrases

• At speech/text level (interactive telephone based systems) (Major projects)

• African Speech Technology: Hotel reservation system in 5 languages (prototype) www.lrecconf.org/proceedings/lrec2004/summaries/445.htm

• LWAZI I and II: Various community based applications www.meraka.org.za/lwazi/

21

Why do we need to speed up localised language services?

There is a demand for a wide array of language based communication systems:

• Interactive multilingual voice systems as information systems

• Interactive text-to-speech systems

• Literacy training in different languages

• Language specific reading support for the blind

• Machine translation systems for public use

• Speech-to speech communication systems with various language pairs

• Etc……

• There are specific research and business opportunities – consider the following

22

Mobile telephone penetration selected countries

http://www.itu.int/ITU-D/ict/statistics/explorer/index.html

Mobile cellular subscriptions Million

Japan 149

Nigeria

Germany

127

100

South Africa

Korea (Rep)

France

76

55

36

Mobile cellular subscriptions per 100 inhabitants

South Africa

Germany

Japan

Korea (Rep)

France

Nigeria

146

121

117

111

98

73

23

24

Conclusion

• Challenges for the development and management of different types of language resources and applicable tools,

• Academic considerations: insights into language structures and use

• Commercial considerations: providing multilingual applications for a growing market, specifically in the African context

• In order to meet these challenges it is necessary to develop and update language resources not only on a case to case basis, but also systematically in a coordinated manner over as long a period as possible.

• This is what we are attempting to do in the South African context.

25

Thank you for listening.

26

Download