Cross language information retrieval in South African

advertisement

CLIR: opening up possibilities for indigenous languages in South Africa?

Research team: Erica Cosijn 1 , Heikki Keskustalo 2 ,

Ari Pirkola 2 , Karen de Wet 1 & Kalervo Järvelin 2

1

University of Pretoria, Pretoria, South Africa

2

University of Tampere, Finland

1

Introduction

• What is CLIR?

• General methodology

• Afrikaans-English CLIR

• Zulu-English CLIR

• The road ahead

• Conclusions

2

What is CLIR?

• The basic idea to bridge the language boundary by providing access in one language (the source language) to documents written in another language (the target language)

Source language : the language that gives access to the required information; the quiery language thus

• Target language : the language of the content in the database

3

CLIR (cont.)

• Use CLIR in:

– query translation and/or document translation from the source language.

• Main strategies for query translation

– dictionary-based methods

– corpus-based methods, and

– machine translation

4

CLIR approaches

• Corpus-based methods: work with frequency analysis

– Implication: aboutness of the two collections should be similar

• Machine translation: uses morphological parser etc.

5

CLIR: Machine translation

• Translates source language texts into target language using:

– Translation dictionaries

– Other linguistic resources

– Syntax analysis

• Limited availability

6

CLIR: Dictionary Based

• Problems

– Limitations of dictionaries

– Inflected word forms

– Phrases and compound words

– Lexical ambiguity

• Possible solution

– Approximate string matching

7

Source language query

Bilingual source-Eng dictionary

Dictionary translation

Other linguistic resources

English language query

English language database

Retrieval in

English language database

English result

8

CLEF

The Cross-Language Evaluation Forum supports global digital library applications by

(i) developing an infrastructure for the testing, tuning and evaluation of information retrieval systems operating on European languages in and (ii) creating test-suites of reusable data which can be employed by system developers for benchmarking purposes

9

Retrieval system and test data

• Inquery – commercially available

• Probabilistic – i.e. best match, not exact

• “Bag of words” or structured queries

• used by Finnish partners in their projects

• TEST DATA: CLEF 2001

– 112 000 newspaper articles

– 35 queries (title and description)

– English to English baseline for comparison

– 2 sets

• Afrikaans/Zulu title

• Afrikaans/Zulu title and description

10

Afrikaans-English CLIR

• Afrikaans spoken by third largest group in

South Africa as first language

• Originated mainly from Dutch

• Germanic language

• Not inflectional

• Good technical vocabulary

• Good resources – e.g. dictionaries, spell checkers, parsers, compound splitters.

11

Methodology : Resources

• Electronic bilingual dictionary

– Filtered commercial dictionary

• Stopword list

– Translated from English and adapted

Morphological analyzer

– Derived statistically from analysis of large newspaper text body

12

Dictionary Filtering

• Headwords identified by string-based rules

• Alternative spellings separated and listed as separate headwords

• Homonyms: each sense listed as separate headword

• Compounds identified and listed as separate headwords

• Plurals not included, but solved by morph analyzer

• Manual checking and fine-tuning

13

Stopword list

• Translation of existing English stopword list

• Check homonyms, e.g. again = weer = weather

• Large text body – Afrikaans language newspaper articles – 3500 words

• Frequency analysis compared to translated list

• Ad hoc additions

• Accented words added

• N=341

14

Morphological analyser (1)

• Based on patterns in language

• Newspaper text used for manual analysis

• 3500 words sorted by frequency facilitated duplicate removal

• 1200 unique words

15

Morphological analyser: Plurals

• All plural forms manually identified from 1200 words

• 62% of Afrikaans plurals formed by adding -e, -s or -’s to singular

• 13% of plurals have a double vowel in singular and plural is formed by removing one vowel and adding an -e to the end of the word

• Thus 75% of plurals solved by two simple rules

16

Morphological analyser: Affixes

Manual analysis of text shows

• Past tense indicated by ge - prefix, but sometimes embedded, e.g. aan ge steek

• Various suffixes are common: te, -ste, -er,

-ing, -ke, -le, -de , etc.

• Suffix stripping possible by longest common substring (LCS) matching

17

Morphological analyser:

Compounds

Manual analysis of text shows

• Relatively high occurrence of compounds in

Afrikaans - 1%

• Different types of compounds

• With or without fogemorphemes (joining morphemes)

• Only two fogemorphemes identified, namely -s- and -e-

18

Morphological analyser test data:

Statistics - solvable

1 Stopwords

2 Headwords

3 oo, aa, ee, uu rule solvable

4 e, s, ’s rule (OR Longest Common

Substring)

5 More LCS matching

6 Stripping prefix, e.g. ge -

7 Compound splitting (multiple LCS runs + fogemorpheme stripping)

Total

N %

150 14,0

565 52,7

18 1,7

85 7,9

59 5,5

13 1,2

50 4,7

940 87,7

19

Morphological analyser test data:

Statistics – not solvable

8 Compounds incorrectly solved

9 Past tense ge - embedded in word

10 Not solvable by morphological analyser

11 Misspelt in original text

12 Proper nouns

N

8

26

16

%

0,7%

2,4%

1,5%

2 0,2%

Total 52 4,8%

80 7,5%

20

Original

Afrikaans query key

Is the key found as-is

(i.e. as a translation dictionary entry)?

N

Y

Does the key start with

Uppercase letter? Y

Preprocess Key

(verify character set used: preserve both

Uppercase and Lowercase letters)

Modify

Uppercase to

Lowercase

N

Is the key found after removal of ge -prefix?

Y

Remove the prefix from the word

N

Is key recognized as plural of a “double vowel singular case??

N

Y

Is the key a compound

(i.e. decomposable using LCS method)

N

Y

Modify

Lowercase to

Uppercase

Unrecognized

Afrikaans key

Normalize the word to singular form

Decompose the word utilizing fogemorphemes

Is the Uppercase form found as-is in the dictionary?

N

Is the key a

Stopword?

Y

Y

Is the word (or decomposed part) a stop word?

Y

N

Remove

Remove

Translate using

Afr-Eng

Dictionary

Word (or component) translations in

English

N

Fuzzy matching

(target index)

Most similar words from the

English database

21

Morphological analyser – steps

(condensed from flow chart)

• Match words found in dictionary

• Uppercase becomes lower case

• Remove ge- prefix

• Double vowel plural case

• Match longest common subsequence (suffixes as well as compounds solved)

• Modify lower case to uppercase (probably proper noun)

• Fuzzy match “as is” with target language database

22

Example

Database used: Cleff

English title : Pesticides in Baby Food

Afrikaans source query : Plaagdoders in babakos

English baseline query : #sum(pesticide baby food)

The English target query translated from the

Afrikaans source query : #sum(#syn(nullstr lues die van plague plague blight infestation pest affliction vexation killer) #syn( nullstr) #syn( baby food))

23

Results

24

Conclusions

• Dictionary probably too large

• Normalizer worked quite well

• Copmpound splitting by LCS methods mostly successful

• Stopword list adequate

• Results quite promising

25

Zulu-English CLIR

• isiZulu spoken by 8,8 million – largest number of speakers for a single language in SA

• Agglutinative – grammatical information conveyed by attaching pre- and suffixes to roots and stems

• Nouns: Grammatical genders – 8 classes in Zulu with distinctive prefixes in every class for singular and plural forms

• Verbs: Affixes mark grammatical relations such as object, subject, tense, mood, aspect

26

Methodology: Zulu to English

Zulu

Source

Query

Monolingual

Zulu dictionary

Zulu-Engl.

Dictionary

Approx.

Dictionary

Matching

Zulu base form query

Dictionary translation

English

Query

CLEF

English

Database

Retrieval in

English

Database

English

Result

27

Methodology (1)

• Monolingual word list

– No electronic bilingual dictionary

• Approximate matching

– Of all five metric and non-metric similarity measures tested, skipgrams yielded best results

– The Zulu word could be identified within three words 80% of the time

28

Methodology (2)

• Translations from Zulu source words into English done manually

• Problems experienced in this process

– Paraphrasing due to disparate vocabularies

E.g. isinyabulala – person weak from age

– Homonyms – single words with various meanings

E.g. –zwe isizwe izizwe = tribe OR rapidly spreading brain disease

29

Example of paraphrasing

Find documents that describe acts of terrorism or vandalism against European synagogues since the end of the Second World War .

Thola

Find izenzo acts izinto the breaking of things nezindlu the houses imibhalo echaza scriptures that describe zokuphekulazikhuni noma of terror ngobudlova with violent force and elwa that fight zesonto of Sunday ase-Europe of Europe kwezimpi of the war kusukela from zesibili of second zamaJuda of the Jews ekupheleni the end zomhlaba of the world

30

Analysis of translation problems

interrogative enclitic verb extensions conjunctives locatives homonyms vowel elision vowel coalescence pre-nasalisation palatalisations paraphrases

2

2

5

5

6

9

10

12

17

46

55 proper names zululisations borrowed words

0 10

20

20

28

30

Number of occurences

40

41

50

31

60

The road forward

• Parsers and morphological analysers in process

• Spellcheckers has extensive word lists

• Increasing web presence of indiginous languages, especially government sites and newspapers leads to possibility of pararlel corpora

• Cross Cultural Information Retrieval?

32

Conclusions

• Indigenous Knowledge is a valuable resource – it is important to make it accessible

• Learn from international research and create a good product from the outset

• Many opportunities for research

33

Cross Language Information

Retrieval (CLIR)

• To provide access in one language to documents written in another language

• Query translation or document translation

• Approaches

– Corpus-based techniques

– Machine translation

– Dictionary-based techniques

34

Download