Research team: Erica Cosijn 1 , Heikki Keskustalo 2 ,
Ari Pirkola 2 , Karen de Wet 1 & Kalervo Järvelin 2
1
University of Pretoria, Pretoria, South Africa
2
University of Tampere, Finland
1
• What is CLIR?
• General methodology
• Afrikaans-English CLIR
• Zulu-English CLIR
• The road ahead
• Conclusions
2
• The basic idea to bridge the language boundary by providing access in one language (the source language) to documents written in another language (the target language)
•
Source language : the language that gives access to the required information; the quiery language thus
• Target language : the language of the content in the database
3
• Use CLIR in:
– query translation and/or document translation from the source language.
• Main strategies for query translation
– dictionary-based methods
– corpus-based methods, and
– machine translation
4
• Corpus-based methods: work with frequency analysis
– Implication: aboutness of the two collections should be similar
• Machine translation: uses morphological parser etc.
5
• Translates source language texts into target language using:
– Translation dictionaries
– Other linguistic resources
– Syntax analysis
• Limited availability
6
• Problems
– Limitations of dictionaries
– Inflected word forms
– Phrases and compound words
– Lexical ambiguity
• Possible solution
– Approximate string matching
7
Source language query
Bilingual source-Eng dictionary
Dictionary translation
Other linguistic resources
English language query
English language database
Retrieval in
English language database
English result
8
The Cross-Language Evaluation Forum supports global digital library applications by
(i) developing an infrastructure for the testing, tuning and evaluation of information retrieval systems operating on European languages in and (ii) creating test-suites of reusable data which can be employed by system developers for benchmarking purposes
9
• Inquery – commercially available
• Probabilistic – i.e. best match, not exact
• “Bag of words” or structured queries
• used by Finnish partners in their projects
• TEST DATA: CLEF 2001
– 112 000 newspaper articles
– 35 queries (title and description)
– English to English baseline for comparison
– 2 sets
• Afrikaans/Zulu title
• Afrikaans/Zulu title and description
10
• Afrikaans spoken by third largest group in
South Africa as first language
• Originated mainly from Dutch
• Germanic language
• Not inflectional
• Good technical vocabulary
• Good resources – e.g. dictionaries, spell checkers, parsers, compound splitters.
11
• Electronic bilingual dictionary
– Filtered commercial dictionary
• Stopword list
– Translated from English and adapted
•
Morphological analyzer
– Derived statistically from analysis of large newspaper text body
12
• Headwords identified by string-based rules
• Alternative spellings separated and listed as separate headwords
• Homonyms: each sense listed as separate headword
• Compounds identified and listed as separate headwords
• Plurals not included, but solved by morph analyzer
• Manual checking and fine-tuning
13
• Translation of existing English stopword list
• Check homonyms, e.g. again = weer = weather
• Large text body – Afrikaans language newspaper articles – 3500 words
• Frequency analysis compared to translated list
• Ad hoc additions
• Accented words added
• N=341
14
• Based on patterns in language
• Newspaper text used for manual analysis
• 3500 words sorted by frequency facilitated duplicate removal
• 1200 unique words
15
• All plural forms manually identified from 1200 words
• 62% of Afrikaans plurals formed by adding -e, -s or -’s to singular
• 13% of plurals have a double vowel in singular and plural is formed by removing one vowel and adding an -e to the end of the word
• Thus 75% of plurals solved by two simple rules
16
Manual analysis of text shows
• Past tense indicated by ge - prefix, but sometimes embedded, e.g. aan ge steek
• Various suffixes are common: te, -ste, -er,
-ing, -ke, -le, -de , etc.
• Suffix stripping possible by longest common substring (LCS) matching
17
Manual analysis of text shows
• Relatively high occurrence of compounds in
Afrikaans - 1%
• Different types of compounds
• With or without fogemorphemes (joining morphemes)
• Only two fogemorphemes identified, namely -s- and -e-
18
1 Stopwords
2 Headwords
3 oo, aa, ee, uu rule solvable
4 e, s, ’s rule (OR Longest Common
Substring)
5 More LCS matching
6 Stripping prefix, e.g. ge -
7 Compound splitting (multiple LCS runs + fogemorpheme stripping)
Total
N %
150 14,0
565 52,7
18 1,7
85 7,9
59 5,5
13 1,2
50 4,7
940 87,7
19
8 Compounds incorrectly solved
9 Past tense ge - embedded in word
10 Not solvable by morphological analyser
11 Misspelt in original text
12 Proper nouns
N
8
26
16
%
0,7%
2,4%
1,5%
2 0,2%
Total 52 4,8%
80 7,5%
20
Original
Afrikaans query key
Is the key found as-is
(i.e. as a translation dictionary entry)?
N
Y
Does the key start with
Uppercase letter? Y
Preprocess Key
(verify character set used: preserve both
Uppercase and Lowercase letters)
Modify
Uppercase to
Lowercase
N
Is the key found after removal of ge -prefix?
Y
Remove the prefix from the word
N
Is key recognized as plural of a “double vowel singular case??
N
Y
Is the key a compound
(i.e. decomposable using LCS method)
N
Y
Modify
Lowercase to
Uppercase
Unrecognized
Afrikaans key
Normalize the word to singular form
Decompose the word utilizing fogemorphemes
Is the Uppercase form found as-is in the dictionary?
N
Is the key a
Stopword?
Y
Y
Is the word (or decomposed part) a stop word?
Y
N
Remove
Remove
Translate using
Afr-Eng
Dictionary
Word (or component) translations in
English
N
Fuzzy matching
(target index)
Most similar words from the
English database
21
(condensed from flow chart)
• Match words found in dictionary
• Uppercase becomes lower case
• Remove ge- prefix
• Double vowel plural case
• Match longest common subsequence (suffixes as well as compounds solved)
• Modify lower case to uppercase (probably proper noun)
• Fuzzy match “as is” with target language database
22
Database used: Cleff
English title : Pesticides in Baby Food
Afrikaans source query : Plaagdoders in babakos
English baseline query : #sum(pesticide baby food)
The English target query translated from the
Afrikaans source query : #sum(#syn(nullstr lues die van plague plague blight infestation pest affliction vexation killer) #syn( nullstr) #syn( baby food))
23
24
• Dictionary probably too large
• Normalizer worked quite well
• Copmpound splitting by LCS methods mostly successful
• Stopword list adequate
• Results quite promising
25
• isiZulu spoken by 8,8 million – largest number of speakers for a single language in SA
• Agglutinative – grammatical information conveyed by attaching pre- and suffixes to roots and stems
• Nouns: Grammatical genders – 8 classes in Zulu with distinctive prefixes in every class for singular and plural forms
• Verbs: Affixes mark grammatical relations such as object, subject, tense, mood, aspect
26
Zulu
Source
Query
Monolingual
Zulu dictionary
Zulu-Engl.
Dictionary
Approx.
Dictionary
Matching
Zulu base form query
Dictionary translation
English
Query
CLEF
English
Database
Retrieval in
English
Database
English
Result
27
• Monolingual word list
– No electronic bilingual dictionary
• Approximate matching
– Of all five metric and non-metric similarity measures tested, skipgrams yielded best results
– The Zulu word could be identified within three words 80% of the time
28
• Translations from Zulu source words into English done manually
• Problems experienced in this process
– Paraphrasing due to disparate vocabularies
E.g. isinyabulala – person weak from age
– Homonyms – single words with various meanings
E.g. –zwe isizwe izizwe = tribe OR rapidly spreading brain disease
29
Find documents that describe acts of terrorism or vandalism against European synagogues since the end of the Second World War .
Thola
Find izenzo acts izinto the breaking of things nezindlu the houses imibhalo echaza scriptures that describe zokuphekulazikhuni noma of terror ngobudlova with violent force and elwa that fight zesonto of Sunday ase-Europe of Europe kwezimpi of the war kusukela from zesibili of second zamaJuda of the Jews ekupheleni the end zomhlaba of the world
30
interrogative enclitic verb extensions conjunctives locatives homonyms vowel elision vowel coalescence pre-nasalisation palatalisations paraphrases
2
2
5
5
6
9
10
12
17
46
55 proper names zululisations borrowed words
0 10
20
20
28
30
Number of occurences
40
41
50
31
60
• Parsers and morphological analysers in process
• Spellcheckers has extensive word lists
• Increasing web presence of indiginous languages, especially government sites and newspapers leads to possibility of pararlel corpora
• Cross Cultural Information Retrieval?
32
• Indigenous Knowledge is a valuable resource – it is important to make it accessible
• Learn from international research and create a good product from the outset
• Many opportunities for research
33
• To provide access in one language to documents written in another language
• Query translation or document translation
• Approaches
– Corpus-based techniques
– Machine translation
– Dictionary-based techniques
34