Effectiveness of stemming for Turkish text retrieval

Program, Vol 34 No 2, 2000
© Aslib, The Association for Information Management.
All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act
1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, electronic, mechanical, photocopying or otherwise without the
prior written permission of the publisher.
Effectiveness of stemming for Turkish text retrieval
F. ÇUNA EKMEKÇIOGLU and PETER WILLETT
1.
Introduction
Stemming algorithms are widely used to conflate morphological variants in
searches of free-text databases.1,2 Many of the stemmers that have been
reported in the literature (e.g., those for English3, Slovene4, Dutch5, Greek6 and
Malay7) operate by matching the ending of a word against a dictionary of suffixes, and removing any suffix that is identified subject to constraints specified
in an associated set of context-sensitive rules; recoding may also be applied to
the root that remains after the suffix has been removed. An alternative
approach to stemming involves the use of a morphological analyser to remove
suffixes from words according to their internal structure. Such algorithms proceed from left to right through a word (rather than from right to left, as in
suffix-stripping) and first look in a lexicon for a root matching an initial substring of the word; they then use grammatical information stored in the lexical
entry to determine what possible suffixes may follow. When a suffix matching
a further substring is found, grammatical information in the lexical entry for
that suffix again determines what class of suffixes may follow. The stemming
is successful if the end of the word can be reached by iteration of this process,
and if the last suffix is one which may end a word. In this paper, we discuss the
evaluation of a stemming algorithm of this type for searching a database of
Turkish text and compare the results with those obtained in comparable, but
non-stemmed, searches.
Turkish is a member of the south-western or Oghuz group of Turkic languages, which also includes Turkmen, Azerbaijani or Azeri, Ghasghai and
Gagauz.8 The language uses a Latin alphabet consisting of 29 letters, of which
eight are vowels and 21 are consonants, and is an agglutinative language, i.e.
one in which words contain a basic root, with one or more suffixes being combined with this root in order to extend its meaning or to create other classes
of words.9 Agglutination can result in long words that can contain as much
semantic information as a whole English phrase, clause or sentence. An
extreme example of this characteristic of the language is provided by the word
avrupalIlastIrIlamIyabilenlerdenmissiniz; which means ‘you seem to be one
of those who may be incapable of being Europeanised’. The root here is
avrupa (Europe) and the remainder consists of suffixes which have certain
functions and add meaning to the root; for example, the first suffix -lI modifies
the meaning to ‘person from Europe’ and the second suffix -las then alters the
function to a verb which means ‘to become one of the persons from Europe’.
Program, vol. 34, no. 2, April 2000, pp. 195–200
Aslib, The Association for Information Management
Staple Hall, Stone House Court, London EC3A 7PB
Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011
Email: pubs@aslib.co.uk, WWW: http://www.aslib.co.uk/aslib
Program, Vol 34 No 2, 2000
© Aslib, The Association for Information Management.
All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act
1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, electronic, mechanical, photocopying or otherwise without the
prior written permission of the publisher.
Program
vol. 34 no. 2
This complex morphological structure means that a single Turkish word can
give rise to a very large number of variants, with a consequent need for effective stemming if high recall is to be achieved in searches of Turkish text databases. For example, Ekmekçioglu et al. found that the root ol (to become, to
happen) occurred in no less than 248 different forms in a file of ca. 50K wordtypes extracted from a Turkish political news database.10 Such behaviour is
extremely difficult to encompass by a conventional suffix-stripping algorithm;
instead, a morphological analyser is required if high-quality stemming is to be
achieved, and we have thus used this approach in the work reported here.
2.
Experimental details
The experiments used a file containing the titles and abstracts for 6,289 economic and political news stories extracted from Turkish newspapers in the
period 1991–93. These texts were searched using a set of 50 natural-language
queries provided by 30 Turkish native students who were studying at the
University of Sheffield and who agreed to provide both natural-language
queries and relevance judgements on the stemmed and unstemmed search outputs discussed below. In previous work,10,11 we have developed a list of 730
stopwords (mainly conjunctions, prepositions and pronouns) for Turkish text
retrieval, and each of the queries was checked against this list. The resulting set
of queries contained an average of 5.1 words (minimum and maximum of
3 and 9 words, respectively), and these were then searched in both stemmed
and unstemmed form, using the OKAPI text retrieval system.12
OKAPI calculates weights for each of the terms in a document or query
using a probabilistic model of retrieval. It then scores a document on the basis
of the sum of the weights for those terms that it has in common with a query;
documents are ranked in order of decreasing sums-of-scores and the topranked documents are then presented to the user for relevance assessment
(although not tested in our experiments, these judgements can be used for the
calculation of new relevance weights and then for a second, feedback search).
The particular weighting scheme used here was the BM25 weight13, with the
users’ relevance judgements being based on the information contained in the
title and abstract of the 10 or 20 top-ranked documents.
Stemming was effected using the two-level morphological analyser,
PC-KIMMO.9,14 PC-KIMMO, which is named after its inventor Kimmo
Koskenniemi, is designed to generate (produce) and/or recognise (parse)
words using a two-level model of word structure in which a word is represented as a correspondence between its lexical level form and its surface level
form. The generator component of PC-KIMMO accepts as input a lexicon
form, applies the appropriate rules, and returns the corresponding surface
form. The recogniser component accepts as input a surface form, applies the
appropriate rules, consults the lexicon, and returns the corresponding lexical
form with its gloss. Oflazer has developed a description of the Turkish language that is suitable for processing by PC-KIMMO,15 in which the phonetic
196
Aslib, The Association for Information Management
Staple Hall, Stone House Court, London EC3A 7PB
Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011
Email: pubs@aslib.co.uk, WWW: http://www.aslib.co.uk/aslib
Program, Vol 34 No 2, 2000
© Aslib, The Association for Information Management.
All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act
1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, electronic, mechanical, photocopying or otherwise without the
prior written permission of the publisher.
April 2000
Short communications
rules of contemporary Turkish have been encoded using two-level rules while
the morphotactics of the agglutinative word structures have been encoded as
finite-state machines for verbal, nominal and other classes of words. Oflazer’s
program is based on a list of ca. 23,000 word roots: this list is divided into
a number of separate lexicons for nouns, adjectives, verbs, compound nouns,
proper nouns, pronouns, adverbs, connectives, exclamations, post positions,
acronyms, technical jargon and nominal word structures that exhibit a number
of special cases. The description consists of two files: a rules file, which specifies the alphabet and the phonological (or spelling) rules; and a lexicon file,
which lists lexical items (words or morphemes) and their glosses, and encodes
morphotactic constraints. These two files were incorporated in PC-KIMMO,
and the resulting program used for stemming.
Stemming is normally carried out upon both document and query texts, but
we have only stemmed the query words in our experiments. Such a procedure
is feasible because Turkish word roots are generally unaffected when a suffix
is added to its right-hand end. Accordingly, there is no need for the recoding
procedures that are required in many other languages, and the use of a simple
truncation search thus ensures that a stemmed query word is able to retrieve all
of the variants in the database that are derived from it. For example, the word
enflasyonla (with/by inflation) in one of the queries was stemmed to enflasyon,
this resulting in matches with words such as enflasyonu, enflasyonunu, enflasyonun and enflasyonist, inter alia.
3.
Results and discussion
The mean numbers of relevant documents retrieved, when averaged over the set
of 50 queries, are listed in Table 1. It will be seen that the stemmed searches
retrieve noticeably more relevant documents; indeed, an inspection of the
retrieved sets showed that in most cases, the unstemmed relevant documents
were a proper subset of the stemmed relevant documents. The statistical significance of this difference in performance was established using the sign test.16
This test focuses on the direction of the difference in some parameter between
two related samples, noting whether the sign of the difference is positive, negative or whether the two values are equal. In the present context, the experimental parameter is the number of relevant documents retrieved in a search; when
used in a one-tailed manner, the test can determine whether the stemmed
Table 1. Mean numbers of relevant documents retrieved
when averaged over a set of 50 searches using both
stemmed and unstemmed query words
Search Type
Cutoff-10
Cutoff-20
Unstemmed
Stemmed
4.04
5.34
6.84
9.08
197
Aslib, The Association for Information Management
Staple Hall, Stone House Court, London EC3A 7PB
Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011
Email: pubs@aslib.co.uk, WWW: http://www.aslib.co.uk/aslib
Program, Vol 34 No 2, 2000
© Aslib, The Association for Information Management.
All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act
1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, electronic, mechanical, photocopying or otherwise without the
prior written permission of the publisher.
Program
vol. 34 no. 2
Table 2. Sign test analysis based on the numbers of
relevant documents retrieved over a set of 50 searches
using both stemmed and unstemmed query words
Stemmed > Unstemmed
Stemmed < Unstemmed
Stemmed = Unstemmed
1-tailed probability
Cutoff-10
Cutoff-20
33
17
0
p < 0.0001
34
16
0
p < 0.0001
searches retrieve significantly more relevant documents than do the unstemmed
searches. The results of this analysis are summarised in Table 2, from where it
will be seen that the stemmed searches are better than the unstemmed searches
at a very high level of statistical significance.
Despite the general superiority of the stemmed searches, it will be noted
that the unstemmed query words gave a better level of performance for about
one-third of the searches. Such occurrences typically arose from limitations in
our implementation of the stemmer, specifically because PC-KIMMO will
sometimes suggest several different possible analyses for a word that is presented to it, i.e., several different potential stems in the present context; for
example, the word adamin (the man’s) yields the stems adam (man) and ada
(island). Inspection of the PC-KIMMO output for a sample file of ca. 3500
Turkish words showed that the shortest stem output by the program was the
correct one in 83% of the cases considered and we thus used these shortest
stems for the searches reported here. However, this does mean that overstemming (i.e., the removal of too much of the original word as with the adamin
example above) occurred for 17% of the query words, and we would hence
expect that the retrieval performance detailed in Tables 1 and 2 could have
been further improved by asking the searchers to check each of the potential
stems output by PC-KIMMO prior to the submission of the query. The stemming results presented here thus represent a lower bound of performance; even
so, they lead us to believe that stemming using a morphological analyser provides a highly effective way of handling the problem of word variants in
searches of Turkish text databases.
Acknowledgements
We thank Michael Lynch, Kemal Oflazer, Sandy Robertson and Stephen
Walker for assistance and helpful discussions, Hilmi Celik for the provision of
the political news database used in this study, the 30 members of the
University of Sheffield who provided the queries and relevance judgements,
and the Turkish government for funding.
198
Aslib, The Association for Information Management
Staple Hall, Stone House Court, London EC3A 7PB
Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011
Email: pubs@aslib.co.uk, WWW: http://www.aslib.co.uk/aslib
Program, Vol 34 No 2, 2000
© Aslib, The Association for Information Management.
All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act
1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, electronic, mechanical, photocopying or otherwise without the
prior written permission of the publisher.
April 2000
Short communications
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Martin Lennon, David S. Peirce, Brian D. Tarry and Peter Willett. An
evaluation of some conflation algorithms for information retrieval.
Journal of Information Science, vol. 3, 1981, pp. 177–183.
William B. Frakes. Stemming algorithms. In: William B. Frakes and
Ricardo Baeza-Yates (eds). Information retrieval: data structures and
algorithms. Englewood Cliffs, NJ: Prentice Hall, 1992, pp. 131–160.
Martin F. Porter. An algorithm for suffix stripping. Program, vol. 14,
1980, pp. 130–137.
Mirko Popovic and Peter Willett. The effectiveness of stemming for natural language access to Slovene textual data. Journal of the American
Society for Information Science, vol. 43, 1992, pp. 384–390.
Wessel Kraaij and Renée Pohlmann. Viewing stemming as recall
enhancement. In: Hans-Peter Frei, Donna Harman, Peter Schäuble and
Ross Wilkinson (eds). Proceedings of the 19th annual international ACM
SIGIR conference on research and development in information retrieval.
New York: Association for Computing Machinery, 1996, pp. 40–49.
T.Z. Kalamboukis. Suffix stripping with modern Greek. Program, vol.
29, 1995, pp. 313–321.
Fatimah Ahmad, Mohammed Yusoff and Tengu M.T. Sembok.
Experiments with a stemming algorithm for Malay words. Journal of the
American Society for Information Science, vol. 47, 1996, pp. 896–908.
Geoffrey L. Lewis. Turkish grammar. Oxford: Oxford University Press,
1991.
Richard Sproat. Morphology and computation. Cambridge MA: MIT
Press, 1992.
F. Çuna Ekmekçioglu, Michael F. Lynch and Peter Willett. Development
and evaluation of conflation techniques for the implementation of a
document retrieval system for Turkish text databases. New Review of
Document and Text Management, vol. 1, 1995, pp. 131–146.
F. Çuna Ekmekçioglu, Michael F. Lynch, Alexander M. Robertson,
Tengu M.T. Sembok and Peter Willett. Comparison of n-gram matching
and stemming for term conflation in English, Malay, and Turkish texts.
Text Technology, vol. 6, 1996, pp. 1–14.
OKAPI is described in detail in a series of papers comprising a special
issue of Journal of Documentation, vol. 53, 1997, pp. 1–106.
Stephen E. Robertson, Stephen Walker, Susan Jones, Micheline
Hancock-Beaulieu and Michael Gatford. Okapi at TREC-3. In: Donna K.
Harman (ed.) Overview of the third text retrieval conference (TREC-3).
Washington DC: National Institute of Standards and Technology, 1995,
pp. 109–126. (NIST Special Publication 500–22.)
Evan L. Antworth. Glossing text with the PC-KIMMO morphological
parser. Computers and the Humanities, vol. 26, 1993, pp. 475–484.
199
Aslib, The Association for Information Management
Staple Hall, Stone House Court, London EC3A 7PB
Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011
Email: pubs@aslib.co.uk, WWW: http://www.aslib.co.uk/aslib
Program, Vol 34 No 2, 2000
© Aslib, The Association for Information Management.
All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act
1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, electronic, mechanical, photocopying or otherwise without the
prior written permission of the publisher.
Program
vol. 34 no. 2
15.
Kemal Oflazer. Two-level description of Turkish morphology. Literary
and Linguistic Computing, vol. 9, 1994, pp. 137–148.
16. Sidney Siegel and N. John Castellan. Nonparametric statistics for the
behavioral sciences. New York: McGraw-Hill, 1998.
Authors
F. Çuna Ekmekçioglu, Project Manager, SCONE Project, CDLR/Andersonian
Library, Curran Building, 101 St. James Road, Glasgow G4 0NS, UK. E-mail:
cuna.ekmekcioglu:@strath.ac.uk
Peter Willett, Professor, Department of Information Studies, University of Sheffield,
Western Bank, Sheffield S10 2TN, UK. E-mail: p.willett@sheffield.ac.uk
200
Aslib, The Association for Information Management
Staple Hall, Stone House Court, London EC3A 7PB
Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011
Email: pubs@aslib.co.uk, WWW: http://www.aslib.co.uk/aslib