Program, Vol 34 No 2, 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher. Effectiveness of stemming for Turkish text retrieval F. ÇUNA EKMEKÇIOGLU and PETER WILLETT 1. Introduction Stemming algorithms are widely used to conflate morphological variants in searches of free-text databases.1,2 Many of the stemmers that have been reported in the literature (e.g., those for English3, Slovene4, Dutch5, Greek6 and Malay7) operate by matching the ending of a word against a dictionary of suffixes, and removing any suffix that is identified subject to constraints specified in an associated set of context-sensitive rules; recoding may also be applied to the root that remains after the suffix has been removed. An alternative approach to stemming involves the use of a morphological analyser to remove suffixes from words according to their internal structure. Such algorithms proceed from left to right through a word (rather than from right to left, as in suffix-stripping) and first look in a lexicon for a root matching an initial substring of the word; they then use grammatical information stored in the lexical entry to determine what possible suffixes may follow. When a suffix matching a further substring is found, grammatical information in the lexical entry for that suffix again determines what class of suffixes may follow. The stemming is successful if the end of the word can be reached by iteration of this process, and if the last suffix is one which may end a word. In this paper, we discuss the evaluation of a stemming algorithm of this type for searching a database of Turkish text and compare the results with those obtained in comparable, but non-stemmed, searches. Turkish is a member of the south-western or Oghuz group of Turkic languages, which also includes Turkmen, Azerbaijani or Azeri, Ghasghai and Gagauz.8 The language uses a Latin alphabet consisting of 29 letters, of which eight are vowels and 21 are consonants, and is an agglutinative language, i.e. one in which words contain a basic root, with one or more suffixes being combined with this root in order to extend its meaning or to create other classes of words.9 Agglutination can result in long words that can contain as much semantic information as a whole English phrase, clause or sentence. An extreme example of this characteristic of the language is provided by the word avrupalIlastIrIlamIyabilenlerdenmissiniz; which means ‘you seem to be one of those who may be incapable of being Europeanised’. The root here is avrupa (Europe) and the remainder consists of suffixes which have certain functions and add meaning to the root; for example, the first suffix -lI modifies the meaning to ‘person from Europe’ and the second suffix -las then alters the function to a verb which means ‘to become one of the persons from Europe’. Program, vol. 34, no. 2, April 2000, pp. 195–200 Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email: pubs@aslib.co.uk, WWW: http://www.aslib.co.uk/aslib Program, Vol 34 No 2, 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher. Program vol. 34 no. 2 This complex morphological structure means that a single Turkish word can give rise to a very large number of variants, with a consequent need for effective stemming if high recall is to be achieved in searches of Turkish text databases. For example, Ekmekçioglu et al. found that the root ol (to become, to happen) occurred in no less than 248 different forms in a file of ca. 50K wordtypes extracted from a Turkish political news database.10 Such behaviour is extremely difficult to encompass by a conventional suffix-stripping algorithm; instead, a morphological analyser is required if high-quality stemming is to be achieved, and we have thus used this approach in the work reported here. 2. Experimental details The experiments used a file containing the titles and abstracts for 6,289 economic and political news stories extracted from Turkish newspapers in the period 1991–93. These texts were searched using a set of 50 natural-language queries provided by 30 Turkish native students who were studying at the University of Sheffield and who agreed to provide both natural-language queries and relevance judgements on the stemmed and unstemmed search outputs discussed below. In previous work,10,11 we have developed a list of 730 stopwords (mainly conjunctions, prepositions and pronouns) for Turkish text retrieval, and each of the queries was checked against this list. The resulting set of queries contained an average of 5.1 words (minimum and maximum of 3 and 9 words, respectively), and these were then searched in both stemmed and unstemmed form, using the OKAPI text retrieval system.12 OKAPI calculates weights for each of the terms in a document or query using a probabilistic model of retrieval. It then scores a document on the basis of the sum of the weights for those terms that it has in common with a query; documents are ranked in order of decreasing sums-of-scores and the topranked documents are then presented to the user for relevance assessment (although not tested in our experiments, these judgements can be used for the calculation of new relevance weights and then for a second, feedback search). The particular weighting scheme used here was the BM25 weight13, with the users’ relevance judgements being based on the information contained in the title and abstract of the 10 or 20 top-ranked documents. Stemming was effected using the two-level morphological analyser, PC-KIMMO.9,14 PC-KIMMO, which is named after its inventor Kimmo Koskenniemi, is designed to generate (produce) and/or recognise (parse) words using a two-level model of word structure in which a word is represented as a correspondence between its lexical level form and its surface level form. The generator component of PC-KIMMO accepts as input a lexicon form, applies the appropriate rules, and returns the corresponding surface form. The recogniser component accepts as input a surface form, applies the appropriate rules, consults the lexicon, and returns the corresponding lexical form with its gloss. Oflazer has developed a description of the Turkish language that is suitable for processing by PC-KIMMO,15 in which the phonetic 196 Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email: pubs@aslib.co.uk, WWW: http://www.aslib.co.uk/aslib Program, Vol 34 No 2, 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher. April 2000 Short communications rules of contemporary Turkish have been encoded using two-level rules while the morphotactics of the agglutinative word structures have been encoded as finite-state machines for verbal, nominal and other classes of words. Oflazer’s program is based on a list of ca. 23,000 word roots: this list is divided into a number of separate lexicons for nouns, adjectives, verbs, compound nouns, proper nouns, pronouns, adverbs, connectives, exclamations, post positions, acronyms, technical jargon and nominal word structures that exhibit a number of special cases. The description consists of two files: a rules file, which specifies the alphabet and the phonological (or spelling) rules; and a lexicon file, which lists lexical items (words or morphemes) and their glosses, and encodes morphotactic constraints. These two files were incorporated in PC-KIMMO, and the resulting program used for stemming. Stemming is normally carried out upon both document and query texts, but we have only stemmed the query words in our experiments. Such a procedure is feasible because Turkish word roots are generally unaffected when a suffix is added to its right-hand end. Accordingly, there is no need for the recoding procedures that are required in many other languages, and the use of a simple truncation search thus ensures that a stemmed query word is able to retrieve all of the variants in the database that are derived from it. For example, the word enflasyonla (with/by inflation) in one of the queries was stemmed to enflasyon, this resulting in matches with words such as enflasyonu, enflasyonunu, enflasyonun and enflasyonist, inter alia. 3. Results and discussion The mean numbers of relevant documents retrieved, when averaged over the set of 50 queries, are listed in Table 1. It will be seen that the stemmed searches retrieve noticeably more relevant documents; indeed, an inspection of the retrieved sets showed that in most cases, the unstemmed relevant documents were a proper subset of the stemmed relevant documents. The statistical significance of this difference in performance was established using the sign test.16 This test focuses on the direction of the difference in some parameter between two related samples, noting whether the sign of the difference is positive, negative or whether the two values are equal. In the present context, the experimental parameter is the number of relevant documents retrieved in a search; when used in a one-tailed manner, the test can determine whether the stemmed Table 1. Mean numbers of relevant documents retrieved when averaged over a set of 50 searches using both stemmed and unstemmed query words Search Type Cutoff-10 Cutoff-20 Unstemmed Stemmed 4.04 5.34 6.84 9.08 197 Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email: pubs@aslib.co.uk, WWW: http://www.aslib.co.uk/aslib Program, Vol 34 No 2, 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher. Program vol. 34 no. 2 Table 2. Sign test analysis based on the numbers of relevant documents retrieved over a set of 50 searches using both stemmed and unstemmed query words Stemmed > Unstemmed Stemmed < Unstemmed Stemmed = Unstemmed 1-tailed probability Cutoff-10 Cutoff-20 33 17 0 p < 0.0001 34 16 0 p < 0.0001 searches retrieve significantly more relevant documents than do the unstemmed searches. The results of this analysis are summarised in Table 2, from where it will be seen that the stemmed searches are better than the unstemmed searches at a very high level of statistical significance. Despite the general superiority of the stemmed searches, it will be noted that the unstemmed query words gave a better level of performance for about one-third of the searches. Such occurrences typically arose from limitations in our implementation of the stemmer, specifically because PC-KIMMO will sometimes suggest several different possible analyses for a word that is presented to it, i.e., several different potential stems in the present context; for example, the word adamin (the man’s) yields the stems adam (man) and ada (island). Inspection of the PC-KIMMO output for a sample file of ca. 3500 Turkish words showed that the shortest stem output by the program was the correct one in 83% of the cases considered and we thus used these shortest stems for the searches reported here. However, this does mean that overstemming (i.e., the removal of too much of the original word as with the adamin example above) occurred for 17% of the query words, and we would hence expect that the retrieval performance detailed in Tables 1 and 2 could have been further improved by asking the searchers to check each of the potential stems output by PC-KIMMO prior to the submission of the query. The stemming results presented here thus represent a lower bound of performance; even so, they lead us to believe that stemming using a morphological analyser provides a highly effective way of handling the problem of word variants in searches of Turkish text databases. Acknowledgements We thank Michael Lynch, Kemal Oflazer, Sandy Robertson and Stephen Walker for assistance and helpful discussions, Hilmi Celik for the provision of the political news database used in this study, the 30 members of the University of Sheffield who provided the queries and relevance judgements, and the Turkish government for funding. 198 Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email: pubs@aslib.co.uk, WWW: http://www.aslib.co.uk/aslib Program, Vol 34 No 2, 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher. April 2000 Short communications References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Martin Lennon, David S. Peirce, Brian D. Tarry and Peter Willett. An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, vol. 3, 1981, pp. 177–183. William B. Frakes. Stemming algorithms. In: William B. Frakes and Ricardo Baeza-Yates (eds). Information retrieval: data structures and algorithms. Englewood Cliffs, NJ: Prentice Hall, 1992, pp. 131–160. Martin F. Porter. An algorithm for suffix stripping. Program, vol. 14, 1980, pp. 130–137. Mirko Popovic and Peter Willett. The effectiveness of stemming for natural language access to Slovene textual data. Journal of the American Society for Information Science, vol. 43, 1992, pp. 384–390. Wessel Kraaij and Renée Pohlmann. Viewing stemming as recall enhancement. In: Hans-Peter Frei, Donna Harman, Peter Schäuble and Ross Wilkinson (eds). Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. New York: Association for Computing Machinery, 1996, pp. 40–49. T.Z. Kalamboukis. Suffix stripping with modern Greek. Program, vol. 29, 1995, pp. 313–321. Fatimah Ahmad, Mohammed Yusoff and Tengu M.T. Sembok. Experiments with a stemming algorithm for Malay words. Journal of the American Society for Information Science, vol. 47, 1996, pp. 896–908. Geoffrey L. Lewis. Turkish grammar. Oxford: Oxford University Press, 1991. Richard Sproat. Morphology and computation. Cambridge MA: MIT Press, 1992. F. Çuna Ekmekçioglu, Michael F. Lynch and Peter Willett. Development and evaluation of conflation techniques for the implementation of a document retrieval system for Turkish text databases. New Review of Document and Text Management, vol. 1, 1995, pp. 131–146. F. Çuna Ekmekçioglu, Michael F. Lynch, Alexander M. Robertson, Tengu M.T. Sembok and Peter Willett. Comparison of n-gram matching and stemming for term conflation in English, Malay, and Turkish texts. Text Technology, vol. 6, 1996, pp. 1–14. OKAPI is described in detail in a series of papers comprising a special issue of Journal of Documentation, vol. 53, 1997, pp. 1–106. Stephen E. Robertson, Stephen Walker, Susan Jones, Micheline Hancock-Beaulieu and Michael Gatford. Okapi at TREC-3. In: Donna K. Harman (ed.) Overview of the third text retrieval conference (TREC-3). Washington DC: National Institute of Standards and Technology, 1995, pp. 109–126. (NIST Special Publication 500–22.) Evan L. Antworth. Glossing text with the PC-KIMMO morphological parser. Computers and the Humanities, vol. 26, 1993, pp. 475–484. 199 Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email: pubs@aslib.co.uk, WWW: http://www.aslib.co.uk/aslib Program, Vol 34 No 2, 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher. Program vol. 34 no. 2 15. Kemal Oflazer. Two-level description of Turkish morphology. Literary and Linguistic Computing, vol. 9, 1994, pp. 137–148. 16. Sidney Siegel and N. John Castellan. Nonparametric statistics for the behavioral sciences. New York: McGraw-Hill, 1998. Authors F. Çuna Ekmekçioglu, Project Manager, SCONE Project, CDLR/Andersonian Library, Curran Building, 101 St. James Road, Glasgow G4 0NS, UK. E-mail: cuna.ekmekcioglu:@strath.ac.uk Peter Willett, Professor, Department of Information Studies, University of Sheffield, Western Bank, Sheffield S10 2TN, UK. E-mail: p.willett@sheffield.ac.uk 200 Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email: pubs@aslib.co.uk, WWW: http://www.aslib.co.uk/aslib