Term Weighting Schemes Experiment on Malay Text Retrieval System Muhamad Taufik Abdullah Fatimah Ahmad Ramlan Mahmod Faculty of Computer Science and Information Technology Universiti Putra Malaysia {taufik,fatimah,ramlan}@fsktm.upm.edu.my Abstract The components of the vectors in vector space model are determined by the term weighting scheme, a function of the frequencies of the terms in the document or query. In this paper we discuss term weighting schemes and the results from experiment on Malay text retrieval system with Quranic text collection. Keywords: text retrieval, term weighting, vector space method, Malay document 1 Introduction Text retrieval system are developed based on information retrieval models such as the Boolean system, the probabilistic model or the vector space model. We focus on the vector space model (VSM). VSM models documents and queries as vectors and computes similarity scores using an inner product. VSM need an additional term weighting algorithm before they can be implemented. The performance of VSM depends on the term weighting scheme [Salton and Buckley 1988]. Term weighting scheme is the functions that determine the components of the vectors. 2 Term Weighting Scheme Weighting of search terms is an important factor in the performance of information retrieval systems. Literally thousands of term weighting algorithms were used experimentally during the last 25 years, especially within the Smart projects [Hiemstra 2000]. Proper term weighting can greatly improve the performance of the vector space method. A weighting scheme is composed of three different types of term weighting: local, global, and normalization. The term weight, wij is given by wij = LijGiNj , where Lij is the local weight for term i in document j, Gi is the global weight for term i, and Nj is the normalization factor for document j. Local weight are functions of how many times each term appears in a document, global weights are functions of how many times each term appears in the entire collection, and the normalization factor compensates for discrepancies in the lengths of documents. The local weight is computed according to the terms in the given document or the query. The global weight, however, is based on the document Tengku Mohd Tengku Sembok Faculty of Information Science and Technology Universiti Kebangsaan Malaysia tmts@ftsm.ukm.my collection regardless of whether we are weighting document or queries. The normalization is done after the local and global weighting. Normalizing the query is not necessary because it does not affect the relative order of the ranked document list [Chisholm and Kolda 1999]. Local weighting formulas perform well if they work on the principle that the terms with higher within-document frequency are more pertinent to that document. A list of the local weight formulas is shown in Table 1. Global weighting tries to give a “discrimination value” to each term. Many schemes are based on the idea that the less frequently a term appears in the whole collection, the more discriminating it is [Salton and Buckley 1988]. A list of global weight formulas is shown in Table 2. The third component of the weighting scheme is the normalization factor, which is used to correct discrepancies in document lengths. It is useful to normalize the document vectors so that documents are retrieved independent of their lengths. A list of the normalization factors is shown in Table 3. 3 Experimental Details Text retrieval system test collection consists of document database, set of queries for the database and relevance judgments that are formulated based on the queries. Our Quranic test collection consists of Quranic documents, natural query words, relevance judgments, and stopword list. Quranic documents collection consist of 114 chapters where every chapter contains variable number of documents. There are 6236 Quranic documents that are translated into Malay language. Query is a formal statement of information need of the user. It is often expressed in short natural language question or statement. In this research, the query are taken from Ahmad’s collection [Ahmad 1995]. There are 36 natural language query words. The relevance of each document that is retrieved according to each query is assessed for its effectiveness. The other component of Quranic collection is a list of relevance judgments. Ahmad formulates the relevance judgment list based on the natural language queries. This relevance judgment consists of document number that should be retrieved for every query. The retrieval effectiveness on Quranic is based on relevance judgments that are already available. The retrieval effectiveness is measured using standard recall and precision. Recall is defined as the proportion of relevant retrieved, while precision is the proportion of retrieved material that is relevant. This recall versus precision is based on 11 standard levels which are 0%, 10%, ..., 100%. 4 Salton, G. and Buckley. 1988. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513-523. Result and Discussion The experimental results in Table 4, Table 5 and Table 6 show recall and precision for retrieval using nine local weight, eight global weight and three normalization formulas. Table 4 shows the highest average precision among local weights is obtained by normalized log (LOGN). The average precision values increase from 7.78% of within-document frequency (FREQ) to 7.91% for normalized log (LOGN). Whereas, Table 5 shows probabilistic inverse (IDFP) achieved the highest average precision for global weights. The average precision values increase from 7.91% of no global weight (NONE) to 9.27% for probabilistic inverse (IDFP). Furthermore, Table 6 shows that the average precision for normalization formulas are equal. Further, Table 7 shows recall and precision for retrieval using the combination of normalized log (LOGN) local weight and probabilistic inverse (IDFP) global weight for both document and query. The average precision values increase from 7.78% of within-document frequency (FREQ) to 12.35% for normalized log and probabilistic inverse (LOGN-IDFP). 5 Conclusion From the experiment, it is clear that combining the right local weight and global weight shows to be more effective in retrieving more relevant documents compared to other weighting schemes. However, the retrieval precision is still low which is only 12.35%. This implies that using only term weighting scheme is not enough, user may retrieve few relevant documents in response to his query. The most obvious way to extending the work would be the combining the stemming and thesaurus. References Ahmad, F. 1995. A Malay Language Document Retrieval System: An Experimental Approach and Analysis. Tesis Ijazah Doktor Falsafah. Universiti Kebangsaan Malaysia. Chisholm, E. and Kolda, T.G. 1999. New Term Weighting Formulas for The Vector Space Method in Information Retrieval. Technical Report. Oak Ridge National Laboratory. Hiemstra, D. 2000. Using Language Models for Information Retrieval. CTIT Ph.D. Thesis Series. Centre for Telematics and Information Technology. Table 1 Local weighting formulas Local weight Binary Formula 1, L ij 0, Within-document frequency if f ij 0 if f ij 0 FREQ L ij = f ij Log L ij Normalized log 1 log f ij , 0, 1 log f ij L ij 1 log aij 0, Augmented frequency normalized Abbreviation BNRY LOGN , if f ij 0 0.5 0.5 f ij xj 0, 0.2 0.8 f ij xj 0, Augmented normalized average term frequency L ij L ij 0.9 0.1 f ij aj ATFA , if f ij 0 if f ij 0 0.2 0. 8 log f ij 1 , if f ij 0 0, if f ij 0 Square root L ij ATFC , if f ij 0 if f ij 0 0, Augmented log ATF1 , if f ij 0 if f ij 0 Changed-coefficient ATF1 L ij f ij 0 .5 1, 0, LOGA if f ij 0 term L ij if f ij 0 if f ij 0 if f ij 0 if f ij 0 where: fij is the frequency of term i in document j; aj is the average frequency of the terms that appear in document j; xj is the maximum frequency of any term in document j; All logs are base two. LOGG SQRT Table 2 Global weighting formulas Global weight No global weight Formula Gi 1 Inverse document frequency Abbreviation NONE Gi log Probabilistic inverse Gi log N ni IDFB N ni IDFP ni Entropy f ij N Gi 1 Global frequency IDF Gi Fi j= 1 log f ij ENPY Fi log N IGFF Fi ni Log-global frequency IDF Gi log Incremented global frequency IDF Fi Gi ni Square root global frequency IDF Fi Gi Fi ni IGFL 1 IGFI 1 IGFS 0.9 ni where: N is the number of documents in the collection; ni is the number of documents in which term i appears; Fi is the frequency of term i throughout the entire collection; All logs are base two. Table 3 Normalization factors formulas Normalization factors None Cosine normalization Pivoted unique normalization Formula Nj 1 Nj Nj Abbreviation NONE 1 m i 0 COSN Gi L ij 1 slope 2 1 pivot PUQN slope l j where: lj is the number of distinct terms in document j; slope is set to 0.2; pivot is set to the average number of distinct terms per document in the entire collection. Table 4 Local weighting results Local Weight Recall 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Average FREQ BNRY LOGA LOGN 0.253813 0.115772 0.105547 0.095437 0.078298 0.065569 0.042292 0.033308 0.023772 0.022723 0.019514 0.077822 0.230861 0.098850 0.091604 0.086017 0.078583 0.060908 0.041004 0.032697 0.023821 0.022690 0.019402 0.071494 0.223065 0.132802 0.118803 0.109834 0.078756 0.065312 0.042775 0.033481 0.023771 0.022810 0.019517 0.079175 0.223065 0.132801 0.118790 0.109871 0.078756 0.065312 0.042775 0.033481 0.023771 0.022810 0.019517 0.079177 ATF1 Precision 0.220068 0.107050 0.100016 0.088621 0.079377 0.064088 0.041636 0.032471 0.023802 0.022479 0.019244 0.072623 ATFC ATFA LOGG SQRT 0.212249 0.116689 0.107310 0.096353 0.078206 0.065144 0.042309 0.033636 0.023866 0.022784 0.019502 0.074368 0.234046 0.103554 0.098694 0.088984 0.077955 0.062655 0.041024 0.032363 0.023852 0.022638 0.019332 0.073191 0.226499 0.119602 0.110358 0.094553 0.077741 0.064649 0.041719 0.032973 0.023795 0.022548 0.019429 0.075806 0.217000 0.107807 0.100702 0.087609 0.078458 0.063987 0.041275 0.032583 0.023830 0.022662 0.019200 0.072283 Table 5 Global weighting results Local Weight Global Weight Recall 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Average NONE IDFB IDFP 0.223065 0.132801 0.118790 0.109871 0.078756 0.065312 0.042775 0.033481 0.023771 0.022810 0.019517 0.079177 0.273379 0.173896 0.141502 0.125107 0.087595 0.069245 0.045232 0.034696 0.024888 0.023542 0.020131 0.092656 0.273751 0.173005 0.142318 0.126126 0.088007 0.069828 0.045520 0.035262 0.025343 0.023947 0.019788 0.092990 LOGN ENPY IGFF Precision 0.271984 0.218206 0.174722 0.133157 0.140634 0.119045 0.124061 0.110449 0.087066 0.080516 0.068795 0.066116 0.045295 0.042546 0.034612 0.033285 0.024860 0.023826 0.023466 0.022857 0.019977 0.019303 0.092316 0.079028 IGFL IGFI IGFS 0.217214 0.131540 0.118877 0.110060 0.079540 0.065616 0.042543 0.033253 0.023853 0.022916 0.019440 0.078623 0.217719 0.131249 0.118510 0.110120 0.079408 0.065310 0.042668 0.033137 0.023850 0.022877 0.019456 0.078573 0.229144 0.131750 0.119942 0.111833 0.081618 0.066556 0.043425 0.033736 0.024032 0.022962 0.019475 0.080407 Table 6 Normalization results Local Weight Global Weight Normalization Recall 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Average NONE 0.273751 0.173005 0.142318 0.126126 0.088007 0.069828 0.045520 0.035262 0.025343 0.023947 0.019788 0.092990 LOGN IDFP COSN Precision 0.273751 0.173005 0.142318 0.126126 0.088007 0.069828 0.045520 0.035262 0.025343 0.023947 0.019788 0.092990 PUQN 0.273751 0.173005 0.142318 0.126126 0.088007 0.069828 0.045520 0.035262 0.025343 0.023947 0.019788 0.092990 Table 7 Combined weight for both document and query Recall 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Average FREQ LOGN-IDFP Precision 0.253813 0.371309 0.115772 0.263053 0.105547 0.215744 0.095437 0.155442 0.078298 0.107050 0.065569 0.078310 0.042292 0.052000 0.033308 0.038905 0.023772 0.028529 0.022723 0.026338 0.019514 0.021823 0.077822 0.123500