Term Weighting Schemes Experiment on Malay Text Retrieval

advertisement
Term Weighting Schemes Experiment on Malay Text Retrieval System
Muhamad Taufik Abdullah
Fatimah Ahmad
Ramlan Mahmod
Faculty of Computer Science and Information Technology
Universiti Putra Malaysia
{taufik,fatimah,ramlan}@fsktm.upm.edu.my
Abstract
The components of the vectors in vector space model are
determined by the term weighting scheme, a function of the
frequencies of the terms in the document or query. In this
paper we discuss term weighting schemes and the results
from experiment on Malay text retrieval system with
Quranic text collection.
Keywords: text retrieval, term weighting, vector space
method, Malay document
1
Introduction
Text retrieval system are developed based on information
retrieval models such as the Boolean system, the
probabilistic model or the vector space model. We focus on
the vector space model (VSM). VSM models documents
and queries as vectors and computes similarity scores using
an inner product. VSM need an additional term weighting
algorithm before they can be implemented. The
performance of VSM depends on the term weighting
scheme [Salton and Buckley 1988]. Term weighting
scheme is the functions that determine the components of
the vectors.
2
Term Weighting Scheme
Weighting of search terms is an important factor in the
performance of information retrieval systems. Literally
thousands of term weighting algorithms were used
experimentally during the last 25 years, especially within
the Smart projects [Hiemstra 2000]. Proper term weighting
can greatly improve the performance of the vector space
method. A weighting scheme is composed of three different
types of term weighting: local, global, and normalization.
The term weight, wij is given by
wij = LijGiNj ,
where Lij is the local weight for term i in document j, Gi is
the global weight for term i, and Nj is the normalization
factor for document j.
Local weight are functions of how many times each term
appears in a document, global weights are functions of how
many times each term appears in the entire collection, and
the normalization factor compensates for discrepancies in
the lengths of documents. The local weight is computed
according to the terms in the given document or the query.
The global weight, however, is based on the document
Tengku Mohd Tengku Sembok
Faculty of Information Science and Technology
Universiti Kebangsaan Malaysia
tmts@ftsm.ukm.my
collection regardless of whether we are weighting
document or queries. The normalization is done after the
local and global weighting. Normalizing the query is not
necessary because it does not affect the relative order of the
ranked document list [Chisholm and Kolda 1999].
Local weighting formulas perform well if they work on the
principle that the terms with higher within-document
frequency are more pertinent to that document. A list of the
local weight formulas is shown in Table 1.
Global weighting tries to give a “discrimination value” to
each term. Many schemes are based on the idea that the
less frequently a term appears in the whole collection, the
more discriminating it is [Salton and Buckley 1988]. A list
of global weight formulas is shown in Table 2.
The third component of the weighting scheme is the
normalization factor, which is used to correct discrepancies
in document lengths. It is useful to normalize the document
vectors so that documents are retrieved independent of their
lengths. A list of the normalization factors is shown in
Table 3.
3
Experimental Details
Text retrieval system test collection consists of document
database, set of queries for the database and relevance
judgments that are formulated based on the queries. Our
Quranic test collection consists of Quranic documents,
natural query words, relevance judgments, and stopword
list. Quranic documents collection consist of 114 chapters
where every chapter contains variable number of
documents. There are 6236 Quranic documents that are
translated into Malay language.
Query is a formal statement of information need of the
user. It is often expressed in short natural language
question or statement. In this research, the query are taken
from Ahmad’s collection [Ahmad 1995]. There are 36
natural language query words.
The relevance of each document that is retrieved according
to each query is assessed for its effectiveness. The other
component of Quranic collection is a list of relevance
judgments. Ahmad formulates the relevance judgment list
based on the natural language queries. This relevance
judgment consists of document number that should be
retrieved for every query.
The retrieval effectiveness on Quranic is based on
relevance judgments that are already available. The
retrieval effectiveness is measured using standard recall
and precision. Recall is defined as the proportion of
relevant retrieved, while precision is the proportion of
retrieved material that is relevant. This recall versus
precision is based on 11 standard levels which are 0%,
10%, ..., 100%.
4
Salton, G. and Buckley. 1988. Term weighting approaches
in automatic text retrieval. Information Processing and
Management, 24(5), 513-523.
Result and Discussion
The experimental results in Table 4, Table 5 and Table 6
show recall and precision for retrieval using nine local
weight, eight global weight and three normalization
formulas. Table 4 shows the highest average precision
among local weights is obtained by normalized log
(LOGN). The average precision values increase from
7.78% of within-document frequency (FREQ) to 7.91% for
normalized log (LOGN). Whereas, Table 5 shows
probabilistic inverse (IDFP) achieved the highest average
precision for global weights. The average precision values
increase from 7.91% of no global weight (NONE) to 9.27%
for probabilistic inverse (IDFP). Furthermore, Table 6
shows that the
average precision for normalization
formulas are equal.
Further, Table 7 shows recall and precision for retrieval
using the combination of normalized log (LOGN) local
weight and probabilistic inverse (IDFP) global weight for
both document and query. The average precision values
increase from 7.78% of within-document frequency
(FREQ) to 12.35% for normalized log and probabilistic
inverse (LOGN-IDFP).
5
Conclusion
From the experiment, it is clear that combining the right
local weight and global weight shows to be more effective
in retrieving more relevant documents compared to other
weighting schemes. However, the retrieval precision is still
low which is only 12.35%. This implies that using only
term weighting scheme is not enough, user may retrieve
few relevant documents in response to his query. The most
obvious way to extending the work would be the
combining the stemming and thesaurus.
References
Ahmad, F. 1995. A Malay Language Document Retrieval
System: An Experimental Approach and Analysis.
Tesis Ijazah Doktor Falsafah. Universiti Kebangsaan
Malaysia.
Chisholm, E. and Kolda, T.G. 1999. New Term Weighting
Formulas for The Vector Space Method in Information
Retrieval. Technical Report. Oak Ridge National
Laboratory.
Hiemstra, D. 2000. Using Language Models for
Information Retrieval. CTIT Ph.D. Thesis Series.
Centre for Telematics and Information Technology.
Table 1 Local weighting formulas
Local weight
Binary
Formula
1,
L ij
0,
Within-document frequency
if f ij 0
if f ij 0
FREQ
L ij = f ij
Log
L ij
Normalized log
1 log f ij ,
0,
1 log f ij
L ij
1 log aij
0,
Augmented
frequency
normalized
Abbreviation
BNRY
LOGN
,
if f ij 0
0.5 0.5
f ij
xj
0,
0.2 0.8
f ij
xj
0,
Augmented normalized average
term frequency
L ij
L ij
0.9 0.1
f ij
aj
ATFA
, if f ij 0
if f ij 0
0.2 0. 8 log f ij 1 , if f ij 0
0,
if f ij 0
Square root
L ij
ATFC
, if f ij 0
if f ij 0
0,
Augmented log
ATF1
, if f ij 0
if f ij 0
Changed-coefficient ATF1
L ij
f ij 0 .5 1,
0,
LOGA
if f ij 0
term
L ij
if f ij 0
if f ij 0
if f ij 0
if f ij 0
where:
fij is the frequency of term i in document j;
aj is the average frequency of the terms that appear in document j;
xj is the maximum frequency of any term in document j;
All logs are base two.
LOGG
SQRT
Table 2 Global weighting formulas
Global weight
No global weight
Formula
Gi 1
Inverse document frequency
Abbreviation
NONE
Gi log
Probabilistic inverse
Gi log
N
ni
IDFB
N ni
IDFP
ni
Entropy
f ij
N
Gi 1
Global frequency IDF
Gi
Fi
j= 1
log
f ij
ENPY
Fi
log N
IGFF
Fi
ni
Log-global frequency IDF
Gi log
Incremented global frequency IDF
Fi
Gi
ni
Square root global frequency IDF
Fi
Gi
Fi
ni
IGFL
1
IGFI
1
IGFS
0.9
ni
where:
N is the number of documents in the collection;
ni is the number of documents in which term i appears;
Fi is the frequency of term i throughout the entire collection;
All logs are base two.
Table 3 Normalization factors formulas
Normalization factors
None
Cosine normalization
Pivoted unique normalization
Formula
Nj 1
Nj
Nj
Abbreviation
NONE
1
m
i 0
COSN
Gi L ij
1 slope
2
1
pivot
PUQN
slope l j
where:
lj is the number of distinct terms in document j;
slope is set to 0.2;
pivot is set to the average number of distinct terms per document in the entire collection.
Table 4 Local weighting results
Local Weight
Recall
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Average
FREQ
BNRY
LOGA
LOGN
0.253813
0.115772
0.105547
0.095437
0.078298
0.065569
0.042292
0.033308
0.023772
0.022723
0.019514
0.077822
0.230861
0.098850
0.091604
0.086017
0.078583
0.060908
0.041004
0.032697
0.023821
0.022690
0.019402
0.071494
0.223065
0.132802
0.118803
0.109834
0.078756
0.065312
0.042775
0.033481
0.023771
0.022810
0.019517
0.079175
0.223065
0.132801
0.118790
0.109871
0.078756
0.065312
0.042775
0.033481
0.023771
0.022810
0.019517
0.079177
ATF1
Precision
0.220068
0.107050
0.100016
0.088621
0.079377
0.064088
0.041636
0.032471
0.023802
0.022479
0.019244
0.072623
ATFC
ATFA
LOGG
SQRT
0.212249
0.116689
0.107310
0.096353
0.078206
0.065144
0.042309
0.033636
0.023866
0.022784
0.019502
0.074368
0.234046
0.103554
0.098694
0.088984
0.077955
0.062655
0.041024
0.032363
0.023852
0.022638
0.019332
0.073191
0.226499
0.119602
0.110358
0.094553
0.077741
0.064649
0.041719
0.032973
0.023795
0.022548
0.019429
0.075806
0.217000
0.107807
0.100702
0.087609
0.078458
0.063987
0.041275
0.032583
0.023830
0.022662
0.019200
0.072283
Table 5 Global weighting results
Local Weight
Global Weight
Recall
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Average
NONE
IDFB
IDFP
0.223065
0.132801
0.118790
0.109871
0.078756
0.065312
0.042775
0.033481
0.023771
0.022810
0.019517
0.079177
0.273379
0.173896
0.141502
0.125107
0.087595
0.069245
0.045232
0.034696
0.024888
0.023542
0.020131
0.092656
0.273751
0.173005
0.142318
0.126126
0.088007
0.069828
0.045520
0.035262
0.025343
0.023947
0.019788
0.092990
LOGN
ENPY
IGFF
Precision
0.271984 0.218206
0.174722 0.133157
0.140634 0.119045
0.124061 0.110449
0.087066 0.080516
0.068795 0.066116
0.045295 0.042546
0.034612 0.033285
0.024860 0.023826
0.023466 0.022857
0.019977 0.019303
0.092316 0.079028
IGFL
IGFI
IGFS
0.217214
0.131540
0.118877
0.110060
0.079540
0.065616
0.042543
0.033253
0.023853
0.022916
0.019440
0.078623
0.217719
0.131249
0.118510
0.110120
0.079408
0.065310
0.042668
0.033137
0.023850
0.022877
0.019456
0.078573
0.229144
0.131750
0.119942
0.111833
0.081618
0.066556
0.043425
0.033736
0.024032
0.022962
0.019475
0.080407
Table 6 Normalization results
Local Weight
Global Weight
Normalization
Recall
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Average
NONE
0.273751
0.173005
0.142318
0.126126
0.088007
0.069828
0.045520
0.035262
0.025343
0.023947
0.019788
0.092990
LOGN
IDFP
COSN
Precision
0.273751
0.173005
0.142318
0.126126
0.088007
0.069828
0.045520
0.035262
0.025343
0.023947
0.019788
0.092990
PUQN
0.273751
0.173005
0.142318
0.126126
0.088007
0.069828
0.045520
0.035262
0.025343
0.023947
0.019788
0.092990
Table 7 Combined weight for both document and query
Recall
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Average
FREQ
LOGN-IDFP
Precision
0.253813 0.371309
0.115772 0.263053
0.105547 0.215744
0.095437 0.155442
0.078298 0.107050
0.065569 0.078310
0.042292 0.052000
0.033308 0.038905
0.023772 0.028529
0.022723 0.026338
0.019514 0.021823
0.077822 0.123500
Download