Combining evidence for effective information filtering Abstract

advertisement
Combiningevidence for effective
information filtering
Susan T. Dmnais
Bellcore
445 SouthSt.
Moeristown,NJ 07960
entail: sui@bellco~.com
From: AAAI Technical Report SS-96-05. Compilation copyright © 1996, AAAI (www.aaai.org). All rights reserved.
Abstract
As part of NLqT/ARPA’s
TREC
Wcckshop,weused
Semamk
Indexing(LSl) for filtering 33~incoming
dotemerefrom diveae ~ (aew~es,I~ttenta tedmiotl
~mractOfor SOt~i~ of ~ Wedevelot~edrepmeematiom of userintems~~ mtea,for thesetopics~ two
mu~cesof emmi~infmmalion.AWon/F/~rused just the
weedsin the ~#c mmnems,
and a Re/Do~F//~r u~djust
¯ e knownn/ram t~s do~.me~and ~ ~e ~#c
,memm.U~the mlevammdnin8documents(a variant
of relevance &edback)was mote effective than ~ ¯
detailednatural ’-,,-magedescriptionof in~ts. Coral/nins these two vec~n provided some ~ iml~ovemmm
in mmdn~
On averap, 7 ~ the top 10 docmnmes
~md 44 of the top 1(]0 ~ ~ mlev¯llt
nslno
bined ~ metlwd.Dam
coml/nationn’~n~_the
the
com-
results of
the Wa/dandRelDoca
reuieval.-ts wasnot generallysucce,dul in ~ pe~onm~e~oqmedm the best individualmethod,
dthoNlh
webelieveit mightbe if additional
murces me used. These ~ mmtmdsme quite gen.
end andapplicableto ¯ variety
of mutingaad feedback
dard veoto~pmbabilis~)treat wordsas if they are independent, althoughit is quite obvionsthat they are not (Salton
blcGill, 1983). Thecentral themeof LSI is that important
inter-relationships amongterms can be automatically
derived,explicitly modeled,andusedto imp~veretrieval.
LSI uses singular-value decomposition(SVD),a sts~stical
te~hn~ueclosely related to eigenvectordecomposition,to
modelthe associmiverelationships. AImptenn-decmnent
maufixis decomposed
into a set of k, typically 100to 300,
orthogonal factors. These derived indexing dimensions,
rather than indivkhml
words,are the basis of the vector space
usedforre~evaL
Eachtermanddocument
is represented
bya vector
intheresulting k-dimensional
LSIspace.
Terms
that
areused
insimilar
contexts
(documents)
will
have
similarvectors in this
space.
Oneimportant
consequence
ofthe
LSI analysis is that users’ queries can reUievedocuments
that do not share any wordswith the query- e.g., a query
about "automobiles" would also nmieve articles about
"cars"andevenarticles about"d~vers"to a lesser extent.
Re~evaloperates in the samewayin the reduceddlmeusion
vector space as it does in standardvector models.A query
Introduction
vectoris locatedat the weightedvectorsumof its constituent
term vectomDocuments
are ranked by their similarity to
As partof NISTIARPA’s
TRECWorkshop,
we usedLatent
Semantic
Indexing
(LSI)forfiltering
336kdocuments the query vector and the most similar documentsare
fromdiverse
sources
for50topics
ofinterest.
Weexam- retxmmd.Since both term and documentvectors are rep~iuedhow~t sources
of infmmation
(e.g.,
a natural sented in the samespace, simila~ies betweenany combinalanguage
description
ofintmests,
feedback
about
previous tlon of terms and documentscan be easily obtained. This
makesit easy to use LSIfor relevancefeedbackandinfonnadocuments)can be used to best predict whichnewobjects
tion filte~ng.
will be of int~est. AnLSl modelwhichcombinesthe initial topic descriptionwith relevant training decmnents
is
The LSI methodhas been applied to mlnyof the standard
quiteeffective.
infozmation
retrieval test collectionswithfavorableresults.
Usingthe sametokenization and term weightings, the LSI
Latent SemanticIndexing(LSI)
methodhas equaled or outperformedstandard vector methods in almosteverycase, and wasas muchas 3095better in
LSIisa vminnt
ofvector
reUieval
inwhich
thedependensomecases (Deezwesteret al., 1990; Dumais,1995).
ciesbetween
terms
areexplicitly
taken
intoaccount
(see
with the standard vector method,diffex~tial term we~ting
Deerwester
et al.,1990formathematical
details
and
and relevance feedbackboth improveLSI performancesubexamples).
Mostre~eval
models
(e.g.,
Boolean,
startstantially (Dumais,1991).
26
TREC- Information
Filtering
We used LSI in NIST/ARPA’~ TREC-3 Workshop
(Durum, 1995; Haman,1995). For the TRF_~fdten~ng
(or rout/ne) task, weweregiven50 topics of interest and
askedto find articles relevantto these inL-~esm
froma new
streamof 336~306
documents
(1.2 gig of ascii text). The
1000documentsmostmlmilarto each of the 50 topics of
intem~are returned, and perfomm~is evaluated using
precisionandrecalL
The TRECtopic statements were qu/te detsiled, s~
andspecific. Theymet~-sentative of the profiles that an
informationanalysts mightdevelopover time for standing
interem.Aaexamplefiltering topic is givenbelow.
<hUm>
Number: 108
<dora>Donmln..International Economics
<~/e> Topic:
Japanese
Pmtectiontn
Measures
<desc>
DescrtptlotL"Document
willreport
onJapanese
poHcleJ
orpractic~
which
help protect Japan’s
domestic
market
fromforelgn
compe~on.
<narr>
Narrate:
A relemnt
docsunent
wlllidentify
a
Japaneae
laworregulation,
a governmental
policy
or
~e procedure,
a corporate
custom,or a
bmine~
practice
which
discourages,
orevenprevents,
entry into the Japanese~ by foreign goods and
serrlca. A documentwhichreports generally on
marketpemtration d~k’uMesbut whichdoes not
Identify
a specific
Japanese
barrier
to trade is NOT
relevant.
<con>
Concept(s):
1. Japan
2. Ministryo/InternationalTradeandIndustry, M1TI,
ui~ry of Fo~g. A~
3. protectiontnn,
protect
4. ~ ~ quo~ dm~ns,
obm~,,~
retaliation
5. structural tmp~productstandard
extendedthis workto the large, diverse TREC
corptm. We
used the training corpus to construct a 346-dimeMional
LSI
representation.For informationfiltering, webeginby identifying a vector for eachtopic of interest in the LSI space.
Newdocumentsare "folded in" to thespace, end suggested
as relevantto a topicif theyare nearenoughthe filter vector.
Foldingin worksjust like queryformation. Eachnewdocumeatis located at the we/ghtedvector sumof its constituent
termvectors.
Wecompa~two basic methodsfor creating filters - one
usingonly rite text of the topic statement(WordFilter); the
other using only relevant documentsfrom the trainln~ set
(ReIDocsFilter). The WordFi~r is located at the weighted
vector sumof all the wordsin the topic ~-~ment.Onaverage, topics contain 192 words,of which52 are uniquecontent words.This methodignores all the training d~tA_about
relevant and non-relevantdocuments.TheRe/DoesF/~ter is
locatedat the centmidor vector sumof the relevant lxalnln~
documents,andignores the topic statement."l~/s is a somewhatunusualvariant of relevancefeedback"l~ypicallyusers’
queries are modifiedby addingwordsflx)m relevant documents and emittin 8 words from non-relevant documents.
Werep/aced the users’ topic stntPmentwith relevant documeats.
For eachof the 50 filtering topics wecreated twofilters or
profiles - a Word
Filter auda RelDocs
Filter. Figure1 illus..
trates howthe~ vectors wouldbe formedfor one topic using
a 2-dimensienalLSIspace. "fine veer.msfor the wordsin the
topic statementare labeledw, andthe vectorsfor the relevant
training documentsmelabeled R. As can be seen in Figme
1, the WordVectoris locatedat the weightedsumof the vecton for topic words,andthe RelDocs
Vectoris locatedat the
sumof the relevant training documentvectors. In both
cases, the filter wasa single vector. Newdocmn~tswere
"foldedin" to this LSIspace, as describedabove,andranked
in decreasingorderof simaerityto the filter vector.
& trade dispute, barrl~ tens~ imbalance.,practice
7. nmrkaaccess, free wade,liberalize, vecipmc~
& Super501, 501 clause
<nat>Natlonalily: Japan
R = relevant doc
N = non-rel doc
w = topic wor~s
N
Traininginfonmttionwasavailable for each topic. Known
relevant
andnon-x~Aevant
documents
froma different
(althoegh related) corpus of documents
wereidentified.
On average, them were 215 knownrelevant documents
and 896 non-relevantdocuments
for each topic. Thetopic
statementandthe training d,t, wereto be used to develop
profilesor filtemfor eachtopic.
Rel ~ Vector
w
N/R
.N
R ~wR’
L$I DimensionI
Basle Wordand ReiDees alters
LSI hag previously beenappfied to iMomuttionfilte~ng
with promising n~lts ffoltz & Duma~,1992), and we
FIGURE
L Bsseline RelI}ecs and WordVectors
27
Thesetwo methodsprovide baselines for examiningvarione combinationsof information from the users’ topic
statementand fromfeedbackusing relevant tr~inlne documents.
havea high value on the min(cos(doc,Word),cos(doc,
Does))measure- these documentsare near either the Word
or RelDocs
filter andwouldbe top rAnk,~ifor this d~t~combination method.
ComMninglatornmflon - Query Combination
and Data Combination
I maxata
Combination
cosine)
filming methods ~ above comd/tute the main
sources of infonnatimL Wealso examinedmethodsfor
¢o~b~in~evidence from these sources. Belkin et al.
(1993, 1994)havedescn’bedthe methodsweused as query
combinalfmand data combination(or data fusion).
For qmn’ycomb/na~m,we combinetwo (or more) representationsinto a single newqueryvector for retrieval. In
the workdeacr/bedin this paperweexplcnedlinear combinations of the basic Wordsand RelDocsvectors. The
combinedvector, with equal weight given to Wordsand
RelDocs,is shownin Figure 2. A newranking of documentswasproducedfor the Combined
vector.
Query Combination
Combined Vector
Rel
LSI Dimension 1
FIGURE3. Data CembfnM/on- maxcosine measure
In tl~ paper we explored combinadonsof documentrankings whichwe~derived from the same LSI space and can
thus be repr~ented geometrically in the samefigure. The
methodis muchmoregeneral than this, however,and could
be used to combinerank and/or similarity informationfrom
very different
retrieval
systems in order to determine a final
documentranking. Data combination is thus muchmore
general than querycombination.
Results
LSI Dimension 1
FIGURE 2.
Query
Cmub~
D~ ¢o~ (or dam,s/on, as Belkl. et al. call it)
combinesinformationabout the results of two (or more)
sysmmwhichrank the samedocumentsin response to the
same requests. Weexaminedvarious waysof comblninH
the top I000/terns f~omthe Wordand RelDocsmatchesto
produceda newnmkingof docmnentsfor each topic. We
mind infommt/onabout the nmksand cosines, and combilled
them using max, rain:
and ~ operators.
Although
wedid not examinethe perm~terspace as systemat/cally
as Batten et eL (1994) did, wedid choosedata combhmtion methodswhichhad p~viouslybeensuccessful in the
context of TREC
(Belkin et al., 1994;Fox& Shaw,1994).
Pigme3 show8geometrically what the combinationlooks
l~e for the r~n(cos(doc,Word),cos(doc, ReiDocs))
sure. The cross-hatched region showsdocumentsthat
For the TRECfiltecing ~ perfoemm~was evaluated
using the top 1000 documentsgemmed
for each topic. We
the total numbeT of relevant doc~t8 ~
cision at 10 and 100docuxm~ts,
and averageprecision.
Query combination
Table1 presentsthe results for the twobasic LSIfilters as
well as several combinationsof them.Not mtrpnsingiy,the
RelDocsfilter vectors whichtake advantageof the known
relevant documentsage better than the Wordvectors. The
improvement
in averageprecision is 30%(.3737 vs..2880).
RelDocswasone of the best TREC
filterin8 methods.Users
get an averageof 2 additional relevant documents
in the top
10 usingthe RelDocs
methodfor filtetiJn 8 (6.7 vs. 4.6). Even
thoughthe topic statementsare quite rich, usingknown
relevanttraining documents
still providessizeable retrieval benefits. Recall that the ReiDocsmethodignores the topic
deacziption!It is also i mpmmnt
to note howmuchinformation wasfiltered out by these methods.Bylookingat just
100 documentsper topic or 1.4% of the data (5000/
336306),24%of the knownrelevant docnn~.mare found.
Bylooking at less than 15%of the A,ta (50000/336306),
74%of the relevant documentsare foundusing RelDocs.
TABLEL Query Combtmtiomof LSI Word (W)
and lbmem (R) veetom
Words
(W)
harem tl0
.b~+.SR
.2b~V+.Tb’R
.75W+.25R
Rel
Does
Pree
atl0
Pree
Avg
at 100
6252
6878
7078
7010
693O
.4620
.672O
.3532
.4544
.440O
.288O
.3737
.3792
.459O
.4118
.3827
.3561
.6820
.650O
.5540
improvement
are found when the RelDocs vector is given equal or more
weightthan the Wordvector. This sameI~_nemof results
wasobservedfor a diffe~t set of 50 mutingtopics in a
lnevious TREC,
mwehavereason to believe these results
pnendi~
to otherfilterln~ applications.
SRel
10Rel
201~!
30Rel
S0Rel
100 l~i
max Rel
relnesentation!
We can
Data combination begins with the top 1000 documents
returned for the Wordand RelDocsfilters and combines
results (not the vectors) in various waysto arrive at a new
ranking. Weused infommtienabout the ranks and cosines,
and combinedthem using max, min; and sum operators.
Similar methodshave previouslybeentested in the context
of TREC
althoughthey do not providethe kind of systematic
parameteranalysis usedby Bat~ell et aL (1994).
TABLE3. Data CemlJWtonsof LSI Word
(w) andRemoes
(R)return
RelDoc Avg
Pree
Words
6252
.2880
RelDoe
6878
.3737
Ranks- sum(W,a) 6960
.3687
l~illklJ - mJn(~tr~l~)
6938
.3498
6764
Ranks
- m-ffW,
R)
.3538
Coslnes- sum(WR) 6891
.3660
- mi ~n(W~R)
6764
.3229
S
TABLE
2. Performam~as s function of number
of trainin~_ docmneatsused in RelDocsvector
A~
at 100
.2564
.39M
.4290
.4428
.4470
.4514
.4544
the conent
Consider, for example,the sumof cosines. For each document,a newmeasureof similarity (its cosine in the Wordset
plus its cosine in the ReIDocsset) is computedand usedto
derive a newranking. Retrieval performanceis evaln~t~
using the newranking. Table 3 summarizes
the performance
of six 4Atl combinationmethodsalong with the Wordand
RelDocsvector baselines.
TheRelDo~methodsdescribed aboveused all known
relevant documentsf~omthe tra/n/ng collection. Onaverage, there were216relevant docmnents
for a topic. Table
2 lhowshowmuchtraining ,/it, it takes to achieve this
level of performance.Withonly 5 relevant documents
performanceis poo~It is comparableto the Wordsfilter in
precisien, andmmewhat
worsein the total number
of relevant doctnnents.(Knowing
5 relevant items is as goodas
a 192 worddescriptionof the topic of interest.) Performancei mIxovessteadily and with only 20 relevant documenusis within 6%of the maximnm
With relatively
small amountsof Irsiaing data, filterln£ performance
is
quite good.
Pine
atl0
.5840
.586O
.64OO
.656O
.664O
.68OO
.672O
given
movein this direction by combinin~ the RelDo~
vector with
a vector basedon newrelevant test documents
as they an/vs.
Combining
the RelDocsvectorwith only I newrelevant test
documentsimprovesaverage ixecision 7%to .4017, and
Add;~g
10 newrelevant documentsimprovesaverage Inecision 16%to .4353.
Weexaminedthree mixUnesvarying the amountthat the
Word and ReIDo~ sources cou~te to the combined
vector (.25, _~0,.75). Smallimprovements
in performance
Rel
Does
5436
6313
6669
6642
68O6
6822
6878
Thebest peffozmaace
one can achieve for this LSI representation is obtainedbylocatingthe filter vectorat the centroid
of all the relevant test document.Clearly this could not
achievethis in practice Eincethe relevant test documents
are
not knownaheadof time, but it does set an upperboundon
performance.Placing the filter vector here wouldincrease
average lnecision by 28%to .4776, so there is roomfor
Pree
.2777
.3O96
.3526
.3565
.3663
.3706
.3737
Coshl~
- ms-(W~R)
6885
.3741
Ascan be seen in Table3, thereareonly m’n.ll inconsistent
improvements
in performancewhendifferent sources of data
are combined.Differentially weightingthe cenm’butiousof
the Wordsand RelDocsmeasuresdoes not help either-
29
Mother way of comb/nln~data from the two basic filters
is to pick the best filter for each topic. Althoughthe RelDocsfilter is best on average, there are 14 topics for which
the Wordsfilter is better. If we use the training dnt, to
select which methodto use for each topic, performanceis
6890relevant docs and .3731 average precision, and this is
no better thau umngjust the RelDocsfilter. (If we select
optimally using the test d,t, small improvementsare seen
- 7058relevant docs and .3962 average precision.)
Othem (Bel]dn et aL, 1994, Fox & Shaw, 1994) have
reported somemccess with a,tA fumonmethods, and it is
not entirely clear whywe were not successful Pedmpsour
two sources are not mffzciently dflferent fxomeach other.
Both the Wordand RelDocs vectors were relZesented in
the same LSI space aud are likely to reflect manyof the
same derived indexing featmes. (However, rememberthat
a query combination method does improve performance.)
A related poss/bility is that two sources of information
maynot be sufficient. Additional sources of information
are easy to include (e.g., dot product sim/larity, keywoni
matchin8, phrase mnt~i,~, but it would be nice to do m
in a principled manne~Fnudly, it maybe that non-linear
combinations would lead to performance i mlnovementB.
Summary
and Conclusions
retrieval performance. In Proceedings SIGIR’9$, 1993, pp.
339-346.
Deerwester, S., Dumais, S. T., Laednner, T. K., Fmaas, G.
W. and Handumm,
IL A. Indexin~ by latent semantic analysis. Journal of the Society for lwformation Science, 1990,
41(6), 391-407.
Dmmds,S.T. UmngLSI for/nfommtion filtering:
TREC-3
experiments. In: D. Harman(Ed.), The Th/n/Text REtr/eva/
Cmrference(TRECS),National Institute of Standards and
TechnologySpecial Publication, 500-225, 1995, pp.219-230.
Foltz, P. W. and Dumals, S. T. Permnalized information
delivery: Ananalysis of information filterizg methods. Communicationsof the A CM,192, 35(12), 51-60.
Fox, E. A. and Shaw, L A. Comb/nation of multiple
searches. In D. Herman(Ed.), The Second Text REtrdval
Conferetwe (TREC-2~ NIST Spedal Publication 500-215,
1994, pp. 243-252.
D. Harman (Ed.), The TMn/ Text RFJr/eval Conference
(TREC3), National Institute of Standan/s and Technology
Special Publicatlo~ 500-225, 1995, pp.219-230.
Salton, G. and McGill, MJ. lntmducti~ to ModernInformation Retrieval. McGraw-Hill,1983.
Weueed LSI for filtenng 336k documents from diveme
,ouce, for 50 intere¢ pmfile,. U, ing relevant traidng
documentswas more effective than using a detailed natural languase ~on of interests.
Combinin~ these twO
veotomprovided mine ,~u41tioual improvementsin filtering. On average, 7 of the top 10 documents ate relevm
udBg ~ method. Perfommce can be i .,,Woved by contiaually
~ new relevant
docmnents.
Data
combination using these two methodswas not successful,
although we believe it might be ff ~,4~donal sources of
infonnmion are used. These combination methods are
quite general and applicable for a wide variety of filtering
tasks.
References
Baxtell, B.T., Cottrell, G.W., and Belew, ILK. Automatic
combiaation of multiple ranked nmieval systems. In Proceedings of SIGIR94, ACMPress, 1994.
Bellda, N. L, Kantor, P., Cool, C and Quatrain, IL Comblnln£ evidence for information n~ievai. In D. Harman
(F~), The Second Text REt~lval Conference (TREC-2),
Nlffl’Special Publication 500-215, 1994, pp. 35-44.
Belkin, N. J., Cool, C, Croft, W.B. and Callan, J. P. The
effect of multiple query nRnesentstions on information
3O
Download