Combiningevidence for effective information filtering Susan T. Dmnais Bellcore 445 SouthSt. Moeristown,NJ 07960 entail: sui@bellco~.com From: AAAI Technical Report SS-96-05. Compilation copyright © 1996, AAAI (www.aaai.org). All rights reserved. Abstract As part of NLqT/ARPA’s TREC Wcckshop,weused Semamk Indexing(LSl) for filtering 33~incoming dotemerefrom diveae ~ (aew~es,I~ttenta tedmiotl ~mractOfor SOt~i~ of ~ Wedevelot~edrepmeematiom of userintems~~ mtea,for thesetopics~ two mu~cesof emmi~infmmalion.AWon/F/~rused just the weedsin the ~#c mmnems, and a Re/Do~F//~r u~djust ¯ e knownn/ram t~s do~.me~and ~ ~e ~#c ,memm.U~the mlevammdnin8documents(a variant of relevance &edback)was mote effective than ~ ¯ detailednatural ’-,,-magedescriptionof in~ts. Coral/nins these two vec~n provided some ~ iml~ovemmm in mmdn~ On averap, 7 ~ the top 10 docmnmes ~md 44 of the top 1(]0 ~ ~ mlev¯llt nslno bined ~ metlwd.Dam coml/nationn’~n~_the the com- results of the Wa/dandRelDoca reuieval.-ts wasnot generallysucce,dul in ~ pe~onm~e~oqmedm the best individualmethod, dthoNlh webelieveit mightbe if additional murces me used. These ~ mmtmdsme quite gen. end andapplicableto ¯ variety of mutingaad feedback dard veoto~pmbabilis~)treat wordsas if they are independent, althoughit is quite obvionsthat they are not (Salton blcGill, 1983). Thecentral themeof LSI is that important inter-relationships amongterms can be automatically derived,explicitly modeled,andusedto imp~veretrieval. LSI uses singular-value decomposition(SVD),a sts~stical te~hn~ueclosely related to eigenvectordecomposition,to modelthe associmiverelationships. AImptenn-decmnent maufixis decomposed into a set of k, typically 100to 300, orthogonal factors. These derived indexing dimensions, rather than indivkhml words,are the basis of the vector space usedforre~evaL Eachtermanddocument is represented bya vector intheresulting k-dimensional LSIspace. Terms that areused insimilar contexts (documents) will have similarvectors in this space. Oneimportant consequence ofthe LSI analysis is that users’ queries can reUievedocuments that do not share any wordswith the query- e.g., a query about "automobiles" would also nmieve articles about "cars"andevenarticles about"d~vers"to a lesser extent. Re~evaloperates in the samewayin the reduceddlmeusion vector space as it does in standardvector models.A query Introduction vectoris locatedat the weightedvectorsumof its constituent term vectomDocuments are ranked by their similarity to As partof NISTIARPA’s TRECWorkshop, we usedLatent Semantic Indexing (LSI)forfiltering 336kdocuments the query vector and the most similar documentsare fromdiverse sources for50topics ofinterest. Weexam- retxmmd.Since both term and documentvectors are rep~iuedhow~t sources of infmmation (e.g., a natural sented in the samespace, simila~ies betweenany combinalanguage description ofintmests, feedback about previous tlon of terms and documentscan be easily obtained. This makesit easy to use LSIfor relevancefeedbackandinfonnadocuments)can be used to best predict whichnewobjects tion filte~ng. will be of int~est. AnLSl modelwhichcombinesthe initial topic descriptionwith relevant training decmnents is The LSI methodhas been applied to mlnyof the standard quiteeffective. infozmation retrieval test collectionswithfavorableresults. Usingthe sametokenization and term weightings, the LSI Latent SemanticIndexing(LSI) methodhas equaled or outperformedstandard vector methods in almosteverycase, and wasas muchas 3095better in LSIisa vminnt ofvector reUieval inwhich thedependensomecases (Deezwesteret al., 1990; Dumais,1995). ciesbetween terms areexplicitly taken intoaccount (see with the standard vector method,diffex~tial term we~ting Deerwester et al.,1990formathematical details and and relevance feedbackboth improveLSI performancesubexamples). Mostre~eval models (e.g., Boolean, startstantially (Dumais,1991). 26 TREC- Information Filtering We used LSI in NIST/ARPA’~ TREC-3 Workshop (Durum, 1995; Haman,1995). For the TRF_~fdten~ng (or rout/ne) task, weweregiven50 topics of interest and askedto find articles relevantto these inL-~esm froma new streamof 336~306 documents (1.2 gig of ascii text). The 1000documentsmostmlmilarto each of the 50 topics of intem~are returned, and perfomm~is evaluated using precisionandrecalL The TRECtopic statements were qu/te detsiled, s~ andspecific. Theymet~-sentative of the profiles that an informationanalysts mightdevelopover time for standing interem.Aaexamplefiltering topic is givenbelow. <hUm> Number: 108 <dora>Donmln..International Economics <~/e> Topic: Japanese Pmtectiontn Measures <desc> DescrtptlotL"Document willreport onJapanese poHcleJ orpractic~ which help protect Japan’s domestic market fromforelgn compe~on. <narr> Narrate: A relemnt docsunent wlllidentify a Japaneae laworregulation, a governmental policy or ~e procedure, a corporate custom,or a bmine~ practice which discourages, orevenprevents, entry into the Japanese~ by foreign goods and serrlca. A documentwhichreports generally on marketpemtration d~k’uMesbut whichdoes not Identify a specific Japanese barrier to trade is NOT relevant. <con> Concept(s): 1. Japan 2. Ministryo/InternationalTradeandIndustry, M1TI, ui~ry of Fo~g. A~ 3. protectiontnn, protect 4. ~ ~ quo~ dm~ns, obm~,,~ retaliation 5. structural tmp~productstandard extendedthis workto the large, diverse TREC corptm. We used the training corpus to construct a 346-dimeMional LSI representation.For informationfiltering, webeginby identifying a vector for eachtopic of interest in the LSI space. Newdocumentsare "folded in" to thespace, end suggested as relevantto a topicif theyare nearenoughthe filter vector. Foldingin worksjust like queryformation. Eachnewdocumeatis located at the we/ghtedvector sumof its constituent termvectors. Wecompa~two basic methodsfor creating filters - one usingonly rite text of the topic statement(WordFilter); the other using only relevant documentsfrom the trainln~ set (ReIDocsFilter). The WordFi~r is located at the weighted vector sumof all the wordsin the topic ~-~ment.Onaverage, topics contain 192 words,of which52 are uniquecontent words.This methodignores all the training d~tA_about relevant and non-relevantdocuments.TheRe/DoesF/~ter is locatedat the centmidor vector sumof the relevant lxalnln~ documents,andignores the topic statement."l~/s is a somewhatunusualvariant of relevancefeedback"l~ypicallyusers’ queries are modifiedby addingwordsflx)m relevant documents and emittin 8 words from non-relevant documents. Werep/aced the users’ topic stntPmentwith relevant documeats. For eachof the 50 filtering topics wecreated twofilters or profiles - a Word Filter auda RelDocs Filter. Figure1 illus.. trates howthe~ vectors wouldbe formedfor one topic using a 2-dimensienalLSIspace. "fine veer.msfor the wordsin the topic statementare labeledw, andthe vectorsfor the relevant training documentsmelabeled R. As can be seen in Figme 1, the WordVectoris locatedat the weightedsumof the vecton for topic words,andthe RelDocs Vectoris locatedat the sumof the relevant training documentvectors. In both cases, the filter wasa single vector. Newdocmn~tswere "foldedin" to this LSIspace, as describedabove,andranked in decreasingorderof simaerityto the filter vector. & trade dispute, barrl~ tens~ imbalance.,practice 7. nmrkaaccess, free wade,liberalize, vecipmc~ & Super501, 501 clause <nat>Natlonalily: Japan R = relevant doc N = non-rel doc w = topic wor~s N Traininginfonmttionwasavailable for each topic. Known relevant andnon-x~Aevant documents froma different (althoegh related) corpus of documents wereidentified. On average, them were 215 knownrelevant documents and 896 non-relevantdocuments for each topic. Thetopic statementandthe training d,t, wereto be used to develop profilesor filtemfor eachtopic. Rel ~ Vector w N/R .N R ~wR’ L$I DimensionI Basle Wordand ReiDees alters LSI hag previously beenappfied to iMomuttionfilte~ng with promising n~lts ffoltz & Duma~,1992), and we FIGURE L Bsseline RelI}ecs and WordVectors 27 Thesetwo methodsprovide baselines for examiningvarione combinationsof information from the users’ topic statementand fromfeedbackusing relevant tr~inlne documents. havea high value on the min(cos(doc,Word),cos(doc, Does))measure- these documentsare near either the Word or RelDocs filter andwouldbe top rAnk,~ifor this d~t~combination method. ComMninglatornmflon - Query Combination and Data Combination I maxata Combination cosine) filming methods ~ above comd/tute the main sources of infonnatimL Wealso examinedmethodsfor ¢o~b~in~evidence from these sources. Belkin et al. (1993, 1994)havedescn’bedthe methodsweused as query combinalfmand data combination(or data fusion). For qmn’ycomb/na~m,we combinetwo (or more) representationsinto a single newqueryvector for retrieval. In the workdeacr/bedin this paperweexplcnedlinear combinations of the basic Wordsand RelDocsvectors. The combinedvector, with equal weight given to Wordsand RelDocs,is shownin Figure 2. A newranking of documentswasproducedfor the Combined vector. Query Combination Combined Vector Rel LSI Dimension 1 FIGURE3. Data CembfnM/on- maxcosine measure In tl~ paper we explored combinadonsof documentrankings whichwe~derived from the same LSI space and can thus be repr~ented geometrically in the samefigure. The methodis muchmoregeneral than this, however,and could be used to combinerank and/or similarity informationfrom very different retrieval systems in order to determine a final documentranking. Data combination is thus muchmore general than querycombination. Results LSI Dimension 1 FIGURE 2. Query Cmub~ D~ ¢o~ (or dam,s/on, as Belkl. et al. call it) combinesinformationabout the results of two (or more) sysmmwhichrank the samedocumentsin response to the same requests. Weexaminedvarious waysof comblninH the top I000/terns f~omthe Wordand RelDocsmatchesto produceda newnmkingof docmnentsfor each topic. We mind infommt/onabout the nmksand cosines, and combilled them using max, rain: and ~ operators. Although wedid not examinethe perm~terspace as systemat/cally as Batten et eL (1994) did, wedid choosedata combhmtion methodswhichhad p~viouslybeensuccessful in the context of TREC (Belkin et al., 1994;Fox& Shaw,1994). Pigme3 show8geometrically what the combinationlooks l~e for the r~n(cos(doc,Word),cos(doc, ReiDocs)) sure. The cross-hatched region showsdocumentsthat For the TRECfiltecing ~ perfoemm~was evaluated using the top 1000 documentsgemmed for each topic. We the total numbeT of relevant doc~t8 ~ cision at 10 and 100docuxm~ts, and averageprecision. Query combination Table1 presentsthe results for the twobasic LSIfilters as well as several combinationsof them.Not mtrpnsingiy,the RelDocsfilter vectors whichtake advantageof the known relevant documentsage better than the Wordvectors. The improvement in averageprecision is 30%(.3737 vs..2880). RelDocswasone of the best TREC filterin8 methods.Users get an averageof 2 additional relevant documents in the top 10 usingthe RelDocs methodfor filtetiJn 8 (6.7 vs. 4.6). Even thoughthe topic statementsare quite rich, usingknown relevanttraining documents still providessizeable retrieval benefits. Recall that the ReiDocsmethodignores the topic deacziption!It is also i mpmmnt to note howmuchinformation wasfiltered out by these methods.Bylookingat just 100 documentsper topic or 1.4% of the data (5000/ 336306),24%of the knownrelevant docnn~.mare found. Bylooking at less than 15%of the A,ta (50000/336306), 74%of the relevant documentsare foundusing RelDocs. TABLEL Query Combtmtiomof LSI Word (W) and lbmem (R) veetom Words (W) harem tl0 .b~+.SR .2b~V+.Tb’R .75W+.25R Rel Does Pree atl0 Pree Avg at 100 6252 6878 7078 7010 693O .4620 .672O .3532 .4544 .440O .288O .3737 .3792 .459O .4118 .3827 .3561 .6820 .650O .5540 improvement are found when the RelDocs vector is given equal or more weightthan the Wordvector. This sameI~_nemof results wasobservedfor a diffe~t set of 50 mutingtopics in a lnevious TREC, mwehavereason to believe these results pnendi~ to otherfilterln~ applications. SRel 10Rel 201~! 30Rel S0Rel 100 l~i max Rel relnesentation! We can Data combination begins with the top 1000 documents returned for the Wordand RelDocsfilters and combines results (not the vectors) in various waysto arrive at a new ranking. Weused infommtienabout the ranks and cosines, and combinedthem using max, min; and sum operators. Similar methodshave previouslybeentested in the context of TREC althoughthey do not providethe kind of systematic parameteranalysis usedby Bat~ell et aL (1994). TABLE3. Data CemlJWtonsof LSI Word (w) andRemoes (R)return RelDoc Avg Pree Words 6252 .2880 RelDoe 6878 .3737 Ranks- sum(W,a) 6960 .3687 l~illklJ - mJn(~tr~l~) 6938 .3498 6764 Ranks - m-ffW, R) .3538 Coslnes- sum(WR) 6891 .3660 - mi ~n(W~R) 6764 .3229 S TABLE 2. Performam~as s function of number of trainin~_ docmneatsused in RelDocsvector A~ at 100 .2564 .39M .4290 .4428 .4470 .4514 .4544 the conent Consider, for example,the sumof cosines. For each document,a newmeasureof similarity (its cosine in the Wordset plus its cosine in the ReIDocsset) is computedand usedto derive a newranking. Retrieval performanceis evaln~t~ using the newranking. Table 3 summarizes the performance of six 4Atl combinationmethodsalong with the Wordand RelDocsvector baselines. TheRelDo~methodsdescribed aboveused all known relevant documentsf~omthe tra/n/ng collection. Onaverage, there were216relevant docmnents for a topic. Table 2 lhowshowmuchtraining ,/it, it takes to achieve this level of performance.Withonly 5 relevant documents performanceis poo~It is comparableto the Wordsfilter in precisien, andmmewhat worsein the total number of relevant doctnnents.(Knowing 5 relevant items is as goodas a 192 worddescriptionof the topic of interest.) Performancei mIxovessteadily and with only 20 relevant documenusis within 6%of the maximnm With relatively small amountsof Irsiaing data, filterln£ performance is quite good. Pine atl0 .5840 .586O .64OO .656O .664O .68OO .672O given movein this direction by combinin~ the RelDo~ vector with a vector basedon newrelevant test documents as they an/vs. Combining the RelDocsvectorwith only I newrelevant test documentsimprovesaverage ixecision 7%to .4017, and Add;~g 10 newrelevant documentsimprovesaverage Inecision 16%to .4353. Weexaminedthree mixUnesvarying the amountthat the Word and ReIDo~ sources cou~te to the combined vector (.25, _~0,.75). Smallimprovements in performance Rel Does 5436 6313 6669 6642 68O6 6822 6878 Thebest peffozmaace one can achieve for this LSI representation is obtainedbylocatingthe filter vectorat the centroid of all the relevant test document.Clearly this could not achievethis in practice Eincethe relevant test documents are not knownaheadof time, but it does set an upperboundon performance.Placing the filter vector here wouldincrease average lnecision by 28%to .4776, so there is roomfor Pree .2777 .3O96 .3526 .3565 .3663 .3706 .3737 Coshl~ - ms-(W~R) 6885 .3741 Ascan be seen in Table3, thereareonly m’n.ll inconsistent improvements in performancewhendifferent sources of data are combined.Differentially weightingthe cenm’butiousof the Wordsand RelDocsmeasuresdoes not help either- 29 Mother way of comb/nln~data from the two basic filters is to pick the best filter for each topic. Althoughthe RelDocsfilter is best on average, there are 14 topics for which the Wordsfilter is better. If we use the training dnt, to select which methodto use for each topic, performanceis 6890relevant docs and .3731 average precision, and this is no better thau umngjust the RelDocsfilter. (If we select optimally using the test d,t, small improvementsare seen - 7058relevant docs and .3962 average precision.) Othem (Bel]dn et aL, 1994, Fox & Shaw, 1994) have reported somemccess with a,tA fumonmethods, and it is not entirely clear whywe were not successful Pedmpsour two sources are not mffzciently dflferent fxomeach other. Both the Wordand RelDocs vectors were relZesented in the same LSI space aud are likely to reflect manyof the same derived indexing featmes. (However, rememberthat a query combination method does improve performance.) A related poss/bility is that two sources of information maynot be sufficient. Additional sources of information are easy to include (e.g., dot product sim/larity, keywoni matchin8, phrase mnt~i,~, but it would be nice to do m in a principled manne~Fnudly, it maybe that non-linear combinations would lead to performance i mlnovementB. Summary and Conclusions retrieval performance. In Proceedings SIGIR’9$, 1993, pp. 339-346. Deerwester, S., Dumais, S. T., Laednner, T. K., Fmaas, G. W. and Handumm, IL A. Indexin~ by latent semantic analysis. Journal of the Society for lwformation Science, 1990, 41(6), 391-407. Dmmds,S.T. UmngLSI for/nfommtion filtering: TREC-3 experiments. In: D. Harman(Ed.), The Th/n/Text REtr/eva/ Cmrference(TRECS),National Institute of Standards and TechnologySpecial Publication, 500-225, 1995, pp.219-230. Foltz, P. W. and Dumals, S. T. Permnalized information delivery: Ananalysis of information filterizg methods. Communicationsof the A CM,192, 35(12), 51-60. Fox, E. A. and Shaw, L A. Comb/nation of multiple searches. In D. Herman(Ed.), The Second Text REtrdval Conferetwe (TREC-2~ NIST Spedal Publication 500-215, 1994, pp. 243-252. D. Harman (Ed.), The TMn/ Text RFJr/eval Conference (TREC3), National Institute of Standan/s and Technology Special Publicatlo~ 500-225, 1995, pp.219-230. Salton, G. and McGill, MJ. lntmducti~ to ModernInformation Retrieval. McGraw-Hill,1983. Weueed LSI for filtenng 336k documents from diveme ,ouce, for 50 intere¢ pmfile,. U, ing relevant traidng documentswas more effective than using a detailed natural languase ~on of interests. Combinin~ these twO veotomprovided mine ,~u41tioual improvementsin filtering. On average, 7 of the top 10 documents ate relevm udBg ~ method. Perfommce can be i .,,Woved by contiaually ~ new relevant docmnents. Data combination using these two methodswas not successful, although we believe it might be ff ~,4~donal sources of infonnmion are used. These combination methods are quite general and applicable for a wide variety of filtering tasks. References Baxtell, B.T., Cottrell, G.W., and Belew, ILK. Automatic combiaation of multiple ranked nmieval systems. In Proceedings of SIGIR94, ACMPress, 1994. Bellda, N. L, Kantor, P., Cool, C and Quatrain, IL Comblnln£ evidence for information n~ievai. In D. Harman (F~), The Second Text REt~lval Conference (TREC-2), Nlffl’Special Publication 500-215, 1994, pp. 35-44. Belkin, N. J., Cool, C, Croft, W.B. and Callan, J. P. The effect of multiple query nRnesentstions on information 3O