Effects of Queryand Database Sizes on Classification using MemoryBased Reasoning of News Stories Brij Masand Thinking MachinesCorporation 245 First Street, Cambridge,Massachusetts,02142USA From: AAAI Technical Report SS-93-07. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. Abstract In this paper weexplorethe effects of queryand database size on news story classification performance. Memory Based Reasoning(MBR)(a k-nearest neighbor method) used as the classification method.There are 360different possible codes. Closematchesto a newstory are foundusing an already codedtraining databaseof about 87,000stories from the DowJones Press Release NewsWire, and a Connection-Machine DocumentRetrieval system (CMDRS, [Stanfill]) that supportsfull text queries, as the underlying match engine. By combining the codes from the near matches,newstories are codedwith a recall of about 80% and precision of about 70%,as reported in [Masand].When the query size is varied from 10 terms to morethan 200 terms(matched against the full-text of documents) the recallprecision productchangesfrom0.53 to 0.6. Whilethis is a significant change,wefind that moderatesized queries of 40-80termscansuffice for findingrelevant matchesfor classification. Bychangingthe size of databasefrom10,000stories to 80,000wefoundthat the recall-precision product changedfrom0.22 to 0.57. This showsthat with our current MBR approachthe database size can’t be reducedsignificantly without compromising performance.Wealso find that fewer numberof retrieved matchesare neededwith larger queryanddatabasesizes. 1 Introduction Findingnear matchesby designingappropriatemetrics is an important step in the Case BasedReasoning(CBR)and the MBRparadigm. The availability of massively parallel machineswith large memoryand computationpower have madeit easier to use large exampledatabaseswith relatively simplesimilarity metricsfor retrieving relevant matches.In particular, increasedcomputation powerhas madeit easier to use all the featuresof a caseas a queryfor matching against databasesas large as several Gigabytes. andthe use of a large training databaseof Censusreturns for automaticallyclassifying newCensusreturns[Creecy].Both of these projects use large (morethan 50,000) number examplesas training cases. Thelarge size of the databases used haveallowedfairly simple formsof indexingand metrics for assessingsimilarity. In this paperwereport experimentsto determinethe optimal numberof terms in a query for findingrelevant newsstory matchesfor goodclassification performance.Wealso investigate howthe classification performancescales with the size of the training database. Theparametersof the classification systemare optimized automaticallyfor findingthe best performance related to different queryanddatabasesizes. Section2 and 3 describe the codingproblemandrepresentatiVe results. Section 4 reviewsMBR and the details of the classificationalgorithm.Theeffect of differentclassification systemparameterson performanceand their automaticoptimizationis describedin Section5. Variationswith respect to queryand databasesize are describedin Section6. Weconcludewitha discussionof results andfuture directions. 2.0 Text Categorization: Related Work Varioussuccessfulsystemshavebeendevelopedto clas- sify text documentsincluding telegraphic messages [Young] [Goodman],physics abstracts [Biebricher], and full text newsstories [Hayes][Lewis][-Rau]. Someof the approaches rely on constructingtopic definitionsthat requireselectionof relevant wordsand phrases or use case frames and other NLPtechniques intendedfor tasks moresophisticated than classification e.g. for tasks suchas extractionof relational informationfromtext [Young][Jacobs]. Alternative systems[Biebricher] [Lewis] use statistical approachessuch as conditional probabilities on summary representationsof the documents.Oneproblemwith statistical representations of the training database is the high Wehave previously reported the results for using a CMbasedIR systemfor classification of newsstories [Masand] dimensionalityof the training space, generallyat least 150k uniquesingle features -- or words.Sucha large feature space 63 makesit difficult to computeprobabilities involvingcongested by the automatedsystem. Eachcode has a score in the left handcolumn,representingthe contributionsof sevjunctions or co-occurrenceof features. It also makesthe ’ application’of neural networksa dauntingtask. Thecase eral near matches.In this particular case the systemsugbasedtelex routingworkdescribedby [Goodman] is closest-~,," ~ ’ : gests 1 l~iof the 14 codes¯assignedby the editors (overlap...... in approachto our present workalthough it uses a much.markedby *) and assigns three extra codes. By varying the score threshold,wecan trade-off recall andprecision. smaller case base combinedwith morepost processing of the retrieved matches. Usingan MBR approachfor classifying newsstories using their full text weare able to achievehighrecall andat least moderateprecision without requiring manualdefinitions of the various topics, as required by most of the earlier approaches. 3.1 RepresentativeResults Thetable belowgroupsperformanceby code categoryfor a random test set of 200articles. Thelast column lists the different codesin eachcodecategory. Category 3.0 The NewsStory Classification Problem Y M/ G/ R/ Name industry marketsector government region subject product Total # of Recall Precision Codes 91 93 85 82 70 69 81 85 112 9 28 121 70 21 361 91 Eachday editors at DowJones assign codes to hundredsof 87 stories originatingfromdiversesourcessuchas newspapers, 74 magazines,newswires,and press releases. Eacheditor must 76 masterthe 350or so distinct codes,groupedinto sevencatep/ 89 gories: industry, marketsector, product, subject, govern72 mentagency, and region. (See Fig. 1 for examplesfrom eachcategory.)Dueto the high volumeof stories, typically Theautomatedsystemachievesfair to high recall andpreseveral thousandper day, manuallycodingall stories consistently andwith highrecall in a timelymanner is impracticision for all the codecategories. GivenCMDRS, the text cal. In general, different editors maycodedocuments with retrieval systemas the underlyingmatchengine,these basic .varyinglevels of consistency,accuracy,andcompleteness. ¯ results wereachievedin about 2 person-months.Bycomparison, [Hayes]and [Creecy]report efforts of 2.5 and 8 person-years, respectively, for developingrule/pattern FIGURE1 SomeSample Codes based conceptdescriptions for classification tasks with comparablenumbersof categories. Our current speed of Code Name # of Documents codingstories is about a story every 2 secondson a 4k CM2 system. 9811 R/CA California 2813 R/TX Texas Technology 9364 MflEC 4 The MemoryBased Reasoning 7264 M/FIN Financial Approach N/PDT NewProducts/Services 4149 N/ERN Earnings 9841 :, ..... ¯ Memory ¯Based Reasoning(MBR) consists of variations I/CPR Computers 2880 the nearest neighbortechniques,(see [Dasrathy]for a comI/BNK All Banks 2869 prehensivereviewof NNtechniques). For a reviewof MBR P/CAR Cars 380 see [WaltzandStanfill] and[Waltz].In its simplestformulaP/PCR Personal Computers 315 tion, MBR solves a newtask by looking up examples of G/CNG Congress 307 tasks similar to the newtask andusingsimilarity with these remembered solutions to determinethe newsolution. For G/FDA Food and Drug Admin. 214 exampleto assign occupationand industry codes to a new Censusreturn one can look up near matchesfrom a large (already coded)databaseandchoosecodesbasedon considThecodingtask consists of assigningoneor morecodesto a ering several near matches[Creecy]. In a similar fashion, text document,froma possibleset of about350 codes.Fig. codes are assigned to newunseennewsstories by finding 2 showsthe text of a typical story with codes. Thecodes near matchesfromthe training databaseand then choosing appearingin the headerare the onesassignedby the editors the best fewcodesbasedon a confidencethreshold. and the codes following"SuggestedCodes"are those sug64 ~. Fromthe CBRpoint of view the examples here are microcases with no annotated structures. This allows matchingon entire examplesincluding all the features (rather than match based on a summary/indexedrepresentation of an exampleor case). Partly because of the nature of the classification task and partly becauseof the excellent matches providedby using full-text matching,very little post-processing or "reasoning"is done to assign the codes, other than a straightforward summingof evidence for different codes from a numberof near matches. FIGURE 2 Sample News Story and Codes 0023000PR PR 910820 : : , I/AUT I/CPR I/ELQ M/CYCM/IDU M/TEC¯ R/EU R/FE R/GE R/JA R/MI R/PRM R/TX R/WEU Suggested Codes: * 3991 * 3991 * 3991 * 3067 * 2813 * 2813 * 2813 * 2813 2460 1555 1495 "1285 "1178 "1175 R/FE M/IDU I/ELQ R/IA M/TEC M/CYC I/CPR I/AUT P/MCR R/CA M/UTI R/MI R/PRM R/EU Far East Industrial Electrical Components &Equipment lapan Technology Consumer,Cyclical Computers AutomobileManufacturers Mainflames California Utilities Michigan Pacific Rim Europe ~DAIMLER-BENZ MENT FOR SANTA CLARA, temhaus UNIT ler-Benz, DATA a 100 has percent signed approximately $ii data a contract Sys- of Daim- to (U.S.) The 7390s centers DRIVES" WIRE)--Debis will FIGURE3 purchase of 7390 Disk is a diversified services company whose Germany over manufacturing and Mercedes-Benz, debis. and Debis AEG, Category and Storage Subsystems capacity storage up to 22.7 15.7 marketing is million First used are HDS’ the shipped in October computers per equivalent of 4.2 of storing variety owned tems Data (EDS). mainframe vices. ees Systems by Hitachi, The company systems, people than 30 countries and the 7390s are main- of businesses with venture Electronic markets peripheral Headquartered 2,600 is a joint Ltd. in Santa products products Clara, company Data a broad Sys- range and ser- HDS employ- installed 38308 38562 3926 47083 41902 2242 57430 42058 4200 116358 52751 2523 The Classification Algorithm Following the general approach of MBR,we first find the near matches for each documentto be classified. This is done by constructing a full-text query out of the text of the document,including both words and capitalized pairs. This query returns a weighted list of near matches(see Fig. 4). Weassign codes to the unknowndocument by combining the codes assigned to the k nearest matches;for these experiments, we used up to 11 nearest neighbors. Codes are assigned weights by summingsimilarity scores from the near matches. Finally we choose the best codes based on a score threshold. Fig. 4 showsthe headlines and the normal- pages. and enterprises. Hitachi # of Occurrences 22 approximately high-performance in a wide Disk # of Documents high- cabinet. typewritten of 1990, with 7390 advanced capable data double-spaced in conjunction frame of insurance, The most subsystems and communications with services. gigabytes gigabytes along I/ M/ G/ R/ N/ P/ include Aerospace computing, services trading units Deutsche provides financial corporate Code Frequencies by Categories be installed throughout the next 6 months. Daimler-Benz The Training Database DowJones publishes a variety of news sources in electronic form. Weused the source for press releases called PR Newswire,most of which is concerned with business news. Editors assign codesto stories dnily. Onaverage, a story has about 2,700 bytes (or 500 words) and 8 codes. For the experimentsreported here the training database consists of 87,000 examples (total size about 240 Mbytes). The database was not specially created for the project; it just contains stories from several months of the newswire. The training database has different numbersof stories for different codes and code categories. Figs. 1 and 3 showsomerepresentative codes and code categories and their sizes. AGREE- DISK subsidiary million Subsystems. in debis’ $II,000,000 SYSTEMS Calif.--(BUSINESS GmbH, Storage SIGNS HITATCHI 4.1 in more worldwide. 65 ized scores for the exampleusedin Fig. 2 andthe first few near matchesfromthe relevancefeedbacksearch. FIGURE4 SampleNewsStow with Eleven Nearest Neighbors Score Size Headline I000 2k Daimler-Benzunit signs $II,000,000 agreementfor Hitatchi Data MCIsigns agreementfor Hitachi Data Systemsdisk drives DeltaAir Linestakes deliveryof industry’s first... CrowleyMaritimeCorp. installs HDS EX HDSannounces15 percent performanceboost for EXSeries processors L.M.Ericssoninstalls twoHitachi Data Systems420 mainframes Gazde France installs HDSEX420 mainframe Hitachi Data Systemsannouncestwo newmodelsof EXSeries mainframes HDSannounces ESA/390schedule SPRINTinstalls HDSEX420 Hitachi DataSystemsannouncesnew modelof EXSeries mainframes HDSannouncesupgradesfor installed 7490subsystems 924 2k 654 2k 631 2k 607 2k 604 2k 571 2k 568 5k 568 2k 543 2k 543 4k 485 4.3 4k DefiningFeatures AlthoughMBR is conceptuallysimple, its implementation requiresidentifyingfeaturesandassociatedmetricsthat enableeasy andquantitative comparisons betweendifferent examples.Anewsstory has a consistentstructure: headline, author,date, maintext, etc. Potentiallyonecan use words andphrasesandtheir co-occurrence fromall these fields to ¯ create features[Creecy].For the purposeof this project we usedsingle wordsandcapital wordpairs as features, largely becauseCMDRS, the underlying documentretrieval system usedas a matchengine,providessupportfor this functionality. If the entirestoryis usedas query,as is the casefor CMDRS, this can result in a large queryof several hundred termsevenafter ignoringstop words. the database. Thesecondstep removesa total of 72 additional words. The remaining words, knownas searchable terms, are assignedweightsinversely proportionalto their frequenciesin the database. Althoughgeneral phrases are ignored, pairs of capital wordsthat occur morethan once are recognized and are also searchable. There are over 250,000searchable wordsand wordpairs in this database. Relevancefeedbackis performedby constructing queries from all the text of the document.Responsetime for a retrieval request is undera second.All the workfor this paper was done on a 4k CM-2ConnectionMachineSystem. 5 Optimizationof parameters DifferentIrade-offs betweenrecall andprecisioncan be achievedbyvaryingthe parametersof retrieval andclassification. Twoimportantparametersare the numberof near matches(k) usedandthe thresholdsusedfor assigningthe codes.Theresults in section 3.1 representmanualoptimization of the numberof near neighborsand the confidence thresholds. Asthe numberof near matchesconsideredfor classificationincreasesrecall increaseswhileprecision decreases(morecorrect codesare foundbut also morenoise is added).In generalhigherthresholdsare effective when consideringincreasing numberof (k) near neighbors.The optimalcombination of score thresholdandk seemsto differ depending on codecategoriesandrequires further study, possibly using morethan eleven near matches. Anotherparameter,the code-gap-ratio,measuresthe ratio betweensuccessivecodes. Byadjusting this parameter,the tail end of the assignedcodescan be cutoff whenever there is a sharpdropin successive(ranked)scores. In order reducethe effort required to find optimalcombinations of the numberof near neighbors,score and code-gap-ratio thresholds weexperimentedwith automaticoptimization througha randomsearch. 5.1 Automatedoptimization Beforetrying techniquessuchas hill-climbingwestarted by testing randomn-tuples of parameters(wehaveexperimentedso far with up to 3 parameters)andchoosingthe best performance.To our surprise wefoundthat in about a thousandrandom trials -- abouthalf an hour’s worthof computation-- optimal parametersare foundfor whichthe performance is quite close to or better than the values obtainedby manualoptimization.This is importantboth fromthe point of viewof avoidingthe cumbersome task of manualoptimizationas well as removingthe subjective fac- 4.4 The Match Engine (CMDRS) CMDRS is the productionversion of the text retrieval systemreportedin [Stanfill].l Thetext is compressed by eliminating stop words(368 non-contentbearing wordssuch as 1 Althoughthe experimentswereconductedat Thinking "the", "on" and "and") and then by eliminating the most Machines Corp,a live version of the systemis available common wordsthat account for 20%of the occurrences in ¯ from DowJonesNewsRetrieval as DowQuest. 66 FIGURE 6 Variation of Classification performanceby databasesize FIGURE 5 Variation of Classification performance by query size 0.62 0.6 ......... ..... IrI ..... I I &.. .j ..... 0.54 ~. ..... .k. ..... ! l p ..... ~ ..... I I I ¯ .....~I .....~i .....ir .....IV I I l 9 .... ~0.56 l ~ ..... 0.52 0.5 I .... I T ..... I II I" " " I ..... "1 I I ! i " I ...... "1 I I I .... ¯ ... 25 -* .... 50 ,,, II I ............... I I I I00 125 150 75 I Ir ......... I rI ......... I I I I II I I I I I I i II Ip ......... I I ! II ......... I I~ ......... I I I ~o4 0 3 I IZ’’:::::::::::::::::::::: I 0 2 .................. ......... 0 1 I I I 175 I I~ ......... 0 5 ¯ ..... J ..... -J ...... I Il II l .... e ..... ~ ...... I ...... I II II I I Ii I i .....I T .....~I ......II ...... I I !I. * I , .... I I~ ......... ......... 0 6 ..... t ..... t ..... -J~-- ~ ..... ÷..... -I ......: ...... ..... ~0.58 I" I ..... "1I .... I ..... -IP "1 I" ..... | I I II ! I 0 200 I Ir ......... 20 I " I I Ip ......... I I I 40 80 60 I00 Databasesize (thousanddocuments) Querysize (terms) tor whencomparing the relative performance for varying databaseandquerysizes. All the results in the following sections representthe outcomeof automaticselection of optimal parametersthrougha simplerandomsearch. words. Thelast columnrepresents the averageoptimized numberof near neighborsfor a querysize. Fewermatches seemto be neededas the querysize increases, presumably becausethe queriesare morespecific as the size increases. 6.0 Variation of Queryand Database Sizes Differentsets of near neighborsfor the test set werefound fromthe entire database,usingdifferent querysizes and then the best performance wasfoundby automaticallyoptimizingthe numberof near neighborsandassociated thresholds. 6.1 Varyingthe querysize Thefollowingtable andFig. 5 describethe variation of the recall-precisionproductwith respect to different number of termsin the querythat is usedto find the nearmatches.The querysize representsthe n best terms(basedon inversefrequencyweights)selected for the queryfromthe article beingclassified. Query Size (terms) I0 40 80 I00 160 6.2 Varyingthe databasesize Thefollowing table and Fig. 6 describe the variation of recall-precisionproductwithincreasingdatabasesize. Database size (thousands) R*P R*P Recall .53 .58 .6 .6 .6 68 74 73 74 72 , Precision 79 78 82 80 82 k I0 20 30 40 60 80 7 4 4 3 3 Wesee that whilethe changeis significant form10 to 100 terms, a modestquerysize of 40 to 80 termssuffices for the best performance. Theaveragesize of a story in the database is about500words.Onlythe size of the queryis varied, matchingagainst the full-text of the documents. The ¯ termsconsist of single wordsandpairs of adjacentcapital .22 .31 .44 .49 .55 .57 Recall 42 65 62 64 69 68 Precision 52 47 71 78 80 83 k 5 5 3 4 4 2 Wesee that performanceimprovessignificantly as stories are added in increments of 10,000. While the rate of increase seemsto decrease, the data suggests that performancecan be improvedfurther by increasingthe size of the database. 67 Differentsets of nearmatchesfor the test set werefoundfor different fractions of the database,while keepingthe maximum querysize constant to a fewhundredterms. It’s likely that there is relationship betweenthe numberof different classifications(in this case 361)andthe rate of increase the databaseis expanded. The last columnindicates the average size of the near matchesfor optimalperformance.As the databaseincreases fewermatchesare needed,as oneexpects to find morenumber of close matches. 6.3 In addition wehave shownthat querysizes of about 40 to 80 terms (best terms as ranked by inverse frequency weights)axesufficient for locatingthe best relevant matches for the purposeof classification whenmatchedagainst a full-text representation of the documents.Fewermatches seemto be neededfor larger queries. A simplerandomsearch yielded classification parameters that result in respectable performance,comparableto manual optimization. Othervariations Onepromisingdirection wouldbe to explore the effect of using summaryrepresentation of documentsthemselves whilestill retaininga large training database.It wouldalso be useful to see the interactionof querysize withdifferent databasesizes anddifferent numberof codesto be assigned in variouscodecategories. 7 cantly whenwe increased the database from 10,000 to 87,0000stories. Alongwith increased performancefewer numberof retrieved matchesseemto be neededfor a larger database. Discussion For the results reported in this paper weusedn-waycross validation, whichinvolvesexcludingeach test exampleone at a timefromthe databaseandperformingthe classification on it. Weuseda randomly chosenset of 200articles for the test set. Whilethe automaticallyoptimizedperformanceseemsless dramaticthan certain systemsthat use manuallyconstructed definitions (suchas 90%recall andprecisionreported by [Hayes]and [Ran])webelieve that an MBR approachoffers significant advantagesin termsof dramaticallyreducedtime of development,automatedoptimization,ease of deployment,and maintenance.Weshould be able to improvethe performance further by increasingthe size of the training database. Ourtest databasecan hold morethan 120,000stories on the existing hardware(a 4k processorCM2). This approachcan be used to providecase basedclassification at little extra cost wherea newsretrieval systemwith relevancefeedbackalreadyexists. Whilewehaveshownthat fairly smallquerysizes suffice to locate relevant near matches,oneshouldnote that the query is searchedagainst the full text of the documents for which the total number of features exceed250,000.In addition the smaller querysizes maybenefit froma larger databasesince the probability of a close matchincreases with the size of the database.It’s also possiblethat a large training database compensates for a simplesimilarity metric. 9 It wouldbe interesting to see if automated optimizationof parametersthrougha randomsearch worksfor morethan 3 parameters at onetime. For larger sets of parameters,hill climbingor optimizationthroughgenetic algorithmsmight be necessary. Weare studyingthe effect of scaling the databaseon different categories of codes with different numbersof codesin them, to judge the effect of training database size with respectto the granularityof classification. Theincrease of performancewith increasing numberof examples is probablydifferent for different code-categories with different numbersof training examples. 8 Conclusions Wehave shownthat Recall and precision using the MBR approachfor newsstory classification increased signifi68 Future Work Although we have shownthat the performance of news classification with the MBR approachdependson havinga large example-base it maybe possible to prunethe database by removingredundant examplesand also removingexamples that aren’t often retrievedas matches. Wewouldlike to study the combined effect of the variation of querysize with respect to the scaling of the databaseas well as the effect of matchingagainst a summary representation of the documents in the database. Anotherextension of the current approachmightbe to do morein-depth reasoningon the retrieved matches,as exemplifiedin the casebased reasoningapproach. 10 Acknowledgments I wouldlike to thank GordonLinoff, DaveWaltz and Steve Smith from Thinking Machinesfor help with this project. 11 References Biebricher, Peter; Fuhr, Norbert et al, "The Automatic Indexing System AIR/PHYS-- From Research to Application." Internal report, THDarmstadt, Department of Computer Science, Darmstadt, Germany. Creecy, R. H., Masand B., Smith S, Waltz D., "Trading MIPSand Memoryfor Knowledge Engineering: Classifying Census Returns on the Connection Machine." Comm. ACM(August 1992). Dasrathy B. V. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEEComputerSociety Press, Los Alamitos, California (1991). Goodman,M., "Prism, An AI Case Based Text Classification System." In R. Smith and E. Rappaport, (eds), Innovative Applicationsof Al, 1991, pp. 25 - 30. Hayes, P. J. and Weinstein, S.P., "CONSTRUE/TIS: A System for Content-based Indexing of a Database of NewsStories." Innovative Applications of Artificial Intelligence 2. The AAAIpress/The MIT Press, Cambridge, MA,pp. 4964, 1991. HayesP.J. et al, "A Shell for Content-basedText Categorization." 6th IEEE AI Applications Conference, Santa Monica, CA, March1990. Jacobs, Paul and Rau, Lisa. "SCISOR:Extracting Information from On-Line News." Communications of the ACM, 33(11):88-97, November1990. Lewis, David D, "An Evaluation of Phrasal and Clustered Representation on a Text Categorization Task." Proceedings, SIGIR ’92, Copenhagen,Denmark. Masand, Brij M., Linoff, GordonS.,and Waltz, David L., "Classifying News Stories on the Connection Machine using MemoryBased Reasoning" Proceedings, SIGIR ’92, Copenhagen, Denmark. Rau, Lisa F. and Jacobs, Paul S.,"Creating SegmentedDatabases From Free Text for Text Retrieval." Proceedings, SIGIR1991(Chicago, Illinois). 69 Stanfill, C. and Kahle, B. "Parallel Free-Text Search on the Connection Machine System." Comm.ACM29 12 (December 1986), pp. 1229-1239. Stanfill, C. and Waltz, D. L. "The Memory-Based Reasoning Paradigm." Proc. Case-Based Reasoning Workshop, Clearwater Beach, FL (Mayi988), pp. 414-424. Stanfill, C. and Waltz, D. L. "The Memory-Based Reasoning Paradigm." Comm. ACM29 12 (December 1986), pp. 1213-1228. Young,Sheryl R., Hayes, Philip J., "AutomaticClassification and Summarization of Banking Telexes." Proceedings of the Second IEEEConference on tll Applications, 1985, Miami Beach, FL. Waltz, D. L. "Memory-BasedReasoning." In M.A. Arbib and J.A. Robinson (eds), Natural and Artificial Parallel Computation, The MITPress, Cambridge, MA,(1990), pp. 251-276.