Effects of Query and Database Sizes on... using Memory Based Reasoning

Effects of Queryand Database Sizes on Classification
using MemoryBased Reasoning
of News Stories
Brij Masand
Thinking MachinesCorporation
245 First Street, Cambridge,Massachusetts,02142USA
From: AAAI Technical Report SS-93-07. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
Abstract
In this paper weexplorethe effects of queryand database
size on news story classification performance. Memory
Based Reasoning(MBR)(a k-nearest neighbor method)
used as the classification method.There are 360different
possible codes. Closematchesto a newstory are foundusing
an already codedtraining databaseof about 87,000stories
from the DowJones Press Release NewsWire, and a Connection-Machine DocumentRetrieval system (CMDRS,
[Stanfill]) that supportsfull text queries, as the underlying
match engine. By combining the codes from the near
matches,newstories are codedwith a recall of about 80%
and precision of about 70%,as reported in [Masand].When
the query size is varied from 10 terms to morethan 200
terms(matched
against the full-text of documents)
the recallprecision productchangesfrom0.53 to 0.6. Whilethis is a
significant change,wefind that moderatesized queries of
40-80termscansuffice for findingrelevant matchesfor classification. Bychangingthe size of databasefrom10,000stories to 80,000wefoundthat the recall-precision product
changedfrom0.22 to 0.57. This showsthat with our current
MBR
approachthe database size can’t be reducedsignificantly without compromising
performance.Wealso find that
fewer numberof retrieved matchesare neededwith larger
queryanddatabasesizes.
1
Introduction
Findingnear matchesby designingappropriatemetrics is an
important step in the Case BasedReasoning(CBR)and the
MBRparadigm. The availability of massively parallel
machineswith large memoryand computationpower have
madeit easier to use large exampledatabaseswith relatively
simplesimilarity metricsfor retrieving relevant matches.In
particular, increasedcomputation
powerhas madeit easier to
use all the featuresof a caseas a queryfor matching
against
databasesas large as several Gigabytes.
andthe use of a large training databaseof Censusreturns for
automaticallyclassifying newCensusreturns[Creecy].Both
of these projects use large (morethan 50,000) number
examplesas training cases. Thelarge size of the databases
used haveallowedfairly simple formsof indexingand metrics for assessingsimilarity. In this paperwereport experimentsto determinethe optimal numberof terms in a query
for findingrelevant newsstory matchesfor goodclassification performance.Wealso investigate howthe classification
performancescales with the size of the training database.
Theparametersof the classification systemare optimized
automaticallyfor findingthe best performance
related to different queryanddatabasesizes.
Section2 and 3 describe the codingproblemandrepresentatiVe results. Section 4 reviewsMBR
and the details of the
classificationalgorithm.Theeffect of differentclassification
systemparameterson performanceand their automaticoptimizationis describedin Section5. Variationswith respect to
queryand databasesize are describedin Section6. Weconcludewitha discussionof results andfuture directions.
2.0 Text Categorization: Related Work
Varioussuccessfulsystemshavebeendevelopedto clas- sify
text documentsincluding telegraphic messages [Young]
[Goodman],physics abstracts [Biebricher], and full text
newsstories [Hayes][Lewis][-Rau]. Someof the approaches
rely on constructingtopic definitionsthat requireselectionof
relevant wordsand phrases or use case frames and other
NLPtechniques intendedfor tasks moresophisticated than
classification e.g. for tasks suchas extractionof relational
informationfromtext [Young][Jacobs].
Alternative systems[Biebricher] [Lewis] use statistical
approachessuch as conditional probabilities on summary
representationsof the documents.Oneproblemwith statistical representations of the training database is the high
Wehave previously reported the results for using a CMbasedIR systemfor classification of newsstories [Masand] dimensionalityof the training space, generallyat least 150k
uniquesingle features -- or words.Sucha large feature space
63
makesit difficult to computeprobabilities involvingcongested by the automatedsystem. Eachcode has a score in
the left handcolumn,representingthe contributionsof sevjunctions or co-occurrenceof features. It also makesthe
’ application’of neural networksa dauntingtask. Thecase
eral near matches.In this particular case the systemsugbasedtelex routingworkdescribedby [Goodman]
is closest-~,," ~ ’ : gests 1 l~iof the 14 codes¯assignedby the editors (overlap......
in approachto our present workalthough it uses a much.markedby *) and assigns three extra codes. By varying the
score threshold,wecan trade-off recall andprecision.
smaller case base combinedwith morepost processing of
the retrieved matches.
Usingan MBR
approachfor classifying newsstories using
their full text weare able to achievehighrecall andat least
moderateprecision without requiring manualdefinitions of
the various topics, as required by most of the earlier
approaches.
3.1 RepresentativeResults
Thetable belowgroupsperformanceby code categoryfor a
random
test set of 200articles. Thelast column
lists the different codesin eachcodecategory.
Category
3.0 The NewsStory Classification
Problem
Y
M/
G/
R/
Name
industry
marketsector
government
region
subject
product
Total
# of
Recall Precision Codes
91
93
85
82
70
69
81
85
112
9
28
121
70
21
361
91
Eachday editors at DowJones assign codes to hundredsof
87
stories originatingfromdiversesourcessuchas newspapers,
74
magazines,newswires,and press releases. Eacheditor must
76
masterthe 350or so distinct codes,groupedinto sevencatep/
89
gories: industry, marketsector, product, subject, govern72
mentagency, and region. (See Fig. 1 for examplesfrom
eachcategory.)Dueto the high volumeof stories, typically
Theautomatedsystemachievesfair to high recall andpreseveral thousandper day, manuallycodingall stories consistently andwith highrecall in a timelymanner
is impracticision for all the codecategories. GivenCMDRS,
the text
cal. In general, different editors maycodedocuments
with
retrieval systemas the underlyingmatchengine,these basic
.varyinglevels of consistency,accuracy,andcompleteness.
¯ results wereachievedin about 2 person-months.Bycomparison, [Hayes]and [Creecy]report efforts of 2.5 and 8
person-years, respectively, for developingrule/pattern
FIGURE1 SomeSample Codes
based conceptdescriptions for classification tasks with
comparablenumbersof categories. Our current speed of
Code
Name
# of Documents
codingstories is about a story every 2 secondson a 4k CM2 system.
9811
R/CA California
2813
R/TX Texas
Technology
9364
MflEC
4 The MemoryBased Reasoning
7264
M/FIN Financial
Approach
N/PDT NewProducts/Services
4149
N/ERN Earnings
9841
:,
.....
¯ Memory
¯Based Reasoning(MBR)
consists of variations
I/CPR Computers
2880
the nearest neighbortechniques,(see [Dasrathy]for a comI/BNK All Banks
2869
prehensivereviewof NNtechniques). For a reviewof MBR
P/CAR Cars
380
see [WaltzandStanfill] and[Waltz].In its simplestformulaP/PCR Personal Computers
315
tion, MBR
solves a newtask by looking up examples of
G/CNG Congress
307
tasks similar to the newtask andusingsimilarity with these
remembered
solutions to determinethe newsolution. For
G/FDA Food and Drug Admin.
214
exampleto assign occupationand industry codes to a new
Censusreturn one can look up near matchesfrom a large
(already coded)databaseandchoosecodesbasedon considThecodingtask consists of assigningoneor morecodesto a
ering several near matches[Creecy]. In a similar fashion,
text document,froma possibleset of about350 codes.Fig.
codes are assigned to newunseennewsstories by finding
2 showsthe text of a typical story with codes. Thecodes
near matchesfromthe training databaseand then choosing
appearingin the headerare the onesassignedby the editors
the best fewcodesbasedon a confidencethreshold.
and the codes following"SuggestedCodes"are those sug64
~.
Fromthe CBRpoint of view the examples here are microcases with no annotated structures. This allows matchingon
entire examplesincluding all the features (rather than
match based on a summary/indexedrepresentation of an
exampleor case). Partly because of the nature of the classification task and partly becauseof the excellent matches
providedby using full-text matching,very little post-processing or "reasoning"is done to assign the codes, other
than a straightforward summingof evidence for different
codes from a numberof near matches.
FIGURE 2 Sample News Story and Codes
0023000PR
PR 910820
:
: ,
I/AUT I/CPR I/ELQ M/CYCM/IDU M/TEC¯
R/EU R/FE R/GE R/JA R/MI R/PRM R/TX R/WEU
Suggested
Codes:
* 3991
* 3991
* 3991
* 3067
* 2813
* 2813
* 2813
* 2813
2460
1555
1495
"1285
"1178
"1175
R/FE
M/IDU
I/ELQ
R/IA
M/TEC
M/CYC
I/CPR
I/AUT
P/MCR
R/CA
M/UTI
R/MI
R/PRM
R/EU
Far East
Industrial
Electrical Components
&Equipment
lapan
Technology
Consumer,Cyclical
Computers
AutomobileManufacturers
Mainflames
California
Utilities
Michigan
Pacific Rim
Europe
~DAIMLER-BENZ
MENT
FOR
SANTA
CLARA,
temhaus
UNIT
ler-Benz,
DATA
a 100
has
percent
signed
approximately
$ii
data
a contract
Sys-
of Daim-
to
(U.S.)
The 7390s
centers
DRIVES"
WIRE)--Debis
will
FIGURE3
purchase
of 7390
Disk
is a diversified
services
company
whose
Germany
over
manufacturing
and
Mercedes-Benz,
debis.
and
Debis
AEG,
Category
and
Storage
Subsystems
capacity
storage
up
to 22.7
15.7
marketing
is
million
First
used
are HDS’
the
shipped
in October
computers
per
equivalent
of
4.2
of storing
variety
owned
tems
Data
(EDS).
mainframe
vices.
ees
Systems
by Hitachi,
The company
systems,
people
than 30 countries
and
the 7390s
are
main-
of businesses
with
venture
Electronic
markets
peripheral
Headquartered
2,600
is a joint
Ltd.
in Santa
products
products
Clara,
company
Data
a broad
Sys-
range
and ser-
HDS employ-
installed
38308
38562
3926
47083
41902
2242
57430
42058
4200
116358
52751
2523
The Classification Algorithm
Following the general approach of MBR,we first find the
near matches for each documentto be classified. This is
done by constructing a full-text query out of the text of the
document,including both words and capitalized pairs. This
query returns a weighted list of near matches(see Fig. 4).
Weassign codes to the unknowndocument by combining
the codes assigned to the k nearest matches;for these experiments, we used up to 11 nearest neighbors. Codes are
assigned weights by summingsimilarity scores from the
near matches. Finally we choose the best codes based on a
score threshold. Fig. 4 showsthe headlines and the normal-
pages.
and enterprises.
Hitachi
# of
Occurrences
22
approximately
high-performance
in a wide
Disk
# of
Documents
high-
cabinet.
typewritten
of 1990,
with
7390
advanced
capable
data
double-spaced
in conjunction
frame
of
insurance,
The
most
subsystems
and
communications
with
services.
gigabytes
gigabytes
along
I/
M/
G/
R/
N/
P/
include
Aerospace
computing,
services
trading
units
Deutsche
provides
financial
corporate
Code Frequencies by
Categories
be installed
throughout
the next 6 months.
Daimler-Benz
The Training Database
DowJones publishes a variety of news sources in electronic form. Weused the source for press releases called PR
Newswire,most of which is concerned with business news.
Editors assign codesto stories dnily. Onaverage, a story has
about 2,700 bytes (or 500 words) and 8 codes. For the
experimentsreported here the training database consists of
87,000 examples (total size about 240 Mbytes). The database was not specially created for the project; it just contains stories from several months of the newswire. The
training database has different numbersof stories for different codes and code categories. Figs. 1 and 3 showsomerepresentative codes and code categories and their sizes.
AGREE-
DISK
subsidiary
million
Subsystems.
in debis’
$II,000,000
SYSTEMS
Calif.--(BUSINESS
GmbH,
Storage
SIGNS
HITATCHI
4.1
in more
worldwide.
65
ized scores for the exampleusedin Fig. 2 andthe first few
near matchesfromthe relevancefeedbacksearch.
FIGURE4 SampleNewsStow with Eleven
Nearest Neighbors
Score Size
Headline
I000 2k
Daimler-Benzunit signs $II,000,000
agreementfor Hitatchi Data
MCIsigns agreementfor Hitachi Data
Systemsdisk drives
DeltaAir Linestakes deliveryof
industry’s
first...
CrowleyMaritimeCorp. installs HDS
EX
HDSannounces15 percent performanceboost for EXSeries processors
L.M.Ericssoninstalls twoHitachi
Data Systems420 mainframes
Gazde France installs HDSEX420
mainframe
Hitachi Data Systemsannouncestwo
newmodelsof EXSeries mainframes
HDSannounces ESA/390schedule
SPRINTinstalls HDSEX420
Hitachi DataSystemsannouncesnew
modelof EXSeries mainframes
HDSannouncesupgradesfor installed
7490subsystems
924
2k
654
2k
631
2k
607 2k
604 2k
571
2k
568
5k
568 2k
543 2k
543 4k
485
4.3
4k
DefiningFeatures
AlthoughMBR
is conceptuallysimple, its implementation
requiresidentifyingfeaturesandassociatedmetricsthat
enableeasy andquantitative comparisons
betweendifferent
examples.Anewsstory has a consistentstructure: headline,
author,date, maintext, etc. Potentiallyonecan use words
andphrasesandtheir co-occurrence
fromall these fields to
¯ create features[Creecy].For the purposeof this project we
usedsingle wordsandcapital wordpairs as features, largely
becauseCMDRS,
the underlying documentretrieval system
usedas a matchengine,providessupportfor this functionality. If the entirestoryis usedas query,as is the casefor
CMDRS,
this can result in a large queryof several hundred
termsevenafter ignoringstop words.
the database. Thesecondstep removesa total of 72 additional words. The remaining words, knownas searchable
terms, are assignedweightsinversely proportionalto their
frequenciesin the database. Althoughgeneral phrases are
ignored, pairs of capital wordsthat occur morethan once
are recognized and are also searchable. There are over
250,000searchable wordsand wordpairs in this database.
Relevancefeedbackis performedby constructing queries
from all the text of the document.Responsetime for a
retrieval request is undera second.All the workfor this
paper was done on a 4k CM-2ConnectionMachineSystem.
5
Optimizationof parameters
DifferentIrade-offs betweenrecall andprecisioncan be
achievedbyvaryingthe parametersof retrieval andclassification. Twoimportantparametersare the numberof near
matches(k) usedandthe thresholdsusedfor assigningthe
codes.Theresults in section 3.1 representmanualoptimization of the numberof near neighborsand the confidence
thresholds. Asthe numberof near matchesconsideredfor
classificationincreasesrecall increaseswhileprecision
decreases(morecorrect codesare foundbut also morenoise
is added).In generalhigherthresholdsare effective when
consideringincreasing numberof (k) near neighbors.The
optimalcombination
of score thresholdandk seemsto differ depending
on codecategoriesandrequires further study,
possibly using morethan eleven near matches.
Anotherparameter,the code-gap-ratio,measuresthe ratio
betweensuccessivecodes. Byadjusting this parameter,the
tail end of the assignedcodescan be cutoff whenever
there
is a sharpdropin successive(ranked)scores. In order
reducethe effort required to find optimalcombinations
of
the numberof near neighbors,score and code-gap-ratio
thresholds weexperimentedwith automaticoptimization
througha randomsearch.
5.1
Automatedoptimization
Beforetrying techniquessuchas hill-climbingwestarted by
testing randomn-tuples of parameters(wehaveexperimentedso far with up to 3 parameters)andchoosingthe
best performance.To our surprise wefoundthat in about a
thousandrandom
trials -- abouthalf an hour’s worthof
computation-- optimal parametersare foundfor whichthe
performance
is quite close to or better than the values
obtainedby manualoptimization.This is importantboth
fromthe point of viewof avoidingthe cumbersome
task of
manualoptimizationas well as removingthe subjective fac-
4.4 The Match Engine (CMDRS)
CMDRS
is the productionversion of the text retrieval systemreportedin [Stanfill].l Thetext is compressed
by eliminating stop words(368 non-contentbearing wordssuch as 1 Althoughthe experimentswereconductedat Thinking
"the", "on" and "and") and then by eliminating the most Machines
Corp,a live version of the systemis available
common
wordsthat account for 20%of the occurrences in ¯ from DowJonesNewsRetrieval as DowQuest.
66
FIGURE
6 Variation of Classification
performanceby databasesize
FIGURE
5 Variation of Classification
performance
by query size
0.62
0.6
.........
..... IrI .....
I
I
&..
.j
.....
0.54
~. .....
.k. .....
!
l
p
.....
~ .....
I
I
I
¯ .....~I .....~i .....ir
.....IV
I
I
l
9 ....
~0.56
l
~ .....
0.52
0.5
I ....
I
T .....
I
II
I" " "
I .....
"1
I
I
!
i "
I ......
"1
I
I
I
....
¯ ...
25
-* ....
50
,,,
II
I
...............
I
I
I
I00
125
150
75
I
Ir .........
I
rI .........
I
I
I
I
II
I
I
I
I
I
I
i
II
Ip .........
I
I
!
II .........
I
I~ .........
I
I
I
~o4
0 3
I
IZ’’::::::::::::::::::::::
I
0 2
..................
.........
0 1
I
I
I
175
I
I~ .........
0 5
¯ ..... J ..... -J ......
I
Il
II
l
....
e ..... ~ ...... I ......
I
II
II
I
I
Ii
I
i
.....I
T .....~I ......II ......
I
I
!I.
*
I
, ....
I
I~ .........
.........
0 6
..... t ..... t ..... -J~-- ~ ..... ÷..... -I ......: ......
.....
~0.58
I"
I ..... "1I ....
I ..... -IP
"1
I" .....
|
I
I
II
!
I
0
200
I
Ir .........
20
I "
I
I
Ip .........
I
I
I
40
80
60
I00
Databasesize (thousanddocuments)
Querysize (terms)
tor whencomparing
the relative performance
for varying
databaseandquerysizes. All the results in the following
sections representthe outcomeof automaticselection of
optimal parametersthrougha simplerandomsearch.
words. Thelast columnrepresents the averageoptimized
numberof near neighborsfor a querysize. Fewermatches
seemto be neededas the querysize increases, presumably
becausethe queriesare morespecific as the size increases.
6.0 Variation of Queryand Database
Sizes
Differentsets of near neighborsfor the test set werefound
fromthe entire database,usingdifferent querysizes and
then the best performance
wasfoundby automaticallyoptimizingthe numberof near neighborsandassociated thresholds.
6.1
Varyingthe querysize
Thefollowingtable andFig. 5 describethe variation of the
recall-precisionproductwith respect to different number
of
termsin the querythat is usedto find the nearmatches.The
querysize representsthe n best terms(basedon inversefrequencyweights)selected for the queryfromthe article
beingclassified.
Query
Size
(terms)
I0
40
80
I00
160
6.2
Varyingthe databasesize
Thefollowing table and Fig. 6 describe the variation of
recall-precisionproductwithincreasingdatabasesize.
Database
size
(thousands) R*P
R*P
Recall
.53
.58
.6
.6
.6
68
74
73
74
72
, Precision
79
78
82
80
82
k
I0
20
30
40
60
80
7
4
4
3
3
Wesee that whilethe changeis significant form10 to 100
terms, a modestquerysize of 40 to 80 termssuffices for the
best performance.
Theaveragesize of a story in the database is about500words.Onlythe size of the queryis varied, matchingagainst the full-text of the documents.
The
¯ termsconsist of single wordsandpairs of adjacentcapital
.22
.31
.44
.49
.55
.57
Recall
42
65
62
64
69
68
Precision
52
47
71
78
80
83
k
5
5
3
4
4
2
Wesee that performanceimprovessignificantly as stories
are added in increments of 10,000. While the rate of
increase seemsto decrease, the data suggests that performancecan be improvedfurther by increasingthe size of the
database.
67
Differentsets of nearmatchesfor the test set werefoundfor
different fractions of the database,while keepingthe maximum
querysize constant to a fewhundredterms. It’s likely
that there is relationship betweenthe numberof different
classifications(in this case 361)andthe rate of increase
the databaseis expanded.
The last columnindicates the average size of the near
matchesfor optimalperformance.As the databaseincreases
fewermatchesare needed,as oneexpects to find morenumber of close matches.
6.3
In addition wehave shownthat querysizes of about 40 to
80 terms (best terms as ranked by inverse frequency
weights)axesufficient for locatingthe best relevant matches
for the purposeof classification whenmatchedagainst a
full-text representation of the documents.Fewermatches
seemto be neededfor larger queries.
A simplerandomsearch yielded classification parameters
that result in respectable performance,comparableto manual optimization.
Othervariations
Onepromisingdirection wouldbe to explore the effect of
using summaryrepresentation of documentsthemselves
whilestill retaininga large training database.It wouldalso
be useful to see the interactionof querysize withdifferent
databasesizes anddifferent numberof codesto be assigned
in variouscodecategories.
7
cantly whenwe increased the database from 10,000 to
87,0000stories. Alongwith increased performancefewer
numberof retrieved matchesseemto be neededfor a larger
database.
Discussion
For the results reported in this paper weusedn-waycross
validation, whichinvolvesexcludingeach test exampleone
at a timefromthe databaseandperformingthe classification
on it. Weuseda randomly
chosenset of 200articles for the
test set.
Whilethe automaticallyoptimizedperformanceseemsless
dramaticthan certain systemsthat use manuallyconstructed
definitions (suchas 90%recall andprecisionreported by
[Hayes]and [Ran])webelieve that an MBR
approachoffers
significant advantagesin termsof dramaticallyreducedtime
of development,automatedoptimization,ease of deployment,and maintenance.Weshould be able to improvethe
performance
further by increasingthe size of the training
database. Ourtest databasecan hold morethan 120,000stories on the existing hardware(a 4k processorCM2).
This approachcan be used to providecase basedclassification at little extra cost wherea newsretrieval systemwith
relevancefeedbackalreadyexists.
Whilewehaveshownthat fairly smallquerysizes suffice to
locate relevant near matches,oneshouldnote that the query
is searchedagainst the full text of the documents
for which
the total number
of features exceed250,000.In addition the
smaller querysizes maybenefit froma larger databasesince
the probability of a close matchincreases with the size of
the database.It’s also possiblethat a large training database
compensates
for a simplesimilarity metric.
9
It wouldbe interesting to see if automated
optimizationof
parametersthrougha randomsearch worksfor morethan 3
parameters
at onetime. For larger sets of parameters,hill
climbingor optimizationthroughgenetic algorithmsmight
be necessary.
Weare studyingthe effect of scaling the databaseon different categories of codes with different numbersof codesin
them, to judge the effect of training database size with
respectto the granularityof classification.
Theincrease of performancewith increasing numberof
examples
is probablydifferent for different code-categories
with different numbersof training examples.
8
Conclusions
Wehave shownthat Recall and precision using the MBR
approachfor newsstory classification increased signifi68
Future Work
Although we have shownthat the performance of news
classification with the MBR
approachdependson havinga
large example-base
it maybe possible to prunethe database
by removingredundant examplesand also removingexamples that aren’t often retrievedas matches.
Wewouldlike to study the combined
effect of the variation
of querysize with respect to the scaling of the databaseas
well as the effect of matchingagainst a summary
representation of the documents
in the database. Anotherextension
of the current approachmightbe to do morein-depth reasoningon the retrieved matches,as exemplifiedin the casebased reasoningapproach.
10
Acknowledgments
I wouldlike to thank GordonLinoff, DaveWaltz and Steve
Smith from Thinking Machinesfor help with this project.
11 References
Biebricher, Peter; Fuhr, Norbert et al, "The Automatic
Indexing System AIR/PHYS-- From Research to Application." Internal report, THDarmstadt, Department of Computer Science, Darmstadt, Germany.
Creecy, R. H., Masand B., Smith S, Waltz D., "Trading
MIPSand Memoryfor Knowledge Engineering: Classifying Census Returns on the Connection Machine." Comm.
ACM(August 1992).
Dasrathy B. V. Nearest Neighbor (NN) Norms: NN Pattern
Classification Techniques. IEEEComputerSociety Press,
Los Alamitos, California (1991).
Goodman,M., "Prism, An AI Case Based Text Classification System." In R. Smith and E. Rappaport, (eds), Innovative Applicationsof Al, 1991, pp. 25 - 30.
Hayes, P. J. and Weinstein, S.P., "CONSTRUE/TIS:
A System for Content-based Indexing of a Database of NewsStories." Innovative Applications of Artificial Intelligence 2.
The AAAIpress/The MIT Press, Cambridge, MA,pp. 4964, 1991.
HayesP.J. et al, "A Shell for Content-basedText Categorization." 6th IEEE AI Applications Conference, Santa Monica, CA, March1990.
Jacobs, Paul and Rau, Lisa. "SCISOR:Extracting Information from On-Line News." Communications of the ACM,
33(11):88-97, November1990.
Lewis, David D, "An Evaluation of Phrasal and Clustered
Representation on a Text Categorization Task." Proceedings, SIGIR ’92, Copenhagen,Denmark.
Masand, Brij M., Linoff, GordonS.,and Waltz, David L.,
"Classifying News Stories on the Connection Machine
using MemoryBased Reasoning" Proceedings, SIGIR ’92,
Copenhagen, Denmark.
Rau, Lisa F. and Jacobs, Paul S.,"Creating SegmentedDatabases From Free Text for Text Retrieval." Proceedings,
SIGIR1991(Chicago, Illinois).
69
Stanfill, C. and Kahle, B. "Parallel Free-Text Search on the
Connection Machine System." Comm.ACM29 12 (December 1986), pp. 1229-1239.
Stanfill, C. and Waltz, D. L. "The Memory-Based
Reasoning Paradigm." Proc. Case-Based Reasoning Workshop,
Clearwater Beach, FL (Mayi988), pp. 414-424.
Stanfill, C. and Waltz, D. L. "The Memory-Based
Reasoning Paradigm." Comm. ACM29 12 (December 1986), pp.
1213-1228.
Young,Sheryl R., Hayes, Philip J., "AutomaticClassification and Summarization of Banking Telexes." Proceedings
of the Second IEEEConference on tll Applications, 1985,
Miami Beach, FL.
Waltz, D. L. "Memory-BasedReasoning." In M.A. Arbib
and J.A. Robinson (eds), Natural and Artificial Parallel
Computation, The MITPress, Cambridge, MA,(1990), pp.
251-276.