Joins that Generalize: Text Classification ... William W. Cohen Haym Hirsh

From: KDD-98 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved.
Joins that Generalize: Text Classification
William W. Cohen
AT&T Labs-Research
180 Park Avenue
Florham Park NJ 07932
wcohen@research.att.com
Abstract
WHIRL is an extensionof relational databasesthat can perform “soft joins” based on the similarity of textual identifiers; these soft joins extend the traditional operation of joining tables based on the equivalenceof atomic values. This
paper evaluatesWHIRL on a number of inductive classification tasksusing data from the World Wide Web. We show
that although WHIRL is designedfor more generalsimilaritybasedreasoningtasks,it is competitive with mature inductive
classificationsystemson these classificationtasks. In particular, WHIRL generally achieveslower generalization error than C4.5, RIPPER, and severalnearest-neighbormethods. WHIRL is also fast-p to 500 times faster than C4.5
on some benchmark problems. We also show that WHIRL
can be efficiently used to select from a large pool of unlabeled items those that can be classified correctly with high
confidence.
Introduction
Consider the problem of exploratory analysis of data obtained from the Internet. Assuming that one has already narrowed the set of available information sourcesto a manageable size, and developedsome sort of automatic procedure
for extracting data from the sites of interest, what remains
is still a difficult problem. There are severaloperationsthat
one would like to support on the resulting data, including
standardrelation databasemanagementsystem(DBMS) operations,since some information is in the form of data; text
retrieval, since some information is in the form of text; and
text categorization. Since data can be collected from many
sources,one must also support the (non-standard)DBMS
operation of data integration (Monge and Elkan 1997;
Hemandezand Stolfo 1995).
As an example of data integration, suppose you have
two relations, place(univ,state), which contains university names and the states in which they are located, and
job(uuiv,dept), which contains university departmentsthat
are hiring. Suppose further that you are interested in job
openingslocated in a particular state. Normally a user could
join the two relations to answer this query. However, what
if the relations were extracted from two different and unrelated Web sites? The same university may be referred to
Copyright @ 1998, American Association for Artificial Intelligence(www.aaai.org).All rights reserved.
Using WHIRL
Haym Hirsh
Departmentof Computer Science
RutgersUniversity
New Brunswick, NJ 08903
hirsh@cs.rutgers.edu
as “Rutgers University” in one relation, and “Rutgers, the
State University of New Jersey”in another. To solve this
problem traditional databaseswould require some form of
key normalization or data cleaning on the relations before
they could be joined.
An alternative approach is taken by the databasemanagement system (DBMS) WHIRL (Cohen 1998a; 1998b).
WHIRL is a conventional DBMS that has been extendedto
use statistical similarity metrics developedin the information retrieval community to compare and reason about the
similarity of pieces of text. These metrics can be used to
implement “similarity joins”, an extension of regular joins
in which tuples are constructedbased on the similarity of
values, rather than on equality. Constructedtuples are then
presentedto the user in an ordered list, with tuples containing the most similar pairs of fields coming first.
As an example, the relations described above could be
integratedusing the WHIRL query
SELECT place.univ as ul, place.state,
job.uuiv as u2, job.dept
FROM place, job WHERE place.uuiv~job.univ
where place.univNjob.univ indicates that these items must
be similar. The result of this query is a table with columns
~1, state, u2, and dept, and rows sorted by the similarity of
~1 and ~2. Thus rows in which ~l=d? (i.e., the rows that
would be in a standardequijoin of the two relations) will appear first in this table, followed by rows for which ul and ~2
are similar, but not identical (such as the pair “Rutgers University” and “Rutgers, the StateUniversity of New Jersey”).
Although WHIRL was designedto query heterogeneous
sources of information, WHIRL can also be used in a
straight-forward manner for traditional text retrieval and relational data operations. Less obviously, WHIRL can be
used for inductive classification of text. To see this, note
that a simple form of rote learning can be implementedwith
a conventionalDBMS. Assume we are given training data in
the form of a relation train(inst,label) associatinginstances
inst with labels label from some fixed set. (For example,
instancesmight be news headlines,and labels might be subject categoriesfrom the set {business, sports, other}.) To
classify an unlabeled object X one can store X as the only
element of a relation test(inst) and use the SQL query:
SELECT test .inst ,train.label
FROM train,test WHERE test.inst=train.inst
KDD-98
169
In a conventionalDBMS this retrievesthe correct label for
any X that hasbeenexplicitly storedin the training settrain.
However, if one replacesthe equality condition test.inst=
train.inst with the similarity condition, test.instNtrain.inst,
the resulting WHIRL query will find training instancesthat
are similar to X, and associatetheir labels with X. More
precisely, the answer to the correspondingsimilarity join
will be a table
Every tuple in this table associatesa label Li with X.
Each Li is in the table becausethere were some instances
Y.51,. . . ,Yi, in the training data that had label Li and were
similar to X. As we will describebelow, every tuple in the
table also has a “score”, which dependson the number of
theseYi j ’s and their similarities to X. This methodcan thus
be viewed as a sort of nearestneighborclassificationalgorithm.
Elsewherewe haveevaluatedWHIRL’on dataintegration
tasks (Cohen 1998a). In this paper we show that WHIRL
is also well-suited to inductive text classification,a task we
would also like a generaldata manipulationsystemsuch as
WHIRL to support.
An Overview of WHIRL
We assumethe readeris familiar with relational databases
and the SQL query language. WHIRL extendsSQL with
a new TEXT data type and a similarity operatorXNY that
applies to any two items X and Y of type TEXT.’ Associated with this similarity operatoris a similarity function
SIM(X, Y), to be describedshortly, whoserangeis [0, l],
with larger valuesrepresentinggreatersimilarity.
This paperfocuseson queriesof the form
SELECT test .inst ,train.label
FROM test,train WHERE test.inst~train.inst
and we therefore tailor our explanation to this class of
queries. Given a query of this form and a user-specified
parameterK, WHIRL will first find the set of K tuples
(Xi, yj, Lj) from the Cartesianproduct of test and train
such that SrM(Xi, Yj) is largest (where Lj is Yj’s label in
train). Thus, for example,if test containsa single element
X, then the resultingset of tuples correspondsdirectly to the
K nearestneighborsto X, plus their associatedlabels. Bach
of thesetuples has a numericscore; in this casethe scoreof
(Xi,~,~~)iSSimplySIM(Xi,Y$).
The next step for WHIRL is to SELECT test.inst,
train.label, i.e., to selectthe first and third columns of the
table of (Xi, Yj, Lj) tuples. In this projection step, tuples
with equal Xi and Lj but different Yj will be combined.
‘Other publisheddescriptionsused a Prolog-like notation for
the samelanguage(Cohen 1998a; 1998b). We refer the readerto
theseearlier papersfor descriptionsof the full scopeof WHIRL
and efficient methodsfor evaluatinggeneralWHIRL queries.
170
Cohen
The scorefor eachnew tuple (X, L) is
l-fi(l-Pj)
(1)
j=l
wherepi ,..., p, arethescoresofthentuples(X,Yi,L) ,...,
(X, Y,, L) that contain both X and L.
The similarity of eachX and Y is assessedvia the widelyused “vector space”model of text (Salton 1989). Assume
a vocabularyT of atomic terms that appearin each document.2 A TEXT object is representedas a vector of real
nufnbersv E 72ITI, where each componentcorrespondsto
a term. We denotewith 21~the componentof v that correspondsto t E T.
The generalidea behind the vector-spacerepresentation
is that the magnitudeof a componentut should be related
to the “importance”of t in the documentrepresentedby v,
with higher weightsassignedto termsthat arefrequentin the
documentand infrequent in the collection as a whole; the latter often correspondto proper namesand other particularly
informative terms. We use the TF-IDF weighting scheme
(Salton 1989) in which ut = log( T&J + 1) * log( ID&).
Here Z’FVIt is the numberof times that t occursin the document representedby v, and IDFt = 5, whereN is the total
numberof records,and 7~~is the total numberof recordsthat
contain the term t in the given field. The similarity of two
documentvectorsv and w is then given by the formula
tET
vt’wt
m f(v,w)= cllvll
* llwll
This measureis large if the two vectors sharemany “important”terms, and small otherwise. Notice that due to the
normalizing term S1&?(v, w) is always betweenzero and
one.
By combining graph searchmethodswith indexing and
pruning methodsfrom informationretrieval,WHIRL can be
implementedfairly efficiently. Details of the implementation and a full descriptionof the semanticsof WHIRL can
be found elsewhere(Cohen 1998a).
Experimental Results
To evaluateWHIRL as a tool for Web-basedtext classification tasks, we creatednine benchmarkproblems,listed in
Table 1, from pagesfound on the World Wide Web. Although they vary in domain and size, all were generatedin
a straight-forwardfashion from the relevantpages,and all
associatetext - typically short, name-like strings- with
labels from a relatively small set3 The table gives a summary descriptionof eachdomain,the numberof classesand
numberof terms found in eachproblem, and the numberof
training and testing examplesusedin our text classification
2Currently,T containsall “word stems”(morphologicallyinspiredprefixes)producedby the Porter(1980) stemmingalgorithm
on textsappearingin the database.
3Althoughmuch work in this area hasfocusedon longer documents,we focus on short text itemsthat moretypically appearin
suchdataintegrationtasks.
experiments. On smaller problems we used lo-fold crossvalidation(“lOcv”), and on larger problemsa singleholdout
set of the specifiedsize was used. In the caseof the cdroms
andnetvet domains,duplicateinstanceswith different labels
were removed.
Relative accuracy
100
80
60
Using WHIRL for Inductive Classification
To label each new test example,the examplewas placedin
a table by itself, and a similarity join was performedwith
it and the training-settable. This results in a table where
the test example is the tist item of each tuple, and every
label that occurredfor any of the top K most similar text
items in the training-settable appearsas the seconditem of
some tuple. We used Ii’ = 30, basedon exploratoryanalysis of a small number of datasets,plus the experienceof
Yang and Chute (1994) with a relatedclassifier(seebelow).
The highest-scoringlabel was usedas the test item’s classification. If the table was empty, the most frequentclasswas
used.
We compared WHIRL to several baseline learning algorithms. C4.5 (Quinlan 1994) is a feature-vectorbased
decision-treelearner.We usedas binary-valuedfeaturesthe
sameset of terms used in WHIRL’s vectors; however,for
tractability reasons,only termsthat appearedat least3 times
in the training set were used as features. RIPPER (Cohen
1995) is a rule learning systemthat has beenused for text
categorization(CohenandSinger 1996).We usedRIPPER s
set-valuedfeatures(Cohen 1996)to representthe instances;
this is functionally equivalentto using all terms as binary
features,but more efficient. No feature selectionwas performed with RIPPER. l-NN finds the nearestitem in the
training-settable (using the vector-spacesimilarity measure
SIM) and givesthe test item that item’s label. We aIsoused
Yang and Chute’s (1994) distance-weightedk-NN method,
hereaftercalled K-NNs. This is closely relatedto WHIRL,
but combinesthe scoresof the the K nearestneighborsdifferently, selectingthe label L that maximizesc pi. Finally,
I<-NNM is like Yang and Chute’smethod,but picks the label L that is most frequentamongthe I{ nearestneighbors.
The accuracyof WHIRL and each baselinemethod, as
well as the accuracyof the “default rule” (alwayslabeling
an item with the most frequent class in the training data)
are shown in Table 2; the highest accuracyfor each problem is shown in boldface. The same information is also
shown graphically in Figure 1, where eachpoint represents
one of the nine test problems.4 With respectto accuracy
WHIRL uniformly outperformsK -NNM , outperformsboth
l-NN and C4.5 eight of nine times, and outperformsRIPPER sevenof nine times. WHIRL was never significantly
worsethan any other learneron any individual problem.5
4The y-value of a point indicatesthe accuracyof WHIRJ.,on
a problem, and the x-value indicates the accuracyof a competing method; points that lie abovethe line y = x are caseswhere
WHJRL was better. The upper graph comparesWHIRL to C4.5
andRIPPER,and the lower graphcomparesWHIRL to the nearestneighbormethods.
5WeusedMcNemar’stest to test for significantdifferenceson
the problemsfor which a single holdout was used, and a paired
t-test on the folds of the cross-validationwhen 1Ocvwasused.
40
20
Accuraz of seconYk3am*r
Relative accuracy
80
100
(lNN,wE
7
(KNN-sunyhiri 1 +
(KNN-mal,whirl) q
0
20
AccuraEj of seconsdqearnf3r
I&
80
Figure 1: Scatterplotsof learneraccuracies
The algorithm that comesclosestto WHIRL is K-NNs.
WHIRL outperformsK-NNs eight of nine times, but all of
theseeight differencesare small. However,five are statistically significant,indicating that WHIRL representsa small
but statisticallyreal improvementover K-BINS.
Using WHIRL for SelectiveClassification
Although the error rates are still quite high on some tasks,
this is not too surprising given the nature of the data. For
example,considerthe problem of labeling companynames
with the industry in which they operate (as in Mine and
hcoarse). It is fairly easy to guessthe businessof “Watson Pharmaceuticals,Inc.“, but correctly labeling “Adolph
Coors Company”or “AT&T” requiresdomainknowledge.
Of course,in many applications,it is not necessaryto label every test instance; often it is enoughto find a subset
of the test data that can be reliably labeled. For example,
in constructinga mailing list, it is not necessarydetermine
if every householdin the US is likely to buy the advertised product; one needsonly to find a relatively small set
of householdsthat are likely customers.The basicproblem
can be statedas follows: given M test cases,give labels to
N of these,for N < M, whereN is set by the user.We call
this problem selective classifcation.
One way to perform selectiveclassificationis to use a
learner’shypothesisto make a prediction on eachtest item,
andthenpick the N itemspredictedwith highestconfidence.
Associatinga confidencewith a learner’sprediction is usuKDD-98
171
problem #train #/test #classesI #terms I text-valuedfield
I label
memos
334 1Ocv
documenttitle
category
cdroms
category
1133 CDRom gamename
798 1Ocv
birdcom
914 1ocv
phylogenicorder
674 commonnameof bird
birdsci
914 1Ocv
1738 common + scientificname of bird phylogenicorder
hcoarse 1875 600
industry (coarsegrain)
2098 companyname
hfine
1875 600
2098 companyname
industry (fine grain)
books
3501 1800
subjectheading
7019 book title
species 3119 1600
7231 animal name
phylum
netvet
3596 2000
category
5460 URL title
-ion-
Table 1: Description If benchmarkproblems
default RIPPER C4.5
memos
57.5
19.8
50.9
40.5
cdrom
26.4
38.3 39.2
70.4
88.8 79.6
42.8
birdcom
42.8
91.0
83.3
82.8
birdsci
hcoarse
11.3
28.0 30.2
20.2
12.7
hfine
4.4
16.5 17.2
53.0
books
5.7
42.3
52.2
87.5
90.6
89.4
species
51.8
50.9
netvet
67.1 68.8
22.4
57.52 57.41 53.76 1
1 average 25.24
I
-7mRy WHIRL
66.5
64.4
43.0
78.9
81.5
29.5
17.8
54.7
90.7
64.8
56.64 1
I
47.5
47.2
83.9
82.3
90.3
89.2
32.9
32.7
20.8
21.0
61.4
60.1
943
93.6
68.1
67.8
62.00 62.90
Table2: Experimentalresultsfor accuracy
ally straightforward.6
However,WHIRL also suggestsanotherapproachto selective classification. One can againuse the query
SELECT test .inst ,train.label
FROM traintest WHERE test.instNtrain.inst
but let test containall unlabeledtest cases,ratherthan a sin-
gle test case.The result of this query will be a table of tuples
(Xi, Lj ) where Xi is a test caseand Lj is a label associated
with one or more training instancesthat are similar to Xi.
In implementing this approachwe used the score of a tuple (as computedby WHIRL) as a measureof confidence.
To avoid returning the same example multiple times with
a different labels, we also discardedpairs (X, L) such that
(X, L’) (with L’ # L) also appearsin the tablewith a higher
score; this guaranteesthat each example is labeled at most
once.
Rather than uniformly selecting K nearest neighbors
for each test example, this approach(henceforththe “join
method”) finds the K pairs of training and test examples
that are most similar, and then projects the final table from
this intermediateresult. Hence,the join methodis not identical to repeatinga K-NN style classification for every test
query (for any value of Ii’).
An advantageof the join method is that since only one
query is executedfor all the test instances,the join method
is simpler and fasterthan using repeatedK-NN queries.Table 3 comparesthe efficiency of the baselinelearnersto the
efficiency of the join method. In each case we report the
6For C4.5 and RIPPER,a confidencemeasurecan be obtained
using the proportion of each class of the training examplesthat
reachedthe terminal nodeor rule used for classifyingthe test case.
For nearest-neighborapproachesa confidencemeasurecan be obtained from the score associatedwith the method’s combination
rule.
172
Cohen
combinedtime to learn and classify the test cases.The lefthand side of the table gives the absoluteruntime for each
method, and the right-hand side gives the runtime for the
baseline learners as a multiple of the runtime of the join
method. The join method is much faster than symbolic
methods,with averagespeedupsof nearly 200 for C4.5 with 500 in the best case- and averagespeedupsof nearly
30 for RIPPER - with 60 in the best case. It is also significantly fasterthan the straight-forwardWHIRL approach,
or evenusing l-NN on this task.7 This speedupis obtained
even though the size K of the intermediatetable must be
fairly large, since it is sharedby all test instances.We used
I< = 5000 in the experimentsbelow, and from preliminary
experimentsK = 1000 appearsto give measurablyworse
performance.
A potential disadvantageof the join method is that test
caseswhich are not nearany training caseswill not be given
any classification;for the five larger problems,for example,
only l/4 to l/2 of the test examplesreceive labels. In selective classification, however,this is not a problem. If one
considersonly the N top-scoring classifications for small
N, there is no apparentloss in accuracyin using the join
method rather than WHIRL; indeed, there appearsto be a
slight gain. The upper graph in Figure 2 plots the error on
the N selectedpredictions (also known as the precision at
rank N) against N for the join method, and comparesthis
to the baselinelearnersthat perform best, accordingto this
metric. All points are averagesacrossall nine datasets.On
average,thejoin methodtendsto outperformWHIRL for N
less than about 200; after this point WHIRL’s performance
beginsto dominate.Generally,thejoin methodis very competitive with C4.5 and RIPPER.* However, theseaverages
‘Runtimes for the K-NIV variantsare comparableto WHIRL’s.
*RIPPERperformsbetter than the join method,on average,for
Table3: Run time of selectedlearningmethods
Average over all datasets
0.95
of examples.Our experimentsshow that, despiteits greater
generality,WHIRL is extremelycompetitivewith state-ofthe-artmethodsfor inductiveclassification.WHIRL is fast,
sincespecialindexingmethodscanbe usedto find the neighbors of an instance,and no time is spentbuilding a model.
WHIRL was also shownto be generallysuperiorin generalization performanceto existingmethods.
0.9
0.85
0.8
0.75
0.7
References
0.85
0.6
0.55 ’ ’
50
’
100
*
’
150 200
I
I
50
60
’
’
patkNOO
n
350
’
400
*
’
450 500
Precision at N=l 00
I
I
n
.-$
.O_
1
Pre%ion
80 IearneY’
of other
100
Figure2: Experimentalresultsfor selectiveclassification
conceala lot of individual variation; while the join method
is beston average,therearemanyindividual casesfor which
its performanceis not best.This is shownin the lower scatter
plot in Figure2.g
Summary
WHIRL extendsrelational databasesto reasonabout the
similarity of text-valuedfields using information-retrieval
technology.This paperevaluatesWHIRL on inductiveclassification tasks, where it can be applied in a very natural
way - in fact, a single unlabeledexamplecan be classified using one simple WHIRL query, and a closely related
WHIRL query can be used to “selectively classify”a pool
W . W . Cohen. 1995. Fast effectiverule induction. In Machine
Learning: Proceedingsof the TwelfthInternational Conference.
MorganKaufmann.
W . W . Cohen. 1996. Learningwith set-valuedfeatures.In Proceedingsof the ThirteenthNational Conferenceon Artiflciul Intelligence.AAAI Press.
W . W . Cohen.1998.Integrationof heterogeneous
databases
without commondomainsusing queriesbasedon textual similarity.
In Proceedingsof the I998 ACM SIGMODInternationalConferenceon Managementof Data. ACM Press.
W . W . Cohen.1998. A Web-basedinformationsystemthat reasonswith structuredcollectionsof text. In Proceedingsof the
SecondInternational ACM Conferenceon AutonomousAgents.
ACM Press.
W . W . Cohen and Y. Singer. 1996. Context-sensitivelearning
methodsfor text categorization.In Proceedingsof the 19th Annual InternationalACM Conferenceon Researchand Developmentin InformationRetrieval,pages307-315.ACM Press.
M. Hernandezand S. Stolfo. The merge/purgeproblemfor large
databases.
In Proceedingsof the 1995ACM SIGMOD,May 1995.
A. Monge and C. Elkan. An efficient domain-independent
algorithm for detectingapproximatelyduplicatedatabaserecords.In
Theproceedingsof the SIGMODI997 workshopon data mining
and knowledgediscovery,May 1997.
M. F. Porter.1980. An algorithmfor suffix stripping. Program,
14(3):13&137.
J. R. Quinlan.1994.C4.5: Programsfor machinelearning.Morgan Kaufmann.
G. Salton,editor.1989.AutomaticTextProcessing.AddisonWesley.
Y. Yang and C.G. Chute. 1994. An example-based
mapping
methodfor text classificationand retrieval. ACM Transactions
on InformationSystems,12(3).
valuesof N < 60, but the differenceis small;for instance,the join
methodaveragesonly 1.3 moreerrorsthan RIPPERfor N = 50.
‘Each point here comparesthe join method with someother
baselinemethodat N = 100,a point at which all four methodsare
closein averageperformance.
KDD-98
173