Toward Parallel and Distributed Learning by Meta

From: AAAI Technical Report WS-93-02. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
TowardParallel and Distributed Learning by Meta-Learning
Philip K. Chanand Salvatore J. Stolfo
Department of ComputerScience
ColumbiaUniversity
New York, NY10027
pkc@cs.columbia.edu and sal@cs.columbia.edu
1
Abstract
Muchof the research in inductive learning concentrates on problems with relatively small amounts
of data. With the coming age of very large network computing, it is likely that orders of magnitude moredata in databases will be available for
various learning problemsof real world irnponance.
Learning techniques are central to knowledgediscovery and the approach proposed in this paper may
substantially increase the amountof data a knowledge discovery system can handle effectively. Metalearning is proposedas a general technique to integrating a numberof distinct learning processes. This
paper details several meta-leamingstrategies for integrating independently learned classifiers by the
sameleamer in a parallel and distributed computing
environment.Ourstrategies are particularly suited
for massive amounts of data that main-memorybased learning algorithms cannot efficiently handle. Thestrategies are also independentof the particular learning algorithm used and the underlying
parallel and distributed platform. Preliminary experiments using different data sets and algorithms
demonstrateencouragingresults: parallel learning
by meta-learning can achieve comparable prediction accuracy in less space and time than purely
serial learning.
Keywords: machine learning, inductive learning, meta-leaming,parallel and distributed processhag, and large databases.
AAA1.93
Introduction
Muchof the research in inductive learning concentrates on problems with relatively small amounts
of data. With the comingage of very large network
computing,it is likely that orders of magnitudemore
data in databaseswill be available for various learning problems of real world importance. The Grand
Challenges of HPCC
[20] are perhaps the best exampies. Learning techniques are central to knowledge
discovery [ 11 ] and the approachproposedhere may
substantially increase the amountof data a Knowledge Discovery system can handle effectively.
Quinlan [14] approached the problem of efficientiy applying learning systems to data that are
substantially larger than available main memory
with a windowingtechnique. A learning algorithm
is applied to a small subset of training data, called
a window,and the learned concept is tested on the
remaining training data. This is repeated on a new
windowof the same size with some of the incorrectly classified data replacing someof the data in
the old window
until all the data are correctly classified. Wirth and Catlett [21] showthat the windowing technique does not significantly improvespeed
on reliable data. On the contrary, for noisy data,
windowingconsiderably slows downthe computation. Catlett [3] demonstrates that larger amounts
of data improvesaccuracy, but he projects that 1133
[15] on modemmachines will take several months
to learn froma million records in the flight data set
obtained from NASA.He proposes some improvementsto the ID3algorithm particularly for handling
KnowledgeDiscovery in Databases Workshop1993
Page 227
attributes with real numbers,but the processingtime
is still prohibitive due to the algorithm’s complexity. In addition, typical leaming systems like ID3
are not designedto handle data that exceedthe size
of a monolithic memoryon a single processor. Although most modemoperating systems support virtual memory,the application of complexalgorithms
like ID3to the large amountof disk-resident data we
realistically assumecan result in intolerable amount
of I/O or even thrashing of external disk storage.
Clearly, parallel and distributed processingprovides
the best hope of dealing with such large amountsof
data.
Oneapproachto this problemis to parallelize the
learning algorithms and apply the parallelized algorithmto the entire data set (presumablyutilizing
multiple I/O channels to handle the I/O bottleneck).
Zhanget al.’s work[23] on parallelizing the backpropagation algorithm on a Connection Machineis
one example. This approach requires optimizing
the codefor a particular algorithmon a specific architecture. Another approach which we propose in
this paper is to run the serial code on a numberof
data subsets in parallel and combinethe results in
an intelligent fashion thus reducingand limiting the
amountof data inspected by any one learning process. This approach has the advantage of using the
same serial code without the time-consumingprotess of parallelizing it. Since the frameworkfor
combiningthe results of learned concepts is independent of the learning algorithms, it can be used
with different learners. In addition, this approachis
independent of the computingplatform used. However, this approach cannot guarantee the accuracy
of the learned concepts to be the sameas the serial
version since clearly a considerable amountof information is not accessible to each of the learning
processes. Furthermore, because of the proliferation of networks of workstations and distributed
databases, our approach of not relying on specific
parallel and distributed environmentis particularly
attractive.
In this paper we introduce the concept of metalearning and its use in combiningresults from a set
of parallel or distributed learning processes. Section 2 discusses meta-learningand howit facilitates
parallel and distributed learning. Section 3 details
our strategies for parallel learning by meta-learning.
Page 228
Section 4 discusses our preliminary experimentsand
Section 5 presents the results. Section 6 discusses
our findings and work in progress. Section 7 concludes with a summaryof this study.
2
Meta-learning
Meta-learning can be loosely defined as learning
from information generated by a learner(s). It can
also be viewedas the learning ofmeta-knowledgeon
the learned information. In our workwe concentrate
on learning from the output of inductive learning (or
learning-from-examples) systems. Meta-leaming,
in this case, meanslearning fromthe classifiers produced by the learners and the predictions of these
classifiers on training data. A classifier (or concept) is the output of an inductive learning system
and a prediction (or classification) is the predicted
class generated by a classifier whenan instance is
supplied. That is, weare interested in the output of
the learners, not the learners themselves. Moreover,
the training data presented to the learners initially
are also available to the meta-learnerif warranted.
Meta-leamingis a general technique to coalesce
the results ofmultiplelearners. In this paper weconcentrate on using meta-leamingto combineparallel
learning processes for higher speed and to maintain
the prediction accuracy that would be achieved by
the sequential version. This involves applying the
samealgorithm on different subsets of the data in
parallel and the use of meta-leamingto combinethe
partial results. Weare not awareof any workin the
literature on this approachbeyondwhatwasfirst reported in [18] in the domainof speech recognition.
Workon using meta-learning for combiningdifferent learning systemsis reported elsewhere[4, 6] and
is further discussed at the end of this paper. In the
next section we will discuss our approach on the
howto use meta-leamingfor parallel learning using
only one learning algorithm.
3
Parallel Learning
Theobjective here is to speedup the learning process
by divide-and.conquer. The data set is partitioned
into subsets and the sameleaming algorithm is applied on each of these subsets. Several issues arise
KnowledgeDiseover~ in Databases Workshop1998
AAAL98
3.1
First, how manysubsets should be generated?
This largely depends on the numberof processors
available and the size of the training set. Thenumber of processors puts an upper boundon the number
of subsets. Anotherconsideration is the desired accuracy we wish to achieve. As we will see in our
experiments, there maybe a tradeoff between the
number of subsets and the final accuracy. Moreover, the size of each subset cannot be too small
because sufficient data must be available for each
learning process to producean effective classifier.
Wevaried the numberof subsets from 2 to 64 in our
experiments.
Second, what is the distribution of training examples in the subsets? The subsets can be disjoint
or overlap. The class distribution can be random,
or follow some deterministic scheme. Weexperimentedwith disjoint equal-size subsets with random
distributions of classes. Disjoint subsets implies no
data is shared betweenlearning processes and thus
no communicationoverhead is paid during training
in a parallel execution environment.
Third, whatis the strategy to coalesce the partial
results generated by the learning processes? This is
the more important question. The simplest approach
is to allow the separate learners to vote and use the
prediction with the mostvotes as the classification.
Our approachis meta-learning arbiters in a bottomup binary-tree fashion. (The choice of a binary tree
is discussedlater.)
Anarbiter, together with an arbitration rule, decide a final classification outcomebased upon a
number of candidate predictions. An arbiter is
learned from the output of a pair of learning processes and recursively, an arbiter is learned from
the output of two arbiters. A binary tree of arbiters
(called an arbiter tree) is generatedwith the initially
learned classifiers at the leaves. (Thearbiters themselves are essentially classifiers.) For s subsets and
s classifiers, there are log2(s) levels in the generated arbiter tree. The mannerin whicharbiters are
computedand used is the subject of the following
sections.
AAAI-93
Classifying using an arbiter
tree
Whenan instance is classified by the arbiter tree,
predictions flow from the leaves to the root. First,
eachof the leaf classifiers producesan initial prediction; i.e., a classification of the test instance. From
a pair of predictions and the parent arbiter’s prediction, a combined prediction is produced by some
arbitration rule. Thesearbitration rules are dependent uponthe mannerin whichthe arbiter is learned
as detailed below. This process is applied at each
level until a final prediction is producedat the root
of the tree. Since at each level, the leaf classifiers
and arbiters are independent,predictions are generated in parallel. Before we discuss the arbitration
process in detail, wefirst describe howarbiters are
learned.
3.2
Meta-learning an arbiter tree
Weexperimented with several schemes to metalearn a binary tree of arbiters. Thetraining examples
for an arbiter are selected fromthe original training
examplesused in its two subtrees.
In all these schemesthe leaf classifiers are first
learned from randomlychosen disjoint data subsets
andthe classifiers are groupedin pairs. (Thestrategy
for pairing classifiers is discussed later.) For each
pair of classifiers, the union of the data subsets on
whichthe classifiers are trained is generated. This
union set is then classified by the two classifiers.
A selection rule comparesthe predictions from the
two classifiers and selects instances from the union
set to formthe training set for the arbiter of the pair
of classifiers. Thus,the role acts as a data filter to
producea training set with a particular distribution
of the examples. The arbiter is learned from this
set with the same learning algorithm. In essence,
we seek to computea training set of data for the
arbiter that the classifiers together do a poor job of
classifying. The process of forming the union of
data subsets, classifying it using a pair of arbiter
trees, comparingthe predictions, forminga training
set, and training the arbiter is recursively performed
until the root arbiter is formed.
For example,supposethere are initially four training data subsets (Tl - T4). First, four classifiers
(C! - C’4) are getwavaledin parallel from Tl - T4.
KnowledgeDiscovery in Databases Workshop1993
Page 229
.
Arbiters
Classifiers
Trainingdatasubsets
Returninstances with predictions that disagree,
Td, as in the first case, but also predictions
that agree but are incorrect; i.e, T = Td UTi,
where 7"/ = {x C E I (aTe(z) AT2(z))^
(class(x) # ATl(x))}. Note that we lump together both cases of data that are incorrectly
classified or are in disagreement.(Henceforth,
denotedas meta-different-incorrect).
Figure 1: Samplearbiter tree
,
The union of subsets Tl and T2, Ul2, is then classifted by Cl and (72, which generates two sets of
predictions (Pl and P2). Based on predictions
and/’2, and the subset U12,a selection rule generates a training set (Tl2) for the arbiter. Thearbiter
(Al2) is then trained fromthe set Tin using the same
learning algorithmused to learn the initial classitiers. Similarly, arbiter A34is generatedin the same
fashion starting from 7"3 and 7"4, in parallel with
Al2, and hence all the first-level arbiters are produced. Then Ut4 is formed by the union of subset
TI through T4 and is classified by the arbiter trees
rooted with Al2 and A34. Similarly, Tl4 and Al4
(root arbiter) are generatedand the arbiter tree
completed(see Figure 1).
3.3 Detailed strategies
Weexperimentedwith three strategies for the selection rule, whichgenerates training examplesfor
the arbiters. Basedon the predictions from two arbiter subtrees AT1and AT2(or two leaf classifiers)
rooted at two sibling arbiters, and a set of training
examples,E, the strategy generates a set of arbiter
training examples, T. ATI(x) denotes the prediction of training examplex by arbiter subtree ATI.
class(x) denotes the given classification of example
x. The three versions of this selection rule implementedand reported here are as follows:
.
Retuma set of three training sets: Td and
Ti, as defined above, and Tc with examples
that have the same correct predictions; i.e.,
T = {Ta, Ti, Tc}, where Tc = {x 6 E I
(ATI(x) = AT2(x))A(class(x)=
Hereweattempt to separate the data into three
cases and distinguish each case by learning a
separate "subarbiter." Ta, Ti, and Tc generate
Aa, Ai, and Ao respectively. The first arbiter is like the one computedin the first case
to arbitrate disagreements. The second and
third arbiters attempt to distinguish the cases
whenthe two predictions agree but are either
incorrect or correct. (Henceforth, denoted as
meta-different-incorrect-correct).
Sampletraining sets generated by the three schemes
are depicted in Figure 2.
Thelearned arbiters are trained on the particular
distinguished distributions of training data and are
used in generating predictions. (Note that the arbiters are trained by the samelearner used to train
the leaf classifiers.) Recall, however,at each arbiter we have two predictions, Pl and P2, from two
lowerlevel arbiter subtrees (or leaf classifiers) and
the arbiter’s, A, ownprediction to arbitrate between.
Ai(x) is denotedas the prediction of training example x by arbiter Ai. Twoversions of the arbitration
rule have been implemented.The first version correspondsto the first two selection strategies, while
the second version corresponds to the third strategy. Wedenote by instance the test instance to be
classified.
Returninstances with predictions that disagree,
i.e., T = Td = {x E ElATe(x) # AT2(x)}.
Thus, the arbiter will be used to decide between
conflicting classifications. Note, however,it l&2. Return the majority vote of Pl, /72, and
cannotdistinguish classifications that agree but
A( instance), with preference given m the
whichare incorrect. (For further reference, this
arbiter’s choice; i.e., if Pt ~ P2 return
schemeis denotedas recta-different.)
A(instance) else realm Pl.
Page 230
KnowledgeDiscovery in Databases Workshop1998
AAAI-98
Class
class(z)
a
b
C
b
v °rll IIB°c’ Prx
attrvec(z)
attrvecl
attrvec2
attrvec3
attrvec4
Training set from
[
the meta.differentarbiter scheme I
Xl
X2
Z3
X4
a
a
b
b
a
b
b
b
Trainingset from
the meta-different-incorrectarbiter scheme
Class
Attribute vector
Instance
1
b
attrvec2
2
C
attrvec3
Trainingset from
the meta.different-incorrect-correct
arbiter scheme
Set
Instance
Class
Attribute vector
Different(Td)
1
b
attrvec2
Incorrect(Ti)
1
C
attrvec3
1
a
attrveel
Correct
(T~)
2
b
attrvee4
Figure2: Sample
training sets generatedby the three arbiter strategies
cessors,there are log(s) interationsin buildingthe
arbiter tree andeachtakesO( t 2) time.Thetotal time
is thereforeO(t2 log(s)), whichimpliesa potential
O(s2 / log( s ) fold speed-up. Similarly,O(nv~)
algorithmsyield O(vrg/lo#(s)) fold speed-upand
O(n)yield O(s/tog(s)). This roughanalysis also
assumesthat eachdata subset fits in the mainmemTo achievesignificant speed-up,the training set
ory.
In addition, the estimatesdo not take into acsize for an arbiter, in these three schemes,is recount
the burden of communicationoverheadand
stricted to be no larger thanthe trainingset size for
speedgainedby multipleI/O channelsin the parala classifier. Thatis, the amountof computationin
lel case(whichwill be addressedin future papers).
training an arbiter is bounded
by the time to train
Furthermore,
weassumethat the processors have
a leaf classifier. In a parallel computationmodel
eachlevel of arbiters can be learnedas efficiently relatively the samespeed;load balancingand other
as the leaf classifiers. (In the third scheme
the three issues in a heterogeneousenvironmentare beyond
the scopeof this paper. Wealso note that oncean
subarbitersare producedin parallel.) Withthis reits applicationin a parallel
striction, substantial speed-upof the learningphase arbiter tree is computed,
environment
can
be
done
efficiently
accordingto the
can be predicted. Assumethe numberof data subsets of the initial distribution is s. Let t = N/s be schemeproposedin [18].
the size of each data subset, whereNis the total
numberof training examples.Furthermore,assume
the learning algorithmtakes O(n2) time in the sequentialcase. In the parallel case, if wehaves pro3. if pt ~ I~ return Aa(instance)
else if pt = Ae(instanee)
return At(instance)
else return Ai(instanee),
where A = {Ad, At, Ac}.
AAA1-93
KnowledgeDiscoveryin DatabasesWorkshop1998
Page 231
4
Experiments
Four inductive learning algorithms were used in our
experiments. ID3 [15] and CART[1] were obtained from NASAAmes Research Center in the
INDpackage[2]. Theyare both decision tree learning algorithms. WPEBLS
is the weighted version of
PEBLS[8], which is a memory-basedlearning algorithm. BAYES
is a simple Bayesian learner based
on conditional probabilities, whichis described in
[7]. The latter two algorithms were reimplemented
inC.
Twodata sets, obtained from the UCI Machine
Learning Database, were used in our studies. The
secondaryprotein structure data set (SS) [13], courtesy of Qian and Sejnowski, contains sequences of
aminoacids and the secondarystructures at the corresponding positions. There are three structures
(three classes) and 20 aminoacids (21 attributes
becauseof a spacer) in the data. The aminoacid sequenceswere split into shorter sequences of length
13 according to a windowingtechnique used in [13].
The sequences were then divided into a training and
test set, whichare disjoint, according to the distribution described in [13]. The training set has
18105instances and the test set has 3520. The DNA
splice junction data set (SJ) [19], courtesy of Towell, Shavlik and Noordewier,contains sequences of
nucleotides and the type of splice junction, if any,
(three classes) at the center of each sequence. Each
sequencehas 60 nucleotides with 8 different values
each (four base ones plus four combinations). Some
2552 sequences were randomlypicked as the trainhag set and the rest, 638 sequences, becamethe test
set. Althoughthese are not very large data sets, they
give us an idea on howour strategies perform.
As mentioned above, we varied the number of
subsets from 2 to 64 and the equal-size subsets
are disjoint with randomdistribution of classes.
Theprediction accuracy on the test set is our primary comparison measure. The three meta-leaming
strategies for arbiters were run on the two data sets
with the four learning algorithms. In addition, we
applied a simple voting schemeon the leaf classifters. The results are plotted ha Figure 3. The
accuracy for the serial case is plotted as "one subset."
If werelax the restriction onthe size of the data set
Page 232
for training an arbiter, we mightexpect an improvementha accuracy, but a decline in execution speed.
To test this hypothesis, a numberof experiments
were performed varying the maximum
training set
size for the arbiters. Thedifferent sizes are constant
multiples of the size of a data subset. The results
plotted in Hgure 4 were obtained from using the
meta-different-incorrect strategy on the SJ data.
5 Results
In Figure 3, for the three arbiter strategies, weobserve that the accuracy stayed roughly the samefor
the SS data and slightly decreased for the SJ data
whenthe numberof subsets increased. With 64 subsets, mostof the learners exhibited approximatelya
10%drop in accuracy, with the exception of BAYES
and one case in CART.The sudden drop in accuracy
in those cases waslikely due to the lack of information ha the training data subsets. In the SJ data there
are only ~ 40 training examplesha each of the 64
subsets. If welook at the case with 32 subsets (,,, 80
exampleseach), all the learners sustained a drop in
accuracy of at most 10%. As we observe, the big
drops did not happenin the SS data. This showsthat
the data subset size cannot be too small. Thevoting
schemeperformed poorly for the SJ data, but was
better than most schemesin the SS data. Basically,
the accuracy was roughly the same percentage of
the mostfrequent class in the data and in the SS data
case, simple voting performedrelatively better. The
behavioron the training set wassimilar to the test set
and those results are not presented here due to space
limitations. Furthermore, the three strategies had
comparableperformanceand since the first strategy
producesfewer examplesin the arbiter training sets,
it is the preferredstrategy.
As we expected, by increasing the maximum
arbiter training set size, higher accuracy can be obtained (see Figure 4, only the results on the SJ data
are presented). Whenthe maximum
size was just
two timesthe size of the original subsets, the largest
accuracy drop was less than 5%accuracy, cutting
half of the 10%drop occurred whenthe maximum
size was the same as the subset size as mentioned
above. Furthermore, when the maximumsize was
unlimited(i.e., at mostthe size of the entire training
KnowledgeDiscovery in Databases Workshop1995
AAAI-g8
ID3 (gJ)
100
7O
tilt
feint
~-di f relent- incorrect "*""
di f ~erQnt- [noor~ot’correct
-e ....
rot I Jig ~--
65
different
di ~ fe rent - incorrect
"*’dL f fertnt"Incorrectoorreo¢ "¯ ....
voting
9S
....
.o...oI~,.
o
o 60
’
I°
..% .......
8S
"t """""
SO
i
2
!
i
i
4
8
16
Number of subeotl
i
32
CART (SS)
7O
8O
i
200 ]
¯
dLf£erent
different-~ncorrect
~’dLfferont-lncorreat-correct ’Q’VOting ~--
65
i
|
*
¯
8
16
Number of 8ul0metm
64
CA.qT (SJ)
.
)2
64
.
dif £erent.
d£ f ferent -lno.or~ot
~d i i fe rent ° Incorrect
- oorreo¢
"e ....
voting
.
l
I
.
-- 95
60
"’"’f
.....
SO’
"
,IS
i
2
i
i
i
8
16
4
Nmd~c of eubeets
i
32
;-’I
~SmLS (S8
70
dLfferon*
-*--]
dLflerent.’tncorreot
-*-- l
d Jr £ ferent" lnoor~ot" oorreot
voting -e--.
"~--J
¯
|
/
^
~5
I, *
2
i
4
Nu~r
100
"
....
80
64
I
I
9S l
~..
,
8
of
,
26
t
32
64
8ubee¢o
~E~bS
....
(SJ)
~ftQ~’nt
~-dl f fetm~t - JJ~oor~ot "’- dl~LllllQlllntvot£ng~noorc~nt-oor~llol; "Q ....
I
85
SO
.....
’J’~
.
2
2
4
8
16
Number of 8uboece
32
64
tO
dtffecqm¢
~ I
dlffe~n¢-inoorreot
"~" I
dlilerent-~noorreot’oorreot
.e....
voCtn~ -"--
.....
\
.....
32
4
8
16
Number of 8ubeet8
200:
.....
l
I
.
6= |[
2
64
ISAYES (SJ)
I~,Y CS (SSl
70
I
1
dll~fecen¢
8tfferent-lJ~orcec¢
~-
gS
"’-~
---.~.~’"-~
\/,
..............
’’%,
’.
8S
50
4S
,
2
,
,
,
4
8
16
Number of 8ubsetll
,
32
80
64
1
4
8
16
~hmber of eub4e~J
32
64
Figure3: Results on different strategies
AAAI-9#
KnowledgeDiscover’!l in Daiabases Workshop199~
P~ge 233
ID3 ($J)
100
IO0
Nax xl
Xax x2 ~-Hax x3 -o-.
Onl Lml ted -0--
9S
95
G
85
85
80
*
2
,
t
,
16
4
8
Number of muboeto
|
32
8O
64
2
2
4
8
16
Number of 8ubeeto
64
"~"
BAYES (SJ}
WPEBLS(SJ)
32
100
100
Wax x2
XaX X2
Xax X3
VhlXRlted
9S
-e~
~-e.~-- .
Nix Xl "*--N&X x2 "~-
,~
~~z3 .* ....
ted
",-..,.
""**.%,
95
o
................
;,’..,-:~
......
..~.
..........
85
8S
80
i
2
I
t
t
4
8
16
Number Of subsets
i
32
8O
64
1
2
¯
8
IG
Number of subsets
32
64
arbiter U’a~’~gset sizes
Figure4: Results on different m~imum
set), the accuracywas roughlythe sameas in the
serial case.
Since our experimentsare still preliminary,the
systemis not implemented
on a parallel/distributed
platform and we do not have relevant timing resuits. However,accordingto the theoretical analysis, significant speed-upcan be obtainedwhenthe
maximum
arbiter trainingset size is fixed to the subset size (see Section 3.3; ID3 and CART
is O(nl)
[15], wherel is the numberof leaves, and assuming I is proportionalto v/n, the complexitybecomes
municationandmultiple I/O channelson speed are
not taken into account. For WPEBLS
(O(n2)),
speed-upsteadily increased as the numberof subsets increasedandleveled off after eight subsets to
a factorof six (see Figure5). (Thiscase is closest
a linear speed-up.) For ID3 and CART
(O(nv/’~)),
the speed-upslowly increased andleveled off after
eight subsets to a factor of three. BAYES
did not exhibit parallel speed-updueto its linear complexity.
Theleveling-off wasmainly due to the bottleneck
at the root level, whichhadthe largest trainingset.
WI,EaLS
is O(n2);BAYES
is O(n)).
Next, weinvestigate the size andlocation of the
largest
training set in the entire arbiter tree. This
Whenthe maximum
arbiter training set size is
gives us a notion of the memoryrequirement at
unlimited, we calculate the speed-upsbasedon the
any processing site and the location of the main
largest training set size at each level (whichtakes
the longest to finish). In this case, since weassume bottleneck. Ourempiricalresults indicates that the
largest training set size wasalwaysaround30%of
wedo not havea parallelized version of the learners available, the computing
resourcebeingutilized
the total training set andalways happenedat the
root level, independentof the numberof subsets
is reducedin half at each level. That is, at the
root level, only one processorwill be in use. The that was larger than two. (Note whenthe number
of subsets wastwo, the training set size was50%of
following discussion is basedon theoretical calcuthe original set at the leaves andbecamethe largest
lations using recordedarbiter training set sizes obin the tree.), Thisimplies that the bottleneckwasin
tained fromthe SJ data. Again,the effects of corn-
Page 234
KnowledgeDiscovery in Databases WorkshopI998
AAALg$
CkRT (SJ)
ID3 ($3)
Largest
tra£nlg
¯
*
Ae~ura~r (xl0t)
-e-set size (xl0t)
~’Speed-up-e--.
*
Largest
,
\\
~
I
tralnlg
o
q
I
32
64
\
Accura~F (xlOt)
-*-set size (xl0t)
-~-4
Speed~up
.e.-*
\
..o..
e
...I.
..........
..:
.......
i
I
l
4
8
16
Number of auh4nete
~LS
Largest.
t
i
|
8
16
4
Number of euJ~etJ
!
32
64
BAYES (SJ)
(S3)
trainlg
~. 10
J
2
Ac¢mra~/ (xl0q)
-4-sol i~ze (:[101)
~Speed-up-e .....
!’
i’
,
10
"\
Laziest
.
tralnlg
.
A~uraclr (xl0t)
set size (zlOt)
-*--
_ : ~peeo’~ptO
\
.m’"’
,.~:"~ .~........... ~- ........... . ............ ~ ........
2
......."’
o................
~...............
¯ ...............
Q................
....~...e
..............
i
2
i
l
l
4
8
16
Number of subsets
i
32
0
64
l
2
l
t
i
8
16
4
Nulber of subsets
l
32
64
Figure 5: Results on unlimitedmaximum
arbiter training set size
processingarmmd
30%of the entire training data set
at the root level. Thisalso impliesthat this parallel
meta-learningstrategy requiredonly around30%of
the memory
used by the serial case at any single
processingsite. Strategies for reducingthe largest
training set size is discussedin the next section.
Recallthat the accuracy
level of this parallelstrategy
is roughlythe sameas the serial case. Thus, the
parallel rests-learningstrategy(with no limit on the
arbitertrainingset size) canperformthe samejob as
the serial case with less time and memory
without
parallelizing the learningalgorithms.
6 Discussion
Thetwo data sets chosenin our experimentsrepresent twodifferentkindsof datasets: oneis difficult
to learn (SS with 50+%accuracy)and the other
easy to learn (SJ with 90+%accuracy). Ourarbiter
schemesmaintainedthe low accuracyin the first
case and mildly degradedthe high accuracyin the
secondcase witha restriction on the arbiter training set size. When
the restriction on the size of the
AAAL98
trainingset for an arbiter waslifted, the samelevel
of aconacycould be achieved with less time and
memory
for the second case. Since we assert that
this approach
is scalable due to the independence
of
each learningprocess, this indicates the robustness
of our strategies andhence their effectiveness on
massive amountsof data.
Largest arbiter training set size As mentionedin
the previoussection, we discoveredthat our scheme
requiredat most30%
of the entire trainingset at any
momentto maintain the sameprediction accuracy
as in the serial case for the SJ data. However,the
percentageis dependenton several factors: the prediction accuracyof the algorithmon the given data
set, the distribution of the datain the subsets, and
the pairingof learnedclassifiers andarbitersat each
level.
If the predictionaccuracy
is high, the arbitertrainhag sets will be small becausethe predictions will
usually be correct andfew disagreementswill occur. In our experiments,the distribution of data in
the mbsetswasrandomandlater we discoveredthat
KnowledgeDiscovery in Databases Workshop1995
Page 235
CART (SJ)
100
Ra~ dimt.
Uni£o~ dist.
"~’-
80
64
i
3
i
i
i
4
8
16
Number of subsetm
i
32
64
BAYES (SO)
~EBLS (S3)
100
100
Random dist.
Onlfona d£st.
-*--*---
Onifom diet. "~""
---4~
95
95
ep
v
ov
85
85
O0
i
i
i
4
8
16
Number ot aubsets
I
32
80
64
i
2
i
,
i
4
8
16
Number of subsets
, !\
32
64
Figure
6: Accuracy
with different class distributions
half of the final arbiter tree wastrainedon examples
withonly two of the three classes. Thatis, half of
the tree wasnot awareof the third class appearingin
the entire trainingdata. Wepostulatethat if the class
distributionin the subsets is uniform,the leaf classifters andarbiters in the arbiter tree will be more
accurateandhencethe training sets for the arbiter
will be smaller. Indeed,results fromour additional
experimentson different class distributions (using
the meta-different-incorrectstrategy on the SJ data
set), shownin Figure6, indicate that uniformclass
distributions can achieve higher accuracythan randomdistributions.
Andlastly, the "neighboring"
leaf classifiers and
arbiters were paired in ourexperiments.Onemight
use moresophisticatedschemesfor pairing to reduce
the size of the arbiter training sets. Oneschemeis
to pair classifiers andarbiters that agree mostoften with each other and producesmaller training
sets (called min-size). Anotherschemeis to pair
those that disagreethe mostandproducelargertraining sets (called max-size). At first glancethe first
schemewouldseemto be moreattractive. However,
Page 236
since disagreementsare present, if they do not get
resolvedat the bottomof the tree, they will all surface nearthe root of the tree, whichis also whenthe
choiceof pairingsis limitedor nonexistent
(thereare
only two arbiters one level belowthe root). Hence,
it mightbemorebeneficialto resolveconflictsnear
the leaves leaving fewer disagreementsnear the mot
These sophisticated pairing schemes might decrease the arbiter training set size, but they might
also increase the communicationoverhead. When
pairingis performed
at everylevel, the overhead
is
incurred at every level. The schemesalso create
synchronizationpoints at each level, instead of at
each node whenno special pairings are performed.
A compromise
strategy might be to performpairing
only at the leaf level. This indirectly affects the
subsequenttraining sets at each level, but synchronization occurs only at each node and not at each
level.
Some experiments were performed on the two
pairingstrategies appliedonly at the leaf level and
the results are shownin Figure 7. All these experiments were conductedon the SJ data set and
KnowledgeDiscovery in Databases Workshop1993
AAAI-93
11)1 (gJ)
\
\
\
\
CA~T {S3)
o
vlOOi
l~ndom diet.
R~nd~a diet. 2 "+’P~ndom diet. 2, max-eize Pairing .e-..
Random diet.
2, Imin-mize
pairing
~--
P~ nd~m diet.
@~ndk2mdiet.2 "÷’-P~m diet.
2, max-size pairing .e ....
Raxx~a diet.
2, lain-0ize
pairing ~-Uniform dist.
.6..
\
\
\
80
6o
4O
.......
__.
=--~=~=~-~-~.~:
to
l
2
w
100,
¯
4
J
i
4
8
16
Number of subeetn
¯
Rand~ dist.
¯
i
32
,
,
Randomdiet.
2, JL3Xm|J. ZO Pairing
.......
i
2
64
¯
0
\
\
\
a
32
64
BAYES (SJ)
,
-e--
.e ....
~’~".....
-~-..~.Z
....
m
i
I
¯
8
1G
Number of eubset~
I~mdkxmdiet.
R~rKI~S diet.
,
,
~ dist.
-*-Ra~l~ diet.
2 ~2, max-size Pairing .e-,.
2~ mln-e/ze
pa£r£ng -~-Unifon~ diet. -6.-
¯ ,~ 60
~
40
40
, I/A\\
20
i~
o
0
2
4
8
br Of euJ:~get.I
16
32
~4
,
2
i
i
i
8
li
4
Mmd~r of subaets
i
32
64
Figure7: Arbitertrainingset size withdifferent class distributionsandpairingstrategies
used the meta.different.incorrect strategy for metalearningarbiters. In additionto theinitial random
class distribution, a uniformclass distributionanda
secondrandomclass distribution (Randomdist. 2)
wasused. The secondrandomdistribution does not
havethe propertythat half of the learned arbiter
tree wasnot awareof one of the classes, as in the
initial random
distribution. Differentpairingstrategies were used on the uniformdistribution andthe
secondrandomdistribution. As shownin Figure 7,
the uniformdistribution achievedsmaller training
sets than the other two randomdistributions. The
largest trainingset size wasaround10%of the original data whenthe number
of subsets waslarger than
eight, except for BAYES
with 64 subsets (BAYES
seemedto be not able to gather enoughstatistics
on small subsets, whichcan also be observedfrom
results presentedearlier). (Note that whenthe number of subsets is eight or fewer, the training sets
for the leaf classifiers are larger than 10%of the
original data set andbecomethe largest in the arbiter tree.) Thetwopairingstrategies did not affect
the sizes for the uniformdistribution andare not
AAAI-93
shownin thefigure. Onepossible explanation is
that the uniformdistribution producedthe smallest
training sets possible andthe pairing strategies did
not matter. However,the max-size pairing strategy
did generally reducethe sizes for the secondrandomdistribution. Themen-sizepairing strategy, on
the other hand, did not affect, or sometimeseven
increased, the sizes. In summary,
uniformclass distributiontends to producethe smallesttraining sets
andthe max-sizepairing strategy can reducethe set
sizes in random
class distributions.
In our discussion so far, we have assumedthat
the arbiter trainingset is unbounded
in orderto determine howthe pairing strategies maybehavein
the case wherethe training set size is bounded.The
max-sizestrategyaimsat resolvingconflicts nearthe
leaves wherethe maximum
possible arbiter training
set size is small(the unionof the twosubtrees)leaving fewerconflicts nearthe root. If thetrainingset
size is boundedat each node, a randomsample(with
the bounded
size) of a relatively small set near the
root wouldbe representativeof the set chosenwhen
the size is unbounded.
KnowledgeDiscovery in Databases Workshop1993
Page 237
Order of the arbiter tree A binary arbiter tree
configuration was chosen for experimental purposes. There is no apparent reason whythe arbiter
tree cannot be n-ary. However,the different strategies proposedabove are designed for n to be equal
to two. Whenn is greater than two, a majority classification fromthe n predictions mightbe sufficient
as an arbitration rule. The examplesthat do notreceive a majorityclassification constitute the training
set for an arbiter. It mightbe worthwhileto have a
large value of n since the final tree will be shallow,
and thus training may he faster. However, more
disagreements and higher communicationoverhead
will appearat eachlevel in the tree dueto the arbitration of manymorepredictions at a single arbitration
site.
Alternate approach An anonymous reviewer of
another paper proposed an "optimal" formula based
on Bayes Theoremto combinethe results of classifters, namely, P(a:) = ~c P(c) P(xlc ), where
z is a prediction and c is a classifier. P(c) is the
prior whichrepresents howlikely classifier c is the
true model and P(zJe) represents the probability
classifier e guesses z. Therefore, P(z) represents
the combinedprobability of prediction z to be the
correct answer. Unfortunately, to be optimal, Bayes
Theoremrequires the priors P(e)’s to be known,
whichare usually not, and it also requires the summarionto be over all possible classifiers, whichis
almost impossible to achieve. However,an approximate P(z) can still be calculated by approximating the priors using various established techniques
on the training data and using only the classifiers
available. This techniqueis essentially a "weighted
voting scheme" and can be used as an aitemative
to generating arbiters. This and the aforementioned
strategies and issues are the subject matter of ongoing experimentation.
Schapire’s hypothesis boosting Our ideas are related to using meta-leaming to improve accuracy.
Themost notable workin this area is due to Schapire
[16], which he refers to as hypothesis boosting.
Basedon an initial learned hypothesis for someconcept derived from a randomdistribution of training
data, Schapire’s schemeiteratively generates two
Page 238
additional distributions of examples.Thefirst newly
derived distribution includes randomlychosentraining examplesthat are equally likely to be correctly
or incorrectlyclassified bythe first learnedclassifier.
Anewclassifier is formedfromthis distribution. Finally, a third distribution is formedfromthe training
exampleson whichboth of the first two classifiers
disagree. Athird classifier (in effect, an arbiter)
computedfor this distribution. The predictions of
the three learned classifiers are combinedusing a
simplearbitration rule similar to the oneof the rules
we presented above. Schapire rigorously proves that
the overall accuracy is higher than the one achieved
by simplyapplying the learning algorithmto the initial distribution under the PAClearning model. In
fact, he showsthat arbitrarily high accuracy can be
achieved by recursively applying the same procedure. However,his approach is limited to the PAC
model of learning, and furthermore, the mannerin
whichthe distributions are generated does not lend
itself to parallelism. Since the second distribution
dependson the first and the third dependson the second, the distributions are not available at the same
time and their respective learning processes cannot
be run concurrently. Weuse three distribution as
well, but the first two are independentand are available simultaneously. The third distribution, for the
arbiter, however, depends on the first two. Freund [9] has a similar approach, but with potentially
manymore distributions. Again, the distributions
can only be generatediteratively.
Workin progress In addition to applying metalearning to combiningresults from a set of parallel or distributed learning processes, meta-learning
can also be used to coalesce the results from multiple different inductive learning algorithms applied
to the same set of data to improve accuracy [5].
Thepremiseis that different algorithms havedifferent representations and search heuristics, different
search spaces are being explored and hence potentially diversed results can be obtained fromdifferent algorithms. Mitchell [12] refers to this phenomenonas inductive bias. Wepostulate that by
combiningthe different results intelligently through
meta-leaming, higher accuracy can be obtained. We
call this approachmultistrategy hypothesis boosting.
KnowledgeDiscovery in Databases Workshop1993
AAAL93
Preliminaryresults reported in [4] are encouraging.
Zhanget al.’s [24] and Wolpert’s[22] workis in this
direction. Silver et al.’s [17] andHolder’s[10] work
also employsmultiple learners, but no learning is involved at the meta level. Since the ultimate goal of
this work is to improveboth the accuracy and efficiency of machine leaming, we have been working
on combiningideas in parallel learning, described
in this paper, with those in multistrategy hypothesis boosting. Wecall this approach multistrategy
parallel learning. Preliminary results reported in
[6] are encouraging. To our knowledge, not much
workin this direction has beenattempted by others.
References
[l] L.
Breiman, J. H. Friedman, R. A. Olshen,
and C. J. Stone. Classification and Regression
Trees. Wadsworth,Belmont, CA, 1984.
[21 W. Buntine and R. Caruana. Introduction to
lArD and Recursive Partitioning. NASA
Ames
Research Center, 1991.
[31J.
Catlett. Megainduction:A test fright. In
Proc. Eighth Intl. Work. Machine Learning,
pages 596-599, 1991.
[4] P.
7
Concluding Remarks
Several meta-learningschemesfor parallel learning
are presented in this paper. In particular, schemes
for building arbiter trees are detailed. Preliminary
empirical results from boundedarbiter training sets
indicate that the presented strategies are viable in
speeding up learning algorithms with small degradation in prediction accuracy. Whenthe arbiter trainhag sets are unbounded,the strategies can preserve
prediction accuracy with less training time and required memory
than the serial version.
The schemespresented here is a step toward the
multistrategy parallel learning approach and the
preliminary results obtained are encouraging. More
experiments are being performedto ensure that the
results wehave achieved to date are indeed statistically significant, and to study howmeta-leaming
scales with muchlarger data sets. Weintend to further explore the diversity and possible "symbiotic"
effects of multiple learners to improve our metalearning schemesin a parallel environment.
Acknowledgements
This work has been partially supported by grants
from NewYork State Science and TechnologyFoundation, Citicorp, and NSFCISE. Wethank David
Wolpertfor manyuseful and insightful discussions
that substantially improvedthe ideas presented in
this paper.
AAA1-93
Chart and S. Stolfo. Experimentson multistrategy learning by meta-leaming. Submitted
to CIKM93,1993.
[51P. Chanand S. Stolfo.
Meta-leamingfor multistrategy and parallel learning. In Proc. Second
Intl. Work. on Multistrategy Learning, 1993.
To appear.
[6] P.
Chan and S. Stolfo. Towardmultistrategy
parallel and distributed learning in sequence
analysis. In Proc. First Intl. Conf. Intel. Sys.
Mol. Biol., 1993. To appear.
171P. Clark and T. Niblett.
TheCN2induction algorithm. MachineLearning, 3:261-285, 1987.
[8] S.
Cost and S. Salzberg. A weighted nearest
neighbor algorithm for learning with symbolic
features. MachineLearning, 10:57-78, 1993.
[9] Y.
Freund. Boosting a weak learning algorithm by majority. In Proc. 3rd Work. Comp.
Learning Theory, pages 202-216, 1990.
[10] L. Holder. Selection of learning methodsusing
an adaptive model of knowledgeutility. In
Proc. MSL-91, pages 247-254, 1991.
[111 C. Matheus, R Chart, and 13. PiateskyShapiro. Systems for knowledge discovery
in databases. IEEETrans. Know.Data. Eng.,
1993. To appear.
[12] T. M. Mitchell. Theneed for biases in learning
generalizaions. Technical Report CBM-TR117, Dept. Comp.Sci., Rutgers Univ., 1980.
KnowledgeDiscovery in Databases Workshop1993
Page 239
[131 N. Qian and T. Sejnowski. Predicting the
secondary structure of globular proteins using neural network models. Z Mol. Biol.,
202:865-884, 1988.
[24] X. Zhang, J. Mesirov, and D. Waltz. A hybrid
system for protein secondarystructure prediction. J. Mol. Biol., 225:1049-1063,1992.
[141 J. R. Quinlan.Induction over large data bases.
Technical Report STAN-CS-79-739, Comp.
Set. Dept., Stanford Univ., 1979.
[151 J. R. Quinlan. Induction of decision trees. Machine Learning, 1’:81-106, 1986.
[16]R. Schapire. Thestrength of weaklearnability.
Machine Learning, 5:197-226, 1990.
[171 B. Silver, W. Frawley, G. lba, J. Vittal,
and K. Bradford. ILS: A framework for
multi-paradigmatic learning. In Proc. Seventh
Intl. Conf. MachineLearning, pages 348-356,
1990.
[18] S. Stolfo, Z. Galil, K. McKeown,
and R. Mills.
Speechrecognition in parallel. In Proe. Speech
Nat. Lang. Work., pages 353-373. DARPA,
1989.
[19] G. Towell, J. Shavlik, and M. Noordewier.Refinement of approximate domain theories by
knowledge-based neural networks. In Proc.
AAA/-90,pages 861-866, 1990.
[201 B. Wahet al. High performance computing and communicationsfor grand challenge
applications: Computer vision, speech and
natural languageprocessing, and artificial intelligence. IEEETrans. Know. Data. Eng.,
5(1):138-154, 1993.
[21] J. Wirth and J. Catlett. Experiments on the
costs and benefits of windowingin ID3. In
Proc. Fifth Intl. Conf. Machine Learning,
pages 87-99, 1988.
[22]D.
Wolpert. Stacked generalization.
Networks, 5:241-259, 1992.
Neural
[231X. Zhang, M. Mckenna, J. Mesirov, and
D. Waltz. An efficient implementationof the
backpropagation algorithm on the connection
machine CM-2. Technical Report RL89-1,
Thinking MachinesCorp., 1989.
Page 240
KnowledgeDiscovery in Databases Workshop1993
AAAI-93