Getting Order Independence in Incremental Learning

Getting
Order Independence
in Incremental
Learning
Antoine Cornu6jols
Equipe Inf6rence et Apprentissage
Laboratoire de Rechercheen Informatique (LRI), UA410 du CNRS
Universit6 de Paris-sud, Orsay
B~timent 490. 91405 ORSAY
(France)
email (UUCP)
: antoine@lri.lri.fr
From: AAAI Technical Report SS-93-06. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
Abstract*
It is empirically
known that
most
therefore seems 1o be equivalent to a preference bias that
makes a choice amongall the models or hypotheses that
the learning system could reach given the collection of
data (that is the models that would be obtained had the
collection of data been presented in every possible orders).
incremental learning systems are order dependent, i.e.
provide results that dependon the particular order of the
data presentation. This paper aims at uncovering the
reasons behind this, and at specifying the conditions
that wouldguarantee order independence.It is shownthat
both an optimalityand a storage criteria are sufficient for
ensuring order independence. Given that these
correspond to very strong requirements however, it is
Hence, ordering undoubtedly amounts to some additional
knowledge supplied to the system. This is why teachers
have some value: by selecting suitable pedagogical
presentations of the material to be learned they provide
further knowledge that hopefully helps the learning
process.
interesting to study necessary, hopefully less stringent,
conditions. The results obtained prove that these
necessary conditions are equally difficult to meet in
practice.
Besides its main outcome, this paper provides an
interesting methodto transform an history dependent
bias into an history independentone.
1
Learning without some bias that allows the reduction of
the search space for the target concept or model is
impossible except in the crudest form of rote learning.
Whenlooking more closely, it is usual to distinguish
between :
- representation bias : where the search space is
constrained because all partitions of the example space
can not be expressed in the hypothesis space considered
by the system (this is the basis for inductive
generalization and is the main topic of current Machine
Introduction
Ordering effects ill Incremenhal Learning have been widely
mentioned in the literature without, however, being the
Leamiugtheory [12]), and
subject of much specific study except for some rare
pioneering works [1,4,9, and, incidentally, in 2] 1. In
-preference bias : which dictates which subspace
should be preferred in the search space (e.g. prefer
short, ordering effects are observed when, given a
collection of data (e.g. examples in inductive concept
learning), different ordered sequencesof these data lead to
different learning results. In this respect, ordering of data
simple hypotheses over more complex ones) (this type
of bias has been muchless studied because it touches
on procedural aspects instead of declarative ones only).
Because ordering of inputs allows one to favor some
models over some others, it seems to amount to a
preference
bias that chooses between competing
hypotheses. In spite of tiffs resemblancehowever, there is
a deep difference with the biases generally discussed in
Machine Learning. Indeed, with ordering effects, one
observes the preference but cannot pinpoint directly where
This paper appeared in the Proc. of the European
Conference on MachineLearning (ECMI.-93), Vienna, April,
5-8, 1993.
1 See also the AAAITechnical Report corresponding to the
recent AAAI Spring Symposium (March 23-25, 1993,
Stanford University) devoted to "Training Issues in
Incremental Learning".
43
in the learning systemit lies and howit works.Thisis in
contrast with what is considered classically as a bias,
whereone can identify operationalconstraints e.g. isolate
representation constraints or proceduresfor choicebetween
hypotheses. Thuswe use the term global preferenc~[~ias
to denote preference amongmodelsdue to ordering effects
after a sequenceof inputs has beenobserved,and the term
local preference bias to denote the local choice strategy
followedby the systemwhenat each learning step it must
chooseto follow somepaths and discard others.
Twoquestions then immediatelycomeup :
I- Whatis the relationship between a global
preferencebias anda local one?
2- Whatis the relationship between a global
preferencebias that is observedor ahnedat anda
correspondingteaching strategy that specifies
the order of inputs ? In other words, how to
designa teachingstrategy so as to get a certain
global preferencebias ?
To sumup, order independentincrementallearners (i) are
able to focus on a optimal hypothesis whenthey have to
chooseamongthe current potential ones; and (it) they do
keep enough informations so as to not forget any
potential hypothesis.If one or both of these properties is
lacking, then incremental learning is prone to be order
dependent.A closer look at each of these property in turn
will help to see why.
Thesecondquestion is related to the recently introduced
concept of teachability [5,8,15,16,17] and the not so
recent concern for good training sequences [19,20].
llowcver, it differs in a fundamentalpoint in that, in the
former the problemis essentially the determination of
good ex,’unples to speed up learning, whereas in our
setting weassumethat the collection of instances is given
a priori and our only degree of freedomlies in the choice
of a good ordering. Additionally, researchers in the
teachability concepthavenot beeninterested withthe idea
of guiding the learner toward somepreferred modeland
awayfromothers, they seek to characterizethe learnability
of concept classes irrelevant of particular preferences
within the classes. Keepingin mindthese differences, the
emphasison providingadditional knowledgeto the learner
throughan educationalstrategy is the stone.
(i) Optimality vs. non optimality.
Sincethe influential thesis and papersof Mitchell[10,11],
is has becomecommonplaceto consider learning as a
search process in a concept or solution space. Whatever
the form (generalization, explanation,...)
and the
constraints on the hypothesisspace (e.g. representation
languagebias), the learner is searchingthe best solution
in this space given the data at hand. In order to compare
the solutions in the search space, it is possibleto imagine
that the learner evaluates and gradeseach one of themand
then chooses the top one. Of course, the grade and
therefi~rethe rankof these solutionscan be modifiedif the
data are changed.
In order to answerthese questions, and particularly the
first one, it is necessaryto determinethe causes of the
orderingeffects.
2
algorithm in Version Space [10,11], ID5R [18], or
systems that are not usually considered as learning
systems but could be, such as TMS[3] or someversions
of the BayesianInferencenets of Pearl [13]. Theyall have
in commonthat they do not forget any information
present in the input data. Thus, even whenthey makea
choice betweenalternative hypotheses, like ID5 or TMS
and unlike the CE algorithm, they keep enough
informationto be able to compareall potential competing
modelsso as to select the best one at any moment,and
changetheir mindif needed.Theyare therefore equivalent
to non-incremental
learningsystemsthat get all the data at
once and focus on the best hypothesis given the
informationsupplied.
Becauseincrementallearners usually function by adapting
a current solution (or a small solution set) to cope with
new informations, they proceed typically in a hillclimbing fashion (see figure 1) open to the draw-back
missing the global optimumsolution in favor of local
ones.
Causes of ordering effects
It is instructive to look at incrementallearners that are
NOTorder dependent,like the candidate elimination (CE)
44
Direction of the gradient
/~
State of the system
at time n-i
Fig. 1. At time n-I, the systemis in a state correspondingto a given hypothesis. At time n, with a newarriving piece of data, the
value of each hypothesisis re-evaluated and the systemfollows the direction of greatest gradient to reach a newstate.
As underlined above, one must realize that in contrast to
the commonoptimization problems where the opthnums
and the whole topology of the search space are set once
and for all before the search starts, in incremental learning
the topology is changed with each new input and the
solutions in the hypothesis space are therefore open to reevaluation, new optimums replacing old ones. Because of
this, incremental learning wandering from one local and
temporary optimum to the next can be order dependent
unless it reaches the global current optimumat each step
that is with each ,arriving data2. In that case of course, at
the end of the training period and given that all the
training data have been observed, the system would reach
the global optimumfor this set of data regardless of its
ordering during learning.
addition to the requirement of finding the global opthnum
at each thne.
To forget
If this forgetting is dependentupon the current state of the
system, then it follows that it may also be dependent
upon the ordering of the inputs. Therefore the resulting
learning may be order dependent.
It is important to note that these two reasons if often tied
(this is because one keeps only a current solution for
adaptation that one is tempted to forget informations
about other possibilities) are nonetheless independent in
principle and can be observed and studied separately. For
reasons that will be exposed shortly, we will concentrate
on the forgetting aspect of incremental learning and will
In fact, it is easy to see that there is a second caveat in
(ii)
incremental learners because they entertain only some
preferred hypothesis or model, can also discard in the
process part of the information present in the past inputs.
accordingly assume, by way of an adequate framework,
that the best alternatives amongthe available ones can
always be chosen by the learner.
Forgetting of information lies therefore at the heart of
order dependence in incremental learning. But forgetting
or not to forget.
Finding the optimal solution at each step is operating
only to the extent that the set of all possible solutions is
brought to consideration; if only part of it is available
can take two faces. In the first one, information present in
the input data is lost, meaningthat the current hypothesis
space considered by the learner is underconstrained. In the
then the whole optimization process is bound to be suboptimM.Now,beside adapting the current solution to new
constraints
expressed under the form of new data,
second one, by contrast, what is lost are potential
altenmlives to the current preferred hypotheses, which
amounts to overconslraining the space of possibilities.
2 Finding the global optimumat each step requires either to
keep in memoryall possible hypotheses, re-evaluate them
and take the best one, or, short to this memoryintensive
method, to be able to re-construct any possible hypothesis
with enoughaccuracy and keep track to the best one so far.
"Ibis latter methodis closed in spirit to SimulatedAnnealing
or to Genetic Algorithms. In mathematics, this corresponds
to crgodic systems.
45
This last form of forgetting is equivalent to a local
preference
bias which chooses among competing
hypotheses which ones to pursue.
This raises
then a more specific
question
aforementioned ones, but which contributes
overall goal :
than the
to the same
3. In which case an incremental learner can be
order independent? Or, in other words, which
information can be safely forgotten without
altering the result of learning whateveris the
orderingof inputs ?
of what is known to the learner
about the target concept f.
It is this last questionthat this paperfocuseson. It must
be kept in mindthat it is equivalentto the question: what
local preferencebias leads to a null globalpreferencebias
(i.e. to orderindependence)
Ill the following of the paper, we will restrict
ourselves to a simple concept learning model
in whichthe learner attempts to infer an unknown
target
conceptf, chosenfrom a knownconceptclass F of {0,1 }valued functions over an instance space X. This
fr,’unework allows us, in the next section, to define a
measureof the informationgained by the leanling system
and of the effect of a local bias on this information.This
measure naturally suggests an equivalence relation
betweenlocal preference bias and additional instances,
whichis detailed in section 4. Then, in section 5, it
bccomesa relatively simple matter to answerquestion 3
above. The conclusion comparesthe frameworkadopted
here with the emergingone of teachability and discusses
the results obtained.
3
Information measure and local
preference bias
In this section, we are interested in formalizing and
quantifyingthe effect of a local preferencebias on whatis
learned by the system. For this, we first define a
characterization of the information maintained by a
learner.
t.et F be a conceptclass over the instance space X, and
f~ F be a target concept. Theteacher has a collection of
examples EX={xi,f(xi)} at his disposal, and makes a
scquenceX=Xl,X2
..... Xm,Xm+l
.... with Xm~ EXfor all
m. "l]le learner receives informationabout f incrementally
via tile label sequencef(xl),...,f(Xm)~f(xm+l) For any
m_>l, we define (with respect to xd0 the mth version
space:
A
Fm(xJ) ={)~
1) ~F :f(x
A
=.l(xI ),...,f (Xm)
=/(Xm)
The version space at time mis simply the class of all
conceplsin F consistent with the first mlabels off (with
respect to x). Fm(x,Dwill serve as a characterization
46
at time m
Weknowfrom Mitchell [10] that the version space can be
economicallyrepresented and stored using the boundary
sets S-set (set of the most general hypothesesthat are
morespecific than the conceptsin the version space), and
G-set (set of the mostspecific hypothesesthat are more
general than the conceptsin the version space)3. Eachnew
ex,’unple (xn~f(xm))provides newinformationif it allows
to reduce the version space by modifying,throughthe CE
algorithm, either one of the boundarysets. Generally,the
S-set and the G-set contain manyelements, and in worst
cases, they can grow exponentially over somesequences
of examples[6].
A local preference bias is a choice strategy which,
at any time m, discards parts of the current version space,
generally in order to keep the boundarysets manageable.
In this way, it reduces the version space and acts as if
there had been someadditional information that had
allowed to constrain the space of hypotheses. The next
section gives a closer lookat this equivalence.
4
Bias and additional instances
Weassume that the incremental learner maintains a
version space of potential concepts by keeping the
boundarysets. Weassumefurther that the local preference
bias, if any, acts by removingelementsof the S-set and/or
of the G-set, thus reducingthe versionspace. Indeed,in so
doing, it removesfromthe version space all conceptsor
hypotheses that are no longer more general than some
element of the S-set and morespecific than someelement
of the G-set. Besides,the resulting versionspacekeepsits
consistency since, in this operation, no element of the
resulting S-set becomemoregeneral than other elements
of the S-set or of the G-set, and vice-versa, no elementof
the G-set can becomemorespecific than other elementsof
the G-set or of the S-set.
3 To be exact, a subset of the concept space can be
represented
by its S-set,andG-setif ,and onlyif it is closed
under the partial order of generality of the description
languageand bounded.This is the case for most learning
tasks and concept representations and particularly when
the description language consists in the set of all
coujunctive expressions over a finite set of boolean
features.See[7] fi~r moredetails.
To sum up, we nowhave a learning system that forgets
pieces of information during learning by discarding
potential hypotheses, but at the same time is optimal
since, in principle, by keepingthe set of all the remaining
hypotheses, it could select the best mnongthem. In that
way, we isolate the effect of forgetting without
interminglingwith nonoptimality effects.
Weare interested in studying the action of a local
preference bias (whichhas been shownto be equivalent to
the forgetting of potential hypotheses)along all possible
sequencesof data. Morespecifically, wewantto find what
type of local bias (or forgetting strategy) leads to
orderingeffects, that is for whichall training sequences
conductto the stoneresulting state.
n! sequences
/
Initial state
State after
learning
Fig. 2. For n instances,
there
are n/ possible
training
sequences.
We look for conditions
under which all
sequences
would result
in
the same state.
Whatmakesthis problemdifficult is that the action of the
local preferencebias dependson whichstate the systemis
in, and therefore depends on the training sequence
h)llowed.This implies in turn that all training sequences
shouldbe compared
in order to find conditionson the local
bias.
Avcry simple but very important idea will allow to
circumventthis obstacle. It has three parts :
equivalent to an order independent incremental
learner using bias b2, (leadingto the consideration
of extra instances) then, finding conditions on the
local bias bl would be the same as finding
conditionson the bias b2, only this time irrelevant
of the training sequence(since b2 is used by an
orderindependent
learner).
(i) and(ii) allowto realize (iii) if it canbe shown
that the
effect of the local bias bl is the sameas the effect of
additional instances given by an oracle or bias b2 to an
order independentlearner. In other words, if it can be
provedthat any forgetting of hypothesisis equivalent to
observing extra instauces, theu, conditions on bl will
amount to conditions on the addition of fictitious
instances to an order independentlearning algorithm.
(i) forgetting hypotheses,amountsto overconstrain
the search space
(ii) extra instancesto an order independentlearning
algorithm wouldresult in constraining the search
spathe
(iii) if an incrementallearner using a local bias bl
(Icading to forgetting of hypotheses)could be made
47
Orderdependent
learnerusing
localbias(forgetting)
OrderIndependent
algorlthm
(e.g. theCEalgorlthm)
G-set
G-set
space
I~-
/
NewS-set
\
I
Elimination
of gi
New
l
Action
of thelocalbias
Sd
J
inst~c~
Fig. 3. The equivalence needed to allow the study of conditions on the local bias. The effect of the local bias (forgetting
of some
element of the S-set or G-set) is the same as the effect of additional instances made available to an order independent algorithm such
as the CE algorithm.
gl
Of this powerfulidea, we will m~ea theorem.
Theorem i : With each choice it makes, the local
pre.ference bias acts as if additional exampleshad been
knownto an order independent/earner.
Proof : (i) Case of the reduction of the S-set. For each
clementgi of the S-set it is possible to find additional
fictitious exampleswhich, if considered by the CE
algorithm,wouldlead to its eliminationof the S-set. It
suffices to take the positive instances coveredbY(or
morespecific than) all g2 suchthat (gj E S-set ,and j¢i)
,and not covered by g~ or excludedby the G-set. As a
result, the CEalgorithm wouldnot have to modifythe
(l-set nor the gj such that (gj E S-set and j~i) and
would generalize gi .just enough to cover the new
instances. But since {gj /(gj e S-set andj~i)} is the
set of all past instances plus the newfictitious ones, gi
can only becomemore general than one or several gj,
and hencewill be eliminatedfrom the newS-set.
(ii) Caseof the reductionof the G-set. In the stone way,
in order to eliminate an elementgi of the G-set Ihrough
the CEalgorithm, it suffices to provide
instances covered bv g! but not covered by the oth~l"
elementsof the G-set ,and by II~e S-set. Asfor (i) above,
the CEalgorithm wouldthen specialize gi just enough
It excludethe negative instances, and this wouldresult
in an element of the G-set that wouldbe morespecific
th,’m others, henceeliminated.[]
g2
Fig. 4. In this figure, {gl,g2,g3}are assumed
to be the S-set
of the positive instances(EI,E2,E3).If E4is observed,then
neither gl nor g2 needto be modified.In fact {gl,g2) is the
S-setof all positiveinstancesthat belongto the grayarea. g3
will needto be generalizedjust enoughto coverE4, andthis
will makeit moregeneral than gl and g2, henceit will be
eliminatedfromthe S-set.
The following examplewill help to understandthis.
Let us assume that we present positive and negative
instances of scenes madeof several objects each one
describedby a conjunctionof attribute-values, and that we
use three f,’unilies of attributes, eachone organizedas a
tree withincreasingorderof generalitytowardthe root.
48
Sizes
large
small
Shapes
convex
polygon
triangle
ellipsis
I
pentagon
quadri lateral
circle
parallelogram
I
oqui fat era i
rectan( le
lozenge
I
square
(~O|OFS
colc, r
"
col. composed
eel.base
B/W
Bl,,ck
green purple
White
red
In file
following,
//Ik
R Y B
yellow
bhle
we show how an element of the S-set could be removed by observing more instances.
I.et us suppose that we ob~rve the following sequence of positive
{(1,ar<le
red square)
& (small
instances :
-El
~
-4
S-set-I = {(large red square) & (small white lozenge)}
- E2
~
-4
S-set-2 = {(large red parallelogram) & (small col.-base polygon),
white lozenge)}
{(lar(~e red pa~alleloclram)& (small blue equilateral)}
(? R-Y-Bpolygon) & (? col.-base parallelogram)}
-E3
-4
4
{(large yellow right) & (small blue rectangle)}
S-set-3 = {(large R-Y-Bpolygon) & (small col.-base polygon),
~-- gl
(? R-Y-Bparallelogram) & (? col.-base polygon),
<--- g2
(? R-Y-Bpolygon) & (? col.-base parallelogrmn)}
4-- g3
49
Let us assumethat at this point the local bias for somereasonsdecidesto discard g3 fromthe S-set. Thiswouldresult in
S-set-4 = [(large R-Y-Bpolygon)&(small col.-base polygon),
(? R-Y-B
parallelogram)&(? col.-base polygon)].
The very sameS-set wouldbe obtained if the le,’u-ning ,’dgorilhmwas the CE,algorithm that after El, e2 and E3 observeda
newinstance covered by gl and by g2 and not by g3 such as :
-E4:
+
((large
yellow
parallelogram)
& (small
black
pentagon)}
(Weuse the notation FnBTto differentiate a learner
implementing
a local bias (LB)from one that does not and
only imple,nents the CE algorithm noted Fn(xd)
section 3. VSwbmeansthe version space obtained with
bias).
Theorem 2:An incremental learner implementing a
local preferencebias is order independentfor a collection
EXof instancesif the action of this bias is equivalentfor
all possible sequencesx of elements of EXto the supply
of the sameset of additionalinstances.
5 Bias and order
independence
Whatwe have seen so far is that a local preference bias
(forgetting hypotheses)can be madeequivalent to another
bias that wouldthrow in chosenextra instances for the
learner to observe. Thanksto this we can nowtackle the
main topic of this paper, namely what kind of local
preferencebias a learner can implement
so as to slay order
independent. Indeed, instead of studying a stralegy of
forgetting of hypothesesthat dependon the current version
space, and therefore on the past history, we nowstudy
addition of extra fictitious instmicesto the CEalgorithm
that is order indepe,dent. In other words,weare nowin a
position to specify conditionson the local preferencebias
by slating to whichextra instancesit should,’unountto.
Weassumethat the teacher has a set of n ex,’unples EX,
and draws a sequence x of these according to her
requirements.
Proof : It follows inunediately fromtheorem1. []
Now,we want to enlarge theorem2, and give conditions
uponthe set of fictitious examplesthat the local bias
acting as an oracle can provideto the learner so that the
learner be order independent and reach VSwbpossibly
different fromVS,b.
The CEalgorithm without bias would reach the state
VSub(Version Space with no bias). If VSwb~: VSnb,
then it follows that extra instances should be providedto
the CEalgorithm so that it reaches VSwb.Theorem3
states whichones.
Furthermore,we ,assumeorder independence,i.e. :
(1) Vx, FLB(xdO = VSwb, where
VSwbis constant.
n!sequences
/
State
after
Initial
without
state~~
bias
learnin~
learning
\
with
~
50
bla.>
G nb" -
Version
spaces
S-set with bias : Swb
VSnb~
S-set with r
+
¯ Fig. 5. Whichfictitious
bias : S b
instances should the oracle corresponding to the local bias provide to the CE algorithm so that it get to the
versionspaceVSwb
? Answering
this questionamountsto give conditionson the local bias that wouldconducta learner using it to
VSwb
irrespective of the training sequencefi~llowed.Theansweris that fictitious positive instancesshouldbe drawnfromthe gray
array, whereasfictitious negativeinstancesshouldbe drawnfromthe stripped array. The’+’ and’-’ figure the real instancesthat
c,mductthe CEalgorithmto the VSnbversionspace.
l.et Suband Gnbbe respectively the S-set and file G-set
that the CEalgorithmwouldobtain fromthe collection of
instances EX, and let Swband Gwbbe respectively the
S-set and the G-set of VSwbin (1) (the version space
obtained on any sequence x of EX by the learner
implementing
the local preferencebias).
Theorem4 : For an incremental learner implementinga
local preference bias leading to the same result as an
incrementallearner without a local bias, it is necessary
that the action of this bias be equivalentto the supplyof:
Theorem3 : For an incremental learner intplementing a
local preference bias to be order independent for EX
leading to the version spaceC, it is necessary that the
actionof this bias be equivalentto the supplyof:
- a set of fictitious positive instatl¢¢$ such that they
together with the real positive instances are coveredanti
boundedby Swb, and,
- a set of negativeinstancessuchthat eachoneis covering
all elementsof Gnb.
- a set of positive instances such that eachone is covered
by elements of Snb, and,
It is as if this local preferencebias knew"in advance"the
collection EXof insU’mces,and eliminatedelementsof the
S-set andof tile G-set judiciously. Thisleads to the final
theorem.
Theorem5 : It is not possible for a local deterministic
preferencebias to lead to a null global preferencebias for
any arbitrary collection EXof examples.
-a set of fictitious negative instances such that they
together with the real negative ones are excluded anti
boumted by Gwb.
Proof : Indeed,at each step, the action of the local bias
can only dependon the past training instances and the
current state. In order to lead to order independenceit
wouldhave to be equivalent to an oracle that provides
instances drawnfroman array of the instance space that
can be defined only with respect to all the training
instances. This wouldmeanthat the learner, throughits
preference bias, was always perfectly informed in
advanceon the collection EXof examplesheld by the
teacher. []
Prt~)f: Self-evident.[]
Thenext theoremis the application of theorem3 to tile
case where VSwb= Fn(EX,f) = VSub, that is the case
wherethe local preference bias leads to the null global
preferencebias, i.e. has no effect. Sucha local bias c,’mbe
seen as eliminating options judiciously sittce the result
obtained after any ~quencex of EXis the sameas what
tile CEalgorithmwouldget on EX.In this c,’t~e Snb and
Swbare one and the .~tme as are Gnband Gwb.
51
6
Conclusion
In this research, we are interested in the following
general question : given a collection of examples(or
data in general), howcana teacher, a priori, best put them
in sequenceso that the learner, a deterministicincremental
learning system that does not ask questions during
learning, can acquire sometarget concept(or knowledge)?
This question, that corresponds to situations wherethe
teacher does not have the choice of the examplesand can
not interpret the progress madeby the student until the
end of the learning period, leads to the study of
incremental learning per se, independently of any
particular system. The solution to this general
interrogation could be of someuse to several realistic
settings where a teacher has an incremental learning
system and a collection of data, or whendata ,arrive
sequentially but time allows to keepthemin small buffers
that can be orderedbefore being proces~d.
This framework is to be compared with the recent
surge of interest for "teaching strategies" that allow to
optimally teach a concept to a learner, thus providing
Iowcr bounds oil Icarnability complexity [5,17]. The
difference with tile formerfr,’uneworkis that in one case
the teacher can only play oil the order of the sequenceof
inpuls, whetherin tile other case, the teacher choosestile
mostinformativeideal exmnplesbut does not look fi)r the
best order (there are someexceptionssuch as [13]).
instances (either the correspondingprior knowledge
is
well-tailored to the future potential collections of
inputs, or there is no prior knowledge, thus no
reduction of storage and computationalcomplexity).
In this study, we havegiven sufficient conditions only on
the fictitious examplesthat should be provided by an
oracle equivalent to a local preferencebias. It wouldbe
nice to obtain necessary conditions that state which
instances should necessarily be given by the oracle to get
order independence. Weare currently working on this
question.
It should be clear that the problemof noisy data is of no
concern to us here. Wecharacterize through the version
space what is knownto the learner irrespective of the
quality of the data. Noise would becomean important
issue if it wasdependenton the orderingof the data (e.g.
sensor equipmentthat woulddegradewith time).
Issues for future research include : are these results
extensible to more general learning situations (e.g.
unsupervised) ? given a local preference bias, howto
determinea gtx’xl sequenceorderingso as to best guidethe
systemtowards the target knowledge?
Acknowledgments: I thank all the membersof the
EquipeInference et Apprentisssage,and particularly Yves
Kodratoff, for the goodhumoredand research conducive
almosphereso beneficial to intellectual work. Comments
fromthe reviewershelpedto makethis paperclearer.
This paper has outlined somefirst results concerningorder
sensitivity in supervisedconceptualincrementallearning.
The most important ones are :
(i) that order dependenceis due non-opthnalityand/or to
forgetting of possibilities correspondingto a local
preference bias that heuristically selects the most
promising hypotheses,
(ii) that this bias can be seen as the result of additional
instances given to the learner (i.e. prior knowledge
built into the system),
(iii) that (ii) allows to replace the difficult problem
determiningthe action of the local bias alongdifferent
sequencesof instances by a problemof,addition (which
is commutative,i.e. order independent)of instances to
an order independentlearner, whichleads to
(iv) that thcre are strong contingenciesfor an incremental
learner to be order independenton somecollections of
52
14. Porat
References
1.
from
Cornu~jols
A. (1989)
"An Ex ploration in
Incremental
Learning : the INFLUENCESystenz",
Daley
& Smith
Inductive
(1986)
Inference".
: "On the
Information
15. Rivest
in
of
and Control.
69,
4.
Fisher,
: "A truth
Intellieence,
maintenance
"Ordering Ef
COBWEB
and an Order-Independent
ill
5.
Proc. of the 9th Int.
16. Shinohara
system".
17. Salzberg,
Method". 1"o appear
Goldman & Kearns
"On th e Co mplexity of
(1991)
~,
Santa
Aug.
learning algorithms and Valiant’s
7. lllrsh
Intelligence.
II.
(1990)
Merging".
Proc.
L_e..RI.IIiI~.
I,
ductive bi as: A!
9
rsion-Space
Conf. on Machille
"I nductive Le
arning fr
om Go od
Joint
Conf. on
(IJCAI-91), pp.751-756.
"] ’he Ef fects of
Learning Classifications
pp.361-370,
for
Intelligence.
1988.
10. Mitchell T. (1978) Version Sp aces :
Concept Learning.
Order on
by Example: Heuristics
Finding the Optimal Order", Artificial
Phi) Thesis,
anApproach to
Stanford,
December
1978.
I I.
Mitchell
Artificial
12. NataraJan
lheoretical
T. (1982)
"Generalization as
Intelligent9,
B. (1991)
Search",
vol.18, pp.203-226, 1982.
Machine Le
arning. A
Approach. Morgan Kaufinann, 1991.
13. Pearl J. (1988) : Probabilistic
Svsten~s:
Theory),
& Miyano
(1990)
"T eachability in
in Proc. of the Workshop on
Learnine Theory, 1990, pp.247-255.
Deleher,
Heath
& Kasif
(1991)
~,
4,161-186,(1989).
"Learning on
e su bprocedure pe
Intelligence.
31 (1), pp.l-40, january
Reasoning in Intelligent
Networks of Plausible
Winston (1970):
Laboratory,
"I ncremental Ve
of the 7th Int.
MacGregor J. (1988)
vol.34,
Learning
Inference.
"Learning
structural
from examples", AI-TR-2,3 l, MIT, Artificial
36, 177-222.
Examples". In Proc. of the 12th Int.
Artif. Intel.
on Computational
lesson". Artificial
20.
learning framework".
Univ. of Austin, Texas, Juue 21-23, 1990,
Ing (1991)
mplicated
In Proc. of COLT88
19. Van Lehn (1987)
5-7
pp.330-338.
8.
"L earning Co
and Usefully".
r
1987.
Haussler D. (1988) "Quantifying in
Artificial
(1988)
18. Utgoff P. (1989) "I ncremental In duction of Decision
1991, pp.303-314.
6.
" ,7,
pp.705-711.
Trees".
Cruz,
L
"Learning with a helpful teacher". Proc. of the IJCAI-91.
Conf. on Machine Learning,
1992.
Teaching".
Aleorithmic
fects in
Aberdeen, June 29-July Ist,
M hin
tomata
Cambridge 1988, pp.69-79.
12, 231-272, 1979.
Xu & Zard (1992)
"L earning Au
Examples".
Computational Learning",
Doyle J. (1979)
Artificial
& SIoan
(Workshop
Complexity
(1991)
1991.
Concepts Reliably
pp.l 2-40, 1986.
3.
Ordered
pp.109-138,
to
Proc.of the 6th Intl. Conf. on Machine Le,’u’ning,
Ithaca, June 29- July I, 1989, pp.383-386.
2.
& Feldman
Morgan
Kaufinann, 1988.
53
Cambridge, MA, 1970.
descriptions
Intelligence