Learn++: An Ensemble Based Incremental Learning and Data Fusion Algorithm
Learn++ is a new algorithm capable of incremental learning of additional data, estimating classification confidence and combining information from different sources. Learn++ employs an ensemble of classifiers approach for this purpose [1,2]. The algorithm’s incremental learning capability is described first, followed by its feasibility in estimating classification confidence, and finally its implementation for data fusion.
Learn++ for Incremental Learning
The idea of generating an ensemble of classifiers, and combining their outputs using a voting or gating mechanism, has been developed in early 1990s. It has since received increasing attention in the pattern recognition community. For example, Wolpert suggested combining hierarchical levels of classifiers using a procedure called stacked generalization [3], where the classifier outputs were combined using another classifier (hence stacked classifiers). Jordan and
Jacobs introduced hierarchical mixture of experts (HME), where multiple classifiers were highly trained (hence experts) in different regions of the feature space, and their outputs were then weighted using a gating network [4,5]. Freund and Schapire introduced the AdaBoost algorithm, using which they proved that an ensemble of weak classifiers whose outputs are combined by weighted majority voting is as strong as, if not stronger, than a strong classifier [6,7]. This seemingly paradoxical concept that a strong classifier is not necessarily better than a weak classifier, augments the No Free Lunch theorem, which states that no classifier is fundamentally better than the other. Ji et al. and Ali et al.
proposed alternative approaches that generate simple classifiers of random parameters and then combine their outputs using majority voting [8,9]. Ji and Ma give an excellent review of various methods for combining classifiers [10], whereas
Kitler et al.
analyzes error sensitivities of various voting and combination schemes [11], and
Dietterich [12] compares ensemble of classifiers to other types of learners, such as reinforcement and stochastic learners.
A consensus of the above-mentioned studies is improved general classification accuracy and robustness to noise, when ensembles of classifiers are used instead of a single classifier. Learn++ algorithm was inspired by these methods using the ensemble approach, in particular by the AdaBoost algorithm. However, in addition to improved accuracy and robustness, Learn++ also seeks incremental acquisition of additional information from new data. Unlike the above-mentioned techniques, Learn++ can learn additional information from new data, even when new classes are introduced, and/or even when previous data is no longer available (that is, Learn++ does not forget previously learned information). As described below, the structure of the Learn++ algorithm also makes it naturally suited for data fusion and classification confidence estimation problems.
In essence, for each set of data that becomes available, Learn++ generates a number of base classifiers , whose outputs are combined through weighted majority voting to obtain the final classification. The base classifiers, also referred to as hypotheses , are generated (trained) using strategically chosen subsets of the currently available database. Fundamentally, Learn++ employs a divide and conquer approach. Instead of learning a difficult classification rule using a single classifier, the algorithm divides the entire information to be learned into simpler and easier classification problems using a strategic selection of training subsets, as described below.
Figure 1 conceptually illustrates the underlying idea for the Learn++ algorithm. The white curve
represents a simple hypothetical decision boundary to be learned. The classifier’s job is to identify whether a given point is inside the boundary. Decision boundaries (hypotheses) generated by base classifiers ( BC i
, i=1,2,…8 ), are illustrated with simple geometric figures. Hypotheses decide whether a data point is within their decision boundary. They are hierarchically combined through weighted
majority voting to form composite hypotheses H t
, t=1,…7
, which are then combined to form the final hypothesis H final
.
BC
1
h
1 h
2
BC
2
H
1
H
2 h
3
BC
3 H final h
7 h
4
BC
4 h
8
BC
5 h
6 h
5
BC
6
BC
7
: weighted
majority voting
BC
8
H
8
Figure 1. Conceptual illustration of the Learn++ algorithm.
Figure 2 illustrates the block diagram of the algorithm. For each dataset D
k
, k=1,…,K
that becomes available to Learn++, the inputs to the algorithm are (i) a sequence of m training data instances x i
along with their correct labels y i
: S=[(x
1
,y
1
),(x
2
, y
2
),…,(x m
,y m
)] , (ii) an arbitrary supervised classification algorithm Learn + (as opposed to Learn++) to be used as a base classifier, and (iii) an integer T k
specifying the number of classifiers (hypotheses) to be generated for that database. Learn+ can be any supervised classifier that can achieve at least 50% classification accuracy on its own training dataset. Note that this is not a very restrictive requirement. In fact, for a two-class problem, it is the least we can expect from a classifier.
Initialized
Weights
T
Training
Data S
Select training subset according to D t
1
Learn+
Generate hypothesis h t
2
Test h t on test subset t 3
N
t = t + 1
t
< ½
Y
Y t < T
Update and normalize weights w t to obtain distribution D t
6
Test H t on test subset E t
5
Combine by WMV h t
H t
4
N
Combine H t by WMV
7
Final Classification
D t
: Weight distribution of training data h t
: Single hypothesis (classifier) generated at iteration t.
H t
: Composite hypothesis at iteration t. w
t
: Weights of instances at iteration t t
: Error of h t
, E t
: Error of H t
, Y: Yes, N: No
T: Max. # of classifiers, WMV: Weighted Majority Voting
Figure 2. Block diagram of the Learn++ algorithm.
Learn++ starts by initializing a weight distribution, according to which a training subset TR t and a test subset TE t are drawn at the t th iteration of the algorithm. Unless there is prior reason to choose otherwise, this distribution is initially set to be uniform, giving equal probability to each instance to be selected into the first training subset: w
1
( i )
D
1
( i ) = 1 m ,
i , i
1 , 2 , , m (1)
At each iteration t , the weights w t
from previous iteration are first normalized to ensure that a legitimate probability distribution D t is obtained (initialization weights are shown as an input to step
D t
w t i m
1 w t
( i ) (2)
A training subset TR t
and test subset TE t
for the t th iteration are then selected according to D t
(step
1), and the base classifier, Learn+, is trained with this training subset (step 2). A hypothesis h t
is obtained as the t th classification rule whose error is computed on the test subset simply by adding the current distribution weights of misclassified instances (step 3):
t
i : h t
(
x i
)
D t
y i
( i )
(3)
The Learn+ condition that at least half the training data be correctly classified is enforced by requiring that the error, as defined in Equation (7), be less than ½. If this is the case, the error is
normalized to obtain the normalized error (not shown in Figure 2)
t
t
1
t
,
0
t
1
(4)
If the error is not less than half, then the current hypothesis is discarded, and a new training subset is selected. All hypotheses generated thus far are then combined using the weighted majority voting to obtain the composite hypothesis H t
(step 4):
H t
arg max y
Y t : h t
(
log x )
y
1
t
(5)
The weighted majority voting chooses the class receiving the highest vote from all hypotheses.
The voting is less then democratic, however, since each hypothesis has a voting weight that is inversely proportional to its normalized error. Those hypotheses with proven records of good classification performance on their test data, as measured by the logarithm of the inverse normalized error, are awarded with higher voting weights (not to be confused with training data weights).
The error of the composite hypothesis is then computed in a similar fashion as the sum of distribution weights of the instances that are misclassified by H t
(step 5)
Ε t
i : H t
(
x i
D
)
y i t
( i )
(6)
The normalized composite error B t
is computed as (not shown in Figure 2)
B t
E t
1
E t
,
0
B t
1 which is then used in updating training data weights (step 6) :
(7)
w t
1
( i )
w t
( i )
B
1 t
,
, if H t
( x i
) otherwise
y i
(8)
The weight update rule in Equation (8) reduces the weights of those training data instances that are correctly classified, so that their probability of being selected into the next training subset is reduced. The probability of misclassified instances being selected into the next training subset, are therefore, effectively increased. In essence, the algorithm is forced to focus more and more on instances that are difficult to classify, or in other words, instances that have not yet been properly learned. This property of Learn++ results in significantly better classification performance and robustness as compared to single classifiers [2]. Moreover, this procedure also allows incremental learning, since instances from a new database are precisely the instances that have not been learned yet, on which the algorithm quickly focuses by generating new hypotheses. At any point, for example, after a maximum pre-specified number of hypotheses is generated, or after achieving a predetermined classification performance, a final hypothesis H final
can be obtained by combining all hypotheses that have been generated thus far using the weighted majority voting rule:
H final
( x )
arg max y
Y k
K
1 t : h t
( x )
y log
1
t
. (9)
It is important to emphasize that the weight update rule is the heart of the Learn++ algorithm, as it allows effective acquisition of novel information, when new databases are introduced. The weight update rule, based on the performance of composite hypothesis, is one of the most important distinctions of Learn++ as compared to other ensemble approaches such as the AdaBoost. Details of the Learn++ algorithm, proof of an upper error bound of the algorithm, classification performance on a number of real world applications can be found in [2].
Learn++ for Confidence Estimation
The voting mechanism inherent in Learn++ hints a simple procedure for estimating the confidence of the algorithm in its own classification decisions. In essence, if the majority of the (weighted) hypotheses agree on the class of a particular instance, we can interpret this outcome as a high confidence decision. If, on the other hand, the individual hypothesis votes are distributed equally among different classes, the final decision can be interpreted as a low confidence decision. To formalize this approach, let us assume that there are a total of T hypotheses generated for classifying instances into one of C classes.
We can then define
c
, the total vote that class c receives, as
c
t : h t
(
x )
c log
1
t t
1 , , T , c
1 , , C .
The final classification will then be the class for which
c
is maximum. Normalizing the votes received by each class
(10)
c
c c
C
1
c
(11) allows us to interpret
c
as a measure of confidence of the decision on a 0 to 1 scale, with 1 corresponding to maximum confidence and 0 to no confidence. We note that normalized
c
values do not represent the accuracy of the results, nor are they related to the statistical definition of confidence intervals determined through hypothesis testing. This is merely a measure of the confidence of the
algorithm in its own decision for any given data instance . Keeping this distinction in mind, we can heuristically define the following ranges: confidence, 0 .
7
c
0 .
8
0 .
0
medium confidence, c
0 .
8
0 .
6
c
very low confidence,
0 .
9
0 .
6
high confidence and
c
0 .
9
0 .
7
c low
1 very high confidence. Through this mechanism, Learn++ can alert the user by flagging any classification decision on which it feels non-confident. Our initial results with Learn++ on real world (nonbiological) data indicate that Learn++ catches over 80% of its own missed detections and false alarms by outputting low or very low confidence on those decisions. We also note, however, that
Learn++, by using multiple classifiers, allows us to perform formal statistical confidence level calculations on the outputs for any given class .
An initial set of results on using Learn++ voting mechanism in estimating classification confidences of NDE signals are provided in [14].
Learn++ for Data Fusion
We note that Learn++ does not specify which base classifier should be used as Learn+, nor does it require that base classifiers be trained with identical features. In fact, the algorithm has been shown to work with a variety of different classifiers, including different types of neural networks on a number of practical real-world applications [15]. In this study, different ensemble classifiers, each trained with signals of different modalities used as features, can be incorporated into the identification system. Such classifiers may include neural networks, rule-based classifiers or Bayes classifiers based on acoustic emission signals, thermal images, etc. To work in data fusion mode,
Learn++ will be modified according to structure in Figure 3, combining the pertinent information
from all identifiers. Learn++ will then provide a more informed decision, then any single identifier can individually..
AE
1
AE
N-1
Preprocessed acoustic emissions signals from different sources
Thermal imaging data
Data from other identifiers
AE
N
TI
Weighted
Majority Voting
Final
Decision other
Feature – specific expert ensembles of classifiers
Rule-Based
Weight Assigning
Figure 3. Learn++ for data fusion application.
References
(1) Polikar R, Udpa L, Udpa SS, Honavar V. Learn++: An incremental learning algorithm for multilayer perceptrons. Proc of IEEE 25th Int Conf on Acoustics, Speech and Signal
Processing 2000; 6:3466-3469.
(2) Polikar R, Udpa L, Udpa SS, Honavar V. Learn++: An incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems, Man and Cybernetics
(C). In press.
(3) Wolpert DH. Stacked Generalization. Neural Networks 1992; 5(2):241-259.
(4) Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE. Adaptive mixtures of local experts.
Neural Computation 1991; 3:79-87.
(5) Jordan MI, Jacobs RA. Hierarchical mixtures of experts and the EM algorithm. Neural
Computation 1994; 6(2):181-214.
(6) Schapire R. Strength of weak learning. Machine Learning 1990; 5:197-227.
(7) Freund Y. Boosting a weak learning algorithm by majority. Information and
Computation 1996; 2:121.
(8) Freund Y, Schapire R. A decision theoretic generalization of on-line learning and an application to boosting. Computer and System Sciences 1997; 57(1):119-139.
(9) Ji, Ma S. Combination of weak classifiers. IEEE Trans on Neural Networks 1997;
8(1):32-42.
(10) Ali K, Pazzani MJ. Error reduction through learning multiple descriptions. Machine
Learning 1996; 24(3):173-202.
(11) Ji C, Ma S. Performance and efficiency: recent advances in supervised learning. IEEE
Proceedings 1999; 87(9):1519-1535.
(12) Kittler J, Hatef M, Duin RP, Matas J. On Combining classifiers. IEEE Trans on Pattern
Analysis and Machine Intelligence 1998; 20(3):226-239.
(13) Dietterich TG. Machine learning research. AI Magazine 1997; 18(4):97-136.
(14) Polikar R., “Incremental learning of NDE signals with confidence estimation,” Proc. of
28 th
Review of Progress in Quantitative Nondestructive Evaluation (QNDE 2001),
Brunswick, ME, 29 July – 3 August 2001.
(15) Polikar R, Byorick J, Kraus S, Marino A, Moreton M. Learn++: A classifier independent incremental learning algorithm for supervised neural networks. Proc of
IEEE/INNS Int Joing Conf on Neural Networks. In press.