Active Classification with Bounded Resources AnYuan Guo

From: AAAI Technical Report SS-02-03. Compilation copyright © 2002, AAAI (www.aaai.org). All rights reserved.
Active Classification with Bounded Resources
AnYuan Guo
Department of Computer Science
University of Massachusetts at Amherst
140 Governors Drive
Amherst, MA 01003
anyuan@cs.umass.edu
Abstract
Traditionally, the input to a classifier is an instance vector with fixed values. Little attention is paid to the acquisition process of these values. In this paper, we will
assume that the values of all the attributes are initially
unobserved, a cost is associated with the observation of
each attribute, and a problem specific misclassification
penalty function is used to assess the decision. Framed
in this way, active classification turns into a resourcebounded optimization problem for the best information
gathering strategy with respect to a given loss function.
We will formalize this problem and present a principled
approach to its solution by mapping it onto a partially
observable Markov decision process and solving for a
finite horizon optimal policy.
1. Introduction
With the advent of database engineering, vast amounts of
data have been accumulated in realms such as finance, manufacturing, and bioinformatics. How can we transform passive domain knowledge into active information gathering
strategies with respect to a task at hand? In this paper, we
will present a framework for automated information gathering in the context of classification problems.
A probabilistic classifier is a function that maps domain
instances to a distribution over class labels. A domain instance is a vector of attribute values given as input to the
classifier. Supervised learning of classifiers can be used
to learn this mapping from a data set of labeled instances.
Given a classifier, the classification problem turns into the
problem of simply inputting an instance to this classifier for
an assignment.
In real life, costs are often associated with attribute measurements. In medical domains, this can be the monetary
cost of tests performed on patients; or, in a robotics domain,
the cost of sensing different attributes of the environment.
Aside from attribute measurement costs, misclassification costs are also ignored by most machine learning algorithms. Simple classification accuracies are often used
that do not account for variations in penalties among different types of misclassification. In medical diagnosis, false
Copyright c 2002, American Association for Artificial Intelligence (www.aaai.org). All rights reserved.
positives are temporarily distressing, yet they are more desirable than false negatives, which may let patients go untreated. Most learning systems lack the expressiveness to
model these costs.
In recent years, the cost-sensitive learning community has
started to tackle the problem of representing costs in classification (Turney 2000). They have taken myopic approaches
using value of information heuristics to decide which queries
to perform (Pedersen et al. 2001).
In this paper, we will present a principled, decisiontheoretic approach to the active classification problem.
Given a knowledge base, active classification is defined
as finding the optimal information gathering strategy with
respect to any given misclassification cost function under
bounded resources.
As a concrete example, during the 1998 Antarctic Expedition (Moorehead et al. 1999) sponsored by NASA, an autonomous rover was deployed to independently search and
gather meteorites. On-board the robot was a rock classifier
built by expert geologists from rock databases and extensive laboratory results. The robot could observe different
attributes of a rock with its camera, with a metal detector,
and by performing a spectroscopy test involving the deployment of a robot arm. The robot’s task was to decide which
sensors to deploy at each step. These decisions have to be
carefully formulated since only a limited number of observations can be carried out due to resource bounds on battery
power and intrinsic monetary costs of sensor deployment.
Diagnostic problems in general fit well under the active
classification framework, since in most instances, costs are
associated with information gathering. In medical diagnosis,
given a knowledge base of diseases, how can these knowledge be used to guide the doctor’s decisions to carry out
further testing as current test results become available, given
certain costs for mis-diagnosis.
The main contribution of our work is the novel treatment
of the true class of an instance as the state of a partially observable Markov decision process(POMDP), and formulating attribute gathering as observations on the states. With
these insights, we are able to map the active classification
problem onto a POMDP, thereby enabling the application of
general solution techniques to find an optimal solution.
In the rest of this paper, we will first survey related work
in section 2; then, lay out the relevant background in section
3. In section 4, we define the active classification problem
formally;
in section 5 present a solution technique by map
ping our problem onto a POMDP. Finally, in section 6 we
conclude with a discussion of future work.
2. Related Work
Aspects of this problem have been studied in different literature. Decision tree induction uses a greedy informationtheoretic strategy to minimize misclassification (Quinlan
1986). Imposing the constraints of our problem would mean
specifying a cut off depth on the decision tree induced. One
problem with this approach is that it is not straightforward
to incorporate cost of attribute observation and misclassification penalty into the decision tree induction procedure.
In (Pedersen et al. 2001), which uses the robotic collection of meteorites as their domain, a greedy entropy reduction heuristic is used to predict the utility of observing an
attribute. As was pointed out in the paper, some problems
are apparent with this approach. First, information gain
is is always positive, yet sometimes entropy increases after new data is obtained. Contradictory data can increase
entropy. Another problem with using entropy reduction is
that an observation that changes P(X) from 10% to 90% has
entropy reduction of zero, indicating that this observation
has no effect, which obviously is not true, since information has definitely being obtained. Furthermore, the utility
measure in that work is simple classification accuracy no
cost is attached to misclassification. We can easily imagine
cases where false-positives should be weighted differently
than false-negatives.
The cost-sensitive learning literature is devoted to this
last issue, where problem-specific loss functions are used to
calculate utility rather than simply using mis-classification
rates (Turney 2000).
The work in (Greiner, Grove, and Roth 1996) studies a
similar problem. We borrow some of the problem definition
from this work. However the focus there is how to learn
an active classifier. In our case, we start with a pre-learned
classifier, and the task is to construct an optimal policy for
attribute observation. Our representation is also different;
we use a Naive Bayesian classifier to compactly represent
the joint distribution of class and attribute variables.
In this paper, we will address classification problems of
this latter case. The meteorite collecting rover example falls
into this category. In this example, the non-determinism
can arise from sensor noise or hidden environment variables.
Photographing a rock at varying distances and under different light conditions could have significant effects on the final
readings.
Within the general framework, we are interested in performing active classification using a specific representation:
the Bayesian classifier. Prior knowledge of a domain can be
easily incorporated into a Bayesian classifier and easily updated as more data becomes available. The classifier outputs
a belief distribution over the class labels rather than a single
label, increasing the informativeness of the answer.
A Bayesian classifier assigns labels to data instances by
computing the posterior distribution of a class variable given
the values of the attribute variables. The joint probability distribution of the class variable and attribute variables
are encoded by a Bayesian network, denoted by .
G is a directed acyclic graph (DAG) capturing the relationships between the class variable that takes its values
from a discrete set , and attribute variables
. is a vector of conditional probabilities specifying the conditional distribution of each attribute
given the class.
C
a1
a2
a3
an
Figure 1: A general Bayesian Classifier with attribute dependences
3. Background
The same mathematical formalism for classification can give
rise to two different interpretations. One is to assume that
there is a stationary distribution over the space of domain
instances from which random instances are drawn independently. Once drawn, the attributes of the instance stay fixed,
and observations made on them are deterministic. In this
case, there is no reason for ever observing the same attribute
twice.
Another interpretation is to assume that the classes are objects whose attributes assume different values according to
a probability distribution. In this case, observations are nondeterministic, so repeated observation of the same attribute
can reduce uncertainty associated with that attribute.
The conditional probabilities in a Bayesian network can
be learned from data using standard Bayesian learning algorithms, see (Heckerman 1996) for a comprehensive survey.
In our work, we start with a Bayesian network with previously learned conditional probabilities.
4. Active Classification
Active classification is the problem of finding the optimal information gathering strategy with respect to any given misclassification cost function under bounded resources.
An active classification episode would proceed as follows.
First an instance, drawn according to a prior distribution
over the classes, is presented to the classifier. The values of
its attributes are not yet known. The active classier chooses
an attrib
ute to observe, obtains its value " , and then de!
cides again what attribute to observe next. This process continues until a deadline is reached at which point the classifier
outputs a class label.
An instance is identified by a# finite vector of attributes
# %
$&
$
$
%
$)($
$
. Let '
+*-,./
&*-,00
be the set of all possible data instances. ,01 denote the value
set
of attribute 1 . A labeled instance of a concept 2 is a pair
$#
$05
is the set of all classes.
243
6*7')8:9 , where 9
The problem description of active classification can be defined by the tuple <;=?>@ACBDEF ,
G
Rock Type
G
; is a Bayesian classifier, denoted by , where is a
directed acyclic graph capturing the relationships between
the class variable (let H denote the set of values that can take on) and attribute variables I*JH . is a vector of
conditional probabilities quantifying these relationships.
G
> is the deadline for the classification episode. It is a
bound on the amount of time available to perform observation actions.
%
HLKM9
is the set of all possible actions, where NO*
A signifies an attribute observation action and N-* C is a
classification action.
G
@
G
BQP.HSRTVUXW is a function that assigns a positive integer
time duration to each of the attribute observation actions.
EYPZ[9\8]9^XRT_U is a function that maps a pair of class
values, the first being the actual class of the instance, and
the second being the class that the classifier outputs, to an
integer denoting the loss for misclassification.
B & W
Image
Color
Image
Metal Spectrometer
Dector
Reading
Figure 2: A Naive Bayesian classifier for meteorite identification
the Operations Research community in modeling problems
of this nature as a partially POMDP. General solution techniques have been developed that find optimal control strategies (Lovejoy 1991). In the next section, we will show how
the active classification problem maps onto a POMDP. Once
this mapping is complete, solutions can be found by the application of standard POMDP algorithms.
5. Solution technique
Given the formal description outlined above, the problem
is to find an active classifier `&# Pab cMRTdHeKM9 , a function mapping partial instances $ c to actions, which maximizes classification accuracy calculated by the loss function
E while staying under the time bound > . The partial instance
is initially completely unobserved.
In this paper we will restrict the distribution of data to the
ones representable by a Naive Bayesian classifier. This is
a Bayesian classifier that assumes the conditional independence of attributes given the class (Mitchell 1997). Graphically, there are no links between attribute variables. Naive
Bayesian classifiers require less training data to learn and in
practice often achieve performance comparable to that of the
optimal Bayesian classifier (Domingos and Pazzani 1997).
Figure 2 is a Naive Bayesian classifier for the robot rover
example.
A rock in question can be either a meteorite, sandstone,
basalt, or schist. The type of attributes gathered are a black
and white image, a colored image, the metal-detector results,
and the spectroscopy signatures. The value of the spectroscopy test, for example, is a vector of spectral signatures.
We will not go into further details here about the specifics of
rock classification.
Formalized in this way, the active classification problem
is one of decision making in a partially observable stochastic environment. Extensive study has been carried out by
A partially observable Markov decision process can be described as a tuple <fXgbih'?jJik=lm , where f is a finite set
of states of the world; g is a finite set of actions; h is the
state-transition function that maps each state action pair to a
probability distribution over the next state; j is the reward
function, given the expected immediate reward for each action, state pair; k is a finite set of possible observations; l
is the observation function, that for a state action pair, gives
a probability distribution over possible observations for the
probability of making an observation given the state and action taken.
In this section, we will present how the active classification problem ;A>n@ACBI?EF , under the Naive Bayes assumption, maps onto a POMDP <fXg'h'?jJik=l+ .
5.1
State representation
The true class label of an instance can be thought of as its
underlying state. Given this simple interpretation, the state
space of the POMDP is simply the set of class labels. Since
we also need to keep track of the time elapsed to ensure
that the deadline for this classification episode has not been
reached, a time stamp has to be included as part of the state
representation. Thus, the state space is a class, time stamp
pair.
f
%
o prq
(
=*J9LK7stvuxwyti`{z^|
}~p}~€
An absorbing state, denoted as o stvuxwytz^|q , is reached
when either the deadline has passed or a classification action has been taken.
5.2
g
¢¢
¢¢
p Ÿ
¢¢
(‚^ƒ„{…{†
"
…
s
„{„‡rˆ‰Š
or H
¢¢
¢¢
¢¢
¢¢
¢
—
†
3?o ŸipxŸ q
(
o prqr
¢¢
¢¢
¢¢
¢¢
¢¢
¢¢
if ¢¢
¢¢
p Ÿ
¢¢
¢¢
k
,01
, 1
is the value set of
Observation function
When the underlying states are not known with certainty,
actions taken at each state give rise to different observations
according to some distribution P. The observation function
l’Pf“8&g”8&k•RT–j , assigns the probability of observing
a certain value given the class of the instance and the action
taken. For observation actions, this corresponds to the entry
in the conditional probability table of the attribute observed
indexed by the class value and attribute value. For classification actions, this value is 0.
†
3
ƒ6(
o iprq[i˜
ƒ„
s
—
5™%\š
ž
†
3
ƒ›( 5
if pœLž
if p % >
The Naive Bayes assumption is crucial for this mapping.
In general Bayesian classifiers, attributes can be correlated
with other attributes. This means we can no longer compactly represent the state space given the current state and
action. Past observations will also need to be included as
part of the state space, leading to an exponential increase of
its dimensionality. We are currently working on extensions
that will relax the Naive Bayes assumption.
5.5
œ~>
%
>
%
%
%
H
stvuxwyit
%
p¥©|
>
„„‡rˆ‰Š
otherwise
ž
An absorbing state is reached either after a classification
action was taken or when the deadline was reached.
5.6
where H is the set of attributes and
attribute 1 .
—
5

%
1‘Z
5.4
¢¤
Œ ŽŒ
s
or
¢¢
where I*7H and *J9‹
The observation set is the union of the set of possible values
of all the attributes.
…
stvuxwyit
‚^ƒ„…{†
…
5F¨
¥§¦&3

"
s
%e‚^ƒ„…{†
…
"
s
p Ÿ
¢¢
Observation Set
"
%
if £
5™%
‚^ƒ„…{†
or
¢¢
A and C are, respectively, the set of attributes and the set of
classes.
5.3
Ÿ
%
¥§¦&3

%e‚^ƒ„
s
¢¢
Action Set
%
if |
¢¢
¢¢
An action in the active classification framework can either
gather information or assign a class label. An information
gathering action chooses an attribute to observe, while a label assignment action terminates a classification episode by
outputting a class label.
%
¡¢¢
Transition Function
The state transitions made during a classification episode are
essentially self loops, because once chosen, the instance’s
class value remains constant. Since we are also keeping
track of the current time as part of the state representation,
the transition function of the POMDP needs to reflect this
constraint.
Rewards
The immediate rewards for all attribute observation actions
are 0. For the classification actions, we query the loss function to find out the cost of misclassifying x as y. Under 0-1
loss, this function is a simple “and” function that returns 1
if and only if the class label assigned matches the true class
and 0 otherwise. More generally, the loss function can be
specified by the user of the classifier to reflect domain specific misclassification costs.
E3 ªi«
5
if H
%
H
„{„‡rˆ‰¬
i­
%
o ªiprq
For example, false positives and false negatives can be
associated with different penalties. This is the cost that the
classifier will have to optimize against.
Given this mapping of our original active classification
problem onto a POMDP, it is possible to show that an optimal policy for the POMDP corresponds to the optimal active classifier. Finding the optimal solution for a POMDP
is PSPACE-complete (Papadimitriou and Tsitsiklis 1987).
Since the active classification problem exhibits more structure, namely, the state transitions are essentially self-loops,
we are exploring the possibility of constructing a polynomial
time algorithm for this problem.
6. Conclusion and Future Work
The main contribution of this work is to put the traditional
classification problem into a decision theoretic framework,
which balances the cost of acquiring additional information
against penalties for misclassification. Once complete, standard POMDP solution techniques from decision theory can
be adopted to find the optimal policy for active classification.
This framework opens up avenues to work both in the
empirical and theoretical domain. On the empirical end,
we would like to compare the performance of this fullfledged
decision-theoretic approach to a greedy information®
theoretic approach. We have applied standard POMDP algorithms to small examples on simulated data. We have yet
to analyze the results and make detailed comparisons with
other approaches. Our goal is to establish heuristic characterizations of problems that suggest the applicability of one
approach over the other.
The special structures of the active classification problem
lead us to believe that it is simpler than a POMDP. This warrants exploration for the existence of a polynomial time algorithm.
Acknowledgements
Special thanks to Dan Bernstein and Ted Perkins for technical discussions and insights; Paul Cohen, Kelly Porpiglia,
Charles Sutton, and Brent Heeringa for proof reading the
paper; and Louis Theran for graphics.
References
Lovejoy, W. S. 1991. A Survey of algorithmic methods for
partially observed Markov decision processes. Annals of
Operations Research 28(1):47-65.
Papadimitriou, C. and Tsitsiklis, J. 1987. The complexity
of Markov decision processes. Mathematics of Operations
Research, 12(3):441 – 450.
Turney, P. 2000. Types of cost in inductive learning. In
Workshop on Cost-Sensitive Learning at the Seventeenth International Conference on Machine Learning, 15-21, Stanford, CA.
Mitchell, T.M. 1997. Machine Learning. The McGraw-Hill
Companies, Inc.
Quinlan, J.R. 1986. Induction of Decision Trees. Machine
Learning, 1. 81–106.
Greiner, R., Grove, A., and Roth, D. 1996. Learning active
classifiers. In Proc. of the International Conference on Machine Learning (1996).
Heckerman, D. 1996. A tutorial on learning with Bayesian
networks. Technical Report MSR-TR-95-06, Microsoft Research.
Domingos, P. and Pazzani, M. 1997. Beyond independence:
Conditions for the optimality of the simple Bayesian classifier. Machine Learning, 29(2/3):103– 130.
Moorehead, S., Simmons, R., Apostolopoulos, D., and
Whittaker, W.L. 1999. Autonomous Navigation Field Results of a Planetary Analog Robot in Antarctica. In proceedings of the International Symposium on Artificial Intelligence, Robotics and Automation in Space.
Pedersen, L., Wagner, M.D., Apostolopoulos, D., and Whittaker, W.L. 2001. Autonomous Robotic Meteorite Identification in Antarctica. 2001 IEEE International Conference
on Robotics and Automation, 4158–4165.