[THIS SPACE MUST BE KEPT BLANK] Scalable Inverse Reinforcement Learning via

Neural-Symbolic Learning and Reasoning
AAAI Technical Report WS-12-11
[THIS SPACE MUST
BE KEPT BLANK]
Scalable Inverse Reinforcement Learning via
Instructed Feature Construction
Tomas Singliar
Dragos D Margineantu
Boeing Research & Technology
P.O. Box 3707, M/C 7L-44
Seattle, WA 98124-2207
{tomas.singliar, dragos.d.margineantu}@boeing.com
“abnormal” behavior. Once the reward function is
approximately known, solving a form of MDP will predict
agent’s actions. Unfortunately, IRL algorithms suffer from
the large dimensionality of the reward function space.
The traditional solution is to approximate the reward or
value
function
by
a
linear
combination
of basis functions, which transforms the problem into one
of
designing
a
compact
set
of
good
basis functions. Basis functions here fill the role that
features play in supervised learning, that is, mapping the
perceptual space into a simpler decision space. Applying
experience from the DARPA Bootstrapped Learning
program, whose hypothesis is that it should be easier to
learn from a benevolent teacher than from the open world,
we design a procedure for eliciting the basis functions from
the domain expert – the analyst. Starting with a minimal
set of basis functions, the system presents the analyst with
anomalies which the analyst then typically explains away
by
indicating
domain
features
that
the
algorithm should have taken into account. The value
function approximation basis is expanded and the process
iterates until only true positives remain.
We applied our proposed technique to Moving Target
Indicator (MTI) data. MTI is an application of Doppler
radar which allows simultaneous tracking of multiple
moving objects, typically vehicles.
Abstract
Inverse reinforcement learning (IRL) techniques (Ng & Russell,
2000) provide a foundation for detecting abnormal agent behavior
and predicting agent intent through estimating its reward
function. Unfortunately, IRL algorithms suffer from the large
dimensionality of the reward function space. Meanwhile, most
applications that can benefit from an IRL-based approach to
assessing agent intent, involve interaction with an analyst or
domain expert. This paper proposes a procedure for scaling up
IRL by eliciting good IRL basis functions from the domain
expert. Further, we propose a new paradigm for modeling limited
rationality. Unlike traditional models of limited rationality that
assume an agent making stochastic choices with the value
function being treated as if it is known, we propose that observed
irrational behavior is actually due to uncertainty about the cost of
future actions. This treatment normally leads to a POMDP
formulation which is unnecessarily complicated, and we show
that adding a simple noise term to the value function
approximation accomplishes the same at a much smaller cost.
Introduction
Analyzing intelligently large volumes of data has become a
challenging but overwhelming task. Scalable machine
learning
algorithms
offer
practical
automation
opportunities for this task. The center of focus in our case
is intelligence and surveillance (ISR) where the interest is
in analyzing the behavior of people, who are often well
approximated by the rational agent assumption.
Inverse reinforcement learning (IRL) methods learn the
reward function of an agent from past behavior of the same
or similar agent and thus provide means for detecting
Approach
We employ the IRL paradigm to characterize the behavior
of a class of agents compactly, by means of bounds on the
reward function. The principal difficulty of IRL is the
dimensionality of the reward function space. Traditionally,
Copyright © 2012, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
33
the dimensionality is reduced by assuming a linear
approximation to the reward function:
analyst provides an explanation, a new basis function is
created and the learning problem is repeated. The analyst
also has the option of excluding the example (action
sequence) from learning as “agent in a different class”.
The mechanism for detecting anomalies relies on
estimating the reward function of a single agent
represented by the track. Tracks are processed in an online
fashion. If no reward function can make the behavior
consistent with the reward bounds derived from previously
seen tracks, an alert is created. This will be detected simply
by the corresponding linear program being infeasible.
Otherwise, the track is added into the repository. An alert
can also be created if the reward function satisfies
interestingness criteria predefined by the analyst.
The optimization problems are then greatly simplified and
the difficulty shifts into finding good basis functions .
We test our approach on a collection of GMTI (Ground
Moving Target Indicator) data. MTI indicator is a
technology based on Doppler or synthetic aperture radar
which is capable of capturing movement tracks of a large
number of vehicles simultaneously. The data is discretized
into a grid and mapped onto nodes of a state space graph,
where each grid cell corresponds to a node as shown in
Figure 1. Cells that were never occupied are discarded to
reduce the state space. For simplicity and to avoid data
sparsity problems, we assume that all observed action
sequences were performed by agents that share the reward
function (a class of identical agents).
Example. In the schematic of Figure 2, two vehicles come
into the area, briefly stop at an intersection, and continue
their travel. The system flags their meeting as intentional,
rather than coincidental, because travel incurs a positive
cost and shorter routes to their destinations exist. When
presented with the alert, the analyst considers the tracks
and dismisses the alert, using a menu interface to specify
that “it is relevant that the meeting point is at a gas station”
This creates a new basis function which takes on a high
value at any gas station and will explain away further false
alerts of a similar nature. To create a basis function, one of
a number of templates is selected and parameterized; e.g. a
Gaussian function where the standard deviation is the
parameter, and centered on the feature of interest (e.g., the
gas station). Note that this example requires that the
analyst previously specified a concept of meeting through a
basis function expressing proximity.
Figure 1. A state space of the IRL problem for the GMTI task
constructed from road network and track occurrence data.
Further simplifying assumptions that assure tractability are
determinism in action outcomes and perfect rationality.
This creates a very simple problem solvable using pure
linear programming as opposed to a more complex
gradient descent learning procedure that arises when a
distribution is placed on behavior that penalizes for
deviation from optimal action, e.g. as in (Ziebart et al.,
2009).
In order to have a compact yet representative feature set,
the best solution is to build the basis function from the
domain expert’s knowledge. However, domain experts are
rarely inclined to build mathematical abstractions of what
they view as common sense knowledge. A novel aspect of
our work is that we elicit this knowledge in the course of a
natural problem-focused “dialog” with the domain expert.
The analyst observes the alerts that the system produces
and either acts on the true positives, or has the option to
explain to the system why the alert is false. When the
Figure 2. Actions of two vehicles that stop at the
same time at a certain location.
34
traveling from origin to destination while minimizing a
combination of time and distance.
Relaxing the perfect rationality assumption
Assuming perfect rationality with a small set of basis
functions quickly leads to infeasible formulations – the
variation in agent behavior is quite wide in GMTI data.
Thus we need some allowance for agents making mildly
suboptimal behavior without raising alarms, and a way to
quantify the degree of “irrationality”. To this end, we posit
that the true reward of each action to the agent (which is
unobservable to us) is in fact the expression above, a
straight linear combination of basis functions. However,
the reward the agent believes it would be experiencing is
augmented as follows:
R̂(s, a, t) = n(s, a, t) +
k
P
i=1
w i i
where the hat over R denotes that this is the reward the
agent t believes at the time of planning its action would be
received. The term n is the “valuation noise”, and can take
on arbitrary values. Of course, introducing many degrees
of freedom into the problem makes the problem underdetermined, and renders all observed behavior rational
under some value of n. Therefore, we must modify the
optimization problem with a regularizer on n. Using the
convenient L1 norm yields the final optimization objective
maximize
w,n
s.t.
[planning margin] + C||n(·, ·, ·)||1
behavior is perfectly rational
C<0
C is a constant trading off planning margin (how
“rational” the agents are) and accuracy of cost perception.
Planning margin and constraints on rationality are modeled
closely on the treatment in (Ng & Russell, 2000), with the
exception that we use a finite-horizon formulation, which
better fits the task of analyzing finite instances of recorded
behavior than assuming a continuously executing agent
maximizing an infinite horizon discounted criterion.
Our optimization problem can be efficiently solved as a
linear program. Also note that there is one set of weights w
for the entire class of agents, while the noise variables are
defined per track (observed behavior sequence.)
Detection occurs by finding m tracks maximizing the
value of “total delusion”,
P
D(ti ) =
|n(s, a, i)|
References
Abbeel, P., and Ng., A.Y. 2004. Apprenticeship learning via
inverse reinforcement learning. Proceedings of the Twenty-first
International Conference on Machine Learning, ICML 2004.
Baker, C.L., Tenenbaum, J.B., and Saxe, R.R. 2007. Goal
Inference as Inverse Planning. Proceedings of the Twenty-Ninth
Annual Conference of the Cognitive Science Society, pp.779-784.
Baker, C.L., Saxe R.R., and Tenenbaum, J.B. 2009. Action
understanding as inverse planning. Cognition, 113, pp. 329-349.
Ng, A.Y., and Russell, S. 2000. Algorithms for inverse
reinforcement learning. Proceedings of the Seventeenth
International Conference on Machine Learning, ICML 2000.
Ramachandran, D., and Amir, E. 2007. Bayesian Inverse
Reinforcement Learning. Proceedings of the International Joint
Conference on Artificial Intelligence, IJCAI 2007, pp.2586-2591.
Ziebart, B.D., Mass A., Bagnell, J.A., and Dey, A.K. 2008.
Maximum entropy inverse reinforcement learning. Proceedings of
AAAI 2008, pp. 1433-1439.
Ziebart, B.D., Ratliff, N., Gallagher, G., Mertz, C., Peterson, K.,
Bagnell, J. A., Hebert M., Dey, A. K., and Srinivasa, S. 2009.
Planning-based Prediction for Pedestrians. Proceedings of the
2009 IEEE/RSJ International Conference on Intelligent Robots
and Systems, pp.3931-3936.
(s,a)ti
representing the amount of action cost perception noise
that must be assumed in order to explain the behavior as
conformant with the value of w shared by the agent class.
Experimental Validation
An example of anomalous behavior we detected is a track
that circles some location. This behavior cannot be
considered a rational plan under any set of weights and the
baseline assumption that all tracks have the goal of
35