Neural-Symbolic Learning and Reasoning AAAI Technical Report WS-12-11 [THIS SPACE MUST BE KEPT BLANK] Scalable Inverse Reinforcement Learning via Instructed Feature Construction Tomas Singliar Dragos D Margineantu Boeing Research & Technology P.O. Box 3707, M/C 7L-44 Seattle, WA 98124-2207 {tomas.singliar, dragos.d.margineantu}@boeing.com “abnormal” behavior. Once the reward function is approximately known, solving a form of MDP will predict agent’s actions. Unfortunately, IRL algorithms suffer from the large dimensionality of the reward function space. The traditional solution is to approximate the reward or value function by a linear combination of basis functions, which transforms the problem into one of designing a compact set of good basis functions. Basis functions here fill the role that features play in supervised learning, that is, mapping the perceptual space into a simpler decision space. Applying experience from the DARPA Bootstrapped Learning program, whose hypothesis is that it should be easier to learn from a benevolent teacher than from the open world, we design a procedure for eliciting the basis functions from the domain expert – the analyst. Starting with a minimal set of basis functions, the system presents the analyst with anomalies which the analyst then typically explains away by indicating domain features that the algorithm should have taken into account. The value function approximation basis is expanded and the process iterates until only true positives remain. We applied our proposed technique to Moving Target Indicator (MTI) data. MTI is an application of Doppler radar which allows simultaneous tracking of multiple moving objects, typically vehicles. Abstract Inverse reinforcement learning (IRL) techniques (Ng & Russell, 2000) provide a foundation for detecting abnormal agent behavior and predicting agent intent through estimating its reward function. Unfortunately, IRL algorithms suffer from the large dimensionality of the reward function space. Meanwhile, most applications that can benefit from an IRL-based approach to assessing agent intent, involve interaction with an analyst or domain expert. This paper proposes a procedure for scaling up IRL by eliciting good IRL basis functions from the domain expert. Further, we propose a new paradigm for modeling limited rationality. Unlike traditional models of limited rationality that assume an agent making stochastic choices with the value function being treated as if it is known, we propose that observed irrational behavior is actually due to uncertainty about the cost of future actions. This treatment normally leads to a POMDP formulation which is unnecessarily complicated, and we show that adding a simple noise term to the value function approximation accomplishes the same at a much smaller cost. Introduction Analyzing intelligently large volumes of data has become a challenging but overwhelming task. Scalable machine learning algorithms offer practical automation opportunities for this task. The center of focus in our case is intelligence and surveillance (ISR) where the interest is in analyzing the behavior of people, who are often well approximated by the rational agent assumption. Inverse reinforcement learning (IRL) methods learn the reward function of an agent from past behavior of the same or similar agent and thus provide means for detecting Approach We employ the IRL paradigm to characterize the behavior of a class of agents compactly, by means of bounds on the reward function. The principal difficulty of IRL is the dimensionality of the reward function space. Traditionally, Copyright © 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 33 the dimensionality is reduced by assuming a linear approximation to the reward function: analyst provides an explanation, a new basis function is created and the learning problem is repeated. The analyst also has the option of excluding the example (action sequence) from learning as “agent in a different class”. The mechanism for detecting anomalies relies on estimating the reward function of a single agent represented by the track. Tracks are processed in an online fashion. If no reward function can make the behavior consistent with the reward bounds derived from previously seen tracks, an alert is created. This will be detected simply by the corresponding linear program being infeasible. Otherwise, the track is added into the repository. An alert can also be created if the reward function satisfies interestingness criteria predefined by the analyst. The optimization problems are then greatly simplified and the difficulty shifts into finding good basis functions . We test our approach on a collection of GMTI (Ground Moving Target Indicator) data. MTI indicator is a technology based on Doppler or synthetic aperture radar which is capable of capturing movement tracks of a large number of vehicles simultaneously. The data is discretized into a grid and mapped onto nodes of a state space graph, where each grid cell corresponds to a node as shown in Figure 1. Cells that were never occupied are discarded to reduce the state space. For simplicity and to avoid data sparsity problems, we assume that all observed action sequences were performed by agents that share the reward function (a class of identical agents). Example. In the schematic of Figure 2, two vehicles come into the area, briefly stop at an intersection, and continue their travel. The system flags their meeting as intentional, rather than coincidental, because travel incurs a positive cost and shorter routes to their destinations exist. When presented with the alert, the analyst considers the tracks and dismisses the alert, using a menu interface to specify that “it is relevant that the meeting point is at a gas station” This creates a new basis function which takes on a high value at any gas station and will explain away further false alerts of a similar nature. To create a basis function, one of a number of templates is selected and parameterized; e.g. a Gaussian function where the standard deviation is the parameter, and centered on the feature of interest (e.g., the gas station). Note that this example requires that the analyst previously specified a concept of meeting through a basis function expressing proximity. Figure 1. A state space of the IRL problem for the GMTI task constructed from road network and track occurrence data. Further simplifying assumptions that assure tractability are determinism in action outcomes and perfect rationality. This creates a very simple problem solvable using pure linear programming as opposed to a more complex gradient descent learning procedure that arises when a distribution is placed on behavior that penalizes for deviation from optimal action, e.g. as in (Ziebart et al., 2009). In order to have a compact yet representative feature set, the best solution is to build the basis function from the domain expert’s knowledge. However, domain experts are rarely inclined to build mathematical abstractions of what they view as common sense knowledge. A novel aspect of our work is that we elicit this knowledge in the course of a natural problem-focused “dialog” with the domain expert. The analyst observes the alerts that the system produces and either acts on the true positives, or has the option to explain to the system why the alert is false. When the Figure 2. Actions of two vehicles that stop at the same time at a certain location. 34 traveling from origin to destination while minimizing a combination of time and distance. Relaxing the perfect rationality assumption Assuming perfect rationality with a small set of basis functions quickly leads to infeasible formulations – the variation in agent behavior is quite wide in GMTI data. Thus we need some allowance for agents making mildly suboptimal behavior without raising alarms, and a way to quantify the degree of “irrationality”. To this end, we posit that the true reward of each action to the agent (which is unobservable to us) is in fact the expression above, a straight linear combination of basis functions. However, the reward the agent believes it would be experiencing is augmented as follows: R̂(s, a, t) = n(s, a, t) + k P i=1 w i i where the hat over R denotes that this is the reward the agent t believes at the time of planning its action would be received. The term n is the “valuation noise”, and can take on arbitrary values. Of course, introducing many degrees of freedom into the problem makes the problem underdetermined, and renders all observed behavior rational under some value of n. Therefore, we must modify the optimization problem with a regularizer on n. Using the convenient L1 norm yields the final optimization objective maximize w,n s.t. [planning margin] + C||n(·, ·, ·)||1 behavior is perfectly rational C<0 C is a constant trading off planning margin (how “rational” the agents are) and accuracy of cost perception. Planning margin and constraints on rationality are modeled closely on the treatment in (Ng & Russell, 2000), with the exception that we use a finite-horizon formulation, which better fits the task of analyzing finite instances of recorded behavior than assuming a continuously executing agent maximizing an infinite horizon discounted criterion. Our optimization problem can be efficiently solved as a linear program. Also note that there is one set of weights w for the entire class of agents, while the noise variables are defined per track (observed behavior sequence.) Detection occurs by finding m tracks maximizing the value of “total delusion”, P D(ti ) = |n(s, a, i)| References Abbeel, P., and Ng., A.Y. 2004. Apprenticeship learning via inverse reinforcement learning. Proceedings of the Twenty-first International Conference on Machine Learning, ICML 2004. Baker, C.L., Tenenbaum, J.B., and Saxe, R.R. 2007. Goal Inference as Inverse Planning. Proceedings of the Twenty-Ninth Annual Conference of the Cognitive Science Society, pp.779-784. Baker, C.L., Saxe R.R., and Tenenbaum, J.B. 2009. Action understanding as inverse planning. Cognition, 113, pp. 329-349. Ng, A.Y., and Russell, S. 2000. Algorithms for inverse reinforcement learning. Proceedings of the Seventeenth International Conference on Machine Learning, ICML 2000. Ramachandran, D., and Amir, E. 2007. Bayesian Inverse Reinforcement Learning. Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI 2007, pp.2586-2591. Ziebart, B.D., Mass A., Bagnell, J.A., and Dey, A.K. 2008. Maximum entropy inverse reinforcement learning. Proceedings of AAAI 2008, pp. 1433-1439. Ziebart, B.D., Ratliff, N., Gallagher, G., Mertz, C., Peterson, K., Bagnell, J. A., Hebert M., Dey, A. K., and Srinivasa, S. 2009. Planning-based Prediction for Pedestrians. Proceedings of the 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.3931-3936. (s,a)ti representing the amount of action cost perception noise that must be assumed in order to explain the behavior as conformant with the value of w shared by the agent class. Experimental Validation An example of anomalous behavior we detected is a track that circles some location. This behavior cannot be considered a rational plan under any set of weights and the baseline assumption that all tracks have the goal of 35