This paper is inspired by a survey collected by the USAID Global

advertisement
CS 6702: Topics in Computational Sustainability
Reaction Paper
Inverse reinforcement learning and discrete decision choice problem of pastoralists
Nikhil Kejriwal
2/17/2011
This paper is inspired by a survey collected by the USAID Global Livestock Collaborative
Research Support Program (GL CRSP) “Improving Pastoral Risk Management on East African
Rangelands” (PARIMA, C. B. Barrett, April 2008). The project focuses on six locations in
Northern Kenya and Southern Ethiopia in one intact semi-arid and arid livestock production and
marketing region.
The survey was structured to shed light on the following key issues. 1) Pastoralist risk exposure
and management behavior. 2) Livestock Marketing 3) Rural Financial Institutions and 4) Public
Service Delivery Systems. In addition, the survey contains a variety of standard household and
individual survey questions concerning household composition, labor allocation, income,
expenditures, etc. The data collected aims provide insight into the cross-sectional (across
individuals, households, communities, and subregions) and intertemporal (seasonal and
interannual) variation of risk and uncertainty faced by pastoralists (C. Barrett, April 2008). This
paper focuses on a subsection of the problem, specifically to study the decision making process of
the pastoralists in managing risks and maintaining livestock based on expectation of coming
weather patterns and availability as well as accessibility to common resources.
This applied problem is centered on explaining the movement over time and space of animal
herds in the arid and semi-arid lands (ASAL) of Northern Kenya and Sothern Ethiopia. In this
region crop agriculture is difficult, and hence pastoralists (animal herders) have adopted the
livelihood strategy of managing large livestock herds (camels, cattle, sheep, and goats), which are
their primary asset/wealth base, and theie primary source of income as well (through livestock
transactions, milk, blood, meat and skins). The movement of herders is due to highly variable
rainfall. In between winters and summers, there are dry seasons, with virtually no precipitation.
At such times the males in the household migrate with the herds to remote water points, dozens if
not hundreds of kilometers away from the main town location. Every few years these dry seasons
can be more severe than usual and turn into droughts. At such times the pastoralists can suffer
greatly, by losing large portions of their herds. Even during relatively mild dry seasons there are a
number of reasons to be interested in the herders spatiotemporal movement problem: we might
ask whether their herd allocation choices are optimal given states of the environment, we might
worry about the environmental degradation caused by herd grazing pressures and the inter-tribal
violence and raids that occur during migration seasons. We are interested in understanding
herders choices to be able to formulate effective policies to correct such issues (more police,
more control of grazing, drilling more waterpoints etc) along with gauging the effects of
environmental changes (e.g., more variable or lower mean rainfall) on the tribes. Hence a
problem can be formulated which is to develop a robust predictive model of pastoralist movement
patterns based on observed states of the environment. We would ideally like to forecast the
outcomes of policy experiments that involve changes in state variables even if such state changes
were not observed in the data that gets used for estimation.
Typical structural modelling exercise in econometrics is also concerned with a similar problem.
Structural Estimation (SE) used by economists and statisticians is a technique for estimating deep
"structural" parameters of theoretical models (Aguirregabiria, May 2010). It is estimating the
parameters of the agents preferences (utility function) that best match up with observed data.
However, the typical approach involves imposing a particular behavioural model on the agent.
While this can be nice since the estimated parameters are interpretable, it means that there are
2|Page
definite limits on the complexity of the state and, in particular, action space that such methods
can handle.
The technical challenge that is raised is that herd movement is a large and complex dynamic
discrete choice problem: pastoralists must decide which particular water points to move to, and
how to allocate each of their animal types between the base camp location and remote water
points, as a function of a large and complex state space. Application of the typical econometric
modeling techniques would seem to run into two immediate challenges: (i) the curse of modeling
(a particular, explicit behavioural model describing the mechanics of pastoralist decision-making,
might be difficult to specify correctly), and (ii) the curse of dimensionality (even if a behavioural
model can be specified, it can be difficult to estimate the parameters of a sufficiently complex
model, without virtually unlimited data).
Keane and Wolpin (1997) point out several limitations with structural estimation approach. An
approach to structural estimation (SE) is to use a full-solution approach based on complete
solution of the optimization problem confronting agents. Parameters for the optimal decision
rules are estimated by iterating over the complete feasible set. It is computationally very
expensive and results in limitations on the complexity of structural models. Another approach to
SE is based on the first-order conditions (FOC's) of the optimization problem which is generally
less computationally demanding than the full-solution method. But this method is unsuitable for
agents with discrete choice variables. A reduced form decision rule may also be used to structure
the model. But, they are in most cases not versatile enough to capture the dynamics of a complex
system.
Russell (1998) suggested that Inverse Reinforcement Learning (IRL) can provide a good model
for animal and human learning in uncertain environments where the reward function is unknown.
Unlike the SE method, in the IRL method the parameters for multi attribute reward functions are
not determined a priori. It does not assume restrictions on decision rules. It is sort of a "modelfree" approach, as it involves deriving a reward function as a function of states of the world and
actions directly, and hence puts much less structure on the agents decision problem. The model
can be thought of as a Markov Decision Process (MDP) where the agent takes actions in order to
maximize the expected reward that it receives from the environment. Thus if the reward function
is known, it can be used to provide a description of how agents map states and actions into
payouts. We can then carry out policy simulations that allow us to vary elements of the state
space, and predict how agents would respond. Such a model can allow simulation of changes in
state (e.g. policy interventions in ASAL Africa) and predicting the action that the agent (e.g.
pastoralists) is going to take in response to that change.
The basic idea behind IRL is the “problem of extracting a reward function given observed,
optimal behavior”, Ng and Russell (2000). We are usually given the following “(1) measurements
of an agent’s behavior over time, in a variety of circumstances, (2) if needed, measurements of
the sensory inputs to that agent, (3), if available, a model of the environment,” with which we
determine “the reward function being optimized” (Ng and Russell, 2000). IRL is particularly
relevant in studying animal and human behavior, because in such cases we often observe data
which we can assume is the result of a learning process, and hence is in some sense optimal. So
the reward function is taken as an unknown to be ascertained through empirical investigation.
3|Page
IRL is a form of apprenticeship learning, where we can observe an expert demonstrating the task
that we want to learn to perform. The solution of IRL generates a reward function by observing
the behavior of an expert to teach the agent the optimal actions in certain states of the
environment. It is particularly suitable for tasks such as walking, diving, and driving, where the
designer of an artificial system may have only an intuitive idea of the appropriate reward function
to be supplied to a Reinforcement Learning (RL) algorithm in order to achieve “desirable”
behavior. Thus instead of learning direct control functions from experts explicitly, it may be
better to solve the inverse reinforcement learning problem to learn simpler reward functions.
“Such reward functions usually are simple monotonic functions of the current sensory inputs, and
thus may be much simpler than the direct decision mapping” (Abbeel & Ng, 2004). The reward
function can be thought of as compact and indirect but robust representation of expert behaviour.
Ng & Russell (2004) give a solution to IRL problem by formulating it as a Linear Programming
(LP) problem. They show that standard LP techniques yield efficient solutions. However, the
solutions are often not unique. They suggest techniques that alleviate this problem by adding ad
hoc criteria that have some natural interpretation like maximizing the difference in value of best
and next best actions. In a large or infinite state space finding a solution becomes an infinite
dimensional problem. The solution then is to find a linear approximation to the value function.
Abbeel and Ng (2004) suggest expressing the reward function as a "linear combination of known
features". The reward function is given as: R(s) = Σ w(s) φ(s), where w is a linear weighting
function, and φ is a function on states s, which have k dimensions taking binary values in {0, 1}.
They claim that the values of w that come out of their model will produce policies that are close
to the expert’s. They claim that their algorithm has quick convergence, and whether or not they
recover the true reward function of the expert, the recovered reward function will produce
policies that are close to those of the expert. They also claim to not need to assume that the
expert is actually succeeding all the time in maximizing his reward function.
Ramachandran and Amir (2007) take a Bayesian approach to solving the IRL problem. They
assume that the modeler has the ability to put some prior belief on the reward function, and then
"consider the actions of the expert as evidence that we use to update a prior on reward functions."
This allows the modeler to use data from multiple experts, does not require assuming the expert is
infallible, and does not require a completely specified optimal policy. The basic idea behind their
approach is to derive a posterior distribution for the rewards from a prior distribution and a
probabilistic model of the expert’s actions given the reward function. Given data, they use Bayes
Rule to update prior beliefs on the distribution over reward functions.
Neu and Szepesvari (2007) claim to improve on Abbeel and Ng (2004) by requiring less strong
assumptions (on specifying the features of the reward function), through the use of a gradient
algorithm. The paper nicely expresses the difference between direct and indirect approaches. The
basic idea is that in apprenticeship learning, we want to learn optimal actions from an expert.
Direct approaches involve directly trying to learn the policy usually by optimizing some loss
function measuring deviations from expert’s choices; the disadvantage being that such an
approach has difficulty learning about policies in places where few actions are taken, because
there are few observations. In the indirect methods (e.g., IRL) it is assumed that we observe an
expert taking optimal actions, and try to learn the unknown reward function of the expert, which
4|Page
can then provide a more succinct description of the expert’s decision problem, and can be
mapped back into policies. Their algorithm combines the direct and indirect approaches by
minimizing a loss function that penalizes deviations from the expert’s policy like in supervised
learning, but the policy is obtained by tuning a reward function and solving the resulting MDP,
instead of finding the parameters of a policy. One thing they point out as a general problem in
IRL is the scale problem. If R is a solution as a reward function, then so is λR.
Syed and Schapire (2007) claims to use an approach based on zero-sum games to approximate
the expert policy, and claim that it can in fact achieve a policy better than that of the expert itself.
The approach poses the problem as learning to play a two-player zero-sum game in which the
apprentice chooses a policy, and the environment chooses a reward function. Goal of apprentice
is to maximize performance relative to the expert, even though the reward function is selected by
environment. While Abbeel and Ng (2004) gives an algorithm that within O(klogk) iterations,
this algorithm achieves the same result within O(logk) iterations.
Ziebart et al. (2008) criticizes Abbeel and Ng (2004), in that it matches feature counts to feature
expectations, arguing that their approach is ambiguous, since many policies can lead to the same
feature counts. And when sub-optimal behaviour appears in the data, mixtures of policies may be
needed to satisfy feature matching. The innovation in Ziebart et al. (2008), is to use the principle
of maximum entropy i.e. maximize the likelihood of the observed data under the maximum
entropy (exponential family) distribution to pick a specific stochastic policy, under the constraint
of matching feature expectations. The optimal can be obtained using gradient-based optimization
methods. They provide a specific algorithm “Expected Edge Frequency Calculation”. They then
demonstrate their method on what they claim to be the largest IRL problem to date in terms of
demonstrated data size. They look at route planning on roads in Pittsburgh, which has 300,000
states (road segments) and 900,000 actions (transitions at intersections). They consider details of
road segments such as road type, speed, lanes and transitions, and use 100,000 miles of travel
data from 25 Yellow Cab taxi drivers over a 12 week duration. They apply their Max Entropy
IRL model to the task of learning taxi drivers’ collective utility function for the different features
describing paths in the road network. They show that their method leads to significant
improvement over alternatives (e.g., Ratliff et al 2006, and Ramachandran and Amir 2007, &
Neu and Szepesvari 2007). They mention improving their algorithm by incorporating contextual
features (e.g., time of day, weather and region-based) or specific road features (e.g., rush hour,
steep road during winter weather).
Summary
Pastoral decision making and risk management presents a relatively complex problem in terms of
discrete state choices and inclusion of state variable that change over time and other subjective
human factors. Structural estimation techniques to determine the decision rule for the herders
becomes challenging as it requires detailed modelling of the decision rule. An unconventional
approach is suggested that is to develop or borrow methods from the new emerging field of IRL
to make decisions based on the perceived rewards in a Markov Decision Process.
5|Page
Acknowledgement:
This work is inspired by unpublished work from Russel Toth. Russell is a Ph.D. student in the
department of Economics at Cornell University. He has been studying the pastoral system in Chris
Barrett's research group and has been instrumental in formulating the pastoral problem definition.
References:
Christopher B. Barrett, Sommarat Chantarat, Getachew Gebru, John G. McPeak, Andrew G.
Mude, Jacqueline Vanderpuye-Orgle, Amare T. Yirbecho, “Codebook For Data Collected Under
The Improving Pastoral Risk Management on East African Rangelands (PARIMA) Project,” April
2008.
Aguirregabiria, Victor & Mira, Pedro, 2010. "Dynamic discrete choice structural models: A
survey," Journal of Econometrics, Elsevier, vol. 156(1), pages 38-67, May
Keane, M. and K. Wolpin, 1997. Introduction to the JBES Special Issue on Structural Estimation in
Applied Microeconomics. Journal of Business and Economic Statistics, 15:2, 111-114
Russell, Stuart. 1998. "Learning Agents for Uncertain Environments (extended abstract)." In Proc.
COLT-98, Madison, Wisconsin: ACM Press.
Ng, Andrew Y. and Stuart Russell. 2000. "Algorithms for Inverse Reinforcement Learning." In
Proceedings of the Seventeenth International Conference on Machine Learning.
Abbeel, Pieter and Andrew Y. Ng. 2004. "Apprenticeship Learning via Inverse Reinforcement
Learning." In Proceedings of the Twenty-.rst International Conference on Machine Learning.
Ramachandran, Deepak and Eyal Amir. 2007. "Bayesian Inverse Reinforcement Learning." In
20th International Joint Conference on Arti.cial Intelligence (IJCAI.07).
Neu, Gergely and Csaba Szepesvari. 2007. "Apprenticeship Learning using Inverse Reinforcement
Learning and Gradient Methods." In Conference on Uncertainty in Arti.cial Intelligence (UAI), pp.
295-302.
Syed, Umar, and Robert E. Schapire. 2007. "A Game Theoretic Approach to Apprenticeship
Learning." In Advances in Neural Information Processing Systems 20 (NIPS 2007).
Syed, Umar, Michael Bowling and Robert E. Schapire. 2008. "Apprenticeship Learning Using
Linear Programming." In Proceedings of the Twenty-Fifth International Conference on Machine
Learning (ICML 2008).
Ziebart, Brian D., Andrew Maas, J. Andrew Bagnell and Anind K. Dey. 2008. "Maximum Entropy
Inverse Reinforcement Learning." In Proceedings of the Twenty-Third AAAI Conference on
Artificial Intelligence.
6|Page
Download