CS 6702: Topics in Computational Sustainability Reaction Paper Inverse reinforcement learning and discrete decision choice problem of pastoralists Nikhil Kejriwal 2/17/2011 This paper is inspired by a survey collected by the USAID Global Livestock Collaborative Research Support Program (GL CRSP) “Improving Pastoral Risk Management on East African Rangelands” (PARIMA, C. B. Barrett, April 2008). The project focuses on six locations in Northern Kenya and Southern Ethiopia in one intact semi-arid and arid livestock production and marketing region. The survey was structured to shed light on the following key issues. 1) Pastoralist risk exposure and management behavior. 2) Livestock Marketing 3) Rural Financial Institutions and 4) Public Service Delivery Systems. In addition, the survey contains a variety of standard household and individual survey questions concerning household composition, labor allocation, income, expenditures, etc. The data collected aims provide insight into the cross-sectional (across individuals, households, communities, and subregions) and intertemporal (seasonal and interannual) variation of risk and uncertainty faced by pastoralists (C. Barrett, April 2008). This paper focuses on a subsection of the problem, specifically to study the decision making process of the pastoralists in managing risks and maintaining livestock based on expectation of coming weather patterns and availability as well as accessibility to common resources. This applied problem is centered on explaining the movement over time and space of animal herds in the arid and semi-arid lands (ASAL) of Northern Kenya and Sothern Ethiopia. In this region crop agriculture is difficult, and hence pastoralists (animal herders) have adopted the livelihood strategy of managing large livestock herds (camels, cattle, sheep, and goats), which are their primary asset/wealth base, and theie primary source of income as well (through livestock transactions, milk, blood, meat and skins). The movement of herders is due to highly variable rainfall. In between winters and summers, there are dry seasons, with virtually no precipitation. At such times the males in the household migrate with the herds to remote water points, dozens if not hundreds of kilometers away from the main town location. Every few years these dry seasons can be more severe than usual and turn into droughts. At such times the pastoralists can suffer greatly, by losing large portions of their herds. Even during relatively mild dry seasons there are a number of reasons to be interested in the herders spatiotemporal movement problem: we might ask whether their herd allocation choices are optimal given states of the environment, we might worry about the environmental degradation caused by herd grazing pressures and the inter-tribal violence and raids that occur during migration seasons. We are interested in understanding herders choices to be able to formulate effective policies to correct such issues (more police, more control of grazing, drilling more waterpoints etc) along with gauging the effects of environmental changes (e.g., more variable or lower mean rainfall) on the tribes. Hence a problem can be formulated which is to develop a robust predictive model of pastoralist movement patterns based on observed states of the environment. We would ideally like to forecast the outcomes of policy experiments that involve changes in state variables even if such state changes were not observed in the data that gets used for estimation. Typical structural modelling exercise in econometrics is also concerned with a similar problem. Structural Estimation (SE) used by economists and statisticians is a technique for estimating deep "structural" parameters of theoretical models (Aguirregabiria, May 2010). It is estimating the parameters of the agents preferences (utility function) that best match up with observed data. However, the typical approach involves imposing a particular behavioural model on the agent. While this can be nice since the estimated parameters are interpretable, it means that there are 2|Page definite limits on the complexity of the state and, in particular, action space that such methods can handle. The technical challenge that is raised is that herd movement is a large and complex dynamic discrete choice problem: pastoralists must decide which particular water points to move to, and how to allocate each of their animal types between the base camp location and remote water points, as a function of a large and complex state space. Application of the typical econometric modeling techniques would seem to run into two immediate challenges: (i) the curse of modeling (a particular, explicit behavioural model describing the mechanics of pastoralist decision-making, might be difficult to specify correctly), and (ii) the curse of dimensionality (even if a behavioural model can be specified, it can be difficult to estimate the parameters of a sufficiently complex model, without virtually unlimited data). Keane and Wolpin (1997) point out several limitations with structural estimation approach. An approach to structural estimation (SE) is to use a full-solution approach based on complete solution of the optimization problem confronting agents. Parameters for the optimal decision rules are estimated by iterating over the complete feasible set. It is computationally very expensive and results in limitations on the complexity of structural models. Another approach to SE is based on the first-order conditions (FOC's) of the optimization problem which is generally less computationally demanding than the full-solution method. But this method is unsuitable for agents with discrete choice variables. A reduced form decision rule may also be used to structure the model. But, they are in most cases not versatile enough to capture the dynamics of a complex system. Russell (1998) suggested that Inverse Reinforcement Learning (IRL) can provide a good model for animal and human learning in uncertain environments where the reward function is unknown. Unlike the SE method, in the IRL method the parameters for multi attribute reward functions are not determined a priori. It does not assume restrictions on decision rules. It is sort of a "modelfree" approach, as it involves deriving a reward function as a function of states of the world and actions directly, and hence puts much less structure on the agents decision problem. The model can be thought of as a Markov Decision Process (MDP) where the agent takes actions in order to maximize the expected reward that it receives from the environment. Thus if the reward function is known, it can be used to provide a description of how agents map states and actions into payouts. We can then carry out policy simulations that allow us to vary elements of the state space, and predict how agents would respond. Such a model can allow simulation of changes in state (e.g. policy interventions in ASAL Africa) and predicting the action that the agent (e.g. pastoralists) is going to take in response to that change. The basic idea behind IRL is the “problem of extracting a reward function given observed, optimal behavior”, Ng and Russell (2000). We are usually given the following “(1) measurements of an agent’s behavior over time, in a variety of circumstances, (2) if needed, measurements of the sensory inputs to that agent, (3), if available, a model of the environment,” with which we determine “the reward function being optimized” (Ng and Russell, 2000). IRL is particularly relevant in studying animal and human behavior, because in such cases we often observe data which we can assume is the result of a learning process, and hence is in some sense optimal. So the reward function is taken as an unknown to be ascertained through empirical investigation. 3|Page IRL is a form of apprenticeship learning, where we can observe an expert demonstrating the task that we want to learn to perform. The solution of IRL generates a reward function by observing the behavior of an expert to teach the agent the optimal actions in certain states of the environment. It is particularly suitable for tasks such as walking, diving, and driving, where the designer of an artificial system may have only an intuitive idea of the appropriate reward function to be supplied to a Reinforcement Learning (RL) algorithm in order to achieve “desirable” behavior. Thus instead of learning direct control functions from experts explicitly, it may be better to solve the inverse reinforcement learning problem to learn simpler reward functions. “Such reward functions usually are simple monotonic functions of the current sensory inputs, and thus may be much simpler than the direct decision mapping” (Abbeel & Ng, 2004). The reward function can be thought of as compact and indirect but robust representation of expert behaviour. Ng & Russell (2004) give a solution to IRL problem by formulating it as a Linear Programming (LP) problem. They show that standard LP techniques yield efficient solutions. However, the solutions are often not unique. They suggest techniques that alleviate this problem by adding ad hoc criteria that have some natural interpretation like maximizing the difference in value of best and next best actions. In a large or infinite state space finding a solution becomes an infinite dimensional problem. The solution then is to find a linear approximation to the value function. Abbeel and Ng (2004) suggest expressing the reward function as a "linear combination of known features". The reward function is given as: R(s) = Σ w(s) φ(s), where w is a linear weighting function, and φ is a function on states s, which have k dimensions taking binary values in {0, 1}. They claim that the values of w that come out of their model will produce policies that are close to the expert’s. They claim that their algorithm has quick convergence, and whether or not they recover the true reward function of the expert, the recovered reward function will produce policies that are close to those of the expert. They also claim to not need to assume that the expert is actually succeeding all the time in maximizing his reward function. Ramachandran and Amir (2007) take a Bayesian approach to solving the IRL problem. They assume that the modeler has the ability to put some prior belief on the reward function, and then "consider the actions of the expert as evidence that we use to update a prior on reward functions." This allows the modeler to use data from multiple experts, does not require assuming the expert is infallible, and does not require a completely specified optimal policy. The basic idea behind their approach is to derive a posterior distribution for the rewards from a prior distribution and a probabilistic model of the expert’s actions given the reward function. Given data, they use Bayes Rule to update prior beliefs on the distribution over reward functions. Neu and Szepesvari (2007) claim to improve on Abbeel and Ng (2004) by requiring less strong assumptions (on specifying the features of the reward function), through the use of a gradient algorithm. The paper nicely expresses the difference between direct and indirect approaches. The basic idea is that in apprenticeship learning, we want to learn optimal actions from an expert. Direct approaches involve directly trying to learn the policy usually by optimizing some loss function measuring deviations from expert’s choices; the disadvantage being that such an approach has difficulty learning about policies in places where few actions are taken, because there are few observations. In the indirect methods (e.g., IRL) it is assumed that we observe an expert taking optimal actions, and try to learn the unknown reward function of the expert, which 4|Page can then provide a more succinct description of the expert’s decision problem, and can be mapped back into policies. Their algorithm combines the direct and indirect approaches by minimizing a loss function that penalizes deviations from the expert’s policy like in supervised learning, but the policy is obtained by tuning a reward function and solving the resulting MDP, instead of finding the parameters of a policy. One thing they point out as a general problem in IRL is the scale problem. If R is a solution as a reward function, then so is λR. Syed and Schapire (2007) claims to use an approach based on zero-sum games to approximate the expert policy, and claim that it can in fact achieve a policy better than that of the expert itself. The approach poses the problem as learning to play a two-player zero-sum game in which the apprentice chooses a policy, and the environment chooses a reward function. Goal of apprentice is to maximize performance relative to the expert, even though the reward function is selected by environment. While Abbeel and Ng (2004) gives an algorithm that within O(klogk) iterations, this algorithm achieves the same result within O(logk) iterations. Ziebart et al. (2008) criticizes Abbeel and Ng (2004), in that it matches feature counts to feature expectations, arguing that their approach is ambiguous, since many policies can lead to the same feature counts. And when sub-optimal behaviour appears in the data, mixtures of policies may be needed to satisfy feature matching. The innovation in Ziebart et al. (2008), is to use the principle of maximum entropy i.e. maximize the likelihood of the observed data under the maximum entropy (exponential family) distribution to pick a specific stochastic policy, under the constraint of matching feature expectations. The optimal can be obtained using gradient-based optimization methods. They provide a specific algorithm “Expected Edge Frequency Calculation”. They then demonstrate their method on what they claim to be the largest IRL problem to date in terms of demonstrated data size. They look at route planning on roads in Pittsburgh, which has 300,000 states (road segments) and 900,000 actions (transitions at intersections). They consider details of road segments such as road type, speed, lanes and transitions, and use 100,000 miles of travel data from 25 Yellow Cab taxi drivers over a 12 week duration. They apply their Max Entropy IRL model to the task of learning taxi drivers’ collective utility function for the different features describing paths in the road network. They show that their method leads to significant improvement over alternatives (e.g., Ratliff et al 2006, and Ramachandran and Amir 2007, & Neu and Szepesvari 2007). They mention improving their algorithm by incorporating contextual features (e.g., time of day, weather and region-based) or specific road features (e.g., rush hour, steep road during winter weather). Summary Pastoral decision making and risk management presents a relatively complex problem in terms of discrete state choices and inclusion of state variable that change over time and other subjective human factors. Pastoral decision making and risk management presents a relatively complex problem in terms of discrete state choices and inclusion of state variable that change over time and other subjective human factors. Structural estimation techniques to determine the decision rule for the herders becomes challenging as it requires detailed modelling of the decision rule. An unconventional approach is suggested that is to develop or borrow methods from the new emerging field of IRL to make decisions based on the perceived rewards in a Markov Decision Process.

Acknowledgement: This work is inspired by unpublished work from Russel Toth. Russell is a Ph.D. student in the department of Economics at Cornell University. He has been studying the pastoral system in Chris Barrett's research group and has been instrumental in formulating the pastoral problem definition. 