Learning Distinctions and Rules in a Continuous World through Active Exploration PAPER BY JONATHAN MUGAN & BENJAMIN KUIPERS PRESENTED BY DANIEL HOUGH The Challenge To build a robot which learns its environment like children do. Piaget [1952] theorised that children constructed this knowledge in stages Cohen [2002] proposed that children have a domain-general information processing system for bootstrapping knowledge. Foundations The Focus of the work: How a developing agent can learn temporal contingencies in the form of predictive rules over events. Watson [2001] proposed a model of contingencies based on his observations of infant behaviour: Prospective temporal contingency: Event B tends to follow Event A with a likelihood greater than chance Retrospective temporal contingency: Event A tends to come before Event B more often than chance. Distinctions must be found to determine when an event has occurred. Foundations Drescher [1991] proposed a model inspired by Piaget where contingencies (here schemas) are found using marginal attribution. Results are found which follow actions in a method similar to Watson’s. For each schema (in the form of an action + result), the algorithm searches for context (situation) that makes the result more likely to follow that action. The Method Introduction Here, prospective contingencies as well as contingencies in which events occur simultaneously are represented using predictive rules These rules are learned using a method inspired by marginal attribution The difference with Drescher is continuous variables. This brings up the issue of determining when events occur, so distinctions must be found. The Method Introduction Motor babbling method from last week to learn distinctions and contingencies. This was undirected, does not allow learning for larger problems – too much effort is wasted on uninteresting portions of state space. The Method Introduction In this algorithm, the agent receives as input the values of time-varying continuous variables but can only represent, reason about and construct knowledge using discrete values. Continuous values are discretised using distinctions in the form of landmarks: A discrete value v(t) for each continuous variable v’(t); If for landmarks v1 and v2, v1 < v’(t) < v2 then v(t) has the open interval between v1 and v2 as its value, v = (v1,v2). The association means agent can focus on changes of v = events The agent greedily learns rules that use one event to predict another. The Method How it’s evaluated The method is evaluated using a simulated robot based on the situation of a baby sitting in a high chair. Fig. 1: Adorable Fig. 2: Less adorable The Method Knowledge Representation & Learning The goal is for the agent to learn to identify landmark values from its own experience. The importance of a qualitative distinction is estimated from the reliability of the rules that can be learned, given that distinction. The qualitative representation is based on QSIM [Kupiers, 1994] The Method Knowledge Representation & Learning A continuous variable x’(t) is represented by discrete variable x(t) for magnitude and x’’(t) for the direction of change of x’(t), and ranges over some subset of the real number line (-∞, +∞). In QSIM, magnitude is abstracted to a discrete variable x(t) that ranges over a quantity space Q(x) of qualitative values. Q(x) = L(x) U I(x) where L(x) = {x1,...,xn} landmark values I(x) = {(-∞,x1),(x1,x2),...,(xn, +∞)} mutually disjoint open intervals The Method Knowledge Representation & Learning A quantity space with two landmarks might be described as (x1,x2), which implies five distinct qualtitative values, Q(x) = {(-∞,x1),x1,(x1,x2),x2,(x2, +∞)} A discrete variable x’’(t) for direction of change of x’(t) has a single intrinsic landmark at 0, so its initial quantity space is Q(x’’) = {(-∞,0),(0,+∞)} The Method Knowledge Representation & Learning: Events If a is the qualitative value of a discrete variable A, meaning a ∈ Q(A), then the event At → a is defined by A(t – 1) =/= a and A(t) = a That is, an event takes place when a discrete variable A changes to value a at time t, from some other value. The Method Knowledge Representation & Learning: Predictive Rules This is how temporal contingencies are described There are two types of predictive rules: Causal: one event occurs after another later in time Functional: linked by a function so happen at the same time The Method Learning a predictive rule The agent wants to learn rule which predicts a certain event h It will look at other events and find that if one, u, leads to h more likely than others, then it will create a rule with that event as the antecedent It does so by starting with an initial rule with no context The Method Landmarks When a new landmark is inserted into Q(x) we replace one interval with two intervals and the dividing landmark, e.g. a new landmark x* we have (xi,x*),x*,(x*,xi+1) Whenever a new landmark is inserted, statistics about the previous state space are thrown out and new ones are built up. This means checking that the reliability of the rule must be checked. The Method The Learning Process Do 7 times 1. a) b) c) Actively explore world with set of candidate goals coming from discrete variables in M for 1000 timesteps Learn new causal and functional rules Learn new landmarks by examining statistics stored in rules and events 2. Gather 3000 more timesteps of experience to solidify the learned rules 3. Update the strata 4. Goto 1 Evaluation Experimental Setup The robot has two motor variables, one for each of its degrees of freedom A perpetual system creates variables for each of the two tracked objects in the environment: the hand and the block. Too many variables to reasonably explain here, each has various constraints During learning of the block is knocked off the tray or if it is not moved for 300 timesteps, it’s put back on the tray in a random position within reach of the agent Evaluation Experimental Results The algorithm was evaluated using the simple task of moving the block in a specified direction. It was ran five times using passive learning and five using active learning and each run lasted 120,000 timesteps. Each active run of the algorithm resulted in an average of 62 predictive rules. The agent gains proficiency as it learns until reaching threshold at approximately 70,000 timesteps for both. Evaluation Experimental Results Clearly, active exploration appears to do better since at 40,000 timesteps, active learning achieves the level passive has at 60,000 timesteps. The Complexity of Space and Time The storage required to learn new rules is O(e2), as is the number of rules – but only a small number are learned by the agent. Using marginal attribution each rule requires storage O(e), although all pairs of events are stored for simplicity. Conclusion First the agent could only determine the direction of movement of an object Active exploration of environment and using rules to learn distinctions then using distinctions to learn more rules, the agent progressed from having a very simple representation towards a representation that is aligned with the natural “joints” of its environment.