TDDD10 AI Programming Machine Learning Cyrille Berger, PhD Unmanned Aircraft Systems Technologies Laboratory Artificial Intelligence and Integrated Computer Systems Division Department of Computer and Information Science Linköping University, Sweden Cyrille Berger Machine Learning 1/74 What is Machine Learning about? To imbue the capacity to learn into machines Our only reference frame for learning is from the animal kingdom …but brains are hideously complex machines, the result of ages of random mutation Like much of AI, Machine Learning takes an engineering approach! Humanity didn’t first master flight by just imitating birds! Although there is some occasional biological inspiration Cyrille Berger Machine Learning 2/74 Why Machine Learning? • It may be impossible to manually program for every situation in advance. • The model of the environment is unknown, incomplete or outdated, if the agent cannot adapt it will fail. • Approximating complex functions. • Many argue that learning is required for AI to scale up. • Data collections are too large for humans to cope with. • …but primarily, the algorithms we have so far have shown themselves to be useful in a wide range of applications! Cyrille Berger Machine Learning 3/74 Some Application Aspects May not be as versatile as human learning, but domain specific problems can often be processed much faster than by a human Computers work 24/7 and you can often scale performance by piling on more of them Data Mining Companies can collect ever more data and processing power is cheap Put it to use automatically analyzing the performance of products! Machine Learning is almost ubiquitous on the web: Mail filters, search engines, product recommendations , customized content, ad serving… Robotics, Computer Vision Many capabilities that humans take for granted like locomotion and recognizing objects have turned out to be ridiculously difficult to manually construct rules for. Cyrille Berger Machine Learning 4/74 Demo – Stanford Helicopter Acrobatics …in narrow applications machine learning can even rival human performance Cyrille Berger Machine Learning 5/74 To Define Machine Learning A machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E -Tom Mitchell From the agent perspective: Performance Metric Input (Sensors) Agent Environment Output (Actuators) Cyrille Berger Machine Learning 6/74 Supervised Learning at a glance In Supervised Learning: • The correct output is given to the algorithm during a training phase • Experience E is thus tuples of training data consisting of (Input,Output) pairs • Performance metric P is some function of how well the predicted output matches the given correct output Mathematically, can be seen as trying to approximate an unknown function f(x) = y given examples of (x, y) Some real world examples • Spam Filter: f(mail) = spam? • Microsoft Kinect: f(pixels) = body part 7 Cyrille Berger Machine Learning 7/74 Unsupervised Learning at a glance In Unsupervised Learning: • The Task is to find a more concise representation of the data • Neither the correct answer, nor a reward is given • Experience E is just the given data • Performance metric P depends on the task T Examples: Clustering – When the data distribution is confined to lie in a small number of “clusters “ we can find these and use them instead of the original representation Dimensionality Reduction – Finding a suitable lower dimensional representation while preserving as much information as possible Cyrille Berger Machine Learning 8/74 Reinforcement Learning at a glance In Reinforcement Learning: • A reward is given at each step instead of the correct input • Experience E consists of the history of inputs, the chosen outputs and the rewards • Performance metric P is some sum of how much reward the agent can accumulate Inspired by early work in psychology and how pets are trained The agent can learn on its own as long as the reward signal can be concisely defined. Real world examples – Robot Control, Game Playing (Checkers…) Cyrille Berger Machine Learning 9/74 Evolutionary algorithms at a glance In Evolution Learning: • Random agents are generated • At each step, the best performing agents are kept and combined to generate new solutions • Experience E consists of the history of solutions • Performance metric P is some sum of how much an agent perform Inspired by early work by the theory of evolution Most classical exemple is the so called "genetic algorithm" Cyrille Berger Machine Learning 10/74 Supervised Learning • Linear Regression • Neural Network • Bayesian Network Cyrille Berger Machine Learning 11/74 The Steps of Supervised Learning Can be seen as searching for an approximation to the unknown function y = f(x) given N examples of x and y: (x1,y1), … ,(xn,yn) The goal is to have the algorithm learn from a small number of examples to successfully generalize to new examples • First construct a feature vector xi of examples by encoding relevant problem data. Examples of xi, yi is the training set. • The algorithm is trained on the examples by searching a family of functions (the hypothesis space) for the one that most closely approximates the unknown true function • If the quality is poor at this point, change algorithm or features Cyrille Berger Machine Learning 12/74 Simple Classification Example I. Construct a feature vector xi to be used with examples of yi II. Select algorithm and train on training set by searching for a good approximation to the unknown function Fictional example: Classify if a day was sunny or rainy based on daily sales of umbrellas and ice cream at a store. Feature vector xi = (umbrellas, ice creams), yi = “sunny” or “rainy” Cyrille Berger Machine Learning 13/74 Training a learning algorithm... Feature vector xi = (umbrellas), yi = (ice creams) Want to find approximation h(x) to the unknown function f(x) How do we find parameters that result in a good hypothesis h? • Loss function: L(f(x),h(x)) • Optimisation Cyrille Berger Machine Learning 14/74 Limitations of Training a Learning Algorithm Locally greedy optimization approaches only work if the loss function is convex w.r.t. w, i.e. there is only one minima Linear regression models are always convex, however more advanced models that we will look at later are vulnerable to getting stuck in local minina Care should be taken when training such models by utilizing for example random restarts Start positions in red area will get stuck in local minima! Cyrille Berger Machine Learning 15/74 Linear Models in Summary Advantages •Linear algorithms are simple and computationally efficient •Training them is often a convex problem, so one is guaranteed to find the best hypothesis in the hypothesis space Disadvantages •The hypothesis space is very restricted, it cannot handle non-linear relations well They are widely used in applications • Recommender Systems – Initial Netflix Cinematch was a linear regression, before their $1 million competition to improve it • At the core of the recent Google Gmail Priority feature is a linear classifier • And many more… Cyrille Berger Machine Learning 16/74 Neural Networks Biologically inspired with some differences Basically a function approximation •Based on units •Connected by links •A unit is acivted by stimulation •Structured in a network Cyrille Berger Machine Learning 17/74 Neuron Each neuron is a linear model of all the inputs passed through a non-linear activation function g Cyrille Berger Machine Learning 18/74 Activation function The activation function is typically the sigmoid The activation function in a perceptron is the step function Cyrille Berger Machine Learning 19/74 Design of Neural Nets An artificial neural network is specified by: • a topology (the number of layers, the number of neurons in each layer and the interconnections between them), and • a training algorithm. A representation of the input and an interpretation of the output are also needed. “Neural networks are the second best way of doing just about anything” – John Denker Cyrille Berger Machine Learning 20/74 Network Topology • Networks are composed of layers • All neurons in a layer are typically connected to all neurons in the next layer, but not to each other Cyrille Berger Machine Learning 21/74 Network Topology A two layer network (one hidden layer) with a sufficient number of neurons is good enough for most problems Cyrille Berger Machine Learning 22/74 Training of neural networks Like before, training is the result of an optimization! We can compute a loss function (errors) for outputs of the last layer against the training examples And for hidden layers, use backpropagation: ”The trick is to assess the blame for an error and divide it among the contributing weights” – Russel & Norvig Cyrille Berger Machine Learning 23/74 When to Consider Neural Nets • • • • • • Input is high-dimensional discrete or real-valued Output is discrete or real-valued Output is a vector of values Possibly noisy data Form of target function is unknown Human readability of result is unimportant Cyrille Berger Machine Learning 24/74 Video of robot avoiding obstacles Cyrille Berger Machine Learning 25/74 Artificial Neural Networks – Summary Advantages •Very large hypothesis space, under some conditions it is a universal approximator to any function f(x) •The model can be rather compact •Has been moderately popular in applications ever since the 90ies •Tolerant to noisy data Disadvantages •Training is a non-convex problem with many local minima •Takes time to learn •Has many tuning parameters to twiddle with (number of neurons, layers, starting weights…) •Topology design is critical •Very difficult to interpret or debug weights in the network •The biological similarity is vastly over-simplified (and frequently overstated) Cyrille Berger Machine Learning 26/74 Abuse of Neural Network in RoboCup Soccer Cyrille Berger Machine Learning 27/74 Bayesian Network A Bayesian Network is a graph of probabilistic dependencies. It contains: • Variables as nodes • Joint-Probability as directed edges Cyrille Berger Machine Learning 28/74 Bayesian Network for modeling dependencies (Source: Wikipedia) Cyrille Berger Machine Learning 29/74 Bayesian Network for solving crimes Cyrille Berger Machine Learning 30/74 Bayesian Network for SPAM Filtering Most known for popularizing learning spam filters Spam classification example: Each mail is an input, some mails are flagged as spam or not spam to create training examples. Feature vector Feature Exists? “Accept” True “Bank” False “Pay” True …. … Encode the “Customer” existence of a fixed set of True relevant key words in each mail as the “Dollar” False feature vector. “Nigeria” False xi = wordsi = Cyrille Berger Machine Learning 31/74 Bayesian Network for decision making Cold TakeDrug Allergy UD RunnyNose Fever UT Temp Cyrille Berger MeasTemperature Machine Learning 32/74 Training a probabilistic model Resulting in How do we train a probabilistic model given examples? Generally by optimization like we did with linear deterministic models, but deriving the loss function is a bit more complex In this case the hypothesis space is the family of fully conditionally independent and binary word distributions The only parameters in the model that need to be searched over are the probability distributions on the right hand side. In this case they have a simple closed form solution: word frequencies for examples of spam and non-spam in the training set, in addition to the frequency of spam in general! Cyrille Berger Machine Learning 33/74 Bayesian Networks – Summary Advantages •Very easy to debug or interpret the weights in the network. •Expose the relationships between variables and easy conversion in decision. Disadvantages •Difficult to learn. •Topology design is critical. And especially which nodes to include ? •Difficulty to train cyclic networks Cyrille Berger Machine Learning 34/74 Decision Trees It is a graph of possible decisions and their possible consequences. Used to identify the best strategy to reach a given goal. Rush to ball. Ball kickable? no Pass. no yes Good shoot? yes Cyrille Berger Machine Learning Shoot at the goal. 35/74 Decision Trees Learning What is great about Decision Trees? ...you can learn them. It is one of the most widely used and practical method for classification. The goal is to create a model that predicts the output value based on several input variables. It is usefull when one wants a compact and easily understood representation. Cyrille Berger Machine Learning 36/74 Exemple of Decision Trees Learning Outlook = Sunny, Humidity = Normal, Wind = Weak Cyrille Berger Machine Learning 37/74 Decision Trees Learning Decision trees are learnt by constructing them top-down: • Select the best root of the tree: • statistical test to find how well it classifies the training sample • the best attribute is selected • Descendants are created for each possible value of the attribute and the attribute matching are propagated down • The entire process is repeated until each descendent of the node select the best attribute Cyrille Berger Machine Learning 38/74 Information Gain We want the most useful attribute to classify the examples. It measures how well a given attribute separates the training examples according to their target classification. Cyrille Berger Machine Learning 39/74 Information Gain Entropy measures the impurity of a sample of training examples S: Entropy(S) = ∑v -pv log2 pv Gain(S,A) the expected reduction in entropy due to sorting on A: Gain(S,A) = Entropy(S) - ∑v∊A(Entropy(Sv) |Sv|/|S|) Cyrille Berger Machine Learning 40/74 Decision Trees Learning Day Outlook Temp Humidity Wind Play D1 D2 D3 D4 D5 D6 D7 D8 D9 Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Hot Hot Hot Mild Cool Cool Cool Mild Cool High High High High Normal Normal Normal High Normal Weak Strong Weak Weak Weak Strong Strong Weak Weak No No Yes Yes Yes No Yes No Yes D10 D11 D12 D13 D14 Rain Sunny Overcast Overcast Rain Mild Mild Mild Hot Mild Normal Normal High Normal High Weak Strong Strong Weak Strong Yes Yes Yes Yes No Cyrille Berger Machine Learning 41/74 Selecting the "Best" Attribute Cyrille Berger Machine Learning 42/74 Decision Trees - Summary Advantages • They have a naturally interpretable representation. • Very fast to evaluate, as they are minimal. • They can support continuous inputs and multiple classes output. • There are robust algorithms and software that are widely used. Disadvantages • Not necesseraly an optimal representation. • Risk of overfitting. Cyrille Berger Machine Learning 43/74 Some Applications of Supervised Learning Robotics – Learning The Dynamics of Motion •Realistic problems are often highly non-linear •A ground robot have at least a handful dimensions •Air vehicles (UAVs) have at least a dozen dimensions •Humanoid robots have at least 30-60 dimensions •The human body is said to have over 600 muscles Multi-Agent Problems – Learning Behaviors •If there are multiple agents, the naive way of supercomposing them into one problem will scale the dimensionality accordingly •The successful approaches here seem to decompose them into behaviours for individual agents and cooperation, although reinforcement learning may be more suitable Cyrille Berger Machine Learning 44/74 Evolutionary Techniques Genetic algorithms • Bitsrings as genes Genetic programming • Programs as genes Evolutionary programming • Fix program structure, numerical parameters are allowed to evolve. Cyrille Berger Machine Learning 45/74 Natural Selection Individuals compete and the most fit survives and generates new offspring for the next generation. x x x x Environment x x Individuals Cyrille Berger Machine Learning 46/74 Algorithmic Perspective While objective not reached do • Select solutions (Selection) • Generate offspring(Recombination) • Mutate (Mutation) x x x x Problem space x x Solutions Cyrille Berger Machine Learning 47/74 Selection Fitness-value – computed by a fitness-function and determines how fit the individual is. Selection schemes: • Roulette-based selection The fitter the individual is, the more chance of being selected. • Tournament selection 2+ Individuals randomly selected, fitness-value determines probability of winning. ”Restricted” roulette-based selection. • Elitism selection A number of the most fit individuals are always selected, the rest is decided by another algorithm. Cyrille Berger Machine Learning 48/74 Generate Offspring Encoding • Matching an individuals genes to a solution in the problem space. Crossover • N individuals → M new individuals Mutation • Small change in one individual Cyrille Berger Machine Learning 49/74 Offspring Operators Cyrille Berger Machine Learning 50/74 A Genetic Algorithm Problem Encoding • How do we encode so that the solution space exactly matches the problem space? Fitness-function • What is a “good” solution? How much does it differ from another solution? Objective function • What is the objective? • When are we done? Other variables • Initial population size • Selection / recombination algorithms • Mutation rate • End criteria Cyrille Berger Machine Learning 51/74 Genetic Algorithm For Robots Can be usefull to solve a path finding problem: Cyrille Berger Machine Learning 52/74 Evolutionary Techniques – Summary Advantages • Moves a population of solutions toward the optimal solution defined by the fitness-function over a number of generations. • An optimisation technique, if you can encode it • Solves problem with multiple solutions and multiple minima! Disadvantages • Slow to converge. • Some optimisation problems cannot be solved. • No guarantee to find a global optimum. • Not reasonnable for online applications! Cyrille Berger Machine Learning 53/74 Reenforcement Learning Examples: • Game of chess • Soccer • Control of a robot: helicopter, humanoid robot... Cyrille Berger Machine Learning 54/74 Reinforcement Learning Definition Agents are given sensory inputs: • State s ∊ S • Reward R ∊ ℝ At each steps, agents select an output: • Action a ∊ A Cyrille Berger Machine Learning 55/74 Naïve Approach Use supervised learning to learn: f(s, a) = R For any input state, pick the best action: a = arg max f(s,a) a∊A Will that work? Cyrille Berger Machine Learning 56/74 Markov Decision Process The agent need to think ahead ! It needs a good sequence of actions. Formalized in the Markov Decision Process framework ! Cyrille Berger Machine Learning 57/74 Markov Decision Process Finite set of states S, finite set of actions A. At each discrete time step t: • agent observes state st ∊ S • chooses action at ∊ A • and receives an immediate reward rt. The state changes to st+1 ∊ S. Markov assumption is st+1 = δ(st, at) and rt = r(st, at). Cyrille Berger Machine Learning 58/74 Policy Function The policy function decides which action to take in each state: at=π(st) The policy function is what we want to learn! Cyrille Berger Machine Learning 59/74 Rewards To think ahead: • an agents looks at future rewards: r(st+1,at+1), r(st+2,at+2)... • formalized as the sum of rewards (also called utility or value): V= ∑ɣ t r(st,at) ɣ is the discount factor making rewards far off into the future less valuable. If we follow a specific policy π: Vπ(st) = r(st,π(st)) + ɣVπ(st+1) Cyrille Berger Machine Learning 60/74 Value Function Example Value function for random movement Cyrille Berger Optimal policy Optimal value function Machine Learning 61/74 Optimal Policy If we follow a specific policy π: Vπ(st) = r(st,π(st)) + ɣVπ(st+1) If we know Vπ(st) then the policy π is given by: π(s) = argmaxa( r(s,a) + ɣVπ(δ(s, a))) Finding the optimal policy is about finding π(s) or Vπ(st) or both. Cyrille Berger Machine Learning 62/74 Value Iteration Initialize the function V(s) with random values For each state st do: • compute ΔV(st) = maxa(r(st,a) + ɣV(st+1)) - V(xt) • update V(xt) with ΔV(xt) Cyrille Berger Machine Learning 63/74 Q-Function Learning in unknown environment Optimal policy • π*(s) = argmaxa( r(s,a) + V*(δ(s, a))) • What if we do not know δ(s, a)? or r(s,a)? Q-Function • Q(s,a) = r(s,a) + V*(δ(s, a))) • π*(s) = argmaxa(Q(s,a)) Cyrille Berger Machine Learning 64/74 Update the Q-Function Q and V* are closely related: V *(s) = maxa Q(s,a) Q can be written as: Q(st ,at) = r(st, at) + V*(δ(st, at)) = r(st, at) + ɣ maxa' Q (st+1 ,a' ) If Q^ denote the current approximation of Q then it can be updated by: Q^(s,a) := r + ɣ maxa' Q^(s',a') Cyrille Berger Machine Learning 65/74 Q-Learning for Deterministic Worlds For each s, a initialize table entry Q^(s,a) ← 0. Observe current state s. Do forever: 1. Select an action a and execute it 2. Receive immediate reward r 3. Observe the new state s' 4. Update the table entry for Q^(s,a): Q^(s,a) := r + ɣ maxa' Q^(s',a') 5. s ← s' Cyrille Berger Machine Learning 66/74 Q-Learning Example +100 +100 Cyrille Berger Machine Learning 67/74 Reinforcement Learning Summary Advantages • Allows an agent to adapt to maximise rewards in a potentially unknown environment. Disadvantages • Requires computation polynomial in the number of states! • The number of states grows exponentially with input dimensions! • Reinforcement Learning assumes discrete states and action spaces. Cyrille Berger Machine Learning 68/74 Summary Supervised Learning • Linear Regression • Neural Network • Bayesian Network Genetic Algorithm Reenforcement Learning: Q-Learning Decision Trees Cyrille Berger Machine Learning 69/74 Summary: when to use what? There is no universal solution to the problem of machine learning. Approximation of a continuous function ⟹ Supervised Learning: linear regression, neural network Problem expressed with rewards ⟹ Reinforcement Learning Understanding of the structure of the problem ⟹ Decision Trees or Bayesian Networks Optimization technique ⟹ Genetic Algorithms Always use the most appropriate technique! Cyrille Berger Machine Learning 70/74 Layered Learning A mapping directly from inputs to outputs is not tractably learnable. A bottom-up hierarchical task decomposition is given. Machine learning exploits data to train and/or adapt. Learning occurs separately at each level. The output of learning in one layer feeds into the next layer. Cyrille Berger Machine Learning 71/74 Task allocation learning • Fires are grouped • Local decision for which building to extinguish • Selective perception • Decision tree with Q-Values • At each step, use the tree to get the reward for extinguishing a specific building Cyrille Berger Machine Learning 72/74 Limitations of Machine Learning • We noted earlier that the first phase of learning is usually to select the ”features” to use as input into the algorithm • In the spam classification example we restricted ourselves to a set of relevant words, but even that could be thousands #features • Even for such binary features we would have needed O(2 examples to calculate the full joint probability distribution • Likewise, in a continuous feature space, if we need a grid of 10 examples along each feature dimension to approximate the #features underlying true function, we need O(10 ) examples. Cyrille Berger Machine Learning ) 73/74 The Curse of Dimensionality • This is known as the curse of dimensionality and applies to all machine learning techniques • However, this is a worst case scenario. The true amount of data needed to learn a model depends on the complexity of the underlying function • Many learning attempts work rather well even for many features, however being able to construct a small set of informative features can be the difference between success and failure in many domains. (Bishop, 74 2006) 17:53 Cyrille Berger Machine Learning 74/74