9/29/2015 A Gentle Introduction to Machine Learning First Lecture Olov Andersson, AIICS Linköpings Universitet What is Machine Learning about? • • To imbue the capacity to learn into machines Our only frame of reference for learning is from biology …but brains are hideously complex, the result of ages of evolution • • Like much of AI, Machine Learning mainly takes an engineering approach1 Humanity didn’t first master flight by just imitating birds! 1. 2015-09-29 Although there is some occasional biological inspiration 2 1 9/29/2015 Why Machine Learning • Difficult to manually program agents for every possible situation • The world changes, if the agent cannot adapt it will fail • Many argue that learning is required to scale up AI • We are still far from a general learning agent! but the algorithms we have so far have shown themselves to be useful in a wide range of applications! 2015-09-29 3 Some Application Aspects • • Not as versatile as human learning, but domain specific problems can often be processed much faster Computers work 24/7 and you can often scale performance by piling on more of them Data Mining Companies collect ever more data and processing power is cheap Put it to use automatically analyzing the performance of products! Machine Learning is almost ubiquitous on the web: Mail filters, search engines, product recommendations , customized content, ad serving… “Big Data” – much hyped technology trend. Robotics, Computer Vision Many capabilities that humans take for granted like locomotion, grasping and recognizing objects have turned out to be ridiculously difficult to manually construct rules for. 2015-09-29 4 2 9/29/2015 Demo – Stanford Helicopter Acrobatics …in narrow applications machine learning can even rival human performance 2015-09-29 5 To Define Machine Learning A machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E -Tom Mitchell From the agent perspective: Performance Metric Input (Sensors) Output (Actuators) 2015-09-29 Task (Environment) Agent 7 3 9/29/2015 The Three Main Types of Machine Learning Machine learning is a young science that is still changing, but traditionally algorithms are usually divided into three types depending on their purpose. • Supervised Learning • Reinforcement Learning • Unsupervised Learning 2015-09-29 8 Supervised Learning at a Glance A machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E -Tom Mitchell In Supervised Learning: • Examples of correct behavior is given in a training phase. • Experience: Tuples of correct (Input,Output) pairs. • Performance metric: How well predicted output matches the correct output. Mathematically, can be seen as trying to approximate an unknown function f(x) = y given examples of (x, y) 2015-09-29 9 4 9/29/2015 Supervised Learning – Agent Perspective Representation from agent perspective: Performance Metric state Input (Sensors) f(input)=output or f(state) = action Output (Actuators) Task (Environment) Reactive Agent action …but it can also be used as a component in other architectures Supervised Learning is surprisingly powerful and ubiquitous Some real world examples • Spam Filter: f(mail) = spam? • Microsoft Kinect: f(pixels, distance) = body part 2015-09-29 10 Body Part Classification on the Microsoft Kinect right hand neck left shoulder right elbow Shotton et al @ MSR, CVPR 2011 2015-09-29 11 5 9/29/2015 Reinforcement Learning at a Glance A machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E -Tom Mitchell In Reinforcement Learning: • A reward is given at each step instead of the correct input/output • Instead of mimicking examples, the agent plans ahead. • Experience E: history of inputs, chosen outputs and rewards • Performance metric P is sum of reward over time Inspired by early work in psychology and how pets are trained The agent can learn on its own if the reward signal can be mathematically defined. 2015-09-29 12 Reinforcement Learning at a glance II RL is based on a utility (reward) maximizing agent framework • Rewards of actions in different states are learned • Agent plans ahead to maximize reward over time Performance Metric (reward) state Input (Sensors) Maximize total rewards Output (Actuators) Task (Environment) RL Agent f(state, action) = reward action Real world examples – Robot Control, Game Playing (Checkers…) 2015-09-29 13 6 9/29/2015 Unsupervised Learning at a glance A machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E -Tom Mitchell In Unsupervised Learning: • The Task is to find a more concise representation of the data • Neither the correct answer, nor a reward is given • Experience E can be arbitrary data • P varies from task to task Examples: Clustering – When the data distribution is confined to lie in a small number of “clusters “ we can find these and use them instead of the original representation Dimensionality Reduction – Finding a suitable lower dimensional representation while preserving as much information as possible 2015-09-29 14 Clustering – Continuous Data Two-dimensional continuous input 2015-09-29 (Bishop, 2006) 16 7 9/29/2015 Topic Modeling – Science Articles (Blei et al, 2009) Two-dimensional continuous input (Bishop, 2006) 2015-09-29 17 Outline of Machine Learning Lectures First we will talk about Supervised Learning • Definition • Main Concepts • Some Approaches & Applications • Pitfalls & Limitations • In-depth: Decision Trees (a supervised learning approach) Then finish with a short introduction to Reinforcement Learning The idea is that you will be informed enough to find and try a learning algorithm if the need arises. 2015-09-29 18 8 9/29/2015 Formalizing Supervised Learning Remember, in Supervised Learning: • Given tuples of training data consisting of (x,y) pairs • The objective is to learn to predict the output y’ for a new input x’ Formalized as searching for an approximation to an unknown function y = f(x) given N examples of x and y: (x1,y1), … ,(xn,yn) • A candidate approximation is sometimes called a hypothesis Two major classes of supervised learning • Classification – Output is a discrete category label Example: Detecting disease, y = “healthy” or “ill” • Regression – Output is a numeric value Example: Predicting temperature, y = 15.3 degrees In either case input data xi can be vector valued and discrete, continuous or mixed. Example: x1 = (12.5, “cloud free”, 1.35). 2015-09-29 19 Supervised Learning in Practice Can be seen as searching for an approximation to the unknown function y = f(x) given N examples of x and y: (x1,y1), … ,(xn,yn) Want the algorithm to generalize from training examples to new inputs x’, so that y’=f(x’) is “close” to the correct answer. • First construct an input vector xi of examples by encoding relevant problem data. This is often called the feature vector. • Examples of such (xi, yi) is the training set. • A model is selected and trained on the examples by searching for parameters (the hypothesis space) that yield a good approximation to the unknown true function. • Evaluate performance, (carefully) tweak algorithm or features. 2015-09-29 20 9 , , , 9/29/2015 Feature Vector Construction Want to learn y = f(x) given N examples of x and y: (x1,y1), … ,(xn,yn) • • Most standard algorithms work on real number variables. If the inputs x or outputs y contain categorical values like “book” or “car”, we need to encode them with numbers. With only two classes we get y in {0,1}, called binary classification Classification into multiple classes can be reduced to a sequence of binary one-vs-all classifiers • • The variables may also be structured like in text, graphs, audio, image or video data Finding a suitable feature representation can be non-trivial, but there are standard approaches for the common domains 2015-09-29 21 Example of Feature Vector Construction One of the early successes was learning spam filters Spam classification example: Each mail is an input, some mails are flagged as spam or not spam to create training examples. Feature vector: Encode the existence of a fixed set of relevant key words in each mail as the feature vector. xi = wordsi = Feature Exists? “Customer” 1 (Yes) “Dollar” 0 (No) “Nigeria” 0 “Accept” 1 “Bank” 0 …. … yi = 1 (spam) or 0 (not spam) Simply learn f(x)=y using suitable classifier! 2015-09-29 22 10 9/29/2015 Simple Linear Classification Example I. II. Construct a feature vector xi to be used with examples of yi Select algorithm and train on training data by searching for a good approximation to the unknown function Fictional example: A learning smartphone app that determines if silent mode should be on or off based on background noise, light level and previous user choices. Feature vector xi = (Light level, Noise), yi = {“silent on”, “silent off”} • • • Select the familiy of linear discriminant functions Train the algorithm by searching for a line that separates the classes well New cases will be classified according to which side they fall 2015-09-29 23 2015-09-29 24 11 9/29/2015 Simple Linear Regression Example I. II. Construct a feature vector xi to be used with examples of yi Select algorithm and train on training set by searching for a good approximation to the unknown function Fictional example: Same smartphone app but now we want to predict the ring volume based on noise level (only) Feature vector xi = (Noise), yi = (Volume) • • Select the familiy of linear functions Train the algorithm by searching for a line that fits the data well …but how does ”training” really work? 2015-09-29 25 2015-09-29 26 12 9/29/2015 Training a Learning Algorithm Feature vector xi = (Noise), outputs yi = (Volume) • • Want to find approximation h(x) to the unknown function f(x) As an example we select the hypothesis space to be the family of polynomials of degree one, that is linear functions: • • The hypothesis space of has two parameters How do we find parameters that result in a good hypothesis h? Three (poor) linear hypotheses 2015-09-29 27 Training a Learning Algorithm – Loss Functions How do we find parameters w that result in a good hypothesis • What is a ”good” hypothesis? • Maximizing some function of how well it fits the data ? Equivalently: minimize how well it doesn’t fit the data Loss functions • For regression one common choice is a sum square loss function: • Search in continuous domains like w is known as optimization (see Ch4.2 in course book AIMA) 2015-09-29 28 13 9/29/2015 Training a Learning Algorithm – Optimization How do we find parameters w that minimize the loss? • • Optimization approaches typically move in the direction that locally decreases the loss function Simple and popular approach: gradient descent Initialize w to some random point in the parameter space loop until decrease in loss is small for each in w do Note: 2015-09-29 29 Training a Learning Algorithm – Limitations Limitations • Locally greedy – Gets stuck in local minima unless the loss function is convex w.r.t. w, i.e. there is only one minima. • Linear models are convex, however most more advanced models are vulnerable to getting stuck in local minina. • Care should be taken when training such models by using for example random restarts and picking the least bad minima. Start positions in red area will get stuck in a local minima! 2015-09-29 30 14 9/29/2015 Training a Learning Algorithm – Loss Functions II • What about classification? Squared error does not make sense when target output in {0,1} • Custom loss functions for classification Number of missclassifications (unsmooth w.r.t. parameter changes) (inverse) information gain (used in decision trees, next lecture) • • These require specialized parameter search methods Alternative: Squash a numeric output to [0,1], then use square error! Sigmoid functions allow us to use any regression method for classification. Common choice is the logistic function: 2015-09-29 31 Linear Models in Summary Advantages • Linear algorithms are simple and computationally efficient. • Training them is a convex problem, so one is guaranteed to find the best hypothesis in the hypothesis space. • Can be extended by non-linear feature transformations. Disadvantages • The hypothesis space is very restricted, it cannot handle nonlinear relations well They are widely used in applications • Recommender Systems – Initial Netflix Cinematch was a linear regression, before their $1 million competition to improve it • At the core of many big internet services. Ad systems at Twitter, Facebook, Google etc... 2015-09-29 32 15 9/29/2015 Beyond Linear Models – Artificial Neural Networks • One non-linear model that has captivated people for decades is Artificial Neural Networks (ANNs) • These draw upon inspiration from the physical structure of the brain as an interconnected network of ”neurons”, represented by non-linear ”activation functions” of the inputs The Neuron The Network 2015-09-29 33 Artificial Neural Networks – The Neuron • In univariate (one input) linear regression we used the model: • Each neuron in an ANN is a linear model of all the inputs passed through a non-linear activation function g • The activation function is typically the sigmoid 2015-09-29 34 16 9/29/2015 Artificial Neural Networks – The Neuron II • • • However, there is not just one neuron, but a network of neurons! Each neuron gets inputs from all neurons in the previous layer. We rewrite our neuron definition using ai for the input, aj for the output and wi,j for the weight parameters: 2015-09-29 35 Artificial Neural Networks – The Network • • • The networks are composed into layers All neurons in a layer are typically connected to all neurons in the next layer, but not to each other Expanding the output of a second layer neuron (5) we get 2015-09-29 36 17 9/29/2015 Artificial Neural Networks – Training • • • • How do we train an ANN to find the best parameters wi,j? By optimization, minimizing a loss function like before! Compute a loss function (errors) for outputs of the last layer against the training examples. A classical approach is backpropagation, a fast gradient descent by using the chain rule and caching shared terms. Predict output on training set Compute errors w.r.t. a loss function Propagate errors backwards and adjust weights in all layers 2015-09-29 37 Recent Development: Deep Learning Abstraction • ANNs (among others) can use many layers to capture abstractions. • Some tasks are uniquely suited to this like vision, text and speech recognition. • Already used by Google, MSFT etc. • Training these require a lot of care, often in combination with unsupervised learning methods. Faces Facial parts Edges (Honglak Lee, 2009) 2015-09-29 38 18 9/29/2015 Artificial Neural Networks – Summary Advantages • Very large hypothesis space, under some conditions it is a universal approximator to any function f(x) • Some (but weak) biological justification • Has been moderately popular in applications ever since the 90ies • Recently: Can be layered to capture abstraction (deep learning) Used for speech, object and text recognition at Google, MSFT etc. Often using millions of neurons/parameters and GPU acceleration. Disadvantages • Training is a non-convex problem with many local minima • Has many tuning parameters to twiddle with (number of neurons, layers, starting weights…) • Very difficult to interpret or debug weights in the network 2015-09-29 39 Overfitting • • Models can overfit if you have too many parameters in relation to the training set size. Example: 9th degree polynomial regression model (10 parameters) on 15 data points: Green: True function (unknown) Blue: Training examples (noisy!) Red: Trained model (Bishop, 2006) • • This is not a local minima during training, it is the best fit possible on the given training examples! The trained model captured ”noise”, variations independent of f(x) 2015-09-29 40 19 9/29/2015 Overfitting – Where Does the Noise Come From? • Noise are small variations in the data due to unknown variables, that cannot be predicted by the feature vector x • Example: Predict the temperature based on season, time-of-day and cloud cover. What about atmospheric changes like a cold front? As they are not included in the model, nor predictable from the other variables, this variation will show up as random noise in the model! • With few data points to go on, the model can mistake the effect of unmodeled variables has on y as coming from variables x that are included. • Since this relationship is merely chance, the model will not generalize well to future situations 2015-09-29 41 Model Selection – Choosing Between Models • • In conclusion, we want to avoid unnecessarily complex models This is a fairly general concept throughout science and is often referred to as Ockham’s Razor: “Pluralitas non est ponenda sine necessitate” -Willian of Ockham “Everything should be kept as simple as possible, but no simpler.” -Albert Einstein (paraphrased) • • There are several mathematically principled ways to penalize model complexity during training However, the most straightforward way is to keep a separate validation set of known examples that is only used for evaluating the performance of different models and not for training them. 2015-09-29 42 20 9/29/2015 Model Selection – Hold-out Validation • • • • This is called a hold-out validation set as we keep the data away from the training phase Measuring performance on such a validation set is a better metric of actual generalization error to unseen examples With the validation set we can compare models of different complexity to select the one which generalizes best. Examples could be polynomial models of different order, the number of neurons or layers in an ANN etc. Given example data: Training Set Validation Set 2015-09-29 43 Model Selection – Selection Strategy • As the number of parameters increases, the size of the hypothesis space also increases, allowing a better fit to training data • However, at some point it is sufficiently flexible to capture the underlying patterns. Any more will just capture noise, leading to worse generalization to new examples! Example: Prediction error vs. model complexity over many (simulated) data sets. (Hastie et al., 2009) Best choice Overfitting Red: Validation Set (Actual) Error Blue: Training Set Error 2015-09-29 44 21 9/29/2015 Thank you for listening! 2015-09-29 56 22