A Gentle Introduction to Machine Learning What is Machine Learning about? 9/29/2015

advertisement
9/29/2015
A Gentle Introduction to
Machine Learning
First Lecture
Olov Andersson, AIICS
Linköpings Universitet
What is Machine Learning about?
•
•
To imbue the capacity to learn into machines
Our only frame of reference for learning is from biology
 …but brains are hideously complex, the result of ages of evolution
•
•
Like much of AI, Machine Learning mainly takes an engineering
approach1
Humanity didn’t first master flight by just imitating birds!
1.
2015-09-29
Although there is some
occasional biological
inspiration
2
1
9/29/2015
Why Machine Learning
•
Difficult to manually program agents for every possible situation
•
The world changes, if the agent cannot adapt it will fail
•
Many argue that learning is required to scale up AI
•
We are still far from a general learning agent!
 but the algorithms we have so far have shown themselves to be useful
in a wide range of applications!
2015-09-29
3
Some Application Aspects
•
•
Not as versatile as human learning, but domain specific problems
can often be processed much faster
Computers work 24/7 and you can often scale performance by
piling on more of them
Data Mining
 Companies collect ever more data and processing power is cheap
 Put it to use automatically analyzing the performance of products!
 Machine Learning is almost ubiquitous on the web: Mail filters, search
engines, product recommendations , customized content, ad
serving…
 “Big Data” – much hyped technology trend.
Robotics, Computer Vision
 Many capabilities that humans take for granted like locomotion,
grasping and recognizing objects have turned out to be ridiculously
difficult to manually construct rules for.
2015-09-29
4
2
9/29/2015
Demo – Stanford Helicopter Acrobatics
…in narrow applications machine learning can even rival human
performance
2015-09-29
5
To Define Machine Learning
A machine learns with respect to a particular task T, performance
metric P, and type of experience E, if the system reliably
improves its performance P at task T, following experience E
-Tom Mitchell
From the agent perspective:
Performance Metric
Input (Sensors)
Output (Actuators)
2015-09-29
Task
(Environment)
Agent
7
3
9/29/2015
The Three Main Types of Machine Learning
Machine learning is a young science that is still changing, but
traditionally algorithms are usually divided into three types
depending on their purpose.
•
Supervised Learning
•
Reinforcement Learning
•
Unsupervised Learning
2015-09-29
8
Supervised Learning at a Glance
A machine learns with respect to a particular task T, performance
metric P, and type of experience E, if the system reliably
improves its performance P at task T, following experience E
-Tom Mitchell
In Supervised Learning:
• Examples of correct behavior is given in a training phase.
• Experience: Tuples of correct (Input,Output) pairs.
• Performance metric: How well predicted output matches the
correct output.
Mathematically, can be seen as trying to approximate an
unknown function f(x) = y given examples of (x, y)
2015-09-29
9
4
9/29/2015
Supervised Learning – Agent Perspective
Representation from agent perspective:
Performance Metric
state
Input (Sensors)
f(input)=output
or
f(state) = action
Output (Actuators)
Task
(Environment)
Reactive
Agent
action
…but it can also be used as a component in other architectures
Supervised Learning is surprisingly powerful and ubiquitous
Some real world examples
• Spam Filter: f(mail) = spam?
• Microsoft Kinect: f(pixels, distance) = body part
2015-09-29
10
Body Part Classification on the Microsoft Kinect
right hand
neck
left
shoulder
right
elbow
Shotton et al @ MSR, CVPR 2011
2015-09-29
11
5
9/29/2015
Reinforcement Learning at a Glance
A machine learns with respect to a particular task T, performance
metric P, and type of experience E, if the system reliably
improves its performance P at task T, following experience E
-Tom Mitchell
In Reinforcement Learning:
• A reward is given at each step instead of the correct input/output
• Instead of mimicking examples, the agent plans ahead.
• Experience E: history of inputs, chosen outputs and rewards
• Performance metric P is sum of reward over time
Inspired by early work in psychology and how pets are trained
The agent can learn on its own if the reward signal can be
mathematically defined.
2015-09-29
12
Reinforcement Learning at a glance II
RL is based on a utility (reward) maximizing agent framework
• Rewards of actions in different states are learned
• Agent plans ahead to maximize reward over time
Performance Metric
(reward)
state
Input (Sensors)
Maximize total rewards
Output (Actuators)
Task
(Environment)
RL Agent
f(state, action) = reward
action
Real world examples – Robot Control, Game Playing (Checkers…)
2015-09-29
13
6
9/29/2015
Unsupervised Learning at a glance
A machine learns with respect to a particular task T, performance
metric P, and type of experience E, if the system reliably
improves its performance P at task T, following experience E
-Tom Mitchell
In Unsupervised Learning:
• The Task is to find a more concise representation of the data
• Neither the correct answer, nor a reward is given
• Experience E can be arbitrary data
• P varies from task to task
Examples:
Clustering – When the data distribution is confined to lie in a small
number of “clusters “ we can find these and use them instead of the
original representation
Dimensionality Reduction – Finding a suitable lower dimensional
representation while preserving as much information as possible
2015-09-29
14
Clustering – Continuous Data
Two-dimensional continuous input
2015-09-29
(Bishop, 2006)
16
7
9/29/2015
Topic Modeling – Science Articles (Blei et al, 2009)
Two-dimensional continuous input
(Bishop, 2006)
2015-09-29
17
Outline of Machine Learning Lectures
First we will talk about Supervised Learning
• Definition
• Main Concepts
• Some Approaches & Applications
• Pitfalls & Limitations
• In-depth: Decision Trees (a supervised learning approach)
Then finish with a short introduction to Reinforcement Learning
The idea is that you will be informed enough to find and try a
learning algorithm if the need arises.
2015-09-29
18
8
9/29/2015
Formalizing Supervised Learning
Remember, in Supervised Learning:
• Given tuples of training data consisting of (x,y) pairs
• The objective is to learn to predict the output y’ for a new input x’
Formalized as searching for an approximation to an unknown
function y = f(x) given N examples of x and y: (x1,y1), … ,(xn,yn)
• A candidate approximation is sometimes called a hypothesis
Two major classes of supervised learning
• Classification – Output is a discrete category label
 Example: Detecting disease, y = “healthy” or “ill”
•
Regression – Output is a numeric value
 Example: Predicting temperature, y = 15.3 degrees
In either case input data xi can be vector valued and discrete,
continuous or mixed. Example: x1 = (12.5, “cloud free”, 1.35).
2015-09-29
19
Supervised Learning in Practice
Can be seen as searching for an approximation to the unknown
function y = f(x) given N examples of x and y: (x1,y1), … ,(xn,yn)
Want the algorithm to generalize from training examples to new
inputs x’, so that y’=f(x’) is “close” to the correct answer.
•
First construct an input vector xi of examples by encoding
relevant problem data. This is often called the feature vector.
•
Examples of such (xi, yi) is the training set.
•
A model is selected and trained on the examples by searching
for parameters (the hypothesis space) that yield a good
approximation to the unknown true function.
•
Evaluate performance, (carefully) tweak algorithm or features.
2015-09-29
20
9
,
,
,
9/29/2015
Feature Vector Construction
Want to learn y = f(x) given N examples of x and y: (x1,y1), … ,(xn,yn)
•
•
Most standard algorithms work on real number variables.
If the inputs x or outputs y contain categorical values like “book” or
“car”, we need to encode them with numbers.
 With only two classes we get y in {0,1}, called binary classification
 Classification into multiple classes can be reduced to a sequence of
binary one-vs-all classifiers
•
•
The variables may also be structured like in text, graphs, audio,
image or video data
Finding a suitable feature representation can be non-trivial, but
there are standard approaches for the common domains
2015-09-29
21
Example of Feature Vector Construction
One of the early successes was learning spam filters
Spam classification example:
Each mail is an input, some mails are flagged as
spam or not spam to create training examples.
Feature vector:
Encode the existence of a fixed set of relevant
key words in each mail as the feature vector.
xi = wordsi =
Feature
Exists?
“Customer”
1 (Yes)
“Dollar”
0 (No)
“Nigeria”
0
“Accept”
1
“Bank”
0
….
…
yi = 1 (spam) or 0 (not spam)
Simply learn f(x)=y using suitable classifier!
2015-09-29
22
10
9/29/2015
Simple Linear Classification Example
I.
II.
Construct a feature vector xi to be used with examples of yi
Select algorithm and train on training data by searching for a
good approximation to the unknown function
Fictional example: A learning smartphone app that determines if
silent mode should be on or off based on background noise, light
level and previous user choices.
Feature vector xi = (Light level, Noise), yi = {“silent on”, “silent off”}
•
•
•
Select the familiy of linear
discriminant functions
Train the algorithm by searching for
a line that separates the classes
well
New cases will be classified
according to which side they fall
2015-09-29
23
2015-09-29
24
11
9/29/2015
Simple Linear Regression Example
I.
II.
Construct a feature vector xi to be used with examples of yi
Select algorithm and train on training set by searching for a
good approximation to the unknown function
Fictional example: Same smartphone app but now we want to
predict the ring volume based on noise level (only)
Feature vector xi = (Noise), yi = (Volume)
•
•
Select the familiy of linear functions
Train the algorithm by searching for
a line that fits the data well
…but how does ”training” really work?
2015-09-29
25
2015-09-29
26
12
9/29/2015
Training a Learning Algorithm
Feature vector xi = (Noise), outputs yi = (Volume)
•
•
Want to find approximation h(x) to the unknown function f(x)
As an example we select the hypothesis space to be the family of
polynomials of degree one, that is linear functions:
•
•
The hypothesis space of
has two parameters
How do we find parameters that result in a good hypothesis h?
Three (poor)
linear hypotheses
2015-09-29
27
Training a Learning Algorithm – Loss Functions
How do we find parameters w that result in a good hypothesis
•
What is a ”good” hypothesis?
•
Maximizing some function of how well it fits the data
?
 Equivalently: minimize how well it doesn’t fit the data
 Loss functions
•
For regression one common choice is a sum square loss function:
•
Search in continuous domains like w is known as optimization
 (see Ch4.2 in course book AIMA)
2015-09-29
28
13
9/29/2015
Training a Learning Algorithm – Optimization
How do we find parameters w that minimize the loss?
•
•
Optimization approaches typically move in the direction that
locally decreases the loss function
Simple and popular approach: gradient descent
Initialize w to some random point in the parameter space
loop until decrease in loss is small
for each
in w do
Note:
2015-09-29
29
Training a Learning Algorithm – Limitations
Limitations
• Locally greedy – Gets stuck in local minima unless the loss function
is convex w.r.t. w, i.e. there is only one minima.
• Linear models are convex, however most more advanced models
are vulnerable to getting stuck in local minina.
• Care should be taken when training such models by using for
example random restarts and picking the least bad minima.
Start positions in red area will
get stuck in a local minima!
2015-09-29
30
14
9/29/2015
Training a Learning Algorithm – Loss Functions II
•
What about classification?
 Squared error does not make sense when target output in {0,1}
•
Custom loss functions for classification
 Number of missclassifications (unsmooth w.r.t. parameter changes)
 (inverse) information gain (used in decision trees, next lecture)
•
•
These require specialized parameter search methods
Alternative: Squash a numeric output to [0,1], then use square
error!
Sigmoid functions allow us to use any
regression method for classification.
Common choice is the logistic function:
2015-09-29
31
Linear Models in Summary
Advantages
• Linear algorithms are simple and computationally efficient.
• Training them is a convex problem, so one is guaranteed to find
the best hypothesis in the hypothesis space.
• Can be extended by non-linear feature transformations.
Disadvantages
• The hypothesis space is very restricted, it cannot handle nonlinear relations well
They are widely used in applications
• Recommender Systems – Initial Netflix Cinematch was a linear
regression, before their $1 million competition to improve it
• At the core of many big internet services. Ad systems at Twitter,
Facebook, Google etc...
2015-09-29
32
15
9/29/2015
Beyond Linear Models – Artificial Neural
Networks
•
One non-linear model that has captivated people for decades is
Artificial Neural Networks (ANNs)
•
These draw upon inspiration from the physical structure of the
brain as an interconnected network of ”neurons”, represented by
non-linear ”activation functions” of the inputs
The Neuron
The Network
2015-09-29
33
Artificial Neural Networks – The Neuron
•
In univariate (one input) linear regression we used the model:
•
Each neuron in an ANN is a linear model of all the inputs passed
through a non-linear activation function g
•
The activation function is typically the sigmoid
2015-09-29
34
16
9/29/2015
Artificial Neural Networks – The Neuron II
•
•
•
However, there is not just one neuron, but a network of neurons!
Each neuron gets inputs from all neurons in the previous layer.
We rewrite our neuron definition using ai for the input, aj for the
output and wi,j for the weight parameters:
2015-09-29
35
Artificial Neural Networks – The Network
•
•
•
The networks are composed into layers
All neurons in a layer are typically connected to all neurons in the
next layer, but not to each other
Expanding the output of a second layer neuron (5) we get
2015-09-29
36
17
9/29/2015
Artificial Neural Networks – Training
•
•
•
•
How do we train an ANN to find the best parameters wi,j?
By optimization, minimizing a loss function like before!
Compute a loss function (errors) for outputs of the last layer
against the training examples.
A classical approach is backpropagation, a fast gradient descent
by using the chain rule and caching shared terms.
Predict output on training set
Compute errors
w.r.t. a loss function
Propagate errors backwards
and adjust weights in all layers
2015-09-29
37
Recent Development: Deep Learning
Abstraction
•
ANNs (among others) can use many
layers to capture abstractions.
•
Some tasks are uniquely suited to this
like vision, text and speech
recognition.
•
Already used by Google, MSFT etc.
•
Training these require a lot of care,
often in combination with
unsupervised learning methods.
Faces
Facial parts
Edges
(Honglak Lee, 2009)
2015-09-29
38
18
9/29/2015
Artificial Neural Networks – Summary
Advantages
• Very large hypothesis space, under some conditions it is a
universal approximator to any function f(x)
• Some (but weak) biological justification
• Has been moderately popular in applications ever since the 90ies
• Recently: Can be layered to capture abstraction (deep learning)
 Used for speech, object and text recognition at Google, MSFT etc.
 Often using millions of neurons/parameters and GPU acceleration.
Disadvantages
• Training is a non-convex problem with many local minima
• Has many tuning parameters to twiddle with (number of neurons,
layers, starting weights…)
• Very difficult to interpret or debug weights in the network
2015-09-29
39
Overfitting
•
•
Models can overfit if you have too many parameters in relation to
the training set size.
Example: 9th degree polynomial regression model (10 parameters)
on 15 data points:
Green: True function (unknown)
Blue: Training examples (noisy!)
Red: Trained model
(Bishop, 2006)
•
•
This is not a local minima during training, it is the best fit possible
on the given training examples!
The trained model captured ”noise”, variations independent of f(x)
2015-09-29
40
19
9/29/2015
Overfitting – Where Does the Noise Come From?
•
Noise are small variations in the data due to unknown variables,
that cannot be predicted by the feature vector x
•
Example: Predict the temperature based on season, time-of-day and
cloud cover. What about atmospheric changes like a cold front? As
they are not included in the model, nor predictable from the other
variables, this variation will show up as random noise in the model!
•
With few data points to go on, the model can mistake the effect of
unmodeled variables has on y as coming from variables x that are
included.
•
Since this relationship is merely chance, the model will not
generalize well to future situations
2015-09-29
41
Model Selection – Choosing Between Models
•
•
In conclusion, we want to avoid unnecessarily complex models
This is a fairly general concept throughout science and is often
referred to as Ockham’s Razor:
“Pluralitas non est ponenda sine necessitate”
-Willian of Ockham
“Everything should be kept as simple as possible, but no simpler.”
-Albert Einstein (paraphrased)
•
•
There are several mathematically principled ways to penalize
model complexity during training
However, the most straightforward way is to keep a separate
validation set of known examples that is only used for
evaluating the performance of different models and not for
training them.
2015-09-29
42
20
9/29/2015
Model Selection – Hold-out Validation
•
•
•
•
This is called a hold-out validation set as we keep the data away
from the training phase
Measuring performance on such a validation set is a better metric
of actual generalization error to unseen examples
With the validation set we can compare models of different
complexity to select the one which generalizes best.
Examples could be polynomial models of different order, the
number of neurons or layers in an ANN etc.
Given example data:
Training Set
Validation Set
2015-09-29
43
Model Selection – Selection Strategy
•
As the number of parameters increases, the size of the
hypothesis space also increases, allowing a better fit to
training data
•
However, at some point it is sufficiently flexible to capture
the underlying patterns. Any more will just capture noise,
leading to worse generalization to new examples!
Example: Prediction error
vs. model complexity over
many (simulated) data sets.
(Hastie et al., 2009)
Best choice Overfitting
Red: Validation Set (Actual) Error
Blue: Training Set Error
2015-09-29
44
21
9/29/2015
Thank you for listening!
2015-09-29
56
22
Download