TDDD10 AI Programming Machine Learning Cyrille Berger, PhD

advertisement
TDDD10 AI
Programming
Machine Learning
Cyrille Berger, PhD
Unmanned Aircraft Systems Technologies Laboratory
Artificial Intelligence and Integrated Computer Systems Division
Department of Computer and Information Science
Linköping University, Sweden
Cyrille Berger
Machine Learning
1/74
What is Machine Learning about?
To imbue the capacity to learn into machines
Our only reference frame for learning is from the animal kingdom
…but brains are hideously complex machines, the result of ages of
random mutation
Like much of AI, Machine Learning takes an engineering approach!
Humanity didn’t first master flight by just imitating birds!
Although there is some
occasional biological inspiration
Cyrille Berger
Machine Learning
2/74
Why Machine Learning?
•
It may be impossible to manually program for every situation in
advance.
•
The model of the environment is unknown, incomplete or outdated,
if the agent cannot adapt it will fail.
•
Approximating complex functions.
•
Many argue that learning is required for AI to scale up.
•
Data collections are too large for humans to cope with.
•
…but primarily, the algorithms we have so far have shown themselves
to be useful in a wide range of applications!
Cyrille Berger
Machine Learning
3/74
Some Application Aspects
May not be as versatile as human learning, but domain specific problems
can often be processed much faster than by a human
Computers work 24/7 and you can often scale performance by piling on
more of them
Data Mining
Companies can collect ever more data and processing power is cheap
Put it to use automatically analyzing the performance of products!
Machine Learning is almost ubiquitous on the web: Mail filters, search
engines, product recommendations , customized content, ad
serving…
Robotics, Computer Vision
Many capabilities that humans take for granted like locomotion and
recognizing objects have turned out to be ridiculously difficult to
manually construct rules for.
Cyrille Berger
Machine Learning
4/74
Demo – Stanford Helicopter
Acrobatics
…in narrow applications machine learning can even
rival human performance
Cyrille Berger
Machine Learning
5/74
To Define Machine Learning
A machine learns with respect to a particular task T,
performance metric P, and type of experience E, if
the system reliably improves its performance P at
task T, following experience E
-Tom Mitchell
From the agent perspective:
Performance Metric
Input (Sensors)
Agent
Environment
Output (Actuators)
Cyrille Berger
Machine Learning
6/74
Supervised Learning
at a glance
In Supervised Learning:
•
The correct output is given to the algorithm during a training
phase
•
Experience E is thus tuples of training data consisting of
(Input,Output) pairs
•
Performance metric P is some function of how well the
predicted output matches the given correct output
Mathematically, can be seen as trying to approximate an unknown
function f(x) = y given examples of (x, y)
Some real world examples
•
Spam Filter: f(mail) = spam?
•
Microsoft Kinect: f(pixels) = body part
7
Cyrille Berger
Machine Learning
7/74
Unsupervised Learning
at a glance
In Unsupervised Learning:
•
The Task is to find a more concise representation of the data
•
Neither the correct answer, nor a reward is given
•
Experience E is just the given data
•
Performance metric P depends on the task T
Examples:
Clustering – When the data distribution is confined to lie in a small
number of “clusters “ we can find these and use them instead of the
original representation
Dimensionality Reduction – Finding a suitable lower dimensional
representation while preserving as much information as possible
Cyrille Berger
Machine Learning
8/74
Reinforcement Learning
at a glance
In Reinforcement Learning:
•
A reward is given at each step instead of the correct input
•
Experience E consists of the history of inputs, the chosen outputs
and the rewards
•
Performance metric P is some sum of how much reward the agent
can accumulate
Inspired by early work in psychology and how pets are trained
The agent can learn on its own as long as the reward signal can be
concisely defined.
Real world examples – Robot Control, Game Playing (Checkers…)
Cyrille Berger
Machine Learning
9/74
Evolutionary algorithms
at a glance
In Evolution Learning:
•
Random agents are generated
•
At each step, the best performing agents are kept and combined
to generate new solutions
•
Experience E consists of the history of solutions
•
Performance metric P is some sum of how much an agent perform
Inspired by early work by the theory of evolution
Most classical exemple is the so called "genetic algorithm"
Cyrille Berger
Machine Learning
10/74
Supervised Learning
•
Linear Regression
•
Neural Network
•
Bayesian Network
Cyrille Berger
Machine Learning
11/74
The Steps of Supervised Learning
Can be seen as searching for an approximation to the unknown
function y = f(x) given N examples of x and y: (x1,y1), … ,(xn,yn)
The goal is to have the algorithm learn from a small number of
examples to successfully generalize to new examples
•
First construct a feature vector xi of examples by encoding
relevant problem data. Examples of xi, yi is the training set.
•
The algorithm is trained on the examples by searching a family
of functions (the hypothesis space) for the one that most
closely approximates the unknown true function
•
If the quality is poor at this point, change algorithm or features
Cyrille Berger
Machine Learning
12/74
Simple Classification Example
I.
Construct a feature vector xi to be used with
examples of yi
II.
Select algorithm and train on training set by
searching for a good approximation to the
unknown function
Fictional example: Classify if a day was
sunny or rainy based on daily sales of
umbrellas and ice cream at a store.
Feature vector xi = (umbrellas, ice creams),
yi = “sunny” or “rainy”
Cyrille Berger
Machine Learning
13/74
Training a learning algorithm...
Feature vector xi = (umbrellas), yi = (ice creams)
Want to find approximation h(x) to the unknown function f(x)
How do we find parameters that result in a good hypothesis h?
•
Loss function: L(f(x),h(x))
•
Optimisation
Cyrille Berger
Machine Learning
14/74
Limitations of
Training a Learning Algorithm
Locally greedy optimization approaches only work if the loss function is
convex w.r.t. w, i.e. there is only one minima
Linear regression models are always convex, however more advanced
models that we will look at later are vulnerable to getting stuck in
local minina
Care should be taken when training such models by utilizing for example
random restarts
Start positions in red area will
get stuck in local minima!
Cyrille Berger
Machine Learning
15/74
Linear Models in Summary
Advantages
•Linear algorithms are simple and computationally efficient
•Training them is often a convex problem, so one is guaranteed to find
the best hypothesis in the hypothesis space
Disadvantages
•The hypothesis space is very restricted, it cannot handle non-linear
relations well
They are widely used in applications
•
Recommender Systems – Initial Netflix Cinematch was a linear
regression, before their $1 million competition to improve it
•
At the core of the recent Google Gmail Priority feature is a linear
classifier
•
And many more…
Cyrille Berger
Machine Learning
16/74
Neural Networks
Biologically inspired with some differences
Basically a function approximation
•Based on units
•Connected by links
•A unit is acivted by stimulation
•Structured in a network
Cyrille Berger
Machine Learning
17/74
Neuron
Each neuron is a linear model of all the inputs
passed through a non-linear activation function g
Cyrille Berger
Machine Learning
18/74
Activation function
The activation function is typically the sigmoid
The activation function in a perceptron is the step function
Cyrille Berger
Machine Learning
19/74
Design of Neural Nets
An artificial neural network is specified by:
• a topology (the number of layers, the number of neurons in each
layer and the interconnections between them), and
• a training algorithm.
A representation of the input and an interpretation of the output are
also needed.
“Neural networks are the second best way of doing just about
anything” – John Denker
Cyrille Berger
Machine Learning
20/74
Network Topology
•
Networks are composed of layers
•
All neurons in a layer are typically connected to all neurons in the
next layer, but not to each other
Cyrille Berger
Machine Learning
21/74
Network Topology
A two layer network (one hidden layer) with a sufficient number of
neurons is good enough for most problems
Cyrille Berger
Machine Learning
22/74
Training of neural networks
Like before, training is the result of an optimization!
We can compute a loss function (errors) for outputs of the last layer
against the training examples
And for hidden layers, use backpropagation:
”The trick is to assess the blame for an error and divide it among the
contributing weights” – Russel & Norvig
Cyrille Berger
Machine Learning
23/74
When to Consider Neural Nets
•
•
•
•
•
•
Input is high-dimensional discrete or real-valued
Output is discrete or real-valued
Output is a vector of values
Possibly noisy data
Form of target function is unknown
Human readability of result is unimportant
Cyrille Berger
Machine Learning
24/74
Video of robot
avoiding obstacles
Cyrille Berger
Machine Learning
25/74
Artificial Neural Networks –
Summary
Advantages
•Very large hypothesis space, under some conditions it is a universal
approximator to any function f(x)
•The model can be rather compact
•Has been moderately popular in applications ever since the 90ies
•Tolerant to noisy data
Disadvantages
•Training is a non-convex problem with many local minima
•Takes time to learn
•Has many tuning parameters to twiddle with (number of neurons, layers,
starting weights…)
•Topology design is critical
•Very difficult to interpret or debug weights in the network
•The biological similarity is vastly over-simplified (and frequently overstated)
Cyrille Berger
Machine Learning
26/74
Abuse of Neural Network
in RoboCup Soccer
Cyrille Berger
Machine Learning
27/74
Bayesian Network
A Bayesian Network is a graph of probabilistic dependencies.
It contains:
• Variables as nodes
• Joint-Probability as directed edges
Cyrille Berger
Machine Learning
28/74
Bayesian Network
for modeling dependencies
(Source: Wikipedia)
Cyrille Berger
Machine Learning
29/74
Bayesian Network
for solving crimes
Cyrille Berger
Machine Learning
30/74
Bayesian Network
for SPAM Filtering
Most known for popularizing learning spam filters
Spam classification example:
Each mail is an input, some mails are
flagged as spam or not spam to create
training examples.
Feature vector
Feature
Exists?
“Accept”
True
“Bank”
False
“Pay”
True
….
…
Encode the “Customer”
existence of
a
fixed
set
of
True
relevant key
words in each
mail as the
“Dollar”
False
feature
vector.
“Nigeria”
False
xi = wordsi =
Cyrille Berger
Machine Learning
31/74
Bayesian Network
for decision making
Cold
TakeDrug
Allergy
UD
RunnyNose
Fever
UT
Temp
Cyrille Berger
MeasTemperature
Machine Learning
32/74
Training a probabilistic model
Resulting in
How do we train a probabilistic model given examples?
Generally by optimization like we did with linear deterministic models,
but deriving the loss function is a bit more complex
In this case the hypothesis space is the family of fully conditionally
independent and binary word distributions
The only parameters in the model that need to be searched over are
the probability distributions on the right hand side.
In this case they have a simple closed form solution: word
frequencies for examples of spam and non-spam in the training set,
in addition to the frequency of spam in general!
Cyrille Berger
Machine Learning
33/74
Bayesian Networks – Summary
Advantages
•Very easy to debug or interpret the weights in the network.
•Expose the relationships between variables and easy conversion in
decision.
Disadvantages
•Difficult to learn.
•Topology design is critical. And especially which nodes to include ?
•Difficulty to train cyclic networks
Cyrille Berger
Machine Learning
34/74
Decision Trees
It is a graph of possible decisions and their possible consequences.
Used to identify the best strategy to reach a given goal.
Rush
to ball.
Ball kickable?
no
Pass.
no
yes
Good shoot?
yes
Cyrille Berger
Machine Learning
Shoot at
the goal.
35/74
Decision Trees Learning
What is great about Decision Trees?
...you can learn them.
It is one of the most widely used and practical method for classification.
The goal is to create a model that predicts the output value based on
several input variables.
It is usefull when one wants a compact and easily understood
representation.
Cyrille Berger
Machine Learning
36/74
Exemple of Decision Trees Learning
Outlook = Sunny,
Humidity = Normal,
Wind = Weak
Cyrille Berger
Machine Learning
37/74
Decision Trees Learning
Decision trees are learnt by constructing them top-down:
• Select the best root of the tree:
• statistical test to find how well it classifies the training sample
• the best attribute is selected
• Descendants are created for each possible value of the attribute
and the attribute matching are propagated down
• The entire process is repeated until each descendent of the node
select the best attribute
Cyrille Berger
Machine Learning
38/74
Information Gain
We want the most useful attribute to classify the examples.
It measures how well a given attribute separates the training
examples according to their target classification.
Cyrille Berger
Machine Learning
39/74
Information Gain
Entropy measures the impurity of a sample of training examples S:
Entropy(S) =
∑v -pv log2 pv
Gain(S,A) the expected reduction in entropy due to sorting on A:
Gain(S,A) = Entropy(S) - ∑v∊A(Entropy(Sv) |Sv|/|S|)
Cyrille Berger
Machine Learning
40/74
Decision Trees Learning
Day
Outlook
Temp
Humidity
Wind
Play
D1
D2
D3
D4
D5
D6
D7
D8
D9
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
High
High
High
High
Normal
Normal
Normal
High
Normal
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
No
No
Yes
Yes
Yes
No
Yes
No
Yes
D10
D11
D12
D13
D14
Rain
Sunny
Overcast
Overcast
Rain
Mild
Mild
Mild
Hot
Mild
Normal
Normal
High
Normal
High
Weak
Strong
Strong
Weak
Strong
Yes
Yes
Yes
Yes
No
Cyrille Berger
Machine Learning
41/74
Selecting the "Best" Attribute
Cyrille Berger
Machine Learning
42/74
Decision Trees - Summary
Advantages
• They have a naturally interpretable representation.
• Very fast to evaluate, as they are minimal.
• They can support continuous inputs and multiple classes output.
• There are robust algorithms and software that are widely used.
Disadvantages
• Not necesseraly an optimal representation.
• Risk of overfitting.
Cyrille Berger
Machine Learning
43/74
Some Applications
of Supervised Learning
Robotics – Learning The Dynamics of Motion
•Realistic problems are often highly non-linear
•A ground robot have at least a handful dimensions
•Air vehicles (UAVs) have at least a dozen dimensions
•Humanoid robots have at least 30-60 dimensions
•The human body is said to have over 600 muscles
Multi-Agent Problems – Learning Behaviors
•If there are multiple agents, the naive way of supercomposing them
into one problem will scale the dimensionality accordingly
•The successful approaches here seem to decompose them into
behaviours for individual agents and cooperation, although
reinforcement learning may be more suitable
Cyrille Berger
Machine Learning
44/74
Evolutionary Techniques
Genetic algorithms
• Bitsrings as genes
Genetic programming
• Programs as genes
Evolutionary programming
• Fix program structure, numerical parameters are allowed to evolve.
Cyrille Berger
Machine Learning
45/74
Natural Selection
Individuals compete and the most fit survives and
generates new offspring for the next generation.
x
x
x
x
Environment
x
x
Individuals
Cyrille Berger
Machine Learning
46/74
Algorithmic Perspective
While objective not reached do
• Select solutions
(Selection)
• Generate offspring(Recombination)
• Mutate
(Mutation)
x
x
x
x
Problem space
x
x
Solutions
Cyrille Berger
Machine Learning
47/74
Selection
Fitness-value – computed by a fitness-function and determines how fit
the individual is.
Selection schemes:
• Roulette-based selection
The fitter the individual is, the more chance of being selected.
• Tournament selection
2+ Individuals randomly selected, fitness-value determines probability
of winning. ”Restricted” roulette-based selection.
• Elitism selection
A number of the most fit individuals are always selected, the rest is
decided by another algorithm.
Cyrille Berger
Machine Learning
48/74
Generate Offspring
Encoding
• Matching an individuals genes to a solution in the problem space.
Crossover
• N individuals → M new individuals
Mutation
• Small change in one individual
Cyrille Berger
Machine Learning
49/74
Offspring Operators
Cyrille Berger
Machine Learning
50/74
A Genetic Algorithm Problem
Encoding
• How do we encode so that the solution space exactly matches the
problem space?
Fitness-function
• What is a “good” solution? How much does it differ from another solution?
Objective function
• What is the objective?
• When are we done?
Other variables
• Initial population size
• Selection / recombination algorithms
• Mutation rate
• End criteria
Cyrille Berger
Machine Learning
51/74
Genetic Algorithm For Robots
Can be usefull to solve a path finding problem:
Cyrille Berger
Machine Learning
52/74
Evolutionary Techniques –
Summary
Advantages
• Moves a population of solutions toward the optimal solution defined by
the fitness-function over a number of generations.
• An optimisation technique, if you can encode it
• Solves problem with multiple solutions and multiple minima!
Disadvantages
• Slow to converge.
• Some optimisation problems cannot be solved.
• No guarantee to find a global optimum.
• Not reasonnable for online applications!
Cyrille Berger
Machine Learning
53/74
Reenforcement Learning
Examples:
• Game of chess
• Soccer
• Control of a robot: helicopter, humanoid robot...
Cyrille Berger
Machine Learning
54/74
Reinforcement Learning Definition
Agents are given sensory inputs:
• State s ∊ S
• Reward R ∊ ℝ
At each steps, agents select an output:
• Action a ∊ A
Cyrille Berger
Machine Learning
55/74
Naïve Approach
Use supervised learning to learn:
f(s, a) = R
For any input state, pick the best action:
a = arg max f(s,a)
a∊A
Will that work?
Cyrille Berger
Machine Learning
56/74
Markov Decision Process
The agent need to think ahead !
It needs a good sequence of actions.
Formalized in the Markov Decision Process framework !
Cyrille Berger
Machine Learning
57/74
Markov Decision Process
Finite set of states S, finite set of actions A.
At each discrete time step t:
• agent observes state st ∊ S
• chooses action at ∊ A
• and receives an immediate reward rt.
The state changes to st+1 ∊ S.
Markov assumption is st+1 = δ(st, at) and rt = r(st, at).
Cyrille Berger
Machine Learning
58/74
Policy Function
The policy function decides which action to take in each state:
at=π(st)
The policy function is what we want to learn!
Cyrille Berger
Machine Learning
59/74
Rewards
To think ahead:
• an agents looks at future rewards: r(st+1,at+1), r(st+2,at+2)...
• formalized as the sum of rewards (also called utility or value):
V=
∑ɣ
t
r(st,at)
ɣ is the discount factor making rewards far off into the future less
valuable.
If we follow a specific policy π:
Vπ(st) = r(st,π(st)) + ɣVπ(st+1)
Cyrille Berger
Machine Learning
60/74
Value Function Example
Value function
for random
movement
Cyrille Berger
Optimal policy
Optimal value
function
Machine Learning
61/74
Optimal Policy
If we follow a specific policy π:
Vπ(st) = r(st,π(st)) + ɣVπ(st+1)
If we know Vπ(st) then the policy π is given by:
π(s) = argmaxa( r(s,a) + ɣVπ(δ(s, a)))
Finding the optimal policy is about finding π(s) or Vπ(st) or both.
Cyrille Berger
Machine Learning
62/74
Value Iteration
Initialize the function V(s) with random values
For each state st do:
• compute ΔV(st) = maxa(r(st,a) + ɣV(st+1)) - V(xt)
• update V(xt) with ΔV(xt)
Cyrille Berger
Machine Learning
63/74
Q-Function Learning
in unknown environment
Optimal policy
• π*(s) = argmaxa( r(s,a) + V*(δ(s, a)))
• What if we do not know δ(s, a)? or r(s,a)?
Q-Function
• Q(s,a) = r(s,a) + V*(δ(s, a)))
• π*(s) = argmaxa(Q(s,a))
Cyrille Berger
Machine Learning
64/74
Update the Q-Function
Q and V* are closely related:
V *(s) = maxa Q(s,a)
Q can be written as:
Q(st ,at) = r(st, at) + V*(δ(st, at))
= r(st, at) + ɣ maxa' Q (st+1 ,a' )
If Q^ denote the current approximation of Q then it can be updated
by:
Q^(s,a) := r + ɣ maxa' Q^(s',a')
Cyrille Berger
Machine Learning
65/74
Q-Learning for
Deterministic Worlds
For each s, a initialize table entry Q^(s,a) ← 0.
Observe current state s.
Do forever:
1. Select an action a and execute it
2. Receive immediate reward r
3. Observe the new state s'
4. Update the table entry for Q^(s,a):
Q^(s,a) := r + ɣ maxa' Q^(s',a')
5. s ← s'
Cyrille Berger
Machine Learning
66/74
Q-Learning Example
+100
+100
Cyrille Berger
Machine Learning
67/74
Reinforcement Learning Summary
Advantages
• Allows an agent to adapt to maximise rewards in a potentially
unknown environment.
Disadvantages
• Requires computation polynomial in the number of states!
• The number of states grows exponentially with input dimensions!
• Reinforcement Learning assumes discrete states and action spaces.
Cyrille Berger
Machine Learning
68/74
Summary
Supervised Learning
• Linear Regression
• Neural Network
• Bayesian Network
Genetic Algorithm
Reenforcement Learning: Q-Learning
Decision Trees
Cyrille Berger
Machine Learning
69/74
Summary: when to use what?
There is no universal solution to the problem of machine learning.
Approximation of a continuous function
⟹ Supervised Learning: linear regression, neural network
Problem expressed with rewards
⟹ Reinforcement Learning
Understanding of the structure of the problem
⟹ Decision Trees or Bayesian Networks
Optimization technique
⟹ Genetic Algorithms
Always use the most appropriate technique!
Cyrille Berger
Machine Learning
70/74
Layered Learning
A mapping directly from inputs to outputs is not tractably learnable.
A bottom-up hierarchical task decomposition is given.
Machine learning exploits data to train and/or adapt. Learning occurs
separately at each level.
The output of learning in one layer feeds into the next layer.
Cyrille Berger
Machine Learning
71/74
Task allocation learning
• Fires are grouped
• Local decision for which building to
extinguish
• Selective perception
• Decision tree with Q-Values
• At each step, use the tree to get
the reward for extinguishing a
specific building
Cyrille Berger
Machine Learning
72/74
Limitations of Machine Learning
•
We noted earlier that the first phase of learning is usually to select
the ”features” to use as input into the algorithm
•
In the spam classification example we restricted ourselves to a set of
relevant words, but even that could be thousands
#features
•
Even for such binary features we would have needed O(2
examples to calculate the full joint probability distribution
•
Likewise, in a continuous feature space, if we need a grid of 10
examples along each feature dimension to approximate the
#features
underlying true function, we need O(10
) examples.
Cyrille Berger
Machine Learning
)
73/74
The Curse of Dimensionality
•
This is known as the curse of dimensionality and applies to all
machine learning techniques
•
However, this is a worst case scenario. The true amount of data
needed to learn a model depends on the complexity of the
underlying function
•
Many learning attempts work rather well even for many features,
however being able to construct a small set of informative features can
be the difference between success and failure in many domains.
(Bishop,
74
2006)
17:53
Cyrille Berger
Machine Learning
74/74
Download