Learning

advertisement
Learning
R&N: ch 19, ch 20
Types of Learning
Supervised Learning - classification,
prediction
Unsupervised Learning – clustering,
segmentation, pattern discovery
Reinforcement Learning – learning
MDPs, online learning
Supervised Learning
Someone gives you a bunch of
examples, telling you what each one is
Eventually, you figure out the mapping
from properties (features) of the
examples to their type (classification)
Supervised Learning
Logic-based approaches:

learn a function f(X)  (true,false)
Statistical/Probabilistic approaches

Learn a probability distribution p(Y|X)
Outline
Logic-based Approaches



Decision Trees (lecture 24)
Version Spaces (today)
PAC learning (we won’t cover, )
Instance-based approaches
Statistical Approaches
Predicate-Learning Methods
Version space
Need to provide H
Putting Things Together
with some “structure”
Explicit representation
of hypothesis space H
yes
no
Test
set
Evaluation
Induced
hypothesis h
Training
set 
Learning
procedure L
Object set
Example
set X
Goal predicate
Observable
predicates
Hypothesis
space H
Bias
Version Spaces
The “version space” is the set of all
hypotheses that are consistent with the
training instances processed so far.
An algorithm:
V := H
;; the version space V is ALL
hypotheses H
 For each example e:

 Eliminate any member of V that disagrees with e
 If V is empty, FAIL

Return V as the set of consistent hypotheses
Version Spaces: The Problem
PROBLEM: V is huge!!
Suppose you have N attributes, each
with k possible values
Suppose you allow a hypothesis to be
any disjunction of instances
There are kN possible instances  |H| =
N
k
2
If N=5 and k=2, |H| = 232!!
Version Spaces: The Tricks
First Trick: Don’t allow arbitrary disjunctions

Organize the feature values into a hierarchy of allowed
disjunctions, e.g.
any-color
dark
black

blue
pale
white
yellow
Now there are only 7 “abstract values” instead of 16
disjunctive combinations (e.g., “black or white” isn’t allowed)
Second Trick: Define a partial ordering on H (“general
to specific”) and only keep track of the upper bound
and lower bound of the version space
RESULT: An incremental, possibly efficient algorithm!
Rewarded Card Example
(r=1) v … v (r=10) v (r=J) v (r=Q) v (r=K)  ANY-RANK(r)
(r=1) v … v (r=10)  NUM(r)
(r=J) v (r=Q) v (r=K)  FACE(r)
(s=) v (s=) v (s=) v (s=)  ANY-SUIT(s)
(s=) v (s=)  BLACK(s)
(s=) v (s=)  RED(s)
A hypothesis is any sentence of the form:
R(r)  S(s)  IN-CLASS([r,s])
where:
• R(r) is ANY-RANK(r), NUM(r), FACE(r), or (r=j)
• S(s) is ANY-SUIT(s), BLACK(s), RED(s), or (s=k)
Simplified Representation
For simplicity, we represent a concept by rs, with:
• r  {a, n, f, 1, …, 10, j, q, k}
• s  {a, b, r, , , , }
For example:
• n represents:
NUM(r)  (s=)  IN-CLASS([r,s])
• aa represents:
ANY-RANK(r)  ANY-SUIT(s)  IN-CLASS([r,s])
Extension of a Hypothesis
The extension of a hypothesis h is
the set of objects that satisfies h
Examples:
• The extension of f is: {j, q, k}
• The extension of aa is the set of all cards
More General/Specific Relation
Let h1 and h2 be two hypotheses in H
h1 is more general than h2 iff the
extension of h1 is a proper superset of
the extension of h2
Examples:
• aa is more general than f
• f is more general than q
• fr and nr are not comparable
More General/Specific Relation
Let h1 and h2 be two hypotheses in H
h1 is more general than h2 iff the
extension of h1 is a proper superset of
the extension of h2
The inverse of the “more general”
relation is the “more specific” relation
The “more general” relation defines a
partial ordering on the hypotheses in H
Example: Subset of Partial Order
aa
na
4a
ab
a
nb
n
4b
4
Construction of Ordering Relation
a
a
n
f
1 … 10
j … k
b

r



G-Boundary / S-Boundary of V
A hypothesis in V is most general iff no
hypothesis in V is more general
G-boundary G of V: Set of most general
hypotheses in V
G-Boundary / S-Boundary of V
A hypothesis in V is most general iff no
hypothesis in V is more general
G-boundary G of V: Set of most general
hypotheses in V
A hypothesis in V is most specific iff no
hypothesis in V is more specific
S-boundary S of V: Set of most specific
hypotheses in V
Example: G-/S-Boundaries of V
G
aa
na
ab
4b
n
We replace every hypothesis in S
Now suppose that 4 is
whose
extension
doesa
not
4a
nb
given as a positive example
contain 4 by its generalization set
S
1
…
4
…
k
Example: G-/S-Boundaries of V
aa
na G and Sab
Here, both
have size 1.
This is not the case in general!
4a
a
nb
n
4b
4
Example: G-/S-Boundaries of V
na
4a
The
aa generalization set
of an hypothesis h is the
set of the
ab hypotheses
that are immediately more
general
than a
h
nb
n
4b
Let 7 be the next
(positive) example
4
Generalization
set of 4
Example: G-/S-Boundaries of V
aa
na
4a
ab
a
nb
n
4b
Let 7 be the next
(positive) example
4
Example: G-/S-Boundaries of V
Specialization
set of aa
aa
na
ab
a
nb
Let 5 be the next
(negative) example
n
Example: G-/S-Boundaries of V
G and S, and all hypotheses in between
form exactly the version space
ab
a
nb
n
Example: G-/S-Boundaries of V
At this stage …
ab
No
Yes
Maybe
Do 8, 6, j
satisfy CONCEPT?
a
nb
n
Example: G-/S-Boundaries of V
ab
a
nb
Let 2 be the next
(positive) example
n
Example: G-/S-Boundaries of V
ab
nb
Let j be the next
(negative) example
Example: G-/S-Boundaries of V
+ 4 7 2
– 5 j
nb
NUM(r)  BLACK(s)  IN-CLASS([r,s])
Example: G-/S-Boundaries of V
Let us return to the
version space …
… and let 8 be the next
(negative) example
ab
nb
The only most specific
hypothesis disagrees with n
this example, so no
hypothesis in H agrees with
all examples
a
Example: G-/S-Boundaries of V
Let us return to the
version space …
… and let j be the next
(positive) example
ab
nb
The only most general
hypothesis disagrees with n
this example, so no
hypothesis in H agrees with
all examples
a
Version Space Update
1. x  new example
2. If x is positive then
(G,S)  POSITIVE-UPDATE(G,S,x)
3. Else
(G,S)  NEGATIVE-UPDATE(G,S,x)
4. If G or S is empty then return failure
POSITIVE-UPDATE(G,S,x)
1. Eliminate all hypotheses in G that do not
agree with x
POSITIVE-UPDATE(G,S,x)
1. Eliminate all hypotheses in G that do not
agree with x
2. Minimally generalize all hypotheses in S
until they are consistent with x
Using the generalization
sets of the hypotheses
POSITIVE-UPDATE(G,S,x)
1. Eliminate all hypotheses in G that do not
agree with x
2. Minimally generalize all hypotheses in S
until they are consistent with x
3. Remove from S every hypothesis that is
neither more specific than nor equal to a
hypothesis in G
This step was not needed in the card example
POSITIVE-UPDATE(G,S,x)
1. Eliminate all hypotheses in G that do not
agree with x
2. Minimally generalize all hypotheses in S
until they are consistent with x
3. Remove from S every hypothesis that is
neither more specific than nor equal to a
hypothesis in G
4. Remove from S every hypothesis that is
more general than another hypothesis in S
5. Return (G,S)
NEGATIVE-UPDATE(G,S,x)
1. Eliminate all hypotheses in S that do not
agree with x
2. Minimally specialize all hypotheses in G
until they are consistent with x
3. Remove from G every hypothesis that is
neither more general than nor equal to a
hypothesis in S
4. Remove from G every hypothesis that is
more specific than another hypothesis in G
5. Return (G,S)
Example-Selection Strategy
(aka Active Learning)
Suppose that at each step the learning
procedure has the possibility to select
the object (card) of the next example
Let it pick the object such that, whether
the example is positive or not, it will
eliminate one-half of the remaining
hypotheses
Then a single hypothesis will be isolated
in O(log |H|) steps
Example
aa
na
• 9?
• j?
• j?
ab
a
nb
n
Example-Selection Strategy
Suppose that at each step the learning
procedure has the possibility to select the
object (card) of the next example
Let it pick the object such that, whether the
example is positive or not, it will eliminate
one-half of the remaining hypotheses
Then a single hypothesis will be isolated in
O(log |H|) steps
But picking the object that eliminates half the
version space may be expensive
Noise
If some examples are misclassified, the
version space may collapse
Possible solution:
Maintain several G- and S-boundaries,
e.g., consistent with all examples, all
examples but one, etc…
Version Spaces
Useful for understanding logical
(consistency-based) approaches to
learning
Practical if strong hypothesis bias
(concept hierarchies, conjunctive
hypothesis)
Don’t handle noise well
VSL vs DTL
Decision tree learning (DTL) is more efficient
if all examples are given in advance; else, it
may produce successive hypotheses, each
poorly related to the previous one
Version space learning (VSL) is incremental
DTL can produce simplified hypotheses that
do not agree with all examples
DTL has been more widely used in practice
Statistical Approaches
Instance-based Learning (20.4)
Statistical Learning (20.1)
Neural Networks (20.5)
Nearest Neighbor Methods
To classify a new input vector x, examine the k-closest training
data points to x and assign the object to the most frequently
occurring class
k=1
k=6
x
Issues
Distance measure


Most common: euclidean
Better distance measures: normalize each variable by standard deviation
Choosing k

Increasing k reduces variance, increases bias
For high-dimensional space, problem that the nearest neighbor may
not be very close at all!
Memory-based technique. Must make a pass through the data for each
classification. This can be prohibitive for large data sets.
Indexing the data can help; for example KD trees
Bayesian Learning
Example: Candy Bags
Candy comes in two flavors: cherry () and lime ()
Candy is wrapped, can’t tell which flavor until opened
There are 5 kinds of bags of candy:





H1 =
H2 =
H3 =
H4 =
H5 =
all cherry
75% cherry, 25% lime
50% cherry, 50% lime
25% cherry, 75% lime
100% lime
Given a new bag of candy, predict H
Observations: D1, D2 , D3, …
Bayesian Learning
Calculate the probability of each hypothesis, given
the data, and make prediction weighted by this
probability (i.e. use all the hypothesis, not just the
single best)
P(hi | d) 
P ( d|hi )P (hi )
P ( d)
 P(d | hi )P(hi )
Now, if we want to predict some unknown quantity X
P(X | d)   P(X |hi )P(hi | d)
i
Bayesian Learning cont.
Calculating P(h|d)
P(hi | d)  P(d | hi )P(hi )
likelihood prior
Assume the observations are i.i.d.—
independent and identically distributed
P ( d | hi )   P ( d j | hi )
j
Example:
Hypothesis Prior over h1, …, h5 is
{0.1,0.2,0.4,0.2,0.1}
Data:
Q1: After seeing d1, what is P(hi|d1)?
Q2: After seeing d1, what is P(d2= |d1)?
Statistical Learning Approaches
Bayesian
MAP – maximum a posteriori **
MLE – assume uniform prior over H
Naïve Bayes
aka Idiot Bayes
particularly simple BN
makes overly strong independence
assumptions
but works surprisingly well in practice…
Bayesian Diagnosis
suppose we want to make a diagnosis D and there
are n possible mutually exclusive diagnosis d1, …, dn
suppose there are m boolean symptoms, E1, …, Em
P(di | e1 ,...,em )  P(di )P(e1 ,..., em | di )
P(e1 ,..., em )
how do we make diagnosis?
we need: P(di )
P(e1 ,..., em | di )
Naïve Bayes Assumption
Assume each piece of evidence
(symptom) is independent give the
diagnosis
then
P(e1 ,...,em | di ) 
m
 P (e
k
| di )
k 1
what is the structure of the corresponding BN?
Naïve Bayes Example
possible diagnosis: Allergy, Cold and OK
possible symptoms: Sneeze, Cough and Fever
Well
Cold
Allergy
P(d)
0.9
0.05
0.05
P(sneeze|d)
0.1
0.9
0.9
P(cough|d)
0.1
0.8
0.7
P(fever|d)
0.01
0.7
0.4
my symptoms are: sneeze & cough, what is
the diagnosis?
Learning the Probabilities
aka parameter estimation
we need


P(di) – prior
P(ek|di) – conditional probability
use training data to estimate
Maximum Likelihood Estimate
(MLE)
use frequencies in training set to
estimate:
ni
p(di ) 
N
nik
p(ek | di ) 
ni
where nx is shorthand for the counts of events in
training set
what is: P(Allergy)?
P(Sneeze| Allergy)?
P(Cough| Allergy)?
Example:
D
Sneeze
Cough
Fever
Allergy
yes
no
no
Well
yes
no
no
Allergy
yes
no
yes
Allergy
yes
no
no
Cold
yes
yes
yes
Allergy
yes
no
no
Well
no
no
no
Well
no
no
no
Allergy
no
no
no
Allergy
yes
no
no
Laplace Estimate (smoothing)
use smoothing to eliminate zeros:
ni  1
p(di ) 
Nn
many other
smoothing
schemes…
nik  1
p(ek | di ) 
ni  2
where n is number of possible values for d
and e is assumed to have 2 possible values
Comments
Generally works well despite blanket
assumption of independence
Experiments show competitive with
decision trees on some well known test
sets (UCI)
handles noisy data
Learning more complex
Bayesian networks
Two subproblems:
learning structure: combinatorial search
over space of networks
learning parameters values: easy if all
of the variables are observed in the
training set; harder if there are ‘hidden
variables’
Neural Networks
Neural function
Brain function (thought) occurs as the result of the
firing of neurons
Neurons connect to each other through synapses,
which propagate action potential (electrical
impulses) by releasing neurotransmitters
Synapses can be excitatory (potential-increasing) or
inhibitory (potential-decreasing), and have varying
activation thresholds
Learning occurs as a result of the synapses’
plasticicity: They exhibit long-term changes in
connection strength
There are about 1011 neurons and about 1014
synapses in the human brain
Biology of a neuron
Brain structure
Different areas of the brain have different functions


Some areas seem to have the same function in all humans (e.g.,
Broca’s region); the overall layout is generally consistent
Some areas are more plastic, and vary in their function; also, the
lower-level structure and function vary greatly
We don’t know how different functions are “assigned” or
acquired


Partly the result of the physical layout / connection to inputs
(sensors) and outputs (effectors)
Partly the result of experience (learning)
We really don’t understand how this neural structure
leads to what we perceive as “consciousness” or
“thought”
Our neural networks are not nearly as complex or
intricate as the actual brain structure
Comparison of computing power
Computers are way faster than neurons…
But there are a lot more neurons than we can reasonably model
in modern digital computers, and they all fire in parallel
Neural networks are designed to be massively parallel
The brain is effectively a billion times faster
Neural networks
Neural networks are made up of nodes
or units, connected by links
Each link has an associated weight and
activation level
Each node has an input function
(typically summing over weighted
inputs), an activation function, and
an output
Neural unit
Model Neuron
Neuron modeled a unit j
weights on input unit I to j, wji
net input to unit j is:
net j 
w
i
threshold Tj
oj is 1 if netj > Tj
ij
 oi
Neural Computation
McCollough and Pitt (1943)showed how LTU
can be use to compute logical functions



AND?
OR?
NOT?
Two layers of LTUs can represent any boolean
function
Learning Rules
Rosenblatt (1959) suggested that if a target
output value is provided for a single neuron
with fixed inputs, can incrementally change
weights to learn to produce these outputs
using the perceptron learning rule


assumes binary valued input/outputs
assumes a single linear threshold unit
Perceptron Learning rule
If the target output for unit j is tj
w ji  w ji  (t j  o j )oi
Equivalent to the intuitive rules:
If output is correct, don’t change the weights
If output is low (oj=0, tj=1), increment weights for all the
inputs which are 1
If output is high (oj=1, tj=0), decrement weights for all
inputs which are 1
Must also adjust threshold. Or equivalently asuume there is
a weight wj0 for an extra input unit that has o0=1
Perceptron Learning Algorithm
Repeatedly iterate through examples adjusting weights
according to the perceptron learning rule until all outputs are
correct


Initialize the weights to all zero (or random)
Until outputs for all training examples are correct
 for each training example e do
 compute the current output oj
 compare it to the target tj and update weights
each execution of outer loop is an epoch
for multiple category problems, learn a separate perceptron for
each category and assign to the class whose perceptron most
exceeds its threshold
Q: when will the algorithm terminate?
Representation Limitations of
a Perceptron
Perceptrons can only represent linear
threshold functions and can therefore
only learn functions which linearly
separate the data, I.e. the positive and
negative examples are separable by a
hyperplane in n-dimensional space
Perceptron Learnability
Perceptron Convergence Theorem:
If there are a set of weights that are
consistent with the training data (I.e.
the data is linearly separable), the
perceptron learning algorithm will
converge (Minksy & Papert, 1969)
Unfortunately, many functions (like
parity) cannot be represented by LTU
Layered feed-forward network
Output units
Hidden units
Input units
“Executing” neural networks
Input units are set by some exterior function (think of
these as sensors), which causes their output links to
be activated at the specified level
Working forward through the network, the input
function of each unit is applied to compute the input
value

Usually this is just the weighted sum of the activation on the
links feeding into this node
The activation function transforms this input function
into a final value

Typically this is a nonlinear function, often a sigmoid function
corresponding to the “threshold” of that node
Download