Learning II

advertisement
Statistical Learning
Methods
Russell and Norvig:
Chapter 20 (20.1,20.2,20.4,20.5)
CMSC 421 – Fall 2006
Statistical Approaches
Statistical Learning (20.1)
Naïve Bayes (20.2)
Instance-based Learning (20.4)
Neural Networks (20.5)
Statistical Learning (20.1)
Example: Candy Bags
Candy comes in two flavors: cherry () and lime ()
Candy is wrapped, can’t tell which flavor until opened
There are 5 kinds of bags of candy:





H1 =
H2 =
H3 =
H4 =
H5 =
all cherry
75% cherry, 25% lime
50% cherry, 50% lime
25% cherry, 75% lime
100% lime
Given a new bag of candy, predict H
Observations: D1, D2 , D3, …
Bayesian Learning
Calculate the probability of each hypothesis, given
the data, and make prediction weighted by this
probability (i.e. use all the hypothesis, not just the
single best)
P(hi | d) 
P ( d|hi )P (hi )
P ( d)
 P(d | hi )P(hi )
Now, if we want to predict some unknown quantity X
P(X | d)   P(X |hi )P(hi | d)
i
Bayesian Learning cont.
Calculating P(h|d)
P(hi | d)  P(d | hi )P(hi )
likelihood prior
Assume the observations are i.i.d.—
independent and identically distributed
P ( d | hi )   P ( d j | hi )
j
Example:
Hypothesis Prior over h1, …, h5 is
{0.1,0.2,0.4,0.2,0.1}
Data:
Q1: After seeing d1, what is P(hi|d1)?
Q2: After seeing d1, what is P(d2= |d1)?
Making Statistical Inferences
Bayesian –

predictions made using all hypothesis, weighted by their
probabilities
MAP – maximum a posteriori


uses the single most probable hypothesis to make prediction
often much easier than Bayesian; as we get more and more
data, closer to Bayesian optimal
ML – maximum likelihood


assume uniform prior over H
when
Naïve Bayes (20.2)
Naïve Bayes
aka Idiot Bayes
particularly simple BN
makes overly strong independence
assumptions
but works surprisingly well in practice…
Bayesian Diagnosis
suppose we want to make a diagnosis D and there
are n possible mutually exclusive diagnosis d1, …, dn
suppose there are m boolean symptoms, E1, …, Em
P(di | e1 ,...,em )  P(di )P(e1 ,..., em | di )
P(e1 ,..., em )
how do we make diagnosis?
we need: P(di )
P(e1 ,..., em | di )
Naïve Bayes Assumption
Assume each piece of evidence
(symptom) is independent give the
diagnosis
then
P(e1 ,...,em | di ) 
m
 P (e
k
| di )
k 1
what is the structure of the corresponding BN?
Naïve Bayes Example
possible diagnosis: Allergy, Cold and OK
possible symptoms: Sneeze, Cough and Fever
Well
Cold
Allergy
P(d)
0.9
0.05
0.05
P(sneeze|d)
0.1
0.9
0.9
P(cough|d)
0.1
0.8
0.7
P(fever|d)
0.01
0.7
0.4
my symptoms are: sneeze & cough, what is
the diagnosis?
Learning the Probabilities
aka parameter estimation
we need


P(di) – prior
P(ek|di) – conditional probability
use training data to estimate
Maximum Likelihood Estimate
(MLE)
use frequencies in training set to
estimate:
ni
p(di ) 
N
nik
p(ek | di ) 
ni
where nx is shorthand for the counts of events in
training set
what is: P(Allergy)?
P(Sneeze| Allergy)?
P(Cough| Allergy)?
Example:
D
Sneeze
Cough
Fever
Allergy
yes
no
no
Well
yes
no
no
Allergy
yes
no
yes
Allergy
yes
no
no
Cold
yes
yes
yes
Allergy
yes
no
no
Well
no
no
no
Well
no
no
no
Allergy
no
no
no
Allergy
yes
no
no
Laplace Estimate (smoothing)
use smoothing to eliminate zeros:
ni  1
p(di ) 
Nn
many other
smoothing
schemes…
nik  1
p(ek | di ) 
ni  2
where n is number of possible values for d
and e is assumed to have 2 possible values
Comments
Generally works well despite blanket
assumption of independence
Experiments show competitive with
decision trees on some well known test
sets (UCI)
handles noisy data
Learning more complex
Bayesian networks
Two subproblems:
learning structure: combinatorial search
over space of networks
learning parameters values: easy if all
of the variables are observed in the
training set; harder if there are ‘hidden
variables’
Instance-based Learning
Instance/Memory-based Learning
Non-parameteric

hypothesis complexity grows with the data
Memory-based learning

Construct hypotheses directly from the
training data itself
Nearest Neighbor Methods
To classify a new input vector x, examine the k-closest training
data points to x and assign the object to the most frequently
occurring class
k=1
k=6
x
Issues
Distance measure



Most common: euclidean
Better distance measures: normalize each variable by standard deviation
For discrete data, can use hamming distance
Choosing k

Increasing k reduces variance, increases bias
For high-dimensional space, problem that the nearest neighbor may
not be very close at all!
Memory-based technique. Must make a pass through the data for each
classification. This can be prohibitive for large data sets.
Indexing the data can help; for example KD trees
Neural Networks (20.5)
Neural function
Brain function (thought) occurs as the result of the
firing of neurons
Neurons connect to each other through synapses,
which propagate action potential (electrical
impulses) by releasing neurotransmitters
Synapses can be excitatory (potential-increasing) or
inhibitory (potential-decreasing), and have varying
activation thresholds
Learning occurs as a result of the synapses’
plasticicity: They exhibit long-term changes in
connection strength
There are about 1011 neurons and about 1014
synapses in the human brain
Biology of a neuron
Brain structure
Different areas of the brain have different functions


Some areas seem to have the same function in all humans (e.g.,
Broca’s region); the overall layout is generally consistent
Some areas are more plastic, and vary in their function; also, the
lower-level structure and function vary greatly
We don’t know how different functions are “assigned” or
acquired


Partly the result of the physical layout / connection to inputs
(sensors) and outputs (effectors)
Partly the result of experience (learning)
We really don’t understand how this neural structure
leads to what we perceive as “consciousness” or
“thought”
Our neural networks are not nearly as complex or
intricate as the actual brain structure
Comparison of computing power
Computers are way faster than neurons…
But there are a lot more neurons than we can reasonably model
in modern digital computers, and they all fire in parallel
Neural networks are designed to be massively parallel
The brain is effectively a billion times faster
Neural networks
Neural networks are made up of nodes
or units, connected by links
Each link has an associated weight and
activation level
Each node has an input function
(typically summing over weighted
inputs), an activation function, and
an output
Neural unit
Linear Threshold Unit (LTU)
X0=1
X1
1
X2
W0
W
W
2
Wn

n
w x
i 0
Xn
i i
n

 1 if  wi xi
 
i 1
- 1 otherwise
1 if w 0  w1 x1  ...  wn x n  0
o( x1 ,..., x n )  
- 1 otherwise

Sigmoid Unit
X0=1
X1
1
X2
W0
W
W
2
Wn

n
net   wi xi
i 0
o  σ (net ) 
1
1  e  net
Xn
σ( x ) is the sigmoid function
1
1  e-x
Neural Computation
McCollough and Pitt (1943)showed how LTU
can be use to compute logical functions



AND?
OR?
NOT?
Two layers of LTUs can represent any boolean
function
Learning Rules
Rosenblatt (1959) suggested that if a target
output value is provided for a single neuron
with fixed inputs, can incrementally change
weights to learn to produce these outputs
using the perceptron learning rule


assumes binary valued input/outputs
assumes a single linear threshold unit
Perceptron Learning rule
If the target output for unit j is tj
w ji  w ji  (t j  o j )oi
Equivalent to the intuitive rules:
If output is correct, don’t change the weights
If output is low (oj=0, tj=1), increment weights for all the
inputs which are 1
If output is high (oj=1, tj=0), decrement weights for all
inputs which are 1
Must also adjust threshold. Or equivalently assume there is
a weight wj0 for an extra input unit that has o0=1
Perceptron Learning Algorithm
Repeatedly iterate through examples adjusting weights
according to the perceptron learning rule until all outputs are
correct


Initialize the weights to all zero (or random)
Until outputs for all training examples are correct
 for each training example e do
 compute the current output oj
 compare it to the target tj and update weights
each execution of outer loop is an epoch
for multiple category problems, learn a separate perceptron for
each category and assign to the class whose perceptron most
exceeds its threshold
Q: when will the algorithm terminate?
Representation Limitations of
a Perceptron
Perceptrons can only represent linear
threshold functions and can therefore
only learn functions which linearly
separate the data, I.e. the positive and
negative examples are separable by a
hyperplane in n-dimensional space
Perceptron Learnability
Perceptron Convergence Theorem:
If there are a set of weights that are
consistent with the training data (I.e.
the data is linearly separable), the
perceptron learning algorithm will
converge (Minksy & Papert, 1969)
Unfortunately, many functions (like
parity) cannot be represented by LTU
Layered feed-forward network
Output units
Hidden units
Input units
Backpropagation Algorithm
Initialize all weights to small random numbers. Until satisfied, do
 For each training example, do
1. Input the training example and compute the outputs
2. For each output unit k
 k  ok (1  ok )( t k  ok )
3. For each hidden unit h
 h  oh (1  oh )
w

k ,h k
k outputs
4. Update each network weight wi, j
w j , i  w j , i  w j , i
where
w j , i    j x j , i
“Executing” neural networks
Input units are set by some exterior function (think of
these as sensors), which causes their output links to
be activated at the specified level
Working forward through the network, the input
function of each unit is applied to compute the input
value

Usually this is just the weighted sum of the activation on the
links feeding into this node
The activation function transforms this input function
into a final value

Typically this is a nonlinear function, often a sigmoid function
corresponding to the “threshold” of that node
Neural Nets for Face
Recognition
left strt
rgt
up
30x32
inputs
Typical Input Images
90% accurate learning
head pose, and recognizing
1-of-20 faces
Summary: Statistical Learning
Methods
Statistical Inference

use likehood of data and prob of hypothesis to
predict value for next instance
 Bayesian
 MAP
 ML
Naïve Bayes
Nearest Neighbor
Neural Networks
Download