Neural Networks Le Song Machine Learning I

advertisement
Neural Networks
Le Song
Machine Learning I
CSE 6740, Fall 2013
Learning nonlinear decision boundary
Linearly separable
Nonlinearly separable
The XOR gate
Speech recognition
2
A decision tree for Tax Fraud
Input: a vector of attributes
๐‘‹ = [Refund,MarSt,TaxInc]
Output:
๐‘Œ= Cheating or Not
H as a procedure:
Each internal node: test one
attribute ๐‘‹๐‘–
Each branch from a node: selects
one value for ๐‘‹๐‘–
Each leaf node: predict ๐‘Œ
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
3
Apply model to test data I
Query Data
Start from the root of tree.
R e fu n d
No
Refund
Yes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
M a r r ie d
80K
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
4
Apply model to test data II
Query Data
R e fu n d
No
Refund
Yes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
M a r r ie d
80K
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
5
Apply model to test data III
Query Data
R e fu n d
No
Refund
Yes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
M a r r ie d
80K
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
6
Apply model to test data IV
Query Data
R e fu n d
No
Refund
Yes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
M a r r ie d
80K
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
7
Apply model to test data V
Query Data
R e fu n d
No
Refund
Yes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
M a r r ie d
80K
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
Assign Cheat to “No”
NO
> 80K
YES
8
Expressiveness of decision tree
Decision trees can express any function of the input attributes.
E.g., for Boolean functions, truth table row → path to leaf:
Trivially, there is a consistent decision tree for any training set
with one path to leaf for each example.
Prefer to find more compact decision trees
9
Decision tree learning
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
11
No
Small
Attrib2
55K
Attrib3
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
10
Top-Down Induction of Decision tree
Main loop:
๐ด ← the “best” decision attribute for next node
Assign A as the decision attribute for node
For each value of A, create new descendant of node
Sort training examples to leaf nodes
If training examples perfectly classified, then STOP;
ELSE iterate over new leaf nodes
11
Tree Induction
Greedy strategy.
Split the records based on an attribute test that optimizes certain
criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
12
Example of a decision tree
T id
R e fu n d
Splitting Attributes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
1
Yes
S in g le
125K
No
2
No
M a r r ie d
100K
No
3
No
S in g le
70K
No
4
Yes
M a r r ie d
120K
No
5
No
D iv o r c e d
95K
Yes
6
No
M a r r ie d
60K
No
7
Yes
D iv o r c e d
220K
No
8
No
S in g le
85K
Yes
9
No
M a r r ie d
75K
No
10
No
S in g le
90K
Yes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Training Data
Model: Decision Tree
13
Another example of a decision tree
MarSt
T id
R e fu n d
Married
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
NO
1
Yes
S in g le
125K
No
2
No
M a r r ie d
100K
No
3
No
S in g le
70K
No
4
Yes
M a r r ie d
120K
No
5
No
D iv o r c e d
95K
Yes
6
No
M a r r ie d
60K
No
7
Yes
D iv o r c e d
220K
No
8
No
S in g le
85K
Yes
9
No
M a r r ie d
75K
No
10
No
S in g le
90K
Yes
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that
fits the same data!
10
Training Data
14
How to determine the Best Split
Idea: a good attribute splits the examples into subsets that are
(ideally) "all positive" or "all negative"
Homogeneous,
Non-homogeneous,
Low degree of impurity
High degree of impurity
Greedy approach:
Splitting nodes with homogeneous class distribution are preferred
Need a measure of node impurity
How to compare attribute?
Entropy
Entropy H(X) of a random variable X
Entropy quatifies the randomness of a random variable.
Information theory: H(X) is the expected number of bits needed to encode a
randomly drawn value of X (under most efficient code)
Examples for computing Entropy
C1
0
P(C1) = 0/6 = 0
C2
6
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
C1
1
P(C1) = 1/6
C2
5
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
C1
2
P(C1) = 2/6
C2
4
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
P(C2) = 6/6 = 1
P(C2) = 5/6
P(C2) = 4/6
Sample Entropy
S is a sample of training examples
p+ is the proportion of positive examples in S
p- is the proportion of negative examples in S
Entropy measure the impurity of S
How to compare attribute?
Conditional Entropy of variable ๐‘‹ given variable ๐‘Œ
The entropy ๐ป ๐‘‹ ๐‘Œ = ๐‘— of X given a specific value ๐‘Œ = ๐‘— :
Conditional entropy ๐ป ๐‘‹ ๐‘Œ of X given Y: average of ๐ป ๐‘‹ ๐‘Œ = ๐‘—
๐‘
๐ป ๐‘‹๐‘Œ =−
๐‘
๐‘
๐‘–
๐‘—
๐‘ƒ ๐‘ฆ=๐‘— ๐ป ๐‘‹๐‘ฆ=๐‘— =
๐‘—
๐‘ƒ(๐‘ฅ = ๐‘–, ๐‘ฆ = ๐‘—)
๐‘ƒ ๐‘ฅ = ๐‘–, ๐‘ฆ = ๐‘— log 2
๐‘ƒ(๐‘ฆ = ๐‘—)
Mutual information (aka information gain) of X given Y :
Information Gain
Information gain (after split a node):
ni
๏ƒฆ
( p) ๏€ญ ๏ƒง ๏ƒฅ
Entropy
๏ƒจ i ๏€ฝ1 n
k
GAIN
split
๏€ฝ Entropy
๏ƒถ
(i ) ๏ƒท
๏ƒธ
๐‘› samples in parent node ๐‘ is split into ๐‘˜ partitions; ๐‘›๐‘– is number of
records in partition ๐‘–
Measures Reduction in Entropy achieved because of the split.
Choose the split that achieves most reduction (maximizes
GAIN)
Stopping Criteria for Tree Induction
Stop expanding a node when all the records belong to the
same class
Stop expanding a node when all the records have similar
attribute values
Early termination
Biologically inspired learning model
Consider humans:
Neuron switching time
~ 0.001 second
Number of neurons
~ 1010
Connections per neuron
~ 104-5
Scene recognition time
~ 0.1 second
100 inference steps doesn't seem like enough
๏ƒ  much parallel computation
Perceptron
From biological neuron to
artificial neuron (perceptron)
In p u ts
x1
Synapse
De ndrites
Synapse
A xo n
L in e a r
w1
N o n lin e a r
C o m b in e r T r a n s fo r m
O u tp u t
A xo n
๏ƒฅ
Soma
w2
Soma
De ndrites
Synapse
๏ฑ
x2
T h r e s h o ld
I n p u t
O u t p u t
S i g n a l s
S i g n a l s
Artificial neural nets (ANN)
Many neuron-like threshold
switching units
Many weighted
interconnections among units
Highly parallel, distributed
processes
Y
M id d le L a y e r
In p u t L a y e r
O u tp u t L a y e r
23
Jargon Pseudo-Correspondence
Independent variable = input variable
Dependent variable = output variable
Coefficients = “weights”
Logistic Regression Model (the sigmoid unit) is a perceptron
Inputs
Age
34
Output
5
0.6
Gender
1
4
Stage
4
8
S
“Probability
of beingAlive”
Independent variables
Coefficients
Dependent variable
x1, x2, x3
a, b, c
p Prediction
What is logistic regression model
Assume that the posterior distribution ๐‘ ๐‘ฆ = 1 ๐‘ฅ take a
particular form
1
๐‘ ๐‘ฆ = 1 ๐‘ฅ, ๐œƒ =
1 + exp(−๐œƒ โŠค ๐‘ฅ)
Logistic function (or sigmoid function) ๐œŽ ๐‘ข =
๐œ•๐œŽ(๐‘ข)
๐œ•๐‘ข
1
1+exp −๐‘ข
=๐œŽ ๐‘ข 1−๐œŽ ๐‘ข
25
Learning parameters in logistic regression
Find ๐œƒ, such that the conditional likelihood of the labels is
maximized
๐‘š
๐‘ƒ ๐‘ฆ๐‘– ๐‘ฅ๐‘– , ๐œƒ
max ๐‘™ ๐œƒ : = log
๐œƒ
๐‘–=1
Good news: ๐‘™ ๐œƒ is concave function of ๐œƒ, and there is a single
global optimum.
Bad new: no closed form solution (resort to numerical method)
26
The objective function ๐‘™ ๐œƒ
logistic regression model
1
๐‘ ๐‘ฆ = 1 ๐‘ฅ, ๐œƒ =
1 + exp(−๐œƒ โŠค ๐‘ฅ)
Note that
1
exp −๐œƒ โŠค ๐‘ฅ
๐‘ ๐‘ฆ = 0 ๐‘ฅ, ๐œƒ = 1 −
=
โŠค
1 + exp −๐œƒ ๐‘ฅ
1 + exp −๐œƒ โŠค ๐‘ฅ
Plug in
๐‘š
๐‘ƒ ๐‘ฆ๐‘– ๐‘ฅ๐‘– , ๐œƒ
๐‘™ ๐œƒ : = log
๐‘–=1
(๐‘ฆ ๐‘– − 1) ๐œƒ โŠค ๐‘ฅ ๐‘– − log(1 + exp −๐œƒ โŠค ๐‘ฅ ๐‘– )
=
๐‘–
27
The gradient of ๐‘™ ๐œƒ
๐‘š
๐‘ƒ ๐‘ฆ๐‘– ๐‘ฅ๐‘– , ๐œƒ
๐‘™ ๐œƒ : = log
๐‘–=1
(๐‘ฆ ๐‘– − 1) ๐œƒ โŠค ๐‘ฅ ๐‘– − log(1 + exp −๐œƒ โŠค ๐‘ฅ ๐‘– )
=
๐‘–
Gradient
๐œ•๐‘™(๐œƒ)
=
๐œ•๐œƒ
๐‘–
โŠค๐‘ฅ ๐‘– ๐‘ฅ ๐‘–
exp
−๐œƒ
(๐‘ฆ ๐‘– − 1) ๐‘ฅ ๐‘– +
1 + exp −๐œƒ โŠค ๐‘ฅ
Setting it to 0 does not lead to closed form solution
28
Gradient Ascent algorithm
Initialize parameter ๐œƒ 0
Do
๐œƒ ๐‘ก+1 ← ๐œƒ ๐‘ก + ๐œ‚
(๐‘ฆ ๐‘– − 1) ๐‘ฅ ๐‘– +
๐‘–
๐‘กโŠค ๐‘–
exp −๐œƒ ๐‘ฅ ๐‘ฅ ๐‘–
โŠค
1 + exp −๐œƒ ๐‘ก ๐‘ฅ
While the ||๐œƒ ๐‘ก+1 − ๐œƒ ๐‘ก || > ๐œ–
29
Learning with square loss
Find ๐œƒ, such that the conditional probability of the labels are
close to the actual labels (0 or 1)
๐‘›
๐‘ฆ๐‘–
min ๐‘™ ๐œƒ โ‰”
๐œƒ
−๐‘ƒ ๐‘ฆ =1
๐‘ฅ๐‘–, ๐œƒ
๐‘–
2
๐‘›
๐‘ฆ๐‘–
=
−๐œŽ
๐œƒโŠค๐‘ฅ๐‘–
2
๐‘–
Not a convex objective function
Use gradient decent to find a local optimum
30
The gradient of ๐‘™ ๐œƒ
๐‘›
๐‘ฆ๐‘–
๐‘™ ๐œƒ :=
−๐‘ƒ ๐‘ฆ =1
๐‘ฅ๐‘–, ๐œƒ
2
๐‘›
๐‘ฆ๐‘–
=
๐‘–
−๐œŽ
๐œƒโŠค๐‘ฅ ๐‘–
2
๐‘–
Let ๐‘ข๐‘– = ๐œƒ โŠค ๐‘ฅ ๐‘–
๐œ•๐œŽ(๐‘ข)
๐œ•๐‘ข
=๐œŽ ๐‘ข 1−๐œŽ ๐‘ข
Gradient
๐œ•๐‘™(๐œƒ)
=
๐œ•๐œƒ
2 ๐‘ฆ ๐‘– − ๐œŽ ๐‘ข๐‘–
๐œŽ ๐‘ข๐‘–
1 − ๐œŽ ๐‘ข๐‘–
๐‘ฅ๐‘–
๐‘–
31
Gradient Descent algorithm
Initialize parameter ๐œƒ 0
Do
๐‘ข๐‘– : =
๐œƒ ๐‘ก+1 ← ๐œƒ ๐‘ก − ๐œ‚
โŠค ๐‘–
๐‘ก
๐œƒ ๐‘ฅ
2 ๐‘ฆ ๐‘– − ๐œŽ ๐‘ข๐‘–
๐œŽ ๐‘ข๐‘–
1 − ๐œŽ ๐‘ข๐‘–
๐‘ฅ๐‘–
๐‘–
While the ||๐œƒ ๐‘ก+1 − ๐œƒ ๐‘ก || > ๐œ–
32
Decision boundary a perceptron
x
y
Z (color)
0
0
1
0
1
1
1
0
1
1
1
0
NAND
w1
x1
y ๏ฑ = 0.5
w2
x2
f(x1 w1 + x2 w2 ) = y
f(0w1 + 0w2 ) = 1
f(0w1 + 1w2 ) = 1
f(1w1 + 0w2 ) = 1
f(1w1 + 1w2 ) = 0
some possible values for w1 and w2
f(a) =
๏ฑ
w1
0.20
0.20
0.25
0.40
w2
0.35
0.40
0.30
0.20
1, for a > ๏ฑ
0, for a ๏‚ฃ ๏ฑ
Difficult problem for perceptron
x
y
Z (color)
0
0
0
0
1
1
1
0
1
1
1
0
NAND
w1
x1
y ๏ฑ = 0.5
w2
x2
f(x1 w1 + x2 w2 ) = y
f(0w1 + 0w2 ) = 0
f(0w1 + 1w2 ) = 1
f(1w1 + 0w2 ) = 1
f(1w1 + 1w2 ) = 0
f(a) =
๏ฑ
w1
some possible values for w1 and w2
w2
1, for a > ๏ฑ
0, for a ๏‚ฃ ๏ฑ
Connecting perceptrons = neural networks
x
y
Z (color)
0
0
0
0
1
1
1
0
1
1
1
0
NAND
๏ฑ = 0.5 for all units
w5
w6
f(a) =
w1 w w2
3
w4
a possible set of values for (w1,
(0.6,-0.6,-0.7,0.8,1,1)
๏ฑ
1, for a > ๏ฑ
0, for a ๏‚ฃ ๏ฑ
w2, w3, w4, w5 , w6):
Neural Network Model
Inputs
Age
.6
34
.2
.4
S
.5
.1
Gender
2
.2
.3
S
.7
Stage
4
Independent
variables
Output
.8
S
.2
Weights
Hidden
Layer
Weights
0.6
“Probability
of beingAlive”
Dependent
variable
Prediction
“Combined logistic models”
Inputs
Age
.6
34
.4
S
.5
.1
Gender
2
S
.2
.7
Stage
Output
S
“Probability
of beingAlive”
.8
4
Independent
variables
Weights
Hidden
Layer
0.6
Weights
Dependent
variable
Prediction
“Combined logistic models”
Inputs
Age
34
.4
.2
Gender
Stage
2
4
Independent
variables
Output
.5
S
S
.2
.3
S
“Probability
of beingAlive”
.8
.2
Weights
Hidden
Layer
0.6
Weights
Dependent
variable
Prediction
“Combined logistic models”
Inputs
Age
.6
34
.2
Output
.4
S
.5
.1
Gender
1
S
.3
S
.7
Stage
4
Independent
variables
“Probability
of beingAlive”
.8
.2
Weights
Hidden
Layer
0.6
Weights
Dependent
variable
Prediction
Not really, no target for hidden units...
Age
.6
34
.2
.4
S
.5
.1
Gender
2
.2
.3
S
.7
Stage
4
Independent
variables
.8
S
.2
Weights
Hidden
Layer
Weights
0.6
“Probability
of beingAlive”
Dependent
variable
Prediction
Training hidden units
Find ๐œƒ, ๐›ผ, ๐›ฝ, such that the value of the output unit are close to
the actual labels (0 or 1)
๐‘›
๐‘ฆ๐‘–
min ๐‘™ ๐œƒ, ๐›ผ, ๐›ฝ : =
๐œƒ,๐›ผ,๐›ฝ
where
−๐œŽ
๐œƒ โŠค๐‘ง๐‘–
2
๐‘–
๐‘ง1๐‘– = ๐œŽ ๐›ผ โŠค ๐‘ฅ ๐‘–
๐‘ง2๐‘– = ๐œŽ(๐›ฝโŠค ๐‘ฅ ๐‘– )
Not a convex objective function
Use gradient decent to find a local optimum
41
The gradient of ๐‘™ ๐œƒ, ๐›ผ, ๐›ฝ
๐‘™ ๐œƒ, ๐›ผ, ๐›ฝ : =
๐‘›
๐‘–
๐‘–
โŠค ๐‘–
๐‘ฆ −๐œŽ ๐œƒ ๐‘ง
2
where ๐‘ง1๐‘– = ๐œŽ ๐›ผ โŠค ๐‘ฅ ๐‘– , ๐‘ง2๐‘– = ๐œŽ(๐›ฝโŠค ๐‘ฅ ๐‘– )
Let ๐‘ข๐‘– = ๐œƒ โŠค ๐‘ง ๐‘–
Gradient
๐œ•๐‘™(๐œƒ, ๐›ผ, ๐›ฝ)
=
๐œ•๐œƒ
2 ๐‘ฆ ๐‘– − ๐œŽ ๐‘ข๐‘–
๐œŽ ๐‘ข๐‘–
1 − ๐œŽ ๐‘ข๐‘–
๐‘ง๐‘–
๐‘–
๐‘ง ๐‘– is computed using ๐›ผ and ๐›ฝ from previous iteration
๐‘ง1๐‘– = ๐œŽ ๐›ผ โŠค ๐‘ฅ ๐‘– , ๐‘ง2๐‘– = ๐œŽ(๐›ฝโŠค ๐‘ฅ ๐‘– )
42
Gradient for ๐›ผ and ๐›ฝ
Use chain rule of derivatives
Note ๐‘ง1๐‘– = ๐œŽ ๐›ผ โŠค ๐‘ฅ ๐‘– , ๐‘ง2๐‘– = ๐œŽ(๐›ฝโŠค ๐‘ฅ ๐‘– )
Let ๐‘ฃ ๐‘– = ๐›ผ โŠค ๐‘ฅ ๐‘–
Gradient
๐œ•๐‘™(๐œƒ, ๐›ผ, ๐›ฝ) ๐œ•๐‘™(๐œƒ, ๐›ผ, ๐›ฝ) ๐œ•๐‘ง1๐‘–
=
๐œ•๐›ผ
๐œ•๐›ผ
๐œ•๐‘ง1๐‘–
2 ๐‘ฆ ๐‘– − ๐œŽ ๐‘ข๐‘–
=
๐œŽ ๐‘ข๐‘–
1 − ๐œŽ ๐‘ข๐‘–
๐œƒ1 ๐œŽ ๐‘ฃ ๐‘–
1 − ๐œŽ ๐‘ฃ๐‘–
๐‘ฅ๐‘–
๐‘–
43
Expressive Capabilities of ANNs
Boolean functions:
Every Boolean function can be represented by network with
single hidden layer
But might require exponential (in number of inputs) hidden units
Continuous functions:
Every bounded continuous function can be approximated with
arbitrary small error, by network with one hidden layer [Cybenko
1989; Hornik et al 1989]
Any function can be approximated to arbitrary accuracy by a
network with two hidden layers [Cybenko 1988].
Abdominal Pain Perceptron
20
10
Ulcer
Pain
Cholecystitis
Duodenal Non-specific
Perforated
1
0
Appendicitis Diverticulitis
37
0
1
WBC
Obstruction Pancreatitis
Small Bowel
0
Temp
Age
0
Male
Intensity Duration
Pain
Pain
adjustable
1
1
weights
0
0
Myocardial Infarction Network
Duration
Pain
2
Intensity Elevation
Pain ECG: ST
4
1
Myocardial Infarction
0.8
Smoker
1
Age
50
Male
1
“Probability” of MI
The "Driver" Network
Application: ANN for Face Reco.
The model
The learned hidden unit weights
Download