Document 13581860

advertisement
Harvard-MIT Division of Health Sciences and Technology
HST.951J: Medical Decision Support, Fall 2005
Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo
6.873/HST.951 Medical Decision Support
Spring 2005
Artificial
Neural Networks
Lucila Ohno-Machado
(with many slides borrowed from Stephan Dreiseitl. Courtesy of Stephan Dreiseitl. Used with permission.)
Overview
•
•
•
•
•
Motivation
Perceptrons
Multilayer perceptrons
Improving generalization
Bayesian perspective
Motivation
Images removed due to copyright reasons.
benign lesion
malignant lesion
Motivation
• Human brain
– Parallel processing
– Distributed representation
– Fault tolerant
– Good generalization capability
• Mimic structure and processing in
computational model
Biological Analogy
Dendrites
Synapses
+
+
Synapses
+
-
Nodes
(weights)
Axon
Perceptrons
10
01
11
10
00
11
00
00
11
Input patterns
Input layer
Output layer
00
00
00
01
10
10
11
11
11
Sorted
patterns
.
Activation functions
Perceptrons (linear machines)
Input units
Cough
Headache
Δ rule
change weights to
decrease the error
weights
No disease
Pneumonia
Flu
Output units
Meningitis
-
what we got
what we wanted
error
Abdominal Pain Perceptron
20
10
Ulcer
Pain
Cholecystitis
Duodenal Non-specific
Perforated
1
0
AppendicitisDiverticulitis
37
0
1
WBC
Obstruction Pancreatitis
Small Bowel
0
Temp
Age
0
Male
Intensity Duration
Pain
Pain
adjustable
1
1
weights
0
0
AND
θ = 0.5
y
input output
0
00
01
0
10
0
11
1
w1
w2
x1
x2
f(x1w1 + x2w2) = y
f(0w1 + 0w2) = 0
f(0w1 + 1w2) = 0
f(1w1 + 0w2 ) = 0
f(1w1 + 1w2 ) = 1
f(a) =
θ
some possible values for w1 and w2
w1 w2
0.20 0.35
0.20 0.40
0.25 0.30
0.40 0.20
1, for a > θ
0, for a ≤ θ
Single layer neural network
Output of unit j:
Output
j
units
oj = 1/ (1 + e- (aj+θ j) )
Input to unit j: aj = Σ wijai
Input to unit i: ai
measured value of variable i
i
Input units
Single layer neural network
Output of unit j:
Output
j
units
o j = 1/ (1 + e
- ( a +θ )
j j
Input to unit j: a j = Σ w ija i
Input to unit i: a i
Increasing θ
measured value of variable i
i
Input units
1.2
1
0.8
Series1
0.6
Series2
Series3
0.4
0.2
0
-15
-10
-5
0
5
10
15
Training: Minimize Error
Input units
Cough
Headache
Δ rule
change weights to
decrease the error
weights
No disease
Pneumonia
Flu
Output units
Meningitis
-
what we got
what we wanted
error
Error Functions
• Mean Squared Error (for regression
problems), where t is target, o is output
Σ(t - o)2/n
• Cross Entropy Error (for binary
classification)
− Σ(t log o) + (1-t) log (1-o)
oj = 1/ (1 + e- (aj+θ j) )
Error function
•
•
•
•
•
Convention: w := (w0,w), x := (1,x)
w0 is “bias”
o = f (w • x)
Class labels ti ∈{+1,-1}
Error measure
Σ
– E= -
ti (w • xi)
i miscl.
• How to minimize E?
θ = 0.5
y
w1
w2
x1
x2
y
wo = 0.5
1
w1
w2
x1
x2
Minimizing the Error
Error surface
initial error
derivative
final error
local minimum
winitial wtrained
Gradient descent
Error
Global minimum
Local minimum
Perceptron learning
• Find minimum of E by iterating
wk+1 = wk – η gradw E
• E = -Σ ti (w • xi) ⇒
i miscl.
gradw E = -Σ ti xi
i miscl.
• “online” version: pick misclassified xi
wk+1 = wk + η ti xi
Perceptron learning
• Update rule wk+1 = wk + η ti xi
• Theorem: perceptron learning converges
for linearly separable sets
Gradient descent
• Simple function minimization algorithm
• Gradient is vector of partial derivatives
• Negative gradient is direction of steepest
descent
3
20
3
2
10
2
0
0
1
1
1
2
3
0
0
0
Figures by MIT OCW.
1
2
3
Classification Model
Inputs
Age
34
Gender
1
Stage
4
Independent
variables
x1, x2, x3
Weights
Output
5
4
0.6
Σ
“Probability
of beingAlive”
8
Coefficients
Dependent
variable
a, b, c
p
Prediction
Terminology
•
•
•
•
Independent variable = input variable
Dependent variable = output variable
Coefficients = “weights”
Estimates = “targets”
• Iterative step = cycle, epoch
XOR
θ = 0.5
y
input output
0
00
01
1
10
1
11
0
w1
w2
x1
x2
f(x1w1 + x2w2) = y
f(0w1 + 0w2) = 0
f(0w1 + 1w2) = 1
f(1w1 + 0w2 ) = 1
f(1w1 + 1w2 ) = 0
f(a) =
θ
some possible values for w1 and w2
w1 w2
1, for a > θ
0, for a ≤ θ
XOR
y
input output
0
00
1
01
10
1
0
11
w
3
θ = 0.5
w5
w
4
θ = 0.5
z
x1
w1 w2
x2
f(w1, w2, w3, w4, w5)
a possible set of values for ws
(w1, w2, w3, w4, w5)
(0.3,0.3,1,1,-2)
f(a) =
θ
1, for a > θ
0, for a ≤ θ
XOR
input output
0
00
1
01
10
1
0
11
w5
w1 w w2
3
w6
θ = 0.5 for all units
w4
f(w1, w2, w3, w4, w5 , w6)
a possible set of values for ws
(w1, w2, w3, w4, w5 , w6)
(0.6,-0.6,-0.7,0.8,1,1)
f(a) =
θ
1, for a > θ
0, for a ≤ θ
From perceptrons to
multilayer perceptrons
Why?
Abdominal Pain
Appendicitis
Diverticulitis
0
0
Perforated
Duodenal Non-specific
Cholecystitis
Ulcer
Pain
Small Bowel
Obstruction
Pancreatitis
0
0
0
1
0
adjustable
weights
1
Male
20
Age
37
Temp
10
WBC
1
Pain
Intensity
1
Pain
Duration
Heart Attack Network
Duration Intensity ECG: ST
Pain elevation Smoker
Pain
2
4
1
Myocardial Infarction
0.8
1
Age
50
Male
1
“Probability” of MI
Multilayered Perceptrons
Output units
k
Output of unit k:
ok = 1/ (1 + e- (ak+θ k) )
Input to unit k:
ak = Σwjkoj
Output of unit j:
Hidden
units
j
Multilayered
perceptron
Perceptron
oj = 1/ (1 + e- (a j+θj) )
Input to unit j: aj = Σwij ai
i
Input units
Input to unit i: ai
measured value of variable i
Neural Network Model
Inputs
Age
.6
34
.2
.4
Σ
.5
.1
Gender
2
.2
.3
Σ
.7
Stage
4
Independent
variables
Output
.8
Σ
.2
Weights
Hidden
Layer
Weights
0.6
“Probability
of beingAlive”
Dependent
variable
Prediction
“Combined logistic models”
Inputs
Age
Output
.6
34
.5
.1
Gender
Σ
2
.8
.7
Stage
“Probability
of beingAlive”
4
Independent
variables
Weights
Hidden
Layer
0.6
Weights
Dependent
variable
Prediction
Inputs
Age
Output
34
.5
.2
Gender
2
Σ
.3
“Probability
of beingAlive”
.8
Stage
4
Independent
variables
.2
Weights
Hidden
Layer
0.6
Weights
Dependent
variable
Prediction
Inputs
Age
Output
.6
34
.5
.2
.1
Gender
1
Σ
.3
.7
Stage
4
Independent
variables
“Probability
of beingAlive”
.8
.2
Weights
Hidden
Layer
0.6
Weights
Dependent
variable
Prediction
Not really,
no target for hidden units...
Age
.6
34
.2
.4
Σ
.5
.1
Gender
2
.2
.3
Σ
.7
Stage
4
Independent
variables
.8
Σ
.2
Weights
Hidden
Layer
Weights
0.6
“Probability
of beingAlive”
Dependent
variable
Prediction
Hidden Units and Backpropagation
Output units
-
Δ rule
Hidden
units
Δ rule
Input units
backpropagation
what we got
what we wanted
error
Multilayer perceptrons
• Sigmoidal hidden layer
• Can represent arbitrary decision regions
• Can be trained similar to perceptrons
ECG Interpretation
QRS amplitude
R-R interval
SV tachycardia
QRS duration
Ventricular tachycardia
AVF lead
LV hypertrophy
S-T elevation
RV hypertrophy
Myocardial infarction
P-R interval
Linear Separation
Separate n-dimensional space using one (n - 1)-dimensional space
Meningitis
Flu
No cough
Headache
Cough
Headache
01
11
No treatment
Treatment
00
No disease
No cough
No headache
10
Pneumonia
Cough
No headache
011
010
111
110
101
000
100
Another way of thinking about
this…
• Have data set D = {(xi,ti)} drawn from
probability distribution P(x,t)
• Model P(x,t) given samples D by ANN
with adjustable parameter w
• Statistics
analogy:
Maximum Likelihood Estimation
• Maximize likelihood of data D
• Likelihood L = Π p(xi,ti) = Π p(ti|xi)p(xi)
• Minimize -log L = -Σ log p(ti|xi) -Σ log p(xi)
• Drop second term: does not depend on w
• Two cases: “regression” and classification
Likelihood for classification
(ie categorical target)
• For classification, targets t are class labels
• Minimize -Σ log p(ti|xi)
• p(ti|xi) = y(xi,w) ti (1- y(xi,w)) 1-ti ⇒
-log p(ti|xi) = -ti log y(xi,w) -(1 – ti) * log(1-y(xi,w))
• Minimizing –log L equivalent to minimizing
-[Σ ti log y(xi,w) +(1 – ti) * log(1-y(xi,w))]
(cross-entropy error)
Likelihood for “regression”
(ie continuous target)
• For regression, targets t are real values
• Minimize -Σ log p(ti|xi)
• p(ti|xi) = 1/Z exp(-(y(xi,w) – ti)2/(2σ2)) ⇒
-log p(ti|xi) = 1/(2σ2) (y(xi,w) – ti)2 +log Z
• y(xi,w) is network output
• Minimizing –log L equivalent to minimizing
Σ (y(xi,w) – ti)2 (sum-of-squares error)
Backpropagation algorithm
• Minimizing error function
by gradient descent:
wk+1 = wk – η gradw E
• Iterative gradient
calculation by
propagating error signals
Backpropagation algorithm
Problem: how to set learning rate η ?
2
2
0
0
-2
-2
-4
-2
0
2
-4
-2
0
Figures by MIT OCW.
Better: use more advanced minimization
algorithms (second-order information)
2
Backpropagation algorithm
Classification
Regression
cross-entropy
sum-of-squares
Overfitting
Real Distribution
Overfitted Model
Improving generalization
Problem: memorizing (x,t) combinations
(“overtraining”)
0.7 0.5 0
-0.5 0.9 1
-0.2 -1.2 1
0.3 0.6 1
-0.2 0.5 ?
Improving generalization
• Need test set to judge performance
• Goal: represent information in data set, not
noise
• How to improve generalization?
– Limit network topology
– Early stopping
– Weight decay
Limit network topology
• Idea: fewer weights ⇒ less flexibility
• Analogy to polynomial interpolation:
Limit network topology
Early Stopping
CHD
Overfitted model “Real” model
Overfitted model
error
holdout
training
0
age
cycles
Early stopping
• Idea: stop training when information (but
not noise) is modeled
• Need hold-out set to determine when to
stop training
Overfitting
tss
Overfitted model
tss a
min (Δtss)
a = hold-out set
b = training set
tss b
Stopping criterion
Epochs
Early stopping
Weight decay
• Idea: control smoothness of network
output by controlling size of weights
• Add term α||w||2 to error function
Weight decay
Bayesian perspective
• Error function minimization corresponds to
maximum likelihood (ML) estimate: single
best solution wML
• Can lead to overtraining
• Bayesian approach: consider weight
posterior distribution p(w|D).
Bayesian perspective
• Posterior = likelihood * prior
• p(w|D) = p(D|w) p(w)/p(D)
• Two approaches to approximating p(w|D):
– Sampling
– Gaussian approximation
Sampling from p(w|D)
prior
likelihood
Sampling from p(w|D)
prior * likelihood =
posterior
Bayesian example for
regression
Bayesian example for
classification
Model Features
(with strong personal biases)
Modeling
Effort
Examples
Needed
Explanat.
Rule-based Exp. Syst.
Classification Trees
Neural Nets, SVM
Regression Models
high
low
low
high
low
high+
high
moderate
high?
“high”
low
moderate
Learned Bayesian Nets
(beautiful when it works)
low
high+
high
Regression vs. Neural Networks
Y
Y
“X1”
X1
X2
X3
X1 X2
“X2”
“X1X3”
X1X3 X2X3 X1X2 X3
(23 -1) possible combinations
Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...
X1
X2
X3
“X1X2X3”
Summary
•
•
•
•
•
•
ANNs inspired by functionality of brain
Nonlinear data model
Trained by minimizing error function
Goal is to generalize well
Avoid overtraining
Distinguish ML and MAP solutions
Some References
Introductory and Historical Textbooks
• Rumelhart, D.E., and McClelland, J.L. (eds) Parallel Distributed
Processing. MIT Press, Cambridge, 1986. (H)
• Hertz JA; Palmer RG; Krogh, AS. Introduction to the Theory of
Neural Computation. Addison-Wesley, Redwood City, 1991.
• Pao, YH. Adaptive Pattern Recognition and Neural Networks.
Addison-Wesley, Reading, 1989.
• Bishop CM. Neural Networks for Pattern Recognition. Clarendon
Press, Oxford, 1995.
Download