Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo 6.873/HST.951 Medical Decision Support Spring 2005 Artificial Neural Networks Lucila Ohno-Machado (with many slides borrowed from Stephan Dreiseitl. Courtesy of Stephan Dreiseitl. Used with permission.) Overview • • • • • Motivation Perceptrons Multilayer perceptrons Improving generalization Bayesian perspective Motivation Images removed due to copyright reasons. benign lesion malignant lesion Motivation • Human brain – Parallel processing – Distributed representation – Fault tolerant – Good generalization capability • Mimic structure and processing in computational model Biological Analogy Dendrites Synapses + + Synapses + - Nodes (weights) Axon Perceptrons 10 01 11 10 00 11 00 00 11 Input patterns Input layer Output layer 00 00 00 01 10 10 11 11 11 Sorted patterns . Activation functions Perceptrons (linear machines) Input units Cough Headache Δ rule change weights to decrease the error weights No disease Pneumonia Flu Output units Meningitis - what we got what we wanted error Abdominal Pain Perceptron 20 10 Ulcer Pain Cholecystitis Duodenal Non-specific Perforated 1 0 AppendicitisDiverticulitis 37 0 1 WBC Obstruction Pancreatitis Small Bowel 0 Temp Age 0 Male Intensity Duration Pain Pain adjustable 1 1 weights 0 0 AND θ = 0.5 y input output 0 00 01 0 10 0 11 1 w1 w2 x1 x2 f(x1w1 + x2w2) = y f(0w1 + 0w2) = 0 f(0w1 + 1w2) = 0 f(1w1 + 0w2 ) = 0 f(1w1 + 1w2 ) = 1 f(a) = θ some possible values for w1 and w2 w1 w2 0.20 0.35 0.20 0.40 0.25 0.30 0.40 0.20 1, for a > θ 0, for a ≤ θ Single layer neural network Output of unit j: Output j units oj = 1/ (1 + e- (aj+θ j) ) Input to unit j: aj = Σ wijai Input to unit i: ai measured value of variable i i Input units Single layer neural network Output of unit j: Output j units o j = 1/ (1 + e - ( a +θ ) j j Input to unit j: a j = Σ w ija i Input to unit i: a i Increasing θ measured value of variable i i Input units 1.2 1 0.8 Series1 0.6 Series2 Series3 0.4 0.2 0 -15 -10 -5 0 5 10 15 Training: Minimize Error Input units Cough Headache Δ rule change weights to decrease the error weights No disease Pneumonia Flu Output units Meningitis - what we got what we wanted error Error Functions • Mean Squared Error (for regression problems), where t is target, o is output Σ(t - o)2/n • Cross Entropy Error (for binary classification) − Σ(t log o) + (1-t) log (1-o) oj = 1/ (1 + e- (aj+θ j) ) Error function • • • • • Convention: w := (w0,w), x := (1,x) w0 is “bias” o = f (w • x) Class labels ti ∈{+1,-1} Error measure Σ – E= - ti (w • xi) i miscl. • How to minimize E? θ = 0.5 y w1 w2 x1 x2 y wo = 0.5 1 w1 w2 x1 x2 Minimizing the Error Error surface initial error derivative final error local minimum winitial wtrained Gradient descent Error Global minimum Local minimum Perceptron learning • Find minimum of E by iterating wk+1 = wk – η gradw E • E = -Σ ti (w • xi) ⇒ i miscl. gradw E = -Σ ti xi i miscl. • “online” version: pick misclassified xi wk+1 = wk + η ti xi Perceptron learning • Update rule wk+1 = wk + η ti xi • Theorem: perceptron learning converges for linearly separable sets Gradient descent • Simple function minimization algorithm • Gradient is vector of partial derivatives • Negative gradient is direction of steepest descent 3 20 3 2 10 2 0 0 1 1 1 2 3 0 0 0 Figures by MIT OCW. 1 2 3 Classification Model Inputs Age 34 Gender 1 Stage 4 Independent variables x1, x2, x3 Weights Output 5 4 0.6 Σ “Probability of beingAlive” 8 Coefficients Dependent variable a, b, c p Prediction Terminology • • • • Independent variable = input variable Dependent variable = output variable Coefficients = “weights” Estimates = “targets” • Iterative step = cycle, epoch XOR θ = 0.5 y input output 0 00 01 1 10 1 11 0 w1 w2 x1 x2 f(x1w1 + x2w2) = y f(0w1 + 0w2) = 0 f(0w1 + 1w2) = 1 f(1w1 + 0w2 ) = 1 f(1w1 + 1w2 ) = 0 f(a) = θ some possible values for w1 and w2 w1 w2 1, for a > θ 0, for a ≤ θ XOR y input output 0 00 1 01 10 1 0 11 w 3 θ = 0.5 w5 w 4 θ = 0.5 z x1 w1 w2 x2 f(w1, w2, w3, w4, w5) a possible set of values for ws (w1, w2, w3, w4, w5) (0.3,0.3,1,1,-2) f(a) = θ 1, for a > θ 0, for a ≤ θ XOR input output 0 00 1 01 10 1 0 11 w5 w1 w w2 3 w6 θ = 0.5 for all units w4 f(w1, w2, w3, w4, w5 , w6) a possible set of values for ws (w1, w2, w3, w4, w5 , w6) (0.6,-0.6,-0.7,0.8,1,1) f(a) = θ 1, for a > θ 0, for a ≤ θ From perceptrons to multilayer perceptrons Why? Abdominal Pain Appendicitis Diverticulitis 0 0 Perforated Duodenal Non-specific Cholecystitis Ulcer Pain Small Bowel Obstruction Pancreatitis 0 0 0 1 0 adjustable weights 1 Male 20 Age 37 Temp 10 WBC 1 Pain Intensity 1 Pain Duration Heart Attack Network Duration Intensity ECG: ST Pain elevation Smoker Pain 2 4 1 Myocardial Infarction 0.8 1 Age 50 Male 1 “Probability” of MI Multilayered Perceptrons Output units k Output of unit k: ok = 1/ (1 + e- (ak+θ k) ) Input to unit k: ak = Σwjkoj Output of unit j: Hidden units j Multilayered perceptron Perceptron oj = 1/ (1 + e- (a j+θj) ) Input to unit j: aj = Σwij ai i Input units Input to unit i: ai measured value of variable i Neural Network Model Inputs Age .6 34 .2 .4 Σ .5 .1 Gender 2 .2 .3 Σ .7 Stage 4 Independent variables Output .8 Σ .2 Weights Hidden Layer Weights 0.6 “Probability of beingAlive” Dependent variable Prediction “Combined logistic models” Inputs Age Output .6 34 .5 .1 Gender Σ 2 .8 .7 Stage “Probability of beingAlive” 4 Independent variables Weights Hidden Layer 0.6 Weights Dependent variable Prediction Inputs Age Output 34 .5 .2 Gender 2 Σ .3 “Probability of beingAlive” .8 Stage 4 Independent variables .2 Weights Hidden Layer 0.6 Weights Dependent variable Prediction Inputs Age Output .6 34 .5 .2 .1 Gender 1 Σ .3 .7 Stage 4 Independent variables “Probability of beingAlive” .8 .2 Weights Hidden Layer 0.6 Weights Dependent variable Prediction Not really, no target for hidden units... Age .6 34 .2 .4 Σ .5 .1 Gender 2 .2 .3 Σ .7 Stage 4 Independent variables .8 Σ .2 Weights Hidden Layer Weights 0.6 “Probability of beingAlive” Dependent variable Prediction Hidden Units and Backpropagation Output units - Δ rule Hidden units Δ rule Input units backpropagation what we got what we wanted error Multilayer perceptrons • Sigmoidal hidden layer • Can represent arbitrary decision regions • Can be trained similar to perceptrons ECG Interpretation QRS amplitude R-R interval SV tachycardia QRS duration Ventricular tachycardia AVF lead LV hypertrophy S-T elevation RV hypertrophy Myocardial infarction P-R interval Linear Separation Separate n-dimensional space using one (n - 1)-dimensional space Meningitis Flu No cough Headache Cough Headache 01 11 No treatment Treatment 00 No disease No cough No headache 10 Pneumonia Cough No headache 011 010 111 110 101 000 100 Another way of thinking about this… • Have data set D = {(xi,ti)} drawn from probability distribution P(x,t) • Model P(x,t) given samples D by ANN with adjustable parameter w • Statistics analogy: Maximum Likelihood Estimation • Maximize likelihood of data D • Likelihood L = Π p(xi,ti) = Π p(ti|xi)p(xi) • Minimize -log L = -Σ log p(ti|xi) -Σ log p(xi) • Drop second term: does not depend on w • Two cases: “regression” and classification Likelihood for classification (ie categorical target) • For classification, targets t are class labels • Minimize -Σ log p(ti|xi) • p(ti|xi) = y(xi,w) ti (1- y(xi,w)) 1-ti ⇒ -log p(ti|xi) = -ti log y(xi,w) -(1 – ti) * log(1-y(xi,w)) • Minimizing –log L equivalent to minimizing -[Σ ti log y(xi,w) +(1 – ti) * log(1-y(xi,w))] (cross-entropy error) Likelihood for “regression” (ie continuous target) • For regression, targets t are real values • Minimize -Σ log p(ti|xi) • p(ti|xi) = 1/Z exp(-(y(xi,w) – ti)2/(2σ2)) ⇒ -log p(ti|xi) = 1/(2σ2) (y(xi,w) – ti)2 +log Z • y(xi,w) is network output • Minimizing –log L equivalent to minimizing Σ (y(xi,w) – ti)2 (sum-of-squares error) Backpropagation algorithm • Minimizing error function by gradient descent: wk+1 = wk – η gradw E • Iterative gradient calculation by propagating error signals Backpropagation algorithm Problem: how to set learning rate η ? 2 2 0 0 -2 -2 -4 -2 0 2 -4 -2 0 Figures by MIT OCW. Better: use more advanced minimization algorithms (second-order information) 2 Backpropagation algorithm Classification Regression cross-entropy sum-of-squares Overfitting Real Distribution Overfitted Model Improving generalization Problem: memorizing (x,t) combinations (“overtraining”) 0.7 0.5 0 -0.5 0.9 1 -0.2 -1.2 1 0.3 0.6 1 -0.2 0.5 ? Improving generalization • Need test set to judge performance • Goal: represent information in data set, not noise • How to improve generalization? – Limit network topology – Early stopping – Weight decay Limit network topology • Idea: fewer weights ⇒ less flexibility • Analogy to polynomial interpolation: Limit network topology Early Stopping CHD Overfitted model “Real” model Overfitted model error holdout training 0 age cycles Early stopping • Idea: stop training when information (but not noise) is modeled • Need hold-out set to determine when to stop training Overfitting tss Overfitted model tss a min (Δtss) a = hold-out set b = training set tss b Stopping criterion Epochs Early stopping Weight decay • Idea: control smoothness of network output by controlling size of weights • Add term α||w||2 to error function Weight decay Bayesian perspective • Error function minimization corresponds to maximum likelihood (ML) estimate: single best solution wML • Can lead to overtraining • Bayesian approach: consider weight posterior distribution p(w|D). Bayesian perspective • Posterior = likelihood * prior • p(w|D) = p(D|w) p(w)/p(D) • Two approaches to approximating p(w|D): – Sampling – Gaussian approximation Sampling from p(w|D) prior likelihood Sampling from p(w|D) prior * likelihood = posterior Bayesian example for regression Bayesian example for classification Model Features (with strong personal biases) Modeling Effort Examples Needed Explanat. Rule-based Exp. Syst. Classification Trees Neural Nets, SVM Regression Models high low low high low high+ high moderate high? “high” low moderate Learned Bayesian Nets (beautiful when it works) low high+ high Regression vs. Neural Networks Y Y “X1” X1 X2 X3 X1 X2 “X2” “X1X3” X1X3 X2X3 X1X2 X3 (23 -1) possible combinations Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ... X1 X2 X3 “X1X2X3” Summary • • • • • • ANNs inspired by functionality of brain Nonlinear data model Trained by minimizing error function Goal is to generalize well Avoid overtraining Distinguish ML and MAP solutions Some References Introductory and Historical Textbooks • Rumelhart, D.E., and McClelland, J.L. (eds) Parallel Distributed Processing. MIT Press, Cambridge, 1986. (H) • Hertz JA; Palmer RG; Krogh, AS. Introduction to the Theory of Neural Computation. Addison-Wesley, Redwood City, 1991. • Pao, YH. Adaptive Pattern Recognition and Neural Networks. Addison-Wesley, Reading, 1989. • Bishop CM. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995.