Neural Networks Le Song Machine Learning I CSE 6740, Fall 2013 Learning nonlinear decision boundary Linearly separable Nonlinearly separable The XOR gate Speech recognition 2 A decision tree for Tax Fraud Input: a vector of attributes ๐ = [Refund,MarSt,TaxInc] Output: ๐= Cheating or Not H as a procedure: Each internal node: test one attribute ๐๐ Each branch from a node: selects one value for ๐๐ Each leaf node: predict ๐ Refund Yes No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 3 Apply model to test data I Query Data Start from the root of tree. R e fu n d No Refund Yes M a r it a l T a x a b le S ta tu s In c o m e C heat M a r r ie d 80K ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 4 Apply model to test data II Query Data R e fu n d No Refund Yes M a r it a l T a x a b le S ta tu s In c o m e C heat M a r r ie d 80K ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 5 Apply model to test data III Query Data R e fu n d No Refund Yes M a r it a l T a x a b le S ta tu s In c o m e C heat M a r r ie d 80K ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 6 Apply model to test data IV Query Data R e fu n d No Refund Yes M a r it a l T a x a b le S ta tu s In c o m e C heat M a r r ie d 80K ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 7 Apply model to test data V Query Data R e fu n d No Refund Yes M a r it a l T a x a b le S ta tu s In c o m e C heat M a r r ie d 80K ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married Assign Cheat to “No” NO > 80K YES 8 Expressiveness of decision tree Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row → path to leaf: Trivially, there is a consistent decision tree for any training set with one path to leaf for each example. Prefer to find more compact decision trees 9 Decision tree learning Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Tree Induction algorithm Induction Learn Model Model 10 Training Set Tid Attrib1 11 No Small Attrib2 55K Attrib3 ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Decision Tree Deduction 10 Test Set 10 Top-Down Induction of Decision tree Main loop: ๐ด ← the “best” decision attribute for next node Assign A as the decision attribute for node For each value of A, create new descendant of node Sort training examples to leaf nodes If training examples perfectly classified, then STOP; ELSE iterate over new leaf nodes 11 Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting 12 Example of a decision tree T id R e fu n d Splitting Attributes M a r it a l T a x a b le S ta tu s In c o m e C heat 1 Yes S in g le 125K No 2 No M a r r ie d 100K No 3 No S in g le 70K No 4 Yes M a r r ie d 120K No 5 No D iv o r c e d 95K Yes 6 No M a r r ie d 60K No 7 Yes D iv o r c e d 220K No 8 No S in g le 85K Yes 9 No M a r r ie d 75K No 10 No S in g le 90K Yes Refund Yes No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 10 Training Data Model: Decision Tree 13 Another example of a decision tree MarSt T id R e fu n d Married M a r it a l T a x a b le S ta tu s In c o m e C heat NO 1 Yes S in g le 125K No 2 No M a r r ie d 100K No 3 No S in g le 70K No 4 Yes M a r r ie d 120K No 5 No D iv o r c e d 95K Yes 6 No M a r r ie d 60K No 7 Yes D iv o r c e d 220K No 8 No S in g le 85K Yes 9 No M a r r ie d 75K No 10 No S in g le 90K Yes Single, Divorced Refund No Yes NO TaxInc < 80K NO > 80K YES There could be more than one tree that fits the same data! 10 Training Data 14 How to determine the Best Split Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" Homogeneous, Non-homogeneous, Low degree of impurity High degree of impurity Greedy approach: Splitting nodes with homogeneous class distribution are preferred Need a measure of node impurity How to compare attribute? Entropy Entropy H(X) of a random variable X Entropy quatifies the randomness of a random variable. Information theory: H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code) Examples for computing Entropy C1 0 P(C1) = 0/6 = 0 C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 C1 1 P(C1) = 1/6 C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 C1 2 P(C1) = 2/6 C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92 P(C2) = 6/6 = 1 P(C2) = 5/6 P(C2) = 4/6 Sample Entropy S is a sample of training examples p+ is the proportion of positive examples in S p- is the proportion of negative examples in S Entropy measure the impurity of S How to compare attribute? Conditional Entropy of variable ๐ given variable ๐ The entropy ๐ป ๐ ๐ = ๐ of X given a specific value ๐ = ๐ : Conditional entropy ๐ป ๐ ๐ of X given Y: average of ๐ป ๐ ๐ = ๐ ๐ ๐ป ๐๐ =− ๐ ๐ ๐ ๐ ๐ ๐ฆ=๐ ๐ป ๐๐ฆ=๐ = ๐ ๐(๐ฅ = ๐, ๐ฆ = ๐) ๐ ๐ฅ = ๐, ๐ฆ = ๐ log 2 ๐(๐ฆ = ๐) Mutual information (aka information gain) of X given Y : Information Gain Information gain (after split a node): ni ๏ฆ ( p) ๏ญ ๏ง ๏ฅ Entropy ๏จ i ๏ฝ1 n k GAIN split ๏ฝ Entropy ๏ถ (i ) ๏ท ๏ธ ๐ samples in parent node ๐ is split into ๐ partitions; ๐๐ is number of records in partition ๐ Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) Stopping Criteria for Tree Induction Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have similar attribute values Early termination Biologically inspired learning model Consider humans: Neuron switching time ~ 0.001 second Number of neurons ~ 1010 Connections per neuron ~ 104-5 Scene recognition time ~ 0.1 second 100 inference steps doesn't seem like enough ๏ much parallel computation Perceptron From biological neuron to artificial neuron (perceptron) In p u ts x1 Synapse De ndrites Synapse A xo n L in e a r w1 N o n lin e a r C o m b in e r T r a n s fo r m O u tp u t A xo n ๏ฅ Soma w2 Soma De ndrites Synapse ๏ฑ x2 T h r e s h o ld I n p u t O u t p u t S i g n a l s S i g n a l s Artificial neural nets (ANN) Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed processes Y M id d le L a y e r In p u t L a y e r O u tp u t L a y e r 23 Jargon Pseudo-Correspondence Independent variable = input variable Dependent variable = output variable Coefficients = “weights” Logistic Regression Model (the sigmoid unit) is a perceptron Inputs Age 34 Output 5 0.6 Gender 1 4 Stage 4 8 S “Probability of beingAlive” Independent variables Coefficients Dependent variable x1, x2, x3 a, b, c p Prediction What is logistic regression model Assume that the posterior distribution ๐ ๐ฆ = 1 ๐ฅ take a particular form 1 ๐ ๐ฆ = 1 ๐ฅ, ๐ = 1 + exp(−๐ โค ๐ฅ) Logistic function (or sigmoid function) ๐ ๐ข = ๐๐(๐ข) ๐๐ข 1 1+exp −๐ข =๐ ๐ข 1−๐ ๐ข 25 Learning parameters in logistic regression Find ๐, such that the conditional likelihood of the labels is maximized ๐ ๐ ๐ฆ๐ ๐ฅ๐ , ๐ max ๐ ๐ : = log ๐ ๐=1 Good news: ๐ ๐ is concave function of ๐, and there is a single global optimum. Bad new: no closed form solution (resort to numerical method) 26 The objective function ๐ ๐ logistic regression model 1 ๐ ๐ฆ = 1 ๐ฅ, ๐ = 1 + exp(−๐ โค ๐ฅ) Note that 1 exp −๐ โค ๐ฅ ๐ ๐ฆ = 0 ๐ฅ, ๐ = 1 − = โค 1 + exp −๐ ๐ฅ 1 + exp −๐ โค ๐ฅ Plug in ๐ ๐ ๐ฆ๐ ๐ฅ๐ , ๐ ๐ ๐ : = log ๐=1 (๐ฆ ๐ − 1) ๐ โค ๐ฅ ๐ − log(1 + exp −๐ โค ๐ฅ ๐ ) = ๐ 27 The gradient of ๐ ๐ ๐ ๐ ๐ฆ๐ ๐ฅ๐ , ๐ ๐ ๐ : = log ๐=1 (๐ฆ ๐ − 1) ๐ โค ๐ฅ ๐ − log(1 + exp −๐ โค ๐ฅ ๐ ) = ๐ Gradient ๐๐(๐) = ๐๐ ๐ โค๐ฅ ๐ ๐ฅ ๐ exp −๐ (๐ฆ ๐ − 1) ๐ฅ ๐ + 1 + exp −๐ โค ๐ฅ Setting it to 0 does not lead to closed form solution 28 Gradient Ascent algorithm Initialize parameter ๐ 0 Do ๐ ๐ก+1 ← ๐ ๐ก + ๐ (๐ฆ ๐ − 1) ๐ฅ ๐ + ๐ ๐กโค ๐ exp −๐ ๐ฅ ๐ฅ ๐ โค 1 + exp −๐ ๐ก ๐ฅ While the ||๐ ๐ก+1 − ๐ ๐ก || > ๐ 29 Learning with square loss Find ๐, such that the conditional probability of the labels are close to the actual labels (0 or 1) ๐ ๐ฆ๐ min ๐ ๐ โ ๐ −๐ ๐ฆ =1 ๐ฅ๐, ๐ ๐ 2 ๐ ๐ฆ๐ = −๐ ๐โค๐ฅ๐ 2 ๐ Not a convex objective function Use gradient decent to find a local optimum 30 The gradient of ๐ ๐ ๐ ๐ฆ๐ ๐ ๐ := −๐ ๐ฆ =1 ๐ฅ๐, ๐ 2 ๐ ๐ฆ๐ = ๐ −๐ ๐โค๐ฅ ๐ 2 ๐ Let ๐ข๐ = ๐ โค ๐ฅ ๐ ๐๐(๐ข) ๐๐ข =๐ ๐ข 1−๐ ๐ข Gradient ๐๐(๐) = ๐๐ 2 ๐ฆ ๐ − ๐ ๐ข๐ ๐ ๐ข๐ 1 − ๐ ๐ข๐ ๐ฅ๐ ๐ 31 Gradient Descent algorithm Initialize parameter ๐ 0 Do ๐ข๐ : = ๐ ๐ก+1 ← ๐ ๐ก − ๐ โค ๐ ๐ก ๐ ๐ฅ 2 ๐ฆ ๐ − ๐ ๐ข๐ ๐ ๐ข๐ 1 − ๐ ๐ข๐ ๐ฅ๐ ๐ While the ||๐ ๐ก+1 − ๐ ๐ก || > ๐ 32 Decision boundary a perceptron x y Z (color) 0 0 1 0 1 1 1 0 1 1 1 0 NAND w1 x1 y ๏ฑ = 0.5 w2 x2 f(x1 w1 + x2 w2 ) = y f(0w1 + 0w2 ) = 1 f(0w1 + 1w2 ) = 1 f(1w1 + 0w2 ) = 1 f(1w1 + 1w2 ) = 0 some possible values for w1 and w2 f(a) = ๏ฑ w1 0.20 0.20 0.25 0.40 w2 0.35 0.40 0.30 0.20 1, for a > ๏ฑ 0, for a ๏ฃ ๏ฑ Difficult problem for perceptron x y Z (color) 0 0 0 0 1 1 1 0 1 1 1 0 NAND w1 x1 y ๏ฑ = 0.5 w2 x2 f(x1 w1 + x2 w2 ) = y f(0w1 + 0w2 ) = 0 f(0w1 + 1w2 ) = 1 f(1w1 + 0w2 ) = 1 f(1w1 + 1w2 ) = 0 f(a) = ๏ฑ w1 some possible values for w1 and w2 w2 1, for a > ๏ฑ 0, for a ๏ฃ ๏ฑ Connecting perceptrons = neural networks x y Z (color) 0 0 0 0 1 1 1 0 1 1 1 0 NAND ๏ฑ = 0.5 for all units w5 w6 f(a) = w1 w w2 3 w4 a possible set of values for (w1, (0.6,-0.6,-0.7,0.8,1,1) ๏ฑ 1, for a > ๏ฑ 0, for a ๏ฃ ๏ฑ w2, w3, w4, w5 , w6): Neural Network Model Inputs Age .6 34 .2 .4 S .5 .1 Gender 2 .2 .3 S .7 Stage 4 Independent variables Output .8 S .2 Weights Hidden Layer Weights 0.6 “Probability of beingAlive” Dependent variable Prediction “Combined logistic models” Inputs Age .6 34 .4 S .5 .1 Gender 2 S .2 .7 Stage Output S “Probability of beingAlive” .8 4 Independent variables Weights Hidden Layer 0.6 Weights Dependent variable Prediction “Combined logistic models” Inputs Age 34 .4 .2 Gender Stage 2 4 Independent variables Output .5 S S .2 .3 S “Probability of beingAlive” .8 .2 Weights Hidden Layer 0.6 Weights Dependent variable Prediction “Combined logistic models” Inputs Age .6 34 .2 Output .4 S .5 .1 Gender 1 S .3 S .7 Stage 4 Independent variables “Probability of beingAlive” .8 .2 Weights Hidden Layer 0.6 Weights Dependent variable Prediction Not really, no target for hidden units... Age .6 34 .2 .4 S .5 .1 Gender 2 .2 .3 S .7 Stage 4 Independent variables .8 S .2 Weights Hidden Layer Weights 0.6 “Probability of beingAlive” Dependent variable Prediction Training hidden units Find ๐, ๐ผ, ๐ฝ, such that the value of the output unit are close to the actual labels (0 or 1) ๐ ๐ฆ๐ min ๐ ๐, ๐ผ, ๐ฝ : = ๐,๐ผ,๐ฝ where −๐ ๐ โค๐ง๐ 2 ๐ ๐ง1๐ = ๐ ๐ผ โค ๐ฅ ๐ ๐ง2๐ = ๐(๐ฝโค ๐ฅ ๐ ) Not a convex objective function Use gradient decent to find a local optimum 41 The gradient of ๐ ๐, ๐ผ, ๐ฝ ๐ ๐, ๐ผ, ๐ฝ : = ๐ ๐ ๐ โค ๐ ๐ฆ −๐ ๐ ๐ง 2 where ๐ง1๐ = ๐ ๐ผ โค ๐ฅ ๐ , ๐ง2๐ = ๐(๐ฝโค ๐ฅ ๐ ) Let ๐ข๐ = ๐ โค ๐ง ๐ Gradient ๐๐(๐, ๐ผ, ๐ฝ) = ๐๐ 2 ๐ฆ ๐ − ๐ ๐ข๐ ๐ ๐ข๐ 1 − ๐ ๐ข๐ ๐ง๐ ๐ ๐ง ๐ is computed using ๐ผ and ๐ฝ from previous iteration ๐ง1๐ = ๐ ๐ผ โค ๐ฅ ๐ , ๐ง2๐ = ๐(๐ฝโค ๐ฅ ๐ ) 42 Gradient for ๐ผ and ๐ฝ Use chain rule of derivatives Note ๐ง1๐ = ๐ ๐ผ โค ๐ฅ ๐ , ๐ง2๐ = ๐(๐ฝโค ๐ฅ ๐ ) Let ๐ฃ ๐ = ๐ผ โค ๐ฅ ๐ Gradient ๐๐(๐, ๐ผ, ๐ฝ) ๐๐(๐, ๐ผ, ๐ฝ) ๐๐ง1๐ = ๐๐ผ ๐๐ผ ๐๐ง1๐ 2 ๐ฆ ๐ − ๐ ๐ข๐ = ๐ ๐ข๐ 1 − ๐ ๐ข๐ ๐1 ๐ ๐ฃ ๐ 1 − ๐ ๐ฃ๐ ๐ฅ๐ ๐ 43 Expressive Capabilities of ANNs Boolean functions: Every Boolean function can be represented by network with single hidden layer But might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrary small error, by network with one hidden layer [Cybenko 1989; Hornik et al 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]. Abdominal Pain Perceptron 20 10 Ulcer Pain Cholecystitis Duodenal Non-specific Perforated 1 0 Appendicitis Diverticulitis 37 0 1 WBC Obstruction Pancreatitis Small Bowel 0 Temp Age 0 Male Intensity Duration Pain Pain adjustable 1 1 weights 0 0 Myocardial Infarction Network Duration Pain 2 Intensity Elevation Pain ECG: ST 4 1 Myocardial Infarction 0.8 Smoker 1 Age 50 Male 1 “Probability” of MI The "Driver" Network Application: ANN for Face Reco. The model The learned hidden unit weights