Business School Institute of Business Informatics Supervised Learning Uwe Lämmel www.wi.hs-wismar.de/~laemmel U.laemmel@wi.hs-wismar.de 1 Supervised Learning Neural Networks – Idea – Artificial Neuron & Network – Supervised Learning – Unsupervised Learning – Data Mining – other Techniques 2 Supervised Learning Supervised Learning Feed-Forward Networks – – – – – Perceptron – AdaLinE – LTU Multi-Layer networks Backpropagation Algorithm Pattern recognition Data preparation Examples – Bank Customer – Customer Relationship 3 Supervised Learning Connections – Feed-forward – Input layer – Hidden layer – Output layer – Feed-back / autoassociative – From (output) layer back to previous (hidden/input) layer – All neurons fully connected to each other Hopfield network 4 Supervised Learning Perceptron – Adaline – TLU – One layer of trainable links only – Adaptive linear element – Threshold Linear Unit – class of neural network of a special architecture: ... 5 Supervised Learning Papert, Minsky and Perceptron - History "Once upon a time two daughter sciences were born to the new science of cybernetics. One sister was natural, with features inherited from the study of the brain, from the way nature does things. The other was artificial, related from the beginning to the use of computers. … But Snow White was not dead. What Minsky and Papert had shown the world as proof was not the heart of the princess; it was the heart of a pig." Seymour Papert, 1988 6 Supervised Learning Perception mapping layer Perception first step of recognition becoming aware of something via the senses output-layer picture fixed 1-1- links 7 Supervised Learning trainable, fully connected Perceptron – Input layer – binary input, passed trough, – no trainable links – Propagation function netj = oiwij – Activation function oj = aj = 1 if netj j , 0 otherwise A perceptron can learn all the functions, that can be represented, in a finite time . (perceptron convergence theorem, F. Rosenblatt) 8 Supervised Learning Linear separable Neuron j should be 0, iff both neurons 1 and 2 have the same value (o1=o2), otherwise 1: netj = o1w1j + o2w2j 0 w1j + 0w2j < j 0 w1j + 1w2j j 1 w1j + 0w2j j 1 w1j + 1w2j < j 9 Supervised Learning ? j w1j 1 j w2j 2 Linear separable o2 1 (1,1) – netj = o1w1j + o2w2j o1 (0,0) 1 o1*w1 +o2*w2=q line in a 2-dim. space – line divides plane so, that (0,1) and (1,0) are in different sub planes. – the network can not solve the problem. – a perceptron can represent only some functions – a neural network representing the XORfunction needs hidden neurons 10 Supervised Learning Learning is easy while input pattern do begin next input patter calculate output for each j in OutputNeurons do if ojtj then if oj=0 then {output=0, but 1 expected } for each i in InputNeurons do wij:=wij+oi else if oj=1 then {output=1, but 0 expected } for each i in InputNeurons do wij:=wij-oi ; end repeat until desired behaviour 11 Supervised Learning Exercise – Decoding – input: binary code of a digit – output - unary representation: as many digits 1, as the digit represents: 5:11111 – architecture: 12 Supervised Learning Exercise – Decoding – input: Binary code of a digit – output: classification: 0~ 1st Neuron, 1~ 2nd Neuron, ... 5~ 6th Neuron, ... – architecture: 13 Supervised Learning Exercises 1. Look at the EXCEL-file of the decoding problem 2. Implement (in PASCAL/Java) a 4-10-Perceptron which transforms a binary representation of a digit (0..9) into a decimal number. Implement the learning algorithm and train the network. 3. Which task can be learned faster? (Unary representation or classification) 14 Supervised Learning Exercises 5. Develop a perceptron for the recognition of digits 0..9. (pixel representation) input layer: 3x7-input neurons Use the SNNS or JavaNNS 6. Can we recognize numbers greater than 9 as well? 7. Develop a perceptron for the recognition of capital letters. (input layer 5x7) 15 Supervised Learning multi-layer Perceptron Cancels the limits of a perceptron – several trainable layers – a two layer perceptron can classify convex polygons – a three layer perceptron can classify any sets multi layer perceptron 16 Supervised Learning = feed-forward network = backpropagation network Multi-layer feed-forward network 17 Supervised Learning Feed-Forward Network 18 Supervised Learning Training pattern p Evaluation of the net output in a feed forward network Ni Oi=pi netj Nj Oj=actj netk Input-Layer 19 Supervised Learning hidden Layer(s) Nk Ok=act k Output Layer Backpropagation-Learning Algorithm – supervised Learning – error is a function of the weights wi : E(W) = E(w1,w2, ... , wn) – We are looking for a minimal error – minimal error = hollow in the error surface – Backpropagation uses the gradient for weight adaptation 20 Supervised Learning error curve weight1 weight2 21 Supervised Learning Problem output hidden layer input layer 22 Supervised Learning teaching output – error in output layer: – difference output – teaching output – error in a hidden layer? Gradient descent – Gradient: – Vector orthogonal to a surface in direction of the strongest slope 0,80 0,40 -1 -0,6 0,00 -0,2 0,2 0,6 example of an error curve of a weight wi 23 Supervised Learning 1 – derivation of a function in a certain direction is the projection of the gradient in this direction Example: Newton-Approximation tan = f‘(x) = 2x tan = f(x) / (x-x‘) x‘ =½(x + a/x) – calculation of the root – f(x) = x²-5 f(x)= x²-a x‘ 24 Supervised Learning x –x =2 – x‘ = ½(x + 5/x) = 2.25 – X“= ½(x‘ + 5/x‘) = 2.2361 Backpropagation - Learning – gradient-descent algorithm – supervised learning: error signal used for weight adaptation – error signal: – teaching – calculated output , if output neuron – weighted sum of error signals of successor – weight adaptation: – : Learning rate – : error signal 25 Supervised Learning w wij oi j ' ij Standard-Backpropagation Rule – gradient descent: derivation of a function 1 – logistic function: f ( x ) Logistic x 1 e f´act(netj) = fact(netj)(1- fact(netj)) = oj(1-oj) – the error signal j is therefore: o j (1 o j ) k w jk if j is hidden neuron k j o j (1 o j ) (t j o j ) if j is output neuron wij' wij oi j 26 Supervised Learning Backpropagation – Examples: – XOR (Excel) – Bank Customer 27 Supervised Learning Backpropagation - Problems A 28 Supervised Learning B C Backpropagation-Problems – A: flat plateau – weight adaptation is slow – finding a minimum takes a lot of time – B: Oscillation in a narrow gorge – it jumps from one side to the other and back – C: leaving a minimum – if the modification in one training step is to high, the minimum can be lost 29 Supervised Learning Solutions: looking at the values – change the parameter of the logistic function in order to get other values – Modification of weights depends on the output: if oi=0 no modification will take place – If we use binary input we probably have a lot of zero-values: change [0,1] into [-½ , ½] or [-1,1] – use another activation function, eg. tanh and use [-1..1] values 30 Supervised Learning Solution: Quickprop assumption: error curve is a square function calculate the vertex of the curve S (t ) wij (t ) wij (t 1) S (t 1) S (t ) slope of the error curve: S (t ) -2 31 Supervised Learning 2 6 E wij (t ) Resilient Propagation (RPROP) – sign and size of the weight modification are calculated separately: bij(t) – size of modification bij(t-1) + if S(t-1)S(t) > 0 bij(t) = bij(t-1) if S(t-1)S(t) < 0 bij(t-1) otherwise +>1 : both ascents are equal „big“ step 0<-<1 : ascents are different „smaller“ step -bij(t) if S(t-1)>0 S(t) > 0 wij(t) = bij(t) íf S(t-1)<0 S(t) < 0 -wij(t-1) if S(t-1)S(t) < 0 (*) -sgn(S(t))bij(t) otherwise (*) S(t) is set to 0, S(t):=0 ; at time (t+1) the 4th case will be applied. 32 Supervised Learning Limits of the Learning Algorithm – it is not a model for biological learning – no teaching output in natural learning – no feedbacks in a natural neural network (at least nobody has discovered yet) – training of an ANN is rather time consuming 33 Supervised Learning Exercise - JavaNNS – Implement a feed forward network containing of 2 input neurons, 2 hidden neurons and one output neuron. Train the network so that it simulates the XORfunction. – Implement a 4-2-4-network, which works like the identity function. (Encoder-Decoder-Network). Try other versions: 4-3-4, 8-4-8, ... What can you say about the training effort? 34 Supervised Learning Pattern Recognition Eingabeschicht input layer 35 Supervised Learning 1. 1. verdeckte hidden layer Schicht hidden 2.2.verdeckte layer schicht output Ausgabelayer schicht Example: Pattern Recognition JavaNNS example: Font 36 Supervised Learning „font“ Example – input = 24x24 pixel-array – output layer: 75 neurons, one neuron for each character: – digits – letters (lower case, capital) – separators and operator characters – two hidden layer of 4x6 neurons each – all neuron of a row of the input layer are linked to one neuron of the first hidden layer – all neuron of a column of the input layer are linked to one neuron of the second hidden layer. 37 Supervised Learning Exercise – load the network “font_untrained” – train the network, use various learning algorithms: (look at the SNNS documentation for the parameters and their meaning) – Backpropagation – Backpropagation with momentum – Quickprop – Rprop =2.0 =0.8 =0.1 =0.6 mu=0.6 c=0.1 mg=2.0 n=0.0001 – use various values for learning parameter, momentum, and noise: – learning parameter – Momentum – noise 38 Supervised Learning 0.2 0.9 0.0 0.3 0.7 0.1 0.5 0.5 0.2 1.0 0.0 Example: Bank Customer A1: Credit history A2: debt A3: collateral A4: income • network architecture depends on the coding of input and output • How can we code values like good, bad, 1, 2, 3, ...? 39 Supervised Learning Data Pre-processing – objectives – – – – 40 prospects of better results adaptation to algorithms data reduction trouble shooting Supervised Learning – methods – selection and integration – completion – transformation – normalization – coding – filter Selection and Integration – unification of data (different origins) – selection of attributes/features – reduction – omit obviously non-relevant data – all values are equal – key values – meaning not relevant – data protection 41 Supervised Learning Completion / Cleaning – Missing values – ignore / omit attribute – add values – manual – global constant („missing value“) – average – highly probable value – remove data set – noised data – inconsistent data 42 Supervised Learning Transformation – Normalization – Coding – Filter 43 Supervised Learning Normalization of values – Normalization – equally distributed – in the range [0,1] – e.g. for the logistic function act = (x-minValue) / (maxValue - minValue) – in the range [-1,+1] – e.g. for activation function tanh act = (x-minValue) / (maxValue - minValue)*2-1 – logarithmic normalization – act = (ln(x) - ln(minValue)) / (ln(maxValue)-ln(minValue)) 44 Supervised Learning Binary Coding of nominal values I – no order relation, n-values – n neurons, – each neuron represents one and only one value: – example: red, blue, yellow, white, black 1,0,0,0,0 0,1,0,0,0 0,0,1,0,0 ... – disadvantage: n neurons necessary lots of zeros in the input 45 Supervised Learning Bank Customer credit history debt collateral income Are these customers good ones? 1: bad high adequate 3 2: good low adequate 2 46 Supervised Learning Data Mining Cup 2002 The Problem: A Mailing Action – mailing action of a company: – special offer – estimated annual income per customer: customer – given: will cancel gets an offer 43.80€ 66.30€ gets no offer 0.00€ 72.00€ – 10,000 sets of customer data containing 1,000 cancellers (training) – problem: – test set containing 10,000 customer data – Who will cancel ? Whom to send an offer? 47 Supervised Learning will not cancel Mailing Action – Aim? customer will cancel gets an offer 43.80€ 66.30€ gets no offer 0.00€ 72.00€ – no mailing action: – 9,000 x 72.00 – everybody gets an offer: – 1,000 x 43.80 + 9,000 x 66.30 640,500 = 648,000 = – maximum (100% correct classification): – 1,000 x 43.80 + 9,000 x 72.00 = 691,800 48 Supervised Learning will not cancel Goal Function: Lift customer will cancel gets an offer 43.80€ 66.30€ gets no offer 0.00€ 72.00€ basis: no mailing action: 9,000 · 72.00 goal = extra income: liftM = 43.8 · cM + 66.30 · nkM – 72.00· nkM 49 Supervised Learning will not cancel ----- 32 input data ------ <important Data results> ^missing values^ 50 Supervised Learning Feed Forward Network – What to do? 51 – train the net with training set (10,000) – test the net using the test set ( another 10,000) – classify all 10,000 customer into canceller or loyal – evaluate the additional income Supervised Learning Results data mining cup 2002 52 neural network project 2004 – gain: – additional income by the mailing action if target group was chosen according analysis Supervised Learning Review Students Project – copy of the data mining cup – real data – known results motivation – contest enthusiasm better results • wishes – engineering approach data mining – real data for teaching purposes 53 Supervised Learning Data Mining Cup 2007 – – – – 54 Supervised Learning started on April 10. check-out couponing Who will get a rebate coupon? 50,000 data sets for training Data 55 Supervised Learning DMC2007 – ~75% output = N(o) – e.g. classification has to > 75%!! – first experiments: no success – deadline: May 31st 56 Supervised Learning Optimization of Neural Networks objectives – good results in an application: better generalisation (improve correctness) – faster processing of patterns (improve efficiency) – good presentation of the results (improve comprehension) 57 Supervised Learning Ability to generalize • a trained net can classify data (out of the same class as the learning data) that it has never seen before – aim of every ANN development – network too large: – all training patterns are learned from memory – no ability to generalize – network too small: – rules of pattern recognition can not be learned (simple example: Perceptron and XOR) 58 Supervised Learning Development of an NN-application calculate network output build a network architecture input of training pattern modify weights change parameters error is too high compare to teaching output quality is good enough use Test set data error is too high evaluate output compare to teaching output quality is good enough 59 Supervised Learning Possible Changes – Architecture of NN – – – – – size of a network shortcut connection partial connected layers remove/add links receptive areas – Find the right parameter values – learning parameter – size of layers – using genetic algorithms 60 Supervised Learning Memory Capacity Number of patterns a network can store without generalisation – figure out the memory capacity – change output-layer: output-layer input-layer – train the network with an increasing number of random patterns: – error becomes small: network stores all patterns – error remains: network can not store all patterns – in between: memory capacity 61 Supervised Learning Memory Capacity - Experiment – output-layer is a copy of the input-layer – training set consisting of n random pattern – error: – error = 0 network can store more than n patterns – error >> 0 network can not store n patterns – memory capacity: error > 0 and error = 0 for n-1 patterns and error >>0 for n+1 patterns 62 Supervised Learning Layers Not fully Connected connections: new removed remaining – – – 63 Supervised Learning partial connected (e.g. 75%) remove links, if weight has been nearby 0 for several training steps build new connections (by chance) Summary – Feed-forward network – Perceptron (has limits) – Learning is Math – Backpropagation is a Backpropagation of Error Algorithm – works like gradient descent – Activation Functions: Logistics, tanh – Application in Data Mining, Pattern Recognition – data preparation is important – Finding an appropriate Architecture 64 Supervised Learning