Neural Networks

advertisement
Business School
Institute of
Business Informatics
Supervised Learning
Uwe Lämmel
www.wi.hs-wismar.de/~laemmel
U.laemmel@wi.hs-wismar.de
1
Supervised Learning
Neural Networks
– Idea
– Artificial Neuron & Network
– Supervised Learning
– Unsupervised Learning
– Data Mining – other
Techniques
2
Supervised Learning
Supervised Learning
Feed-Forward Networks
–
–
–
–
–
Perceptron – AdaLinE – LTU
Multi-Layer networks
Backpropagation Algorithm
Pattern recognition
Data preparation
Examples
– Bank Customer
– Customer Relationship
3
Supervised Learning
Connections
– Feed-forward
– Input layer
– Hidden layer
– Output layer
– Feed-back / autoassociative
– From (output) layer back to
previous (hidden/input)
layer
– All neurons fully connected
to each other
Hopfield network
4
Supervised Learning
Perceptron – Adaline – TLU
– One layer of trainable links only
– Adaptive linear element
– Threshold Linear Unit
– class of neural network of a special
architecture:
...
5
Supervised Learning
Papert, Minsky and Perceptron - History
"Once upon a time two daughter sciences were born to the new science
of cybernetics.
One sister was natural, with features inherited from the study of the
brain, from the way nature does things.
The other was artificial, related from the beginning to the use of
computers.
…
But Snow White was not dead.
What Minsky and Papert had shown the world as proof was not the
heart of the princess; it was the heart of a pig."
Seymour Papert, 1988
6
Supervised Learning
Perception
mapping layer
Perception
first step of recognition
becoming aware of
something via the senses
output-layer
picture
fixed 1-1- links
7
Supervised Learning
trainable, fully
connected
Perceptron
– Input layer
– binary input, passed trough,
– no trainable links
– Propagation function netj =  oiwij
– Activation function
oj = aj = 1 if netj  j , 0 otherwise
A perceptron can learn all the functions,
that can be represented, in a finite time .
(perceptron convergence theorem, F. Rosenblatt)
8
Supervised Learning
Linear separable
Neuron j should be 0,
iff both neurons 1 and 2 have the same value
(o1=o2), otherwise 1:
netj = o1w1j + o2w2j
0 w1j + 0w2j < j
0 w1j + 1w2j  j
1 w1j + 0w2j  j
1 w1j + 1w2j < j
9
Supervised Learning
?
j
w1j
1
j
w2j
2
Linear
separable
o2
1
(1,1)
– netj = o1w1j + o2w2j
o1
(0,0)
1
o1*w1 +o2*w2=q
line in a 2-dim. space
– line divides plane so,
that (0,1) and (1,0) are in different sub planes.
– the network can not solve the problem.
– a perceptron can represent only some functions
–  a neural network representing the XORfunction needs hidden neurons
10
Supervised Learning
Learning is easy
while input pattern   do begin
next input patter
calculate output
for each j in OutputNeurons do
if ojtj then
if oj=0 then {output=0, but 1 expected }
for each i in InputNeurons do
wij:=wij+oi
else if oj=1 then {output=1, but 0 expected }
for each i in InputNeurons do
wij:=wij-oi ;
end
repeat until desired behaviour
11
Supervised Learning
Exercise
– Decoding
– input: binary code of a digit
– output - unary representation:
as many digits 1, as the digit
represents:
5:11111
– architecture:
12
Supervised Learning
Exercise
– Decoding
– input: Binary code of a digit
– output: classification:
0~ 1st Neuron, 1~ 2nd Neuron, ... 5~ 6th Neuron,
...
– architecture:
13
Supervised Learning
Exercises
1. Look at the EXCEL-file of the decoding problem
2. Implement (in PASCAL/Java)
a 4-10-Perceptron which transforms a binary
representation of a digit (0..9) into a decimal
number.
Implement the learning algorithm and train the
network.
3. Which task can be learned faster?
(Unary representation or classification)
14
Supervised Learning
Exercises
5. Develop a perceptron for the
recognition of digits 0..9. (pixel
representation)
input layer: 3x7-input neurons
Use the SNNS or JavaNNS
6. Can we recognize numbers greater
than 9 as well?
7. Develop a perceptron for the
recognition of capital letters. (input
layer 5x7)
15
Supervised Learning
multi-layer Perceptron
Cancels the limits of a
perceptron
– several trainable layers
– a two layer perceptron can classify convex
polygons
– a three layer perceptron can classify any sets
multi layer perceptron
16
Supervised Learning
= feed-forward network
= backpropagation network
Multi-layer feed-forward network
17
Supervised Learning
Feed-Forward Network
18
Supervised Learning
Training
pattern p
Evaluation of the net output in a feed
forward network
Ni
Oi=pi
netj
Nj
Oj=actj
netk
Input-Layer
19
Supervised Learning
hidden Layer(s)
Nk
Ok=act
k
Output Layer
Backpropagation-Learning Algorithm
– supervised Learning
– error is a function of the weights wi :
E(W) = E(w1,w2, ... , wn)
– We are looking for a minimal error
– minimal error = hollow in the error surface
– Backpropagation uses the gradient
for weight adaptation
20
Supervised Learning
error curve
weight1
weight2
21
Supervised Learning
Problem
output
hidden
layer
input layer
22
Supervised Learning
teaching
output
– error in output layer:
– difference output – teaching output
– error in a hidden layer?
Gradient descent
– Gradient:
– Vector orthogonal to a
surface in direction
of the strongest slope
0,80
0,40
-1
-0,6
0,00
-0,2
0,2
0,6
example of an error curve
of a weight wi
23
Supervised Learning
1
– derivation of a function
in a certain direction is
the projection of the
gradient in this direction
Example: Newton-Approximation
tan  = f‘(x) = 2x
tan  = f(x) / (x-x‘)
 x‘ =½(x + a/x)
– calculation of the root
– f(x) = x²-5
f(x)= x²-a

x‘
24
Supervised Learning
x
–x =2
– x‘ = ½(x + 5/x) = 2.25
– X“= ½(x‘ + 5/x‘) = 2.2361
Backpropagation - Learning
– gradient-descent algorithm
– supervised learning:
error signal used for weight adaptation
– error signal:
– teaching – calculated output , if output
neuron
– weighted sum of error signals of successor
– weight adaptation:
– : Learning rate
– : error signal
25
Supervised Learning
w  wij   oi   j
'
ij
Standard-Backpropagation Rule
– gradient descent: derivation of a function
1
– logistic function: f
(
x
)

Logistic
x
1 e
f´act(netj) = fact(netj)(1- fact(netj)) = oj(1-oj)
– the error signal j is therefore:
o j  (1  o j )    k w jk if j is hidden neuron

k
 j 

o j  (1  o j )  (t j  o j ) if j is output neuron
wij'  wij   oi   j
26
Supervised Learning
Backpropagation
– Examples:
– XOR (Excel)
– Bank Customer
27
Supervised Learning
Backpropagation - Problems
A
28
Supervised Learning
B
C
Backpropagation-Problems
– A: flat plateau
– weight adaptation is slow
– finding a minimum takes a lot of time
– B: Oscillation in a narrow gorge
– it jumps from one side to the other and back
– C: leaving a minimum
– if the modification in one training step is to high,
the minimum can be lost
29
Supervised Learning
Solutions: looking at the values
– change the parameter of the logistic function in
order to get other values
– Modification of weights depends on the output:
if oi=0 no modification will take place
– If we use binary input we probably have a lot of
zero-values: change [0,1] into [-½ , ½] or [-1,1]
– use another activation function, eg. tanh and
use [-1..1] values
30
Supervised Learning
Solution: Quickprop
 assumption: error curve is a square
function
 calculate the vertex of the curve
S (t )
wij (t )
 wij (t  1)
S (t  1)  S (t )
slope of the error curve:
S (t ) 
-2
31
Supervised Learning
2
6
E
 wij (t )
Resilient Propagation (RPROP)
– sign and size of the weight modification are calculated separately:
bij(t) – size of modification
 bij(t-1) +
if S(t-1)S(t) > 0
bij(t) =
 bij(t-1) if S(t-1)S(t) < 0
 bij(t-1)
otherwise
+>1 : both ascents are equal  „big“ step
0<-<1 : ascents are different  „smaller“ step
 -bij(t)
if S(t-1)>0  S(t) > 0
wij(t) =  bij(t)
íf S(t-1)<0  S(t) < 0
 -wij(t-1)
if S(t-1)S(t) < 0
(*)
 -sgn(S(t))bij(t)
otherwise
(*) S(t) is set to 0, S(t):=0 ; at time (t+1) the 4th case will be applied.
32
Supervised Learning
Limits of the Learning Algorithm
– it is not a model for biological learning
– no teaching output in natural learning
– no feedbacks in a natural neural network
(at least nobody has discovered yet)
– training of an ANN is rather time consuming
33
Supervised Learning
Exercise - JavaNNS
– Implement a feed forward network containing of 2
input neurons, 2 hidden neurons and one output
neuron.
Train the network so that it simulates the XORfunction.
– Implement a 4-2-4-network, which works like the
identity function. (Encoder-Decoder-Network).
Try other versions: 4-3-4, 8-4-8, ...
What can you say about the training effort?
34
Supervised Learning
Pattern Recognition
Eingabeschicht
input layer
   
   
   
   
   
   
   
  
35
Supervised Learning
1. 1.
verdeckte
hidden
layer
Schicht





hidden
2.2.verdeckte
layer
schicht





output
Ausgabelayer
schicht








Example: Pattern Recognition
JavaNNS example: Font
36
Supervised Learning
„font“ Example
– input = 24x24 pixel-array
– output layer: 75 neurons, one neuron for each
character:
– digits
– letters (lower case, capital)
– separators and operator characters
– two hidden layer of 4x6 neurons each
– all neuron of a row of the input layer are linked to one
neuron of the first hidden layer
– all neuron of a column of the input layer are linked to
one neuron of the second hidden layer.
37
Supervised Learning
Exercise
– load the network “font_untrained”
– train the network, use various learning algorithms:
(look at the SNNS documentation for the
parameters and their meaning)
– Backpropagation
– Backpropagation
with momentum
– Quickprop
– Rprop
=2.0
=0.8
=0.1
=0.6
mu=0.6
c=0.1
mg=2.0
n=0.0001
– use various values for
learning parameter, momentum, and noise:
– learning parameter
– Momentum
– noise
38
Supervised Learning
0.2
0.9
0.0
0.3
0.7
0.1
0.5
0.5
0.2
1.0
0.0
Example: Bank Customer
A1: Credit history
A2: debt
A3: collateral
A4: income
• network architecture depends on the coding of input and output
• How can we code values like good, bad, 1, 2, 3, ...?
39
Supervised Learning
Data Pre-processing
– objectives
–
–
–
–
40
prospects of better results
adaptation to algorithms
data reduction
trouble shooting
Supervised Learning
– methods
– selection and integration
– completion
– transformation
– normalization
– coding
– filter
Selection and Integration
– unification of data (different origins)
– selection of attributes/features
– reduction
– omit obviously non-relevant data
– all values are equal
– key values
– meaning not relevant
– data protection
41
Supervised Learning
Completion / Cleaning
– Missing values
– ignore / omit attribute
– add values
– manual
– global constant („missing value“)
– average
– highly probable value
– remove data set
– noised data
– inconsistent data
42
Supervised Learning
Transformation
– Normalization
– Coding
– Filter
43
Supervised Learning
Normalization of values
– Normalization – equally distributed
– in the range [0,1]
– e.g. for the logistic function
act = (x-minValue) / (maxValue - minValue)
– in the range [-1,+1]
– e.g. for activation function tanh
act = (x-minValue) / (maxValue - minValue)*2-1
– logarithmic normalization
– act = (ln(x) - ln(minValue)) / (ln(maxValue)-ln(minValue))
44
Supervised Learning
Binary Coding of nominal values I
– no order relation, n-values
– n neurons,
– each neuron represents one and only one
value:
– example:
red,
blue,
yellow,
white,
black
1,0,0,0,0 0,1,0,0,0 0,0,1,0,0 ...
– disadvantage:
n neurons necessary  lots of zeros in the input
45
Supervised Learning
Bank Customer
credit
history
debt
collateral income
Are these customers good ones?
1:
bad
high adequate 3
2:
good low adequate 2
46
Supervised Learning
Data Mining Cup 2002
The Problem: A Mailing Action
– mailing action of a company:
– special offer
– estimated annual income per customer:
customer
– given:
will
cancel
gets an offer
43.80€
66.30€
gets no offer
0.00€
72.00€
– 10,000 sets of customer data
containing 1,000 cancellers (training)
– problem:
– test set containing 10,000 customer data
– Who will cancel ? Whom to send an offer?
47
Supervised Learning
will
not cancel
Mailing Action – Aim?
customer
will
cancel
gets an offer
43.80€
66.30€
gets no offer
0.00€
72.00€
– no mailing action:
– 9,000 x 72.00
– everybody gets an offer:
– 1,000 x 43.80 + 9,000 x 66.30
640,500
= 648,000
=
– maximum (100% correct classification):
– 1,000 x 43.80 + 9,000 x 72.00
=
691,800
48
Supervised Learning
will
not
cancel
Goal Function: Lift
customer
will
cancel
gets an offer
43.80€
66.30€
gets no offer
0.00€
72.00€
basis: no mailing action: 9,000 · 72.00
goal = extra income:
liftM = 43.8 · cM + 66.30 · nkM – 72.00· nkM
49
Supervised Learning
will
not
cancel
----- 32 input data ------
<important
Data
results>
^missing values^
50
Supervised Learning
Feed Forward Network – What to do?
51
– train the net with training set (10,000)
– test the net using the test set ( another 10,000)
– classify all 10,000 customer into canceller or
loyal
– evaluate the additional income
Supervised Learning
Results
data mining cup 2002
52
neural network project
2004
– gain:
– additional income by the mailing action
if target group was chosen according
analysis
Supervised Learning
Review Students Project
– copy of the data mining cup
– real data
– known results
motivation
– contest
enthusiasm
better results
• wishes
– engineering approach data mining
– real data for teaching purposes
53
Supervised Learning
Data Mining Cup 2007
–
–
–
–
54
Supervised Learning
started on April 10.
check-out couponing
Who will get a rebate coupon?
50,000 data sets for training
Data
55
Supervised Learning
DMC2007
– ~75% output = N(o)
– e.g. classification has to > 75%!!
– first experiments: no success
– deadline: May 31st
56
Supervised Learning

Optimization of Neural Networks
objectives
– good results in an application:
better generalisation
(improve correctness)
– faster processing of patterns
(improve efficiency)
– good presentation of the results
(improve comprehension)
57
Supervised Learning
Ability to generalize
• a trained net can classify data
(out of the same class as the learning data)
that it has never seen before
– aim of every ANN development
– network too large:
– all training patterns are learned from memory
– no ability to generalize
– network too small:
– rules of pattern recognition can not be
learned
(simple example: Perceptron and XOR)
58
Supervised Learning
Development of an NN-application
calculate
network
output
build a network
architecture
input of training
pattern
modify
weights
change
parameters
error is too high
compare to
teaching
output
quality is good
enough
use Test set data
error is too
high
evaluate output
compare to teaching
output
quality is good enough
59
Supervised Learning
Possible Changes
– Architecture of NN
–
–
–
–
–
size of a network
shortcut connection
partial connected layers
remove/add links
receptive areas
– Find the right parameter values
– learning parameter
– size of layers
– using genetic algorithms
60
Supervised Learning
Memory Capacity
Number of patterns
a network can store without generalisation
– figure out the memory capacity
– change output-layer: output-layer  input-layer
– train the network with an increasing number of random
patterns:
– error becomes small: network stores all patterns
– error remains:
network can not store all
patterns
– in between:
memory capacity
61
Supervised Learning
Memory Capacity - Experiment
– output-layer is a copy of the
input-layer
– training set consisting of
n random pattern
– error:
– error = 0
network can store more
than n patterns
– error >> 0
network can not store n
patterns
– memory capacity:
error > 0 and error = 0
for n-1 patterns and
error >>0 for n+1
patterns
62
Supervised Learning
Layers Not fully Connected
connections:
new
removed
remaining
–
–
–
63
Supervised Learning
partial connected (e.g. 75%)
remove links, if weight has been nearby 0 for several
training steps
build new connections (by chance)
Summary
– Feed-forward network
– Perceptron (has limits)
– Learning is Math
– Backpropagation is a Backpropagation of Error
Algorithm
– works like gradient descent
– Activation Functions: Logistics, tanh
– Application in Data Mining, Pattern Recognition
– data preparation is important
– Finding an appropriate Architecture
64
Supervised Learning
Download