Training Set - Neural Network and Machine Learning Laboratory

advertisement
Machine Learning and Neural Networks
Professor Tony Martinez
Computer Science Department
Brigham Young University
http://axon.cs.byu.edu/~martinez
Machine Learning Tutorial – UIST 2002
1
Tutorial Overview


Introduction and Motivation
Neural Network Model Descriptions
– Perceptron
– Backpropagation

Issues
– Overfitting
– Applications

Other Models
– Decision Trees, Nearest Neighbor/IBL, Genetic Algorithms, Rule
Induction, Ensembles
Machine Learning Tutorial – UIST 2002
2
More Information

You can download this presentation from:
ftp://axon.cs.byu.edu/pub/papers/NNML.ppt

An excellent introductory text to Machine Learning:
Machine Learning, Tom M. Mitchell, McGraw Hill, 1997
Machine Learning Tutorial – UIST 2002
3
What is Inductive Learning

Gather a set of input-output examples from some
application: Training Set
i.e. Speech Recognition, financial forecasting
 Train the learning model (Neural network, etc.) on the
training set until it solves it well
 The Goal is to generalize on novel data not yet seen
 Gather a further set of input-output examples from the
same application: Test Set
 Use the learning system on actual data
Machine Learning Tutorial – UIST 2002
4
Motivation

Costs and Errors in Programming
 Our inability to program "subjective" problems
 General, easy-to use mechanism for a large set of
applications
 Improvement in application accuracy - Empirical
Machine Learning Tutorial – UIST 2002
5
Example Application - Heart Attack
Diagnosis
The patient has a set of symptoms - Age, type of pain,
heart rate, blood pressure, temperature, etc.
 Given these symptoms in an Emergency Room setting, a
doctor must diagnose whether a heart attack has occurred.
 How do you train a machine learning model solve this
problem using the inductive learning model?
 Consistent approach
 Knowledge of ML approach not critical
 Need to select a reasonable set of input features

Machine Learning Tutorial – UIST 2002
6
Examples and Discussion

Loan Underwriting
– Which Input Features (Data)
– Divide into Training Set and Test Set
– Choose a learning model
– Train model on Training set
– Predict accuracy with the Test Set
– How to generalize better?
 Different Input Features
 Different Learning Model
– Issues
 Intuition vs. Prejudice
 Social Response
Machine Learning Tutorial – UIST 2002
7
UC Irvine Machine Learning Data Base
Iris Data Set
4.8,3.0,1.4,0.3,
5.1,3.8,1.6,0.2,
4.6,3.2,1.4,0.2,
5.3,3.7,1.5,0.2,
5.0,3.3,1.4,0.2,
7.0,3.2,4.7,1.4,
6.4,3.2,4.5,1.5,
6.9,3.1,4.9,1.5,
5.5,2.3,4.0,1.3,
6.5,2.8,4.6,1.5,
6.0,2.2,5.0,1.5,
6.9,3.2,5.7,2.3,
5.6,2.8,4.9,2.0,
7.7,2.8,6.7,2.0,
6.3,2.7,4.9,1.8,
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-viginica
Iris-viginica
Iris-viginica
Iris-viginica
Iris-viginica
Machine Learning Tutorial – UIST 2002
8
Voting Records Data Base
democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y
republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,?,?
republican,n,y,n,y,y,n,n,n,n,n,?,?,y,y,n,n
republican,n,y,n,y,y,y,n,n,n,n,y,?,y,y,?,?
democrat,n,y,y,n,n,n,y,y,y,n,n,n,y,n,?,?
democrat,y,y,y,n,n,y,y,y,?,y,y,?,n,n,y,?
republican,n,y,n,y,y,y,n,n,n,n,n,y,?,?,n,?
republican,n,y,n,y,y,y,n,n,n,y,n,y,y,?,n,?
democrat,y,n,y,n,n,y,n,y,?,y,y,y,?,n,n,y
democrat,y,?,y,n,n,n,y,y,y,n,n,n,y,n,y,y
republican,n,y,n,y,y,y,n,n,n,n,n,?,y,y,n,n
Machine Learning Tutorial – UIST 2002
9
Machine Learning Sketch History

Neural Networks - Connectionist - Biological Plausibility
– Late 50’s, early 60’s, Rosenblatt, Perceptron
– Minsky & Papert 1969 - The Lull, symbolic expansion
– Late 80’s - Backpropagation, Hopfield, etc. - The explosion

Machine Learning - Artificial Intelligence - Symbolic Psychological Plausibility
– Samuel (1959) - Checkers evaluation strategies
– 1970’s and on - ID3, Instance Based Learning, Rule induction, …
– Currently – Symbolic and connectionist lumped under ML

Genetic Algorithms - 1970’s
– Originally lumped in connectionist
– Now an exploding area – Evolutionary Algorithms
Machine Learning Tutorial – UIST 2002
10
Inductive Learning - Supervised



Assume a set T of examples of the form (x,y) where x is a vector of
features/attributes and y is a scalar or vector output
By examining the examples postulate a hypothesis H(x) => y for
arbitrary x
Spectrum of Supervised Algorithms
– Unsupervised Learning
– Reinforcement Learning
Machine Learning Tutorial – UIST 2002
11
Other Machine Learning Areas

Case Based Reasoning
 Analogical Reasoning
 Speed-up Learning
 Inductive Learning is the most studied and successful to
date
 Data Mining
 COLT – Computational Learning Theory
Machine Learning Tutorial – UIST 2002
12
Machine Learning Tutorial – UIST 2002
13
Perceptron Node – Threshold Logic Unit
x1
w1
x2
w2
xn
wn

Z
n
1
if
x w 
i 1
z
i
i
n
0
if
x w 
i 1
i
i
Machine Learning Tutorial – UIST 2002
14
Learning Algorithm
x1
.4
.1
x2
-.2
n
x2 x2 T
.8 .3 1
.4 .1 0
Z
1
if
x w 
i 1
z
i
i
n
0
if
x w 
i 1
i
i
Machine Learning Tutorial – UIST 2002
15
First Training Instance
.8
.4
.1
.3
-.2
.4 .1 0
=1
Net = .8*.4 + .3*-.2 = .26
n
x2 x2 T
.8 .3 1
Z
1
if
x w 
i 1
z
i
i
n
0
if
x w 
i 1
i
i
Machine Learning Tutorial – UIST 2002
16
Second Training Instance
.4
.4
Z
.1
.1
-.2
.4 .1 0
Net = .4*.4 + .1*-.2 = .14
n
x2 x2 T
.8 .3 1
=1
1
if
x w 
i 1
z
i
i
n
0
if
Dwi = (T - Z)* C * Xi
x w 
i 1
i
i
Machine Learning Tutorial – UIST 2002
17
Delta Rule Learning
Dwij = C(Tj – Zj) xi





Create a network with n input and m output nodes
Each iteration through the training set is an epoch
Continue training until error is less than some epsilon
Perceptron Convergence Theorem: Guaranteed to find a
solution in finite time if a solution exists
As can be seen from the node activation function the
decision surface is an n-dimensional hyper plane
n
1
if
x w 
i 1
z
i
i
n
0
if
x w 
i 1
i
i
Machine Learning Tutorial – UIST 2002
18
Linear Separability
Machine Learning Tutorial – UIST 2002
19
Linear Separability and Generalization
When is data noise vs. a legitimate exception
Machine Learning Tutorial – UIST 2002
20
Limited Functionality of Hyperplane
Machine Learning Tutorial – UIST 2002
21
Gradient Descent Learning
Error Landscape
TSS:
Total
Sum
Squared
Error
0
Weight Values
Machine Learning Tutorial – UIST 2002
22
Deriving a Gradient Descent Learning
Algorithm






Goal to decrease overall error (or other objective function)
each time a weight is changed
Total Sum Squared error = S (Ti – Zi)2
E
Seek a weight changing algorithm such that
is
wij
negative
If a formula can be found then we have a gradient descent
learning algorithm
Perceptron/Delta rule is a gradient descent learning
algorithm
Linearly-separable problems have no local minima
Machine Learning Tutorial – UIST 2002
23
Multi-layer Perceptron





Can compute arbitrary mappings
Assumes a non-linear activation function
Training Algorithms less obvious
Backpropagation learning algorithm not exploited until
1980’s
First of many powerful multi-layer learning algorithms
Machine Learning Tutorial – UIST 2002
24
Responsibility Problem
Output 1
Wanted 0
Machine Learning Tutorial – UIST 2002
25
Multi-Layer Generalization
Machine Learning Tutorial – UIST 2002
26
Backpropagation




Multi-layer supervised learner
Gradient Descent weight updates
Sigmoid activation function (smoothed threshold logic)
Backpropagation requires a differentiable activation
function
Machine Learning Tutorial – UIST 2002
27
Multi-layer Perceptron Topology
i
k
i
j
k
i
k
i
Input Layer Hidden Layer(s) Output Layer
Machine Learning Tutorial – UIST 2002
28
Backpropagation Learning Algorithm

Until Convergence (low error or other criteria) do
– Present a training pattern
– Calculate the error of the output nodes (based on T - Z)
– Calculate the error of the hidden nodes (based on the error of the
output nodes which is propagated back to the hidden nodes)
– Continue propagating error back until the input layer is reached
– Update all weights based on the standard delta rule with the
appropriate error function d
Dwij = Cdj Zi
Machine Learning Tutorial – UIST 2002
29
Activation Function and its Derivative

Node activation function f(net) is typically the sigmoid
1
1
Z j  f (netj ) 
- net
1 e j
.5
0
-5
0
5
Net

Derivate of activation function is critical part of algorithm
.25
f ' (netj )  Z j (1 - Z j )
0
-5
0
5
Net
Machine Learning Tutorial – UIST 2002
30
Backpropagation Learning Equations
Dwij  Cd j Z i
d j  (T j - Z j ) f ' (netj )
[Output Node]
d j   (d k w jk ) f ' (netj )
[Hidden Node]
k
i
k
i
j
k
i
k
i
Machine Learning Tutorial – UIST 2002
31
Backpropagation Summary


Excellent Empirical results
Scaling – The pleasant surprise
– Local Minima very rare is problem and network complexity
increase

Most common neural network approach
 User defined parameters lead to more difficulty of use
– Number of hidden nodes, layers, learning rate, etc.

Many variants
– Adaptive Parameters, Ontogenic (growing and pruning) learning
algorithms
– Higher order gradient descent (Newton, Conjugate Gradient, etc.)
– Recurrent networks
Machine Learning Tutorial – UIST 2002
32
Inductive Bias


The approach used to decide how to generalize novel cases
Occam’s Razor – The simplest hypothesis which fits the
data is usually the best – Still many remaining options
A B C -> Z
A B’ C -> Z
A B C’ -> Z
A B’ C’ -> Z
A’ B’ C’ -> Z’

Now you receive the new input A’ B C
What is your output?
Machine Learning Tutorial – UIST 2002
33
Overfitting
Noise vs. Exceptions revisited
Machine Learning Tutorial – UIST 2002
34
The Overfit Problem
TSS
Validation/Test Set
Epochs


Training Set
Newer powerful models can have very complex decision
surfaces which can converge well on most training sets by
learning noisy and irrelevant aspects of the training set in
order to minimize error (memorization in the limit)
This makes them susceptible to overfit if not carefully
considered
Machine Learning Tutorial – UIST 2002
35
Avoiding Overfit






Inductive Bias – Simplest accurate model
More Training Data (vs. overtraining - One epoch limit)
Validation Set (requires separate test set)
Backpropagation – Tends to build from simple model (0
weights) to just large enough weights (Validation Set)
Stopping criteria with any constructive model (Accuracy
increase vs Statistical significance) – Noise vs. Exceptions
Specific Techniques
– Weight Decay, Pruning, Jitter, Regularization

Ensembles
Machine Learning Tutorial – UIST 2002
36
Ensembles

Many different Ensemble approaches
– Stacking, Gating/Mixture of Experts, Bagging, Boosting, Wagging,
Mimicking, Combinations



Multiple diverse models trained on same problem and then their outputs
are combined
The specific overfit of each learning model is averaged out
If models are diverse (uncorrelated errors) then even if the individual
models are weak generalizers, the ensemble can be very accurate
Combining Technique
M1
M2
M3
Mn
Machine Learning Tutorial – UIST 2002
37
Application Issues

Choose relevant features
 Normalize features
 Can learn to ignore irrelevant features, but will have to fight
the curse of dimensionality
 More data (training examples) the better
 Slower training acceptable for complex and production
applications if accuracy improvement, (“The week
phenomenon”)
 Execution normally fast regardless of training time
Machine Learning Tutorial – UIST 2002
38
Decision Trees - ID3/C4.5

Top down induction of decision trees
 Highly used and successful
 Attribute Features - discrete nominal (mutually exclusive)
– Real valued features are discretized
 Search for smallest tree is too complex (always NP hard)
 C4.5 use common symbolic ML philosophy of a greedy
iterative approach
Machine Learning Tutorial – UIST 2002
39
Decision Tree Learning

Mapping by Hyper-Rectangles
A1
A2
Machine Learning Tutorial – UIST 2002
40
ID3 Learning Approach


C is the current set of examples
A test on attribute A partitions C into {Ci, C2,...,Cw} where
w is the number of values of A
C
Attribute:Color
Red
Green
Purple
C1
C2
C3
Machine Learning Tutorial – UIST 2002
41
Decision Tree Learning Algorithm





Start with the Training Set as C and test how each attribute
partitions C
Choose the best A for root
The goodness measure is based on how well attribute A
divides C into different output classes – A perfect attribute
would divide C into partitions that contain only one output
class each – A poor attribute (irrelevant) would leave each
partition with the same ratio of classes as in C
20 questions analogy – good questions quickly minimize
the possibilities
Continue recursively until sets unambiguously classified or
a stopping criteria is reached
Machine Learning Tutorial – UIST 2002
42
ID3 Example and Discussion
Temperature
P
Hot
2
Mild
4
Cool
3
Gain: .029


N
2
2
1
Humidity
P
High
3
Normal 6
N
4
1
Gain: .151
14 Examples. Uses Information Gain. Attributes which
best discriminate between classes are chosen
If the same ratios are found in partitioned set, then gain
is 0
Machine Learning Tutorial – UIST 2002
43
ID3 - Conclusions



Good Empirical Results
Comparable application robustness and accuracy with
neural networks - faster learning (though NNs are more
natural with continuous features - both input and
output)
Most used and well known of current symbolic systems
- used widely to aid in creating rules for expert systems
Machine Learning Tutorial – UIST 2002
44
Nearest Neighbor Learners

Broad Spectrum
– Basic K-NN, Instance Based Learning, Case Based Reasoning,
Analogical Reasoning

Simply store all or some representative subset of the
examples in the training set
 Generalize on the fly rather than use pre-acquired
hypothesis - faster learning, slower execution, information
retained, memory intensive
Machine Learning Tutorial – UIST 2002
45
Nearest Neighbor Algorithms
Machine Learning Tutorial – UIST 2002
46
Nearest Neighbor Variations



How many examples to store
How do stored example vote (distance weighted, etc.)
Can we choose a smaller set of near-optimal examples
(prototypes/exemplars)
– Storage reduction
– Faster execution
– Noise robustness
Distance Metrics – non-Euclidean
 Irrelevant Features – Feature weighting

Machine Learning Tutorial – UIST 2002
47
Evolutionary Computation/Algorithms
Genetic Algorithms



Simulate “natural” evolution of structures via selection and
reproduction, based on performance (fitness)
Type of Heuristic Search - Discovery, not inductive in
isolation
Genetic Operators - Recombination (Crossover) and
Mutation are most common
1 1 0 2 3 1 0 2 2 1 (Fitness = 10)
2 2 0 1 1 3 1 1 0 0 (Fitness = 12)
2 2 0 1 3 1 0 2 2 1 (Fitness = calculated or f(parents))
Machine Learning Tutorial – UIST 2002
48
Evolutionary Algorithms

Start with initialized population P(t) - random, domainknowledge, etc.
 Population usually made up of possible parameter settings
for a complex problem
 Typically have fixed population size (like beam search)
 Selection
– Parent_Selection P(t) - Promising Parents used to create new
children
– Survive P(t) - Pruning of unpromising candidates

Evaluate P(t) - Calculate fitness of population members.
Ranges from simple metrics to complex simulations.
Machine Learning Tutorial – UIST 2002
49
Evolutionary Algorithm
Procedure EA
t = 0;
Initialize Population P(t);
Evaluate P(t);
Until Done{ /*Sufficiently “good” individuals discovered*/
t = t+1;
Parent_Selection P(t);
Recombine P(t);
Mutate P(t);
Evaluate P(t);
Survive P(t);}
Machine Learning Tutorial – UIST 2002
50
EA Example

Goal: Discover a new automotive engine to maximize
performance, reliability, and mileage while minimizing
emissions
 Features: CID (Cubic inch displacement), fuel system, # of
valves, # of cylinders, presence of turbo-charging
 Assume - Test unit which tests possible engines and returns
integer measure of goodness
 Start with population of random engines
Machine Learning Tutorial – UIST 2002
51
Machine Learning Tutorial – UIST 2002
52
Machine Learning Tutorial – UIST 2002
53
Genetic Operators

Crossover variations - multi-point, uniform probability,
averaging, etc.
 Mutation - Random changes in features, adaptive, different
for each feature, etc.
 Others - many schemes mimicking natural genetics:
dominance, selective mating, inversion, reordering,
speciation, knowledge-based, etc.
 Reproduction - terminology - selection based on fitness keep best around - supported in the algorithms
 Critical to maintain balance of diversity and quality in the
population
Machine Learning Tutorial – UIST 2002
54
Evolutionary Algorithms


There exist mathematical proofs that evolutionary techniques are
efficient search strategies
There are a number of different Evolutionary strategies
–
–
–
–



Genetic Algorithms
Evolutionary Programming
Evolution Strategies
Genetic Programming
Strategies differ in representations, selection, operators, evaluation,
etc.
Most independently discovered, initially function optimization (EP,
ES)
Strategies continue to “evolve”
Machine Learning Tutorial – UIST 2002
55
Genetic Algorithm Comments

Much current work and extensions
 Numerous application attempts. Can plug into many
algorithms requiring search. Has built-in heuristic. Could
augment with domain heuristics
 “Lazy Man’s Solution” to any tough parameter search
Machine Learning Tutorial – UIST 2002
56
Rule Induction



Creates a set of symbolic rules to solve a classification
problem
Sequential Covering Algorithms
Until no good and significant rules can be created
– Create all first order rules Ax -> Classy
– Score each rule based on goodness (accuracy) and significance
using the current training set
– Iteratively (greedily) expand the best rules to n+1 attributes, score
the new rules, and prune weak rules to keep the total candidate list
at a fixed size (beam search)
– Pick the one best rule and remove all instances from the training
set that the rule covers
Machine Learning Tutorial – UIST 2002
57
Rule Induction Variants

Ordered Rule lists (decision lists) - naturally supports
multiple output classes
– A=Green and B=Tall -> Class 1
– A=Red and C=Fast -> Class 2
– Else Class 1



Placing new rules at beginning or end of list
Unordered rule lists for each output class (must handle
multiple matches)
Rule induction can handle noise by no longer creating new
rules when gain is negligible or not statistically significant
Machine Learning Tutorial – UIST 2002
58
Conclusion


Many new algorithms and approaches being proposed
Application areas rapidly increasing
– Amount of available data and information growing
– User desire for more adaptive and user-specific computer
interaction
– This need for specific and adaptable user interaction will make
machine learning a more important tool in user interface research
and applications
Machine Learning Tutorial – UIST 2002
59
Download