Integrated Learning in Multi-net Systems

advertisement
Integrated Learning
in Multi-net Systems
Matthew Casey
6th February 2004
Neural Computing Group
Department of Computing
University of Surrey
http://www.computing.surrey.ac.uk/personal/st/M.Casey/
Introduction
• In-situ learning in multi-net systems
• Classification
– Parallel combination of networks: ensemble
– Sequential combination of networks: modular
• Simulation
– Parallel and sequential combination of networks
– Quantification and addition
• Formal framework and algorithm for multi-net
systems
2
Introduction
• Learning:
– “process which leads to the modification of behaviour”1
• Biological motivation
– Hebb’s neurophysiological postulate2
– Learning across cell assemblies: neural integration
– Functional specialism: analogy to multi-net systems
• Theoretical motivation
– Generalisation improvements with multi-net systems
– Ensemble and modular
• Learning in collaboration with modularisation
3
Single-net Systems
• Systems of one or more artificial neurons
combined together in a single network
– Parallel distributed processing3
– Learning to generalise
• Learning algorithms
– Supervised: delta4,5,6, backpropagation7,8,9
– Unsupervised: Hebbian2, SOM10
4
Single-nets as Multi-nets?
1
XOR
x1 x2
y
1
1
1
-1
-1 -1 -1
-1
Combination of
Linear Decision
Boundaries
-1
1
1
1
1 -1
1
1
1 -1
-1
-1
-1
-1
1
1
-1
True (1)
-1
1
False (-1)
5
From Single-nets to Multi-nets
• Multi-net systems appear to be a development of
the parallel processing paradigm
• Can multi-net systems improve generalisation?
– Modularisation with simpler networks?
– Limited theoretical and empirical evidence
• Generalisation:
– Balance prior knowledge and training
– VC Dimension11,12,13
– Bias/variance dilemma14
6
Multi-net Systems:
Ensemble or Modular?
• Ensemble systems:
–
–
–
–
Parallel combination
Each network performs the same task
Simple ensemble
AdaBoost15
• Modular systems:
– Each network performs a different (sub-)task
– Mixture-of-experts16 (top-down parallel competitive)
– Min-max17 (bottom-up static parallel/sequential)
7
Categorising Multi-net Systems
• Sharkey’s18,19 combination strategies:
– Parallel: co-operative or competitive
top-down or bottom-up
static or dynamic
– Sequential
– Supervisory
• Component networks may be20:
– Pre-trained (independent)
– Incrementally trained (iterative)
– In-situ trained (simultaneous)
8
Multi-net Systems
• Categorisation schemes appear not to support the
generalisation of multi-net system properties
beyond specific examples
– Ensemble: bias, variance and diversity21
– Modular: bias and variance
– What about measures such as the VC Dimension?
• Some use of in-situ learning
– ME and HME22
– Negative correlation learning23
9
Research
• Multi-net systems seem to offer a way in which
generalisation performance and learning speed can
be improved:
– Yet limited theoretical and empirical evidence
– Focus on parallel systems
• Limited use of in-situ learning despite motivation
– Existing results show improvement
– Can the approach be generalised?
• No general framework for multi-net systems
– Difficult to generalise properties from categorisation
10
Research
• Explore in-situ learning in multi-net systems:
–
–
–
–
Parallel: in-situ learning in the simple ensemble
Sequential: combining networks with in-situ learning
Does in-situ learning provide improved generalisation?
Can we combine ‘simple’ networks to solve ‘complex’
problems: ‘superordinate’ systems with faster learning?
• Propose a formal framework for multi-net systems
– A method to describe the architecture and learning
algorithm for a general multi-net system
11
Multi-net System Framework
• Previous work:
– Framework for the co-operation of learning
algorithms24
– Stochastic model25
– Importance Sampled Learning Ensembles (ISLE)26
– Focus on supervised learning and specific architectures
• Jordan and Xu’s (1995) definition of HME27:
– Generalisation of HME
– Abstraction of architecture from algorithm
– Theoretical results for convergence
12
Multi-net System Framework
• Propose a modification to Jordan and Xu’s
definition of HME to provide a generalised multinet system framework
– HME combines the output of the expert networks
through a weighting generated by a gating network
– Replace the weighting by the (optional) operation of a
network
– Can be used for parallel, sequential and supervisory
systems
13
Example: HME
y1
Depth
v1
r=0
g11
g12 y11
x
r=1
y12
v12
v11
g111
g112
v111
r=2
x
x
y111
y112
v112
x
x
14
Example: Framework
y1
Depth
v1
r=0
y11
r=1
v12
v11
y111
r=2
x
y13
y12
v13
y113
y112
v111
v112
x
x
v113
x
x
15
Definition
• A multi-net system consists of the ordered
tree of depth r defined by the nodes v , with v
the root of the tree associated with the
output y  R , such that:

1
m
1

 f x , 
y  

 f y 1 ,..., yK ,


if K  0
if K  0
16
Multi-net System Framework
• Learning algorithm operates by modifying the
state of the system as defined by   associated with
each node v
• Includes:
– Pre-training
– In-situ training
– Incremental training (through pre- or in-situ training)
• Examples demonstrate how framework can be
used to describe existing types of multi-net system
– However does not rely upon categorisation schemes
17
In-situ Learning
• Evaluation of in-situ learning in multi-net systems
• Explore parallel and sequential in-situ learning
with definitions using the proposed framework
– Simple learning ensemble (SLE)
– Sequential learning modules (SLM)
• Benchmark classification tasks28:
–
–
–
–
XOR
MONK’s problems (MONK 1, 2, 3)29
Wisconsin Breast Cancer Database (WBCD)30
Proben1 Thyroid (thyroid1 data set)31
18
Simple Learning Ensemble
• Ensembles:
– Train each network individually
– Parallel combination of network outputs: mean output
– Pre-training: how can we understand or control the
combined performance of the ensemble?
– Incremental: AdaBoost15
– In-situ: negative correlation23
• In-situ learning:
– Train in-situ and assess combined performance during
training using early stopping
– Generalisation loss early stopping criterion32
19
Benchmark Results
• Compare SLE results with:
– Simple ensemble (SE): all networks pre-trained
using early stopping
– Single-net: MLP with backpropagation – with
and without early stopping
• For SLE and SE:
– From 2 to 20 MLP networks
– 100 trials per configuration: mean response
20
Benchmark Results
• XOR
– SLE and SE equivalent training (no early stopping)
– SLE uses less epochs to give equivalent responses
• MONK’s problems/WBCD/Thyroid
–
–
–
–
SLE improves upon SE, SE improves upon single-net
SLE trains for longer: combined performance
Adding more networks gives better generalisation
More networks, more achieved desired performance
• MONK 1/2
– SLE improves upon non-early stopping single-net
21
Sequential Learning Modules
• Sequential systems:
– Can a combination of (simpler) networks give good
generalisation and learning speed?
– Typically pre-trained and for specific processing
• Sequential in-situ learning:
– How can in-situ learning be achieved with sequential
networks: target error/output?
– Use unsupervised networks
– Last network has target output and hence can be
supervised
22
Benchmark Results
• Compare SLM results with:
– Single-net: single layer network with delta learning
– Single-net: MLP with backpropagation – without early
stopping
• Networks:
– Combine a SOM with a single layer network with delta
learning
– Varying map sizes of SOM used to see effect on
classification performance
– 100 trials per configuration: mean response
23
Benchmark Results
• XOR
– Cannot be solved by SOM or single layer network
– Can be solved by SLM with 3x3 map
– Faster learning times than MLP with backpropagation
• MONK’s problems/WBCD
– SLM can learn classification of training examples
– Generalisation: improves upon SE with early stopping
• MONK 2/3
– SLM improves upon MLP without early stopping
– MONK 3 improves upon SLE
24
Benchmark Results
• MONK 1/WBCD
– Generalisation: similar, but slightly worse
• Thyroid
– Poor learning of training examples
– Can SOM perform sufficient pre-processing for
single layer network?
• Results seem to depend upon problem type
and map size
25
In-situ Learning
• In-situ learning in multi-net systems:
– Can give better training and generalisation performance
– Comparison with SE (early stopping) and single-net
systems (with and without early stopping)
– Computational effort?
• Ensemble systems:
– Effect of training times on bias, variance and diversity?
• Sequential systems:
– Encouraging empirical results: theoretical results?
– Automatic classification of unsupervised clusters
26
In-situ Learning
System
MONK 1
MLP
84.44%
MLP (ES) 57.13%
SE
55.75%
SLE
90.21%
SLM
75.63%
Benchmark
MONK 2 MONK 3 WBCD
66.29% 83.39% 92.68%
65.21% 63.10% 76.01%
66.25% 66.03% 87.23%
69.49% 78.57% 88.47%
69.24% 84.10% 92.14%
Thyroid
95.71%
90.37%
90.86%
93.96%
25.26%
27
In-situ Learning
Mean Correct Validation Responses
100.00%
80.00%
60.00%
40.00%
20.00%
0.00%
MONK 1
MONK 2
MONK 3
WBCD
Thyroid
System
MLP
MLP (ES)
SE
SLE
SLM
28
In-situ Learning and Simulation
• Biological motivation for in-situ learning
– Hebb’s ‘neural integration’
– Functional specialism in cognitive systems
• Use in-situ learning in modular multi-net systems
to simulate numerical abilities:
– Quantification: subitization and counting
– Addition: fact retrieval and ‘count all’
– Combine ME and SLM to allow abilities to ‘compete’
29
Strategy Learning System
v1
K1=3
Parallel
(ME)
v11
K11=0
v12
K12=1
v13
K13=0
Sequential
(SLM)
v121
Gate
Unsupervised
Strategy
K121=0
Supervised
Strategy
30
Multi-net Simulation of
Quantification
• Single-net and multi-net systems
– Trained on scenes consisting of ‘objects’
– Logarithmic probability model
• Subitization SOM:
– Ordering of numbers with compressive scheme
– Limit: training data, object frequency and map size
• Counting MLP with backpropagation:
– Correct responses to training data
– Conventional, stable non-conventional and nonstable
31
Multi-net Simulation of
Quantification
• MNQ:
– Subitization: SOM (pre-trained)
– Single layer network with delta learning
– Counting: MLP with backpropagation
• Successfully learnt to quantify
– To subitize or count
– Decision based upon input
• Subitization limit attributable to interaction of
modules
– Lower numbers subitized, higher numbers counted
32
Multi-net Simulation of
Addition
• Single-net and multi-net systems
– Trained on scenes consisting of two sets of ‘objects’
– Equal probability model
• Fact retrieval SOM:
– Each addend associated with a map axis
– Some overlap of problems
– Commutative information used
• ‘Count all’ MLP with backpropagation:
– Correct responses to training data
– Some relationship to observed human errors
33
Multi-net Simulation of
Addition
• MNA:
– Fact retrieval: SOM
– Single layer network with delta learning
– ‘Count all’: MLP with backpropagation
• Successfully learnt to add
– To count or from facts
– Decision based upon input
• Predominant use of ‘count all’, rather than facts
– Does demonstrate change in strategy
34
In-situ Learning and Simulation
• In-situ learning in SLS:
– Simulate developmental progression
– At least as capable as equivalent monolithic solutions
– Competition between supervised and unsupervised
learning paradigms
• In-situ learning and simulation of the interaction
between multiple abilities:
– Interaction between different abilities: simulating
psychological phenomena
– Integrated learning
35
Contribution
• Multi-net systems:
– Seem to offer empirical and theoretical improvement to
generalisation
– Properties under-explored
• Framework for multi-net systems
– Generalisation of multi-net systems beyond
categorisation schemes
– Foundation to explore multi-net properties
• In-situ learning
– Biological and theoretical motivation
36
Contribution
• Compared different multi-net techniques and the
use of in-situ learning
• Demonstrated that in-situ learning can lead to
improved training and generalisation
• SLE
– Assessment of combined performance
• SLM
– Combining ‘simple’ networks to solve ‘complex’ tasks
– ‘Superordinate’?
– Also shows automatic classification using SOM
37
Contribution
• Integrated learning
– Simulations using modular parallel and
sequential networks
– In-situ learning used to explore the interaction
of modules during learning
– Demonstrated simulation of abilities and
observed psychological phenomena
38
Future Work
• Framework:
– Modify learning algorithm – recursive using tree
– Explore properties such as bias, variance, diversity and
VC Dimension
• In-situ learning:
– Further comparison
– SLE: does learning promote diversity?
– SLM: expand and explore limitations
• Simulations:
– Explore further the effect of integrated learning
39
Questions?
40
References
1.
2.
3.
4.
5.
6.
7.
Simpson, J.A. & Weiner, E.S.C. (Ed) (1989). Oxford English Dictionary, 2nd.
Oxford, UK: Clarendon Press.
Hebb, D.O. (1949). The Organization of Behavior: A Neuropsychological Theory.
New York: John Wiley & Sons.
McClelland, J.L. & Rumelhart, D.E. (1986). Parallel Distributed Processing:
Explorations in the Microstructure of Cognition, Volume 2: Psychological and
Biological Models. Cambridge, MA.: A Bradford Book, MIT Press.
Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information
Storage and Organization in the Brain. Psychological Review, vol. 65(6), pp. 386408.
Widrow, B. & Hoff, M.E.Jr. (1960). Adaptive Switching Circuits. IRE WESCON
Convention Record, pp. 96-104.
Minsky, M.L. & Papert, S. (1988). Perceptrons: An Introduction to Computational
Geometry, Expanded Ed. Cambridge, MA.: MIT Press.
Werbos, P.J. (1974). Beyond Regression: New Tools for Prediction and Analysis in
the Behavioral Sciences. Unpublished doctoral thesis. Cambridge, MA.: Harvard
University.
41
References
8.
9.
10.
11.
12.
13.
14.
Rumelhart, D.E., Hinton, G.E. & Williams, R.J. (1986). Learning Internal
Representations by Error Propagation. In Rumelhart, D. E. & McClelland, J. L. (Ed),
Parallel Distributed Processing: Explorations in the Microstructure of Cognition,
Volume 1: Foundations, pp. 318-362. Cambridge, MA.: MIT Press.
Elman, J.L. (1990). Finding Structure in Time. Cognitive Science, vol. 14, pp. 179211.
Kohonen, T. (1982). Self-Organized Formation of Topologically Correct Feature
Maps. Biological Cybernetics, vol. 43, pp. 59-69.
Vapnik, V.N. & Chervonenkis, A.Ya. (1971). On the Uniform Convergence of
Relative Frequencies of Events to their Probabilities. Theory of Probability and Its
Applications, vol. XVI(2), pp. 264-280.
Baum, E.B. & Haussler, D. (1989). What Size Net Gives Valid Generalisation?
Neural Computation, vol. 1(1), pp. 151-160.
Koiran, P. & Sontag, E.D. (1997). Neural Networks With Quadratic VC Dimension.
Journal of Computer and System Sciences, vol. 54(1), pp. 190-198.
Geman, S., Bienenstock, E. & Doursat, R. (1992). Neural Networks and the
Bias/Variance Dilemma. Neural Computation, vol. 4(1), pp. 1-58.
42
References
15.
16.
17.
18.
19.
20.
Freund, Y. & Schapire, R.E. (1996). Experiments with a New Boosting Algorithm.
Machine Learning: Proceedings of the 13th International Conference, pp. 148-156.
Morgan Kaufmann.
Jacobs, R.A., Jordan, M.I. & Barto, A.G. (1991). Task Decomposition through
Competition in a Modular Connectionist Architecture: The What and Where Vision
Tasks. Cognitive Science, vol. 15, pp. 219-250.
Lu, B. & Ito, M. (1999). Task Decomposition and Module Combination Based on
Class Relations: A Modular Neural Network for Pattern Classification. IEEE
Transactions on Neural Networks, vol. 10(5), pp. 1244-1256.
Sharkey, A.J.C. (1999). Multi-Net Systems. In Sharkey, A. J. C. (Ed), Combining
Artificial Neural Nets: Ensemble and Modular Multi-Net Systems, pp. 1-30. London:
Springer-Verlag.
Sharkey, A.J.C. (2002). Types of Multinet System. In Roli, F. & Kittler, J. (Ed),
Proceedings of the Third International Workshop on Multiple Classifier Systems
(MCS 2002), pp. 108-117. Berlin, Heidelberg, New York: Springer-Verlag.
Liu, Y., Yao, X., Zhao, Q. & Higuchi, T. (2002). An Experimental Comparison of
Neural Network Ensemble Learning Methods on Decision Boundaries. Proceedings
of the 2002 International Joint Conference on Neural Networks (IJCNN'02), vol. 1,
pp. 221-226. Los Alamitos, CA.: IEEE Computer Society Press.
43
References
21.
22.
23.
24.
25.
26.
27.
Kuncheva, L.I. (2002). Switching Between Selection and Fusion in Combining
Classifiers: An Experiment. IEEE Transactions on Systems, Man, and Cybernetics,
Part B, vol. 32(2), pp. 146-156.
Jordan, M.I. & Jacobs, R.A. (1994). Hierarchical Mixtures of Experts and the EM
Algorithm. Neural Computation, vol. 6(2), pp. 181-214.
Liu, Y. & Yao, X. (1999a). Ensemble Learning via Negative Correlation. Neural
Networks, vol. 12(10), pp. 1399-1404.
Bottou, L. & Gallinari, P. (1991). A Framework for the Cooperation of Learning
Algorithms. In Lippmann, R.P., Moody, J.E. & Touretzky, D.S. (Ed), Advances in
Neural Information Processing Systems, vol. 3, pp. 781-788.
Amari, S.-I. (1995). Information Geometry of the EM and em Algorithms for Neural
Networks. Neural Networks, vol. 8(9), pp. 1379-1408.
Friedman,J.H. & Popescu,B. (2003). Importance Sampling: An Alternative View of
Ensemble Learning. Presented at the 4th International Workshop on Multiple
Classifier Systems (MCS 2003). Guildford, UK.
Jordan, M.I. & Xu, L. (1995). Convergence Results for the EM Approach to
Mixtures of Experts Architectures. Neural Networks, vol. 8, pp. 1409-1431.
44
References
28.
29.
30.
31.
32.
Blake,C.L. & Merz,C.J. (1998). UCI Repository of Machine Learning Databases.
http://www.ics.uci.edu/~mlearn/MLRepository.html. Irvine, CA.: University of
California, Irvine, Department of Information and Computer Sciences.
Thrun, S.B., Bala, J., Bloedorn, E., Bratko, I., Cestnik, B., Cheng, J., De Jong, K.,
Dzeroski, S., Fahlman, S.E., Fisher, D., Hamann, R., Kaufman, K., Keller, S.,
Kononenko, I., Kreuziger, J., Michalski, R.S., Mitchell, T., Pachowicz, P., Reich, Y.,
Vafaie, H., van de Welde, W., Wenzel, W., Wnek, J. & Zhang, J. (1991). The
MONK's Problems: A Performance Comparison of Different Learning Algorithms.
Technical Report CMU-CS-91-197. Pittsburgh, PA.: Carnegie-Mellon University,
Computer Science Department.
Wolberg, W.H. & Mangasarian, O.L. (1990). Multisurface Method of Pattern
Separation for Medical Diagnosis Applied to Breast Cytology. Proceedings of the
National Academy of Sciences, USA, vol. 87(23), pp. 9193-9196.
Prechelt, L. (1994). Proben1: A Set of Neural Network Benchmark Problems and
Benchmarking Rules. Technical Report 21 / 94. Karlsruhe, Germany: University of
Karlsruhe.
Prechelt, L. (1996). Early Stopping - But When? In Orr, G. B. & Müller, K-R. (Ed),
Neural Networks: Tricks of the Trade, 1524, pp. 55-69. Berlin, Heidelberg, New
York: Springer-Verlag.
45
Download