Integrated Learning in Multi-net Systems Matthew Casey 6th February 2004 Neural Computing Group Department of Computing University of Surrey http://www.computing.surrey.ac.uk/personal/st/M.Casey/ Introduction • In-situ learning in multi-net systems • Classification – Parallel combination of networks: ensemble – Sequential combination of networks: modular • Simulation – Parallel and sequential combination of networks – Quantification and addition • Formal framework and algorithm for multi-net systems 2 Introduction • Learning: – “process which leads to the modification of behaviour”1 • Biological motivation – Hebb’s neurophysiological postulate2 – Learning across cell assemblies: neural integration – Functional specialism: analogy to multi-net systems • Theoretical motivation – Generalisation improvements with multi-net systems – Ensemble and modular • Learning in collaboration with modularisation 3 Single-net Systems • Systems of one or more artificial neurons combined together in a single network – Parallel distributed processing3 – Learning to generalise • Learning algorithms – Supervised: delta4,5,6, backpropagation7,8,9 – Unsupervised: Hebbian2, SOM10 4 Single-nets as Multi-nets? 1 XOR x1 x2 y 1 1 1 -1 -1 -1 -1 -1 Combination of Linear Decision Boundaries -1 1 1 1 1 -1 1 1 1 -1 -1 -1 -1 -1 1 1 -1 True (1) -1 1 False (-1) 5 From Single-nets to Multi-nets • Multi-net systems appear to be a development of the parallel processing paradigm • Can multi-net systems improve generalisation? – Modularisation with simpler networks? – Limited theoretical and empirical evidence • Generalisation: – Balance prior knowledge and training – VC Dimension11,12,13 – Bias/variance dilemma14 6 Multi-net Systems: Ensemble or Modular? • Ensemble systems: – – – – Parallel combination Each network performs the same task Simple ensemble AdaBoost15 • Modular systems: – Each network performs a different (sub-)task – Mixture-of-experts16 (top-down parallel competitive) – Min-max17 (bottom-up static parallel/sequential) 7 Categorising Multi-net Systems • Sharkey’s18,19 combination strategies: – Parallel: co-operative or competitive top-down or bottom-up static or dynamic – Sequential – Supervisory • Component networks may be20: – Pre-trained (independent) – Incrementally trained (iterative) – In-situ trained (simultaneous) 8 Multi-net Systems • Categorisation schemes appear not to support the generalisation of multi-net system properties beyond specific examples – Ensemble: bias, variance and diversity21 – Modular: bias and variance – What about measures such as the VC Dimension? • Some use of in-situ learning – ME and HME22 – Negative correlation learning23 9 Research • Multi-net systems seem to offer a way in which generalisation performance and learning speed can be improved: – Yet limited theoretical and empirical evidence – Focus on parallel systems • Limited use of in-situ learning despite motivation – Existing results show improvement – Can the approach be generalised? • No general framework for multi-net systems – Difficult to generalise properties from categorisation 10 Research • Explore in-situ learning in multi-net systems: – – – – Parallel: in-situ learning in the simple ensemble Sequential: combining networks with in-situ learning Does in-situ learning provide improved generalisation? Can we combine ‘simple’ networks to solve ‘complex’ problems: ‘superordinate’ systems with faster learning? • Propose a formal framework for multi-net systems – A method to describe the architecture and learning algorithm for a general multi-net system 11 Multi-net System Framework • Previous work: – Framework for the co-operation of learning algorithms24 – Stochastic model25 – Importance Sampled Learning Ensembles (ISLE)26 – Focus on supervised learning and specific architectures • Jordan and Xu’s (1995) definition of HME27: – Generalisation of HME – Abstraction of architecture from algorithm – Theoretical results for convergence 12 Multi-net System Framework • Propose a modification to Jordan and Xu’s definition of HME to provide a generalised multinet system framework – HME combines the output of the expert networks through a weighting generated by a gating network – Replace the weighting by the (optional) operation of a network – Can be used for parallel, sequential and supervisory systems 13 Example: HME y1 Depth v1 r=0 g11 g12 y11 x r=1 y12 v12 v11 g111 g112 v111 r=2 x x y111 y112 v112 x x 14 Example: Framework y1 Depth v1 r=0 y11 r=1 v12 v11 y111 r=2 x y13 y12 v13 y113 y112 v111 v112 x x v113 x x 15 Definition • A multi-net system consists of the ordered tree of depth r defined by the nodes v , with v the root of the tree associated with the output y R , such that: 1 m 1 f x , y f y 1 ,..., yK , if K 0 if K 0 16 Multi-net System Framework • Learning algorithm operates by modifying the state of the system as defined by associated with each node v • Includes: – Pre-training – In-situ training – Incremental training (through pre- or in-situ training) • Examples demonstrate how framework can be used to describe existing types of multi-net system – However does not rely upon categorisation schemes 17 In-situ Learning • Evaluation of in-situ learning in multi-net systems • Explore parallel and sequential in-situ learning with definitions using the proposed framework – Simple learning ensemble (SLE) – Sequential learning modules (SLM) • Benchmark classification tasks28: – – – – XOR MONK’s problems (MONK 1, 2, 3)29 Wisconsin Breast Cancer Database (WBCD)30 Proben1 Thyroid (thyroid1 data set)31 18 Simple Learning Ensemble • Ensembles: – Train each network individually – Parallel combination of network outputs: mean output – Pre-training: how can we understand or control the combined performance of the ensemble? – Incremental: AdaBoost15 – In-situ: negative correlation23 • In-situ learning: – Train in-situ and assess combined performance during training using early stopping – Generalisation loss early stopping criterion32 19 Benchmark Results • Compare SLE results with: – Simple ensemble (SE): all networks pre-trained using early stopping – Single-net: MLP with backpropagation – with and without early stopping • For SLE and SE: – From 2 to 20 MLP networks – 100 trials per configuration: mean response 20 Benchmark Results • XOR – SLE and SE equivalent training (no early stopping) – SLE uses less epochs to give equivalent responses • MONK’s problems/WBCD/Thyroid – – – – SLE improves upon SE, SE improves upon single-net SLE trains for longer: combined performance Adding more networks gives better generalisation More networks, more achieved desired performance • MONK 1/2 – SLE improves upon non-early stopping single-net 21 Sequential Learning Modules • Sequential systems: – Can a combination of (simpler) networks give good generalisation and learning speed? – Typically pre-trained and for specific processing • Sequential in-situ learning: – How can in-situ learning be achieved with sequential networks: target error/output? – Use unsupervised networks – Last network has target output and hence can be supervised 22 Benchmark Results • Compare SLM results with: – Single-net: single layer network with delta learning – Single-net: MLP with backpropagation – without early stopping • Networks: – Combine a SOM with a single layer network with delta learning – Varying map sizes of SOM used to see effect on classification performance – 100 trials per configuration: mean response 23 Benchmark Results • XOR – Cannot be solved by SOM or single layer network – Can be solved by SLM with 3x3 map – Faster learning times than MLP with backpropagation • MONK’s problems/WBCD – SLM can learn classification of training examples – Generalisation: improves upon SE with early stopping • MONK 2/3 – SLM improves upon MLP without early stopping – MONK 3 improves upon SLE 24 Benchmark Results • MONK 1/WBCD – Generalisation: similar, but slightly worse • Thyroid – Poor learning of training examples – Can SOM perform sufficient pre-processing for single layer network? • Results seem to depend upon problem type and map size 25 In-situ Learning • In-situ learning in multi-net systems: – Can give better training and generalisation performance – Comparison with SE (early stopping) and single-net systems (with and without early stopping) – Computational effort? • Ensemble systems: – Effect of training times on bias, variance and diversity? • Sequential systems: – Encouraging empirical results: theoretical results? – Automatic classification of unsupervised clusters 26 In-situ Learning System MONK 1 MLP 84.44% MLP (ES) 57.13% SE 55.75% SLE 90.21% SLM 75.63% Benchmark MONK 2 MONK 3 WBCD 66.29% 83.39% 92.68% 65.21% 63.10% 76.01% 66.25% 66.03% 87.23% 69.49% 78.57% 88.47% 69.24% 84.10% 92.14% Thyroid 95.71% 90.37% 90.86% 93.96% 25.26% 27 In-situ Learning Mean Correct Validation Responses 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% MONK 1 MONK 2 MONK 3 WBCD Thyroid System MLP MLP (ES) SE SLE SLM 28 In-situ Learning and Simulation • Biological motivation for in-situ learning – Hebb’s ‘neural integration’ – Functional specialism in cognitive systems • Use in-situ learning in modular multi-net systems to simulate numerical abilities: – Quantification: subitization and counting – Addition: fact retrieval and ‘count all’ – Combine ME and SLM to allow abilities to ‘compete’ 29 Strategy Learning System v1 K1=3 Parallel (ME) v11 K11=0 v12 K12=1 v13 K13=0 Sequential (SLM) v121 Gate Unsupervised Strategy K121=0 Supervised Strategy 30 Multi-net Simulation of Quantification • Single-net and multi-net systems – Trained on scenes consisting of ‘objects’ – Logarithmic probability model • Subitization SOM: – Ordering of numbers with compressive scheme – Limit: training data, object frequency and map size • Counting MLP with backpropagation: – Correct responses to training data – Conventional, stable non-conventional and nonstable 31 Multi-net Simulation of Quantification • MNQ: – Subitization: SOM (pre-trained) – Single layer network with delta learning – Counting: MLP with backpropagation • Successfully learnt to quantify – To subitize or count – Decision based upon input • Subitization limit attributable to interaction of modules – Lower numbers subitized, higher numbers counted 32 Multi-net Simulation of Addition • Single-net and multi-net systems – Trained on scenes consisting of two sets of ‘objects’ – Equal probability model • Fact retrieval SOM: – Each addend associated with a map axis – Some overlap of problems – Commutative information used • ‘Count all’ MLP with backpropagation: – Correct responses to training data – Some relationship to observed human errors 33 Multi-net Simulation of Addition • MNA: – Fact retrieval: SOM – Single layer network with delta learning – ‘Count all’: MLP with backpropagation • Successfully learnt to add – To count or from facts – Decision based upon input • Predominant use of ‘count all’, rather than facts – Does demonstrate change in strategy 34 In-situ Learning and Simulation • In-situ learning in SLS: – Simulate developmental progression – At least as capable as equivalent monolithic solutions – Competition between supervised and unsupervised learning paradigms • In-situ learning and simulation of the interaction between multiple abilities: – Interaction between different abilities: simulating psychological phenomena – Integrated learning 35 Contribution • Multi-net systems: – Seem to offer empirical and theoretical improvement to generalisation – Properties under-explored • Framework for multi-net systems – Generalisation of multi-net systems beyond categorisation schemes – Foundation to explore multi-net properties • In-situ learning – Biological and theoretical motivation 36 Contribution • Compared different multi-net techniques and the use of in-situ learning • Demonstrated that in-situ learning can lead to improved training and generalisation • SLE – Assessment of combined performance • SLM – Combining ‘simple’ networks to solve ‘complex’ tasks – ‘Superordinate’? – Also shows automatic classification using SOM 37 Contribution • Integrated learning – Simulations using modular parallel and sequential networks – In-situ learning used to explore the interaction of modules during learning – Demonstrated simulation of abilities and observed psychological phenomena 38 Future Work • Framework: – Modify learning algorithm – recursive using tree – Explore properties such as bias, variance, diversity and VC Dimension • In-situ learning: – Further comparison – SLE: does learning promote diversity? – SLM: expand and explore limitations • Simulations: – Explore further the effect of integrated learning 39 Questions? 40 References 1. 2. 3. 4. 5. 6. 7. Simpson, J.A. & Weiner, E.S.C. (Ed) (1989). Oxford English Dictionary, 2nd. Oxford, UK: Clarendon Press. Hebb, D.O. (1949). The Organization of Behavior: A Neuropsychological Theory. New York: John Wiley & Sons. McClelland, J.L. & Rumelhart, D.E. (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 2: Psychological and Biological Models. Cambridge, MA.: A Bradford Book, MIT Press. Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, vol. 65(6), pp. 386408. Widrow, B. & Hoff, M.E.Jr. (1960). Adaptive Switching Circuits. IRE WESCON Convention Record, pp. 96-104. Minsky, M.L. & Papert, S. (1988). Perceptrons: An Introduction to Computational Geometry, Expanded Ed. Cambridge, MA.: MIT Press. Werbos, P.J. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. Unpublished doctoral thesis. Cambridge, MA.: Harvard University. 41 References 8. 9. 10. 11. 12. 13. 14. Rumelhart, D.E., Hinton, G.E. & Williams, R.J. (1986). Learning Internal Representations by Error Propagation. In Rumelhart, D. E. & McClelland, J. L. (Ed), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations, pp. 318-362. Cambridge, MA.: MIT Press. Elman, J.L. (1990). Finding Structure in Time. Cognitive Science, vol. 14, pp. 179211. Kohonen, T. (1982). Self-Organized Formation of Topologically Correct Feature Maps. Biological Cybernetics, vol. 43, pp. 59-69. Vapnik, V.N. & Chervonenkis, A.Ya. (1971). On the Uniform Convergence of Relative Frequencies of Events to their Probabilities. Theory of Probability and Its Applications, vol. XVI(2), pp. 264-280. Baum, E.B. & Haussler, D. (1989). What Size Net Gives Valid Generalisation? Neural Computation, vol. 1(1), pp. 151-160. Koiran, P. & Sontag, E.D. (1997). Neural Networks With Quadratic VC Dimension. Journal of Computer and System Sciences, vol. 54(1), pp. 190-198. Geman, S., Bienenstock, E. & Doursat, R. (1992). Neural Networks and the Bias/Variance Dilemma. Neural Computation, vol. 4(1), pp. 1-58. 42 References 15. 16. 17. 18. 19. 20. Freund, Y. & Schapire, R.E. (1996). Experiments with a New Boosting Algorithm. Machine Learning: Proceedings of the 13th International Conference, pp. 148-156. Morgan Kaufmann. Jacobs, R.A., Jordan, M.I. & Barto, A.G. (1991). Task Decomposition through Competition in a Modular Connectionist Architecture: The What and Where Vision Tasks. Cognitive Science, vol. 15, pp. 219-250. Lu, B. & Ito, M. (1999). Task Decomposition and Module Combination Based on Class Relations: A Modular Neural Network for Pattern Classification. IEEE Transactions on Neural Networks, vol. 10(5), pp. 1244-1256. Sharkey, A.J.C. (1999). Multi-Net Systems. In Sharkey, A. J. C. (Ed), Combining Artificial Neural Nets: Ensemble and Modular Multi-Net Systems, pp. 1-30. London: Springer-Verlag. Sharkey, A.J.C. (2002). Types of Multinet System. In Roli, F. & Kittler, J. (Ed), Proceedings of the Third International Workshop on Multiple Classifier Systems (MCS 2002), pp. 108-117. Berlin, Heidelberg, New York: Springer-Verlag. Liu, Y., Yao, X., Zhao, Q. & Higuchi, T. (2002). An Experimental Comparison of Neural Network Ensemble Learning Methods on Decision Boundaries. Proceedings of the 2002 International Joint Conference on Neural Networks (IJCNN'02), vol. 1, pp. 221-226. Los Alamitos, CA.: IEEE Computer Society Press. 43 References 21. 22. 23. 24. 25. 26. 27. Kuncheva, L.I. (2002). Switching Between Selection and Fusion in Combining Classifiers: An Experiment. IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 32(2), pp. 146-156. Jordan, M.I. & Jacobs, R.A. (1994). Hierarchical Mixtures of Experts and the EM Algorithm. Neural Computation, vol. 6(2), pp. 181-214. Liu, Y. & Yao, X. (1999a). Ensemble Learning via Negative Correlation. Neural Networks, vol. 12(10), pp. 1399-1404. Bottou, L. & Gallinari, P. (1991). A Framework for the Cooperation of Learning Algorithms. In Lippmann, R.P., Moody, J.E. & Touretzky, D.S. (Ed), Advances in Neural Information Processing Systems, vol. 3, pp. 781-788. Amari, S.-I. (1995). Information Geometry of the EM and em Algorithms for Neural Networks. Neural Networks, vol. 8(9), pp. 1379-1408. Friedman,J.H. & Popescu,B. (2003). Importance Sampling: An Alternative View of Ensemble Learning. Presented at the 4th International Workshop on Multiple Classifier Systems (MCS 2003). Guildford, UK. Jordan, M.I. & Xu, L. (1995). Convergence Results for the EM Approach to Mixtures of Experts Architectures. Neural Networks, vol. 8, pp. 1409-1431. 44 References 28. 29. 30. 31. 32. Blake,C.L. & Merz,C.J. (1998). UCI Repository of Machine Learning Databases. http://www.ics.uci.edu/~mlearn/MLRepository.html. Irvine, CA.: University of California, Irvine, Department of Information and Computer Sciences. Thrun, S.B., Bala, J., Bloedorn, E., Bratko, I., Cestnik, B., Cheng, J., De Jong, K., Dzeroski, S., Fahlman, S.E., Fisher, D., Hamann, R., Kaufman, K., Keller, S., Kononenko, I., Kreuziger, J., Michalski, R.S., Mitchell, T., Pachowicz, P., Reich, Y., Vafaie, H., van de Welde, W., Wenzel, W., Wnek, J. & Zhang, J. (1991). The MONK's Problems: A Performance Comparison of Different Learning Algorithms. Technical Report CMU-CS-91-197. Pittsburgh, PA.: Carnegie-Mellon University, Computer Science Department. Wolberg, W.H. & Mangasarian, O.L. (1990). Multisurface Method of Pattern Separation for Medical Diagnosis Applied to Breast Cytology. Proceedings of the National Academy of Sciences, USA, vol. 87(23), pp. 9193-9196. Prechelt, L. (1994). Proben1: A Set of Neural Network Benchmark Problems and Benchmarking Rules. Technical Report 21 / 94. Karlsruhe, Germany: University of Karlsruhe. Prechelt, L. (1996). Early Stopping - But When? In Orr, G. B. & Müller, K-R. (Ed), Neural Networks: Tricks of the Trade, 1524, pp. 55-69. Berlin, Heidelberg, New York: Springer-Verlag. 45