NEATwise Justin Chen & Gabriele Giaroli April 21st, 2016 EVOLVING HETEROGENEOUS ARTIFICIAL NEURAL NETWORKS THROUGH COMBINATORIALLY GENERATED PIECEWISE ACTIVATION FUNCTIONS Introduction Our original idea: Optimize NEAT Algorithm by introducing piecewise activation functions “NEATwise” Results: NEATwise performs comparably well to original NEAT on its hardest benchmark Combinatorially generated piecewise activation functions are competitive with conventional activation functions! Motivation Yann LeCun, Yoshua Bengio & Geoffrey Hinton, Deep Learning (2015) Nature ReLU activation functions most optimal for CNN Combinatorial generated piecewise activation functions are unexplored Model resting and activated states Activation function: ReLU What is NEAT NEAT stands for “NeuroEvolution of Augmenting Topologies”, developed by Professors Kenneth O. Stanley and Risto Miikkulainen et al. in 2001 Genetic Algorithm Stochastic reinforcement learning control tasks Protean n-polytopic space NEAT’s 3 main characteristics: Growing from Minimal Structure Principled Crossover Speciation What is NEATwise Sample Pool of Activation Functions NEAT + Piecewise activation functions ReLU ELU Bent Identity Canonical activation functions ArcTa n TanH Sine Combinatorially generated piecewise activation functions Heterogeneous networks Capable of differentiable and nondifferentiable functions ArcTan and Sine Piecewise Approach Initially.. Selected canonical activation functions that could be combined into piecewise differentiable functions Analyzed code and made modifications to create NEATwise Realized uniform selection function makes NEAT fail on Double Pole Balancing No Velocity (DPNV) benchmark Then.. Change approach: restricted version of NEATwise Tested original NEAT with each canonical function to find best performing canonical functions Finally, thoroughly tested Sigmoid/Arctan-NEATwise (SA-NEATwise) on DPNV Canonical Activation Functions 1. Sigmoid 2. Tanh (Hyperbolic Tangent) 3. ReLU (Rectified Linear Unit) 4. Arctan 5. ELU (Exponential Linear Unit) 6. BentId 7. Sine Uniform Random Selection NEATwise fails when selecting uniformly at random Uniform random noise destroys genetic algorithm search bias Activation functions required bias for selection Redesigned our approach to use homogeneous networks as a baseline Activation functions of two best performing homogeneous networks were selected for further study, specifically Arctan and Sigmoid Homogeneous Network Baselines 7 canonical functions Version: NEAT.1.2.1 Test: Double Pole Balancing No Velocity Number of tests: 20 per function Parameter file: p2nv.ne Homogeneous ELU network Homogeneous Networks Baselines Average accuracy for each function over 20 runs of original NEAT However, 100 out-of-box tests for a fair comparison later on Function Out-of-box Accuracy 96.70% Arctan ELU Sine Tanh BentId ReLU Sigmoid 94.20% 87.9% 87.40% 84.10% 82.05% 80.00% 54.20% Benchmark Double Pole Balancing No Velocity: Main benchmark used by original NEAT Hardest of the pole balancing tests Easier variants: Single Pole with Velocity Single Pole without Velocity Double Pole experiment Double Pole with Velocity XOR approximation Performance Metrics Evaluations Example Test Results: Failures: 0 out of 100 runs Generalization Network size Activation function distribution 100 experiments/instance 100 instances Evaluation info: Total # evals: 2,968,877 Total # samples: 100 Average # evals: 29,688.8 Generalization info: Total generalization score: 26,741 Total generalization champs: 100 Average generalization: 267 Network size info: Total Nodes: 17,923 Total Samples: 3026 Average Network Size: 5.923 Sigmoid-Arctan NEATwise III Activation functions: Arctan(x) & Sigmoid(x) 50% bias towards Arctan 50% bias towards Sigmoid Slope: unaltered Discontinuous Piecewise functions: Sigmoid-Arctan, Arctan-Sigmoid Success: 0 experiments Failure: 100 experiments 50-50 SA NEATwise 1, Genome 468 Population Statistics Damaged Genome 468 Node 1 & 262: Arctan Node 2, 4 & 5: Sigmoid Node 3 & 10: Arctan-Sigmoid Node 155: Sigmoid-Arctan Sigmoid-Arctan NEATwise II Activation functions: Arctan(x) & Sigmoid(x) 87.5% bias towards Arctan 12.5% bias towards Sigmoid Slope: unaltered Discontinuous Piecewise functions: Sigmoid-Arctan, Arctan-Sigmoid Success: 87 experiment Failure: 13 experiments Sigmoid-Arctan NEATwise 1, Genome 350 Population Statistics Damaged Genome 350 Node 1, 2, 4, 5 & 87: Arctan Node 3: Sigmoid-Arctan Node 45: Arctan-Sigmoid Sigmoid-Arctan NEATwise 88, Genome 175 Population Statistics Accuracy: 99% Avg Network size: 6.04 Avg Evals: 35,984.10 Avg Generalization: 263.00 Genome 175 Node 1, 3, 4, 6, 158 & 388: Arctan Node 2: Sigmoid-Arctan Sigmoid-Arctan NEATwise I Activation functions: Arctan(x) & Sigmoid(4.92*x) - 0.5 87.5% bias towards Arctan 12.5% bias towards Sigmoid Continuous Success: 88 experiments Failure: 12 experiment Piecewise functions: Sigmoid-Arctan, Arctan-Sigmoid Sigmoid-Arctan NEATwise 17, Genome 177 Population Statistics Damaged Genome 177 Node 1 & 2: Arctan Sigmoid- Node 4, 5, & 46: Sigmoid Node: 3 Arctan Sigmoid-Arctan NEATwise 27, Genome 14 Population Statistics Accuracy: 100% Avg Network size: 6.53 Avg Evals: 33,701.00 Avg Generalization: 271.00 Genome 14 Node 1 & 2: Arctan-Sigmoid NEATwise MARI/O Generations: 274 Max fitness: 4115 Training: 5 days 87.5% bias for Arctan 12.5% bias for Sigmoid 4.924273 slope for Sigmoid Discontinuous NEAT vs. NEATwise Method Accuracy Evaluations Generalization No. Nodes NEAT 96.70% 28,925 269 5.72 NEATwise I 97.27% 36,452 279 6.56 NEATwise II 98.39% 36,792 266 6.28 NEATwise III 0.00% 0 0 0 Future work Gaussian NEATwise to select N activation functions Expanding the choices for canonical function selection Analyzing optimal bias for choosing each activation function of the piecewise Multi-piece piecewise activation functions Combinatorial generated piecewise activation functions in different architectures Analyze network structures to explain hidden failures As the problem domain changes, how do you determine the optimal nontopological hyperparameters such as drop off age, mutation rate, initial population size, etc.? Conclusions 1. Combinatorially generated piecewise activation functions outperforms conventional choices 2. Automatic model selection 3. Enhanced expressive power 4. NEATwise does not require parameter tuned slopes 5. Slightly decreased average generalization 6. Requires negligibly more evaluations 7. Adaptable to various problem domains References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. Stanley, Kenneth O., and Risto Miikkulainen. "Evolving neural networks through augmenting topologies." Evolutionary computation 10.2 (2002): 99-127. LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." Nature 521.7553 (2015): 436-444. Stanley, Kenneth O. "Compositional pattern producing networks: A novel abstraction of development." Genetic programming and evolvable machines 8.2 (2007): 131-162. Nair, Vinod, and Geoffrey E. Hinton. "Rectified linear units improve restricted boltzmann machines." Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010. Stanley, Kenneth O., Bobby D. Bryant, and Risto Miikkulainen. "Evolving adaptive neural networks with and without adaptive synapses." Evolutionary Computation, 2003. CEC'03. The 2003 Congress on. Vol. 4. IEEE, 2003. Stanley, Kenneth O., David B. D'Ambrosio, and Jason Gauci. "A hypercube-based encoding for evolving large-scale neural networks." Artificial life 15.2 (2009): 185-212. Agostinelli, Forest, et al. "Learning activation functions to improve deep neural networks." arXiv preprint arXiv:1412.6830 (2014). Yao, Xin. "Evolving artificial neural networks." Proceedings of the IEEE 87.9 (1999): 1423-1447. Turner, Andrew James, and Julian Francis Miller. "NeuroEvolution: Evolving Heterogeneous Artificial Neural Networks." Evolutionary Intelligence 7.3 (2014): 135-154. D. White and P. Ligomenides, “GANNet: A genetic algorithm for optimizing topology and weights in neural network design,” in Proc. Int. Workshop Artificial Neural Networks (IWANN’93), Lecture Notes in Computer Science, vol. 686. Berlin, Germany: Springer-Verlag, 1993, pp. 322–327. Y. Liu and X. Yao, “Evolutionary design of artificial neural networks with different nodes,” in Proc. 1996 IEEE Int. Conf. Evolutionary Computation (ICEC’96), Nagoya, Japan, pp. 670–675. M. W. Hwang, J. Y. Choi, and J. Park, “Evolutionary projection neural networks,” in Proc. 1997 IEEE Int. Conf. Evolutionary Computation, ICEC’97, pp. 667–671. Whitley, Darrell. "A genetic algorithm tutorial." Statistics and computing 4.2 (1994): 65-85. Duch, Włodzisław, and Norbert Jankowski. "Survey of neural transfer functions." Neural Computing Surveys 2.1 (1999): 163-212. Duch, Wlodzislaw, and Norbert Jankowski. "Transfer functions: hidden possibilities for better neural networks." ESANN. 2001. Weingaertner, Daniel, et al. "Hierarchical evolution of heterogeneous neural networks." Evolutionary Computation, 2002. CEC'02. Proceedings of the 2002 Congress on. Vol. 2. IEEE, 2002. Turner, Andrew James, and Julian Francis Miller. "Cartesian genetic programming encoded artificial neural networks: a comparison using three benchmarks." Proceedings of the 15th annual conference on Genetic and evolutionary computation. ACM, 2013. Sebald, Anthony V., and Kumar Chellapilla. "On making problems evolutionarily friendly part 1: evolving the most convenient representations."Evolutionary Programming VII. Springer Berlin Heidelberg, 1998. Annunziato, M., et al. "Evolving weights and transfer functions in feed forward neural networks." Proc. EUNITE2003, Oulu, Finland (2003). Duch, Włodzisław, and Norbert Jankowski. "New neural transfer functions."Applied Mathematics and Computer Science 7 (1997): 639-658. Yao, Xin, and Yong Liu. "Ensemble structure of evolutionary artificial neural networks." Evolutionary Computation, 1996., Proceedings of IEEE International Conference on. IEEE, 1996. Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Deep sparse rectifier neural networks." International Conference on Artificial Intelligence and Statistics. 2011.