Herding: The Nonlinear Dynamics of Learning Max Welling SCIVI LAB - UCIrvine Yes, All Models Are Wrong, but… …from a CS/ML perspective this may not necessarily be less of big problem. • Training: We want to gain an optimal amount of predictive accuracy per unit time. • Testing: We want to engage the model that results in optimal accuracy within the time allowed to make a decision. Fight or flight • Computer scientists are mostly interested in prediction. Example: ML do not care about identifiability (as long as the model predicts well). • Computer scientist care a lot about computation. Example: ML are willing to tradeoff estimation bias for computation (if this means that we can handle bigger datasets – e.g. variational inference vs. MCMC). Not Bayesian Nor Frequentist But Mandelbrotist… Is there a deep connection between learning, computation and chaos theory? Perspective x n ~ p(x) model/inductive bias learning f 1 p(x) f (x) N f (x n ) Pseudo-samples n Integration prediction inference Integration prediction Herding E[ f ]Pˆ Nonlinear Dynamical System. Generate pseudo-samples “S”. g (S ) herding prediction f (S ) herding E[ f ]Pˆ consistency Herding S arg max Wk f k ( S ) S k Wk Wk E Pˆ [ f k ] f k ( S ) • weights to not converge, Monte Carlo sums do • Maximization does not have to be perfect (see PCT theorem). • Deterministic • No step-size • Only very simple operations (no exponentiation, logarithms etc.) Ising/Hopfield Model Network w f (S) wij si s j wi si k k k s*i w ij s j w i j Wij Wij E pˆ [ si s j ] si* s*j Wi Wi E pˆ [ si ] si* ij i Neuron fires if input exceeds threshold Synapse depresses if pre- & postsynaptic neurons fire. Threshold depresses after neuron fires Pseudo-Samples From Critical Ising Model Herding as a Dynamical System Wk ,t 1 F (Wkt ) Wkt EPˆ [ f k ] f k (St ) constant Markov process in W St (Wkt ) arg maxWkt f k ( S ) S k Piecewise constant fct. of W data w St G(S1 , S2 ,...,St 1 ) arg maxWkt f k (S ) S k t 1 Wkt Wk 0 EPˆ [ f k ] f k ( Si ) i 1 S Infinite memory process in S Example in 2-D r r r r Wt 1 Wt EPˆ [ f ] f (St ) s=1 s=6 s=2 s=5 s=3 s=4 Itinerary: s=[1,1,2,5,2... Convergence Translation: v t E Pˆ [ f ] f k (St ) Choose St such that: W W E k k k k k Pˆ [ f k ] f k (S) 0 T 1 | 1 Then: f ( s ) E [ f ] |~ O ( ) k t Pˆ k T t 1 T Equivalent to “Perceptron Cycling Theorem” (Minsky ’68) s=1 s=6 s=[1,1,2,5,2... s=2 s=5 s=3 s=4 Period Doubling As we change R (T) the number of fixed points change. W (x)exp k',t f k (x) k' T W k',t exp f (x) T k k' x f W k,t 1 W k,t f k W t 1 RW t (1 W t ) “edge of chaos” x k T=0: herding Applications • Classification • Compression • Modeling Default Swaps • Monte Carlo Integration • Image Segmentation • Natural Language Processing • Social Networks Example Classifier from local Image features: Classifier from boundary detection: P(Object Categories are Different across Boundary | Boundary Information ) P(Object Category | Local Image Information ) Combine with Herding + Herding will generate samples such that the local probabilities are respected as much as possible (project on marginal polytope) Topological Entropy S=1,3,2 Theorem [Goetz00] : Call W(T) the number of possible subsequences of length T, then the topological entropy for herding is: htop logW (T) wlog(T) lim lim 0 T T T T However, we are interested in the sub-extensive entropy: [Nemenman et al.] hsubtop logW (T) w log(T) lim lim w T log(T) T log(T) Theorem: hsubtop K Conjecture: hsubtop K (K = nr. of parameters) (for typical herding systems) Learning Systems K Bayesian Evidence log P( X ) ~ extensive terms H [ p( | X )] ~ log( N ) 2 Information we learn from the random IID data. Herding is not random and not IID due to negative auto-correlations. The information in its sequence is: K log(N ) . We can therefore represent the original (random) data sample by a much smaller subset without loss of information content (N instead of N2 samples). These shorter herding sequences can be used to efficiently approximate averages by Monte Carlo sums. Conclusions • Herding is an efficient alternative for learning in MRFs. • Edge of chaos dynamics provides more efficient information processing than random sampling. • General principle that underlies information processing in the brain ? • We advocate to explore potential interesting connections between computation, learning and the the theory of nonlinear dynamical systems and chaos. What can we learn from viewing learning as a nonlinear dynamical process?