Learning with Weakly Chaotic Nonlinear Dynamical Systems

Herding: The Nonlinear Dynamics of Learning Max Welling SCIVI LAB - UCIrvine Yes, All Models Are Wrong, but… …from a CS/ML perspective this may not necessarily be less of big problem. • Training: We want to gain an optimal amount of predictive accuracy per unit time. • Testing: We want to engage the model that results in optimal accuracy within the time allowed to make a decision. Fight or flight • Computer scientists are mostly interested in prediction. Example: ML do not care about identifiability (as long as the model predicts well). • Computer scientist care a lot about computation. Example: ML are willing to tradeoff estimation bias for computation (if this means that we can handle bigger datasets – e.g. variational inference vs. MCMC). Not Bayesian Nor Frequentist But Mandelbrotist…  Is there a deep connection between learning, computation and chaos theory? Perspective x n ~ p(x) model/inductive bias learning f  1  p(x) f (x)  N  f (x n ) Pseudo-samples n  Integration prediction inference Integration prediction  Herding E[ f ]Pˆ Nonlinear Dynamical System. Generate pseudo-samples “S”. g (S ) herding prediction f (S ) herding  E[ f ]Pˆ consistency Herding   S  arg max Wk f k ( S )  S  k  Wk  Wk  E Pˆ [ f k ]  f k ( S ) • weights to not converge, Monte Carlo sums do • Maximization does not have to be perfect (see PCT theorem). • Deterministic • No step-size • Only very simple operations (no exponentiation, logarithms etc.) Ising/Hopfield Model Network w f (S)  wij si s j  wi si k k k    s*i    w ij s j  w i     j  Wij  Wij  E pˆ [ si s j ]  si* s*j  Wi  Wi  E pˆ [ si ]  si* ij i Neuron fires if input exceeds threshold Synapse depresses if pre- & postsynaptic neurons fire. Threshold depresses after neuron fires Pseudo-Samples From Critical Ising Model Herding as a Dynamical System Wk ,t 1  F (Wkt )  Wkt  EPˆ [ f k ]  f k (St ) constant Markov process in W   St (Wkt )  arg maxWkt f k ( S ) S k  Piecewise constant fct. of W data w   St  G(S1 , S2 ,...,St 1 )  arg maxWkt f k (S ) S k  t 1  Wkt  Wk 0   EPˆ [ f k ]  f k ( Si ) i 1  S Infinite memory process in S Example in 2-D r r r r Wt 1  Wt  EPˆ [ f ]  f (St )  s=1 s=6 s=2 s=5 s=3 s=4 Itinerary: s=[1,1,2,5,2... Convergence Translation: v t  E Pˆ [ f ]  f k (St ) Choose St such that: W  W E k k k k k Pˆ  [ f k ]  f k (S)  0 T 1  | 1 Then: f ( s )  E [ f ] |~ O ( )  k t Pˆ k T t 1 T  Equivalent to “Perceptron Cycling Theorem” (Minsky ’68) s=1 s=6 s=[1,1,2,5,2... s=2 s=5 s=3 s=4 Period Doubling As we change R (T) the number of fixed points change.  W  (x)exp k',t f k (x)   k' T   W k',t  exp f (x)    T k   k'  x f W k,t 1  W k,t  f k  W t 1  RW t (1  W t )   “edge of chaos” x k T=0: herding Applications • Classification • Compression • Modeling Default Swaps • Monte Carlo Integration • Image Segmentation • Natural Language Processing • Social Networks Example Classifier from local Image features: Classifier from boundary detection: P(Object Categories are Different across Boundary | Boundary Information ) P(Object Category | Local Image Information )  Combine with Herding + Herding will generate samples such that the local probabilities are respected as much as possible (project on marginal polytope) Topological Entropy S=1,3,2 Theorem [Goetz00] : Call W(T) the number of possible subsequences of length T, then the topological entropy for herding is: htop logW (T) wlog(T)  lim  lim 0 T  T  T T However, we are interested in the sub-extensive entropy: [Nemenman et al.] hsubtop logW (T) w log(T)  lim  lim w T   log(T) T   log(T) Theorem: hsubtop  K Conjecture: hsubtop  K (K = nr. of parameters) (for typical herding systems) Learning Systems K Bayesian Evidence  log P( X ) ~ extensive terms  H [ p( | X )] ~ log( N ) 2 Information we learn from the random IID data. Herding is not random and not IID due to negative auto-correlations. The information in its sequence is: K log(N ) . We can therefore represent the original (random) data sample by a much smaller subset without loss of information content (N instead of N2 samples). These shorter herding sequences can be used to efficiently approximate averages by Monte Carlo sums. Conclusions • Herding is an efficient alternative for learning in MRFs. • Edge of chaos dynamics provides more efficient information processing than random sampling. • General principle that underlies information processing in the brain ? • We advocate to explore potential interesting connections between computation, learning and the the theory of nonlinear dynamical systems and chaos. What can we learn from viewing learning as a nonlinear dynamical process?

Learning with Weakly Chaotic Nonlinear Dynamical Systems

Related documents

Products

Support

Learning with Weakly Chaotic Nonlinear Dynamical Systems

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib