Learning with Weakly Chaotic Nonlinear Dynamical Systems

advertisement
Herding:
The Nonlinear Dynamics of Learning
Max Welling
SCIVI LAB - UCIrvine
Yes, All Models Are Wrong, but…
…from a CS/ML perspective this may not necessarily be less of big problem.
• Training: We want to gain an optimal amount of predictive accuracy per unit time.
• Testing: We want to engage the model that results in optimal accuracy within the
time allowed to make a decision.
Fight or flight
• Computer scientists are mostly interested in prediction.
Example: ML do not care about identifiability (as long as the model predicts well).
• Computer scientist care a lot about computation.
Example: ML are willing to tradeoff estimation bias for computation
(if this means that we can handle bigger datasets – e.g. variational inference vs. MCMC).
Not Bayesian Nor Frequentist
But Mandelbrotist…

Is there a deep connection between learning, computation and chaos theory?
Perspective
x n ~ p(x)
model/inductive bias
learning
f 
1
 p(x) f (x)  N  f (x
n
)
Pseudo-samples
n

Integration
prediction
inference
Integration
prediction

Herding
E[ f ]Pˆ
Nonlinear Dynamical System.
Generate pseudo-samples “S”.
g (S )
herding
prediction
f (S )
herding
 E[ f ]Pˆ
consistency
Herding


S  arg max Wk f k ( S ) 
S
 k

Wk  Wk  E Pˆ [ f k ]  f k ( S )
• weights to not converge, Monte Carlo sums do
• Maximization does not have to be perfect (see PCT theorem).
• Deterministic
• No step-size
• Only very simple operations (no exponentiation, logarithms etc.)
Ising/Hopfield Model Network
w
f (S)  wij si s j  wi si
k k
k



s*i    w ij s j  w i 


 j

Wij  Wij  E pˆ [ si s j ]  si* s*j

Wi  Wi  E pˆ [ si ]  si*
ij
i
Neuron fires if input
exceeds threshold
Synapse depresses if
pre- & postsynaptic
neurons fire.
Threshold depresses
after neuron fires
Pseudo-Samples From Critical Ising Model
Herding as a Dynamical System
Wk ,t 1  F (Wkt )  Wkt  EPˆ [ f k ]  f k (St )
constant
Markov process in W


St (Wkt )  arg maxWkt f k ( S )
S
k

Piecewise constant fct. of W
data
w


St  G(S1 , S2 ,...,St 1 )  arg maxWkt f k (S )
S
k

t 1

Wkt  Wk 0   EPˆ [ f k ]  f k ( Si )
i 1

S
Infinite memory process in S
Example in 2-D
r
r
r
r
Wt 1  Wt  EPˆ [ f ]  f (St )

s=1
s=6
s=2
s=5
s=3
s=4
Itinerary:
s=[1,1,2,5,2...
Convergence
Translation: v t  E Pˆ [ f ]  f k (St )
Choose St such that:
W  W E
k k
k
k
k
Pˆ

[ f k ]  f k (S)  0
T
1
 | 1
Then:
f
(
s
)

E
[
f
]
|~
O
(
)
 k t Pˆ k
T t 1
T

Equivalent to “Perceptron Cycling Theorem”
(Minsky ’68)
s=1
s=6
s=[1,1,2,5,2...
s=2
s=5
s=3
s=4
Period Doubling
As we change R (T) the number of fixed points change.
 W

(x)exp k',t f k (x) 
 k' T

 W k',t

exp
f
(x)

  T k 
 k'

x
f
W k,t 1  W k,t  f k 
W t 1  RW t (1  W t )


“edge of chaos”
x
k
T=0:
herding
Applications
• Classification
• Compression
• Modeling Default Swaps
• Monte Carlo Integration
• Image Segmentation
• Natural Language Processing
• Social Networks
Example
Classifier from local Image features:
Classifier from boundary detection:
P(Object Categories are Different across Boundary | Boundary Information )
P(Object Category | Local Image Information )

Combine
with
Herding
+
Herding will generate samples such that the local probabilities are respected as much as possible (project on marginal polytope)
Topological Entropy
S=1,3,2
Theorem [Goetz00] : Call W(T) the number of
possible subsequences of length T, then the
topological entropy for herding is:
htop
logW (T)
wlog(T)
 lim
 lim
0
T 
T 
T
T
However, we are interested in the sub-extensive entropy: [Nemenman et al.]
hsubtop
logW (T)
w log(T)
 lim
 lim
w
T   log(T)
T   log(T)
Theorem:
hsubtop  K
Conjecture:
hsubtop  K
(K = nr. of parameters)
(for typical herding systems)
Learning Systems
K
Bayesian Evidence  log P( X ) ~ extensive terms  H [ p( | X )] ~ log( N )
2
Information we learn from the random IID data.
Herding is not random and not IID due to negative auto-correlations.
The information in its sequence is: K log(N ) .
We can therefore represent the original (random) data sample by a much smaller
subset without loss of information content (N instead of N2 samples).
These shorter herding sequences can be used to efficiently approximate averages by
Monte Carlo sums.
Conclusions
• Herding is an efficient alternative for learning in MRFs.
• Edge of chaos dynamics provides more efficient information processing
than random sampling.
• General principle that underlies information processing in the brain ?
• We advocate to explore potential interesting connections between
computation, learning and the the theory of nonlinear dynamical systems and chaos.
What can we learn from viewing learning as a nonlinear dynamical process?
Download