slides - Clopinet

advertisement

Autoencoders, Unsupervised

Learning, and Deep

Architectures

P. Baldi

University of California, Irvine

1. General Definition

2. Historical Motivation (50s,80s,2010s)

3. Linear Autoencoders over Infinite Fields

4. Non-Linear Autoencoders: the Boolean

Case

5. Summary and Speculations

General Definition

• x

1

, ,x

M training vectors in E N (e.g. E=IR or {0,1})

• Learn A and B to minimize :

 i

Δ[

F

AB

(x i

)-x i

]

N

A

H

B

N

Key scaling parameters: N, H, M

Autoencoder Zoo

Autoencoders

Linear Non-Linear

Complex Real Finite Fields (GF2) Boolean Boolean/Linear

Neural Network

(sigmoidal)

Boltzmann

Machines

Threshold Gates RBMs

Historical Motivation

• Three time periods: 1950s, 1980s, 2010s.

• Three motivations:

– Fundamental Learning Problem (1950s)

– Unsupervised Learning (1980s)

– Deep Architectures (2010s)

2010: Deep Architectures

1950s

Where do you store your telephone number?

THE SYNAPTIC BASIS OF MEMORY CONSOLIDATION

© 2004, Graham Johnson

© 2007, Paul De Koninck

Size in Meters

Diameter of Atom 10 -10

Diameter of DNA 10 -9

Diameter of

Synapse

10 -7

Diameter of Axon 10 -6

10 -5 Diameter of

Neuron

Length of Axon

Length of Brain

10 -3

10 -1

-10 0

Length of Body 1

Scales x10 6

10 -4

10 -3

10 -1

1

10

10 3 -10 6

10 5

10 6

Hair

Fist

Room

Park-Nation

State

Nation

The Organization of

Behavior: A

Neuropsychological

Theory (1949)

Let us assume that the persistence or repetition of a reverberatory activity (or “trace”) tends to induce lasting cellular changes that add to its stability…….

When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process of metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.

Δw ij

~ x i x j

• Hopfield

• PDP group

1980s

Back-Propagation (1985)

BACK-PROPAGATION

ERROR E=F(w)

OUTPUT LAYER i

W ij j

INPUT LAYER

GRADIENT DESCENT: Δ w ij

= µ out j є i

µ = learning rate

First Autoencoder

• x

1

, ,x

M training points (real-valued vectors)

• Learn A and B to minimize

 i

||F

AB

(x i

)-x i

|| 2

N sigmoidal neurons

A

H sigmoidal neurons

B

N

Linear Autoencoder

• x

1

,…,x

M training vectors over IR N

• Find two matrices A and B that minimize:

 i

|| AB(x i

)-x i

|| 2

N

A

H

B

N

Linear Autoencoder Theorem (IR)

• A and B are defined only up to group multiplication by an invertible

HxH matrix C: W = AB = (AC -1) CB.

• Although the cost function is quadratic and the transformation W=AB is linear, the problem is NOT convex .

• The problem becomes convex if A or B is fixed . Assuming Σ

XX is invertible and the covariance matrix has full rank : B*=(A t A) -1 A t and

A*= Σ

XX

B t (B Σ

XX

Bt) -1 .

• Alternate minimization of A and B is an EM algorithm.

A

B

Linear Autoencoder Theorem (IR)

• The overall landscape of E has no local minima . All the critical points where the gradient is 0 are associated with projections onto subspaces associated with H eigenvectors of the covariance matrix.

At any critical point: A=U

I

C and B=C are the H eigenvectors of Σ

XX

-1 U

I where the columns of U

I associated with the index set I. In this case, W = AB = P

UI correspond to a projection. Generalization is easy to measure and understand.

• Projections onto the top H eigenvectors correspond to a global minimum. All other critical points are saddle points.

N

A

H

B

N

Landscape of E

A

B

Linear Autoencoder Theorem (IR)

• Thus any critical point performs a form of clustering by hyperplane.

For any vector x, all the vectors of the form x+KerB are mapped onto the same vector y=AB(x)=AB(x+ KerB).

• At any critical point where C=Identity A=B t . The constraint A=B t can be imposed during learning by weight sharing, or symmetric connections, and is consistent with a Hebbian rule that is symmetric between pre-and post- synaptic units (folded autoencoder, or clamping input and output units).

N

A

H

B

N

Linear Autoencoder Theorem (IR)

• At any critical point, reverberation is stable for every x

(AB) 2 x=ABx

• The global minimum remains the same if additional matrices or rank

>=H are introduced anywhere in the architecture. There is no gain in expressivity by adding such matrices.

• However such matrices could be introduced for other reasons.

Vertical Composition law: “NH1HH1N ~NH1N + H1HH1”

• Results can be extended to linear case with given output targets and to the complex field.

N

A

H

B

N

Vertical Composition

• NH1HH1N ~ NH1N + H1HH1

H1

N

H1

H

N

H1

H

H1

N

H1

N

H

H1

N

Linear Autoencoder Theorem (IR)

• At any critical point, reverberation is stable (AB) 2 x=ABx

• The global minimum remains the same if additional matrices or rank

>=H are introduced anywhere in the architecture. There is no gain in expressivity by adding such matrices.

• However such matrices could be introduced for other reasons.

VerticalcComposition law: “NH1HH1N ~NH1N + H1HH1”

• Results can be extended to linear case with given output targets and to the complex field.

• Provides some intuition

N for the non-linear case.

A

H

B

N

Boolean Autoencoder

Boolean Autoencoder

• x

1

,…,x

M training vectors over IH N (binary)

• Find Boolean functions A and B that minimize:

 i

H[ AB(x i

),x i

]

H= Hamming Distance

• Variation 1: Enforce AB(x i

)

{x

1

,…,x

M

}

• Variation 2: Restrict A and B (connectivity, threshold gates, etc)

Boolean Autoencoder

Fix A

Boolean Autoencoder

Fix A h=10010

Boolean Autoencoder y=A(h)=11010110010 h=10010

Fix A

Boolean Autoencoder

A(h2)

A(h1) y=A(h)=11010110010 A(h3)

Fix A h=10010

y=A(h)=11010110010 h=10010

Autoencoder

A(h1)

A(h2)

A(h3)

Fix A

B({Voronoi A(h)}) =h

y=A(h)=11010110010 h=10010

Autoencoder

A(h1)

A(h2)

A(h3)

Fix A

B({Voronoi A(h)}) =h

Boolean Autoencoder

Fix B

Boolean Autoencoder h=10100

A

Fix B

Boolean Autoencoder

A(h)=?

h=10100

A

Fix B

Boolean Autoencoder

A(h)=?

h=10100

00110101001

11010100101

10101010101

A

Fix B

Boolean Autoencoder

A(h)=10110100101 h=10100

00110101001

11010100101

10101010101

A

Fix B

Boolean Autoencoder

A(h)=10110100101 h=10100

00110101001

11010100101

10101010101

A

A(h)=Majority[B -1 (h)]

Fix B

Boolean Autoencoder Theorem

• A and B are defined only up to the group of permutations of the

2H points in the H-dimensional hypercube of the hidden layer.

• The overal optimization problem is non trivial. Polynomial time solutions exist when H is held constant (centroids in the training set).

When H~ εLogN the problem becomes NP-complete .

• The problem has a simple solution when A is fixed or B is fixed:

A*(h)=Majority {B -1 (h)} B*{Voronoi A(h)}=h [B*(x)=h such that A(h) is closest to x among {A(h)}].

• Every “critical point” (A* and B*) correspond to a clustering into

K=2 H clusters. The optimum correspond to the best clustering.

(Maximum?) Plenty of approximate algorithms (k means, hierarchical clustering, belief propagation (centroids in training set).

• G eneralization is easy to measure and understand.

Boolean Autoencoder Theorem

• At any critical point, reverberation is stable.

• The global minimum remains the same if additional Boolean functions with layers >=H are introduced anywhere in the architecture. There is no gain in expressivity by adding such functions.

• However such functions could be introduced for other reasons.

Composition law: “NH1HH1N ~NH1N + H1HH1”. Can achieve hierarchical clustering in input space.

• Results can be extended to the case with given output targets.

Learning Complexity

• Linear autoencoder over infinite fields can be solved analytically

• Boolean autoencoder is NP complete as soon as the number of clusters (K=2 H ) scales like M

ε

(for ε>0). It is solvable in polynomial time when K is fixed.

• Linear autoencoder over finite fields is NP complete in the general case.

• RBM learning is NP complete in the general case.

Embedding of Square Lattice in

Hypercube

• 4x3 square lattice with embedding in H 7

0000111 1111111

0000000 1111000

Vertical Composition

Horizontal Composition

Autoencoders with H>N

• Identity provides trivial solution

• Regularization//Horizontal Composition//Noise

Information and Coding

(Transmission and Storage) decoded message noisy channel parity bits message

Summary and Speculations

Autoencoders

Linear Non-Linear

Complex Real Finite Fields (GF(2)) Boolean

Boolean/Linear over R or C

Neural Network

(sigmoidal)

Boltzmann

Machines

Boolean/Linear over GF(2)

Threshold Gates RBMs

Unsupervised Learning

Clustering

Autoencoders

Hebbian

Learning

Information and Coding Theory

Compression

Autoencoders

Communication

Autoencoders

Deep Architectures

Vertical

Composition

Horizontal

Composition

Summary and Speculations

• Unsupervised Learning: Hebb,

Autoencoders, RBMs, Clustering

• Conceptually clustering is the fundamental operation

• Clustering can be combined with targets

• Clustering is composable: horizontally, vertically, recursively, etc.

• Autoencoders implement clustering and labeling simultaneously

• Deep architecture conjecture

Download