Deep Learning

advertisement
Deep Learning
Bing-Chen Tsai
1/21
1
outline






Neural networks
Graphical model
Belief nets
Boltzmann machine
DBN
Reference
2
Neural networks

Supervised learning


The training data consists of input information with their
corresponding output information.
Unsupervised learning

The training data consists of input information without their corresponding
output information.
3
Neural networks

Generative model


Model the distribution of input as well as output ,P(x , y)
Discriminative model

Model the posterior probabilities ,P(y | x)
P(x,y1)
P(x,y2)
P(y1|x)
4
P(y2|x)
Neural networks

x1
What is the neural?

Linear neurons y  b   x i w i
x2
w1
w2
1
b
y
i

Binary threshold neurons z = b + å xi wi
y 
0 otherwise
i

Sigmoid neurons z  b   x i w i

1
y 
i
1
Stochastic binary neurons z  b 

i
5
1 if z³0
e
xi wi
 z
1
p ( y  1) 
1
 e
 z
Neural networks

Two layer neural networks (Sigmoid neurons)
Back-propagation
Step1:
Randomly initial weight
Determine the output vector
Step2:
Evaluating the gradient
of an error function
Step3:
Adjusting weight,
Repeat The step1,2,3
until error enough low
6
Neural networks

Back-propagation is not good for deep learning




It requires labeled training data.
 Almost data is unlabeled.
The learning time is very slow in networks with multiple hidden
layers.
 It is very slow in networks with multi hidden layer.
It can get stuck in poor local optima.
 For deep nets they are far from optimal.
Learn P(input) not P(output | input)

What kind of generative model should we learn?
7
outline






Neural networks
Graphical model
Belief nets
Boltzmann machine
DBN
Reference
8
Graphical model

A graphical model is a probabilistic model for which graph
denotes the conditional dependence structure between
random variables probabilistic model
In this example: D
depends on A, D depends
on B, D depends on C, C
depends on B, and C
depends on D.
9
Graphical model

A
Directed graphical model
𝑃 𝐴, 𝐵, 𝐶, 𝐷 = 𝑃 𝐴 𝑃 𝐵 𝐴 𝑃 𝐶 𝐴 𝑃(𝐷|𝐵, 𝐶)
B
C
D

Undirected graphical model
𝑃 𝐴, 𝐵, 𝐶, 𝐷 =
A
C
B
D
1
∗ φ 𝐴, 𝐵, 𝐶 ∗ 𝜑(𝐵, 𝐶, 𝐷)
𝑍
10
outline






Neural networks
Graphical model
Belief nets
Boltzmann machine
DBN
Reference
11
Belief nets

A belief net is a directed acyclic graph composed of
stochastic variables
stochastic hidden causes
Stochastic binary neurons
z  b

i
xi wi
1
p ( y  1) 
1 e
 z
It is sigmoid belief nets
visible
12
Belief nets

we would like to solve two problems


The inference problem: Infer the states of the unobserved variables.
The learning problem: Adjust the interactions between variables to
make the network more likely to generate the training data.
stochastic hidden causes
visible
13
Belief nets


It is easy to generate sample P(v | h)
It is hard to infer P(h | v)

stochastic hidden causes
Explaining away
visible
14
Belief nets

Explaining away
H1
H2
H1 and H2 are independent, but they
can become dependent
when we observe an effect that they
can both influence
𝑃 𝐻1 𝑉 𝑎𝑛𝑑 𝑃 𝐻2 𝑉 𝑎𝑟𝑒 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡
V
15
Belief nets

Some methods for learning deep belief nets

Monte Carlo methods



But its painfully slow for large, deep belief nets
Learning with samples from the wrong distribution
Use Restricted Boltzmann Machines
16
outline






Neural networks
Graphical model
Belief nets
Boltzmann machine
DBN
Reference
17
Boltzmann Machine


It is a Undirected graphical model
The Energy of a joint configuration
hidden
j
i
-E(v, h) =
å vibi +
iÎvis
p(v, h) =
e
å
kÎhid
-E(v, h)
å e-E(u, g)
hk bk + å vi v j wij + å vi hk wik + å hk hl wkl
i< j
i, k
e-E(v, h)
å
h
p(v) =
åu, g e-E(u, g)
u, g
18
k<l
visible
Boltzmann Machine
v h -E
e-E
p(v, h ) p(v)
An example of how weights
define a distribution
h1
+2
v1
19
-1
h2
+1
v2
Boltzmann Machine

A very surprising fact
¶log p(v)
= si s j
¶wij
Derivative of log
probability of one
training vector, v
under the model.
v
- si s j
Expected value of
product of states at
thermal equilibrium
when v is clamped on
the visible units
Dwij µ
si s j
data
20
- si s j
model
Expected value of
product of states at
thermal equilibrium
with no clamping
model
Boltzmann Machines



Restricted Boltzmann Machine
We restrict the connectivity to make learning easier.
 Only one layer of hidden units.
 We will deal with more layers later
 No connections between hidden units
Making the updates more parallel
visible
21
Boltzmann Machines
the Boltzmann machine learning algorithm for an RBM

j
j
j
<vi h j>¥
<vi h j>0
i
t=0
j
i
i
t=1
i
t=2
Dwij = e ( <vi h j >0 - <vi h j>¥ )
22
t = infinity
Boltzmann Machines

Contrastive divergence: A very surprising short-cut
j
<vi h j>0
i
t=0
data
j
<vi h j>1
This is not following the gradient of the
log likelihood. But it works well.
i
t=1
reconstruction
Dwij = e ( <vi h j >0 - <vi h j>1 )
23
outline






Neural networks
Graphical model
Belief nets
Boltzmann machine
DBN
Reference
24
DBN


It is easy to generate sample P(v | h)
It is hard to infer P(h | v)

stochastic hidden causes
Explaining away
visible

Use RBM to initial weight can get good optimal
25
DBN

Combining two RBMs to make a DBN
Then train
this RBM
h2
W2
h1
Compose the
two RBM
models to
make a single
DBN model
W2
h1
copy binary state for each v
W1
h1
Train this
RBM first
h2
v
W1
It’s a deep belief nets!
v
26
DBN
etc.
W
T
h2

Why we can use RBM to initial belief nets weights?

An infinite sigmoid belief net that is equivalent to an RBM
W
v2
W
T
h1

Inference in a directed net with replicated weights
 Inference is trivial. We just multiply v0 by W transpose.


The model above h0 implements a complementary prior.
Multiplying v0 by W transpose gives the
product of the likelihood term and the prior term.
W
v1
W
h0
W
v0
27
T
DBN




Complementary prior
X1
X2
X3
X4
A Markov chain is a sequence of variables X1;X2; : : : with the Markov
property
𝑃 𝑋𝑡 𝑋1 , … , 𝑋𝑡−1 = 𝑃(𝑋𝑡 |𝑋𝑡−1 )
A Markov chain is stationary if the transition probabilities do not
depend on time
𝑃 𝑋𝑡 = 𝑥 ′ 𝑋𝑡−1 = 𝑥 = 𝑇 𝑥 → 𝑥 ′
𝑇(𝑥 → 𝑥′) is called the transition matrix.
If a Markov chain is ergodic it has a unique equilibrium distribution
𝑃𝑡 𝑋𝑡 = 𝑥 → 𝑃∞ 𝑋 = 𝑥 𝑎𝑠 𝑡 → ∞
28
DBN


Most Markov chains used in practice satisfy detailed balance
𝑃∞ (𝑋)𝑇(𝑋 → 𝑋′) = 𝑃∞ (𝑋′)𝑇(𝑋′ → 𝑋)
e.g. Gibbs, Metropolis-Hastings, slice sampling. . .
Such Markov chains are reversible
X1
X2
X3
X4
𝑃∞ 𝑋1 𝑇 𝑋1 → 𝑋2 𝑇 𝑋2 → 𝑋3 𝑇(𝑋3 → 𝑋4 )
X1
X2
X3
X4
𝑇 𝑋1 ← 𝑋2 𝑇 𝑋2 ← 𝑋3 𝑇 𝑋3 ← 𝑋4 𝑃∞ (𝑋4 )
29
DBN
𝑃 𝑌𝑘 = 1 𝑋𝑘+1 = 𝜎(𝑊 𝑇 𝑋𝑘+1 + 𝑐)
𝑃 𝑋𝑘 = 1 𝑌𝑘 = 𝜎(𝑊𝑌𝑘 + 𝑏)
30
DBN

Combining two RBMs to make a DBN
Then train
this RBM
h2
W2
h1
Compose the
two RBM
models to
make a single
DBN model
W2
h1
copy binary state for each v
W1
h1
Train this
RBM first
h2
v
W1
It’s a deep belief nets!
v
31
Reference




Deep Belief Nets,2007 NIPS tutorial ,
G . Hinton
https://class.coursera.org/neuralnets-2012001/class/index
Machine learning 上課講義
http://en.wikipedia.org/wiki/Graphical_mod
el
32
Download