Deep Learning

Deep Learning Bing-Chen Tsai 1/21 1 outline       Neural networks Graphical model Belief nets Boltzmann machine DBN Reference 2 Neural networks  Supervised learning   The training data consists of input information with their corresponding output information. Unsupervised learning  The training data consists of input information without their corresponding output information. 3 Neural networks  Generative model   Model the distribution of input as well as output ,P(x , y) Discriminative model  Model the posterior probabilities ,P(y | x) P(x,y1) P(x,y2) P(y1|x) 4 P(y2|x) Neural networks  x1 What is the neural?  Linear neurons y  b   x i w i x2 w1 w2 1 b y i  Binary threshold neurons z = b + å xi wi y  0 otherwise i  Sigmoid neurons z  b   x i w i  1 y  i 1 Stochastic binary neurons z  b   i 5 1 if z³0 e xi wi  z 1 p ( y  1)  1  e  z Neural networks  Two layer neural networks (Sigmoid neurons) Back-propagation Step1: Randomly initial weight Determine the output vector Step2: Evaluating the gradient of an error function Step3: Adjusting weight, Repeat The step1,2,3 until error enough low 6 Neural networks  Back-propagation is not good for deep learning     It requires labeled training data.  Almost data is unlabeled. The learning time is very slow in networks with multiple hidden layers.  It is very slow in networks with multi hidden layer. It can get stuck in poor local optima.  For deep nets they are far from optimal. Learn P(input) not P(output | input)  What kind of generative model should we learn? 7 outline       Neural networks Graphical model Belief nets Boltzmann machine DBN Reference 8 Graphical model  A graphical model is a probabilistic model for which graph denotes the conditional dependence structure between random variables probabilistic model In this example: D depends on A, D depends on B, D depends on C, C depends on B, and C depends on D. 9 Graphical model  A Directed graphical model 𝑃 𝐴, 𝐵, 𝐶, 𝐷 = 𝑃 𝐴 𝑃 𝐵 𝐴 𝑃 𝐶 𝐴 𝑃(𝐷|𝐵, 𝐶) B C D  Undirected graphical model 𝑃 𝐴, 𝐵, 𝐶, 𝐷 = A C B D 1 ∗ φ 𝐴, 𝐵, 𝐶 ∗ 𝜑(𝐵, 𝐶, 𝐷) 𝑍 10 outline       Neural networks Graphical model Belief nets Boltzmann machine DBN Reference 11 Belief nets  A belief net is a directed acyclic graph composed of stochastic variables stochastic hidden causes Stochastic binary neurons z  b  i xi wi 1 p ( y  1)  1 e  z It is sigmoid belief nets visible 12 Belief nets  we would like to solve two problems   The inference problem: Infer the states of the unobserved variables. The learning problem: Adjust the interactions between variables to make the network more likely to generate the training data. stochastic hidden causes visible 13 Belief nets   It is easy to generate sample P(v | h) It is hard to infer P(h | v)  stochastic hidden causes Explaining away visible 14 Belief nets  Explaining away H1 H2 H1 and H2 are independent, but they can become dependent when we observe an effect that they can both influence 𝑃 𝐻1 𝑉 𝑎𝑛𝑑 𝑃 𝐻2 𝑉 𝑎𝑟𝑒 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 V 15 Belief nets  Some methods for learning deep belief nets  Monte Carlo methods    But its painfully slow for large, deep belief nets Learning with samples from the wrong distribution Use Restricted Boltzmann Machines 16 outline       Neural networks Graphical model Belief nets Boltzmann machine DBN Reference 17 Boltzmann Machine   It is a Undirected graphical model The Energy of a joint configuration hidden j i -E(v, h) = å vibi + iÎvis p(v, h) = e å kÎhid -E(v, h) å e-E(u, g) hk bk + å vi v j wij + å vi hk wik + å hk hl wkl i< j i, k e-E(v, h) å h p(v) = åu, g e-E(u, g) u, g 18 k<l visible Boltzmann Machine v h -E e-E p(v, h ) p(v) An example of how weights define a distribution h1 +2 v1 19 -1 h2 +1 v2 Boltzmann Machine  A very surprising fact ¶log p(v) = si s j ¶wij Derivative of log probability of one training vector, v under the model. v - si s j Expected value of product of states at thermal equilibrium when v is clamped on the visible units Dwij µ si s j data 20 - si s j model Expected value of product of states at thermal equilibrium with no clamping model Boltzmann Machines    Restricted Boltzmann Machine We restrict the connectivity to make learning easier.  Only one layer of hidden units.  We will deal with more layers later  No connections between hidden units Making the updates more parallel visible 21 Boltzmann Machines the Boltzmann machine learning algorithm for an RBM  j j j <vi h j>¥ <vi h j>0 i t=0 j i i t=1 i t=2 Dwij = e ( <vi h j >0 - <vi h j>¥ ) 22 t = infinity Boltzmann Machines  Contrastive divergence: A very surprising short-cut j <vi h j>0 i t=0 data j <vi h j>1 This is not following the gradient of the log likelihood. But it works well. i t=1 reconstruction Dwij = e ( <vi h j >0 - <vi h j>1 ) 23 outline       Neural networks Graphical model Belief nets Boltzmann machine DBN Reference 24 DBN   It is easy to generate sample P(v | h) It is hard to infer P(h | v)  stochastic hidden causes Explaining away visible  Use RBM to initial weight can get good optimal 25 DBN  Combining two RBMs to make a DBN Then train this RBM h2 W2 h1 Compose the two RBM models to make a single DBN model W2 h1 copy binary state for each v W1 h1 Train this RBM first h2 v W1 It’s a deep belief nets! v 26 DBN etc. W T h2  Why we can use RBM to initial belief nets weights?  An infinite sigmoid belief net that is equivalent to an RBM W v2 W T h1  Inference in a directed net with replicated weights  Inference is trivial. We just multiply v0 by W transpose.   The model above h0 implements a complementary prior. Multiplying v0 by W transpose gives the product of the likelihood term and the prior term. W v1 W h0 W v0 27 T DBN     Complementary prior X1 X2 X3 X4 A Markov chain is a sequence of variables X1;X2; : : : with the Markov property 𝑃 𝑋𝑡 𝑋1 , … , 𝑋𝑡−1 = 𝑃(𝑋𝑡 |𝑋𝑡−1 ) A Markov chain is stationary if the transition probabilities do not depend on time 𝑃 𝑋𝑡 = 𝑥 ′ 𝑋𝑡−1 = 𝑥 = 𝑇 𝑥 → 𝑥 ′ 𝑇(𝑥 → 𝑥′) is called the transition matrix. If a Markov chain is ergodic it has a unique equilibrium distribution 𝑃𝑡 𝑋𝑡 = 𝑥 → 𝑃∞ 𝑋 = 𝑥 𝑎𝑠 𝑡 → ∞ 28 DBN   Most Markov chains used in practice satisfy detailed balance 𝑃∞ (𝑋)𝑇(𝑋 → 𝑋′) = 𝑃∞ (𝑋′)𝑇(𝑋′ → 𝑋) e.g. Gibbs, Metropolis-Hastings, slice sampling. . . Such Markov chains are reversible X1 X2 X3 X4 𝑃∞ 𝑋1 𝑇 𝑋1 → 𝑋2 𝑇 𝑋2 → 𝑋3 𝑇(𝑋3 → 𝑋4 ) X1 X2 X3 X4 𝑇 𝑋1 ← 𝑋2 𝑇 𝑋2 ← 𝑋3 𝑇 𝑋3 ← 𝑋4 𝑃∞ (𝑋4 ) 29 DBN 𝑃 𝑌𝑘 = 1 𝑋𝑘+1 = 𝜎(𝑊 𝑇 𝑋𝑘+1 + 𝑐) 𝑃 𝑋𝑘 = 1 𝑌𝑘 = 𝜎(𝑊𝑌𝑘 + 𝑏) 30 DBN  Combining two RBMs to make a DBN Then train this RBM h2 W2 h1 Compose the two RBM models to make a single DBN model W2 h1 copy binary state for each v W1 h1 Train this RBM first h2 v W1 It’s a deep belief nets! v 31 Reference     Deep Belief Nets,2007 NIPS tutorial , G . Hinton https://class.coursera.org/neuralnets-2012001/class/index Machine learning 上課講義 http://en.wikipedia.org/wiki/Graphical_mod el 32

Deep Learning

Related documents

Products

Support

Deep Learning

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib