Parameter Learning Le Song Machine Learning II: Advanced Topics

Parameter Learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sampling Methods Direct Sampling Works only for easy distributions (multinomial, Gaussian etc.) Rejection Sampling Create samples like direct sampling Only count samples consistent with given evidence Importance Sampling Create samples like direct sampling Assign weights to samples Gibbs Sampling Often used for high-dimensional problem Use variables and its Markov blanket for sampling 2 Gibbs Sampling in formula Gibbs sampling 𝑋 = 𝑥0 For t = 1 to N 𝑥1𝑡 ∼ 𝑃(𝑋1 |𝑥2𝑡−1 , … , 𝑥𝐾𝑡−1 ) 𝑥2𝑡 ∼ 𝑃(𝑋2 |𝑥1𝑡 , 𝑥3𝑡−1 … , 𝑥𝐾𝑡−1 ) … 𝑡 𝑥𝐾𝑡 ∼ 𝑃(𝑋𝐾 |𝑥1𝑡 , … , 𝑥𝐾−1 ) 𝐹𝑜𝑟 𝑔𝑟𝑎𝑝ℎ𝑖𝑐𝑎𝑙 𝑚𝑜𝑑𝑒𝑙𝑠 Only need to condition on the Variables in the Markov blanket 𝑋1 Variants: Randomly pick variable to sample sample block by block 𝑋3 𝑋2 𝑋4 𝑋5 𝑥1𝑡 , 𝑥2𝑡 ∼ 𝑃(𝑋1 , 𝑋2 |𝑥3𝑡−1 … , 𝑥𝐾𝑡−1 ) 3 Gibbs Sampling: Image Segmentation 1 𝑍 Pairwise Markov random fields: 𝑃 𝑋 = 𝑖Ψ 𝑋𝑖 𝑖𝑗 Ψ 𝑋𝑖 , 𝑋𝑗 Need conditional 𝑃(𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1 , 𝑥𝑖+1 , … , 𝑥𝑘 ) 𝑃(𝑥1 ,…,𝑥𝑘 ) 𝑃(𝑥1 ,…,𝑥𝑖−1 ,𝑥𝑖+1 ,𝑥𝑘 ) = 1 𝑍 𝑙Ψ 1 𝑥𝑖 𝑍 𝑥𝑙 𝑙Ψ 𝑙𝑗 Ψ 𝑥𝑙 𝑥𝑙 ,𝑥𝑗 𝑙𝑗 Ψ 𝑥𝑙 ,𝑥𝑗 Terms without 𝑥𝑖 will cancel out 𝑋6 𝑋5 𝑋8 𝑋2 𝑋1 𝑋4 𝑋7 𝑋3 𝑋9 𝑥𝑖 is summed out in the denominator ∝ Ψ 𝑥𝑖 𝑗∈𝑁(𝑖) Ψ(𝑥𝑖 , 𝑥𝑗 ) 4 Convergence of Gibbs Sampling Not all samples 𝑥 0 , … 𝑥 𝑇 are independent Consider a particular marginal 𝑃(𝑥𝑖 |𝑢𝑖 ) 1 True 𝑃(𝑥𝑖 |𝑢𝑖 ) 𝐸𝑚𝑝𝑖𝑐𝑎𝑙 𝑃 𝑥𝑖 𝑢𝑖 𝑡 0 Burn-in Take samples from here 5 Parameter Learning Example Estimate the probability 𝜃 of landing in heads using a biased coin Given a sequence of 𝑁 independently and identically distributed (iid) flips Eg., 𝐷 = 𝑥1 , 𝑥2 , … , 𝑥𝑁 = {1,0,1, … , 0}, 𝑥𝑖 ∈ {0,1} Model: 𝑃 𝑥|𝜃 = 𝜃 𝑥 1 − 𝜃 𝑃(𝑥|𝜃 ) = 1−𝑥 1 − 𝜃, 𝑓𝑜𝑟 𝑥 = 0 𝜃, 𝑓𝑜𝑟 𝑥 = 1 Likelihood of a single observation 𝑥𝑖 ? 𝑃 𝑥𝑖 |𝜃 = 𝜃 𝑥𝑖 1−𝜃 1−𝑥𝑖 𝜃 𝑋 𝑁 6 Bayesian Parameter Estimation Bayesian treats the unknown parameters as a random variable, whose distribution can be inferred using Bayes rule: 𝑃(𝜃|𝐷) = 𝑃 𝐷 𝜃 𝑃(𝜃) 𝑃(𝐷) = 𝑃 𝐷 𝜃 𝑃(𝜃) ∫ 𝑃 𝐷 𝜃 𝑃 𝜃 𝑑𝜃 Prior over 𝜃, Beta distribution 𝑃 𝜃; 𝛼, 𝛽 ∝ 𝜃 𝛼−1 1 − 𝜃 𝛽−1 Posterior distribution 𝜃 𝑃 𝑥1 ,…,𝑥𝑁 𝜃 𝑃 𝜃 𝑃 𝑥1 ,…,𝑥𝑁 𝑛𝑡 𝜃 𝛼−1 1 − 𝜃 𝛽−1 = 𝑃 𝜃|𝑥1 , … , 𝑥𝑁 = ∝ 𝜃 𝑛ℎ 1 − 𝜃 𝜃 𝑛ℎ +𝛼−1 1 − 𝜃 𝑛𝑡 +𝛽−1 Posterior mean estimation: 𝜃𝑏𝑎𝑦𝑒𝑠 = ∫ 𝜃 𝑃 𝜃 𝐷 𝑑𝜃 = 𝐶∫ 𝜃 × 𝜃 𝑛ℎ +𝛼−1 1 − 𝜃 𝑛𝑡 +𝛽−1 𝑑𝜃 = (𝑛ℎ +𝛼) 𝑁+𝛼+𝛽 7 MLE for Biased Coin 𝜃𝑀𝐿 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃(𝐷|𝜃) Objective function, log likelihood 𝑙 𝜃; 𝐷 = log 𝑃 𝐷 𝜃 = log 𝜃 𝑛ℎ 1 − 𝜃 𝑁 − 𝑛ℎ log 1 − 𝜃 𝑛𝑡 = 𝑛ℎ log 𝜃 + We need to maximize this w.r.t. 𝜃 Take derivatives w.r.t. 𝜃 𝜕𝑙 𝜕𝜃 = 𝑛ℎ 𝜃 − 𝑁−𝑛ℎ 1−𝜃 = 0 ⇒ 𝜃𝑀𝐿𝐸 = 𝑛ℎ 𝑁 or 𝜃𝑀𝐿𝐸 = 1 𝑁 𝑖 𝑥𝑖 8 How estimators should be used? 𝜃𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃(𝜃|𝐷) is not Bayesian (even though it uses a prior) since it is a point estimate Consider predicting the future. A sensible way is to combine predictions based on all possible value of 𝜃, weighted by their posterior probability, this is called Bayesian prediction: 𝑃 𝑥𝑛𝑒𝑤 𝐷 = ∫ 𝑃 𝑥𝑛𝑒𝑤 , 𝜃 𝐷 𝑑𝜃 = ∫ 𝑃 𝑥𝑛𝑒𝑤 𝜃, 𝐷 𝑃 𝜃 𝐷 𝑑𝜃 = ∫ 𝑃 𝑥𝑛𝑒𝑤 𝜃 𝑃 𝜃 𝐷 𝑑𝜃 𝜃 𝑋𝑛𝑒𝑤 𝑋 𝑁 A frequentist prediction will typically use a “plug-in” estimator such as ML/MAP 𝑃 𝑥𝑛𝑒𝑤 𝐷 = 𝑃(𝑥𝑛𝑒𝑤 | 𝜃𝑀𝐿 ) 𝑜𝑟 𝑃 𝑥𝑛𝑒𝑤 𝐷 = 𝑃(𝑥𝑛𝑒𝑤 | 𝜃𝑀𝐴𝑃 ) 9 Learning Graphical Models The goal: given set of independent samples (assignments of random variables), find the best (the most likely) graphical model (both structure and the parameters 𝐹 𝐴 Learn 𝑆 𝑁 Structure learning 𝑆 𝐻 (A,F,S,N,H) = (T,F,F,T,F) (A,F,S,N,H) = (T,F,T,T,F) … (A,F,S,N,H) = (F,T,T,T,T) 𝐹 𝐴 𝐻 𝑁 S FA TT TF FT FF t 0.9 0.7 0.8 0.2 f 0.1 0.3 0.2 0.8 parameter learning 10 MLE for General Bayesian Networks If we assume that the parameters for each CPT are globally independent, and all nodes are fully observed, then the loglikelihood function decomposes into a sum of local terms, one per node: 𝑙 𝜃; 𝐷 = log 𝑃 𝐷 𝜃 = log 𝑖 𝑖 𝑖 𝑃 𝑥 𝑝𝑎 𝑋𝑗 , 𝜃𝑗 ) = 𝑗 𝑗 𝑖 𝑖 𝑖 log 𝑃 𝑥 𝑝𝑎 𝑋𝑗 , 𝜃𝑗 ) 𝑗 𝑗 𝐴𝑙𝑙𝑒𝑟𝑔𝑦 For each variable 𝑋𝑖 : 𝑃𝑀𝐿𝐸 𝑋𝑖 = 𝑥𝑖 𝑃𝑎𝑋𝑖 = 𝑢 = Why? 𝐹𝑙𝑢 #(𝑋𝑖 =𝑥 ,𝑃𝑎𝑋𝑖 =𝑢) #(𝑃𝑎𝑋𝑖 =𝑢) 𝑆𝑖𝑛𝑢𝑠 𝑁𝑜𝑠𝑒 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 11 Decomposable likelihood of directed model 𝑙 𝜃; 𝐷 = log 𝑃 𝐷 𝜃 = 𝑖 log 𝑃 𝑎𝑖 |𝜃𝑎 + 𝑖 log 𝑃 𝑓 𝑖 |𝜃𝑓 + 𝑖 𝑙𝑜𝑔𝑃 𝑠 𝑖 𝑎𝑖 , 𝑓 𝑖 , 𝜃𝑠 + 𝑖 𝑙𝑜𝑔𝑃(ℎ 𝑖 |𝑠 𝑖 , 𝜃ℎ ) One term for each CPT; break up MLE problem into independent subproblems Because the factorization of the distribution, we can estimate each CPT separately. 𝐴𝑙𝑙𝑒𝑟𝑔𝑦 𝐴𝑙𝑙𝑒𝑟𝑔𝑦 𝐴𝑙𝑙𝑒𝑟𝑔𝑦 𝐹𝑙𝑢 𝐹𝑙𝑢 𝑆𝑖𝑛𝑢𝑠 Learn separately 𝐹𝑙𝑢 𝑆𝑖𝑛𝑢𝑠 𝑆𝑖𝑛𝑢𝑠 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 12 MLE for BNs with tabluar CPTs Assume each CPT is represented as a table (multinomial): 𝜃𝑖𝑗𝑘 = 𝑃 𝑋𝑖 = 𝑗 𝑋𝜋𝑖 = 𝑘 Note that in case of multiple parents, 𝑋𝜋𝑖 will have a composite state, and CPT will be a high dimensional table The sufficient statistics are counts of family configurations 𝑛𝑖𝑗𝑘 = # 𝑋𝑖 = 𝑗&&𝑋𝜋𝑖 = 𝑘 The log-likelihood is 𝐿 𝜃; 𝐷 = log 𝑛𝑖𝑗𝑘 𝑖𝑗𝑘 𝜃𝑖𝑗𝑘 = 𝑖𝑗𝑘 𝑛𝑖𝑗𝑘 Using a Lagrange multiplier to enforce 𝑀𝐿 𝜃𝑖𝑗𝑘 = log 𝜃𝑖𝑗𝑘 𝑗 𝜃𝑖𝑗𝑘 = 1, we get 𝑛𝑖𝑗𝑘 𝑖𝑗′ 𝑘 𝑛𝑖𝑗′ 𝑘 13 MLE and Kullback-Leibler divergence KL divergence 𝐷(𝑄(𝑋)| 𝑃 𝑋 = 𝑥 𝑄 𝑥 log 𝑄(𝑋) 𝑃 𝑋 Empirical distribution P 𝑋 = 1 𝑁 𝑁 𝑖=1 𝛿(𝑋, 𝑥𝑖 ) Where 𝛿 𝑋, 𝑥𝑖 is a Kronecker delta function 𝑀𝑎𝑥𝜃 𝑀𝐿𝐸 = 𝑀𝑖𝑛𝜃 𝐾𝐿 𝐷(P 𝑋 | 𝑃 𝑋|𝜃 = 𝑥P = 𝑥 logP 𝑥 − = 𝑐𝑜𝑛𝑠𝑡. − 1 𝑁 𝑁 𝑖=1 log 𝑥 P 𝑥 log 𝑥P P 𝑥 𝑃 𝑥|𝜃 𝑥 log 𝑃 𝑥|𝜃 1 𝑁 𝑃 𝑥𝑖 |𝜃 = 𝑐𝑜𝑛𝑠𝑡. − 𝑙(𝜃; 𝐷) 14 Bayesian estimator for tabular CPTs Factorization 𝑃 𝑋 = 𝑥 = 𝑖𝑃 𝑥𝑖 𝑝𝑎𝑋𝑖 , 𝜃𝑖 ) Local CPT: multinomial distribution 𝑃 𝑋𝑖 = 𝑘 𝑃𝑎𝑋𝑖 = 𝑗 = 𝜃𝑘𝑗 Put prior distribution over parameters 𝑃(𝜃𝑎 , 𝜃𝑏 , 𝜃𝑠 , 𝜃ℎ ) 𝜃𝑏 𝜃𝑎 𝐴𝑙𝑙𝑒𝑟𝑔𝑦 𝜃𝑠 𝐹𝑙𝑢 𝑆𝑖𝑛𝑢𝑠 𝜃ℎ 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 15 Global and Local Parameter Independence Global parameter independence Parameters for all nodes in a GM 𝑃 𝜃𝑎 , 𝜃𝑏 , 𝜃𝑠 , 𝜃ℎ = 𝑃 𝜃𝑎 𝑃 𝜃𝑏 𝑃 𝜃𝑠 𝑃(𝜃ℎ ) 𝜃𝑏 𝜃𝑎 𝐴𝑙𝑙𝑒𝑟𝑔𝑦 Local Parameter Independence Parameters in each node 𝜃𝑠 𝐹𝑙𝑢 𝑆𝑖𝑛𝑢𝑠 𝜃ℎ 𝑃 𝑋𝑖 = 𝑘 𝑃𝑎𝑋𝑖 = 𝑗 = 𝜃𝑘𝑗 𝑃 𝜃𝑎 = 𝑘𝑗 𝑃(𝜃𝑘𝑗 ) 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 16 Parameter independence Example Provided all variables are observed, we can perform Bayesian update for each parameter independently 𝑙𝑜𝑐𝑎𝑙 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒 𝑔𝑙𝑜𝑏𝑎𝑙 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒 𝜃21 𝜃1 𝑋1 𝑋2 𝑋1 𝑋2 𝜃22 17 Which distribution satisfy independence assumptions? Discrete DAG models: 𝑋𝑖 |𝑃𝑎𝑋𝑖 ∼ 𝑀𝑢𝑙𝑡𝑖 𝜃 Dirichlet prior: 𝑃 𝜃 = Γ( 𝑘 𝛼𝑘 ) 𝑘 Γ 𝛼𝑘 𝛼 𝑘 𝜃𝑘 Gaussian DAG models: 𝑋𝑖 |𝑃𝑎𝑋𝑖 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 𝜇, Σ Normal prior: 𝑃 𝜇 𝑣, Ψ ∝ exp 1 − 2 𝜇 − 𝑣 ′ Ψ −1 𝜇 − 𝑣 18 Parameter sharing Consider a time-invariant (stationary) first order Markov model Initial state probability vector: 𝜋𝑗 = 𝑃 𝑋1 = 𝑗 State transition probability matrix: 𝐴𝑖𝑗 = 𝑃 𝑋𝑡 = 𝑗 𝑋𝑡−1 = 𝑖 𝜋 The likelihood of one sequence 𝑃 𝑥1:𝑇 𝜃 = 𝑃 𝑥1 𝜋 𝑇 𝑡=2 𝑃 𝑥𝑡 𝑥𝑡−1 𝑋1 𝐴 𝑋2 𝑋𝑇 𝑋3 The log-likelihood of 𝑁 sequences 𝑙 𝜃; 𝑥1:𝑇 = 𝑁 𝑙=1 log 𝑃 𝑥1𝑙 𝜋 + 𝑁 𝑙=1 𝑇 𝑡=2 log 𝑃 𝑙 𝑥𝑡𝑙 𝑥𝑡−1 Again, we can optimize each parameter separately 𝜋 can be simple estimated by the frequency How about 𝐴? 19 Learning a Markov chain transition matrix 𝐴 is a stochastic matrix: 𝑗 𝐴𝑖𝑗 =1 Each row of 𝐴 is multinomial distribution MLE of 𝐴𝑖𝑗 is the fraction of transitions from 𝑖 to 𝑗 𝐴𝑀𝐿 𝑖𝑗 = #(𝑖→𝑗) #(𝑖→⋅) Application: If the states of 𝑋𝑡 represent words, this is called a bigram language model Small sample size problem: If 𝑖 → 𝑗 did not occur in data, we will have 𝐴𝑀𝐿 𝑖𝑗 = 0 A standard hack: backoff smoothing 𝐴𝑖⋅𝑆 = 𝜆𝜂 + (1 − 𝜆)𝐴𝑀𝐿 𝑖⋅ 20 Bayesian model for Markov chain Global and local parameter independence 𝐴i→⋅ 𝜋 𝑋2 𝑋1 𝑋3 𝐴i′ →⋅ 𝑋𝑇 The posterior of 𝐴i→⋅ and 𝐴i′→⋅ is factorized despite the vstructure Assign a Dirichlet prior 𝛽𝑖 to each row of the transition matrix 𝐵𝑎𝑦𝑒𝑠 𝐴𝑖𝑗 = ′ 𝜆𝛽𝑖𝑗 = #(𝑖→𝑗)+𝛽𝑖𝑗 #(𝑖→⋅)+ 𝑗 𝛽𝑖𝑗 + (1 − 𝜆)𝐴𝑀𝐿 𝑖𝑗 where 𝜆 = 𝑗 𝛽𝑖𝑗 #(𝑖→⋅)+ 𝑗 𝛽𝑖𝑗 21 MLE for Exponential Family 𝑃 𝑋1 , … , 𝑋𝑘 |𝜃 = 1 =𝑍𝜃 1 𝑍 𝜃 exp 𝑖𝑗 exp(𝜃𝑖𝑗 𝑋𝑖 𝑋𝑗 ) 𝑍 𝜃 = 𝑙 𝜃, 𝐷 = log( 𝑥 𝑖𝑗 𝜃𝑖𝑗 𝑋𝑖 𝑋𝑗 𝑁 𝑙 ( 𝑙 𝑙 𝜃 𝑥 𝑥𝑗 + 𝑖𝑗 𝑖 𝑖𝑗 𝑖 exp(𝜃𝑖 𝑋𝑖 ) 𝑙 𝑙 exp (𝜃 𝑥 𝑖𝑗 𝑖 𝑥𝑗 ) 𝑖𝑗 𝑙 𝑥 𝑙 )) + = 𝑁 ( log (exp (𝜃 𝑥 𝑖𝑗 𝑖 𝑗 𝑙 𝑖𝑗 log 𝑍 𝜃 ) = 𝑖 𝜃𝑖 𝑋𝑖 𝑖 exp(𝜃𝑖 𝑋𝑖 ) 𝑖𝑗 exp(𝜃𝑖𝑗 𝑋𝑖 𝑋𝑗 ) 1 𝑁 𝑙=1 𝑍 𝜃 + 𝑙 exp (𝜃 𝑥 𝑖 𝑖 )) 𝑖 𝑙 log(exp (𝜃 𝑥 𝑖 𝑖 )) − 𝑖 𝑙 𝜃 𝑥 𝑖 𝑖 𝑖 − log 𝑍 𝜃 ) 𝑐𝑎𝑛 𝑏𝑒 𝑜𝑡ℎ𝑒𝑟 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑓 𝑥𝑖 22 MLE for Exponential Family 𝑙 𝜃, 𝐷 = 1 𝑁 𝜕𝑙 𝜃,𝐷 𝜕𝜃𝑖𝑗 1 𝑁 𝜕𝑙 𝜃,𝐷 𝜕𝜃𝑖𝑗 = 1 𝑁 = = 1 𝑁 𝑁 𝑙 𝑙 𝑙 𝑥𝑖 𝑥𝑗 𝑁 𝑙 𝑁 𝑙 𝑁 𝑙 − ( 𝑙𝑥 𝑙 + 𝜃 𝑥 𝑖𝑗 𝑖 𝑖𝑗 𝑗 𝑙𝑥 𝑙 𝑥 𝑖𝑗 𝑖 𝑗 𝑙𝑥 𝑙 𝑥 𝑖𝑗 𝑖 𝑗 1 Z(𝜃) 𝑥 − − 𝑙 𝜃 𝑥 𝑖 𝑖 𝑖 − log 𝑍 𝜃 ) 𝜕 log 𝑍 𝜃 𝜕𝜃𝑖𝑗 1 𝜕𝑍 𝜃 Z(𝜃) 𝜕𝜃𝑖𝑗 𝑖𝑗 exp(𝜃𝑖𝑗 𝑋𝑖 𝑋𝑗 ) 𝑖 exp(𝜃𝑖 𝑋𝑖 ) 𝑋𝑖 𝑋𝑗 𝑛𝑒𝑒𝑑 𝑡𝑜 𝑑𝑜 𝑖𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒 23 Moment Matching condition 𝜕𝑙 𝜃,𝐷 = 𝜕𝜃𝑖𝑗 1 𝑁 𝑙 𝑙 𝑥 𝑥 𝑁 𝑙 𝑖 𝑗 − 1 Z 𝜃 𝜕𝑙 𝜃,𝐷 𝜕𝜃𝑖𝑗 1 𝑁 = 𝐸P 𝑖𝑗 exp 𝜃𝑖𝑗 𝑋𝑖 𝑋𝑗 𝑖 exp 𝜃𝑖 𝑋𝑖 𝑋𝑖 𝑋𝑗 𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑚𝑎𝑡𝑟𝑖𝑥 from model 𝑒𝑚𝑝𝑖𝑟𝑖𝑐𝑎𝑙 𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑚𝑎𝑡𝑟𝑖𝑥 P 𝑋𝑖 , 𝑋𝑗 = 𝑥 𝑙 𝑙 𝑁 𝛿(𝑋 , 𝑥 ) 𝛿(𝑋 , 𝑥 𝑖 𝑗 𝑙=1 𝑖 𝑗) 𝑋𝑖 ,𝑋𝑗 𝑋𝑖 𝑋𝑗 − 𝐸𝑃(𝑋|𝜃) [𝑋𝑖 𝑋𝑗 ] 24 MLE Learning Algorithm for Exponential models max𝜃 𝑙 𝜃, 𝐷 is a convex optimization problem. Can be solve by many methods, such as gradient descent, conjugate gradient. Initialize model parameters 𝜃 Loop until convergence Compute 𝜕𝑙 𝜃,𝐷 𝜕𝜃𝑖𝑗 = 𝐸P Update 𝜃𝑖𝑗 ← 𝜃𝑖𝑗 − 𝜂 𝑋𝑖 ,𝑋𝑗 𝑋𝑖 𝑋𝑗 − 𝐸𝑃 𝑋𝜃 𝑋𝑖 𝑋𝑗 𝜕𝑙 𝜃,𝐷 𝜕𝜃𝑖𝑗 25

Parameter Learning Le Song Machine Learning II: Advanced Topics

Related documents

Products

Support

Parameter Learning Le Song Machine Learning II: Advanced Topics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib