Multidimensional space “The Last Frontier” • • • • • Optimization Expectation Exhaustive search Random sampling “Probabilistic random” sampling Lecture 18, CS567 1 Optimization – “I want the bestest there is” • Problem: Given F(x1, x2, …xn), find argmin{x} F(x1, x2, …xn) where xi = Data AND/OR Model AND/OR Parameters and F = P or some other function, e.g., energy • Ideal goal: Finding global minimum (xor maximum) – “Finding the happiest person in the whole wide world” – “Finding the deepest trench in the ocean” • Ideal approach: Exhaustive search – Guaranteed to find global minimum • Brute force – Subject to resource limits (“Will take a lot of diving!”) Lecture 18, CS567 2 Optimization - Random Sampling • Spectrum of Approaches: – Exhaustive…………………………………………………….Random Sampling • Random sampling: – Find several local minima starting from random points – Pick the lowest minimum as the approximation of the global minimum – Examples: • “Pick a few air travel websites at random, and buy your ticket from the one giving the lowest fare” (finding the cheapest website) • “Pick a few casinos at random, and try your luck at each” (finding the luckiest casino in the world) – Problems • If the number of websites/casinos is very large, a small sample may be far from global minimum • If a large sample is taken, begins to approximate exhaustive search – Example: Monte Carlo algorithm Lecture 18, CS567 3 Optimization • Dynamic programming – Equivalent to exhaustive search without having to use brute force – However, applicable only to problems that satisfy principle of optimality • Gradient descent: Need to combine with some random/iterative element to find global minimum – Backpropagation in NN training: Classic gradient descent – Line search: Update direction only when minimum along a particular direction is reached (“Go down mid-line of saddle before turning towards flank”) – Second derivative based methods • Step size based on second derivative of minimization function, and history of past steps taken • Examples: Newton-Raphson, Conjugate gradient (scales better) • Evolutionary/Genetic algorithms – Simulation based on principles of biological evolution, with cost function of choice – Benefiting from no holds barred multiple inheritance Lecture 18, CS567 4 Optimization – Simulated Annealing • Analogies – Annealing steel to a stable form – Crystallization best with slow evaporation – Using a pogo stick with a bounce that decreases with time, to find lowest valley • Cost function has an additional term that is directly proportional to temperature – Relative importance of this term is progressively decreased • Heat system to high temperature, i.e., give it a lot of energy (all states become roughly equally probable) • Cool system slowly (probability distribution gradually approaches the underlying normal temperature one) • Example: – Producing a 3D structural model of a molecule, given diffraction/NMR data (constraints) Lecture 18, CS567 5 Computing Expectations “How great am I?” • Problem: Given F(x1, x2, …xn), find E[F(x1, x2, …xn)] where xi = Data AND/OR Model AND/OR Parameters and F = P or some other function, e.g., energy • Ideal goal: Finding global average – “What is the average happiness in the whole wide world?” – “What is the average depth of the ocean?” • Ideal approach: Exhaustive search – Guaranteed to find global average • Brute force – Subject to resource limits (“Will take a lot of divers and a lot of diving!”) • Typical application: Once you know E[F] – I scored 3. How great am I? • If this is soccer or baseball, GREAT! • If this is basketball, time for practice…. – What is the statistical significance of a particular value of F? Lecture 18, CS567 6 Expectation - Random Sampling • Spectrum of Approaches: – Exhaustive…………………………………………………….Random Sampling • Random sampling: – Take the average of the values of the function at the random points – Examples: • “Pick a few air travel websites at random, and buy your ticket from the one giving the lowest fare” (get the average price for a market survey) • “Pick a few casinos at random, and try your luck at each” (compute the general expectation of winning at a casino) – Problems • If the number of websites/casinos is very large and/or highly variable in odds of winning (complex space), an estimate based on a small sample may be far from the global average • If a large sample is taken, begins to approximate exhaustive search – Example: Monte Carlo algorithm Lecture 18, CS567 7 Best of Both Worlds - MCMC • Pragmatic goal: To approximate expectation – “What is the average happiness in the world, give or take a laugh or two?” – “What is the average depth of the ocean, rounded off to miles?” • Pragmatic approach: – Markov Chain Monte Carlo (“Probabilistic random” sampling) – Principles: • Monte Carlo E[F(x1, x2, …xn)] = Σ{x} F(x1, x2, …xn) P(x1, x2, …xn) is approximated by E[F(x1, x2, …xn)] ~ 1/T ΣT F(tx1, tx2, …txn) Where T refers to transitions from one state of the multi-dimensional variables to another AND • Markov Chain approximation State = A particular set of values for {x} Transition to the next state depends only on current values of variables Stationary Markov chain: Constant transition probabilities Ergodic distribution: Average number of transitions between two states is equal in either direction, represented in the Markov chain transition matrix. Thus, following a series of Markov steps in the region will not alter the distribution. (Ergodicity: The distribution converges to this, irrespective of the starting distribution) Represent the distribution as a Markov chain at equilibrium (ergodic) and sample from it Lecture 18, CS567 8 • • MCMC – Metropolis algorithm Goal: To sample states (and compute F for each of them) based on MCMC Metropolis: – Separate probability of transition from state j to state i into two conditional probabilities • • • • – qik : Probability of selecting state i as the next step, while in state k (Clark Kent [while dancing with Lois Lane]: “May I have the next dance, Lana?”) rik : Having selected state i as a candidate next step, probability of i actually becoming the next step (Lana Lang: “I’d love to, Clark! See you tomorrow, Lois”) Thus, tik = qik rik rik given by relative probabilities of P(si) and P(sk). If P(si) < P(sk), rik = P(si)/P(sk), else rik = 1. (If Clark usually dances with Lois Lane, then the probability of him switching partners follows the relative probabilities; if Clark usually dances with Lana, then he should switch!) Algorithm: For x number of iterations 1. 2. 3. 4. 5. Start in some state sk Pick a possible next state based on qik Evaluate for acceptance based on rik If accepted, calculate F(si), and back to 1 If not accepted, back to 2 Calculate E[F(s)] Lecture 18, CS567 9 • • MCMC – Gibbs sampling algorithm Goal: To sample states (and compute F for each of them) based on MCMC Gibbs sampling: – – At every step, subclassify variables into free and fixed, and evaluate probability of transition. If transition is made, the next state differs from the previous one only in the value of the free variables. Algorithm: For x number of iterations 1. Start in some state Sk = (s1, s2, s3, …… sn) 2. Make the transition to Si based on P({sub} | sl, sl+1, …… sn-|sub|) 3. Repeat 2, but with a different variable freed • Choice of free variable(s) based on cycling, or probabilistic sampling Calculate E[F(s)] Lecture 18, CS567 10 Optimizing Expectation • Expectation Maximization – Example: Baum-Welch algorithm (déjà vu) • Given training data {s}, maximize E[P(s|w,M)] • “The average probability of a sequence that is part of the model should be as high as possible” or “On the average, the probability of a sequence that is part of the model should be high” – Most useful when both parameters and data are to be optimized. More generally, when two subclasses of parameters need to be optimized together. – General form (GEM) just looks for a higher/lower value, not necessarily maximum/minimum Lecture 18, CS567 11