Document

advertisement
Multidimensional space
“The Last Frontier”
•
•
•
•
•
Optimization
Expectation
Exhaustive search
Random sampling
“Probabilistic random” sampling
Lecture 18, CS567
1
Optimization – “I want the bestest there is”
• Problem: Given F(x1, x2, …xn), find argmin{x} F(x1, x2,
…xn)
where xi = Data AND/OR Model AND/OR Parameters
and F = P or some other function, e.g., energy
• Ideal goal: Finding global minimum (xor maximum)
– “Finding the happiest person in the whole wide world”
– “Finding the deepest trench in the ocean”
• Ideal approach: Exhaustive search
– Guaranteed to find global minimum
• Brute force
– Subject to resource limits (“Will take a lot of diving!”)
Lecture 18, CS567
2
Optimization - Random Sampling
• Spectrum of Approaches:
– Exhaustive…………………………………………………….Random Sampling
• Random sampling:
– Find several local minima starting from random points
– Pick the lowest minimum as the approximation of the global minimum
– Examples:
• “Pick a few air travel websites at random, and buy your ticket from the one giving
the lowest fare” (finding the cheapest website)
• “Pick a few casinos at random, and try your luck at each” (finding the luckiest casino
in the world)
– Problems
• If the number of websites/casinos is very large, a small sample may be far from
global minimum
• If a large sample is taken, begins to approximate exhaustive search
– Example: Monte Carlo algorithm
Lecture 18, CS567
3
Optimization
• Dynamic programming
– Equivalent to exhaustive search without having to use brute force
– However, applicable only to problems that satisfy principle of optimality
• Gradient descent: Need to combine with some random/iterative
element to find global minimum
– Backpropagation in NN training: Classic gradient descent
– Line search: Update direction only when minimum along a particular
direction is reached (“Go down mid-line of saddle before turning towards
flank”)
– Second derivative based methods
• Step size based on second derivative of minimization function, and history of
past steps taken
• Examples: Newton-Raphson, Conjugate gradient (scales better)
• Evolutionary/Genetic algorithms
– Simulation based on principles of biological evolution, with cost function
of choice
– Benefiting from no holds barred multiple inheritance
Lecture 18, CS567
4
Optimization – Simulated Annealing
• Analogies
– Annealing steel to a stable form
– Crystallization best with slow evaporation
– Using a pogo stick with a bounce that decreases with time, to find lowest
valley
• Cost function has an additional term that is directly proportional
to temperature
– Relative importance of this term is progressively decreased
• Heat system to high temperature, i.e., give it a lot of energy (all
states become roughly equally probable)
• Cool system slowly (probability distribution gradually
approaches the underlying normal temperature one)
• Example:
– Producing a 3D structural model of a molecule, given diffraction/NMR
data (constraints)
Lecture 18, CS567
5
Computing Expectations
“How great am I?”
• Problem: Given F(x1, x2, …xn), find E[F(x1, x2, …xn)]
where xi = Data AND/OR Model AND/OR Parameters
and F = P or some other function, e.g., energy
• Ideal goal: Finding global average
– “What is the average happiness in the whole wide world?”
– “What is the average depth of the ocean?”
• Ideal approach: Exhaustive search
– Guaranteed to find global average
• Brute force
– Subject to resource limits (“Will take a lot of divers and a lot of diving!”)
• Typical application: Once you know E[F]
– I scored 3. How great am I?
• If this is soccer or baseball, GREAT!
• If this is basketball, time for practice….
– What is the statistical significance of a particular value of F?
Lecture 18, CS567
6
Expectation - Random Sampling
• Spectrum of Approaches:
– Exhaustive…………………………………………………….Random Sampling
• Random sampling:
– Take the average of the values of the function at the random points
– Examples:
• “Pick a few air travel websites at random, and buy your ticket from the one giving
the lowest fare” (get the average price for a market survey)
• “Pick a few casinos at random, and try your luck at each” (compute the general
expectation of winning at a casino)
– Problems
• If the number of websites/casinos is very large and/or highly variable in odds of
winning (complex space), an estimate based on a small sample may be far from the
global average
• If a large sample is taken, begins to approximate exhaustive search
– Example: Monte Carlo algorithm
Lecture 18, CS567
7
Best of Both Worlds - MCMC
• Pragmatic goal: To approximate expectation
– “What is the average happiness in the world, give or take a laugh or two?”
– “What is the average depth of the ocean, rounded off to miles?”
• Pragmatic approach:
– Markov Chain Monte Carlo (“Probabilistic random” sampling)
– Principles:
• Monte Carlo
E[F(x1, x2, …xn)] = Σ{x} F(x1, x2, …xn) P(x1, x2, …xn) is approximated by
E[F(x1, x2, …xn)] ~ 1/T ΣT F(tx1, tx2, …txn)
Where T refers to transitions from one state of the multi-dimensional variables to another
AND
• Markov Chain approximation
State = A particular set of values for {x}
Transition to the next state depends only on current values of variables
Stationary Markov chain: Constant transition probabilities
Ergodic distribution: Average number of transitions between two states is equal in either
direction, represented in the Markov chain transition matrix. Thus, following a series of
Markov steps in the region will not alter the distribution. (Ergodicity: The distribution
converges to this, irrespective of the starting distribution)
Represent the distribution as a Markov chain at equilibrium (ergodic) and sample from it
Lecture 18, CS567
8
•
•
MCMC – Metropolis algorithm
Goal: To sample states (and compute F for each of them) based on MCMC
Metropolis:
–
Separate probability of transition from state j to state i into two conditional probabilities
•
•
•
•
–
qik : Probability of selecting state i as the next step, while in state k (Clark Kent [while
dancing with Lois Lane]: “May I have the next dance, Lana?”)
rik : Having selected state i as a candidate next step, probability of i actually becoming the
next step (Lana Lang: “I’d love to, Clark! See you tomorrow, Lois”)
Thus, tik = qik rik
rik given by relative probabilities of P(si) and P(sk). If P(si) < P(sk), rik = P(si)/P(sk), else rik =
1. (If Clark usually dances with Lois Lane, then the probability of him switching partners
follows the relative probabilities; if Clark usually dances with Lana, then he should switch!)
Algorithm:
For x number of iterations
1.
2.
3.
4.
5.
Start in some state sk
Pick a possible next state based on qik
Evaluate for acceptance based on rik
If accepted, calculate F(si), and back to 1
If not accepted, back to 2
Calculate E[F(s)]
Lecture 18, CS567
9
•
•
MCMC – Gibbs sampling algorithm
Goal: To sample states (and compute F for each of them) based on
MCMC
Gibbs sampling:
–
–
At every step, subclassify variables into free and fixed, and evaluate
probability of transition. If transition is made, the next state differs from the
previous one only in the value of the free variables.
Algorithm:
For x number of iterations
1. Start in some state Sk = (s1, s2, s3, …… sn)
2. Make the transition to Si based on P({sub} | sl, sl+1, …… sn-|sub|)
3. Repeat 2, but with a different variable freed
•
Choice of free variable(s) based on cycling, or probabilistic sampling
Calculate E[F(s)]
Lecture 18, CS567
10
Optimizing Expectation
• Expectation Maximization
– Example: Baum-Welch algorithm (déjà vu)
• Given training data {s}, maximize E[P(s|w,M)]
• “The average probability of a sequence that is part of the model should be as high as
possible” or “On the average, the probability of a sequence that is part of the model
should be high”
– Most useful when both parameters and data are to be optimized. More generally,
when two subclasses of parameters need to be optimized together.
– General form (GEM) just looks for a higher/lower value, not necessarily
maximum/minimum
Lecture 18, CS567
11
Download