Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall CHAPTER 2 DIRECT METHODS FOR STOCHASTIC SEARCH •Organization of chapter in ISSO –Introductory material –Random search methods •Attributes of random search •Blind random search (algorithm A) •Two localized random search methods (algorithms B and C) –Random search with noisy measurements –Nonlinear simplex (Nelder-Mead) algorithm •Noise-free and noisy measurements Some Attributes of Direct Random Search with Noise-Free Loss Measurements • Ease of programming • Use of only L values (vs. gradient values) – Avoid “artful contrivance” of more complex methods • Reasonable computational efficiency • Generality – Algorithms apply to virtually any function • Theoretical foundation – Performance guarantees, sometimes in finite samples – Global convergence in some cases 2-2 Algorithm A: Simple Random (“Blind”) Search Step 0 (initialization) Choose an initial value of = ̂0 inside of . Set k = 0. Step 1 (candidate value) Generate a new independent value new(k + 1) , according to the chosen probability distribution. If L(new(k + 1)) < L( ̂ k ), set ˆ k 1 = new(k + 1). Else take ˆ k 1 = ̂ k . Step 2 (return or stop) Stop if maximum number of L evaluations has been reached or user is otherwise satisfied with the current estimate for ; else, return to step 1 with the new k set to the former k + 1. 2-3 First Several Iterations of Algorithm A on Problem with Solution = [1.0, 1.0]T (Example 2.1 in ISSO) T L(new(k)) L(ˆ k ) Iteration k new(k) 0 [2.00, 2.00] 8.00 1 [2.25, 1.62] 7.69 [2.25, 1.62] 7.69 2 [2.81, 2.58] 14.55 [2.25, 1.62] 7.69 3 [1.93, 1.19] 5.14 [1.93, 1.19] 5.14 4 [2.60, 1.92] 10.45 [1.93, 1.19] 5.14 5 [2.23, 2.58] 11.63 [1.93, 1.19] 5.14 6 [1.34, 1.76] 4.89 [1.34, 1.76] 4.89 ̂Tk 2-4 Functions for Convergence (Parts (a) and(b)) and Nonconvergence (Part (c)) of Blind Random Search (a) Continuous L(); probability density for new is > 0 on = [0, ) (b) Discrete L(); discrete sampling for new with P(new = i ) > 0 for i = 0, 1, 2,... (c) Noncontinuous L(); probability density for new is > 0 on = [0, ) 2-5 Algorithm B: Localized Random Search Step 0 (initialization) Choose an initial value of = ̂0 inside of . Set k = 0. Step 1 (candidate value) Generate a random dk. Check if ̂k d k . If not, generate new dk or move ̂k d k to nearest valid point. Let new(k + 1) be ̂k d k or the modified point. Step 2 (check for improvement) If L(new(k + 1)) < L( ̂ k ), set ˆ k 1 = new(k + 1). Else take ˆ k 1 = ̂ k . Step 3 (return or stop) Stop if maximum number of L evaluations has been reached or if user satisfied with current estimate; else, return to step 1 with new k set to former k + 1. 2-6 Algorithm C: Enhanced Localized Random Search • Similar to algorithm B • Exploits knowledge of good/bad directions • If move in one direction produces decrease in loss, add bias to next iteration to continue algorithm moving in “good” direction • If move in one direction produces increase in loss, add bias to next iteration to move algorithm in opposite way • Slightly more complex implementation than algorithm B 2-7 Formal Convergence of Random Search Algorithms • Well-known results on convergence of random search – Applies to convergence of and/or L values – Applies when noise-free L measurements used in algorithms • Algorithm A (blind random search) converges under very general conditions – Applies to continuous or discrete functions • Conditions for convergence of algorithms B and C somewhat more restrictive, but still quite general – ISSO presents theorem for continuous functions – Other convergence results exist • Convergence rate theory also exists: how fast to converge? – Algorithm A generally slow in high-dimensional problems 2-8 Example Comparison of Algorithms A, B, and C • Relatively simple p = 2 problem (Examples 2.3 and 2.4 in ISSO) – Quartic loss function (plot on next slide) • One global solution; several local minima/maxima • Started all algorithms at common initial condition and compared based on common number of loss evaluations – Algorithm A needed no tuning – Algorithms B and C required “trial runs” to tune algorithm coefficients 2-9 Multimodal Quartic Loss Function for p = 2 Problem (Example 2.3 in ISSO) 2-10 Example 2.3 in ISSO (cont’d): Sample Means of Terminal Values L( ˆ k ) – L() in Multimodal Loss Function (with Approximate 95% Confidence Intervals) Algorithm A Algorithm B Algorithm C 2.51 0.78 0.49 [1.94, 3.08] [ 0.51, 1.04] [ 0.32, 0.67] Notes: Sample means from 40 independent runs. Confidence intervals for algorithms B and C overlap slightly since 0.51 < 0.67 2-11 Examples 2.3 and 2.4 in ISSO (cont’d): Typical Adjusted Loss Values (L( ̂k ) – L()) and Estimates in Multimodal Loss Function (One Run) Algorithm Algorithm Algorithm B C A Adjusted L value estimate for above L value 2.60 0.80 0.49 2.55 3.09 2.68 2.90 2.74 2.96 Note: = [2.904, 2.904]T 2-12 Random Search Algorithms with Noisy Loss Function Measurements • Basic implementation of random search assumes perfect (noise-free) values of L • Some applications require use of noisy measurements: y() = L() + noise • Simplest modification is to form average of y values at each iteration as approximation to L • Alternative modification is to set threshold > 0 for improvement before new value is accepted in algorithm • Thresholding in algorithm B with modified step 2: Step 2 (modified) If y(new(k + 1)) < y ( ˆ k ) , set ˆ k 1 = new(k + 1). Else take ˆ k 1 = ˆ k . • Very limited convergence theory with noisy measurements 2-13 Nonlinear Simplex (Nelder-Mead) Algorithm • Nonlinear simplex method is popular search method (e.g., fminsearch in MATLAB) • Simplex is convex hull of p + 1 points in p – Convex hull is smallest convex set enclosing the p + 1 points – For p = 2 convex hull is triangle – For p = 3 convex hull is pyramid • Algorithm searches for by moving convex hull within • If algorithm works properly, convex hull shrinks/collapses onto • No injected randomness (contrast with algorithms A, B, and C), but allowance for noisy loss measurements • Frequently effective, but no general convergence theory and many numerical counterexamples to convergence 2-14 Steps of Nonlinear Simplex Algorithm Step 0 (Initialization) Generate initial set of p + 1 extreme points in p, i (i = 1, 2, …, p + 1), vertices of initial simplex Step 1 (Reflection) Identify where max, second highest, and min loss values occur; denote them by max, 2max, and min, respectively. Let cent = centroid (mean) of all i except for max. Generate candidate vertex refl by reflecting max through cent using refl = (1 + )cent max ( > 0). Step 2a (Accept reflection) If L(min) L(refl) < L(2max), then refl replaces max; proceed to step 3; else go to step 2b. Step 2b (Expansion) If L(refl) < L(min), then expand reflection using exp = refl + (1 )cent, > 1; else go to step 2c. If L(exp) < L(refl), then exp replaces max; otherwise reject expansion and replace max by refl. Go to step 3. 2-15 Steps of Nonlinear Simplex Algorithm (cont’d) Step 2c (Contraction) If L(refl) L(2max), then contract simplex: Either case (i) L(refl) < L(max), or case (ii) L(max) L(refl). Contraction point is cont = max/refl + (1 )cent , 0 1, where max/refl = refl if case (i), otherwise max/refl = max. In case (i), accept contraction if L(cont) L(refl); in case (ii), accept contraction if L(cont) < L(max). If accepted, replace max by cont and go to step 3; otherwise go to step 2d. Step 2d (Shrink) If L(cont) L(max), shrink entire simplex using a factor 0 < < 1, retaining only min. Go to step 3. Step 3 (Termination) Stop if convergence criterion or maximum number of function evaluations is met; else return to step 1. 2-16 Illustration of Steps of Nonlinear Simplex Algorithm with p = 2 exp Expansion when L(refl) < L(min) Reflection max 2max min cent con ref t l Contraction when L(refl) < L(max) (“outside”) max min min max con t 2max cent cent refl Contraction when L(refl) L(max) (“inside”) 2max cont refl Shrink after failed contraction when L(refl) < L(max) 2-17