Slides for Introduction to Stochastic Search and Optimization ( ISSO ) by J. C. Spall
CHAPTER 4
S
A
R
F
N
M
•Organization of chapter in ISSO
–Introduction and potpourri of examples
•Sample mean
•Quantile and CEP
•Production function (contrast with maximum likelihood)
–Convergence of the SA algorithm
–Asymptotic normality of SA and choice of gain sequence
–Extensions to standard root-finding SA
•Joint parameter and state estimation
•Higher-order methods for algorithm acceleration
•Iterate averaging
•Time-varying functions
• Focus is on finding
(i.e.,
) such that g (
) = 0
– g (
) is typically a nonlinear function of
(contrast with
Chapter 3 in ISSO )
• Assume only noisy measurements of g (
) are available:
Y k
(
) = g (
) + e k
(
), k = 0, 1, 2,…,
• Above problem arises frequently in practice
– Optimization with noisy measurements ( g (
) represents gradient of loss function) (see Chapter 5 of ISSO )
– Quantile-type problems
– Equation solving in physics-based models
– Machine learning (see Chapter 11 of ISSO )
4-2
• Basic algorithm published in Robbins and Monro (1951)
• Algorithm is a stochastic analogue to steepest descent when used for optimization
– Noisy measurement Y k
(
) replaces exact gradient g (
)
• Generally wasteful to average measurements at given value of
– Average across iterations (changing
)
• Core Robbins-Monro algorithm for unconstrained rootfinding is k
1
ˆ k
a Y (
k
), where a k
0
• Constrained version of algorithm also exists
4-3
• Interested in estimating radius of circle about target such that half of impacts lie within circle (
is scalar radius)
• Define success variable s k
(
k
)
1 if X k
k
(success)
0 otherwise (nonsuccess)
• Root-finding algorithm becomes
k
1
ˆ k a k
s k
( )
Y k
0.5
• Figure on next slide illustrates results for one study
4-4
(Example 4.3 in ISSO)
4-5
• Central aspect of root-finding SA are conditions for formal convergence of the iterate to a root
– Provides rigorous basis for many popular algorithms (LMS, backpropagation, simulated annealing, etc.)
• Section 4.3 of ISSO contains two sets of conditions:
– “Statistics” conditions based on classical assumptions about g (
), noise, and gains a k
– “Engineering” conditions based on connection to deterministic ordinary differential equation (ODE)
• Convergence and stability of ODE dZ (
) / d
= – g ( Z (
)) closely related to convergence of SA algorithm ( Z (
) represents p dimensional time-varying function and
denotes time)
• Neither of statistics or engineering conditions is special case of other
4-6
ODE Convergence Paths for Nonlinear Problem in Example 4.6 in ISSO: Satisfies ODE
Conditions Due to Asymptotic Stability and
Global Domain of Attraction
Z
2
2
4
2
2
22
2
2
Z
1
2
22
2
4-7
• Choice of the gain sequence a k performance of SA is critical to the
• Famous conditions for convergence are
k
0 a
2 k
k
0 a k
=
and
• A common practical choice of gain sequence is a k
( k 1 a
A )
where 1/2 <
1, a > 0, and A
0
• Strictly positive A (“stability constant”) allows for larger a
(possibly faster convergence) without risking unstable behavior in early iterations
• and A can usually be pre-specified; critical coefficient a usually chosen by “trial-and-error”
4-8
(Section 4.5 of ISSO)
• Joint Parameter and State Evolution
– There exists state vector x k optimized related to system being
– E.g., state-space model governing evolution of x model depends on values of
k
, where
• Adaptive Estimation and Higher-Order Algorithms
– Adaptively estimating gain a k
– SA analogues of fast Newton-Raphson search
• Iterate Averaging
– See slides to follow
• Time-Varying Functions
– See slides to follow
4-9
• Iterate averaging is important and relatively recent development in SA
• Provides means for achieving optimal asymptotic performance without using optimal gains a k
• Basic iterate average uses following sample mean as final estimate:
k
( k
1)
1 j k
0 j
• Results in finite-sample practice are mixed
• Success relies on large proportion of individual iterates hovering in some balanced way around
– Many practical problems have iterate approaching in roughly monotonic manner
– Monotonicity not consistent with good performance of iterate averaging; see plot on following slide
4-10
4-11
• In some problems, the root-finding function varies with iteration: g k
(
) (rather than g (
))
– Adaptive control with time-varying target vector
– Experimental design with user-specified input values
– Signal processing based on Markov models (Subsection
4.5.1 of ISSO )
k g k
(
) = 0
• Suppose that to the fixed
for some fixed value in conventional root-finding)
(equivalent
• In such cases, much standard theory continues to apply
• Plot on following slide shows case when g k gradient function with scalar
(
) represents a
4-12
Time-Varying g k
(
) =
L k
(
) / for Loss
Functions with Limiting Minimum
4-13