Evolutionary Computation II: General Methods and Theory

advertisement
Slides for Introduction to Stochastic Search
and Optimization (ISSO) by J. C. Spall
CHAPTER 10
EVOLUTIONARY COMPUTATION II:
GENERAL METHODS AND THEORY
•Organization of chapter in ISSO
– Introduction
– Evolution strategy and evolutionary programming;
comparisons with GAs
– Schema theory for GAs
– What makes a problem hard?
– Convergence theory
– No free lunch theorems
Methods of EC
• Genetic algorithms (GAs), evolution strategy (ES), and
evolutionary programming (EP) are most common EC
methods
• Many modern EC implementations borrow aspects from
one or more EC methods
• Generally: ES generally for function optimization; EP for AI
applications such as automatic programming
10-2
ES Algorithm with Noise-Free Loss
Measurements
Step 0 (initialization) Randomly or deterministically
generate initial population of N values of    and
evaluate L for each of the values.
Step 1 (offspring) Generate  offspring from current
population of N candidate  values such that all  values
satisfy direct or indirect constraints on .
Step 2 (selection) For (N +)-ES, select N best values from
combined population of N original values plus  offspring;
for (N,)-ES, select N best values from population of  > N
offspring only.
Step 3 (repeat or terminate) Repeat steps 1 and 2 or
terminate.
10-3
Schema Theory for GAs
• Key innovation in Holland (1975) is a form of theoretical
foundation for GAs based on schemas
– Represents first attempt at serious theoretical analysis
– But not entirely successful, as “leap of faith” required to
relate schema theory to actual convergence of GA
• “GAs work by discovering, emphasizing, and recombining
good ‘building blocks’ of solutions in a highly parallel
fashion.” (Melanie Mitchell, An Introduction to Genetic
Algorithms [p. 27], 1996, paraphrasing John Holland)
– Statement above more intuitive than formal
– Notion of building block is characterized via schemas
– Schemas are propagated or destroyed according to the
laws of probability
10-4
Schema Theory for GAs
• Schema is template for chromosomes in GAs
• Example: [* 1 0 * * * * 1], where the * symbol represents a
don’t care (or free) element
– [1 1 0 0 1 1 0 1] is specific instance of this schema
• Schemas sometimes called building blocks of GAs
• Two fundamental results: Schema theorem and implicit
parallelism
• Schema theorem says that better templates dominate the
population as generations proceed
• Implicit parallelism says that GA processes >> N schemas
at each iteration
• Schema theory is controversial
– Not connected to algorithm performance in same direct way
as usual convergence theory for iterates of algorithm
10-5
Convergence Theory via Markov Chains
• Schema theory inadequate
– Mathematics behind schema theory not fully rigorous
– Unjustified claims about implications of schema theory
• More rigorous convergence theory exists
– Pertains to noise-free loss (fitness) measurements
– Pertains to finite representation (e.g., bit coding or floating
point representation on digital computer)
• Convergence theory relies on Markov chains
• Each state in chain represents possible population
• Markov transition matrix P contains all information for
Markov chain analysis
10-6
GA Markov Chain Model
• GAs with binary bit coding can be modeled as (discrete
state) Markov chains
• Recall states in chain represent possible populations
• ith element of probability vector pk represents probability of
achieving ith population at iteration k
• Transition matrix: The i, j element of P represents the
probability of population i producing population j through
the selection, crossover and mutation operations
– Depends on loss (fitness) function, selection method, and
reproduction and mutation parameters
•
Given transition matrix P, it is known that
pTk +1 = pTk P
10-7
Rudolph (1994) and Markov Chain
Analysis for Canonical GA
• Rudolph (1994, IEEE Trans. Neural Nets.) uses Markov
chain analysis to study “canonical GA” (CGA)
• CGA includes binary bit coding, crossover, mutation, and
“roulette wheel” selection
– CGA is focus of seminal book, Holland (1975)
• CGA does not include elitismlack of elitism is critical
aspect of theoretical analysis
• CGA assumes mutation probability 0 < Pm < 1 and singlepoint crossover probability 0  Pc  1
• Key preliminary result: CGA is ergodic Markov chain:
– Exists a unique limiting distribution for the states of chain
– Nonzero probability of being in any state regardless of
initial condition
10-8
Rudolph (1994) and Markov Chain
Analysis for CGA (cont’d)
• Ergodicity for CGA provides a negative result on
convergence in Rudolph (1994)
• Let Lˆmin,k denote lowest of N (= population size) loss
values within population at iteration k
– Lˆmin,k represents loss value for  in population k that has
maximum fitness value
• Main theorem: CGA satisfies
lim P  Lˆmin,k  L( )  1
k 
(above limit on left-hand side exists by ergodicity)
• Implies CGA does not converge to the global optimum
10-9
Rudolph (1994) and Markov Chain
Analysis for CGA (cont’d)
• Fundamental problem with CGA is that optimal solutions
are found but then lost
• CGA has no mechanism for retaining optimal solution
• Rudolph discusses modification to CGA yielding positive
convergence results
• Appends “super individual” to each population
– Super individual represents best chromosome so far
– Not eligible for GA operations (selection, crossover, mutation)
– Not same as elitism
• CGA with added super individual converges in
probability
10-10
Contrast of Suzuki (1995) and Rudolph
(1994) in Markov Chain Analysis for GA
• Suzuki (1995, IEEE Trans. Systems, Man, and Cyber.)
uses Markov chain analysis to study GA with elitism
– Same as CGA of Rudolph (1994) except for elitism
• Suzuki (1995) only considers unique states (populations)
– Rudolph (1994) includes redundant states
• With N = population size and B = no. of bits/chromosome:
B
B
 N  2  1  (N  2  1)!
unique states in Suzuki (1995),


B

 (2  1)! N !
N
2NB states in Rudolph (1994) (much larger than number of
unique states above)
• Above affects bookkeeping; does not fundamentally
change relative results of Suzuki (1995) and Rudolph
(1994)
10-11
Convergence Under Elitism
• In both CGA case (Rudolph, 1994) and case with elitism
(Suzuki, 1995) the limit p exists:
pT  lim pT0 P k
k 
(dimension of p differs according to definition of states,
unique or nonunique as on previous slide)
• Suzuki (1995) assumes each population includes one elite
element and that crossover probability Pc = 1
• Let p j represent jth element of p , and J represent indices j
where population j includes chromosome achieving L()
• Then from Suzuki (1995):  jJ p j  1
• Implies GA with elitism converges in probability to set
of optima
10-12
Calculation of Stationary Distribution
• Markov chain theory provides useful conceptual device
• Practical calculation difficult due to explosive growth of
number of possible populations (states)
• Growth is in terms of factorials of N and bit string length
(B)
• Practical calculation of pk usually impossible due to
difficulty in getting P
• Transition matrix can be very large in practice
– E.g., if N = B = 6, P is 108108 matrix!
– Real problems have N and B much larger than 6
• Ongoing work attempts to severely reduce dimension by
limiting states to only most important (e.g., Spears, 1999;
Moey and Rowe, 2004)
10-13
Example 10.2 from ISSO: Markov Chain
Calculations for Small-Scale Implementation
 
• Consider L() =   sin  ,    = [0, 15]
• Function has local and global minimum; plot on next slide
• Several GA implementations with very small population
sizes (N) and numbers of bits (B)
• Small scale implementations imply Markov transition
matrices are computable
– But still not trivial, as matrix dimensions range from
approximately 20002000 to 40004000
10-14
Loss Function for Example 10.2 in ISSO
Markov chain theory provides probability of finding
solution ( = 15) in given number of iterations
10-15
Example 10.2 (cont’d): Probability
Calculations for Very Small-Scale GAs
Probability that GA with elitism produces
population containing optimal solution
GA iteration
0
5
10
20
30
40
50
100 150
Crossover (Pc) = 1.0
Mutation (Pm) = 0.05 0.03 0.08 0.15 0.32 0.48 0.62 0.74 0.97 1.00
Population (N) = 2
Bit length (B) = 6
Pc = 1.0
Pm = 0.05
N=4
B=4
0.21 0.51 0.69 0.92 1.00
Pc = 1.0
Pm = 0.05
N=2
B=4
0.12 0.23 0.34 0.55 0.75 0.93 1.00
--
--
--
--
--
-10-16
Summary of GA Convergence Theory
• Schema theory (Holland, 1975) was most popular method
for theoretical analysis until approximately mid-1990s
– Schema theory not fully rigorous and not fully connected to
actual algorithm performance
• Markov chain theory provides more formal means of
convergence—and convergence rate—analysis
• Rudolph (1994) used Markov chains to provide largely
negative result on convergence for canonical GAs
– Canonical GA does not converge to optimum
• Suzuki (1995) considered GAs with elitism; unlike Rudolph
(1994), GA is now convergent
• Challenges exist in practical calculation of Markov
transition matrix
10-17
No Free Lunch Theorems (Reprise, Chap. 1)
• No free lunch (NFL) Theorems apply to EC algorithms
– Theorems imply there can be no universally efficient EC
algorithm
– Performance of one algorithm when averaged over all
problems is identical to that of any other algorithm
• Suppose EC algorithm A applied to loss L
– Let Lˆ n  denote lowest loss value from most recent N
population elements after n  N unique function evaluations
• Consider the probability that Lˆ n    after n unique
evaluations of the loss:
n
P Lˆ    L, A


NFL theorems state that the sum of above probabilities
over all loss functions is independent of A
10-18
Comparison of Algorithms for Stochastic
Optimization in Chaps. 2 – 10 of ISSO
• Table next slide is rough summary of relative merits of
several algorithms for stochastic optimization
– Comparisons based on semi-subjective impressions from
numerical experience (author and others) and theoretical or
analytical evidence
– NFL theorems not generally relevant as only considering
“typical” problems of interest, not all possible problems
• Table does not consider root-finding per se
• Table is for “basic” implementation forms of algorithms
• Ratings range from L (low), ML (medium-low), M
(medium), MH (mediumhigh), and H (high)
– These scales are for stochastic optimization setting and have
no meaning relative to classical deterministic methods
10-19
Comparison of Algorithms
Rand.
search
RLS
Stoch SPSA
grad. (basic)
Ease of
implementation
H
MH
M
Efficiency in
high dimen.
M (algs.
H
H
ASP
SAN
GA
M
M
M
ML
MH
MH
MH
M
Highly
variable
L
M
MH
M
MH
MH
H
N/A
ML
MH
ML
MH
MH
ML
MH
H
H
M
ML
ML
B&C)
Generality of
loss fn.
Global
optimization
Handles noise
in loss/gradient
Real-time
applications
Theoretical
foundation
L
H
H
MH
M
ML
L
MH
H
H
H
H
MH
MH
Sexiness
L
M
M
M
M
MH
H
10-20
Download