Program Structure-Fitness Disconnect and Its Impact On Evolution In

advertisement
Title: Program Structure-Fitness Disconnect and Its Impact On
Evolution In Genetic Programming
Authors: A.A. Almal, C.D. MacLean, W.P. Worzel
Keywords: phenotype, genotype, evolutionary dynamics, GP
structure, GP content, speciation, population, fitness
I. Abstract
Simple Genetic Programming (GP) is generally considered to lack the
strong separation between genotype and phenotype found in natural
evolution. In many cases, the genotype and the phenotype are
considered identical in GP since the program representation does not
undergo any modification prior to its encounter with “environment” in
the form of inputs and a fitness function. However, this view overlooks
a key fact: fitness in GP is determined without reference to the
makeup of the individual programs but evolutionary changes occur in
the structure and content of the individual without reference to its
fitness. This creates a disconnect between “genetic recombination” and
fitness similar to that in nature that can create unexpected effects
during the evolution of a population andsuggests an important
dynamic that has not been thoroughly considered by the GP
community. This paper describes some of the observed effects of this
disconnect and suggests some metrics for the informational diversity
of a population which could lead to a new way of modeling the
dynamics of GP. We also speculate on the similarity of these effects
and some recently studied aspects of natural evolution.
II. Introduction/Background
In comparison to natural genetics and evolution, genetic programming (GP) is a crude
mechanism. In a simple GP system, the phenotype is seemingly the same as the genotype
with the individual program applied to a problem being identical to the entity that is
involved in such operations as crossover and mutation. There are no equivalents to the
natural mechanisms of transcription or translation let alone such complexities as
embryonic development and maturation of the phenotype.
Moreover, in most GP applications, fitness is determined in a static environment as
represented by a single, unchanging fitness function and the data it learns from is often
retrospective and static. Finally, the genetic operation of crossover is quite disruptive
having no concept of particulate genes that are exchanged intact during meiosis.
Crossover in GP is more like a genetic translocation in nature wherein a piece of genetic
material is randomly swapped between the strands of DNA in the potential zygote,
which almost inevitably leads to death of the embryo.
In addition, the conventional view of evolution in genetic programming populations is
that a superior individual propagates fairly quickly throughout a population, creating after a period of diffusion - a much less diverse population (Poli, 1997). Although such
mechanisms as demes, tournament selection and geographies slow this progression, as
long as there is free communication between all members of the population, the overall
population is expected to reach a relatively stable state where the comparative diversity is
relatively small until the next major mutational event creates a notable change in the
fitness of members of the population.
(Daida, 2003) and (Almal, 2005) showed that the structure and content of an entire
population could be imaged creating a dynamic “movie” of its progression through many
generations. Daida used this to show that most genotypic changes occur toward the leaves
of the program tree and that this creates structural limitations in the space that could be
effectively searched, even within the space of all possible structures. Figure 1 shows one
of the structural visualizations of a population from (Daida, 2003) where the larger the
number of individuals that share a structure within a program tree, the darker the branch.
The reader is urged to review this work and, if possible, the “movies” that show the
evolution of simple GP systems as they reveal much about the mechanisms of GP.
Figure 1: Representation of structures used in a population from a overlay of
tree structures created by Daida from (Daida, 2003)
(Almal, 2005) showed that the content of program trees could be could be visualized
along with the structure by using a technique developed for visualizing DNA sequences.
The method of visualizing DNA is where a sequence of bases are plotted in order from
the center of a square to the corner of the squared labeled with the next DNA base in the
sequence where each successive line segment is plotted from the end of the previous
segment and travels 1/(n+1) of the distance where n is the number of steps taken since
starting at the beginning of a sequence (Jeffrey, 1990).
(Almal, 2005) modified this by plotting all possible variables and operators used in a
GP tree on a circle and then used the same mechanism of plotting line segments as the
program tree is traversed. Each branch of the tree that resolved to a terminal is plotted
separately creating a set of endpoints for each branch. This appears as a set of ovals
around a common center. Figure 2 shows an example of such a population representation.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Figure 2: Plot of structure and content using modified chaos game method to
represent an entire population from (Almal, 2005)
Each cluster of endpoints can be viewed as a different path through structure-content
space and collectively they can be viewed as a population in the statistical sense so that
statistical metrics of the dispersion of the endpoints can be used to evaluate the diversity
of the population. Using this idea, we have begun to calculate the informational entropy
of GP populations using this approach. Conceptually, the smaller the distribution of the
endpoints, the more uniform the population and the less informational entropy there is in
the population.
III. The Nature of Genotype and Phenotype In GP
Programs or “genotypes” in GP are usually created by assembling a random selection
of possible variables and operators. There is sometimes bias for or against certain
terminal elements, but in general a “generation 0” population is the most diverse
population of an entire GP run because there has been no selection of individuals to
remove combinations from the population.
However, it is important to remember that not all possible combinations create viable
individuals. For instance, an individual that contains combinations such as ‘+ * *’ is not
viable and will either be prevented or removed by syntax driven selection, repair, or in
the worst case, negative selection by assignment of a poor fitness. (Yu, 1998) outlines all
the methods used to remove deleterious individuals due to bad composition. The fact that
individuals are removed at some point, or even several points in the process of creation,
as well as during crossover and mutation, has a correspondence to natural processes such
as selection during embryonic development, DNA repair and even human interventions
such as gene therapy. What this suggests is that in genetic programming, as in nature, not
all genetic paths are allowable - and that this affects the probabilities associated with
evolutionary pathways.
Moreover, as Daida has suggested in (Daida, 2003) , even within the set of allowable
evolutionary paths, not all are equally likely to be selected because of the tendency of GP
to weight crossover and mutation toward points lower in the program trees, leaving the
roots mostly untouched beyond the early stages of evolution.
Taken together, these factors suggest that not all evolutionary paths are equally likely,
and that there some paths are unlikely to be retraced. For example, if, through chance,
crossover occurs high in a program tree, it is unlikely that the inverse crossover will
occur as the joint probability of two high-tree crossover’s occurring is increasingly small
as the typical tree size grows.
All of the GP operations involving population creation, crossover, and mutation occur
without reference to fitness, they involve only structural changes to the program tree.
This corresponds loosely to genetic alterations of an individual genotype in nature, which
suggests that even without embryonic and developmental stages in GP individuals, there
is still a de facto phenotype even though the phenotype is the same entity as the genotype.
What, then, is a phenotype in GP? On paper it is the same entity as the genotype: the
program representation. However, in general, fitness is not determined by structure or
content – it is determined by evaluating its fitness in some way using a particular set of
inputs or running the program in a specific environment. In other words, it is the
application of a genetically derived program representation to an environment in the form
of a set of input data or its injection into a simulation, that a fitness is calculated. Even
though the fitness measure and environment used to test the system may be different for
different problems, the phenotype is always defined by the individual’s encounter with
these factors.
Whether a GP individual has an embryonic stage or not, whether it “grows” after it is
“born” (as has been done in various applications, particularly in real world engineering
designs), the characteristics of structure and content manipulation define the GP
genotype, while test data and fitness calculation define the phenotype.
IV. Implications of GP Genotype-Phenotype Definition
The implication of this is that while constraints on the structure and content of a GP
genotype control its path in structure-content space, the selection of individuals in the
population is controlled by fitness. Importantly, the fitness is not directly connected to
changes to structure-content made by the genetic operations of crossover and mutation.
The impact of structural changes on fitness are not known at the time the structure and
content changes are made, while the impact of fitness is only related to which individuals
are combined during crossover, fitness does not direct the nature of the changes made.
Thus the path through fitness space is not directly related to the mechanism of changes to
the genotype.
It is worth noting that the only exception to this is when the fitness is directly
connected to the size, shape or content of the individuals in a GP population. For
example, many parsimony rules reduce the fitness of large, complex individuals.
Similarly, it is not unusual to bias the fitness of an individual if there are certain content
elements that are more or less desirable than others. There are even fitness measures in
extreme cases, such as the LiD function described in (Daida, 2004), where fitness is
determined by the size and shape of the structure in order to show how structures evolve,
but these cases only influence the selection of individuals by changing their fitness
according to their structure, not the mechanism of recombination and mutation.
Given that there is a clear separation of mechansims, what are the implications of this
separation in a population as it evolves? By using the structure-content plots, and adding
a heat map for fitness, we can examine the evolution of different structures with roughly
similar fitness in a single population. We have begun to search for this behavior within
populations and have seen some early cases where similar fitness occurs in very different
parts of the structure-contents space. We have yet to determine the longevity of these
occurrences and must still show that the distance between the endpoints in Figure 2 are
truly definitive in characterizing differences between individuals, but we believe based on
preliminary analysis that both this conjectures are true.
Because the crossover operator in GP tends to operate as a macromutation operator, it
can create significant changes in the structure and content of an individual program in a
GP population. This can lead to the sudden appearance of two distinct populations.
Occasionally, whether by accident (Angeline, 1997) or design (Ryan, 2004), the average
fitness of these populations may be roughly equivalent so that they endure despite their
differences. If a distinctly different individual appears because of a crossover event that
occurs high up in a tree, then when this individual is crossed with another individual in
the population, their offspring will tend to resemble the different individual more than it
does the other individuals immediate ancestors in the sense that the offspring with tend to
be located closer to the distinct individual in structure-content space than do the other
members of the population. Given roughly equivalent fitness in these two groups, this
new population will not disappear back into the main population as the change that
created it is unlikely to be reversed since the probability of having a second crossover at
the high in the tree is very small. The only way such offspring are likely to disappear is if
the structure-content neighborhood they have moved is not as “fitness rich” as the
neighborhood they came from.
Population mechanisms such as tournament selection (Koza, 1992), demes (Iwashita,
2002), and “trivial” geography (Spector, 2005) will increase this tendency by allowing a
pattern to be fixed in a sub-population and reducing the chance of dilution that occurs in a
larger population, but such sub-populations may not be necessary for a sub-population to
be created and once created, to persist. Once a distinctly different individual occurs
through a high-tree crossover event, it begins to take on some of the attributes of species
in a natural environment.
If two distinct sub-populations with similar fitness among their respective individuals
occur, then they can get into a arms-race where one sub-population’s size increases
slightly, reducing the “resources” of available places in the overall population. This is
then followed by the other sub-population gaining an advantage by additional refinement
of it’s fitness and wins back the space in the population. Eventually however, this
behavior can collapse as one sub-population gains a clear upper hand and drives the other
sub-population out of existence, leading to a newly stable system. This is again
reminiscent of natural species that are in competition with one another.
V. Is there a relationship between GP behavior and natural
selection?
Walter Fontana, in “The Topology of the Possible” (Fontana, 2003) described the
exploration of different phenotypes, as represented by transcribed RNA structures, and
structural changes at key decision points in the variation of DNA sequences. He focuses
on the mechanisms of change within a genotype that lead to a differentiation in the
corresponding phenotype. Fontana studied whether all DNA sequences that lead to a
transcription coding for the same protein are co-equal in evolutionary terms and
concludes that in fact, from an evolutionary perspective, they are quite different, as they
may be on very different paths that are more or less likely to lead to structural changes in
a phenotype. He makes the argument that some sequence changes can lead in directions
where change is more likely in one direction than it is in the other and that this creates a
gradient where evolution is more likely in one direction that in the other.
GP is not quite the same, as a macromuatation from crossover with a significant
variance in structure is unlikely to occur. But if a significant structural change does occur,
and if it has equivalent fitness to the individuals the original population, it is unlikely that
it will be go back to the structure as the original population. The point is that significant
changes that occur near the root node are very unlikely to recur and even changes that are
not near the root but are noticeably higher than the terminal leaf nodes may survive long
enough to create a fixed sub-population.
Fontana also describes protein folding constraints in RNA that mediate exploration of
natural systems while in GP (Daida, 2003) and (Almal, 2005) suggest that exploration
occurs at certain points as constrained by both syntactic and structural behaviors of
crossover in GP. Simply put, in both systems some combinations are very unlikely or
unreachable while others are much easier to reach. The difference is that, in RNA, there
is an unequal gradient to movement in both directions while in GP, movement shares an
equal probability
At a macrobiology level, the disconnect between genotype and phenotype suggests
that if significant structural changes are possible in the genotype, then evolutionary
changes are possible without an environmental change. There can be changes that simply
occur through random exploration that provide an equivalent fitness without competing
with the original phenotypes. A possible example of this is the recently discovered
genotypic and phenotypic emergence of a new sub-species of the Darwin finch on the
Galapagos Islands within the same range as the original species (Huber, 2006) without
any obvious ecological change that would put pressure on the Darwin finches to change.
The changes to beak structure in the new sub-species allows them to feed on different
plant seeds and therefore does not put them into direct competition for resources with the
original species. Indeed, the original species remains and flourishes while the emergent
sub-species grows in numbers. This phenotypic change correlates with a specific
variation in the DNA of the finches. One could consider such a change to be setting the
stage for speciation if there was a future environmental change that led to further changes
in the new sub-species away from the originating species. It is a marvelous twist of
history that such a case should emerge in Darwin finches, one of the key species Charles
Darwin used to make his case for natural selection (Darwin, 1859).
This leads to further questions about the similarities and differences in natural and
artificial evolutionary systems. For example, given the energy cost of creating individuals
that do not develop successfully through the embryonic stage, is natural selection more
conservative when compared to the relatively trivial cost of evaluating a new phenotype
in an artificial evolutionary system? If so, this would suggest that as artificially
evolutionary systems are applied to more complex problems where the costs to evaluate
fitness increases, a more conservative mechanism for genotypic change may be called
for. Conversely, the comparative cheapness of fitness evaluation in simpler problems
may suggest that the greater degree of exploration allowed in artificial evolutionary
systems is effort well spent.
Following from this, does diploidy encourage structural change “testing” at a lower
evolutionary cost in nature since some individuals with a bad allele on one strand of
DNA can survive while others with a double allele cannot? There are many cases
suggestive of this such as Sickle Cell Anemia, Cystic Fibrosis and Tay’sachs Disease
where a single allele confers an evolutionary advantage in the form of resistance to a
disease while the double allele is deadly. One could imagine a structural change in DNA
leading to this sort of occurrence with the diploidal relationship conferring some
protection from dangerous mutations.
By comparison, prokaryotes, with their lower cost of testing structural changes (from
an evolutionary perspective) may evolve more aggressively since populations are large,
generations are short, and there is no embryo to support during development. Monoploidy
would also have the advantage of allowing easier testing of mutations in a highly
competitive environment with an immediate payoff for success.
Finally, can the lessons learned from watching the dynamics of GP behavior using the
modified Chaos Game approach described in (Almal, 2005) be turned back to the study
of structural differences in DNA in nature? Rapid whole genome mapping is becoming
feasible (Wicker, 2006) and it may be possible, to watch structural changes in the DNA
of creatures with short generational timeframes, to appear and disappear or become fixed
in the genome.
VI. Summary and Future Directions
Our notions of the evolutionary mechanisms of even “simple” GP may need to be
reconsidered. The Building Block Hyptohesis and the Schema Theorem are just the start
to understanding a complex dynamic. More investigation needs to be done on the use of
tool such as Daida’s population structure displays and the modified Chaos Game
structure-content heat maps described here. In particular more classes of problems need
to be studied using these techniques and the effect of changing genetic operators and
parameters on the structure and content of populations should be studied.
We expect that it may be possible to “tune” the genetic parameters based on this view
of population dynamics where it is possible to characterize the cost of exploration in
terms of fitness changes within a population and to visualize the effect of such changes
on both a population’s diversity and the overall fitness of the individuals in it. If, for
example, an increased crossover rate leads to greater structural diversity but without a
corresponding increase in fitness, it may be that there’s a need to decrease the crossover
rate to exploit the local search characteristics of mutation. Conversely, a relatively
uniform population in terms of fitness and structural diversity may suggest that the
problem may require an increase in the crossover rate.
Finally, while it has been suggested that GP and similar, simple evolutionary systems
do not really reveal much about natural selection, the convergence of our ideas with
Fontana’s, despite our being unaware of his work at the time, suggests that that even a
simple evolutionary system can have some of the same behavioral complexities as a
natural system at the genotypic level.
References
Angeline. P. J. (1997). Subtree crossover: Building
block engine or macromutation? In John R. Koza and
Kalyanmoy Deb and Marco Dorigo and David B. Fogel and
Max Garzon and Hitoshi Iba and Rick L. Riolo, editors,
Genetic Programming 1997: Proceedings of the Second
Annual Conference, pages 9--17, 13-16 July 1997, pages
9-17, San Francisco, Morgan Kaufmann.
Daida, J.M. (2003). What Makes a Problem GP-Hard? A
Look at How Structure Affects Content. In Rick L. Riolo
and Bill Worzel, editors, Genetic Programming Theory
and Practice, volume pages 99-118, Norwell, Kluwer.
Darwin, C. (1859) On the Origin of Species by Means of
Natural Selection or the Preservation of Favoured Racs
in the Struggle for Life. Murray, London, UK.
Fontana, W. (2003). The Topology of the Possible.
Working Papers, Paper No. 03-03-017, Santa Fe, The
Santa Fe Institute.
Iwashita, M. and Iba, H. (2002). Island Model GP with
Immigrants Aging and Depth-Dependent Crossover. In
David B. Fogel and Mohamed A. El-Sharkawi and Xin Yao
and Garry Greenwood and Hitoshi Iba and Paul Marrow and
Mark Shackleton, editors, Proceedings of the 2002
Congress on Evolutionary Computation CEC2002, pages
267-272, Piscataway, IEEE Press.
Huber, S.K. (2006). Premating isolation of sympatric
morphs in a population of Darwin's finches (Geospiza
fortis). Animal Behavior Society 43rd Annual Meeting.
Aug. 12-16. Snowbird, Utah.
Jeffrey, H.J. (1990). Chaos game representation of gene
structure.
Nucleic Acids Res., 1990 Apr 25;18(8):2163-70.
Koza, J. R. (1992) Genetic Programming: On the
programming of computers by means of natural selection,
pp. 604-607. Cambridge, MIT Press.
Poli, R. and Langdon, W. B. (1997). An experimental
analysis of schema creation, propagation and disruption
in genetic programming. In T. Back, editor, Genetic
Algorithms: Proceedings of the Seventh International
Conference, pages 18-25, San Francisco, Morgan
Kaufmann.
Ryan, C., Majeed, H. and Azad, A. (2004). A Competitive
Building Block Hypothesis. In Kalyanmoy Deb and
Riccardo Poli and Wolfgang Banzha and Hans-Georg Beyer
and Edmund Burke and Paul Darwen and Dipankar Dasgupta
and Dario Floreano and James
Foster and Mark Harman and Owen Holland and Pier Luca
Lanzi and Lee Spector and Andrea Tettamanzi and Dirk
Thierens and Andy Tyrrell, editors, Genetic and
Evolutionary Computation -- GECCO-2004, Part II, volume
3103 of LNCS, pages 654-665, Heidelberg, SpringerVerlag.
Spector, L. and Klein, J (2005). Trivial Geography in
Genetic Programming. In Tina Yu and Rick L. Riolo and
Bill Worzel, editors, Genetic Programming Theory and
Practice {III}, volume 9 of Genetic Programming, pages
109-123, New York, Springer.
Wicker, et al. 454 sequencing put to the test using the
complex genome of barley. BMC Genomics. 2006; 7: 275.
Published online 2006 October 26. doi: 10.1186/14712164-7-275.
Yu, T. and Bentley, P. (1998). Methods to Evolve Legal
Phenotypes. In Eiben, A.E., Back, T., Schoenauer, M.
and Schwefel, H., editors, Fifth International
Conference on Parallel Problem Solving from Nature,
volume 1498 of LNCS, pages 280-291, Berlin, SpringerVerlag.
Download