Self-Emergence of Structures in Gene Expression Programming Xin Li Department of Computer Science, University of Illinois at Chicago, Chicago, IL, 60607 Email: xli1@cs.uic.edu Abstract This thesis work aims at improving the problem solving ability of the Gene Expression Programming (GEP) algorithm to fulfill complex data mining tasks by preserving and utilizing the self-emergence of structures during its evolutionary process. The main contributions include the investigation of the constant creation techniques for promoting good functional structures emergent in the evolution, analysis of the limitation with the current implementation scheme of GEP, and introduction of a novel utilization of the emergent structures to achieve a flexible search process for solutions at a higher level. Problem Statement First introduced by Candida Ferreira (Ferreira 2001), GEP is a recently developed evolutionary computation method for data analysis and knowledge discovery. Born from Genetic Algorithms (GAs) and Genetic Programming (GP), GEP is both flexible at genetic operations due to its linear genotype and capable of retaining a certain extent of functional complexity due to its phenotype as expression trees. Distinguished from other traditional machine learning methods, GEP searches the global optimum through a population of candidate solutions in parallel and is able to produce solutions of any possible form as computer programs. Previous research work has shown its powerful capabilities over a large range of domains, including regression modeling, classification tasks, parameter optimization, time series prediction, logic synthesis, and so on (Ferreira 2002; Zhou et al. 2003). However, the algorithm could be improved with respect to both time efficiency and solution quality when dealing with complex problems. These problems are usually featured by large data sets, high dimensional feature sets and non-linear form of hidden knowledge within the data. The proper solutions to these applications are structurally intricate enough to require a more dedicated derivation process. Currently, the exploration of search space is purely driven by the disruptive genetic operations and fitness measure in GEP. However, the structure of a solution, which actually largely determines its functionality, is yet to be explicitly maintained or utilized. We hypothesize that a more effective search of the solution can be built upon the evolving procedure of the solution Copyright © 2005, AAAI (www.aaai.org). All rights reserved. structures, which we call as the self-emergence of structures in GEP. Here structures mean physical elements or building blocks that help compose the optimal solution for a problem, which, for example, could be either the overall functional structure of the solution or sub-parts of it (also called substructures in the latter case). By saying self-emergence, we mean structures are generated, preserved, and evolved automatically during the evolutionary process. Proposed Plan for Research Although building blocks, modularity and hierarchy have been historically addressed in GAs and GP (Holland 1975; Koza 1994), our research stands out by accentuating the utilization of naturally emergent structures based on the representation scheme of GEP. We believe only structures adapted to the environment via fitness measurement during the evolution are most competitive candidates for desirable structures of the unknown optimal solution. The proposed research plan gradually digs into the thesis topic as follows: (1) Constant creation methods in GEP Since finding the optimal solution structure and the appropriate constant coefficients are intermingled together in the GEP evolutionary process, a discovered good solution structure is less likely inherited if its combination with improper constant coefficients results in a low fitness value, and it may require extra search effort to be rediscovered later. Therefore, given the evolved candidate solution structures at the end of each generation, it might be helpful to incorporate some local search algorithms to optimize constants for speeding up the learning process. This forms the motivation to investigate the constant creation methods in the context of GEP. As of now, work that has been achieved includes: an extensive survey for existing constant creation methods in the literature about related algorithms; five constant creation methods were defined for GEP based on two basic constant mutation techniques, and all of these methods have been implemented and experimented with several benchmark symbolic regression problems (Li et al. 2004); the experiments have proved the feasibility of specific assistance for GEP in discovering useful numeric constants, which then helps identify good functional structures; it was found that constant creation methods applied to the whole population for selected generations AAAI-05 Doctoral Consortium / 1650 perform better than methods that are applied only to the best individuals, and have achieved significant improvement in the average fitness of the best solutions. The initial work shows support for the hypothesis that the GEP evolutionary process propels the emergence of solution structures in its genotype. (2) Analysis of limitations with the implementation scheme of GEP Since in GEP, genotype is linear character strings of a fixed length as in GAs, it is possible to analyze the driving force behind the evolutionary process of GEP following the paradigm of the schema theorem from GAs (Holland 1975). This would facilitate us to better understand how the solution structures evolve under the current implementation scheme of GEP, and further propose appropriate techniques to effectively utilize the emergent structures in GEP. Work has been done includes: a derivation of schema theorem for GEP, which defines the pattern of solutions (i.e., schema) in GEP as a set of gene groups corresponding to the sub-trees in its expression tree representation; based on the schema theorem, it was demonstrated that genetic operators like crossover and rotation, which are designed more in favor of passing genetic material down the generations, are actually as destructive as mutations to the evolved solution structures in GEP; it was also drawn from the analysis that shorter schemata defined by closely related elements in the linear genotype can receive the best exploration of their instances in the later evolutionary process. The current work shows a more direct mapping between the genotype and phenotype is necessary so that the solution structure encoded in the linear genotype can be better preserved and utilized during evolution. (3) Utilization of emergent structures in GEP Motivated by the analysis on the implementation scheme of GEP, it is appealing to preserve and utilize the emergent structures in GEP from simpler components, which will subsequently result in the emergence of higher level solution structures. Moreover, since the final solution is most likely formed by combining gene segments from other candidates, this will ideally help save repeated search effort for discovering useful functional structure components, and yield a more natural way to tackle complex problems by constructing the solution hierarchically within its genotype representation. Fulfilled work regarding this idea includes: a definition for emergent structural components (called derived genes) in GEP, which is basically a set of genes in the genotype corresponding to a complete sub-tree in the phenotype and having a high appearance frequency in the fittest individuals (called the elite group) among the whole population (e.g., given “sqrt.*.+.*.a.*.sqrt.a.b.c./.1.-.c.d” as the genotype string of a solution, the substring “-.c.d” forms a sub-tree in the phenotype, and if the appearance frequency requirement is met, it then defines a qualified derived gene); introduction of two novel genetic operators, namely, compression and expansion, to implement derived genes, and as with their application, GEP is able to preserve and utilize the useful emergent structural components completely determined by the evolution itself; implementation of a dynamic substructure library based on the compression and expansion operators in order to incorporate the utilization of emergent structures into the design of GEP algorithm. The initial experimental results have revealed the promising advantages of this proposed implementation scheme regarding emergent structures. (4) Further plan for this thesis research Up to now, the investigation of this thesis topic has basically established the framework. However, many more interesting ideas need further research, and some of them are planned as follows in decreasing order of priority: • To propose a new genotype-phenotype mapping mechanism so that the linear chromosome can also encode the natural functional hierarchy defined by the expression tree. This would benefit the emergence of the structures in GEP and ease the implementation of the derived genes. • Further exploration of different kinds of emergent structural components, which generally fall into categories of derived constant, derived attribute and derived function. This can possibly produce a more general and viable scheme to utilize the emergent solution structures. • To introduce parametric constants into the GEP algorithm. The general idea is that constant genes will participate in the evolution as a symbol. Before the final evaluation of the solutions, the constant symbols will tune up around the original value to maximize the fitness of the current functional structure. • Extensive benchmark testing on both regression and classification problems, to measure the effectiveness of the new design for the algorithm from different perspectives. The results will be compared to the original GEP, as well as some other traditional methods like neural network and C4.5 decision tree. References Ferreira, C. 2001. Gene Expression Programming: a New Adaptive Algorithm for Solving Problems. Complex Systems 13(2): 87-129. Ferreira, C. 2002. Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence. Angra do Heroismo, Portugal. Zhou, C.; Xiao, W.; Nelson, P. C.; and Tirpak, T. M. 2003. Evolving Accurate and Compact Classification Rules with Gene Expression Programming. IEEE Transactions on Evolutionary Computation 7(6): 519 – 531. Holland, J. H. 1975. Adaptation in Natural and Artificial Systems. Ann Arbor, MI: University of Michigan Press. Koza, J. 1994. Genetic Programming II: Automatic Discovery of Reusable Programs. Cambridge, MA: MIT Press. Li X.; Zhou, C; Nelson, P. C.; and Tirpak, T. M. 2004. Investigation of Constant Creation Techniques in the Context of Gene Expression Programming. In Late Breaking Paper at Genetic and Evolutionary Computation Conference (GECCO2004). Seattle, WA, USA. AAAI-05 Doctoral Consortium / 1651