Dominic Thirlwall ID: 1300676 Function Optimisation This program uses an evolutionary algorithm to find expressions to best match empirical data. The program is specially designed to find expressions linking the energy consumption of a building to three independent variables, although no knowledge of the patterns in the data is assumed by the program’s evolutionary algorithm. The program only knows the structure of the file containing empirical data and the names of the variables, and can adapt to any relationship. When data like the energy consumption example appears to follow a complicated unknown pattern or contains too many independent variables to analyse, evolutionary algorithms can the best tools for finding expressions to predict the data. In the energy consumption example, the data contains some anomalous points and small spikes that don’t appear to coincide with any of the independent variables. These anomalies are something that an evolutionary algorithm can handle, given enough computation time to find an optimal expression that either matches the smaller fluctuations in the data or ignores them while maintaining high fitness, suggesting they are caused by unmeasured variables. Algorithm This program uses an evolutionary algorithm that performs recombination, mutation, selection and replacement on a population of mathematical expressions in order to find close matches to the data. The initial (0th) generation is generated randomly, with each expression independent of the others and containing a mixture of constants, variables and operators. I experimented with changing the probability of certain types of elements appearing in the expressions of the initial generation, but any differences in the method have little to no effect on the population after the first few generations due to crossover and elimination of poorly generated expressions. All of the unsuitable expressions are quickly replaced in the first couple of generations and the population focusses on similar fitness members regardless of the ratios during the initiation. Recombination and Mutation While advancing to the next generation, each expression in the population (except for the least fit if the population contains an odd number) is paired with another, and two offspring are produced. The offspring are initially duplicates of the parent expressions, before undergoing recombination and mutation. For recombination, the offspring expressions swap one section of themselves with the other. This is equivalent to two data trees swapping subtrees, as the expressions are stored in a similar style. Only one section of each expression is swapped for each offspring because even the exchange of small subtrees can strongly affect the nature of the expression and it’s fitness; any higher subtree-crossover rates appear to have no real effect . Mutation is performed by multiplying some of the constants in the new expressions by a randomly chosen value, picked with Gaussian probability distribution with a mean of 1 and a standard deviation of 0.1. I found that, after enough generations, the fittest expressions would very rarely change in layout, as swapping elements of the expressions becomes a very inefficient way of improving fitness once a close match is found. At this point, mutation of constants in the equation becomes a useful method of incrementally improving fitness without changing the expression very much. By giving each constant a 50% chance to be mutated on every offspring, it allowed for mixing of old and new Dominic Thirlwall ID: 1300676 constants in the hope of slowly keeping appropriate values while improving the others. Test results showed that this is the case; giving each constant a 50% chance of mutation allowed for quicker improvement of fitness once the changes in expression layout become rare between generations. Rough tests of lower probabilities of mutation per constant showed the increase in fitness to take effect slightly slower, and tests of higher probabilities prevented too many of the most suitable constants in the expressions from being preserved. Higher standard deviations were found to result in low effectiveness of the mutation method, so a default value of 0.1 was picked. Choosing values lower than 0.1 appear to have negligible effects on the final results, unless they are close to or equal to zero, where very little or no mutation occurs at all. Selection and Replacement The fitness of the newly created offspring of each pair of expressions is then calculated, in order for selection by fitness to take place. An array containing the current population also contains space for the new offspring to be added to the end, in order for all old expressions and new offspring to be sorted by fitness. This pushes expressions with low fitness towards the back of the array, out of the active population where they are no longer used and overwritten during the creation of the next generation. For each generation, the number of new offspring is equal to the size of the population, rounded down to the nearest even integer. The same number of expressions are removed for each generation after the old and new members of the population have been sorted by fitness. Any amount of the old population, from none to all, can therefore be replaced each time a new generation is calculated. In later generations, it becomes very unlikely that many expressions in the existing population are replaced by offspring, as the evolutionary progress slows down. Parameters I performed tests, changing one parameter at a time, to find the best set of values to use for the most effective and consistent results. The program’s default parameters are set to values which I have found to be optimal. During the fitness calculation process, the amount of mathematical operators in the expression is counted and penalties are applied to those which are deemed to be too complex. Unless the threshold amount of operators was set too low, fitness penalties for complex expressions were only slightly detrimental to the results (fitness values are lowered by approximately 2 on average without any fitness penalties, after accounting for the changes to the fitness values caused by the penalties themselves), so a compromising level of fitness penalties has been applied by default. Expressions with over 11 operators are punished quite heavily and are very unlikely to make it into final generations, although they are sometimes considered quite fit in earlier generations when the expressions hardly matches the input data at all. The final generations are generally much more simple expressions with this setting applied, but fitness penalties can be turned off for testing purposes. Testing with different population sizes shows a rough improvement in results at the expense of required processing time per generation calculated. The default population size has been set to Dominic Thirlwall ID: 1300676 1000, although higher values can produce good results more often after taking considerably more time. 60 Fig. 1 Progression of highest fitness expression after each generation 55 50 Fitness value 45 40 35 30 25 20 0 10 20 30 40 50 60 70 Generation (results for generation 0 omitted as they vary too much) Regardless of the population size, the improvement it fitness per generation will have become very small by the 100th generation. Increasing the number of calculated generations linearly increases the time required by the program, so knowing when to stop calculation is important. Figure 1 shows that by the 75th generation, most improvement is due to small expression changes and mutation of constant values in the top expressions, lowering the fitness value a small amount each time. This leads to a slow incremental improvement that could be considered negligible by the 100th generation. There is little advantage to running the program for much more than 100 generations, as the impracticality starts to outweigh the results quality, so this is set as the default. The program has a user interface to make testing with other values and settings easy. Options for file location and amount of logging files created can also be set. Calculation progress can be seen in the terminal while the program is working. Dominic Thirlwall ID: 1300676 Evolutionary Process 400 Fig. 2 First 100 data points compared to predicted values by high and low fitness expressions 350 Actual data Power consumption, C 300 Unfit expression, best after 1 generation Fit expression, best after 100 generations 250 200 150 100 50 0 0 10 20 30 40 50 60 70 80 90 100 Data point ID Initial generations will only be able to find expressions of poor fitness, but after 100 generations, the expressions can much more closely match the input data. On figure 2, the match after one generation had a fitness value of approximately 50 and the match after 100 generations had a fitness value of approximately 30 (lower is better as fitness is equivalent to the root-mean-squared difference between actual and predicted values of ‘c’, after adjustment by the fitness penalties). These fitness values only apply to the building power consumption data, as any other data will have different scales and different values can be considered fit or unfit. In the initial populations, expressions are generated with fitness values ranging right up to the allowed limits of some data types. Many initial expressions may therefore have ‘infinity’ or ‘NaN’ fitness, especially if the original data contains variables with very high, or zero values. These unsuitable expressions are replaced very quickly as non-number fitness values are always sorted below numerical ones. Dominic Thirlwall ID: 1300676 Fig. 3 Trial run 1 2 3 4 5 6 7 8 9 10 Mean Standard deviation Fitness of best expression Best expression c = ((32.6972 + T) - (T * ((((|(0.3583 / s)| ^ T) - 0.2122) ^ (-1)) 37.5783 ((T + (81.3043 - 105.017)) ^ (-1))))) c = ((T + 34.5647) * (|T| ^ (((|((42.5516 + T) + T)| ^ (s ^ (-1))) 38.6327 ^ (-1)) * 0.3041))) c = ((28.3514 + T) + ((T + (T + 23.6637)) * (|((s ^ (-1)) * 38.6695 0.4521)| ^ ((T + (-0.4213 * 143.4796)) ^ (-1))))) c = ((|T| ^ (T ^ (-1))) + (|92.4792| ^ (|0.522| ^ (|((T - (|s| ^ 40.6542 (108.5878 + s))) - (-0.6967 ^ (-1)))| ^ -0.6019)))) c = ((((|0.307| ^ ((56.6919 / 50.4108) * (s ^ (-1)))) + 66.9119) 42.9630 + T) - ((0.0352 + (|0.7385| ^ (s ^ (-1)))) ^ (-1))) c = ((29.5895 + T) + ((|(s * s)| ^ (38.7536 ^ (-1))) * (T + 38.3007 ((|53.376| ^ 0.6006) + T)))) c = ((|T| ^ 0.1397) * (72.0246 - ((2.2104 ^ (-1)) * (79.9815 * 39.7089 (|(((|t| ^ s) - s) ^ (-1))| ^ 0.8902))))) c = (45.0981 - ((60.6252 * T) / ((|34.624| ^ (32.6612 39.3507 ((|(31.4166 + T)| ^ s) / (s - T)))) - 13.083))) c = ((29.3784 + T) + ((|((|t| ^ 1.0936) - 16.6562)| ^ 1.2683) + 44.4718 ((|t| ^ 1.2083) - t))) c = (106.7684 - (50.9451 * ((|(T * 0.3276)| ^ ((41.2359 38.8227 41.7532) * s)) + (|-0.8288| ^ T)))) 39.9152 2.0837 Figure 3 shows the best expressions and their fitness from 10 trial runs with the default settings. Turning the fitness penalties for long/complex expressions off is the easiest way to improve the best fitness results (even when taking into account the lower fitness values from the lack of penalties), but the expressions become long and unwieldy if they are allowed to become more complex than the ones in the table. 1 0.5 0 Fig. 4 35 36 Distribution of fitness values for 10 independent trial runs 37 38 39 40 41 42 43 44 45 Fitness value for best expression of final generation Fitness results for trial runs Mean; +/- 1 standard deviation Figure 4 shows the distribution of fitness values for the best fitness expressions at the end of each trial run. Decreasing the population size, using fewer generations and introducing fitness penalties for even shorter expressions were all found to increase the standard deviation in these results. I could find no way of achieving more consistent fitness values on independent tests that with the chosen optimal parameters. The program’s classes, as well as this report were all written solely by Dominic Thirlwall. No code was copied from external sources, and no evolutionary algorithm packages were used. The program was written using Eclipse IDE for java developers.