Dominic Thirlwall EC report 2

advertisement
Dominic Thirlwall ID: 1300676
Function Optimisation
This program uses an evolutionary algorithm to find expressions to best match empirical data. The
program is specially designed to find expressions linking the energy consumption of a building to
three independent variables, although no knowledge of the patterns in the data is assumed by the
program’s evolutionary algorithm. The program only knows the structure of the file containing
empirical data and the names of the variables, and can adapt to any relationship.
When data like the energy consumption example appears to follow a complicated unknown pattern
or contains too many independent variables to analyse, evolutionary algorithms can the best tools
for finding expressions to predict the data. In the energy consumption example, the data contains
some anomalous points and small spikes that don’t appear to coincide with any of the independent
variables. These anomalies are something that an evolutionary algorithm can handle, given enough
computation time to find an optimal expression that either matches the smaller fluctuations in the
data or ignores them while maintaining high fitness, suggesting they are caused by unmeasured
variables.
Algorithm
This program uses an evolutionary algorithm that performs recombination, mutation, selection and
replacement on a population of mathematical expressions in order to find close matches to the data.
The initial (0th) generation is generated randomly, with each expression independent of the others
and containing a mixture of constants, variables and operators. I experimented with changing the
probability of certain types of elements appearing in the expressions of the initial generation, but
any differences in the method have little to no effect on the population after the first few
generations due to crossover and elimination of poorly generated expressions. All of the unsuitable
expressions are quickly replaced in the first couple of generations and the population focusses on
similar fitness members regardless of the ratios during the initiation.
Recombination and Mutation
While advancing to the next generation, each expression in the population (except for the least fit if
the population contains an odd number) is paired with another, and two offspring are produced. The
offspring are initially duplicates of the parent expressions, before undergoing recombination and
mutation. For recombination, the offspring expressions swap one section of themselves with the
other. This is equivalent to two data trees swapping subtrees, as the expressions are stored in a
similar style. Only one section of each expression is swapped for each offspring because even the
exchange of small subtrees can strongly affect the nature of the expression and it’s fitness; any
higher subtree-crossover rates appear to have no real effect . Mutation is performed by multiplying
some of the constants in the new expressions by a randomly chosen value, picked with Gaussian
probability distribution with a mean of 1 and a standard deviation of 0.1.
I found that, after enough generations, the fittest expressions would very rarely change in layout, as
swapping elements of the expressions becomes a very inefficient way of improving fitness once a
close match is found. At this point, mutation of constants in the equation becomes a useful method
of incrementally improving fitness without changing the expression very much. By giving each
constant a 50% chance to be mutated on every offspring, it allowed for mixing of old and new
Dominic Thirlwall ID: 1300676
constants in the hope of slowly keeping appropriate values while improving the others. Test results
showed that this is the case; giving each constant a 50% chance of mutation allowed for quicker
improvement of fitness once the changes in expression layout become rare between generations.
Rough tests of lower probabilities of mutation per constant showed the increase in fitness to take
effect slightly slower, and tests of higher probabilities prevented too many of the most suitable
constants in the expressions from being preserved.
Higher standard deviations were found to result in low effectiveness of the mutation method, so a
default value of 0.1 was picked. Choosing values lower than 0.1 appear to have negligible effects on
the final results, unless they are close to or equal to zero, where very little or no mutation occurs at
all.
Selection and Replacement
The fitness of the newly created offspring of each pair of expressions is then calculated, in order for
selection by fitness to take place. An array containing the current population also contains space for
the new offspring to be added to the end, in order for all old expressions and new offspring to be
sorted by fitness. This pushes expressions with low fitness towards the back of the array, out of the
active population where they are no longer used and overwritten during the creation of the next
generation. For each generation, the number of new offspring is equal to the size of the population,
rounded down to the nearest even integer. The same number of expressions are removed for each
generation after the old and new members of the population have been sorted by fitness. Any
amount of the old population, from none to all, can therefore be replaced each time a new
generation is calculated. In later generations, it becomes very unlikely that many expressions in the
existing population are replaced by offspring, as the evolutionary progress slows down.
Parameters
I performed tests, changing one parameter at a time, to find the best set of values to use for the
most effective and consistent results. The program’s default parameters are set to values which I
have found to be optimal.
During the fitness calculation process, the amount of mathematical operators in the expression is
counted and penalties are applied to those which are deemed to be too complex. Unless the
threshold amount of operators was set too low, fitness penalties for complex expressions were only
slightly detrimental to the results (fitness values are lowered by approximately 2 on average without
any fitness penalties, after accounting for the changes to the fitness values caused by the penalties
themselves), so a compromising level of fitness penalties has been applied by default. Expressions
with over 11 operators are punished quite heavily and are very unlikely to make it into final
generations, although they are sometimes considered quite fit in earlier generations when the
expressions hardly matches the input data at all. The final generations are generally much more
simple expressions with this setting applied, but fitness penalties can be turned off for testing
purposes.
Testing with different population sizes shows a rough improvement in results at the expense of
required processing time per generation calculated. The default population size has been set to
Dominic Thirlwall ID: 1300676
1000, although higher values can produce good results more often after taking considerably more
time.
60
Fig. 1
Progression of highest fitness expression after each generation
55
50
Fitness value
45
40
35
30
25
20
0
10
20
30
40
50
60
70
Generation (results for generation 0 omitted as they vary too much)
Regardless of the population size, the improvement it fitness per generation will have become very
small by the 100th generation. Increasing the number of calculated generations linearly increases the
time required by the program, so knowing when to stop calculation is important. Figure 1 shows that
by the 75th generation, most improvement is due to small expression changes and mutation of
constant values in the top expressions, lowering the fitness value a small amount each time. This
leads to a slow incremental improvement that could be considered negligible by the 100th
generation. There is little advantage to running the program for much more than 100 generations, as
the impracticality starts to outweigh the results quality, so this is set as the default.
The program has a user interface to make testing with other values and settings easy. Options for
file location and amount of logging files created can also be set. Calculation progress can be seen in
the terminal while the program is working.
Dominic Thirlwall ID: 1300676
Evolutionary Process
400
Fig. 2
First 100 data points compared to predicted values by high
and low fitness expressions
350
Actual data
Power consumption, C
300
Unfit expression, best after 1
generation
Fit expression, best after 100
generations
250
200
150
100
50
0
0
10
20
30
40
50
60
70
80
90
100
Data point ID
Initial generations will only be able to find expressions of poor fitness, but after 100 generations, the
expressions can much more closely match the input data. On figure 2, the match after one
generation had a fitness value of approximately 50 and the match after 100 generations had a
fitness value of approximately 30 (lower is better as fitness is equivalent to the root-mean-squared
difference between actual and predicted values of ‘c’, after adjustment by the fitness penalties).
These fitness values only apply to the building power consumption data, as any other data will have
different scales and different values can be considered fit or unfit.
In the initial populations, expressions are generated with fitness values ranging right up to the
allowed limits of some data types. Many initial expressions may therefore have ‘infinity’ or ‘NaN’
fitness, especially if the original data contains variables with very high, or zero values. These
unsuitable expressions are replaced very quickly as non-number fitness values are always sorted
below numerical ones.
Dominic Thirlwall ID: 1300676
Fig. 3
Trial run
1
2
3
4
5
6
7
8
9
10
Mean
Standard deviation
Fitness of best expression Best expression
c = ((32.6972 + T) - (T * ((((|(0.3583 / s)| ^ T) - 0.2122) ^ (-1)) 37.5783 ((T + (81.3043 - 105.017)) ^ (-1)))))
c = ((T + 34.5647) * (|T| ^ (((|((42.5516 + T) + T)| ^ (s ^ (-1)))
38.6327 ^ (-1)) * 0.3041)))
c = ((28.3514 + T) + ((T + (T + 23.6637)) * (|((s ^ (-1)) * 38.6695 0.4521)| ^ ((T + (-0.4213 * 143.4796)) ^ (-1)))))
c = ((|T| ^ (T ^ (-1))) + (|92.4792| ^ (|0.522| ^ (|((T - (|s| ^
40.6542 (108.5878 + s))) - (-0.6967 ^ (-1)))| ^ -0.6019))))
c = ((((|0.307| ^ ((56.6919 / 50.4108) * (s ^ (-1)))) + 66.9119)
42.9630 + T) - ((0.0352 + (|0.7385| ^ (s ^ (-1)))) ^ (-1)))
c = ((29.5895 + T) + ((|(s * s)| ^ (38.7536 ^ (-1))) * (T +
38.3007 ((|53.376| ^ 0.6006) + T))))
c = ((|T| ^ 0.1397) * (72.0246 - ((2.2104 ^ (-1)) * (79.9815 *
39.7089 (|(((|t| ^ s) - s) ^ (-1))| ^ 0.8902)))))
c = (45.0981 - ((60.6252 * T) / ((|34.624| ^ (32.6612 39.3507 ((|(31.4166 + T)| ^ s) / (s - T)))) - 13.083)))
c = ((29.3784 + T) + ((|((|t| ^ 1.0936) - 16.6562)| ^ 1.2683) +
44.4718 ((|t| ^ 1.2083) - t)))
c = (106.7684 - (50.9451 * ((|(T * 0.3276)| ^ ((41.2359 38.8227 41.7532) * s)) + (|-0.8288| ^ T))))
39.9152
2.0837
Figure 3 shows the best expressions and their fitness from 10 trial runs with the default settings.
Turning the fitness penalties for long/complex expressions off is the easiest way to improve the best
fitness results (even when taking into account the lower fitness values from the lack of penalties),
but the expressions become long and unwieldy if they are allowed to become more complex than
the ones in the table.
1
0.5
0
Fig. 4
35
36
Distribution of fitness values for 10 independent trial runs
37
38
39
40
41
42
43
44
45
Fitness value for best expression of final generation
Fitness results for trial runs
Mean; +/- 1 standard deviation
Figure 4 shows the distribution of fitness values for the best fitness expressions at the end of each
trial run. Decreasing the population size, using fewer generations and introducing fitness penalties
for even shorter expressions were all found to increase the standard deviation in these results. I
could find no way of achieving more consistent fitness values on independent tests that with the
chosen optimal parameters.
The program’s classes, as well as this report were all written solely by Dominic Thirlwall. No code was copied from external
sources, and no evolutionary algorithm packages were used. The program was written using Eclipse IDE for java
developers.
Download