information theory, fitness and sampling semantics

advertisement
information theory, fitness
and
sampling semantics
colin johnson / university of kent
john woodward / university of stirling
Schtick
•
•
•
That we can use the set of ideas around
entropy, information theory and algorithmic
complexity as a way of assigning fitness in
program synthesis systems.
This involves the idea of an information
distance between sampling semantics vectors
and problem target definitions across a set of
training cases.
In particular we describe how to assign fitness
to subprograms without putting them in the
Semantics in GP etc.
Canonical-representation Semantics
Program Text
Canonical form
of I/O mapping
•
•
Why do we Care about
Semantics?
In the end, problems cash out as inputoutput behaviour.
By having an understanding of program
semantics, we can:
•
•
•
avoid duplicating programs with different
representations but the same I/O behaviour in the
population
choose points for crossover/mutation in a more
informed way
build new frameworks (such as geometric
semantic GP) that manipulate program meanings.
Semantics in GP etc.
Canonical-representation Semantics
Program Text
Canonical form
of I/O mapping
Sampling Semantics
Program Text
Vector of outputs
on training set
Sampling Semantics
•
•
•
Sampling semantics (O’Neill, Nguyen, et
al.) are a data-driven way of defining a
semantic representation for any kind of
function.
The sampling semantics of a function
over a particular (ordered) training set T is
simply a vector of the outputs of that
function over T.
This emphasis on the set of outputs
(rather than just, say, a sum of errors)
allows us to define metrics on pairs of
population members e.g. to define how
What do we really want?
•
•
•
GP assigns fitness on the basis of
counting how many fitness cases are
solved by each program in the population,
or by summing up the total error.
This is the wrong thing to measure.
We want to measure whether subprograms add information/structure that
will make it easier for later parts of the
program to solve the problem.
The Semantics of Wrong
Programs
•
•
•
•
Much of computer science is interested in
reasoning about correct programs (or,
reasoning about whether programs are
correct).
But, most programs are wrong most of
the time during development.
We need ideas that help us to reason
about wrong programs, and their
relationship to the target specification.
Can we measure how much problemspecific structure a program fragment is
•
•
Similarity Measures
When are two things similar? For
example, two programs, or the output
from a program and the target value?
Clearly, bitwise difference is not the most
important thing.
Information Distance
•
•
•
Instead of pointwise distance, consider
instead information distance (Vitányi et
al.).
An example information distance is the
length of the shortest program required to
transform one thing into the other.
The program that outputs
10101010101010101010 against the
target
01010101010101010101 is “better” than
01000010110100010110 even though the
Information Distance Fitness
•
•
•
Combine the idea of information distance
and sampling semantics to get a new
notion of fitness.
The fitness of a program fragment is the
length of the smallest program required to
transform the sampling semantics vector
into the target vector.
A computationally grounded notion of
“wrong” should be grounded in how much
computation is needed to make the
program “right”.
Programs by Accumulation
•
•
•
•
Rather than the GP notion of a population
of complete programs, we will find it
easier to work with a set of program
fragments.
Let us call these fragments “theories”.
Good theories represent partial solutions
to all (or many) training cases; not
complete solutions to some training
cases. (We “cut horizontally” rather than
“cutting vertically”).
We can compare theories by their
Assigning Quality to
Program Fragments
•
•
Most GP research to date assigns fitness
to programs. That is, we need a complete
program before we can assign fitness to
programs, and we don’t assign fitness
directly to substructures.
In machine learning (e.g. C4.5), we
assign “fitness” to combinations of
features by using e.g. ideas like
information gain. That is, we assign a
fitness to a partial “program”.
Which is best: f1 or f2?
Compressibility
•
•
One way to describe strings or mappings
is in terms of their algorithmic information
complexity, such as the Kolmogorov
complexity.
Roughly speaking, this is a measure of
the shortest possible program required to
compute the string.
•
•
So, for example, 1010101010101010 can be
described by a shorter program than
1001010101111010.
Non-computable; but, we can
approximate it by running a compression
Which is best: f1 or f2?
•
diff f1 and parity is more compressible.
We can find a shorter description of it.
•
•
•
Compression-based
Program Synthesis (TDFcomp)
Choose a set of functions F.
Create a construction set C, initially containing all of the
input variables.
LOOP:
•
•
•
•
create a number (500) of sub-programs by applying
functions from F to members of C
calculate the difference between the output from these
sub-programs and the target on all inputs (Hamming)
choose the most compressible difference and add the
relevant sub-program to C (gzip)
UNTIL C contains a program that is a solution
to the whole problem
Example: 4-bit even parity
•
•
•
•
•
•
•
•
•
•
•
•
•
run:
0: (XNOR v_3 v_2) with quality 26
1: (XNOR v_2 v_3) with quality 26
2: (XNOR v_2 v_3) with quality 26
3: (XOR v_2 v_3) with quality 26
4: (XOR v_3 v_2) with quality 26
5: (AND v_2 v_1) with quality 27
6: (XNOR v_1 v_2) with quality 27
7: (XOR v_2 v_1) with quality 27
8: (1st v_1 v_0) with quality 28
9: (OR v_2 v_0) with quality 28
10: (OR v_2 v_3) with quality 28
...
Typical Run (2)
•
•
•
•
•
•
•
•
•
•
•
run:
***************** Iteration 1
(XNOR v_2 v_3) with quality 26
***************** Iteration 2
(XOR (XNOR v_2 v_3) v_1) with quality 24
***************** Iteration 3
(XOR (XOR (XNOR v_2 v_3) v_1) v_0) with quality 23
######################################
1 perfect solution found, which is:
(XOR (XOR (XNOR v_2 v_3) v_1) v_0) with quality 23
BUILD SUCCESSFUL (total time: 1 second)
Typical Run (2)
•
•
•
•
•
•
•
•
•
•
•
•
•
run:
***************** Iteration 1
(XNOR v_3 v_2) with quality 26
***************** Iteration 2
(XNOR (XNOR v_3 v_2) v_1) with quality 24
***************** Iteration 3
(XOR (XNOR (XNOR v_3 v_2) v_1) v_0) with quality 23
***************** Iteration 4
(NOT2 v_0 (XOR (XNOR (XNOR v_3 v_2) v_1) v_0)) with quality 23
######################################
14 perfect solutions found, which are:
(NOT2 v_0 (XOR (XNOR (XNOR v_3 v_2) v_1) v_0)) with quality 23
....
...and traditional GP for contrast!
•
•
•
•
•
•
run:
0
2.0
XOR(NAND(OR(OR(NAND(OR(d3 d1) AND(d0 d0)) XOR(AND(d3 d0) OR(d0 d0)))
XOR(XOR(NAND(d1 d3) XOR(d0 d2)) XOR(XOR(d2 d0) XOR(d2 d0)))) AND(AND(OR(OR(d2 d1) XOR(d3 d0))
OR(AND(d2 d3) XOR(d3 d0))) OR(NAND(NAND(d2 d0) AND(d2 d2)) XOR(OR(d0 d1) OR(d0 d2)))))
AND(OR(NAND(XOR(NAND(d2 d2) AND(d2 d0)) NAND(XOR(d1 d1) XOR(d0 d1))) XOR(XOR(AND(d1 d2) OR(d0 d2))
XOR(OR(d3 d0) OR(d1 d2)))) NAND(OR(XOR(NAND(d3 d1) XOR(d1 d2)) AND(OR(d1 d2) AND(d3 d3)))
XOR(NAND(NAND(d3 d0) OR(d3 d1)) XOR(XOR(d3 d2) AND(d1 d0))))))
1
2.0
XOR(NAND(OR(OR(NAND(OR(d3 d1) AND(d0 d0)) XOR(AND(d3 d0) OR(d0 d0)))
XOR(XOR(NAND(d1 d3) XOR(d0 d2)) XOR(XOR(d2 d0) XOR(d2 d0)))) AND(AND(OR(OR(d2 d1) XOR(d3 d0))
OR(AND(d2 d3) XOR(d3 d0))) OR(NAND(NAND(d2 d0) AND(d2 d2)) XOR(OR(d0 d1) OR(d0 d2)))))
AND(OR(NAND(XOR(NAND(d2 d2) AND(d2 d0)) NAND(XOR(d1 d1) XOR(d0 d1))) XOR(XOR(AND(d1 d2) OR(d0 d2))
XOR(OR(d3 d0) OR(d1 d2)))) NAND(OR(XOR(NAND(d3 d1) XOR(d1 d2)) AND(OR(d1 d2) AND(d3 d3)))
XOR(NAND(NAND(d3 d0) OR(d3 d1)) XOR(XOR(d3 d2) AND(d1 d0))))))
2
2.0
XOR(NAND(OR(OR(NAND(OR(d3 d1) AND(d0 d0)) XOR(AND(d3 d0) OR(d0 d0)))
XOR(XOR(NAND(d1 d3) XOR(d0 d2)) XOR(XOR(d2 d0) XOR(d2 d0)))) AND(AND(OR(OR(d2 d1) XOR(d3 d0))
OR(AND(d2 d3) XOR(d3 d0))) OR(NAND(NAND(d2 d0) AND(d2 d2)) XOR(OR(d0 d1) OR(d0 d2)))))
AND(OR(NAND(XOR(NAND(d2 d2) AND(d2 d0)) NAND(XOR(d1 d1) XOR(d0 d1))) XOR(XOR(AND(d1 d2) OR(d0 d2))
XOR(OR(d3 d0) OR(d1 d2)))) NAND(OR(XOR(NAND(d3 d1) XOR(d1 d2)) AND(OR(d1 d2) AND(d3 d3)))
XOR(NAND(NAND(d3 d0) OR(d3 d1)) XOR(XOR(d3 d2) AND(d1 d0))))))
3
1.0
XOR(XOR(XOR(XOR(XOR(OR(d2 d3) AND(d3 d3)) OR(NAND(d0 XOR(d2 d2)) AND(d0 d2)))
OR(OR(NAND(d3 d3) OR(d2 d3)) AND(XOR(d2 d3) OR(d3 d2)))) XOR(OR(NAND(NAND(d1 d3) NAND(d3 d0))
NAND(AND(d1 d3) XOR(d1 d3))) NAND(OR(XOR(d1 d3) NAND(d0 d2)) OR(AND(d2 d0) XOR(d0 d1)))))
NAND(XOR(AND(OR(OR(d2 d3) OR(d0 d1)) OR(AND(d1 d0) AND(d3 d3))) AND(AND(XOR(d0 d3) AND(d0 d1))
XOR(AND(d2 d1) OR(d3 d0)))) NAND(NAND(AND(d2 XOR(d3 d0)) XOR(OR(d1 d3) d2)) AND(OR(AND(d2 d2) AND(d1
d1)) XOR(AND(d3 d2) XOR(d3 d3))))))
4
0.0
XOR(XOR(XOR(XOR(XOR(OR(d2 d3) AND(d3 d3)) OR(NAND(d0 d2) AND(d0 d2)))
OR(OR(NAND(d3 d3) OR(d2 d3)) AND(XOR(d2 d3) OR(d3 d2)))) XOR(OR(NAND(NAND(XOR(d3 d0) d3) NAND(d3 d0))
NAND(AND(d1 d3) XOR(d1 d3))) NAND(OR(XOR(d1 d3) NAND(d0 d2)) OR(AND(d2 d0) XOR(d0 d1)))))
NAND(XOR(AND(OR(OR(d2 d3) OR(d0 d1)) OR(AND(OR(d1 d2) d0) AND(d3 d3))) AND(AND(XOR(d0 d3) AND(d0 d1))
XOR(AND(d2 d1) OR(d3 d0)))) NAND(NAND(d1 XOR(OR(d1 d3) d1)) AND(OR(AND(d2 d2) AND(d1 d1)) XOR(AND(d3 d2)
...is this a fair example?
•
•
•
•
Perhaps this isn’t the fairest of examples.
The parity problem has the advantage
that, once you have combined two
variables with the XOR or XNOR
operator, you have extracted all of the
information out of them.
In other problems, this is not the case;
e.g. in the multiplexer problem you need
to use the address bits more than once.
...but, there are ways of dealing with this.
The Big Picture
•
•
•
GP is measuring the wrong thing:
•
we want to measure how (algorithmically)
complex the “gap” is between the current program
fragment and the target, not the error between the
current program (fragment) and the target
We have shown a way to give a fitness
value to small components of a program
during program synthesis, rather than
having to always evaluate full programs.
Can we do more to remove the “bioinspired” from the methods and replace it
with computational/informational
Questions/Comments
Download