Discovering Patterns by Searching for Simplicity

advertisement
From: AAAI Technical Report SS-95-03. Compilation copyright © 1995, AAAI (www.aaai.org). All rights reserved.
Discovering
Patterns
by Searching
for Simplicity
Alexander
P.M. van den Bosch
Department of Philosophy and
Center for Behavioural, Cognitive and Neuro Sciences (BCN)
Groningen University
PA A-Weg 30, 9718 CW
Groningen - The Netherlands
E-mail: vdbosch@bcn.rug.nl
Abstract
sonably independent of the computer-language that is
used. However it has one draw back, it is uncomputable, yet it is claimed that computable approaches
to induction in machine learning are approximations
of Solomonoff’s method (Li & Vit~nyi 1994).
In this paper I demonstrate how Solomonoff’s approach can elegantly he used to make a universal prior
probability distribution for Bayes’ theorem. First it is
shown that Rissanen’s MinimumDescription Length
principle (MDL)can be derived from Solomonoff’s approach. And from thereon I show that simplicity in
Langley et al.’s BACON.2, and simplicity in Thagard’s
PI are nicely subsumed by MDL.
An important general conclusion is that for any
methodthat searches for regularities in data, it is justified, given found alternatives, to provisionally prefer
the simplest explanation, i.e., the shortest computational description consistent with our observations.
I discuss the use of Kolmogorov
complexityand Bayes’
theoremin Solomonoff’s inductive methodto explicate a generMconcept of simplicity. This makes it
possible to understandhowthe search for simple, i.e.,
short, computationaldescriptions of (empirical) data
yields to the discovery of patterns, and hence more
probable predictions. I showhowthe simplicity bias
of Langley’s BACON.2
and Thagard’s PI is subsumed
by Rissanen’s Minimum
Description Length principle,
which is a computableapproximation of Solomonoff’s
uncomputable inductive method.
Introduction
This paper is about the role of simplicity in discovery, explanation, and predictions. I pursue and answer
two questions: Howcan simplicity most generally be
defined? And why should we prefer a simpler theory
above a more complex one? I discuss simplicity definitions that stem from research in cognitive science and
machinelearning. In those approaches simplicity plays
an important role in: the process of scientific discovery, as implemented in Langley and Simon’s computer
model BACON;
inference to the best explanation, as implemented in Thagard’s computer model eI; and in the
probability of predictions, as explicated by Solomonoff.
Langley and Simon claim that the BACON
programs
search for simple consistent laws without making explicit what is meant by ’simple’ laws and whywe should
pursue simplicity (Langley et al. 1987). Thagard proposed an explicit definition of simplicity and employsit
in his model PI, without providing a satisfying reason
for it (Thagard 1988). However Solomonoff proposed
an explication of induction which makes use of a concept that can be used to understand simplicity and has
a satisfying justification for its preference.
According to Solomonoff we should trust the theory which implications can be generated by the shortest computer-program that can generate a description
of our known observational data. It is argued that
a shorter computer-program provides more probable
predictions because it uses more patterns from that
data. It is proved that this simplicity measure is tea-
Prior
probabilities
and programs
In 1964 an article by Solomonoff was published that
contained a proposal for a general theory of induction.
The objective was the extrapolation of a long sequence
of symbols by finding the probability that a sequence
is followed by one of a number of given symbols. It
was Solomonoff’s conviction that all forms of induction
could he expressed in this form.
He argued that any method of extrapolation
can
only work if the sequence is very long and that all the
information for an extrapolation is in the sequence.
Solomonoff proposed a general solution that involved
Bayes’ theorem. This theorem requires that a prior
probability of a hypothesis is known to determine
the posterior probability with use of the knowndata.
Solomonoff’ssolution is to provide for a universal distribution of prior probabilities, makinguse of a formal
definition of computation.
It is widely believed that the notion of computation is fundamentally defined as the operation of a
Turing-machine. This machine, conceived of by the
mathematician Alan Thring, consists of a long tape
and a finite automaton which controls a ’head’ that
can read, delete and print symbols on the tape. To
166
compute the value of a function y = f(x), write a program for f(x) on the tape, together with a symbolic
representation of x and start the Turing-machine. The
program is completed when the Turing-machine halts
and the value of y is left on the tape as output. Turing
proved that there is a universal Turing-machine that
can compute every function that can be computed by
any Turing-machine. The famous Church-~hring thesis claims that every function that can be computed,
can be computed by a Turing-machine.
What Solomonoff did was to correlate all possible
sequences of symbols with programs for a universal
Turing-machine that has a program as input and the
sequence as output. He assigned a sequence that can
be computed with short and/or numerous programs a
high prior probability. Sequences that need long programs and only has few, receive a low prior probability.
Kolmogorov complexity
The Kolmogorovcomplexity of a sequence or string is
actually a measure of randomness or, when inversed,
the regularity of the patterns in that string. Wecan
use a Turing-machine to measure that regularity with
the length of the shortest program for that Turingmachine that has the sequence as output. Wecan call
such a program a description of the sequence. This
description is relative to the Turing-machine that has
the description as input. So when we have a sequence
x and a description program p and a Turing-machine T
we can define the descriptional complexity of x, relative
to T as follows (cf. (Li &Vit£nyi 1993, p. 352)):
Definition 1 The descriptional complexity CT of x,
relative to Turing-machine T is defined by:
CT(X) = min{l(p):p
or CT(X)
For Solomonoffthe validity of giving a shorter program a higher prior probability is suggested by a conceptual interpretation
of Ockham’s razor. But it is
justified because a shorter program utilises more patterns in the sequence to make the program smaller. So,
if we trust the data as being representative of things to
come, then the shorter program is the more probable.
If we, e.g., have a sequence that has x as a prefix
and x = 1234123412341234123, then we could write
a program that describes x as daaaa123, where d is
a definition of 1234 as a. If we want to predict the
following letter we can entertain the hypothesis Pl =
daaaaa, or still shorter Px = d5a. Another option is
p~ = d4cd231. The first hypothesis predicts a 4 and
the second a 1. Both hypotheses are compatible with
the known data x. Now Solomonoff argues that the
prediction of pl is more probable because it requires a
shorter program to generate a continuation of x than
P2 (Solomonoff 1964, p. 10).
That a sequence with many programs gets a high
prior probability is suggested by the idea that if an
occurrence has many possible causes, then it is more
likely. The principle of indifference is integrated by giving programs of the same length the same prior probability.
Unfortunately this approach has as an important
problem. It is not determinable if a given program
is the shortest program that computes a sequence.
If that was determinable then there would exist a
Turing-machine that could determine for every possible program whether it would generate a given sequence. However, most of the possible Turing-machine
programs will never halt. Due to this halting problem we cannot know for every program if the program
computes a given sequence. But before I go into that
problem I want to make Solomonoff’s theory more specific by first discussing Kolmogorovcomplexity and its
application in probability theory.
167
E {0, 1}*,T(p)
O0if no such p e xi sts.
Weconsider T to be a Turing-machine that takes as
input-program a binary string of zeros and ones, so
the program is an element of the set {0, 1}*, which
is the set of all finite binary strings. Weuse binary
strings because everything that can be decoded, like
e.g., scientific theories, can be decodedwith a string of
zeros and ones. The length of the program, l(p), is the
numberof zeros and ones. So the definition takes as the
complexity of a string x the program p that consists
of the least number of bits and that will generate x
when given to T. If no such program exists then the
complexity is considered to be infinite.
Whenx is a finite string then there is always a program that will describe it. Just take a program that
will merely print the number literally.
This program
will be larger than the string. However,if x is infinite
and no finite program exists, then x is uncomputable
by definition.
This complexity measure is relative but surprisingly
not (very) dependent on the Turing-machine that
used to describe a string, as long as it is a universal Turing-machine. Because there exists a universal
Turing-machine that computes the same or a lower
complexity than the complexity computed by every
other Turing-machine plus some constant dependent
on that Turing-machine. So e.g., when I have a string
and two programs in different computer languages that
compute that string, the difference in length between
those programs cannot be more than a constant, independent of the string. This is called the invariance
theorem (cf. (Li & Vit£nyi 1993, p. 353)).
In the literature Kolmogorovcomplexity K(x) is defined as a variant of descriptional complexity C(x),
which makesuse of a slightly different kind of Turingmachines. In the definition of descriptional complexity
a Turing-machine was used with one infinite tape that
can movein two directions and that starts with an input programon it and halts with a string on the tape as
output. For the definition of Kolmogorovcomplexity
a prefix machine is used. This kind of Turing-machine
uses three tapes, an input tape and an output tape
which both move in only one direction, and a working
tape that movesin two directions. The prefix 2baringmachine reads program p from the input tape, and
writes string x on the output tape.
Kolmogorov complexity will render a similar measure of complexity as descriptional complexity, where
C(x) and If(x) differ by at most a 2 log If(x) additive
term. This difference is important for its use in Bayes’
formula. (Without it the sum of the probabilities of
all possible hypotheses will not diverge to one.) The
invariance theorem for If(x) is similar to that of C(x).
Nowhowcan this measure be useful in extrapolating a
sequence? First we will take a brief look at how Bayes’
formula requires a prior probability.
Bayesian
inference
Wewill start with a hypothesis space that consists of a
countable set of hypotheses which are mutually exclusive, i.e., only one can be right, and exhaustive, i.e.,
at least one is right. Each hypotheses must have an
associated prior probability P(Hn) such that the sum
of the probabilities of all hypotheses is one. If we want
the probability of a hypothesis Hn given some known
data D then we can compute that probability
with
Bayes’ formula:
P(gnlD)
P( DIHn)P(H,*)
P(D)
where P(D) ~- ~, P(DIHn)P(Hn). This fo rmula de termines the a posteriori probability P(Hn[D) of a
hypothesis given the data, i.e., the probability of Hn
modified from the prior probability P(Hn) after seeing
the data D. The conditional probability P(D]Hn) of
seeing D whenHn is true is usually forced by Hn, i.e.,
P(D[Hn) = 1 if Hn can generate D, and P(D[Hn) =
if Hn is inconsistent with D. So whenwe only consider
hypotheses that are consistent with the data the prior
probability becomescrucial. Because for all Hn where
P(D[Hn)= 1 the posterior probability of H,, will become:
P(Hn)
P(gn[D)P(D)
Now let us see what happens when we apply Bayes’
formula to an example of Solomonoff’s inductive inference. In this example we only consider a discrete
samplespace, i.e., the set of all finite binary sequences
{0, 1}*. Whatwe want to do is, given a finite prefix of a
sequence, assign probabilities to possible continuation
of that sequence. Whatwe do is, given the knowndata,
make a probability distribution of all hypotheses that
are consistent with the data. So if we have a sequence
x of bits, we want to knowwhat is the probability that
x is continued with Y. So in Bayes’ formula:
P(~rYl~r)
P(xlxy)P(xy)
= P(x)
168
Wecan take P(x[xy) = 1 no matter what we take for
y, so we can say that:
P(xylx)-
P(xy)
P(x)
Hence if we want to determine the probability that
sequence x is continued by y we only need the prior
probability distribution for P(xy). Solomonoff’s approach is ingenious because he first identifies x with
the computer programs that generate xy. In this way
y can be determined in two ways: y is the next element that is predicted, i.e., generated, by the smallest
Turing-machine program that can generate x; or the
element is predicted that is generated by most of the
programs that can generate x.
So we can define the prior probability of a hypothesis in two different ways. Wecan give the shortest
program the highest prior probability and define the
probability of xy as
PK(xy) -K(xv)
:= 2
i.e., the length of the shortest program that computes
xy as the negative power of two (Li & Vit£nyi 1990,
p. 216). Or we can define Pv(xy) as -z(p
the sum of 2
for every programp (so not only the shortest) that generates xy on a reference universal prefix machine (Li
& Vit£nyi 1993, p.356). The latter is known as the
Solomonoff-Levin distribution.
Both have the quality
that the sum of prior probabilities is equal to or less
than one, i.e., ~-’~xPK(x) < 1 and ~-~,Pv(x) < 1.
However,it can be shownthat if there are many’long’
programs that generate x and predict the same y, then
a smaller program must exist that does the same. And
it is proved that both prior probability measures coincide up to an independent fixed multiplicative constant (Li & Vit£nyi 1993, p. 357).
So we can take the Kolmogorovcomplexity of a sequence as the widest possible notion of shortness of description of that sequence. Andif we interpret shortness of description, defined by Kolmogorovcomplexity, as a measure for parsimony, then the SolomonoffLevin distribution presents a formal representation of
the conceptual variant of Ockham’srazor. Because the
predictions of a simple, i.e., short, description of a phenomenonare more probable than the predictions of a
more complex, i.e., longer, description.
Minimising
the description
length
While both the Kolmogorov and Solomonoff-Levin
measure are not computable, there are computable approximations of them. It is demonstrated that several independently developed inductive methods are
actually computable approximations of Solomonoff’s
method (Li & Vit£nyi 1993). I will first demonstrate
this for Rissanen’s MDL.
Rissanen made an effort to develop an inductive
methodthat could be used in practice. Inspired by the
ideas of Solomonoff he eventually proposed the minimumdescription length principle. This principle states
that the best theory given some data is the one which
minimises the sum of the length of the binary encoded
theory plus the length of the data, encoded with the
help of the theory. The space of possible theories does
not have to consist of all possible Turing-machine programs, but can just as well be restricted to polynomials, finite automata, Boolean formulas, or any other
practical class of computable functions.
To derive Rissanen’s principle I first need to introduce a definition of the complexity of a string given
some extra information, which is knownas conditional
Kolmogorov complexity:
Definition 2 The conditional Kolmogorov complexity
KT of x, relative to some universal prefix Turingmachine T and additional information y is defined by:
KT(XIy) = min{l(p): p E {0,1}*,T(p,y) = x}
or KT(Xly) = oo if such p does not exist.
This definition subsumes the definition of (unconditional) Kolmogorov complexity when we take Y to be
empty. Now,Rissanen’s principle can elegantly be derived from Solomonoff’s method. Westart with Bayes’
theorem:
P(HID ) = P(DIH)P(H)
P(D)
The hypothesis H can be any computable description
of some given data D. Our goal is to find an H that
will maximise P(HID). Nowfirst we take the negative
logarithm of all probabilities in Bayes equation. The
negative logarithm is taken because probabilities are
smaller or equal to one and we want to ensure positive
quantities. This results in:
- log P(glD) = - log P(DIg) - log P(H) lo g P(D)
When we consider P(D) to be a constant then maximising P(HID
) is equivalent to minimising its negative logarithm. Therefore we should minimise:
- log P(DIH) - log P(H)
This will result to the MinimumDescription Length
principle if we consider that the probability of H is
approximated by the probability for the shortest program for H, i.e.,
P(H) 2- K(H). Therefore th e
negative logarithm of the probability of H is exactly
matched by the length of the shortest program for H,
i.e., the Kolmogorovcomplexity K(H). The same goes
for P(DIH) and hence we should minimise:
K(DIH) + K(H)
i.e., the description or program length of the data,
given the hypothesis, plus the description length of
the hypothesis (Li & Vit~nyi 1990, p. 218). To make
this principle practical all that remains is formulating
a space of computable hypotheses that together have
a prior probability smaller or equal to one, and searching this space effectively. It has been shownin several
applications that this approach is an effective way of
learning (Li & Vithnyi 1993, p. 371).
169
BACON and PI
Let’s look at the simplicity bias of BACON.2
(BACON.2
is not representative for the other BACON
programs.
I discuss the simplicity bias of the other BACON’s
in
(Van den Bosch 1994).) BACON.2
will always construct
the simplest consistent law in its range of search. The
method it uses is called the differencing method. With
this method BACON.2is able to find polynomial and
exponential (polynomial) laws that summarise given
numeral data (Langley et al. 1987). One could define
BACON.2
its simplicity bias as follows:
Definition 3 The simplicity of a polynomial in BACON.2decreases with the increase of the polynomial’s
highest power. And, when possible, a variable power
will result to a simpler polynomial than a polynomial
with a high constant degree.
Langley et al. give no epistemical reason for preferring
simplicity. However,after discussing simplicity in Thagard’s PI I will argue that BACON.2
its simplicity bias
as defined above, is justified.
Thagard’s account of the simplicity of a hypothesis
does not depend on the simplicity of the hypothesis
itself, but on the number of auxiliary hypotheses that
the hypothesis needs to explain a given numberof facts
(Thagard 1988). In PI, Thagard’s cognitive model
scientific problem solving, discovery and evaluation are
closely related. Simplicity plays an important role in
both.
PI considers two hypotheses to be co-hypotheses if
they were formed together for an explanation by abduction. With the number of co-hypotheses and facts
explained by a hypothesis its simplicity is calculated
according to the following definition:
Definition 4 The simplicity of hypothesis H, with a
number of c co-hypotheses and f facts explained by H,
is calculated in PI as simplicity(H) = (f - e)/f or
iff <c.
To asses the best explanation PI considers both consilience (i.e., explanatory success, or in PI; numberof
facts explained) and simplicity. This is no difficult decision if one of the dimensions is superior while the
values for the other dimension are more or less equal.
If both explain the same numberof facts but one is simpler than the other, or if they are both equally simple,
but one explains more facts than the other, then there
is no difficult choice. But whene.g., the first theory
explains most facts while the second is the simplest,
that conflict seems to make the choice more difficult.
In that case PI computes a value for both hypotheses
according to the following definition:
Definition 5 The explanatory value of hypothesis
H for IBE is calculated
in Pl as value(H)
simplicity(H) x consilience(
In this way PI can pick out explanations that do not
explain as muchas their competitors but have a higher
simplicity or explain more important facts. It also renders ad hoc hypotheses useless because if we add an
extra hypothesis for every explanation then the simplicity of that theory will decrease at the same rate as
the consilience.
One feature of IBE in PI is that its valuation formula
implies a muchsimpler definition which easily follows
from the definitions of the simplicity and the value of
a hypothesis:
Definition 6 For IBE the explanatory value of a hypothesis H, for the number of c co-hypotheses and f
facts explained, can be calculated in PI as value(H)
f-c, or zero if f < c.
Thagard does not satisfactorily
argue why we should
prefer this kind of simple hypotheses. He only demonstrates that several famous scientists used it as a defending argument. But he did not show that simplicity
promotes the goals of inference to the best explanation,
like truth, explanation and prediction.
Approximation
I will now compare Rissanen’s minimumdescription
length principle (MDL)with BACON.2,and with inference to the best explanation (IBE) as implemented
Thagard in PI.
For BACON.2Rissanen’s principle suggests an improvement because in the case of noisy data, BACON.2
would probably come up with a polynomial as long
as that data, while it could construct a muchsimpler
one when it employs and decodes deviations from the
polynomial as well.
An important difference between Rissanen’s principle and BACON
is that the former requires to search
the whole problem space, while the latter searches it
heuristically.
But BACON’ssearch is guided by the
same patterns that eventually will be described by a
law. However,a heuristic search, like BACON’s,
can be
aided by Rissanen’s principle. Actually BACON
does
search for a minimal description, but it does not try to
minimiseit, i.e., if BACON
finds a description, it halts,
and will not search for a shorter one.
BACON.2determines the shortest polynomial that
can describe a given sequence. A Turing-machine can
be constructed that needs a longer description for a
more complex polynomial. It can be demonstrated
that a polynomial with an exponential term needs a
shorter description than a non-exponential polynomial
that describes the same sequence. BACON.2’smethod
always finds the simplest polynomial that exactly fits
the data. So I will make the following claim:
Claim 1 The polynomial constructed by BACON.~with
the differencing method, based on a given sequence x
is the polynomial with the shortest description that exactly describes x, if x can be described at all with a
polynomial.
The validity of this claim can be derived from the differencing method. Every preference of BACON.2be-
170
tween two polynomials that are compatible with the
data is in agreement with the minimumdescription
length principle. However, MDLcan seriously improve
BACON.2
by including a valuation of a description of
the sequence, given a possible polynomial. A shorter
description of the sequence may result when deviations
from a possible polynomial are encoded as well.
In Thagard’s explication of inference to the best explanation, in P1 the simplicity of a hypothesis is determined by the number of additional assumptions or
co-hypotheses that the hypothesis needs for its explanations. Rissanen’s MDLaccounts for the importance
of auxiliary hypotheses as well. MDLrequires that
we minimise both the description of an explaining hypotheses and the description of the data with the aid of
the hypothesis. If an hypothesis can explain something
right awaythe description of the data is minimal, while
if the hypothesis requires additional assumptions, the
description of the data will be longer. So Thagard’s
simplicity satisfies at least one of the requirements of
MDL.Hence I want to make the following claim and
argue for its plausibility.
Claim 2 In case of equal consilience, the explanation
that will be selected by IBE in PI will provide a shorter
description of the facts, given the explanation, or at
least no longer description with respect to the available
alternatives.
This claim follows easily from the definition stating
that P?s IBE values hypotheses by subtracting the
number of co-hypotheses c from the number of f explained facts, i.e., f - c. Twohypotheses that are of
equal consilience explain the same number of facts, in
which case the hypothesis with the least number of cohypotheses is preferred. Hence, assuming that every
such extra assumption is of reasonably equal length,
the simpler hypothesis will provide a shorter description of the facts given the hypothesis. However,if both
have the same number of co-hypotheses, then pI can
not make a choice, because both will provide a description of reasonably similar length.
Note that the length of description of the hypothesis/
is not relevant in this description of the facts. Anapproximation of the first clause, K(D]H) is minimised
and not the second clause, K(H). Recall that the conditional Kolmogorovcomplexity K(D]H) is defined as
the shortest program that will generate D, with the
aid of H. Such a program will be of length zero if H
can generate D on its own. Hence the shortest program that generates D given H is an effective measure
of the length of the co-hypotheses that are needed by
H to explain D.
The simplest
hypothesis
One question
may now come to mind: will the
Solomonoff approach yield a unique preference when
several simple hypotheses are compatible with the
data? It seems possible that more than one theory
or program, consistent with given data, can be of the
same length. So in that case we cannot make a decision based on a simplicity consideration, because all
alternatives are of equal simplicity.
To answer that critique we first have to distinguish
between the next symbol y that is predicted given a
sequence x and the different programs p that can generate a prediction. Our goal can be a proper prediction
y, given x, or a proper explanation of x. In the case
of the need of a proper prediction, if two programs are
of the same length it may turn out that both predict
something else. However, Solomonoff’s method supplies two ways to solve this dilemma.
The first is the universal Solomonoff-Levindistribution with which probabilities can be assigned to different continuations of a sequence. A given prediction y
not only receives a higher probability if it is predicted
by a short program, but also if numerous programs
make the same prediction. So if there is more than
one shortest program, the prediction of the program
that predicts the same as numerous other longer programs is preferred.
The second way out of the dilemma is in the situation when the given amount of data z is very long.
It can be proved that in the limit all reasonable short
programs will converge to the same prediction, so you
can pick any of them. This feature of the Solomonoff
approach is nice for practical and computable approximations. Because you can make reasonably good predictions with a given short program that may not be
the shortest one.
However,if you value the best explanation of a given
amountof observations, then you won’t be satisfied by
a grab-bag of possible hypotheses that may not even
make the same predictions. Scientists that want to
understand the world usually look for one best explanation. In this situation a case could be made for the
simplest hypothesis as the best explanation. But with
such an aim the Solomonoff approach seems troublesome. Because you can never know whether a given
short program that computes x is also the shortest
program possible. Because the only effective way to
do so is to test all possible Turing-machines one by
one to see if they generate x. But any of those possible Turing-machines may never halt and there is no
way to find out if it ever will. Youmayput a limit to
the time you allow the machine to run before you test
the following one. But a shorter program can always
be just one second further away.
The philosopher Ernst Mach once made the drastic
claim that the best thing that science could do is making predictions about phenomena, without explaining
the success of such predictions with the (ontological)
assumptions of the possible hypotheses. However, the
best explanation, and hence possibly the simplest program, can be the ultimate goal of science. And a nice
property of this kind of simplicity is that we can measure our progress. Wemaynot have an effective, i.e.,
computable, method to establish whether a hypothesis
is the simplest but given a large amountof data we can
establish the relative simplicity of any two hypotheses.
Conclusion
I will try to state myconclusion in one sentence, while
it is probably not the shortest description of that conclusion:
Claim 3 In systematic induction we should, given discovered alternatives, prefer the simplest explanation,
i.e., the shortest computational description of known
data given a hypothesis plus the shortest description of
that hypothesis, because it provides more probable predictions.
In mydiscussion I tried to be as substantial as possible.
Those whoare interested in a more lengthy discussion,
including several other approaches to simplicity, beginning with Aristotle’s, can contact me for a copy of my
(Van den Bosch 1994).
References
P. Langley, H.A. Simon, G.L. Bradshaw, J.M, Zytkow
Scientific Discovery, Computational Explorations of
the Creative Processes. Cambridge: The MIT Press,
1987.
M. Li and P.M.B. Vit£nyi "Kolmogorov Complexity
and its Applications". In: J. van LeeuwenAlgorithms
and Complexity, Handbook of Theoretical Computer
Science, volume A, Elsevier,187-254, 1990.
M. Li and P.M.B. Vit£nyi "Inductive Reasoning and
Kolmogorov Complexity". In: Journal of Computer
and System Sciences, vol. 44, no. 2, 1993.
M. Li and P.M.B. Vit£nyi An Introduction to Kolmogorov Complexity and Its Applications, AddisonWesley, Reading, MA, 1994.
R.J. Solomonoff "A Formal Theory of Inductive Inference, part I". In: Information and Control, 7, 1-22,
1964.
P. Thagard Computational Philosophy of Science.
Cambridge: The MIT Press, 1988.
A.P.M. van den Bosch Computing Simplicity,
About
the role of simplicity in discovery, explanation, and
prediction. Master thesis, Department of Philosophy,
University of Groningen, 1994.
171
Download