Information Theoretic Definitions of Complexity

advertisement
• Office hours shifted: 2-4 Tuesdays
• Summary & Discussion of Gell-Mann
• Definitions & Examples
– Shannon Information
– Mutual Information
– Kolmogorov Complexity (AIC)
• Turing Machine
– Effective Complexity
Shannon Information
• Shannon Entropy H to measure basic information
capacity:
– For a random variable X with a probability mass function p(x),
the entropy
H ( X )  of
log 2 p( x) as:
Xp(isx)defined
– Entropy is measured in bits.
– H measures the average uncertainty in the random variable.
• Example 1:
– Consider a random variable with uniform distribution over 32
outcomes.
– To identify an outcome, we need a label that takes on 32
different values,
e.g., 5-bit strings.
32
32
H ( X )   p (i ) log p (i )  
i 1
i 1
1
1
log
 log 32  5 bits
32
32
What is a Random Variable?
• A function defined on a sample space.
– Should be called “random function.”
– Independent variable is a point in a sample space (e.g., the
outcome of an experiment).
• A function of outcomes, rather than a single given outcome.
• Probability distribution of the random variable X:
P{ X  x j }  f ( x j )
(j  1,2,...)
• Example:
–
–
–
–
Toss 3 fair coins.
Let X denote the number of heads appearing.
X is a random variable taking on one of the values (0,1,2,3).
P{X=0} = 1/8; P{X=1} = 3/8; P{X=2} = 3/8; P{X=3} = 1/8.
• Example 2:
– A horse race with 8 horses competing.
– The probabilities of 8 horses are:
1 1 1 1 1 1 1 1 
 , , , , , , , 
 2 4 8 16 64 64 64 64 
– Calculate the entropy H of the horse race:
1
1 1
1 1
1 1
1
1
1
H ( X )   log  log  log  log  4 log
 2 bits
2
2 4
4 8
8 16
16
64
64
– Suppose that we wish to send a (short) message to another
person indicating which horse won the race.
– Could send the index of the winning horse (3 bits).
– Alternatively, could use the following set of labels:
• 0, 10, 110, 1110, 111100, 111101, 111110, 111111.
• Average description length is 2 bits (instead of 3).
• Huffman code: variable length ‘prefix free’ code for maximum
lossless compression
• More generally, the entropy of a random variable is a lower bound on
the average number of bits required to represent the random
variable.
• Shannon information is the amount of ‘surprise’ when you read a
message
• The amount of uncertainty you have before you read the message—
information measures how much you’ve reduced that uncertainty by
reading the message.
• The uncertainty (complexity) of a random variable can be extended to
define the descriptive complexity of a single string.
– E.g., Kolmogorov (or algorithmic) complexity is the length of the shortest
computer program that prints out the string.
• Entropy is the uncertainty of a single random variable.
• Conditional entropy is the entropy of a random variable given another
random variable.
Mutual Information
• Measures the amount of information that one random variable
contains about another random variable.
– Mutual information is a measure of reduction of uncertainty due
to another random variable.
– That is, mutual information measures the dependence between
two random variables.
– It is symmetric in X and Y, and is always non-negative.
• Recall: Entropy of a random variable X is H(X).
• Conditional entropy of a random variable X given another
random variable Y = H(X | Y).
• The mutual information of two random variables X and Y is:
I ( X , Y )  H ( X )  H ( X | Y )   p ( x, y ) log
x, y
H(X)
H(Y)
p ( x, y )
p( x) p( y )
H(X|Y)
Algorithmic Complexity (AIC)
(also known as Kolmogorov-Chaitin complexity)
• Kolomogorov-Chaitin complexity or Algorithmic Information
Content, K(x) or KU(x), is the length, in bits, of the smallest
program that when run on a Universal Turing Machine outputs
(prints) x and then halts.
• Example: What is K(x) where x is the first 10 even natural
numbers? Where x is the first 5 million even natural numbers?
• Possible representations:
– 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, … (2n - 2) K(x) = O(n log n) bits
– for (j = 0; j < n: j++) printf(“%d\n”, j * 2) K(x) = O(log n)
Two problems with AIC
– Calculation of K(x) depends on the machine we
have available (e.g., what if we have a machine
with an instruction “print the first 10 even natural
numbers”?)
– Determining K(x) for arbitrary x is uncomputable
AIC cont.
• AIC formalizes what it means for a set of numbers to be
compressible and incompressible.
– Data that are redundant can be more easily described and
have lower AC.
– Data that have no clear pattern and no easy algorithmic
description have high AIC.
• Random strings are incompressible, therefore contain
no regularities to compress
– K(x) = | Print(x) |
• Implication: The more random a system, the greater its
AIC.
• Contrast with Statistical simplicity
Random strings are simple because you can
approximate them statistically
– Coin toss, random walks, Gaussian (normal distributions)
– You can compress random numbers with statistical
descriptions and only a few parameters
Gell-Mann’s Effective Complexity
• The length of the shortest description of a set’s regularities
• EC(x) = K(r) where r is the set of regularities in x
and Kolmogorov Complexity, K(r), is the length of a concise
description of a set
• Highest for entities that are not strictly regular or random
High Shannon Info
Low compressibility*
Random
Algorithmic
Complexity
Effective
Complexity
Low Shannon Info
High compressibility
Orderly
Randomness h m
Randomness h m
– Gell Mann suggests one formal way to identify regularities
– Determine mutual AIC between parts of the string
• If x = [x1,x2]
• K(x1,x2) = K(x1) + K(x2) – K (x)
• The sum of the AICs of the parts – the AIC of the whole
• Eg. 10010 10011 10010 -the whole has more regularity
than the sum of the regularities in the parts
Total Information
– Alternative approach in Gell-Mann & Lloyd 1998
– EC(x) = K(E) where E is the set of entities of which x is a
typical member
– Then K(x) is the length of the shortest program required to
specify the members of a the set of which x is a typical
member
– Effective complexity measures knowledge—the extent to
which the entity is nonrandom and predictable
– Total Information is Effective complexity, K(E), + the
Shannon Information (peculiarities, randomness)
– TI(x) = K(E) + H(x)
– There is a tradeoff between the effective complexity (how
complete a description of the regularities) and the
remaining randomness
– Ex: 10010 10011 10010 10010 10010
Language Complexity
• How complex is the English (or Spanish or Chinese or
Thai or C programming or Java…) language?
• What is the Shannon information content oof a
message in this language, as a function of the
message length?
• What is the effective complexity (AIC of the
regularities?
• What is the total information content?
• How well do these measures approximate the
complexity you perceive in these languages?
What is the information content, AIC, Effective
Complexity & logical depth of this string?
0413433021131473713868974402394801381716598485518981513440862714
2027932522312442988890890859944935463236713411532481714219947455
6443658237932020095610583305754586176522220703854106467494942849
8145339172620056875566595233987560382563722564800409510712838906
1184470277585428541980111344017500242858538249833571552205223608
7250291678860362674527213399057131606875345083433934446103706309
4520191158769724322735898389037949462572512890979489867683346116
2688911656312347446057517953912204556247280709520219819909455858
1946136877445617396074115614074243754435499204869180982648652368
4387027996490173977934251347238087371362116018601281861020563818
1835409759847796417390032893617143215987824078977661439139576403
7760537119096932066998361984288981837003229412030210655743295550
3888458497370347275321219257069584140746618419819610061296401614
8771294441590140546794180019813325337859249336588307045999993837
5411726563553016862529032210862320550634510679399023341675
Summary of Complexity Measures
•
Information-theoretic methods:
–
–
–
•
Effective Complexity:
–
•
How complex a machine is needed to compute a function?
Logical depth:
–
•
How many resources does it take to compute a function?
The language/machine hierarchy:
–
•
Neither regular nor random
Computational complexity:
–
•
Shannon Entropy
Algorithmic complexity
Mutual information
Run-time of the shortest program that generates the phenomena and halts.
Asymptotic behavior of dynamical systems:
–
–
Fixed points, limit cycles, chaos.
Wolfram’s CA classification: the outcome of complex CA can not be predicted any faster than it
can be simulated.
Frozen Accidents &
Sensitive dependence on initial conditions
For Want of a Nail
For want of a nail the shoe was lost.
For want of a shoe the horse was lost.
For want of a horse the rider was lost.
For want of a rider the battle was lost.
For want of a battle the kingdom was lost.
And all for the want of a horseshoe nail.
Hierarchy, Interactions, Frozen Accidents
HELA: An exponential that has lasted forever
Under the microscope, a cell looks a lot like a fried egg: It has a white (the cytoplasm) that's full of water
and proteins to keep it fed, and a yolk (the nucleus) that holds all the genetic information that
makes you you. The cytoplasm buzzes like a New York City street. It's crammed full of molecules
and vessels endlessly shuttling enzymes and sugars from one part of the cell to another, pumping
water, nutrients, and oxygen in and out of the cell. All the while, little cytoplasmic factories work
24/7, cranking out sugars, fats, proteins, and energy to keep the whole thing running and feed the
nucleus. The nucleus is the brains of the operation; inside every nucleus within each cell in your
body, there's an identical copy of your entire genome. That genome tells cells when to grow and
divide and makes sure they do their jobs, whether that's controlling your heartbeat or helping your
brain understand the words on this page.
Defler paced the front of the classroom telling us how mitosis — the process of cell division — makes it
possible for embryos to grow into babies, and for our bodies to create new cells for healing wounds
or replenishing blood we've lost. It was beautiful, he said, like a perfectly choreographed dance.
All it takes is one small mistake anywhere in the division process for cells to start growing out of control,
he told us. Just one enzyme misfiring, just one wrong protein activation, and you could have cancer.
Mitosis goes haywire, which is how it spreads.
"We learned that by studying cancer cells in culture," Defler said. He grinned and spun to face the
board, where he wrote two words in enormous print: HENRIETTA LACKS.
I had the idea that I'd write a book that was a biography of both the cells and the woman they came
from — someone's daughter, wife, and mother.
From The immortal life of Henrietta Lacks.
http://www.npr.org/templates/story/story.php?storyId=123232331
Defining Complexity
Suggested References
•
•
•
•
•
•
Computational Complexity by Papadimitriou. Addison-Wesley (1994).
Elements of Information Theory by Cover and Thomas. Wiley (1991).
Kaufmann, At Home in the Universe (1996) and Investigations (2002).
Per Bak, How Nature Works: The Science of Self-Organized Criticality (1988)
Gell-Mann, The Quark and the Jaguar (1994)
Ay, Muller & Szkola, Effective Complexity and its Relation to Logical Depth, ArXiv
(2008)
Mitchell Ch. 3
Information
• CAS process information
– CAS are computers
• Energy is Conserved
• Entropy increases: the arrow of time
• Statistical mechanics bridges classical Newtonian
physics to thermodynamics: S = k log W
• Maxwell’s Demon
– Information costs energy
• Real world information—analyzed for meaning,
processed for some outcome, dependent on context
Mitchell Ch. 4
Godel & Turing
• Godel: Mathematics is Incomplete
“This statement is false.”
• Turing: Mathematics is undecideable
• Turing Machine defines definite procedure
• UTM: tape contains both the input I and the
machine (or program) M
• I could be the M of another machine (it could
even be the same M)
Turing Machines
1.
2.
3.
4.
a tape
A head that can r/w & move l/r
Instruction table
State register
Tape is infinite, all else is finite and discrete
Proof by Contradiction that Math/logic/computation is
undecideable
• Turing Statement (to be contradicted):
A TM, H, given input I and Machine M will halt in finite
time and return Yes if M will halt on I, or No if M will
not halt on I
H(M,I) = 1 if M halts on I
H(M,I) = 0 if M does not halt on I
• Equivalent to saying we can design an infinite loop
detector H
• Why can’t H just run M on I?
Inspiration from Godel
Create H’ that calculates H(M,M) except,
H’ (M,M) does not halt if M halts on M
H’ (M,M) halts if if M does not halt on M
• What does H’ do when it is its own input?
• H’(H’,H’) halts if H’ doesn’t halt on its own input;
• H’ doesn’t halt if H’ halts on its own input
CONTRADICTION
• Godel encodes logical/mathematical statements so they talk about
themselves.
• Turing encoded logic in TM and runs TM on themselves.
• Demonstrate that math/logic are incomplete and undecideable
My view of CAS
• Incomplete, Undecidable
We know it when we see it
• Process information
• Interactions between components
• Hierarchy of components
• Result from frozen accidents & arrow of time
– Largely from evolution
– Exception: climate & weather: memory at different
scales
• Predictable at some scales, sometimes
Logical Depth
• Bennett 1986;1990:
– The Logical depth of x is the run time of the shortest program
that will cause a UTM to produce x and then halt.
– Logical depth is not a measure of randomness; it is small both
for trivially ordered and random strings.
• Drawbacks:
– Uncomputable.
– Loses the ability to distinguish between systems that can be
described by computational models less powerful than Turing
Machines (e.g., finite-state machines).
• Ay et al 2008, Recent proposed proof that strings with high effective
complexity also have high logical depth, and low effective
complexity have small logical depth.
Download