Outline of Ergodic Theory Steven Arthur Kalikow

advertisement
1
Outline of Ergodic Theory Steven Arthur Kalikow
(second edition) 2/6/16Table
of contents:
FORWARD:
PREFACE
INTRODUCTION
PRELIMINARIES
Lebesgue spaces
Stationary
Methods of making a measure preserving transformation
naming a function
stationary process
cutting and stacking
ergodic
homomorphism
Increasing Partitions
Measurable Zorn’s lemma & Rohlin tower theorem
superimposition of partition s& Countable Generator Theorem
Birkhoff ergodic theorem
measure from a monkey sequence
Stationary is convex combination of ergodic (proof 1)
conditional expectation
conditional expectation of a measure…. Stationary is convex combination of ergodic measures (proof
Subsequential limit & Extending the monkey method & Comparison
Martingales & martingale convergence …..Probability of present given past
2)
Coupling
MEAN HAMMING METRIC & DBAR METRIC &VARIATION METRIC
Markov Chains
Preparation for the Shannon Macmillan Theorem
ENTROPY
Entorpy of process and transformation
Shannon Macmillan Breiman Theorem
Entorpy of a partition
Conditioning property & Independent joining
Kreigers Finite Generator Theorem
induced transformation
BERNOULLI TRANSFORMATIONS
Definitions of EX,VWB,FD,IC,B andFB
Extremality of independent processes
Final preleiminaries of big equivalence theorem
B  FB  IC  EX  VWB  FD.
coupling infinite paths
Stationary
Ergodic
IC
FD
VWB
WB
ORNSTEIN ISOMORPHISM THEOREM
Copying In Distribution
Coding
The Land of P
Capturing entropy
The dbar metric
Sinai’s and Ornstein’s theorems
ABBREVIATION OF BOOK (primarily a capsule of the book, but also serves as an extended table of
002
003
006
009
009
010
011
011
011
011
012
014
014
016
020
021
023
023
024
026
026
027
030
033
025
042
044
044
045
046
047
052
054
056
057
060
065
070
071
072
073
073
073
074
080
086
086
089
091
092
098
104
111
contents.)
INDEX OF DEFINITIONS BY PAGE (also serves as an extended table of contents.)
INDEX OF DEFINITIONS ALPHABETIZED
118
121
2
Outline of Ergodic Theory
February 6, 2016
Forward:
The scheme of the book is that statements of theorems and definitions are made as
clear as possible but although some rigorous proofs are presented, for the most part only
the idea of the proof is presented, not the rigorous proof. Books that give rigorous proofs
give you an excellent view of the trees, but give a poor view of the forest. The main
benefit received from this book is that the student develops the ability to think
conceptually rather than in terms of detailed symbol pushing. This format benefits
a) the serious beginner who wishes to work out the proofs for himself with helpful
guidance from the book, and who does not want to be overburdened with notation while
trying to understand the ideas.
b) the student who just wants an overview of Ergodic Theory.
c) the person who already knows Ergodic theory and wants a review.
Most graduate student pure math books provide detailed proofs of theorems
followed by homework exercises at the end. Why not make the theorems themselves the
homework exercises? The answer is that proving the theorems is too hard for the student
without help. In this book, for the most part the theorems are the exercises, which is
possible because the book presents the essence of the proofs so that the student’s job is
manageable.
If this book is used as the text of the course, I would suggest that the professor
insist that the students not take notes in class. The professor should assign specific
theorems as homework. The students would have the option of reproducing the
professor’s proof, filling in the details of the proof in the book, or coming up with his
own proof. The professor should select theorems that are written up in the book very nonrigorously, because if I were a student, it would strain my sense of honesty to have to
write up a proof that is already completely there. In some cases part of the proof is
written rigorously. It is the job of the professor to isolate for the student the part that has
not been.
At some point the graduate student has to grow up and become a mathematician.
The student can’t just read other people’s work. The student has to creatively come up
with his own thinking. This book helps to bridge that gap.
An elementary understanding of real analysis, analysis, point set topology and
probability theory is assumed. The book is written at the level of a second year graduate
student in mathematics. The reader is expected to already have read and written many
rigorous proofs and be quite confident about his ability to know when he can make an
argument rigorous, and when he cannot.
3
PREFACE
This preface is written in an intuitive manner in the hope that a scientifically oriented
nonmathematition can get something out of it. It is the only part of the book that a nonmathematician is likely to
follow.
Example 1: A pool table has no pockets or friction. Part of the table is painted white and
part of the table is painted black. A pool ball is placed in a random spot and shot in a
random direction with a random velocity. You are blindfolded. You do not get to see the
shape of the table. All you get to know is when the ball is in the white part and when it is
in the black part. From this information you are to deduce as much as you can about the
entire process such as whether or not it is possible that the table is in the shape of a
rectangle.
Example 2: We are receiving a sequence of signals from outer space coming from a
random stationary (stationary means that the probability law is stationary in time)
process. We are unable to detect a precise signal but we can encode it by interpreting 5
successive signals as one signal, but this code loses information. Furthermore, we make
occasional mistakes. We wish to get as much information as possible about the original
process.
The subject that addresses these examples is called ergodic theory. To explain this
a little more mathematically, we need to introduce two words; “measure” and
“transformation”. There is a concept of measure that tells you how big a set is. In the
language of probability theory, how probable it is. A transformation is a way of assigning
one point to another, (e.g. taking the position and direction of a pool ball into the position
and direction that the ball will be in a minute later). Fundamentally, ergodic theory is the
study of transformations on a probability space which are measure preserving (e.g. if a set
of points has measure 1/3, then the set of points which map into that set also has measure
1/3.) This means, for instance, that the probability of rain 3 years from now is the same as
it is today. Of course this is false, but I am just pretending it to be true. When you apply
such a transformation over and over you get a process. In example 1, consider the process
you get if you only look at the ball after 1 minute, after 2 minutes etc. and don’t pay
attention to where the ball is in between (e.g. 1½ minutes). This process arises from
repeatedly applying the following transformation: Transform a position and velocity of a
ball into a position and velocity a minute later.
A process in which the probability laws do not change over time is called a
stationary process (e.g. the probability of rain 3 years from now is the same as it is
today). If the weather could be interpreted as a stationary process, an output might be a
listing of every day in the future and past, and whether it is raining or sunny. It would
output a doubly infinite output of R’s and S’s
…R R S (S) R R…. where the parentheses indicates today (it is sunny today will be
rainy tomorrow and was sunny yesterday.)
The transformation shifts tomorrow into today so it shifts the sequence one to the left.
4
…R R S S (R) R….
Here is a summary of the important concepts and theorems in this book.
1) Isomorphism: Suppose that in example 1 we were to change which part of the table we
paint white and which we paint black. Then you would have a different process. But our
new process could end up being equivalent to the original process in the sense that if you
know the output of either process, it determines the output of the other process. Any two
such equivalent processes are called isomorphic. If you are simply given two processes
and want to know whether they are isomorphic, then you need to find out whether there is
a way to correspond them in a nice way so that each determines the other.
2) Ergodic: An ergodic process or transformation is a process (or transformation) which
cannot be written as a linear combination of two other processes (or transformations). If
all men watch baseball all the time and all women watch cooking shows all the time the
process a fetus will watch all the time is not ergodic if you don’t know what sex that baby
will be because it is either the baseball process or the cooking process and you don’t
know which. Mathematically, it can be regarded as (½ (baseball process) + (½ (cooking
process)). Such a process is not ergodic.
3) Birkhoff Ergodic Theorem: When an ergodic transformation is repeatedly applied to
form an ergodic process, the frequency of time that process spends in a given set is the
measure of that set, e.g. it spends 1/3 of the time in a given set of measure 1/3.
4) The Rohlin Tower theorem: Fix a positive integer, say 5. For any measure preserving
transformation that does not simply rotate finite sets of points around, you can break
almost the whole space into 5 equally sized disjoint sets which can be ordered so that the
transformation takes each set to the next.
5) Shannon Macmillan Breiman theorem: In an ergodic process with finite alphabet,
consider an infinite name. If you look at its sequence of finite initial names (e.g. a, ab,
abb, abba…), that sequence will have probabilities decreasing at a rate which will
approach exponential decay, and the rate of decay will not depend on the sequence you
choose.
6) Entropy: The exponential rate of decay just mentioned is called the entropy of the
process. Since the number of reasonable names is the reciprocal of the typical probability
of a name, entropy can also be thought of as being the limiting exponential number of
reasonable names. Example: Suppose you repeatedly flip a fair coin. All of the head, tail
names of length n are reasonable, and there are 2n of them, so the entropy is of the
process is log 2.
7) Kolmogorov entropy invariance: Two isomorphic processes must have the same
entropy.
5
8) Independent process: a stationary process on an alphabet in which all letters are
completely independent of each other is called an independent process(e.g. repeatedly
flipping a fair coin).
9) Ornstein Isomorphism theorem: Two stationary independent processes are isomorphic
if and only if they have the same entropy.
6
INTRODUCTION
1) Ergodic Theory is the study of measure preserving transformations on a probability
space, AT). AT) are the space, -algebra, measure, and transformation
respectively.
2) A Lebesgue space is the unit interval (only as a measure space, no topology, no metric)
possibly plus and/or minus a set of measure zero, possibly with atoms. A Lebesgue space
is always a probability space.
All Ergodic theory is done on Lebesgue spaces. Non-Lebesgue
counterexamples don’t count.
3) An isomorphism between two four-tuples AT), and ’A’’T’) is a bijection
from  to ’ which preserves -algebra, measure and commutes with the
transformations.
4) A homomorphism, , is the same as an isomorphism except only onto, not necessarily
one to one. Here, as in an isomorphism, -1 takes members of A’ to members of A, but
the image of that map does not have to include all of A (distinct sets in A do not have to
map to distinct sets in A’). A homomorphism is the same as throwing away part of the algebra.
5) Stationary processes are an example of such a four-tuple, where  is the space of
doubly infinite strings of letters in some finite or countable alphabet, A is the -algebra
on  generated by cylinder sets (a cylinder set is a set which depends on only finitely
many coordinates),  is a measure on A which causes the process to be a stationary
process, and T shifts the doubly infinite strings to the left by one.
6) The difference between probability and ergodic theory with respect to stationary
processes, is that ergodic theory considers two processes to be essentially the same if they
are isomorphic.
7) Start with a homomorphism, , of one stationary process to another. -1 of the set of
doubly infinite strings with a given fixed letter at the origin is a measurable set, and every
measurable set is approximately equal to a cylinder set. This gives an approximate coding
of one process from another.
7
8) A collection of transformations indexed by the reals can form a flow, where Tt
composed with Ts is Tt+s, if proper measurability conditions are met.
9) A transformation can turn into an operator where T(f) is defined by
T(f)() = f(T()).
10) Some ergodic theorists like to add structure to their Lebesgue space, considering for
instance, flows on a differentiable manifold.
11) The assumption that T is a measure preserving transformation is sometimes replaced
with the weaker assumption that T is a nonsingular transformation, which means that T
sends sets of measure 0 to sets of measure 0.
12) Some look at measure preserving transformations on an infinite measure space.
13) Here are some definitions that we think that every ergodic theorist should be familiar
with, but most of these definitions are outside the scope of this book. Throughout these
definitions AT) is as in (1) and every stationary process is to be regarded as such a
fourtuple by (5).
Ergodic: T is called ergodic if any Q  A, (T-1(Q)) = (Q).
Bernoulli: T is called Bernoulli if it is isomorphic to a completely independent identically
distributed process.
K: A stationary process is called K (for Kolmogorov) if any S   with the following property
has measure either 0 or 1:
For any two words …a-2a-1a0a1a2… and …b-2b-1b0b1b2…
such that there exists N such that for all n > N, an = bn, either both words are in S or neither of
them are.
Mixing: Measure preserving transformation T is called mixing iff for all P,Q  A
lim (i  )(T-i(P  Q)) = (P)(Q).
Weak mixing: There are three equivalent definitions of weak mixing. Any of these 3 ALONE
defines weak mixing if T is a measure preserving transformation on .
1) There is a set of integers  of density 1 such that for all P,Q  A,
lim (i  ) (i  ) (T-i(P  Q)) = (P)(Q).
2) Define T X T on  X with measureX  by T X T(a,b) = (T(a),T(b)).
Then T X T is ergodic.
3) There does not exist measurable f:   (the unit circle of the complex numbers)
and   1 such that for all    T(f(f(.
8
Minimal self joinings: Define S on  X by S(a,b) = (T(a),T(b)). For any measure preserving
transformation T, here are two ways to define a measure  on  X so that the projections of S
on the axes are both isomorphic to T.
1)  =  X 
2) Fix an integer n in advance. The support of  is on the {,Tn()}and for any
S  A, {,Tn():   S} = (S).
If these are the only ways to define such an S, then T is said to have minimal self joinings.
Entropy: There is a theorem that says that for any ergodic process,
…X-2, X-1, X0, X1, X2…
the sequence X0 = a ; X0 = a, X0 = b; X0 = a, X0 = b, X0 = c… has measures decreasing at a
constant exponential rate except on a set of words abc…. of measure 0. This rate is called the
entropy of the process.
Subshift of finite type: Fix a finite alphabet. Let  be a finite list of finite words. Let  be the
space of doubly infinite words in your alphabet in which no word of  occurs as a subword. Let T
be the transformation on  that shifts the doubly infinite words to the left by one. Let A be the algebra generated by cylinder sets and  be the measure on A that maximizes the entropy of T.
Then AT) is called a subshift of finite type.
Pinsker algebra : The set of all S  A with the following property turns out to be a -algebra and
is called the Pinsker algebra of T: The process …X-2, X-1, X0, X1, X2… such that Xi()= 1 if Ti(
 S and 0 if Ti(  S had 0 entropy.
Mean Hamming Distance: The Hamming Distance between two words
a1, a2… an ad b1, b2… bn is (1/n)#{i  {1,2,3,…n }: ai  bi}.
Rank 1: The most natural way to define rank 1 is to learn cutting and stacking (page 11) and then
define T to be rank 1 if it is constructed by cutting and stacking so that at each stage there is only
one column. However, if you don’t wish to read beyond this introduction, the following theorem
defines rank 1:
A process …X-2, X-1, X0, X1, X2… is isomorphic to a rank 1 transformation iff for all 
there is an n such that with probability 1, part of the doubly infinite word can be tiled by
disjoint subwords of length n such that
1) limsup (n  )(1/n)#{i  {-n,-n+1…n: Xi is not in the part covered} 
2) The mean Hamming distance between any two such subwords < .
9
PRELIMINARIES
In this book when we write “Proof” we regard our proof to be rigorous. However, when
we write “Idea of proof” we expect the reader to be capable of writing out a more
carefully written proof although part of the proof may be rigorous.
Measurable and measure preserving.
DEFINITION 1: Let be a set of points, T be a function from onto Then T is
called a transformation on .
DEFINITION 2: Let (A,T) be a collection of points, -algebra on , probability
measure on A, and transformation on  respectively. Then T is measurable if T-1(S)A
for all SA.
DEFINITION 3: Let (A,T) be a space, -algebra, probability measure and
measurable transformation respectively. T is called measure preserving if (T-1(S)) =
(S) for all SA.
DEFINITION 4: Measurable and measure preserving are similarly defined when T is a
transformation from one probability space to another.
Lebesgue spaces.
DEFINITION 5: Let 0 < t <1. Let  be a set consisting of the interval [0,t] and a finite
set F.

 = [0,t]  F. Place a probability measure  on which when restricted to [0,t] is
uniform Lebesgue measure, standard Lebesgue -algebra, total measure t. (,) is
called an interval space, with atoms if t  1, without atoms if t  1.
DEFINITION 6: Let , be an interval space with or without atoms. Let (A,be a
probability space. If there are sets A  , B   both of measure one, and a bijection
between A and B which is bimeasurable and measure preserving in both directions, then
(A,is called a Lebesgue space.
ABBREVIATION: You get a Lebesgue space by taking the unit interval possibly
squashing some subintervals into atoms and then perhaps removing and adding sets of
measure 0.
10
CONVENTION 7: In this book all transformations are one to one and all spaces are
atomless Lebesgue spaces.
COMMENT 8: Since in essence, the only Lebesgue space is the unit interval, one can be
deceived into thinking that we are talking about a specific space. Actually, the concept of
being isomorphic to the unit interval is quite general. It includes every probability space
you are likely to come across in a standard probability course, including Poisson
Processes, Brownian motion, and White noise. The only time you should suspect that you
are not dealing with a Lebesgue space, is when either
a) the space is endowed with a topology that is not separable
or
b) you used the axiom of choice to make the space.
Stationary
DEFINITION 9: A stationary process is a doubly infinite sequence of random variables,
...X-3,X-2,X-1,X0,X1,X2,X3... whose probability law is time invariant (e. g. the probability
that X0 is “a”, X1 is “b”, and X2 is “c” is the same as the probability that X100 is “a”, X101
is “b”, and X102 is “c”). The alphabet is the set of values an Xi is allowed to take.
CONVENTION 10: Unless stated otherwise, the alphabet should be regarded to be at
most countably infinite.
EXAMPLE 11: Let ...X-3,X-2,X-1,X0,X1,X2,X3... be an independent sequence of heads
and tails, obtained by a doubly infinite independent sequence of flips of a fair coin. This
is a stationary process. The alphabet is {h,t}.
EXERCISE 12: Show that a stationary with an at most countably infinite alphabet
process endowed with the algebra generated by cylinder sets (A cylinder set is a set
determined by X-n...X-3,X-2,X-1,X0,X1,X2,X3...Xn for some n) is a Lebesque space.
11
Methods of making a measure preserving transformation
naming a function,
Define a transformation explicitly, e.g. let T from the unit interval to itself, be
defined by T(x) = ex mod 1. Then you have to look for a measure on the unit interval that
makes the transformation measure preserving. This particular transformation is not one to
one.
stationary process,
Define a transformation by letting the measure be a stationary process on the space
of doubly infinite words, and then let T be the shift to the left,
i.e. T( ...a-3,a-2,a-1,a0,a1,a2,a3...) = (...b-3,b-2,b-1,b0,b1,b2,b3...)
where ai+1 = bi for all i.
cutting and stacking
Start with the real line with Lebesgue measure. Cut off a finite piece (an interval) of
the line (whose measure is not required to be 1). Define a transformation on that piece by
cutting it into many equal sized pieces, and stacking them into a vertical tower, defining
the transformation to go straight up.
EXAMPLE 13: Perhaps the finite piece you cut off is [0, 3/4] and you cut into [0,
1/4],[1/4, 2/4], [2/4, 3/4], and stack them vertically with [0,1/4] on the bottom [1/4, 2/4]
above and [2/4, 3/4] above. The picture looks like this
2/4
3/4
1/4
2/4
0
1/4
and the transformation goes straight up, taking x to x + 1/4 on the bottom two rungs, and
so far undefined on the top rung. So to define it on part of that top, cut the tower
vertically into columns and stack some of those columns on top of others.
You may add more of the line at this stage, by cutting off from it more pieces
called spacers, and putting them on the top of columns before stacking them. This would
cause the space we are considering to increase in measure. The transformation still goes
straight up, which makes its definition the same as it was on the part already defined.
12
EXAMPLE 14: Take the previous stack and split it into 5 equal sized columns. Arrange
these columns into two stacks Place the second column on top of the first to form the
first stack, and then place the fourth column on the third, and the fifth column on the
fourth to form the second stack. Then place two spacers between the third and the fourth
columns.
14/20
15/20
9/20
10/20
4/20
15/20
13/20
14.20
8/20
9/20
3/20
4/20
11/20
12/20
6/20
7/20
spacer
1/20
2/20
spacer
10/20
11/20
12/20
13/20
5/20
6/20
7/20
8/20
0/20
1/20
2/20
3.20
Continue this process of cutting and stacking infinitely many times, making sure that
your transformation ends up being defined almost everywhere. If you choose to keep
adding spacers as you continue, the total measure of the space will keep increasing. You
are required to make sure that in the end the whole space we are considering is finite in
measure. Hence you cannot add arbitrarily many spacers at each stage. Then normalize
the measure to be measure 1.
ergodic
DEFINITION 15: A transformation is called ergodic if there is no set S, of measure
strictly between 0 and 1, such that T-1(S) = S.
13
COMMENT 16: A typical example of a nonergodic process is to take two coins, one fair
and one unfair, and randomly pick one of them, each with probability 1/2. Once you have
picked your coin, use it to produce a doubly infinite sequence of heads and tails. This
gives a stationary process, and hence, by above, gives a measure preserving
transformation. It is nonergodic, because S can be chosen to be the event that the
frequency of heads approaches 1/2. This is a stationary process, but an unnatural one,
since it is obviously a convex combination of two more natural processes. We consider
nonergodic processes to be unnatural, and usually ignore them.
COMMENT 17: You notice that the definition of ergodic refers to T-1(S) rather than T(S)
and we will frequently prefer to write T-i rather than Ti .This is because
Ti()  S iff   T-i(S).
This is important when T is not one to one because in that case it is not always true that
T(S) has the same measure as S.
DEFINITION 18: A stationary process …X-2, X-1, X0, X1, X2… is called an independent
process if all the X-i’s are independent of each other,
(e.g. P(X0 = a, X1 = b, X2 = c) = P(X0 = a)P(X0 = b)P(X0 = c)).
THEOREM 19: Independent processes are ergodic.
idea of proof: Approximate a set S by a cylinder set. If T-1(S) = S, then T-n of that
cylinder set also approximates S for any n. By choosing two far apart values for n, S can
be approximated by two completely independent sets.//
EXERCISE 20: Prove that a mixing Markov chain is also ergodic.
COMMENT 21: If you don’t know what a Markov chain is, this book has a section
introducing Markov chains to you.
14
homomorphism
COMMENT 22: In Ergodic theory, sets of measure zero don’t count. Lebesgue set and
Borel set mean the same thing. When we speak of the whole -algebra, it does not matter
whether you regard us to be talking about the Lebesgue sets or the Borel sets.
DEFINITION 23: Let AT) be the space, -algebra, measure, and transformation
respectively and let ’A’’T’) be another such fourtuple. A function
f:AT)’A’’T’) is called measurable if
f-1(S) A for all S A’.
f is called measure preserving if
(f-1(S)) = ’(S) for all S A’.
If f is measurable and measure preserving, and if T’(f(f(T( for all  then f is
a homomorphism.
A one to one homomorphism is called a isomorphism.
DEFINITION 24: A homomorphic image of a fourtuple AT) (or process) is called
a factor of that fourtuple AT) (or process).
A isomphoric image of a fourtuple AT) (or process) is called isomorphic to that
fourtuple AT) (or process).
Increasing Partitions
THEOREM 25: Assume the whole space to be a Lebesgue space. Let P1,P2,... be an
increasing collection of finite partitions which separate any two points. Then every set in
the -algebra can be approximated by the union of some pieces of Pi for some i, (i.e. the
symmetric difference of the set and its approximation can have arbitrarily small measure
by making i sufficiently large).
15
Idea of proof: The definition of Lebesgue space does not include topology, but
since every Lebesgue space is essentially the unit interval, we can add the topology of the
unit interval to the space. Let A be your chosen set. Every set can be approximated by an
open set containing it and a closed set contained in it. Fix  > 0. Show that every set in Pi
can be approximated by a closed set from within so that
a) For all n, the union of the closed approximating sets has measure > 1- .
b) If we choose a sequence
A1 P1 , A2 P2, A3 P3, …
such that
A1 A2 A3...
and if C1, C2, C3,… are the closed approximations for A1, A2, A3,… then
C1 C2 C3... .
The assumption that we separate points means that every point is the intersection of
precisely one decreasing sequence of sets from P1, from P2, from P3 etc. Replace A and
its complement with approximating open sets containing them. Fix a point and replace
the decreasing sequence of sets with approximating closed sets. Then eventually one of
these closed sets is entirely inside our replacement for A, or inside our replacement for
the complement of A. Thus, for sufficiently large i, most (I mean “most” in measure, not
necessarily in number.) of the approximating closed sets in Pi are in the approximation
for A or in the approximation for the complement of A. //
COROLLARY 26: Notation as in the theorem 24. P1,P2,... generates the whole algebra.
THEOREM 27: If T is one to one measure preserving then T(A) is measurable for every
measurable set A.
Idea of proof: Since we are talking about a Lebesque space it is easy to exhibit sets
P, P2 ... as in Theorem 25. It easily follows that T-1(P1), T-1(P2), ... also separates points.
Apply theorem 25 to get a set B approximating A such that T(B) is measurable. Conclude
by taking limits.
16
COROLLARY 28: If T is a one to one measurable transformation from a space to itself,
(T(A)) = (A) for all measurable A. If T is an isomorphism from one space to another
T-1 is also an isomorphism.
DEFINITION 29: Let P be a partition, and let T be a measure preserving
transformation. The P,T process is defined to be the stationary process
...X-3,X-2,X-1,X0,X1,X2,X3...,
whose alphabet are the pieces of P, such that Xi takes a point  to the value p, if p is the
member of P containing Ti().
DEFINITION 30: If the P,T process separates points, (i.e.for every 1and 2 there exists
i with Xi(1)  Xi(2)).We say that P generates T, or that P is a generator for T.
COROLLARY 31: P generates T iff the P,T process generates the whole -algebra.
-If obvious for a Lebesgue space. Only if follows from corollary 25.//
THEOREM 32: If P generates T, and Q generates T, then the P,T process, Q,T process,
and T are all isomorphic.
Idea of proof : Your isomorphism is the map that takes
the p name of  to
the q name of 
to .
It is obvious that this is a bisection. Measurability in both directions follows from
theorem 25.//
Measurable Zorn’s lemma
Start with any probability space. Order a collection of measurable sets by inclusion.
Then if each countable chain has an upper bound, the collection has a maximal element,
up to measure zero.
Idea of proof: Pick a chain, where (the measure of Si+1) - (the measure of Si) is always
more than half of what it could be, and then take an upper bound. This proof does not
make use of the axiom of choice.//
17
Rohlin tower theorem
THEOREM 33:
If T is ergodic, for all N and , there is a set S such that S, T(S),
T2(S),... TN(S) are disjoint and cover all but  of the space. This is also true if T
nonergodic, if you presume that with probability 1, Ti(x) x, for any i,x .
Idea of proof: In the ergodic case, Start with a tiny set c, and let S be the set of all points
Tn(p), such that p  c, n is a nonnegative multiple of N, and
{T1(p),T2(p),T3(p)...Tn+N(p)} is entirely outside of c. Then the union of S, T(S), T2(S),
...TN(S) will disjointly cover all of the space except
c U T-1(c) U T-2(c) U... T-N(c),
because by ergodicity, c and all of its translates cover the whole space.
In the nonergodic case, Since the space is Lebesgue, we can regard it to be the unit
interval, and endow that interval with its usual metric. Let M>>N. Use the same proof as
above, letting (by measurable Zorn) c be a maximal set such that for every x in c,
T(x),T2(x),... TM(x) are all outside c. The only problem is to show that c and all of its
translates cover the whole space. If not, note that by our non-periodicity assumption, for
 small enough, there is a positive probability that T(x),T2(x),... TM(x) are all more
than  away from x, and x is not in the union of c and its translates. There must be some
interval of diameter  which intersects that event on a set of positive measure, and that
intersection is a set of positive measure outside the translates of c, such that for any x in
that set, T(x),T2(x),... TM(x) are all outside of that set. Add that set to c to contradict
maximality. //
DEFINITION 34: If S, T(S),T2(S),... TN(S) are disjoint sets, they form a Rohlin tower of
height N. When we speak of a Rohlin tower, we presume that the union of S,
T(S),T2(S),... TN(S) has measure almost 1.
DEFINITION 35: If S, T(S),T2(S),... TN(S) is a Rohlin tower, the complement of
S T(S)  T2(S)  ... TN(S) is called the error set.
COMMENT 36: In all arguments, always assume the error set is sufficiently negligible in
measure.
18
EXERCISE 37: Prove that the error set can have exactly size  for any  > 0.
DEFINITION 38: If S, T(S),T2(S),... TN(S) is a Rohlin tower, each Ti(S), 0 < i < N, is
called a rung of the tower, S is called the first rung of the tower. S is also called the base
of the tower.
DEFINITION 39: Let S, T(S),T2(S),... TN(S) be a Rohlin tower, and P be a
partition of the space. Define two points in the base of the tower to be equivalent, if they
have the same P name of length N. It is helpful to regard an equivalence class in the base
of the tower, to be an interval. A column, or P,T column (for a given tower and partition)
is defined to be the union of E, T(E),T2(E),... TN(E), where E is an equivalence class in
the base of the tower.
DEFINITION 40: The intersection of any rung of the tower, with a P,T column of the
tower, is called a rung of that column.
DEFINITION 41: When P and T are understood, the n name of a point  is the sequence
of sets of P containing , T(), T2(), …Tn-1() in that order.
COMMENT 42: A P,T rung of a P,T column of a tower, is entirely inside one piece of
P. Indeed, the n name of any point in the base of a column of height n is precisely the
pieces of P containing the base rung, second rung, third rung, etc. of the column
containing that point, in that order.
19
THEOREM 43: Let P be a partition of the space, and N be arbitrary. Then there is a
Rohlin tower S, T(S),T2(S),... TN(S), such that S and P are independent.
Idea of proof: Just take a Rohlin tower, height much bigger than N, do the following (see
picture) to every P,T column, and let the union of the shaded areas of all the columns be
the base (here we are letting N = 4).
This will give you your desired Rohlin tower, except that the base will be almost
independent of P, but not quite. To get perfection, shave off a small fraction of the base,
and the column that small fraction defines, and put it into the error set.//
COMMENT 44: If P,T is a process, and you pick an n tower whose base is
independent of the partition of n names, then you get the beautiful situation where the
distribution of names defined by the columns is precisely the distribution of all n names.
20
superimposition of partitions
COMMENT 45: It is possible to superimpose countably many finite partitions, and get a
uncountable partition.
THEOREM 46: Let Si be a sequence of sets whose measures are summable. Let Pi be
finite partitions of the whole space, such that one of the pieces of Pi is the complement of
Si. Then the superimposition of the Pi forms a countable partition (As always, ignore sets
of measure 0.).
Idea of proof: Let Ai be the union, j > i, of Sj. Then the intersection of the Ai has measure
zero, and it is easily seen that the superimposition partition partitions Ai - Ai-1 into only
finitely many pieces for all i.//
Countable Generator Theorem
THEOREM 47:
Every measure preserving transformation on a Lebesgue space is
isomorphic to a stationary process with a countable alphabet.
Idea of proof: All we need is a countable partition of the space so that the map which
takes points to the name of points with respect to that partition is an isomorphism, i.e.
such that the point to name map is one to one. To do this, create Rohlin towers with bases
S1,S2,...so that the measures of Si are summable, and insignificantly small error sets.
Assume the heights of these towers are N1,N2,... respectively. Let Pi be a partition of the
whole space, consisting of the complement of Si together with the following partition of
Si .Two points x and y are in the same atom of that partition iff
for k {1,2,...Ni}, Tk(x) and Tk(y), when thought of as being real numbers with binary
expansion, have the same first Ni digits.
The reader should verify that the superimposition of the Pi does the trick (the error sets
can be obnoxious but Borel Cantelli can be used to ignore all but finitely many of
them).//
21
Birkhoff ergodic theorem
THEOREM 48: Let T be a measure preserving transformation on . Let f be an
integrable function on . Then 1/n (f() + f(T()) +...f(Tn()) converges as n  .
Idea of proof: Fix b > a. Consider points , where
lim sup 1/n (f() + f(T()) +...f(Tn()) > b, and
lim inf 1/n (f() + f(T()) +...f(Tn()) < a.
Let B be chosen large enough, that we can write f = f1 + f2, where |f1| < B (i.e. f1 is
bounded between –B and B) and the integral of |f2| is small.
Now pick N so big that the set of all points which don’t have an ergodic average
for f almost as big as b by time N (to be called bad points) has a probability which is a
small fraction of (b-a)/B. Let M >> N. Show that there is an S , T(), T2(), ...
TM()} such that
i) the average of f over S is not much smaller than b,
ii) only bad points, and points among the last N points are outside S.
Hint: Express S as a disjoint union of intervals for which the average of f is not much less
than b.
For the points outside S, 1,2... , f(1)+f(2)+...=
*
[f1(1)+f1(2)+...] +
**
[f2(1)+f2(2)+...].
(*) is usually a small fraction of (b-a)M, because the terms are bounded above by B, and
the number of bad points is usually a small fraction of (b-a)/B
Note that since T is the measure preserving, integral of f2(Ti()) = integral of f2().
(**) is usually a small fraction of M, because it is dominated by
(|f2()|+|f2T()|+...|f2TM()|) whose integral is a small fraction of M.
This proves that the average of all the points in the interval is usually much closer to b
than to a. We get a contradiction by using an exactly analogous argument to establish
that the average is usually much closer to a than to b.//
22
COMMENT and EXERCISE 49: Too many sets of measure zero ?
The reader might object that there is a probability 0 that things could go wrong for every
a,b, and all these sets of measure zero could add up to a problem. Do you see how to get
around this problem?
Birkhoff ergodic theorem 2
the limit of 1/n (f() + f(T()) +...f(Tn())
is the integral of f for almost all .
In the ergodic case:
Idea of proof: The limit is a constant by ergodicity. Define f1 and f2 as above.
1/n (f()+f(T())+...f(Tn()) =
1/n (f1()+f1(T())+...f1(Tn()) + 1/n (f2()+f2(T())+...f2(Tn()).
All these sequences converge by the theorem 48. The first sum converges to the integral
of f1 by the bounded convergence theorem, and the second term converges to something
small by Fatous lemma.//
COMMENT 50: It is important for the reader to focus on the case where f is an indicator
function. The above then says that the frequency of time the stationary process hits a
certain set is the probability of that set.
EXERCISE 51: Show that a stationary process with a countable alphabet is ergodic iff
the frequency of every word is constant.
DEFINITION 52: The invariant -algebra, is the collection of all sets A such that T-1(A)
= A.
In the nonergodic case:
Definition of conditional expectation will soon be given. The limit is
the conditional expectation of f with respect to the invariant -algebra .
Idea of proof:
It is easy to see that f is measurable with respect to the invariant -algebra, and for any A
in that -algebra, integral over A of the limit = integral over A of f by the exact same
proof as above. When you learn what conditional expectation is you will understand that
this is all you need to show.//
EXERCISE 53: Use the Birkhoff ergodic theorem to prove the strong law of large
numbers, which states that
1/n(X0+X1+X2+X3+...Xn)
converges to the integral of X1, when
...X-3,X-2,X-1,X0,X1,X2,X3... are independent identically distributed random variables.
23
measure from a monkey sequence
Let a monkey who lives forever type any sequence of 0’s and 1’s it wants to,

1,2,3....
Obtain a stationary measure as follows. Select increasing integers mi, where the
frequency of cylinders of length 1 in the following sequence of finite words converge;
1,2,3,...m1 ,
1,2,3,...m2 ,
.
.
.
Then take a subsequence of that subsequence ni of the mi, where the frequency of
cylinders of length 2 in the following sequence of finite words converge;
1,2,3,...n1 ,
1,2,3,...n2 ,
.
.
.
Continue the pattern, to get measures for all cylinder sets.
DEFINITION 54: The above method is called the Monkey method for extracting a
stationary process from a sequence from a finite alphabet.
EXERCISE 55: Prove the limiting measure to be stationary.
THEOREM 56: Stationary measures are convex combinations of ergodic measures.
Stationary is convex combination of ergodic (proof 1)
Idea of proof:
If the above point came out of a stationary process, rather than
monkey, then taking subsequences of subsequences is unnecessary because everything
already converges. The limit that the point uniquely defines will be an ergodic measure
with probability 1. (Warning: ergodicity is not so easy to prove). Letting f() be the
measure that  gives rise to, we have that the integral of f() with respect to , gives
back the original stationary measure.//
24
conditional expectation
DEFINITION 57:
Let f be a function measurable with respect to the whole algebra and let s be a sub -algebra. Then the Radon Nikodim theorem guarantees the
existence of a function g, called the conditional expectation of f with respect to s, denoted
E(f|s)(), which is measurable with respect to s, and such that for all A in s, the integral
of f over A is the integral of g over A. It is only unique up to a set of measure zero.
DEFINITION 58: A definite choice for E(f|s)() for all  is called a version for E(f|s). If
g is a given version of E(f|s), then h is another version for E(f|s) iff g and h differ on a set
of measure zero.
DEFINITION 59: For random variables, X,Y,Z or more, the expectation of f with respect
to the -algebra generated by X, Y and Z is denoted E(f|X,Y,Z).
COMMENT 60:
For any f1, f2, f3 ... whose sum converges in a dominated way, the
conditional expectation of E(f1 + f2 + f3 +... |s)() = E(f1|s)() + E(f2|s)() + E(f3|s)()
+....except on a  set of measure 0 (for any -algebra s).
conditional expectation of a measure
We want the map A (the conditional expectation of the indicator function of A
with respect to s) to be a measure for almost every . The trouble is that forcing
additivity causes too many bad sets of measure zero, another bad set for every special
case of
(*)  P(Ai|s) = P((i=1to )Ai|s).
To get around this, just restrict to cylinder sets and to equations of type (*) with only
finitely many terms and then use Cartheodory extension theorem on each  to extend to a
measure. The result is a function which is jointly measurable on A and  which is a
measure when  is held fixed, and a version of E(A|S) when A is held fixed. The original
measure of A is the integral of the conditional measure of A.
25
DEFINITION 61: Let be a probability measure on a Lebesgue space, and let s be a algebra of that space. Then the conditional expectation of  with respect to s, denoted
E(|s)(A,), is a function which is jointly measurable on A and , such that fixing set A,
you get a version of the conditional expectation of the indicator of A , and fixing , you
get a probability measure.
Stationary is convex combination of ergodic measures (proof 2)
Idea of proof: Let s be the invariant -algebra. For each , where  is a doubly infinite
word of the stationary process, E(|s)() is a measure if is held fixed. The definition
of conditional expectation implies that  is the integral of E(|s)(), with  running in
accordance with the stationary measure. This expresses  as a convex combination of
measures E(|s)(). All that remains is to show that for almost all , E(|s)() is
ergodic. It should be ergodic because it assigns any set A to E(1A|s)() and in the case
where A is in the invariant sigma algebra,
E(1A|s)() = 1A() which is either 1 or 0.
However, the above equation is allowed to fail on set of measure 0.
Once again we have to contend with too many bad sets of measure 0, because you could
get a bad set of measure 0 for every A. This can be handled, but is not easy.//
Stationary is convex combination of ergodic measures (proof 3)
Idea of proof: There is still another proof that stationary is convex combination of
ergodic. Just apply the Krein Milman theorem, which says that in certain spaces, such as
the space of measures that we look at, every point in any convex set is a convex
combination of its extreme points. //
EXERCISE 62: Show that a measure in the space of stationary processes is ergodic iff it
is an extreme point of that space.
26
Subsequential limit
DEFINITION 63: For each i, let i be a measure on words of length i, not necessarily
stationary. Take a subsequence whose measure on the first coordinate converges. Take a
subsequence of that subsequence whose measure on the first two coordinates converge.
Take a subsequence of that subsequence whose measures on the first three coordinates
converge. Continue. The limiting measure on words is a subsequential limit of the i (If
we want we can just assume the i’s to have increasing length or infinite length rather
than length i).
COMMENT 64: The subsequential limit need not be stationary if the approximating
measures are not.
Extending the monkey method
We already exhibited a way to extract a stationary process from a sequence of
letters. You can also extract a stationary process from a sequence of measures on finite
sequences with increasing length or from a sequence of measures on infinite sequences
(A measure on infinite sequences is called a stochastic process.)
Let 1,2,3 … be measures on words of finite but increasing lengths. Let those
lengths be L1, L2, L3, …respectively. By passing to a subsequence if necessary, you can
assume Ln > 100n for all n. What is important here is that Ln >> 2n so that the subwords
of length n occur sufficiently often. For each n, derive from n a nearly stationary
measure n on words of length n defined by
Ln-n
n (a1, a2, a3… an) = (1/( Ln-n+1))n ( xi+1 = a1, xi+2= a2,… xi+n= an),
i=0
and then take a subsequential limit of the n. In the case where 1,2 … are stochastic
processes, just truncate them in order to reduce to the case above. In either case
DEFINITION 65: A subsequential limit of the n above is called a stationary process
obtained by the monkey method.
EXERCISE 66: Prove it to be stationary.
27
COMMENT 67: Comparison of subsequential limit and monkey method.
Subsequential limit has the advantage of maintaining the local behavior. If you
use the monkey method, the limit measure on the first 10 coordinates obtained from the
approximating measures may have absolutely nothing to do with the measure on the first
10 coordinates of the approximating measures.
The monkey method has the advantage that it gives a gives stationary measure.
Subsequential limits might not be stationary.
However, if the approximating measures are stationary, then even a subsequential
limit is stationary, and it is pointless to use the monkey method.
Martingales
DEFINITION 68: A sequence of random variables Xn such that
E(Xn| Xn-1, Xn-2... ) = Xn-1 for every n>1.
(You are gambling and for every time n-1 you don’t care if you go home or play another
round.) is called a Martingale.
EXAMPLE 69: You flip a fair coin over and over, getting a dollar for every heads, and
losing a dollar for every tails. Xi is the amount of money you have at time n.
X0, X1, X2… is a martingale.
Stopping times
DEFINITION 70: A stopping time is a rule for stopping that depends only on information
that you know by the time you stop.
EXAMPLE 71: In the above example, “Stop as soon as you are ahead.” is a stopping time
that makes money.
COMMENT 72: The stopping time, “Stop which ever comes first; you are ahead or time
1000” does not make money (in expectation), because you might just as well have used
999. If you reach time 999, it does not help to play another turn.
28
DEFINITION 73: a^b means the minimum of a and b.
Bounded time theorem: Using the reasoning of the comment 69, for any Martingale XT^n
never makes money for any stopping time T and any positive integer n. More precisely,
E(XT^n|X0) = X0.
Bounded money theorem: In a bounded martingale, stopping times never make money,
i.e. E(XT|X0) = E(X0).
Proof: E(XT^n) converges to E(XT) by the bounded convergence theorem.//
EXAMPLES 74:
1) Unbiased random walk is recurrent.
Idea of proof: Take a probability 1/2 one-step backward, probability 1/2 one step
forward random walk. Suppose the amount of money you have at time n is m, where m is
the integer you are sitting on at time n. Then your fortune at time n is a martingale. Start
at 1. Stop when you hit either 0 or 1,000,000. Letting p be the probability that you go to 0
before going to 1,000,000, the bounded money theorem gives 1 = p(0) + (1-p)1,000,000.//
EXERCISE 75: Prove that with probability 1, you will not spend forever without hitting
either 0 or 1,000,000.
2) Biased random walk is transient.
Idea of proof: Take a probability 2/3 one-step forward, probability 1/3 one step
backward random walk. Fortune at time n is 1/2m if you are sitting on m at time n. This
gives a martingale. Start at 1. Letting p be the probability that you go to 0 before going to
1,000,000, the bounded money theorem gives
1/2 = (p)1 + (1-p)/21,000,000.//
3) Solution to the Dirichlet problem.
Suppose h is a harmonic function and X(t) is Brownian motion. Since h(x) is the average
of h around any circle centered at x, it is easy to see that h(X(t)) is a continuous time
martingale (see if you can guess what a continuous time martingale is). If you want h(x),
and you know h on some curve about x, just start a Brownian path at time x and stop
when you hit the curve (stopping time T). h(x) = E(h(X(T))) by the bounded money
theorem.//
29
martingale convergence
THEOREM 76: A bounded martingale converges.
Idea of proof: Let us make the absurd assumption that the stock market is a bounded
martingale (say bounded by 0 and 1). Let a < b. You keep selling when the stock market
is higher than b, and buying when it is less than a. To put some limit on the situation,
assume you sell any stock you may have at some very late date. You could get very rich.
But this is impossible with high probability, because you cannot lose more than one
dollar, and your expected net gain has to be 0. Therefore the martingale only crosses a,b
finitely many times for any a,b, and since this is true for every rational a and b, it
converges. Careful examination of this argument gives an upper bound for the probability
for crossing more than n times.//
COMMENT 77: A backward martingale also converges.
Exactly the same argument
DEFINITION 78: Let...X-3,X-2,X-1,X0,X1,X2,X3... be a stationary process. Then the algebra generated by X-1,X-2,X-3...is called the past.
EXAMPLE 79: Probability of present given past.
Let ...X-3,X-2,X-1,X0,X1,X2,X3... be a stationary process. We
frequently talk about the P(X0= 0|past). Naively you would define this to be
P((X0 = 0)  past)/P(past). However, this is ridiculous, because the numerator and
denominator of that fraction are both 0. Here is the careful way to define P(X0= 0|past).
It turns out that P(X0 = 0|X-1), P(X0 = 0|X-1,X-2), P(X0 = 0|X-1,X-2,X-3)...
forms a martingale, which converges almost everywhere by the martingale convergence
theorem. Its limit is our desired P(X0= 0|past). Alternatively, just use probability given
sub -algebra (definition 57). //
30
Coupling
DEFINITION 80: A coupling of two measures is a measure on the product space with
the two measures as marginals (i.e. the projection on the X-axis and the projection on the
Y-axis.)
EXAMPLES:
Let  be the measure on {0,1}, (0) = 1/3, (1) = 2/3.
Let  be the measure on {0,1}, (0) = 1/4, (1) = 3/4.
Coupling EXAMPLE 1) Independent coupling of  and .
P(0,0) = 1/12
P(0,1) = 1/4 P(1,0) = 1/6 P(1,1) = 1/2
COMMENT 81: You can couple any two measures together, because if you can’t find
any other way to do it you can always use independent coupling, (i.e. product measure).
Coupling EXAMPLE 2) coupling of  and  to maximize probability that the two
coordinates are equal.
P(0,0) = 1/4 P(0,1) = 1/12 P(1,0) = 0 P(1,1) = 2/3
Coupling EXAMPLE 3) Coupling by induction:
Let X1,X2,X3...and Y1,Y2,Y3...be two distinct sequences of random variables. Here
is an inductive technique for coupling the two processes together. Indeed, it is a general
technique in that you can achieve any possible coupling this way.
Assume you have already coupled X1,X2,X3...Xn with Y1,Y2,Y3...Yn, and you want
to couple X1,X2,X3...Xn+1 with Y1,Y2,Y3...Yn+1. Pick a conditional probability law for
Xn+1 given X1,X2,X3...Xn and Y1,Y2,Y3...Yn, such that when you integrate that
probability law over all Y1,Y2,Y3...Yn, you get the conditional probability of Xn+1 given
X1,X2,X3...Xn. Similarly, pick a conditional probability law for Yn+1 given Y1,Y2,Y3...Yn
and X1,X2,X3...Xn, such that when you integrate that probability law over all
X1,X2,X3...Xn, you get the conditional probability of Yn+1 given Y1,Y2,Y3...Yn. Now
simply couple the conditional Xn+1 with the conditional Yn+1.
31
Coupling EXAMPLE 4) A special case of (3):
Assume that Xn+1 is independent of Y1,Y2,Y3...Yn, and that Yn+1 is independent of
X1,X2,X3...Xn. This means that you are simply coupling the conditional probability of
Xn+1 given X1,X2,X3...Xn, with the conditional probability of Yn+1 given Y1,Y2,Y3...Yn.
If you can’t think of any other way of coupling these two conditional probability
laws, you can always use product measure.
Coupling EXAMPLE 5) A special case of (4):
Suppose X1,X2,X3...Xn is {0,1} valued, and you wish to couple it with the independent
process on {0,1} which assigns probability 1/2 to each. You would like this coupling to
live on pairs such that the two words tend to agree on many coordinates. Just, for each n,
couple Xn+1 conditioned on X1,X2,X3...Xn with the measure obtained by flipping a fair
coin, so that you maximize the probability that the two coordinates are equal.
*Note* This doesn’t necessarily give the best result.
Coupling EXAMPLE 6) Glue together two couplings to form a coupling of three
measures.
P1, P2, P3 three processes. You are given a coupling of P1 and P2 and another
coupling of P2 and P3. Put them together to get a coupling of P1, P2, P3 by first putting
down P2 in accordance with its probability law, then computing the conditional measure
of P1 in the first coupling, given the word of P2 you have put down, computing the
conditional probability of P3 in the second coupling, given the word of P2 you have put
down, and then couple those two conditional measures (perhaps independently).
Coupling EXAMPLE 7) Using coupling, prove that in a random walk, for any set of
integers, the probability that you are in that set in at time 1,000,000 starting at 0, is close
to the probability that you are in that set at time 1,000,010 starting at 0.
Idea of proof: Put a measure on pairs of random walks X(n) and Y(n), such that the two
walks walk independently of each other until a time n0, in which X(n0+10) = Y(n0) (The
existence of n0 is guaranteed by recurrence of random walk). Then for all n > n0, we
inductively continue the coupling so that X(n+10) = Y(n). n0 will probably be less than
1,000,000.//
32
DEFINITION 82: Let X be a random variable. The -algebra generated by X is the algebra generated by X-1[- , a] for all real a.
DEFINITION 83: If ...X-3,X-2,X-1,X0,X1,X2,X3... is a process, the n future, is the algebra generated by the union of the -algebras generated by Xn by Xn+1 by Xn+2 etc.
DEFINITION 84: If ...X-3,X-2,X-1,X0,X1,X2,X3... is a process, the tailfield is the
intersection, over all n, of the n future.
DEFINITION 85: Two specific outputs of a process a1a2a3... and b1b2b3... are said to
have the same tail if there is an n such that for all n>N,
an = bn.
THEOREM 86: A measurable set is in the tailfield iff, (up to a set of measure 0), the set
is a union of equivalence classes, where two sequences are equivalent iff they have the
same tail.
Idea of proof: Otherwise you would be able to get, in the tailfield, a set of positive
measure, where each name has an equivalent name which is not in the set. From that, get
an n, and a set of positive measure in the tailfield, where for each name in the set, there is
a name which is not in the set which agrees with it after time n. That set would not be in
the n future.//
DEFINITION 87: A process is said to have trivial tail if every set in the tailfield has
measure either 0 or 1.
THEOREM 88: The Kolmogorov 0,1 law:
A process ...X-3,X-2,X-1,X0,X1,X2,X3... in which all Xi are independent of each other, has
trivial tail.
Coupling EXAMPLE 8) Condition an independent process on a specific cylinder set.
Now consider the unconditioned independent process. Couple the conditioned process
with the unconditioned process so that the tail of the coupled pair is identical and use this
coupling to prove the Kolmogorov 0,1 law.
33
mean hamming metric
DEFINITION 89: The mean Hamming distance between two words of the same length is
the fraction of the letters which are different e.g. d(slashed, plaster) = 3/7.
EXERCISE 90: Prove triangle inequality for the mean Hamming distance.
dbar metric
DEFINITION 91: Let  be the set of words of a given finite length from a given
alphabet. The dbar distance between two measures on , is the infimum expected mean
hamming distance between coupled words, taken over all couplings of the two measures.
In the case of two stationary processes, the dbar distance between them is the limit of the
dbar distance of finite chunks. This can be shown to approach a limit. Furthermore, this
limit can be achieved with a stationary coupling of the whole spaces by using the
extension of the “monkey” method mentioned earlier (We take finite possibly
nonstationary couplings and use the monkey method to obtain an infinite stationary
coupling).
If the processes involved are themselves not stationary we use the same definition
for dbar distance, except instead of limit we use lim sup. In that case, a stationary
coupling of the whole space is obviously impossible.
EXERCISE 92: Show that the word infimum in the previous definition can be replaced
with the word minimum.
Variation metric
DEFINITION 93: Let P and Q be two partitions with the same number of sets such that
the ith set of P corresponds to the ith set of Q for all i. The minimum over all couplings of
P and Q, of the measure of {(i,j) in the coupling: i does not correspond to j} is called the
variation distance of P and Q.
If the probabilities of P and Q are p1, p2,… pn and q1, q2,… qn respectively, this amounts
to
34
DEFINITION 94: Let p1, p2,… pn and q1, q2,… qn be two probability measures on a set
{ x1, x2,… xn }. The variation distance between p1, p2,… pn and q1, q2,… qn is
|p1- q1|+|p2- q2|+…|pn- qn|
DEFINITION 95: is a set of words of a finite length from a given alphabet. The
variation distance between two measures on  is the minimum over all couplings of P
and Q, of the measure of {(i,j) in the coupling: i j}.
DEFINITION 96: Let X and Y be real valued random variables possibly with different
domains. The variation distance between X and Y is the minimum of the probability that
x y over all couplings of X and Y (Although their domains are different, their ranges
are presumed to both be the real numbers, so a coupling is a probability measure on
(the reals X the reals)).
COMMENT 97: Saying that “The variation distance between two things are is small” is a
very strong statement. For measures on words of a given length, it is much stronger than
saying the dbar distance is small.
EXERCISE 98: Prove that the minimum is actually achieved.
35
The theory of Markov chains (presented as a subsection of coupling section)
COMMENT 99: You do not have to understand this section on Markov chains to read
this book because we have decided that it is easier to use the “ startover process”
(definition 212) then the approximation “n step Markov process” (definition 119) in
proofs. However Markov chains are very important to ergodic theory (see, for example,
exercise 274). Furthermore the proof of the renewal theorem is a very nice application of
coupling. However you can skip everything from definition 100 through theorem 124 if
you wish.
DEFINITION 100: A Possibly Non-stationary Markov Chain (abbreviated pnm) is a
possibly non-stationary probability measure on the space of doubly or signally infinite
words with a finite alphabet (unless we state otherwise, assume the alphabet to be finite)
in which
P(Xn = an | Xn-1 = an-1, Xn-2 = an-2, Xn-3 = an-3 ….) (abbreviated P(an-1,an))
depends only on an-1 and an (i.e. not on n or an-2, an-3,an-4…)
DEFINITION 101: the terms P(an-1, an) are called transition probabilities.
DEFINITION: Let a and b be in the alphabet of a pnm. We say that there is a path from a
to b if there exists a finite sequence in the alphabet, c0, c1, c2…cn such that
P(a,c0), P(c0,c1), P(c1,c2)… P(cn-2,cn-1), P(cn-1,cn), P(cn,b)
are all positive. In this case we say that there is a path of length n+2 from a to b (length 1
if P(a,b) is positive we are not considering any ci).
DEFINITION 102: Let a and b be in the alphabet of a pnm. We say that a and b
communicate if there is a path from a to b and a path from b to a.
DEFINITION 103: Let “a” be in the alphabet of a pnm. We say that a is transient if there
is no path from a to itself.
36
DEFINITION 104: Let a be in the alphabet of a pnm. If a is not transient
(a) = {n > 0: there is a path of length n from a to itself}.
DEFINITION 105: Let a be in the alphabet of a pnm. If a is not transient
L(a) is the least common divisor of (a).
THEOREM 106: If a is not transient, there must exist n such that (a) contains every
multiple of L(a) greater than n.
Idea of proof: Show that (a) is closed under addition and that every set of positive
integers closed under addition contains every multiple of its greatest common divisor past
a certain point. To prove the latter, just let S be a finite subset of (a) such that L(a) is a
linear combination of S with integer coefficients, let t be the sum of S, and show that for
all sufficiently large k, kt + i, i<t, can be written as a linear combination of S with
positive integer coefficients. //
THEOREM 107: If a is not transient and a and b communicate, then b is not transient and
L(a) = L(b).
Proof: Let k be the length of a path connecting a to a. Let m and n be lengths of paths
connecting a to b and b to a respectively. Then there are paths connecting b to b of size
m+n and m+k+n so k divides L(b). With similar proof, L(b) divides L(a).//
We now prove the big theorem of Markov chains using coupling.
Coupling EXAMPLE 9:
THEOREM 108: The renewal theorem. Suppose …X-2, X-1, X0, X1, X2… (or X0, X1,
X2…) is a pnm,  > 0, a,b and c all communicate, and L(a)= 1. Furthermore, assume that
any state which “a” has a path to communicates with “a”. Then for all sufficiently large n
and m, |P(Xm = c| X0 = a) – P(Xn = c| X0 = b) | < .
37
Idea of proof: First we show that
(*) If we let d,e,and f be any states which “a” has a path to and run two processes
independently starting at d and e respectively, that with probability 1 they will eventually
both be simultaneously at state f.
Since they all communicate, and since “a” has positive probability paths to itself at all
sufficiently large times, it follows that a process starting at any state that communicates
with “a” will have positive probability paths to f for all sufficiently large times. In fact
exactly how large “sufficiently” is does not depend on which state we start at because we
can simply maximize over all states which communicate with “a”. Now let I be
sufficiently large and run the two process starting at any two points communicating with
“a” and let  be the minimum probability (minimized over any two such starting points)
that they are both at state f by at time I. Then the probability that they are not at state f at
either I,2I,3I….or kI is at most (1- )I ; proving (*).
We now proceed exactly as we did in coupling example 7. Just as in that case the
“criminal” gets a head start but if the “police car” catches him the police handcuff him to
the car so that the two stay together from that point onward. The difference is that in that
case the criminal had a small head start and in this case we let the criminal get a possibly
huge one. We run the two processes starting at a and b, giving the second process a head
start of n-m (We are assuming W.L.O.G. that n > m.) and then run them both for time m
independently until they meet, gluing them together if they meet. The difference between
this example and example 7 is that in this example the criminal can run until the cows
come home and he still can’t get far away from the police car because he is running on a
finite set.//
DEFINITION 109: A Markov chain is a doubly infinite stationary pnm.
THEOREM 110: A stationary process …X-2, X-1, X0, X1, X2… is a Markov chain iff for
any state a0, conditioned on X0 = a0
X1, X2… and …X-2, X-1 are independent.
Proof left to reader. //
38
THEOREM 111: A Markov process has no transient states and in a Markov process, if
there is a path from a to b then there is a path from b to a (We assume, of course, that a
and b both have positive probability).
Idea of proof: Show that with probability one any state which occurs in a stationary
process occurs infinitely often (Don’t use the Birkhoff ergodic theorem because it is sick
to use such a powerful theorem when you can easily avoid it.) If a is transient or there is
no path from b to a you could visit a and then never come back.//
THEOREM 112: If the transition probabilities are such that all states communicate and
for a given state “a”, L(a) = 1, then there is precisely one measure on …X-2, X-1, X0, X1,
X2… consistent with those transition probabilities that makes that measure a Markov
chain.
Idea of proof: Show that in the case of a Markov chain, the measure on X0 determines the
entire measure of the chain. Using the renewal theorem, show that the given conditions
imply that
lim(i  )(P(Xi = c| X0 = a) exists and is independent of a.
Show that the assumption that we end up with Markov chain forces that limit to be
P(X0) = c for all c, and that we can, in fact, extend that measure to a Markov chain.
(Hint: You need to define a measure on cylinder sets and then use Cartheodory extension
theorem to extend to all Borel sets. Try to do it, and show that if you run into trouble
there must exist c such that
P(Xi+1 = c| X0 = a ) is very different from P(Xi = c| X0 = a) for arbitrarily big i.)//
THEOREM 113: A Markov process is ergodic iff all states communicate.
Only if: Easy.
If: Every state must occur infinitely often in every doubly infinite word and in fact every
word of positive probability must occur infinitely often for every doubly infinite word.
Let T be the first occurrence of a given word. Use theorem 110 and the fact that for any
event A, P(A) = E(P(A| X0, X1, X2…XT)) to show that for any invariant set A,
P(A|X0 = a0, X1 = a1, X2 = a2,… Xn = an) does not depend on a0, a1, a2… an. Now
approximate A with a cylinder set.//
39
DEFINITION 114: When …X-2, X-1, X0, X1, X2… is an ergodic Markov process, by
theorems 107 and 113, L(a) does not depend on a and we will call it the periodicity of the
process.
THEOREM 115: Renewal theorem for Markov processes (including case where there are
countably many states).
Let …X-2, X-1, X0, X1, X2… be an ergodic Markov process with periodicity 1, possibly
with countably infinitely many of states. Let  > 0. Let S1, S2 be collections of those
states. Then there is an N, dependent only on a lower bound for the measures of S 1 and
S2, not on S1 and S2 themselves, such that for all n > N,
|P(Xn  S1| X0  S2) - P(X0  S1)| < 
Idea of proof: Essentially repeat proof of the renewal theorem, using that fact that there is
a finite set of states of measure nearly 1, making the problem essentially finite.//
COMMENT 116: In the finite case, N does not depend on anything.
DEFINITION 117: A ergodic Markov process with periodicity 1 is called a mixing
Markov process.
DEFINITION 118: A stationary process …X-2, X-1, X0, X1, X2…
Markov process if for all …a-2, a-1, a0,
is called an n step
P(X0 = a0| X-1 = a-1, X-2 = a-2, X-3 = a-3…) =
P(X0 = a0| X-1 = a-1, X-2 = a-2, X-3 = a-3… X-n = a-n)
DEFINITION 119: Let X = …X-2, X-1, X0, X1, X2… be a stationary process and
Y = …Y-2, Y-1, Y0, Y1, Y2… be an n step Markov process. Y is called the n step Markov
process corresponding to X iff they have the same distribution of n words and
P(Y0 = a0| Y-1 = a-1, Y-2 = a-2, Y-3 = a-3… Y-n = a-n) =
P(X0 = a0| X-1 = a-1, X-2 = a-2, X-3 = a-3… X-n = a-n)
for all a0, a1, a2… an
(This is the same as saying that they have the same distribution of n+1 words).
40
DEFINITION 120: Let X =…X-2, X-1, X0, X1, X2… be a n step Markov process.
A Markov process …Y-2, Y-1, Y0, Y1, Y2…, constructed from X, is defined as
follows:
The states of Y are the words of length n from the X process. The measure on Y0 is the
same as the measure on X0, X1, X2… Xn-1.The transition probabilities are
P(Y1 = a0 a1 a2… an | Y0 = b0 b1 b2… bn) =
{
P(Y1 = an | Y0 = bn) if
for all i  {0,1,…n-1}, bi+1 = ai
0 otherwise
COMMENT 121: It is important that the reader understand what the above Y process
really is. It is simply the X process in disguise. For example, suppose n = 4. The reader
should be required to demonstrate that if you start with the X process, and change the
name
a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 …
to
(a0 a1 a2 a3) (a1 a2 a3 a4) (a2 a3 a4 a5) (a3 a4 a5 a6) (a4 a5 a6 a7) (a5 a6 a7 a8) (a6 a7 a8 a9)…..
for every sequence a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 …
you get the Y process.
The purpose of constructing the Y process is to demonstrate that every n step Markov
process is really a one step Markov process in disguise.
DEFINITION 122: The periodicity of an n step Markov process is the periodicity of the
corresponding Markov process.
THEOREM 123: An n step Markov process is ergodic iff its periodicity is 1.
Proof left to reader.//
EXERCISE 124: Suppose …X-2, X-1, X0, X1, X2… is an ergodic process.
Show that its approximating n step Markov process is also ergodic.
(Hint: Show that if two words don’t communicate in the approximating process, they
don’t communicate in the original process either.)
41
THEOREM 125: Renewal theorem for n step Markov processes (including case where
there are countably many states).
Let …X-2, X-1, X0, X1, X2… be an ergodic n step Markov process with periodicity 1,
possibly with countably infinitely many of states. Let  > 0. Let S1, S2 be collections of n
tuples of those states. Then there is an M, dependent only on a lower bound for the
measures of S1 and S2, not on S1 and S2 themselves (in the finite case, M does not depend
on anything), such that for all m > M,
|P(Xm, Xm+1, Xm+2… Xm+n-1  S1| X0, X1, X2… Xn-1  S2) –
P(X0, X1, X2… Xn-1  S1)| < 
Proof: Immediate from theorem 94.16. //
42
Preparation for the Shannon Macmillan theorem(presented as a subsection of coupling section)
COMMENT 126: We conclude our “Preliminaries” chapter with some lemmas designed
to prepare for the most important theorem of the next chapter, the Shannon Macmillan
Breiman theorem.
DEFINITION 127: Let fN be defined as follows:
____-_
fN(x) = x
fN(x) = 0
if |x| > N, and let
otherwise.
DEFINITION 128: A sequence a1,a2,a3... is essentially bounded if
lim(N )lim sup(n ):[|fN(a1)| +|fN(a2)|+ ...|fN(an)|]/n] = 0
We conclude our section on coupling by using coupling to conclude.
Coupling EXAMPLE 10:
THEOREM: Let ai be any sequence of nonnegative numbers such that iai are summable.
If P((i < |Xn| < i+1)|Xn-1, Xn-2...X1) < ai , for all i and n, then X1, X2, X3,… is essentially
bounded(with probability 1).
Idea of proof: Choose N large enough so that aN-1 + aN + aN+1...< 1. Define a distribution
for a random variable YN,
P(YN =0) = 1-(aN-1 + aN + aN+1...), and for all i > N, P(YN =i) = ai-1.
Let YN (1), YN(2), YN (3),.... be an independent sequence of random variables each with
the distribution of YN. Now couple the two sequences of random variables
1)|fN(X1)|, |fN(X2)|, ...|fN(Xn)|
2)YN (1), YN (2), YN (3),....
inductively so that |fN(Xi)| < YN(i) for all i.(Here we make our coupling so that
i-1 < fN(Xi)< i whenever Yn = i for i > N)
(1/n)(YN (1) + YN(2) +YN(3) +....YN(n)) approaches something small when N is large.//
43
Square lemmas:
COMMENT 129: Only the third of these square lemmas has any relevance to the text.
Our only reason for including the first two is that we believe every graduate student
should be familiar with them.
1)
Let ai,j be reals such that ai,j (i   ) converges to aj uniformly, and ai,j (j 
converges to ai, then lim(ai),(i  ), lim(aj),(j  ), and lim(ai,j),((i,j) ), all
converge, and they converge to the same thing.
2)
Let ai,j be reals such that ai,j (i ) converges to aj uniformly, and lim(aj)
converges. Then lim(ai,j),(i,j ) converges to the same thing.
3)
Let Xi,j be real valued random variables such that the columns form a stationary
process (i.e. any n successive columns have the same distribution). Suppose Xi,j
converges to Xj as i approaches infinity. If both Xj and Xi,i are essentially bounded, then
both
(1/n)(X1,1 + X2,2 +X3,3 + ... Xn,n) and 1/n (X1 + X2 +X3 + ... Xn)
converge, and their limits are the same.
Idea of proof: By the Birkhoff ergodic theorem and essential boundedness,
(1/n)(X1 + X2 +X3 + ... Xn) converges. Fix  and N, and let
Yi = 1 if there is any j > N such that |Xi,j-Xi| >  and
Yi = 0 else. Then for large N, 1/n (Y1 + Y2 + Y3 + ... Yn) converges to something
small.//
COMMENT 130: The expectation of the thing they are converging to is E(X1). When the
limit is constant, the thing they are converging to is E(X1).//
44
ENTROPY
DEFINITION 131: The n shift is obtained by taking an n sided fair coin, and flipping it
doubly infinitely many times to get an independent process on n letters.
THEOREM 132: The three shift is not a factor of the two shift.
Idea of proof: Suppose  is a homomorphism of the two shift to the three shift.
Approximate -1 of the canonical three set partition with a three set partition of cylinder
sets in the two shift. These cylinder sets have a length (say length 9) (the length of a
cylinder set is the number of letters it depends on).
Presume for the moment, that this approximation is not just an approximation, but
rather is exactly -1 of the canonical three set partition. Then each word of length 200 in
the three shift is determined from a distinct word of length 208 in the two shift. For
example, the 7th letter in the three shift is determined by the 3rd, 4th , 5th, 6th, 7th, 8th, 9th,
10th and 11th terms of the two shift. This is impossible because 2208 < 3200.
However, since the cylinder set partition is only an approximation, words of length
208 determines words of length 200 only after 200 of the letters are altered for some
small , (on more than half of the space). Nevertheless, by making such changes we still
won’t get enough words to account for (1/2)(3200) words. //
DEFINITION 133: A small exponential number of words of length n means 2n words,
for a small number . If the number of words in a set of such words is 2Hn, then we say
that H is the exponential number of words in the set. If the probability of a word is 2-Hn,
the exponential size of a word of length n is H. Note that the smaller the name, the bigger
it’s exponential size.
DEFINITION 134: We are about to prove a theorem saying that in every stationary
process, the exponential size of the first n letter subword of an infinite word, approaches
a limit as n approaches infinity. In the ergodic case, this limit is a constant, called the
entropy of the process.
COMMENT AND DEFINITION 135: The theorem we just referred to (namely the
Shannon Macmillan Breiman theorem, which we will soon state and prove) implies that
for large n, after removing a small set, all words of length n have approximately the same
exponential size. Therefore (after removal of the small set) the exponential number of
words is approximately the entropy of the process. Henceforth, names with the
approximately correct exponential size will be called reasonable names.
45
THEOREM 136: Two isomorphic ergodic processes must have the same entropy.
Idea of proof: By comment 135, this theorem is a generalization of the theorem 132, and
the proof is identical.
DEFINITION 137: The entropy of an ergodic transformation which has a finite
generator, P, is the entropy of the process P,T. Theorem 136 says that it does not matter
which generator you use.
COMMENT 138: The reader may object that it is hard work to prove the existence of
generators for transformations and that he resents having to have a definition of entropy
that requires such hard work. Such a complainer will be happier to define the entropy of a
transformation T to be
sup (over all finite partitions P) of the entropy of P,T.
Since every partition can be extended to a generator just by joining it with a generator
(“joining” is another word for “superimposing”), the two definitions are equivalent. In
fact, this definition is more encompassing because it can be used for transformations that
have no finite generator.
CONVENTION 139: Throughout this book “log” means log base 2.
THEOREM 140: (Shannon Macmillan Breiman):
Assume an ergodic stationary process on a finite alphabet. With probability 1, a
name b1b2b3.... has the property that
lim(n  ) [-(1/n)log(P(b1b2b3....bn))] exists. The limit is constant almost everywhere.
Idea of proof: We will, in fact, show that the limit exists for any stationary process.
Ergodicity is only used to get the limit to be constant.
-(1/n)log(P(b1b2b3....bn)) =
-(1/n)log(P(b1)) - (1/n)log(P(b2|b1)) -(1/n)log(P(b3|b2b1))...
-(1/n)log(P(bn|bn-1bn-2...b1))
which we will compare with
-(1/n)log(P(b1|b0b-1b-2b-3....)) - (1/n)log(P(b2|b1b0b-1b-2...))
-(1/n)log(P(b3|b2,b1b0b-1....))-... (1/n)log(P(bn|bn-1bn-2...)).
46
By the third square lemma, all we need to prove is that both sequences are essentially
bounded, which we show from coupling example 9. We now determine the ai of coupling
example 9.
We can always let a0 = 1, because all probabilities are less than or equal to 1.
Hence, we only have to look for ai, for i > 1. For n > 0, in order for
-log(P(bn|bn-1bn-2...b1))
to be between i and i+1, bn has to be a letter which, given
bn-1, bn-2,...
has probability less than 2-i. Since we are assuming a finite alphabet of size, say A, it
follows that the probability that
-log(P(bn|bn-1bn-2...b1)) is between i and i + 1, given bn-1, bn-2,..., is less than A2-i . Same
is true for -log(P(bn|bn-1bn-2...)) //
THEOREM 141: The entropy of …X-2, X-1, X0, X1, X2… is
E(-log((P(X0|past ))))
Idea of proof: Apply the proof of theorem 108, along with comment 98 (which is the
comment following the third square lemma).//
DEFINITION 142: Let P be a partition such that the probability of the pieces are
p1,p2,p3,..pn. Let ...X-3,X-2,X-1,X0,X1,X2,X3... be an independent process on an n letter
alphabet, whose probabilities are p1,p2,p3,..pn. Then the entropy of P is defined to be the
entropy of that process.
THEOREM 143: The entropy of the above partition is -p1log(p1) + -p2log(p2) +-p3log(p3)
...+-pnlog(pn).
Idea of one proof: Use theorem 141.
Idea of another proof: A typical word of length n has about n(p1) of the first letter, about
n(p2) of the second letter etc., so that it has probability approximately
p1n(p1 )p2n(p2 ) ...pnn(pn ) = 2[n(p1)log(p1) + n(p2)log(p2) +...n(pn)log(pn)] //
47
COMMENT 144: In the near future, we will use theorem 111 as the definition of entropy
for a partition, and see what we can learn about entropy of a partition from that definition.
The reader should try to prove everything from the other definition also.
DEFINITION 145: Let P1 and P2 be two partitions with probabilities of parts [p1,p2, p3,
...pn] and [q1,q2, q3, ...qm] respectively. If we say put P1 into the first piece of P2, we
mean that we want you to consider the partition whose pieces have sizes.
q1p1, q1p2, q1p3, ...q1pn, q2, q3, ...qm
THEOREM 146: Conditioning property: Let the first piece of P2 has probability q1. Let
P1 and P2 have entropies H1 and H2 respectively. Then if you put P1 into the first piece
of P2, the resulting partition has entropy H2 + q1H1.
Idea of proof: You can prove this by straightforward computation, or use a slicker proof,
based upon the idea that entropy of a partition is the expectation of -log(the probability of
the piece you are in).//
DEFINITION 147: A join of two partitions P1,P2, with m and n pieces respectively,
denoted P1 V P2 , is a partition with mn pieces (although some of those pieces may be
empty, so that in reality you have less than mn pieces) so that each of these tiny pieces is
regarded as the intersection of a piece of one, with a piece of the other. The sum of the
probabilities of the pieces in a piece of P1 is the probability of that piece, and the sum of
the probabilities of the pieces in a piece of P2 is the probability of that piece.
COMMENT 148: Note that coupling and join really mean the same thing.
THEOREM 149: The entropy of the independent join of two partitions, is the sum of
their entropies.
Idea of proof: Repeatedly use the conditioning property, by putting P1 into every piece of
P2.//
DEFINITION 150: Let P be a partition and T be an ergodic transformation. H(P) means
entropy of P and H(T) means entropy of T
48
DEFINITION 151: (Convex combination of partitions)
Let P1, P2,… Pm be n set partitions whose sets are identified, (e.g. the third set of
P1 is identified with the third set of P2 etc.). For i < m and j < n, let the probability of the
jth set of Pi be pi,j. Let p1, p2,.. pm be positive numbers which add to 1. Then p1 P1+ p2 P2+..
pm Pm is the partition whose jth set has probability
p1p1,j + p2p2,j + pmpm,j.
COMMENT 152: Convex combination of random variables and convex combination of
probability measures is defined similarly.
THEOREM 153: Suppose p1 P1+ p2 P2+.. pm Pm is a convex combination of partitions.
Then
H(p1 P1+ p2 P2+.. pm Pm) > p1 H(P1)+ p2 H(P2)+ +.. pm H(Pm).
Proof: Use the notation of the previous definition. By concavity of –xlog x
(p1p1j + p2p2j +… pmpmj )log(p1p1j + p2p2j +… pmpmj ) >
p1(-p1jlog(p1j)) + p2(-p2jlog(p2j)) +… pm(-pmjlog(pmj))
sum over j.//
THEOREM 154: Independent joining maximizes entropy, (i.e., it gives higher entropy
than any other joining).
Idea of proof:
Let the probabilities of P1 be p1, p2, p3, ...pm.
P1 V P2 can be obtained by selecting Q1, Q2,… Qn such that
p1 Q1+ p2 Q2+ … pn Qn = P2
and then putting Q1 into the first piece of P1, Q2 into the second piece of P1,etc. Apply
conditioning property and the theorem 121.//
49
THE CAVE MAN’S PROOF OF THE ABOVE THEOREM:
Idea of proof:
We use only the conditioning property, and the fact that the maximum
entropy of a two set partition is obtained by the 1/2,1/2 partition. We do not use the
definition of entropy, formula for entropy, or even use the continuity of entropy. From
this we can get the above theorem, when the P probabilities
p1,p2,p3,...pm are dyadic rationals, and the Q probabilities q1,q2,q3,...qn arbitrary.
If we then want to get it for any pairs p1,p2,p3,...pm and q1,q2,q3,...qn we need continuity
of entropy.
Step 1: First show by induction on m, that the maximum 2m set partition entropy is
obtained with the 1/2m,1/2m,1/2m,...1/2m partition.
Step 2: Show the theorem to be true when P1 is the 1/2m,1/2m,1/2m,...1/2m
partition.
Idea of proof: by using conditioning property, show that entropy goes up on any
piece of P2, when we exchange the partition of it for the 1/2m,1/2m,1/2m,...1/2m partition
conditioned on that piece.
Step 3: Show it when P1 is composed of dyadic rational pieces.
Proof: P1 can be extended to a 1/2m,1/2m,1/2m,...1/2m for some m
partition. More precisely, we can get partitions R1,R2,...Rn so that if we put each Ri into
the ith piece of P1, we will get a 1/2m,1/2m,1/2m,...1/2m partition. Now consider two
joinings of P2 and P1, which we will call J1 and J2. J2 will be the independent joining, and
J1 will be an arbitrary joining. Make J1 into an even finer partition, by, for all i, putting Ri
into the ith piece of P1 independent of J1 inside the ith piece of P1. Do the same with J2. It
can easily be seen by the conditioning property that the amount of entropy added, in both
cases, is identical. When you are done, the extension of J2 will be an independent joining
of the 1/2m,1/2m,1/2m,...1/2m partition with P2, and the extension of J1 will be a joining of
the 1/2m,1/2m,1/2m,...1/2m partition with P2 which is not independent. Since the former
has bigger entropy than the latter, and we added the same amount to get them, it follows
that the entropy of J2 is bigger than that of J1.//
DEFINITION 155: Let P and Q be partitions which are joined into a big partition P V Q.
Then the entropy of P over Q, written H(P|Q), is H(P V Q) - H(Q).
COMMENT 156: H(P|Q) < H(P)
50
THEOREM 157: Let Q is a sub partition of Q1, and P be joined with Q1.
Then H(P|Q1) < H(P|Q).//
Proof: You can only increase the entropy of P V Q1 by rearranging the P on each piece of
Q so that P is independent of Q1 on that piece. This will not alter H(P|Q), and cannot
decrease H(P|Q1), and in the end you will have
H(P|Q1) = H(P|Q).
COROLLARY 158: Let ...X-3,X-2,X-1,X0,X1,X2,X3... be a stationary process, with a
finite alphabet. H(Xn|Xn-1,Xn-2,Xn-3,...X0) is nonincreasing as n approaches infinity.
Idea of proof: H(Xn|Xn-1...X0) = H(X0|X-1...X-n) by stationarity.//
DEFINITION 159: Let H be the limit of H(Xn|Xn-1,Xn-2,Xn-3,...X0) as n approaches
infinity.
COMMENT: 160: We will later see that H is the entropy of the process.
COMMENT 161: This is H(X0|the past)
COMMENT 162: H is also lim 1/nH(X0 V X0 V X1 V X2... V Xn).
Here X0 V X0 V X1 V X2... V Xn means the join of X0,X1,X2...Xn.
Proof: H( X0 V X1 V X2... V Xn) = H(X0) + H(X1|X0) + H(X2|X1,X0) ...
H(Xn|Xn-1,Xn-2,Xn-3,...X0) //
THEOREM 163: H is the entropy of the process.
Proof: X0 V X1 V X2... V Xn is a partition whose pieces correspond to the n names of
the process, each having the probability of the corresponding name as its measure. Let S
be the entropy of the process. By the Shannon Macmillan theorem, except on a small set
of size , all names have measure 2-(S + )n where ||<o, o is a small positive number.
On the set of size , there are at most An names, where A is the size of the alphabet,
because that is the total number of names altogether. Hence, by the conditioning
property,
H( X0 V X1 V X2... V Xn-1) is between (1-)(S - o)n and (1-)(S + o)n + (logA)n.//
51
DEFINITION 164: Let P be a finite partition, and T be a measure preserving
transformation. Then P,T defines a stationary process, and H(P,T) is the entropy of that
process.
COMMENT 165: H(P,T) = H(Q,T) for any two finite generators P and Q, because the
two processes are isomorphic.
DEFINITION 166: The entropy of a transformation, H(T), is the supremum of H(P,T)
over all finite partitions P.
COMMENT 167: If a transformation has a finite generator, P, then H(T) = H(P,T)
Proof: For any other partition Q, PVQ is a generator also, so
H(Q,T) < H(PVQ,T) = H(P,T).
THEOREM 168: Let A1,A2,A3,... be a countable generator for T, and let  be the whole
space. Then
H([A1A2,A3,...An,(- [A1 U A2 U A3... U An])], T )
converges to H(T) as n .
Idea of proof: Let Qn be the partition
[A1A2,A3,...An,(- [A1 U A2 U A3... U An])]
Let P be any other finite partition. Since the A’s are a generator,
it follows that if you know the A name, you know the P name. For large enough n, the Qn
name on times –n, -n+1,…n nearly determines which element of P you are in by theorem
24.
Thus H(Qn,T) is not much less than H(P,T) by the same argument we used to
show that the three shift is not a homomorphism of the two shift. //
52
THEOREM 169: KREIGER’S FINITE GENERATOR THEOREM:
If a transformation has finite entropy, then it has a finite generator. Indeed, if the entropy
is less then log(k), we can find a generator B = B1,B2,...Bk.
We first give a vague idea of the proof followed by more details of proof.
The student should read the “vague idea” and the “more details” before attempting to
write out a rigorous proof.
Proof scheme:
let An and Qn be as above. We will look at a Rohlin tower; then a much
bigger one; then a much bigger one; etc. Start with a large m. Let n>>m. We will choose
the first tower of height n, so that the base is independent of the partition of n names of
Qm. By Shannon Macmillan, there are enough k letter names to label all reasonable
columns with the letters B1,B2,...Bk, on the rungs of the columns, so that all the
reasonable columns are distinguished from each other, because the number of reasonable
columns is smaller than the number of such names. Carry out such a labeling. If you
know the B1,B2,...Bk name of a column containing the base of the tower, you probably
know the Qm name of that column. The only reason you might turn out to be wrong is
that you might be looking at an unreasonable column.
This is just the first stage. We will end up altering our choice of B1,B2,...Bk over
and over again, to accommodate larger towers. However we will find that we will alter
less and less so that, starting with a random point  in , once we know our initial choice
of which of B1,B2,...Bk contains  it is not likely to change by the time we are done. In
the end, we will be able to tell our entire A name from our B name, and hence we will be
able to tell what point we are, from our B name. Hence we will know that B generates.
Vague idea of proof:
There is a complication. We will have to know whenever we enter one of these
towers from just looking at the B name. If we saw an infinite B name but had no idea
which coordinates were in the base of the tower, we would be quite confused. To handle
this, we need to reserve certain words to label the bases of the towers, so that we know
when we are in a base of a tower. You can arrange that so few words are reserved that no
damage is done to the proof. Use labels like 1000001 for the first tower and
100000000000001 for the second (the reader needs to figure out how fast the length of
these names must grow so that the restriction that these names cannot be used except for
labeling the base will not cause damage).
53
More details of the proof:
1) Start off with Q10 (10 chosen so that Q10 has nearly full entropy). Pick n1large enough
so that Shannon Macmillan kicks in for Q10 names by time n1. Pick a Rohlin tower of
height n1, whose base is independent of the partition of Q10 names of length n1. Label all
reasonable columns with a distinct B name.
2) Pick your second tower to be much bigger than your first, and as before pick your base
independent of Q1000 names of tower length (1000 chosen so that Q1000 has very very
nearly full entropy). Your B process defined at the previous stage, already captures a
great deal of entropy, because B names almost determine which atom of Q10 you are in.
Prove that there are enough B names so that, since our entopy is strictly less than log(k),
it is possible to let the top ¾ of the rungs remain as they are and change the bottom ¼ of
the rungs to B names and thereby distinguish all reasonable Q1000 columns.
3) Continue, except that replace the numbers ¾ and ¼ with 7/8 and 1/8 at the next stage,
then with 15/16 and 1/16 in the following stage etc.
4) Here is why the final B name separates points . At the end of the first stage you
probably know what reasonable atom of Q10 you are in and there is not much more than a
¼ chance that the information will be damaged. Continue this reasoning at higher stages
using Borel Cantelli to show that information is damaged only finitely many times.
For a given point, it is possible that the n tower name of its n tower column will
be damaged if that tower turns out to be one of those bottom altered n+1 tower rungs.
Borel Cantelli will say that happens only finitely many times. //
ABBREVIATION OF PROOF OF THEOREM: Make big tower distinguishing
reasonable columns in mean Hamming distance. Improve approximation with a bigger
tower by altering a few columns at the bottom. Label the base of all your towers with
words that are just used to label bases.
COMMENT 170: WE REGARD IT AS ESSENTIAL THAT THE STUDENT BE
REQUIRED TO WRITE OUT THE ABOVE PROOF IN DETAIL.
54
THEOREM 171: Topologize the space of stationary measures by the dbar metric. Then
the function that takes a process to its entropy is continuous.
Idea of proof: If Process 1 is close to Process 2 in dbar, then most
reasonable names of one are within  in the mean hamming distance to a reasonable name
of the other. There are exponentially few names close to a given name in the mean
hamming distance.//
DEFINITION 172: If a process homomorphism has the property that the inverse of the
set of points with any specific letter in the origin (in the range) is a cylinder set (in the
domain), then the range is called a coded factor of the domain, and the map from cylinder
set to letter is called a code.
LEMMA 173: A coded factor of a given process cannot have more entropy than the
process.
Idea of proof: If the code has length M, then every name of length N of the coded
process, is determined by a name of length N+M of the original process. //
THEOREM 174: A factor of a process cannot have more entropy than the process.
Idea of proof: entropy is continuous, and coded factors are dense in all factors.//
THEOREM 175: A process has entropy zero iff the past determines the present.
Idea of proof: The entropy of ...X-3,X-2,X-1,X0,X1,X2,X3...
is the expectation of the entropy of X0 given the past.//
DEFINITION 176: Let T be an ergodic transformation, and  be a set of positive
measure. Let A be the -algebra on  that is obtained by restricting the  – algebra of the
whole space to subsets of . Let  be the probability measure on  defined by
()
< ()/()
for every   . Let transformation S from  to , be defined by S(x) = Tix, where i is the
least positive integer such that Tix is in . Then (,A,,S) is a measure preserving
transformation. S is called the induced transformation of T on .
55
EXERCISE 177: x and i as in defined in definition 143. i is a function of x. Prove that the
union, over all x of {x, T(x), T2x …} such that i =  has measure 0.
EXERCISE 178: Prove S to be measure preserving.
THEOREM 179: Let  be a measure for a Lebesgue space, T be an ergodic
measure preserving transformation on that space,  be a subset of the space, S be the
induced transformation on . Then S the entropy of S on , is H(T)/().
Idea of proof: Let (,A,,S) be as in definition 176. Let P be a countable generator for T.
Define a new countable alphabet whose letters are finite strings of letters in P. x   is
assigned a particular string if that string is the P name of
x, T(x), T2(x), ...Ti-1(x),
such that i is the least positive integer with Ti(x) in . Show that this partition generates A
using transformation S.
By the Birkhoff ergodic theorem, the number of times S returns to , by time n, is about
n().Hence names of size n() in the induced process look like names of size n in T
(more precisely, they are between names of size .9n and names of size 1.1n where .9 and
.1 are arbitrarily close to 1).
We would be done, with a beautifully simple argument, if all
partitions in question were finite. Since they aren’t, the above argument proves nothing.
The entropy of T can be approximated with a finite partition, obtained by lumping all but
finitely many of the pieces of P. Our new partition is P = (P1,P2,...Pn, lump).
We can approximate the entropy of the induced transformation, by using P instead
of P in our construction of a partition on . This partition on  is still infinite, but we can
make it finite by lumping all points in  whose return time is greater than some large
number N, without seriously altering the entropy of S,. These arguments can be made to
be rigorous with proper use of theorem 168. Now carry out the argument of the first
paragraph.
of length n() look like words of length n in T and words of length
n() in the infinite process look like words of length between .99n() and n() in the
approximating process so the entropy of the approximating process is between H(T)/()
and H(T)/(.99()).//
More detail: Words
56
BERNOULLI TRANSFORMATIONS
DEFINITION 180: A transformation is called Bernoulli (Abbreviated(B)) if it is
isomorphic to an independent process.
PLAN: We are about to define a bunch of conditions, which are equivalent to Bernoulli.
We will prove them all equivalent, and then prove that Bernoulli implies them. We will
not prove that they imply Bernoulli in this section, because that proof is harder. We will
leave that to next section, where we will prove the Ornstein isomorphism theorem.
DEFINITION 181: The trivial transformation is the transformation where your lebesque
space consists of precisely one point and the transformation takes that point to itself. The
only partition of that space is called the trivial partition. If P and T are trivial, the P,T
process, namely the process which assigns probability 1 to the name …a,a,a,a… is called
the trivial process.
COMMENT 182: If the reader would like to skip this section, and go onto the next, just
read the definition of finitely determined. There are two results from this section you will
need if you intend to skip it;
i)
finitely determined processes are closed under taking factors
ii)
A nontrivial finitely determined process has positive entropy and
iii)
Bernoulli implies finitely determined.
You won’t need (iii) until the proof of the isomorphism theorem (classical form) at the
very end of the next section.
DEFINITION 183: Let u, u1, u2 be three probability measures, t be a real number, 0<t<1,
such that tu1+(1-t)u2 = u. Then u1 is a submeasure of u.
COMMENT 184:
There is a u2 such that tu1+(1-t)u2 = u iff
for all sets A, tu1(A) < u(A).
COMMENT 185: Note that submeasure is a generalization of subset. For example,
instead of picking a subset consisting of 3 points, pick a “subset” consisting of 1/3 of one
point, 1/5 of another, and 1/7 of a third. This example suggests that we should define a
submeasure to be tu1 instead of u1, but we find it easier to consider probability measures.
57
DEFINITION 186: Notation as in the above definition. The size of the submeasure is t.
DEFINITION 187: If these measures are measures are on words of length n, and the size
of the submeasure > 2-n for small  then it is said to be an exponentially fat submeasure.
COMMENT 188: Note that in the special case of a subset of {0,1}n, we are saying that
the number of points in the set is greater than 2(1- )n.
DEFINITION 189: A stationary process  is called extremal, (abbreviated EX) if for
every , there is a , such that for all sufficiently large n, there is a set of measure less
than  (which we call error) of words of length n, such that the following holds:
Any submeasure 1 of restricted to words of length nof size > 2-n whose support is
off error is within  of  restricted to words of length nin dbar.
ABBREVIATED DEFINITION : A process is extremal if for sufficiently large n,
exponentially fat submeasures whose support is off some small subset are dbar close to
the whole measure.
DEFINITION 190: A process is called Very Weak Bernoulli, abbreviated VWB., if the
dbar distance between the n future conditioned on the past, and the unconditioned n
future [A function from pasts to numbers] goes to zero in probability as n approaches 
DEFINITION 191: A stationary process P is called finitely determined (abbreviated FD),
if
a) It is ergodic.
b) for every , there is a  and N, such that for any other ergodic process P1,
if
1) the P and P1 probability laws for N are within  of each other,
and
2) the entropies of P and P1 are within  of each other,
then
3) the two processes are within  in dbar of each other.
58
ABBREVIATED DEFINITION: A process is finitely determined if a good distribution
and entropy approximation to the process, is a good dbar approximation to the process.
DEFINITION 192: An independent concatenation is obtained by taking any measure on
words of length n, using that measure to choose the first n letters, using that measure
again to choose the next n letters independently, and continuing again and again forever
(and backwards in time).
COMMENT 193: In general, an independent concatenation is not stationary.
DEFINITION 194: A dbar limit of independent concatenations (abbreviated IC) is a
stationary process which is the dbar limit of independent concatenations.
DEFINITION 195: We will use B to indicate a Bernoulli process (i.e. isomorphic to an
independent process), FB to indicate a factor of an independent process.
PLAN: The goal of this section is to prove that
B  FB  IC  EX  VWB  FD.
We prove FD FB  B in the next section.
COMMENT 196: FD is the criterion generally used in proofs. VWB is the criterion
generally used to check a given candidate to see if it is Bernoulli.
DEFINITION 197: Completely extremal is the same as the abbreviated definition of
extremal, except delete the words “whose support is off some small set” (in other words,
there is no error set)
INTRODUCTION TO COMPLETELY EXTREMAL:
Define 1 to be the distance from the northern end of the earth to the southern end of
the earth. Suppose you would like to couple the northern half of the earth to all of the
earth, so that the average distance between coupled points is less than .01. It is rather
obvious that you can’t do that. Thus, you can take half of the earth that is far from all of
the earth.
But the same is not true for {0,1}n under the mean hamming distance. Not only is it
impossible to extract half the space such that the half you extract is far away from the
whole space when n is large; it is even impossible to extract an exponentially fat subset
such that the subset is far away from the whole space. In other words {0,1}n is
completely extremal.
59
THEOREM 198: An independent process is completely extremal.
We will postpone the proof until we first do some preparatory work.
Lemma 199 Let  be a space and let Xn from  to {-1,1} be independent random
variables each assigning probabilities ½ to both -1 and 1. Let
Sn = Sum (1 to n) (Xi )
Let  be a submeasure of  (remember that in this book submeasures of probability
measures are probability measures) and E be the expectation operator corresponding to .
For every  there is a  such that
if
n is sufficiently large and E(Sn) > n
then
The size (recall the definition of size of a submeasure) of  < e-n,
(i.e.  is not exponentially fat.)
Proof:
i) E(Sn) > n.
Let R be the set of names in {-1,1}n such that
Sn > (/2)n.
ii) Then it is an elementary fact about random walk that for fixed , n going to infinity, R
is not exponentially fat.
From (i) and (ii), the reader can easily establish that  is not exponentially fat.//
EXERCISE 200: If the reader wishes to establish (ii) himself, he should write down the
probability of R precisely. You will get a sum of terms where the first is bigger than all
the others put together. Show that probability decreases exponentially as i increases.
60
EXERCISE 201: Our argument could be improved. Show that if you replace the
statement
“E(Sn) > n”
with the statement
“With probability one Sn = n (obviously we have to assume n to be an integer)”
the long run exponential rate that the size of  shrinks does not increase.
Hint: In the proof of lemma 199, replace (/2)n with (.9)n.
LEMMA 202: Let P = {p1, p2, p3, … pm,} be a set of finitely many points with two
measures  and  on it. Suppose the variation distance between  and  is . Let I be the
unit interval, and  be Lebesgue measure on I. Consider two measures on PX I, namely 
X  and  X . Then there is a set S  P X I such that

 X  (S) = ½, and  X  (S) > ½ + /4.
COMMENT 203: Don’t read the following proof of lemma 165. Prove this lemma
yourself, or at least convince yourself that something like it must be true.
Proof: We know that  | (pi) – (pi) | =  which means that you can partition the set P
into two classes so that
The sum in the first class of (pi)– (pi) =  /2
and
The sum in the second class of  (pi) – (pi) =  /2. Let
r = (first class) and r +  /2 = (first class )
1-r = (first class) and 1-r- /2 = (second class)
If r > ½, then let S be {(,t):  is in the first class and t< 1/(2r)}. Then
 X  (S) = (r +  /2)/(2r) > ½ +  /4.
If r < ½, then let S be { (,t): ( is in the first class) or ( is in the second class and t <
(½ - r))/(1- r))}.
Then  X  (S) = (r +  /2) +(1-r- /2)(½ - r))/(1- r) = ½ + 4(1- r) > ½ + /4.
The reader can verify that in any case  X  (S) = ½. I thought I told you not to read
this!//
61
Restatement of THEOREM 198: Let P = {p1, p2, p3, … pm,} be a set of finitely many
points and let  be a measure on P.
Let  = Pn and endow  with measure n = (We are defining ).
Let S be a subset of  and let  be the restriction of  to S normalized so that  is a
probability measure.
For any c, there is a d such that if n is large enough and the dbar distance between (S, )
and (,) is greater than c, then (S) < e-d n. In other words,
(P,) is completely extremal.
COMMENT and EXERCISE 204: Actually, the definition of completely extremal refers
to submeasure rather than subset. The reader should prove that this is equivalent
Proof of theorem: Let I be the unit interval and let  be Lebesgue measure on I. Now we
extend the space  to

 X In and endow that space with measure  X n. Similarly,
extend the space S to
(S X In) and endow that space with measure ( X n).
Our goal is to prove that

(S) < e-d n,
which is equivalent to showing that
( X n)(S X In) < e-d n .
62
Our first step is to construct independent random variables on  X In,  X n
such that each has probability ½ of being 1 and ½ of being –1. Let k < m, and select some
initial word of length k,
a1a2a3 … ak, where each ai is a member of P. Note this still makes sense when k = 0; it
merely means that we are considering the empty string. If the initial word a1a2a3 … ak
occurs as an initial word of some member of S (always true when k = 0 where we assume
S not to be empty), then let
(a1a2a3 … ak) be the conditional measure (under ) of the k+1 letter of a word in S given
that the first k letters are a1a2a3 … ak. For such a word, let d(a1a2a3 … ak) be the variation
distance between  and (a1a2a3 … ak). Invoking lemma 165, let (a1a2a3 … ak) be a
subset of P X I which has
 X  measure ½ and
(a1a2a3 … ak) X  measure > ½ + d(a1a2a3 … ak)/4
Now we define the random variable Xk+1 on  X In.
Xk+1 ((a1a2a3 … an),(t1t2t3 … tn)) =
{
-1 if a1a2a3 … ak is not the initial
segment of any word in S and
tk+1 < ½.
1 if a1a2a3 … ak is not the initial
segment of any word in S and
tk+1 > ½.
1 if a1a2a3 … ak is the initial
segment a word in S and (ak+1, tk+1)
is in
(a1a2a3 … ak)
-1 if a1a2a3 … ak is the initial
segment a word in S and (ak+1, tk+1)
is not in (a1a2a3 … ak)
(*) Under  X n, the variables Xi are all independent of each other taking on the values
–1, 1 with probabilities ½ each.
63
It is time to make use of the fact that the dbar distance between (S, ) and (,)
is greater than c. We couple (S, ) and (,) by induction. We assume that we already
know how the first k coordinates of S are coupled with the first k coordinates of .
Consider such a coupled pair of k-tuples (a1a2a3 … ak), (b1b2b3 … bk) and conditioned on
that coupled pair we want the joint distribution of (ak+1,bk+1). Let ak+1 be distributed with
distribution (a1a2a3 … ak) and let bk+1 be distributed with distribution . Couple those
two distributions as closely as possible. It follows that the probability that the coordinates
do not match is d(a1a2a3 … ak).
The expected mean hamming distance of the two spaces, obtained by this
coupling is greater than c.
1)
c < (h1+ h2+ h3+… hn )/n
where hk+1 is the probability that the (k+1)th coordinates differ.
The rest is definition chasing so we will set the reader do some work. Let E be the
expectation operator of the measure X n. Show that
(**) c/2 < (1/n)E( (i from 1 to n) Xi )
(*), (**), and lemma 162 give us our result.//
ABBREVIATION: S dbar far away. Xi independent processes defined inductively so that
Xk takes on 1, -1 with probability ½ apiece but in such a way that it prejudices for 1 as
much as possible on S. Far dbar implies prejudice can be strong enough to force
X1 + X2 … to grow in expectation linearly on S, proving S to shrink exponentially.
COMMENT 205: Note that in the statement of theorem 161, it turns out that d as a
function of c does not depend on what independent process you use.
EXERCISE 206: We proved the theorem for subsets. Extend the proof to submeasures.
COMMENT 207: It follows immediately from definition, that IC is closed in dbar. In
order to prove that IC implies extremal, we will need that extremal is closed in dbar. The
reason we need to use extremal, rather than completely extremal, is that completely
extremal is not closed in dbar.
DEFINITION and LEMMA 208: Let  be a measure on S1 X S2.  is a coupling of the
projection of  onto S1, which we will write P( , S1), and the projection of  onto S2,
which we will write P( , S2).
64
LEMMA 209: All notation as above,
if tu1 + (1-t)u2 = ,
then
tP(u1 , S1)+ (1-t) P(u2 , S1) = P(, S1).
THEOREM 210: Extremal is closed under dbar limits.
Proof: Suppose S1,S2,S3... is a sequence of extremal processes which converge in dbar to
S. Select Sm close to S. Regard Sm and S to be measures on words of length n, n large.
Let  be a good dbar match, i.e. a measure on the product space with S and Sm as
marginals. There are two small bad sets two consider.
There is a set  in on the product space with  small measure, such that off that set every
ordered pair in S X Sm is close in the mean hamming distance.
There is a set  in Sm of small measure, such that every exponentially fat submeasure of
Sm whose support is off  is close in dbar to Sm.
Let us refer to {(x,y): y   and (x,y)  .} as the good set.
There is a small set  on S, such that
For any x0 S -  { ( x0,y): y  Sm and ( x0,y) is good.} / { ( x0,y): y  Sm.} is close
to one.
Fix exponentially fat set F on S - . Again I am looking at a subset rather than
submeasure expecting the reader to generalize. Restrict  to {(x,y): x  F} and normalize
to get a probability measure .  is a coupling between F and P(,Sm) and furthermore is
a good expected mean Hamming match. P(,Sm) is just as exponentially fat as F. Most of
P(,Sm) is off  so it can be changed slightly to get P(,Sm)’completely off . F is close
to P(,Sm) is close to P(,Sm)’ is close to Sm is close to S in dbar.//
ABBREVIATION: We wish to show that approximating S with extremal Sm tends to
make S extremal. After dodging obnoxious sets, we extract from the dbar coupling of S
and Sm a dbar coupling of a prechosen fat submeasure of S with a fat submeasure of Sm.
COMMENT 211: The reader should note what goes wrong with the above proof if he
tries to prove that dbar limit of completely extremal is completely extremal.
65
Final preliminaries to the big equivalence theorem:
DEFINITION 212: Let …X-2, X-1, X0, X1, X2… be a stationary process and  > 0. Let
…Y-2, Y-1, Y0, Y1, Y2… be the independent stationary process on {0,1} with
P(Yi = 1) = . We define a new process …Z-2, Z-1, Z0, Z1, Z2… as follows. Select i such
that Yi = 1. Let j be the least number greater than i such Zi = 1. For each such i,j we let
Zi, Z-i+1, …Zj-1 have the same distribution as Xi, Xi+1…Xj-1 but we require that
Zi, Z-i+1, …Zj-1 be independent of …Z-i-3, Z-i-2, Zi-1.
…Z-2, Z-1, Z0, Z1, Z2… is called the  startover process of …X-2, X-1, X0, X1, X2….
ABBREVIATION: The  startover process is obtained by at each time running like the X
process with probability 1- and starting over with probability .
Preliminary 1)
Let X = …X-2, X-1, X0, X1, X2… be a stationary process.
Let Z = …Z-2, Z-1, Z0, Z1, Z2… be its  startover process. Then Z is an IC.
Idea of proof: Let m be large enough so that with very high probability Z has started over
by time m. Let n >> m. We independently concatenate X1, X2, X3… Xn measure to form
Y = (Y1, Y2, Y3… Yn)( Yn+1, Yn+2, Yn+3… Y2n)( Y2n+1, Y2n+2, Y2n+3… Y3n)…. Now we
couple Y to Z using the following scheme.
i) We will first couple times 1 to n-m, n to 2n-m, 2n to 3n-m etc.
ii) Conditioned on how we coupled the above times we couple the remaining times
independently.
To accomplish (i) couple each set of times {kn+1,…(k+1)n-m} independently if X did
not start over during times {kn-m,…kn} and identically the same if X did start over
during {kn-m,…kn}.//
Preliminary 2) A coding of an independent process is IC.
Idea of proof: Essentially the same proof, letting m be twice the length of the cylinder
used to do the coding.//
66
Preliminary 3) EX implies ergodic.
Idea of proof: Non-ergodicity implies the existence of a finite word whose frequency of
occurrence in the randomly chosen doubly infinite word is non-constant. Select a < b
such that both the events that the frequency of the word is less than a and the event that
the frequency of the word is greater than b has positive probability. Show that these two
fat sets cannot possibly be close in dbar.
DEFINITION 213: Two stationary processes, X0, X1, X2…and Y0, Y1, Y2…
are said to be close in distribution, if there is a large n such that
X0, X1, X2… Xn and Y0, Y1, Y2… Yn are within 1/nin the variation metric.
COMMENT 214: This ergodic theory concept of close in distribution should be
distinguished from the notion of close in distribution used by probabilists to indicate that
the distribution functions are close.
LEMMA 215: Let …X-2, X-1, X0, X1, X2… be a process and H be its entropy. For all m,
H = (1/m) lim( n ) H(Xn+1 V Xn+2 V Xn+3,...Xn+m| Xn V Xn-1 V Xn-2 V Xn-3,...X0) =
Proof: lim( n ) H(Xn+1 V Xn+2 V Xn+3,...Xn+m| Xn V Xn-1 V Xn-2 V Xn-3,...X0) =
lim(n )H(X1 V X2 V X3,...Xm |X0 V X-1 V X-2 V X-3,... X-n)
which exists by theorem 157. We already showed that H = lim( n ) 1/nH(X0 V X1
V X2... V Xn)
and we already showed that
lim( n )[H(Xn|Xn-1Xn-2Xn-3,...X0)] = lim( n )[1/nH(X0 V X1 V X2... V Xn)]
by writing
H( X0 V X1 V X2... V Xn) = H(X0) + H(X1|X0) + H(X2|X1 V X0) + V ...
H(Xn|Xn-1 V Xn-2 V Xn-3,...X0).
67
Similarly, we can see that
lim( n ) H(Xn+1 V Xn+2 V Xn+3,...Xn+m| Xn V Xn-1 V Xn-2 V Xn-3,...X0)
= lim 1/nH(X0 V X1 V X2... V Xn) by writing for any k and r,
H( X0 V X1 V X2... V Xkn+r) =
= H(( X0 V X1 V X2... V Xr) + H(( X0 V X1 V X2... V Xn+r | ( X0 V X1 V X2... V Xr)
+
H(X0 V X1 V X2... V X2n+r | X0 V X1 V X2... V Xn+r) +
... H(X0 V X1 V X2... V Xkn+r | X0 V X1 V X2... V X(k-1)n+r).//
COMMENT 216: We are about to talk about the entropy of the m future given the past. It
is important not to get confused with the notation. H(the m future | the past) is not the
same thing as the conditioned entropy of the m future given the past. The former is a
fixed number, defined by taking a limit of
H(the m future | the n past) as n  and the latter is a random variable depending on
the past. However,
E( the conditioned m entropy of the future given the n past) = H(the m future | the n past)
by the conditioning property.
LEMMA 217: for all m,
H = (1/m)E ( the conditioned entropy of the m future given the past)
Proof: By translation, lemma 177 says that (1/m)H( the m future | the n past) converges to
H as n . By the conditioning property,
E ( the conditioned entropy of the m future given the n past) =
H(the m future | the n past).//
68
Preliminary 4) Let ...X-3,X-2,X-1,X0,X1,X2,X3... be a stationary ergodic process. If you
select  and let m be large enough, then for most (probability at least 1- ) pasts, the m
future given the past is close (distance at most ) in the variation metric to an
exponentially fat (size at least exp(-m)) submeasure of the unconditioned m future.
ABBREVIATION: Most conditioned futures are nearly exponentially fat submeasures of
the unconditioned future.
Idea of proof:
Let  be the set of all names that eventually become and stay reasonable.  has
measure 1 by the Shannon Macmillan Breiman theorem. Hence for all pasts P except
measure 0, the conditioned future given P assigns measure 1 to .
Fix a past P that is not in that set of measure 0. Condition on the past being P, for
sufficiently large m, most conditioned m names are reasonable in the unconditioned
future.
Now we make use of an elementary fact. If you have a random variable with a
finite expectation, and the variable usually takes on values at most slightly greater than its
expectation, and never takes on values much bigger than its expectation, then it rarely
takes on values much lower than its expectation.
By lemma 217,
i)E ( the conditioned entropy of the m future given the past) = mH
ii) the conditioned entropy of the m future given the past is rarely much bigger than mH
because
a) usually most of the m future is on reasonable names so
b) usually by making only a small change in the n future you can get a measure
which lives entirely on reasonable names.
c) Making this small change only slightly affects entropy.
d) The most entropy you can get if you live on reasonable names is when you
assign every reasonable name the same measure.
e) If you do that you get about mH entropy because by Shannon Macmillan there
are about 2mH reasonable names.
iii)  There is an absolute upper bound for
(conditional entropy of the m future given the past)
of mA, where Ais the size of the alphabet.
69

ence the conditioned entropy of the m future given the past is rarely much smaller than
mH. Use the conditioning property to show that when a measure lies almost entirely on
reasonable names and has entropy not much smaller than mH then it is rare for a name to
have measure much bigger than 2-mH . Alter such a measure to live only on reasonable
names whose measure is not much bigger than 2-mH .If you then multiply their measures
by 2-H you will get only reasonable names with measures less than reasonable. More
precisely, prove
If you select  and let m be large enough, then for most (probability at least 1- ) pasts
you can alter the probability law of the m future given the past by less than  in the
variation metric to get a measure  such that
exp(- H) times( ) is strictly less than
the unconditional probability law of the of the m future. Write
(the unconditional probability law of the of the m future) =
t  +(1-t)  for some , where t = exp(- H).//
COMMENT 218: Suppose S is a small set in the unconditional m future. Then the
conditional measure of S given the past will be small for most pasts P. After you apply
Preliminary (4), moving the conditional future given such a P slightly to get an
exponentially fat submeasure, your exponentially fat submeasure will still assign small
probability to S. By making another small change you will get an exponentially fat
submeasure disjoint from S.
This ability to dodge a tiny set and still get your exponentially fat submeasure will
come in handy when we try to make deductions from extremality.
ABBREVIATION: Most conditioned futures are nearly exponentially fat submeasures of
the unconditioned future disjoint from a prechosen tiny set.
70
Theorem 219: B  FB  IC  EX  VWB  FD.
Ideas of proof:
B  FB
Obvious.//
FB  IC
Every FB is a factor of an independent process that in turn is a dbar limit of codings of
an independent process.
IC  EX
We proved that independent processes Fn are extremal. An Independent concatenation is
like Fn, where the letters of F are words of a fixed length. Then we use the fact that EX is
closed in dbar. //
EX  VWB
This follows from final preliminary (4), and comment to preliminary 4 (namely, comment
180). We need the comment to preliminary (4) not just preliminary (4) itself, because EX
only means extremal, not completely extremal. //
EX  FD
Preliminary 4 merely says that sufficiently far in the future conditioned futures are nearly
exponentially fat subsets of the unconditioned, but it does not tell you how long to wait.
But the proof tells you how long to wait. As soon as most names are reasonable, you have
waited long enough. First pick n big enough so that most names are reasonable, and so
that n names are extremal. Then pick an approximating process in n distribution and
entropy. Both conditioned n futures are usually fat submeasures of the unconditional, and
they both have essentially the same unconditional, so they can both be coupled in dbar to
that unconditional because the unconditional is extremal. Hence they can be dbar coupled
to each other. Throughout this proof we should be working with the comment to
preliminary (4) not with just preliminary (4) itself, because EX only means extremal, not
completely extremal.
This couples times 1,2…n given 0,-1,-2.... continue inductively coupling n+1,
…2n given n,n-1,..0,-1…etc. until you have a coupling of 1,2…nm given the past for
some m. Along the way you may come across some bad pasts which force a stage of
coupling to be bad but the expected mean Hamming distance overall is small. Of course
we have only shown that we can couple to a multiple of n, but if we have a big number
that is not a multiple of n, the remainder when divided by n is insignificant.
Preliminary 3 gives the rest of the definition of FD.//
71
COMMENT 220: You may wonder where we used the fact that processes are close in
entropy. We used preliminary (4) on both the original and the approximating process. To
guarantee that preliminary (4) would kick in by time n on the latter, we needed that the n
names of the approximating process tended to have “reasonable” probability. That would
not be the case if the entropy of the approximating process was not appropriate.
VWB  IC
Make a independent concatenation by taking n big, taking the measure on n names, and
repeating over and over. Couple this with original inductively, piece by piece. //
FD  IC
Just observe that the  startover process of a process is close to it in distribution and
entropy. Apply preliminary 1. //
THEOREM 221: The equivalent properties above are closed under taking factors.
Idea of proof: IC is closed under finite code factors, and under taking dbar limits.//
THEOREM 222: The equivalent properties above imply that either we are talking about
the trivial process or the process has positive entropy.
Idea of proof: 0 entropy means past determines future, easily contradicting IC.//
COUPLING INFINITE PATHS
COMMENT 223: All of the terms of theorem 219 are defined finitistically. However
frequently there are more natural descriptions of terms when we are allowed to consider
couplings of the infinite future. For the remainder of this section, we will explore what
we can do when such infinite path couplings are considered.
72
Stationary
THEOREM 224: Let …X-2, X-1, X0, X1, X2… and …Y-2, Y-1, Y0, Y1, Y2… be two
stationary processes. Then
i)
ii)
the dbar distance of X1, X2…Xn and Y1, Y2…Yn approaches a limit, say d, as
n  .
There is a stationary coupling of X1, X2… and Y1, Y2… so that with probability
1, a coupled pair (( a1, a2… ),(b1, b2…)) has the property that the fraction of
i: ai  bi in ( a1, a2… an) and (b0, b1, b2… bn) approaches a limit as n   and
the integral of that limit is d.
Idea of proof:
1) For any k >0, n >0
the dbar distance between X0, X1, X2…Xkn and Y0, Y1, Y2…Ykn >
the dbar distance between X1, X2…Xn and Y1, Y2…Yn
because the mean hamming distance between words a1, a2… akn and b1, b2… bkn is the
average of the mean hamming distances of
{a1+1, a2+i…a(i+1)n and b1+i, b2+i… …b(i+1)n : 0 < i < n}.
2) Obviously, # {j < kn+r : Xj Yj} – # {j < kn : Xj Yj} < r
for all k,n, and r
3) Show that (1) and (2) give (i)
4) Let d be as in (i). We can get for all n a coupling
cn between X0, X1, X2……Xn and Y0, Y1, Y2… …Yn
so that the mean Hamming distance obtained by cn converges to d as n. Use the
extension of the monkey method on the cn to get a stationary measure c on
(X0, X1, X2… )X (Y0, Y1, Y2…). Show
a) c is a coupling of
(X0, X1, X2… ) and (Y0, Y1, Y2…)
b)When restricted to (X0, X1, X2… Xn ) and (Y0, Y1, Y2… Yn), c achieves mean
Hamming distance d.
c) By the Birkhoff ergodic theorem, the fraction (ii) converges to a limit.
d) The integral of that limit is d by the bounded convergence theorem.//
73
Ergodic
THEOREM 225: If the …X-2, X-1, X0, X1, X2… and …Y-2, Y-1, Y0, Y1, Y2… above are
ergodic the coupling constructed above can be made to be an ergodic coupling.
Idea of proof: Theorem 224 gives a stationary coupling. Every stationary measure is a
convex combination of ergodic measures. Those ergodic measures must also be a
coupling of the two processes…X-2, X-1, X0, X1, X2… and …Y-2, Y-1, Y0, Y1, Y2…
because otherwise at least one of those two processes could be decomposed into distinct
processes.//
COMMENT 226: In the case of theorem 185, where we actually achieve a ergodic
coupling, the limiting mean hamming distance between the two coordinates is constant.
Hence any single ordered pair (except in a set of ordered pairs that the coupling assigns
measure 0) tells you the dbar distance between the processes.
IC
COMMENT 227: An independent concatenation, periodicity P, is in general not
stationary, but Tp is an independent process (prove this) and hence is both stationary and
ergodic. Thus theorem 185 applies when one or both of your processes is an IC as long as
you regard the transformation to be Tp instead of T.
FD
COMMENT 228: FD says close in distribution and entropy implies close in dbar.
Theorem 225 apply to get infinite couplings manifesting that close dbar match.
74
VWB
THEOREM 230: …X-2, X-1, X0, X1, X2… is VWB iff there is a set of pasts of
probability zero such that if past 1 = …a-2 a-1, and past 2 = …b-2 b-1, are not members of
that set,
the conditioned measure of X0, X1, X2… given …X-2 = a-2, X-1 = a--1 and
the conditioned measure of X0, X1, X2… given …X-2 = b-2, X-1 = b--1
can be coupled with each other so that with probability 1 in the coupled process, a pair
((c0c1c2…),(d0d1,d2…)) will obey
lim (1/n) #{i: 0< i <n-1 and ci  di  0 as n  

Proof:
Our goal is to construct a good dbar coupling between the future given past 1 and
the future given past 2. We construct that coupling as follows. Let k1, k2, k3… be rapidly
increasing sequence of integers and let Ni be integers such that
Ni >> ki+1
for all i. Define Ii inductively as follows.
I0 = 0. Ii = Niki + Ii-1.
We now define our coupling piece by piece by induction. First we couple
the measure on terms 0,1,2… k1–1 given the first past.
with
the measure on terms 0,1,2… k1–1 given the second past.
We use the following procedure for doing this. The definition of VWB guarantees
that with high probability, if k1 is big enough, the measure on terms 0,1,2… k1–1 given
the first the past is close in dbar to the unconditioned measure on words of length k1.
Couple these conditional and unconditional spaces as well as possible. Similarly couple
the measure on terms 0,1,2… k1–1 given the second past with the unconditional as well
as possible. Join these two couplings together to get a coupling of the two conditional
processes. Note that the expected mean hamming distance this coupling gives to the two
conditionals is at least as small as the sum of the distance between the first conditional
and the unconditioned and that of the second conditional and the unconditioned.
75
We now want to extend the above coupling to a coupling of
the measure on terms 0,1,2…2k1–1 given the first the past.
with
the measure on terms 0,1,2…2k1–1 given the second past.
Recall the inductive method for extending a coupling to more terms. Let
((a0a1a2. . . ak -1), (b0b1b2. . . bk -1))
1
1
be a coupled pair in the earlier coupling. We need to couple
terms k1, k1+1…2k1-1 given
a0a1a2. . . ak -1
1
and past 1 with
terms k1, k1+1…2k1-1given
b0b1b2. . . bk -1
1
and past 2.
Again we do this by optimally coupling both with the unconditioned process. Once we do
that for every pair
((a0a1a2. . . ak -1), (b0b1b2. . . bk -1))
1
1
we will have extended our coupling so that we have a coupling to terms 0,1,2… 2k1–1
given our two pasts. We now continue the process, looking at each coupled pair in that
coupling
((a0a1a2. . . a2k -1), (b0b1b2. . . b2k -1))
1
1
and using it to couple the terms 2k1, k1+1…3k1–1 so that in the end we have coupled
terms
0,1,2…3k1–1. Continue… 0,1,2…4k1–1 etc. until you reach 0,1,2… I1–1.
76
Continue using the same technique. Next extend to 0,1,2…. I1 + k2–1. Then
extend to 0,1,2…. I1 + 2k2–1. Then to 0,1,2…. I1 + 3k2–1 and on and on and on until
0,1,2…. I2–1, followed by 0,1,2…. I2 + k3–1…and by now I presume you see the pattern.
When you get done you have coupled the whole infinite future given past 1 with the
whole infinite future given past 2. We now show that this match will have the desired
property if the ki’s increase rapidly enough, and if each Ni is big enough in comparison to
ki+1.
Let us first consider the terms from I1 to I2–1. Let m be a positive integer less than
N2 and assume we have already defined the coupling for terms 0,1,2… I1 + mk2 –1. We
are now about to define the coupling for times I1 + mk2, I1 + mk2+1,… I1 + (m+1)k2 -1.
At time I1 + mk2, you think of I1 + mk2 as being the origin so that from the stand point of
I1 + mk2, the “past” means times
I1 + mk2-1, I1 + mk2-2….
Hence the first past at time I1 + mk2, which we will call past(1, I1 + mk2), is obtained
from a knowledge of
past 1 and
a0a1a2. . . a I 1+mk 2 - 1
_
_
Similarly the second past at time I1 + mk2, which we will call past(2, I1 + mk2), is
obtained from a knowledge of past 2 and
b0b1b2. . . b I 1+mk 2 – 1
_
_
VWB says that as long as k2 is sufficiently big, if past(1, I1 + mk2) is chosen
randomly, it will usually turn out to be “good” in the sense that with regard to times
I1 + mk2, I1 + mk2 + 1…. (m+1)k2 - 1
77
we will be able to achieve a good dbar match between the conditioned process
conditioned on
past(1, I1 + mk2) and the unconditioned process. If both
past(1, I1 + mk2) and past(2, I1 + mk2) are good,
then a nice match will occur between the two conditioned processes at times
I1 + mk2, I1 + mk2 + 1…. (m+1)k2 - 1.
However two things can go wrong. The first thing that can go wrong is that either
past(1, I1 + mk2) or past(2, I1 + mk2) can turn out to be bad. We will call this a type 1
error.
The second thing that can go wrong is that although both pasts turn out to be
good, and hence the match will be good, and even though the joint coupling gives high
probability to picking a pair
((a I 1+mk 2 a I 1+mk 2 +1 …a I 1+(m+1) k 2 - 1) , ( b I 1+mk 2 b I 1+mk 2 +1 …b I 1+(m+1) k 2 - 1))
_
_
_
_
_
_
_
_
_
_
_
_
that are close, when we actually use that coupling to pick this pair we may be unlucky
and pick them so that the mean Hamming distance is far apart. We will call this a type 2
error.
The mere fact that these two kinds of errors have small probability would be
enough for us if all we wanted to do was to get convergence in probability, but we are
looking for convergence almost surely. We need to know that the frequency of type 1 and
type 2 errors not only get small but also stay small. We are about to handle type 1 errors
using the Birkhoff ergodic theorem, and type 2 errors using the concept of independence.
Type 1 errors. Let T be the transformation corresponding to the process.
We use the following:
a) Bad pasts have small probability,
b) Tk is measure preserving,
c) The Birkhoff ergodic theorem.
2
78
From these we conclude with probability almost 1,
there is a large n depending on k2 but not on N2 or I1 , (i.e. N2 and I1 can be chosen to be
arbitrarily large in comparison to n.) such that
“For all i with N2 >i>n,
1/i #{j: 0<j<i and past(1, I1+jk2) is bad}
is small, and
for all i with N2 >i>n,
1/i #{j: 0<j<i and past(2, I1+jk2) is bad}
is small.”
This says that with probability almost 1,
“For all i such that N2 > i>n,
1/i #{j: 0<j<i and an error of type 1 during times {I1+jk2, I1+jk2+1,… I1+(j+1) k2-1}
occurs} is small”
Type 2 errors: We use the following general principle. In an arbitrary probability space
suppose we consider a sequence of events such that each has small probability even when
we condition on what happened to the previous ones. Then for large n, with probability
almost 1,
“For all i greater than n,
(1/i)#(the subset of the first i events which actually occur) is small.”
Here # means “number of”.
That principle immediately implies that with probability almost 1, again letting n be large
but not in comparison to N2 and I1.
such that
“For all i with N2 > i > n, 1/i #{j: 0<j<i and an error of type 2 occurs during times
{I1+jk2, I1+jk2+1,… I1+(j+1) k2-1} occurs} is small.”
79
Conclusion: Let ((a0, a1, a2,…),(b0, b1, b,…)) be a randomly chosen couple from our final
coupling using the measure of that coupling. The above type 1, type 2 analysis implies
that if k2 is large and N1 is sufficiently large in comparison to k2, we can get the following
for a small 1.
With probability at least 1-1,
“For all i such that N2 > i > (1N1 k2),
(ik2))#{j: I1 <j<I1+ik2-1 and (aj bj)} < 1”
Going through a similar analysis, we could get a rapidly decreasing sequence of
123 such that
With probability at least 1-g
“For all i such that Ng+1 > i > (gNg kg+1),
(ikg+1))#{j: Ig<j<Ig+ikg+1-1 and (aj bj)} < g”
By choosing the g’s to be summable, Borel Cantelli says that with probability 1,
“For all sufficiently large g,
for all i such that Ng+1 > i > (gNg kg+1),
(ikg+1))#{j: Ig<j< Ig+ikg+1-1 and (aj bj)} < g”
Upon reflection the reader will see that this last sentence implies that the density of times
that aj bj approaches zero.//
ABBREVIATION:
Ki rapidly increasing, Ni >> Ki+1. Couple two futures to times
K1,2 K1, … N1 K1, N1 K1 + K2, N1 K1 +2 K2, …N1 K1 + N2 K2, N1 K1 + N2 K2+ K3 …
successively inductively as well as possible. Focussing on just times between
N1 K1 + N2 K2+… Ng Kg and N1 K1 + N2 K2+… Ng+1 Kg+1 , the frequency of times that
either past is bad usually gets and stays small by Birkhoff and frequency of times pasts
are good but coupling is bad gets and stays small because each time you have another big
independent chance to get a good coupling.
80
WB
DEFINITION 231: Let …X-2, X-1, X0, X1, X2… be a stationary process.
Define a (generally nonstationary) process …Y-2, Y-1, Y0, Y1, Y2… by
…Y-2, Y-1 has the same distribution as …X-2, X-1
Y0, Y1, Y2… has the same distribution as X0, X1, X2… but unlike the X process,
…Y-2, Y-1 and Y0, Y1, Y2… are independent of each other.
Suppose that for every  there is an m such that for every n,
X-(m+n),…X-(m+2),X-(m+1),X-m,Xm, Xm+1, Xm+2… Xm+n and
Y-(m+n),…Y-(m+2),Y-(m+1),Y-m,Ym, Ym+1, Ym+2… Ym+n
are within  in the variation metric.
Then …X-2, X-1, X0, X1, X2… is said to be weak Bernoulli (abbreviated WB).
COMMENT 232: The reason weak Bernoulli was not defined earlier is that it is not
equivalent to the equivalent properties of the bold face theorem (theorem 181). It is
stronger than they are.
COMMENT 233: I am hereby establishing a probably false rumor that there are people
who went insane upon finding out that weak Bernoulli is stronger than Bernoulli. Let me
see if I can explain this rather perverted notation.
The word “Bernoulli” has two definitions depending on whether you are doing
probability theory or ergodic theory. In Probability theory it means an independent
process. In ergodic theory and in this book, it means isomorphic to an independent
process. The word weak Bernoulli was chosen because the property is weaker than the
former definition. But it is stronger than the latter definition.
DEFINITION 234: [ ] means greatest integer of (e.g. [5.326] = 5, [6.77772] = 6,
_____
_____________________________________
[] = 3, [-4.02] = -5.). This notation will only be used when we state that we are using it,
because we want to be able to use brackets to mean just brackets.
DEFINITION 235: Parity tells you evenness or oddness (e.g. parity of 4 is even)
81
EXERCISE 236: We use greatest integer notation here. Let i be an irrational number and
let …X-2, X-1, X0, X1, X2… be an independent process, each Xi taking values 1 or –1
with probability ½ each. Let Sn be
0 if n is 0,
X0 + X1… +Xn-1 if n is positive, and
-(X-n…+X-2+ X-1 ) if n is negative
and U be uniformly distributed on the unit interval. Show that “parity of [2U + i Sn]” is a
stationary process (as n runs) and in fact that it is very weak Bernoulli, but it is not weak
Bernoulli. (Hint: For VWB, run conditional futures U+iSn independently until they are
close and then keep them close.)
We now develop some techniques for altering a coupling to get another coupling. These
techniques will be used to help us analyze weak Bernoulli.
EXERCISE 237: Let and  be two probability measures and let c be a coupling of the
two of them. Let  be a positive measure that is less than or equal to than c on all sets and
let  be a measure that has the same marginals as . (i.e. it has the same projections to the
axes.). Then c –  +  is also a coupling of  and .
COMMENT 238: If  and  are positive measures on X X Y and  and  are their
projections onto X respectively. Then a coupling of  and  can be extended to a
coupling of  and . We already showed that (see example 4 of our section on coupling)
when  and  are probability measures, but the reason we are repeating it is to point out
that this statement is still valid when  and  are not probability measures.
82
LEMMA A: We now extend the exercise above to more than one dimension to form a
more elaborate way of altering a coupling. Suppose  and  are measures on X X Y.
Let c (a measure on (X X Y) X (X X Y)) be a coupling of  and .
Let  be a positive measure on (X X Y) X (X X Y)
that is less than or equal to c on all sets.
Let X be a measure (on X X X) with the same marginals as the projection of  on X X X.
X can be extended to a measure
 on (X X Y) X (X X Y) which has the same marginals as .
c - +  is another coupling of  and .
Idea of proof: This is just exercise 196 and comment 197 put together.
DEFINITION 239: On a space X X X the set of all points (x,x) in which the first
coordinate is the same as the second coordinate is called the diagonal.
LEMMA B: Given:
 a measure on X X Y
 a measure on X X Y
c a coupling of  and 
d: a coupling of the projection of  on X with the projection of  on X.
If
all but of the measure of d is on the diagonal, and
all but  of the measure of c is on the diagonal.
Then
d extends to coupling g of  and  such that
all but at most  +  of the measure of g is on the diagonal.
ABBREVIATION: Given ,  , c of  and .
d of ( on X), (  on X) extends to g
of  and g can be almost diagonal if c,d are.
83
Idea of proof: Show the existence of a  which is less than or equal to c,
whose support is entirely on the diagonal of (X X Y) X (X X Y),
whose projection on X X X is less than or equal to d, and
whose measure is at least 1 – ( + ).
Remove  from c and remove the projection of  from d. The projection of what remains
from c has the same marginals as what remains from d. Apply lemma A and then put 
back.
DEFINITION 240: Let …X-2, X-1, X 0, X1, X2… and …Y-2, Y-1, Y0, Y1, Y2… be two
processes and c be a coupling of the two processes. Suppose that c assigns probability 1
on paths in the product space such that there is an n0 (n0 depends on the path) such that
for all n > n0, Xn = Yn. Then c is said to eventually agree.
THEOREM 241: Let …X-2, X-1, X0, X1, X2… be a stationary process. It is a weak
Bernoulli process iff for all pasts P off a set of pasts of measure 0, we can couple the
conditioned future, conditioned on the past being P to the unconditioned future with a
coupling that eventually agrees.
Idea of proof: We leave “if” as an exercise to the reader. We now prove “only if”.
Consider the process …Y-2, Y-1, Y0, Y1, Y2… in the definition of weak Bernoulli. That
definition implies that for all  there is a m such that for all n, there is a coupling of the
two processes (…X-2, X-1, X0, X1, X2… and …Y-2, Y-1, Y0, Y1, Y2…) when restricted to
times
(-(m+n)-(n+m-1)…..-m, m, m+1….m+n)
is on the diagonal with probability at least 1-, where  issmall. We can shift the
coupling by m-1 to get times
(-n-1,-n….-1,2m-1,2m,2m+1…2m+n-1).
We now apply lemma A to take that part of the coupling in which there is disagreement
on any time (-n-1,-n….-1) and replace it with something which is on the diagonal for
times (-n-1,-n….-1). What we have now is a coupling of …X-2, X-1, X0, X1, X2… and
…Y-2, Y-1, Y0, Y1, Y2… on times (-n-1,-n….-1,2m-1,2m,2m+1…2m+n) which is on the
diagonal with probability 1 for times (-n-1,-n….-1) and has a probability of at most of
disagreeing on any time 2m,2m+1…2m+n-1.
84
Take a subsequential limit as n so that our coupling is now defined
for all times except times 0,1,…2m. We get an m for every . To indicate this we will
write m = m(). Fix a large M and consider the couplings cM determined by 1/5M,
m(1/5M). By observing cM, cM-1 and lemma B (where we project onto all times except
(0,1,…2m(1/5M)), we get a coupling
defined for all times except (0,1,…2m(1/5M-1))
is the diagonal on the past
is the diagonal except probablility 1/5M on all times greater than m(1/5M)
and
is the diagonal except 1/5M + 1/5M-1 < 1/2M-1 on all times except (0,1,…2m(1/5M-1)).
Repeating the process using that coupling and cM-2 we get
a coupling of all times except (0,1,…2m(1/5M-2))
is the diagonal on the past
is the diagonal except probablility 1/5M on all times greater than m(1/5M)
is the diagonal except 1/2M-1 on all times except (0,1,…2m(1/5M-1))
and
is the diagonal except 1/2M-2 on all times except (0,1,…2m(1/5M-2)).
Repeat again and again using cM-3,cM-4…etc. until you have a coupling defined on all
times, which is the diagonal on all times greater than ci with probability at least 1-(1/2)i
for all i < M. Take a subsequential limit as M  to get a coupling of the two processes
which is on the diagonal for the past and which eventually agrees with probability 1 in
the future.
Now it is simply a matter of conditioning this coupling on a given past. The future
Y process is the unconditioned future, the future X process is the conditional process
given the given past, and those two processes are coupled so that they will eventually
agree. //
85
ABBREVIATION: Couple times –n,-n+1…-m, m, …n. with the almost identity
coupling. Shift m-n,m-n+1…0, 2m, …m+n. Fudge coupling to get identity on m-n,mn+1…0. Take limit n , to get all times …-2,-1,0,m,m+1… For
k < m there is a way to take one of these couplings on times
…-2,-1,0,m,m+1… and, another on …-2,-1,0,k,k+1… and get another on …-2,1,0,k,k+1… so that if the former is identity except measure  and the second is identity
except measure , the third will be an extension of the first and the identity except
measure + . Repeating this type of manipulation and taking limit we end up with
coupling on all times; identity on past and eventually agree on future.
COMMENT 242: This comment refers to both our discussions of weak Bernoulli and
very weak Bernoulli. We have perhaps confused you by sometimes coupling the
conditioned measure given P1 with the conditioned measure given P2 for any two pasts P1
and P2, and sometimes coupling the conditioned future given P1 with the unconditioned
future
for the purpose of defining a metric. The purpose of this comment is to indicate that it
doesn’t matter which of these two things we do. They are equivalent.
If we can get an  match by coupling a conditioned future given any past P with
the unconditioned future, then for any two P1 and P2 we can glue together two couplings
to get a coupling that gives a 2 match between the two conditioned futures.
Conversely If we can couple any two conditioned pasts in a measurable way, then
we can consider the coupling for P1 and P2, regard P2 as a random variable and integrate
out the coupling as you let P2 run, to form a coupling of the conditioned future with the
unconditioned future which gives an  match. The difficulty in this proof is that you can’t
just choose your coupling with the axiom of choice for each P2 or you will get a nonmeasurable mess. You need to show that you can choose your coupling without the
axiom of choice.
86
ORNSTEIN ISOMORPHISM THEOREM
COMMENT 243: This comment is only directed to people who already know the theory.
There is no marriage lemma in this proof of the Ornstein isomorphism theorem.
1) Copying In Distribution
THEOREM 244: Let P,T be a process, > 0. Let m be chosen. If n is sufficiently large,
then for any Rohlin tower R of
i)
height n
ii)
error set less than ½ in measure
for all but measure in measure of the P columns of R, the distribution of m names in a
column is within  in the variation metric of the distribution of m names of the process.
ABBREVIATION: Columns in a large Rohlin tower tend to have the right m distribution.
Idea of proof: n >> M >> m. Birkhoff ergodic says that most words of length M have
about the right frequency of names of length m so if the theorem is false there must be
many columns with many inappropriate M words. A typical word of length 5n would be
very likely to contain such a bad column, contradicting Birkhoff ergodic for M names.//
TECHNIQUE 245: How to copy in distribution: Suppose P,T is a process. S is another
transformation, perhaps on another space, and you want a partition Q on the latter space,
so that Q,S has similar distribution to P,T. Here is a procedure for getting Q.
DEFINITION 246: If P,T is a process, and if S is a transformation on another space with
Rohlin tower R of height n on the latter space, then we say we are copying P,T to R, if
we construct a partition Q in the latter space in the following manner. For every set in P
associate a distinct letter of the alphabet. These letters will designate sets in Q. Place
another Rohlin tower R’ of height n on the former space (space of T). P breaks R’ into
columns. Break R into columns of the same size as the columns of R’ and identify each
column of R with a column of R’ which has the same size. If the rungs of a given column
of R’ are in P1P2...Pn (from bottom to top) and if Q1,Q2,...Qn are the letters
corresponding to P1,P2,...Pn respectively, define Q so that the rungs of the corresponding
column, (from bottom to top) will be in Q1,Q2,... Qn respectively. As always, assume tiny
error sets for R and R’.
87
THEOREM 247: Let P,T be a process, S a transformation on a possibly different space.
Suppose you would like a partition Q on the latter space such that the P,T process and the
Q,S process have approximately the same distribution on n names. All you have to do is
to make a Rohlin tower R in the latter space sufficiently big (much bigger than n) and
copy P,T to R.
Idea of proof: Immediate from theorem 203.
DEFINITION 248: If we follow the procedure of theorem 206, we say that we are
copying the n distribution of P,T to S.
DEFINITION 249: In both of the above definitions, if we don’t wish to explicitly
mention the n or the R we can simply say copy P,T to get Q so that Q,S looks like P,T.
When you do that it is presumed that n is large.
COMMENT and DEFINITION 250:
Suppose that T and S are transformations on possibly different spaces. P and B are
partitions on the same space as T.
If you have already copied P,T to R, to get a partition Q, so that Q,S looks like
P,T and you want to get a partition C so that (Q V C),S looks like (P V B),T then you can
do so in the obvious way by making more columns in R. We say that we are copying
P,B,T to get Q,C,S. or we are copying B to get C so that Q,C,S looks like P,B,T.
COMMENT 251: Here the conclusion is strong. It says that if the copy was made to
force the n distribution (P,T) to look like the n distribution of (Q,S), then we can get the n
distribution of (Q V C,S) to be close to the n distribution of (P V B,T).
COMMENT: 252 Suppose you know that (Q,S) looks like (P,T). More precisely,
(*) every set of Q is associated with a set of P and for some large n we know that the
distribution of n names (P,T) is close to distribution of n name (Q,T).
Comment and definition 250 presupposes that (*) was achieved by copying (P,T) to (Q,S)
using Rohlin towers. However, suppose (*) just happens to be true but we did not force it
to be true using Rohlin towers. The next exercise shows that in that case comment 251
turns out to be false.
88
EXERCISE 253: Let ({a,b},T) be the process on two letters which takes on the
following four words with probablility ¼ each.
…aaaaaaaa…
…abababab…
…babababa…
…bbbbbbbb…
Let C be defined to be T(a) and D be defined to be T(b). Let ({heads, tails},S) be
independent coin flipping on letters {heads, tails}with probability (½,½) apiece. Show
that ({a,b},T) has the same 2 distribution as ({heads,tails},S), but there is no way to get
partition Q such that the 2 distribution of (Q V {heads,tails},S) is less than ¼ in the
variation distance from the 2 distribution of ({C,D} V {a,b},T).
COMMENT 254: However we can get the following weaker result.
THEOREM 255: Let  > 0, process (P,T) and (Q,S) be given. Then for every integer
m > 0, there exists integer integer n > 0 and real number  > 0 such that for any partition
B on the space of T, if the n distribution of (P,T) and (Q,S) is closer than  in the
distribution distance, then you can get partition C such that the m distributions of
(P V B,T) and (Q V C,S) are closer than  in the variation distance.
Idea of proof: This time we have to pick our Rohlin towers more carefully. Choose n so
large that by Birkhoff most of the words of length n have the right distribution of m
words. By comment 44, put towers R and R’ on the T and S spaces respectively so that
the P columns (respectively Q columns) have exactly the same distributions as the
distribution of n names of (P,T) (resp. (Q,S)). Since those distributions are essentially the
same, you can copy B to get C so that (P,B,R) looks almost exactly like (P,C,R’).
COROLLARY 256: If (Q,S) is a perfect copy of (P,T) (exactly the same n distribution
for all n), then for any m you can copy B to get C so that (P V B,T) and (Q V C,S) are as
close as you want in m distribution.
DEFINITION 257: If P and Q are partitions on the same space, and we are identifying Q
with P, by corresponding each Pi in P with a Qi in Q, we define P intersect Q to be the
union over all i of Pi intersect Qi .The symmetric difference between P and Q is the
complement of P intersect Q.
89
DEFINITION 258: The distance between P and Q, denoted |P-Q| is the measure of the
symmetric distance between P and Q.
COMMENT 259: You can also use the above definition to refer to two sets as opposed to
two partitions. From now on, if we just say “distance” or “close” without mentioning the
words “variation” or “mean hamming” or “dbar” and if it is not obvious that there is
some other metric that we are discussing, then we are referring to the above metric.
2) Coding
DEFINITION 260: Suppose f is a homomorphism from P,T to Q,S. Then f can be
approximated by a finite code, i.e. there is some n such that the
-n,-n+1...n-1,n P name determines the atom of Q that the image of f is in, up to some
small probability . We will say P,T codes Q,S well by time n. Saying the same thing in
the language of processes, if …Y-2, Y-1, Y0, Y1, Y2… is a factor of …X-2, X-1, X0, X1,
X2… , we will say that the X process codes the Y process well by time n, if
X-n…X-2, X-1, X0, X1, X2… …Xn determines Y0 up to probability .
DEFINITION 261: If …X-2, X-1, X0, X1, X2… codes …Y-2, Y-1, Y0, Y1, Y2…
well by time n then the code is a function from words of length 2n +1 in the X process to
words of length 1 in the Y process such that if you guess Y0 to be the image of
X-n…X-2, X-1, X0, X1, X2… …Xn,
you have a high probability of being right.
DEFINITION 260 in more detail.
Of course, when we know that
…Y-2, Y-1, Y0, Y1, Y2… is a factor of …X-2, X-1, X0, X1, X2…
and we merely say that the X process  codes the Y process by time n we mean that using
the same code for all i
Xi-n… Xi+n makes a quess as to what Xi is
and that guess is correct with probability at least 1-. This follows from definition 260 by
stationarity if we know the code came from a homomorphism.
90
DEFINITION 262: In the above definition, if T = S then we simplify our language by
saying P,T codes Q well by time n.
COMMENT 263: The reason that P,T must code Q by time n for some n is that f-1(Q) is
a partition which can be approximated by a partition of cylinder sets.
THEOREM 264: S,Q is a factor of T,P iff there is a code from (T,P) to (S,Q) for each n
in such a way that
For all  there exists n such that (T,P)

codes (S,Q) by time n.
Idea of proof:
Only if: Let f: (T,P)  (S,Q) be the factor map.f-1(Q) is a partition in the -algebra
generated by P. This partition can be approximated by cylinder sets.
If: Let i be summable and ni be the corresponding value of n for i. By Borel Cantelli the
ni code correctly tells what the set in Q f() is in for all but finitely many ni. By
translation the same is true for Ti(Q) for any i. This defines the homomorphism.//
DEFINITION 265: Suppose P and P’ are two partitions identified with each other in that
they have the same number of pieces, and every distinct set in P is identified with a
distinct set in P’. Suppose Q and Q’ are two similarly identified partitions. We say that
P,T codes Q,S well by time n in about the same way that P’,T’ codes Q’,S’ if that P,T
codes Q,S well by time n, P’,T’ codes Q’,S’ well by time n, and furthermore the codes
we use are the same.
COMMENT 266: Note that if P,T codes Q well by time n in about the same way that P,T
codes Q’, (Remember that in this case T is the only transformation under consideration)
then Q and Q’ are close to each other.
COMMENT 267: Fix positive integer n. Let T and S be transformations. Suppose Q,T is
a factor of P,T (so that P,T codes Q,T by time n). Suppose we copy P,Q,T to get P’,Q’,S.
If the copy is good enough, P,T codes Q about the same as P’,T’ codes Q’ by time n.
Note that if we use a Rohlin tower to do our copying, it will have to be much bigger than
n.
91
COMMENT 268: True or false: Suppose
i
> 0 for all i, (i) < , there are partitions Pi, integers ni, |Pi+1 - Pi | < i and for all i,
(T, Pi) i codes Q well by time ni.
Then
Pi converges to a P such that T,Q is a factor of T,P.
Answer: False. It may be the case that for all i and all j > i, (T, Pj) does not code Q well
by time ni so that (T, P) does not code Q well for any n. The point is that we only know
(T, Pj) codes well by time j, not by time i. However this statement can be modified to be
true as follows:
THEOREM 269: Suppose
i >
0, for all i,(i) <  there are partitions Pi, increasing integers ni, |Pi+1 - Pi | < i/ni, and
for all i, (T, Pi) codes Q well by time ni.
then
Pi converges to a P such that (T, Q) is a factor of (T,P).
Idea of proof: For all j > i, moving from Pj-1 to Pj worsens the ni code by at most 2j + 1
because each coordinate from -ni to ni is altered with probability at most j/nj which is less
than j/ni.
-----------------------------------------3) The Land of P
DEFINITION 270: Let T be a transformation, and let P be a partition which does not
necessarily generate. The - algebra generated by Ti(P) where i runs over all integers, is
called the land of P.
DEFINITION 271: We can make copies of - algebras, partitions etceteras in the land of
P, just by simply doing everything in that - algebra (the land of P) because the land of P
is a Lebesgue space.
92
Example 1: Suppose T is a transformation and you want to copy a partition in the land of
P. You can make a Rohlin tower in the land of P, then partition the base in the land of P
etc. The beauty of all this, is that whatever you end up with ends up being a factor of P,T.
Example 2: Fix a transformation T and an integer n. Suppose P, P1, P2 are partitions, and
you have a copy P3 of P1 so that P3,T has exactly the same distribution as P1,T. You
would like to make a good copy P4, in the land of P, of P2 so that P3,P4,T has about the
same n distribution as P1,P2,T. You can do so as long as the P3,T process is in the land of
P, even if the P1 and P2, processes are not. On the other hand, you cannot do it if the P3,T
process is not in the land of P, because you use that process to construct P4.
-----------------------------------------4) Capturing entropy
COMMENT 272: Let P,T be a process and S be another transformation. Copy the n
distribution of P,T to S, thereby getting a process Q,S whose n distribution is close to that
of P,T. Is there anything we can say about the comparison of the entropy of Q,S with that
of the entropy of P,T? To answer that, recall that the entropy of P,T is the decreasing
limit of
*
1/n H (V(i from 1 to n) T-i (P)).
This means that the n distribution of P,T already gives an upper bound for the entropy of
P,T. Hence if Q,S has been selected to have approximately the same n distribution as P,T
and if n is large enough for * to be close to its limit, we will already know that Q,S could
not possibly have much more entropy than P,T.
However, the entropy of Q,S could conceivably be much less than the entropy of
P,T no matter how well we copy the distribution of P,T to get Q,S. In short, a good copy
of distribution guarantees a big enough entropy (approximately), but not a small enough
entropy. If we want to guarantee that Q,S has approximately the same entropy as P,T, we
need to have some way of holding its entropy up. The purpose of this section is to
develop techniques which generate a copy Q,S in such a way as to hold up its entropy,
thereby insuring that Q,S and P,T have approximately the same entropy.
93
DEFINITION 273: Let P be a partition, T a transformation and n be a positive integer.
The P,T,n entropy drop is defined to be
(1/n)H(V i goes from 1 to n of (T-iP)) – H(P,T)
EXERCISE 274: Fix a stationary distribution for words of size n. Prove that the process
that the n-1 step Markov chain has mimimum P,T,n entropy drop.
THEOREM 275: Let Q refine P, then
H(P) – ½(H(P V T(P)) < H(Q) - ½ H(Q V T(Q)).
Idea of proof: Recall the following principle: If A,B, and C are three partitions such that
C refines B, then
H(A V C) – H (C) < H(A V B) – H(B). Hence
H(Q V T(Q)) - H(P V T(Q)) < H(Q) - H(P) and
H(P V T(Q)) - H(P V T(P)) < H(T(Q)) - H(T(P)) = H(Q) - H(P).
Add these two inequalities together.//
COROLLARY 276: Let Q refine P. Then
1/n (V (i from 1 to n) Ti(P)) – 1/(2n) (V (i from 1 to 2n) Ti(P)) <
1/n (V (i from 1 to n) Ti(Q)) – 1/(2n) (V (i from 1 to 2n) Ti(Q)).
Proof: Just plug in theorem 275 replacing
P with V (i from 1 to n) Ti(P),
Q with V (i from 1 to n) Ti(Q)and
T with Tn.//
94
COROLLARY 277 : Let Q refine P. Then P,T,n entropy drop < Q,T,n entropy drop for
all n.
Idea of proof: The P,T,n entropy drop is
1/n (V (i from 1 to n) Ti(P)) – 1/(2n) (V (i from 1 to 2n) Ti(P)) +
1/(2n) (V (i from 1 to 2n) Ti(P)) – 1/(4n) (V (i from 1 to 4n) Ti(P)) + …
THEOREM 278: P,T ergodic with entropy H,  > 0 arbitrary, for all n sufficiently large,
for all Rohlin towers R of height n, for all but measure  of the P,T columns of R, the
width of the column is between exp(-(Hn+)) of the of measure of the base and exp((Hn-)) of the measure of the base.
Idea of proof: This theorem immediately follows if we know that the bad set of the
Shannon Macmillan theorem does not contain a large fraction of the base because the
width of a column is exactly the size of a word in the base of that column. We can’t
easily prove that the base has a small bad set, but we will now point out that there are
nearby rungs with small bad set.
Pick  and  such that 0 <  <<  << . By Shannon Macmillan Breiman, pick
the height of your tower, n, so that for all N > (1-)n, out side of a bad set of size , all N
names have size between exp(-N(H+)) and exp(-N(H-)). Among the sets
(first rung , second rung, … (n) rung).
There must be a rung r such that less than /2 of the measure of r is in the bad set and
among the sets
( T-1(first rung), T-2(first rung)… T-n)(first rung)) (prove these sets to be disjoint)
there must be a set s such that less than /2 of s is in the bad set. The existence of r
establishes that all but /2 of the columns are sufficiently small and the existence of s
shows that all but /2 of the columns are sufficiently big because every name of size n in
the base is contained in a name of size at least n(1 - ) in r and contains a name of size at
most n(1 + ) in S.//
ABBREVIATION: Since Shannon Macmillan set is bad only on a small set, there are
rungs near the base above and below the base with small bad set.
95
THEOREM 279: Young children’s puzzle theorem:
Let P be a partition containing m sets and Q be a partition whose sets are all smaller than
in measure. Then there is a partition P’ such that
i)
ii)
Q is finer than P’
The variation distance between the distributions of P and P’ is less than m
proof left to reader.
COMMENT 280: The young children’s puzzle theorem and a theorem we will later state
called the mature children’s puzzle theorem are obviously meant to be our substitution
for a marriage lemma. However they are much easier to prove than a marriage lemma.
Even your young children can tell you how to make big pieces out of little pieces. The
advantage of not using a powerful marriage lemma is that if you want to investigate the
details of the isomorphism that you get, it will be much harder to do that if you used a
deep theorem to get that isomorphism.
COMMENT 281:
True or false: If we have a partition and take a set of pieces of the partition of small
measure and lump it together you won’t decrease the entropy of the partition by very
much.
Answer: False. Let S be that small set. By the conditioning property, if we blow up P
when restricted to S by normalizing the measure on S to a probability measure, and blow
up P on the complement of S by blowing up the its measure to a probability measure,
then
H(P) = (S)H(blow up of P on S) + (1-(S)) H(blow up of P on the complement of S).
It is conceivable that
(S)H(blow
up of P on S) >> (1-(S)) H(blow up of P on the complement of S)
so that lumping some of S could have big effect. However, we nevertheless have the
following.
96
Lemma 282: Let P be a finite partition with #P pieces and S be a set of names in the P,T
process. If (S) << 1/ log(#P), lumping S together will not significantly alter
(1/n)H(the set of n names), regardless what n is.
Idea of proof: The total number of possible names of length n is (#P)n. Hence the blow
up partition on S can have at most (#P)n pieces. The largest its entropy can conceivably
be is if all those sets have the same entropy, in which case it is n(log(#P)).
Technique 283: We are finally ready to answer the question posed in the beginning of
this section: Technique: How to copy a distribution while holding up the entropy.
We are given a Process P,T, an integer n, and another transformation S whose entropy is
greater than T. Here is how to construct a partition Q so that the n distributions and
entropies of P,T and Q,S are as close as you want. (as long as you don’t insist on
equality)
Among other things we are copying the n distribution and we already know how
to do that. As you will recall, the process for doing that is to let N >> n. Choose N
towers R and R’ for T and S respectively, and then copy P from R to R’. Our goal is to
carry out that process carefully so that we will keep our entropy.
Let H be the entropy of P,T. Let H’ be the entropy of S. Let  H’- H >> > 0 be
given. By theorem 203, only a small measure of P,T,R columns have measure smaller
than exp(-(HT+ )N). Alter P by lumping those few columns together into one big
column with one name(any given name you choose) and that won’t change either entropy
or distribution of P,T very much by lemma 228.1. Now you have at most exp((HT+ )N)
columns.
Let G be a generator for S. Consider the S, G, R’ columns. All but a very small measure
of those columns are small enough so that we can copy P,T to R’ as usual , using the
young children’s puzzle theorem to make sure that the columns of G are finer than the
columns of your copy. The big S,G,R’ columns together with the error in the puzzle
theorem only make the copy not quite as nice as it otherwise would have been.
That constructs the desired copy Q. It has approximately the right n distribution
and is more course than the S,G,R’ columns. Let B be the base of R’ and let Bc be its
complement. Define partitions PG, PQ :
PG = {B  G columns, Bc} and PQ = {B  Q columns, Bc}
97
are partitions which generate the same -algebras, under S, as G V {B, Bc} and Q V {B,
Bc} respectively (not precisely the same because of error set, but this is negligible.) so
H(PG,S) is approximately H(G,S) = H(S), and H(PQ,S) is approximately H(Q,S). By
corollary 225.5,
The PQ,S, N entropy drop < The PG,S, N entropy drop.
It does not hurt to assume that n is bigger than it is so choose N big enough in the very
beginning to assure that the PG,S, N entropy drop is small.//
ABBREVIATION: To get n distribution and keep entropy, copy one N >> n picture onto
another like usual. First put G columns on the range tower where G is a generator and
then make your copy columns to contain G columns by fudging columns to make puzzle
theorem applicable. Intersect columns with base to get PQ and PG.
the PQ,S,n entropy drop < the PG,S,n entropy drop.
THEOREM 284: Technique 283 can be made to work even when
the entropy of S = the entropy of T
Idea of proof: Since we are only interested in copying a distribution close to that of the
original process, don’t copy P,T itself, but just copy something close to it. Just start with
a process slightly less than full entropy, which is close to the original process in
distribution and then technique 283 is applicable. One way to get the approximating
process, is to form a Rohlin tower and lump a small set of columns together. This can’t
change entropy by much because of Lemma 282. To see how this procedure can force a
drop in entropy, consider the PQ partition developed for technique 283. By lumping a few
columns together we can force the N distribution entropy of H(PQ,T) to be too small
forcing H(Q,T) which is approximately H(PQ,T) to be smaller than it was.
98
5)
The dbar metric
LEMMA 285: Fix a,> 0, a+< ½ . For sufficiently large n, the number of words of
zeros and ones of length n with less than “an” ones is less than
exp[-((a + log(a + + (1- (a + log(1- (a + n].
proof: We could give a proof using Sterling’s formula but we prefer this proof. Consider
a random independent process on zeros and ones where “one” has probability
“a + ” and consider words of length n. The process has entropy
(a + log(a + + (1- (a + log(1- (a + 
so the typical reasonable word has probability
exp(-[(a + log(a + + (1- (a + log(1- (a + ]n)
and all of the words of the given set are exponentially bigger than these reasonable
words. The result follows because the given set has probability less than 1.//
THEOREM 286: Fix a,> 0, a+< ½ . Let A be the number of elements in your alphabet.
For sufficiently large n, the number of words within “a” of a given word in the mean
Hamming distance of any fixed word of length “n” is less than
exp[-((a + log(a + + (1- (a + log(1- (a + (a+)log(A-1)n].
Idea of proof: Don’t use lemma 231 directly. Use the idea of its proof. First describe a set
which has the same number of elements and the given set.//
THEOREM 287: Mature children’s puzzle theorem:
Let M,P be two partitions such that M V P contains m sets and M’ have just as
many sets as M. Let Q be a partition whose sets are all smaller than in measure such
that Q is finer than M’. Then there is a partition P’ such that
i) Q is finer than M’ V P’
ii) The variation distance between the distributions of M V P and M’ V P’ is less than
mthe variation distance between the distributions of M and M’
proof left to reader.
COMMENT 288: Be the first on your block to own one of these swell toys!
99
DEFINITION BY CONSTRUCTION 289: Let
…X-2, X-1, X0, X1, X2… and …Y-2, Y-1, Y0, Y1, Y2…
be two processes on the same alphabet, perhaps on different spaces, and c be a coupling
of X1, X2… Xn and Y1, Y2… Yn. This coupling is a measure on the product space which
associates with every ordered pair (w1, w2) from X1, X2… Xn and Y1, Y2… Yn
respectively a probability p(w1, w2).
We now consider a completely new Lebesgue space. Place a Rohlin tower of
height n on our new space (You don’t even have to assume there to be any error set
because we have no transformation in mind, just n sets of measure 1/n each). Break it into
columns so that for every ordered pair (w1, w2) there is an associated column whose
measure is p(w1, w2).
Let { a1, a2,… ap } be the common alphabet of the X and Y processes. We are
about to regard ordered pairs of these letters as representing sets in our new space.
In particular, we regard the mth rung of the column associated with (w1, w2) to be in
(ai,aj) if the mth letter of w1 is ai and the mth letter of w2 is aj. Our tower, endowed with
these sets we have represented by ordered pairs (ai,aj) is called a picture of the coupling
between X1, X2… Xn and Y1, Y2… Yn .
EXERCISE 290:
Show that the union of all ordered pairs (ai,aj) such that i  j has
measure exactly equal to the expected mean Hamming distance achieved by the coupling.
DEFINITION 291: We want to discuss the notation we just developed where instead of
regarding the processes under discussion to be …X-2, X-1, X0, X1, X2… and …Y-2, Y-1,
Y0, Y1, Y2… we regard them to be P,T and Q,S. The translation is that
P is the p set partition { X0 = a1, X0 = a2,… X0 = ap } and
Q is the p set partition { Y0 = a1, Y0 = a2,… Y0 = ap }.
The statement in the previous definition
“Our tower, endowed with these sets we have represented by ordered pairs (ai,aj) is called
a picture of the dbar coupling.”
is replaced with the statement
Rohlin R’’ of height n endowed with partitions P’’ corresponding to P, and Q’’
corresponding to Q, is a picture of the coupling.
where
R’’ is the tower of the picture, P’’ is the partition of picture given by the first coordinate
of the pair and Q’’ is the partition of picture given by the second coordinate of the pair.
100
COMMENT 292: In the previous definition, the partition of the picture into ordered pairs
is precisely P’’ V Q’’.
EXERCISE 293: All notation as in definition 236. Let R be a Rohlin tower of height n on
the space of T such that the base of the tower is independent of the partition of the space
into n names (under P,T). Show that P’’, R’’ is a precise copy of P,R.
THEOREM 294: Let P,T and Q,S be two processes whose distance in dbar is d < ¼. A is
the number of letters in the alphabet. Suppose
H(T) > H(P,T) +d½ log(A-1) - (( d½log d½) +(1- d½)(log(1- d½)).
(Note that - (( d½log d½) +(1- d½)(log(1- d½)) is positive.)
Then it is possible to find a partition Q’ on the same space as P such that
i)
|P-Q’| < 2d½
ii)
Q’,T is as close to Q,S as you want in distribution and entropy (except that you
are not allowed to insist that their n distributions or entropy be equal)
COMMENT 295: (ii) means you can pick any n you want in advance, and then get them
as close as you want in entropy while getting them as close as you want in n distribution.
COMMENT 296: The only significance of the expressions
d½ log(A-1) - (( d½log d½) +(1- d½)(log(1- d½)) and 2d½
for this book is that they go to zero as d does.
Proof: This proof is separated into two parts; a computation and then a proof.
COMPUTATION: The statement of the theorem introduces values d and A. The
second line includes the expression d½ log(A-1) - (( d½log d½) +(1- d½)(log(1- d½)) which
we will call X and (i) includes the expression 2d½ which we will call Y. When deriving
the theorem we did not yet know X and Y, so let us now pretend that we do not know
what they are. We have the following quantities to consider.
Cast of characters: X,Y, d and A
101
We wish to derive the values of X and Y in terms of d and A.
In the proof we will introduce another quantity, B.
Guest star: B
In the proof we will find that we can let
X = B log(A-1) - (( Blog B) +(1- B)(log(1- B))
Y = B + d/B
Since we can let B be anything we want, we will choose B to be d½ in order to force Y to
be pretty small. This forces X and Y to be what they are.
PROOF:
Start off with the large n for which we want to copy the n distribution of P.
Let N >> n. There is a coupling of the N distribution of (P,T) with the N distribution of
(Q,T) which achieves an expected mean Hamming distance < d.
Let G be a generator for T. Let R be a tower of height N.
On an arbitrary space, let Rohlin R’’ of height N endowed with partitions P’’
corresponding to P, and Q’’ corresponding to Q, be a picture of the coupling.
For the moment ignore P’’ V Q’’ and just focus on P’’ columns. In preparation for
copying, lump together all unreasonably small P’’ columns (totally ignoring what Q’’
looks like on those columns) as we did in the proof of technique 283.
Now we will add some more Q’’ V P’’ columns to that lumped column.
Consider a Q’’ V P’’ column in R’’ to be “bad” if the fraction of the rungs in the column
in which the first and second coordinates disagree exceeds B (The value of B is given in
“computation”). Our problem is that the number of bad columns may be too big for us to
apply the mature children’s puzzle theorem. We resolve the problem by adding all bad
columns to the lumped column so that we have one big lumped column. Before doing
that, the lumped column was negligibly small. Now since d is the Hamming distance
achieved by the coupling,
(*) the size of the lumped column is at most d/B (We need not take into consideration the
negligibly small part).
On the lumped column, we erase all knowledge of Q’’ so that we no longer have
Q’’ V P’’ columns, but rather only P’’ columns.
Now we count columns. The number of unlumped P’’ columns can’t be very
much bigger than
exp((H(P,T)n).
102
The number of Q’’ V P’’ in each unlumped P’’ column can’t be much bigger than
exp(Blog(A-1) - (( Blog B) +(1- B)(log(1- B)))
by Theorem 286. Hence the total number of columns cannot be very much bigger than
exp((H(P,T)n) exp(Blog(A-1) - (( Blog B) +(1- B)(log(1- B))).
If we insist that
H(T) > (H(P,T)) + A(-((BlogB) +(1-B)(log(1-B))), there are enough G columns to apply
puzzle theorems.
Finally we are ready to copy Q’’ to get Q’. On the unlumped columns, use the
mature children’s puzzle theorem to get G finer than Q’ and to get Q’,P to look like
Q’’,P’’. On the lumped column, use the YOUNG children’s puzzle theorem to get G finer
than Q’, but here we don’t care how Q’ and P compare with each other.
conditioned on our being in the unlumped columns, |P’’ – Q’’| < B. On good columns |PQ’| < B so by (*),
|P-Q’| < B + d/B //
ABBREVIATION: In a picture of a dbar coupling lump small P’’ columns together with
P’’ columns with a higher fraction than d½ of rungs of disagreement, yielding a lump of
size d½. Now copy Q’’ to get Q’ with Q’ looking like Q everywhere, close to P except on
copy of lump, and filled with generator columns to insure high entropy.
COMMENT 297: In order to motivate what is coming, we need to point out some of the
difficulty you might have had if you had tried to prove the theorem 294 yourself without
reading our proof. The difficulty is that we don’t have any idea what the picture of the
coupling looks like. The result is an awkward only partially useful theorem.
Here is another problem. We will end up using the theorem when we know that
Q,S is finitely determined, and instead of knowing that P,T and Q,S are close in dbar, as
the theorem requires, we will know that P,T and Q, S are close in distribution and
entropy. That is O.K. because for finitely determined processes, close in distribution and
entropy implies close in dbar. The trouble is that we don’t know how close in distribution
and entropy you have to be in order to be close in dbar unless you know precisely what
finitely determined transformation we are talking about.
We will use the theorem but it will not be enough. We will also need that in a
special case we can do much better. In our special case we will have a much better
picture of the coupling, and we will not have to make use of the finitely determined
property when we use it. We now proceed to develop that special case.
103
DEFINITION 298: Let f: …X-2, X-1, X0, X1, X2… …Y-2, Y-1, Y0, Y1, Y2… be a
factor map. The coupling corresponding to f is the coupling of …X-2, X-1, X0, X1, X2…
and …Y-2, Y-1, Y0, Y1, Y2… assigning measure 0 to
{(w1, w2) : f(w1)  w2}
and for every set A in the …X-2, X-1, X0, X1, X2… space, it assigns the measure of A to
{(w1, w2) : f(w1)  w2 and w1 A}
EXERCISE 299: Prove that it does indeed couple the two spaces.
EXERCISE 300: Prove that if the coupling corresponding to f is restricted to words of
length n, (i.e. Erase all knowledge of Xi or Yi when i {1,2,…n}) for large n, then the
picture of it is made up almost entirely of approximately 2Hn reasonable names where H
is the entropy of the X process.
Hint: …X-2, X-1, X0, X1, X2… V …Y-2, Y-1, Y0, Y1, Y2…is isomorphic to …X-2, X-1, X0,
X1, X2….//
THEOREM 301: Suppose S ergodic, f:Q,S  q,S is a homomorphism and suppose Q
has the same number of sets as q and that the sets of Q are identified with the sets of q so
that it makes sense to ask about the expected mean hamming distance that a coupling
between Q and q achieves. Suppose the expected mean hamming distance achieved by
the coupling corresponding to f when restricted to words of length n is d. Then
if
T is another transformation, possibly on a different space, P is a partition on that latter
space such that P,T is a perfect copy of q,S, and H(T) > H(Q,S),
then
it is possible to find a partition Q’ on the same space as P such that
i)
|P-Q’| < d.
ii)
Q’,T is as close to Q,S as you want in distribution and entropy (except that you
are not allowed to insist that their n distributions or entropy be equal).
Idea of proof: Same idea as the proof of theorem 294 except easier. Use exercise 300. We
don’t have any big lumps in this proof to worry about.
As a technicality we have to admit that this proof really only forces |P-Q’| to be at
most a tiny bit bigger than d. However,
i) This is good enough for our purposes. No harm will be done if you alter condition (i)
to say |P-Q’|<2d.
ii) If you want to achieve |P-Q’|<d, show you can do so by fudging Q’ slightly without
seriously hurting distribution or entropy.//
104
4) Sinai’s and Ornstein’s theorems.
DEFINITION: 302: let …X-2, X-1, X0, X1, X2… be a stationary process, Let a0 a1 a2…
be an infinite word in the alphabet of the process and define the event An(a0 a1 a2…),
abbreviated An by
X0 obeys An iff
Xi = a0, Xi+1 = a1, Xi+2 = a2… Xi+n-1 = an for some i  {-n+1, -n+2, … 0}.
COMMENT: 303: Finite determined implies positive entropy so picking a0 a1 a2… to
obey Shannon Macmillan, its size will decrease exponentially and hence P(An) goes to 0
as n approaches infinity.
LEMMA: 304 The average entropy lemma:
Let P and Q be a partition and T an ergodic transformation. Let B, T(B), T2(B), … Tn(B)
be a Rohlin tower for T where n is odd and e be its error set. Let two points
p  B and q  B be said to be isomorphic iff
the P,T (n+1)/2 name of p and the Q,T (n+1)/2 name of T(n+1)/2(p) are the same as
the P,T (n+1)/2 name of q and the Q,T (n+1)/2 name of T(n+1)/2(q) respectively.
Break B into equivalence classes by the above equivalence relation and define a column
to be any set
A  T(A)  T2(A) … Tn(A)
where A is such an equivalence class. Let PP be the partition made up of the columns
and e. Let d > 0. Then if n is sufficiently large and e sufficiently small
1/n H(PP) < (H(T,P) + H(T,Q))/ 2 + d
ABBREVIATION: Let the bottom half of a Rohlin tower be one process and the top half
be another process. The entropy is not much bigger than the average of that of the two
processes.
105
Idea of proof: There must be an i  {1,2,3,…n},  small, such that most of Ti(B)obeys
the Shannon Macmillan theorem for P,T by time n/4 onward and most of Ti+(n+1)/2(B)
obeys Shannon Macmillan for Q,T by time n/4 onward. Use lemma 282 to ignore the
columns where either
Ti(B) or Ti+(n+1)/2(B) fails to obey the above mentioned Shannon Macmillan theorems. All
other columns are determined by the following triple where   B.
( the i-1 name of ; the (n+1)/2 name of Ti() (which is reasonable); the ((n+1)/2) - i
name of Ti+(n+1)/2() (which is reasonable) )
The number of such triples is less then ((H(T,P) + H(T,Q))/ 2 + d/2)n for any prechosen
small d.//
THEOREM: 305: Let P,T be an ergodic process with positive entropy. Then there are
factors Pi,T where sets of Pi correspond to sets of P such that
i)
(Pi, T) is a factor of (Pi+1, T) with factor map fi: (Pi+1, T) (Pi, T).
ii)
The mean hamming distance given by the coupling corresponding to fi goes to
zero as i .
iii)
|P- Pi |  0 as i 
iv)
H (Pi,T) < H (P,T) for all i.
Idea of proof: We consider an alphabet {p1, p2, p3… pn, D} where p1, p2, p3… pn are
letters corresponding to the sets of P and D is an additional letter (i.e. P is defined by
letting  be assigned the letter pi for any point   Pi for all i.).Let C = {c1, c2, c3…} be
a countable partition of the space. Define Pi to be the same as P on c1 c2 c3 … ci,
and throw all the sets all sets cj, j > i into D.
However there are some technicalities. You can’t just pick the ci’s arbitrarily.
By comment 303 pick a0 a1 a2… so that P(An) goes to 0 and let c1 = Am for some large m.
Note that An, n > m is measurable with respect to all (Pi,T) no matter how we pick
ci, i > 2 and if you recall the proof of the Rohlin tower theorem, arbitrarily large Rohlin
towers can be constructed from arbitrarily small sets so using the An, n > m, we get
arbitrarily large Rohlin towers which we know to be measurable with respect to all (Pi,T)
even before we know the values of the ci, i > 2. Define ci inductively by considering
(Pi,T), defining D to be the set of points labeled D at that time, selecting a huge Rohlin
tower defined out of some An, and letting ci+1 be the intersection of D with the bottom
half of that tower. The measurability of our Rohlin towers for all (Pi,T) together with the
fact that c1 is measurable for all (Pi,T) will give us (i). (ii) and (iii) are easy. (iv) follows
inductively from the average entropy lemma. //
106
Sinai's theorem:
Let T be an ergodic transformation with finite entropy. Every finitely determined
transformation of the same entropy as T is a factor of T.
Idea of proof: Let P,S be the given full entropy finitely determined process.
Step 1: Get the Pi,S of theorem 305.(They are all finitely determined because finitely
determined is closed under taking factors).
Step 2: Let P0,0,T approximate P1,S in distribution entropy (hence in dbar)( Technique
229).
Step 3: Move P0,0,T slightly to get P0,1,T such that P0,1,T is a much better approximation
of P1,S. (Theorem 294)
Step 4: Continue to get P0,2,T, P0,3,T, ... converging to a process P1,0,T that has the same
distribution of P1,S.
Step 5: Move P1,0,T slightly to get P1,1,T such that P1,1,T is good
approximation of P2,S. (Theorem 301)
Step 6: Get (P1,2,T),(P1,3,T),(P1,4,T), approaching (P2,0,T) which has the same distribution
of (P2,S). (Theorem 294)
Step 7: Continuing the same recipe over and over, get (P3,0,T),(P4,0,T),(P5,0,T) which
converges to our desired copy. //
COMMENT 306: The reason we could not go after P,S directly, is that P,S has full
entropy, and theorems 294 and 301 require us to be working in an environment of higher
entropy.
COMMENT 307: If the reader only wants to prove Sinai’s theorem in the case where
H(P,S) is strictly less than H(T), the proof of Sinai’s theorem actually contains within it a
proof of that theorem. It is a much easier proof because it is only a piece of that proof and
does not require theorems 301 or 305. We consider it important that the reader focus on
that.
107
LEMMA 308: Let P,T and Q,S be processes.
Suppose the dbar distance between P,T and Q,S is .
Take a factor of the P,T process using a code of length n (i.e. the ith coordinate of the
image is determined by the i-n,i-n+1,…i+n terms of the P,T process.). Take a factor of
the Q,T process using a code of length n and in fact use the exact same code as you used
for the P,T process. Then
the dbar distance of your factors < (2n+1).
Idea of proof: A good dbar coupling of the processes induces a good dbar coupling of the
images where k errors on terms from –(M + n) to (M + n) induces at most (2n+1)k errors
on terms from –M to M.
LEMMA 309: Let P,T and Q,S be processes.
Suppose the dbar distance between P,T and Q,S is .
Take a factor of the P,T process using a code of length n. Take a factor of the Q,T process
using a code of length n and in fact use the exact same code as you used for the P,T
process. Suppose you use the same alphabet for the factors as you use for the processes
themselves. Then it makes sense to ask whether a given point ends in the same piece of P
and its factor.
Suppose the probability that a given point ends in a different piece of P and its factor is

Then
the probability that a given point ends in a different piece of Q and its factor < (2n+1) +
.
Idea of proof: As before, induce a coupling from a coupling. If the –n to n terms of the
two processes are the same and if the 0 coordinate of the P,T process and its factor are the
same. Then the 0 coordinates of the P,T process, its factor, the Q,T process, and its factor
are all the same.//
108
Sinai's theorem(strong form):
THEOREM: For all  there exists  such that if a Q,S is within  in distribution and
entropy to a FD distribution whose entropy does not exceed that of S, then P can be
moved distance less than  to get a partition Q’ such that Q’,S has that FD distribution.
Idea of proof: Let P,T have the target distribution. Let Pi be as in theorem 305. Then the
map that takes the P,T name a point to the Pi,T name of it is a factor map which for large
i rarely alters what set the point is in. Fix a large i and suppose
the probability that a given point ends in a different piece of P and its Pi is
Since every factor map is a limit of finite code maps, Pi is very close to a Pi’ that is coded
from P with a code of length n for some n. We can presume that
the probability that a given point ends in a different piece of P and Pi’ is
Now suppose the dbar distance between Q,S and P,T is a small fraction of 1/n which is
forced if  is sufficiently small. Let
’ be the dbar distance between Q,S and P,T.
Use the same n code to code a factor q,S of Q,S.
The distance between q and Q < (2n+1)’+ 2 by lemma 309.
The dbar distance between q,S and Pi’,T < (2n+1)’ by lemma 308
which in turn is very close to Pi’,T.
Proceed as in the proof of Sinai’s theorem.//
COMMENT: There is a subtlety in the above proof. ’ has to be chosen small even in
comparison to . If the dbar distance between q,S and Pi,T were on the order of , that
might be too big to apply Theorem 294 in the proof of Sinai’s theorem.
109
Ornstein isomorphism theorem:
Two finitely determined processes with the same entropy are isomorphic.
Idea of proof: We are given transformations T and S and partitions P and Q such that P,T
and Q,S are finitely determined of the same entropy, where P generates. Our goal is to
establish a Q’ which also generates T such that Q’,T has the same distribution as Q,S.
1) By Sinai, get q such that q,T has the same distribution as Q,S.
By theorem 269, it suffices to get q’ such that
q’,T has the same distribution as Q,S,
q’ is arbitrarily close to q, and
q’ codes P arbitrarily well.
2) Copy P in the land of q to get P’, so that
i)
(P’,T ) V (q,T) looks like (P,T) V (q,T)
in distribution and entropy. By the strong form of Sinai’s theorem by moving the copy a
little after making it, we can assume
ii)
P’,T has the same distribution as P,T
3) Copy q to get q’, so that
i)
(P,T ) V (q’,T) looks like (P’,T) V (q,T)
in distribution and entropy. By the strong form of Sinai’s theorem by moving the copy a
little after making it, we can assume
ii)
q’,T has the same distribution as Q,S
110
What all this means:
Since P generates, there is some n such that P codes q well by time n.
(P,T) V (q,T), (P’,T) V (q,T) and (P,T) V (q’,T) all look alike so P codes q like P codes q’
by time n. Thus
q is close to q’.
P’ is in the land of q so there is an n1 such that q codes P’ extremely well by time n1.
Hence, by making copy q’ well enough, we can guarantee that
q’ codes P extremely well by time n1.
We are done. However there is a technicality to be concerned about when we play
this game. It takes 2n1+1 letters from the q’ partition to code a letter from P, when we are
using a n1 code. If any of those 2n1+1 letters are altered, the code codes wrong.
Therefore, if we move the q' partition even as much as the order of 1/n1, that is enough to
devastate the coding. Furthermore, you will recall that we did in fact move q' using
Sinai’s Theorem.
Relax. We did not make the q’ coding until after we already knew n1, and by
making it arbitrarily well, we can get the distance we needed to move it to be small, even
in comparison to 1/n1.//
ABBREVIATION: Let P,T be a factor of Q,T of full entropy which generates Q,T pretty
well. You would like to move P slightly to get an isomorphic copy P’,T of P,T which
generates Q,T even better. Make Q’ factor of P,T: (Q’,T) V P,T looks like (Q,T) V P,T.
Make (P’ V Q,T) look like (P V Q’,T).
P’ is close to P because T,Q codes them about the same.
(P’,T) captures (Q,T) well because it copies the way P,T captures Q’,T.
Corollaries:
1)The Ornstein Isomorphism theorem (Classical form):
Two independent processes of the same entropy are isomorphic.
2) FDFBB
Proof: Every FD is isomorphic to an independent process of the same entropy.
111
Abbreviation of Book
WARNING: Material in this abbreviation can fail to be valid or meaningful mathematics.
It may appear to be complete gobledeegook if the reader tries to read it before having
read the book. However, if the reader has established that he can make all “ideas of
proof” in this book rigorous, then by comparing this abbreviation with the book he can
speed up his thinking and be able to quickly review the whole book by just reading this
abbreviation. It also serves a extended table of contents.
A backslash is used to separate theorems from their proof.
---------------------------------------------------------------PRELIMINARIES page 8
page 11
Stationary process = measure preserving transformation\ Just shift word to the left.
page 11
Making a transformation\ name a function, stationary process, cutting and stacking
page11
Cutting and stacking\ first stage put interval [.4, .5] on top of [.3, .4] on top of [.2, .3] on
top of [.1, .2] on top of [0 , .1] and go up; x x + .1
second stage left third of the stack goes on top of middle third of stack goes on top of
right third of that stack to make new stack. You can put spacers between successive thirds
and make more than one stack. Continue.
page12
Ergodic\ T-1 (S) = S  measure of S {0,1}
page13
Independent implies ergodic\ Approximate S with cylinder C . T-n (C) and T-m (C)
independent for |n-m| big.
page14
Let P1,P2,... be increasing and separating. Arbitrary S almost union of pieces of Pi for i
big\ Closed approximation for piece of Pi is either is in open approximation for S or open
approximation for complement of S.
page 15
Above Pi generate.
page 16
P,T process spits out word p0, p1,… pn if those are the sets containing ,T(),…Tn ().
page 16
P,Q generate T P,T Q,T T isomorphic.
page 16
For measurable sets: Every countable chain has upper bound set has maximal element.
112
page 17
Exists S, T(S), T2(S),... TN(S) disjoint and almost cover\ Ergodic case; You can simply
take every nth point in every orbit, but you would get non-measurable set. Fix this
problem by using small set as a starting set.
Nonergodic case; You also need aperiodic assumption. This allows you to extract a small
subset of any set whose iterates are disjoint for a long time. Apply measurable Zorn to get
a maximal such set for your starting set.
page 18
P,T breaks Rohlin tower into columns.
page 19
Base of tower can be independent of n names\ Accomplish this by making it independent
of every rung of every column.
page 20
Measure (Si) summable, (complement of Si )  Pi Superimposition of Pi countable.
page 20
Every transformation is a process with countable alphabet\ Bi bases of towers with
summable measures. Partition each Bi with a finite but huge amount of information, tack
on complement of Bi, and then superimpose them all.
page 21
1/n (f() + f(T()) +...f(Tn()) converges as n  \ Suppose lim sup>b and lim inf <a
average b average b average b
….{………..}..{…………}..{………}… gives near b average. Get near a also.
page 22
Ergodic case; Above converges to integral of f\ Lebesgue dominated convergence if f is
bounded. Truncation argument.
Nonergodic case; It converges to conditional expectation of f with respect to invariant algebra.
page 23
Monkey sequence\ Given sequence. Take subsequence of finite truncations where
frequency of words of length 1 converge, subsequence of that for words of length 2 etc.
Get stationary measure.
page 23
Stationary are convex combination of ergodic\ Three proofs …1) The monkey measure of
a given word is the ergodic component it is in …2) Conditioning with respect to the
invariant -algebra gives an ergodic measure …3) Krein Milman
page 26
Subsequential limit\ Same as Monkey except we start with sequence of measures and
instead of looking at frequencies we look at probability law of first letter of first two
letters etc.
113
page 26
Monkey method can also be regarded as starting with sequences of measures instead of
sequences\ Let N >> n. We extract a nearly stationary measure of length n from the Nth
measure restricted to 1,2,3…N by averaging every n length submeasure. Take
subsequential limit as n 
page 27
Martingale\ E(Xn| Xn-1, Xn-2... ) = Xn-1 for every n > 1.
page 28
For martingale, If T is a stopping time, if all Xn uniformly bounded, E(XT|X0) = X0. In
general E(XT^n|X0) = X0 for any n.
page 29
A bounded Martingale converges\ Buy whenever the Martingale goes above b and sell
when it goes below a. First time above, first time below, second time above etc. are all
stopping times so your expected gain is zero and you can’t lose much so you can’t get
much.
page 30
Coupling\ measure on product space with two measures as marginals.
page 30
Independent coupling, gluing two couplings together, coupling by induction.
page 32
The tailfield of a process is the intersection for all n of the n future of the process.
page 32
Trivial tail, Kolmogorov zero one law.
page 33
mean Hamming distance, dbar metric, variation metric.
page 35
Markov process, P(present|past) depends only on X-1.
page 36
Aperiodic communicating Markov, n,m large, P(Xn = c | X0 = a) - P(Xm = c | X0 = b) |
small / Give one process a head start, couple independently until they meet, stay together.
page 40
P(present|past) depends on n past…really a one step Markov process in disguise.
page 42
Essentially bounded\ lim(N )lim sup(n ):[|fN(X1)| +|fN(X2)|+ ...|fN(Xn)|]/n] = 0
page 42
iai summable, P((i < |Xn| < i+1)|Xn-1, Xn-2...X1) < ai then Xi essentially bounded\
Dominate |fN(X1)|, |fN(X2)|, ...|fN(Xn)| with a small independent process.
page 43
“Square lemma”: Xi,j random variables with stationary columns, Xj and Xi,i essentially
bounded. then lim 1/n (X1,1 + X2,2 +X3,3 + ... Xn,n) = lim 1/n (X1 + X2 +X3 + ... Xn)\
For n large |Xn,n-Xn| is usually dominated by a small stationary process.
ENTROPY PAGE 44
Three shift not a homomorphism of two shift\ If we use an 8 code, every distinct 200
length three shift word would come from distinct 208 two shift word violating pigeon
hole principle. It doesn’t help to allow a small fraction of errors in the code.
114
page 44
Entropy of process is  if there are about 2n reasonable words of length n.
page 45
Isomorphic ergodic processes have same entropy\ Same as 2 shift 3 shift proof.
page 45
H(T) = Entropy of T = sup H(P,T) = H(P,T) for any finite generator P.
page 45
Shannon Macmillan Breiman; exponential decrease rate in measure of b1b2b3....bn
converges (essentially constant if ergodic) \ Compare -(1/n)log(P(b1b2b3....bn)) with
-(1/n)log(P(b1|b0b-1b-2b-3....))-… (1/n)log(P(bn|bn-1bn-2...)) using square lemma. We have
essential boundedness of everything because the probability that any of those terms is
between i and i + 1, given previous terms < the size of alphabet times 2-i.
page 46
H(P) is H(P,T) where P is an independent generator for T, and this turns out to be
-p1log(p1) + -p2log(p2) +-p3log(p3) ...+-pnlog(pn).
page 47
Conditioning property; Put P1 into the first piece of P2. Get entropy H2 + q1H1.
page 47
Join of two partitions (a way of superimposing them).
page 48
Independent joining maximizes entropy\ 2 proofs 1) use the convexity of xlog x to prove
that a convex combination of partitions has at least entropy of the convex combination of
their entropies. 2) Use only that ½, ½ maximizes two set entropy and conditioning
property. i) Show 1/2n,1/2n,1/2n,…1/2n maximizes 2n set partition. ii) True when one
partition is 1/2n,1/2n,1/2n,…1/2n iii)True when one partition is dyadic rationals.
page 50
H(P|Q1) < H(P|Q) if Q1is finer.
page 50
Entropy = limit of H(Xn|Xn-1,Xn-2,Xn-3,...X0) = lim 1/nH(X0 V X0 V X1 V X2... V Xn) =
lim H([A1A2,A3,...An,(- [A1 U A2 U A3... U An])], T ) when A’s are countable
generator.
page 52
Finite entropy  finite generator\ Make big tower distinguishing reasonable columns.
Improve approximation with a bigger tower making columns distinct by altering a few
rungs at the bottom of that big tower. Label the base of all your towers.
page 54
Entropy 0 iff past determines future.
page 54
Induce T on  Induced entropy = H(T)/()\ Given obvious countable generator on 
words of length n for  give words of length approximately n() for T, but this does not
give a valid proof because the generator is infinite. Make proof valid with finite
truncation approximations.
115
BERNOULLI SHIFTS PAGE 56
B = Bernoulli = isomorphic to independent process.
FB = factor of Bernoulli
EX = extremal = exponentially fat n word submeasures (off some small subset) are dbar
close to the whole measure.
VWB = very weak Bernoulli = conditioned and unconditioned futures dbar close.
FD = finitely determined = good distribution and entropy approximation to the process
gives a good dbar approximation.
IC = dbar limit of independent concatenations = stationary limit of picking first n then
next n etc. in an i.i.d. manner.
page 59
Independent processes are extremal\ S dbar far away. Xi independent process defined
inductively so that Xk takes on 1, -1 with probability ½ apiece but in such a way that it
prejudices for 1 as much as possible on S. Far dbar implies prejudice can be strong
enough to force X1 + X2 …to grow in expectation linearly on S, proving S to shrink
exponentially.
page 64
Extremality is closed under dbar limits\ We wish to show that approximating S with
extremal Sm tends to make S extremal. After dodging obnoxious sets, we extract from the
dbar coupling of S and Sm a dbar coupling of a prechosen fat submeasure of S with a fat
submeasure of Sm.
page 65
 startover process is IC / n >> m >> 1/; couple 1 to n; n+m to2n+m; 2n+2m to 3n+2m..
page 65
coding of independent is IC / same proof.
page 66
EX ergodic: else a < b C cylinder whose frequency bigger than b and less than a are both
fat and dbar far apart.
page 68
Most conditioned futures are nearly exponentially fat submeasures of the unconditioned
future\ Expectation of conditioned entropy is unconditioned entropy but conditioned
entropy cannot be very big because typical names from the conditioned process come
from reasonable names of the unconditioned process. Therefore it is unlikely to be very
small.
page 70
B  FB  IC  EX  VWB  FD\ FB  IC: FB dbar close to coding of
independent IC  EX: Independent concatenation looks like independent.
EX  VWB: Conditioned future fat subset of unconditional future (most conditioned
futures dodge the small bad set) EX  FD: Process and approximating process,
conditioned n futures are usually fat subsets of same unconditional. Couple inductively.
VWB  IC: Couple your VWB with its finite word measures repeated to form
independent concatenation. FD  IC: make process “start over” rarely to get
approximating IC.
page 71
They are closed under factors\ Use IC . A finite code of ( )( )( ) is close to (
)(
)…
116
page 71
They have positive entropy\ IC contradicts past determines future.
page 72
(i) n word dbar between processes approaches limit d (ii) coupling with limiting error
frequency whose integral is d\ For i, dbar (…Xkn , …Ykn ) > dbar( …Xk , …Yk), and
# (Xi Yi :i < k+n) - # (Xi Yi :i < n) < k.
For ii, Monkey, Birkhoff, Bounded convergence.
page 73
Ergodic coupling of ergodic processes\ Just take ergodic component.
page 73
Almost ergodic coupling of IC with its approximation.
page 73
FD definition manifests in infinite coupling.
page 74
Two conditioned VWB futures can be infinitely coupled in dbar\
Ki rapidly increasing, Ni >> Ki+1. Couple two futures to times
K1,2 K1,… N1 K1, N1 K1 + K2, N1 K1 +2 K2, …N1 K1 + N2 K2, N1 K1 + N2 K2+ K3 …
successively inductively as well as possible. Focussing on just times between
N1 K1 + N2 K2+… Ng Kg and N1 K1 + N2 K2+… Ng+1 Kg+1, the frequency of times that
either past is bad usually gets and stays small by Birkhoff and frequency of times pasts
are good but coupled pair is bad gets and stays small because each time you have another
big independent chance to get a good match.
page 80
WB means that making past independent of future has small effect in variation metric if
you restrict yourself to times –n,-n+1…-m, m, …n. m large.
page 83
Conditioned and unconditioned WB futures can be coupled to eventually agree\ Couple
times –n,-n+1…-m, m, …n. with the almost identity coupling. Shift m-n,m-n+1…0, 2m,
…m+n. Fudge coupling to get identity on m-n,m-n+1…0. Take limit n , to get all
times …-2,-1,0,2m,2m+1… For k < m there is a way to take one of these couplings on
times …-2,-1,0,2m,2m+1… and, another on …-2,-1,0,2k,2k+1… and get another on
…-2,-1,0,2k,2k+1… so that if the former is identity except measure  and the second is
identity except measure , the third will be an extension of the first and the identity
except measure + . Repeating this type of manipulation and taking limit we end up
with coupling on all times; identity on past and eventually agree on future.
ORNSTEIN ISOMORPHISM THEOREM PAGE 86
page 85
Columns in a large Rohlin tower tend to have the right m distribution\ Otherwise
Birkhoff fails for M words, M >> m.
page 85
Copy distribution: Rohlin towers on both spaces, copy from one to other.
page 89
Coding: Given homomorphism, Inverse of 0 coordinate partition approximately cylinder.
page 91
Land of P: -algebra generated by translates of partition P.
page 93
Q finer  H(P) – ½(H(P V T(P)) < H(Q) - ½ H(Q V T(Q)) / Add H(Q V T(Q)) - H(P V
T(Q)) < H(Q) - H(P) with H(P V T(Q)) - H(P V T(P)) < H(T(Q)) - H(T(P)) = H(Q) - H(P).
page 94
Q finer  entropy drop from time n onward greater. Apply above replacing P with n
name partition and T with Tn.
page 94
Columns of Rohlin tower are reasonable even when base not independent\
B base. Shannon Macmillan good on most Ti(B) for some small positive i and small
117
absolute value negative i.
page 96
Copy distribution and entropy: Exists P’ so that P’,S looks like P,T in distribution and
entropy if H(S) > H(P,T)\Copy P on a Rohlin tower to get a P’ on another Rohlin tower
so that the columns of a generator for S are inside the columns of your copy.
page 100
dbar: dbar P,T and Q,S small, H(T) > H(P,T) +O(dbar) implies copy Q’ of Q in entropy
and distribution, |P-Q’| = O(dbar) \ P’’ V Q’’ on R’’ manifests dbar match. Copy of Q’’ to
form Q’ with generator columns inside, but only insist that P V Q’ look like P’’ V Q’’
where there are not too many Q’’ columns per P’’ column so that H(T) does not have to
be too big.
page 103
P,T a perfect copy of q,S. f:Q,S  q,S both a homomorphism and a mean Hamming
coupling, distance d. H(T) > H(Q,S). entropy distribution copy Q’ of Q with |P-Q’| < d
page 105
Pi)approaching (T,P) strictly increasingly as factors, in entropy and in dbar./ Pick
each Pi to be P on the bottom of a Pi measurable Rohlin tower after picking P0 in such a
way as to insure that the Rohlin tower is also P0 measurable.
page 106
FD full entropy factor\ Pi approaching target as above. If Pi,0 is a perfect copy of Pi ,
move it slightly to get approximate copy Pi of Pi+1. move again and again to get Pi ,
Pi …approaching Pi+1,0 a perfect copy of Pi+1.
1
2
3
page 108
FD not too much entropy, close ent. dist. approximate copy can be moved slightly to get
perfection.close factor./ Pi above. first move slightly to get close to Pi for large i.
Proceed as above.
page 109
two FD’s same entropy isomorphic\ Let P,T be a factor of Q,T of full entropy which
generates Q,T pretty well. You would like to move P slightly to get a perfect copy P’,T of
P,T which generates Q,T even better. Make Q’ factor of P,T: (Q’,T) V P,T looks like
(Q,T) V P,T. Make (P’ V Q,T) look like (P V Q’,T).
P’ is close to P because T,Q codes them about the same.
(P’,T) captures (Q,T) well because it copies the way P,T captures Q’,T.
page 109
Two independent’s same entropy isomorphic.
page 109
FDFBB.
118
Index of definitions by page
HOW TO USE THIS INDEX: Example: Look at the third row. It says that Definition 3 is a definition of
measure preserving and that it is on page 8.
001
002
003
004
005
006
009
009
015
023
023
023
023
024
024
029
030
035
035
038
039
040
041
052
054
055
057
058
061
063
065
068
070
073
078
080
082
083
084
transformation
measurable
measure preserving
Measurable and measure preserving
interval space, with atoms , without atoms
Lebesgue space
stationary process
alphabet
ergodic
measurable
measure preserving
homomorphism
isomorphism
factor
isomorphic
P,T process
P generates T, P is a generator for T.
Rohlin tower of height N
error set
rung of the tower, first rung of the tower, base of the tower.
column, or P,T column
a rung of that column
n name
invariant -algebra
Monkey method
E(f|X,Y,Z).
conditional expectation of f with respect to s
version for E(f|s)
conditional expectation of  with respect to s
subsequential limit of the i
a stationary process obtained by the monkey method
Martingale
stopping time
a^b
past
coupling
-algebra generated by X
the n future
the tailfield
009
009
009
009
009
009
010
010
012
014
014
014
014
014
014
016
016
017
017
018
018
018
018
022
023
024
024
024
025
026
026
026
027
028
036
030
032
032
032
119
085
087
089
091
093
094
095
096
100
101
101.5
102
103
104
105
109
114
117
118
119
122
127
128
131
133
133
134
135
137
142
145
145
147
150
151
155
159
164
166
172
176
180
181
183
same tail
trivial tail
mean Hamming distance
dbar distance
variation distance
The variation distance between p1, p2,… pn and q1, q2,… qn
variation distance
variation distance
Possibly Non-stationary Markov Chain
transition probabilities
there is a path from a to b
a and b communicate
a is transient
(a)
L(a)
Markov chain
periodicity of the process
mixing Markov process
n step Markov process
Y is called the n step Markov process corresponding to X
periodicity of an n step Markov process
fN
essentially bounded
n shift
small exponential number of words, exponential number of words
exponential size of a word
entropy of the process
reasonable names
entropy of a transformation
entropy of P
put P1 into the first piece of P2
Conditioning property
join of two partitions
H(P) H(T)
Convex combination of partitions p1 P1+ p2 P2+.. pm Pm
the entropy of P over Q
H
H(P,T)
The entropy of a transformation
coded factor, code
the induced transformation of T on 
Bernoulli
trivial transformation
submeasure
032
032
033
033
033
034
034
034
035
035
035
035
035
036
036
037
039
039
039
039
040
042
042
044
044
044
044
044
045
046
047
047
047
047
048
049
050
051
051
054
054
056
056
056
120
186
187
189
190
191
192
194
195
197
208
212
213
231
234
235
239
241
246
249
250
257
258
260
261
262
265
270
271
273
289
291
298
302
size
exponentially fat submeasure
extremal
Very Weak Bernoulli
finitely determined
independent concatenation
dbar limit of independent concatenations
B, FB
Completely extremal
P( , S1)
 startover process of …X-2, X-1, X0, X1, X2….
close in distribution
weak Bernoulli WB
[ ],
greatest integer
Parity
diagonal
eventually agree
we are copying P,T to R
copy P,T to get Q so that Q,S looks like P,T
we are copying P,B,T to get Q,C,S. or we are copying B to get
C so that Q,C,S looks like P,B,T.
define P intersect Q .The symmetric difference between P and Q
distance between P and Q, |P-Q|
P,T codes Q,S well by time n X process codes the Y process
well by time n
code
P,T codes Q well by time n
P,T codes Q,S well by time n in about the same way that P’,T’
codes Q’,S’
the land of P
copies of - algebras, partitions etceteras in the land of P
P,T,n entropy drop
a picture of the coupling between X1, X2… Xn and Y1, Y2… Yn
Rohlin R’’ of height n endowed with partitions P’’ corresponding
to P, and Q’’ corresponding to Q, is a picture of the coupling
The coupling corresponding to f
An(a0 a1 a2…), abbreviated An
057
057
057
057
057
058
058
058
058
063
065
065
080
080
080
082
083
086
087
087
088
089
089
089
090
090
091
091
093
098
099
103
104
121
Index of definitions alphabetized
HOW TO USE THIS INDEX: example: Look at the first row. It says that definition 009 is the definition of
alphabet and it occurs on page 10.
SOME ITEMS ARE LISTED MORE THAN ONCE: Example: Definition 89 of the mean Hamming distance on
page 33 is listed under mhd and under ham.
alp
alp
an
an
b
b
bas
bas
ber
ber
ce
ce
ce
ce
ce
ce
cf
cf
clo
clo
cod
cod
009
009
302
302
195
195
038
038
180
180
057
061
197
057
061
197
172
172
213
213
172
260
cod
cod
cod
261
262
265
cod
cod
172
260
cod
cod
cod
261
262
265
col
col
com
com
039
039
102
197
alphabet
alphabet
An(a0 a1 a2…), abbreviated An
An(a0 a1 a2…), abbreviated An
B, FB
B, FB
rung of the tower, first rung of the tower, base of the tower.
rung of the tower, first rung of the tower, base of the tower.
Bernoulli
Bernoulli
conditional expectation of f with respect to s
conditional expectation of  with respect to s
Completely extremal
conditional expectation of f with respect to s
conditional expectation of  with respect to s
Completely extremal
coded factor, code
coded factor, code
close in distribution
close in distribution
coded factor, code
P,T codes Q,S well by time n X process codes the Y process
well by time n
code
P,T codes Q well by time n
P,T codes Q,S well by time n in about the same way that P’,T’
codes Q’,S’
coded factor, code
P,T codes Q,S well by time n X process codes the Y process
well by time n
code
P,T codes Q well by time n
P,T codes Q,S well by time n in about the same way that P’,T’
codes Q’,S’
column, or P,T column
column, or P,T column
a and b communicate
Completely extremal
010
010
104
104
058
058
018
018
056
056
024
025
058
024
025
058
054
054
065
065
054
089
089
090
090
054
089
089
090
090
018
018
035
058
122
com
com
con
con
con
con
con
con
con
con
con
con
cop
cop
cop
102
197
055
057
061
145
151
055
057
061
145
151
246
249
250
cop
cop
cop
cop
271
246
249
250
cou
cou
cou
cp
dba
dia
dis
ea
eb
efs
efx
en
ent
ent
ent
ent
ent
ent
eps
erg
err
es
es
080
289
298
145
091
239
258
241
128
187
055
133
134
137
142
155
166
273
212
015
035
035
133
a and b communicate
Completely extremal
E(f|X,Y,Z).
conditional expectation of f with respect to s
conditional expectation of  with respect to s
Conditioning property
Convex combination of partitions p1 P1+ p2 P2+.. pm Pm
E(f|X,Y,Z).
conditional expectation of f with respect to s
conditional expectation of  with respect to s
Conditioning property
Convex combination of partitions p1 P1+ p2 P2+.. pm Pm
we are copying P,T to R
copy P,T to get Q so that Q,S looks like P,T
we are copying P,B,T to get Q,C,S. or we are copying B to get
C so that Q,C,S looks like P,B,T.
copies of - algebras, partitions etceteras in the land of P
we are copying P,T to R
copy P,T to get Q so that Q,S looks like P,T
we are copying P,B,T to get Q,C,S. or we are copying B to get
C so that Q,C,S looks like P,B,T.
coupling
a picture of the coupling between X1, X2… Xn and Y1, Y2… Yn
The coupling corresponding to f
Conditioning property
dbar distance
diagonal
distance between P and Q, |P-Q|
eventually agree
essentially bounded
exponentially fat submeasure
E(f|X,Y,Z).
small exponential number of words, exponential number of words
entropy of the process
entropy of a transformation
entropy of P
the entropy of P over Q
The entropy of a transformation
P,T,n entropy drop
 startover process of …X-2, X-1, X0, X1, X2….
ergodic
error set
error set
exponential size of a word
035
058
024
024
025
047
048
024
024
025
047
048
086
087
087
091
086
087
087
030
098
103
047
033
082
089
083
042
057
024
044
044
045
046
049
051
093
065
012
017
017
044
123
ess
eve
exp
exp
exp
ext
ext
fac
fb
fd
fir
fn
fut
gen
gen
gi
h
ham
hom
hp
hpt
ht
ic
ic
ind
int
int
int
inv
is
iso
iso
it
joi
l
lan
lan
leb
lop
lop
ls
mar
mar
128
241
133
133
187
189
197
024
195
191
145
127
083
030
082
234
159
089
023
150
164
150
192
194
176
005
234
257
052
005
023
024
176
147
105
270
271
006
270
271
006
068
100
essentially bounded
eventually agree
small exponential number of words, exponential number of words
exponential size of a word
exponentially fat submeasure
extremal
Completely extremal
factor
B, FB
finitely determined
put P1 into the first piece of P2
fN
the n future
P generates T, P is a generator for T.
-algebra generated by X
[ ],
greatest integer
H
mean Hamming distance
homomorphism
H(P) H(T)
H(P,T)
H(P) H(T)
independent concatenation
dbar limit of independent concatenations
the induced transformation of T on 
interval space, with atoms , without atoms
[ ],
greatest integer
define P intersect Q .The symmetric difference between P and Q
invariant -algebra
interval space, with atoms , without atoms
isomorphism
isomorphic
the induced transformation of T on 
join of two partitions
L(a)
the land of P
copies of - algebras, partitions etceteras in the land of P
Lebesgue space
the land of P
copies of - algebras, partitions etceteras in the land of P
Lebesgue space
Martingale
Possibly Non-stationary Markov Chain
042
083
044
044
057
057
058
014
058
057
047
042
032
016
032
080
050
033
014
047
051
047
058
058
054
009
080
088
022
009
014
014
054
047
036
091
091
009
091
091
009
026
035
124
mar
mar
mar
mar
mc
mc
mea
mea
mea
mea
mhd
min
mm
mon
mon
mp
mp
mp
mp
mp
mp
nan
nfu
nmp
nn
nsh
P
par
pas
pat
per
per
pic
pnc
poc
pro
ptn
ptp
put
rea
roh
roh
run
109
117
118
119
100
109
002
003
004
023
089
073
117
054
065
003
004
023
117
118
119
041
083
119
041
131
208
235
078
101.5
114
122
289
100
289
029
273
029
145
135
035
291
038
Markov chain
mixing Markov process
n step Markov process
Y is called the n step Markov process corresponding to X
Possibly Non-stationary Markov Chain
Markov chain
measurable
measure preserving
Measurable and measure preserving
measurable
mean Hamming distance
a^b
mixing Markov process
Monkey method
a stationary process obtained by the monkey method
measure preserving
Measurable and measure preserving
measure preserving
mixing Markov process
n step Markov process
Y is called the n step Markov process corresponding to X
n name
the n future
Y is called the n step Markov process corresponding to X
n name
n shift
P( , S1)
Parity
past
there is a path from a to b
periodicity of the process
periodicity of an n step Markov process
a picture of the coupling between X1, X2… Xn and Y1, Y2… Yn
Possibly Non-stationary Markov Chain
a picture of the coupling between X1, X2… Xn and Y1, Y2… Yn
P,T process
P,T,n entropy drop
P,T process
put P1 into the first piece of P2
reasonable names
Rohlin tower of height N
Rohlin R’’ of height n endowed with partitions P’’ corresponding
to P, and Q’’ corresponding to Q, is a picture of the coupling
rung of the tower, first rung of the tower, base of the tower.
037
039
039
039
035
037
009
009
009
014
033
028
039
023
026
009
009
014
039
039
039
018
032
039
018
044
063
080
036
035
039
040
098
035
098
016
093
016
047
044
017
099
018
125
run
sa
sd
shi
siz
sp
040
082
257
131
186
009
spm 065
st
085
sta
009
sta
212
sto
070
sto
070
sub 063
sub 183
sym 257
tai
084
tai
085
tai
087
th
104
tp
101
tra
001
tra
101
tra
103
tri
087
tri
181
tt
087
tt
181
var 093
var 094
var 095
var 096
vd
093
vd
094
vd
095
vd
096
ver 058
vwb 190
wb 231
a rung of that column
-algebra generated by X
define P intersect Q .The symmetric difference between P and Q
n shift
size
stationary process
a stationary process obtained by the monkey method
same tail
stationary process
 startover process of …X-2, X-1, X0, X1, X2….
stopping time
stopping time
subsequential limit of the i
submeasure
define P intersect Q .The symmetric difference between P and Q
the tailfield
same tail
trivial tail
(a)
transition probabilities
transformation
transition probabilities
a is transient
trivial tail
trivial transformation
trivial tail
trivial transformation
variation distance
The variation distance between p1, p2,… pn and q1, q2,… qn
variation distance
variation distance
variation distance
The variation distance between p1, p2,… pn and q1, q2,… qn
variation distance
variation distance
version for E(f|s)
Very Weak Bernoulli
weak Bernoulli WB
018
032
088
044
057
010
026
032
010
065
027
027
026
056
088
032
032
032
036
035
009
035
035
032
056
032
056
033
034
034
034
033
034
034
034
024
057
080
Download