1 Outline of Ergodic Theory Steven Arthur Kalikow (second edition) 2/6/16Table of contents: FORWARD: PREFACE INTRODUCTION PRELIMINARIES Lebesgue spaces Stationary Methods of making a measure preserving transformation naming a function stationary process cutting and stacking ergodic homomorphism Increasing Partitions Measurable Zorn’s lemma & Rohlin tower theorem superimposition of partition s& Countable Generator Theorem Birkhoff ergodic theorem measure from a monkey sequence Stationary is convex combination of ergodic (proof 1) conditional expectation conditional expectation of a measure…. Stationary is convex combination of ergodic measures (proof Subsequential limit & Extending the monkey method & Comparison Martingales & martingale convergence …..Probability of present given past 2) Coupling MEAN HAMMING METRIC & DBAR METRIC &VARIATION METRIC Markov Chains Preparation for the Shannon Macmillan Theorem ENTROPY Entorpy of process and transformation Shannon Macmillan Breiman Theorem Entorpy of a partition Conditioning property & Independent joining Kreigers Finite Generator Theorem induced transformation BERNOULLI TRANSFORMATIONS Definitions of EX,VWB,FD,IC,B andFB Extremality of independent processes Final preleiminaries of big equivalence theorem B FB IC EX VWB FD. coupling infinite paths Stationary Ergodic IC FD VWB WB ORNSTEIN ISOMORPHISM THEOREM Copying In Distribution Coding The Land of P Capturing entropy The dbar metric Sinai’s and Ornstein’s theorems ABBREVIATION OF BOOK (primarily a capsule of the book, but also serves as an extended table of 002 003 006 009 009 010 011 011 011 011 012 014 014 016 020 021 023 023 024 026 026 027 030 033 025 042 044 044 045 046 047 052 054 056 057 060 065 070 071 072 073 073 073 074 080 086 086 089 091 092 098 104 111 contents.) INDEX OF DEFINITIONS BY PAGE (also serves as an extended table of contents.) INDEX OF DEFINITIONS ALPHABETIZED 118 121 2 Outline of Ergodic Theory February 6, 2016 Forward: The scheme of the book is that statements of theorems and definitions are made as clear as possible but although some rigorous proofs are presented, for the most part only the idea of the proof is presented, not the rigorous proof. Books that give rigorous proofs give you an excellent view of the trees, but give a poor view of the forest. The main benefit received from this book is that the student develops the ability to think conceptually rather than in terms of detailed symbol pushing. This format benefits a) the serious beginner who wishes to work out the proofs for himself with helpful guidance from the book, and who does not want to be overburdened with notation while trying to understand the ideas. b) the student who just wants an overview of Ergodic Theory. c) the person who already knows Ergodic theory and wants a review. Most graduate student pure math books provide detailed proofs of theorems followed by homework exercises at the end. Why not make the theorems themselves the homework exercises? The answer is that proving the theorems is too hard for the student without help. In this book, for the most part the theorems are the exercises, which is possible because the book presents the essence of the proofs so that the student’s job is manageable. If this book is used as the text of the course, I would suggest that the professor insist that the students not take notes in class. The professor should assign specific theorems as homework. The students would have the option of reproducing the professor’s proof, filling in the details of the proof in the book, or coming up with his own proof. The professor should select theorems that are written up in the book very nonrigorously, because if I were a student, it would strain my sense of honesty to have to write up a proof that is already completely there. In some cases part of the proof is written rigorously. It is the job of the professor to isolate for the student the part that has not been. At some point the graduate student has to grow up and become a mathematician. The student can’t just read other people’s work. The student has to creatively come up with his own thinking. This book helps to bridge that gap. An elementary understanding of real analysis, analysis, point set topology and probability theory is assumed. The book is written at the level of a second year graduate student in mathematics. The reader is expected to already have read and written many rigorous proofs and be quite confident about his ability to know when he can make an argument rigorous, and when he cannot. 3 PREFACE This preface is written in an intuitive manner in the hope that a scientifically oriented nonmathematition can get something out of it. It is the only part of the book that a nonmathematician is likely to follow. Example 1: A pool table has no pockets or friction. Part of the table is painted white and part of the table is painted black. A pool ball is placed in a random spot and shot in a random direction with a random velocity. You are blindfolded. You do not get to see the shape of the table. All you get to know is when the ball is in the white part and when it is in the black part. From this information you are to deduce as much as you can about the entire process such as whether or not it is possible that the table is in the shape of a rectangle. Example 2: We are receiving a sequence of signals from outer space coming from a random stationary (stationary means that the probability law is stationary in time) process. We are unable to detect a precise signal but we can encode it by interpreting 5 successive signals as one signal, but this code loses information. Furthermore, we make occasional mistakes. We wish to get as much information as possible about the original process. The subject that addresses these examples is called ergodic theory. To explain this a little more mathematically, we need to introduce two words; “measure” and “transformation”. There is a concept of measure that tells you how big a set is. In the language of probability theory, how probable it is. A transformation is a way of assigning one point to another, (e.g. taking the position and direction of a pool ball into the position and direction that the ball will be in a minute later). Fundamentally, ergodic theory is the study of transformations on a probability space which are measure preserving (e.g. if a set of points has measure 1/3, then the set of points which map into that set also has measure 1/3.) This means, for instance, that the probability of rain 3 years from now is the same as it is today. Of course this is false, but I am just pretending it to be true. When you apply such a transformation over and over you get a process. In example 1, consider the process you get if you only look at the ball after 1 minute, after 2 minutes etc. and don’t pay attention to where the ball is in between (e.g. 1½ minutes). This process arises from repeatedly applying the following transformation: Transform a position and velocity of a ball into a position and velocity a minute later. A process in which the probability laws do not change over time is called a stationary process (e.g. the probability of rain 3 years from now is the same as it is today). If the weather could be interpreted as a stationary process, an output might be a listing of every day in the future and past, and whether it is raining or sunny. It would output a doubly infinite output of R’s and S’s …R R S (S) R R…. where the parentheses indicates today (it is sunny today will be rainy tomorrow and was sunny yesterday.) The transformation shifts tomorrow into today so it shifts the sequence one to the left. 4 …R R S S (R) R…. Here is a summary of the important concepts and theorems in this book. 1) Isomorphism: Suppose that in example 1 we were to change which part of the table we paint white and which we paint black. Then you would have a different process. But our new process could end up being equivalent to the original process in the sense that if you know the output of either process, it determines the output of the other process. Any two such equivalent processes are called isomorphic. If you are simply given two processes and want to know whether they are isomorphic, then you need to find out whether there is a way to correspond them in a nice way so that each determines the other. 2) Ergodic: An ergodic process or transformation is a process (or transformation) which cannot be written as a linear combination of two other processes (or transformations). If all men watch baseball all the time and all women watch cooking shows all the time the process a fetus will watch all the time is not ergodic if you don’t know what sex that baby will be because it is either the baseball process or the cooking process and you don’t know which. Mathematically, it can be regarded as (½ (baseball process) + (½ (cooking process)). Such a process is not ergodic. 3) Birkhoff Ergodic Theorem: When an ergodic transformation is repeatedly applied to form an ergodic process, the frequency of time that process spends in a given set is the measure of that set, e.g. it spends 1/3 of the time in a given set of measure 1/3. 4) The Rohlin Tower theorem: Fix a positive integer, say 5. For any measure preserving transformation that does not simply rotate finite sets of points around, you can break almost the whole space into 5 equally sized disjoint sets which can be ordered so that the transformation takes each set to the next. 5) Shannon Macmillan Breiman theorem: In an ergodic process with finite alphabet, consider an infinite name. If you look at its sequence of finite initial names (e.g. a, ab, abb, abba…), that sequence will have probabilities decreasing at a rate which will approach exponential decay, and the rate of decay will not depend on the sequence you choose. 6) Entropy: The exponential rate of decay just mentioned is called the entropy of the process. Since the number of reasonable names is the reciprocal of the typical probability of a name, entropy can also be thought of as being the limiting exponential number of reasonable names. Example: Suppose you repeatedly flip a fair coin. All of the head, tail names of length n are reasonable, and there are 2n of them, so the entropy is of the process is log 2. 7) Kolmogorov entropy invariance: Two isomorphic processes must have the same entropy. 5 8) Independent process: a stationary process on an alphabet in which all letters are completely independent of each other is called an independent process(e.g. repeatedly flipping a fair coin). 9) Ornstein Isomorphism theorem: Two stationary independent processes are isomorphic if and only if they have the same entropy. 6 INTRODUCTION 1) Ergodic Theory is the study of measure preserving transformations on a probability space, AT). AT) are the space, -algebra, measure, and transformation respectively. 2) A Lebesgue space is the unit interval (only as a measure space, no topology, no metric) possibly plus and/or minus a set of measure zero, possibly with atoms. A Lebesgue space is always a probability space. All Ergodic theory is done on Lebesgue spaces. Non-Lebesgue counterexamples don’t count. 3) An isomorphism between two four-tuples AT), and ’A’’T’) is a bijection from to ’ which preserves -algebra, measure and commutes with the transformations. 4) A homomorphism, , is the same as an isomorphism except only onto, not necessarily one to one. Here, as in an isomorphism, -1 takes members of A’ to members of A, but the image of that map does not have to include all of A (distinct sets in A do not have to map to distinct sets in A’). A homomorphism is the same as throwing away part of the algebra. 5) Stationary processes are an example of such a four-tuple, where is the space of doubly infinite strings of letters in some finite or countable alphabet, A is the -algebra on generated by cylinder sets (a cylinder set is a set which depends on only finitely many coordinates), is a measure on A which causes the process to be a stationary process, and T shifts the doubly infinite strings to the left by one. 6) The difference between probability and ergodic theory with respect to stationary processes, is that ergodic theory considers two processes to be essentially the same if they are isomorphic. 7) Start with a homomorphism, , of one stationary process to another. -1 of the set of doubly infinite strings with a given fixed letter at the origin is a measurable set, and every measurable set is approximately equal to a cylinder set. This gives an approximate coding of one process from another. 7 8) A collection of transformations indexed by the reals can form a flow, where Tt composed with Ts is Tt+s, if proper measurability conditions are met. 9) A transformation can turn into an operator where T(f) is defined by T(f)() = f(T()). 10) Some ergodic theorists like to add structure to their Lebesgue space, considering for instance, flows on a differentiable manifold. 11) The assumption that T is a measure preserving transformation is sometimes replaced with the weaker assumption that T is a nonsingular transformation, which means that T sends sets of measure 0 to sets of measure 0. 12) Some look at measure preserving transformations on an infinite measure space. 13) Here are some definitions that we think that every ergodic theorist should be familiar with, but most of these definitions are outside the scope of this book. Throughout these definitions AT) is as in (1) and every stationary process is to be regarded as such a fourtuple by (5). Ergodic: T is called ergodic if any Q A, (T-1(Q)) = (Q). Bernoulli: T is called Bernoulli if it is isomorphic to a completely independent identically distributed process. K: A stationary process is called K (for Kolmogorov) if any S with the following property has measure either 0 or 1: For any two words …a-2a-1a0a1a2… and …b-2b-1b0b1b2… such that there exists N such that for all n > N, an = bn, either both words are in S or neither of them are. Mixing: Measure preserving transformation T is called mixing iff for all P,Q A lim (i )(T-i(P Q)) = (P)(Q). Weak mixing: There are three equivalent definitions of weak mixing. Any of these 3 ALONE defines weak mixing if T is a measure preserving transformation on . 1) There is a set of integers of density 1 such that for all P,Q A, lim (i ) (i ) (T-i(P Q)) = (P)(Q). 2) Define T X T on X with measureX by T X T(a,b) = (T(a),T(b)). Then T X T is ergodic. 3) There does not exist measurable f: (the unit circle of the complex numbers) and 1 such that for all T(f(f(. 8 Minimal self joinings: Define S on X by S(a,b) = (T(a),T(b)). For any measure preserving transformation T, here are two ways to define a measure on X so that the projections of S on the axes are both isomorphic to T. 1) = X 2) Fix an integer n in advance. The support of is on the {,Tn()}and for any S A, {,Tn(): S} = (S). If these are the only ways to define such an S, then T is said to have minimal self joinings. Entropy: There is a theorem that says that for any ergodic process, …X-2, X-1, X0, X1, X2… the sequence X0 = a ; X0 = a, X0 = b; X0 = a, X0 = b, X0 = c… has measures decreasing at a constant exponential rate except on a set of words abc…. of measure 0. This rate is called the entropy of the process. Subshift of finite type: Fix a finite alphabet. Let be a finite list of finite words. Let be the space of doubly infinite words in your alphabet in which no word of occurs as a subword. Let T be the transformation on that shifts the doubly infinite words to the left by one. Let A be the algebra generated by cylinder sets and be the measure on A that maximizes the entropy of T. Then AT) is called a subshift of finite type. Pinsker algebra : The set of all S A with the following property turns out to be a -algebra and is called the Pinsker algebra of T: The process …X-2, X-1, X0, X1, X2… such that Xi()= 1 if Ti( S and 0 if Ti( S had 0 entropy. Mean Hamming Distance: The Hamming Distance between two words a1, a2… an ad b1, b2… bn is (1/n)#{i {1,2,3,…n }: ai bi}. Rank 1: The most natural way to define rank 1 is to learn cutting and stacking (page 11) and then define T to be rank 1 if it is constructed by cutting and stacking so that at each stage there is only one column. However, if you don’t wish to read beyond this introduction, the following theorem defines rank 1: A process …X-2, X-1, X0, X1, X2… is isomorphic to a rank 1 transformation iff for all there is an n such that with probability 1, part of the doubly infinite word can be tiled by disjoint subwords of length n such that 1) limsup (n )(1/n)#{i {-n,-n+1…n: Xi is not in the part covered} 2) The mean Hamming distance between any two such subwords < . 9 PRELIMINARIES In this book when we write “Proof” we regard our proof to be rigorous. However, when we write “Idea of proof” we expect the reader to be capable of writing out a more carefully written proof although part of the proof may be rigorous. Measurable and measure preserving. DEFINITION 1: Let be a set of points, T be a function from onto Then T is called a transformation on . DEFINITION 2: Let (A,T) be a collection of points, -algebra on , probability measure on A, and transformation on respectively. Then T is measurable if T-1(S)A for all SA. DEFINITION 3: Let (A,T) be a space, -algebra, probability measure and measurable transformation respectively. T is called measure preserving if (T-1(S)) = (S) for all SA. DEFINITION 4: Measurable and measure preserving are similarly defined when T is a transformation from one probability space to another. Lebesgue spaces. DEFINITION 5: Let 0 < t <1. Let be a set consisting of the interval [0,t] and a finite set F. = [0,t] F. Place a probability measure on which when restricted to [0,t] is uniform Lebesgue measure, standard Lebesgue -algebra, total measure t. (,) is called an interval space, with atoms if t 1, without atoms if t 1. DEFINITION 6: Let , be an interval space with or without atoms. Let (A,be a probability space. If there are sets A , B both of measure one, and a bijection between A and B which is bimeasurable and measure preserving in both directions, then (A,is called a Lebesgue space. ABBREVIATION: You get a Lebesgue space by taking the unit interval possibly squashing some subintervals into atoms and then perhaps removing and adding sets of measure 0. 10 CONVENTION 7: In this book all transformations are one to one and all spaces are atomless Lebesgue spaces. COMMENT 8: Since in essence, the only Lebesgue space is the unit interval, one can be deceived into thinking that we are talking about a specific space. Actually, the concept of being isomorphic to the unit interval is quite general. It includes every probability space you are likely to come across in a standard probability course, including Poisson Processes, Brownian motion, and White noise. The only time you should suspect that you are not dealing with a Lebesgue space, is when either a) the space is endowed with a topology that is not separable or b) you used the axiom of choice to make the space. Stationary DEFINITION 9: A stationary process is a doubly infinite sequence of random variables, ...X-3,X-2,X-1,X0,X1,X2,X3... whose probability law is time invariant (e. g. the probability that X0 is “a”, X1 is “b”, and X2 is “c” is the same as the probability that X100 is “a”, X101 is “b”, and X102 is “c”). The alphabet is the set of values an Xi is allowed to take. CONVENTION 10: Unless stated otherwise, the alphabet should be regarded to be at most countably infinite. EXAMPLE 11: Let ...X-3,X-2,X-1,X0,X1,X2,X3... be an independent sequence of heads and tails, obtained by a doubly infinite independent sequence of flips of a fair coin. This is a stationary process. The alphabet is {h,t}. EXERCISE 12: Show that a stationary with an at most countably infinite alphabet process endowed with the algebra generated by cylinder sets (A cylinder set is a set determined by X-n...X-3,X-2,X-1,X0,X1,X2,X3...Xn for some n) is a Lebesque space. 11 Methods of making a measure preserving transformation naming a function, Define a transformation explicitly, e.g. let T from the unit interval to itself, be defined by T(x) = ex mod 1. Then you have to look for a measure on the unit interval that makes the transformation measure preserving. This particular transformation is not one to one. stationary process, Define a transformation by letting the measure be a stationary process on the space of doubly infinite words, and then let T be the shift to the left, i.e. T( ...a-3,a-2,a-1,a0,a1,a2,a3...) = (...b-3,b-2,b-1,b0,b1,b2,b3...) where ai+1 = bi for all i. cutting and stacking Start with the real line with Lebesgue measure. Cut off a finite piece (an interval) of the line (whose measure is not required to be 1). Define a transformation on that piece by cutting it into many equal sized pieces, and stacking them into a vertical tower, defining the transformation to go straight up. EXAMPLE 13: Perhaps the finite piece you cut off is [0, 3/4] and you cut into [0, 1/4],[1/4, 2/4], [2/4, 3/4], and stack them vertically with [0,1/4] on the bottom [1/4, 2/4] above and [2/4, 3/4] above. The picture looks like this 2/4 3/4 1/4 2/4 0 1/4 and the transformation goes straight up, taking x to x + 1/4 on the bottom two rungs, and so far undefined on the top rung. So to define it on part of that top, cut the tower vertically into columns and stack some of those columns on top of others. You may add more of the line at this stage, by cutting off from it more pieces called spacers, and putting them on the top of columns before stacking them. This would cause the space we are considering to increase in measure. The transformation still goes straight up, which makes its definition the same as it was on the part already defined. 12 EXAMPLE 14: Take the previous stack and split it into 5 equal sized columns. Arrange these columns into two stacks Place the second column on top of the first to form the first stack, and then place the fourth column on the third, and the fifth column on the fourth to form the second stack. Then place two spacers between the third and the fourth columns. 14/20 15/20 9/20 10/20 4/20 15/20 13/20 14.20 8/20 9/20 3/20 4/20 11/20 12/20 6/20 7/20 spacer 1/20 2/20 spacer 10/20 11/20 12/20 13/20 5/20 6/20 7/20 8/20 0/20 1/20 2/20 3.20 Continue this process of cutting and stacking infinitely many times, making sure that your transformation ends up being defined almost everywhere. If you choose to keep adding spacers as you continue, the total measure of the space will keep increasing. You are required to make sure that in the end the whole space we are considering is finite in measure. Hence you cannot add arbitrarily many spacers at each stage. Then normalize the measure to be measure 1. ergodic DEFINITION 15: A transformation is called ergodic if there is no set S, of measure strictly between 0 and 1, such that T-1(S) = S. 13 COMMENT 16: A typical example of a nonergodic process is to take two coins, one fair and one unfair, and randomly pick one of them, each with probability 1/2. Once you have picked your coin, use it to produce a doubly infinite sequence of heads and tails. This gives a stationary process, and hence, by above, gives a measure preserving transformation. It is nonergodic, because S can be chosen to be the event that the frequency of heads approaches 1/2. This is a stationary process, but an unnatural one, since it is obviously a convex combination of two more natural processes. We consider nonergodic processes to be unnatural, and usually ignore them. COMMENT 17: You notice that the definition of ergodic refers to T-1(S) rather than T(S) and we will frequently prefer to write T-i rather than Ti .This is because Ti() S iff T-i(S). This is important when T is not one to one because in that case it is not always true that T(S) has the same measure as S. DEFINITION 18: A stationary process …X-2, X-1, X0, X1, X2… is called an independent process if all the X-i’s are independent of each other, (e.g. P(X0 = a, X1 = b, X2 = c) = P(X0 = a)P(X0 = b)P(X0 = c)). THEOREM 19: Independent processes are ergodic. idea of proof: Approximate a set S by a cylinder set. If T-1(S) = S, then T-n of that cylinder set also approximates S for any n. By choosing two far apart values for n, S can be approximated by two completely independent sets.// EXERCISE 20: Prove that a mixing Markov chain is also ergodic. COMMENT 21: If you don’t know what a Markov chain is, this book has a section introducing Markov chains to you. 14 homomorphism COMMENT 22: In Ergodic theory, sets of measure zero don’t count. Lebesgue set and Borel set mean the same thing. When we speak of the whole -algebra, it does not matter whether you regard us to be talking about the Lebesgue sets or the Borel sets. DEFINITION 23: Let AT) be the space, -algebra, measure, and transformation respectively and let ’A’’T’) be another such fourtuple. A function f:AT)’A’’T’) is called measurable if f-1(S) A for all S A’. f is called measure preserving if (f-1(S)) = ’(S) for all S A’. If f is measurable and measure preserving, and if T’(f(f(T( for all then f is a homomorphism. A one to one homomorphism is called a isomorphism. DEFINITION 24: A homomorphic image of a fourtuple AT) (or process) is called a factor of that fourtuple AT) (or process). A isomphoric image of a fourtuple AT) (or process) is called isomorphic to that fourtuple AT) (or process). Increasing Partitions THEOREM 25: Assume the whole space to be a Lebesgue space. Let P1,P2,... be an increasing collection of finite partitions which separate any two points. Then every set in the -algebra can be approximated by the union of some pieces of Pi for some i, (i.e. the symmetric difference of the set and its approximation can have arbitrarily small measure by making i sufficiently large). 15 Idea of proof: The definition of Lebesgue space does not include topology, but since every Lebesgue space is essentially the unit interval, we can add the topology of the unit interval to the space. Let A be your chosen set. Every set can be approximated by an open set containing it and a closed set contained in it. Fix > 0. Show that every set in Pi can be approximated by a closed set from within so that a) For all n, the union of the closed approximating sets has measure > 1- . b) If we choose a sequence A1 P1 , A2 P2, A3 P3, … such that A1 A2 A3... and if C1, C2, C3,… are the closed approximations for A1, A2, A3,… then C1 C2 C3... . The assumption that we separate points means that every point is the intersection of precisely one decreasing sequence of sets from P1, from P2, from P3 etc. Replace A and its complement with approximating open sets containing them. Fix a point and replace the decreasing sequence of sets with approximating closed sets. Then eventually one of these closed sets is entirely inside our replacement for A, or inside our replacement for the complement of A. Thus, for sufficiently large i, most (I mean “most” in measure, not necessarily in number.) of the approximating closed sets in Pi are in the approximation for A or in the approximation for the complement of A. // COROLLARY 26: Notation as in the theorem 24. P1,P2,... generates the whole algebra. THEOREM 27: If T is one to one measure preserving then T(A) is measurable for every measurable set A. Idea of proof: Since we are talking about a Lebesque space it is easy to exhibit sets P, P2 ... as in Theorem 25. It easily follows that T-1(P1), T-1(P2), ... also separates points. Apply theorem 25 to get a set B approximating A such that T(B) is measurable. Conclude by taking limits. 16 COROLLARY 28: If T is a one to one measurable transformation from a space to itself, (T(A)) = (A) for all measurable A. If T is an isomorphism from one space to another T-1 is also an isomorphism. DEFINITION 29: Let P be a partition, and let T be a measure preserving transformation. The P,T process is defined to be the stationary process ...X-3,X-2,X-1,X0,X1,X2,X3..., whose alphabet are the pieces of P, such that Xi takes a point to the value p, if p is the member of P containing Ti(). DEFINITION 30: If the P,T process separates points, (i.e.for every 1and 2 there exists i with Xi(1) Xi(2)).We say that P generates T, or that P is a generator for T. COROLLARY 31: P generates T iff the P,T process generates the whole -algebra. -If obvious for a Lebesgue space. Only if follows from corollary 25.// THEOREM 32: If P generates T, and Q generates T, then the P,T process, Q,T process, and T are all isomorphic. Idea of proof : Your isomorphism is the map that takes the p name of to the q name of to . It is obvious that this is a bisection. Measurability in both directions follows from theorem 25.// Measurable Zorn’s lemma Start with any probability space. Order a collection of measurable sets by inclusion. Then if each countable chain has an upper bound, the collection has a maximal element, up to measure zero. Idea of proof: Pick a chain, where (the measure of Si+1) - (the measure of Si) is always more than half of what it could be, and then take an upper bound. This proof does not make use of the axiom of choice.// 17 Rohlin tower theorem THEOREM 33: If T is ergodic, for all N and , there is a set S such that S, T(S), T2(S),... TN(S) are disjoint and cover all but of the space. This is also true if T nonergodic, if you presume that with probability 1, Ti(x) x, for any i,x . Idea of proof: In the ergodic case, Start with a tiny set c, and let S be the set of all points Tn(p), such that p c, n is a nonnegative multiple of N, and {T1(p),T2(p),T3(p)...Tn+N(p)} is entirely outside of c. Then the union of S, T(S), T2(S), ...TN(S) will disjointly cover all of the space except c U T-1(c) U T-2(c) U... T-N(c), because by ergodicity, c and all of its translates cover the whole space. In the nonergodic case, Since the space is Lebesgue, we can regard it to be the unit interval, and endow that interval with its usual metric. Let M>>N. Use the same proof as above, letting (by measurable Zorn) c be a maximal set such that for every x in c, T(x),T2(x),... TM(x) are all outside c. The only problem is to show that c and all of its translates cover the whole space. If not, note that by our non-periodicity assumption, for small enough, there is a positive probability that T(x),T2(x),... TM(x) are all more than away from x, and x is not in the union of c and its translates. There must be some interval of diameter which intersects that event on a set of positive measure, and that intersection is a set of positive measure outside the translates of c, such that for any x in that set, T(x),T2(x),... TM(x) are all outside of that set. Add that set to c to contradict maximality. // DEFINITION 34: If S, T(S),T2(S),... TN(S) are disjoint sets, they form a Rohlin tower of height N. When we speak of a Rohlin tower, we presume that the union of S, T(S),T2(S),... TN(S) has measure almost 1. DEFINITION 35: If S, T(S),T2(S),... TN(S) is a Rohlin tower, the complement of S T(S) T2(S) ... TN(S) is called the error set. COMMENT 36: In all arguments, always assume the error set is sufficiently negligible in measure. 18 EXERCISE 37: Prove that the error set can have exactly size for any > 0. DEFINITION 38: If S, T(S),T2(S),... TN(S) is a Rohlin tower, each Ti(S), 0 < i < N, is called a rung of the tower, S is called the first rung of the tower. S is also called the base of the tower. DEFINITION 39: Let S, T(S),T2(S),... TN(S) be a Rohlin tower, and P be a partition of the space. Define two points in the base of the tower to be equivalent, if they have the same P name of length N. It is helpful to regard an equivalence class in the base of the tower, to be an interval. A column, or P,T column (for a given tower and partition) is defined to be the union of E, T(E),T2(E),... TN(E), where E is an equivalence class in the base of the tower. DEFINITION 40: The intersection of any rung of the tower, with a P,T column of the tower, is called a rung of that column. DEFINITION 41: When P and T are understood, the n name of a point is the sequence of sets of P containing , T(), T2(), …Tn-1() in that order. COMMENT 42: A P,T rung of a P,T column of a tower, is entirely inside one piece of P. Indeed, the n name of any point in the base of a column of height n is precisely the pieces of P containing the base rung, second rung, third rung, etc. of the column containing that point, in that order. 19 THEOREM 43: Let P be a partition of the space, and N be arbitrary. Then there is a Rohlin tower S, T(S),T2(S),... TN(S), such that S and P are independent. Idea of proof: Just take a Rohlin tower, height much bigger than N, do the following (see picture) to every P,T column, and let the union of the shaded areas of all the columns be the base (here we are letting N = 4). This will give you your desired Rohlin tower, except that the base will be almost independent of P, but not quite. To get perfection, shave off a small fraction of the base, and the column that small fraction defines, and put it into the error set.// COMMENT 44: If P,T is a process, and you pick an n tower whose base is independent of the partition of n names, then you get the beautiful situation where the distribution of names defined by the columns is precisely the distribution of all n names. 20 superimposition of partitions COMMENT 45: It is possible to superimpose countably many finite partitions, and get a uncountable partition. THEOREM 46: Let Si be a sequence of sets whose measures are summable. Let Pi be finite partitions of the whole space, such that one of the pieces of Pi is the complement of Si. Then the superimposition of the Pi forms a countable partition (As always, ignore sets of measure 0.). Idea of proof: Let Ai be the union, j > i, of Sj. Then the intersection of the Ai has measure zero, and it is easily seen that the superimposition partition partitions Ai - Ai-1 into only finitely many pieces for all i.// Countable Generator Theorem THEOREM 47: Every measure preserving transformation on a Lebesgue space is isomorphic to a stationary process with a countable alphabet. Idea of proof: All we need is a countable partition of the space so that the map which takes points to the name of points with respect to that partition is an isomorphism, i.e. such that the point to name map is one to one. To do this, create Rohlin towers with bases S1,S2,...so that the measures of Si are summable, and insignificantly small error sets. Assume the heights of these towers are N1,N2,... respectively. Let Pi be a partition of the whole space, consisting of the complement of Si together with the following partition of Si .Two points x and y are in the same atom of that partition iff for k {1,2,...Ni}, Tk(x) and Tk(y), when thought of as being real numbers with binary expansion, have the same first Ni digits. The reader should verify that the superimposition of the Pi does the trick (the error sets can be obnoxious but Borel Cantelli can be used to ignore all but finitely many of them).// 21 Birkhoff ergodic theorem THEOREM 48: Let T be a measure preserving transformation on . Let f be an integrable function on . Then 1/n (f() + f(T()) +...f(Tn()) converges as n . Idea of proof: Fix b > a. Consider points , where lim sup 1/n (f() + f(T()) +...f(Tn()) > b, and lim inf 1/n (f() + f(T()) +...f(Tn()) < a. Let B be chosen large enough, that we can write f = f1 + f2, where |f1| < B (i.e. f1 is bounded between –B and B) and the integral of |f2| is small. Now pick N so big that the set of all points which don’t have an ergodic average for f almost as big as b by time N (to be called bad points) has a probability which is a small fraction of (b-a)/B. Let M >> N. Show that there is an S , T(), T2(), ... TM()} such that i) the average of f over S is not much smaller than b, ii) only bad points, and points among the last N points are outside S. Hint: Express S as a disjoint union of intervals for which the average of f is not much less than b. For the points outside S, 1,2... , f(1)+f(2)+...= * [f1(1)+f1(2)+...] + ** [f2(1)+f2(2)+...]. (*) is usually a small fraction of (b-a)M, because the terms are bounded above by B, and the number of bad points is usually a small fraction of (b-a)/B Note that since T is the measure preserving, integral of f2(Ti()) = integral of f2(). (**) is usually a small fraction of M, because it is dominated by (|f2()|+|f2T()|+...|f2TM()|) whose integral is a small fraction of M. This proves that the average of all the points in the interval is usually much closer to b than to a. We get a contradiction by using an exactly analogous argument to establish that the average is usually much closer to a than to b.// 22 COMMENT and EXERCISE 49: Too many sets of measure zero ? The reader might object that there is a probability 0 that things could go wrong for every a,b, and all these sets of measure zero could add up to a problem. Do you see how to get around this problem? Birkhoff ergodic theorem 2 the limit of 1/n (f() + f(T()) +...f(Tn()) is the integral of f for almost all . In the ergodic case: Idea of proof: The limit is a constant by ergodicity. Define f1 and f2 as above. 1/n (f()+f(T())+...f(Tn()) = 1/n (f1()+f1(T())+...f1(Tn()) + 1/n (f2()+f2(T())+...f2(Tn()). All these sequences converge by the theorem 48. The first sum converges to the integral of f1 by the bounded convergence theorem, and the second term converges to something small by Fatous lemma.// COMMENT 50: It is important for the reader to focus on the case where f is an indicator function. The above then says that the frequency of time the stationary process hits a certain set is the probability of that set. EXERCISE 51: Show that a stationary process with a countable alphabet is ergodic iff the frequency of every word is constant. DEFINITION 52: The invariant -algebra, is the collection of all sets A such that T-1(A) = A. In the nonergodic case: Definition of conditional expectation will soon be given. The limit is the conditional expectation of f with respect to the invariant -algebra . Idea of proof: It is easy to see that f is measurable with respect to the invariant -algebra, and for any A in that -algebra, integral over A of the limit = integral over A of f by the exact same proof as above. When you learn what conditional expectation is you will understand that this is all you need to show.// EXERCISE 53: Use the Birkhoff ergodic theorem to prove the strong law of large numbers, which states that 1/n(X0+X1+X2+X3+...Xn) converges to the integral of X1, when ...X-3,X-2,X-1,X0,X1,X2,X3... are independent identically distributed random variables. 23 measure from a monkey sequence Let a monkey who lives forever type any sequence of 0’s and 1’s it wants to, 1,2,3.... Obtain a stationary measure as follows. Select increasing integers mi, where the frequency of cylinders of length 1 in the following sequence of finite words converge; 1,2,3,...m1 , 1,2,3,...m2 , . . . Then take a subsequence of that subsequence ni of the mi, where the frequency of cylinders of length 2 in the following sequence of finite words converge; 1,2,3,...n1 , 1,2,3,...n2 , . . . Continue the pattern, to get measures for all cylinder sets. DEFINITION 54: The above method is called the Monkey method for extracting a stationary process from a sequence from a finite alphabet. EXERCISE 55: Prove the limiting measure to be stationary. THEOREM 56: Stationary measures are convex combinations of ergodic measures. Stationary is convex combination of ergodic (proof 1) Idea of proof: If the above point came out of a stationary process, rather than monkey, then taking subsequences of subsequences is unnecessary because everything already converges. The limit that the point uniquely defines will be an ergodic measure with probability 1. (Warning: ergodicity is not so easy to prove). Letting f() be the measure that gives rise to, we have that the integral of f() with respect to , gives back the original stationary measure.// 24 conditional expectation DEFINITION 57: Let f be a function measurable with respect to the whole algebra and let s be a sub -algebra. Then the Radon Nikodim theorem guarantees the existence of a function g, called the conditional expectation of f with respect to s, denoted E(f|s)(), which is measurable with respect to s, and such that for all A in s, the integral of f over A is the integral of g over A. It is only unique up to a set of measure zero. DEFINITION 58: A definite choice for E(f|s)() for all is called a version for E(f|s). If g is a given version of E(f|s), then h is another version for E(f|s) iff g and h differ on a set of measure zero. DEFINITION 59: For random variables, X,Y,Z or more, the expectation of f with respect to the -algebra generated by X, Y and Z is denoted E(f|X,Y,Z). COMMENT 60: For any f1, f2, f3 ... whose sum converges in a dominated way, the conditional expectation of E(f1 + f2 + f3 +... |s)() = E(f1|s)() + E(f2|s)() + E(f3|s)() +....except on a set of measure 0 (for any -algebra s). conditional expectation of a measure We want the map A (the conditional expectation of the indicator function of A with respect to s) to be a measure for almost every . The trouble is that forcing additivity causes too many bad sets of measure zero, another bad set for every special case of (*) P(Ai|s) = P((i=1to )Ai|s). To get around this, just restrict to cylinder sets and to equations of type (*) with only finitely many terms and then use Cartheodory extension theorem on each to extend to a measure. The result is a function which is jointly measurable on A and which is a measure when is held fixed, and a version of E(A|S) when A is held fixed. The original measure of A is the integral of the conditional measure of A. 25 DEFINITION 61: Let be a probability measure on a Lebesgue space, and let s be a algebra of that space. Then the conditional expectation of with respect to s, denoted E(|s)(A,), is a function which is jointly measurable on A and , such that fixing set A, you get a version of the conditional expectation of the indicator of A , and fixing , you get a probability measure. Stationary is convex combination of ergodic measures (proof 2) Idea of proof: Let s be the invariant -algebra. For each , where is a doubly infinite word of the stationary process, E(|s)() is a measure if is held fixed. The definition of conditional expectation implies that is the integral of E(|s)(), with running in accordance with the stationary measure. This expresses as a convex combination of measures E(|s)(). All that remains is to show that for almost all , E(|s)() is ergodic. It should be ergodic because it assigns any set A to E(1A|s)() and in the case where A is in the invariant sigma algebra, E(1A|s)() = 1A() which is either 1 or 0. However, the above equation is allowed to fail on set of measure 0. Once again we have to contend with too many bad sets of measure 0, because you could get a bad set of measure 0 for every A. This can be handled, but is not easy.// Stationary is convex combination of ergodic measures (proof 3) Idea of proof: There is still another proof that stationary is convex combination of ergodic. Just apply the Krein Milman theorem, which says that in certain spaces, such as the space of measures that we look at, every point in any convex set is a convex combination of its extreme points. // EXERCISE 62: Show that a measure in the space of stationary processes is ergodic iff it is an extreme point of that space. 26 Subsequential limit DEFINITION 63: For each i, let i be a measure on words of length i, not necessarily stationary. Take a subsequence whose measure on the first coordinate converges. Take a subsequence of that subsequence whose measure on the first two coordinates converge. Take a subsequence of that subsequence whose measures on the first three coordinates converge. Continue. The limiting measure on words is a subsequential limit of the i (If we want we can just assume the i’s to have increasing length or infinite length rather than length i). COMMENT 64: The subsequential limit need not be stationary if the approximating measures are not. Extending the monkey method We already exhibited a way to extract a stationary process from a sequence of letters. You can also extract a stationary process from a sequence of measures on finite sequences with increasing length or from a sequence of measures on infinite sequences (A measure on infinite sequences is called a stochastic process.) Let 1,2,3 … be measures on words of finite but increasing lengths. Let those lengths be L1, L2, L3, …respectively. By passing to a subsequence if necessary, you can assume Ln > 100n for all n. What is important here is that Ln >> 2n so that the subwords of length n occur sufficiently often. For each n, derive from n a nearly stationary measure n on words of length n defined by Ln-n n (a1, a2, a3… an) = (1/( Ln-n+1))n ( xi+1 = a1, xi+2= a2,… xi+n= an), i=0 and then take a subsequential limit of the n. In the case where 1,2 … are stochastic processes, just truncate them in order to reduce to the case above. In either case DEFINITION 65: A subsequential limit of the n above is called a stationary process obtained by the monkey method. EXERCISE 66: Prove it to be stationary. 27 COMMENT 67: Comparison of subsequential limit and monkey method. Subsequential limit has the advantage of maintaining the local behavior. If you use the monkey method, the limit measure on the first 10 coordinates obtained from the approximating measures may have absolutely nothing to do with the measure on the first 10 coordinates of the approximating measures. The monkey method has the advantage that it gives a gives stationary measure. Subsequential limits might not be stationary. However, if the approximating measures are stationary, then even a subsequential limit is stationary, and it is pointless to use the monkey method. Martingales DEFINITION 68: A sequence of random variables Xn such that E(Xn| Xn-1, Xn-2... ) = Xn-1 for every n>1. (You are gambling and for every time n-1 you don’t care if you go home or play another round.) is called a Martingale. EXAMPLE 69: You flip a fair coin over and over, getting a dollar for every heads, and losing a dollar for every tails. Xi is the amount of money you have at time n. X0, X1, X2… is a martingale. Stopping times DEFINITION 70: A stopping time is a rule for stopping that depends only on information that you know by the time you stop. EXAMPLE 71: In the above example, “Stop as soon as you are ahead.” is a stopping time that makes money. COMMENT 72: The stopping time, “Stop which ever comes first; you are ahead or time 1000” does not make money (in expectation), because you might just as well have used 999. If you reach time 999, it does not help to play another turn. 28 DEFINITION 73: a^b means the minimum of a and b. Bounded time theorem: Using the reasoning of the comment 69, for any Martingale XT^n never makes money for any stopping time T and any positive integer n. More precisely, E(XT^n|X0) = X0. Bounded money theorem: In a bounded martingale, stopping times never make money, i.e. E(XT|X0) = E(X0). Proof: E(XT^n) converges to E(XT) by the bounded convergence theorem.// EXAMPLES 74: 1) Unbiased random walk is recurrent. Idea of proof: Take a probability 1/2 one-step backward, probability 1/2 one step forward random walk. Suppose the amount of money you have at time n is m, where m is the integer you are sitting on at time n. Then your fortune at time n is a martingale. Start at 1. Stop when you hit either 0 or 1,000,000. Letting p be the probability that you go to 0 before going to 1,000,000, the bounded money theorem gives 1 = p(0) + (1-p)1,000,000.// EXERCISE 75: Prove that with probability 1, you will not spend forever without hitting either 0 or 1,000,000. 2) Biased random walk is transient. Idea of proof: Take a probability 2/3 one-step forward, probability 1/3 one step backward random walk. Fortune at time n is 1/2m if you are sitting on m at time n. This gives a martingale. Start at 1. Letting p be the probability that you go to 0 before going to 1,000,000, the bounded money theorem gives 1/2 = (p)1 + (1-p)/21,000,000.// 3) Solution to the Dirichlet problem. Suppose h is a harmonic function and X(t) is Brownian motion. Since h(x) is the average of h around any circle centered at x, it is easy to see that h(X(t)) is a continuous time martingale (see if you can guess what a continuous time martingale is). If you want h(x), and you know h on some curve about x, just start a Brownian path at time x and stop when you hit the curve (stopping time T). h(x) = E(h(X(T))) by the bounded money theorem.// 29 martingale convergence THEOREM 76: A bounded martingale converges. Idea of proof: Let us make the absurd assumption that the stock market is a bounded martingale (say bounded by 0 and 1). Let a < b. You keep selling when the stock market is higher than b, and buying when it is less than a. To put some limit on the situation, assume you sell any stock you may have at some very late date. You could get very rich. But this is impossible with high probability, because you cannot lose more than one dollar, and your expected net gain has to be 0. Therefore the martingale only crosses a,b finitely many times for any a,b, and since this is true for every rational a and b, it converges. Careful examination of this argument gives an upper bound for the probability for crossing more than n times.// COMMENT 77: A backward martingale also converges. Exactly the same argument DEFINITION 78: Let...X-3,X-2,X-1,X0,X1,X2,X3... be a stationary process. Then the algebra generated by X-1,X-2,X-3...is called the past. EXAMPLE 79: Probability of present given past. Let ...X-3,X-2,X-1,X0,X1,X2,X3... be a stationary process. We frequently talk about the P(X0= 0|past). Naively you would define this to be P((X0 = 0) past)/P(past). However, this is ridiculous, because the numerator and denominator of that fraction are both 0. Here is the careful way to define P(X0= 0|past). It turns out that P(X0 = 0|X-1), P(X0 = 0|X-1,X-2), P(X0 = 0|X-1,X-2,X-3)... forms a martingale, which converges almost everywhere by the martingale convergence theorem. Its limit is our desired P(X0= 0|past). Alternatively, just use probability given sub -algebra (definition 57). // 30 Coupling DEFINITION 80: A coupling of two measures is a measure on the product space with the two measures as marginals (i.e. the projection on the X-axis and the projection on the Y-axis.) EXAMPLES: Let be the measure on {0,1}, (0) = 1/3, (1) = 2/3. Let be the measure on {0,1}, (0) = 1/4, (1) = 3/4. Coupling EXAMPLE 1) Independent coupling of and . P(0,0) = 1/12 P(0,1) = 1/4 P(1,0) = 1/6 P(1,1) = 1/2 COMMENT 81: You can couple any two measures together, because if you can’t find any other way to do it you can always use independent coupling, (i.e. product measure). Coupling EXAMPLE 2) coupling of and to maximize probability that the two coordinates are equal. P(0,0) = 1/4 P(0,1) = 1/12 P(1,0) = 0 P(1,1) = 2/3 Coupling EXAMPLE 3) Coupling by induction: Let X1,X2,X3...and Y1,Y2,Y3...be two distinct sequences of random variables. Here is an inductive technique for coupling the two processes together. Indeed, it is a general technique in that you can achieve any possible coupling this way. Assume you have already coupled X1,X2,X3...Xn with Y1,Y2,Y3...Yn, and you want to couple X1,X2,X3...Xn+1 with Y1,Y2,Y3...Yn+1. Pick a conditional probability law for Xn+1 given X1,X2,X3...Xn and Y1,Y2,Y3...Yn, such that when you integrate that probability law over all Y1,Y2,Y3...Yn, you get the conditional probability of Xn+1 given X1,X2,X3...Xn. Similarly, pick a conditional probability law for Yn+1 given Y1,Y2,Y3...Yn and X1,X2,X3...Xn, such that when you integrate that probability law over all X1,X2,X3...Xn, you get the conditional probability of Yn+1 given Y1,Y2,Y3...Yn. Now simply couple the conditional Xn+1 with the conditional Yn+1. 31 Coupling EXAMPLE 4) A special case of (3): Assume that Xn+1 is independent of Y1,Y2,Y3...Yn, and that Yn+1 is independent of X1,X2,X3...Xn. This means that you are simply coupling the conditional probability of Xn+1 given X1,X2,X3...Xn, with the conditional probability of Yn+1 given Y1,Y2,Y3...Yn. If you can’t think of any other way of coupling these two conditional probability laws, you can always use product measure. Coupling EXAMPLE 5) A special case of (4): Suppose X1,X2,X3...Xn is {0,1} valued, and you wish to couple it with the independent process on {0,1} which assigns probability 1/2 to each. You would like this coupling to live on pairs such that the two words tend to agree on many coordinates. Just, for each n, couple Xn+1 conditioned on X1,X2,X3...Xn with the measure obtained by flipping a fair coin, so that you maximize the probability that the two coordinates are equal. *Note* This doesn’t necessarily give the best result. Coupling EXAMPLE 6) Glue together two couplings to form a coupling of three measures. P1, P2, P3 three processes. You are given a coupling of P1 and P2 and another coupling of P2 and P3. Put them together to get a coupling of P1, P2, P3 by first putting down P2 in accordance with its probability law, then computing the conditional measure of P1 in the first coupling, given the word of P2 you have put down, computing the conditional probability of P3 in the second coupling, given the word of P2 you have put down, and then couple those two conditional measures (perhaps independently). Coupling EXAMPLE 7) Using coupling, prove that in a random walk, for any set of integers, the probability that you are in that set in at time 1,000,000 starting at 0, is close to the probability that you are in that set at time 1,000,010 starting at 0. Idea of proof: Put a measure on pairs of random walks X(n) and Y(n), such that the two walks walk independently of each other until a time n0, in which X(n0+10) = Y(n0) (The existence of n0 is guaranteed by recurrence of random walk). Then for all n > n0, we inductively continue the coupling so that X(n+10) = Y(n). n0 will probably be less than 1,000,000.// 32 DEFINITION 82: Let X be a random variable. The -algebra generated by X is the algebra generated by X-1[- , a] for all real a. DEFINITION 83: If ...X-3,X-2,X-1,X0,X1,X2,X3... is a process, the n future, is the algebra generated by the union of the -algebras generated by Xn by Xn+1 by Xn+2 etc. DEFINITION 84: If ...X-3,X-2,X-1,X0,X1,X2,X3... is a process, the tailfield is the intersection, over all n, of the n future. DEFINITION 85: Two specific outputs of a process a1a2a3... and b1b2b3... are said to have the same tail if there is an n such that for all n>N, an = bn. THEOREM 86: A measurable set is in the tailfield iff, (up to a set of measure 0), the set is a union of equivalence classes, where two sequences are equivalent iff they have the same tail. Idea of proof: Otherwise you would be able to get, in the tailfield, a set of positive measure, where each name has an equivalent name which is not in the set. From that, get an n, and a set of positive measure in the tailfield, where for each name in the set, there is a name which is not in the set which agrees with it after time n. That set would not be in the n future.// DEFINITION 87: A process is said to have trivial tail if every set in the tailfield has measure either 0 or 1. THEOREM 88: The Kolmogorov 0,1 law: A process ...X-3,X-2,X-1,X0,X1,X2,X3... in which all Xi are independent of each other, has trivial tail. Coupling EXAMPLE 8) Condition an independent process on a specific cylinder set. Now consider the unconditioned independent process. Couple the conditioned process with the unconditioned process so that the tail of the coupled pair is identical and use this coupling to prove the Kolmogorov 0,1 law. 33 mean hamming metric DEFINITION 89: The mean Hamming distance between two words of the same length is the fraction of the letters which are different e.g. d(slashed, plaster) = 3/7. EXERCISE 90: Prove triangle inequality for the mean Hamming distance. dbar metric DEFINITION 91: Let be the set of words of a given finite length from a given alphabet. The dbar distance between two measures on , is the infimum expected mean hamming distance between coupled words, taken over all couplings of the two measures. In the case of two stationary processes, the dbar distance between them is the limit of the dbar distance of finite chunks. This can be shown to approach a limit. Furthermore, this limit can be achieved with a stationary coupling of the whole spaces by using the extension of the “monkey” method mentioned earlier (We take finite possibly nonstationary couplings and use the monkey method to obtain an infinite stationary coupling). If the processes involved are themselves not stationary we use the same definition for dbar distance, except instead of limit we use lim sup. In that case, a stationary coupling of the whole space is obviously impossible. EXERCISE 92: Show that the word infimum in the previous definition can be replaced with the word minimum. Variation metric DEFINITION 93: Let P and Q be two partitions with the same number of sets such that the ith set of P corresponds to the ith set of Q for all i. The minimum over all couplings of P and Q, of the measure of {(i,j) in the coupling: i does not correspond to j} is called the variation distance of P and Q. If the probabilities of P and Q are p1, p2,… pn and q1, q2,… qn respectively, this amounts to 34 DEFINITION 94: Let p1, p2,… pn and q1, q2,… qn be two probability measures on a set { x1, x2,… xn }. The variation distance between p1, p2,… pn and q1, q2,… qn is |p1- q1|+|p2- q2|+…|pn- qn| DEFINITION 95: is a set of words of a finite length from a given alphabet. The variation distance between two measures on is the minimum over all couplings of P and Q, of the measure of {(i,j) in the coupling: i j}. DEFINITION 96: Let X and Y be real valued random variables possibly with different domains. The variation distance between X and Y is the minimum of the probability that x y over all couplings of X and Y (Although their domains are different, their ranges are presumed to both be the real numbers, so a coupling is a probability measure on (the reals X the reals)). COMMENT 97: Saying that “The variation distance between two things are is small” is a very strong statement. For measures on words of a given length, it is much stronger than saying the dbar distance is small. EXERCISE 98: Prove that the minimum is actually achieved. 35 The theory of Markov chains (presented as a subsection of coupling section) COMMENT 99: You do not have to understand this section on Markov chains to read this book because we have decided that it is easier to use the “ startover process” (definition 212) then the approximation “n step Markov process” (definition 119) in proofs. However Markov chains are very important to ergodic theory (see, for example, exercise 274). Furthermore the proof of the renewal theorem is a very nice application of coupling. However you can skip everything from definition 100 through theorem 124 if you wish. DEFINITION 100: A Possibly Non-stationary Markov Chain (abbreviated pnm) is a possibly non-stationary probability measure on the space of doubly or signally infinite words with a finite alphabet (unless we state otherwise, assume the alphabet to be finite) in which P(Xn = an | Xn-1 = an-1, Xn-2 = an-2, Xn-3 = an-3 ….) (abbreviated P(an-1,an)) depends only on an-1 and an (i.e. not on n or an-2, an-3,an-4…) DEFINITION 101: the terms P(an-1, an) are called transition probabilities. DEFINITION: Let a and b be in the alphabet of a pnm. We say that there is a path from a to b if there exists a finite sequence in the alphabet, c0, c1, c2…cn such that P(a,c0), P(c0,c1), P(c1,c2)… P(cn-2,cn-1), P(cn-1,cn), P(cn,b) are all positive. In this case we say that there is a path of length n+2 from a to b (length 1 if P(a,b) is positive we are not considering any ci). DEFINITION 102: Let a and b be in the alphabet of a pnm. We say that a and b communicate if there is a path from a to b and a path from b to a. DEFINITION 103: Let “a” be in the alphabet of a pnm. We say that a is transient if there is no path from a to itself. 36 DEFINITION 104: Let a be in the alphabet of a pnm. If a is not transient (a) = {n > 0: there is a path of length n from a to itself}. DEFINITION 105: Let a be in the alphabet of a pnm. If a is not transient L(a) is the least common divisor of (a). THEOREM 106: If a is not transient, there must exist n such that (a) contains every multiple of L(a) greater than n. Idea of proof: Show that (a) is closed under addition and that every set of positive integers closed under addition contains every multiple of its greatest common divisor past a certain point. To prove the latter, just let S be a finite subset of (a) such that L(a) is a linear combination of S with integer coefficients, let t be the sum of S, and show that for all sufficiently large k, kt + i, i<t, can be written as a linear combination of S with positive integer coefficients. // THEOREM 107: If a is not transient and a and b communicate, then b is not transient and L(a) = L(b). Proof: Let k be the length of a path connecting a to a. Let m and n be lengths of paths connecting a to b and b to a respectively. Then there are paths connecting b to b of size m+n and m+k+n so k divides L(b). With similar proof, L(b) divides L(a).// We now prove the big theorem of Markov chains using coupling. Coupling EXAMPLE 9: THEOREM 108: The renewal theorem. Suppose …X-2, X-1, X0, X1, X2… (or X0, X1, X2…) is a pnm, > 0, a,b and c all communicate, and L(a)= 1. Furthermore, assume that any state which “a” has a path to communicates with “a”. Then for all sufficiently large n and m, |P(Xm = c| X0 = a) – P(Xn = c| X0 = b) | < . 37 Idea of proof: First we show that (*) If we let d,e,and f be any states which “a” has a path to and run two processes independently starting at d and e respectively, that with probability 1 they will eventually both be simultaneously at state f. Since they all communicate, and since “a” has positive probability paths to itself at all sufficiently large times, it follows that a process starting at any state that communicates with “a” will have positive probability paths to f for all sufficiently large times. In fact exactly how large “sufficiently” is does not depend on which state we start at because we can simply maximize over all states which communicate with “a”. Now let I be sufficiently large and run the two process starting at any two points communicating with “a” and let be the minimum probability (minimized over any two such starting points) that they are both at state f by at time I. Then the probability that they are not at state f at either I,2I,3I….or kI is at most (1- )I ; proving (*). We now proceed exactly as we did in coupling example 7. Just as in that case the “criminal” gets a head start but if the “police car” catches him the police handcuff him to the car so that the two stay together from that point onward. The difference is that in that case the criminal had a small head start and in this case we let the criminal get a possibly huge one. We run the two processes starting at a and b, giving the second process a head start of n-m (We are assuming W.L.O.G. that n > m.) and then run them both for time m independently until they meet, gluing them together if they meet. The difference between this example and example 7 is that in this example the criminal can run until the cows come home and he still can’t get far away from the police car because he is running on a finite set.// DEFINITION 109: A Markov chain is a doubly infinite stationary pnm. THEOREM 110: A stationary process …X-2, X-1, X0, X1, X2… is a Markov chain iff for any state a0, conditioned on X0 = a0 X1, X2… and …X-2, X-1 are independent. Proof left to reader. // 38 THEOREM 111: A Markov process has no transient states and in a Markov process, if there is a path from a to b then there is a path from b to a (We assume, of course, that a and b both have positive probability). Idea of proof: Show that with probability one any state which occurs in a stationary process occurs infinitely often (Don’t use the Birkhoff ergodic theorem because it is sick to use such a powerful theorem when you can easily avoid it.) If a is transient or there is no path from b to a you could visit a and then never come back.// THEOREM 112: If the transition probabilities are such that all states communicate and for a given state “a”, L(a) = 1, then there is precisely one measure on …X-2, X-1, X0, X1, X2… consistent with those transition probabilities that makes that measure a Markov chain. Idea of proof: Show that in the case of a Markov chain, the measure on X0 determines the entire measure of the chain. Using the renewal theorem, show that the given conditions imply that lim(i )(P(Xi = c| X0 = a) exists and is independent of a. Show that the assumption that we end up with Markov chain forces that limit to be P(X0) = c for all c, and that we can, in fact, extend that measure to a Markov chain. (Hint: You need to define a measure on cylinder sets and then use Cartheodory extension theorem to extend to all Borel sets. Try to do it, and show that if you run into trouble there must exist c such that P(Xi+1 = c| X0 = a ) is very different from P(Xi = c| X0 = a) for arbitrarily big i.)// THEOREM 113: A Markov process is ergodic iff all states communicate. Only if: Easy. If: Every state must occur infinitely often in every doubly infinite word and in fact every word of positive probability must occur infinitely often for every doubly infinite word. Let T be the first occurrence of a given word. Use theorem 110 and the fact that for any event A, P(A) = E(P(A| X0, X1, X2…XT)) to show that for any invariant set A, P(A|X0 = a0, X1 = a1, X2 = a2,… Xn = an) does not depend on a0, a1, a2… an. Now approximate A with a cylinder set.// 39 DEFINITION 114: When …X-2, X-1, X0, X1, X2… is an ergodic Markov process, by theorems 107 and 113, L(a) does not depend on a and we will call it the periodicity of the process. THEOREM 115: Renewal theorem for Markov processes (including case where there are countably many states). Let …X-2, X-1, X0, X1, X2… be an ergodic Markov process with periodicity 1, possibly with countably infinitely many of states. Let > 0. Let S1, S2 be collections of those states. Then there is an N, dependent only on a lower bound for the measures of S 1 and S2, not on S1 and S2 themselves, such that for all n > N, |P(Xn S1| X0 S2) - P(X0 S1)| < Idea of proof: Essentially repeat proof of the renewal theorem, using that fact that there is a finite set of states of measure nearly 1, making the problem essentially finite.// COMMENT 116: In the finite case, N does not depend on anything. DEFINITION 117: A ergodic Markov process with periodicity 1 is called a mixing Markov process. DEFINITION 118: A stationary process …X-2, X-1, X0, X1, X2… Markov process if for all …a-2, a-1, a0, is called an n step P(X0 = a0| X-1 = a-1, X-2 = a-2, X-3 = a-3…) = P(X0 = a0| X-1 = a-1, X-2 = a-2, X-3 = a-3… X-n = a-n) DEFINITION 119: Let X = …X-2, X-1, X0, X1, X2… be a stationary process and Y = …Y-2, Y-1, Y0, Y1, Y2… be an n step Markov process. Y is called the n step Markov process corresponding to X iff they have the same distribution of n words and P(Y0 = a0| Y-1 = a-1, Y-2 = a-2, Y-3 = a-3… Y-n = a-n) = P(X0 = a0| X-1 = a-1, X-2 = a-2, X-3 = a-3… X-n = a-n) for all a0, a1, a2… an (This is the same as saying that they have the same distribution of n+1 words). 40 DEFINITION 120: Let X =…X-2, X-1, X0, X1, X2… be a n step Markov process. A Markov process …Y-2, Y-1, Y0, Y1, Y2…, constructed from X, is defined as follows: The states of Y are the words of length n from the X process. The measure on Y0 is the same as the measure on X0, X1, X2… Xn-1.The transition probabilities are P(Y1 = a0 a1 a2… an | Y0 = b0 b1 b2… bn) = { P(Y1 = an | Y0 = bn) if for all i {0,1,…n-1}, bi+1 = ai 0 otherwise COMMENT 121: It is important that the reader understand what the above Y process really is. It is simply the X process in disguise. For example, suppose n = 4. The reader should be required to demonstrate that if you start with the X process, and change the name a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 … to (a0 a1 a2 a3) (a1 a2 a3 a4) (a2 a3 a4 a5) (a3 a4 a5 a6) (a4 a5 a6 a7) (a5 a6 a7 a8) (a6 a7 a8 a9)….. for every sequence a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 … you get the Y process. The purpose of constructing the Y process is to demonstrate that every n step Markov process is really a one step Markov process in disguise. DEFINITION 122: The periodicity of an n step Markov process is the periodicity of the corresponding Markov process. THEOREM 123: An n step Markov process is ergodic iff its periodicity is 1. Proof left to reader.// EXERCISE 124: Suppose …X-2, X-1, X0, X1, X2… is an ergodic process. Show that its approximating n step Markov process is also ergodic. (Hint: Show that if two words don’t communicate in the approximating process, they don’t communicate in the original process either.) 41 THEOREM 125: Renewal theorem for n step Markov processes (including case where there are countably many states). Let …X-2, X-1, X0, X1, X2… be an ergodic n step Markov process with periodicity 1, possibly with countably infinitely many of states. Let > 0. Let S1, S2 be collections of n tuples of those states. Then there is an M, dependent only on a lower bound for the measures of S1 and S2, not on S1 and S2 themselves (in the finite case, M does not depend on anything), such that for all m > M, |P(Xm, Xm+1, Xm+2… Xm+n-1 S1| X0, X1, X2… Xn-1 S2) – P(X0, X1, X2… Xn-1 S1)| < Proof: Immediate from theorem 94.16. // 42 Preparation for the Shannon Macmillan theorem(presented as a subsection of coupling section) COMMENT 126: We conclude our “Preliminaries” chapter with some lemmas designed to prepare for the most important theorem of the next chapter, the Shannon Macmillan Breiman theorem. DEFINITION 127: Let fN be defined as follows: ____-_ fN(x) = x fN(x) = 0 if |x| > N, and let otherwise. DEFINITION 128: A sequence a1,a2,a3... is essentially bounded if lim(N )lim sup(n ):[|fN(a1)| +|fN(a2)|+ ...|fN(an)|]/n] = 0 We conclude our section on coupling by using coupling to conclude. Coupling EXAMPLE 10: THEOREM: Let ai be any sequence of nonnegative numbers such that iai are summable. If P((i < |Xn| < i+1)|Xn-1, Xn-2...X1) < ai , for all i and n, then X1, X2, X3,… is essentially bounded(with probability 1). Idea of proof: Choose N large enough so that aN-1 + aN + aN+1...< 1. Define a distribution for a random variable YN, P(YN =0) = 1-(aN-1 + aN + aN+1...), and for all i > N, P(YN =i) = ai-1. Let YN (1), YN(2), YN (3),.... be an independent sequence of random variables each with the distribution of YN. Now couple the two sequences of random variables 1)|fN(X1)|, |fN(X2)|, ...|fN(Xn)| 2)YN (1), YN (2), YN (3),.... inductively so that |fN(Xi)| < YN(i) for all i.(Here we make our coupling so that i-1 < fN(Xi)< i whenever Yn = i for i > N) (1/n)(YN (1) + YN(2) +YN(3) +....YN(n)) approaches something small when N is large.// 43 Square lemmas: COMMENT 129: Only the third of these square lemmas has any relevance to the text. Our only reason for including the first two is that we believe every graduate student should be familiar with them. 1) Let ai,j be reals such that ai,j (i ) converges to aj uniformly, and ai,j (j converges to ai, then lim(ai),(i ), lim(aj),(j ), and lim(ai,j),((i,j) ), all converge, and they converge to the same thing. 2) Let ai,j be reals such that ai,j (i ) converges to aj uniformly, and lim(aj) converges. Then lim(ai,j),(i,j ) converges to the same thing. 3) Let Xi,j be real valued random variables such that the columns form a stationary process (i.e. any n successive columns have the same distribution). Suppose Xi,j converges to Xj as i approaches infinity. If both Xj and Xi,i are essentially bounded, then both (1/n)(X1,1 + X2,2 +X3,3 + ... Xn,n) and 1/n (X1 + X2 +X3 + ... Xn) converge, and their limits are the same. Idea of proof: By the Birkhoff ergodic theorem and essential boundedness, (1/n)(X1 + X2 +X3 + ... Xn) converges. Fix and N, and let Yi = 1 if there is any j > N such that |Xi,j-Xi| > and Yi = 0 else. Then for large N, 1/n (Y1 + Y2 + Y3 + ... Yn) converges to something small.// COMMENT 130: The expectation of the thing they are converging to is E(X1). When the limit is constant, the thing they are converging to is E(X1).// 44 ENTROPY DEFINITION 131: The n shift is obtained by taking an n sided fair coin, and flipping it doubly infinitely many times to get an independent process on n letters. THEOREM 132: The three shift is not a factor of the two shift. Idea of proof: Suppose is a homomorphism of the two shift to the three shift. Approximate -1 of the canonical three set partition with a three set partition of cylinder sets in the two shift. These cylinder sets have a length (say length 9) (the length of a cylinder set is the number of letters it depends on). Presume for the moment, that this approximation is not just an approximation, but rather is exactly -1 of the canonical three set partition. Then each word of length 200 in the three shift is determined from a distinct word of length 208 in the two shift. For example, the 7th letter in the three shift is determined by the 3rd, 4th , 5th, 6th, 7th, 8th, 9th, 10th and 11th terms of the two shift. This is impossible because 2208 < 3200. However, since the cylinder set partition is only an approximation, words of length 208 determines words of length 200 only after 200 of the letters are altered for some small , (on more than half of the space). Nevertheless, by making such changes we still won’t get enough words to account for (1/2)(3200) words. // DEFINITION 133: A small exponential number of words of length n means 2n words, for a small number . If the number of words in a set of such words is 2Hn, then we say that H is the exponential number of words in the set. If the probability of a word is 2-Hn, the exponential size of a word of length n is H. Note that the smaller the name, the bigger it’s exponential size. DEFINITION 134: We are about to prove a theorem saying that in every stationary process, the exponential size of the first n letter subword of an infinite word, approaches a limit as n approaches infinity. In the ergodic case, this limit is a constant, called the entropy of the process. COMMENT AND DEFINITION 135: The theorem we just referred to (namely the Shannon Macmillan Breiman theorem, which we will soon state and prove) implies that for large n, after removing a small set, all words of length n have approximately the same exponential size. Therefore (after removal of the small set) the exponential number of words is approximately the entropy of the process. Henceforth, names with the approximately correct exponential size will be called reasonable names. 45 THEOREM 136: Two isomorphic ergodic processes must have the same entropy. Idea of proof: By comment 135, this theorem is a generalization of the theorem 132, and the proof is identical. DEFINITION 137: The entropy of an ergodic transformation which has a finite generator, P, is the entropy of the process P,T. Theorem 136 says that it does not matter which generator you use. COMMENT 138: The reader may object that it is hard work to prove the existence of generators for transformations and that he resents having to have a definition of entropy that requires such hard work. Such a complainer will be happier to define the entropy of a transformation T to be sup (over all finite partitions P) of the entropy of P,T. Since every partition can be extended to a generator just by joining it with a generator (“joining” is another word for “superimposing”), the two definitions are equivalent. In fact, this definition is more encompassing because it can be used for transformations that have no finite generator. CONVENTION 139: Throughout this book “log” means log base 2. THEOREM 140: (Shannon Macmillan Breiman): Assume an ergodic stationary process on a finite alphabet. With probability 1, a name b1b2b3.... has the property that lim(n ) [-(1/n)log(P(b1b2b3....bn))] exists. The limit is constant almost everywhere. Idea of proof: We will, in fact, show that the limit exists for any stationary process. Ergodicity is only used to get the limit to be constant. -(1/n)log(P(b1b2b3....bn)) = -(1/n)log(P(b1)) - (1/n)log(P(b2|b1)) -(1/n)log(P(b3|b2b1))... -(1/n)log(P(bn|bn-1bn-2...b1)) which we will compare with -(1/n)log(P(b1|b0b-1b-2b-3....)) - (1/n)log(P(b2|b1b0b-1b-2...)) -(1/n)log(P(b3|b2,b1b0b-1....))-... (1/n)log(P(bn|bn-1bn-2...)). 46 By the third square lemma, all we need to prove is that both sequences are essentially bounded, which we show from coupling example 9. We now determine the ai of coupling example 9. We can always let a0 = 1, because all probabilities are less than or equal to 1. Hence, we only have to look for ai, for i > 1. For n > 0, in order for -log(P(bn|bn-1bn-2...b1)) to be between i and i+1, bn has to be a letter which, given bn-1, bn-2,... has probability less than 2-i. Since we are assuming a finite alphabet of size, say A, it follows that the probability that -log(P(bn|bn-1bn-2...b1)) is between i and i + 1, given bn-1, bn-2,..., is less than A2-i . Same is true for -log(P(bn|bn-1bn-2...)) // THEOREM 141: The entropy of …X-2, X-1, X0, X1, X2… is E(-log((P(X0|past )))) Idea of proof: Apply the proof of theorem 108, along with comment 98 (which is the comment following the third square lemma).// DEFINITION 142: Let P be a partition such that the probability of the pieces are p1,p2,p3,..pn. Let ...X-3,X-2,X-1,X0,X1,X2,X3... be an independent process on an n letter alphabet, whose probabilities are p1,p2,p3,..pn. Then the entropy of P is defined to be the entropy of that process. THEOREM 143: The entropy of the above partition is -p1log(p1) + -p2log(p2) +-p3log(p3) ...+-pnlog(pn). Idea of one proof: Use theorem 141. Idea of another proof: A typical word of length n has about n(p1) of the first letter, about n(p2) of the second letter etc., so that it has probability approximately p1n(p1 )p2n(p2 ) ...pnn(pn ) = 2[n(p1)log(p1) + n(p2)log(p2) +...n(pn)log(pn)] // 47 COMMENT 144: In the near future, we will use theorem 111 as the definition of entropy for a partition, and see what we can learn about entropy of a partition from that definition. The reader should try to prove everything from the other definition also. DEFINITION 145: Let P1 and P2 be two partitions with probabilities of parts [p1,p2, p3, ...pn] and [q1,q2, q3, ...qm] respectively. If we say put P1 into the first piece of P2, we mean that we want you to consider the partition whose pieces have sizes. q1p1, q1p2, q1p3, ...q1pn, q2, q3, ...qm THEOREM 146: Conditioning property: Let the first piece of P2 has probability q1. Let P1 and P2 have entropies H1 and H2 respectively. Then if you put P1 into the first piece of P2, the resulting partition has entropy H2 + q1H1. Idea of proof: You can prove this by straightforward computation, or use a slicker proof, based upon the idea that entropy of a partition is the expectation of -log(the probability of the piece you are in).// DEFINITION 147: A join of two partitions P1,P2, with m and n pieces respectively, denoted P1 V P2 , is a partition with mn pieces (although some of those pieces may be empty, so that in reality you have less than mn pieces) so that each of these tiny pieces is regarded as the intersection of a piece of one, with a piece of the other. The sum of the probabilities of the pieces in a piece of P1 is the probability of that piece, and the sum of the probabilities of the pieces in a piece of P2 is the probability of that piece. COMMENT 148: Note that coupling and join really mean the same thing. THEOREM 149: The entropy of the independent join of two partitions, is the sum of their entropies. Idea of proof: Repeatedly use the conditioning property, by putting P1 into every piece of P2.// DEFINITION 150: Let P be a partition and T be an ergodic transformation. H(P) means entropy of P and H(T) means entropy of T 48 DEFINITION 151: (Convex combination of partitions) Let P1, P2,… Pm be n set partitions whose sets are identified, (e.g. the third set of P1 is identified with the third set of P2 etc.). For i < m and j < n, let the probability of the jth set of Pi be pi,j. Let p1, p2,.. pm be positive numbers which add to 1. Then p1 P1+ p2 P2+.. pm Pm is the partition whose jth set has probability p1p1,j + p2p2,j + pmpm,j. COMMENT 152: Convex combination of random variables and convex combination of probability measures is defined similarly. THEOREM 153: Suppose p1 P1+ p2 P2+.. pm Pm is a convex combination of partitions. Then H(p1 P1+ p2 P2+.. pm Pm) > p1 H(P1)+ p2 H(P2)+ +.. pm H(Pm). Proof: Use the notation of the previous definition. By concavity of –xlog x (p1p1j + p2p2j +… pmpmj )log(p1p1j + p2p2j +… pmpmj ) > p1(-p1jlog(p1j)) + p2(-p2jlog(p2j)) +… pm(-pmjlog(pmj)) sum over j.// THEOREM 154: Independent joining maximizes entropy, (i.e., it gives higher entropy than any other joining). Idea of proof: Let the probabilities of P1 be p1, p2, p3, ...pm. P1 V P2 can be obtained by selecting Q1, Q2,… Qn such that p1 Q1+ p2 Q2+ … pn Qn = P2 and then putting Q1 into the first piece of P1, Q2 into the second piece of P1,etc. Apply conditioning property and the theorem 121.// 49 THE CAVE MAN’S PROOF OF THE ABOVE THEOREM: Idea of proof: We use only the conditioning property, and the fact that the maximum entropy of a two set partition is obtained by the 1/2,1/2 partition. We do not use the definition of entropy, formula for entropy, or even use the continuity of entropy. From this we can get the above theorem, when the P probabilities p1,p2,p3,...pm are dyadic rationals, and the Q probabilities q1,q2,q3,...qn arbitrary. If we then want to get it for any pairs p1,p2,p3,...pm and q1,q2,q3,...qn we need continuity of entropy. Step 1: First show by induction on m, that the maximum 2m set partition entropy is obtained with the 1/2m,1/2m,1/2m,...1/2m partition. Step 2: Show the theorem to be true when P1 is the 1/2m,1/2m,1/2m,...1/2m partition. Idea of proof: by using conditioning property, show that entropy goes up on any piece of P2, when we exchange the partition of it for the 1/2m,1/2m,1/2m,...1/2m partition conditioned on that piece. Step 3: Show it when P1 is composed of dyadic rational pieces. Proof: P1 can be extended to a 1/2m,1/2m,1/2m,...1/2m for some m partition. More precisely, we can get partitions R1,R2,...Rn so that if we put each Ri into the ith piece of P1, we will get a 1/2m,1/2m,1/2m,...1/2m partition. Now consider two joinings of P2 and P1, which we will call J1 and J2. J2 will be the independent joining, and J1 will be an arbitrary joining. Make J1 into an even finer partition, by, for all i, putting Ri into the ith piece of P1 independent of J1 inside the ith piece of P1. Do the same with J2. It can easily be seen by the conditioning property that the amount of entropy added, in both cases, is identical. When you are done, the extension of J2 will be an independent joining of the 1/2m,1/2m,1/2m,...1/2m partition with P2, and the extension of J1 will be a joining of the 1/2m,1/2m,1/2m,...1/2m partition with P2 which is not independent. Since the former has bigger entropy than the latter, and we added the same amount to get them, it follows that the entropy of J2 is bigger than that of J1.// DEFINITION 155: Let P and Q be partitions which are joined into a big partition P V Q. Then the entropy of P over Q, written H(P|Q), is H(P V Q) - H(Q). COMMENT 156: H(P|Q) < H(P) 50 THEOREM 157: Let Q is a sub partition of Q1, and P be joined with Q1. Then H(P|Q1) < H(P|Q).// Proof: You can only increase the entropy of P V Q1 by rearranging the P on each piece of Q so that P is independent of Q1 on that piece. This will not alter H(P|Q), and cannot decrease H(P|Q1), and in the end you will have H(P|Q1) = H(P|Q). COROLLARY 158: Let ...X-3,X-2,X-1,X0,X1,X2,X3... be a stationary process, with a finite alphabet. H(Xn|Xn-1,Xn-2,Xn-3,...X0) is nonincreasing as n approaches infinity. Idea of proof: H(Xn|Xn-1...X0) = H(X0|X-1...X-n) by stationarity.// DEFINITION 159: Let H be the limit of H(Xn|Xn-1,Xn-2,Xn-3,...X0) as n approaches infinity. COMMENT: 160: We will later see that H is the entropy of the process. COMMENT 161: This is H(X0|the past) COMMENT 162: H is also lim 1/nH(X0 V X0 V X1 V X2... V Xn). Here X0 V X0 V X1 V X2... V Xn means the join of X0,X1,X2...Xn. Proof: H( X0 V X1 V X2... V Xn) = H(X0) + H(X1|X0) + H(X2|X1,X0) ... H(Xn|Xn-1,Xn-2,Xn-3,...X0) // THEOREM 163: H is the entropy of the process. Proof: X0 V X1 V X2... V Xn is a partition whose pieces correspond to the n names of the process, each having the probability of the corresponding name as its measure. Let S be the entropy of the process. By the Shannon Macmillan theorem, except on a small set of size , all names have measure 2-(S + )n where ||<o, o is a small positive number. On the set of size , there are at most An names, where A is the size of the alphabet, because that is the total number of names altogether. Hence, by the conditioning property, H( X0 V X1 V X2... V Xn-1) is between (1-)(S - o)n and (1-)(S + o)n + (logA)n.// 51 DEFINITION 164: Let P be a finite partition, and T be a measure preserving transformation. Then P,T defines a stationary process, and H(P,T) is the entropy of that process. COMMENT 165: H(P,T) = H(Q,T) for any two finite generators P and Q, because the two processes are isomorphic. DEFINITION 166: The entropy of a transformation, H(T), is the supremum of H(P,T) over all finite partitions P. COMMENT 167: If a transformation has a finite generator, P, then H(T) = H(P,T) Proof: For any other partition Q, PVQ is a generator also, so H(Q,T) < H(PVQ,T) = H(P,T). THEOREM 168: Let A1,A2,A3,... be a countable generator for T, and let be the whole space. Then H([A1A2,A3,...An,(- [A1 U A2 U A3... U An])], T ) converges to H(T) as n . Idea of proof: Let Qn be the partition [A1A2,A3,...An,(- [A1 U A2 U A3... U An])] Let P be any other finite partition. Since the A’s are a generator, it follows that if you know the A name, you know the P name. For large enough n, the Qn name on times –n, -n+1,…n nearly determines which element of P you are in by theorem 24. Thus H(Qn,T) is not much less than H(P,T) by the same argument we used to show that the three shift is not a homomorphism of the two shift. // 52 THEOREM 169: KREIGER’S FINITE GENERATOR THEOREM: If a transformation has finite entropy, then it has a finite generator. Indeed, if the entropy is less then log(k), we can find a generator B = B1,B2,...Bk. We first give a vague idea of the proof followed by more details of proof. The student should read the “vague idea” and the “more details” before attempting to write out a rigorous proof. Proof scheme: let An and Qn be as above. We will look at a Rohlin tower; then a much bigger one; then a much bigger one; etc. Start with a large m. Let n>>m. We will choose the first tower of height n, so that the base is independent of the partition of n names of Qm. By Shannon Macmillan, there are enough k letter names to label all reasonable columns with the letters B1,B2,...Bk, on the rungs of the columns, so that all the reasonable columns are distinguished from each other, because the number of reasonable columns is smaller than the number of such names. Carry out such a labeling. If you know the B1,B2,...Bk name of a column containing the base of the tower, you probably know the Qm name of that column. The only reason you might turn out to be wrong is that you might be looking at an unreasonable column. This is just the first stage. We will end up altering our choice of B1,B2,...Bk over and over again, to accommodate larger towers. However we will find that we will alter less and less so that, starting with a random point in , once we know our initial choice of which of B1,B2,...Bk contains it is not likely to change by the time we are done. In the end, we will be able to tell our entire A name from our B name, and hence we will be able to tell what point we are, from our B name. Hence we will know that B generates. Vague idea of proof: There is a complication. We will have to know whenever we enter one of these towers from just looking at the B name. If we saw an infinite B name but had no idea which coordinates were in the base of the tower, we would be quite confused. To handle this, we need to reserve certain words to label the bases of the towers, so that we know when we are in a base of a tower. You can arrange that so few words are reserved that no damage is done to the proof. Use labels like 1000001 for the first tower and 100000000000001 for the second (the reader needs to figure out how fast the length of these names must grow so that the restriction that these names cannot be used except for labeling the base will not cause damage). 53 More details of the proof: 1) Start off with Q10 (10 chosen so that Q10 has nearly full entropy). Pick n1large enough so that Shannon Macmillan kicks in for Q10 names by time n1. Pick a Rohlin tower of height n1, whose base is independent of the partition of Q10 names of length n1. Label all reasonable columns with a distinct B name. 2) Pick your second tower to be much bigger than your first, and as before pick your base independent of Q1000 names of tower length (1000 chosen so that Q1000 has very very nearly full entropy). Your B process defined at the previous stage, already captures a great deal of entropy, because B names almost determine which atom of Q10 you are in. Prove that there are enough B names so that, since our entopy is strictly less than log(k), it is possible to let the top ¾ of the rungs remain as they are and change the bottom ¼ of the rungs to B names and thereby distinguish all reasonable Q1000 columns. 3) Continue, except that replace the numbers ¾ and ¼ with 7/8 and 1/8 at the next stage, then with 15/16 and 1/16 in the following stage etc. 4) Here is why the final B name separates points . At the end of the first stage you probably know what reasonable atom of Q10 you are in and there is not much more than a ¼ chance that the information will be damaged. Continue this reasoning at higher stages using Borel Cantelli to show that information is damaged only finitely many times. For a given point, it is possible that the n tower name of its n tower column will be damaged if that tower turns out to be one of those bottom altered n+1 tower rungs. Borel Cantelli will say that happens only finitely many times. // ABBREVIATION OF PROOF OF THEOREM: Make big tower distinguishing reasonable columns in mean Hamming distance. Improve approximation with a bigger tower by altering a few columns at the bottom. Label the base of all your towers with words that are just used to label bases. COMMENT 170: WE REGARD IT AS ESSENTIAL THAT THE STUDENT BE REQUIRED TO WRITE OUT THE ABOVE PROOF IN DETAIL. 54 THEOREM 171: Topologize the space of stationary measures by the dbar metric. Then the function that takes a process to its entropy is continuous. Idea of proof: If Process 1 is close to Process 2 in dbar, then most reasonable names of one are within in the mean hamming distance to a reasonable name of the other. There are exponentially few names close to a given name in the mean hamming distance.// DEFINITION 172: If a process homomorphism has the property that the inverse of the set of points with any specific letter in the origin (in the range) is a cylinder set (in the domain), then the range is called a coded factor of the domain, and the map from cylinder set to letter is called a code. LEMMA 173: A coded factor of a given process cannot have more entropy than the process. Idea of proof: If the code has length M, then every name of length N of the coded process, is determined by a name of length N+M of the original process. // THEOREM 174: A factor of a process cannot have more entropy than the process. Idea of proof: entropy is continuous, and coded factors are dense in all factors.// THEOREM 175: A process has entropy zero iff the past determines the present. Idea of proof: The entropy of ...X-3,X-2,X-1,X0,X1,X2,X3... is the expectation of the entropy of X0 given the past.// DEFINITION 176: Let T be an ergodic transformation, and be a set of positive measure. Let A be the -algebra on that is obtained by restricting the – algebra of the whole space to subsets of . Let be the probability measure on defined by () < ()/() for every . Let transformation S from to , be defined by S(x) = Tix, where i is the least positive integer such that Tix is in . Then (,A,,S) is a measure preserving transformation. S is called the induced transformation of T on . 55 EXERCISE 177: x and i as in defined in definition 143. i is a function of x. Prove that the union, over all x of {x, T(x), T2x …} such that i = has measure 0. EXERCISE 178: Prove S to be measure preserving. THEOREM 179: Let be a measure for a Lebesgue space, T be an ergodic measure preserving transformation on that space, be a subset of the space, S be the induced transformation on . Then S the entropy of S on , is H(T)/(). Idea of proof: Let (,A,,S) be as in definition 176. Let P be a countable generator for T. Define a new countable alphabet whose letters are finite strings of letters in P. x is assigned a particular string if that string is the P name of x, T(x), T2(x), ...Ti-1(x), such that i is the least positive integer with Ti(x) in . Show that this partition generates A using transformation S. By the Birkhoff ergodic theorem, the number of times S returns to , by time n, is about n().Hence names of size n() in the induced process look like names of size n in T (more precisely, they are between names of size .9n and names of size 1.1n where .9 and .1 are arbitrarily close to 1). We would be done, with a beautifully simple argument, if all partitions in question were finite. Since they aren’t, the above argument proves nothing. The entropy of T can be approximated with a finite partition, obtained by lumping all but finitely many of the pieces of P. Our new partition is P = (P1,P2,...Pn, lump). We can approximate the entropy of the induced transformation, by using P instead of P in our construction of a partition on . This partition on is still infinite, but we can make it finite by lumping all points in whose return time is greater than some large number N, without seriously altering the entropy of S,. These arguments can be made to be rigorous with proper use of theorem 168. Now carry out the argument of the first paragraph. of length n() look like words of length n in T and words of length n() in the infinite process look like words of length between .99n() and n() in the approximating process so the entropy of the approximating process is between H(T)/() and H(T)/(.99()).// More detail: Words 56 BERNOULLI TRANSFORMATIONS DEFINITION 180: A transformation is called Bernoulli (Abbreviated(B)) if it is isomorphic to an independent process. PLAN: We are about to define a bunch of conditions, which are equivalent to Bernoulli. We will prove them all equivalent, and then prove that Bernoulli implies them. We will not prove that they imply Bernoulli in this section, because that proof is harder. We will leave that to next section, where we will prove the Ornstein isomorphism theorem. DEFINITION 181: The trivial transformation is the transformation where your lebesque space consists of precisely one point and the transformation takes that point to itself. The only partition of that space is called the trivial partition. If P and T are trivial, the P,T process, namely the process which assigns probability 1 to the name …a,a,a,a… is called the trivial process. COMMENT 182: If the reader would like to skip this section, and go onto the next, just read the definition of finitely determined. There are two results from this section you will need if you intend to skip it; i) finitely determined processes are closed under taking factors ii) A nontrivial finitely determined process has positive entropy and iii) Bernoulli implies finitely determined. You won’t need (iii) until the proof of the isomorphism theorem (classical form) at the very end of the next section. DEFINITION 183: Let u, u1, u2 be three probability measures, t be a real number, 0<t<1, such that tu1+(1-t)u2 = u. Then u1 is a submeasure of u. COMMENT 184: There is a u2 such that tu1+(1-t)u2 = u iff for all sets A, tu1(A) < u(A). COMMENT 185: Note that submeasure is a generalization of subset. For example, instead of picking a subset consisting of 3 points, pick a “subset” consisting of 1/3 of one point, 1/5 of another, and 1/7 of a third. This example suggests that we should define a submeasure to be tu1 instead of u1, but we find it easier to consider probability measures. 57 DEFINITION 186: Notation as in the above definition. The size of the submeasure is t. DEFINITION 187: If these measures are measures are on words of length n, and the size of the submeasure > 2-n for small then it is said to be an exponentially fat submeasure. COMMENT 188: Note that in the special case of a subset of {0,1}n, we are saying that the number of points in the set is greater than 2(1- )n. DEFINITION 189: A stationary process is called extremal, (abbreviated EX) if for every , there is a , such that for all sufficiently large n, there is a set of measure less than (which we call error) of words of length n, such that the following holds: Any submeasure 1 of restricted to words of length nof size > 2-n whose support is off error is within of restricted to words of length nin dbar. ABBREVIATED DEFINITION : A process is extremal if for sufficiently large n, exponentially fat submeasures whose support is off some small subset are dbar close to the whole measure. DEFINITION 190: A process is called Very Weak Bernoulli, abbreviated VWB., if the dbar distance between the n future conditioned on the past, and the unconditioned n future [A function from pasts to numbers] goes to zero in probability as n approaches DEFINITION 191: A stationary process P is called finitely determined (abbreviated FD), if a) It is ergodic. b) for every , there is a and N, such that for any other ergodic process P1, if 1) the P and P1 probability laws for N are within of each other, and 2) the entropies of P and P1 are within of each other, then 3) the two processes are within in dbar of each other. 58 ABBREVIATED DEFINITION: A process is finitely determined if a good distribution and entropy approximation to the process, is a good dbar approximation to the process. DEFINITION 192: An independent concatenation is obtained by taking any measure on words of length n, using that measure to choose the first n letters, using that measure again to choose the next n letters independently, and continuing again and again forever (and backwards in time). COMMENT 193: In general, an independent concatenation is not stationary. DEFINITION 194: A dbar limit of independent concatenations (abbreviated IC) is a stationary process which is the dbar limit of independent concatenations. DEFINITION 195: We will use B to indicate a Bernoulli process (i.e. isomorphic to an independent process), FB to indicate a factor of an independent process. PLAN: The goal of this section is to prove that B FB IC EX VWB FD. We prove FD FB B in the next section. COMMENT 196: FD is the criterion generally used in proofs. VWB is the criterion generally used to check a given candidate to see if it is Bernoulli. DEFINITION 197: Completely extremal is the same as the abbreviated definition of extremal, except delete the words “whose support is off some small set” (in other words, there is no error set) INTRODUCTION TO COMPLETELY EXTREMAL: Define 1 to be the distance from the northern end of the earth to the southern end of the earth. Suppose you would like to couple the northern half of the earth to all of the earth, so that the average distance between coupled points is less than .01. It is rather obvious that you can’t do that. Thus, you can take half of the earth that is far from all of the earth. But the same is not true for {0,1}n under the mean hamming distance. Not only is it impossible to extract half the space such that the half you extract is far away from the whole space when n is large; it is even impossible to extract an exponentially fat subset such that the subset is far away from the whole space. In other words {0,1}n is completely extremal. 59 THEOREM 198: An independent process is completely extremal. We will postpone the proof until we first do some preparatory work. Lemma 199 Let be a space and let Xn from to {-1,1} be independent random variables each assigning probabilities ½ to both -1 and 1. Let Sn = Sum (1 to n) (Xi ) Let be a submeasure of (remember that in this book submeasures of probability measures are probability measures) and E be the expectation operator corresponding to . For every there is a such that if n is sufficiently large and E(Sn) > n then The size (recall the definition of size of a submeasure) of < e-n, (i.e. is not exponentially fat.) Proof: i) E(Sn) > n. Let R be the set of names in {-1,1}n such that Sn > (/2)n. ii) Then it is an elementary fact about random walk that for fixed , n going to infinity, R is not exponentially fat. From (i) and (ii), the reader can easily establish that is not exponentially fat.// EXERCISE 200: If the reader wishes to establish (ii) himself, he should write down the probability of R precisely. You will get a sum of terms where the first is bigger than all the others put together. Show that probability decreases exponentially as i increases. 60 EXERCISE 201: Our argument could be improved. Show that if you replace the statement “E(Sn) > n” with the statement “With probability one Sn = n (obviously we have to assume n to be an integer)” the long run exponential rate that the size of shrinks does not increase. Hint: In the proof of lemma 199, replace (/2)n with (.9)n. LEMMA 202: Let P = {p1, p2, p3, … pm,} be a set of finitely many points with two measures and on it. Suppose the variation distance between and is . Let I be the unit interval, and be Lebesgue measure on I. Consider two measures on PX I, namely X and X . Then there is a set S P X I such that X (S) = ½, and X (S) > ½ + /4. COMMENT 203: Don’t read the following proof of lemma 165. Prove this lemma yourself, or at least convince yourself that something like it must be true. Proof: We know that | (pi) – (pi) | = which means that you can partition the set P into two classes so that The sum in the first class of (pi)– (pi) = /2 and The sum in the second class of (pi) – (pi) = /2. Let r = (first class) and r + /2 = (first class ) 1-r = (first class) and 1-r- /2 = (second class) If r > ½, then let S be {(,t): is in the first class and t< 1/(2r)}. Then X (S) = (r + /2)/(2r) > ½ + /4. If r < ½, then let S be { (,t): ( is in the first class) or ( is in the second class and t < (½ - r))/(1- r))}. Then X (S) = (r + /2) +(1-r- /2)(½ - r))/(1- r) = ½ + 4(1- r) > ½ + /4. The reader can verify that in any case X (S) = ½. I thought I told you not to read this!// 61 Restatement of THEOREM 198: Let P = {p1, p2, p3, … pm,} be a set of finitely many points and let be a measure on P. Let = Pn and endow with measure n = (We are defining ). Let S be a subset of and let be the restriction of to S normalized so that is a probability measure. For any c, there is a d such that if n is large enough and the dbar distance between (S, ) and (,) is greater than c, then (S) < e-d n. In other words, (P,) is completely extremal. COMMENT and EXERCISE 204: Actually, the definition of completely extremal refers to submeasure rather than subset. The reader should prove that this is equivalent Proof of theorem: Let I be the unit interval and let be Lebesgue measure on I. Now we extend the space to X In and endow that space with measure X n. Similarly, extend the space S to (S X In) and endow that space with measure ( X n). Our goal is to prove that (S) < e-d n, which is equivalent to showing that ( X n)(S X In) < e-d n . 62 Our first step is to construct independent random variables on X In, X n such that each has probability ½ of being 1 and ½ of being –1. Let k < m, and select some initial word of length k, a1a2a3 … ak, where each ai is a member of P. Note this still makes sense when k = 0; it merely means that we are considering the empty string. If the initial word a1a2a3 … ak occurs as an initial word of some member of S (always true when k = 0 where we assume S not to be empty), then let (a1a2a3 … ak) be the conditional measure (under ) of the k+1 letter of a word in S given that the first k letters are a1a2a3 … ak. For such a word, let d(a1a2a3 … ak) be the variation distance between and (a1a2a3 … ak). Invoking lemma 165, let (a1a2a3 … ak) be a subset of P X I which has X measure ½ and (a1a2a3 … ak) X measure > ½ + d(a1a2a3 … ak)/4 Now we define the random variable Xk+1 on X In. Xk+1 ((a1a2a3 … an),(t1t2t3 … tn)) = { -1 if a1a2a3 … ak is not the initial segment of any word in S and tk+1 < ½. 1 if a1a2a3 … ak is not the initial segment of any word in S and tk+1 > ½. 1 if a1a2a3 … ak is the initial segment a word in S and (ak+1, tk+1) is in (a1a2a3 … ak) -1 if a1a2a3 … ak is the initial segment a word in S and (ak+1, tk+1) is not in (a1a2a3 … ak) (*) Under X n, the variables Xi are all independent of each other taking on the values –1, 1 with probabilities ½ each. 63 It is time to make use of the fact that the dbar distance between (S, ) and (,) is greater than c. We couple (S, ) and (,) by induction. We assume that we already know how the first k coordinates of S are coupled with the first k coordinates of . Consider such a coupled pair of k-tuples (a1a2a3 … ak), (b1b2b3 … bk) and conditioned on that coupled pair we want the joint distribution of (ak+1,bk+1). Let ak+1 be distributed with distribution (a1a2a3 … ak) and let bk+1 be distributed with distribution . Couple those two distributions as closely as possible. It follows that the probability that the coordinates do not match is d(a1a2a3 … ak). The expected mean hamming distance of the two spaces, obtained by this coupling is greater than c. 1) c < (h1+ h2+ h3+… hn )/n where hk+1 is the probability that the (k+1)th coordinates differ. The rest is definition chasing so we will set the reader do some work. Let E be the expectation operator of the measure X n. Show that (**) c/2 < (1/n)E( (i from 1 to n) Xi ) (*), (**), and lemma 162 give us our result.// ABBREVIATION: S dbar far away. Xi independent processes defined inductively so that Xk takes on 1, -1 with probability ½ apiece but in such a way that it prejudices for 1 as much as possible on S. Far dbar implies prejudice can be strong enough to force X1 + X2 … to grow in expectation linearly on S, proving S to shrink exponentially. COMMENT 205: Note that in the statement of theorem 161, it turns out that d as a function of c does not depend on what independent process you use. EXERCISE 206: We proved the theorem for subsets. Extend the proof to submeasures. COMMENT 207: It follows immediately from definition, that IC is closed in dbar. In order to prove that IC implies extremal, we will need that extremal is closed in dbar. The reason we need to use extremal, rather than completely extremal, is that completely extremal is not closed in dbar. DEFINITION and LEMMA 208: Let be a measure on S1 X S2. is a coupling of the projection of onto S1, which we will write P( , S1), and the projection of onto S2, which we will write P( , S2). 64 LEMMA 209: All notation as above, if tu1 + (1-t)u2 = , then tP(u1 , S1)+ (1-t) P(u2 , S1) = P(, S1). THEOREM 210: Extremal is closed under dbar limits. Proof: Suppose S1,S2,S3... is a sequence of extremal processes which converge in dbar to S. Select Sm close to S. Regard Sm and S to be measures on words of length n, n large. Let be a good dbar match, i.e. a measure on the product space with S and Sm as marginals. There are two small bad sets two consider. There is a set in on the product space with small measure, such that off that set every ordered pair in S X Sm is close in the mean hamming distance. There is a set in Sm of small measure, such that every exponentially fat submeasure of Sm whose support is off is close in dbar to Sm. Let us refer to {(x,y): y and (x,y) .} as the good set. There is a small set on S, such that For any x0 S - { ( x0,y): y Sm and ( x0,y) is good.} / { ( x0,y): y Sm.} is close to one. Fix exponentially fat set F on S - . Again I am looking at a subset rather than submeasure expecting the reader to generalize. Restrict to {(x,y): x F} and normalize to get a probability measure . is a coupling between F and P(,Sm) and furthermore is a good expected mean Hamming match. P(,Sm) is just as exponentially fat as F. Most of P(,Sm) is off so it can be changed slightly to get P(,Sm)’completely off . F is close to P(,Sm) is close to P(,Sm)’ is close to Sm is close to S in dbar.// ABBREVIATION: We wish to show that approximating S with extremal Sm tends to make S extremal. After dodging obnoxious sets, we extract from the dbar coupling of S and Sm a dbar coupling of a prechosen fat submeasure of S with a fat submeasure of Sm. COMMENT 211: The reader should note what goes wrong with the above proof if he tries to prove that dbar limit of completely extremal is completely extremal. 65 Final preliminaries to the big equivalence theorem: DEFINITION 212: Let …X-2, X-1, X0, X1, X2… be a stationary process and > 0. Let …Y-2, Y-1, Y0, Y1, Y2… be the independent stationary process on {0,1} with P(Yi = 1) = . We define a new process …Z-2, Z-1, Z0, Z1, Z2… as follows. Select i such that Yi = 1. Let j be the least number greater than i such Zi = 1. For each such i,j we let Zi, Z-i+1, …Zj-1 have the same distribution as Xi, Xi+1…Xj-1 but we require that Zi, Z-i+1, …Zj-1 be independent of …Z-i-3, Z-i-2, Zi-1. …Z-2, Z-1, Z0, Z1, Z2… is called the startover process of …X-2, X-1, X0, X1, X2…. ABBREVIATION: The startover process is obtained by at each time running like the X process with probability 1- and starting over with probability . Preliminary 1) Let X = …X-2, X-1, X0, X1, X2… be a stationary process. Let Z = …Z-2, Z-1, Z0, Z1, Z2… be its startover process. Then Z is an IC. Idea of proof: Let m be large enough so that with very high probability Z has started over by time m. Let n >> m. We independently concatenate X1, X2, X3… Xn measure to form Y = (Y1, Y2, Y3… Yn)( Yn+1, Yn+2, Yn+3… Y2n)( Y2n+1, Y2n+2, Y2n+3… Y3n)…. Now we couple Y to Z using the following scheme. i) We will first couple times 1 to n-m, n to 2n-m, 2n to 3n-m etc. ii) Conditioned on how we coupled the above times we couple the remaining times independently. To accomplish (i) couple each set of times {kn+1,…(k+1)n-m} independently if X did not start over during times {kn-m,…kn} and identically the same if X did start over during {kn-m,…kn}.// Preliminary 2) A coding of an independent process is IC. Idea of proof: Essentially the same proof, letting m be twice the length of the cylinder used to do the coding.// 66 Preliminary 3) EX implies ergodic. Idea of proof: Non-ergodicity implies the existence of a finite word whose frequency of occurrence in the randomly chosen doubly infinite word is non-constant. Select a < b such that both the events that the frequency of the word is less than a and the event that the frequency of the word is greater than b has positive probability. Show that these two fat sets cannot possibly be close in dbar. DEFINITION 213: Two stationary processes, X0, X1, X2…and Y0, Y1, Y2… are said to be close in distribution, if there is a large n such that X0, X1, X2… Xn and Y0, Y1, Y2… Yn are within 1/nin the variation metric. COMMENT 214: This ergodic theory concept of close in distribution should be distinguished from the notion of close in distribution used by probabilists to indicate that the distribution functions are close. LEMMA 215: Let …X-2, X-1, X0, X1, X2… be a process and H be its entropy. For all m, H = (1/m) lim( n ) H(Xn+1 V Xn+2 V Xn+3,...Xn+m| Xn V Xn-1 V Xn-2 V Xn-3,...X0) = Proof: lim( n ) H(Xn+1 V Xn+2 V Xn+3,...Xn+m| Xn V Xn-1 V Xn-2 V Xn-3,...X0) = lim(n )H(X1 V X2 V X3,...Xm |X0 V X-1 V X-2 V X-3,... X-n) which exists by theorem 157. We already showed that H = lim( n ) 1/nH(X0 V X1 V X2... V Xn) and we already showed that lim( n )[H(Xn|Xn-1Xn-2Xn-3,...X0)] = lim( n )[1/nH(X0 V X1 V X2... V Xn)] by writing H( X0 V X1 V X2... V Xn) = H(X0) + H(X1|X0) + H(X2|X1 V X0) + V ... H(Xn|Xn-1 V Xn-2 V Xn-3,...X0). 67 Similarly, we can see that lim( n ) H(Xn+1 V Xn+2 V Xn+3,...Xn+m| Xn V Xn-1 V Xn-2 V Xn-3,...X0) = lim 1/nH(X0 V X1 V X2... V Xn) by writing for any k and r, H( X0 V X1 V X2... V Xkn+r) = = H(( X0 V X1 V X2... V Xr) + H(( X0 V X1 V X2... V Xn+r | ( X0 V X1 V X2... V Xr) + H(X0 V X1 V X2... V X2n+r | X0 V X1 V X2... V Xn+r) + ... H(X0 V X1 V X2... V Xkn+r | X0 V X1 V X2... V X(k-1)n+r).// COMMENT 216: We are about to talk about the entropy of the m future given the past. It is important not to get confused with the notation. H(the m future | the past) is not the same thing as the conditioned entropy of the m future given the past. The former is a fixed number, defined by taking a limit of H(the m future | the n past) as n and the latter is a random variable depending on the past. However, E( the conditioned m entropy of the future given the n past) = H(the m future | the n past) by the conditioning property. LEMMA 217: for all m, H = (1/m)E ( the conditioned entropy of the m future given the past) Proof: By translation, lemma 177 says that (1/m)H( the m future | the n past) converges to H as n . By the conditioning property, E ( the conditioned entropy of the m future given the n past) = H(the m future | the n past).// 68 Preliminary 4) Let ...X-3,X-2,X-1,X0,X1,X2,X3... be a stationary ergodic process. If you select and let m be large enough, then for most (probability at least 1- ) pasts, the m future given the past is close (distance at most ) in the variation metric to an exponentially fat (size at least exp(-m)) submeasure of the unconditioned m future. ABBREVIATION: Most conditioned futures are nearly exponentially fat submeasures of the unconditioned future. Idea of proof: Let be the set of all names that eventually become and stay reasonable. has measure 1 by the Shannon Macmillan Breiman theorem. Hence for all pasts P except measure 0, the conditioned future given P assigns measure 1 to . Fix a past P that is not in that set of measure 0. Condition on the past being P, for sufficiently large m, most conditioned m names are reasonable in the unconditioned future. Now we make use of an elementary fact. If you have a random variable with a finite expectation, and the variable usually takes on values at most slightly greater than its expectation, and never takes on values much bigger than its expectation, then it rarely takes on values much lower than its expectation. By lemma 217, i)E ( the conditioned entropy of the m future given the past) = mH ii) the conditioned entropy of the m future given the past is rarely much bigger than mH because a) usually most of the m future is on reasonable names so b) usually by making only a small change in the n future you can get a measure which lives entirely on reasonable names. c) Making this small change only slightly affects entropy. d) The most entropy you can get if you live on reasonable names is when you assign every reasonable name the same measure. e) If you do that you get about mH entropy because by Shannon Macmillan there are about 2mH reasonable names. iii) There is an absolute upper bound for (conditional entropy of the m future given the past) of mA, where Ais the size of the alphabet. 69 ence the conditioned entropy of the m future given the past is rarely much smaller than mH. Use the conditioning property to show that when a measure lies almost entirely on reasonable names and has entropy not much smaller than mH then it is rare for a name to have measure much bigger than 2-mH . Alter such a measure to live only on reasonable names whose measure is not much bigger than 2-mH .If you then multiply their measures by 2-H you will get only reasonable names with measures less than reasonable. More precisely, prove If you select and let m be large enough, then for most (probability at least 1- ) pasts you can alter the probability law of the m future given the past by less than in the variation metric to get a measure such that exp(- H) times( ) is strictly less than the unconditional probability law of the of the m future. Write (the unconditional probability law of the of the m future) = t +(1-t) for some , where t = exp(- H).// COMMENT 218: Suppose S is a small set in the unconditional m future. Then the conditional measure of S given the past will be small for most pasts P. After you apply Preliminary (4), moving the conditional future given such a P slightly to get an exponentially fat submeasure, your exponentially fat submeasure will still assign small probability to S. By making another small change you will get an exponentially fat submeasure disjoint from S. This ability to dodge a tiny set and still get your exponentially fat submeasure will come in handy when we try to make deductions from extremality. ABBREVIATION: Most conditioned futures are nearly exponentially fat submeasures of the unconditioned future disjoint from a prechosen tiny set. 70 Theorem 219: B FB IC EX VWB FD. Ideas of proof: B FB Obvious.// FB IC Every FB is a factor of an independent process that in turn is a dbar limit of codings of an independent process. IC EX We proved that independent processes Fn are extremal. An Independent concatenation is like Fn, where the letters of F are words of a fixed length. Then we use the fact that EX is closed in dbar. // EX VWB This follows from final preliminary (4), and comment to preliminary 4 (namely, comment 180). We need the comment to preliminary (4) not just preliminary (4) itself, because EX only means extremal, not completely extremal. // EX FD Preliminary 4 merely says that sufficiently far in the future conditioned futures are nearly exponentially fat subsets of the unconditioned, but it does not tell you how long to wait. But the proof tells you how long to wait. As soon as most names are reasonable, you have waited long enough. First pick n big enough so that most names are reasonable, and so that n names are extremal. Then pick an approximating process in n distribution and entropy. Both conditioned n futures are usually fat submeasures of the unconditional, and they both have essentially the same unconditional, so they can both be coupled in dbar to that unconditional because the unconditional is extremal. Hence they can be dbar coupled to each other. Throughout this proof we should be working with the comment to preliminary (4) not with just preliminary (4) itself, because EX only means extremal, not completely extremal. This couples times 1,2…n given 0,-1,-2.... continue inductively coupling n+1, …2n given n,n-1,..0,-1…etc. until you have a coupling of 1,2…nm given the past for some m. Along the way you may come across some bad pasts which force a stage of coupling to be bad but the expected mean Hamming distance overall is small. Of course we have only shown that we can couple to a multiple of n, but if we have a big number that is not a multiple of n, the remainder when divided by n is insignificant. Preliminary 3 gives the rest of the definition of FD.// 71 COMMENT 220: You may wonder where we used the fact that processes are close in entropy. We used preliminary (4) on both the original and the approximating process. To guarantee that preliminary (4) would kick in by time n on the latter, we needed that the n names of the approximating process tended to have “reasonable” probability. That would not be the case if the entropy of the approximating process was not appropriate. VWB IC Make a independent concatenation by taking n big, taking the measure on n names, and repeating over and over. Couple this with original inductively, piece by piece. // FD IC Just observe that the startover process of a process is close to it in distribution and entropy. Apply preliminary 1. // THEOREM 221: The equivalent properties above are closed under taking factors. Idea of proof: IC is closed under finite code factors, and under taking dbar limits.// THEOREM 222: The equivalent properties above imply that either we are talking about the trivial process or the process has positive entropy. Idea of proof: 0 entropy means past determines future, easily contradicting IC.// COUPLING INFINITE PATHS COMMENT 223: All of the terms of theorem 219 are defined finitistically. However frequently there are more natural descriptions of terms when we are allowed to consider couplings of the infinite future. For the remainder of this section, we will explore what we can do when such infinite path couplings are considered. 72 Stationary THEOREM 224: Let …X-2, X-1, X0, X1, X2… and …Y-2, Y-1, Y0, Y1, Y2… be two stationary processes. Then i) ii) the dbar distance of X1, X2…Xn and Y1, Y2…Yn approaches a limit, say d, as n . There is a stationary coupling of X1, X2… and Y1, Y2… so that with probability 1, a coupled pair (( a1, a2… ),(b1, b2…)) has the property that the fraction of i: ai bi in ( a1, a2… an) and (b0, b1, b2… bn) approaches a limit as n and the integral of that limit is d. Idea of proof: 1) For any k >0, n >0 the dbar distance between X0, X1, X2…Xkn and Y0, Y1, Y2…Ykn > the dbar distance between X1, X2…Xn and Y1, Y2…Yn because the mean hamming distance between words a1, a2… akn and b1, b2… bkn is the average of the mean hamming distances of {a1+1, a2+i…a(i+1)n and b1+i, b2+i… …b(i+1)n : 0 < i < n}. 2) Obviously, # {j < kn+r : Xj Yj} – # {j < kn : Xj Yj} < r for all k,n, and r 3) Show that (1) and (2) give (i) 4) Let d be as in (i). We can get for all n a coupling cn between X0, X1, X2……Xn and Y0, Y1, Y2… …Yn so that the mean Hamming distance obtained by cn converges to d as n. Use the extension of the monkey method on the cn to get a stationary measure c on (X0, X1, X2… )X (Y0, Y1, Y2…). Show a) c is a coupling of (X0, X1, X2… ) and (Y0, Y1, Y2…) b)When restricted to (X0, X1, X2… Xn ) and (Y0, Y1, Y2… Yn), c achieves mean Hamming distance d. c) By the Birkhoff ergodic theorem, the fraction (ii) converges to a limit. d) The integral of that limit is d by the bounded convergence theorem.// 73 Ergodic THEOREM 225: If the …X-2, X-1, X0, X1, X2… and …Y-2, Y-1, Y0, Y1, Y2… above are ergodic the coupling constructed above can be made to be an ergodic coupling. Idea of proof: Theorem 224 gives a stationary coupling. Every stationary measure is a convex combination of ergodic measures. Those ergodic measures must also be a coupling of the two processes…X-2, X-1, X0, X1, X2… and …Y-2, Y-1, Y0, Y1, Y2… because otherwise at least one of those two processes could be decomposed into distinct processes.// COMMENT 226: In the case of theorem 185, where we actually achieve a ergodic coupling, the limiting mean hamming distance between the two coordinates is constant. Hence any single ordered pair (except in a set of ordered pairs that the coupling assigns measure 0) tells you the dbar distance between the processes. IC COMMENT 227: An independent concatenation, periodicity P, is in general not stationary, but Tp is an independent process (prove this) and hence is both stationary and ergodic. Thus theorem 185 applies when one or both of your processes is an IC as long as you regard the transformation to be Tp instead of T. FD COMMENT 228: FD says close in distribution and entropy implies close in dbar. Theorem 225 apply to get infinite couplings manifesting that close dbar match. 74 VWB THEOREM 230: …X-2, X-1, X0, X1, X2… is VWB iff there is a set of pasts of probability zero such that if past 1 = …a-2 a-1, and past 2 = …b-2 b-1, are not members of that set, the conditioned measure of X0, X1, X2… given …X-2 = a-2, X-1 = a--1 and the conditioned measure of X0, X1, X2… given …X-2 = b-2, X-1 = b--1 can be coupled with each other so that with probability 1 in the coupled process, a pair ((c0c1c2…),(d0d1,d2…)) will obey lim (1/n) #{i: 0< i <n-1 and ci di 0 as n Proof: Our goal is to construct a good dbar coupling between the future given past 1 and the future given past 2. We construct that coupling as follows. Let k1, k2, k3… be rapidly increasing sequence of integers and let Ni be integers such that Ni >> ki+1 for all i. Define Ii inductively as follows. I0 = 0. Ii = Niki + Ii-1. We now define our coupling piece by piece by induction. First we couple the measure on terms 0,1,2… k1–1 given the first past. with the measure on terms 0,1,2… k1–1 given the second past. We use the following procedure for doing this. The definition of VWB guarantees that with high probability, if k1 is big enough, the measure on terms 0,1,2… k1–1 given the first the past is close in dbar to the unconditioned measure on words of length k1. Couple these conditional and unconditional spaces as well as possible. Similarly couple the measure on terms 0,1,2… k1–1 given the second past with the unconditional as well as possible. Join these two couplings together to get a coupling of the two conditional processes. Note that the expected mean hamming distance this coupling gives to the two conditionals is at least as small as the sum of the distance between the first conditional and the unconditioned and that of the second conditional and the unconditioned. 75 We now want to extend the above coupling to a coupling of the measure on terms 0,1,2…2k1–1 given the first the past. with the measure on terms 0,1,2…2k1–1 given the second past. Recall the inductive method for extending a coupling to more terms. Let ((a0a1a2. . . ak -1), (b0b1b2. . . bk -1)) 1 1 be a coupled pair in the earlier coupling. We need to couple terms k1, k1+1…2k1-1 given a0a1a2. . . ak -1 1 and past 1 with terms k1, k1+1…2k1-1given b0b1b2. . . bk -1 1 and past 2. Again we do this by optimally coupling both with the unconditioned process. Once we do that for every pair ((a0a1a2. . . ak -1), (b0b1b2. . . bk -1)) 1 1 we will have extended our coupling so that we have a coupling to terms 0,1,2… 2k1–1 given our two pasts. We now continue the process, looking at each coupled pair in that coupling ((a0a1a2. . . a2k -1), (b0b1b2. . . b2k -1)) 1 1 and using it to couple the terms 2k1, k1+1…3k1–1 so that in the end we have coupled terms 0,1,2…3k1–1. Continue… 0,1,2…4k1–1 etc. until you reach 0,1,2… I1–1. 76 Continue using the same technique. Next extend to 0,1,2…. I1 + k2–1. Then extend to 0,1,2…. I1 + 2k2–1. Then to 0,1,2…. I1 + 3k2–1 and on and on and on until 0,1,2…. I2–1, followed by 0,1,2…. I2 + k3–1…and by now I presume you see the pattern. When you get done you have coupled the whole infinite future given past 1 with the whole infinite future given past 2. We now show that this match will have the desired property if the ki’s increase rapidly enough, and if each Ni is big enough in comparison to ki+1. Let us first consider the terms from I1 to I2–1. Let m be a positive integer less than N2 and assume we have already defined the coupling for terms 0,1,2… I1 + mk2 –1. We are now about to define the coupling for times I1 + mk2, I1 + mk2+1,… I1 + (m+1)k2 -1. At time I1 + mk2, you think of I1 + mk2 as being the origin so that from the stand point of I1 + mk2, the “past” means times I1 + mk2-1, I1 + mk2-2…. Hence the first past at time I1 + mk2, which we will call past(1, I1 + mk2), is obtained from a knowledge of past 1 and a0a1a2. . . a I 1+mk 2 - 1 _ _ Similarly the second past at time I1 + mk2, which we will call past(2, I1 + mk2), is obtained from a knowledge of past 2 and b0b1b2. . . b I 1+mk 2 – 1 _ _ VWB says that as long as k2 is sufficiently big, if past(1, I1 + mk2) is chosen randomly, it will usually turn out to be “good” in the sense that with regard to times I1 + mk2, I1 + mk2 + 1…. (m+1)k2 - 1 77 we will be able to achieve a good dbar match between the conditioned process conditioned on past(1, I1 + mk2) and the unconditioned process. If both past(1, I1 + mk2) and past(2, I1 + mk2) are good, then a nice match will occur between the two conditioned processes at times I1 + mk2, I1 + mk2 + 1…. (m+1)k2 - 1. However two things can go wrong. The first thing that can go wrong is that either past(1, I1 + mk2) or past(2, I1 + mk2) can turn out to be bad. We will call this a type 1 error. The second thing that can go wrong is that although both pasts turn out to be good, and hence the match will be good, and even though the joint coupling gives high probability to picking a pair ((a I 1+mk 2 a I 1+mk 2 +1 …a I 1+(m+1) k 2 - 1) , ( b I 1+mk 2 b I 1+mk 2 +1 …b I 1+(m+1) k 2 - 1)) _ _ _ _ _ _ _ _ _ _ _ _ that are close, when we actually use that coupling to pick this pair we may be unlucky and pick them so that the mean Hamming distance is far apart. We will call this a type 2 error. The mere fact that these two kinds of errors have small probability would be enough for us if all we wanted to do was to get convergence in probability, but we are looking for convergence almost surely. We need to know that the frequency of type 1 and type 2 errors not only get small but also stay small. We are about to handle type 1 errors using the Birkhoff ergodic theorem, and type 2 errors using the concept of independence. Type 1 errors. Let T be the transformation corresponding to the process. We use the following: a) Bad pasts have small probability, b) Tk is measure preserving, c) The Birkhoff ergodic theorem. 2 78 From these we conclude with probability almost 1, there is a large n depending on k2 but not on N2 or I1 , (i.e. N2 and I1 can be chosen to be arbitrarily large in comparison to n.) such that “For all i with N2 >i>n, 1/i #{j: 0<j<i and past(1, I1+jk2) is bad} is small, and for all i with N2 >i>n, 1/i #{j: 0<j<i and past(2, I1+jk2) is bad} is small.” This says that with probability almost 1, “For all i such that N2 > i>n, 1/i #{j: 0<j<i and an error of type 1 during times {I1+jk2, I1+jk2+1,… I1+(j+1) k2-1} occurs} is small” Type 2 errors: We use the following general principle. In an arbitrary probability space suppose we consider a sequence of events such that each has small probability even when we condition on what happened to the previous ones. Then for large n, with probability almost 1, “For all i greater than n, (1/i)#(the subset of the first i events which actually occur) is small.” Here # means “number of”. That principle immediately implies that with probability almost 1, again letting n be large but not in comparison to N2 and I1. such that “For all i with N2 > i > n, 1/i #{j: 0<j<i and an error of type 2 occurs during times {I1+jk2, I1+jk2+1,… I1+(j+1) k2-1} occurs} is small.” 79 Conclusion: Let ((a0, a1, a2,…),(b0, b1, b,…)) be a randomly chosen couple from our final coupling using the measure of that coupling. The above type 1, type 2 analysis implies that if k2 is large and N1 is sufficiently large in comparison to k2, we can get the following for a small 1. With probability at least 1-1, “For all i such that N2 > i > (1N1 k2), (ik2))#{j: I1 <j<I1+ik2-1 and (aj bj)} < 1” Going through a similar analysis, we could get a rapidly decreasing sequence of 123 such that With probability at least 1-g “For all i such that Ng+1 > i > (gNg kg+1), (ikg+1))#{j: Ig<j<Ig+ikg+1-1 and (aj bj)} < g” By choosing the g’s to be summable, Borel Cantelli says that with probability 1, “For all sufficiently large g, for all i such that Ng+1 > i > (gNg kg+1), (ikg+1))#{j: Ig<j< Ig+ikg+1-1 and (aj bj)} < g” Upon reflection the reader will see that this last sentence implies that the density of times that aj bj approaches zero.// ABBREVIATION: Ki rapidly increasing, Ni >> Ki+1. Couple two futures to times K1,2 K1, … N1 K1, N1 K1 + K2, N1 K1 +2 K2, …N1 K1 + N2 K2, N1 K1 + N2 K2+ K3 … successively inductively as well as possible. Focussing on just times between N1 K1 + N2 K2+… Ng Kg and N1 K1 + N2 K2+… Ng+1 Kg+1 , the frequency of times that either past is bad usually gets and stays small by Birkhoff and frequency of times pasts are good but coupling is bad gets and stays small because each time you have another big independent chance to get a good coupling. 80 WB DEFINITION 231: Let …X-2, X-1, X0, X1, X2… be a stationary process. Define a (generally nonstationary) process …Y-2, Y-1, Y0, Y1, Y2… by …Y-2, Y-1 has the same distribution as …X-2, X-1 Y0, Y1, Y2… has the same distribution as X0, X1, X2… but unlike the X process, …Y-2, Y-1 and Y0, Y1, Y2… are independent of each other. Suppose that for every there is an m such that for every n, X-(m+n),…X-(m+2),X-(m+1),X-m,Xm, Xm+1, Xm+2… Xm+n and Y-(m+n),…Y-(m+2),Y-(m+1),Y-m,Ym, Ym+1, Ym+2… Ym+n are within in the variation metric. Then …X-2, X-1, X0, X1, X2… is said to be weak Bernoulli (abbreviated WB). COMMENT 232: The reason weak Bernoulli was not defined earlier is that it is not equivalent to the equivalent properties of the bold face theorem (theorem 181). It is stronger than they are. COMMENT 233: I am hereby establishing a probably false rumor that there are people who went insane upon finding out that weak Bernoulli is stronger than Bernoulli. Let me see if I can explain this rather perverted notation. The word “Bernoulli” has two definitions depending on whether you are doing probability theory or ergodic theory. In Probability theory it means an independent process. In ergodic theory and in this book, it means isomorphic to an independent process. The word weak Bernoulli was chosen because the property is weaker than the former definition. But it is stronger than the latter definition. DEFINITION 234: [ ] means greatest integer of (e.g. [5.326] = 5, [6.77772] = 6, _____ _____________________________________ [] = 3, [-4.02] = -5.). This notation will only be used when we state that we are using it, because we want to be able to use brackets to mean just brackets. DEFINITION 235: Parity tells you evenness or oddness (e.g. parity of 4 is even) 81 EXERCISE 236: We use greatest integer notation here. Let i be an irrational number and let …X-2, X-1, X0, X1, X2… be an independent process, each Xi taking values 1 or –1 with probability ½ each. Let Sn be 0 if n is 0, X0 + X1… +Xn-1 if n is positive, and -(X-n…+X-2+ X-1 ) if n is negative and U be uniformly distributed on the unit interval. Show that “parity of [2U + i Sn]” is a stationary process (as n runs) and in fact that it is very weak Bernoulli, but it is not weak Bernoulli. (Hint: For VWB, run conditional futures U+iSn independently until they are close and then keep them close.) We now develop some techniques for altering a coupling to get another coupling. These techniques will be used to help us analyze weak Bernoulli. EXERCISE 237: Let and be two probability measures and let c be a coupling of the two of them. Let be a positive measure that is less than or equal to than c on all sets and let be a measure that has the same marginals as . (i.e. it has the same projections to the axes.). Then c – + is also a coupling of and . COMMENT 238: If and are positive measures on X X Y and and are their projections onto X respectively. Then a coupling of and can be extended to a coupling of and . We already showed that (see example 4 of our section on coupling) when and are probability measures, but the reason we are repeating it is to point out that this statement is still valid when and are not probability measures. 82 LEMMA A: We now extend the exercise above to more than one dimension to form a more elaborate way of altering a coupling. Suppose and are measures on X X Y. Let c (a measure on (X X Y) X (X X Y)) be a coupling of and . Let be a positive measure on (X X Y) X (X X Y) that is less than or equal to c on all sets. Let X be a measure (on X X X) with the same marginals as the projection of on X X X. X can be extended to a measure on (X X Y) X (X X Y) which has the same marginals as . c - + is another coupling of and . Idea of proof: This is just exercise 196 and comment 197 put together. DEFINITION 239: On a space X X X the set of all points (x,x) in which the first coordinate is the same as the second coordinate is called the diagonal. LEMMA B: Given: a measure on X X Y a measure on X X Y c a coupling of and d: a coupling of the projection of on X with the projection of on X. If all but of the measure of d is on the diagonal, and all but of the measure of c is on the diagonal. Then d extends to coupling g of and such that all but at most + of the measure of g is on the diagonal. ABBREVIATION: Given , , c of and . d of ( on X), ( on X) extends to g of and g can be almost diagonal if c,d are. 83 Idea of proof: Show the existence of a which is less than or equal to c, whose support is entirely on the diagonal of (X X Y) X (X X Y), whose projection on X X X is less than or equal to d, and whose measure is at least 1 – ( + ). Remove from c and remove the projection of from d. The projection of what remains from c has the same marginals as what remains from d. Apply lemma A and then put back. DEFINITION 240: Let …X-2, X-1, X 0, X1, X2… and …Y-2, Y-1, Y0, Y1, Y2… be two processes and c be a coupling of the two processes. Suppose that c assigns probability 1 on paths in the product space such that there is an n0 (n0 depends on the path) such that for all n > n0, Xn = Yn. Then c is said to eventually agree. THEOREM 241: Let …X-2, X-1, X0, X1, X2… be a stationary process. It is a weak Bernoulli process iff for all pasts P off a set of pasts of measure 0, we can couple the conditioned future, conditioned on the past being P to the unconditioned future with a coupling that eventually agrees. Idea of proof: We leave “if” as an exercise to the reader. We now prove “only if”. Consider the process …Y-2, Y-1, Y0, Y1, Y2… in the definition of weak Bernoulli. That definition implies that for all there is a m such that for all n, there is a coupling of the two processes (…X-2, X-1, X0, X1, X2… and …Y-2, Y-1, Y0, Y1, Y2…) when restricted to times (-(m+n)-(n+m-1)…..-m, m, m+1….m+n) is on the diagonal with probability at least 1-, where issmall. We can shift the coupling by m-1 to get times (-n-1,-n….-1,2m-1,2m,2m+1…2m+n-1). We now apply lemma A to take that part of the coupling in which there is disagreement on any time (-n-1,-n….-1) and replace it with something which is on the diagonal for times (-n-1,-n….-1). What we have now is a coupling of …X-2, X-1, X0, X1, X2… and …Y-2, Y-1, Y0, Y1, Y2… on times (-n-1,-n….-1,2m-1,2m,2m+1…2m+n) which is on the diagonal with probability 1 for times (-n-1,-n….-1) and has a probability of at most of disagreeing on any time 2m,2m+1…2m+n-1. 84 Take a subsequential limit as n so that our coupling is now defined for all times except times 0,1,…2m. We get an m for every . To indicate this we will write m = m(). Fix a large M and consider the couplings cM determined by 1/5M, m(1/5M). By observing cM, cM-1 and lemma B (where we project onto all times except (0,1,…2m(1/5M)), we get a coupling defined for all times except (0,1,…2m(1/5M-1)) is the diagonal on the past is the diagonal except probablility 1/5M on all times greater than m(1/5M) and is the diagonal except 1/5M + 1/5M-1 < 1/2M-1 on all times except (0,1,…2m(1/5M-1)). Repeating the process using that coupling and cM-2 we get a coupling of all times except (0,1,…2m(1/5M-2)) is the diagonal on the past is the diagonal except probablility 1/5M on all times greater than m(1/5M) is the diagonal except 1/2M-1 on all times except (0,1,…2m(1/5M-1)) and is the diagonal except 1/2M-2 on all times except (0,1,…2m(1/5M-2)). Repeat again and again using cM-3,cM-4…etc. until you have a coupling defined on all times, which is the diagonal on all times greater than ci with probability at least 1-(1/2)i for all i < M. Take a subsequential limit as M to get a coupling of the two processes which is on the diagonal for the past and which eventually agrees with probability 1 in the future. Now it is simply a matter of conditioning this coupling on a given past. The future Y process is the unconditioned future, the future X process is the conditional process given the given past, and those two processes are coupled so that they will eventually agree. // 85 ABBREVIATION: Couple times –n,-n+1…-m, m, …n. with the almost identity coupling. Shift m-n,m-n+1…0, 2m, …m+n. Fudge coupling to get identity on m-n,mn+1…0. Take limit n , to get all times …-2,-1,0,m,m+1… For k < m there is a way to take one of these couplings on times …-2,-1,0,m,m+1… and, another on …-2,-1,0,k,k+1… and get another on …-2,1,0,k,k+1… so that if the former is identity except measure and the second is identity except measure , the third will be an extension of the first and the identity except measure + . Repeating this type of manipulation and taking limit we end up with coupling on all times; identity on past and eventually agree on future. COMMENT 242: This comment refers to both our discussions of weak Bernoulli and very weak Bernoulli. We have perhaps confused you by sometimes coupling the conditioned measure given P1 with the conditioned measure given P2 for any two pasts P1 and P2, and sometimes coupling the conditioned future given P1 with the unconditioned future for the purpose of defining a metric. The purpose of this comment is to indicate that it doesn’t matter which of these two things we do. They are equivalent. If we can get an match by coupling a conditioned future given any past P with the unconditioned future, then for any two P1 and P2 we can glue together two couplings to get a coupling that gives a 2 match between the two conditioned futures. Conversely If we can couple any two conditioned pasts in a measurable way, then we can consider the coupling for P1 and P2, regard P2 as a random variable and integrate out the coupling as you let P2 run, to form a coupling of the conditioned future with the unconditioned future which gives an match. The difficulty in this proof is that you can’t just choose your coupling with the axiom of choice for each P2 or you will get a nonmeasurable mess. You need to show that you can choose your coupling without the axiom of choice. 86 ORNSTEIN ISOMORPHISM THEOREM COMMENT 243: This comment is only directed to people who already know the theory. There is no marriage lemma in this proof of the Ornstein isomorphism theorem. 1) Copying In Distribution THEOREM 244: Let P,T be a process, > 0. Let m be chosen. If n is sufficiently large, then for any Rohlin tower R of i) height n ii) error set less than ½ in measure for all but measure in measure of the P columns of R, the distribution of m names in a column is within in the variation metric of the distribution of m names of the process. ABBREVIATION: Columns in a large Rohlin tower tend to have the right m distribution. Idea of proof: n >> M >> m. Birkhoff ergodic says that most words of length M have about the right frequency of names of length m so if the theorem is false there must be many columns with many inappropriate M words. A typical word of length 5n would be very likely to contain such a bad column, contradicting Birkhoff ergodic for M names.// TECHNIQUE 245: How to copy in distribution: Suppose P,T is a process. S is another transformation, perhaps on another space, and you want a partition Q on the latter space, so that Q,S has similar distribution to P,T. Here is a procedure for getting Q. DEFINITION 246: If P,T is a process, and if S is a transformation on another space with Rohlin tower R of height n on the latter space, then we say we are copying P,T to R, if we construct a partition Q in the latter space in the following manner. For every set in P associate a distinct letter of the alphabet. These letters will designate sets in Q. Place another Rohlin tower R’ of height n on the former space (space of T). P breaks R’ into columns. Break R into columns of the same size as the columns of R’ and identify each column of R with a column of R’ which has the same size. If the rungs of a given column of R’ are in P1P2...Pn (from bottom to top) and if Q1,Q2,...Qn are the letters corresponding to P1,P2,...Pn respectively, define Q so that the rungs of the corresponding column, (from bottom to top) will be in Q1,Q2,... Qn respectively. As always, assume tiny error sets for R and R’. 87 THEOREM 247: Let P,T be a process, S a transformation on a possibly different space. Suppose you would like a partition Q on the latter space such that the P,T process and the Q,S process have approximately the same distribution on n names. All you have to do is to make a Rohlin tower R in the latter space sufficiently big (much bigger than n) and copy P,T to R. Idea of proof: Immediate from theorem 203. DEFINITION 248: If we follow the procedure of theorem 206, we say that we are copying the n distribution of P,T to S. DEFINITION 249: In both of the above definitions, if we don’t wish to explicitly mention the n or the R we can simply say copy P,T to get Q so that Q,S looks like P,T. When you do that it is presumed that n is large. COMMENT and DEFINITION 250: Suppose that T and S are transformations on possibly different spaces. P and B are partitions on the same space as T. If you have already copied P,T to R, to get a partition Q, so that Q,S looks like P,T and you want to get a partition C so that (Q V C),S looks like (P V B),T then you can do so in the obvious way by making more columns in R. We say that we are copying P,B,T to get Q,C,S. or we are copying B to get C so that Q,C,S looks like P,B,T. COMMENT 251: Here the conclusion is strong. It says that if the copy was made to force the n distribution (P,T) to look like the n distribution of (Q,S), then we can get the n distribution of (Q V C,S) to be close to the n distribution of (P V B,T). COMMENT: 252 Suppose you know that (Q,S) looks like (P,T). More precisely, (*) every set of Q is associated with a set of P and for some large n we know that the distribution of n names (P,T) is close to distribution of n name (Q,T). Comment and definition 250 presupposes that (*) was achieved by copying (P,T) to (Q,S) using Rohlin towers. However, suppose (*) just happens to be true but we did not force it to be true using Rohlin towers. The next exercise shows that in that case comment 251 turns out to be false. 88 EXERCISE 253: Let ({a,b},T) be the process on two letters which takes on the following four words with probablility ¼ each. …aaaaaaaa… …abababab… …babababa… …bbbbbbbb… Let C be defined to be T(a) and D be defined to be T(b). Let ({heads, tails},S) be independent coin flipping on letters {heads, tails}with probability (½,½) apiece. Show that ({a,b},T) has the same 2 distribution as ({heads,tails},S), but there is no way to get partition Q such that the 2 distribution of (Q V {heads,tails},S) is less than ¼ in the variation distance from the 2 distribution of ({C,D} V {a,b},T). COMMENT 254: However we can get the following weaker result. THEOREM 255: Let > 0, process (P,T) and (Q,S) be given. Then for every integer m > 0, there exists integer integer n > 0 and real number > 0 such that for any partition B on the space of T, if the n distribution of (P,T) and (Q,S) is closer than in the distribution distance, then you can get partition C such that the m distributions of (P V B,T) and (Q V C,S) are closer than in the variation distance. Idea of proof: This time we have to pick our Rohlin towers more carefully. Choose n so large that by Birkhoff most of the words of length n have the right distribution of m words. By comment 44, put towers R and R’ on the T and S spaces respectively so that the P columns (respectively Q columns) have exactly the same distributions as the distribution of n names of (P,T) (resp. (Q,S)). Since those distributions are essentially the same, you can copy B to get C so that (P,B,R) looks almost exactly like (P,C,R’). COROLLARY 256: If (Q,S) is a perfect copy of (P,T) (exactly the same n distribution for all n), then for any m you can copy B to get C so that (P V B,T) and (Q V C,S) are as close as you want in m distribution. DEFINITION 257: If P and Q are partitions on the same space, and we are identifying Q with P, by corresponding each Pi in P with a Qi in Q, we define P intersect Q to be the union over all i of Pi intersect Qi .The symmetric difference between P and Q is the complement of P intersect Q. 89 DEFINITION 258: The distance between P and Q, denoted |P-Q| is the measure of the symmetric distance between P and Q. COMMENT 259: You can also use the above definition to refer to two sets as opposed to two partitions. From now on, if we just say “distance” or “close” without mentioning the words “variation” or “mean hamming” or “dbar” and if it is not obvious that there is some other metric that we are discussing, then we are referring to the above metric. 2) Coding DEFINITION 260: Suppose f is a homomorphism from P,T to Q,S. Then f can be approximated by a finite code, i.e. there is some n such that the -n,-n+1...n-1,n P name determines the atom of Q that the image of f is in, up to some small probability . We will say P,T codes Q,S well by time n. Saying the same thing in the language of processes, if …Y-2, Y-1, Y0, Y1, Y2… is a factor of …X-2, X-1, X0, X1, X2… , we will say that the X process codes the Y process well by time n, if X-n…X-2, X-1, X0, X1, X2… …Xn determines Y0 up to probability . DEFINITION 261: If …X-2, X-1, X0, X1, X2… codes …Y-2, Y-1, Y0, Y1, Y2… well by time n then the code is a function from words of length 2n +1 in the X process to words of length 1 in the Y process such that if you guess Y0 to be the image of X-n…X-2, X-1, X0, X1, X2… …Xn, you have a high probability of being right. DEFINITION 260 in more detail. Of course, when we know that …Y-2, Y-1, Y0, Y1, Y2… is a factor of …X-2, X-1, X0, X1, X2… and we merely say that the X process codes the Y process by time n we mean that using the same code for all i Xi-n… Xi+n makes a quess as to what Xi is and that guess is correct with probability at least 1-. This follows from definition 260 by stationarity if we know the code came from a homomorphism. 90 DEFINITION 262: In the above definition, if T = S then we simplify our language by saying P,T codes Q well by time n. COMMENT 263: The reason that P,T must code Q by time n for some n is that f-1(Q) is a partition which can be approximated by a partition of cylinder sets. THEOREM 264: S,Q is a factor of T,P iff there is a code from (T,P) to (S,Q) for each n in such a way that For all there exists n such that (T,P) codes (S,Q) by time n. Idea of proof: Only if: Let f: (T,P) (S,Q) be the factor map.f-1(Q) is a partition in the -algebra generated by P. This partition can be approximated by cylinder sets. If: Let i be summable and ni be the corresponding value of n for i. By Borel Cantelli the ni code correctly tells what the set in Q f() is in for all but finitely many ni. By translation the same is true for Ti(Q) for any i. This defines the homomorphism.// DEFINITION 265: Suppose P and P’ are two partitions identified with each other in that they have the same number of pieces, and every distinct set in P is identified with a distinct set in P’. Suppose Q and Q’ are two similarly identified partitions. We say that P,T codes Q,S well by time n in about the same way that P’,T’ codes Q’,S’ if that P,T codes Q,S well by time n, P’,T’ codes Q’,S’ well by time n, and furthermore the codes we use are the same. COMMENT 266: Note that if P,T codes Q well by time n in about the same way that P,T codes Q’, (Remember that in this case T is the only transformation under consideration) then Q and Q’ are close to each other. COMMENT 267: Fix positive integer n. Let T and S be transformations. Suppose Q,T is a factor of P,T (so that P,T codes Q,T by time n). Suppose we copy P,Q,T to get P’,Q’,S. If the copy is good enough, P,T codes Q about the same as P’,T’ codes Q’ by time n. Note that if we use a Rohlin tower to do our copying, it will have to be much bigger than n. 91 COMMENT 268: True or false: Suppose i > 0 for all i, (i) < , there are partitions Pi, integers ni, |Pi+1 - Pi | < i and for all i, (T, Pi) i codes Q well by time ni. Then Pi converges to a P such that T,Q is a factor of T,P. Answer: False. It may be the case that for all i and all j > i, (T, Pj) does not code Q well by time ni so that (T, P) does not code Q well for any n. The point is that we only know (T, Pj) codes well by time j, not by time i. However this statement can be modified to be true as follows: THEOREM 269: Suppose i > 0, for all i,(i) < there are partitions Pi, increasing integers ni, |Pi+1 - Pi | < i/ni, and for all i, (T, Pi) codes Q well by time ni. then Pi converges to a P such that (T, Q) is a factor of (T,P). Idea of proof: For all j > i, moving from Pj-1 to Pj worsens the ni code by at most 2j + 1 because each coordinate from -ni to ni is altered with probability at most j/nj which is less than j/ni. -----------------------------------------3) The Land of P DEFINITION 270: Let T be a transformation, and let P be a partition which does not necessarily generate. The - algebra generated by Ti(P) where i runs over all integers, is called the land of P. DEFINITION 271: We can make copies of - algebras, partitions etceteras in the land of P, just by simply doing everything in that - algebra (the land of P) because the land of P is a Lebesgue space. 92 Example 1: Suppose T is a transformation and you want to copy a partition in the land of P. You can make a Rohlin tower in the land of P, then partition the base in the land of P etc. The beauty of all this, is that whatever you end up with ends up being a factor of P,T. Example 2: Fix a transformation T and an integer n. Suppose P, P1, P2 are partitions, and you have a copy P3 of P1 so that P3,T has exactly the same distribution as P1,T. You would like to make a good copy P4, in the land of P, of P2 so that P3,P4,T has about the same n distribution as P1,P2,T. You can do so as long as the P3,T process is in the land of P, even if the P1 and P2, processes are not. On the other hand, you cannot do it if the P3,T process is not in the land of P, because you use that process to construct P4. -----------------------------------------4) Capturing entropy COMMENT 272: Let P,T be a process and S be another transformation. Copy the n distribution of P,T to S, thereby getting a process Q,S whose n distribution is close to that of P,T. Is there anything we can say about the comparison of the entropy of Q,S with that of the entropy of P,T? To answer that, recall that the entropy of P,T is the decreasing limit of * 1/n H (V(i from 1 to n) T-i (P)). This means that the n distribution of P,T already gives an upper bound for the entropy of P,T. Hence if Q,S has been selected to have approximately the same n distribution as P,T and if n is large enough for * to be close to its limit, we will already know that Q,S could not possibly have much more entropy than P,T. However, the entropy of Q,S could conceivably be much less than the entropy of P,T no matter how well we copy the distribution of P,T to get Q,S. In short, a good copy of distribution guarantees a big enough entropy (approximately), but not a small enough entropy. If we want to guarantee that Q,S has approximately the same entropy as P,T, we need to have some way of holding its entropy up. The purpose of this section is to develop techniques which generate a copy Q,S in such a way as to hold up its entropy, thereby insuring that Q,S and P,T have approximately the same entropy. 93 DEFINITION 273: Let P be a partition, T a transformation and n be a positive integer. The P,T,n entropy drop is defined to be (1/n)H(V i goes from 1 to n of (T-iP)) – H(P,T) EXERCISE 274: Fix a stationary distribution for words of size n. Prove that the process that the n-1 step Markov chain has mimimum P,T,n entropy drop. THEOREM 275: Let Q refine P, then H(P) – ½(H(P V T(P)) < H(Q) - ½ H(Q V T(Q)). Idea of proof: Recall the following principle: If A,B, and C are three partitions such that C refines B, then H(A V C) – H (C) < H(A V B) – H(B). Hence H(Q V T(Q)) - H(P V T(Q)) < H(Q) - H(P) and H(P V T(Q)) - H(P V T(P)) < H(T(Q)) - H(T(P)) = H(Q) - H(P). Add these two inequalities together.// COROLLARY 276: Let Q refine P. Then 1/n (V (i from 1 to n) Ti(P)) – 1/(2n) (V (i from 1 to 2n) Ti(P)) < 1/n (V (i from 1 to n) Ti(Q)) – 1/(2n) (V (i from 1 to 2n) Ti(Q)). Proof: Just plug in theorem 275 replacing P with V (i from 1 to n) Ti(P), Q with V (i from 1 to n) Ti(Q)and T with Tn.// 94 COROLLARY 277 : Let Q refine P. Then P,T,n entropy drop < Q,T,n entropy drop for all n. Idea of proof: The P,T,n entropy drop is 1/n (V (i from 1 to n) Ti(P)) – 1/(2n) (V (i from 1 to 2n) Ti(P)) + 1/(2n) (V (i from 1 to 2n) Ti(P)) – 1/(4n) (V (i from 1 to 4n) Ti(P)) + … THEOREM 278: P,T ergodic with entropy H, > 0 arbitrary, for all n sufficiently large, for all Rohlin towers R of height n, for all but measure of the P,T columns of R, the width of the column is between exp(-(Hn+)) of the of measure of the base and exp((Hn-)) of the measure of the base. Idea of proof: This theorem immediately follows if we know that the bad set of the Shannon Macmillan theorem does not contain a large fraction of the base because the width of a column is exactly the size of a word in the base of that column. We can’t easily prove that the base has a small bad set, but we will now point out that there are nearby rungs with small bad set. Pick and such that 0 < << << . By Shannon Macmillan Breiman, pick the height of your tower, n, so that for all N > (1-)n, out side of a bad set of size , all N names have size between exp(-N(H+)) and exp(-N(H-)). Among the sets (first rung , second rung, … (n) rung). There must be a rung r such that less than /2 of the measure of r is in the bad set and among the sets ( T-1(first rung), T-2(first rung)… T-n)(first rung)) (prove these sets to be disjoint) there must be a set s such that less than /2 of s is in the bad set. The existence of r establishes that all but /2 of the columns are sufficiently small and the existence of s shows that all but /2 of the columns are sufficiently big because every name of size n in the base is contained in a name of size at least n(1 - ) in r and contains a name of size at most n(1 + ) in S.// ABBREVIATION: Since Shannon Macmillan set is bad only on a small set, there are rungs near the base above and below the base with small bad set. 95 THEOREM 279: Young children’s puzzle theorem: Let P be a partition containing m sets and Q be a partition whose sets are all smaller than in measure. Then there is a partition P’ such that i) ii) Q is finer than P’ The variation distance between the distributions of P and P’ is less than m proof left to reader. COMMENT 280: The young children’s puzzle theorem and a theorem we will later state called the mature children’s puzzle theorem are obviously meant to be our substitution for a marriage lemma. However they are much easier to prove than a marriage lemma. Even your young children can tell you how to make big pieces out of little pieces. The advantage of not using a powerful marriage lemma is that if you want to investigate the details of the isomorphism that you get, it will be much harder to do that if you used a deep theorem to get that isomorphism. COMMENT 281: True or false: If we have a partition and take a set of pieces of the partition of small measure and lump it together you won’t decrease the entropy of the partition by very much. Answer: False. Let S be that small set. By the conditioning property, if we blow up P when restricted to S by normalizing the measure on S to a probability measure, and blow up P on the complement of S by blowing up the its measure to a probability measure, then H(P) = (S)H(blow up of P on S) + (1-(S)) H(blow up of P on the complement of S). It is conceivable that (S)H(blow up of P on S) >> (1-(S)) H(blow up of P on the complement of S) so that lumping some of S could have big effect. However, we nevertheless have the following. 96 Lemma 282: Let P be a finite partition with #P pieces and S be a set of names in the P,T process. If (S) << 1/ log(#P), lumping S together will not significantly alter (1/n)H(the set of n names), regardless what n is. Idea of proof: The total number of possible names of length n is (#P)n. Hence the blow up partition on S can have at most (#P)n pieces. The largest its entropy can conceivably be is if all those sets have the same entropy, in which case it is n(log(#P)). Technique 283: We are finally ready to answer the question posed in the beginning of this section: Technique: How to copy a distribution while holding up the entropy. We are given a Process P,T, an integer n, and another transformation S whose entropy is greater than T. Here is how to construct a partition Q so that the n distributions and entropies of P,T and Q,S are as close as you want. (as long as you don’t insist on equality) Among other things we are copying the n distribution and we already know how to do that. As you will recall, the process for doing that is to let N >> n. Choose N towers R and R’ for T and S respectively, and then copy P from R to R’. Our goal is to carry out that process carefully so that we will keep our entropy. Let H be the entropy of P,T. Let H’ be the entropy of S. Let H’- H >> > 0 be given. By theorem 203, only a small measure of P,T,R columns have measure smaller than exp(-(HT+ )N). Alter P by lumping those few columns together into one big column with one name(any given name you choose) and that won’t change either entropy or distribution of P,T very much by lemma 228.1. Now you have at most exp((HT+ )N) columns. Let G be a generator for S. Consider the S, G, R’ columns. All but a very small measure of those columns are small enough so that we can copy P,T to R’ as usual , using the young children’s puzzle theorem to make sure that the columns of G are finer than the columns of your copy. The big S,G,R’ columns together with the error in the puzzle theorem only make the copy not quite as nice as it otherwise would have been. That constructs the desired copy Q. It has approximately the right n distribution and is more course than the S,G,R’ columns. Let B be the base of R’ and let Bc be its complement. Define partitions PG, PQ : PG = {B G columns, Bc} and PQ = {B Q columns, Bc} 97 are partitions which generate the same -algebras, under S, as G V {B, Bc} and Q V {B, Bc} respectively (not precisely the same because of error set, but this is negligible.) so H(PG,S) is approximately H(G,S) = H(S), and H(PQ,S) is approximately H(Q,S). By corollary 225.5, The PQ,S, N entropy drop < The PG,S, N entropy drop. It does not hurt to assume that n is bigger than it is so choose N big enough in the very beginning to assure that the PG,S, N entropy drop is small.// ABBREVIATION: To get n distribution and keep entropy, copy one N >> n picture onto another like usual. First put G columns on the range tower where G is a generator and then make your copy columns to contain G columns by fudging columns to make puzzle theorem applicable. Intersect columns with base to get PQ and PG. the PQ,S,n entropy drop < the PG,S,n entropy drop. THEOREM 284: Technique 283 can be made to work even when the entropy of S = the entropy of T Idea of proof: Since we are only interested in copying a distribution close to that of the original process, don’t copy P,T itself, but just copy something close to it. Just start with a process slightly less than full entropy, which is close to the original process in distribution and then technique 283 is applicable. One way to get the approximating process, is to form a Rohlin tower and lump a small set of columns together. This can’t change entropy by much because of Lemma 282. To see how this procedure can force a drop in entropy, consider the PQ partition developed for technique 283. By lumping a few columns together we can force the N distribution entropy of H(PQ,T) to be too small forcing H(Q,T) which is approximately H(PQ,T) to be smaller than it was. 98 5) The dbar metric LEMMA 285: Fix a,> 0, a+< ½ . For sufficiently large n, the number of words of zeros and ones of length n with less than “an” ones is less than exp[-((a + log(a + + (1- (a + log(1- (a + n]. proof: We could give a proof using Sterling’s formula but we prefer this proof. Consider a random independent process on zeros and ones where “one” has probability “a + ” and consider words of length n. The process has entropy (a + log(a + + (1- (a + log(1- (a + so the typical reasonable word has probability exp(-[(a + log(a + + (1- (a + log(1- (a + ]n) and all of the words of the given set are exponentially bigger than these reasonable words. The result follows because the given set has probability less than 1.// THEOREM 286: Fix a,> 0, a+< ½ . Let A be the number of elements in your alphabet. For sufficiently large n, the number of words within “a” of a given word in the mean Hamming distance of any fixed word of length “n” is less than exp[-((a + log(a + + (1- (a + log(1- (a + (a+)log(A-1)n]. Idea of proof: Don’t use lemma 231 directly. Use the idea of its proof. First describe a set which has the same number of elements and the given set.// THEOREM 287: Mature children’s puzzle theorem: Let M,P be two partitions such that M V P contains m sets and M’ have just as many sets as M. Let Q be a partition whose sets are all smaller than in measure such that Q is finer than M’. Then there is a partition P’ such that i) Q is finer than M’ V P’ ii) The variation distance between the distributions of M V P and M’ V P’ is less than mthe variation distance between the distributions of M and M’ proof left to reader. COMMENT 288: Be the first on your block to own one of these swell toys! 99 DEFINITION BY CONSTRUCTION 289: Let …X-2, X-1, X0, X1, X2… and …Y-2, Y-1, Y0, Y1, Y2… be two processes on the same alphabet, perhaps on different spaces, and c be a coupling of X1, X2… Xn and Y1, Y2… Yn. This coupling is a measure on the product space which associates with every ordered pair (w1, w2) from X1, X2… Xn and Y1, Y2… Yn respectively a probability p(w1, w2). We now consider a completely new Lebesgue space. Place a Rohlin tower of height n on our new space (You don’t even have to assume there to be any error set because we have no transformation in mind, just n sets of measure 1/n each). Break it into columns so that for every ordered pair (w1, w2) there is an associated column whose measure is p(w1, w2). Let { a1, a2,… ap } be the common alphabet of the X and Y processes. We are about to regard ordered pairs of these letters as representing sets in our new space. In particular, we regard the mth rung of the column associated with (w1, w2) to be in (ai,aj) if the mth letter of w1 is ai and the mth letter of w2 is aj. Our tower, endowed with these sets we have represented by ordered pairs (ai,aj) is called a picture of the coupling between X1, X2… Xn and Y1, Y2… Yn . EXERCISE 290: Show that the union of all ordered pairs (ai,aj) such that i j has measure exactly equal to the expected mean Hamming distance achieved by the coupling. DEFINITION 291: We want to discuss the notation we just developed where instead of regarding the processes under discussion to be …X-2, X-1, X0, X1, X2… and …Y-2, Y-1, Y0, Y1, Y2… we regard them to be P,T and Q,S. The translation is that P is the p set partition { X0 = a1, X0 = a2,… X0 = ap } and Q is the p set partition { Y0 = a1, Y0 = a2,… Y0 = ap }. The statement in the previous definition “Our tower, endowed with these sets we have represented by ordered pairs (ai,aj) is called a picture of the dbar coupling.” is replaced with the statement Rohlin R’’ of height n endowed with partitions P’’ corresponding to P, and Q’’ corresponding to Q, is a picture of the coupling. where R’’ is the tower of the picture, P’’ is the partition of picture given by the first coordinate of the pair and Q’’ is the partition of picture given by the second coordinate of the pair. 100 COMMENT 292: In the previous definition, the partition of the picture into ordered pairs is precisely P’’ V Q’’. EXERCISE 293: All notation as in definition 236. Let R be a Rohlin tower of height n on the space of T such that the base of the tower is independent of the partition of the space into n names (under P,T). Show that P’’, R’’ is a precise copy of P,R. THEOREM 294: Let P,T and Q,S be two processes whose distance in dbar is d < ¼. A is the number of letters in the alphabet. Suppose H(T) > H(P,T) +d½ log(A-1) - (( d½log d½) +(1- d½)(log(1- d½)). (Note that - (( d½log d½) +(1- d½)(log(1- d½)) is positive.) Then it is possible to find a partition Q’ on the same space as P such that i) |P-Q’| < 2d½ ii) Q’,T is as close to Q,S as you want in distribution and entropy (except that you are not allowed to insist that their n distributions or entropy be equal) COMMENT 295: (ii) means you can pick any n you want in advance, and then get them as close as you want in entropy while getting them as close as you want in n distribution. COMMENT 296: The only significance of the expressions d½ log(A-1) - (( d½log d½) +(1- d½)(log(1- d½)) and 2d½ for this book is that they go to zero as d does. Proof: This proof is separated into two parts; a computation and then a proof. COMPUTATION: The statement of the theorem introduces values d and A. The second line includes the expression d½ log(A-1) - (( d½log d½) +(1- d½)(log(1- d½)) which we will call X and (i) includes the expression 2d½ which we will call Y. When deriving the theorem we did not yet know X and Y, so let us now pretend that we do not know what they are. We have the following quantities to consider. Cast of characters: X,Y, d and A 101 We wish to derive the values of X and Y in terms of d and A. In the proof we will introduce another quantity, B. Guest star: B In the proof we will find that we can let X = B log(A-1) - (( Blog B) +(1- B)(log(1- B)) Y = B + d/B Since we can let B be anything we want, we will choose B to be d½ in order to force Y to be pretty small. This forces X and Y to be what they are. PROOF: Start off with the large n for which we want to copy the n distribution of P. Let N >> n. There is a coupling of the N distribution of (P,T) with the N distribution of (Q,T) which achieves an expected mean Hamming distance < d. Let G be a generator for T. Let R be a tower of height N. On an arbitrary space, let Rohlin R’’ of height N endowed with partitions P’’ corresponding to P, and Q’’ corresponding to Q, be a picture of the coupling. For the moment ignore P’’ V Q’’ and just focus on P’’ columns. In preparation for copying, lump together all unreasonably small P’’ columns (totally ignoring what Q’’ looks like on those columns) as we did in the proof of technique 283. Now we will add some more Q’’ V P’’ columns to that lumped column. Consider a Q’’ V P’’ column in R’’ to be “bad” if the fraction of the rungs in the column in which the first and second coordinates disagree exceeds B (The value of B is given in “computation”). Our problem is that the number of bad columns may be too big for us to apply the mature children’s puzzle theorem. We resolve the problem by adding all bad columns to the lumped column so that we have one big lumped column. Before doing that, the lumped column was negligibly small. Now since d is the Hamming distance achieved by the coupling, (*) the size of the lumped column is at most d/B (We need not take into consideration the negligibly small part). On the lumped column, we erase all knowledge of Q’’ so that we no longer have Q’’ V P’’ columns, but rather only P’’ columns. Now we count columns. The number of unlumped P’’ columns can’t be very much bigger than exp((H(P,T)n). 102 The number of Q’’ V P’’ in each unlumped P’’ column can’t be much bigger than exp(Blog(A-1) - (( Blog B) +(1- B)(log(1- B))) by Theorem 286. Hence the total number of columns cannot be very much bigger than exp((H(P,T)n) exp(Blog(A-1) - (( Blog B) +(1- B)(log(1- B))). If we insist that H(T) > (H(P,T)) + A(-((BlogB) +(1-B)(log(1-B))), there are enough G columns to apply puzzle theorems. Finally we are ready to copy Q’’ to get Q’. On the unlumped columns, use the mature children’s puzzle theorem to get G finer than Q’ and to get Q’,P to look like Q’’,P’’. On the lumped column, use the YOUNG children’s puzzle theorem to get G finer than Q’, but here we don’t care how Q’ and P compare with each other. conditioned on our being in the unlumped columns, |P’’ – Q’’| < B. On good columns |PQ’| < B so by (*), |P-Q’| < B + d/B // ABBREVIATION: In a picture of a dbar coupling lump small P’’ columns together with P’’ columns with a higher fraction than d½ of rungs of disagreement, yielding a lump of size d½. Now copy Q’’ to get Q’ with Q’ looking like Q everywhere, close to P except on copy of lump, and filled with generator columns to insure high entropy. COMMENT 297: In order to motivate what is coming, we need to point out some of the difficulty you might have had if you had tried to prove the theorem 294 yourself without reading our proof. The difficulty is that we don’t have any idea what the picture of the coupling looks like. The result is an awkward only partially useful theorem. Here is another problem. We will end up using the theorem when we know that Q,S is finitely determined, and instead of knowing that P,T and Q,S are close in dbar, as the theorem requires, we will know that P,T and Q, S are close in distribution and entropy. That is O.K. because for finitely determined processes, close in distribution and entropy implies close in dbar. The trouble is that we don’t know how close in distribution and entropy you have to be in order to be close in dbar unless you know precisely what finitely determined transformation we are talking about. We will use the theorem but it will not be enough. We will also need that in a special case we can do much better. In our special case we will have a much better picture of the coupling, and we will not have to make use of the finitely determined property when we use it. We now proceed to develop that special case. 103 DEFINITION 298: Let f: …X-2, X-1, X0, X1, X2… …Y-2, Y-1, Y0, Y1, Y2… be a factor map. The coupling corresponding to f is the coupling of …X-2, X-1, X0, X1, X2… and …Y-2, Y-1, Y0, Y1, Y2… assigning measure 0 to {(w1, w2) : f(w1) w2} and for every set A in the …X-2, X-1, X0, X1, X2… space, it assigns the measure of A to {(w1, w2) : f(w1) w2 and w1 A} EXERCISE 299: Prove that it does indeed couple the two spaces. EXERCISE 300: Prove that if the coupling corresponding to f is restricted to words of length n, (i.e. Erase all knowledge of Xi or Yi when i {1,2,…n}) for large n, then the picture of it is made up almost entirely of approximately 2Hn reasonable names where H is the entropy of the X process. Hint: …X-2, X-1, X0, X1, X2… V …Y-2, Y-1, Y0, Y1, Y2…is isomorphic to …X-2, X-1, X0, X1, X2….// THEOREM 301: Suppose S ergodic, f:Q,S q,S is a homomorphism and suppose Q has the same number of sets as q and that the sets of Q are identified with the sets of q so that it makes sense to ask about the expected mean hamming distance that a coupling between Q and q achieves. Suppose the expected mean hamming distance achieved by the coupling corresponding to f when restricted to words of length n is d. Then if T is another transformation, possibly on a different space, P is a partition on that latter space such that P,T is a perfect copy of q,S, and H(T) > H(Q,S), then it is possible to find a partition Q’ on the same space as P such that i) |P-Q’| < d. ii) Q’,T is as close to Q,S as you want in distribution and entropy (except that you are not allowed to insist that their n distributions or entropy be equal). Idea of proof: Same idea as the proof of theorem 294 except easier. Use exercise 300. We don’t have any big lumps in this proof to worry about. As a technicality we have to admit that this proof really only forces |P-Q’| to be at most a tiny bit bigger than d. However, i) This is good enough for our purposes. No harm will be done if you alter condition (i) to say |P-Q’|<2d. ii) If you want to achieve |P-Q’|<d, show you can do so by fudging Q’ slightly without seriously hurting distribution or entropy.// 104 4) Sinai’s and Ornstein’s theorems. DEFINITION: 302: let …X-2, X-1, X0, X1, X2… be a stationary process, Let a0 a1 a2… be an infinite word in the alphabet of the process and define the event An(a0 a1 a2…), abbreviated An by X0 obeys An iff Xi = a0, Xi+1 = a1, Xi+2 = a2… Xi+n-1 = an for some i {-n+1, -n+2, … 0}. COMMENT: 303: Finite determined implies positive entropy so picking a0 a1 a2… to obey Shannon Macmillan, its size will decrease exponentially and hence P(An) goes to 0 as n approaches infinity. LEMMA: 304 The average entropy lemma: Let P and Q be a partition and T an ergodic transformation. Let B, T(B), T2(B), … Tn(B) be a Rohlin tower for T where n is odd and e be its error set. Let two points p B and q B be said to be isomorphic iff the P,T (n+1)/2 name of p and the Q,T (n+1)/2 name of T(n+1)/2(p) are the same as the P,T (n+1)/2 name of q and the Q,T (n+1)/2 name of T(n+1)/2(q) respectively. Break B into equivalence classes by the above equivalence relation and define a column to be any set A T(A) T2(A) … Tn(A) where A is such an equivalence class. Let PP be the partition made up of the columns and e. Let d > 0. Then if n is sufficiently large and e sufficiently small 1/n H(PP) < (H(T,P) + H(T,Q))/ 2 + d ABBREVIATION: Let the bottom half of a Rohlin tower be one process and the top half be another process. The entropy is not much bigger than the average of that of the two processes. 105 Idea of proof: There must be an i {1,2,3,…n}, small, such that most of Ti(B)obeys the Shannon Macmillan theorem for P,T by time n/4 onward and most of Ti+(n+1)/2(B) obeys Shannon Macmillan for Q,T by time n/4 onward. Use lemma 282 to ignore the columns where either Ti(B) or Ti+(n+1)/2(B) fails to obey the above mentioned Shannon Macmillan theorems. All other columns are determined by the following triple where B. ( the i-1 name of ; the (n+1)/2 name of Ti() (which is reasonable); the ((n+1)/2) - i name of Ti+(n+1)/2() (which is reasonable) ) The number of such triples is less then ((H(T,P) + H(T,Q))/ 2 + d/2)n for any prechosen small d.// THEOREM: 305: Let P,T be an ergodic process with positive entropy. Then there are factors Pi,T where sets of Pi correspond to sets of P such that i) (Pi, T) is a factor of (Pi+1, T) with factor map fi: (Pi+1, T) (Pi, T). ii) The mean hamming distance given by the coupling corresponding to fi goes to zero as i . iii) |P- Pi | 0 as i iv) H (Pi,T) < H (P,T) for all i. Idea of proof: We consider an alphabet {p1, p2, p3… pn, D} where p1, p2, p3… pn are letters corresponding to the sets of P and D is an additional letter (i.e. P is defined by letting be assigned the letter pi for any point Pi for all i.).Let C = {c1, c2, c3…} be a countable partition of the space. Define Pi to be the same as P on c1 c2 c3 … ci, and throw all the sets all sets cj, j > i into D. However there are some technicalities. You can’t just pick the ci’s arbitrarily. By comment 303 pick a0 a1 a2… so that P(An) goes to 0 and let c1 = Am for some large m. Note that An, n > m is measurable with respect to all (Pi,T) no matter how we pick ci, i > 2 and if you recall the proof of the Rohlin tower theorem, arbitrarily large Rohlin towers can be constructed from arbitrarily small sets so using the An, n > m, we get arbitrarily large Rohlin towers which we know to be measurable with respect to all (Pi,T) even before we know the values of the ci, i > 2. Define ci inductively by considering (Pi,T), defining D to be the set of points labeled D at that time, selecting a huge Rohlin tower defined out of some An, and letting ci+1 be the intersection of D with the bottom half of that tower. The measurability of our Rohlin towers for all (Pi,T) together with the fact that c1 is measurable for all (Pi,T) will give us (i). (ii) and (iii) are easy. (iv) follows inductively from the average entropy lemma. // 106 Sinai's theorem: Let T be an ergodic transformation with finite entropy. Every finitely determined transformation of the same entropy as T is a factor of T. Idea of proof: Let P,S be the given full entropy finitely determined process. Step 1: Get the Pi,S of theorem 305.(They are all finitely determined because finitely determined is closed under taking factors). Step 2: Let P0,0,T approximate P1,S in distribution entropy (hence in dbar)( Technique 229). Step 3: Move P0,0,T slightly to get P0,1,T such that P0,1,T is a much better approximation of P1,S. (Theorem 294) Step 4: Continue to get P0,2,T, P0,3,T, ... converging to a process P1,0,T that has the same distribution of P1,S. Step 5: Move P1,0,T slightly to get P1,1,T such that P1,1,T is good approximation of P2,S. (Theorem 301) Step 6: Get (P1,2,T),(P1,3,T),(P1,4,T), approaching (P2,0,T) which has the same distribution of (P2,S). (Theorem 294) Step 7: Continuing the same recipe over and over, get (P3,0,T),(P4,0,T),(P5,0,T) which converges to our desired copy. // COMMENT 306: The reason we could not go after P,S directly, is that P,S has full entropy, and theorems 294 and 301 require us to be working in an environment of higher entropy. COMMENT 307: If the reader only wants to prove Sinai’s theorem in the case where H(P,S) is strictly less than H(T), the proof of Sinai’s theorem actually contains within it a proof of that theorem. It is a much easier proof because it is only a piece of that proof and does not require theorems 301 or 305. We consider it important that the reader focus on that. 107 LEMMA 308: Let P,T and Q,S be processes. Suppose the dbar distance between P,T and Q,S is . Take a factor of the P,T process using a code of length n (i.e. the ith coordinate of the image is determined by the i-n,i-n+1,…i+n terms of the P,T process.). Take a factor of the Q,T process using a code of length n and in fact use the exact same code as you used for the P,T process. Then the dbar distance of your factors < (2n+1). Idea of proof: A good dbar coupling of the processes induces a good dbar coupling of the images where k errors on terms from –(M + n) to (M + n) induces at most (2n+1)k errors on terms from –M to M. LEMMA 309: Let P,T and Q,S be processes. Suppose the dbar distance between P,T and Q,S is . Take a factor of the P,T process using a code of length n. Take a factor of the Q,T process using a code of length n and in fact use the exact same code as you used for the P,T process. Suppose you use the same alphabet for the factors as you use for the processes themselves. Then it makes sense to ask whether a given point ends in the same piece of P and its factor. Suppose the probability that a given point ends in a different piece of P and its factor is Then the probability that a given point ends in a different piece of Q and its factor < (2n+1) + . Idea of proof: As before, induce a coupling from a coupling. If the –n to n terms of the two processes are the same and if the 0 coordinate of the P,T process and its factor are the same. Then the 0 coordinates of the P,T process, its factor, the Q,T process, and its factor are all the same.// 108 Sinai's theorem(strong form): THEOREM: For all there exists such that if a Q,S is within in distribution and entropy to a FD distribution whose entropy does not exceed that of S, then P can be moved distance less than to get a partition Q’ such that Q’,S has that FD distribution. Idea of proof: Let P,T have the target distribution. Let Pi be as in theorem 305. Then the map that takes the P,T name a point to the Pi,T name of it is a factor map which for large i rarely alters what set the point is in. Fix a large i and suppose the probability that a given point ends in a different piece of P and its Pi is Since every factor map is a limit of finite code maps, Pi is very close to a Pi’ that is coded from P with a code of length n for some n. We can presume that the probability that a given point ends in a different piece of P and Pi’ is Now suppose the dbar distance between Q,S and P,T is a small fraction of 1/n which is forced if is sufficiently small. Let ’ be the dbar distance between Q,S and P,T. Use the same n code to code a factor q,S of Q,S. The distance between q and Q < (2n+1)’+ 2 by lemma 309. The dbar distance between q,S and Pi’,T < (2n+1)’ by lemma 308 which in turn is very close to Pi’,T. Proceed as in the proof of Sinai’s theorem.// COMMENT: There is a subtlety in the above proof. ’ has to be chosen small even in comparison to . If the dbar distance between q,S and Pi,T were on the order of , that might be too big to apply Theorem 294 in the proof of Sinai’s theorem. 109 Ornstein isomorphism theorem: Two finitely determined processes with the same entropy are isomorphic. Idea of proof: We are given transformations T and S and partitions P and Q such that P,T and Q,S are finitely determined of the same entropy, where P generates. Our goal is to establish a Q’ which also generates T such that Q’,T has the same distribution as Q,S. 1) By Sinai, get q such that q,T has the same distribution as Q,S. By theorem 269, it suffices to get q’ such that q’,T has the same distribution as Q,S, q’ is arbitrarily close to q, and q’ codes P arbitrarily well. 2) Copy P in the land of q to get P’, so that i) (P’,T ) V (q,T) looks like (P,T) V (q,T) in distribution and entropy. By the strong form of Sinai’s theorem by moving the copy a little after making it, we can assume ii) P’,T has the same distribution as P,T 3) Copy q to get q’, so that i) (P,T ) V (q’,T) looks like (P’,T) V (q,T) in distribution and entropy. By the strong form of Sinai’s theorem by moving the copy a little after making it, we can assume ii) q’,T has the same distribution as Q,S 110 What all this means: Since P generates, there is some n such that P codes q well by time n. (P,T) V (q,T), (P’,T) V (q,T) and (P,T) V (q’,T) all look alike so P codes q like P codes q’ by time n. Thus q is close to q’. P’ is in the land of q so there is an n1 such that q codes P’ extremely well by time n1. Hence, by making copy q’ well enough, we can guarantee that q’ codes P extremely well by time n1. We are done. However there is a technicality to be concerned about when we play this game. It takes 2n1+1 letters from the q’ partition to code a letter from P, when we are using a n1 code. If any of those 2n1+1 letters are altered, the code codes wrong. Therefore, if we move the q' partition even as much as the order of 1/n1, that is enough to devastate the coding. Furthermore, you will recall that we did in fact move q' using Sinai’s Theorem. Relax. We did not make the q’ coding until after we already knew n1, and by making it arbitrarily well, we can get the distance we needed to move it to be small, even in comparison to 1/n1.// ABBREVIATION: Let P,T be a factor of Q,T of full entropy which generates Q,T pretty well. You would like to move P slightly to get an isomorphic copy P’,T of P,T which generates Q,T even better. Make Q’ factor of P,T: (Q’,T) V P,T looks like (Q,T) V P,T. Make (P’ V Q,T) look like (P V Q’,T). P’ is close to P because T,Q codes them about the same. (P’,T) captures (Q,T) well because it copies the way P,T captures Q’,T. Corollaries: 1)The Ornstein Isomorphism theorem (Classical form): Two independent processes of the same entropy are isomorphic. 2) FDFBB Proof: Every FD is isomorphic to an independent process of the same entropy. 111 Abbreviation of Book WARNING: Material in this abbreviation can fail to be valid or meaningful mathematics. It may appear to be complete gobledeegook if the reader tries to read it before having read the book. However, if the reader has established that he can make all “ideas of proof” in this book rigorous, then by comparing this abbreviation with the book he can speed up his thinking and be able to quickly review the whole book by just reading this abbreviation. It also serves a extended table of contents. A backslash is used to separate theorems from their proof. ---------------------------------------------------------------PRELIMINARIES page 8 page 11 Stationary process = measure preserving transformation\ Just shift word to the left. page 11 Making a transformation\ name a function, stationary process, cutting and stacking page11 Cutting and stacking\ first stage put interval [.4, .5] on top of [.3, .4] on top of [.2, .3] on top of [.1, .2] on top of [0 , .1] and go up; x x + .1 second stage left third of the stack goes on top of middle third of stack goes on top of right third of that stack to make new stack. You can put spacers between successive thirds and make more than one stack. Continue. page12 Ergodic\ T-1 (S) = S measure of S {0,1} page13 Independent implies ergodic\ Approximate S with cylinder C . T-n (C) and T-m (C) independent for |n-m| big. page14 Let P1,P2,... be increasing and separating. Arbitrary S almost union of pieces of Pi for i big\ Closed approximation for piece of Pi is either is in open approximation for S or open approximation for complement of S. page 15 Above Pi generate. page 16 P,T process spits out word p0, p1,… pn if those are the sets containing ,T(),…Tn (). page 16 P,Q generate T P,T Q,T T isomorphic. page 16 For measurable sets: Every countable chain has upper bound set has maximal element. 112 page 17 Exists S, T(S), T2(S),... TN(S) disjoint and almost cover\ Ergodic case; You can simply take every nth point in every orbit, but you would get non-measurable set. Fix this problem by using small set as a starting set. Nonergodic case; You also need aperiodic assumption. This allows you to extract a small subset of any set whose iterates are disjoint for a long time. Apply measurable Zorn to get a maximal such set for your starting set. page 18 P,T breaks Rohlin tower into columns. page 19 Base of tower can be independent of n names\ Accomplish this by making it independent of every rung of every column. page 20 Measure (Si) summable, (complement of Si ) Pi Superimposition of Pi countable. page 20 Every transformation is a process with countable alphabet\ Bi bases of towers with summable measures. Partition each Bi with a finite but huge amount of information, tack on complement of Bi, and then superimpose them all. page 21 1/n (f() + f(T()) +...f(Tn()) converges as n \ Suppose lim sup>b and lim inf <a average b average b average b ….{………..}..{…………}..{………}… gives near b average. Get near a also. page 22 Ergodic case; Above converges to integral of f\ Lebesgue dominated convergence if f is bounded. Truncation argument. Nonergodic case; It converges to conditional expectation of f with respect to invariant algebra. page 23 Monkey sequence\ Given sequence. Take subsequence of finite truncations where frequency of words of length 1 converge, subsequence of that for words of length 2 etc. Get stationary measure. page 23 Stationary are convex combination of ergodic\ Three proofs …1) The monkey measure of a given word is the ergodic component it is in …2) Conditioning with respect to the invariant -algebra gives an ergodic measure …3) Krein Milman page 26 Subsequential limit\ Same as Monkey except we start with sequence of measures and instead of looking at frequencies we look at probability law of first letter of first two letters etc. 113 page 26 Monkey method can also be regarded as starting with sequences of measures instead of sequences\ Let N >> n. We extract a nearly stationary measure of length n from the Nth measure restricted to 1,2,3…N by averaging every n length submeasure. Take subsequential limit as n page 27 Martingale\ E(Xn| Xn-1, Xn-2... ) = Xn-1 for every n > 1. page 28 For martingale, If T is a stopping time, if all Xn uniformly bounded, E(XT|X0) = X0. In general E(XT^n|X0) = X0 for any n. page 29 A bounded Martingale converges\ Buy whenever the Martingale goes above b and sell when it goes below a. First time above, first time below, second time above etc. are all stopping times so your expected gain is zero and you can’t lose much so you can’t get much. page 30 Coupling\ measure on product space with two measures as marginals. page 30 Independent coupling, gluing two couplings together, coupling by induction. page 32 The tailfield of a process is the intersection for all n of the n future of the process. page 32 Trivial tail, Kolmogorov zero one law. page 33 mean Hamming distance, dbar metric, variation metric. page 35 Markov process, P(present|past) depends only on X-1. page 36 Aperiodic communicating Markov, n,m large, P(Xn = c | X0 = a) - P(Xm = c | X0 = b) | small / Give one process a head start, couple independently until they meet, stay together. page 40 P(present|past) depends on n past…really a one step Markov process in disguise. page 42 Essentially bounded\ lim(N )lim sup(n ):[|fN(X1)| +|fN(X2)|+ ...|fN(Xn)|]/n] = 0 page 42 iai summable, P((i < |Xn| < i+1)|Xn-1, Xn-2...X1) < ai then Xi essentially bounded\ Dominate |fN(X1)|, |fN(X2)|, ...|fN(Xn)| with a small independent process. page 43 “Square lemma”: Xi,j random variables with stationary columns, Xj and Xi,i essentially bounded. then lim 1/n (X1,1 + X2,2 +X3,3 + ... Xn,n) = lim 1/n (X1 + X2 +X3 + ... Xn)\ For n large |Xn,n-Xn| is usually dominated by a small stationary process. ENTROPY PAGE 44 Three shift not a homomorphism of two shift\ If we use an 8 code, every distinct 200 length three shift word would come from distinct 208 two shift word violating pigeon hole principle. It doesn’t help to allow a small fraction of errors in the code. 114 page 44 Entropy of process is if there are about 2n reasonable words of length n. page 45 Isomorphic ergodic processes have same entropy\ Same as 2 shift 3 shift proof. page 45 H(T) = Entropy of T = sup H(P,T) = H(P,T) for any finite generator P. page 45 Shannon Macmillan Breiman; exponential decrease rate in measure of b1b2b3....bn converges (essentially constant if ergodic) \ Compare -(1/n)log(P(b1b2b3....bn)) with -(1/n)log(P(b1|b0b-1b-2b-3....))-… (1/n)log(P(bn|bn-1bn-2...)) using square lemma. We have essential boundedness of everything because the probability that any of those terms is between i and i + 1, given previous terms < the size of alphabet times 2-i. page 46 H(P) is H(P,T) where P is an independent generator for T, and this turns out to be -p1log(p1) + -p2log(p2) +-p3log(p3) ...+-pnlog(pn). page 47 Conditioning property; Put P1 into the first piece of P2. Get entropy H2 + q1H1. page 47 Join of two partitions (a way of superimposing them). page 48 Independent joining maximizes entropy\ 2 proofs 1) use the convexity of xlog x to prove that a convex combination of partitions has at least entropy of the convex combination of their entropies. 2) Use only that ½, ½ maximizes two set entropy and conditioning property. i) Show 1/2n,1/2n,1/2n,…1/2n maximizes 2n set partition. ii) True when one partition is 1/2n,1/2n,1/2n,…1/2n iii)True when one partition is dyadic rationals. page 50 H(P|Q1) < H(P|Q) if Q1is finer. page 50 Entropy = limit of H(Xn|Xn-1,Xn-2,Xn-3,...X0) = lim 1/nH(X0 V X0 V X1 V X2... V Xn) = lim H([A1A2,A3,...An,(- [A1 U A2 U A3... U An])], T ) when A’s are countable generator. page 52 Finite entropy finite generator\ Make big tower distinguishing reasonable columns. Improve approximation with a bigger tower making columns distinct by altering a few rungs at the bottom of that big tower. Label the base of all your towers. page 54 Entropy 0 iff past determines future. page 54 Induce T on Induced entropy = H(T)/()\ Given obvious countable generator on words of length n for give words of length approximately n() for T, but this does not give a valid proof because the generator is infinite. Make proof valid with finite truncation approximations. 115 BERNOULLI SHIFTS PAGE 56 B = Bernoulli = isomorphic to independent process. FB = factor of Bernoulli EX = extremal = exponentially fat n word submeasures (off some small subset) are dbar close to the whole measure. VWB = very weak Bernoulli = conditioned and unconditioned futures dbar close. FD = finitely determined = good distribution and entropy approximation to the process gives a good dbar approximation. IC = dbar limit of independent concatenations = stationary limit of picking first n then next n etc. in an i.i.d. manner. page 59 Independent processes are extremal\ S dbar far away. Xi independent process defined inductively so that Xk takes on 1, -1 with probability ½ apiece but in such a way that it prejudices for 1 as much as possible on S. Far dbar implies prejudice can be strong enough to force X1 + X2 …to grow in expectation linearly on S, proving S to shrink exponentially. page 64 Extremality is closed under dbar limits\ We wish to show that approximating S with extremal Sm tends to make S extremal. After dodging obnoxious sets, we extract from the dbar coupling of S and Sm a dbar coupling of a prechosen fat submeasure of S with a fat submeasure of Sm. page 65 startover process is IC / n >> m >> 1/; couple 1 to n; n+m to2n+m; 2n+2m to 3n+2m.. page 65 coding of independent is IC / same proof. page 66 EX ergodic: else a < b C cylinder whose frequency bigger than b and less than a are both fat and dbar far apart. page 68 Most conditioned futures are nearly exponentially fat submeasures of the unconditioned future\ Expectation of conditioned entropy is unconditioned entropy but conditioned entropy cannot be very big because typical names from the conditioned process come from reasonable names of the unconditioned process. Therefore it is unlikely to be very small. page 70 B FB IC EX VWB FD\ FB IC: FB dbar close to coding of independent IC EX: Independent concatenation looks like independent. EX VWB: Conditioned future fat subset of unconditional future (most conditioned futures dodge the small bad set) EX FD: Process and approximating process, conditioned n futures are usually fat subsets of same unconditional. Couple inductively. VWB IC: Couple your VWB with its finite word measures repeated to form independent concatenation. FD IC: make process “start over” rarely to get approximating IC. page 71 They are closed under factors\ Use IC . A finite code of ( )( )( ) is close to ( )( )… 116 page 71 They have positive entropy\ IC contradicts past determines future. page 72 (i) n word dbar between processes approaches limit d (ii) coupling with limiting error frequency whose integral is d\ For i, dbar (…Xkn , …Ykn ) > dbar( …Xk , …Yk), and # (Xi Yi :i < k+n) - # (Xi Yi :i < n) < k. For ii, Monkey, Birkhoff, Bounded convergence. page 73 Ergodic coupling of ergodic processes\ Just take ergodic component. page 73 Almost ergodic coupling of IC with its approximation. page 73 FD definition manifests in infinite coupling. page 74 Two conditioned VWB futures can be infinitely coupled in dbar\ Ki rapidly increasing, Ni >> Ki+1. Couple two futures to times K1,2 K1,… N1 K1, N1 K1 + K2, N1 K1 +2 K2, …N1 K1 + N2 K2, N1 K1 + N2 K2+ K3 … successively inductively as well as possible. Focussing on just times between N1 K1 + N2 K2+… Ng Kg and N1 K1 + N2 K2+… Ng+1 Kg+1, the frequency of times that either past is bad usually gets and stays small by Birkhoff and frequency of times pasts are good but coupled pair is bad gets and stays small because each time you have another big independent chance to get a good match. page 80 WB means that making past independent of future has small effect in variation metric if you restrict yourself to times –n,-n+1…-m, m, …n. m large. page 83 Conditioned and unconditioned WB futures can be coupled to eventually agree\ Couple times –n,-n+1…-m, m, …n. with the almost identity coupling. Shift m-n,m-n+1…0, 2m, …m+n. Fudge coupling to get identity on m-n,m-n+1…0. Take limit n , to get all times …-2,-1,0,2m,2m+1… For k < m there is a way to take one of these couplings on times …-2,-1,0,2m,2m+1… and, another on …-2,-1,0,2k,2k+1… and get another on …-2,-1,0,2k,2k+1… so that if the former is identity except measure and the second is identity except measure , the third will be an extension of the first and the identity except measure + . Repeating this type of manipulation and taking limit we end up with coupling on all times; identity on past and eventually agree on future. ORNSTEIN ISOMORPHISM THEOREM PAGE 86 page 85 Columns in a large Rohlin tower tend to have the right m distribution\ Otherwise Birkhoff fails for M words, M >> m. page 85 Copy distribution: Rohlin towers on both spaces, copy from one to other. page 89 Coding: Given homomorphism, Inverse of 0 coordinate partition approximately cylinder. page 91 Land of P: -algebra generated by translates of partition P. page 93 Q finer H(P) – ½(H(P V T(P)) < H(Q) - ½ H(Q V T(Q)) / Add H(Q V T(Q)) - H(P V T(Q)) < H(Q) - H(P) with H(P V T(Q)) - H(P V T(P)) < H(T(Q)) - H(T(P)) = H(Q) - H(P). page 94 Q finer entropy drop from time n onward greater. Apply above replacing P with n name partition and T with Tn. page 94 Columns of Rohlin tower are reasonable even when base not independent\ B base. Shannon Macmillan good on most Ti(B) for some small positive i and small 117 absolute value negative i. page 96 Copy distribution and entropy: Exists P’ so that P’,S looks like P,T in distribution and entropy if H(S) > H(P,T)\Copy P on a Rohlin tower to get a P’ on another Rohlin tower so that the columns of a generator for S are inside the columns of your copy. page 100 dbar: dbar P,T and Q,S small, H(T) > H(P,T) +O(dbar) implies copy Q’ of Q in entropy and distribution, |P-Q’| = O(dbar) \ P’’ V Q’’ on R’’ manifests dbar match. Copy of Q’’ to form Q’ with generator columns inside, but only insist that P V Q’ look like P’’ V Q’’ where there are not too many Q’’ columns per P’’ column so that H(T) does not have to be too big. page 103 P,T a perfect copy of q,S. f:Q,S q,S both a homomorphism and a mean Hamming coupling, distance d. H(T) > H(Q,S). entropy distribution copy Q’ of Q with |P-Q’| < d page 105 Pi)approaching (T,P) strictly increasingly as factors, in entropy and in dbar./ Pick each Pi to be P on the bottom of a Pi measurable Rohlin tower after picking P0 in such a way as to insure that the Rohlin tower is also P0 measurable. page 106 FD full entropy factor\ Pi approaching target as above. If Pi,0 is a perfect copy of Pi , move it slightly to get approximate copy Pi of Pi+1. move again and again to get Pi , Pi …approaching Pi+1,0 a perfect copy of Pi+1. 1 2 3 page 108 FD not too much entropy, close ent. dist. approximate copy can be moved slightly to get perfection.close factor./ Pi above. first move slightly to get close to Pi for large i. Proceed as above. page 109 two FD’s same entropy isomorphic\ Let P,T be a factor of Q,T of full entropy which generates Q,T pretty well. You would like to move P slightly to get a perfect copy P’,T of P,T which generates Q,T even better. Make Q’ factor of P,T: (Q’,T) V P,T looks like (Q,T) V P,T. Make (P’ V Q,T) look like (P V Q’,T). P’ is close to P because T,Q codes them about the same. (P’,T) captures (Q,T) well because it copies the way P,T captures Q’,T. page 109 Two independent’s same entropy isomorphic. page 109 FDFBB. 118 Index of definitions by page HOW TO USE THIS INDEX: Example: Look at the third row. It says that Definition 3 is a definition of measure preserving and that it is on page 8. 001 002 003 004 005 006 009 009 015 023 023 023 023 024 024 029 030 035 035 038 039 040 041 052 054 055 057 058 061 063 065 068 070 073 078 080 082 083 084 transformation measurable measure preserving Measurable and measure preserving interval space, with atoms , without atoms Lebesgue space stationary process alphabet ergodic measurable measure preserving homomorphism isomorphism factor isomorphic P,T process P generates T, P is a generator for T. Rohlin tower of height N error set rung of the tower, first rung of the tower, base of the tower. column, or P,T column a rung of that column n name invariant -algebra Monkey method E(f|X,Y,Z). conditional expectation of f with respect to s version for E(f|s) conditional expectation of with respect to s subsequential limit of the i a stationary process obtained by the monkey method Martingale stopping time a^b past coupling -algebra generated by X the n future the tailfield 009 009 009 009 009 009 010 010 012 014 014 014 014 014 014 016 016 017 017 018 018 018 018 022 023 024 024 024 025 026 026 026 027 028 036 030 032 032 032 119 085 087 089 091 093 094 095 096 100 101 101.5 102 103 104 105 109 114 117 118 119 122 127 128 131 133 133 134 135 137 142 145 145 147 150 151 155 159 164 166 172 176 180 181 183 same tail trivial tail mean Hamming distance dbar distance variation distance The variation distance between p1, p2,… pn and q1, q2,… qn variation distance variation distance Possibly Non-stationary Markov Chain transition probabilities there is a path from a to b a and b communicate a is transient (a) L(a) Markov chain periodicity of the process mixing Markov process n step Markov process Y is called the n step Markov process corresponding to X periodicity of an n step Markov process fN essentially bounded n shift small exponential number of words, exponential number of words exponential size of a word entropy of the process reasonable names entropy of a transformation entropy of P put P1 into the first piece of P2 Conditioning property join of two partitions H(P) H(T) Convex combination of partitions p1 P1+ p2 P2+.. pm Pm the entropy of P over Q H H(P,T) The entropy of a transformation coded factor, code the induced transformation of T on Bernoulli trivial transformation submeasure 032 032 033 033 033 034 034 034 035 035 035 035 035 036 036 037 039 039 039 039 040 042 042 044 044 044 044 044 045 046 047 047 047 047 048 049 050 051 051 054 054 056 056 056 120 186 187 189 190 191 192 194 195 197 208 212 213 231 234 235 239 241 246 249 250 257 258 260 261 262 265 270 271 273 289 291 298 302 size exponentially fat submeasure extremal Very Weak Bernoulli finitely determined independent concatenation dbar limit of independent concatenations B, FB Completely extremal P( , S1) startover process of …X-2, X-1, X0, X1, X2…. close in distribution weak Bernoulli WB [ ], greatest integer Parity diagonal eventually agree we are copying P,T to R copy P,T to get Q so that Q,S looks like P,T we are copying P,B,T to get Q,C,S. or we are copying B to get C so that Q,C,S looks like P,B,T. define P intersect Q .The symmetric difference between P and Q distance between P and Q, |P-Q| P,T codes Q,S well by time n X process codes the Y process well by time n code P,T codes Q well by time n P,T codes Q,S well by time n in about the same way that P’,T’ codes Q’,S’ the land of P copies of - algebras, partitions etceteras in the land of P P,T,n entropy drop a picture of the coupling between X1, X2… Xn and Y1, Y2… Yn Rohlin R’’ of height n endowed with partitions P’’ corresponding to P, and Q’’ corresponding to Q, is a picture of the coupling The coupling corresponding to f An(a0 a1 a2…), abbreviated An 057 057 057 057 057 058 058 058 058 063 065 065 080 080 080 082 083 086 087 087 088 089 089 089 090 090 091 091 093 098 099 103 104 121 Index of definitions alphabetized HOW TO USE THIS INDEX: example: Look at the first row. It says that definition 009 is the definition of alphabet and it occurs on page 10. SOME ITEMS ARE LISTED MORE THAN ONCE: Example: Definition 89 of the mean Hamming distance on page 33 is listed under mhd and under ham. alp alp an an b b bas bas ber ber ce ce ce ce ce ce cf cf clo clo cod cod 009 009 302 302 195 195 038 038 180 180 057 061 197 057 061 197 172 172 213 213 172 260 cod cod cod 261 262 265 cod cod 172 260 cod cod cod 261 262 265 col col com com 039 039 102 197 alphabet alphabet An(a0 a1 a2…), abbreviated An An(a0 a1 a2…), abbreviated An B, FB B, FB rung of the tower, first rung of the tower, base of the tower. rung of the tower, first rung of the tower, base of the tower. Bernoulli Bernoulli conditional expectation of f with respect to s conditional expectation of with respect to s Completely extremal conditional expectation of f with respect to s conditional expectation of with respect to s Completely extremal coded factor, code coded factor, code close in distribution close in distribution coded factor, code P,T codes Q,S well by time n X process codes the Y process well by time n code P,T codes Q well by time n P,T codes Q,S well by time n in about the same way that P’,T’ codes Q’,S’ coded factor, code P,T codes Q,S well by time n X process codes the Y process well by time n code P,T codes Q well by time n P,T codes Q,S well by time n in about the same way that P’,T’ codes Q’,S’ column, or P,T column column, or P,T column a and b communicate Completely extremal 010 010 104 104 058 058 018 018 056 056 024 025 058 024 025 058 054 054 065 065 054 089 089 090 090 054 089 089 090 090 018 018 035 058 122 com com con con con con con con con con con con cop cop cop 102 197 055 057 061 145 151 055 057 061 145 151 246 249 250 cop cop cop cop 271 246 249 250 cou cou cou cp dba dia dis ea eb efs efx en ent ent ent ent ent ent eps erg err es es 080 289 298 145 091 239 258 241 128 187 055 133 134 137 142 155 166 273 212 015 035 035 133 a and b communicate Completely extremal E(f|X,Y,Z). conditional expectation of f with respect to s conditional expectation of with respect to s Conditioning property Convex combination of partitions p1 P1+ p2 P2+.. pm Pm E(f|X,Y,Z). conditional expectation of f with respect to s conditional expectation of with respect to s Conditioning property Convex combination of partitions p1 P1+ p2 P2+.. pm Pm we are copying P,T to R copy P,T to get Q so that Q,S looks like P,T we are copying P,B,T to get Q,C,S. or we are copying B to get C so that Q,C,S looks like P,B,T. copies of - algebras, partitions etceteras in the land of P we are copying P,T to R copy P,T to get Q so that Q,S looks like P,T we are copying P,B,T to get Q,C,S. or we are copying B to get C so that Q,C,S looks like P,B,T. coupling a picture of the coupling between X1, X2… Xn and Y1, Y2… Yn The coupling corresponding to f Conditioning property dbar distance diagonal distance between P and Q, |P-Q| eventually agree essentially bounded exponentially fat submeasure E(f|X,Y,Z). small exponential number of words, exponential number of words entropy of the process entropy of a transformation entropy of P the entropy of P over Q The entropy of a transformation P,T,n entropy drop startover process of …X-2, X-1, X0, X1, X2…. ergodic error set error set exponential size of a word 035 058 024 024 025 047 048 024 024 025 047 048 086 087 087 091 086 087 087 030 098 103 047 033 082 089 083 042 057 024 044 044 045 046 049 051 093 065 012 017 017 044 123 ess eve exp exp exp ext ext fac fb fd fir fn fut gen gen gi h ham hom hp hpt ht ic ic ind int int int inv is iso iso it joi l lan lan leb lop lop ls mar mar 128 241 133 133 187 189 197 024 195 191 145 127 083 030 082 234 159 089 023 150 164 150 192 194 176 005 234 257 052 005 023 024 176 147 105 270 271 006 270 271 006 068 100 essentially bounded eventually agree small exponential number of words, exponential number of words exponential size of a word exponentially fat submeasure extremal Completely extremal factor B, FB finitely determined put P1 into the first piece of P2 fN the n future P generates T, P is a generator for T. -algebra generated by X [ ], greatest integer H mean Hamming distance homomorphism H(P) H(T) H(P,T) H(P) H(T) independent concatenation dbar limit of independent concatenations the induced transformation of T on interval space, with atoms , without atoms [ ], greatest integer define P intersect Q .The symmetric difference between P and Q invariant -algebra interval space, with atoms , without atoms isomorphism isomorphic the induced transformation of T on join of two partitions L(a) the land of P copies of - algebras, partitions etceteras in the land of P Lebesgue space the land of P copies of - algebras, partitions etceteras in the land of P Lebesgue space Martingale Possibly Non-stationary Markov Chain 042 083 044 044 057 057 058 014 058 057 047 042 032 016 032 080 050 033 014 047 051 047 058 058 054 009 080 088 022 009 014 014 054 047 036 091 091 009 091 091 009 026 035 124 mar mar mar mar mc mc mea mea mea mea mhd min mm mon mon mp mp mp mp mp mp nan nfu nmp nn nsh P par pas pat per per pic pnc poc pro ptn ptp put rea roh roh run 109 117 118 119 100 109 002 003 004 023 089 073 117 054 065 003 004 023 117 118 119 041 083 119 041 131 208 235 078 101.5 114 122 289 100 289 029 273 029 145 135 035 291 038 Markov chain mixing Markov process n step Markov process Y is called the n step Markov process corresponding to X Possibly Non-stationary Markov Chain Markov chain measurable measure preserving Measurable and measure preserving measurable mean Hamming distance a^b mixing Markov process Monkey method a stationary process obtained by the monkey method measure preserving Measurable and measure preserving measure preserving mixing Markov process n step Markov process Y is called the n step Markov process corresponding to X n name the n future Y is called the n step Markov process corresponding to X n name n shift P( , S1) Parity past there is a path from a to b periodicity of the process periodicity of an n step Markov process a picture of the coupling between X1, X2… Xn and Y1, Y2… Yn Possibly Non-stationary Markov Chain a picture of the coupling between X1, X2… Xn and Y1, Y2… Yn P,T process P,T,n entropy drop P,T process put P1 into the first piece of P2 reasonable names Rohlin tower of height N Rohlin R’’ of height n endowed with partitions P’’ corresponding to P, and Q’’ corresponding to Q, is a picture of the coupling rung of the tower, first rung of the tower, base of the tower. 037 039 039 039 035 037 009 009 009 014 033 028 039 023 026 009 009 014 039 039 039 018 032 039 018 044 063 080 036 035 039 040 098 035 098 016 093 016 047 044 017 099 018 125 run sa sd shi siz sp 040 082 257 131 186 009 spm 065 st 085 sta 009 sta 212 sto 070 sto 070 sub 063 sub 183 sym 257 tai 084 tai 085 tai 087 th 104 tp 101 tra 001 tra 101 tra 103 tri 087 tri 181 tt 087 tt 181 var 093 var 094 var 095 var 096 vd 093 vd 094 vd 095 vd 096 ver 058 vwb 190 wb 231 a rung of that column -algebra generated by X define P intersect Q .The symmetric difference between P and Q n shift size stationary process a stationary process obtained by the monkey method same tail stationary process startover process of …X-2, X-1, X0, X1, X2…. stopping time stopping time subsequential limit of the i submeasure define P intersect Q .The symmetric difference between P and Q the tailfield same tail trivial tail (a) transition probabilities transformation transition probabilities a is transient trivial tail trivial transformation trivial tail trivial transformation variation distance The variation distance between p1, p2,… pn and q1, q2,… qn variation distance variation distance variation distance The variation distance between p1, p2,… pn and q1, q2,… qn variation distance variation distance version for E(f|s) Very Weak Bernoulli weak Bernoulli WB 018 032 088 044 057 010 026 032 010 065 027 027 026 056 088 032 032 032 036 035 009 035 035 032 056 032 056 033 034 034 034 033 034 034 034 024 057 080