Free probability: Basic concepts, tools, applications, and relations to other fields Øyvind Ryan February 21, 2008 Abstract In (at least) two talks, I will formally define the concepts of free probability theory, state its main theorems, and present some of the useful tools it provides which may be applicable to other fields. Free probability, its connection with random matrix theory, and some applications, were presented in my talk on the CMA seminar 13th of December. These talks will go deeper, in that all concepts from the talk will be formally defined and theorems proved, while many new facets of the theory are added. Contents 1 Intoduction to free probability and the concept of freeness 2 2 Free convolution, its analytical machinery, and its combinatorial facet 3 3 Connection to random matrix theory 7 4 Results from classical probability which have their analogue in free probability 10 5 The free central limit theorem 13 6 An important result from free probability, applicable in many situations 16 7 Applications to wireless communication 17 7.1 Channel capacity estimation using free probability theory [6] . . 17 7.2 Estimation of power and the number of users in CDMA systems [7] 20 8 Applications to portfolio optimization 20 8.1 Interpretation of eigenvalues and eigenvectors . . . . . . . . . . . 22 8.2 Markowitz portfolio optimization . . . . . . . . . . . . . . . . . . 23 8.3 Cleaning of correlation matrices . . . . . . . . . . . . . . . . . . . 24 1 8.4 8.5 8.6 1 Other ways of forming an empirical matrix . . Dynamics of the top eigenvalue and eigenvector Relation between the correlation matrix and the lation matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . empirical corre. . . . . . . . . . 25 27 28 Intoduction to free probability and the concept of freeness A useful view of classical probability is in terms of (C(Ω), E), where C(Ω) is the space of (real-valued) functions (i.e. the space of random variables) on the R probability space (µ, Ω), and E is the linear functional E(f ) = f (x)dµ(x) on C(Ω) (i.e. the expectation). In this classical setting, all random variables commute w.r.t. multiplication. Is it possible to find a useful theory where the space C(Ω) is replaced by an algebra (in particular matrix algebras), where the random variables not necessarily commute? And how does an expectation look like in such a theory? By a useful theory, we mean a theory where a concept analogous to independence exists, so that classical results hold within our new setting also (with random variables, expectation, independence replaced by the new concepts). The candidate for our new theory will be as follows: Definition 1. By a (noncommutative) probability space we mean a pair (A, φ), where A is a unital ∗-algebra, and where φ (the expectation) is a unital linear functional on A. The elements of A are called random variables. A family of unital ∗-subalgebras (Ai )i∈I is called a free family if aj ∈ Aij i1 6= i2 , i2 6= i3 , · · · , in−1 6= in ⇒ φ(a1 · · · an ) = 0. (1.1) φ(a1 ) = φ(a2 ) = · · · = φ(an ) = 0 A family of random variables ai is called a free family if the algebras they generate form a free family. The algebra A will mostly be some subalgebra of the n × n matrices Mn (C) (for instance the unitary matrices U(n), or diagonal matrices), or some subalgebra of n × n random matrices. The freeness relation will typically not be found for small matrices, but, as we will see, it will be very useful in describing relationships in the spectra of random matrices when the matrices get large. Remark: Freeness is actually quite different from classical independence: 2 We have that E(a1 a2 a1 a2 ) = E(a21 )E(a22 ) when a1 and a2 are independent, but ¡ ¢ φ(a1 a2 a1 a2 ) = φ (a01 + φ(a1 )I)(a02 + φ(a2 )I)(a01 + φ(a1 )I)(a02 + φ(a2 )I) = φ(a01 a02 a01 a02 ) + · · · | {z } 0 +φ(a1 )φ(a2 )φ(a1 )φ(a2 ) + φ((a01 )2 )φ(a2 )2 + φ((a02 )2 )φ(a1 )2 = φ(a1 )2 φ(a2 )2 ¡ ¢ ¡ ¢ + φ(a21 ) − φ(a1 )2 φ(a2 )2 + φ(a1 )2 φ(a22 ) − φ(a2 )2 = φ(a21 )φ(a2 )2 + φ(a22 )φ(a1 )2 − φ(a1 )2 φ(a2 )2 6 = φ(a21 )φ(a22 ) when a1 and a2 are free. Here we have used the shorthand notation a0 = a−φ(a)I. The order of the random variables is highly revelant in free probability. 2 Free convolution, its analytical machinery, and its combinatorial facet The definition of freeness is seen to give many combinatorial challenges in terms of computation. We will here introduce the basic machinery on combinatorics needed to prove some useful results from the definition of freeness. We denote by P(n) the set of partitions of {1, ..., n}. In classical probability, any compactly supported probability measure can be associated with its (classical) cumulants cn : Definition 2. The classical cumulants cn of the random variable x are defined through the recursive relation X E(xn ) = cπ , (2.1) π∈P(n) where cπ = x by cn [x]. Q|π| i=1 c|πi | when π = {π1 , ..., π|π| }. We also denote the cumulants of The nice property about cumulants in the classical case, is the following: Theorem 1. If x and y are indepedendent, then cn [x + y] = cn [x] + cn [y] for all n. One can show that log F(t) = ∞ X (−it)n cn , n! n=1 R where F(t) = R e−itx dν(x) is the Fourier transform of the probability measure ν. In other words, the classical cumulants are actually the coefficients in the power series expansion of the logarithm of the Fourier transform. This also 3 proves theorem 1, since log F is known to have the same linearizing property for the sum of independent random variables. In free probability, a similar definition of cumulants can be made, so that the mentioned additivity property is maintained (with independence replaced by freeness). One simply replaces the set of all partitions with the set of noncrossing partitions in (2.1): Definition 3. A partition is said to be noncrossing if, whenever i < j < k < l, i and k are in the same block, and j and l are in the same block, then all i, j, k, l are in the same block. The set of noncrossing partitions is denoted by N C(n). In the following, we will also say that a partition (noncrossing or not) is a pairing, if all blocks have exactly two elements. The following provides the replacement of classical cumulants and log F for the setting of free random variables: Definition 4. The free cumulants αn of the random variable x are defined through the recursive relation X φ(xn ) = απ , (2.2) π∈N C(n) Q|π| where απ = i=1 α|πi | when π = {π1 , ..., π|π| }. We also denote the cumulants of x by αn [x]. The power series Rx (z) = ∞ X αi z i i=1 is also called the R-transform (of the distribution of x). We will also have need for the more general definition of mixed cumulants: Definition 5. The mixed (free) cumulants α[x1 , ..., xk ] for a sequence of random variables are defined through the recursive relation X απ , (2.3) φ(xi1 · · · xin ) = π∈N C(n) where απ = Q|π| i=1 α[xπi1 , ..., xπi|πi | ], when π = {π1 , ..., π|π| }, and πi = {πi1 , ..., πi|πi | }. One can show that the functional (also called the cumulant functional) (x1 , ..., xn ) → α[x1 , ..., xk ] is linear in all variables. The formulas (2.1) and (2.2) are also called moment-cumulant formulas. Analogous to theorem 1, we have the following result: Theorem 2. If a and b are free, then αn [a + b] = αn [a] + αn [b] for all n. In other words, Ra+b (z) = Ra (z) + Rb (z) (i.e. the R-transform takes the role of log F in the free setting). 4 This theorem has a nice interpretation in terms of probability measures: We can associate with the moments of a and b two probability measures µa and µb (at least in the case of compactly supported probability measures). Since the definition of freeness really gives us a rule for computing the moments of a + b (when they are free) (and thereby finding a new probability measure, we can alternatively view addition of free random variables as a binary operation on probability measures. This operation is denoted ¢. Using this new notation, theorem 2 can be rewritten as Rµa ¢µb (z) = Rµa (z) + Rµb (z), which corresponds to the relationship between (classical) convolution of probability measures and the logarithm of the Fourier transform. To prove theorem 2, we need the following result: Theorem 3. Assume that a and b are free. Then the mixed cumulant α[x1 , ...xn ] is zero whenever at least one xi = a, and at least one xi = b (with all xi taking their values in {a, b}). The converse also holds: Whenever all such mixed cumulants of a and b vanish, a and b must be free. We will not give the proof for this in its entirety (since more background on the combinatorics of partitions is needed), but only sketch how the proof goes: 1. The first step in the proof is noting that we can just as well assume that φ(a) = φ(b) = 0 due to the linearity of the cumulant functional, and since α[x1 , ..., xn ] = 0 whenever one of the xi is a scalar (this last part requires a proof of its own, and goes by induction). 2. If x1 6= x2 6= · · · 6= xn , then the definition of freeness gives us (with a little extra argument) that α[x1 , ...xn ] = 0. 3. If not x1 6= x2 6= · · · 6= xn , we can group together neighbouring a’s and b’s (at least once), so that we obtain an element y1 · · · ym with m < n, and each yi = ai or yi = bi . The proof goes by analyzing y1 · · · ym instead of the longer product x1 · · · xn , and using induction (the details are quite involved, however). It is easy to prove theorem 2 once theorem 3 is proved: (2.3) says that X φ((a + b)n ) = απ [a + b] π∈N C(n) = X |π| Y π∈N C(n) i=1 = = α[a + b, ..., a + b] | {z } |πi | times |π| |π| Y Y α[ b, ..., b ] α[ a, ..., a ] + | {z } | {z } i=1 π∈N C(n) i=1 |πi | times |πi | times X (απ [a] + απ [b]) , X π∈N C(n) 5 where we have used the vanishing of mixed cumulants (theorem 3) between the second and third equation. This proves that αn [a + b] = αn [a] + αn [b], so that Ra+b (z) = Ra (z) + Rb (z), and this is the contents of theorem 2. One distribution will have a very special role as a replacement for the Gaussian law in free probability theory: Definition 6. A random variable a is called (standard) semicircular if its Rtransform is on the form Ra (z) = z 2 . In section 5, it will be explained why such random variables are called semicircular. It remains to explain how the probability density can be recovered from the moments and the cumulants. For this, a connection with the Cauchy-transform can be used: Definition 7. The Cauchy-transform for a probability measure µ is defined by Z dµ(t) Gµ (z) = . R z−t The probability measure µ can be recovered from its Cauchy transform Gµ via the Stieltjes inversion formula, which says that 1 fµ (t) = lim+ − =Gµ (t + i²) π ²→0 for all t ∈ R. To see the connection between the Cauchy transform and the R-transform, the following result can be used (we omit the proof, as it involves more background on combinatorics): ¸ · 1 Gµ (1 + Rµ (z)) = z. (2.4) z In other words, the Cauchy transform can be found from the R-transform as the inverse function of z1 (1 + Rµ (z)). Once the Cauchy transform has been found, we can recover the density using Stieltjes inversion formula. This will be examplified for the semicircular and the free Poisson distributions (to be defined) in later sections. There also exists a transform which does the same thing for multiplication of free variables as the R-transform does for addition of free random variables. This transform is called the S-transform, and has the property Sµa £µb (z) = Sµa (z)Sµb (z), where £ is defined just as ¢, but with addition replaced by multiplication. We will not go into details on this, but only explain that the same type of analytical machinery can be used to find the density of µa £ µb , as was the case above for µa ¢ µb . To be more precise, one can show the relationship 1 −1 R (z), z which enables one to compute the R-transform, from which we have already seen how to compute th Cauchy transform, and thus the density. S(z) = 6 3 Connection to random matrix theory We first need some more terminology: Definition 8. A sequence of random variables an1 , an2 , ... in probability spaces (An , φn ) is said to converge in distribution if, for any m1 , ..., mr ∈ N, k1 , ..., kr ∈ mr 1 {1, 2, ...}, we have that limn→∞ φn (am nk1 · · · ankr ) exists. If also mr m1 mr 1 lim φn (am nk1 · · · ankr ) = φ(Ak1 · · · Akr ) n→∞ for any m1 , ..., mr ∈ N, k1 , ..., kr ∈ {1, 2, ...}, with A1 , A2 , ..., free in some probability space (A, φ), then we will say that the an1 , an2 , ... are asymptotically free. We will prove that independent Gaussian matrices are asymptotically free as the matrices grow in size. To be more precise we will consider n × n matrices An = √1n an = √1n (an (i, j))1≤i,j≤n , where 1. The entries an (i, j), 1 ≤ i ≤ j ≤ n form a set of of 12 n(n + 1) independent, complex values random variables. 2. a(k, k) is real valued with standard Gaussian distribution (i.e. mean 0, variance 1). 3. When i < j, the real and imaginary parts <(a(i, j)) and =(a(i, j)) are independent and identically distributed with mean 0 variance 12 Gaussian distribution. 4. a(i, j) = a( j, i). The following holds: Theorem 4. Let Sp be the set of permutations of p elements {1, 2, ..., p}. For π ∈ Sp , let also π̂ be the order 2 permutation in S2p defined by π̂(2j − 1) = 2π −1 (j), π̂(2j) = 2π(j) − 1, (j ∈ {1, 2, ..., p}) (j ∈ {1, 2, ..., p}), (3.1) let ∼π̂ denote the equivalence relation on {1, ..., 2p} generated by the expression j∼π̂ π̂(j) + 1, (addition formed mod. 2p), (3.2) and let d(π̂) denote the number of equivalence classes of ∼π̂ . If An1 , An2 , ... are Gaussian, independent matrices as described above, then X £ ¤ nd(π̂)−p−1 , E trn (Ani1 · · · Ani2p ) = π̂∈S2p ≤σ where σ is the partition with blocks σj = {k|ik = j}. Moreover, d(π̂) ≤ p + 1, with equality if and only if π̂ is noncrossing. Also, d(π̂) − p − 1 is always an even number. 7 We will not show this in its entirety, only sketch some parts of the proof. The proof builds heavily on the fact that the entries are Gaussian. The following fact makes things more simple when entries are Gaussian: If f, g are real, standard, Gaussian, and independent, then E ((f + ig)m (f − ig)n ) = 0 unless m = n. (3.3) One multiplies the matrices Ani1 · · · Ani2p entry for entry, and only keeps certain terms using (3.3). One then considers all possible identifications of depenedent (i.e. equal) entries from different matrices (this leads us to consider π̂ which stisfy π̂ ≤ σ only). These identifications give rise to the partition π̂ from the text of the theorem. It turns out that, again due to the Gaussian property, a cancellation phenomenon occurs so that one only has to consider π which are pairings. The term nd(π̂)−p−1 follows from another careful count of the terms for such pairings. An important consequence of theorem 4 is the following: Corollary 1. The An1 , An2 , ... are asymptotically free as n → ∞. Moreover, each Ani converge in distribution to a standard semicircular random variable Ai . Moerover the convergence is almost everywhere, i.e. 2p lim trn (A2p ni ) = φ(Ai ) n→∞ almost everywhere. Proof: We see from theorem 4 that £ ¤ E trn (Ani1 · · · Ani2p ) = X ¡ ¢ 1 + O n−2 , (3.4) π̂∈S2p ≤σ so that X £ ¤ lim E trn (Ani1 · · · Ani2p ) = n→∞ 1. π̂∈S2p ≤σ Denoting this limit by φ(Ai1 · · · Ai2p ), this says that α[Ai1 , Ai2 , ...] = 0 whenever two different ij occur (i.e. the mixed cumulants vanish). This means that A1 , A2 , ... are free (theorem 3), hence the An1 , An2 are asymptotically free. Moreover, since h i X lim E trn (A2p ) = 1, ni n→∞ π̂∈S2p we see that α2 [Ai ] = 1, while all other αj [Ai ] = 0, since we only sum over noncrossing pairings. But this implies that Ai is a standard semicircular random variable. Finally, the almost everywhere ¡ ¢ convergence follows from the fact that the deviation term in (3.4) is O n−2 . We will not present a complete proof for this, but remark that it follows from the Borel-Cantelli lemma once we have shown that ∞ X ¡¯ ¡ ¢ £ ¡ ¢¤¯ ¢ P ¯trn Ani1 · · · Ani2p − E trn Ani1 · · · Ani2p ¯ > ² < ∞. n=1 8 This in turn follows from the Chebychev inequality if we show that h¯ ¡ ¢ £ ¡ ¢¤¯ i P∞ ¯trn Ani1 · · · Ani2p − E trn Ani1 · · · Ani2p ¯2 E n=1 ¡ ¢¤¯2 ´ ¡ ¢¯2 i ¯ £ P∞ ³ h¯ = n=1 E ¯trn Ani1 · · · Ani2p ¯ − ¯E trn Ani1 · · · Ani2p ¯ < ∞. (3.5) We have already shown that the last term in the last expression satisfies ¯ ¯2 ¯ ¯ ¯ £ ¯ ¡ ¢¤¯2 ¯ X ¡ ¢ ¯E trn Ani1 · · · Ani2p ¯ = ¯ (3.6) 1¯¯ + O n−2 . ¯ ¯π̂∈S2p ≤σ ¯ A more complicated argument shows that the first term in the last expression satisfies ¯ ¯2 ¯ ¯ h¯ ¯ ¡ ¢¯2 i ¯ X ¡ ¢ E ¯trn Ani1 · · · Ani2p ¯ = ¯¯ (3.7) 1¯¯ + O n−2 ¯π̂∈S2p ≤σ ¯ The proof is completed by putting (3.6) and (3.7) into (3.5). . In figure 1, a histogram is shown for the eigenvalues of a 1000 × 1000 selfadjoint standard Gaussian random matrix. The shape of a semicircle (ellipse would be more correct to say) of radius 2 centered at the origin is clearly visible. The following Matlab code produces the plot: A = (1/sqrt(2000)) * (randn(1000,1000) + j*randn(1000,1000)); A = (sqrt(2)/2)*(A+A’); hist(eig(A),40) Not only Gaussian matrices display the freeness property when the matrices get large. More correct would it perhaps be to say that this occurs for any random matrix system where the eigenvector structure has a uniform distribution (i.e. point in each direction with equal probability) (it can be shown that our Gaussian matrices have this property). Also, our Gaussian matrices are not only asymptotically free with other Gaussian matrices: One can show that, in a very general sense, the Gaussian matrices are asymptotically free from any other random matrix system independent from it, and which converges in distribution. Another type of random matrices frequently used with similar properties as Gaussian matrices, are standard unitary matrices. These are unitary random matrices with distribution equal to the Haar measure on U(n). Another much used random matrix system which exhibits asymptotic freeness is the system N1 XXH (which has the structure of a sample covariance matrix), where X is n × N with i.i.d. complex standard Gaussian entries (n is interpreted as the number of parameters in a system, N as the number of samples taken from the system). One here lets n and N go to infinity at a given n = c. With a slight generalization of theorem 4, one can ratio, say limn→∞ N show that ·µ ¶p ¸ X 1 lim E XXH = cp−|π| n→∞ N π∈N C(n) 9 35 30 25 20 15 10 5 0 −3 −2 −1 0 1 2 3 Figure 1: Histogram of the eigenvalues of a 1000 × 1000 selfadjoint standard Gaussian random matrix almost everywhere. We call this limit distribution the Marc̆henko Pastur law. We quickly see that its free cumulants are 1, c, c2 , .... In the next section, we will introduce analytic methods to calculate the density of this law. 4 Results from classical probability which have their analogue in free probability Besides the free central limit theorem in the next section, several other results and concepts can be generalized from a classical to a free probability setting. For instance, the following is the analogue of the Poisson distribution: Definition 9. Let λ ≥ 0 and α ∈ R. The limit distribution for N → ∞ of µµ νN = 1− λ N ¶ δ0 + λ δα N ¶¢N (4.1) is called the free Poisson distribution with rate λ and jump size α. ¡ ¢ λ λ Denote by mn (νN ) the n’th moment of 1 − N δα . It is clear that δ0 + N n λα mn (νN ) = N . It is not hard to show that the cumulants αn (νN ) of the approximation (4.1) is on the form λαn + O(N −1 ). Taking the limit gives that 10 the n’th cumulant of the free Poisson distribution with rate λ and jump size α is λαn , i.e. its R-transform is given by R(z) = ∞ X λαn z n = n=1 λαz . 1 − αz This corresponds nicely to the cumulants of the classical Poisson distribution. Example. The limit distribution of N1 XXH encountered above (which had cumulants 1, c, c2 , ...) is the free Poisson distribution with rate 1c and jump size c. Let us compute the density of this law. First of all, z R(z) = z + cz 2 + c2 z 3 + ... = . 1 − cz ³ ´ z Using (2.4), we can find the Cauchy transform as the inverse function of z1 1 + 1−cz . We therefore solve µ ¶ 1 z w= 1+ , z 1 − cz which implies that wcz 2 + (1 − c − w)z + 1 = 0. we find the roots of this as p w + c − 1 ± (1 − c − w)2 − 4wc z = p 2wc w + c − 1 ± w2 − 2(1 + c)w + (1 − c)2 = 2wc p √ √ w + c − 1 ± (w − (1 − c)2 )(w − (1 + c)2 ) = . 2wc Using the Stieltjes inversion formula, (i.e. taking the limit of the imaginary part of this as w → x (x real)), we obtain the density p √ √ (x − (1 − c)2 )((1 + c)2 ) − x f (x) = 2πxc √ 2 √ 2 for (1 − c) ≤ x ≤ (1 + c) (note that we have to treat the case x = 0 separately). In figure 2, four different Marc̆henko Pastur laws µc have been plotted. In classical probability, a probability measure µ is said to be infinitely divisible if, for any n, it can be written on the form µ = µ∗(n) , n for some probability measure µn , where ∗(n) denotes n-fold convolution. In classical probability, the Lévy-Hinčin formula states which probability measures are infinitely divisible. A similar result exist in free probability (with ∗ replaced by ¢). To be more precise, a compactly supported probability measure µ is ¢-infinitely divisible if and only if its R-transform has the form ¶ µ Z z dρ(x) Rµ (z) = z α1 [µ] + R 1 − xz for some finite measure ρ on R with compact support. 11 1.6 c=0.5 c=0.2 c=0.1 c=0.05 1.4 1.2 Density 1 0.8 0.6 0.4 0.2 0 0.5 1 1.5 2 2.5 Figure 2: Four different Marc̆henko Pastur laws µc 12 3 5 The free central limit theorem The free central limit theorem is analogous to the classical central limit theorem: We simply need to replace independence with freeness, and the Gaussian law with another law, called the semicircle law: Definition 10. The semicircle law wm,r (of radius r and centered at m) is the probability measure on R with density ½ 2 p r2 − (x − m)2 if m − r ≤ x ≤ m + r πr 2 wm,r (x) = 0 otherwise w0,2 is also called the standard semicircle law. Note that the term semicircular is a bit misleading. More correct would be q to say semielliptic, since it is only the case r = π2 which yields a circular shape of the density. All other values of r give an elliptic shape of the density. The following lemma will be helpful when we connect to the semicircle law in the proof of the free central limit theorem: Lemma 1. The k’th moment of the standard semicircle law is equal to the number of noncrossing pairings of {1, ..., k}. Equivalently, Rw0,2 (z) = z 2 ,i.e. the standard semicircle law is the same as the distribution of a standard semicircular random variable. In particular, the odd moments of the standard semicircle law are all 0. Proof: That the odd moments of the standard semicircle law all are zero follows immediately by symmetry. Assume that k = 2s. Integration by parts gives Z 2 p 1 m2s = x2s 4 − x2 dx 2π −2 Z 2 1 −x √ = − x2s−1 (4 − x2 )dx 2π −2 4 − x2 Z 2p ¡ ¢0 1 = 4 − x2 x2s−1 (4 − x2 ) dx 2π −2 = 4(2s − 1)m2s−2 − (2s + 1)m2s . This means that the recursion m2s = 2(2s−1) m holds. By induction, we s+1 ¡2s¢ 2s−2 1 will prove that this implies that m2s = s+1 (these numbers are also called s the Catalan numbers cs ). This obviously holds for s = 0. Assume that we have 13 proved that m2(s−1) = m2s = = = ¡ 1 2(s−1) s s−1 ¢ . Then µ ¶ 2(2s − 1) 2(2s − 1) 1 2(s − 1) m2s−2 = s+1 s+1 s s−1 2s(2s − 1) (2s − 2) · · · (s + 1) s(s + 1) 1 · · · (s − 1) µ ¶ 1 2s · · · (s + 2) 2s = , 1···s s+1 s which ¡2s¢proves the induction step. It now suffices to prove that m2s = cs = 1 s+1 s equals the number of noncrossing pairings r2s of {1, ..., 2s}. Note that these satisy the equation r2s = r2s−2 r0 + r2s−4 r2 + · · · r2 r2s−4 + r0 r2s−2 P∞ (with r0 = 1). Setting g(x) = s=0 r2s xs+1 , this can equivalently be expressed as the power series equation g(x)2 = g(x) − x. P∞ 1 ¡2s¢ s+1 P∞ It is easily checked that s=0 m2s xs+1 = s=0 s+1 is the Taylor series s x ¡ ¢ √ of f (x) = 12 1 − 1 − 4x , and that f (x) satisfies f (x)2 = f (x) − x. Therefore f (x) and g(x) are the same power series, so that m2s = r2s , which is what we had to show. Here we could also have used (2.4) and Stieltjes inversion formula to obtain the density from the R-transform R(z) = z 2 : The Cauchy transform is simply the inverse function of ¢ 1¡ 2 1 w= z +1 =z+ z z √ 2 We thus need to solve z 2 − wz + 1 = 0, which implies that z = w± 2w −4 . √ 1 Taking imaginary parts we obtain the density 2π 4 − x2 on [−2, 2], which is the density of the standard semicircle law. The free central limit theorem goes as follows: Theorem 5. If • a1 , ..., an are free and self-adjoint, • φ(ai ) = 0, • φ(a2i ) = 1, • supi |φ(aki )| < ∞ for all k, √ then the sequence (a1 + · · · + an )/ n converges in distribution to the standard semicircle law. 14 √ Proof: The proof goes by computing the moments of (a1 + · · · + an )/ n, and comparing these with the moments of the semicircle law. Note that the k’th moment can be written µ φ (a1 + · · · + an )k nk/2 ¶ = k XX X φ(ai(1) ai(2) · · · ai(k) ) nk/2 s=1 V , (5.1) where the second summation is over all partitions V = {V1 , V2 , ..., Vs } (where the blocks are the equivalence classes of the equivalence relation u ∼ v if i(u) = i(v)), and the third summation is over all choices of (i(1), ..., i(k)) which give this partition. By repeatedly using definition 1, we see that each term in (5.1) can be written as a polynomial in φ(am i ), 1 ≤ m ≤ k, and the polynomial depends only on the partition V. The fourth assumption of the theorem now gives us that there exists a constant Ck such that |φ(ai(1) ai(2) · · · ai(k) )| ≤ Ck , for all choices of i(a), ..., i(k). We now split the first summation of (5.1) into three parts: s < k/2, s = k/2, and s > k/2. For s < k/2, ¯ ¯ ¯ ¯ X X n(n − 1) · · · (n − s + 1) ¯ X X X φ(ai(1) ai(2) · · · ai(k) ) ¯ ¯ ¯≤ Ck , ¯ ¯ nk/2 nk/2 ¯ ¯ V V s<k/2 s<k/2 and this tends to 0 as n → ∞. For s > k/2, it is easy to show that X X X φ(ai(1) ai(2) · · · ai(k) ) nk/2 s<k/2 V , since at least one of the blocks Vi must contain only one element (we here use that φ(ai ) = 0 for all i). Therefore, only the case s = k/2 is of interest (in particular, k must be even). We therefore must consider the terms X X φ(ai(1) ai(2) · · · ai(k) ) . nk/2 V:s=k/2 We will prove by induction that φ(ai(1) ai(2) · · · ai(k) ) = 0 if V has crossings, and that it is 1 if V is noncrossing. Assume that this has been shown for k < l. Assume first that there exists a j such that i(j) = i(j + 1). Then it is easy to see that ´ ³ ´ ³ ¡ ¢ φ ai(1) ai(2) · · · ai(k) = φ ai(1) · · · ai(j−1) (a2i(j) )0 + φ(a2i(j) ) ai(j+2) · · · ai(k) ´ ³ = φ ai(1) · · · ai(j−1) (a2i(j) )0 ai(j+2) · · · ai(k) ³ ´ ¡ ¢ +φ a2i(j) φ ai(1) · · · ai(j−1) ai(j+2) · · · ai(k) ³ ´ ¡ ¢ = φ a2i(j) φ ai(1) · · · ai(j−1) ai(j+2) · · · ai(k) . 15 Since the partition defined by i(1), ..., i(j − 1), i(j + 2), ..., i(k) is noncrossing if and only if V is noncrossing, and since φ(ai(j) )2 = 1, we have by induction that φ(ai(1) ai(2) · · · ai(k) ) = 1 if and only if V is noncrossing (and zero otherwise) whenever there exists a j such that i(j) = i(j + 1). Assume now that there exists no j such that i(j) = i(j + 1). Then condition 2 coupled with the definition of freeness gives that φ(ai(1) ai(2) · · · ai(k) ) = 0. Note also that V always has crossings in this case. Since the partitions with crossings do not contribute, we have proved that (denoting by j(1), ..., j(k/2) representatives from the equivalence classes of V) ¶ µ (a1 + · · · + an )k lim φ n→∞ nk/2 X X 1 = lim k/2 n→∞ n V noncrossing pairing j(1),...,j(k/2) X n(n − 1) · · · (n − k/2 + 1) = lim n→∞ nk/2 V noncrossing pairing X = 1 V noncrossing pairing = #{noncrossing pairings}, since there are n(n − 1) · · · (n − k/2 + 1) choices of i(1), ..., i(k) which give rise to V. By lemma 1, the above is simply the k’th moment of the standard semicircle law. This concludes the proof of the free central limit theorem. 6 An important result from free probability, applicable in many situations Theorem 6. Assume that Rn and Xn are independent random matrices of dimension n × N , where Xn contains i.i.d. standard (i.e. mean 0, variance 1) complex Gaussian entries. Assume that the empirical eigenvalue distribution of Γn = N1 Rn RH n converges in distribution almost everywhere to a compactly n supported probability measure ηΓ as n, N goes to infinity with N → c. Then we have that the empirical eigenvalue distribution of Wn = 1 (Rn + σXn )(Rn + σXn )H N (6.1) converges in distribution almost surely to a compactly supported probability measure ηW uniquely identified by ηW rµc = (ηΓ rµc ) ¢ δσ2 , where δσ2 is dirac measure (point mass) at σ 2 . 16 (6.2) Here r is ”the opposite” of multiplicative free convolution. (6.1) can be thought of as the sample covariance matrices of random vectors rn + σxn . rn can be interpreted as a vector containing the system characteristics (direction of arrival for instance in radar applications or portfolio assets in financial applications). xn represents additive noise, with σ a measure of the strength of the noise. Theorem 6 is important, since it opens up for estimation of the information-part in systems by filtering out the noise. 7 Applications to wireless communication 7.1 Channel capacity estimation using free probability theory [6] Theorem 6 can be used for estimation of the channel capacity in wireless systems, which is defined as follows: Definition 11. The capacity per receiving antenna (in the case where the noise is spatially white additive Gaussian) of a channel with n × m channel matrix H and signal to noise ratio ρ = σ12 is given by C= µ ¶ n 1 1 1X 1 H log2 det In + HH = log2 (1 + 2 λl ), n mσ 2 n σ (7.1) l=1 where λl are the eigenvalues of 1 H m HH . Assume that we have L observations Ĥi of the form Ĥi = H + σXi , (7.2) in a MIMO system. To adapt to theorem 6, we form the n × mL random matrices σ Ĥ1...L = H1...L + √ X1...L (7.3) L with i 1 h Ĥ1...L = √ Ĥ1 , Ĥ2 , ..., ĤL , L 1 H1...L = √ [H, H, ..., H] , L X1...L = [X1 , X2 , ..., XL ] . H Noting that H1...L HH 1...L = HH , theorem 6 now gives us the approximation ´ ³ n n ¢ δσ2 . (7.4) ν 1 Ĥ1...L ĤH rµ mL ≈ ν m1 HHH rµ mL m 1...L This can be used to obtain estimates of the moments of the channel matrix 1 H from the observation matrices. These can in turn be used to obtain an m HH 17 2.6 2.4 2.2 Capacity 2 1.8 1.6 1.4 1.2 1 True capacity C1 C2 C3 0.8 0 5 10 15 20 Number of observations 25 30 Figure 3: Comparison of the classical capacity estimators for various number of observations. σ 2 = 0.1, n = 10 receive antennas, m = 10 transmit antennas. The rank of H was 3. estimate of the eigenvalues, and thus the capacity. The estimates prove to work better than existing methods, at least for the common case when the channel matrix has low rank (≤ 4). Existing methods for estimating the capacity from observation matrices are ³ ´ PL 1 1 H C1 = nL i=1 log2 det In + mσ 2 Ĥi Ĥi ³ ´ PL (7.5) C2 = n1 log2 det In + Lσ12 m i=1 Ĥi ĤH i ³ ´ PL PL 1 1 1 1 H C3 = n log2 det In + σ2 m ( L i=1 Ĥi )( L i=1 Ĥi ) ) These are compared with the free probability based estimator Cf defined through (7.4) in figure 3 The estimation is worse when the rank of the channel matrix is increased, as can be seen in figure 4. 18 4.5 4 Capacity 3.5 True capacity, rank 3 C , rank 3 f True capacity, rank 5 C , rank 5 3 f True capacity, rank 6 Cf, rank 6 2.5 2 1.5 0 10 20 30 40 50 60 Number of observations 70 80 90 100 Figure 4: Cf for various number of observations. σ 2 = 0.1, n = 10 receive antennas, m = 10 transmit antennas. The rank of H was 3, 5 and 6. 19 7.2 Estimation of power and the number of users in CDMA systems [7] In communication applications, one needs to determine the number of users in a cell in a CDMA type network as well the power with which they are received. Denoting by n the spreading length, the received vector at the base station in an uplink CDMA system is given by: 1 yi = WP 2 si + bi (7.6) where yi , W, P, si and bi are respectively the n × 1 received vector, the n × N spreading matrix with i.i.d zero mean, n1 variance entries, the N × N diagonal power matrix, the N × 1 i.i.d gaussian unit variance modulation signals and the n × 1 additive white zero mean Gaussian noise. Adaption of theorem 6 to this case gives µµ ¶ ¶ ¶ µ N N (µ N £ µP ) + 1 − δ0 ¢ µσ2 I £ µ Ln ≈ µΘ̂ (7.7) n n n We will show that this enables us to estimate the numbers of users N through a best-match procedure in the following way: Try all values of N with 1 ≤ N ≤ n, and choose the N which gives a best match between the left and right hand side in (7.7). By best match we mean the value of N which gave the smallest deviation in the first moments, when compared to the moments we observe in the sample covariance matrix. We use a 36 × 36 (N = 36) diagonal matrix as our power matrix P with µP = δ1 . In this case, a common method that try to find just the rank exists. This method tries the number of eigenvalues greater than σ 2 . Some threshold is used in this process. We will set the threshold at 1.5σ 2 , so that only eigenvalues larger that 1.5σ 2 are counted. There are no general known rules for where the threshold should be set, so some guessing is inherent in this method. Also, choosing a wrong threshold can lead to a need for a very high number of observations for the method to be precise. The two methods are tested with varying number of observations, from L = 1 to L = 4000, In figure 5, it is seen that when L increases, we get a prediction of N which is closer to the actual value 36. The classical method starts to predict values close to the right one only for a number of observations close to 4000. The method using free probability predicts values close to the right one for a less greater number of realizations. 8 Applications to portfolio optimization Certain companies exist which have specialized themselves in automatic trading strategies based on research and results from random matrix theory. An example is Capital Fund Management (is http://www.cfm.fr). This company employs 25 PhDs (mostly in physics). The publications from this company can be found in http://www.cfm.fr/us/publications.php. Two of the founders of Capital 20 90 Predicted number of users 80 70 60 50 40 30 20 10 0 0 Classical prediction method Prediction method based on free convolution 500 1000 1500 2000 2500 Number of observations 3000 3500 4000 Figure 5: Estimation of the number of users with a classical method, and free convolution L = 1024 observations have been used. Fund Management have written the book [1], and the paper [2], which the topics of this section are based on. We will follow the notation used in [2]: • We assume that we have a portfolio with N assets with weight wi on the ni y0i i’th asset, i.e. wi = W , where ni is the number of asset i in the portfolio, y0i is the initial price of asset i (t = 0), and W is the total wealth invested in the portfolio (at time t = 0). We will also write w = (w1 , ..., wN ) for the portfolio. • rti will denote the daily return of asset i in our portfolio at time t, i.e. rti = i yt+1 −yti , i yt where yti is the price of asset i at time t. The expected P return of the portfolio is thus i wi rti . • Denote by (σti )2 the (daily) variance of rti . We will denote the correlation matrix of the rti by Cijt = E(rti rtj ) − E(rti )E(rtj ) σti σtj . Similarly, the covariance matrix is defined by Dijt = E(rti rtj ) − E(rti )E(rtj ) = σti σtj Cijt . 21 The correlation matrix may or may not evolve over time. We will mostly assume it does not evolve over time, in which case we will write σi and Cij , i.e. drop the subscript t. It is common to assume (without loss of generality) that E(rti ) = 0. • The (daily) variance/risk of the portfolio return is given by X wi σi Cij σj wj , (8.1) ij where σi2 is the (daily) variance of asset i. • The empirical correlation matrix E is given by Eij = T 1X i j xx , T t=1 t t (8.2) where xit = rti /σi . The empirical correlation matrix is typically very different from the true correlation matrix. • The risk of our portfolio can be faithfully measured by à ! X X 1X 1 j j wi σi xit xt σj wj = wi σi xi x σj wj , T ijt T t t t ij (8.3) which is (8.1) with the correlation matrix replaced with the empirical correlation matrix. 8.1 Interpretation of eigenvalues and eigenvectors If the portfolio is given by weights from a normalized eigenvector wa = (wa1 , ..., waN ) of the covariance matrix with eigenvalue λa , then (8.1) says that the variance/risk of the portfolio return is X X wai σi Cij σj waj = wai Dij waj = wa · Dwa = λa . ij ij Now consider two portfolios corresponding to (orthogonal) eigenvectors wa and wb . Then the covariance of their returns is given by à ! N ! N ÃN N X X X X E wai rti wbj rtj − E wai rti E wbj rtj i=1 = X j=1 wai E(rti rtj )wbj − ij = X ij wai Dij wbj ij = = X wb · Dwa 0. 22 i=1 wai E(rti )E(rtj )wbj j=1 Therefore, the eigenvectors of the covariance matrix correspond to uncorrelated portfolios, and the corresponding eigenvalues correspond to the risks of these portfolios. We can now decompose the original portfolio as a sum of these uncorrelated portfolios: X sa wa . (w1 , ..., wN ) = a This decomposition is called a principal component analysis. 8.2 Markowitz portfolio optimization Markowitz portfolio optimization helps us find weights for a portfolio which give us maximal expected return for a given risk, or, equivalently, the minimum risk for a given return (G): C−1 g wC = G T −1 , (8.4) g C g where g = (g1 , ..., gN ) are the predicted gains for the assets. Typically, the correlation matrix C is not known, so the Markowitz portfolio estimate (8.4) is computed from an empirical correlation matrix: wE = G E−1 g gT E−1 g (8.5) What is the true risk of the Markowitz optimized portfolio? This is called the 2 ”true” minimal risk Rtrue . Wa assume that C is perfectly known, and we get 2 Rtrue = = = = = T wC CwC gT C−1 C−1 g G T −1 CG T −1 g C g g C g T −1 −1 g C CC g G2 2 T −1 (g C g) gT C−1 g G2 2 (gT C−1 g) G2 . gT C−1 g When C is not known, we use an empirical correlation matrix as in (8.5). The 2 ”in-sample” risk Rin is defined as the risk of the optimized portfolio over the period used to construct it: 2 Rin = G2 gT E−1 g . 2 One can show that Rin ≤ Rtrue . In particular, in the case where C = I, one has that r G2 N G2 2 2 Rin = 1 − and R = . true T gT g gT g 23 see also figure 1 in [2]. It is only the case when T is much larger than N that these values will be close. Thus, using past returns (in form of an empirical correlation matrix) to optimize a portfolio strategy leads to an over-optimistic estimate of the risk. To eliminate this bias in the estimation of risk, one can attempt to do some cleaning of the empirical correlation matrix. When this is done properly, we will see that the ”cleaned” matrix can provide more reliable risk estimation. 8.3 Cleaning of correlation matrices A large part of the empirical correlation matrices must be considered as ”noise”, and can not be trusted for risk management To describe a useful way of cleaning this noise, let us first look at what happens when the prices of the assets are independent, identically distributed random variables. This is also called the null-hypothesis of independent PTassets. When C = I, the empirical correlation matrix thus has the form T1 t=1 xit xjt , where xit are real, standard, Gaussian, and independent. When N and T go to infinity at a given ratio, it is known that the eigenvalues are almost everywhere close to the Marc̆henko Pastur law, as shown in figure 2. The derivation of the density of this law was given in section 4, using R-transform techniques. The Marc̆henko Pastur law thus serves as a theoretical prediction under the assumption that the market is ”all-noise”. Deviations from this theoretical limit in the eigenvalue distribution should indicate non-noisy components, i.e. they should suggest information about the market: Most eigenvalues can be explained in terms of a purely random correlation matrix, except for the largest ones, which correspond to the fluctuations of the market as a whole, and of several industrial sectors. In figure 1 in [2], the effect on the risk estimates is shown after cleaning described as follows: • Replace all ”low-lying” eigenvalues with a unique value. • Keep all the high eigenvalues. These should correspond to meaningful economic information (sectors). This boils down to finding a k ∗ (which is the number of meaningful economic sectors), and set λkc = 1 − δ if k > k ∗ , and λkc = λkE if k ≤ k ∗ , where δ is chosen such that the trace of the correlation matrix is exactly preserved. Here the eigenvalues have been listed in descending order. k ∗ is found by first finding the theoretical edge of the eigenvalues under the assumption that the null hypothesis holds (i.e. the upper limit of the support of the Marc̆henko ∗ Pastur law), and choosing k ∗ such that λkE is close to this edge. Another cleaning strategy used in the literature is to shift the empirical matrix closer to the indentity matrix, i.e. replace E with Ec , where Ec = αE + (1 − α)I. 24 The new eigenvalues are λkc = 1 + α(λk − 1). This method of cleaning is also alled the shrinkage estimator. 8.4 Other ways of forming an empirical matrix In finance, it is standard practice to form an exponentially weighted moving average (EWMA) correlation matrix EijT ¶T −t T µ 1X 1 = xit xjt , 1− T t=1 T (8.6) instead of the standard empirical matrix (8.2), where each sample has equal weight ( T1 ). This can also be written in matrix notation as ¶T −t T µ X 1 ET = 1− δEt , T t=1 where δEt is the rank one matrix given by δEijt = T1 xit xjt . We can perform cleaning of the correlation matrix (8.6) in a similar way, but now the null hypothesis gives us a different law. In order to compute this law, we will restrict to a covariance matrix equal to the identity. Note first that ET = = = ¢T −t PT ¡ 1− 1 δEt ´ ¡ t=11 ¢ ³PTT −1 ¡ ¢T −1−t 1− T 1 − T1 δEt + δET t=1 ¢ ¡ 1 − T1 ET −1 + δET . (8.7) It is easily checked that δET (x1t , ..., xN t ) 1 = T à X ! (xit )2 1 N (x1t , ..., xN t ) → c(xt , ..., xt ), i so that δET has for large N one eigenvalue close to c, while all the other N − 1 eigenvalues are 0. This means that GµδET (z) = 1 1 N −11 + . N z−c N z 25 The inverse function of this is found by solving w = wz 2 − wzc = wz 2 − (wc + 1)z + N −1 c = N z = N −11 1 1 + N z−c N z 1 N −1 z+ (z − c) N N 0 q (wc + 1)2 − 4cw NN−1 wc + 1 ± 2w q z = z ≈ z ≈ wc + 1 ± (wc − 1)2 + 4cw N 2w 1 c − w N (wc − 1) 1 c + , w N (1 − wc) where we have used the Taylor expansion r 4cw 2cw (wc − 1)2 + = wc − 1 + + ... N N (wc − 1) Using (2.4), we see that µ RµδET (z) ≈ z 1 c + z N (1 − cz) ¶ −1= cz . N (1 − cz) If we replace ET and ET −1 in (8.7) with E, since the matrices δEt are rotationally invariant, a result from free probability gives us that RµE (z) ≈ = cz (z) + (1− T1 )E N (1 − cz) µµ ¶ ¶ 1 cz RµE 1− z + . T N (1 − cz) Rµ where we have used that well-known R-transform property RaE (z) = RE (az). Denoting R(z) = RµE (z), we now have the equation µµ ¶ ¶ 1 cz R(z) = R 1− z + , T N (1 − cz) 26 which also can be written µµ ¶ ¶ 1 cz R 1− z − R(z) + T N (1 − cz) ¡¡ ¢ ¢ R 1 − T1 z − R(z) − T1 z R0 (z) R(z) = 0 1 1 − cz 1 = 1 − cz ln(1 − cz) . = − c = To find the density of this, we must first find the Cauchy transform by solving µ ¶ µ µ ¶¶ 1 1 ln(1 − cz) G (1 + R(z)) = G 1− = z, z z c and then use the Stieltjes inversion formula to find the density. The density is shown in [2]. 8.5 Dynamics of the top eigenvalue and eigenvector The largest eigenvalue of the empirical correlation matrix is typically much larger than the predicted value from the null hypothesis. The interpretation of the corresponding eigenvector is ”the market itself”, i.e. it has roughly equal components on all the N assets. A first approximation to the market could be that all stocks in the beginning move up or down together. One way to state this is through the model rti = βi φt + ²it , where φt (market mode) is common to all stocks, βi is market exposure, and ²it is noise, uncorrelated from stock to stock. The covariance matrix for such a model is Cij = βi βj σφ2 + σi2 δij , where σφ2 is the variance of φ, σi2 is the variance of ²it . When all σi are equal with common value σ, the largest eigenvalue of C is X Λ0 = ( βj2 )σφ2 + σ 2 j and is of multiplicity 1 with eigenvector (β1 , ..., βN ). The other N −1 eigenvalues are all equal to Λα = σ 2 . It would be very interesting to see how the top eigenvector and eigenvalue of the empirical covariance matrix fluctuates, and how this is related to the top eigenvector/eigenvalue of the covariance matrix itself (note that here we 27 consider covariance matrices, not correlation matrices. We consider also here EWMA covariance matrices, i.e. in (8.7) we do the replacements δEijt = EijT = 1 i j rr T t t ¶T −t T µ 1X 1 1− rti rtj . T t=1 T It turns out that the largest eigenvalue λ0t of Et (with corresponding eigenvector ψ0t ) has the following relationship to the largest eigenvalue Λ0 of the actual covariance matrix C: ¡ ¢ ¡ ¢ Var ³(λ0(t+τ ) − ´ λ0t )2 = T2 1 − e−τ /T ¡ ¢ (8.8) t ≈ 1 − TΛΛ10 1 − e−τ /T , Var ψ0(t+τ ) ψ0t where Λ1 is the second largest eigenvalue of C. We have asssumed that the actual covariance matrix stays constant. Measurements suggesting deviations from (8.8) would suggest otherwise. 8.6 Relation between the correlation matrix and the empirical correlation matrix Assume that • N and T goes to infinity at the ratio limN,T →∞ N T = c, • the eigenvalue distribution of the empirical covariance matrix converges almost everywhere to a meaure µE . Then ”very often” the eigenvalue distribution of the corresponding covariance matrices converge almost everywhere to a measure µC , and the following relationship determines the one from the other [8]: µE = µC £ µc . This also establishes the connection with theorem 6, since £ is used there as well. Further reading Most of the material on the combinatorics of free probability presented here is taken from [5]. For a survey of free probability covering many other facets also, [3] is a good reference. For a survey of random matrix results, take a look at [4]. The result for the exact moments of Gaussian matrices was taken from [9]. [10] is a good reference for the connection between random matrices, free probability, and wireless communication. 28 References [1] J-P Bouchaud and M. Potters. Theory of Financial Risk and Derivative Pricing - From Statistical Physics to Risk Management. Cambridge University Press, Cambridge, 2000. [2] J-P Bouchaud and M. Potters. Financial applications of random matrix theory: Old laces and new pieces. pages 1–11, 2005. arxiv.org/abs/physics/0507111. [3] F. Hiai and D. Petz. The Semicircle Law, Free Random Variables and Entropy. American Mathematical Society, 2000. [4] M. L. Mehta. Random Matrices. Academic Press, New York, 2nd edition, 1991. [5] A. Nica and R. Speicher. Lectures on the Combinatorics of Free Probability. Cambridge University Press, 2006. [6] Ø. Ryan and M. Debbah. Channel capacity estimation using free probability theory. Submitted to IEEE Trans. Signal Process., 2007. http://arxiv.org/abs/0707.3095. [7] Ø. Ryan and M. Debbah. Free deconvolution for signal processing applications. Submitted to IEEE Trans. on Information Theory, 2007. http://arxiv.org/abs/cs.IT/0701025. [8] Ø. Ryan and M. Debbah. Multiplicative free convolution and informationplus-noise type matrices. 2007. http://arxiv.org/abs/math.PR/0702342. [9] S. Thorbjørnsen. Mixed moments of Voiculescu’s Gaussian random matrices. J. Funct. Anal., 176(2):213–246, 2000. [10] A. M. Tulino and S. Verdú. Random Matrix Theory and Wireless Communications. www.nowpublishers.com, 2004. 29