An NLA Look at σmin Universality (& the Stoch Diff Operator) Alan Edelman and Po-Ru Loh MIT Applied Mathematics Random Matrices October 10, 2010 Outline • • • • History of σmin universality Proof idea of Tao-Vu (in NLA language) What the proof doesn't prove Do stochastic differential operators say more? A bit of history • E ('89): Explicit formula (for finite iid Gaussian n x n matrices) for distribution of σn – Analytic techniques: Integration of a joint density function, Tricomi functions, Kummer's differential equation, etc. • Striking convergence to limiting distribution even in non-Gaussian case – Parallel Matlab in the MIT computer room – “Please don’t turn off” (no way to save work in the background on those old Sun workstations) Empirical smallest singular value distribution: n x n Gaussian entries 4 Empirical smallest singular value distribution: n x n ±1 entries 5 Extending to non-Gaussians: How? • Central limit theorem is a mathematical statement and a “way of life” – Formally: a (series of) theorems – with assumptions (e.g., iid) – and if the assumptions are not met, the theorems don't apply – Way of life: “When a bunch of random variables are mixed up enough, they behave as if Gaussian” – Example from our discussions: Does the square root of a sum of squares of (almost) iid random variables go to χn? Probably an application of CLT but not precisely CLT without some tinkering (what does “go to” mean when n is changing?) 6 Outline • • • • History of σmin universality Proof idea of Tao-Vu (in NLA language) What the proof doesn't prove Do stochastic differential operators say more? Tao-Vu ('09) “the rigorous proof”! • Basic idea (NLA reformulation)... Consider a 2x2 block QR decomposition of M: n-s ( n-s s ) ( R11 R12 R22 )( M = M1 M2 = QR = Q1 Q2 s ) Note: Q2T M2 = R22 1. The smallest singular value of R22, scaled by √n/s, is a good estimate for σn! 2. R22 (viewed as the product Q2T M2) is roughly s x s Gaussian 8 n-s s Basic idea part 1: σs√n/s ≈ σn • The smallest singular value of M is the reciprocal of the largest singular value of M-1 • Singular values of R22 are exactly the inverse singular values of an s-row subsample of M-1 • The largest singular value of an approximately low-rank matrix reliably shows up in a random sample (Vempala et al.; Rokhlin, Tygert et al.) – Referred to as “property testing” in theoretical CS terminology 9 Basic idea part 2: R22 ≈ Gaussian • Recall R22 = Q2T M2 • Note that Q1 is determined by M1 and thus independent of M2 • Q2 can be any orthogonal completion of Q1 • Thus, multiplying by Q2T “randomly stirs up” entries of the (independent) n x s matrix M2 • Any “rough edges” of M2 should be smoothed away in the s x s result R22 10 Basic idea (recap) 1. The smallest singular value of R22, scaled by √n/s, is a good estimate for σn! 2. R22 (viewed as the product Q2T M2) ≈ s x s Gaussian • We feel comfortable (from our CLT “way of life”) that part 2 works well • How well does part 1 work? 11 Outline • • • • History of σmin universality Proof idea of Tao-Vu (in NLA language) What the proof doesn't prove Do stochastic differential operators say more? How good is the s x s estimator? Fluctuates around ±10% of truth (at least for this matrix) 13 How good is the s x s estimator? A few more tries... 14 More s x s estimator experiments Gaussian entries, ±1 entries A lot more tries... again, accurate to say 10% 15 15 More s x s estimator experiments n = 100 vs. n = 200 A lot more tries, now comparing matrix sizes s = 10 to 50% (of n)... n = 200 a bit better 16 How good is the s x s estimator? • On one hand, surprisingly good, especially when not expecting any such result – “Did you know you can get the smallest singular value to within 10% just by looking at a corner of the QR?” • On the other hand, still clearly an approximation: n would need to be huge in order to reach human-indistinguishable agreement 17 Bounds from the proof • “C is a sufficiently large const (104 suffices)” • Implied constants in O(...) depend on E|ξ|C – For ξ = Gaussian, this is 9999!! • s = n500/C – To get s = 10, n ≈ 1020? • Various tail bounds go as n-1/C – To get 1% chance of failure, n ≈ 1020000?? 18 Reality check: n x n Gaussian entries, revisited 19 Reality check: n x n ±1 entries, revisited 20 What the proof doesn't prove • The s x s estimator is pretty nifty... • … but the truth is far stronger than what the approximation can tell us 21 Outline • • • • History of σmin universality Proof idea of Tao-Vu (in NLA language) What the proof doesn't prove Do stochastic differential operators say more? Can another approach get us closer to the truth? • Recall the standard numerical SVD algorithm starting with Householder bidiagonalization • In the case of Gaussian random matrices, each Householder step puts a χ distribution on the bidiagonal and leaves the remaining subrectangle Gaussian • At each stage, all χ's and Gaussians in the entries are independent of each other (due to isotropy of multivariate Gaussians) 23 Bidiagonalization process for n x n Gaussian matrix 24 A stochastic operator connection • E ('03) argued that the bidiagonal of χ's can be viewed as a discretization of a stochastic Bessel operator • – √x d/dx + “noise” / √2 • As n grows, the discretization becomes smoother, and the (scaled) singular value distributions of the matrices ought to converge to those of the operator 25 A stochastic operator connection 26 1.6 k=1 How close are we if we use kxk chi’s in the bottom, rest Gaussian? 1.5 1.4 n=200; t=1000000; v=zeros(t,1); for k=1 x=sqrt(n:-1:1); y=sqrt(n-1:-1:1); v=zeros(t,1); k, endx=(n-k+1:n); endy=(n-k+1:n-1); dofx=k:-1:1; dofy=(k-1):-1:1; for i=1:t yy=y+randn(1,n-1)/sqrt(2); xx=x+randn(1,n)/sqrt(2); xx(endx)=sqrt(chi2rnd(dofx)); yy(endy)=sqrt(chi2rnd(dofy)); n=100 n=200 k=2 1.3 1.2 1.1 1 k=3 k=0 k=4 k=5 k=6 k=7..10 k=inf v(i)=min(bidsvd(xx,yy)); if rem(i,500)==0,[i k],end end hold off v=v*sqrt(n); Area of Detail 1 Million Trials in each experiment (Probably as n→inf, there is still a little upswing for finite k?) 0.9 0 0.05 0.1 0.15 0.2 A stochastic operator connection • Ramírez and Rider ('09) produced a proof • In further work with Virág, they have applied the SDO machinery to obtain similar convergence results for largest eigenvalues of beta distributions, etc. 28 Extending to non-Gaussians: How? • The bidiagonalization mechanism shouldn't care too much about the difference... • Each Householder spin “stirs up” the entries of the remaining subrectangle, making them “more Gaussian” (according to Berry-Esseen, qTx is close to Gaussian as long as entries of q are evenly distributed) • Almost-Gaussians combine into (almostindependent) almost-χ's • Original n2 entries compress to 2n-1 29 SDO mechanism • Old intuition: non-Gaussian n x n matrices act like Gaussian n x n matrices (which we understand) • New view: non-Gaussian and Gaussian n x n matrices are both discretizations of the same object • Non-random discretizations have graininess in step size, where to take finite differences, etc. • SDO discretizations have issues like almostindependence... but can be overcome? 30 Some grand questions • Could an SDO approach circumvent property testing (sampling the bottom-right s x s) and thereby get closer to the truth? • Does the mathematics of today have enough technology for this? (If not, can someone invent the new technology we need?) 31