An NLA Look at σ Universality (& the Stoch Diff Operator)

advertisement
An NLA Look at σmin Universality
(& the Stoch Diff Operator)
Alan Edelman and Po-Ru Loh
MIT Applied Mathematics
Random Matrices
October 10, 2010
Outline
•
•
•
•
History of σmin universality
Proof idea of Tao-Vu (in NLA language)
What the proof doesn't prove
Do stochastic differential operators say
more?
A bit of history
• E ('89): Explicit formula (for finite iid Gaussian
n x n matrices) for distribution of σn
– Analytic techniques: Integration of a joint density function,
Tricomi functions, Kummer's differential equation, etc.
• Striking convergence to limiting distribution
even in non-Gaussian case
– Parallel Matlab in the MIT computer room
– “Please don’t turn off” (no way to save work in the
background on those old Sun workstations)
Empirical smallest singular value
distribution: n x n Gaussian entries
4
Empirical smallest singular value
distribution: n x n ±1 entries
5
Extending to non-Gaussians: How?
• Central limit theorem is a mathematical
statement and a “way of life”
– Formally: a (series of) theorems – with assumptions (e.g.,
iid) – and if the assumptions are not met, the theorems
don't apply
– Way of life: “When a bunch of random variables are mixed
up enough, they behave as if Gaussian”
– Example from our discussions: Does the square root of a
sum of squares of (almost) iid random variables go to χn?
Probably an application of CLT but not precisely CLT
without some tinkering (what does “go to” mean when n is
changing?)
6
Outline
•
•
•
•
History of σmin universality
Proof idea of Tao-Vu (in NLA language)
What the proof doesn't prove
Do stochastic differential operators say
more?
Tao-Vu ('09) “the rigorous proof”!
• Basic idea (NLA reformulation)...
Consider a 2x2 block QR decomposition of M:
n-s
(
n-s
s
)
(
R11 R12
R22
)(
M = M1 M2 = QR = Q1 Q2
s
)
Note: Q2T M2 = R22
1. The smallest singular value of R22, scaled by √n/s, is a good
estimate for σn!
2. R22 (viewed as the product Q2T M2) is roughly s x s Gaussian
8
n-s
s
Basic idea part 1: σs√n/s ≈ σn
• The smallest singular value of M is the
reciprocal of the largest singular value of M-1
• Singular values of R22 are exactly the inverse
singular values of an s-row subsample of M-1
• The largest singular value of an approximately
low-rank matrix reliably shows up in a random
sample (Vempala et al.; Rokhlin, Tygert et al.)
– Referred to as “property testing” in theoretical CS
terminology
9
Basic idea part 2: R22 ≈ Gaussian
• Recall R22 = Q2T M2
• Note that Q1 is determined by M1 and thus
independent of M2
• Q2 can be any orthogonal completion of Q1
• Thus, multiplying by Q2T “randomly stirs up”
entries of the (independent) n x s matrix M2
• Any “rough edges” of M2 should be smoothed
away in the s x s result R22
10
Basic idea (recap)
1. The smallest singular value of R22, scaled by √n/s, is
a good estimate for σn!
2. R22 (viewed as the product Q2T M2) ≈ s x s Gaussian
• We feel comfortable (from our CLT “way of
life”) that part 2 works well
• How well does part 1 work?
11
Outline
•
•
•
•
History of σmin universality
Proof idea of Tao-Vu (in NLA language)
What the proof doesn't prove
Do stochastic differential operators say
more?
How good is the s x s estimator?
Fluctuates around ±10% of truth
(at least for this matrix)
13
How good is the s x s estimator?
A few more tries...
14
More s x s estimator experiments
Gaussian entries, ±1 entries
A lot more tries... again, accurate to say 10%
15 15
More s x s estimator experiments
n = 100 vs. n = 200
A lot more tries, now comparing matrix sizes
s = 10 to 50% (of n)... n = 200 a bit better
16
How good is the s x s estimator?
• On one hand, surprisingly good, especially
when not expecting any such result
– “Did you know you can get the smallest singular value to
within 10% just by looking at a corner of the QR?”
• On the other hand, still clearly an
approximation: n would need to be huge in
order to reach human-indistinguishable
agreement
17
Bounds from the proof
• “C is a sufficiently large const (104 suffices)”
• Implied constants in O(...) depend on E|ξ|C
– For ξ = Gaussian, this is 9999!!
• s = n500/C
– To get s = 10, n ≈ 1020?
• Various tail bounds go as n-1/C
– To get 1% chance of failure, n ≈ 1020000??
18
Reality check: n x n Gaussian
entries, revisited
19
Reality check: n x n ±1 entries,
revisited
20
What the proof doesn't prove
• The s x s estimator is
pretty nifty...
• … but the truth is far
stronger than what
the approximation
can tell us
21
Outline
•
•
•
•
History of σmin universality
Proof idea of Tao-Vu (in NLA language)
What the proof doesn't prove
Do stochastic differential operators say
more?
Can another approach get us closer
to the truth?
• Recall the standard numerical SVD algorithm
starting with Householder bidiagonalization
• In the case of Gaussian random matrices, each
Householder step puts a χ distribution on the
bidiagonal and leaves the remaining
subrectangle Gaussian
• At each stage, all χ's and Gaussians in the
entries are independent of each other (due to
isotropy of multivariate Gaussians)
23
Bidiagonalization process for n x n
Gaussian matrix
24
A stochastic operator connection
• E ('03) argued that the bidiagonal of χ's can be
viewed as a discretization of a stochastic
Bessel operator
• – √x d/dx + “noise” / √2
• As n grows, the discretization becomes
smoother, and the (scaled) singular value
distributions of the matrices ought to
converge to those of the operator
25
A stochastic operator connection
26
1.6
k=1
How close are we if we use kxk chi’s in the bottom,
rest Gaussian?
1.5
1.4
n=200; t=1000000; v=zeros(t,1);
for k=1
x=sqrt(n:-1:1);
y=sqrt(n-1:-1:1);
v=zeros(t,1);
k, endx=(n-k+1:n); endy=(n-k+1:n-1);
dofx=k:-1:1;
dofy=(k-1):-1:1;
for i=1:t
yy=y+randn(1,n-1)/sqrt(2);
xx=x+randn(1,n)/sqrt(2);
xx(endx)=sqrt(chi2rnd(dofx));
yy(endy)=sqrt(chi2rnd(dofy));
n=100
n=200
k=2
1.3
1.2
1.1
1
k=3
k=0
k=4
k=5
k=6
k=7..10
k=inf
v(i)=min(bidsvd(xx,yy));
if rem(i,500)==0,[i k],end
end
hold off
v=v*sqrt(n);
Area of Detail
1 Million Trials in each experiment
(Probably as n→inf, there is still a little upswing for finite k?)
0.9
0
0.05
0.1
0.15
0.2
A stochastic operator connection
• Ramírez and Rider ('09) produced a proof
• In further work with Virág, they have applied
the SDO machinery to obtain similar
convergence results for largest eigenvalues of
beta distributions, etc.
28
Extending to non-Gaussians: How?
• The bidiagonalization mechanism shouldn't
care too much about the difference...
• Each Householder spin “stirs up” the entries
of the remaining subrectangle, making them
“more Gaussian” (according to Berry-Esseen,
qTx is close to Gaussian as long as entries of q
are evenly distributed)
• Almost-Gaussians combine into (almostindependent) almost-χ's
• Original n2 entries compress to 2n-1
29
SDO mechanism
• Old intuition: non-Gaussian n x n matrices act
like Gaussian n x n matrices (which we
understand)
• New view: non-Gaussian and Gaussian n x n
matrices are both discretizations of the same
object
• Non-random discretizations have graininess in
step size, where to take finite differences, etc.
• SDO discretizations have issues like almostindependence... but can be overcome?
30
Some grand questions
• Could an SDO approach circumvent property
testing (sampling the bottom-right s x s) and
thereby get closer to the truth?
• Does the mathematics of today have enough
technology for this? (If not, can someone
invent the new technology we need?)
31
Download