L Row Sampling by Lewis Weights p Richard Peng

advertisement
Lp Row Sampling by Lewis Weights
Richard Peng
Georgia Tech, CS
Joint with Michael Cohen (M.I.T.)
PROBLEM: ROW SAMPLING
Given n × d matrix A, norm p
Pick a few (rescaled) rows of A’
so that ║Ax║p≈1+ε║A’x║p for all x
A
A’
Multiplicative error notation ≈:
a≈kb if there exists kmin, kmax s.t.
kmax/kmin ≤ k and kmina ≤ b ≤ kmax a
APPLICATION: DATA
A
Often data exists in dimensions that
are far larger than its true rank
• Over-constrained regression
• PCA / low rank approximation
• Stochastic gradient descent
║Ax║p≈1+ε║A’x║p for all x:
Can solve on A’
INSTANTIATION: REGRESSION
x
A
b
-1
Minx ║Ax–b║ p
• p=2: least squares
• p=1: robust regression
Simplified view:
• Ax – b= [A, b] [x; -1]
• min║Ax║p with one entry of x fixed
ON GRAPHS
A: edge-vertex incidence matrix
x: indicator vector of cut
A’: subset of (rescaled) edges
Widely used tools in graph algorithms
• [Benczur-Karger `96]: cut sparsifiers
• [Spielman-Teng `04]: spectral sparsification
PREVIOUS POLY-TIME
ALGORITHMS
[CW`12][NN`13][LMP`13][CLMMPS`15]
Input sparsity time: expensive steps only on
size d matrices, O(nnz(A) + poly(d)) total
P
2
# rows
By
dlogd Matrix concentration bounds
([Rudelson-Vershynin `07], [Tropp `12])
d
[Batson-Spielman-Srivastava `09]
1
d2.5
2
1 < p < 2 dp/2+2
2<p
dp+1
[Dasgupta-Drineas-Harb
-Kumar-Mahoney `09]
Assuming ε = constant
L2 VS. L1
1/2
A:
1/2
A’:
1
1/2
1/2
1-dimensional: only `interesting’ vector: x = [1]
L2 distance: 1 vs. 1 = 1
L1 distance: 2 vs. 1 = 2
Difference = distortion
between L2 and L1: n1/2
CONNECTION
A
Functional analysis
• ║Ax║p: Banach space
• A’ = SA: embedding into a
lower dimensional space
p=1
• Embeddability of Banach
spaces: A’ with O(dlogd) rows
• Beck-Filla conjecture: implied
by A’ with O(d) rows:
OUR RESULTS: ROW COUNT
P
1
Previous
Our
d2.5
dlogd
Uses
[Talagrand `90]
1 < p < 2 dp/2+2
dlogd(loglogd)2 [Talagrand `95]
2<p
dp/2logd
dp+1
[Bourgain-MilmanLindenstrauss `89]
• Runtime: input-sparsity time, O(nnz(A) + poly(d))
• When p < 4, overhead is O(dω)
Will focus on p = 1 for this talk
OUTLINE
• Row Sampling
• Lewis Weights
• Computation
• Proof of Concentration
IMPORTANCE SAMPLING
Only one non-zero row: need
to keep while sampling rest of A
Sampling scheme:
Probability pi for each row
Can ensure E[║A’x║pp] = ║Ax║pp by
rescaling rows picked, need concentration
ISSUES WITH SAMPLING BY NORM
norm sampling: pi =║ai║22
column with one entry
Need:
║A[1;0;…;0]║p≠ 0
MATRIX-CHERNOFF BOUNDS
τ: L2 statistical leverage scores
τi = aiT(ATA)-1ai
ai row i of A
[???]: Sampling rows w.p. pi = τi logd
gives ║Ax║2≈║A’x║2 ∀x w.h.p.
[Foster `49] Σi τi = rank ≤ d  O(dlogd) rows in A’
[CW`12][NN`13][LMP`13][CLMMPS`15]: can estimate
L2 leverage scores in O(nnz(A) + dω+a) time
WHAT IS LEVERAGE SCORE
Norm of ai after `whitening transform’
Approximations are basis independent:
Max ║Ax║p/║A’x║p
= Max ║AUx║p/║A’ Ux║p
Can reorganize the columns of A:
Transform so ATA = I (isotropic position)
Interpretation of matrix-Chernoff
When A is in isotropic position, norm sampling works
WHAT WORKS FOR ANY P
For any p, exists function f in d s.t.
For any n × d A where
• n > f(d),
• A is isotropic, ATA = I,
• All row norms of A < 2d/n,
Sampling with pi = ½ gives ║Ax║p≈║A’x║p ∀x
LEWIS CHANGE OF DENSITY
w*: L1 Lewis Weights
wi*2 = aiT(ATW*-1A)-1ai
Recursive
definition
Symmetrization ([Rudelson-Vershynin `07]):
importance sampling using 2-approximate Lp
Lewis weights gives A’ s.t. ║Ax║p≈║A’x║p ∀x
P
1
# rows
dlogd
Citation
[Talagrand `90]
1<p<2
dlogd(loglogd)2 [Talagrand `95]
2<p
dp/2logd
[Bourgain-MilmanLindenstrauss `89]
REST OF TALK
An Algorithm:
Computation of approximate Lewis weights
and a proof:
Elementary proof of 1-norm concentration
Key ideas:
• Isotropic position
• Analyze intermediate matrices as operators
using variants of relative condition number
• Radamacher processes under L1
OUTLINE
• Row Sampling
• Lewis Weights
• Computation
• Proof of Concentration
ALGORITHM
w*: L1 Lewis Weights
wi*2 = aiT(ATW*-1A)-1ai
Recursive
definition
Algorithm: pretend w is the
right answer, and iterate
w’i  (aiT(ATW-1A)-1ai)1/2
Has resemblances to
iterative reweighted least
squares
Each iteration: compute leverage
scores w.r.t. w, O(nnz(A) + dω+a) time
OPERATORS
w*: L1 Lewis Weights
wi*2 = aiT(ATW*-1A)-1ai
Key to analysis: Lewis quadratic form, ATW-1A
Spectral similarity: P ≈ k Q if xTPx ≈ k xTQx ∀x
Implications of P ≈ k Q :
• P-1 ≈ k Q-1
• UTPU ≈ k UTQU for all matrices U
• Relative condition number of P and Q ≤ k
CONVERGENCE PROOF OUTLINE
Update rule
w’i  (aiT(ATW-1A)-1ai)1/2
w’’i  (aiT(ATW’-1A)-1ai)1/2
…
Will show:
• if 0 < p < 4, distance between w
and w’ rapidly decreases
• distance between w’ and w’’ less
than distance between w and w’,
contraction mapping
CONVERGENCE FOR L1
Assume: w ≈ k w’
Update:
Composition:
Invert:
Apply to
vector ai:
W ≈ k W’
W-1 ≈ k W’-1
w’i  (aiT(ATW-1A)-1ai)1/2
w’’i  (aiT(ATW’-1A)-1ai)1/2
ATW-1A ≈ k ATW’-1A
(ATW-1A)-1 ≈ k (ATW’-1A)-1
aiT(ATW-1A)-1ai ≈ k aiT(ATW’-1A)-1ai
w’i2 ≈ k w’’i2
w’i ≈ k1/2 w’’i
k’  k1/2 per step, convergence in O(loglogk) steps!
OVERALL ALGORITHM
Can also show: if initialize with wi = 1,
After 1 step we have wi ≈ n w’i
Convergence bound gives
w(t) ≈ 2 w(t+1) in O(loglogn) rounds
Sample with probabilities p = min{1, 2logdw}
• Uniqueness: w’i  (aiT(ATW-1A)-1ai)1/2 is a contraction
mapping, same convergence rate to fixed point
• Sum: w*i are L2 leverage scores of W-1/2A
OPTIMIZATION FORMULATION
Lp Lewis weights: wi*2 = aiT(ATW*1-2/pA)-1ai
poly-time algorithm: solve
Max det(M)
s.t.
Σi (aiTMai)p/2 ≤ d
M P.S.D.
wi*
• Convex problem when p > 2
• Also leads to input-sparsity time algorithms
OUTLINE
• Row Sampling
• Lewis Weights
• Computation
• Proof of Concentration
PROOFS OF KNOWN RESULTS
Will show: elementary proof for p = 1
P
1
Citation
[Talagrand `90] + [Pisier `89, Ch2]
# pages
8 + ~30
1 < p < 2 [Talagrand `95] +
16 + 12
[Ledoux-Talagrand `90, Ch15.5]
2<p
[Bourgain-Milman-Lindenstrauss `89] 69
Tools used:
• Gaussian processes
• K-convexity for p = 1
• Majorizing measure +
generic chaining for p > 1
CONCENTRATION
‘Nice’ case:
• ATA = I (isotropic position)
• ║ai ║22 <ε2 / logn
• Sampling with pi = ½
pick half the rows, double them
si: copy of row i: 0 w.p. 1/2
2 w.p. 1/2
║Ax║1 - ║A’x║1
= Σi |aiTx| - Σi si|aiTx|
= Σi (1 – si)|aiTx|
RADAMACHER PROCESSES
Need to bound (over choices of σ = s - 1):
Maxx, ║Ax║ ≤ 1Σi (1 – si)|aiTx|
1
Radamacher random
variables:
σi = 1 – si = ±1 w.p. ½ each
= Maxx, ║Ax║
1
≤ 1Σi σi
|aiTx|
MAIN STEPS
Goal: show w.h.p. over choices of σ =±1
Maxx, ║Ax║ ≤ 1Σi σi |aiTx| ≤ 1 +ε
1
• Comparison lemma: turn sum of
absolute values into a single sum
• Dual norms: reduce to bounding the
maximum of a single operator
• Finish with scalar concentration bounds
COMPARISON THEOREM
[Ledoux-Talagrand `89], implication:
Eσ[Maxx, ║Ax║ ≤ 1Σi σi aiTx] ≤ 2 Eσ[Maxx, ║Ax║
1
1
Tx]
Σ
σ
a
≤1 i i i
Proof via Hall’s theorem on the hypercube
Invoke with moment generating functions:
suffices to show w.h.p. over σ:
Maxx, ║Ax║ ≤ 1Σi σi aiTx ≤ 1 +ε
1
TRANSFORMATION + DUAL NORM
Maxx, ║Ax║
1
Tx
Σ
σ
a
≤1 i i i
= Maxx, ║Ax║
= Maxx, ║Ax║
≤ Maxy, ║y║
1
1
1
TAx
σ
≤1
ATA = I (assumption)
TAATAx
σ
≤1
TAATy
σ
≤1
Dual norm: Maxy, ║y║
= ║σTAAT║∞
1
≤1
bTy = ║b║∞
EACH ENTRY
(σTAAT)j = ΣiσiaiTaj
• σi are the only random variables.
• Khintchine’s inequality (with logn moment):
w.h.p Σσibi ≤ O( ( logn ║b║22)1/2 )
• Suffices to bound
Σi bi2 = Σi(aiTaj)2
LAST STEPS
Σi(aiTaj)2 = Σi ajTaiaiTaj
= ajT (Σi aiaiT) aj
= ajT ATA aj
`Nice case’ assumptions:
• ATA = I (isotropic position)
• ║ai ║22 <ε2 / logn
= ║aj║22 < ε2 / logn
Consequence of Khintchine’s inequality:
w.h.p. each entry < O( ( logn ║b║22)1/2 ) = O(ε)
UNWIND PROOF STACK:
• Finish with scalar concentration bounds
W.h.p. ║σTAAT║∞ < ε
• Dual norms: reduce to bounding the
maximum of a single operator
Maxx, ║Ax║ ≤ 1σTAx < ε
1
• Pass moment generating function
through comparison theorem gives result
OPEN PROBLEMS
•
•
•
•
•
•
Elementary proof for p ≠ 1?
O(d) rows for 1 < p < 2?
Better algorithms for p ≥ 4
Generalize to low-rank approximations
SGD using Lewis weights?
Fewer rows for structured matrices (e.g. graphs)
when p > 2? Conjecture: O(dlogf(p)d) for graphs
Reference: http://arxiv.org/abs/1412.0588
Download