Slide

advertisement
Iterative Row Sampling
Richard Peng
CMU  MIT
Joint work with Mu Li (CMU) and Gary Miller (CMU)
OUTLINE
• Matrix Sketches
• Existence
• Samples  better samples
• Iterative algorithms
DATA
• n-by-d matrix A, m entries
• Columns: data
• Rows: attributes
A
Goal:
• Classification/ clustering
• Identify patterns
• Interpret new data
LINEAR MODEL
Ax
x1A:,1
x2A:,2
x3A:,3
• Can add/scale data points
• x1: coefficients, combo: Ax
PROBLEM
?
Interpret new data point b as
combination of known ones Ax
REGRESSION
• Express as combination of
current examples
• Regression: minx ║Ax–b║ p
• p=2: least squares
• p=1: compressive sensing
• ║x║2: Euclidean norm of x
• ║x║1: sum of absolute values
VARIANTS OF COMPRESSIVE SENSING
• minx ║Ax-b║1 +║x║1
• minx ║Ax-b║2 +║x║1
• minx ║x║1 s.t. Ax=b
• minx ║Ax║1 s.t. Bx = y
• minx ║Ax-b║1 + ║Bx - y║1
All similar to minx║Ax-b║1
SIMPLIFIED
x
A
b
-1
• minx║Ax–b║p = minx║[A, b] [x; -1]║p
• Regression equivalent to min║Ax║p
with one entry of x fixed
‘BIG’ DATA POINTS
• Each data point has
many attributes
• #rows (n) >> #columns (d)
• Examples:
• Genetic data
• Time series (videos)
• Reverse (d>>n) also
common: images + SIFT
A
FASTER?
A
A’
Smaller, equivalent A’
Matrix sketch
ROW SAMPLING
A
A’
• Pick some rows of A to be A’
• How to pick? Random
SHORTER EQUIVALENT
• Find shorter A’ that
preserves answer
• |Ax|p≈1+ε|A’x|p for all x
• Run algorithm on A’,
same answer good for A
A’
Simplified error notation ≈:
a≈kb if there exists k1, k2 s.t.
k2/k1 ≤ k and k1a ≤ b ≤ k2 b
OUTLINE
• Matrix Sketches
• How? Existence
• Samples  better samples
• Iterative algorithms
SKETCHES EXIST
|Ax|p≈|A’x|p for all x
• Linear sketches: A’=SA
• [Drineals et al. `12]:
Row sampling: one nonzero in each row of S
• [Clarkson-Woodruff `12]:
S = countSketch, one
non-zero per column.
A’
SKETCHES EXIST
p=2
p=1
d2.5
Dasgupta et al. `09
Magdon-Ismail `10
Sohler & Woodruff `11
dlog2d
Drineals et al. `12
dlogd
Clarkson et al. `12
Clarkson & Woodruff `12
d4.5log1.5d
d2logd
d8
Mahoney & Meng `12
Nelson & Nguyen `12
This Paper
d2
d1+α
dlogd
d3.5
d3.5
d3.66
Hidden: runtime costs, ε-2 dependency
WHY IS ≈D POSSIBLE?
|Ax|p≈|A’x|p for all x
• ║Ax║22 = xTATAx
• ATA: d-by-d matrix
• Any factorization (e.g. QR)
of ATA suffices as A’
ATA
A:,j1
• Covariance matrix
• Dot product of all pairs
of columns (data)
• Covariance:
cov(j1,j2) = Σi Ai,j1TAi,j2
A:,j2
USE OF COVARIANCE MATRIX
C=ATA
• Clustering: l2 distances
of all pairs given by C
• Kernel methods: all pair
dot products suffice for
many models.
C
OTHER USE OF COVARIANCE
• Covariance of attributes
used to tune parameters
• Images + SIFT: many data
points, few attributes.
• http://www.image-net.org/:
14,197,122 images
1000 SIFT features
C
HOW EXPENSIVE IS THIS?
A
C
• d2 dots of length n vectors
• Total: O(nd2)
• Faster: O(ndω-1)
• Expensive: nd2 > nd > m
EQUIVALENT VIEW OF SKETCHES
A’
C’
• Approximate covariance
matrix: C’=(A’)TA’
• ║Ax║2≈║A’x║2 is the
same as C ≈ C’
APPLICATION OF SKETCHES
A
A’
C’
• A’: n’ rows
• d2 dots of length n’ vectors
• Total cost: O(n’dω-1)
SKETCHES IN INPUT SPARSITY TIME
A’
C’
A
• Need: cost of computing C’
< cost of computing C = ATA
• 2 goals:
• n’ small
• A’ found efficiently
COST AND QUALITY OF A’
p=2
cost
p=1
size
Dasgupta et al. `09
Magdon-Ismail `10
nd2/logd
ndlogd+dω
size
nd5
d2.5
ndω-1+α
d3.5
ndlogd
d4.5log1.5d
dlog2d
Sohler & Woodruff `11
Drineals et al. `12
cost
dlogd
Clarkson et al. `12
Clarkson & Woodruff `12
m
d2logd
m + d7
d8
Mahoney & Meng `12
m
d2
mlogn+d8
d3.5
Nelson & Nguyen `12
m
d1+α
m + dω+α
dlogd
This Paper
Same as above
m + dω+α
d3.66
OUTLINE
• Matrix Sketches
• How? Existence
• Samples  better samples
• Iterative algorithms
PREVIOUS APPROACHES
A miracle
happens
A
A’
poly(d)
m
• Go go poly(d) rows directly
• Projection to obtain key
info, or the sketch itself
OUR MAIN APPROACH
A”
A’
A
• Utilize the robustness of sketches,
covariance matrices, and sampling
• Iteratively reduce errors and sizes
BETTER ALGORITHM FOR P=2
p=2
cost
p=1
size
Dasgupta et al. `09
Magdon-Ismail `10
nd2/logd
ndlogd+dω
size
nd5
d2.5
ndω-1+α
d3.5
ndlogd
d4.5log1.5d
dlog2d
Sohler & Woodruff `11
Drineals et al. `12
cost
dlogd
Clarkson et al. `12
Clarkson & Woodruff `12
m
d2logd
m + d7
d8
Mahoney & Meng `12
m
d2
mlogn+d8
d3.5
Nelson & Nguyen `12
m
d1+α
m + dω+α
dlogd
This Paper
Same as above
m + dω+α
d3.66
COMPOSING SKETCHES
O(m)
O(n’dlogd +dω)
A”
A’
A
n rows
n’ = d1+α
dlogd rows
Total cost: O(m + n’dlogd + dω)
= O(m + dω)
ACCUMULATION OF ERRORS
║Ax║2 ≈kk’║A’x║2
║A”x║2 ≈k’║A’x║2
║Ax║2 ≈k║A”x║2
A”
A’
A
n rows
n’ = d1+α
dlogd rows
ACCMULATION OF ERRORS
║Ax║ 2 ≈kk’║A’x║2
• Final error: product of both errors
• Dependency of error in cost:
usually ε-2 or more for 1± ε error
• [Avron & Toledo `11]: only final
step needs to be accurate
• Idea: compute sketches indirectly
ROW SAMPLING
A
A’
• Pick some rows of A to be A’
• How to pick? Random
ARE ALL ROWS EQUAL?
one non-zero row
column with one entry
A
A
|A[1;0;…;0]|p≠ 0
ROW SAMPLING
A
A’
• τ’ : weights on rows  distribution
• Pick a number of rows independently
from this distribution, rescale to form A’
MATRIX CHERNOFF BOUNDS
• Sufficient property of τ’
• τ: statistical leverage scores
τ'
• If τ' ≥ τ,║τ'║1logd (scaled)
rows suffices for A’ ≈ A
A
STATISTICAL LEVERAGE SCORES
• Studied in stats since 70s
• Importance of rows
• leverage score of row i, Ai:
τi = Ai (ATA)-1AiT
• Key fact: ║τ║1 = rank ≤ d
 ║τ'║1logd = dlogd rows
τ
A
COMPUTING LEVERAGE SCORES
τi = Ai(ATA)-1AiT
= AiC-1AiT
• ATA: covariance matrix, C
• Given C-1, can compute
each τi in O(d2) time
• Total cost: O(nd2+dω)
COMPUTING LEVERAGE SCORES
τi = AiC-1AiT
=║AiC-1/2║22
• 2-norm of a vector, AiC-1/2
• rows in isotropic positions
• Decorrelates columns
ASIDE: WHAT IS LEVERAGE?
Ai
AiC-1/2
Geometric view:
• Rows define ‘energy’ directions.
• Normalize so total energy is uniform
• τi : norm of row i after normalizing
ASIDE: WHAT IS LEVERAGE?
How to interpret statistical
leverage scores?
τ
• Statistics ([Hoaglin-Welsh
`78], [Chatterjee-Hadi `86]):
• Influence on data set
• Likelihood of outlier
• Uniqueness of Row
A
ASIDE: WHAT IS LEVERAGE?
High Leverage Score:
• Key attribute?
• Outlier (measuring error)?
ASIDE: WHAT IS LEVERAGE?
My current view (motivated
by graph sparsification):
• Sampling probabilities
• Use them to find sketches
τ
A
COMPUTING LEVERAGE SCORES
τi = ║AiC-1/2║22
• Only need τ' ≥ τ
• Can use approximations
after scaling them up
• Error leads to larger ║τ'║1
DIMENSIONALITY REDUCTION
x
Gx
║x║22 ≈jl ║Gx║22
• Johnson Lindenstrauss Transform
• G: d-by-O(1/α) Gaussian
• Errorjl = dα
ESTIMATING LEVERAGE SCORES
τi =║AiC-1/2║22
≈jl║AiC-1/2G║22
• G: d-by-O(1/α) Gaussian
• C1/2G: d-by-O(1/α)
• Cost: O(α ∙ nnz(Ai))
total: O(α ∙ m + α ∙ d2logd)
ESTIMATING LEVERAGE SCORES
τi =║AiC-1/2║ 22
≈║AiC’-1/2║ 22
• C ≈k C’ gives ║C-1/2x║2 ≈k║C’-1/2x║2
• Using C’ as a preconditioner for C
• Can also combine with JL
ESTIMATING LEVERAGE SCORES
τi’ =║AiC’-1/2G║22
≈jl║AiC-1/2║22
≈jl∙k τ i
• (jl ∙ k) ∙ τ’ ≥ τ
• Total number of rows:
║jl ∙ k ∙ τ’║1 ≤ jl ∙ k ∙ ║τ’║1
≤ k d1 + α
ESTIMATING LEVERAGE SCORES
• (jl ∙ k) ∙ τ’ ≥ τ
• ║jl ∙ k ∙ τ’║1 ≤ jl ∙ k ∙ d1+α
• Quality of A’ does not depend
on quality of τ'
• C ≈k C’ gives A’ ≈2 A with
O(kd1+α) rows in O(m + dω) time
Some fixable issues when n >>>d
SIZE REDUCTION
A”
C”
A’
τ'
•
•
•
•
A” ≈O(1) A
C” ≈O(1) C
τ' ≈O(1) τ
A’ ≈O(1) A , O(d 1+α logd) rows
HIGH ERROR SETTING
A”
C”
A’
τ'
•
•
•
•
A” ≈k A
C” ≈k C
τ' ≈k τ
A’ ≈O(1) A , O(kd 1+α logd) rows
ACCURACY BOOSTING
A’’
A
• Can reduce any error, k, in
O(m + kdω+α ) time
• All intermediate steps can
have large (constant) error
A’
OUTLINE
• Matrix Sketches
• How? Existence
• Samples  better samples
• Iterative algorithms
ONE STEP SKETCHING
A miracle
happens
A
A”
poly(d)
A’
dlogd
m
• Obtain sketch of size poly(d)
• Error correct to O(dlogd) rows
in poly(d) time
WHAT WE WILL SHOW
• A number of iterative steps
can give a similar result
• More work, less miraculous,
more robust
• Key idea: find leverage scores
ALGORITHMIC PICTURE
C’
A’
τ'
sketch, covariance matrix, leverage
scores with error k gives all three with
high accuracy in O(m + kdω+α ) time
OBSERVATIONS
≈k
C’
≈k
A’
τ'
≈O(1), O(K) size increase
• Error does not accumulate
• Can loop around many times
• Unused parameter: size of A
OUR APPROACH
A
As
Create shorter matrix As
s.t. total leverage score of
each block is close
LEVERAGE SCORE OF A BLOCK
A
║τ1..k║22 =║A1:kC-1/2║F2
≈║GA1:kC-1/2║F2
As
• l22 of leverage scores :
Frobenius norm of A1:kC-1/2
• ≈ under random projection
• G: O(1)-by-k, GA1:k: O(1) rows
SIZE REDUCTION
A
As
Recursing on As gives
leverages scores that:
• Sum to ≤d
• Can row sample A
ALGORITHM
≈k
C’
≈k
A’
τ'
≈O(1), O(K) size increase
• Decrease size by dα, recurse
• Bring back leverage scores
• Reduce error
PROBLEM
• Leverage scores in As
measured using Cs = AsTAs
• Already have bound on
total, suffices to show
║xC-1/2║2 ≤ k║xCs-1/2║2
PROOF SKETCH
Need: ║xC-1/2║2 ≤ k║xCs-1/2║2
• Show ║Cs1/2x|2 ≤ k║C1/2x║2
• Invert both sides
• Some issues when As has
smaller rank than A
║CS1/2X║2 ≤ K║C1/2X║2
b: blocks of As
║Cs1/2x║2=║Asx║2
= ΣbΣi(Gi,bAbTx)22
≤ ΣbΣi║Gi,b║22║AbTx║22
≤ maxb,i║Gi,b║22 ║Ax║22
≤ O(klogn)║Ax║22
P=1, OR ARBITRARY P
║Ax║p≈║A’x║p for any x
• Same approach can still work
• P-norm leverage scores
• Need: well-conditioned basis,
U for column space
QUALITY OF BASIS (P=1)
• Quality of U: maximum
distortion in dual norm:
β = maxx≠0║Ux║∞ /║x║∞
• Analog of leverage scores:
τi = β║Ui,:║1
• Total number of rows: β║U║1
BASIS CONSTRUCTION
≈k
C1, U
≈k
A’
τ'
≈O(1), O(K) size increase
• Basis using linear transform, U = AC
• Compute |Ui|1 using p-stable
distributions (Indyk `06) instead of JL
ITERATIVE ALGORITHM FOR P=1
• C1 = C-1/2, l2 basis
• Quality of U=AC1: β║Ui,:║1= n1/2d
• Too coarse for a single step, but
good enough to iterate
• n approaches poly(d) quickly
• Need to run l2 algorithm for C
SUMMARY
p=2
Cost for dlog rows
Sohler & Woodruff `11
Drineals et al. `12
p=1
cost
size
ndω-1+α
d3.5
ndlogd
d4.5log1.5d
ndlogd+dω
Clarkson et al. `12
Clarkson & Woodruff `12
m+d3log2d
m + d7
d8
Mahoney & Meng `12
m+d3logd
mlogn+d8
d3.5
Nelson & Nguyen `12
m+dω
This Paper
m + dω+α
Same as above
m + dω+α
d3.66
• Robust steps  algorithms
• l2: more complicated than sketching
• Smaller overhead for p-norm
FUTURE WORK
• What are leverage scores???
• Iterative low rank approximation?
• Better p-norm leverage scores?
• More streamlined view of the
projections in our algorithm?
• Empirical evaluation?
THANK YOU!
Questions?
Download