Iterative Row Sampling Richard Peng CMU MIT Joint work with Mu Li (CMU) and Gary Miller (CMU) OUTLINE • Matrix Sketches • Existence • Samples better samples • Iterative algorithms DATA • n-by-d matrix A, m entries • Columns: data • Rows: attributes A Goal: • Classification/ clustering • Identify patterns • Interpret new data LINEAR MODEL Ax x1A:,1 x2A:,2 x3A:,3 • Can add/scale data points • x1: coefficients, combo: Ax PROBLEM ? Interpret new data point b as combination of known ones Ax REGRESSION • Express as combination of current examples • Regression: minx ║Ax–b║ p • p=2: least squares • p=1: compressive sensing • ║x║2: Euclidean norm of x • ║x║1: sum of absolute values VARIANTS OF COMPRESSIVE SENSING • minx ║Ax-b║1 +║x║1 • minx ║Ax-b║2 +║x║1 • minx ║x║1 s.t. Ax=b • minx ║Ax║1 s.t. Bx = y • minx ║Ax-b║1 + ║Bx - y║1 All similar to minx║Ax-b║1 SIMPLIFIED x A b -1 • minx║Ax–b║p = minx║[A, b] [x; -1]║p • Regression equivalent to min║Ax║p with one entry of x fixed ‘BIG’ DATA POINTS • Each data point has many attributes • #rows (n) >> #columns (d) • Examples: • Genetic data • Time series (videos) • Reverse (d>>n) also common: images + SIFT A FASTER? A A’ Smaller, equivalent A’ Matrix sketch ROW SAMPLING A A’ • Pick some rows of A to be A’ • How to pick? Random SHORTER EQUIVALENT • Find shorter A’ that preserves answer • |Ax|p≈1+ε|A’x|p for all x • Run algorithm on A’, same answer good for A A’ Simplified error notation ≈: a≈kb if there exists k1, k2 s.t. k2/k1 ≤ k and k1a ≤ b ≤ k2 b OUTLINE • Matrix Sketches • How? Existence • Samples better samples • Iterative algorithms SKETCHES EXIST |Ax|p≈|A’x|p for all x • Linear sketches: A’=SA • [Drineals et al. `12]: Row sampling: one nonzero in each row of S • [Clarkson-Woodruff `12]: S = countSketch, one non-zero per column. A’ SKETCHES EXIST p=2 p=1 d2.5 Dasgupta et al. `09 Magdon-Ismail `10 Sohler & Woodruff `11 dlog2d Drineals et al. `12 dlogd Clarkson et al. `12 Clarkson & Woodruff `12 d4.5log1.5d d2logd d8 Mahoney & Meng `12 Nelson & Nguyen `12 This Paper d2 d1+α dlogd d3.5 d3.5 d3.66 Hidden: runtime costs, ε-2 dependency WHY IS ≈D POSSIBLE? |Ax|p≈|A’x|p for all x • ║Ax║22 = xTATAx • ATA: d-by-d matrix • Any factorization (e.g. QR) of ATA suffices as A’ ATA A:,j1 • Covariance matrix • Dot product of all pairs of columns (data) • Covariance: cov(j1,j2) = Σi Ai,j1TAi,j2 A:,j2 USE OF COVARIANCE MATRIX C=ATA • Clustering: l2 distances of all pairs given by C • Kernel methods: all pair dot products suffice for many models. C OTHER USE OF COVARIANCE • Covariance of attributes used to tune parameters • Images + SIFT: many data points, few attributes. • http://www.image-net.org/: 14,197,122 images 1000 SIFT features C HOW EXPENSIVE IS THIS? A C • d2 dots of length n vectors • Total: O(nd2) • Faster: O(ndω-1) • Expensive: nd2 > nd > m EQUIVALENT VIEW OF SKETCHES A’ C’ • Approximate covariance matrix: C’=(A’)TA’ • ║Ax║2≈║A’x║2 is the same as C ≈ C’ APPLICATION OF SKETCHES A A’ C’ • A’: n’ rows • d2 dots of length n’ vectors • Total cost: O(n’dω-1) SKETCHES IN INPUT SPARSITY TIME A’ C’ A • Need: cost of computing C’ < cost of computing C = ATA • 2 goals: • n’ small • A’ found efficiently COST AND QUALITY OF A’ p=2 cost p=1 size Dasgupta et al. `09 Magdon-Ismail `10 nd2/logd ndlogd+dω size nd5 d2.5 ndω-1+α d3.5 ndlogd d4.5log1.5d dlog2d Sohler & Woodruff `11 Drineals et al. `12 cost dlogd Clarkson et al. `12 Clarkson & Woodruff `12 m d2logd m + d7 d8 Mahoney & Meng `12 m d2 mlogn+d8 d3.5 Nelson & Nguyen `12 m d1+α m + dω+α dlogd This Paper Same as above m + dω+α d3.66 OUTLINE • Matrix Sketches • How? Existence • Samples better samples • Iterative algorithms PREVIOUS APPROACHES A miracle happens A A’ poly(d) m • Go go poly(d) rows directly • Projection to obtain key info, or the sketch itself OUR MAIN APPROACH A” A’ A • Utilize the robustness of sketches, covariance matrices, and sampling • Iteratively reduce errors and sizes BETTER ALGORITHM FOR P=2 p=2 cost p=1 size Dasgupta et al. `09 Magdon-Ismail `10 nd2/logd ndlogd+dω size nd5 d2.5 ndω-1+α d3.5 ndlogd d4.5log1.5d dlog2d Sohler & Woodruff `11 Drineals et al. `12 cost dlogd Clarkson et al. `12 Clarkson & Woodruff `12 m d2logd m + d7 d8 Mahoney & Meng `12 m d2 mlogn+d8 d3.5 Nelson & Nguyen `12 m d1+α m + dω+α dlogd This Paper Same as above m + dω+α d3.66 COMPOSING SKETCHES O(m) O(n’dlogd +dω) A” A’ A n rows n’ = d1+α dlogd rows Total cost: O(m + n’dlogd + dω) = O(m + dω) ACCUMULATION OF ERRORS ║Ax║2 ≈kk’║A’x║2 ║A”x║2 ≈k’║A’x║2 ║Ax║2 ≈k║A”x║2 A” A’ A n rows n’ = d1+α dlogd rows ACCMULATION OF ERRORS ║Ax║ 2 ≈kk’║A’x║2 • Final error: product of both errors • Dependency of error in cost: usually ε-2 or more for 1± ε error • [Avron & Toledo `11]: only final step needs to be accurate • Idea: compute sketches indirectly ROW SAMPLING A A’ • Pick some rows of A to be A’ • How to pick? Random ARE ALL ROWS EQUAL? one non-zero row column with one entry A A |A[1;0;…;0]|p≠ 0 ROW SAMPLING A A’ • τ’ : weights on rows distribution • Pick a number of rows independently from this distribution, rescale to form A’ MATRIX CHERNOFF BOUNDS • Sufficient property of τ’ • τ: statistical leverage scores τ' • If τ' ≥ τ,║τ'║1logd (scaled) rows suffices for A’ ≈ A A STATISTICAL LEVERAGE SCORES • Studied in stats since 70s • Importance of rows • leverage score of row i, Ai: τi = Ai (ATA)-1AiT • Key fact: ║τ║1 = rank ≤ d ║τ'║1logd = dlogd rows τ A COMPUTING LEVERAGE SCORES τi = Ai(ATA)-1AiT = AiC-1AiT • ATA: covariance matrix, C • Given C-1, can compute each τi in O(d2) time • Total cost: O(nd2+dω) COMPUTING LEVERAGE SCORES τi = AiC-1AiT =║AiC-1/2║22 • 2-norm of a vector, AiC-1/2 • rows in isotropic positions • Decorrelates columns ASIDE: WHAT IS LEVERAGE? Ai AiC-1/2 Geometric view: • Rows define ‘energy’ directions. • Normalize so total energy is uniform • τi : norm of row i after normalizing ASIDE: WHAT IS LEVERAGE? How to interpret statistical leverage scores? τ • Statistics ([Hoaglin-Welsh `78], [Chatterjee-Hadi `86]): • Influence on data set • Likelihood of outlier • Uniqueness of Row A ASIDE: WHAT IS LEVERAGE? High Leverage Score: • Key attribute? • Outlier (measuring error)? ASIDE: WHAT IS LEVERAGE? My current view (motivated by graph sparsification): • Sampling probabilities • Use them to find sketches τ A COMPUTING LEVERAGE SCORES τi = ║AiC-1/2║22 • Only need τ' ≥ τ • Can use approximations after scaling them up • Error leads to larger ║τ'║1 DIMENSIONALITY REDUCTION x Gx ║x║22 ≈jl ║Gx║22 • Johnson Lindenstrauss Transform • G: d-by-O(1/α) Gaussian • Errorjl = dα ESTIMATING LEVERAGE SCORES τi =║AiC-1/2║22 ≈jl║AiC-1/2G║22 • G: d-by-O(1/α) Gaussian • C1/2G: d-by-O(1/α) • Cost: O(α ∙ nnz(Ai)) total: O(α ∙ m + α ∙ d2logd) ESTIMATING LEVERAGE SCORES τi =║AiC-1/2║ 22 ≈║AiC’-1/2║ 22 • C ≈k C’ gives ║C-1/2x║2 ≈k║C’-1/2x║2 • Using C’ as a preconditioner for C • Can also combine with JL ESTIMATING LEVERAGE SCORES τi’ =║AiC’-1/2G║22 ≈jl║AiC-1/2║22 ≈jl∙k τ i • (jl ∙ k) ∙ τ’ ≥ τ • Total number of rows: ║jl ∙ k ∙ τ’║1 ≤ jl ∙ k ∙ ║τ’║1 ≤ k d1 + α ESTIMATING LEVERAGE SCORES • (jl ∙ k) ∙ τ’ ≥ τ • ║jl ∙ k ∙ τ’║1 ≤ jl ∙ k ∙ d1+α • Quality of A’ does not depend on quality of τ' • C ≈k C’ gives A’ ≈2 A with O(kd1+α) rows in O(m + dω) time Some fixable issues when n >>>d SIZE REDUCTION A” C” A’ τ' • • • • A” ≈O(1) A C” ≈O(1) C τ' ≈O(1) τ A’ ≈O(1) A , O(d 1+α logd) rows HIGH ERROR SETTING A” C” A’ τ' • • • • A” ≈k A C” ≈k C τ' ≈k τ A’ ≈O(1) A , O(kd 1+α logd) rows ACCURACY BOOSTING A’’ A • Can reduce any error, k, in O(m + kdω+α ) time • All intermediate steps can have large (constant) error A’ OUTLINE • Matrix Sketches • How? Existence • Samples better samples • Iterative algorithms ONE STEP SKETCHING A miracle happens A A” poly(d) A’ dlogd m • Obtain sketch of size poly(d) • Error correct to O(dlogd) rows in poly(d) time WHAT WE WILL SHOW • A number of iterative steps can give a similar result • More work, less miraculous, more robust • Key idea: find leverage scores ALGORITHMIC PICTURE C’ A’ τ' sketch, covariance matrix, leverage scores with error k gives all three with high accuracy in O(m + kdω+α ) time OBSERVATIONS ≈k C’ ≈k A’ τ' ≈O(1), O(K) size increase • Error does not accumulate • Can loop around many times • Unused parameter: size of A OUR APPROACH A As Create shorter matrix As s.t. total leverage score of each block is close LEVERAGE SCORE OF A BLOCK A ║τ1..k║22 =║A1:kC-1/2║F2 ≈║GA1:kC-1/2║F2 As • l22 of leverage scores : Frobenius norm of A1:kC-1/2 • ≈ under random projection • G: O(1)-by-k, GA1:k: O(1) rows SIZE REDUCTION A As Recursing on As gives leverages scores that: • Sum to ≤d • Can row sample A ALGORITHM ≈k C’ ≈k A’ τ' ≈O(1), O(K) size increase • Decrease size by dα, recurse • Bring back leverage scores • Reduce error PROBLEM • Leverage scores in As measured using Cs = AsTAs • Already have bound on total, suffices to show ║xC-1/2║2 ≤ k║xCs-1/2║2 PROOF SKETCH Need: ║xC-1/2║2 ≤ k║xCs-1/2║2 • Show ║Cs1/2x|2 ≤ k║C1/2x║2 • Invert both sides • Some issues when As has smaller rank than A ║CS1/2X║2 ≤ K║C1/2X║2 b: blocks of As ║Cs1/2x║2=║Asx║2 = ΣbΣi(Gi,bAbTx)22 ≤ ΣbΣi║Gi,b║22║AbTx║22 ≤ maxb,i║Gi,b║22 ║Ax║22 ≤ O(klogn)║Ax║22 P=1, OR ARBITRARY P ║Ax║p≈║A’x║p for any x • Same approach can still work • P-norm leverage scores • Need: well-conditioned basis, U for column space QUALITY OF BASIS (P=1) • Quality of U: maximum distortion in dual norm: β = maxx≠0║Ux║∞ /║x║∞ • Analog of leverage scores: τi = β║Ui,:║1 • Total number of rows: β║U║1 BASIS CONSTRUCTION ≈k C1, U ≈k A’ τ' ≈O(1), O(K) size increase • Basis using linear transform, U = AC • Compute |Ui|1 using p-stable distributions (Indyk `06) instead of JL ITERATIVE ALGORITHM FOR P=1 • C1 = C-1/2, l2 basis • Quality of U=AC1: β║Ui,:║1= n1/2d • Too coarse for a single step, but good enough to iterate • n approaches poly(d) quickly • Need to run l2 algorithm for C SUMMARY p=2 Cost for dlog rows Sohler & Woodruff `11 Drineals et al. `12 p=1 cost size ndω-1+α d3.5 ndlogd d4.5log1.5d ndlogd+dω Clarkson et al. `12 Clarkson & Woodruff `12 m+d3log2d m + d7 d8 Mahoney & Meng `12 m+d3logd mlogn+d8 d3.5 Nelson & Nguyen `12 m+dω This Paper m + dω+α Same as above m + dω+α d3.66 • Robust steps algorithms • l2: more complicated than sketching • Smaller overhead for p-norm FUTURE WORK • What are leverage scores??? • Iterative low rank approximation? • Better p-norm leverage scores? • More streamlined view of the projections in our algorithm? • Empirical evaluation? THANK YOU! Questions?