Min-wise hashing for large-scale regression Rajen D. Shah (Statistical Laboratory, University of Cambridge) joint work with Nicolai Meinshausen (Seminar für Statistik, ETH Zürich) Theory of Big Data, UCL 8 January 2015 Rajen Shah (Cambridge) Large-scale regression 8 January 2015 1 / 36 Large-scale data High-dimensional data: large number p of predictors, small number n of observations. Main challenge is a statistical one: how can we obtain accurate estimates with so few observations? Rajen Shah (Cambridge) Large-scale regression 8 January 2015 2 / 36 Large-scale data High-dimensional data: large number p of predictors, small number n of observations. Main challenge is a statistical one: how can we obtain accurate estimates with so few observations? Large-scale data: large p, large n. Data is not the only relevant resource to consider. The computational budget is also and issue (both memory and computing power). Rajen Shah (Cambridge) Large-scale regression 8 January 2015 2 / 36 Large-scale sparse regression Text analysis. Given a collection of documents, construct variables which count the number of occurrences of different words. Can add variables giving the frequency of pairs of words (bigrams) or triples of words (trigrams). An example from (Kogan, 2009) contains 4, 272, 227 predictor variables for n = 16, 087 documents. URL reputation scoring (Ma et al., 2009). Information about a URL comprises > 3 million variables which include word-stem presence and geographical information, for example. Number of observations, n > 2 million. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 3 / 36 Large-scale sparse regression Text analysis. Given a collection of documents, construct variables which count the number of occurrences of different words. Can add variables giving the frequency of pairs of words (bigrams) or triples of words (trigrams). An example from (Kogan, 2009) contains 4, 272, 227 predictor variables for n = 16, 087 documents. URL reputation scoring (Ma et al., 2009). Information about a URL comprises > 3 million variables which include word-stem presence and geographical information, for example. Number of observations, n > 2 million. In both of these examples, the design matrices are sparse. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 3 / 36 Large-scale sparse regression Text analysis. Given a collection of documents, construct variables which count the number of occurrences of different words. Can add variables giving the frequency of pairs of words (bigrams) or triples of words (trigrams). An example from (Kogan, 2009) contains 4, 272, 227 predictor variables for n = 16, 087 documents. URL reputation scoring (Ma et al., 2009). Information about a URL comprises > 3 million variables which include word-stem presence and geographical information, for example. Number of observations, n > 2 million. In both of these examples, the design matrices are sparse. This is very different from sparsity in the signal, which is the typical assumption in high-dimensional statistics. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 3 / 36 Dimension reduction Our computational budget may mean that OLS and Ridge regression are computationally infeasible. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 4 / 36 Dimension reduction Our computational budget may mean that OLS and Ridge regression are computationally infeasible. The computer science literature has a variety of algorithms to form a low-dimensional “sketch” of the design matrix i.e. a mapping X 7→ S n×p Rajen Shah (Cambridge) n × L, Large-scale regression L p. 8 January 2015 4 / 36 Dimension reduction Our computational budget may mean that OLS and Ridge regression are computationally infeasible. The computer science literature has a variety of algorithms to form a low-dimensional “sketch” of the design matrix i.e. a mapping X 7→ S n×p n × L, L p. The idea is then to perform the regression on S rather than the larger X. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 4 / 36 Linear model with sparse design target Y ∈ Rn z }| { ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ sparse z ∗ ≈ ∗ ∗ ∗ ∗ ∗ X ∈ Rn×p }| ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ε ∈ Rn β ∗ ∈ Rp noise z }| { z }| { ∗ ∗ { ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ + ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ Non-zero entries are marked with ∗. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 5 / 36 Can we safely reduce our sparse p-dimensional problem to a (possibly dense) L-dimensional one with L p? sparse X ∈ R z }| ∗ ∗ ∗ ∗ ∗ ∗ n×p ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ β ∗ ∈ Rp z }| { dense S ∈ Rn×L ∗ { }| z { ∗ ∗ ∗ ∗ ∗ ∗ ∈ RL ∗ ∗ ∗ ∗ bz }| ∗ { ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ≈ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ The approach we take here is based on min-wise hashing (Broder, 1997), and more specifically a variant called b-bit min-wise hashing (Li and König, 2011). Rajen Shah (Cambridge) Large-scale regression 8 January 2015 6 / 36 Can we safely reduce our sparse p-dimensional problem to a (possibly dense) L-dimensional one with L p? sparse X ∈ R z }| ∗ ∗ ∗ ∗ ∗ ∗ n×p ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ β ∗ ∈ Rp z }| { dense S ∈ Rn×L ∗ { }| z { ∗ ∗ ∗ ∗ ∗ ∗ ∈ RL ∗ ∗ ∗ ∗ bz }| ∗ { ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ≈ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ The approach we take here is based on min-wise hashing (Broder, 1997), and more specifically a variant called b-bit min-wise hashing (Li and König, 2011). Note PCA may seem like as obvious choice, but may be too computationally expensive to compute. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 6 / 36 Min-wise hashing matrix M b-bit min-wise hashing begins with the creation of a min-wise hashing matrix M. π1 1 X= 1 1 2 3 4 1 1 1 1 1 1 1 1 7→ M= 1 2 2 1 1 One column of M generated by the random permutation π1 of the variables. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 7 / 36 Min-wise hashing matrix M b-bit min-wise hashing begins with the creation of a min-wise hashing matrix M. π1 X= 2 1 1 1 3 1 4 1 1 1 1 1 1 1 7→ M= 1 2 2 1 1 One column of M generated by the random permutation π1 of the variables. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 8 / 36 Min-wise hashing matrix M π2 2 4 1 3 1 1 1 1 1 1 1 1 1 1 X= Rajen Shah (Cambridge) 7→ Large-scale regression M= 1 2 2 1 1 3 1 1 1 2 8 January 2015 9 / 36 b-bit min-wise hashing (Li and König, 2011) b-bit min-wise hashing stores the lowest b bits of each entry of M when expressed in binary (i.e. the residue mod 2). In the case where b = 1, (1) Mil Rajen Shah (Cambridge) ≡ Mil (mod 2). Large-scale regression 8 January 2015 10 / 36 b-bit min-wise hashing (Li and König, 2011) b-bit min-wise hashing stores the lowest b bits of each entry of M when expressed in binary (i.e. the residue mod 2). In the case where b = 1, (1) Mil 1 1 1 1 X= 1 1 1 1 1 (mod 2). 1 Rajen Shah (Cambridge) ≡ Mil 7→ M= 1 2 2 1 1 3 1 1 1 2 Large-scale regression 7→ M (1) = 1 0 0 1 1 8 January 2015 1 1 1 1 0 10 / 36 b-bit min-wise hashing (Li and König, 2011) b-bit min-wise hashing stores the lowest b bits of each entry of M when expressed in binary (i.e. the residue mod 2). In the case where b = 1, (1) Mil 1 1 1 1 X= 1 1 1 1 1 ≡ Mil (mod 2). 1 7→ M= 1 2 2 1 1 3 1 1 1 2 7→ M (1) = 1 0 0 1 1 1 1 1 1 0 Perform regression using binary n × L matrix M(1) rather than X. When L p this gives large computational savings, and empirical studies report good performance (mostly for classification with SVMs). Rajen Shah (Cambridge) Large-scale regression 8 January 2015 10 / 36 Random sign min-wise hashing A procedure that is closely related to 1-bit min-wise hashing is what we call random sign min-wise hashing. Easier to analyse. Deals with sparse design matrices with real-valued entries. Allows for the construction of a variable importance measure. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 11 / 36 Random sign min-wise hashing 1-bit min-wise 1 X= 1 1 1 1 hashing: keep last bit 1 1 1 1 → 7 M = 1 Rajen Shah (Cambridge) 1 2 2 1 1 3 1 1 1 2 Large-scale regression 7→ M (1) = 1 0 0 1 1 8 January 2015 1 1 1 1 0 12 / 36 Random sign min-wise hashing 1-bit min-wise 1 X= 1 1 1 1 hashing: keep last bit 1 1 1 1 → 7 M = 1 1 2 2 1 1 3 1 1 1 2 7→ M (1) = Random sign min-wise hashing: random sign assignments {1, . . . , p} → {−1, 1} are chosen independently for all columns l = 1, . . . , L when going from Ml to Sl . 1 1 1 3 2 1 1 1 7→ M = 2 1 7→ S = 1 X= 1 1 1 1 1 1 1 1 2 Rajen Shah (Cambridge) Large-scale regression 1 0 0 1 1 1 −1 −1 1 1 8 January 2015 1 1 1 1 0 −1 −1 −1 −1 1 12 / 36 The H and M matrices Rather than storing M, we can store the “responsible” variables in H Mil = min πl (k) k∈zi Hil =arg min πl (k) k∈zi X= 1 1 X= 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Rajen Shah (Cambridge) 7→ M= 7→ H= 1 2 2 1 1 3 1 1 1 2 S= 2 3 3 2 2 4 3 3 3 1 S= Large-scale regression 7→ 7→ 1 −1 −1 1 1 −1 −1 −1 −1 1 1 −1 −1 1 1 −1 −1 −1 −1 1 8 January 2015 13 / 36 Continuous variables By using H rather than M, we can handle continuous variables. X= 1 7.1 1 1 1 1 4.2 1 1 1 7→ H= 2 3 3 2 2 4 3 3 3 1 7→ S= 1 −4.2 −1 1 1 −1 −4.2 −1 −1 7.1 We get n × L matrices H, and S given by Hil =arg min πl (k) k∈zi Sil =ΨHil l XiHil , where Ψhl is the random sign of the hth variable in the l th permutation. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 14 / 36 Random sign min-wise hashing and Jaccard similarity When X is binary, Sil = ΨHil l . Take i 6= j. Write zi = {k : Xik = 1}. E(Sil Sjl ) = E(Sil Sjl |Hil = Hjl )P(Hil = Hjl ) + E(Sil Sjl |Hil 6= Hjl )P(Hil 6= Hjl ) = P(Hil = Hjl ) |zi ∩ zj | = |zi ∪ zj | The final quantity is known as the Jaccard similarity between zi and zj . Random sign min-wise hashing and Jaccard similarity When X is binary, Sil = ΨHil l . Take i 6= j. Write zi = {k : Xik = 1}. E(Sil Sjl ) = E(Sil Sjl |Hil = Hjl )P(Hil = Hjl ) + E(Sil Sjl |Hil 6= Hjl )P(Hil 6= Hjl ) = P(Hil = Hjl ) |zi ∩ zj | = |zi ∪ zj | Rajen Shah (Cambridge) Large-scale regression 8 January 2015 15 / 36 Random sign min-wise hashing and Jaccard similarity When X is binary, Sil = ΨHil l . Take i 6= j. Write zi = {k : Xik = 1}. E(Sil Sjl ) = E(Sil Sjl |Hil = Hjl )P(Hil = Hjl ) + E(Sil Sjl |Hil 6= Hjl )P(Hil 6= Hjl ) = P(Hil = Hjl ) |zi ∩ zj | = |zi ∪ zj | The final quantity is known as the Jaccard similarity between zi and zj . Rajen Shah (Cambridge) Large-scale regression 8 January 2015 15 / 36 Random sign min-wise hashing and Jaccard similarity When X is binary, Sil = ΨHil l . Take i 6= j. Write zi = {k : Xik = 1}. E(Sil Sjl ) = E(Sil Sjl |Hil = Hjl )P(Hil = Hjl ) + E(Sil Sjl |Hil 6= Hjl )P(Hil 6= Hjl ) = P(Hil = Hjl ) |zi ∩ zj | = |zi ∪ zj | The final quantity is known as the Jaccard similarity between zi and zj . Rajen Shah (Cambridge) Large-scale regression 8 January 2015 15 / 36 Random sign min-wise hashing and Jaccard similarity Consider performing ridge regression using S with regularisation parameter Lλ. The fitted values are S(LλI + ST S)−1 ST Y = SST (LλI + SST )−1 Y = L1 SST (λI + L1 SST )−1 Y. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 16 / 36 Random sign min-wise hashing and Jaccard similarity Consider performing ridge regression using S with regularisation parameter Lλ. The fitted values are S(LλI + ST S)−1 ST Y = SST (LλI + SST )−1 Y = L1 SST (λI + L1 SST )−1 Y. As L → ∞, the fitted values tend to J(λI + J)−1 Y. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 16 / 36 Random sign min-wise hashing and Jaccard similarity Consider performing ridge regression using S with regularisation parameter Lλ. The fitted values are S(LλI + ST S)−1 ST Y = SST (LλI + SST )−1 Y = L1 SST (λI + L1 SST )−1 Y. As L → ∞, the fitted values tend to J(λI + J)−1 Y. The fitted values from ridge regression on the original X are XXT (λI + XXT )−1 Y. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 16 / 36 Random sign min-wise hashing and Jaccard similarity Consider performing ridge regression using S with regularisation parameter Lλ. The fitted values are S(LλI + ST S)−1 ST Y = SST (LλI + SST )−1 Y = L1 SST (λI + L1 SST )−1 Y. As L → ∞, the fitted values tend to J(λI + J)−1 Y. The fitted values from ridge regression on the original X are XXT (λI + XXT )−1 Y. Ridge regression using S approximates a kernel ridge regression using the Jaccard similarity kernel. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 16 / 36 Random sign min-wise hashing and Jaccard similarity Consider performing ridge regression using S with regularisation parameter Lλ. The fitted values are S(LλI + ST S)−1 ST Y = SST (LλI + SST )−1 Y = L1 SST (λI + L1 SST )−1 Y. As L → ∞, the fitted values tend to J(λI + J)−1 Y. The fitted values from ridge regression on the original X are XXT (λI + XXT )−1 Y. Ridge regression using S approximates a kernel ridge regression using the Jaccard similarity kernel. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 16 / 36 Approximation error Can we construct a b∗ ∈ RL such that Xβ ∗ is close to Sb∗ on average? sparse X ∈ R z }| ∗ ∗ ∗ ∗ ∗ ∗ n×p ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ Rajen Shah (Cambridge) ∗ ∗ β ∗ ∈ Rp z }| { dense S ∈ Rn×L ∗ z }| { { ∗ ∗ ∗ ∗ ∗ ∗ ∈ RL ∗ ∗ ∗ ∗ ∗ bz }| { ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ≈ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ Large-scale regression 8 January 2015 17 / 36 Approximation error Is there a b∗ such that the expected value is unbiased (if averaged over the random permutations and sign assignments)? sparse X ∈ R z }| ∗ ∗ ∗ ∗ ∗ ∗ n×p ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ Rajen Shah (Cambridge) ∗ ∗ β ∗ ∈ Rp z }| { S ∈ Rn×1 ∗ z }| { { ∗ ∗ ∗ ∗ ∗ ∗ b∗ ∈ R1 z }| { ∗ ? ∗ ∗ = ∗ Eπ,ψ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ Large-scale regression 8 January 2015 18 / 36 Approximation error Assume for now that there are q ≤ p non-zero entries in each row of X. Unequal row sparsity can also be dealt with. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 19 / 36 Approximation error Assume for now that there are q ≤ p non-zero entries in each row of X. Unequal row sparsity can also be dealt with. Consider one permutation with min-hash value Hi for i = 1, . . . , n and random signs ψk , k = 1, . . . , p. S∈Rn×1 z }| ψ X H1 1H1 ψH2 X2H2 ... Eπ,ψ ... ... Rajen Shah (Cambridge) { =:b∗ ∈R1 z }| { p X ∗ q βk ψk = k=1 Large-scale regression 8 January 2015 19 / 36 Approximation error Can we find a b∗ ∈ RL such that Xβ ∗ is close to Sb∗ on average? ψ X H1 1H1 ψH2 X2H2 ... Eπ,ψ ... ... | {z S Rajen Shah (Cambridge) Pp X β ∗ qP(H = k) 1 1k k Ppk=1 ∗ qP(H = k) p X β 2 2k X k k=1 q ... βk∗ ψk = k=1 . . . {z } | ... =:b∗ } Large-scale regression 8 January 2015 20 / 36 Approximation error Can we find a b∗ ∈ RL such that Xβ ∗ is close to Sb∗ on average? ψ X H1 1H1 ψH2 X2H2 ... Eπ,ψ ... ... | {z S Pp X β ∗ qP(H = k) 1 1k k Ppk=1 ∗ qP(H = k) p X β 2 2k X k k=1 q ... βk∗ ψk = k=1 . . . {z } | ... =:b∗ } = Xβ ∗ (unbiased). Rajen Shah (Cambridge) Large-scale regression 8 January 2015 21 / 36 Approximation error Theorem Let b∗ ∈ RL be defined by p bl∗ qX ∗ βk Ψkl wπl (k) , = L k=1 where w is a vector of weights. Then there is a choice of w, such that: (i) The approximation is unbiased: Eπ,Ψ (Sb∗ ) = Xβ ∗ . (ii) Eπ,Ψ (kb∗ k22 ) ≤ 2qkβ ∗ k22 /L. (iii) If kXk∞ ≤ 1, then n1 Eπ,Ψ (kSb∗ − Xβ ∗ k22 ) ≤ 2qkβ ∗ k22 /L. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 22 / 36 Linear model Assume model Y = α∗ 1 + Xβ ∗ + ε. Random noise ε ∈ Rn satisfies E(εi ) = 0, E(ε2i ) = σ 2 and Cov(εi , εj ) = 0 for i 6= j. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 23 / 36 Linear model Assume model Y = α∗ 1 + Xβ ∗ + ε. Random noise ε ∈ Rn satisfies E(εi ) = 0, E(ε2i ) = σ 2 and Cov(εi , εj ) = 0 for i 6= j. We give bounds on a mean-squared prediction error (MSPE) of the form MSPE((α̂, b̂)) := Eε,π,Ψ kα∗ 1 + Xβ ∗ − α̂1 − Sb̂k22 /n. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 23 / 36 Ordinary least squares Theorem √ Let (α̂, b̂) be the least squares estimator and let L∗ = 2qnkβ ∗ k2 /σ. We have r 2q ∗ L L∗ σ2 MSPE((α̂, b̂)) ≤ 2 max kβ k + . , σ 2 L∗ L n n Rajen Shah (Cambridge) Large-scale regression 8 January 2015 24 / 36 Ordinary least squares Theorem √ Let (α̂, b̂) be the least squares estimator and let L∗ = 2qnkβ ∗ k2 /σ. We have r 2q ∗ L L∗ σ2 MSPE((α̂, b̂)) ≤ 2 max kβ k + . , σ 2 L∗ L n n Imagine a situation where more predictors are added to the design matrix but their associated coefficients are all 0, so kβ ∗ k2 = O(1). √ The MSPE only increases like q compared to the factor of p we would see if OLS were used. Additionally assume n = O(q) (ensures the MSPE is bounded asymptotically). Then L∗ = O(q). This could be a substantial reduction over p. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 24 / 36 Ridge regression Define (α̂, b̂η ) := arg min kY − α1 − Sbk22 such that kbk22 ≤ (1 + η) (α,b)∈R×Rp 2qkβ ∗ k22 . L Theorem Let ρ := exp − Lη 2 36q(36 + η) . Then η MSPE((α̂, b̂ )) ≤ p ∗ 2qkβ k2 √ 2σ 1 + η + (L∗ /L) σ2 kXβ ∗ k22 √ + +ρ . n n n Similar result for logistic regression available. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 25 / 36 Interaction models Let f ∗ ∈ Rn be given by fi ∗ = p X ∗,(1) Xik θk k=1 Rajen Shah (Cambridge) + p X Xik 1{Xik 1 ∗,(2) =0} Θk,k1 , i = 1, . . . , n. k,k1 =1 Large-scale regression 8 January 2015 26 / 36 Interaction models Let f ∗ ∈ Rn be given by fi ∗ = p X ∗,(1) Xik θk k=1 + p X Xik 1{Xik 1 ∗,(2) =0} Θk,k1 , i = 1, . . . , n. k,k1 =1 Assume kXk∞ ≤ 1. Previous results hold if X ∗ ∗,(1) `(Θ ) := kθ k2 + 2 q kβ ∗ k2 is replaced by 1/2 ∗,(2) ∗,(2) . Θkk1 Θkk2 k,k1 ,k2 Theorem There exists b∗ ∈ RL such that (i) Eπ,Ψ (Sb∗ ) = f ∗ ; (ii) Eπ,Ψ (kSb∗ − f ∗ k22 )/n ≤ 2q`2 (Θ∗ )/L. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 26 / 36 Interaction models Let f ∗ ∈ Rn be given by fi ∗ = p X ∗,(1) Xik θk k=1 + p X Xik 1{Xik 1 ∗,(2) =0} Θk,k1 , i = 1, . . . , n. k,k1 =1 Assume kXk∞ ≤ 1. Previous results hold if X ∗ ∗,(1) `(Θ ) := kθ k2 + 2 q kβ ∗ k2 is replaced by 1/2 ∗,(2) ∗,(2) . Θkk1 Θkk2 k,k1 ,k2 Theorem There exists b∗ ∈ RL such that (i) Eπ,Ψ (Sb∗ ) = f ∗ ; (ii) Eπ,Ψ (kSb∗ − f ∗ k22 )/n ≤ 2q`2 (Θ∗ )/L. If there are a finite number of non-zero interaction terms with finite value, the approximation error becomes very small if L q 2 . Rajen Shah (Cambridge) Large-scale regression 8 January 2015 26 / 36 Prediction error Assume the linear model from before, but with Xβ ∗ replaced by f ∗ . Theorem √ Let b̂ be the least squares estimator and let L∗ = 2qn `(Θ∗ )/σ. We have r σ2 L L∗ 2q ∗ , σ `(Θ ) + . MSPE(b̂) ≤ 2 max L∗ L n n Consider a situation where there are a fixed number of interaction and √ main effects of fixed size, so `(Θ∗ ) = O( q). Rajen Shah (Cambridge) Large-scale regression 8 January 2015 27 / 36 Prediction error Assume the linear model from before, but with Xβ ∗ replaced by f ∗ . Theorem √ Let b̂ be the least squares estimator and let L∗ = 2qn `(Θ∗ )/σ. We have r σ2 L L∗ 2q ∗ , σ `(Θ ) + . MSPE(b̂) ≤ 2 max L∗ L n n Consider a situation where there are a fixed number of interaction and √ main effects of fixed size, so `(Θ∗ ) = O( q). If n, q and p increase by collecting new data and adding uninformative variables, then in order for the MSPE to vanish asymptotically, we require q 2 /n → 0. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 27 / 36 Variable importance Predicted values are f̂ = Sb̂ Let f̂ −(k) be the predictions obtained when setting Xk = 0. If the underlying model is linear and contains only main effects, f̂ − f̂ −(k) ≈ Xk βk∗ . Construct S̃ in exactly the same way as S but using a matrix H̃ rather than H, with H̃ defined by H̃il := arg min πl (k). k∈zi \Hil Rajen Shah (Cambridge) Large-scale regression 8 January 2015 28 / 36 Variable importance Predicted values are f̂ = Sb̂ Let f̂ −(k) be the predictions obtained when setting Xk = 0. If the underlying model is linear and contains only main effects, f̂ − f̂ −(k) ≈ Xk βk∗ . Construct S̃ in exactly the same way as S but using a matrix H̃ rather than H, with H̃ defined by H̃il := arg min πl (k). k∈zi \Hil Store n × L matrices S, S̃ and H. Then (−k) fˆi − fˆi = L X (Sil − S̃il )1{Hil =k} b̂l . l=1 Rajen Shah (Cambridge) Large-scale regression 8 January 2015 28 / 36 Variable importance Predicted values are f̂ = Sb̂ Let f̂ −(k) be the predictions obtained when setting Xk = 0. If the underlying model is linear and contains only main effects, f̂ − f̂ −(k) ≈ Xk βk∗ . Construct S̃ in exactly the same way as S but using a matrix H̃ rather than H, with H̃ defined by H̃il := arg min πl (k). k∈zi \Hil Store n × L matrices S, S̃ and H. Then (−k) fˆi − fˆi = L X (Sil − S̃il )1{Hil =k} b̂l . l=1 Rajen Shah (Cambridge) Large-scale regression 8 January 2015 28 / 36 Aggregation The compressed design matrix S is generated in a random fashion. We can repeat the construction B > 1 times to obtain B different S matrices. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 29 / 36 Aggregation The compressed design matrix S is generated in a random fashion. We can repeat the construction B > 1 times to obtain B different S matrices. In the spirit of bagging (Breiman, 1996) we can then aggregate the predictions obtained from the different random mappings by averaging them. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 29 / 36 Aggregation The compressed design matrix S is generated in a random fashion. We can repeat the construction B > 1 times to obtain B different S matrices. In the spirit of bagging (Breiman, 1996) we can then aggregate the predictions obtained from the different random mappings by averaging them. Using B > 1 often gives large improvements and our experience has been that L can be chosen much lower than for B = 1 to achieve the same predictive accuracy. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 29 / 36 Text data Have p = 4, 272, 227 predictor variables for n = 16, 087 observations. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 30 / 36 Text data Have p = 4, 272, 227 predictor variables for n = 16, 087 observations. Created artificial responses from i) and linear model ii) a model with interactions. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 30 / 36 Text data Have p = 4, 272, 227 predictor variables for n = 16, 087 observations. Created artificial responses from i) and linear model ii) a model with interactions. Compared prediction accuracy with random projections: this method takes a matrix A ∈ Rp×L with i.i.d. N(0, 1) entries as forms a sketch XA ∈ Rn×L . Rajen Shah (Cambridge) Large-scale regression 8 January 2015 30 / 36 Response: linear model σ=0 σ = 0.5 σ=2 σ=5 0.0 0.2 0.4 0.6 0.8 1.0 Correlation between prediction and response. Red: Min-wise hashing. Blue: random projections. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 31 / 36 Response: interaction model 0.3 Correlation between prediction and response. σ = 0.5 σ=2 σ=5 0.0 0.1 0.2 σ=0 Red: Min-wise hashing. Blue: random projections. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 32 / 36 URL identification Large-scale classification of malicious URLs with n ≈ 2 million and p ≈ 3 million. Data are ordered into consecutive days. Response Y ∈ {0, 1}n is a binary vector where 1 corresponds to a malicious URL. In order to compare min-wise hashing with the Lasso- and ridge-penalised logistic regression, we split the data into the separate days, training on the first half of each day and testing on the second. This gives on average n ≈ 20, 000, p ≈ 100, 000. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 33 / 36 0.20 0.10 0.05 misclassification error URL identification: Ridge regression ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● 100 B: 1 ● ● ● ● ● ● ● 400 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 800 L ● ● ● ● ● ● ● ● ● ● ● ● raw data ridge B: 20 1600 ● ● ● ● ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 200 400 L Ridge regression following min-wise hashing performs here better than ridge regression applied to the original data. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 34 / 36 0.20 0.10 0.05 misclassification error URL identification: Lasso regression ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● 100 ● ● ● ● ● ● ● ● ● ● ● 400 B: 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 800 L 1600 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 lasso B: 20 ● ● ● ● ● ● ● ● ● ● ● raw data ● ● ● ● ● ● ● ● ● ● ● ● 200 ● ● ● ● ● ● ● ● ● ● ● ● 400 L Lasso with and without min-wise hashing have similar performance here. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 35 / 36 Discussion b-bit min-wise hashing and closely related random sign min-wise hashing are interesting dimensionality reduction techniques for large-scale sparse design matrices. Prediction error following compression can be bounded with a slow rate (in the absence of assumptions on the design). Behaves similarly to random projections if only main effects are present. Regression using only main effects from the compressed, dense, low-dimensional matrix can capture interactions among the large number of original sparse variables. Rajen Shah (Cambridge) Large-scale regression 8 January 2015 36 / 36