Min-wise hashing for large-scale regression

advertisement
Min-wise hashing for large-scale regression
Rajen D. Shah (Statistical Laboratory, University of Cambridge)
joint work with Nicolai Meinshausen (Seminar für Statistik, ETH
Zürich)
Theory of Big Data, UCL
8 January 2015
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
1 / 36
Large-scale data
High-dimensional data: large number p of predictors, small number n
of observations.
Main challenge is a statistical one: how can we obtain accurate
estimates with so few observations?
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
2 / 36
Large-scale data
High-dimensional data: large number p of predictors, small number n
of observations.
Main challenge is a statistical one: how can we obtain accurate
estimates with so few observations?
Large-scale data: large p, large n.
Data is not the only relevant resource to consider. The computational
budget is also and issue (both memory and computing power).
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
2 / 36
Large-scale sparse regression
Text analysis. Given a collection of documents, construct variables
which count the number of occurrences of different words. Can add
variables giving the frequency of pairs of words (bigrams) or triples of
words (trigrams). An example from (Kogan, 2009) contains
4, 272, 227 predictor variables for n = 16, 087 documents.
URL reputation scoring (Ma et al., 2009). Information about a
URL comprises > 3 million variables which include word-stem
presence and geographical information, for example. Number of
observations, n > 2 million.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
3 / 36
Large-scale sparse regression
Text analysis. Given a collection of documents, construct variables
which count the number of occurrences of different words. Can add
variables giving the frequency of pairs of words (bigrams) or triples of
words (trigrams). An example from (Kogan, 2009) contains
4, 272, 227 predictor variables for n = 16, 087 documents.
URL reputation scoring (Ma et al., 2009). Information about a
URL comprises > 3 million variables which include word-stem
presence and geographical information, for example. Number of
observations, n > 2 million.
In both of these examples, the design matrices are sparse.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
3 / 36
Large-scale sparse regression
Text analysis. Given a collection of documents, construct variables
which count the number of occurrences of different words. Can add
variables giving the frequency of pairs of words (bigrams) or triples of
words (trigrams). An example from (Kogan, 2009) contains
4, 272, 227 predictor variables for n = 16, 087 documents.
URL reputation scoring (Ma et al., 2009). Information about a
URL comprises > 3 million variables which include word-stem
presence and geographical information, for example. Number of
observations, n > 2 million.
In both of these examples, the design matrices are sparse.
This is very different from sparsity in the signal, which is the typical
assumption in high-dimensional statistics.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
3 / 36
Dimension reduction
Our computational budget may mean that OLS and Ridge regression
are computationally infeasible.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
4 / 36
Dimension reduction
Our computational budget may mean that OLS and Ridge regression
are computationally infeasible.
The computer science literature has a variety of algorithms to form a
low-dimensional “sketch” of the design matrix i.e. a mapping
X 7→ S
n×p
Rajen Shah (Cambridge)
n × L,
Large-scale regression
L p.
8 January 2015
4 / 36
Dimension reduction
Our computational budget may mean that OLS and Ridge regression
are computationally infeasible.
The computer science literature has a variety of algorithms to form a
low-dimensional “sketch” of the design matrix i.e. a mapping
X 7→ S
n×p
n × L,
L p.
The idea is then to perform the regression on S rather than the larger
X.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
4 / 36
Linear model with sparse design
target Y ∈ Rn
z }| {

∗
 ∗ 
 
 ∗ 
 
 ∗ 
 
 ∗ 
 
 ∗ 
 
 ∗ 
 
 ∗ 
 
 ∗ 
 
 ∗ 
∗
sparse
z

∗





≈
 ∗




∗
∗
∗
∗
X ∈ Rn×p
}|
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
ε ∈ Rn
β ∗ ∈ Rp noise
z }| {
z }| {
∗
∗
{
 ∗ 

 ∗ 


 
 ∗ 
 ∗ 


 
 ∗ 
 ∗ 


 
 ∗ 





∗  ∗ 


 ∗ +  ∗ 
 
 ∗ 
 ∗ 


∗
 
 ∗ 





∗
 ∗ 
 ∗ 


 ∗ 
 ∗ 
∗
∗
Non-zero entries are marked with ∗.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
5 / 36
Can we safely reduce our sparse p-dimensional problem to a (possibly
dense) L-dimensional one with L p?
sparse X ∈ R
z
}|
∗






 ∗




∗
∗
∗
∗
n×p
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
β ∗ ∈ Rp
z }| {
dense S ∈ Rn×L
∗
{
}|
z


{
 ∗ 
∗
∗
∗
∗
∗
 
∈ RL
 ∗ ∗ ∗ ∗  bz }|
 ∗ 

 
  {
 ∗ ∗ ∗ ∗ 
 ∗ 
∗

 

 ∗ 
 ∗ ∗ ∗ ∗  ∗ 
∗ 
  ≈ 
 
 ∗ ∗ ∗ ∗  ∗ 
 ∗ 

 

 ∗ ∗ ∗ ∗ 
 ∗ 
∗
∗

 


 ∗ ∗ ∗ ∗ 
∗ 
∗
 
 ∗ 
∗ ∗ ∗ ∗
∗
The approach we take here is based on min-wise hashing (Broder,
1997), and more specifically a variant called b-bit min-wise hashing
(Li and König, 2011).
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
6 / 36
Can we safely reduce our sparse p-dimensional problem to a (possibly
dense) L-dimensional one with L p?
sparse X ∈ R
z
}|
∗






 ∗




∗
∗
∗
∗
n×p
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
β ∗ ∈ Rp
z }| {
dense S ∈ Rn×L
∗
{
}|
z


{
 ∗ 
∗
∗
∗
∗
∗
 
∈ RL
 ∗ ∗ ∗ ∗  bz }|
 ∗ 

 
  {
 ∗ ∗ ∗ ∗ 
 ∗ 
∗

 

 ∗ 
 ∗ ∗ ∗ ∗  ∗ 
∗ 
  ≈ 
 
 ∗ ∗ ∗ ∗  ∗ 
 ∗ 

 

 ∗ ∗ ∗ ∗ 
 ∗ 
∗
∗

 


 ∗ ∗ ∗ ∗ 
∗ 
∗
 
 ∗ 
∗ ∗ ∗ ∗
∗
The approach we take here is based on min-wise hashing (Broder,
1997), and more specifically a variant called b-bit min-wise hashing
(Li and König, 2011).
Note PCA may seem like as obvious choice, but may be too
computationally expensive to compute.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
6 / 36
Min-wise hashing matrix M
b-bit min-wise hashing begins with the creation of a min-wise hashing
matrix M.
π1
1

X=


1


1
2 3 4

1
1
1 1


1


1 1
1

7→


M=


1
2
2
1
1






One column of M generated by the random permutation π1 of the
variables.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
7 / 36
Min-wise hashing matrix M
b-bit min-wise hashing begins with the creation of a min-wise hashing
matrix M.
π1

X=
2
1




1
1
3
1
4

1
1
1


1 1


1
1

7→


M=


1
2
2
1
1






One column of M generated by the random permutation π1 of the
variables.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
8 / 36
Min-wise hashing matrix M
π2
2
4
1 3

1 1
1

1


1 1



1
1
1
1

X=
Rajen Shah (Cambridge)

7→
Large-scale regression


M=


1
2
2
1
1
3
1
1
1
2






8 January 2015
9 / 36
b-bit min-wise hashing (Li and König, 2011)
b-bit min-wise hashing stores the lowest b bits of each entry of M when
expressed in binary (i.e. the residue mod 2). In the case where b = 1,
(1)
Mil
Rajen Shah (Cambridge)
≡ Mil
(mod 2).
Large-scale regression
8 January 2015
10 / 36
b-bit min-wise hashing (Li and König, 2011)
b-bit min-wise hashing stores the lowest b bits of each entry of M when
expressed in binary (i.e. the residue mod 2). In the case where b = 1,
(1)
Mil


1

1 1 




1
X= 1



1 1
1 1
(mod 2).

1
Rajen Shah (Cambridge)
≡ Mil
7→


M=


1
2
2
1
1
3
1
1
1
2
Large-scale regression







7→
M
(1)


=


1
0
0
1
1
8 January 2015
1
1
1
1
0






10 / 36
b-bit min-wise hashing (Li and König, 2011)
b-bit min-wise hashing stores the lowest b bits of each entry of M when
expressed in binary (i.e. the residue mod 2). In the case where b = 1,
(1)
Mil


1

1 1 




1
X= 1



1 1
1 1
≡ Mil
(mod 2).

1
7→


M=


1
2
2
1
1
3
1
1
1
2







7→
M
(1)


=


1
0
0
1
1
1
1
1
1
0






Perform regression using binary n × L matrix M(1) rather than X.
When L p this gives large computational savings, and empirical
studies report good performance (mostly for classification with
SVMs).
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
10 / 36
Random sign min-wise hashing
A procedure that is closely related to 1-bit min-wise hashing is what we
call random sign min-wise hashing.
Easier to analyse.
Deals with sparse design matrices with real-valued entries.
Allows for the construction of a variable importance measure.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
11 / 36
Random sign min-wise hashing
1-bit min-wise

1


X=
 1

1
1 1
hashing: keep last bit


1

1 1 




1
→
7
M
=




1
Rajen Shah (Cambridge)
1
2
2
1
1
3
1
1
1
2
Large-scale regression







7→
M
(1)


=


1
0
0
1
1
8 January 2015
1
1
1
1
0






12 / 36
Random sign min-wise hashing
1-bit min-wise

1


X=
 1

1
1 1
hashing: keep last bit


1

1 1 




1
→
7
M
=




1
1
2
2
1
1
3
1
1
1
2







7→
M
(1)


=


Random sign min-wise hashing: random sign assignments
{1, . . . , p} → {−1, 1} are chosen independently for all columns
l = 1, . . . , L when going from Ml to Sl .





1
1
1 3

 2 1 

1 1 





 7→ M =  2 1  7→ S = 
1
X=
 1






 1 1 

1 1
1 1
1 2
Rajen Shah (Cambridge)
Large-scale regression
1
0
0
1
1
1
−1
−1
1
1
8 January 2015
1
1
1
1
0

−1
−1
−1
−1
1











12 / 36
The H and M matrices
Rather than storing M, we can store the “responsible” variables in H
Mil = min πl (k)
k∈zi
Hil =arg min πl (k)
k∈zi



X=
 1

1



X=
 1

1

1
1 1 


1


1 1
1

1
1
1 1 


1


1 1
1

1
Rajen Shah (Cambridge)
7→


M=



7→


H=


1
2
2
1
1
3
1
1
1
2









S=


2
3
3
2
2
4
3
3
3
1









S=


Large-scale regression
7→
7→
1
−1
−1
1
1
−1
−1
−1
−1
1

1
−1
−1
1
1
−1
−1
−1
−1
1

8 January 2015










13 / 36
Continuous variables
By using H rather than M, we can handle continuous variables.



X=
 1

7.1
1
1
1

1
4.2 1 


1


1

7→


H=


2
3
3
2
2
4
3
3
3
1







7→


S=


1
−4.2
−1
1
1
−1
−4.2
−1
−1
7.1






We get n × L matrices H, and S given by
Hil =arg min πl (k)
k∈zi
Sil =ΨHil l XiHil ,
where Ψhl is the random sign of the hth variable in the l th permutation.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
14 / 36
Random sign min-wise hashing and Jaccard similarity
When X is binary, Sil = ΨHil l .
Take i 6= j. Write zi = {k : Xik = 1}.
E(Sil Sjl ) = E(Sil Sjl |Hil = Hjl )P(Hil = Hjl )
+ E(Sil Sjl |Hil 6= Hjl )P(Hil 6= Hjl )
= P(Hil = Hjl )
|zi ∩ zj |
=
|zi ∪ zj |
The final quantity is known as the Jaccard similarity between zi and
zj .
Random sign min-wise hashing and Jaccard similarity
When X is binary, Sil = ΨHil l .
Take i 6= j. Write zi = {k : Xik = 1}.
E(Sil Sjl ) = E(Sil Sjl |Hil = Hjl )P(Hil = Hjl )
+ E(Sil Sjl |Hil 6= Hjl )P(Hil 6= Hjl )
= P(Hil = Hjl )
|zi ∩ zj |
=
|zi ∪ zj |
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
15 / 36
Random sign min-wise hashing and Jaccard similarity
When X is binary, Sil = ΨHil l .
Take i 6= j. Write zi = {k : Xik = 1}.
E(Sil Sjl ) = E(Sil Sjl |Hil = Hjl )P(Hil = Hjl )
+ E(Sil Sjl |Hil 6= Hjl )P(Hil 6= Hjl )
= P(Hil = Hjl )
|zi ∩ zj |
=
|zi ∪ zj |
The final quantity is known as the Jaccard similarity between zi and
zj .
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
15 / 36
Random sign min-wise hashing and Jaccard similarity
When X is binary, Sil = ΨHil l .
Take i 6= j. Write zi = {k : Xik = 1}.
E(Sil Sjl ) = E(Sil Sjl |Hil = Hjl )P(Hil = Hjl )
+ E(Sil Sjl |Hil 6= Hjl )P(Hil 6= Hjl )
= P(Hil = Hjl )
|zi ∩ zj |
=
|zi ∪ zj |
The final quantity is known as the Jaccard similarity between zi and
zj .
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
15 / 36
Random sign min-wise hashing and Jaccard similarity
Consider performing ridge regression using S with regularisation
parameter Lλ. The fitted values are
S(LλI + ST S)−1 ST Y = SST (LλI + SST )−1 Y
= L1 SST (λI + L1 SST )−1 Y.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
16 / 36
Random sign min-wise hashing and Jaccard similarity
Consider performing ridge regression using S with regularisation
parameter Lλ. The fitted values are
S(LλI + ST S)−1 ST Y = SST (LλI + SST )−1 Y
= L1 SST (λI + L1 SST )−1 Y.
As L → ∞, the fitted values tend to
J(λI + J)−1 Y.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
16 / 36
Random sign min-wise hashing and Jaccard similarity
Consider performing ridge regression using S with regularisation
parameter Lλ. The fitted values are
S(LλI + ST S)−1 ST Y = SST (LλI + SST )−1 Y
= L1 SST (λI + L1 SST )−1 Y.
As L → ∞, the fitted values tend to
J(λI + J)−1 Y.
The fitted values from ridge regression on the original X are
XXT (λI + XXT )−1 Y.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
16 / 36
Random sign min-wise hashing and Jaccard similarity
Consider performing ridge regression using S with regularisation
parameter Lλ. The fitted values are
S(LλI + ST S)−1 ST Y = SST (LλI + SST )−1 Y
= L1 SST (λI + L1 SST )−1 Y.
As L → ∞, the fitted values tend to
J(λI + J)−1 Y.
The fitted values from ridge regression on the original X are
XXT (λI + XXT )−1 Y.
Ridge regression using S approximates a kernel ridge regression using
the Jaccard similarity kernel.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
16 / 36
Random sign min-wise hashing and Jaccard similarity
Consider performing ridge regression using S with regularisation
parameter Lλ. The fitted values are
S(LλI + ST S)−1 ST Y = SST (LλI + SST )−1 Y
= L1 SST (λI + L1 SST )−1 Y.
As L → ∞, the fitted values tend to
J(λI + J)−1 Y.
The fitted values from ridge regression on the original X are
XXT (λI + XXT )−1 Y.
Ridge regression using S approximates a kernel ridge regression using
the Jaccard similarity kernel.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
16 / 36
Approximation error
Can we construct a b∗ ∈ RL such that Xβ ∗ is close to Sb∗ on average?
sparse X ∈ R
z
}|
∗






 ∗




∗
∗
∗
∗
n×p
∗
∗
∗
∗
∗
∗
∗
∗
∗
Rajen Shah (Cambridge)
∗
∗
β ∗ ∈ Rp
z }| {

dense S ∈ Rn×L
∗
z
}|
{

{
 ∗ 
∗
∗
∗
∗
∗
 
∈ RL
 ∗ 
 ∗ ∗ ∗ ∗  bz }|
 

  {
 ∗ 
 ∗ ∗ ∗ ∗ 
∗
 


 ∗ ∗ ∗ ∗  ∗ 
 ∗ 
∗ 
  ≈ 
 
 ∗ 
 ∗ ∗ ∗ ∗  ∗ 
 


 ∗ 
 ∗ ∗ ∗ ∗ 
∗
∗
 



 ∗ ∗ ∗ ∗ 
∗ 
∗
 
 ∗ 
∗ ∗ ∗ ∗
∗
Large-scale regression
8 January 2015
17 / 36
Approximation error
Is there a b∗ such that the expected value is unbiased (if averaged over the
random permutations and sign assignments)?
sparse X ∈ R
z
}|
∗






 ∗




∗
∗
∗
∗
n×p
∗
∗
∗
∗
∗
∗
∗
∗
∗
Rajen Shah (Cambridge)
∗
∗
β ∗ ∈ Rp


z }| {
S ∈ Rn×1
∗
 z }| {

{

 ∗ 

∗

 

 ∗ 
 ∗ 

 
 

 ∗ 
  ∗  b∗ ∈ R1 
 
   z }| { 
 ∗ 
  ?

∗ 
 ∗  =
 
∗ 
Eπ,ψ 
 ∗ 
 ∗ 

 
 








∗
 ∗ 
 ∗ 








∗
∗
 ∗ 




 ∗ 
∗


∗
Large-scale regression
8 January 2015
18 / 36
Approximation error
Assume for now that there are q ≤ p non-zero entries in each row of
X. Unequal row sparsity can also be dealt with.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
19 / 36
Approximation error
Assume for now that there are q ≤ p non-zero entries in each row of
X. Unequal row sparsity can also be dealt with.
Consider one permutation with min-hash value Hi for i = 1, . . . , n
and random signs ψk , k = 1, . . . , p.

S∈Rn×1
z
}|
 ψ X

H1 1H1

 ψH2 X2H2

...
Eπ,ψ 

...


...

Rajen Shah (Cambridge)

{





=:b∗ ∈R1
z

}|
{

p
X


∗
q
βk ψk  =

k=1



Large-scale regression
8 January 2015
19 / 36
Approximation error
Can we find a b∗ ∈ RL such that Xβ ∗ is close to Sb∗ on average?


 ψ X

H1 1H1

 ψH2 X2H2

...
Eπ,ψ 

...


...

|
{z
S
Rajen Shah (Cambridge)
  Pp X β ∗ qP(H = k) 

1
1k k
  Ppk=1
∗ qP(H = k) 

p
X
β

2
2k
X
k
k=1


 
 q

...
βk∗ ψk  = 




 k=1


.
.
.

{z
}
|
...

=:b∗
}

Large-scale regression
8 January 2015
20 / 36
Approximation error
Can we find a b∗ ∈ RL such that Xβ ∗ is close to Sb∗ on average?


 ψ X

H1 1H1

 ψH2 X2H2

...
Eπ,ψ 

...


...

|
{z
S
  Pp X β ∗ qP(H = k) 

1
1k k
  Ppk=1
∗ qP(H = k) 

p
X
β

2
2k
X
k
k=1


 
 q

...
βk∗ ψk  = 




 k=1


.
.
.

{z
}
|
...

=:b∗
}

= Xβ ∗ (unbiased).
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
21 / 36
Approximation error
Theorem
Let b∗ ∈ RL be defined by
p
bl∗
qX ∗
βk Ψkl wπl (k) ,
=
L
k=1
where w is a vector of weights. Then there is a choice of w, such that:
(i) The approximation is unbiased: Eπ,Ψ (Sb∗ ) = Xβ ∗ .
(ii) Eπ,Ψ (kb∗ k22 ) ≤ 2qkβ ∗ k22 /L.
(iii) If kXk∞ ≤ 1, then n1 Eπ,Ψ (kSb∗ − Xβ ∗ k22 ) ≤ 2qkβ ∗ k22 /L.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
22 / 36
Linear model
Assume model
Y = α∗ 1 + Xβ ∗ + ε.
Random noise ε ∈ Rn satisfies E(εi ) = 0, E(ε2i ) = σ 2 and
Cov(εi , εj ) = 0 for i 6= j.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
23 / 36
Linear model
Assume model
Y = α∗ 1 + Xβ ∗ + ε.
Random noise ε ∈ Rn satisfies E(εi ) = 0, E(ε2i ) = σ 2 and
Cov(εi , εj ) = 0 for i 6= j.
We give bounds on a mean-squared prediction error (MSPE) of the
form
MSPE((α̂, b̂)) := Eε,π,Ψ kα∗ 1 + Xβ ∗ − α̂1 − Sb̂k22 /n.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
23 / 36
Ordinary least squares
Theorem
√
Let (α̂, b̂) be the least squares estimator and let L∗ = 2qnkβ ∗ k2 /σ. We
have
r
2q ∗
L L∗
σ2
MSPE((α̂, b̂)) ≤ 2 max
kβ
k
+
.
,
σ
2
L∗ L
n
n
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
24 / 36
Ordinary least squares
Theorem
√
Let (α̂, b̂) be the least squares estimator and let L∗ = 2qnkβ ∗ k2 /σ. We
have
r
2q ∗
L L∗
σ2
MSPE((α̂, b̂)) ≤ 2 max
kβ
k
+
.
,
σ
2
L∗ L
n
n
Imagine a situation where more predictors are added to the design
matrix but their associated coefficients are all 0, so kβ ∗ k2 = O(1).
√
The MSPE only increases like q compared to the factor of p we
would see if OLS were used.
Additionally assume n = O(q) (ensures the MSPE is bounded
asymptotically). Then L∗ = O(q). This could be a substantial
reduction over p.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
24 / 36
Ridge regression
Define
(α̂, b̂η ) := arg min kY − α1 − Sbk22 such that kbk22 ≤ (1 + η)
(α,b)∈R×Rp
2qkβ ∗ k22
.
L
Theorem
Let
ρ := exp −
Lη 2
36q(36 + η)
.
Then
η
MSPE((α̂, b̂ )) ≤
p
∗
2qkβ k2
√
2σ 1 + η + (L∗ /L)
σ2
kXβ ∗ k22
√
+
+ρ
.
n
n
n
Similar result for logistic regression available.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
25 / 36
Interaction models
Let f ∗ ∈ Rn be given by
fi ∗ =
p
X
∗,(1)
Xik θk
k=1
Rajen Shah (Cambridge)
+
p
X
Xik 1{Xik
1
∗,(2)
=0} Θk,k1 ,
i = 1, . . . , n.
k,k1 =1
Large-scale regression
8 January 2015
26 / 36
Interaction models
Let f ∗ ∈ Rn be given by
fi ∗ =
p
X
∗,(1)
Xik θk
k=1
+
p
X
Xik 1{Xik
1
∗,(2)
=0} Θk,k1 ,
i = 1, . . . , n.
k,k1 =1
Assume kXk∞ ≤ 1. Previous results hold if
X
∗
∗,(1)
`(Θ ) := kθ
k2 + 2 q
kβ ∗ k2 is replaced by
1/2
∗,(2) ∗,(2) .
Θkk1 Θkk2 k,k1 ,k2
Theorem
There exists b∗ ∈ RL such that
(i) Eπ,Ψ (Sb∗ ) = f ∗ ;
(ii) Eπ,Ψ (kSb∗ − f ∗ k22 )/n ≤ 2q`2 (Θ∗ )/L.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
26 / 36
Interaction models
Let f ∗ ∈ Rn be given by
fi ∗ =
p
X
∗,(1)
Xik θk
k=1
+
p
X
Xik 1{Xik
1
∗,(2)
=0} Θk,k1 ,
i = 1, . . . , n.
k,k1 =1
Assume kXk∞ ≤ 1. Previous results hold if
X
∗
∗,(1)
`(Θ ) := kθ
k2 + 2 q
kβ ∗ k2 is replaced by
1/2
∗,(2) ∗,(2) .
Θkk1 Θkk2 k,k1 ,k2
Theorem
There exists b∗ ∈ RL such that
(i) Eπ,Ψ (Sb∗ ) = f ∗ ;
(ii) Eπ,Ψ (kSb∗ − f ∗ k22 )/n ≤ 2q`2 (Θ∗ )/L.
If there are a finite number of non-zero interaction terms with finite value,
the approximation error becomes very small if L q 2 .
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
26 / 36
Prediction error
Assume the linear model from before, but with Xβ ∗ replaced by f ∗ .
Theorem
√
Let b̂ be the least squares estimator and let L∗ = 2qn `(Θ∗ )/σ. We have
r
σ2
L L∗
2q
∗
,
σ
`(Θ
)
+
.
MSPE(b̂) ≤ 2 max
L∗ L
n
n
Consider a situation where there are a fixed number of interaction and
√
main effects of fixed size, so `(Θ∗ ) = O( q).
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
27 / 36
Prediction error
Assume the linear model from before, but with Xβ ∗ replaced by f ∗ .
Theorem
√
Let b̂ be the least squares estimator and let L∗ = 2qn `(Θ∗ )/σ. We have
r
σ2
L L∗
2q
∗
,
σ
`(Θ
)
+
.
MSPE(b̂) ≤ 2 max
L∗ L
n
n
Consider a situation where there are a fixed number of interaction and
√
main effects of fixed size, so `(Θ∗ ) = O( q).
If n, q and p increase by collecting new data and adding
uninformative variables, then in order for the MSPE to vanish
asymptotically, we require q 2 /n → 0.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
27 / 36
Variable importance
Predicted values are
f̂ = Sb̂
Let f̂ −(k) be the predictions obtained when setting Xk = 0. If the
underlying model is linear and contains only main effects,
f̂ − f̂ −(k) ≈ Xk βk∗ .
Construct S̃ in exactly the same way as S but using a matrix H̃ rather
than H, with H̃ defined by
H̃il := arg min πl (k).
k∈zi \Hil
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
28 / 36
Variable importance
Predicted values are
f̂ = Sb̂
Let f̂ −(k) be the predictions obtained when setting Xk = 0. If the
underlying model is linear and contains only main effects,
f̂ − f̂ −(k) ≈ Xk βk∗ .
Construct S̃ in exactly the same way as S but using a matrix H̃ rather
than H, with H̃ defined by
H̃il := arg min πl (k).
k∈zi \Hil
Store n × L matrices S, S̃ and H. Then
(−k)
fˆi − fˆi
=
L
X
(Sil − S̃il )1{Hil =k} b̂l .
l=1
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
28 / 36
Variable importance
Predicted values are
f̂ = Sb̂
Let f̂ −(k) be the predictions obtained when setting Xk = 0. If the
underlying model is linear and contains only main effects,
f̂ − f̂ −(k) ≈ Xk βk∗ .
Construct S̃ in exactly the same way as S but using a matrix H̃ rather
than H, with H̃ defined by
H̃il := arg min πl (k).
k∈zi \Hil
Store n × L matrices S, S̃ and H. Then
(−k)
fˆi − fˆi
=
L
X
(Sil − S̃il )1{Hil =k} b̂l .
l=1
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
28 / 36
Aggregation
The compressed design matrix S is generated in a random fashion.
We can repeat the construction B > 1 times to obtain B different S
matrices.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
29 / 36
Aggregation
The compressed design matrix S is generated in a random fashion.
We can repeat the construction B > 1 times to obtain B different S
matrices.
In the spirit of bagging (Breiman, 1996) we can then aggregate the
predictions obtained from the different random mappings by
averaging them.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
29 / 36
Aggregation
The compressed design matrix S is generated in a random fashion.
We can repeat the construction B > 1 times to obtain B different S
matrices.
In the spirit of bagging (Breiman, 1996) we can then aggregate the
predictions obtained from the different random mappings by
averaging them.
Using B > 1 often gives large improvements and our experience has
been that L can be chosen much lower than for B = 1 to achieve the
same predictive accuracy.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
29 / 36
Text data
Have p = 4, 272, 227 predictor variables for n = 16, 087 observations.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
30 / 36
Text data
Have p = 4, 272, 227 predictor variables for n = 16, 087 observations.
Created artificial responses from i) and linear model ii) a model with
interactions.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
30 / 36
Text data
Have p = 4, 272, 227 predictor variables for n = 16, 087 observations.
Created artificial responses from i) and linear model ii) a model with
interactions.
Compared prediction accuracy with random projections: this method
takes a matrix A ∈ Rp×L with i.i.d. N(0, 1) entries as forms a sketch
XA ∈ Rn×L .
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
30 / 36
Response: linear model
σ=0
σ = 0.5
σ=2
σ=5
0.0
0.2
0.4
0.6
0.8
1.0
Correlation between prediction and response.
Red: Min-wise hashing. Blue: random projections.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
31 / 36
Response: interaction model
0.3
Correlation between prediction and response.
σ = 0.5
σ=2
σ=5
0.0
0.1
0.2
σ=0
Red: Min-wise hashing. Blue: random projections.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
32 / 36
URL identification
Large-scale classification of malicious URLs with
n ≈ 2 million and p ≈ 3 million.
Data are ordered into consecutive days.
Response Y ∈ {0, 1}n is a binary vector where 1 corresponds to a
malicious URL.
In order to compare min-wise hashing with the Lasso- and ridge-penalised
logistic regression, we split the data into the separate days, training on the
first half of each day and testing on the second. This gives on average
n ≈ 20, 000, p ≈ 100, 000.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
33 / 36
0.20
0.10
0.05
misclassification error
URL identification: Ridge regression
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
100
B: 1
●
●
●
●
●
●
●
400
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
800
L
●
●
●
●
●
●
●
●
●
●
●
●
raw data
ridge
B: 20
1600
●
●
●
●
●
●
●
●
●
50
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
200
400
L
Ridge regression following min-wise hashing performs here better than
ridge regression applied to the original data.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
34 / 36
0.20
0.10
0.05
misclassification error
URL identification: Lasso regression
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
100
●
●
●
●
●
●
●
●
●
●
●
400
B: 1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
800
L
1600
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
50
lasso
B: 20
●
●
●
●
●
●
●
●
●
●
●
raw data
●
●
●
●
●
●
●
●
●
●
●
●
200
●
●
●
●
●
●
●
●
●
●
●
●
400
L
Lasso with and without min-wise hashing have similar performance here.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
35 / 36
Discussion
b-bit min-wise hashing and closely related random sign min-wise hashing
are interesting dimensionality reduction techniques for large-scale sparse
design matrices.
Prediction error following compression can be bounded with a slow
rate (in the absence of assumptions on the design).
Behaves similarly to random projections if only main effects are
present.
Regression using only main effects from the compressed, dense,
low-dimensional matrix can capture interactions among the large
number of original sparse variables.
Rajen Shah (Cambridge)
Large-scale regression
8 January 2015
36 / 36
Download