Mike Salwan
November 2, 2006
Stat 882
“Notorious” problem of automatic model building algorithms for linear regression
Implicit Assumption
Replacing Y by something without loss of info
Selecting variables
Summary
We have n x m matrix X and n-vector Y
P is the projection onto the column space
LARS assumes we can replace Y with Ŷ
= PY, in large samples F(y|x) = F(y|x’β)
We estimate residual variance by
ˆ 2
( I
P ) Y
2
/( n
m
1 )
If this assumption does not hold, then LARS is unlikely to produce useful results
Alternative: let F(y|x) = F(y|x’B), where B is an m x d rank d matrix. The smallest d is called the structural dimension of the regression problem
The R package dr can be used to estimate d using methods such as sliced inverse regression
Find a smooth function that operates on a variable set of projections
Expanded variables from 10 to 65 in paper such that F(y|x) = F(y|x’β) holds
LARS relies too much on correlations
Correlation measures degree of linear association (obviously)
Requires linearity in conditional distributions of y and of a’x and b’x for all a and b, otherwise bizarre results can come
Any method replacing Y by PY cannot be sensitive to nonlinearity
Methods based on PY alone can be strongly
influenced by outliers and high leverage cases
Consider
C p
(
ˆ
)
Y
ˆ
2
2
n
2 i n
1 cov(
i
2
, y i
)
Estimate σ² by ˆ 2
( I
P ) Y
2
/( n
m
1 )
Thus the ith term is given by:
C pi
( )
( y
ˆ i
ˆ 2
i
)
2
cov(
ˆ 2 i
, y i
)
Ŷ i is the ith element of PY and h i is the ith leverage which is a diagonal element in P h cov(
ˆ 2 i
, y i
)
From the simulation in the article, we can
ˆ 2 u u i is the ith diagonal of the projection matrix on the columns of (1,X) at the current step of the
algorithm
Thus, C pi
(
ˆ
)
( i
ˆ 2
ˆ i
)
2
u i
( h i
u i
)
This is the same formula in another paper by
Weisberg where is computed from LARS instead of a projection
The value of depends on the agreement i
, the leverage in the subset model and the difference in the leverage between the full and subset models
Neither of the latter two terms has much to do with the problem of interest (study of conditional distribution of y given x), but they are determined by the predictors only
We want to decompose x into two parts x u and x a where x a represents the active predictors
We want the smallest x a such that F(y|x) =
F(y|x a
), often using some criterion
Standard methods are too greedy
LARS permits highly correlated predictors to be used
Example to disprove LARS
Added nine new variables by multiplying original variables by 2.2, then rounding to the nearest integer
LARS method applied to both sets
LARS selects two of the rounded variables including one variable and its rounded variable (BP)
Inclusion or exclusion depends on the marginal distribution of x as much as the conditional distribution of y|x
Ex: Two variables have a high correlation.
LARS selects one for its active set
Modify the other to make it now uncorrelated
Doesn’t change y|x, changes marginal of x
Could change set of active predictors selected by LARS or any method that uses correlation
LARS results are invariant under rescaling, but not under reparameterization of related predictors
By first scaling predictors then adding all cross-products and quadratics, we get a different model if done other way around
This can be solved by considering them simultaneously, but this is self-defeating in terms of subset selection
Problems gain notoriety because their solution is illusive but of wide interest
LARS nor any other automatic model selection considers the context of the problem
There seems to be no foreseeable solution to this problem