Instrumental weighted variables - algorithm

advertisement
Instrumental weighted variables - algorithm
Jan Ámos Vı́šek1∗
Faculty of Social Sciences, Charles University, Smetanovo nabřežı́ 6, 110 01
Prague, the Czech Republic visek@mbox.fsv.cuni.cz
Summary. An algorithm for the Instrumental Weighted Variables (IW V ) is presented. It represents a modification of algorithm for the Least Weighted Squares. The
heuristics for the IW V and the “history” of the algorithm (originally proposed for
the Least Trimmed Squares) is recalled. Numerical illustrations are also presented.
Key words: Robust regression, failure of orthogonality condition, controllable level
of robustness, heuristics of algorithm and its verification, heteroscedasticity.
1 Introduction and notations
Sensitivity studies had been always an inseparable part of statistics (see e.g.
[ChHa88, BKW80, Vi96a, Vi02c] or [Ro87] and references there). Nevertheless, surprising findings by Hettmansperger and Sheather in 1992 [HeSh92] for the Engine Knock Data (see [MGH89]) have largely increased the interest about them.
In [HeSh92] the reasons for the inherent instability of the high-breakdown-point
robust procedures with respect to a (small) shift of an observation were firstly thoroughly discussed (see also [Vi96b, Vi00a]). Their study also revealed the crucial
role of reliability of the algorithm and its implementation. When later the algorithm
proposed by Boček and Lachout [BoLa95]2 found an alternative to HettmanspengerSheather’s Least-Median-of-Squares-model (LM S), see [Vi94], an alternative with
smaller value of the minimized 11th order statistic among the squared residuals
and much different model, the fact that the studies of robust estimators cannot be
restricted on the classical statistical properties became fully clear.
Since usually we don’t know the precise value of the robust estimator - even for
the simulated data - (as the estimator is typically given as a solution of an extremal
problem), verification of reliability of the algorithm (and its implementation) requires
to invent a “trick” which hints at least a reasonable hope that the respective value
is a tight approximation to the precise value. For the Boček-Lachout’s algorithm
such hope might stem from the fact that it gave smaller value of the minimized
functional than any other available algorithm (e.g. in PROGRESS or S-PLUS, see
∗
Research was supported by grant of GA ČR number 402/06/0408.
The algorithm is based on a procedure similar to the simplex method. The
implementation is available on the request from the author of the present paper.
2
778
Jan Ámos Vı́šek
[Vi96b, Vi00a]). It gave even smaller value of the h-th order statistics among the
squared residuals than the exact Least-Trimmed-Squares-model (LT S) (for data
sets, for which we were able to evaluate LS-models for all subsamples of size h, we
can establish the exact LT S-model). Below-presented tables offer examples of such
data-sets. Now, let us introduce the notations and recall some definitions.
Let N denote the set of all positive integers, R the real line and Rp the pdimensional Euclidean space. The linear regression model given as
Yi = Xi′ β 0 + ei =
p
X
Xij βj0 + ei ,
i = 1, 2, ..., n
(1)
j=1
will be considered (all vectors throughout the paper will be assumed as column
2
ones). For any β ∈ Rp ri (β) = Yi − Xi′ β denotes the i-th residual and r(h)
(β) the
h-th order statistic among the squared residuals, i.e. we have
2
2
2
r(1)
(β) ≤ r(2)
(β) ≤ ... ≤ r(n)
(β).
(2)
2 Definitions of estimators - LMS and LTS
Definition 1. Let [n/2] + [(p + 1)/2] ≤ h ≤ n, then
2
β̂ (LM S,n,h) = arg min
r(h)
(β)
p
β̂ (LT S,n,w) = arg min
p
and
β∈R
β∈R
h
X
2
r(i)
(β)
(3)
i=1
are called the Least Median of Squares and the Least Trimmed Squares estimators,
respectively. Let for any n ∈ N
1 = w1 ≥ w2 ≥ .... ≥ wn , wi ∈ [0, 1] be some
weights. Then
β̂ (LW S,n,w) = arg min
p
β∈R
n
X
2
wi r(i)
(β)
(4)
i=1
is called the Least Weighted Squares (see [Vi01] and also [Vi02a]).
Now, following [HaSi67] let us define ranks of squared residuals: For any i ∈
{1, 2, ..., n} put
π(β, i) = j ∈ {1, 2, ..., n}
⇔
2
ri2 (β) = r(j)
(β).
(5)
Then we arrive at
β̂ (LW S,n,w) = arg min
p
β∈R
n
X
w n−1 (π(β, i) − 1) ri2 (β).
(6)
i=1
It is then easy to show that β̂ (LW S,n,w) is (one of) solution(s) of the normal equations
n
X
i=1
w n−1 (π(β, i) − 1) Xi Yi − Xi′ β = 0.
(7)
Instrumental weighted variables - algorithm
779
For the consistency, asymptotic normality and optimality of weights see e.g. [Ma03,
Ma04b, Pl04, Vi02a]. If IE {ei |Xi } =
6 0, the (Ordinary) Least Squares are generally
biased and inconsistent, as the following relations show:
β̂
(OLS,n)
0
=β +
n
1X
Xk Xk′
n
k=1
!−1
n
n
1X
1X
Xi ei and plim
Xi ei = IEX1 e1 .
n i=1
n→∞ n
i=1
(8)
The classical theory offers then the method of Instrumental Varibles (see e.g. [CRS95,
JGHLL85]) which defines the estimator as the solution of normal equations Z ′ (Y −
Xβ) = 03 . We are going to recall a robustified version of Instrumental Varibles.
p
Definition 2. For any sequence of random vectors {Zi }∞
i=1 ⊂ R the solution(s)
of the normal equation
n
X
w n−1 (π(β, i) − 1 Zi Yi − Xi′ β = 0
(9)
i=1
will be called the Instrumental Weighted Variables estimator (see [Vi04]) and denoted
by β̂ (IW V,n,w) (see [Vi04]).
For the consistency, asymptotic normality and Bahadur representation see [Vi06],
the algorithm will be discussed in the rest of paper.
3 Algorithm for LTS - heuristics, history and verification
Let us return to the above-promised tables. Nevertheless, prior to it, let us recall two
things. First of all, Hettmanspenger-Sheather found firstly LM S-model for “contaminated” data - “Air/Fuel ratio” for the 2nd case was inadvertently typed as 15.1, later
they considered the correct data with “Air/Fuel ratio” equal to 14.1. Secondly, nowadays due to the quick processors we are able to find exact LT S-model for data sets
with number of observations, say, up to 25 (in a reasonable time)4 . Nevertheless, for
data sets containing, say, more than 40 cases we need algorithm for evaluating procedures like the LT S. The algorithm was described and tested in [Vi96a,Vi00a]5 (let us
call it “Iterative LT S”) and its slight modification for the LW S will be recalled below (PRO-LM S and Bo-La-LM S stay for LM S models found by PROGRESS - by
Peter Rousseeuw and Annick Leroy - and Bocek-Lachout algorithms, respectively).
Firstly, notice that in all tables the hth order statistic among the squared residuals in “Bo-La-LM S” is smaller than in “PRO-LM S” as well as than in “Exact
LT S”. It happened also in the all data samples we have analyzed since 1992. It gives
3
An alternative approach, suitable especially for “exact” sciences, is known as
Total Least Squares, see Van Huffel, S.: Total least squares and error-in-variables
modelling. In: Antoch, J. (ed) COMPSTAT’2004, 539 - 555. Physica-Verlag, Heidelberg (2004).
4
It represents to evaluate about 25 × 106 LS-models which is possible in a few
minutes, e.g. in MATLAB.
5
Recently a similar algorithm appeared in [HaOl].
780
Jan Ámos Vı́šek
Table 1. Engine Data (Air/Fuel 14.1, n = 16, p = 5, h = 11) (p is dimension of
data including intercept)
Method
PRO-LM S
th
11 order stat.
0.3221
Sum of squares
0.4239
Bo-La-LM S
Exact LT S
Iterative LT S
0.22783
0.3092
0.3092
0.3575
0.2707
0.2707
Table 2. Engine Data (Air/Fuel 15.1, n = 16, p = 5, h = 11)
Method
PRO-LM S
th
11 order stat.
0.5729
Sum of squares
1.0481
Bo-La-LM S
Exact LT S
Iterative LT S
0.4506
0.5392
0.5392
1.432
0.7283
0.7283
Table 3. Stackloss Data (n = 21, p = 4, h = 12) The data were firstly considered
by [Br65], later by e.g. [RoLe87, ChHa88]. There is nowadays a huge literature on
them
Method
PRO-LM S
th
11 order stat.
0.6640
Sum of squares
2.4441
Bo-La-LM S
Exact LT S
Iterative LT S
0.5321
0.7014
0.7014
1.9358
1.6371
1.6371
a strong hope that the algorithm and its implementation6 are good. Secondly, the
sum of the h smallest order statistics among the squared residuals for “Exact LT S” is
smaller than for “Bo-La-LM S” as well as than for“PRO-LM S”. It of course has to be
so and hence it only confirms that the implementation of the searching throughout
all subsamples of the size h is perhaps correctly implemented. Finally, the values of
“Iterative LT S” are the same as of “Exact LT S” (and they were evaluated in much
shorter time). It rouses a hope that the iterative algorithm (and its implementation)
may be rather good. Let us consider larger data samples.
Table 4. Demographical Data (n = 49, p = 7, h = 28) The data were firstly
considered by [GuMa80], later by e.g. [ChHa88].
Method
PRO-LM S
th
27 order stat.
131.50
Sum of squares
134260
Bo-La-LM S
Iterative LT S
95.38
104.20
132340
64159
We can observe again that the hth order statistic among the squared residuals in
“Bo-La-LM S” is smaller than in “PRO-LM S” as well as than in “Iterative LT S” (the
number of observations doesn’t allow to obtain “Exact LT S” by searching through
6
Available on request (in MATLAB or in MATHEMATICA).
Instrumental weighted variables - algorithm
781
Table 5. Educational Data (n = 50, p = 4, h = 27). The data were firstly considered
by [ChPr77], later by e.g. [RoLe87,ChHa88,PoKo97,At94] to give some among many
others.
Method
PRO-LM S
th
27 order stat.
19.3562
Sum of squares
3605.5
Bo-La-LM S
Iterative LT S
16.63511
19.0378
3728.6
3414.5
all subsamples of size 28 and 27, respectively). Notice that again the sum of the h
smallest order statistics among the squared residuals for “Iterative LT S” is smaller
than for “Bo-La-LM S” as well as than for“PRO-LM S”. It may strengthen our belief
that the algorithm works well7 .
4 LWS - comparison with Chatterjee & Price
Now, let us give at least some reasons why we have used the idea of the implicit
weighting8 for the robustification of the Instrumental Variables. Employing the implicit weighting in the definition of LW S we have rid of (generally) instability of
LT S (LT S are of course a special case of LW S) and attained higher flexibility (or
even adaptivity) to data. Moreover, when processing panel data, we can’t trim off
completely any observation, without possible damage of the autocorrelation structure of disturbances and/or explanatory variables, i.e. the LT S or LM S can’t be
used. Finally, since the LW S in contrast to the classical Weighted Least Squares
accommodate implicitly to the magnitude of residuals, it is able to give more “appropriate” models, especially for heteroscedastic data. For some examples with a
generated data with various level of heteroscedasticity see [Pl03].
We will give the example employing Educational Data (which we have already
used; they were considered firstly by Chatterjee and Price in 1977, see [ChPr77]). We
will consider results from [ChPr77], the 2nd edition, see the paragraph 5.6. The data
record information about per capita expenditure on education in the 50 states of U.S.
projected for 1975 (SE75 - response variable), per capita income in 1973 (PI73), number of residents per thousand under 18 year of age in 1974 (Y74) and number of residents per thousand living in urban areas in 1970 (URB70) (explanatory variables).
Classical diagnostic tools (Cook’s distance and DFIT9 ) indicated Alaska as outlier (any robust processing the data, as L1 , M -estimators with various ψ-functions,
LM S, LT S and LW S, found Alaska to be very significant outlier, see also [RoLe87]).
Then they applied LS (on the whole data except Alaska), divided the states into
4 groups (Northeast, North central, South and West), evaluated for each group the
sums of squared residuals s2A , A = N, N C, S, W and the total sum of squared residuals s2 (see Table 5.5 on the page 137 of the 2nd edition). Finally, they defined weights
7
For other examples of results see [Vi96b, Vi00a].
Let us assume under implicit weighting the process of assigning weights to the
order statistics of the squared residuals rather than to residuals. This is the idea the
LW S are based on.
9
Originally proposed by [WeKu77], see also [ChPr77], 2nd ed., p. 86.
8
782
Jan Ámos Vı́šek
for the cases in the respective group as wA = s/sA , A = N, N C, S, W . In the next
∗
table wA
demote normalized weights (so that the largest one is 1) and dA the codes
of observations in group A (they are the same for all cases in respective group, e.g.
d2 = 1, d21 = 2, d34 = 3, etc., and we will need them later).
Table 6. Chatterjee-Price framework
Group of States number of cases
s2A /s
wA
∗
wA
dA
South
16
0.383
2.611
1
1
West
12
0.794
1.259
0.482
2
Northeast
9
1.110
0.901
0.348
3
North central
12
1.433
0.698
0.267
4
Alaska
1
–
0
0
5
Having assigned the weights, Chatterjee and Price employed the classical
Weighted Least Squares (results see below or in Table 5.6 of [ChPr77]). Using their
estimates of coefficients, we can reconstruct residuals in their model and ordering the
absolute values of them (increasingly), we obtain ranks of them, say R(i). Now, we
can expect that these ranks for the observations from the group (S) will have smallest
values (as the residuals were expected to be smallest - in absolute value - and hence
obtained the largest weights 2.611), etc. Let us put d∗ (i) = 1, i = 1, 2, ..., 16, d∗ (i) =
2, i = 17, 18, ..., 28, d∗ (i) = 3, i = 29, 30, ..., 37, d∗ (i) = 1, i = 38, 39, ..., 49 and
d∗5 0 = 5. Then, the statistics
D=
50
X
|di − d∗ (R(i))|
i=1
indicates how far our expectation that the observations from (S) would have
smallest residuals, etc. has been justified. We obtained D = 35, which indicates (as
the total number of observations is 50) that two third of observations have somewhat inappropriate weights. Finally,we applied on Educational Data the LW S with
∗
weights wA
. In the next table the results of both analysis were collected. We have
added sample variance of residuals and the coefficient of determination evaluated for
data multiplied by square root of weights (for the idea see PROGRESS or [RoLe87]
e. g. p. 490). The approach by Rousseeuw and Leroy seems to be more appropriate
than the approach by Chatterjee and Price, see [ChPr77] p. 137 - proposing to consider all squared residuals without weighting them. According to the latter for the
LM S or the LT S also the residuals of (sometimes evidently) contaminated observations would be also considered in respective sums of squares (σ̂r2 and R02 denote
sample variance of residuals and coefficient of determination, respectively).
Instrumental weighted variables - algorithm
783
Table 7. Educational Data
PI73
Y74
URB70
σ̂r2
R02
Method
Intercept
Chaterjee-Price-W LS
-320.1
0.027 0.063
0.877
377.55 0.890
LW S
-306.8
0.066 0.052
0.926
370.09 0.909
5 Algorithm for LWS and IWV
Let’s now describe the algorithm for the LW S and then for IW V . As we already said,
the algorithm for LW S is a slight modification of algorithm for LT S which was tested
in [Vi96b] and [Vi00a], and also implemented in EXPLORE, see [CiVi00]. Let w1 =
1 ≥ w2 ≥ ... ≥ wn ≥ 0 be vector of weights and denote W = diag {w1 , w2 , ..., wn }
the diagonal matrix. Further, select some maximal number of iteration of algorithm,
say MAX, and minimal number of models, say MIN.
A Select randomly p observations10 and evaluate the (regression) plane going through them (if it is given uniquely, otherwise repeat selection of
observations). It gives a starting estimate of regression coefficients, say
(W LS,n,W )
W LS,n,W
β̂(I)
, evaluate squared residuals of all observations ri2 (β̂(I)
) =
(W LS,n,W )
Yi − Xi′ β̂(I)
(W LS,n,W )
2
r(i)
(β̂(I)
)’s
2
, i = 1, 2, ..., n, establish the order statistics of them
(see (2)). Then find the sum of weighted order statistics
(W LS,n,W )
S(0) β̂(I)
=
n
X
(W LS,n,W )
2
wi r(i)
(β̂(I)
).
i=1
B Reorder the observations according to the order statistics from previousstep,
−1
(W LS,n,W )
′
apply the Weighted Least Squares, i.e. put β̂(t)
= X(t−1)
W X(t−1)
×
′
X(t−1) W Y(t−1) (where X(t−1) is design matrix containing X1 , X2 , ..., Xn but
in the order given by the ordered statistics from the previous step. Similarly,
Y(t−1) is the vector of response variables with coordinates Y1 , Y2 , ..., Yn but again
in the order given by the ordered statistics from the
previous step.
(W LS,n,W )
(W LS,n,W )
) = Yi − Xi′ β̂(t)
C Evaluate squared residuals ri2 (β̂(t)
1, 2, ..., n, establish the ordered statistics of them, i.e.
(W LS,n,W )
2
r(1)
(β̂(t)
(W LS,n,W )
2
) ≤ r(2)
(β̂(t)
(W LS,n,W )
2
) ≤ ... ≤ r(n)
(β̂(t)
)
and find the sum of weighted order statistics, i.e.
(W LS,n,W )
S(t) β̂(t)
=
n
X
(W LS,n,W )
2
wi r(i)
(β̂(t)
).
i=1
If
(W LS,n,W )
S(t) β̂(t)
(W LS,n,W )
< S(t−1) β̂(t−1)
,
go to B.Otherwise, denote
the last sum of weighted order statistics as
(W LS,n,W )
S(f inal) β̂(f inal)
and go to D.
10
p is number of explanatory variables including intercept, if any.
2
,i =
784
Jan Ámos Vı́šek
D Keep in memory the MIN previous models, each of them was found in
repetitions
of steps B and C and they are ordered according to their
(W LS,n,W )
’s. If the model (just found in the previous B-C-cycle)
S(f inal) β̂(f inal)
(W LS,n,W )
has S(f inal) β̂(f inal)
smaller than some model among MIN models kept in
memory, include it on the appropriate place and restrict the number of saved
models again on MIN. If all models kept in memory are the same, stop the evaluation and return as the solution this model. If the algorithm already passed
MAX times through A, stop and return that model from
MIN models kept in
(W LS,n,W )
. Otherwise, go to A.
memory which has minimal sum S(f inal) β̂(f inal)
The algorithm for the IW V evaluates in the steps B and C
(IW V,n,W )
β̂(t)
′
= Z(t−1)
W X(t−1)
−1
′
Z(t−1)
W Y(t−1)
(where Z(t−1) is reordered matrix of instruments, reordered in the same way as
X(t−1) ) and
(IW V,n,W )
S(t) β̂(t)
(IW V,n,W )
= Y − X(t) β̂(t)
′
(IW V,n,W )
W Zt Zt′ W Y − X(t) β̂(t)
instead of
(W LS,n,W )
(W LS,n,W )
, respectively as one of solution of normal
and S(t) β̂(t)
β̂(t)
equations (9) minimizes (Y − Xβ)′ ZW ′ W Z (Y − Xβ) in β ∈ Rp , see [JGHLL85,
Vi06]. All other steps are the same as in algorithm for the LW S.
References
[At94]
[BKW80]
[BoLa95]
[Br65]
[CRS95]
[ChHa88]
[ChPr77]
[CiVi00]
[GuMa80]
[HaSi67]
[HaOl]
Atkinson, A.C.: Fast very robust methods for the detection of multiple
outliers. JASA 89, 1329 - 1338 (1994)
Belsley, D.A., Kuh, E., Welsch, R.E.: Regression Diagnostics: Identifying
Influential Data and Sources of Collinearity. Wiley, NY (1980)
Boček, P., P. Lachout: Linear programming approach to LM S-estimation.
Mem. Vol. Comput. Statist. & Data Analysis, 19, 129 - 134 (1997)
Brownlee, K. A.: Statistical Theory and Methodology in Science and Engineering. 2nd ed., Wiley, NY (1965)
Carroll, R.J., Ruppert, D., Stefanski, L.A.: Measurement error in Nonlinear Models. Chapmann & Hall/CRC, NY (1995)
Chatterjee, S., Hadi, A.S.: Sensitivity Analysis in Linear Regression. Wiley, NY (1988)
Chatterjee, S., Price, B.: Regression Analysis by Example. 1st ed. Wiley,
NY (1977), 2nd ed. , Wiley, NY (1991)
Čı́žek, P., Vı́šek, J.Á.: The least trimmed squares. In: Hrardle, W. (ed)
User Guide of Explore (2000)
Gunst, R. F., Mason, R.L.: Regression Analysis and Its Application: A
Data-Oriented Approach. Marcel Dekker, NY (1980)
Hájek, J., Šidák, Z.: Theory of Rank Test. Academic Press, NY (1967)
Hawkins, D.M., Olive, D.J.: Improved feasible solution algorithms for
breakdown estimation. CSDA, 30, 1 - 12 (1999)
Instrumental weighted variables - algorithm
785
[HeSh92] Hettmansperger, T.P., S.J. Sheather: A Cautionary Note on the Method
of Least Median Squares. The American Statistician, 46, 79–83 (1992)
[JGHLL85] Judge, G.G., W.E. Griffiths, R.C. Hill, H. Lutkepohl, T.C. Lee: The
Theory and Practice of Econometrics. Wiley, NY (1985)
[MGH89] Mason, R.L., Gunst, R.F., Hess J.L.: Statistical Design and Analysis of
Experiments. Wiley, NY (1989)
[Ma03]
Mašı́ček, L.: Consistency of the least weighted squares estimator. In: M.
Hubert et al. (eds) ICORS 2003, 183 - 194. Birkhauser, (2003)
[Ma04b] Mašı́ček, L.: Optimality of the least weighted squares estimator. Kybernetika, 40, 715 – 734 (2004)
[Pl03]
Plát, P. The Least Weighted Squares. PhD. thesis, the Czech Technical
University, Prague (2003)
[Pl04]
Plát, P.: The Least Weighted Squares Estimator. In: Antoch, J. (ed)
COMPSTAT’2004, 1653 - 1660. Physica-Verlag, Heidelberg (2004)
[PoKo97] Portnoy, S., Koenker, R.: The Gaussian hare and the Laplacian tortoise.
Statistical Science, 12, 279 - 300 (1997)
[Ro87]
Ronchetti, E.: Bounded influence inference in regression: A review. In:
Dodge, Y. (ed) Statist. Data Analysis. Based on the L1 -norm. NorthHolland, NY (1987)
[RoLe87] Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection.
Wiley, NY (1987)
[Vi94]
Vı́šek, J.Á.: A cautionary note on the method of the Least Median of
Squares reconsidered. In: Lachout,P., Vı́šek, J.Á. (eds) Trans. Twelfth
Prague Conf, 254 - 259, Prague, (1994)
[Vi96a]
Vı́šek, J.Á.: Sensitivity analysis of M -estimates. Ann. Inst. Statist. Maths,
48, 469-495 (1996)
[Vi96b]
Vı́šek, J.Á.: On high breakdown point estimation. Computational Statistics, 11, 137 – 146 (1996)
[Vi00a]
Vı́šek, J.Á.: On the diversity of estimates. CSDA, 34, 67 - 89 (2000)
[Vi01]
Vı́šek, J.Á.: Regression with high breakdown point. In: Antoch, J.,
Dohnal, G. (eds) Robust 2000, 324 - 356. UCMP, Prague (2001)
[Vi02a]
Vı́šek, J.Á.: LWS, asymptotic linearity, consistency, as. normality. Bull.
the Czech Econometric Soc., 9,15, 31 - 58,16, 1 - 28 (2002)
[Vi02c]
Vı́šek, J. Á.: Sensitivity analysis of M -estimates of regression model: Influence of data subsets. Ann.Inst.Statist.Maths, 54, 261 - 290(2002)
[Vi04]
Vı́šek, J.Á.: Robustifying instrumental variables. In: Antoch, J. (ed)
COMPSTAT’2004, 1947 - 1954. Physica-Ver. Heidelberg
[Vi06]
Vı́šek, J.Á.: Instrumental weighted variables. Jurečková, J. (ed)
Proc.Conf. on Pers. in Modern Statist. Inference III, Springer (2006)
[WeKu77] Welsh, R.E., Kuh, E.: Linear regression diagnostics. Technical Report
923-77, Sloan School of Management, Cambridge, Massachusetts (1977)
Download