MEMORANDUM No 6/2001 HOW IS GENERALIZED LEAST SQUARES RELATED TO WITHIN AND BETWEEN ESTIMATORS IN UNBALANCED PANEL DATA? By Erik Biørn ISSN: 0801-1117 Department of Economics University of Oslo This series is published by the University of Oslo Department of Economics In co-operation with The Frisch Centre for Economic Research P. O.Box 1095 Blindern N-0317 OSLO Norway Telephone: + 47 22855127 Fax: + 47 22855035 Internet: http://www.oekonomi.uio.no/ e-mail: econdep@econ.uio.no Gaustadalleén 21 N-0371 OSLO Norway Telephone: +47 22 95 88 20 Fax: +47 22 95 88 25 Internet: http://www.frisch.uio.no/ e-mail: frisch@frisch.uio.no No 05 No 04 No 03 No 02 No 01 No 43 No 42 No 41 No 40 No 39 List of the last 10 Memoranda: By Atle Seierstad: NECESSARY CONDITIONS AND SUFFICIENT CONDITIONS FOR OPTIMAL CONTROL OF PIECEWISE DETERMINISTIC CONTROL SYSTEMS. 39 p. By Pål Longva: Out-Migration of Immigrants: Implications for Assimilation Analysis. 48 p. By Øystein Kravdal: The Importance of Education for Fertility in SubSaharan Africa is Substantially Underestimated When Community Effects are Ignored. 33 p. By Geir B. Asheim and Martin L. Weitzman: DOES NNP GROWTH INDICATE WELFARE IMPROVEMENT? 8 p. By Tore Schweder and Nils Lid Hjort: Confidence and Likelihood. 35 p. By Mads Greaker: Strategic Environmental Policy when the Governments are Threatened by Relocation. 22 p. By Øystein Kravdal: The Impact of Individual and Aggregate Unemployment on Fertility in Norway. 36 p. By Michael Hoel and Erik Magnus Sæther: Private health care as a supplement to a public health system with waiting time for treatment: 47 p. By Geir B. Asheim, Anne Wenche Emblem and Tore Nilssen: Health Insurance: Treatment vs. Compensation. 15 p. By Diderik Lund and Tore Nilssen: Cream Skimming, Dregs Skimming, and Pooling: On the Dynamics of Competitive Screening. 21 p. A complete list of this memo-series is available in a PDF® format at: http://www.oekonomi.uio.no/memo/ HOW IS GENERALIZED LEAST SQUARES RELATED TO WITHIN AND BETWEEN ESTIMATORS IN UNBALANCED PANEL DATA ? ) by ERIK BIRN ABSTRACT For a random eects regression model with unbalanced panel data, we demonstrate that the Generalized Least Squares (GLS) estimator can be expressed as a (matrix) weighted average of estimators which utilize the within individual and the between individual variation in the data set. We thus generalize a relationship familiar for balanced panel data. Specic attention must be given to the intercept of the regression. We also dene an estimator containing the GLS, the within individual, and the between individual estimators for balanced and unbalanced data as special cases. Keywords: Panel Data. Unbalanced panels. Missing observations. Random eects. Generalized Least Squares. Within estimation. Between estimation JEL classication: C13, C23 ) I thank Terje Skjerpen for valuable comments. 1 Introduction It is well known from textbook expositions of xed and random eects regression models with balanced panel data that the Ordinary (OLS) and the Generalized Least Squares (GLS) estimators of the coecient vector can be interpreted as (matrix) weighted averages of the estimators which utilize only the within individual and only the between individual variation in the data set, often denoted as within and between estimators [see Maddala (1977, chapter 14-3) and Hsiao (1986, section 3.3.2)]. Unbalanced situations, however, are more common in practice than balanced ones, in particular when using micro data, due to entry or exit of respondents, non-response, rotation designs, etc. Therefore, the interest of this weighting relationship from a practical point of view is somewhat limited. There exists a growing literature on GLS estimators of random individual eects models in unbalanced situations [see, e.g., Birn (1981) and Baltagi (1985)]. The question of whether, and possibly how, this estimator can be related to estimators which can be interpreted as within and between estimators has not been addressed in this literature. Our focus in this note is on the latter question. We demonstrate that a weighting relationship for GLS with unbalanced panel data and random individual eects similar to that in the balanced case exists, provided that the within and the between variation in the data are dened in a suitable way. In deriving this estimator, we show that specic attention must be given to the intercept term of the equation. Finally, we present a general, and easily implementable, estimator which contains the GLS, the OLS, the within individual, and the between individual estimators for balanced and unbalanced situations as special cases. 2 Model and estimators Consider a one-way error components regression model for unbalanced panel data in which individual i (i = 1; : : :; N ) is observed in Ti periods (not all equal), and let t denote the observation number (which diers from the calendar period if the starting period of the individuals dier or if gaps occur in the time series of some of them): (1) yit = xit + k + it ; it = i + uit; i IID(0; 2 ); uit IID(0; 2); i; uit; xit are independent for all i; t; i = 1; : : :; N ; t = 1; : : :; Ti; where xit is a (row) vector of regressors, its (column) vector of coecients, i an individual specic random eect, and uit a disturbance. Let y i = (yi1; : : :; yiTi )0 , y = (y01 ; : : :; y0N )0, X i = (x0i1 ; : : :; x0iTi )0, X = (X 01 ; : : :; X 0N )0 , etc., and let I m be the m 1 dimensional identity matrix and em the (m 1) vector of ones. The number of observaP tions, i.e., the number of rows in y and X , is n = Ni=1 Ti . Compactly, the model can then be written y = X + enk + ; = [eT0 1 1; : : :; eT0 N N ] 0 + u; (2) E() = 0n;1 ; E( 0 ) = ; where (3) = diag( 1; : : :; N ); (4) i = E(ii0) = 2 eTi eT0 i + 2I Ti = 2K Ti + (2 + Ti2 )J Ti ; i = 1; : : :; N; `diag' denoting a block-diagonal matrix, J m = (em e0m )=m, and K m = I m ; J m ; m = 1; 2; : : : . Since the latter two matrices are idempotent and have orthogonal columns, we simply have (5) ;i 1 = 12 (K Ti + iJ Ti ); where 2 (6) i = 2 + T 2 ; i = 1; : : :; N: i We use the following notation for the within individual, the between individual, and the total covariation in arbitrary matrices, Z and Q, constructed in the same way as X above: Ti N X X W ZQ = (z it ; zi ) 0(qit ; qi ) = Z 0diag(K T1 ; : : :; K TN )Q; i=1 t=1 N X B ZQ = Ti(zi ; z) 0(qi ; q) = Z 0diag(J T1 ; : : :; J TN )Q ; Z 0J n Q; i=1 Ti N X X T ZQ = (z it ; z) 0(qit ; q) = W ZQ + B ZQ = Z 0 (I n ; J n )Q; i=1 t=1 P P i z = n;1 PN T z . ; 1 PTi where zi = Ti t=1 z it and z = n;1 Ni=1 Tt=1 i=1 i i it . 0 0 f i = (X i .. eTi ) and X f = (X f1 ; : : :; X f N )0. In the following we do not, however, Let X include the intercept term and the ones attached to it in the coecient vectors and regressor matrices, as in, e.g., Baltagi (1995, section 9.2), but specify them explicitly in the formulae. This is essential in dening between estimators and decomposing the GLS estimator into within and between estimators for the unbalanced case. The OLS and GLS estimators of ( 0 k) 0 are 2 3 2P 3;1 2 P 3 0X P X 0 e 0y b OLS X X 0 0 fX f );1 (X f y ) = 4 P i i P i Ti 5 4 P i i 5 4 5 = (X (7) kbOLS eT0 i X i eT0 i eTi eT0 i yi 2 P T x 0 x P T x 0 3;1 2 W + P T x 0 y 3 W + i i 5 4 XY i i i 5 = 4 XXP i i i P Tixi n Ti yi 2 and 2 3 b GLS f 0X f ) ; 1 (X f 0 y) 5 = (X (8) 4 b kGLS 2P 3;1 2 P 3 0 ;1 X P X 0 ;1 e 0 ;1 y X X = 4 P 0 i ;i 1 i P 0 i ;i 1 Ti 5 4 P 0 i ;i 1 i 5 eTi i X i eTi i eTi eTi i yi 2 P T x 0 x P T x 0 3;1 2 W + P T x 0 y 3 W + XX =4 P T xi i i i Pi iT i 5 4 XY P T yi i i i 5 ; i i i i i i i i respectively, the last equality following from (5). Since the formula for the partitioned inverse [see, e.g., Lutkepohl (1996, section 3.5.3)] implies 2 3 2 0 ;1 bc0 3 0b !;1 A b Q ; Q b 5 4 5 4 Q = A; c ; (9) b c = ; bc Q bc Q bc0 + 1c ; where A is a symmetric matrix, b a row vector, and c a scalar, (7) and (8) can be written as b OLS = [W XX + P Ti(xi ; x ) 0P(xi ; x)];1 (10) [W XY + Ti(xi ; x) 0(yi ; y)]; kbOLS = y ; xb OLS ; and b GLS = [W XX + P i Ti(xi ; xeP) 0(xi ; xe)];1 (11) [W XY + iTi(xi ; xe ) 0(yi ; ye)]; kbGLS = ye ; xe b GLS ; where P T y P T x P T y P T x i i i i i i i e e (12) x = P Ti ; y = P Ti ; x = P iTi ; y = P ii Ti ii : Note that the global means occurring in the denitions of the OLS and the GLS estimators dier when Ti depends on i. The between and within estimators corresponding to b OLS and kbOLS , obtained by running OLS on p p p p (13) Tiyi = Tixi + Tik + Tii; i = 1; : : :; N; and on (1) with the k + i 's considered as N unknown constants, are, respectively, 1 b B = [P Ti(xi ; x ) 0(xi ; x)];1 [P Ti(xi ; x) 0 (yi ; y)] = B ;XX B XY ; (14) kbB = y ; xb B ; and (15) b W = W ;XX1 W XY : p Note that the disturbances in (13), Ti i , are homoskedastic when no individual eects occur (2 = 0), and heteroskedastic otherwise. 3 3 The relationship between the estimators We next consider the relationships between the estimators (10), (11), (14), and (15). From (10), (14), and (15) we have (16) b OLS = (W XX + B XX );1(W XX b W + B XX b B ); regardless of whether the panel data set is balanced or unbalanced. In the balanced case, where Ti = T and i = = 2 =( 2 + T2 ) for all i, it follows from (11), (14) and (15) that (17) b GLS = (W XX + B XX );1(W XX b W + B XX b B ): We will now derive a relationship for the unbalanced case similar to the latter. Let vi (i = 1; : : :; N ) be an arbitrary weight for individual i and multiply its equation in individual means by the square root of this weight, which generalizes (13) to (18) pv y = pv x + pv k + pv ; i i i i i i i i = 1; : : :; N: Running OLS on this equation, we obtain generalized between estimators of and k 2 3 2P 3 2 3 0 x P v x 0 ;1 P vi x 0 y e B (v) v x i i i i i i i 4 5=4 P 5 4 P 5; (19) keB (v) vixi P vi vi yi where v = (v1 ; : : :; vN ), which, when we again use (9), leads to e B = e B (v) = [P vi (xi ; xe (v ))0(xi ; xe (v ))];1 [P vi (xi ; xe (v))0(yi ; ye(v))] ; (20) e e kB = kB (v) = ye(v) ; xe (v)e B (v); where (21) P P xe (v) = Pvivxi i ; ye(v) = Pvivyi : i This brings us to the main result in this note: The estimators b OLS , b W b B , e B , and b GLS are all matrix weighted means of b W and e B (v ) for a suitable choice of v, since they all belong to the class (22) b (W ; B; v) = [W W XX + B f B XX (v)];1[W W XX b W + B f B XX (v)e B (v)]; where W and B are arbitrary scalar constants and f BXX (v) = P vi[xi ; xe (v)]0[xi ; xe (v)]: fXX (v) and b (W ; B ; v) are homogeneous in v of degrees one and zero, Note that B respectively, while b (W ; B ; v) is homogeneous in (W ; B ) of degree zero. In particular, 4 we have b OLS = b (1; 1; T ); b W = b (1; 0; T ) = b (1; 0; T ); b B = e B (T ) = b (0; 1; T ); e B (T ) = b (0; 1; T ); b GLS = b (1; 1; T ); where T = (T1; : : :; TN ) and T = (1 T1; : : :; N TN ). In practical applications, the i 's have to be estimated, which requires estimation of and 2 . This problem, for unbalanced panel data, is discussed in Searle, Casella, and McCulloch (1992, section 3.6) and Birn (1999, section 3). 2 4 Conclusion Our conclusions then are the following: 1. If we dene a modied between estimator of , (20), by choosing the weight vi such that the weighted equation in individual means, (18), has disturbances which are homoskedastic with variance 2, we obtain the between estimator for the unbalanced panel data set, e B (T ). Since var(i ) = 2 + 2=Ti , this choice inplies vi = i Ti (i = 1; : : :; N ). 2. The GLS estimator for the unbalanced case can be interpreted as a matrix weighted mean of b W and e B (T ), with weights depending on X . Unlike the OLS estimator for the unbalanced case, it cannot, however, in general be interpreted as a matrix weighted mean of b W and b B . 3. In the balanced case, in which T = (T; : : :; T ), we have f B XX (T ) = B XX and e B (T ) = e B (T ) = b B . This gives the familiar decomposition formula (17). 5 References Baltagi, B.H. (1985): Pooling Cross-Sections with Unequal Time-Series Lengths. Economics Letters, 18 (1985), 133 { 136. Baltagi, B.H. (1995): Econometric Analysis of Panel Data. Chichester: Wiley, 1995. Birn, E. (1981): Estimating Economic Relations from Incomplete Cross-Section/TimeSeries Data. Journal of Econometrics, 16 (1981), 221 { 236. Birn, E. (1999): Estimating Regression Systems from Unbalanced Panel Data: A Stepwise Maximum Likelihood Procedure. Department of Economics, University of Oslo, Memorandum No. 20/1999. Hsiao, C. (1986): Analysis of Panel Data. Cambridge: Cambridge University Press, 1986. Lutkepohl, H. (1996): Handbook of Matrices. Chichester: Wiley, 1996. Maddala, G.S. (1977): Econometrics. Auckland: McGraw-Hill, 1977. Searle, S.R., Casella, G., and McCulloch, C.E. (1992): Variance Components. New York: Wiley, 1992. 6