Part 14: Nonlinear Models [ 1/84] Econometric Analysis of Panel Data William Greene Department of Economics Stern School of Business Part 14: Nonlinear Models [ 2/84] Nonlinear Models Nonlinear Models Estimation Theory for Nonlinear Models Estimators Properties M Estimation GMM Estimation Minimum Distance Estimation Minimum Chi-square Estimation Computation – Nonlinear Optimization Nonlinear Least Squares Maximum Likelihood Estimation Nonlinear Least Squares Newton-like Algorithms; Gradient Methods (Background: JW, Chapters 12-14, Greene, Chapters 12-14) Part 14: Nonlinear Models [ 3/84] What is a ‘Model?’ Unconditional ‘characteristics’ of a population Conditional moments: E[g(y)|x]: median, mean, variance, quantile, correlations, probabilities… Conditional probabilities and densities Conditional means and regressions Fully parametric and semiparametric specifications Parametric specification: Known up to parameter θ Parameter spaces Conditional means: E[y|x] = m(x, θ) Part 14: Nonlinear Models [ 4/84] What is a Nonlinear Model? Model: E[g(y)|x] = m(x,θ) Objective: Learn about θ from y, X Usually “estimate” θ Linear Model: Closed form; θˆ = h(y, X) Nonlinear Model Not wrt m(x,θ). E.g., y=exp(θ’x + ε) ˆ )=0, Wrt estimator: Implicitly defined. h(y, X, θ E.g., E[y|x]= exp(θ’x) Part 14: Nonlinear Models [ 5/84] What is an Estimator? Point and Interval ˆ f(data | mod el) I(ˆ ) ˆ sampling variability Classical and Bayesian ˆ E[ | data,prior f()] expectation from posterior I(ˆ ) narrowest interval from posterior density containing the specified probability (mass) Part 14: Nonlinear Models [ 6/84] Parameters Model parameters The parameter space The true parameter(s) exp( y i / i ) Example : f(y i | x i ) , i exp(βx i ) i Model parameters : β Conditional Mean: E(y i | x i ) i exp(βx i ) Part 14: Nonlinear Models [ 7/84] The Conditional Mean Function m( x , 0 ) E[y | x] for some 0 in . A property of the conditional mean: E y,x (y m( x , ))2 is minimized by E[y | x] (Proof, pp. 343-344, JW) Part 14: Nonlinear Models [ 8/84] M Estimation Classical estimation method 1 n ˆ arg min i=1 q(datai ,) n Example : Nonlinear Least squares 1 n ˆ arg min i=1 [y i -E(y i | x i , )]2 n Part 14: Nonlinear Models [ 9/84] An Analogy Principle for M Estimation 1 n ˆ The estimator minimizes q= i1 q(datai , ) n The true parameter 0 minimizes q*=E[q(data, )] The weak law of large numbers: 1 n P q= i1 q(datai , ) q*=E[q(data, )] n Part 14: Nonlinear Models [ 10/84] Estimation 1 n P q= i1 q(datai , ) q*=E[q(data, )] n Estimator ˆ minimizes q True parameter 0 minimizes q* P q q* P Does this imply ˆ 0 ? Yes, if ... Part 14: Nonlinear Models [ 11/84] Identification Uniqueness : If 1 0 , then m(x ,1 ) m( x , 0 ) for some x Examples when this does not occur (1) Multicollinearity generally (2) Need for normalization E[y|x] = m(x /) (3) Indeterminacy m(x ,)=1 2 x 3 x 4 when 3 = 0 Part 14: Nonlinear Models [ 12/84] Continuity q(datai , ) is (a) Continuous in for all datai and all (b) Continuously differentiable. First derivatives are also continuous (c) Twice differentiable. Second derivatives must be nonzero, though they need not be continuous functions of . (E.g. Linear LS) Part 14: Nonlinear Models [ 13/84] Consistency 1 n P q= i1 q(datai , ) q*=E[q(data, )] n Estimator ˆ minimizes q True parameter 0 minimizes q* P q q* P Does this imply ˆ 0 ? Yes. Consistency follows from identification and continuity with the other assumptions Part 14: Nonlinear Models [ 14/84] Asymptotic Normality of M Estimators First order conditions: (1/n)Ni=1q(datai , ˆ ) 0 ˆ ) 1 N q(datai , ˆ i=1 n ˆ 1 Ni=1g(datai , ˆ ) g(data, ˆ ) n For any ˆ , this is the mean of a random sample. We apply Lindberg-Feller CLT to assert the limiting normal distribution of n g(data, ˆ ). Part 14: Nonlinear Models [ 15/84] Asymptotic Normality A Taylor series expansion of the derivative = some point between ˆ and 0 g(data, ˆ ) g(data, 0 ) H()(ˆ 0 ) 0 1 n 2 q(datai , ) H() i1 n Then, (ˆ 0 ) [H()]1 g(data, 0 ) and n (ˆ 0 ) [H()]1 n g(data, 0 ) Part 14: Nonlinear Models [ 16/84] Asymptotic Normality n (ˆ 0 ) [H()]1 n g(data, 0 ) [H()]1 converges to its expectation (a matrix) n g(data, 0 ) converges to a normally distributed vector (Lindberg-Feller) Implies limiting normal distribution of n (ˆ 0 ). Limiting mean is 0. Limiting variance to be obtained. Asymptotic distribution obtained by the usual means. Part 14: Nonlinear Models [ 17/84] Asymptotic Variance a ˆ 0 [H()]1 g(data, 0 ) Asymptotically normal Mean 0 Asy.Var[ˆ ] [H(0 )]1 Var[g(data, 0 )] [H(0 )]1 (A sandwich estimator, as usual) What is Var[g(data, 0 )]? 1 E[g(datai , 0 )g(datai , 0 ) '] n Not known what it is, but it is easy to estimate. 1 1 n i1g(datai , ˆ )g(datai , ˆ ) ' n n Part 14: Nonlinear Models [ 18/84] Estimating the Variance Asy.Var[ˆ ] [H(0 )]1 Var[g(data, 0 )] [H(0 )]1 1 Estimate [H(0 )] 1 n 2m(datai , ˆ ) with i1 ˆ ˆ n ) m(datai , ˆ ) 1 1 n m(datai , ˆ Estimate Var[g(data, 0 )] with ˆ n n i1 ˆ E.g., if this is linear least squares, (1/2)ni=1 (y i -x i)2 m(datai , ˆ ) (1 / 2)(y i x ib)2 1 n 2m(datai , ˆ ) 1 i1 ( X X / n) ˆ ˆ n ) m(datai , ˆ ) 1 1 n m(datai , ˆ 2 N 2 (1 / n ) e x i x i i 1 i i1 ˆ ˆ nn Part 14: Nonlinear Models [ 19/84] Nonlinear Least Squares Gauss-Marquardt Algorithm qi the conditional mean function = m(x i , ) m(xi , ) gi xi0 'pseudo regressors ' x i0 is one row of pseudo-regressor matrix X 0 Algorithm - iteration ˆ (k+1) ˆ (k) [X 0'X 0 ]1 X 0'e0 Part 14: Nonlinear Models [ 20/84] Application - Income German Health Care Usage Data, 7,293 Individuals, Varying Numbers of Periods Variables in the file are Data downloaded from Journal of Applied Econometrics Archive. This is an unbalanced panel with 7,293 individuals. They can be used for regression, count models, binary choice, ordered choice, and bivariate binary choice. This is a large data set. There are altogether 27,326 observations. The number of observations ranges from 1 to 7. (Frequencies are: 1=1525, 2=2158, 3=825, 4=926, 5=1051, 6=1000, 7=987). Note, the variable NUMOBS below tells how many observations there are for each person. This variable is repeated in each row of the data for the person. HHNINC = household nominal monthly net income in German marks / 10000. (4 observations with income=0 were dropped) HHKIDS = children under age 16 in the household = 1; otherwise = 0 EDUC = years of schooling AGE = age in years Part 14: Nonlinear Models [ 21/84] Income Data 2. 74 2. 19 Densi t y 1. 64 1. 09 . 55 . 00 . 00 1. 00 2. 00 3. 00 4. 00 I NCO M E Ker nel densi t y est i m at e f or I NCO M E 5. 00 Part 14: Nonlinear Models [ 22/84] Exponential Model f(Income | Age,Educ,Married) HHNINCi 1 exp i i i exp(a0 a1Educ a2Married a3 Age) E[HHNINC | Age,Educ,Married] i Starting values for the iterations: E[y i | nothing else]=exp(a0 ) Start a0 = logHHNINC, a1 a2 a3 0 Part 14: Nonlinear Models [ 23/84] Conventional Variance Estimator 2 ˆ [yi m(xi , )] 0 0 1 (X X ) n #parameters n i1 Sometimes omitted. Part 14: Nonlinear Models [ 24/84] Variance Estimator for the M Estimator qi (1 / 2)[y i exp( x i)]2 (1 / 2)(y i i )2 gi eii x i Hi y i i x i x i Estimator is [Ni=1Hi ]-1 [Ni=1 gi gi ][Ni=1Hi ]-1 = [Ni=1 y i i x i x i]-1 [Ni=1 ei2 i2 x i x i][Ni=1 y i i x i x i]-1 This is the White estimator. See JW, p. 359. Part 14: Nonlinear Models [ 25/84] Computing NLS Reject; Calc ; Nlsq ; ; ; ; Name ; Create; ; Matrix; Matrix; hhninc=0$ b0=log(xbr(hhninc))$ lhs = hhninc fcn = exp(a0+a1*educ+a2*married+a3*age) start = b0, 0, 0, 0 labels = a0,a1,a2,a3$ x = one,educ,married,age$ thetai = exp(x'b); ei = hhninc*thetai gi=ei*thetai ; hi = hhninc*thetai$ varM = <x'[hi] x> * x'[gi^2]x * <x'[hi] x> $ stat(b,varm,x)$ Part 14: Nonlinear Models [ 26/84] Iterations ' gradient ' e0X 0 (X 0 ' X 0 )-1X 0 ' e0 Part 14: Nonlinear Models [ 27/84] NLS Estimates with Different Variance Estimators Part 14: Nonlinear Models [ 28/84] Hypothesis Tests for M Estimation Null hypothesis: c()=0 for some set of J functions (1) continuous c() (2) differentiable; R (), J K Jacobian (3) functionally independent: Rank R() = J ˆ =Est.Asy.Var[ˆ Wald: given ˆ , V ], W=Wald distance =[c(ˆ )-c()]{R() V()R()'} -1[c(ˆ )-c()] chi-squared[J] Part 14: Nonlinear Models [ 29/84] Change in the Criterion Function 1 n P q= i1 q(datai , ) q*=E[q(data, )] n Estimator ˆ minimizes q Estimator ˆ 0 minimizes q subject to restrictions c()=0 q 0 q. D 2n(q0 q) chi squared[J] Part 14: Nonlinear Models [ 30/84] Score Test LM Statistic Derivative of the objective function (1 / n)ni=1q(datai , ) Score vector = g(data, ) Without restrictions g(data, ˆ ) 0 With null hypothesis, c(ˆ ) imposed g(data, ˆ 0 ) generally not equal to 0. Is it close? (Within sampling variability?) Wald distance = [g(data, ˆ 0 )]'{Var[g(data, ˆ 0 )]} 1 [g(data, ˆ 0 )] D LM chi squared[J] Part 14: Nonlinear Models [ 31/84] Exponential Model f(Income | Age,Educ,Married) HHNINCi 1 exp i i i exp(a0 a1Educ a2Married a3 Age) Test H0: a1 a2 a3 0 Part 14: Nonlinear Models [ 32/84] Wald Test Matrix ; List ; R = [0,1,0,0 / 0,0,1,0 / 0,0,0,1] ; c = R*b ; Vc = R*Varb*r' ; Wald = c’ <VC> c $ Matrix R has .0000000D+00 .0000000D+00 .0000000D+00 Matrix C has .05472 .23756 .00081 Matrix VC has .1053686D-05 .4530603D-06 .3649631D-07 Matrix WALD has 3627.17514 3 rows and 4 columns. 1.00000 .0000000D+00 .0000000D+00 .0000000D+00 1.00000 .0000000D+00 .0000000D+00 .0000000D+00 1.00000 3 rows and 1 columns. 3 rows and 3 columns. .4530603D-06 .3649631D-07 .5859546D-04 -.3565863D-06 -.3565863D-06 .6940296D-07 1 rows and 1 columns. Part 14: Nonlinear Models [ 33/84] Change in Function Calc ; b0 = log(xbr(hhninc)) $ Nlsq ; lhs = hhninc ; labels = a0,a1,a2,a3 ; start = b0,0,0,0 ; fcn = exp(a0+a1*educ+a2*married+a3*age)$ Calc ; qbar = sumsqdev/n $ Nlsq ; lhs = hhninc ; labels = a0,a1,a2,a3 ; start = b0,0,0,0 ; fix = a1,a2,a3 ; fcn = exp(a0+a1*educ+a2*married+a3*age)$ Calc ; cm = 2*n*(Sumsqdev/n – qbar) $ (Sumsqdev = 763.767; Sumsqdev_0 = 854.682) Part 14: Nonlinear Models [ 34/84] Constrained Estimation Part 14: Nonlinear Models [ 35/84] LM Test 2 Function : qi (1 / 2)[y i exp(a0 a1Educ...)] Derivative : gi eii x i LM statistic 1 n LM=( gi )[ gigi ] (i1gi ) n i1 n i1 All evaluated at ˆ a0 log(y), 0, 0, 0 Part 14: Nonlinear Models [ 36/84] LM Test Name Create Create Create Matrix ; ; ; ; ; ; Matrix LM x = one,educ,married,age$ thetai = exp(x'b) ei = hhninc - thetai$ gi = ei*thetai $ list LM = x’gi * <x'[gi2]x> * gi’x has 1 rows and 1 +-------------1| 1915.03286 $ 1 columns. Part 14: Nonlinear Models [ 37/84] Maximum Likelihood Estimation Fully parametric estimation Density of yi is fully specified The likelihood function = the joint density of the observed random variable. Example: density for the exponential model yi 1 f(y i | x i ) exp , i exp( xiβ) i i E[y i | x i ]=i , Var[y i | x i ]=i2 NLS (M) estimator examined earlier operated only on E[y i | x i ]=i. Part 14: Nonlinear Models [ 38/84] The Likelihood Function yi 1 f(y i | x i ) exp , i exp( x iβ) i i Likelihood f(y1 ,..., y n | x1 ,..., x n ) by independence, yi 1 L(β|data)= i=1 exp , i exp( x iβ) i i ˆ , maximizes the likelihood function The MLE ,β n MLE Part 14: Nonlinear Models [ 39/84] Log Likelihood Function yi 1 f(y i | x i ) exp , i exp( x iβ) i i yi n 1 L(β|data)= i=1 exp , i exp( x iβ) i i ˆ , maximizes the likelihood function The MLE ,β MLE logL(β|data) is a monotonic function. Therefore ˆ , maximizes the log likelihood function The MLE ,β MLE yi logL(β|data)= i=1 -logi i n Part 14: Nonlinear Models [ 40/84] Conditional and Unconditional Likelihood Unconditional joint density f(y i , x i | , ) our parameters of interest = parameters of the marginal density of x i Unconditional likelihood function L(, |y,X )= i=1 f(y i , x i | , ) n f(y i , x i | , ) f(y i |x i , , )g(x i | , ) L(, |y,X )= i=1 f(y i |x i , , )g(x i | , ) n Assuming the parameter space partitions logL(, |y,X )= i=1 logf(y i |x i , ) i=1 logg(x i | ) n n conditional log likelihood + marginal log likelihood Part 14: Nonlinear Models [ 41/84] Concentrated Log Likelihood ˆ MLE maximizes logL(|data) Consider a partition, =(,) two parts. logL Maximum occurs where 0 Joint solution equates both derivatives to 0. If logL/=0 admits an implicit solution for ˆ), then write in terms of ,= ˆ MLE ˆ ( logL c (,())=a function only of . Concentrated log likelihood can be maximized for , then the solution computed for . ˆ) , so restrict The solution must occur where ˆ MLE ˆ ( the search to this subspace of the parameter space. Part 14: Nonlinear Models [ 42/84] Concentrated Log Likelihood Fixed effects exponential regression: it exp(i x it ) logL i1 t 1 ( log it y it / it ) n T i1 t 1 ( (i x it ) y it exp(i x it )) n T T logL t 1 1 y it exp(i x it )(1) i T t 1 y it exp(i x it ) T T exp(i ) t 1 y it exp( x it ) 0 T tT1 y it / exp( x it ) Solve this for i log i () T T c t 1 y it / exp( x it ) Concentrated log likelihood has it exp( x it ) T Part 14: Nonlinear Models [ 43/84] ML and M Estimation logL() i1 log f(y i | x i , ) n ˆ MLE argmax i1 log f(y i | x i , ) n 1 n argmin - i1 log f(y i | x i , ) n The MLE is an M estimator. We can use all of the previous results for M estimation. Part 14: Nonlinear Models [ 44/84] ‘Regularity’ Conditions Conditions for the MLE to be consistent, etc. Augment the continuity and identification conditions for M estimation Regularity: Three times continuous differentiability of the log density Finite third moments of log density Conditions needed to obtain expected values of derivatives of log density are met. (See Greene (Chapter 14)) Part 14: Nonlinear Models [ 45/84] Consistency and Asymptotic Normality of the MLE Conditions are identical to those for M estimation Terms in proofs are log density and its derivatives Nothing new is needed. Law of large numbers Lindberg-Feller central limit theorem applies to derivatives of the log likelihood. Part 14: Nonlinear Models [ 46/84] Asymptotic Variance of the MLE Based on results for M estimation Asy.Var[ˆ MLE ] ={-E[Hessian]}-1 {Var[first derivative]}{-E[Hessian]} -1 1 logL logL logL = -E Var -E 2 2 1 Part 14: Nonlinear Models [ 47/84] The Information Matrix Equality Fundamental Result for MLE The variance of the first derivative equals the negative of the expected second derivative. 2 logL -E The Information Matrix Asy.Var[ˆ MLE ] logL = -E 2 1 logL = -E 2 logL logL -E -E 1 2 2 1 Part 14: Nonlinear Models [ 48/84] Three Variance Estimators Negative inverse of expected second derivatives matrix. (Usually not known) Negative inverse of actual second derivatives matrix. Inverse of variance of first derivatives Part 14: Nonlinear Models [ 49/84] Asymptotic Efficiency M estimator based on the conditional mean is semiparametric. Not necessarily efficient. MLE is fully parametric. It is efficient among all consistent and asymptotically normal estimators when the density is as specified. This is the Cramer-Rao bound. Note the implied comparison to nonlinear least squares for the exponential regression model. Part 14: Nonlinear Models [ 50/84] Invariance Useful property of MLE If =g() is a continuous function of , the MLE of is g(ˆ MLE ) E.g., in the exponential FE model, the MLE of i = exp(-i ) is exp(- ˆ i,MLE ) Part 14: Nonlinear Models [ 51/84] Application: Exponential Regression – MLE and NLS MLE assumes E[y|x] = exp(β′x) – Note sign reversal. Part 14: Nonlinear Models [ 52/84] Variance Estimators LogL i1 log i y i / i , i exp( x iβ) n g logL n n i1 x i (y i / i )x i i1 [(y i / i ) 1]x i Note, E[y i | x i ] i , so E[g]=0 2 logL n H= i1 (y i / i )x i x i E[H] i1 x i x i = -X'X (known for this particular model) n Part 14: Nonlinear Models [ 53/84] Three Variance Estimators Berndt-Hall-Hall-Hausman (BHHH) 1 i1 [(y i / ˆ i ) 1] x i x i Based on actual second derivatives g g i=1 i i n 2 n 1 1 1 H (y / ˆ ) x x i i i i i i=1 i1 Based on expected second derivatives n n E i=1 Hi n 1 i1 x i x i n 1 ( X'X )1 Part 14: Nonlinear Models [ 54/84] Variance Estimators Loglinear ; Lhs=hhninc;Rhs=x ; Model = Exponential create;thetai=exp(x'b);hi=hhninc*thetai;gi2=(hi-1)^2$ matr;he=<x'x>;ha=<x'[hi]x>;bhhh=<x'[gi2]x>$ matr;stat(b,ha);stat(b,he);stat(b,bhhh)$ Part 14: Nonlinear Models [ 55/84] Hypothesis Tests Trinity of tests for nested hypotheses Wald Likelihood ratio Lagrange multiplier All as defined for the M estimators Part 14: Nonlinear Models [ 56/84] Example Exponential vs. Gamma exp(yi / i )yPi 1 Gamma Distribution: f(yi | xi , ,P) Pi (P) Exponential: P = 1 P>1 Part 14: Nonlinear Models [ 57/84] Log Likelihood logL ni1 P log i log (P) y i / i (P 1) log y i (1) 0! 1 logL ni1 log i y i / i (P 1) log y i (P 1) log i log (P) ni1 log i y i / i (P 1) log(y i / i ) log (P) Exponential logL + part due to P 1 Part 14: Nonlinear Models [ 58/84] Estimated Gamma Model Part 14: Nonlinear Models [ 59/84] Testing P = 1 Wald: W = (5.10591-1)2/.042332 = 9408.5 Likelihood Ratio: logL|(P=1)=1539.31 logL|P = 14240.74 LR = 2(14240.74 - 1539.31)=25402.86 Lagrange Multiplier… Part 14: Nonlinear Models [ 60/84] Derivatives for the LM Test logL ni1 log i y i / i (P 1) log(y i / i ) log (P) logL ni1 (y i / i P) x i in1gx ,i x i logL ni1 log(y i / i ) (P) in1gP,i P (1)=-.5772156649 For the LM test, we compute these at the exponential MLE and P = 1. Part 14: Nonlinear Models [ 61/84] Psi Function Ps i (p ) = L o g De ri v a ti v e o f Ga m m a F u n c ti o n 5 0 P si Funct i on -5 - 10 - 15 - 20 - 25 . 00 . 50 1. 00 1. 50 2. 00 PV 2. 50 3. 00 3. 50 Part 14: Nonlinear Models [ 62/84] Score Test Test the hypothesis that the derivative vector equals zero when evaluated for the larger model with the restricted coefficient vector. 1 n Estimator of zero is g = i=1 g i n Statistic = chi squared = g[Var g]-1 g 1 1 n Use i=1 g i gi (the n's will cancel). n n -1 n n n chi squared = i=1 g i i=1 g i gi g i=1 i Part 14: Nonlinear Models [ 63/84] Calculated LM Statistic Loglinear ; Lhs = hhninc ; Rhs = x ; Model = Exponential $ Create;thetai=exp(x’b) ; gi=(hhninc*thetai – 1) $ Create;gpi=log(hhninc*thetai)-psi(1)$ Create;g1i=gi;g2i=gi*educ;g3i=gi*married;g4i=gi*age;g5i=gpi$ Namelist;ggi=g1i,g2i,g3i,g4i,g5i$ Matrix;list ; lm = 1'ggi * <ggi'ggi> * ggi'1 $ Matrix LM has 1 rows and 1 +-------------1| 23468.7 1 columns. ? Use built-in procedure. ? LM is computed with actual Hessian instead of BHHH Loglinear ; Lhs = hhninc ; Rhs = x ; Model = Exponential $ logl;lhs=hhninc;rhs=x;model=gamma;start=b,1;maxit=0 $ | LM Stat. at start values 9604.33 | Part 14: Nonlinear Models [ 64/84] Clustered Data and Partial Likelihood Panel Data: y it | x it , t 1,..., Ti Some connection across observations within a group Assume marginal density for y it | x it f(y it | x it , ) Joint density for individual i is f(y i1 ,..., y i,Ti | X i ) t i 1 f(y it | x it , ) T "Pseudo logLikelihood" i1 log t i 1 f(y it | x it , ) T n = n T i1 t 1 log f(y it | x it , ) Just the pooled log likelihood, ignoring the panel aspect of the data. Not the correct log likelihood. Does maximizing wrt work? Yes, if the marginal density is correctly specified. Part 14: Nonlinear Models [ 65/84] Inference with ‘Clustering’ (1) Estimator is consistent (2) Asymptotic Covariance matrix needs adjustment Asy.Var[]=[Hessian]-1 Var[gradient][Hessian]-1 H i1 t i 1 Hit n T g i1 gi , where gi t i 1 git n T Terms in gi are not independent, so estimation of the variance cannot be done with n Ti i1 t 1 git git But, terms across i are independent, so we estimate Var[g] with ] Est.Var[ˆ PMLE n i1 g ' Hˆ t 1 git Ti n i1 Ti t 1 1 Ti t 1 it it n Ti i1 t 1 ˆ it g t 1 gˆit ' i1 t 1 Hˆ it Ti (Stata inserts a term n/(n-1) before the middle term.) n Ti 1 Part 14: Nonlinear Models [ 66/84] Cluster Estimation Part 14: Nonlinear Models [ 67/84] On Clustering The theory is very loose. That the marginals would be correctly specified while there is ‘correlation’ across observations is ambiguous It seems to work pretty well in practice (anyway) BUT… It does not imply that one can safely just pool the observations in a panel and ignore unobserved common effects. Part 14: Nonlinear Models [ 68/84] ‘Robust’ Estimation If the model is misspecified in some way, then the information matrix equality does not hold. Assuming the estimator remains consistent, the appropriate asymptotic covariance matrix would be the ‘robust’ matrix, actually, the original one, Asy.Var[ˆ MLE ] [E[Hessian]]1 Var[gradient][E[Hessian]]1 (Software can be coerced into computing this by telling it that clusters all have one observation in them.) Part 14: Nonlinear Models [ 69/84] Two Step Estimation and Murphy/Topel Likelihood function defined over two parameter vectors logL= i=1 logf(y i | x i , zi , , ) n (1) Maximize the whole thing. (FIML) (2) Typical Situation: Two steps y 1 exp i , i i i exp(0 1Educ 2Married 3 Age 4 Pr[IfKids]) E.g., f(HHNINC|educ,married,age,Ifkids)= If[Kids | age,bluec] Logistic Regression Pr[IfKids]=exp(0 1 Age 2Bluec) /[1 exp(0 1 Age 2Bluec)] (3) Two step strategy: Fit the stage one model () by MLE ) and estimate . first, insert the results in logL(, ˆ Part 14: Nonlinear Models [ 70/84] Two Step Estimation (1) Does it work? Yes, with the usual identification conditions, continuity, etc. The first step estimator is assumed to be consistent and asymptotically normally distributed. (2) The asymptotic covariance matrix at the second step that takes ˆ as if it were known it too small. (3) Repair to the covariance matrix by the Murphy Topel Result (the one published verbatim twice by JBES). Part 14: Nonlinear Models [ 71/84] Murphy-Topel - 1 logL1 () defines the first step estimator. Let ˆ1 Estimated asymptotic covariance matrix for ˆ V gi,1 log fi,1 (..., ) n ˆ1 might = [i=1 ˆ i,1g ˆi,1 ]1 ) . (V g logL(, ˆ ) defines the second step estimator using the estimated value of . ˆ2 Estimated asymptotic covariance matrix for ˆ V |ˆ gi,2 log fi,2 (..., , ˆ ) ˆ2 is too small V n ˆ2 might = [i=1 ˆ i,2 g ˆi,2 ]1 ) . (V g Part 14: Nonlinear Models [ 72/84] Murphy-Topel - 2 ˆ1 Estimated asymptotic covariance matrix for ˆ V n ˆ1 might = [i=1 ˆi,1g ˆi,1 ]1 ) gi,1 log fi,1 (..., ) / . ( V g ˆ2 Estimated asymptotic covariance matrix for ˆ V |ˆ ˆ2 might = [ni=1g ˆ i,2 g ˆi,2 ]1 ) gi,2 log fi,2 (..., , ˆ ) / . ( V ˆ hi,2 log fi,2 (..., , ˆ ) / C ˆ (the off diagonal block in the Hessian) ˆ i,2h ni=1g i,2 R ˆ i,2 g ˆ i,1 (cross products of derivatives for two logL's) ni=1g ˆ2 V ˆ2 V ˆ2 [CV ˆ1C' - CV ˆ1R' - RV ˆ1C']V ˆ2 M&T: Corrected V Part 14: Nonlinear Models [ 73/84] Application of M&T Reject ; Logit ; Matrix ; Names ; Create ; Loglinear; Matrix ; Names ; Create ; Create ; Create ; Matrix ; Matrix ; ; hhninc = 0 $ lhs=hhkids ; rhs=one,age,bluec ; prob=prifkids $ v1=varb$ z1=one,age,bluec$ gi1=hhkids-prifkids$ lhs=hhninc;rhs=one,educ,married,age,prifkids;model=e$ v2=varb$ z2=one,educ,married,age,prifkids$ gi2=hhninc*exp(z2'b)-1$ hi2=gi2*b(5)*prifkids*(1-prifkids)$ varc=gi1*gi2 ; varr=gi1*hi2$ c=z2'[varc]z1 ; r=z2'[varr]z1$ q=c*v1*c'-c*v1*r'-r*v1*c' mt=v2+v2*q*v2;stat(b,mt)$ Part 14: Nonlinear Models [ 74/84] M&T Application +---------------------------------------------+ | Multinomial Logit Model | | Dependent variable HHKIDS | +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Characteristics in numerator of Prob[Y = 1] Constant 2.61232320 .05529365 47.245 .0000 AGE -.07036132 .00125773 -55.943 .0000 43.5271942 BLUEC -.02474434 .03052219 -.811 .4175 .24379621 +---------------------------------------------+ | Exponential (Loglinear) Regression Model | | Dependent variable HHNINC | +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Parameters in conditional mean function Constant -3.79588863 .44440782 -8.541 .0000 EDUC -.05580594 .00267736 -20.844 .0000 11.3201838 MARRIED -.20232648 .01487166 -13.605 .0000 .75869263 AGE .08112565 .00633014 12.816 .0000 43.5271942 PRIFKIDS 5.23741034 .41248916 12.697 .0000 .40271576 +---------+--------------+----------------+--------+---------+ B_1 -3.79588863 .44425516 -8.544 .0000 B_2 -.05580594 .00267540 -20.859 .0000 B_3 -.20232648 .01486667 -13.609 .0000 B_4 .08112565 .00632766 12.821 .0000 B_5 5.23741034 .41229755 12.703 .0000 Why so little change? N = 27,000+. No new variation. Part 14: Nonlinear Models [ 75/84] GMM Estimation 1 g(β)= Ni1mi (y i , x i , β) N 1 1 N Asy.Var[g(β)] estimated with W= i1mi (y i , x i , β)mi (y i , x i , β) NN The GMM estimator of β then minimizes 1 N 1 1 N q i1mi (y i , x i , β) 'W i1mi (y i , x i , β) . N N g(β) -1 1 ˆ Est.Asy.Var[β ] [ G'W G ] , G = GMM β Part 14: Nonlinear Models [ 76/84] GMM Estimation-1 GMM is broader than M estimation and ML estimation Both M and ML are GMM estimators. 1 n log f(y i | x i , β) g(β) i1 for MLE n β E(y i | x i , β) 1 n g(β) i1 ei for NLSQ n β Part 14: Nonlinear Models [ 77/84] GMM Estimation - 2 Exactly identified GMM problems 1 When g(β) = Ni1mi (y i , x i , β) 0 is K equations in N K unknown parameters (the exactly identified case), the weighting matrix in 1 1 q Ni1mi (y i , x i , β) 'W 1 Ni1mi (y i , x i , β) N N is irrelevant to the solution, since we can set exactly g(β) 0 so q = 0. And, the asymptotic covariance matrix (estimator) is the product of 3 square matrices and becomes [G'W -1G]1 G-1 WG'-1 Part 14: Nonlinear Models [ 78/84] Optimization - Algorithms Maximize or minimize (optimize) a function F() Algorithm = rule for searching for the optimizer Iterative algorithm: (k 1) (k ) Update (k ) Derivative (gradient) based algorithm (k 1) (k ) Update(g(k ) ) Update is a function of the gradient. Compare to 'derivative free' methods (Discontinuous criterion functions) Part 14: Nonlinear Models [ 79/84] Optimization Algorithms Iteration (k+1) (k) Update(k ) General structure: (k+1) (k) (k ) W (k ) g(k ) g(k) derivative vector, points to a better value than (k) = direction vector (k ) = 'step size' W (k) = a weighting matrix Algorithms are defined by the choices of (k) and W (k) Part 14: Nonlinear Models [ 80/84] Algorithms Steepest Ascent: (k) -g(k)'g(k) (k) (k) (k) , W (k) I g 'H g g(k) first derivative vector H(k) second derivatives matrix Newton's Method: (k) -1, W (k) [H(k) ]1 (Sometimes called Newton-Raphson. Method of Scoring: (k) -1, W (k) [E[H(k) ]]1 (Scoring uses the expected Hessian. Usually inferior to Newton's method. Takes more iterations.) (k) (k) BHHH Method (for MLE): (k) -1, W (k) [in1gi gi ']1 Part 14: Nonlinear Models [ 81/84] Line Search Methods Squeezing: Essentially trial and error (k) 1, 1/2, 1/4, 1/8, ... Until the function improves Golden Section: Interpolate between (k) and (k-1) Others : Many different methods have been suggested Part 14: Nonlinear Models [ 82/84] Quasi-Newton Methods How to construct the weighting matrix: Variable metric methods: W (k) W (k 1) E(k 1) , W (1) I Rank one updates: W (k) W (k 1) a(k 1) a(k 1)' (Davidon Fletcher Powell) There are rank two updates (Broyden) and higher. Part 14: Nonlinear Models [ 83/84] Stopping Rule When to stop iterating: 'Convergence' (1) Derivatives are small? Not good. Maximizer of F() is the same as that of .0000001F(), but the derivatives are small right away. (2) Small absolute change in parameters from one iteration to the next? Problematic because it is a function of the stepsize which may be small. (3) Commonly accepted 'scale free' measure = g(k)[H(k ) ]1 g(k ) Part 14: Nonlinear Models [ 84/84] For Example Nonlinear Estimation of Model Parameters Method=BFGS ; Maximum iterations= 4 Convergence criteria:gtHg .1000D-05 chg.F .0000D+00 max|dB| .0000D+00 Start values: -.10437D+01 .00000D+00 .00000D+00 .00000D+00 .10000D+01 1st derivs. -.23934D+05 -.26990D+06 -.18037D+05 -.10419D+07 .44370D+05 Parameters: -.10437D+01 .00000D+00 .00000D+00 .00000D+00 .10000D+01 Itr 1 F= .3190D+05 gtHg= .1078D+07 chg.F= .3190D+05 max|db|= .1042D+13 Try = 0 F= .3190D+05 Step= .0000D+00 Slope= -.1078D+07 Try = 1 F= .4118D+06 Step= .1000D+00 Slope= .2632D+08 Try = 2 F= .5425D+04 Step= .5214D-01 Slope= .8389D+06 Try = 3 F= .1683D+04 Step= .4039D-01 Slope= -.1039D+06 1st derivs. -.45100D+04 -.45909D+05 -.18517D+04 -.95703D+05 -.53142D+04 Parameters: -.10428D+01 .10116D-01 .67604D-03 .39052D-01 .99834D+00 Itr 2 F= .1683D+04 gtHg= .1064D+06 chg.F= .3022D+05 max|db|= .4538D+07 Try = 0 F= .1683D+04 Step= .0000D+00 Slope= -.1064D+06 Try = 1 F= .1006D+06 Step= .4039D-01 Slope= .7546D+07 Try = 2 F= .1839D+04 Step= .4702D-02 Slope= .1847D+06 Try = 3 F= .1582D+04 Step= .1855D-02 Slope= .7570D+02 ... 1st derivs. -.32179D-05 -.29845D-04 -.28288D-05 -.16951D-03 .73923D-06 Itr 20 F= .1424D+05 gt<H>g= .13893D-07 chg.F= .1155D-08 max|db|= .1706D-08 * Converged Normal exit from iterations. Exit status=0. Function= .31904974915D+05, at entry, -.14237328819D+05 at exit