Econometric Notes 30 December 2014 Econometric Notes * Houston H. Stokes Department of Economics University of Illinois in Chicago hhstokes@uic.edu An Overview of Econometrics * .......................................... 1 Objective of Notes .................................................. 1 1. Purpose of statistics................................................ 3 2. Role of statistics .................................................. 3 3. Basic Statistics ................................................... 4 4. More complex setup to illustrate B34S Matrix Approach..................... 11 5. Review of Linear Algebra and Introduction to Programming Regression Calculations 14 Figure 5.1 X'X for a random Matrix X ................................... 22 Figure 5.2 3D plot of 50 by 50 X'X matrix where X is a random matrix Error! Bookmark not defined. 6. A Sample Multiple Input Regression Model Dataset ........................ 25 Figure 6.1 2 D Plots of Textile Data ..................................... 31 Figure 6.2 3-D Plot of Theil (1971) Textile Data ............................ 33 7. Advanced Regression analysis ....................................... 47 Figure 7.1 Analysis of residuals of the YMA model. .......................... 58 Figure 7.2 Recursively estimated X1 and X3 coefficients for X1 Sorted Data .......... 60 Figure 7.3 CUSUM test on Estimated with Sorted Data ........................ 61 Figure 7.4 CUMSQ Test of Model y model estimated with sorted data............... 63 Figure 7.5 Quandt Likelihood Ratio tests of y model estimated with sorted data. ....... 64 8. Advanced concepts ............................................. 64 9. Summary .................................................... 67 Objective of Notes The objective of these notes is to introduce students to the basics of applied regression calculation using STATA setups of a number of very simple models. Computer code is shown to allow students to "get going" ASAP. More advanced sections show matlab code to made calculations. The notes are 1 Econometric Notes organized around the estimation of regression models and the use of basic statistical concepts. The textbooks Introduction to Econo metrics by Christopher Dougherty 4th Edition Oxford 2011 or Introductory Econometrics: A Modern Approach by Jeffrey Wooldridge 5th Edition, South-Western Cengage 2013 can be used to provide added information. A number of examples from this book will be shown. Statistical analysis will be treated, both as a means by which the data can be summarized, and as a means by which it is possible to accept or reject a specific hypothesis. Four simple datasets are initially discussed: - The Price vs Age of Cars dataset illustrates a simple 2 variable OLS model where graphics and correlation analysis can be used to detect relationships. - The Theil (1971) Textile deta set illustrates use of log transformations and contracts 2D and 3D graphic analysis of data. A variable with a low correlation was show to enter an OLS model only in the presence of another variable. - The Brownlee (1965) Stack Loss data set illustrates how in a multiple regression context, variables with "significant" correlation may not enter a full model. - The Brownlee (1965) Stress data set illustrates the dangers of relying on correlation analysis. Finally a number of statistical problems and procedures that might be used are discussed. 2 Econometric Notes 1. Purpose of statistics - Summarize data - Test models - Allow one to generalize from a sample to the wider population. 2. Role of statistics Quote by Stanley (1856) in a presidential address to section F of the British Association for the Advancement of Science. "The axiom on which ....(statistics) is based may be stated thus: that the laws by which nature is governed, and more especially those laws which operate on the moral and physical condition of the human race, are consistent, and are, in all cases best discoverable - in some cases only discoverable by the investigation and comparison of phenomena extending over a very large number of individual instances. In dealing with MAN in the aggregate, results may be calculated with precision and accuracy of a mathematical problem... This then is the first characteristic of statistics as a science: that it proceeds wholly by the accumulation and comparison of registered facts; - that from these facts alone, properly classified, it seeks to deduce general principles, and that it rejects all a priori reasoning, employing hypothesis, if at all, only in a tentative manner, and subject to future verification" 3 Econometric Notes (Note: underlining entered by H. H. Stokes) 3. Basic Statistics Key concepts: x x / N x -Mean -Median -Mode -Population Variance =middle data value = data value with most cases = x2 -Sample Variance -Population Standard Deviation = s2 = x -Sample Standard Deviation -Confidence Interval with k% -Correlation = sx => a range of data values = p xy k y i X i e -Regression i 1 Where X i = is a N by K matrix of explanatory variables. -Percentile -Quartile -Z score -t test -SE of the mean -Central Limit Theorem Statistics attempts to generalize about a population from a sample. For the purposes of this discussion assume the population of men in the US. A 1/1000 sample from this population would be a randomly selected sample of men such that the sample contained only one male for every 1000 in the population. The task of statistics is to be able to draw meaningful generalizations from the sample about the population. It is costly, and often impossible, to examine all the measurements in the population of interest. A sample must be selected in such a manner such that it is representative of the population. In a famous example of the potential for problems in sample selection, during the depression in the 1932 presidential election the Literery Digest attempted to sample the electorate. A staff was selected and numbers to call were randomly selected from the phone book in New York. In each call the question was asked “Who will you vote for, Mr. Roosevelt or President Hoover?” Those called, for the most part, supported President Hoover being relected. When Mr. Roosevelt won the election, the question was asked? What went wrong in the sampling process? The assumption that those who had phones was the correct characterization of poplution of the voters, was the problem. 4 Econometric Notes Those without phones in that period disproportionally went for Mr. Roosevelt biasing the results of the study. In summary, statistics allows us to use the information contained in a representative sample to correctly make inferences about the population. For example if one were interested in ascertaining how long the light bulbs produced by a certain company last, one could hardly test them all. Sampling would be necessary. The bootstrap can be used to test the distribution of statistics estimated from a sample whose distribution is not known. In addition to sampling correctly, it is important to be able to detect a shift in the underlying population. The usual practice is to draw a sample from the population to be able to make inferences about the underlying population. If the population is shifting, such samples will give biased information. For example assume a reservoir. If a rain comes and adds to and stirs up the water in the reservoir, samples of water would have to be taken more frequently than if there had been no rain and there was no change in water usage. The interesting question is how do you know when to start increasing the sampling rate? A possible approach would be to increase the sampling rate when the water quality of previous samples begins to fall outside normal ranges for the focus variable. In this example, it is not possible to use the population of all the water in the reservoir to test the water. A number of key concepts are listed next. Measures of Central Tendency. The mean is a measure of central tendency. Assume a vector x containing N observations. The mean is defined as N x xi / N (3-1) i 1 Assuming xi = (1 2 3 4 5 6 7 8 9), then N=9, and x 5 . The mean is often written as x or E(x) or the expected value of x. The problem with the mean as a measure of central tendency is that it is affected by all observations. If instead of making x9 = 9, make x9 = 99. Here x (45 90) / 5 15 which is bigger than all xi values except x9. The median defined as the middle term of an odd number of terms or the average of the two middle terms when the terms have been arranged in increasing order is not affected by outlier terms. In the above example the median is 5 no matter whether x9 = 9 or x9 = 99. The final measure of central tendency is the mode or value which has the highest frequency. The mode may not be unique. In the above example, it does not exist. Variation. It has been reported that a poor statistician once drowned in a stream with a mean depth of 6 inches. How could this occur? To summarize the data, we also need to check on variation, something that can be done by looking at the standard deviation and variance. The population variance of a vector x is defined as x2 i 1 ( xi x )2 / N N (3-2) 5 Econometric Notes while the sample variance s x2 is N sx2 ( xi x ) 2 /( N 1) (3-3) i 1 The population standard deviation x is the square root of the population variance. For the purposes of these notes, the standard deviation will mean the sample standard deviation. There are alternative formulas for these values that may be easier to use. As an alternative to (3-2) and (3-3) N N i 1 i 1 N N i 1 i 1 x2 ( N xi ( xi ) 2 ) / N 2 (3-4) sx2 ( N xi ( xi ) 2 ) / N ( N 1) (3-5) For implementing the variance in a computer program, (3-2) is more accurate than (3-4)? Why is this the case? If sx is unbiased, a general rule is that xi will lie 99% of the time in + - 3 standard deviations, 95% of the time in + - 2 standard deviations, and 68% of the time in + - 1 standard deviations. Given a vector of numbers, it is important to determine where a certain number might lie. There are 4 quartile positions of a series. Quartile 1 is the top of the lower 25%, quartile 2 the top of the lower 50% or the median. Quartile 3 is the top of the 75%. The standard deviation gives information concerning where observations lie. Assume x = 10, sx = 5 and N = 300. The question asked is how likely will a value > 14 occur? To answer this question requires putting the data in Z form where Z ( xi x ) / sx (3-6) Think of Z as a normalized deviation. Once we get Z, we can enter tables and determine how likely this will occur. In this case Z = (14-10)/5 = .8. Distribution of the mean. It often is desirable to know how the sample mean x is distributed. Assuming a vector has a finite distribution and that each xi value is mutually independent, then the Central Limit Theorem states that if the vector ( x1 ,...., xN ) has any distribution with mean and variance 2 , then the distribution of x approaches the normal distribution with mean and variance 2 / N as sample size N increases. Note that the standard deviation of the mean x defined as 6 Econometric Notes x / N (3-7) Given x and x the 95% confidence interval around x is x 2 x x x 2 x_ (3-8) For small samples (<30) the formula is x t.025 ( sx / N ) 2 x x t.025 ( sx / N ) 2 (3-9) Tests of two means. Assume two vectors x and y where we know x , y , sx2 and s 2y . The simplest test if the means differ is Z ( x y ) /(( sx2 / N x ) ( s2y / N y )).5 (3-10) where the small sample approximation assuming the two samples have the same population standard deviation is t ( x y ) /(( s2p / N x ) ( s2p / N y )).5 (3-11) s2p (( N x 1)sx2 ( N y 1)s 2y ) /( N x N y 2) (3-12) Note that s 2p is an estimate of the population variance. Correlation. If two variables are thought to be related, a possible summary measure would be the correlation coefficient . Most calculators or statistical computer programs will make the calculation. The standard error of is ( N 1) .5 for small samples and N .5 for large samples. This means that p /( N 1).5 is distributed as a t statistic with asymptotic percentages as given above . The correlation coefficient is defined as ( xy ( x y )) /( x y ) (3-13) Perfect positive correlation is 1.0, perfect negative correlation is -1.0. The SE of is converges to 0.0 as N . If N was 101, the SE of r would be 1/10 or .1. | | must be .2 to be significant at or better than the 95% level. Correlation is major tool of analysis that allows a person to formalize what 7 Econometric Notes is shown in an x y plot of the data. A simple data set will be used to illustrates these concepts and introduce OLS models as well as show the flaws of correlation analysis as a diagnostic tool. Single Equation OLS Regression Model. Data was obtained on 6 observations on age and value of cars (from Freund [1960] Modern Elementary Statistics, page 332), two variables that are thought to be related. Table One lists this data and gives means, correlation between age and value and a simple regression value=f(age). We expect the relationship to be negative and significant. Table 1. Age of cars Obs 1 2 3 4 5 6 Age 1 3 6 10 5 2 Mean 4.5 Variance 10.7 Correlation Value 1995 875 695 345 595 1795 1050 461750 -0.85884 Next we show the Stata command files to obtain analysis of this data. Assume you have a file car_age_data.do input double 0.1E+01 0.3E+01 0.6E+01 0.1E+02 0.5E+01 0.2E+01 end label variable x input double 0.1995E+04 0.8750E+03 0.6950E+03 0.3450E+03 0.5950E+03 0.1795E+04 label variable y // x "AGE OF CARS " "PRICE OF CARS " y Comment // run car_age_data.do describe summarize list correlate (x y) regress y x twoway (scatter y x) 8 Econometric Notes Edited output is: clear . run car_age_data.do . describe Contains data obs: 6 vars: 2 size: 96 ----------------------------------------------------------------------------------storage display value variable name type format label variable label ----------------------------------------------------------------------------------x double %10.0g AGE OF CARS y double %10.0g PRICE OF CARS ----------------------------------------------------------------------------------Sorted by: Note: dataset has changed since last saved . summarize Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------x | 6 4.5 3.271085 1 10 y | 6 1050 679.5219 345 1995 . list 1. 2. 3. 4. 5. 6. +-----------+ | x y | |-----------| | 1 1995 | | 3 875 | | 6 695 | | 10 345 | | 5 595 | |-----------| | 2 1795 | +-----------+ . (obs=6) correlate (x y) | x y -------------+-----------------x | 1.0000 y | -0.8588 1.0000 . regress y x Source | SS df MS -------------+-----------------------------Model | 1702935.05 1 1702935.05 Residual | 605814.953 4 151453.738 -------------+-----------------------------Total | 2308750 5 461750 Number of obs F( 1, 4) Prob > F R-squared Adj R-squared Root MSE = = = = = = 6 11.24 0.0285 0.7376 0.6720 389.17 -----------------------------------------------------------------------------y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------x | -178.4112 53.20631 -3.35 0.028 -326.1356 -30.68683 _cons | 1852.85 287.3469 6.45 0.003 1055.048 2650.653 -----------------------------------------------------------------------------. twoway (scatter y x) . end of do-file 9 0 500 1000 1500 2000 Econometric Notes 0 2 4 AGE OF CARS 6 8 10 From the plot we see that the ten year old car appears to have a larger that expected value for its age. For this reason, more variables and observations are needed. Remark: When there are two series correlation and plots can be used effectively to determine the model. However when there are more that two series, plots and correlation analysis are less useful and in may cases can give the wrong impression. This will be illustrated later. In cases where there are more than one explanatory variable, regression is the appropriate approach, although this approach has many problems. A regression tries to write the dependent variable y as a linear function of the explanatory variables. In this case we have estimated a model of the form y xe (3-14) where y=value is the price of the car in period t, x=age is the age in period t and e is the error term. Regression output produces 10 Econometric Notes value = 1852.8505 - 178.41121*age (6.45) (-3.35) (3-15) R2 = .672, SEE = 389.17, e'e = 605814.953 which can be verified from the printout. Note that SEE= 605814.953 / 4 (e ' e) / n k . The regression model suggests that every year older a car gets the value significantly drops $178.41. A car one year old should have a value of 1852.8505 - (1)*178.41221 = 1674.4. In the sample data set the one year old car in fact had a value of 1995. For this observation the error was 320.56. Using the estimated equation (3-14) we have Age 1 3 6 10 5 2 Actual Value 1995 875 695 345.0 595 1795 Estimated Value 1674.4 1317.6 782.38 68.738 960.79 1496 Error 320.56 -442.62 -87.383 276.26 -365.79 298.97 t scores have been placed under the estimated coefficients. Since for both coefficients |t| > 2, we can state that given the assumptions of the linear regression model, both coefficients are significant. Before turning to an in-depth discussion of the regression model, we look at a few optional topics. 4. More complex setup to illustrate Matlab to estimate the Model. Optional Topic. This optional topic implements the key ideas in Appendix E of Wooldridge (2013) that show how a linear econometrioc model has be estimated by OLS. As discussed in the text, a linear OLS Model selects the coefficients so as to minimize the sum of squared errors. Define X as an N by K matrix where N is the number of observation of K series. The OLS coefficient vector ( X ' X ) 1 X ' y where y is the right hand side vector. The error vector e X . Standard errors of the coefficients can be obtained from the square root of diagonal elements of 2 ( X ' X ) 1 where 2 e ' e /( N K ) . As an alternative to the Stata regress command that was shown above, the self contained MATLAB program that is listed next can be used to estimate the model. %% Cars Example using Matlab % Load data x=[1,1; 1,3; 1,6; 1,10; 1,5; 1,2]; y=[1995 875 695 345 595 1795]; 11 Econometric Notes y=y'; value=y; disp('Mean of dependent (Age) and Independent Variable (Value)') disp([mean(y),mean(x(:,2))]) age=x(:,2); disp('Small and Large Variances for Age and Value') disp([var(age,0),var(age,1),var(value,0),var(value,1)]) disp('Correlation using formula and built in function') cor=(mean(age.*y)-(mean(age)*mean(y)))/(sqrt(var(age,1))*sqrt(var(value,1))) % using built in function cor=corr([age,value]) %% Estimate the model % Logic works for any sized problem!! % for large # of obs put ; at end of [y,yhat,res] line beta=inv(x'*x)*x'*y; yhat=x*beta; res=y-yhat; disp(' Value Yhat Res') [y,yhat,res] ssr=res'*res; disp('Sum of squared residuals') disp(ssr) df=size(x,1)-size(x,2); se=sqrt(diag((ssr/df)*inv(x'*x))); disp(' Beta se t') t=beta./se; [beta,se,t] plot(res) % plot(age,y,age,yhat) disp('Durbin Watson') i=1:1:5; dw=((res(i+1)-res(i))'*(res(i+1)-res(i)))/(res'*res); disp(dw) Which produces output: Mean of dependent (Age) and Independent Variable (Value) 1050 4.5 Small and Large Variances for Age and Value 10.7 8.9167 4.6175e+005 3.8479e+005 Correlation using formula and built in function cor = -0.85884 cor = 1 -0.85884 -0.85884 1 Value Yhat Res ans = 1995 1674.4 320.56 875 1317.6 -442.62 695 782.38 -87.383 345 68.738 276.26 595 960.79 -365.79 1795 1496 298.97 Sum of squared residuals 12 Econometric Notes 6.0581e+005 Beta ans = 1852.9 -178.41 Durbin Watson 2.7979 se t 287.35 53.206 6.4481 -3.3532 which matches what was produced by the Stata regress commands which can give the user the impression of a "black box." Our findings indicate that for every year on average the car falls in value $178.41. Remark: Econometric calculations can easily be programmed using 4th generation languages without detailed knowledge of Fortran or C. This allows new techniques to be implemented without waiting for software developers to "hard wire" these procedures. 13 Econometric Notes 5. Review of Linear Algebra and Introduction to Programming Regression Calculations. Optional Topic for those with right math background. Assume a problem where there are multiple x variables, all possibly related to y, and there is some relationship between the x variables (multicollinearity). The proposed solution is to fit a linear model of the form: k y i xi e , (5-1) i 1 where y, xi and e are N element column vectors, i is the coefficient of xi and is the intercept of the equation. A linear model such as (5-1) can be estimated by OLS (ordinary least squares), which will minimize e ' e which a good measure of the fit of the model. OLS is one of many methods to fit a line, others discussed being L1 which minimizes | e | and minimax which minimizes the largest element in e. After the coefficients are calculated, it is a good idea to estimate and report standard errors, which allow significance tests on the estimates of the parameters. OLS models can be estimated, using matrix algebra directly or using pre programmed procedures like the regression command in Excel. There are however a number of ways to calculate the estimated parameters. Before this occurs we first illustrate a number of Linear algebra calculations that include the LU factorization, eigenvalue analysis, the Cholesky factorization, the QR factoprization, the Schur factorization (that always works when eigen values may not work) and the SVD calculation. The LU factorization ( LU Z ) is the appropriate way to invert a general matrix. Eigenvalue analysis decomposes Z V V 1 where is a diagonal matrix and Z is a general matrix. For the positive definite case Z V V ' since here V 1 V ' . Inspection of the diagonal elements of indicates whether lim k Z k explodes if we note X k V kV 1 . The sum of the diagonal elements of are the trace of Z while their product is | Z | . If Z is positive definite (all diagonal elements of >0) the Cholesky factorization writes Z R ' R where R is upper triangular. The Schur factorization writes Z USU ' where U is orthogonal UU ' I and S is block upper triangular. Unlike the eigenvalue transformation, all elements of the Schur factorization are real for the general matrix. The QR factorization writes X QR where Q is orthogonal and R is the Cholesky factorization calculated accurately since it used X not X ' X . The SVD calculates X U V ' where both U and V are orthogonal and N by K and K by K and is a K by K diagonal matrtrix whose elements are the square roots of the eigenvalues of X ' X . The below listed Matlab script self documents these calculations and shows graphically X ' X where X was 100 by 50. How would this graph look like if X was not a random matrix where by assumption E ( X (, i ) X (, j )) 0 for i j ? How might it be used? %% Linear Algebra Useful for Econometrics in Matlab disp(' Short course in Math using Matlab(c)') % 2 December 2006 Version disp(' Houston H. Stokes') 14 Econometric Notes disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' All Matlab commands are indented. Cut and paste from this') document into Matlab and execute.') ') If ; is left off result will print.') Define x as a n by n matrix of random numbers.') x = rand(n) ') define x as a n by n matrix of random normal numbers') xn = randn(n)') Do a LU factorization and test answer') Inverse using LU ') ') x = rand(n) ') [l u] = lu(x) ') test = l*u ') error = l*u - x ') ix = inv(x) ') ix2 = inv(u)*inv(l) ') error = ix - ix2 ') n=3 x = rand(n) [l u] = lu(x) test = l*u error = l*u - x ix = inv(x) ix2 = inv(u)*inv(l) error = ix - ix2 Form PD Matrix and look at it. ') xx = randn(100,10); ') xpx = xx`*xx ') mesh(xpx) ') xx = randn(100,50); xpx = xx'*xx; mesh(xpx) Factor PD matrix into R(t)*R and test') xx = randn(100,n); ') xpx = xx(t)*xx ') r = chol(xpx) ') test1 = r(t)*r ') mesh(r) ') error = r(t)*r - xpx ') xx = randn(100,n); xpx = xx'*xx r = chol(xpx) test1 = r'*r error = r'*r - xpx disp(' Eigen and svd analysis. For pd matrix s = landa') disp(' xx = randn(100,n); ') disp(' xpx = xx(t)*xx ') disp(' lamda = eig(xpx) ') xx = randn(100,n); xpx = xx'*xx lamda = eig(xpx) 15 Econometric Notes disp(' show trace = sum eigen') disp(' det = prod(e) ') disp(' trace1 = trace(xpx) disp(' det1 = det(xpx) disp(' trace2 = sum(lamda) disp(' det2 = prod(lamda) trace1 = trace(xpx) det1 = det(xpx) trace2 = sum(lamda) det2 = prod(lamda) ') ') ') ') disp(' Test SVD') disp(' s = svd(xpx) ') disp(' [u ss v] = svd(xpx) ') disp(' test = u*ss*v(t)') disp(' error = xpx-test ') s = svd(xpx) [u ss v] = svd(xpx) test = u*ss*v' error = xpx-test disp(' Does X*V = V*Lamda') disp(' xx = randn(100,n); ') disp(' xpx = xx(t)*xx ') disp(' [v lamda] = eig(xpx) ') disp(' test = v*lamda*inv(v)') disp(' error = xpx-test ') disp(' vpv = v(t)*v ') disp(' s = svd(xpx) ') xx = randn(100,n); xpx = xx'*xx [v lamda] = eig(xpx) test = v*lamda*inv(v) error = xpx-test vpv = v'*v s = svd(xpx) disp(' Schur Factorization X = U S U(t) where U is orthogonal and') disp(' S is block upper triangural with 1 by 1 and 2 by 2 on the') disp(' diagonal. All elements of a Schur factorization real') disp(' xx = randn(100,n); ') disp(' xpx = xx(t)*xx ') disp(' [U,S] = schur(xpx) ') disp(' test = U*S*U(t) ') disp(' error = xpx-test ') xx = randn(100,n); xpx = xx'*xx [U,S] = schur(xpx) test = U*S*U' error = xpx-test disp(' Schur Factorization') disp(' xx = randn(n,n) ') disp(' [U,S] = schur(xx) ') 16 Econometric Notes disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' disp(' test = U*S*U(t) ') error = xx-test ') xx = randn(n,n) [U,S] = schur(xx) test = U*S*U' error = xx-test QR Factorization preserves length and angles and does not magnify') errors. We express X = Q*R where Q is orthogonal and R is upper') triangular ') x = randn(n,n) ') [Q R] = qr(x) ') test1 = Q(t)*Q ') test2 = Q*R ') error = x - test2 ') x = randn(n,n) [Q R] = qr(x) test1 = Q'*Q test2 = Q*R error = x - test2 and produces output: Short course in Math using Matlab(c) Houston H. Stokes All Matlab commands are indented. Cut and paste from this document into Matlab and execute. If ; is left off result will print. Define x as a n by n matrix of random numbers. x = rand(n) define x as a n by n matrix of random normal numbers xn = randn(n) Do a LU factorization and test answer Inverse using LU x [l u] test error ix ix2 error = rand(n) = lu(x) = l*u = l*u - x = inv(x) = inv(u)*inv(l) = ix - ix2 n = 3 x = 0.84622 0.52515 0.20265 0.67214 0.83812 0.01964 0.68128 0.37948 0.8318 1 0.62059 0.23947 0 1 -0.33568 0 0 1 0.84622 0 0 0.67214 0.421 0 0.68128 -0.04331 0.65411 l = u = test = 17 Econometric Notes 0.84622 0.67214 0.68128 0.52515 0.83812 0.37948 0.20265 0.01964 0.8318 error = 0 0 0 0 0 0 0 -6.9389e-018 0 ix = 2.9596 -2.3417 -1.3557 -1.5445 2.4281 0.15727 -0.68458 0.51318 1.5288 ix2 = 2.9596 -2.3417 -1.3557 -1.5445 2.4281 0.15727 -0.68458 0.51318 1.5288 error = 0 0 0 0 0 0 -1.1102e-016 0 0 Form PD Matrix and look at it. xx = randn(100,10); xpx = xx`*xx mesh(xpx) Factor PD matrix into R(t)*R and test xx = randn(100,n); xpx = xx(t)*xx r = chol(xpx) test1 = r(t)*r mesh(r) error = r(t)*r - xpx xpx = 98.02 17.334 0.14022 17.334 104.66 -7.2052 0.14022 -7.2052 114.22 r = 9.9005 1.7508 0.014163 0 10.08 -0.71729 0 0 10.663 test1 = 98.02 17.334 0.14022 17.334 104.66 -7.2052 0.14022 -7.2052 114.22 error = 1.4211e-014 0 0 0 0 0 0 0 0 Eigen and svd analysis. For pd matrix s = landa xx = randn(100,n); xpx = xx(t)*xx lamda = eig(xpx) xpx = 95.217 -3.5453 12.006 -3.5453 96.003 -3.9312 12.006 -3.9312 92.989 lamda = 82.025 93.783 108.4 show trace = sum eigen det = prod(e) 18 Econometric Notes trace1 = trace(xpx) det1 = det(xpx) trace2 = sum(lamda) det2 = prod(lamda) trace1 = 284.21 det1 = 8.3388e+005 trace2 = 284.21 det2 = 8.3388e+005 Test SVD s = svd(xpx) [u ss v] = svd(xpx) test = u*ss*v(t) error = xpx-test s = 108.4 93.783 82.025 u = -0.67492 0.3165 0.66657 0.39135 0.91936 -0.040277 -0.62557 0.23368 -0.74435 ss = 108.4 0 0 0 93.783 0 0 0 82.025 v = -0.67492 0.3165 0.66657 0.39135 0.91936 -0.040277 -0.62557 0.23368 -0.74435 test = 95.217 -3.5453 12.006 -3.5453 96.003 -3.9312 12.006 -3.9312 92.989 error = -2.8422e-014 -1.199e-014 -4.0856e-014 -9.3703e-014 -2.8422e-014 5.9064e-014 -4.7962e-014 -5.3291e-015 1.4211e-014 Does X*V = V*Lamda xx = randn(100,n); xpx = xx(t)*xx [v lamda] = eig(xpx) test = v*lamda*inv(v) error = xpx-test vpv = v(t)*v s = svd(xpx) xpx = 98.321 -0.36605 1.9557 -0.36605 127.52 -2.4594 1.9557 -2.4594 112.74 v = 0.99127 0.1298 0.022941 0.0013134 0.1643 -0.98641 -0.13181 0.97783 0.1627 lamda = 98.061 0 0 0 112.59 0 19 Econometric Notes 0 0 127.93 test = 98.321 -0.36605 1.9557 -0.36605 127.52 -2.4594 1.9557 -2.4594 112.74 error = -1.4211e-014 2.7645e-014 2.2204e-015 2.9421e-014 -4.2633e-014 5.7732e-015 1.7764e-015 -1.3323e-015 0 vpv = 1 2.7756e-017 -2.0817e-017 2.7756e-017 1 -2.498e-016 -2.0817e-017 -2.498e-016 1 s = 127.93 112.59 98.061 Schur Factorization X = U S U(t) where U is orthogonal and S is block upper triangural with 1 by 1 and 2 by 2 on the diagonal. All elements of a Schur factorization real xx = randn(100,n); xpx = xx(t)*xx [U,S] = schur(xpx) test = U*S*U(t) error = xpx-test xpx = 75.062 11.465 -4.6863 11.465 135.28 7.6196 -4.6863 7.6196 87.647 U = -0.91599 -0.36457 -0.16747 0.20355 -0.062606 -0.97706 -0.34572 0.92907 -0.13156 S = 70.745 0 0 0 88.973 0 0 0 138.27 test = 75.062 11.465 -4.6863 11.465 135.28 7.6196 -4.6863 7.6196 87.647 error = 1.4211e-014 2.1316e-014 -1.4211e-014 2.4869e-014 -8.5265e-014 7.1054e-015 -1.0658e-014 5.3291e-015 1.4211e-014 Schur Factorization xx = randn(n,n) [U,S] = schur(xx) test = U*S*U(t) error = xx-test xx = 2.095 0.93943 -0.45994 0.34979 -0.047081 0.64722 2.0142 -1.4799 -1.8411 U = -0.89939 -0.19282 0.39233 -0.24726 -0.51574 -0.82029 -0.3605 0.83477 -0.41617 S = 2.1689 1.4404 -1.1939 20 Econometric Notes 0 0 -0.98103 -0.42339 2.3141 -0.98103 test = 2.095 0.93943 -0.45994 0.34979 -0.047081 0.64722 2.0142 -1.4799 -1.8411 error = 8.8818e-016 -2.2204e-016 -1.6653e-016 1.1102e-016 9.09e-016 8.8818e-016 4.4409e-016 3.3307e-015 8.8818e-016 QR Factorization preserves length and angles and does not magnify errors. We express X = Q*R where Q is orthogonal and R is upper triangular x = randn(n,n) [Q R] = qr(x) test1 = Q(t)*Q test2 = Q*R error = x - test2 x = -0.9756 0.55997 0.88166 0.028304 0.62542 0.15174 -0.050706 0.53695 -0.017682 Q = -0.99823 0.0094729 -0.058658 0.028961 -0.78444 -0.61953 -0.051883 -0.62013 0.78278 R = 0.97733 -0.56872 -0.87479 0 -0.81828 -0.099712 0 0 -0.15956 test1 = 1 0 6.9389e-018 0 1 -1.6653e-016 6.9389e-018 -1.6653e-016 1 test2 = -0.9756 0.55997 0.88166 0.028304 0.62542 0.15174 -0.050706 0.53695 -0.017682 error = 0 0 -4.4409e-016 3.4694e-018 2.2204e-016 8.3267e-017 0 0 2.0817e-017 21 Econometric Notes X'X Where X was 100 by 50 140 120 100 80 60 40 20 0 -20 -40 50 40 50 30 40 30 20 20 10 10 0 0 Figure 5.1 X'X for a random Matrix X These ideas are illustrated using the Theil dataset discussed in more detail in the next section. %% Use of Theil Data to Illustrate various ways to get Beta % For more detail on these calculations see Stokes (200x) Chapter 10 disp('Theil (1971) data on Year CT RP Income') data=[ 1923 99.2 96.7 101.0; 1924 99.0 98.1 100.1; 1925 100.0 100.0 100.0; 1926 111.6 104.6 90.6; 1927 122.2 104.9 86.5; 1928 117.6 109.5 89.7; 1929 121.1 110.8 90.6; 1930 136.0 112.3 82.8; 1931 154.2 109.3 70.1; 1932 153.6 105.3 65.4; 1933 158.5 101.7 61.3; 22 Econometric Notes 1934 140.6 95.4 62.5; 1935 136.2 96.4 63.6; 1936 168.0 97.6 52.6; 1937 154.3 102.4 59.7; 1938 149.0 101.6 59.5; 1939 165.5 103.8 61.3] y=data(:,2); x=[ones(size(data,1),1),data(:,3),data(:,4)]; disp('Beta using Inverse') beta1=inv(x'*x)*x'*y %% QR disp('Using QR approach') [q,r]=qr(x,0) disp('Testing q') q'*q beta2=inv(r)*q'*y yhat=q*q'*y; resid=y-yhat; disp('Y Yhat Residual') [y,yhat,resid] %% Testing R from QR and R from Cholesky disp('Inverse (xpx) = inv(r)*transpose(inv(r))') inv(x'*x) inv(r)*(inv(r))' r cholr=chol(x'*x) %% SVD approach that includes PC Regression disp('SVD approach') [u,s,v]=svd(x,0) pc_coef=u'*y beta3=inv(v')*inv(s)*pc_coef Output produced is: Theil (1971) data on data = 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 Beta using Inverse beta1 = 130.23 Year CT RP Income 99.2 99 100 111.6 122.2 117.6 121.1 136 154.2 153.6 158.5 140.6 136.2 168 154.3 149 165.5 96.7 98.1 100 104.6 104.9 109.5 110.8 112.3 109.3 105.3 101.7 95.4 96.4 97.6 102.4 101.6 103.8 101 100.1 100 90.6 86.5 89.7 90.6 82.8 70.1 65.4 61.3 62.5 63.6 52.6 59.7 59.5 61.3 23 Econometric Notes 1.0659 -1.3822 Using QR approach q = -0.24254 -0.2958 -0.42465 -0.24254 -0.2297 -0.39928 -0.24254 -0.13999 -0.38173 -0.24254 0.077214 -0.20134 -0.24254 0.091379 -0.13707 -0.24254 0.30858 -0.14641 -0.24254 0.36996 -0.14898 -0.24254 0.44079 -0.018862 -0.24254 0.29913 0.14704 -0.24254 0.11027 0.18403 -0.24254 -0.059716 0.21536 -0.24254 -0.35718 0.14409 -0.24254 -0.30997 0.13597 -0.24254 -0.25331 0.31174 -0.24254 -0.026664 0.24537 -0.24254 -0.064438 0.24162 -0.24254 0.03944 0.2331 r = -4.1231 -424.53 -314.64 0 21.179 11.878 0 0 -66.411 Testing q ans = 1 8.1532e-017 1.3878e-017 8.1532e-017 1 -1.1796e-016 1.3878e-017 -1.1796e-016 1 beta2 = 130.23 1.0659 -1.3822 Y Yhat Residual ans = 99.2 93.704 5.4962 99 96.44 2.56 100 98.603 1.3965 111.6 116.5 -4.8995 122.2 122.49 -0.28637 117.6 122.97 -5.3664 121.1 123.11 -2.0081 136 135.49 0.51173 154.2 149.84 4.3553 153.6 152.08 1.5225 158.5 153.91 4.5927 140.6 145.53 -4.9335 136.2 145.08 -8.8789 168 161.56 6.4376 154.3 156.87 -2.565 149 156.29 -7.2887 165.5 156.15 9.3542 Inverse (xpx) = inv(r)*transpose(inv(r)) ans = 23.773 -0.2272 -0.0042094 -0.2272 0.0023008 -0.00012716 -0.0042094 -0.00012716 0.00022673 ans = 23.773 -0.2272 -0.0042094 24 Econometric Notes -0.2272 -0.0042094 0.0023008 -0.00012716 -0.00012716 0.00022673 -4.1231 0 0 cholr = 4.1231 0 0 SVD approach u = 0.26014 0.26123 0.26398 0.26026 0.25606 0.26662 0.26959 0.26301 0.2441 0.23275 0.22268 0.21455 0.2173 0.20664 0.22192 0.22049 0.22584 s = 530.48 0 0 v = 0.0077424 0.799 0.60128 pc_coef = 545.81 131.94 26.706 beta3 = 130.23 1.0659 -1.3822 -424.53 21.179 0 -314.64 11.878 -66.411 424.53 21.179 0 314.64 11.878 66.411 -0.42317 -0.39389 -0.37096 -0.17816 -0.11332 -0.1094 -0.10823 0.025612 0.18215 0.20748 0.22834 0.13929 0.13408 0.31251 0.26022 0.25419 0.25202 0.28267 0.21821 0.12977 -0.076467 -0.086908 -0.30402 -0.36537 -0.42853 -0.27778 -0.087337 0.08395 0.37648 0.32893 0.28251 0.052713 0.090163 -0.013903 0 53.304 0 0 0 0.20509 0.0056046 0.60125 -0.79904 0.99995 -0.0095564 -0.00017699 r = Remark: This section shows how to implement the basic linear algebra relationships that are useful in understanding modern econometric methods and calculations. In many cases these new approaches are required to be used for complex and multi-collinear datasets. 6. A Sample Multiple Input Regression Model Dataset In sections 3 and 4 we introduced a small (6 observation dataset) that relates age of cars to their value. We observed that since there are so few observations in this example, the correlation 25 Econometric Notes coefficient must be relatively large to be significant. The small sample standard error of the correlation coefficient is calculated using (3.3) which is this case is .4472 1/ 5 . Since the absolute value of the correlation coefficient (-.85884) is about 2 times the standard error, we can state that at about the 95% level, the correlation coefficient is significant. The problem with correlation analysis is that it is hard to make direct predictions. What is wanted is a relationship where, if given only the age of a car, we can make some prediction on its price. To obtain an answer to the prediction problem requires more advanced statistical techniques. Its solution will be discussed further below. As discussed earlier, when more complicated models are deemed appropriate or when predictions are required, the correlation coefficient statistical procedure, which restricts analysis to two variables, is no longer the best way to proceed. In the highly unlikely situation where all the variables influencing y (the x's) were unrelated among themselves (i. e., were orthogonal), correlation analysis would give the correct sign of the relationship between each x variable and y. This situation would occur if the x's were principal components. In a later example, using generated data, some of these possibilities will be illustrated with further examples. Table Two lists data on the consumption of textiles in the Netherlands from Theil( [1971] Principles of Econometrics, page 102) which was used as an example in the Matlab code in section 5. This example will be shown to provide a better fit than the previous example and, in addition, illustrates multiple input regression models. (It should be noted that not all economics examples work this well.) Usually time series models have higher R 2 than cross section models, because of the serial correlation (relationship between the error terms across time) implicit in most time series. In this example from Theil (l971) the consumption of textiles in the Netherlands (CT) between 19231939 is modeled as a function of income (Y) and the relative price of textiles (RP). The maintained hypothesis is that as income increases, the consumption of textiles should increase and as the relative price of textiles increases, the consumption of textiles should decrease. Two models are tried, one with the raw data and one with data logged to the base 10. The linear model asserts Table Two Consumption of Textiles in the Netherlands: 1923-1939 Year 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 CT 99.2 99.0 100.0 111.6 122.2 117.6 121.1 136.0 154.2 153.6 158.5 140.6 136.2 168.0 Y 96.7 98.1 100.0 104.6 104.9 109.5 110.8 112.3 109.3 105.3 101.7 95.4 96.4 97.6 RP 101.0 100.1 100.0 90.6 86.5 89.7 90.6 82.8 70.1 65.4 61.3 62.5 63.6 52.6 26 Econometric Notes 1937 1938 1939 154.3 149.0 165.5 102.4 101.6 103.8 59.7 59.5 61.3 CT = consumption of textiles. Y = income. RP = relative price of textiles. CT 1 y 2 RP e (6-1) while the log form assumes the error is multiplicative or that CT y 1 ( RP) 2 e (6-2) (6-2) can be estimated in log form as log10(CT ) log10( ) 1 log10( y) 2 log( RP) (6-3) Actual estimates of the alternative models were CT 130.7006 1.061710* y 1.382985* RP (4.824) ( 16.50) (3.981) R 2 .94432 e ' e 433.3 log10(CT ) 1.373914 1.143156*log10( y ) .8288375*log( RP ) (4.4886) (7.3279) ( 22.952) (6.4) R 2 .97070 e ' e .002568 Prior to preliminary estimation, raw correlations and plots were performed. The log transformation was attempted to make the time series data stationary. B34S and SAS commands to analyze this data are shown next Note that B34S requires the user to explicitly define variables to be built with the gen statements with the build statement when using the B34S data step. This allows for checking of variable names in the gen statements. For SAS the following commands would be used. data theil; INPUT CT Y RP LABEL CT LABEL LOG10CT LABEL Y LABEL LOG10Y LABEL RP LABEL LOG10RP ; = = = = = = 'CONSUMPTION OF TEXTILES' ; 'LOG10 OF CONSUMPTION' ; 'INCOME' ; ' LOG10 OF INCOME ' ; 'RELATIVE PRICE OF TEXTILES'; 'LOG10 OF RELATIVE PRICE' ; 27 Econometric Notes LOG10CT = LOG10(CT) LOG10RP = LOG10(RP) LOG10Y = LOG10(Y) CARDS; 99.2 96.7 101 99 98.1 100.1 100 100 100 111.6 104.9 90.6 122.2 104.9 86.5 117.6 109.5 89.7 121.1 110.8 90.6 136 112.3 82.8 154.2 109.3 70.1 153.6 105.3 65.4 158.5 101.7 61.3 140.6 95.4 62.5 136.2 96.4 63.6 168 97.6 52.6 154.3 102.4 59.7 149 101.6 59.5 165.5 103.8 61.3 ; proc reg; MODEL proc reg; MODEL proc autoreg; MODEL proc autoreg; MODEL ; ; ; CT = Y RP; run; LOG10CT = LOG10Y LOG10RP; run; LOG10CT = LOG10Y LOG10RP / nlag=1 method=ml; run; LOG10CT = LOG10Y / nlag=1 method=ml; run; Edited output from B34S discussed below is: Variable # Label CT Y RP LOG10CT LOG10Y LOG10RP CONSTANT 1 2 3 4 5 6 7 Mean CONSUMPTION OF TEXTILES INCOME RELATIVE PRICE OF TEXTILES LOG10 OF CONSUMPTION LOG10 OF INCOME LOG10 OF RELATIVE PRICE Std. Dev. 134.506 102.982 76.3118 2.12214 2.01222 1.87258 1.00000 Variance 23.5773 5.30097 16.8662 0.791131E-01 0.222587E-01 0.961571E-01 0.00000 555.891 28.1003 284.470 0.625889E-02 0.495451E-03 0.924619E-02 0.00000 Data file contains 17 observations on 7 variables. Current missing value code is B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 16:14:15 DATA STEP Maximum 168.000 112.300 101.000 2.22531 2.05038 2.00432 1.00000 1 0.61769E-01 Var 2 RP Var 3 1 -0.94664 LOG10CT Var 4 1 0.99744 2 0.93936E-01 LOG10Y Var 5 1 0.66213E-01 2 0.99973 3 0.17511 LOG10RP Var 6 1 -0.93820 2 0.22599 3 0.99750 4 -0.93596 5 0.22212 CONSTANT Var 7 1 0.0000 2 0.0000 3 0.0000 4 0.0000 5 0.0000 2 0.17885 *************** Problem Number Subproblem Number F to enter F to remove Tolerance Maximum no of steps Dependent variable X( 1). Standard Error of Y = ............. Step Number 3 3 -0.94836 4 0.97862E-01 6 0.0000 4 1 0.99999998E-02 0.49999999E-02 0.10000000E-04 3 Variable Name CT 23.577332 for degrees of freedom = 16. Analysis of Variance for reduction in SS due to variable entering 28 99.0000 95.4000 52.6000 1.99564 1.97955 1.72099 1.00000 0.1000000000000000E+32 Correlation Matrix Y Minimum PAGE 2 Econometric Notes Variable Entering 2 Multiple R 0.975337 Std Error of Y.X 5.56336 R Square 0.951282 Source Due Regression Dev. from Reg. Total Multiple Regression Equation Variable Coefficient Std. Error CT = Y X- 2 1.061710 0.2666740 RP X- 3 -1.382985 0.8381426E-01 CONSTANT X- 7 130.7066 27.09429 Adjusted R Square -2 * ln(Maximum of Likelihood Function) Akaike Information Criterion (AIC) Scwartz Information Criterion (SIC) Akaike (1970) Finite Prediction Error Generalized Cross Validation Hannan & Quinn (1979) HQ Shibata (1981) Rice (1984) Residual Variance T Val. DF 2 14 16 SS 8460.9 433.31 8894.2 MS 4230.5 30.951 555.89 T Sig. P. Cor. Elasticity 3.981 -16.50 4.824 0.99863 0.7287 1.00000 -0.9752 0.99973 F 136.68 Partial Cor. for Var. not in equation Variable Coefficient F for selection 0.8129 -0.7846 0.944321908495049 103.294108058298 111.294108058298 114.626961434523 36.4128553394184 37.5832685467569 36.8112662258895 34.4851159390962 39.3920889580981 30.9509270385056 Order of entrance (or deletion) of the variables = 7 3 2 Estimate of computational error in coefficients = 1 -0.1889E-13 2 -0.2396E-14 3 0.7430E-11 Covariance Matrix of Regression Coefficients Row Row Row 1 Variable X- 2 0.71115004E-01 Y 2 Variable X- 3 RP -0.39974169E-02 0.70248306E-02 3 Variable X- 7 CONSTANT -7.0185405 -0.12441382 Program terminated. 734.10069 All variables put in. Residual Statistics for... Original Data Von Neumann Ratio 1 ... Von Neumann Ratio 2 ... For D. F. 2.14471 2.14471 14 t(.9999)= Infin 0 1.000 1.000 Cell No. Interval Act Per 2 1.000 1.000 2.01855 5.3624, t(.999)= 4.1403, t(.99)= 2.9768, t(.95)= 2.1448, t(.90)= 1.7613, t(.80)= 1.3450 Skewness test (Alpha 3) = -.232914E-01, t Stat Cell No. Interval Act Per Durbin-Watson TEST..... Peakedness test (Alpha 4)= 1.37826 Normality Test -- Extended grid cell size 1.761 1.345 1.076 0.868 0.692 0.537 2 2 4 2 0 2 0.900 0.800 0.700 0.600 0.500 0.400 1.000 0.882 0.765 0.529 0.412 0.412 = 0.393 2 0.300 0.294 Normality Test -- Small sample grid cell size = 6 2 4 0.800 0.600 0.400 0.882 0.529 0.412 1.70 0.258 0.128 1 2 0.200 0.100 0.176 0.118 3.40 3 0.200 0.176 Extended grid normality test - Prob of rejecting normality assumption Chi= 7.118 Chi Prob= 0.4760 F(8, 14)= 0.889706 F Prob =0.450879 Small sample normality test - Large grid Chi= 3.294 Chi Prob= 0.6515 F(3, 14)= 1.09804 F Prob =0.617396 Autocorrelation function of residuals 1) -0.1546 F( 6, 2) -0.2529 6) = 0.3219 Sum of squared residuals 433.3 3) 0.2272 1/F = 3.106 4) -0.3925 Heteroskedasticity at Mean squared residual 0.9032 25.49 Gen. Least Squares ended by satisfying tolerance. *************** Problem Number Subproblem Number 4 2 F to enter F to remove Tolerance Maximum no of steps Dependent variable X( 4). Standard Error of Y = 0.99999998E-02 0.49999999E-02 0.10000000E-04 3 Variable Name LOG10CT 0.79113140E-01 for degrees of freedom = 29 16. F Sig. 1.000000 level Econometric Notes ............. Step Number 3 Variable Entering 5 Multiple R 0.987097 Std Error of Y.X 0.135425E-01 R Square 0.974361 Analysis of Variance Source Due Regression Dev. from Reg. Total Multiple Regression Equation Variable Coefficient Std. Error LOG10CT = LOG10Y X- 5 1.143156 0.1560002 LOG10RP X- 6 -0.8288375 0.3611136E-01 CONSTANT X- 7 1.373914 0.3060903 Adjusted R Square -2 * ln(Maximum of Likelihood Function) Akaike Information Criterion (AIC) Scwartz Information Criterion (SIC) Akaike (1970) Finite Prediction Error Generalized Cross Validation Hannan & Quinn (1979) HQ Shibata (1981) Rice (1984) Residual Variance T Val. for reduction in SS due to variable entering DF SS MS F 2 0.97575E-01 0.48787E-01 266.02 14 0.25676E-02 0.18340E-03 16 0.10014 0.62589E-02 T Sig. P. Cor. Elasticity 7.328 -22.95 4.489 1.00000 0.8906 1.00000 -0.9870 0.99949 Partial Cor. for Var. not in equation Variable Coefficient F for selection 1.084 -0.7314 0.970697895872232 -101.322167384484 -93.3221673844844 -89.9893140082595 0.215763077505479D-003 0.222698319282440D-003 0.218123847024249D-003 0.204340326343424D-003 0.233416420210472D-003 0.183398615879657D-003 Order of entrance (or deletion) of the variables = 7 6 5 Estimate of computational error in coefficients = 1 0.5793E-11 2 0.2356E-12 3 0.2547E-11 Covariance Matrix of Regression Coefficients Row 1 Variable X- 5 0.24336056E-01 LOG10Y Row 2 Variable X- 6 LOG10RP -0.12513115E-02 0.13040301E-02 Row 3 Variable X- 7 CONSTANT -0.46626424E-01 0.76017246E-04 Program terminated. 0.93691270E-01 All variables put in. Residual Statistics for... Original Data Von Neumann Ratio 1 ... Von Neumann Ratio 2 ... For D. F. 14 t(.9999)= 2.04710 2.04710 Infin 1 1.000 1.000 Cell No. Interval Act Per 2 1.000 1.000 1.92669 5.3624, t(.999)= 4.1403, t(.99)= 2.9768, t(.95)= 2.1448, t(.90)= 1.7613, t(.80)= 1.3450 Skewness test (Alpha 3) = -.159503 t Stat Cell No. Interval Act Per Durbin-Watson TEST..... , Peakedness test (Alpha 4)= 1.44345 Normality Test -- Extended grid cell size 1.761 1.345 1.076 0.868 0.692 0.537 1 1 5 1 1 3 0.900 0.800 0.700 0.600 0.500 0.400 0.941 0.882 0.824 0.529 0.471 0.412 = 0.393 1 0.300 0.235 Normality Test -- Small sample grid cell size = 6 2 4 0.800 0.600 0.400 0.882 0.529 0.412 1.70 0.258 0.128 2 1 0.200 0.100 0.176 0.059 3.40 3 0.200 0.176 Extended grid normality test - Prob of rejecting normality assumption Chi= 9.471 Chi Prob= 0.6958 F(8, 14)= 1.18382 F Prob =0.626481 Small sample normality test - Large grid Chi= 3.294 Chi Prob= 0.6515 F(3, 14)= 1.09804 F Prob =0.617396 Autocorrelation function of residuals 1) -0.0990 F( 6, 2) -0.1061 6) = Sum of squared residuals 0.5544 3) 0.0862 1/F = 1.804 4) -0.3157 Heteroskedasticity at 0.2568E-02 Mean squared residual 0.1510E-03 Gen. Least Squares ended by satisfying tolerance. We first show plots of the data. 30 0.7544 F Sig. 1.000000 level Econometric Notes Log Theil Data LOG10CT 2.20 2.15 2.10 2.05 LOG10Y 2.00 1.95 1.90 1.85 1.80 LOG10RP 1.75 2 4 6 8 10 12 14 16 OBS Linear Theil Data CT 160 150 140 130 120 110 Y 100 90 80 70 RP 60 2 4 6 8 10 OBS Figure 6.1 2 D Plots of Textile Data 31 12 14 16 Econometric Notes Two dimensional plots of this dataset do not capture the full relationships. From the plots in Figure 6.1 it appears that the consumption of textiles increases when the relative price of textiles falls and that RI has little effect. Figure 6.2, which is based on a three dimensional extrapolation about each point, gives a better picture of the true relationship. This figure clearly shows that LOG10RP, has the most effect on LOG10CT, which is on the Z axis, but that LOG10RI does have a positive effect. The OLS regression model attempts to capture this surface. Remark: A 2-D plot may lead one to drop a variable that is in fact significant in a multi-dimensional context. A 3-D plot can help in cases where K=3, but may be less useful for larger problems. The plots of CT against RP and LOG10CT against LOG10RP suggest a negative relationship, which is consistent with the economic theory that quantity demanded of a good will increase as its relative price falls. The correlations between these two sets of variables are negative (-.94664 and -.93596) and highly significant (at the .0001 level for both correlations). The plot between CT and Y and the plot between LOG10CT and LOG10Y do not show much of a relationship. The raw correlations are small (.06177 and .09786, respectively) and not significant. The preliminary finding might be that Y was not a good variable to use on the right-hand side of a model predicting CT. It will be shown later that such a conclusion would be wrong. 32 Econometric Notes Log Theil Textile Data 2.18 l o g 1 0 c t 2.16 2.14 2.12 2.10 2.08 2.06 2.04 2.050 2.040 2.030 2.020 log 1.80 2.010 2.000 1.90 1.990 1.980 10y 2.00 log 10rp Figure 6.2 3-D Plot of Theil (1971) Textile Data Remark: The preliminary estimation of a model CT = f(constant, Y, RP) indicates that the coefficients are 1.0617 (t = 3.98) and -1.383 (t = -16.5), respectively. The results support the maintained hypothesis that CT is positively related to income and negatively related to relative price. The Y variable, which was not significantly correlated with CT, was found to be significant when included in a regression controlling for RP. This demonstrates that it is important to go beyond just raw cross correlation analysis. If proposed variables are "prescreened out" by correlation analysis and not tried in regression models, many important variables may be incorrectly dropped from the analysis. It is important not to prematurely drop a proposed, theoretically plausible, variable from a regression model specification, even if in preliminary specifications it does not enter significantly. Later in the paper an example will be presented that illustrates that if other important variables are omitted from an equation, a significant variable that is in the equation may not show up as significant when other variables enter the equation omitted variable bias). The preceding discussion suggests that regression analysis requires careful use of diagnostic tests before the results are to be used in a production environment. A possible problem with the above formulation is that the error process might potentially have heteroskedasticity or nonconstant variance due to the fact that the time series values for CT are 33 Econometric Notes increasing over time. If all the variables in the model are transformed into logs (to the base 10), some of the potential for difficulty may be avoided. If heteroskedasticity were to be present, the estimated standard errors of the coefficients would be biased. In addition, the estimated standard error of the model, 5.5634 (433.31298/(17.3) from equation (6-4), would be misleading, since it would be an average, and, assuming the variance of the error was increasing, would overstate the error at the beginning of the data set and understate the error at the end of the data set. Log transforms to the base 10 are made and the model is estimated again and reported in the bottom equation (6-4). The results indicate the log linear form of the model fits better (the adjusted R 2 now is .9707) and all coefficients, except for the constant, are more significant. Comparison of the estimated values with the actual values shows surprisingly good results, considering there are only two explanatory variables in the model. One of the assumptions of an OLS regression is that the error process follows a random normal distribution with no serial correlation or heteroskedasticity (nonconstant variance). If the error process is only normal, the estimated coefficients will be unbiased and the standard errors of the estimated coefficients will be biased. Another important assumption of OLS is that the error terms are not related. If et is the error term of the estimated model, ut is a random error and the model p et i et i ut (6-5) i 1 is estimated, no autocorrelation up to order K implies that for K , 1, , k are not significant. First-order serial correlation can be tested by the Durbin-Watson test statistic. If the Durbin Watson statistic is around 2.0, there is no problem. If it is substantially below (above) 2.0 there is positive (negative) autocorrelation. This can be seen since the formula for the Durbin Watson is T T t 2 t 1 d (et et 1 ) 2 / et2 (6-6) If serial correlation is found, the appropriate procedure is generalized least squares, which involves a transformation of the data. If heteroskedasticity is found, there are other procedures that can be used to remove the problem. To illustrate GLS, assume yt xt et (6-7) where t refers to the time period of the observation. If 1 and model (6-5) is estimated for the residuals for (6-7) and 1 is significant, the appropriate procedure is to lag the original equation and 34 Econometric Notes multiply through by 1 and then subtract from the original equation. This would give ( yt 1 yt 1 ) (1 1 ) ( xt 1 xt 1 ) (et 1et 1 ) (6-9) which will give unbiased estimates of and and their standard errors, since from (6-5) ut et 1et 1 and by assumption ut does not contain serial correlation. As a test a misspecified model (to induce serial correlation) containing only LOG10Y is run in SAS. This model will find LOG10Y not significant and evidence of serial correlation in the model as measured by the low Durbin-Watson test statistic (.241). In the presence of serial correlation, the best course of action is to attempt to add new variables to explain the serial correlation. The B34S reg command output is shown first and next the SAS autoreg command REG Command. Version 1 February 1997 Real*8 space available Real*8 space used 9000000 43 OLS Estimation Dependent variable Adjusted R**2 Standard Error of Estimate Sum of Squared Residuals Model Sum of Squares Total Sum of Squares F( 1, 15) F Significance 1/Condition of XPX Number of Observations Durbin-Watson Variable LOG10Y { CONSTANT { 0} 0} LOG10CT -5.645122564288263E-02 8.131550225838252E-02 9.918316361299519E-02 9.590596648930139E-04 0.1001422232778882 0.1450437196128150 0.2913428904662156 1.523468705487359E-05 17 0.2414802718079813 Coefficient 0.34782649 1.4222303 Std. Error 0.91329943 1.8378693 t 0.38084606 0.77384737 SAS output next: The AUTOREG Procedure Dependent Variable LOG10CT LOG10 OF CONSUMPTION Ordinary Least Squares Estimates SSE MSE SBC Regress R-Square Durbin-Watson 0.00256758 0.0001834 -92.822527 0.9744 1.9267 DFE Root MSE AIC Total R-Square Variable DF Estimate Standard Error t Value Approx Pr > |t| Intercept LOG10Y LOG10RP 1 1 1 1.3739 1.1432 -0.8288 0.3061 0.1560 0.0361 4.49 7.33 -22.95 0.0005 <.0001 <.0001 35 14 0.01354 -95.322167 0.9744 Variable Label LOG10 OF INCOME LOG10 OF RELATIVE PRICE Econometric Notes Estimates of Autocorrelations Lag Covariance Correlation 0 1 0.000151 -0.00001 1.000000 -0.093221 -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 | | |********************| **| | Preliminary MSE 0.000150 Estimates of Autoregressive Parameters Lag Coefficient Standard Error t Value 1 0.093221 0.276142 0.34 Algorithm converged. The SAS System 10:37 Wednesday, December 6, 2006 4 The AUTOREG Procedure Maximum Likelihood Estimates SSE MSE SBC Regress R-Square Durbin-Watson 0.0025352 0.0001950 -90.189374 0.9789 1.7932 DFE Root MSE AIC Total R-Square Variable DF Estimate Standard Error t Value Approx Pr > |t| Intercept LOG10Y LOG10RP AR1 1 1 1 1 1.3592 1.1487 -0.8271 0.1248 0.2941 0.1516 0.0343 0.3186 4.62 7.58 -24.09 0.39 0.0005 <.0001 <.0001 0.7017 13 0.01396 -93.522227 0.9747 Variable Label LOG10 OF INCOME LOG10 OF RELATIVE PRICE Autoregressive parameters assumed given. Variable DF Estimate Standard Error t Value Approx Pr > |t| Intercept LOG10Y LOG10RP 1 1 1 1.3592 1.1487 -0.8271 0.2875 0.1471 0.0338 4.73 7.81 -24.47 0.0004 <.0001 <.0001 The SAS System Variable Label LOG10 OF INCOME LOG10 OF RELATIVE PRICE 10:37 Wednesday, December 6, 2006 The AUTOREG Procedure Dependent Variable LOG10CT LOG10 OF CONSUMPTION Ordinary Least Squares Estimates SSE MSE SBC 0.09918316 0.00661 -33.537669 DFE Root MSE AIC 36 15 0.08132 -35.204096 5 Econometric Notes Regress R-Square Durbin-Watson 0.0096 0.2415 Total R-Square 0.0096 Variable DF Estimate Standard Error t Value Approx Pr > |t| Intercept LOG10Y 1 1 1.4222 0.3478 1.8379 0.9133 0.77 0.38 0.4510 0.7087 Variable Label LOG10 OF INCOME Estimates of Autocorrelations Lag Covariance Correlation 0 1 0.00583 0.00447 1.000000 0.765305 -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 | | |********************| |*************** | Preliminary MSE 0.00242 Estimates of Autoregressive Parameters Lag Coefficient Standard Error t Value 1 -0.765305 0.172027 -4.45 Algorithm converged. The SAS System 10:37 Wednesday, December 6, 2006 6 The AUTOREG Procedure Maximum Likelihood Estimates SSE MSE SBC Regress R-Square Durbin-Watson 0.02423721 0.00173 -53.034484 0.0564 1.6157 DFE Root MSE AIC Total R-Square 14 0.04161 -55.534124 0.7580 Variable DF Estimate Standard Error t Value Approx Pr > |t| Intercept LOG10Y AR1 1 1 1 0.6643 0.7229 -0.8961 1.6320 0.8167 0.1312 0.41 0.89 -6.83 0.6901 0.3910 <.0001 Variable Label LOG10 OF INCOME Autoregressive parameters assumed given. Variable DF Estimate Standard Error t Value Approx Pr > |t| Intercept LOG10Y 1 1 0.6643 0.7229 1.5885 0.7910 0.42 0.91 0.6821 0.3762 Variable Label LOG10 OF INCOME The SAS first shows the complete model where the DW was 1.9267 indicating GLS was not needed. If in fact GLS is run, .1248 with t .39 . The DW fell to 1.79. For the mis-specified equation the DW was .215 before GLS and 1.6157 after GLS. Here .8961 with t 6.83 . For 37 Econometric Notes B34S which the two pass method to do GLS the results were: Problem Number Subproblem Number F to enter F to remove Tolerance (1.-R**2) for including a variable Maximum Number of Variables Allowed Internal Number of dependent variable Dependent Variable Standard Error of Y Degrees of Freedom ............. Step Number 2 Variable Entering 5 Multiple R 0.978620E-01 Std Error of Y.X 0.813155E-01 R Square 0.957698E-02 1 3 9.999999776482582E-03 4.999999888241291E-03 1.000000000000000E-05 2 4 LOG10CT 7.911314021618683E-02 16 Analysis of Variance Source Due Regression Dev. from Reg. Total Multiple Regression Equation Variable Coefficient Std. Error LOG10CT = LOG10Y X- 5 0.3478265 0.9132994 CONSTANT X- 7 1.422230 1.837869 Adjusted R Square -2 * ln(Maximum of Likelihood Function) Akaike Information Criterion (AIC) Scwartz Information Criterion (SIC) Akaike (1970) Finite Prediction Error Generalized Cross Validation Hannan & Quinn (1979) HQ Shibata (1981) Rice (1984) Residual Variance T Val. T Sig. P. Cor. Elasticity 0.3808 0.7738 0.29134 0.54895 0.0979 7 0.3298 5 2 0.00000 Covariance Matrix of Regression Coefficients Row 1 Variable X- 5 0.83411584 Row 2 Variable X- 7 CONSTANT -1.6784283 3.3777634 Program terminated. LOG10Y All variables put in. Residual Statistics for Original data Von Neumann Ratio 1 ... Von Neumann Ratio 2 ... For D. F. 0.25657 0.25657 15 t(.9999)= Infin 0 1.000 1.000 Cell No. Interval Act Per 4 1.000 1.000 0.24148 5.2391, t(.999)= 4.0728, t(.99)= 2.9467, t(.95)= 2.1314, t(.90)= 1.7531, t(.80)= 1.3406 Skewness test (Alpha 3) = -.233040 t Stat Cell No. Interval Act Per Durbin-Watson TEST..... , Peakedness test (Alpha 4)= 1.30008 Normality Test -- Extended grid cell size 1.753 1.341 1.074 0.866 0.691 0.536 4 1 2 4 2 2 0.900 0.800 0.700 0.600 0.500 0.400 1.000 0.765 0.706 0.588 0.353 0.235 = 0.393 1 0.300 0.118 Normality Test -- Small sample grid cell size = 3 6 3 0.800 0.600 0.400 0.765 0.588 0.235 1.70 0.258 0.128 0 1 0.200 0.100 0.059 0.059 3.40 1 0.200 0.059 Extended grid normality test - Prob of rejecting normality assumption Chi= 10.65 Chi Prob= 0.7775 F(8, 15)= 1.33088 F Prob =0.698738 Small sample normality test - Large grid Chi= 3.882 Chi Prob= 0.7255 F(3, 15)= 1.29412 F Prob =0.687249 Autocorrelation function of residuals 1 0.813137 F( 6, 2 0.658160 6) = Sum of squared residuals 1.730 3 0.545551 1/F = 0.5781 4 0.332369 Heteroskedasticity at 9.918316361299512E-02 38 0.7389 level F Sig. 0.291343 Partial Cor. for Var. not in equation Variable Coefficient F for selection -5.645122564294430E-02 -39.20409573256165 -33.20409573256165 -30.70445570039300 7.390118073123232E-03 7.493839028535489E-03 7.454314110419272E-03 7.207081092983957E-03 7.629474124074592E-03 6.612210907531313E-03 Order of entrance (or deletion) of the variables = Estimate of Computational Error in Coefficients 1 0.00000 for reduction in SS due to variable entering DF SS MS F 1 0.95906E-03 0.95906E-03 0.14504 15 0.99183E-01 0.66122E-02 16 0.10014 0.62589E-02 Econometric Notes Mean squared residual B34S 8.10Z 5.834303741940890E-03 (D:M:Y) 6/12/06 (H:M:S) 11: 8:24 REGRESSION STEP Doing Gen. Least Squares using residual Dif. Eq. of order Lag Coefficients PAGE 10 1 1 0.842413 Standard Error of Y Degrees of Freedom 0.2288578856184614 15 ............. Step Number 2 Variable Entering 5 Multiple R 0.105883 Std Error of Y.X 0.235559 R Square 0.112113E-01 Analysis of Variance Source Due Regression Dev. from Reg. Total Multiple Regression Equation Variable Coefficient Std. Error LOG10CT = LOG10Y X- 5 0.2959532 0.7428190 CONSTANT X- 7 1.605192 1.504752 Adjusted R Square -2 * ln(Maximum of Likelihood Function) Akaike Information Criterion (AIC) Scwartz Information Criterion (SIC) Akaike (1970) Finite Prediction Error Generalized Cross Validation Hannan & Quinn (1979) HQ Shibata (1981) Rice (1984) Residual Variance T Val. for reduction in SS due to variable entering DF SS MS F 1 0.88080E-02 0.88080E-02 0.15874 14 0.77683 0.55488E-01 15 0.78564 0.52376E-01 T Sig. P. Cor. Elasticity 0.3984 1.067 0.30367 0.69586 0.1059 Partial Cor. for Var. not in equation Variable Coefficient F for selection 0.2806 -5.941647786252678E-02 -2.995906752503885 3.004093247496115 5.321859414215458 6.242391585298818E-02 6.341477166017846E-02 6.265098482400155E-02 6.068991819040517E-02 6.473591273643219E-02 5.548792520265616E-02 Order of entrance (or deletion) of the variables = 7 5 Covariance Matrix of Regression Coefficients Row 1 Variable X- 5 0.55178009 Row 2 Variable X- 7 CONSTANT -1.1169023 2.2642794 Program terminated. LOG10Y All variables put in. Residual Statistics for Smoothed Original data For GLS Y and Y estimate scaled by Von Neumann Ratio 1 ... Von Neumann Ratio 2 ... For D. F. 14 t(.9999)= 0.1575867198767030 2.30745 2.30745 Infin 1 1.000 1.000 Cell No. Interval Act Per 1 1.000 1.000 2.16324 5.3634, t(.999)= 4.1405, t(.99)= 2.9768, t(.95)= 2.1448, t(.90)= 1.7613, t(.80)= 1.3450 Skewness test (Alpha 3) = 0.512095 t Stat Cell No. Interval Act Per Durbin-Watson TEST..... , Peakedness test (Alpha 4)= 1.97237 Normality Test -- Extended grid cell size 1.761 1.345 1.076 0.868 0.692 0.537 0 3 3 2 1 3 0.900 0.800 0.700 0.600 0.500 0.400 0.938 0.938 0.750 0.562 0.438 0.375 = 0.393 2 0.300 0.188 Normality Test -- Small sample grid cell size = 6 3 5 0.800 0.600 0.400 0.938 0.562 0.375 1.60 0.258 0.128 0 1 0.200 0.100 0.062 0.062 3.20 1 0.200 0.062 Extended grid normality test - Prob of rejecting normality assumption Chi= 7.750 Chi Prob= 0.5417 F(8, 14)= 0.968750 F Prob =0.503304 Small sample normality test - Large grid Chi= 6.500 Chi Prob= 0.9103 F(3, 14)= 2.16667 F Prob =0.862453 Autocorrelation function of residuals 1 -0.159923 F( 5, 2 -0.479294 5) = Sum of squared residuals Mean squared residual 0.3909 3 0.253850 1/F = 2.558 4 -0.348167E-01 Heteroskedasticity at 0.7768309528371908 4.855193455232443E-02 Gen. Least Squares ended by satisfying tolerance. 39 0.8371 F Sig. 0.303669 level Econometric Notes Here .8424 and after GLS the DW was 2.16, which is higher than found with SAS. Note that the change in sign of in the SAS output. Although there is positive serial correlation (the autocorrelation was .7653) SAS reports . The insignificant LOG10Y term is now found to be .2959 in place of .7229 as found with SAS but very close to the OLS .3478 value. Remarks: What can we conclude from the preceding results? Serial correlation was not the reason that LOG10Y was not significant (as measured by the low t value) in the OLS equation containing just LOG10Y on the right-hand side. In this equation, LOG10Y was not significant because of omitted variable bias. The B34S two-pass GLS procedure was able to remove more serial correlation than the SAS ML approach. We found that LOG10Y is a significant variable in a properly specified equation. This problem illustrates how it would be a mistake to remove LOG10Y from consideration as a potentially important variable just because it does not enter significantly into a serial correlation-free equation that does not contain all the appropriate variables on the right. An example having different problems is illustrated by a dataset from the engineering literature (from Brownlee[1965] Statistical Theory and Methodology, page 454) that is presented in Table Three. Here we have a maintained hypothesis that the stack loss of in going ammonia (Y) is related to the operation of the factory to convert ammonia to nitric acid by the process of oxidation. There is data on three variables for 21 days of plant operation. X1 = air flow, X2 = cooling water inlet temperature, X3 = acid concentration and Y=stack loss of ammonia. ____________________________________________________________ Table Three Brownlee Engineering Stack Loss Data Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 X1 X2 X3 Y1 = = = = X1 80 80 75 62 62 62 62 62 58 58 58 58 58 58 50 50 50 50 50 56 70 X2 27 27 25 24 22 23 24 24 23 18 18 17 18 19 18 18 19 19 20 20 20 X3 89 88 90 87 87 87 93 93 87 80 89 88 82 93 89 86 72 79 80 82 91 Y1 42 37 37 28 18 18 19 20 15 14 14 13 11 12 8 7 8 8 9 15 15 air flow. cooling water inlet temperature. acid concentration. stack loss of ammonia. 40 Econometric Notes The following B34S commands will load the data and perform the required analysis. /$ Sample Data # 3 /$ Data from Browlee (1965) page 454 b34sexec data corr$ INPUT X1 X2 X3 Y$ LABEL X1 = 'AIR FLOW'$ LABEL X2 = 'COOLING WATER INLET TEMPERATURE'$ LABEL X3 = 'ACID CONCENTRATION'$ LABEL Y = 'STACK LOSS' $ DATACARDS$ 80 27 89 42 80 27 88 37 75 25 90 37 62 24 87 28 62 22 87 18 62 23 87 18 62 24 93 19 62 24 93 20 58 23 87 15 58 18 80 14 58 18 89 14 58 17 88 13 58 18 82 11 58 19 93 12 50 18 89 8 50 18 86 7 50 19 72 8 50 19 79 8 50 20 80 9 56 20 82 15 70 20 91 15 b34sreturn$ b34seend$ b34sexec regression maxgls=2 residuala$ MODEL Y = X1 X2 X3 $ b34seend$ 41 Econometric Notes The results of the OLS model fit are reported next. B34S Version 8.42e Variable # Label X1 X2 X3 Y CONSTANT 1 2 3 4 5 (D:M:Y) 04/01/99 (H:M:S) 21:47:54 DATA STEP AIR FLOW COOLING WATER INLET TEMPERATURE ACID CONCENTRATION STACK LOSS Data file contains B34S Version 8.42e PAGE Mean 21 observations on (D:M:Y) Std. Dev. 60.4286 21.0952 86.2857 17.5238 1.00000 Variance 9.16827 3.16077 5.35857 10.1716 0.00000 84.0571 9.99048 28.7143 103.462 0.00000 5 variables. Current missing value code is 04/01/99 (H:M:S) 21:47:54 Maximum 80.0000 27.0000 93.0000 42.0000 1.00000 1 Minimum 50.0000 17.0000 72.0000 7.00000 1.00000 0.1000000000000000E+32 DATA STEP PAGE 2 PAGE 3 Correlation Matrix X2 Var 2 1 0.78185 X3 Var 3 1 0.50014 2 0.39094 Y Var 4 1 0.91966 2 0.87550 3 0.39983 CONSTANT Var 5 1 0.0000 2 0.0000 3 0.0000 B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 21:47:54 *************** Problem Number Subproblem Number F to enter F to remove Tolerance Maximum no of steps Dependent variable X( 4). Standard Error of Y = 4 0.0000 REGRESSION STEP 1 1 0.99999998E-02 0.49999999E-02 0.10000000E-04 4 Variable Name Y 10.171623 for degrees of freedom ............. Step Number 4 Variable Entering 3 Multiple R 0.955812 Std Error of Y.X 3.24336 R Square 0.913577 = Analysis of Variance Source Due Regression Dev. from Reg. Total Multiple Regression Equation Variable Coefficient Std. Error Y = X1 X- 1 0.7156402 0.1348582 X2 X- 2 1.295286 0.3680243 X3 X- 3 -0.1521225 0.1562940 CONSTANT X- 5 -39.91967 11.89600 Adjusted R Square -2 * ln(Maximum of Likelihood Function) Akaike Information Criterion (AIC) Scwartz Information Criterion (SIC) Akaike (1970) Finite Prediction Error Generalized Cross Validation Hannan & Quinn (1979) HQ Shibata (1981) Rice (1984) Residual Variance T Val. 20. for reduction in SS due to variable entering DF SS MS F 3 1890.4 630.14 59.902 17 178.83 10.519 20 2069.2 103.46 T Sig. P. Cor. Elasticity 5.307 3.520 -0.9733 -3.356 0.99994 0.7897 0.99737 0.6492 0.65595 -0.2297 0.99625 0.898325769953741 104.575591004800 114.575591004800 119.798203193417 12.5231065545072 12.9945646836181 13.0142387900995 11.7597933930896 13.7561508921817 10.5194095057860 Order of entrance (or deletion) of the variables = 1 Estimate of computational error in coefficients = 1 0.3461E-10 2 0.1208E-10 3 0.1959E-09 5 2 4 3 0.1472E-15 Covariance Matrix of Regression Coefficients Row 1 Variable X- 1 0.18186730E-01 X1 Row 2 Variable X- 2 X2 -0.36510675E-01 0.13544186 Row 3 Variable X- 3 X3 -0.71435215E-02 0.10476827E-04 0.24427828E-01 42 2.468 1.559 -0.7490 F Sig. 1.000000 Partial Cor. for Var. not in equation Variable Coefficient F for selection Econometric Notes Row 4 Variable X- 5 CONSTANT 0.28758711 -0.65179437 Program terminated. -1.6763208 141.51474 All variables put in. Residual Statistics for... Original Data Von Neumann Ratio 1 ... Von Neumann Ratio 2 ... For D. F. 17 t(.9999)= 1.55939 1.55939 Infin 2 1.000 1.000 Cell No. Interval Act Per 3 1.000 1.000 1.48513 5.0433, t(.999)= 3.9650, t(.99)= 2.8982, t(.95)= 2.1098, t(.90)= 1.7396, t(.80)= 1.3334 Skewness test (Alpha 3) = -.140452 t Stat Cell No. Interval Act Per Durbin-Watson TEST..... , Peakedness test (Alpha 4)= 2.03637 Normality Test -- Extended grid cell size 1.740 1.333 1.069 0.863 0.689 0.534 1 0 3 4 1 5 0.900 0.800 0.700 0.600 0.500 0.400 0.905 0.857 0.857 0.714 0.524 0.476 = 0.392 2 0.300 0.238 Normality Test -- Small sample grid cell size = 3 5 7 0.800 0.600 0.400 0.857 0.714 0.476 2.10 0.257 0.128 2 1 0.200 0.100 0.143 0.048 4.20 3 0.200 0.143 Extended grid normality test - Prob of rejecting normality assumption Chi= 9.952 Chi Prob= 0.7316 F(8, 17)= 1.24405 F Prob =0.666586 Small sample normality test - Large grid Chi= 3.048 Chi Prob= 0.6157 F(3, 17)= 1.01587 F Prob =0.589919 Autocorrelation function of residuals 1) F( 0.0858 7, 2) -0.1149 7) = Sum of squared residuals 1.336 178.8 3) -0.0409 1/F = 0.7485 4) -0.0064 Heteroskedasticity at Mean squared residual 0.6440 level 8.516 Gen. Least Squares ended by satisfying tolerance. Y1 = -39.91967 + .7156402X1 + 1.295286X2 - .1521225X3 (-3.356) (5.307) (3.520) (-.9733) R2 = .8983, e ' e = 178.83 (6-10) Two of the three variables (in addition to the constant) are found to be significant (significantly different from zero at or better than the 95% level). The correlations were .9197, .8755 and .3998 respectively. Of the three variables, X3 (acid concentration) was not significant at the 95% or better level, because its t statistic was less than 2 in absolute value. The variable X1 (air flow) was found to be positively related to stack loss and the variable X2 (cooling water inlet temperature) was also found to be positively related to stack loss. In this model, 89.83% of the variance is explained by the three variables on the right. Clearly, stack loss can be lowered, if X1 and X2 are lowered. The X3 variable (acid concentration) was not significant, even though the raw correlations show some relationship (correlation = .39983)1. The OLS equation was found to have a Durbin-Watson statistic of 1.4851, showing some serial correlation. First-order GLS was tried but were not executed since the residual correlation was less that the tolerance. Remark: A close to significant correlation is no assurance that the variable will be significant in a more populated model. From an economist's point of view, the results reported in the above paragraph suggest that the tradeoffs of a lower air flow and lower cooling water inlet temperature must be 1 Depending on whether the large or small sample SE is used the value is 43 1/ 20 .22361 or 1/ 21 .21822 Econometric Notes weighed against absorption technology changes that would lower the constant. While engineering considerations are clearly paramount in the decision process, the regression results, which can be readily obtained with a modern PC, can help summarize the data and highlight the relationship between the variables of interest. Of course it is important to select the appropriate data to use in the study. If data on key variables are omitted, the results of the study could be called into question. However, the problem may not be as bad as it seems. If a variable was inadvertently omitted that was important, unless it was random, its effect should be visible in the error process. In the last analysis, the value of a model lies in how well it works. Inspection of the results is a key aspect of the validation of a model. Table Four shows a data set (taken from Brownlee [1965], op. cit., page 463). Here the number of deaths per 100,000 males aged 55-59 years of heart disease in a number of countries is related to the number of telephones per head (X1) (presumable as a measure of stress and or income), the percent fat calories are of total calories (X2) and the percent animal protein calories are of total calories (X3). Table Four Brownlee Health Data obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X1 X2 X3 Y X1 124 49 181 4 22 152 75 54 43 41 17 22 16 10 63 170 125 15 221 171 97 254 = = = = X2 33 31 38 17 20 39 30 29 35 31 23 21 8 23 37 40 38 25 39 33 38 39 X3 8 6 8 2 4 6 7 7 6 5 4 3 3 3 6 8 6 4 7 7 6 8 Y 81 55 80 24 78 52 88 45 50 69 66 45 24 43 38 72 41 38 52 52 66 89 1000 * telephones per head. fat calories as a % of total calories. animal protein as a % of total calories. 100 * log number of deaths per 1000 males 55-59 years. The B34S commands to analyze this data are: 44 Econometric Notes /$ Sample Data # 4 /$ From Brownlee (1965) page 463 b34sexec data corr$ INPUT X1 X2 X3 Y$ LABEL X1 = '1000 * TELEPHONES PER HEAD'$ LABEL X2 = ' FAT CALORIES AS A % OF TOTAL CALORIES'$ LABEL X3 = 'ANIMAL PROTEIN AS A % TO TOTAL CALORIES'$ LABEL Y = '100 * LOG # DEATHS PER 1GMALES 55-59'$ DATACARDS$ 124 33 8 81 49 31 6 55 181 38 8 80 4 17 2 24 22 20 4 78 152 39 6 52 75 30 7 88 54 29 7 45 43 35 6 50 41 31 5 69 17 23 4 66 22 21 3 45 16 8 3 24 10 23 3 43 63 37 6 38 170 40 8 72 125 38 6 41 15 25 4 38 221 39 7 52 171 33 7 52 97 38 6 66 254 39 8 89 b34sreturn$ b34seend$ b34sexec regression residuala$ MODEL Y = X1 X2 X3 $ b34seend$ The raw correlation results, show X1, X2 and X3 positively related to Y, with correlations of .46875, .44628 and .62110, respectively. The variable X3 appears to be the most important. Y = 23.9306 - .0067849*X1 -.478240*X2 + 8.496616*X3 (1.499) (-.0833) (-.6315) (2.21) R 2 = .3017, e ' e = 4686 (6-11) OLS results, indicate that only 30.17% of the variance can be explained and that the animal protein variable (X3) is the only significant variable. Clearly, this finding is interesting, but the large unexplained component suggests that more data need to be collected to improve the model. It may well be the case that the animal protein variable (X3) is related to other unspecified variables and interpreting it without qualification would be dangerous. This will have to be investigated in future research if more data is available. 45 Econometric Notes Remark: This dataset shows a case where correlation analysis suggested a result that did not stand up in a multiple regression model. This is in contrast to the Theil dataset where correlation analysis did not suggested a relationship that was only found with a regression model. B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 21:58:58 Variable # Label X1 X2 X3 Y CONSTANT 1 1000 * TELEPHONES PER HEAD 2 FAT CALORIES AS A % OF TOTAL CALORIES 3 ANIMAL PROTEIN AS A % TO TOTAL CALORIES 4 100 * LOG # DEATHS PER 1GMALES 55-59 5 Data file contains B34S Version 8.42e DATA STEP PAGE Mean 22 observations on (D:M:Y) Std. Dev. 87.5455 30.3182 5.63636 56.7273 1.00000 Variance 75.4212 8.68708 1.86562 19.3075 0.00000 5688.35 75.4654 3.48052 372.779 0.00000 5 variables. Current missing value code is 04/01/99 (H:M:S) 21:58:58 Maximum 254.000 40.0000 8.00000 89.0000 1.00000 1 Minimum 4.00000 8.00000 2.00000 24.0000 1.00000 0.1000000000000000E+32 DATA STEP PAGE 2 PAGE 3 Correlation Matrix X2 Var 2 1 0.75915 X3 Var 3 1 0.80220 2 0.83018 Y Var 4 1 0.46875 2 0.44628 3 0.62110 CONSTANT Var 5 1 0.0000 2 0.0000 3 0.0000 B34S Version 8.42e (D:M:Y) 04/01/99 (H:M:S) 21:58:58 *************** Problem Number Subproblem Number REGRESSION STEP 3 1 F to enter F to remove Tolerance Maximum no of steps Dependent variable X( 4). Standard Error of Y = 4 0.0000 0.99999998E-02 0.49999999E-02 0.10000000E-04 4 Variable Name Y 19.307491 for degrees of freedom ............. Step Number 4 Variable Entering 1 Multiple R 0.633617 Std Error of Y.X 16.1340 R Square 0.401471 = Analysis of Variance Source Due Regression Dev. from Reg. Total 21. for reduction in SS due to variable entering DF SS MS F 3 3142.9 1047.6 4.0246 18 4685.5 260.31 21 7828.4 372.78 Multiple Regression Equation Variable Coefficient Std. Error T Val. T Sig. P. Cor. Elasticity Y = X1 X- 1 -0.6784908E-02 0.8144097E-01 -0.8331E-01 0.06548 -0.0196 -0.1047E-01 X2 X- 2 -0.4782399 0.7572547 -0.6315 0.46438 -0.1472 -0.2556 X3 X- 3 8.496616 3.844121 2.210 0.95973 0.4620 0.8442 CONSTANT X- 5 23.93061 15.96606 1.499 0.84875 Adjusted R Square -2 * ln(Maximum of Likelihood Function) Akaike Information Criterion (AIC) Scwartz Information Criterion (SIC) Akaike (1970) Finite Prediction Error Generalized Cross Validation Hannan & Quinn (1979) HQ Shibata (1981) Rice (1984) Residual Variance 0.301715772931645 180.379400452936 190.379400452936 195.834612719728 307.634186421501 318.151594504287 321.036004555605 290.423882286032 334.678950062951 260.305850048962 Order of entrance (or deletion) of the variables = 3 Estimate of computational error in coefficients = 1 0.2532E-11 2 -0.5376E-11 3 -0.7851E-12 5 2 1 4 -0.9443E-13 Covariance Matrix of Regression Coefficients Row 1 Variable X- 1 X1 46 F Sig. 0.976444 Partial Cor. for Var. not in equation Variable Coefficient F for selection Econometric Notes 0.66326323E-02 Row 2 Variable X- 2 X2 -0.17265284E-01 0.57343470 Row 3 Variable X- 3 X3 -0.14835665 -1.6567911 Row 14.777267 4 Variable X- 5 CONSTANT 0.77898727 -6.5357233 Program terminated. -20.071205 254.91515 All variables put in. Residual Statistics for... Original Data Von Neumann Ratio 1 ... Von Neumann Ratio 2 ... For D. F. 2.21784 2.21784 18 t(.9999)= Infin 1 1.000 1.000 Cell No. Interval Act Per 3 1.000 1.000 2.11703 4.9654, t(.999)= 3.9216, t(.99)= 2.8784, t(.95)= 2.1009, t(.90)= 1.7341, t(.80)= 1.3304 Skewness test (Alpha 3) = 0.145227 t Stat Cell No. Interval Act Per Durbin-Watson TEST..... , Peakedness test (Alpha 4)= 1.39268 Normality Test -- Extended grid cell size 1.734 1.330 1.067 0.862 0.688 0.534 2 5 2 1 2 3 0.900 0.800 0.700 0.600 0.500 0.400 0.955 0.864 0.636 0.545 0.500 0.409 = 0.392 4 0.300 0.273 Normality Test -- Small sample grid cell size = 7 3 7 0.800 0.600 0.400 0.864 0.545 0.409 2.20 0.257 0.127 1 1 0.200 0.100 0.091 0.045 4.40 2 0.200 0.091 Extended grid normality test - Prob of rejecting normality assumption Chi= 8.000 Chi Prob= 0.5665 F(8, 18)= 1.00000 F Prob =0.531050 Small sample normality test - Large grid Chi= 5.273 Chi Prob= 0.8471 F(3, 18)= 1.75758 F Prob =0.808728 Autocorrelation function of residuals 1) -0.0991 F( 7, 2) 7) 0.1355 = Sum of squared residuals 1.432 4686. 3) -0.4051 1/F = 0.6985 4) 0.1520 Heteroskedasticity at Mean squared residual 0.6762 level 213.0 The preceding sections have outlined some of the things that can be done with simple regression analysis. In the next section of the paper, data will be generated that will better illustrate problems of omitted variables and "hidden" nonlinearity. 7. Advanced Regression analysis The below listed B34S code shows how 250 observations for a number of series are generated. Regression models for the B34S are also shown. /; /; nonlinearity and serial correlation in generated data /; b34sexec data noob=250 maxlag=1 /; corr ; * b0=1 b1=100 b2=-100 b3=80 $ * generate three output variables with different characteristics$ build x1 x2 x3 y ynlin yma e$ gen x1 = rn()$ gen x2 = x1*x1$ 47 Econometric Notes gen x3 = lag1(x1)$ /; gen e = 100.*rn()$ gen e = rn()$ * ; * build three variables $ * y=f(x1,x2 x3) ; * ynlin=f(x1, x3); * yma =f(x1,x2,x3) + theta*lag(et); * ; gen y = 1.0 + 1.*x1 - 1.*x2 gen ynlin = y - .8*x3 $ * generate an ma model; gen yma = y + (-.95*lag1(e)); b34srun$ /; /; end of data building /; /; b34sexec list iend=20$ b34seend$ b34sexec reg$ model y = x1 x2 $ b34sexec reg$ model y = x1 $ b34sexec reg$ model ynlin = x1 $ b34sexec reg$ model ynlin = x1 x2 $ b34sexec reg$ model yma = x1 x2 x3$ + .8*x3 + e $ b34seend$ b34seend$ b34seend$ b34seend$ b34seend$ /$ do gls b34sexec regression residuala maxgls=4$ model yma=x1 x2 x3 $ b34seend$ b34sexec matrix; call loaddata; call load(rrplots); call load(data2acf); call olsq(yma x1 x2 x3 :print); call data2acf(%res,'Model yma=f(x1, x2, x3)',12,'yma_res_acf.wmf'); b34srun; /; sort data by variable we suspect is nonlinear /; Then do RR analysis /; b34sexec sort $ by x1$ b34seend$ /; b34sexec list iend=20$ b34seend$ b34sexec reg$ b34sexec reg$ b34sexec reg$ model y = x1 $ model y = x1 x3 model ynlin = x1 b34seend$ $ b34seend$ $ b34seend$ /; /; recursive residual analysis /; x2 which is a nonlinewar x1 term is missing. /; b34sexec matrix; call loaddata; 48 Can RR detect it? Econometric Notes call load(rrplots); /; call print(rrplots); call olsq(y x1 x3 :rr 1 :print); /; call tabulate(%rrobs,%ssr1,%ssr2,%rr,%rrstd,%res); call print('Sum of squares of std RR ',sumsq(goodrow(%rrstd)):); call print('Sum of squares of OLS RES ',sumsq(goodrow(%res)):); /; call print(%rrcoef,%rrcoeft); /; call rrplots(%rrstd,%rss,%nob,%k,%ssr1,%ssr2,1); call rrplots(%rrstd,%rss,%nob,%k,%ssr1,%ssr2,0); /; call names(all); x1_coef=%rrcoef(,1); x3_coef=%rrcoef(,2); call graph(x1_coef,x3_coef :file 'coef_bias.wmf' :nolabel :heading 'Omitted Variable Bias x1 and x3 coef'); b34srun; The above code builds three models. y 1.0 x1 x 2 .8 x3 e (7-1) ynlin 1.0 x1 x 2 e (7-2) yma 1.0 x1t x2t .8 x3t et .95et 1 (7-3) By construction of the data and ignoring subscripts where there is no confusion: x 2 ( x1)2 (7-4) x3t x1t 1 (7-5) Since x1 is a random variable, there is no correlation between x1 and x3. Because of (7-4) there is correlation between x1 and x2. The purpose of the generated data set is to illustrate the conditions under which an omitted variable will and will not cause a bias in the coefficients estimated for an incomplete model and to show detection strategy. The yma series illustrates the relationship between AR (autoregressive) and MA (moving average) error processes. Assume the lag operator L defined such that xt k Lk xt . A simple OLS model with an MA process is defined as yt 1 x ( L)et , (7-6) 49 Econometric Notes where ( L) is a polynomial in the lag operator L. A simple OLS model with an AR process is defined as yt 1 x (1/ ( L))et , (7-7) where ( L ) is a polynomial in the lag operator L. If we assume further that the maximum order in ( L) is 1, i. e. ( L) (1 1L) (7-8) It can be proved that a first-order MA model (MA(1)) is equal to an infinite order AR model if | 1 | 1 . This can be seen if we note that (1 1L) 1 (1 1L 2 L2 (7-9) k Lk ) where i (1 )i . The importance of equation (7-9) is that it shows that if equation (7-3) is estimated with GLS, which is implicitly an AR error correction technique, more than first-order GLS will be required to remove the serial correlation in the error term. In a transfer function model of the form of yt ( L) ( L) xt e ( L) ( L) t (7-10) then only one MA term (7-8) would be needed and ( L) 1 . An OLS model is a transfer function model that constrains ( L) ( L) ( L) 1 . GLS allows ( L) 1 . The means of the data generated in accordance with equations (7-1) - (7-3) and OLS estimation of a number of models are given next. Variable X1 X2 X3 Y YNLIN YMA E CONSTANT # Cases 1 2 3 4 5 6 7 8 249 249 249 249 249 249 249 249 Mean Std Deviation 0.1508079470 1.116435259 0.1609627505 0.1975673548 0.6879715443E-01 0.1594119944 0.3442446593E-01 1.000000000 Number of observations in data file Current missing variable code 1.047903751 1.532920056 1.054029274 2.283588868 2.085180439 2.399542772 1.059199304 0.000000000 Variance 1.098102271 2.349843898 1.110977710 5.214778119 4.347977464 5.757805512 1.121903166 0.000000000 Maximum 3.422285173 11.71203580 3.422285173 5.013192154 4.117746229 6.108808152 2.876329588 1.000000000 Minimum -2.584990017 0.2973256533E-04 -2.584990017 -9.876622239 -9.568155797 -10.57614307 -3.278246463 1.000000000 249 1.000000000000000E+31 The below listed output shows that the coefficients for x1 and x2 are close to their population 50 Econometric Notes values of 1.0 and -1.0 even though x3 is missing from the model. This is because the omitted variable X3 is not correlated with an included variable. REG Command. Version 1 February 1997 Real*8 space available Real*8 space used 8000000 638 OLS Estimation Dependent variable Adjusted R**2 Standard Error of Estimate Sum of Squared Residuals Model Sum of Squares Total Sum of Squares F( 2, 246) F Significance 1/Condition of XPX Number of Observations Durbin-Watson Variable X1 { X2 { CONSTANT { 0} 0} 0} Y 0.6416762467329375 1.366959717042574 459.6704015322101 833.5945719535468 1293.264973485757 223.0557634525042 1.000000000000000 0.1334523794627398 249 1.889314291485050 Coefficient 1.1228155 -1.0266583 1.1744354 B34S 8.10Z (D:M:Y) Std. Error 0.83599150E-01 0.57148357E-01 0.10731666 10/12/06 (H:M:S) 8:14:43 t 13.430944 -17.964792 10.943645 REG STEP PAGE 3 Since the omitted variable x2 is related to the included variable x1, the estimated coefficient for x1 is biased. REG Command. Version 1 February 1997 Real*8 space available Real*8 space used 8000000 508 OLS Estimation Dependent variable Adjusted R**2 Standard Error of Estimate Sum of Squared Residuals Model Sum of Squares Total Sum of Squares F( 1, 247) F Significance 1/Condition of XPX Number of Observations Durbin-Watson Variable X1 { CONSTANT { 0} 0} Y 0.1749360075645998 2.074253035297188 1062.723836646581 230.5411368391758 1293.264973485757 53.58274542797675 0.9999999999965166 0.7121866785207466 249 1.944404933534759 Coefficient 0.92008294 0.58811535E-01 Std. Error 0.12569399 0.13281015 t 7.3200236 0.44282410 The model for ynlin does not contain x3. Here the omission of x2 shows as a bias on the included variable x1. The fact that is appears highly significant (t=7.55) may fool the researcher. The task ahead is to investigate model specification in a systematic manner using simple tests. REG Command. Version 1 February 1997 Real*8 space available Real*8 space used 8000000 508 OLS Estimation Dependent variable Adjusted R**2 Standard Error of Estimate Sum of Squared Residuals Model Sum of Squares Total Sum of Squares F( 1, 247) F Significance 1/Condition of XPX Number of Observations Durbin-Watson Variable X1 { CONSTANT { 0} 0} Coefficient 0.86152861 -0.61128207E-01 YNLIN 0.1841644277111127 1.883410386043661 876.1669665175114 202.1314444376089 1078.298410955120 56.98282254868781 0.9999999999991491 0.7121866785207466 249 1.922026461982424 Std. Error 0.11412945 0.12059089 t 7.5486967 -0.50690568 51 Econometric Notes Before beginning the analysis, note that the correct model for ynlin shows coefficients close to their population values REG Command. Version 1 February 1997 Real*8 space available Real*8 space used 8000000 638 OLS Estimation Dependent variable Adjusted R**2 Standard Error of Estimate Sum of Squared Residuals Model Sum of Squares Total Sum of Squares F( 2, 246) F Significance 1/Condition of XPX Number of Observations Durbin-Watson Variable X1 { X2 { CONSTANT { 0} 0} 0} YNLIN 0.7410517802311660 1.061084833449131 276.9716518488394 801.3267591062809 1078.298410955120 355.8602142571058 1.000000000000000 0.1334523794627398 249 1.933912998767355 Coefficient 1.0636116 -1.0233690 1.0509212 Std. Error 0.64892761E-01 0.44360674E-01 0.83303169E-01 t 16.390297 -23.069283 12.615622 The model for yma shows negative serial correlation (DW=2.872) even though all variables are in the model. REG Command. Version 1 February 1997 Real*8 space available Real*8 space used 8000000 771 OLS Estimation Dependent variable Adjusted R**2 Standard Error of Estimate Sum of Squared Residuals Model Sum of Squares Total Sum of Squares F( 3, 245) F Significance 1/Condition of XPX Number of Observations Durbin-Watson Variable X1 X2 X3 CONSTANT { { { { 0} 0} 0} 0} Coefficient 1.0907591 -0.94687924 0.75276712 0.93087875 YMA 0.6413612625787487 1.437001078383063 505.9181643221510 922.0176027460155 1427.935767068167 148.8345537566244 1.000000000000000 0.1270036486757258 249 2.871725884170242 Std. Error 0.88117138E-01 0.60077630E-01 0.86803876E-01 0.11360868 t 12.378513 -15.760929 8.6720450 8.1937292 GLS will be attempted Problem Number Subproblem Number F to enter F to remove Tolerance (1.-R**2) for including a variable Maximum Number of Variables Allowed Internal Number of dependent variable Dependent Variable Standard Error of Y Degrees of Freedom ............. Step Number 4 Variable Entering 8 Sig. Multiple R 0.803554 Std Error of Y.X 1.43700 R Square 0.645700 1 1 9.999999776482582E-03 4.999999888241291E-03 1.000000000000000E-05 4 6 YMA 2.399542771523700 248 Analysis of Variance for reduction in SS due to variable entering Source DF SS MS Due Regression Dev. from Reg. Total Multiple Regression Equation Variable Coefficient Std. Error YMA = X1 X- 1 1.090759 0.8811714E-01 X2 X- 2 -0.9468792 0.6007763E-01 X3 X- 3 0.7527671 0.8680388E-01 CONSTANT X- 8 0.9308788 0.1136087 Adjusted R Square -2 * ln(Maximum of Likelihood Function) Akaike Information Criterion (AIC) T Val. 12.38 -15.76 8.672 8.194 3 245 248 922.02 505.92 1427.9 T Sig. P. Cor. Elasticity 1.00000 0.6203 1.00000 -0.7095 1.00000 0.4846 1.00000 0.6413612625787489 883.1529747953530 893.1529747953530 52 1.032 -6.631 0.7601 307.34 2.0650 5.7578 148.83 F F 1.000000 Partial Cor. for Var. not in equation Variable Coefficient F for selection Econometric Notes Scwartz Information Criterion (SIC) Akaike (1970) Finite Prediction Error Generalized Cross Validation Hannan & Quinn (1979) HQ Shibata (1981) Rice (1984) Residual Variance 910.7402392776766 2.098144341832704 2.098685929466314 2.146406058401069 2.097078566971383 2.099245495112659 2.064972099274085 Order of entrance (or deletion) of the variables = Estimate of Computational Error in Coefficients 1 -0.255313E-15 2 -0.119643E-15 3 0.190567E-16 1 2 3 8 4 -0.196339E-16 Covariance Matrix of Regression Coefficients Row 1 Variable X- 1 0.77646301E-02 X1 Row 2 Variable X- 2 X2 -0.71499452E-03 0.36093216E-02 Row 3 Variable X- 3 X3 -0.55762009E-03 0.30981359E-04 Row 4 Variable X- 8 CONSTANT -0.28296676E-03 -0.39267339E-02 -0.11633355E-02 Program terminated. 0.75349130E-02 0.12906932E-01 All variables put in. Residual Statistics for Original data Von Neumann Ratio 1 ... Von Neumann Ratio 2 ... For D. F. 245 t(.9999)= 2.88331 2.88331 Infin 25 1.000 1.000 Cell No. Interval Act Per 45 1.000 1.000 2.87173 3.9556, t(.999)= 3.3307, t(.99)= 2.5960, t(.95)= 1.9697, t(.90)= 1.6511, t(.80)= 1.2850 Skewness test (Alpha 3) = 0.113402 t Stat Cell No. Interval Act Per Durbin-Watson TEST..... , Peakedness test (Alpha 4)= 2.84927 Normality Test -- Extended grid cell size 1.651 1.285 1.039 0.843 0.675 0.525 20 29 26 23 30 27 0.900 0.800 0.700 0.600 0.500 0.400 0.900 0.819 0.703 0.598 0.506 0.386 = 0.386 24 0.300 0.277 Normality Test -- Small sample grid cell size = 55 53 51 0.800 0.600 0.400 0.819 0.598 0.386 24.90 0.254 0.126 21 24 0.200 0.100 0.181 0.096 49.80 45 0.200 0.181 Extended grid normality test - Prob of rejecting normality assumption Chi= 3.731 Chi Prob= 0.1195 F(8, 245)= 0.466365 F Prob =0.120880 Small sample normality test - Large grid Chi= 1.703 Chi Prob= 0.3637 F(3, 245)= 0.567604 F Prob =0.363150 Autocorrelation function of residuals 1 -0.438023 F( 83, 2 -0.874684E-01 83) = Sum of squared residuals Mean squared residual 1.121 3 0.759865E-01 1/F = 4 -0.113471 0.8919 5 0.809694E-01 Heteroskedasticity at 0.6983 level 505.9181643221511 2.031799856715466 Note the ACF values of -.438, -.087 for the OLS model. GLS is now attempted: Doing Gen. Least Squares using residual Dif. Eq. of order Lag Coefficients 1 1 -0.436471 Standard Error of Y Degrees of Freedom ............. Step Number 4 Variable Entering 3 Sig. Multiple R 0.868408 Std Error of Y.X 0.901185 R Square 0.754132 Multiple Regression Equation Variable Coefficient Std. Error 1.806380831086763 247 Analysis of Variance for reduction in SS due to variable entering Source DF SS MS Due Regression Dev. from Reg. Total T Val. 3 244 247 607.80 198.16 805.96 T Sig. P. Cor. Elasticity 53 202.60 0.81213 3.2630 249.47 F F 1.000000 Partial Cor. for Var. not in equation Econometric Notes YMA = X1 X- 1 1.053314 X2 X- 2 -0.9569885 X3 X- 3 0.7796353 CONSTANT X- 8 0.9431791 Variable 0.7820969E-01 0.5047854E-01 0.7773981E-01 0.8032375E-01 Adjusted R Square -2 * ln(Maximum of Likelihood Function) Akaike Information Criterion (AIC) Scwartz Information Criterion (SIC) Akaike (1970) Finite Prediction Error Generalized Cross Validation Hannan & Quinn (1979) HQ Shibata (1981) Rice (1984) Residual Variance 13.47 -18.96 10.03 11.74 1.00000 0.6530 1.00000 -0.7718 1.00000 0.5403 1.00000 Coefficient F for selection 0.9965 -6.702 0.7872 0.7511089134685829 648.1547627678506 658.1547627678506 675.7219064986755 0.8252334731172152 0.8254482099043910 0.8442731002246441 0.8248109265359979 0.8256701045844729 0.8121345290994816 Order of entrance (or deletion) of the variables = 1 2 8 3 Covariance Matrix of Regression Coefficients Row 1 Variable X- 1 0.61167560E-02 X1 Row 2 Variable X- 2 X2 -0.42933638E-03 0.25480832E-02 Row 3 Variable X- 3 X3 -0.26031711E-02 -0.11668626E-03 Row 4 Variable X- 8 CONSTANT -0.43170371E-04 -0.27722114E-02 -0.41214721E-03 Program terminated. 0.60434773E-02 0.64519047E-02 All variables put in. Residual Statistics for Smoothed Original data For GLS Y and Y estimate scaled by Von Neumann Ratio 1 ... Von Neumann Ratio 2 ... For D. F. 244 t(.9999)= 1.436471170926504 2.30772 2.30772 Infin 21 1.000 1.000 Cell No. Interval Act Per 54 1.000 1.000 2.29841 3.9559, t(.999)= 3.3308, t(.99)= 2.5961, t(.95)= 1.9697, t(.90)= 1.6511, t(.80)= 1.2850 Skewness test (Alpha 3) = 0.110831 t Stat Cell No. Interval Act Per Durbin-Watson TEST..... , Peakedness test (Alpha 4)= 2.81493 Normality Test -- Extended grid cell size 1.651 1.285 1.039 0.843 0.675 0.525 33 20 25 25 25 34 0.900 0.800 0.700 0.600 0.500 0.400 0.915 0.782 0.702 0.601 0.500 0.399 = 0.386 23 0.300 0.262 Normality Test -- Small sample grid cell size = 45 50 57 0.800 0.600 0.400 0.782 0.601 0.399 24.80 0.254 0.126 21 21 0.200 0.100 0.169 0.085 49.60 42 0.200 0.169 Extended grid normality test - Prob of rejecting normality assumption Chi= 8.935 Chi Prob= 0.6522 F(8, 244)= 1.11694 F Prob =0.647735 Small sample normality test - Large grid Chi= 3.089 Chi Prob= 0.6219 F(3, 244)= 1.02957 F Prob =0.619884 Autocorrelation function of residuals 1 -0.151211 F( 83, 2 -0.321877 83) = 1.028 Sum of squared residuals Mean squared residual 3 -0.413296E-03 1/F = 0.9724 4 -0.787568E-01 Heteroskedasticity at 0.5505 level 198.1608251002743 0.7990355850817513 The DW, now 2.298, is closed to 2.0. The ACF shows a spike at lag 2. GLS order 2 is attempted. Doing Gen. Least Squares using residual Dif. Eq. of order Lag Coefficients 1 -0.587393 2 -0.343873 Standard Error of Y Degrees of Freedom ............. Step Number 4 Variable Entering Sig. 2 1.488647529013689 246 3 Analysis of Variance for reduction in SS due to variable entering Source DF SS MS 54 F F Econometric Notes Multiple R Std Error of Y.X R Square 0.906985 0.630820 0.822623 Due Regression Dev. from Reg. Total Multiple Regression Equation Variable Coefficient Std. Error YMA = X1 X- 1 1.064045 0.7171594E-01 X2 X- 2 -0.9423114 0.4273665E-01 X3 X- 3 0.7593212 0.7132141E-01 CONSTANT X- 8 0.9305895 0.6243077E-01 Adjusted R Square -2 * ln(Maximum of Likelihood Function) Akaike Information Criterion (AIC) Scwartz Information Criterion (SIC) Akaike (1970) Finite Prediction Error Generalized Cross Validation Hannan & Quinn (1979) HQ Shibata (1981) Rice (1984) Residual Variance T Val. 3 243 246 448.46 96.698 545.15 149.49 0.39793 2.2161 T Sig. P. Cor. Elasticity 14.84 -22.05 10.65 14.91 1.00000 0.6894 1.00000 -0.8165 1.00000 0.5640 1.00000 375.65 1.000000 Partial Cor. for Var. not in equation Variable Coefficient F for selection 1.007 -6.599 0.7667 0.8204327863250724 469.3198834070857 479.3198834070857 496.8668250902256 0.4043780501040350 0.4044841286507808 0.4137361553163259 0.4041693287529482 0.4045937579438610 0.3979337783892296 Order of entrance (or deletion) of the variables = 1 2 8 3 Covariance Matrix of Regression Coefficients Row 1 Variable X- 1 0.51431761E-02 X1 Row 2 Variable X- 2 X2 -0.30211664E-03 0.18264214E-02 Row 3 Variable X- 3 X3 -0.29466037E-02 -0.27120489E-04 Row 4 Variable X- 8 CONSTANT -0.60309115E-05 -0.19976715E-02 -0.29290972E-03 Program terminated. 0.50867433E-02 0.38976015E-02 All variables put in. Residual Statistics for Smoothed Original data For GLS Y and Y estimate scaled by Von Neumann Ratio 1 ... Von Neumann Ratio 2 ... For D. F. 243 t(.9999)= 1.931266064605667 2.12282 2.12282 Infin 24 1.000 1.000 Cell No. Interval Act Per 53 1.000 1.000 2.11423 3.9561, t(.999)= 3.3310, t(.99)= 2.5962, t(.95)= 1.9698, t(.90)= 1.6511, t(.80)= 1.2850 Skewness test (Alpha 3) = 0.726358E-01, t Stat Cell No. Interval Act Per Durbin-Watson TEST..... Peakedness test (Alpha 4)= 2.77604 Normality Test -- Extended grid cell size 1.651 1.285 1.039 0.843 0.676 0.525 29 23 25 23 24 27 0.900 0.800 0.700 0.600 0.500 0.400 0.903 0.785 0.692 0.591 0.498 0.401 = 0.386 30 0.300 0.291 Normality Test -- Small sample grid cell size = 48 47 57 0.800 0.600 0.400 0.785 0.591 0.401 24.70 0.254 0.126 25 17 0.200 0.100 0.170 0.069 49.40 42 0.200 0.170 Extended grid normality test - Prob of rejecting normality assumption Chi= 4.781 Chi Prob= 0.2193 F(8, 243)= 0.597672 F Prob =0.220557 Small sample normality test - Large grid Chi= 2.696 Chi Prob= 0.5592 F(3, 243)= 0.898785 F Prob =0.557561 Autocorrelation function of residuals 1 -0.645124E-01 F( 82, 82) 2 -0.148837 = Sum of squared residuals Mean squared residual 1.162 3 -0.240427 1/F = 0.8609 4 -0.992257E-01 5 0.728148E-01 Heteroskedasticity at 0.7505 level 96.69790814858062 0.3914895066744155 DW now 2.114. GLS order 3 and 4 attempted with little gain to the DW but showing slow decline in GLS values which would be expected in view of (7-9). Doing Gen. Least Squares using residual Dif. Eq. of order Lag Coefficients 1 2 3 3 55 Econometric Notes -0.648299 -0.447794 -0.176326 Standard Error of Y Degrees of Freedom 1.356570786867690 245 ............. Step Number 4 Variable Entering 3 Sig. Multiple R 0.923144 Std Error of Y.X 0.524763 R Square 0.852194 Analysis of Variance for reduction in SS due to variable entering Source DF SS MS Due Regression Dev. from Reg. Total Multiple Regression Equation Variable Coefficient Std. Error YMA = X1 X- 1 1.027797 0.7160766E-01 X2 X- 2 -0.9485962 0.3945060E-01 X3 X- 3 0.7885052 0.7105830E-01 CONSTANT X- 8 0.9428557 0.5552124E-01 Adjusted R Square -2 * ln(Maximum of Likelihood Function) Akaike Information Criterion (AIC) Scwartz Information Criterion (SIC) Akaike (1970) Finite Prediction Error Generalized Cross Validation Hannan & Quinn (1979) HQ Shibata (1981) Rice (1984) Residual Variance T Val. 3 242 245 384.23 66.641 450.87 128.08 0.27538 1.8403 T Sig. P. Cor. Elasticity 14.35 -24.05 11.10 16.98 1.00000 0.6781 1.00000 -0.8396 1.00000 0.5807 1.00000 465.10 1 2 8 3 Covariance Matrix of Regression Coefficients Row 1 Variable X- 1 0.51276574E-02 X1 Row 2 Variable X- 2 X2 -0.23263544E-03 0.15563500E-02 Row 3 Variable X- 3 X3 -0.33414602E-02 0.11967882E-04 Row 4 Variable X- 8 CONSTANT -0.35430730E-04 -0.17098083E-02 -0.26379732E-03 Program terminated. 0.50492818E-02 0.30826085E-02 All variables put in. Residual Statistics for Smoothed Original data For GLS Y and Y estimate scaled by Von Neumann Ratio 1 ... Von Neumann Ratio 2 ... For D. F. 242 t(.9999)= 2.272418943212623 2.12109 2.12109 Infin 23 1.000 1.000 Cell No. Interval Act Per 53 1.000 1.000 2.11247 3.9564, t(.999)= 3.3312, t(.99)= 2.5963, t(.95)= 1.9698, t(.90)= 1.6512, t(.80)= 1.2851 Skewness test (Alpha 3) = 0.628738E-01, t Stat Cell No. Interval Act Per Durbin-Watson TEST..... Peakedness test (Alpha 4)= 2.66896 Normality Test -- Extended grid cell size 1.651 1.285 1.039 0.843 0.676 0.525 30 24 24 23 30 22 0.900 0.800 0.700 0.600 0.500 0.400 0.907 0.785 0.687 0.589 0.496 0.374 = 0.386 28 0.300 0.285 Normality Test -- Small sample grid cell size = 48 53 50 0.800 0.600 0.400 0.785 0.589 0.374 24.60 0.254 0.126 25 17 0.200 0.100 0.171 0.069 49.20 42 0.200 0.171 Extended grid normality test - Prob of rejecting normality assumption Chi= 5.707 Chi Prob= 0.3200 F(8, 242)= 0.713415 F Prob =0.320389 Small sample normality test - Large grid Chi= 1.683 Chi Prob= 0.3593 F(3, 242)= 0.560976 F Prob =0.358735 Autocorrelation function of residuals 1 -0.577779E-01 F( 82, 82) 2 -0.116476 = Sum of squared residuals Mean squared residual 1.185 3 -0.153224 1/F = 0.8435 4 -0.211519 5 0.509156E-01 Heteroskedasticity at 66.64108838087667 0.2708987332555962 Doing Gen. Least Squares using residual Dif. Eq. of order 4 56 0.7786 F 1.000000 Partial Cor. for Var. not in equation Variable Coefficient F for selection 0.9723 -6.643 0.7962 0.8503620346592192 376.8392476702688 386.8392476702688 404.3659053499306 0.2798540632805742 0.2799280742725161 0.2863502020183029 0.2797084481582168 0.2800045730288931 0.2753763982680850 Order of entrance (or deletion) of the variables = F 6 0.347658E-02 level Econometric Notes Lag Coefficients 1 -0.694654 2 -0.564775 3 -0.345213 Standard Error of Y Degrees of Freedom 4 -0.259654 1.203012334750675 244 ............. Step Number 4 Variable Entering 3 Sig. Multiple R 0.943134 Std Error of Y.X 0.402377 R Square 0.889502 Analysis of Variance for reduction in SS due to variable entering Source DF SS MS Due Regression Dev. from Reg. Total Multiple Regression Equation Variable Coefficient Std. Error YMA = X1 X- 1 1.036793 0.6936194E-01 X2 X- 2 -0.9666933 0.3536745E-01 X3 X- 3 0.7804025 0.6903331E-01 CONSTANT X- 8 0.9634053 0.4764657E-01 Adjusted R Square -2 * ln(Maximum of Likelihood Function) Akaike Information Criterion (AIC) Scwartz Information Criterion (SIC) Akaike (1970) Finite Prediction Error Generalized Cross Validation Hannan & Quinn (1979) HQ Shibata (1981) Rice (1984) Residual Variance T Val. 3 241 244 314.11 39.020 353.13 104.70 0.16191 1.4472 T Sig. P. Cor. Elasticity 14.95 -27.33 11.30 20.22 1.00000 0.6936 1.00000 -0.8695 1.00000 0.5887 1.00000 646.68 1 2 8 3 Covariance Matrix of Regression Coefficients Row 1 Variable X- 1 0.48110784E-02 X1 Row 2 Variable X- 2 X2 -0.16111651E-03 0.12508569E-02 Row 3 Variable X- 3 X3 -0.35012367E-02 0.74455260E-04 Row 0.47655984E-02 4 Variable X- 8 CONSTANT -0.40532334E-04 -0.13901832E-02 -0.26943072E-03 Program terminated. 0.22701954E-02 All variables put in. Residual Statistics for Smoothed Original data For GLS Y and Y estimate scaled by Von Neumann Ratio 1 ... Von Neumann Ratio 2 ... For D. F. 241 t(.9999)= 2.864295541231388 2.11504 2.11504 Infin 22 1.000 1.000 Cell No. Interval Act Per 52 1.000 1.000 2.10640 3.9567, t(.999)= 3.3314, t(.99)= 2.5964, t(.95)= 1.9699, t(.90)= 1.6512, t(.80)= 1.2851 Skewness test (Alpha 3) = -.879541E-01, t Stat Cell No. Interval Act Per Durbin-Watson TEST..... Peakedness test (Alpha 4)= 2.53942 Normality Test -- Extended grid cell size 1.651 1.285 1.039 0.843 0.676 0.525 30 30 20 21 22 27 0.900 0.800 0.700 0.600 0.500 0.400 0.910 0.788 0.665 0.584 0.498 0.408 = 0.386 23 0.300 0.298 Normality Test -- Small sample grid cell size = 50 43 50 0.800 0.600 0.400 0.788 0.584 0.408 24.50 0.254 0.126 22 28 0.200 0.100 0.204 0.114 49.00 50 0.200 0.204 Extended grid normality test - Prob of rejecting normality assumption Chi= 5.408 Chi Prob= 0.2868 F(8, 241)= 0.676020 F Prob =0.287518 Small sample normality test - Large grid Chi= 0.9796 Chi Prob= 0.1938 F(3, 241)= 0.326531 F Prob =0.193819 Autocorrelation function of residuals 1 -0.580007E-01 F( 82, 82) 2 -0.654729E-01 = Sum of squared residuals Mean squared residual 1.215 3 -0.913505E-01 1/F = 0.8233 4 -0.110200 5 -0.165341 Heteroskedasticity at 39.01973744641884 0.1592642344751790 57 6 -0.324259E-01 0.8097 F 1.000000 Partial Cor. for Var. not in equation Variable Coefficient F for selection 0.9808 -6.770 0.7880 0.8881265220663394 245.1681832743955 255.1681832743955 272.6744743271191 0.1645510140428232 0.1645948877321811 0.1683823670820125 0.1644646992743718 0.1646402423899563 0.1619076242590027 Order of entrance (or deletion) of the variables = F level 7 -0.258273E-01 Econometric Notes Gen. Least Squares ended by max. order reached. The classic MA residual ACF is shown in Figure 7.1. There is one ACF spike but the PACF suggests a longer AR model which was shown to be captured by the GLS model above. Model yma=f(x1, x2, x3) 4 3 2 X 1 0 -1 -2 -3 -4 20 40 60 80 100 120 Obs 140 160 180 ACF of Above Series 1 .8 .8 .6 .6 .4 .4 .2 .2 0 0 -.2 -.2 -.4 -.4 -.6 -.6 -.8 -.8 1 2 3 4 5 6 Lag 7 8 220 240 PACF of Above Series 1 -1 200 9 10 11 -1 12 Figure 7.1 Analysis of residuals of the YMA model. 58 1 2 3 4 5 6 Lag 7 8 9 10 11 12 Econometric Notes Remark: A low order autoregressive structure in the error term is usually easily captured by a GLS model. However a simple MA residual structure, that might occur in an over shooting situation, often requires a high order GLS model to clean the residual. The problem is that using maximum likelihood GLS the GLS autoregressive parameters often are hard to estimate because they are related as is seen in (7.9) Recall that the models ynlin f ( x1) , y f ( x1) and y f ( x1, x3) produced biased coefficients for both the constant and x1 and the constant, x1 and x3 respectively. How might one test such a models for an excluded variable (x2) that is related to an included variable (x1)? One way to proceed is to sort the data with respect to one variable (x1 in the example to be shown) and inspect the Durbin Watson statistic. Nonlinearity will be reflected in a low DW. This approach uses time series methods on cross section data. This is shown next. REG Command. Version 1 February 1997 Real*8 space available Real*8 space used 8000000 508 OLS Estimation Dependent variable Adjusted R**2 Standard Error of Estimate Sum of Squared Residuals Model Sum of Squares Total Sum of Squares F( 1, 247) F Significance 1/Condition of XPX Number of Observations Durbin-Watson Variable X1 { CONSTANT { 0} 0} Y 0.1749360075645995 2.074253035297189 1062.723836646582 230.5411368391754 1293.264973485757 53.58274542797663 0.9999999999965213 0.7121866785207469 249 0.9304383379985703 Coefficient 0.92008294 0.58811535E-01 REG Command. Version Y 0.3171457852038800 1.887043512405973 875.9895715575144 417.2754019282424 1293.264973485757 58.59073681198955 1.000000000000000 0.6446864807753573 249 0.7352270766892632 Coefficient 0.85966646 0.82544163 -0.64942535E-01 REG Command. Version 7.3200236 0.44282410 8000000 638 OLS Estimation Dependent variable Adjusted R**2 Standard Error of Estimate Sum of Squared Residuals Model Sum of Squares Total Sum of Squares F( 2, 246) F Significance 1/Condition of XPX Number of Observations Durbin-Watson 0} 0} 0} t 1 February 1997 Real*8 space available Real*8 space used Variable X1 { X3 { CONSTANT { Std. Error 0.12569399 0.13281015 Std. Error 0.11465356 0.11398725 0.12202611 t 7.4979481 7.2415260 -0.53220196 1 February 1997 Real*8 space available Real*8 space used OLS Estimation Dependent variable Adjusted R**2 Standard Error of Estimate Sum of Squared Residuals Model Sum of Squares Total Sum of Squares F( 1, 247) F Significance 8000000 508 YNLIN 0.1841644277111131 1.883410386043660 876.1669665175107 202.1314444376096 1078.298410955120 56.98282254868798 0.9999999999991508 59 Econometric Notes 1/Condition of XPX Number of Observations Durbin-Watson Variable X1 { CONSTANT { 0} 0} 0.7121866785207469 249 0.7345600764854415 Coefficient 0.86152861 -0.61128207E-01 Std. Error 0.11412945 0.12059089 t 7.5486967 -0.50690568 The Durbin Watson tests for the three models were .9304, .7352 and .7346 respectively. The above results show how the Durbin-Watson test, which was developed for a time series model, can be used effectively in cross section models to test for equation misspecifications. The results suggest that if a nonlinearity is suspected, the data should be sorted against each suspected variable in turn and recursive coefficients analyzed. The recursively estimated coefficients for x1 and x3 for the model y f ( x1, x3) , when the data was sorted against x1, are displayed in Figure 7.2. The omitted variable bias is clearly shown by the movement in the x1 coefficient as higher and higher values of x1 and added to the sample. Om itted Variabl e B ias x 1 and x 3 coef 8 7 6 5 X 1 _ C O E F 4 3 2 1 20 40 60 80 100 120 140 160 180 200 Obs Figure 7.2 Recursively estimated X1 and X3 coefficients for X1 Sorted Data 60 220 240 X 3 _ C O E F Econometric Notes Plot of Cusum T est 40 U1 U5 U10 20 0 -20 LL 10 5 L1 -40 -60 -80 -100 -120 -140 -160 -180 CUSUM T 0 50 100 150 200 250 D I1 Figure 7.3 CUSUM test on Estimated with Sorted Data Figures 7.3-7.5 show respectively the CUSUM, CUMSQ and Quandt Likehood ratio tests. Further detailed on these tests is contained in Stokes (Specifying and Diagnostically Testing Econometric Models 1997 (see also third edition drafts) Chapter 9. Here we only sketch their use. Brown, Durbin and Evans (1975) proposed the CUSUM test as a summary measure of whether there was parameter stability. The test consists of plotting the quantity i i w j K 1 j / ˆ . (7-11) Where w j is the normalized recursive residual. The CUSUM test is particularly good at detecting systematic departure of the i coefficients, while the CUSUMSQ test is useful when the departure of the i coefficients from constancy is haphazard rather than systematic. The CUSUMSQ test involves a plot of *i defined as 61 Econometric Notes *i i j K 1 T w2j / w. j K 1 2 j (7-12) Approximate bounds for i and *i are given in Brown, Durbin and Evans (1975). Assuming a rectangular plot, the upper-right-hand value is 1.0 and the lower-left-hand value is 0.0. A regression with stable coefficients i will generate a *i plot up the diagonal. If the plot lies above the diagonal, the implication is that the regression is tracking poorly in the early subsample in comparison with the total sample. A plot below the diagonal suggests the reverse, namely, that the regression is tracking better in the early subsample than in the complete sample. The Quandt log-likelihood ratio test involves the calculation of the i , defined as i .5 i ln( 12 ) .5(T i)ln( 22 ) .5 T ln( 2 ) (7-13) where 12 , 22 and 2 are the variances of regressions fitted to the first i observations, the last T i observations and the whole T observations, respectively. The minimum of the plot of i can be used to select the "break" in the sample. Although no specific tests are available for i , the information suggested by the plot can be tested with the multiperiod Chow test, which is discussed next. If structural change is suspected, a homogeneity test (Chow) of equal segments n can be performed. Given that S (r , i ) is the residual sum of squares from a regression calculated from observations t r to i , the appropriate statistic is distributed as F (kp k , T kp ) and defined as ( F1 / F2 )((T kp) / (kp k )) (7-14) where F1 (S (1, T ) (S (1, n) S (n 1, 2n) and F2 (S (1, n) S (n 1,2n) S ( pn 2n 1, pn n) S ( pn n 1, T ))) S ( pn n 1, T )). 62 (7-15) (7-16) Econometric Notes Plot of Cusum sq T est 1 .9 .8 .7 .6 .5 .4 .3 .2 .1 0 0 50 100 150 200 D I1 Figure 7.4 CUMSQ Test of Model y model estimated with sorted data. 63 250 Econometric Notes Plot of Quandt L ikehood Ratio -20 -30 -40 -50 -60 -70 -80 -90 -100 -110 0 50 100 150 200 250 D I1 Figure 7.5 Quandt Likelihood Ratio tests of y model estimated with sorted data. Remark: If an inadvertently excluded variable is correlated with an excluded variable, substantial bias in the estimated coefficients can occur. In cross section analysis, if the data is sort by the included variables, what are usually thought of as time series techniques can be used to determine the nature of the problem. For more complex models "automatic" techniques such as GAM, MARS and ACE can be employed. These are far too complex to discuss in this introductory analysis. 8. Advanced concepts A problem with simple OLS models is that there may be situations where the estimated coefficients are biased, or the estimated standard errors are biased. While space precludes a detailed treatment, some of the problems and their solution are outlined below. _________________________________________________________ Table Five 64 Econometric Notes Some Problems and Their Solution. Problem Solution Y a 0 - 1 variable. PROBIT, LOGIT Y a bounded variable. TOBIT X's not independent (i. e., X's not orthogonal to e in the population. 2SLS, 3SLS, LIML, FIML Relationship not linear. Reparameterize model and/or NLS, MARS, GAM, ACE Error not random. GLS, weighed least squares Coefficients changing from changing population. Recursive Residual Analysis Time series problems. ARIMA Model, transfer function model, vector model Outlier Problems L1 & MINIMAX Estimation ___________________________________________________________ The 0-1 left-hand variable problem arises if there are only two states for Y. For example, if Y is coded 0 = alive, 1=dead, then a regression model that predicts more than dead (YHAT > 1) or less than alive (YHAT < 0) is clearly not using all of the information at hand. While the coefficients of an OLS model can be interpreted as partial derivatives, in the case of the 0-1 problem, this is not the case. Assume that you have a number of variables, x1, x2 , , xk , and that high values are associated with a high probability death before 45 years of age. Clearly, since you cannot be more than dead, if all variables are high, an additional one more unit for x1 will not have the same effect than if all variables were low. For such problems, the appropriate procedure is LOGIT or PROBIT analysis. Due to space and time limitations, these techniques are not illustrated. A left-hand variable can be bounded on the upper or lower side. Examples of the former include scores on tests and of the latter money spent on cars. Assume a model where the score on a test (S) is a function of a number of variables, such as study time (ST), age (A), health (H) and experience (E). Clearly, one is going to get into diminishing returns regarding study time. If the number of hours were increased from 200 to 210, the increase in the score would not be the same as if the hours had been increased from 0 to 10 hours. Such problems require TOBIT procedures, which have not been illustrated here due to space and time constraints. If an OLS model were fit to the above data, the coefficient for the study time variable would understate the effect of study 65 Econometric Notes time on exam scores for relative low total study time hours and overstate the effect of study time on exam scores for relatively high total study time. An important assumption of OLS is that the right-hand variables are independent. By this, we mean that if y f ( x1 , x2 , , xk ) , where xi are variables, then x1 can be changed without xk changing. On the other hand, if the system is of the form y1 1 1 x1 2 y1 e1 , (8-1) y2 2 3 x2 4 y1 e2 , (8-2) then one cannot use 1 as a measure of how y1 will change for a one unit change in x1 , since there will be an effect on y1 from the change in y 2 in the second equation, which will occur as x1 changes. Such problems require two-stage least squares (2SLS) or limited information maximum likelihood estimation (LIML). In addition, if the possible relationship between the error terms e1 and e2 is taken into consideration, three-stage least squares (3SLS) and or full information maximum likelihood (FIML) estimation procedures should be used. These more advanced techniques will not be discussed further here except to say that the appropriate procedures are available. In OLS estimation, there is always the danger that the estimated linear model is being used to capture a nonlinear process. Over a short data range, a nonlinear process can look like a linear process. In the preliminary research on the way to "kill" live polio bacteria in order to make a "live" vaccine, a graph was used to show that increased percentages of the bacteria were killed as more heat was applied. A straight line was fit and the appropriate temperature was selected. Much to the surprise of the researcher, it was later determined that the relationship was not linear; in fact, proportionately more heat was required to kill bacteria, the lower the percentage of live polio bacteria. Poor statistical methodology resulted in people unexpectedly getting polio from the vaccine. One way to determine if the relationship is nonlinear is to put power and interaction terms in the regression. The problem is that it is easy to exhaust available CPU time and researcher time before all possibilities have been tested. The recursive residual procedure which involves starting from a model with a small sample and recursively estimating and adding observations provides a way to detect is there are problems in the initial specification. More detail on this approach is provided in Stokes (Specifying and Diagnostically Testing Econometric Models 1997 Chapter 9). A brief introduction was given in section 7. The essential idea is that if the data set is sorted against one of the right-hand variables and regressions are run, adding one observation at a time, a plot or list of the coefficients will indicate whether they are stable for different ranges of the sorted variable. If other coefficients change, it indicates the need for interaction terms. If the sorted variable coefficient changes, it indicates that there is a nonlinear relationship. If time is the variable for which the sort is made, it suggests that over time the coefficients are shifting. 66 Econometric Notes The OLS model for time series data can be shown to be a special case of the more general ARIMA, transfer function and vector autoregressive moving average models. A preliminary look at some of these models and their uses is presented in the paper "The Box-Jenkins Approach-When Is It a Cost-effective Alternative," which I wrote with Hugh Neuburger in the Columbia Journal of World Business (Vol. XI, No. 4, Winter 1976). As noted, if we were to write the model as yt ( L) ( L) xt e ( L) ( L) t (8-3) ( L) 0 , then we ( L) ( L) have an ARIMA model and are modeling yt as a function of past shocks alone. If 1 , then ( L) we have a rational distributed lag model. If neither term is zero, we have a transfer function model. Systems of transfer function type models can be estimated using simultaneous transfer function estimation techniques or vector model estimation procedures. Space limits more comprehensive discussion of these important and general class of models beyond this brief treatment. a more complex lag structure can be modeled than in the simple OLS case. If 9. Summary These short notes have attempted to outline the scope of elementary applied statistics. Students are encouraged to experiment with the sample data sets to perform further analysis. * Editorial assistance was provided by Diana A. Stokes. Important suggestions were made by Evelyn Lehrer on a prior draft. I am responsible for any errors or omissions. 67