© 2003-2005, The Trustees of Indiana University Comparing Group Means: 1 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS Hun Myoung Park This document summarizes the method of comparing group means and illustrates how to conduct the t-test and one-way ANOVA using STATA 9.0, SAS 9.1, and SPSS 13.0. 1. 2. 3. 4. 5. 6. 7. Introduction Univariate Samples Paired (dependent) Samples Independent Samples with Equal Variances Independent Samples with Unequal Variances One-way ANOVA, GLM, and Regression Conclusion 1. Introduction The t-test and analysis of variance (ANOVA) compare group means. The mean of a variable to be compared should be substantively interpretable. A t-test may examine gender differences in average salary or racial (white versus black) differences in average annual income. The lefthand side (LHS) variable to be tested should be interval or ratio, whereas the right-hand side (RHS) variable should be binary (categorical). 1.1 T-test and ANOVA While the t-test is limited to comparing means of two groups, one-way ANOVA can compare more than two groups. Therefore, the t-test is considered a special case of one-way ANOVA. These analyses do not, however, necessarily imply any causality (i.e., a causal relationship between the left-hand and right-hand side variables). Table 1 compares the t-test and one-way ANOVA. Table 1. Comparison between the T-test and One-way ANOVA T-test One-way ANOVA LHS (Dependent) RHS (Independent) Null Hypothesis Interval or ratio variable Binary variable with only two groups µ1 = µ 2 * T distribution Prob. Distribution * In the case of one degree of freedom on numerator, F=t2. Interval or ratio variable Categorical variable µ1 = µ 2 = µ 3 = ... F distribution The t-test assumes that samples are randomly drawn from normally distributed populations with unknown population means. Otherwise, their means are no longer the best measures of central tendency and the t-test will not be valid. The Central Limit Theorem says, however, that http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Comparing Group Means: 2 the distributions of y1 and y 2 are approximately normal when N is large. When n1 + n 2 ≥ 30 , in practice, you do not need to worry too much about the normality assumption. You may numerically test the normality assumption using the Shapiro-Wilk W (N<=2000), Shapiro-Francia W (N<=5000), Kolmogorov-Smirnov D (N>2000), and Jarque-Bera tests. If N is small and the null hypothesis of normality is rejected, you my try such nonparametric methods as the Kolmogorov-Smirnov test, Kruscal-Wallis test, Wilcoxon Rank-Sum Test, or Log-Rank Test, depending on the circumstances. 1.2 T-test in SAS, STATA, and SPSS In STATA, the .ttest and .ttesti commands are used to conduct t-tests, whereas the .anova and .oneway commands perform one-way ANOVA. SAS has the TTEST procedure for t-test, but the UNIVARIATE, and MEANS procedures also have options for ttest. SAS provides various procedures for the analysis of variance, such as the ANOVA, GLM, and MIXED procedures. The ANOVA procedure can handle balanced data only, while the GLM and MIXED can analyze either balanced or unbalanced data (having the same or different numbers of observations across groups). However, unbalanced data does not cause any problems in the t-test and one-way ANOVA. In SPSS, T-TEST, ONEWAY, and UNIANOVA commands are used to perform t-test and one-way ANOVA. Table 2 summarizes STATA commands, SAS procedures, and SPSS commands that are associated with t-test and one-way ANOVA. Table 2. Related Procedures and Commands in STATA, SAS, and SPSS STATA 9.0 SE SAS 9.1 SPSS 13.0 .sktest; .swilk; UNIVARIATE EXAMINE Normality Test Equal Variance Nonparametric T-test ANOVA GLM* .sfrancia .oneway .ksmirnov; .kwallis .ttest .anova; .oneway TTEST NPAR1WAY TTEST; MEANS ANOVA GLM; MIXED T-TEST NPAR TESTS T-TEST ONEWAY UNIANOVA * The STATA .glm command is not used for the T test, but for the generalized linear model. 1.3 Data Arrangement There are two types of data arrangement for t-tests (Figure 1). The first data arrangement has a variable to be tested and a grouping variable to classify groups (0 or 1). The second, appropriate especially for paired samples, has two variables to be tested. The two variables in this type are not, however, necessarily paired nor balanced. SAS and SPSS prefer the first data arrangement, whereas STATA can handle either type flexibly. Note that the numbers of observations across groups are not necessarily equal. http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Variable x x … y y … Comparing Group Means: 3 Figure 1. Two Types of Data Arrangement Group Variable1 0 0 … 1 1 … x x … Variable2 y y … The data set used here is adopted from J. F. Fraumeni’s study on cigarette smoking and cancer (Fraumeni 1968). The data are per capita numbers of cigarettes sold by 43 states and the District of Columbia in 1960 together with death rates per hundred thousand people from various forms of cancer. Two variables were added to categorize states into two groups. See the appendix for the details. http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Comparing Group Means: 4 2. Univariate Samples The univariate-sample or one-sample t-test determines whether an unknown population mean µ differs from a hypothesized value c that is commonly set to zero: H 0 : µ = c . The t statistic y−c follows Student’s T probability distribution with n-1 degrees of freedom, t = ~ t ( n − 1) , sy where y is a variable to be tested and n is the number of observations.1 Suppose you want to test if the population mean of the death rates from lung cancer is 20 per 100,000 people at the .01 significance level. Note the default significance level used in most software is the .05 level. 2.1 T-test in STATA The .ttest command conducts t-tests in an easy and flexible manner. For a univariate sample test, the command requires that a hypothesized value be explicitly specified. The level() option indicates the confidence level as a percentage. The 99 percent confidence level is equivalent to the .01 significance level. . ttest lung=20, level(99) One-sample t test -----------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [99% Conf. Interval] ---------+-------------------------------------------------------------------lung | 44 19.65318 .6374133 4.228122 17.93529 21.37108 -----------------------------------------------------------------------------mean = mean(lung) t = -0.5441 Ho: mean = 20 degrees of freedom = 43 Ha: mean < 20 Pr(T < t) = 0.2946 Ha: mean != 20 Pr(|T| > |t|) = 0.5892 Ha: mean > 20 Pr(T > t) = 0.7054 STATA first lists descriptive statistics of the variable lung. The mean and standard deviation of the 44 observations are 19.653 and 4.228, respectively. The t statistic is -.544 = (19.653-20) / .6374. Finally, the degrees of freedom are 43 =44-1. There are three t-tests at the bottom of the output above. The first and third are one-tailed tests, whereas the second is a two-tailed test. The t statistic -.544 and its large p-value do not reject the null hypothesis that the population mean of the death rate from lung cancer is 20 at the .01 level. The mean of the death rate may be 20 per 100,000 people. Note that the hypothesized value 20 falls into the 99 percent confidence interval 17.935-21.371. 2 ∑y ∑(y − y )2 s , and standard error s y = . n n −1 n 2 The 99 percent confidence interval of the mean is y ± tα 2 s y = 19 .653 ± 2.695 * .6374 , where the 2.695 is 1 y= i , s = 2 i the critical value with 43 degree of freedom at the .01 level in the two-tailed test. http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Comparing Group Means: 5 If you just have the aggregate data (i.e., the number of observations, mean, and standard deviation of the sample), use the .ttesti command to replicate the t-test above. Note the hypothesized value is specified at the end of the summary statistics. . ttesti 44 19.65318 4.228122 20, level(99) 2.2 T-test Using the SAS TTEST Procedure The TTEST procedure conducts various types of t-tests in SAS. The H0 option specifies a hypothesized value, whereas the ALPHA indicates a significance level. If omitted, the default values zero and .05 respectively are assumed. PROC TTEST H0=20 ALPHA=.01 DATA=masil.smoking; VAR lung; RUN; The TTEST Procedure Statistics Variable lung N Lower CL Mean Mean Upper CL Mean Lower CL Std Dev Std Dev Upper CL Std Dev Std Err 44 17.935 19.653 21.371 3.2994 4.2281 5.7989 0.6374 T-Tests Variable DF t Value Pr > |t| lung 43 -0.54 0.5892 The TTEST procedure reports descriptive statistics followed by a one-tailed t-test. You may have a summary data set containing the values of a variable (lung) and their frequencies (count). The FREQ option of the TTEST procedure provides the solution for this case. PROC TTEST H0=20 ALPHA=.01 DATA=masil.smoking; VAR lung; FREQ count; RUN; 2.3 T-test Using the SAS UNIVARIATE and MEANS Procedures The SAS UNIVARIATE and MEANS procedures also conduct a t-test for a univariate-sample. The UNIVARIATE procedure is basically designed to produces a variety of descriptive statistics of a variable. Its MU0 option tells the procedure to perform a t-test using the hypothesized value specified. The VARDEF=DF specifies a divisor (degrees of freedom) used in http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Comparing Group Means: 6 computing the variance (standard deviation).3 The NORMAL option examines if the variable is normally distributed. PROC UNIVARIATE MU0=20 VARDEF=DF NORMAL ALPHA=.01 DATA=masil.smoking; VAR lung; RUN; The UNIVARIATE Procedure Variable: lung Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 44 19.6531818 4.22812167 -0.104796 17763.604 21.5136751 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 44 864.74 17.8770129 -0.949602 768.711555 0.63741333 Basic Statistical Measures Location Mean Median Mode Variability 19.65318 20.32000 . Std Deviation Variance Range Interquartile Range 4.22812 17.87701 15.26000 6.53000 Tests for Location: Mu0=20 Test -Statistic- -----p Value------ Student's t Sign Signed Rank t M S Pr > |t| Pr >= |M| Pr >= |S| -0.5441 1 -36.5 0.5892 0.8804 0.6752 Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling W D W-Sq A-Sq Pr Pr Pr Pr 0.967845 0.086184 0.063737 0.382105 < > > > W D W-Sq A-Sq 0.2535 >0.1500 >0.2500 >0.2500 Quantiles (Definition 5) 3 Quantile Estimate 100% Max 27.270 The VARDEF=N uses N as a divisor, while VARDEF=WDF specifies the sum of weights minus one. http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Comparing Group Means: 7 99% 95% 90% 75% Q3 50% Median 25% Q1 27.270 25.950 25.450 22.815 20.320 16.285 Quantiles (Definition 5) Quantile 10% 5% 1% 0% Min Estimate 14.110 12.120 12.010 12.010 Extreme Observations -----Lowest---- ----Highest---- Value Obs Value Obs 12.01 12.11 12.12 13.58 14.11 39 33 30 10 36 25.45 25.88 25.95 26.48 27.27 16 1 27 18 8 The third block of the output above reports a t statistic and its p-value. The fourth block contains several statistics of normality test. Since N is less than 2,000, you should read the Shapiro-Wilk W, which suggests that lung is normally distributed (p<.2535) The MEANS procedure also conducts t-tests using the T and PROBT options that request the t statistic and its two-tailed p-value. The CLM option produces the two-tailed confidence interval (or upper and lower limits). The MEAN, STD, and STDERR respectively print the sample mean, standard deviation, and standard error. PROC MEANS MEAN STD STDERR T PROBT CLM VARDEF=DF ALPHA=.01 DATA=masil.smoking; VAR lung; RUN; The MEANS Procedure Analysis Variable : lung Lower 99% Upper 99% Mean Std Dev Std Error t Value Pr > |t| CL for Mean CL for Mean ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 19.6531818 4.2281217 0.6374133 30.83 <.0001 17.9352878 21.3710758 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Comparing Group Means: 8 The MEANS procedure does not, however, have an option to specify a hypothesized value to anything other than zero. Thus, the null hypothesis here is that the population mean of death rate from lung cancer is zero. The t statistic 30.83 is (19.6532-0)/.6374. The large t statistic and small p-value reject the null hypothesis, reporting a consistent conclusion. 2.4 T-test in SPSS The SPSS has the T-TEST command for t-tests. The /TESTVAL subcommand specifies the value with which the sample mean is compared, whereas the /VARIABLES list the variables to be tested. Like STATA, SPSS specifies a confidence level rather than a significance level in the /CRITERIA=CI() subcommand. T-TEST /TESTVAL = 20 /VARIABLES = lung /MISSING = ANALYSIS /CRITERIA = CI(.99) . http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Comparing Group Means: 9 3. Paired (Dependent) Samples When two variables are not independent, but paired, the difference of these two variables, d i = y1i − y 2i , is treated as if it were a single sample. This test is appropriate for pre-post treatment responses. The null hypothesis is that the true mean difference of the two variables is D0, H 0 : µ d = D0 .4 The difference is typically assumed to be zero unless explicitly specified. 3.1 T-test in STATA In order to conduct a paired sample t-test, you need to list two variables separated by an equal sign. The interpretation of the t-test remains almost unchanged. The -1.871 = (-10.16670)/5.4337 at 35 degrees of freedom does not reject the null hypothesis that the difference is zero. . ttest pre=post0, level(95) Paired t test -----------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------pre | 36 176.0278 6.529723 39.17834 162.7717 189.2838 post0 | 36 186.1944 7.826777 46.96066 170.3052 202.0836 ---------+-------------------------------------------------------------------diff | 36 -10.16667 5.433655 32.60193 -21.19757 .8642387 -----------------------------------------------------------------------------mean(diff) = mean(pre – post0) t = -1.8711 Ho: mean(diff) = 0 degrees of freedom = 35 Ha: mean(diff) < 0 Pr(T < t) = 0.0349 Ha: mean(diff) != 0 Pr(|T| > |t|) = 0.0697 Ha: mean(diff) > 0 Pr(T > t) = 0.9651 Alternatively, you may first compute the difference between the two variables, and then conduct one-sample t-test. Note that the default confidence level, level(95), can be omitted. . gen d=pre–post0 . ttest d=0 3.2 T-test in SAS In the TTEST procedure, you have to use the PAIRED instead of the VAR statement. For the output of the following procedure, refer to the end of this section. PROC TTEST DATA=temp.drug; PAIRED pre*post0; RUN; 4 d − D0 td = ~ t ( n − 1) , where d = sd http://www.indiana.edu/~statmath ∑d n i , sd = 2 ∑ (d − d )2 s , and s d = d n −1 n i © 2003-2005, The Trustees of Indiana University Comparing Group Means: 10 The PAIRED statement provides various ways of comparing variables using asterisk (*) and colon (:) operators. The asterisk requests comparisons between each variable on the left with each variable on the right. The colon requests comparisons between the first variable on the left and the first on the right, the second on the left and the second on the right, and so forth. Consider the following examples. PROC TTEST; PAIRED pro: post0; PAIRED (a b)*(c d); /* Equivalent to PAIRED a*c a*d b*c b*d; */ PAIRED (a b):(c d); /* Equivalent to PAIRED a*c b*c; */ PAIRED (a1-a10)*(b1-b10); RUN; The first PAIRED statement is the same as the PAIRED pre*post0. The second and the third PAIRED statements contrast differences between asterisk and colon operators. The hyphen (–) operator in the last statement indicates a1 through a10 and b1 through b10. Let us consider an example of the PAIRED statement. PROC TTEST DATA=temp.drug; PAIRED (pre)*(post0-post1); RUN; The TTEST Procedure Statistics N Lower CL Mean Mean Upper CL Mean Lower CL Std Dev Std Dev Upper CL Std Dev Std Err 36 36 -21.2 -30.43 -10.17 -20.39 0.8642 -10.34 26.443 24.077 32.602 29.685 42.527 38.723 5.4337 4.9475 Difference pre - post0 pre - post1 T-Tests Difference DF t Value Pr > |t| pre - post0 pre - post1 35 35 -1.87 -4.12 0.0697 0.0002 The first t statistic for pre versus post0 is identical to that of the previous section. The second for pre versus post1 rejects the null hypothesis of no mean difference at the .01 level (p<.0002). In order to use the UNIVARIATE and MEANS procedures, the difference between two paired variables should be computed in advance. DATA temp.drug2; SET temp.drug; d1 = pre - post0; d2 = pre - post1; RUN; http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University PROC UNIVARIATE MU0=0 VARDEF=DF NORMAL; VAR d1 d2; RUN; PROC MEANS MEAN STD STDERR T PROBT CLM; VAR d1 d2; RUN; PROC TTEST ALPHA=.05; VAR d1 d2; RUN; 3.3 T-test in SPSS In SPSS, the PAIRS subcommand indicates a paired sample t-test. T-TEST PAIRS = pre post0 /CRITERIA = CI(.95) /MISSING = ANALYSIS . http://www.indiana.edu/~statmath Comparing Group Means: 11 © 2003-2005, The Trustees of Indiana University Comparing Group Means: 12 4. Independent Samples with Equal Variances You should check three assumptions first when testing the mean difference of two independent samples. First, the samples are drawn from normally distributed populations with unknown parameters. Second, the two samples are independent in the sense that they are drawn from different populations and/or the elements of one sample are not related to those of the other sample. Finally, the population variances of the two groups, σ 12 and σ 22 are equal.5 If any one of assumption is violated, the t-test is not valid. An example here is to compare mean death rates from lung cancer between smokers and nonsmokers. Let us begin with discussing the equal variance assumption. 4.1 F test for Equal Variances The folded form F test is widely used to examine whether two populations have the same s2 variance. The statistic is L2 ~ F (n L − 1, n S − 1) , where L and S respectively indicate groups sS with larger and smaller sample variances. Unless the null hypothesis of equal variances is rejected, the pooled variance estimate s 2pool is used. The null hypothesis of the independent sample t-test is H 0 : µ1 − µ 2 = D0 . t= ( y1 − y 2 ) − D0 s pool s 2 pool 1 1 + n1 n2 ∑(y = 1i ~ t (n1 + n2 − 2) , where − y1 ) 2 + ∑ ( y 2 j − y 2 ) 2 n1 + n2 − 2 (n1 − 1) s12 + (n 2 − 1) s 22 . = n1 + n2 − 2 When the assumption is violated, the t-test requires the approximations of the degree of freedom. The null hypothesis and other components of the t-test, however, remain unchanged. Satterthwaite’s approximation for the degree of freedom is commonly used. Note that the approximation is a real number, not an integer. y − y 2 − D0 t' = 1 ~ t (df Satterthwaite ) , where s12 s 22 + n1 n 2 df Satterthwaite = 5 (n1 − 1)(n2 − 1) s12 n1 c = and s12 n1 + s22 n2 (n1 − 1)(1 − c) 2 + (n2 − 1)c 2 E ( x1 − x 2 ) = µ1 − µ 2 , Var ( x1 − x 2 ) = http://www.indiana.edu/~statmath σ 12 n1 + σ 22 ⎛1 1 ⎞ = σ 2 ⎜⎜ + ⎟⎟ n2 ⎝ n1 n 2 ⎠ © 2003-2005, The Trustees of Indiana University Comparing Group Means: 13 The SAS TTEST procedure and SPSS T-TEST command conduct F tests for equal variance. SAS reports the folded form F statistic, whereas SPSS computes Levene's weighted F statistic. In STATA, the .oneway command produces Bartlett’s statistic for the equal variance test. The following is an example of Bartlett's test that does not reject the null hypothesis of equal variance. . oneway lung smoke Analysis of Variance Source SS df MS F Prob > F -----------------------------------------------------------------------Between groups 313.031127 1 313.031127 28.85 0.0000 Within groups 455.680427 42 10.849534 -----------------------------------------------------------------------Total 768.711555 43 17.8770129 Bartlett's test for equal variances: chi2(1) = 0.1216 Prob>chi2 = 0.727 STATA, SAS, and SPSS all compute Satterthwaite’s approximation of the degrees of freedom. In addition, the SAS TTEST procedure reports Cochran-Cox approximation and the STATA .ttest command provides Welch’s degrees of freedom. 4.2 T-test in STATA With the .ttest command, you have to specify a grouping variable smoke in this example in the parenthesis of the by option. . ttest lung, by(smoke) level(95) Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------0 | 22 16.98591 .6747158 3.164698 15.58276 18.38906 1 | 22 22.32045 .7287523 3.418151 20.80493 23.83598 ---------+-------------------------------------------------------------------combined | 44 19.65318 .6374133 4.228122 18.36772 20.93865 ---------+-------------------------------------------------------------------diff | -5.334545 .9931371 -7.338777 -3.330314 -----------------------------------------------------------------------------diff = mean(0) - mean(1) t = -5.3714 Ho: diff = 0 degrees of freedom = 42 Ha: diff < 0 Pr(T < t) = 0.0000 Ha: diff != 0 Pr(|T| > |t|) = 0.0000 Ha: diff > 0 Pr(T > t) = 1.0000 sL2 3.41822 Let us first check the equal variance. The F statistic is 1.17 = 2 = ~ F (21,21) . The sS 3.1647 2 degrees of freedom of the numerator and denominator are 21 (=22-1). The p-value of .7273, virtually the same as that of Bartlett’s test above, does not reject the null hypothesis of equal variance. Thus, the t-test here is valid (t=-5.3714 and p<.0000). http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University t= Comparing Group Means: 14 (16.9859 − 22.3205) − 0 s 2pool = −5.3714 ~ t (22 + 22 − 2) , where 1 1 s pool + 22 22 (22 − 1)3.1647 2 + (22 − 1)3.4182 2 = = 10.8497 22 + 22 − 2 If only aggregate data of the two variables are available, use the .ttesti command and list the number of observations, mean, and standard deviation of the two variables. . ttesti 22 16.85591 3.164698 22 22.32045 3.418151, level(95) Suppose a data set is differently arranged (second type in Figure 1) so that one variable smk_lung has data for smokers and the other non_lung for non-smokers. You have to use the unpaired option to indicate that two variables are not paired. A grouping variable here is not necessary. Compare the following output with what is printed above. . ttest smk_lung=non_lung, unpaired Two-sample t test with equal variances -----------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------smk_lung | 22 22.32045 .7287523 3.418151 20.80493 23.83598 non_lung | 22 16.98591 .6747158 3.164698 15.58276 18.38906 ---------+-------------------------------------------------------------------combined | 44 19.65318 .6374133 4.228122 18.36772 20.93865 ---------+-------------------------------------------------------------------diff | 5.334545 .9931371 3.330313 7.338777 -----------------------------------------------------------------------------diff = mean(smk_lung) - mean(non_lung) t = 5.3714 Ho: diff = 0 degrees of freedom = 42 Ha: diff < 0 Pr(T < t) = 1.0000 Ha: diff != 0 Pr(|T| > |t|) = 0.0000 Ha: diff > 0 Pr(T > t) = 0.0000 This unpaired option is very useful since it enables you to conduct a t-test without additional data manipulation. You may run the .ttest command with the unpaired option to compare two variables, say leukemia and kidney, as independent samples in STATA. In SAS and SPSS, however, you have to stack up two variables and generate a grouping variable before ttests. . ttest leukemia=kidney, unpaired Two-sample t test with equal variances -----------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------leukemia | 44 6.829773 .0962211 .6382589 6.635724 7.023821 kidney | 44 2.794545 .0782542 .5190799 2.636731 2.95236 ---------+-------------------------------------------------------------------combined | 88 4.812159 .2249261 2.109994 4.365094 5.259224 ---------+-------------------------------------------------------------------diff | 4.035227 .1240251 3.788673 4.281781 ------------------------------------------------------------------------------ http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Comparing Group Means: 15 diff = mean(leukemia) - mean(kidney) Ho: diff = 0 Ha: diff < 0 Pr(T < t) = 1.0000 t = degrees of freedom = Ha: diff != 0 Pr(|T| > |t|) = 0.0000 32.5356 86 Ha: diff > 0 Pr(T > t) = 0.0000 The F 1.5119 = (.6532589^2)/(.5190799^2) and its p-value (=.1797) do not reject the null hypothesis of equal variance. The large t statistic 32.5356 rejects the null hypothesis that death rates from leukemia and kidney cancers have the same mean. 4.3 T-test in SAS The TTEST procedure by default examines the hypothesis of equal variances, and provides T statistics for either case. The procedure by default reports Satterthwaite’s approximation for the degrees of freedom. Keep in mind that a variable to be tested is grouped by the variable that is specified in the CLASS statement. PROC TTEST H0=0 ALPHA=.05 DATA=masil.smoking; CLASS smoke; VAR lung; RUN; The TTEST Procedure Statistics Variable lung lung lung smoke N 0 1 22 22 Diff (1-2) Lower CL Mean Mean Upper CL Mean Lower CL Std Dev Std Dev Upper CL Std Dev 15.583 20.805 -7.339 16.986 22.32 -5.335 18.389 23.836 -3.33 2.4348 2.6298 2.7159 3.1647 3.4182 3.2939 4.5226 4.8848 4.1865 Statistics Variable lung lung lung smoke 0 1 Diff (1-2) Std Err Minimum Maximum 0.6747 0.7288 0.9931 12.01 12.11 25.45 27.27 T-Tests Variable Method Variances lung lung Pooled Satterthwaite Equal Unequal DF t Value Pr > |t| 42 41.8 -5.37 -5.37 <.0001 <.0001 Equality of Variances Variable Method http://www.indiana.edu/~statmath Num DF Den DF F Value Pr > F © 2003-2005, The Trustees of Indiana University lung Folded F Comparing Group Means: 16 21 21 1.17 0.7273 The F test for equal variance does not reject the null hypothesis of equal variances. Thus, the ttest labeled as “Pooled” should be referred to in order to get the t -5.37 and its p-value .0001. If the equal variance assumption is violated, the statistics of “Satterthwaite” and “Cochran” should be read. If you have a summary data set with the values of variables (lung) and their frequency (count), specify the count variable in the FREQ statement. PROC TTEST DATA=masil.smoking; CLASS smoke; VAR lung; FREQ count; RUN; Now, let us compare the death rates from leukemia and kidney in the second data arrangement type of Figure 1. As mentioned before, you need to rearrange the data set to stack up two variables into one and generate a grouping variable (first type in Figure 1). DATA masil.smoking2; SET masil.smoking; death = leukemia; leu_kid ='Leukemia'; OUTPUT; death = kidney; leu_kid ='Kidney'; OUTPUT; KEEP leu_kid death; RUN; PROC TTEST COCHRAN DATA=masil.smoking2; CLASS leu_kid; VAR death; RUN; The TTEST Procedure Statistics Variable leu_kid N death death death Kidney Leukemia Diff (1-2) 44 44 Lower CL Mean Mean Upper CL Mean Lower CL Std Dev Std Dev Upper CL Std Dev Std Err 2.6367 6.6357 -4.282 2.7945 6.8298 -4.035 2.9524 7.0238 -3.789 0.4289 0.5273 0.5063 0.5191 0.6383 0.5817 0.6577 0.8087 0.6838 0.0783 0.0962 0.124 T-Tests Variable Method Variances death death death Pooled Satterthwaite Cochran Equal Unequal Unequal DF t Value Pr > |t| 86 82.6 43 -32.54 -32.54 -32.54 <.0001 <.0001 <.0001 Equality of Variances Variable http://www.indiana.edu/~statmath Method Num DF Den DF F Value Pr > F © 2003-2005, The Trustees of Indiana University death Folded F Comparing Group Means: 17 43 43 1.51 0.1794 Compare this SAS output with that of STATA in the previous section. 4.4 T-test in SPSS In the T-TEST command, you need to use the /GROUP subcommand in order to specify a grouping variable. SPSS reports Levene's F .0000 that does not reject the null hypothesis of equal variance (p<.995). T-TEST GROUPS = smoke(0 1) /VARIABLES = lung /MISSING = ANALYSIS /CRITERIA = CI(.95) . http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Comparing Group Means: 18 5. Independent Samples with Unequal Variances If the assumption of equal variances is violated, we have to compute the adjusted t statistic using individual sample standard deviations rather than a pooled standard deviation. It is also necessary to use the Satterthwaite, Cochran-Cox (SAS), or Welch (STATA) approximations of the degrees of freedom. In this chapter, you compare mean death rates from kidney cancer between the west (south) and east (north). 5.1 T-test in STATA As discussed earlier, let us check equality of variances using the .oneway command. The tabulate option produces a table of summary statistics for the groups. . oneway kidney west, tabulate | Summary of kidney west | Mean Std. Dev. Freq. ------------+-----------------------------------0 | 3.006 .3001298 20 1 | 2.6183333 .59837219 24 ------------+-----------------------------------Total | 2.7945455 .51907993 44 Analysis of Variance Source SS df MS F Prob > F -----------------------------------------------------------------------Between groups 1.63947758 1 1.63947758 6.92 0.0118 Within groups 9.94661333 42 .236824127 -----------------------------------------------------------------------Total 11.5860909 43 .269443975 Bartlett's test for equal variances: chi2(1) = 8.6506 Prob>chi2 = 0.003 Bartlett’s chi-squared statistic rejects the null hypothesis of equal variance at the .01 level. It is appropriate to use the unequal option in the .ttest command, which calculates Satterthwaite’s approximation for the degrees of freedom. Unlike the SAS TTEST procedure, the .ttest command cannot specify the mean difference D0 other than zero. Thus, the null hypothesis is that the mean difference is zero. . ttest kidney, by(west) unequal level(95) Two-sample t test with unequal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------0 | 20 3.006 .0671111 .3001298 2.865535 3.146465 1 | 24 2.618333 .1221422 .5983722 2.365663 2.871004 ---------+-------------------------------------------------------------------combined | 44 2.794545 .0782542 .5190799 2.636731 2.95236 ---------+-------------------------------------------------------------------diff | .3876667 .139365 .1047722 .6705611 ------------------------------------------------------------------------------ http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University diff = mean(0) - mean(1) Ho: diff = 0 Ha: diff < 0 Pr(T < t) = 0.9957 Comparing Group Means: 19 t = Satterthwaite's degrees of freedom = Ha: diff != 0 Pr(|T| > |t|) = 0.0086 2.7817 35.1098 Ha: diff > 0 Pr(T > t) = 0.0043 See Satterthwaite’s approximation of 35.110 in the middle of the output. If you want to get Welch’s approximation, use the welch as well as unequal options; without the unequal option, the welch is ignored. . ttest kidney, by(west) unequal welch Two-sample t test with unequal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------0 | 20 3.006 .0671111 .3001298 2.865535 3.146465 1 | 24 2.618333 .1221422 .5983722 2.365663 2.871004 ---------+-------------------------------------------------------------------combined | 44 2.794545 .0782542 .5190799 2.636731 2.95236 ---------+-------------------------------------------------------------------diff | .3876667 .139365 .1050824 .6702509 -----------------------------------------------------------------------------diff = mean(0) - mean(1) t = 2.7817 Ho: diff = 0 Welch's degrees of freedom = 36.2258 Ha: diff < 0 Pr(T < t) = 0.9957 Ha: diff != 0 Pr(|T| > |t|) = 0.0085 Ha: diff > 0 Pr(T > t) = 0.0043 Satterthwaite’s approximation is slightly smaller than Welch’s 36.2258. Again, keep in mind that these approximations are not integers, but real numbers. The t statistic 2.7817 and its pvalue .0086 reject the null hypothesis of equal population means. The north and east have larger death rates from kidney cancer per 100 thousand people than the south and west. For aggregate data, use the .ttesti command with the necessary options. . ttesti 20 3.006 .3001298 24 2.618333 .5983722, unequal welch As mentioned earlier, the unpaired option of the .ttest command directly compares two variables without data manipulation. The option treats the two variables as independent of each other. The following is an example of the unpaired and unequal options. . ttest bladder=kidney, unpaired unequal welch Two-sample t test with unequal variances -----------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------bladder | 44 4.121136 .1454679 .9649249 3.827772 4.4145 kidney | 44 2.794545 .0782542 .5190799 2.636731 2.95236 ---------+-------------------------------------------------------------------combined | 88 3.457841 .1086268 1.019009 3.241933 3.673748 ---------+-------------------------------------------------------------------diff | 1.326591 .1651806 .9968919 1.65629 -----------------------------------------------------------------------------diff = mean(bladder) - mean(kidney) t = 8.0312 Ho: diff = 0 Welch's degrees of freedom = 67.0324 http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Ha: diff < 0 Pr(T < t) = 1.0000 Comparing Group Means: 20 Ha: diff != 0 Pr(|T| > |t|) = 0.0000 Ha: diff > 0 Pr(T > t) = 0.0000 The F 3.4556 = (.9649249^2)/(.5190799^2) rejects the null hypothesis of equal variance (p<0001). If the welch option is omitted, Satterthwaite's degree of freedom 65.9643 will be produced instead. For aggregate data, again, use the .ttesti command without the unpaired option. . ttesti 44 4.121136 .9649249 44 2.794545 .5190799, unequal welch level(95) 5.2 T-test in SAS The TTEST procedure reports statistics for cases of both equal and unequal variance. You may add the COCHRAN option to compute Cochran-Cox approximations for the degree of freedom. PROC TTEST COCHRAN DATA=masil.smoking; CLASS west; VAR kidney; RUN; The TTEST Procedure Statistics Variable kidney kidney kidney s_west N 0 1 Lower CL Mean Mean Upper CL Mean Lower CL Std Dev Std Dev Upper CL Std Dev 2.8655 2.3657 0.0903 3.006 2.6183 0.3877 3.1465 2.871 0.685 0.2282 0.4651 0.4013 0.3001 0.5984 0.4866 0.4384 0.8394 0.6185 20 24 Diff (1-2) Statistics Variable kidney kidney kidney west 0 1 Diff (1-2) Std Err Minimum Maximum 0.0671 0.1221 0.1473 2.34 1.59 3.62 4.32 T-Tests Variable Method Variances kidney kidney kidney Pooled Satterthwaite Cochran Equal Unequal Unequal DF t Value Pr > |t| 42 35.1 . 2.63 2.78 2.78 0.0118 0.0086 0.0109 Equality of Variances Variable Method kidney Folded F http://www.indiana.edu/~statmath Num DF Den DF F Value Pr > F 23 19 3.97 0.0034 © 2003-2005, The Trustees of Indiana University Comparing Group Means: 21 F 3.9749 = (.5983722^2)/(.3001298^2) and p <.0034 reject the null hypothesis of equal variances. Thus, individual sample standard deviations need to be used to compute the adjusted t, and either Satterthwaite’s or the Cochran-Cox approximation should be used in computing the p-value. See the following computations. t' = 3.006 − 2.6183 = −2.78187 , .30012 .5984 2 + 20 24 2 s n .30012 20 c = 2 1 12 = = .2318 , and s1 n1 + s2 n2 .30012 20 + .5984 2 24 (n1 − 1)(n2 − 1) (20 − 1)(24 − 1) df Satterthwaite = = = 35.1071 2 2 (n1 − 1)(1 − c) + (n2 − 1)c (20 − 1)(1 − .2318) 2 + (24 − 1).2318 2 The t statistic 2.78 rejects the null hypothesis of no difference in mean death rates between the two regions (p<.0086). Now, let us compare death rates from bladder and kidney cancers using SAS. DATA masil.smoking3; SET masil.smoking; death = bladder; bla_kid ='Bladder'; OUTPUT; death = kidney; bla_kid ='Kidney'; OUTPUT; KEEP bla_kid death; RUN; PROC TTEST COCHRAN DATA=masil.smoking3; CLASS bla_kid; VAR death; RUN; The TTEST Procedure Statistics Variable bla_kid death death death Bladder Kidney Diff (1-2) N 44 44 Lower CL Mean Mean Upper CL Mean Lower CL Std Dev Std Dev Upper CL Std Dev Std Err 3.8278 2.6367 0.9982 4.1211 2.7945 1.3266 4.4145 2.9524 1.655 0.7972 0.4289 0.6743 0.9649 0.5191 0.7748 1.2226 0.6577 0.9107 0.1455 0.0783 0.1652 T-Tests Variable Method Variances DF t Value Pr > |t| death death death Pooled Satterthwaite Cochran Equal Unequal Unequal 86 66 43 8.03 8.03 8.03 <.0001 <.0001 <.0001 Equality of Variances http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Variable Method death Folded F Comparing Group Means: 22 Num DF Den DF F Value Pr > F 43 43 3.46 <.0001 Fortunately, the t-tests under equal and unequal variance in this case lead the same conclusion at the .01 level; that is, the means of the two death rates are not the same. 5.3 T-test in SPSS Like SAS, SPSS also reports t statistics for cases of both equal and unequal variance. Note that Levene's F 5.466 rejects the null hypothesis of equal variance at the .05 level (p<.024). T-TEST GROUPS = west(0 1) /VARIABLES = kidney /MISSING = ANALYSIS /CRITERIA = CI(.95) . http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Comparing Group Means: 23 6. One-way ANOVA, GLM, and Regression The t-test is a special case of one-way ANOVA. Thus, one-way ANOVA produces equivalent results to those of the t-test. ANOVA examines mean differences using the F statistic, whereas the t-test reports the t statistic. The one-way ANOVA (t-test), GLM, and linear regression present essentially the same things in different ways. 6.1 One-way ANOVA Consider the following ANOVA procedure. The CLASS statement is used to specify categorical variables. The MODEL statement lists the variable to be compared and a grouping variable, separating them with an equal sign. PROC ANOVA DATA=masil.smoking; CLASS smoke; MODEL lung=smoke; RUN; The ANOVA Procedure Dependent Variable: lung Source DF Sum of Squares Model Error Corrected Total 1 42 43 313.0311273 455.6804273 768.7115545 Mean Square F Value Pr > F 313.0311273 10.8495340 28.85 <.0001 F Value 28.85 Pr > F <.0001 R-Square Coeff Var Root MSE lung Mean 0.407215 16.75995 3.293863 19.65318 Source smoke DF 1 Anova SS 313.0311273 Mean Square 313.0311273 STATA .anova and .oneway commands also conduct one-way ANOVA. . anova lung smoke Number of obs = 44 Root MSE = 3.29386 R-squared = Adj R-squared = 0.4072 0.3931 Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------Model | 313.031127 1 313.031127 28.85 0.0000 | smoke | 313.031127 1 313.031127 28.85 0.0000 | Residual | 455.680427 42 10.849534 -----------+---------------------------------------------------Total | 768.711555 43 17.8770129 http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Comparing Group Means: 24 In SPSS, the ONEWAY command is used. ONEWAY lung BY smoke /MISSING ANALYSIS . 6.2 Generalized Linear Model (GLM) The SAS GLM and MIXED procedures and the SPSS UNIANOVA command also report the F statistic for one-way ANOVA. Note that STATA’s .glm command does not perform one-way ANOVA. PROC GLM DATA=masil.smoking; CLASS smoke; MODEL lung=smoke /SS3; RUN; The GLM Procedure Dependent Variable: lung Source DF Sum of Squares Model Error Corrected Total 1 42 43 313.0311273 455.6804273 768.7115545 Mean Square F Value Pr > F 313.0311273 10.8495340 28.85 <.0001 R-Square Coeff Var Root MSE lung Mean 0.407215 16.75995 3.293863 19.65318 Source smoke DF Type III SS Mean Square F Value Pr > F 1 313.0311273 313.0311273 28.85 <.0001 The MIXED procedure has the similar usage as the GLM procedure. The output here is skipped. PROC MIXED; CLASS smoke; MODEL lung=smoke; RUN; In SPSS, the UNIANOVA command estimates univariate ANOVA models using the GLM method. UNIANOVA lung BY smoke /METHOD = SSTYPE(3) /INTERCEPT = INCLUDE /CRITERIA = ALPHA(.05) /DESIGN = smoke . 6.3 Regression http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Comparing Group Means: 25 The SAS REG procedure, STATA .regress command, and SPSS REGRESSION command estimate linear regression models. PROC REG DATA=masil.smoking; MODEL lung=smoke; RUN; The REG Procedure Model: MODEL1 Dependent Variable: lung Number of Observations Read Number of Observations Used 44 44 Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 1 42 43 313.03113 455.68043 768.71155 313.03113 10.84953 Root MSE Dependent Mean Coeff Var 3.29386 19.65318 16.75995 R-Square Adj R-Sq F Value Pr > F 28.85 <.0001 0.4072 0.3931 Parameter Estimates Variable Intercept smoke DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 16.98591 5.33455 0.70225 0.99314 24.19 5.37 <.0001 <.0001 Look at the results above. The coefficient of the intercept 16.9859 is the mean of the first group (smoke=0). The coefficient of smoke is, in fact, mean difference between two groups with its sign reversed (5.33455=16.9859-22.3205). Finally, the standard error of the coefficient is the 1 1 1 1 , denominator of the independent sample t-test, .99314= s pool + = 3.2939 + n1 n2 22 22 where the pooled variance estimate 10.8497=3.2939^2 (see page 11 and 13). Thus, the t 5.37 is identical to the t statistic of the independent sample t-test with equal variance. The STATA .regress command is quite simple. Note that a dependent variable precedes a list of independent variables. . regress lung smoke Source | SS df MS -------------+------------------------------ http://www.indiana.edu/~statmath Number of obs = F( 1, 42) = 44 28.85 © 2003-2005, The Trustees of Indiana University Model | 313.031127 1 313.031127 Residual | 455.680427 42 10.849534 -------------+-----------------------------Total | 768.711555 43 17.8770129 Comparing Group Means: 26 Prob > F R-squared Adj R-squared Root MSE = = = = 0.0000 0.4072 0.3931 3.2939 -----------------------------------------------------------------------------lung | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------smoke | 5.334545 .9931371 5.37 0.000 3.330314 7.338777 _cons | 16.98591 .702254 24.19 0.000 15.5687 18.40311 ------------------------------------------------------------------------------ The SPSS REGRESSION command looks complicated compared to the SAS REG procedure and STATA .regress command. REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT lung /METHOD=ENTER smoke. Note that ANOVA, GLM, and regression report the same F (1, 42) 28.85, which is equivalent to t (42) -5.3714. As long as the degrees of freedom of the numerator is 1, F is always t^2 (28.85=-5.3714^2). http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Comparing Group Means: 27 7. Conclusion The t-test is a basic statistical method for examining the mean difference between two groups. One-way ANOVA can compare means of more than two groups. The number of observations in individual groups does not matter in the t-test or one-way ANOVA; both balanced and unbalanced data are fine. One-way ANOVA, GLM, and linear regression models all use the variance-covariance structure in their analysis, but present the results in different ways. Researchers must check four issues when performing t-tests. First, a variable to be tested should be interval or ratio so that its mean is substantively meaningful. Do not, for example, run a t-test to compare the mean of skin colors (white=0, yellow=1, black=2) between two countries. If you have a latent variable measured by several Likert-scaled manifest variables, first run a factor analysis to get that latent variable. Second, examine the normality assumptions before conducting a t-test. It is awkward to compare means of variables that are not normally distributed. Figure 2 illustrates a normal probability distribution on top and a Poisson distribution skewed to the right on the bottom. Although the two distributions have the same mean and variance of 1, they are not likely to be substantively interpretable. This is the rationale to conduct normality test such as Shapiro-Wilk W, Shapiro-Francia W, and Kolmogorov-Smirnov D statistics. If the normality assumption is violated, try to use nonparametric methods. Figure 2. Comparing Normal and Poisson Probability Distributions ( σ 2 =1 and µ =1) http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Comparing Group Means: 28 Third, check the equal variance assumption. You should be careful when comparing means of normally distributed variables with different variances. You may conduct the folded form F test. If the equal variance assumption is violated, compute the adjusted t and approximations of the degree of freedom. Finally, consider the types of t-tests, data arrangement, and functionalities available in each statistical software (e.g., STATA, SAS, and SPSS) to determine the best strategy for data analysis (Table 3). The first data arrangement in Figure 1 is commonly used for independent sample t-tests, whereas the second arrangement is appropriate for a paired sample test. Keep in mind that the type II data sets in Figure 1 needs to be reshaped into type I in SAS and SPSS. Table 3. Comparison of T-test Functionalities of STATA, SAS and SPSS STATA 9.0 SAS 9.1 SPSS 13.0 Test for equal variance Approximation of the degrees of freedom (DF) Second Data Arrangement Aggregate Data Bartlett’s chi-squared (.ttest command) Satterthwaite’s DF Welch’s DF var1=var2 .ttesti command Folded form F (TTEST procedure) Satterthwaite’s DF Cochran-Cox DF Reshaping the data set FREQ option Levene’s weighted F (T-TEST command) Satterthwaite’s DF Reshaping the data set N/A SAS has several procedures (e.g., TTEST, MEANS, and UNIVARIATE) and useful options for t-tests. The STATA .ttest and .ttesti commands provide very flexible ways of handling different data arrangements and aggregate data. Table 4 summarizes usages of options in these two commands. Table 4. Summary of the Usages of the .ttest and .ttest Command Options by(group var) unequal welch Usage var=c Univariate sample var1=var2 Paired (dependent) sample Var O Equal variance (1 variable) ** var1=var2 Equal variance (2 variables) Var O O O Unequal variance (1 variable) var1=var2 O O Unequal variance (2 variables) * The .ttesti command does not allow the unpaired option. ** The “var1=var2” assumes second type of data arrangement in Figure 1. http://www.indiana.edu/~statmath * unpaired O O © 2003-2005, The Trustees of Indiana University Comparing Group Means: 29 Appendix: Data Set Literature: Fraumeni, J. F. 1968. "Cigarette Smoking and Cancers of the Urinary Tract: Geographic Variations in the United States," Journal of the National Cancer Institute, 41(5): 1205-1211. Data Source: http://lib.stat.cmu.edu The data are per capita numbers of cigarettes smoked (sold) by 43 states and the District of Columbia in 1960 together with death rates per 100 thousand people from various forms of cancer. The variables used in this document are, cigar = number of cigarettes smoked (hds per capita) bladder = deaths per 100k people from bladder cancer lung = deaths per 100k people from lung cancer kidney = deaths per 100k people from kidney cancer leukemia = deaths per 100k people from leukemia smoke = 1 for those whose cigarette consumption is larger than the median and 0 otherwise. west = 1 for states in the South or West and 0 for those in the North, East or Midwest. The followings are summary statistics and normality tests of these variables. . sum cigar-leukemia Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------cigar | 44 24.91409 5.573286 14 42.4 bladder | 44 4.121136 .9649249 2.86 6.54 lung | 44 19.65318 4.228122 12.01 27.27 kidney | 44 2.794545 .5190799 1.59 4.32 leukemia | 44 6.829773 .6382589 4.9 8.28 . sfrancia cigar-leukemia Shapiro-Francia W' test for normal data Variable | Obs W' V' z Prob>z -------------+------------------------------------------------cigar | 44 0.93061 3.258 2.203 0.01381 bladder | 44 0.94512 2.577 1.776 0.03789 lung | 44 0.97809 1.029 0.055 0.47823 kidney | 44 0.97732 1.065 0.120 0.45217 leukemia | 44 0.97269 1.282 0.474 0.31759 . tab west smoke | smoke west | 0 1 | Total -----------+----------------------+---------0 | 7 13 | 20 1 | 15 9 | 24 -----------+----------------------+---------Total | 22 22 | 44 http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Comparing Group Means: 30 References Fraumeni, J. F. 1968. "Cigarette Smoking and Cancers of the Urinary Tract: Geographic Variations in the United States," Journal of the National Cancer Institute, 41(5): 12051211. Ott, R. Lyman. 1993. An Introduction to Statistical Methods and Data Analysis. Belmont, CA: Duxbury Press. SAS Institute. 2005. SAS/STAT User's Guide, Version 9.1. Cary, NC: SAS Institute. SPSS. 2001. SPSS 11.0 Syntax Reference Guide. Chicago, IL: SPSS Inc. STATA Press. 2005. STATA Reference Manual Release 9. College Station, TX: STATA Press. Walker, Glenn A. 2002. Common Statistical Methods for Clinical Research with SAS Examples. Cary, NC: SAS Institute. Acknowledgements I am grateful to Jeremy Albright, Takuya Noguchi, and Kevin Wilhite at the UITS Center for Statistical and Mathematical Computing, Indiana University, who provided valuable comments and suggestions. Revision History • • • 2003. First draft 2004. Second draft 2005. Third draft (Added data arrangements and conclusion). http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 1 Regression Models for Event Count Data Using SAS, STATA, and LIMDEP Hun Myoung Park This document summarizes regression models for event count data and illustrates how to estimate individual models using SAS, STATA, and LIMDEP. Example models were tested in SAS 9.1, STATA 9.0, and LIMDEP 8.0. 1. 2. 3. 4. 5. 6. 7. Introduction The Poisson Regression Model (PRM) The Negative Binomial Regression Model (NBRM) The Zero-Inflated Poisson Regression Model (ZIP) The Zero-Inflated Negative Binomial Regression Model (ZINB) Conclusion Appendix 1. Introduction An event count is the realization of a nonnegative integer-valued random variable (Cameron and Trivedi 1998). Examples are the number of car accidents per month, thunder storms per year, and wild fires per year. The ordinary least squares (OLS) method for event count data results in biased, inefficient, and inconsistent estimates (Long 1997). Thus, researchers have developed various nonlinear models that are based on the Poisson distribution and negative binomial distribution. 1.1 Count Data Regression Models The left-hand side (LHS) of the equation has event count data. Independent variables are, as in the OLS, located at the right-hand side (RHS). These RHS variables may be interval, ratio, or binary (dummy). Table 1 below summarizes the categorical dependent variable regression models (CDVMs) according to the level of measurement of the dependent variable. Table 1. Ordinary Least Squares and CDVMs Model Dependent (LHS) OLS CDVMs Ordinary least squares Interval or ratio Binary response Binary (0 or 1) Ordinal response Nominal response Event count data Ordinal (1st, 2nd , 3rd…) Nominal (A, B, C …) Count (0, 1, 2, 3…) Method Moment based method Maximum likelihood method Independent (RHS) A linear function of interval/ratio or binary variables β 0 + β 1 X 1 + β 2 X 2 ... The Poisson regression model (PRM) and negative binomial regression model (NBRM) are basic models for count data analysis. Either the zero-inflated Poisson (ZIP) or the zero-inflated http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 2 negative binomial regression model (ZINB) is used when there are many zero counts. Other count models are developed to handle censored, truncated, or sample selected count data. This document, however, focuses on the PRM, NBRM, ZIP, and ZINB. 1.2 Poisson Models versus Negative Binomial Models e −µ µ y , has the same mean and variance y! (equidispersion), Var(y)=E(y)= µ . As the mean of a Poisson distribution increases, the probability of zeros decreases and the distribution approximates a normal distribution (Figure 1). The Poisson distribution also has the strong assumption that events are independent. Thus, this distribution does not fit well if µ differs across observations (heterogeneity) (Long 1997). The Poisson probability distribution, P( y | µ ) = The Poisson regression model (PRM) incorporates observed heterogeneity into the Poisson distribution function, Var ( y i | xi ) = E ( y i | xi ) = µ i = exp( xi β ) . As µ increases, the conditional variance of y increases, the proportion of predicted zeros decreases, and the distribution around the expected value becomes approximately normal (Long 1997). The conditional mean of the errors is zero, but the variance of the errors is a function of independent variables, Var (ε | x) = exp( xβ ) . The errors are heteroscedastic. Thus, the PRM rarely fits in practice due to overdispersion (Long 1997; Maddala 1983). Figure 1. Poisson Probability Distribution with Means of .5, 1, 2, and 5 vi yi Γ( y i + vi ) ⎛ vi ⎞ ⎛ µ i ⎞ ⎟ , ⎟ ⎜ ⎜ The negative binomial probability distribution is P( y i | xi ) = y i !Γ(vi ) ⎜⎝ vi + µ i ⎟⎠ ⎜⎝ vi + µ i ⎟⎠ where 1 / v = α determines the degree of dispersion and Γ is the Gamma probability distribution. As the dispersion parameter α increases, the variance of the negative binomial distribution also increases, Var ( yi | xi ) = µi (1 + µi vi ) . The negative binomial regression model (NBRM) incorporates observed and unobserved heterogeneity into the conditional mean, http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 3 µ i = exp( xi β + ε i ) (Long 1997). Thus, the conditional variance of y becomes larger than its conditional mean, E ( yi | xi ) = µi , which remains unchanged. Figure 2 illustrates how the probabilities for small and larger counts increase in the negative binomial distribution as the conditional variance of y increases, given µ = 3 . Figure 2. Negative Binomial Probability Distribution with Alpha of .01, .5, 1, and 5 The PRM and NBRM, however, have the same mean structure. If α = 0 , the NBRM reduces to the PRM (Cameron and Trivedi 1998; Long 1997). 1.3 Overdispersion When Var ( yi | xi ) > E ( yi | xi ) , we are said to have overdispersion. Estimates of a PRM for overdispersed data are unbiased, but inefficient with standard errors biased downward (Cameron and Trivedi 1998; Long 1997). The likelihood ratio test is developed to examine the null hypothesis of no overdispersion, H 0 : α = 0 . The likelihood ratio follows the Chi-squared distribution with one degree of freedom, LR = 2 * (ln LNB − ln LPoisson ) ~ χ 2 (1) . If the null hypothesis is rejected, the NBRM is preferred to the PRM. Zero-inflated models handle overdispersion by changing the mean structure to explicitly model the production of zero counts (Long 1997). These models assume two latent groups. One is the always-zero group and the other is the not-always-zero or sometime-zero group. Thus, zero counts come from the former group and some of the latter group with a certain probability. The likelihood ratio, LR = 2 * (ln LZINB − ln LZIP ) ~ χ 2 (1) , tests H 0 : α = 0 to compare the ZIP and NBRM. The PRM and ZIP as well as NBRM and ZINB cannot, however, be tested by this likelihood ratio, since they are not nested respectively. The Voung’s statistic compares these non-nested models. If V is greater than 1.96, the ZIP or ZINB is favored. If V is less than -1.96, the PRM or NBRM is preferred (Long 1997). http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 4 1.4 Estimation in SAS, STATA, and LIMDEP The SAS GENMOD procedure estimates Poisson and negative binomial regression models. STATA has individual commands (e.g., .poisson and .nbreg) for the corresponding count data models. LIMDEP has Poisson$ and Negbin$ commands to estimate various count data models including zero-inflated and zero-truncated models. Table 2 summarizes the procedures and commands for count data regression models. Table 2. Comparison of the Procedures and Commands for Count Data Models Model SAS 9.1 STATA 9.0 LIMDEP 8.0 Poisson Regression (PRM) Negative Binomial Regression (NBRM) Zero-Inflated Poisson (ZIP) Zero-inflated Negative Binomial (ZINB) Zero-truncated Poisson (ZTP) Zero-truncated Negative Binomial (ZTNB) GENMOD GENMOD - .poisson .nbreg .zip .zinb .ztp .ztnb Poisson$ Negbin$ Poisson; Zip; Rh2$ Negbin; Zip; Rh2$ Poisson; Truncation$ Negbin; Truncation$ The example here examines how waste quotas (emps) and the strictness of policy implementation (strict) affect the frequency of waste spill accidents of plants (accident). 1. 5 Long and Freese’s SPost Module STATA users may take advantages of user-written modules such as SPost written by J. Scott Long and Jeremy Freese. The module allows researchers to conduct follow-up analyses of various CDVMs including event count data models. See 2.3 for examples of major SPost commands. In order to install SPost, execute the following commands consecutively. For more details, visit J. Scott Long’s Web site at http://www.indiana.edu/~jslsoc/spost_install.htm. . net from http://www.indiana.edu/~jslsoc/stata/ . net install spost9_ado, replace . net get spost9_do, replace http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 5 2. The Poisson Regression Model The SAS GENMOD procedure, STATA .poisson command, and LIMDEP Poisson$ command estimate the Poisson regression model (PRM). 2.1 PRM in SAS SAS has the GENMOD procedure for the PRM. The /DIST=POISSON option tells SAS to use the Poisson distribution. PROC GENMOD DATA = masil.accident; MODEL accident=emps strict /DIST=POISSON LINK=LOG; RUN; The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable Observations Used COUNT.WASTE Poisson Log Accident 778 Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood DF Value Value/DF 775 775 775 775 2827.2079 2827.2079 4944.9473 4944.9473 -667.2291 3.6480 3.6480 6.3806 6.3806 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept Emps Strict Scale 1 1 1 0 0.3901 0.0054 -0.7042 1.0000 0.0467 0.0007 0.0668 0.0000 Wald 95% Confidence Limits ChiSquare Pr > ChiSq 0.2986 0.0040 -0.8350 1.0000 69.84 53.13 111.25 <.0001 <.0001 <.0001 0.4816 0.0069 -0.5733 1.0000 NOTE: The scale parameter was held fixed. You will need to run a restricted model without regressors in order to conduct the likelihood ratio test for goodness-of-fit, LR = 2 * (ln LUnrestricted − ln LRe stricted ) ~ χ 2 ( J ) , where J is the difference in http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 6 the number of regressors between the unrestricted and restricted models. The chi-squared statistic is 124.8218 = 2* [-667.2291 - (-729.6400)] (p<.0000). PROC GENMOD DATA = masil.accident; MODEL accident= /DIST=POISSON LINK=LOG; RUN; The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable MASIL.ACCIDENT Poisson Log accident Number of Observations Read Number of Observations Used 778 778 Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood DF Value Value/DF 777 777 777 777 2952.0297 2952.0297 4919.9745 4919.9745 -729.6400 3.7993 3.7993 6.3320 6.3320 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept Scale 1 0 0.3168 1.0000 0.0306 0.0000 Wald 95% Confidence Limits 0.2568 1.0000 0.3768 1.0000 ChiSquare Pr > ChiSq 107.20 <.0001 NOTE: The scale parameter was held fixed. 2.2 PRM in STATA STATA has the .poisson command for the PRM. This command provides likelihood ratio and Pseudo R2 statistics. . poisson accident emps strict Iteration 0: Iteration 1: Iteration 2: log likelihood = -1821.5112 log likelihood = -1821.5101 log likelihood = -1821.5101 http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 7 Poisson regression Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -1821.5101 = = = = 778 124.82 0.0000 0.0331 -----------------------------------------------------------------------------accident | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------emps | .0054186 .0007434 7.29 0.000 .0039615 .0068757 strict | -.7041664 .0667619 -10.55 0.000 -.8350174 -.5733154 _cons | .3900961 .0466787 8.36 0.000 .2986076 .4815846 ------------------------------------------------------------------------------ Let us run a restricted model and then run the .display command in order to double check that the likelihood ratio for goodness-of-fit is 124.8218. . poisson accident Iteration 0: Iteration 1: log likelihood = log likelihood = -1883.921 -1883.921 Poisson regression Log likelihood = -1883.921 Number of obs LR chi2(0) Prob > chi2 Pseudo R2 = = = = 778 0.00 . 0.0000 -----------------------------------------------------------------------------accident | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------_cons | .3168165 .0305995 10.35 0.000 .2568426 .3767904 -----------------------------------------------------------------------------. display 2 * (-1821.5101 - (-1883.921)) 124.8218 2.3 Using the SPost Module in STATA The SPost module provides useful commands for follow-up analyses of various categorical dependent variable models. The .fitstat command calculates various goodness-of-fit statistics such as log likelihood, McFadden’s R2 (or Pseudo R2), Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). . quietly poisson accident emps strict . fitstat Measures of Fit for poisson of accident Log-Lik Intercept Only: D(775): -1883.921 3643.020 McFadden's R2: Maximum Likelihood R2: AIC: BIC: 0.033 0.148 4.690 -1515.943 http://www.indiana.edu/~statmath Log-Lik Full Model: LR(2): Prob > LR: McFadden's Adj R2: Cragg & Uhler's R2: AIC*n: BIC': -1821.510 124.822 0.000 0.032 0.149 3649.020 -111.508 © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 8 The .listcoef command lists unstandardized coefficients (parameter estimates), factor and percent changes, and standardized coefficients to help interpret regression results. . listcoef, help poisson (N=778): Factor Change in Expected Count Observed SD: 2.9482675 ---------------------------------------------------------------------accident | b z P>|z| e^b e^bStdX SDofX -------------+-------------------------------------------------------emps | 0.00542 7.289 0.000 1.0054 1.2297 38.1548 strict | -0.70417 -10.547 0.000 0.4945 0.7031 0.5003 ---------------------------------------------------------------------b = raw coefficient z = z-score for test of b=0 P>|z| = p-value for z-test e^b = exp(b) = factor change in expected count for unit increase in X e^bStdX = exp(b*SD of X) = change in expected count for SD increase in X SDofX = standard deviation of X The .prtab command constructs a table of predicted values (events) for all combinations of categorical variables listed. The following example shows that the predicted number of accidents under the strict policy is .9172 at the mean waste quota (emps=42.0129). . prtab strict poisson: Predicted rates for accident ---------------------strict | Prediction ----------+----------0 | 1.8547 1 | 0.9172 ---------------------- x= emps 42.012853 strict .50771208 The .prvalue lists predicted values for a given set of values for the independent variables. For example, the predicted probability of a zero count is .3996 at the mean waste quota under the strict policy (strict=1). Note that the predicted rate of .917 is equivalent to .9172 in the .prtab above. . prvalue, x(strict=1) maxcnt(5) poisson: Predictions for accident Predicted rate: .917 95% CI [.827 , Predicted probabilities: Pr(y=0|x): Pr(y=2|x): Pr(y=4|x): x= emps 42.012853 0.3996 0.1681 0.0118 Pr(y=1|x): Pr(y=3|x): Pr(y=5|x): strict 1 http://www.indiana.edu/~statmath 0.3665 0.0514 0.0022 1.02] © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 9 The most useful command is the .prchange that calculates marginal effects (changes) and discrete changes. For instance, a standard deviation increase in waste quota form its mean will increase accidents by .3841 under the lenient policy (strict=0). . prchange, x(strict=0) poisson: Changes in Predicted Rate for accident emps strict min->max 2.3070 -0.9375 exp(xb): x= sd(x)= 0->1 0.0080 -0.9375 -+1/2 0.0101 -1.3332 -+sd/2 0.3841 -0.6568 MargEfct 0.0101 -1.3060 1.8547 emps 42.0129 38.1548 strict 0 .500262 SPost also includes the .prgen command, which computes a series of predictions by holding all variables but one constant and allowing that variable to vary (Long and Freese 2003). These SPost commands work with most categorical and count data models such as .logit, .probit, .poisson, .nbreg, .zip, and .zinb. 2.4 PRM in LIMDEP The LIMDEP Poisson$ command estimates the PRM. LIMDEP reports log likelihoods of both the unrestricted and restricted models. Keep in mind that you must include the ONE for the intercept. POISSON; Lhs=ACCIDENT; Rhs=ONE,EMPS,STRICT$ +---------------------------------------------+ | Poisson Regression | | Maximum Likelihood Estimates | | Model estimated: Aug 24, 2005 at 04:56:45PM.| | Dependent variable ACCIDENT | | Weighting variable None | | Number of observations 778 | | Iterations completed 8 | | Log likelihood function -1821.510 | | Restricted log likelihood -1883.921 | | Chi squared 124.8218 | | Degrees of freedom 2 | | Prob[ChiSqd > value] = .0000000 | | Chi- squared = 4944.94781 RsqP= -.0051 | | G - squared = 2827.20794 RsqD= .0423 | | Overdispersion tests: g=mu(i) : 4.720 | | Overdispersion tests: g=mu(i)^2: 4.253 | +---------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 10 Constant .3900961420 .46678663E-01 8.357 .0000 EMPS .5418599057E-02 .74341923E-03 7.289 .0000 STRICT -.7041663804 .66761926E-01 -10.547 .0000 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) 42.012853 .50771208 SAS, STATA, and LIMDEP produce almost the same parameter estimates and standard errors (Table 3). The log likelihood in SAS is different from that of STATA and LIMDEP (-667.291 versus -1821.5101). This difference seems to come from the generalized linear model that the GENMOD procedure uses. These log likelihoods are, however, equivalent in the sense that they result in the same likelihood ratio. Table 3. Summary of the Poisson Regression Model in SAS, STATA, and LIMDEP Model SAS 9.1 STATA 9.0 LIMDEP 8.0 Intercept EMPS STRICT Log Likelihood (unrestricted) Log Likelihood (restricted) Likelihood Ratio for Goodness-of-fit http://www.indiana.edu/~statmath .3901 (.0467) .0054 (.0007) -.7042 (.0668) -667.2291 -729.6400 124.8218 .3901 (.0467) .0054 (.0007) -.7042 (.0668) -1821.5101 -1883.921 124.82 .3901 (.0467) .0054 (.0007) -.7042 (.0668) -1821.510 -1883.921 124.8218 © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 11 3. The Negative Binomial Regression Model The SAS GENMODE procedure, STATA .nbreg command, and LIMDEP Negbin$ command estimate the negative binomial regression model (NBRM). 3.1 NBRM in SAS The GENMOD procedure estimates the NBRM using the /DIST=NEGBIN option. Note that the dispersion parameter is equivalent to the alpha in STATA and LIMDEP. PROC GENMOD DATA = masil.accident; MODEL accident=emps strict /DIST=NEGBIN LINK=LOG; RUN; The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable Observations Used COUNT.WASTE Negative Binomial Log Accident 778 Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood DF Value Value/DF 775 775 775 775 589.7752 589.7752 845.6033 845.6033 37.5628 0.7610 0.7610 1.0911 1.0911 Algorithm converged. Analysis Of Parameter Estimates Parameter Intercept Emps Strict Dispersion DF Estimate Standard Error 1 1 1 1 0.3851 0.0052 -0.6703 3.9554 0.1278 0.0023 0.1671 0.3501 Wald 95% Confidence Limits 0.1345 0.0008 -0.9978 3.3254 0.6357 0.0096 -0.3427 4.7048 ChiSquare Pr > ChiSq 9.07 5.29 16.09 0.0026 0.0214 <.0001 NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood. The restricted model produces a log likelihood of 28.8627. Thus, the likelihood ratio for goodness-of-fit is 17.4002 = 2 * (37.5628 – 28.8627) (p<.00017). PROC GENMOD DATA = masil.accident; MODEL accident= /DIST=NEGBIN LINK=LOG; RUN; http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 12 The likelihood ratio for overdispersion is 1409.5838 = 2 * (37.5628 - (-667.2291)). 3.2 NBRM in STATA STATA has the .nbreg command for the NBRM. The command reports three log likelihood statistics: for the PRM, restricted NBRM (constant-only model), and unrestricted NBRM (full model), which make it easy to conduct likelihood ratio tests. . nbreg accident emps strict Fitting comparison Poisson model: Iteration 0: Iteration 1: Iteration 2: log likelihood = -1821.5112 log likelihood = -1821.5101 log likelihood = -1821.5101 Fitting constant-only model: Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood = = = = = -1256.6761 -1152.6155 -1125.6643 -1125.4183 -1125.4183 = = = = -1117.1731 -1116.7201 -1116.7182 -1116.7182 Fitting full model: Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood Negative binomial regression Log likelihood = -1116.7182 Number of obs LR chi2(2) Prob > chi2 Pseudo R2 = = = = 778 17.40 0.0002 0.0077 -----------------------------------------------------------------------------accident | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------emps | .0051981 .0022595 2.30 0.021 .0007694 .0096267 strict | -.6702548 .1671191 -4.01 0.000 -.9978021 -.3427074 _cons | .3851111 .1278468 3.01 0.003 .134536 .6356861 -------------+---------------------------------------------------------------/lnalpha | 1.37509 .0885176 1.201599 1.548582 -------------+---------------------------------------------------------------alpha | 3.955434 .3501257 3.32543 4.704793 -----------------------------------------------------------------------------Likelihood ratio test of alpha=0: chibar2(01) = 1409.58 Prob>=chibar2 = 0.000 The restricted model or “constant-only model” gives us a log likelihood -1125.4183. Thus, the likelihood ratio for goodness-of-fit is 17.4002 = 2 * [-1116.7182 - (-1125.4183)] (p<.00017). The p-value is computed as follows (Note the .disp or .di is an abbreviation of the .display). . disp chi2tail(2, 17.4002) .00016657 http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 13 The likelihood ratio test for overdispersion results in a chi-squared of 1409.5838 (p<.0000) and rejects the null hypothesis of alpha=0. The statistically significant evidence of overdispersion indicates that the NBRM is preferred to the PRM. . di 2 * (-1116.7182 - (-1821.5101)) 1409.5838 The p-value of the likelihood ratio for overdispersion is computed as, . di chi2tail(1, 1409.5838) 1.74e-308 Now, let us calculate marginal effects (or changes) at the means of independent variables. You should the read the discrete change labeled “0->1” of a binary variable strict, since its marginal change at the mean (.5077) is meaningless. . prchange nbreg: Changes in Predicted Rate for accident emps strict min->max 1.5326 -0.8931 exp(xb): x= sd(x)= 0->1 0.0055 -0.8931 -+1/2 0.0068 -0.8885 -+sd/2 0.2585 -0.4383 MargEfct 0.0068 -0.8721 1.3011 emps 42.0129 38.1548 strict .507712 .500262 3.3 NBRM in LIMDEP LIMDEP has the Negbin$ command for the NBRM that reports the PRM as well. Note that the standard errors of parameter estimates are slightly different from those of SAS and STATA. The Marginal Effects$ and the Means$ subcommands compute marginal effects at the mean of independent variables. You may not omit the Means$ subcommand. NEGBIN; Lhs=ACCIDENT; Rhs=ONE,EMPS,STRICT; Marginal Effects; Means$ +---------------------------------------------+ | Poisson Regression | | Maximum Likelihood Estimates | | Model estimated: Sep 08, 2005 at 09:35:36AM.| | Dependent variable ACCIDENT | | Weighting variable None | | Number of observations 778 | | Iterations completed 8 | | Log likelihood function -1821.510 | | Restricted log likelihood -1883.921 | | Chi squared 124.8218 | | Degrees of freedom 2 | http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 14 | Prob[ChiSqd > value] = .0000000 | | Chi- squared = 4944.94781 RsqP= -.0051 | | G - squared = 2827.20794 RsqD= .0423 | | Overdispersion tests: g=mu(i) : 4.720 | | Overdispersion tests: g=mu(i)^2: 4.253 | +---------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant .3900961420 .46678663E-01 8.357 .0000 EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853 STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) Normal exit from iterations. Exit status=0. +---------------------------------------------+ | Negative Binomial Regression | | Maximum Likelihood Estimates | | Model estimated: Sep 08, 2005 at 09:35:36AM.| | Dependent variable ACCIDENT | | Weighting variable None | | Number of observations 778 | | Iterations completed 8 | | Log likelihood function -1116.718 | | Restricted log likelihood -1821.510 | | Chi squared 1409.584 | | Degrees of freedom 1 | | Prob[ChiSqd > value] = .0000000 | +---------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant .3851110699 .12855240 2.996 .0027 EMPS .5198057234E-02 .22602075E-02 2.300 .0215 42.012853 STRICT -.6702547660 .16729839 -4.006 .0001 .50771208 Dispersion parameter for count data model Alpha 3.955434012 .35680876 11.086 .0000 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) +-------------------------------------------+ | Partial derivatives of expected val. with | | respect to the vector of characteristics. | | They are computed at the means of the Xs. | | Observations used for means are All Obs. | | Conditional Mean at Sample Point 1.3011 | | Scale Factor for Marginal Effects 1.3011 | +-------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant .5010628939 .19396434 2.583 .0098 EMPS .6763123170E-02 .29746591E-02 2.274 .0230 42.012853 STRICT -.8720595665 .22469308 -3.881 .0001 .50771208 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 15 Read the coefficients (.0068 and -.8721) to confirm that they are identical to the corresponding marginal effects calculated in STATA. SAS, STATA, and LIMDEP produce almost the same parameter estimates and goodness-of-fit statistics (Table 4). Note that SAS reports different log likelihoods, but the same likelihood ratio. Table 4. Summary of the Negative Binomial Regression Model in SAS, STATA, and LIMDEP Model SAS 9.1 STATA 9.0 LIMDEP 8.0 .3851 .3851 .3851 (.1278) (.1278) (.1286) .0052 .0052 .0052 EMPS (.0023) (.0023) (.0023) -.6703 -.6703 -.6703 STRICT (.1671) (.1671) (.1673) 3.9554 3.9554 3.9554 Dispersion Parameter (Alpha) (.3501) (.3501) (.3568) 37.5628 -1116.7182 -1116.718 Log Likelihood (unrestricted) 28.8627 -1125.4183 -1125.418* Log Likelihood (restricted) 17.4002 17.40 17.4002 Likelihood Ratio for Goodness-of-fit 1409.5838 1409.5838 1409.5838 Likelihood Ratio for Overdispersion * LIMDEP mistakenly reports the log likelihood of the unrestricted Poisson regression model. Intercept The following plot compares the PRM and NBRM. Look at the predictions for zero counts of the two models. As the likelihood ratio test indicates, the NBRM seems to fit these data better than PRM. Figure 3. Comparison of the Poisson and Negative Binomial Regression Models http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 16 4. The Zero-Inflated Poisson Regression Model STATA and LIMDEP have commands for the zero-inflated Poisson regression model (ZIP). 4.1 ZIP in STATA (.zip) STATA has the .zip command to estimate the ZIP. The inflate() option specifies a list of variables that determines whether the observed count is zero. The vuong option computes the Vuong statistic to compare the ZIP and PRM. . zip accident emps strict, inflate(emps strict) vuong Fitting constant-only model: Iteration Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: 5: log log log log log log likelihood likelihood likelihood likelihood likelihood likelihood = = = = = = -1627.0779 -1309.5825 -1272.433 -1270.9543 -1270.9523 -1270.9523 = = = = -1270.9523 -1269.7219 -1269.7206 -1269.7206 Fitting full model: Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood Zero-inflated Poisson regression Number of obs Nonzero obs Zero obs = = = 778 280 498 Inflation model = logit Log likelihood = -1269.721 LR chi2(2) Prob > chi2 = = 2.46 0.2918 -----------------------------------------------------------------------------| Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------accident | emps | -.000277 .0008633 -0.32 0.748 -.001969 .001415 strict | -.0923911 .0729023 -1.27 0.205 -.2352771 .0504948 _cons | 1.361978 .0493222 27.61 0.000 1.265308 1.458647 -------------+---------------------------------------------------------------inflate | emps | -.0109897 .0022678 -4.85 0.000 -.0154344 -.006545 strict | 1.057031 .1767509 5.98 0.000 .7106059 1.403457 _cons | .488656 .1211099 4.03 0.000 .2512849 .726027 -----------------------------------------------------------------------------Vuong test of zip vs. standard Poisson: z = 8.40 Pr>z = 0.0000 The restricted model is estimated with the intercept only. . zip accident, inflate(emps strict) The Vuong statistic at the bottom compares the ZIP and PRM. Since the V 8.40 is greater than 1.96, we conclude that the ZIP is preferred to the PRM. http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 17 4.2 ZIP in LIMDEP The LIMDEP Poisson$ command needs to have the Zip and Rh2 subcommands. The Rh2 is equivalent to the inflate() option in STATA. The Alg=Newton$ subcommand is needed to use the Newton-Raphson algorithm because the default Broyden algorithm failed to converge.1 POISSON; Lhs=ACCIDENT; Rhs=ONE,EMPS,STRICT; ZIP; Rh2=ONE,EMPS,STRICT; Alg=Newton$ +---------------------------------------------+ | Poisson Regression | | Maximum Likelihood Estimates | | Model estimated: Sep 06, 2005 at 00:25:07PM.| | Dependent variable ACCIDENT | | Weighting variable None | | Number of observations 778 | | Iterations completed 8 | | Log likelihood function -1821.510 | | Restricted log likelihood -1883.921 | | Chi squared 124.8218 | | Degrees of freedom 2 | | Prob[ChiSqd > value] = .0000000 | | Chi- squared = 4944.94781 RsqP= -.0051 | | G - squared = 2827.20794 RsqD= .0423 | | Overdispersion tests: g=mu(i) : 4.720 | | Overdispersion tests: g=mu(i)^2: 4.253 | +---------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant .3900961420 .46678663E-01 8.357 .0000 EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853 STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) Normal exit from iterations. Exit status=0. +----------------------------------------------------------------------+ | Zero Altered Poisson Regression Model | | Logistic distribution used for splitting model. | | ZAP term in probability is F[tau x Z(i) ] | | Comparison of estimated models | | Pr[0|means] Number of zeros Log-likelihood | | Poisson .27329 Act.= 498 Prd.= 212.6 -1821.51007 | 1 If you get a warning message of “Error: 806: Line search does not improve fn. Exit iterations. Status=3” or “Error: 805: Initial iterations cannot improve function. Status=3”, you may change the optimization algorithm or increase the maximum number of iterations (e.g., Maxit=1000$). http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 18 | Z.I.Poisson .64642 Act.= 498 Prd.= 502.9 -1259.88568 | | Note, the ZIP log-likelihood is not directly comparable. | | ZIP model with nonzero Q does not encompass the others. | | Vuong statistic for testing ZIP vs. unaltered model is 9.5740 | | Distributed as standard normal. A value greater than | | +1.96 favors the zero altered Z.I.Poisson model. | | A value less than -1.96 rejects the ZIP model. | +----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Poisson/NB/Gamma regression model Constant 1.361977491 .23944641E-01 56.880 .0000 EMPS -.2770010575E-03 .37770090E-03 -.733 .4633 42.012853 STRICT -.9239125073E-01 .33326502E-01 -2.772 .0056 .50771208 Zero inflation model Constant .4886559537 .12210013 4.002 .0001 EMPS -.1098971050E-01 .22152492E-02 -4.961 .0000 42.012853 STRICT 1.057031399 .17715551 5.967 .0000 .50771208 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) In order to estimate the restricted model, run the following command with the ONE only in the Lhs$ subcommand. The Rh2$ subcommand remains unchanged. POISSON; Lhs=ACCIDENT; Rhs=ONE; ZIP; Alg=Newton; Rh2=ONE,EMPS,STRICT$ Table 5 summarizes parameter estimates and goodness-of-fit statistics for the zero-inflated Poisson model. STATA and LIMDEP report the same parameter estimates, but they produce different standard errors and log likelihoods. In particular, LIMDEP returned a suspicious log likelihood for the restricted model, and thus ended up with the “unlikely” likelihood ratio of .0304. In addition, the Vuong statistics in STATA and LIMDEP are different. Table 5. Summary of the Zero-Inflated Poisson Regression Model in STATA, and LIMDEP Model SAS 9.1 STATA 9.0 LIMDEP 8.0 Intercept EMPS STRICT Intercept (Zero-inflated) EMPS (Zero-inflated) STRICT (Zero-inflated) Log Likelihood (unrestricted) Log Likelihood (restricted) Likelihood Ratio for Goodness-of-fit Vuong Statistic (ZINB versus NBRM) http://www.indiana.edu/~statmath 1.3620 (.0493) -.0003 (.0009) -.0924 (.0729) .4887 (.1211) -.0110 (.0023) 1.0570 (.1768) -1269.7206 -1270.9523 2.46 8.40 1.3620 (.0239) -.0003 (.0004) -.0924 (.0333) .4887 (.1221) -.0110 (.0022) 1.0570 (.1772) -1259.8857 -1259.8705 -.0304 9.5740 © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 19 5. The Zero-Inflated NB Regression Model STATA and LIMDEP can estimate the zero-inflated negative binomial regression model (ZINB). 5.1 ZINB in STATA (.zinb) The STATA .zinb command estimates the ZINB. The vuong option computes the Vuong statistic to compare the ZINB and NBRM. . zinb accident emps strict, inflate(emps strict) vuong Fitting constant-only model: Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: log log log log log log log log log log log likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood = = = = = = = = = = = -1190.5117 -1106.9874 -1098.8642 -1095.3638 -1094.0237 -1093.063 -1092.6216 -1091.798 -1091.7332 -1091.7329 -1091.7329 = = = = -1091.7329 -1089.5565 -1089.5198 -1089.5198 (not concave) Fitting full model: Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood Zero-inflated negative binomial regression Number of obs Nonzero obs Zero obs = = = 778 280 498 Inflation model = logit Log likelihood = -1089.52 LR chi2(2) Prob > chi2 = = 4.43 0.1094 -----------------------------------------------------------------------------| Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------accident | emps | -.0004407 .0020554 -0.21 0.830 -.0044691 .0035877 strict | -.3251317 .1659173 -1.96 0.050 -.6503235 .0000602 _cons | .7763065 .1508037 5.15 0.000 .4807367 1.071876 -------------+---------------------------------------------------------------inflate | emps | -.2087768 .0955122 -2.19 0.029 -.3959772 -.0215763 strict | 7.562388 3.055775 2.47 0.013 1.573179 13.5516 _cons | .1032115 .3800045 0.27 0.786 -.6415835 .8480065 -------------+---------------------------------------------------------------/lnalpha | .9252514 .1351387 6.85 0.000 .6603845 1.190118 -------------+---------------------------------------------------------------alpha | 2.522502 .3408876 1.935536 3.28747 -----------------------------------------------------------------------------Vuong test of zinb vs. standard negative binomial: z = 4.13 Pr>z = 0.0000 http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 20 The likelihood ratio, 360.4024= 2*(-1089.5198 - (-1269.721)), rejects the null hypothesis of no overdispersion, indicating that the ZINB can improve goodness-of-fit over the ZIP (p<.0000). The Vuong test, 4.13 > 1.96, suggests that the ZINB is preferred to the NBRM. 5.2 ZINB in LIMDEP The LIMDEP Negbin$ command needs to have the Zip and Rh2 subcommands for the ZINB. The following command produces the Poisson regression model, negative binomial model, and zero-inflated negative binomial model. You may omit the Alg=Newton$ subcommand. NEGBIN; Lhs=ACCIDENT; Rhs=ONE,EMPS,STRICT; Rh2=ONE,EMPS,STRICT; ZIP; Alg=Newton$ +---------------------------------------------+ | Poisson Regression | | Maximum Likelihood Estimates | | Model estimated: Sep 10, 2005 at 00:20:00AM.| | Dependent variable ACCIDENT | | Weighting variable None | | Number of observations 778 | | Iterations completed 8 | | Log likelihood function -1821.510 | | Restricted log likelihood -1883.921 | | Chi squared 124.8218 | | Degrees of freedom 2 | | Prob[ChiSqd > value] = .0000000 | | Chi- squared = 4944.94781 RsqP= -.0051 | | G - squared = 2827.20794 RsqD= .0423 | | Overdispersion tests: g=mu(i) : 4.720 | | Overdispersion tests: g=mu(i)^2: 4.253 | +---------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant .3900961420 .46678663E-01 8.357 .0000 EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853 STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) Normal exit from iterations. Exit status=0. +---------------------------------------------+ | Negative Binomial Regression | | Maximum Likelihood Estimates | | Model estimated: Sep 10, 2005 at 00:20:00AM.| | Dependent variable ACCIDENT | | Weighting variable None | | Number of observations 778 | | Iterations completed 12 | | Log likelihood function -1116.718 | | Restricted log likelihood -1821.510 | | Chi squared 1409.584 | http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 21 | Degrees of freedom 1 | | Prob[ChiSqd > value] = .0000000 | +---------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant .3851110482 .12855240 2.996 .0027 EMPS .5198057322E-02 .22602075E-02 2.300 .0215 42.012853 STRICT -.6702547787 .16729839 -4.006 .0001 .50771208 Dispersion parameter for count data model Alpha 3.955434128 .35680877 11.086 .0000 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) Normal exit from iterations. Exit status=0. +----------------------------------------------------------------------+ | Zero Altered Neg.Binomial Regression Model | | Logistic distribution used for splitting model. | | ZAP term in probability is F[tau x Z(i) ] | | Comparison of estimated models | | Pr[0|means] Number of zeros Log-likelihood | | Poisson .27329 Act.= 498 Prd.= 212.6 -1821.51007 | | Neg. Bin. .32470 Act.= 498 Prd.= 252.6 -1116.71820 | | Z.I.Neg_Bin .62918 Act.= 498 Prd.= 489.5 -1089.51977 | | Note, the ZIP log-likelihood is not directly comparable. | | ZIP model with nonzero Q does not encompass the others. | | Vuong statistic for testing ZIP vs. unaltered model is 4.1270 | | Distributed as standard normal. A value greater than | | +1.96 favors the zero altered Z.I.Neg_Bin model. | | A value less than -1.96 rejects the ZIP model. | +----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Poisson/NB/Gamma regression model Constant .7763063017 .15178042 5.115 .0000 EMPS -.4407244013E-03 .20262626E-02 -.218 .8278 42.012853 STRICT -.3251315411 .16179883 -2.009 .0445 .50771208 Dispersion parameter Alpha 2.522502810 .29924002 8.430 .0000 Zero inflation model Constant .1032103951 .37413759 .276 .7827 EMPS -.2087767804 .68774937E-01 -3.036 .0024 42.012853 STRICT 7.562389399 2.2216392 3.404 .0007 .50771208 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) In order to estimate the restricted model, run the following command. You have to use the Alg=Newton$ subcommand to get the restricted model to converge. Negbin; Lhs=ACCIDENT; Rhs=ONE; Rh2=ONE,EMPS,STRICT; ZIP; Alg=Newton$ http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 22 Table 6 summarizes parameter estimates and goodness-of-fit statistics for the zero-inflated negative binomial regression model. STATA and LIMDEP reports the same results except standard errors and likelihood ratio for overdispersion. Table 6. Summary of the Zero-Inflated NBRM in STATA, and LIMDEP Model SAS 9.1 STATA 9.0 .7763 Intercept (.1508) -.0004 EMPS (.0021) -.3251 STRICT (.1659) .1032 Intercept (Zero-inflated) (.3800) -.2088 EMPS (Zero-inflated) (.0955) 7.5624 STRICT (Zero-inflated) (3.0558) 2.5225 Dispersion Parameter (Alpha) (.3409) -1089.5198 Log Likelihood (unrestricted) -1091.7329 Log Likelihood (restricted) 4.43 Likelihood Ratio for Goodness-of-fit 360.4024 Likelihood Ratio for Overdispersion 4.13 Vuong Statistic (ZINB versus NBRM) * The likelihood ratio for overdispersion is 340.7318 =2*(-1089.5198 - (-1259.8857)) LIMDEP 8.0 .7763 (.1518) -.0004 (.0020) -.3251 (.1618) .1032 (.3741) -.2088 (.0688) 7.5624 (2.2216) 2.5225 (.2992) -1089.5198 -1091.7329 4.43 340.7318* 4.1270 Figure 4. Comparison of the Zero-Inflated PRM and the Zero-Inflated NBRM http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 23 6. Conclusion Like other econometric models, researchers must first examine the data generation process of a dependent variable to understand its behavior. Sophisticated researchers pay special attention to excess zeros, censored and/or truncated counts, sample selection, and other particular patterns of the data generation, and then decide which model best describes the data generation process. The Poisson regression model and negative binomial regression model have the same mean structure, but they describe the behavior of a dependent variable in different ways. Zero-inflated regression models integrate two different data generation processes to deal with overdispersion. Truncated or censored regression models are appropriate when data are (left and/or right) truncated or censored. Researchers need to spend more time and effort interpreting the results substantively. Like other categorical dependent variable models, count data models produce estimates that are difficult to interpret intuitively. Reporting parameter estimates and goodness-of-fit statistics are not sufficient. J. Scott Long (1997) and Long and Freese (2003) provide good examples of meaningful count data model interpretations. Regarding statistical software, I would recommend STATA for general count data models and LIMDEP for special types of models. Although able to handle various models, LIMDEP does not seem stable and reliable. The SAS GENMODE procedure estimates the Poisson regression model and the negative binomial model, but it does not have easy ways of estimating other models. We encourage SAS Institute to develop an individual procedure, say the CLIM (Count and Limited Dependent Variable Model) procedure, to handle a variety of count data models. http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 24 Appendix: Data Set The data set used here is a part of the data provided for David H. Good’s class of the School of Public and Environmental Affairs, Indiana University. Note that these data have been manipulated for the sake of data security. The variables in the data set include, 1. emps: the size of the waste quotas 2. strict: strictness of policy implementation (1=strict) 3. accident: the frequency of waste spill accidents of plant The followings summarize descriptive statistics of these variables. Note that there are many zero counts that indicate an overdispersion problem. . summarize accident emps strict Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------accident | 778 1.372751 2.948267 0 31 emps | 778 42.01285 38.1548 1 174 strict | 778 .5077121 .5002621 0 1 . tab accident strict | strict accident | 0 1 | Total -----------+----------------------+---------0 | 214 284 | 498 1 | 41 29 | 70 2 | 38 32 | 70 3 | 28 13 | 41 4 | 16 13 | 29 5 | 10 3 | 13 6 | 12 7 | 19 7 | 4 3 | 7 8 | 4 2 | 6 9 | 3 2 | 5 10 | 0 2 | 2 11 | 3 1 | 4 12 | 2 0 | 2 13 | 1 0 | 1 14 | 1 0 | 1 15 | 3 0 | 3 16 | 1 0 | 1 17 | 0 2 | 2 18 | 1 1 | 2 21 | 0 1 | 1 31 | 1 0 | 1 -----------+----------------------+---------Total | 383 395 | 778 http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 25 References Allison, Paul D. 1991. Logistic Regression Using the SAS System: Theory and Application. Cary, NC: SAS Institute. Cameron, A. Colin, and Pravin K. Trivedi. 1998. Regression Analysis of Count Data. New York: Cambridge University Press. Greene, William H. 2003. Econometric Analysis, 5th ed. Upper Saddle River, NJ: Prentice Hall. Greene, William H. 2002. LIMDEP Version 8.0 Econometric Modeling Guide. Plainview, New York: Econometric Software. Long, J. Scott, and Jeremy Freese. 2003. Regression Models for Categorical Dependent Variables Using STATA, 2nd ed. College Station, TX: STATA Press. Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Advanced Quantitative Techniques in the Social Sciences. Sage Publications. Maddala, G. S. 1983. Limited Dependent and Qualitative Variables in Econometrics. New York: Cambridge University Press. SAS Institute. 2004. SAS/STAT 9.1 User's Guide. Cary, NC: SAS Institute. STATA Press. 2005. STATA Base Reference Manual, Release 9. College Station, TX: STATA Press. Acknowledgements I am grateful to Jeremy Albright and Kevin Wilhite at the UITS Center for Statistical and Mathematical Computing, Indiana University, who provided valuable comments and suggestions. Revision History • • • 2003. First draft 2004. Second draft 2005. Third draft (Added LIMDEP examples) http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 1 Linear Regression Models for Panel Data Using SAS, STATA, LIMDEP, and SPSS Hun Myoung Park This document summarizes linear regression models for panel data and illustrates how to estimate each model using SAS 9.1, STATA 9.0, LIMDEP 8.0, and SPSS 13.0. This document does not address nonlinear models (i.e., logit and probit models), but focuses on linear regression models. 1. 2. 3. 4. 5. 6. 7. 8. 9. Introduction Least Squares Dummy Variable Regression Panel Data Models The Fixed Group Effect Model The Fixed Time Effect Model The Fixed Group and Time Effect Model Random Effect Models The Poolability Test Conclusion 1. Introduction Panel data are cross sectional and longitudinal (time series). Some examples are the cumulative General Social Survey (GSS) and Current Population Survey (CPS) data. Panel data may have group effects, time effects, or the both. These effects are analyzed by fixed effect and random effect models. 1.1 Data Arrangement A panel data set contains observations on n individuals (e.g., firms and states), each measured at T points in time. In other word, each individual (1 through n subject) includes T observations (1 through t time period). Thus, the total number of observations is nT. Figure 1 illustrates the data arrangement of a panel data set. Figure 1. Data Arrangement of Panel Data Group Time Variable1 1 1 … 1 2 2 ... 2 … 1 2 … T 1 2 … T … http://www.indiana.edu/~statmath … … … … … … … … … Variable2 Variable3 … … … … … … … … … … … … … … … … … … … … … … … … … … … … © 2005 The Trustees of Indiana University (12/10/2005) … n n … n … 1 2 … T … … … … … Linear Regression Model for Panel Data: 2 … … … … … … … … … … … … … … … 1.2 Fixed Effect versus Random Effect Models Panel data models estimate fixed and/or random effects models using dummy variables. The core difference between fixed and random effect models lies in the role of dummies. If dummies are considered as a part of the intercept, it is a fixed effect model. In a random effect model, the dummies act as an error term (see Table 1). The fixed effect model examines group differences in intercepts, assuming the same slopes and constant variance across groups. Fixed effect models use least square dummy variable (LSDV), within effect, and between effect estimation methods. Thus, ordinary least squares (OLS) regressions with dummies, in fact, are fixed effect models. Table 1. Fixed Effect and Random Effect Models Fixed Effect Model Functional form* y it = (α + μ i ) + X it' β + vit Intercepts Error variances Slopes Estimation Hypothesis test Varying across group and/or time Constant Constant LSDV, within effect, between effect Incremental F test Random Effect Model y it = α + X it' β + ( μ i + vit ) Constant Varying across group and/or time Constant GLS, FGLS Breusch-Pagan LM test * vit ~ IID(0,σ v ) 2 The random effect model, by contrast, estimates variance components for groups and error, assuming the same intercept and slopes. The difference among groups (or time periods) lies in the variance of the error term. This model is estimated by generalized least squares (GLS) when the Ω matrix, a variance structure among groups, is known. The feasible generalized least squares (FGLS) method is used to estimate the variance structure when Ω is not known. A typical example is the groupwise heteroscedastic regression model (Greene 2003). There are various estimation methods for FGLS including maximum likelihood methods and simulations (Baltagi and Cheng 1994). Fixed effects are tested by the (incremental) F test, while random effects are examined by the Lagrange multiplier (LM) test (Breusch and Pagan 1980). If the null hypothesis is not rejected, the pooled OLS regression is favored. The Hausman specification test (Hausman 1978) compares fixed effect and random effect models. Table 1 compares the fixed effect and random effect models. Group effect models create dummies using grouping variables (e.g., country, firm, and race). If one grouping variable is considered, it is called a one-way fixed or random group effects model. Two-way group effect models have two sets of dummy variables, one for a grouping variable and the other for a time variable. http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 3 1.3 Estimation and Software Issues LSDV regression, the within effect model, the between effect model (group or time mean model), GLS, and FGLS are fundamentally based on OLS in terms of estimation. Thus, any procedure and command for OLS is good for the panel data models. The REG procedure of SAS/STAT, STATA .regress (.cnsreg), LIMDEP regress$, and SPSS regression commands all fit LSDV1 dropping one dummy and have options to suppress the intercept (LSDV2). SAS, STATA, and LIMDEP can estimate OLS with restrictions (LSDV3), but SPSS cannot. Note that the STATA .cnsreg command requires the .constraint command that defines a restriction (Table 2). Table 2. Procedures and Commands in SAS, STATA, LIMDEP, and SPSS SAS 9.1 STATA 9.0 LIMDEP 8.0 Regression (OLS) LSDV1 LSDV2 LSDV3 Fixed effect (within effect) Two-way fixed (within effect) Between effect Random effect Two-way random PROC REG w/o a dummy /NOINT RESTRICT TSCSREG /FIXONE PANEL /FIXONE .regress w/o a dummy Noconstant .cnsreg .xtreg w/ fe TSCSREG /FIXTWO PANEL /FIXTWO N/A PANEL /BTWNG PANEL /BTWNT TSCSREG /RANONE PANEL /RANONE TSCSREG /RANTWO PANEL /RANTWO .xtreg w/ be .xtreg w/ re N/A Regress$ w/o a dummy w/o One in Rhs Cls: Regress;Panel;St r=;Pds=;Fixed$ Regress;Panel;St r=;Pds=;Fixed$ Regress;Panel;St r=;Pds=;Means$ Regress;Panel;St r=;Pds=;Random$ Problematic SPSS 13.0 Regression w/o a dummy /Origin N/A N/A N/A N/A N/A N/A SAS, STATA, and LIMDEP also provide the procedures (commands) that are designed to estimate panel data models conveniently. SAS/ETS has the TSCSREG and PANEL procedures to estimate one-way and two-way fixed and random effect models.1 For the fixed effect model, these procedures estimate LSDV1, which drops one of the dummy variables. For the random effects model, they by default use the Fuller-Battese method (1974) to estimate variance components for group, time, and error. These procedures also support other estimation methods such as Parks (1967) autoregressive model and Da Silva moving average method. The TSCSREG procedure can handle balanced data only, whereas the PANEL procedure is able to deal with balanced and unbalanced data. The former provides one-way and two-way fixed and random effect models, while the latter supports the between effect model and pooled OLS regression as well. Despite advanced features of PANEL, output from the two procedures looks alike. The STATA .xtreg command estimates within effect (fixed effect) models with the fe option, between effect models with the be option, and random effect models with the re option. This command, however, does not fit the two-way fixed and random effect models. The LIMDEP 1 SAS recently announced the PROC PANEL, an experimental procedure, for panel data models. http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 4 regress$ command with the panel; subcommand estimates panel data models, but this command is not sufficiently stable. SPSS has limited ability to analyze panel data. 1.4 Data Sets This document uses two data sets. The cross-sectional data set contains research and development (R&D) expenditure data of the top 50 information technology firms presented in OECD Information Technology Outlook 2004. The panel data set has cost data for U.S. airlines (1970-1984) from Econometric Analysis (Greene 2003). See the Appendix for the details. http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 5 2. Least Squares Dummy Variable Regression A dummy variable is a binary variable that is coded either 1 or zero. It is commonly used to examine group and time effects in regression. Consider a simple model of regressing R&D expenditure in 2002 on 2000 net income and firm type. The dummy variable d1 is set to 1 for equipment and software firms and zero for telecommunication and electronics. The variable d2 is coded in the opposite way. Take a look at the data structure (Figure 2). Figure 2. Dummy Variable Coding for Firm Type +-----------------------------------------------------------------+ | firm rnd income type d1 d2 | |-----------------------------------------------------------------| | Samsung 2,500 4,768 Electronics 0 1 | | AT&T 254 4,669 Telecom 0 1 | | IBM 4,750 8,093 IT Equipment 1 0 | | Siemens 5,490 6,528 Electronics 0 1 | | Verizon . 11,797 Telecom 0 1 | | Microsoft 3,772 9,421 Service & S/W 1 0 | … … … … … … … … 2.1 Model 1 without a Dummy Variable The ordinary least squares (OLS) regression without dummy variables, a pooled regression model, assumes a constant intercept and slope regardless of firm types. In the following regression equation, β 0 is the intercept; β1 is the slope of net income in 2000; and ε i is the error term. Model 1: R & Di = β 0 + β 1incomei + ε i The pooled model has the intercept of 1,482.697 and slope of .223. For a $ one million increase in net income, a firm is likely to increase R&D expenditure in 2002 by $ .223 million. . regress rnd income Source | SS df MS -------------+-----------------------------Model | 15902406.5 1 15902406.5 Residual | 83261299.1 37 2250305.38 -------------+-----------------------------Total | 99163705.6 38 2609571.2 Number of obs F( 1, 37) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 7.07 0.0115 0.1604 0.1377 1500.1 -----------------------------------------------------------------------------rnd | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------income | .2230523 .0839066 2.66 0.012 .0530414 .3930632 _cons | 1482.697 314.7957 4.71 0.000 844.8599 2120.533 ------------------------------------------------------------------------------ Pooled model: R&D = 1,482.697 + .223*income Despite moderate goodness of fit statistics such as F and t, this is a naïve model. R&D investment tends to vary across industries. http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 6 2.2 Model 2 with a Dummy Variable You may assume that equipment and software firms have more R&D expenditure than other types of companies. Let us take this group difference into account.2 We have to drop one of the two dummy variables in order to avoid perfect multicollinearity. That is, OLS does not work with both dummies in a model. The δ 1 in model 2 is the coefficient that is valid in equipment and software companies only. Model 2: R & Di = β 0 + β 1incomei + δ 1 d1i + ε i Unlike Model 1, this model results in two different regression equations for two groups. The difference lies in the intercepts, but the slope remains unchanged. . regress rnd income d1 Source | SS df MS -------------+-----------------------------Model | 24987948.9 2 12493974.4 Residual | 74175756.7 36 2060437.69 -------------+-----------------------------Total | 99163705.6 38 2609571.2 Number of obs F( 2, 36) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 6.06 0.0054 0.2520 0.2104 1435.4 -----------------------------------------------------------------------------rnd | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------income | .2180066 .0803248 2.71 0.010 .0551004 .3809128 d1 | 1006.626 479.3717 2.10 0.043 34.41498 1978.837 _cons | 1133.579 344.0583 3.29 0.002 435.7962 1831.361 ------------------------------------------------------------------------------ d1=1: R&D = 2,140.205 + .218*income = 1,113.579 +1,006.626*1 + .218*income d1=0: R&D = 1,133.579 + .218*income = 1,113.579 +1,006.626*0 + .218*income The slope .218 indicates a positive impact of two-year-lagged net income on a firm’s R&D expenditure. Equipment and software firms on average spend $1,007 million more for R&D than telecommunication and electronics companies. 2.3 Visualization of Model 1 and 2 There is only a tiny difference in the slope (.223 versus .218) between Model 1 and Model 2. The intercept 1,483 of Model 1, however, is quite different from 1,134 for equipment and software companies and 2,140 for telecommunications and electronics in Model 2. This result appears to support Model 2. Figure 3 highlights differences between Model 1 and 2 more clearly. The black line (pooled) in the middle is the regression line of Model 1; the red line at the top is one for equipment and software companies (d1=1) in Model 2; finally the blue line at the bottom is for telecommunication and electronics firms (d2=1 or d1=0). 2 The dummy variable (firm types) and regressors (net income) may or may not be correlated. http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 7 Figure 3. Regression Lines of Model 1 and Model 2 This plot shows that Model 1 ignores the group difference, and thus reports the misleading intercept. The difference in the intercept between two groups of firms looks substantial. Moreover, the two models have the similar slopes. Consequently, Model 2 considering fixed group effects seems better than the simple Model 1. Compare goodness of fit statistics (e.g., F, t, R2, and SSE) of the two models. See Section 3.2.2 and 4.7 for formal hypothesis testing. 2.4 Alternatives to LSDV1 The least squares dummy variable (LSDV) regression is ordinary least squares (OLS) with dummy variables. The critical issue in LSDV is how to avoid the perfect multicollinearity or the so called “dummy variable trap.” LSDV has three approaches to avoid getting caught in the trap. They produce different parameter estimates of dummies, but their results are equivalent. The first approach, LSDV1, drops a dummy variable as in Model 2 above. The second approach includes all dummies and, in turn, suppresses the intercept (LSDV2). Finally, include the intercept and all dummies, and then impose a restriction that the sum of parameters of all dummies is zero (LSDV3). Take a look at the following functional forms to compare these three LSDVs. LSDV1: R & Di = β 0 + β1incomei + δ 1 d1i + ε i or R & Di = β 0 + β1incomei + δ 2 d 2i + ε i LSDV2: R & Di = β 1incomei + δ 1 d1i + δ 2 d 2i + ε i LSDV3: R & Di = β 0 + β 1incomei + δ 1 d1i + δ 2 d 2i + ε i , subject to δ 1 + δ 2 = 0 http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 8 The main differences among these approaches exist in the meanings of the dummy variable parameters. Each approach defines the coefficients of dummy variables in different ways (Table 3). The parameter estimates in LSDV2 are actual intercepts of groups, making it easy to interpret substantively. LSDV1 reports differences from the reference point (dropped dummy variable). LSDV3 computes how far parameter estimates are away from the average group effect. Accordingly, null hypotheses of t-tests in the three approaches are different. Keep in mind that the R2 of LSDV2 is not correct. Table 3 contrasts the three LSDVs. Table 3. Three Approaches of Least Squares Dummy Variable Models LSDV1: LSDV2: LSDV3: Drop one dummy Suppress the intercept Impose a restriction Dummy included α a , d 2a − d da d1* − d d* α c , d1c − d dc Intercept? All dummy? Restriction? Yes No (d-1) No No Yes (d) No Yes Yes (d) Meaning of coefficient How far away from the reference point (dropped)? Fixed group effect How far away from the average group effect? Coefficients d i* = α a + d ia , d1* , d 2* ,… d d* d i* = α c + d ic , where 1 α c = ∑ d i* d 1 d i* − ∑ d i* = 0 d * d dropped =αa H0 of T-test * d i* − d dropped =0 d i* = 0 ∑d c i = 0* Source: David Good’s Lecture (2004) * This restriction reduces the number of parameters to be estimated, making the model identified. 2.5 Estimating Three LSDVs The SAS REG procedure, STATA .regress command, LIMDEP Regress$ command, and SPSS Regression command all fit OLS and LSDVs. Let us estimate three LSDVs using SAS and STATA. 2.5.1 LSDV 1 without a Dummy LSDV 1 drops a dummy variable. The intercept is the actual parameter estimate of the dropped dummy variable. The coefficient of the dummy included means how far its parameter estimate is away from the reference point or baseline (i.e., the intercept). Here we include d2 instead of d1 to see how a different reference point changes the result. Check the sign of the dummy coefficient included and the intercept. Dropping other dummies does not make any significant difference. PROC REG DATA=masil.rnd2002; MODEL rnd = income d2; RUN; http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 9 The REG Procedure Model: MODEL1 Dependent Variable: rnd Number of Observations Read Number of Observations Used Number of Observations with Missing Values 50 39 11 Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 2 36 38 24987949 74175757 99163706 12493974 2060438 Root MSE Dependent Mean Coeff Var 1435.42248 2023.56410 70.93536 R-Square Adj R-Sq F Value Pr > F 6.06 0.0054 0.2520 0.2104 Parameter Estimates Variable Intercept income d2 DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 1 2140.20468 0.21801 -1006.62593 434.48460 0.08032 479.37174 4.93 2.71 -2.10 <.0001 0.0101 0.0428 d2=0: R&D = 2,140.205 + .218*income = 2,140.205 - 1,006.626*0 + .218*income d2=1: R&D = 1,133.579 + .218*income = 2,140.205 - 1,006.626*1 + .218*income 2.5.2 LSDV 2 without the Intercept LSDV 2 includes all dummy variables and suppresses the intercept. The STATA .regress command has the noconstant option to fit LSDV2. The coefficients of dummies are actual parameter estimates; thus, you do not need to compute intercepts of groups. This LSDV, however, reports wrong R2 (.7135 ≠ .2520). . regress rnd income d1 d2, noconstant Source | SS df MS -------------+-----------------------------Model | 184685604 3 61561868.1 Residual | 74175756.7 36 2060437.69 -------------+-----------------------------Total | 258861361 39 6637470.79 Number of obs F( 3, 36) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 29.88 0.0000 0.7135 0.6896 1435.4 -----------------------------------------------------------------------------rnd | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------income | .2180066 .0803248 2.71 0.010 .0551004 .3809128 d1 | 2140.205 434.4846 4.93 0.000 1259.029 3021.38 d2 | 1133.579 344.0583 3.29 0.002 435.7962 1831.361 http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 10 ------------------------------------------------------------------------------ d1=1: R&D = 2,140.205 + .218*income d2=1: R&D = 1,133.579 + .218*income 2.5.3 LSDV 3 with a Restriction LSDV 3 includes the intercept and all dummies and then imposes a restriction on the model. The restriction is that the sum of all dummy parameters is zero. The STATA .constraint command defines a constraint, while the .cnsreg command fits a constrained OLS using the constraint()option. The number in the parenthesis indicates the constraint number defined in the .constraint command. . constraint 1 d1 + d2 = 0 . cnsreg rnd income d1 d2, constraint(1) Constrained linear regression Number of obs = F( 2, 36) = Prob > F = Root MSE = 39 6.06 0.0054 1435.4 ( 1) d1 + d2 = 0 -----------------------------------------------------------------------------rnd | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------income | .2180066 .0803248 2.71 0.010 .0551004 .3809128 d1 | 503.313 239.6859 2.10 0.043 17.20749 989.4184 d2 | -503.313 239.6859 -2.10 0.043 -989.4184 -17.20749 _cons | 1636.892 310.0438 5.28 0.000 1008.094 2265.69 ------------------------------------------------------------------------------ d1=1: R&D = 2,140.205 + .218*income = 1,637 + 503 *1 + (-503)*0 + .218*income d2=1: R&D = 1,133.579 + .218*income = 1,637 + 503 *0 + (-503)*1 + .218*income The intercept is the average of actual parameter estimates: 1,636 = (2,140+1,133)/2. In the SAS output below, the coefficient of RESTRICT is virtually zero and, in theory, should be zero. PROC REG DATA=masil.rnd2002; MODEL rnd = income d1 d2; RESTRICT d1 + d2 = 0; RUN; The REG Procedure Model: MODEL1 Dependent Variable: rnd NOTE: Restrictions have been applied to parameter estimates. Number of Observations Read Number of Observations Used Number of Observations with Missing Values 50 39 11 Analysis of Variance Source Model http://www.indiana.edu/~statmath DF Sum of Squares Mean Square F Value Pr > F 2 24987949 12493974 6.06 0.0054 © 2005 The Trustees of Indiana University (12/10/2005) Error Corrected Total Linear Regression Model for Panel Data: 11 36 38 74175757 99163706 Root MSE Dependent Mean Coeff Var 1435.42248 2023.56410 70.93536 2060438 R-Square Adj R-Sq 0.2520 0.2104 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Intercept income d1 d2 RESTRICT 1 1 1 1 -1 1636.89172 0.21801 503.31297 -503.31297 1.81899E-12 310.04381 0.08032 239.68587 239.68587 0 5.28 2.71 2.10 -2.10 . Pr > |t| <.0001 0.0101 0.0428 0.0428 . * Probability computed using beta distribution. Table 4 compares how SAS, STATA, LIMDEP, and SPSS conducts LSDVs. SPSS is not able to fit the LSDV3. In LIMDEP, the b(2) of the Cls: indicates the parameter estimate of the second independent variable. In SPSS, pay attention to the /ORIGIN option for LSDV2. Table 4. Estimating Three LSDVs Using SAS, STATA, LIMDEP, and SPSS LSDV 1 LSDV 2 LSDV 3 PROC REG; PROC REG; PROC REG; SAS MODEL rnd = income d2; MODEL rnd = income d1 d2 /NOINT; MODEL rnd = income d1 d2; RUN; RUN; STATA . regress ind income d2 . regress rnd income d1 d2, noconstant LIMDEP REGRESS; Lhs=rnd; Rhs=ONE,income, d2$ REGRESS; Lhs=rnd; Rhs=income, d1, d2$ SPSS REGRESSION /MISSING LISTWISE /STATISTICS COEFF R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT rnd /METHOD=ENTER income d2. REGRESSION /MISSING LISTWISE /STATISTICS COEFF R ANOVA /CRITERIA=PIN(.05) POUT(.10) /ORIGIN /DEPENDENT rnd /METHOD=ENTER income d1 d2. http://www.indiana.edu/~statmath RESTRICT d1 + d2 = 0; RUN; . constraint 1 d1+ d2 = 0 . cnsreg rnd income d1 d2 const(1) REGRESS; Lhs=rnd; Rhs=ONE,income, d1, d2; Cls: b(2)+b(3)=0$ N/A © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 12 3. Panel Data Models Panel data may have group effects, time effects, or both. These effects are either fixed effect or random effect. A fixed effect model assumes differences in intercepts across groups or time periods, whereas a random effect model explores differences in error variances. A one-way model includes only one set of dummy variables (e.g., firm), while a two way model considers two sets of dummy variables (e.g., firm and year). Model 2 in Chapter 2, in fact, is a one-way fixed group effect panel data model. 3.1 Functional Forms and Notation The functional forms of one-way panel data models are as follows. Fixed group effect model: y it = (α + μ i ) + X it' β + vit , where vit ~ IID(0,σ v2 ) Random group effect model: y it = α + X it' β + ( μ i + vit ) , where vit ~ IID(0,σ v2 ) The dummy variable is a part of the intercept in the fixed effect model and a part of error in the random effect model. vit ~ IID(0,σ v2 ) indicates that errors are independent identically distributed. The notations used in this document are, • yi • : dependent variable (DV) mean of group i. • x•t : means of independent variables (IVs) at time t. • • • • • • y•• and x•• for overall means of the DV and IVs, respectively. n: the number of groups or firms T : the number of time periods N=nT : total number of observations k : the number of regressors excluding dummy variables K=k+1 (including the intercept) 3.2 Fixed Effect Models There are several strategies for estimating fixed effect models. The least squares dummy variable model (LSDV) uses dummy variables, whereas the within effect does not. These strategies produce the identical slopes of non-dummy independent variables. The between effect model also does not use dummies, but produces different parameter estimates. There are pros and cons of these strategies (Table 5). 3.2.1 Estimations: LSDV, Within Effect, and Between Effect Model As discussed in Chapter 2, LSDV is widely used because it is relatively easy to estimate and interpret substantively. This LSDV, however, becomes problematic when there are many http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 13 groups or subjects in the panel data. If T is fixed and N → ∞ , only coefficients of regressors are consistent. The coefficients of dummy variables, α + μi , are not consistent since the number of these parameters increases as N increases (Baltagi 2001). This is so called the incidental parameter problem. Under this circumstance, LSDV is useless, calling for another strategy, the within effect model. The within effect model does not use dummy variables, but uses deviations from group means. Thus, this model is the OLS of ( y it − y i• ) = β ' ( xit − xi• ) + (ε it − ε i• ) without an intercept.3 You do not need to worry about the incidental parameter problem any more. The parameter estimates of regressors are identical to those of LSDV. The within effect model in turn has several disadvantages. Table 5. Three Strategies for Fixed Effect Models LSDV1 Within Effect Functional form y i = iα i + X i β + ε i yit − yi • = xit − xi • + ε it − ε i • Dummy Dummy coefficient Transformation Intercept (estimation) R2 SSE MSE Standard error of β DFerror Observations Between Effect y i • = α + xi • + ε i Yes Presented No Yes Correct Correct Correct Correct No Need to be computed Deviation from the group means No Incorrect Correct Smaller Incorrect (smaller) No N/A Group means No nT-n-k nT nT-k (Larger) nT n-K n Since this model does not report dummy coefficients, you need to compute them using the formula d g* = y g • − β ' x g • Since no dummy is used, the within effect model has a larger degree of freedom for error, resulting in a small MSE (mean square error) and incorrect (larger) standard errors of parameter estimates. Thus, you have to adjust the standard error using the Within df error nT − k formula se = sek . Finally, R2 of the within effect model is not = sek LSDV nT − n − k df error correct because an intercept is suppressed. * k The between group effect model, so called the group mean regression, uses the group means of the dependent and independent variables. Then, run OLS of yi • = α + xi • + ε i The number of observations decreases to n. This model uses aggregated data to test effects between groups (or individuals), assuming no group and time effect. Table 5 contrasts LSDV, the within effect model, and the between group models. In two-way fixed effect model, LSDV2 and the between effect model are not valid. 3 You need to follow three steps: 1) compute group means of the dependent and independent variables; 2) transform variables to get deviations of individual values from the group means; 3) run OLS with the transformed variables without the intercept. http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 14 3.2.2 Testing Group Effects The null hypothesis is that all dummy parameters except one are zero: H 0 : μ1 = ... = μ n −1 = 0 . This hypothesis is tested by the F test, which is based on loss of goodness-of-fit. The robust model in the following formula is LSDV and the efficient model is the pooled regression.4 (e' e Efficient − e' e Robust ) (n − 1) (e' e Robust ) (nT − n − k ) = 2 2 ( RRobust − REfficient ) (n − 1) 2 (1 − RRobust ) (nT − n − k ) ~ F (n − 1, nT − n − k ) If the null hypothesis is rejected, you may conclude that the fixed group effect model is better than the pooled OLS model. 3.2.3 Fixed Time Effect and Two-way Fixed Effect Models For the fixed time effects model, you need to switch n and T, and i and t in the formulas. • Model: y it = α + τ t + β ' X it + ε it • Within effect model: ( y it − y •t ) = β ' ( xit − x•t ) + (ε it − ε •t ) • Dummy coefficients: d t* = y •t − β ' x•t • Correct standard errors: sek* = sek • Between effect model: y •t = α + x•t + ε t • H 0 : τ 1 = ... = τ T −1 = 0 . (e' e Efficient − e' e Robust ) (T − 1) F-test: ~ F (T − 1, Tn − T − k ) . (e' e Robust ) (Tn − T − k ) • Within df error Tn − k = sek LSDV Tn − T − k df error The fixed group and time effect model uses slightly different formulas. The within effect model of this two-way fixed model has four approaches for LSDV (see 6.1 for details). • Model: y it = α + μ i + τ t + β ' X it + ε it . • Within effect Model: yit* = yit − yi • − y•t + y•• and xit* = xit − xi • − x•t + x•• . • Dummy coefficients: d g* = ( y g • − y •• ) − b' ( x g • − x•• ) and d t* = ( y •t − y •• ) − b' ( x•t − x•• ) • Correct standard errors: sek* = sek • H 0 : μ1 = ... = μ n −1 = 0 and τ 1 = ... = τ T −1 = 0 . (e' e Efficient − e' e Robust ) (n + T − 2) F-test: ~ F [(n + T − 2), (nT − n − T − k + 1)] (e' e Robust ) (nT − n − T − k + 1) • 4 Within df error nT − k = sek LSDV nT − n − T − k + 1 df error When comparing fixed effect and random effect models, the fixed effect estimates are considered as the robust estimates and random effect estimates as the efficient estimates. http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 15 3.3 Random Effect Models The one-way random group effect model is formulated as y it = α + β ' X ti + μ i + vit , wit = μ i + vit where μi ~ IID(0,σ μ2 ) and vit ~ IID(0,σ v2 ) . The μ i are assumed independent of vit and X it , which are also independent of each other for all i and t. Remember that this assumption is not necessary in the fixed effect model. The components of Cov( wit , w js ) = E ( wit w js ) are σ μ2 + σ v2 if i=j and t=s and σ μ2 if i=j and t ≠ s .5 A random effect model is estimated by generalized least squares (GLS) when the variance structure is known and feasible generalized least squares (FGLS) when the variance is unknown. Compared to fixed effect models, random effect models are relatively difficult to estimate. This document assumes panel data are balanced. 3.3.1 Generalized Least Squares (GLS) When Ω is known (given), GLS based on the true variance components is BLUE and all the feasible GLS estimators considered are asymptotically efficient as either n or T approaches infinity (Baltagi 2001). The Ω matrix looks like, ⎡σ μ2 + σ v2 σ μ2 ⎢ σ μ2 σ μ2 + σ v2 ⎢ Ω = ⎢ ... T ×T ... ⎢ 2 σ μ2 ⎢⎣ σ μ σ μ2 σ μ2 ⎤ ⎥ ... ⎥ ... ... ⎥ ⎥ ... σ μ2 + σ v2 ⎥⎦ ... In GLS, you just need to compute θ using the Ω matrix: θ = 1 − σ v2 Tσ μ + σ 2 2 v .6 Then transform variables as follows. yit* = yit − θ yi • • • xit* = xit − θ xi • for all Xk • α * = 1−θ Finally, run OLS with the transformed variables: yit* = α * + xit* β * − ε it* . Since Ω is often unknown, FGLS is more frequently used rather than GLS. 3.3.2 Feasible Generalized Least Squares (FGLS) 5 This implies that Corr ( wit , w js ) is 1 if i=j and t=s, and 6 If σ μ2 (σ μ2 + σ v2 ) if i=j and t ≠ s . θ = 0 , run pooled OLS. If θ = 1 and σ v2 = 0 , then run the within effect model. http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 16 If Ω is unknown, first you have to estimate θ using σˆ μ2 and σˆ v2 : θˆ = 1 − σˆ v2 Tσˆ μ2 + σˆ v2 =1− σˆ v2 . 2 Tσˆ between The σˆ v2 is derived from the SSE (sum of squares due to error) of the within effect model or from the deviations of residuals from group means of residuals: n SSE within e' ewithin σˆ v2 = = = nT − n − k nT − n − k T ∑∑ (v i =1 t =1 it − vi • ) 2 nT − n − k , where vit are the residuals of the LSDV1. The σˆ μ2 comes from the between effect model (group mean regression): 2 − σˆ μ2 = σˆ between σˆ v2 T 2 = , where σˆ between SSEbetween . n−K Next, transform variables using θˆ and then run OLS: yit* = α * + xit* β * − ε it* . • y * = y − θˆ y it • • it i• x = xit − θˆ xi • for all Xk α * = 1 − θˆ * it The estimation of the two-way random effect model is skipped here, since it is complicated. 3.3.3 Testing Random Effects (LM test) The null hypothesis is that cross-sectional variance components are zero, H 0 : σ u2 = 0 . Breusch and Pagan (1980) developed the Lagrange multiplier (LM) test (Greene 2003; Judge et al. 1988). In the following formula, e is the n X 1 vector of the group specific means of pooled regression residuals, and e' e is the SSE of the pooled OLS regression. The LM is distributed as chi-squared with one degree of freedom. 2 nT ⎡ e' DDe ⎤ nT ⎡ T 2 e ' e ⎤ LM μ = − = − 1⎥ ~ χ 2 (1) . 1 ⎢ ⎥⎦ 2(T − 1) ⎢⎣ e' e 2(T − 1) ⎣ e' e ⎦ 2 Baltagi (2001) presents the same LM test in a different way. 2 2 ⎤ ⎤ nT ⎡ ∑ (∑ eit ) nT ⎡ ∑ (Tei• ) ⎢ ⎥ LM μ = 1 1 − = − ⎢ ⎥ ~ χ 2 (1) . 2 T 2(T − 1) ⎢ ∑∑ eit2 2 ( 1 ) − e ⎥⎦ ⎢⎣ ∑∑ it ⎥⎦ ⎣ 2 2 The two way random effect model has the null hypothesis of H 0 : σ u2 = 0 and σ v2 = 0 . The LM test combines two one-way random effect models for group and time, LM μv = LM μ + LM v ~ χ 2 (2) . http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 17 3.4 Hausman Test: Fixed Effects versus Random Effects The Hausman specification test compares the fixed versus random effects under the null hypothesis that the individual effects are uncorrelated with the other regressors in the model (Hausman 1978). If correlated (H0 is rejected), a random effect model produces biased estimators, violating one of the Gauss-Markov assumptions; so a fixed effect model is preferred. Hausman’s essential result is that the covariance of an efficient estimator with its difference from an inefficient estimator is zero (Greene 2003). ' ˆ −1 (bRobust − bEfficient ) ~ χ 2 (k ) , m = (bRobust − bEfficient ) ∑ ˆ = Var[b ∑ Robust − bEfficient ] = Var (bRobust ) − Var (bEfficient ) is the difference between the estimated covariance matrix of the parameter estimates in the LSDV model (robust) and that of the random effects model (efficient). It is notable that an intercept and dummy variables SHOULD be excluded in computation. 3.5 Poolability Test What is poolability? It asks if slopes are the same across groups or over time. Thus, the null hypothesis of the poolability test is H 0 : β ik = β k . Remember that slopes remain constant in fixed and random effect models; only intercepts and error variances matter. The poolability test is undertaken under the assumption of μ ~ N (0, s 2 I NT ) . This test uses the F statistic, Fobs = (e' e − ∑ ei' ei ) (n − 1) K ∑e e ' i i n(T − K ) ~ F [(n − 1) K , n(T − K )] , where e' e is the SSE of the pooled OLS and ei' ei is the SSE of the OLS regression for group i. If the null hypothesis is rejected, the panel data are not poolable. Under this circumstance, you may go to the random coefficient model or hierarchical regression model. Similarly, the null hypothesis of the poolability test over time is H 0 : β tk = β k . The F-test is Fobs = (e' e − ∑ et' et ) (T − 1) K ∑ et' et T (n − K ) regression at time t. http://www.indiana.edu/~statmath = F [(T − 1) K , T (n − K )] , where et' et is SSE of the OLS © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 18 4. The Fixed Group Effect Model The one-way fixed group model examines group differences in the intercepts. The LSDV for this fixed model needs to create as many dummy variables as the number of groups or subjects. When many dummies are needed, the within effect model is useful since it transforms variables using group means to avoid dummies. The between effect model uses group means of variables. 4.1 The Pooled OLS Regression Model Let us first consider the pooled model without dummy variables. . regress cost output fuel load // pooled model Source | SS df MS -------------+-----------------------------Model | 112.705452 3 37.5684839 Residual | 1.33544153 86 .01552839 -------------+-----------------------------Total | 114.040893 89 1.28135835 Number of obs F( 3, 86) Prob > F R-squared Adj R-squared Root MSE = 90 = 2419.34 = 0.0000 = 0.9883 = 0.9879 = .12461 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------output | .8827385 .0132545 66.60 0.000 .8563895 .9090876 fuel | .453977 .0203042 22.36 0.000 .4136136 .4943404 load | -1.62751 .345302 -4.71 0.000 -2.313948 -.9410727 _cons | 9.516923 .2292445 41.51 0.000 9.0612 9.972645 ------------------------------------------------------------------------------ cost = 9.517 + .883*output +.454*fuel -1.628*load. This model fits the data well (p<.0000 and R2=.9883). We may, however, suspect fixed group effects that produce different intercepts across groups. As discussed in Chapter 2, there are three equivalent approaches of LSDV. They report the identical parameter estimates of regresors excluding dummies. Let us begin with LSDV1. 4.2 LSDV1 without a Dummy LSDV1 drops a dummy variable to identify the model. LSDV1 produces correct ANOVA information, goodness of fit, parameter estimates, and standard errors. As a consequence, this approach is commonly used in practice. LSDV produces six regression equations for six groups (airlines). Group1: Group2: Group3: Group4: Group5: Group6: cost cost cost cost cost cost = = = = = = 9.706 9.665 9.497 9.891 9.730 9.793 + + + + + + .919*output .919*output .919*output .919*output .919*output .919*output +.417*fuel +.417*fuel +.417*fuel +.417*fuel +.417*fuel +.417*fuel -1.070*load -1.070*load -1.070*load -1.070*load -1.070*load -1.070*load In SAS, the REG procedure fits the OLS regression model. Let us drop the last dummy g6, the reference point. PROC REG DATA=masil.airline; MODEL cost = g1-g5 output fuel load; http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 19 RUN; The REG Procedure Model: MODEL1 Dependent Variable: cost Number of Observations Read Number of Observations Used 90 90 Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 8 81 89 113.74827 0.29262 114.04089 14.21853 0.00361 Root MSE Dependent Mean Coeff Var 0.06011 13.36561 0.44970 R-Square Adj R-Sq F Value Pr > F 3935.79 <.0001 0.9974 0.9972 Parameter Estimates Variable Intercept g1 g2 g3 g4 g5 output fuel load DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 1 1 1 1 1 1 1 9.79300 -0.08706 -0.12830 -0.29598 0.09749 -0.06301 0.91928 0.41749 -1.07040 0.26366 0.08420 0.07573 0.05002 0.03301 0.02389 0.02989 0.01520 0.20169 37.14 -1.03 -1.69 -5.92 2.95 -2.64 30.76 27.47 -5.31 <.0001 0.3042 0.0941 <.0001 0.0041 0.0100 <.0001 <.0001 <.0001 Note that the parameter estimate of g6 is presented in the intercept (9.793). Other dummy parameter estimates are computed with the reference point. The actual intercept of the group 1, for example, is computed as 9.706 = 9.793 + (-.087)*1 + (-.1283)*0 + (-.2960)*0 + (.0975)*0 + (-.0630)*0, where 9.793 is the reference point. STATA has the .regress command for OLS regression (LSDV). . regress cost g1-g5 output fuel load Source | SS df MS -------------+-----------------------------Model | 113.74827 8 14.2185338 Residual | .292622872 81 .003612628 -------------+-----------------------------Total | 114.040893 89 1.28135835 Number of obs F( 8, 81) Prob > F R-squared Adj R-squared Root MSE = 90 = 3935.79 = 0.0000 = 0.9974 = 0.9972 = .06011 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 20 -------------+---------------------------------------------------------------g1 | -.0870617 .0841995 -1.03 0.304 -.2545924 .080469 g2 | -.1282976 .0757281 -1.69 0.094 -.2789728 .0223776 g3 | -.2959828 .0500231 -5.92 0.000 -.395513 -.1964526 g4 | .097494 .0330093 2.95 0.004 .0318159 .1631721 g5 | -.063007 .0238919 -2.64 0.010 -.1105443 -.0154697 output | .9192846 .0298901 30.76 0.000 .8598126 .9787565 fuel | .4174918 .0151991 27.47 0.000 .3872503 .4477333 load | -1.070396 .20169 -5.31 0.000 -1.471696 -.6690963 _cons | 9.793004 .2636622 37.14 0.000 9.268399 10.31761 ------------------------------------------------------------------------------ Now, run the LIMDEP Regress$ command to fit the LSDV1. Do not forget to include ONE for the intercept in the Rhs;. --> REGRESS;Lhs=COST;Rhs=ONE,G1,G2,G3,G4,G5,OUTPUT,FUEL,LOAD$ +-----------------------------------------------------------------------+ | Ordinary least squares regression Weighting variable = none | | Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 | | Model size: Observations = 90, Parameters = 9, Deg.Fr.= 81 | | Residuals: Sum of squares= .2926207777 , Std.Dev.= .06010 | | Fit: R-squared= .997434, Adjusted R-squared = .99718 | | Model test: F[ 8, 81] = 3935.82, Prob value = .00000 | | Diagnostic: Log-L = 130.0865, Restricted(b=0) Log-L = -138.3581 | | LogAmemiyaPrCrt.= -5.528, Akaike Info. Crt.= -2.691 | | Autocorrel: Durbin-Watson Statistic = 1.02645, Rho = .48677 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant 9.793021272 .26366104 37.142 .0000 G1 -.8707201949E-01 .84199161E-01 -1.034 .3042 .16666667 G2 -.1283060033 .75727781E-01 -1.694 .0940 .16666667 G3 -.2959885994 .50022855E-01 -5.917 .0000 .16666667 G4 .9749253376E-01 .33009146E-01 2.954 .0041 .16666667 G5 -.6300770422E-01 .23891796E-01 -2.637 .0100 .16666667 OUTPUT .9192881432 .29889967E-01 30.756 .0000 -1.1743092 FUEL .4174910457 .15199071E-01 27.468 .0000 12.770359 LOAD -1.070395015 .20168924 -5.307 .0000 .56046016 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) What if you drop a different dummy variable, say g1, instead of g6? Since the different reference point is applied, you will get different dummy coefficients. The other statistics such as goodness-of-fits, however, remain unchanged. . regress cost g2-g6 output fuel load // LSDV1 dropping g1 Source | SS df MS -------------+-----------------------------Model | 113.74827 8 14.2185338 Residual | .292622872 81 .003612628 -------------+-----------------------------Total | 114.040893 89 1.28135835 Number of obs F( 8, 81) Prob > F R-squared Adj R-squared Root MSE = 90 = 3935.79 = 0.0000 = 0.9974 = 0.9972 = .06011 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------g2 | -.0412359 .0251839 -1.64 0.105 -.0913441 .0088722 g3 | -.2089211 .0427986 -4.88 0.000 -.2940769 -.1237652 g4 | .1845557 .0607527 3.04 0.003 .0636769 .3054345 http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 21 g5 | .0240547 .0799041 0.30 0.764 -.1349293 .1830387 g6 | .0870617 .0841995 1.03 0.304 -.080469 .2545924 output | .9192846 .0298901 30.76 0.000 .8598126 .9787565 fuel | .4174918 .0151991 27.47 0.000 .3872503 .4477333 load | -1.070396 .20169 -5.31 0.000 -1.471696 -.6690963 _cons | 9.705942 .193124 50.26 0.000 9.321686 10.0902 ------------------------------------------------------------------------------ When you have not created dummy variables, take advantage of the .xi prefix command.7 Note that STATA by default drops the first dummy variable while the SAS TSCSREG and PANEL procedures in 4.5.2 drops the last dummy. . xi: regress cost i.airline output fuel load i.airline _Iairline_1-6 (naturally coded; _Iairline_1 omitted) Source | SS df MS -------------+-----------------------------Model | 113.74827 8 14.2185338 Residual | .292622872 81 .003612628 -------------+-----------------------------Total | 114.040893 89 1.28135835 Number of obs F( 8, 81) Prob > F R-squared Adj R-squared Root MSE = 90 = 3935.79 = 0.0000 = 0.9974 = 0.9972 = .06011 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------_Iairline_2 | -.0412359 .0251839 -1.64 0.105 -.0913441 .0088722 _Iairline_3 | -.2089211 .0427986 -4.88 0.000 -.2940769 -.1237652 _Iairline_4 | .1845557 .0607527 3.04 0.003 .0636769 .3054345 _Iairline_5 | .0240547 .0799041 0.30 0.764 -.1349293 .1830387 _Iairline_6 | .0870617 .0841995 1.03 0.304 -.080469 .2545924 output | .9192846 .0298901 30.76 0.000 .8598126 .9787565 fuel | .4174918 .0151991 27.47 0.000 .3872503 .4477333 load | -1.070396 .20169 -5.31 0.000 -1.471696 -.6690963 _cons | 9.705942 .193124 50.26 0.000 9.321686 10.0902 ------------------------------------------------------------------------------ 4.3 LSDV2 without the Intercept LSDV2 reports actual parameter estimates of the dummies. Because LSDV2 suppresses the intercept, you will get incorrect F and R2 statistics. In the SAS REG procedure, you need to use the /NOINT option to suppress the intercept. Note that the F value of 497,985 and R2 of 1 are not likely. PROC REG DATA=masil.airline; MODEL cost = g1-g6 output fuel load /NOINT; RUN; The REG Procedure Model: MODEL1 Dependent Variable: cost Number of Observations Read Number of Observations Used 7 90 90 The STATA .xi is used either as an ordinary command or a prefix command like .bysort. This command creates dummies from a categorical variable specified in the term i. and then run the command following the colon. http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 22 NOTE: No intercept in model. R-Square is redefined. Analysis of Variance Source DF Sum of Squares Mean Square Model Error Uncorrected Total 9 81 90 16191 0.29262 16192 1799.03381 0.00361 Root MSE Dependent Mean Coeff Var 0.06011 13.36561 0.44970 R-Square Adj R-Sq F Value Pr > F 497985 <.0001 1.0000 1.0000 Parameter Estimates Variable g1 g2 g3 g4 g5 g6 output fuel load DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 1 1 1 1 1 1 1 9.70594 9.66471 9.49702 9.89050 9.73000 9.79300 0.91928 0.41749 -1.07040 0.19312 0.19898 0.22496 0.24176 0.26094 0.26366 0.02989 0.01520 0.20169 50.26 48.57 42.22 40.91 37.29 37.14 30.76 27.47 -5.31 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 STATA uses the noconstant option to suppress the intercept. Note that noc is its abbreviation. . regress cost g1-g6 output fuel load, noc Source | SS df MS -------------+-----------------------------Model | 16191.3043 9 1799.03381 Residual | .292622872 81 .003612628 -------------+-----------------------------Total | 16191.5969 90 179.906633 Number of obs F( 9, 81) Prob > F R-squared Adj R-squared Root MSE = = = = = = 90 . 0.0000 1.0000 1.0000 .06011 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------g1 | 9.705942 .193124 50.26 0.000 9.321686 10.0902 g2 | 9.664706 .198982 48.57 0.000 9.268794 10.06062 g3 | 9.497021 .2249584 42.22 0.000 9.049424 9.944618 g4 | 9.890498 .2417635 40.91 0.000 9.409464 10.37153 g5 | 9.729997 .2609421 37.29 0.000 9.210804 10.24919 g6 | 9.793004 .2636622 37.14 0.000 9.268399 10.31761 output | .9192846 .0298901 30.76 0.000 .8598126 .9787565 fuel | .4174918 .0151991 27.47 0.000 .3872503 .4477333 load | -1.070396 .20169 -5.31 0.000 -1.471696 -.6690963 ------------------------------------------------------------------------------ In LIMDEP, you need to drop ONE out of the Rhs; to suppress the intercept. Unlike SAS and STATA, LIMDEP reports correct R2 and F even in LSDV2. http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 23 --> REGRESS;Lhs=COST;Rhs=G1,G2,G3,G4,G5,G6,OUTPUT,FUEL,LOAD$ +-----------------------------------------------------------------------+ | Ordinary least squares regression Weighting variable = none | | Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 | | Model size: Observations = 90, Parameters = 9, Deg.Fr.= 81 | | Residuals: Sum of squares= .2926207777 , Std.Dev.= .06010 | | Fit: R-squared= .997434, Adjusted R-squared = .99718 | | Model test: F[ 8, 81] = 3935.82, Prob value = .00000 | | Diagnostic: Log-L = 130.0865, Restricted(b=0) Log-L = -138.3581 | | LogAmemiyaPrCrt.= -5.528, Akaike Info. Crt.= -2.691 | | Model does not contain ONE. R-squared and F can be negative! | | Autocorrel: Durbin-Watson Statistic = 1.02645, Rho = .48677 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ G1 9.705949253 .19312325 50.258 .0000 .16666667 G2 9.664715269 .19898117 48.571 .0000 .16666667 G3 9.497032673 .22495746 42.217 .0000 .16666667 G4 9.890513806 .24176245 40.910 .0000 .16666667 G5 9.730013568 .26094094 37.288 .0000 .16666667 G6 9.793021272 .26366104 37.142 .0000 .16666667 OUTPUT .9192881432 .29889967E-01 30.756 .0000 -1.1743092 FUEL .4174910457 .15199071E-01 27.468 .0000 12.770359 LOAD -1.070395015 .20168924 -5.307 .0000 .56046016 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) 4.4 LSDV3 with Restrictions LSDV3 imposes a restriction that the sum of the dummy parameters is zero. The SAS REG procedure uses the RESTRICT statement to impose restrictions. PROC REG DATA=masil.airline; MODEL cost = g1-g6 output fuel load; RESTRICT g1 + g2 + g3 + g4 + g5 + g6 = 0; RUN; The REG Procedure Model: MODEL1 Dependent Variable: cost NOTE: Restrictions have been applied to parameter estimates. Number of Observations Read Number of Observations Used 90 90 Analysis of Variance Source http://www.indiana.edu/~statmath DF Sum of Squares Mean Square F Value Pr > F © 2005 The Trustees of Indiana University (12/10/2005) Model Error Corrected Total Linear Regression Model for Panel Data: 24 8 81 89 113.74827 0.29262 114.04089 Root MSE Dependent Mean Coeff Var 0.06011 13.36561 0.44970 14.21853 0.00361 R-Square Adj R-Sq 3935.79 <.0001 0.9974 0.9972 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Intercept g1 g2 g3 g4 g5 g6 output fuel load RESTRICT 1 1 1 1 1 1 1 1 1 1 -1 9.71353 -0.00759 -0.04882 -0.21651 0.17697 0.01647 0.07948 0.91928 0.41749 -1.07040 3.01674E-15 0.22964 0.04562 0.03798 0.01606 0.01942 0.03669 0.04050 0.02989 0.01520 0.20169 1.51088E-10 42.30 -0.17 -1.29 -13.48 9.11 0.45 1.96 30.76 27.47 -5.31 0.00 <.0001 0.8683 0.2023 <.0001 <.0001 0.6547 0.0532 <.0001 <.0001 <.0001 1.0000* * Probability computed using beta distribution. The dummy coefficients mean deviations from the averaged group effect (9.714). The actual intercept of group 2, for example, is 9.665 =9.714+ (-.049). Note that the 3.01674E-15 of RESTRICT below is virtually zero. In STATA, you have to use the .cnsreg command rather than .regress. The command, however, does not provide an ANOVA table and goodness-of-fit statistics. . constraint define 1 g1 + g2 + g3 + g4 + g5 + g6 = 0 . cnsreg cost g1-g6 output fuel load, constraint(1) Constrained linear regression Number of obs = 90 F( 8, 81) = 3935.79 Prob > F = 0.0000 Root MSE = .06011 ( 1) g1 + g2 + g3 + g4 + g5 + g6 = 0 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------g1 | -.0075859 .0456178 -0.17 0.868 -.0983509 .0831792 g2 | -.0488218 .0379787 -1.29 0.202 -.1243875 .0267439 g3 | -.2165069 .0160624 -13.48 0.000 -.2484661 -.1845478 g4 | .1769698 .0194247 9.11 0.000 .1383208 .2156189 g5 | .0164689 .0366904 0.45 0.655 -.0565335 .0894712 g6 | .0794759 .0405008 1.96 0.053 -.001108 .1600597 output | .9192846 .0298901 30.76 0.000 .8598126 .9787565 fuel | .4174918 .0151991 27.47 0.000 .3872503 .4477333 load | -1.070396 .20169 -5.31 0.000 -1.471696 -.6690963 _cons | 9.713528 .229641 42.30 0.000 9.256614 10.17044 ------------------------------------------------------------------------------ http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 25 LIMDEP has the Cls$ subcommand to impose restrictions. Again, do not forget to include ONE in the Rhs;. --> REGRESS;Lhs=COST;Rhs=ONE,G1,G2,G3,G4,G5,G6,OUTPUT,FUEL,LOAD; Cls:b(1)+b(2)+b(3)+b(4)+b(5)+b(6)=0$ +-----------------------------------------------------------------------+ | Linearly restricted regression | | Ordinary least squares regression Weighting variable = none | | Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 | | Model size: Observations = 90, Parameters = 9, Deg.Fr.= 81 | | Residuals: Sum of squares= .2926207777 , Std.Dev.= .06010 | | Fit: R-squared= .997434, Adjusted R-squared = .99718 | | (Note: Not using OLS. R-squared is not bounded in [0,1] | | Model test: F[ 8, 81] = 3935.82, Prob value = .00000 | | Diagnostic: Log-L = 130.0865, Restricted(b=0) Log-L = -138.3581 | | LogAmemiyaPrCrt.= -5.528, Akaike Info. Crt.= -2.691 | | Note, when restrictions are imposed, R-squared can be less than zero. | | F[ 1, 80] for the restrictions = .0000, Prob = 1.0000 | | Autocorrel: Durbin-Watson Statistic = 1.02645, Rho = .48677 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Constant 12.12205614 .27886962 43.469 .0000 G1 -2.416106889 .89836871E-01 -26.894 .0000 .16666667 G2 -2.457340873 .82929154E-01 -29.632 .0000 .16666667 G3 -2.625023469 .56175656E-01 -46.729 .0000 .16666667 G4 -2.231542336 .41557714E-01 -53.697 .0000 .16666667 G5 -2.392042574 .29995908E-01 -79.746 .0000 .16666667 G6 -2.329034870 .33569388E-01 -69.380 .0000 .16666667 OUTPUT .9192881432 .29889967E-01 30.756 .0000 -1.1743092 FUEL .4174910457 .15199071E-01 27.468 .0000 12.770359 LOAD -1.070395015 .20168924 -5.307 .0000 .56046016 LSDV3 in LIMDEP reports different dummy coefficients. But you may draw actual intercepts of groups in a manner similar to what you would do in SAS and STATA. The actual intercept of group 3, for example, is 9.497 = 12.122 + (-2.625). 4.5 Within Group Effect Model The within effect model does not use the dummies and thus has larger degrees of freedom, smaller MSE, and smaller standard errors of parameters than those of LSDV. As a consequence, you need to adjust standard errors. This model does not report individual dummy coefficients either. The SAS TSCSREG procedure and LIMDEP Regress$ command report the adjusted (correct) MSE, SEE (Root MSE), R2, and standard errors. 4.5.1 Estimating the Within Effect Model http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 26 First, let us manually estimate the within group effect model in STATA. You need to compute group means and transform dependent and independent variables using group means (log is skipped here). . . . . egen egen egen egen gm_cost=mean(cost), by(airline) // compute group means gm_output=mean(output), by(airline) gm_fuel=mean(fuel), by(airline) gm_load=mean(load), by(airline) You will get the following group means of variables. +------------------------------------------------------+ | airline gm_cost gm_output gm_fuel gm_load | |------------------------------------------------------| | 1 14.67563 .3192696 12.7318 .5971917 | | 2 14.37247 -.033027 12.75171 .5470946 | | 3 13.37231 -.9122626 12.78972 .5845358 | | 4 13.1358 -1.635174 12.77803 .5476773 | | 5 12.36304 -2.285681 12.7921 .5664859 | | 6 12.27441 -2.49898 12.7788 .5197756 | +------------------------------------------------------+ . . . . gen gen gen gen gw_cost = gw_output gw_fuel = gw_load = cost - gm_cost // compute deviations from the group means = output - gm_output fuel - gm_fuel load - gm_load Now, we are ready to run the within effect model. Keep in mind that you have to suppress the intercept. Carefully check MSE, SEE, R2, and standard errors. . regress gw_cost gw_output gw_fuel gw_load, noc // within effect Source | SS df MS -------------+-----------------------------Model | 39.0683861 3 13.0227954 Residual | .292622861 87 .003363481 -------------+-----------------------------Total | 39.361009 90 .437344544 Number of obs F( 3, 87) Prob > F R-squared Adj R-squared Root MSE = 90 = 3871.82 = 0.0000 = 0.9926 = 0.9923 = .058 -----------------------------------------------------------------------------gw_cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gw_output | .9192846 .028841 31.87 0.000 .86196 .9766092 gw_fuel | .4174918 .0146657 28.47 0.000 .3883422 .4466414 gw_load | -1.070396 .1946109 -5.50 0.000 -1.457206 -.6835858 ------------------------------------------------------------------------------ You may compute group intercepts using d g* = y g • − β ' x g • . For example, the intercept of airline 5 is computed as 9.730 = 12.363 – {.919*(-2.286) + .417*12.792 + (-1.073)*.566 }. In order to get the correct standard errors, you need to adjust them using the ratio of degrees of freedom of the within effect model and the LSDV. For example, the standard error of the logged output is computed as .0299=.0288*sqrt(87/81). 4.5.2 Using the SAS TSCSREG and PANEL Procedures The TSCSREG and PANEL procedures of SAS/ETS allows users to fit the within effect model conveniently. The procedures, in fact, report LSDV1, but you do not need to create dummy http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 27 variables and compute deviations from the group means. This procedures reports correct MSE, SEE, R2, and standard errors, and conducts the F test for the fixed group effect as well. PROC SORT DATA=masil.airline; BY airline year; PROC TSCSREG DATA=masil.airline; ID airline year; MODEL cost = output fuel load /FIXONE; RUN; The TSCSREG Procedure Dependent Variable: cost Model Description Estimation Method Number of Cross Sections Time Series Length FixOne 6 15 Fit Statistics SSE MSE R-Square 0.2926 0.0036 0.9974 DFE Root MSE 81 0.0601 F Test for No Fixed Effects Num DF Den DF F Value Pr > F 5 81 57.73 <.0001 Parameter Estimates DF Estimate Standard Error t Value Pr > |t| CS1 1 -0.08706 0.0842 -1.03 0.3042 CS2 1 -0.1283 0.0757 -1.69 0.0941 CS3 1 -0.29598 0.0500 -5.92 <.0001 CS4 1 0.097494 0.0330 2.95 0.0041 CS5 1 -0.06301 0.0239 -2.64 0.0100 Intercept output fuel load 1 1 1 1 9.793004 0.919285 0.417492 -1.0704 0.2637 0.0299 0.0152 0.2017 37.14 30.76 27.47 -5.31 <.0001 <.0001 <.0001 <.0001 Variable http://www.indiana.edu/~statmath Label Cross Sectional Effect 1 Cross Sectional Effect 2 Cross Sectional Effect 3 Cross Sectional Effect 4 Cross Sectional Effect 5 Intercept © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 28 Note that a data set needs to be sorted in advance by variables to appear in the ID statement of the TSCSREG and PANEL procedures. The following PANEL procedure returns the same output. PROC PANEL DATA=masil.airline; ID airline year; MODEL cost = output fuel load /FIXONE; RUN; 4.5.3 Using STATA The STATA .xtreg command fits the within group effect model without creating dummy variables. The command reports correct standard errors and the F test for fixed group effects. This command, however, does not provide an analysis of variance (ANOVA) table and correct R2 and F statistics. The .xtreg command should follow the .tsset command that specifies grouping and time variables. . tsset airline year panel variable: time variable: airline, 1 to 6 year, 1 to 15 The fe of .xtreg indicates the within effect model and i(airline) specifies airline as the independent unit. Note that this command reports adjusted (correct) standard errors. . xtreg cost output fuel load, fe i(airline) // within group effect Fixed-effects (within) regression Group variable (i): airline Number of obs Number of groups = = 90 6 R-sq: Obs per group: min = avg = max = 15 15.0 15 within = 0.9926 between = 0.9856 overall = 0.9873 corr(u_i, Xb) = -0.3475 F(3,81) Prob > F = = 3604.80 0.0000 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------output | .9192846 .0298901 30.76 0.000 .8598126 .9787565 fuel | .4174918 .0151991 27.47 0.000 .3872503 .4477333 load | -1.070396 .20169 -5.31 0.000 -1.471696 -.6690963 _cons | 9.713528 .229641 42.30 0.000 9.256614 10.17044 -------------+---------------------------------------------------------------sigma_u | .1320775 sigma_e | .06010514 rho | .82843653 (fraction of variance due to u_i) -----------------------------------------------------------------------------F test that all u_i=0: F(5, 81) = 57.73 Prob > F = 0.0000 The last line of the output tests the null hypothesis that all dummy parameters in LSDV1 are zero (e.g., g1=0, g2=0, g3=0, g4=0, and g5=0). Not the intercept of 9.714 is that of LSDV3. 4.5.4 Using LIMDEP http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 29 In LIMDEP, you have to specify the panel data model and stratification or time variables. The Panel$ and Fixed$ subcommands mean a fixed effect panel data model. The Str$ subcommand specifies a stratification variable. --> REGRESS;Lhs=COST;Rhs=ONE,OUTPUT,FUEL,LOAD;Panel;Str=AIRLINE;Fixed$ +-----------------------------------------------------------------------+ | OLS Without Group Dummy Variables | | Ordinary least squares regression Weighting variable = none | | Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 | | Model size: Observations = 90, Parameters = 4, Deg.Fr.= 86 | | Residuals: Sum of squares= 1.335449522 , Std.Dev.= .12461 | | Fit: R-squared= .988290, Adjusted R-squared = .98788 | | Model test: F[ 3, 86] = 2419.33, Prob value = .00000 | | Diagnostic: Log-L = 61.7699, Restricted(b=0) Log-L = -138.3581 | | LogAmemiyaPrCrt.= -4.122, Akaike Info. Crt.= -1.284 | | Panel Data Analysis of COST [ONE way] | | Unconditional ANOVA (No regressors) | | Source Variation Deg. Free. Mean Square | | Between 74.6799 5. 14.9360 | | Residual 39.3611 84. .468584 | | Total 114.041 89. 1.28136 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ OUTPUT .8827386341 .13254552E-01 66.599 .0000 -1.1743092 FUEL .4539777119 .20304240E-01 22.359 .0000 12.770359 LOAD -1.627507797 .34530293 -4.713 .0000 .56046016 Constant 9.516912231 .22924522 41.514 .0000 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) +-----------------------------------------------------------------------+ | Least Squares with Group Dummy Variables | | Ordinary least squares regression Weighting variable = none | | Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 | | Model size: Observations = 90, Parameters = 9, Deg.Fr.= 81 | | Residuals: Sum of squares= .2926207777 , Std.Dev.= .06010 | | Fit: R-squared= .997434, Adjusted R-squared = .99718 | | Model test: F[ 8, 81] = 3935.82, Prob value = .00000 | | Diagnostic: Log-L = 130.0865, Restricted(b=0) Log-L = -138.3581 | | LogAmemiyaPrCrt.= -5.528, Akaike Info. Crt.= -2.691 | | Estd. Autocorrelation of e(i,t) .573531 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ OUTPUT .9192881432 .29889967E-01 30.756 .0000 -1.1743092 FUEL .4174910457 .15199071E-01 27.468 .0000 12.770359 LOAD -1.070395015 .20168924 -5.307 .0000 .56046016 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) LIMDEP reports both the pooled OLS regression and the within effect model. Like the SAS TSCSREG procedure, LIMDEP provides correct MSE, SEE, R2, and standard errors. http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 30 4.6 Between Group Effect Model: Group Mean Regression The between effect model uses aggregate information, group means of variables. In other words, the unit of analysis is not an individual observation, but groups or subjects. The number of observations jumps down to n from nT. This group mean regression produces different goodness-of-fits and parameter estimates from those of LSDV and the within effect model. Let us compute group means and run the OLS regression with them. The .collapse command computes aggregate information and saves into a new data set. Note that /// links two command lines. . collapse (mean) gm_cost=cost (mean) gm_output=output (mean) gm_fuel=fuel (mean) /// gm_load=load, by(airline) . regress gm_cost gm_output gm_fuel gm_load Source | SS df MS -------------+-----------------------------Model | 4.94698124 3 1.64899375 Residual | .031675926 2 .015837963 -------------+-----------------------------Total | 4.97865717 5 .995731433 Number of obs F( 3, 2) Prob > F R-squared Adj R-squared Root MSE = = = = = = 6 104.12 0.0095 0.9936 0.9841 .12585 -----------------------------------------------------------------------------gm_cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gm_output | .7824568 .1087646 7.19 0.019 .3144803 1.250433 gm_fuel | -5.523904 4.478718 -1.23 0.343 -24.79427 13.74647 gm_load | -1.751072 2.743167 -0.64 0.589 -13.55397 10.05182 _cons | 85.8081 56.48199 1.52 0.268 -157.2143 328.8305 ------------------------------------------------------------------------------ The SAS PANEL procedure has the /BTWNG and /BTWNT option to estimate the between effect model. The TSCSREG procedure does not have this option. PROC PANEL DATA=masil.airline; ID airline year; MODEL cost = output fuel load /BTWNG; RUN; The PANEL Procedure Between Groups Estimates Dependent Variable: cost Model Description Estimation Method Number of Cross Sections Time Series Length BtwGrps 6 15 Fit Statistics SSE MSE http://www.indiana.edu/~statmath 0.0317 0.0158 DFE Root MSE 2 0.1258 © 2005 The Trustees of Indiana University (12/10/2005) R-Square Linear Regression Model for Panel Data: 31 0.9936 Parameter Estimates Variable DF Estimate Standard Error t Value Pr > |t| 1 1 1 1 85.80901 0.782455 -5.52398 -1.75102 56.4830 0.1088 4.4788 2.7432 1.52 7.19 -1.23 -0.64 0.2681 0.0188 0.3427 0.5886 Intercept output fuel load Label Intercept The STATA .xtreg command has the be option to fit the between effect model. This command, however, does not report the ANOVA table. . xtreg cost output fuel load, be i(airline) Between regression (regression on group means) Group variable (i): airline Number of obs Number of groups = = 90 6 R-sq: Obs per group: min = avg = max = 15 15.0 15 within = 0.8808 between = 0.9936 overall = 0.1371 sd(u_i + avg(e_i.))= .1258491 F(3,2) Prob > F = = 104.12 0.0095 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------output | .7824552 .1087663 7.19 0.019 .3144715 1.250439 fuel | -5.523978 4.478802 -1.23 0.343 -24.79471 13.74675 load | -1.751016 2.74319 -0.64 0.589 -13.55401 10.05198 _cons | 85.80901 56.48302 1.52 0.268 -157.2178 328.8358 ------------------------------------------------------------------------------ LIMDEP has the Mean; subcommand to fit the between effect model. --> REGRESS;Lhs=COST;Rhs=ONE,OUTPUT,FUEL,LOAD;Panel;Str=AIRLINE;Means$ +-----------------------------------------------------------------------+ | Group Means Regression | | Ordinary least squares regression Weighting variable = none | | Dep. var. = YBAR(i.) Mean= 13.36560933 , S.D.= .9978636346 | | Model size: Observations = 6, Parameters = 4, Deg.Fr.= 2 | | Residuals: Sum of squares= .3167277206E-01, Std.Dev.= .12584 | | Fit: R-squared= .993638, Adjusted R-squared = .98410 | | Model test: F[ 3, 2] = 104.13, Prob value = .00953 | | Diagnostic: Log-L = 7.2185, Restricted(b=0) Log-L = -7.9538 | | LogAmemiyaPrCrt.= -3.635, Akaike Info. Crt.= -1.073 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ OUTPUT .7824472689 .10876126 7.194 .0000 .23025612E-11 FUEL -5.524437466 4.4786519 -1.234 .2174 .18642891 LOAD -1.750947653 2.7430470 -.638 .5233 .32541105 Constant 85.81483169 56.481148 1.519 .1287 http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 32 4.7 Testing Fixed Group Effects (F-test) How do we know whether there are fixed group effects? The null hypothesis is that all dummy parameters except one are zero: H 0 : μ1 = ... = μ n −1 = 0 . In order to conduct a F-test, let us take the SSE (e’e) of 1.3354 from the pooled OLS regression and .2926 from the LSDVs (LSDV1 through LSDV3) or the within effect model. Alternatively, you may draw R2 of .9974 from LSDV1 or LSDV3 and .9883 from the pooled OLS. Do not, however, use LSDV2 and the within effect model for R2. The Fstatistic is computed as (1.3354 − .2926) (6 − 1) (.9974 − .9883) (6 − 1) = ~ 57.7319[5,81] . (.2926) (90 − 6 − 3) (1 − .9974) (90 − 6 − 3) The large F statistic rejects the null hypothesis in favor of the fixed group effect model (p<.0000). The SAS TSCSREG and PANEL procedures and STATA .xtreg command by default conduct the F test. Alternatively, you may conduct the same test with LSDV1. In SAS, add the TEST statement in the REG procedure and run the procedure again (other outputs are skipped). PROC REG DATA=masil.airline; MODEL cost = g1-g5 output fuel load; TEST g1 = g2 = g3 = g4 = g5 = 0; RUN; The REG Procedure Model: MODEL1 Test 1 Results for Dependent Variable cost Source DF Mean Square Numerator Denominator 5 81 0.20856 0.00361 F Value Pr > F 57.73 <.0001 In STATA, run the .test command, a follow-up command for the Wald test, right after estimating the model. . quietly regress cost g1-g5 output fuel load // LSDV1 . test g1 g2 g3 g4 g5 ( ( ( ( ( 1) 2) 3) 4) 5) g1 g2 g3 g4 g5 F( = = = = = 0 0 0 0 0 5, 81) = Prob > F = 57.73 0.0000 http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 33 4.8 Summary Table 6 summarizes the estimation of panel data models in SAS, STATA, and LIMDEP. The SAS REG and TSCSREG procedures are generally preferred to STATA and LIMDEP commands. Table 6 Comparison of the Fixed Effect Model in SAS, STATA, LIMDEP* SAS 9.1 STATA 9.0 LIMDEP 8.0 OLS estimation LSDV1 LSDV2 PROC REG; Correct Incorrect F, (adjusted) R2 . regress (cnsreg) Correct Incorrect F, (adjusted) R2 LSDV3 Correct . cnsreg command No R2 , ANOVA table but F . xtreg Regress$ Correct (slightly different F) Correct (slightly different F) Correct R2 Correct (slightly different F) Different dummy coefficients Regress; Panel$ PROC TSCSREG; PROC PANEL; Estimation type LSDV1 Within and between effect Within effect SSE (e’e) Correct No Correct MSE or SEE Correct (adjusted) No Correct (adjusted) SEE Model test (F) No Incorrect Slightly different F (adjusted) R2 Correct Incorrect Correct Intercept Correct LSDV3 intercept No Coefficients Correct Correct Correct Standard errors Correct (adjusted) Correct (adjusted) Correct (adjusted) Effect test (F) Yes Yes No Between effect Yes (PROC PANEL;) N/A Yes (the be option) * “Yes/No” means whether the software reports the statistics. “Correct/incorrect” indicates whether the statistics are different from those of the least squares dummy variable (LSDV) 1 without a dummy variable. Panel Estimation http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 34 5. The Fixed Time Effect Model The fixed time effect model investigates how time affects the intercept using time dummy variables. The logic and method are the same as those of the fixed group effect model. 5.1 Least Squares Dummy Variable Models The least squares dummy variable (LSDV) model produces fifteen regression equations. This section does not present all outputs, but one or two for each LSDV approach. Time01: Time02: Time03: Time04: Time05: Time06: Time07: Time08: Time09: Time10: Time11: Time12: Time13: Time14: Time15: cost cost cost cost cost cost cost cost cost cost cost cost cost cost cost = = = = = = = = = = = = = = = 20.496 20.578 20.656 20.741 21.200 21.412 21.503 21.654 21.830 22.114 22.465 22.651 22.617 22.552 22.537 + + + + + + + + + + + + + + + .868*output .868*output .868*output .868*output .868*output .868*output .868*output .868*output .868*output .868*output .868*output .868*output .868*output .868*output .868*output - .484*fuel .484*fuel .484*fuel .484*fuel .484*fuel .484*fuel .484*fuel .484*fuel .484*fuel .484*fuel .484*fuel .484*fuel .484*fuel .484*fuel .484*fuel -1.954*load -1.954*load -1.954*load -1.954*load -1.954*load -1.954*load -1.954*load -1.954*load -1.954*load -1.954*load -1.954*load -1.954*load -1.954*load -1.954*load -1.954*load 5.1.1 LSDV1 without a Dummy Let us begin with the SAS REG procedure. The test statement examines fixed time effects. PROC REG DATA=masil.airline; MODEL cost = t1-t14 output fuel load; RUN; The REG Procedure Model: MODEL1 Dependent Variable: cost Number of Observations Read Number of Observations Used 90 90 Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 17 72 89 112.95270 1.08819 114.04089 6.64428 0.01511 Root MSE http://www.indiana.edu/~statmath 0.12294 R-Square F Value Pr > F 439.62 <.0001 0.9905 © 2005 The Trustees of Indiana University (12/10/2005) Dependent Mean Coeff Var Linear Regression Model for Panel Data: 35 13.36561 0.91981 Adj R-Sq 0.9882 Parameter Estimates Variable Intercept t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 output fuel load DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 22.53677 -2.04096 -1.95873 -1.88103 -1.79601 -1.33693 -1.12514 -1.03341 -0.88274 -0.70719 -0.42296 -0.07144 0.11457 0.07979 0.01546 0.86773 -0.48448 -1.95440 4.94053 0.73469 0.72275 0.72036 0.69882 0.50604 0.40862 0.37642 0.32601 0.29470 0.16679 0.07176 0.09841 0.08442 0.07264 0.01541 0.36411 0.44238 4.56 -2.78 -2.71 -2.61 -2.57 -2.64 -2.75 -2.75 -2.71 -2.40 -2.54 -1.00 1.16 0.95 0.21 56.32 -1.33 -4.42 <.0001 0.0070 0.0084 0.0110 0.0122 0.0101 0.0075 0.0076 0.0085 0.0190 0.0134 0.3228 0.2482 0.3477 0.8320 <.0001 0.1875 <.0001 The following are the corresponding STATA and LIMDEP commands for LSDV1 (outputs are skipped). . regress cost t1-t14 output fuel load REGRESS;Lhs=COST;Rhs=ONE,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12,T13,T14,OUTPUT,FUEL,LOAD$ 5.1.2 LSDV2 without the Intercept Let us use LIMDEP to fit LSDV2 because it reports correct (although slightly different) F and R2 statistics. --> REGRESS;Lhs=COST;Rhs=T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12,T13,T14,T15,OUTPUT,FUEL,LOAD$ +-----------------------------------------------------------------------+ | Ordinary least squares regression Weighting variable = none | | Dep. var. = COST Mean= 13.36560929 , S.D.= 1.131971002 | | Model size: Observations = 90, Parameters = 18, Deg.Fr.= 72 | | Residuals: Sum of squares= 1.088190223 , Std.Dev.= .12294 | | Fit: R-squared= .990458, Adjusted R-squared = .98820 | | Model test: F[ 17, 72] = 439.62, Prob value = .00000 | | Diagnostic: Log-L = 70.9837, Restricted(b=0) Log-L = -138.3581 | | LogAmemiyaPrCrt.= -4.010, Akaike Info. Crt.= -1.177 | | Model does not contain ONE. R-squared and F can be negative! | | Autocorrel: Durbin-Watson Statistic = 2.93900, Rho = -.46950 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 36 T1 20.49580478 4.2095283 4.869 .0000 .66666667E-01 T2 20.57803885 4.2215262 4.875 .0000 .66666667E-01 T3 20.65573100 4.2241771 4.890 .0000 .66666667E-01 T4 20.74075857 4.2457497 4.885 .0000 .66666667E-01 T5 21.19983202 4.4403312 4.774 .0000 .66666667E-01 T6 21.41162082 4.5386212 4.718 .0000 .66666667E-01 T7 21.50335085 4.5713968 4.704 .0000 .66666667E-01 T8 21.65402827 4.6228858 4.684 .0000 .66666667E-01 T9 21.82957108 4.6569062 4.688 .0000 .66666667E-01 T10 22.11380260 4.7926483 4.614 .0000 .66666667E-01 T11 22.46532734 4.9499089 4.539 .0000 .66666667E-01 T12 22.65133704 5.0085924 4.522 .0000 .66666667E-01 T13 22.61655508 4.9861391 4.536 .0000 .66666667E-01 T14 22.55222832 4.9559418 4.551 .0000 .66666667E-01 T15 22.53676562 4.9405321 4.562 .0000 .66666667E-01 OUTPUT .8677267843 .15408184E-01 56.316 .0000 -1.1743092 FUEL -.4844835367 .36410849 -1.331 .1875 12.770359 LOAD -1.954404328 .44237771 -4.418 .0000 .56046015 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) The following are the corresponding SAS REG procedure and STATA command for LSDV2 (outputs are skipped). PROC REG DATA=masil.airline; MODEL cost = t1-t15 output fuel load /NOINT; RUN; . regress cost t1-t15 output fuel load, noc 5.1.3 LSDV3 with a Restriction In SAS, you need to use the RESTRICT statement to impose a restriction. PROC REG DATA=masil.airline; MODEL cost = t1-t15 output fuel load; RESTRICT t1+t2+t3+t4+t5+t6+t7+t8+t9+t10+t11+t12+t13+t14+t15=0; RUN; The REG Procedure Model: MODEL1 Dependent Variable: cost NOTE: Restrictions have been applied to parameter estimates. Number of Observations Read Number of Observations Used 90 90 Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 17 72 89 112.95270 1.08819 114.04089 6.64428 0.01511 http://www.indiana.edu/~statmath F Value Pr > F 439.62 <.0001 © 2005 The Trustees of Indiana University (12/10/2005) Root MSE Dependent Mean Coeff Var Linear Regression Model for Panel Data: 37 0.12294 13.36561 0.91981 R-Square Adj R-Sq 0.9905 0.9882 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Intercept t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 output fuel load RESTRICT 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 21.66698 -1.17118 -1.08894 -1.01125 -0.92622 -0.46715 -0.25536 -0.16363 -0.01296 0.16259 0.44682 0.79834 0.98435 0.94957 0.88524 0.86978 0.86773 -0.48448 -1.95440 -3.946E-15 4.62405 0.41783 0.40586 0.40323 0.38177 0.19076 0.09856 0.07190 0.04862 0.06271 0.17599 0.32940 0.38756 0.36537 0.33549 0.32029 0.01541 0.36411 0.44238 . 4.69 -2.80 -2.68 -2.51 -2.43 -2.45 -2.59 -2.28 -0.27 2.59 2.54 2.42 2.54 2.60 2.64 2.72 56.32 -1.33 -4.42 . Pr > |t| <.0001 0.0065 0.0090 0.0144 0.0178 0.0168 0.0116 0.0258 0.7907 0.0115 0.0133 0.0179 0.0132 0.0113 0.0102 0.0083 <.0001 0.1875 <.0001 . * Probability computed using beta distribution. In STATA, define the restriction with the .constraint command and specify the restriction using the constraint() option of the .cnsreg command. . constraint define 3 t1+t2+t3+t4+t5+t6+t7+t8+t9+t10+t11+t12+t13+t14+t15=0 . cnsreg cost t1-t15 output fuel load, constraint(3) Constrained linear regression Number of obs = 90 F( 17, 72) = 439.62 Prob > F = 0.0000 Root MSE = .12294 ( 1) t1 + t2 + t3 + t4 + t5 + t6 + t7 + t8 + t9 + t10 + t11 + t12 + t13 + t14 + t15 = 0 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------t1 | -1.171179 .4178338 -2.80 0.007 -2.004115 -.3382422 t2 | -1.088945 .4058579 -2.68 0.009 -1.898008 -.2798816 t3 | -1.011252 .4032308 -2.51 0.014 -1.815078 -.2074266 t4 | -.9262249 .3817675 -2.43 0.018 -1.687265 -.1651852 t5 | -.4671515 .1907596 -2.45 0.017 -.8474239 -.0868791 t6 | -.2553627 .0985615 -2.59 0.012 -.4518415 -.0588839 t7 | -.1636326 .0718969 -2.28 0.026 -.3069564 -.0203088 t8 | -.0129552 .0486249 -0.27 0.791 -.1098872 .0839768 t9 | .1625876 .0627099 2.59 0.012 .0375776 .2875976 t10 | .4468191 .175994 2.54 0.013 .0959814 .7976568 t11 | .7983439 .3294027 2.42 0.018 .1416916 1.454996 t12 | .9843536 .3875583 2.54 0.013 .2117702 1.756937 http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 38 t13 | .9495716 .3653675 2.60 0.011 .2212248 1.677918 t14 | .8852448 .3354912 2.64 0.010 .2164554 1.554034 t15 | .8697821 .3202933 2.72 0.008 .2312891 1.508275 output | .8677268 .0154082 56.32 0.000 .8370111 .8984424 fuel | -.4844835 .3641085 -1.33 0.188 -1.210321 .2413535 load | -1.954404 .4423777 -4.42 0.000 -2.836268 -1.07254 _cons | 21.66698 4.624053 4.69 0.000 12.4491 30.88486 ------------------------------------------------------------------------------ The following are the corresponding LIMDEP command for LSDV3 (outputs are skipped). REGRESS;Lhs=COST;Rhs=ONE,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12,T13,T14,T15,OUTPUT,FUEL,LOAD; Cls:b(1)+b(2)+b(3)+b(4)+b(5)+b(6)+b(7)+b(8)+b(9)+b(10)+b(11)+b(12)+b(13)+b(14)+b(15)=0$ 5.2 Within Time Effect Model The within effect mode for the fixed time effects needs to compute deviations from the time means. Keep in mind that the intercept should be suppressed. 5.2.1 Estimating the Time Effect Model Let us manually estimate the fixed time effect model first. . . . . egen egen egen egen tm_cost = tm_output tm_fuel = tm_load = mean(cost), by(year) // compute time means = mean(output), by(year) mean(fuel), by(year) mean(load), by(year) +---------------------------------------------------+ | year tm_cost tm_output tm_fuel tm_load | |---------------------------------------------------| | 1 12.36897 -1.790283 11.63606 .4788587 | | 2 12.45963 -1.744389 11.66868 .4868322 | | 3 12.60706 -1.577767 11.67494 .52358 | | 4 12.77912 -1.443695 11.73193 .5244486 | | 5 12.94143 -1.398122 12.26843 .5635266 | | 6 13.0452 -1.393002 12.53826 .5541809 | | 7 13.15965 -1.302416 12.62714 .5607425 | | 8 13.29884 -1.222963 12.76768 .5670587 | | 9 13.4651 -1.067003 12.86104 .6179098 | | 10 13.70187 -.9023156 13.23183 .6233943 | | 11 13.91324 -.9205539 13.66246 .5802577 | | 12 14.05984 -.8641667 13.82315 .5856243 | | 13 14.12841 -.7923916 13.75979 .5803183 | | 14 14.23517 -.6428015 13.67403 .5804528 | | 15 14.32062 -.5527684 13.62997 .5797168 | +---------------------------------------------------+ . . . . gen gen gen gen tw_cost = tw_output tw_fuel = tw_load = cost - tm_cost // transform variables = output - tm_output fuel - tm_fuel load - tm_load . regress tw_cost tw_output tw_fuel tw_load, noc Source | SS df MS -------------+-----------------------------Model | 75.6459391 3 25.215313 Residual | 1.08819023 87 .012507934 -------------+-----------------------------Total | 76.7341294 90 .852601437 http://www.indiana.edu/~statmath // within time effect Number of obs F( 3, 87) Prob > F R-squared Adj R-squared Root MSE = 90 = 2015.95 = 0.0000 = 0.9858 = 0.9853 = .11184 © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 39 -----------------------------------------------------------------------------tw_cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------tw_output | .8677268 .0140171 61.90 0.000 .8398663 .8955873 tw_fuel | -.4844836 .3312359 -1.46 0.147 -1.142851 .1738836 tw_load | -1.954404 .4024388 -4.86 0.000 -2.754295 -1.154514 ------------------------------------------------------------------------------ If you want to get intercepts of years, use d t* = y•t − β ' x•t . For example, the intercept of year 7 is 21.503=13.1597-{.8677*(-1.3024) + (-.4845)*12.6271 + (-1.9544)*.5607}. As discussed previously, the standard errors of the within effects model need to be adjusted. For instance, the correct standard error of fuel price is computed as .364 = .3312*sqrt(87/72). 5.2.2 Using the TSCSREG and PANEL procedures You need to sort the data set by variables (i.e., year and airline) to appear in the ID statement of the TSCSREG and PANEL procedures. PROC SORT DATA=masil.airline; BY year airline; PROC PANEL DATA=masil.airline; ID year airline; MODEL cost = output fuel load /FIXONE; RUN; The PANEL Procedure Fixed One Way Estimates Dependent Variable: cost Model Description Estimation Method Number of Cross Sections Time Series Length FixOne 15 6 Fit Statistics SSE MSE R-Square 1.0882 0.0151 0.9905 DFE Root MSE 72 0.1229 F Test for No Fixed Effects Num DF Den DF F Value Pr > F 14 72 1.17 0.3178 Parameter Estimates Variable DF Estimate http://www.indiana.edu/~statmath Standard Error t Value Pr > |t| Label © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 40 CS1 1 -2.04096 0.7347 -2.78 0.0070 CS2 1 -1.95873 0.7228 -2.71 0.0084 CS3 1 -1.88103 0.7204 -2.61 0.0110 CS4 1 -1.79601 0.6988 -2.57 0.0122 CS5 1 -1.33693 0.5060 -2.64 0.0101 CS6 1 -1.12514 0.4086 -2.75 0.0075 CS7 1 -1.03341 0.3764 -2.75 0.0076 CS8 1 -0.88274 0.3260 -2.71 0.0085 CS9 1 -0.70719 0.2947 -2.40 0.0190 CS10 1 -0.42296 0.1668 -2.54 0.0134 CS11 1 -0.07144 0.0718 -1.00 0.3228 CS12 1 0.114571 0.0984 1.16 0.2482 CS13 1 0.079789 0.0844 0.95 0.3477 CS14 1 0.015463 0.0726 0.21 0.8320 Intercept output fuel load 1 1 1 1 22.53677 0.867727 -0.48448 -1.9544 4.9405 0.0154 0.3641 0.4424 4.56 56.32 -1.33 -4.42 <.0001 <.0001 0.1875 <.0001 Cross Sectional Effect 1 Cross Sectional Effect 2 Cross Sectional Effect 3 Cross Sectional Effect 4 Cross Sectional Effect 5 Cross Sectional Effect 6 Cross Sectional Effect 7 Cross Sectional Effect 8 Cross Sectional Effect 9 Cross Sectional Effect 10 Cross Sectional Cross Sectional Effect 12 Cross Sectional Effect 13 Cross Sectional Effect 14 Intercept The following TSCSREG procedure gives the same outputs. PROC TSCSREG DATA=masil.airline; ID year airline; MODEL cost = output fuel load /FIXONE; RUN; 5.2.3 Using STATA The STATA .xtreg command uses the fe option for the fixed effect model. . xtreg cost output fuel load, fe i(year) Fixed-effects (within) regression Group variable (i): year Number of obs Number of groups = = 90 15 R-sq: Obs per group: min = avg = max = 6 6.0 6 within = 0.9858 between = 0.4812 overall = 0.5265 corr(u_i, Xb) = -0.1503 F(3,72) Prob > F = = 1668.37 0.0000 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 41 -------------+---------------------------------------------------------------output | .8677268 .0154082 56.32 0.000 .8370111 .8984424 fuel | -.4844835 .3641085 -1.33 0.188 -1.210321 .2413535 load | -1.954404 .4423777 -4.42 0.000 -2.836268 -1.07254 _cons | 21.66698 4.624053 4.69 0.000 12.4491 30.88486 -------------+---------------------------------------------------------------sigma_u | .8027907 sigma_e | .12293801 rho | .97708602 (fraction of variance due to u_i) -----------------------------------------------------------------------------F test that all u_i=0: F(14, 72) = 1.17 Prob > F = 0.3178 5.2.4 Using LIMDEP You need to pay attention to the Str=; subcommand for stratification. --> REGRESS;Lhs=COST;Rhs=ONE,OUTPUT,FUEL,LOAD;Panel;Str=YEAR;Fixed$ +-----------------------------------------------------------------------+ | OLS Without Group Dummy Variables | | Ordinary least squares regression Weighting variable = none | | Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 | | Model size: Observations = 90, Parameters = 4, Deg.Fr.= 86 | | Residuals: Sum of squares= 1.335449522 , Std.Dev.= .12461 | | Fit: R-squared= .988290, Adjusted R-squared = .98788 | | Model test: F[ 3, 86] = 2419.33, Prob value = .00000 | | Diagnostic: Log-L = 61.7699, Restricted(b=0) Log-L = -138.3581 | | LogAmemiyaPrCrt.= -4.122, Akaike Info. Crt.= -1.284 | | Panel Data Analysis of COST [ONE way] | | Unconditional ANOVA (No regressors) | | Source Variation Deg. Free. Mean Square | | Between 37.3068 14. 2.66477 | | Residual 76.7341 75. 1.02312 | | Total 114.041 89. 1.28136 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ OUTPUT .8827386341 .13254552E-01 66.599 .0000 -1.1743092 FUEL .4539777119 .20304240E-01 22.359 .0000 12.770359 LOAD -1.627507797 .34530293 -4.713 .0000 .56046016 Constant 9.516912231 .22924522 41.514 .0000 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) +-----------------------------------------------------------------------+ | Least Squares with Group Dummy Variables | | Ordinary least squares regression Weighting variable = none | | Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 | | Model size: Observations = 90, Parameters = 18, Deg.Fr.= 72 | | Residuals: Sum of squares= 1.088193393 , Std.Dev.= .12294 | | Fit: R-squared= .990458, Adjusted R-squared = .98820 | | Model test: F[ 17, 72] = 439.62, Prob value = .00000 | | Diagnostic: Log-L = 70.9836, Restricted(b=0) Log-L = -138.3581 | | LogAmemiyaPrCrt.= -4.010, Akaike Info. Crt.= -1.177 | | Estd. Autocorrelation of e(i,t) .573531 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 42 OUTPUT .8677268093 .15408179E-01 56.316 .0000 -1.1743092 FUEL -.4844946699 .36410984 -1.331 .1868 12.770359 LOAD -1.954414378 .44237791 -4.418 .0000 .56046016 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) +------------------------------------------------------------------------+ | Test Statistics for the Classical Model | | | | Model Log-Likelihood Sum of Squares R-squared | | (1) Constant term only -138.35814 .1140409821D+03 .0000000 | | (2) Group effects only -120.52864 .7673414157D+02 .3271354 | | (3) X - variables only 61.76991 .1335449522D+01 .9882897 | | (4) X and group effects 70.98362 .1088193393D+01 .9904579 | | | | Hypothesis Tests | | Likelihood Ratio Test F Tests | | Chi-squared d.f. Prob. F num. denom. Prob value | | (2) vs (1) 35.659 14 .00117 2.605 14 75 .00404 | | (3) vs (1) 400.256 3 .00000 2419.329 3 86 .00000 | | (4) vs (1) 418.684 17 .00000 439.617 17 72 .00000 | | (4) vs (2) 383.025 3 .00000 1668.364 3 72 .00000 | | (4) vs (3) 18.427 14 .18800 1.169 14 72 .31776 | +------------------------------------------------------------------------+ 5.3 Between Time Effect Model The between effect model regresses time means of dependent variables on those of independent variables. See also 3.2 and 4.6. . collapse (mean) tm_cost=cost (mean) tm_output=output (mean) tm_fuel=fuel /// (mean) tm_load=load, by(year) . regress tm_cost tm_output tm_fuel tm_load // between time effect Source | SS df MS -------------+-----------------------------Model | 6.21220479 3 2.07073493 Residual | .005590631 11 .000508239 -------------+-----------------------------Total | 6.21779542 14 .444128244 Number of obs F( 3, 11) Prob > F R-squared Adj R-squared Root MSE = 15 = 4074.33 = 0.0000 = 0.9991 = 0.9989 = .02254 -----------------------------------------------------------------------------tm_cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------tm_output | 1.133337 .0512898 22.10 0.000 1.020449 1.246225 tm_fuel | .3342486 .0228284 14.64 0.000 .2840035 .3844937 tm_load | -1.350727 .2478264 -5.45 0.000 -1.896189 -.8052644 _cons | 11.18505 .3660016 30.56 0.000 10.37949 11.99062 ------------------------------------------------------------------------------ The SAS PANEL procedure has the /BTWNT option to estimate the between effect model. PROC PANEL DATA=masil.airline; ID airline year; MODEL cost = output fuel load /BTWNT; RUN; The PANEL Procedure http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 43 Between Time Periods Estimates Dependent Variable: cost Model Description Estimation Method Number of Cross Sections Time Series Length BtwTime 6 15 Fit Statistics SSE MSE R-Square 0.0056 0.0005 0.9991 DFE Root MSE 11 0.0225 Parameter Estimates Variable Intercept output fuel load DF Estimate Standard Error t Value Pr > |t| 1 1 1 1 11.18504 1.133335 0.334249 -1.35073 0.3660 0.0513 0.0228 0.2478 30.56 22.10 14.64 -5.45 <.0001 <.0001 <.0001 0.0002 Label Intercept You may use the be option in the STATA .xtreg command and the Means; subcommand in LIMDEP (outputs are skipped). . xtreg cost output fuel load, be i(year) // between time effect model --> REGRESS;Lhs=COST;Rhs=ONE,OUTPUT,FUEL,LOAD;Panel;Str=YEAR;Means$ 5.4 Testing Fixed Time Effects. The null hypothesis is that all time dummy parameters except one are zero: (1.3354 − 1.0882) (15 − 1) H 0 : τ 1 = ... = τ T −1 = 0 . The F statistic is ~ 1.1683[14,72] . The p(1.0882) (6 *15 − 15 − 3) value of .3180 does not reject the null hypothesis. The SAS TSCSREG and PANEL procedures and the STATA .xtreg command conduct the Wald test. You may get the same test using the TEST statement in LSDV1 and the STATA .test command (the output is skipped). PROC REG DATA=masil.airline; MODEL cost = t1-t14 output fuel load; TEST t1=t2=t3=t4=t5=t6=t7=t8=t9=t10=t11=t12=t13=t14=0; RUN; . quietly regress cost t1-t14 output fuel load . test t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 44 6. The Fixed Group and Time Effect Model The two-way fixed model considers both group and time effects. This model thus needs two sets of group and time dummy variables. LSDV2 and the between effect model are not valid in this model. 6.1 Least Squares Dummy Variable Models There are four approaches to avoid the perfect multicollinearity or the dummy variable trap. You may not suppress the intercept under any circumstances. • Drop one cross-section and one time-series dummy variables. • Drop one cross-section dummy and impose a restriction on the time-series dummies of ∑τ t = 0 • Drop one time-series dummy and impose a restriction on the cross-section dummies of ∑μg = 0 • Include all dummy variables and impose two restrictions on the cross-section and timeseries dummies of ∑ μ g = 0 and ∑τ t = 0 6.2 LSDV1 without Two Dummies Let us first run LSDV1 using STATA. . regress cost g1-g5 t1-t14 output fuel load Source | SS df MS -------------+-----------------------------Model | 113.864044 22 5.17563838 Residual | .176848775 67 .002639534 -------------+-----------------------------Total | 114.040893 89 1.28135835 Number of obs F( 22, 67) Prob > F R-squared Adj R-squared Root MSE = 90 = 1960.82 = 0.0000 = 0.9984 = 0.9979 = .05138 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------g1 | .1742825 .0861201 2.02 0.047 .0023861 .346179 g2 | .1114508 .0779551 1.43 0.157 -.0441482 .2670499 g3 | -.143511 .0518934 -2.77 0.007 -.2470907 -.0399313 g4 | .1802087 .0321443 5.61 0.000 .1160484 .2443691 g5 | -.0466942 .0224688 -2.08 0.042 -.0915422 -.0018463 t1 | -.6931382 .3378385 -2.05 0.044 -1.367467 -.0188098 t2 | -.6384366 .3320802 -1.92 0.059 -1.301271 .0243983 t3 | -.5958031 .3294473 -1.81 0.075 -1.253383 .0617764 t4 | -.5421537 .3189139 -1.70 0.094 -1.178708 .0944011 t5 | -.4730429 .2319459 -2.04 0.045 -.9360088 -.0100769 t6 | -.4272042 .18844 -2.27 0.027 -.8033319 -.0510764 t7 | -.3959783 .1732969 -2.28 0.025 -.7418804 -.0500762 t8 | -.3398463 .1501062 -2.26 0.027 -.6394596 -.040233 t9 | -.2718933 .1348175 -2.02 0.048 -.5409901 -.0027964 t10 | -.2273857 .0763495 -2.98 0.004 -.37978 -.0749914 t11 | -.1118032 .0319005 -3.50 0.001 -.175477 -.0481295 t12 | -.033641 .0429008 -0.78 0.436 -.1192713 .0519893 t13 | -.0177346 .0362554 -0.49 0.626 -.0901007 .0546315 t14 | -.0186451 .030508 -0.61 0.543 -.0795393 .042249 http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 45 output | .8172487 .031851 25.66 0.000 .7536739 .8808235 fuel | .16861 .163478 1.03 0.306 -.1576935 .4949135 load | -.8828142 .2617373 -3.37 0.001 -1.405244 -.3603843 _cons | 12.94004 2.218231 5.83 0.000 8.512434 17.36765 ------------------------------------------------------------------------------ The following is the corresponding SAS REG procedure (outputs are skipped). PROC REG DATA=masil.airline; MODEL cost = g1-g5 t1-t14 output fuel load; RUN; The LIMDEP example is skipped here, since many dummy variables need to be listed in the Regress$ command. 6.3 LSDV1 + LSDV3: Dropping a Dummy and Imposing a Restriction In the second approach, you may drop either one group dummy or one time dummy. The following drops one time dummy, includes all group dummies, and imposes a restriction on group dummies. PROC REG DATA=masil.airline; MODEL cost = g1-g6 t1-t14 output fuel load; RESTRICT g1 + g2 + g3 + g4 + g5 + g6 = 0; RUN; The REG Procedure Model: MODEL1 Dependent Variable: cost NOTE: Restrictions have been applied to parameter estimates. Number of Observations Read Number of Observations Used 90 90 Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 22 67 89 113.86404 0.17685 114.04089 5.17564 0.00264 Root MSE Dependent Mean Coeff Var 0.05138 13.36561 0.38439 R-Square Adj R-Sq Parameter Estimates Parameter http://www.indiana.edu/~statmath Standard F Value Pr > F 1960.82 <.0001 0.9984 0.9979 © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 46 Variable DF Estimate Error t Value Intercept g1 g2 g3 g4 g5 g6 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 output fuel load RESTRICT 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 12.98600 0.12833 0.06549 -0.18947 0.13425 -0.09265 -0.04596 -0.69314 -0.63844 -0.59580 -0.54215 -0.47304 -0.42720 -0.39598 -0.33985 -0.27189 -0.22739 -0.11180 -0.03364 -0.01773 -0.01865 0.81725 0.16861 -0.88281 -1.9387E-16 2.22540 0.04601 0.03897 0.01561 0.01832 0.03731 0.04161 0.33784 0.33208 0.32945 0.31891 0.23195 0.18844 0.17330 0.15011 0.13482 0.07635 0.03190 0.04290 0.03626 0.03051 0.03185 0.16348 0.26174 . 5.84 2.79 1.68 -12.14 7.33 -2.48 -1.10 -2.05 -1.92 -1.81 -1.70 -2.04 -2.27 -2.28 -2.26 -2.02 -2.98 -3.50 -0.78 -0.49 -0.61 25.66 1.03 -3.37 . Pr > |t| <.0001 0.0069 0.0975 <.0001 <.0001 0.0155 0.2733 0.0441 0.0588 0.0750 0.0938 0.0454 0.0266 0.0255 0.0268 0.0477 0.0040 0.0008 0.4357 0.6263 0.5432 <.0001 0.3061 0.0012 . * Probability computed using beta distribution. Alternatively, you may run the STATA .cnsreg command with the second constraint (output is skipped). . cnsreg cost g1-g6 t1-t14 output fuel load, constraint(2) The following drops one group dummy and imposes a restriction on time dummies. . cnsreg cost g1-g5 t1-t15 output fuel load, constraint(3) Constrained linear regression Number of obs = 90 F( 22, 67) = 1960.82 Prob > F = 0.0000 Root MSE = .05138 ( 1) t1 + t2 + t3 + t4 + t5 + t6 + t7 + t8 + t9 + t10 + t11 + t12 + t13 + t14 + t15 = 0 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------g1 | .1742825 .0861201 2.02 0.047 .0023861 .346179 g2 | .1114508 .0779551 1.43 0.157 -.0441482 .2670499 g3 | -.143511 .0518934 -2.77 0.007 -.2470907 -.0399313 g4 | .1802087 .0321443 5.61 0.000 .1160484 .2443691 g5 | -.0466942 .0224688 -2.08 0.042 -.0915422 -.0018463 t1 | -.3740245 .191872 -1.95 0.055 -.7570026 .0089536 t2 | -.3193228 .1860877 -1.72 0.091 -.6907554 .0521097 t3 | -.2766893 .1833501 -1.51 0.136 -.6426576 .0892789 t4 | -.2230399 .1729671 -1.29 0.202 -.5682837 .1222038 t5 | -.1539291 .0864404 -1.78 0.079 -.3264649 .0186066 t6 | -.1080904 .0448591 -2.41 0.019 -.1976296 -.0185513 t7 | -.0768646 .0319336 -2.41 0.019 -.1406043 -.0131248 t8 | -.0207326 .0204506 -1.01 0.314 -.061552 .0200869 t9 | .0472205 .0290822 1.62 0.109 -.0108278 .1052688 http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 47 t10 | .0917281 .0811525 1.13 0.262 -.0702531 .2537092 t11 | .2073105 .1491443 1.39 0.169 -.0903829 .5050039 t12 | .2854727 .1756365 1.63 0.109 -.0650993 .6360447 t13 | .3013791 .1660294 1.82 0.074 -.030017 .6327752 t14 | .3004686 .1536212 1.96 0.055 -.0061606 .6070978 t15 | .3191137 .1474883 2.16 0.034 .0247259 .6135015 output | .8172487 .031851 25.66 0.000 .7536739 .8808235 fuel | .16861 .163478 1.03 0.306 -.1576935 .4949135 load | -.8828142 .2617373 -3.37 0.001 -1.405244 -.3603843 _cons | 12.62093 2.074302 6.08 0.000 8.480603 16.76125 ------------------------------------------------------------------------------ You may run the following SAS REG procedure to get the same result (output is skipped). PROC REG DATA=masil.airline; /* LSDV3 */ MODEL cost = g1-g5 t1-t15 output fuel load; RESTRICT t1+t2+t3+t4+t5+t6+t7+t8+t9+t10+t11+t12+t13+t14+t15=0; RUN; 6.4 LSDV3 with Two Restrictions The third approach includes all group and time dummies and imposes two restrictions on group and time dummies. . cnsreg cost g1-g6 t1-t15 output fuel load, constraint(2 3) Constrained linear regression Number of obs = 90 F( 22, 67) = 1960.82 Prob > F = 0.0000 Root MSE = .05138 ( 1) g1 + g2 + g3 + g4 + g5 + g6 = 0 ( 2) t1 + t2 + t3 + t4 + t5 + t6 + t7 + t8 + t9 + t10 + t11 + t12 + t13 + t14 + t15 = 0 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------g1 | .1283264 .0460126 2.79 0.007 .0364849 .2201679 g2 | .0654947 .0389685 1.68 0.097 -.0122867 .1432761 g3 | -.1894671 .0156096 -12.14 0.000 -.220624 -.1583102 g4 | .1342526 .0183163 7.33 0.000 .097693 .1708121 g5 | -.0926504 .0373085 -2.48 0.016 -.1671184 -.0181824 g6 | -.0459561 .0416069 -1.10 0.273 -.1290038 .0370916 t1 | -.3740245 .191872 -1.95 0.055 -.7570026 .0089536 t2 | -.3193228 .1860877 -1.72 0.091 -.6907554 .0521097 t3 | -.2766893 .1833501 -1.51 0.136 -.6426576 .0892789 t4 | -.2230399 .1729671 -1.29 0.202 -.5682837 .1222038 t5 | -.1539291 .0864404 -1.78 0.079 -.3264649 .0186066 t6 | -.1080904 .0448591 -2.41 0.019 -.1976296 -.0185513 t7 | -.0768646 .0319336 -2.41 0.019 -.1406043 -.0131248 t8 | -.0207326 .0204506 -1.01 0.314 -.061552 .0200869 t9 | .0472205 .0290822 1.62 0.109 -.0108278 .1052688 t10 | .0917281 .0811525 1.13 0.262 -.0702531 .2537092 t11 | .2073105 .1491443 1.39 0.169 -.0903829 .5050039 t12 | .2854727 .1756365 1.63 0.109 -.0650993 .6360447 t13 | .3013791 .1660294 1.82 0.074 -.030017 .6327752 t14 | .3004686 .1536212 1.96 0.055 -.0061606 .6070978 t15 | .3191137 .1474883 2.16 0.034 .0247259 .6135015 output | .8172487 .031851 25.66 0.000 .7536739 .8808235 fuel | .16861 .163478 1.03 0.306 -.1576935 .4949135 load | -.8828142 .2617373 -3.37 0.001 -1.405244 -.3603843 _cons | 12.66688 2.081068 6.09 0.000 8.513054 16.82071 ------------------------------------------------------------------------------ The following SAS REG procedure gives you the same result (output is skipped). PROC REG DATA=masil.airline; http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 48 MODEL cost = g1-g6 t1-t15 output fuel load; RESTRICT g1 + g2 + g3 + g4 + g5 + g6 = 0; RESTRICT t1+t2+t3+t4+t5+t6+t7+t8+t9+t10+t11+t12+t13+t14+t15=0; RUN; 6.5 Two-way Within Effect Model The two-way within group and time effect model requires a transformation of the data set as yit* = yit − yi • − y•t + y•• and xit* = xit − xi • − x•t + x•• . The following commands do this task. . . . . gen gen gen gen w_cost = w_output w_fuel = w_load = cost - gm_cost - tm_cost + m_cost = output - gm_output - tm_output + m_output fuel - gm_fuel - tm_fuel + m_fuel load - gm_load - tm_load + m_load . tabstat cost output fuel load, stat(mean) stats | cost output fuel load ---------+---------------------------------------mean | 13.36561 -1.174309 12.77036 .5604602 -------------------------------------------------- Now, run the OLS with the transformed variables. Do not forget to suppress the intercept. . regress w_cost w_output w_fuel w_load, noc // within effect Source | SS df MS -------------+-----------------------------Model | 1.87739643 3 .625798811 Residual | .176848774 87 .002032745 -------------+-----------------------------Total | 2.05424521 90 .022824947 Number of obs F( 3, 87) Prob > F R-squared Adj R-squared Root MSE = = = = = = 90 307.86 0.0000 0.9139 0.9109 .04509 -----------------------------------------------------------------------------w_cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------w_output | .8172487 .0279512 29.24 0.000 .7616927 .8728048 w_fuel | .16861 .1434621 1.18 0.243 -.1165364 .4537565 w_load | -.8828142 .2296907 -3.84 0.000 -1.339349 -.426279 ------------------------------------------------------------------------------ Note again that R2, MSE, standard errors, and DFerror are not correct. The dummy variable coefficients are computed as d g* = ( y g • − y •• ) − b' ( x g • − x•• ) and d t* = ( y•t − y •• ) − b' ( x•t − x•• ) . The standard errors also need to be adjusted; for instance, the standard error of the load factor is .2617=.2297*sqrt(87/67). 6.6 Using the TSCSREG and PANEL Procedures The SAS TSCSREG and PANEL procedures have the /FIXTWO option to fit the two-way fixed effect model. PROC TSCSREG DATA=masil.airline; ID airline year; MODEL cost = output fuel load /FIXTWO; RUN; http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 49 The TSCSREG Procedure Dependent Variable: cost Model Description Estimation Method Number of Cross Sections Time Series Length FixTwo 6 15 Fit Statistics SSE MSE R-Square 0.1768 0.0026 0.9984 DFE Root MSE 67 0.0514 F Test for No Fixed Effects Num DF Den DF F Value Pr > F 19 67 23.10 <.0001 Parameter Estimates DF Estimate Standard Error t Value Pr > |t| CS1 1 0.174283 0.0861 2.02 0.0470 CS2 1 0.111451 0.0780 1.43 0.1575 CS3 1 -0.14351 0.0519 -2.77 0.0073 CS4 1 0.180209 0.0321 5.61 <.0001 CS5 1 -0.04669 0.0225 -2.08 0.0415 TS1 1 -0.69314 0.3378 -2.05 0.0441 TS2 1 -0.63844 0.3321 -1.92 0.0588 TS3 1 -0.5958 0.3294 -1.81 0.0750 TS4 1 -0.54215 0.3189 -1.70 0.0938 TS5 1 -0.47304 0.2319 -2.04 0.0454 TS6 1 -0.4272 0.1884 -2.27 0.0266 TS7 1 -0.39598 0.1733 -2.28 0.0255 TS8 1 -0.33985 0.1501 -2.26 0.0268 Variable http://www.indiana.edu/~statmath Label Cross Sectional Effect 1 Cross Sectional Effect 2 Cross Sectional Effect 3 Cross Sectional Effect 4 Cross Sectional Effect 5 Time Series Effect 1 Time Series Effect 2 Time Series Effect 3 Time Series Effect 4 Time Series Effect 5 Time Series Effect 6 Time Series Effect 7 Time Series Effect 8 © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 50 TS9 1 -0.27189 0.1348 -2.02 0.0477 TS10 1 -0.22739 0.0763 -2.98 0.0040 TS11 1 -0.1118 0.0319 -3.50 0.0008 TS12 1 -0.03364 0.0429 -0.78 0.4357 TS13 1 -0.01773 0.0363 -0.49 0.6263 TS14 1 -0.01865 0.0305 -0.61 0.5432 Intercept output fuel load 1 1 1 1 12.94004 0.817249 0.16861 -0.88281 2.2182 0.0319 0.1635 0.2617 5.83 25.66 1.03 -3.37 <.0001 <.0001 0.3061 0.0012 Time Series Effect 9 Time Series Effect 10 Time Series Effect 11 Time Series Effect 12 Time Series Effect 13 Time Series Effect 14 Intercept The STATA .xtreg command does not fit the two-way fixed or random effect model. The following LIMDEP command fits the two-way fixed model. Note that this command has Str$ and Period$ specifications to specify stratification and time variables. This command presents the pooled model and one-way group effect model as well, but reports the incorrect intercept in the two-way fixed model, 12.667 (2.081). REGRESS;Lhs=COST;Rhs=ONE,OUTPUT,FUEL,LOAD;Panel;Str=AIRLINE;Period=YEAR;Fixed$ 6.7 Testing Fixed Group and Time Effects The null hypothesis is that parameters of group and time dummies are zero: H 0 : μ1 = ... = μ n −1 = 0 and τ 1 = ... = τ T −1 = 0 . The F test compares the pooled regression and two-way group and time effect model. The F statistic of 23.1085 rejects the null hypothesis at the .01 significance level (p<.0000). (1.3354 − .1768) (6 + 15 − 2) ~ 23.1085[19,67] (.1768) (6 *15 − 6 − 15 − 3 + 1) The SAS TSCSREG and PANEL procedures conduct the F-test for the group and time effects. You may also run the following SAS REG procedure and .regress command to perform the same test. PROC REG DATA=masil.airline; MODEL cost = g1-g5 t1-t14 output fuel load; TEST g1=g2=g3=g4=g5=t1=t2=t3=t4=t5=t6=t7=t8=t9=t10=t11=t12=t13=t14=0; RUN; . quietly regress cost g1-g5 t1-t14 output fuel load . test g1 g2 g3 g4 g5 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 51 7. Random Effect Models The random effects model examines how group and/or time affect error variances. This model is appropriate for n individuals who were drawn randomly from a large population. This chapter focuses on the feasible generalized least squares (FGLS) with variance component estimation methods from Baltagi and Chang (1994), Fuller and Battese (1974), and Wansbeek and Kapteyn (1989).8 7.1 The One-way Random Group Effect Model When the omega matrix is not known, you have to estimate θ using the SSEs of the pooled model (.0317) and the fixed effect model (.2926). The variance component of error σˆ ε2 is .00361263 = .292622872/(6*15-6-3) The variance component of group σˆ u2 is .01559712 =.031675926/(6-4) - .00361263/15 Thus, θˆ is .87668488 = 1 − .00361263 15 * .031675926/(6 - 4) Now, transform the dependent and independent variables including the intercept. . . . . . gen gen gen gen gen rg_cost = cost - .87668488*gm_cost // transform variables rg_output = output - .87668488*gm_output rg_fuel = fuel - .87668488*gm_fuel rg_load = load - .87668488*gm_load rg_int = 1 - .87668488 // for the intercept Finally, run the OLS with the transformed variables. Do not forget to suppress the intercept. This is the groupwise heteroscedastic regression model (Greene 2003). . regress rg_cost rg_int rg_output rg_fuel rg_load, noc Source | SS df MS -------------+-----------------------------Model | 284.670313 4 71.1675783 Residual | .311586777 86 .003623102 -------------+-----------------------------Total | 284.9819 90 3.16646556 Number of obs F( 4, 86) Prob > F R-squared Adj R-squared Root MSE = 90 =19642.72 = 0.0000 = 0.9989 = 0.9989 = .06019 -----------------------------------------------------------------------------rg_cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------rg_int | 9.627911 .2101638 45.81 0.000 9.210119 10.0457 8 Baltagi and Cheng (1994) introduce various ANOVA estimation methods, such as a modified Wallace and Hussain method, the Wansbeek and Kapteyn method, the Swamy and Arora method, and Henderson’s method III. They also discuss maximum likelihood (ML) estimators, restricted ML estimators, minimum norm quadratic unbiased estimators (MINQUE), and minimum variance quadratic unbiased estimators (MIVQUE). Based on a Monte Carlo simulation, they argue that ANOVA estimators are Best Quadratic Unbiased estimators of the variance components for the balanced model, whereas ML, restricted ML, MINQUE, and MIVQUE are recommended for the unbalanced models. http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 52 rg_output | .9066808 .0256249 35.38 0.000 .8557401 .9576215 rg_fuel | .4227784 .0140248 30.15 0.000 .394898 .4506587 rg_load | -1.0645 .2000703 -5.32 0.000 -1.462226 -.6667731 ------------------------------------------------------------------------------ 7.2 Estimations in SAS, STATA, and LIMDEP The SAS TSCSREG and PANEL procedures have the /RANONE option to fit the one-way random effect model. These procedures by default use the Fuller and Battese (1974) estimation method, which produces slightly different estimates from FGLS. PROC TSCSREG DATA=masil.airline; ID airline year; MODEL cost = output fuel load /RANONE; RUN; The TSCSREG Procedure Dependent Variable: cost Model Description Estimation Method Number of Cross Sections Time Series Length RanOne 6 15 Fit Statistics SSE MSE R-Square 0.3090 0.0036 0.9923 DFE Root MSE 86 0.0599 Variance Component Estimates Variance Component for Cross Sections Variance Component for Error 0.018198 0.003613 Hausman Test for Random Effects DF m Value Pr > m 3 0.92 0.8209 Parameter Estimates Variable Intercept output http://www.indiana.edu/~statmath DF Estimate Standard Error t Value Pr > |t| 1 1 9.637 0.908024 0.2132 0.0260 45.21 34.91 <.0001 <.0001 © 2005 The Trustees of Indiana University (12/10/2005) fuel load 1 1 0.422199 -1.06469 Linear Regression Model for Panel Data: 53 0.0141 0.1995 29.95 -5.34 <.0001 <.0001 The PANEL procedure has the /VCOMP=WK option for the Wansbeek and Kapteyn (1989) method, which is close to groupwise heteroscedastic regression. The BP option of the MODEL statement, not available in the TSCSREG procedure, conducts the Breusch-Pagen LM test for random effects. Note that two procedures estimate the same variance component for error (.0036) but a different variance component for groups (.0182 versus .0160), PROC PANEL DATA=masil.airline; ID airline year; MODEL cost = output fuel load /RANONE BP VCOMP=WK; RUN; The PANEL Procedure Wansbeek and Kapteyn Variance Components (RanOne) Dependent Variable: cost Model Description Estimation Method Number of Cross Sections Time Series Length RanOne 6 15 Fit Statistics SSE MSE R-Square 0.3111 0.0036 0.9923 DFE Root MSE 86 0.0601 Variance Component Estimates Variance Component for Cross Sections Variance Component for Error Hausman Test for Random Effects DF m Value Pr > m 2 1.63 0.4429 Breusch Pagan Test for Random Effects (One Way) DF m Value Pr > m 1 334.85 <.0001 Parameter Estimates http://www.indiana.edu/~statmath 0.016015 0.003613 © 2005 The Trustees of Indiana University (12/10/2005) Variable Linear Regression Model for Panel Data: 54 DF Estimate Standard Error t Value Pr > |t| 1 1 1 1 9.629513 0.906918 0.422676 -1.06452 0.2107 0.0257 0.0140 0.2000 45.71 35.30 30.11 -5.32 <.0001 <.0001 <.0001 <.0001 Intercept output fuel load The STATA .xtreg command has the re option to produce FGLS estimates. The .iis command specifies the panel identification variable, such as a grouping or cross-section variable that is used in the i() option. . iis airline . xtreg cost output fuel load, re i(airline) theta Random-effects GLS regression Group variable (i): airline Number of obs Number of groups = = 90 6 R-sq: Obs per group: min = avg = max = 15 15.0 15 within = 0.9925 between = 0.9856 overall = 0.9876 Random effects u_i ~ Gaussian corr(u_i, X) = 0 (assumed) theta = .87668503 Wald chi2(3) Prob > chi2 = = 11091.33 0.0000 -----------------------------------------------------------------------------cost | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------output | .9066805 .025625 35.38 0.000 .8564565 .9569045 fuel | .4227784 .0140248 30.15 0.000 .3952904 .4502665 load | -1.064499 .2000703 -5.32 0.000 -1.456629 -.672368 _cons | 9.627909 .210164 45.81 0.000 9.215995 10.03982 -------------+---------------------------------------------------------------sigma_u | .12488859 sigma_e | .06010514 rho | .81193816 (fraction of variance due to u_i) ------------------------------------------------------------------------------ The theta option reports the estimated theta (.8767). The sigma_u and sigma_e are square roots of the variance components for groups and errors (.0036=.0601^2). In LIMDEP, you have to specify Panel$ and Het$ subcommands for the groupwise heteroscedastic model. Note that LIMDEP presents the pooled OLS regression and least square dummy variable model as well. --> REGRESS;Lhs=COST;Rhs=ONE,OUTPUT,FUEL,LOAD;Panel;Str=AIRLINE;Het=AIRLINE$ +-----------------------------------------------------------------------+ | OLS Without Group Dummy Variables | | Ordinary least squares regression Weighting variable = none | | Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 | | Model size: Observations = 90, Parameters = 4, Deg.Fr.= 86 | | Residuals: Sum of squares= 1.335449522 , Std.Dev.= .12461 | | Fit: R-squared= .988290, Adjusted R-squared = .98788 | | Model test: F[ 3, 86] = 2419.33, Prob value = .00000 | | Diagnostic: Log-L = 61.7699, Restricted(b=0) Log-L = -138.3581 | | LogAmemiyaPrCrt.= -4.122, Akaike Info. Crt.= -1.284 | http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 55 | Panel Data Analysis of COST [ONE way] | | Unconditional ANOVA (No regressors) | | Source Variation Deg. Free. Mean Square | | Between 74.6799 5. 14.9360 | | Residual 39.3611 84. .468584 | | Total 114.041 89. 1.28136 | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ OUTPUT .8827386341 .13254552E-01 66.599 .0000 -1.1743092 FUEL .4539777119 .20304240E-01 22.359 .0000 12.770359 LOAD -1.627507797 .34530293 -4.713 .0000 .56046016 Constant 9.516912231 .22924522 41.514 .0000 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) +-----------------------------------------------------------------------+ | Least Squares with Group Dummy Variables | | Ordinary least squares regression Weighting variable = none | | Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 | | Model size: Observations = 90, Parameters = 9, Deg.Fr.= 81 | | Residuals: Sum of squares= .2926207777 , Std.Dev.= .06010 | | Fit: R-squared= .997434, Adjusted R-squared = .99718 | | Model test: F[ 8, 81] = 3935.82, Prob value = .00000 | | Diagnostic: Log-L = 130.0865, Restricted(b=0) Log-L = -138.3581 | | LogAmemiyaPrCrt.= -5.528, Akaike Info. Crt.= -2.691 | | Estd. Autocorrelation of e(i,t) .573531 | | White/Hetero. corrected covariance matrix used. | +-----------------------------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ OUTPUT .9192881432 .19105357E-01 48.117 .0000 -1.1743092 FUEL .4174910457 .13532534E-01 30.851 .0000 12.770359 LOAD -1.070395015 .21662097 -4.941 .0000 .56046016 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) +------------------------------------------------------------------------+ | Test Statistics for the Classical Model | | | | Model Log-Likelihood Sum of Squares R-squared | | (1) Constant term only -138.35814 .1140409821D+03 .0000000 | | (2) Group effects only -90.48804 .3936109461D+02 .6548513 | | (3) X - variables only 61.76991 .1335449522D+01 .9882897 | | (4) X and group effects 130.08647 .2926207777D+00 .9974341 | | | | Hypothesis Tests | | Likelihood Ratio Test F Tests | | Chi-squared d.f. Prob. F num. denom. Prob value | | (2) vs (1) 95.740 5 .00000 31.875 5 84 .00000 | | (3) vs (1) 400.256 3 .00000 2419.329 3 86 .00000 | | (4) vs (1) 536.889 8 .00000 3935.818 8 81 .00000 | | (4) vs (2) 441.149 3 .00000 3604.832 3 81 .00000 | | (4) vs (3) 136.633 5 .00000 57.733 5 81 .00000 | +------------------------------------------------------------------------+ Error: 425: REGR;PANEL. Could not invert VC matrix for Hausman test. http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 56 +--------------------------------------------------+ | Random Effects Model: v(i,t) = e(i,t) + u(i) | | Estimates: Var[e] = .361260D-02 | | Var[u] = .119159D-01 | | Corr[v(i,t),v(i,s)] = .767356 | | Lagrange Multiplier Test vs. Model (3) = 334.85 | | ( 1 df, prob value = .000000) | | (High values of LM favor FEM/REM over CR model.) | | Fixed vs. Random Effects (Hausman) = .00 | | ( 3 df, prob value = 1.000000) | | (High (low) values of H favor FEM (REM).) | | Reestimated using GLS coefficients: | | Estimates: Var[e] = .362491D-02 | | Var[u] = .392309D-01 | | Var[e] above is an average. Groupwise | | heteroscedasticity model was estimated. | | Sum of Squares .147779D+01 | +--------------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ OUTPUT .9041238041 .24615477E-01 36.730 .0000 -1.1743092 FUEL .4238986905 .13746498E-01 30.837 .0000 12.770359 LOAD -1.064558659 .19933132 -5.341 .0000 .56046016 Constant 9.610634379 .20277404 47.396 .0000 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) Like SAS TSCSREG and PANEL procedures, LIMDEP estimates a slightly different variance component for groups (.0119), thus producing different parameter estimates. In addition, the Hausman test is not successful in this example. 7.3 The One-way Random Time Effect Model Let us compute θˆ using the SSEs of the between effect model (.0056) and the fixed effect model (1.0882). The variance component for error σˆ ε2 is .01511375 = 1.08819022/(15*6-15-3) The variance component for time σˆ v2 is -.00201072 =.005590631/(15-4)- .01511375/6 The θˆ is - 1.226263 = 1 − . . . . . gen gen gen gen gen .01511375 6 * .005590631/(15 - 4) rt_cost = cost - (-1.226263)*tm_cost // transform variables rt_output = output - (-1.226263)*tm_output rt_fuel = fuel - (-1.226263)*tm_fuel rt_load = load - (-1.226263)*tm_load rt_int = 1 - (-1.226263) // for the intercept . regress rt_cost rt_int rt_output rt_fuel rt_load, noc http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Source | SS df MS -------------+-----------------------------Model | 79944.1804 4 19986.0451 Residual | 1.79271995 86 .020845581 -------------+-----------------------------Total | 79945.9732 90 888.288591 Linear Regression Model for Panel Data: 57 Number of obs F( 4, 86) Prob > F R-squared Adj R-squared Root MSE = = = = = = 90 . 0.0000 1.0000 1.0000 .14438 -----------------------------------------------------------------------------rt_cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------rt_int | 9.516098 .1489281 63.90 0.000 9.220038 9.812157 rt_output | .8883838 .0143338 61.98 0.000 .8598891 .9168785 rt_fuel | .4392731 .0129051 34.04 0.000 .4136186 .4649277 rt_load | -1.279176 .2482869 -5.15 0.000 -1.772754 -.7855982 ------------------------------------------------------------------------------ However, the negative value of the variance component for time is not likely. This section presents examples of procedures and commands for the one-way time random effect model without outputs. In SAS, use the TSCSREG or PANEL procedure with the /RANONE option. PROC SORT DATA=masil.airline; BY year airline; PROC TSCSREG DATA=masil.airline; ID year airline; MODEL cost = output fuel load /RANONE; RUN; PROC PANEL DATA=masil.airline; ID year airline; MODEL cost = output fuel load /RANONE BP; RUN; In STATA, you have to switch the grouping and time variables using the .tsset command. . tsset year airline panel variable: time variable: year, 1 to 15 airline, 1 to 6 . xtreg cost output fuel load, re i(year) theta In LIMDEP, you need to use the Period$ and Random$ subcommands. REGRESS;Lhs=COST;Rhs=ONE,OUTPUT,FUEL,LOAD;Panel;Pds=15;Het=YEAR$ 7.4 The Two-way Random Effect Model in SAS The random group and time effect model is formulated as y it = α + β ' X ti + u i + γ t + ε it . Let us first estimate the two way FGLS using the SAS PANEL procedure with the /RANTWO option. The BP2 option conducts the Breusch-Pagan LM test for the two-way random effect model. PROC PANEL DATA=masil.airline; ID airline year; MODEL cost = output fuel load /RANTWO BP2; http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 58 RUN; The PANEL Procedure Fuller and Battese Variance Components (RanTwo) Dependent Variable: cost Model Description Estimation Method Number of Cross Sections Time Series Length RanTwo 6 15 Fit Statistics SSE MSE R-Square 0.2322 0.0027 0.9829 DFE Root MSE 86 0.0520 Variance Component Estimates Variance Component for Cross Sections Variance Component for Time Series Variance Component for Error 0.017439 0.001081 0.00264 Hausman Test for Random Effects DF m Value Pr > m 3 6.93 0.0741 Breusch Pagan Test for Random Effects (Two Way) DF m Value Pr > m 2 336.40 <.0001 Parameter Estimates Variable Intercept output fuel load DF Estimate Standard Error t Value Pr > |t| 1 1 1 1 9.362677 0.866448 0.436163 -0.98053 0.2440 0.0255 0.0172 0.2235 38.38 33.98 25.41 -4.39 <.0001 <.0001 <.0001 <.0001 Similarly, you may run the TSCSREG procedure with the /RANTWO option. http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 59 PROC TSCSREG DATA=masil.airline; ID airline year; MODEL cost = output fuel load /RANTWO; RUN; 7.5 Testing Random Effect Models The Breusch-Pagan Lagrange multiplier (LM) test is designed to test random effects. The null hypothesis of the one-way random group effect model is that variances of groups are zero: H 0 : σ u2 = 0 . If the null hypothesis is not rejected, the pooled regression model is appropriate. The e’e of the pooled OLS is 1.33544153 and e ' e is .0665147. 2 6 * 15 ⎡15 2 * .0665 ⎤ LM is 334.8496= − 1⎥ ~ χ 2 (1) with p <.0000. ⎢ 2(15 − 1) ⎣ 1.3354 ⎦ With the large chi-squared, we reject the null hypothesis in favor of the random group effect model. The SAS PANEL procedure with the /BP option and the LIMDEP Panel$ and Het$ subcommands report the LM statistic. In STATA, run the .xttest0 command right after estimating the one-way random effect model. . quietly xtreg cost output fuel load, re i(airline) . xttest0 Breusch and Pagan Lagrangian multiplier test for random effects: cost[airline,t] = Xb + u[airline] + e[airline,t] Estimated results: | Var sd = sqrt(Var) ---------+----------------------------cost | 1.281358 1.131971 e | .0036126 .0601051 u | .0155972 .1248886 Test: Var(u) = 0 chi2(1) = Prob > chi2 = 334.85 0.0000 The null hypothesis of the one-way random time effect is that variance components for time are zero, H 0 : σ v2 = 0 . The following LM test uses Baltagi’s formula. The small chi-squared of 1.5472 does not reject the null hypothesis at the .01 level. 2 2 ⎤ Tn ⎡ ∑ (ne•t ) 15 * 6 ⎡ .7817 ⎤ 2 − 1⎥ = LM is 1.5472 = ⎢ ⎢⎣1.3354 − 1⎥⎦ ~ χ (1) with p<.2135 2(n − 1) ⎢⎣ ∑∑ eit2 − 2 ( 6 1 ) ⎥⎦ 2 . quietly xtreg cost output fuel load, re i(year) . xttest0 Breusch and Pagan Lagrangian multiplier test for random effects: http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 60 cost[year,t] = Xb + u[year] + e[year,t] Estimated results: | Var sd = sqrt(Var) ---------+----------------------------cost | 1.281358 1.131971 e | .0151138 .122938 u | 0 0 Test: Var(u) = 0 chi2(1) = Prob > chi2 = 1.55 0.2135 The two way random effects model has the null hypothesis that variance components for groups and time are all zero. The LM statistic with two degrees of freedom is 336.3968 = 334.8496 + 1.5472 (p<.0001). 7.6 Fixed Effects versus Random Effects How do we compare a fixed effect model and its counterpart random effect model? The Hausman specification test examines if the individual effects are uncorrelated with the other regressors in the model. Since computation is complicated, let us conduct the test in STATA. . tsset airline year panel variable: time variable: airline, 1 to 6 year, 1 to 15 . quietly xtreg cost output fuel load, fe . estimates store fixed_group . quietly xtreg cost output fuel load, re . hausman fixed_group . ---- Coefficients ---| (b) (B) (b-B) sqrt(diag(V_b-V_B)) | fix_group . Difference S.E. -------------+---------------------------------------------------------------output | .9192846 .9066805 .0126041 .0153877 fuel | .4174918 .4227784 -.0052867 .0058583 load | -1.070396 -1.064499 -.0058974 .0255088 -----------------------------------------------------------------------------b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg Test: Ho: difference in coefficients not systematic chi2(3) = (b-B)'[(V_b-V_B)^(-1)](b-B) = 2.12 Prob>chi2 = 0.5469 (V_b-V_B is not positive definite) The Hausman statistic 2.12 is different from the PANEL procedure’s 1.63 and Greene (2003)’s 4.16. It is because SAS, STATA, and LIMDEP use different estimation methods to produce slightly different parameter estimates. These tests, however, do not reject the null hypothesis in favor of the random effect model. http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 61 7.7 Summary Table 7 summarizes random effect estimations in SAS, STATA, and LIMDEP. The SAS PANEL procedure is highly recommended. Table 7 Comparison of the Random Effect Model in SAS, STATA, LIMDEP* SAS 9.1 STATA 9.0 LIMDEP 8.0 . xtreg Regress; Panel$ Procedure/Command PROC TSCSREG PROC PANEL One-way /RANONE /RANONE WK re Str=;Pds=;Het;Random$ Two-way /RANTWO /RANTWO No Problematic SSE (e’e) Slightly different Correct No No MSE or SEE Slightly different Correct No No Model test (F) No No Wald test No (adjusted) R2 Slightly different Slightly different Incorrect No Intercept Slightly different Correct Correct Slightly different Coefficients Slightly different Correct Correct Slightly different Standard errors Slightly different Correct Correct Slightly different Variance for group Slightly different Correct Correct (sigma) Slightly different Variance for error Correct Correct Correct (sigma) Correct theta Theta No No No . xttest0 Breusch-Pagan (LM) No BP option Yes . hausman Hausman Test (H) Incorrect Yes Yes (unstable) * “Yes/No” means whether the software reports the statistics. “Correct/incorrect” indicates whether the statistics are different from those of the groupwise heteroscedastic regression. http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 62 8. The Poolability Test In order to conduct the poolability test, you need to run group by group OLS regressions and/or time by time OLS regressions. If the null hypothesis is rejected, the panel data are not poolable. In this case, you may consider the random coefficient model and hierarchical regression model. 8.1 Group by Group OLS Regression In SAS, use the BY statement in the REG procedure. Do not forget to sort the data set in advance. PROC SORT DATA=masil.airline; BY airline; PROC REG DATA=masil.airline; MODEL cost = output fuel load; BY airline; RUN; In STATA, the if qualifier makes it easy to run group by group regressions. . forvalues i= 1(1)6 { // run group by group regression display "OLS regression for group " `i' regress cost output fuel load if airline==`i' } OLS regression for group 1 Source | SS df MS -------------+-----------------------------Model | 3.41824348 3 1.13941449 Residual | .006798918 11 .000618083 -------------+-----------------------------Total | 3.4250424 14 .244645886 Number of obs F( 3, 11) Prob > F R-squared Adj R-squared Root MSE = 15 = 1843.46 = 0.0000 = 0.9980 = 0.9975 = .02486 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------output | 1.18318 .0968946 12.21 0.000 .9699164 1.396444 fuel | .3865867 .0181946 21.25 0.000 .3465406 .4266329 load | -2.461629 .4013571 -6.13 0.000 -3.34501 -1.578248 _cons | 10.846 .2972551 36.49 0.000 10.19174 11.50025 -----------------------------------------------------------------------------OLS regression for group 2 Source | SS df MS -------------+-----------------------------Model | 6.47622084 3 2.15874028 Residual | .007587838 11 .000689803 -------------+-----------------------------Total | 6.48380868 14 .463129191 Number of obs F( 3, 11) Prob > F R-squared Adj R-squared Root MSE = 15 = 3129.50 = 0.0000 = 0.9988 = 0.9985 = .02626 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------output | 1.459104 .0792856 18.40 0.000 1.284597 1.63361 fuel | .3088958 .0272443 11.34 0.000 .2489315 .36886 load | -2.724785 .2376522 -11.47 0.000 -3.247854 -2.201716 _cons | 11.97243 .4320951 27.71 0.000 11.02139 12.92346 http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 63 -----------------------------------------------------------------------------OLS regression for group 3 Source | SS df MS -------------+-----------------------------Model | 3.79286673 3 1.26428891 Residual | .022869767 11 .00207907 -------------+-----------------------------Total | 3.8157365 14 .272552607 Number of obs F( 3, 11) Prob > F R-squared Adj R-squared Root MSE = = = = = = 15 608.10 0.0000 0.9940 0.9924 .0456 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------output | .7268305 .1554418 4.68 0.001 .3847054 1.068956 fuel | .4515127 .0381103 11.85 0.000 .3676324 .5353929 load | -.7513069 .6105989 -1.23 0.244 -2.095226 .5926122 _cons | 8.699815 .8985786 9.68 0.000 6.722057 10.67757 -----------------------------------------------------------------------------OLS regression for group 4 Source | SS df MS -------------+-----------------------------Model | 7.37252558 3 2.45750853 Residual | .034752343 11 .003159304 -------------+-----------------------------Total | 7.40727792 14 .52909128 Number of obs F( 3, 11) Prob > F R-squared Adj R-squared Root MSE = = = = = = 15 777.86 0.0000 0.9953 0.9940 .05621 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------output | .9353749 .0759266 12.32 0.000 .7682616 1.102488 fuel | .4637263 .044347 10.46 0.000 .3661192 .5613333 load | -.7756708 .4707826 -1.65 0.128 -1.811856 .2605148 _cons | 9.164608 .6023241 15.22 0.000 7.838902 10.49031 -----------------------------------------------------------------------------OLS regression for group 5 Source | SS df MS -------------+-----------------------------Model | 7.08313716 3 2.36104572 Residual | .012986435 11 .001180585 -------------+-----------------------------Total | 7.09612359 14 .506865971 Number of obs F( 3, 11) Prob > F R-squared Adj R-squared Root MSE = 15 = 1999.89 = 0.0000 = 0.9982 = 0.9977 = .03436 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------output | 1.076299 .0771255 13.96 0.000 .9065471 1.246051 fuel | .2920542 .0434213 6.73 0.000 .1964845 .3876239 load | -1.206847 .3336308 -3.62 0.004 -1.941163 -.4725305 _cons | 11.77079 .7430078 15.84 0.000 10.13544 13.40614 -----------------------------------------------------------------------------OLS regression for group 6 Source | SS df MS -------------+-----------------------------Model | 11.1173565 3 3.70578551 Residual | .015663323 11 .001423938 -------------+-----------------------------Total | 11.1330199 14 .795215705 Number of obs F( 3, 11) Prob > F R-squared Adj R-squared Root MSE = 15 = 2602.49 = 0.0000 = 0.9986 = 0.9982 = .03774 -----------------------------------------------------------------------------cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------output | .9673393 .0321728 30.07 0.000 .8965275 1.038151 fuel | .3023258 .0308235 9.81 0.000 .2344839 .3701678 load | .1050328 .4767508 0.22 0.830 -.9442886 1.154354 http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 64 _cons | 10.77381 .4095921 26.30 0.000 9.872309 11.67532 ------------------------------------------------------------------------------ 8.2 Poolability Test across Groups The null hypothesis of the poolability test across groups is H 0 : β ik = β k . The e’e is 1.3354, the SSE of the pooled OLS regression. The ei' ei is .1007 = .0068 + .0076 + .0229 + .0348 + .0130 + .0157. Thus, the F statistic is (1.3354 − .1007 (6 − 1)4 ~ 40.4812[20,66] .1007 6(15 − 4) The large 40.4812 rejects the null hypothesis of poolability (p< .0000). We conclude that the panel data are not poolable with respect to group. 8.3 Poolability Test over Time The null hypothesis of the poolability test over time is H 0 : β tk = β k . The sum of et' et is computed from the 15 time by time regression. . di .044807673 + .023093978 + .016506613 + .012170358 + .014104542 + /// .000469826 + .063648817 + .085430285 + .049329439 + .077112957 + /// .029913538 + .087240016 + .143348297 + .066075346 + .037256216 .7505079 The F statistic is .4175[84,30] = (1.3354 − .7505) (15 − 1)4 .7505 15(6 − 4) The small F statistic does not reject the null hypothesis in favor of poolable panel data with respect to time (p<.9991). http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 65 9. Conclusion Panel data models investigate group and time effects using fixed effect and random effect models. The fixed effect models ask how group and/or time affect the intercept, while the random effect models analyze error variance structures affected by group and/or time. Slopes are assumed unchanged in both fixed effect and random effect models. Fixed effect models are estimated by least squares dummy variable (LSDV) regression, the within effect model, and the between effect model. LSDV has three approaches to avoid perfect multicollinearity. LSDV1 drops a dummy, LSDV2 suppresses the intercept, and LSDV3 includes all dummies and imposes restrictions instead. LSDV1 is commonly used since it produces correct statistics. LSDV2 provides actual parameter estimates of group intercepts, but reports incorrect R2 and F statistic. Note that the dummy parameters of three LSDV approaches have different meanings and thus different t-tests. The within effect model does not use dummy variables but deviations from the group means. Thus, this model is useful when there are many groups and/or time periods in the panel data set (no incidental parameter problem at all). The dummy parameter estimates need to be computed afterward. Because of its larger degrees of freedom, the within effect model produces incorrect MSE and standard errors of parameters. As a result, you need to adjust the standard errors to conduct the correct t-tests. Random effect models are estimated by the generalized least squares (GLS) and the feasible generalization least squares (FGLS). When the variance structure is known, GLS is used. If unknown, FGLS estimates theta. Parameter estimates may vary depending on estimation methods. Fixed effects are tested by the F-test and random effects by the Breusch-Pagan Lagrange multiplier test. The Hausman specification test compares a fixed effect model and a random effect model. If the null hypothesis of uncorrelation is rejected, the fixed effect model is preferred. Poolabiltiy is tested by running group by group or time by time regressions. Among the four statistical packages addressed in this document, I would recommend SAS and STATA. In particular, the SAS PANEL procedure, although experimental now, provides various ways of analyzing panel data. STATA is very handy to manipulate panel data, but it does not fit two-way effect models. LIMDEP is able to estimate various panel data models, but it is not stable enough. SPSS is not recommended for panel data models. http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 66 APPENDIX: Data sets Data set 1: Data of the top 50 information technology firms presented in OECD Information Technology Outlook 2004 (http://thesius.sourceoecd.org/). firm = IT company name type = type of IT firm rnd = 2002 R&D investment in current USD millions income = 2000 net income in current USD millions d1 = 1 for equipment and software firms and 0 for telecommunication and electronics . tab type d1 | d1 Type of Firm | 0 1 | Total ----------------+----------------------+---------Telecom | 18 0 | 18 Electronics | 17 0 | 17 IT Equipment | 0 6 | 6 Comm. Equipment | 0 5 | 5 Service & S/W | 0 4 | 4 ----------------+----------------------+---------Total | 35 15 | 50 . sum rnd income Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------rnd | 39 2023.564 1615.417 0 5490 income | 50 2509.78 3104.585 -732 11797 Data set 2: Cost data for U.S. airlines (1970-1984) presented in Greene (2003). URL: http://pages.stern.nyu.edu/~wgreene/Text/tables/tablelist5.htm airline = airline (six airlines) year = year (fifteen years) output0 = output in revenue passenger miles, index number cost0 = total cost in $1,000 fuel0 = fuel price load = load factor, the average capacity utilization of the fleet . tsset panel variable: time variable: airline, 1 to 6 year, 1 to 15 . sum output0 cost0 fuel0 load Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------output0 | 90 .5449946 .5335865 .037682 1.93646 cost0 | 90 1122524 1192075 68978 4748320 fuel0 | 90 471683 329502.9 103795 1015610 load | 90 .5604602 .0527934 .432066 .676287 http://www.indiana.edu/~statmath © 2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 67 References Baltagi, Badi H. 2001. Econometric Analysis of Panel Data. Wiley, John & Sons. Baltagi, Badi H., and Young-Jae Chang. 1994. "Incomplete Panels: A Comparative Study of Alternative Estimators for the Unbalanced One-way Error Component Regression Model." Journal of Econometrics, 62(2): 67-89. Breusch, T. S., and A. R. Pagan. 1980. "The Lagrange Multiplier Test and its Applications to Model Specification in Econometrics." Review of Economic Studies, 47(1):239-253. Fox, John. 1997. Applied Regression Analysis, Linear Models, and Related Methods. Newbury Park, CA: Sage. Freund, Rudolf J., and Ramon C. Littell. 2000. SAS System for Regression, 3rd ed. Cary, NC: SAS Institute. Fuller, Wayne A. and George E. Battese. 1973. "Transformations for Estimation of Linear Models with Nested-Error Structure." Journal of the American Statistical Association, 68(343) (September): 626-632. Fuller, Wayne A. and George E. Battese. 1974. "Estimation of Linear Models with CrossedError Structure." Journal of Econometrics, 2: 67-78. Greene, William H. 2002. LIMDEP Version 8.0 Econometric Modeling Guide, 4th ed. Plainview, New York: Econometric Software. Greene, William H. 2003. Econometric Analysis, 5th ed. Upper Saddle River, NJ: Prentice Hall. Hausman, J. A. 1978. "Specification Tests in Econometrics." Econometrica, 46(6):1251-1271. SAS Institute. 2004. SAS/ETS 9.1 User’s Guide. Cary, NC: SAS Institute. SAS Institute. 2004. SAS/STAT 9.1 User’s Guide. Cary, NC: SAS Institute. http://www.sas.com/ STATA Press. 2005. STATA Base Reference Manual, Release 9. College Station, TX: STATA Press. STATA Press. 2005. STATA Longitudinal/Panel Data Reference Manual, Release 9. College Station, TX: STATA Press. STATA Press. 2005. STATA Time-Series Reference Manual, Release 9. College Station, TX: STATA Press. Wooldridge, Jeffrey M. 2002. Econometric Analysis of Cross Section and Panel Data. Cambridge, MA: MIT Press. Acknowledgements I have to thank Dr. Heejoon Kang in the Kelley School of Business and Dr. David H. Good in the School of Public and Environmental Affairs, Indiana University at Bloomington, for their insightful lectures. I am also grateful to Jeremy Albright and Kevin Wilhite at the UITS Center for Statistical and Mathematical Computing for comments and suggestions. Revision History z 2005.11 First draft http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 1 Categorical Dependent Variable Models Using SAS, STATA, LIMDEP, and SPSS Hun Myoung Park This document summarizes regression models for categorical dependent variables and illustrates how to estimate individual models using SAS 9.1, STATA 9.0, LIMDEP 8.0, and SPSS 13.0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction The Binary Logit Model The Binary Probit Model Bivariate Logit/Probit Models Ordered Logit/Probit Models The Multinomial Logit Model The Conditional Logit Model The Nested Logit Model Conclusion Appendix 1. Introduction The categorical variable here refers to a variable that is binary, ordinal, or nominal. Event count data are discrete (categorical) but often considered continuous. When the dependent variable is categorical, the ordinary least squares (OLS) method can no longer produce the best linear unbiased estimator (BLUE); that is, OLS is biased and inefficient. Consequently, researchers have developed various categorical dependent variable models (CDVMs). The nonlinearity of CDVMs makes it difficult to interpret outputs, since the effect of a change in a variable depends on the values of all other variables in the model (Long 1997). 1.1 Categorical Dependent Variable Models In CDVMs, the left-hand side (LHS) variable or dependent variable is neither interval nor ratio, but rather categorical. The level of measurement and data generation process (DGP) of a dependent variable determines the proper type of CDVM. Thus, binary responses are modeled with the binary logit and probit regressions, ordinal responses are formulated into the ordered logit/probit regression models, and nominal responses are analyzed by multinomial logit, conditional logit, or nested logit models. Independent variables on the right-hand side (RHS) may be interval, ratio, or binary (dummy). The CDVMs adopt the maximum likelihood (ML) estimation method, whereas OLS uses the moment based method. The ML method requires assumptions about probability distribution functions, such as the logistic function and the complementary log-log function. Logit models use the standard logistic probability distribution, while probit models assume the standard http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 2 normal distribution. This document focuses on logit and probit models only. Table 1 summarizes CDVMs in comparison with OLS. Model OLS CDVMs Table 1. Ordinary Least Squares and CDVMs Dependent (LHS) Estimation Independent (RHS) Moment based method Ordinary least squares Interval or ratio Binary response Binary (0 or 1) Ordinal response Nominal response Event count data Ordinal (1st, 2nd , 3rd…) Nominal (A, B, C …) Count (0, 1, 2, 3…) Maximum likelihood method A linear function of interval/ratio or binary variables β 0 + β 1 X 1 + β 2 X 2 ... 1.2 Logit Models versus Probit Models How do logit models differ from probit models? The core difference lies in the distribution of errors. In the logit model, errors are assumed to follow the standard logistic distribution with π2 eε mean 0 and variance , λ (ε ) = . The errors of the probit model are assumed to follow 3 (1 + eε ) 2 the standard normal distribution, φ (ε ) = ε2 1 −2 e . 2π Figure 1. Comparison of the Standard Normal and Standard Logistic Probability Distributions PDF of the Standard Normal Distribution CDF of the Standard Normal Distribution PDF of the Standard Logistic Distribution CDF of the Standard Logistic Distribution The probability density function (PDF) of the standard normal probability distribution has a higher peak and thinner tails than the standard logistic probability distribution (Figure 1). The http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 3 standard logistic distribution looks as if someone has weighed down the peak of the standard normal distribution and strained its tails. As a result, the cumulative density function (CDF) of the standard normal distribution is steeper in the middle than the CDF of the standard logistic distribution and quickly approaches zero on the left and one on the right. The two models, of course, produce different parameter estimates. In binary response models, the estimates of a logit model are roughly π 3 times larger than those of the corresponding probit model. These estimators, however, are almost the same in terms of the standardized impacts of independent variables and predictions (Long 1997). In general, logit models reach convergence in estimation fairly well. Some (multinomial) probit models may take a long time to reach convergence, although the probit works well for bivariate models. 1.3 Estimation in SAS, STATA, LIMDEP, and SPSS SAS provides several procedures for CDVMs, such as LOGISTIC, PROBIT, GENMOD, QLIM, MDC, and CATMOD. Since these procedures support various models, a CDVM can be estimated by multiple procedures. For example, you may run a binary logit model using the LOGISTIC, PROBIT, GENMODE, and QLIM. The LOGISTIC and PROBIT procedures of SAS/STAT have been commonly used, but the QLIM and MDC procedures of SAS/ETS are noted for their advanced features. Table 2. Procedures and Commands for CDVMs Model SAS 9.1 Stata 9.0 LIMDEP 8.0 REG .regress Regress$ QLIM, GENMOD, .logit, Binary logit Logit$ LOGISTIC, PROBIT, logistic CATMOD Binary QLIM, GENMOD, Binary probit .probit Probit$ LOGISTIC, PROBIT Bivariate logit QLIM Bivariate Bivariate probit QLIM .biprobit Bivariateprobit$ QLIM, PROBIT, Ordered logit .ologit Ordered$, Logit$ LOGISTIC * Generalized logit .gologit Ordinal QLIM, PROBIT, .oprobit Ordered$ Ordered probit LOGISTIC Multinomial logit CATMOD .mlogit Mlogit$, Logit$ Conditional logit MDC, PHREG .clogit Clogit$, Logit$ Nominal Nested logit MDC .nlogit Nlogit$** Multinomial probit MDC .mprobit * User-written commands written by Fu (1998) and Williams (2005) ** The Nlogit$ command is supported by NLOGIT 3.0, which is sold separately. OLS (Ordinary least squares) SPSS13.0 Regression Logistic regression Probit Plum Plum Nomreg Coxreg - The QLIM (Qualitative and LImited dependent variable Model) procedure analyzes various categorical and limited dependent variable regression models such as censored, truncated, and sample-selection models. This QLIM procedure also handles Box-Cox regression and bivariate http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 4 probit and logit models. The MDC (Multinomial Discrete Choice) Procedure can estimate multinomial probit, conditional logit, and nested (multinomial) logit models. Unlike SAS, STATA has individualized commands for corresponding CDVMs. For example, the .logit and .probit commands respectively fit the binary logit and probit models. The LIMDEP Logit$ and Probit$ commands support a variety of CDVMs that are addressed in Greene’s Econometric Analysis (2003). SPSS supports some related commands for CDVMs but has limited ability to analyze categorical data. Because of its limitation, SPSS outputs are skipped here. Table 2 summarizes the procedures and commands for CDVMs. 1. 4 Long and Freese’s SPost Module STATA users may take advantages of user-written modules such as J. Scott Long and Jeremy Freese’s SPost. The module allows researchers to conduct follow-up analyses of various CDVMs including event count data models. See section 2.2 for major SPost commands. In order to install SPost, execute the following commands consecutively. For more details, visit J. Scott Long’s Web site at http://www.indiana.edu/~jslsoc/spost_install.htm. . net from http://www.indiana.edu/~jslsoc/stata/ . net install spost9_ado, replace . net get spost9_do, replace If you want to use Vincent Kang Fu’s gologit (2000) and Richard Williams’ gologit2 (2005) for the generalized ordered logit model, type in the following. . net search gologit . net install gologit from(http://www.stata.com/users/jhardin) . net install gologit2 from(http://fmwww.bc.edu/RePEc/bocode/g) http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 5 2. The Binary Logit Regression Model exp( xβ ) = Λ ( xβ ) , where Λ 1 + exp( xβ ) indicates a link function, the cumulative standard logistic probability distribution function. This chapter examines how car ownership (owncar) is affected by monthly income (income), age, and gender (male). See the appendix for details about the data set. The binary logit model is represented as Pr ob( y = 1 | x) = 2.1 Binary Logit in STATA (.logit) STATA provides two equivalent commands for the binary logit model, which present the same result in different ways. The .logit command produces coefficients with respect to logit (log of odds), while the .logistic reports estimates as odd ratios. . logistic owncar income age male Logistic regression Number of obs LR chi2(3) Prob > chi2 Pseudo R2 Log likelihood = -273.84758 = = = = 437 18.24 0.0004 0.0322 -----------------------------------------------------------------------------owncar | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------income | .9898826 .5677504 -0.02 0.986 .3216431 3.046443 age | 1.279626 .088997 3.55 0.000 1.116561 1.466505 male | 1.513669 .3111388 2.02 0.044 1.011729 2.264633 -----------------------------------------------------------------------------. logit In order to get the coefficients (log of odds), simply run the .logit without any argument right after the .logistic command. Or run an independent .logit command with all arguments. . logit owncar income age male Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood Logistic regression Log likelihood = -273.84758 = = = = -282.96512 -273.93537 -273.84761 -273.84758 Number of obs LR chi2(3) Prob > chi2 Pseudo R2 = = = = 437 18.24 0.0004 0.0322 -----------------------------------------------------------------------------owncar | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------income | -.010169 .5735533 -0.02 0.986 -1.134313 1.113975 age | .2465678 .0695492 3.55 0.000 .1102539 .3828817 male | .4145366 .2055527 2.02 0.044 .0116606 .8174126 _cons | -4.682741 1.474519 -3.18 0.001 -7.572745 -1.792738 ------------------------------------------------------------------------------ http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 6 Note that a coefficient of the .logit is the logarithmic transformed corresponding estimator of the .logistic. For example, .2465678= log(1.279626). STATA has post-estimation commands that conduct follow-up analyses. The .predict command computes predictions, residuals, or standard errors of the prediction and stores them into a new variable. . predict r, residual The .test and .lrtest commands respectively conduct the Wald test and likelihood ratio test. . test income age ( 1) ( 2) income = 0 age = 0 chi2( 2) = Prob > chi2 = 12.57 0.0019 2.2 Using the SPost Module in STATA The SPost module provides useful follow-up analysis commands (ado files) for various categorical dependent variable models (Long and Freese 2003). The .fitstat command calculates various goodness-of-fit statistics such as log likelihood, McFadden’s R2 (or Pseudo R2), Akaike Information Criterion (AIC), and (Bayesian Information Criterion (BIC). . fitstat Measures of Fit for logistic of owncar Log-Lik Intercept Only: D(433): -282.965 547.695 McFadden's R2: 0.032 Maximum Likelihood R2: 0.041 McKelvey and Zavoina's R2: 0.059 Variance of y*: 3.495 Count R2: 0.638 AIC: 1.272 BIC: -2084.916 Log-Lik Full Model: LR(3): Prob > LR: McFadden's Adj R2: Cragg & Uhler's R2: Efron's R2: Variance of error: Adj Count R2: AIC*n: BIC': -273.848 18.235 0.000 0.018 0.056 0.040 3.290 -0.033 555.695 0.005 The likelihood ratio for goodness of fit is computed as, . di 2*(-273.848 - (-282.965)) 18.234 The .listcoef command lists unstandardized coefficients (parameter estimates), factor and percent changes, and standardized coefficients to help interpret results. The help option tells how to read the outputs. . listcoef, help http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 7 logistic (N=437): Factor Change in Odds Odds of: 1 vs 0 ---------------------------------------------------------------------owncar | b z P>|z| e^b e^bStdX SDofX -------------+-------------------------------------------------------income | -0.01017 -0.018 0.986 0.9899 0.9982 0.1792 age | 0.24657 3.545 0.000 1.2796 1.4876 1.6108 male | 0.41454 2.017 0.044 1.5137 1.2279 0.4953 ---------------------------------------------------------------------b = raw coefficient z = z-score for test of b=0 P>|z| = p-value for z-test e^b = exp(b) = factor change in odds for unit increase in X e^bStdX = exp(b*SD of X) = change in odds for SD increase in X SDofX = standard deviation of X The .prtab command constructs a table of predicted values (events) for all combinations of categorical variables listed. The following example shows that 60 percent of female and 70 percent of male students are likely to own cars, given the mean values of income and age. . prtab male logistic: Predicted probabilities of positive outcome for owncar ---------------------male | Prediction ----------+----------0 | 0.6017 1 | 0.6958 ---------------------- x= income .61683982 age 20.691076 male .57208238 The .prvalue lists predicted probabilities of positive and negative outcomes for a given set of values for the independent variables. Note both the .prtab and .prvalue commands report the identical predicted probability that male students own cars, .6017, holding other variables at their means. . prvalue, x(male=0) rest(mean) logistic: Predictions for owncar Pr(y=1|x): Pr(y=0|x): x= income .61683982 0.6017 0.3983 age 20.691076 95% ci: (0.5286,0.6706) 95% ci: (0.3294,0.4714) male 0 The most useful command is the .prchange, which calculates marginal effects (changes) and discrete changes at the given set of values of independent variables. The help option tells how to read the outputs. For instance, the predicted probability that a male students owns a car is .094 (0->1) higher than that of female students, holding other variables at their mean. . prchange, help http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 8 logit: Changes in Predicted Probabilities for owncar income age male min->max -0.0019 0.4404 0.0940 0->1 -0.0023 0.0032 0.0940 Pr(y|x) 0 0.3430 1 0.6570 x= sd(x)= income .61684 .17918 age 20.6911 1.61081 -+1/2 -0.0023 0.0555 0.0932 -+sd/2 -0.0004 0.0893 0.0462 MargEfct -0.0023 0.0556 0.0934 male .572082 .495344 Pr(y|x): probability of observing each y for specified x values Avg|Chg|: average of absolute value of the change across categories Min->Max: change in predicted probability as x changes from its minimum to its maximum 0->1: change in predicted probability as x changes from 0 to 1 -+1/2: change in predicted probability as x changes from 1/2 unit below base value to 1/2 unit above -+sd/2: change in predicted probability as x changes from 1/2 standard dev below base to 1/2 standard dev above MargEfct: the partial derivative of the predicted probability/rate with respect to a given independent variable The SPost module also includes the .prgen, which computes a series of predictions by holding all variables but one interval variable constant and allowing that variable to vary (Long and Freese 2003). . prgen income, from(.1) to(1.5) x(male=1) rest(median) generate(ppcar) logistic: Predicted values as income varies from .1 to 1.5. x= income .58200002 age 21 male 1 The above command computes predicted probabilities that male students own cars when income changes from $100 through $1,500, holding age at its median of 21 and stores them into a new variable ppcar. 2.3 Using the SAS LOGISTIC and PROBIT Procedures SAS has several procedures for the binary logit model such as the LOGISTIC, PROBIT, GENMOD, and QLIM. The LOGISTIC procedure is commonly used for the binary logit model, but the PROBIT procedure also estimates the binary logit. Let us first consider the LOGISTIC procedure. PROC LOGISTIC DESCENDING DATA = masil.students; MODEL owncar = income age male; RUN; The LOGISTIC Procedure http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 9 Model Information Data Set Response Variable Number of Response Levels Model Optimization Technique MASIL.STUDENTS owncar 2 binary logit Fisher's scoring Number of Observations Read Number of Observations Used 437 437 Response Profile Ordered Value owncar Total Frequency 1 2 1 0 284 153 Probability modeled is owncar=1. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Criterion AIC SC -2 Log L Intercept Only Intercept and Covariates 567.930 572.010 565.930 555.695 572.015 547.695 Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald Chi-Square DF Pr > ChiSq 18.2351 17.4697 16.7977 3 3 3 0.0004 0.0006 0.0008 Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept income 1 1 -4.6827 -0.0102 1.4745 0.5736 10.0855 0.0003 0.0015 0.9859 http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University age male 1 1 0.2466 0.4145 Categorical Dependent Variable Models: 10 0.0695 0.2056 12.5686 4.0670 0.0004 0.0437 Odds Ratio Estimates Effect Point Estimate income age male 0.990 1.280 1.514 95% Wald Confidence Limits 0.322 1.117 1.012 3.046 1.467 2.265 Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 58.9 34.3 6.8 43452 Somers' D Gamma Tau-a c 0.246 0.264 0.112 0.623 The SAS LOGISTIC, PROBIT, and GENMOD procedures by default uses a smaller value in the dependent variable as success. Thus, the magnitudes of the coefficients remain the same, but the signs are opposite to those of the QLIM procedure, STATA, and LIMDEP. The DESCENDING option forces SAS to use a larger value as success. Alternatively, you may explicitly specify the category of successful event using the EVENT option as follows. PROC LOGISTIC DESCENDING DATA = masil.students; MODEL owncar(EVENT=’1’) = income age male; RUN; The SAS LOGISTIC procedure computes odds changes when independent variables increase by the units specified in the UNITS statement. The SD below indicates a standard deviation increase in income and age (e.g., -2 means a two unit decrease in independent variables). PROC LOGISTIC DESCENDING DATA = masil.students; MODEL owncar = income age male; UNITS income=SD age=SD; RUN; The UNITS statement adds the “Adjusted Odds Ratios” to the end of the outputs above. Note that the odds changes of the two variables are identical to those under the “e^bStdX” of the previous SPost .listcoef output. Adjusted Odds Ratios Effect Unit Estimate income age 0.1792 1.6108 0.998 1.488 Now, let us use the PROBIT procedure to estimate the same binary logit model. The PROBIT requires the CLASS statement to list categorical variables. The /DIST=LOGISTIC option indicates the probability distribution to be used in maximum likelihood estimation. http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 11 PROC PROBIT DATA = masil.students; CLASS owncar; MODEL owncar = income age male /DIST=LOGISTIC; RUN; Probit Procedure Model Information Data Set Dependent Variable Number of Observations Name of Distribution Log Likelihood MASIL.STUDENTS owncar 437 Logistic -273.847577 Number of Observations Read Number of Observations Used 437 437 Class Level Information Name Levels owncar Values 2 0 1 Response Profile Ordered Value 1 2 owncar Total Frequency 0 1 153 284 PROC PROBIT is modeling the probabilities of levels of owncar having LOWER Ordered Values in the response profile table. Algorithm converged. Type III Analysis of Effects Effect DF Wald Chi-Square Pr > ChiSq income age male 1 1 1 0.0003 12.5686 4.0670 0.9859 0.0004 0.0437 Analysis of Parameter Estimates Parameter DF Estimate http://www.indiana.edu/~statmath Standard Error 95% Confidence Limits ChiSquare Pr > ChiSq © 2003-2005, The Trustees of Indiana University Intercept income age male 1 1 1 1 4.6827 0.0102 -0.2466 -0.4145 Categorical Dependent Variable Models: 12 1.4745 0.5736 0.0695 0.2056 1.7927 -1.1140 -0.3829 -0.8174 7.5727 1.1343 -0.1103 -0.0117 10.09 0.00 12.57 4.07 0.0015 0.9859 0.0004 0.0437 Unlike LOGISTIC, PROBIT does not have the DESCENDING option. Thus, you have to switch the signs of coefficients when comparing with those of STATA and LIMDEP. The PROBIT procedure also does not have the UNITS statement to compute changes in odds. 2.4 Using the SAS GENMOD and QLIM Procedures The GENMOD provides flexible methods to estimate generalized linear model. The DISTRIBUTION (DIST) and the LINK=LOGIT options respectively specify a probability distribution and a link function. PROC GENMOD DATA = masil.students DESC; MODEL owncar = income age male /DIST=BINOMIAL LINK=LOGIT; RUN; The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable Number Number Number Number of of of of MASIL.STUDENTS Binomial Logit owncar Observations Read Observations Used Events Trials 437 437 284 437 Response Profile Ordered Value 1 2 owncar Total Frequency 1 0 284 153 PROC GENMOD is modeling the probability that owncar='1'. Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 http://www.indiana.edu/~statmath DF Value Value/DF 433 433 433 433 547.6952 547.6952 436.4352 436.4352 1.2649 1.2649 1.0079 1.0079 © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 13 Log Likelihood -273.8476 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept income age male Scale 1 1 1 1 0 -4.6827 -0.0102 0.2466 0.4145 1.0000 1.4745 0.5736 0.0695 0.2056 0.0000 Wald 95% Confidence Limits -7.5727 -1.1343 0.1103 0.0117 1.0000 -1.7927 1.1140 0.3829 0.8174 1.0000 ChiSquare Pr > ChiSq 10.09 0.00 12.57 4.07 0.0015 0.9859 0.0004 0.0437 NOTE: The scale parameter was held fixed. If you have categorical (string) independent variables, list the variables in the CLASS statement without creating dummy variables. PROC GENMOD DATA = masil.students DESC; CLASS male; MODEL owncar = income age male /DIST=BINOMIAL LINK=LOGIT; RUN; Users may also provide their own link functions using the FWDLINK and INVLINK statements instead of the LINK=LOGIT option. PROC GENMOD DATA = masil.students DESC; FWDLINK link=LOG(_MEAN_/(1-_MEAN_)); INVLINK invlink=1/(1+EXP(-1*_XBETA_)); MODEL owncar = income age male /DIST=BINOMIAL; RUN; All three GENMOD examples discussed so far produce the identical result. The QLIM procedure estimates not only logit and probit models, but also censored, truncated, and sample-selected models. You may provide characteristics of the dependent variable either in the ENDOGENOUS statement or the option of the MODEL statement. PROC QLIM DATA=masil.students; MODEL owncar = income age male; ENDOGENOUS owncar ~ DISCRETE (DIST=LOGIT); RUN; Or, PROC QLIM DATA=masil.students; MODEL owncar = income age male /DISCRETE (DIST=LOGIT); RUN; The QLIM Procedure http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 14 Discrete Response Profile of owncar Index Value 1 2 0 1 Frequency Percent 153 284 35.01 64.99 Model Fit Summary Number of Endogenous Variables Endogenous Variable Number of Observations Log Likelihood Maximum Absolute Gradient Number of Iterations AIC Schwarz Criterion 1 owncar 437 -273.84758 9.63219E-6 8 555.69515 572.01489 Goodness-of-Fit Measures Measure Likelihood Ratio (R) Upper Bound of R (U) Aldrich-Nelson Cragg-Uhler 1 Cragg-Uhler 2 Estrella Adjusted Estrella McFadden's LRI Veall-Zimmermann McKelvey-Zavoina Value 18.235 565.93 0.0401 0.0409 0.0563 0.0415 0.0234 0.0322 0.071 0.1699 Formula 2 * (LogL - LogL0) - 2 * LogL0 R / (R+N) 1 - exp(-R/N) (1-exp(-R/N)) / (1-exp(-U/N)) 1 - (1-R/U)^(U/N) 1 - ((LogL-K)/LogL0)^(-2/N*LogL0) R / U (R * (U+N)) / (U * (R+N)) N = # of observations, K = # of regressors Algorithm converged. Parameter Estimates Parameter Estimate Standard Error t Value Approx Pr > |t| Intercept income age male -4.682741 -0.010169 0.246568 0.414537 1.474519 0.573553 0.069549 0.205553 -3.18 -0.02 3.55 2.02 0.0015 0.9859 0.0004 0.0437 Finally, the CATMOD procedure fits the logit model to the functions of categorical response variables. This procedure, however, produces slightly different estimators compared to those of other procedures discussed so far. This procedure is, therefore, less recommended for the binary logit model. The DIRECT statement specifies interval or ratio variables used in the MODEL. The /NOPROFILE suppresses the display of the population profiles and the response profiles. http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 15 PROC CATMOD DATA = masil.students; DIRECT income age; MODEL owncar = income age male /NOPROFILE; RUN; 2.5 Binary Logit in LIMDEP (Logit$) The Logit$ command in LIMDEP estimates various logit models. The dependent variable is specified in the Lhs$ (left-hand side) subcommand and a list of independent variables in the Rhs$ (right-hand side). You have to explicitly specify the ONE for the intercept. The Marginal Effects$ and the Means$ subcommands compute marginal effects at the mean values of independent variables. LOGIT; Lhs=owncar; Rhs=ONE,income,age,male; Marginal Effects; Means$ Normal exit from iterations. Exit status=0. +---------------------------------------------+ | Multinomial Logit Model | | Maximum Likelihood Estimates | | Model estimated: Sep 17, 2005 at 05:31:28PM.| | Dependent variable OWNCAR | | Weighting variable None | | Number of observations 437 | | Iterations completed 5 | | Log likelihood function -273.8476 | | Restricted log likelihood -282.9651 | | Chi squared 18.23509 | | Degrees of freedom 3 | | Prob[ChiSqd > value] = .3933723E-03 | | Hosmer-Lemeshow chi-squared = 8.44648 | | P-value= .39111 with deg.fr. = 8 | +---------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Characteristics in numerator of Prob[Y = 1] Constant -4.682741385 1.4745190 -3.176 .0015 INCOME -.1016896029E-01 .57355331 -.018 .9859 .61683982 AGE .2465677833 .69549211E-01 3.545 .0004 20.691076 MALE .4145365774 .20555276 2.017 .0437 .57208238 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) +--------------------------------------------------------------------+ | Information Statistics for Discrete Choice Model. | | M=Model MC=Constants Only M0=No Model | | Criterion F (log L) -273.84758 -282.96512 -302.90532 | | LR Statistic vs. MC 18.23509 .00000 .00000 | | Degrees of Freedom 3.00000 .00000 .00000 | | Prob. Value for LR .00039 .00000 .00000 | | Entropy for probs. 273.84758 282.96512 302.90532 | | Normalized Entropy .90407 .93417 1.00000 | http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 16 | Entropy Ratio Stat. 58.11548 39.88039 .00000 | | Bayes Info Criterion 565.93495 584.17004 624.05044 | | BIC - BIC(no model) 58.11548 39.88039 .00000 | | Pseudo R-squared .03222 .00000 .00000 | | Pct. Correct Prec. 63.84439 .00000 50.00000 | | Means: y=0 y=1 y=2 y=3 yu=4 y=5, y=6 y>=7 | | Outcome .3501 .6499 .0000 .0000 .0000 .0000 .0000 .0000 | | Pred.Pr .3501 .6499 .0000 .0000 .0000 .0000 .0000 .0000 | | Notes: Entropy computed as Sum(i)Sum(j)Pfit(i,j)*logPfit(i,j). | | Normalized entropy is computed against M0. | | Entropy ratio statistic is computed against M0. | | BIC = 2*criterion - log(N)*degrees of freedom. | | If the model has only constants or if it has no constants, | | the statistics reported here are not useable. | +--------------------------------------------------------------------+ +-------------------------------------------+ | Partial derivatives of probabilities with | | respect to the vector of characteristics. | | They are computed at the means of the Xs. | +-------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Characteristics in numerator of Prob[Y = 1] Constant -1.055282283 .33183024 -3.180 .0015 INCOME -.2291632775E-02 .12925338 -.018 .9859 .61683982 AGE .5556544593E-01 .15534022E-01 3.577 .0003 20.691076 Marginal effect for dummy variable is P|1 - P|0. MALE .9403411023E-01 .46726710E-01 2.012 .0442 .57208238 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) +----------------------------------------+ | Fit Measures for Binomial Choice Model | | Logit model for variable OWNCAR | +----------------------------------------+ | Proportions P0= .350114 P1= .649886 | | N = 437 N0= 153 N1= 284 | | LogL = -273.84758 LogL0 = -282.9651 | | Estrella = 1-(L/L0)^(-2L0/n) = .04153 | +----------------------------------------+ | Efron | McFadden | Ben./Lerman | | .03963 | .03222 | .56318 | | Cramer | Veall/Zim. | Rsqrd_ML | | .04010 | .07099 | .04087 | +----------------------------------------+ | Information Akaike I.C. Schwarz I.C. | | Criteria 1.27161 572.01489 | +----------------------------------------+ Frequencies of actual & predicted outcomes Predicted outcome has maximum probability. Threshold value for predicting Y=1 = .5000 Predicted ------ ---------- + ----Actual 0 1 | Total ------ ---------- + ----0 21 132 | 153 http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University 1 -----Total 26 258 ---------47 390 | + | Categorical Dependent Variable Models: 17 284 ----437 Note that the marginal effects above are identical to those of the SPost .prchange command in section 2.2. LIMDEP computes discrete changes for binary variables like male. 2.6 Binary Logit in SPSS SPSS has the Logistic regression command for the binary logit model. LOGISTIC REGRESSION VAR=owncar /METHOD=ENTER income age male /CRITERIA PIN(.05) POUT(.10) ITERATE(20) CUT(.5) . Table 3 summarizes parameter estimates and goodness-of-fit statistics across procedures and commands for the binary logit model. Estimates and their standard errors produced are almost identical except some rounding errors. As shown in Table 3, the QLIM and LOGISTIC are recommended for categorical dependent variables. Note that the PROBIT procedure returns the opposite signs of estimates. Table 3. Parameter Estimates and Goodness-of-fit Statistics of the Binary Logit Model Intercept income age male LOGISTIC PROBIT GENMOD QLIM STATA LIMDEP -4.6827 (1.4745) -.0102 (.5736) .2466 (.0695) .4145 (.2056) 547.695* 18.2351 4.6827 (1.4745) .0102 (.5736) -.2466 (.0695) -.4145 (.2056) -273.8476 -4.6827 (1.4745) -.0102 (.5736) .2466 (.0695) .4145 (.2056) -273.8476 -4.6827 (1.4745) -.0102 (.5736) .2466 (.0695) .4145 (.2056) -273.8476 18.235 .0322 555.6952** 572.0149 -4.6827 (1.4745) -.0102 (.5736) .2466 (.0695) .4145 (.2056) -273.8476 18.24 .0322 -4.6827 (1.4745) -.0102 (.5736) .2466 (.0695) .4145 (.2056) -273.8476 18.2351 .0322 1.2716 572.0150 565.9350 Log likelihood Likelihood test Pseudo R2 555.695** AIC 572.015 Schwarz BIC * The LOGISTIC procedure reports (-2*log likelihood). ** AIC*N http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 18 3. The Binary Probit Regression Model The probit model is represented as Pr ob( y = 1 | x) = Φ( xβ ) , where Φ indicates the cumulative standard normal probability distribution function. 3.1 Binary Probit in STATA (.probit) STATA has the .probit command to estimate the binary probit regression model. . probit owncar income age male Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -282.96512 -273.84832 -273.81741 -273.81741 Probit regression Number of obs LR chi2(3) Prob > chi2 Pseudo R2 Log likelihood = -273.81741 = = = = 437 18.30 0.0004 0.0323 -----------------------------------------------------------------------------owncar | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------income | .0005613 .3476842 0.00 0.999 -.6808873 .6820098 age | .1487005 .0409837 3.63 0.000 .068374 .2290271 male | .2579112 .1256085 2.05 0.040 .0117231 .5040993 _cons | -2.823671 .8730955 -3.23 0.001 -4.534907 -1.112435 ------------------------------------------------------------------------------ In order to get standardized estimates and factor changes, run the SPost .listcoef command. . listcoef probit (N=437): Unstandardized and Standardized Estimates Observed SD: .47755228 Latent SD: 1.0371456 ------------------------------------------------------------------------------owncar | b z P>|z| bStdX bStdY bStdXY SDofX -------------+----------------------------------------------------------------income | 0.00056 0.002 0.999 0.0001 0.0005 0.0001 0.1792 age | 0.14870 3.628 0.000 0.2395 0.1434 0.2309 1.6108 male | 0.25791 2.053 0.040 0.1278 0.2487 0.1232 0.4953 ------------------------------------------------------------------------------- You may compute the marginal effects and discrete change using the SPost .prchange. . prchange, x(income=1 age=21 male=0) probit: Changes in Predicted Probabilities for owncar income age min->max 0.0002 0.4900 0->1 0.0002 0.0014 http://www.indiana.edu/~statmath -+1/2 0.0002 0.0567 -+sd/2 0.0000 0.0912 MargEfct 0.0002 0.0567 © 2003-2005, The Trustees of Indiana University male 0.0937 0.0937 Pr(y|x) 0 0.3822 1 0.6178 x= sd(x)= income 1 .17918 age 21 1.61081 0.0981 Categorical Dependent Variable Models: 19 0.0487 0.0984 male 0 .495344 3.2 Using the PROBIT and LOGISTIC Procedures The PROBIT and LOGISTIC procedures estimate the binary probit model. Keep in mind that the coefficients of PROBIT has opposite signs. PROC PROBIT DATA = masil.students; CLASS owncar; MODEL owncar = income age male; RUN; Probit Procedure Model Information Data Set Dependent Variable Number of Observations Name of Distribution Log Likelihood MASIL.STUDENTS owncar 437 Normal -273.8174115 Number of Observations Read Number of Observations Used 437 437 Class Level Information Name Levels owncar 2 Values 0 1 Response Profile Ordered Value 1 2 owncar 0 1 Total Frequency 153 284 PROC PROBIT is modeling the probabilities of levels of owncar having LOWER Ordered Values in the response profile table. Algorithm converged. http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 20 Type III Analysis of Effects Effect DF Wald Chi-Square Pr > ChiSq income age male 1 1 1 0.0000 13.1644 4.2160 0.9987 0.0003 0.0400 Analysis of Parameter Estimates Parameter Intercept income age male DF Estimate 1 1 1 1 Standard Error 2.8237 -0.0006 -0.1487 -0.2579 0.8731 0.3477 0.0410 0.1256 95% Confidence Limits 1.1124 -0.6820 -0.2290 -0.5041 ChiSquare Pr > ChiSq 4.5349 0.6809 -0.0684 -0.0117 10.46 0.00 13.16 4.22 0.0012 0.9987 0.0003 0.0400 The LOGISTIC procedure requires a normal probability distribution as a link function (/LINK=PROBIT or /LINK=NORMIT). PROC LOGISTIC DATA = masil.students DESC; MODEL owncar = income age male /LINK=PROBIT; RUN; The LOGISTIC Procedure Model Information Data Set Response Variable Number of Response Levels Model Optimization Technique MASIL.STUDENTS owncar 2 binary probit Fisher's scoring Number of Observations Read Number of Observations Used 437 437 Response Profile Ordered Value owncar Total Frequency 1 2 1 0 284 153 Probability modeled is owncar=1. Model Convergence Status http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 21 Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Criterion AIC SC -2 Log L Intercept Only Intercept and Covariates 567.930 572.010 565.930 555.635 571.955 547.635 Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald Chi-Square DF Pr > ChiSq 18.2954 17.4697 17.4690 3 3 3 0.0004 0.0006 0.0006 Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept income age male 1 1 1 1 -2.8237 0.000548 0.1487 0.2579 0.8796 0.3496 0.0413 0.1257 10.3048 0.0000 12.9602 4.2096 0.0013 0.9987 0.0003 0.0402 Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 57.8 32.9 9.3 43452 Somers' D Gamma Tau-a c 0.249 0.274 0.113 0.624 3.3 Using the GENMODE and QLIM Procedures The GENMOD procedure also estimates the binary probit model using the /DIST=BINOMIAL and /LINK=PROBIT options in the MODEL statement. PROC GENMOD DATA = masil.students DESC; MODEL owncar = income age male /DIST=BINOMIAL LINK=PROBIT; RUN; The GENMOD Procedure Model Information Data Set http://www.indiana.edu/~statmath MASIL.STUDENTS © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 22 Distribution Link Function Dependent Variable Number Number Number Number of of of of Binomial Probit owncar Observations Read Observations Used Events Trials 437 437 284 437 Response Profile Ordered Value owncar 1 2 Total Frequency 1 0 284 153 PROC GENMOD is modeling the probability that owncar='1'. Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood DF Value Value/DF 433 433 433 433 547.6348 547.6348 437.0270 437.0270 -273.8174 1.2647 1.2647 1.0093 1.0093 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept income age male Scale 1 1 1 1 0 -2.8237 0.0006 0.1487 0.2579 1.0000 0.8731 0.3477 0.0410 0.1256 0.0000 Wald 95% Confidence Limits -4.5349 -0.6809 0.0684 0.0117 1.0000 -1.1124 0.6820 0.2290 0.5041 1.0000 ChiSquare Pr > ChiSq 10.46 0.00 13.16 4.22 0.0012 0.9987 0.0003 0.0400 NOTE: The scale parameter was held fixed. The QLIM procedure provides various goodness-of-fit statistics. The DIST=NORMAL option indicates the normal probability distribution used in estimation. PROC QLIM DATA=masil.students; MODEL owncar = income age male /DISCRETE (DIST=NORMAL); RUN; The QLIM Procedure http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 23 Discrete Response Profile of owncar Index Value 1 2 0 1 Frequency Percent 153 284 35.01 64.99 Model Fit Summary Number of Endogenous Variables Endogenous Variable Number of Observations Log Likelihood Maximum Absolute Gradient Number of Iterations AIC Schwarz Criterion 1 owncar 437 -273.81741 3.82848E-8 10 555.63482 571.95456 Goodness-of-Fit Measures Measure Likelihood Ratio (R) Upper Bound of R (U) Aldrich-Nelson Cragg-Uhler 1 Cragg-Uhler 2 Estrella Adjusted Estrella McFadden's LRI Veall-Zimmermann McKelvey-Zavoina Value 18.295 565.93 0.0402 0.041 0.0565 0.0417 0.0235 0.0323 0.0712 0.0702 Formula 2 * (LogL - LogL0) - 2 * LogL0 R / (R+N) 1 - exp(-R/N) (1-exp(-R/N)) / (1-exp(-U/N)) 1 - (1-R/U)^(U/N) 1 - ((LogL-K)/LogL0)^(-2/N*LogL0) R / U (R * (U+N)) / (U * (R+N)) N = # of observations, K = # of regressors Algorithm converged. Parameter Estimates Parameter Estimate Standard Error t Value Approx Pr > |t| Intercept income age male -2.823671 0.000561 0.148701 0.257911 0.873096 0.347684 0.040984 0.125608 -3.23 0.00 3.63 2.05 0.0012 0.9987 0.0003 0.0400 3.4 Binary Probit in LIMDEP (Probit$) The LIMDEP Probit$ command estimates various probit models. Do not forget to include the ONE for the intercept. http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 24 PROBIT; Lhs=owncar; Rhs=ONE,income,age,male$ Normal exit from iterations. Exit status=0. +---------------------------------------------+ | Binomial Probit Model | | Maximum Likelihood Estimates | | Model estimated: Sep 17, 2005 at 10:28:56PM.| | Dependent variable OWNCAR | | Weighting variable None | | Number of observations 437 | | Iterations completed 4 | | Log likelihood function -273.8174 | | Restricted log likelihood -282.9651 | | Chi squared 18.29542 | | Degrees of freedom 3 | | Prob[ChiSqd > value] = .3822542E-03 | | Hosmer-Lemeshow chi-squared = 8.18372 | | P-value= .41573 with deg.fr. = 8 | +---------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Index function for probability Constant -2.823670829 .87309548 -3.234 .0012 INCOME .5612515407E-03 .34768423 .002 .9987 .61683982 AGE .1487005234 .40983697E-01 3.628 .0003 20.691076 MALE .2579111914 .12560848 2.053 .0400 .57208238 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) +----------------------------------------+ | Fit Measures for Binomial Choice Model | | Probit model for variable OWNCAR | +----------------------------------------+ | Proportions P0= .350114 P1= .649886 | | N = 437 N0= 153 N1= 284 | | LogL = -273.81741 LogL0 = -282.9651 | | Estrella = 1-(L/L0)^(-2L0/n) = .04166 | +----------------------------------------+ | Efron | McFadden | Ben./Lerman | | .03984 | .03233 | .56327 | | Cramer | Veall/Zim. | Rsqrd_ML | | .04016 | .07121 | .04100 | +----------------------------------------+ | Information Akaike I.C. Schwarz I.C. | | Criteria 1.27148 571.95456 | +----------------------------------------+ Frequencies of actual & predicted outcomes Predicted outcome has maximum probability. Threshold value for predicting Y=1 = .5000 Predicted ------ ---------- + ----Actual 0 1 | Total ------ ---------- + ----- http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University 0 1 -----Total 5 148 8 276 ---------13 424 | | + | Categorical Dependent Variable Models: 25 153 284 ----437 3.5 Binary Probit in SPSS SPSS has the Probit command to fit the binary probit model. This command requires a variable (e.g., n in the following example) with constant 1. COMPUTE n=1. PROBIT owncar OF n WITH income age male /LOG NONE /MODEL PROBIT /PRINT FREQ /CRITERIA ITERATE(20) STEPLIMIT(.1). Table 4 summarizes parameter estimates and goodness-of-fit statistics produced. Note that the LOGISTIC procedure reports slightly different estimates and standard errors. I would recommend the SAS QLIM procedure, STATA, and LIMDEP for the binary probit model. Table 4.Parameter Estimates and Goodness-of-fit Statistics of the Binary Probit Model Intercept income age male LOGISTIC PROBIT GENMOD QLIM STATA LIMDEP -2.8237 (.8796) .0005 (.3496) .1487 (.0413) .2579 (.1257) 547.653* 18.2954 2.8237 (.8731) -.0006 (.3477) -.1487 (.0410) -.2579 (.1256) -273.8174 -2.8237 (.8731) .0006 (.3477) .1487 (.0410) .2579 (.1256) -273.8174 -2.8237 (.8731) .0006 (.3477) .1487 (.0410) .2579 (.1256) -273.8174 18.295 .0323 555.6348** 571.9546 -2.8237 (.8731) .0006 (.3477) .1487 (.0410) .2579 (.1256) -273.8174 18.30 .0323 -2.8237 (.8731) .0006 (.3477) .1487 (.0410) .2579 (.1256) -273.8174 18.2954 .0323 1.2715 571.9546 Log likelihood Likelihood test Pseudo R2 555.635** AIC 571.955 Schwarz BIC * The LOGISTIC procedure reports (-2*log likelihood). ** AIC*N http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 26 4. Bivariate Probit/Logit Regression Models Bivariate regression models have two equations for the two dependent variables. This chapter explains the bivariate regression model with two binary dependent variables. Like the seemingly unrelated regression model (SUR), biviriate probit/logit models assume that the “independent, identically distributed” errors are correlated (Greene 2003). The bivariate probit model, although consuming relatively much time, is more likely to converge than the bivariate logit model. SAS supports both the bivariate probit and logit models, while STATA and LIMDEP estimate the bivariate probit model. Here we consider a model for car ownership (owncar) and housing type (offcamp). 4.1 Bivariate Probit in STATA (.biprobit) STATA has the .biprobit command to estimate the bivariate probit model. The two dependent variables precede a set of independent variables. . biprobit owncar offcamp income age male Fitting comparison equation 1: Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -282.96512 -273.84832 -273.81741 -273.81741 Fitting comparison equation 2: Iteration Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: 5: Comparison: log log log log log log likelihood likelihood likelihood likelihood likelihood likelihood = = = = = = -54.97403 -45.919608 -43.685448 -43.32265 -43.309675 -43.309654 log likelihood = -317.12707 Fitting full model: Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: 5: 6: 7: log log log log log log log log likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood Bivariate probit regression Log likelihood = -306.45392 = = = = = = = = -317.12707 -307.15684 -306.49535 -306.46018 -306.45493 -306.45408 -306.45395 -306.45392 Number of obs Wald chi2(6) Prob > chi2 = = = 437 30.13 0.0000 -----------------------------------------------------------------------------| Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 27 owncar | income | -.0017168 .347905 -0.00 0.996 -.6835982 .6801645 age | .1492475 .0409238 3.65 0.000 .0690383 .2294568 male | .2594624 .1255633 2.07 0.039 .0133628 .505562 _cons | -2.834625 .8719679 -3.25 0.001 -4.543651 -1.125599 -------------+---------------------------------------------------------------offcamp | income | .7519064 .8254937 0.91 0.362 -.8660316 2.369844 age | .5895658 .149221 3.95 0.000 .297098 .8820336 male | .3939644 .2834889 1.39 0.165 -.1616637 .9495925 _cons | -10.34593 2.947501 -3.51 0.000 -16.12293 -4.568938 -------------+---------------------------------------------------------------/athrho | 2.387522 27.20167 0.09 0.930 -50.92678 55.70182 -------------+---------------------------------------------------------------rho | .9832658 .9027811 -1 1 -----------------------------------------------------------------------------Likelihood-ratio test of rho=0: chi2(1) = 21.3463 Prob > chi2 = 0.0000 4.2 Bivariate Probit in SAS The SAS QLIM procedure is able to estimate both the bivariate logit and probit models. You need to provide two equations that may or may not have different sets of independent variables. PROC QLIM DATA=masil.students; MODEL owncar = income age male; MODEL offcamp = income age male; ENDOGENOUS owncar offcamp ~ DISCRETE(DIST=NORMAL); RUN; Or, simply, PROC QLIM DATA=masil.students; MODEL owncar offcamp = income age male /DISCRETE; RUN; The QLIM Procedure Discrete Response Profile of owncar Index Value 1 2 0 1 Frequency Percent 153 284 35.01 64.99 Discrete Response Profile of offcamp Index Value 1 2 0 1 Frequency Percent 12 425 2.75 97.25 Model Fit Summary Number of Endogenous Variables http://www.indiana.edu/~statmath 2 © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 28 Endogenous Variable Number of Observations Log Likelihood Maximum Absolute Gradient Number of Iterations AIC Schwarz Criterion owncar offcamp 437 -306.45392 2.16967E-6 27 628.90784 661.54730 Algorithm converged. Parameter Estimates Parameter owncar.Intercept owncar.income owncar.age owncar.male offcamp.Intercept offcamp.income offcamp.age offcamp.male _Rho Estimate Standard Error t Value Approx Pr > |t| -2.834511 -0.001723 0.149243 0.259462 -10.345002 0.751837 0.589515 0.393859 0.999990 0.871964 0.347904 0.040924 0.125563 2.947054 0.825398 0.149197 0.283458 0 -3.25 -0.00 3.65 2.07 -3.51 0.91 3.95 1.39 . 0.0012 0.9960 0.0003 0.0388 0.0004 0.3624 <.0001 0.1647 . 4.3 Bivariate Probit in LIMDEP (Bivariateprobit$) LIMDEP has the Bivariateprobit$ command to estimate the bivariate probit model. The Lhs$ subcommand lists the two binary dependent variables, whereas Rh1$ and Rh2$ respectively indicate independent variables for the two dependent variables. In this model, you may not switch the order of dependent variables (Lhs=owncar,offcamp;) to avoid convergence problems. BIVARIATEPROBIT; Lhs=offcamp,owncar; Rh1=ONE,income,age,male; Rh2= ONE,income,age,male$ Normal exit from iterations. Exit status=0. +---------------------------------------------+ | FIML Estimates of Bivariate Probit Model | | Maximum Likelihood Estimates | | Model estimated: Sep 17, 2005 at 10:36:25PM.| | Dependent variable OFFOWN | | Weighting variable None | | Number of observations 437 | | Iterations completed 35 | | Log likelihood function -306.4539 | +---------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Index equation for OFFCAMP http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 29 Constant INCOME AGE MALE -10.34508235 3.6592558 -2.827 .0047 .7518407011 .85274898 .882 .3780 .5895189160 .18572787 3.174 .0015 .3938599470 .29308051 1.344 .1790 Index equation for OWNCAR Constant -2.834513147 .84825468 -3.342 .0008 INCOME -.1723102966E-02 .34222451 -.005 .9960 AGE .1492426338 .39739762E-01 3.755 .0002 MALE .2594618946 .12565094 2.065 .0389 Disturbance correlation RHO(1,2) .9941311591 .73338053E+09 .000 1.0000 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) .61683982 20.691076 .57208238 .61683982 20.691076 .57208238 Joint Frequency Table: Columns=OWNCAR Rows =OFFCAMP (N) = Count of Fitted Values 0 0 1 TOTAL ( 12 0) ( 0 0) ( 12 0) ( 141 0) ( 284 437) ( 425 437) ( 153 0) ( 284 437) ( 437 437) 1 TOTAL SAS, STATA, and LIMDEP produce almost the same parameter estimates and standard errors with slight differences after the decimal point. 4.4 Bivariate Logit in SAS The QLIM procedure also estimates the bivariate logit model using the DIST=LOGIT option. Unfortunately, this model does not fit in SAS. PROC QLIM DATA=masil.students; MODEL owncar = income age male; MODEL offcamp = income age male; ENDOGENOUS offcamp owncar ~ DISCRETE(DIST=LOGIT); RUN; Or, PROC QLIM DATA=masil.students; MODEL owncar offcamp = income age male /DISCRETE(DIST=LOGIT); RUN; http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 30 5. Ordered Logit/Probit Regression Models Suppose we have an ordinal dependent variable such as the degree of illegal parking (0=none, 1=sometimes, and 2=often). The ordered logit and probit models have the parallel regression assumption, which is violated from time to time. 5.1 Ordered Logit/Probit in STATA (.ologit and .oprobit) STATA has the .ologit and .oprobit commands to estimate the ordered logit and probit models, respectively. . ologit parking income age male Iteration Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: 5: log log log log log log likelihood likelihood likelihood likelihood likelihood likelihood = = = = = = -103.78713 -92.739147 -90.036393 -89.861679 -89.860105 -89.860105 Ordered logistic regression Number of obs LR chi2(3) Prob > chi2 Pseudo R2 Log likelihood = -89.860105 = = = = 437 27.85 0.0000 0.1342 -----------------------------------------------------------------------------parking | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------income | -.5140709 1.283192 -0.40 0.689 -3.029082 2.00094 age | -.7362588 .1894339 -3.89 0.000 -1.107542 -.3649752 male | -1.227092 .4705859 -2.61 0.009 -2.149423 -.3047605 -------------+---------------------------------------------------------------/cut1 | -12.74479 3.787616 -20.16839 -5.321203 /cut2 | -10.83295 3.801685 -18.28412 -3.381786 ------------------------------------------------------------------------------ STATA estimates τ m , /cut1 and /cut2, assuming β 0 = 0 (Long and Freese 2003). This parameterization is different from that of SAS and LIMDEP, which assume τ 1 = 0 . . oprobit parking income age male Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood Ordered probit regression Log likelihood = -89.430754 = = = = = -103.78713 -90.990455 -89.496288 -89.430915 -89.430754 Number of obs LR chi2(3) Prob > chi2 Pseudo R2 = = = = 437 28.71 0.0000 0.1383 -----------------------------------------------------------------------------parking | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------income | -.1869839 .6116037 -0.31 0.760 -1.385705 1.011737 http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 31 age | -.3594853 .0924817 -3.89 0.000 -.540746 -.1782246 male | -.5867871 .2205253 -2.66 0.008 -1.019009 -.1545655 -------------+---------------------------------------------------------------/cut1 | -6.000986 1.869046 -9.664248 -2.337724 /cut2 | -5.118676 1.862909 -8.769911 -1.467442 ------------------------------------------------------------------------------ 5.2 The Parallel Assumption and the Generalized Ordered Logit Model The .brant command of SPost is valid only in the .ologit command. This command tests the parallel regression assumption of the ordinal regression model. The outputs here are skipped. . quietly ologit parking income male . brant The parallel regression assumption is often violated. If this is the case, you may use the multinomial regression model or estimate the generalized ordered logit model (GOLM) using either the .gologit command written by Fu (1998) or the .gologit2 command by Williams (2005). Note that Fu’s module does not impose the restriction of (τ j − xβ j ) ≥ (τ j −1 − xβ j −1 ) (Long’s class note 2003). . gologit2 parking income age male, autofit -----------------------------------------------------------------------------Testing parallel lines assumption using the .05 level of significance... Step Step Step Step 1: 2: 3: 4: male meets the pl assumption (P Value = 0.9901) income meets the pl assumption (P Value = 0.8958) age meets the pl assumption (P Value = 0.7964) All explanatory variables meet the pl assumption Wald test of parallel lines assumption for the final model: ( 1) ( 2) ( 3) [0]male - [1]male = 0 [0]income - [1]income = 0 [0]age - [1]age = 0 chi2( 3) = Prob > chi2 = 0.04 0.9982 An insignificant test statistic indicates that the final model does not violate the proportional odds/ parallel lines assumption If you re-estimate this exact same model with gologit2, instead of autofit you can save time by using the parameter pl(male income age) -----------------------------------------------------------------------------Generalized Ordered Logit Estimates Log likelihood = -89.860105 ( 1) [0]male - [1]male = 0 http://www.indiana.edu/~statmath Number of obs Wald chi2(3) Prob > chi2 Pseudo R2 = = = = 437 21.74 0.0001 0.1342 © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 32 ( 2) [0]income - [1]income = 0 ( 3) [0]age - [1]age = 0 -----------------------------------------------------------------------------parking | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------0 | income | -.5140709 1.283192 -0.40 0.689 -3.029082 2.00094 age | -.7362588 .1894339 -3.89 0.000 -1.107543 -.3649752 male | -1.227092 .4705859 -2.61 0.009 -2.149423 -.3047605 _cons | 12.74479 3.787616 3.36 0.001 5.321202 20.16839 -------------+---------------------------------------------------------------1 | income | -.5140709 1.283192 -0.40 0.689 -3.029082 2.00094 age | -.7362588 .1894339 -3.89 0.000 -1.107543 -.3649752 male | -1.227092 .4705859 -2.61 0.009 -2.149423 -.3047605 _cons | 10.83295 3.801686 2.85 0.004 3.381785 18.28412 ------------------------------------------------------------------------------ 5.3 Ordered Logit in SAS The QLIM, LOGISTIC, and PROBIT procedures estimate ordered logit and probit models. As shown in Tables 3 and 4, the QLIM procedure is most recommended. Note that the DIST=LOGISTIC indicates the logit model to be estimated. PROC QLIM DATA=masil.students; MODEL parking = income age male /DISCRETE (DIST=LOGISTIC); RUN; The QLIM Procedure Discrete Response Profile of parking Index Value 1 2 3 0 1 2 Frequency Percent 413 20 4 94.51 4.58 0.92 Model Fit Summary Number of Endogenous Variables Endogenous Variable Number of Observations Log Likelihood Maximum Absolute Gradient Number of Iterations AIC Schwarz Criterion 1 parking 437 -89.86011 8.14046E-7 23 189.72021 210.11988 Goodness-of-Fit Measures Measure Likelihood Ratio (R) http://www.indiana.edu/~statmath Value 27.854 Formula 2 * (LogL - LogL0) © 2003-2005, The Trustees of Indiana University Upper Bound of R (U) Aldrich-Nelson Cragg-Uhler 1 Cragg-Uhler 2 Estrella Adjusted Estrella McFadden's LRI Veall-Zimmermann McKelvey-Zavoina 207.57 0.0599 0.0618 0.1633 0.0662 0.0418 0.1342 0.1861 0.6462 Categorical Dependent Variable Models: 33 - 2 * LogL0 R / (R+N) 1 - exp(-R/N) (1-exp(-R/N)) / (1-exp(-U/N)) 1 - (1-R/U)^(U/N) 1 - ((LogL-K)/LogL0)^(-2/N*LogL0) R / U (R * (U+N)) / (U * (R+N)) N = # of observations, K = # of regressors Algorithm converged. Parameter Estimates Parameter Estimate Standard Error t Value Approx Pr > |t| Intercept income age male _Limit2 12.744794 -0.514071 -0.736259 -1.227092 1.911842 3.787615 1.283192 0.189434 0.470586 0.468050 3.36 -0.40 -3.89 -2.61 4.08 0.0008 0.6887 0.0001 0.0091 <.0001 The SAS QLIM procedure estimates the intercept and τ 2 , assuming τ 1 = 0 . The estimated intercept of SAS is equivalent to (0-/cut1) in STATA. The _Limit2 of SAS is the difference between cut points of STATA, 1.91184=-10.83295-(-12.74479). The SAS LOGISTIC and PROBIT procedures are also used to estimate the ordered logit and probit models. These procedures recognize binary or ordinal response models by examining the dependent variable. PROC LOGISTIC DATA = masil.students DESC; MODEL parking = income age male /LINK=LOGIT; RUN; Like the STATA .ologit command, The LOGISTIC procedure fits the model, assuming the intercept is zero. The parameter estimates and standard errors are slightly different from those of the QLIM procedure and the .ologit command. Other parts of the output are skipped. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq 1 1 1 1 1 10.8324 12.7444 -0.5142 -0.7362 -1.2271 3.8112 3.8021 1.2908 0.1900 0.4709 8.0784 11.2354 0.1587 15.0221 6.7902 0.0045 0.0008 0.6904 0.0001 0.0092 Intercept 2 Intercept 1 income age male PROC PROBIT DATA = masil.students; http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 34 CLASS parking; MODEL parking = income age male /DIST=LOGISTIC; RUN; The PROBIT procedure returns almost the same results as the QLIM procedure except for the signs of the estimates. Other parts of the output are skipped. Analysis of Parameter Estimates Parameter Intercept Intercept2 income age male DF Estimate 1 -12.7448 1 1.9118 1 0.5141 1 0.7363 1 1.2271 Standard Error 95% Confidence Limits 3.7876 -20.1684 0.4680 0.9945 1.2832 -2.0009 0.1894 0.3650 0.4706 0.3048 -5.3212 2.8292 3.0291 1.1075 2.1494 ChiSquare Pr > ChiSq 11.32 16.68 0.16 15.11 6.80 0.0008 <.0001 0.6887 0.0001 0.0091 5.4 Ordered Probit in SAS The QLIM procedure by default estimates a probit model. The DIST=NORMAL, the default option, may be omitted. PROC QLIM DATA=masil.students; MODEL parking = income age male /DISCRETE (DIST=NORMAL); RUN; The QLIM Procedure Discrete Response Profile of parking Index Value 1 2 3 0 1 2 Frequency Percent 413 20 4 94.51 4.58 0.92 Model Fit Summary Number of Endogenous Variables Endogenous Variable Number of Observations Log Likelihood Maximum Absolute Gradient Number of Iterations AIC Schwarz Criterion Goodness-of-Fit Measures Measure http://www.indiana.edu/~statmath Value Formula 1 parking 437 -89.43075 4.69307E-6 17 188.86151 209.26117 © 2003-2005, The Trustees of Indiana University Likelihood Ratio (R) Upper Bound of R (U) Aldrich-Nelson Cragg-Uhler 1 Cragg-Uhler 2 Estrella Adjusted Estrella McFadden's LRI Veall-Zimmermann McKelvey-Zavoina 28.713 207.57 0.0617 0.0636 0.1682 0.0683 0.0439 0.1383 0.1915 0.3011 Categorical Dependent Variable Models: 35 2 * (LogL - LogL0) - 2 * LogL0 R / (R+N) 1 - exp(-R/N) (1-exp(-R/N)) / (1-exp(-U/N)) 1 - (1-R/U)^(U/N) 1 - ((LogL-K)/LogL0)^(-2/N*LogL0) R / U (R * (U+N)) / (U * (R+N)) N = # of observations, K = # of regressors Algorithm converged. Parameter Estimates Parameter Estimate Standard Error t Value Approx Pr > |t| Intercept income age male _Limit2 6.000986 -0.186984 -0.359485 -0.586787 0.882310 1.869053 0.611605 0.092482 0.220526 0.196555 3.21 -0.31 -3.89 -2.66 4.49 0.0013 0.7598 0.0001 0.0078 <.0001 The QLIM procedure and .oprobit command produce almost the same result except for the τ 2 estimate. The _Limit2 of SAS is the difference of the cut points of STATA, .88231=-5.118676(-6.000986). The PROBIT and LOGISTIC procedures also estimate the ordered probit model. Keep in mind that the signs of the coefficients are reversed in the PROBIT procedure. PROC LOGISTIC DATA = masil.students DESC; MODEL parking = income age male /LINK=PROBIT; RUN; Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq 1 1 1 1 1 5.1181 6.0004 -0.1869 -0.3595 -0.5868 1.8373 1.8441 0.6160 0.0908 0.2203 7.7601 10.5872 0.0921 15.6767 7.0941 0.0053 0.0011 0.7615 <.0001 0.0077 Intercept 2 Intercept 1 income age male PROC PROBIT DATA = masil.students; CLASS parking; MODEL parking = income age male /DIST=NORMAL; RUN; http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 36 Analysis of Parameter Estimates Parameter Intercept Intercept2 income age male DF Estimate 1 1 1 1 1 -6.0010 0.8823 0.1870 0.3595 0.5868 Standard Error 1.8691 0.1966 0.6116 0.0925 0.2205 95% Confidence Limits -9.6643 0.4971 -1.0117 0.1782 0.1546 -2.3377 1.2675 1.3857 0.5407 1.0190 ChiSquare Pr > ChiSq 10.31 20.15 0.09 15.11 7.08 0.0013 <.0001 0.7598 0.0001 0.0078 5.5 Ordered Logit/Probit in LIMDEP (Ordered$) The LIMDEP Ordered$ command estimates ordered logit and probit models. The Logit$ subcommand runs the ordered logit model. ORDERED; Lhs=parking; Rhs=ONE,income,age,male; Logit$ Normal exit from iterations. Exit status=0. +---------------------------------------------+ | Ordered Probability Model | | Maximum Likelihood Estimates | | Model estimated: Sep 18, 2005 at 05:53:44PM.| | Dependent variable PARKING | | Weighting variable None | | Number of observations 437 | | Iterations completed 13 | | Log likelihood function -89.86011 | | Restricted log likelihood -103.7871 | | Chi squared 27.85404 | | Degrees of freedom 3 | | Prob[ChiSqd > value] = .3896741E-05 | | Underlying probabilities based on Logistic | | Cell frequencies for outcomes | | Y Count Freq Y Count Freq Y Count Freq | | 0 413 .945 1 20 .045 2 4 .009 | +---------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Index function for probability Constant 12.74479424 3.7876161 3.365 .0008 INCOME -.5140708643 1.2831923 -.401 .6887 .61683982 AGE -.7362588281 .18943391 -3.887 .0001 20.691076 MALE -1.227091964 .47058590 -2.608 .0091 .57208238 Threshold parameters for index Mu(1) 1.911841923 .46804996 4.085 .0000 +---------------------------------------------------------------------------+ | Cross tabulation of predictions. Row is actual, column is predicted. | http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 37 | Model = Logistic . Prediction is number of the most probable cell. | +-------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | Actual|Row Sum| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | +-------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | 0| 413| 413| 0| 0| | 1| 20| 20| 0| 0| | 2| 4| 4| 0| 0| +-------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ |Col Sum| 437| 437| 0| 0| 0| 0| 0| 0| 0| 0| 0| +-------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ LIMDEP and SAS QLIM produce the same results for the ordered logit model. Note that _Limit2 in SAS is equivalent to Mu(1), the threshold parameter, in LIMDEP. The ordered probit model is estimated by the Ordered$ command without the Logit$ subcommand. The command by default fits the ordered logit model. The output is comparable to that of the QLIM procedure. ORDERED; Lhs=parking; Rhs=ONE,income,age,male$ Normal exit from iterations. Exit status=0. +---------------------------------------------+ | Ordered Probability Model | | Maximum Likelihood Estimates | | Model estimated: Sep 18, 2005 at 05:55:42PM.| | Dependent variable PARKING | | Weighting variable None | | Number of observations 437 | | Iterations completed 11 | | Log likelihood function -89.43075 | | Restricted log likelihood -103.7871 | | Chi squared 28.71275 | | Degrees of freedom 3 | | Prob[ChiSqd > value] = .2572557E-05 | | Underlying probabilities based on Normal | | Cell frequencies for outcomes | | Y Count Freq Y Count Freq Y Count Freq | | 0 413 .945 1 20 .045 2 4 .009 | +---------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Index function for probability Constant 6.000985035 1.8690536 3.211 .0013 INCOME -.1869836008 .61160494 -.306 .7598 .61683982 AGE -.3594852294 .92482090E-01 -3.887 .0001 20.691076 MALE -.5867870572 .22052578 -2.661 .0078 .57208238 Threshold parameters for index Mu(1) .8823095981 .19655461 4.489 .0000 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) +---------------------------------------------------------------------------+ | Cross tabulation of predictions. Row is actual, column is predicted. | http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 38 | Model = Probit . Prediction is number of the most probable cell. | +-------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | Actual|Row Sum| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | +-------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | 0| 413| 413| 0| 0| | 1| 20| 20| 0| 0| | 2| 4| 4| 0| 0| +-------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ |Col Sum| 874| 437| 0| 0| 0| 0| 0| 0| 0| 0| 0| +-------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ 5.6 Ordered Logit/Probit in SPSS The Plum command estimates the ordered logit and probit models in SPSS. The Threshold points in SPSS are equivalent to the cut points in STATA. PLUM parking WITH income age male /CRITERIA = CIN(95) DELTA(0) LCONVERGE(0) MXITER(100) MXSTEP(5) PCONVERGE(1.0E-6) SINGULAR(1.0E-8) /LINK = LOGIT /PRINT = FIT PARAMETER SUMMARY . PLUM parking WITH income age male /CRITERIA = CIN(95) DELTA(0) LCONVERGE(0) MXITER(100) MXSTEP(5) PCONVERGE(1.0E-6) SINGULAR(1.0E-8) /LINK = PROBIT /PRINT = FIT PARAMETER SUMMARY . http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 39 6. The Multinomial Logit Regression Model Suppose we have a nominal dependent variable such as the mode of transportation (walk, bike, bus, and car). The multinomial logit and conditional logit models are commonly used; the multinomial probit model is not often used mainly due to the practical difficulty in estimation. However, STATA does have the .mprobit command to fit the model. In the multinomial logit model, the independent variables contain characteristics of individuals, while they are the attributes of the choices in the conditional logit model. In other words, the conditional logit estimates how alternative-specific, not individual-specific, variables affect the likelihood of observing a given outcome (Long 2003). Therefore, data need to be appropriately arranged in advance. 6.1 Multinomial Logit/Probit in STATA (.mlogit and .mprobit) STATA has the .mlogit command for the multinomial logit model. The base() option indicates the value of the dependent variable to be used as the base category for the estimation. You may omit the default option, base(0). . mlogit transmode income age male, base(0) Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood = = = = = Multinomial logistic regression Log likelihood = -406.32509 -444.84113 -411.18604 -406.36474 -406.3251 -406.32509 Number of obs LR chi2(9) Prob > chi2 Pseudo R2 = = = = 437 77.03 0.0000 0.0866 -----------------------------------------------------------------------------transmode | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------1 | income | 4.018021 1.34443 2.99 0.003 1.382986 6.653057 age | .1915917 .1392928 1.38 0.169 -.0814172 .4646006 male | .2582886 .4039971 0.64 0.523 -.5335311 1.050108 _cons | -6.903473 2.97678 -2.32 0.020 -12.73785 -1.069091 -------------+---------------------------------------------------------------2 | income | 8.951041 1.338539 6.69 0.000 6.327552 11.57453 age | .1374997 .1451938 0.95 0.344 -.1470749 .4220742 male | .1573179 .4191014 0.38 0.707 -.6641057 .9787415 _cons | -9.091051 3.088123 -2.94 0.003 -15.14366 -3.038442 -------------+---------------------------------------------------------------3 | income | 4.210485 1.032024 4.08 0.000 2.187755 6.233215 age | .3457236 .0995071 3.47 0.001 .1506932 .540754 male | .5402549 .2769887 1.95 0.051 -.0026329 1.083143 _cons | -8.388756 2.135792 -3.93 0.000 -12.57483 -4.202681 -----------------------------------------------------------------------------(transmode==0 is the base outcome) http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 40 Let us see if the base outcome changes. As shown in the following, the parameter estimates and standard errors are changed, whereas the goodness-of-fit remains unchanged. The two .mlogit commands with different bases fit the same model but present the result in different manner. The SAS CATMOD procedure in the next section uses the largest value as the base outcome. . mlogit transmode income age male, base(3) Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood = = = = = -444.84113 -411.18604 -406.36474 -406.3251 -406.32509 Multinomial logistic regression Log likelihood = -406.32509 Number of obs LR chi2(9) Prob > chi2 Pseudo R2 = = = = 437 77.03 0.0000 0.0866 -----------------------------------------------------------------------------transmode | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------0 | income | -4.210485 1.032024 -4.08 0.000 -6.233215 -2.187755 age | -.3457236 .0995071 -3.47 0.001 -.540754 -.1506932 male | -.5402549 .2769887 -1.95 0.051 -1.083143 .0026329 _cons | 8.388756 2.135792 3.93 0.000 4.202681 12.57483 -------------+---------------------------------------------------------------1 | income | -.1924639 1.00606 -0.19 0.848 -2.164305 1.779377 age | -.154132 .1131506 -1.36 0.173 -.3759031 .0676392 male | -.2819663 .3443963 -0.82 0.413 -.9569706 .3930379 _cons | 1.485283 2.430912 0.61 0.541 -3.279216 6.249783 -------------+---------------------------------------------------------------2 | income | 4.740556 .9447126 5.02 0.000 2.888953 6.592158 age | -.2082239 .1164954 -1.79 0.074 -.4365507 .0201028 male | -.382937 .3490247 -1.10 0.273 -1.067013 .3011389 _cons | -.7022953 2.460119 -0.29 0.775 -5.52404 4.119449 -----------------------------------------------------------------------------(transmode==3 is the base outcome) The SPost .mlogtest command conducts a variety of statistical tests for the multinomial logit model. This command supports not only Wald and likelihood ratio tests, but also Hausman and Small-Hsiao tests for the independence of irrelevant alternatives (IIA) assumption. The .mlogtest command works with the .mlogit command only. . mlogtest, hausman smhsiao base **** Hausman tests of IIA assumption Ho: Odds(Outcome-J vs Outcome-K) are independent of other alternatives. Omitted | chi2 df P>chi2 evidence ---------+-----------------------------------0 | 0.260 8 1.000 for Ho 1 | -3.307 8 1.000 for Ho 2 | -0.319 8 1.000 for Ho 3 | 2.315 8 0.970 for Ho ---------------------------------------------- http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 41 **** Small-Hsiao tests of IIA assumption Ho: Odds(Outcome-J vs Outcome-K) are independent of other alternatives. Omitted | lnL(full) lnL(omit) chi2 df P>chi2 evidence ---------+--------------------------------------------------------0 | -120.685 -116.139 9.092 4 0.059 for Ho 1 | -131.938 -128.574 6.728 4 0.151 for Ho 2 | -155.078 -150.308 9.540 4 0.049 against Ho 3 | -71.735 -67.571 8.327 4 0.080 for Ho ------------------------------------------------------------------- The STATA .mprobit command fits the multinomial probit model. The model took longer time to converge than the multinomial logit model. . mprobit transmode income age male Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood = = = = = Multinomial probit regression Log likelihood = -406.38431 -425.2053 -407.95972 -406.38652 -406.38431 -406.38431 Number of obs Wald chi2(9) Prob > chi2 = = = 437 64.47 0.0000 -----------------------------------------------------------------------------transmode | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------_outcome_2 | income | 2.651949 .8328566 3.18 0.001 1.01958 4.284318 age | .1501467 .0903614 1.66 0.097 -.0269584 .3272519 male | .1967795 .262047 0.75 0.453 -.3168232 .7103822 _cons | -5.075328 1.953866 -2.60 0.009 -8.904835 -1.245822 -------------+---------------------------------------------------------------_outcome_3 | income | 5.757611 .8443105 6.82 0.000 4.102793 7.412429 age | .1218625 .0942662 1.29 0.196 -.0628959 .3066209 male | .1662947 .2772189 0.60 0.549 -.3770444 .7096339 _cons | -6.547874 2.031953 -3.22 0.001 -10.53043 -2.565319 -------------+---------------------------------------------------------------_outcome_4 | income | 2.751622 .6936632 3.97 0.000 1.392067 4.111177 age | .2760071 .074178 3.72 0.000 .1306208 .4213933 male | .4232271 .2086763 2.03 0.043 .0142289 .8322252 _cons | -6.375609 1.598767 -3.99 0.000 -9.509134 -3.242083 -----------------------------------------------------------------------------(transmode=0 is the base outcome) 6.2 Multinomial Logit in SAS SAS has the CATMOD procedure for the multinomial logit model. In the CATMOD procedure, the RESPONSE statement is used to specify the functions of response probabilities. PROC CATMOD DATA = masil.students; DIRECT income age male; RESPONSE LOGITS; http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 42 MODEL transmode = income age male /NOPROFILE; RUN; The CATMOD Procedure Data Summary Response Weight Variable Data Set Frequency Missing transmode None STUDENTS 0 Response Levels Populations Total Frequency Observations 4 414 437 437 Maximum Likelihood Analysis Maximum likelihood computations converged. Maximum Likelihood Analysis of Variance Source DF Chi-Square Pr > ChiSq ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Intercept 3 15.91 0.0012 income 3 45.72 <.0001 age 3 14.66 0.0021 male 3 4.73 0.1927 Likelihood Ratio 1E3 778.33 1.0000 Analysis of Maximum Likelihood Estimates Function Standard ChiParameter Number Estimate Error Square Pr > ChiSq ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Intercept 1 8.3888 2.1358 15.43 <.0001 2 1.4853 2.4309 0.37 0.5412 3 -0.7023 2.4601 0.08 0.7753 income 1 -4.2105 1.0320 16.65 <.0001 2 -0.1925 1.0061 0.04 0.8483 3 4.7406 0.9447 25.18 <.0001 age 1 -0.3457 0.0995 12.07 0.0005 2 -0.1541 0.1132 1.86 0.1731 3 -0.2082 0.1165 3.19 0.0739 male 1 -0.5403 0.2770 3.80 0.0511 2 -0.2820 0.3444 0.67 0.4129 3 -0.3829 0.3490 1.20 0.2726 As mentioned before, the CATMOD procedure uses the largest value of the dependent variable as a base outcome. Accordingly, you need to compare the above with the STATA output of the base(3) option. The two outputs are the same except for the likelihood ratio. 6.3 Multinomial Logit in LIMDEP (Mlogit$) http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 43 In LIMDEP, you may use either the Mlogit$ or simply the Logit$ commands to fit the multinomial logit model. Both commands produce the identical result. Like STATA, LIMDEP by default uses the smallest value as the base outcome. MLOGIT; Lhs=transmod; Rhs=ONE,income,age,male$ Or, use the old style command. LOGIT; Lhs=transmod; Rhs=ONE,income,age,male$ Normal exit from iterations. Exit status=0. +---------------------------------------------+ | Multinomial Logit Model | | Maximum Likelihood Estimates | | Model estimated: Sep 19, 2005 at 09:19:23AM.| | Dependent variable TRANSMOD | | Weighting variable None | | Number of observations 437 | | Iterations completed 6 | | Log likelihood function -406.3251 | | Restricted log likelihood -444.8411 | | Chi squared 77.03209 | | Degrees of freedom 9 | | Prob[ChiSqd > value] = .0000000 | +---------------------------------------------+ +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| +---------+--------------+----------------+--------+---------+----------+ Characteristics in numerator of Prob[Y = 1] Constant -6.903472671 2.9767801 -2.319 .0204 INCOME 4.018021309 1.3444306 2.989 .0028 .61683982 AGE .1915916653 .13929281 1.375 .1690 20.691076 MALE .2582886230 .40399708 .639 .5226 .57208238 Characteristics in numerator of Prob[Y = 2] Constant -9.091051495 3.0881231 -2.944 .0032 INCOME 8.951040805 1.3385396 6.687 .0000 .61683982 AGE .1374996725 .14519378 .947 .3436 20.691076 MALE .1573178944 .41910143 .375 .7074 .57208238 Characteristics in numerator of Prob[Y = 3] Constant -8.388756169 2.1357918 -3.928 .0001 INCOME 4.210485161 1.0320242 4.080 .0000 .61683982 AGE .3457236198 .99507140E-01 3.474 .0005 20.691076 MALE .5402549359 .27698869 1.950 .0511 .57208238 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) +--------------------------------------------------------------------+ | Information Statistics for Discrete Choice Model. | | M=Model MC=Constants Only M0=No Model | | Criterion F (log L) -406.32509 -444.84113 -605.81064 | | LR Statistic vs. MC 77.03209 .00000 .00000 | http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 44 | Degrees of Freedom 9.00000 .00000 .00000 | | Prob. Value for LR .00000 .00000 .00000 | | Entropy for probs. 406.32509 444.84113 605.81064 | | Normalized Entropy .67071 .73429 1.00000 | | Entropy Ratio Stat. 398.97109 321.93900 .00000 | | Bayes Info Criterion 867.36958 944.40167 1266.34067 | | BIC - BIC(no model) 398.97109 321.93900 .00000 | | Pseudo R-squared .08658 .00000 .00000 | | Pct. Correct Prec. 64.98856 .00000 25.00000 | | Means: y=0 y=1 y=2 y=3 yu=4 y=5, y=6 y>=7 | | Outcome .1648 .0892 .0961 .6499 .0000 .0000 .0000 .0000 | | Pred.Pr .1648 .0892 .0961 .6499 .0000 .0000 .0000 .0000 | | Notes: Entropy computed as Sum(i)Sum(j)Pfit(i,j)*logPfit(i,j). | | Normalized entropy is computed against M0. | | Entropy ratio statistic is computed against M0. | | BIC = 2*criterion - log(N)*degrees of freedom. | | If the model has only constants or if it has no constants, | | the statistics reported here are not useable. | +--------------------------------------------------------------------+ Frequencies of actual & predicted outcomes Predicted outcome has maximum probability. -----Actual -----0 1 2 3 -----Total Predicted -------------------0 1 2 3 -------------------5 0 0 67 0 0 1 38 0 0 1 41 6 0 0 278 -------------------11 0 2 424 + | + | | | | + | ----Total ----72 39 42 284 ----437 Note that the variable name TRANSMOD was truncated because LIMDEP allows up to eight characters for a variable name. LIMDEP and STATA produce the same result of the multinomial logit model. 6.4 Multinomial Logit in SPSS SPSS has the Nomreg command to estimate the multinomial logit model. Like SAS, SPSS by default uses the largest value as the base outcome. NOMREG transmode WITH income age male /CRITERIA CIN(95) DELTA(0) MXITER(100) MXSTEP(5) CHKSEP(20) LCONVERGE(0) PCONVERGE(0.000001) SINGULAR(0.00000001) /MODEL /INTERCEPT INCLUDE /PRINT PARAMETER SUMMARY LRT . http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 45 7. The Conditional Logit Regression Model Imagine a choice of the travel modes among air flight, train, bus, and car. The data set and model here are adopted from Greene (2003). The model examines how the generalized cost measure (cost), terminal waiting time (time), and household income (income) affect the choice. These independent variables are not characteristics of subjects (individuals), but attributes of the alternatives. Thus, the data arrangement of the conditional logit model is different from that of the multinomial logit model (Figure 2). Figure 2. Data Arrangement for the Conditional Logit Model +------------------------------------------------------------------------------+ | subject mode choice air train bus cost time income air_inc | |------------------------------------------------------------------------------| | 1 1 0 1 0 0 70 69 35 35 | | 1 2 0 0 1 0 71 34 35 0 | | 1 3 0 0 0 1 70 35 35 0 | | 1 4 1 0 0 0 30 0 35 0 | | 2 1 0 1 0 0 68 64 30 30 | |------------------------------------------------------------------------------| | 2 2 0 0 1 0 84 44 30 0 | | 2 3 0 0 0 1 85 53 30 0 | | 2 4 1 0 0 0 50 0 30 0 | | 3 1 0 1 0 0 129 69 40 40 | | 3 2 0 0 1 0 195 34 40 0 | … … … … … … … … … … … … The example data set has four observations per subject, each of which contains attributes of using air flight, train, bus, and car. The dependent variable choice is coded 1 only if a subject chooses that travel mode. The four dummy variables, air, train, bus, and car, are flagging the corresponding modes of transportation. See the appendix for details about the data set. 7.1 Conditional Logit in STATA (.clogit) STATA has the .clogit command to estimate the condition logit model. The group() option specifies the variable (e.g., identification number) that identifies unique individuals. . clogit choice air train bus cost time air_inc, group(subject) Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood = = = = = -205.8187 -199.23679 -199.12851 -199.12837 -199.12837 Conditional (fixed-effects) logistic regression Log likelihood = -199.12837 Number of obs LR chi2(6) Prob > chi2 Pseudo R2 = = = = 840 183.99 0.0000 0.3160 -----------------------------------------------------------------------------choice | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------air | 5.207443 .7790551 6.68 0.000 3.680523 6.734363 http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 46 train | 3.869043 .4431269 8.73 0.000 3.00053 4.737555 bus | 3.163194 .4502659 7.03 0.000 2.280689 4.045699 cost | -.0155015 .004408 -3.52 0.000 -.024141 -.006862 time | -.0961248 .0104398 -9.21 0.000 -.1165865 -.0756631 air_inc | .013287 .0102624 1.29 0.195 -.0068269 .033401 -------------------------------------------------------------------------------------- Let us run the .listcoef command to compute factor changes in odds. For a one unit increase in the waiting time for a given travel mode, for example, we can expect a decrease in the odds of using that travel by 9 percent (or a factor of .9084), holding other variables constant. . listcoef clogit (N=840): Factor Change in Odds Odds of: 1 vs 0 -------------------------------------------------choice | b z P>|z| e^b -------------+-----------------------------------air | 5.20744 6.684 0.000 182.6265 train | 3.86904 8.731 0.000 47.8965 bus | 3.16319 7.025 0.000 23.6460 cost | -0.01550 -3.517 0.000 0.9846 time | -0.09612 -9.207 0.000 0.9084 air_inc | 0.01329 1.295 0.195 1.0134 -------------------------------------------------- 7.2 Conditional Logit in SAS SAS has the MDC procedure to fit the conditional logit model. The TYPE=CLOGIT indicates the conditional logit model; the ID statement specifies the identification variable; and the NCHOICE=4 tells that there are four choices of the travel mode. PROC MDC DATA=masil.travel; MODEL choice = air train bus cost time air_inc /TYPE=CLOGIT NCHOICE=4; ID subject; RUN; The MDC Procedure Conditional Logit Estimates Algorithm converged. Model Fit Summary Dependent Variable Number of Observations Number of Cases Log Likelihood Maximum Absolute Gradient Number of Iterations Optimization Method AIC http://www.indiana.edu/~statmath choice 210 840 -199.12837 2.73152E-8 5 Newton-Raphson 410.25674 © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 47 Schwarz Criterion 430.33938 Discrete Response Profile Index 0 1 2 3 CHOICE Frequency Percent 1 2 3 4 58 63 30 59 27.62 30.00 14.29 28.10 Goodness-of-Fit Measures Measure Value Likelihood Ratio (R) Upper Bound of R (U) Aldrich-Nelson Cragg-Uhler 1 Cragg-Uhler 2 Estrella Adjusted Estrella McFadden's LRI Veall-Zimmermann 183.99 582.24 0.467 0.5836 0.6225 0.6511 0.6212 0.316 0.6354 Formula 2 * (LogL - LogL0) - 2 * LogL0 R / (R+N) 1 - exp(-R/N) (1-exp(-R/N)) / (1-exp(-U/N)) 1 - (1-R/U)^(U/N) 1 - ((LogL-K)/LogL0)^(-2/N*LogL0) R / U (R * (U+N)) / (U * (R+N)) N = # of observations, K = # of regressors Conditional Logit Estimates Parameter Estimates Parameter air train bus cost time air_inc DF Estimate Standard Error t Value 1 1 1 1 1 1 5.2074 3.8690 3.1632 -0.0155 -0.0961 0.0133 0.7791 0.4431 0.4503 0.004408 0.0104 0.0103 6.68 8.73 7.03 -3.52 -9.21 1.29 Approx Pr > |t| <.0001 <.0001 <.0001 0.0004 <.0001 0.1954 Alternatively, you may use the PHREG procedure that estimates the Cox proportional hazards model for survival data and the conditional logit model. In order to make the data set consistent with the survival analysis data, you need to create a failure time variable, failure=1–choice. The identification variable is specified in the STRATA statement. The NOSUMMARY option suppresses the display of the event and censored observation frequencies. PROC PHREG DATA=masil.travel NOSUMMARY; STRATA subject; MODEL failure*choice(0)=air train bus cost time air_inc; RUN; http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 48 The PHREG Procedure Model Information Data Set Dependent Variable Censoring Variable Censoring Value(s) Ties Handling MASIL.TRAVEL failure choice 0 BRESLOW Number of Observations Read Number of Observations Used 840 840 Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Criterion -2 LOG L AIC SBC Without Covariates With Covariates 582.244 582.244 582.244 398.257 410.257 430.339 Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald Chi-Square DF Pr > ChiSq 183.9869 173.4374 103.7695 6 6 6 <.0001 <.0001 <.0001 Analysis of Maximum Likelihood Estimates Variable air train bus cost time air_inc DF Parameter Estimate Standard Error Chi-Square Pr > ChiSq Hazard Ratio 1 1 1 1 1 1 5.20743 3.86904 3.16319 -0.01550 -0.09612 0.01329 0.77905 0.44313 0.45027 0.00441 0.01044 0.01026 44.6799 76.2343 49.3530 12.3671 84.7778 1.6763 <.0001 <.0001 <.0001 0.0004 <.0001 0.1954 182.625 47.896 23.646 0.985 0.908 1.013 While the MDC procedure reports t statistics, the PHREG procedure computes chi-squared (e.g., 12.3671=-3.52^2). The PHREG presents the hazard ratio at the last column of the output, which is equivalent to the factor changes under the e^b column of the SPost .listcoef command. http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 49 7.3 Conditional Logit in LIMDEP (Clogit$) LIMDEP fits the conditional logit model using either the Clogit$ or the Logit$ command. The Clogit$ command has the Choices$ subcommand to list the choices available. CLOGIT; Lhs=choice; Rhs=air,train,bus,cost,time,air_inc; Choices=air,train,bus,car$ Normal exit from iterations. Exit status=0. +---------------------------------------------+ | Discrete choice (multinomial logit) model | | Maximum Likelihood Estimates | | Model estimated: Sep 19, 2005 at 09:20:39PM.| | Dependent variable Choice | | Weighting variable None | | Number of observations 210 | | Iterations completed 6 | | Log likelihood function -199.1284 | | Log-L for Choice model = -199.12837 | | R2=1-LogL/LogL* Log-L fncn R-sqrd RsqAdj | | Constants only -283.7588 .29825 .29150 | | Response data are given as ind. choice. | | Number of obs.= 210, skipped 0 bad obs. | +---------------------------------------------+ +---------+--------------+----------------+--------+---------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | +---------+--------------+----------------+--------+---------+ AIR 5.207443299 .77905514 6.684 .0000 TRAIN 3.869042702 .44312685 8.731 .0000 BUS 3.163194212 .45026593 7.025 .0000 COST -.1550152532E-01 .44079931E-02 -3.517 .0004 TIME -.9612479610E-01 .10439847E-01 -9.207 .0000 AIR_INC .1328702625E-01 .10262407E-01 1.295 .1954 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) The Clogit$ command has the Ias$ subcommand to conduct the Hausman test for the IIA assumption (e.g., Ias=air,bus$). Unfortunately, the subcommand does not work in this model because the Hessian is not positive definite. The Logit$ command takes the panel data analysis approach. The Pds$ subcommand specifies the number of time periods. The two commands produce the same result. LOGIT; Lhs=choice; Rhs=air,train,bus,cost,time,air_inc; Pds=4$ +--------------------------------------------------+ | Panel Data Binomial Logit Model | | Number of individuals = 210 | http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 50 | Number of periods = 4 | | Conditioning event is the sum of CHOICE | | Distribution of sums over the 4 periods: | | Sum 0 1 2 3 4 5 6 | | Number 0 210 0 0 0 5 10 | | Pct. .00100.00 .00 .00 .00 .00 .00 | +--------------------------------------------------+ Normal exit from iterations. Exit status=0. +---------------------------------------------+ | Logit Model for Panel Data | | Maximum Likelihood Estimates | | Model estimated: Sep 19, 2005 at 09:21:58PM.| | Dependent variable CHOICE | | Weighting variable None | | Number of observations 840 | | Iterations completed 6 | | Log likelihood function -199.1284 | | Hosmer-Lemeshow chi-squared = 251.24482 | | P-value= .00000 with deg.fr. = 8 | | Fixed Effects Logit Model for Panel Data | +---------------------------------------------+ +---------+--------------+----------------+--------+---------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | +---------+--------------+----------------+--------+---------+ AIR 5.207443299 .77905514 6.684 .0000 TRAIN 3.869042702 .44312685 8.731 .0000 BUS 3.163194212 .45026593 7.025 .0000 COST -.1550152532E-01 .44079931E-02 -3.517 .0004 TIME -.9612479610E-01 .10439847E-01 -9.207 .0000 AIR_INC .1328702625E-01 .10262407E-01 1.295 .1954 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.) 7.4 Conditional Logit in SPSS Like the SAS PHREG procedure, the SPSS Coxreg command, which was designed for survival analysis data, provides a backdoor way of estimating the conditional logit model. COXREG failure WITH air train bus cost time air_inc /STATUS=choice(1) /STRATA=subject. http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 51 8. The Nested Logit Regression Model Now, consider a nested structure of choices. When the IIA assumption is violated, one of the alternatives is the nested (multinomial) logit model. This chapter replicates the nested logit model discussed in Greene (2003). P(choice, branch) = P(choice | branch) * P(branch) P(choice | branch) = Pchild (α1air + α 2train + α 3bus + β1 cos t + β 2time) P (branch) = Pparent (γ income air _ inc + τ fly IV fly + τ ground IVground ) 8.1 Nested Logit in STATA (.nlogit) The STATA .nlogit command estimates the nested multinomial logit model. First you need to create a variable based on the specification of the tree using the .nlogitgen command. From the top, the parent-level has fly and ground branches; the fly branch of the child-level has air flight (1); the ground branch has train (2), bus (3), and car (4). . nlogitgen tree = mode(fly: 1, ground: 2 | 3 | 4) new variable tree is generated with 2 groups label list lb_tree lb_tree: 1 fly 2 ground The .nlogittree command.displays the tree-structure defined by the .nlogitgen command. . nlogittree mode tree tree structure specified for the nested logit model top --> bottom tree mode -------------------------fly 1 ground 2 3 4 The .nlogit command consists of three parts. The dependent or choice variable follows the command. Utility functions of the parent and child-levels are then specified. The group()option specifies an identification or grouping variable. . nlogit choice (mode=air train bus cost time) (tree=air_inc), /// group(subject) notree nolog Nested logit regression Levels = Dependent variable = 2 choice http://www.indiana.edu/~statmath Number of obs LR chi2(8) = = 840 194.9313 © 2003-2005, The Trustees of Indiana University Log likelihood = -193.65615 Categorical Dependent Variable Models: 52 Prob > chi2 = 0.0000 -----------------------------------------------------------------------------| Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------mode | air | 6.042255 1.198907 5.04 0.000 3.692441 8.39207 train | 5.064679 .6620317 7.65 0.000 3.767121 6.362237 bus | 4.096302 .6151582 6.66 0.000 2.890614 5.30199 cost | -.0315888 .0081566 -3.87 0.000 -.0475754 -.0156022 time | -.1126183 .0141293 -7.97 0.000 -.1403111 -.0849254 -------------+---------------------------------------------------------------tree | air_inc | .0153337 .0093814 1.63 0.102 -.0030534 .0337209 -------------+---------------------------------------------------------------(incl. value | parameters) | tree | /fly | .5859993 .1406199 4.17 0.000 .3103894 .8616092 /ground | .3889488 .1236623 3.15 0.002 .1465753 .6313224 -----------------------------------------------------------------------------LR test of homoskedasticity (iv = 1): chi2(2)= 10.94 Prob > chi2 = 0.0042 ------------------------------------------------------------------------------ The notree option does not show the tree-structure and the nolog suppresses an iteration log of the log likelihood. Note that the /// joins the next command line with the current line. 8.2 Nested Logit in SAS The SAD MDC procedure fits the conditional logit model as well as the nested multinomial logit model. For the nested logit model, you have to use the UTILITY statement to specify utility functions of the parent (level 2) and child level (level 1), and the NEST statement to construct the decision-tree structure. Note that “2 3 4 @ 2” reads that there are three nodes at the child level under the branch 2 at the parent-level. PROC MDC DATA=masil.travel; MODEL choice = air train bus cost time air_inc /TYPE=NLOGIT CHOICE=(mode); ID subject; UTILITY U(1,) = air train bus cost time, U(2, 1 2) = air_inc; NEST LEVEL(1) = (1 @ 1, 2 3 4 @ 2), LEVEL(2) = (1 2 @ 1); RUN; The MDC Procedure Nested Logit Estimates Algorithm converged. Model Fit Summary Dependent Variable Number of Observations Number of Cases http://www.indiana.edu/~statmath choice 210 840 © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 53 Log Likelihood Maximum Absolute Gradient Number of Iterations Optimization Method AIC Schwarz Criterion -193.65615 0.0000147 15 Newton-Raphson 403.31230 430.08916 Discrete Response Profile Index 0 1 2 3 mode Frequency Percent 1 2 3 4 58 63 30 59 27.62 30.00 14.29 28.10 Goodness-of-Fit Measures Measure Value Likelihood Ratio (R) Upper Bound of R (U) Aldrich-Nelson Cragg-Uhler 1 Cragg-Uhler 2 Estrella Adjusted Estrella McFadden's LRI Veall-Zimmermann 194.93 582.24 0.4814 0.6048 0.6451 0.6771 0.6485 0.3348 0.655 Formula 2 * (LogL - LogL0) - 2 * LogL0 R / (R+N) 1 - exp(-R/N) (1-exp(-R/N)) / (1-exp(-U/N)) 1 - (1-R/U)^(U/N) 1 - ((LogL-K)/LogL0)^(-2/N*LogL0) R / U (R * (U+N)) / (U * (R+N)) N = # of observations, K = # of regressors Nested Logit Estimates Parameter Estimates Parameter air_L1 train_L1 bus_L1 cost_L1 time_L1 air_inc_L2G1 INC_L2G1C1 INC_L2G1C2 DF Estimate Standard Error t Value 1 1 1 1 1 1 1 1 6.0423 5.0646 4.0963 -0.0316 -0.1126 0.0153 0.5860 0.3890 1.1989 0.6620 0.6152 0.008156 0.0141 0.009381 0.1406 0.1237 5.04 7.65 6.66 -3.87 -7.97 1.63 4.17 3.15 Approx Pr > |t| <.0001 <.0001 <.0001 0.0001 <.0001 0.1022 <.0001 0.0017 The /fly and /ground in the STATA output above are equivalent to the INC_L2G1C1 and INC_L2G1C2 in the SAS output. SAS and STATA produce the same result. http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 54 9. Conclusion The appropriate type of categorical dependent variable model (CDVM) is determined largely by the level of measurement of the dependent variable. The level of measurement should be, however, considered in conjunction with your theory and research questions (Long 1997). You must also examine the data generation process (DGP) of a dependent variable to understand its behavior. Sophisticated researchers pay special attention to censoring, truncation, sample selection, and other particular patterns of the DGP. If your dependent variable is a binary variable, you may use the binary logit or probit regression model. For ordinal responses, try to fit the ordered logit/probit regression models. If you have a nominal response variable, investigate the DGP carefully and then choose one of the multinomial logit, conditional logit, and nested logit models. In order to use the conditional logit and nested logit, the data set requires a different setup. You should check the key assumptions of the CDVMs when fitting the models. Examples are the parallel regression assumption in the ordered logit model and the independence of irrelevant alternatives (IIA) assumption in the multinomial logit model. You may conduct the Brant test and Hausman test for these assumptions. Since CDVMs are nonlinear, they produce estimates that are difficult to interpret intuitively. Consequently, researchers need to spend more time and effort interpreting the results substantively. Reporting parameter estimates and goodness-of-fit statistics is not sufficient. J. Scott Long (1997) and Long and Freese (2003) provide good examples of meaningful interpretations using predicted probabilities, factor changes in odds, and marginal/discrete changes of predicted probabilities. Regarding statistical software for CDVMs, I would recommend the SAS QLIM and MDC procedures of SAS/ETS (see Table 3 and 4). SAS has other procedures such as LOGISTIC, GENMODE, and PROBIT for CDVMs, but the QLIM procedure seems best for binary and ordinal response models, and the MDC procedure is good for nominal dependent variable models. I also strongly recommend STATA with SPost, since it has various useful commands for CDVMs such as .prchange, .listcoef, and .prtab. I encourage SAS Institute to develop additional statements similar to those SPost commands. LIMDEP supports various CDVMs addressed in Greene (2003) but does not seem stable and reliable. Thus, I recommend LIMDEP for CDVMs that SAS and STATA do not support. SPSS is not currently recommended for CDVMs. http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 55 Appendix: Data Sets The first data set students is a subset of data provided for David H. Good’s class in the School of Public and Environmental Affairs (SPEA). The data were manipulated for the sake of data security. • • • • • • • owncar: 1 if a student owns a car parking: Illegal parking (0=none, 1=sometimes, and 2=often) offcamp: 1 if a student lives off-campus transmode: the mode of transportation (0=walk, 1=bike, 2=bus, 3=car) age: students’ age income: monthly income male: 1 for male and 0 for female . tab male owncar | owncar male | 0 1 | Total -----------+----------------------+---------0 | 76 111 | 187 1 | 77 173 | 250 -----------+----------------------+---------Total | 153 284 | 437 . tab male offcamp | offcamp male | 0 1 | Total -----------+----------------------+---------0 | 7 180 | 187 1 | 5 245 | 250 -----------+----------------------+---------Total | 12 425 | 437 . tab male parking | parking male | 0 1 2 | Total -----------+---------------------------------+---------0 | 170 13 4 | 187 1 | 243 7 0 | 250 -----------+---------------------------------+---------Total | 413 20 4 | 437 . tab male transmode | transmode male | 0 1 2 3 | Total -----------+--------------------------------------------+---------0 | 38 18 20 111 | 187 1 | 34 21 22 173 | 250 -----------+--------------------------------------------+---------Total | 72 39 42 284 | 437 http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 56 . sum income age Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------income | 437 .6168398 .17918 .4 1.227 age | 437 20.69108 1.610812 18 29 The second data set travel on travel mode choice is adopted from Greene (2003). You may get the data from http://pages.stern.nyu.edu/~wgreene/Text/tables/tablelist5.htm • • • • • • • • • • • • subject: identification number mode: 1=Air, 2=Train, 3=Bus, 4=Car choice: 1 if the travel mode is chosen time: terminal waiting time, 0 for car cost: generalized cost measure income: household income air_inc: interaction of air flight and household income, air*income air: 1 for the air flight mode, 0 for others train: 1 for the train mode, 0 for others bus: 1 for the bus mode, 0 for others car: 1 for the car mode, 0 for others failure: failure time variable, 1-choice . tab choice mode | mode choice | 1 2 3 4 | Total -----------+--------------------------------------------+---------0 | 152 147 180 151 | 630 1 | 58 63 30 59 | 210 -----------+--------------------------------------------+---------Total | 210 210 210 210 | 840 . sum time income air_inc Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------time | 840 34.58929 24.94861 0 99 income | 840 34.54762 19.67604 2 72 air_inc | 840 8.636905 17.91206 0 72 http://www.indiana.edu/~statmath © 2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 57 References Allison, Paul D. 1991. Logistic Regression Using the SAS System: Theory and Application. Cary, NC: SAS Institute. Greene, William H. 2002. LIMDEP Version 8.0 Econometric Modeling Guide. Plainview, New York: Econometric Software. Greene, William H. 2003. Econometric Analysis, 5th ed. Upper Saddle River, NJ: Prentice Hall. Long, J. Scott, and Jeremy Freese. 2003. Regression Models for Categorical Dependent Variables Using STATA, 2nd ed. College Station, TX: STATA Press. Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Advanced Quantitative Techniques in the Social Sciences. Sage Publications. Maddala, G. S. 1983. Limited Dependent and Qualitative Variables in Econometrics. New York: Cambridge University Press. SAS Institute. 2004. SAS/STAT 9.1 User's Guide. Cary, NC: SAS Institute. SPSS Inc. 2001. SPSS 11.0 Syntax Reference Guide. Chicago, IL: SPSS Inc. STATA Press. 2004. STATA Base Reference Manual, Release 8. College Station, TX: STATA Press. Stokes, Maura E., Charles S. Davis, and Gary G. Koch. 2000. Categorical Data Analysis Using the SAS System, 2nd ed. Cary, NC: SAS Institute. Acknowledgements I am grateful to Jeremy Albright and Kevin Wilhite at the UITS Center for Statistical and Mathematical Computing for comments and suggestions. I also thank J. Scott Long in Sociology and David H. Good in the School of Public and Environmental Affairs, Indiana University, for their insightful lectures and data set. Revision History • • • 2003. First draft 2004. Second draft 2005. Third draft (Added bivariate logit/probit models and the nested logit model with LIMDEP examples). http://www.indiana.edu/~statmath