Advanced topics in regression Tron Anders Moger 18.10.2006 Last time: • Had the model P-value for test all β’s=0 vs At least one β not 0 K=no independent variables death rate per 1000=a+b*car age+c*prop light trucks MSE=SSR/K SE(R2) Model Summary Model 1 R R Square ,768 a ,590 Adjus ted R Square ,572 Std. Error of the Es timate ,03871 Model 1 a. Predictors : (Constant), lghttrks , carage Pearson’s r =√R2 d.f.(SSR)=K SSR Regress ion Res idual Total b ANOVA Sum of Squares ,097 ,067 ,165 df 2 45 47 Mean Square ,049 ,001 b. Dependent Variable: deaths SSE d.f.(SSE)=n-K-1 SST d.f.(SST)=n-1 Coefficientsa (Cons tant) carage lghttrks Uns tandardized Coefficients B Std. Error 2,668 ,895 -,037 ,013 ,006 ,001 Standardized Coefficients Beta -,295 ,622 t 2,981 -2,930 6,181 Sig. ,005 ,005 ,000 95% Confidence Interval for B Lower Bound Upper Bound ,865 4,470 -,063 -,012 ,004 ,009 a. Dependent Variable: deaths β-estimates SE(β) Sig. ,000 a a. Predictors : (Constant), lghttrks , carage R2=1-SSE/SST Adj R2 =1-(SSE/(n-K-1))/(SST/(n-1)) Model 1 F 32,402 T test-statistic =β/SE(β) P-value for test β=0 vs β not 0 95%CI for β F test-statistic =MSR/MSE MSE=se2=σ^ =SSE/(n-K-1) Why did we remove car weigth and percentage imported cars from the model? • They did not show a significant relationship with the dependent variable (β not different form 0) • Unless independent variables are completely uncorrelated, you will get different b’s when including several variables in your model compared to just one variable (collinerarity) • Hence, would like to remove variables that has nothing to do with the dependent variable, but still influence the effect of important independent variables Relationship R2 and b 100 High R2, low b (with narrow CI) 100 100 Low R2, high b (with wide CI) 100 • Which result would make you most happy? Centered variables • Remember, we found the model Birth weight=2369.672+4.429*mother’s weight • Hence, constant has no interpretation • Construct mother’s weight 2=mother’s weight=130 lbs mean(mother’s weight) • Get Coefficientsa Model 1 (Cons tant) lwt2 Uns tandardized Coefficients B Std. Error 2944,656 52,244 4,429 1,713 Standardized Coefficients Beta ,186 t 56,363 2,586 Sig. ,000 ,010 95% Confidence Interval for B Lower Bound Upper Bound 2841,592 3047,720 1,050 7,809 a. Dependent Variable: birthweight • And the model Birth weight=2944.656+4.429*mother’s weight 2 • Constant is now pred. birth weight for a 130 lbs mother Indicator variables • Binary variables (yes/no, male/female, …) can be represented as 1/0, and used as independent variables. • Also called dummy variables in the book. • When used directly, they influence only the constant term of the regression • It is also possible to use a binary variable so that it changes both constant term and slope for the regression Example: Regression of birth weight with mother’s weight and smoking status as independent variables Model Summaryb Model 1 R R Square ,259 a ,067 Adjus ted R Square ,057 Std. Error of the Es timate 707,83567 a. Predictors : (Constant), smoking s tatus , weight in pounds b. Dependent Variable: birthweight ANOVAb Model 1 Regress ion Res idual Total Sum of Squares 6725224 93191828 99917053 df 2 186 188 Mean Square 3362612,165 501031,335 F 6,711 a. Predictors : (Constant), smoking s tatus , weight in pounds b. Dependent Variable: birthweight Coefficientsa Model 1 (Cons tant) weight in pounds s moking s tatus Uns tandardized Coefficients B Std. Error 2500,174 230,833 4,238 1,690 -270,013 105,590 a. Dependent Variable: birthweight Standardized Coefficients Beta ,178 -,181 t 10,831 2,508 -2,557 Sig. ,000 ,013 ,011 95% Confidence Interval for B Lower Bound Upper Bound 2044,787 2955,561 ,905 7,572 -478,321 -61,705 Sig. ,002 a Interpretation: • Have fitted the model Birth weight=2500.174+4.238*mother’s weight-270.013*smoking status • If the mother start to smoke (and her weight remian constant), what is the predicted influence on the infatnt’s birth weight? -270.013*1= -270 grams • What is the predicted weight of the child of a 150 pound, smoking woman? 2500.174+4.238*150-270.013*1=2866 grams Will R2 automatically be low for indicator variables? 0 1 What if categorical variable has more than two values? • Example: Ethinicity; black, white, other • For categorical variables with m possible values, use m-1 indicators. • Important: A model with two indicator variables will assume that the effect of one indicator adds to the effect of the other • If this may be unsuitable, use an additional interaction variable (product of indicators) Model birth weight as a function of ethnicity • Have constructed variables black=0 or 1 and other=0 or 1 • Model: Birth weight=a+b*black+c*others • Get Coefficientsa Model 1 (Cons tant) black other Uns tandardized Coefficients B Std. Error 3103,740 72,882 -384,047 157,874 -299,725 113,678 Standardized Coefficients Beta -,182 -,197 t 42,586 -2,433 -2,637 Sig. ,000 ,016 ,009 95% Confidence Interval for B Lower Bound Upper Bound 2959,959 3247,521 -695,502 -72,593 -523,988 -75,462 a. Dependent Variable: birthweight • Hence, predicted birth weight decrease by 384 grams for blacks and 299 grams for others • Predicted birth weight for whites is 3104 grams Interaction: • Sometimes the effect (on y) of one independent variable (x1) depends on the value of another independent variable (x2) • Means that you e.g get different slopes x1 for different values of x2 • Usually modelled by constructing a product of the two variables, and including it in the model • Example: bwt=a+b*mwt+c*smoking+d*mwt*smoking =a+(b+d*smoking)*mwt+c*smoking Get SPSS to do the estimation: • Get bwt=2347+5.41*mwt+47.87*smoking-2.46*mwt*smoking Coefficientsa Model 1 (Cons tant) weight in pounds s moking s tatus s mkwht Uns tandardized Coefficients B Std. Error 2347,507 312,717 5,405 2,335 47,867 451,163 -2,456 3,388 Standardized Coefficients Beta ,227 ,032 -,223 t 7,507 2,315 ,106 -,725 Sig. ,000 ,022 ,916 ,470 95% Confidence Interval for B Lower Bound Upper Bound 1730,557 2964,457 ,798 10,012 -842,220 937,953 -9,140 4,229 a. Dependent Variable: birthweight • Mwt=100 lbs vs mwt=200lbs for non-smokers: bwt=2888g and bwt=3428g, difference=540g • Mwt=100 lbs vs mwt=200lbs for smokers: bwt=2690g and bwt=2985g, difference=295g What does this mean? smoking status 5000,00 ,00 1,00 4000,00 birthweight • Mother’s weight has a greater impact for birth weight for nonsmokers than for smokers (or the other way round) 3000,00 2000,00 1000,00 R Sq Linear = 0,042 R Sq Linear = 0,023 0,00 50,00 100,00 150,00 200,00 weight in pounds 250,00 What does this mean cont’d? • We see that the slope is steeper for nonsmokers • In fact, a model with mwt and mwt*smoking fits better than the model mwt and smoking: Model Summary Model 1 R R Square ,264 a ,070 Adjus ted R Square ,060 Std. Error of the Es timate 706,85442 a. Predictors : (Constant), smkwht, weight in pounds Coefficientsa Model 1 (Cons tant) weight in pounds s mkwht Uns tandardized Coefficients B Std. Error 2370,504 224,809 5,237 1,713 -2,106 ,792 a. Dependent Variable: birthweight Standardized Coefficients Beta ,220 -,191 t 10,545 3,057 -2,660 Sig. ,000 ,003 ,009 95% Confidence Interval for B Lower Bound Upper Bound 1927,000 2814,007 1,857 8,616 -3,668 -,544 Should you always look at all possible interactions? • No. • Example shows interaction between an indicator and a continuous variable, fairly easy to interpret • Interaction between two continuous variables, slightly more complicated • Interaction between three or more variables: Difficult too interpret • Doesn’t matter if you have a good model, if you can’t interpret it • Often interested in interactions you think are there before you do the study Multicollinearity • Means that two or more independent variables are closely correlated • To discover it, make plots and compute correlations (or make a regression of one parameter on the others) • To deal with it: – Remove unnecessary variables – Define and compute an ”index” – If variables are kept, model could still be used for prediction Example: Traffic deaths • Recall: Used four variables to predict traffic deaths in the U.S. • Among them: Average car weight and prop. imported cars • However, the correlation between these two variables is pretty high Correlation car weight vs imp.cars • Pearson r is 0.94: 3800,00 Correlations impcars Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N 49 ,011 ,943 49 impcars ,011 ,943 49 1 49 • Problematic to use both of these as independents in a regression 3600,00 vehwt carage carage 1 3400,00 3200,00 3000,00 0,00 5,00 10,00 15,00 impcars 20,00 25,00 30,00 Choice of variables • Include variables which you believe have a clear influence on the dependent variable, even if the variable is ”uninteresting”: This helps find the true relationship between ”interesting” variables and the dependent. • Avoid including a pair (or a set) of variables whose values are clearly linearily related Choice of values • Should have a good spread: Again, avoid collinearity • Should cover the range for which the model will be used • For categorical variables, one may choose to combine levels in a systematic way. Specification bias • Unless two independent variables are uncorrelated, the estimation of one will influence the estimation of the other • Not including one variable which bias the estimation of the other • Thus, one should be humble when interpreting regression results: There are probably always variables one could have added Heteroscedasticity – what is it? • In the standard regression model yi 0 1 x1i 2 x2i ... K xKi i it is assumed that all i have the same variance. • If the variance varies with the independent variables or dependent variable, the model is heteroscedastic. • Sometimes, it is clear that data exhibit such properties. Heteroscedasticity – why does it matter? • Our standard methods for estimation, confidence intervals, and hypothesis testing assume equal variances. • If we go on and use these methods anyway, our answers might be quite wrong! Heteroscedasticity – how to detect it? • Fit a regression model, and study the residuals – make a plot of them against independent variables – make a plot of them against the predicted values for the dependent variable • Possibility: Test for heteroscedasticity by doing a regression of the squared residuals on the predicted values. Example: The model traffic deaths=a+b*car age+c*light trucks • Does not look too bad Scatterplot Regression Standardized Residual Dependent Variable: deaths 4 3 2 1 0 -1 -2 -3 -2 -1 0 1 Regression Standardized Predicted Value 2 What is bad? • This: and this: ε ε 0 0 y^ y^ Heteroscedasticity – what to do about it? • Using a transformation of the dependent variable – log-linear models • If the standard deviation of the errors appears to be proportional to the predicted values, a two-stage regression analysis is a possibility Dependence over time • Sometimes, y1, y2, …, yn are not completely independent observations (given the independent variables). – Lagged values: yi may depend on yi-1 in addition to its independent variables – Autocorrelated errors: The residuals εi are correlated • Often relevant for time-series data Lagged values • In this case, we may run a multiple regression just as before, but including the previous dependent variable yi-1 as a predictor variable for y i. • Use the model yt=β0+β1x1+γyt-1+εt • A 1-unit increase in x1 in first time period yields an expected increase in y of β1, an increase β1γ in the second period, β1γ2 in the third period and so on • Total expected increase in all future is β1/(1-γ) Example: Pension funds from textbook CD • Want to use the market return for stocks (say, in millon $) as a predictor for the percentage of pension fund portifolios at market value (y) at the end of the year • Have data for 25 yrs->24 observations Model Summaryb Model 1 R R Square ,980 a ,961 Adjus ted R Square ,957 Std. Error of the Es timate 2,288 DurbinWatson 1,008 a. Predictors : (Cons tant), lag, return b. Dependent Variable: stocks Coefficientsa Model 1 (Cons tant) return lag Uns tandardized Coefficients B Std. Error 1,397 2,359 ,235 ,030 ,954 ,042 a. Dependent Variable: stocks Standardized Coefficients Beta ,359 1,041 t ,592 7,836 22,690 Sig. ,560 ,000 ,000 95% Confidence Interval for B Lower Bound Upper Bound -3,509 6,303 ,172 ,297 ,867 1,042 Get the model: • yt=1.397+0.235*stock return+0.954*yt-1 • A one million $ increase in stock return one year yields a 0.24% increase in pension fund portifolios at market value • For the next year: 0.235*0.954=0.22% • And the third year: 0.235*0.9542=0.21% • For all future: 0.235/(1-0.954)=5.1% • What if you have a 2 million $ increase? Autocorrelated errors • In the standard regression model, the errors are independent. • Using standard regression formulas anyway can lead to errors: Typically, the uncertainty in the result is underestimated. Autocorrelation – how to detect? • Plotting residuals against time! Option in SPSS • The Durbin-Watson test compares the possibility of independent errors with a first-order autoregressive model: t t 1 ut n Test depends on K (no. of independent variables), n (no. observations) and sig.level α Test Test H0: ρ=0 vs H1: ρ=0 Reject H0 if d<dL Accept H0 if d>dU Inconclusive if dL<d<dU statistic: d (e e t 2 t 1 t n e t 1 t 2 ) 2 Example: Pension funds Model Summaryb Model 1 R R Square ,980 a ,961 Adjus ted R Square ,957 Std. Error of the Es timate 2,288 DurbinWatson 1,008 a. Predictors : (Cons tant), lag, return b. Dependent Variable: stocks •Want to test ρ=0 on 5%-level •Test statistic d=1.008 •Have one independent variable (K=1 in table 12 on p. 876) and n=24 •Find critical values of dL=1.27 and dU=1.45 •Reject H0 Autocorrelation – what to do? • It is possible to use a two-stage regression procedure: – If a first-order auto-regressive model with parameter is appropriate, the model yt yt 1 0 (1 ) 1 ( x1t x1,t 1 ) ... K ( xKt xK ,t 1 ) t t 1 will have uncorrelated errors t t 1 • Estimate from the Durbin-Watson statistic, and estimate from the model above Next time: • What if the assumption of normality for your data is invalid? • You have to forget all you have learnt so far, and do something else • Non-parametric statistics