Regression Analysis: Outline • Review on Regression Analysis • Regression with Categorical explanatory variables • Pooled Regression: Fixed Effect and Random Effect models 1 Regression Analysis in the overall context of Research • Research Purpose – Research questions, objectives, hypotheses • Methodology – – – – Type of Study Sampling plan and sample size determination Data collection methods Data analysis plan • Execution – – – – Data collection and analysis Data collection and Data analysis Discussion and Conclusion Research Evaluations 2 Regression Analysis: Review • What is Regression? • Dependence measure~ estimate the overall relationships between the dependent and independent variables • Examples of dependent and independent variables? • Regression and Causality (~ experiment, theory ) • Regression (~predict dependent) and Correlation (~ linear association) • Uses of Regression • • • • Descriptive~ describe relationship and how strong? Inference ~ which variables are most important/ significant? Predictive ~ forecasting Hypothesis Testing • Sample Size 3 Type of Variables in Regression Analysis • • • • • Independent Dependent Moderating Mediating Moderation-mediation 4 Moderating Variables • Moderating Variables Testing Moderation • Y = b0 + b1* X + b2* Z + b3* XZ +e Y = [b1 + b3* Z] X + [b0+b2*Z] 5 Mediator Variables • Mediator Variables b a Attitude B BI c 6 Multivariate Research Methods: Regression Analysis: Review • How it works? • Formalization of regression model: Systematic • part y = b0 + b1 x1+ b2 X2+ …+bk Xk+ error – intercept, slope, error – Examples?? usystematic part • What do we observe? Y and X’s and estimate b’s • Which variables to include? – – Theory, Prior research, common sense If you don’t have any idea? » statistical criteria: stepwise, Forward and Backward ( in cases of only metric data??) • Moderator Effects ~ Interaction Variables • How to Obtain Estimates? – – – – – – Least square method of Regression Any straight line you fit will have some error Objective is to minimize that errors e.g. sums of squared values of difference between Y and Y-predicted. Or minimize the sum squares errors Y = a + b*X + e leads to e = Y - a -b*X e2 = (Y - a - b*X)2 ~ minimize sum of e2 7 Multivariate Research Methods: Regression Analysis: Review • Interpretation of parameter estimates? • Intercept • mean of the dependent ~ when value of all independent variables are zero • Mean of the dependent ~ when all slopes are zero • Not always meaningful • Slopes: • Change in Y as we change one unit of X. • zero slope ? X does not affect Y • b1, b2,…..bk: partial regression coefficients • e.g. b1 = Change in the value of Y if X1 is changed by one unit while all other explanatory variables are ( X2 …Xk) kept constant. 8 Multivariate Research Methods: Regression Analysis: Review • Interpretation of parameter estimates? • Size of the regression coefficient • depends on the scale of the explanatory variable • Which variable is a good explanatory variables then size of the coefficient is not a good predictor for that. • Scale of the independent variables ~ within 10 times • Beta coefficients/ or standardized coefficients, • provides relative importance • Elasticity: This measures the percentage change in dependent variable for 1 % change in the independent variable. X elasticity Y 9 Multivariate Research Methods: Regression Analysis: Review • Is Regression coefficient Significant? • Overall goodness of fit? • r2 r2 ESS RSS 1 TSS TSS 0 r2 1 • Is Regression Significant? • r ~ coefficient of multiple correlation • adjusted r2 Y RSS ( error) TSS ESS Y= b0+bX X 10 Multivariate Research Methods: Regression Analysis: Review Majo r as s um pt io n s He t e ro s c e das t ic it y Au t o c o rre lat io n Mu lt ic o llin e arit y Th e v arian c e o f t h e e rro r t e rm is c o n s t an t Th e re is n o au t o c o rre lat io n in t h e e rro r t e rm Th e re is n o e x ac t lin e ar re lat io n s h ip in t h e in de pe n de n t v ariable s Th e re m u s t be v ariabilit y in t h e in de pe n de n t v ariable s Th e re gre s s io n m o de l is c o rre c t ly s pe c ifie d Th e re gre s s io n m o de l is lin e ar in param e t e rs Th e m e an v alu e o f t h e e rro r t e rm is z e ro No c o v ariat io n be t we e n e rro rs an d in de pe n de n t v ariable s Th e e rro r t e rm is n o rm ally dis t ribu t e d 11 Multivariate Research Methods: Regression Analysis: Review • • Detecting problems with the assumptions? Heteroscedasticity • error variances are not same • when errors are related to either dependent or independent variables • e.g more stable saving ( or consumption) with lower income families/ larger variances with brand switchers than brand loyal customers Variance Saving Income •Remedy ?? If we know the nature of heteroscedasticity, we can use WLS • Volatility ~ Finance ?? 12 Regression Analysis : Review • • Detecting problems with the assumptions? Autocorrelation~ more a time-series problem • • • • • when errors are correlated with consecutive obs. Reasons? Omitted variables Model mis-specification Detection • Graphical methods • Durbin-Watson ~ DW= 2 (1-r), DW varies between 0 - 4 – et ideal number is 2 Y Positive Negative Problem? X • Over estimate coeff. of determination and underestimate the standard errors et-1 13 Multivariate Research Methods: Regression Analysis: Review • • Detecting problems with the assumptions? Multicollinearity X2 X1 Y X1 X2 Y • • presence of very high interrelations among explanatory variables (do not violate any assumption) Symptoms:The standard errors are likely to be high, Estimates are not reliable? • Detection • Bivariate correlation • Variance Inflation Factor (VIF)~ 10 • Tolerance = 1/VIF • VIF 1 1 ri2 . Remedies • Drop variables • composite variables e.g. Family life cycles, Social Status • Factor analysis 14 Multivariate Research Methods: Regression Analysis: Review • Detecting problems with the assumptions? • Linear in parameters • Y = a + b*X2 + e ~ linear in parameters but non-linear in variables • Y = a + b2 *X1 + b*X2+ e ~ non-linear in parameters: Non-linear regression • The Regression model is correctly specified • Functional form, e.g. new consumer durable sales • Influential observation • outliers • whether one or a few observations?? 15 Regression Analysis: Review • • • Outliers: In linear regression, an outlier is an observation with large residual. Problem with dependent variable?? Leverage: An observation with an extreme value on a independent variable is called a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. These leverage points can have an unusually large effect on the estimate of regression coefficients. Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. • Detection • RESIDUAL CHECK – Standardized residual – Studentized residual – Problem approx.: abs. value > 2 ei* ei* s si ei 1 hi ei 1 hi 16 Regression Analysis: Review • Transformation of variables – Dependent variable should be normally dist., constant variance etc – e.g. GNP per capita, Log(Price) etc – Retransformation ?? • Forecasting • model fit versus forecasting • forecasting independent variables • Model Selection / comparing models • adjusted R-sq • Model Validation • Cross-validation • Jackknife validation 17 Multivariate Research Methods: Regression Analysis: Limitations • Nominal independent variables ~ dummy variable regression – • gender, income groups, ethnicity, region, race etc. Measurement error~ Structural equation models • XTrue = Xobs + ex • Y=b0 +b1 * XTrue + eY • Y= b0 +b1 * (Xobs + ex) + eY • Y= b0 +b1 * Xobs + b1*ex + eY • Y= b0 +b1 * Xobs + b1*ex + eY Error term is correlated with x-variable ~ this violates the reg. assumption 18 Regression Analysis: Limitations • Limited dependent variable – Censored dependent variable ~ lots of zeros • • • Expenditures in home buying Demand in a supply restricted situation vacation expenditures Y (e.g housing exp.) – Tobit Regression X (e.g. income) Truncated dependent variable ~ duration analysis, available in LIMDEP • • Interpurchase times duration of unemployment 19 Regression with Categorical Explanatory Variables • Some modeling problems • Is gender important in determining the level of expenditure on medical expenses? • Do Nescafe’s supermarket coffee sales vary by state? • How would you model the impact of local crime on housing prices if crime rate were rated - none, moderate or high? • How do I include income as a determinant of cigarette demand when data have only been collected by income class? • Examples • Medical expenditure = intercept+ b1* Gender + b2* age group + error • Sales=intercept+ b1*Provinces+ error 20 Interpretation of regression coefficients: Binary Coding • Midterm exam scores by sex Yi • . score 0 1 Di Y score i D 1, if male i 0, if female Yavg , fem 0 Yavg , male 0 1 1 0 female male 21 Interpretation of regression coefficients: Effect Coding • Midterm exam scores by sex• . score Yi 0 1D i Yi score D i 1, if male 1, if female Yoverall mean 0 Yavg . male 0 1 1 2 0 2 1 Note: we are not estimating 1 0 2 female 2 male 22 Regression Analysis: Non-Linear Regression • Example: Sales and Price dynamics of New Product Sales First Purchase Sales Price Time Time 23 Pooled Regression: Fixed Effect and Random Effect models • Panel Data – Cross Sectional Time Series Data • Observations on “n” individuals (or countries, firms etc), each measured at T points in time (T can be different for each measuring unit) • Observations are not independent • use panel structure to get better parameter estimates • Control for fixed or random individual differences • Example of Data Setup…. • Software : LIMDEP ( also SAS…) • Example: Cross-sectional survey 50% Female Participation in Labor Force?? 24 y it i X 'it e it Pooled Regression: Fixed Effect and Random Effect models • Fixed Effect – individual slopes are different - shifted by “fixed” amount y it i X 'it e it y it i X 'it e it • Random Effect – individual differences are random rather than fixed – random slope terms. The slope is function of mean slope value plus random error y it X 'it (e it u i ) - Unobserved heterogeneity that is stable over time - This ui is uncorrelated with X’s 25 Pooled Regression: Fixed Effect and Random Effect models • The Hausman Test: • Model Selection – Fixed Effect vs Random Effect – H0: that random effects would be consistent and efficient, versus – H1: that random effects would be inconsistent. Chi-Square Test Statistic. 26