Introduction to Regression and Data Analysis Statlab Workshop February 28, 2003 Tom Pepinsky and Jennifer Tobin I. Regression: An Introduction: A. What is regression? The idea behind regression in the social sciences is that the researcher would like to find the relationship between two or more variables. Regression is a statistical technique that allows the scientist to examine the existence and extent of this relationship. Regression shows that given a population, if the researcher can either examine the entire population or perform a random sample of sufficient size, it is possible to mathematically recover the parameters that describe the relationships between variables. Once the researcher has established such a relationship, she can then use these parameters to predict values of a new dependent variable given a new independent variable. Regression does not make any specifications about the way that the independent variables are distributed or measured (discrete, continuous, binary, etc.), but in order for regression to be the appropriate technique, the Gauss-Markov assumptions must be fulfilled. In its simplest (bivariate) form, regression shows the relationship between one independent variable (X) and a dependent variable (Y). The magnitude and direction of that relation are given by a parameter (β1), and an intercept term (β0) captures the status of the dependent variable when the independent variable is absent. A final error term (u) captures the amount of variation that is not predicted by the slope and intercept terms. The regression coefficient (R2) shows how well the values fit the data. More sophisticated forms of regression allow for more independent variables, interactions between the independent variables, and other complexities in the way that one variable affects another. Regression thus shows us how variation in one variable co-occurs with variation in another. What regression cannot show is causation; causation is only demonstrated analytically, through substantive theory. For example, a regression with shoe size as an independent variable and foot size as a dependent variable would show a very high regression coefficient and highly significant parameter estimates, but we should not conclude that higher shoe size causes higher foot size. All that the mathematics can tell us is whether or not they are correlated, and if so, by how much. B. Difference between correlation and regression. It is important to recognize that regression analysis is fundamentally different from ascertaining the correlations among different variables. Correlation can tell you how the StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003 1 values of your variables co-vary, but regression analysis is aimed at making a stronger claim: demonstrating how one variable, your independent variable, causes another variable, your dependent variable. Correlation determines the strength of the relationship between variables, while regression attempts to describe the relationship between these variables. Of course, it is apparent that regression may lead to what is called “spurious correlation,” where the co-variation of two variables implies a causal relationship that does not exist. For example, we might find that there is a significant relationship between being a basketball player and being tall. Of course, being a basketball player does not cause one to become taller; the relationship is almost certainly the opposite. It is important to recognize that regression analysis cannot itself establish causation, only describe correlation. Causation is established through theory. SPSS syntax: Analyze: correlate: bivariate: Correlations X X Pearson Correlation Sig. (2-tailed) N Y Pearson Correlation Sig. (2-tailed) N Y 1 -.954(**) . .001 7 7 -.954(**) 1 .001 . 7 7 ** Correlation is significant at the 0.01 level (2-tailed). II. Before We Get Started: The Basics A. Your variables may take several forms, and it will be important later that you are aware of, and understand, the nature of your variables. The following variables are those which you are most likely to encounter in your research. Categorical variables Such variables include anything sort of measure that is “qualitative” or otherwise not amenable to actual quantification. There are a few subclasses of such variables. Dummy variables take only two possible values, 0 and 1. They signify conceptual opposites: war vs. peace, fixed exchange rate vs. floating exchange rate, etc. Nominal variables can range over any number of non-negative integers. They signify conceptual categories that have no inherent relationship to one another: red vs. green vs. black, Christian vs. Jewish vs. Muslim, etc. StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003 2 Ordinal variables are like nominal variables, only there is an ordered relationship among them: no vs. maybe vs. yes, etc. Numerical variables Such variables describe data that can be readily quantified. Like categorical variables, there are a few relevant subclasses of numerical variables. Continuous variables can appear as fractions; in reality, they can have an infinite number of values. Examples include temperature, GDP, etc. Discrete variables can only take the form of whole numbers. Most often, these appear as count variables, signifying the number of times that something occurred: the number of firms invested in a country, the number of hate crimes committed in a county, etc. When you begin a statistical analysis of your data, a useful starting point is to get a handle on your variables. Are they qualitative or quantitative? If they are the latter, are they discrete or continuous? Another useful practice is to ascertain how your data are distributed. Do your variables all cluster around the same value, or do you have a large amount of variation in your variables? Are they normally distributed, or not? B. We are only going to deal with the linear regression model The simple (or bivariate) LRM model is designed to study the relationship between a pair of variables that appear in a data set. The multiple LRM is designed to study the relationship between one variable and several of other variables. In both cases, the sample is considered a random sample from some population. The two variables, X and Y, are two measured outcomes for each observation in the data set. For example, lets say that we had data on the prices of homes on sale and the actual number of sales of homes: Price(thousands of $) Sales of new homes X y 160 126 180 103 200 82 220 75 240 82 260 40 280 20 And we want to know the relationship between X and Y. Well, what does our data look like? SPSS syntax: graph: scatter: simple: enter x and y StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003 3 y 126 20 160 280 x We need to specify the population regression function, the model we specify to study the relationship between X and Y. This is written in any number of ways, but we will specify it as: where Y is an observed random variable (also called the endogenous variable, the left-hand side variable). X is an observed non-random or conditioning variable (also called the exogenous or right-hand side variable). is an unknown population parameter, known as the constant or intercept term. is an unknown population parameter, known as the coefficient or slope parameter. u is is an unobserved random variable, known as the disturbance or error term. Once we have specified our model, we can accomplish 2 things: Estimation: How do we get a "good" estimates of the PRF make a given estimator a good one? StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003 and ? What assumptions about 4 Inference: What can we infer about and from sample information? That is, how do we form confidence intervals for and and/or test hypotheses about them. The answer to these questions depends upon the assumptions that the linear regression model makes about the variables. The Ordinary Least Squres (OLS) regression procedure will compute the values of the parameters and (the intercept and slope) that best fit the observations. We want to fit a straight line through the data, from our example above, that would look like this: y Fitted values 126 y ui 20 160 280 x Obviously, no straight line can exactly run through all of the points. The vertical distance between each observation and the line that fits “best”—the regression line—is called the error. The OLS procedure calculates our parameter values by minimizing the sum of the squared errors for all observations. Why OLS? It is considered the most reliable method of estimating linear relationships between economic variables. It can be used in a variety of environments, but can only be used when it meets the following assumptions: C. Assumptions of the linear regression model (The Gauss-Markov Theorem) StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003 5 The Gauss-Markov Theorem is essentially a claim about the ability of regression to assess the relationship between a dependent variable and one or more independent variables. The Gauss-Markov Theorem, however, requires that for all Yi, Xi, the following conditions are met: 1) The conditional expectation of Y is an unchanging linear function of known independent variables. That is, Y is generated through the following process: Yi β 0 β1 X 1i ... β k X ki ε i In the simple regression model, the dependent variable is assumed to be a function of one or more independent variables plus an error introduced to account for all other factors. In the regression equation specified above, Yi is the dependent variable, Xi1, ...., Xik are the independent or explanatory variables, and εi is the disturbance or error term. The goal of regression analysis is to obtain estimates of the unknown parameters β1, ..., βk which indicate how a change in one of the independent variables affects the values taken by the dependent variable. Note that the model also assumes that the relationships between the dependent variable and the independent variables are linear. Examples of violations: non-linear relationship between variables, including the wrong variables 2) All X’s are fixed in repeated samples The Gauss-Markov Theorem also assumes that the independent variables are non-random. In an experiment, the values of the independent variable would be fixed by the experimenter and repeated samples could be drawn with the independent variables fixed at the same values in each sample. As a consequence of this assumption, the independent variables will in fact be independent of the disturbance. For non-experimental work, this will need to be assumed directly along with the assumption that the independent variables have finite variances. Examples of violations: endogeneity, measurement error, autoregression 3) The expected value of the disturbance term is zero. E[ ε i ] Cov[ X i , ε i ] 0 The disturbance terms in the linear model above must also satisfy some special criteria. First, the Gauss-Markov Theorem assumes that the expected value of the disturbance term is zero. This means that on average, the errors balance out. Examples of violations: expected value of disturbance term is not zero StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003 6 4) Disturbance have uniform variance and are uncorrelated V[ε i ] E[ε i2 ] σ 2 Cov[ ε i , ε j ] E[ε i , ε j ] 0 for all i, j The Gauss-Markov Theorem further assumes that the variance of the error term is a constant for all observations and in all time periods. Formally, this assumption implies that the error is homoskedastic. If the variance of the error term is not constant, then the error terms are heteroskedastic. Finally, the Gauss-Markov Theorem assumes that the error term is non-correlated. More specifically, it assumes that the values of the error term at different time periods are independent of each other. So, the error terms for all observations, or among observations at different time periods, are not correlated with each other. Examples of violations: heteroskedasticity, serial correlation of error terms 5) No exact linear relationship between independent variables and more observations than independent variables Abs(correlation [xi,xj]) 1 T _ ( X t X )2 0 t 1 n k +1 The independent variables must be linearly independent of one another. That is, no independent variable can be expressed as a non-zero linear combination of the remaining independent variables. There also must be more observations than there are independent variables in order to ensure that there are enough degrees of freedom for the model to be identified. Examples of violations: multicolinearity, micronumerosity If the five Gauss-Markov Assumptions listed above are met, then the Gauss-Markov Theorem states that Ordinary Least Squares regression estimator bi is the Best Linear Unbiased Estimator of βi. (OLS is BLUE.) All estimators will be unbiased, and that will have the least variance in the class of unbiased linear estimators. The formula for deriving the OLS estimator of βi is as follows. n Scalar form: b y i xi i 1 n xi2 i 1 StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003 7 Matrix form: b X' X X' Y 1 The point of the regression equation is to find the best fitting line relating the variables to one another. In this enterprise, we wish to minimize the sum of the squared deviations (residuals) from this line. OLS will do this better than any other process as long as these conditions are met. III. Now, Regression Itself A. So, now that we know the assumptions of the OLS model, how do we estimate and ? The Ordinary Least Squares estimates of and of and and are defined as the particular values that minimize the sum of squares for the sample data. The best fit line associated with the n points (x1, y1), (x2, y2), . . . , (xn, yn) has the form y = mx + b where slope m n( xy) ( x)( y ) int ercept b n ( x 2 ) ( x ) 2 y m( x) n So we can take our data from above and substitute in to find our parameters: Sum Price(thousands of $) Sales of new homes X y 160 126 180 103 200 82 220 75 240 82 260 40 280 20 1540 528 StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003 xy 20,160 18,540 16,400 16,500 19,680 10,400 5,600 107280 x2 25,600 32,400 40,000 48,400 57,600 67,600 78,400 350000 8 n( xy) ( x)( y ) slope m n ( x ) ( x ) 2 int ercept b 2 7(107280) (1540) * (528) 62160 0.79286 78400 7(350000) (1540) 2 y m( x) 528 0.79286(1540) 249.8571 n 7 And now in SPSS: Analyze: Regression: Linear: statistics: confidence interval: Dependent variable:Y Independent Variables: Xs: Model Summary Model 1 R R Square .954(a) a Predictors: (Constant), X .911 Adjusted R Square .893 Std. Error of the Estimate 11.75706 Coefficients(a) Unstandardized Coefficients Model B 1 (Constant) 249.857 Price of -.793 house a Dependent Variable: # houses sold Std. Error Standardized Coefficients t Sig. Beta Lower Bound 24.841 .111 95% Confidence Interval for B -.954 Upper Bound 10.058 .000 186.000 313.714 -7.137 .001 -1.078 -.507 Thus our least squares line is y = 249.857 0.79286x Interpreting data: Let’s take a look at the regression diagnostics from our example above: Explaining coefficient: for every one unit increase in the price of a house, -.793 fewer houses are sold—now this doesn’t make intuitive sense in this case, because you can’t sell .793 of a house, but we could imagine this to be true for a continuous variable. StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003 9 But, how do we know that our coefficient estimate is meaningful? We can do a test of statistical significance—usually we want to know if our coefficient estimate is statistically different from zero: this is called a t-test. Formally, we say: H0 : Bprice of a house = 0 In other words, the price of a house has no effect on house sales. What we hope to do is to reject this hypothesis Explaining t-statistic: Our t-statistic is simply our coefficent divided by its standard error. The next step is to compare this statistic to its critical value that can be found in a table of t-statistics in any statistics text book, or for a large enough sample, most researchers use the 95% confidence interval, with the associated critical value of 1.96. Thus, for any t-statistic below 1.96 we cannot reject the hypothesis that our coefficient estimate is significantly different from 0. Note: We never say that we can accept a hypothesis, we can simply not reject or reject hypothesis about coefficient estimates. p-value (in SPSS Sig.) P .001 : The probabilit y that we would have gotten thi s estimate of B if its true value were 0 Confidence interval 95% confidence int erval : 1.078 to - 0.507 B̂ c * (se(B̂)) Under repeated sampling, B would lie in this inteval 95% of the time R-squared: Finally, we want to look at the R-squared statistic from our model summary statistics above. Formally, RSS R 2 0.911 : 1 : the relative amount of variance in y explained by our independen t variable X TSS StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003 10 Now lets do it again for a multi-variate regression, where we add in number of red cars in neighborhood as predictor of housing sales: Model Summary Model 1 R .958(a) R Square .917 Adjusted R Square .876 Std. Error of the Estimate 12.65044 a Predictors: (Constant), PRICE, REDCARS Coefficients(a) Unstandardized Coefficients Model 1 (Constant) Price of house Red cars B 223.157 Std. Error 54.323 -.708 .191 .376 .666 Standardized Coefficients t Sig. Beta 95% Confidence Interval for B 4.108 .015 Lower Bound 72.332 Upper Bound 373.983 -.853 -3.703 .021 -1.240 -.177 .130 .565 .603 -1.474 2.226 a Dependent Variable: # houses sold Explaining coefficients: This time, the estimate cannot be explained in exactly the same manner. Here we would say: controlling for the effect of red cars, the marginal effect of a one unit increase in the price of houses, housing sales decreasy by –0.708 units. Let’s look at our t-statistic on red cars, what does that tell us? So, should we take it out of our model? Again, let’s think about p-values, p-value (in SPSS Sig.), Confidence intervals, and R-squared. What happened to our R-squared in this case? It increased to .958—this is good, right, we explained more of the variation in Y through adding this variable. NO. B. Interpreting log models The log-log model: Yi 1 X iB 2 e ui rewrite : ln Yi B2 ln X i u i Is this linear in the parameters? How do we estimate this? Where a=lnB1 StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003 11 Slope coefficient B2 measures the elasticity of Y with respect to X, that is, the percentage change in Y for a given percentage change in X The model assumes that the elasticity coefficent between Y and X, B2 remains constant throughout.—the change in lnY per unit change in lnX (the elasticity B2) remains the same no matter at which lnX we measure the elasticity. demand 10 1 1 9 price lndemand 2.30259 0 0 2.19722 lnprice StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003 12 The log-linear model: ln Yt 1 2 t ut B2 measures the constant proportional change or relative change in Y for a given absolute change in the value of the regressor: B2=relative change in regressand / absolute change in regressor If we multiply the relative change in Y by 100, will then get the percentage change in Y for an absolute change in X, the regressor. gnp 420819 191857 17889 28798 var4 lngnp 12.95 12.1645 17889 28798 var4 StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003 13 The lin-log model: Now interested in finding the absolute change in Y for a percent change in X Yi 1 2 ln X i ui =Change in Y / relative change in X The absolute change in Y is equal to B2 times the relative change in X—if the latter is multiplied by 100, then it gives the absolute change in Y for a percentage change in X IV. When Linear Regression Does Not Work A. Violations of the Gauss-Markov assumptions Some violations of the Gauss-Markov assumptions are more serious problems for linear regression than others. Consider an instance where you have more independent variables than observations. In such a case, in order to run linear regression, you must simply gather more observations. Similarly, in a case where you have two variables that are very highly correlated (say, GDP per capita and GDP), you may omit one of these variables from your regression equation. If the expected value of your disturbance term is not zero, then there is another independent variable that is systematically determining your dependent variable. Finally, in a case where your theory indicates that you need a number of independent variables, you may not have access to all of them. In this case, to run the linear regression, you must either find alternate measures of your independent variable, or find another way to investigate your research question. Other violations of the Gauss-Markov assumptions are addressed below. In all of these cases, be aware that Ordinary Least Squares regression, as we have discussed it today, gives biased estimates of parameters and/or standard errors. B. Non-linear relationship between X and Y: use Non-linear Least Squares or Maximum Likelihood Estimation Endogeneity: use Two-Stage Least Squares or a lagged dependent variable Autoregression: use a Time-Series Estimator (ARIMA) Serial Correlation: use Generalized Least Squares Heteroskedasticity: use Generalized Least Squares Characteristics of the Dependent Variable Until now, we have implicitly assumed that the dependent variable in our equation is a continuous non-censored variable. This is necessary for linear regression analysis. However, it will often be the case that your dependent variable is a dummy variable (0 for peace, 1 for war), a count variable (the number of bills passed by the legislature in a given year), strictly non-negative (the distance from your house to the post office), or censored (yearly income, with no data collected for those making less than $1,000 per year). In these cases, linear regression is not an appropriate technique for uncovering the StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003 14 relationships between X and Y. I discuss the appropriate remedies for these problems below. C. Dependent variable is binary: use Logistic Regression or Probit Regression Dependent variable is a count variable: use Poisson Regression Dependent variable is strictly non-negative or censored: use Tobit Regression Help If you believe that the nature of your data will force you to use a more sophisticated estimation technique than Ordinary Least Squares, you should first consult the resources listed on the Statlab Web Page at http://statlab.stat.yale.edu/links/index.jsp. You may also find that Google searches of these regression techniques often find simple tutorials for the methods that you must use, complete with help on estimation, programming, and interpretation. Finally, you should also be aware that the Statlab consultants are available throughout regular Statlab hours to answer your questions (see the areas of expertise for Statlab consultants at http://statlab.stat.yale.edu/people/showConsultants.jsp). StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003 15