Understanding Relationships To this point we have primarily done analyses of data on one variable, i.e., on univariate data. Now we would like to look at relating two or more variables, i.e., multivariate data. We have already looked at relating quantitative variables through covariance and correlation. In this section we will return to investigate relationships between two quantitative variables. Now our main focus will be on regression. The primary objective of this section is to learn how relationships between variables can be quantified and interpreted. In this context, we will review causation, and we will show how the computer can be used to ease the computational burden. When we talk about relationships between variables, we usually cannot conclude anything about causation (i.e., that change in one variable causes a change in the other variable). We only can conclude whether the variables are related or not. Linear Regression - 1 Linear Regression With correlation we found a single number that represented the strength of the linear relationship between two variables. Another way to look at the relationship between two variables is to hypothesize the form of the linear relationship, and then from the data estimate the equation of the line that relates them. Our objectives for this section are to learn to relate two or more random variables in a meaningful way, to test the significance of the relationships, and to use the relationships to make predictions. Most of what we have learned so far will be used in this section. We will use histograms, summary measures, confidence intervals, and tests of hypotheses to talk about regression. In all of these discussions, we will again be talking about linear relationships. We will see, however, that this isn't as restrictive as it first sounds. Here we often assume that a set of the variables is under our control. These variables are referred to as the independent variables, and are denoted x. The variable that is not under our control is called the dependent (or response) variable, and is denoted y. For our purposes there will only be one dependent variable, but we can have many independent variables. Again, we assume nothing about causality, i.e., if we find a linear relationship, we cannot necessarily say that a change in the independent variables causes a change in the response of the dependent variable. Linear Regression - 2 Let's begin with a simple example. The following table and graph show Narco Medical's advertising expense (in hundreds of dollars) in a period, and their associated sales (in thousands of dollars). Advertising Expense Sales 8 15 9 11 7 10 6 11 5 8 1 5 16 14 Sales (thousands) 12 10 8 6 4 2 0 0 1 2 3 4 5 6 7 8 9 Advertising Expense (hundreds) When there is only one independent variable, as in this case, we call the modeling "simple linear regression." We want to fit a line to the data. We will assume a linear model of the form: y = 0 + 1x + . What criterion or criteria should we use to fit the line? What line would best fit the data? Linear Regression - 3 Least Squares Most often a method known as least squares is used to fit the line. The estimate of the line's intercept we will call b0, and that of the line's slope will be b1. The estimated or predicted value of Y is denoted Y . Hence our estimated line has the form y i b 0 b1x i . In least squares we minimize the sum of squared differences between Y and Y . We define a residual to be ei = yi - y i , and minimize n e2i . i 1 For you calculus fans, the procedure is to take the partial derivatives with respect to b0 and b1 and set them equal to 0. We will not go through the details, nor even worry about writing the result. The book shows the result, and for us, the important thing is that the computer calculates the values of b0 and b1 for us. For the example, here is a partial printout from the spreadsheet: Coefficients Intercept Advertising Expense 4 1 So the estimated relationship is What would we predict sales to be if the advertising expense is $400? What about an advertising expense of $1500? Linear Regression - 4 Multiple Linear Regression We can write a similar model when there are more than one independent variables. The general form of the model is y = 0 + 1x1 + 2x2 + ... + kxk + , and the estimated relationship is y b 0 b1x1 b 2 x 2 b p x p . We use least squares to find the values of b0, b1, ..., bp that minimize the sum of the squared differences between yi - y i . Using Data Analysis Tools to Do Regression Doing regression in Excel is very similar to using the other analysis tools. With regression, however, having the data in the right form is more important. First, all data should be entered in columns. Second, all independent variables should be next to each other (i.e., in a contiguous set of cells). Once the data are entered correctly, select "Regression" from the Tools/ Data Analysis menu item in Excel. You will be presented with the dialogue box shown on the following page. In the Input Y Range, enter the cell range referring to the column containing the dependent variable. In the Input X Range, enter the range of cells containing all independent variables. This is why the X variables need to be next to each other. If your range of cells included a row of labels, click the label box. Linear Regression - 5 I never click the Constant is Zero box. In some physical systems it only makes sense for the intercept to be 0, so we can force it do so. In our examples that will never be the case. If you want a confidence interval for the values other than a 95% confidence interval, click in the Confidence Level box and enter a different confidence level. Next, indicate where you want the output to go. Finally, click on the box next to “Residuals.” I leave all other boxes blank, because I don’t like the way that Excel does the rest of the residual analysis or the normal probability plot. Then hit enter. Linear Regression - 6 Adequacy of Fit We have discussed how we fit the line to the data, but we must remember that we are using sample values to estimate the hypothesized line. Since the results come from sample estimates, they are subject to error, and hence we need to be sure that the relationship we are seeing is really significant. There are a few measures and tests that we can use to look at how good the fitted relationship really is. Example: We will use the pizza delivery example to motivate our discussion. Let's concentrate on the delivery time as our dependent variable. We want to predict delivery time from some of the other variables in the data set. Which ones might it make sense to include? We will try using distance, day of the week, and hour of the day as our independent variables. On the next page is the overall output from the computer for fitting these data. We will talk about every part of the printout. Linear Regression - 7 Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.936248 0.87656 0.874991 0.670277 240 Analysis of Variance Regression Residual Total Intercept Day Hour Distance df 3 236 239 Sum of Squares 752.919 106.028 858.947 Coefficients Standard Error t Statistic 1.156832 -0.02521 -0.00592 1.754525 0.229887 0.022013 0.018919 0.042988 5.032169 9.54E-07 0.703939 1.609725 -1.14541 0.253183 -0.06858 0.018153 -0.31297 0.754578 -0.04319 0.031351 40.8147 1E-109 1.669837 1.839213 What is the estimated relationship? Linear Regression - 8 Mean Square F Significance F 250.973 558.6225 7.1E-107 0.449271 P-value Lower 95% Upper 95% Significance of the Overall Relationship: Our first question to answer when looking at the goodness of fit of the model is whether or not the overall relationship is significant. In other words, is y related to any of the x's? We do this by testing a hypothesis. H0: Ha: The hypothesis is tested by comparing the amount of variation explained by the independent variables to the amount of variation left unexplained. The unexplained variance is shown on the printout as Residual Mean Square. The explained portion is referred to as Regression Mean Square. What would be the appropriate test statistic? We use the Analysis of Variance (ANOVA) portion of the printout to test the hypothesis. ANOVA Regression Residual Total df 3 236 239 Sum of Squares 752.919 106.028 858.947 Linear Regression - 9 Mean Square F Significance F 250.973 558.6225 7.1E-107 0.449271 Testing Individual Contributions: We can also test the marginal contribution of an individual independent variable when all other variables are included in the model. We again do this by testing a hypothesis. H0: Ha: This turns out to be a t-test, very similar to the types we have done before. Here is the part of the printout which can be used to do this analysis. Intercept Day Hour Distance Coefficients Standard Error t Statistic 1.156832 -0.02521 -0.00592 1.754525 0.229887 0.022013 0.018919 0.042988 5.032169 9.54E-07 0.703939 1.609725 -1.14541 0.253183 -0.06858 0.018153 -0.31297 0.754578 -0.04319 0.031351 40.8147 1E-109 1.669837 1.839213 P-value Note that we can also use this section of the printout to form confidence intervals around individual contributions. Linear Regression - 10 Lower 95% Upper 95% If we do simple linear regression, to test whether the one slope is equal to 0 is equivalent to testing whether the correlation is equal to 0. Here is the test: H0: = 0 Ha: 0 H0: 1 = 0 Ha: 1 0 or For example if we wanted to test if the correlation between day of the week and preparation time is singificant, if we run a regression we obtain the following results: Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.101878 0.010379 0.006221 1.970273 240 Analysis of Variance df Regression Residual Total 1 238 239 Sum of Mean Squares Square F 9.689971 9.689971 2.496145 923.91 3.881975 933.6 Coefficients Intercept Prep Time 1.242385 0.191105 Significance F 0.115453 Standard Error t Statistic P-value Lower 95% Upper 95% 1.813175 0.685199 0.493883 -2.32954 4.814311 0.120959 1.579919 0.115448 -0.04718 0.429391 Conclusions: Linear Regression - 11 Amount of Variation Explained by the Independent Variable: Our next measure of goodness of fit is referred to as the coefficient of (multiple) determination, and is denoted R2. R2 is the proportion of variation explained by the model compared to the overall variation in the data. It is computed as the ratio of the regression sums of squares to the total sums of squares. Adjusted R2: One way of increasing the amount of variation explained is to increase the number of independent or explanatory variables. To adjust for this, there is a number called the adjusted coefficient of multiple determination, which is adjusted for the number of variables in the model. The printout which shows R2 and the adjusted R2 is repeated below. Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.936248 0.87656 0.874991 0.670277 240 ONLY if we are doing simple linear regression, the square root of R2 is the correlation coefficient between the dependent and independent variables. The other important number displayed in this portion of the printout is called “Standard Error.” This is an estimate of the standard deviation of the observations about the line. The book refers to this quantity as the “standard error of the estimate,” and denotes it s . Linear Regression - 12 Indicator Variables In many cases there is reason to believe that the dependent variable is related to qualitative variables. These variables are usually categorical and include gender, marital status, educational level, etc. These variables can also be handled in a regression model. To use categorical variables, we must define what are called indicator or dummy variables. These indicator variables assign numbers to the categories. In general, if we have a total of m categories, we will need m-1 indicator variables. Most of the time the qualitative variable will have only two categories, so we will need only one dummy variable. To illustrate, consider a case where we want to relate a person’s income to the number of months they have worked and to the person’s gender. We define y to be income, x1 to be months on the job, and x2 to be an indicator variable for gender, where x2=1 if the person is male, and x2=0 if the person is female. Our model is y = 0 + 1x1 + 2x2 + . If we want to investigate the effect of gender, then we can look at each case separately. Plugging in the appropriate values for x1 and x2 gives Y = (0 + 2) + 1x1 + for males Y = 0 + 1x1 + for females. and Linear Regression - 13 The model then says that gender affects the intercept of the line, but does not affect the slope. Graphically we have Income 45 Males 40 Females 35 30 25 20 0 50 100 150 200 Tenure 2 then represents the differential influence on income of being male. If we want to know if the difference is significant, we can test the hypothesis H0: 2 = 0. If we reject H0, then we conclude the difference is significant. Linear Regression - 14 Testing Assumptions The last thing we want to do to examine how well the model fits the data is to check the validity of the assumptions. Recall that the first four assumptions were almost all concerned with the error terms of the model. We can check these assumptions by analyzing the residuals. Let us begin by stating the assumptions of linear regression, which will guide our analyses. There are five important assumptions. 1. 2. 3. 4. 5. The relationship is linear. The error terms () are normally distributed. At every x value, the error terms have the same variance. The error terms are independent of each other. The independent variables (x1, x2,...,xk) are independent of each other. Residual Analysis To check most of the assumptions, we do something called a residual analysis, which involves looking at the difference between the actual values and the predicted values (the residuals). The residuals are estimates of the error terms. To check for model linearity, error term independence, and equal variance, we will use scatter plots. First we plot the residuals against the predicted values. Second, we can make a time series plot of the residuals. If the data are nonlinear, we should see a systematic pattern in the first plot. If the error terms are dependent, we may also see a systematic pattern in the scatter plot, or in the time series plot. Finally, if the error terms have different variances, we should see the spread in the residuals changing as a function of the predicted values. If all the assumptions are met, we should see a random scatter plot with no identifiable patterns. Linear Regression - 15 To check for normality, we will use a histogram of the residuals. If it looks close to a bell shaped distribution, we can feel comfortable that the error terms are close to normally distributed. We can also use standardized residuals (described below) as another check. The book mentions a normal probability plot, and Excel presumably constructs such a plot, but Excel’s is not consistent with the book’s. We will ignore the normal probability plot for now. Later we will discuss a hypothesis test to test if the residuals follow a normal distribution. Linear Regression - 16 The plots for the pizza example are shown below. Linearity: Variance: Independence: Normality: Scatter Plot Residuals 2.5 2 1.5 1 0.5 0 -0.5 0 -1 -1.5 -2 5 10 Predicted Values Linear Regression - 17 15 Residuals Time Series Plot 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 100.00% 80.00% 60.00% 40.00% Bin Linear Regression - 18 .00% More 1.752 1.501 1.249 0.998 0.746 0.495 0.244 -0.008 -0.259 -0.511 -0.762 -1.013 -1.265 20.00% -1.516 35 30 25 20 15 10 5 0 -1.768 Frequency Histogram Examples of identifiable patterns: 3 Residuals 2 1 0 -1 -2 -3 Predicted Values 3 Residuals 2 1 0 -1 -2 -3 Predicted Values Linear Regression - 19 Multicollinearity The final assumption in multiple linear regression is that the independent variables are truly independent of each other. If they are not, the model will exhibit multicollinearity. We will discuss the effects of multicollinearity, and how to detect it. Effects of multicollinearity: If we are only interested in prediction, then multicollinearity does not represent a large problem. If we are trying to explain the relationships between dependent and independent variables, however, it does cause problems. The main problem is that the standard error of the regression coefficients is highly inflated. Hence the estimated regression coefficients have large sampling variability, and tend to vary widely from one sample to the next when the independent variables are highly correlated. Another problem is in the interpretation of the estimated coefficients. When the explanatory variables are correlated, we cannot really vary one variable without the correlated variable(s) changing at the same time. Detecting Multicollinearity: There are a few ways of informally detecting multicollinearity, and there is one more formal way. Here are some informal indications: 1. Large changes in the estimated regression coefficients when a variable is added or deleted. 2. Nonsignificant results in individual tests on or wide confidence intervals for the regression coefficients representing important independent variables. 3. Estimated regression coefficients with an algebraic sign that is opposite of that expected from theoretical considerations or prior experience. 4. Large correlation coefficients of correlation between independent variables in the correlation matrix. Linear Regression - 20 Variance Inflation Factors A more formal way to test for multicollinearity is to calculate variance inflation factors (VIFs). These factors measure how much the variances of the estimated regression coefficients are inflated as compared to when the independent variables are not correlated. An individual VIF for variable i is defined as 1 1 R 2i , where R 2i is the coefficient of multiple determination when Xi is regressed on all of the other X's in the model. Fortunately, to calculate the VIFs we do not to run a regression analysis of each independent variable on all of the others. To calculate the VIF values, we will use the correlation matrix of the independent variables only. After finding the correlation matrix using the CORRELATION analysis tool, fill in the blank values, and then invert the matrix. In Quattro Pro for Windows this can be done by using the NUMERIC TOOL called INVERT. The procedure is almost self-explanatory. To invert the matrix in Excel, you need to use the function MINVERSE(array), where "array" is the range of cells containing the correlation matrix. This function is called an array function. When entering array functions you need to first block out all the cells where you want the result to go, and then enter the formula by simultaneously pressing CNTRL SHIFT ENTER. We will illustrate this in the lab. The VIF values will be on the diagonal of the resulting matrix. If the any VIF value exceeds 10, there is evidence of multicollinearity. Linear Regression - 21 Example: To show the effects of multicollinearity, and to show how to detect it, consider a study of the relation of body fat to triceps skinfold thickness, thigh circumference, and midarm circumference based on a sample of 20 healthy females 25-34 years old. The data are shown below. Skinfold Thickness 19.5 24.7 30.7 29.8 19.1 25.6 31.4 27.9 22.1 25.5 31.1 30.4 18.7 19.7 14.6 29.5 27.7 30.2 22.7 25.2 Thigh circumference 43.1 49.8 51.9 54.3 42.2 53.9 58.5 52.1 49.9 53.5 56.6 56.7 46.5 44.2 42.7 54.4 55.3 58.6 48.2 51 Midarm Circumference 29.1 28.2 37 31.1 30.9 23.7 27.6 30.6 23.2 24.8 30 28.3 23 28.6 21.3 30.1 25.7 24.6 27.1 27.5 Body Fat 11.9 22.8 18.7 20.1 12.9 21.7 27.1 25.4 21.3 19.3 25.4 27.2 11.7 17.8 12.8 23.9 22.6 25.4 14.8 21.1 The correlation matrix for all of the data is shown below. Skinfold Thickness Skinfold Thickness Thigh circumference Midarm Circumference Body Fat 1 0.923843 0.457777 0.843265 Thigh Midarm circumference Circumference 1 0.084667 0.87809 Linear Regression - 22 1 0.142444 Body Fat 1 We can see that the skinfold thickness and thigh circumference are quite highly correlated. Hence we should suspect multicollinearity. First let's look at the regression results for body fat when only skinfold thickness and thigh circumference are used as independent variables. Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.882072 0.778052 0.75194 2.543166 20 Analysis of Variance df Regression Residual Total 2 17 19 Coefficients Intercept Skinfold Thickness Thigh circumference Sum of Squares Mean Square F 385.4387 192.7194 29.79723 109.9508 6.467694 495.3895 Standard Error Significance F 2.77E-06 Lower 95% Upper 95% t Statistic P-value -19.1742 8.360641 -2.29339 0.033401 -36.8137 -1.53481 0.222353 0.303439 0.732775 0.47264 -0.41785 0.862554 0.659422 0.291187 2.264597 0.035425 0.045069 1.273774 Suppose that we tested significance at the .01 level. What can we observe? Linear Regression - 23 Now let's include the third variable. Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.895186 0.801359 0.764113 2.479981 20 Analysis of Variance Sum of Squares df Regression Residual Total 3 16 19 F 396.9846 132.3282 21.51571 98.40489 6.150306 495.3895 Coefficients Intercept Skinfold Thickness Thigh circumference Midarm Circumference Mean Square Standard Error t Statistic P-value 117.0847 99.7824 1.1734 0.255136 4.334092 3.015511 1.437266 0.166908 -2.85685 2.582015 -1.10644 0.282348 -2.18606 1.595499 -1.37014 0.186616 Now what do we see? Linear Regression - 24 Significance F 7.34E-06 Lower 95% Upper 95% -94.4445 -2.05851 -8.33047 -5.56837 328.6139 10.72669 2.616779 1.196246 Here is what we would do to get the variance inflation factors. First we need the correlation matrix of the independent variables. Skinfold Thickness Skinfold Thickness Thigh circumference Midarm Circumference 1 0.923843 0.457777 Thigh Midarm circumference Circumference 1 0.084667 1 Next we fill in the missing values. 1 0.923843 0.923843 1 0.457777 0.084667 0.457777 0.084667 1 Finally, we invert the matrix and look for the diagonal elements. 708.8429142 -631.9152231 -270.9894176 -631.9152231 564.3433857 241.4948157 -270.989 241.4948 104.606 There is clearly multicollinearity present in this case. Note that even though the third independent variable (midarm circumference) was not highly correlated with either of the other two independent variables, it has a high VIF value. This is a case where one variable is strongly related to two others together, even though the pairwise correlations are small. Linear Regression - 25 Making Predictions One of the most common uses of fitting a linear model to a set of data is to make predictions at new levels of the independent variables. To make a prediction, we simply insert the desired values of the independent variables into the estimated relationship. The result is a point estimate. As with other statistical estimates, we can also make interval estimates. For example, suppose a new subdivision is being constructed and we are interested in estimating how long it will take to deliver pizzas to homes in the neighborhood. The distance from the pizza parlor to the entrance to the subdivision is 5 miles. We will drop day of the week and time of day from the model since they did not add to the model. If we run a new regression, we obtain the estimated relationship Estimated Delivery Time=1.02815+1.74962*Distance. We will call the point estimate of y given a particular set of x values y p . y p is actually an estimate of two things. First it is an estimate of the mean level of y given all the values of x, represented by E(yp). Second, it is an estimate of an individual value of y given all of the x values. We can also find interval estimates of the prediction, but there is no built in function in Excel to do it. Hence we will not discuss it in this class. If you are interested in how to do it, I have a macro function that can be used. Linear Regression - 26 Curvilinear Regression So far we have looked at cases where the relationship is linear in both the 's and the x's. The assumption on the x's is not as limiting as it seems. In multiple regression, we can substitute nonlinear functions for the x's and for y. This allows us to fit functions which show nonlinearity. Hence, if the residual analysis shows that the linearity assumption is violated, we can try to fit a model which is nonlinear. The most common types of nonlinear models are polynomials and logarithmic models. For example, in some physical systems, there is a theoretical relationship that says y=axb. If we take the natural logarithm of both sides, we obtain ln(y) = ln(a) + bln(x). If we substitute y' = ln(y), x' = ln(x), and a' = ln(a), we have y' = a+bx', which is a linear model. We can then use y' and x' as inputs to the linear regression model, and find estimates for a and b, which usually have useful physical interpretations. Even when the model is linear, we can frequently use the logarithm as a way of fixing problems with heteroscedasticity or normality. For some reason, the transformation fixes the problem in many cases. If we assume a polynomial relationship, the model may look like y =0 + 1x+ 2x2+ 3x3+ . We can let x1 = x, x2 = x2 and x3 = x3. However, this has an inherent problem. What is it? Linear Regression - 27 We can fix the multicollinearity by using the following substitutions: x1 = x - x , x2 = (x - x )2 and x3 = (x - x )3. Indicator Variables and Curvilinear Regression In our example with indicator variables, we said that the slopes were assumed to be equal even when gender differed. We can use curvilinear regression, combined with indicator variables, to take away the assumption. The model is y = 0 + 1x1 + 2x2 + 3x1x2 + . Model for the men: Model for the women: What do we do if we want to test if the slope with respect to income is the same for men and women? Linear Regression - 28