Author: Brenda Gunderson, Ph.D., 2012 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution-NonCommercial-Share Alike 3.0 Unported License: http://creativecommons.org/licenses/by-nc-sa/3.0/ The University of Michigan Open.Michigan initiative has reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The attribution key provides information about how you may share and adapt this material. Copyright holders of content included in this material should contact open.michigan@umich.edu with any questions, corrections, or clarification regarding the use of content. For more information about how to attribute these materials visit: http://open.umich.edu/education/about/terms-of-use. Some materials are used with permission from the copyright holders. You may need to obtain new permission to use those materials for other uses. This includes all content from: Mind on Statistics Utts/Heckard, 4th Edition, Cengage L, 2012 Text Only: ISBN 9781285135984 Bundled version: ISBN 9780538733489 SPSS and its associated programs are trademarks of SPSS Inc. for its proprietary computer software. Other product names mentioned in this resource are used for identification purposes only and may be trademarks of their respective companies. Attribution Key For more information see: http:://open.umich.edu/wiki/AttributionPolicy Content the copyright holder, author, or law permits you to use, share and adapt: Creative Commons Attribution-NonCommercial-Share Alike License Public Domain – Self Dedicated: Works that a copyright holder has dedicated to the public domain. Make Your Own Assessment Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for copyright. Public Domain – Ineligible. WOrkds that are ineligible for copyright protection in the U.S. (17 USC §102(b)) *laws in your jurisdiction may differ. Content Open.Michigan has used under a Fair Use determination Fair Use: Use of works that is determined to be Fair consistent with the U.S. Copyright Act (17 USC § 107) *laws in your jurisdiction may differ. Our determination DOES NOT mean that all uses of this third-party content are Fair Uses and we DO NOT guarantee that your use of the content is Fair. To use t his content you should conduct your own independent analysis to determine whether or not your use will be Fair. Module 9: Simple Linear Regression Objective: In this module, you will examine relationships between two quantitative variables using a graphical tool called a scatterplot. You will interpret scatterplots in terms of form, direction, and strength of the relationship, and use it to assess the appropriateness of using a simple linear regression model to describe the relationship between the two variables. If appropriate, you can then perform a simple linear regression analysis to produce an estimated model that can be used to predict the value of the response y for a given value of the predictor x. You will learn how to perform hypothesis tests and compute confidence intervals in regression; including learning how to check the assumptions needed for these inference procedures to be valid (see Supplement 8 for further details about these assumptions). Overview: A regression model describes how the mean of one variable is thought to depend on the value of one or more other variables. If we think a variable x may explain changes in another variable y, we call x an explanatory variable (or predictor variable or independent variable) and the variable y is called the response variable (or dependent variable). To start, we use a scatterplot to display the relationship between two quantitative variables, plotting the explanatory variable x on the x-axis and the response variable y on the y-axis. In examining the relationship, look for an overall pattern showing the form (linear, curved, clusters), direction (positive or negative), and strength (how close the points are to the underlying form – weak, moderate, strong, etc.) of the relationship. Finally, check for outliers or other deviations from the overall pattern. One of the many misconceptions about regression arises from the concept of association. Scatterplots can show the association between variables, but you need to remember that correlation does not imply causation. This can be seen in a very simple example: weekly flu medication sales and weekly sweater sales for an area with extreme seasons would exhibit a positive association because both tend to go up in the winter and down in the summer. However, it should be clear that neither of them causes the other. 110 There are four interpretations of an observed association. (Source: Mind on Statistics, 3rd edition, by J. M. Utts, R. F. Heckard. © 2007, page 176) There is causation – the explanatory variable is indeed causing a change in the response variable. There may be causation, but confounding factors contribute as well and make this causation difficult to prove. There is no causation – the association is explained by how the explanatory and response variables are both affected by other variables. The response variable is causing a change in the explanatory variable. When a scatterplot suggests that the dependence of y on x can be summarized by a straight line, the least squares regression line of y on x can be calculated. The least squares regression line is the line that minimizes the sum of the squared vertical distances of the data points to the line – hence the name least squares. This fitted line can be used to describe the linear relationship between y and x and to predict y for a given value of x. In this module we will focus on simple linear regression, which is based on a linear model relating a single explanatory variable x to the mean response y as follows: E(Y) = 0 + 1(x). Here 0 and 1 are parameters – fixed but unknown constants. Specifically, 0 is the population y-intercept (where the true regression line crosses the y-axis when x = 0) and 1 is the population slope (the change in the mean response y for a one-unit increase in x). These two values are unknown, but can be estimated using the least squares criterion. The resulting estimated regression line is generally written as: yˆ b0 b1 ( x) . The estimates, b0 and b1 are referred to as the least squares estimates of 0 and 1. There are also several assumptions that must be checked in order for inferences to be valid. First, the response variable must have a normal distribution with a mean that varies linearly with the predictor variable and a standard deviation (σ) that does not depend on the predictor variable. We also assume that the error terms are normally distributed with mean 0 and a standard deviation σ. 111 Formula Card: 112 Activity 1: Using a Scatterplot to Display the Relationship Background: The U.S. Census Bureau collects many different kinds of data, including demographics, housing data, and economic indicators. Data are collected at the individual, county, state, and national levels. The data set poverty.sav contains poverty rates, teen birth rates, and violent crime rates for the 50 states and the District of Columbia for the year 2000. The poverty rate variable (PovPct) gives the percentage of the population living in households with income below the poverty level for each state and the District of Columbia (Location). The teen birth rate variable (TeenBrth) contains the birth rate for females 15 to 19 years old, recorded as number of births per 1,000 females in that age group. The violent crime rate (ViolCrime), the birth rate for females 15 to 17 years old (Brth15to17), and the birth rate for females 18 to 19 years old (Brth18to19) are also recorded. (Source: Mind on Statistics, 3rd edition, by J. M. Utts, R. F. Heckard. © 2007) Task: Is there a relationship between the poverty rate (PovPct) and the teen birth rate (TeenBrth) by state? Produce a scatterplot to help examine the plausibility of a linear relationship between the poverty rate and the teen birth rate. Note: You are interested in predicting teen birth rate from poverty rate. Before conducting any test, here are a set of questions to ask yourself: a. How many populations are there? How many variables are there? One Two One Two More than two What is the response variable? What type of variable is the response? Categorical Quantitative What is the explanatory variable? What type of variable is the explanatory variable? Categorical Quantitative What type of parameter would be useful for summarizing this response, considering the explanatory variable? (See Supplement 3) Proportion Mean Other Use your answers to these questions, to guide you to the appropriate inference procedure. You may refer back to Supplement 3: Name that Scenario for assistance. The appropriate inference procedure for this scenario is: 113 _____________________________________ . 1. Open the data set and produce a scatterplot: Graphs> Legacy Dialogs> Scatter/Dot. Note that your explanatory variable should be on the x-axis (independent), and your response variable should be on the y-axis (dependent). Sketch the general pattern of the plot below. Make sure to label your axes. 2. Based on the scatterplot, does there appear to be a linear relationship between teen birth rate and poverty rate? Explain. 3. Are there any unusual observations or outliers present? 4. Which states have the highest poverty rates? Highest teen birth rates? Do states with low poverty rates tend to have low teen birth rates also? What does this tell you about the direction of the association between these two variables? (Note: You can find this information quickly by choosing Location for Label Cases by and selecting Display chart with case labels under Options.) 5. Does there appear to be a strong relationship between poverty rates and teen birth rates? 114 Check Your Understanding: Circle the scatterplot(s) which indicate a linear regression analysis would be appropriate. Plot 1 Plot 2 Plot 3 115 Activity 2: Describing a Linear Relationship with a Regression Line If we are satisfied that a linear model seems to describe the relationship between the poverty rate and the teen birth rate, we are ready to estimate that model and use it to predict TeenBrth on the basis of PovPct. Task: Fit a linear model to the data. Refer to Activity 1 for response/explanatory variables. If you have questions about the regression output after the activity, refer to Supplement 8 at the beginning of this workbook for more details. 1. Obtain the linear regression output using Analyze> Regression> Linear. (Make sure to enter the appropriate dependent and independent variables.) Report the estimated regression line (predicting equation): ________________________________________________________ 2. How does the estimated regression line differ from the equation for the population regression line? 3. Interpret the estimated slope b1. Clearly explain what the slope says about the change in the teen birth rate. 4. Report the coefficient of correlation, r, between TeenBrth and PovPct: r = ____________ 5. Report the coefficient of determination, r2, and interpret it: r2 = ____________ Interpretation: 6. Use the regression line to predict the teen birth rate for New Mexico (with a poverty rate of 25.3%) and for Michigan (with a poverty rate of 12.2%). How do they compare to the observed TeenBrth values for the states of New Mexico and Michigan? 116 Check Your Understanding: One of the important notes about interpreting r2 is that it relies on the ____________________ relationship between the two quantitative variables. Think About It: Would you use this model to predict the teen birth rate for a state that has a poverty rate of 35%? How about 2%? Why or why not? Activity 3: Is There a Significant (Non-Zero) Linear Relationship Between Teen Birth Rate and Poverty Rate? In this activity, you will assess if the explanatory variable, poverty rate, is a useful linear predictor for the teen birth rate. In other words, you will test to see if there is a significant, non-zero linear relationship between poverty rate (PovPct) and teen birth rate (TeenBrth). Remember that another way to make inferences about the significance of the linear relationship is through a confidence interval for the population slope. Further, recall the basic form of a confidence interval: point estimate ± (a few) standard errors. Most standard computer regression output provides the slope estimate and its standard error, and the “few” will correspond to a t* value for the corresponding confidence level with degrees of freedom for regression of n – 2. Since a confidence interval provides a range of reasonable values for the parameter, it can be used to perform two-sided hypothesis tests by seeing whether the hypothesized value falls in the interval or not. Task: Perform a test to assess if the explanatory variable PovPct is significant in the linear model. Recall: Write out the Five Steps for conducting a test of hypotheses (reference page 53). 1. 2. 3. 4. 5. 117 Hypothesis Test: You can now implement the Five Steps for conducting a test of hypotheses. 1. State the Hypotheses: H0: ___________ = ___________ and Ha: ___________ ___________ , where __________ represents: Remember: Your hypotheses and parameter definition should always be a statement about the population(s) under study. 2. Checking the Assumptions and Computing the Test Statistic Assumptions: Covered in the next activity. Test-Statistic: a. Using the SPSS output produced in Activity 2, which two test statistics could you use to test these hypotheses? b. Give the value for each test statistic listed in part a. c. Which of the test statistics in part a would not be appropriate for conducting a one-sided version of the alternative hypothesis? 3. Calculate the p-Value: What is the SPSS reported p-value for both test statistics? __________________________________ Are these the p-values you want? ______________________ 118 4. Decision: What is your decision at a 5% significance level? Reject H0 Fail to Reject H0 Remember: Reject H0 Fail to Reject H0 Results statistically significant Results not statistically significant 5. Conclusion: What is your conclusion in the context of the problem? Note: Conclusions should always include a reference to the population parameter of interest. Be careful that your conclusion is not too strong; you can say that you have sufficient evidence or something equivalent, but do NOT say that we have proven anything. 6. Confidence Intervals (CI): What is the formula for a confidence interval for the population slope? Generate the appropriate confidence intervals using Analyze > Regression > Linear. Under Statistics, choose the Confidence intervals, which has a default level of 95%. Give the 95% confidence interval for the population slope. c. Provide an interpretation of the 95% confidence interval for the population slope. d. Provide an interpretation of the 95% confidence level. e. Based on the confidence interval, would you reject the null hypothesis at a 5% significance level? Circle one: Yes No Explain. Did your conclusion here match the one you made in part 4? 119 Check Your Understanding: Two proposed values for the population slope have been given by researchers in the field. One proposed value is 2 and the other is 3. Based on your results in the activity, which proposed slope value is reasonable? Why? 2 3 because… Think About It: If you were a parent and did not want your daughter to become a teen mom, should you consider moving to a state that has a low poverty rate (and only for this reason)? Explain. 120 Activity 4: Is the Linear Model Appropriate? Are the Assumptions Met for Inference? In this activity, you will produce and examine the residuals from the regression line as well as create some plots to assess the fit of the linear model. This will also serve to evaluate the validity of the testing and confidence intervals performed in activities 2 and 3. Regression assumptions may be stated in terms of the response variable or in terms of the error terms. In general, the statistical model for simple linear regression assumes that for each value of x, the observed values of the response are normally distributed with some mean (that may depend on x in a linear way) and a standard deviation σ that does not depend on x. That is, for each x, Y is N(E(Y), σ), where E(Y) = 0 + 1x. Thinking about the error terms, we can say the true error terms (those that we do not observe) are the difference of the response and the true mean (for a given x). These errors are to have a normal distribution with mean of 0 and standard deviation of σ (that doesn’t depend on x). Task: Obtain the residuals for the regression of TeenBrth on PovPct, and generate the appropriate plots to check the assumptions of the linear model. For additional information on checking the regression assumptions, refer to Supplement 8. 1. What assumptions have to hold for the inferences in Activity 3 to be valid? (Be sure you state them in the context of the problem.) 2. Write the expression for the residuals in terms of observed and predicted values. 3. Obtain the residuals in SPSS using Analyze> Regression> Linear, but instead of running it right away, click on the Save button. In the box that will open, select Unstandardized under the Residuals heading, and click Continue. This saves the residuals as a new variable. Now, construct a scatterplot of the residuals against the explanatory variable, PovPct. Sketch the general pattern of the plot. 121 What assumption of the error terms in the linear model do you think the plot in part 3 is useful for assessing? What conclusion can you draw from this plot? 5. Construct a Q-Q plot of the residuals. Sketch the general pattern of the plot. What assumption of the error terms in the linear model do you think this plot is useful for assessing? Based on the plot what is your conclusion about this assumption? 6. If there were an ordering present in the data, it would be appropriate to construct a time (or sequence) plot of the residuals. Although we have no order here to worry about, what assumption of the error terms in the linear model do you think such a time plot is useful for? 122 Check Your Understanding: A ______________ plot is used to check the assumption that the standard deviation of the population error terms is constant i.e. change with the value of the does does not not constant , explanatory variable. Activity 5: Constructing Prediction Intervals for an Individual Response and Confidence Intervals for a Mean Response In this activity, you will be guided through the construction of a prediction interval for an individual response and a confidence interval for a mean response. These intervals may be constructed after a regression analysis to provide insight into possible response values for a given value of the explanatory variable in terms of the mean or for an individual. The formulas for the two intervals are: Task: Compute both a confidence interval for the mean response for states like Michigan and prediction interval for an individual response for the state of Michigan. Note that the poverty rate is 12.2% in Michigan. 123 1. What is the predicted teen birth rate for the state of Michigan (or any state with a poverty rate of 12.2%)? 2. Steps to compute the s.e.(fit). a. The value of s from the output is (circle one): 4.032 0.292 8.84624 78.256 b. Note that the sample mean poverty rate is 13.12%, n is 51, and SXX is 917.8078. Compute the value of s.e.(fit) when the poverty rate of interest (value of x) is 12.2%. 3. For both a confidence interval for a mean response and prediction interval for an individual response, the t* multiplier is based on n - 2 = ____ df, and for a 95% interval, t* = ______. 4. Compute a confidence interval for the mean response for states like Michigan with a poverty rate of 12.2%. You have already computed the values of s.e.(fit), ŷ , and t*. 5. Compute a prediction interval for an individual response for the state of Michigan. You will need to compute s.e.(pred) before you can finish the interval computation. 124 6. Without any computation, for a given value of x and fixed confidence level, which interval will always be wider? Confidence interval for a mean response Prediction interval for an individual response 125 Example Exam Question on Regression A doctor wanted to study the relationship between a male’s age and HDL (socalled good) cholesterol. She randomly selects 18 male adult patients and records their x = age and y = HDL. A scatterplot of the data indicated an approximately linear relationship overall, so she performs a linear regression analysis using SPSS. Model Summary Model 1 R .598 R Square .358 Adjus ted R Square .318 Std. Error of the Es timate 6.807 ANOVA Model 1 Regress ion Res idual Total Sum of Squares 413.01 741.43 1154.44 df 1 16 17 Mean Square 413.01 46.34 F 8.91 Sig. .009 Coefficients Model 1 (Cons tant) Age Uns tandardized Coefficients B Std. Error 61.05 6.08 -.38 .13 We also have: x 46.22 and Sxx = Standardized Coefficients Beta -.598 x x 2 t 10.05 -2.99 Sig. .000 .009 2883.131 . a. What is the correlation between age and HDL cholesterol levels? final answer: __________________________ b. What is equation of the least square regression line for predicting HDL from age? final answer: _______________________________________________ c. Based on this model, predict the HDL cholesterol for a male who is 30 years old. final answer: __________________________ d. Compute the residual for the observation (x = 30, y = 53). final answer: __________________________ e. Use a 1% significance level to assess if there is a significant linear relationship between the age and HDL cholesterol for male adults. State the hypotheses to be tested, the observed value of the test statistic, the corresponding p-value, and your decision. 126 Hypotheses: H0:______________________ Ha:____________________ Test Statistic Value: _____________________ p-value:________________ Decision: (circle) f. Fail to reject H0 Reject H0 The standard error for the estimated slope is given as 0.13. Interpret this standard error in terms of repetitions of this study. g. Calculate a 95% confidence interval for the mean HDL cholesterol level for all 30-year-old male adults. final answer: ________________________ h. Consider the residual plot shown below. Does this plot support the conclusion that the linear regression model is appropriate? Yes No Explain: 20 10 0 Residual -10 -20 20 25 30 35 40 45 50 55 60 65 70 75 Age 127