20-f12-bgunderson-wb-module9 - Open.Michigan

advertisement
Author: Brenda Gunderson, Ph.D., 2012
License: Unless otherwise noted, this material is made available under the terms of the
Creative Commons Attribution-NonCommercial-Share Alike 3.0 Unported License:
http://creativecommons.org/licenses/by-nc-sa/3.0/
The University of Michigan Open.Michigan initiative has reviewed this material in accordance
with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it.
The attribution key provides information about how you may share and adapt this material.
Copyright holders of content included in this material should contact
open.michigan@umich.edu with any questions, corrections, or clarification regarding the use of
content.
For more information about how to attribute these materials visit:
http://open.umich.edu/education/about/terms-of-use. Some materials are used with permission
from the copyright holders. You may need to obtain new permission to use those materials for
other uses. This includes all content from:
Mind on Statistics
Utts/Heckard, 4th Edition, Cengage L, 2012
Text Only: ISBN 9781285135984
Bundled version: ISBN 9780538733489
SPSS and its associated programs are trademarks of SPSS Inc. for its proprietary
computer software. Other product names mentioned in this resource are used for identification
purposes only and may be trademarks of their respective companies.
Attribution Key
For more information see: http:://open.umich.edu/wiki/AttributionPolicy
Content the copyright holder, author, or law permits you to use, share and adapt:
Creative Commons Attribution-NonCommercial-Share Alike License
Public Domain – Self Dedicated: Works that a copyright holder has
dedicated to the public domain.
Make Your Own Assessment
Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for
copyright.
Public Domain – Ineligible. WOrkds that are ineligible for copyright
protection in the U.S. (17 USC §102(b)) *laws in your jurisdiction may
differ.
Content Open.Michigan has used under a Fair Use determination
Fair Use: Use of works that is determined to be Fair consistent with the
U.S. Copyright Act (17 USC § 107) *laws in your jurisdiction may differ.
Our determination DOES NOT mean that all uses of this third-party content are Fair Uses and
we DO NOT guarantee that your use of the content is Fair. To use t his content you should
conduct your own independent analysis to determine whether or not your use will be Fair.
Module 9: Simple Linear Regression
Objective: In this module, you will examine relationships between two quantitative
variables using a graphical tool called a scatterplot. You will interpret scatterplots
in terms of form, direction, and strength of the relationship, and use it to assess
the appropriateness of using a simple linear regression model to describe the
relationship between the two variables. If appropriate, you can then perform a
simple linear regression analysis to produce an estimated model that can be used
to predict the value of the response y for a given value of the predictor x. You will
learn how to perform hypothesis tests and compute confidence intervals in
regression; including learning how to check the assumptions needed for these
inference procedures to be valid (see Supplement 8 for further details about these
assumptions).
Overview: A regression model describes how the mean of one variable is thought
to depend on the value of one or more other variables. If we think a variable x
may explain changes in another variable y, we call x an explanatory variable (or
predictor variable or independent variable) and the variable y is called the
response variable (or dependent variable).
To start, we use a scatterplot to display the relationship between two quantitative
variables, plotting the explanatory variable x on the x-axis and the response
variable y on the y-axis. In examining the relationship, look for an overall pattern
showing the form (linear, curved, clusters), direction (positive or negative), and
strength (how close the points are to the underlying form – weak, moderate,
strong, etc.) of the relationship. Finally, check for outliers or other deviations from
the overall pattern.
One of the many misconceptions about regression arises from the concept of
association. Scatterplots can show the association between variables, but you
need to remember that correlation does not imply causation. This can be seen in a
very simple example: weekly flu medication sales and weekly sweater sales for an
area with extreme seasons would exhibit a positive association because both tend
to go up in the winter and down in the summer. However, it should be clear that
neither of them causes the other.
110
There are four interpretations of an observed association. (Source: Mind on
Statistics, 3rd edition, by J. M. Utts, R. F. Heckard. © 2007, page 176)
There is causation – the explanatory variable is indeed causing a
change in the response variable.
There may be causation, but confounding factors contribute as well
and make this causation difficult to prove.
There is no causation – the association is explained by how the
explanatory and response variables are both affected by other
variables.
The response variable is causing a change in the explanatory variable.
When a scatterplot suggests that the dependence of y on x can be summarized by
a straight line, the least squares regression line of y on x can be calculated. The
least squares regression line is the line that minimizes the sum of the squared
vertical distances of the data points to the line – hence the name least squares.
This fitted line can be used to describe the linear relationship between y and x and
to predict y for a given value of x.
In this module we will focus on simple linear regression, which is based on a linear
model relating a single explanatory variable x to the mean response y as follows:
E(Y) = 0 + 1(x). Here 0 and 1 are parameters – fixed but unknown constants.
Specifically, 0 is the population y-intercept (where the true regression line crosses
the y-axis when x = 0) and  1 is the population slope (the change in the mean
response y for a one-unit increase in x). These two values are unknown, but can be
estimated using the least squares criterion. The resulting estimated regression line
is generally written as: yˆ  b0  b1 ( x) . The estimates, b0 and b1 are referred to as
the least squares estimates of 0 and 1.
There are also several assumptions that must be checked in order for inferences to
be valid. First, the response variable must have a normal distribution with a mean
that varies linearly with the predictor variable and a standard deviation (σ) that
does not depend on the predictor variable. We also assume that the error terms
are normally distributed with mean 0 and a standard deviation σ.
111
Formula Card:
112
Activity 1: Using a Scatterplot to Display the Relationship
Background: The U.S. Census Bureau collects many different kinds of data,
including demographics, housing data, and economic indicators. Data are
collected at the individual, county, state, and national levels. The data set
poverty.sav contains poverty rates, teen birth rates, and violent crime rates for
the 50 states and the District of Columbia for the year 2000. The poverty rate
variable (PovPct) gives the percentage of the population living in households with
income below the poverty level for each state and the District of Columbia
(Location). The teen birth rate variable (TeenBrth) contains the birth rate for
females 15 to 19 years old, recorded as number of births per 1,000 females in that
age group. The violent crime rate (ViolCrime), the birth rate for females 15 to 17
years old (Brth15to17), and the birth rate for females 18 to 19 years old
(Brth18to19) are also recorded. (Source: Mind on Statistics, 3rd edition, by J. M.
Utts, R. F. Heckard. © 2007)
Task: Is there a relationship between the poverty rate (PovPct) and the teen birth
rate (TeenBrth) by state? Produce a scatterplot to help examine the plausibility of
a linear relationship between the poverty rate and the teen birth rate. Note: You
are interested in predicting teen birth rate from poverty rate.
Before conducting any test, here are a set of questions to ask yourself:
a.
How many populations are there?
 How many variables are there?
One
Two
One
Two
More than two
 What is the response variable?
 What type of variable is the response?
Categorical
Quantitative
 What is the explanatory variable?
 What type of variable is the explanatory variable?
Categorical
Quantitative
 What type of parameter would be useful for summarizing this response,
considering the explanatory variable? (See Supplement 3)
Proportion
Mean
Other
Use your answers to these questions, to guide you to the appropriate inference
procedure. You may refer back to Supplement 3: Name that Scenario for
assistance.
The appropriate inference procedure for this scenario is:
113
_____________________________________ .
1. Open the data set and produce a scatterplot: Graphs> Legacy Dialogs>
Scatter/Dot. Note that your explanatory variable should be on the x-axis
(independent), and your response variable should be on the y-axis
(dependent). Sketch the general pattern of the plot below. Make sure to label
your axes.
2. Based on the scatterplot, does there appear to be a linear relationship
between teen birth rate and poverty rate? Explain.
3. Are there any unusual observations or outliers present?
4. Which states have the highest poverty rates? Highest teen birth rates? Do
states with low poverty rates tend to have low teen birth rates also? What
does this tell you about the direction of the association between these two
variables? (Note: You can find this information quickly by choosing Location
for Label Cases by and selecting Display chart with case labels under Options.)
5. Does there appear to be a strong relationship between poverty rates and teen
birth rates?
114
Check Your Understanding:
Circle the scatterplot(s) which indicate a linear regression analysis would be
appropriate.
Plot 1
Plot 2
Plot 3
115
Activity 2: Describing a Linear Relationship with a
Regression Line
If we are satisfied that a linear model seems to describe the relationship between
the poverty rate and the teen birth rate, we are ready to estimate that model and
use it to predict TeenBrth on the basis of PovPct.
Task: Fit a linear model to the data. Refer to Activity 1 for response/explanatory
variables. If you have questions about the regression output after the activity, refer
to Supplement 8 at the beginning of this workbook for more details.
1. Obtain the linear regression output using Analyze> Regression> Linear. (Make
sure to enter the appropriate dependent and independent variables.) Report
the estimated regression line (predicting equation):
________________________________________________________
2. How does the estimated regression line differ from the equation for the
population regression line?
3. Interpret the estimated slope b1. Clearly explain what the slope says about the
change in the teen birth rate.
4. Report the coefficient of correlation, r, between TeenBrth and PovPct:
r = ____________
5. Report the coefficient of determination, r2, and interpret it: r2 = ____________
Interpretation:
6. Use the regression line to predict the teen birth rate for New Mexico (with a
poverty rate of 25.3%) and for Michigan (with a poverty rate of 12.2%). How
do they compare to the observed TeenBrth values for the states of New
Mexico and Michigan?
116
Check Your Understanding:
One of the important notes about interpreting r2 is that it relies on
the ____________________ relationship between the two quantitative variables.
Think About It:
Would you use this model to predict the teen birth rate for a state that has a
poverty rate of 35%? How about 2%? Why or why not?
Activity 3: Is There a Significant (Non-Zero) Linear
Relationship Between Teen Birth Rate and
Poverty Rate?
In this activity, you will assess if the explanatory variable, poverty rate, is a useful
linear predictor for the teen birth rate. In other words, you will test to see if there
is a significant, non-zero linear relationship between poverty rate (PovPct) and
teen birth rate (TeenBrth). Remember that another way to make inferences about
the significance of the linear relationship is through a confidence interval for the
population slope. Further, recall the basic form of a confidence interval: point
estimate ± (a few) standard errors. Most standard computer regression output
provides the slope estimate and its standard error, and the “few” will correspond
to a t* value for the corresponding confidence level with degrees of freedom for
regression of n – 2. Since a confidence interval provides a range of reasonable
values for the parameter, it can be used to perform two-sided hypothesis tests by
seeing whether the hypothesized value falls in the interval or not.
Task: Perform a test to assess if the explanatory variable PovPct is significant in
the linear model.
Recall: Write out the Five Steps for conducting a test of hypotheses (reference
page 53).
1.
2.
3.
4.
5.
117
Hypothesis Test: You can now implement the Five Steps for conducting a test of
hypotheses.
1. State the Hypotheses: H0: ___________ = ___________ and
Ha: ___________
___________ , where __________ represents:
Remember: Your hypotheses and parameter definition should always be a
statement about the population(s) under study.
2. Checking the Assumptions and Computing the Test Statistic
Assumptions: Covered in the next activity.
Test-Statistic:
a. Using the SPSS output produced in Activity 2, which two test statistics
could you use to test these hypotheses?
b. Give the value for each test statistic listed in part a.
c. Which of the test statistics in part a would not be appropriate for
conducting a one-sided version of the alternative hypothesis?
3. Calculate the p-Value:
What is the SPSS reported p-value for both test statistics?
__________________________________
Are these the p-values you want? ______________________
118
4. Decision:
What is your decision at a 5% significance level? Reject H0 Fail to Reject H0
Remember: Reject H0

Fail to Reject H0 
Results statistically significant
Results not statistically significant
5. Conclusion:
What is your conclusion in the context of the problem?
Note: Conclusions should always include a reference to the population
parameter of interest. Be careful that your conclusion is not too strong;
you can say that you have sufficient evidence or something equivalent,
but do NOT say that we have proven anything.
6. Confidence Intervals (CI):
What is the formula for a confidence interval for the population slope?
Generate the appropriate confidence intervals using Analyze > Regression
> Linear. Under Statistics, choose the Confidence intervals, which has a
default level of 95%. Give the 95% confidence interval for the population
slope.
c. Provide an interpretation of the 95% confidence interval for the
population slope.
d. Provide an interpretation of the 95% confidence level.
e. Based on the confidence interval, would you reject the null hypothesis at a
5% significance level? Circle one: Yes No
Explain.
Did your conclusion here match the one you made in part 4?
119
Check Your Understanding:
Two proposed values for the population slope have been given by researchers in
the field. One proposed value is 2 and the other is 3. Based on your results in the
activity, which proposed slope value is reasonable? Why?
2
3
because…
Think About It:
If you were a parent and did not want your daughter to become a teen mom,
should you consider moving to a state that has a low poverty rate (and only for this
reason)? Explain.
120
Activity 4: Is the Linear Model Appropriate?
Are the Assumptions Met for Inference?
In this activity, you will produce and examine the residuals from the regression line
as well as create some plots to assess the fit of the linear model. This will also
serve to evaluate the validity of the testing and confidence intervals performed in
activities 2 and 3. Regression assumptions may be stated in terms of the response
variable or in terms of the error terms. In general, the statistical model for simple
linear regression assumes that for each value of x, the observed values of the
response are normally distributed with some mean (that may depend on x in a
linear way) and a standard deviation σ that does not depend on x. That is, for each
x, Y is N(E(Y), σ), where E(Y) =  0 +  1x. Thinking about the error terms, we can say
the true error terms (those that we do not observe) are the difference of the
response and the true mean (for a given x). These errors are to have a normal
distribution with mean of 0 and standard deviation of σ (that doesn’t depend on x).
Task: Obtain the residuals for the regression of TeenBrth on PovPct, and generate
the appropriate plots to check the assumptions of the linear model. For additional
information on checking the regression assumptions, refer to Supplement 8.
1. What assumptions have to hold for the inferences in Activity 3 to be valid? (Be
sure you state them in the context of the problem.)
2. Write the expression for the residuals in terms of observed and predicted
values.
3. Obtain the residuals in SPSS using Analyze> Regression> Linear, but instead of
running it right away, click on the Save button. In the box that will open, select
Unstandardized under the Residuals heading, and click Continue. This saves
the residuals as a new variable. Now, construct a scatterplot of the residuals
against the explanatory variable, PovPct. Sketch the general pattern of the
plot.
121
What assumption of the error terms in the linear model do you think the plot
in part 3 is useful for assessing?
What conclusion can you draw from this plot?
5. Construct a Q-Q plot of the residuals. Sketch the general pattern of the plot.
What assumption of the error terms in the linear model do you think this plot
is useful for assessing?
Based on the plot what is your conclusion about this assumption?
6. If there were an ordering present in the data, it would be appropriate to
construct a time (or sequence) plot of the residuals. Although we have no
order here to worry about, what assumption of the error terms in the linear
model do you think such a time plot is useful for?
122
Check Your Understanding:
A ______________ plot is used to check the assumption that the standard
deviation of the population error terms is
constant
i.e.
change with the value of the
does
does not
not constant ,
explanatory variable.
Activity 5: Constructing Prediction Intervals
for an Individual Response and
Confidence Intervals for a Mean Response
In this activity, you will be guided through the construction of a prediction interval
for an individual response and a confidence interval for a mean response. These
intervals may be constructed after a regression analysis to provide insight into
possible response values for a given value of the explanatory variable in terms of
the mean or for an individual. The formulas for the two intervals are:
Task: Compute both a confidence interval for the mean response for states like
Michigan and prediction interval for an individual response for the state of
Michigan. Note that the poverty rate is 12.2% in Michigan.
123
1. What is the predicted teen birth rate for the state of Michigan (or any
state with a poverty rate of 12.2%)?
2. Steps to compute the s.e.(fit).
a. The value of s from the output is (circle one):
4.032
0.292
8.84624 78.256
b. Note that the sample mean poverty rate is 13.12%, n is 51, and SXX is
917.8078. Compute the value of s.e.(fit) when the poverty rate of interest
(value of x) is 12.2%.
3. For both a confidence interval for a mean response and prediction interval
for an individual response, the t* multiplier is based on n - 2 = ____ df, and
for a 95% interval, t* = ______.
4. Compute a confidence interval for the mean response for states like
Michigan with a poverty rate of 12.2%. You have already computed the
values of s.e.(fit), ŷ , and t*.
5. Compute a prediction interval for an individual response for the state of
Michigan. You will need to compute s.e.(pred) before you can finish the
interval computation.
124
6. Without any computation, for a given value of x and fixed confidence level,
which interval will always be wider?
Confidence interval for a mean response
Prediction interval for an individual response
125
Example Exam Question on Regression
A doctor wanted to study the relationship between a male’s age and HDL (socalled good) cholesterol. She randomly selects 18 male adult patients and records
their x = age and y = HDL. A scatterplot of the data indicated an approximately
linear relationship overall, so she performs a linear regression analysis using SPSS.
Model Summary
Model
1
R
.598
R Square
.358
Adjus ted
R Square
.318
Std. Error of
the Es timate
6.807
ANOVA
Model
1
Regress ion
Res idual
Total
Sum of
Squares
413.01
741.43
1154.44
df
1
16
17
Mean Square
413.01
46.34
F
8.91
Sig.
.009
Coefficients
Model
1
(Cons tant)
Age
Uns tandardized
Coefficients
B
Std. Error
61.05
6.08
-.38
.13
We also have: x  46.22 and Sxx =
Standardized
Coefficients
Beta
-.598
 x  x 
2
t
10.05
-2.99
Sig.
.000
.009
 2883.131 .
a. What is the correlation between age and HDL cholesterol levels?
final answer: __________________________
b. What is equation of the least square regression line for predicting HDL from
age?
final answer: _______________________________________________
c. Based on this model, predict the HDL cholesterol for a male who is 30 years
old.
final answer: __________________________
d. Compute the residual for the observation (x = 30, y = 53).
final answer: __________________________
e. Use a 1% significance level to assess if there is a significant linear relationship
between the age and HDL cholesterol for male adults. State the hypotheses to
be tested, the observed value of the test statistic, the corresponding p-value,
and your decision.
126
Hypotheses: H0:______________________
Ha:____________________
Test Statistic Value: _____________________ p-value:________________
Decision: (circle)
f.
Fail to reject H0
Reject H0
The standard error for the estimated slope is given as 0.13. Interpret this
standard error in terms of repetitions of this study.
g. Calculate a 95% confidence interval for the mean HDL cholesterol level for all
30-year-old male adults.
final answer: ________________________
h. Consider the residual plot shown below. Does this plot support the conclusion
that the linear regression model is appropriate?
Yes
No
Explain:
20
10
0
Residual
-10
-20
20
25
30
35
40
45
50
55
60
65
70
75
Age
127
Download