sol_linreg_simple

advertisement
Simple Linear Regression - Solutions
1 Relationship Between Eighth Grade IQ and Ninth grade Math Score For a statistics class
project, students examined the relationship between x = 8th grade IQ and y = 9th grade math scores for 20
students. The data are displayed below.
Math Score
33
31
35
38
41
37
37
39
43
40
41
44
40
45
48
45
31
47
43
48
Student
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
IQ
95
100
100
102
103
105
106
106
106
109
110
110
111
112
112
114
114
115
117
118
Abstract Reas
28
24
29
30
33
32
34
36
38
39
40
43
41
42
46
44
41
47
42
49
Use Minitab on the dataset Finals found in the Datasets folder in ANGEL. Do
Stat>Regression>Regression and enter in the Response window the variable math score and in the
Predictors window enter IQ. Click ‘Storage’ and then ‘Residuals’ and ‘Fits’. These will be stored in
columns C3 and C4 and named as RESI1 and FITS1. Your output should look as follows:
Regression Analysis: Math Score versus IQ
The regression equation is
Math Score = - 21.0 + 0.567 IQ
Predictor
Constant
IQ
Coef
-21.04
0.5666
S = 3.98537
SE Coef
16.00
0.1475
R-Sq = 45.0%
T
-1.32
3.84
P
0.205
0.001
R-Sq(adj) = 42.0%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
18
19
SS
234.30
285.90
520.20
MS
234.30
15.88
F
14.75
P
0.001
1
a. Explain this equation. Discuss slope as change in Y per unit change in X in context of the variables
used in this problem
The slope indicates “for a unit change in X, Y will change by the amount and direction of the slope”. So
here, for a 1 unit increase in IQ the predicted math score will increase by 0.567 points.
b. Create a scatter plot of the measurements by Graph > Scatter Plot > Simple, and select IQ as the
predictor (x-variable) and math score as the response (y-variable). Describe the relationship between
math score and IQ.
Scatterplot of Math Score vs IQ
50
Math Score
45
40
35
30
95
100
105
110
115
120
IQ
There is a positive relationship between math score (the response variable) and IQ (the explanatory
variable)
c. One of the students with a high IQ (number 17) appears to be an outlier. With a sample size of only 20
this can affect our normality assumption. Also, the constant variance assumption could be compromised.
We can visual check for constant variance using a Residual Plot and test for normality using a Probability
Plot. To get a residual plot go to Graph > Scatterplot > Simple and enter RESI1 as y-variable and
FITS1 as x-variable. Click OK. For the probability plot check of normality, go to Graph > Scatterplot
> Single and enter RESI1 in the graph variables window. This provides a test of the null hypothesis that
the data follows a normal distribution. Based on these two graphs and what you have learned about
hypothesis testing, what interpretations do you come to regarding the assumptions of constant variance
and normality?
2
Scatterplot of RESI1 vs FITS1
5
RESI1
0
-5
-10
-15
32
34
36
38
40
42
44
46
FITS1
From the residual plot, we can see the one outlier (lower right) while the remaining
residuals appear to be scattered randomly around 0. This indicates a possible violation of
constant variance. With a small sample size the effect of outliers can be more extreme.
Probability Plot of RESI1
Normal - 95% CI
99
Mean
StDev
N
AD
P-Value
95
90
2.486900E-15
3.879
20
0.723
0.050
Percent
80
70
60
50
40
30
20
10
5
1
-15
-10
-5
0
RESI1
5
10
From the graph legend we have a p-value of 0.050 which equals our usual level of
significance for hypothesis testing. Again, with such a small sample we might be concerned
that the normality assumption is not satisfied. Putting these two graph results together, we
might want to investigate this outlier more thoroughly (e.g. is this a data entry error, a real
score, etc.) [NOTE: the AD test statistic refers to Anderson-Darling which is a test of
normality. There are other such tests, however.]
3
d. The least squares regression line for predicting math score from IQ is given in the above output. What
is the fitted regression line (i.e. regression equation)?
The regression equation is Math Score = - 21.0 + 0.567 IQ
e. What do the values in the FITS and RES columns represent?
The fits are the values of the Response (e.g. math score) obtained when the observed predictor
variable (e.g. IQ) values are entered into the regression
The residuals (RES) are the values of the observed Response, Y, values minus the fitted values. For
example, if you take the first student who had an observed math score of 33 minus predicted math
score of 32.7921 you get the residual of 0.2079
f. Based on the output, what is the test of the slope for this regression equation? That is, provide the null
and alternative hypotheses, the test statistic, p-value of the test, and state your decision and conclusion.
Ho: B1 = 0 Ha: B1 ╪ 0 The test statistic is 3.84 with a p-value of 0.001. Since this p-value is
less than 0.05, we would reject Ho and conclude that eighth grade IQ is a statistically
significant linear predictor of ninth grade math scores.
2 Although outliers should never be deleted without a reason, there are several reasons why it may be
legitimate to conduct an analysis without them. Delete the data point for row 17 (click on the cell with the
IQ of 114, enter * and then click on any other cell - this “enters” the asterisk in that previous cell. ) and
re-calculate the regression line for the remainder of the data. You should obtain the following output:
(Student 17 deleted)
Regression Analysis: Math Score versus IQ
The regression equation is
Math Score = - 32.2 + 0.676 IQ
19 cases used, 1 cases contain missing values
Predictor
Constant
IQ
Coef
-32.18
0.67601
S = 2.56190
SE Coef
10.51
0.09718
R-Sq = 74.0%
T
-3.06
6.96
P
0.007
0.000
R-Sq(adj) = 72.5%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
17
18
SS
317.58
111.58
429.16
MS
317.58
6.56
F
48.39
P
0.000
a. Use the regression line with the Student 17 deleted to estimate the math score for an individual who has
an eighth grade IQ o114.. Do you think this estimate could be achieved by anybody?
The fitted regression equation is Math Score = - 32.2 + 0.676 IQ .
in this equation to get math score of 44.81.
Substitute 114 for IQ
Math Score = - 32.2 + 0.676 *(114) = 44.881
4
It is certainly possible given the range of math scores for someone to achieve this score.
b. What do the values of R2 represent (just use the latest output) and how do this compare to the
R2 value from the first analysis in this activity? (Explain it using the variables from this data).
R2 is the coefficient of determination and in simple terms provides how much of the
variation in the Response(Y) variable is explained by the Predictor(X) variable. For our
example: with the observation deleted, 74.0% of the variability Ninth grade math score is
explained by eighth grade IQ compared to 45.0% for when the outlier is included.
c. What is the correlation between Math Score and IQ for both the data sets, including and
excluding the outlier?
The correlation is equal to the square root of R2 and takes the sign of the slope (therefore
being able to take on a range of values from – 1 ≤ r ≤ 1). The correlation is commonly
represented as a decimal value. Thus, the correlation between ninth grade math score and
eighth grade IQ is equal to the square root of the correlation of determination (R2)
Outlier Deleted: correlation, r, = √0.740 = .860 and is positive since the slope of the
regression equation is positive. In the case where the outlier is included, the correlation is:
r = 0.670
d. Use Minitab to find the correlation between Math Score and IQ (you can pick whether do
include the outlier or not) by going to Stat>Basic Statistics>Correlation and entering both variables into
the Variables box. Does this correlation value agree with the value you found in part c?
Yes, the values are the same.
e. How does the fit of the regression line of the original data (i.e. with outlier) compare (visually and
statistically) to the fit of the regression line to the data with the outlier removed? To do this, use the
current data with the outlier removed and go to Stat > Regression > Fitted Line Plot. Select IQ as the
Predictor (x-variable) and math score as the Response (y-variable). Once the graph is created you can
Click twice on the title which will open an “Edit title” box. Type in the box under Text: Outlier Deleted.
Now add the IQ of 114 back into the data (i.e. replace the * with 114) and repeat these steps, labeling the
graph Outlier Included. Now compare the fit of the regression line between the two sets of data. Pay
particular attention to the differences in R2, the slope and how the line fits each set of data. You may
want to repeat the residual plot and probability plot!




NOTE:
how R2 changes, 45.0% to 74.0%
how the regression equation changes. Slope is more positive.
scatter plot looks more tight’ around regression line because outlier is not there
now.
The residual and probability plots are more agreeable to the assumptions.
Special Note: Just because the removal of an outlier or outliers improves our results, this
does not give the researcher carte blanch to simply remove data until the results are what
one wants. You should always include in your research any manipulations of the data such
as these, possibly providing two reports for the reader to decide: one with the outlier(s) and
another without.
5
Outlier Removed
Math Score = - 32.18 + 0.6760 IQ
50
S
R-Sq
R-Sq(adj)
2.56190
74.0%
72.5%
S
R-Sq
R-Sq(adj)
3.98537
45.0%
42.0%
Math Score
45
40
35
30
95
100
105
110
115
120
IQ
Outlier Included
Math Score = - 21.04 + 0.5666 IQ
50
Math Score
45
40
35
30
95
100
105
110
115
120
IQ
Scatterplot of RESI2 vs FITS2
5.0
RESI2
2.5
0.0
-2.5
-5.0
30
32
34
36
38
40
FITS2
42
44
46
48
6
Probability Plot of RESI2
Normal - 95% CI
99
Mean
StDev
N
AD
P-Value
95
90
Percent
80
-8.97528E-15
2.490
19
0.144
0.962
70
60
50
40
30
20
10
5
1
-10
-5
0
RESI2
5
10
f. Facts about correlation. Answer the following questions about correlation (r).
a) What is the strongest the correlation can ever be? 1.0
b) If there is no relationship, r is equal to 0.
c) The correlation coefficient ranges from – 1.0 to 1.0
d) If the points fall in an almost perfect, negative linear pattern, r is close to: - 1.0
e) If the points fall in an almost perfect, positive linear pattern, r is close to: 1.0
7
Download