Understanding Sum of Squares Six facts about least squares

advertisement
Understanding Sum of Squares
Six facts about least squares regression:
1.
The sum of residuals add to 0.
n
n
n
i 1
i 1
i 1
 ei   (Yi  Y i )  [Yi  (b0  b1 X i )]  0
2.
The sum of squared residuals is minimized.
n
n
n
i 1
i 1
i 1
SSE   ei2   (Yi  Y i ) 2  [Yi  (b0  b1 X i )]2
3.
The sum of the observed values equals the sum of the fitted values:
n
n
i 1
i 1
 Yi   Y i
4.
The sum of the residuals multiplied by each predictor value is 0:
n
n
n
i 1
i 1
i 1
 X i ei   X i (Yi  Y i )  X i [Yi  (b0  b1 X i )]  0
5.
The sum of the residuals multiplied by each predicted value is 0:
n
n
n
 Y e   Y (Y  Y )  Y [Y  (b
i 1
6.
i i
i
i 1
i
i
i
i 1
i
0
 b1 X i )]  0
The regression line always goes through the point ( X , Y )
These are found on page 24 of the text. I am not going to prove them, it is not too difficult just
messy.
Partitioning the Sum of Squares (page 63).
One idea of regression is to parse out two quantities from the inherent variation among the Y
values.
Let us look at our data from the using X = study hours to predict Y = GPA,
studygpa.sas7bdat. Open the program chap2b.sas.
proc means data=mydata.studygpa;
var gpa study;
run;
quit;
1
symbol1 value=dot I=R;
proc gplot data=mydata.studygpa;
plot gpa*study / vref=3.24;
label gpa='GPA';
label study='Hours of study per week';
title 'GPA vs. Hours studied';
run;
quit;
We have:
Yi  Y  Yi  Yi  Yi  Y  (Yi  Y )  (Yi  Yi )
The Total Sum of Squares (or Sum of Squares Total) is SST) is just the total variation in the Y
values:
n
SSTO   (Yi  Y ) 2
i 1
The error part, we already familiar with, this is just the sum of the squared residuals:
2
n
n
i 1
i 1
SSE   ei2   (Yi  Y i ) 2
The Regression Sum of Squares (or Sum of Squares of Regression) is the left over piece.
n
n
n
i 1
i 1
i 1
SSR  SSTO  SEE   (Yi  Y ) 2   (Yi  Y i ) 2   (Y i  Y ) 2 .
It is also true because of the six facts we went over earlier that:
(Yi  Y )2  (Yi  Yi  Yi  Y )2  (Yi  Y )2  (Yi  Yi ) 2
SSTO  SSR  SSE
Analysis of Variance (ANOVA) Table
Degrees of Freedom (df)
The SSTO has n 1 degrees of freedom (we lost a degree when estimated Y ). The SSR has one
degree of freedom because we are using one predictor. The SSE degrees of freedom is n  2
(which I like to think is because we estimated the intercept and the slope).
Mean Squares (MS)
The mean squares are just the Sum of Squares divided by the appropriate degrees of freedom.
We summarize these quantities in the Analysis of Variance Table.
Source of
Variation
Regression
Sum of Squares (SS) Degrees of
freedom (df)
n
1
SSR   (Y i  Y ) 2
i 1
Error
n
SSE   (Yi  Yi ) 2
n2
i 1
Total
n
SSTO   (Yi  Y ) 2
Mean Square
(MS)
SSR
MSR 
1
SSE
MSE 
n2
F
F
MSR
MSE
n2
i 1
The F-statistic in the final column is the t-test statistic squared and the p-value is exactly the
same as the two-sided p-value for the slope.
For the study – gpa data, the ANOVA of table is:
3
The MSR is estimating the quantity
n
 2  12  ( X i  X ) 2
i 1
And the MSE is estimating
2.
So when we calculate:
n
F
MSR
, we are estimating the ratio
MSE
 2  12  ( X i  X )2
i 1
2
.
Notice that if the slope 1 =0, then the ratio is just 1. So the larger the F statistic is, the more
likely we are to reject the null hypothesis.
Coefficient of Determination R2
One measure of the goodness of model fit is the coefficient of determination R2 . This number is
always positive and always less than 1 because it is defined as:
n

SSR
R2 
 i n1
SSTO
(Y i  Y ) 2
 (Y  Y )
i 1
, 0  R2  1.
2
i
The close the value of R2 to 1, the better the fit.
4
Interpretation: The way that we interpret R2 is that we say: Insert R 2 *100 value here percent
of the variation in the Y is explained by variation in the X .
So for the study gpa data,we say that 11.77% of the variation in Y = gpa is explained by
variation in X = hours of study.
In some sense the test of hypothesis of H 0 : 1  0 answers the question “Is X useful in
predicting Y ?”. The R2 value answers the question “How useful is X in predicting Y .” Now
there are no hard and fast rules about how high R2 has to be in order to be deemed a good fit.
Higher is better, but in some research, low R2 values are common, whereas in other fields they
are much more close to 1.
Correlation coefficient r
The literature is full of references to the correlation coefficient. It is the square root of R2 , with
the same sign as the slope. So it is a number between -1 and 1. It indicates direction. Can get it
by taking
R 2 and using the sign of the slope or use
Notice that the p-value is the same as for the test of slope we did earlier. The test of hypothesis
that is equivalent is:
  the correlation between X and Y
5
H0 :   0
HA :   0
Nice thing about correlation and coefficient of determination is that it is does not matter which is
X and which is Y.
Can play Guess the Correlation game: http://istics.net/stat/correlations/
6
We can calculate a confidence interval for the correlation coefficient as well:
proc corr data=mydata.studygpa
var gpa study;
run;
quit;
Fisher;
Interpretation:
7
Spearman Rank Correlation: This is sometimes used if data is not normally distributed with
severe departures. Basically, we rank each variable from low to high (within each variable) and
then calculate the correlation between the ranks.
Example: cigweight.sas7bdat A study is conducted to investigate the relationship between
cigarette smoking during pregnancy and the weights of newborn infants. A sample of 15 women
smokers kept accurate records of the number of cigarettes smoked during their pregnancies, and
the weights of their children were recorded at birth.
Woman
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Cigs
per day
12
15
35
21
20
17
19
46
20
25
39
25
30
27
29
Rank
1
2
13
7
5.5
3
4
15
5.5
8.5
14
8.5
12
10
11
Baby
weight lbs.
7.7
8.1
6.9
8.2
8.6
8.3
9.4
7.8
8.3
5.2
6.4
7.9
8.0
6.1
8.6
Rank
5
9
4
10
13.5
11.5
15
6
11.5
1
3
7
8
2
13.5
The procedure for this nonparametric test:
1. The null hypothesis: H 0 :   0; H A :   0,  is the population rank correlation
coefficient
2. Rank each column of values and break ties in the usual way.
3. Calculate the pearson correlation coefficient with the ranks.
proc print data=mydata.cigweight;
run;
quit;
proc corr data=mydata.cigweight;
var rankcigs rankweight;
run;
quit;
8
Or we can have SAS calculate the Spearman rank correlation directly:
proc corr data=mydata.cigweight spearman;
var cigs weight;
run;
quit;
9
proc corr data=mydata.cigweight pearson spearman;
var cigs weight;
run;
quit;
10
Misunderstandings of high R2 or high correlation value:
1. Correlation does NOT mean causation. Just because two variables may be highly
related, does not mean one causes the other. It might mean that one causes the other, but
typically there is some other variable that is affecting both X and Y . Such a variable is
called a confounder. For example, monthly ice cream sales and number of drownings
are correlated, but that does not mean ice cream causes drowning, it is the underlying
variable of temperature. Temperature is the confounder.
Other misunderstandings are listed in the text on page 75.
1. A high coefficient of determination indicates that useful predictions can be made. Often
the prediction intervals for Y are so large they are useless.
2. A high coefficient of determination indicates that eh estimated regression line is a good
fit. There may be curvature in the plot. We could be underfitting or overfitting in
patterns. This often happens in time series data.
3. A coefficient of determination near zero indicates that X and Y are not related. This
may happen, when there is a relationship between X and Y , but it is not a linear
relationship.
Y
Non-linear Relationship
300
250
200
150
100
50
0
-50 0
-100
-150
5
10
15
20
25
X
Practice Homework
Examine the data ch1copier2.sas7bdat (Context explained on page 35).
1. Use prog reg to determine the 99% confidence interval for the slope of predicting minutes
from number of calls.
2. Use proc reg to determine the coefficient of determination. Interpret its value in context.
3. Use proc corr to calculate the Pearson correlation coefficient. Give its p-value. Verify
that the p-value is the same as the test for slope with proc reg.
4. Give the 95% confidence interval for the Pearson correlation coefficient.
11
5. Use proc corr to calculate the spearman correlation coefficient. Give its pvalue. Does it
lead to the same conclusion that the pearson correlation coefficient does?
6. Give the 95% confidence interval for the mean number of minutes when the number of
calls is 5.
7. Give the 95% prediction interval for the number of minutes when the number of calls is
5.
8. Explain the difference between these two intervals.
9. Make a graph that shows the scatterplot using number of calls to predict minutes, the
regression line, the 95% confidence interval for the mean minutes for given number of
calls and the 95% prediction interval for the minutes given the number of calls.
10. Examine the ANOVA table below. Fill in the missing values (using a calculator).
Analysis of Variance
Sum of Mean
Source
DF Squares Square F Value Pr > F
Model
1
2.11505
0.1961
Error
14.80495 1.13884
Corrected Total 14 16.92000
Use the data nc20101000.sas7bdat
1. Use the variable weeks (weeks of gestation) to predict the birth weight in pounds
(tpounds). Give and interpret the 95% confidence interval for the slope.
2. Give the coefficient of determination and interpret its value.
3. Make a graph that shows the scatterplot using weeks to predict tpounds, the regression
line, the 95% confidence interval for the mean pounds for given weeks of gestation and
the 95% prediction interval for the total pounds given the weeks of gestation.
4. Give the 95% prediction interval of birth weight for a mother who carries 35 weeks.
5. Give the spearman correlation coefficient for weeks and tpounds.
12
Download