Correlation and Regression

advertisement
Correlation and
Regression
Ch 4
Why Regression and
Correlation
 We need to be able to analyze the relationship between
two variables (up to now we have only looked at single
variables)
 We can use one variable to predict the other, but
regression and correlation do NOT imply causation.
 Ex in the late 1940’s, analysts found a strong correlation
between the amount of ice cream consumed and higher
levels of the onset of polio….did ice cream cause polio?
No, but both of these peak in the summer months causing
there to be a strong correlation
Scatter Plots (pg 158)
Cost of Lots in Glen Ellyn, IL
Sale Price (in $1000's)
600
500
400
300
200
100
0
0
50
100
150
Square Footage (in 100's)
200
250
Possible Relationships
 Pg 159
 Positive Linear
 Negative Linear
 No apparent relationship
 Nonlinear relationship
Correlation Coefficient (r)
 How we quantify the strength and direction of the linear
relationship.
 Always between -1 and 1 inclusively
 1 or -1 is a perfect fit (data points lie on a perfectly
straight line)
 0 is no correlation at all
Correlation Coefficient
(x  x)(y  y)

r
(n 1)sx sy
Where sx isthe sample standard deviation of the xvalues, and sy is the sample standard deviation of
the y-values.
It takes too long to hand calculate……Use Excel ®
=correl(array1, array2)
Test for Linear Correlation
 Find absolute value of r
 Table G, go to row n
 Compare the absolute value of r to the critical value in
table G
 If if the absolute value of r is GREATER than the critical
value from table G, the variables ARE linearly correlated
(either positive or negative)
 If absolute value of r is LESS than OR EQUAL to the
critical value from table G, the variables are NOT linearly
correlated
Lot/sale price example
 Is there a linear correlation between the size of the lot
and the sale price for the data from the earlier slide?
Regression Lines: least-squares
method
 Ŷ=mx+b
m

(notice the notation for the y…it is called “y hat”)
(x  x)(y  y)
(x  x)
b  y  (m * x)
2

 The slope (m) is the estimated change in y per unit of
x (how much is y increasing or decreasing per unit of x)
 The y-intercept (b) is the initial value when x is zero
 Lets find the equation of the regression line for the
lot/sale price example…using Excel ®
Excel for least squares method
 Enter data in Excel
 Insert chart: marked scatter (doesn’t have any lines on the
points)
 Click the plus sign next to your chart
 Check trendline
 Click on the over arrow to the right of “trendline”
 Go to more options
 Select Type: Linear
 Options: check the display equation box and the show r
squared value box—we talk about r squared later
 Pg 175…figure 4.15
Use of regression lines
 Use regression lines to predict
 Interpolation (within the range of the plotted data)
 Extrapolation (outside the range of the plotted data)
 There is possible error in both interpolation and
extrapolation (predicted value vs observed value)
 Prediction error (or residual) is y – Ŷ
SSE, SST, and SSR
 We are going to briefly talk about three different
measures that have ugly equations…individually the
numbers are not that useful, but at the end we will put
them together to find a useful value…so just be patient.
 And, just so you know, we will use Excel ® to calculate
all these values too 
Sum of Squares Error (SSE)
 We want our prediction errors to be small
 We use SSE to measure the prediction error
SSE   (y  y ) 2
^
 When we use the Least-Squares Criterion our SSE will

be minimized…we will use Excel ® in just a little bit
Standard Error of the
Estimate s
Gives the measure of a typical
residual (typical prediction error, kind
of like the “average” error)…we want
it to be small
SSE
s
n 2

SST
 Is SSE = 12 “small” which would indicate that our
regression line is useful? We have to find a couple of
other values that will help answer this:
 Total Sum of Squares (SST)
SST   (y  y)

2
Or
where s2 is the sample variance

SST  (n 1)s
2
SSR
 Sum of Squares Regression (SSR) measures the
amount of improvement in the accuracy of our
estimates when using the regression equation
compared to only relying on the y-values.
SSR   (y  y)
^

2
SST, SSR, SSE
 SST=SSR + SSE
 Pg 190 ex 4.16
 Go to Excel ®….File, options, Add-Ins, Go, check
Analysis tool pack, OK
 On your “DATA” tab “Data Analysis” should be to the right
 Select Data Analysis, select regression, ok
 Fill in y-values, x-values, ok
 In the table that appears, under ANOVA, the SS column is
where we get SST (total) SSR (regression) and SSE
(residual)
Coefficient of Determination
SSR
r 
SST
2

Measures the goodness of fit of the
regression equation to the data
(always between 0 and 1
inclusively)
This was on our original trend line
graph!!!
Coefficient of determination
 Read tan box below problem 4.17 pg 190
 The closer the coefficient of determination is to 1 the
better the fit of the regression equation to the data. (0
is a horrible fit)
 Pg 194 #43
Download