Regression

advertisement
Definition
The linear correlation coefficient r
measures the strength of the linear
relationship between the paired
quantitative x- and y-values in a sample.
Requirements
1. The sample of paired (x, y) data is a simple
random sample of quantitative data.
2. Visual examination of the scatterplot must
confirm that the points approximate a
straight-line pattern.
3. The outliers must be removed if they are
known to be errors. The effects of any
other outliers should be considered by
calculating r with and without the outliers
included.
Notation for the
Linear Correlation Coefficient
n
= number of pairs of sample data

denotes the addition of the items
indicated.
x
denotes the sum of all x-values.
x2 indicates that each x-value should be
squared and then those squares added.
(x)2 indicates that the x-values should be
added and then the total squared.
Notation for the
Linear Correlation Coefficient
xy
indicates that each x-value should be first
multiplied by its corresponding y-value.
After obtaining all such products, find
their sum.
r
= linear correlation coefficient for sample
data.

= linear correlation coefficient for
population data.
Formula
The linear correlation coefficient r measures the
strength of a linear relationship between the
paired values in a sample.
r=
nxy – (x)(y)
n(x2) – (x)2
n(y2) – (y)2
Computer software or calculators can compute r
Properties of the
Linear Correlation Coefficient r
1. –1  r  1
2. if all values of either variable are converted
to a different scale, the value of r does not
change.
3. The value of r is not affected by the choice
of x and y. Interchange all x- and y-values
and the value of r will not change.
4. r measures strength of a linear relationship.
5. r is very sensitive to outliers, they can
dramatically affect its value.
Example:
Using software or a calculator, r is
automatically calculated:
Example:
Using the pizza subway fare costs, we have
found that the linear correlation coefficient
is r = 0.988. What proportion of the variation
in the subway fare can be explained by the
variation in the costs of a slice of pizza?
With r = 0.988, we get r2 = 0.976.
We conclude that 0.976 (or about 98%) of the
variation in the cost of a subway fares can be
explained by the linear relationship between the
costs of pizza and subway fares. This implies that
about 2% of the variation in costs of subway fares
cannot be explained by the costs of pizza.
Common Errors
Involving Correlation
1. Causation: It is wrong to conclude that
correlation implies causality.
2. Averages: Averages suppress individual
variation and may inflate the correlation
coefficient.
3. Linearity: There may be some relationship
between x and y even when there is no
linear correlation.
Basic Concepts of Regression
Regression
The regression equation expresses a
relationship between x (called the
explanatory variable, predictor variable or
independent variable), and ^
y (called the
response variable or dependent variable).
The typical equation of a straight line
y = mx + b is expressed in the form
^
y = b0 + b1x, where b0 is the y-intercept and b1
is the slope.
Definitions
 Regression Equation
Given a collection of paired data, the regression
equation
y^ = b0 + b1x
algebraically describes the relationship
between the two variables.
 Regression Line
The graph of the regression equation is called
the regression line (or line of best fit, or least
squares line).
Notation for
Regression Equation
Population
Parameter
Sample
Statistic
y-intercept of
regression equation
0
b0
Slope of regression
equation
1
b1
Equation of the
regression line
y = 0 + 1 x
y^ = b0 + b1x
Requirements
1. The sample of paired (x, y) data is a
random sample of quantitative data.
2. Visual examination of the scatterplot
shows that the points approximate a
straight-line pattern.
3. Any outliers must be removed if they are
known to be errors. Consider the effects
of any outliers that are not known errors.
Formulas for b0 and b1
b1  r
sy
sx
b0  y  b1x
(slope)
(y-intercept)
calculators or computers can
compute these values
Example:
Refer to the sample data given in Table below.
Use technology to find the equation of the
regression line in which the explanatory
variable (or x variable) is the cost of a slice of
pizza and the response variable (or y variable)
is the corresponding cost of a subway fare.
Example:
Requirements are satisfied: simple random
sample; scatterplot approximates a straight
line; no outliers
Here are results from four different technologies
technologies
Example:
All of these technologies show that the
regression equation can be expressed as
^ = 0.0346 +0.945x, where ^
y
y is the predicted
cost of a subway fare and x is the cost of a
slice of pizza.
Example:
Graph the regression equation
yˆ  0.0346  0.945 x
(from the preceding Example) on the
scatterplot of the pizza/subway fare data and
examine the graph to subjectively determine
how well the regression line fits the data.
Example:
Definition
For a pair of sample x and y values, the
residual is the difference between the
observed sample value of y and the yvalue that is predicted by using the
regression equation. That is,
residual = observed y – predicted y = y – ^y
Residuals
Definitions
A straight line satisfies the least-squares
property if the sum of the squares of the
residuals is the smallest sum possible.
Definitions
A residual plot is a scatterplot of the
(x, y) values after each of the
y-coordinate values has been replaced
by the residual value y – y^ (where y^
denotes the predicted value of y). That
is, a residual plot is a graph of the
^
points (x, y – y).
Residual Plot Analysis
When analyzing a residual plot, look for a
pattern in the way the points are configured,
and use these criteria:
The residual plot should not have an obvious
pattern that is not a straight-line pattern.
The residual plot should not become thicker
(or thinner) when viewed from left to right.
Residuals Plot - Pizza/Subway
Residual Plots
Residual Plots
Residual Plots
Definition
Coefficient of determination
is the amount of the variation in y that
is explained by the regression line.
r
2
=
explained variation.
total variation
The value of r2 is the proportion of the
variation in y that is explained by the linear
relationship between x and y.
Alternate Formula for r
SS( xy )
r
SS( x )SS( y )
SS(x)  “sum of squ
ares forx”  x 
2
SS( y)  “sum of squ
ares fory”  y 
2
( x)
2
n
( y)
2
n


x
y

SS(xy) “sum of squ
ares forxy”  xy 
n
Example
 Example: The table below presents the weight (in thousands of
pounds) x and the gasoline mileage (miles per gallon)
y for ten different automobiles. Find the linear
correlation coefficient:
2
2
y
y
xy
x
x
Sum
Sum
Sum
2.5
2.5
2.5
3.0
3.0
3.0
4.0
4.0
4.0
3.5
3.5
3.5
2.7
2.7
2.7
4.5
4.5
4.5
3.8
3.8
3.8
2.9
2.9
2.9
5.0
5.0
5.0
2.2
2.2
2.2
34.1
34.1
34.1
x
40
40
40
6.25
6.25
6.25
43
43
43
9.00
9.00
9.00
30
30
30
16.00
16.00
16.00
35
35
35
12.25
12.25
12.25
42
42
42
7.29
7.29
7.29
19
19
19
20.25
20.25
20.25
32
32
32
14.44
14.44
14.44
39
39
39
8.41
8.41
8.41
15
15
15
25.00
25.00
25.00
14
14
14
4.84
4.84
4.84
309
309
309 123.73
123.73
123.73
y
 x2
1600
1600
1600
100.0
100.0
100.0
1849
1849
1849
129.0
129.0
129.0
900
900
900
120.0
120.0
120.0
1225
1225
1225
122.5
122.5
122.5
1764
1764
1764
113.4
113.4
113.4
361
361
361
85.5
85.5
85.5
1024
1024
1024
121.6
121.6
121.6
1521
1521
1521
113.1
113.1
113.1
225
225
225
75.0
75.0
75.0
196
196
196
30.8
30.8
30.8
10665
10665
10665 1010.9
1010.9
1010.9
 y2
 xy
Completing the Calculation
for r
SS( x )   x
SS( y )  
x)
(


n
y)
(

2
y 
SS( xy )  
r
2
2
n
2
(34.1) 2
 123.73 
 7.449
10
(309) 2
 10665 
 1116.9
10
x y
(34.1)(309)

xy 
 1010.9 
 42.79
SS ( xy )

SS ( x )SS ( y )
n
10
 42.79
( 7.449 )(1116 .9 )
 0.47
The Line of Best Fit Equation
• The equation is determined by:
b0: y-intercept
b1: slope
• Values that satisfy the least squares criterion:
( x  x )( y  y ) SS( xy )

b1 

2
SS( x )
 ( x  x)
b0
y  (b1   x )


 y  (b  x)
n
1
Example
 Example: A recent article measured the job satisfaction of
subjects with a 14-question survey. The data below
represents the job satisfaction scores, y, and the
salaries, x, for a sample of similar individuals:
x
y
31
17
33
20
22
13
24
15
1) Draw a scatter diagram for this data
2) Find the equation of the line of best fit
35
18
29
17
23
12
37
21
Line of Best Fit
SS( x )  
x)
(

2
x 
n
SS( xy )  
b1 
b0
2
 234 2 
 7074  
 229.5

 8 
x y
(234)(133) 


xy 
 4009 
 118.75

n
8

SS(xy)  11875
. 
0.5174
SS(x) 2295
.

y  (b1   x) 133 (0.5174)(234)



n
8
14902
.
. 0.517x
Solution 1) Equation fothe lineof best it:
f ^y  149
Scatter Diagram
Solution 2)
Job Satisfaction Survey
22
21
20
19
18
Job
Satisfaction
17
16
15
14
13
12
21
23
25
27
29
Salary
31
33
35
37
Download