Chapter 10 Correlation and Regression

advertisement
Chapter 10 Correlation and Regression
Chapter Problem: Can we predict the cost of subway fare from the
price of a slice of pizza?
Table 10-1 Cost of a Slice of Pizza, Subway Fare and the CPI
Year
1960
1973
1986
1995
2002
2003
Cost of Pizza
0.15
0.35
1.00
1.25
1.75
2.00
Subway Fare
0.15
0.35
1.00
1.35
1.50
2.00
CPI
30.2
48.3
112.3
162.2
191.9 197.8
1
Chapter 10 Correlation and Regression
Chapter Problem: Can we predict the cost of subway fare from the
price of a slice of pizza?
2.5
2003
2
Subway
2002
1995
1.5
1986
1
1973
0.5
1960
0
0
0.5
1
1.5
2
2.5
Pizza
Figure 10-1 Scatterplot of Pizza Costs and Subway Costs
• If there is a correlation between two variables, how can it be described? Is
there an equation can be used to predict the cost subway given the cost of a
slice of pizza?
• If we can predict the cost of subway fare, how accurate is the prediction?
• If there also an correlation between the CPI and the cost of a subway fare, and
if so, is the CPI better for predicting the cost of a subway?
2
10.1 Review and Preview
Ch. 9 presented methods for making inferences from two samples
• Sec 9-4 considers two dependent samples
• Sec 9-4 considers differences between the paired values, and
illustrated two methods
– Hypothesis test for claim about the population of differences
– Confidence Interval for estimating the mean of all such differences
Ch. 10 introduces methods for determining whether a correlation, or
association, between two variables exists and whether the
correlation is linear.
• Sec 10-2 introduces the correlation between two variables
• Sec 10-3 introduces regression analysis, which allows us to
describe the relationship with an equation
3
10.2 Correlation
Key Concept
• Linear correlation coefficient r
– Numerical measure of the strength of the relationship between two
variables
• Use paired sample data to find r
• Use the value to conclude that there is a relationship between two
variables
• We consider linear relationships
4
10.2 Correlation
Part I: Basic Concepts of Correlation
Definition – A Correlation exists between two variables when the
values of one variable are somehow associated with the values
of the other variable.
• e.g. Table 10-1. x = cost of pizza, y = cost of subway
• Exploring the data
5
10.2 Correlation
Part I: Basic Concepts of Correlation – Exploring Data
r =  0.721
r = 0.851
Nonlinear relationship
r =  0.087
r=0
Figure 10-2 Scatterplots
6
10.2 Correlation
Part I: Basic Concepts of Correlation
Linear Correlation Coefficient
Definition
The linear correlation coefficient r measures the strength of
the linear relationship between the paired x- and y-quantitative
values in samples. Its value is computed by 10-1 or Formula
10-2 (next pages).
The linear correlation coefficient r is sometimes referred to as
the Pearson product moment correlation coefficient r in
honor of Karl Pearson (1857 – 1936).
7
10.2 Correlation
Part I: Basic Concepts of Correlation – Computing r
Requirements
• Paired (x, y) data is a simple random sample of independent quantitative data
• Visual examination of the scatterplot confirms the straight-line pattern
• Any outliers must be removed if they are known to be errors
The paired (x, y) data must have a bivariate normal distribution. This
requirement is usually difficult to check. For now, we use requirement 2 & 3.
8
10.2 Correlation
Part I: Basic Concepts of Correlation – Computing r
Notation for Linear Correlation Coefficient
n
number of pairs of data present

the addition of the items indicated
the sum of all x-values
x
each x-value should be squared and then those squares added
x
2
 x 
2
each x-value should be added and the total then squared. It is
extremely important to avoid confusing  x  and  x
multiply all paired sample, then add
the correlation coefficient for a sample
the correlation coefficient for a population
2
 xy
r

2
9
10.2 Correlation
Part I: Basic Concepts of Correlation – Computing r
Formula 10 – 1 r 
Formula 10 – 2
n x y    x  y 
 
n  x   x 
2
z

r
x
2


n  y   y 
2
2
zy
n 1
Here zx and zy is the z-score for sample values x and y respectively.
10
10.2 Correlation
Part I: Basic Concepts of Correlation – Computing r
Interpreting r
• Using SW, compute the P-value
– If P-value is less than the significance level, there is a linear correlation
– otherwise, there is no linear correlation
• Using Table A – 6, find the critical value r0 (depending on n and
),
– If | r |  r0, then there is a linear correlation;
– Otherwise, there is no linear correlation.
Rounding the Linear Correlation Coefficient to the three decimal
places (so that it can be compared to critical values in Table A – 6).
Try rounding only final results.
11
10.2 Correlation
Part I: Basic Concepts of Correlation
Properties of the Linear Correlation Coefficient (LCC) r
1.
2.
3.
4.
5.
The value of r is always between –1 and 1. That is,
–1  r  1
The value of r does not change if all values of either variable are
converted to a different scale
The value of r is not affected by the choice of x and y. interchange all xand y-values and the value of r will not change.
LCC measures the strength of a linear relationship. It is not designed to
measure the strength of a relationship that is not linear.
LCC is very sensitive to outliers
12
10.2 Correlation
Part I: Basic Concepts of Correlation – Computing r
e.g.1 Calculating r using SW (TI-84 for P-value and r)
Sol.
Table 10-2 Costs of a Slice of Pizza and Subway Fare (in dollars)
Cost of Pizza
0.15
0.35
1.00
1.25
1.75
2.00
Subway Fare
0.15
0.35
1.00
1.35
1.50
2.00
1. Requirement. Simple random data; linear pattern; no outlier
2. TI-84, P-value = 0.00022195436, r = 0.987811 (LinRegTTest)
13
10.2 Correlation
Part I: Basic Concepts of Correlation – Computing r
e.g.2 Calculating r using 10-1
X (pizza)
Y (subway)
x2
y2
xy
0.15
0.15
0.0225
0.0225
0.0225
0.35
0.35
0.1225
0.1225
0.1225
1.00
1.00
1.0000
1.0000
1.0000
1.25
1.35
1.5625
1.8225
1.6875
1.75
1.50
3.0625
2.2500
2.6250
2.00
2.00
4.0000
4.0000
4.0000
= 6.50
= 6.35
= 9.77
= 9.2175
9.4575
Sol.
1. Requirement. Done in e.g.1
14
10.2 Correlation
Part I: Basic Concepts of Correlation – Computing r
e.g.2 Calculating r using 10-1
Sol.
r



n x y    x  y 
n x
2
  x  n y   y 
2
2
2
69.4575  6.50 6.35
69.77   6.50 
2
69.2175  6.35
2
15.47
 0.988
16.37 14.9825
15
10.2 Correlation
Part I: Basic Concepts of Correlation – Computing r
e.g.3 Calculating r using 10-2
Sol. First x = 1.083333, sx= 0.738693; y = 1.058333, sy= 0.706694
and
y y
zy 
sy
xx
zx 
sx
x (pizza)
y (subway)
zx
zy
zx zy
0.15
0.15
–1.26349
–1.28533
1.62400
0.35
0.35
–0.99274
–1.00232
0.99504
1.00
1.00
–0.11281
–0.08254
0.00931
1.25
1.35
0.22562
0.41272
0.09312
1.75
1.50
0.90250
0.62498
1.65404
2.00
2.00
1.24093
1.33250
1.65354
4.93905
16
10.2 Correlation
Part I: Basic Concepts of Correlation – Computing r
e.g.3 Calculating r using 10-2
Sol.
4.93905
r
5
 0.988
e.g.4 Interpreting r. Interpreting the value of r = 0.988 found in e.g.1,
2, and 3. Use the significance level of 0.05. Is there sufficient
evidence to support a claim that there is a linear correlation
between the cost of a slice of pizza and subway fares?
Ans. Using A-6. For n = 6, c.v. = 0.811 (for  = 0.05).
Since |0.988| > 0.811, we conclude that there is a linear correlation
between x and y.
17
10.2 Correlation
Part I: Basic Concepts of Correlation – Computing r
e.g.5 Interpreting r. Using a 0.05 significance level, interpret the
value of r = 0.117 found using the 62 pairs of weights of
discarded paper and glass listed in Data Set 22 in Appendix B.
When the paired data are used with computer SW, the P-value
is found to be 0.364, is there sufficient evidence to support the
claim of a linear correlation between the weights of discarded
paper and glass.?
Ans.
1. Use Technology: P-value = 0.364 > 0.05, so there is not sufficient
evidence to support the claim that there is a linear correlation
between weights of discarded paper and glasses.
2. n = 62,  = 0.06, C.V.  0.254. Since | r | = 0.117 < 0.254, …
same conclusion as above.
18
10.2 Correlation
Part I: Basic Concepts of Correlation
Interpreting r: Explained Variation
The value of r2 is the proportion of the variation in y that is explained by
the linear relationship between x and y.
19
10.2 Correlation
Part I: Basic Concepts of Correlation
Interpreting r: Explained Variation
e.g.6. Explained Variation. Using the pizza/subway fare costs in Table
10-2, it is found that the LCC is r = 0.988. What proportion of the variation in the
subway fare can be explained by the variation in the costs of a slice of a
pizza?
Ans. r2 = 0.976.
Interpretation. About 97.6% of the variation in the cost of a subway fares can be
explained by the linear relationship between the costs of pizza and subway
fares. This implies that about 2.4% of the variation in costs of subway fares
cannot be explained by the costs of pizza.
20
10.2 Correlation
Part I: Basic Concepts of Correlation
Common Errors Involving Correlation:
1. A common error is to conclude that correlation implies causality
(pizza/subway)
2. Another error arises with data based on averages. (average may
make the LCC larger)
3. The third error involves the property of linearity – relation may
exist between x and y even when there is no linear correlation
21
10.2 Correlation
Part II: Formal Hypothesis Test
Method 1. Hypothesis Test for Correlation Using Test Statistic r
Notation:
– n: number of pairs of sample data
– r: LCC for a sample of paired data
– : LCC for a population of paired data
Requirement: see beginning of the chapter.
Hypothesis:
H0:  = 0 (There is no linear correlation)
H1:   0 (There is a linear correlation)
22
10.2 Correlation
Part II: Formal Hypothesis Test
Method 1. Hypothesis Test for Correlation Using Test Statistic r
Test Statistic: r
Critical Values: Use Table A-6
Conclusion:
• If |r| > the critical value, reject H0 and conclude that there is a
linear correlation.
• If |r|  the critical value, fail to reject H0 and there is not sufficient
evidence to conclude that there is a linear correlation.
23
10.2 Correlation
Part II: Formal Hypothesis Test
e.g.7 Hypothesis Test with Pizza/Subway Fare Costs. Use the paired
pizza/subway fare data in Table 10-2 to test the claim that there
is a linear correlation between the costs of a slice of pizza and
the subway fares. Use a 0.05 significance level.
Solution. The requirements are met as discussed in e.g 1. Consider
the following hypothesis test,
H0:  = 0 (There is no linear correlation)
H1:   0 (There is a linear correlation)
r = 0.988. From Table A-6 with n = 6,  = 0.05, we find that the
critical value is 0.811. Since | 0.988 | > 0.811, we reject H0. This
indicate that there is a linear correlation between two costs.
24
10.2 Correlation
Part II: Formal Hypothesis Test
Method 2. P-value Method for a Hypothesis Test for Correlation
Hypothesis:
H0:  = 0 (There is no linear correlation)
H1:   0 (There is a linear correlation)
Test Statistic is t:
t 
r
1 r 2
n2
P-value: Use Table A-3 with df = n – 2
25
10.2 Correlation
Part II: Formal Hypothesis Test
Conclusion:
• If the P-value < significance level, reject H0 and conclude that
there is a linear correlation
• Otherwise, fail to reject H0 and conclude that there is not
sufficient evidence to support the claim of a linear correlation
26
10.2 Correlation - Part II: Formal Hypothesis Test
e.g.8 Hypothesis Test with Pizza/Subway Fare Costs. Use the paired
pizza/subway fare data in Table 10-2 and use the P-value
method to test the claim that there is a linear correlation
between the costs of a slice of pizza and the subway fares. Use
a 0.05 significance level.
Sol. Test Statistic t:
t
r
1 r
n2
2

0.988
1  0.988
62
2
 12.793
Table A-3 with df = n – 2 = 4, shows that the P-value is less than
0.01. Because the P-value is < 0.05, reject H0 and conclude that there
is a linear correlation between two costs.
note: TI-84 and Statdisk can calculate the P-value, on TI:
P-value = 0.00022
27
10.2 Correlation - Part II: Formal Hypothesis Test
One-tailed Tests
• Previous examples involve two-tailed test
• One-tailed test can occur with claiming positive correlation or
claiming negative correlation:
Claim of Negative Correlation
(Left-tailed test)
H0:  = 0
H1:  < 0
Claim of Positive Correlation
(Right-tailed test)
H0:  = 0
H1:  > 0
• For both one-tailed tests
– The P-value Method can be handled as in earlier chapters
28
10.2 Correlation - Part II: Formal Hypothesis Test
Rationale
n x y    x  y 
• Formula 10 – 1
r
• Formula 10 – 2
z

r
• Other formulas
r


n  x   x 
2
x
2


n  y   y 
2
2
zy
n 1
 ( x  x )( y  y )
r
(n  1) s x s y
s xy
s xx
s yy
• Rationale: self reading
29
10.3 Regression
Key Concept
• In 10.2 we compute LCC r to conclude that whether there is a
linear relationship between two variables x and y.
• If there is, in this section we find a linear equation, ŷ = b0 + b1x,
which best represent the relationship
• This line is called regression line and the equation is called
regression equation
30
10.3 Regression – Part 1: Basic Concept
Requirements (same as in 10.2)
1. Sample of paired (x, y) data is a random sample of quantitative
data
2. Scatterplot shows linear pattern
3. Outliers must be removed if they are known to be errors.
Note: requirements 2 and 3 above are simplified attempts at
checking more formal requirements for regression analysis (book
page 537)
31
10.3 Regression – Part 1: Basic Concept
Notation for Regression Equation
Given a collection of paired sample data,
Population parameter Sample Statistic
0
b0
y-intercept of regression equation
1
Slope of regression equation
b1
Equation of the regression line
yˆ  b0  b1 x
y   0  1 x
Rounding the slope b1 and the y-intercept b0
Round b1 and b0 to the three significant digits.
x: predictor variable, or independent variable
ŷ : response variable, or dependent variable
32
10.3 Regression – Part 1: Basic Concept
Definition
Given a collection of paired sample data,
ŷ = b0 + b1x,
is called regression equation. It describes the relationship between x
and y algebraically. Its graph is called regression line (or line of best
fit, or least-square line).
Finding the slope b1 and y-intercept b0 in the regression equation
Formula 10 – 3
Slope:
Formula 10 – 4
Intercept:
b1  r
sy
sx
b0  y  b1 x
33
10.3 Regression – Part 1: Basic Concept
e.g.1 (from 10.2) Using Technology to find the regression equation.
Refer to the sample data in Table 10-2.
Table 10-2 Costs of a Slice of Pizza and Subway Fare (in dollars)
Cost of Pizza
0.15
0.35
1.00
1.25
1.75
2.00
Subway Fare
0.15
0.35
1.00
1.35
1.50
2.00
Sol.
1. Requirement. Checked in 10.2.
2. Technology.
– TI-84 (LinRegTTest)
a = 0.034560171, b0  0.0346
b = 0.9450213806, b1  0.945
– Statdisk
Equation of regression line:
ŷ = 0.0346 + 0.945x
or Subway = 0.0346 + 0.945 Pizza
34
10.3 Regression – Part 1: Basic Concept
Example 2. Using manual calculations to find the regression
Equation. Using formula 10-3 and 10-4 to find the equation of
regression line. (use TI-84)
Sol. r = 0.987811, sx = 0.738693, sy = 0.706694, x = 1.083333,
y = 1.058333
b1  r
sy
sx
 0.987811
0.706694
 0.945
0.738693
b0  y  b1 x  1.058333  (0.945)(1.083333)  0.0346
The regression equation is: yˆ  0.0346  0.945x
35
10.3 Regression – Part 1: Basic Concept
e.g.3. Graphing the regression equation. Plot the data and graph
the line on the same xy-plane. (I don’t have Minitab)
Sol. Data:
Table 10-2 Costs of a Slice of Pizza and Subway Fare (in dollars)
Cost of Pizza
0.15
0.35
1.00
1.25
1.75
2.00
Subway Fare
0.15
0.35
1.00
1.35
1.50
2.00
The regression equation :
yˆ  0.0346  0.945x
How do I plot two groups
of data on the same Excel
graph?
36
10.3 Regression – Part 1: Basic Concept
Using Regression Equation for Predictions.
Figure 10-5 Recommended Strategy for Predicting Values of y
37
10.3 Regression – Part 1: Basic Concept
e.g.4 Predicting Subway Fare. Table 10-5 includes the pizza/subway
fare costs from the chapter problem, as well as the total number
of runs scored in the baseball World Series for six different
years. As of this writing, the cost of a slice of pizza in New
York was $2.25, and 33 runs were scored in the last World
Series.
Table 10-2 Costs of a Slice of Pizza and Subway Fare (in dollars)
Cost of Pizza
Runs Scored In WS
Subway Fare
a.
b.
0.15
0.35
1.00
1.25
1.75
2.00
82
45
59
42
85
38
0.15
0.35
1.00
1.35
1.50
2.00
Use the pizza/subway fare data to predict the cost of a subway fare given that
a slice of pizza costs $2.25.
Use the runs/subway fare data to predict the subway fare in a year in which 33
runs were scored in the World Series.
38
10.3 Regression – Part 1: Basic Concept
a. Use the pizza/subway fare data to predict the cost of a subway
fare given that a slice of pizza costs $2.25.
a. Sol. Already know that the regression equation is a good model.
yˆ  0.0346  0.945x
yˆ  0.0346  0.945(2.25)
yˆ  $2.16
39
10.3 Regression – Part 1: Basic Concept
b. Use the runs/subway fare data to predict the subway fare in a year
in which 33 runs were scored in the World Series.
b. Sol. (calculate r and P-value using TI-84)
–
–
–
–
The regression line does not fits data well (need help from Statdisk)
LCC r = –0.332, which suggests that there is not a linear correlation
P-value is 0.520
33 runs is not too far beyond the scope of the available data
Because the regression model is not a good model, the best predicted
subway fare is the mean, y  $1.06
Interpretation. Use the regression equation for predictions only if it
is a good model. Otherwise, use y as the predicted value.
40
10.3 Regression – Part 2: Beyond the Basic
Interpreting the Regression Equation: Marginal Change
(skip part 2 if time press)
Definition – In working with two variables related by a regression
equation, the marginal change is the amount that it changes
when the other variable changes by exactly one unit. The slope b1
in the regression equation represents the marginal change in y
that occurs when x changes by one unit.
e.g. for the pizza/subway fare data in the chapter problem, b1= 0.945.
So, if x (the cost of a slice of pizza) increases by $1, the
predicted cost of a subway fare will increase by $0.945. That is,
for every additional $1 increase in the cost of a slice of pizza,
we expect the subway fare to increase by 94.5¢.
41
10.3 Regression – Part 2: Beyond the Basic
Outliers and Influential Points
Definition
In a scatterplot, an outlier is a point lying far away from the other
data points.
Paired sample data may include one or more influential points,
which are points that strongly affect the graph of the regression line.
Using Technology as example (from the book)
42
10.3 Regression – Part 2: Beyond the Basic
Outliers and Influential Points
e.g.5
43
10.3 Regression – Part 2: Beyond the Basic
Residuals and the Least-Squares Property
Definition – For a pair of sample x and y values, the residual is the
difference between the observed sample value of y and the yvalue that is predicted by using the regression equation. That is,
residual = observed y – predicted y = ( y  yˆ )
Consider an example with data
x
1
2
4
5
y
4
24
8
32
The regression equation is ŷ = 5 + 4x. When x = 5, the observed
value of y is 32. On the other hand, the predicted value of y is ŷ
= 5 + 4(5) = 25. So the residual is 32 – 25 = 7.
44
10.3 Regression – Part 2: Beyond the Basic
Residuals and the Least-Squares Property
y
32
Residual = 7
30
28
26
yˆ  5  4 x
24
22
Residual = 11
20
18
16
14
Residual = 13
12
10
8
6
4
2
Residual = 5
x
0
1
2
3
4
5
Figure 10 – 6 Residuals and Squares of Residuals
45
10.3 Regression – Part 2: Beyond the Basic
Residuals and the Least-Squares Property
Regression equation represents the line that fits the points “best”
according to the following least-squares property.
Definition – A straight line satisfies the least-squares property if the
sum of the squares of the residuals is the smallest sum possible.
From the Figure 10-8, the residuals are –5, 11, –13, and 7, so, the
sum of their squares is (5)2  112  (13) 2  72  364
Use any straight line other than the regression line, the sum of areas
will be greater than 364.
46
10.3 Regression – Part 2: Beyond the Basic
Residuals Plots (If time press, skip this)
Definition – A residual plot of the (x, y) values after each of the
y-coordinate values has been replaced by the residual values y  yˆ
(where ŷ denotes the predicted value of y). That is a residual plot is a
Graph of the points (x, y  yˆ).
When analyzing a residual plot, look for a pattern in the way the
points are configured, and use these criteria:
• The residual plot should not have an obvious pattern that is not a straight-line
pattern. (This confirms that a scatterplot of the sample data is a straight-line
pattern.)
• The residual plot should not become thicker (or thinner) when viewed from
left to right. (This confirms the requirement that for the different fixed values
of x, the distributions of the corresponding y-values all have the same standard
47
deviation.)
10.3 Regression – Part 2: Beyond the Basic
Residuals Plots
See plot in ch10.xlsx.
Residual Versus Pizza
0.2
0.15
0.1
Residual
0.05
0
0
0.5
1
1.5
2
2.5
-0.05
-0.1
-0.15
-0.2
-0.25
Pizza
48
10.3 Regression – Part 2: Beyond the Basic
Complete Regression Analysis
In part 1, we identified simplified criteria for determining whether a
regression equation is a good model. A more thorough analysis can
be implemented with the following steps:
1.
2.
3.
4.
Construct a scatterplot and verify that the pattern of the points is
approximately a straight-line without outliers.
Construct a residual plot and verify that there is no pattern, and also verify
that the residual plot does not become thicker (or thinner).
Use a histogram and/or normal quantile plot to confirm that the values of the
residuals have a distribution that is approximately normal.
Consider any effects of a pattern over time.
49
10.3 Regression – Part 2: Beyond the Basic
Using Technology
• STATDISK
– Compute LCC
– Graph Scatterplot with regression line included (need to practice this!)
• Minitab (powerful one but we don’t have it)
• Excel
• TI-83/84 Plus (a lot)
50
Download