Chapter 4 Describing the Relation Between Two Variables

advertisement
Describing the Relation Between Two Variables
Learning Objectives
1. Construct and interpret scatter diagrams
2. Compute and interpret correlation coefficients
3. Compute and interpret least square lines
4. Interpret residual plots
1
Scatter Diagrams; Correlation
Bivariate data is data in which two variables are measured on an
individual. Often the purpose is
to study the relationship between two variables: Correlation
problem
or to predict one variable using the other: Least-squares
Regression Problem (Also called: Simple Linear Regression
problem).
For bivariate problems, we call one variable the response
variable, y (also called dependent variable), which is the
variable whose value can be explained or determined based upon
the value of the predictor variable, x (also called independent
2
variable).
Examples of Bivariate Data
1.
What is the relationship between hand-size and height?
One needs to collect two variables from each subject (hand-size and
height) of a sample of n subjects. It is a bivariate study.
We can conduct two studies: (a) To find out the relationship between Handsize and Height. (b) To predict Height using Hand-size and
investigate if Hand-size is a good predictor of Height or not .
2. Is the weight of a car a good predictor of mileage?
One needs to collect the weight and mileage of a sample of n cars, in
order to study this problem.
We can conduct two studies: (a) To find out the relationship between
Weight and Mileage. (b) To predict Mileage using Weight and
investigate if Weight is a good predictor of Mileage or not.
3
Real-Time Activity
Is Hand-size a good predictor of Height?
How do you measure Hand-size?
Measure your Hand-size by measuring
Hand-Width and Hand Length as discussed in the class.
Now go to the Real-Time Online activity site at
http://stat.cst.cmich.edu/statact/
Go to Data Entry, select Activity: Hand-Size.
Use the Activity Code: To be provided in class
4
Is Hand size a good
predictor of height?
Here are 20 cases from
previous students
How to demonstrate the
relationship between hand
length and height?
Graphical method –
Scatter diagram.
Numerical MethodCorrelation coefficient and
Least squares regression.
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Row Gender
1 female
2 female
3 female
4 female
5 male
6 male
7 female
8 male
9 male
10 male
11 male
12 female
13 male
14 female
15 male
16 female
17 male
18 female
19 female
20 female
length width height
8.50
9.50 68.5
8.40
9.00 68.0
7.50
8.00 68.0
7.25
8.00 68.0
7.40
7.70 70.0
7.50
8.75 71.0
6.50
7.25 66.0
8.00
7.00 68.0
8.00
8.75 72.0
8.75
9.50 76.8
8.00
9.00 71.0
6.00
7.50 62.0
6.20
11.50 69.0
6.50
7.50 61.5
8.00
9.50 69.0
7.00
8.00 69.0
7.00
9.10 72.2
6.50
7.50 63.0
6.50
7.00 61.0
7.25
7.50 63.5
5
How can we demonstrate the relationship
between Hand Length and Height?
Graphical method: A scatter diagram (scatter plot):
shows the relationship between two quantitative variables
measured on the same individual. Each individual in the data
set is represented by a point in the scatter diagram.
The predictor variable (independent variable) is plotted on
the horizontal axis and the response variable (dependent
variable) is plotted on the vertical axis.
Note: Points are not connected in scatter diagram.
6
For the example of using Hand Length to predict Height
What is the response variable? ______________________
What is the predictor variable? _____________________
Scatter Plot of Height Vs. Hand Length
Height
7
Hand Length
Scatter plot using Minitab
Go to Graph, choose Scatterplot, choose Simple, select
variable name, OK.
Scatterplot of height vs hand_length
78
76
74
height
72
70
68
66
64
62
60
6.0
6.5
7.0
7.5
hand_length
8.0
8.5
9.0
(1)Is the relation positive?
(2)Is the relation strong?
8
positive
Perfectly correlated
Positive:
Moderately
correlated
Positive : Highly
correlated
Nonlinear,
Nonlinear,
Nonlinear,
Positive Correlation
Positive Correlation
No correlation
Scatterplot of C3 vs C1
20
20
15
15
C3
Y: Response
Scatterplot of C2 vs C1
25
10
10
5
5
0
2
4
6
X - Predictor
8
10
12
0
2
4
6
8
10
12
C1
9
Negative
Negative:
Negative:
Perfectly correlated
Highly Correlated
Moderately Correlated
Nonlinear
No correlation
No correlation
Negative Correlation
Scatterplot of C6 vs C4
Scatterplot of C5 vs C4
15.0
35
30
12.5
C5
Y
25
10.0
20
7.5
15
10
5.0
0
2
4
6
X
8
10
0
2
4
6
8
10
C4
10
How can we quantify the correlation?
Numerical Method: Pearson Correlation
The linear correlation coefficient or Pearson product moment
correlation coefficient is a measure of the strength of linear
relation between two quantitative variables.
NOTATION: We use
the Greek letter (rho=ρ) to represent the population correlation
coefficient and
r to represent the sample correlation coefficient -1 < r < +1.
11
Properties of the Linear Correlation Coefficient
1. If r = +1 (or -1) there is a perfect positive (negative) linear
relation between the two variables.
2. The closer r is to +1 (or -1), the stronger the evidence of
positive (negative) association between the two variables.
3. If r is close to 0, there is no evidence of linear relation
between the two variables. Because the linear correlation
coefficient is a measure of the linear relation, r close to 0
does not imply no relation, just no linear relation.
12
Positive: r
Moderately
correlated
Positive : Highly
correlated
Positive Perfectly correlated
r = +1
r= +.9
Nonlinear,
Nonlinear,
Nonlinear,
Positive Correlation
Positive Correlation
No correlation
Scatterplot of C2 vs C1
Scatterplot of C3 vs C1
25
20
20
15
r = +.8
15
C3
Y: Response
r~0
= +.4
10
10
r=+.6
5
5
0
2
4
6
X - Predictor
8
10
12
0
2
4
6
8
10
12
C1
13
Negative
Negative:
Negative:
Perfectly correlated
Highly Correlated
Moderately Correlated
r=-1.0
r=-0.4
r=-0.9
Nonlinear
No correlation
No correlation
R~0.0
R~0.0
Negative Correlation
Scatterplot of C6 vs C4
Scatterplot of C5 vs C4
15.0
35
30
12.5
C5
Y
25
10.0
20
7.5
r=-0.8
15
10
5.0
0
2
4
6
X
8
10
0
2
4
6
8
10
C4
14
Important distinction between Association and Cause
and Effect Relation between two variables
A strong relationship does not imply cause and effect!!!!
Examples:
A study shows there exists a strong correlation between Math IQ and Feet size
for Kindergarten children. Does this mean that Feet Size is the cause of Math
IQ for kindergarten children?
A study shows there is a positive correlation between CEO salary and Stock
price. Does this mean that CEO salary is the cause of Stock price?
For each of the above examples, the relationship is not a causal relation. There
is a hidden variable that is related to both variables and is the cause. This
hidden variable is often called ‘Lurking Variable’.
Can you identify the lurking variable for each case?
15
The following is the Scatter Plot between
Height and Hand Length.
Scatterplot of height vs hand_length
78
76
74
height
72
70
68
66
64
62
60
6.0
6.5
7.0
7.5
hand_length
8.0
8.5
9.0
How do we determine the Pearson correlation coefficient , r, for the
data of HAND SIZE AND HEIGHT?
16
Computing r
r=
SSxy
SSxx  SSyy
where SSxx = (n-1) sx2
SSyy = (n-1) sy2
SSxy = xy - n
Sx2 is the sample variance of the x variable.
Sy2 is the sample variance of the y variable.
17
Compute the linear
correlation coefficient to
quantify the relationship
between Height and Hand
Length
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Row Gender
1 female
2 female
3 female
4 female
5 male
6 male
7 female
8 male
9 male
10 male
11 male
12 female
13 male
14 female
15 male
16 female
17 male
18 female
19 female
20 female
length width height
8.50
9.50 68.5
8.40
9.00 68.0
7.50
8.00 68.0
7.25
8.00 68.0
7.40
7.70 70.0
7.50
8.75 71.0
6.50
7.25 66.0
8.00
7.00 68.0
8.00
8.75 72.0
8.75
9.50 76.8
8.00
9.00 71.0
6.00
7.50 62.0
6.20
11.50 69.0
6.50
7.50 61.5
8.00
9.50 69.0
7.00
8.00 69.0
7.00
9.10 72.2
6.50
7.50 63.0
6.50
7.00 61.0
7.25
7.50 63.5 18
EXAMPLE
Height & Hand Length: Compute the linear
correlation coefficient
.
Sample Statistics
N
Mean
Standard Deviation
Hand Length (X)
20
7.338
0.808
Height (Y)
20
67.875
4.056
Use Minitab: Go to Stat, Basic Statistics, Correlation, select Height and
Hand-length. OK.
The computer result is : r = .668. The correlation is moderately high.
19
Online Applet Activity:
Visualizing correlation using scatter plots
Go to the site:
http://bcs.whfreeman.com/scc/content/cat_040/spt/correlation/correlationregr
ession.html
To review scatter plot, correlation coefficient
(a) create a scatter plot using 10 pairs of data with near zero correlation.
(b) create a scatter plot with nonlinear relation and near zero correlation
using 10 pairs of data .
© create a scatter plot with nonlinear relation and correlation near .8 using
10 pairs of data.
(d) create a scatter plot using 10 pairs of data with near zero correlation
and add one additional point that will greatly increase correlation.
(e) create a scatter plot using 10 pairs of data with near one correlation
and add one additional point that will greatly decrease correlation.
20
Finding the Least-squares Regression Line
Recall: A mathematical Line y = mx + b
m: is the slope: the unit change of y when increasing one unit of x.
b: the intercept, the y-value when setting x = 0.
Examples:
(1) Graph the line : y = 2x – 3 and determine the slope and intercept
(2) Graph the line y = (-.5)x +1 and determine the slope and intercept.
(3) Determine the line y = mx+b that passes through two points (1, 5) and (3,2)
Ans: determine the slope m = (y2-y1)/(x2-x1)=(2-5)/(3-1) = -1.5
The equation is y = (-1.5)x+b. Now, to determine the intercept b, apply a point, say,
(1,5) into the equation y=(-1.5)x+b: 5 = (-1.5)(1)+b then, solve for b = 6.5
So the equation is y=(-1.5)x + 6.5
(4) Exercise: determine the line passing through (2,3) and (4,9)
21
Scatterplot of height vs hand_length
78
76
74
height
72
70
68
66
64
62
60
6.0
6.5
7.0
7.5
hand_length
8.0
8.5
9.0
How do we determine a line that can be used to predict The Height using
Hand Length?
That is to determine a line : yˆ  b x  b
1
0
b1 is the slope and b0 is the intercept
An intuitive approach: By drawing your best guess line and determine two
points of your best guess line, then obtain the line using the two points you
22
chose.
Use Fathom to Demonstrate
The Least Squared Method
Predictions, Residuals,
Sum of Squares of Residuals
Problem: How well can Hand_size predict
Height?
Data: Hand_size_20cases
23
24
What is Residual?
The difference between the observed value
of y and the predicted value of y is the error
or residual. That is
residual = observed – predicted
Notation:
ei  yi  yˆ i
25
Scatterplot of height vs hand_length
78
yˆ  b1 x  b0
76
74
e2
72
height
e20
70
68
66
e1
64
62
60
6.0
6.5
7.0
7.5
hand_length
8.0
8.5
9.0
b1 is the slope and b0 is the intercept.
One way to determine the best line is to find b1 and b0 so that the sum of
the squared residuals is the smallest.
26
27
NOTE : if we replace r by the formula for computing r,
the slope b1 can be obtained by :
b1  r
sy
sx

ss xy
ss xx

ss xy
(n  1) s
2
x
28
Predicting Height using Hand Length
(a) Find the least-squares regression line:
Sample Statistics
N
Mean
Hand Length (X)
20
7.338 (
Height (Y)
20
67.875 (
Standard Deviation
x
)
y)
0.808 (sx )
4.056 (sy )
= .668 (4.056)/(.808) = 3.353
= 67.875 – (3.353)(7.338) = 43.27
The regression line is Height = 3.353(Hand Length) + 43.27
(b) Interpret the slope:
Increase Hand Length by one inches will increase Height by 3.353 inches.
29
Use Minitab to obtain the regression line:
Go to Stat, Regression, Fitted Line Plot, choose Y and X. OK.
Fitted Line Plot
height = 43.29 + 3.351 hand_length
78
S
R-Sq
R-Sq(adj)
76
3.10061
44.6%
41.6%
74
height
72
70
68
66
64
62
60
6.0
6.5
7.0
7.5
hand_length
8.0
8.5
9.0
(b) Interpret the slope:
Increase Hand Length by one inches will increase Height by 3.353 inches.30
Predicting Height using Hand Length
(c) Predict the height for the individual with Hand Length 7.5” and 6.5”,
respectively.
(d) Draw the least-squares regression line on the scatter diagram of the
data.
(e) Compute the residual, y-ŷ: In the data, there are individuals whose
Hand Length is 7.5” and the height is 68”. Find the residual of the height
when using the model to predict the height. Do the same for (6.5”, 63”).
(f) Find the sum of the squared residuals.
(g) Does any other line yield smaller squared residuals?
31
Predicting Height using Hand Length ŷ=3.351x + 43.29
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Row Gender
1 female
2 female
3 female
4 female
5 male
6 male
7 female
8 male
9 male
10 male
11 male
12 female
13 male
14 female
15 male
16 female
17 male
18 female
19 female
20 female
length width
8.50
9.50
8.40
9.00
7.50
8.00
7.25
8.00
7.40
7.70
7.50
8.75
6.50
7.25
8.00
7.00
8.00
8.75
8.75
9.50
8.00 ŷ
9.00
6.00
7.50
6.20
11.50
6.50
7.50
8.00
9.50
7.00
8.00
7.00
9.10
6.50
7.50
6.50
7.00
7.25
7.50
height
68.5
68.0
68.0
68.0
70.0
71.0
66.0
68.0
72.0
76.8
71.0
62.0
69.0
61.5
69.0
69.0
72.2
63.0
61.0
63.5
ŷ (Predicte
d Height)
e  y  yˆ (Residual)
For each case, can you find the
predicted Height and the
corresponding residual?
32
Should the line be used to predict the Height
when the Hand Length = 3” or = 10”?
Do not use a least-squares regression line to make predictions
for X values far outside the scope of the model (in this case, x
variable is from (6”) to (8.75”) ), because we can’t be sure the
linear relation continues to exist when hand length < 6” or >
8.75”.
33
Diagnostics on the Least-squares Regression Line
Fitted Line Plot
height = 43.29 + 3.351 hand_length
78
S
R-Sq
R-Sq(adj)
76
3.10061
44.6%
41.6%
74
height
72
70
68
66
64
62
60
6.0
6.5
7.0
7.5
hand_length
8.0
8.5
9.0
Questions:
(1) How do I know if this is a ‘good’ model? That is how much
information of the Height can be explained by the Hand Length.
(2) Is there any unusual Height – an outlier in the Y, response
variable?
(3) Is there any unusual X value that may dramatically affect the
model – an influential case?
34
When we were asked to predict the Height using only the
Height data, without knowing any information about the
relation between height and hand length. Our ‘typical guess’ is
the ‘Average Height:
(1) Using the sample average of Height:
=68.875”
When we have the information of Hand Length, and we are
asked to predict the Height for the individual whose hand length
is 7”, we can apply the model we derived:
(2): Use the least squares regression line:
ŷ = 3.351(7) + 43.29 = 66.75”
35
The difference between the 2 predictions is the additional
information explained by the Hand Length: explained deviation
y-ŷ
y y
yˆ  y
Total Deviation 
Unexplained Deviation  Explained Deviation
ˆ)  ( y
ˆ  y)
y  y  (y  y
T.D.
 U.D.
 E.D.
36
Total Deviation
= Unexplained Deviation + Explained Deviation
Total Variation
= Unexplained Variation + Explained Variation
Which is computed as follows:
Total Sum of Square

Error Sum of Square

Regression Sum of Square
2
2
ˆ
(
y

y
)

(
y

y
)



SS(Total)  SS(Error)
2
ˆ
(
y

y
)

 SS(Regress ion)
37
R2 : Coefficient of Determination
Variation Explained by the X variable:
SS due to Regression
R2 =the proportion
of variation
explained by X
variable
Variation due to Error –
Sum of squared
Residuals
SS(Total)
SS(Error)
SS(Regression)
1


SS(Total)
SS(Total)
SS(Total)
SS(Regression)
SS(Error)
2
R 
*100%  (1 
) *100%
SS(Total)
SS(Total)
R 2 is called the coefficient of determination, which is the
the percent of variation that the predictor variable
can explain the response variable.
38
Where do I find the SS(Total), SS(Error) and
SS(Regression)?
This information can be easily obtained from computer output.
The regression equation is
The Regression Line is : Height = 43.29 + 3.351 Hand_length
S = 3.10061 R-Sq = 44.6% R-Sq(adj) = 41.6%
Analysis of Variance
Source
Regression
Error
Total
DF
1
18
19
SS(Total)
SS
139.469
173.048
312.518
=
MS
F
P
139.469 14.51 0.001
9.614
SS(Error) + SS(Regression)
39
The coefficient of determination R2 is the % of variation in
the response variable that is explained by variation in the
predictor variable.
R2 = SS(Regression)/SS(Total)
= 1- SS(Error)/SS(Total)
To determine R2 for the linear regression model
simply square the value of the linear correlation
coefficient. We can also use: (r2)*100% [NOTE:
The method does not work for regression equations that have more than
1 predictor variable.]
40
Determining the Coefficient of Determination for
the model: Predicting Height using Hand Length
Find and interpret the coefficient of determination for the
model of predicting Height using Hand Length:
R2 = 139.469/312.518*100% = 44.6%
OR : use r=.668
R2 = (.668)2* 100% = 44.6%
The Hand Length can explain 44.6% of variation of the Height.
41
Some concept questions
Determine if each of the following statement true or false:
•
If the Pearson coefficient, r > 0, then, the slope, b1 > 0.
•
If r = 0, it is possible b1 >0.
•
If r < 0, then, R2 < 0
•
If R2 = .64, and b1 =2.35, then, r = .8
•
If R2 = .64, and b1 = -2.35, then, r = .8
NOTE:
(a) Slope must have the same sign as correlation coefficient, r
(b) R2 can not be negative.
(b) To compute r from R2 for simple linear regression:
r   R2
r can be positive or negative, Where the sign of r is the same as the
sign of b1, the slope.
42
How do we know if the model is adequate?
• Is a linear model adequate ?
• Are there any outliers in response variable?
• Are there any influential cases?
All of these questions cane be answered by analyzing
residuals, ei.
43
Online Applet Activity:
Visualizing the effects of outliers and influential cases
using scatter plots
Go to the site:
http://bcs.whfreeman.com/scc/content/cat_040/spt/correlation/correlationregre
ssion.html
(a) Create a scatter plot of 10 cases with high positive correlation. Add one case to the very
right and lower corner, observe the change of the correlation coefficient and the change of
the regression line (pay special attention to the change of the slope. What do you find?
(b) Create 10 pairs of data points with one on the upper right corner, the rest show a high
negative correlation, and the regression line has almost zero slope. Delete the upper right
corner case, and observe the effect of deleting the upper right corner case in changing the
slope and correlation.
(c ) Create 10 pairs of cases with high negative correlation and one case having X-values
around the middle and y-value (outlier case) is much higher than the rest. Now, delete the
outlier case, and observe the change of the slope and correlation.
44
Useful Residual Plots for Model Diagnosis
1. The residuals Vs. the order of Data: If the linear model is
adequate, then, this plot would look like random, no specific
pattern can be identified. If there is a curve pattern, then, it
indicates the relationship between x and y is not linear.
A clear nonlinear pattern of the
residuals Vs data order. The
linear model is not adequate.
No specific Pattern.
Model is adequate
Residuals Versus the Order of the Data
(response is expor)
4
12
3
10
2
8
1
6
Residual
residual
Scatterplot of residual vs order
0
-1
4
2
0
-2
-2
-3
-4
-4
0
5
10
15
order
20
25
30
2
4
6
8
10
12 14 16 18 20
Observation Order
22
24
26
28
30
45
2. Plot Residuals Vs. Predicted Y
If the model is adequate, residuals should show no specific
pattern along the zero line.
Two common problems can be identified from this plot:
(a) A curve pattern indicates the relation between Y and X is
nonlinear.
(b) Some residuals are far away from zero indicates there are
outliers in the response variable.
46
Nonlinear Pattern.
No unusual pattern.
Adequate model
Model is not linear.
Residuals Versus the Fitted Values
(response is Y)
Residuals Versus Predicted Y Values
10
(response is Y)
12
10
8
0
Residual
Residual
5
-5
6
4
2
0
-10
-2
19.5
19.6
19.7
19.8
19.9
20.0
Fitted Value
20.1
20.2
20.3
-4
0
3
6
Fitted Value
9
12
47
3. A plot of residuals against the predictor variable may also
reveal outliers. These values will be easy to identify
because the residual will lie far from the rest of the plot.
An Outlier case
0
-5
48
The effect of Influential Observations
An influential observation is one that has a
disproportionate affect on the value of the slope and yintercept in the least-squares regression equation.
49
Re-visit the Online Applet Activity:
Visualizing the effects of outliers and
influential cases using scatter plots
Open the activity worksheet:
Online Applet Activity-Outlier&InfluCases(Reg&Corr).
Working with your group to answer the questions asked in
the worksheet. We will work on some of the problems in
class, and your team will complete the work by the next
class eriod. It is due next class period.
50
If there are outliers or influential cases,
how do we deal with them?
As with outliers, influential observations should be removed only
if there is justification to do so.
When an influential observation occurs in a data set and its
removal is not warranted, there are two courses of action:
(1) Collect more data so that additional points near the influential
observation are obtained, or
(2) Use more advanced techniques such as transformations to
log transformations
51
Activity: How well can Hand_length or
Hand_width predict height?
Open the activity worksheet:
Activity-Regression-Hand-size
Working with your group to answer the questions asked in the
worksheet.
We will work on some of the problems, and you will complete
the rest after class.
It is due next class period.
52
Download