Chapter 5

advertisement
Chapter 5
Summarizing
Bivariate Data
Terms

A multivariate data set consists of
measurements or observations on each of
two or more variables.
The classroom data set introduced in the slides for
Chapter 3 is a multivariate data set. The data set
includes observations on the variables: age,
weight, height, gender, vision (correction method),
and smoke (status). Age, weight and height are
numerical variables while gender, vision and smoke
are categorical variables.
2
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Terms
 A bivariate
data set consists of
measurements or observations on each
of two or more variables.
For the rest of this chapter we will
concentrate on dealing with bivariate data
sets where both variables are numeric.
3
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Scatterplots
 A scatterplot
is a plot of pairs of
observed values (both quantitative) of
two different variables.
When one of the variables is considered to be a response
variable (y) and the other an explanatory variable (x). The
explanatory variable is usually plotted on the x axis.
4
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Example
A sample of one-way
Greyhound bus fares
from Rochester, NY to
cities less than 750 miles
was taken by going to
Greyhound’s website.
The following table
gives the destination city,
the distance and the oneway fare. Distance
should be the x axis and
the Fare should be the y
axis.
5
Destination City Distance
Albany, NY
240
Baltimore, MD
430
Buffalo, NY
69
Chicago, IL
607
Cleveland, OH
257
Montreal, QU
480
New York City, NY 340
Ottawa, ON
467
Philadelphia, PA
335
Potsdam, NY
239
Syracuse, NY
95
Toronto, ON
178
Washington, DC
496
Standard
One-Way
Fare
39
81
17
96
61
70.5
65
82
67
47
20
35
87
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Example Scatterplot
$100
Greyhound Bus Fares Vs. Distance
Standard One-Way Fare
$90
$80
$70
$60
$50
$40
$30
$20
$10
50
150
250
350
450
550
650
Distance from Rochester, NY (miles)
6
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Comments
 The
axes need not intersect at (0,0).
 For each of the axes, the scale should be
chosen so that the minimum and maximum
values on the scale are convenient and the
values to be plotted are between the two
values.
 Notice that for this example,
1.The x axis (distance) runs from 50 to 650 miles
where the data points are between 69 and 607.
2.The y axis (fare) runs from $10 to $100 where
the data points are between $17 and $96.
7
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Further Comments
1.
2.
3.
8
It is possible that two points might have the same x
value with different y values. Notice that Potsdam
(239) and Albany (240) come very close to having the
same x value but the y values are $8 apart. Clearly,
the value of y in not determined solely by the x value
(there are factors other than distance that affect the
fare.
In this example, the y value tends to increase a x
increases. We say that there is a positive relationship
between the variables distance and fare.
It appears that the y value (fare) could be predicted
reasonably well from the x value (distance) by finding
a line that is close to the points in the plot.
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Association


9
Positive Association - Two variables are positively
associated when above-average values of one tend
to accompany above-average values of the other and
below-average values tend similarly to occur
together. (I.e., Generally speaking, the y values tend
to increase as the x values increase.)
Negative Association - Two variables are negatively
associated when above-average values of one
accompany below-average values of the other, and
vice versa. (I.e., Generally speaking, the y values
tend to decrease as the x values increase.)
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
The Pearson Correlation Coefficient
A measure of the strength of the linear
relationship between the two variables Pierson
correlation coefficient.
The Pearson sample correlation coefficient is
defined by
  
x x y y 

sy 
z x z y   sx

r

n 1
n 1
10
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Example Calculation
11
x
y
240
430
69
607
257
480
340
467
335
239
95
178
496
39
81
17
96
61
70.5
65
82
67
47
20
35
87
x-x
sx
-0.5214
0.6357
-1.5627
1.7135
-0.4178
0.9402
0.0876
0.8610
0.0571
-0.5275
-1.4044
-0.8989
1.0376
y-y
sy
 x-x   y-y 


 
s
s
 x  y 
-0.7856
0.8610
-1.6481
1.4491
0.0769
0.4494
0.2337
0.9002
0.3121
-0.4720
-1.5305
-0.9424
1.0962
0.4096
0.5473
2.5755
2.4831
-0.0321
0.4225
0.0205
0.7751
0.0178
0.2489
2.1494
0.8472
1.1374
11.6021
x  325.615
s x  164.2125
y=59.0385
s y  25.506
11.601
r
13  1
 0.9668
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Some Correlation Pictures
12
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Some Correlation Pictures
13
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Some Correlation Pictures
14
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Some Correlation Pictures
15
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Some Correlation Pictures
16
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Some Correlation Pictures
17
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Properties of r
The value of r does not depend on the unit of
measurement for each variable.
2. The value of r does not depend on which of
the two variables is labeled x.
3. The value of r is between –1 and +1.
4. The correlation coefficient is
1.
a) –1 only when all the points lie on a downwardsloping line, and
b) +1 only when all the points lie on an upwardsloping line.
5.
18
The value of r is a measure of the extent to
which x and y are linearly related.
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Linear Relations
 The
relationship y = a + bx is the
equation of a straight line. The value b,
called the slope of the line, is the
amount by which y increases when x
increase by 1 unit. The value of a,
called the intercept (or sometimes the
vertical intercept) of the line, is the
height of the line above the value x = 0.
19
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Example
y
15
y = 7 + 3x
y increases by b = 3
10
x increases by 1
5
a=7
0
0
20
2
4
6
8
x
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Example
y
15
y changes by b = -4 (i.e., changes by –4)
10
a = 17
y = 17 - 4x
5
x increases by 1
0
0
21
2
4
6
8
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Least Squares Line
The most widely used criterion for measuring the
goodness of fit of a line y = a + bx to bivariate data
(x1, y1), (x2, y2),, (xn, yn) is the sum of the of the
squared deviations about the line:
  y  (a  bx)   y  (a  bx )
2
2
1
1

  y n  (a  bx n ) 
2
The line that gives the best fit to the data is the one
that minimizes this sum; it is called the least squares
line or sample regression line.
22
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Coefficients a and b
The slope of the least squares
b
line is
And the y intercept is
  x  x  y  y 
x  x
2
a  y  bx
We write the equation of the least squares line as
ŷ  a  bx
where the ^ above y emphasizes that ŷ (read as y-hat)
is a prediction of y resulting from the subst5itution of a
particular value into the equation.
23
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Calculating Formula for b
b
24

x   y 


xy 
n
2
x


2
x  n
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Greyhound Example Continued
25
x
y
xx
240
430
69
607
257
480
340
467
335
239
95
178
496
4233
39
81
17
96
61
70.5
65
82
67
47
20
35
87
768
-85.615
104.385
-256.615
281.385
-68.615
154.385
14.385
141.385
9.385
-86.615
-230.615
-147.615
170.385
(x  x)2
7329.994
10896.148
65851.456
79177.302
4708.071
23834.609
206.917
19989.609
88.071
7502.225
53183.456
21790.302
29030.917
323589.08
y y
-20.038
21.962
-42.038
36.962
1.962
11.462
5.962
22.962
7.962
-12.038
-39.038
-24.038
27.962
 x-x  y-y
1715.60
2292.45
10787.72
10400.41
-134.59
1769.49
85.75
3246.41
74.72
1042.72
9002.87
3548.45
4764.22
48596.19
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Calculations
From the previous slide, we have
  x  x  y  y   48596.19 and   x  x 
2
 323589.08
So
 x  x  y  y   48596.19

b

 0.15018
323589.08
x  x
2
Also n=13,
 x=4233 and  y  768
4233
768
 325.615 and y 
 59.0385
13
13
This gives
a=y-bx=59.0385-0.15018(325.615)=10.138
so x 
The regression line is
26
ˆ
y=10.138
+ 0.15018x.
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Minitab Graph
The following graph is a copy of the output from a
Minitab command to graph the regression line.
Regression Plot
Standard Fare= 10.1380 + 0.150179 Distance
S = 6.80319
R-Sq = 93.5 %
R-Sq(adj) = 92.9 %
105
95
Standard Fare
85
75
65
55
45
35
25
15
0
27
100
200
300
400
500
600
Distance
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Greyhound Example Revisited
28
x
y
x2
xy
240
430
69
607
257
480
340
467
335
239
95
178
496
4233
39
81
17
96
61
70.5
65
82
67
47
20
35
87
768
57600
184900
4761
368449
66049
230400
115600
218089
112225
57121
9025
31684
246016
1701919
9360
34830
1173
58272
15677
33840
22100
38294
22445
11233
1900
6230
43152
298506
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Greyhound Example Revisited
Using the calculation formula we have:
n  13,  x  4233,  y  768
2
x
  1701919, and
so

 xy  298506
x   y 


xy 
4233 768 

298506 
13
n

2
2
4233


x



2
1701919 
x  n
13
48596.19

 0.15018
323589.1
As before a=y-bx=59.0385-0.15018(325.615)=10.138
and the regression line is yˆ =10.138 + 0.15018x.
b
Notice that we get the same result.
29
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Three Important Questions
To examine how useful or effective the line
summarizing the relationship between x and y,
we consider the following three questions.
1. Is a line an appropriate way to summarize the
relationship between the two variables?
2. Are there any unusual aspects of the dat set that
we need to consider before proceeding to use the
regression line to make predictions?
3. If we decide that it is reasonable to use the
regression line as a basis for prediction, how
accurate can we expect predictions based on the
regression line to be?
30
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Terminology
The predicted or fitted values result from
substituting each sample x value into the
equation for the least squares line. This gives
ŷ1  a  bx1 =1st predicted value
ŷ 2  a  bx 2 =2nd predicted value
...
ŷ n  a  bx n =nth predicted value
The residuals for the least squares line are the
values: y1  yˆ 1 , y 2  yˆ 2 , ..., y n  yˆ n
31
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Greyhound Example Continued
x
240
430
69
607
257
480
340
467
335
239
95
178
496
32
Predicted value
yˆ  10.1  .150x
39
46.18
81
74.72
17
20.50
96
101.30
61
48.73
70.5
82.22
65
61.20
82
80.27
67
60.45
47
46.03
20
24.41
35
36.87
87
84.63
y
Residual
y  ŷ
-7.181
6.285
-3.500
-5.297
12.266
-11.724
3.801
1.728
6.552
0.969
-4.405
-1.870
2.373
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Residual Plot
A residual plot is a scatter plot of the data pairs
(x, residual). The following plot was produced
by Minitab from the Greyhound example.
33
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Residual Plot – What to look for.
 Isolated
points or patterns indicate
potential problems.
 Ideally the the points should be
randomly spread out above and below
This residual plot would be indicates no
zero.
systematic bias using the least squares line
Residual
to predict the y value.
Generally this is the kind of pattern that you
would like to see.
0
Note:
1. Values below 0 indicate over prediction
34
2. Values above 0 indicate under prediction.
x
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
The Greyhound example continued
Predicted
fares are
too high.
35
Predicted
fares are
too low.
For the Greyhound example, it appears that the line systematically
predicts fares that are too high for cities close to Rochester and
predicts fares that are too little for most cities between 200 and 500
miles.
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
More Residual Plots
Another common type of residual plot is a scatter plot
of the data pairs ( ŷ, residual). The following plot was
produced by Minitab for the Greyhound data. Notice,
that this residual plot shows the same type of
systematic problems with the model.
36
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Coefficient of Determination
The coefficient of determination,
denoted by r2, gives the proportion of
variation in y that can be attributed to an
approximate linear relationship between x
and y.
37
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Definition formulae
The total sum of squares, denoted by SSTo, is
defined as
SSTo  (y1  y) 2  (y 2  y) 2 
 (y n  y) 2
  (y  y) 2
The residual sum of squares, denoted by
SSResid, is defined as
SSResid  (y1  yˆ 1 ) 2  (y 2  yˆ 2 ) 2 
  (y  y)
ˆ
38
 (y n  yˆ n ) 2
2
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Calculational formulae
SSTo and SSResid are generally found as part
of the standard output from most statistical
packages or can be obtained using the
following computational formulas:
 y

SSTo   y 
2
2
n
SSResid   y 2  a  y  b  xy
The coefficient of determination, r2, can be
computed as
SSResid
r  1
SSTo
2
39
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Greyhound example revisited
n  13,  y  768,  y  53119,  xy  298506
b  0.150179 and a  10.1380
2
2
y



768
2
SSTo   y 
 53119 
 78072.2
2
n
13
SSResid   y 2  a  y  b  xy
 53119  10.1380(768)  0.150179(298506)
 509.117
40
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Greyhound example revisited
SSResid
509.117
r  1
 1
 0.9348
SSTo
7807.23
2
We can say that 93.5% of the variation in the
Fare (y) that can attribute to the least squares
linear relationship between distance (x) and
fare.
41
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
More on variability
The standard deviation about the least
squares line is denoted se and given by
SSResid
se 
n2
se is interpreted as the “typical” amount by
which an observation deviates from the least
squares line.
42
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Greyhound example revisited
SSResid
509.117
se 

 $6.80
n2
11
The “typical” deviation of actual fare from the
prediction is $6.80.
43
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Minitab output for Regression
Regression Analysis: Standard Fare versus Distance
Least squares
regression line
The regression equation is
Standard Fare = 10.1 + 0.150 Distance
Predictor
Constant
Distance
Coef
10.138
0.15018
S = 6.803
SE Coef
4.327
0.01196
R-Sq = 93.5%
T
2.34
12.56
R-Sq(adj) = 92.9%
se
44
r2
DF
1
11
12
a
b
Analysis of Variance
Source
Regression
Residual Error
Total
P
0.039
0.000
SS
7298.1
509.1
7807.2
SSTo
MS
7298.1
46.3
F
157.68
P
0.000
SSResid
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
The Greyhound problem with
additional data
The sample of fares and mileages from
Rochester was extended to cover a total of 20
cities throughout the country. The resulting
data and a scatterplot are given on the next
few slides.
45
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Extended Greyhound Fare Sample
Standard
Distance Fare
Buffalo, NY
69
17
New York City
340
65
Cleveland, OH
257
61
Baltimore, MD
430
81
Washington, DC
496
87
Atlanta, GE
998
115
Chicago, IL
607
96
San Francisco
2861
159
Seattle, WA
2848
159
Philadelphia, PA
335
67
Orlando, FL
1478
109
Phoenix, AZ
2569
149
Houston, TX
1671
129
New Orleans, LA 1381
119
Syracuse, NY
95
20
Albany, NY
240
39
Potsdam, NY
239
47
Toronto, ON
178
35
Ottawa, ON
467
82
Montreal, QU
480
70.5
46
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Extended Greyhound Fare Sample
Standard Fare
150
100
50
0
0
1000
2000
3000
Distance
47
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Extended Greyhound Fare Sample
Regression Plot
Standard Far = 46.0582 + 0.0435354 Distance
S = 17.4230
R-Sq = 84.9 %
R-Sq(adj) = 84.1 %
Standard Far
150
100
50
0
0
1000
2000
3000
Distance
Minitab reports the correlation coefficient, r=0.921,
R2=0.849, se=$17.42 and the regression line
Standard Fare = 46.058 + 0.043535 Distance
Notice that even though the correlation coefficient is
reasonably high and 84.9 % of the variation in the
Fare is explained, the linear model is not very usable.
48
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Nonlinear Regression Example
49
Standard
Distance Log10(distance) Fare
Buffalo, NY
69
1.83885
17
New York City
340
2.53148
65
Cleveland, OH
257
2.40993
61
Baltimore, MD
430
2.63347
81
Washington, DC
496
2.69548
87
Atlanta, GE
998
2.99913
115
Chicago, IL
607
2.78319
96
San Francisco
2861
3.45652
159
Seattle, WA
2848
3.45454
159
Philadelphia, PA
335
2.52504
67
Orlando, FL
1478
3.16967
109
Phoenix, AZ
2569
3.40976
149
Houston, TX
1671
3.22298
129
New Orleans, LA 1381
3.14019
119
Syracuse, NY
95
1.97772
20
Albany, NY
240
2.38021
39
Potsdam, NY
239
2.37840
47
Toronto, ON
178
2.25042
35
Ottawa, ON
467
2.66932
82
Montreal, QU
480
2.68124
70.5
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Extended Greyhound Fare Sample
Nonlinear Regression
From the previous slide we can see that the plot does
not look linear, it appears to have a curved shape. We
sometimes replace the one of more of the variables
with a transformation of that variable and then perform
a linear regression on the transformed variables. This
can sometimes lead to developing a useful prediction
equation.
For this particular data, the shape of the curve is
almost logarithmic so we might try to replace the
distance with log10(distance) [the logarithm to the base
10) of the distance].
50
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Nonlinear Regression Example
Minitab provides the following output.
High r2
Regression Analysis: Standard Fare versus Log10(Distance)
The regression equation is
Standard Fare = - 163 + 91.0 Log10(Distance)
Predictor
Constant
Log10(Di
Coef
-163.25
91.039
S = 7.869
SE Coef
10.59
3.826
R-Sq = 96.9%
T
-15.41
23.80
P
0.000
0.000
R-Sq(adj) = 96.7%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
18
19
Unusual Observations
Obs
Log10(Di
Standard
11
3.17
109.00
SS
35068
1115
36183
Fit
125.32
MS
35068
62
96.9% of the
variation attributed
to the model
Typical Error = $7.87
Reasonably good
F
566.30
SE Fit
2.43
P
0.000
Residual
-16.32
St Resid
-2.18R
R denotes an observation with a large standardized residual
51
The only outlier is Orlando and as you’ll see from the
next two slides,Copyright
it is not
too bad.
(c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Nonlinear Regression Example
Looking at the plot of the residuals against distance, we see
some problems. The model over estimates fares for middle
distances (1000 to 2000 miles) and under estimates for longer
distances (more than 2000 miles
52
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Nonlinear Regression Example
When we look at how the prediction curve looks on a graph that
has the Standard Fare and log10(Distance) axes, we see the
result looks reasonably linear.
Regression Plot
Standard Fare = -163.246 + 91.0389 Log10(Distance)
S = 7.86930
R-Sq = 96.9 %
R-Sq(adj) = 96.7 %
Standard Fare
150
100
50
0
53
2.0
2.5
3.0
3.5
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Log10(Distance)
Nonlinear Regression Example
When we look at how the prediction curve looks on a graph that
has the Standard Fare and Distance axes, we see the result
appears to work fairly well.
By and large, this prediction model for the fares appears to work
reasonable well.
Standard Fare
150
Prediction Model
100
50
0
0
1000
2000
3000
Distance
54
Copyright (c) 2001 Brooks/Cole, a division of Thomson Learning, Inc.
Download