Chapter 2 Looking at Data - Relationships

advertisement
Chapter 2
Looking at Data - Relationships
Relations Among Variables
• Response variable - Outcome measurement (or
characteristic) of a study. Also called: dependent
variable, outcome, and endpoint. Labelled as y.
• Explanatory variable - Condition that explains or
causes changes in response variables. Also called:
independent variable and predictor. Labelled as x.
• Theories usually are generated about relationships
among variables and statistical methods can be used to
test them.
• Research questions are stated such as: Do changes in x
cause changes in y?
Scatterplots
• Identify the explanatory and response variables of
interest, and label them as x and y
• Obtain a set of individuals and observe the pairs
(xi , yi) for each pair. There will be n pairs.
• Statistical convention has the response variable (y)
placed on the vertical (up/down) axis and the
explanatory variable (x) placed on the horizontal
(left/right) axis. (Note: economists reverse axes in
price/quantity demand plots)
• Plot the n pairs of points (x,y) on the graph
France August,2003 Heat Wave Deaths
•
•
•
•
Individuals: 13 cities in France
Response: Excess Deaths(%) Aug1/19,2003 vs 1999-2002
Explanatory Variable: Change in Mean Temp in period (C)
Data: City
Dth03
Dth9902 %chng (y)
Degchg(x)
Little
Marseilles
Grenoble
Rennes
Toulouse
Bordeaux
Strasbourg
Nice
Poitiers
Lyon
Le Mans
Dijon
Paris
200
571
148
156
315
318
253
341
184
447
204
168
1854
192.3
456.8
115.6
114.7
231.6
222.4
167.5
222.9
102.8
248.3
112.1
87
766.1
4
25
28
36
36
43
51
53
79
80
82
93
142
4
4.3
6.3
5.6
6.6
6.2
5.9
4.3
7.3
6.8
7
7.4
6.7
France August,2003 Heat Wave Deaths
2003 France Heat Wave Mortality
Possible Outlier
160
140
Excess Mortality (%)
120
100
80
60
40
20
0
3
3.5
4
4.5
5
5.5
6
Change in Mean Temp (Celsius)
6.5
7
7.5
8
Example - Pharmacodynamics of LSD
• Response (y) - Math score (mean among 5 volunteers)
• Explanatory (x) - LSD tissue concentration (mean of 5 volunteers)
• Raw Data and scatterplot of Score vs LSD concentration:
80
70
60
LSD Conc (x)
1.17
2.97
3.26
4.69
5.83
6.00
6.41
50
40
SCORE
Score (y)
78.93
58.20
67.47
37.47
45.65
32.92
29.97
30
20
1
2
LSD_CONC
Source: Wagner, et al (1968)
3
4
5
6
7
Manufacturer Production/Cost Relation
Y= Amount Produced x= Total Cost n=48 months (not in order)
Month
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Prod
Cost
Month
Prod
Cost
Month
Prod
Cost
46.75
92.64
17
36.54
91.56
33
32.26
66.71
42.18
88.81
18
37.03
84.12
34
30.97
64.37
41.86
86.44
19
36.60
81.22
35
28.20
56.09
43.29
88.80
20
37.58
83.35
36
24.58
50.25
42.12
86.38
21
36.48
82.29
37
20.25
43.65
41.78
89.87
22
38.25
80.92
38
17.09
38.01
41.47
88.53
23
37.26
76.92
39
14.35
31.40
42.21
91.11
24
38.59
78.35
40
13.11
29.45
41.03
81.22
25
40.89
74.57
41
9.50
29.02
39.84
83.72
26
37.66
71.60
42
9.74
19.05
39.15
84.54
27
38.79
65.64
43
9.34
20.36
39.20
85.66
28
38.78
62.09
44
7.51
17.68
39.52
85.87
29
36.70
61.66
45
8.35
19.23
38.05
85.23
30
35.10
77.14
46
6.25
14.92
39.16
87.75
31
33.75
75.47
47
5.45
11.44
38.59
92.62
32
34.29
70.37
48
3.79
12.69
Manufacturer Production/Cost Relation
Production (x) / Cost (y) Relation
100
90
80
70
Total Cost
60
50
40
30
20
10
0
0
5
10
15
20
25
Total Production
30
35
40
45
50
Correlation
• Numerical measure to summarize the strength of the
linear (straight-line) association between two variables
• Bounded between -1 and +1 (Labelled as r)
– Values near -1  Strong Negative association
– Values near 0  Weak or no association
– Values near +1  Strong Positive association
• Not affected by linear transformation of either x or y
• Does not distinguish between response and explanatory
variable (x and y can be interchaged)
 xi  x  yi  y  COV ( x, y )
1



r

n  1  s x  s y 
sx s y
COV ( x, y ) 


1
xi  x yi  y

n 1

Excess French Heatwave Deaths
x  6.03 sx  1.16
City
Little
Marseilles
Grenoble
Rennes
Toulouse
Bordeaux
Strasbourg
Nice
Poitiers
Lyon
Le Mans
Dijon
Paris
Total
Degchg(x)
4.0
4.3
6.3
5.6
6.6
6.2
5.9
4.3
7.3
6.8
7.0
7.4
6.7
78.4
COV ( x, y ) 
y  57.85 s y  36.46 n  13
%chng (y)
4
25
28
36
36
43
51
53
79
80
82
93
142
752.0
333.7
 27.81
13  1
x-xbar
-2.03
-1.73
0.27
-0.43
0.57
0.17
-0.13
-1.73
1.27
0.77
0.97
1.37
0.67
0.0
r
y-ybar
-53.85
-32.85
-29.85
-21.85
-21.85
-14.85
-6.85
-4.85
21.15
22.15
24.15
35.15
84.15
0.0
(x-xbar)(y-ybar)
109.3155
56.8305
-8.0595
9.3955
-12.4545
-2.5245
0.8905
8.3905
26.8605
17.0555
23.4255
48.1555
56.3805
333.7
27.81
27.81

 0.66
(1.16)(36.46) 42.29
Examples
Least-Squares Regression
• Goal: Fit a line that “best fits” the relationship
between the response variable and the explanatory
variable
• Equation of a straight line: y = a + bx
– a - y-intercept (value of y when x = 0)
– b - slope (amount y increases as x increases by 1 unit)
• Prediction: Often want to predict what y will be at a
given level of x. (e.g. How much will it cost to fill an
order of 1000 t-shirts)
• Extrapolation: Using a fitted line outside level of the
explanatory variable observed in sample: BAD IDEA
Least-Squares Regression
• y = a + bx is a deterministic equation
• Sample data don’t fall on a straight line, but rather
around one
• Obtain equation that “best fits” a sample of data points
• Error - Difference between observed response and
predicted response (from equation)
• Least Squares criteria: Choose the line that minimizes
the sum of squared errors. Resulting regression line:
^
y  a  bx
br
sy
sx
a  y  bx
Excess French Heatwave Deaths
x  6.03 s x  1.16
y  57.85 s y  36.46 r  0.66
 36.46 
b  0.66
  0.66(31.43)  20.74
1
.
16


a  57.85  20.74(6.03)  57.85  125.06  67.21
^
2003 France Heat Wave Mortality
y  67.21  20.74 x
160
140
120
Excess Mortality (%)
For each 1C
increase in
mean temp,
excess
mortality
increases about
20%
100
80
60
40
20
0
3
3.5
4
4.5
5
5.5
6
Change in Mean Temp (Celsius)
6.5
7
7.5
8
Effect of an Outlier (Paris)
• Re-fitting the model without Paris, which had a very
high excess mortality (Using EXCEL):
^
r  0.76
y *  52.78  17.34 x
*
Heat Wave Mortality (No Paris)
100
90
80
Excess Mortality
70
60
50
40
30
20
10
0
3
3.5
4
4.5
5
5.5
Temp Change
6
6.5
7
7.5
8
Squared Correlation
• The squared correlation represents the fraction of the
variation in the response variable that is “explained” by
the explanatory variable
• Represents the improvement (reduction in sum of squared
errors) by using x (and fitted equation y-hat) to predict y
as opposed to ignoring x (and simply using the sample
mean y-bar) to predict y
• 0  r2  1
– Values near 0  x does not help predict y (regression line flat)
– Values near 1  x predicts y well (data near regression line)
2
r2
^
  y  y 

2
y

y



Residual Analysis
• Residuals: Difference between observed
^
responses and their predictedyvalues:
y
• Useful to plot the residuals versus the level of the
explanatory variable (x)
• Outliers: Large (positive or negative) residuals.
Values of y that are inconsistent with prediction
• Influential observations: Cases where the level
of the explanatory variable is far away from the
other individuals (extreme x values)
France Heatwave Mortality
x
4
4.3
6.3
5.6
6.6
6.2
5.9
4.3
7.3
6.8
7
7.4
6.7
yhat
16.04
22.22
63.39
48.98
69.56
61.33
55.15
22.22
83.98
73.68
77.80
86.03
71.62
e=y-yhat
-12.04
2.78
-35.39
-12.98
-33.56
-18.33
-4.15
30.78
-4.98
6.32
4.20
6.97
70.38
Paris (outlier)
Residual Plot
80.00
60.00
40.00
20.00
Residual
y
4
25
28
36
36
43
51
53
79
80
82
93
142
0.00
3
3.5
4
4.5
5
5.5
-20.00
-40.00
-60.00
Temp Change (x)
6
6.5
7
7.5
8
Miscellaneous Topics
• Lurking Variable: Variable not included in regression
analysis that may influence the association between y
and x. Sometimes referred to as a spurious association
between y and x.
• Association does not imply causation (it is one of
various steps to demonstrating cause-and-effect)
• Do not extrapolate outside range of x observed in study
• Some relationships are not linear, which may show low
correlation when relation is strong
• Correlations based on averages across individuals tend
to be higher than those based on individuals
Causation
• Association between x and y demonstrated
• Time order confirmed (x “occurs” before y)
• Alternative explanations are considered and explained
away:
– Lurking variables - Another variable causes both x and y
– Confounding - Two explanatory variables are highly related,
and which causes y cannot be determined
• Dose-Response Effect
• Plausible cause
Download