Beware of Data

advertisement
Factor
10
9
6
1
-2
3
2
5
10
7
8
2
5
1
8
8
2
10
7
10
Outcome
11
9
6
1
-2
4
2
3
9
7
10
1
4
3
7
9
4
11
6
9
Which of these two relationships
is “tighter?”
Factor
5
5
10
2
6
1
6
1
1
9
6
7
9
5
3
8
3
2
7
9
Outcome
10
11
-5
20
8
23
7
22
21
-3
8
4
-3
9
17
2
17
20
5
-2
1
Factor
10
9
6
1
-2
3
2
5
10
7
8
2
5
1
8
8
2
10
7
10
Outcome
11
9
6
1
-2
4
2
3
9
7
10
1
4
3
7
9
4
11
6
9
The relationship on the left appears “tighter” for
three reasons:
1. Cognition bias. Simple linear relationships are
easier to “eyeball” than complex relationships.
2. Information bias. Rounding masks
information.
3. Confirmation bias. Tendency to focus on
observations that confirm beliefs and ignore
observations that contradict beliefs.
2
Outcome
11
9
6
1
-2
4
2
3
9
7
10
1
4
3
7
9
4
11
6
9
12
10
8
6
Outcome
Factor
10
9
6
1
-2
3
2
5
10
7
8
2
5
1
8
8
2
10
7
10
4
2
0
-4
0
-2
2
4
6
8
10
12
-2
-4
Factor
3
25
20
Outcome
15
10
5
0
0
2
4
6
-5
-10
Factor
8
10
12
Factor
5
5
10
2
6
1
6
1
1
9
6
7
9
5
3
8
3
2
7
9
Outcome
10
11
-5
20
8
23
7
22
21
-3
8
4
-3
9
17
2
17
20
5
-2
4
Lesson #1
Never trust your eyes.
5
Corollary
Don’t trust summary statistics either.
Anscombe’s quartet
Four data sets that yield identical summary statistics.
6
Anscombe's quartet
I
Mean
Stdev
Corr
alpha hat
beta hat
II
III
IV
x
10
8
13
9
11
14
6
4
12
7
5
y
8.04
6.95
7.58
8.81
8.33
9.96
7.24
4.26
10.84
4.82
5.68
x
10
8
13
9
11
14
6
4
12
7
5
y
9.14
8.14
8.74
8.77
9.26
8.1
6.13
3.1
9.13
7.26
4.74
x
10
8
13
9
11
14
6
4
12
7
5
y
7.46
6.77
12.74
7.11
7.81
8.84
6.08
5.39
8.15
6.42
5.73
x
8
8
8
8
8
8
8
19
8
8
8
y
6.58
5.76
7.71
8.84
8.47
7.04
5.25
12.5
5.56
7.91
6.89
9.00
3.32
7.50
2.03
9.00
3.32
7.50
2.03
9.00
3.32
7.50
2.03
9.00
3.32
7.50
2.03
0.82
0.82
0.82
0.82
3.00
0.50
3.00
0.50
3.00
0.50
3.00
0.50
7
8
Lesson #1
Never trust your eyes.
(Don’t trust summary statistics either)
Lesson #2
Always employ sanity checks.
9
10.0%
2.5
2.4
9.5%
2.3
9.0%
2.2
8.5%
2.1
8.0%
2
1.9
7.5%
1.8
7.0%
1.7
6.5%
1.6
6.0%
1.5
1991
1992
1993
1994
1995
1996
Conventional Mortgage Rates
1997
1998
1999
2000
2001
2002
Mystery Variable from 2 Years Prior
10
Mystery variable explains 57% of the variation in mortgage rates.
Relationship is:
Rate  0.03  0.02 Mystery Variable
2.5
10.0%
2.4
9.5%
2.3
9.0%
2.2
8.5%
2.1
2
8.0%
1.9
7.5%
1.8
7.0%
1.7
6.5%
1.6
1.5
6.0%
1991
1992
1993
1994
1995
1996
Conventional Mortgage Rates
1997
1998
1999
2000
2001
2002
Mystery Variable from 2 Years Prior
11
Mystery variable is Algeria’s GDP-relative-to-Trade
Spurious Results
An infinite number of factors can attempt to explain a given outcome.
Look hard enough and you are guaranteed to find a perfect predictor.
If the factor is “spurious,” what you are observing is random chance.
12
Mystery variable is Algeria’s GDP-relative-to-Trade.
18.0%
4
16.0%
By random chance, the mystery
variable predicts mortgage rates
over this period.
14.0%
3.5
3
12.0%
2.5
10.0%
2
8.0%
1.5
6.0%
4.0%
77
9
1
1
79
9
1
81
9
1
83
9
1
85
9
1
87
9
1
89
9
1
Conventional Mortgage Rates
91
9
1
93
9
1
95
9
1
97
9
1
99
9
1
01
0
2
03
0
2
Mystery Variable from 2 Years Prior
13
If you wait long enough, randomness will tell you anything you want to hear.
100,000 letters
DJIA will be
down tomorrow!
200,000 letters
.
.
.
DJIA will be
down tomorrow!
DJIA will be
down tomorrow!
.
.
.
DJIA will be
down tomorrow!
200,000 letters
DJIA will be up
tomorrow!
.
.
.
DJIA will be up
tomorrow!
25,000 letters
DJIA will be
down tomorrow!
50,000 letters
DJIA will be
down tomorrow!
100,000 letters
DJIA will be up
tomorrow!
.
.
.
DJIA will be up
tomorrow!
.
.
.
DJIA will be
down tomorrow!
.
.
.
DJIA will be
down tomorrow!
25,000 letters
DJIA will be up
tomorrow!
50,000 letters
DJIA will be up
tomorrow!
.
.
.
.
.
.
DJIA will be up
tomorrow!
DJIA will be up
tomorrow!
14
180
60
160
50
140
120
40
100
80
30
60
40
20
20
1980
1979
1978
1977
1976
1975
1974
1973
1972
1971
1970
1969
1968
1967
1966
1965
1964
1963
1962
1961
10
1960
0
Number of Sunspots in the Current Year (left axis)
Number of Republicans in the Senate 1 Year in the Future (right axis)
Source: ftp.ngdc.noaa.gov/stp/solar_data/sunspot_numbers/yearly
www.senate.gov/pagelayout/history/one_item_and_teasers/partydiv.htm
15
Counter argument:
Spurious or not, sunspots would have been useful at predicting
Republicans in the Senate.
Fallacy:
We see the correlation in hindsight. To be useful, we need to detect
the correlation before it ceases to exist.
16
1981 – 2005
1960 – 1980
180
60
160
180
80
160
50
140
120
70
140
120
60
40
100
100
80
80
50
30
60
60
40
20
40
40
30
20
Number of Sunspots in the Current Year (left axis)
Number of Sunspots in the Current Year (left axis)
Number of Republicans in the Senate 1 Year in the Future (right axis)
Number of Republicans in the Senate 1 Year in the Future (right axis)
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
20
1983
0
1982
1980
1979
1978
1977
1976
1975
1974
1973
1972
1971
1970
1969
1968
1967
1966
1965
1964
1963
1962
1961
10
1960
0
1981
20
Source: ftp.ngdc.noaa.gov/stp/solar_data/sunspot_numbers/yearly
www.senate.gov/pagelayout/history/one_item_and_teasers/partydiv.htm
17
18
19
20
21
22
Lesson #1
Never trust your eyes.
(Don’t trust summary statistics either)
Lesson #2
Always employ sanity checks.
Lesson #3
An observation is meaningless.
Corollary
An anecdote is both meaningless and dangerous.
23
Left half of room: Don’t look.
Right half of room: Write what you read.
24
The average person in Benin earns an
annual income of $750 (in U.S. dollars).
25
Right half of room: Don’t look.
Left half of room: Write what you read.
26
The average person in Andorra earns an
annual income of $40,000 (in U.S. dollars).
27
The average person on planet Earth earns
what annual income (in U.S. dollars)?
28
Anchoring
When we see a piece of information, we evaluate subsequent
information in light of the first piece of information.
Information
News interview of a single mother working three jobs to support her
family.
Policy Question
Do we need welfare reform?
Problem
How common is this example?
29
Left half of room: Don’t look.
Right half of room: Read and answer.
30
Should we require school districts to pay to install seat belts
on school buses?
1
Definitely not!
2
3
4
5
Absolutely!
31
Right half of room: Don’t look.
Left half of room: Read and answer.
32
Every year in the U.S., 17,000 children are treated
for injuries sustained in school buses accidents.
Most of these injuries could have been avoided had
the children been wearing seat belts.
Should we require school districts to pay to install seat belts
on school buses?
1
Definitely not!
2
3
4
5
Absolutely!
33
Availability
It’s easier to see what’s in front of us that it is to see what isn’t.
Information
News report showing the benefit of school bus seat belts.
Policy Question
Should we require seat belts in school buses?
Problem
What is the expected benefit and what are the tradeoffs?
34
Lesson #1
Never trust your eyes.
Lesson #2
Always employ sanity checks.
Lesson #3
An observation is meaningless.
Corollary
An anecdote is both meaningless and dangerous.
Lesson #4
Not everything that appears random is.
35
y     X1  u
ˆ  50.01  8.65 
ˆ  0.11  0.14 
Y
R 2  0.01
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
y     X2  u
ˆ  1.18
ˆ  0.50
 7.56 
 0.06 
Y
R 2  0.55
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
y    1 X 1   2 X 2  u
 0.00 
ˆ1  1.00  0.00 
ˆ2  1.00  0.00 
ˆ  0.00
Y
R 2  1.00
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X1
X2
Y
X2
X1
Y
X2
X1
Y
X2
X1
Y
X2
X1
Y
X2
X1
Y
X2
X1
Y
X2
X1
Y
X2
X1
Y
X2
X1
Y
X2
X1
Y
X2
X1
Y
X2
X1
Y
X2
X1
Regression
Why do we do this?
217
A trucking company wants to be able to predict the round-trip travel time of
its trucks. Use the data below to predict the round-trip travel time for a truck
that will be travelling 200 miles and making 3 deliveries.
Miles Traveled
500
250
500
500
250
400
375
325
450
450
Deliveries Travel Time (hours)
4
11.3
3
6.8
4
10.9
2
8.5
2
6.2
2
8.2
3
9.4
4
8
3
9.6
2
8.1
Approach #1: Calculate Average Time per Mile
Trucks in the data set required a total of 87 hours to travel a total of 4,000 miles. Dividing
hours by miles, we find an average of 0.02 hours per mile journeyed.
(0.02 hours per mile) (200 miles) = 4 hours
218
A trucking company wants to be able to predict the round-trip travel time of
its trucks. Use the data below to predict the round-trip travel time for a truck
that will be travelling 200 miles and making 3 deliveries.
Miles Traveled
500
250
500
500
250
400
375
325
450
450
Deliveries Travel Time (hours)
4
11.3
3
6.8
4
10.9
2
8.5
2
6.2
2
8.2
3
9.4
4
8
3
9.6
2
8.1
Approach #2: Calculate Average Time per Delivery
Trucks in the data set required a total of 87 hours to make 29 deliveries. Dividing hours
by deliveries, we find an average of 3 hours per delivery.
(3 hours per delivery) (3 deliveries) = 9 hours
219
A trucking company wants to be able to predict the round-trip travel time of
its trucks. Use the data below to predict the round-trip travel time for a truck
that will be travelling 200 miles and making 3 deliveries.
Miles Traveled
500
250
500
500
250
400
375
325
450
450
Deliveries Travel Time (hours)
4
11.3
3
6.8
4
10.9
2
8.5
2
6.2
2
8.2
3
9.4
4
8
3
9.6
2
8.1
Approach #3: Combine Average Time per Mile and Average Time per Delivery
Trucks in the data set required 0.02 hours per mile journeyed and 3 hours per delivery.
(0.02 hours per mile) (200 miles) + (3 hours per delivery) (3 deliveries) = 13 hours
220
A trucking company wants to be able to predict the round-trip travel time of
its trucks. Use the data below to predict the round-trip travel time for a truck
that will be travelling 200 miles and making 3 deliveries.
Miles Traveled
500
250
500
500
250
400
375
325
450
450
Deliveries Travel Time (hours)
4
11.3
3
6.8
4
10.9
2
8.5
2
6.2
2
8.2
3
9.4
4
8
3
9.6
2
8.1
Problems
1. Combining average time per delivery and average time per mile will double-count
time if delivery and miles are correlated.
2. We have ignored a possible fixed effect – an amount of “overhead” time that is
required regardless of the number of miles and deliveries.
221
A trucking company wants to be able to predict the round-trip travel time of
its trucks. Use the data below to predict the round-trip travel time for a truck
that will be travelling 200 miles and making 3 deliveries.
Miles Traveled
500
250
500
500
250
400
375
325
450
450
Deliveries Travel Time (hours)
4
11.3
3
6.8
4
10.9
2
8.5
2
6.2
2
8.2
3
9.4
4
8
3
9.6
2
8.1
Timei   0  1 (deliveries i )  u i
ˆ0  5.38
ˆ1  1.14
5.38 hours + (1.14 hours per delivery) (3 deliveries) = 8.8 hours
222
A trucking company wants to be able to predict the round-trip travel time of
its trucks. Use the data below to predict the round-trip travel time for a truck
that will be travelling 200 miles and making 3 deliveries.
Miles Traveled
500
250
500
500
250
400
375
325
450
450
Deliveries Travel Time (hours)
4
11.3
3
6.8
4
10.9
2
8.5
2
6.2
2
8.2
3
9.4
4
8
3
9.6
2
8.1
Timei   0  1 (miles i )  u i
ˆ0  3.27
ˆ1  0.01
3.27 hours + (0.01 hours per mile) (200 miles) = 5.27 hours
223
A trucking company wants to be able to predict the round-trip travel time of
its trucks. Use the data below to predict the round-trip travel time for a truck
that will be travelling 200 miles and making 3 deliveries.
Miles Traveled
500
250
500
500
250
400
375
325
450
450
Deliveries Travel Time (hours)
4
11.3
3
6.8
4
10.9
2
8.5
2
6.2
2
8.2
3
9.4
4
8
3
9.6
2
8.1
Timei  0  1 (miles i )   2 (deliveries i )  u i
ˆ0  1.13
ˆ1  0.01
ˆ2  0.92
1.13 hours + (0.01 hours per mile) (200 miles) + (0.92 hours per delivery) (3 deliveries)
= 5.89 hours
224
A trucking company wants to be able to predict the round-trip travel time of
its trucks. Use the data below to predict the round-trip travel time for a truck
that will be travelling 200 miles and making 3 deliveries.
Miles Traveled
500
250
500
500
250
400
375
325
450
450
Deliveries Travel Time (hours)
4
11.3
3
6.8
4
10.9
2
8.5
2
6.2
2
8.2
3
9.4
4
8
3
9.6
2
8.1
Timei  1 (miles i )  2 (deliveries i )  u i
ˆ1  0.01
ˆ2  1.07
(0.01 hours per mile) (200 miles) + (1.07 hours per delivery) (3 deliveries)
= 5.21 hours
225
A trucking company wants to be able to predict the round-trip travel time of
its trucks. Use the data below to predict the round-trip travel time for a truck
that will be travelling 200 miles and making 3 deliveries.
Hours per Mile
0.02
0.02
0.01
0.01
0.01
Hours per Delivery
3.00
3.00
1.14
0.92
1.07
Fixed Hours
5.38
3.27
1.13
Estimated Hours
4.00
9.00
13.00
8.80
5.27
5.89
5.21
226
Download