Using the LA Airline Data

advertisement
Using the LA Airline Data
Jo Hardin
Lab Assignment 7
Lab Goals:
1.
To work with a large and real dataset that is publicly available (though the airline data
has already been cleaned up a bit).
2.
To get used to working with quantitative data in R.
3.
To understand the difference between inference (which is done on a sample of the
population) and descriptions of the entire population (census, find the actual
parameters).
Installing packages:
Make sure the mosaic package is installed on your computer (from the console) before
trying to knit.
install.packages("mosaic")
Accessing & Defining the data
I have created a dataset of all flights out of LA in March from 2011-2014. Given next week's
adventures, you might be interested in delays out of any of the various airports. [Thanks to:
Nicholas Horton at Amherst College.]
•
•
•
•
•
Orig, Dest – the airports of origin and destination for each flight
DepTime, ArrTime – actual departure and arrival time
CRSDepTime, CRSArrTime – scheduled arrival and departure time (CRS is the
Computer Reservation System)
DepDelay, ArrDelay – the departure and arrival delays based on scheduled arrival and
departure times
Cancelled – a binary variable on whether a flight was cancelled (0 = no, 1 = yes)
All flying times are reported in a format of HHMM (double digit hours and minutes), e.g.
1802 would correspond to 6:02 pm. Times are reported in local time of the airport, but for
our purpose, we have transformed times to be all on the same scale, here Eastern Standard
Time (EST).
1.
Entering the data. The first R chunk is something you should read through, but you
won't have to adjust it (or answer any questions on it). After you load in
airlinesLA.rda, you'll have a dataset called ds.
require(mosaic)
load(url("http://www.rossmanchance.com/iscam2/ISCAM.RData"))
options(digits=3)
trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice
load(url("http://pages.pomona.edu/~jsh04747/courses/math58/airlinesLA.rda"))
names(ds)
##
##
##
[1] "DayofMonth"
[5] "Dest"
[9] "ArrDelay"
"Month"
"Year"
"UniqueCarrier" "TailNum"
"Cancelled"
"date"
"Origin"
"CRSDepTime"
"weekday"
# looking at the data filtered by March 11th, 2013
# and sorted by scheduled departure time
filter(ds, Year==2013 & DayofMonth==11 & Origin=='ONT') %>%
select(DayofMonth, Month, Year, Origin, Dest, CRSDepTime, ArrDelay,
Cancelled) %>%
arrange(CRSDepTime)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
DayofMonth Month Year Origin Dest CRSDepTime ArrDelay Cancelled
11
3 2013
ONT DEN
545
-1
0
11
3 2013
ONT DFW
600
-4
0
11
3 2013
ONT PHX
600
-6
0
11
3 2013
ONT SMF
600
-15
0
11
3 2013
ONT SFO
600
-12
0
11
3 2013
ONT SLC
604
-3
0
11
3 2013
ONT SJC
610
-9
0
11
3 2013
ONT SEA
615
-14
0
11
3 2013
ONT LAS
615
-6
0
11
3 2013
ONT DEN
630
5
0
11
3 2013
ONT PHX
630
-10
0
11
3 2013
ONT MDW
645
16
0
11
3 2013
ONT OAK
710
-11
0
11
3 2013
ONT DFW
740
-20
0
11
3 2013
ONT SMF
750
-7
0
11
3 2013
ONT OAK
755
-1
0
11
3 2013
ONT PHX
810
3
0
11
3 2013
ONT SFO
846
56
0
11
3 2013
ONT LAS
855
-3
0
11
3 2013
ONT PHX
900
-9
0
11
3 2013
ONT PDX
935
-20
0
11
3 2013
ONT SMF
955
-8
0
11
3 2013
ONT PHX
1000
-13
0
11
3 2013
ONT SLC
1017
-6
0
11
3 2013
ONT LAS
1040
-11
0
11
3 2013
ONT SEA
1045
-14
0
11
3 2013
ONT SJC
1050
-7
0
11
3 2013
ONT OAK
1115
-13
0
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
ONT
IAH
DFW
OAK
LAS
PHX
SMF
SFO
PHX
SLC
SJC
SMF
PDX
DFW
OAK
DEN
DEN
LAS
SLC
PHX
PHX
SMF
OAK
SJC
SEA
SFO
PHX
OAK
LAS
SMF
PHX
SJC
SMF
1126
1150
1205
1215
1235
1245
1309
1310
1315
1325
1400
1410
1440
1455
1500
1545
1600
1621
1630
1645
1720
1745
1800
1805
1836
1930
1945
1955
2010
2040
2115
2145
-7
-5
-16
-3
-10
-10
20
-9
-4
-1
5
6
-2
4
-1
-7
2
-10
-12
1
-7
-16
-7
-5
-14
-19
18
-4
6
0
-8
-6
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
# assigning that same data above to a variable:
ds.Mar11 = filter(ds, Year==2013 & DayofMonth==11 & Origin=='ONT') %>%
select(DayofMonth, Month, Year, Origin, Dest, CRSDepTime, ArrDelay,
Cancelled) %>%
arrange(CRSDepTime)
# looking at just the first few rows of the created dataset
head(ds.Mar11)
##
##
##
##
##
1
2
3
4
DayofMonth Month Year Origin Dest CRSDepTime ArrDelay Cancelled
11
3 2013
ONT DEN
545
-1
0
11
3 2013
ONT DFW
600
-4
0
11
3 2013
ONT PHX
600
-6
0
11
3 2013
ONT SMF
600
-15
0
## 5
## 6
11
11
3 2013
3 2013
ONT
ONT
SFO
SLC
600
604
-12
-3
0
0
# adding a new variable that gives three time of day groups
ds = mutate(ds, TimeOfDay = cut(CRSDepTime, breaks=c(0, 1200, 1800, 2400),
labels=c("morning", "afternoon", "evening")))
# adding a new variable that gives a binary indicator of whether or not
# the flight was delayed or canceled.
ds = mutate(ds, delayorcancel = ifelse(is.na(ArrDelay) | ArrDelay > 15,
"yes", "no"))
attach(ds)
2.
•
Now, look at the data and summarize some of the variables. Notice the names of the
variables, then summarize as many of the variables as are interesting to you.
What is the difference between favstats and tally and iscamsummary?
•
What does the ~ command do within tally?
•
Does the ~ command work for favstats?
names(ds)
## [1]
## [5]
## [9]
## [13]
"DayofMonth"
"Dest"
"ArrDelay"
"TimeOfDay"
"Month"
"Year"
"UniqueCarrier" "TailNum"
"Cancelled"
"date"
"delayorcancel"
"Origin"
"CRSDepTime"
"weekday"
favstats(~CRSDepTime, data=ds)
##
##
min Q1 median
Q3 max mean sd
n missing
1 905
1305 1740 2359 1339 507 107587
0
tally(~ TimeOfDay, data=ds)
##
##
##
morning afternoon
46023
36986
evening
24578
iscamsummary(CRSDepTime)
##
n
## 108000
Min
1
Q1 Median
905
1300
Q3
1740
Max
2360
Mean
1340
SD
507
iscamsummary(Cancelled)
##
n
Min
Q1
Median
Q3
Max
Mean
SD
## 1.08e+05 0.00e+00 0.00e+00 0.00e+00 0.00e+00 1.00e+00 1.15e-02 1.07e-01
iscamsummary(TimeOfDay)
## Error in quantile.default(x, na.rm = TRUE): factors are not allowed
tally(~ Cancelled, format="percent", data=ds)
##
##
0
## 98.85
1
1.15
tally(~ delayorcancel, format="percent", data=ds)
##
##
no yes
## 81.9 18.1
tally(Cancelled ~ TimeOfDay, format="percent", data=ds)
##
TimeOfDay
## Cancelled morning afternoon evening
##
0
98.90
98.85
98.73
##
1
1.10
1.15
1.27
tally(delayorcancel ~ TimeOfDay, format="percent", data=ds)
##
TimeOfDay
## delayorcancel morning afternoon evening
##
no
86.2
79.2
78.1
##
yes
13.8
20.8
21.9
favstats(CRSDepTime ~ Cancelled, data=ds)
##
.group min Q1 median
Q3 max mean sd
n missing
## 1
0
1 905
1305 1740 2359 1339 507 106346
0
## 2
1 135 945
1329 1805 2359 1368 491
1241
0
favstats(ArrDelay ~ TimeOfDay, data=ds)
##
.group min Q1 median Q3 max mean
sd
n missing
## 1
morning -58 -12
-5 5 842 0.808 30.1 45423
600
## 2 afternoon -60 -9
-2 11 1104 6.240 32.3 36505
481
## 3
evening -108 -10
-2 11 720 6.003 31.6 24227
351
3.
•
Look in your textbook at pages 197-198.
What are quantiles?
•
•
How do they determine the values associated with a boxplot?
Can you change the limits of your boxplot so that you observe every single
observation?
At what time of day are the flights most likely to be delayed?
•
favstats(ArrDelay, data=ds)
##
##
min Q1 median Q3 max mean
sd
n missing
-108 -11
-3 8 1104 3.86 31.3 106155
1432
ds = mutate(ds, ActDelay = ifelse(ArrDelay < 0, 0, ArrDelay))
favstats(ActDelay, data=ds)
##
##
min Q1 median Q3 max mean
sd
n missing
0 0
0 8 1104 10.2 28.1 106155
1432
favstats(ActDelay ~ TimeOfDay, data=ds)
##
.group min Q1 median Q3 max mean
sd
n missing
## 1
morning
0 0
0 5 842 8.03 26.8 45423
600
## 2 afternoon
0 0
0 11 1104 11.69 29.4 36505
481
## 3
evening
0 0
0 11 720 12.00 28.1 24227
351
bwplot(ActDelay ~ TimeOfDay, ylim=c(-10, 120), main="March flights from LA,
2011-2014", ylab="Actual arrival delay (in minutes)", data=ds)
bwplot(TimeOfDay ~ ActDelay, xlim=c(-10, 120), main="March flights from LA,
2011-2014", xlab="Actual arrival delay (in minutes)", data=ds)
4.
•
•
Consider page 196 in your text.
Are Arrival Delays symmetric, skewed right, or skewed left?
What about the shape of the density for the departure time?
Create both denisty plots and histograms:
ds2014 = filter(ds, Year==2014)
ds2014 = mutate(ds2014, TimeOfDay = cut(CRSDepTime, breaks=c(0, 1200, 1800,
2400), labels=c("morning", "afternoon", "evening")))
densityplot(~ ArrDelay, groups=TimeOfDay, auto.key=TRUE,
xlab="Arrival delay (in minutes)", xlim=c(-35, 120), data=ds2014)
histogram(~ ArrDelay | TimeOfDay, auto.key=TRUE,
xlab="Arrival delay (in minutes)", xlim=c(-35, 120),
layout=c(1,3),data=ds2014)
densityplot(~ CRSDepTime, groups=TimeOfDay, auto.key=TRUE,
xlab="Scheduled Departure (in minutes)", data=ds2014)
histogram(~ CRSDepTime | TimeOfDay, auto.key=TRUE,
xlab="Scheduled Departure (in minutes)", data=ds2014)
To turn in
The goal of Lab 7 is to explore rates of Fatal accidents for different ages. In this project data
will come from the federal government on Fatal Accidents. The name of the federal
database is FARS (fatal accident reporting system). For more information read
http://www.nhtsa.gov/FARS
The dataset consists of all fatal accidents in 2011. An observational unit is a driver involved
in a fatal accident. Note, there could be two observational units that refer to the same
accident (how?). The dataset allows us to look at the ages of drivers involved in fatal
crashes and distracted driving rates. (Data courtesy of Laura Kapitula at Grand Valley State
University.)
crashdata =
read.table("http://pages.pomona.edu/~jsh04747/courses/math58/DRIVERS_CREATED.
csv", sep=",", header=TRUE, na.strings="NA")
attach(crashdata)
dim(crashdata) # how many rows and variables in the dataset?
## [1] 43758
9
crashdata[1:5,]
# it is always a good idea to look at the data
##
STATE ST_CASE
A_DIST type.crash VEH_NO PER_NO
AGE ageYrs
## 1 Alabama
10001 Other Crash
2
1
1 55 Years
55
## 2 Alabama
10002 Other Crash
2
1
1 24 Years
24
## 3 Alabama
10003 Other Crash
2
1
1 30 Years
30
##
##
##
##
##
##
##
##
4 Alabama
5 Alabama
SEX
1 Female
2
Male
3
Male
4 Female
5 Female
10003 Other Crash
10005 Other Crash
2
2
2
1
1 33 Years
1 23 Years
33
23
Notice that Other Crash is coded "2" and Involving a Distracted Driver is coded as "1"
.1. Create a histogram, density plot, and box plot of age using R. What features of the
distribution can you see in the histogram that you cannot see in the box plot?
.2. Looking at just the histogram or the density plot, how would you describe the shape of
the distribution?
.3. Give the mean, five number summary, IQR and standard deviation of age.
.4. Go to http://factfinder2.census.gov/faces/nav/jsf/pages/index.xhtml. Go to Decennial
Census data and then go to Profile of General Population and Housing Characteristics: 2010
(SF means Summary File). What proportion of the population is between 15 and 24 years
old?
.5. Using the frequency numbers for age, find the proportion of drivers involved in fatal
accidents that are between 15 and 24 years old.
# the histogram
hist(ageYrs, breaks=c(0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,
90, 95, 100), right=F, label=T)
# the cutoffs
hist(ageYrs, plot=F,
breaks=c(0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85, 90, 95, 100),
right=F)$breaks
## [1]
## [18]
0
85
5
90
10 15
95 100
20
25
30
35
40
45
50
55
60
65
70
75
80
# the frequencies in each bin
hist(ageYrs,
plot=F,breaks=c(0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85, 90, 95,
100), right=F)$counts
## [1]
0
## [15] 1293
4
991
56 3212 5600 4560 3957 3444 3614 3762 3731 3052 2490 1654
819 501 190
21
# the total number of non-missing ages
sum(!is.na(ageYrs))
## [1] 42951
.6. Does the analysis in .5. lead us to believe, therefore, that younger drivers are more
accident prone? Explain.
.7. Next get a comparative box plot of ages for different levels of the variable type.crash.
Provide your boxplot.
.8. What differences and/or similarities do you see in the distribution of ages for accidents
involving distracted driving and those not involving distracted driving? Write a few
sentences.
Note: Given this is population data any difference we see IS a "real" difference, meaning we
know it is not due to sampling error (chance alone) since we are not taking a sample.
However, one question we might ask is "Is the difference large enough to be of practical
importance?"
Download