Using the LA Airline Data Jo Hardin Lab Assignment 7 Lab Goals: 1. To work with a large and real dataset that is publicly available (though the airline data has already been cleaned up a bit). 2. To get used to working with quantitative data in R. 3. To understand the difference between inference (which is done on a sample of the population) and descriptions of the entire population (census, find the actual parameters). Installing packages: Make sure the mosaic package is installed on your computer (from the console) before trying to knit. install.packages("mosaic") Accessing & Defining the data I have created a dataset of all flights out of LA in March from 2011-2014. Given next week's adventures, you might be interested in delays out of any of the various airports. [Thanks to: Nicholas Horton at Amherst College.] • • • • • Orig, Dest – the airports of origin and destination for each flight DepTime, ArrTime – actual departure and arrival time CRSDepTime, CRSArrTime – scheduled arrival and departure time (CRS is the Computer Reservation System) DepDelay, ArrDelay – the departure and arrival delays based on scheduled arrival and departure times Cancelled – a binary variable on whether a flight was cancelled (0 = no, 1 = yes) All flying times are reported in a format of HHMM (double digit hours and minutes), e.g. 1802 would correspond to 6:02 pm. Times are reported in local time of the airport, but for our purpose, we have transformed times to be all on the same scale, here Eastern Standard Time (EST). 1. Entering the data. The first R chunk is something you should read through, but you won't have to adjust it (or answer any questions on it). After you load in airlinesLA.rda, you'll have a dataset called ds. require(mosaic) load(url("http://www.rossmanchance.com/iscam2/ISCAM.RData")) options(digits=3) trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice load(url("http://pages.pomona.edu/~jsh04747/courses/math58/airlinesLA.rda")) names(ds) ## ## ## [1] "DayofMonth" [5] "Dest" [9] "ArrDelay" "Month" "Year" "UniqueCarrier" "TailNum" "Cancelled" "date" "Origin" "CRSDepTime" "weekday" # looking at the data filtered by March 11th, 2013 # and sorted by scheduled departure time filter(ds, Year==2013 & DayofMonth==11 & Origin=='ONT') %>% select(DayofMonth, Month, Year, Origin, Dest, CRSDepTime, ArrDelay, Cancelled) %>% arrange(CRSDepTime) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 DayofMonth Month Year Origin Dest CRSDepTime ArrDelay Cancelled 11 3 2013 ONT DEN 545 -1 0 11 3 2013 ONT DFW 600 -4 0 11 3 2013 ONT PHX 600 -6 0 11 3 2013 ONT SMF 600 -15 0 11 3 2013 ONT SFO 600 -12 0 11 3 2013 ONT SLC 604 -3 0 11 3 2013 ONT SJC 610 -9 0 11 3 2013 ONT SEA 615 -14 0 11 3 2013 ONT LAS 615 -6 0 11 3 2013 ONT DEN 630 5 0 11 3 2013 ONT PHX 630 -10 0 11 3 2013 ONT MDW 645 16 0 11 3 2013 ONT OAK 710 -11 0 11 3 2013 ONT DFW 740 -20 0 11 3 2013 ONT SMF 750 -7 0 11 3 2013 ONT OAK 755 -1 0 11 3 2013 ONT PHX 810 3 0 11 3 2013 ONT SFO 846 56 0 11 3 2013 ONT LAS 855 -3 0 11 3 2013 ONT PHX 900 -9 0 11 3 2013 ONT PDX 935 -20 0 11 3 2013 ONT SMF 955 -8 0 11 3 2013 ONT PHX 1000 -13 0 11 3 2013 ONT SLC 1017 -6 0 11 3 2013 ONT LAS 1040 -11 0 11 3 2013 ONT SEA 1045 -14 0 11 3 2013 ONT SJC 1050 -7 0 11 3 2013 ONT OAK 1115 -13 0 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT ONT IAH DFW OAK LAS PHX SMF SFO PHX SLC SJC SMF PDX DFW OAK DEN DEN LAS SLC PHX PHX SMF OAK SJC SEA SFO PHX OAK LAS SMF PHX SJC SMF 1126 1150 1205 1215 1235 1245 1309 1310 1315 1325 1400 1410 1440 1455 1500 1545 1600 1621 1630 1645 1720 1745 1800 1805 1836 1930 1945 1955 2010 2040 2115 2145 -7 -5 -16 -3 -10 -10 20 -9 -4 -1 5 6 -2 4 -1 -7 2 -10 -12 1 -7 -16 -7 -5 -14 -19 18 -4 6 0 -8 -6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 # assigning that same data above to a variable: ds.Mar11 = filter(ds, Year==2013 & DayofMonth==11 & Origin=='ONT') %>% select(DayofMonth, Month, Year, Origin, Dest, CRSDepTime, ArrDelay, Cancelled) %>% arrange(CRSDepTime) # looking at just the first few rows of the created dataset head(ds.Mar11) ## ## ## ## ## 1 2 3 4 DayofMonth Month Year Origin Dest CRSDepTime ArrDelay Cancelled 11 3 2013 ONT DEN 545 -1 0 11 3 2013 ONT DFW 600 -4 0 11 3 2013 ONT PHX 600 -6 0 11 3 2013 ONT SMF 600 -15 0 ## 5 ## 6 11 11 3 2013 3 2013 ONT ONT SFO SLC 600 604 -12 -3 0 0 # adding a new variable that gives three time of day groups ds = mutate(ds, TimeOfDay = cut(CRSDepTime, breaks=c(0, 1200, 1800, 2400), labels=c("morning", "afternoon", "evening"))) # adding a new variable that gives a binary indicator of whether or not # the flight was delayed or canceled. ds = mutate(ds, delayorcancel = ifelse(is.na(ArrDelay) | ArrDelay > 15, "yes", "no")) attach(ds) 2. • Now, look at the data and summarize some of the variables. Notice the names of the variables, then summarize as many of the variables as are interesting to you. What is the difference between favstats and tally and iscamsummary? • What does the ~ command do within tally? • Does the ~ command work for favstats? names(ds) ## [1] ## [5] ## [9] ## [13] "DayofMonth" "Dest" "ArrDelay" "TimeOfDay" "Month" "Year" "UniqueCarrier" "TailNum" "Cancelled" "date" "delayorcancel" "Origin" "CRSDepTime" "weekday" favstats(~CRSDepTime, data=ds) ## ## min Q1 median Q3 max mean sd n missing 1 905 1305 1740 2359 1339 507 107587 0 tally(~ TimeOfDay, data=ds) ## ## ## morning afternoon 46023 36986 evening 24578 iscamsummary(CRSDepTime) ## n ## 108000 Min 1 Q1 Median 905 1300 Q3 1740 Max 2360 Mean 1340 SD 507 iscamsummary(Cancelled) ## n Min Q1 Median Q3 Max Mean SD ## 1.08e+05 0.00e+00 0.00e+00 0.00e+00 0.00e+00 1.00e+00 1.15e-02 1.07e-01 iscamsummary(TimeOfDay) ## Error in quantile.default(x, na.rm = TRUE): factors are not allowed tally(~ Cancelled, format="percent", data=ds) ## ## 0 ## 98.85 1 1.15 tally(~ delayorcancel, format="percent", data=ds) ## ## no yes ## 81.9 18.1 tally(Cancelled ~ TimeOfDay, format="percent", data=ds) ## TimeOfDay ## Cancelled morning afternoon evening ## 0 98.90 98.85 98.73 ## 1 1.10 1.15 1.27 tally(delayorcancel ~ TimeOfDay, format="percent", data=ds) ## TimeOfDay ## delayorcancel morning afternoon evening ## no 86.2 79.2 78.1 ## yes 13.8 20.8 21.9 favstats(CRSDepTime ~ Cancelled, data=ds) ## .group min Q1 median Q3 max mean sd n missing ## 1 0 1 905 1305 1740 2359 1339 507 106346 0 ## 2 1 135 945 1329 1805 2359 1368 491 1241 0 favstats(ArrDelay ~ TimeOfDay, data=ds) ## .group min Q1 median Q3 max mean sd n missing ## 1 morning -58 -12 -5 5 842 0.808 30.1 45423 600 ## 2 afternoon -60 -9 -2 11 1104 6.240 32.3 36505 481 ## 3 evening -108 -10 -2 11 720 6.003 31.6 24227 351 3. • Look in your textbook at pages 197-198. What are quantiles? • • How do they determine the values associated with a boxplot? Can you change the limits of your boxplot so that you observe every single observation? At what time of day are the flights most likely to be delayed? • favstats(ArrDelay, data=ds) ## ## min Q1 median Q3 max mean sd n missing -108 -11 -3 8 1104 3.86 31.3 106155 1432 ds = mutate(ds, ActDelay = ifelse(ArrDelay < 0, 0, ArrDelay)) favstats(ActDelay, data=ds) ## ## min Q1 median Q3 max mean sd n missing 0 0 0 8 1104 10.2 28.1 106155 1432 favstats(ActDelay ~ TimeOfDay, data=ds) ## .group min Q1 median Q3 max mean sd n missing ## 1 morning 0 0 0 5 842 8.03 26.8 45423 600 ## 2 afternoon 0 0 0 11 1104 11.69 29.4 36505 481 ## 3 evening 0 0 0 11 720 12.00 28.1 24227 351 bwplot(ActDelay ~ TimeOfDay, ylim=c(-10, 120), main="March flights from LA, 2011-2014", ylab="Actual arrival delay (in minutes)", data=ds) bwplot(TimeOfDay ~ ActDelay, xlim=c(-10, 120), main="March flights from LA, 2011-2014", xlab="Actual arrival delay (in minutes)", data=ds) 4. • • Consider page 196 in your text. Are Arrival Delays symmetric, skewed right, or skewed left? What about the shape of the density for the departure time? Create both denisty plots and histograms: ds2014 = filter(ds, Year==2014) ds2014 = mutate(ds2014, TimeOfDay = cut(CRSDepTime, breaks=c(0, 1200, 1800, 2400), labels=c("morning", "afternoon", "evening"))) densityplot(~ ArrDelay, groups=TimeOfDay, auto.key=TRUE, xlab="Arrival delay (in minutes)", xlim=c(-35, 120), data=ds2014) histogram(~ ArrDelay | TimeOfDay, auto.key=TRUE, xlab="Arrival delay (in minutes)", xlim=c(-35, 120), layout=c(1,3),data=ds2014) densityplot(~ CRSDepTime, groups=TimeOfDay, auto.key=TRUE, xlab="Scheduled Departure (in minutes)", data=ds2014) histogram(~ CRSDepTime | TimeOfDay, auto.key=TRUE, xlab="Scheduled Departure (in minutes)", data=ds2014) To turn in The goal of Lab 7 is to explore rates of Fatal accidents for different ages. In this project data will come from the federal government on Fatal Accidents. The name of the federal database is FARS (fatal accident reporting system). For more information read http://www.nhtsa.gov/FARS The dataset consists of all fatal accidents in 2011. An observational unit is a driver involved in a fatal accident. Note, there could be two observational units that refer to the same accident (how?). The dataset allows us to look at the ages of drivers involved in fatal crashes and distracted driving rates. (Data courtesy of Laura Kapitula at Grand Valley State University.) crashdata = read.table("http://pages.pomona.edu/~jsh04747/courses/math58/DRIVERS_CREATED. csv", sep=",", header=TRUE, na.strings="NA") attach(crashdata) dim(crashdata) # how many rows and variables in the dataset? ## [1] 43758 9 crashdata[1:5,] # it is always a good idea to look at the data ## STATE ST_CASE A_DIST type.crash VEH_NO PER_NO AGE ageYrs ## 1 Alabama 10001 Other Crash 2 1 1 55 Years 55 ## 2 Alabama 10002 Other Crash 2 1 1 24 Years 24 ## 3 Alabama 10003 Other Crash 2 1 1 30 Years 30 ## ## ## ## ## ## ## ## 4 Alabama 5 Alabama SEX 1 Female 2 Male 3 Male 4 Female 5 Female 10003 Other Crash 10005 Other Crash 2 2 2 1 1 33 Years 1 23 Years 33 23 Notice that Other Crash is coded "2" and Involving a Distracted Driver is coded as "1" .1. Create a histogram, density plot, and box plot of age using R. What features of the distribution can you see in the histogram that you cannot see in the box plot? .2. Looking at just the histogram or the density plot, how would you describe the shape of the distribution? .3. Give the mean, five number summary, IQR and standard deviation of age. .4. Go to http://factfinder2.census.gov/faces/nav/jsf/pages/index.xhtml. Go to Decennial Census data and then go to Profile of General Population and Housing Characteristics: 2010 (SF means Summary File). What proportion of the population is between 15 and 24 years old? .5. Using the frequency numbers for age, find the proportion of drivers involved in fatal accidents that are between 15 and 24 years old. # the histogram hist(ageYrs, breaks=c(0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85, 90, 95, 100), right=F, label=T) # the cutoffs hist(ageYrs, plot=F, breaks=c(0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85, 90, 95, 100), right=F)$breaks ## [1] ## [18] 0 85 5 90 10 15 95 100 20 25 30 35 40 45 50 55 60 65 70 75 80 # the frequencies in each bin hist(ageYrs, plot=F,breaks=c(0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85, 90, 95, 100), right=F)$counts ## [1] 0 ## [15] 1293 4 991 56 3212 5600 4560 3957 3444 3614 3762 3731 3052 2490 1654 819 501 190 21 # the total number of non-missing ages sum(!is.na(ageYrs)) ## [1] 42951 .6. Does the analysis in .5. lead us to believe, therefore, that younger drivers are more accident prone? Explain. .7. Next get a comparative box plot of ages for different levels of the variable type.crash. Provide your boxplot. .8. What differences and/or similarities do you see in the distribution of ages for accidents involving distracted driving and those not involving distracted driving? Write a few sentences. Note: Given this is population data any difference we see IS a "real" difference, meaning we know it is not due to sampling error (chance alone) since we are not taking a sample. However, one question we might ask is "Is the difference large enough to be of practical importance?"