Project III • Presentation schedule is online • If you can’t find your name on the list, contact me! • If you can’t find your team on the list, submit the abstract! Now! R and Databases stat 579 Heike Hofmann Outline • dplyr again • SQL statements • denormalisation: joining tables Connecting to the database with dplyr > db <- src_mysql(host="headnode.stat.iastate.edu", user="2009Expo", port=3306, dbname="BRFSS", password="R R0cks") Loading required package: RMySQL Loading required package: DBI > db src: mysql 5.0.95-log [2009Expo@headnode.stat.iastate.edu:/ BRFSS] tbls: brfss12 ! using sql queries > brfss <- tbl(db, "brfss12") > tbl(db, sql("select count(*) from brfss12")) Source: mysql 5.0.95-log [2009Expo@headnode.stat.iastate.edu:/BRFSS] From: <derived table> [?? x 1] ! count(*) 1 475687 .. ... • in order to understand this we need to look a bit closer at databases … Relational Databases Relational Databases Student Name ID Major 1234 Never Ever 4321 Some Times ... ... Math CS ... Keys Link tables Attendance ID Date 1234 1234 4321 4321 ... 02-05 02-07 02-05 02-07 ... Status Absent Absent Present Absent ... Table keys Relationships are defined by key-key relationships: 1:1 one to one 1:m one to many m:n many to many Relational Database • Database is a Collection of normalized data tables • Database System is a Framework of Data Management Functions Patient Smoker Age Surgeon Years experience Medical school Able Peterson No 45 Pete Roberts 15 Bertha Mathes Yes 60 Ralf Countryman 10 1:n Harvard medical school University of Texas relations 1:m Date Patient Surgeon Procedure Result Mar 17, 2006 Able Peterson Pete Roberts Apendectomy Excellent Mar 19, 2006 Bertha Mathes Ralf Countryman Heart transplant Good SQL • Structured Query Language (1970, E Codds) • Programming language used for accessing data in a database • ANSI standard since 1986, ISO standard since 1987 • Still some portability issues between software and operating systems! • We’ll mainly focus on SQL queries to access data SQL Queries • SELECT columns(, aggregate function(*)) FROM table1(, table2) WHERE row_condition (AND table1.id = table2.id) (GROUP BY column) (ORDER BY order_by_columns) Syntax • SQL is not case sensitive. • Some systems require “;” at the end of each line. The semi-colon can be used to separate each SQL statement in a system that allows multiple command to be executed in a call to the server. SELECT • Selects data from the database SELECT column_name(s)! FROM table_name Student Attendance ID Name Major ID Date Status 1234 4321 ... Never Ever Some Times ... Math CS ... 1234 1234 4321 4321 ... 02-05 02-07 02-05 02-07 ... Absent Absent Present Absent ... SELECT Name, Major! FROM Student Name Major Never Ever Some Times ... Math CS ... SELECT Student Attendance ID Name Major ID Date Status 1234 4321 ... Never Ever Some Times ... Math CS ... 1234 1234 4321 4321 ... 02-05 02-07 02-05 02-07 ... Absent Absent Present Absent ... All SELECT *! FROM Student ID Name Major 1234 4321 ... Never Ever Some Times ... Math CS ... WHERE Student Attendance ID Name Major ID Date Status 1234 4321 ... Never Ever Some Times ... Math CS ... 1234 1234 4321 4321 ... 02-05 02-07 02-05 02-07 ... Absent Absent Present Absent ... SELECT Name! FROM Student! WHERE Major=‘Math’ Name Never Ever ... Aggregating Functions Student Attendance ID Name Major ID Date Status 1234 4321 ... Never Ever Some Times ... Math CS ... 1234 1234 4321 4321 ... 02-05 02-07 02-05 02-07 ... Absent Absent Present Absent ... SELECT ID, count(ID)! FROM Attendance! WHERE Status=‘Absent’ GROUP BY ID ID count(ID) 1234 4321 ... 2 1 Functions • COUNT • AVG • MAX • MIN • SUM • ROUND • LEN • ... What summary statistics do you want? ! Very similar to ddply Your Turn • Go to website http://www.w3schools.com/ sql/ and click on “try for yourself”: • What fields are in the table “customers”? • Select the CustomerName and • ContactName of customers that come from Germany Find a frequency breakdown of all customers by country. Your Turn • Go to http://www.w3schools.com/sql/sql_tryit.asp • Read the help files for functions AVG(), COUNT() and MIN() SQL commands • Select command from, where, group by, having, order by • Aggregate functions count, sum, min, max, avg • operators =, <=, >=, !=, and, or, is in (...), is not in (...), like (regular expression) Example: US Flights • Arrival/Departure details of all commercial flights in the US between Oct 1985 and Oct 2008 • Main interest: on-time performance • more details: http://stat-computing.org/dataexpo/2009/ US Flights - Variables Name Description 1 Year 1987-2008 2 Month Jan 12, 2009 3 DayofMonth Jan 31, 2009 4 DayOfWeek 1 (Monday) - 7 (Sunday) 5 DepTime actual departure time (local, hhmm) 6 CRSDepTime scheduled departure time (local, hhmm) 7 ArrTime actual arrival time (local, hhmm) 8 CRSArrTime scheduled arrival time (local, hhmm) 9 UniqueCarrier unique carrier code 10 FlightNum flight number 11 TailNum plane tail number 12 ActualElapsedTime in minutes 13 CRSElapsedTime in minutes 14 AirTime in minutes US Flights - Variables Name Description 15 ArrDelay arrival delay, in minutes 16 DepDelay departure delay, in minutes 17 Origin origin IATA airport code 18 Dest destination IATA airport code 19 Distance in miles 20 TaxiIn taxi in time, in minutes 21 TaxiOut taxi out time in minutes 22 Cancelled was the flight cancelled? 23 CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security) 24 Diverted 1 = yes, 0 = no 25 CarrierDelay in minutes 26 WeatherDelay in minutes 27 NASDelay in minutes 28 SecurityDelay in minutes 29 LateAircraftDelay in minutes Your Turn • Connect to the data_expo_2009 database, and load the ontime table. • How many flight records are there total? • Select the first few records from the data base • Which airline carriers are in the database? What questions are we interested in? Your Turn • Write out the SQL command to answer one of the posed questions. • (don’t execute them! some statements might be very long or consist of too much data) • Alternatively: Find the SQL command to determine, - how many miles each carrier has flown over the time period, - how many delays occurred and - what percentage of flights was cancelled each year. Delays over time • `How do average delays evolve over time?’ • Specified to: monthly average departure delays for flights from Chicago O’Hare •select Year, Month, avg(DepDelay) from ontime where Origin=‘ORD’ group by Year, Month Average Delays > summary(delays) Year Month Min. :1987 Min. : 1.000 1st Qu.:1993 1st Qu.: 4.000 Median :1998 Median : 7.000 Mean :1998 Mean : 6.559 3rd Qu.:2003 3rd Qu.:10.000 Max. :2008 Max. :12.000 avg.DepDelay. Min. : 2.593 1st Qu.: 7.352 Median :10.463 Mean :11.633 3rd Qu.:14.768 Max. :32.763 Average Monthly Departure Delays from Chicago 30 avg.DepDelay. 25 Year 20 1990 1995 2000 15 2005 10 5 2 4 6 Month 8 10 12 Yearly differences hard to tell, but it seems that red lines show higher delays, increased delays during the summer and at the beginning and the end of a year Avg Monthly Delay by Year: 2000 is different, seasonal pattern hard to see. Beginning and end of years have high delays 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 30 25 20 15 10 5 avg.DepDelay. 30 25 20 15 10 5 30 25 20 15 10 5 30 25 20 15 10 5 30 25 20 15 10 5 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 Delays seem to increase steeply for some months in recent years 1 2 3 4 5 6 7 8 9 10 11 12 1990 1995 2000 2005 1990 1995 2000 2005 1990 1995 2000 2005 1990 1995 2000 2005 avg.DepDelay. 35 30 25 20 15 10 5 35 30 25 20 15 10 5 35 30 25 20 15 10 5 Year Your Turn • Write SQL statements to retrieve data to answer the following questions (don’t execute them! some statements might be very long or consist of too much data) • How many flights take off each day? • Which airlines cancel flights most often? - what percentage of flights gets cancelled? • What is the main cause for flight cancellations? is there a time trend? SQL Sample Queries • Select count(*) from ontime • Select * from ontime limit 15 • Select distinct(UniqueCarrier) from ontime • the following statement might take a while: select Year, UniqueCarrier, count(*), sum(Distance), sum(DepDelay), sum(ArrDelayDepDelay), avg(Cancelled) from ontime group by Year, UniqueCarrier Your Turn • Log into the data_expo_09 database using the connection details as specified earlier. • Download weather data for Des Moines (maybe just one year) • How many tornadoes/ hail storms/ Thunderstorms are there? • Very Advanced: link the weather information with flight departures/arrivals: how does e.g. visibility affect delays? Denormalization: Join • reverse of normalization, joins tables • easiest by putting appropriate constraints • Example SELECT * from ontime o, weather w WHERE o.Year=w.Year and o.Month=w.Month and DayofMonth=Day and DepTime div 100 = Hour and Origin=IATA select * from ontime o, weather w1, weather w2 where UniqueCarrier = 'US' and w1.Year=o.Year and w1.Month=o.Month and w1.Day=DayofMonth and w1.Hour=CRSDepTime div 100 and w1.IATA=Origin and w2.Year=o.Year and w2.Month=o.Month and w2.Hour=ArrTime div 100 and w2.IATA=Dest limit 10 Your Turn • Construct an SQL statement to link 10 US Airway flights to the weather records at the airport of origin during departure time and the weather at the destination during arrival time (closest hour only)