Document 10696566

advertisement
Project III
• Presentation schedule is online
• If you can’t find your name on the list,
contact me!
• If you can’t find your team on the list,
submit the abstract! Now!
R and Databases
stat 579
Heike Hofmann
Outline
• dplyr again
• SQL statements
• denormalisation: joining tables
Connecting to the
database with dplyr
> db <- src_mysql(host="headnode.stat.iastate.edu",
user="2009Expo", port=3306, dbname="BRFSS", password="R R0cks")
Loading required package: RMySQL
Loading required package: DBI
> db
src: mysql 5.0.95-log [2009Expo@headnode.stat.iastate.edu:/
BRFSS]
tbls: brfss12
!
using sql queries
> brfss <- tbl(db, "brfss12")
> tbl(db, sql("select count(*) from brfss12"))
Source: mysql 5.0.95-log [2009Expo@headnode.stat.iastate.edu:/BRFSS]
From: <derived table> [?? x 1]
!
count(*)
1
475687
..
...
• in order to understand this we need to
look a bit closer at databases …
Relational Databases
Relational Databases
Student
Name
ID
Major
1234 Never Ever
4321 Some Times
...
...
Math
CS
...
Keys Link tables
Attendance
ID
Date
1234
1234
4321
4321
...
02-05
02-07
02-05
02-07
...
Status
Absent
Absent
Present
Absent
...
Table keys
Relationships are defined by
key-key relationships:
1:1
one to one
1:m
one to many
m:n many to many
Relational Database
• Database is a Collection of normalized data tables • Database System is a Framework of Data Management Functions
Patient
Smoker
Age
Surgeon
Years experience Medical school
Able Peterson
No
45
Pete Roberts
15
Bertha Mathes
Yes
60
Ralf Countryman 10
1:n
Harvard medical school
University of Texas
relations
1:m
Date
Patient
Surgeon
Procedure
Result
Mar 17, 2006
Able Peterson
Pete Roberts
Apendectomy
Excellent
Mar 19, 2006
Bertha Mathes Ralf Countryman Heart transplant
Good
SQL
• Structured Query Language (1970, E Codds)
• Programming language used for accessing data
in a database
• ANSI standard since 1986, ISO standard since
1987
• Still some portability issues between software
and operating systems!
• We’ll mainly focus on SQL queries to access
data
SQL Queries
• SELECT columns(, aggregate function(*))
FROM table1(, table2)
WHERE row_condition (AND table1.id = table2.id)
(GROUP BY column)
(ORDER BY order_by_columns)
Syntax
• SQL is not case sensitive.
• Some systems require “;” at the end of each
line. The semi-colon can be used to
separate each SQL statement in a system
that allows multiple command to be
executed in a call to the server.
SELECT
• Selects data from the database
SELECT column_name(s)!
FROM table_name
Student
Attendance
ID
Name
Major
ID
Date
Status
1234
4321
...
Never Ever
Some Times
...
Math
CS
...
1234
1234
4321
4321
...
02-05
02-07
02-05
02-07
...
Absent
Absent
Present
Absent
...
SELECT Name, Major!
FROM Student
Name
Major
Never Ever
Some Times
...
Math
CS
...
SELECT
Student
Attendance
ID
Name
Major
ID
Date
Status
1234
4321
...
Never Ever
Some Times
...
Math
CS
...
1234
1234
4321
4321
...
02-05
02-07
02-05
02-07
...
Absent
Absent
Present
Absent
...
All
SELECT *!
FROM Student
ID
Name
Major
1234
4321
...
Never Ever
Some Times
...
Math
CS
...
WHERE
Student
Attendance
ID
Name
Major
ID
Date
Status
1234
4321
...
Never Ever
Some Times
...
Math
CS
...
1234
1234
4321
4321
...
02-05
02-07
02-05
02-07
...
Absent
Absent
Present
Absent
...
SELECT Name!
FROM Student!
WHERE Major=‘Math’
Name
Never Ever
...
Aggregating Functions
Student
Attendance
ID
Name
Major
ID
Date
Status
1234
4321
...
Never Ever
Some Times
...
Math
CS
...
1234
1234
4321
4321
...
02-05
02-07
02-05
02-07
...
Absent
Absent
Present
Absent
...
SELECT ID, count(ID)!
FROM Attendance!
WHERE Status=‘Absent’
GROUP BY ID
ID
count(ID)
1234
4321
...
2
1
Functions
• COUNT
• AVG
• MAX
• MIN
• SUM
• ROUND
• LEN
• ...
What summary
statistics do you want?
!
Very similar to ddply
Your Turn
• Go to website http://www.w3schools.com/
sql/ and click on “try for yourself”: • What fields are in the table “customers”?
• Select the CustomerName and
•
ContactName of customers that come
from Germany
Find a frequency breakdown of all
customers by country.
Your Turn
• Go to http://www.w3schools.com/sql/sql_tryit.asp
• Read the help files for functions AVG(),
COUNT() and MIN()
SQL commands
• Select command
from, where, group by, having, order by
• Aggregate functions
count, sum, min, max, avg
• operators
=, <=, >=, !=, and, or, is in (...), is not in (...),
like (regular expression)
Example: US Flights
• Arrival/Departure details of all commercial
flights in the US between Oct 1985 and
Oct 2008
• Main interest: on-time performance
• more details: http://stat-computing.org/dataexpo/2009/
US Flights - Variables
Name
Description
1 Year
1987-2008
2 Month
Jan 12, 2009
3 DayofMonth
Jan 31, 2009
4 DayOfWeek
1 (Monday) - 7 (Sunday)
5 DepTime
actual departure time (local, hhmm)
6 CRSDepTime
scheduled departure time (local, hhmm)
7 ArrTime
actual arrival time (local, hhmm)
8 CRSArrTime
scheduled arrival time (local, hhmm)
9 UniqueCarrier
unique carrier code
10 FlightNum
flight number
11 TailNum
plane tail number
12 ActualElapsedTime
in minutes
13 CRSElapsedTime
in minutes
14 AirTime
in minutes
US Flights - Variables
Name
Description
15 ArrDelay
arrival delay, in minutes
16 DepDelay
departure delay, in minutes
17 Origin
origin IATA airport code
18 Dest
destination IATA airport code
19 Distance
in miles
20 TaxiIn
taxi in time, in minutes
21 TaxiOut
taxi out time in minutes
22 Cancelled
was the flight cancelled?
23 CancellationCode
reason for cancellation (A = carrier, B = weather, C = NAS,
D = security)
24 Diverted
1 = yes, 0 = no
25 CarrierDelay
in minutes
26 WeatherDelay
in minutes
27 NASDelay
in minutes
28 SecurityDelay
in minutes
29 LateAircraftDelay
in minutes
Your Turn
• Connect to the data_expo_2009 database,
and load the ontime table.
• How many flight records are there total?
• Select the first few records from the data
base
• Which airline carriers are in the database?
What questions are we
interested in?
Your Turn
• Write out the SQL command to answer one
of the posed questions.
• (don’t execute them! some statements might
be very long or consist of too much data)
• Alternatively:
Find the SQL command to determine, - how many miles each carrier has flown over
the time period, - how many delays occurred and - what percentage of flights was cancelled
each year.
Delays over time
• `How do average delays evolve over time?’
• Specified to: monthly average departure
delays for flights from Chicago O’Hare
•select
Year, Month, avg(DepDelay) from ontime
where Origin=‘ORD’
group by Year, Month
Average Delays
> summary(delays)
Year
Month
Min.
:1987
Min.
: 1.000
1st Qu.:1993
1st Qu.: 4.000
Median :1998
Median : 7.000
Mean
:1998
Mean
: 6.559
3rd Qu.:2003
3rd Qu.:10.000
Max.
:2008
Max.
:12.000
avg.DepDelay.
Min.
: 2.593
1st Qu.: 7.352
Median :10.463
Mean
:11.633
3rd Qu.:14.768
Max.
:32.763
Average Monthly Departure
Delays from Chicago
30
avg.DepDelay.
25
Year
20
1990
1995
2000
15
2005
10
5
2
4
6
Month
8
10
12
Yearly differences hard to tell, but it seems that red lines show higher
delays, increased delays during the summer and at the beginning and the
end of a year
Avg Monthly Delay by Year: 2000 is
different, seasonal pattern hard to see.
Beginning and end of years have high delays
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
30
25
20
15
10
5
avg.DepDelay.
30
25
20
15
10
5
30
25
20
15
10
5
30
25
20
15
10
5
30
25
20
15
10
5
2
4
6
8 10 12
2
4
6
8 10 12
2
4
6
8 10 12
2
4
6
8 10 12
2
4
6
8 10 12
Delays seem to increase steeply
for some months in recent years
1
2
3
4
5
6
7
8
9
10
11
12
1990 1995 2000 2005
1990 1995 2000 2005
1990 1995 2000 2005
1990 1995 2000 2005
avg.DepDelay.
35
30
25
20
15
10
5
35
30
25
20
15
10
5
35
30
25
20
15
10
5
Year
Your Turn
• Write SQL statements to retrieve data to answer
the following questions (don’t execute them! some statements might be
very long or consist of too much data)
• How many flights take off each day?
• Which airlines cancel flights most often? - what
percentage of flights gets cancelled?
• What is the main cause for flight cancellations? is there a time trend?
SQL Sample Queries
• Select count(*) from ontime • Select * from ontime limit 15
• Select distinct(UniqueCarrier) from ontime
• the following statement might take a while:
select Year, UniqueCarrier, count(*),
sum(Distance), sum(DepDelay), sum(ArrDelayDepDelay), avg(Cancelled) from ontime group
by Year, UniqueCarrier
Your Turn
• Log into the data_expo_09 database using
the connection details as specified earlier.
• Download weather data for Des Moines
(maybe just one year)
• How many tornadoes/ hail storms/
Thunderstorms are there? • Very Advanced: link the weather information
with flight departures/arrivals: how does e.g.
visibility affect delays?
Denormalization: Join
• reverse of normalization, joins tables
• easiest by putting appropriate constraints
• Example
SELECT * from ontime o, weather w
WHERE o.Year=w.Year and o.Month=w.Month and DayofMonth=Day and DepTime div 100 = Hour and Origin=IATA
select * from ontime o, weather w1,
weather w2 where UniqueCarrier = 'US' and
w1.Year=o.Year and w1.Month=o.Month and
w1.Day=DayofMonth and
w1.Hour=CRSDepTime div 100 and
w1.IATA=Origin and w2.Year=o.Year and
w2.Month=o.Month and w2.Hour=ArrTime
div 100 and w2.IATA=Dest limit 10
Your Turn
• Construct an SQL statement to link 10 US
Airway flights to the weather records at the
airport of origin during departure time and
the weather at the destination during arrival
time (closest hour only)
Download