R and Databases stat 579 Heike Hofmann

advertisement
R and Databases
stat 579
Heike Hofmann
Outline
• Normalization & Relational Databases
• Select Query Language
• Weighted Displays
What is a database?
• A collection of data
• A set of rules to manipulate data
Why are they important?
• Efficient manipulation of large data sets
• Convenient processing of data
• Integration of multiple sources of data
• Access to a shared resource
Reasons
• More and more datasets are in a Gigabyte
range … R reads data into main memory
… most computers have a very limited
amount of main memory
• (Now, data on the order of some activities
today, terabytes on a frequent basis are
TOO BIG for databases! Then analysts
revert back to FLAT FILES.)
Relational Databases
• Database is collection of tables and links
(normal form)
• SQL (Structured Query Language)
for querying
• DBMS (Database Management System) and
managing data
Relational Databases
Student
ID
Name
Major
1234 Never Ever
4321 Some Times
...
...
Math
CS
...
Keys Link tables
Attendance
ID
Date
1234
1234
4321
4321
...
02-05
02-07
02-05
02-07
...
Status
Absent
Absent
Present
Absent
...
Table keys
Relationships are defined by
key-key relationships:
1:1 one to one
1:m one to many
m:n many to many
DRY Principle for Data
Don’t Repeat Yourself
• “Normal Forms” of data
• more consistency, less repetition
• harder to view & edit
• Distinction between measured data and
data indices (keys)
• “The key, the whole key & nothing but the key”
Why normalize?
• Reduce overall data size
• Easier to maintain
• Reduce redundancy to detect possible
errors
• Useful way of thinking about objects under
study
1st Normal Form
“the key”
• Data has rectangular shape
• header line contains
variable names (optional)
• some column(s) uniquely
describe a data record
The Key
• e.g. unique row names provide a key
• keys are id variables, fixed by the design
(as opposed to measured variables)
• multiple id variables are called a composite
key
2nd Normal Form
“the whole key”
• Violated, if non-key entry is
described by part of the key
• i.e. data sets in 1st NF with
single key are automatically
in 2nd NF
Not allowed:
Example:
Attendance in class
• Data is in 1st Normal Form, but not in 2nd
UnivID
UnivID
Name
Date
Date
Status
1234
1234
4321
4321
...
Never Ever
Never Ever
Some Times
Some Times
...
02-05
02-07
02-05
02-07
...
Absent
Absent
Present
Absent
...
Name is uniquely described by University ID already
Remedy: Normalization
• Normalize by splitting data sets
Student
Attendance
UnivID
UnivIDNameName UnivID
Date Date
Name Status
Status
1234 1234Never Never
Ever Ever
4321 1234Some Times
Never Ever
... 4321
...Some Times
4321
Some Times
...
...
123402-05
123402-07
432102-05
432102-07
... ...
Never
02-05 Absent
Absent
Ever Absent
Never
02-07
Absent
Ever Present
02-05
Some
Present
Times
02-07
Some Absent
Absent
Times
...
... ...
Both tables are now in normal form
3rd Normal Form
“and nothing but the key”
• Violated, if non-key entry is
identified by another nonkey entry
Not allowed:
Example:
Address
• in 2nd NF, but not 3rd NF
UnivID
Name
City
State
Zipcode
7340
Heike
Ames
IA
50014
...
...
...
...
Zipcode implies City and State,
normalization again by splitting table
Databases
• Database is collection of tables in normal
form
• SQL (Structured Query Language)
for querying and managing data in DB
SQL Queries
• SELECT columns(, aggregate function(*))
FROM table1(, table2)
WHERE row_condition
(AND table1.id = table2.id)
(GROUP BY column)
(ORDER BY order_by_columns)
SQL commands
• Select command
from, where, group by, having, order by
• Aggregate functions
count, sum, min, max, avg
• operators
=, <=, >=, !=, and, or, is in (...), is not in (...),
like (regular expression)
Good SQL Tutorial:
http://www.1keydata.com/sql/sql.html
Why normalize?
• Reduce overall data size
• Easier to maintain
• Reduce redundancy to detect possible
errors
• Useful way of thinking about objects under
study
SQL
• Structured Query Language (1970, E Codds)
• Programming language used for accessing data
in a database
• ANSI standard since 1986, ISO standard since
1987
• Still some portability issues between software
and operating systems!
• We’ll mainly focus on SQL queries to access
data
Syntax
• SQL is not case sensitive.
• Some systems require “;” at the end of each
line. The semi-colon can be used to
separate each SQL statement in a system
that allows multiple command to be
executed in a call to the server.
SELECT
• Selects data from the database
SELECT column_name(s)
FROM table_name
Student
Attendance
ID
Name
Major
ID
Date
Status
1234
4321
...
Never Ever
Some Times
...
Math
CS
...
1234
1234
4321
4321
...
02-05
02-07
02-05
02-07
...
Absent
Absent
Present
Absent
...
SELECT Name, Major
FROM Student
Name
Major
Never Ever
Some Times
...
Math
CS
...
SELECT
Student
Attendance
ID
Name
Major
ID
Date
Status
1234
4321
...
Never Ever
Some Times
...
Math
CS
...
1234
1234
4321
4321
...
02-05
02-07
02-05
02-07
...
Absent
Absent
Present
Absent
...
All
SELECT *
FROM Student
ID
Name
Major
1234
4321
...
Never Ever
Some Times
...
Math
CS
...
WHERE
Student
Attendance
ID
Name
Major
ID
Date
Status
1234
4321
...
Never Ever
Some Times
...
Math
CS
...
1234
1234
4321
4321
...
02-05
02-07
02-05
02-07
...
Absent
Absent
Present
Absent
...
SELECT Name
FROM Student
WHERE Major=‘Math’
Name
Never Ever
...
Aggregating Functions
Student
Attendance
ID
Name
Major
ID
Date
Status
1234
4321
...
Never Ever
Some Times
...
Math
CS
...
1234
1234
4321
4321
...
02-05
02-07
02-05
02-07
...
Absent
Absent
Present
Absent
...
SELECT ID, count(ID)
FROM Attendance
WHERE Status=‘Absent’
GROUP BY ID
ID
count(ID)
1234
4321
...
2
1
Functions
• COUNT
• AVG
• MAX
• MIN
• SUM
• ROUND
• LEN
• ...
What summary
statistics do you want?
Very similar to ddply
Your Turn
• Go to website http://www.w3schools.com/
sql/sql_tryit.asp to try for yourself:
• What fields are in the table “customers”?
• Select the CompanyName and
•
ContactName of customers that come
from Germany
Find a frequency breakdown of all
customers by country.
Example: US Flights
• Arrival/Departure details of all commercial
flights in the US between Oct 1985 and
Oct 2008
• Main interest: on-time performance
• more details:
http://stat-computing.org/dataexpo/2009/
US Flights - Variables
Name
1 Year
Description
1987-2008
2 Month
Jan 12, 2009
3 DayofMonth
Jan 31, 2009
4 DayOfWeek
1 (Monday) - 7 (Sunday)
5 DepTime
actual departure time (local, hhmm)
6 CRSDepTime
scheduled departure time (local, hhmm)
7 ArrTime
actual arrival time (local, hhmm)
8 CRSArrTime
scheduled arrival time (local, hhmm)
9 UniqueCarrier
unique carrier code
10 FlightNum
flight number
11 TailNum
plane tail number
12 ActualElapsedTime
in minutes
13 CRSElapsedTime
in minutes
14 AirTime
in minutes
US Flights - Variables
Name
Description
15 ArrDelay
arrival delay, in minutes
16 DepDelay
departure delay, in minutes
17 Origin
origin IATA airport code
18 Dest
destination IATA airport code
19 Distance
in miles
20 TaxiIn
taxi in time, in minutes
21 TaxiOut
taxi out time in minutes
22 Cancelled
was the flight cancelled?
23 CancellationCode
24 Diverted
reason for cancellation (A = carrier, B = weather, C = NAS,
D = security)
1 = yes, 0 = no
25 CarrierDelay
in minutes
26 WeatherDelay
in minutes
27 NASDelay
in minutes
28 SecurityDelay
in minutes
29 LateAircraftDelay
in minutes
Accessing Databases
• MySQL Workbench:
http://dev.mysql.com/downloads/
workbench/5.2.html
• Access local copy
of ontime database
R R0cks
SQL Sample Queries
• Select count(*) from ontime
• Select * from ontime limit 15
• Select distinct(UniqueCarrier) from ontime
• the following statement might take a while:
select Year, UniqueCarrier, count(*), sum
(Distance), sum(ArrDelay+DepDelay), avg
(Cancelled) from ontime group by Year,
UniqueCarrier
Your Turn
•
Download the MySQL Work Bench from http://
dev.mysql.com/downloads/workbench/5.2.html
•
Log into the data_expo_09 database using the
connection details as specified earlier.
•
Explore the weather database - i.e. how many
records are there? how many different weather
stations? how many years?
•
How many different Weather Events are there,
what is their frequency?
Denormalization: Join
•
•
•
reverse of normalization, joins tables
easiest by putting appropriate constraints
Example
SELECT * from ontime o, weather w
WHERE o.Year=w.Year and
o.Month=w.Month and
DayofMonth=Day and
DepTime div 100 = Hour and
Origin=IATA
select * from ontime o, weather w1,
weather w2
where UniqueCarrier = 'US' and
w1.Year=o.Year and w1.Month=o.Month and
w1.Day=DayofMonth and
w1.Hour=CRSDepTime div 100 and
w1.IATA=Origin and w2.Year=o.Year and
w2.Month=o.Month and w2.Hour=ArrTime
div 100 and w2.IATA=Dest limit 10
Your Turn
• Construct an SQL statement to link 10 US
Airway flights to the weather records at the
airport of origin during departure time and
the weather at the destination during arrival
time (closest hour only)
Accessing Databases
• Packages in R have Front-/Backend Set-up
• backend is the same for all database
systems (DBS): done by DBI
• frontend depends on DBS, there is
RMySQL, RSQLite, ROracle, ...
Packages DBI, RMySQL
• Install both packages
• You’ll need to install the mysql client in
order to run RMySQL
• You will need to Remote Login to ts1-
stat.stat.iastate.edu
(from Applications > Communications)
RMySQL
• Link to Database:
dbDriver, dbConnect, dbDisconnect
• Get Information:
dbListTables, dbListFields
• Get Records:
dbReadTable, dbGetQuery, dbSendQuery
Connecting to the DB
library(DBI)
library(RMySQL)
drv <- dbDriver("MySQL")
co <- dbConnect(drv, user="2009Expo", password="R R0cks", port=3306,
dbname="data_expo_2009", host="headnode.stat.iastate.edu")
dbGetQuery(co, "select count(*) from ontime")
The Expo Database
> # what variables are in the table?
> dbListFields(co, "airports")
[1] "iata"
"airport"
"city"
"state"
"country"
"latitude" "longitude" "id"
> dbListFields(co, "ontime")
[1] "Year"
"Month"
"DayofMonth"
"DayOfWeek"
"DepTime"
[6] "CRSDepTime"
"ArrTime"
"CRSArrTime"
"UniqueCarrier"
"FlightNum"
[11] "TailNum"
"ActualElapsedTime" "CRSElapsedTime"
"AirTime"
"ArrDelay"
[16] "DepDelay"
"Origin"
"Dest"
"Distance"
"TaxiIn"
[21] "TaxiOut"
"Cancelled"
"CancellationCode" "Diverted"
"CarrierDelay"
[26] "WeatherDelay"
"NASDelay"
"SecurityDelay"
"LateAircraftDelay" "id"
>
> # read the whole table (only suitable for smaller tables)
> airports <- dbReadTable(co, "airports")
> head (airports)
iata
airport
city state country latitude longitude id
1 00M
Thigpen
Bay Springs
MS
USA 31.95376 -89.23450 1
2 00R Livingston Municipal
Livingston
TX
USA 30.68586 -95.01793 2
3 00V
Meadow Lake Colorado Springs
CO
USA 38.94575 -104.56989 3
4 01G
Perry-Warsaw
Perry
NY
USA 42.74135 -78.05208 4
5 01J
Hilliard Airpark
Hilliard
FL
USA 30.68801 -81.90594 5
6 01M
Tishomingo County
Belmont
MS
USA 34.49167 -88.20111 6
>
US Airports
>
>
>
>
require(ggplot2)
qplot(longitude, latitude, data=airports)
qplot(longitude, latitude, data=airports, xlim=c(-180,-50))
qplot(longitude, latitude, data=airports, xlim=c(-180,-50), size=1)
>
>
>
>
# send an SQL statement to the server and extract data right away
df <- dbGetQuery(co, "select longitude, latitude from airports")
head(df)
qplot(longitude, latitude, data=df, xlim=c(-180,-50), size=I(1))
70
60
50
latitude
1
1
40
30
20
10
-180
-160
-140
-120
longitude
-100
-80
-60
Weighted Plots
• Datasets often in aggregated form
(particularly for large data)
• Use parameter weight=count in ggplot2
7e+06
6e+06
count
qplot(factor(Year), geom="bar",
weight=count, data=years)
5e+06
4e+06
3e+06
2e+06
1e+06
0e+00
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
factor(Year)
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
Your Turn
• Find all balloons (investigate the planes
table), get their flights, calculate their average
speed ... and be amazed!
Download