R and Databases stat 579 Heike Hofmann Outline • Normalization & Relational Databases • Select Query Language • Weighted Displays What is a database? • A collection of data • A set of rules to manipulate data Why are they important? • Efficient manipulation of large data sets • Convenient processing of data • Integration of multiple sources of data • Access to a shared resource Reasons • More and more datasets are in a Gigabyte range … R reads data into main memory … most computers have a very limited amount of main memory • (Now, data on the order of some activities today, terabytes on a frequent basis are TOO BIG for databases! Then analysts revert back to FLAT FILES.) Relational Databases • Database is collection of tables and links (normal form) • SQL (Structured Query Language) for querying • DBMS (Database Management System) and managing data Relational Databases Student ID Name Major 1234 Never Ever 4321 Some Times ... ... Math CS ... Keys Link tables Attendance ID Date 1234 1234 4321 4321 ... 02-05 02-07 02-05 02-07 ... Status Absent Absent Present Absent ... Table keys Relationships are defined by key-key relationships: 1:1 one to one 1:m one to many m:n many to many DRY Principle for Data Don’t Repeat Yourself • “Normal Forms” of data • more consistency, less repetition • harder to view & edit • Distinction between measured data and data indices (keys) • “The key, the whole key & nothing but the key” Why normalize? • Reduce overall data size • Easier to maintain • Reduce redundancy to detect possible errors • Useful way of thinking about objects under study 1st Normal Form “the key” • Data has rectangular shape • header line contains variable names (optional) • some column(s) uniquely describe a data record The Key • e.g. unique row names provide a key • keys are id variables, fixed by the design (as opposed to measured variables) • multiple id variables are called a composite key 2nd Normal Form “the whole key” • Violated, if non-key entry is described by part of the key • i.e. data sets in 1st NF with single key are automatically in 2nd NF Not allowed: Example: Attendance in class • Data is in 1st Normal Form, but not in 2nd UnivID UnivID Name Date Date Status 1234 1234 4321 4321 ... Never Ever Never Ever Some Times Some Times ... 02-05 02-07 02-05 02-07 ... Absent Absent Present Absent ... Name is uniquely described by University ID already Remedy: Normalization • Normalize by splitting data sets Student Attendance UnivID UnivIDNameName UnivID Date Date Name Status Status 1234 1234Never Never Ever Ever 4321 1234Some Times Never Ever ... 4321 ...Some Times 4321 Some Times ... ... 123402-05 123402-07 432102-05 432102-07 ... ... Never 02-05 Absent Absent Ever Absent Never 02-07 Absent Ever Present 02-05 Some Present Times 02-07 Some Absent Absent Times ... ... ... Both tables are now in normal form 3rd Normal Form “and nothing but the key” • Violated, if non-key entry is identified by another nonkey entry Not allowed: Example: Address • in 2nd NF, but not 3rd NF UnivID Name City State Zipcode 7340 Heike Ames IA 50014 ... ... ... ... Zipcode implies City and State, normalization again by splitting table Databases • Database is collection of tables in normal form • SQL (Structured Query Language) for querying and managing data in DB SQL Queries • SELECT columns(, aggregate function(*)) FROM table1(, table2) WHERE row_condition (AND table1.id = table2.id) (GROUP BY column) (ORDER BY order_by_columns) SQL commands • Select command from, where, group by, having, order by • Aggregate functions count, sum, min, max, avg • operators =, <=, >=, !=, and, or, is in (...), is not in (...), like (regular expression) Good SQL Tutorial: http://www.1keydata.com/sql/sql.html Why normalize? • Reduce overall data size • Easier to maintain • Reduce redundancy to detect possible errors • Useful way of thinking about objects under study SQL • Structured Query Language (1970, E Codds) • Programming language used for accessing data in a database • ANSI standard since 1986, ISO standard since 1987 • Still some portability issues between software and operating systems! • We’ll mainly focus on SQL queries to access data Syntax • SQL is not case sensitive. • Some systems require “;” at the end of each line. The semi-colon can be used to separate each SQL statement in a system that allows multiple command to be executed in a call to the server. SELECT • Selects data from the database SELECT column_name(s) FROM table_name Student Attendance ID Name Major ID Date Status 1234 4321 ... Never Ever Some Times ... Math CS ... 1234 1234 4321 4321 ... 02-05 02-07 02-05 02-07 ... Absent Absent Present Absent ... SELECT Name, Major FROM Student Name Major Never Ever Some Times ... Math CS ... SELECT Student Attendance ID Name Major ID Date Status 1234 4321 ... Never Ever Some Times ... Math CS ... 1234 1234 4321 4321 ... 02-05 02-07 02-05 02-07 ... Absent Absent Present Absent ... All SELECT * FROM Student ID Name Major 1234 4321 ... Never Ever Some Times ... Math CS ... WHERE Student Attendance ID Name Major ID Date Status 1234 4321 ... Never Ever Some Times ... Math CS ... 1234 1234 4321 4321 ... 02-05 02-07 02-05 02-07 ... Absent Absent Present Absent ... SELECT Name FROM Student WHERE Major=‘Math’ Name Never Ever ... Aggregating Functions Student Attendance ID Name Major ID Date Status 1234 4321 ... Never Ever Some Times ... Math CS ... 1234 1234 4321 4321 ... 02-05 02-07 02-05 02-07 ... Absent Absent Present Absent ... SELECT ID, count(ID) FROM Attendance WHERE Status=‘Absent’ GROUP BY ID ID count(ID) 1234 4321 ... 2 1 Functions • COUNT • AVG • MAX • MIN • SUM • ROUND • LEN • ... What summary statistics do you want? Very similar to ddply Your Turn • Go to website http://www.w3schools.com/ sql/sql_tryit.asp to try for yourself: • What fields are in the table “customers”? • Select the CompanyName and • ContactName of customers that come from Germany Find a frequency breakdown of all customers by country. Example: US Flights • Arrival/Departure details of all commercial flights in the US between Oct 1985 and Oct 2008 • Main interest: on-time performance • more details: http://stat-computing.org/dataexpo/2009/ US Flights - Variables Name 1 Year Description 1987-2008 2 Month Jan 12, 2009 3 DayofMonth Jan 31, 2009 4 DayOfWeek 1 (Monday) - 7 (Sunday) 5 DepTime actual departure time (local, hhmm) 6 CRSDepTime scheduled departure time (local, hhmm) 7 ArrTime actual arrival time (local, hhmm) 8 CRSArrTime scheduled arrival time (local, hhmm) 9 UniqueCarrier unique carrier code 10 FlightNum flight number 11 TailNum plane tail number 12 ActualElapsedTime in minutes 13 CRSElapsedTime in minutes 14 AirTime in minutes US Flights - Variables Name Description 15 ArrDelay arrival delay, in minutes 16 DepDelay departure delay, in minutes 17 Origin origin IATA airport code 18 Dest destination IATA airport code 19 Distance in miles 20 TaxiIn taxi in time, in minutes 21 TaxiOut taxi out time in minutes 22 Cancelled was the flight cancelled? 23 CancellationCode 24 Diverted reason for cancellation (A = carrier, B = weather, C = NAS, D = security) 1 = yes, 0 = no 25 CarrierDelay in minutes 26 WeatherDelay in minutes 27 NASDelay in minutes 28 SecurityDelay in minutes 29 LateAircraftDelay in minutes Accessing Databases • MySQL Workbench: http://dev.mysql.com/downloads/ workbench/5.2.html • Access local copy of ontime database R R0cks SQL Sample Queries • Select count(*) from ontime • Select * from ontime limit 15 • Select distinct(UniqueCarrier) from ontime • the following statement might take a while: select Year, UniqueCarrier, count(*), sum (Distance), sum(ArrDelay+DepDelay), avg (Cancelled) from ontime group by Year, UniqueCarrier Your Turn • Download the MySQL Work Bench from http:// dev.mysql.com/downloads/workbench/5.2.html • Log into the data_expo_09 database using the connection details as specified earlier. • Explore the weather database - i.e. how many records are there? how many different weather stations? how many years? • How many different Weather Events are there, what is their frequency? Denormalization: Join • • • reverse of normalization, joins tables easiest by putting appropriate constraints Example SELECT * from ontime o, weather w WHERE o.Year=w.Year and o.Month=w.Month and DayofMonth=Day and DepTime div 100 = Hour and Origin=IATA select * from ontime o, weather w1, weather w2 where UniqueCarrier = 'US' and w1.Year=o.Year and w1.Month=o.Month and w1.Day=DayofMonth and w1.Hour=CRSDepTime div 100 and w1.IATA=Origin and w2.Year=o.Year and w2.Month=o.Month and w2.Hour=ArrTime div 100 and w2.IATA=Dest limit 10 Your Turn • Construct an SQL statement to link 10 US Airway flights to the weather records at the airport of origin during departure time and the weather at the destination during arrival time (closest hour only) Accessing Databases • Packages in R have Front-/Backend Set-up • backend is the same for all database systems (DBS): done by DBI • frontend depends on DBS, there is RMySQL, RSQLite, ROracle, ... Packages DBI, RMySQL • Install both packages • You’ll need to install the mysql client in order to run RMySQL • You will need to Remote Login to ts1- stat.stat.iastate.edu (from Applications > Communications) RMySQL • Link to Database: dbDriver, dbConnect, dbDisconnect • Get Information: dbListTables, dbListFields • Get Records: dbReadTable, dbGetQuery, dbSendQuery Connecting to the DB library(DBI) library(RMySQL) drv <- dbDriver("MySQL") co <- dbConnect(drv, user="2009Expo", password="R R0cks", port=3306, dbname="data_expo_2009", host="headnode.stat.iastate.edu") dbGetQuery(co, "select count(*) from ontime") The Expo Database > # what variables are in the table? > dbListFields(co, "airports") [1] "iata" "airport" "city" "state" "country" "latitude" "longitude" "id" > dbListFields(co, "ontime") [1] "Year" "Month" "DayofMonth" "DayOfWeek" "DepTime" [6] "CRSDepTime" "ArrTime" "CRSArrTime" "UniqueCarrier" "FlightNum" [11] "TailNum" "ActualElapsedTime" "CRSElapsedTime" "AirTime" "ArrDelay" [16] "DepDelay" "Origin" "Dest" "Distance" "TaxiIn" [21] "TaxiOut" "Cancelled" "CancellationCode" "Diverted" "CarrierDelay" [26] "WeatherDelay" "NASDelay" "SecurityDelay" "LateAircraftDelay" "id" > > # read the whole table (only suitable for smaller tables) > airports <- dbReadTable(co, "airports") > head (airports) iata airport city state country latitude longitude id 1 00M Thigpen Bay Springs MS USA 31.95376 -89.23450 1 2 00R Livingston Municipal Livingston TX USA 30.68586 -95.01793 2 3 00V Meadow Lake Colorado Springs CO USA 38.94575 -104.56989 3 4 01G Perry-Warsaw Perry NY USA 42.74135 -78.05208 4 5 01J Hilliard Airpark Hilliard FL USA 30.68801 -81.90594 5 6 01M Tishomingo County Belmont MS USA 34.49167 -88.20111 6 > US Airports > > > > require(ggplot2) qplot(longitude, latitude, data=airports) qplot(longitude, latitude, data=airports, xlim=c(-180,-50)) qplot(longitude, latitude, data=airports, xlim=c(-180,-50), size=1) > > > > # send an SQL statement to the server and extract data right away df <- dbGetQuery(co, "select longitude, latitude from airports") head(df) qplot(longitude, latitude, data=df, xlim=c(-180,-50), size=I(1)) 70 60 50 latitude 1 1 40 30 20 10 -180 -160 -140 -120 longitude -100 -80 -60 Weighted Plots • Datasets often in aggregated form (particularly for large data) • Use parameter weight=count in ggplot2 7e+06 6e+06 count qplot(factor(Year), geom="bar", weight=count, data=years) 5e+06 4e+06 3e+06 2e+06 1e+06 0e+00 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 factor(Year) 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Your Turn • Find all balloons (investigate the planes table), get their flights, calculate their average speed ... and be amazed!