Project III • Today due: • Abstracts and description of the data (follow link from course website) • schedule so far … Databases & SQL Stat 480 Heike Hofmann Outline • Why Databases? • Linking R to a database • Relational databases What is a database? • A collection of data • A set of rules to manipulate data Why are they important? • Efficient manipulation of large data sets • Convenient processing of data • Integration of multiple sources of data • Access to a shared resource Reasons • More and more datasets are in a Gigabyte range … R reads data into main memory … most computers have a very limited amount of main memory • (Now, data on the order of some activities today, terabytes on a frequent basis are TOO BIG for databases! Then analysts revert back to FLAT FILES.) Relational Databases • Database is collection of tables and links (normal form) • SQL (Structured Query Language) for querying • DBMS (Database Management System) and managing data dplyr • dplyr package in R has several main functions: group_by, summarize, transform, filter, arrange, select ! • dplyr works (almost) the same for local data frames as tables in a database Connecting to the database > db <- src_mysql(host="headnode.stat.iastate.edu", user="2009Expo", port=3306, dbname="accidents", password="R R0cks") Loading required package: RMySQL Loading required package: DBI > db src: mysql 5.0.95-log [2009Expo@headnode.stat.iastate.edu:/ accidents] tbls: accidents, person, vehicle • “accidents” database is part of FARS (fatal accident report system, http://www.nhtsa.gov/FARS) • Documentation at http://www-nrd.nhtsa.dot.gov/Cats/ listpublications.aspx?Id=J&ShowBy=DocType Your Turn • Using the same connection information, the • • department of Statistics hosts several databases (dbname). Among them are ‘accidents’, ‘data_expo_2009’, ‘baseball’, ‘recovery’ (by tonight there also should be ‘BRFSS’) connect to them and see what tables they contain • pick one of the tables, and start working with it • note: R does not load the data into the session > fars <- tbl(db, "accidents") > fars Source: mysql 5.0.95-log [2009Expo@headnode.stat.iastate.edu:/accidents] From: accidents [333,578 x 56] ! row_names STATE COUNTY MONTH DAY HOUR MINUTE VE_TOTAL PERSONS PEDS NHS ROAD_FNC ROUTE SP_JUR HARM_EV MAN_COL 1 1 1 127 1 2 18 15 2 2 0 1 2 2 0 12 2 2 1 3 1 1 20 0 1 2 1 0 15 4 0 9 3 3 1 15 1 1 15 59 2 3 0 0 3 3 0 12 4 4 1 69 1 3 8 40 2 3 0 0 4 3 0 12 5 5 1 77 1 3 0 5 1 1 0 0 6 4 0 34 6 6 1 9 1 6 14 50 2 2 0 0 3 3 0 12 7 7 1 75 1 7 5 30 1 1 0 0 3 3 0 42 8 8 1 15 1 7 19 50 1 2 1 0 14 3 0 8 9 9 1 125 1 8 17 48 1 1 0 0 6 4 0 1 10 10 1 97 1 9 21 30 2 6 0 1 2 2 0 12 .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. Variables not shown: REL_ROAD (dbl), TRAF_FLO (dbl), NO_LANES (dbl), SP_LIMIT (dbl), ALIGNMNT (dbl), PROFILE (d (dbl), SUR_COND (dbl), TRA_CONT (dbl), T_CONT_F (dbl), LGT_COND (dbl), WEATHER1 (dbl), WEATHER2 (dbl), WRK_ZO NOT_HOUR (dbl), NOT_MIN (dbl), ARR_HOUR (dbl), ARR_MIN (dbl), HOSP_HR (dbl), HOSP_MN (dbl), SCH_BUS (dbl), CF (int), CF3 (int), FATALS (int), DAY_WEEK (int), DRUNK_DR (int), ST_CASE (dbl), CITY (dbl), YEAR (dbl), MILEPT TWAY_ID (chr), TWAY_ID2 (chr), RAIL (chr), LATITUDE (dbl), LONGITUD (dbl), VE_FORMS (dbl), WEATHER (dbl), srs • remember the chaining operator %>% ?! > # how many accidents are there in a year? > fars %>% group_by(YEAR) %>% summarise(n=n()) Source: mysql 5.0.95-log [2009Expo@headnode.stat.iastate.edu:/accidents] From: <derived table> [?? x 2] ! YEAR 1 2001 2 2002 3 2003 4 2004 5 2005 6 2006 7 2007 8 2008 9 2009 .. ... > n 37862 38491 38477 38444 39252 38648 37435 34172 30797 ... > # are there some days of the week where more accidents happen than on others? > fars %>% group_by(day_week) %>% summarise(n=n()) Source: mysql 5.0.95-log [2009Expo@headnode.stat.iastate.edu:/accidents] From: <derived table> [?? x 2] ! day_week 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 9 .. ... n 54047 41395 40049 41236 42795 52654 61329 73 ... Difference between local df and DBS • The result from a chaining operation is just a ‘preview’ - we need to collect the result to be able to work with it: > sql <- fars %>% group_by(YEAR) %>% summarise(n=n()) > collect(sql) Source: local data frame [9 x 2] 1 2 3 4 5 6 7 8 9 YEAR 2001 2002 2003 2004 2005 2006 2007 2008 2009 n 37862 38491 38477 38444 39252 38648 37435 34172 30797 60 50 > library(ggplot2) > a09 <- collect(fars %>% filter(year==2009)) > qplot(LONGITUD, LATITUDE, data=a09) 40 LATITUDE ! 30 20 -150 -120 LONGITUD -90 Your Turn • Connect to the accidents data set • DRUNK_DR gives the number of drunk drivers involved in an accident • get a summary from the accidents table that gives an • • overview of the number of accidents at a particular time of day (hour), day of week (day_week) and number of drunk drivers (drunk_dr) introduce variable ‘alcohol’ into the previous data frame to indicate whether there were any drunk drivers involved in the accident find a visualization summarising the relationship between alcohol related accidents, time of the day, and day of the week. select > # for these larger dataset the function select makes sense to use: > fars %>% select(DRUNK_DR, FATALS, PERSONS, VE_TOTAL, SP_LIMIT) Source: mysql 5.0.95-log [2009Expo@headnode.stat.iastate.edu:/accidents] From: accidents [333,578 x 5] ! DRUNK_DR FATALS PERSONS VE_TOTAL SP_LIMIT 1 0 1 2 2 65 2 0 1 2 1 35 3 0 2 3 2 55 4 0 1 3 2 55 5 1 1 1 1 45 6 0 1 2 2 55 7 0 1 1 1 55 8 0 1 2 1 55 9 1 1 1 1 45 10 1 3 6 2 55 .. ... ... ... ... ... using sql queries > tbl(db, sql("select count(*) from accidents")) Source: mysql 5.0.95-log [2009Expo@headnode.stat.iastate.edu:/accidents] From: <derived table> [?? x 1] ! count(*) 1 333578 .. ... • in order to understand this we need to look a bit closer at databases … Relational Databases Relational Databases Student Name ID Major 1234 Never Ever 4321 Some Times ... ... Math CS ... Keys Link tables Attendance ID Date 1234 1234 4321 4321 ... 02-05 02-07 02-05 02-07 ... Status Absent Absent Present Absent ... Table keys Relationships are defined by key-key relationships: 1:1 one to one 1:m one to many m:n many to many Relational Database • Database is a Collection of normalized data tables • Database System is a Framework of Data Management Functions Patient Smoker Age Surgeon Years experience Medical school Able Peterson No 45 Pete Roberts 15 Bertha Mathes Yes 60 Ralf Countryman 10 1:n Harvard medical school University of Texas relations 1:m Date Patient Surgeon Procedure Result Mar 17, 2006 Able Peterson Pete Roberts Apendectomy Excellent Mar 19, 2006 Bertha Mathes Ralf Countryman Heart transplant Good SQL • Structured Query Language (1970, E Codds) • Programming language used for accessing data in a database • ANSI standard since 1986, ISO standard since 1987 • Still some portability issues between software and operating systems! • We’ll mainly focus on SQL queries to access data SQL Queries • SELECT columns(, aggregate function(*)) FROM table1(, table2) WHERE row_condition (AND table1.id = table2.id) (GROUP BY column) (ORDER BY order_by_columns) Syntax • SQL is not case sensitive. • Some systems require “;” at the end of each line. The semi-colon can be used to separate each SQL statement in a system that allows multiple command to be executed in a call to the server. SELECT • Selects data from the database SELECT column_name(s)! FROM table_name Student Attendance ID Name Major ID Date Status 1234 4321 ... Never Ever Some Times ... Math CS ... 1234 1234 4321 4321 ... 02-05 02-07 02-05 02-07 ... Absent Absent Present Absent ... SELECT Name, Major! FROM Student Name Major Never Ever Some Times ... Math CS ... SELECT Student Attendance ID Name Major ID Date Status 1234 4321 ... Never Ever Some Times ... Math CS ... 1234 1234 4321 4321 ... 02-05 02-07 02-05 02-07 ... Absent Absent Present Absent ... All SELECT *! FROM Student ID Name Major 1234 4321 ... Never Ever Some Times ... Math CS ... WHERE Student Attendance ID Name Major ID Date Status 1234 4321 ... Never Ever Some Times ... Math CS ... 1234 1234 4321 4321 ... 02-05 02-07 02-05 02-07 ... Absent Absent Present Absent ... SELECT Name! FROM Student! WHERE Major=‘Math’ Name Never Ever ... Aggregating Functions Student Attendance ID Name Major ID Date Status 1234 4321 ... Never Ever Some Times ... Math CS ... 1234 1234 4321 4321 ... 02-05 02-07 02-05 02-07 ... Absent Absent Present Absent ... SELECT ID, count(ID)! FROM Attendance! WHERE Status=‘Absent’ GROUP BY ID ID count(ID) 1234 4321 ... 2 1 Functions • COUNT • AVG • MAX • MIN • SUM • ROUND • LEN • ... What summary statistics do you want? ! Very similar to ddply Your Turn • Go to website http://www.w3schools.com/ sql/sql_tryit.asp to try for yourself: • What fields are in the table “customers”? • Select the CompanyName and • ContactName of customers that come from Germany Find a frequency breakdown of all customers by country. Your Turn • Go to http://www.w3schools.com/sql/sql_tryit.asp • Read the help files for functions AVG(), COUNT() and MIN()