ID - CompanyName.com

advertisement
Project III
• Today due:
• Abstracts and description of the data
(follow link from course website)
• schedule so far …
Databases & SQL
Stat 480
Heike Hofmann
Outline
• Why Databases?
• Linking R to a database
• Relational databases
What is a database?
• A collection of data
• A set of rules to manipulate data
Why are they important?
• Efficient manipulation of large data sets
• Convenient processing of data
• Integration of multiple sources of data
• Access to a shared resource
Reasons
• More and more datasets are in a Gigabyte
range … R reads data into main memory
… most computers have a very limited
amount of main memory
• (Now, data on the order of some activities
today, terabytes on a frequent basis are
TOO BIG for databases! Then analysts
revert back to FLAT FILES.)
Relational Databases
• Database is collection of tables and links
(normal form)
• SQL (Structured Query Language)
for querying • DBMS (Database Management System) and
managing data
dplyr
• dplyr package in R has several main
functions:
group_by, summarize, transform, filter,
arrange, select
!
• dplyr works (almost) the same for local
data frames as tables in a database
Connecting to the
database
> db <- src_mysql(host="headnode.stat.iastate.edu",
user="2009Expo", port=3306, dbname="accidents", password="R
R0cks")
Loading required package: RMySQL
Loading required package: DBI
> db
src: mysql 5.0.95-log [2009Expo@headnode.stat.iastate.edu:/
accidents]
tbls: accidents, person, vehicle
• “accidents” database is part of FARS (fatal accident
report system, http://www.nhtsa.gov/FARS)
• Documentation at
http://www-nrd.nhtsa.dot.gov/Cats/
listpublications.aspx?Id=J&ShowBy=DocType
Your Turn
• Using the same connection information, the
•
•
department of Statistics hosts several
databases (dbname).
Among them are ‘accidents’,
‘data_expo_2009’, ‘baseball’, ‘recovery’
(by tonight there also should be ‘BRFSS’)
connect to them and see what tables they
contain
• pick one of the tables, and start working with
it
• note: R does not load the data into the session
> fars <- tbl(db, "accidents")
> fars
Source: mysql 5.0.95-log [2009Expo@headnode.stat.iastate.edu:/accidents]
From: accidents [333,578 x 56]
!
row_names STATE COUNTY MONTH DAY HOUR MINUTE VE_TOTAL PERSONS PEDS NHS ROAD_FNC ROUTE SP_JUR HARM_EV MAN_COL
1
1
1
127
1
2
18
15
2
2
0
1
2
2
0
12
2
2
1
3
1
1
20
0
1
2
1
0
15
4
0
9
3
3
1
15
1
1
15
59
2
3
0
0
3
3
0
12
4
4
1
69
1
3
8
40
2
3
0
0
4
3
0
12
5
5
1
77
1
3
0
5
1
1
0
0
6
4
0
34
6
6
1
9
1
6
14
50
2
2
0
0
3
3
0
12
7
7
1
75
1
7
5
30
1
1
0
0
3
3
0
42
8
8
1
15
1
7
19
50
1
2
1
0
14
3
0
8
9
9
1
125
1
8
17
48
1
1
0
0
6
4
0
1
10
10
1
97
1
9
21
30
2
6
0
1
2
2
0
12
..
...
...
...
... ... ...
...
...
... ... ...
...
...
...
...
..
Variables not shown: REL_ROAD (dbl), TRAF_FLO (dbl), NO_LANES (dbl), SP_LIMIT (dbl), ALIGNMNT (dbl), PROFILE (d
(dbl), SUR_COND (dbl), TRA_CONT (dbl), T_CONT_F (dbl), LGT_COND (dbl), WEATHER1 (dbl), WEATHER2 (dbl), WRK_ZO
NOT_HOUR (dbl), NOT_MIN (dbl), ARR_HOUR (dbl), ARR_MIN (dbl), HOSP_HR (dbl), HOSP_MN (dbl), SCH_BUS (dbl), CF
(int), CF3 (int), FATALS (int), DAY_WEEK (int), DRUNK_DR (int), ST_CASE (dbl), CITY (dbl), YEAR (dbl), MILEPT
TWAY_ID (chr), TWAY_ID2 (chr), RAIL (chr), LATITUDE (dbl), LONGITUD (dbl), VE_FORMS (dbl), WEATHER (dbl), srs
• remember the chaining operator %>% ?!
> # how many accidents are there in a year?
> fars %>% group_by(YEAR) %>% summarise(n=n())
Source: mysql 5.0.95-log [2009Expo@headnode.stat.iastate.edu:/accidents]
From: <derived table> [?? x 2]
!
YEAR
1 2001
2 2002
3 2003
4 2004
5 2005
6 2006
7 2007
8 2008
9 2009
.. ...
>
n
37862
38491
38477
38444
39252
38648
37435
34172
30797
...
> # are there some days of the week where more accidents happen than on others?
> fars %>% group_by(day_week) %>% summarise(n=n())
Source: mysql 5.0.95-log [2009Expo@headnode.stat.iastate.edu:/accidents]
From: <derived table> [?? x 2]
!
day_week
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
9
..
...
n
54047
41395
40049
41236
42795
52654
61329
73
...
Difference between
local df and DBS
• The result from a chaining operation is just
a ‘preview’ - we need to collect the
result to be able to work with it:
> sql <- fars %>% group_by(YEAR) %>% summarise(n=n())
> collect(sql)
Source: local data frame [9 x 2]
1
2
3
4
5
6
7
8
9
YEAR
2001
2002
2003
2004
2005
2006
2007
2008
2009
n
37862
38491
38477
38444
39252
38648
37435
34172
30797
60
50
> library(ggplot2)
> a09 <- collect(fars %>% filter(year==2009))
> qplot(LONGITUD, LATITUDE, data=a09) 40
LATITUDE
!
30
20
-150
-120
LONGITUD
-90
Your Turn
• Connect to the accidents data set
• DRUNK_DR gives the number of drunk drivers involved
in an accident
• get a summary from the accidents table that gives an
•
•
overview of the number of accidents at a particular time
of day (hour), day of week (day_week) and number of
drunk drivers (drunk_dr)
introduce variable ‘alcohol’ into the previous data frame
to indicate whether there were any drunk drivers involved
in the accident
find a visualization summarising the relationship between
alcohol related accidents, time of the day, and day of the
week.
select
> # for these larger dataset the function select makes sense to use:
> fars %>% select(DRUNK_DR, FATALS, PERSONS, VE_TOTAL, SP_LIMIT)
Source: mysql 5.0.95-log [2009Expo@headnode.stat.iastate.edu:/accidents]
From: accidents [333,578 x 5]
!
DRUNK_DR FATALS PERSONS VE_TOTAL SP_LIMIT
1
0
1
2
2
65
2
0
1
2
1
35
3
0
2
3
2
55
4
0
1
3
2
55
5
1
1
1
1
45
6
0
1
2
2
55
7
0
1
1
1
55
8
0
1
2
1
55
9
1
1
1
1
45
10
1
3
6
2
55
..
...
...
...
...
...
using sql queries
> tbl(db, sql("select count(*) from accidents"))
Source: mysql 5.0.95-log [2009Expo@headnode.stat.iastate.edu:/accidents]
From: <derived table> [?? x 1]
!
count(*)
1
333578
..
...
• in order to understand this we need to
look a bit closer at databases …
Relational Databases
Relational Databases
Student
Name
ID
Major
1234 Never Ever
4321 Some Times
...
...
Math
CS
...
Keys Link tables
Attendance
ID
Date
1234
1234
4321
4321
...
02-05
02-07
02-05
02-07
...
Status
Absent
Absent
Present
Absent
...
Table keys
Relationships are defined by
key-key relationships:
1:1
one to one
1:m
one to many
m:n many to many
Relational Database
• Database is a
Collection of normalized data tables • Database System is a
Framework of Data Management Functions
Patient
Smoker
Age
Surgeon
Years experience Medical school
Able Peterson
No
45
Pete Roberts
15
Bertha Mathes
Yes
60
Ralf Countryman 10
1:n
Harvard medical school
University of Texas
relations
1:m
Date
Patient
Surgeon
Procedure
Result
Mar 17, 2006
Able Peterson
Pete Roberts
Apendectomy
Excellent
Mar 19, 2006
Bertha Mathes Ralf Countryman Heart transplant
Good
SQL
• Structured Query Language (1970, E Codds)
• Programming language used for accessing data
in a database
• ANSI standard since 1986, ISO standard since
1987
• Still some portability issues between software
and operating systems!
• We’ll mainly focus on SQL queries to access
data
SQL Queries
• SELECT columns(, aggregate function(*))
FROM table1(, table2)
WHERE row_condition
(AND table1.id = table2.id)
(GROUP BY column)
(ORDER BY order_by_columns)
Syntax
• SQL is not case sensitive.
• Some systems require “;” at the end of each
line. The semi-colon can be used to
separate each SQL statement in a system
that allows multiple command to be
executed in a call to the server.
SELECT
• Selects data from the database
SELECT column_name(s)!
FROM table_name
Student
Attendance
ID
Name
Major
ID
Date
Status
1234
4321
...
Never Ever
Some Times
...
Math
CS
...
1234
1234
4321
4321
...
02-05
02-07
02-05
02-07
...
Absent
Absent
Present
Absent
...
SELECT Name, Major!
FROM Student
Name
Major
Never Ever
Some Times
...
Math
CS
...
SELECT
Student
Attendance
ID
Name
Major
ID
Date
Status
1234
4321
...
Never Ever
Some Times
...
Math
CS
...
1234
1234
4321
4321
...
02-05
02-07
02-05
02-07
...
Absent
Absent
Present
Absent
...
All
SELECT *!
FROM Student
ID
Name
Major
1234
4321
...
Never Ever
Some Times
...
Math
CS
...
WHERE
Student
Attendance
ID
Name
Major
ID
Date
Status
1234
4321
...
Never Ever
Some Times
...
Math
CS
...
1234
1234
4321
4321
...
02-05
02-07
02-05
02-07
...
Absent
Absent
Present
Absent
...
SELECT Name!
FROM Student!
WHERE Major=‘Math’
Name
Never Ever
...
Aggregating Functions
Student
Attendance
ID
Name
Major
ID
Date
Status
1234
4321
...
Never Ever
Some Times
...
Math
CS
...
1234
1234
4321
4321
...
02-05
02-07
02-05
02-07
...
Absent
Absent
Present
Absent
...
SELECT ID, count(ID)!
FROM Attendance!
WHERE Status=‘Absent’
GROUP BY ID
ID
count(ID)
1234
4321
...
2
1
Functions
• COUNT
• AVG
• MAX
• MIN
• SUM
• ROUND
• LEN
• ...
What summary
statistics do you want?
!
Very similar to ddply
Your Turn
• Go to website http://www.w3schools.com/
sql/sql_tryit.asp to try for yourself: • What fields are in the table “customers”?
• Select the CompanyName and
•
ContactName of customers that come
from Germany
Find a frequency breakdown of all
customers by country.
Your Turn
• Go to
http://www.w3schools.com/sql/sql_tryit.asp
• Read the help files for functions AVG(),
COUNT() and MIN()
Download