Parts of the Project

advertisement
Parts of the Project
Data preparation
coding & write-up
1
Research question
coding & write-up
4
Introduction, conclusion, cohesiveness
1
Editing and proof-reading
1
Factors, Reshaping
Stat 480
Heike Hofmann
Outline
• More on factor variables
• reshaping data: melt and cast
• ddply
Factors
• A special type of numeric (integer) data
• Numbers + labels
• Used for categorical variables
• On import, make sure numeric categorical
variables are converted to factors
•
factor creates a new factor with specified
labels
Practice
• Load fbi-incidences.csv data (use URL)
• Change the levels of Type to shorter names (e.g.
Motor.Vehicle.Theft to Car.Theft)
• Change Year to a factor variable, and then back to a
numeric variable (use as.numeric). What happened?
Checking for, and
casting between types
• str, mode provide info on type
• is.XXX (with XXX either factor,
int, numeric, logical,
character, ... ) checks for specific type
• as.XXX casts to specific type
Casting between types
as.factor
factor
character
as.character
numeric
as.numeric
Factors
• factor variables often have to be re-ordered for
ease of comparisons
• We can specify the order of the levels by explicitly
listing them, see help(factor)
• We can make the order of the levels in one variable
dependent on the summary statistic of another
variable, see help(reorder)
Tables
•
table(x) gives returns frequency table of
variable x
Similarly: xtabs(~x, data) or xtabs(count~x, data) • for higher number of variables, frequencies
of corresponding contingency tables are
returned
Re-ordering
• factor variables often have to be re-ordered for
ease of comparisons
• We can specify the order of the levels by explicitly
listing them, see help(factor)
• We can make the order of the levels in one variable
dependent on the summary statistic of another
variable, see help(reorder)
Practice
• Introduce variable rate into fbi data
• Plot boxplots of rate by crime. • Reorder rate by crime. • For one type of crime (subset!) plot boxplots of rates by
state, reorder crime rates
Reshaping Data
Reshaping Data
• Two step process: • get data into a “convenient” shape, i.e.
melt
• cast data into new shape(s) that are
cast
one that is particularly flexible
better suited for analysis
Melt data first
•
This gets it in a form useful for “casting” into new
formats
•
When melting, you need to specify the measured
variables and the id variables
• melt(data,
measure.var=c(1,2,3),
id.var=5)!
•
key variables are fixed by design (or categorical
variables), measured variables correspond to
numeric measurements
melt.data.frame(data, id.vars, measure.vars,
na.rm = F, ...)!
key
molten form
“long & skinny”
• id.vars: all identifiers (keys) and
qualitative variables X1
X2
X3
• measure.vars: all quantitative
variables
X4
key
original data
X1 X2X3X4X5
X5
Casting
• Function dcast
dcast(dataset, rows ~ columns, aggregate)
columns
rows
aggregate(data)
Then, cast
• Row variables, column variables, and a summary
function (sum, mean, max, etc)
• dcast(molten,
• dcast(molten,
row ~ col, summary)!
• dcast(molten,
• dcast(molten,
row ~ . , summary)!
row1 + row2 ~ col,
summary)!
. ~ col, summary)
Practice
• Load the old fbi format (fbi-60-12.csv) and melt it into the
new fbi format (fbi-incidences.csv).
• Introduce variable rate into fbi incidences data
• Using cast, find:
• the number of overall crimes
• the number of incidences by state
• the number of incidences by state and crime type
• the number of incidences by state, crime type and year
Pivot tables in R
Pivot tables in R
• functions in the dplyr package:
•
•
•
groupby
summarize!
(transform, filter, select)
Your Turn
• For the fbi-incidences data find:
• the number of overall crimes
• the number of incidences by state
• the number of incidences by state and crime type
• the number of incidences by state, crime type and
year
Checkpoint
• Submit all of your code for today’s Your Turns
so far at
http://hhofmann.wufoo.com/forms/check-point/
Download