Parts of the Project Data preparation coding & write-up 1 Research question coding & write-up 4 Introduction, conclusion, cohesiveness 1 Editing and proof-reading 1 Factors, Reshaping Stat 480 Heike Hofmann Outline • More on factor variables • reshaping data: melt and cast • ddply Factors • A special type of numeric (integer) data • Numbers + labels • Used for categorical variables • On import, make sure numeric categorical variables are converted to factors • factor creates a new factor with specified labels Practice • Load fbi-incidences.csv data (use URL) • Change the levels of Type to shorter names (e.g. Motor.Vehicle.Theft to Car.Theft) • Change Year to a factor variable, and then back to a numeric variable (use as.numeric). What happened? Checking for, and casting between types • str, mode provide info on type • is.XXX (with XXX either factor, int, numeric, logical, character, ... ) checks for specific type • as.XXX casts to specific type Casting between types as.factor factor character as.character numeric as.numeric Factors • factor variables often have to be re-ordered for ease of comparisons • We can specify the order of the levels by explicitly listing them, see help(factor) • We can make the order of the levels in one variable dependent on the summary statistic of another variable, see help(reorder) Tables • table(x) gives returns frequency table of variable x Similarly: xtabs(~x, data) or xtabs(count~x, data) • for higher number of variables, frequencies of corresponding contingency tables are returned Re-ordering • factor variables often have to be re-ordered for ease of comparisons • We can specify the order of the levels by explicitly listing them, see help(factor) • We can make the order of the levels in one variable dependent on the summary statistic of another variable, see help(reorder) Practice • Introduce variable rate into fbi data • Plot boxplots of rate by crime. • Reorder rate by crime. • For one type of crime (subset!) plot boxplots of rates by state, reorder crime rates Reshaping Data Reshaping Data • Two step process: • get data into a “convenient” shape, i.e. melt • cast data into new shape(s) that are cast one that is particularly flexible better suited for analysis Melt data first • This gets it in a form useful for “casting” into new formats • When melting, you need to specify the measured variables and the id variables • melt(data, measure.var=c(1,2,3), id.var=5)! • key variables are fixed by design (or categorical variables), measured variables correspond to numeric measurements melt.data.frame(data, id.vars, measure.vars, na.rm = F, ...)! key molten form “long & skinny” • id.vars: all identifiers (keys) and qualitative variables X1 X2 X3 • measure.vars: all quantitative variables X4 key original data X1 X2X3X4X5 X5 Casting • Function dcast dcast(dataset, rows ~ columns, aggregate) columns rows aggregate(data) Then, cast • Row variables, column variables, and a summary function (sum, mean, max, etc) • dcast(molten, • dcast(molten, row ~ col, summary)! • dcast(molten, • dcast(molten, row ~ . , summary)! row1 + row2 ~ col, summary)! . ~ col, summary) Practice • Load the old fbi format (fbi-60-12.csv) and melt it into the new fbi format (fbi-incidences.csv). • Introduce variable rate into fbi incidences data • Using cast, find: • the number of overall crimes • the number of incidences by state • the number of incidences by state and crime type • the number of incidences by state, crime type and year Pivot tables in R Pivot tables in R • functions in the dplyr package: • • • groupby summarize! (transform, filter, select) Your Turn • For the fbi-incidences data find: • the number of overall crimes • the number of incidences by state • the number of incidences by state and crime type • the number of incidences by state, crime type and year Checkpoint • Submit all of your code for today’s Your Turns so far at http://hhofmann.wufoo.com/forms/check-point/