Project I is due you can email me your one page comments on how your team worked Project II description is online Factors Stat 480 Heike Hofmann Outline • Factor variables • reshaping data: melt and cast • ddply Factors • A special type of numeric (integer) data • Numbers + labels • Used for categorical variables • On import, make sure numeric categorical variables are converted to factors • factor creates a new factor with specified labels Practice • Load fbi-incidences.csv data (use URL) • Make a summary of Year • Alternatively, we can treat Year as a factor variable fbi$Year <- factor(fbi$Year) ! • Compare summary of Year to previous result • Are there other variables that should be factors (or vice versa)? Checking for, and casting between types • str, mode provide info on type • is.XXX (with XXX either factor, int, numeric, logical, character, ... ) checks for specific type • as.XXX casts to specific type Casting between types as.factor factor character as.character numeric as.numeric Factors • factor variables often have to be re-ordered for ease of comparisons • We can specify the order of the levels by explicitly listing them, see help(factor) • We can make the order of the levels in one variable dependent on the summary statistic of another variable, see help(reorder) Practice • Introduce variable rate into fbi data • Plot boxplots of rate by crime. • Reorder rate by crime. • For one type of crime (subset!) plot boxplots of rates by state, reorder crime rates Factors/Strings • For changes to levels and label names convert factor variable to character, make changes, then convert back: > table(fbi$Crime) ! Aggravated.assault Burglary 2545 2545 Forcible.rape Larceny.theft 2545 2545 Motor.vehicle.theft Murder.and.nonnegligent.Manslaughter 2545 2545 Robbery 2545 > fbi$Crime <- as.character(fbi$Crime) > summary(fbi$Crime) Length Class Mode 17815 character character Updating subsets • Example: fbi$Crime <- as.character(fbi$Crime) ! idx <- which(fbi$Crime == "Aggravated.Assault") ! fbi$Crime[idx] <- "Assault" ! ! table(fbim$variable) ! ! Violent.Crime 2698 Property.Crime 2698 Murder 2698 Burglary 2698 Forcible.Rape Robbery 2698 2698 Larceny.theft Motor.Vehicle.Theft 2698 2698 Assault 2698 Tables • table(x) gives returns frequency table of variable x Similarly: xtabs(~x, data) or xtabs(count~x, data) • for higher number of variables, frequencies of corresponding contingency tables are returned Re-ordering • factor variables often have to be re-ordered for ease of comparisons • We can specify the order of the levels by explicitly listing them, see help(factor) • We can make the order of the levels in one variable dependent on the summary statistic of another variable, see help(reorder) Practice • The file fbi-incidences.csv contains a different fomat of the fbi data. Load it into R and describe it. • Introduce variable rate into fbi incidences data • Plot boxplots of rate by crime. • Reorder rate by crime. • For one type of crime (subset!) plot boxplots of rates by state, reorder crime rates Changing Levels’ names • Example: ! > levels(fbi$Crime) [1] "Aggravated.assault" "Murder" [6] "Property" "Robbery" "Burglary" "Rape" "Larceny.Theft" "Motor.Vehicle.Theft" "Violent" ! > levels(fbi$Crime)[1] <- "Assault" > levels(fbi$Crime)[8] <- "Vehicle.Theft" ! ! ! > levels(fbi$Crime) [1] "Assault" "Burglary" [6] "Property" "Robbery" "Rape" "Larceny.Theft" "Murder" "Vehicle.Theft" "Violent" Reshaping Data Reshaping Data • Two step process: • get data into a “convenient” shape, i.e. melt • cast data into new shape(s) that are cast one that is particularly flexible better suited for analysis Melt data first • This gets it in a form useful for “casting” into new formats • When melting, you need to specify the measured variables and the id variables • melt(data, measure.var=c(1,2,3), id.var=5)! • key variables are fixed by design (or categorical variables), measured variables correspond to numeric measurements melt.data.frame(data, id.vars, measure.vars, na.rm = F, ...)! key molten form “long & skinny” • id.vars: all identifiers (keys) and qualitative variables X1 X2 X3 • measure.vars: all quantitative variables X4 key original data X1 X2X3X4X5 X5 Casting • Function cast cast(dataset, rows ~ columns, aggregate) columns rows aggregate(data) Then, cast • Row variables, column variables, and a summary function (sum, mean, max, etc) • cast(molten, • cast(molten, • cast(molten, • cast(molten, row ~ col, summary)! row1 + row2 ~ col, summary)! row ~ . , summary)! . ~ col, summary) Practice • Load the old fbi format (fbi-60-12.csv) and melt it into the new fbi format (fbi-incidences.csv). • Introduce variable rate into fbi incidences data • Using cast, find: • the number of overall crimes • the number of incidences by state • the number of incidences by state and crime type • the number of incidences by state, crime type and year Pivot tables in R Pivot tables in R • functions in the dplyr package: • • • groupby summarize! (transform, filter, select) Your Turn • For the fbi-incidences data find: • the number of overall crimes • the number of incidences by state • the number of incidences by state and crime type • the number of incidences by state, crime type and year