Project I is due Project II description is online

advertisement
Project I is due
you can email me your one page comments on how
your team worked
Project II description is
online
Factors
Stat 480
Heike Hofmann
Outline
• Factor variables
• reshaping data: melt and cast
• ddply
Factors
• A special type of numeric (integer) data
• Numbers + labels
• Used for categorical variables
• On import, make sure numeric categorical
variables are converted to factors
•
factor creates a new factor with specified
labels
Practice
• Load fbi-incidences.csv data (use URL)
• Make a summary of Year
• Alternatively, we can treat Year as a factor variable
fbi$Year <- factor(fbi$Year) !
• Compare summary of Year to previous result
• Are there other variables that should be factors (or vice
versa)?
Checking for, and
casting between types
• str, mode provide info on type
• is.XXX (with XXX either factor,
int, numeric, logical,
character, ... ) checks for specific type
• as.XXX casts to specific type
Casting between types
as.factor
factor
character
as.character
numeric
as.numeric
Factors
• factor variables often have to be re-ordered for
ease of comparisons
• We can specify the order of the levels by explicitly
listing them, see help(factor)
• We can make the order of the levels in one variable
dependent on the summary statistic of another
variable, see help(reorder)
Practice
• Introduce variable rate into fbi data
• Plot boxplots of rate by crime. • Reorder rate by crime. • For one type of crime (subset!) plot boxplots of rates by
state, reorder crime rates
Factors/Strings
• For changes to levels and label names convert factor
variable to character, make changes, then convert
back:
> table(fbi$Crime)
!
Aggravated.assault
Burglary
2545
2545
Forcible.rape
Larceny.theft
2545
2545
Motor.vehicle.theft Murder.and.nonnegligent.Manslaughter
2545
2545
Robbery 2545 > fbi$Crime <- as.character(fbi$Crime)
> summary(fbi$Crime)
Length
Class
Mode 17815 character character Updating subsets
• Example:
fbi$Crime <- as.character(fbi$Crime)
!
idx <- which(fbi$Crime == "Aggravated.Assault")
!
fbi$Crime[idx] <- "Assault"
!
!
table(fbim$variable)
!
!
Violent.Crime
2698
Property.Crime
2698
Murder
2698
Burglary
2698
Forcible.Rape
Robbery
2698
2698
Larceny.theft Motor.Vehicle.Theft 2698
2698 Assault
2698
Tables
•
table(x) gives returns frequency table of
variable x
Similarly: xtabs(~x, data) or xtabs(count~x, data) • for higher number of variables, frequencies
of corresponding contingency tables are
returned
Re-ordering
• factor variables often have to be re-ordered for
ease of comparisons
• We can specify the order of the levels by explicitly
listing them, see help(factor)
• We can make the order of the levels in one variable
dependent on the summary statistic of another
variable, see help(reorder)
Practice
• The file fbi-incidences.csv
contains a different fomat of the
fbi data. Load it into R and describe it.
• Introduce variable rate into fbi incidences data
• Plot boxplots of rate by crime. • Reorder rate by crime. • For one type of crime (subset!) plot boxplots of rates by
state, reorder crime rates
Changing Levels’ names
• Example:
!
> levels(fbi$Crime)
[1] "Aggravated.assault"
"Murder"
[6] "Property"
"Robbery"
"Burglary"
"Rape"
"Larceny.Theft"
"Motor.Vehicle.Theft" "Violent"
!
> levels(fbi$Crime)[1] <- "Assault"
> levels(fbi$Crime)[8] <- "Vehicle.Theft"
!
!
!
> levels(fbi$Crime)
[1] "Assault"
"Burglary"
[6] "Property"
"Robbery"
"Rape"
"Larceny.Theft" "Murder"
"Vehicle.Theft" "Violent"
Reshaping Data
Reshaping Data
• Two step process: • get data into a “convenient” shape, i.e.
melt
• cast data into new shape(s) that are
cast
one that is particularly flexible
better suited for analysis
Melt data first
•
This gets it in a form useful for “casting” into new
formats
•
When melting, you need to specify the measured
variables and the id variables
• melt(data,
measure.var=c(1,2,3),
id.var=5)!
•
key variables are fixed by design (or categorical
variables), measured variables correspond to
numeric measurements
melt.data.frame(data, id.vars, measure.vars,
na.rm = F, ...)!
key
molten form
“long & skinny”
• id.vars: all identifiers (keys) and
qualitative variables X1
X2
X3
• measure.vars: all quantitative
variables
X4
key
original data
X1 X2X3X4X5
X5
Casting
• Function cast
cast(dataset, rows ~ columns, aggregate)
columns
rows
aggregate(data)
Then, cast
• Row variables, column variables, and a summary
function (sum, mean, max, etc)
• cast(molten,
• cast(molten,
• cast(molten,
• cast(molten,
row ~ col, summary)!
row1 + row2 ~ col, summary)!
row ~ . , summary)!
. ~ col, summary)
Practice
• Load the old fbi format (fbi-60-12.csv) and melt it into the
new fbi format (fbi-incidences.csv).
• Introduce variable rate into fbi incidences data
• Using cast, find:
• the number of overall crimes
• the number of incidences by state
• the number of incidences by state and crime type
• the number of incidences by state, crime type and year
Pivot tables in R
Pivot tables in R
• functions in the dplyr package:
•
•
•
groupby
summarize!
(transform, filter, select)
Your Turn
• For the fbi-incidences data find:
• the number of overall crimes
• the number of incidences by state
• the number of incidences by state and crime type
• the number of incidences by state, crime type and
year
Download