Subsets, Factors Stat 480
 Heike Hofmann

advertisement
Subsets, Factors
Stat 480
Heike Hofmann
Outline
• Subsetting
• Logical Vectors & Data Updates
• Factor Variables
Practice
•a
<- c(1,15, 3,20, 5,8,9,10, 1,3)!
• Get logical vector that is TRUE when number
is:
• less than 20
• squared value is at least 100 or less than 10
• equals 1 or 3
• bit advanced nut: even (look at a %% 2)
Review: Logical vectors
• Usually created by a logical comparison (or
combination thereof)
•<, >, ==, !=, <=,
•x %in% c(1, 4, 3,
•subset
>=!
7)!
Logical expressions
• & and | are the logical and and or
• ! is the logical negation
• use parentheses () when linking
expressions to avoid mis-interpretation
Practice
• Go to our course website http://www.public.iastate.edu/~hofmann/stat480/
lectures/11-factors.html • Copy the link to the Crime Data (fbi-clean)
• Open R
• load the data directly from the Web by using fbi <- read.csv(“paste the link here”)
Your Turn:
Logical Expressions
• For the following subsets find a logical expression
and report dimension:
• how many data records do we have for california?
• how many states reported more than 10,000
violent crimes in 2010? how many in 1960?
• in how many states in 2012 were there more
vehicle thefts than robberies?
Useful Commands
•
•
•
•
nrow(dataset)
# number of records!
quantile(variable, probs=0.001, na.rm=T)
# retrieves 0.1 percentile of variable!
which(logical variable)
# retrieves all indices for which the variable
is TRUE!
which.max(variable)
which.min(variable) # retrieve index of highest (lowest) value in variable
Your Turn
FBI Data
• Get a subset of all crimes in Iowa and Minnesota, i.e.:
...
two_states <- subset (…)
Pick one type of crime. Compute rates for it for each
year. Compare between the two states.
• Get a subset of all crimes in 2010, and plot one type
of crime compared to population.
• Get a subset of the data that includes number of
Homicides in the last five years. Find rate of
homicides, extract all states that have a rate in the top
10% across the all the states, and plot (= top 5 states).
Checkpoint
• Submit all of your code for today’s Your Turns
so far at
http://hhofmann.wufoo.com/forms/check-point/
Factors
• A special type of numeric (integer) data
• Numbers + labels
• Used for categorical variables
• On import, make sure numeric categorical
variables are converted to factors
•
factor creates a new factor with specified
labels
Practice
• Load fbi-incidences.csv data (use URL)
• Make a summary of Year
• Alternatively, we can treat Year as a factor variable
fbi$Year <- factor(fbi$Year) !
• Compare summary of Year to previous result
• Are there other variables that should be factors (or vice
versa)?
Checking for, and
casting between types
• str, mode provide info on type
• is.XXX (with XXX either factor,
int, numeric, logical,
character, ... ) checks for specific type
• as.XXX casts to specific type
Casting between types
as.factor
factor
character
as.character
numeric
as.numeric
Factors
• factor variables often have to be re-ordered for
ease of comparisons
• We can specify the order of the levels by explicitly
listing them, see help(factor)
• We can make the order of the levels in one variable
dependent on the summary statistic of another
variable, see help(reorder)
Practice
• Introduce variable rate into fbi data
• Plot boxplots of rate by crime. • Reorder rate by crime. • For one type of crime (subset!) plot boxplots of rates by
state, reorder crime rates
Factors/Strings
• For changes to levels and label names convert factor
variable to character, make changes, then convert
back:
> table(fbi$Crime)
!
Aggravated.assault
Burglary
2545
2545
Forcible.rape
Larceny.theft
2545
2545
Motor.vehicle.theft Murder.and.nonnegligent.Manslaughter
2545
2545
Robbery 2545 > fbi$Crime <- as.character(fbi$Crime)
> summary(fbi$Crime)
Length
Class
Mode 17815 character character Tables
•
table(x) gives returns frequency table of
variable x
Similarly: xtabs(~x, data) or xtabs(count~x, data) • for higher number of variables, frequencies of
corresponding contingency tables are returned
•
ggfluctuation(table(x,y)) renders
matrix graphically (area proportional to cell
count)
Updating subsets
• You can take a subset and update the
original data
•a <- 1:4!
•a[2:3] <•a!
0!
• Very useful with logical subsetting
Updating subsets
• Example:
fbi$Crime <- as.character(fbi$Crime)
!
idx <- which(fbi$Crime == "Aggravated.Assault")
!
fbi$Crime[idx] <- "Assault"
!
!
table(fbim$variable)
!
!
Violent.Crime
2698
Property.Crime
2698
Murder
2698
Burglary
2698
Forcible.Rape
Robbery
2698
2698
Larceny.theft Motor.Vehicle.Theft 2698
2698 Assault
2698
Factors/Strings
• Explicit use of factor(x) (re)-casts
variable x to a factor and updates number
of levels accordingly, e.g. compare:
two_states <- subset(fbi, State %in% c("Iowa","Minnesota"))
!
qplot(Year, Rate, colour=State, data=two_states)
qplot(Year, Rate, colour=factor(State), data=two_states)
Next time
• Automating Manipulations: • Functions & Loops
Download