Subsets, Factors Stat 480 Heike Hofmann Outline • Subsetting • Logical Vectors & Data Updates • Factor Variables Practice •a <- c(1,15, 3,20, 5,8,9,10, 1,3)! • Get logical vector that is TRUE when number is: • less than 20 • squared value is at least 100 or less than 10 • equals 1 or 3 • bit advanced nut: even (look at a %% 2) Review: Logical vectors • Usually created by a logical comparison (or combination thereof) •<, >, ==, !=, <=, •x %in% c(1, 4, 3, •subset >=! 7)! Logical expressions • & and | are the logical and and or • ! is the logical negation • use parentheses () when linking expressions to avoid mis-interpretation Practice • Go to our course website http://www.public.iastate.edu/~hofmann/stat480/ lectures/11-factors.html • Copy the link to the Crime Data (fbi-clean) • Open R • load the data directly from the Web by using fbi <- read.csv(“paste the link here”) Your Turn: Logical Expressions • For the following subsets find a logical expression and report dimension: • how many data records do we have for california? • how many states reported more than 10,000 violent crimes in 2010? how many in 1960? • in how many states in 2012 were there more vehicle thefts than robberies? Useful Commands • • • • nrow(dataset) # number of records! quantile(variable, probs=0.001, na.rm=T) # retrieves 0.1 percentile of variable! which(logical variable) # retrieves all indices for which the variable is TRUE! which.max(variable) which.min(variable) # retrieve index of highest (lowest) value in variable Your Turn FBI Data • Get a subset of all crimes in Iowa and Minnesota, i.e.: ... two_states <- subset (…) Pick one type of crime. Compute rates for it for each year. Compare between the two states. • Get a subset of all crimes in 2010, and plot one type of crime compared to population. • Get a subset of the data that includes number of Homicides in the last five years. Find rate of homicides, extract all states that have a rate in the top 10% across the all the states, and plot (= top 5 states). Checkpoint • Submit all of your code for today’s Your Turns so far at http://hhofmann.wufoo.com/forms/check-point/ Factors • A special type of numeric (integer) data • Numbers + labels • Used for categorical variables • On import, make sure numeric categorical variables are converted to factors • factor creates a new factor with specified labels Practice • Load fbi-incidences.csv data (use URL) • Make a summary of Year • Alternatively, we can treat Year as a factor variable fbi$Year <- factor(fbi$Year) ! • Compare summary of Year to previous result • Are there other variables that should be factors (or vice versa)? Checking for, and casting between types • str, mode provide info on type • is.XXX (with XXX either factor, int, numeric, logical, character, ... ) checks for specific type • as.XXX casts to specific type Casting between types as.factor factor character as.character numeric as.numeric Factors • factor variables often have to be re-ordered for ease of comparisons • We can specify the order of the levels by explicitly listing them, see help(factor) • We can make the order of the levels in one variable dependent on the summary statistic of another variable, see help(reorder) Practice • Introduce variable rate into fbi data • Plot boxplots of rate by crime. • Reorder rate by crime. • For one type of crime (subset!) plot boxplots of rates by state, reorder crime rates Factors/Strings • For changes to levels and label names convert factor variable to character, make changes, then convert back: > table(fbi$Crime) ! Aggravated.assault Burglary 2545 2545 Forcible.rape Larceny.theft 2545 2545 Motor.vehicle.theft Murder.and.nonnegligent.Manslaughter 2545 2545 Robbery 2545 > fbi$Crime <- as.character(fbi$Crime) > summary(fbi$Crime) Length Class Mode 17815 character character Tables • table(x) gives returns frequency table of variable x Similarly: xtabs(~x, data) or xtabs(count~x, data) • for higher number of variables, frequencies of corresponding contingency tables are returned • ggfluctuation(table(x,y)) renders matrix graphically (area proportional to cell count) Updating subsets • You can take a subset and update the original data •a <- 1:4! •a[2:3] <•a! 0! • Very useful with logical subsetting Updating subsets • Example: fbi$Crime <- as.character(fbi$Crime) ! idx <- which(fbi$Crime == "Aggravated.Assault") ! fbi$Crime[idx] <- "Assault" ! ! table(fbim$variable) ! ! Violent.Crime 2698 Property.Crime 2698 Murder 2698 Burglary 2698 Forcible.Rape Robbery 2698 2698 Larceny.theft Motor.Vehicle.Theft 2698 2698 Assault 2698 Factors/Strings • Explicit use of factor(x) (re)-casts variable x to a factor and updates number of levels accordingly, e.g. compare: two_states <- subset(fbi, State %in% c("Iowa","Minnesota")) ! qplot(Year, Rate, colour=State, data=two_states) qplot(Year, Rate, colour=factor(State), data=two_states) Next time • Automating Manipulations: • Functions & Loops