Factor Variables, Reshaping Data Stat 579 Heike Hofmann Outline • Missing Values • Investigating Data: Outlier Identification • Factor Variables • (Reshaping Data) Economical Status • The file economical-status.csv is an extract of the General Social Survey, with variables • region, income, happy, age, finrela, marital, degree, health, wrkstat, partyid, polviews, sex, year • Go on the GSS website and find out more about these variables Advanced • Load economical-status.csv data into R • Find subsets for two years of your choice • Plot income. Why is the order of the labels so strange? Is there a difference in income between the two years? • Plot age. Is there an age difference between the two years? Identification of small subsets Useful commands •which! •quantile, max, min! •which.max, which.min Practice • Extract all records for 80 year old females (or males) in the midwest (or another region of your choosing). What is their general health? • What is the % of individuals with an income of more than $25,000 in 1990? • What is the record with the highest age? What does that mean? - check the codebook online. Fix correspondingly. Useful Commands • • • • nrow(dataset) # number of records! quantile(variable, probs=0.001, na.rm=T) # retrieves 0.1 percentile of variable! which(logical variable) # retrieves all indices for which the variable is TRUE! which.max(variable) which.min(variable) # retrieve index of highest (lowest) value in variable More about missings • NA + x = NA, NA * x = NA! • x == NA ! • is.na returns logical vector, for single vector • na.omit removes all missing values from a vector complete.cases does the same for a data.frame • Many functions have parameter na.rm Practice • Check the website for a description of the levels for income. • Make a new variable income2 in the gss data, by copying all values of income. Replace all values in income2 that don’t correspond to intervals by the value NA. • Advanced: what does the command levels() do? Use it to replace all intervals in the income2 variable by their midpoint, i.e. use 2000 instead of “$1000 TO $2999”, use end points, if the interval is one-sided Factors • A special type of numeric (integer) data • Numbers + labels • Used for categorical variables • On import, make sure numeric categorical variables are converted to factors • factor creates a new factor with specified labels Practice • Make a summary of year • Year should be treated as a factor variable, but isn’t. Turn state into a factor variable explicitly: gss$year <- factor(gss$year) ! • Compare summary of year to previous result. Which year had the most participants? • Are there other variables that should be factors (or vice versa)? Checking for, and casting between types • str, mode provide info on type • is.XXX (with XXX either factor, int, numeric, logical, character, ... ) checks for specific type • as.XXX casts to specific type Casting between types as.factor factor character as.character numeric as.numeric Factors • factor variables often have to be re-ordered for ease of comparisons • We can specify the order of the levels by explicitly listing them, see help(factor) • We can make the order of the levels in one variable dependent on the summary statistic of another variable, see help(reorder) Practice • Plot income by gender (facet or use color). Turn the order of gender around. • Reorder the levels of degree by approximate years of education. ! • Reorder income by average income2. Hint use command reorder. • Reorder wrkstat by frequency. • Make income2 a factor and then a numeric variable again. Tables • table(x) gives returns frequency table of variable x • for higher number of variables, frequencies of corresponding contingency tables are returned • ggfluctuation(table(x,y)) renders matrix graphically (area proportional to cell count) Comparisons of States • Goal: We want to find summary statistics of several variables for all years ! • Concept: automate data aggregation Aggregating data • Many different ways to do this in R, but we’re going to focus on one: library(reshape2) First, melt • • First need to “melt” the data • When melting, you need to specify the measured variables and the id variables This gets it in a form useful for “casting” into new formats • melt(data, measure.var=c(1,2,3), id.var=5)! • key variables are fixed by design (or categorical variables), measured variables correspond to numeric measurements Your Turn • Melt the gss data: • Use year, sex, and happy as identifier variables • use income2 and age as measured variables • Find the means and standard deviations for each year by sex and happiness level. • Find the “poorest” and “richest” years Aggregations in R key X1 X2 • Functions melt and cast • Melting Data: key X3 X4 X1 X2X3X4X5 X5 for aggregation: dplyr package • main functionality: group_by, summarise, mutate, filter • http://cran.rstudio.com/web/packages/dplyr/ vignettes/introduction.html group_by • group_by(data, var1, ...) is a function that takes a dataset and introduces a group for each (combination of) level(s) of the grouping variable(s) • Power combination: group_by and summarise for a grouped dataframe, the summary statistics will be calculated for every group library(dplyr) ! gss <- read.csv("http://www.hofroe.net/stat579/economical-status.csv") ! summarise(gss, age=mean(age, na.rm=T), men.pct=sum(sex=="MALE")/length(sex)*100, married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100) ! age men.pct married 1 45.63128 43.85683 53.67281 ! summarise(group_by(gss, region), age=mean(age, na.rm=T), men.pct=sum(sex=="MALE")/length(sex)*100, married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100) region age men.pct married 1 E. NOR. CENTRAL 45.86872 43.71178 55.42193 2 E. SOU. CENTRAL 46.77994 41.58812 54.44159 3 MIDDLE ATLANTIC 45.73188 42.43971 52.34146 4 MOUNTAIN 44.74539 43.72470 54.06415 5 NEW ENGLAND 46.47334 43.47826 54.04307 6 NOT ASSIGNED 47.62237 42.78768 44.73258 7 PACIFIC 44.56465 46.33001 51.29445 8 SOUTH ATLANTIC 46.19038 43.83937 54.22135 9 W. NOR. CENTRAL 46.27819 45.35393 54.38867 10 W. SOU. CENTRAL 43.93512 43.74745 54.13442 Chaining operator %>% • x %>% f(y) is equivalent to f(x, • gss %>% group_by(year) is equivalent to group_by(gss, year) • Read %>% as ‘then’ i.e. “take data, then group it by year, then summarise it to …” y)! Chained version of example gss %>.% group_by(region) %.% summarise( age=mean(age, na.rm=T), men.pct=sum(sex=="MALE")/length(sex)*100, married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100 ) Your Turn • Use dplyr statements to get (a) the percent of married responders by year (b) the percent of men answering the survey in each year • Plot the three variables. filter • filter(data, expr1, ...) is a function that takes a dataset and subsets it according to a set of expressions • filter() works similarly to subset() except that you can give it any number of filtering conditions which are joined together with the logical ‘AND’ &. You can use other boolean operators explicitly Your Turn • Use dplyr statements to get the number of responses in each year • Has party affiliation changed over time? Summarize the data with dplyr routines first, then visualize.