Data Aggregation Stat 579 Heike Hofmann Outline • Factors • Data Aggregation with dplyr! • group_by, summarize, transform Practice gss <- read.csv(“http://www.hofroe.net/stat579/economical-status.csv”) • Make a new variable income2 in the gss data, by copying all values of income. Replace all values in income2 that don’t correspond to intervals by the value NA. • Advanced: what does the command levels() do? Use it to replace all intervals in the income2 variable by their midpoint, i.e. use 2000 instead of “$1000 TO $2999”, use end points, if the interval is one-sided Factors • A special type of numeric (integer) data • Numbers + labels • Used for categorical variables • On import, make sure numeric categorical variables are converted to factors • factor creates a new factor with specified labels Practice • Make a summary of year • Year should be treated as a factor variable, but isn’t. Turn state into a factor variable explicitly: gss$year <- factor(gss$year) ! • Compare summary of year to previous result. Which year had the most participants? • Are there other variables that should be factors (or vice versa)? Checking for, and casting between types • str, mode provide info on type • is.XXX (with XXX either factor, int, numeric, logical, character, ... ) checks for specific type • as.XXX casts to specific type Casting between types as.factor factor character as.character numeric as.numeric Factors • factor variables often have to be re-ordered for ease of comparisons • We can specify the order of the levels by explicitly listing them, see help(factor) • We can make the order of the levels in one variable dependent on the summary statistic of another variable, see help(reorder) Practice • Plot income by gender (facet or use color). Turn the order of gender around. • Reorder the levels of degree by approximate years of education. ! • Reorder income by average income2. Hint use command reorder. • Reorder wrkstat by frequency. • Make income2 a factor and then a numeric variable again. Tables • table(x) gives returns frequency table of variable x • for higher number of variables, frequencies of corresponding contingency tables are returned Comparisons across levels of (multiple) variables • Goal: We want to find summary statistics across different levels ! • DRY principle? • Concept: automate data aggregation First, melt • • First need to “melt” the data • When melting, you need to specify the measured variables and the id variables This gets it in a form useful for “casting” into new formats • melt(data, measure.var=c(1,2,3), id.var=5)! • key variables are fixed by design (or categorical variables), measured variables correspond to numeric measurements Aggregations in R key X1 X2 • Functions melt and cast • Melting Data: key X3 X4 X1 X2X3X4X5 X5 Your Turn • Melt the gss data: • Use year, sex, and happy as identifier variables • use income2 and age as measured variables • Do a summary of the melted data for aggregation: dplyr package • main functionality: group_by, summarise, mutate, filter • http://cran.rstudio.com/web/packages/dplyr/ vignettes/introduction.html Split x y a 2 a 4 b Apply x y a 2 a 4 x y 0 b 0 b 5 b 5 c 5 x y c 10 c 5 c 10 Combine 3 2.5 7.5 x y a 3 b 2.5 c 7.5 group_by • group_by(data, var1, ...) is a function that takes a dataset and introduces a group for each (combination of) level(s) of the grouping variable(s) • Power combination: group_by and summarise for a grouped dataframe, the summary statistics will be calculated for every group Summarizing by groups Function Input data summarise(gss, avgincome=mean(income2, na.rm=T), avgage=mean(age, na.rm=T)) Columns to ! calculate Column(s) ! to group by ! ! summarise(group_by(gss, year), avgincome=mean(income2, na.rm=T), avgage=mean(age, na.rm=T)) avgincome 20000 17500 15000 12500 1980 qplot(year, avgincome, data=year.summary) 1990 year 2000 2010 library(dplyr) ! gss <- read.csv("http://www.hofroe.net/stat579/economical-status.csv") ! summarise(gss, age=mean(age, na.rm=T), men.pct=sum(sex=="MALE")/length(sex)*100, married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100) ! age men.pct married 1 45.63128 43.85683 53.67281 ! summarise(group_by(gss, region), age=mean(age, na.rm=T), men.pct=sum(sex=="MALE")/length(sex)*100, married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100) region age men.pct married 1 E. NOR. CENTRAL 45.86872 43.71178 55.42193 2 E. SOU. CENTRAL 46.77994 41.58812 54.44159 3 MIDDLE ATLANTIC 45.73188 42.43971 52.34146 4 MOUNTAIN 44.74539 43.72470 54.06415 5 NEW ENGLAND 46.47334 43.47826 54.04307 6 NOT ASSIGNED 47.62237 42.78768 44.73258 7 PACIFIC 44.56465 46.33001 51.29445 8 SOUTH ATLANTIC 46.19038 43.83937 54.22135 9 W. NOR. CENTRAL 46.27819 45.35393 54.38867 10 W. SOU. CENTRAL 43.93512 43.74745 54.13442 Chaining operator %>% • x %>% f(y) is equivalent to f(x, • gss %>% group_by(year) is equivalent to group_by(gss, year) • Read %>% as ‘then’ i.e. “take data, then group it by year, then summarise it to …” y)! Chained version of example gss %>.% group_by(region) %.% summarise( age=mean(age, na.rm=T), men.pct=sum(sex=="MALE")/length(sex)*100, married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100 ) Your Turn • Use dplyr statements to get (a) the percent of married responders by year (b) the percent of men answering the survey in each year • Plot the three variables. Your Turn • For the gss data use dplyr statements to find • the percent of respondents with $25,000 or more for each year • the percent of respondents with $25,000 or more by gender and year • the percent of respondents with $25,000 or more by party affiliation and year Check-Point • Submit the Code for the last ‘Your Turn’ at http://hhofmann.wufoo.com/forms/checkpoint/ Your Turn • What are the “richest” and “poorest” regions? - does it change over the years? • Does a higher income make people happier? - is that stable across years?