Data Aggregation Stat 579  Heike Hofmann

Data Aggregation Stat 579 Heike Hofmann Outline • Factors • Data Aggregation with dplyr! • group_by, summarize, transform Practice gss <- read.csv(“http://www.hofroe.net/stat579/economical-status.csv”) • Make a new variable income2 in the gss data, by copying all values of income. Replace all values in income2 that don’t correspond to intervals by the value NA. • Advanced: what does the command levels() do? Use it to replace all intervals in the income2 variable by their midpoint, i.e. use 2000 instead of “$1000 TO $2999”, use end points, if the interval is one-sided Factors • A special type of numeric (integer) data • Numbers + labels • Used for categorical variables • On import, make sure numeric categorical variables are converted to factors • factor creates a new factor with specified labels Practice • Make a summary of year • Year should be treated as a factor variable, but isn’t. Turn state into a factor variable explicitly: gss$year <- factor(gss$year) ! • Compare summary of year to previous result. Which year had the most participants? • Are there other variables that should be factors (or vice versa)? Checking for, and casting between types • str, mode provide info on type • is.XXX (with XXX either factor, int, numeric, logical, character, ... ) checks for specific type • as.XXX casts to specific type Casting between types as.factor factor character as.character numeric as.numeric Factors • factor variables often have to be re-ordered for ease of comparisons • We can specify the order of the levels by explicitly listing them, see help(factor) • We can make the order of the levels in one variable dependent on the summary statistic of another variable, see help(reorder) Practice • Plot income by gender (facet or use color). Turn the order of gender around. • Reorder the levels of degree by approximate years of education. ! • Reorder income by average income2. Hint use command reorder. • Reorder wrkstat by frequency. • Make income2 a factor and then a numeric variable again. Tables • table(x) gives returns frequency table of variable x • for higher number of variables, frequencies of corresponding contingency tables are returned Comparisons across levels of (multiple) variables • Goal: We want to find summary statistics across different levels ! • DRY principle? • Concept: automate data aggregation First, melt • • First need to “melt” the data • When melting, you need to specify the measured variables and the id variables This gets it in a form useful for “casting” into new formats • melt(data, measure.var=c(1,2,3), id.var=5)! • key variables are fixed by design (or categorical variables), measured variables correspond to numeric measurements Aggregations in R key X1 X2 • Functions melt and cast • Melting Data: key X3 X4 X1 X2X3X4X5 X5 Your Turn • Melt the gss data: • Use year, sex, and happy as identifier variables • use income2 and age as measured variables • Do a summary of the melted data for aggregation: dplyr package • main functionality: group_by, summarise, mutate, filter • http://cran.rstudio.com/web/packages/dplyr/ vignettes/introduction.html Split x y a 2 a 4 b Apply x y a 2 a 4 x y 0 b 0 b 5 b 5 c 5 x y c 10 c 5 c 10 Combine 3 2.5 7.5 x y a 3 b 2.5 c 7.5 group_by • group_by(data, var1, ...) is a function that takes a dataset and introduces a group for each (combination of) level(s) of the grouping variable(s) • Power combination: group_by and summarise for a grouped dataframe, the summary statistics will be calculated for every group Summarizing by groups Function Input data summarise(gss, avgincome=mean(income2, na.rm=T), avgage=mean(age, na.rm=T)) Columns to ! calculate Column(s) ! to group by ! ! summarise(group_by(gss, year), avgincome=mean(income2, na.rm=T), avgage=mean(age, na.rm=T)) avgincome 20000 17500 15000 12500 1980 qplot(year, avgincome, data=year.summary) 1990 year 2000 2010 library(dplyr) ! gss <- read.csv("http://www.hofroe.net/stat579/economical-status.csv") ! summarise(gss, age=mean(age, na.rm=T), men.pct=sum(sex=="MALE")/length(sex)*100, married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100) ! age men.pct married 1 45.63128 43.85683 53.67281 ! summarise(group_by(gss, region), age=mean(age, na.rm=T), men.pct=sum(sex=="MALE")/length(sex)*100, married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100) region age men.pct married 1 E. NOR. CENTRAL 45.86872 43.71178 55.42193 2 E. SOU. CENTRAL 46.77994 41.58812 54.44159 3 MIDDLE ATLANTIC 45.73188 42.43971 52.34146 4 MOUNTAIN 44.74539 43.72470 54.06415 5 NEW ENGLAND 46.47334 43.47826 54.04307 6 NOT ASSIGNED 47.62237 42.78768 44.73258 7 PACIFIC 44.56465 46.33001 51.29445 8 SOUTH ATLANTIC 46.19038 43.83937 54.22135 9 W. NOR. CENTRAL 46.27819 45.35393 54.38867 10 W. SOU. CENTRAL 43.93512 43.74745 54.13442 Chaining operator %>% • x %>% f(y) is equivalent to f(x, • gss %>% group_by(year) is equivalent to group_by(gss, year) • Read %>% as ‘then’ i.e. “take data, then group it by year, then summarise it to …” y)! Chained version of example gss %>.% group_by(region) %.% summarise( age=mean(age, na.rm=T), men.pct=sum(sex=="MALE")/length(sex)*100, married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100 ) Your Turn • Use dplyr statements to get (a) the percent of married responders by year (b) the percent of men answering the survey in each year • Plot the three variables. Your Turn • For the gss data use dplyr statements to find • the percent of respondents with $25,000 or more for each year • the percent of respondents with $25,000 or more by gender and year • the percent of respondents with $25,000 or more by party affiliation and year Check-Point • Submit the Code for the last ‘Your Turn’ at http://hhofmann.wufoo.com/forms/checkpoint/ Your Turn • What are the “richest” and “poorest” regions? - does it change over the years? • Does a higher income make people happier? - is that stable across years?

Data Aggregation Stat 579  Heike Hofmann

Related documents

Products

Support

Data Aggregation Stat 579 Heike Hofmann

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

Data Aggregation Stat 579  Heike Hofmann