Data Aggregation Stat 579
 Heike Hofmann

advertisement
Data Aggregation
Stat 579
Heike Hofmann
Outline
• Factors
• Data Aggregation with dplyr!
• group_by, summarize,
transform
Practice
gss <- read.csv(“http://www.hofroe.net/stat579/economical-status.csv”)
• Make a new variable income2 in the gss
data, by copying all values of income. Replace
all values in income2 that don’t correspond
to intervals by the value NA.
• Advanced: what does the command levels()
do? Use it to replace all intervals in the
income2 variable by their midpoint, i.e. use
2000 instead of “$1000 TO $2999”, use end
points, if the interval is one-sided
Factors
• A special type of numeric (integer) data
• Numbers + labels
• Used for categorical variables
• On import, make sure numeric categorical
variables are converted to factors
•
factor creates a new factor with specified
labels
Practice
• Make a summary of year
• Year should be treated as a factor variable, but isn’t. Turn
state into a factor variable explicitly:
gss$year <- factor(gss$year) !
• Compare summary of year to previous result.
Which year had the most participants?
• Are there other variables that should be factors (or vice
versa)?
Checking for, and
casting between types
• str, mode provide info on type
• is.XXX (with XXX either factor,
int, numeric, logical,
character, ... ) checks for specific type
• as.XXX casts to specific type
Casting between types
as.factor
factor
character
as.character
numeric
as.numeric
Factors
• factor variables often have to be re-ordered for
ease of comparisons
• We can specify the order of the levels by explicitly
listing them, see help(factor)
• We can make the order of the levels in one variable
dependent on the summary statistic of another
variable, see help(reorder)
Practice
• Plot income by gender (facet or use color).
Turn the order of gender around.
• Reorder the levels of degree by approximate years of
education. !
• Reorder income by average income2. Hint use command
reorder.
• Reorder wrkstat by frequency.
• Make income2 a factor and then a numeric variable again.
Tables
•
table(x) gives returns frequency table of
variable x
• for higher number of variables, frequencies
of corresponding contingency tables are
returned
Comparisons across levels
of (multiple) variables
• Goal: We want to find summary statistics across
different levels
!
• DRY principle? • Concept:
automate data aggregation
First, melt
•
•
First need to “melt” the data
•
When melting, you need to specify the measured
variables and the id variables
This gets it in a form useful for “casting” into new
formats
• melt(data,
measure.var=c(1,2,3),
id.var=5)!
•
key variables are fixed by design (or categorical
variables), measured variables correspond to
numeric measurements
Aggregations in R
key
X1
X2
• Functions melt and cast
• Melting Data:
key
X3
X4
X1 X2X3X4X5
X5
Your Turn
• Melt the gss data:
• Use year, sex, and happy as identifier variables
• use income2 and age as measured variables
• Do a summary of the melted data
for aggregation:
dplyr package
• main functionality:
group_by, summarise, mutate,
filter • http://cran.rstudio.com/web/packages/dplyr/
vignettes/introduction.html
Split
x
y
a
2
a
4
b
Apply
x
y
a
2
a
4
x
y
0
b
0
b
5
b
5
c
5
x
y
c
10
c
5
c
10
Combine
3
2.5
7.5
x
y
a
3
b
2.5
c
7.5
group_by
• group_by(data,
var1, ...) is a function that takes a dataset and
introduces a group for each (combination
of) level(s) of the grouping variable(s)
• Power combination: group_by and summarise
for a grouped dataframe, the summary
statistics will be calculated for every group
Summarizing by groups
Function
Input data
summarise(gss, avgincome=mean(income2, na.rm=T),
avgage=mean(age, na.rm=T))
Columns to
!
calculate
Column(s)
!
to group by
!
!
summarise(group_by(gss, year), avgincome=mean(income2, na.rm=T),
avgage=mean(age, na.rm=T))
avgincome
20000
17500
15000
12500
1980
qplot(year, avgincome, data=year.summary)
1990
year
2000
2010
library(dplyr)
!
gss <- read.csv("http://www.hofroe.net/stat579/economical-status.csv")
!
summarise(gss,
age=mean(age, na.rm=T),
men.pct=sum(sex=="MALE")/length(sex)*100,
married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100)
!
age men.pct married
1 45.63128 43.85683 53.67281
!
summarise(group_by(gss, region),
age=mean(age, na.rm=T),
men.pct=sum(sex=="MALE")/length(sex)*100,
married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100) region
age men.pct married
1 E. NOR. CENTRAL 45.86872 43.71178 55.42193
2 E. SOU. CENTRAL 46.77994 41.58812 54.44159
3 MIDDLE ATLANTIC 45.73188 42.43971 52.34146
4
MOUNTAIN 44.74539 43.72470 54.06415
5
NEW ENGLAND 46.47334 43.47826 54.04307
6
NOT ASSIGNED 47.62237 42.78768 44.73258
7
PACIFIC 44.56465 46.33001 51.29445
8
SOUTH ATLANTIC 46.19038 43.83937 54.22135
9 W. NOR. CENTRAL 46.27819 45.35393 54.38867
10 W. SOU. CENTRAL 43.93512 43.74745 54.13442
Chaining operator %>%
• x %>% f(y) is equivalent to f(x,
• gss %>% group_by(year) is equivalent to
group_by(gss, year)
• Read %>% as ‘then’ i.e. “take data, then group it by year, then summarise it to …”
y)!
Chained version of
example
gss %>.%
group_by(region) %.%
summarise(
age=mean(age, na.rm=T),
men.pct=sum(sex=="MALE")/length(sex)*100,
married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100
)
Your Turn
• Use dplyr statements to get (a) the percent of married responders by
year (b) the percent of men answering the survey
in each year
• Plot the three variables.
Your Turn
• For the gss data use dplyr statements to find
• the percent of respondents with $25,000 or
more for each year
• the percent of respondents with $25,000 or
more by gender and year
• the percent of respondents with $25,000 or
more by party affiliation and year
Check-Point
• Submit the Code for the last ‘Your Turn’ at
http://hhofmann.wufoo.com/forms/checkpoint/
Your Turn
• What are the “richest” and “poorest” regions? - does it
change over the years?
• Does a higher income make people happier? - is that stable
across years?
Download