Factor Variables, Reshaping Data Stat 579
 Heike Hofmann

advertisement
Factor Variables,
Reshaping Data
Stat 579
Heike Hofmann
Outline
• Missing Values
• Investigating Data: Outlier Identification
• Factor Variables
• (Reshaping Data)
Economical Status
• The file economical-status.csv is an extract
of the General Social Survey, with variables
• region, income, happy, age, finrela, marital,
degree, health, wrkstat, partyid, polviews,
sex, year
• Go on the GSS website and find out more
about these variables
Advanced
• Load economical-status.csv data into
R
• Find subsets for two years of your choice
• Plot income. Why is the order of the labels so
strange? Is there a difference in income between the
two years?
• Plot age. Is there an age difference between the
two years?
Identification of small
subsets
Useful commands
•which!
•quantile, max, min!
•which.max, which.min
Practice
• Extract all records for 80 year old females (or
males) in the midwest (or another region of
your choosing). What is their general health?
• What is the % of individuals with an income of
more than $25,000 in 1990?
• What is the record with the highest age? What does that mean? - check the codebook
online. Fix correspondingly.
Useful Commands
•
•
•
•
nrow(dataset)
# number of records!
quantile(variable, probs=0.001, na.rm=T)
# retrieves 0.1 percentile of variable!
which(logical variable)
# retrieves all indices for which the variable
is TRUE!
which.max(variable)
which.min(variable) # retrieve index of highest (lowest) value in variable
More about missings
• NA + x = NA, NA * x = NA!
• x == NA !
• is.na returns logical vector, for single vector
• na.omit removes all missing values from a
vector
complete.cases does the same for a
data.frame
• Many functions have parameter na.rm
Practice
• Check the website for a description of the levels
for income. • Make a new variable income2 in the gss data, by
copying all values of income. Replace all values in
income2 that don’t correspond to intervals by the
value NA.
• Advanced: what does the command levels() do? Use
it to replace all intervals in the income2 variable by
their midpoint, i.e. use 2000 instead of “$1000 TO
$2999”, use end points, if the interval is one-sided
Factors
• A special type of numeric (integer) data
• Numbers + labels
• Used for categorical variables
• On import, make sure numeric categorical
variables are converted to factors
•
factor creates a new factor with specified
labels
Practice
• Make a summary of year
• Year should be treated as a factor variable, but isn’t. Turn
state into a factor variable explicitly:
gss$year <- factor(gss$year) !
• Compare summary of year to previous result.
Which year had the most participants?
• Are there other variables that should be factors (or vice
versa)?
Checking for, and
casting between types
• str, mode provide info on type
• is.XXX (with XXX either factor,
int, numeric, logical,
character, ... ) checks for specific type
• as.XXX casts to specific type
Casting between types
as.factor
factor
character
as.character
numeric
as.numeric
Factors
• factor variables often have to be re-ordered for
ease of comparisons
• We can specify the order of the levels by explicitly
listing them, see help(factor)
• We can make the order of the levels in one variable
dependent on the summary statistic of another
variable, see help(reorder)
Practice
• Plot income by gender (facet or use color).
Turn the order of gender around.
• Reorder the levels of degree by approximate years of
education. !
• Reorder income by average income2. Hint use command
reorder.
• Reorder wrkstat by frequency.
• Make income2 a factor and then a numeric variable again.
Tables
•
table(x) gives returns frequency table of
variable x
• for higher number of variables, frequencies
of corresponding contingency tables are
returned
•
ggfluctuation(table(x,y)) renders
matrix graphically (area proportional to cell
count)
Comparisons of States
• Goal: We want to find summary statistics of
several variables for all years
!
• Concept:
automate data aggregation
Aggregating data
• Many different ways to do this in R, but
we’re going to focus on one:
library(reshape2)
First, melt
•
•
First need to “melt” the data
•
When melting, you need to specify the measured
variables and the id variables
This gets it in a form useful for “casting” into new
formats
• melt(data,
measure.var=c(1,2,3),
id.var=5)!
•
key variables are fixed by design (or categorical
variables), measured variables correspond to
numeric measurements
Your Turn
• Melt the gss data:
• Use year, sex, and happy as identifier variables
• use income2 and age as measured variables
• Find the means and standard deviations for each year by
sex and happiness level.
• Find the “poorest” and “richest” years
Aggregations in R
key
X1
X2
• Functions melt and cast
• Melting Data:
key
X3
X4
X1 X2X3X4X5
X5
for aggregation:
dplyr package
• main functionality:
group_by, summarise, mutate,
filter • http://cran.rstudio.com/web/packages/dplyr/
vignettes/introduction.html
group_by
• group_by(data,
var1, ...) is a function that takes a dataset and
introduces a group for each (combination
of) level(s) of the grouping variable(s)
• Power combination: group_by and summarise
for a grouped dataframe, the summary
statistics will be calculated for every group
library(dplyr)
!
gss <- read.csv("http://www.hofroe.net/stat579/economical-status.csv")
!
summarise(gss,
age=mean(age, na.rm=T),
men.pct=sum(sex=="MALE")/length(sex)*100,
married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100)
!
age men.pct married
1 45.63128 43.85683 53.67281
!
summarise(group_by(gss, region),
age=mean(age, na.rm=T),
men.pct=sum(sex=="MALE")/length(sex)*100,
married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100) region
age men.pct married
1 E. NOR. CENTRAL 45.86872 43.71178 55.42193
2 E. SOU. CENTRAL 46.77994 41.58812 54.44159
3 MIDDLE ATLANTIC 45.73188 42.43971 52.34146
4
MOUNTAIN 44.74539 43.72470 54.06415
5
NEW ENGLAND 46.47334 43.47826 54.04307
6
NOT ASSIGNED 47.62237 42.78768 44.73258
7
PACIFIC 44.56465 46.33001 51.29445
8
SOUTH ATLANTIC 46.19038 43.83937 54.22135
9 W. NOR. CENTRAL 46.27819 45.35393 54.38867
10 W. SOU. CENTRAL 43.93512 43.74745 54.13442
Chaining operator %>%
• x %>% f(y) is equivalent to f(x,
• gss %>% group_by(year) is equivalent to
group_by(gss, year)
• Read %>% as ‘then’ i.e. “take data, then group it by year, then summarise it to …”
y)!
Chained version of
example
gss %>.%
group_by(region) %.%
summarise(
age=mean(age, na.rm=T),
men.pct=sum(sex=="MALE")/length(sex)*100,
married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100
)
Your Turn
• Use dplyr statements to get (a) the percent of married responders by
year (b) the percent of men answering the survey
in each year
• Plot the three variables.
filter
• filter(data,
expr1, ...) is a function that takes a dataset and subsets
it according to a set of expressions
• filter() works similarly to subset()
except that you can give it any number of
filtering conditions which are joined together
with the logical ‘AND’ &. You can use other boolean operators
explicitly
Your Turn
• Use dplyr statements to get the number of
responses in each year
• Has party affiliation changed over time?
Summarize the data with dplyr routines first,
then visualize.
Download