Factor Variables, Reshaping Data Stat 579  Heike Hofmann

Factor Variables, Reshaping Data Stat 579 Heike Hofmann Outline • Missing Values • Investigating Data: Outlier Identification • Factor Variables • (Reshaping Data) Economical Status • The file economical-status.csv is an extract of the General Social Survey, with variables • region, income, happy, age, finrela, marital, degree, health, wrkstat, partyid, polviews, sex, year • Go on the GSS website and find out more about these variables Advanced • Load economical-status.csv data into R • Find subsets for two years of your choice • Plot income. Why is the order of the labels so strange? Is there a difference in income between the two years? • Plot age. Is there an age difference between the two years? Identification of small subsets Useful commands •which! •quantile, max, min! •which.max, which.min Practice • Extract all records for 80 year old females (or males) in the midwest (or another region of your choosing). What is their general health? • What is the % of individuals with an income of more than $25,000 in 1990? • What is the record with the highest age? What does that mean? - check the codebook online. Fix correspondingly. Useful Commands • • • • nrow(dataset) # number of records! quantile(variable, probs=0.001, na.rm=T) # retrieves 0.1 percentile of variable! which(logical variable) # retrieves all indices for which the variable is TRUE! which.max(variable) which.min(variable) # retrieve index of highest (lowest) value in variable More about missings • NA + x = NA, NA * x = NA! • x == NA ! • is.na returns logical vector, for single vector • na.omit removes all missing values from a vector complete.cases does the same for a data.frame • Many functions have parameter na.rm Practice • Check the website for a description of the levels for income. • Make a new variable income2 in the gss data, by copying all values of income. Replace all values in income2 that don’t correspond to intervals by the value NA. • Advanced: what does the command levels() do? Use it to replace all intervals in the income2 variable by their midpoint, i.e. use 2000 instead of “$1000 TO $2999”, use end points, if the interval is one-sided Factors • A special type of numeric (integer) data • Numbers + labels • Used for categorical variables • On import, make sure numeric categorical variables are converted to factors • factor creates a new factor with specified labels Practice • Make a summary of year • Year should be treated as a factor variable, but isn’t. Turn state into a factor variable explicitly: gss$year <- factor(gss$year) ! • Compare summary of year to previous result. Which year had the most participants? • Are there other variables that should be factors (or vice versa)? Checking for, and casting between types • str, mode provide info on type • is.XXX (with XXX either factor, int, numeric, logical, character, ... ) checks for specific type • as.XXX casts to specific type Casting between types as.factor factor character as.character numeric as.numeric Factors • factor variables often have to be re-ordered for ease of comparisons • We can specify the order of the levels by explicitly listing them, see help(factor) • We can make the order of the levels in one variable dependent on the summary statistic of another variable, see help(reorder) Practice • Plot income by gender (facet or use color). Turn the order of gender around. • Reorder the levels of degree by approximate years of education. ! • Reorder income by average income2. Hint use command reorder. • Reorder wrkstat by frequency. • Make income2 a factor and then a numeric variable again. Tables • table(x) gives returns frequency table of variable x • for higher number of variables, frequencies of corresponding contingency tables are returned • ggfluctuation(table(x,y)) renders matrix graphically (area proportional to cell count) Comparisons of States • Goal: We want to find summary statistics of several variables for all years ! • Concept: automate data aggregation Aggregating data • Many different ways to do this in R, but we’re going to focus on one: library(reshape2) First, melt • • First need to “melt” the data • When melting, you need to specify the measured variables and the id variables This gets it in a form useful for “casting” into new formats • melt(data, measure.var=c(1,2,3), id.var=5)! • key variables are fixed by design (or categorical variables), measured variables correspond to numeric measurements Your Turn • Melt the gss data: • Use year, sex, and happy as identifier variables • use income2 and age as measured variables • Find the means and standard deviations for each year by sex and happiness level. • Find the “poorest” and “richest” years Aggregations in R key X1 X2 • Functions melt and cast • Melting Data: key X3 X4 X1 X2X3X4X5 X5 for aggregation: dplyr package • main functionality: group_by, summarise, mutate, filter • http://cran.rstudio.com/web/packages/dplyr/ vignettes/introduction.html group_by • group_by(data, var1, ...) is a function that takes a dataset and introduces a group for each (combination of) level(s) of the grouping variable(s) • Power combination: group_by and summarise for a grouped dataframe, the summary statistics will be calculated for every group library(dplyr) ! gss <- read.csv("http://www.hofroe.net/stat579/economical-status.csv") ! summarise(gss, age=mean(age, na.rm=T), men.pct=sum(sex=="MALE")/length(sex)*100, married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100) ! age men.pct married 1 45.63128 43.85683 53.67281 ! summarise(group_by(gss, region), age=mean(age, na.rm=T), men.pct=sum(sex=="MALE")/length(sex)*100, married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100) region age men.pct married 1 E. NOR. CENTRAL 45.86872 43.71178 55.42193 2 E. SOU. CENTRAL 46.77994 41.58812 54.44159 3 MIDDLE ATLANTIC 45.73188 42.43971 52.34146 4 MOUNTAIN 44.74539 43.72470 54.06415 5 NEW ENGLAND 46.47334 43.47826 54.04307 6 NOT ASSIGNED 47.62237 42.78768 44.73258 7 PACIFIC 44.56465 46.33001 51.29445 8 SOUTH ATLANTIC 46.19038 43.83937 54.22135 9 W. NOR. CENTRAL 46.27819 45.35393 54.38867 10 W. SOU. CENTRAL 43.93512 43.74745 54.13442 Chaining operator %>% • x %>% f(y) is equivalent to f(x, • gss %>% group_by(year) is equivalent to group_by(gss, year) • Read %>% as ‘then’ i.e. “take data, then group it by year, then summarise it to …” y)! Chained version of example gss %>.% group_by(region) %.% summarise( age=mean(age, na.rm=T), men.pct=sum(sex=="MALE")/length(sex)*100, married=sum(marital=="MARRIED", na.rm=T)/length(marital)*100 ) Your Turn • Use dplyr statements to get (a) the percent of married responders by year (b) the percent of men answering the survey in each year • Plot the three variables. filter • filter(data, expr1, ...) is a function that takes a dataset and subsets it according to a set of expressions • filter() works similarly to subset() except that you can give it any number of filtering conditions which are joined together with the logical ‘AND’ &. You can use other boolean operators explicitly Your Turn • Use dplyr statements to get the number of responses in each year • Has party affiliation changed over time? Summarize the data with dplyr routines first, then visualize.

Factor Variables, Reshaping Data Stat 579  Heike Hofmann

Related documents

Products

Support

Factor Variables, Reshaping Data Stat 579 Heike Hofmann

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

Factor Variables, Reshaping Data Stat 579  Heike Hofmann