Contingency Tables stat 557 Heike Hofmann Outline • Three R packages: • Data structures: reshape2, plyr • Visualizing contingency tables: productplots Your Turn: Titanic Data • Data on survival on board the Titanic is part of base R, in object named Titanic Investigate structure and dimensions. • Try to interpret result of plot(Titanic) • What does as.data.frame() do? • How many men survived, how many women? ... in each class? Contingency Tables • Lots of different, specialized commands for cross-tabulation: table, xtabs, freq • Unified frameworks: packages reshape2 or plyr • reshape2 for large number of variables, with same operation plyr for small numbers, different summary statistics Reshape Package • from CRAN install.packages("reshape2") library(reshape2) • Command dcast does cross-tabulation, yields dataframe: dcast(data, rows~columns, value, function) Casting • tc <- as.data.frame(Titanic) • dcast(tc, .~Class, value="Freq", fun=sum) • dcast(tc, Sex~Survived, value="Freq", fun=sum) • dcast(tc, Sex+Class~Survived, value="Freq", fun=sum) Plyr Package • from CRAN install.packages("plyr") library(plyr) • Command ..ply does summary . is one of "a","d","l", e.g.: ddply(data, .(var1, var2, ...), function) first . is data format of input, second . is output ddply • dataframe in, dataframe out: ddply(tc, .(Sex, Class), summarize, Total = sum(Freq), Survived = sum(Freq[Survived=="Yes"]), SurvivedPct = sum(Freq[Survived=="Yes"])/sum(Freq) *100 ) • summarize(data, ...) takes arguments of the form var=function, creates new column var in dataframe, see help(summarize) Your Turn: Titanic Data • Write function odds.ratio(x), that computes odds ratio of a 2 by 2 table x as x1*x4/(x2*x3). • Compute odds ratio of Survival by Gender. ... for each Class Visualizing Contingency Tables Productplots • from CRAN install.packages(“productplots”) One dimension • Barcharts, Spineplots: area is proportional to marginal probability prodplot(tc, Freq~Class, "hbar") 1st 1st 2nd 3rd 2nd 3rd Crew Crew prodplot(tc, Freq~Class, "hspine") Increasing dimensionality • Mosaicplots: combination of horizontal and vertical spine plots • Productplots allow combinations of hbar, vbar, hspine, and vspine prodplot(tc, Freq~Survived+Class, divider=c("hbar","hbar"), subset=level == 2) 1st 1st 2nd 2nd 3rd 3rd Crew prodplot(tc, Freq~Survived+Class, divider=c("vspine","hspine"), subset=level == 2) Crew 1st 2nd 3rd Crew Special Case: Doubledecker Plot • horizontal spines of all but the last variable, last variable in vertical spine plot prodplot(tc, Freq~Survived+Sex+Class, divider=c("vspine","hspine", "hspine"), subset=level == 3) 1st 1st 2nd 2nd 3rd 3rd Crew Crew Your Turn: Titanic Data • Using the productplots package, draw barcharts for each of the variables. • Change the previous plots to spineplots, include Survived vertically. Which plot shows the strongest association with Survival? • • Work out, in which order variables are drawn. Draw & interpret a plot using all four variables. Your Turn: GSS Data • The General Social Survey is conducted every two years across the U.S., http://www.norc.org/GSS+Website/Browse +GSS+Variables/ • Based on the R script, extract 3 variables that might be related to ‘happiness’ & plot.