Contingency Tables stat 557 Heike Hofmann

advertisement
Contingency Tables
stat 557
Heike Hofmann
Outline
• Three R packages:
• Data structures: reshape2, plyr
• Visualizing contingency tables: productplots
Your Turn:
Titanic Data
• Data on survival on board the Titanic is
part of base R, in object named Titanic
Investigate structure and dimensions.
• Try to interpret result of plot(Titanic)
• What does as.data.frame() do?
• How many men survived, how many women?
... in each class?
Contingency Tables
• Lots of different, specialized commands for
cross-tabulation: table, xtabs, freq
• Unified frameworks: packages reshape2
or plyr
• reshape2 for large number of variables,
with same operation
plyr for small numbers, different summary
statistics
Reshape Package
• from CRAN
install.packages("reshape2")
library(reshape2)
• Command dcast does cross-tabulation, yields
dataframe:
dcast(data, rows~columns, value,
function)
Casting
• tc <- as.data.frame(Titanic)
• dcast(tc, .~Class, value="Freq",
fun=sum)
• dcast(tc,
Sex~Survived, value="Freq",
fun=sum)
• dcast(tc,
Sex+Class~Survived, value="Freq",
fun=sum)
Plyr Package
• from CRAN
install.packages("plyr")
library(plyr)
• Command ..ply does summary
. is one of "a","d","l", e.g.:
ddply(data, .(var1, var2, ...), function)
first . is data format of input,
second . is output
ddply
•
dataframe in, dataframe out:
ddply(tc, .(Sex, Class), summarize,
Total = sum(Freq),
Survived = sum(Freq[Survived=="Yes"]),
SurvivedPct = sum(Freq[Survived=="Yes"])/sum(Freq)
*100
)
•
summarize(data, ...) takes arguments of the form
var=function, creates new column var in dataframe,
see help(summarize)
Your Turn:
Titanic Data
• Write function odds.ratio(x), that
computes odds ratio of a 2 by 2 table x as
x1*x4/(x2*x3).
• Compute odds ratio of Survival by Gender.
... for each Class
Visualizing Contingency
Tables
Productplots
• from CRAN
install.packages(“productplots”)
One dimension
• Barcharts, Spineplots:
area is proportional to marginal probability
prodplot(tc, Freq~Class, "hbar")
1st
1st
2nd
3rd
2nd
3rd
Crew
Crew
prodplot(tc, Freq~Class, "hspine")
Increasing dimensionality
• Mosaicplots:
combination of horizontal and vertical spine plots
• Productplots allow combinations of
hbar, vbar, hspine, and vspine
prodplot(tc, Freq~Survived+Class,
divider=c("hbar","hbar"),
subset=level == 2)
1st
1st
2nd
2nd
3rd
3rd
Crew
prodplot(tc, Freq~Survived+Class,
divider=c("vspine","hspine"),
subset=level == 2)
Crew
1st
2nd
3rd
Crew
Special Case:
Doubledecker Plot
• horizontal spines of all but the last variable,
last variable in vertical spine plot
prodplot(tc, Freq~Survived+Sex+Class, divider=c("vspine","hspine", "hspine"), subset=level == 3)
1st
1st
2nd
2nd
3rd
3rd
Crew
Crew
Your Turn:
Titanic Data
•
Using the productplots package, draw
barcharts for each of the variables.
•
Change the previous plots to spineplots,
include Survived vertically. Which plot shows
the strongest association with Survival?
•
•
Work out, in which order variables are drawn.
Draw & interpret a plot using all four
variables.
Your Turn:
GSS Data
• The General Social Survey is conducted
every two years across the U.S.,
http://www.norc.org/GSS+Website/Browse
+GSS+Variables/
• Based on the R script, extract 3 variables
that might be related to ‘happiness’ & plot.
Download