Mastering Metrics Ch 1 Hwk Q2 C Wassell September 30, 2015 library(foreign) ## Warning: package 'foreign' was built under R version 3.2.2 library(plyr) ## Warning: package 'plyr' was built under R version 3.2.1 library(sm) ## Warning: package 'sm' was built under R version 3.2.2 ## Package 'sm', version 2.2-5.4: type help(sm) for summary information library(ggplot2) bm <- read.dta("G:/ECON 424/Fall 2015/bm.dta") Question 2a We need to cross-tabulate education by gender and by race. One way to do this is the xtabs command, coupled with ftable() and prop.table(). "ftable" gives a flat table -- an easy (relatively) way to read results. "prop.table" gives info as percentages, rather than as totals. table1 <- xtabs(~ female + black + computerskills, data=bm) ftable(prop.table(table1)) ## computerskills 0 1 ## female black ## 0 0 0.04229979 0.07577002 ## 1 0.03572895 0.07700205 ## 1 0 0.05338809 0.32854209 ## 1 0.04804928 0.33921971 If you look at the different margins, you'll see that there are few differences by race. For example, there are a nearly identical percentage of black and non-black males with computer skills -- 7.58% and 7.70%. For a different view, look at a 2-way table. table2 <- xtabs(~ female + black, data=bm) ftable(prop.table(table2)) ## black ## female 0 1 ## 0 ## 1 0.1180698 0.1127310 0.3819302 0.3872690 There is an even distribution of gender by race -- 11.81% non-black males vs 11.28% black males, for example. table3 <- xtabs(~ computerskills + black, data=bm) ftable(prop.table(table3)) ## black 0 1 ## computerskills ## 0 0.09568789 0.08377823 ## 1 0.40431211 0.41622177 There is also an even distribution of computer skills by race; about the same percentage of blacks and non-blacks have, or lack, computer skills. Question 2b This is very similar to question 2a, but looking at education an number of previous jobs instead of computer skills. Question 2c For this problem we want to look at descriptive statistics by race. I find the easiest way to handle this is to first partition my data using the "subset" command from the plyr library. Subset picks out all observations from a particular dataset that meet a specified criterion. In the following code, I'm creating two new data sets -- subsets of bm -- based on whether the individuals are, or are not, black. Then I calculate mean, std dev, etc. bm_black <- subset(bm, black==1) bm_nonblack <- subset(bm, black==0) summary(bm_black$yearsexp) ## ## Min. 1st Qu. 1.00 5.00 Median 6.00 Mean 3rd Qu. 7.83 9.00 Max. 44.00 Mean 3rd Qu. 7.856 9.000 Max. 26.000 summary(bm_nonblack$yearsexp) ## ## Min. 1st Qu. 1.000 5.000 Median 6.000 sd(bm_black$yearsexp) ## [1] 5.010764 sd(bm_nonblack$yearsexp) ## [1] 5.079228 The mean and standard deviation look awfully similar. These are just the first two moments from the sample distributions, however; to see the whole thing, plot two histograms, or even better, two density functions. DF_yearsexp <- rbind(data.frame(fill="blue", yearsexp=bm_black$yearsexp), data.frame(fill="green", yearsexp=bm_nonblack$yearsexp)) ggplot(data=DF_yearsexp, aes(x=yearsexp, fill=fill)) + geom_histogram(binwidth=2, colour="black", position="dodge") + scale_fill_identity() d1 <- density(bm_black$yearsexp) plot(d1) polygon(d1, col="red", border="blue") d2 <- density(bm_nonblack$yearsexp) plot(d2) polygon(d2, col="red", border="blue") sm.density.compare(bm$yearsexp, bm$black) There are four separate plots here. The code is a bit above "beginner" level, though the density comparison plot is straightforward. That one uses the sm.density.compare function from the "sm" library. Question 2d You can answer this one yourself. Question 2e Here we want to see whether the "call" variable varies by race. tapply(bm_black$call, bm_nonblack$call, summary) ## $`0` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00000 0.00000 0.00000 0.03636 0.00000 1.00000 ## ## $`1` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.0000 0.0000 0.3277 1.0000 1.0000 tapply(bm_black$call, bm_nonblack$call, sd) ## 0 1 ## 0.1872358 0.4703618 sm.density.compare(bm$call, bm$black) The "tapply" function is just a shortcut. It allows you to perform a different function on multiple variables, without having to write out separate lines. In this case, I'm calculating summary statistics and the std deviation of "call" for both the black and non-black datasets. The question of whether there are differences is tough in this case, since "call" is a binary variable. Mean and standard deviation are relatively uninformative. The density comparison suggests there are probably differences, however. Question 2f You can answer this yourself.