mastering_metrics_ch_1_q_2_r_markdown

advertisement
Mastering Metrics Ch 1 Hwk Q2
C Wassell
September 30, 2015
library(foreign)
## Warning: package 'foreign' was built under R version 3.2.2
library(plyr)
## Warning: package 'plyr' was built under R version 3.2.1
library(sm)
## Warning: package 'sm' was built under R version 3.2.2
## Package 'sm', version 2.2-5.4: type help(sm) for summary information
library(ggplot2)
bm <- read.dta("G:/ECON 424/Fall 2015/bm.dta")
Question 2a
We need to cross-tabulate education by gender and by race. One way to do this is the xtabs
command, coupled with ftable() and prop.table(). "ftable" gives a flat table -- an easy
(relatively) way to read results. "prop.table" gives info as percentages, rather than as totals.
table1 <- xtabs(~ female + black + computerskills, data=bm)
ftable(prop.table(table1))
##
computerskills
0
1
## female black
## 0
0
0.04229979 0.07577002
##
1
0.03572895 0.07700205
## 1
0
0.05338809 0.32854209
##
1
0.04804928 0.33921971
If you look at the different margins, you'll see that there are few differences by race. For
example, there are a nearly identical percentage of black and non-black males with
computer skills -- 7.58% and 7.70%.
For a different view, look at a 2-way table.
table2 <- xtabs(~ female + black, data=bm)
ftable(prop.table(table2))
##
black
## female
0
1
## 0
## 1
0.1180698 0.1127310
0.3819302 0.3872690
There is an even distribution of gender by race -- 11.81% non-black males vs 11.28% black
males, for example.
table3 <- xtabs(~ computerskills + black, data=bm)
ftable(prop.table(table3))
##
black
0
1
## computerskills
## 0
0.09568789 0.08377823
## 1
0.40431211 0.41622177
There is also an even distribution of computer skills by race; about the same percentage of
blacks and non-blacks have, or lack, computer skills.
Question 2b
This is very similar to question 2a, but looking at education an number of previous jobs
instead of computer skills.
Question 2c
For this problem we want to look at descriptive statistics by race. I find the easiest way to
handle this is to first partition my data using the "subset" command from the plyr library.
Subset picks out all observations from a particular dataset that meet a specified criterion.
In the following code, I'm creating two new data sets -- subsets of bm -- based on whether
the individuals are, or are not, black. Then I calculate mean, std dev, etc.
bm_black <- subset(bm, black==1)
bm_nonblack <- subset(bm, black==0)
summary(bm_black$yearsexp)
##
##
Min. 1st Qu.
1.00
5.00
Median
6.00
Mean 3rd Qu.
7.83
9.00
Max.
44.00
Mean 3rd Qu.
7.856
9.000
Max.
26.000
summary(bm_nonblack$yearsexp)
##
##
Min. 1st Qu.
1.000
5.000
Median
6.000
sd(bm_black$yearsexp)
## [1] 5.010764
sd(bm_nonblack$yearsexp)
## [1] 5.079228
The mean and standard deviation look awfully similar. These are just the first two
moments from the sample distributions, however; to see the whole thing, plot two
histograms, or even better, two density functions.
DF_yearsexp <- rbind(data.frame(fill="blue", yearsexp=bm_black$yearsexp),
data.frame(fill="green", yearsexp=bm_nonblack$yearsexp))
ggplot(data=DF_yearsexp, aes(x=yearsexp, fill=fill)) +
geom_histogram(binwidth=2, colour="black", position="dodge") +
scale_fill_identity()
d1 <- density(bm_black$yearsexp)
plot(d1)
polygon(d1, col="red", border="blue")
d2 <- density(bm_nonblack$yearsexp)
plot(d2)
polygon(d2, col="red", border="blue")
sm.density.compare(bm$yearsexp, bm$black)
There are four separate plots here. The code is a bit above "beginner" level, though the
density comparison plot is straightforward. That one uses the sm.density.compare function
from the "sm" library.
Question 2d
You can answer this one yourself.
Question 2e
Here we want to see whether the "call" variable varies by race.
tapply(bm_black$call, bm_nonblack$call, summary)
## $`0`
##
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
## 0.00000 0.00000 0.00000 0.03636 0.00000 1.00000
##
## $`1`
##
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
## 0.0000 0.0000 0.0000 0.3277 1.0000 1.0000
tapply(bm_black$call, bm_nonblack$call, sd)
##
0
1
## 0.1872358 0.4703618
sm.density.compare(bm$call, bm$black)
The "tapply" function is just a shortcut. It allows you to perform a different function on
multiple variables, without having to write out separate lines. In this case, I'm calculating
summary statistics and the std deviation of "call" for both the black and non-black datasets.
The question of whether there are differences is tough in this case, since "call" is a binary
variable. Mean and standard deviation are relatively uninformative. The density
comparison suggests there are probably differences, however.
Question 2f
You can answer this yourself.
Download