Practice Midterm Exam

advertisement
Statistics 103
May 5, 2005
Final Exam
Instructions: Write your answers on the exam in the spaces after the questions. For maximum
credit, show all work.
You are permitted to use two sheets of notes, front and back, and a calculator. Any other form of
aid is not permitted. If you need clarification on any part of the exam, contact Prof. Reiter.
Provide the information requested below in the adjacent empty spaces.
NAME (print):
LAB TIME:
.
Demographic Questions (not used for grading in any way):
1) Did you take AP Statistics in high school? No ___ Yes ____
2) What was your score on the AP Statistics exam? _____ or circle “Did not take it.”
Page
Points Possible
4
20
5
12
6
12
7
12
8
10
9
12
10
10
11
12
Total
100
Score
1
QUESTIONS 1 – 4 REFER TO THE DATASET DESCRIBED BELOW
In 1970s, Harris Trust and Savings Bank was sued for sex discrimination. The law suit alleged that
the Bank systematically paid female employees lower salaries than male employees. The key
evidence in the case was data on salaries of employees. Both the prosecution and the defense
presented data analyses in attempts to support their cases.
In the problems below, we analyze a subset of the data for one type of employee: the skilled, entrylevel clerical workers. You can assume that these data are a random sample of the population of
skilled, entry-level clerical workers who work at this Bank.
DESCRIPTION OF THE DATA
========================
There were 61 female and 32 male employees in the data set. The following are variables we
consider on this exam.
bsal:
Annual salary at time of hire.
sal77:
Annual salary in 1977 (the latest year in the study).
educ: Years of education.
exper: Number of months working at other companies prior to being hired at the Bank.
senior: Number of months worked at Bank since hired
age:
Age in months
There are no problems on this page. Starting below, the next two pages display output from
exploratory data analyses that you should use to answer exam questions. The questions begin
on page 4.
Correlations among selected variables, based on all 93 employees
Sal77
Exper
Senior
Educ
1.00
-0.37
0.13
0.42
Sal77
1.00
-0.07
-0.10
Exper
1.00
0.06
Senior
1.00
Educ
2
7000 8000 900010000
12000
14000
Histogram of sal77 for females.
6000 8000 10000 12000 14000 16000
Histogram of sal77 for males.
Scatterplot Matrix: Each graph is based on all 93 employees.
8000
7000
6000
bsal
5000
4000
16000
14000
sal77
12000
10000
8000
800
700
600
age
500
400
300
4000 5000 6000 7000 8000 8000
11000 14000 17000300 400 500 600 700 800
3
EXAM PROBLEMS BEGIN HERE
1. (2 points per part) For parts 1a-1d, circle the answer that is closest to the truth.
a) Estimate the 75th percentile of sal77 for the male employees minus the 75th percentile of sal77
for the female employees.
2200
b) Estimate the standard deviation of sal77 for the females:
1200
c) Estimate the percentage of female employees whose sal77 exceeds $10,000.
30%
d) The standard error for the average of sal77 for males is ______________ the standard error for
the average of sal77 for females.
larger than (due to smaller sample size)
2. (2 points per part). For 2a – 2f, circle the appropriate answer.
a) Which one of the following three scatter plots displays the relationship between sal77 and
experience? Circle the letter of the correct plot.
The plot with the negative slope and not especially tight pattern.
b) Which variable has the weakest linear association with sal77?
senior
c) Which of the following lines is the fitted regression line for predicting sal77 (Y) from bsal (X)?
Circle the correct line.
Y = 4620 + 1.065 X.
You can verify this by plugging in values of bsal into each line, and
you’ll see that this is the only line that gives any reasonable predictions of sal77.
d) Which pair of variables has correlation closest to zero?
bsal, age
e) Which variable has the largest standard deviation?
sal77 (it has the biggest numbers)
f) True or false. In the regression of sal77 (Y) on bsal (X), the plot of residuals versus bsal
shows no evidence of violations of the regression assumptions.
True. The scatter plot of sal77 (Y) on bsal (X) shows no indications of non-linear relationships, so
the regression line would be a good fit to the data.
4
3. The differences between salaries for men and women.
a) (5 points) The sample average and sample variance of bsal for males equal 5937 and 477066,
respectively. The sample average and sample variance of bsal for females equal 5139 and 291460,
respectively. Give an interval for the difference in the population average bsal for male skilled,
entry-level clerical workers employed at the Bank and the population average bsal for female
skilled, entry-level clerical workers employed at the Bank. Use a 99% confidence level. Use 40
degrees of freedom to approximate the Welch-Satterthwaite degrees of freedom (it’s equal to 51).
(5937  5139)  2.704 477066 / 32  291460 / 61
b) (1 points) Based on your interval in part a, circle the choice that best completes the statement:
The confidence interval suggests that the population average bsal for male skilled, entry-level
clerical workers employed at the Bank ______________________ the population average bsal for
female skilled, entry-level clerical workers employed at the Bank.
is larger than
c) (6 points) Test the null hypothesis that the population percentages of men and women hired in
the bank are equal. Write your null and alternative hypotheses, the value of the test statistic, the pvalue, and your conclusion. Use a two-sided alternative. Consider p-values in the .05 range as
small. Be sure to address the question of interest in your conclusion (write more than reject/not
reject the null hypothesis).
Let p = the population percentage of females hired by the bank. Then,
Ho: p=0.5. Ha: p not = 0.5
Pr( Pˆ  61/ 93)  Pr( Z 
.656  .50
)  Pr( Z  3.0)  .0015
.5(1  .5) / 93
We double this to get the p-value, since we have a two-sided hypothesis, and we get a p-value of
.003.
Assuming males and females are hired in equal rates, there is only a 3 out of 1000 chance we’d get
a sample percentage of females of 65.6%. This is a small chance. Therefore, we reject the null
hypothesis. There does in fact seem to be evidence that the Bank hires males and females with
differing percentages (at least for this type of worker).
5
4. Predicting salaries
In the regression of sal77 (Y) on educ (X), the following output is obtained.
Bivariate Fit of sal77 By educ
17000
Summary of Fit
16000
15000
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
sal77
14000
13000
12000
11000
10000
0.177259
0.168218
1632.19
10392.9
93
Parameter Estimates
9000
8000
7000
7
8
9
10
11
12
educ
13
14
15
16
Term
Estimate
Std Error
t Ratio
Prob>|t|
Intercept
6264.513
947.6058
6.61
<.0001
330.12924
74.55742
4.43
<.0001
17
Educ
a) (5 points) Give a 90% confidence interval for the population regression slope.
330.13  1.662(74.557)
Any multiplier between 1.664 (df=80) and 1.660 (df=100) is acceptable. The degrees of freedom
equals 91.
b) (2 points) What happens to the estimated slope between sal77 and educ when a person with
sal77=17000 and educ=7 is added to the data? Circle the appropriate answer.
decreases slightly. The outlier pulls the line towards it.
c) (5 points) In the sample, on average the men have 13.5 years of education and the women have
12 years of education. The men’s sample average sal77 is about $2000 higher than the women’s
sample average sal77. Could this $2000 difference be explained away by the difference in
education levels? Give a numerical argument why or why not.
Using the regression line, we expect an additional 1.5 years of education to be worth
($330.13)(1.5), which equals $508.40. This is much less than the $2000 gap. Hence, we cannot
explain the difference in average salaries entirely from differences in average education.
6
5. Probability Problems 1
Consider two random variables, X and Y. The sample space for X is {0, 1}, and the sample space
for Y is {0, 1}. The Pr(X=1, Y=1) = .30, and the Pr(X=1) = .80. Each part is worth 3 points.
a) Write the joint distribution in the table below so that Cov(X, Y) = 0.
y=0
x=0
.125
x=1
.50
.80
y=1
.075
.20
.30
.80
.20
To have independence, we want Pr(Y=1|X=1) = Pr(Y=1|X=0). Since, Pr(Y=1|X=1)=3/8, and
Pr(Y=1|X=0) = Pr(X=0, Y=1)/Pr(X=0), we find that Pr(X=0, Y=1) = .2(3/8) = .075.
b) Suppose you took a simple random sample of 100,000 from a population that follows the joint
distribution in part a, and you did a chi-squared test of independence of X and Y. Select the true
statement from the choices below.
There is about a 5% chance that the chi-squared test statistic from the data will exceed 3.84.
3.84 is the value of the chi-squared test statistic associated with a p-value of .05 for one degree of
freedom. Since the null hypothesis is true in this problem, we’d expect to get a value of the chisquared test statistic as or more extreme than 3.84 about 5% of the time.
c) Write the joint distribution in the table below so that E(X | Y = 1) = 0.60.
y=0
y=1
x=0
0
.20
.20
x=1
.50
.30
.80
.50
.50
To have E(X|Y=1) = 0.6, we need (1)Pr(X=1|Y=1) = 0.6. Since, Pr(X=1|Y=1) = Pr(X=1,
Y=1)/Pr(Y=1) = .30/Pr(Y=1), we know that Pr(Y=1) = .30/.60 = 0.50. Hence, Pr(X=0, Y=1) =
0.20. Finally, since Pr(X=0) = 0.20, we have that Pr(X=0, Y=0) = 0.
d) Suppose you took a simple random sample of 100,000 from a population that follows the joint
distribution in part c, and you did a chi-squared test of independence of X and Y. Select the true
statement from the choices below.
There is much more than a 5% chance that the chi-squared test statistic from the data will exceed
3.84. This is because the variables clearly are not independent, so the chi-squared test would most
certainly reject the null hypothesis (i.e. the test statistic would be very large) with 100000 people.
7
6. Probability Problems 2
The probability distribution for the amount of time (in hours) it takes students to complete a three
hour final exam is described by the following probability density function:
f ( x)  ( x 2  1) / 12
for 0 < x < 3.
a) (2 points) Given that someone has already worked on the exam for one hour, what is the chance
that it will take him or her more than 2 hours total to complete the exam?
Pr( X  2 | X  1)  Pr( X  2, X  1) / Pr( X  1)  Pr( X  2) / Pr( X  1)
3

 (x
2
 1) / 12dx

2
3
 (x
2
 1) / 12dx
9  3  8/3  2
 .6875
9  3  1/ 3  1
1
b) (2 points) What is the average amount of time it takes to complete the exam?
3
 x( x
2
 1) / 12dx  (1 / 12)(81 / 4  9 / 2)  99 / 48
0
c) (2 points) What is the variance of the amount of time it takes to complete the exam?
3
 x (x
2
2
 1) / 12dx  (99 / 48) 2  (1 / 12)( 243 / 5  9)  (99 / 48) 2  .546
0
d) (4 points) In a class of 100 students whose times to completion are independent, what is the
chance that after 2 hours there will be less than 20 students still taking the exam (i.e., the chance
that the number of students who take more than 2 hours to complete the exam is less than 20)?
From the numerator of part a, for any student Pr(X>2) = .611. This is the chance that any random
student will take more than two hours to complete the exam.
Let P̂ be the percentage of students still left after 2 hours. We want to find the chance that the
sample percentage is less than 20% (20 out of 100).
Since 100 is a large sample size, we can use the central limit theorem to determine this chance.
Pr( Pˆ  .20)  Pr( Z 
.20  .611
)  Pr( Z  8.4)
.611(1  .611) / 100
The chance of getting a z value less than -8.4 is practically zero.
8
7. Probability Problems 3
You have four coins in your pocket: a penny (worth 0.01 dollars), a nickel (worth 0.05 dollars), a
dime (worth 0.10 dollars), and a quarter (worth 0.25 dollars). You pick three coins at random,
without replacing each coin.
a) (3 points) Write the probability distribution for the sample average of the three coins.
Pr( X  .16 / 3)  1 / 4.
Pr( X  .31 / 3)  1 / 4.
Pr( X  .36 / 3)  1 / 4.
Pr( X  .40 / 3)  1 / 4.
b) (3 points) Using part a, show mathematically whether the sample average is an unbiased or
biased estimator of the population average (which equals $0.1025). If you didn’t answer part a,
assume Pr( X  .05)  .40, Pr( X  .10)  .25, Pr( X  .15)  .30, Pr( X  .20)  .05 for parts b and c
(which is not correct and will mess you up for part d, so don’t use this if you answered part a).
E ( X )  (.16 / 3)(.25)  (.31/ 3)(.25)  (.36 / 3)(.25)  (.40 / 3)(.25)  .1025
It is unbiased. If your answer differed because of rounding, and you therefore said it was biased,
no points were taken off. Parts b and c were graded using whatever you did on part a, or the fake
distribution provided in part b.
c) (2 points) Compute the standard deviation of the sample average.
Var ( X )  (.16 / 3)2 (.25)  (.31/ 3)2 (.25)  (.36 / 3)2 (.25)  (.40 / 3)2 (.25)  (.1025)2
The SD is the square root of the above sum. You get .0303.
d) (4 points) Suppose you repeat this procedure two separate times (you put all coins back in your
pocket after the first time). Both times the sample average exceeds .1025. Given this information,
what is the chance that at least one of the six coins you picked was a nickel?
Pr(at least one nickel | sample average exceeds .1025 on two separate tries)
= 1 – Pr(no nickels | sample average exceeds .1025 on two separate tries)
Pr(no nickels | sample average exceeds .1025 on two separate tries)
= Pr(no nickels and sample average exceeds .1025 on two separate tries)
Pr(average exceeds .1025 on two separate tries)
For the denominator, there is a 75% chance we get a sample average greater than .1025 (all but the
penny, nickel, and dime combination qualify). Hence, the denominator equals (.75)(.75), since
each trial is independent. For the numerator, in any trial the only way to get no nickel and a sample
average exceeding .1025 is penny, dime, and quarter, which has a .25 chance of happening. Since
each flip is independent, we get (.25)(.25). So, the probability that we get no nickels is
(.25)(.25)/(.75)(.75) = 1/9. Hence, the probability we want equals 1 – 1/9 = 8/9.
9
8. Two other questions
a) (5 points) Exam scores on midterm 1 have a mean of 29 (out of 40) and an SD of 5.7. Exam
scores on midterm 2 have a mean of 34 (out of 50) and an SD of 7.6. The correlation of scores on
the two exams equals 0.35. Let X be the score on midterm 1 rescaled to be out of 100 (e.g., a 30/40
is a 75), and let Y be the score on midterm 2 rescaled to be out of 100 (e.g., a 40/50 is an 80).
Compute Var(X+Y).
Let M = score on midterm 1, and let N = score on midterm 2. Then, X = (100/40)M and Y =
(100/50)N.
So, Var(X+Y) = Var( 2.5M + 2N ) = 6.25Var(M) + 4Var(N) + 2(2.5)(2)Cov(M,N)
= 6.25(32.49)+4(57.76)+2(2.5)(2)[.35(5.7)(7.6)] = 585.7
Answers on a decimal scale (.05857) also were given full credit.
b) (5 points) One of three people wrote you a love letter, but you don’t know which one. In
general, 75% of Person A’s words have more than two syllables; 25% of Person B’s words have
more than two syllables; and, 50% of Person C’s words have more than two syllables. Before
seeing the letter, you believe each person has a 1/3 chance of being the one who wrote you.
In the letter you received, 7 out of 20 words have more than two syllables. You can use a binomial
distribution to model the number of words with more than two syllables.
Given the data from the letter you received, what is the posterior probability that Person C wrote
the letter?
This is a Bayesian statistics problem. We want Pr(p=.25 | X=7), where X is the number of words
that have more than two syllables. Using Bayes Rule, we have:
p
Pr(p)
Pr(X=7 | p)
Pr(X=7, p)
Pr(p | X=7)
--------------------------------------------------------------------------------------0.25 1/3
.112
.03747
.6028
0.50 1/3
.0739
.02464
.3964
0.75 1/3
.00015
.00005
.0008
Pr(X=7) = .06216
Since Person C corresponds to the row for 50%, there is a 39.64% chance that person C wrote the
letter. I hope you weren’t wishing it was Person B.
Remember, to get Pr(X=7) you have to add the joint probabilities, not the conditional probabilities.
This applies for any probability problem.
10
9. True or False (3 points per part).
For each statement, if you think the statement is always true, just say it is true. If you think the
statement is always false or sometimes false, say it is false and explain why or when it is false in
two or less sentences.
a) When the p-value is 0.020, you should reject the null hypothesis because there is a 2.0% chance
that the null hypothesis is true (assume a significance level of 0.05).
False. Although we would reject the null hypothesis, it is not the case that there is a 2% chance the
null hypothesis is true. It is either true or not true.
b) In the sex discrimination data, the variable race is coded 1=white, 2=black, 3=Hispanic/Latin
American, 4=Asian, and 5=Native American. True or false: the association between salaries and
race is determined by finding the value of the correlation between sal77 and race.
False. Correlations involving a nominal variable are meaningless. The ordering of race is
arbitrary: e.g., you can’t meaningfully say that Asian is “four times larger” than white.
c) The senior survey at Duke is sent to all 1500 seniors, who are asked to respond to various
questions about their Duke experience. Out of the 500 seniors who return it, 400 cite parking as a
“serious problem that negatively affected my Duke experience.” True or false: A 95% confidence
interval for the percentage of Duke seniors who rate parking as a serious problem is: 0.80 ± .035.
False. Although the calculations are correct, there is non-response bias that could lead to an invalid
confidence interval.
d) Using recent data from the U.S. Current Population Survey, a 95% confidence interval for the
difference in population average years of education for men and population average years of
education for women stretches from 0.25 to 0.29. True or false: about 95% of men have between
.25 and .30 more years of education than women do.
False. Confidence intervals give a range for the difference in population averages, not a range for
individuals’ differences.
11
Download