Solution to Practice Problems for Midterm #1

advertisement
Solution to Practice Problems for Applied Statistics Midterm #1
1) What is the difference between a population and a sample? What is our objective in
examining samples?
Answer: A population is the set of all items. A sample is a subset of the population.
Our objective is to infer things about populations by examining samples.
2) Using the unemployment data in the Excel spreadsheet P1DATA.XLS (linked to the
course website), create:
a) a frequency distribution
b) a relative frequency distribution
c) a percent frequency distribution
d) a cumulative relative frequency distribution
e) a cumulative percent frequency distribution
Answer: For brevity, I have eliminated the middle items.
Value
3.4
3.5
3.6
3.7
3.8
3.9
…
10.3
10.4
10.5
10.6
10.7
10.8
Total
Cumulative
Cumulative
Relative
Relative
Percent
Percent
Frequency Frequency Frequency Frequency Frequency
9
0.0186
0.0186
1.86%
1.86%
8
0.0165
0.0351
1.65%
3.51%
1
0.0021
0.0372
0.21%
3.72%
8
0.0165
0.0537
1.65%
5.37%
16
0.0331
0.0868
3.31%
8.68%
7
0.0145
0.1012
1.45%
10.12%
…
…
…
…
…
1
0.0021
0.9897
0.21%
98.97%
3
0.0062
0.9959
0.62%
99.59%
0
0.0000
0.9959
0.00%
99.59%
0
0.0000
0.9959
0.00%
99.59%
0
0.0000
0.9959
0.00%
99.59%
2
0.0041
1.0000
0.41%
100.00%
492
1
f) a histogram
U.S. Unemployment: 1960-2000
30
25
20
15
10
5
0
3.4
4.4
5.4
6.4
7.4
8.4
9.4
10.4
Unemployment Rate
g) an ogive
U.S. Unemployment: 1960-2000
Cumulative Relative
Frequency
1.0000
0.8000
0.6000
0.4000
0.2000
0.0000
0
2
4
6
8
10
12
Unemployment Rate
2
h) a stem and leaf display (use only the years 1981-1984 for this)
Answer: Note that I have used the integer portion as the stem.
7
8
9
10
2
0
0
1
2
3
2
1
2
3
3
1
2
5
4
2
3
5
4
3
3
6
5
4
4
8
6
4
4 4 4 4 5 5 5 5 5 6 7 8 8 9
9
8 8
4 8 8
3) Using the unemployment data used in problem 2,
a) what is the mean?
b) what is the median?
c) what is the mode?
d) what is the range?
e) what is the interquartile range?
f) what is the five number summary?
g) what is the variance?
h) what is the standard deviation?
i) what is the z-score of smallest observation?
j) what is the z-score of the largest observation?
Answer:
mean = 5.96
median = 5.7
mode = 5.4
range = 7.4
Q1 = 5.0 ; Q2 = 5.7 ; Q3 = 7.0
interquartile range = 2.0
five number summary = {3.4,5.0,5.7,7.0,10.8}
variance = 2.28
standard deviation = 1.51
z-score3.4 = -1.69
z-score10.8 = 3.20
k) create a table that lists the z-score for every item
Answer: For brevity, I have eliminated the middle items.
1960
1960
1960
1960
Date
January
February
March
April
Civilian Unemployment Rate Z-Score
5.2
-0.50
4.8
-0.77
5.4
-0.37
5.2
-0.50
3
1960
1960
…
2000
2000
2000
2000
2000
2000
May
June
…
July
August
September
October
November
December
5.1
5.4
…
4
4.1
3.9
3.9
4
4
-0.57
-0.37
…
-1.30
-1.23
-1.36
-1.36
-1.30
-1.30
4) Are there any unemployment outliers in the data used in problem 2? Justify your
answer carefully.
Answer: The z-score of 3.20 for the November and December 1982 periods is quite
high, but we have a large set of data (492). There are no other items with z-scores
over 3, but there are 10 items with z-scores over 2.7. Also note that the lowest zscore is only –1.69. This suggests that the data is highly skewed to the right. In this
scenario, the high z-scores are not likely to be indicative of outliers.
5) Answer not provided due to its similarity to the take home exam.
6) Answer not provided due to its similarity to the take home exam.
7) Using the interest rate data and the classes formed in problem 5,
a) create a table of grouped interest rate data
Range
2.50 – 4.99
5.00 – 7.49
7.50 – 9.99
10.00 – 12.49
12.50 – 15.00
# of
Observations
152
225
79
20
16
b) what is the mean of the grouped data?
c) what is the variance of the grouped data?
d) what is the standard deviation of the grouped data?
Answer: mean = 6.32; variance = 5.78; standard deviation = 2.40. Note that these
calculation used a midpoint of (2.5+4.99)/2 = 3.745, etc.
4
8) Compare the values calculated in problem 7 to the mean, variance, and standard
deviation of the full interest rate data. Comment on the potential errors when we have
grouped data.
Answer: full data mean = 6.18; full data variance = 5.86; full data standard
deviation = 2.42. Notice that these are slightly different than what we found in
problem 7. This illustrates the basic problems associated with grouped data. As is
typical (but not true all the time), our grouped estimates of variability are slightly
below the estimates using the full data set. The means are also different, but is likely
due to the skewed nature of the data.
9) Consider the following sales data:
Month
Sales
Month
Sales
January
$200
July
$140
February
$190
August
$150
March
$200
September
$140
April
$180
October
$120
May
$170
November
$110
June
$170
December
$90
a) Create a histogram of the data that might mislead people to believe that sales are
generally increasing.
Sales
Sales
800
700
600
500
400
300
200
100
0
Jan
Feb-Mar
Apr-Jun July-Dec
Tim e
b) Create a time series plot that might mislead people to believe that sales are
generally increasing.
5
Sales vs. Time
250
Sales
200
150
100
50
Fe
b
Ap
r
Ju
n
Au
g
ct
O
D
ec
0
Tim e
c) Based on your answers to a) and b), what advice would you give to people who
review business plans for potential investment?
Answer: We must be careful to examine the labels on any graphical and/or tabular
display. We should NEVER look at a presentation without carefully considering the
labels.
d) Briefly comment on the relationship between transparency, ethics, and legality.
Answer: A display might be perfectly accurate and complete, but not be
transparent. If the display is misleading, we certainly violate ethical standards.
Legislation typically addresses accuracy, so in many cases, one could create a
misleading display that is unethical yet legal.
10) Evaluate the validity of the following statement: “A cutoff rule of z = +/-3 should be
used to determine outliers”. If you disagree, comment on how one might determine
an appropriate outlier rule.
Answer: The statement is too general and is therefore incorrect. Any cutoff rule
must be determined on a case-by-case basis. In making the determination, the
sample size should be considered as well as potential skewness in the data. For
example, large samples commonly have values with z=+/-3. Such values should not
be discarded.
11) Evaluate the validity of the following statement: “Ethical behavior demands that we
present data in such a way that it is accurate and complete, but not transparent”.
Answer: The statement is decidedly incorrect. Transparency is important for both
communicative reasons and ethical reasons. If we present data (or descriptions of
data) in such a way that it is difficult to interpret, we are guilty of potentially
misleading our audience. Whether intentional or not, this could be considered an
ethical violation.
6
12) Briefly comment on how outliers might cause descriptive statistics to be misleading.
Answer: The existence of just one outlier can dramatically alter many of our
descriptive statistics. For example, suppose that we roll a die 10 times and come up
with {3,2,5,4,3,6,1,2,4,2}. The mean is 3.20, the standard deviation is 1.55, and the
five-number summary is {1,2,3,4.5,6}. Suppose, though, that we accidentally typed
in 66 instead of 6 in the data. The mean is then 9.20, the standard deviation is 19.99,
and the five-number summary is {1,2,3,4.5,66}. In this example, we would surely
catch the error. In many cases, however, such errors are difficult to detect.
13) Consider four sets A, B, C, and D such that AB, AC, AD=, BC,
BD=, CD=, and ABC=. Draw a Venn diagram depicting this situation.
Answer:
B
A
D
C
14) A door-to-door salesman has examined historical data on his success given the sex of
the person who answers the door. 74% of the time, a woman answers the door. He
has also noted the following:
P(sale)=0.3 (i.e., the man makes a sale at 30% of the houses he approaches)
P(sale  woman)=0.19 (i.e., 19% of the time, a woman answers the door and a
sale follows).
What is the probability of getting a sale given that a man answers the door?
Answer: We want P(sale|male). We are tempted to use Bayes’ Rule, but it isn’t
necessary. We know that P(sale  male) = 0.3-0.19 = 0.11 and that males answer the
door 26% of the time. From our rule for conditional probabilities,
P(sale|male)=P(sale  male)/P(male) = 0.11/0.26 = 42.3%.
15) A bank screens credit applicant based on three factors, current debt, income, and prior
payment history. 40% of all applicants are rejected. 15% of applicants fail the debt
test. 20% of applicants fail the income test. 5% of applicants fail the payment history
test. You know that a certain customer applied and was rejected. What is the
probability that the customer was rejected due to low income? Comment on your
ability to answer the question if 30% of all applicants are rejected (and the other
numbers are the same).
7
Answer: We want P(low income|rejection). We know from Venn diagrams and the
derivation of Bayes’ Rule that
Pincome  rejection 
Pincome rejection 
Pincome  rejection  Pdebt  rejection   Phistory  rejection 
0.20

0.20  0.15  0.05
 0.5
So there is a 50% chance that the applicant failed due to low income. We can use
this approach because the factors are independent. If 30% of all applicants are
rejected, then there must be some applicants that were rejected for multiple reasons.
To answer the question, we must clarify whether we are interested in “rejected due
only to low income” or “rejected due to low income and/or other reasons”. If the
former, the answer is 0.20/0.30 = 66.67%. If the latter, we cannot answer without
additional information.
16) Suppose that telemarketing sales are dependent on two factors: weather (when it’s
raining, more people are home) and time of day (if you call during prime time, people
are less likely to answer the phone). Those factors are independent. It rains with
probability 0.1 and prime time constitutes 40% of the normal calling hours. A
telemarketer can make 10 calls per hour. The net profit (including everything except
telemarketer wages) per successful call is $9 and the probabilities of success on a
given call are as follows.
Raining Not Raining
Prime Time
0.25
0.15
Not Prime
Time
0.3
0.2
Telemarketers charge $15 per hour. What is the expected profit per hour of calling?
Should you implement a restricted calling plan? If so, what would you recommend?
What is the expected profit per hour of calling under the new plan?
Answer: The probabilities for the possible scenarios are
Raining
Not Raining
Prime Time
0.10.4 =
0.90.4 =
0.04
0.36
Not Prime
0.10.6 =
0.90.6 =
Time
0.06
0.54
P(success on a randomly chosen call) = P(prime time & raining)P(success | prime
time & raining)
+ P(not prime time & raining)P(success | not prime time &
raining)
+ P(prime time & not raining)P(success | prime time & not
raining)
8
+ P(not prime time & not raining)P(success | not prime
time & not raining)
= 0.040.25 + 0.060.3 + 0.360.15 + 0.540.2
= 0.19
The expected profit per hour of calling is then 100.19$9 - $15 = $2.10. On
average, 1.9 calls per hour are successful, giving a net profit of $17.10 less the $15
paid to the telemarketer.
The lowest probabilities of success occur during prime time, so we might consider
not making calls during prime time. To answer this, we consider whether prime
time calling is profitable or not.
P(success on a randomly chosen call during prime time) =
P(raining)P(success | raining)
+ P(not raining)P(success | not raining)
= 0.10.25 + 0.90.15 = 0.16
The expected profit per hour of calling during prime time is then 100.16$9 - $15 =
-$0.60. So, we should not make calls during prime time.
P(success on a randomly chosen call not during prime time) = 0.10.3 + 0.90.2 =
0.21.
Expected profit under the new plan = 100.21$9 - $15 =$3.90.
One might also consider not calling unless it is raining, but that would be
difficult to implement. It also might result in low morale because the employees
would be on a very uncertain work schedule. I therefore chose not to consider
that possibility.
9
Download