Uploaded by yoodowon

stat-17-quiz-2

advertisement
Prof. David Draper
Department of Statistics
University of California, Santa Cruz
Fall 2022
STAT 17: Quiz 2 155 total points
Name:
1. [25 total points for this problem] You’re writing an article for the product quality
evaluation website Consumer Reports, based on a sample survey of the website’s readers,
on the reliability of various household appliances. 1,338 readers said they had Maytag
washing machines, and 294 of these people reported at least one episode in the last year in
which they had to call for someone to come and repair the machine. Only 192 of the 480
people who said they owned a General Electric (GE) washer reported one or more repair
calls in the same period.
(a) Person A says “There were 294 people with one or more repair episodes for the
Maytag machines and only 192 for GE, so GE machines are more reliable.” Person
B objects to this argument, noting that “( 1,338 – 294 ) = 1,044 readers reported
no problems with their Maytag washers, versus only ( 480 – 192 ) = 288 trouble-free
reports for GE, so Maytag machines are better.” Briefly explain why both of these
people’s reasoning is incorrect, compute a fair numerical summary of the reliability
of these two brands, and compare. (This is not a trick question.) 15 points
(b) Intuitively, which of these two reliability estimates is likely to be more accurate?
(That is, assuming that the Maytag and GE samples were representatively drawn
from the populations of all people owning those brands of washing machine, which
estimate should have the smaller amount of uncertainty attached to it as an educated
guess for the corresponding population quantity?) Explain briefly. 10 points
2. [130 total points for this problem] Example 1.13 in the Alwan book presents a
case study from the world of customer service in banking: we have data on “the length (in
1
seconds) of all n = 31,492 calls made to the customer service center (CSC) of a small bank
in a month.” Table 1.1 in the book presents the data for the first 80 calls listed; the full
data set is available in the Pages tab of the course Canvas page, under the name
stat-17-customer-service-center-data.csv
You don’t need to analyze this data set (all the questions about this case study are selfcontained below); I provide it to you in case you wish to explore the data in Stata yourself.
Table 1 in this document gives the results from describe, list, count, and univar1
commands that I ran after reading the .csv file into Stata. Note: the list length in 1
/ 10 command was run before sorting the data from smallest to largest; the list length
in -10 / l command was run after sorting, and the syntax of this command means that
its output shows us the 10 largest length values.
(a) (variable types) List all of the type attributes of the variable length as coded in
the manner indicated in Table 1 (qualitative, quantitative, nominal, ordinal, dichotomous, discrete, continuous, interval-scale, ratio-scale), briefly explaining your choices.
(Hint: This variable is conceptually continuous but has been made discrete by taking
its ceiling: rounding the length of the call to the nearest integer greater than or equal
to the actual length [e.g., 10.4 seconds has been recorded as 11].) 10 points
(b) (data curation, part 1) Based on our LDS discussion on 3 Oct 2022 and looking at
Table 1, does the length variable have any missing values in it? Is that a good thing
or a bad thing? Explain briefly. 10 points
(c) (data curation, part 2) What are the conceptually possible values for the length
variable? How could you use this to find incorrect data values? Explain briefly, and
identify the information in Table 1 that proves that there are no such values in this
data set. 10 points
1
univar is not part of the standard set of commands available in Stata; I used the web to figure out
how to gain access to it (I can show you how to do this if you’re interested; ask me in Discord if so).
2
Table 1: Numerical ummaries of the length of calls in the CSC case study (Stata output
condensed).
. describe
Contains data
Observations:
31,492
Variables:
1
-------------------------------------------------------------------------Variable
Storage
Display
Value
name
type
format
label
Variable label
-------------------------------------------------------------------------length
int
%8.0g
-------------------------------------------------------------------------Sorted by:
Note: Dataset has changed since last saved.
. list length in 1 / 10
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
. list length in -10 / l
+--------+
| length |
|--------|
|
77 |
|
289 |
|
128 |
|
59 |
|
19 |
|--------|
|
148 |
|
157 |
|
203 |
|
126 |
|
118 |
+--------+
31483.
31484.
31485.
31486.
31487.
31488.
31489.
31490.
31491.
31492.
+--------+
| length |
|--------|
|
4165 |
|
4277 |
|
4394 |
|
4577 |
|
4845 |
|--------|
|
5222 |
|
5764 |
|
8514 |
| 17904 |
| 28739 |
+--------+
. count if length == .
0
. univar length
-------------- Quantiles -------------Variable
n
Mean
S.D.
Min
.25
Mdn
.75
Max
-------------------------------------------------------------------------length
31492 188.59 312.78 1.00
57.00
115.00
225.00 28739.00
--------------------------------------------------------------------------
3
(d) (data curation, part 3) Identify the 5 largest values of length in Table 1. Recalling
that this variable is measured in seconds, convert each of those 5 values to hours by
dividing by (60 · 60) = 3,600. Considering each of the 3 largest values expressed in
hours, can you imagine yourself participating in a customer service call that long?
What do you conclude about the good data/bad data status of those 3 largest values?
Explain briefly. 10 points
(e) (data curation, part 4) It can be shown (you’re not asked to show it; ask me for
details in Discord if you wish) that the mean of the length variable drops from
188.59 seconds to 186.85 seconds when the 3 largest values of the variable are omitted
from the data set.
(i) With reference to the LDS discussion of practical significance, do you regard that
change (from 188.59 seconds to 186.85 seconds) as practically significant? Explain briefly. (Hint: Compute how large that change is in relative (percentage)
terms, as we did in LDS.) 10 points
(ii) Would you say that in this case the mean is relatively insensitive to a small
number of outliers in the right tail? How can this be true, given how extremely
large the outliers are here? Explain briefly. 10 points
(f) (distributional shape) Without looking at any plots, what can you conclude about the
shape of the length distribution (symmetric, long left- or right-hand tail, unimodal,
multimodal) from the following facts (specify in each case2 )? (A) Your answer to part
(c) about possible values. (B) The relationship between the mean and the median for
2
For example, what numerical fact did you learn in answering part (c)? What is the relationship
between the mean and the median? And so on.
4
.015
.01
Density
.005
0
0
.005
Density
.01
.015
Figure 1: Four views of the shape of the length distribution. Upper left: default Stata
histogram of the entire data set; upper right: histogram (1,000 bars) of length distribution
truncated at 1,000 seconds; lower left: default Stata boxplot of the entire data set; lower
right: boxplot of length distribution truncated at 1,000 seconds.
0
0
10000
20000
Customer Center Call Lengths
10,000
length
20,000
30000
0
30,000
0
200
200
400
600
Customer Center Call Lengths
400
600
truncated_length
800
800
this variable. (C) The relationship between the mean and the SD for this variable.
Explain briefly. 30 points
5
1000
1,000
(g) (graphical summaries of the length distribution) Figure 1 presents four views of
the shape of the length distribution. The upper left and lower left panels give
Stata’s default histogram and boxplot of this variable; I produced the upper right
(histogram) and lower right (boxplot) graphs by creating a new version of length
that was truncated at 1,000 seconds (this means that none of the data values above
1,000 were used in these plots; this involved setting aside the largest 429 of the 31,492
observations.
(i) Summarize the shape information provided by the upper left and lower left plots.
Does this summary agree with your conclusions in part (f) above? Explain
briefly. 10 points
(ii) What important feature of the distribution of the length of the customer service
calls is visible in the upper right plot? Can you see that feature in the boxplot
in the lower right panel? Explain briefly. 10 points
(iii) Table 2 presents a raw and relative frequency distribution of a version of the
length variable that’s been truncated at 40 seconds, to hold up a magnifying
glass to the left tail of the distribution; ignore the Percent and Cum. columns
and focus on the columns length and (raw) Freq. (this counts how many customer service calls were 1 second long [314], 2 seconds long [401], and so on).
What pattern in the frequencies do you see in scanning down the page from
1 second to 40 seconds? Does this pattern support the contextual story suggested by the authors of the Alwan et al. book on page 15? Explain briefly.
10 points
(h) Someone says, “Histograms can be tricky: if you don’t choose the number of bars
well, you can get a view of the distributional shape of the variable that can fail to
identify important features.” Do you agree with this statement? Explain briefly.
10 points
6
Table 2: Raw and relative frequency distribution of the length variable for its smallest 40
values.
length |
Freq.
Percent
Cum.
------------+----------------------------------1 |
314
5.92
5.92
2 |
401
7.56
13.48
3 |
343
6.47
19.95
4 |
334
6.30
26.24
5 |
288
5.43
31.67
6 |
200
3.77
35.44
7 |
163
3.07
38.52
8 |
151
2.85
41.37
9 |
126
2.38
43.74
10 |
82
1.55
45.29
11 |
98
1.85
47.13
12 |
87
1.64
48.77
13 |
80
1.51
50.28
14 |
69
1.30
51.58
15 |
60
1.13
52.71
16 |
67
1.26
53.98
17 |
71
1.34
55.32
18 |
75
1.41
56.73
19 |
60
1.13
57.86
20 |
61
1.15
59.01
21 |
75
1.41
60.43
22 |
70
1.32
61.75
23 |
78
1.47
63.22
24 |
85
1.60
64.82
25 |
88
1.66
66.48
26 |
103
1.94
68.42
27 |
86
1.62
70.04
28 |
104
1.96
72.00
29 |
110
2.07
74.08
30 |
109
2.06
76.13
31 |
111
2.09
78.22
32 |
136
2.56
80.79
33 |
115
2.17
82.96
34 |
129
2.43
85.39
35 |
122
2.30
87.69
36 |
137
2.58
90.27
37 |
117
2.21
92.48
38 |
126
2.38
94.85
39 |
138
2.60
97.45
40 |
135
2.55
100.00
------------+----------------------------------Total |
5,304
100.00
7
Download