Prof. David Draper Department of Statistics University of California, Santa Cruz Fall 2022 STAT 17: Quiz 2 155 total points Name: 1. [25 total points for this problem] You’re writing an article for the product quality evaluation website Consumer Reports, based on a sample survey of the website’s readers, on the reliability of various household appliances. 1,338 readers said they had Maytag washing machines, and 294 of these people reported at least one episode in the last year in which they had to call for someone to come and repair the machine. Only 192 of the 480 people who said they owned a General Electric (GE) washer reported one or more repair calls in the same period. (a) Person A says “There were 294 people with one or more repair episodes for the Maytag machines and only 192 for GE, so GE machines are more reliable.” Person B objects to this argument, noting that “( 1,338 – 294 ) = 1,044 readers reported no problems with their Maytag washers, versus only ( 480 – 192 ) = 288 trouble-free reports for GE, so Maytag machines are better.” Briefly explain why both of these people’s reasoning is incorrect, compute a fair numerical summary of the reliability of these two brands, and compare. (This is not a trick question.) 15 points (b) Intuitively, which of these two reliability estimates is likely to be more accurate? (That is, assuming that the Maytag and GE samples were representatively drawn from the populations of all people owning those brands of washing machine, which estimate should have the smaller amount of uncertainty attached to it as an educated guess for the corresponding population quantity?) Explain briefly. 10 points 2. [130 total points for this problem] Example 1.13 in the Alwan book presents a case study from the world of customer service in banking: we have data on “the length (in 1 seconds) of all n = 31,492 calls made to the customer service center (CSC) of a small bank in a month.” Table 1.1 in the book presents the data for the first 80 calls listed; the full data set is available in the Pages tab of the course Canvas page, under the name stat-17-customer-service-center-data.csv You don’t need to analyze this data set (all the questions about this case study are selfcontained below); I provide it to you in case you wish to explore the data in Stata yourself. Table 1 in this document gives the results from describe, list, count, and univar1 commands that I ran after reading the .csv file into Stata. Note: the list length in 1 / 10 command was run before sorting the data from smallest to largest; the list length in -10 / l command was run after sorting, and the syntax of this command means that its output shows us the 10 largest length values. (a) (variable types) List all of the type attributes of the variable length as coded in the manner indicated in Table 1 (qualitative, quantitative, nominal, ordinal, dichotomous, discrete, continuous, interval-scale, ratio-scale), briefly explaining your choices. (Hint: This variable is conceptually continuous but has been made discrete by taking its ceiling: rounding the length of the call to the nearest integer greater than or equal to the actual length [e.g., 10.4 seconds has been recorded as 11].) 10 points (b) (data curation, part 1) Based on our LDS discussion on 3 Oct 2022 and looking at Table 1, does the length variable have any missing values in it? Is that a good thing or a bad thing? Explain briefly. 10 points (c) (data curation, part 2) What are the conceptually possible values for the length variable? How could you use this to find incorrect data values? Explain briefly, and identify the information in Table 1 that proves that there are no such values in this data set. 10 points 1 univar is not part of the standard set of commands available in Stata; I used the web to figure out how to gain access to it (I can show you how to do this if you’re interested; ask me in Discord if so). 2 Table 1: Numerical ummaries of the length of calls in the CSC case study (Stata output condensed). . describe Contains data Observations: 31,492 Variables: 1 -------------------------------------------------------------------------Variable Storage Display Value name type format label Variable label -------------------------------------------------------------------------length int %8.0g -------------------------------------------------------------------------Sorted by: Note: Dataset has changed since last saved. . list length in 1 / 10 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. . list length in -10 / l +--------+ | length | |--------| | 77 | | 289 | | 128 | | 59 | | 19 | |--------| | 148 | | 157 | | 203 | | 126 | | 118 | +--------+ 31483. 31484. 31485. 31486. 31487. 31488. 31489. 31490. 31491. 31492. +--------+ | length | |--------| | 4165 | | 4277 | | 4394 | | 4577 | | 4845 | |--------| | 5222 | | 5764 | | 8514 | | 17904 | | 28739 | +--------+ . count if length == . 0 . univar length -------------- Quantiles -------------Variable n Mean S.D. Min .25 Mdn .75 Max -------------------------------------------------------------------------length 31492 188.59 312.78 1.00 57.00 115.00 225.00 28739.00 -------------------------------------------------------------------------- 3 (d) (data curation, part 3) Identify the 5 largest values of length in Table 1. Recalling that this variable is measured in seconds, convert each of those 5 values to hours by dividing by (60 · 60) = 3,600. Considering each of the 3 largest values expressed in hours, can you imagine yourself participating in a customer service call that long? What do you conclude about the good data/bad data status of those 3 largest values? Explain briefly. 10 points (e) (data curation, part 4) It can be shown (you’re not asked to show it; ask me for details in Discord if you wish) that the mean of the length variable drops from 188.59 seconds to 186.85 seconds when the 3 largest values of the variable are omitted from the data set. (i) With reference to the LDS discussion of practical significance, do you regard that change (from 188.59 seconds to 186.85 seconds) as practically significant? Explain briefly. (Hint: Compute how large that change is in relative (percentage) terms, as we did in LDS.) 10 points (ii) Would you say that in this case the mean is relatively insensitive to a small number of outliers in the right tail? How can this be true, given how extremely large the outliers are here? Explain briefly. 10 points (f) (distributional shape) Without looking at any plots, what can you conclude about the shape of the length distribution (symmetric, long left- or right-hand tail, unimodal, multimodal) from the following facts (specify in each case2 )? (A) Your answer to part (c) about possible values. (B) The relationship between the mean and the median for 2 For example, what numerical fact did you learn in answering part (c)? What is the relationship between the mean and the median? And so on. 4 .015 .01 Density .005 0 0 .005 Density .01 .015 Figure 1: Four views of the shape of the length distribution. Upper left: default Stata histogram of the entire data set; upper right: histogram (1,000 bars) of length distribution truncated at 1,000 seconds; lower left: default Stata boxplot of the entire data set; lower right: boxplot of length distribution truncated at 1,000 seconds. 0 0 10000 20000 Customer Center Call Lengths 10,000 length 20,000 30000 0 30,000 0 200 200 400 600 Customer Center Call Lengths 400 600 truncated_length 800 800 this variable. (C) The relationship between the mean and the SD for this variable. Explain briefly. 30 points 5 1000 1,000 (g) (graphical summaries of the length distribution) Figure 1 presents four views of the shape of the length distribution. The upper left and lower left panels give Stata’s default histogram and boxplot of this variable; I produced the upper right (histogram) and lower right (boxplot) graphs by creating a new version of length that was truncated at 1,000 seconds (this means that none of the data values above 1,000 were used in these plots; this involved setting aside the largest 429 of the 31,492 observations. (i) Summarize the shape information provided by the upper left and lower left plots. Does this summary agree with your conclusions in part (f) above? Explain briefly. 10 points (ii) What important feature of the distribution of the length of the customer service calls is visible in the upper right plot? Can you see that feature in the boxplot in the lower right panel? Explain briefly. 10 points (iii) Table 2 presents a raw and relative frequency distribution of a version of the length variable that’s been truncated at 40 seconds, to hold up a magnifying glass to the left tail of the distribution; ignore the Percent and Cum. columns and focus on the columns length and (raw) Freq. (this counts how many customer service calls were 1 second long [314], 2 seconds long [401], and so on). What pattern in the frequencies do you see in scanning down the page from 1 second to 40 seconds? Does this pattern support the contextual story suggested by the authors of the Alwan et al. book on page 15? Explain briefly. 10 points (h) Someone says, “Histograms can be tricky: if you don’t choose the number of bars well, you can get a view of the distributional shape of the variable that can fail to identify important features.” Do you agree with this statement? Explain briefly. 10 points 6 Table 2: Raw and relative frequency distribution of the length variable for its smallest 40 values. length | Freq. Percent Cum. ------------+----------------------------------1 | 314 5.92 5.92 2 | 401 7.56 13.48 3 | 343 6.47 19.95 4 | 334 6.30 26.24 5 | 288 5.43 31.67 6 | 200 3.77 35.44 7 | 163 3.07 38.52 8 | 151 2.85 41.37 9 | 126 2.38 43.74 10 | 82 1.55 45.29 11 | 98 1.85 47.13 12 | 87 1.64 48.77 13 | 80 1.51 50.28 14 | 69 1.30 51.58 15 | 60 1.13 52.71 16 | 67 1.26 53.98 17 | 71 1.34 55.32 18 | 75 1.41 56.73 19 | 60 1.13 57.86 20 | 61 1.15 59.01 21 | 75 1.41 60.43 22 | 70 1.32 61.75 23 | 78 1.47 63.22 24 | 85 1.60 64.82 25 | 88 1.66 66.48 26 | 103 1.94 68.42 27 | 86 1.62 70.04 28 | 104 1.96 72.00 29 | 110 2.07 74.08 30 | 109 2.06 76.13 31 | 111 2.09 78.22 32 | 136 2.56 80.79 33 | 115 2.17 82.96 34 | 129 2.43 85.39 35 | 122 2.30 87.69 36 | 137 2.58 90.27 37 | 117 2.21 92.48 38 | 126 2.38 94.85 39 | 138 2.60 97.45 40 | 135 2.55 100.00 ------------+----------------------------------Total | 5,304 100.00 7