Stat 401 F/XW: HW 2 answers. 1) Radon. 7 pts. a) 2 pt. Radon concentrations 40 radon 30 20 10 0 1 group The distribution is skewed with some extremely large values. Note: if you use different software that identifies individual extreme values, you see two unusually large points. I would call these extreme values not outliers, because there is nothing erroneous about these points. They are just houses with high radon concentrations. To a public health official, these would be the two most important points in the data set. b) 2 pt. mean = 4.54, median = 3.5 I would say these are not similar. I would expect this because the data are skewed. You could also argue these are similar, because the influence of one or two large values is diluted when you have a total of 42 observations. Note: Although you could argue either way, I strongly favor 'not similar' because the difference between the two estimates is large relative to the variability in the bulk of the data. The 25'th percentile is 2.2; the 75'th percentile is 5.5. The difference (IQR) is 3.3. The difference between mean and median is 1/3 of that. c) 1 pt. The se of the mean is 0.76 (directly from SAS / JMP output) or calculated as: 4.896/sqrt(42) = 0.76. d) 2 pt. Yes, the se is appropriate because the data are a simple random sample. The skewness is irrelevant. The se does not assume any particular distribution. Note: A lot of folks misunderstood this point. I talked about this briefly in the lecture, but it is not emphasized in the book. In many ways this is analogous to the mean of a skewed distribution. The sample average from a simple random sample is a valid and appropriate estimator of the mean of a population, no matter what shape it has. Our discussion about mean or median was about which is more appropriate for a particular goal. If the mean is the right quantity, the sample average (and sample se) are appropriate because the sample is a simple random sample. Skewness doesn't matter for either the average or the se of the average. The distribution will matter when you use t-statistics to compute a confidence interval. Note: when would the se not be an appropriate measure? When the sampling is not a simple random sample. If you had a way to identify houses likely to have high radon and houses likely to have low radon (e.g. based on the underlying soil/rock), you could sample 21 houses from the 'likely high' group and 21 houses from the 'likely low' group. The sample se we have talked about is not appropriate. The survey is a stratified random sample for which different methods need to be used to estimate the population mean and the se of that estimate. 2) mutagen and microsatellite nuclei. 7 pt. a) 1pt for treatments, 1 pt for randomly assigned and can make causal conclusions Treatments are control or 80 mg/ml of mutagen. They are randomly assigned to a vial of cells. Yes, because treatments are randomly assigned. b) 1 pt 1/126 = 0.0079 = 0.79%. Explanation: Using the permutation distribution, there is one permutation with a difference of 8.2 or more extreme. There are 126 possible permutations, so P[diff >= 8.2] = 1/126 c) 2 pt 3/126 = 0.024 = 2.4% (0r 0.0238 = 2.38%) The two-sided p-value is the probability of >= 8.2 or <= -8.2. There is 1 value >= 8.2 and two <= -8.2, so three events in total. d) 1 pt for p-value and correct interpretation of result, 1 pt for estimate of effect size There are many possible (and reasonable) ways to word this. Here's mine. There is evidence (p=0.024) that the mutagen increases the number of microsatellite nuclei. The estimated increase with an 80 mg/ml dose of mutagen is 8.2 nuclei. Note: When grading we were looking for appropriate interpretation of the p-value (evidence of) and an estimate of the size of the effect. You could also include the means or medians of the two groups (full credit) or the difference in the medians (full credit). 3) Problem ./26: 6 points We were looking for: Statement of methods: (1 pt) An appropriate graphic (2 pts) Appropriate numerical results (1 pt) that are consistent with your choice of method (1 pt) A statement reporting your conclusion (1 pt). One possible answer is: The data were analyzed by plotting side-by-side box plots. Because these data are very skewed (Figure 1), we report the median percent pro-environment votes for each party. The Democratic party (median = 92%) had a much higher pro-environment vote percent than the Republican Party (median = 7.1%). The one independent member of the House of Representatives voted 100% pro-environment. 100 PctPro 75 50 25 0 D I R Party Figure 1. Percent pro-environment votes for members of the Democratic (D) and Republican (R) parties and the one Independent. Note: The statement about the Independent party member of Congress is not required.