STT 825 F01 HOMEWORK #4 SOLUTIONS Due Monday, Oct. 29, 2001 1. (7.5 points) We have a population of 500 accounts, stratified by balance of the account: Stratum 1: balance $0 up to $500 Stratum 2: balance $500 up to $2000 Stratum 3: balance $2000 and up. Yesterday's fees were measured on all accounts, and population characteristics are Stratum number of accounts mean fee standard deviation 1 150 $2.05 $2.07 2 300 $3.93 $2.75 3 50 $8.22 $4.03 population 500 $ (a) below $3.21 a. Give optimal allocation (assuming sampling costs are equal across strata) of a sample of n=100 accounts. h 1 2 3 NhSh 310.5 = 150 x 2.07 825 201.5 sum = 1337 allocation nh {310.5/1337) x 100 = 23.2 (use n1 = 23) (825/1337) x 100 = 61.7 (use n2 = 62) (201.5/1337)x100 = 15.1 (use n3 = 15) b. Give the sampling weight for an observation from Stratum #1. N1/n1 = 150/23 = 6.52 c. Suppose we had another population of accounts with strata sizes 225, 225, 50 and stratum standard deviations $2.00, $2.00, $21.00. h 1 2 3 i. Find the optimal allocation of 100 observations. NhSh nh 225x 2.00 = 450 (450/1950)x 100 rounds to 23 225 x 2.00 = 450 23 50 x 21.00 = 1050 54 Sum = 1950 ii. Compare n3 with N3. What allocation would you recommend as "optimal?" N3 < n3, as optimal take n3 = 50, then split the left over equally among strata 1 and 2: n1 = n2 = 25. -------------------------------------------------------------------------------------------------------------------------------------2. (6.5 points) A company has 800 employees who travel on company business. The employees are classified into 200 Level I employees and 600 Level II employees. To audit the amount of mileage claimed last month, a stratified random sample of 200 (70 from Level I and 130 from Level II) is taken. a. What is the sampling weight for a Level I employee? N1/n1 = 200/70 = 2.86 1 b. Note the sample statistics reported below; y-measurements are in 100 miles units. Level sample size sample mean sample standard deviation I 70 11.20 3.467 II 130 8.44 3.00 i. Give an unbiased estimate of the total mileage claimed by only the 200 Level I employees. 200 x 11.20 = 2240 ii. What is the standard error of your estimate in (i)? Just use SRS theory on your stratum I sample. Estimated variance = N2(1-n/N) s2/n = 2002(1 - 70/200) (3.467)2/70 = 4464.6, SE of the estimated total is 66.82 iii. Is there a problem with random n1 in question (i)? No, Level I employees are a stratum, and n1 is not random. c. Management later realized that the employees came from two different locations, A and B, and thought travel amounts might differ by location. They decided to POST-Stratify their stratified random sample by location. The numbers in the population and sample means are given below. Level I I II Area A B A Number in Population 160 40 200 Number in Sample 56 14 20 Sample Mean 11.32 11.08 8.15 II B 400 110 8.49 i. Post-stratifying by location, estimate the total mileage claimed by the 800 employees (Do NOT compute the standard error). Define 4 strata using the level*area combinations: yh Stratum Nh Nh y h 1 =I, A 160 11.32 1811.2 2 = I, B 40 11.08 443.2 3 = II, A 200 8.15 1630 4 = II, B 400 8.49 3396 sum =7280.4 y str = 7280.4/800 = $9.10 ii. Is the number of Level I employees in Area A a random variable? YES because you’re post-stratifying on Area. -------------------------------------------------------------------------------------------------------------------------------------3. (9 points) Do text problem #3, Chapter 5. Hint): yij = 1 if error in jth field, ith claim, = 0 if no error and ti = total number of field errors for the ith claim. The error rate is a rate PER FIELD. 2 a. ti = total number of field errors, ith claim. Note that yij = 1 or 0 (1 if jth field, ith claim is in error). The sum of the ti over i in the sample = 37. Thus, tˆ = 828 (37/85) = 360.42. You also have to compute (via describe), st = .558263 (sample standard deviation of the 85 ti values observed. The ERROR RATE = population mean at the ssu level (field level) because that’s were the 0,1 measurements are taken. The estimated error rate is ŷ = (360.42)/K = 360.42/178020 = .002025. SE( ŷ ) = (equation 5.6) = .000357 b. tˆ = 828 (37/85) = 360.42., SE( tˆ ) = K (SE( ŷ )) = 63.55 c. Using an SRS of 85x215 = 18,275 fields from a population of 178,020 fields, the estimate would be the same: 37 field errors/18,275 = .00202 The estimated variance for the SRS = (equation 2.16) = 9.92 x 10-8. The estimated variance in part (a) for cluster sampling is (.000357)2 = 1.27 x 10-7. Est. var. cluster/ est. var. SRS = 1.29, so that clustering increased the variance by 29%. ----------------------------------------------------------------------------------------------------------------------------------------4. (19 points) Refer to your handout with the 4 cluster designs. In all designs, we will select enough clusters to get 16 ssu's. (We have already worked with Design #1 in class.) a. Find V[ tˆ ], wij, and ICC for Formulas used: equation (5.2) for V[ tˆ ], wij = N/n, ICC = 1- NM(pop.MSW)/(NM-1)S2. S2 = population MSTot = 4308/39 = 110.46 same for all cluster choices since it’s an ssu characteristic. i. Design 2 M=4, N=10 n = 4 St2 = M x MSBet = 4 (83) = 332, V[ tˆ ] = 4980 wij = 10/4 = 2.5 ICC = 1 - 40(119)/(39)(110.46) = -.104 ii. Design 3 M=8, N=5 n = 2 St2 = M x MSBet = 8 (173) =1384. V[ tˆ ]= 10,380 wij = 5/2 = 2.5 ICC = 1 - 40(103)/(39)(110.46) =1 - .956 = .044 iii. Design 4 M = 2, N = 20 n = 8 St2 = M x MSBet = 2 x 203.1 = 406.2 V[ tˆ ]= 12,186 wij = 20/8 = 2.5 3 ICC = 1- 40(22.5)/(39)(110.46) = 1 - .209 = .791 iv. SRS of size 16 ssu's. n=16, N=40 Using formula 2.13, and noting that S2 = MStot which doesn’t depend on the design, V[ tˆ ]= 6,6277.7 wij = 40/16 = 2.5 ICC = (make a good guess here) = 1 - 0 = 1, since M=1, MSW=0. b. Refer to your answers in (a), and recall that for Design 1, V[ tˆ ]= 7,680. Rank the designs 1-4 and SRS according to the V[ tˆ ]. Which design gives the highest variance? the lowest variance? Highest variance to lowest variance: Designs 4, SRS, 3, 1, 2 c. Is S2 affected by the choice of clusters? Is St2 affected by the choice of clusters? S2 is NOT affected since it’s a characteristic of the ssu’s; St2 is affected since it depends on the cluster totals which depend on the cluster choices. d. For Design #2, we took an SRS of n=4 clusters and selected clusters #3,4,7,10. i. Use the computer to get the ANOVA on this sample. One-way ANOVA: yij versus cluster3 Analysis of Variance for yij Source DF SS cluster3 3 15 Error 12 1260 Total 15 1275 Level 3 4 7 10 N 4 4 4 4 Mean 57.08 58.78 56.22 58.10 Pooled StDev = 10.25 MS 5 105 StDev 6.77 12.15 12.98 7.61 F 0.05 P 0.985 Individual 95% CIs For Mean Based on Pooled StDev ------+---------+---------+---------+ (---------------*--------------) (---------------*---------------) (---------------*---------------) (---------------*---------------) ------+---------+---------+---------+ 49.0 56.0 63.0 70.0 ii. What is the total yield for Cluster #3? t3 = 4 x 57.08 =228.3 iii. What is the variance in acre yields for the 4 acres in Cluster #3? s32 = (6.77)2 = 44.9 iv. Use the data to estimate the total yield and its standard error. Using formula (5.1), tˆ =2286.8, SE( tˆ ) = (formula 5.3) = 11.97, the st2 = M(sampleMSBet) v. Give an unbiased estimate of the ssu variance (variance in acre yields). Use the formula {N(M-1)(sample MSW) + (N-1)(sample MSBet)}/(NM-1) = 83.4 4 Note that sample MSTot = 85.4 (not the same) -------------------------------------------------------------------------------------------------------------------------------------5. (8 points) Refer to the data sheet attached. We will take a systematic sample of size 25 from this population. There are 200 claims for expenses by 8 sales representatives. a. How many possible systematic samples are there (size 25)? N/n = 200/25 = 8 b. What is the probability that unit # 17 is selected? What is the sampling weight for unit #17? Probability = 1/8 sampling weight = 8 c. Using List #1, Circle the expenses in a systematic sample of size 25 (period of 8) which begins with unit #5. Be sure to read across rows. (Used * to indicate measurement in the sample) EXPENSES (List #1) (listed by date) 30.0 23.8 *20.6 24.1 33.0 34.1 *45.5 19.7 20.8 18.5 *17.6 12.8 8.9 24.6 *17.3 29.1 21.2 40.9 *37.7 35.2 19.0 20.4 25.1 23.3 30.1 47.2 35.5 34.6 25.0 13.9 9.3 22.3 18.4 14.0 7.5 35.7 46.6 28.7 30.4 24.4 32.2 *21.7 11.3 39.7 31.1 *35.1 28.7 32.8 29.5 *8.8 5.9 19.6 25.1 *6.3 14.7 32.9 31.2 *39.7 15.0 25.8 29.8 13.7 19.4 34.2 19.2 22.8 24.0 27.1 11.7 12.8 8.4 16.1 12.9 10.0 15.8 35.1 40.4 18.5 43.2 33.5 *43.7 19.2 21.8 24.2 33.2 18.4 30.6 37.4 29.8 31.6 15.2 *40.6 *28.5 13.0 25.9 23.1 38.5 31.4 59.9 35.0 49.4 34.2 22.2 *13.7 *17.8 16.8 22.7 14.0 12.5 20.9 15.0 13.2 9.2 14.8 11.6 *17.1 *17.5 15.8 22.4 27.9 23.5 17.9 13.7 7.4 11.4 39.5 28.3 *28.5 *33.6 19.2 29.1 25.3 39.6 34.1 31.4 37.9 32.9 36.8 32.5 *30.0 28.8 15.8 28.9 24.5 34.6 40.1 30.1 20.2 32.8 13.1 23.5 16.9 11.6 19.3 5.8 37.5 25.5 26.7 34.3 26.2 15.9 17.1 *16.2 22.7 21.2 14.8 *32.7 12.5 13.0 19.5 *11.8 20.5 21.1 16.2 *15.7 31.5 29.5 39.9 *33.1 45.8 28.1 25.6 25.5 33.6 26.8 31.5 22.5 13.6 9.6 17.5 20.0 22.0 21.8 24.2 21.3 25.6 40.5 30.9 45.6 22.4 d. Using list #2, circle the expenses in a systematic sample of size 25 (period of 8) which begins with unit #5. Be sure to read across rows. The sample is all measurements in column 5 (under rep 5) Expenses listed by rep Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 rep1 30.0 19.0 32.2 29.8 43.7 19.2 21.8 28.8 15.9 28.1 23.8 20.4 21.7 13.7 24.2 33.2 rep2 37.4 29.8 28.9 16.2 25.5 24.1 23.3 39.7 34.2 31.6 15.2 40.6 24.5 22.7 33.6 33.0 rep3 34.1 47.2 35.1 22.8 23.1 38.5 31.4 40.1 14.8 31.5 45.5 35.5 28.7 24.0 59.9 35.0 rep4 22.2 13.7 20.2 12.5 13.6 20.8 25.0 29.5 11.7 17.8 16.8 22.7 32.8 13.0 9.6 18.5 rep5 17.6 9.3 5.9 8.4 15.0 13.2 9.2 23.5 11.8 20.0 12.8 22.3 19.6 16.1 14.8 11.6 rep6 15.8 22.4 11.6 21.1 21.8 24.6 14.0 6.3 10.0 27.9 23.5 17.9 19.3 16.2 24.2 17.3 rep7 29.1 35.7 32.9 35.1 39.5 28.3 28.5 37.5 31.5 25.6 21.2 46.6 31.2 40.4 33.6 19.2 5 rep8 39.6 34.1 26.7 39.9 30.9 37.7 30.4 15.0 43.2 31.4 37.9 32.9 34.3 33.1 45.6 35.2 17 18 19 20 21 22 23 24 25 18.4 15.8 17.1 25.6 20.6 25.1 11.3 19.4 30.6 30.1 31.1 19.2 28.5 13.0 25.9 34.6 21.2 26.8 49.4 30.1 32.7 22.5 19.7 34.6 32.8 27.1 34.2 13.9 8.8 12.8 14.0 12.5 20.9 13.1 19.5 17.5 17.1 16.9 20.5 22.0 8.9 18.4 25.1 12.9 17.5 7.5 14.7 15.8 13.7 7.4 11.4 5.8 15.7 21.3 29.1 25.5 29.5 40.5 40.9 28.7 39.7 18.5 25.3 24.4 25.8 33.5 36.8 32.5 30.0 26.2 45.8 22.4 e. Comparing the population ANOVA's for List #1 and List #2, which list would give systematic samples more representative of the population? List (1) gives clusters which are more homogeneous, and their means are all between 21.19 and 27.52. While, list (2) means are as low as 15.616 and as high as 33.212. f. In designing a survey of the 200 claim amounts, and noting the population ANOVA for List #2 (by reps), would you use sales representative as a CLUSTER or a STRATUM? Sales representative would be useful as a critierion for stratification. 28.0 35.0 6