Examination Date: 2007-04-26 Statistics for Business and Economics Module 3: Statistical survey methodology Name: .......................................... Personal code number: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Time of examination: 9.00-13.00, ÖP sal (hall) 5 Aid: Pocket calculator. Formulae are handed out. Write your answers on loose paper except for the multiple choice questions where you mark your answer. Evaluation of the exercises is done by the teacher. For the grade pass, 50% of maximal mark is required. To pass with distinction, 75% is required. Note that omitted or imperfect explanation leads to reduction of marks. Exercise 1 2 3 4 5 6 Sum Points 10 9 10 6 5 10 50 Break a leg! /J Note: Examination with plausible solutions Author: Joakim Malmdin, 2007-04-30 Exercise 1, 10p These statements are either true or false. 1) Stratification may produce a smaller bound on the error of estimation, B. This is especially true when the strata are heterogeneous. True False 2) A drawback with systematic sampling is the risk of periodicity. True False 3) If our findings are statistically significant they are too unusual to often occur just by chance. True False 4) The margin of error includes only random sampling error. True False 5) A census is a sample survey that attempts to include the entire population in the sample. True False 6) The use of a control group in an experiment allows us to control the effects of lurking variables. True False 7) Clusters should be as heterogeneous as possible within, and one cluster should look very much like another in order for the economic advantages of cluster to pay off. True False 8) The finite population correction (or correction factor) takes into account the fact that an estimate based on a sample n = 10 from a population of N = 20 000 items contains more information about the population than a sample of n = 10 from a population of N = 20. True 2 False 9) Reliability has to do with the quality of measurement. In its everyday sense, reliability is the “repeatability” of your measures. True False 10) We usually favour stem-plots when we have a small number of observations, and histograms for larger amounts of data. True 3 False Exercise 2, 9p For each of the following sample situations, identify the target population and the frame used. Comment upon the coverage and identify also the sampling technique used and eventual shortcomings. Finally, suggest another way of doing the investigation with the same target population. 1. A sociologist is interested in determining the extent to which tenthgraders in the USA are self-motivated. A sample of four high schools in Large City is taken and all tenth-graders in each school is interviewed. 3p Answer [very short]: Target population: Tenth-graders in the USA Frame: High schools in Large City Coverage: Undercoverage Technique: Clustered sampling of high schools Shortcomings: Bias due to nonsampling errors (undercoverage) Suggestion: The risk of bias is due to the fact that just high schools from one city is selected. If the target population is tenth-graders in the USA we have to include all tenth-graders in the frame. One way of doing this is to randomly select cities and then schools within each city. 2. The host of a local radio talk show in London wonders if people who are actively religious are happier than those who are not. He asks the listeners to call in and the station receives calls from 48 listeners who voice their opinions. 3p Answer [very short]: Target population: London citizens (eventually other too, but unspecified) Frame: Listeners to the show Coverage: Undercoverage (eventually overcoverage) Technique: Voluntary response Shortcomings: Bias due to sampling errors (voluntary response) and nonsampling errors (undercoverage) Suggestion: To base an investigation on voluntary answers (self-selection) is not a good idea. A more serious attempt to reach a trustworthy result would be to randomly select London citizens from the telephone book for example, i.e. simple random sampling. 4 3. Every tenth person between 2 pm and 4 pm the first day of a term outside the library at Umeå University is asked whether he or she prefer written or oral exams this term. 3p Answer [very short]: Target population: Students at Umeå University a specific term Frame: People passing by the library between 2 pm and 4 pm the first day of a specific term Coverage: Over- and undercoverage Technique: Systematic sampling Shortcomings: Bias due to nonsampling errors (undercoverage) Suggestion: The risk of bias in this survey is connected to the choice of time and place. We could instead use lists of all students accepted for studying at the University the term in question and wait until the registration is fulfilled. Systematic or simple random sampling could be used. 5 Exercise 3, 10p Newspapers sold 50000 Newspapers sold 20000 30000 40000 News for you Daily words 10000 World in words 0 Time for news News for you Daily words Time for news World in words Figure 1: Newspapers sold displayed in two graphs a) In these two graphs (Figure 1) the sale numbers of the four biggest newspapers in News Island are displayed. Choose the most proper graph and give two reasons for your choice. 3p Answer: The bar chart is the most proper graph. 1) The four newspapers do not sum up to a whole (we get an impression of that the whole population is represented in the pie chart). 2) It is easier to compare the bars in order to see the difference in number of sold papers. b) Can we use the graph in Figure 2? Explain. 2p Answer: No. With nominal data (the four newspapers) it does not make sense to connect the categories with a line. c) Explain why a pictogram would have been improper to use. 2p Answer: A pictogram (of newspapers) would be misleading since both the height and width must be increased in order to avoid distortion, i.e. it is not only the height of the picture which get larger, so do the width and thereby the area. [An alternative to this is to keep the same width regardless of height, but then the pictures are distorted.] d) Describe the data set displayed in Figure 3. 6 3p 30000 20000 ds or W Tim ef or ne ld in wo r ws ds Da ily wo r Ne ws f or yo u 0 10000 Count 40000 50000 Newspapers sold Figure 2: Newspapers sold displayed with combining line Answer: This is a boxplot, which is a graph of the five-number summary. All observations (except one outlier) lies in between approximately 18 and 50 (the minimum and the maximum), which is covered by the whiskers. The interquartile range stretches from approximately 30 (first quartile) to 45 (third quartile) and constitutes the box, consisting of 50% of the observations. The measure of central location, the median, is also a measure of relative standing, and its value is approximately 37, meaning that half of the observations are larger than 37, and half of the observations are smaller. Since the difference between the first and second quartiles is approximately equal to the difference between the second and third quartiles, a good guess is that the distribution is approximately symmetric. 0 20 40 60 80 Barry Bond’s 19 home run counts Figure 3: Home run counts 7 Exercise 4, 6p A forester wants to estimate the total number of farm acres planted in trees for a state. Since the number of acres of trees varies considerably with the size of the farm, she decides to stratify on farm sizes. The 240 farms in the state are placed in one of four categories according to size. A stratified random sample of 40 farms, selected by using proportional allocation, yields the results shown in Table 1 on number of acres planted in trees. Table 1: Acres of trees on farms Stratum I 0-200 Acres Stratum II 201-400 Acres Stratum III 401-600 Acres Stratum IV Over 600 Acres N1 = 86 n1 = 14 97, 67, 42, 125, 25, 92, 105, 86, 27, 43, 45, 59, 53, 21 N2 = 72 n2 = 12 125, 155, 67, 96, 256, 47, 310, 236, 220, 352, 142, 190 N3 = 52 n3 = 9 142, 256, 310, 440, 495, 510, 320, 396, 196 N4 = 30 n4 = 5 167, 655, 220, 540, 780 400 0 200 Acres of trees 600 800 Stratified random sample of 40 farms Stratum I Stratum II Stratum III Stratum IV Figure 4: Acres of trees a) Estimate the total number of acres of trees on farms in the state by using the information given in Table 1. 3p 8 Answer: Use the formulae in A.2.1 and A.5 to find the answer. First calculate the mean value of the random variable of interest (Y =”number of farm acres planted in trees” ) for each stratum. Ȳ1 = 63.3571 Ȳ2 = 183 Ȳ3 = 340.5555 Ȳ4 = 472.4 Now we have that L 1 X Ni Ȳi N Ȳst = i=1 4 X 1 240 = Ni Ȳi i=1 1 (86 × 63.3571 + 72 × 183 240 +52 × 340.5555 + 30 × 472.4) = = 210.43999 and the total is estimated to τ̂ = N Ȳst = 240 × 210.43999 = 50 505.6 ≈ 50 506 ∴ The total number of acres of trees on farms in the state, τ̂ , is estimated to 50 506 . b) Place a bound on the error of estimation. Answer: The margin of error is given by the formula in A.6 q B = 2 Vb (θ̂) 3p where θ̂ (the estimate of interest) here is τ̂ (i.e. N Ȳst ), so we have to estimate the variance of Ȳst in order to estimate the margin of error of the total. We use the formula in A.2.1 to estimate the variance of Ȳst : Vb (Ȳst ) = L 1 X 2 Ni − ni s2i Ni N2 Ni ni i=1 9 where the estimated variance in each stratum is s21 = 1071.786 s22 = 9054.182 s23 = 16794.28 s24 = 72376.3 The estimated variance of Ȳst is then Vb (Ȳst ) = 1 (474 035.6366 + 3 259 505.52 2402 +4 172 445.564 + 10 856 445) = 325.7366618 The variance of the estimated total is now estimated according to the formula in A.5 Vb (N Ȳst ) = N 2 Vb (Ȳst ) = 2402 × 325.7366618 = 18 762 431.72 and, finally, calculate the margin of error of τ̂ to √ B = 2 18 762 431.72 = 8663.1245 ≈ 8663 ∴ The margin of error of τ̂ is 8663 . 10 Exercise 5, 5p 100 50 Income 150 Suppose we are interested in determining the average daily sales (income) for a chain of grocery stores. In Figure 5 we see the true sale numbers for the last 12 days. 2 4 6 8 10 12 Days Figure 5: Daily sales a) Suppose we want to sample days in order to estimate the average daily sales. Comment upon the use of systematic sampling. 2p Answer: There is periodicity since it seems to be peak sales every second or every third day. The effectiveness of a 1 − in − k sample depends on the value we choose for k. The risk is that we over- or underestimate the parameter of interest. [We could change the random starting point several times in order to reduce the possibility of choosing observations from the same relative position in a periodic population. Note that the corresponding terminology in time series analysis is seasonal variation, which refer to systematic patterns that occur over short repetitive calendar periods (with a duration of less than one year.)] b) Another task is to estimate the average number of customers per grocery store for the chain. The 300 stores are listed in 50 geographical clusters of 6 each, and a simple random sampling of three clusters is selected (see Table 2). 3p Answer: Use the formula in A.4.1 to estimate the population mean, µ, with the sample mean Ȳ . The random variable of interest is defined 11 Table 2: Number of customers Cluster Number of customers 1 2 3 34, 56, 78, 56, 100, 87 47, 212, 220, 34, 68, 90 98, 67, 88, 99, 29, 58 as Y =”number of customers per grocery store”. Pn yi Ȳ = Pni=1 i=1 mi where m1 = m2 = m3 = m (the number of elements in each cluster) and yi is the total of all observations in the ith cluster. Here the total sample size is equal to nm elements. We get Pn yi 411 + 671 + 439 = 84.5 Ȳ = i=1 = nm 3(6) ∴ The average number of customers per grocery store is estimated to 84.5 . 12 Exercise 6, 10p Fill in the right answer. a) The drawback of a web-survey with voluntary answers is that it. . . 2p may be biased. costs too much. is a very simple random sample. b) What do we call the distribution in Figure 6? unimodal. bimodal. multimodal. 10 0 5 Frequency 15 20 2p 0 20 40 60 80 100 X Figure 6: Distribution of X c) When performing a significance test we would like to know about the sample size. Why? 2p The p-value depends on the sample size. The sample size depends on the p-value. The true value of the parameter depends on the sample size. d) We can describe the overall pattern of a histogram or stem-plot by giving its shape, centre, and . . . 2p spread. height. stem. e) When the respondent has not responded to any of the questions we call this. . . 2p random sampling error. error. undercoverage. 13 nonresponse A Formulae for estimation A.1 A.1.1 Srs The mean and the variance of the mean Pn θi θ̄ = i=1 n 2 s N −n b V (θ̄) = n N where N is the size of the population, n the size of the sample, and Pn (θi − θ̄)2 2 s = i=1 n−1 A.1.2 The proportion and the variance of the proportion Pn θi p̂ = i=1 n p̂q̂ N −n Vb (p̂) = n−1 N where q = 1 − p. A.2 Strs L =number of strata, Ni =number of sampling units in stratum i, and ni the number of sampled units. A.2.1 The mean and the variance of the mean L 1 X Ni θ̄i θ̄st = N i=1 2 L X N − n si 1 i i 2 Ni Vb (θ̄st ) = 2 N Ni ni i=1 A.2.2 The proportion and the variance of the proportion p̂st = L 1 X Ni p̂i N i=1 L 1 X 2b b Ni V (p̂i ) V (p̂st ) = 2 N i=1 14 A.2.3 Neyman allocation Ni σ i ni = n PL k=1 Nk σk assuming that the cost per observation are equal. σi is the standard deviation in stratum i (often estimated with si ). A.2.4 Proportional allocation Ni n i = n PL k=1 Nk =n Ni N assuming that the cost per observation and the standard deviation are equal for all strata. A.3 Sys A.3.1 The mean and the variance of the mean Pn θi θ̄sy = i=1 n assuming a randomly ordered population we have s2 N − n Vb (θ̄sy ) = n N A.3.2 A.4 The proportion and the variance for proportion Pn θi p̂sy = i=1 n p̂sy q̂sy N − n b V (p̂sy ) = n−1 N Clus N =the number of clusters in the population, n =the number of clusters selected in a simple random sample, mi =the number of elements in cluster i, m̄ =the average cluster size for the sample, M =the number of elements in the population, M̄ =the average cluster size for the population ( M N ), θi =the total of all observations in the ith cluster 15 A.4.1 The mean and the variance of the mean Pn θi θ̄ = Pni=1 i=1 mi Pn 2 N −n i=1 (θi − θ̄mi ) Vb (θ̄) = n−1 N nM̄ 2 A.4.2 The proportion and the variance for proportion Pn ai P p̂ = ni=1 i=1 mi where ai denote the total number of elements in cluster i that possess the characteristic of interest. Pn 2 N −n i=1 (ai − p̂mi ) b V (p̂) = n−1 N nM̄ 2 A.5 The total and the variance of the total τ̂ = N θ̄ A.6 Vb (N θ̄) = N 2 Vb (θ̄) Sample size and the margin of error Sample size required to estimate µ with margin of error B when using Srs or Sys N σ2 n= (N − 1)D + σ 2 2 where D = B4 The margin of error is q B = 2 Vb (θ̂) 16