Chapter 4 ESTIMATION: TWO POPULATIONS Parameters of Interest (pages 482, 487, 489) Objective: compare the means of two populations 1. Population 1: mean Parameter of interest: Interpretation: 2. X X - X X Population 2: mean - Y= 0: means of the two populations are equal > 0: mean of measurements in Population 1 is larger than mean of Population 2 Y < 0: mean of measurements in Population 1 is smaller than mean of Population 2 Y X Objective: compare the proportions of two populations Population 1: proportion – p1 Parameter of interest: p1 – p2 Interpretation: 3. Y Y Population 2: proportion – p2 p1 – p2 = 0: proportions of the two populations are equal p1 – p2 > 0: proportion of elements possessing characteristic of interest is larger in Population 1 than in Population 2 p1 – p2 < 0: proportion of elements possessing characteristic of interest is smaller in Population 1 than in Population 2 Objective: compare the variances of two populations Population 1: variance - X2 Parameter of interest: X2 / Y2 Interpretation: 2 X 2 X 2 X / / / `Population 2: variance - 2 Y 1: variances of the two populations are equal 1: measures in Population 1 are more varied than measures in Population 2 Y 2 < 1: measures in Population 1 are less varied than measures in Population 2 Y Y 2= 2> Note: Variances are comparable when the means of the two populations are not too different from each other. Chapter 4. Estimation: Two Populations Two Approaches to Sampling 1. Select two independent samples 2. Select two related samples (matched samples) Chapter 4. Estimation: Two Populations Independent Sampling from Two Populations (pages 482-483) the selection of the random sample from one population will not affect the selection of the random sample from the other population Example: (Exercise 3d, page 485) The principal of a school wishes to determine if the Grade 6 boys are better in mathematics than the Grade 6 girls. A random sample of boys were selected. Then a random sample of girls were selected. All of the students in the two random samples were asked to take a standardized test in mathematics and their scores were determined. Population 1 X, Population 2 2 X Y, Sample 1 of size n1 (X1, X2,….,Xn1) Y 2 Sample 2 of size n2 (Y1, Y2,….,Yn2) X , S X2 Y , S Y2 Use these samples to infer on X - Y Chapter 4. Estimation: Two Populations Matched Sampling/Paired Sampling (pages 483-484) Recall: An experiment is a data collection method where the researcher intervenes by controlling the conditions that may affect the response variable by: (i) using a randomization mechanism in assigning the treatments and (ii) controlling the identified extraneous variables. By doing so, the researcher can isolate the effects of the explanatory variable on the response variable and clarify the direction and strength of their relationship. In many experiments, the available experimental units may considerably differ with respect to extraneous variables. Failure to control these differences may mask the true difference between the population means of the response variable for the two treatments or, even worse, create an illusion of a difference between means when there is actually none. In matched sampling, the elements of the samples drawn from the two populations are carefully matched in pairs so that the two elements in each pair are as similar as possible with respect to identified extraneous variables. The observations in each pair being compared are therefore related or associated by design. Chapter 4. Estimation: Two Populations Methods of Generating Paired Data (page 484) Paired data: {(X1,Y1), (X2,Y2), …, (Xn,Yn)} Forming paired data will be beneficial when the two measures in the ith pair, Xi and Yi, exhibit strong direct relationship so that when Xi is high then so is Yi as a result of sharing the same values on the extraneous variable/s. Method 1: all experimental units in the sample receive both treatments Example: (Exercise 3c, page 485) Two formulations of a new whitening soap are to be compared as to their whitening effect. A random sample of 40 potential users of the soap is selected. Each person uses a randomization mechanism to determine which formulation is applied on the left arm, so that the other formulation is applied on the right arm. After two weeks, they measured the effect of each formulation. Experimental unit: person Response variable: degree of fairness of a person 2 Treatments: Formulation A and Formulation B of whitening soap Xi = degree of fairness of arm of ith person where Formulation A was applied Yi = degree of fairness of arm of ith person where Formulation B was applied Parameter of interest: X - Y Extraneous variables: original degree of fairness of person, biological characteristics that affect person’s reaction to any treatment Chapter 4. Estimation: Two Populations Methods of Generating Paired Data (page 484) Method 2: taking measurements before and after the treatment is applied to the experimental units (can be viewed as a special case of Method 1) Example: (Exercise 3a-page 485) A police department wants to assess the effects of an obvious radar trap on the speeds of cars. Ten cars are randomly selected on a highway, and their speeds are measured just before a radar trap comes into view and right after they pass the obvious radar trap. Experimental unit: car Response variable: speed of car Treatment: visible radar trap Xi= speed of ith car before seeing radar trap Yi = speed of ith car after seeing radar trap Parameter of interest: X - Y Extraneous variables: driver, type of car, age of car, etc. Chapter 4. Estimation: Two Populations Methods of Generating Paired Data (page 484) Method 3: use naturally occurring pairs such as twins, or husbands and wives, or siblings, etc. Method 4: form pairs of experimental units that have the same values or levels of the extraneous variable Example: (Exercise 3e-page 485) A science teacher has developed new teaching materials and wants to evaluate the effectiveness of these materials in improving the students’ comprehension. Prior to sampling, the teacher formed pairs of students so that students belonging in the same pair received about the same final grade in science the previous term. The teacher then selected a sample of pairs of students. The teacher randomly selects which student in each pair will be taught using the new materials so that the other one will be taught using the old materials. At the end of the term, all the students in the sample were given a standardized test.. Experimental unit: student Response variable: score in standardized test 2 Treatments: old teaching materials, new teaching materials Xi= score of ith student taught using the old teaching materials Yi = score of ith student taught using the new teaching materials Parameter of interest: X - Y Extraneous variables: aptitude in science Chapter 4. Estimation: Two Populations Point Estimation (pages486-490) Parameter Point Estimator – Y X Y p1 – p2 Pˆ1 Pˆ2 / S2X / S2Y x 2 X 2 Y Chapter 4. Estimation: Two Populations Confidence Interval Estimators for µX - µY Based on 2 Independent Samples (page 495) Cases Confidence Interval Estimators Case 1: X2 and are known 2 Y Case 2: X2 and are unknown but 2 Y 2 X 2 Y (X Y ) z (X Y ) t /2 2 Y n1 2 where S P2 /2 (v ) S X2 n1 ,(X Y ) z n2 1 n1 (v n1 n2 2) S p2 (X Y ) t Case 3: X2 and are unknown 2 2 but X Y 2 X /2 1 ,(X Y ) t n2 SY2 ,(X Y ) t n2 s X2 s 2 X n1 2 n1 n1 1 2 2 X 2 Y n1 n2 /2 (v n1 n2 2) S p2 (n1 1) S X2 (n2 1) SY2 n1 n2 2 2 Y where v /2 sY2 /2 (v ) S X2 n1 SY2 n2 2 n2 sY2 n2 2 n2 1 2 Case 4: X and Y are unknown, but (X Y ) z /2 S X2 n1 SY2 ,(X Y ) z n2 /2 S X2 n1 n1 >30 and n2 >30 Chapter 4. Estimation: Two Populations SY2 n2 1 n1 1 n2 Assumptions (pages 491-495) All formulas were derived under the assumption that the two independent random samples come from normal distributions. These procedures are robust in the sense that these will still provide good approximate (1- )100% confidence interval estimates even if there are slight deviations from the assumption of normality. Because of the Central Limit Theorem, the assumption of normality can be dropped as long as both samples are greater than 30. Formula 2 was derived under the additional assumption that the two unknown variances are equal to each other. However, the procedure is also robust in the sense that this will still provide good approximate (1- )100% confidence interval estimates even if the 2 population variances are not equal to each other so long as the sample sizes are equal to each other. This is one of the reasons why we consider using equal sample sizes when we design our experiment. Formula 3 adjusts the degrees of freedom (downwards). The result of this is to have a longer interval estimate. Formula 3 also does not pool the information from the two samples to estimate the common variance since the variances of the two populations are actually not equal. However, these two adjustments become negligible when both sample sizes are large. The degrees of freedom in Formula 3 is a computed value based on the sample sizes and sample variances so that the resulting value will not always be an integer. Since our table presents the values for integral degrees of freedom only, then we would have to round-off the computed value. We will take the more conservative approach of always rounding-down instead of using the standard rules of rounding. Formula 4 is relevant only when we cannot get the t-value from the t-table because the degrees of freedom is very large. Again, we just replace t by z because as the degrees of freedom approaches infinity, the tdistribution approaches the standard normal distribution. Chapter 4. Estimation: Two Populations Flowchart (page 496) We still need to satisfy the assumption of normality for the two populations (or at least approximately normal) when at least one of the sample sizes is less than 30. Chapter 4. Estimation: Two Populations Interpretation If the computed interval estimate contains 0 then we do not have sufficient evidence to conclude that the two means are different from each other. If the computed interval estimate does not contain 0 then we can conclude with a (1- )100% degree of confidence that the two means are different from each other. If the computed interval estimate contains positive values only then we are highly confident that X is greater than Y. If the computed interval estimate contains negative values only then we are highly confident that X is less than Y. Chapter 4. Estimation: Two Populations Examples Examples 14.7 and 14.8. (pages 496 – 498) Exercise 1 for Section 14.4 (page 500). Suppose that company officials were concerned about the length of time a particular drug retained its potency. A random sample of n1 = 20 bottles of the drug was drawn from the production line and analyzed for potency. A second sample of n2 = 25 bottles was drawn and stored in regulated environment for a period of one year. The readings obtained are shown below. Sample 1: X =10.37, SX = 0.3234 Sample 2: Y =9.83, SY = 0.2406 Estimate the difference in mean potency for all bottles coming off the production line and the mean potency for all bottles retained for a period of one year using a 95% confidence interval assuming (i) the population variances are equal and (ii) the population variances are unequal. X S P2 Assuming normality and equal variances: 10.37 9.83 0.54 Y 2 X 2 Y (n1 1) S (n2 1) S n1 n2 2 2 (20 1)(0.3234) (25 1)(0.2406) 20 25 2 Y)t (X t /2 (v /2 (v n1 n1 n2 n2 2) S p2 2) t.05/2 (v 2 0.078523 S p2 0.54 (2.017)(0.084066) (0.37, 0.71) S2 ( X Y ) t /2 (v) X Assuming normality but unequal variances: s s 2 X v s X2 n1 n1 2 n1 1 2 Y 2 (0.3234) 2 n2 sY2 n2 20 2 2 (0.3234) 2 n2 25 20 1 0.54 (2.032)(0.08686) 2 (0.2406) 2 20 1 n1 2 (0.2406) 2 34.237 34 t0.025 (v 34) 1 n1 1 n1 1 n2 20 25 2) 1 n2 SY2 ,(X Y ) t n2 2.032 25 25 1 (0.36, 0.72) Chapter 4. Estimation: Two Populations where S P2 S X2 n1 t0.025 (v 0.078523 /2 (v ) SY2 n2 S X2 n1 (n1 1) S X2 (n2 1) SY2 n1 n2 2 1 20 43) 1 25 2.017 0.084066 SY2 n2 0.32342 20 0.24062 25 0.086861455 Preliminaries on Inference on µX - µY Based on 2 Related Samples (page 498) Sample Data={(X1,Y1), (X2,Y2), …, (Xn,Yn)} Define: Di = Xi – Yi , i=1,2,…,n (Note: Dis are all random variables.) Assumptions: (D1, D2,… Dn) is a random sample Di ~ Normal( D, D2) Following the same procedure to estimate the population mean based on a random sample from a normal distribution: n 1. the point estimator for the mean 2. the standard error of D is D D is D i 1 Di n sample mean of Di s / n 3. the estimator for the standard error is SD / n n where SD i 1 (Di D)2 n 1 standard deviation of Dis Chapter 4. Estimation: Two Populations Remarks on D and 2 D We defined Di = Xi – Yi, i=1,2,….,n We assumed (D1,D2,…,Dn) is a random sample from a normal distribution with parameters D and D2. Since Di = Xi – Yi then D = X - Y and D2 = X2 + Y2 – 2Cov(X,Y) where X = common mean of the Xis Y = common mean of the Yis 2 X = common variance of the Xis 2 Y = common variance of the Yis Cov(X,Y) = common covariance of (Xi, Yi)s The Cov(X,Y) is a measure of the linear relationship of X and Y. If X and Y are not related then Cov(X,Y)=0. (The converse though is not always true.) If the value of Y increases as X increases then Cov(X,Y)>0; but if the value of Y decreases as X increases then Cov(X,Y)<0. Chapter 4. Estimation: Two Populations Confidence Interval Estimator for D= µX - µY Based on 2 Related Samples (page 499) A 1 100% confidence interval estimator for the mean of the differences, D X Y , based on matched or paired samples is given by: D t / 2 (v SD ,D t n n 1) where t v n 1 is the 100 1 v n 1 degrees of freedom. 2 th 2 / 2 (v n 1) SD n percentile of the t – distribution with Procedure: Step 1: Step 2: Step 3: Step 4: Compute Di= Xi – Yi, i=1,2,…,n Compute for the mean and standard deviation of the Dis. Use t-table to determine t v n 1 where n=number of pairs. Plug-in the computed values in Steps 2 and 3 in the formula. 2 Chapter 4. Estimation: Two Populations Examples Example 14.10 (pages499-500) Exercise 18, page399 MGB. To test two promising new lines of hybrid corn under normal farming conditions, a seed company selected eight farms at random in Iowa and planted both lines in experimental plots on each farm. The yields (converted to bushels per acre) for the eight locations were: Line A: Line B: 86 80 87 79 56 58 93 91 84 77 93 82 75 74 79 66 Assuming that the two yields are jointly normally distributed, estimate the difference between the mean yields by a 95% confidence interval. The parameter of interest is line B of hybrid corn. = D X - Y where =mean yield using line A of hybrid corn and X Y =mean yield using Step 1: Compute for Di = Xi – Yi, i=1,2,…,8 D1=86 – 80=6 D5=84 – 77=7 D2=87 – 79=8 D3=56 – 58=-2 D4=93 – 91=2 D6=93 – 82=11 D7=75 – 74=1 D8=79 – 66=13 Step 2: Compute for D and SD. (Use standard deviation mode of your calculator by entering the values of Di, i=1,2,…,8. D =5.75 Step 3: and SD= 5.1199888 Use t-table to determine value of t.05/2(v=8-1). It is t0.025(v=7) = 2.365 Step 4: Plug-in computed values in the following formula D t /2 (v n 1) SD ,D t n /2 (v n 1) SD n 95% confidence interval estimate for the mean difference is (1.47, 10.03) Chapter 4. Estimation: Two Populations Assignment 11 Always show your solution. Present the confidence interval estimator used and show the plugged=in values. No immediate rounding. Whenever necessary, round-off final answer only to 3 decimal places. 1. A manufacturer of office machines is considering the production of a new word processor. The decision to start large-scale production of the new machines will be based on the comparison of the mean operating speed using the standard machines ( X) and the mean operating speed using the new machines ( Y). Since operators of the machine have varying abilities, a random sample of 20 typists was selected and the speed of each typist in the sample was observed once using the new word processor and once using the standard word processor. The collected data on the speed (in minutes) are as follows: Typist (i) Standard Processor (Xi) 1 2 3 4 5 6 7 8 9 10 60.2 58.7 59.4 60.3 61.7 60.2 64.1 63.2 62.4 57.8 New Processor (Yi) 57.2 57.4 56.4 58.5 60.1 61.4 61.9 60.4 60.0 56.8 Typist (i) Standard Processor (Xi) 11 12 13 14 15 16 17 18 19 20 55.4 61.2 64.7 64.1 62.9 65.8 69.3 56.4 58.5 63.7 New Processor (Yi) 50.2 58.4 63.5 60.5 62.2 66.3 68.5 56.6 58.3 60.2 Assuming normality, compute a 95% confidence interval estimate for X- Y . (cont’d) Chapter 4. Estimation: Two Populations Assignment 11 (cont’d) 2. A study is conducted between high school students and college students to compare their proficiency at writing computer programs for microcomputers. For this study, the researchers wish to compare the mean time (in minutes) of high school students to write an error-free program ( X) with the mean time (in minutes) of college students to write an error-free program ( Y). Data taken from two independent samples were summarized as follows: Statistics Mean time Standard deviation Sample size High school College 70 10 10 84 12 10 Assuming normality of both populations, compute a 90% confidence interval estimate for Chapter 4. Estimation: Two Populations X- Y. Confidence Interval Estimator for p1 – p2 Based on 2 Independent Samples (page 501) An approximate 1 given by: 100% confidence interval estimator for p1 p2 when the sample sizes are large is ( Pˆ1 Pˆ2 ) z where z 2 is the 1 2 2 Pˆ1 (1 Pˆ1 ) n1 Pˆ2 (1 Pˆ2 ) ˆ ˆ , ( P1 P2 ) z 2 n2 Pˆ1 (1 Pˆ1 ) n1 Pˆ2 (1 Pˆ2 ) n2 100th percentile of the standard normal distribution. Note: This confidence interval estimator will provide a good approximate 1 100% confidence interval estimate for p1 p2 when both sample sizes are large. Thus, we require that both sample sizes are at least 30. Furthermore, we have the condition that both p1 and p2 are not expected to be too close to 0 or 1. Chapter 4. Estimation: Two Populations Interpretation Exercise 1 for Section 14.5 (page503) Suppose that a 95% confidence interval estimate for the difference is constructed. a) b) c) For what range of values is it not possible to conclude that the population proportions are different from one another? For what range of values can you conclude, with 95% confidence, that the proportion in population 1 is statistically higher than the proportion in population 2? For what range of values can you conclude, with 95% confidence, that the proportion in population 1 is statistically lower than the proportion in population 2? Chapter 4. Estimation: Two Populations Examples Example 14.11 and 14.12 (pages 501 to 503) A company is considering the introduction of a new formulation of its Zippi Cola softdrink. It first conducts a series of taste tests comparing Zippi to the leading brand of cola. In the first test based on the original formula of Zippi, 120 of 500 people who tried it preferred Zippi. The test was repeated to a new group of 1000 tasters to compare the new formulation of Zippie Cola to the leading brand. This time, 300 of the 1000 tasters preferred the new Zippi to the leading brand. Compute for an approximate 90% confidence interval estimate for the difference of population proportions who prefer Zippi over the leading brand of cola. Parameter of interest: p1 – p2 where p1=proportion who prefer the original formula of Zippi over the leading brand p2=proportion who prefer the new formulation of Zippi over the leading brand Point estimates : Pˆ1 120 / 500 0.24 and Pˆ2 Point estimate for p1 p2 : Pˆ1 Pˆ2 0.24 0.3 z0.1/2 z0.05 Interval estimator : ( Pˆ1 Pˆ2 ) z 2 1.645 interval estimate : 0.06 (1.645) 300 /1000 0.06 Pˆ1 (1 Pˆ1 ) n1 (0.24)(0.76) 500 0.3 Pˆ2 (1 Pˆ2 ) ˆ ˆ Pˆ1 (1 Pˆ1 ) , ( P1 P2 ) z 2 n2 n1 Pˆ2 (1 Pˆ2 ) n2 (0.3)(0.7) (0.24)(0.76) , 0.06 (1.645) 1000 500 (0.3)(0.7) 1000 Chapter 4. Estimation: Two Populations ( 0.099, 0.021)