8 Testing Hypothesis for Two Population Parameters SECTIONS 8.1 Testing Hypothesis for Two Population Means 8.1.1 Two Independent Samples 8.1.2 Paired Sample 8.2 Testing Hypothesis for Two Population Variances 8.3 Testing Hypothesis for Two Population Proportions CHAPTER OBJECTIVES In Chapter 7, we discussed how to test hypotheses about parameters in a single population. In this chapter, we discuss testing hypothesis to compare population parameters of two populations. Section 8.1 discusses a t-test for testing hypothesis of two population means when samples are independent and when samples are paired. Section 8.2 discusses a F-test for testing hypothesis of two population variances. Section 8.3 discusses a Z-test for testing hypothesis of two population proportions when samples are large enough. 2 / Chapter 8 Testing Hypothesis for Two Population Parameters 8.1 Testing Hypothesis for Two Population Means Ÿ There are many examples comparing means of two populations as follows: - Is there a difference between the starting salary of male graduates and of female graduates in this year’s college graduates? - Is there a difference in the weight of the products produced in the two production lines? - Did the special training for typists to increase the speed of typing really bring about an increase in the speed of typing? Ÿ As such, a comparison of the two population means ( and ) is possible by testing hypothesis that the difference in the population means is greater than, or less than, or equal to zero. The comparison of two population means differs depending on whether samples are extracted independently from each population or not (referred to as paired samples). 8.1.1 Two Independent Samples Ÿ Generally, testing hypothesis for two population means can be divided into three types, depending on the type of the alternative hypothesis as follows: 2) 3) ≠ Here is the value for the difference in population means to be tested. Ÿ Ÿ When samples are selected independently from each other in the population, the estimator of the difference of the population means is the difference in . The sampling distribution of all possible sample mean sample means differences is approximately a normal distribution with the mean and variance if both sample sizes are large enough. Since the population variances and are usually unknown, estimates of these variances, and , are used to test the hypothesis. The test statistic differs slightly depending on the assumption of two population variances. If two populations follow normal distributions and their variances can be assumed the same, the testing hypothesis for the difference of two population means uses the following statistic. is an estimator of the population variance called as a pooled variance which is Ÿ an weighted average of two sample variances and by using the sample sizes as weights when population variances are assumed to be the same. The above statistic follows a t-distribution with degrees of freedom and it is used to test the difference of two population means as follows: 8.1 Testing Hypothesis for Two Population Means / 3 Table 8.1.1 Testing hypothesis of two population means - independent samples, populations are normal distributions, two population variances are assumed to be equal Type of Hypothesis Decision Rule , then reject , else accept If , then reject , else accept If If ≠ , then reject , else accept ※ If sample sizes are large enough ( ), distribution is approximately close to the standard normal distribution and the decision rule may use the standard normal distribution. Example 8.1.1 Answer Two machines produce cookies at a factory and the average weight of a cookie bag should be 270g. Cookie bags were sampled from each of two machines to examine the weight of the cookie bag. The average weight of 15 cookie bags extracted from the machine 1 was 275g and their standard deviation was 12g, and the average weight of 14 cookie bags extracted from the machine 2 was 269g and the standard deviation was 10g. Test whether weights of cookie bags produced by two machines are different at the 1% significance level. Check the test result using『eStatU』. w The hypothesis of this problem is decision rule is as follows: ‘If ≠ . Hence, the , then reject ’ The information in this example can be summarized as follows: Therefore, Since 1.457 < 2.7707, can not be rejected. 4 / Chapter 8 Testing Hypothesis for Two Population Parameters Example 8.1.1 Answer (continued) w In『eStatU』menu, select ‘Testing Hypothesis ’, In the window shown in <Figure 8.1.1>, check the alternative hypothesis of not equal case at [Hypothesis], check the variance assumption of [Test Type] as the equal case, check the significance level of 1%, check the independent sample, and enter sample sizes , and sample variances as in <Figure 8.1.1>. , sample means 『 <Figure 8.1.1> Testing hypothesis for two population means using eStatU 』 w Click the [Execute] button will show the result of testing hypothesis as <Figure 8.1.2>. <Figure 8.1.2> Testing hypothesis for two population means – case of the same population variances Ÿ If variances of two populations are different, the test statistic X X do not follow a t distribution even if populations are normally distributed. The testing hypothesis for two population means when their population variances are 8.1 Testing Hypothesis for Two Population Means / 5 different is called a Behrens-Fisher problem and several methods to solve this problem have been studied. The Satterthwaite method approximates the degrees of freedom of the t distribution in the decision rule in Table 8.1.1 with as follows: Ÿ Table 8.1.2 summarizes decision rule when two population variances are different. Table 8.1.2 Testing hypothesis of two population means - independent samples, populations are normal distributions, two population variances are assumed to be different Type of Hypothesis Decision Rule , then reject , else accept If , then reject , else accept If If ≠ Example 8.1.2 Answer , then reject , else accept If two population variances are assumed to be different in [Example 8.1.1], test whether weights of cookie bags produced from two machines are equal or not at a 1% significance level. Check the test result using『eStatU』. w Since the population variances are different, the degrees of freedom of distribution is approximated as follows: Since 1.457 < 2.773, can not be rejected. w In order to practice using『eStatU』, select the different population variances assumption of [Test Type] in the window of <Figure 8.1.1> and click the [Execute] button to see the result as shown in <Figure 8.1.3>. 6 / Chapter 8 Testing Hypothesis for Two Population Parameters Example 8.1.2 Answer <Figure 8.1.3> Testing hypothesis for two population means – Case of two different population variances Example 8.1.3 (Monthly wages by male and female) Samples of 10 male and female college graduates this year were randomly taken and their monthly average wages were examined as follows: (Unit 10,000 KRW) Male 272 255 278 282 296 312 356 296 302 312 Female 276 280 369 285 303 317 290 250 313 307 ⇨ eBook ⇨ EX080103_WageByGender.csv. Using『eStat』, answer the following questions. 1) If population variances are assumed to be the same, test the hypothesis at the 5% significance level whether the average monthly wage for male and female is the same. 2) If population variances are assumed to be different, test the hypothesis at the 5% significance level whether the average monthly wage for male and female is the same. Answer 1) In 『eStat』, enter raw data of gender (M or F) and income as shown in <Figure 8.1.4> on the sheet. This type of data input is similar to all statistical packages. After entering the data, click the icon for testing two population means and select 'Analysis Var' as V2 and 'By Group' variable as V1. A 95% confidence interval graph that compares sample means of two populations will be displayed as <Figure 8.1.5>. 8.1 Testing Hypothesis for Two Population Means / 7 Example 8.1.3 Answer <Figure 8.1.4> Data input for testing two population means <Figure 8.1.5> Dot graph and confidence Intervals by gender for testing two population means w In the options window as in <Figure 8.1.6> located at the below the Graph Area, enter the average difference for the desired test, select the variance assumption , select the 5% significance level and click the [t-test] button. Then the graphical result of testing hypothesis for two population means will be shown as in <Figure 8.1.7> and the test result as in <Figure 8.1.8>. <Figure 8.1.6> Options to test for two population means <Figure 8.1.7> Testing hypothesis for and – case of the same population variances 8 / Chapter 8 Testing Hypothesis for Two Population Parameters Example 8.1.3 Answer (continued) <Figure 8.1.8> result of testing hypothesis for two population means if population variances are the same 2) Select the variance assumption ≠ at the option window and click [t-test] button under the graph to display the graph of the hypothesis test and the test result table as in <Figure 8.1.9> and <Figure 8.1.10>. <Figure 8.1.9> Testing hypothesis for and – case of the different population variances <Figure 8.1.10> Result of testing hypothesis for two population means if population variances are different 8.1 Testing Hypothesis for Two Population Means / 9 [Practice 8.1.1] (Oral Cleanliness by Brushing Methods) Oral cleanliness scores were examined for 8 samples who are using the basic brushing method (coded 1) and 7 samples who are using the rotation method (coded 2). The data are saved at the following location of『eStat』. ⇨ eBook ⇨ PR080101_ToothCleanByBrushMethod.csv. 1) If population variances are the same, test the hypothesis at the 5% significance level whether scores for both brushing methods are the same using『eStat』. 2) If population variances are different, test the hypothesis at the 5% significance level whether scores for both brushing methods are the same using『eStat』. 8.1.2 Paired Sample Ÿ Ÿ Ÿ The testing hypothesis for two population means in the previous section is based on two samples extracted independently from each population. However, in some cases it is difficult to extract samples independently, or if samples are extracted independently, then the resulting analysis may be meaningless, because characteristics of each sample differ too much. For example, you want to give typists a special education to increase the speed of typing and want to see if this training has been effective in the speed of typing. In this case, if different samples are extracted before and after education, it is difficult to measure the effectiveness of education, because individual differences are severe. In order to overcome the individual difference for a typist who has sampled before training education, if you measure the typing speed before and after the training for the typist, the effect of special education can be well understood. A hypothesis test that uses same samples to perform similar experiments to compare means of two populations is called a paired comparison. In the paired comparison, we calculate the difference ( ) between paired data and as shown in Table 8.1.3 and obtain the mean of differences ( ) and variance of differences ( ). Table 8.1.3 Data for a paired comparison Ÿ Sample of population 1 ( ) Sample of population 2 ( ) Difference of pair ... ... ... Mean of Variance When two populations of normal distributions have the same mean, the sample follows a t distribution with the n-1 degrees of freedom. It statistic allows the testing of the difference between two population means in case of the paired comparison as follows: 10 / Chapter 8 Testing Hypothesis for Two Population Parameters Table 8.1.4 Testing hypothesis of two population means (paired comparison) - two populations are normal distributions, and paired sample case Type of Hypothesis Decision Rule If , then reject , else accept If , then reject , else accept ≠ Example 8.1.4 Answer , then reject , else accept The following is the result of a special typists before and after the training. increased at the 5% significance level. normal distribution. Check the test result training to improve the typing speed of eight Test whether or not the typing speed has Assume that the speed of typing follows a using『eStat』and『eStatU』. id Typing speed before training (unit: words/min) Typing speed after training (unit: words/min) 1 2 3 4 5 6 7 8 52 60 63 43 46 56 62 50 58 62 62 48 50 55 68 57 w This problem is for testing the null hypothesis to the alternative hypothesis to compare the typing speed of typists before training (population 1) and after training (population 2) using paired samples. Therefore, the decision rule is as follows: ‘If , then reject .’ w Calculated differences ( ) of paired samples before and after training, the mean and standard deviation ( ) of differences are as follows: ( ) 8.1 Testing Hypothesis for Two Population Means / 11 Example 8.1.4 Answer id Typing speed before training (unit: words/min) Typing speed after training (unit: words/min) Difference 1 2 3 4 5 6 7 8 52 60 63 43 46 56 62 50 58 62 62 48 50 55 68 57 -6 -2 1 -5 -4 1 -6 -7 Mean Standard deviation w The test statistic is as follows: , Therefore, is rejected and concludes that the training increased the typing speed. w In『eStatU』menu, select ’Testing Hypothesis: ’, select the alternative hypothesis at [Hypothesis], check the 5% significance level, check ‘paired sample’ at [Test Type], and enter data of sample 1 and sample 2 of paired samples at [Sample Data] as in <Figure 8.1.11>. 『 』 <Figure 8.1.11> Testing hypothesis for two population means using eStatU - paired sample w Click the [Execute] button to calculate the sample mean and sample standard and ) and to show the result of the hypothesis test deviation of differences ( as <Figure 8.1.12>. 12 / Chapter 8 Testing Hypothesis for Two Population Parameters Example 8.1.4 Answer (continued) 『 』 <Figure 8.1.12> Result of testing hypothesis for two population means using eStatU - paired sample w In『eStat』, the paired data is entered in two columns as shown in <Figure 8.1.13>. Click the icon for testing two population means and select 'Analysis Var' as V1 and 'by Group' as V2 to show the dot graph and the confidence interval for differences of paired data as in <Figure 8.1.14>. ⇨ eBook ⇨ EX080104_TypingSpeedEducation.csv. <Figure 8.1.13> Data input of paired sample <Figure 8.1.14> Dot graph of difference data of paired sample 8.1 Testing Hypothesis for Two Population Means / 13 Example 8.1.4 Answer (continued) w Enter the mean difference D = 0 for the desired test in the options window below the graph, select the 5% significance level, and press the [t-test] button to display the result of the hypothesis test for paired samples such as <Figure 8.1.15> and <Figure 8.1.16>. 『 』 『 』 <Figure 8.1.15> Testing hypothesis for two population means using eStat - paired sample <Figure 8.1.16> Result of testing hypothesis for two population means using eStat - paired sample [Practice 8.1.2] Randomly sampled data of (wife age, husband age) for 8 couples are as follows: (28, 28) (29, 30) (18, 21) (29, 33) (22, 22) (18, 21) (40, 35) (24, 29) ⇨ eBook ⇨ PR080102_CoupleAge.csv. Test whether the population mean of wife’s age is the same as the population mean husband’s age or not. Use the significance level of 0.05. 14 / Chapter 8 Testing Hypothesis for Two Population Parameters 8.2 Testing Hypothesis for Two Population Variances Ÿ Consider following examples to compare two population variances. - When comparing two population means in the previous section, we studied that if the sample size was small, the decision rule for testing hypothesis were different depending on whether two population variances were the same or different. So how can we test if two population variances are the same? - The quality of bolts used to assemble cars depends on the strict specification for their diameters. Average diameters of bolts produced by two factories were said to be the same and if the variance of diameters is smaller, it is considered as superior production. How can you compare variances of the diameter? Ÿ When comparing variances ( and ) of two populations, the ratio ( ) of variances is calculated instead of comparing the difference in variances. If the ratio of variances is greater, smaller, or equal to 1, you can see that is greater, smaller, or equal to . The reason for using the ratio of variances instead of the difference of variances is that it is easy to find the sampling distribution of the ratio of variances mathematically. If two populations follow normal distributions, and if and samples are collected randomly from each population, the ratio of two sample variances and such as Ÿ follows a F distribution with the numerator degrees of freedom and the denominator degrees of freedom . Using this fact, we can perform testing hypothesis on the ratio of population variances. F distribution is an asymmetrical distribution group with two parameters, the numerator degrees of freedom and denominator degrees of freedom. <Figure 8.2.1> shows F distributions for different parameters. <Figure 8.2.1> F distribution of different degrees of freedom. Ÿ Testing hypothesis for two population variances can be performed using the F distribution as following Table 8.2.1. 8.2 Testing Hypothesis for Two Population Variances / 15 Table 8.2.1 Testing hypothesis for two population variances - Two populations are normally distributedType of Hypothesis 1) 2) 3) ≠ Example 8.2.1 Answer Decision Rule If , then reject , else accept If , then reject , else accept If or , then reject , else accept A company that produces a bolt has two plants. One day, 12 bolts produced in Plant 1 were sampled randomly and the variance of diameter was . 10 bolts produced in Plant 2 were sampled randomly and the variance of diameter was . Test whether variances of the bolt from two plants are the same or not with the 5% significance level. Check the test result using『eStatU』. w The hypothesis of this problem is , ≠ and its decision rule is as follows: ‘If or , then reject , else accept ’ The test statistic using two sample variances and the percentile of F distribution is as follows: Hence, the hypothesis can not be rejected and conclude that two variances are equal. w In『eStatU』menu, select ‘Testing Hypothesis , . At the window shown in <Figure 8.2.2>, enter , , , . Click the [Execute] button to reveal the hypothesis test result shown in <Figure 8.2.3>. 『 』 <Figure 8.2.2> Data input for testing hypothesis of two population variances using eStatU 16 / Chapter 8 Testing Hypothesis for Two Population Parameters Example 8.2.1 Answer (continued) 『 』 <Figure 8.2.3> Testing hypothesis for two population variances using eStatU Example 8.2.2 (Income of college graduates, data of [Example 8.1.3]) Samples of 10 male and 10 female graduates of the college this year were taken and the average monthly income were examined as follows. Test whether variances of two populations are equal. Male 272 255 278 282 296 312 356 296 302 312 Female 276 280 369 285 303 317 290 250 313 307 ⇨ eBook ⇨ EX080103_WageByGender.csv. Answer (Unit 10000 KRW) w In『eStat』, enter the gender and income in two columns on the sheet as shown in <Figure 8.2.4>. This type of data input is similar to all statistical packages. Once you entered the data, click on the icon for testing two population variances and select 'Analysis Var' as V2 and 'By Groups' as V1. Then a mean-standard deviation graph for each group will be appeared as in <Figure 8.2.5>. <Figure 8.2.4> Data input for testing two population variances <Figure 8.2.5> Dot graph and mean-standard deviation interval of each group 8.2 Testing Hypothesis for Two Population Variances / 17 Example 8.2.2 Answer (continued) w If you click the [F-Test] button int the options window below the graph, a test result graph using F distribution such as <Figure 8.2.6> is appeared in the Graph Area and the result table is appeared as in <Figure 8.2.7> appears in the Log Area. <Figure 8.2.6> Testing hypothesis for two population variances <Figure 8.2.7> Result table of testing two population variances [Practice 8.2.1] Tire products from two companies are known to have the same average life span of 80,000km. However, there seems to be a difference in the variance. Sixteen tires from each of the two companies were randomly selected and run under similar conditions to measure their life span. The sample variance was 4,500 and 2,500, respectively. Using『eStatU』, test the null hypothesis that the variances of the tire life of two products are the same at the 5% significance level. 18 / Chapter 8 Testing Hypothesis for Two Population Parameters 8.3 Testing Hypothesis for Two Population Proportions Ÿ Consider the following examples which compare two population proportions. - Is there a gender gap in the approval rating for a particular candidate in this year's presidential election? - A factory has two machines that make products. Do two machines have different defect rates? Ÿ Comparing proportions and of two populations is possible by testing the difference between two proportions as the comparison of two population from two populations follows means. The difference in sample proportions a normal distribution with the mean and variance when two sample sizes are large enough. Since we do not know population proportions and to estimate the variance, weighted and by using sample sizes as average value for two sample proportions weights is used as follows: Ÿ The testing hypothesis for two population proportions uses the following test statistic. Table 8.3.1 Testing hypothesis for two population proportions - two independent large samples Type of Hypothesis 1) 2) 3) ≠ Decision Rule If > , then reject , else accept If < - , then reject , else accept If > , then reject , else accept 8.3 Testing Hypothesis for Two Population Proportions / 19 Example 8.3.1 Answer A survey was conducted for a presidential election and samples were selected independently from both male and female populations. 54 out of 225 samples from the male population supported the candidate A and 52 out of 175 samples from the female population supported the candidate A. Test whether there is a difference in approval ratings of the male and female populations with the 5% significance level. Check the result using『eStatU』. w The hypothesis of this problem is , ≠ and its decision rule is as follows: ‘If > , then reject , else accept ’ = 54/225 = 0.240, = 52/175 = 0.297, w Since calculated as follows: and the test statistic can be = (54 + 52) / (225 + 175) = 106 / 400 = 0.265 = = 1.28 = = = 1.96 Therefore, the hypothesis can not be rejected and we conclude that there is not enough evidence that the approval ratings of male and female are different. w In『eStatU』menu, select ’Testing Hypothesis ’ and enter , , , as shown in <Figure 8.3.1>. Clicking the [Execute] button will show the result of the hypothesis test as shown in <Figure 8.3.2>. 『 』 <Figure 8.3.1> Data input for testing two population proportions in eStatU 『 』 <Figure 8.3.2> Result of testing hypothesis for two population proportions using eStatU 20 / Chapter 8 Testing Hypothesis for Two Population Parameters Example 8.3.2 In 2000, a simple random sampling of 1,000 people aged 15 to 29 across the country examined the status of marriage, and 63.5 percent were single. In 2020, another 1,000 people were surveyed independently, with 69.8 percent of them being single. From this fact, can you say that there has been a tendency to get married late in recent years? In other words, test at the 5% significance level whether the population aged 15 to 29 in 2020 is more likely to be single than in 2000. What is the p-value of this test? Answer w The hypothesis of this problem is , and its decision rule is as follows: If < - , then reject , else accept = 0.635 and = 0.698, w Since and the test statistic are as follows: × × = = -2.989 - = - = -1.645 Therefore, is rejected. and conclude that the proportion of unmarried people in 2020 has been increased. p-value can be calculated as follows: -value = = 0.0014 [Practice 8.3.1] Ÿ Ÿ In a company, the labor union found that 63 percent of 200 salesmen who did not receive a college education wanted to take it back even now. The company did a similar study 10 years ago and it was only 58 percent of 100 salesmen wanted it. Test the null hypothesis that the desire for college education is not different from 10 years ago using the significance level of 0.05. Samples were selected independently. In the previous two examples of comparing two population proportions, two sample proportions were calculated from independent samples. Suppose two candidates ran in an election and one thousand samples were selected to test whether there was any difference on the candidate's approval rating. The approval ratings and of two candidates obtained from the sample are not independent, because unlike two previous examples they are calculated from one set of samples. So the test method should be different. The following statistic are used to test whether there is a difference in approval ratings of two candidates. , where Ÿ is . the standard error of Assuming that two population proportions are equal, the estimated value is 8.3 Testing Hypothesis for Two Population Proportions / 21 as follows Ÿ , If the sample size is large, the test statistic follows a normal distribution which allows proper testing hypothesis according to the form of the alternative hypothesis. As such, it is important to distinguish between sample proportions from independent samples and not independent samples when we compare two population proportions. 22 / Chapter 8 Testing Hypothesis for Two Population Parameters Exercise 8.1 An analyst studies two types of advertising methods (A and B) tried by retailers. The variable is the sum of the amount spent on advertising over the past year. The following is the sample statistics extracted independently from retailers of each type. (Unit million USD) Type A: = 60 Type B: = 70 = 14.8 = 14.5 = 0.180 = 0.133 From these data, can you conclude that type A retailers have invested more in advertising than type B retailers? (Significance level = 0.05) 8.2 Paper making plants are looking to buy one of two forests. The followings are diameters of 50 trees sampled from each forest. From these data, test at the significance level of 0.05 whether the trees in area B are on average smaller than those in area A. What is the p-value of this test? Area A: Area B: = 28.25 = 22.50 = 25 = 16 8.3 In order to check the period of residence at the current house in region A and B, the following statistics were examined from simple random samples of 100 households in A and 150 households in B. From this data, can households in A area live shorter on average than those in B? (Significance level = 0.05) Region A Region B 8.4 = 33 months = 49 months 900 = 1,050 An advertising analyst surveyed how much working men and housewives were exposed to advertisements on radio, TV, newspaper and magazines. The survey item was the number of advertisements that each group encountered in a particular week and the sample mean and standard deviation of each group are as shown in the table below. From these data, can you say that housewives are exposed to more advertisements on average than working men? (Significance level = 0.05) Group Working men Housewives 8.5 = Sample mean 100 144 200 225 Sample Standard Deviation 50 60 One company wants to test whether a female employee uses the phone longer than a male employee. A sample survey of 10 males and 10 females for one-day call time measurement are as follows. Is there a difference in the average call time between male and female? Use the 5% significance level. Male 8 6 4 6 2 2 Female 4 4 10 2 8 4 8 10 10 4 10 8 13 14 (unit minutes) 8.6 One factory tries to compare the adhesion of motor oil from two companies. Among the products of each company, 32 products were randomly selected and tested as follows. Based on these Chapter 8 Exercise / 23 data, can you conclude that the adhesion means of the two company products are different? (Significance level = 0.05.) Company A Company B 13 52 46 74 21 25 52 73 60 11 66 43 35 11 65 70 38 55 71 51 10 44 67 72 36 25 47 65 24 41 48 45 35 16 58 76 35 47 42 48 45 50 66 56 19 42 11 35 39 25 17 51 25 18 69 60 80 45 47 69 75 43 46 64 8.7 An industrial psychologist thinks that the big factor that workers change jobs is self-esteem to workers' individual work. The scholar thinks that workers who change jobs frequently (group A) have lower self-esteem than those who do not (group B). The following data are used to measure the score of self-esteem by sampling each group independently. Group A 60 45 42 62 68 54 52 55 44 41 Group B 70 72 74 74 76 91 71 78 78 83 50 52 66 65 53 52 Can this data support the psychologist's idea? Assume that scores of the population are normally distributed and that the population variance is not known but the same. (Significance level = 0.01) 8.8 In a business administration department of a university, a debate arose over claims that men have more knowledge of the stock market than women. To calm the dispute, the instructor sampled each of 15 men and women independently and tested them for knowledge of the stock market. The result is as follows: Women 73 96 74 55 91 50 46 82 79 79 50 46 81 83 Men 57 78 42 44 91 65 63 60 97 85 92 42 86 81 64 According to the data, on average, can you say that men have more knowledge of the stock market than women? Use the significance level of 0.05. What assumptions do you need? 8.9 An oil company has developed a gasoline additive that will improve the fuel mileage of gasoline. We used 16 pairs of cars to compare the fuel mileage to see if it actually improved. Each pair of cars has the same details as its structure, model, engine size, and other relationship characteristics. When driving the test course using gasoline, one of the pair selected randomly and added additives, the other of the pair was driving the same course using gasoline without additives. The following table shows the km per liter for each of pairs. Is this data a basis for saying that additives increase fuel mileage? Assume that the fuel mileage is normally distributed. Use 5% significance level. (unit: km / liter) pair 1 2 3 4 5 6 7 8 Additive (X1) 17.1 12.7 11.6 15.8 14.0 17.8 14.7 16.3 No Additive (X2) 16.3 11.6 11.2 14.9 12.8 17.1 13.4 15.4 pair 9 10 11 12 13 14 15 16 Additive (X1) 10.8 14.9 19.7 11.4 11.4 9.3 19.0 10.1 No Additive (X2) 10.1 13.7 18.3 11.0 10.5 8.7 17.9 9.4 24 / Chapter 8 Testing Hypothesis for Two Population Parameters 8.10 A study deals with a survey on whether car accidents in a village can be reduced effectively by increasing the number of street lamps. The following table shows the average number of accidents per night, one year before and one year after putting street lamps on 12 locations. Does this data provide evidence that street lamps have reduced nightly car accidents? Use the 5% significance level. Location A Before After B C D E F G 8 12 5 4 6 3 4 5 2 1 4 2 2 3 H I J K L 3 2 6 6 9 4 3 5 4 3 8.11 The survey result of (wife’s age, husband’s age) by sampling 16 couples are as follows (28, 28) (29, 30) (18, 21) (29, 33) (22, 22) (18, 21) (40, 35) (24, 29) (21, 31) (20, 24) (20, 34) (23, 25) (33, 39) (33, 35) (40, 29) (39, 40) Test whether the wife’s age is the same as the husband’s age or not. Use the significance level of 0.05. 8.12 One person is considering the use of a test to compare between two population means. 16 samples are randomly taken from two populations and their sample variances are 28.5 and 9.5. Is this data shows evidence that two population variances are the same? (Significance level = 0.05) 8.13 Certain studies have been planned to compare the two relaxing drugs for office workers in stressful jobs. A medical team sampled eight workers for each of two drugs and collected data on the strain. Two sample variances are = 2916 and = 4624. Using the significance level of 0.05, can this data be said to differ in two population variances of tension? Explain necessary assumptions. 8.14 Let and be the number of days it takes for a plant to sprout its wide leaves and narrow leaves, respectively. The measured data are as follows: , , , , , . If ∼ and ∼ , test the following hypothesis using the 5% significance level. : , : 8.15 Both tire products are known to have an average life span of 80,000 km. However, there seems to be a difference in the variance. Sixteen tires from each of two companies were randomly selected and run under similar conditions to measure their life span. Sample variances were 4,500 and 2,200, respectively. 1) Test the null hypothesis that variances of the tire life of two products are the same at the significance levels of both 0.10 and 0.05. 2) Obtain 90% and 95% confidence intervals of the ratio . 8.16 A carpet manufacturer is looking for materials that can withstand temperature above 250 degree Fahrenheit. One of two materials is a natural material and the other is a cheap artificial material, which both have the same properties except for heat-resistant levels. As a result of a Chapter 8 Exercise / 25 heat-resistant experiment by independently selecting 250 samples from each of two materials, 36 samples from natural materials and 45 samples from man-made materials failed at temperatures above 250 degrees Fahrenheit. Is there a difference in the heat resistance of two materials from this data using the significance level is 0.05? 8.17 A labor union of a company found that 63 percent of 150 salespeople who did not receive a college education wanted to take it back even now. The company did a similar study 10 years ago, when only 58 percent of 160 people wanted it. Test the null hypothesis that the desire for college education is not different from 10 years ago using the significance level of 0.05. Samples were selected independently. 8.18 When we extracted 200 companies of the type A and examined them, we found that 12% of them spent more than 1% of their total sales on advertising. The other 200 companies of the type B independently selected and examined, we found 15% of them spent more than 1% of their total sales on advertising. Test the following hypotheses with the significance level of 0.05. ≤ , 8.19 In a company, a study was conducted on the leisure activities of sales staffs and managing staffs. 400 persons were selected independently from each of sales and managing staffs. 288 sales and 260 managing staffs answered that they usually spend their leisure time on sports activities. From this data, can you say that the percentage of two groups for the leisure time spent on sports activities is the same? Use the significance level of 0.05. 8.20 In September 2013, a research institute surveyed 260 men and 263 women about a political issue and the response result is as follows. Do you think there are significant differences in their way of thinking on the political issue?' Specify the null hypothesis and the alternative hypothesis and test at the 5% significance level. Men Women Yes 57% 65% No 43% 35% 8.21 In order to see whether the unemployment rate in two cities are different, samples of 500 people were randomly selected from two cities and found unemployment persons were 35 and 25 respectively. Can you say that the unemployment rate in two cities is different? Describe the necessary assumptions and calculate the p-value. 26 / Chapter 8 Testing Hypothesis for Two Population Parameters Multiple Choice Exercise 8.1 One professor claims that 'A student who studies in the morning will get better math score than a student who studies in the evening.' Assume that is the average exam score of students who study in the morning and is the average exam score of students who study in the evening. What is the null hypothesis of this test? ① ③ > ≠ ② ≧ ④ = 8.2 What is the alternative hypothesis of the test of the above question 8.1? ① ③ 8.3 > ≠ ② ≧ ④ = A researcher claims that “After age of 40 and over, there is no difference in weight between male and female.” Assume the average weight of males whose age is 40 and more is and the average weight of females whose age is 40 and more is . What is the alternative hypothesis of the test? ① = ③ > ② ④ ≠ < 8.4 We want to test whether two population means are equal or not using t-test. Which one of the following is not a required assumption? ① Populations are normal distributions. ② Two population variances are the same. ③ Samples are selected independently. ④ Samples are collected using cluster sampling method. 8.5 Which sampling distribution is used to test whether two population means are equal or not when sample sizes are small? ① Normal distribution ② t-distribution ③ Chi square distribution. ④ F-distribution 8.6 16 couples are randomly selected to compare their ages as follows. What is the name of this kind of data? (woman age, man age) (28, 28) (29, 30) (18, 21) (29, 33) (22, 22) (18, 21) (40, 35) (24, 29) (21, 31) (20, 24) (20, 34) (23, 25) (33, 39) (33, 35) (40, 29) (39, 40) ① independent data ③ random data ② paired data ④ cluster data 8.7 Which sampling distribution is used to test whether two population variances are equal or not when populations are normally distributed? Chapter 8 Multiple Choice Exercise / 27 ① Normal distribution ② t-distribution ③ Chi square distribution. ④ F-distribution 8.8 Which sampling distribution is used to test whether two population proportions are equal or not when sample sizes are large enough? ① Normal distribution ② t-distribution ③ Chi square distribution. ④ F-distribution 8.9 In a company, a comparative study was conducted on leisure activities of sales staffs and managing staffs. 400 staffs selected independently from each of sales staffs and management staffs and surveyed. We found that 288 sales staffs and 260 managing staffs answered that they usually spend their leisure time on sports activities. Which of the following is the null hypothesis for comparing two groups? ① ③ ② ≠ ④ 8.10 Which of the following is the alternative hypothesis in question 8.9? ① ③ ② ≠ ④ (Answers) 8.1 ④, 8.2 ①, 8.3 ②, 8.4 ④, 8.5 ②, 8.6 ②, 8.7 ④, 8.8 ①, 8.9 ①, 8.10 ②, ♡ 9 Testing Hypothesis for Several Population Means SECTIONS CHAPTER OBJECTIVES 9.1 Analysis of Variance for Experiments of Single Factor 9.1.1 Multiple Comparison 9.1.2 Residual Analysis In testing hypothesis of the population mean described in chapters 7 and 8, the number of populations was one or two. However, many cases are encountered where there are three or more population means to compare. 9.2 Design of Experiments for Sampling The analysis of variance (ANOVA) is used to test whether several population means are equal or not. The ANOVA was first published by British statistician R. A. Fisher as a test method applied to the study of agriculture, but today its principles are applied in many experimental sciences, including economics, business administration, psychology and medicine. 9.2.1 Completely Randomized Design 9.2.2 Randomized Block Design 9.3 Analysis of Variance for Experiments of Two Factors In section 9.1, the one-way ANOVA for single factor is introduced. In section 9.2, experimental designs for experiments are introduced. In section 9.2, the two-way ANOVA for two factors experiments is introduced. 30 / Chapter 9 Testing Hypothesis for Several Population Means 9.1 Analysis of Variance for Experiments of Single Factor Ÿ In section 8.1, we discussed how to compare means of two populations using the testing hypothesis. This chapter discusses how to compare means of several populations. There are many examples of comparing means of several populations as follows: - Are average hours of library usage for each grade the same? - Are yields of three different rice seeds equal? - In a chemical reaction, are response rates the same at four different temperatures? - Are average monthly wages of college graduates the same at three different cities? Ÿ The group variable used to distinguish groups of the population, such as the grade or the rice, is called a factor. Factor Definition The group variable used to distinguish groups of the population is called a factor. Ÿ Example 9.1.1 This section describes the one-way analysis of variance (ANOVA) which compares population means when there is a single factor. Section 9.2 describes how the experiment is designed to extract sample data. Section 9.3 describes the two-way ANOVA to compare several population means when there are two factors. Let's take a look at the following example. In order to compare the English proficiency of each grade at a university, samples were randomly selected from each grade to take the same English test, and data are as in ⋅ , ⋅ , ⋅ , ⋅ for Table 9.1.1. The right column is a calculation of the average each grade. Table 9.1.1 English Proficiency Score by Grade Grade English Proficiency Score Average 1 81 75 69 90 72 83 2 65 80 73 79 81 69 3 72 67 62 76 80 4 89 94 79 88 ⋅ = 78.3 ⋅ = 74.5 = 71.4 ⋅ ⋅ = 87.5 ⇨ eBook ⇨ EX090101_EnglishScoreByGrade.csv. 1) Using『eStat』, draw a dot graph of test scores for each grade and compare their averages. 2) We want to test a hypothesis whether average scores of each grade are the same or not. Set up a null hypothesis and an alternative hypothesis. 3) Apply the one-way analysis of variances to test the hypothesis in question 2). 4) Use『eStat』to check the result of the ANOVA test. Example 9.1.1 Answer 1) If you draw a dot graph of English scores by each grade, you can see whether scores of each grade are similar. If you plot the 95% confidence interval of the population mean studied in Chapter 6 on each dot graph, you can see a more detailed comparison. 9.1 Analysis of Variance for Experiments of Single Factor / 31 Example 9.1.1 Answer (continued) w In order to draw a dot graph with data shown in Table 9.1.1 using 『eStat』, enter data on the sheet and set variable names to 'Grade' and 'Score' as shown in <Figure 9.1.1>. In the variable selection box which appears by clicking the ANOVA icon on the main menu of『eStat』, select 'Analysis Var' as ‘Score’ and 'By Group' as ‘Grade’. The dot graph of English scores by each grade and the 95% confidence interval are displayed as shown in <Figure 9.1.2>. <Figure 9.1.1> 『eStat 』data input for ANOVA <Figure 9.1.2> 95% Confidence Interval by grade w To review the normality of the data, pressing the [Histogram] button under this graph (<Figure 9.1.3>) will draw the histogram and normal distribution together, as shown in <Figure 9.1.4>. 32 / Chapter 9 Testing Hypothesis for Several Population Means Example 9.1.1 Answer (continued) <Figure 9.1.3> Options of ANOVA <Figure 9.1.4> Histogram of English score by grade = 78.3, = 74.5, = 71.4, = 87.5. w <Figure 9.1.2> shows sample means as th The sample mean of the 4 grader is relatively large and the order of the sample < < < . and are similar, but is means in English is much greater than the other three. Therefore, it can be expected that the population mean and would be the same and will differ from three other population means. However, we need to test whether this difference by sample means is statistically significant. 2) In this example, the null hypothesis to test is that population means of English scores of the four grades are all the same, and the alternative hypothesis is that population means of the English scores are not the same. In other words, if are the population means of English scores for each grade, the hypothesis to test can be written as follows, Null hypothesis Alternative hypothesis : : at least one pair of is not the same 3) A measure that can be considered first as a basis for testing differences in multiple sample means would be the distance from each mean to the overall mean. In ·· , the other words, if the overall sample mean for all 21 students is expressed as squared distance from each sample mean to the overall mean is as follows when the number of samples in each grade is weighted. This squared distance is called the between sum of squares (SSB) or the treatment sum of squares (SSTr). ·· ·· ·· ·· = 643.633 SSTr = If the squared distance SSTr is close to zero, all sample means of English scores for four grades are similar. 9.1 Analysis of Variance for Experiments of Single Factor / 33 Example 9.1.1 Answer (continued) w However, this treatment sum of squares can be larger if the number of populations increases. It requires modification to become a test statistic to determine whether several population means are equal. The squared distance from each observation to its sample mean of the grade is called the within sum of squares (SSW) or the error sum of squares (SSE) as defined below. SSE = · · · · · · · · · · · · = 839.033 w If population distributions of English scores in each grade follow normal distributions and their variances are the same, the following test statistic has the distribution. SSTr SSE This statistic can be used to test whether population English scores of four grades are the same or not. In the test statistic, the numerator SSTr is called the treatment mean square (MSTr) which implies a variance between grade means. The denominator SSE is called the error mean square (MSE) which implies a variance within each grade. Thus, the above test statistics are based on the ratio of two variances which is why the test of multiple population means is called an analysis of variance (ANOVA). w Calculated test statistic which is the observed F value, , using data of English scores for each grade is as follows: SSTr SSE Since = 3.20, the null hypothesis that population means of English scores of each grade are the same, , is rejected at the 5% significance level. In other words, there is a difference in population means of English scores of each grade. w The following ANOVA table provides a single view of the above calculation. Sum of Squares Degree of freedom Treatment Error SSTr= 643.633 SSE = 839.033 4-1 21-4 Total SST =1482.666 20 Factor Mean Squares F value MSTr = 643.633/3 MSE = 839.033/17 Fo = 4.347 4) In <Figure 9.1.3>, if you select the significance level of 5%, confidence level of 95%, and click [ANOVA F test] button, a graph showing the location of the test statistic in the F distribution is appeared as shown in <Figure 9.1.5>. Also, in the Log Area, the mean and confidence interval tables and test result for each grade are appeared as in <Figure 9.1.6>. 34 / Chapter 9 Testing Hypothesis for Several Population Means Example 9.1.1 Answer (continued) <Figure 9.1.5> <Figure 9.1.6> 『eStat』 ANOVA F test 『eStat』Basic Statistics and ANOVA table w The analysis of variance is also possible using『eStatU』. Entering the data as in <Figure 9.1.7> and clicking the [Execute] button will have the same result as in <Figure 9.1.5>. 9.1 Analysis of Variance for Experiments of Single Factor / 35 Example 9.1.1 Answer (continued) 『 <Figure 9.1.7> ANOVA data input at eStatU Ÿ Ÿ 』 The above example refers to two variables, the English score and grade. The variable such as the English score is called as an analysis variable or a response variable. The response variable is mostly a continuous variable. The variable used to distinguish populations such as the grade is called a group variable or a factor variable which is mostly a categorical variable. Each value of a factor variable Is called a level of the factor and the number of these levels is the number of populations to be compared. In the above example, the factor has four levels, 1st, 2nd, 3rd and 4th grade. The term 'response' or 'factor' is originated to analyze data through experiments in engineering, agriculture, medicine and pharmacy. The analysis of variance method that examines the effect of single factor on the response variable is called the one-way ANOVA. Table 9.1.2 shows the typical data structure of the one-way ANOVA when the number of levels of a factor is and the numbers of observation at each level are ⋯ . Table 9.1.2 Notation of the one-way ANOVA Factor Observed values of sample Level 1 ⋯ · Level 2 ⋯ · ⋯ ⋯ Level Ÿ Average ⋯ ⋯ ⋯ ⋯ · Statistical model for the one-way analysis of variance is given as follows: ⋯ ⋯ represents the observed value of the response variable for the level of factor. The population mean of the level, , is represented as where is the mean of entire population and is the effect of level for the 36 / Chapter 9 Testing Hypothesis for Several Population Means Ÿ Ÿ response variable. denotes an error term of the observation for the level and the all error terms are assumed independent of each other and follow the same normal distribution with the mean 0 and variance . The error term is a random variable in the response variable due to reasons other than levels of the factor. For example, in the English score example, differences in English performance for each grade can be caused by other variables besides the variables of grade, such as individual study hours, gender and IQ. However, by assuming that these variations are relatively small compared to variations due to differences in grade, the error term can be interpreted as the sum of these various reasons. The hypothesis to test can be represented using instead of as follows: ⋯ At least one pair of is not equal to 0 Null hypothesis Alternative hypothesis In order to test the hypothesis, the analysis of variance table as Table 9.1.3 is used. Table 9.1.3 Analysis of variance table of the one-way ANOVA Factor Sum of Squares Treatment Error SSTr SSE Total SST Degree of freedom Mean Squares MSTr = SSTr / ( ) MSE = SSE / ( ) F value = MSTr/MSE ( ) Ÿ The three sum of squares for the analysis of variances can be described as follows. For an explanation, first define the following statistics: · Mean of observations at the level Mean of total observations ·· ni k SST Y i j Y ·· ij : The sum of squared distances between observed values of the response variable and the mean of total observations is called the total sum of squares (SST). ni k SSTr Y i j i· Y ·· : The sum of squared distances between the mean of each level and the mean of total observations is called the treatment sum of squares (SSTr). It represents the variation between level means. k SSE ni Y i j ij Y i· : The sum of squared distances between observations of the level and the mean of the level which is referred to as 'within variation’, and is called the error sum of squares (SSE). 9.1 Analysis of Variance for Experiments of Single Factor / 37 Ÿ Ÿ The degree of freedom of each sum of squares is determined by the following , but should be logic: The SST consists of number of squares, calculated first, before SST is calculated, and Hence, the degree of freedom of SST · , but the is . The SSE consists of number of squares, ⋅ ⋯ ⋅ should be calculated first, before SSE is number of values, calculated, and Hence, the degree of freedom of SSE is . The degree of freedom of SSTr is calculated as the degree of freedom of SST minus the degree of freedom of SSE which is . In the one-way analysis of variance, the following facts are always established: Partition of sum of squares and degrees of freedom Ÿ Ÿ Sum of squares: SST SSE SSTr Degrees of freedom: The sum of squares divided by the corresponding degrees of freedom is referred to as the mean squares and Table 9.1.3 defines the treatment mean squares (MSTr) and error mean squares (MSE). As in the meaning of the sum of squares, the treatment mean square implies the average variation between each level of the factor, and the error mean square implies the average variation within observations in each level. Therefore, if MSTr is relatively much larger than MSE, we can conclude that the population means of each level, , are not the same. So by what criteria can you say it is relatively much larger? The calculated value, , in the last column of the ANOVA table represents the relative size of MSTr and MSE. If the assumptions of based on statistical theory are satisfied, and if the null hypothesis ⋯ is true, then the below test statistic follows a F distribution with degrees of freedoms and . SSTrk MSTr MSE SSE n k Ÿ Therefore, when the significance level is for a test, if the calculated value is greater than the value of , then the null hypothesis is rejected. That is, it is determined that the population means of each factor level are not all the same. One-way analysis of variance test Null hypothesis ⋯ Alternative hypothesis At least one is not equal to 0 Test Statistic MSE Decision Rule If 『 』 MSTr , then reject (Note: eStat calculates the -value of this test. Hence, if the -value is smaller than the significance level , then reject the null hypothesis. ) 38 / Chapter 9 Testing Hypothesis for Several Population Means [Practice 9.1.1] (Plant Growth by Condition) Results from an experiment to compare yields (as measured by dried weight of plants) obtained under a control (leveled ‘ctrl’) and two different treatment conditions (leveled ‘trt1’ and ‘trt2’). The weight data with 30 observations on control and two treatments (‘crtl’, ‘trt1’, ‘trt2’), are saved at the following location of 『eStat』. Answer the following questions using『eStat』, ⇨ eBook ⇨ PR090101_Rdatasets_PlantGrowth.csv 1) Draw a dot graph of weights for each control and treatments. 2) Test a hypothesis whether the weights are the same or not. Use the 5% significance level. 9.1.1 multiple comparisons Ÿ Ÿ If the F test of the one-way ANOVA does not show a significant difference between each level of the factor, it can be concluded that there is no difference between each level of populations. However, if you conclude that there are significant differences between each level as shown in [Example 9.1.1], you need to examine which levels are different from each other. The analysis of differences between population means after ANOVA requires several tests for the mean difference to be performed simultaneously and it is called as the multiple comparisons. The hypothesis for the multiple comparisons to test whether the level means, and , are equal is as follows: ; , ≠ Ÿ It means that there are tests to be done simultaneously for the multiple comparisons if there are levels of the factor. There are many multiple comparisons tests, but Tukey's Honestly Significant Difference (HSD) test is most commonly used. The statistic for Tukey's HSD test to compare means and is the sample mean difference ⋅ ⋅ and the decision rule to test is as follows: , then reject If where HSD ij q k n k ⋅ MSE , n n i j and are the number of samples (repetitions) in level and level, MSE is the mean squared error, is the right tail 100× percentile of the studentized range distribution with parameter and degrees of freedom. (It can be found at『eStatU』 (<Figure 9.1.8>)). 9.1 Analysis of Variance for Experiments of Single Factor / 39 『 』 <Figure 9.1.8> eStatU HSD percentile table Example 9.1.2 Answer In [Example 9.1.1], the analysis variance of English scores by the grade concluded that the null hypothesis was rejected and the average English scores for each grade were not all the same. Now let's apply the multiple comparisons to check where the differences exist among each school grade with the significance level of 5%. Use 『eStat』 to check the result. w The hypothesis of the multiple comparisons is , the decision rule is as follows: ≠ and , then reject .’ ‘If Since there are four school grades ( ), = 6 multiple comparisons are possible as follows. The 5 percentile from the right tail of HSD distribution which is used to test is . 1) ≠ HSD q k n k ⋅ MSE n n ⋅ = 11.530 Therefore, accept . 2) ≠ HSD q k n k ⋅ MSE n n ⋅ = 12.092 Therefore, accept . 40 / Chapter 9 Testing Hypothesis for Several Population Means Example 9.1.2 Answer (continued) 3) ≠ HSD q k n k ⋅ MSE n n ⋅ Therefore, accept . 4) ≠ HSD q k n k ⋅ MSE n n ⋅ = 12.092 Therefore, accept . 5) ≠ HSD q k n k ⋅ MSE n n ⋅ = 12.891 Therefore, reject . 6) ≠ HSD q k n k ⋅ MSE n n ⋅ = 13.396 Therefore, reject . w The result of the above multiple comparisons shows that there is a difference between and , and as can be seen in the dot graph with average in <Figure 9.1.1>. It also shows that has no significant difference from other means. w If you click [Multiple Comparison] in the options of the ANOVA as in <Figure 9.1.3>, 『eStat』shows the result of Tukey's multiple comparisons as shown in <Figure 9.1.9>. 『eStat』also shows the mean difference and 95% HSD value for the sample mean combination after rearranging levels of rows and columns in ascending order of the sample means. w The next table shows that, if the HSD test result for the combination of the two levels is significant with the 5% significance level, then * will be marked and if it is significant with the 1% significance level, then ** will be marked, if it is not significant, then the cell is left blank. 9.1 Analysis of Variance for Experiments of Single Factor / 41 Example 9.1.2 Answer (continued) <Figure 9.1.9> HSD Multiple Comarisons w For the analysis of mean differences, confidence intervals for each level may also be used. <Figure 9.1.2> shows the 95% confidence interval for the mean for each level. This confidence interval is created using the formula described in Chapter 6, but the only difference is that the estimate of the variance for the error, , is the pooled variance using overall observations rather than the sample variance of observed values at each level. In the ANOVA table, MSE is the pooled variance. w In post-analysis using these confidence intervals, there is a difference between means if the confidence intervals are not overlapped, so the same conclusion can be obtained as in the previous HSD test. [Practice 9.1.2] By using the data of [Practice 9.1.1] ⇨ eBook ⇨ PR090101_Rdatasets_PlantGrowth.csv apply the multiple comparisons to check where differences exist among Control and two treatments with the significance level of 5%. Use『eStat』 . 9.1.2 Residual Analysis Ÿ Another statistical analysis related to the ANOVA hypothesis tests in the ANOVA are performed on hold about the error term . Assumptions independence ( are independent of each other), of is constant as ), normality (each is is a residual analysis. Various the condition that assumptions about error terms include homoscedasticity (each variance normally distributed), etc. The 42 / Chapter 9 Testing Hypothesis for Several Population Means validity of these assumptions should always be investigated. However, since can not be observed, the residual as the estimate of is used to check the assumptions. The residuals in the ANOVA are defined as the deviations used in in the the equation of the error sum of squares, for example, one-way analysis of variance. Example 9.1.3 In [Example 9.1.1] of English score comparison by the grade, apply the residual analysis using『eStat』. Answer w If you click on [Standardized Residual Plot] of the ANOVA option in <Figure 9.1.3>, a scatter plot of residuals versus fitted values appears as shown in <Figure 9.1.10>. In this scatter plot, if the residuals show no unusual tendency around zero and appear randomly, then the assumptions of independence and homoscedasticity are valid. There is no unusual tendency in this scatter plot. Normality of the residuals can be checked by drawing the histogram of residuals. <Figure 9.1.10> Residual plot of the ANOVA [Practice 9.1.3] By using the data of [Practice 9.1.1] ⇨ eBook ⇨ PR090101_Rdatasets_PlantGrowth.csv apply the residual analysis using『eStat』. 9.2 Design of Experiments for Sampling Ÿ Data such as English scores by the grade in [Example 9.1.1] are not so difficult to collect samples from each of the grade population. However, obtaining samples through experiments such as engineering, medicine, or agriculture are often 9.2 Design of Experiments for Sampling / 43 difficult to collect a large number of samples due to the influence of many other external factors, and should be very cautious about sampling. This section discusses how to design experiments for collecting small number of data from experiments. 9.2.1 Completely Randomized Design Ÿ Ÿ In order to identify the differences accurately that may exist among each level of a factor, you should design experiments such as little influence from other factors. One method to do this is to make the whole experiments random. For example, consider experiments to compare a fuel mileage per one liter of gasoline for three types of cars A, B and C. We want to measure the fuel mileage for five different cars of each type. One driver may try to drive all 15 cars. However, if only five cars can be measured per day, the measurement will take place over a total of three days. In this case, changes in daily weather, wind speed and wind direction can influence the fuel mileage which makes it a question of which car should be measured for fuel mileage on each day. If five drivers (1, 2, 3, 4, 5) plan to drive the car to measure the fuel mileage of all cars a day, the fuel mileage of the car may be affected by the driver. One solution would be to allocate 15 cars randomly to five drivers and then to randomize the sequence of experiments as well. For example, each car is numbered from 1 to 15 and then, the experiment of the fuel mileage is conducted in the order of numbers that come out using drawing a random number. Such an experiment would reduce the likelihood of differences caused by external factors such as the driver, daily wind speed and wind direction, because randomized experiments make all external factors equally affecting the all observed measurement values. This method of experiments is called a completely randomized design of experiments. Table 9.2.1 shows an example allocation of experiments by this method. Symbols A, B and C represent the three types of cars. Table 9.2.1 Example of completely randomized design of experiments Ÿ Driver 1 2 3 4 5 Car Type B B C A C B B A A C A B A C C In general, in order to achieve the purpose of the analysis of variance, it is necessary to plan experiments thoroughly in advance for obtaining data properly. The completely randomized design method explained as above is studied in detail at the Design of Experiments area in Statistics. From the standpoint of the experimental design, the one-way analysis of variance technique is called an analysis of the single factor design. 9.2.2 Randomized Block Design Ÿ In the experiments of completely randomized design for measuring the fuel mileage explained in the previous section, 15 cars were randomly allocated to five drivers. However, one example allocation as inTable 9.2.1 shows a problem of this completely randomized design. For example, Driver 1 will only experiment with B and C types of cars and Driver 3 will only experiment A and B types of cars so that the variable between drivers will not be averaged in the test. Thus, if there is a significant variation between drivers for measuring the fuel mileage, the error term of the analysis of variance may not be a simple experimental error. In order 44 / Chapter 9 Testing Hypothesis for Several Population Means to eliminate this problem, each driver may be required to experiment with each type of the car at least once which is known as a randomized block design. Table 9.2.2 shows an example of possible allocation in this case. In this table, the values in parentheses are the values of the observed fuel mileage. Table 9.2.2 Example of randomized block design Driver 1 A(22.4) C(20.2) B(16.3) Car Type (gas mileage) Ÿ Ÿ 2 3 B(12.6) C(15.2) A(16.1) 4 C(18.7) A(19.7) B(15.9) A(21.1) B(17.8) C(18.9) 5 A(24.5) C(23.8) B(21.0) Table 9.2.2 shows that the total observed values are divided into five groups by driver, called blocks so that they have the same characteristics. The variable representing blocks, such as the driver, is referred to as a block variable. A block variable is considered generally if experimental results are influenced significantly by this variable which is different from the factor. For example, when examining the yield resulting from rice variety, if the fields of the rice paddy used in the experiment do not have the same fertility, divide the fields into several blocks which have the same fertility and then all varieties of rice are planted in each block of the rice paddy. This would eliminate the influence of the rice paddy which have different fertility and would allow for a more accurate examination of the differences in yield between rice varieties. Statistical model of the randomized block design with blocks can be represented as follows: ⋯ ⋯ In this equation, is the effect of level of the block variable to the response variable. In the randomized block design, the variation resulting from the difference between levels of the block variable can be separated from the error term of the variation of the factor independently. In the randomized block design, the total variation is divided into as follows: Ÿ If you square both sides of the equation above and then combine for all , you can obtain several sums of squares as in the one-way analysis of variance as follows: Total sum of squares, degrees of freedom SST Error sum of squares, degrees of freedom SSE Treatment sum of squares, degrees of freedom SSTr 9.2 Design of Experiments for Sampling / 45 Block sum of squares, degrees of freedom SSB Ÿ The following facts are always established in the randomized block design. Division of the sum of squares and degrees of freedom SST Sum of squares : Degrees of freedom : Ÿ = SSE + SSTr = + ( ) + SSB + ( ) Table 9.2.3 shows the ANOVA table of the randomized block design. In this ANOVA table, if you combine the sum of squares and degrees of freedom of the block variable and the error variation, it becomes the sum of squares and degrees of freedom of the error term in the one-way ANOVA table 9.1.3. Table 9.2.3 Analysis of Variance Table of the randomized block design Variation Ÿ Ÿ Ÿ Sum of Squares Degrees of freedom Mean Squares Treatment SSTr SSTr MSTr k Block SSB SSB MSB b Error SSE SSE MSE b k Total SST F value In the randomized block design, the entire experiments are not randomized unlike the completely randomized design, but only the experiments in each block are randomized. Another important thing to note in the randomized block design is that, although the variation of the block variable was separated from the error variation, the main objective is to test the difference between levels of a factor as in the one-way analysis of variance. The test for differences between the levels of the block variable is not important, because the block variable is used to reduce the error variation and to make the test for differences between the levels of the factor more accurate. In addition, the error mean square (MSE) does not always decrease, because although the block variation is separated from the error variation of the one-way analysis of variance, the degrees of freedom are also reduced. 46 / Chapter 9 Testing Hypothesis for Several Population Means Example 9.2.1 Table 9.2.4 is the rearrangement of the fuel mileage data in Table 9.2.2 measured by five drivers and car types. Table 9.2.4 Fuel mileage data by five drivers and three car types Drive Car Type 1 A B C Average 3 4 5 ⋅ ) Average( 19.7 15.9 18.7 21.1 17.8 18.9 24.5 21.0 23.8 20.76 16.72 19.36 2 22.4 16.3 20.2 16.1 12.6 15.2 19.63 14.63 18.10 19.27 23.10 18.947 ⇨ eBook ⇨ EX090201_GasMilage.csv 1) Assuming that this data have been measured by the completely randomized design, use 『eStat』 to do the analysis of variance whether the three car types have the same fuel mileage. 2) Assuming that this data have been measured by the randomized block design, use 『eStat』 to do the analysis of variance whether the three car types have the same fuel mileage. Answer 1) In 『eStat』, enter data as shown in <Figure 9.2.1> and click the icon of analysis of variance . Select 'Analysis Var' as Miles and 'By Group' as Car in the variable selection box, then the confidence interval graph for each type of cars will appear such as <Figure 9.2.2>. <Figure 9.2.1> Data input for randomized block design for eStat ANOVA 『 』 <Figure 9.2.2> Dot graph and 95% confidence interval for population mean of each car type w Click the [ANOVA F-test] button in the option below the graph to reveal the ANOVA graph as in <Figure 9.2.3> and the ANOVA table as in <Figure 9.2.4>. The result of the ANOVA is that there is no difference in fuel mileage between the cars of each company. The same is true for the multiple comparisons tests in <Figure 9.2.5>. 9.2 Design of Experiments for Sampling / 47 Example 9.2.1 Answer (continued) <Figure 9.2.3> ANOVA of gas milage <Figure 9.2.4> ANOVA table of gas milage <Figure 9.2.5> Multiple comparisons by car 2) If this data have been extracted using the randomized block design, the block sum of squares will be separated from the error sum of squares. Adding Driver variable to 'by Group' in the variable selection box of『eStat』will give you a scatter plot of driver-specific fuel mileage for each car type as shown in <Figure 9.2.6>. This scatter plot shows a significant difference in fuel mileage per driver. 48 / Chapter 9 Testing Hypothesis for Several Population Means Example 9.2.1 Answer (continued) <Figure 9.2.6> Fuel mileages for each driver w Click the [ANOVA F-Test] button in the options window below the graph to reveal the two-way mean table shown in <Figure 9.2.7> and the ANOVA table shown in <Figure 9.2.8>. This ANOVA table clearly shows a decrease in error sum of squares and reduces significantly the mean squares of errors. This is due to the large variation between drivers being separated from the error variation. Factor B (driver) represents the block sum of squares separated from error term. The p-value shows that, the block (driver) effect is statistically significant. The value for the hypothesis of fuel mileage by Factor A (car type) is 43.447 and is greater than = 4.46, so you can reject the at the significance level of 0.05. Consequently, significant differences in fuel mileages between car types can be found by removing the variation of the block in the error term. <Figure 9.2.7> Two-way mean table by car and driver (There is no standard deviation of single data and denoted as NaN) 9.2 Design of Experiments for Sampling / 49 Example 9.2.1 Answer (continued) <Figure 9.2.8> ANOVA table for randomized block design w In average, car type A has the best fuel mileage than other car types. In order to examine more about the differences between car types, the multiple comparisons test in the previous section can be applied. In this example, you can use one HSD value for all mean comparisons, because the number of repetitions at each level is the same. Therefore, there is a significant difference in fuel mileage between all three types of cars, since the differences between the mean values (4.04, 1.40, 2.64) are all greater than the critical value of 1.257. w The same analysis of randomized design can be done using 『eStatU』 by following data input and clicking [Execute] button. <Figure 9.2.9> Data input for [Practice 9.2.1] 『eStatU』RBD The following is the result of an agronomist's survey of the yield of four varieties of wheat by using the randomized block design of the three cultivated areas (block). Test whether the mean yields of the four wheats are the same or not with 5% significance level. 1 Wheat Type A B C D 50 59 55 58 Cultivated Area 2 3 60 52 55 58 56 51 52 55 ⇨ eBook ⇨ PR090201_WheatAreaYield.csv 50 / Chapter 9 Testing Hypothesis for Several Population Means 9.2.3 Latin Square Design Ÿ Ÿ Ÿ In the experiments of randomized block design for measuring the fuel mileage explained in the previous section, there is one extraneous block variation which is the driver. If the researcher feels that there is an additional variation such as road type, there are two identifiable sources of extraneous block variations, i.e., two block variables. In this case, the researcher needs a design that will isolate and remove both sources of block variables from residual. The Latin square design is such a design. In the Latin square design, we assign one sources of extraneous variation to the columns of the square and the second source of extraneous variation to the rows of the square. We then assign the treatments in such a way that each treatment occurs one and only once in each row and each column. The number of rows, the number of columns, and the number of treatments, therefore, are all equal. Table 9.2.5 shows a 3 × 3 typical Latin squares with three rows, three columns and three treatments designated by capital letters A, B, C. Table 9.2.5 Fuel mileage data by three drivers and three road types of three car types (A, B, C) Row 1 Row 2 Row 3 Column 1 Road 1 Column 2 Road 2 Column 3 Road 3 A B C B C A C A B Driver 1 Driver 2 Driver 3 Table 9.2.6 shows a 4 × 4 typical Latin squares with four rows, four columns and four treatments designated by capital letters A, B, C, D. Table 9.2.6 Fuel mileage data by four drivers and four road types of four car types (A, B, C, D) Row Row Row Row Ÿ Ÿ Ÿ 1 2 3 4 Driver Driver Driver Driver Column 2 Road 2 A B C D B C D A Column 3 Road 3 Column 4 Road 4 C D A B D A B C In the Latin square design, treatments can be assigned randomly in such a way that the car type occurs one and only once in each row and each column.. Therefore, there are many possible designs of 3 × 3 and 4 × 4 Latin square. We get randomization in the Latin square by randomly selection a square of the desired dimension from all possible squares of that dimension. One method of doing this is to randomly assign a different treatments to each cell in each column, with the restriction that each treatment must appear one and only once in each row. Small Latin squares provided only a small number of degrees of freedom for the error mean square. So a minimum size of 5 × 5 is usually recommended. The hypothesis of Latin square design with treatments is as follows: Null hypothesis Alternative hypothesis Ÿ 1 2 3 4 Column 1 Road 1 ⋯ At least one pair of is not equal Statistical model of the × Latin square design with treatments can be represented as follows: 9.2 Design of Experiments for Sampling / 51 ⋯ ⋯ ⋯ where Ÿ In this equation, is the effect of level of the row block variable to the response variable and is the effect of level of the column block variable to the response variable. is the effect of level of the response variable. Notation for row averages, column averages and treatment averages of × Latin squre data are as follows; Table 9.2.7 Notation for row means, column means and treatment averages of × Latin squre data Column 1 Column 2 ⋯ Column r Row 1 Row 2 ⋯ ⋯ Row r Column Average Treatment average: Ÿ Row Average ⋯ ⋯ In the Latin square design, the variation resulting from the difference between levels of two block variables can be separated from the error term of the variation of the factor independently. In the Latin square design, the total variation is divided into as follows: If you square both sides of the equation above and then combine for all , you can obtain the following sums of squares: Total sum of squares, degrees of freedom SST Error sum of squares, degrees of freedom SSE Row sum of squares, degrees of freedom SSR Column sum of squares, degrees of freedom SSC Treatment sum of squares, degrees of freedom SSTr Ÿ The following facts are always established in the Latin square design. Table 9.2.8 shows the ANOVA table of the Latin square design. In this ANOVA table, 52 / Chapter 9 Testing Hypothesis for Several Population Means Division of the sum of squares and degrees of freedom Sum of squares : SST Degrees of freedom : = SSE + SSR SSC + SSTr = + ( ) + ( ) + ( ) Table 9.2.8 ANOVA table of the Latin square design Sum of Squares Variation Example 9.2.2 Degrees of freedom Mean Squares Treatment SSTr SSTr MSTr r Row SSR SSR MSR r Column SSC SSC MSC r Error SSE SSE MSE r r Total SST F value Table 9.2.9 is the fuel mileage data of four car types (A, B, C, D) measured by four drivers and four road types with Latin square design. Table 9.2.9 Fuel mileage data by four drivers and four road types of four car types (A, B, C, D) Row Row Row Row 1 2 3 4 Driver Driver Driver Driver 1 2 3 4 Column 1 Road 1 Column 2 Road 2 Column 3 Road 3 Column 4 Road 4 A(22) B(24) C(17) D(18) B(16) C(16) D(21) A(18) C(19) D(12) A(20) B(23) D(21) A(15) B(15) C(22) Use 『eStatU』 to do the analysis of variance whether the four car types have the same fuel mileage. Answer w In 『eStatU』- ‘Testing Hypothesis ANOVA – Latin Square Design’, select the number of treatment r = 4 and enter data as shown in <Figure 9.2.10>. <Figure 9.2.10> Data input for Latin square design in 『eStatU』 9.2 Design of Experiments for Sampling / 53 Example 9.2.2 Answer (continued) w Click [Execute] button to show Dot graph by car type in Latin square design as <Figure 9.2.11> and ANOVA table as in <Figure 9.2.12>. The dot graph and result of the ANOVA is that there is no difference in fuel mileage between the car types. <Figure 9.2.11> Dot graph by car type in Latin square design <Figure 9.2.12> ANOVA table of Latin square design [Practice 9.2.2] To study the effect of packaging on the sales of a certain cereal, a researcher tries three different packaging methods (treatments) at four different times of the week (columns) in four different supermarket chains (rows). The variable of interest is daily salse. The following table shows the results of the study. Do these data show a significant difference in shoppers’ response to the different packaging methods? Let = 0.05. 1 Store 1 2 3 4 A(50) B(59) C(55) D(58) Time of week 2 3 B(60) C(52) D(55) A(58) C(56) D(51) A(52) B(55) 4 D(63) A(57) B(56) C(61) 54 / Chapter 9 Testing Hypothesis for Several Population Means 9.3 Analysis of Variance for Experiments of Two Factors Ÿ Ÿ Definition If there are two factors affecting the response variable, the analysis is called a two-way analysis of variances. This technique is frequently used in experiments such as engineering, medicine and agriculture. The response variable is observed at each combination of levels of two factors (denoted as A and B). In general, it is advisable to repeat at least two experiments at each combination of levels of two factors, if possible, in order to increase the reliability of the experimental results. When data are obtained from repeated experiments at each factor level, the two-way ANOVA tests whether the population means of each level of factor A are the same (called the main effect test of the factor A) as the one-way ANOVA, or tests whether the population means of each level of factor B are the same (called the main effect test of the factor B). In addition, the two-way ANOVA tests whether the effect of one factor A is influenced by each level of the other factor B (called the interaction effect test). For example, in a chemical process, if the higher the pressure when the temperature is low, the greater the amount of products, and the lower the pressure when the temperature is high, the greater the amount of products, the interaction effect exists between the two factors of temperature and pressure. The interaction effect exists where the effects of one factor change with changes in the level of another factor. Main effect and Interaction effect When data are obtained from repeated experiments at each factor level, the two-way ANOVA tests whether the population means of each level of factor A (called the main effect test of the factor A) are the same as the one-way ANOVA, or tests whether the population means of each level of factor B are the same (called the main effect test of the factor B). The two-way ANOVA also tests whether the effect of one factor A is influenced by each level of the other factor B (called the interaction effect test). Example 9.3.1 Table 9.3.1 shows the yield data of three repeated agricultural experiments for each combination of four fertilizer levels and three rice types to investigate the yield of rice. Table 9.3.1 Yield of rice by fertilizers and types of rice (unit kg) Fertilizer 1 2 3 4 1 64,66,70 65,63,58 59,68,65 58,50,49 Types of rice 2 72,81,64 57,43,52 66,71,59 57,61,53 3 74,51,65 47,58,67 58,45,42 53,59,38 ⇨ eBook ⇨ EX090301_YieldByRiceFertilzer.csv 1) Find the average yield for each combination of fertilizers and rice types. 2) Using 『eStat』, draw a scatter plot with the rice types (1, 2 and 3) as X-axis and the yield as Y-axis. Separate the color of dots in the scatter plot by the type of fertilizer. Then, show the average of the combinations at each level on the scatter plot and connect them with lines for each type of fertilizer to observe. 3) Test the main effects of fertilizers and rice types and test the interaction effect of the two factors. 4) Using『eStat』, check the result of the two-way analysis of variance. 9.3 Analysis of Variance for Experiments of Two Factors / 55 Example 9.3.1 Answer 1) For convenience, let us call the fertilizer as the factor A and the rice type as factor B. The averages of the rice yield for each level combination of two factors are · of each shown in Table 9.3.2. Denote the rice yield, , and average combination of level of factor A and level of factor B. Also, denote the ·· , the average of level of factor B as average of level of factor A as , and the global average as . ·· ··· Table 9.3.2 Average yield of rice by fertilizers and types of rice (unit kg) Types of Rice (Factor A) Fertilizer (Factor B) Row Average 1 2 3 · · · · · · ·· ·· 4 · · · ·· Column Average ·· ·· ·· ··· 1 2 3 · · · ·· 2) To draw a scatter plot for the two-way ANOVA using 『eStat』, enter data as <Figure 9.3.1> where the fertilizer is variable 1, the rice type is variable 2 and the rice yield is variable 3. 『 』 <Figure 9.3.1> Data input for two-way ANOVA in eStat w In the variable selection box which appears by clicking the ANOVA icon on the main menu, select 'Analysis Var' as Yield and 'By Group' as Rice and Fertilizer, then the scatter plot of the yield by rice type will appear as in <Figure 9.3.2>. In addition, the average yields at each rice type by fertilizer are marked as dots linking them with lines by fertilizer. In this graph, rice type 1 always yields more than rice type 3 regardless of the fertilizer used. Rice type 2 varies in yield depending on the type of fertilizer used, which shows the existence of interaction, and the use of fertilizer 1 usually results in a high yield regardless of the rice types. 56 / Chapter 9 Testing Hypothesis for Several Population Means Example 9.3.1 Answer (continued) <Figure 9.3.2> Yields by rice types and fertilizer types 3) Testing the factor A, which is to test the main effect of rice types, implies to test the following null hypothesis. : The average yields of the three rice types are the same. w If the null hypothesis is rejected, we conclude that the main effect of rice types exists. In order to test the main effect of rice types, as in the one-way analysis of of rice type variance, the sum of squared distances from each average yield to the overall average yield . where the weight of 12 of each sum of squares is the number of data for each rice type. Since there are 3 rice types, the degrees of freedom of is (3-1) and we call the sum of squares divided by (3-1), , is the mean squares of factor A, . w Testing the factor B, which is to test the main effect of fertilizer types, implies to test the following null hypothesis. : The average yields of the four fertilizer types are the same. w If the null hypothesis is rejected, we conclude that the main effect of fertilizer types exists. In order to test the main effect of fertilizer types, as in the one-way analysis of variance, the sum of squared distances from each average yield of fertilizer type to the overall average yield , where the weight of 9 of each sum of squares is the number of data for each fertilizer type. Since there are 4 fertilizer types, the degrees of freedom of is (4-1) and we call the sum of squares divided by (4-1), , is the mean squares of factor B, . w Testing the interaction effect of rice and fertilizer (represented as factor AB) is to test the following null hypothesis. : There is no interaction effect between rice type and fertilizer type. 9.3 Analysis of Variance for Experiments of Two Factors / 57 Example 9.3.1 Answer (continued) w If the null hypothesis is rejected, we conclude that there is an interaction effect between rice types and fertilizer types. In order to test the interaction effect, the subtracting the average yield sum of squared distances from each average yield of fertilizer type , subtracting the average yield of rice type , adding the . overall average yield SSAB where the weight of 3 of each sum of squares is the number of data for each cell of rice and fertilizer type. The degrees of freedom of is (3-1)(4-1) and we call the sum of squares divided by (3-1)(4-1), is the mean squares of interaction AB, . w It is not possible to test each effect immediately using these sum of squares, but the error sum of squares should be calculated. In order to calculate the error sum of squares, first we calculate the total sum of squares which is the sum of the squared distances from each data to the overall average. SST y y y This total sum of squares can be proven mathematically to be the sum of the other sums of squares as follows: SST SSA SSB SSAB SSE Therefore, the error sum of squares can be calculated as follows: SSE SST SSA SSB SSAB w If the yields on each rice type or fertilizer type are assumed to be normal and the variances are the same, the statistic which divides the each mean squares by the error mean squares follows distribution. Therefore, the main effects and interaction effect can be tested using distributions. If the interaction effect is separated, we test them first. Testing results using the 5% significance level are as follows: ① Testing of the interaction effect on rice and fertilizer: SSAB = 1.77 SSE = 2.51 Since < , we conclude that there is no interaction. The interaction on rice and fertilizer in <Figure 9.3.2> is so small which is not statistically significant and it may due to other kind of random error. The calculated p-value of using『eStat』is 0.1488. 58 / Chapter 9 Testing Hypothesis for Several Population Means Example 9.3.1 Answer (continued) ② Testing of the main effect on rice types (Factor A): SSA = 3.08 SSE = 3.40 Since < , we can not reject the null hypothesis that average yields of rice types are the same. There is not enough evidence statistically that average yields are different depending on rice types. The calculated p-value of using『eStat』is 0.0644. ③ Testing of the main effect on fertilizer types (Factor B): SSB = 6.02 SSE = 3.01 Since > , we reject the null hypothesis that average yields of fertilizer types are the same. There is enough statistical evidence which shows that average yields are different depending on fertilizer types. Since there is no interaction effect by 1), we can conclude that fertilizer 1 produces more yields than other fertilizer. The calculated p-value of using『eStat』is 0.0033. w The result of the two-way analysis of variances is as Table 9.3.3. Table 9.3.3 two-way analysis of variance of yields by rice and fertilizer types Sum of Squares Factor Rice Type degrees of freedom Mean Squares F value -value 342.3889 2 171.1944 3.0815 0.0644 1002.8889 3 334.2963 6.0173 0.0033 588.9444 6 98.1574 1.7668 0.1488 Error 1333.3333 24 55.5556 Total 3267.5556 35 Feritlizer Type Interaction 4) If you press the [ANOVA F-test] button in the options window below <Figure 9.3.2> of『eStat』, the two-dimensional table of means / standard deviations for each level combination as in <Figure 9.3.3> and the two-way analysis of variance table as in <Figure 9.3.4> will appear in the Log Area. <Figure 9.3.3> Two dimensional mean / standard deviation table 9.3 Analysis of Variance for Experiments of Two Factors / 59 Example 9.3.1 Answer (continued) <Figure 9.3.4> two-way analysis of variance table Ÿ Let's generalize the theory of the two-way analysis of variance discussed in the example above. Let be the random variable representing the observation at the level of factor A, which has number of levels, and level of factor B, which has number of levels. A statistical model of the two-way analysis of variances is as follows: ⋯ ⋯ ⋯ : total mean : effect of level of factor A : effect of level of factor B : interaction effect of level of factor A and level of factor B : error terms which are independent and follow N(0, ). Ÿ Assume that experiments are repeated times equally at the level of factor A and level of factor B. Therefore, the total number of observations is . The total sum of squared distances from each observation to the total mean can be partitioned as following sum of squares similar to the one-way analysis of variance. a Total sum of squares: SST b r Y i j k a Fator A sum of squares: SSA br Y i b Factor B sum of squares: SSB ar a : degrees of freedom: : degrees of freedom: b Y i j : degrees of freedom: j Y Interaction sum of squares: SSAB r i Y Y j ijk Y ij Y i Y j Y : degrees of freedom: a Error sum of squares: SSE b r Y i j k ijk Y ij : degrees of freedom: Partition of Sum of Squares and degrees of freedom Sum of Squares: degrees of freedom: 60 / Chapter 9 Testing Hypothesis for Several Population Means Ÿ The two-way analysis of variance is summarized as Table 9.3.4. Table 9.3.4 two-way analysis of variance table Sum of Squares Factor Factor A SSA Factor B SSB Interaction SSAB Error SSE Total SST Degree of Freedom Mean Squares F value MSA = SSA/( ) = MSA/MSE MSB = SSB/( ) = MSB/MSE MSAB = SSAB/( ) = MSAB/MSE MSE = SSE/( ) Two-way analysis of variance without repetition of experiments If there is no repeated observation at each level combination of two factors, the interaction effect can not be estimated and the row of interaction factor is deleted from the above two-way ANOVA table. In this case, the analysis of variance table is the same as the randomized block design as Table 9.2.3. Ÿ Testing hypothesis for the main effects and interaction effect of factor A and factor B are as follows. If the interaction effect is separated, it is reasonable to test the interaction effect first. This is because, depending on the significance of the interaction effect, the method of interpreting the result of the main effect test of each factor can be different. 1) Test for the interaction effect: ⋯ ⋯ If MSABMSE > , then reject 2) Test for the main effect of factor A: ⋯ If MSAMSE > , then reject 3) Test for the main effect of factor B: ⋯ If MSBMSE > , then reject 『 』 ( eStat calculates the p-value for each of these tests and tests them using it. That is, for each test, if the p-value is less than the significance level, the null hypothesis is rejected.) Ÿ If the test for interaction effect is not significant, a test of the main effects of each factor can be performed to test significant differences between levels. However, if there is a significant interaction effect, the test for the main effects of each factor is meaningless, so an analysis should be made on which level combinations of factors show differences in the means. 9.3 Analysis of Variance for Experiments of Two Factors / 61 Ÿ [Practice 9.3.1] If you conclude that significant differences between the levels of a factor as in the one-way analysis of variance exist there, you can compare confidence intervals at each level to see which level of the differences appears. And a residual analysis is necessary to investigate the validity of the assumption. The result of an experiment at a production plant of an electronic component to investigate the life of the product due to changes in temperature ( ) and humidity ( ) is as follows. Analyze data using the analysis of variance with 5% significance level. (Unit: Time) 6.29 6.38 6.25 5.80 5.92 5.78 5.95 6.05 5.89 6.32 6.44 6.29 ⇨ eBook ⇨ PR090301_LifeByTemperatureHumidity.csv Design of experiments for the two-way analysis of variances Even in the two-way analysis of variance, obtaining sample data at each level of two factors in engineering or in agriculture can be influenced by other factors and should be careful in sampling. In order to accurately identify the differences that may exist between each level of a factor, it is advisable to make as few as possible influences from other factors. One of the most commonly used methods of doing this is completely randomized design which makes the entire experiments random. There are many other experimental design methods, and for more information, refer to the references to the experimental design of several factors. 62 / Chapter 9 Testing Hypothesis for Several Population Means Exercise 9.1 Complete the following ANOVA table. Factor df Treatment Error 154.9199 ________ 4 __ Total 200.4773 39 ratio ______ ______ 9.2 Answer the following questions based on this ANOVA table. Factor Treatment Error df ratio 5.05835 65.42090 2 27 2.52917 2.4230 1.0438 1) How many levels of treatment are compared? 2) How many total number of observations are there? 3) Can you conclude that the levels of treatment are significantly different with the 5% significance level? Why? 9.3 In order to test customers' responses to new products, four different exhibition methods (A, B, C and D) were used by a company. Each exhibition method was used in nine stores by selecting 36 stores that met the company's criteria. The total sales for the weekend are shown in the following table. Exhibition Method Sales for the weekend in 9 stones (unit: 1000USD) A B C D 5 2 2 6 6 2 2 6 7 2 3 7 7 3 3 8 8 3 2 8 6 2 2 8 7 3 2 6 7 3 3 6 6 2 3 6 1) Draw a scatter plot of sales (y axis) and exhibition method (x axis). Mark the average sales of each exhibition method and connect them with a line. 2) Test that the sales by each exhibition method are different in the amount of sales with the 5% significance level. Can you conclude that one of the exhibition methods shows significant effect on sales? 9.4 The following table shows mileages in km per liter obtained from experiments to compare three brands of gasoline. In this experiment, seven cars of the same type were used in a similar situation to reduce the variation of the car. Gasoline A B C mileage in km / liter 14 20 20 19 21 26 19 18 23 16 20 24 15 19 23 17 19 25 20 18 23 1) Calculate the average mileages of each gasoline brand. Draw a scatter plot of gas milage (y axis) and gasoline brand (x axis) to compare. 2) From this data, test whether there are differences between gasoline brands for gas Chapter 9 Exercise / 63 milage with the 5% significance level. 9.5 The result of a survey on job satisfaction of three companies (A, B, and C) is as follows. Test whether the averages of job satisfaction of the three companies are different with the 5% significance level. Company Job satisfaction score A B C 69 67 65 59 68 61 66 56 63 55 59 52 57 71 72 70 68 74 9.6 Psychologists were asked to investigate the job satisfaction of salespeople in three companies: A, B and C. Ten salespeople were randomly selected from each company and a test to measure the job satisfaction was conducted. Test scores are as follows. From this data, can we claim that the average scores of the job satisfaction of three companies are different with the significance level of 0.05? Company A B C Job satisfaction score 67 66 87 65 68 80 59 55 67 59 59 89 58 61 80 61 66 84 66 62 78 53 65 65 51 64 72 64 74 85 9.7 An advertising agency experimented to find out the effects of various forms (A, B, C, D and E) of TV advertising. Fifty television viewers were shown five forms of TV commercials for a cold medicine in random order one by one. The effect of advertising after viewing was measured and recorded as follows. Test an appropriate hypothesis with the 5% significance level. Forms of TV Advertising A 20 23 21 23 26 24 26 23 20 24 B 28 27 22 28 23 29 27 25 28 21 C D 33 34 25 26 27 33 25 32 25 34 33 29 31 29 27 25 26 26 33 32 E 49 41 41 39 41 48 43 43 46 35 9.8 The following is the result of an agronomist's survey of the yield of four varieties of wheat by using the randomized block design of three cultivated areas (block). Test whether the mean yields of the four wheats are the same or not with the 5% significance level. Cultivated Area Wheat Type 1 2 3 Average ⋅ ) ( A B C D 60 59 55 58 61 52 55 58 56 51 52 55 59 54 54 57 9.9 Answer the following questions based on the following ANOVA table. 64 / Chapter 9 Testing Hypothesis for Several Population Means Factor df value -value 6.1575 6.5948 1.4902 0.2094 29.4021 31.4898 7.1159 < 0.005 < 0.005 < 0.005 A B AB Error 12.3152 19.7844 8.9416 10.0525 2 3 6 48 Total 51.0938 59 1) What method of analysis was used? 2) What conclusions can be obtained from the above analysis table? The significance level is 0.05. 9.10 Research was conducted to compare the job satisfaction of workers in the assembly process with different working conditions. Another concern is the relationship between the job satisfaction and years of service. Observers would like to investigate the interaction effect between the years of service and working conditions. The following table shows the level of the job satisfaction obtained from the survey. Analyze the data using an appropriate methodology. Years of service Good Working condition Fair Bad < 5 12 15 15 14 12 10 10 9 10 9 8 7 7 8 6 5 - 10 12 14 12 10 11 10 10 14 14 10 10 11 12 10 14 11 or more 9 10 9 9 10 10 11 10 10 12 12 14 15 15 15 9.11 The following table shows the degree of stress in the work and the level of anxiety among 27 workers classified as years of service. Analyze data using the analysis of variance with the 5% significance level. Factor A Years of service Job-induced pressure (Factor B) Good Fair Bad < 5 25 28 22 18 23 19 17 24 19 5 - 10 28 32 30 16 24 20 18 22 20 11 or more 25 35 30 14 16 15 10 8 12 Chapter 9 Exercise / 65 9.12 A fertilizer manufacturer hired a research team to study the yields of three grain seeds (A, B, C) and three types of fertilizer (1, 2, 3). Three grain seeds in combination of three types of fertilizer were used and the experiment were repeated three times at each combination of treatments. Each combination of treatments was randomly assigned to 27 different regions. Analyze data using the analysis of variance with the 5% significance level. Seed type 1 Fertilizer type 2 3 A 5 8 7 8 8 10 10 9 10 B 6 8 6 10 12 11 15 14 14 C 7 8 10 12 12 14 16 10 18 9.13 The result of an experiment at a production plant of an electronic component to investigate the life of the product due to changes in temperature ( ) and humidity ( ) is as follows. Analyze data using the analysis of variance with the 5% significance level. (Unit: Time) 6.29 6.38 6.25 5.95 6.05 5.89 5.80 5.92 5.78 6.32 6.44 6.29 9.14 The result of a fertilizer manufacturer's experiment with the production of soybeans on two seeds using three types of fertilizer (A, B, and C) is as follows. Each fertilizer and seed were tested four times. Analyze data using the analysis of variance with the 5% significance level. Fertilizer A B C Seed 1 5 8 7 6 8 8 10 10 10 12 10 10 Seed 2 8 6 8 10 12 11 12 14 14 16 16 18 66 / Chapter 9 Testing Hypothesis for Several Population Means Multiple Choice Exercise 9.1 Who first announced the ANOVA method? ① Laspeyres ③ Fisher ② Paasche ④ Edgeworth 9.2 What are the abbreviation of the analysis of variance? ① ANOVA ③ ② ④ 9.3 Which areas are not the area of application for the analysis of variance? ① marketing survey ③ economy forecasting ② quality control ④ medical experiment 9.4 Which sampling distribution is used for the analysis of variance? ① distribution ③ distribution ② distribution ④ Normal distribution 9.5 Which is the correct process for the one-way ANOVA? a. Calculate Total SS, Treatment SS, Error SS b. Set the hypothesis c. Test the hypothesis d. Calculate the variance ration in the ANOVA table e. Find the value in the F distribution table ①a→b→c→d→e ③b→a→d→e→c ②b→d→e→a→c ④b→e→d→a→c 9.6 Which is the correct relationship between the total sum of squares (SST), between sum of squares (SSB), error sum of squares (SSE)? ① SST = SSB + SSE ③ SST = SSE - SSB ② SST = SSB - SSE ④ SST = SSB * SSE 9.7 If and the observed ratio is 6.90 in the ANOVA table, what is your conclusion with the 5% significance level? ① significantly different ③ very similar ② no significant difference ④ unknown 9.8 Which is not appeared in the analysis of variance table? ① sum of squares ③ degrees of freedom ② F ratio ④ standard deviation Chapter 9 Multiple Choice Exercise / 67 9.9 What is the name of variable which effects response variable in the experimental design? ① cause element ③ dependent variable ② independent variable ④ factor 9.10 In order to compare the fuel mileage of three types of cars, three drivers would like to drive cars, but fuel mileage may be affected by the driver. What is the name of variable like drivers? ① block variable ③ dependent variable ② independent variable ④ factor 9.11 When we compare the fuel mileage of three types of cars, which experimental desing is used to reduce the effect of drivers? bvcxz① completely randomized design ② latin square method ③ two-way ANOVA ④ randomized block design 9.12 What is called the effect of a factor A that varies depending on the level of the factor B? ① main effect of factor A ③ two-way ANOVA ② main effect of factor B ④ interaction effect (Answers) 9.1 ③, 9.2 ①, 9.3 ③, 9.4 ②, 9.5 ③, 9.6 ①, 9.7 ①, 9.8 ④, 9.9 ④, 9.10 ①, 9.11 ④, 9.12 ④ ♡ 10 Nonparametric Testing Hypothesis SECTIONS CHAPTER OBJECTIVES 10.1 Nonparametric Test for Location of Single Population 10.1.1 Sign Test 10.1.2 Wilcoxon Signed Rank Sum Test The hypothesis tests from Chapters 7 through 9 are based on assumptions such that the populations of continuous data follow the normal distributions. However, in real-world data, such assumptions may not be satisfied. 10.2 Nonparametric Test for Comparing Locations of Two Populations 10.2.1 Independent Samples: Wilcoxon Rank Sum Test 10.2.2 Paired Samples: Wilcoxon Signed Rank Sum Test This chapter introduces the nonparametric methods for testing hypothesis by converting data such as rankings which do not require assumptions on the population distribution. 10.3 Nonparametric Test for Comparing Locations of Several Populations 10.3.1 Completely Randomized Design: Kruskal-Wallis Test 10.3.2 Randomized block design: Friedman Test Section 10.1 introduces tests for the location parameter of single population such as the Sign Test and Signed Rank Test. Section 10.2 introduces tests for comparing location parameters of two populations such as the Wilcoxon Rank Sum Test. Section 10.3 introduces tests for comparing location parameter of several populations such as the Kruskal-Wallis Test and Friedman Test. 70 / Chapter 10 Nonparametric Testing Hypothesis 10.1 Nonparametric Test for the Location Parameter of Single Population Ÿ Ÿ Ÿ Ÿ Ÿ The hypothesis test for a population mean in Chapter 7 can be done using t distribution in the case of a small sample if the population is assumed as a normal distribution. As such, if we make some assumptions about a population distribution and test a population parameter using sample data, it is called a parametric test. The hypothesis tests for two population parameters in Chapter 8 and the analysis variance in Chapter 9 are also parametric tests, because they assume that populations are normal distributions. However, real world data may not be appropriate to assume that a population follows a normal distribution, or there may not be enough number of samples to assume a normal distribution. In some cases, data collected are not continuous or can be ordinal such as rank, then the parametric tests are not appropriate. In such cases, methods to test population parameters by converting the data into signs or ranks without assuming on population distributions are called the distribution-free or nonparametric tests. Since the nonparametric test utilizes the converted data such as signs or ranks, there may be some loss of information about the data. Therefore, if a population can be assumed as a normal distribution, there is no reason to use the nonparametric tests. In fact, when a population follows a normal distribution, a nonparametric test has a higher probability of the type 2 error at the same significance level. However, a nonparametric test would be more appropriate if the data are from a population that do not follow a normal distribution. The hypothesis test for a population mean in Chapter 7 is based on the theory of the central limit theorem for the sampling distribution of all possible sample means. However, the nonparametric test use signs by examining whether data values are small or large from the central location parameter of the population (the Sign Test of 10.1.1), or use ranks by calculating the ranking of the data (the Wilcoxon Signed Rank Test of Section 10.1.2). Here, the central location parameter can be the population mean or the population median, but usually referring to the population median that is not affected by an extreme point of the data. Estimation of a population parameter can also be made by using a nonparametric method, but this chapter only introduces nonparametric hypothesis tests. Those interested in the nonparametric estimation should refer to the relevant literature. 10.1.1 Sign Test Ÿ Example 10.1.1 Let's take a look at the sign test with the following examples. A bag of cookies is marked with a weight of 200g. Ten bags are randomly selected from several retailers and examined their weights as follows. Can you say that there are as many cookies in the bag as the weight marked? 203 204 197 195 201 205 198 199 194 207 ⇨ eBook ⇨ EX100101_CookieWeight.csv 1) Draw a histogram of the data to check whether a testing hypothesis using a parametric method can be performed. 2) Test the hypothesis by using a nonparametric method which utilizes the sign data by examining whether data values are smaller or larger than 200 with the significance level of 5%. 3) Check the result of the above test using『eStatU』. 10.1 Nonparametric Test for the Location Parameter of Single Population / 71 Example 10.1.1 Answer 1) The null and alternative hypothesis to test the population mean can be written as follows: : = 200, : ≠ 200 In order to test the hypothesis using the parametric t-test in Chapter 7, it is necessary to assume that the population is normally distributed, because the sample size of 10 is small. Let us check whether the sample data is a normal distribution by using a histogram. Enter data in『eStat』 as shown in <Figure 10.1.1> <Figure 10.1.1> Data input for cookie weight w Click icon of the testing hypothesis for the population mean and select ‘Weight’ as the analysis variable in the variable selection box. A dot graph with the 95% confidence interval will appear as <Figure 10.1.2>. If you click the [Histogram] button in the options window below the graph, a histogram as shown in <Figure 10.1.3> will appear. If you look at the histogram, it is not sufficient to assume that the population follows a normal distribution. In such cases, applying the parametric hypothesis test may lead to errors. <Figure 10.1.2> Dot graph of the cookie weight <Figure 10.1.3> Histogram of the cookie weight 2) In this case, the sample data can be converted to sign data only by examining whether the weight of cookie bag is greater than 200g (+ marked) or not (marked). sample data sign data 203 204 197 195 201 205 198 199 194 207 + + + + + 72 / Chapter 10 Nonparametric Testing Hypothesis Example 10.1.1 Answer (continued) If the number of + signs and – signs are similar, the weight of cookie bag would be 200g approximately. If the number of + signs is larger than – signs, then the weight of cookie bag is greater than 200g. If the number of – signs is larger than + signs, then the weight of cookie bag is less than 200g. w Since the above sign data only investigate whether a data is larger and smaller than 200 and never use a concept of the mean, it can be considered as testing for the population median ( ) as follows: : = 200 : ≠ 200 w In the sign data above, ‘the number of + signs’ (denote it as ) or ‘the number of – signs’ (denote as ) follows a binomial distribution with parameters of =10, =0.5 (<Figure 10.1.4>). <Figure 10.1.4> Binomial distribution when =10, =0.5 w Therefore, if is correct, the number of + signs may be the most likely to be 5 and 0, 1 or 9, 10 are very unlikely to be present. In order to test : = 200 with 5% significance level, since it is a two-sided test, rejection region should have the 2.5% probability at both ends of the binomial distribution, so it is approximately as follows: If the number of + signs ( ) is either 0, 1 (cumulated probability from left is 0.011) or 9, 10 (cumulated probability from right is 0.011), then reject This rejection region has a total probability of 2*0.011 = 0.022 which is smaller than the significance level of 0.05. When we use a discrete distribution such as binomial distribution, it may be difficult to find a rejection region which is exactly the same as the significance level. If we include one more values in the rejection region, the decision rule is as follows: If the number of + signs ( ) is either 0, 1, 2 (cumulated probability from left is 0.055) or 8, 9, 10 (cumulated probability from right is 0.055), then reject This rejection region has a total probability of 2*0.055 = 0.110 which is greater than the significance leve of 0.05. Therefore, the middle values 1.5 (of 1 and 2) and 8.5 (of 8 and 9) can be used in the decision rule as follows: If the number of + signs ( ) < 1.5 or > 8.5, then reject 10.1 Nonparametric Test for the Location Parameter of Single Population / 73 Example 10.1.1 Answer (continued) This method may also be approximate. In the case of testing using a discrete distribution, it is not possible to say 'what is right' among the above decision rules and the analyst should select the critical value near the significance level. In this example, the number of + signs ( ) is 5 and you can not reject the null hypothesis . In other words, the median of the weight of the cookie bag is 200g. 3) Enter data as shown in <Figure 10.1.5> in『eStatU』and press the [Execute] button to show the test result as in <Figure 10.1.6>. It shows the critical lines for values containing the significance level of 5% (2.5% for both tests). For a discrete distribution such as the binomial distribution, the choice of the final reject region shall be determined by the analyst. <Figure 10.1.5> Data input for sign test in 『eStatU』 『 <Figure 10.1.6> Result of sign test using eStatU [Practice 10.1.1] 』 A psychologist has selected 9 handicap workers randomly from production workers employed at various factories in a large industrial complex and their work competency scores are examined as follows. The psychologist wants to test whether the population median score is 40. Assume the population distribution is symmetrical about the mean. 32, 52, 21, 39, 23, 55, 36, 27, 37 ⇨ eBook ⇨ PR100101_CompetencyScore.csv 1) Check whether a parametric test is possible. 2) Apply the sign test with the significance level of 5%. 74 / Chapter 10 Nonparametric Testing Hypothesis Ÿ Ÿ When the population median is , the sign test is to test whether or (or or ≠ ). However, if the population distribution is symmetrical to the mean, the sign test is the same as the test of the population mean, because mean and median are the same in this case. When there are number of samples, the test statistic for the sign test uses the number of data which are greater than which is denoted as . The sign test uses the random variable of ‘the number of + signs ( )’ which follows a binomial distribution with parameters and =0,5, i.e., when the null hypothesis is true. You can use the number of data which are less than which is = . also follows a binomial distribution . Let us use in this section. represents the right tail 100 × percentile, but the accurate percentile value may not exist, because it is a discrete distribution. In this case, middle value of two nearest percentile is often used. Table 10.1.1 summarizes the decision rule for each type of hypothesis of the sign test. Table 10.1.1 Decision rule of the sign test Decision Rule Test Statistic = 'number of plus sign data' Type of Hypothesis 1) If , then reject , else accept 2) If , then reject , else accept 3) If or , then reject , else accept ≠ ☞ If the observed value is the same as ? If any of the observations has the same value as , they are not used in the sign test. In other words, reduce . Ÿ As studied in Chapter 5, the binomial distribution can be approximated by the normal distribution if is sufficiently large. Therefore, if the sample size is large, the test statistic = 'the number of plus sign data' can be tested using the normal distribution . Table 10.1.2 summarizes the decision rule for each hypothesis of the sign test in the case of large samples. Table 10.1.2 Decision rule of the sign test (large sample case) Decision Rule Test Statistic: = 'number of plus sign data' Type of Hypothesis 1) 2) 3) ≠ , then reject , else accept If , then reject , else accept If If , then reject , else accept 10.1 Nonparametric Test for the Location Parameter of Single Population / 75 10.1.2 Wilcoxon Signed Rank Sum Test Ÿ Example 10.1.2 The sign test described in the previous section converted sample data to either + or - symbols by examining whether the data were larger or smaller than the medium . In this case, most of the information that the original sample data have is lost. In order to apply the Wilcoxon signed rank test, we subtract first from the sample data and take the absolute value of this data. Assign ranks on these absolute values and calculate the sum of the larger ranks than and the sum of the smaller ranks than . If two rank sums are similar, we conclude that the population median is equal to . This signed rank sum test is the most widely used nonparametric method for testing the central location parameter of a population. This test takes into account the relative size of the sample data as well as the larger and smaller than . Using the cookie weight data of [Example 10.1.1], apply the signed rank test to see whether the weight of the cookie bag is 200g or not with the significance level of 5% 203 204 197 195 201 205 198 199 194 207 ⇨ eBook ⇨ EX100101_CookieWeight.csv Check the result of the signed rank test using『eStatU』. Example 10.1.2 Answer w The hypothesis for this problem is to test whether the population median( ) is 200g or not. : = 200 : ≠ 200 w The signed rank sum test examines not only checking the sample data are greater than = 200g (+ sign) or not (- sign), but also checking the rank of values of |data – 200|. If there are tied values, assign the average rank to each of tied values. For example, since there are two tied values of ‘1’ which is the smallest among |data – 200|, the corresponding ranks of 1 and 2 are averaged which is 1.5 and assign the averaged rank to each of value ‘1’. Sample data Sign data |data – 200| Rank of |data – 200| Rank sum of ‘+’ sign ( ) 203 + 3 4.5 204 + 4 6 4.5 + 6 197 3 4.5 195 5 7.5 + 201 + 1 1.5 205 + 5 7.5 1.5 + 7.5 198 2 3 199 1 1.5 + 194 6 9 207 + 7 10 10 = 29.5 w The sum of all ranks is ⋯ . If the rank sum of + sign data ( ) and the rank sum of – sign data ( ) are similar (approximately 27.5 or so), the null hypothesis = 200g would be true. In this example, = 29.5 and = 25.5. Since is greater than , the weight data which are greater than 200g appears to be dominant. What kind of large difference is statistically significant? w To investigate how large a value is statistically significant when the null hypothesis is true, the sampling distribution of random variable = 'rank sum of + sign data' (or = 'rank sum of – sign data') should be known. If is true, the number of cases for is shown in Table 10.1.3. It is not easy to examine all of these possible rankings to create a distribution table.『eStatU』shows the distribution of Wilcoxon signed rank sum as shown in <Figure 10.1.7> and its table as in Table 10.1.4. 76 / Chapter 10 Nonparametric Testing Hypothesis Example 10.1.2 Answer (continued) Table 10.1.3 All possible cases of = 'rank sum of + sign data' Number of data with + sign 0 {1}, {2}, ⋯ , {10} {1,2}, {1,3}, ⋯ , {1,10}, {2,3}, ⋯ , {2,10}, ⋯ {9,10} ⋯ {1,2, ⋯ ,10} 0 1 2 ⋯ 10 All possible rank sum of All possible combination of ranks 0 1, 2, ⋯ , 10 3, 4, ⋯ , 11, 5, ⋯ , 12, ⋯ 19 ⋯ 55 <Figure 10.1.7> Distribution of Wilcoxon signed rank sum when Table 10.1.4 Distribution of Wilcoxon signed rank sum when Wilcoxon Signed Rank Sum Distribution n = 10 x P(X = x) P(X ≤ x) P(X ≥ x) 0 0.0010 0.0010 1.0000 1 0.0010 0.0020 0.9990 2 0.0010 0.0029 0.9980 3 0.0020 0.0049 0.9971 4 0.0020 0.0068 0.9951 5 0.0029 0.0098 0.9932 6 0.0039 0.0137 0.9902 7 0.0049 0.0186 0.9863 8 0.0059 0.0244 0.9814 9 0.0078 0.0322 0.9756 ⋯ ⋯ ⋯ ⋯ 47 0.0059 0.9814 0.0244 48 0.0049 0.9863 0.0186 49 0.0039 0.9902 0.0137 50 0.0029 0.9932 0.0098 51 0.0020 0.9951 0.0068 52 0.0020 0.9971 0.0049 53 0.0010 0.9980 0.0029 54 0.0010 0.9990 0.0020 55 0.0010 1.0000 0.0010 10.1 Nonparametric Test for the Location Parameter of Single Population / 77 Example 10.1.2 Answer (continued) w Since it is a two-sided test with the 5% significance level, if you find a 2.5% percentile at both ends, P(X ≤ 8) = 0.0244, P(X ≥ 47) = 0.0244. In case of a discrete distribution, we can not find the exact 2.5 percentile from both ends. Therefore, the decision rule can be written as follows: ‘If ≤ or ≥ , then reject ’ Since = 29.5 in this problem, we can not reject . w After entering the data in『eStatU』as in <Figure 10.1.8>, pressing the [Execute] button will calculate the sample statistics and show the test result as in <Figure 10.1.9>. The critical lines are the value for containing 5% significance level from both sides (the probability of each end is 2.5%). For a discrete distribution, the choice of the final reject region should be determined by the analyst. <Figure 10.1.8> 『eStatU』Signed rank sum test 『 <Figure 10.1.9> Signed rank sum test in eStatU 』 w The signed rank sum test may be done using『eStat』. If you enter the data as shown in <Figure 10.1.10>, select 'Weight' as the analysis variable in the variable selection box and click the icon of testing the population mean. Then a dot graph with the 95% confidence interval for the population mean will appear as <Figure 10.1.11>. 78 / Chapter 10 Nonparametric Testing Hypothesis Example 10.1.2 Answer (continued) <Figure 10.1.10> Data input for cookie weight <Figure 10.1.11> Dot graph and confidence interval of cookie weight w Enter a value of 200 from the options below the graph and click the [Wilcoxon Signed Rank Sum Test] button to display the same test result graph and result table as in <Figure 10.1.12>. <Figure 10.1.12> Result of the Wilicoxon Signed Rank Sum Test Ÿ If we denote the population median as , the signed rank sum test is to test whether the population median is or greater than (or less than or not equal to) . However, if the population distribution is symmetric about the mean, the signed rank sum test becomes to test about the population mean, because the population median and mean are the same. The basic statistical model is as follows: , ⋯ where ’s are independent, symmetric about the mean 0 and follow the same distribution. Ÿ If are sample data, ranks of are calculated first and the sum of ranks for the data which are greater than (+ sign data of ), denoted as , is calculated. is the test statistic for the signed rank sum test and the sampling distribution of , denoted as , is calculated for testing hypothesis by considering all possible cases. eStatU provides until . denotes right tail 100 × percentile of the distribution, but it is not easy to find the exact percentile, because is a discrete distribution and is usually used to approximate the two adjacent values. Table 10.1.5 summarizes the decision rule for the Wilcoxon signed rank sum test for each type 『 』 10.1 Nonparametric Test for the Location Parameter of Single Population / 79 of hypothesis. Table 10.1.5 Decision rule of Wilcoxon signed rank sum test Type of Hypothesis 1) 2) 3) ≠ ☞ Decision Rule Test Statistic = Rank sum of + sign data of If , then reject , else accept If , then reject , else accept If , else accept or , then reject If the observed value is the same as ? If any of the observed values has the same value as , they are not used in the test. In other words, reduce . [Practice 10.1.2] A psychologist has selected 9 handicap workers randomly from production workers employed at various factories in a large industrial complex and their work competency scores are examined as follows. The psychologist wants to test whether the population median score is 45. Assume the population distribution is symmetrical about the mean. 32, 52, 21, 39, 23, 55, 36, 27, 37 ⇨ eBook ⇨ PR100101_CompetencyScore.csv 1) Check whether a parametric test is possible. 2) Apply the Wilcoxon signed rank test with the significance level of 5%. 3) Compare this test result with the sign test of [Practice 10.1.1]. Ÿ If the sample size is large enough, the test statistic is approximated to a normal distribution with the following mean and variance when the null hypothesis is true. Ÿ Table 10.1.6 summarizes the decision rule of the signed rank sum test for each type of hypothesis. 80 / Chapter 10 Nonparametric Testing Hypothesis Table 10.1.6 Decision rule of Wilcoxon signed rank sum test (large sample case) Decision Rule Test Statistic: = Rank sum of + sign data of Type of Hypothesis 1) 2) 3) ≠ Ÿ , then reject , else accept If , then reject , else accept If If The distribution of is independent of the population distribution. In other words, the Wilcoxon signed rank sum test is a distribution free test. For example, if , the distribution of can be obtained as follows: Rank Ÿ Ÿ , then reject , else accept Possible value of 1 2 3 - - - 0 + - - 1 - + - 2 - - + 3 + + - 3 + - + 4 - + + 5 + + + 6 Therefore, the distribution of can be calculated as follows which is independent of the population distribution. 0 1 2 3 4 5 6 If there is a tie on the value of , the average rank is calculated when the ranking is obtained. In this case, the variance of in case of large sample is calculated using the following modified formula. Here = (number of groups of tie) = (size of tie group, i.e., number of observations in the tie group) if there is no tie, size of tie group is 1 and =1 10.2 Nonparametric Test for Location Parameters of Two Populations / 81 10.2 Nonparametric Test for Location Parameters of Two Populations Ÿ Ÿ Ÿ The testing hypothesis for the two population means in Chapter 8 used the t-distribution in case of a small sample, if each population could be assumed to be a normal distribution. However, the assumption that the population follows a normal distribution may not be appropriate for real world data, or that there may not be enough sample data to assume a normal distribution. Alternatively, if collected data is ordinal such as ranking, then the parametric t-test is not appropriate. In such cases, a nonparametric method is used to test parameters by converting data to ranks without assuming the distribution of the population. This section introduces the Wilcoxon rank sum test. Nonparametric tests convert data into ranks, so there may be some loss of information about the data. Therefore, if data are normally distributed, there is no reason to apply a nonparametric test. However, a nonparametric method would be a more appropriate method if the data do not follow a normal distribution. As in Chapter 8, this section introduces nonparametric tests for testing location parameters of two populations for the samples drawn independently from each population and for the samples drawn as paired. 10.2.1 Independent Samples: Wilcoxon Rank Sum Test Ÿ Example 10.2.1 Let's take a look at the Wilcoxon rank sum test with the following example. A professor teaches Statistics courses to students in the Department of Economics and the Department of Management. In order to compare exam scores of students in the two departments, seven students were randomly sampled from the Economics Department and six students from the Management Department and their scores were as follows: Department of Economics 87 75 65 95 90 81 93 Department of Management 57 85 90 83 87 71 ⇨ eBook ⇨ EX100201_ScoreByDepartment.csv 1) Draw a histogram of the data to verify that the testing hypothesis can be performed using a parametric method. 2) Apply the Wilcoxon rank sum test with the significance level of 5%. 3) Check the result of the Wilcoxon rank sum test using『eStat』. Example 10.2.1 Answer 1) The hypothesis of this problem to test two population means and are as follows: , ≠ Since the sample sizes, = 7 and = 6, are small from each population, it is necessary to assume that the populations are normally distributed in order to apply the parametric t-test. In order to check whether each sample data follows a normal distribution, let us draw a histogram using『eStat』. Enter data in『eStat』 as shown in <Figure 10.2.1>. 82 / Chapter 10 Nonparametric Testing Hypothesis Example 10.2.1 Answer (continued) 『 』 <Figure 10.2.1> Data input at eStat w Click icon for testing two population means in the main menu. Select ‘Score’ as 'Analysis Var' and ‘Dept’ as ‘By Group’ variable. Then, two dot graphs together with 95% confidence intervals for each population mean will appear as in <Figure 10.2.2>. Average score of students in the Economics Department appears to be higher than the average score of students in the Management Department, but it should be tested for statistical significance. Pressing the [Histogram] button in the options window below the graph will reveal the histogram and normal distribution curves for each department as in<Figure 10.2.3>. <Figure 10.2.2> Dot graph and confidence interval by department <Figure 10.2.3> Histogram by department 2) Looking at the histogram, the small number of data is not sufficient to assume that the population follows a normal distribution. In such case, applying the parametric t-test may lead to error. In case of a nonparametric test, we test the location parameter of the population such as median which is not so sensitive to extreme values. The hypothesis for this problem is to test whether the median values and of the two populations are equal or not as follows: ≠ w The Wilcoxon rank sum test calculates ranks of each data by combining two samples first and then calculate the sum of ranks in each sample. If there is a tie, then the averaged rank shall be used. To obtain the ranks of the combined sample, it is convenient to arrange each sample data in ascending order as shown in Table 10.2.1. The sum of ranks and in each sample will be used as the test statistic for the Wilcoxon rank sum test. 10.2 Nonparametric Test for Location Parameters of Two Populations / 83 Example 10.2.1 Answer (continued) Table 10.2.1 A table to calculate ranks in a combined sample Sorted Data of Sample 1 Sorted Data of Sample 2 57 Ranks of Sample 1 65 Ranks of Sample 2 1 2 71 75 81 3 4 5 83 85 87 90 87 90 93 95 Sum of ranks 8.5 10.5 12 13 = 55 6 7 8.5 10.5 = 36 w The sum of all ranks is ⋯ . The sum of ranks in sample 1 is = 55 and the sum of ranks in sample 2 is = 36. Note that + = 91. If and are similar, the null hypothesis that two population medians are the same is accepted. In this example is larger than and it seems the median of the population 1 is larger than the median of the population 2. But how much difference in the rank sum would be statistically significant if you consider the sample sizes? w To investigate how large a difference in the rank sum is statistically significant when the null hypothesis is true, the sampling distribution of the random variable = 'Rank sum of sample 2' (or = 'Rank sum of sample 1') should be known. If is true, the number of cases for is as shown in Table 10.2.2. It is not easy to examine all of these possible rankings to find the distribution table. 『eStatU』provides the Wilcoxon rank sum distribution and its table as shown in <Figure 10.2.4>. Table 10.2.2 All possible ranks for six data in sample 2 if n = 13 All possible permutation of ranks {1,2,3,4,5,6} {1,2,3,4,5,7} ⋯ {8,9,10,11,12,13} Sum of ranks, 21 22 ⋯ 63 <Figure 10.2.4> Wilcoxon rank sum distribution when =7, =6 84 / Chapter 10 Nonparametric Testing Hypothesis Example 10.2.1 Answer (continued) Table 10.2.3 Wilcoxon rank sum distribution when =7, =6 Wilcoxon rank sum distribution = 7 = 6 x P(X = x) P(X ≤ x) 21 0.0006 0.0006 P(X ≥ x) 1 22 0.0006 0.0012 0.9994 23 0.0012 0.0023 0.9988 24 0.0017 0.0041 0.9977 25 0.0029 0.007 0.9959 26 0.0041 0.0111 0.993 27 0.0064 0.0175 0.9889 28 0.0082 0.0256 0.9825 29 0.0111 0.0367 0.9744 ⋯ ⋯ ⋯ ⋯ 55 0.0111 0.9744 0.0367 56 0.0082 0.9825 0.0256 57 0.0064 0.9889 0.0175 58 0.0041 0.993 0.0111 59 0.0029 0.9959 0.007 60 0.0017 0.9977 0.0041 61 0.0012 0.9988 0.0023 62 0.0006 0.9994 0.0012 63 0.0006 1 0.0006 w Since the hypothesis requires a two sided test with the significance level of 5%, so if you find a 2.5 percentile at both ends, P(X ≤ 28) = 0.0256, P(X ≥ 56) = 0.0256. Since it is a discrete distribution, there is no exact value of the 2.5 percentile. Therefore, the decision rule can be set as follows: ‘If ≤ or ≥ , then reject ’ In this problem = 36, and therefore, we cannot reject which means the difference between and is not statistically significant. 3) In『eStatU』, enter the data as <Figure 10.2.5> will calculate the sample statistics and show 10.2.6>. The two critical lines which correspond here. For a discrete distribution such as this, the should be determined by the analyst. 『 』 and click the [Execute] button. It the test result graph as <Figure to 2.5% from the end are shown choice of the final rejection region <Figure 10.2.5> Data input for the Wilcoxon rank sum test at eStatU 10.2 Nonparametric Test for Location Parameters of Two Populations / 85 Example 10.2.1 Answer (continued) <Figure 10.2.6> Wilcoxon rank sum test using 『eStatU』 w The rank sum test can be performed using『eStat』. After you saw <Figure 10.2.2>, click the [Wilcoxon Rank Sum Test] button in the options window below the graph. Then a test result graph as shown in <Figure 10.2.6> will appear in the Graph Area and a test result table as in <Figure 10.2.7> will appear in the Log Area. <Figure 10.2.7> Result table of Wilcoxon rank sum test Ÿ Let's generalize the Wilcoxon rank sum test described in [Example 10.2.1]. Denote random samples selected independently from each of the two populations as follows. The sample sizes are and respectively, and . Sample 1 ⋯ Sample 2 ⋯ For convenience, assume ≥ . If ≤ , you can swap between and . 86 / Chapter 10 Nonparametric Testing Hypothesis Ÿ The statistical model of the Wilcoxon rank sum test is as follows: , ⋯ , ⋯ , * You may write Ÿ Here is the difference between location parameters. ’s are independent and follow the same continuous distribution which is symmetric about 0. The test statistic for the Wilcoxon rank sum test is the sum of ranks, , for ⋯ based on the combined sample of ⋯ , ⋯ . The distribution of the random variable = ‘Sum of the ranks for sample’ can be obtained by investigating all possible cases of ranks for Y which is and is denoted as . eStatU provides the Wilcoxon rank sum distribution and its table up to . denotes the right tail 100 × percentile, but it might not be able to find the accurate percentile, because is a discrete distribution. In this case, middle value of two percentiles near is often used as an approximation. Table 10.2.4 summarizes the decision rule for each type of hypothesis. 『 』 Table 10.2.4 Wilcoxon rank sum test Type of Hypothesis 1) 2) Decision Rule Test Statistic: = ‘Sum of ranks assigned samples of ’ If , then reject , else accept If , then reject , else accept 3) ☞ [Practice 10.2.1] If or , then ≠ reject , else accept If there is a tie in the combined sample, assign the average rank. A company wants to compare two methods of obtaining information about a new product. Among company employees, 17 employees were randomly selected and divided into two groups. The first group learned about the new product by the method A, and the second group learned by the method B. At the end of the experiment, the employees took a test to measure their knowledge of the new product and their test scores are as follows: Method A: 50 59 60 71 80 78 72 77 73 Method B: 52 54 58 78 65 61 60 72 ⇨ eBook ⇨ PR100201_ScoreByMethod.csv 1) Can we apply a parametric test to conclude that population means of the two groups are different? 2) Apply a nonparmetric test to conclude that the median values of the two groups are different. Test with the significance level of 0.05. 10.2 Nonparametric Test for Location Parameters of Two Populations / 87 Ÿ When the null hypothesis is true, if the sample is large enough, the test statistic is approximated to the normal distribution with the following mean and variance : Ÿ Table 10.2.5 summarizes the decision rule for each hypothesis type of the Wilcoxon rank sum test if the sample is large enough. Table 10.2.5 Wilcoxon rank sum test (large sample case) Decision Rule Test Statistic: = ‘Sum of ranks assigned samples of ’ Type of Hypothesis 1) 2) 3) ≠ Ÿ , then reject , else accept If , then reject , else accept If If , then reject , else accept The distribution of rank sum statistic, , is not dependent on the population distribution. That is, the rank sum test is a distribution free test. For example, if = 3 and = 2, the distribution can be found as follows. All possible cases of ranks for is . All possible ranks for combined sample Ÿ Value of 3 2 2 2 1 1 1 1 1 1 4 4 3 3 4 3 3 2 2 2 5 5 5 4 5 5 4 5 4 3 1 1 1 1 2 2 2 3 3 4 2 3 4 5 3 4 5 4 5 5 3 4 5 6 5 6 7 7 8 9 Therefore, the distribution distribution as follows: is given regardless of the 3 4 5 6 7 8 9 P population 88 / Chapter 10 Nonparametric Testing Hypothesis Ÿ If there is a tie in the combined sample, the average rank is assigned to each data. In this case, the variance of should be modified in case of large sample as follows: Here = (number of tied groups) = (size of tie group, i.e., number of observations in the tie group) if there is no tie, size of tie group is 1 and =1 10.2.2 Paired Samples: Wilcoxon Signed Rank Sum Test Ÿ Ÿ Section 8.1.2 discussed the testing hypothesis for two population means using paired samples. Paired samples are used when it is difficult to extract samples independently from two populations, or if independently extracted, the characteristics of each sample object are so different that the resulting analysis is meaningless. If two populations are normally distributed, the t-test was applied for the difference data of the paired samples as described in Section 8.1.2. However, if the normality assumption of two populations can not be satisfied, the Wilcoxon signed rank sum test in Section 10.1.2, which is a nonparametric test, can be applied to the difference data of the paired samples. In case of the paired samples, first calculate the differences ( ) for each paired sample as shown in Table 10.2.6. For the data of differences, we examine the normality to check whether the parametric test can be applicable or not. If it is not applicable, we apply the Wilcoxon signed rank sum test on the differences. Table 10.2.6 Data of differences for paired samples Pair number 1 2 ... Ÿ Example 10.2.2 Sample of population 1 Sample of population 2 ... ... Difference ... Let's take a look at the next example. The following is the survey result of eight samples from young couples. The husband’s age and wife’s age of each couple are recorded. (28, 28) (30, 29) (34, 31) (29, 32) (28, 29) (31, 33) (39, 35) (34, 29) ⇨ eBook ⇨ EX100202_AgeOfCouple.csv 1) Calculate data of differences in each pair and draw their histogram to check whether a parametric test is applicable or not. 2) Apply the Wilcoxon signed rank sum test to see whether the husband’s age is greater than the wife’s age with the significance level of 0.05. 3) Check the result of the above signed rank sum test using『eStat』. 10.2 Nonparametric Test for Location Parameters of Two Populations / 89 Example 10.2.2 Answer 1) The data of age differences between husband and wife are as follows: Table 10.2.7 Data of age differences between husband and wife Husband Wife Difference Number 1 2 3 4 5 6 7 8 28 30 34 29 28 31 39 34 28 29 31 32 29 33 35 29 0 1 3 -3 -1 -2 4 5 w The histogram for the data of differences by using『eStat』 (the testing hypothesis for a population mean) is as in <Figure 10.2.8>. If you look at the histogram, it is not sufficient evidence that the data of differences follow a normal distribution, because the number of data is small. In such a case, applying the parametric hypothesis test may lead to errors. An appropriate nonparametric method for this problem is the Wilcoxon signed rank sum test on the data of differences. <Figure 10.2.8> Histogram of age difference 2) The hypothesis to test is that the population median of the husband’s age ( ) is the same as the population median of the wife’s age ( ) or not as follows: ≠ Since it is a paired sample, the hypothesis can be written whether the population median of differences ( ) is equal to 0 or not as follows: ≠ w In order to apply the signed rank sum test on the data of differences, we count the number of differences which is greater than 0 (denote as + sign) or not (denote as – sign) and assign ranks on |difference – 0|. Then calculate the sum of ranks with + sign and the sum of ranks with – sign. If the difference data is 0, omit the data. If there are ties on the difference data, assign the average rank. 90 / Chapter 10 Nonparametric Testing Hypothesis Example 10.2.2 Answer (continued) Difference data Sign data |data – 0| Rank of |data – 0| Rank sum of ‘+’ sign ( = 19) 1 + 1 1.5 3 + 3 4.5 -3 3 4.5 -1 1 1.5 -2 2 3 1.5 + 4.5 4 + 4 6 5 + 5 7 + 6 + 7 w In『eStatU』, the distribution of the Wilcoxon signed rank sum when is shown in <Figure 10.2.9> and Table 10.2.8. Table 10.2.8 Wilcoxon signed rank sum distribution when Wilcoxon Signed Rank Sum Distribution n = 7 x P(X = x) P(X ≤ x) P(X ≥ x) 0 0.0078 0.0078 1.0000 1 0.0078 0.0156 0.9922 2 0.0078 0.0234 0.9844 3 0.0156 0.0391 0.9766 ⋯ ⋯ ⋯ ⋯ 25 0.0156 0.9766 0.0391 26 0.0078 0.9844 0.0234 27 0.0078 0.9922 0.0156 28 0.0078 1.0000 0.0078 w Since it is a two-sided test with the significance level of 5%, if a 2.5 percentile is found at both ends, P(X ≤ 2) = 0.0234, P(X ≥ 26) = 0.0234. Since it is a discrete distribution, there is no exact value of the 2.5 percentile. Therefore, the decision rule is as follows: If ≤ or ≥ , reject Since = 19 in this problem, we can not reject the null hypothesis and conclude that the husband’s age and the wife’s age are the same. 3) Enter the data as shown in <Figure 10.2.9> in『eStat』and click the icon which is the test for a population mean. If you select the variable ‘Difference’ as the analysis variable, a dot graph with the 95% confidence interval for the population mean difference will appear. If you enter 0 for testing value in the hypothesis option and click the [Execute] button, you will see the test result as in <Figure 10.2.10> and <Figure 10.2.11>. Two critical lines for values containing 2.5 percentile from both sides are shown here. For a discrete distribution, the choice of the final decision rule should be determined by the analyst. <Figure 10.2.9> Data difference 10.2 Nonparametric Test for Location Parameters of Two Populations / 91 Example 10.2.2 Answer (continued) <Figure 10.2.10> 『eStat』Signed rank sum test <Figure 10.2.11> Result of Wilcoxon signed rank sum test Ÿ The Wilcoxon signed rank test for the paired samples is to test whether the population median of the differences between two populations, , is zero or not. If we denote the paired samples as ⋯ , the Wilcoxon signed rank sum test calculates the difference first and assign ranks on . The sum of ranks of which has + sign, , is used as the test statistic. eStatU provides the distribution of , denoted as , up to . refers to the right tail 100 × percentile of this distribution which may not have an accurate percentile value, because it is a discrete distribution. In this case the average of two values near is used approximately. Table 10.2.9 summarizes the decision rule of the Wilcoxon signed rank sum test for paired samples by the type of hypothesis. 『 』 92 / Chapter 10 Nonparametric Testing Hypothesis Table 10.2.9 Wilcoxon signed rank sum test for paired samples Decision Rule Test Statistic: = ‘sum of ranks on with + sign’ Type of Hypothesis 1) 2) If , then reject , else accept If , then reject , else accept 3) If ≠ else accept or , then reject , If there is 0 on the differences of paired samples? ☞ If there is 0 on the differences of paired samples, the data is omitted for further analysis. That is, is decreased. [Practice 10.2.2] An oil company has developed a gasoline additive that will improve the fuel mileage of gasoline. We used 8 pairs of cars to compare the fuel mileage to see if it is actually improved. Each pair of cars has the same details as its structure, model, engine size, and other relationship characteristics. When driving the test course using gasoline, one of the pair selected randomly and added additives, the other of the pair was driving the same course using gasoline without additives. The following table shows the km per liter for each of pairs. pair 1 2 3 4 5 6 7 8 Additive (X1) 17.1 12.7 11.6 15.8 14.0 17.8 14.7 16.3 No Additive (X2) 16.3 11.6 11.2 14.9 12.8 17.1 13.4 15.4 Difference 0.8 1.1 0.4 0.9 1.2 0.7 1.3 0.9 ⇨ eBook ⇨ PR100202_DifferenceOfMileage.csv Apply a nonparametric test to check whether the additive increase fuel mileage. Use the significance level of 0.05. Ÿ If the sample size of the paired sample is large, use the normal distribution approximation formula shown in Table 10.1.6. 10.3 Nonparametric Test for Location Parameters of Several Populations Ÿ The testing hypothesis for several population means in Chapter 9 was possible if each population could be assumed to be a normal distribution and has the same population variance. However, the assumption that the population follows a normal distribution may not be true for real-world data, or that there may not 10.3 Nonparametric Test for Locations of Several Populations / 93 Ÿ be enough data to assume a normal distribution. Alternatively, if data are ordinal such as ranks, then the parametric test is not appropriate. In this case, a nonparametric test is used by converting data into ranks without making assumptions about the population distribution. This section introduces the Kruskal-Wallis test corresponding to the completely randomized design of experiments and the Friedman test corresponding to the randomized block design of experiments in Chapter 9. Since nonparametric tests are done by using the converted data such as ranks, there may be some loss of information about the data. Therefore, if data are normally distributed, there is no reason to apply a nonparametric test. However, a nonparametric test would be a more appropriate method if data were selected from a population that did not follow a normal distribution. 10.3.1 Completely Randomized Design: Kruskal-Wallis Test Ÿ Example 10.3.1 The Kruskal–Wallis test extends the Wilcoxon rank sum test for two populations. Consider the following example. The result of a survey of the job satisfaction by sampling employees of three companies are as follows. From this data, can you say that the three companies have different job satisfaction? (unit: points out of 100 scores) Company A 69 67 65 59 Company B 56 63 55 Company C 71 72 70 ⇨ eBook ⇨ EX100301_JobSatisfaction.csv 1) Draw a histogram of the data to see whether the comparison of the job satisfaction for the three companies can be made using a parametric test. 2) Using the Kruskal-Wallis test, which is a nonparametric test, find whether the three companies have the same job satisfaction or not with the significance level of 5% 3) Check the above result of the Kruskal-Wallis test using『eStat』. Answer 1) The parametric method for testing the hypothesis that three population means are the same is the one-way analysis of variance studied in Chapter 9 and it requires the assumption that the populations are normal distributions. Since the sample sizes are small, =4, =3, =3, in each of the population respectively we need to examine if each sample data satisfy the normality assumption. w Enter the data as shown in <Figure 10.3.1> in『eStat』. 10.3.1> 『<Figure eStat』data input 94 / Chapter 10 Nonparametric Testing Hypothesis Example 10.3.1 Answer (continued) w Click the ANOVA icon . Select ‘Score’ as ‘Analysis Var’ and ‘Company’ as ‘by Group’ variable in the variable selection box. Then a dot graph with the 95% confidence interval of each population mean will appear as in <Figure 10.3.2>. Company C has the highest average of satisfaction scores, followed by Company A and Company B. However, it should be tested if these differences are statistically significant. Clicking the [Histogram] button in the options window below the graph will reveal the histogram and its normal distribution curve for each company, as in<Figure 10.3.3>. <Figure 10.3.2> Dot graph and the confidence interval by company <Figure 10.3.3> Histogram by company w Looking at the histogram, the data are not sufficient to assume that the population follows a normal distribution, because the number of data is so small. In such a case, applying the parametric hypothesis test such as the ANOVA F-test may lead to errors. The hypothesis for this problem is to test whether the location parameters , , of the three populations are the same or not as follows: At least one pair of location parameters is not the same. w The Kruskal–Wallis test combines all three samples into a single set of data and calculate ranks of this data. If there is a tie, then the average rank will be assigned. Then the sum of the ranks in each sample, , is calculated. The test statistic for the Kruscal–Wallis test is similar to the -test by converting sample data into ranks as follows: w To obtain ranks of the combined sample, it is convenient to arrange the data in ascending order separately and then rank the whole data as shown in Table 10.3.1. Table 10.3.1 A table to calculate the sum of ranks in each sample Sample 1 Sorted Data Sample 2 Sorted Data 55 56 Sample 3 Sorted Data 59 Sample 1 Rank Sample 2 Rank 1 2 Sample 3 Rank 3 63 4 65 67 69 5 6 7 70 71 72 Sum of ranks 8 9 10 = 21 = 7 = 27 10.3 Nonparametric Test for Locations of Several Populations / 95 Example 10.3.1 Answer (continued) w The total sum of ranks is ⋯ . The sum of ranks for sample 1 is = 21, for sample 2 is = 7, and for sample 3 is = 27. When the number of data in each sample is taken into account, if , , and are similar, the null hypothesis that three population location parameters are the same would be accepted. In this example, despite of the small sample size for sample 3, is larger thant or . Also is larger than . Based on these differences, can you conclude that the three population location parameters are statistically different? w In the above example, the statistic is as follows: If the null hypothesis is true, the distribution of the test statistic should be known to investigate how large a value of is statistically significant. If , the number of cases for ranking {1,2,3, ⋯, 10} is . It is not easy to examine all of these possible rankings to create a distribution table of .『eStat U』shows the distribution of the Kruskal–Wallis for =4, =3, and =3 as shown in <Figure 10.3.4>, and a part of the distribution table as in Table 10.3.2. As shown in the figure, the distribution of is an asymmetrical distribution. <Figure 10.3.4> Kruskal Wallis H distribution when =4, =3, =3 Table 10.3.2 Kruskal Wallis H distribution when =4, =3, =3 Kruskal Wallis H distribution =3 = 4 = 3 ≤ = 3 ≥ x P(X = x) 0.018 0.0162 0.0162 1.0000 0.045 0.0133 0.0295 0.9838 ⋯ ⋯ ⋯ ⋯ 5.727 0.0048 0.9543 0.0505 5.791 0.0095 0.9638 0.0457 5.936 0.0019 0.9657 0.0362 5.982 0.0076 0.9733 0.0343 6.018 0.0019 0.9752 0.0267 6.155 0.0019 0.9771 0.0248 6.300 0.0057 0.9829 0.0229 6.564 0.0033 0.9862 0.0171 6.664 0.0010 0.9871 0.0138 6.709 0.0029 0.9900 0.0129 6.745 0.0038 0.9938 0.0100 7.000 0.0019 0.9957 0.0062 7.318 0.0019 0.9976 0.0043 7.436 0.0010 0.9986 0.0024 8.018 0.0014 1.0000 0.0014 P(X x) P(X x) 96 / Chapter 10 Nonparametric Testing Hypothesis Example 10.3.1 Answer (continued) w test is a right tail test and the 5 percentile from the right corresponding to the significance level is approximately P(X ≥ 5.727) = 0.0505. Note that there is no exact 5 percentile in case of a discrete distribution. Hence, the decision rule to test the null hypothesis is as follows: ‘If , then reject ’ Since = 7.318 in this example, we reject . 3) In『eStatU』, enter data as <Figure 10.3.5> and click the [Execute] button. Then the sample statistics are calculated and the test result is shown as in <Figure 10.3.6>. The critical line for values containing 5 percentile of the significance level is shown here. For a discrete distribution, the choice of the final rejection region shall be determined by the analyst. <Figure 10.3.5> 『eStatU』 Kruskal-Wallis test <Figure 10.3.6> Kruskal-Wallis test 10.3 Nonparametric Test for Locations of Several Populations / 97 Example 10.3.1 Answer (continued) w 『eStat』may also be used to conduct the Kruskal–Wallis test. Enter data as <Figure 10.3.1> and click the ANOVA icon . Select ‘Score’ as ‘Analysis Var’ and ‘Company’ as ‘by Group’ variable in the variable selection box. Then a dot graph with the 95% confidence interval of the population mean in each company will appear as <Figure 10.3.2>. If you press the [Kruskal–Wallis test] button in the options window below the graph, the same test graph and test result table will appear as in <Figure 10.3.7>. <Figure 10.3.7> Result of the Kruskal-Wallis test Ÿ Let us generalize the Kruskal–Wallis test described so far with an example. Denote random samples collected independently from the populations (at each level of one factor) when their sample sizes are , , ..., as follows: ( ⋯ ). Table 10.3.3 Notation for random samples from each level Level 1 Level 2 ⋯ ⋯ Level 1 Mean Level 2 Mean Ÿ ⋅ ⋯ Level ⋯ ⋯ ⋯ Level Mean ⋅ ⋅ Total Mean ⋅⋅ The statistical model of the Kruskal-Wallis test is as follows: , ⋯; ⋯ Ÿ where . Here represents the effect of the level and ’s are independent and follow the same continuous distribution. The hypothesis of the Kruskal-Wallis test is as follows: ⋯ At least one pair of is not equal. 98 / Chapter 10 Nonparametric Testing Hypothesis Ÿ For the Kruskal–Wallis test, ranking data for the combined sample must be created. Table 10.3.4 is a notation of ranking data for each level. Table 10.3.4 Notation of ranking data in each level Level 1 Level 2 ⋯ Sum of ranks in level 1 Sum of ranks in level 2 ⋅ ⋅ Mean of ranks in level 2 ⋯ Mean of ranks in level 1 ⋅ Level ⋯ ⋯ Ÿ ⋯ Sum of ranks in level ⋅ ⋯ Mean of ranks in level ⋯ ⋅ ⋅ Total mean of ranks ⋅⋅ The sum of squares for the one-way analysis of variance studied in Chapter 9 by using the ranking data in Table 10.3.4 are as follows: ⋅⋅ · ⋅⋅ ⋅⋅ Ÿ Also, the statistic for the -test is as follows: Ÿ Since is a constant, the statistic for the -test is proportional to . The statistic for the Kruskal-Wallis test is proportional to as follows: Ÿ Ÿ ⋅ ⋅ ⋅⋅ The multiplication constant in the definition of statistics is intended to ensure that the statistic follows approximately the chi-square distribution with degrees of freedom. The distribution of the Kruskal-Wallis test statistic , denoted as ⋯ , can be obtained by considering all possible cases of ranks {1, 2, ⋯ , } which is . eStatU provides the table of ⋯ up to . ⋯ denotes the right tail 100 × percentile, but it might not have the exact value of this percentile, because ⋯ is a discrete distribution. In this case, the middle of two adjacent values of 100 × percentile is often used. The decision rule of the Kruskal-Wallis test is as Table 10.3.5. 『 』 10.3 Nonparametric Test for Locations of Several Populations / 99 Table 10.3.5 Kruskal-Wallis test Hypothesis Decision Rule Test Statistic: ⋯ If ⋯ , then reject , At least one pair of is not equal. else accept . ☞ If there are tied values in the combined sample, assign the average of ranks. Ÿ Ÿ The distribution of the Kruskal-Wallis statistic is independent of a population distribution. In other words, the Kruskal-Wallis test is a distribution-free test. If the null hypothesis is true and the sample size is large enough, the test statistic is approximated by the chi-square distribution with degrees of freedom. Table 10.3.6 summarizes the decision rule for the Kruskal-Wallis test in case of large samples. Table 10.3.6 Kruskal-Wallis test in case of large samples. Hypothesis Decision Rule Test Statistic: ⋯ If , then reject , At least one pair of is not equal. else accept Ÿ If there is a tie in the combined sample, the average rank is assigned to each data. In this case, the statistic shall be modified as follows: ′ Here = (number of tied groups) = (the size of the tie group, i.e., the number of observations in the tie group) if there is no tie, the size of the tie group is 1 and =1. [Practice 10.3.1] A bread maker wants to compare the three new mixing methods of ingredients. 15 breads were made by each mixing method (A, B, C) of 5 pieces, and a group of judges who did not know the difference in material mixing ratio gave the following points. Test the null hypothesis that there is no difference in taste according to the mixing methods at the significance level of 0.05. Mixing ratio: Method A: 72 88 70 87 Method B: 85 89 86 82 Method C: 94 94 88 87 ⇨ eBook ⇨ PR100301_ScoreByMixingMethod.csv 71 90 89 100 / Chapter 10 Nonparametric Testing Hypothesis 10.3.2 Randomized Block Design: Friedman Test Ÿ Ÿ Example 10.3.2 In Section 9.2, we studied the randomized block design to measure the fuel mileage of three types of cars which reduce the impact of the block factor, i.e., driver. If each population follows a normal distribution, sample data are analyzed using the F-test based on the two-way analysis of variance without the interaction. However, the assumption that a population follows a normal distribution may not be appropriate for real-world data, or that there may not be enough data to assume a normal distribution. Alternatively, if the data collected might not be continuous and are ordinal such as ranks, then the parametric test is not appropriate. In such cases, nonparametric tests are used to test parameters by converting data to ranks without assuming the distribution of the population. This section introduces the Friedman test corresponding to the randomized block design experiments in Section 9.2.2. Let us take a look at the Friedman test using [Example 9.2.1] which was the car fuel mileage measurement problem. The fuel mileage of the three types of cars (A, B and C) is measured using the randomized block design as Table 9.2.4 and it is rearranged in Table 10.3.7. Table 10.3.7 Fuel mileage of the three types of cars Driver (Block) 1 2 3 4 5 Car A 22.4 16.1 19.7 21.1 24.5 Car B 16.3 12.6 15.9 17.8 21.0 Car C 20.2 15.2 18.7 18.9 23.8 ⇨ eBook ⇨ EX090201_GasMileage.csv 1) Draw a histogram of the data to see if the fuel mileage of the three cars can be tested by a parametric method. 2) Using the Friedman test which is a nonparametric method of the randomized block design, test whether the fuel mileage of the three types of cars are different with the significance level of 5%. 3) Check the result of the above Friedman test using『eStatU』. Answer 1) Enter data in『eStat』as shown in <Figure 10.3.8>. 10.3.8> 『<Figure eStat』Data input 10.3 Nonparametric Test for Locations of Several Populations / 101 Example 10.3.2 Answer (continued) w Click icon of the analysis of variance. Select ‘Miles’ as 'Analysis Var' and ‘Car’ as ‘by Group’. Then the dot graph by car type and the 95% confidence interval for the population mean will appear. Again, clicking the [Histogram] button in the options window below the graph will show the histogram and normal distribution curve for each car type as shown in <Figure 10.3.9>. <Figure 10.3.9> Histogram of fuel mileage by car w Looking at the histogram, it is not sufficient to assume that each population follows a normal distribution, because of the small number of data. In such case, applying the parametric -test may lead to errors. w The hypothesis for this problem is to test whether or not the location parameters , , of the three populations are the same. At least one pair of location parameters is not equal. w The Friedman test calculates the sum of ranks, , for each of the three types of cars after the ranking is calculated for the fuel mileage measured for each driver (block) (Table 10.3.8). If there is a tie, then the average of ranks is assigned. Table 10.3.8 Ranking in each of the block 1 2 3 4 5 Driver (Block) Sum of ranks Car A Car B Car C 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2 =15 =5 =10 The sum of ranks for Car A is = 15, for Car B is = 5, for Car C is = 10. The sum of ranks looks different. Are the differences statistically significant? w The Friedman test statistic can be considered as the statistic in the two-way analysis of variance to these ranking data as follows: , is the number of population. 102 / Chapter 10 Nonparametric Testing Hypothesis Example 10.3.2 Answer (continued) In this example, and the statistic is as follows: × × The distribution of the test statistic , when the null hypothesis is true, should be known to investigate how large a value of is statistically significant. Since the number of cases of ranking when , is , it is not easy to examine all of these possible rankings to obtain a distribution.『eStatU』provides the distribution of the test statistic in the case of , as in <Figure 10.3.10> and its distribution table as Table 10.3.9. As shown in the graph, the distribution of is an asymmetrical distribution. <Figure 10.3.10> Friedman S distribution when , Table 10.3.9 Friedman S distribution when , Friedman S distribution k = 3 n = 5 x P(X = x) P(X ≤ x) P(X ≥ x) 0.000 0.0463 0.0463 1.0000 0.400 0.2623 0.3086 0.9537 1.200 0.1698 0.4784 0.6914 1.600 0.1543 0.6327 0.5216 2.800 0.1852 0.8179 0.3673 3.600 0.0579 0.8758 0.1821 4.800 0.0309 0.9066 0.1242 5.200 0.0540 0.9606 0.0934 6.400 0.0154 0.9761 0.0394 7.600 0.0154 0.9915 0.0239 8.400 0.0077 0.9992 0.0085 10.000 0.0008 1.0000 0.0008 w The Friedman test is a right sided test. If we look for the five percentile from the right tail corresponding to significance level, the nearest value is P(X ≥ 6.4) = 0.0394. Since it is a discrete distribution, there is no exact value of five percentile. Hence, the rejection region with the significance level of 5% can be written as follows: ‘If ≥ , then reject ’ Since = 10 in this example, is rejected. 10.3 Nonparametric Test for Locations of Several Populations / 103 Example 10.3.2 Answer (continued) 3) Enter data in『eStatU』as in <Figure 10.3.11> and click the [Execute] button. The sample statistics and test graph will be shown as in <Figure 10.3.12>. The critical line which contains 5% of the significance level is shown here. For a discrete distribution, the choice of the final rejection region should be determined by the analyst. 『 <Figure 10.3.11> Data input for Friedman test at eStatU 『 <Figure 10.3.12> Result of Friedman test using eStatU Ÿ 』 』 Let's generalize the Friedman test described so far using the above example. Assume that there are number of levels and denote the rank of number of data as follows: 104 / Chapter 10 Nonparametric Testing Hypothesis Table 10.3.10 Notation of n random samples for k number of levels with randomized block design Treatment Block 1 2 ⋯ Level 1 Level 2 ⋯ ⋯ n Mean Ÿ ⋅ ⋯ Level ⋯ ⋯ ⋯ ⋅ Total ⋅⋅ Mean ⋅ A statistical model of the Friedman test is as follows: ⋯; ⋯ , Here is the effect of level which satisfies and is the effect of block which satisfies Ÿ . ’s are independent and follows the same continuous distribution. The hypothesis of the Friedman test is as follows: ⋯ At least one pair of is different Ÿ For the Friedman test, ranking data for each block must be created. Table 10.3.11 is the notation of ranking data for each level. Table 10.3.11 Notation of rank data in each level Treatment Block Level 1 Level 2 1 2 ⋯ ⋯ Level ⋯ ⋯ n Sum of ranks ⋅ ⋅ ⋯ ⋅ ⋯ Average of ranks Ÿ ⋯ ⋅ ⋅ ⋯ ⋅ Average of ranks ⋅⋅ If we apply the analysis of variance for the rank data of Table 10.3.11 instead of the observation data in Section 9.2, the total sum of squares, , and the block sum of squares are constants. The treatment sum of squares is as follows: · ⋅⋅ Therefore, the test statistic can be written as follows: 10.3 Nonparametric Test for Locations of Several Populations / 105 , Here is a constant. Ÿ That is, since is a constant, test statistic is proportional to . The Friedman test statistic is proportional to as follows: ⋅ ⋅⋅ ⋅ The reason statistic has the constant multiplication of is to make Ÿ which follows a chi-square distribution with degrees of freedom. The distribution of the Friedman test statistic is denoted as . eStatU provides the distribution of up to ≤ if and up to ≤ if . denotes the right tail 100 × percentile, but there might not be the exact percentile, because it is a discrete distribution. In this case, the middle value of two nearest is often used approximately. Table 10.3.12 is the summary of decision rule of the Friedman test. 『 』 Table 10.3.12 Friedman Test Hypothesis Decision Rule Test Statistic: ⋯ If , then reject , At least one pair of is different else accept ☞ If there are tied values on each block, use the average rank. Ÿ Ÿ The distribution of the Friedman statistic is independent of the population distribution. In other words, the Friedman test is a distribution-free test. If the null hypothesis is true and if the sample is large enough, the test statistic is approximated by the chi-square distribution with degrees of freedom. Table 10.3.13 summarizes the decision rule for the Friedman test in case of large sample. Table 10.3.13 Friedman Test – large sample case Hypothesis Decision Rule Test Statistic: ⋯ If , then reject , else At least one pair of is different accept Ÿ If there is a tie in the block, the average rank is assigned to each data. In this case, the statistic shall be modified as follows: 106 / Chapter 10 Nonparametric Testing Hypothesis ′ Here = (number of tied groups) = (the size of the tie group, i.e., the number of observations in the tie group). If there is no tie, the size of the tie group is 1 and =1 [Practice 10.3.2] The following is the result of an agronomist's survey of the yield of four varieties of wheat by using the randomized block design of the three cultivated areas (block). Apply the Friedman test whether the mean yields of the four wheats are the same or not with the 5% significance level. Cultivated Area Wheat Type A B C D 1 2 3 50 59 55 58 60 52 55 58 56 51 52 55 ⇨ eBook ⇨ PR100302_WheatAreaYield.csv Chapter 10 Exercise / 107 Exercise 10.1 A psychologist has selected 12 handicap workers randomly from production workers employed at various factories in a large industrial complex and their work competency scores are examined as follows. The psychologist wants to test whether the population average score is 45. Assume the population distribution is symmetrical about the mean. 32, 52, 21, 39, 23, 55, 36, 27, 37, 41, 34, 51 1) Check whether a parametric test is possible. 2) Apply the sign test with the significance level of 5%. 3) Apply the Wilcoxon signed rank test with the significance level of 5%. 10.2 A tire production company wants to test whether a new manufacturing process can produce a more durable tire than the existing process. The tire by a new process was tested to obtain the following data: (unit: 1000 ) Existing Process New Process 62 76 61 90 74 74 75 63 73 53 61 65 60 53 70 63 1) Check whether a parametric test is possible. 2) Apply the Wilcoxon rank sum test whether the new process and the existing process have the same durability or not with the significance level of 5%. 10.3 A company wants to compare two methods of obtaining information about a new product. Among company employees, 19 were randomly selected and divided into two groups. The first group learned about the new product by the method A, and the second group learned by the method B. At the end of the experiment, the employees took a test to measure their knowledge of the new product and their test scores are as follows. Can we conclude from these data that the median values of the two groups are different? Test with the significance level of 0.05. Method A 50 59 60 71 80 78 72 77 73 75 Method B 52 54 58 78 65 61 60 72 60 10.4 10 men and 10 women working in the same profession were selected independently and surveyed their monthly salaries. Can you say that a man in this profession earns more than a woman. Test with the significance level of 0.05. (Unit: 10USD) 108 / Chapter 10 Nonparametric Testing Hypothesis Man Woman 381 294 296 389 281 194 193 286 384 494 284 279 288 383 489 287 496 393 277 371 10.5 To find out the fuel mileage improvement effect of a new gasoline additive, 10 cars of the same state were selected. The gas mileage was tested without gasoline additives and with additives running the same road at the same speed and obtain the following data. Test whether the new gasoline additive is effective in improving the fuel mileage with the significance level of 0.05. gas mileage (unit: km/liter) With additives Without additives 11.7 13.8 11.2 7.7 8.2 16.3 14.2 19.4 13.9 15.5 10.3 12.9 12.5 9.5 11.2 14.6 15.9 18.5 12.0 15.1 10.6 In order to determine the efficacy of the new pain reliever, seven persons were tested with aspirin and new pain reliever. The experimental time of the two pain relievers were sufficiently spaced, and the order of the medication experiment was randomly determined. The time (in minutes) until feeling pain relief was measured as follows. Do the data indicate that the new pain reliever has faster pain relief than aspirin? Test with the significance level of 0.05. Person ID Aspirin New pain reliever 1 2 3 4 5 6 7 15 7 20 14 12 13 20 11 17 10 14 16 17 11 10.7 A person was asked to taste 15 coffee samples to rank from 1 (hate first) to 15 (best). The 15 samples are taken from each of the three types of coffee (A, B, C) and are tasted in random order. The following table shows the ranking of preference by the coffee type. Test the null hypothesis that there is no difference in three types of coffee preferences at the significance level of 0.05. Coffee Type A B C Ranking 9 14 2 10 1 3 11 5 4 12 7 15 13 8 6 10.8 A bread maker wants to compare the four new mix of ingredients. 5 breads were made by each mixing ratio of ingredients, a total of 20 breads, and a group of judges who did not know the difference in mixing ratio of ingredients were given the following points. Test the null hypothesis that there is no difference in taste according to the mixing ratio of ingredients at the significance Chapter 10 Exercise / 109 level of 0.05. Mixing Ratio Method A Method B Method C Method D 72 85 94 91 88 89 94 93 Scores 70 86 88 92 87 82 87 95 71 90 89 96 110 / Chapter 10 Nonparametric Testing Hypothesis Multiple Choice Exercise 10.1 What is NOT the reason to have a nonparametric test? ① Population is not normally distributed. ② Ordinal data. ③ Data follows a normal distribution. ④ There is an extreme point in sample. 10.2 Which of the following nonparametric tests is for testing the location parameter of single population? ① Wilcoxon signed rank sum test ③ Kruskal-Wallis test ② Wilcoxon rank sum test ④ Friedman test 10.3 Which of the following nonparametric tests is for testing the location parameters of two populations? ① Wilcoxon signed rank sum test ③ Kruskal-Wallis test ② Wilcoxon rank sum test ④ Friedman test 10.4 Which of the following nonparametric tests is for tesing the location parameters of multiple populations? ① Wilcoxon signed rank sum test ③ Kruskal-Wallis test ② Wilcoxon rank sum test ④ Friedman test 10.5 Which of the following nonparametric tests is appropriate for testing of the randomized block design method? ① Wilcoxon signed rank sum test ③ Kruskal-Wallis test ② Wilcoxon rank sum test ④ Friedman test 10.6 What is the sign test? ① Test for the location parameter of single population ② Test for two location parameters of two populations ③ Test for several location parameters of multiple populations ④ Test for the randomized block design 10.7 What is the transformation of data that is often used for nonparametric tests? ① log transformation ③ (0-1) transformation ② exponential transformation ④ ranking transformation 10.8 What is the test statistic used for the sign test? ① rank ③ degrees of freedom ② (number of + signs) ④ (number of + signs) (number of - signs) 10.9 What is the test statistic used for testing two location parameters of two populations using a nonparametric test? Chapter 10 Multiple Choice Exercise / 111 ① (number of + signs) ② sum of ranks in population 2 ③ (number of - signs) ④ (sum of ranks in population 1) + (sum of ranks in population 2) 10.10 What is the theoretical basis for the statistic used for the Kruskal-Wallis test? ① Within sum of squares of rank data ② Error sum of squares of rank data ③ Total sum of squares of rank data ④ Treatment sum of squares of rank data (Answers) 10.1 ③, 10.2 ①, 10.3 ②, 10.4 ③, 10.5 ④, 10.6 ①, 10.7 ④, 10.8 ④, 10.9 ②, 10.10 ④, ♡ 11 Testing Hypothesis for Categorical Data SECTIONS CHAPTER OBJECTIVES 11.1 Goodness of Fit Test 11.1.1 Goodness of Fit Test for Categorical Data 11.1.2 Goodness of Fit Test for Continuous Data The hypothesis tests that we have studied from Chapter 7 to Chapter 10 are for continuous data. In this chapter, we describe testing hypothesis for categorical data. 11.2 Testing Hypothesis for Contingency Table 11.2.1 Independence Test 11.2.2 Homogeneity Test Section 11.1 describes the goodness of fit test for the frequency table of categorical data. Section 11.2 describes the independence and homogeneity tests for the contingence table of two categorical data. 114 / Chapter 11 Testing Hypothesis for Categorical Data 11.1 Goodness of Fit Test Ÿ The frequency table of categorical data discussed in Chapter 4 counts the frequency of possible values of a categorical variable. If this frequency table is for sample data from a population, we are curious what would be the frequency distribution of the population. The goodness of fit test is a test on the hypothesis that the population follows a particular distribution based on the sample frequency distribution. In this section, we discuss the goodness of fit test for categorical distributions (Section 11.1.1) and the goodness of fit test for continuous distribution (Section 11.1.2). 11.1.1 Goodness of Fit Test for Categorical Data Ÿ Example 11.1.1 Answer Consider the goodness of fit test for a categorical distribution using the example below. The result of a survey of 150 people before a local election to find out the approval ratings of three candidates is as follows. Looking at this frequency table alone, it seems that A candidate has a 40 percent approval rating, higher than the other candidates. Based on this sample survey, perform the goodness of fit test whether three candidates have the same approval rating or not. Use『eStatU』with the 5% significance level. Candidate Number of Supporters Percent A B C 60 50 40 40.0% 33.3% 25.7% Total 150 100% w Assume each of candidate A, B, and C’s approval rating is respectively. The hypothesis for this problem is as follows: : The three candidates have the same approval rating. (i.e., ) : The three candidates have different approval ratings. w If the null hypothesis is true that the three candidates have the same approval rating, each candidate will have × supporters out of total 150 people. It is referred to as the ‘expected frequency’ of each candidate when is true. For each candidate, the number of observed supporters in the sample is called the 'observed frequency'. If is true, the observed and expected number of supporters can be summarized as the following table. Candidate Observed frequency (denoted as ) Expected frequency (denoted as ) A B C = 60 = 50 = 40 = 50 = 50 = 50 Total 150 150 11.1 Goodness of Fit Test / 115 Example 11.1.1 Answer (continued) w If is true, the observed frequency ( ) and the expected frequency ( ) will coincide. Therefore, in order to test the hypothesis, a statistic which uses the squared difference between and is used. Specifically, the statistic to test the hypotheses is as follows: If the observed value of this test statistic is close to zero, it can be considered that is true, because is close to . If the observed value is large, will be rejected. The question is, 'How large value of the test statistic would be considered as the statistically significant one?' It can be shown that this test statistic approximately follows the chi-square distribution with degrees of freedom if the expected frequency is large enough. Here is the number of categories (i.e., candidates) in the table and it is 3 in this example. Therefore, the decision rule to test the hypotheses is as follows: ‘If , reject , else do not reject ’ w The statistic can be calculated as follows: Since the significance level is 5%, the critical value can be found from the chi-square distribution as follows: Therefore, cannot be rejected. In other words, although the above sample frequency table shows that the approval ratings of the three candidates differ, this difference does not provide sufficient evidence to conclude that the three candidates have different approval ratings. , , w Using each candidate's sample approval rating , 95% confidence intervals for the population proportion of each candidate's approval rating using the formula ± ) (refer Chapter 6.4) are as follows: ⋅ ⋅ B : ± ⋅ C : ± A : ± ⇔ ⇔ ⇔ The overlapping of the confidence intervals on the three candidates' approval ratings means that one candidate's approval rating is not completely different from the other. w In the data input box that appears by selecting the 'Goodness of Fit Test' of 『eStatU』, enter the 'Observed Frequency' and 'Expected Probability' data as shown in <Figure 11.1.1>. After entering the data, select the significance level and click [Execute] button to calculate the 'Expected Frequency' and to see the result of the chi-square test. Be sure that this chi-square goodness of fit test should be applied when the expected frequency of each category is at least 5. 116 / Chapter 11 Testing Hypothesis for Categorical Data Example 11.1.1 Answer (continued) <Figure 11.1.1> Goodness of fit test in <Figure 11.1.2> Ÿ Ÿ 『eStatU』 『eStatU』Chi-square Goodness of Fit Test Consider a categorical variable X which has number of possible values ⋯ and their probabilities are ⋯ respectively. In other words, the probability distribution for the categorical variable X is as follows: ⋯ Total ⋯ 1 When random samples are collected from the population of the categorical random variable X and their observed frequencies are ⋯ , the 11.1 Goodness of Fit Test / 117 hypothesis to test the population probability distribution of ⋯ ⋯ is as follows: Distribution of are from the distribution ⋯ ⋯ Distribution of are not from the distribution ⋯ ⋯ Ÿ If the total number of samples is large enough, the above hypothesis can be tested using the chi-square test statistic as follows: ‘If , then reject ’ Here, ⋯ ⋯ are expected frequencies, is the number of population parameters estimated from the sample data. In [Example 11.1.1], since there was not a population parameter estimated from the sample, . Goodness of Fit Test Consider a categorical variable which has number of possible values ⋯ and their probabilities are ⋯ respectively. Let observed frequencies for each value of from samples are ⋯ , expected frequencies for each value of from samples are ⋯ ⋯ and the significance level is α. Hypothesis: Distribution of follows ⋯ Distribution of does not follow ⋯ Decision Rule: , then reject ’ ‘If is the number of population parameters estimated from the samples. ☞ In order to use the chi-square Goodness of Fit test, all expected frequencies should be greater than 5. A category which has an expected frequency less than 5 can be merged with other category. [Practice 11.1.1] Market shares of toothpaste A, B, C and D are known to be 0.3, 0.6, 0.08, and 0.02 respectively. The result of a survey of 100 people for the toothpaste brands are as follows. Can you conclude from these data that the known market share is incorrect? Use 『eStatU』. . Brand Number of Customers A B C D Total 192 342 44 22 600 118 / Chapter 11 Testing Hypothesis for Categorical Data 11.1.2 Goodness of Fit Test for Continuous Data Ÿ Example 11.1.2 The goodness of fit test for categorical data using the chi-square distribution can also be used for continuous data. The following is an example of the goodness of fit test in which data are derived from a population of a normal distribution. The parametric statistical tests from Chapter 6 to Chapter 9 require the assumption that the population is normally distributed and the goodness of fit test in this section can be used to test for normality. Ages of 30 people who visited a library in the morning is as follows. Test the hypothesis that the population is normally distributed at the significance level of 5%. 28 55 26 35 43 47 47 17 35 36 48 47 34 28 43 20 30 53 27 32 34 43 18 38 29 44 67 48 45 43 ⇨ eBook ⇨ EX110102_AgeOfLibraryVisitor.csv Answer w Age is a continuous variable, but you can make a frequency distribution by dividing possible values into intervals as we studied in histogram of Chapter 3. It is called a categorization of the continuous data. w Let's find a frequency table which starts at the age of 10 with the interval size of 10. The histogram of『eStat』makes this frequency table easy to obtain. If you enter the data as shown in <Figure 11.1.3>, click the histogram icon and select Age from the variable selection box, then the histogram as <Figure 11.1.4> will appear. <Figure 11.1.3> Data input at eStat 『 』 <Figure 11.1.4> Default histogram of age w If you specify 'start interval' as 10 and 'interval width' as 10 in the options window below the histogram, the histogram of <Figure 11.1.4> is adjusted as <Figure 11.1.5>. If you click [Frequency Table] button, the frequency table as shown in <Figure 11.1.6> will appear in the Log Area. The designation of interval size can be determined by a researcher. 11.1 Goodness of Fit Test / 119 Example 11.1.2 Answer (continued) <Figure 11.1.5> Adjusted histogram of age <Figure 11.1.6> Frequency table of the adjusted histogram w Since the normal distribution is a continuous distribution defined at ∞ ∞, the frequency table of <Figure 11.1.6> can be written as follows: Table 11.1.2 Frequency table of age with adjusted interval Interval id 1 2 3 4 5 6 Observed frequency Interval 20 30 40 50 60 ≤ ≤ ≤ ≤ ≤ X X X X X X < < < < < 20 30 40 50 60 2 6 8 11 2 1 w The frequency table of sample data as Table 11.1.2 can be used to test the goodness of fit whether the sample data follows a normal distribution using the chi-square distribution. The hypothesis of this problem is as follows: Sample data follow a normal distribution Sample data do not follow a normal distribution w This hypothesis does not specify what a normal distribution is and therefore, the population mean and the population variance should be estimated from sample data. Pressing the 'Basic Statistics' icon on the main menu of『eStat』will display a table of basic statistics in the Log Area, as shown in <Figure 11.1.7>. The sample mean is 38.567 and the sample standard deviation is 12.982. <Figure 11.1.7> Descriptive statistics of age 120 / Chapter 11 Testing Hypothesis for Categorical Data Example 11.1.2 Answer (continued) Hence, the above hypothesis can be written in detail as follows: Sample data follow . Sample data do not follow . w In order to find the expected frequency of each interval when is true, the expected probability of each interval is calculated first using the normal distribution as follows. The normal distribution module of『eStatU』makes it easy to calculate this probability of an interval. At the normal distribution module of『eStatU』, enter the mean of 38.000 and the standard deviation of 11.519. Click the second radio button of ≤ type and enter 20, then press the [Execute] button to calculate the probability as shown in <Figure 11.1.8>. 『 <Figure 11.1.8> Calculation of normal probability using eStatU 』 Similarly you can calculate the following probabilities. ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≥ ≥ ≥ ≤ ≤ ≤ w Expected frequency can be calculated by multiplying the sample size of 30 to the expected probability of each interval obtained above. The observed frequencies, expected probabilities, and expected frequencies for each interval can be summarized as the following table. 11.1 Goodness of Fit Test / 121 Example 11.1.2 Answer (continued) Table 11.1.3 Observed and expected frequencies of each interval of distribution Interval id 1 2 3 4 5 6 Interval 20 30 40 50 60 ≤ ≤ ≤ ≤ ≤ X X X X X X < < < < < 20 30 40 50 60 Observed frequency Expected probability Expected frequency 2 6 8 11 2 1 0.059 0.185 0.325 0.282 0.121 0.028 1.77 5.55 9.75 8.46 3.63 0.84 w Since the expected frequencies of the 1st and 6th interval are less than 5, the intervals should be combined with adjacent intervals for testing the goodness of fit using the chi-square distribution as Table 11.1.4. The expected frequency of the last interval is still less than 5, but, if we combine this interval, there are only three intervals, we demonstrate the calculation as it is. Note that, due to computational error, the sum of the expected probabilities may not be exactly equal to 1 and the sum of the expected frequencies may not be exactly 30 in Table 11.1.4. Table 11.1.4 Revised table after combining interval of small expected frequency Interval id Interval Observed frequency Expected probability Expected frequency 1 2 3 4 X < 30 30 ≤ X < 40 40 ≤ X < 50 50 ≤ X 8 8 11 3 0.244 0.325 0.282 0.149 7.32 9.75 8.46 4.47 Total 30 1.000 30.00 w The test statistic for the goodness of fit test is as follows: Since the number of intervals is 4, becomes 4, and , because two population parameters and are estimated from the sample data. Therefore, the critical value is as follows: The observed test statistic is less than the critical value, we can not reject the null hypothesis that the sample data follows . w Test result can be verified using 'Goodness of Fit Test' in 『eStatU』. In the Input box that appears by selecting the 'Goodness of Fit Test' module, enter the data for 'observation frequency' and 'expected probability' in Table 11.1.4, as shown in <Figure 11.1.9>. After entering the data, select the significance level and press the [Execute] button to calculate the 'expected frequency' and produce a chi-square test result (<Figure 11.1.10>). 122 / Chapter 11 Testing Hypothesis for Categorical Data Example 11.1.2 Answer (continued) <Figure 11.1.9> Data input for goodness of fit test in 『eStatU』 『 <Figure 11.1.10> Chi-square goodness of fit test using eStatU [Practice 11.1.2] 』 (Otter length) Data of 30 otter lengths can be found at the following location of『eStat』. ⇨ eBook ⇨ PR110102_OtterLength.csv. Test the hypothesis that the population is normally distributed at the significance level of 5% using『eStat』. 11.2 Testing Hypothesis for Contingency Table / 123 11.2 Testing Hypothesis for Contingency Table Ÿ The contingency table or cross table discussed in Chapter 4 was a table that placed the possible values of two categorical variables in rows and columns, respectively, and examined frequencies of each cell in which the values of the two variables intersect. If this contingency table is for sample data taken from a population, it is possible to predict what would be the contingency table of the population. The test for the contingency table is usually an analysis of the relation between two categorical variables and it can be divided into the independence test and homogeneity test according to the sampling method for obtaining the data. 11.2.1 Independence Test Ÿ Example 11.2.1 The independence test of the contingency table is to investigate whether two categorical variables are independent when samples are extracted from one population. Consider the independence test with the following example. In order to investigate whether college students who are wearing glasses are independent by gender, a sample of 100 students was collected and its contingency table was prepared as follows: Table 11.2.1 Wearing glasses by gender Wear Glasses Men Women Total No Glasses Total 40 20 10 30 50 50 60 40 100 ⇨ eBook ⇨ EX110201_GlassesByGender.csv. 1) Using『eStat』, draw a line graph of the use of eyeglasses by men and women. 2) Test the hypothesis at 5% of the significance level to see if the gender variable and the wearing of glasses are independent or related to each other. 3) Check the result of the independence test using『eStatU』. Answer 1) Enter data in 『eStat』 as shown in <Figure 11.2.1>. <Figure 11.2.1> Data input w Select 'Line Graph' icon from the main menu. If you click variables ‘Gender’, ‘Glasses’, ‘NoGlasses’ one by one, then a line graph as shown in <Figure 11.2.2> will appear in the Graph Area. If you look at the line graph, you can see that the ratio of wearing glasses for men and women are different. For men, there are many students who do not wear glasses (80% of men) and for women, 60% of women do. In such cases, the gender variable and the wearing of glasses are considered related. As such, when two variables are related, two lines of the line graph intersect to each other. 124 / Chapter 11 Testing Hypothesis for Categorical Data Example 11.2.1 Answer (continued) <Figure 11.2.2> Line graph of wearing glasses by gender 2) If two variables are not related (i.e., if the two variables are independent of each other), the contingency table in Table 11.2.1 will show that the proportion of wearing glasses by men or women is equal to 60% which is the proportion of all students wearing glasses. In other words, if two variables are independent, the contingency table should be as follows: Table 11.2.2 Contingency table when gender and wearing glasses are independent Wear Glasses No Glasses Total Men Women 30 30 20 20 50 50 Total 60 40 100 w If there is little difference between the observed contingency table and the contingency table in the case of independence, two categorical variables are said to be independent of each other. If the differences are very large, two categorical variables are related to each other. The independence test is a statistical method for determining that two categorical variables of the population are independent of each other by using the observed contingency table obtained from the sample. The independent test uses the chi-square distribution and the hypothesis is as follows: Two variables of the contingency table are independent of each other. Two variables of the contingency table are related. w The test statistic for testing this hypothesis utilizes the difference between the observed frequency of the contingency table in the sample and the expected frequency of the contingency table when two variables are assumed to be independent which is similar to the goodness of fit test. The test statistic in this example is as follows: This test freedom variable) variable). statistic follows a chi-square distribution with degrees of where is the number of rows (number of possible values of row and is the number of columns (number of possible values of column Therefore, the decision rule to test the hypothesis is as follows: 11.2 Testing Hypothesis for Contingency Table / 125 Example 11.2.1 Answer (continued) ‘If , then reject .’ In this example, is greater than the critical value than . Therefore, the null hypothesis that two variables are independent each other is rejected and we conclude that the gender and wearing glasses are related. 3) In the independence test of『eStatU』, enter data as shown in <Figure 11.2.3> and press the [Execute] button to display the result of the chi-square test as shown in <Figure 11.2.4>. <Figure 11.2.3> 『 『eStatU』Test of Independence 』 <Figure 11.2.4> eStatU Chi-square test of independence Ÿ Assume that there are number of attributes of the variable A such as , and number of attributes of the variable B such as . Let denote the probability of the cell of and attribute in the contingency table of A and B as Table 11.2.3. Here ․ ··· denotes the probability of and ⋅ ··· denotes the probability of . 126 / Chapter 11 Testing Hypothesis for Categorical Data Table 11.2.3 Notation of probabilities in × contingency table Variable · · · · Vaiable · · · Total Ÿ Total · · · · · · · · · · · · · · · · · · · · · · 1 If two events and are independent, ∩ · and hence, ·⋅·. If two variables and are independent, all and should satisfy the above property which is called the independent test. Variables and are independent. i.e. ·⋅· ··· ··· Variables and are not independent. Ÿ In order to test whether two variables of the population are independent, let us assume the observed frequencies, ’s, of the contingency table from samples are as follows: Table 11.2.4 Observed frequency of × contingency table Variable · · · Total · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Total Ÿ Variable · · · · · · · · If the null hypothesis is true, i.e., if two variables are independent of each other, the expected frequency of the sample data will be ··. Since we do not know the population · and · , if we use the estimates of · and · , then the estimate of the expected frequency, , is as follows: · · · · Ÿ The expected frequencies in case of independent can be explained that the proportions of each attribute of the B variable, ⋯ · , are maintained in each attribute of the A variable. 11.2 Testing Hypothesis for Contingency Table / 127 Table 11.2.5 Expected frequency of × contingency table Variable Variable · · · · · · · · · · · · Ÿ · · · The test statistic utilizes the difference between and as follows: This test statistic follows approximately a chi-square distribution with degrees of freedom. Therefore, the decision rule to test the hypothesis with significance level of α is as follows: ‘If , then reject ’ Independence Test Hypothesis: Variables and are independent. i.e., ·⋅· ··· ··· Variables and are not independent. Decision Rule: ‘If , then reject ’ where is the number of attributes of row variable and is the number of attributes of column variable. In order to use the chi-square distribution for the independence test, all expected frequencies are at least 5 or more. ☞ If an expected frequency of a cell is smaller than 5, the cell is combined with adjacent cell for analysis. Ÿ Consider an example of the independent test with many rows and columns. 128 / Chapter 11 Testing Hypothesis for Categorical Data Example 11.2.2 A market research institute surveyed 500 people on how three beverage products (A, B and C) are preferred by region and obtained the following contingency table. Table 11.2.6 Survey for preference of beverage by region New York Region Los Angels Atlanta Total A 52 60 50 162 Beverage B 64 59 65 188 C 24 52 74 150 Total 140 171 189 500 ⇨ eBook ⇨ EX110202_BeverageByRegion.csv. 1) Draw a line graph of beverage preference by region using『eStat』and analyze the graph. 2) Test whether the beverage preference by the region is independent of each other at the significance level of 5%. 3) Check the result of the independence test using『eStatU』. Answer 1) Enter the data in『eStat』 as shown in <Figure 11.2.5>. <Figure 11.2.5> Data input w Select 'Line Graph' and click variables ‘Region’, ‘A’, ‘B’, and ‘C’ in order, then the line graph shown in <Figure 11.2.6> will appear. If you look at the line graph, you can see the cross-section of the lines from region to region, and the regional preference is different. Can you statistically conclude that the region and beverage preference are related? <Figure 11.2.6> Line graph by region and beverage 2) The hypothesis for the independence test is as follows: Region and beverage preference are independent. Region and beverage preference are not independent. 11.2 Testing Hypothesis for Contingency Table / 129 Example 11.2.2 Answer (continued) w In order to calculate the expected frequencies, we first calculate the proportions of each beverage preference without considering the region as follows: w If two variables are independent, these proportions should be kept in each region. Hence, the expected frequencies in each region can be calculated as follows: × × × × × × × × × w The chi-square test statistic and critical value are as follows: ··· Therefore, the null hypothesis is rejected at the significance level of 5% and conclude that the region and beverage are related. 3) In the independence test of『eStatU』, enter data as shown in <Figure 11.2.7> and click the [Execute] button to display the result of the chi-square test as shown in <Figure 11.2.8>. 『 』 <Figure 11.2.7> Data input for Independence Test at eStatU 『 』 <Figure 11.2.8> Chi-square Independence Test at eStatU 130 / Chapter 11 Testing Hypothesis for Categorical Data Ÿ As described in Chapter 4, if a contingency table is made using raw data (<Figure 11.2.9>), eStat provides the result of the independence test as shown in <Figure 11.2.10>. In this case, if a cell of the contingency table has a small expected number, the test result should be interpreted carefully. 『 』 <Figure 11.2.9> Raw data input for independence test 『 』 <Figure 11.2.10> eStat contingency table and independence test [Practice 11.2.1] A guidance counselor surveyed 100 high school students for reading and watching TV. The following table was obtained by classifying each item as high and low. Using the significance level of 0.05, are these data sufficient to claim that the reading and TV viewing are related? Check the test result using『eStatU』. Reading Total High Low TV viewing High TV viewing Low 40 31 18 11 58 42 Total 71 29 100 ⇨ eBook ⇨ EX110201_TV_Reading.csv. 11.2 Testing Hypothesis for Contingency Table / 131 11.2.2 Homogeneity Test Ÿ The independence test described in the previous section were for the contingency table of two categorical variables based on sample data from one population. However, similar contingency table may be taken from several populations, where each sample is drawn from such a different population. It can often be seen when the research is more efficiently to be done or when time and space constraints are imposed. For example, if you want to compare the English scores of freshman, sophomore, junior and senior students in a university, it is reasonable to take samples from each grade and analyze them. In this case, the contingency table is as follows: Table 11.2.7 A contingency table of English score by grade level English score Ÿ Freshman Sophomore Junior Senior - - - - A B C D If this contingency table is derived from each grade population, the question we are curious is not an independence of the English score and grade level, but four distributions of English scores are equal. The hypothesis for a contingency table of samples drawn from multiple populations is as follows. It is called the homogeneity test. Distributions of several populations for a categorical variable are homogeneous. Distributions of several populations for a categorical variable are not homogeneous. Ÿ The test statistic for the homogeneity test is the same as the independence test as follows: Here is the number of attributes of the categorical variable and is the number of populations. Homogeneity Test Hypothesis: Several population distributions for a categorical variable are homogeneous. Several population distributions for a categorical variable are not homogeneous. Decision Rule: If , then reject Here is the number of attributes of the categorical variable and is the number of populations. ☞ In order to use the chi-square distribution for the homogeneity test, all expected frequencies are at least 5 or more. If an expected frequency of a cell is smaller than 5, the cell is combined with adjacent cell for analysis. 132 / Chapter 11 Testing Hypothesis for Categorical Data Example 11.2.3 In order to investigate whether viewers of TV programs are different by age for three programs (A, B and C), 200, 100 and 100 samples were taken separately from the population of young people(20s), middle-aged people (30s and 40s), and older people (50s and over) respectively. Their preference of the program were summarized as follows. Test whether TV program preferences vary by age group at the significance level of 5%. Table 11.2.8 Preference of TV program by age group A B C TV Program Total Answer Young Middle Aged Older Total 120 30 50 10 75 15 10 30 60 140 135 125 200 100 100 400 w The hypothesis of this problem is as follows: TV program preferences for different age groups are homogeneous. TV program preferences for different age groups are not homogeneous. w Proportions of the number of samples for each age group are as follows: Therefore, the expected frequencies of each program when is true are as follows: × × × × × × × × × w Test statistic and critical value are as follows: ⋯ Since is greater than the critical value, is rejected. TV programs have different preferences for different age groups. 11.2 Testing Hypothesis for Contingency Table / 133 [Practice 11.2.2] To evaluate the effectiveness of typing training, 100 documents by company employees who received type training and 100 documents by employees who did not receive typing training were evaluated. Evaluated documents are classified as good, normal, and low. The following table shows a classification of the evaluation for total 200 documents according to whether or not they received training. Test the null hypothesis that distributions of the document evaluation are the same in both populations. Use and check your test result using『eStatU』. Training Total Document Evaluation Typing training No typing training Good Normal Low 48 39 13 12 26 62 60 65 75 Total 100 100 200 134 / Chapter 11 Testing Hypothesis for Categorical Data Exercise 11.1 300 customers selected randomly are asked on which day of the week they usually went to the grocery store and received the following votes. Can you conclude that the percentage of days customers prefer is different? Use the 5% significance level. Check the test result using 『eStatU』. Day Number of Customers Mon 10 Tue 20 Wed Thr Fri Sat Sun 40 40 80 60 50 Total 300 11.2 The market shares of toothpaste brands A, B, C and D are known to be 0.3, 0.6, 0.08, and 0.02 respectively. The result of a survey of 600 people for the toothpaste brands are as follows. Can you conclude from these data that the existing market share is incorrect? Use and check your test result using 『eStatU』. Brand Number of Customers A 192 B 342 C 44 D 22 Total 600 11.3 The following table shows the distribution by score by conducting an aptitude test on 223 workers at a plant. The mean and variance from the sample data are 75 and 386 respectively. Test whether the scores of the aptitude test follow a normal distribution. Use and check your test result using 『eStatU』. Score interval X < 40 40 ≤ X < 50 50 ≤ X < 60 60 ≤ X < 70 70 ≤ X < 80 80 ≤ X < 90 90 ≤ X < 100 X ≥ 100 Total Number of Workers 10 12 17 37 55 51 34 7 223 11.4 The following data shows the highest temperature of a city during the month of August. Test whether the temperature data follow a normal distribution with the 5% significance level. (Unit: °C) 29, 29, 34, 35, 35, 31, 32, 34, 38, 34, 33, 31, 31, 30, 34, 35, 34, 32, 32, 29, 28, 30, 29, 31, 29, 28, 30, 29, 29, 27, 28. 11.5 For market research, a company obtained data on the educational level and socio-economic status of 375 housewives and summarized a contingency table as follows. Test the null hypothesis that social and economic status and educational level are independent at the significance level of 0.05. Check the test result using 『eStatU』. Socio-economic status 1 2 3 4 5 6 Total Elementary 10 14 9 7 3 2 45 Education Level Middle High 7 3 10 7 25 13 9 38 8 14 3 8 62 83 College 4 4 18 44 18 10 98 Above 1 2 3 6 62 13 87 Total 25 37 68 104 105 36 375 Chapter 11 Exercise / 135 11.6 Government agencies surveyed workers who wanted to get a job and classified 532 respondents according to the gender and technical level as follows. Does these data provide sufficient evidence that the technical level and gender are related? Use and check your test result using 『eStatU』. Technical Level Gender Male Total Female Skilled worker Semi-skilled worker Unskilled worker 106 93 215 6 39 73 112 132 288 Total 414 118 532 11.7 A guidance counselor surveyed 110 high school students for reading and watching TV. The following table was obtained by classifying each item as high and low. At the significance level of 0.05, are these data sufficient to claim that the reading and TV viewing are related? Check the test result using 『eStatU』. Reading Total High Low TV viewing High TV viewing Low 40 41 18 11 58 52 Total 81 29 110 11.8 165 defective products produced in two plants operated by the same company were classified depending on whether they were due to low occupational awareness or low quality raw materials by each plant. Test the null hypothesis that the cause of the defect and production plant are independent with the significance level of 0.05. Check the test result using 『eStatU』. Plant Total Cause of defect A B low occupational awareness low quality raw materials 21 46 72 26 93 72 Total 67 98 165 11.9 To evaluate the effectiveness of typing training, 110 documents by company employees who received type training and 120 documents by employees who did not receive typing training were evaluated. Evaluated documents are classified as good, normal, and low. The following table shows a classification of the evaluation for total 230 documents according to whether or not they received training. Test the null hypothesis that typing training and document evaluation are independent. Use and check your test result using 『eStatU』. Evaluation Good Normal Low Total Typing training No typing training 48 12 39 36 23 72 110 120 Total 60 75 95 230 136 / Chapter 11 Testing Hypothesis for Categorical Data 11.10 A company with three large plants applied different working conditions and wage systems to three plants to ask them for satisfaction with the new system six months later. 250 workers from each of three plants were randomly selected and the survey results were as follows. Is there sufficient evidence that workers at each plant have different satisfaction levels? Test with the significance level of 0.05. Check the test result using 『eStatU』. Job Satisfaction Plant Very satisfied Satisfied Average Not satisfied Total Plant 1 Plant 2 Plant 3 135 145 140 70 80 75 25 15 20 20 10 15 250 250 250 Total 420 225 60 45 750 Chapter 11 Multiple Choice Exercise / 137 Multiple Choice Exercise 11.1 What tests do you need to investigate whether the sample data follow a theoretical distribution? ① Goodness of fit test ③ Test for population proportion ② Independence test ④ Test for two population means 11.2 In order to test whether sample data of a continuous variable follow a distribution, what is the first necessary work for the goodness of fit test? ① log transformation ③ [0,1] transformation ② frequency distribution of interval ④ frequency distribution 11.3 How do you test the hypothesis that the two categorical variables of a sample from a population have no relation? ① Goodness of fit test ③ Test for population proportion ② Independence test ④ Test for homogeneity 11.4 How do you test the hypothesis that the samples from two categorical populations have the same distribution? ① Goodness of fit test ③ Test for population proportion ② Independence test ④ Test for homogeneity 11.5 Which of the following statistical distributions is used to test for a contingency table? ① distribution ③ binomial distribution (Answers) 11.1 ①, 11.2 ②, 11.3 ②, 11.4 ④, 11.5 ② ② distribution ④ Normal distribution ♡ 12 Correlation and Regression Analysis SECTIONS CHAPTER OBJECTIVES 12.1 Correlation Analysis 12.2 Simple Linear Regression Analysis 12.2.1 Simple Linear Regression Model 12.2.2 Estimation of Regression Coefficient 12.2.3 Goodness of Fit for Regression Line 12.2.4 Analysis of Variance for Regression 12.2.5 Inference for Regression 12.2.6 Residual Analysis 12.3 Multiple Linear Regression Analysis 12.3.1 Multiple Linear Regression Model 12.3.2 Estimation of Regression Coefficient 12.3.3 Goodness of Fit for Regression and Analysis of Variance 12.3.4 Inference for Multiple Linear Regression From Chapter 7 to Chapter 10, we discussed the estimation and the testing hypothesis of parameters such as population mean and variance for single variable. This chapter describes a correlation analysis for two or more variables. If variables are related with each other, then a regression analysis is described to see how this association can be used. Simple linear regression analysis and multiple regression analysis are discussed. 140 / Chapter 12 Correlation and Regression Analysis 12.1 Correlation Analysis Ÿ Example 12.1.1 The easiest way to observe the relation of two variables is to draw a scatter plot with one variable as X axis and the other as Y axis. If two variables are related, data will gather together with a certain pattern, and if not related, data will be scattered around. The correlation analysis is a method of analyzing the degree of linear relationship between two variables. It is to investigate how linearly the other variable increases or decreases as one variable increases. Based on the survey of advertising costs and sales for 10 companies that make the same product, we obtained the following data as in Table 12.1.1. Using『eStat』, draw a scatter plot for this data and investigate the relation of the two variables. Table 12.1.1 Advertising costs and sales (unit: 1 million USD) Company 1 Advertise (X) Sales (Y) 2 3 4 5 4 6 6 8 8 39 42 45 47 50 6 7 8 9 10 9 50 9 52 10 55 12 57 12 60 ⇨ eBook ⇨ EX120101_SalesByAdvertise.csv. Answer w Using『eStat』, enter data as shown in <Figure 12.1.1>. If you select the Sales as 'Y Var' and the Advertise 'by X Var' in the variable selection box that appears when you click the scatter plot icon on the main menu, the scatter plot will appear as shown in <Figure 12.1.2>. As we can expect, the scatter plot show that the more investments in advertising, the more sales increase, and not only that, the form of increase is linear. 『 』 <Figure 12.1.1> Data input in eStat Ÿ <Figure 12.1.2> Scatter plot of sales by advertise The relation between two variables can be roughly investigated using a scatter plot like this. However, a measure of the extent of the relation can be used together to provide a more accurate and objective view of the relation between two variables. As a measure of the relation between two variables, there is a covariance. The population covariance of the two variables X and Y is denoted as . When the random samples of two variables are given as 12.1 Correlation Analysis / 141 ⋯ , the estimate of the population covariance using samples, which is called the sample covariance, , is defined as follows: Ÿ Ÿ In the above equation, and represent the sample means of X and Y respectively. In order to understand the meaning of covariance, consider a case that Y and the value of Y is increases if X increases. If the value of X is larger than , then always has a positive value. Also, if the larger than value of X is smaller than and the value of Y is smaller than , then has a positive value. Therefore, their mean value which is the covariance tends to be positive. Conversely, if the value of the covariance is negative, the value of the other variable decreases as the value of one variable increases. Hence, by calculating covariance, we can see the relation between two variables: positive correlation (i.e., increasing the value of one variable will increase the value of the other) or negative correlation (i.e., decreasing the value of the other). Covariance itself is a good measure, but, since the covariance depends on the unit of X and Y, it makes difficult to interpret the covariance according to the size of the value and inconvenient to compare with other data. Standardized covariance which divides the covariance by the standard deviation of X and Y, and , to obtain a measurement unrelated to the type of variable or specific unit, is called the population correlation coefficient and denoted as . Population Correlation Coefficient: Ÿ <Figure 12.1.3> shows different scatter plots and its values of the correlation coefficient. <Figure 12.1.3> Different scatter plots and their correlation coefficients. 142 / Chapter 12 Correlation and Regression Analysis Ÿ The correlation coefficient is interpreted as follows: 1) has a value between -1 and +1. A value closer to +1 indicates a strong positive linear relation and a value closer to -1 indicates a strong negative linear relation. Linear relationship weakens as the value of is close to 0. 2) If all the corresponding values of X and Y are located on a straight line, the value of has either +1 (if the slope of the straight line is positive) or -1 (if the slope of the straight line is negative). 3) The correlation coefficient is only a measure of linear relationship between two variables. Therefore, in the case of , there is no linear relationship between the two variables, but there may be a different relationship. (see the scatter plot (f) in <Figure 12.1.3>) Ÿ 『eStatU』provides a simulation of scatter plot shapes for different correlations as in <Figure 12.1.4>. 『 』 <Figure 12.1.4> Simulation of correlation coefficient at eStatU Ÿ An estimate of the population correlation coefficient using samples of two variables is called the sample correlation coefficient and denoted as . The formula for the sample correlation coefficient can be obtained by replacing each parameter with the estimates in the formula for the population correlation coefficient. where is the sample covariance and are the sample standard deviations of and as follows: 12.1 Correlation Analysis / 143 Therefore, the formula can be written as follows: Example 12.1.2 Find the sample covariance and correlation coefficient for the advertising costs and sales of [Example 12.1.1]. Answer w To calculate the sample covariance and correlation coefficient, it is convenient to make the following table. This table can also be used for calculations in regression analysis. Table 12.1.2 A table for calculating the covariance 1 2 3 4 5 6 7 8 9 10 4 6 6 8 8 9 9 10 12 12 39 42 45 47 50 50 52 55 57 60 16 36 36 64 64 81 81 100 144 144 1521 1764 2025 2209 2500 2500 2704 3025 3249 3600 156 252 270 376 400 450 468 550 684 720 766 25097 4326 Sum 84 497 Mean 8.4 49.7 w Terms which are necessary to calculate the covariance and correlation coefficient are as follows: × × w × × represent the sum of squares of , the sum of squares of , the sum of squares of . Hence, the covariance and correlation coefficient are as follows: × This value of the correlation coefficient is consistent with the scatter plot which shows a strong positive correlation of the two variables. 144 / Chapter 12 Correlation and Regression Analysis Ÿ Sample correlation coefficient can be used for testing hypothesis of the population correlation coefficient. The main interest in testing hypothesis of is which tests the existence of linear correlation. This test can be done using t distribution as follows: Testing the population correlation coefficient : Null Hypothesis: Test Statistic: follows t distribution with degrees of freedom Rejection Region of : Example 12.1.3 Answer 1) : Reject if 2) : Reject if 3) ≠ : Reject if In the Example 12.1.2, test the hypothesis that the population correlation coefficient between advertising cost and the sales amount is zero at the significance level of 0.05. (Since the sample correlation coefficient is 0.978 which is close to 1, this test will not be required in practice.) w The value of the test statistic is as follows: Since it is greater than = 2.306, should be rejected. w With the selected variables of 『eStat』 as <Figure 12.1.1>, click the regression icon on the main menu, then the scatter plot with a regression line will appear. Clicking the [Correlation and Regression] button below this graph will show the output as <Figure 12.1.5> in the Log Area with the result of the regression analysis. The values of this result are slightly different from the textbook, which is the error associated with the number of digits below the decimal point. The same conclusion is obtained that the p-value for the correlation test is 0.0001, less than the significance level of 0.05 and, therefore, the null hypothesis is rejected. 『 <Figure 12.1.5> Testing hypothesis of correlation using eStat 』 12.1 Correlation Analysis / 145 [Practice 12.1.1] A professor of statistics argues that a student’s final test score can be predicted from his/her midterm. Ten students were randomly selected and their mid-term and final exam scores are as follows: id Mid-term X Final Y 1 2 3 4 5 6 7 8 9 10 92 65 75 83 95 87 96 53 77 68 87 71 75 84 93 82 98 42 82 60 ⇨ eBook ⇨ PR120101_MidtermFinal.csv. 1) Draw a scatter plot of this data with the mid-term score on X axis and final score on Y axis. What do you think is the relationship between mid-term and final scores? 2) Find the sample correlation coefficient and test the hypothesis that the population correlation coefficient is zero with the significance level of 0.05. Ÿ If there are more than three variables in the analysis, the relationship can be viewed using the scatter plots for each combination of two variables and the sample correlation coefficients can be obtained. However, to make it easier to see the relationship between the variables, the correlations between the variables can be arranged in a matrix format which is called a correlation matrix. eStat shows the result of a correlation matrix and the significance test for those values. The result of the test shows the t value and p-value. 『 Example 12.1.4 』 Draw a scatter plot matrix and correlation coefficient matrix using four variables of the iris data saved in the following location of『eStat』. ⇨ eBook ⇨ EX120104_Iris.csv The variables are Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width. Test the hypothesis whether the correlation coefficients are equal to zero. Answer w From 『eStat』, load the data and click the 'Regression' icon. When the variable selection box appears, select the four variables of Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width, then the scatter plot matrix will be shown as <Figure 12.1.6>. w It is observed that the Sepal.Length and the Petal.Length, and the Petal.Length and the Petal.Width are related. 146 / Chapter 12 Correlation and Regression Analysis Example 12.1.4 Answer (continued) 『 <Figure 12.1.6> Scatter plot matrix using eStat 』 w When selecting [Regression Analysis] button from the options below the graph, the basic statistics and correlation coefficient matrix such as <Figure 12.1.7> appear in the Log Area with the test result. It can be seen that all correlations are significant except the correlation coefficient between the Sepal.Length and Sepal.Width. 『 』 <Figure 12.1.7> Descriptive statistics and correlation matrix using eStat 12.2 Simple Linear Regression Analysis / 147 [Practice 12.1.2] A health scientist randomly selected 20 people to determine the effects of smoking and obesity on their physical strength and examined the average daily smoking rate ( , number/day), the ratio of weight by height ( , kg/m), and the time to exercise with a certain intensity (, in hours). Draw a scatterplot matrix and test whether there is a correlation among smoking, obesity and exercising time with a certain intensity. smoking rate ratio of weight by height time to exercise 24 0 25 0 5 18 20 0 15 6 0 15 18 5 10 0 12 0 15 12 53 47 50 52 40 44 46 45 56 40 45 47 41 38 51 43 38 36 43 45 11 22 7 26 22 15 9 23 15 24 27 14 13 21 20 24 15 24 12 16 ⇨ eBook ⇨ PR120102_SmokingObesityExercis.csv. 12.2 Simple Linear Regression Analysis Ÿ Regression analysis is a statistical method that first establishes a reasonable mathematical model of relationships between variables, estimates the model using measured values of the variables, and then uses the estimated model to describe the relationship between the variables, or to apply it to the analysis such as forecasting. For example, a mathematical model of the relationship between sales (Y) and advertising costs (X) would not only explain the relationship between sales and advertising costs, but would also be able to predict the amount of sales that a given investment. Regression analysis is a statistical method that first establishes a reasonable mathematical model of relationships between variables, estimates the model using measured values of the variables, and then uses the estimated model to describe the relationship between the variables, or to apply it to the analysis such as forecasting. Definition Ÿ As such, the regression analysis is intended to investigate and predict the degree of relation between variables and the shape of the relation. In regression analysis, a mathematical model of the relation between variables is called a regression equation, and the variable affected by other related variables is called a dependent variable. The dependent variable is the variable we would like to describe which is usually observed in response to other variables, so it is also called a response variable. In addition, variables that affect the dependent variable are called independent variables. The independent variable is also referred to as 148 / Chapter 12 Correlation and Regression Analysis Ÿ the explanatory variable, because it is used to describe the dependent variable. In the previous example, if the objective is to analyse the change in sales amounts resulting from increases and decreases in advertising costs, the sales amount is a dependent variable and the advertising cost is an independent variable. If the number of independent variables included in the regression equation is one, it is called a simple linear regression. If the number of independent variables are two or more, it is called a multiple linear regression. 12.2.1 Simple Linear Regression Model Ÿ Simple linear regression analysis has only one independent variable and the regression equation is shown as follows: Ÿ In other words, the regression equation is represented by the linear equation of the independent variable, and and are unknown parameters which represent the intercept and slope respectively. The and are called the regression coefficients. The above equation represents an unknown linear relationship between Y and X in population and is therefore, referred to as the population regression equation. In order to estimate the regression coefficients and , observations of the dependent and independent variable are required, i.e., samples. In general, all of these observations are not located in a line. This is because, even if the Y and X have an exact linear relation, there may be a measurement error in the observations, or there may not be an exact linear relationship between Y and X. Therefore, the regression formula can be written by considering these errors together as follows: Ÿ ⋯ where is the subscript representing the observation, and is the random variable indicating an error with a mean of zero and a variance which is independent of each other. The error indicates that the observation is how far away from the population regression equation. The above equation includes unknown population parameters , and and is therefore, referred to as a population regression model. If and are the estimated regression coefficients using samples, the fitted regression equation can be written as follows. It is referred to as the sample regression equation. represents the estimated value of at as predicted In this expression, by the appropriate regression equation. These predicted values can not match the actual observed values of , and differences between these two values are called residuals and denoted as Residuals: Ÿ ⋯ The regression analysis makes some assumptions about the unobservable error . Since the residuals calculated using the sample values have similar characteristics as , they are used to investigate the validity of these assumptions. (Refer to Section 12.2.6 for residual analysis.) 12.2 Simple Linear Regression Analysis / 149 12.2.2 Estimation of Regression Coefficient Ÿ When sample data, ⋯ , are given, a straight line representing it can be drawn in many ways. Since one of the main objectives of regression analysis is prediction, we would like to use the estimated regression line that would make the residuals smallest that the error occurs when predicting the value of Y. However, it is not possible to minimize the value of the residuals at all points, and it should be chosen to make the residuals 'totally' smaller. The most widely used of these methods is the method which minimizes the total sum of squared residuals, that is called the method of least squares regression. Method of Least Squares Regression Definition A method of estimating regression coefficients so that the total sum of the squared errors occurring in each observation is minimized. i.e., Find and which minimize Ÿ Definition To obtain the values of and by the least squares method, the sum of squares above should be differentiated partially with respect to and , and equate them zero respectively. If the solution of and of these equations is and , the equations can be written as follows: ⋅ Ÿ The above expression is called a normal equation. The solution and of this normal equation is called the least squares estimator of and and is given as follows: Least Squares Estimator of and If we divide both the numerator and the denominator of by , can be written as . Since the correlation coefficient is and, therefore, , the slope can also be calculated by using the correlation coefficient as follows: 150 / Chapter 12 Correlation and Regression Analysis Example 12.2.1 Answer In [Example 12.1.1], find the least squares estimate of the slope and intercept if the sales amount is a dependent variable and the advertising cost is an independent variable. Predict the amount of sales when you have spent on advertising by 10. w In [Example 12.1.1], the calculation required to obtain the intercept and slope has already been made. The intercept and slope using this are as follows: × Therefore, the fitted regression line is w <Figure 12.2.1> shows the fitted regression line on the original data. The meaning of slope value, 2.5033, is that, if advertising cost increases by one (i.e., one million), sales increases by about 2.5 million. 『 <Figure 12.2.1> Simple linear regression using eStat 』 w Prediction of the sales amount of a company with an advertising cost of 10 can be obtained by using the fitted sample regression line as follows: In other words, sales of 53.705 million are expected. That is not to say that all companies with advertising costs of 10 million USD have sales of 53.705 million USD, but that the average amount of their sales is about that. Therefore, there may be some differences in individual companies. [Practice 12.2.1] Using the data of [Practice 12.1.1] for the mid-term and final exam score, find the least squares estimate of the slope and intercept if the final exam score is a dependent variable and the mid-term score is an independent variable. Predict the final exam score when you have a mid-term score of 80. 12.2 Simple Linear Regression Analysis / 151 12.2.3 Goodness of Fit for Regression Line Ÿ Ÿ After estimating the regression line, it should be investigated how valid the regression line is. Since the objective of a regression analysis is to describe a dependent variable as a function of an independent variable, it is necessary to find out how much the explanation is. A residual standard error and a coefficient of determination are used for such validation studies. Residual standard error is a measure of the extent to which observations are scattered around the estimated line. First, you can define the sample variance of residuals as follows: Ÿ Ÿ The residual standard error is defined as the square root of . The is an estimate of which is the extent that the observations are spread around the population regression line. A small value of or indicates that the observations are close to the estimated regression line, which in turn the regression line represents well the relationship between the two variables. However, it is not clear how small the residual standard error is, although the smaller value is the better. In addition, the size of the value of depends on the unit of . To eliminate this shortcoming, a relative measure called the coefficient of determination is defined. The coefficient of determination is the ratio of the variation described by the regression line over the total variation of observation , so that it is a relative measure that can be used regardless of the type and unit of the variable. As in the analysis of variance in Chapter 9, the following partitions of the sum of squares and degrees of freedom are formed in the regression analysis: Partitions of the sum of squares and degrees of freedom Sum of squares: Degrees of freedom: Ÿ Description of the above three sums of squares is as follows: Total Sum of Squares : The total sum of squares indicating the total variation in observed values of is called the total sum of squares ( ). This has the degree of freedom, , and if is divided by the degrees of freedom, it becomes the sample variance of . Error Sum of Squares : The error sum of squares ( ) of the residuals represents the unexplained variation of the total variation of the . Since the calculation of this sum of squares requires the estimation of two parameters and , has the degree of freedom . This is the reason why, in the calculation of the sample variance of residuals , it was divided by . Regression Sum of Squares : The regression sum of squares ( ) indicates the variation explained by the 152 / Chapter 12 Correlation and Regression Analysis regression line among the total variation of . This sum of squares has the degree of freedom of 1. Ÿ If the estimated regression equation fully explains the variation in all samples (i.e., if all observations are on the sample regression line), the unexplained variation will be zero. Thus, if the portion of is small among the total sum of squares , or if the portion of is large, the estimated regression model is more suitable. Therefore, the ratio of to the total variation , called the coefficient of determination, is defined as a measure of the suitability of the regression line as follows: The value of the coefficient of determination is always between 0 and 1 and the closer the value is to 1, the more concentrated the samples are around the regression line, which means that the estimated regression line explains the observations well. Example 12.2.2 Answer Calculate the value of the residual standard error and the coefficient of determination in the data on advertising costs and sales. w To obtain the residual standard error and the coefficient of determination, it is of the convenient to make the following Table 12.2.1. Here, the estimated value sales from each value of uses the fitted regression line. Table 12.2.1 Useful calculations for the residual standard error and coefficient of determination SST SSR SSE 1 2 3 4 5 6 7 8 9 10 4 6 6 8 8 9 9 10 12 12 39 42 45 47 50 50 52 55 57 60 38.639 43.645 43.645 48.651 48.651 51.154 51.154 53.657 58.663 58.663 114.49 59.29 22.09 7.29 0.09 0.09 5.29 28.09 53.29 106.09 122.346 36.663 36.663 1.100 1.100 2.114 2.114 15.658 80.335 80.335 0.130 2.706 1.836 2.726 1.820 1.332 0.716 1.804 2.766 1.788 Sum 84 497 496.522 396.1 378.429 17.622 Average 8.4 49.7 w In Table 12.2.1, = 396.1, = 378.429, = 17.622. Here, the relationship of = + does not exactly match because of the error in the number of digits calculation. The sample variance of residuals is as follows: 12.2 Simple Linear Regression Analysis / 153 Example 12.2.2 Answer (continued) Hence, the residual standard error is . The coefficient of determination is as follows: This means that 95.6% of the total variation in the observed 10 sales amounts can be explained by the simple linear regression model using a variable of advertising costs, so this regression line is quite useful. w Click the [Correlation and Regression] button in the option below the graph of <Figure 12.2.1> to show the coefficient of determinations and estimation errors shown in <Figure 12.2.2>. <Figure 12.2.2> Correlation and descriptive statistics [Practice 12.2.2] Using the data of [Practice 12.1.1] for the mid-term and final exam scores, calculate the value of the residual standard error and coefficient of determination. 12.2.4 Analysis of Variance for Regression Ÿ If we divide three sums of squares obtained in the above example by its degree of freedom, each one becomes a kind of variance. For example, if you divide the by degrees of freedom, then it becomes the sample variance of the observed values ⋯ . If you divide the by degrees of freedom, it becomes which is an estimate of the variance of error . For this reason, addressing the problems associated with the regression using the partition of the sum of squares is called the ANOVA of regression. Information required for ANOVA, such as calculated sum of squares and degrees of freedom, can be compiled in the ANOVA table as shown in Table 12.2.2. Table 12.2.2 Analysis of variance table for simple linear regression Factor Sum of squares Degrees of freedom Regression Error SSR SSE 1 Total SST Mean squares F value MSR=SSR/1 Fo=MSR/MSE MSE=SSE/ 154 / Chapter 12 Correlation and Regression Analysis Ÿ Ÿ The sum of squares divided by its degrees of freedom is referred to as mean squares, and Table 12.2.2 defines the regression mean squares ( ) and error mean squares ( ) respectively. As the expression indicates, is the same statistic as which is the estimate of . The F value given in the last column are used for testing hypothesis , ≠ . If is not 0, the F value can be expected to be large, because the assumed regression line is valid and the variation of is explained in large part by the regression line. Therefore, we can reversely decide that is not zero if the calculated ratio is large enough. If the assumptions about the error terms mentioned in the population regression model are valid and if the error terms follows a normal distribution, the distribution of value, when the null hypothesis is true follows distribution with 1 and degrees of freedom. Therefore, if , then we can reject . Test for simple linear regression: Hypothesis: , ≠ Decision rule: If 『 』 , then reject (In eStat , the p-value for this test is calculated and the decision can be made using this p-value. That is, if the p-value is less than the significance level, the null hypothesis is rejected.) Example 12.2.3 Prepare an ANOVA table for the example of advertising cost and test it using the 5% significance level. Answer w Using the sum of squares calculated in [Example 12.2.2], the ANOVA table is prepared as follows: Factor Sum of Squares Degrees Mean squares of freedom Regression Error 378.42 17.62 1 10-2 Total 396.04 10-1 =378.42/1=378.42 =17.62/8 =2.20 F value =378.42/2.2=172.0 w Since the calculated F value of 172.0 is much greater than , we reject the null hypothesis with the significance level . w Click the [Correlation and Regression] button in the options window below the graph <Figure 12.2.1> to show the result of the ANOVA as shown in <Figure 12.2.3>. 『 <Figure 12.2.3> Regression Analysis of Variance using eStat [Practice 12.2.3] 』 Using the data in [Practice 12.1.1] for the mid-term and final exam scores, prepare an ANOVA table and test it using the 5% significance level. 12.2 Simple Linear Regression Analysis / 155 12.2.5 Inference for Regression Ÿ One assumption of the error term in the population regression model is that it follows a normal distribution with the mean of zero and variance of . Under this assumption the regression coefficients and other parameters can be estimated and tested. Note that, under the assumption above, the regression model follows a normal distribution with the mean and variance 1) Inference for the parameter The parameter , which is the slope of the regression line, indicates the existence and extent of a linear relationship between the dependent and the independent variables. The inference for can be summarized as follows. Especially, the test for hypotheses is used whether the independent variable describes the dependent variable significantly. The F test for the hypothesis described in the ANOVA of regression is theoretically the same as in the test below. eStat calculates the p- value under the null hypothesis. If this p-value is less than the significance level, the null hypothesis is rejected and the regression line is said to be significant. 『 』 Inference for the parameter Point estimate: Standard error of estimate : ∼ Confidence interval of : ± ⋅ Testing hypothesis: Null hypothesis: Test statistic:: rejection region: if , then if , then if ≠ , then 2) Inference for the parameter The inference for the parameter , which is the intercept of the regression line, can be summarized as below. The parameter is not much of interest in most of the analysis, because it represents the average value of the response variable when an independent variable is 0. 156 / Chapter 12 Correlation and Regression Analysis Inference for the parameter Point estimate: ⋅ ∼ , Standard error of estimate : ⋅ Confidence interval : ± ⋅ Testing hypothesis: Null hypothesis: Test Statistic:: rejection region: if , then if , if ≠ , 3) Inference for the average of At any point in , the dependent variable has an average value . Estimation of is also considered as an important parameter, because it means predicting the mean value of . Inference for the average value Point estimate: : Standard error of estimate ⋅ Confidence interval of : ± ⋅ The confidence interval formula of the mean value depends on the value of the given the standard error of the estimate, so the width of the confidence interval depends on the value of the given . As the formula for the standard error shows, this width is the narrowest at a time , and if is the farther away from , the wider it becomes. If we calculate the confidence interval for the mean value of Y at each point of , and then if we connect the upper and lower limits to each other, we have a confidence band of the regression line on the above and below the sample regression line. 12.2 Simple Linear Regression Analysis / 157 Example 12.2.4 Answer Let's make inferences about each parameter with the result of a regression analysis of the previous data for the sales amount and advertising costs. Use『eStat』to check the test result and confidence band. 1) Inference for The point estimate of is and the standard error of is as follows: w Hence, the 95% confidence interval of using is as follows: ± ± i.e. the interval (1.7720, 3.2346). w The test statistic for the hypothesis , ≠ is as follows: Since , the null hypothesis is rejected with the significance level of . This result of two sided test can be obtained from the confidence interval. Since 95% confidence interval (1.7720, 3.2346) do not include 0, the null hypothesis can be rejected. 2) Inference for The point estimate of is and its standard error is as follows: ⋅ = 1.670 ⋅ Since the value of statistic is and , the null hypothesis is also rejected with the significance level . 3) Inference for the average value of In『eStat』, the standard error of , which is the estimate of , is calculated at is at each point of . For example, the point estimate of and its standard error is 0.475. Hence, the 95% confidence interval of is as follows: ± ± i.e., the inteval is (46.878, 50.520). We can calculate the confidence interval for other value of in a similar way as follows: At At At At ± ⇒ ± ⇒ ± ⇒ ± ⇒ X. As we discussed, the confidence interval becomes wider as is far from w If you select the [Confidence Band] button from the options below the regression graph of <Figure 12.2.1>, you can see the confidence band graph on the scatter plot together with regression line as <Figure 12.2.4>. If you click the [Correlation and Regression] button, the inference result of each parameter will appear in the Log Area as shown in <Figure 12.2.5>. 158 / Chapter 12 Correlation and Regression Analysis Example 12.2.4 Answer (continued) 『 <Figure 12.2.4> Confidence band using eStat 』 <Figure 12.2.5> Testing hypothesis of regression coefficients [Practice 12.2.4] Using the data in [Practice 12.1.1] for the mid-term and final exam scores, make inferences about each parameter using『eStat』and draw the confidence band. 12.2.6 Residual Analysis Ÿ Ÿ The inference for each regression parameter in the previous section is all based on some assumptions about the error term included in the population regression model. Therefore, the satisfaction of these assumptions is an important precondition for making a valid inference. However, because the error term is unobservable, the residuals as estimate of the error term are used to investigate the validity of these assumptions which are referred to as a residual analysis. First, let's look at the assumptions in the regression model. Assumptions in regression model : The assumed model is correct. : The expectation of error terms is 0. : (Homoscedasticity) The variance of is which is the same for all X. : (Independence) Error terms are independent. : (Normality) Error terms ’s are normally distributed. Ÿ Review the references for the meaning of these assumptions. The validity of these assumptions is generally investigated using scatter plots of the residuals. The 12.2 Simple Linear Regression Analysis / 159 following scatter plots used primarily for each assumption: ) 1) Residuals versus predicted values (i.e., vs : 2) Residuals versus independent variables (i.e., vs ) : 3) Residuals versus observations (i.e., vs ) : , Ÿ Ÿ Example 12.2.5 Answer In the above scatter plots, if the residuals show no particular trend around zero, and appear randomly, then each assumption is valid. The assumption that the error term follows a normal distribution can be investigated by drawing a histogram of the residuals in case of a large amount of data to see if the distribution is similar to the shape of the normal distribution. Another method is to use the quantile–quantile (Q-Q) scatter plot of the residuals. In general, if the Q-Q scatter plot of the residuals forms a straight line, it can be considered as a normal distribution. Since residuals are also dependent on the unit of the dependent variable, standardized values of the residuals are used for consistent analysis of the residuals, which are called standardized residuals. Both the scatter plots of the residuals described above and the Q-Q scatter plot are created using the standardized residuals. In particular, if the value of the standardized residuals is outside the ± , an anomaly value or an outlier value can be suspected. Draw a scatter plot of residuals and a Q-Q scatter plot for the advertising cost example. w When you click the [Residual Plot] button from the options below the regression graph of <Figure 12.2.1>, the scatter plot of the standardized residuals and predicted values are appeared as shown in <Figure 12.2.6>. If you click [Residual Q-Q Plot] button, <Figure 12.2.7> is appeared. Although the scatter plot of the residuals has no significant pattern, the Q-Q plot deviates much from the straight line and so, the normality of the error term is somewhat questionable. In such cases, the values of the response variable need to be re-analyzed by taking logarithmic or square root transformation. <Figure 12.2.6> Residual plot 160 / Chapter 12 Correlation and Regression Analysis Example 12.2.5 Answer (continued) <Figure 12.2.7> Residual Q-Q Plot [Practice 12.2.4] Ÿ Using the data in [Practice 12.1.1] for the mid-term and final exam scores, draw a scatter plot of the residuals and a Q-Q scatter plot. 『 』 In eStatU , it is possible to do experiments on how much a regression line is affected by an extreme point (<Figure 12.2.8>). A point can be created by clicking the mouse on the screen in the link below. If you create multiple dots, you can see how much the regression line changes each time. You can observe how sensitive the correlation coefficient and the coefficient of determination are as you move a point along with the mouse. 『 』 <Figure 12.2.8> Simulation experiment of regression analysis at eStatU 12.3 Multiple Linear Regression Analysis / 161 12.3 Multiple Linear Regression Analysis Ÿ For actual applications of the regression analysis, the multiple regression models with two or more independent variables are more frequently used than the simple linear regression with one independent variable. This is because it is rare for a dependent variable to be sufficiently explained by a single independent variable, and in most cases, a dependent variable has a relationship with several independent variables. For example, it may be expected that sales will be significantly affected by advertising costs, which are examples of simple linear regression, but will also be affected by product quality ratings, the number and size of stores sold. The statistical model used to identify the relationship between one dependent variable and several independent variables is called a multiple linear regression analysis. However, the simple linear regression and multiple linear regression analysis differ only in the number of independent variables involved, and there is no difference in the method of analysis. 12.3.1 Multiple Linear Regression Model Ÿ In the multiple linear regression model, it is assumed that the dependent variable Y and k number of independent variables have the following relational formulas: ⋯ This means that the dependent variable is represented by the linear function of independent variables and a random variable that represents the error term as in the simple linear regression model. The assumption of the error terms is the same as the assumption in the simple linear regression. In the above equation, is the intercept of Y axis and is the slope of the Y axis and which indicates the effect of to Y when other independent variables are fixed. Example 12.3.1 When logging trees in forest areas, it is necessary to investigate the amount of timber in those areas. Since it is difficult to measure the volume of a tree directly, we can think of ways to estimate the volume using the diameter and height of a tree that is relatively easy to measure. The data in Table 12.3.1 are the values for measuring diameter, height and volume after sampling of 15 trees in a region. (The diameter was measured at a point 1.5 meters above the ground.) Draw a scatter plot matrix of this data and consider a regression model for this problem. Table 12.3.1 Diameter, height and volume of tree Diameter(cm) Height(m) Volume( ) 21.0 21.33 0.291 21.8 19.81 0.291 22.3 19.20 0.288 26.6 21.94 0.464 27.1 24.68 0.532 27.4 25.29 0.557 27.9 20.11 0.441 27.9 22.86 0.515 29.7 21.03 0.603 32.7 22.55 0.628 32.7 25.90 0.956 33.7 26.21 0.775 34.7 21.64 0.727 35.0 19.50 0.704 40.6 21.94 1.084 ⇨ eBook ⇨ EX120301_TreeVolume.csv. 162 / Chapter 12 Correlation and Regression Analysis Example 12.3.1 Answer w Load the data saved at the following location of 『eStat』. ⇨ eBook ⇨ EX120301_TreeVolume.csv w In the variable selection box which appears by selecting the regression icon, select 'Y variable' by volume and select ‘by X variable’ as the diameter and height to display a scatter plot matrix as shown in <Figure 12.3.1>. It can be observed that there is a high correlation between volume and diameter, and that volume and height, and diameter and height are also somewhat related. <Figure 12.3.1> Scatterplot matrix <Figure 12.3.2> Correlation matrix 12.3 Multiple Linear Regression Analysis / 163 Example 12.3.1 Answer (continued) w Since the volume is to be estimated using the diameter and height of the tree, the volume is the dependent variable Y, and the diameter and height are independent variables , respectively, and the following regression model can be considered. [Practice 12.3.1] ⋯ A health scientist randomly selected 20 people to determine the effect of smoking and obesity on their physical strength and examined the average daily smoking rate ( , number/day), the ratio of weight by height ( , kg/m), and the time to continue to exercise with a certain intensity (, in hours). Draw a scatter plot matrix of this data and consider a regression model for this problem. smoking rate ratio of weight by height 24 0 25 0 5 18 20 0 15 6 0 15 18 5 10 0 12 0 15 12 53 47 50 52 40 44 46 45 56 40 45 47 41 38 51 43 38 36 43 45 time to continue to exercise 11 22 7 26 22 15 9 23 15 24 27 14 13 21 20 24 15 24 12 16 ⇨ eBook ⇨ PR120301_SmokingObesityExercis.csv. Ÿ In general, matrix and vectors are used to facilitate expression of formula and calculation of expressions. For example, if there are k number of independent variables, the population multiple regression model at the observation point ⋯ is presented in a simple manner as follows: Here are defined as follows: ⋯ ⋯ ⋅ , ⋅ , ⋯ ⋅ ⋅ ⋯ ⋯ ⋅ ⋅ 164 / Chapter 12 Correlation and Regression Analysis 12.3.2 Estimation of Regression Coefficient Ÿ In a multiple regression analysis, it is necessary to estimate the number of regression coefficients ⋯ using samples. In this case, the least squares method which minimizes the sum of the squared errors is also used. We find which minimizes the following sum of the error squares as follows: ′ ′ As in the simple linear regression, the above error sum of squares is differentiated with respect to and then, equate to zero which is called a normal equation. The solution of the equation, denoted as which is called the least squares estimate of , should satisfy the following normal equation. ′ ′ Therefore, if there exists an inverse matrix of ′ , the least squares estimator of , , is as follows: ′ ′ Ÿ (Note: Statistical packages uses a different formula, because the above formula causes large amount of computing error) If the estimated regression coefficients are ⋯ ′, the estimate of the response variable is as follows: ⋯ The residuals are as follows: ⋯ By using a vector notation, the residual vector can be defined as follows: 12.3.3 Goodness of Fit for Regression and Analysis of Variance Ÿ Ÿ In order to investigate the validity of the estimated regression line in the multiple regression analysis, the standardized residual error and coefficient of determination are also used. In the simple linear regression analysis, the computational formula for these measures was given as a function of the residuals, i.e., observed value of and its predicted value, so there is nothing to do with the number of independent variables. Therefore, the same formula can be used in the multiple linear regression and there is only a difference in the value of the degrees of freedom that each sum of squares has. In the multiple linear regression analysis, the standard error of residuals is defined as follows: 12.3 Multiple Linear Regression Analysis / 165 Ÿ Ÿ The difference from the simple linear regression is that the degrees of freedom for residuals is , because the k number of regression coefficients must be estimated in order to calculate residuals. As in simple linear regression, is a statistic such as the residual mean squares( ). The coefficient of determination is given in and its interpretation is as shown in the simple linear regression. The sum of squares is defined by the same formula as in the simple linear regression, and can be divided with corresponding degrees of freedom as follows and the table of the analysis of variance is shown in Table 12.3.2. Sum of squares: Degrees of freedom: Table 12.3.2 Analysis of variance table for multiple linear regression analysis Source Ÿ Sum of Squares Regression Error SSR Total SST Degrees of Freedom Mean Squares MSR=SSR/ SSE F value =MSR/MSE MSE=SSE/ The F value in the above ANOVA table is used to test the significance of the regression equation, where the null hypothesis is that all independent variables are not linearly related to the dependent variables. ⋯ At least one of number of s is not equal to 0 Ÿ Since follows F distribution with and degrees of freedom under the null hypothesis, we can reject at the significance level if α. Each can also be tested which is described in the following sections. (Also, eStat calculates the p-value for this test, so use this p-value to test. That is, if the p-value is less than the significance level, the null hypothesis is rejected.) 『 』 12.3.4 Inference for Multiple Linear Regression Ÿ Parameters that are of interest in multiple linear regression, as in the simple linear regression, are the expected value of Y and each regression coefficients ⋯ . The inference of these parameters ⋯ is made possible by obtaining a probability distribution of the point estimates . Under the assumption that the error terms are independent and all have a distribution of , it can be shown that the distribution of is as follows: ∼ N ⋅ , ⋯ The above is the diagonal element of the × matrix ′ . In addition, using an estimate instead of a parameter , you can make inferences about each regression coefficient using the t distribution. 166 / Chapter 12 Correlation and Regression Analysis Inference on regression coefficient Point estimate: Standard error of point estimate: ⋅ Confidence interval of ± ⋅ Testing hypothesis: Null hypothesis: Test Statistic:: rejection region: if , if , 『 』 if ≠ , (Since eStat calculates the p-value under the null hypothesis , p-value is used for testing hypothesis. ) Ÿ Example 12.3.2 Answer Residual analysis of the multiple linear regression is the same as in the simple linear regression. For the tree data of [Example 12.3.1], obtain the least squares estimate of each coefficient of the proposed regression equation using『eStat』and apply the analysis of variance, test for goodness of fit and test for regression coefficients. w In the options window below the scatter plot matrix in <Figure 12.3.1>, click [Regression Analysis] button. Then you can find the estimated regression line, ANOVA table as shown in <Figure 12.3.3> in the Log Area. The estimated regression equation is as follows: In the above equation, 0.037 represents the increase of the volume of the tree when the diameter increases 1(cm). w The p-value calculated from the ANOVA table in <Figure 12.3.3> at F value of 73.12 is less than 0.0001, so you can reject the null hypothesis at the significance level . The coefficient of determination, , implies that 92.4% of the total variances of the dependent variable are explained by the regression line. Based on the above two results, we can conclude that the diameter and height of the tree are quite useful in estimating the volume. 12.3 Multiple Linear Regression Analysis / 167 Example 12.3.2 Answer (continued) <Figure 12.3.3> Result of Multiple Linear Regression w Since and from the result in <Figure 12.3.3>, the 95% confidence intervals for each regression coefficients can be calculated as follows. The difference between this result and the Figure 12.3.3 due to the error in the calculation below the decimal point. 95% confidence interval for : ± 95% confidence interval for : ± w In the hypothesis test of ≠ , each p-value is less than the significance level of 0.05, so you can reject each null hypothesis. w The scatter plot of the standardized residuals is shown in <Figure 12.3.4> and the Q-Q scatter plot is shown in <Figure 12.3.5>. There is no particular pattern in the scatter plot of the standardized residuals, but there is one outlier value, and the Q-Q scatter plot shows that the assumption of normality is somewhat satisfactory. <Figure 12.3.4> Residual analysis of multiple linear regression 168 / Chapter 12 Correlation and Regression Analysis Example 12.3.2 Answer (continued) <Figure 12.3.5> Q-Q plot of multiple linear regression [Practice 12.3.2] Apply a multiple regression model by using『eStat』on the regression model of [Practice 12.3.1]. Obtain the least squares estimate of each coefficient of the proposed regression equation and apply the analysis of variance, test for goodness of fit and test for regression coefficients. Chapter 12 Exercise / 169 Exercise 12.1 A survey was conducted on the level of education(X, the period after graduating a high school, unit: year) for 10 businessmen and annual income (Y, unit: one thousand USD) after graduating from the high school. id Education Period (X) Annual Income (Y) 1 2 3 4 5 6 7 8 9 10 4 2 0 3 4 4 5 5 2 2 50 37 35 45 57 49 60 47 39 50 1) Draw a scatter plot of data and interpret. 2) Calculate the sample correlation coefficient. 3) Apply the regression analysis with annual income as the dependent variable and the level of education as the independent variable. 12.2 The following data shows studying time for a week (X) and the grade (Y) of six students. Studying time X (unit: hour) 15 28 13 20 4 10 Grade Y 2.0 2.7 1.3 1.9 0.9 1.7 1) Find a regression line and 95% confidence interval for (it is a further grade score that is expected to be raised when a student studies one more hour a week.) 2) Calculate a 99% confidence interval in the average score of a student who studies an average of 12 hours a week. 3) Test for hypothesis , (significance level = 0.01). 12.3 A professor of statistics argues that a student’s final test score can be predicted from his/her midterm. Five students were randomly selected and their mid-term and final exam scores are as follows: id 1 2 3 4 5 Mid-term X 92 65 75 83 95 Final Y 87 71 75 84 93 1) Draw a scatter plot of this data with mid-term score on X axis and final score on Y axis. What do you think is the relationship between mid-term and final scores? 170 / Chapter 12 Correlation and Regression Analysis 2) Find the regression line and analyse the result. 12.4 An economist argues that there is a clear relationship between coffee and sugar prices. 'When people buy coffee, they will also buy sugar. Isn't it natural that the higher the demand, the higher the price?' We collected the following sample data to test his theory. Year Coffee Price Sugar Price 1985 1986 1987 1988 1989 1990 1991 0.68 1.21 1.92 1.81 1.55 1.87 1.56 0.245 0.126 0.092 0.086 0.101 0.223 0.212 1) Prepare a scatter plot with the coffee price on X axis and sugar price on Y axis. Is this data true to this economist's theory? 2) Test this economist's theory by using a regression analysis. 12.5 A rope manufacturer thinks that the strength of the rope is proportional to the nylon content of the rope. Ten ropes are randomly selected and their data are as follows: % Nylon 0 10 20 20 30 30 40 50 60 70 X Strength (psi) Y 260 360 490 510 600 600 680 820 910 990 1) Draw a scatter plot with the % Nylon on X axis and strength on Y axis. Find a regression line using the least squares method. Draw this estimated regression line on the scatter plot. 2) Estimate the strength of a rope in case of 33% nylon. 3) Estimate the strength of a rope in case of 66% nylon. 4) The strength of two ropes in case of 20% nylon on the data are different. How can you explain this variation in a regression model? 5) Estimate the strength of a rope in case of 0% nylon. Why is this estimate different from the observed value of 260? 6) Obtain a 95% confidence interval for the strength of the 0% nylon rope. 7) If the observed strength of the 0% nylon rope was outside the confidence interval in 6), how would you interpret this result? 12.6 A health scientist randomly selected 20 people to determine the effects of smoking and obesity on their physical strength and examined the average daily smoking rate ( , number/day), the ratio of weight by height ( , kg/m), and the time to continue to exercise with a certain intensity ( , in hours). Test whether smoking and obesity can affect your exercising time with a certain intensity. Apply a multiple regression model by using 『eStat』. Chapter 12 Exercise / 171 smoking rate ratio of weight by height 24 0 25 0 5 18 20 0 15 6 0 15 18 5 10 0 12 0 15 12 53 47 50 52 40 44 46 45 56 40 45 47 41 38 51 43 38 36 43 45 time to continue to exercise 11 22 7 26 22 15 9 23 15 24 27 14 13 21 20 24 15 24 12 16 12.7 The price of old watches in an antique auction is said to be determined by the year of making the watch and the number of bidders. In order to see if this is true, the 32 recently auctioned alarm clocks were examined for the elapsed period (in years) after manufacture, the number of bidders and the auction price (in 1,000USD) as follows. Test the hypothesis that the auction price of the alarm clock increases with the increase in the number of bidders using the multiple linear regression model. (significance level: 0.05) 172 / Chapter 12 Correlation and Regression Analysis Elapsed Period Number of bidders 127 115 127 150 156 182 156 132 137 113 137 117 137 153 117 126 170 182 162 184 143 159 108 175 108 179 111 187 111 115 194 168 13 12 7 9 6 11 12 10 9 9 15 11 8 6 13 10 14 8 11 10 6 9 14 8 6 9 15 8 7 7 5 7 Auction Price 1235 1080 845 1522 1047 1979 1822 1253 1297 946 1713 1024 1147 1092 1152 1336 2131 1550 1884 2041 854 1483 1055 1545 729 1792 1175 1593 785 744 1356 1262 Chapter 12 Multiple Choice Exercise / 173 Multiple Choice Exercise 12.1 The variables X and Y have a strong relationship with a quadratic equation ( ) as shown in the following table. What is their sample correlation coefficient? … -3 -2 -1 0 1 2 3 … … 9 4 1 0 1 4 9 … ①1 ③ -1 ②0 ④ 12.2 Which is a wrong description of the correlation coefficient? ① ③ if , no linear correlation ② if , perfect negative correlation ④ if < 0, negative correlation 12.3 Which is a right description of the correlation coefficient? ① if , there is strong positive correlation between and . ② if || closes to 0, there exist a weak linear correlation between and . ③ If is negative, then is increasing when increases. ④ If is near , there exist a weak linear correlation between and . 12.4 If the sample correlation coefficient between and ⋯ is , what is the sample correlation coefficient between + and + ? ① ③ 5 + 3 ② 2 ④ 10 + 2 12.5 If the sample correlation coefficient between and is , what is the sample correlation coefficient between and ? ① ③ 3 ② 2 ④ 3 + 1 12.6 When not all points on a scatter plot tend to be linear, what is the sample correlation coefficient close to: ① ≥ ③ || is close to 1 ② ≤ ④ || is close to 0 12.7 Find the sample correlation coefficient between and of the following data. 10 20 30 40 2 4 6 8 ①1 ② 0.3 ③ 0.4 ④ 0.5 174 / Chapter 12 Correlation and Regression Analysis 12.8 If the correlation coefficient of two variables is 0, what is the right description? ① There is no linear relationship between two variables . ② There is a linear relationship between two variables . ③ Two variables has a strong relationship. ④ Two variables has a strong linear relationship. 12.9 Which one of the following descriptions on the sample correlation coefficient is not right? ① is a random variable. ② ≤ ≤ ③ is a measure of linear relationship between two variables. ④ Distribution of is a normal distribution. 12.10 Find the sample correlation coefficient between and of the following data. 1 2 3 4 5 5 4 3 2 1 ① -1 ③0 ② ④ 12.11 Find the sample correlation coefficient between and of the following data? 1 2 3 4 5 6 -1 1 3 5 7 9 ① -0.5 ② ③ 0.5 0 ④ 1 12.12 If and are independent, what is the sample correlation coefficient ? ①1 ③0 ② ④ 12.13 Which one of the followings is right for description of the sample correlation coefficient ? ① ≤≤ ③ ≤≤ ② ≤ ≤ ④ ∞ ∞ 12.14 Which one of the followings is right for description of the sample correlation coefficient between and ? ① if = -1, the value of is directly proportional to the value of . ② if = 1, the value of is directly proportional to the value of . ③ if = 0, the value of is inversely proportional to the value of . ④ if = -1, the value of is not related with the value of . Chapter 12 Multiple Choice Exercise / 175 12.15 Which one of the followings is not right for description of the sample correlation coefficient between and ? ① ≤≤ ② Distribution of is a normal distribution. ③ is a random variable. ④ The formula to calculate is . 12.16 If two variables and have a strong quadratic relation, what is the sample correlation coefficient ? ① ≒ ③ ≒ ② ≒ ④ know information on . 12.17 Which one of the followings has positive correlation? ① height of mountain and pressure ② weight and height ③ monthly income and Engel’ coefficient ④ amount of production and price 12.18 If all points lie on a straight line in a scatter plot, what is the characteristic of the correlation coefficient? ① perfect correlation ③ weak correlation ② strong correlation ④ no correlation 12.19 If the sample correlation coefficient is , what is the characteristic of the correlation coefficient? ① inverse correlation ③ weak correlation ② positive correlation ④ usual correlation 12.20 Find the sample correlation coefficient between and of the following data. 1 1 ① 0.29 2 4 3 3 4 6 ② 0.53 ③ 0.87 ④ 0.98 12.21 Find the sample covariance between and of the following data. ① ② ③ ④ 1 0 0.5 -1 1 2 3 4 5 5 5 5 12.22 Find the sample covariance between and of the following data. 176 / Chapter 12 Correlation and Regression Analysis 1 17 2 15 3 13 4 11 ①0 ③ 5 9 ②1 ④ -1 12.23 Find the sample covariance between and of the following data. 1 2 3 4 5 6 8 10 12 14 ①3 ③ 10 ②4 ④ 20 12.24 Find the regression line between and using the following data. 1 2 3 4 5 1 4 7 10 13 ① ③ ② ④ 12.25 If the standard deviations of the and variables are 4.06 and 2.65 respectively, the covariance is 10.50, what is the sample correlation coefficient ? ① 10.759 ③ 1.025 ② 0.532 ④ 0.976 12.26 If we know the sample correlation coefficient and the standard deviations of X and Y, and respectively, what is the regression line equation? ① ․ ③ ․ ② ․ ④ ․ 12.27 If the sample correlation coefficient of two random variables and is , the sample means are , and the sample standard deviations are , what is the regression line of on ? ① ③ ② ④ 12.28 Find the regression coefficient of the regression line using the following data. Chapter 12 Multiple Choice Exercise / 177 40 sample standard deviation 4 30 3 sample mean correlation coefficient 0.75 ① ② ③ ④ 0.56 0.07 1.00 1.53 12.29 Which one of the following statements is true about the regression line of two variables and , the regression line of on and the regression line of on ? ① The two regression lines are always consistent. ② The two regression lines are always parallel. ③ The two regression lines meet at one point and do not match. ④ The two regression lines are always perpendicular. 12.30 Find the regression coefficient of the regression line using the following data. sample mean sample standard deviation 12 3 13 4 ① 0.6 ② 0.7 correlation coefficient r = 0.6 ③ 0.8 ④ 0.9 12.31 Which one is a wrong explanation about the regression coefficient and the sample correlation coefficient ? ① If ② If ③ If ④ If (no correlation) (positive correlation) (perfect correlation) (negative correlation) 12.32 If a regression line is and the sample standard deviations of and are 4, 2 respectively, what is the value of the sample correlation coefficient ? ①1 ② 0.8 ③ 0.5 ④ 0.4 (Answers) 12.1 ②, 12.2 ①, 12.3 ②, 12.4 ①, 12.5 ①, 12.6 ④, 12.7 ①, 12.8 ①, 12.9 ④, 12.10 ①, 12.11 ④, 12.12 ③, 12.13 ③, 12.14 ②, 12.15 ②, 12.16 ③, 12.17 ②, 12.18 ①, 12.19 ①, 12.20 ③, 12.21 ②, 12.22 ④, 12.23 ②, 12.24 ①, 12.25 ④, 12.26 ①, 12.27 ①, 12.28 ①, 12.29 ③, 12.30 ③), 12.31 ③, 12.32 ② ♡ 13 Time Series Analysis SECTIONS 13.1 13.2 13.3 13.4 13.5 What is Time Series Analysis? Smoothing of Time Series Transformation of Time Series Regression Model and Forecasting Exponential Smoothing Model and Forecasting 13.6 Seasonal Model and Forecasting CHAPTER OBJECTIVES In this chapter, we study data observed over time, time series, and find out about:. - What is time series analysis and what are types of time series models? - How to smooth a time series. - How to transform a time series. - Prediction method using regression model. - Prediction method using exponential smoothing model. - How to predict future values with models for seasonal time series.. We will be mainly focused on descriptive methods and simple models, and discussion of the Box-Jenkins model and other theoretical models will not be discussed. 180 / Chapter 13 Time Series Analysis 13.1 What is Time Series Analysis? Ÿ Ÿ Time series data refers to data recorded according to changes in time. In general, observations are made at regular time intervals such as year, season, month, or day, and this is called a discrete time series. There may be time series that are continuously observed, but this book will only deal with the analysis of discrete time series. An example of a discrete time series is the population of Korea as shown in [Table 13.1.1]. This data is from the census conducted every five years in Korea from 1925 to 2020 (except for 1944 and 1949). . [Table 13.1.1] Population of Korea (Source: Korea National Statistical Office, Census till 2010, Registered Census 2015, 2020) Year Population Year Population Year Population 1925 1930 1935 1940 1944 1949 1955 19,020,030 20,438,108 22,208,102 23,547,465 25,120,174 20,166,756 21,502,386 1960 1966 1970 1975 1980 1985 1990 24,989,241 29,159,640 31,435,252 34,678,972 37,406,815 40,419,652 43,390,374 1995 2000 2005 2010 2015 2020 44,553,710 45,985,289 47,041,434 47,990,761 51,069,375 51,829,136 Ÿ As shown in the table above, it is not easy to understand the overall shape of the time series displayed in numbers. The first step in time series analysis is to observe the time series by drawing a time series plot with the X axis as time and the Y axis as time series values. For example, the time series plot of the total population in Korea is shown in <Figure 13.1.1>. <Figure 13.1.1> Time Series of Korea Population Ÿ Observing this figure, Korea's population has an overall increasing trend, but the population decreased sharply in 1944-1949 due to World War II. It can be seen that the population expanded rapidly after the Korean war in 1953 and slowed since 1990. It can be seen that the growth has slowed further in the last 10 years. By observing the time series in this way, trends, change points, and outliers 13.1 What is Time Series Analysis? / 181 Ÿ Ÿ can be observed, and it is helpful in selecting an analysis model or method suitable for the data. Time series that we frequently encounter include monthly sales of department stores and companies, daily composite stock index, annual crop production, yearly export and import time series, and yearly national income and economic growth rate, and so on. [Table 13.1.2] shows the percent increase in monthly sales of the US toy/game industry for the past 6 years, and <Figure 13.1.2> is a plot of this time series. As it is the rate of change from the previous month, it can be observed that it is seasonal data showing a large increase in November and December every year, moving up and down based on 0. However, May 2020 is an extreme with an increase rate of 211% unlike other years. For time series, you can better examine the characteristics of the data by converting the raw time series into the rate of change. [Table 13.1.2] Percent Increase, Monthly Sales of Toy/Game in US(%) (Source: Bureau of Census, US) Year.Month 2016.01 2016.02 2016.03 2016.04 2016.05 2016.06 2016.07 2016.08 2016.09 2016.10 2016.11 2016.12 2017.01 2017.02 2017.03 2017.04 2017.05 2017.06 2017.07 2017.08 2017.09 2017.10 2017.11 2017.12 Percent Increase -66.7 2.5 12.5 -9.0 -0.6 -4.4 4.3 0.0 6.1 8.6 56.4 53.6 -65.6 -0.1 14.7 -5.7 -2.4 -5.5 1.3 4.2 8.4 7.2 54.9 45.5 Year.Month 2018.01 2018.02 2018.03 2018.04 2018.05 2018.06 2018.07 2018.08 2018.09 2018.10 2018.11 2018.12 2019.01 2019.02 2019.03 2019.04 2019.05 2019.06 2019.07 2019.08 2019.09 2019.10 2019.11 2019.12 Percent Increase -63.6 3.6 39.8 -21.0 5.9 -12.4 -16.9 5.2 7.5 8.5 54.9 5.8 -46.2 -3.8 16.3 -8.4 6.6 -5.3 0.8 7.7 -1.2 12.2 46.7 11.7 Year.Month 2020.01 2020.02 2020.03 2020.04 2020.05 2020.06 2020.07 2020.08 2020.09 2020.10 2020.11 2020.12 2021.01 2021.02 2021.03 2021.04 2021.05 2021.06 2021.07 2021.08 2021.09 2021.10 2021.11 2021.12 Percent Increase -49.1 2.2 -28.2 -58.2 211.1 26.8 -0.8 7.0 4.9 5.8 44.1 8.5 -37.1 -12.2 37.0 -10.3 -0.5 -2.0 4.6 1.8 5.2 6.4 40.0 10.6 182 / Chapter 13 Time Series Analysis <Figure 13.1.2> Percent Increase, Monthly Sales of Toy/Game in US(%) Ÿ Most time series have four components: trend, season, cycle, and other irregular factors. A trend is a case in which a time series has a certain trend, such as a line or a curved shape as time elapses, and there are various types of trends. Trends can be understood as a consumption behavior, population variations, and inflation that appear in time series over a long period of time. Seasonal factors are short-term and regular fluctuation factors that exist quarterly, monthly, or by day of the week. Time series such as monthly rainfall, average temperature, and ice cream sales have seasonal factors. Seasonal factors generally have a short cycle, but fluctuations when the cycle occurs over a long period of time rather than due to the season is called a cycle factor. By observing these cyclical factors, it is possible to predict the boom or recession of a periodic economy. {Figure 13.1.3} shows the US S&P 500 Index from 1997 to 2016, and a six-year cycle can be observed. <Figure 13.1.3>] US S&P500 Index (1997- 2016) Ÿ Other factors that cannot be explained by trend, season, or cyclical factors are called irregular or random factors, which refer to variable factors that appear due to random causes regardless of regular movement over time. 13.1 What is Time Series Analysis? / 183 13.1.1 Time Series Model Ÿ By observing the time series, you can predict how this time series will change in the future by building a time series model that fits the probabilistic characteristics of this data. Because the time series observed in reality has a very diverse form, the time series model is also very diverse, from simple to very complex. In general, time series models for a single variable can be divided into the following four categories. A. Regression Model A model that explains data or predicts the future by expressing a time series in the form of a function related to time is the most intuitive and easy to understand model. That is, when a time series is an observation of a random variable, ⋯ , it is expressed as the following model: ⋯ Here is the error of the time series that cannot be explained by a function . In general is assumed independent, , and = σ2 which is called a white noise. For example, the following model can be applied to a time series in which the data is horizontal or has a linear trend. Horizontal: Linear Trend: B. Decomposition Model The model that decomposes the time series into four factors, i.e., trend( ), cycle( ), seasonal( ), and irregular( ), is an analysis method that has been used for a long time based on empirical facts. It can be divided into additive model and multiplicative model. Additive Model: Multiplicative Model: ⋅ ⋅ ⋅ Here , , are deterministic function, is a random variable. If we take the logarithm of a multiplicative model, it becomes an additive model. If the number of data is not enough the cycle factor can be omitted in the model. C. Exponential Smoothing Model Time series data are often more related to recent data than to past data. The above two types of models are models that do not take into account the relationship between the past time series data and the recent time series data. Models using moving averages and exponential smoothing are often used to explain and predict data using the fact that time series forecasting is more related to recent data. D. Box-Jenkins ARIMA Model The above models are not methods that can be applied to all types of time series, and the analyst selects and applies them according to the type of data. Box and Jenkins presented the following general ARIMA model that can be 184 / Chapter 13 Time Series Analysis applied to all time series of stationary or nonstationary type as follows: ⋯ ⋯ Ÿ Ÿ The ARIMA model considers the observed time series as a sample extracted from a population time series, studies the probabilistic properties of each model, and establishes an appropriate time series model through parameter estimation and testing. For the ARIMA model, autocorrelation coefficients between time lags are used to identify a model. The ARIMA model is beyond the scope of this book, so interested readers are encouraged to consult the bibliography. In the above time series model, the regression model and ARIMA model are systematic models based on statistical theory, and the decomposition model and exponential smoothing model are methods based on experience and intuition. In general, regression models using mathematical functions and models using decomposition are known to be suitable for predicting slow-changing time series, whereas exponential smoothing and ARIMA models are known to be effective in predicting very rapidly changing time series. For all time series models, it is impossible to predict due to sudden changes. And because time series has so many different forms, it cannot be said that one time series model is always superior to another. Therefore, rather than applying only one model to a time series, it is necessary to establish and compare several models, combine different models, or make an effort to determine the final model by combining opinions of experts familiar with the time series. 13.1.2 Evaluation of Time Series Model Ÿ Let the time series be the observed values of the random variables ⋯ and ⋯ be the values predicted by the model. If the model agrees exactly, the observed and predicted values are the same, and the model error is zero. In general, it is assumed that the error ’s of the time series model are independent random variables which follow the same normal distribution with a mean of 0 and a variance of . The accuracy of a time series model can be evaluated using residual, , which is a measure by subtracting the predicted value from the observed value. In general, the following mean squared error (MSE) is commonly used for the accuracy of a model and the smaller the MSE value, the more appropriate the predicted model is judged. n Y t t Yt MSE n The mean square error is used as an estimator for the variance of the error. Since MSE can have a large value, the root mean squared error (RMSE) is often used. 13.2 Smoothing of Time Series Ÿ Original time series data can be used to make a time series model by observing trends, but in many cases, time series can be observed by smoothing to understand better. In a time series such as stock price, it is often difficult to find 13.2 Smoothing of Time Series / 185 a trend because of temporary or short-term fluctuations due to accidental coincidences or cyclical factors. In this case, smoothing techniques are used as a method to effectively grasp the overall long-term trend by removing temporary or short-term fluctuations. The centered moving average method and the exponential smoothing method are widely used. 13.2.1 Centered Moving Average Ÿ The time series in [Table 13.2.1] is the world crude oil price based on the closing price every year from 1987 to 2022. Looking at <Figure 13.2.1>, it can be seen that the short-term fluctuations in the time series are large. However, causes such as oil shocks are short-term and not continuous, so if we are interested in the long-term trend of gasoline consumption, it would be more effective to look at the fluctuations caused by short-term causes. [Table 13.2.1] Price of Crude Oil (End of Year Price, US$) and 5-point Moving Average Year Price of Oil 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 16.74 17.12 21.84 28.48 19.15 19.49 14.19 17.77 19.54 25.90 17.65 12.14 25.76 26.72 19.96 31.21 32.51 43.36 5-point Moving Average 20.666 21.216 20.630 19.816 18.028 19.378 19.010 18.600 20.198 21.634 20.446 23.158 27.232 30.752 37.620 45.798 Year Price of Oil 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 61.06 60.85 95.95 44.60 79.39 91.38 98.83 91.83 98.17 53.45 37.13 53.75 60.46 45.15 61.14 48.52 75.21 106.95 5-point Moving Average 58.746 61.164 68.370 74.434 82.030 81.206 91.920 86.732 75.882 66.866 60.592 49.988 51.526 53.804 58.096 67.394 186 / Chapter 13 Time Series Analysis <Figure 13.2.1> Price of Crude Oil and 5-point Moving Average Ÿ The N-point centered moving average of a time series refers to the average of data from a single point in time. For example, in crude oil price data, the value of the five-point moving average for a specific year is the average of the data for two years before the specific year, that year, and the data for the next two years. Expressed as an expression, if is a moving average in time , the 5-point centered moving average is as follows: For example, the 5-point centered moving average for 1989 is as follows. Ÿ Ÿ [Table 13.2.2] shows the values of all 5-points centered moving averages obtained in this way and <Figure 13.2.1> is the graph of 5-points moving average. Note that the moving averages for the first two years and the last two years cannot be obtained here. It can be seen that the graph of the moving average is better for grasping the long-term trend than the graph of the original data because short-term fluctuations are removed. The choice of a value for the -points moving average is important. A large value of will provide a smoother moving average, but it has the disadvantage of losing more points at both ends and insensitive to detecting important trend changes. On the other hand, if you choose small , you will lose less data at both ends, but you may not be able to get the smoothing effect because you will not sufficiently eliminate short-term fluctuations. In general, try a few values to reflect important changes that should not be missed, while achieving a smoothing effect and balancing the points not to lose too much at both ends. If the value of is an even number, there is a difficulty in obtaining a central moving average with the same number of data on both sides of the base year. For example, the center of the four-point moving average from 1987 to 1990 is between 1988 and 1989. If you denote this as , it can be calculated as 13.2 Smoothing of Time Series / 187 follows: Ÿ The 4-point moving average obtained in this way is called a non-central 4-points moving average. In the case of this even number N, the non-central moving average does not match the observation year of the original data, which is inconvenient. In the case of this even number N, it is calculated as the average of the noncentral moving average values of two adjacent non-central moving averages. In other words, the central four-point moving average in 1989 is the average of and as follows: Ÿ If the time series is quarterly or monthly, a 4-point central moving average or a 12-point central moving average is an average of one year, so it is often used to observe data without seasonality. 13.2.2 Exponential Smoothing Ÿ 3-point moving average can be considered the weighted average of three data with each weight 1/3 as follows: When the weights are , the weighted moving average Mt of the time series is defined as follows: , n is the number of data. n Here, ≥ , Ÿ ∑ wi =1 i=1 Various weighted averages with different weights can be used depending on the purpose. Among them, a smoothing method that gives more weight to data closer to the present and smaller weights as it is farther from the present is called exponential smoothing. The exponential smoothing method is determined by an exponential smoothing constant that has a value between 0 and 1. The exponentially smoothed data Et is calculated as follows: 188 / Chapter 13 Time Series Analysis ⋯⋯ Here, an initial value is required, and is usually used a lot, and the average value of the data can also be used. The exponentially smoothed value at the point in time gives weight to the current data, and the weight to the previous smoothed data is given. The exponentially smoothed value can be represented with the original data as follows: ⋯ Ÿ Therefore, the exponential smoothing method uses all data from the present and the past, but gives the current data the highest weight, and gives a lower weight as the distance from the present time increases. Exponential smoothing of the crude oil price in [Table 13.2.1] with the initial value and exponential smoothing constant = 0.3 is as follows. All data exponentially smoothed with = 0.3 are given in [Table 13.4]. It can be seen that, in the exponential smoothing method, there is no loss of data at both ends, unlike the moving average method. The crude oil price time series and exponentially smoothed data are shown in <Figure 13.2.2>. It can be seen that the smoothed data are not significantly different from the original data. If the value of is small, more weight is given to the past data than to the present, making it less sensitive to sudden changes in the present data. Conversely, the closer the value of is to 1, that is, the more weight is given to the current data, the more the smoothed data resembles the original data, and the smoothing effect disappears. 13.2 Smoothing of Time Series / 189 [Table 13.2.2] Price of Crude Oil and Exponential Smoothing with =0.3 Year Price of Oil =0.3 Exponental Smoothing Year Price of Oil =0.3 Exponental Smoothing 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 16.74 17.12 21.84 28.48 19.15 19.49 14.19 17.77 19.54 25.90 17.65 12.14 25.76 26.72 19.96 31.21 32.51 43.36 16.740 16.854 18.350 21.389 20.717 20.349 18.501 18.282 18.659 20.832 19.877 17.556 20.017 22.028 21.408 24.348 26.797 31.766 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 61.06 60.85 95.95 44.60 79.39 91.38 98.83 91.83 98.17 53.45 37.13 53.75 60.46 45.15 61.14 48.52 75.21 106.95 40.554 46.643 61.435 56.385 63.286 71.714 79.849 83.443 87.861 77.538 65.416 61.916 61.479 56.580 57.948 55.120 61.146 74.888 <Figure 13.2.2> Price of Crude Oil and Exponential Smoothing with =0.3 13.2.3 Filtering by Moving Median 필터링 Ÿ Ÿ The N-point centered moving median of a time series refers to the median of N data from a single point in time. For example, in crude oil price data, the value of a five-point moving median for a specific year is the median of data for two years before a certain year, that year, and data for two years thereafter. If data are denoted by , and the data are sorted from smallest to largest, and expressed as , the median value is . For example, the 1989 central moving median for crude oil prices in [Table 13.3] 190 / Chapter 13 Time Series Analysis is as follows: Ÿ [Table 13.2.3] and <Figure 13.2.3> show all the five-point moving median values obtained in this way and their graphs. Note that the moving median for the first two years and the last two years are not available here. Because the centered moving medians remove extreme values, it is called a filtering and the time series is much smoother than the original data. Year 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 [Table 13.2.3] Price of Crude Oil and 5-point Centered Moving Median 5-point 5-point Centered Centered Price of Oil Year Price of Oil Moving Moving Median Median 16.74 17.12 21.84 28.48 19.15 19.49 14.19 17.77 19.54 25.90 17.65 12.14 25.76 26.72 19.96 31.21 32.51 43.36 19.15 19.49 19.49 19.15 19.15 19.49 17.77 17.77 19.54 25.76 19.96 25.76 26.72 31.21 32.51 43.36 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 61.06 60.85 95.95 44.60 79.39 91.38 98.83 91.83 98.17 53.45 37.13 53.75 60.46 45.15 61.14 48.52 75.21 106.95 60.85 60.85 61.06 79.39 91.38 91.38 91.83 91.83 91.83 53.75 53.75 53.45 53.75 53.75 60.46 61.14 <Figure 13.2.3> Price of Crude Oil and 5-point Centered Moving Median 13.3 Transformation of Time Series / 191 Ÿ If the value of N is an even number, there is a difficulty in obtaining the central moving median having the same number of data on both sides of the base year. For example, the center of the four-point moving median from 1987 to 1990 is between 1988 and 1989. If you denote this as , it can be calculated as follows: The 4-point moving median obtained in this way is called the non-central 4-point moving median. As such, the non-central moving average in the case of this even number N does not match the observation year of the original data, which is inconvenient. In the case of this even number, it is calculated as the average of the values of the two non-central moving averages that are adjacent to each other. In other words, the central four-point moving median in 1989 is the mean of and . 13.3 Transformation of Time Series Ÿ Time series can be viewed by drawing the raw data directly, but in order to examine various characteristics, change in percentage increase or decrease is examined, and an index that is a percentage with respect to base time is alse examined. In addition, in order to examine the relation of the previous data, it is compared with a time lag or converted into horizontal data using the difference. When the variance of the time series increases with time, it is sometimes converted into a form suitable for applying the time series model by using logarithmic, square root, or Box-Cox transformation. 13.3.1 Percent A. Percentage Change In a time series, you can examine the increase or decrease of a value, but you can easily observe the change by calculating the percentage increase or decrease. When the time series is expressed as ⋯ , the percentage increase or decrease compared to the previous data is as follows. × , [Table 13.3.1] shows the number of houses in Korea from 2010 to 2020, and <Figure 13.3.1> shows the percentage increase or decrease compared to the previous data. Looking at this rate of change, it can be easily observed that the original time series has an overall increasing trend, but the rate of change of the previous year has many changes. In other words, it can be observed that there was a 2.23% increase in the number of houses in 2014 compared to the previous year, and a 2.48% increase in the number of houses in 2018 as well. × 192 / Chapter 13 Time Series Analysis [Table 13.3.1] Number of Houses in Korea and Percent Change (Korea National Statistical Office, unit 1000) Year Number of Houses % Change 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 17738.8 18082.1 18414.4 18742.1 19161.2 19559.1 19877.1 20313.4 20818.0 21310.1 21673.5 1.93 1.83 1.77 2.23 2.07 1.62 2.19 2.48 2.36 1.70 <Figure 13.3.1> Number of Houses in Korea and Percent Change B. Simple Index Another way to use percentages to easily characterize changes over time is to calculate an index number. An index number is a number that indicates the change over time of a time series. The index number of a time series at a certain point in time is the percentage of the total time series data for a predetermined time point called the base period. × , The most commonly used indices in the economic field are the price index and the quantity index. For example, the consumer price index is a price index indicating the price change of a set of goods that can reflect the total consumer price, and the index indicating the change in total electricity consumption every year is the quantity index. There are several methods of calculating the index, which are broadly divided into simple index number when the number of items 13.3 Transformation of Time Series / 193 represented by the index is one, and composite index number when there are several as in the consumer price index. [Table 13.3.2] is a simple index for the number of houses in Korea from 2010 to 2020, with the base time being 2010. If you look at the figure for the index, you can see that in this case, there is no significant change from the original time series and trend. It can be seen that there is a 22.18% increase in the number of houses in 2020 compared to 2010. × × , [Table 13.3.2] Simple Index of Number of Houses in Korea (Korea National Statistical Office, unit 1000) Simple Index Year Number of Houses Base: 2010 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 17738.8 18082.1 18414.4 18742.1 19161.2 19559.1 19877.1 20313.4 20818.0 21310.1 21673.5 100.00 101.94 103.81 105.66 108.02 110.26 112.05 114.51 117.36 120.13 122.18 <Figure 13.3.2> Simple Index of Number of Houses in Korea C. Composite Index Composite index is a method in which the change in price or quantity of several goods is set at a specific time point as the base period, and then the data at each time point is calculated as a percentage value compared to the base period. An example of the most used composite index is the consumer price index, which reflects price fluctuations of about 500 products in Korea that affect 194 / Chapter 13 Time Series Analysis consumer prices. Other commonly used composite indices include the comprehensive stock index, which examines the price fluctuations of all listed stocks traded in the stock market. For the composite index, a weighted composite index that is calculated by weighting the price of each product with the quantity consumed is often used. When calculating such a weighted composite index, the case where the quantity consumption at the base time is used as a weight is called the Laspeyres method, and the case where the quantity consumption at the current time is used as the weight is called the Paasche method. In general, the Laspeyres method of weighted composite index is widely used, and the consumer price index is a representative example. The price index of the Paasche method is used when the consumption of goods used as weights varies greatly over time, and can be used only when the consumption at each time point is known. It is expensive to examine the quantity consumption at each point in time. Assuming that ⋯ are the prices of number of products at the time point , and ⋯ are the quantities of each product consumpted at the base time, the formula for calculating each composite index is as follows: Laspeyres Index: Paasche Index: ⋯ × ⋯ ⋯ × ⋯ The data in [Table 13.3.3] shows the price and quantity of three metals by month in 2020. [Table 13.3.3] Composite Index of three Metal Prices($/ton) and Production Quantity(ton) Month 1 2 3 4 5 6 7 8 9 10 11 12 Copper Price Quantity 1361.6 1399.0 1483.6 1531.6 1431.2 1383.8 1326.8 1328.8 1307.8 1278.4 1354.2 1305.2 100.7 95.1 104.0 95.6 103.3 106.9 95.9 96.7 95.7 89.1 100.5 96.9 Metal Price Quantity 213 213 213 213 213 213 213 213 213 213 213 213 4311 4497 5083 5077 5166 4565 4329 4057 3473 3739 3817 3694 Price Lead Quantity 530.0 520.0 529.0 540.0 531.0 580.0 642.8 602.6 513.6 480.8 528.4 462.2 46.1 47.0 51.0 23.0 26.5 13.5 27.4 25.8 20.5 24.6 21.5 27.9 Laspeyres 100.00 100.31 101.13 101.63 100.65 100.42 100.16 100.00 99.43 99.01 99.92 99.18 Paasche 100.00 100.28 101.01 101.35 100.57 100.27 99.98 99.87 99.38 99.07 99.92 99.21 In [Table 13.3.3], the Laspeyres index for the data for February with January as the base time is as follows. ⋯ × ⋯ Similarly, Paasche index is as follows: 13.3 Transformation of Time Series / 195 ⋯ × ⋯ In [Table 13.3.3], it can be seen that the production quantity of iron and lead in the last 4 quarters is significantly different from the production quantity in January, which is the base time. In this way, when the quantity fluctuates greatly and the quantity at each time is known, the Pasche index can be said to be the best index because it appropriately reflects the price change at that time. 13.3.2 Time Lag and Difference A. Time Lag In a time series, current data can usually be related to past data. Lag means a transformation for comparing data of the present time and observation values at one time point or a certain past time point. That is, when the observed time series is ⋯ , the time series with lag 1 becomes ⋯ . Note that, in case of lag k, there are no data for the first number than the original data. The correlation coefficient between the time lag data and the raw data is called the autocorrelation coefficient. If the average of time series is , the -lag autocorrelation is defined as follows: k=0,1,2,⋯,n-1 ⋯ are called an autocorrelation function and are used to determine a time series model. [Table 13.3.4] shows the monthly consumer price index for the past two years and time lag 1 to 12 for this data, and the autocorrelation coefficients are shown in [Table 13.3.5]. <Figure 13.3.3> shows the original time series and the autocorrelation function. 196 / Chapter 13 Time Series Analysis [Table 13.3.4] Monthly Consumer Price Index and Time Lag 1, Lag 2, ... Lag 12 time Year/month CPI Lag 1 Lag 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 2020.01 2020.02 2020.03 2020.04 2020.05 2020.06 2020.07 2020.08 2020.09 2020.10 2020.11 2020.12 2021.01 2021.02 2021.03 2021.04 2021.05 2021.06 2021.07 2021.08 2021.09 2021.10 2021.11 2021.12 102.3 102.8 103.7 104.1 104.0 104.3 104.5 104.9 104.8 104.8 104.2 104.4 105.0 105.5 106.1 106.7 107.1 107.0 106.7 107.4 108.0 107.7 107.8 108.3 102.3 102.8 103.7 104.1 104.0 104.3 104.5 104.9 104.8 104.8 104.2 104.4 105.0 105.5 106.1 106.7 107.1 107.0 106.7 107.4 108.0 107.7 107.8 102.3 102.8 103.7 104.1 104.0 104.3 104.5 104.9 104.8 104.8 104.2 104.4 105.0 105.5 106.1 106.7 107.1 107.0 106.7 107.4 108.0 107.7 [Table 13.3.5] Autocorrelation Function t autocorrelation 1 2 3 4 5 6 7 8 9 10 11 0.8318 0.6772 0.5651 0.4479 0.3333 0.2547 0.1647 0.0755 -0.0143 -0.0854 -0.1737 ... Lag 12 ... 102.3 102.8 103.7 104.1 104.0 104.3 104.5 104.9 104.8 104.8 104.2 104.4 13.3 Transformation of Time Series / 197 <Figure 13.3.3> Autocorrelation Function Graph B. Differencing Since the price index in [Table 13.3.4] has a linear trend, a model for this trend can be built, but in some cases, a model can be created by changing the time series to a horizontal trend. The way to transform a linear trend into a horizontal trend is to use a differencing. When the time series is ⋯ , the first order difference ∇ is as follows: ∇ , t=2,3,⋯,n If the raw data is a linear trend, the first-order differencing of time series is a horizontal time series because it means a change in slope. If we make differencing on the first-order differencing ∇ , it becomes the second-order difference as follows: ∇ ∇ ∇ , ⋯ If the raw data has a trend with a quadratic curve, the second differencing of time series becomes a horizontal time series. <Figure 13.3.4> shows the first order differencing of [Table 13.3.4] time series and it becomes horizontal series. 198 / Chapter 13 Time Series Analysis <Figure 13.3.4> 1st Order Differencing of Consumer Price Index 13.3.3 Mathematical Transformation Ÿ If the original data of the time series is used as it is, modeling may not be easy or it may not satisfy various assumptions. In this case, we can fit the model we want by performing an appropriate functional transformation, such as log transformation. The functions commonly used for mathematical transformations are as follows. Log function Square root function Square function Box-Cox Transformation Ÿ log ≠ log [Table 13.3.6] is a toy company's quarterly sales, and <Figure 13.3.5> is a diagram of this data. It is a seasonal data by quarter, but, as time goes on, dispersion of sales increases over time. It is not easy to apply a time series model to data with this increasing dispersion over time. In this case, log transformation log can reduce the dispersion as time increases, as shown in <Figure 13.3.5>, so that a model can be applied. After predicting by applying the model to log-transformed data, exponential transformation exp is performed again to predict the raw data. 13.4 Regression Model and Forecasting / 199 [Table 13.3.6] Quarterly Sales of a Toy Company (unit million $) t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Year 2017 Quarter Quarter 2 Quarter 3 Quarter 4 2018 Quarter Quarter 2 Quarter 3 Quarter 4 2019 Quarter Quarter 2 Quarter 3 Quarter 4 2020 Quarter Quarter 2 Quarter 3 Quarter 4 2021 Quarter Quarter 2 Quarter 3 Quarter 4 Sales 1 1 1 1 1 38.0 53.6 57.5 200.0 56.5 75.8 78.3 269.7 70.2 92.7 101.8 332.6 97.3 123.7 132.9 429.4 138.3 167.6 189.9 545.9 <Figure 13.3.5> Log Transformation of Toy Sales Ÿ The square root transform is used for a similar purpose to the log transform, and the square transform can be used when the variance decreases with time. The Box-Cox transform is a general transformation. 13.4 Regression Model and Forecasting Ÿ If there is a trend factor that shows a continuous increase or decrease in the time series, the regression model learned in Chapter 12 can be applied. For example, if the time series shows a linear trend, the linear regression model is applied with the time series as the observation values of the random variable and time as 1, 2, ... , as follows. 200 / Chapter 13 Time Series Analysis ⋅ Here εt is ther error term with mean 0 and variance . Ÿ A characteristic of the linear model is that it increases by a slope of a certain magnitude over time. When the estimated regression coefficients are , the validity test of the linear regression model is the same as the method described in Chapter 12. The standard error of estimate and coefficient of determination are often used. In a linear trend model, represents the degree to which observations can be scattered around the estimated regression line at each time point. As an estimate of this , the following standard error is used. Ÿ A smaller standard error value indicates that the observed values are close to the estimated regression line, which means that the regression line model is well fitted. The coefficient of determination is the ratio of the regression sum of squares, RSS, which is explained out of the total sum of squares, TSS. RSS TSS Ÿ The value of the coefficient of determination is always between 0 and 1, and the closer the value is to 1, the more dense the samples are around the regression line, which means that the estimated regression equation explains the observations well. As explained in Chapter 12, since it is difficult to determine the absolute criteria for adequacy of the standard error or the coefficient of determination, a hypothesis test is used to determine whether the trend parameter β1 is zero or not. Hypothesis: Test statistic: H , H ≠ , Here SE SE Rejection region: Ÿ If , reject H with significance level . If the null hypothesis H is not rejected, the model cannot be considered valid. The assumption for error is tested using the residual, which is the difference between the observed time series value and the predicted value which is called residual analysis. Residual analysis usually examines whether assumptions about error terms such as independence and equal variance between errors are satisfied by drawing a scatter plot of the residuals over time or a scatter plot of the residuals and predicted values. In the scatterplots, if the residuals do not show a specific trend around 0 and appear randomly, it means that each assumption is valid. To examine the normality assumption of the error term, draw a normal probability plot of the residuals, and if the points on the figure show the shape of a straight line, it is judged that the assumption of the normal distribution is appropriate. 13.4 Regression Model and Forecasting / 201 Ÿ If the linear regression model is suitable, the predicted value ⋅ at the time point can be interpreted as a point estimate for the mean of the random variable at the time point, and the confidence interval for the mean of at this time is as follows: ± ⋅SE Here SE ⋅ Ÿ (Cubic) ⋅ ⋅ ⋅ ⋅ ⋅ The prediction method is similar to the above simple linear regression model. If the trend is not a polynomial model as above, the following model can also be considered. (Square root) (Log) Ÿ If the trend is in the form of a quadratic, cubic or higher polynomial, the following multiple linear regression model can be assumed. (Quadratic) Ÿ ⋅ ⋅log These models are the same as the linear regression model if I or log are replaced with a variable in the simple linear regression and the prediction method is similar. In addition, the function types to which the linear regression model can be applied by transformation are as follows. (Power) ⋅ (Exponential) ⋅ In the case of these two models, the parameters should be estimated using the nonlinear regression model, but if the error term is ignored, the linear model can be estimated approximately as follows: (Power) (Exponential) Ÿ log log ⋅log log log Korea's GDP from 1986 to 2021 is shown in [Table 13.4.1]. <Figure 13.4.1> shows the application of three regression models to this data. Among these models, the quadratic model has the largest value of = 0.9591, so it can be said that the time series is the most suitable model. However, additional validation of the model is required. 202 / Chapter 13 Time Series Analysis [Table 13.4.1] GDP of Korea Year GDP (billion $) 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 330.65 355.53 392.67 463.62 566.58 610.17 569.76 383.33 497.51 576.18 547.66 627.25 702.72 793.18 934.9 1053.22 1172.61 1047.34 943.67 1143.98 1253.16 1278.43 1370.8 1484.32 1465.77 1499.36 1623.07 1725.37 1651.42 1638.26 <Figure 13.4.1> GDP of Korea and Three Regression Model 13.5 Exponential Smoothing Model and Forecasting / 203 13.5 Exponential Smoothing Model and Forecasting Ÿ When the time series moves in a trend, the future can be predicted well with the regression model. However, it may not be appropriate to predict a time series that is dynamically moving hourly, daily, etc. In this case, a moving average model or an exponential smoothing model can be used. The time series model is explained into two cases where the trend is stationary and linear. 13.5.1 Stationary Time Series Ÿ A time series is called stationary if the statistical properties such as mean, variance and covariance are consistent over time. When a time series is the observed values of random variables ⋯ , a stationary time series is the following model that changes around a constant value . , Here is unknown parameter and is an error term which is independent with mean 0 and variance . A. Single Moving Average Model In a stationary time series model, the estimated value of , , is the mean of the data. . Using this model, the prediction after time points ahead at the current time denoted as , is as follows: , τ=1,2,⋯ It is called a simple average model. . The simple average model uses all observations until the current time. However, the unknown parameter may shift slightly over time, so it would be reasonable to give more weight to recent data than to past data for prediction. If a weight is given to only the most recent observations at the present time and the weight of the remaining observations is set to 0, the estimated value of is as follows. ⋯ This is called a single moving average model at the time point and is denoted by . The single moving average means the average of the observations adjacent to the time point . Notice that are independent of each other by assumption, but are not independent of each other, but are correlated. The value of the single moving average varies depending on the size of . When the value of is large, it becomes insensitive to the fluctuations of the original time series, so it changes gradually, and when the value of is small, it becomes sensitive to fluctuations. Therefore, when the fluctuation of the original 204 / Chapter 13 Time Series Analysis time series is small, it is common to set the small value of , and when the fluctuation is large, it is common to set the value of large . Using the single moving average model at the time point , the predicted value and the mean and variance of the predicted value at the time point are as follows. , ⋯ When the single moving average model is used, the 95% confidence interval estimation of the predicted value is approximately as follows. ± => ± The monthly sales for the last two years of a furniture company are as shown in [Table 13.4.2], and the residual between the raw data and the predicted value of one point in time was calculated by obtaining a six-point moving average. <Figure 13.5.1> shows the time series for this. This time series fluctuates up and down based on approximately 95, and such a time series is called a stationary time series. When = 6, the moving average for the first 5 time points cannot be obtained. The moving average at time 6 is is as follows: Therefore, one time prediction at time 6 becomes and the residual at time 7 is as follows: In the same way, the moving average of the remaining time points, the predicted values after one time point, and the residuals are as shown in [Table 13.5.1], so the mean square error is as follows: Sales for the next three months are the last moving average , and the 95% confidence interval for the forecast is as follows: ± => ± => [90.10, 119.23] ※ Moving average at initial period Since the -point single moving average cannot be obtained before the time 13.5 Exponential Smoothing Model and Forecasting / 205 point , the prediction model cannot be applied. When there are many time series, this may not be a big problem, but when the number of data is small, it can affect the prediction. In order to solve this problem, the moving average at initial period can be obtained as follows until the time point . , , ⋯ ⋯ , [Table 13.5.1] Montly Sales of a Furniture Company and 6-point Moving Average, One Time Forecast and Residuals Time Sales (unit million $) 6-pt Moving Average Mt One Time Forecast Residual t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 95 100 87 123 90 96 75 78 106 104 89 83 118 86 86 112 85 101 135 120 76 115 90 92 98.50 95.17 91.50 94.67 91.50 91.33 89.17 96.34 97.67 94.34 95.67 95.00 98.00 100.84 106.50 104.84 105.34 106.17 104.67 98.50 95.17 91.50 94.67 91.50 91.33 89.17 96.34 97.67 94.34 95.67 95.00 98.00 100.84 106.50 104.84 105.34 106.17 -23.50 -17.17 14.50 9.33 -2.50 -8.33 28.83 -10.33 -11.67 17.67 -10.67 6.00 37.00 19.17 -30.50 10.17 -15.33 -14.17 206 / Chapter 13 Time Series Analysis <Figure 13.5.1> Montly Sales of a Furniture Company and 6-point Moving Average and One Time Forecast B. Single Exponential Smoothing Model In the single moving average model, the same weight is given to only the latest observations, and the previous observations are completely ignored by setting the weight to 0. The single exponential smoothing method compensates for the shortcomings of the moving average model by assigning weights to all observations when predicting future values from past observations, but giving more weight to recent data. This single exponential smoothing model uses the value of the exponential smoothing method as the predicted value. The single exponential smoothing model calculates the weighted average of the exponential smoothing estimator at the immediately preceding time point and the observation value at the time point . Assuming that the exponential smoothing estimated value at the time point is and is a real number between 0 and 1, the single exponential smoothing value is defined as follows. , , ⋯ , Here, is called the smoothing constant, and the single exponential smoothing value is the weighted average value given the weight of the most recent observation and of the exponential smoothing value at the time . You can better understand the meaning of exponential smoothing if you write down the recursive equation as follows: +․․․+ 13.5 Exponential Smoothing Model and Forecasting / 207 In other words, for the single exponential smoothing value , the most recent observation is given a weight , and the next most recent observation is given , the next is ⋯ and so on, a gradually smaller weight. Therefore, if the size of is small, the current observation value is given a small weight, and the exponential smoothing value is insensitive to the fluctuations of the time series. if the size of is large, the current observation value is given a large weight, and the exponential smoothing value is sensitive to the fluctuations of the time series. In general, a value between 0.1 and 0.3 is often used as the value of . In order to obtain a single exponential smoothing value, an initial smoothing value is required, and the first observation value or the sample average of several initial data or the overall sample average can be used. The exponential smoothing method has the advantage of being less affected by extreme point or intervention than the ARIMA model and easy to use, although the selection of the smoothing constant is arbitrary and it is difficult to obtain a prediction interval. The predicted value, average and variance of the predicted value at the time point using the single exponential smoothing model are as follows: Therefore, when the single exponential smoothing model is used, the 95% interval estimation in the predicted value is approximately as follows. ± => ± To the data of [Table 13.5.1], predict sales for the next three months by a single exponential smoothing model with smoothing constant = 0.1. Let’s use the first observed value for the initial value of exponential smoothing, that is = = 95. The exponential smoothing value for the first three time series are as follows: × × × × × × × × × × × × At each time point, the prediction after one point in time is as follows: Hence the residuals using the above estimated values are as follows: 208 / Chapter 13 Time Series Analysis In the same way, the single exponential smoothing of the remaining time points, the predicted values after one time point, and the residuals are as shown in [Table 13.4.2]. Therefore, the mean square error is as follows: Y Y i i i MSE In terms of mean square error, the MSE of the 6-point single moving average model is 331.22, so it can be said that the exponential smoothing model has better fit. Sales for the next three months are the last moving average , and the 95% confidence interval for the forecast is as follows: ± => ± => [65.27, 132.05] [Table 13.5.2] summarizes the above equations, and <Figure 13.5.2> shows the prediction after one time point and the prediction for the next 3 months using the single exponential smoothing model with = 0.1. [Table 13.5.2] Exponential Smoothing with α = 0.1, One Time Forecast and Residual Time Sales (unit million $) Exponential Smoothing St One Time Forecast Residual t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 95 100 87 123 90 96 75 78 106 104 89 83 118 86 86 112 85 101 135 120 76 115 90 92 95.00 95.50 94.65 97.48 96.74 96.66 94.50 92.85 94.16 95.15 94.53 93.38 95.84 94.86 93.97 95.77 94.70 95.33 99.29 101.36 98.83 100.45 99.40 98.66 95.00 95.00 95.50 94.65 97.48 96.74 96.66 94.50 92.85 94.16 95.15 94.53 93.38 95.84 94.86 93.97 95.77 94.70 95.33 99.29 101.36 98.83 100.45 99.40 0.00 5.00 -8.50 28.35 -7.48 -0.74 -21.66 -16.50 13.15 9.84 -6.15 -11.53 24.62 -9.84 -8.86 18.03 -10.77 6.30 39.67 20.71 -25.36 16.17 -10.45 -7.40 13.5 Exponential Smoothing Model and Forecasting / 209 <Figure 13.5.2> Exponential Smoothing with α = 0.1 and One Time Forecast ※ Initial value of exponential smoothing Since the initial exponential smoothing value at the time point cannot be obtained, the following three methods are commonly used. 1) The first observation, i.e. , 2) Partial average using the initial observation values, that is ⋯ , 3) The mean up to the entire time point , i.e., ⋯ ※ Initial smoothing constant The same smoothing constant can be applied to all time series, but the following method is also used to reduce the effect of the initial value . , until reaches 13.5.2 Linear Trend Time Series A. Double Moving Average Model In the previous section, we examined that the single moving average model can be applied to a stationary time series. What would happen if the single moving average model was applied to a time series with a linear trend? That is, for a time series with a linear trend ⋅ , the -point single moving average at time is as follows: ⋯ The expected value can be shown as follows: 210 / Chapter 13 Time Series Analysis That is, in the case of a linear trend model, it can be seen that the single moving average is biased by . For example, if the consumer price index with a linear trend in [Table 13.5.3] is predicted after one point in time using the 5-point single moving average, it is as shown in <Figure 13.5.3>. It can be seen that the predicted value using is under estimated value of the time series . <Figure 13.5.3> 5-pt Moving Average of Consumer Price Index with Linear Trend In the case of a linear trend, one way to eliminate the bias of the single moving average model is the double moving average, which obtains the moving average again for the single moving average. The -point double moving average at the time and its expected value are as follows. ⋯ Since and have the same number of parameters, can be estimated by solving the system of two equations as follows: Therefore, the predicted value at the time point using the double moving average at time as follows. Such a double moving average model can be said to be a kind of heuristic 13.5 Exponential Smoothing Model and Forecasting / 211 method. That is, although logical, it is not based on any optimization such as least squares method. However, it can be approximated by the least-squares method, which we will omit in this book. [Table 13.5.3] is a calculation table for predicting the consumer price index using the 5-point double moving average model. Note that the third column is a 5-point single moving average , but the single moving average cannot be calculated from time points 1 to 4. The fourth column is the calculation of the 5-point double moving average , but the double moving average cannot be calculated until 5 single moving averages have been calculated, that is, from time points 1 to 8. Using and to obtain the prediction after time 1 from time 9, is as follows: × The predicted values calculated in the same way are shown in the fifth column. [Table 13.5.3] Double Moving Average of Consumer Price Index and One Time Forecast Time t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Consumer Price Index 5-pt Single Moving Average 5점 Double Moving Average One Time Forecast 102.3 102.8 103.7 104.1 104.0 104.3 104.5 104.9 104.8 104.8 104.2 104.4 105.0 105.5 106.1 106.7 107.1 107.0 106.7 107.4 108.0 107.7 107.8 108.3 103.38 103.78 104.12 104.36 104.50 104.66 104.64 104.62 104.64 104.78 105.04 105.54 106.08 106.48 106.72 106.98 107.24 107.36 107.52 107.84 104.028 104.284 104.456 104.556 104.612 104.668 104.744 104.924 105.216 105.584 105.972 106.36 106.7 106.956 107.164 107.388 105.2080 105.2240 104.9160 104.7160 104.6820 104.9480 105.4840 106.4640 107.3760 107.8240 107.8420 107.9100 108.0500 107.9660 108.0540 212 / Chapter 13 Time Series Analysis <Figure 13.5.4> Forecast using Double Moving Average of Consumer Price Index B. Holt Double Exponential Smoothing Model Holt proposed a model for a linear time series ⋅ which uses a smoothing constant for the level and a smoothing constant for the trend. This is called Holt's linear trend exponential smoothing model or two-parameters double exponential smoothing model. Let and be the initial values of the intercept and slope, and be the smoothing constant of the level and is the smoothing constant of the trend. The predicted values , level and trend are as follows: Predicted value: Level: Trend: , ⋯ ⋯ , ⋯ That is, the level is the weighted average of the current observed value and the predicted value , and the trend is the weighted average of the level difference between the time points and ( ) and the trend at the time point (t-1). For this model, initial values of level and slope are required and the intercept and slope of the simple regression analysis of all observed values are widely used as initial estimates. Similar to the single exponential smoothing model, a value between 0.1 and 0.3 is often used to determine the smoothing constants and . The predicted values for the time point at time using the trend exponential smoothing model are as follows: Such a trend exponential smoothing model is also a kind of heuristic method. That is, although logical, it is not based on any optimization such as least squares method. The result of simple linear regression model to all data in [Table 13.5.4] is 13.5 Exponential Smoothing Model and Forecasting / 213 as follows: [Table 13.5.4] is a calculation table for predicting the consumer price index with the Holt double exponential smoothing model using this initial values. The third column is the predicted value of the level , the fourth column is the trend , and the fifth column is the prediction obtained one time after each time point. Therefore, the forecast of the consumer price index for the next three months is as follows: × t = 25 : × t = 26 : × × × t = 27 : [ [Table 13.5.4] Forecasting using Holt Double Exponential Smoothing Model of Consumer Price Index One Time Consumer Constant Trend Time Forecast Price Index t 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 102.3 102.8 103.7 104.1 104.0 104.3 104.5 104.9 104.8 104.8 104.2 104.4 105.0 105.5 106.1 106.7 107.1 107.0 106.7 107.4 108.0 107.7 107.8 108.3 102.57 102.76 102.97 103.25 103.54 103.80 104.07 104.33 104.61 104.85 105.07 105.20 105.33 105.50 105.70 105.93 106.20 106.49 106.75 106.96 107.21 107.50 107.73 107.95 108.20 0.234 0.229 0.227 0.232 0.239 0.241 0.243 0.245 0.249 0.248 0.245 0.234 0.224 0.218 0.216 0.218 0.223 0.230 0.233 0.230 0.232 0.238 0.237 0.236 0.237 102.81 102.99 103.20 103.48 103.78 104.04 104.31 104.58 104.86 105.10 105.31 105.44 105.56 105.72 105.91 106.15 106.43 106.72 106.98 107.19 107.44 107.73 107.97 108.19 <Figure 13.5.5> shows the predicted values using the Holt’s double exponential smoothing model. 214 / Chapter 13 Time Series Analysis <Figure 13.5.5> Forecasting using Holt Double Exponential Smoothing Model of CPI 13.6 Seasonal Model and Forecasting Ÿ As a seasonal time series model, a multiplicative model using a central moving average and a Holt-Winters model are introduced. 13.6.1 Seasonal Multiplicative Model Ÿ Assume that a time series with a seasonal period can be expressed as the product of a trend ( ), a seasonal ( ), and an irregular component ( ) as follows: ⋅ ⋅ . Ÿ The ratio to moving average method removes the trend and irregular components to obtain the seasonal component as follows (Step 1) For the time series, find the -point centered moving average. This moving average represents the trend component after removing seasonal component and irregular component from the time series. (Step 2) Divide the time series by the trend component obtained in Step 1. This value implies the seasonal component and the irregular component ⋅ , and is called the seasonal ratio. ⋅ (Step 3) Calculate the trimmed average for each seasonal ratio obtained in Step 2. This is the seasonal index, but the normalization should be performed so that the sum of the seasonal indices is . 13.6 Seasonal Model and Forecasting / 215 Ÿ After obtaining the seasonal index as shown, dividing the original time series data by the seasonal index. The results is called a deseasonal time series. Deseasonal time series: Ÿ Ÿ This deseasonal time series implies . An appropriate time series model is applied to this deseasonal data and predict the future vaules. Then multiply the corresponding seasonal index to obtain the final predicted value of the desired season. [Table 13.6.1] shows a company's quarterly sales. Since the seasonal period is 4, the 4-point centered moving average is as shown in column 4 of the table. By dividing the original time series by a 4-point centered moving average, the seasonal ratio in column 5 can be calculated. ⋅ [Table 13.6.1] Quarterly Sales of a Company ③ ④ ② ① ⑤ Centered Sales Seasonal 4-point 4-point Year Quarter Ratio MA MA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Ÿ 2018 Quarter Quarter 2 Quarter 3 Quarter 4 2019 Quarter Quarter 2 Quarter 3 Quarter 4 2020 Quarter Quarter 2 Quarter 3 Quarter 4 2021 Quarter Quarter 2 Quarter 3 Quarter 4 1 1 1 1 75 60 54 59 86 65 63 80 90 72 66 85 100 78 72 93 62.000 64.750 66.000 68.250 73.500 74.500 76.250 77.000 78.250 80.750 82.250 83.750 85.750 63.375 65.375 67.125 70.875 74.000 75.375 76.625 77.625 79.500 81.500 83.000 84.750 0.852 0.902 1.281 0.917 0.851 1.061 1.175 0.928 0.830 1.043 1.205 0.920 ⑥ Deseasonal Data 61.292 64.915 63.759 58.700 70.281 70.324 74.385 79.593 73.550 77.898 77.928 84.567 81.722 84.389 85.012 92.527 [Table 13.6.2] shows the seasonal ratio by year and quarter. If the maximum and minimum values are removed for each quarter and the average is obtained (trimmed average), it is the seasonal index in column 6. Since the sum of these values is 4.0197, the seasonal index in column is normalized as in column 7. = 1.1991, = 0.9159, = 0.8472, = 1.0378 216 / Chapter 13 Time Series Analysis [Table 13.6.2] Seasonal Index ① Year Quarter 1st 2nd 3rd 4th Quarter Quarter Quarter Quarter ② ③ ④ ⑤ 2018 2019 2020 2021 1.175 0.928 0.830 1.043 1.205 0.920 0.852 0.902 1.281 0.917 0.851 1.061 ⑥ Trimmed Mean of Seasonal Ratio 1.2050 0.9204 0.8514 1.0429 ⑦ Seasonal Index St 1.1991 0.9159 0.8472 1.0378 sum 4.0197 Column 6 of [Table 13.6.1] shows the non-seasonal data obtained by dividing the original data by the seasonal index of each quarter. The linear regression line for this non-seasonal data is as follows (<Figure 13.6.1>)., Therefore, the forecast for the next one year is as follows. Time 17 : Time 18 : Time 19 : Time 20 : × × × × × × × × <Figure 13.6.2> is a graph of seasonal forecasts. <Figure 13.6.1> Forcasting Model of Deseasonal Sales of a Company 13.6 Seasonal Model and Forecasting / 217 <Figure 13.6.2> Forcasting of Quarterly Sales of a Company 13.6.2 Holt-Winters Model Ÿ Assume that a time series with a seasonal period is observed over cycles as follows: Season 1 Season 2 … Season L ----------------------------------------------------------------------------------------------Cycle 1 … Cycle 2 … … … … Cycle m … ----------------------------------------------------------------------------------------------- Ÿ The Holt-Winters model is an extension of Holt's linear double exponential smoothing method studied in the previous section to a seasonal model. It consists of a level component , a trend component , and a seasonal component . There are additive model and multiplicative model, but only the multiplication model is introduced here. ⋅ Here is integer part of is a time series level, which means the exponential smoothing of the current level ( ) with seasonality removed and the value predicted one time ago . is the slope which is the exponential smoothing of the slope of the current time point and the previous time point ( ). is a seasonal index which is the exponential smoothing of the current seasonal 218 / Chapter 13 Time Series Analysis Ÿ component ( ) and the seasonal component of the previous season . [Table 13.6.3] calculates exponential smoothing values of level, slope, and seasonal indices using the Holt-Winters model with = 0.3, = 0.3, = 0.3 to the quarterly sales of a company. The last column is one time prediction at time , . The initial values and are the intercept and slope of the linear regression model for all data, and the initial values of the seasonal index are calculated by the model × × . [Table 13.6.3] Holt-Winters Forecasting Model of Quarterly Sales One Time Seasonal Sales Level Slope Year Quarter Forecast -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Ÿ 2018 Quarter Quarter 2 Quarter 3 Quarter 4 2019 Quarter Quarter 2 Quarter 3 Quarter 4 2020 Quarter Quarter 2 Quarter 3 Quarter 4 2021 Quarter Quarter 2 Quarter 3 Quarter 4 1 1 1 1 75 60 54 59 86 65 63 80 90 72 66 85 100 78 72 93 61.2 62.7312 64.6753 65.5795 63.9809 66.7081 68.7061 71.6766 75.7320 76.5997 78.2283 79.2477 81.6382 83.2158 84.7427 86.0501 88.4618 1.61 1.5863 1.6937 1.4568 0.5402 1.1963 1.4368 1.8969 2.5445 2.0414 1.9176 1.6481 1.8708 1.7829 1.7061 1.5865 1.8341 1.1991 0.9159 0.8472 1.0378 1.1976 0.9210 0.8371 0.9905 1.2382 0.9319 0.8555 1.0195 1.2117 0.9270 0.8459 1.0289 1.2074 0.9242 0.8420 1.0386 75.315 58.908 56.230 69.570 77.270 62.539 58.720 72.874 96.920 73.282 68.561 82.477 101.184 78.791 73.124 90.169 <Figure 13.6.3> is the Holt-Winters forecast for the next one year and is calculated as follows: × ⋅ × ⋅ × ⋅ × ⋅ × × × × × × × 13.6 Seasonal Model and Forecasting / 219 <Figure 13.6.3> Holt-Winters Forecasting of Quarterly Sales 220 / Chapter 13 Time Series Analysis Exercise For the next exercise (13.1 – 13.4), draw graph of the time series data, Apply an appropriate smoothing method and transformation, Find an appropriate prediction model to predict the next year. 13.1 The following table provides data on the number of items damaged in shipment during 2001 2014 for a manufacturer. Year Items Year Items 2001 2002 2003 2004 2005 2006 2007 533 373 132 555 168 281 175 2008 2009 2010 2011 2012 2013 2014 291 228 204 349 234 209 176 13.2 The following table shows the sales volume (in thousands of dollars) of a retail store between 2001-2014. Year Sales Year Sales 2001 2002 2003 2004 2005 2006 2007 815 1,276 4,752 7,535 10,122 9,642 14,100 2008 2009 2010 2011 2012 2013 2014 12,529 12,824 13,777 15,379 18,705 17,632 16,571 13.3 The following table shows the number of items repaired during a company's warranty period between 2001 and 2014. Year Items Year Items 2001 2002 2003 2004 2005 2006 2007 749 709 700 678 611 641 631 2008 2009 2010 2011 2012 2013 2014 611 600 574 559 543 534 524 13.4 The following table shows the annual sales (unit: billion $) of a company for 11 years. Chapter 13 Exercise / 221 Year Sales Year Sales 2012 2013 2014 2015 2016 2017 12 14 18 20 18 16 2018 2019 2020 2021 2022 20 22 27 24 30 13.5 The following data show the price of silver and crude oil between 2000 and 2015, respectively. Find the percentage change of silver and crude oil and the price indices and overlay them on one picture. Year Silver ($/ounce) 1.771 1.546 1.684 2.558 4.708 4.419 4.353 4.620 2000 2001 2002 2003 2004 2005 2006 2007 Crude Oil ($/barrel) 1.80 2.18 2.48 5.18 10.46 11.51 11.51 12.70 Year Silver Crude Oil 2008 2009 2010 2011 2012 2013 2014 2015 5.440 11.090 20.633 10.481 7.950 11.439 8.141 6.192 15.40 18.00 28.00 32.00 34.00 30.00 26.00 26.00 13.6 The following table shows the number of skis sold by a sports merchandise seller in 2017-2021. 1) Predict the next year with a multiplicative seasonal model. 2) Predict the next year using the Holt-Winters seasonal model. Year 2017 2018 2019 2021 2021 1 0 3 9 13 4 2 2 0 2 4 12 3 10 5 46 56 6 4 4 4 11 30 10 5 89 14 14 90 17 6 33 23 30 20 32 7 11 7 22 15 24 8 4 11 4 11 9 9 17 11 7 6 10 10 5 4 4 5 5 11 17 4 0 1 17 12 0 8 2 7 1