SSTS012: Intro to Statistical Inference Study Guide

SSTS012 2019 STUDY GUIDE FACULTY OF SCIENCE AND AGRICULTURE SCHOOL OF MATHEMATICAL AND COMPUTER SCIENCES DEPARTMENT OF STATISTICS AND OPERATIONS RESEARCH INTRODUCTION TO STATISTICAL INFERENCE (SSTS012) STUDY GUIDE SECOND SEMESTER: 2019 SSTS012 2019 STUDY GUIDE LECTURERS INFORMATION: Name : Mr TH Chavalala and Mr H Maluleke Office : Mathematical Sciences Building, Room (2017 & 2022) Telephone : 015 268 4769\2168 E-mail address : thembhani.chavalala@ul.ac.za and happy.maluleke@ul.ac.za STUDY COMPONENTS Purpose of the Module The purpose of the module is guide the students to:  Find point and interval estimates of the mean and proportion  Perform hypothesis tests on the mean and proportion  Perform hypothesis tests using chi-square statistic  Identify when to apply ANOVA as hypothesis testing technique  Fit and interpret a simple linear regression model  Calculate and interpret the correlation coefficient  Define and explain the purpose of index numbers  Explain the purpose of time series analysis STUDY UNITS  SAMPLING DISTRIBUTION  POINT AND INTERVAL ESTIMATION  HYPOTHESIS TESTING  CHI-SQUARE ANALYSIS  ANALYSIS OF VARIANCE  REGRESSION & CORRELATION  INDEX NUMBERS  TIME SERIES ANALYSIS SSTS012 2019 STUDY GUIDE LECTURES AND LECTURE TIMES: You have to attend four lectures per week. The times are as follows: DAY TIME VENUE MONDAY(Tutorial) PERIOD 7 & 8 SMCS 1036 & 1037 MONDAY(Tutorial) PERIOD 9 & 10 SMCS 1036 & 1037 TUESDAY(Lecture) PERIOD 3 & 4 TC THURSDAY(Lecture) PERIOD 3 & 4 KA ASSIGNMENTS AND QUIZZES Two assignments (weighing 15% average) as well as at least four quizzes (weighing 15% average) shall be written and the average of these assessments will contribute towards the module mark. TESTS AND EXAMINATION Two tests will also be written; and the average of 70% shall also contribute towards the module mark. A student is admitted to the final exam based on a module mark of at least 40%. CALCULATION OF MARKS MODULE MARK = 15% quizzes average + 15% assignments average + 70% test average FINAL MARK = 60% MODULE MARK + 40% FINAL EXAM MARK SSTS012 2019 STUDY GUIDE TENTATIVE SCHEDULE OF LECTURES Week Date Topic/Activity 1 01-05 July Sampling Distributions 2 08-12 July Point and Interval Estimation 3 15-19 July Point and Interval Estimation 4 22-26 July Hypothesis Testing 5 29-02 July/August Hypothesis Testing 6 05-09 August Chi-Square Analysis 7 12-16 August Analysis of Variance 8 19-23 August Regression Analysis 9 26-30 August Correlation Analysis 10 02-06 September Index Numbers 11 09-13 September Index Numbers 12 16-20 September Spring Recess 13 23-27 September Time Series Analysis 14 30-04 October Time Series Analysis 15 07-11 October Revision 16 14-18 October Study Week Assessment Test1(07 August) Test2(30 August) SSTS012 2019 STUDY GUIDE TABLE OF CONTENTS CHAPTER 1: THE SAMPLING DISTRIBUTION ....................................................... 1 1.1 Introduction ....................................................................................................... 1 1.2 Population distribution ...................................................................................... 1 1.3 Sampling distribution ........................................................................................ 1 1.4 Sampling and Non-sampling errors .................................................................. 3 1.5 The Mean and Standard deviation of the sample mean, 𝒙 ............................... 4 1.6 Sampling from a Normally Distributed Population............................................. 5 1.7 Sampling from a population that is not normally distributed. ............................. 5 1.6 Population and Sample Proportions ................................................................. 7 CHAPTER 2: POINT AND INTERVAL ESTIMATION ............................................. 11 2.1 Introduction ..................................................................................................... 11 2.2 Estimation ....................................................................................................... 11 2.3 Type of Estimates: Point and interval estimates ............................................. 12 2.4 Confidence interval ......................................................................................... 12 2.5 Estimation of a population mean..................................................................... 13 2.7 Confidence interval estimation for Population proportion ................................ 16 2.8 Confidence Intervals for Variances and Standard Deviations ......................... 17 2.9 Inferences about the difference between two population means for independent samples: 𝛔𝟏 and 𝛔𝟐 known .................................................................................. 19 2.10 Inferences about the difference between two population means for independent samples: 𝛔𝟏 and 𝛔𝟐 are unknown but equal .................................... 20 2.11 Inferences about the difference between two population proportions for large and independent samples ..................................................................................... 21 CHAPTER 3: HYPOTHESIS TESTING ................................................................... 22 3.1 Introduction ................................................................................................... 22 3.2 Hypothesis testing procedure ..................................................................... 22 SSTS012 2019 STUDY GUIDE 3.3 Type I and Type II errors ................................................................................ 24 3.4 Steps of hypothesis testing ............................................................................. 24 3.5 Test of hypothesis for the population mean 𝝁 ................................................. 25 3.6 Hypothesis test about a population proportion ................................................ 26 3.7 The 𝑷-Value Approach ................................................................................... 27 3.8 Hypothesis Tests About the Population Variance ........................................... 28 3.9 Hypothesis testing for the difference between two population means 𝝁𝟏 − 𝝁𝟐 .............................................................................................................................. 29 3.10 Hypothesis testing for the equality of variances from two populations .......... 33 3.11 Hypothesis testing for the difference between two population proportions 𝑷𝟏 − 𝑷𝟐 ......................................................................................................................... 34 CHAPTER 4: CHI-SQUARE HYPOTHESIS TESTING............................................ 36 4.1 Introduction ..................................................................................................... 36 4.2 Chi-square goodness-of-fit test ....................................................................... 37 4.3 Chi-square test for independence of association ............................................ 40 CHAPTER 5: ANALYSIS OF VARIANCE ............................................................... 44 5.1 Introduction ..................................................................................................... 44 5.2 Terms and concepts ....................................................................................... 44 5.3 The F-Distribution ........................................................................................... 45 5.3.1 Basic Properties of F-curves .................................................................... 45 5.3.2 Finding the 𝛘𝟐-value having the specified area to its right ....................... 45 5.4 Performing a One-Way Anova ........................................................................ 46 5.5 One-Way ANOVA Table ................................................................................. 47 5.6 Pairwise comparisons of the treatments ......................................................... 49 5.6.1 Tukey’s pairwise comparison test ............................................................ 50 CHAPTER 6: SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS 51 6.1 Introduction ..................................................................................................... 51 SSTS012 2019 STUDY GUIDE 6.2 Simple Linear Regression Analysis ................................................................ 52 6.3 Inference in regression analysis ..................................................................... 54 6.4 Correlation Analysis ........................................................................................ 55 6.5 Inference about the correlation ....................................................................... 56 6.6 The Coefficient of Determination .................................................................... 57 CHAPTER 7: INDEX NUMBERS ............................................................................. 58 7.1 Introduction ..................................................................................................... 58 7.2 Price indexes .................................................................................................. 59 7.3 Quantity Indexes ............................................................................................. 62 CHAPTER 8: TIME SERIES ANALYSIS ................................................................. 65 8.1 Introduction ..................................................................................................... 65 8.2 Components of a Time Series ........................................................................ 65 8.3 Decomposition of a Time Series ..................................................................... 66 8.4 Trend Analysis ................................................................................................ 67 8.5 Seasonal Analysis .......................................................................................... 70 TABLES ................................................................................................................... 72 SSTS012 2019 STUDY GUIDE CHAPTER 1: THE SAMPLING DISTRIBUTION 1.1 Introduction In the first semester, we have studied sampling, descriptive statistics, probability, and the normal distribution. Now we will learn how these various topics can be incorporated to lay the foundation for inferential statistics. This chapter introduces the concepts of population distribution and sampling distribution. Moreover, the essential role that these concepts play in the design of inferential studies will be explained. We should always keep in mind that we perform sampling because we want to make this inference. Because of this inference we begin to talk about things like confidence intervals and hypothesis testing. A good picture to represent this situation follows: The most significant objective of statistics is to make conclusions about a population from the information contained in a sample. 1.2 Population distribution The population distribution is the probability distribution derived from the information on all elements under consideration. Definition 1.1: Population distribution is the probability distribution of the population data. 1.3 Sampling distribution For any population data set, there is only one value of the population mean, μ. However, we cannot say the same about the sample mean, 𝑥̅ . We would expect different samples of the same size drawn from the same population to yield different SSTS012 2019 STUDY GUIDE values of the sample mean,𝑥̅ . The value of the sample mean for any sample will depend on the elements included in that sample. Accordingly, the sample mean, 𝑥̅ , is a random variable. Therefore, like other random variables, the sample mean possesses a probability distribution, which is more commonly called the sampling ̅. Other sample statistics, such as the median, mode, and standard distribution of 𝒙 deviation, also possess sampling distributions. ̅ is called the sampling distribution Definition 1.2: The probability distribution of 𝒙 ̅. of 𝒙 There are many ways to take a sample. We have different methods that can be applied based on the given problem. Once we know more about research problem this will help us determine which sampling makes the most sense. Therefore, we will talk about sampling design. Sampling design is the procedure by which the sample is selected. There are two very broad categories of sampling designs: Probability sampling and Non-probability sampling. SSTS012 2019 STUDY GUIDE ̅ How to conduct a sampling distribution of 𝒙 In a random sample, each member of the population has an equal chance of being selected. There are a number of ways in which a random sample may be taken: i) Random Sampling without replacement. ii) Random Sampling with replacement. 1.4 Sampling and Non-sampling errors Usually, different samples selected from the same population will give different results because they contain different elements. Definition 1.3: Sampling error is the difference between the value of a sample statistic and the value of the corresponding population parameter. In the case of the mean, ̅−𝛍 𝐒𝐄 = 𝒙 It is important to remember that a sampling error occurs because of chance. The errors that occur for other reasons, such as errors made during collection, recording, and tabulation of data, are called non-sampling errors. These errors occur because of human mistakes, and not chance. Note that there is only one kind of sampling error, which is the error that occurs due to chance. However, there is not just one nonsampling error, but there are many non-sampling errors that may occur for different reasons. Definition 1.4: Non-sampling Errors are the errors that occur in the collection, recording, and tabulation of data. The following are the main reasons for the occurrence of non-sampling errors i) If a sample is non-random the sample results may be too different from the census results. ii) The questions may be phrased in such a way that they are not fully understood by the members of the sample or population. As a result, the answers obtained are not accurate. SSTS012 iii) 2019 STUDY GUIDE The respondents may intentionally give false information in response to some sensitive questions. iv) The poll taker may make a mistake and enter a wrong number in the records or make an error while entering the data on a computer. Example 1: Suppose there were only five students in supplementary exam of SSTS011 and the exam scores of these five students are: 70, 78, 80, 80, 95. Suppose that a simple random sampling of size three is drawn without replacement. a) List all samples that can be selected from this population. b) Calculate the sample mean and sampling error for each of these samples. ̅ 1.5 The Mean and Standard deviation of the sample mean, 𝒙 The mean and standard deviation calculated for the sampling distribution of 𝑥̅ are called the mean and standard deviation of 𝑥̅ . Actually, the mean and standard deviation of are, respectively, the mean and standard deviation of the means of all samples of the same size selected from a population. The standard deviation of 𝑥̅ is also called the standard error of 𝑥̅ . ̅ Mean of the sampling distribution of 𝒙 ̅ are called the The mean and standard deviation of the sampling distribution of 𝒙 ̅ and are denoted by 𝝁𝒙̅ and 𝛔𝒙̅ , respectively. mean and standard deviation of 𝒙 The mean of the sampling distribution of 𝑥̅ is always equal to the mean of the population. Thus, μ𝐱 = μ Standard deviation of the sampling distribution of 𝐱 The standard deviation of the sampling distribution of x is σ σ𝐱 = √n where σ is the standard deviation of the population and n is the sample size. SSTS012 2019 STUDY GUIDE Important observation regarding the sampling distribution of 𝐱 i) The spread of the sampling distribution of 𝐱 is smaller than the spread of the corresponding population distribution. In other words, σ𝐱 < σ. This is obvious from the formula for When n is greater than 1, which is usually true, the denominator in ii) σ √n is greater than 1. Hence, σx̅ is smaller than σ. The standard deviation of the sampling distribution of 𝐱 decreases as the sample size increases. This feature of the sampling distribution of 𝐱 is also obvious from the formula σ𝐱 = σ √n If the standard deviation of a sample statistic decreases as the sample size is increased, that statistic is said to be a consistent estimator. 1.6 Sampling from a Normally Distributed Population When the population from which samples are drawn is normally distributed with its mean equal to μ and standard deviation equal to σ, then: i) The mean of 𝐱, μ 𝐱 , is equal to the mean of the population, μ. ii) The standard deviation of 𝐱, 𝛔𝐱 , is equal to iii) The shape of the sampling distribution of 𝐱 is normal, whatever the value of n. σ √n . Important remark: If the population from which the samples are drawn is normally distributed with mean μ and standard deviation σ, then the sampling distribution of the sample mean, 𝐱 will also be normally distributed with the following mean and standard deviation, irrespective of the sample size: μ 𝐱 = μ and σ𝐱 = σ √n . 1.7 Sampling from a population that is not normally distributed. Most of the time the population from which the samples are selected is not normally distributed. In such cases, the shape of the sampling distribution of 𝑥̅ is inferred from a very important theorem called the central limit theorem. Central Limit Theorem SSTS012 2019 STUDY GUIDE According to the central limit theorem, for a large sample size, the sampling distribution of 𝐱 is approximately normal, irrespective of the shape of the population distribution. The mean and standard deviation of the sampling distribution of 𝐱 are, respectively, μ 𝐱 = μ and σ σ𝐱 = √n The sample size is usually considered to be large if n ≥ 30. Important remark: when the population does not have a normal distribution, the shape of the sampling distribution is not exactly normal, but it is approximately normal for a large sample size. The approximation becomes more accurate as the sample size increases. Another point to remember is that the central limit theorem applies to large samples only. Usually, if the sample size is 30 or more, it is considered sufficiently large so that the central limit theorem can be applied to the sampling distribution of x̅ Thus, according to the central limit theorem: i) When n ≥ 30, the shape of the sampling distribution of x̅ is approximately normal irrespective of the shape of the population distribution. ii) The mean of 𝐱, μ 𝐱 , is equal to the mean of the population μ. iii) The standard deviation of 𝐱, 𝛔𝐱 , is equal to σ . √n Example 3: The delivery times for all food orders at a fast-food restaurant during the lunch hour are normally distributed with a mean of 7.7 minutes and a standard deviation of 2.1 minutes. Let be the mean delivery time for a random sample of 16 orders at this restaurant. Calculate the mean and standard deviation of x̅. Example 4: Suppose that the distribution of time spent working per week by University of Limpopo (UL) students who hold part-time jobs during the school year is unknown with a mean of 20.20 hours and a standard deviation of 2.60 hours. Let x̅ be the average time spent working per week for 36 randomly selected UL students who hold part-time jobs during the school year. Calculate the mean and the standard deviation of the sampling distribution of x̅. SSTS012 2019 STUDY GUIDE Remark: Suppose that we take a random sample of size n from a normal population with mean μ and variance σ2 , then the sample mean 𝑥̅ is also normal with mean μ and variance σ2 σ2 ̅~N(μ, ). , that is X n n Note that in situation whereby the sample is not drawn from a normal population, the Central limit theorem can be applied provided the sample size is large (n ≥ 30). Example 5: Assume that the weights of all packages of a certain brand of cookies are normally distributed with a mean of 32 ounces and a standard deviation of 0.3 ounces. Find the probability that the mean weight, 𝑥̅ of a random sample of 20 packages of this brand of cookies will be between 31.8 and 31.9 ounces. Example 6: The amounts of electricity bills for all households in a city have a skewed probability distribution with a mean of 𝑅140 and a standard deviation of 𝑅30. Find the probability that the mean amount of electric bills for a random sample of 75 households selected from this city will be: a) between 𝑅132 and 𝑅136. b) within 𝑅6 of the population mean. 1.6 Population and Sample Proportions The concept of proportion is the same as the concept of relative frequency discussed in the first semester and the concept of probability of success in a binomial experiment. The relative frequency of a category or class gives the proportion of the sample or population that belongs to that category or class. Similarly, the probability of success in a binomial experiment represents the proportion of the sample or population that possesses a given characteristic. The population proportion, denoted by p, is obtained by taking the ratio of the number of elements in a population with a specific characteristic to the total number of elements in the population. The sample proportion, denoted by p̂ (pronounced p hat), gives a similar ratio for a sample. SSTS012 2019 STUDY GUIDE The population and sample proportions, denoted by p and p̂ , respectively, are calculated as P X N and pˆ  x n where N = total number of elements in the population n = total number of elements in the sample X = number of elements in the population that possess a specific characteristic x  number of elements in the sample that possess a specific characteristic Example 7: Suppose a total of 789,654 families live in a particular city and 563,282 of them own homes. A sample of 240 families is selected from this city, and 158 of them own homes. a) Find the proportion of families who own homes in the population. b) Find the proportion of families who own homes in the sample. c) Calculate the sampling error of a proportion. Example 8: In a population of 9500 subjects, 75% possess a certain characteristic. In a sample of 400 subjects selected from this population, 78% possess the same characteristic. How many subjects in the population and sample, respectively, possess this characteristic? Calculate the sampling error of a proportion. ̂ 1.6.1 Sampling distribution of 𝐩 ̂ is a random variable. Hence, Just like the sample mean 𝐱̅, the sample proportion 𝐩 it possesses a probability distribution, which is called its sampling distribution. ̂: The probability distribution Sampling Distribution of the Sample Proportion, 𝐩 of the sample proportion, is called its sampling distribution. It gives the various values that can assume and their probabilities. ̂ 1.6.2 Mean and Standard Deviation of 𝐩 SSTS012 2019 STUDY GUIDE ̂ of which is the same as the mean of the sampling distribution 𝐩 ̂ of is The mean 𝐩 always equal to the population proportion, 𝐩, just as the mean of the sampling distribution of 𝐱̅ is always equal to the population mean, 𝛍. The mean of the sample proportion, p̂ is denoted by μp̂ and is equal to the population proportion, p. Thus, μp̂ = p ̂ is denoted by σp̂ and is given The standard deviation of the sample proportion, 𝐩 by the formula pq n σp̂ = √ where p is the population proportion, q = 1 − p, and n is the sample size. Example 9: A population of N = 4000 has a population proportion equal to 0.12. In each of the following cases, which formula will you use to calculate σp̂ and why? Using the appropriate formula, calculate σp̂ for each of these cases. a) n = 800. b) n = 30. ̂ 1.6.3 Shape of the sampling distribution of 𝐩 The shape of the sampling distribution of p̂ is inferred from the central limit theorem. Central Limit Theorem for Sample Proportion: According to the central limit theorem, the sampling distribution of p̂ is approximately normal for a sufficiently large sample size. In the case of proportion, the sample size is considered to be sufficiently large if np and nq are both greater than 5. That is, if np > 5 and nq > 5 Example 10: Maureen Webster, who is running for mayor in a large city, claims that she is favoured by 53% of all eligible voters of that city. Assume that this claim is true. SSTS012 2019 STUDY GUIDE What is the probability that in a random sample of 400 registered voters taken from this city, less than 49% will favour Maureen Webster? Example 11: According to the BBMG Conscious Consumer Report, 51% of the adults surveyed said that they are willing to pay more for products with social and environmental benefits despite the current tough economic times (USA TODAY, June 8, 2009). Suppose this result is true for the current population of adult Americans. Let p̂ be the proportion in a random sample of 1050 adult Americans who will hold the said opinion. Find the probability that the value of p̂ is between 0.53 and 0.55. SSTS012 2019 STUDY GUIDE CHAPTER 2: POINT AND INTERVAL ESTIMATION 2.1 Introduction Statistical inference is the process of using sample results to estimate or draw conclusions about the characteristics or parameter of a population. In this chapter we shall examine estimation procedures which attempt to measure particular characteristics of a population such as the mean, population and variance. There are two major types of estimates, point and interval estimates. A point estimate uses a single sample value to estimate the population parameter involved. Instead of having an estimate based on a single value, an interval is used for estimating the population parameter. This interval has a specified confidence or probability of correctly estimating the true value of the population parameter. 2.2 Estimation Definition 2.1: The assignment of value(s) to a population parameter based on a value of the corresponding sample statistic is called estimation. Definition 2.2: Estimate is the value(s) assigned to a population parameter based on the value of a sample statistic. Definition 2.3: Estimator is the sample statistic used to estimate a population parameter. Three Properties of a Good Estimator 1. The estimator should be an unbiased estimator. That is, the expected value or the mean of the estimates obtained from samples of a given size is equal to the parameter being estimated. 2. The estimator should be consistent. For a consistent estimator, as sample size increases, the value of the estimator approaches the value of the parameter estimated. 3. The estimator should be a relatively efficient estimator. That is, of all the statistics that can be used to estimate a parameter, the relatively efficient estimator has the smallest variance. SSTS012 2019 STUDY GUIDE Unbiased estimator A point estimator θ̂ is said to be an unbiased estimator of θ if E(θ̂) = θ for every possible value of θ. If is not unbiased, the difference E(θ̂) − θ is called the bias of θ̂. That is, an unbiased estimator of a population parameter is an estimator whose expected value is equal to that parameter Example 2.1: Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be a random sample from a normal population with mean 𝜇 and variance 𝜎 2 > 0. a) Is the sample mean 𝑋̅ an unbiased estimator of the parameter 𝜇 ? b) Show that 𝑆 2 = ∑ 𝑥 2 −𝑛𝑥̅ 2 𝑛−1 is an unbiased estimator of the parameter 𝜎 2 . 2.3 Type of Estimates: Point and interval estimates An estimate may be a point estimate or an interval estimate. These two types of estimates are described in this section. 2.3.1 Point estimate Point Estimate A point estimate is a specific numerical value estimate of a parameter. The best point estimate of the population mean 𝜇 is the sample mean 𝑋̅. 2.3.2 Interval estimate Interval Estimation An interval estimate of a parameter is an interval or a range of values used to estimate the parameter. This estimate may or may not contain the value of the parameter being estimated. 2.4 Confidence interval Confidence Level and Confidence Interval SSTS012 2019 STUDY GUIDE The confidence level of an interval estimate of a parameter is the probability that the interval estimate will contain the parameter, assuming that a large number of samples are selected and that the estimation process on the same parameter is repeated. A confidence interval is a specific interval estimate of a parameter determined by using data obtained from a sample and by using the specific confidence level of the estimate. 2.5 Estimation of a population mean This section explains how to construct a confidence interval for the population mean μ. Here, there are two possible cases, summarised in the chart below. SSTS012 2019 STUDY GUIDE Confidence Interval for 𝛍: 𝛔 known or 𝛔 unknown but 𝒏 ≥ 𝟑𝟎 The (1 − α)100% confidence interval for 𝛍 under Cases I and II is σ σ x − Zα × ≤ μ ≤ x + Zα × 2 2 √n √n where the value of 𝑧 used here is obtained from the standard normal distribution table for the given confidence level. Example 2.2: A publishing company has just published a new University textbook. Before the company decides the price at which to sell this textbook, it wants to know the average price of all such textbooks in the market. The research department at the company took a sample of 25 comparable textbooks and collected information on their prices. This information produced a mean price of 𝑅145 for this sample. It is known that the standard deviation of the prices of all such textbooks is 𝑅35 and the population of such prices is normal. Construct a 90% confidence interval for the mean price of all such college textbooks. Example 2.3: A researcher wishes to estimate the number of days it takes an automobile dealer to sell a Ford Ranger 3.2 double cab. A sample of 50 cars had a mean time on the dealer’s lot of 54 days and standard deviation to be 6.0 days. Find the best point estimate of the population mean and the 95% confidence interval of the population mean. Margin of Error The margin of error for the estimate for 𝛍, denoted by E, is the quantity that is subtracted from and added to the value of 𝐱 to obtain a confidence interval for 𝛍. Thus, E = 𝐙𝛂 × 𝟐 𝛔 √𝐧 Determining the Sample Size for the Estimation of 𝛍 Given the confidence level and the standard deviation of the population, the sample size that will produce a predetermined margin of error E of the confidence interval estimate of 𝛍 is SSTS012 2019 n=( 𝑍𝛼∗ 𝜎 2 𝐸 STUDY GUIDE 2 ) If necessary, round the answer up to obtain a whole number. That is, if there is any fraction or decimal portion in the answer, use the next whole number for sample size n. Example 2.4: An alumni association wants to estimate the mean debt of this year’s university graduates. It is known that the population standard deviation of the debts of this year’s college graduates is 𝑅11,800. How large a sample should be selected so that the estimate with a 99% confidence level is within 𝑅800 of the population mean? Confidence Interval for 𝛍 when 𝛔 unknown and 𝒏 < 𝟑𝟎 As previously stated, just as the mean of the population 𝜇 is usually not known, the actual standard deviation of the population 𝜎 is also not likely to be known and the sample size is small (𝑛 < 30). Therefore, we need to obtain a confidence interval estimate of 𝜇 by using the sample statistics of 𝑋̅ and 𝑆 2 . The distribution that has been developed to be applied in this situation is Student’s t distribution. The (1 − α)100% confidence interval for 𝛍 is 𝐱 − 𝐭 𝐧−𝟏 , 𝛂 𝐬 𝛂 𝐬 × ≤ 𝛍 ≤ 𝐱 + 𝐭 𝐧−𝟏 , × 𝟐 √𝐧 𝟐 √𝐧 The value of t is obtained from the t distribution table for n − 1 degrees of freedom and the given confidence level. Example 2.5: Dr. Moore wanted to estimate the mean cholesterol level for all adult men living in Hartford. He took a sample of 25 adult men from Hartford and found that the mean cholesterol level for this sample is 186 mg/dL with a standard deviation of 12 mg/dL. Assume that the cholesterol levels for all adult men in Hartford are (approximately) normally distributed. Construct a 95% confidence interval for the population mean 𝛍. SSTS012 2019 STUDY GUIDE Example 2.6: The data represent a sample of the number of home fires started by candles for the past several years. (Data are from the National Fire Protection Association). 5460 5900 6090 6310 7160 8440 9930 Find the 99% confidence interval for the mean number of home fires started by candles each year. 2.7 Confidence interval estimation for Population proportion The concept of estimation can be extended to qualitative data to estimate the proportion of success in the population based only upon sample data. We noted in the previous chapter that when np and nq were at least 5, the binomial distribution generally could be approximated by the normal distribution. If we desire to estimate the population proportion 𝑝 from the sample proportion 𝑝̂ we could set up the following (1 − 𝛼)100% confidence interval estimate for the population proportion 𝑝. To construct a confidence interval about a proportion, you must use the maximum error of the estimate, which is 𝑝̂𝑞̂ 𝐸 = 𝑍𝛼 √ 𝑛 2 Solving this equation of 𝑛, we have 𝑍𝛼 2 𝑝̂𝑞̂ 𝑛= 2 𝐸2 Confidence intervals about proportions must meet the criteria that 𝑛𝑝̂ ≥ 5 and 𝑛𝑞̂ ≥ 5. Confidence Interval for the Population Proportion, 𝒑 The (1 − α)100% confidence interval for the population proportion, p, is ̂ − 𝐙𝛂 × √ 𝐩 𝟐 ̂×𝐪 ̂ ̂×𝐪 ̂ 𝐩 𝐩 ̂ + 𝐙𝛂 × √ <𝐩<𝐩 𝐧 𝐧 𝟐 The value of 𝑍𝛼 used here is obtained from the standard normal distribution table for 2 the given confidence level SSTS012 2019 STUDY GUIDE Example 2.7: According to a survey conducted by Pew Research Centre in June 2009, 44% of people aged 18 to 29 years said that religion is very important to them. Suppose this result is based on a sample of 1000 people aged 18 to 29 years. a) What is the point estimate of the corresponding population proportion? b) Find, with a 99% confidence level, the percentage of all people aged 18 to 29 years who will say that religion is very important to them. What is the margin of error of this estimate? Example 2.8: A survey of 1721 people found that 15.9% of individuals purchase religious books at a Christian bookstore. Find the 95% confidence interval of the true proportion of people who purchase their religious books at a Christian bookstore? Example 2.9: Lombard Electronics Company has just installed a new machine that makes a part that is used in clocks. The company wants to estimate the proportion of these parts produced by this machine that are defective. The company manager wants this estimate to be within . 02 of the population proportion for a 95% confidence level. What is the most conservative estimate of the sample size that will limit the margin of error to within . 02 of the population proportion? 2.8 Confidence Intervals for Variances and Standard Deviations In the previous sections confidence intervals were calculated for means and proportions. This section will explain how to find confidence intervals for variances and standard deviations. In statistics, the variance and standard deviation of a variable are as important as the mean. For example, when products that fit together (such as pipes) are manufactured, it is important to keep the variations of the diameters of the products as small as possible; otherwise, they will not fit together properly and will have to be scrapped. In the manufacture of medicines, the variance and standard deviation of the medication in the pills play an important role in making sure patients receive the proper dosage. For these reasons, confidence intervals for variances and standard deviations are necessary. To calculate these confidence intervals, a new statistical distribution is needed. It is called the chi-square distribution. The formulas for the confidence intervals are shown here: SSTS012 2019 STUDY GUIDE Confidence Interval for population variance Assuming that the population from which the sample is selected is (approximately) normally distributed, we obtain the (1 − α)100% confidence interval for the population variance σ2 as (𝒏 − 𝟏)𝑺𝟐 (𝒏 − 𝟏)𝑺𝟐 𝟐 < 𝝈 < 𝝌𝟐 𝜶 (𝒏 − 𝟏) 𝝌𝟐 𝟏−𝜶 (𝒏 − 𝟏) 𝟐 𝟐 Note that 𝑛 − 1 is degrees of freedom. Confidence Interval for population standard deviation Assuming that the population from which the sample is selected is (approximately) normally distributed, we obtain the (1 − α)100% confidence interval for the population standard deviation σ as (𝒏 − 𝟏)𝑺𝟐 (𝒏 − 𝟏)𝑺𝟐 √𝝌𝟐 𝜶 (𝒏 − 𝟏) < 𝛔 < √𝝌𝟐 𝜶 (𝒏 − 𝟏) 𝟏− 𝟐 𝟐 Note that 𝑛 − 1is degrees of freedom. Example 2.10: Find the 95% confidence interval for the variance and standard deviation of the nicotine content of cigarettes manufactured if a sample of 20 cigarettes has a standard deviation of 1.6 milligrams. Example 2.11: Find the 90% confidence interval for the variance and standard deviation for the price in dollars of an adult single-day ski lift ticket. The data represent a selected sample of nationwide ski resorts. Assume the variable is normally distributed. 59 54 53 52 51 39 49 46 49 48 SSTS012 2019 STUDY GUIDE 2.9 Inferences about the difference between two population means for independent samples: 𝛔𝟏 and 𝛔𝟐 known Two samples drawn from two populations are independent if the selection of one sample from one population does not affect the selection of the second sample from the second population. Otherwise, the samples are dependent. 2.9.1 Interval estimation of 𝛍𝟏 − 𝛍𝟐 Confidence interval for 𝛍𝟏 − 𝛍𝟐 : 𝛔𝟏 and 𝛔𝟐 known When using the normal distribution, the (1 − α)100% confidence interval for μ1 − μ2 is 𝛔𝟐𝟏 𝛔𝟐𝟐 𝛔𝟐𝟏 𝛔𝟐𝟐 √ √ (𝐱𝟏 − 𝐱𝟐 ) − 𝐙𝛂 × + < 𝛍𝟏 − 𝛍𝟐 < (𝐱𝟏 − 𝐱𝟐 ) + 𝐙𝛂 × + 𝐧𝟏 𝐧𝟐 𝐧𝟏 𝐧𝟐 𝟐 𝟐 The value of 𝑧 is obtained from the normal distribution table for the given confidence level. Here, x1 − x2 is the point estimator of μ1 − μ2. Example 2.12: A 2008 survey of low- and middle-income households conducted by Demos, a liberal public policy group, showed that consumers aged 65 years and older had an average credit card debt of R10,235 and consumers in the 50- to 64-year age group had an average credit card debt of R9342 at the time of the survey (USA TODAY, July 28, 2009). Suppose that these averages were based on random samples of 1200 and 1400 people for the two groups, respectively. Further assume that the population standard deviations for the two groups were R2800 and R2500, respectively. Let and be the respective population means for the two groups, people aged 65 years and older and people in the 50- to 64-year age group. a) What is the point estimate of μ1 − μ2 . b) Construct a 97% confidence interval for μ1 − μ2. Example 2.13 A survey found that the average hotel room rate in New Orleans is $88.42 and the average room rate in Phoenix is R80.61. Assume that the data were obtained from two samples of 50 hotels each and that the standard deviations of the populations are R5.62 and R4.83, respectively. Find a 95% confidence interval for the difference between the means. SSTS012 2019 STUDY GUIDE 2.10 Inferences about the difference between two population means for independent samples: 𝛔𝟏 and 𝛔𝟐 are unknown but equal Confidence Interval for 𝛍𝟏 − 𝛍𝟐 : 𝛔𝟏 and 𝛔𝟐 are unknown but equal The (𝟏 − 𝛂)𝟏𝟎𝟎% confidence interval for 𝛍𝟏 − 𝛍𝟐 is (𝐱𝟏 − 𝐱𝟐 ) − 𝐭 𝐧𝟏+𝐧𝟐−𝟐 , 𝛂 𝟏 𝟏 𝛂 𝟏 𝟏 × 𝐒𝐩 √ + < 𝛍𝟏 − 𝛍𝟐 < (𝐱𝟏 − 𝐱𝟐 ) + 𝐭 𝐧𝟏+𝐧𝟐−𝟐 , × 𝐒𝐩 √ + 𝟐 𝐧𝟏 𝐧𝟐 𝟐 𝐧𝟏 𝐧𝟐 where (𝐧𝟏 − 𝟏)𝐒𝟏𝟐 + (𝐧𝟐 − 𝟏)𝐒𝟐𝟐 √ 𝐒𝐩 = 𝐧𝟏 + 𝐧𝟐 − 𝟐 the value of t is obtained from the t distribution table for the given confidence level and n1 + n2 − 2 degrees of freedom. Example 2.14: A consumer agency wanted to estimate the difference in the mean amounts of caffeine in two brands of coffee. The agency took a sample of 15 onepound jars of Brand I coffee that showed the mean amount of caffeine in these jars to be 80 milligrams per jar with a standard deviation of 5 milligrams. Another sample of 12 one-pound jars of Brand II coffee gave a mean amount of caffeine equal to 77 milligrams per jar with a standard deviation of 6 milligrams. Construct a 95% confidence interval for the difference between the mean amounts of caffeine in onepound jars of these two brands of coffee. Assume that the two populations are normally distributed and that the standard deviations of the two populations are equal. Example 2.15: The following information was obtained from two independent samples selected from two normally distributed populations with unknown but equal standard deviations. n1 = 21, a) x1 = 13.97 s1 = 3.78 What is the point estimate of μ1 − μ2 ? b) Construct a 95% confidence interval for μ1 − μ2. SSTS012 2019 STUDY GUIDE 2.11 Inferences about the difference between two population proportions for large and independent samples The difference between two sample proportions p̂1 − p̂2 is the point estimator for the difference between two population proportions p1 − p2 . Because we do not know p1 and p2 when we are making a confidence interval for p1 − p2 , we cannot calculate the value of σp̂1 −p̂2 . Therefore, we use sp̂1 −p̂2 as the point estimator of σp̂1 −p̂2 in the interval estimation. We construct the confidence interval for p1 − p2 using the following formula. confidence interval for 𝐩𝟏 − 𝐩𝟐 the (1 − α)100% confidence interval for p1 − p2 is ̂𝟏 − 𝐩 ̂𝟐 ) − 𝐙𝛂 × √ (𝐩 𝟐 ̂𝟏 𝐪 ̂𝟏 𝐩 ̂𝟐 𝐪 ̂𝟐 ̂𝟏 𝐪 ̂𝟏 𝐩 ̂𝟐 𝐪 ̂𝟐 𝐩 𝐩 ̂𝟏 − 𝐩 ̂𝟐 ) + 𝐙𝛂 × √ + < 𝐩𝟏 − 𝐩𝟐 < (𝐩 + 𝐧𝟏 𝐧𝟐 𝐧𝟏 𝐧𝟐 𝟐 where the value of z is read from the normal distribution table for the given confidence level Example 2.16: A researcher wanted to estimate the difference between the percentages of users of two toothpastes who will never switch to another toothpaste. In a sample of 500 users of Toothpaste A taken by this researcher, 100 said that they will never switch to another toothpaste. In another sample of 400 users of Toothpaste B taken by the same researcher, 68 said that they will never switch to another toothpaste. a) Let p1 and p2 be the proportions of all users of Toothpastes A and B, respectively, who will never switch to another toothpaste. What is the point estimate of p1 − p2. b) Construct a 95% confidence interval for the difference between the proportions of all users of the two toothpastes who will never switch. Example 2.17: In the nursing home study mentioned in the chapter-opening Statistics Today, the researchers found that 12 out of 34 small nursing homes had a resident vaccination rate of less than 80%, while 17 out of 24 large nursing homes had a vaccination rate of less than 80%. Find the 95% confidence interval for the difference of proportions SSTS012 2019 STUDY GUIDE CHAPTER 3: HYPOTHESIS TESTING 3.1 Introduction In chapter 2, the concept that a sample statistic such as the mean or a proportion would follow a particular distribution under various circumstances was used to develop the confidence interval as a way of making inference about the true value of the mean or proportion. In this chapter we will begin to focus on another phase of statistical inference called hypothesis testing, which is a decision-making process for evaluating claims about a population. In hypothesis testing, the researcher must define the population under study, state the particular hypotheses that will be investigated, give the significance level, select a sample from the population, collect the data, perform the calculations required for the statistical test, and reach a conclusion. 3.2 Hypothesis testing procedure Every hypothesis-testing situation begins with the statement of a hypothesis. A statistical hypothesis is a conjecture about a population parameter. This conjecture may or may not be true. There are two types of statistical hypotheses for each situation: the null hypothesis and the alternative hypothesis. The null hypothesis, symbolized by 𝑯𝟎 , is a statistical hypothesis that states that there is no difference between a parameter and a specific value, or that there is no difference between two parameters. The alternative hypothesis, symbolized by 𝑯𝟏 , is a statistical hypothesis that states the existence of a difference between a parameter and a specific value, or states that there is a difference between two parameters. To state hypotheses correctly, researchers must translate the conjecture or claim from words into mathematical symbols. The basic symbols used are as follows: Equal to = Not equal to ≠ Greater than > Less than < SSTS012 2019 STUDY GUIDE The null and alternative hypotheses are stated together, and the null hypothesis contains the equals sign, as shown in the table below (where 𝒌 represents a specified number). Two-tailed test Right-tailed test 𝐻0 : 𝜇 = 𝑘 𝐻1 : 𝜇 ≠ 𝑘 Left-tailed test 𝐻0 : 𝜇 ≤ 𝑘 𝐻0 : 𝜇 ≥ 𝑘 𝐻0 : 𝜇 > 𝑘 𝐻0 : 𝜇 < 𝑘 Hypothesis testing common phrases > < Is greater than Is less than Is above Is below Is higher than Is lower than Is longer than Is shorter than Is bigger than Is smaller than Is increased Is decreased or reduced from = ≠ Is equal to Is not equal to Is the same as Is different from Has not changed from Has changed from Is the same as Is not the same as SSTS012 2019 STUDY GUIDE 3.3 Type I and Type II errors In using a sample to draw inferences about the population, the decision maker is taking the risk that the incorrect conclusion will be reached. There are two types of errors that can occur in the hypothesis testing procedure. The first error, called Type I error (𝛼), is the probability that the null hypothesis 𝐻0 will be rejected when, in fact, it is true. The Type I error 𝛼 is also called the level of significance. The value of α represents the probability of committing this type of error; that is, α = P(H0 is rejected|H0 is true) The second error, called Type II error (𝛽), is the probability that the null hypothesis 𝐻0 will not be rejected when it is false and should be rejected. The value of β represents the probability of committing a Type II error; that is, β = P(H0 is not rejected|H0 is false) The value of 1 − β is called the power of the test. It represents the probability of not making a Type II error. In the hypothesis-testing situation, there are four possible outcomes. In reality, the null hypothesis may or may not be true, and a decision is made to reject or not reject it on the basis of the data obtained from a sample. The four possible outcomes are shown in table below. Notice that there are two possibilities for a correct decision and two possibilities for an incorrect decision. Statistical Decision Actual Situation 𝐻0 True H0 False Do not reject H0 Correct decision (1 − 𝛼) Type II error (𝛽) Reject H0 Type I error (𝛼) Correct decision (1 − 𝛽) 3.4 Steps of hypothesis testing  State the null and alternative hypothesis  Specify the level of significance 𝛼  Calculate the value of the test statistic SSTS012 2019 STUDY GUIDE  Set up the critical values that divide the rejection and nonrejection rejection  Determine the statistical decision  Express the statistical decision in terms of the problem. 3.5 Test of hypothesis for the population mean 𝝁 This section explains how to perform a test of hypothesis for the population mean 𝜇. Here, there are two possible cases, as follows. Case I. If the following two conditions are fulfilled: 1) The population standard deviation 𝜎 is known 2) The sample size is large (i.e. 𝑛 ≥ 30) then we use the normal distribution to perform the hypothesis testing of the population mean 𝜇. That is, If the standard deviation 𝜎 is known or the sample size is large, then based on the central limit theorem, the sampling distribution of the sample mean 𝑋̅ would follow a normal distribution and the test statistic which is based upon the difference between the sample mean 𝑋̅ and the hypothesized mean 𝜇 would be found as follows: Z= x−μ σ √n The test statistic can be defined as a rule or criterion that is used to make the decision on whether or not to reject the null hypothesis. Example 3.1: The TIV Telephone Company provides long-distance telephone service in an area. According to the company’s records, the average length of all long-distance calls placed through this company in 2009 was 12.44 minutes. The company’s management wanted to check if the mean length of the current long-distance calls is different from 12.44 minutes. A sample of 150 such calls placed through this company produced a mean length of 13.71 minutes. The standard deviation of all such calls is 2.65 minutes. Using the 10% significance level, can you conclude that the mean length of all current long-distance calls is different from 12.44 minutes? Example 3.2: A researcher reports that the average salary of assistant professors is more than R42,000. A sample of 30 assistant professors has a mean salary of R43,260 SSTS012 2019 STUDY GUIDE and the standard deviation of R5,230. At α = 0.05, test the claim that assistant professors earn more than R42,000 per year.. Case II. If the population standard deviation is unknown and the sample size is small (i.e. 𝒏 < 𝟑𝟎) then we use the student t distribution to perform the hypothesis testing of the population mean 𝝁. The test statistic for determining the difference between ̅ and the population mean 𝝁 when the sample size is small and the sample mean 𝑿 standard deviation 𝑺 is used, is given by t= x−μ s √n If the population is assumed to be normal, the sampling distribution of the mean will follow a student t distribution with 𝑛 − 1 degrees of freedom. Example 3.3: A psychologist claims that the mean age at which children start walking is 12.5 months. Carol wanted to check if this claim is true. She took a random sample of 18 children and found that the mean age at which these children started walking was 12.9 months with a standard deviation of 0.80 month. Using the 1% significance level, can you conclude that the mean age at which all children start walking is different from 12.5 months? Assume that the ages at which all children start walking have an approximately normal distribution. Example 3.4: A medical investigation claims that the average number of infections per week at a hospital in south-western Pennsylvania is 16.3. A random sample of 10 weeks had a mean number of 17.7 infections and the sample standard deviation is 1.8 infections. Is there enough evidence to reject the investigator’s claim at a α = 0.05? 3.6 Hypothesis test about a population proportion In the preceding section we used hypothesis testing procedures for quantitative data (means). The concept of hypothesis testing can also be used to test hypothesis about the qualitative data. The number of successes follows a binomial distribution process. However, as we have seen previously when developing confidence intervals, if the sample size is large enough (𝑛𝑝 > 5 and 𝑛𝑞 > 5), the normal distribution gives a good SSTS012 2019 STUDY GUIDE approximation to the binomial distribution. The test statistic can be stated in two forms, in terms of either the proportion of successes or the number of successes: 𝑍= 𝑝̂−𝑝 𝑝(1−𝑝) √ 𝑛 or 𝑍 = 𝑋−𝑛𝑝 √𝑛𝑝(1−𝑝) Example 3.5: According to a Nationwide Mutual Insurance Company Driving While Distracted Survey conducted in 2008, 81% of the drivers interviewed said that they have talked on their cell phones while driving (The New York Times, July 19, 2009). The survey included drivers aged 16 to 61 years selected from 48 states. Assume that this result holds true for the 2008 population of all such drivers in the United States. In a recent random sample of 1600 drivers aged 16 to 61 years selected from the United States, 83% said that they have talked on their cell phones while driving. Using the 5% significance level, can you conclude that the current percentage of such drivers who have talked on their cell phones while driving is different from 81%? Example 3.6: A statistician read that at least 77% of the population oppose replacing R1 bills with R1 coins. To see if this claim is valid, the statistician selected a sample of 80 people and found that 55 were opposed to replacing the R1 bills. At α = 0.01, test the claim that at least 77% of the population are opposed to the change. 3.7 The 𝑷-Value Approach In this procedure, we find a probability value such that a given null hypothesis is rejected for any 𝛼 (significance level) greater than this value and it is not rejected for any 𝛼 less than this value. The probability-value approach, more commonly called the 𝑃-value approach, gives such a value. In this approach, we calculate the 𝑷-value for the test, which is defined as the smallest level of significance at which the given null hypothesis is rejected. Using this 𝑃-value, we state the decision. If we have a predetermined value of 𝛼, then we compare the value of 𝑃 with 𝛼 and make a decision. Definition 3.1: The 𝑃-value (or probability value) is the probability of getting a sample statistic (such as the mean) or a more extreme sample statistic in the direction of the alternative hypothesis when the null hypothesis is true. Decision rule when using the 𝑷-value approach SSTS012 2019 STUDY GUIDE We reject the null hypothesis if the 𝑃-value is less or equal to the level of significance (𝛼). That is, reject 𝐻0 when the 𝑃-value ≤ 𝛼. Otherwise, we fail to reject the null hypothesis. 3.8 Hypothesis Tests About the Population Variance A test of hypothesis about the population variance can be one-tailed or two-tailed. To make a test of hypothesis about σ2 , we perform the same steps we used earlier in hypothesis testing examples. The procedure to test a hypothesis about σ2 discussed in this section is applied only when the population from which a sample is selected is (approximately) normally distributed. The value of the test statistic 𝜒 2 is calculated as χ2 = (n − 1)s2 σ2 where s2 is the sample variance, σ2 is the hypothesized value of the population variance, and n − 1 represents the degrees of freedom. The population from which the sample is selected is assumed to be (approximately) normally distributed. Example 3.7: The variance of scores on a standardized mathematics test for all high school seniors was 150 in 2009. A sample of scores for 20 high school seniors who took this test this year gave a variance of 170. Test at the 5% significance level if the variance of current scores of all high school seniors on this test is different from 150. The population from which the sample is selected is assumed to be normally distributed. Example 3.8: A sample of 21 observations selected from a normally distributed population produced a sample variance of 1.97. a) Write the null and alternative hypotheses to test whether the population variance is greater than 1.75. b) Using α = 0.025, find the critical value of χ2 . Show the rejection and nonrejection regions on a chi-square distribution curve. c) Find the value of the test statistic χ2 . SSTS012 d) 2019 STUDY GUIDE Using the 2.5% significance level, will you reject the null hypothesis stated in part a)? Support your answer. 3.9 Hypothesis testing for the difference between two population means 𝝁𝟏 − 𝝁𝟐 In the preceding sections we examined hypothesis testing procedures pertaining to whether a mean or proportion was equal to some specified value. These cases are usually referred to as one-sample tests, since a single sample is selected from as population of interest and a computed statistic from the sample is compared to a hypothesized value. In this section, we shall extend our discussion of hypothesis testing to consider additional procedures pertaining to quantitative and qualitative data. Let us first extend the hypothesis testing concepts developed in the previous section to situations in which we would like to determine whether there is any difference between the means of two independent populations. Two populations are independent if the elements of one population have no relationship to the elements of the second population. If the elements of the two population are somehow related, then the population are said to be dependent. Thus, in two independent populations, the selection of one population has no effect on the selection of the second population. Suppose then that we consider two independent populations, each having a mean and standard deviation (symbolically represent as follows:) Population I Population II 𝜇1 , 𝜎1 𝜇2 , 𝜎2 The test to be performed can be either two-tailed or one-tailed, depending on whether we are testing if the two population means are merely different or if one mean is greater than the other mean. Two-tailed test One-tailed (Left-tailed) test One-tailed (right tailed) test 𝐻0 : 𝜇1 = 𝜇2 𝐻0 : 𝜇1 ≥ 𝜇2 𝐻0 : 𝜇1 ≤ 𝜇2 𝐻1 : 𝜇1 ≠ 𝜇2 𝐻1 : 𝜇1 < 𝜇2 𝐻1 : 𝜇1 > 𝜇2 where 𝜇1 = mean of population 1 SSTS012 2019 STUDY GUIDE 𝜇2 = mean of population 2 The statistic used to determine the difference between the population means is ̅𝟏 − 𝑿 ̅ 𝟐 ). Because of the based upon the difference between the sample means (𝑿 central limit theorem, the statistic will follow the normal distribution for large enough sample sizes. The test statistic is 𝐳= (𝐱𝟏 − 𝐱𝟐 ) − (𝛍𝟏 − 𝛍𝟐 ) 𝟐 𝟐 √𝛔𝟏 + 𝛔𝟐 𝐧𝟏 𝐧 𝟐 The value of μ1 − μ2 in this formula is substituted from the null hypothesis. Example 3.9: A 2008 survey of low- and middle-income households conducted by Demos, a liberal public policy group, showed that consumers aged 65 years and older had an average credit card debt of R10,235 and consumers in the 50- to 64-year age group had an average credit card debt of R9342 at the time of the survey (USA TODAY, July 28, 2009). Suppose that these averages were based on random samples of 1200 and 1400 people for the two groups, respectively. Further assume that the population standard deviations for the two groups were R2800 and R2500, respectively. Let and be the respective population means for the two groups, people aged 65 years and older and people in the 50- to 64-year age group. Test at the 1% significance level whether the population means for the 2008 credit card debts for the two groups are different. Example 3.10: A survey found that the average hotel room rate in New Orleans is R88.42 and the average room rate in Phoenix is R80.61. Assume that the data were obtained from two samples of 50 hotels each and that the standard deviations of the populations are R5.62 and R4.83, respectively. At α = 0.05, can it be concluded that there is a significant difference in the rates? However, as we mentioned previously, in the most cases we do not know the standard deviation of either of the two population (𝜎1 , 𝜎2 ). The only information usually available are the sample means and sample standard deviations. If the assumptions are made that each population is normally distributed, a student t test can be used to determine SSTS012 2019 STUDY GUIDE whether there is any difference between the means of the two populations. The student t test statistic will be 𝐭= (𝐱𝟏 − 𝐱𝟐 ) − (𝛍𝟏 − 𝛍𝟐 ) 𝟐 𝟐 √𝑺𝟏 + 𝑺𝟐 𝐧𝟏 𝐧𝟐 The value of μ1 − μ2 in this formula is substituted from the null hypothesis, Note that the student t test statistic above has the following degrees of freedom: 𝑑𝑓 = 𝑺 𝟐 𝑺 𝟐 ( 𝒏𝟏 + 𝒏𝟐 ) 𝟏 𝟐 2 2 2 𝑺 𝟐 𝑺 𝟐 ( 𝒏𝟏 ) ( 𝒏𝟐 ) 𝟏 𝟐 𝑛1 − 1 + 𝑛2 − 1 The number given by this formula is always rounded down for 𝑑𝑓. Example 3.11: A sample of 14 cans of Brand I diet soda gave the mean number of calories per can of 23 with a standard deviation of 3 calories. Another sample of 16 cans of Brand II diet soda gave the mean number of calories of 25 per can with a standard deviation of 4 calories. Test at the 1% significance level whether the mean numbers of calories per can of diet soda are different for these two brands. Assume that the calories per can of diet soda are normally distributed for each of these two brands and that the standard deviations for the two populations are not equal. Example 3.12: A sample of 15 one-pound jars of coffee of Brand I showed that the mean amount of caffeine in these jars is 80 milligrams per jar with a standard deviation of 5 milligrams. Another sample of 12 one-pound coffee jars of Brand II gave a mean amount of caffeine equal to 77 milligrams per jar with a standard deviation of 6 milligrams. Construct a 95% confidence interval for the difference between the mean amounts of caffeine in one-pound coffee jars of these two brands. Assume that the two populations are normally distributed and that the standard deviations of the two populations are not equal. SSTS012 2019 STUDY GUIDE However, If the assumptions are made that each population is normally distributed and that the population variances are equal (𝜎1 2 = 𝜎2 2 ), a student t test can be used to determine whether there is any difference between the means of the two populations. Since we have assumed equal variances in the two populations, the variances of the two samples (𝑆1 2 ; 𝑆2 2 ) can be pooled together to form one estimate (𝑺𝒑 𝟐 ) of the population variance. The student t test statistic will be 𝐭= (𝐱𝟏 − 𝐱𝟐 ) − (𝛍𝟏 − 𝛍𝟐 ) √𝑺𝒑 𝟐 ( 𝟏 𝟏 𝒏𝟏 + 𝒏𝟐 ) The value of μ1 − μ2 in this formula is substituted from the null hypothesis, where the pooled variance Sp 2 for two samples is computed as: 𝑺𝒑 𝟐 = (𝒏𝟏 − 𝟏)𝑺𝟏 𝟐 + (𝒏𝟐 − 𝟏)𝑺𝟐 𝟐 𝒏𝟏 + 𝒏𝟐 − 𝟐 where 𝑛1 and 𝑛2 are the sizes of the two samples and 𝑺𝟏 𝟐 and 𝑺𝟐 𝟐 are the variances of the two samples, respectively. Here Sp 2 is an estimator of 𝜎 2 . Example 3.13: A consumer agency wanted to estimate the difference in the mean amounts of caffeine in two brands of coffee. The agency took a sample of 15 onepound jars of Brand I coffee that showed the mean amount of caffeine in these jars to be 80 milligrams per jar with a standard deviation of 5 milligrams. Another sample of 12 one-pound jars of Brand II coffee gave a mean amount of caffeine equal to 77 milligrams per jar with a standard deviation of 6 milligrams. Construct a 95% confidence interval for the difference between the mean amounts of caffeine in onepound jars of these two brands of coffee. Assume that the two populations are normally distributed and that the standard deviations of the two populations are equal. Example 3.14: A sample of 40 children from New York State showed that the mean time they spend watching television is 28.50 hours per week with a standard deviation of 4 hours. Another sample of 35 children from California showed that the mean time spent by them watching television is 23.25 hours per week with a standard deviation of 5 hours. Using a 2.5% significance level, can you conclude that the mean time spent SSTS012 2019 STUDY GUIDE watching television by children in New York State is greater than that for children in California? Assume that the standard deviations for the two populations are equal. 3.10 Hypothesis testing for the equality of variances from two populations In many situations, we may also be interested in testing whether two populations have the same variability. Either we may be interested in testing the assumption of equal variances that we had made for the t test in section 3.9, or we may be interested in studying the variances for two populations as an end in itself. In order to examine the equality of the variances of two independent populations, a statistical procedure has been devised that is based upon the ratio of the two sample variances. If the data from each population are assumed to be normal distributed, then the ration of the two sample variances follows a distribution call the 𝐹 distribution, which was named after the famous statistician R.A. Fisher. The test statistic for testing the ratio between two variances would be 𝐅= 𝐬𝟏𝟐 𝐬𝟐𝟐 where the larger of the two variances is placed in the numerator regardless of the subscripts. The 𝐹 test has two terms for the degrees of freedom: that of the numerator, n1 − 1, and that of the denominator, n2 − 1, where n1 is the sample size from which the larger variance was obtained. In testing the ratio of two variances, either one-tailed or two-tailed tests can be employed as indicated in the table below. Two-tailed test One-tailed (Left-tailed) test One-tailed (right tailed) test 𝐻0 : 𝜎 21 = 𝜎 2 2 𝐻0 : 𝜎 21 ≥ 𝜎 2 2 𝐻0 : 𝜎 21 ≤ 𝜎 2 2 𝐻1 : 𝜎 21 ≠ 𝜎 2 2 𝐻1 : 𝜎 21 < 𝜎 2 2 𝐻1 : 𝜎 21 > 𝜎 2 2 where 𝜎 21 = variance of population 1 𝜎 2 2 = variance of population 2 SSTS012 2019 STUDY GUIDE Example 3.15: A medical researcher wishes to see whether the variance of the heart rates (in beats per minute) of smokers is different from the variance of heart rates of people who do not smoke. Two samples are selected, and the data are as shown. Using α = 0.05, is there enough evidence to support the claim? Example 3.16: The standard deviation of the average waiting time to see a doctor for non-life threatening problems in the emergency room at an urban hospital is 32 minutes. At a second hospital, the standard deviation is 28 minutes. If a sample of 16 patients was used in the first case and 18 in the second case. Using α = 0.01 ,is there enough evidence to conclude that the standard deviation of the waiting times in the first hospital is greater than the standard deviation of the waiting times in the second hospital? 3.11 Hypothesis testing for the difference between two population proportions 𝑷𝟏 − 𝑷𝟐 Rather than being concerned with the difference between two populations in terms of a quantitative variable, we could be interested in difference in some qualitative characteristic. A test for the difference between two proportions based upon independent samples can be performed using the normal distribution. This test is based on the difference between the two sample proportions which may be approximated by a normal distribution for large sample sizes. For the two populations involved, we are interested in either determining whether there is any difference in the proportion of successes in the two groups (wo-tailed test) or whether one group had higher proportion of successes than the other group (one-tailed test). Two-tailed test One-tailed (Left-tailed) test One-tailed (right tailed) test 𝐻0 : 𝑃1 = 𝑃2 𝐻0 : 𝑃1 ≥ 𝑃2 𝐻0 : 𝑃1 ≤ 𝑃2 𝐻1 : 𝑃1 ≠ 𝑃2 𝐻1 : 𝑃1 < 𝑃2 𝐻1 : 𝑃1 > 𝑃2 where 𝑷𝟏 = Proportion of successes in population 1 𝑷𝟐 = Proportion of successes in population 2 SSTS012 2019 STUDY GUIDE The test statistic would be 𝐙= ̂𝟏 − 𝐏 ̂𝟐 ) − (𝐏𝟏 − 𝐏𝟐 ) (𝐏 ̅ (𝟏 − 𝑷 ̅) ( 𝟏 + 𝟏 ) √𝑷 𝑷 𝟏 𝑷𝟐 The estimate for the population proportion that we shall use is based upon the null hypothesis. Under the null hypothesis it is assumed that the two population proportions are equal. Therefore, we may obtain am overall estimate of the population proportion by pooling together the two sample proportions. The estimate 𝑃̅ is simple the number of successes in the two samples combined divided by the total sample size. That is, 𝑃̅ = 𝑋1 + 𝑋2 𝑛2 + 𝑛2 Example 3.17: A researcher wanted to estimate the difference between the percentages of users of two toothpastes who will never switch to another toothpaste. In a sample of 500 users of Toothpaste 𝐴 taken by this researcher, 100 said that they will never switch to another toothpaste. In another sample of 400 users of Toothpaste 𝐵 taken by the same researcher, 68 said that they will never switch to another toothpaste. At the 10% significance level, can you conclude that the proportion of users of Toothpaste 𝐴 who will never switch to another toothpaste is higher than the proportion of users of Toothpaste 𝐵 who will never switch to another toothpaste? Example 3.18: A company that has many department stores in the southern states wanted to find at two such stores the percentage of sales for which at least one of the items was returned. A sample of 800 sales randomly elected from Store 𝐴 showed that for 280 of them at least one item was returned. Another sample of 900 sales randomly selected from Store 𝐵 showed that for 279 of them at least one item was returned. Using the 5% significance level, can you conclude that the proportions of all sales for which at least one item is returned is higher for Store 𝐴 than for Store 𝐵? SSTS012 2019 STUDY GUIDE CHAPTER 4: CHI-SQUARE HYPOTHESIS TESTING 4.1 Introduction The statistical inference techniques presented so far have dealt exclusively with hypothesis tests for population parameters such as mean (μ) , variance (σ2 ) and proportion (P). In this chapter, we consider inferential procedures that are not concerned with population parameters. These procedures are often called chi-square (χ2 ) procedures for simple reason that they rely on a probability distribution called chisquare distribution. A random variable has the chi-square distribution if its distribution has the shape of a special type of right-skewed curve, called the chi-square (χ2 ) curve. 4.1.1 Basic properties of 𝛘𝟐 -curves  The total area under the χ2 -curve equals 1.  A χ2 -curve is right skewed.  As the number of degrees of freedom becomes larger, χ2 -curve looks increasingly like normal curve.  A χ2 -curve starts at zero on the horizontal axis and extends indefinitely to the right, approaching, but never touching, the horizontal axis. 4.1.2 Finding the 𝛘𝟐 -value having the specified area to its right For a χ2 -curve with 8 degrees of freedom, find χ2 0.025; that is, find the χ2 -value that has area 0.025 to its right. SSTS012 2019 STUDY GUIDE To find this χ2 -value, we use the chi-square distribution table. The degrees of freedom is 8, so we first go down the column, labelled df, to 8. Then going across that row to the column labelled χ2 0.025, we reach 17.535. Therefore, the χ2 -curve with 8 degrees of freedom, χ2 0.025 = 17.535. Example 4.1: Use the chi-square distribution table to determine the required χ2 values. Illustrate you work graphically. a) For a χ2 -curve with 3 degrees of freedom, determine the χ2 -values that has area 0.025 and 0.95 to its right. b) For a χ2 -curve with df=7, determine χ2 0.05 and χ2 0.975. c) Consider a χ2 -curve with df=12 and df=20, respectively. Which one more closely resembles a normal curve. Explain your answer. 4.2 Chi-square goodness-of-fit test Goodness-of-fit test is a chi-square procedure which can be used to perform a hypothesis test about the distribution of qualitative (categorical) variable or a discrete quantitative variable that has only finitely many possible values. 4.2.1 Distribution of the 𝛘𝟐 -statistic for a goodness-of-fit test For a chi-square goodness-of-fit test, the test statistic is algebraically expressed as: SSTS012 2019 STUDY GUIDE k (Oi − Ei )2 χ =∑ Ei 2 i=1 And it has approximately chi-square distribution if the null hypothesis is correct. The number of degrees of freedom is one less than the number of possible values (k) for the variable of interest. Note that in the chi-square statistic, O represent the observed frequencies and E represent the expected frequencies. The expected frequency for each possible value of the variable is obtained using the following formula: Ei = npi Where n is the sample size and pi is the relative frequency (or probability) given for the value in the null hypothesis. 4.2.2 Procedures for the Chi-square Goodness-of-fit Test The purpose of the chi-square goodness-of-fit test is to perform a hypothesis test for the distribution of a variable. Assumptions:  The data are obtained from a random sample.  The expected frequency of each category must be at least 5. Six steps to be followed when conducting a chi-square goodness-of-fit test:  Step 1: The null and the alternative hypothesis are, respectively, H0 : The variable has the specified distribution H1 : The variable does not have the specified distribution.  Step 2: Decide on the significance level, α  Step 3: Compute the test statistic, χ2 = ∑ki=1 (Oi −Ei )2 Ei  Step 4: The critical value is χ2 α with degrees of freedom k − 1, where k is the number of possible values for the variable. SSTS012 2019 STUDY GUIDE  Step 5: If the value of the test statistic falls in the rejection region, reject H0 ; otherwise, do not reject H0 .  Step 6: Interpret the results of the hypothesis test. EXAMPLE 4.2: A simple random sample of 500 violent-crime reports from last year yielded the results in Table 4.1 column 2. Column 3 gives relative-frequency for 2016. Table 4.1: Distribution of violent-crimes in Polokwane. Type of violent-crime Observed frequency Relative frequency Murder 3 0.011 Forcible rape 37 0.063 Robbery 154 0.286 Assault 306 0.640 a) Identify the population and the variable of interest. b) Check the two assumptions of the chi-square goodness-of-fit test if they are met. c) At 5% level of significance, do the data provide sufficient evidence to conclude that last year’s violent-crime distribution is different from the 2016 distribution? Example 4.3: Finger Lakes Homes manufactures four models of prefabricated homes, a two-story colonial, a log cabin, a split-level, and an A-frame. To help in production planning, management would like to determine if previous customer purchases indicate that there is a preference in the style selected. SSTS012 2019 STUDY GUIDE Table 4.2: The number of homes sold of each model for 100 sales over the past two years. Model Colonial Log-cabin Split-level Sold 30 20 35 A-frame a) Complete the table above. b) Calculate the expected value for each category. c) Test if previous customer purchases indicate that there is a preference in the style selected. Use 1% significance level. Example 4.3: The Higher Education Research Institute of the University of Limpopo, South Africa, publishes information on characteristics of incoming college freshmen in the South African freshmen. In 2017, 27.7% of incoming freshmen characterised their political views as liberal, 51.9% as moderate, and 20.4% as conservative. For this year, a random sample of 250 incoming college freshmen produced the preceding frequency distribution for political views. Table 4.3: Frequency distribution for political views. Political view Liberal Moderate conservative Frequency 80 123 47 a) Identify the population and variable under consideration here. b) Test if the data provide sufficient evidence to conclude that this year’s distribution of political views for incoming college freshmen has changed from the 2017. 4.3 Chi-square test for independence of association As indicated by the formula in the previous section, the chi-square test statistic measures how much the observed frequencies and the expected frequencies differ. The test establishes whether two categorical random variables are statistically related (dependent of independent). Statistical independence means that the outcome of one random variable in no way influences (or is influenced by) the outcome of the second random variable. 4.3.1 Distribution of the 𝛘𝟐 -statistic for independence test SSTS012 2019 STUDY GUIDE The chi-square statistic that transform the sample frequencies into a test statistic is mathematically expressed as follows: k 2 χ =∑ i=1 (Oi − Ei )2 Ei And it has approximately chi-square distribution if the null hypothesis of nonassociation is correct. The number of degrees of freedom is (r − 1)(c − 1), where r and c are the number of rows and column, respectively. Note that in the chi-square statistic, O represent the observed frequencies and E represent the expected frequencies. The expected frequency for each possible value of the variable is obtained using the following formula: 𝐸𝑖 = 𝑅𝑖 ∗ 𝐶𝑖 𝑛 Where 𝑛 is the sample size, 𝑅𝑖 is the sum of the all frequencies in row 𝑖 and 𝐶𝑖 is the sum of all frequencies in column 𝑖. 4.3.2 Procedures for the Chi-square of Independence Test The purpose of the chi-square independence test is to perform a hypothesis test to decide whether the two variable are associated. Assumptions:  The data are obtained from a random sample.  The expected frequency of each category must be at least 5. Six steps to be followed when conducting a chi-square for independence test:  Step 1: The null and the alternative hypothesis are, respectively, 𝐻0 : The two variables are not associated. 𝐻1 : The two variables are associated.  Step 2: Decide on the significance level, 𝛼  Step 3: Compute the test statistic, 𝜒 2 = ∑𝑘𝑖=1 (𝑂𝑖 −𝐸𝑖 )2 𝐸𝑖 SSTS012 2019 STUDY GUIDE  Step 4: The critical value is 𝜒 2 𝛼 with degrees of freedom (𝑟 − 1)(𝑐 − 1)  Step 5: If the value of the test statistic falls in the rejection region, reject 𝐻0 ; otherwise, do not reject 𝐻0 .  Step 6: Interpret the results of the hypothesis test. Example 4.4: suppose that you are a marketing research analyst and you ask a random sample of 286 if they purchased a Diet Pepsi or coke. Table 4.4: Contingency of Pepsi and Coke Diet. Diet Pepsi Diet Coke No Yes Total No 84 32 116 Yes 48 122 170 Total 132 154 286 a) Check the two assumptions of the chi-square goodness-of-fit test if they are met. b) At 𝛼 = 0.01 significance level, can one conclude that there exist a relationship between the two diets? Example 4.5: A national survey was conducted to obtain information on the alcohol consumption pattern of RSA adults by marital status. A random sample of 1772 residents of 18 years old and older yielded the data displayed in the table below. Table 4.5: Four by three contingency table of alcohol consumption and marital status. SSTS012 2019 STUDY GUIDE Drinks per month Marital status Abstain 1-60 Over 60 single 67 213 74 Married 411 Widowed 85 51 7 Divorced 27 60 15 129 a) Identify the population and variable under consideration here. b) Complete the contingency table above. c) Check the two assumptions of the chi-square goodness-of-fit test if they are met. d) Do the data provide sufficient evidence to conclude that an association exist between marital status and alcohol assumption? SSTS012 2019 STUDY GUIDE CHAPTER 5: ANALYSIS OF VARIANCE 5.1 Introduction Analysis of variance (ANOVA) is a statistical technique used to test whether there exists a significant difference between two or more population means. This might seem strange because the technique is called “analysis of variance” rather than “analysis of population means”. However, the name is appropriate because inference about the population means is made by analysing the variance. In the context of ANOVA, populations are described as treatments, while observations are results obtained after applying treatments on the experimental units. In this module we will only focus only on One-Way ANOVA. 5.2 Terms and concepts Let’s define some of the important terms and concepts in design of experiments. We have already seen the terms like, treatment, experimental unit, randomisation and response. However, we define them again here for completeness. Definition 5.1: Treatments, sometimes called factors, are the different procedures (levels) that we want to investigate or compare. E.g, different kinds or amount of fertilisers in agronomy. Definition 5.2: Experimental units are the things to which we apply the treatments. E.g, patients in hospital. Definition 5.3: Response, sometimes called the dependent variable is the outcome that we observe after applying a treatment on an experimental unit. That is, the response is what we measure to judge what happened in the experiment. Definition 5.4: Randomization is the random allocation of the experimental units to the treatments of factor levels. That is, it is the allocation of the experimental units to the treatments in a haphazard way. Example 5.1: Suppose that a group of researchers from Rotterdam village in Giyani conduct a study to compare the mean caffeine content of three brands of tea leaves. They sampled 20 tea bags of each brand, analysed them of caffeine content and record the amount of caffeine in each tea bag in milligrams. From the study above: SSTS012 2019 STUDY GUIDE a) What is the response variable? b) Identify the treatment and levels of interest in the study. c) Identify the experimental units and number of the experimental units. 5.3 The F-Distribution ANOVA procedures depend on the distribution called the F-distribution, which was named in honor of Sir Ronald Fisher. A variable is said to follow an F-distribution if its distribution has a shape of a special type of right curve called an F-curve. The Fdistribution has two degrees of freedom instead of one. The first number of degrees of freedom for an F-curve is called the degrees of freedom for the numerator and the second is called the degree of freedom for the denominator. 5.3.1 Basic Properties of F-curves  The total area under the F-curve equals 1.  An F-curve is skewed to the right.  An F-curve starts at zero on the horizontal axis and extends indefinitely to the right, approaching, but never touching, the horizontal axis as it does so. 5.3.2 Finding the 𝛘𝟐 -value having the specified area to its right For an F-curve with df = (4, 12), find 𝐹0.05 ; that is, find the F-value having area 0.05 to its right: SSTS012 2019 STUDY GUIDE To find this F-value, we use the F-distribution table above. In this case, 𝛼 = 0.05, the degrees of freedom of the numerator is 4, and the degrees of freedom for the denominator is 12. We first go down the dfd column 12, then going across that row to the column labelled 4 and reach 3.26. Therefore, F-curve with df = (4, 12), 𝐹0.05 = 3.26. Example 5.2: Use the F-distribution table to determine the required F-values. Illustrate you work graphically. a) F-curve which has df = (8, 19). What is the degrees of freedom for the numerator and for the denominator? b) F-curve that has df = (12, 5) with 0.05 area to its right. c) F-curve with df = (20, 20), 𝐹0.05 . d) F-curve with df = (23, 9), 𝐹0.05 e) F-curve with df = (35, 10), 𝐹0.05 5.4 Performing a One-Way Anova To perform a one-way ANOVA, we need to determine the three sums of squares, Total sum of squares (SST), Treatment Sum of squares (SSTR) and Error sum of squares (SSE). For a one-way ANOVA with 𝑡 population means, the defining and computing formulas for the three sums of squares are as follows: SSTS012 2019 Sum of squares STUDY GUIDE Defining formula Computing formula 𝑛 Total, SST 𝑛 ∑(𝑥𝑖 − 𝑥̅ ) 2 ∑ 𝑥 2 𝑖 − 𝑛𝑥̅ 2 𝑖=1 𝑖=1 𝑡 Treatment, SSTR 𝑡 ∑ 𝑛𝑗 (𝑥̅𝑗 − 𝑥̅ ) 2 ∑ 𝑛𝑗 𝑥̅ 2𝑗 − 𝑛𝑥̅ 2 𝑗=1 𝑗=1 𝑡 Error, SSE 𝑡 ∑(𝑛𝑗 − 1)𝑠 2𝑗 𝑡 ∑ 𝑛𝑗 𝑠 2𝑗 𝑗=1 𝑗=1 − ∑ 𝑠 2𝑗 𝑗=1 Note: Total sum of squares equals the treatment sum of squares plus error sum of squares: 𝑆𝑆𝑇 = 𝑆𝑆𝑇𝑅 + 𝑆𝑆𝐸. In the table above, we used the following notations: 𝑛 =total number of observations 𝑥̅ = mean of all 𝑛 observations and, for 𝑗 = 1,2, … , 𝑡, 𝑛𝑗 = size of sample from population 𝑗 𝑥̅𝑗 = mean of sample from population 𝑗 𝑠 2𝑗 = variance of sample from population 𝑗 5.5 One-Way ANOVA Table To organize and summarize the quantities required for performing a one-way analysis of variance, we use a one-way ANOVA table. The general format of such table is shown in the table below: Source of variation SS DF MS F-ratio Treatment 𝑆𝑆𝑇𝑅 𝑡−1 𝑀𝑆𝑇𝑅 Error 𝑆𝑆𝐸 𝑛−𝑡 𝑀𝑆𝐸 Total 𝑆𝑆𝑇 𝑛−1 𝑀𝑆𝑇𝑅 𝑀𝑆𝐸 Note: SS = Sum of squares, DF = Degrees of freedom and MS = Mean squares Procedures for One-Way ANOVA Test The purpose of one-way ANOVA is to perform a hypothesis test to compare 𝑡 treatment or population means. Assumptions:  Simple random samples  Independent samples SSTS012 2019 STUDY GUIDE  Normal populations  Equal population variances Six steps to be followed when conducting hypothesis testing for comparing more than two population means:  Step 1: The null and alternative hypothesis, respectively, 𝐻0 : 𝜇1 = 𝜇2 = ⋯ = 𝜇𝑡 𝐻1 : Not all the means are equal.  Decide on the significance level, 𝛼.  Compute the value of the F-test statistic 𝐹= 𝑀𝑆𝑇𝑅 𝑀𝑆𝐸  The Critical value is 𝐹𝛼 with df = (𝑡 − 1, 𝑛 − 𝑡). Use the F-distribution table to find the critical value of the specified area.  If the value of the test statistic falls in the rejection region, reject 𝐻0 ; otherwise do no reject 𝐻0 .  Interpret the results of the hypothesis test. Example 5.3: A study was undertaken to compare distance travelled in kilometres per litre of three competing brands of petrol. Fifteen identical cars were available for the experiments. Replications Brands 1 2 3 4 5 A 9.5 11.0 13.0 15.0 18.0 B 10.5 12.0 14.0 16.0 10.5 C 10.0 10.5 13.5 14.5 10.0 a) What is the response variable of this study? SSTS012 2019 STUDY GUIDE b) What is the sample size that was considered in this study? c) Identify the treatment and number of levels. d) Identify the experimental units. e) Compute the three sum of squares. f) Use the answers in e) to construct the ANOVA table. g) Is there a significance difference between the means of the three brands of petrol? Example 5.4: Suppose that a company wishes to study job satisfaction of employees according to the length of service. It plans to classify employees into five independent groups and select four employees at random from each group for intensive interviewing. Suppose that the results yielded: 𝑆𝑆𝑇 = 7872 and 𝑆𝑆𝐸 = 1724. a) What is the sample size that was considered in this study? b) Check the first two assumptions required for performing a one-way ANOVA test. c) Use the information above to set-up an appropriate ANOVA table. d) Test at 𝛼 = 10% level of significance whether the job satisfaction scores are equal for the five groups. Example 5.5: Consider the summary statistics of former prisoners diagnosed with three different posttraumatic stress disorder (PTSD): 𝑛1 = 32, 𝑥̅1 = 73.0, 𝑠1 = 19.2; 𝑛2 = 20, 𝑥̅2 = 45.6, 𝑠2 = 23.4, 𝑛3 = 29, 𝑥̅ 3 = 34.5 and 𝑠3 = 22.0. Do the data provide sufficient evidence to conclude that the mean severity of PTSD are equal? 5.6 Pairwise comparisons of the treatments In many practical situations, we will wish to compare only pairs of means. Frequently, we can determine which means differ by testing the differences between all pairs of treatment means. Suppose that we are interested in comparing all pairs of a treatment means and that the null hypotheses that we wish to test are 𝐻0 : 𝜇𝑖 = 𝜇𝑗 for all 𝑖 ≠ 𝑗. There are numerous procedures available for this problem. We now present two popular methods for making such comparisons. SSTS012 2019 STUDY GUIDE 5.6.1 Tukey’s pairwise comparison test Suppose that, following an ANOVA in which we have rejected the null hypothesis of equal treatment means, we wish to test all pairwise mean comparisons: 𝐻0 : 𝜇𝑖 = 𝜇𝑗 versus 𝐻1 : 𝜇𝑖 ≠ 𝜇𝑗 for all 𝑖 ≠ 𝑗. Tukey’s test declares two means significantly different if the absolute value of their sample differences exceeds 1 1 𝑖 𝑗 𝑇 = 𝑞𝛼 (𝑡, 𝑛 − 𝑡)√𝑀𝑆𝐸 (𝑛 + 𝑛 ) Equivalently, we could construct a set of 100(1 − 𝛼) percent confidence intervals for all pairs of means as follows: 1 1 𝑖 𝑗 𝑦̅𝑖 − 𝑦̅𝑗 ∓ 𝑞𝛼 (𝑡, 𝑛 − 𝑡)√𝑀𝑆𝐸 (𝑛 + 𝑛 ) 5.6.2 LSD method of pairwise comparison Suppose that, following an ANOVA in which we have rejected the null hypothesis of equal treatment means, we wish to test all pairwise mean comparisons: 𝐻0 : 𝜇𝑖 = 𝜇𝑗 versus 𝐻1 : 𝜇𝑖 ≠ 𝜇𝑗 for all 𝑖 ≠ 𝑗. The Fisher Least Significant Difference (LSD) method declares two means significantly different if the absolute value of their sample differences exceeds 1 1 𝐿𝑆𝐷 = 𝑡𝛼 (𝑛 − 𝑡)√𝑀𝑆𝐸 ( + ) 𝑛𝑖 𝑛𝑗 2 Equivalently, we could construct a set of 100(1 − 𝛼) percent confidence intervals for all pairs of means as follows: 𝑦̅𝑖 − 𝑦̅𝑗 ∓ 𝑡𝛼 (𝑛 − 𝑡)√𝑀𝑆𝐸 ( 2 1 1 + ) 𝑛𝑖 𝑛𝑗 SSTS012 2019 STUDY GUIDE CHAPTER 6: SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS 6.1 Introduction Regression analysis and correlation analysis are the two statistical techniques which deal with examining existence of relationship between two or more variables and measuring the strength of this relationship. The relationship between any pair of variables (𝑥, 𝑦) can also be examined graphically by producing a scatter plot of their data values. SSTS012 2019 STUDY GUIDE 6.2 Simple Linear Regression Analysis Simple linear regression analysis (SLRA) is the statistical approach for modelling the relationship between the dependent (response) variable 𝑌 and one independent (explanatory) variable 𝑋. The simple linear regression model is mathematically expressed as follows: 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀 Where 𝑦 is the dependent variable, 𝛽0 and 𝛽1 are the regression parameters called the 𝑦-intercept and slope of the regression line, respectively, 𝑥 is called the independent variable and 𝜀 (epsilon) is the random error term. The above model is said to be simple, linear in the parameters, and linear in the independent variable. The reason behind this is that it is simple because there is only one independent variable, linear in the parameters because no parameter appears as an exponent or is multiplied or divided by another parameter, and linear in the independent variable because this variable appears only in the first power. 6.2.1 The basic assumptions on the random error term  The random error term has a mean value of zero, i.e. 𝐸(𝜀) = 0.  The random error term has a constant variance 𝜎 2 ; i.e. 𝑉𝑎𝑟(𝜀) = 𝜎 2 .  The random error term a normal distribution. We note that, since the random error term follows a normal distribution with mean zero and constant variance 𝜎 2 , then the dependent variable 𝑦 will follow a normal distribution with mean 𝛽0 + 𝛽1 𝑥 and variance 𝜎 2 ; that is, 𝐸(𝑦) = 𝛽0 + 𝛽1 𝑥 and 𝑉𝑎𝑟(𝑦) = 𝜎 2 . The reason for this is that, a linear equation of normally distributed variable also follows a normal distribution. The fitted regression equation obtained using the least squares estimates is given by: 𝑦̂ = 𝛽̂0 + 𝛽̂1 𝑥 This equation is used to predict the value of the response variable given the value of the explanatory variable. SSTS012 2019 STUDY GUIDE 6.2.2 Interpretation of the Regression Parameters 𝛽1 is the regression slope which indicates the amount of change in the mean of the probability distribution of 𝑦 or just the value of 𝑦 per unit change in the independent variable 𝑥. 𝛽0 is the 𝑦-iintercept which does not have any particular meaning as a separate term in the model. 6.2.3 Formulas for the Least Squares Estimates ̂ 𝟏 = 𝑺𝑺𝒙𝒚 Slope: 𝜷 𝑺𝑺 𝒙𝒙 ̂𝟎 = 𝒚 ̂ 𝟏𝒙 ̅−𝜷 ̅ 𝒚-intercept: 𝜷 ̅) (𝒚𝒊 − 𝒚 ̅) = ∑𝒏𝒊=𝟏 𝒙𝒊 𝒚𝒊 − 𝒏𝒙 ̅𝒚 ̅ Where: 𝑺𝑺𝒙𝒚 = ∑𝒏𝒊=𝟏(𝒙𝒊 − 𝒙 ̅)𝟐 = ∑𝒏𝒊=𝟏 𝒚𝒊 𝟐 − 𝒏𝒚 ̅𝟐 𝑺𝑺𝒚𝒚 = ∑𝒏𝒊=𝟏(𝒚𝒊 − 𝒚 ̅)𝟐 = ∑𝒏𝒊=𝟏 𝒙𝒊 𝟐 − 𝒏𝒙 ̅𝟐 and 𝒏 =sample size 𝑺𝑺𝒙𝒙 = ∑𝒏𝒊=𝟏(𝒙𝒊 − 𝒙 Example 6.1: Consider the information of five observations about the sale revenue (𝑦) and advertising expenditure (𝑥) in the table below: Sales Revenue (R, 000) Advertising Expenditure (R, 00) 1 1 1 2 2 3 2 4 4 5 a) Draw a scatter plot of the data above. Comment in the relationship of the variables. b) Estimate the regression parameters and fit the least squares equation. c) Interpret the regression slope in terms on the sales revenue and advertising expenditure. d) Estimate the value of 𝑦 and calculate the mean of the estimated values. SSTS012 2019 STUDY GUIDE Example 6.2: Suppose that you are a statistician working for certain company selling used cars in Polokwane. Consider the summary data below based on the age (𝑥) and price (𝑦) of eleven cars:∑ 𝑥 = 58, ∑ 𝑦 = 975, ∑ 𝑥𝑦 = 4732, ∑ 𝑥 2 = 326 , ∑ 𝑦 2 = 96129 . a) Estimate the regression parameters for the eleven cars. b) If you were to draw a scatter plot for data, what kind of relationship would you expect? Explain your answer. c) Interpret the regression slope in terms of age and price of the cars. d) Fit the least squares equation e) Predict the price of a 3-year old and 4-year old used cars. Comment on the predicted prices. 6.3 Inference in regression analysis In this section, we shall discuss inferences concerning the regression slope 𝛽1, considering both confidence estimation and hypothesis testing of 𝛽1. Hypothesis testing for 𝜷𝟏 We use the student t distribution to perform the hypothesis testing of the regression slope 𝛽. The test statistic for testing the following hypothesis, Two-tailed test One-tailed (Left-tailed) test One-tailed (right tailed) test 𝐻0 : 𝛽1 = 𝑏1 𝐻0 : 𝛽1 ≥ 𝑏1 𝐻0 : 𝛽1 ≤ 𝑏1 𝐻1 : 𝛽1 ≠ 𝑏1 𝐻1 : 𝛽1 < 𝑏1 𝐻1 : 𝛽1 > 𝑏1 where 𝑏1 is the hypothesized value for 𝛽1, is given by 𝑡= and 𝑆𝛽̂ = 𝑆𝑆𝑥𝑦 2 √∑ 𝑥 2 −(∑ 𝑥) 𝛽̂1 − 𝑏1 𝑆𝛽̂1 . 𝑛 A second, equivalent method for testing the existence of a linear relationship between variables is to set up a confidence interval estimate of 𝛽1 and determine whether the SSTS012 2019 STUDY GUIDE hypothesised value (𝛽1 = 𝑏1) in included in the interval. The confidence interval estimate of 𝛽1 would be obtained by using the following formula: 𝛽̂1 ± 𝑡𝛼(𝑛−2) 𝑆𝛽̂1 2 6.4 Correlation Analysis The reliability of the estimate of the response variable (𝑦) depends on the strength of the relationship between the independent variable (𝑥) and the dependent variable (𝑦). A strong relationship implies a more accurate and reliable estimate of the response variable. Definition 6.1: The Pearson coefficient of correlation is a measure of the strength of the relationship between two variables, 𝑥 and 𝑦. The following expression is used to calculate the sample Pearson correlation coefficient: 𝑟= ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅) √∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2 = 𝑆𝑆𝑥𝑦 √𝑆𝑆𝑥𝑥 𝑆𝑆𝑦𝑦 𝑆𝑆𝑥𝑥 = 𝛽̂1 √ 𝑆𝑆𝑦𝑦 We note that 𝑆𝑆𝑥𝑦 appears in the numerator of the expression of estimating the correlation coefficient and the regression slope. Therefore, 𝑆𝑆𝑥𝑦 , 𝛽̂1 and 𝑟 will always have the same sign (positive or negative). The Pearson correlation coefficient is a number between -1 and 1 inclusive, which measure the degree to which the two variables are linearly related. 6.4.1 The strength of correlations can be interpreted as follows 𝒓 value (±) Correlation Relationship 0.00 to 0.09 Very low Very weak 0.10 to 0.29 Low Weak 0.30 to 0.49 Medium Moderate 0.50 to 0.89 High Strong 0.90 to 1.00 Very high Perfect SSTS012 2019 STUDY GUIDE 6.4.2 Scatter diagrams which illustrate relationships between points are given below: 6.5 Inference about the correlation Testing for the existence of a linear relationship between two variables is the same as determining whether there is any significant correlation between them. The population correlation coefficient 𝜌 is hypothesized as equal to zero. Thus the null and alternative hypotheses would be 𝐻0 : 𝜌 = 0 versus 𝐻1 : 𝜌 ≠ 0 The test statistic for determining the existence of correlation is given by SSTS012 2019 𝑡= STUDY GUIDE 𝑟√𝑛 − 2 √1 − 𝑟 2 6.6 The Coefficient of Determination Another way of measuring the utility of the regression model is to quantify the contribution of the independent variable (𝑥) and in predicting the values of the dependent variable (𝑦). In order to do this, we measure how much the error of predicting the value of 𝑦 was reduced by using the information provided by the independent variable. Definition 6.2: The coefficient of determination measures the proportion of variation in the dependent variable that is explained by the independent variable. The following formula is used to compute the value of the coefficient of determination: 𝑅2 = 𝑆𝑆𝑦𝑦 −𝑆𝑆𝐸 𝑆𝑆𝑦𝑦 𝑆𝑆𝐸 = 1 − 𝑆𝑆 𝑦𝑦 note that 𝑆𝑆𝐸 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂𝑖 )2 In simple linear regression, the coefficient of determination can also be computed as the square of the correlation coefficient: 𝑆𝑆𝑥𝑦 2 2 𝑆𝑆𝑥𝑥 𝑟 =( ) = 𝛽̂1 𝑆𝑆𝑦𝑦 √𝑆𝑆𝑥𝑥 𝑆𝑆𝑦𝑦 2 Interpretation: about 100(𝑟 2 )% of the sample variation in the dependent variable 𝑦 can be explained by ( or attributed to ) using the independent variable 𝑥 to predict the value of 𝑦 in the regression equation. SSTS012 2019 STUDY GUIDE CHAPTER 7: INDEX NUMBERS 7.1 Introduction An index is a summary value which reflects how business or economic activity has changed over time. The consumer price index is the most commonly understood economic index. Index numbers are used to measure either price or quantity changes over time. They play a vital role in the monitoring of business performance as well as in the preparation of business forecasts. Definition 7.1: Index number is a summary measure of overall change in the level of activity of single item or a basket of related items from one-time period to another. Index numbers are most commonly used to monitor price and quantity changes over time. They can also monitor changes in business performance levels and are therefore a useful planning and control tool in business. The best know and widely used index number in any country is the consumer price index (CPI) or the inflation indicator. An index number is constructed by dividing the value of an item in the current period by its value in the base period, expressed as a percentage: 𝐼𝑛𝑑𝑒𝑥 𝑛𝑢𝑚𝑏𝑒𝑟 = 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑝𝑒𝑟𝑖𝑜𝑑 𝑣𝑎𝑙𝑢𝑒 × 100 𝑏𝑎𝑠𝑒 𝑝𝑒𝑟𝑖𝑜𝑑 𝑣𝑎𝑙𝑢𝑒 We note that an index number value of above 100 indicate an increase in the level of activity being monitored, while the index number value of below 100 reflects a decrease in activity relative to the base period. There are two major categories of index numbers. Within each of the two categories, an index value can be computed for either a single item or basket of correlated items.  Price indexes  Single price index  Composite price index  Quantity indexes  Single quantity index  Composite price index SSTS012 2019 STUDY GUIDE The following notations are used in the construction of price and quantity index numbers: 𝑃0 = base period price 𝑞0 = base period quantity 𝑃1 = Current period price 𝑞1 = Current period quantity 7.2 Price indexes A price index measures the percentage change in price between any two-time period either for a single item or a basket of correlated items. 7.2.1 Simple price index (Price relative) The simple price index is the change in price from a base period to another time period for a single item. It is sometimes called price relative. Mathematical expression of computing the price relative value is defined as: 𝑃𝑟𝑖𝑐𝑒 𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 = 𝑃1 × 100 𝑃0 Note that the price relative is multiplied by 100 in order to express it in terms of percentages. Example 7.1: Consider the information about the prices of 95-Unleaded fuel in Polokwane for each year from 2014 to 2017. Year Price/litre 2014 R10.22 2015 R11.78 2016 R12.28 2017 R13.49 Using 2014 as the base period, compute and interpret the price relatives for 95unleaded fuel in Polokwane for these years: a) 2015 b) 2016 c) 2017 SSTS012 2019 STUDY GUIDE 7.2.3 Composite price index A composite price index measures the average price change for a basket of related items (activities) from one-time period (base period) to another period which is the current period. There are two techniques that can be used to compute the composite price index once the weighting method between Laspeyres and Paasche has been chosen. The two techniques yield the same index value, however, the reasoning behind how the values are calculated is not the same. The two computational techniques are:  The method of weighted aggregates and  The method of weighted average of price relatives. Since the two methods produce the same index value, in this level we will only look at the method of weighted aggregates. The formula for calculating the weighted aggregates index value is easy to use as compared to the one of the weighted average of price index. 7.2.3.1 Weighted Aggregates Method – using the Laspeyres Weighting Approach The construction of the Laspeyres composite price index using the weighted aggregates method is done following the three steps listed below:  Step 1: Compute the base period value for the basket of items: 𝐵𝑎𝑠𝑒 𝑝𝑒𝑟𝑖𝑜𝑑 𝑣𝑎𝑙𝑢𝑒 = ∑(𝑃0 × 𝑞0 ) The based period value is what the basket of items would have costed in the based period.  Step 2: Compute the current period value for the basket of items: 𝐶𝑢𝑟𝑟𝑒𝑛𝑡 𝑝𝑒𝑟𝑖𝑜𝑑 𝑣𝑎𝑙𝑢𝑒 = ∑(𝑃1 × 𝑞0 ) The current period value is the cost of the basket of items in the current period paying current prices, but consuming base period quantity.  Step 3: Calculate the composite price index: 𝐿𝑎𝑠𝑝𝑒𝑦𝑟𝑒𝑠 𝑝𝑟𝑖𝑐𝑒 𝑖𝑛𝑑𝑒𝑥 = ∑(𝑃1 × 𝑞0 ) × 100% ∑(𝑃0 × 𝑞0 ) SSTS012 2019 STUDY GUIDE 7.2.3.2 Weighted Aggregates Method – using the Paasche Weighting Approach The construction of the Paasche composite price index uses the current period quantities to weight the basket. The same three steps in calculating the weighted aggregates composite price index are followed:  Step 1: Compute the base period value for the basket of items: 𝐵𝑎𝑠𝑒 𝑝𝑒𝑟𝑖𝑜𝑑 𝑣𝑎𝑙𝑢𝑒 = ∑(𝑃0 × 𝑞1 ) The based period value is what the basket of items would have costed in the base period, but consuming current period quantities.  Step 2: Compute the current period value for the basket of items: 𝐶𝑢𝑟𝑟𝑒𝑛𝑡 𝑝𝑒𝑟𝑖𝑜𝑑 𝑣𝑎𝑙𝑢𝑒 = ∑(𝑃1 × 𝑞1 ) The current period value of a basket is the current cost of the basket of items based on the current prices and current consumption.  Step 3: Calculate the composite price index: 𝑃𝑎𝑎𝑠𝑐ℎ𝑒 𝑝𝑟𝑖𝑐𝑒 𝑖𝑛𝑑𝑒𝑥 = ∑(𝑃1 × 𝑞1 ) × 100% ∑(𝑃0 × 𝑞1 ) Example 7.2: The data in Table 7.1 shows the usage of a basket of three toiletry items in two-person households in Giyani for 2016 and 2017 respectively. Table 7.1: Annual household Consumption of Basket of Toiletries (2016-2017) Base year (2016) Current year (2017) Toiletry items Unit price Quantity Unit price Quantity Soap R5.95 38 R6.10 41 Deodorant R18.65 25 R19.95 19 Toothpaste R8.29 15 R8.74 17 a) Calculate the relative price index for Soup. Interpret your answer. b) Calculate the price relative for toothpaste in Giyani. What is the meaning of this value? c) Compute the Laspeyres weighted aggregate composite price index for the basket of toiletries. Interpret the value. d) Compute the Paasche weighted aggregate composite price index for the basket of toiletries. Interpret the answer. SSTS012 2019 STUDY GUIDE Example 7.3: A printing company that specialises in business stationary has recorded its usage and cost of printer cartridges for its four different printers. Printer 2015 2016 2017 cartridge Unit price Quantity Unit price Quantity Unit price Quantity HQ21 145 24 155 28 149 36 HQ25 172 37 165 39 160 44 HQ26 236 12 255 12 262 14 HQ32 314 10 306 8 299 11 a) Using 2015 as the base year period, calculate the price relatives of the HQ26 and HQ32 printer cartridges for 2016 only. Interpret the meaning of these two price relatives. b) Using 2016 as the base year period, calculate the price relatives of the HQ21 and HQ25 printer cartridges for 2017 only. Interpret the meaning of these two price relatives. c) Calculate the composite price indexes for 2016 and 2017, with 2015 as the base period, using each of the following methods: (i) The Laspeyres weighted aggregate method (ii) The Paasche weighted aggregate method d) Interpret all the values obtained in c) above. 7.3 Quantity Indexes A quantity index measures the percentage change in consumption level, either for a single item or a basket of items, form one-time period to other. 7.3.1 Simple Quantity index (Quantity Relative) For a single item, the change in units consumed from a base period to another time period is found by calculating its quantity relative. The quantity relative is mathematically expressed as follows: 𝑄𝑢𝑎𝑛𝑡𝑖𝑡𝑦 𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 = 𝑞1 × 100% 𝑞0 This relative quantity change is multiplied by 100 to express it in percentage terms. SSTS012 2019 STUDY GUIDE Example 7.4: In 2014, Baloyi TS hardware store sold 145 window frames. In 2015, window frame sale were only 125 units, while in 2016, sale of window frames rose to 175 units. Find the quantity relative of window frame for each year 2016 and 2015 respectively, using 2014 as the base period. What are the meaning of the computed values? 7.3.2 Composite quantity index A composite quantity index measures the average consumption (quantity) changes for a basket of related items from one-time period (the base period) to another time period which is the current period. 7.3.2.1 Weighted Aggregates Method – Composite Quantity Index This method compares the aggregate value of the basket of related items between the current period and the base period. The composite quantity index will reflect the overall consumption changes while holding prices constant at either the base period (Laspeyres approach) or current period (Paasche approach). The Laspeyres approach holds prices constant in the based period: 𝐿𝑎𝑠𝑝𝑒𝑦𝑟𝑒𝑠 𝑞𝑢𝑎𝑛𝑡𝑖𝑡𝑦 𝑖𝑛𝑑𝑒𝑥 = ∑(𝑝0 × 𝑞1 ) × 100% ∑(𝑝0 × 𝑞0 ) The Paasche approach holds prices constant in the current period: Paasche quantity index = ∑(p1 × q1 ) × 100% ∑(p1 × q 0 ) Example 7.5: The data in the table below refers to a basket of three carpentry items used by Ngoveni woodwork company in the manufacture of cupboards for 2015 and 2016, respectively. Carpentry items Base year (2015) Unit price Current year (2016) Quantity Unit price Quantity Cold glue (𝟏 𝐥) R14 45 R17 55 Boards (𝐦𝟐 ) R65 125 R80 115 Paint (𝟓 𝐥) R125 20 R130 25 SSTS012 2019 STUDY GUIDE a) Using 2015 as the base year period, calculate the quantity relatives of the cold glue and paint. Interpret the meaning of these two quantity relatives. b) Using the Laspeyres weighted aggregates method, construct the composite quantity index for the average change of carpentry materials used between 2015 (as base period) and 2016 (as current period). Interpret the value. c) Using the Paasche weighted aggregates method, construct the composite quantity index for the average change of carpentry materials used between 2015 (as base period) and 2016 (as current period). Interpret the value. Example 7.6: A printing company that specialises in business stationary has recorded its usage and cost of printer cartridges for its four different printers. Printer 2015 2016 2017 cartridge Unit price Quantity Unit price Quantity Unit price Quantity HQ21 145 24 155 28 149 36 HQ25 172 37 165 39 160 44 HQ26 236 12 255 12 262 14 HQ32 314 10 306 8 299 11 a) Using 2015 as the base year period, calculate the quantity relatives of the HQ26 and HQ32 printer cartridges for 2016 only. Interpret the meaning of these two price relatives. b) Using 2016 as the base year period, calculate the quantity relatives of the HQ26 and HQ32 printer cartridges for 2017 only. Interpret the meaning of these two price relatives. c) Calculate the composite quantity indexes for 2016 and 2017, with 2015 as the base period, using each of the following methods: (i) The Laspeyres weighted aggregate method (ii) The Paasche weighted aggregate method d) Interpret all the values obtained in c) above. SSTS012 2019 STUDY GUIDE CHAPTER 8: TIME SERIES ANALYSIS 8.1 Introduction Most of the data used in statistics analysis is called cross-sectional data, meaning that it is gathered from a sample survey at one point in time. However, data can also be collected over time. For example, when a company records its daily, weekly or monthly turnover; or when a household records their daily or monthly electricity usage, they are compiling a time series data. Definition 8.1: A time series is a set of numeric data of a random variable that is gathered over time at a regular intervals and arranged in time order. The purpose of time series analysis is to identify any recurring patterns in a time series, quantify these patterns through building a statistical model and then use the statistical model to prepare forecasts to estimate future values of the time series. 8.2 Components of a Time Series Time series analysis assumes that the data values of a time series variable are determined by four underlying environmental forces that operate both individually and collectively over time. The four underlying environmental forces are:  Trend (T) – is defined as a long-term smooth underlying movement in a time series. It measures the effect that long-term factors have on the times series.  Cycles (C) – are the medium to long-term deviations from the trend. They reflect alternating periods of relative expansion and contraction of economic activity.  Seasonality (S) – seasonal variations are fluctuations in a time series that are repeated at regular intervals within a year (daily, weekly, monthly).  Irregular (random) influences (I) – Irregular fluctuations in time series are attributed to unpredictable events, such natural disaster (floods) or man-made disaster (strikes). Time series analysis attempts to isolate each of these components and quantify them statistically. The process of doing this is called decomposition of the times series. Once these components are identified and quantified, they are combined and used to estimate the future values of the time series variable. SSTS012 2019 STUDY GUIDE 8.3 Decomposition of a Time Series Time series analysis aims to isolate the influence of each of the four components on the actual time series. The time series model used as the basis for analysis the influence of these four components assumes a multiplicative relationship between them. The multiplicative time series model is mathematically expressed as follows: Actual y = trend × cyclical × seasonal × irregualr = T ×C ×S ×I Trend and seasonal components account for the most significant proportion of an actual value in a time series. By isolating them, most of the actual time series values will be explained. Therefore, we will examine the statistical approaches to quantify trend and seasonal variation only. SSTS012 2019 STUDY GUIDE 8.4 Trend Analysis The long-term trend in a time series can be isolated by removing the medium-term and short-term fluctuations (cycles, seasonal and irregular) in the series. This will result in either a smooth curve or a straight line, depending on the technique chosen. The two techniques which can be applied for trend isolation are:  Moving average – is the technique which produces a smooth curve.  Regression analysis – is the technique which produces a straight-line trend. 8.4.1 The Moving Average Technique A moving average removes the short-term fluctuations in the time series by taking successive averages of groups of observations. Each time period’s actual value is replaced by the average of observations from time periods that surrounds it. This results in a smoothed time series. Thus the moving average technique smoothies a time series by removing short-term fluctuations. The four steps of calculating a kperiod moving average if k is odd are as follows:  Step 1: Sum the first k period’s observation and position the total opposite the middle time period.  Step 2: Repeat the summing of the k period’s observations by removing the first period’s observation and including the next period’s observation.  Step 3: Continue producing these moving totals until the end of the time series is reach. The process of positioning each moving total opposite the middle time period of each sum of the k observation is called centring.  The moving average series is now calculated by dividing each moving total by k. Example 8.1: The table below shows the number of fire insurance claims received by an insurance company in each four-month period from 2014 to 2017. 2014 2015 2016 2017 Period P1 P2 P3 P1 P2 P3 P1 P2 P3 P1 P2 P3 Claim 3 5 9 7 9 12 4 10 13 9 10 7 Calculate the three-period, five-period and seven-period moving average for the number of insurance claims received. SSTS012 2019 STUDY GUIDE 8.4.1.1 Centring an Uncentred Moving average A moving average value must always be centred on the middle time period. When the number of the periods averaged is odd, centring occurs directly when the moving average value is positioned in the middle time period of k observation. However, when the moving average is calculated for an even number of time periods, then the moving total will be Uncentred. The three steps of centring an Uncentred moving average are:  Calculate the uncentred moving total.  Centre the uncentred moving totals – calculate a second moving total series consisting of pairs of the uncentred moving totals. Each second moving total value is centred between the two uncentred moving total values. This positions these second moving totals on an actual time period.  Calculate the centred moving averages – a centred moving average is calculated by dividing the centred moving total values by 2 × k. Example 8.2: A cycle shop recorded the quarterly sales of racing bicycles for the period of 2014 to 2016 as shown in the table below. 2014 Period 2015 2016 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Sales(𝐲) 17 13 15 19 17 19 22 14 20 23 19 20 Produce a four-period centred moving average for the quarterly sales of racing bicycles sold by the cycle shop during the period 2014 to 2016. Example 8.3: The table below shows the number of quarterly orders received by a security company in each quarterly period from 2014 to 2017. Quarter 2014 2015 2016 2017 Q1 20 20 23 23 Q2 16 22 26 27 Q3 18 25 22 20 Q4 21 17 23 22 Calculate a four-period centred moving average for the quarterly orders received by a security company in each four-month period from 2014 to 2017. SSTS012 2019 STUDY GUIDE 8.4.2 Regression Analysis Technique A trend line isolates the trend (T) component of the time series only. It shows the general direction (upward, downward or constant) in which the series is moving. It is therefore best presented by the simple linear regression. The method of least squares from the regression analysis (chapter 6) is used to estimate the regression parameters to find the trend line of best fit to a time series of numeric data. The dependent variable, y, is the actual time series and the independent variable, x, is time. To use time as an independent variable in regression analysis, it must be numerically coded. Any sequential numbering system can be used, however, in this chapter we will use the set of natural numbers (x = 1, 2, … . , n), where n is the number of time periods in the time series. Example 8.4: The number of houses sold quarterly by Valley Estates in the Cape peninsula is recorded for the 16 quarters from 2014 to 2017, as shown in the table below. Quarter 2014 2015 2016 2017 Q1 20 20 23 23 Q2 16 22 26 27 Q3 18 25 22 20 Q4 21 17 23 22 a) Use the least square method to estimate the regression parameters b) Construct the trend line for the quarterly houses sales data for Valley Estates c) Use the regression trend line to estimate the level of house sales for the first and third quarter of 2018. d) Interpret the meaning of the values obtained in c) above. Example 8.5: Consider the dataset of the quarterly sales of racing bicycles for the period of 2014 to 2016 as shown in example 8.2. a) Construct the trend line for the quarterly sales of racing bicycles b) Interpret the meaning of the magnitude of the regression slope. c) Estimate the sales of racing bicycle for second and third period of 2018. SSTS012 2019 STUDY GUIDE 8.5 Seasonal Analysis Seasonal analysis isolates the influence of seasonal forces on a time series. The ratio to moving average method is utilized to measure and quantify these seasonal influences. This method expresses the seasonal influence as an index number. It measures the percentage deviation of the actual values of the time series, y, from a base value that excludes the short-term seasonal influences. These base values of a time series represent the trend/cyclical influences only. 8.5.1 Ratio to Moving Average Technique  Step 1: identify the trend/cyclical movement – The moving average approach, as described earlier, isolate the combine trend/cyclical components in a time series. The choice of an appropriate moving average term, k, is determined by the number of periods that distance the short-term seasonal fluctuations. In most instances, the term k corresponds to the number of observations that distance a one-year period. The below shows the appropriate term to use to remove the short-term seasonal fluctuations in time series data the occur annually. Time interval Appropriate term (𝐤) Weekly 52-period term Monthly 12-period term Bi-monthly 6-period term Quarterly 4-period term Four-monthly 3-period term Half-yearly 2-period term  Step 2: Find the seasonal ratios – A seasonal ratio for each period is found by dividing each actual time series value, y, by its corresponding moving average value. Seasonal ratio = Actual y × 100 Moving average y A seasonal ratio is an index that measures the percentage deviation of each actual y from its moving average value. SSTS012 2019 STUDY GUIDE  Step 3: Produce the median seasonal indexes – Average the seasonal ratios across the corresponding periods within years to smooths out the irregular components inherent in the seasonal ratios.  Step 4: Calculate the adjusted seasonal indexes – Each seasonal index has a base index of 100. Therefore, the sum of the k median seasonal indexes must equal 100 × k. If this is not the case, each median seasonal index must be adjusted to a base of 100. The adjustment factor is determined as follows: Adjustment factor = Example 8.6: k × 100 ∑(median seasonal indexes) Refer to the quarterly house sales by Valley Estates in the Cape peninsula from 2014 to 2017. Calculate the quarterly seasonal indexes for the house sales dataset. Example 8.7: Refer to the four-monthly number of fire insurance claims received by an insurance company in each four-month period from 2014 to 2017. Calculate the four-monthly seasonal indexes for the house sales dataset. SSTS012 2019 STUDY GUIDE TABLES Table H1:Standard Normal Probabilities The values in the table are the areas between zero and the z-score. That is, P(0<Z<z-score) z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.09 0.0359 0.0753 0.1141 0.1517 0.1879 0.2224 0.2549 0.2852 0.3133 0.3389 0.3621 0.3830 0.4015 0.4177 0.4319 0.4441 0.4545 0.4633 0.4706 0.4767 0.4817 0.4857 0.4890 0.4916 0.4936 0.4952 0.4964 0.4974 0.4981 0.4986 0.4990 SSTS012 2019 STUDY GUIDE Table 2: Critical Values for Student's t  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 60 120  t.100 3.078 1.886 1.638 1.533 1.476 1.440 1.415 1.397 1.383 1.372 1.363 1.356 1.350 1.345 1.341 1.337 1.333 1.330 1.328 1.325 1.323 1.321 1.319 1.318 1.316 1.315 1.314 1.313 1.311 1.310 1.303 1.296 1.289 1.282 t.050 6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.760 1.753 1.746 1.740 1.734 1.729 1.725 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.684 1.671 1.658 1.645 t.025 12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 2.110 2.101 2.093 2.086 2.080 2.074 2.069 2.064 2.060 2.056 2.052 2.048 2.045 2.042 2.021 2.000 1.980 1.960 t.010 31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.528 2.508 2.500 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.423 2.390 2.358 2.326 t.005 63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106 3.055 3.102 2.977 2.947 2.921 2.898 2.878 2.861 2.845 2.831 2.819 2.807 2.797 2.787 2.779 2.771 2.763 2.756 2.750 2.704 2.660 2.617 2.576 t.001 318.310 22.326 10.213 7.173 5.893 5.208 4.785 4.501 4.297 4.144 4.025 3.930 3.852 3.787 3.733 3.686 3.646 3.610 3.579 3.552 3.527 3.505 3.485 3.467 3.450 3.435 3.421 3.408 3.396 3.385 3.307 3.232 3.160 3.090 t.0005 636.620 31.598 12.924 8.610 6.869 5.959 5.408 5.041 4.781 4.587 4.437 4.318 4.221 4.140 4.073 4.015 3.965 3.922 3.883 3.850 3.819 3.792 3.767 3.745 3.725 3.707 3.690 3.674 3.659 3.646 3.551 3.460 3.373 3.291 SSTS012 2019 STUDY GUIDE Table H3: Chi-Square Probabilities The areas given across the top are the areas to the right of the critical value. To look up an area on the left, subtract it from one, and then look it up (i.e: 0.05 on the left is 0.95 on the right) df 0.995 0.99 0.975 0.95 0.9 0.1 0.05 0.025 0.01 0.005 1 ----0.001 0.004 0.016 2.706 3.841 5.024 6.635 7.879 2 0.010 0.020 0.051 0.103 0.211 4.605 5.991 7.378 9.210 10.597 3 0.072 0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.345 12.838 4 0.207 0.297 0.484 0.711 1.064 7.779 9.488 11.143 13.277 14.860 5 0.412 0.554 0.831 1.145 1.610 9.236 11.070 12.833 15.086 16.750 6 0.676 0.872 1.237 1.635 2.204 10.645 12.592 14.449 16.812 18.548 7 0.989 1.239 1.690 2.167 2.833 12.017 14.067 16.013 18.475 20.278 8 1.344 1.646 2.180 2.733 3.490 13.362 15.507 17.535 20.090 21.955 9 1.735 2.088 2.700 3.325 4.168 14.684 16.919 19.023 21.666 23.589 10 2.156 2.558 3.247 3.940 4.865 15.987 18.307 20.483 23.209 25.188 11 2.603 3.053 3.816 4.575 5.578 17.275 19.675 21.920 24.725 26.757 12 3.074 3.571 4.404 5.226 6.304 18.549 21.026 23.337 26.217 28.300 13 3.565 4.107 5.009 5.892 7.042 19.812 22.362 24.736 27.688 29.819 14 4.075 4.660 5.629 6.571 7.790 21.064 23.685 26.119 29.141 31.319 15 4.601 5.229 6.262 7.261 8.547 22.307 24.996 27.488 30.578 32.801 16 5.142 5.812 6.908 7.962 9.312 23.542 26.296 28.845 32.000 34.267 17 5.697 6.408 7.564 8.672 10.085 24.769 27.587 30.191 33.409 35.718 18 6.265 7.015 8.231 9.390 10.865 25.989 28.869 31.526 34.805 37.156 19 6.844 7.633 8.907 10.117 11.651 27.204 30.144 32.852 36.191 38.582 20 7.434 8.260 9.591 10.851 12.443 28.412 31.410 34.170 37.566 39.997 21 8.034 8.897 10.283 11.591 13.240 29.615 32.671 35.479 38.932 41.401 22 8.643 9.542 10.982 12.338 14.041 30.813 33.924 36.781 40.289 42.796 23 9.260 10.196 11.689 13.091 14.848 32.007 35.172 38.076 41.638 44.181 24 9.886 10.856 12.401 13.848 15.659 33.196 36.415 39.364 42.980 45.559 25 10.520 11.524 13.120 14.611 16.473 34.382 37.652 40.646 44.314 46.928 26 11.160 12.198 13.844 15.379 17.292 35.563 38.885 41.923 45.642 48.290 27 11.808 12.879 14.573 16.151 18.114 36.741 40.113 43.195 46.963 49.645 28 12.461 13.565 15.308 16.928 18.939 37.916 41.337 44.461 48.278 50.993 29 13.121 14.256 16.047 17.708 19.768 39.087 42.557 45.722 49.588 52.336 30 13.787 14.953 16.791 18.493 20.599 40.256 43.773 46.979 50.892 53.672 40 20.707 22.164 24.433 26.509 29.051 51.805 55.758 59.342 63.691 66.766 50 27.991 29.707 32.357 34.764 37.689 63.167 67.505 71.420 76.154 79.490 60 35.534 37.485 40.482 43.188 46.459 74.397 79.082 83.298 88.379 91.952 70 43.275 45.442 48.758 51.739 55.329 85.527 90.531 95.023 100.425 104.215 80 51.172 53.540 57.153 60.391 64.278 96.578 101.879 106.629 112.329 116.321 90 59.196 61.754 65.647 69.126 73.291 107.565 113.145 118.136 124.116 128.299 100 67.328 70.065 74.222 77.929 82.358 118.498 124.342 129.561 135.807 140.169 SSTS012 2019 STUDY GUIDE Table 4 : Critical values for F statistic: F.05  Denominator degrees of freedom  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 60 120  1 161.40 18.51 10.13 7.71 6.61 5.99 5.59 5.32 5.12 4.96 4.84 4.75 4.67 4.60 4.54 4.49 4.45 4.41 4.38 4.35 4.32 4.30 4.28 4.26 4.24 4.23 4.21 4.20 4.18 4.17 4.08 4.00 3.92 3.84 2 199.50 19.00 9.55 6.94 5.79 5.14 4.74 4.46 4.26 4.10 3.98 3.89 3.81 3.74 3.68 3.63 3.59 3.55 3.52 3.49 3.47 3.44 3.42 3.40 3.39 3.37 3.35 3.34 3.33 3.32 3.23 3.15 3.07 3.00 3 215.70 19.16 9.28 6.59 5.41 4.76 4.35 4.07 3.86 3.71 3.59 3.49 3.41 3.34 3.29 3.24 3.20 3.16 3.13 3.10 3.07 3.05 3.03 3.01 2.99 2.98 2.96 2.95 2.93 2.92 2.84 2.76 2.68 2.60 4 224.60 19.25 9.12 6.39 5.19 4.53 4.12 3.84 3.63 3.48 3.36 3.26 3.18 3.11 3.06 3.01 2.96 2.93 2.90 2.87 2.84 2.82 2.80 2.78 2.76 2.74 2.73 2.71 2.70 2.69 2.61 2.53 2.45 2.37 5 230.20 19.30 9.01 6.26 5.05 4.39 3.97 3.69 3.48 3.33 3.20 3.11 3.03 2.96 2.90 2.85 2.81 2.77 2.74 2.71 2.68 2.66 2.64 2.62 2.60 2.59 2.57 2.56 2.55 2.53 2.45 2.37 2.29 2.21 6 234.00 19.33 8.94 6.16 4.95 4.28 3.87 3.58 3.37 3.22 3.09 3.00 2.92 2.85 2.79 2.74 2.70 2.66 2.63 2.60 2.57 2.55 2.53 2.51 2.49 2.47 2.46 2.45 2.43 2.42 2.34 2.25 2.17 2.10 Numerator degrees of freedom 7 8 9 10 12 15 236.80 238.90 240.50 241.90 243.90 245.90 19.35 19.37 19.38 19.40 19.41 19.43 8.89 8.85 8.81 8.79 8.74 8.70 6.09 6.04 6.00 5.96 5.91 5.86 4.88 4.82 4.77 4.74 4.68 4.62 4.21 4.15 4.10 4.06 4.00 3.94 3.79 3.73 3.68 3.64 3.57 3.51 3.50 3.44 3.39 3.35 3.28 3.22 3.29 3.21 3.18 3.14 3.07 3.01 3.14 3.07 3.02 2.98 2.91 2.85 3.01 2.95 2.90 2.85 2.79 2.72 2.91 2.85 2.80 2.75 2.69 2.62 2.83 2.77 2.71 2.67 2.60 2.53 2.76 2.70 2.65 2.60 2.53 2.46 2.71 2.64 2.59 2.54 2.48 2.40 2.66 2.59 2.54 2.49 2.42 2.35 2.61 2.55 2.49 2.45 2.38 2.31 2.58 2.51 2.46 2.41 2.34 2.27 2.54 2.48 2.42 2.38 2.31 2.23 2.51 2.45 2.39 2.35 2.28 2.20 2.49 2.42 2.37 2.32 2.25 2.18 2.46 2.40 2.34 2.30 2.23 2.15 2.44 2.37 2.32 2.27 2.20 2.13 2.42 2.36 2.30 2.25 2.18 2.11 2.40 2.34 2.28 2.24 2.16 2.09 2.39 2.32 2.27 2.22 2.15 2.07 2.37 2.31 2.25 2.20 2.13 2.06 2.36 2.29 2.24 2.19 2.12 2.04 2.35 2.28 2.22 2.18 2.10 2.03 2.33 2.27 2.21 2.16 2.09 2.01 2.25 2.18 2.12 2.08 2.00 1.92 2.17 2.10 2.04 1.99 1.92 1.84 2.09 2.02 1.96 1.91 1.83 1.75 2.01 1.94 1.88 1.83 1.75 1.67 20 248.00 19.45 8.66 5.80 4.56 3.87 3.44 3.15 2.94 2.77 2.65 2.54 2.46 2.39 2.33 2.28 2.23 2.19 2.16 2.12 2.10 2.07 2.05 2.03 2.01 1.99 1.97 1.96 1.94 1.93 1.84 1.75 1.66 1.57 24 249.10 19.45 8.64 5.77 4.53 3.84 3.41 3.12 2.90 2.74 2.61 2.51 2.42 2.35 2.29 2.24 2.19 2.15 2.11 2.08 2.05 2.03 2.01 1.98 1.96 1.95 1.93 1.91 1.90 1.89 1.79 1.70 1.61 1.52 30 250.10 19.46 8.62 5.75 4.50 3.81 3.38 3.08 2.86 2.70 2.57 2.47 2.38 2.31 2.25 2.19 2.15 2.11 2.07 2.04 2.01 1.98 1.96 1.94 1.92 1.90 1.88 1.87 1.85 1.84 1.74 1.65 1.55 1.46 40 251.10 19.47 8.59 5.72 4.46 3.77 3.34 3.04 2.83 2.66 2.53 2.43 2.34 2.27 2.20 2.15 2.10 2.06 2.03 1.99 1.96 1.94 1.91 1.89 1.87 1.85 1.84 1.82 1.81 1.79 1.69 1.59 1.50 1.39 60 252.20 19.48 8.57 5.69 4.43 3.74 3.30 3.01 2.79 2.62 2.49 2.38 2.30 2.22 2.16 2.11 2.06 2.02 1.98 1.95 1.92 1.89 1.86 1.84 1.82 1.80 1.79 1.77 1.75 1.74 1.64 1.53 1.43 1.32 120 253.30 19.49 8.55 5.66 4.40 3.70 3.27 2.97 2.75 2.58 2.45 2.34 2.25 2.18 2.11 2.06 2.01 1.97 1.93 1.90 1.87 1.84 1.81 1.79 1.77 1.75 1.73 1.71 1.70 1.68 1.58 1.47 1.35 1.22

SSTS012: Intro to Statistical Inference Study Guide

Related documents

Study collections

Products

Support

SSTS012: Intro to Statistical Inference Study Guide

Related documents

Study collections

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib