Uploaded by billdartismyname

Study Guide - Introduction to Statistical Inference 2019

advertisement
SSTS012
2019
STUDY GUIDE
FACULTY OF SCIENCE AND AGRICULTURE
SCHOOL OF MATHEMATICAL AND COMPUTER SCIENCES
DEPARTMENT OF STATISTICS AND OPERATIONS RESEARCH
INTRODUCTION TO STATISTICAL INFERENCE
(SSTS012)
STUDY GUIDE
SECOND SEMESTER: 2019
SSTS012
2019
STUDY GUIDE
LECTURERS INFORMATION:
Name
: Mr TH Chavalala and Mr H Maluleke
Office
: Mathematical Sciences Building, Room (2017 & 2022)
Telephone
: 015 268 4769\2168
E-mail address : [email protected] and [email protected]
STUDY COMPONENTS
Purpose of the Module
The purpose of the module is guide the students to:

Find point and interval estimates of the mean and proportion

Perform hypothesis tests on the mean and proportion

Perform hypothesis tests using chi-square statistic

Identify when to apply ANOVA as hypothesis testing technique

Fit and interpret a simple linear regression model

Calculate and interpret the correlation coefficient

Define and explain the purpose of index numbers

Explain the purpose of time series analysis
STUDY UNITS

SAMPLING DISTRIBUTION

POINT AND INTERVAL ESTIMATION

HYPOTHESIS TESTING

CHI-SQUARE ANALYSIS

ANALYSIS OF VARIANCE

REGRESSION & CORRELATION

INDEX NUMBERS

TIME SERIES ANALYSIS
SSTS012
2019
STUDY GUIDE
LECTURES AND LECTURE TIMES:
You have to attend four lectures per week. The times are as follows:
DAY
TIME
VENUE
MONDAY(Tutorial)
PERIOD 7 & 8
SMCS 1036 & 1037
MONDAY(Tutorial)
PERIOD 9 & 10
SMCS 1036 & 1037
TUESDAY(Lecture)
PERIOD 3 & 4
TC
THURSDAY(Lecture) PERIOD 3 & 4
KA
ASSIGNMENTS AND QUIZZES
Two assignments (weighing 15% average) as well as at least four quizzes (weighing
15% average) shall be written and the average of these assessments will contribute
towards the module mark.
TESTS AND EXAMINATION
Two tests will also be written; and the average of 70% shall also contribute towards
the module mark. A student is admitted to the final exam based on a module mark of
at least 40%.
CALCULATION OF MARKS
MODULE MARK = 15% quizzes average + 15% assignments average + 70% test
average
FINAL MARK = 60% MODULE MARK + 40% FINAL EXAM MARK
SSTS012
2019
STUDY GUIDE
TENTATIVE SCHEDULE OF LECTURES
Week
Date
Topic/Activity
1
01-05 July
Sampling Distributions
2
08-12 July
Point and Interval Estimation
3
15-19 July
Point and Interval Estimation
4
22-26 July
Hypothesis Testing
5
29-02 July/August Hypothesis Testing
6
05-09 August
Chi-Square Analysis
7
12-16 August
Analysis of Variance
8
19-23 August
Regression Analysis
9
26-30 August
Correlation Analysis
10
02-06 September
Index Numbers
11
09-13 September
Index Numbers
12
16-20 September
Spring Recess
13
23-27 September
Time Series Analysis
14
30-04 October
Time Series Analysis
15
07-11 October
Revision
16
14-18 October
Study Week
Assessment
Test1(07 August)
Test2(30 August)
SSTS012
2019
STUDY GUIDE
TABLE OF CONTENTS
CHAPTER 1: THE SAMPLING DISTRIBUTION ....................................................... 1
1.1 Introduction ....................................................................................................... 1
1.2 Population distribution ...................................................................................... 1
1.3 Sampling distribution ........................................................................................ 1
1.4 Sampling and Non-sampling errors .................................................................. 3
1.5 The Mean and Standard deviation of the sample mean, 𝒙 ............................... 4
1.6 Sampling from a Normally Distributed Population............................................. 5
1.7 Sampling from a population that is not normally distributed. ............................. 5
1.6 Population and Sample Proportions ................................................................. 7
CHAPTER 2: POINT AND INTERVAL ESTIMATION ............................................. 11
2.1 Introduction ..................................................................................................... 11
2.2 Estimation ....................................................................................................... 11
2.3 Type of Estimates: Point and interval estimates ............................................. 12
2.4 Confidence interval ......................................................................................... 12
2.5 Estimation of a population mean..................................................................... 13
2.7 Confidence interval estimation for Population proportion ................................ 16
2.8 Confidence Intervals for Variances and Standard Deviations ......................... 17
2.9 Inferences about the difference between two population means for independent
samples: 𝛔𝟏 and 𝛔𝟐 known .................................................................................. 19
2.10 Inferences about the difference between two population means for
independent samples: 𝛔𝟏 and 𝛔𝟐 are unknown but equal .................................... 20
2.11 Inferences about the difference between two population proportions for large
and independent samples ..................................................................................... 21
CHAPTER 3: HYPOTHESIS TESTING ................................................................... 22
3.1 Introduction ................................................................................................... 22
3.2 Hypothesis testing procedure ..................................................................... 22
SSTS012
2019
STUDY GUIDE
3.3 Type I and Type II errors ................................................................................ 24
3.4 Steps of hypothesis testing ............................................................................. 24
3.5 Test of hypothesis for the population mean 𝝁 ................................................. 25
3.6 Hypothesis test about a population proportion ................................................ 26
3.7 The 𝑷-Value Approach ................................................................................... 27
3.8 Hypothesis Tests About the Population Variance ........................................... 28
3.9 Hypothesis testing for the difference between two population means 𝝁𝟏 − 𝝁𝟐
.............................................................................................................................. 29
3.10 Hypothesis testing for the equality of variances from two populations .......... 33
3.11 Hypothesis testing for the difference between two population proportions 𝑷𝟏 −
𝑷𝟐 ......................................................................................................................... 34
CHAPTER 4: CHI-SQUARE HYPOTHESIS TESTING............................................ 36
4.1 Introduction ..................................................................................................... 36
4.2 Chi-square goodness-of-fit test ....................................................................... 37
4.3 Chi-square test for independence of association ............................................ 40
CHAPTER 5: ANALYSIS OF VARIANCE ............................................................... 44
5.1 Introduction ..................................................................................................... 44
5.2 Terms and concepts ....................................................................................... 44
5.3 The F-Distribution ........................................................................................... 45
5.3.1 Basic Properties of F-curves .................................................................... 45
5.3.2 Finding the 𝛘𝟐-value having the specified area to its right ....................... 45
5.4 Performing a One-Way Anova ........................................................................ 46
5.5 One-Way ANOVA Table ................................................................................. 47
5.6 Pairwise comparisons of the treatments ......................................................... 49
5.6.1 Tukey’s pairwise comparison test ............................................................ 50
CHAPTER 6: SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS 51
6.1 Introduction ..................................................................................................... 51
SSTS012
2019
STUDY GUIDE
6.2 Simple Linear Regression Analysis ................................................................ 52
6.3 Inference in regression analysis ..................................................................... 54
6.4 Correlation Analysis ........................................................................................ 55
6.5 Inference about the correlation ....................................................................... 56
6.6 The Coefficient of Determination .................................................................... 57
CHAPTER 7: INDEX NUMBERS ............................................................................. 58
7.1 Introduction ..................................................................................................... 58
7.2 Price indexes .................................................................................................. 59
7.3 Quantity Indexes ............................................................................................. 62
CHAPTER 8: TIME SERIES ANALYSIS ................................................................. 65
8.1 Introduction ..................................................................................................... 65
8.2 Components of a Time Series ........................................................................ 65
8.3 Decomposition of a Time Series ..................................................................... 66
8.4 Trend Analysis ................................................................................................ 67
8.5 Seasonal Analysis .......................................................................................... 70
TABLES ................................................................................................................... 72
SSTS012
2019
STUDY GUIDE
CHAPTER 1: THE SAMPLING DISTRIBUTION
1.1 Introduction
In the first semester, we have studied sampling, descriptive statistics, probability, and
the normal distribution. Now we will learn how these various topics can be incorporated
to lay the foundation for inferential statistics. This chapter introduces the concepts of
population distribution and sampling distribution. Moreover, the essential role that
these concepts play in the design of inferential studies will be explained. We should
always keep in mind that we perform sampling because we want to make this
inference. Because of this inference we begin to talk about things like confidence
intervals and hypothesis testing. A good picture to represent this situation follows:
The most significant objective of statistics is to make conclusions about a population
from the information contained in a sample.
1.2 Population distribution
The population distribution is the probability distribution derived from the information
on all elements under consideration.
Definition 1.1: Population distribution is the probability distribution of the population
data.
1.3 Sampling distribution
For any population data set, there is only one value of the population mean, μ.
However, we cannot say the same about the sample mean, 𝑥̅ . We would expect
different samples of the same size drawn from the same population to yield different
SSTS012
2019
STUDY GUIDE
values of the sample mean,𝑥̅ . The value of the sample mean for any sample will
depend on the elements included in that sample. Accordingly, the sample mean, 𝑥̅ , is
a random variable. Therefore, like other random variables, the sample mean
possesses a probability distribution, which is more commonly called the sampling
̅. Other sample statistics, such as the median, mode, and standard
distribution of 𝒙
deviation, also possess sampling distributions.
̅ is called the sampling distribution
Definition 1.2: The probability distribution of 𝒙
̅.
of 𝒙
There are many ways to take a sample. We have different methods that can be
applied based on the given problem. Once we know more about research problem
this will help us determine which sampling makes the most sense. Therefore, we will
talk about sampling design.
Sampling design is the procedure by which the sample is selected. There are two
very broad categories of sampling designs: Probability sampling and Non-probability
sampling.
SSTS012
2019
STUDY GUIDE
̅
How to conduct a sampling distribution of 𝒙
In a random sample, each member of the population has an equal chance of being
selected. There are a number of ways in which a random sample may be taken:
i)
Random Sampling without replacement.
ii)
Random Sampling with replacement.
1.4 Sampling and Non-sampling errors
Usually, different samples selected from the same population will give different results
because they contain different elements.
Definition 1.3: Sampling error is the difference between the value of a sample
statistic and the value of the corresponding population parameter. In the case of the
mean,
̅−𝛍
𝐒𝐄 = 𝒙
It is important to remember that a sampling error occurs because of chance. The errors
that occur for other reasons, such as errors made during collection, recording, and
tabulation of data, are called non-sampling errors. These errors occur because of
human mistakes, and not chance. Note that there is only one kind of sampling error,
which is the error that occurs due to chance. However, there is not just one nonsampling error, but there are many non-sampling errors that may occur for different
reasons.
Definition 1.4: Non-sampling Errors are the errors that occur in the collection,
recording, and tabulation of data.
The following are the main reasons for the occurrence of non-sampling errors
i)
If a sample is non-random the sample results may be too different from the
census results.
ii)
The questions may be phrased in such a way that they are not fully understood
by the members of the sample or population. As a result, the answers obtained
are not accurate.
SSTS012
iii)
2019
STUDY GUIDE
The respondents may intentionally give false information in response to some
sensitive questions.
iv)
The poll taker may make a mistake and enter a wrong number in the records or
make an error while entering the data on a computer.
Example 1: Suppose there were only five students in supplementary exam of
SSTS011 and the exam scores of these five students are: 70, 78, 80, 80, 95. Suppose
that a simple random sampling of size three is drawn without replacement.
a) List all samples that can be selected from this population.
b) Calculate the sample mean and sampling error for each of these samples.
̅
1.5 The Mean and Standard deviation of the sample mean, 𝒙
The mean and standard deviation calculated for the sampling distribution of 𝑥̅ are
called the mean and standard deviation of 𝑥̅ . Actually, the mean and standard
deviation of are, respectively, the mean and standard deviation of the means of all
samples of the same size selected from a population. The standard deviation of 𝑥̅ is
also called the standard error of 𝑥̅ .
̅
Mean of the sampling distribution of 𝒙
̅ are called the
The mean and standard deviation of the sampling distribution of 𝒙
̅ and are denoted by 𝝁𝒙̅ and 𝛔𝒙̅ , respectively.
mean and standard deviation of 𝒙
The mean of the sampling distribution of 𝑥̅ is always equal to the mean of the
population. Thus,
μ𝐱 = μ
Standard deviation of the sampling distribution of 𝐱
The standard deviation of the sampling distribution of x is
σ
σ𝐱 =
√n
where σ is the standard deviation of the population and n is the sample size.
SSTS012
2019
STUDY GUIDE
Important observation regarding the sampling distribution of 𝐱
i)
The spread of the sampling distribution of 𝐱 is smaller than the spread of the
corresponding population distribution. In other words, σ𝐱 < σ. This is obvious
from the formula for When n is greater than 1, which is usually true, the
denominator in
ii)
σ
√n
is greater than 1. Hence, σx̅ is smaller than σ.
The standard deviation of the sampling distribution of 𝐱 decreases as the
sample size increases. This feature of the sampling distribution of 𝐱 is also
obvious from the formula
σ𝐱 =
σ
√n
If the standard deviation of a sample statistic decreases as the sample size
is increased, that statistic is said to be a consistent estimator.
1.6 Sampling from a Normally Distributed Population
When the population from which samples are drawn is normally distributed with its
mean equal to μ and standard deviation equal to σ, then:
i)
The mean of 𝐱, μ 𝐱 , is equal to the mean of the population, μ.
ii)
The standard deviation of 𝐱, 𝛔𝐱 , is equal to
iii)
The shape of the sampling distribution of 𝐱 is normal, whatever the value of n.
σ
√n
.
Important remark: If the population from which the samples are drawn is normally
distributed with mean μ and standard deviation σ, then the sampling distribution of the
sample mean, 𝐱 will also be normally distributed with the following mean and standard
deviation, irrespective of the sample size: μ 𝐱 = μ and
σ𝐱 =
σ
√n
.
1.7 Sampling from a population that is not normally distributed.
Most of the time the population from which the samples are selected is not normally
distributed. In such cases, the shape of the sampling distribution of 𝑥̅ is inferred from
a very important theorem called the central limit theorem.
Central Limit Theorem
SSTS012
2019
STUDY GUIDE
According to the central limit theorem, for a large sample size, the sampling
distribution of 𝐱 is approximately normal, irrespective of the shape of the
population distribution. The mean and standard deviation of the sampling
distribution of 𝐱 are, respectively,
μ 𝐱 = μ and
σ
σ𝐱 =
√n
The sample size is usually considered to be large if n ≥ 30.
Important remark: when the population does not have a normal distribution, the
shape of the sampling distribution is not exactly normal, but it is approximately normal
for a large sample size. The approximation becomes more accurate as the sample
size increases. Another point to remember is that the central limit theorem applies to
large samples only. Usually, if the sample size is 30 or more, it is considered
sufficiently large so that the central limit theorem can be applied to the sampling
distribution of x̅ Thus, according to the central limit theorem:
i)
When n ≥ 30, the shape of the sampling distribution of x̅ is approximately normal
irrespective of the shape of the population distribution.
ii)
The mean of 𝐱, μ 𝐱 , is equal to the mean of the population μ.
iii)
The standard deviation of 𝐱, 𝛔𝐱 , is equal to
σ
.
√n
Example 3: The delivery times for all food orders at a fast-food restaurant during the
lunch hour are normally distributed with a mean of 7.7 minutes and a standard
deviation of 2.1 minutes. Let be the mean delivery time for a random sample of 16
orders at this restaurant. Calculate the mean and standard deviation of x̅.
Example 4: Suppose that the distribution of time spent working per week by University
of Limpopo (UL) students who hold part-time jobs during the school year is unknown
with a mean of 20.20 hours and a standard deviation of 2.60 hours. Let x̅ be the
average time spent working per week for 36 randomly selected UL students who hold
part-time jobs during the school year. Calculate the mean and the standard deviation
of the sampling distribution of x̅.
SSTS012
2019
STUDY GUIDE
Remark: Suppose that we take a random sample of size n from a normal population
with mean μ and variance σ2 , then the sample mean 𝑥̅ is also normal with mean μ and
variance
σ2
σ2
̅~N(μ, ).
, that is X
n
n
Note that in situation whereby the sample is not drawn from a normal population, the
Central limit theorem can be applied provided the sample size is large (n ≥ 30).
Example 5: Assume that the weights of all packages of a certain brand of cookies are
normally distributed with a mean of 32 ounces and a standard deviation of 0.3 ounces.
Find the probability that the mean weight, 𝑥̅ of a random sample of 20 packages of this
brand of cookies will be between 31.8 and 31.9 ounces.
Example 6: The amounts of electricity bills for all households in a city have a skewed
probability distribution with a mean of 𝑅140 and a standard deviation of 𝑅30. Find the
probability that the mean amount of electric bills for a random sample of 75 households
selected from this city will be:
a)
between 𝑅132 and 𝑅136.
b)
within 𝑅6 of the population mean.
1.6 Population and Sample Proportions
The concept of proportion is the same as the concept of relative frequency discussed
in the first semester and the concept of probability of success in a binomial experiment.
The relative frequency of a category or class gives the proportion of the sample or
population that belongs to that category or class. Similarly, the probability of success
in a binomial experiment represents the proportion of the sample or population that
possesses a given characteristic.
The population proportion, denoted by p, is obtained by taking the ratio of the
number of elements in a population with a specific characteristic to the total number
of elements in the population. The sample proportion, denoted by p̂ (pronounced p
hat), gives a similar ratio for a sample.
SSTS012
2019
STUDY GUIDE
The population and sample proportions, denoted by p and p̂ , respectively, are
calculated as
P
X
N
and
pˆ 
x
n
where
N = total number of elements in the population
n = total number of elements in the sample
X = number of elements in the population that possess a specific characteristic
x  number of elements in the sample that possess a specific characteristic
Example 7: Suppose a total of 789,654 families live in a particular city and 563,282 of
them own homes. A sample of 240 families is selected from this city, and 158 of them
own homes.
a)
Find the proportion of families who own homes in the population.
b)
Find the proportion of families who own homes in the sample.
c)
Calculate the sampling error of a proportion.
Example 8: In a population of 9500 subjects, 75% possess a certain characteristic. In
a sample of 400 subjects selected from this population, 78% possess the same
characteristic. How many subjects in the population and sample, respectively, possess
this characteristic? Calculate the sampling error of a proportion.
̂
1.6.1 Sampling distribution of 𝐩
̂ is a random variable. Hence,
Just like the sample mean 𝐱̅, the sample proportion 𝐩
it possesses a probability distribution, which is called its sampling distribution.
̂: The probability distribution
Sampling Distribution of the Sample Proportion, 𝐩
of the sample proportion, is called its sampling distribution. It gives the various
values that can assume and their probabilities.
̂
1.6.2 Mean and Standard Deviation of 𝐩
SSTS012
2019
STUDY GUIDE
̂ of which is the same as the mean of the sampling distribution 𝐩
̂ of is
The mean 𝐩
always equal to the population proportion, 𝐩, just as the mean of the sampling
distribution of 𝐱̅ is always equal to the population mean, 𝛍.
The mean of the sample proportion, p̂ is denoted by μp̂ and is equal to the
population proportion, p. Thus,
μp̂ = p
̂ is denoted by σp̂ and is given
The standard deviation of the sample proportion, 𝐩
by the formula
pq
n
σp̂ = √
where p is the population proportion, q = 1 − p, and n is the sample size.
Example 9: A population of N = 4000 has a population proportion equal to 0.12. In
each of the following cases, which formula will you use to calculate σp̂ and why? Using
the appropriate formula, calculate σp̂ for each of these cases.
a)
n = 800.
b)
n = 30.
̂
1.6.3 Shape of the sampling distribution of 𝐩
The shape of the sampling distribution of p̂ is inferred from the central limit theorem.
Central Limit Theorem for Sample Proportion: According to the central limit
theorem, the sampling distribution of p̂ is approximately normal for a sufficiently
large sample size. In the case of proportion, the sample size is considered to be
sufficiently large if np and nq are both greater than 5. That is, if
np > 5 and nq > 5
Example 10: Maureen Webster, who is running for mayor in a large city, claims that
she is favoured by 53% of all eligible voters of that city. Assume that this claim is true.
SSTS012
2019
STUDY GUIDE
What is the probability that in a random sample of 400 registered voters taken from
this city, less than 49% will favour Maureen Webster?
Example 11: According to the BBMG Conscious Consumer Report, 51% of the adults
surveyed said that they are willing to pay more for products with social and
environmental benefits despite the current tough economic times (USA TODAY, June
8, 2009). Suppose this result is true for the current population of adult Americans. Let
p̂ be the proportion in a random sample of 1050 adult Americans who will hold the said
opinion. Find the probability that the value of p̂ is between 0.53 and 0.55.
SSTS012
2019
STUDY GUIDE
CHAPTER 2: POINT AND INTERVAL ESTIMATION
2.1 Introduction
Statistical inference is the process of using sample results to estimate or draw
conclusions about the characteristics or parameter of a population. In this chapter we
shall examine estimation procedures which attempt to measure particular
characteristics of a population such as the mean, population and variance. There are
two major types of estimates, point and interval estimates. A point estimate uses a
single sample value to estimate the population parameter involved. Instead of having
an estimate based on a single value, an interval is used for estimating the population
parameter. This interval has a specified confidence or probability of correctly
estimating the true value of the population parameter.
2.2 Estimation
Definition 2.1: The assignment of value(s) to a population parameter based on a
value of the corresponding sample statistic is called estimation.
Definition 2.2: Estimate is the value(s) assigned to a population parameter based
on the value of a sample statistic.
Definition 2.3: Estimator is the sample statistic used to estimate a population
parameter.
Three Properties of a Good Estimator
1. The estimator should be an unbiased estimator. That is, the expected
value or the mean of the estimates obtained from samples of a given size
is equal to the parameter being estimated.
2. The estimator should be consistent. For a consistent estimator, as
sample size increases, the value of the estimator approaches the value of
the parameter estimated.
3. The estimator should be a relatively efficient estimator. That is, of all the
statistics that can be used to estimate a parameter, the relatively efficient
estimator has the smallest variance.
SSTS012
2019
STUDY GUIDE
Unbiased estimator
A point estimator θ̂ is said to be an unbiased estimator of θ if E(θ̂) = θ for every
possible value of θ. If is not unbiased, the difference E(θ̂) − θ is called the bias
of θ̂.
That is, an unbiased estimator of a population parameter is an estimator whose
expected value is equal to that parameter
Example 2.1: Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be a random sample from a normal population with
mean 𝜇 and variance 𝜎 2 > 0.
a)
Is the sample mean 𝑋̅ an unbiased estimator of the parameter 𝜇 ?
b)
Show that 𝑆 2 =
∑ 𝑥 2 −𝑛𝑥̅ 2
𝑛−1
is an unbiased estimator of the parameter 𝜎 2 .
2.3 Type of Estimates: Point and interval estimates
An estimate may be a point estimate or an interval estimate. These two types of
estimates are described in this section.
2.3.1 Point estimate
Point Estimate
A point estimate is a specific numerical value estimate of a parameter. The best
point estimate of the population mean 𝜇 is the sample mean 𝑋̅.
2.3.2 Interval estimate
Interval Estimation
An interval estimate of a parameter is an interval or a range of values used to
estimate the parameter. This estimate may or may not contain the value of the
parameter being estimated.
2.4 Confidence interval
Confidence Level and Confidence Interval
SSTS012
2019
STUDY GUIDE
The confidence level of an interval estimate of a parameter is the probability that
the interval estimate will contain the parameter, assuming that a large number of
samples are selected and that the estimation process on the same parameter is
repeated.
A confidence interval is a specific interval estimate of a parameter determined
by using data obtained from a sample and by using the specific confidence level
of the estimate.
2.5 Estimation of a population mean
This section explains how to construct a confidence interval for the population mean
μ. Here, there are two possible cases, summarised in the chart below.
SSTS012
2019
STUDY GUIDE
Confidence Interval for 𝛍: 𝛔 known or 𝛔 unknown but 𝒏 ≥ 𝟑𝟎
The (1 − α)100% confidence interval for 𝛍 under Cases I and II is
σ
σ
x − Zα ×
≤ μ ≤ x + Zα ×
2
2
√n
√n
where the value of 𝑧 used here is obtained from the standard normal distribution
table for the given confidence level.
Example 2.2: A publishing company has just published a new University textbook.
Before the company decides the price at which to sell this textbook, it wants to know
the average price of all such textbooks in the market. The research department at the
company took a sample of 25 comparable textbooks and collected information on their
prices. This information produced a mean price of 𝑅145 for this sample. It is known
that the standard deviation of the prices of all such textbooks is 𝑅35 and the population
of such prices is normal. Construct a 90% confidence interval for the mean price of all
such college textbooks.
Example 2.3: A researcher wishes to estimate the number of days it takes an
automobile dealer to sell a Ford Ranger 3.2 double cab. A sample of 50 cars had a
mean time on the dealer’s lot of 54 days and standard deviation to be 6.0 days. Find
the best point estimate of the population mean and the 95% confidence interval of the
population mean.
Margin of Error
The margin of error for the estimate for 𝛍, denoted by E, is the quantity that is
subtracted from and added to the value of 𝐱 to obtain a confidence interval for 𝛍.
Thus,
E = 𝐙𝛂 ×
𝟐
𝛔
√𝐧
Determining the Sample Size for the Estimation of 𝛍
Given the confidence level and the standard deviation of the population, the
sample size that will produce a predetermined margin of error E of the confidence
interval estimate of 𝛍 is
SSTS012
2019
n=(
𝑍𝛼∗ 𝜎
2
𝐸
STUDY GUIDE
2
)
If necessary, round the answer up to obtain a whole number. That is, if there
is any fraction or decimal portion in the answer, use the next whole number for
sample size n.
Example 2.4: An alumni association wants to estimate the mean debt of this year’s
university graduates. It is known that the population standard deviation of the debts of
this year’s college graduates is 𝑅11,800. How large a sample should be selected so
that the estimate with a 99% confidence level is within 𝑅800 of the population mean?
Confidence Interval for 𝛍 when 𝛔 unknown and 𝒏 < 𝟑𝟎
As previously stated, just as the mean of the population 𝜇 is usually not known, the
actual standard deviation of the population 𝜎 is also not likely to be known and the
sample size is small (𝑛 < 30). Therefore, we need to obtain a confidence interval
estimate of 𝜇 by using the sample statistics of 𝑋̅ and 𝑆 2 . The distribution that has been
developed to be applied in this situation is Student’s t distribution.
The (1 − α)100% confidence interval for 𝛍 is
𝐱 − 𝐭 𝐧−𝟏 ,
𝛂
𝐬
𝛂
𝐬
×
≤ 𝛍 ≤ 𝐱 + 𝐭 𝐧−𝟏 , ×
𝟐 √𝐧
𝟐 √𝐧
The value of t is obtained from the t distribution table for n − 1 degrees of freedom
and the given confidence level.
Example 2.5: Dr. Moore wanted to estimate the mean cholesterol level for all adult
men living in Hartford. He took a sample of 25 adult men from Hartford and found that
the mean cholesterol level for this sample is 186 mg/dL with a standard deviation of
12 mg/dL. Assume that the cholesterol levels for all adult men in Hartford are
(approximately) normally distributed. Construct a 95% confidence interval for the
population mean 𝛍.
SSTS012
2019
STUDY GUIDE
Example 2.6: The data represent a sample of the number of home fires started by
candles for the past several years. (Data are from the National Fire Protection
Association).
5460 5900 6090 6310
7160 8440 9930
Find the 99% confidence interval for the mean number of home fires started by candles
each year.
2.7 Confidence interval estimation for Population proportion
The concept of estimation can be extended to qualitative data to estimate the
proportion of success in the population based only upon sample data. We noted in the
previous chapter that when np and nq were at least 5, the binomial distribution
generally could be approximated by the normal distribution. If we desire to estimate
the population proportion 𝑝 from the sample proportion 𝑝̂ we could set up the following
(1 − 𝛼)100% confidence interval estimate for the population proportion 𝑝.
To construct a confidence interval about a proportion, you must use the maximum
error of the estimate, which is
𝑝̂𝑞̂
𝐸 = 𝑍𝛼 √ 𝑛
2
Solving this equation of 𝑛, we have
𝑍𝛼 2 𝑝̂𝑞̂
𝑛=
2
𝐸2
Confidence intervals about proportions must meet the criteria that 𝑛𝑝̂ ≥ 5 and 𝑛𝑞̂ ≥ 5.
Confidence Interval for the Population Proportion, 𝒑
The (1 − α)100% confidence interval for the population proportion, p, is
̂ − 𝐙𝛂 × √
𝐩
𝟐
̂×𝐪
̂
̂×𝐪
̂
𝐩
𝐩
̂ + 𝐙𝛂 × √
<𝐩<𝐩
𝐧
𝐧
𝟐
The value of 𝑍𝛼 used here is obtained from the standard normal distribution table for
2
the given confidence level
SSTS012
2019
STUDY GUIDE
Example 2.7: According to a survey conducted by Pew Research Centre in June
2009, 44% of people aged 18 to 29 years said that religion is very important to them.
Suppose this result is based on a sample of 1000 people aged 18 to 29 years.
a)
What is the point estimate of the corresponding population proportion?
b)
Find, with a 99% confidence level, the percentage of all people aged 18 to 29
years who will say that religion is very important to them. What is the margin of
error of this estimate?
Example 2.8: A survey of 1721 people found that 15.9% of individuals purchase
religious books at a Christian bookstore. Find the 95% confidence interval of the true
proportion of people
who purchase their religious books at a Christian bookstore?
Example 2.9: Lombard Electronics Company has just installed a new machine that
makes a part that is used in clocks. The company wants to estimate the proportion of
these parts produced by this machine that are defective. The company manager wants
this estimate to be within . 02 of the population proportion for a 95% confidence level.
What is the most conservative estimate of the sample size that will limit the margin of
error to within . 02 of the population proportion?
2.8 Confidence Intervals for Variances and Standard Deviations
In the previous sections confidence intervals were calculated for means and
proportions. This section will explain how to find confidence intervals for variances and
standard deviations. In statistics, the variance and standard deviation of a variable are
as important as the mean. For example, when products that fit together (such as pipes)
are manufactured, it is important to keep the variations of the diameters of the products
as small as possible; otherwise, they will not fit together properly and will have to be
scrapped. In the manufacture of medicines, the variance and standard deviation of the
medication in the pills play an important role in making sure patients receive the proper
dosage. For these reasons, confidence intervals for variances and standard deviations
are necessary.
To calculate these confidence intervals, a new statistical distribution is needed. It is
called the chi-square distribution. The formulas for the confidence intervals are
shown here:
SSTS012
2019
STUDY GUIDE
Confidence Interval for population variance
Assuming that the population from which the sample is selected is (approximately)
normally distributed, we obtain the (1 − α)100% confidence interval for the
population variance σ2 as
(𝒏 − 𝟏)𝑺𝟐
(𝒏 − 𝟏)𝑺𝟐
𝟐
<
𝝈
<
𝝌𝟐 𝜶 (𝒏 − 𝟏)
𝝌𝟐 𝟏−𝜶 (𝒏 − 𝟏)
𝟐
𝟐
Note that 𝑛 − 1 is degrees of freedom.
Confidence Interval for population standard deviation
Assuming that the population from which the sample is selected is (approximately)
normally distributed, we obtain the (1 − α)100% confidence interval for the
population standard deviation σ as
(𝒏 − 𝟏)𝑺𝟐
(𝒏 − 𝟏)𝑺𝟐
√𝝌𝟐 𝜶 (𝒏 − 𝟏) < 𝛔 < √𝝌𝟐 𝜶 (𝒏 − 𝟏)
𝟏−
𝟐
𝟐
Note that 𝑛 − 1is degrees of freedom.
Example 2.10: Find the 95% confidence interval for the variance and standard
deviation of the nicotine content of cigarettes manufactured if a sample of 20 cigarettes
has a standard deviation of 1.6 milligrams.
Example 2.11: Find the 90% confidence interval for the variance and standard
deviation for the price in dollars of an adult single-day ski lift ticket. The data represent
a selected sample of nationwide ski resorts. Assume the variable is normally
distributed.
59 54 53 52 51 39 49 46 49 48
SSTS012
2019
STUDY GUIDE
2.9 Inferences about the difference between two population means for
independent samples: 𝛔𝟏 and 𝛔𝟐 known
Two samples drawn from two populations are independent if the selection of one
sample from one population does not affect the selection of the second sample from
the second population. Otherwise, the samples are dependent.
2.9.1 Interval estimation of 𝛍𝟏 − 𝛍𝟐
Confidence interval for 𝛍𝟏 − 𝛍𝟐 : 𝛔𝟏 and 𝛔𝟐 known
When using the normal distribution, the (1 − α)100% confidence interval for μ1 −
μ2 is
𝛔𝟐𝟏 𝛔𝟐𝟐
𝛔𝟐𝟏 𝛔𝟐𝟐
√
√
(𝐱𝟏 − 𝐱𝟐 ) − 𝐙𝛂 ×
+
< 𝛍𝟏 − 𝛍𝟐 < (𝐱𝟏 − 𝐱𝟐 ) + 𝐙𝛂 ×
+
𝐧𝟏 𝐧𝟐
𝐧𝟏 𝐧𝟐
𝟐
𝟐
The value of 𝑧 is obtained from the normal distribution table for the given
confidence level. Here, x1 − x2 is the point estimator of μ1 − μ2.
Example 2.12: A 2008 survey of low- and middle-income households conducted by
Demos, a liberal public policy group, showed that consumers aged 65 years and older
had an average credit card debt of R10,235 and consumers in the 50- to 64-year age
group had an average credit card debt of R9342 at the time of the survey (USA
TODAY, July 28, 2009). Suppose that these averages were based on random samples
of 1200 and 1400 people for the two groups, respectively. Further assume that the
population standard deviations for the two groups were R2800 and R2500,
respectively. Let and be the respective population means for the two groups, people
aged 65 years and older and people in the 50- to 64-year age group.
a)
What is the point estimate of μ1 − μ2 .
b)
Construct a 97% confidence interval for μ1 − μ2.
Example 2.13 A survey found that the average hotel room rate in New Orleans is
$88.42 and the average room rate in Phoenix is R80.61. Assume that the data were
obtained from two samples of 50 hotels each and that the standard deviations of the
populations are R5.62 and R4.83, respectively. Find a 95% confidence interval for the
difference between the means.
SSTS012
2019
STUDY GUIDE
2.10 Inferences about the difference between two population means for
independent samples: 𝛔𝟏 and 𝛔𝟐 are unknown but equal
Confidence Interval for 𝛍𝟏 − 𝛍𝟐 : 𝛔𝟏 and 𝛔𝟐 are unknown but equal
The (𝟏 − 𝛂)𝟏𝟎𝟎% confidence interval for 𝛍𝟏 − 𝛍𝟐 is
(𝐱𝟏 − 𝐱𝟐 ) − 𝐭 𝐧𝟏+𝐧𝟐−𝟐 ,
𝛂
𝟏
𝟏
𝛂
𝟏
𝟏
× 𝐒𝐩 √ +
< 𝛍𝟏 − 𝛍𝟐 < (𝐱𝟏 − 𝐱𝟐 ) + 𝐭 𝐧𝟏+𝐧𝟐−𝟐 , × 𝐒𝐩 √ +
𝟐
𝐧𝟏 𝐧𝟐
𝟐
𝐧𝟏 𝐧𝟐
where
(𝐧𝟏 − 𝟏)𝐒𝟏𝟐 + (𝐧𝟐 − 𝟏)𝐒𝟐𝟐
√
𝐒𝐩 =
𝐧𝟏 + 𝐧𝟐 − 𝟐
the value of t is obtained from the t distribution table for the given confidence level
and n1 + n2 − 2 degrees of freedom.
Example 2.14: A consumer agency wanted to estimate the difference in the mean
amounts of caffeine in two brands of coffee. The agency took a sample of 15 onepound jars of Brand I coffee that showed the mean amount of caffeine in these jars to
be 80 milligrams per jar with a standard deviation of 5 milligrams. Another sample of
12 one-pound jars of Brand II coffee gave a mean amount of caffeine equal to 77
milligrams per jar with a standard deviation of 6 milligrams. Construct a 95%
confidence interval for the difference between the mean amounts of caffeine in onepound jars of these two brands of coffee. Assume that the two populations are normally
distributed and that the standard deviations of the two populations are equal.
Example 2.15: The following information was obtained from two independent samples
selected from two normally distributed populations with unknown but equal standard
deviations.
n1 = 21,
a)
x1 = 13.97 s1 = 3.78
What is the point estimate of μ1 − μ2 ?
b) Construct a 95% confidence interval for μ1 − μ2.
SSTS012
2019
STUDY GUIDE
2.11 Inferences about the difference between two population proportions for
large and independent samples
The difference between two sample proportions p̂1 − p̂2 is the point estimator for the
difference between two population proportions p1 − p2 . Because we do not know p1
and p2 when we are making a confidence interval for p1 − p2 , we cannot calculate the
value of σp̂1 −p̂2 . Therefore, we use sp̂1 −p̂2 as the point estimator of σp̂1 −p̂2 in the interval
estimation. We construct the confidence interval for p1 − p2 using the following
formula.
confidence interval for 𝐩𝟏 − 𝐩𝟐
the (1 − α)100% confidence interval for p1 − p2 is
̂𝟏 − 𝐩
̂𝟐 ) − 𝐙𝛂 × √
(𝐩
𝟐
̂𝟏 𝐪
̂𝟏 𝐩
̂𝟐 𝐪
̂𝟐
̂𝟏 𝐪
̂𝟏 𝐩
̂𝟐 𝐪
̂𝟐
𝐩
𝐩
̂𝟏 − 𝐩
̂𝟐 ) + 𝐙𝛂 × √
+
< 𝐩𝟏 − 𝐩𝟐 < (𝐩
+
𝐧𝟏
𝐧𝟐
𝐧𝟏
𝐧𝟐
𝟐
where the value of z is read from the normal distribution table for the given
confidence level
Example 2.16: A researcher wanted to estimate the difference between the
percentages of users of two toothpastes who will never switch to another toothpaste.
In a sample of 500 users of Toothpaste A taken by this researcher, 100 said that they
will never switch to another toothpaste. In another sample of 400 users of Toothpaste
B taken by the same researcher, 68 said that they will never switch to another
toothpaste.
a)
Let p1 and p2 be the proportions of all users of Toothpastes A and B,
respectively, who will never switch to another toothpaste. What is the point
estimate of p1 − p2.
b)
Construct a 95% confidence interval for the difference between the proportions
of all users of the two toothpastes who will never switch.
Example 2.17: In the nursing home study mentioned in the chapter-opening Statistics
Today, the researchers found that 12 out of 34 small nursing homes had a resident
vaccination rate of less than 80%, while 17 out of 24 large nursing homes had a
vaccination rate of less than 80%. Find the 95% confidence interval for the difference
of proportions
SSTS012
2019
STUDY GUIDE
CHAPTER 3: HYPOTHESIS TESTING
3.1 Introduction
In chapter 2, the concept that a sample statistic such as the mean or a proportion
would follow a particular distribution under various circumstances was used to
develop the confidence interval as a way of making inference about the true value
of the mean or proportion. In this chapter we will begin to focus on another phase
of statistical inference called hypothesis testing, which is a decision-making
process for evaluating claims about a population. In hypothesis testing, the
researcher must define the population under study, state the particular hypotheses
that will be investigated, give the significance level, select a sample from the
population, collect the data, perform the calculations required for the statistical test,
and reach a conclusion.
3.2 Hypothesis testing procedure
Every hypothesis-testing situation begins with the statement of a hypothesis. A
statistical hypothesis is a conjecture about a population parameter. This
conjecture may or may not be true.
There are two types of statistical hypotheses for each situation: the null hypothesis
and the alternative hypothesis. The null hypothesis, symbolized by 𝑯𝟎 , is a
statistical hypothesis that states that there is no difference between a parameter
and a specific value, or that there is no difference between two parameters. The
alternative hypothesis, symbolized by 𝑯𝟏 , is a statistical hypothesis that states
the existence of a difference between a parameter and a specific value, or states
that there is a difference between two parameters.
To state hypotheses correctly, researchers must translate the conjecture or claim
from words into mathematical symbols. The basic symbols used are as follows:
Equal to
=
Not equal to ≠
Greater than >
Less than
<
SSTS012
2019
STUDY GUIDE
The null and alternative hypotheses are stated together, and the null hypothesis
contains the equals sign, as shown in the table below (where 𝒌 represents a
specified number).
Two-tailed test
Right-tailed test
𝐻0 : 𝜇 = 𝑘
𝐻1 : 𝜇 ≠ 𝑘
Left-tailed test
𝐻0 : 𝜇 ≤ 𝑘
𝐻0 : 𝜇 ≥ 𝑘
𝐻0 : 𝜇 > 𝑘
𝐻0 : 𝜇 < 𝑘
Hypothesis testing common phrases
>
<
Is greater than
Is less than
Is above
Is below
Is higher than
Is lower than
Is longer than
Is shorter than
Is bigger than
Is smaller than
Is increased
Is decreased or reduced from
=
≠
Is equal to
Is not equal to
Is the same as
Is different from
Has not changed from
Has changed from
Is the same as
Is not the same as
SSTS012
2019
STUDY GUIDE
3.3 Type I and Type II errors
In using a sample to draw inferences about the population, the decision maker is taking
the risk that the incorrect conclusion will be reached. There are two types of errors that
can occur in the hypothesis testing procedure.
The first error, called Type I error (𝛼), is the probability that the null hypothesis 𝐻0 will
be rejected when, in fact, it is true. The Type I error 𝛼 is also called the level of
significance. The value of α represents the probability of committing this type of error;
that is,
α = P(H0 is rejected|H0 is true)
The second error, called Type II error (𝛽), is the probability that the null hypothesis
𝐻0 will not be rejected when it is false and should be rejected. The value of β
represents the probability of committing a Type II error; that is,
β = P(H0 is not rejected|H0 is false)
The value of 1 − β is called the power of the test. It represents the probability of not
making a Type II error.
In the hypothesis-testing situation, there are four possible outcomes. In reality, the null
hypothesis may or may not be true, and a decision is made to reject or not reject it on
the basis of the data obtained from a sample. The four possible outcomes are shown
in table below. Notice that there are two possibilities for a correct decision and two
possibilities for an incorrect decision.
Statistical Decision
Actual Situation
𝐻0 True
H0 False
Do not reject H0
Correct decision (1 − 𝛼)
Type II error (𝛽)
Reject H0
Type I error (𝛼)
Correct decision (1 − 𝛽)
3.4 Steps of hypothesis testing
 State the null and alternative hypothesis
 Specify the level of significance 𝛼
 Calculate the value of the test statistic
SSTS012
2019
STUDY GUIDE
 Set up the critical values that divide the rejection and nonrejection rejection
 Determine the statistical decision
 Express the statistical decision in terms of the problem.
3.5 Test of hypothesis for the population mean 𝝁
This section explains how to perform a test of hypothesis for the population mean 𝜇.
Here, there are two possible cases, as follows.
Case I. If the following two conditions are fulfilled:
1) The population standard deviation 𝜎 is known
2) The sample size is large (i.e. 𝑛 ≥ 30)
then we use the normal distribution to perform the hypothesis testing of the population
mean 𝜇. That is, If the standard deviation 𝜎 is known or the sample size is large, then
based on the central limit theorem, the sampling distribution of the sample mean 𝑋̅
would follow a normal distribution and the test statistic which is based upon the
difference between the sample mean 𝑋̅ and the hypothesized mean 𝜇 would be found
as follows:
Z=
x−μ
σ
√n
The test statistic can be defined as a rule or criterion that is used to make the
decision on whether or not to reject the null hypothesis.
Example 3.1: The TIV Telephone Company provides long-distance telephone service
in an area. According to the company’s records, the average length of all long-distance
calls placed through this company in 2009 was 12.44 minutes. The company’s
management wanted to check if the mean length of the current long-distance calls is
different from 12.44 minutes. A sample of 150 such calls placed through this company
produced a mean length of 13.71 minutes. The standard deviation of all such calls is
2.65 minutes. Using the 10% significance level, can you conclude that the mean length
of all current long-distance calls is different from 12.44 minutes?
Example 3.2: A researcher reports that the average salary of assistant professors is
more than R42,000. A sample of 30 assistant professors has a mean salary of R43,260
SSTS012
2019
STUDY GUIDE
and the standard deviation of R5,230. At α = 0.05, test the claim that assistant
professors earn more than R42,000 per year..
Case II. If the population standard deviation is unknown and the sample size is small
(i.e. 𝒏 < 𝟑𝟎) then we use the student t distribution to perform the hypothesis testing
of the population mean 𝝁. The test statistic for determining the difference between
̅ and the population mean 𝝁 when the sample size is small and
the sample mean 𝑿
standard deviation 𝑺 is used, is given by
t=
x−μ
s
√n
If the population is assumed to be normal, the sampling distribution of the mean
will follow a student t distribution with 𝑛 − 1 degrees of freedom.
Example 3.3: A psychologist claims that the mean age at which children start walking
is 12.5 months. Carol wanted to check if this claim is true. She took a random sample
of 18 children and found that the mean age at which these children started walking
was 12.9 months with a standard deviation of 0.80 month. Using the 1% significance
level, can you conclude that the mean age at which all children start walking is different
from 12.5 months? Assume that the ages at which all children start walking have an
approximately normal distribution.
Example 3.4: A medical investigation claims that the average number of infections
per week at a hospital in south-western Pennsylvania is 16.3. A random sample of 10
weeks had a mean number of 17.7 infections and the sample standard deviation is 1.8
infections. Is there enough evidence to reject the investigator’s claim at a α = 0.05?
3.6 Hypothesis test about a population proportion
In the preceding section we used hypothesis testing procedures for quantitative data
(means). The concept of hypothesis testing can also be used to test hypothesis about
the qualitative data. The number of successes follows a binomial distribution process.
However, as we have seen previously when developing confidence intervals, if the
sample size is large enough (𝑛𝑝 > 5 and 𝑛𝑞 > 5), the normal distribution gives a good
SSTS012
2019
STUDY GUIDE
approximation to the binomial distribution. The test statistic can be stated in two forms,
in terms of either the proportion of successes or the number of successes:
𝑍=
𝑝̂−𝑝
𝑝(1−𝑝)
√
𝑛
or 𝑍 =
𝑋−𝑛𝑝
√𝑛𝑝(1−𝑝)
Example 3.5: According to a Nationwide Mutual Insurance Company Driving While
Distracted Survey conducted in 2008, 81% of the drivers interviewed said that they
have talked on their cell phones while driving (The New York Times, July 19, 2009).
The survey included drivers aged 16 to 61 years selected from 48 states. Assume that
this result holds true for the 2008 population of all such drivers in the United States. In
a recent random sample of 1600 drivers aged 16 to 61 years selected from the United
States, 83% said that they have talked on their cell phones while driving. Using the
5% significance level, can you conclude that the current percentage of such drivers
who have talked on their cell phones while driving is different from 81%?
Example 3.6: A statistician read that at least 77% of the population oppose replacing
R1 bills with R1 coins. To see if this claim is valid, the statistician selected a sample
of 80 people and found that 55 were opposed to replacing the R1 bills. At α = 0.01,
test the claim that at least 77% of the population are opposed to the change.
3.7 The 𝑷-Value Approach
In this procedure, we find a probability value such that a given null hypothesis is
rejected for any 𝛼 (significance level) greater than this value and it is not rejected for
any 𝛼 less than this value. The probability-value approach, more commonly called
the 𝑃-value approach, gives such a value. In this approach, we calculate the 𝑷-value
for the test, which is defined as the smallest level of significance at which the given
null hypothesis is rejected. Using this 𝑃-value, we state the decision. If we have a
predetermined value of 𝛼, then we compare the value of 𝑃 with 𝛼 and make a decision.
Definition 3.1: The 𝑃-value (or probability value) is the probability of getting a sample
statistic (such as the mean) or a more extreme sample statistic in the direction of the
alternative hypothesis when the null hypothesis is true.
Decision rule when using the 𝑷-value approach
SSTS012
2019
STUDY GUIDE
We reject the null hypothesis if the 𝑃-value is less or equal to the level of significance
(𝛼). That is, reject 𝐻0 when the 𝑃-value ≤ 𝛼. Otherwise, we fail to reject the null
hypothesis.
3.8 Hypothesis Tests About the Population Variance
A test of hypothesis about the population variance can be one-tailed or two-tailed. To
make a test of hypothesis about σ2 , we perform the same steps we used earlier in
hypothesis testing examples. The procedure to test a hypothesis about σ2 discussed
in this section is applied only when the population from which a sample is selected is
(approximately) normally distributed.
The value of the test statistic 𝜒 2 is calculated as
χ2 =
(n − 1)s2
σ2
where s2 is the sample variance, σ2 is the hypothesized value of the population
variance, and n − 1 represents the degrees of freedom. The population from
which the sample is selected is assumed to be (approximately) normally
distributed.
Example 3.7: The variance of scores on a standardized mathematics test for all high
school seniors was 150 in 2009. A sample of scores for 20 high school seniors who
took this test this year gave a variance of 170. Test at the 5% significance level if the
variance of current scores of all high school seniors on this test is different from 150.
The population from which the sample is selected is assumed to be normally
distributed.
Example 3.8: A sample of 21 observations selected from a normally distributed
population produced a sample variance of 1.97.
a)
Write the null and alternative hypotheses to test whether the population variance
is greater than 1.75.
b)
Using α = 0.025, find the critical value of χ2 . Show the rejection and nonrejection
regions on a chi-square distribution curve.
c)
Find the value of the test statistic χ2 .
SSTS012
d)
2019
STUDY GUIDE
Using the 2.5% significance level, will you reject the null hypothesis stated in part
a)? Support your answer.
3.9 Hypothesis testing for the difference between two population means 𝝁𝟏 − 𝝁𝟐
In the preceding sections we examined hypothesis testing procedures pertaining to
whether a mean or proportion was equal to some specified value. These cases are
usually referred to as one-sample tests, since a single sample is selected from as
population of interest and a computed statistic from the sample is compared to a
hypothesized value. In this section, we shall extend our discussion of hypothesis
testing to consider additional procedures pertaining to quantitative and qualitative
data.
Let us first extend the hypothesis testing concepts developed in the previous section
to situations in which we would like to determine whether there is any difference
between the means of two independent populations. Two populations are
independent if the elements of one population have no relationship to the elements
of the second population. If the elements of the two population are somehow related,
then the population are said to be dependent. Thus, in two independent populations,
the selection of one population has no effect on the selection of the second population.
Suppose then that we consider two independent populations, each having a mean and
standard deviation (symbolically represent as follows:)
Population I
Population II
𝜇1 , 𝜎1
𝜇2 , 𝜎2
The test to be performed can be either two-tailed or one-tailed, depending on whether
we are testing if the two population means are merely different or if one mean is greater
than the other mean.
Two-tailed test
One-tailed (Left-tailed) test
One-tailed (right tailed) test
𝐻0 : 𝜇1 = 𝜇2
𝐻0 : 𝜇1 ≥ 𝜇2
𝐻0 : 𝜇1 ≤ 𝜇2
𝐻1 : 𝜇1 ≠ 𝜇2
𝐻1 : 𝜇1 < 𝜇2
𝐻1 : 𝜇1 > 𝜇2
where 𝜇1 = mean of population 1
SSTS012
2019
STUDY GUIDE
𝜇2 = mean of population 2
The statistic used to determine the difference between the population means is
̅𝟏 − 𝑿
̅ 𝟐 ). Because of the
based upon the difference between the sample means (𝑿
central limit theorem, the statistic will follow the normal distribution for large enough
sample sizes. The test statistic is
𝐳=
(𝐱𝟏 − 𝐱𝟐 ) − (𝛍𝟏 − 𝛍𝟐 )
𝟐
𝟐
√𝛔𝟏 + 𝛔𝟐
𝐧𝟏 𝐧 𝟐
The value of μ1 − μ2 in this formula is substituted from the null hypothesis.
Example 3.9: A 2008 survey of low- and middle-income households conducted by
Demos, a liberal public policy group, showed that consumers aged 65 years and older
had an average credit card debt of R10,235 and consumers in the 50- to 64-year age
group had an average credit card debt of R9342 at the time of the survey (USA
TODAY, July 28, 2009). Suppose that these averages were based on random samples
of 1200 and 1400 people for the two groups, respectively. Further assume that the
population standard deviations for the two groups were R2800 and R2500,
respectively. Let and be the respective population means for the two groups, people
aged 65 years and older and people in the 50- to 64-year age group. Test at the 1%
significance level whether the population means for the 2008 credit card debts for the
two groups are different.
Example 3.10: A survey found that the average hotel room rate in New Orleans is
R88.42 and the average room rate in Phoenix is R80.61. Assume that the data were
obtained from two samples of 50 hotels each and that the standard deviations of the
populations are R5.62 and R4.83, respectively. At α = 0.05, can it be concluded that
there is a significant difference in the rates?
However, as we mentioned previously, in the most cases we do not know the standard
deviation of either of the two population (𝜎1 , 𝜎2 ). The only information usually available
are the sample means and sample standard deviations. If the assumptions are made
that each population is normally distributed, a student t test can be used to determine
SSTS012
2019
STUDY GUIDE
whether there is any difference between the means of the two populations. The
student t test statistic will be
𝐭=
(𝐱𝟏 − 𝐱𝟐 ) − (𝛍𝟏 − 𝛍𝟐 )
𝟐
𝟐
√𝑺𝟏 + 𝑺𝟐
𝐧𝟏
𝐧𝟐
The value of μ1 − μ2 in this formula is substituted from the null hypothesis,
Note that the student t test statistic above has the following degrees of freedom:
𝑑𝑓 =
𝑺 𝟐 𝑺 𝟐
( 𝒏𝟏 + 𝒏𝟐 )
𝟏
𝟐
2
2
2
𝑺 𝟐
𝑺 𝟐
( 𝒏𝟏 )
( 𝒏𝟐 )
𝟏
𝟐
𝑛1 − 1 + 𝑛2 − 1
The number given by this formula is always rounded down for 𝑑𝑓.
Example 3.11: A sample of 14 cans of Brand I diet soda gave the mean number of
calories per can of 23 with a standard deviation of 3 calories. Another sample of 16
cans of Brand II diet soda gave the mean number of calories of 25 per can with a
standard deviation of 4 calories. Test at the 1% significance level whether the mean
numbers of calories per can of diet soda are different for these two brands. Assume
that the calories per can of diet soda are normally distributed for each of these two
brands and that the standard deviations for the two populations are not equal.
Example 3.12: A sample of 15 one-pound jars of coffee of Brand I showed that the
mean amount of caffeine in these jars is 80 milligrams per jar with a standard deviation
of 5 milligrams. Another sample of 12 one-pound coffee jars of Brand II gave a mean
amount of caffeine equal to 77 milligrams per jar with a standard deviation of 6
milligrams. Construct a 95% confidence interval for the difference between the mean
amounts of caffeine in one-pound coffee jars of these two brands. Assume that the
two populations are normally distributed and that the standard deviations of the two
populations are not equal.
SSTS012
2019
STUDY GUIDE
However, If the assumptions are made that each population is normally distributed
and that the population variances are equal (𝜎1 2 = 𝜎2 2 ), a student t test can be used
to determine whether there is any difference between the means of the two
populations. Since we have assumed equal variances in the two populations, the
variances of the two samples (𝑆1 2 ; 𝑆2 2 ) can be pooled together to form one estimate
(𝑺𝒑 𝟐 ) of the population variance. The student t test statistic will be
𝐭=
(𝐱𝟏 − 𝐱𝟐 ) − (𝛍𝟏 − 𝛍𝟐 )
√𝑺𝒑 𝟐 (
𝟏
𝟏
𝒏𝟏 + 𝒏𝟐 )
The value of μ1 − μ2 in this formula is substituted from the null hypothesis,
where the pooled variance Sp 2 for two samples is computed as:
𝑺𝒑 𝟐 =
(𝒏𝟏 − 𝟏)𝑺𝟏 𝟐 + (𝒏𝟐 − 𝟏)𝑺𝟐 𝟐
𝒏𝟏 + 𝒏𝟐 − 𝟐
where 𝑛1 and 𝑛2 are the sizes of the two samples and 𝑺𝟏 𝟐 and 𝑺𝟐 𝟐 are the variances
of the two samples, respectively. Here Sp 2 is an estimator of 𝜎 2 .
Example 3.13: A consumer agency wanted to estimate the difference in the mean
amounts of caffeine in two brands of coffee. The agency took a sample of 15 onepound jars of Brand I coffee that showed the mean amount of caffeine in these jars to
be 80 milligrams per jar with a standard deviation of 5 milligrams. Another sample of
12 one-pound jars of Brand II coffee gave a mean amount of caffeine equal to 77
milligrams per jar with a standard deviation of 6 milligrams. Construct a 95%
confidence interval for the difference between the mean amounts of caffeine in onepound jars of these two brands of coffee. Assume that the two populations are normally
distributed and that the standard deviations of the two populations are equal.
Example 3.14: A sample of 40 children from New York State showed that the mean
time they spend watching television is 28.50 hours per week with a standard deviation
of 4 hours. Another sample of 35 children from California showed that the mean time
spent by them watching television is 23.25 hours per week with a standard deviation
of 5 hours. Using a 2.5% significance level, can you conclude that the mean time spent
SSTS012
2019
STUDY GUIDE
watching television by children in New York State is greater than that for children in
California? Assume that the standard deviations for the two populations are equal.
3.10 Hypothesis testing for the equality of variances from two populations
In many situations, we may also be interested in testing whether two populations have
the same variability. Either we may be interested in testing the assumption of equal
variances that we had made for the t test in section 3.9, or we may be interested in
studying the variances for two populations as an end in itself.
In order to examine the equality of the variances of two independent populations, a
statistical procedure has been devised that is based upon the ratio of the two sample
variances. If the data from each population are assumed to be normal distributed, then
the ration of the two sample variances follows a distribution call the 𝐹 distribution,
which was named after the famous statistician R.A. Fisher. The test statistic for testing
the ratio between two variances would be
𝐅=
𝐬𝟏𝟐
𝐬𝟐𝟐
where the larger of the two variances is placed in the numerator regardless of the
subscripts. The 𝐹 test has two terms for the degrees of freedom: that of the
numerator, n1 − 1, and that of the denominator, n2 − 1, where n1 is the sample
size from which the larger variance was obtained.
In testing the ratio of two variances, either one-tailed or two-tailed tests can be
employed as indicated in the table below.
Two-tailed test
One-tailed (Left-tailed) test
One-tailed (right tailed) test
𝐻0 : 𝜎 21 = 𝜎 2 2
𝐻0 : 𝜎 21 ≥ 𝜎 2 2
𝐻0 : 𝜎 21 ≤ 𝜎 2 2
𝐻1 : 𝜎 21 ≠ 𝜎 2 2
𝐻1 : 𝜎 21 < 𝜎 2 2
𝐻1 : 𝜎 21 > 𝜎 2 2
where 𝜎 21 = variance of population 1
𝜎 2 2 = variance of population 2
SSTS012
2019
STUDY GUIDE
Example 3.15: A medical researcher wishes to see whether the variance of the heart
rates (in beats per minute) of smokers is different from the variance of heart rates of
people who do not smoke. Two samples are selected, and the data are as shown.
Using α = 0.05, is there enough evidence to support the claim?
Example 3.16: The standard deviation of the average waiting time to see a doctor for
non-life threatening problems in the emergency room at an urban hospital is 32
minutes. At a second hospital, the standard deviation is 28 minutes. If a sample of 16
patients was used in the first case and 18 in the second case. Using α = 0.01 ,is there
enough evidence to conclude that the standard deviation of the waiting times in the
first hospital is greater than the standard deviation of the waiting times in the second
hospital?
3.11 Hypothesis testing for the difference between two population proportions
𝑷𝟏 − 𝑷𝟐
Rather than being concerned with the difference between two populations in terms
of a quantitative variable, we could be interested in difference in some qualitative
characteristic. A test for the difference between two proportions based upon
independent samples can be performed using the normal distribution. This test is
based on the difference between the two sample proportions which may be
approximated by a normal distribution for large sample sizes.
For the two populations involved, we are interested in either determining whether
there is any difference in the proportion of successes in the two groups (wo-tailed
test) or whether one group had higher proportion of successes than the other group
(one-tailed test).
Two-tailed test
One-tailed (Left-tailed) test
One-tailed (right tailed)
test
𝐻0 : 𝑃1 = 𝑃2
𝐻0 : 𝑃1 ≥ 𝑃2
𝐻0 : 𝑃1 ≤ 𝑃2
𝐻1 : 𝑃1 ≠ 𝑃2
𝐻1 : 𝑃1 < 𝑃2
𝐻1 : 𝑃1 > 𝑃2
where 𝑷𝟏 = Proportion of successes in population 1
𝑷𝟐 = Proportion of successes in population 2
SSTS012
2019
STUDY GUIDE
The test statistic would be
𝐙=
̂𝟏 − 𝐏
̂𝟐 ) − (𝐏𝟏 − 𝐏𝟐 )
(𝐏
̅ (𝟏 − 𝑷
̅) ( 𝟏 + 𝟏 )
√𝑷
𝑷 𝟏 𝑷𝟐
The estimate for the population proportion that we shall use is based upon the null
hypothesis. Under the null hypothesis it is assumed that the two population proportions
are equal. Therefore, we may obtain am overall estimate of the population proportion
by pooling together the two sample proportions. The estimate 𝑃̅ is simple the number
of successes in the two samples combined divided by the total sample size. That is,
𝑃̅ =
𝑋1 + 𝑋2
𝑛2 + 𝑛2
Example 3.17: A researcher wanted to estimate the difference between the
percentages of users of two toothpastes who will never switch to another toothpaste.
In a sample of 500 users of Toothpaste 𝐴 taken by this researcher, 100 said that they
will never switch to another toothpaste. In another sample of 400 users of Toothpaste
𝐵 taken by the same researcher, 68 said that they will never switch to another
toothpaste. At the 10% significance level, can you conclude that the proportion of
users of Toothpaste 𝐴 who will never switch to another toothpaste is higher than the
proportion of users of Toothpaste 𝐵 who will never switch to another toothpaste?
Example 3.18: A company that has many department stores in the southern states
wanted to find at two such stores the percentage of sales for which at least one of the
items was returned. A sample of 800 sales randomly elected from Store 𝐴 showed that
for 280 of them at least one item was returned. Another sample of 900 sales randomly
selected from Store 𝐵 showed that for 279 of them at least one item was returned.
Using the 5% significance level, can you conclude that the proportions of all sales for
which at least one item is returned is higher for Store 𝐴 than for Store 𝐵?
SSTS012
2019
STUDY GUIDE
CHAPTER 4: CHI-SQUARE HYPOTHESIS TESTING
4.1 Introduction
The statistical inference techniques presented so far have dealt exclusively with
hypothesis tests for population parameters such as mean (μ) , variance (σ2 ) and
proportion (P). In this chapter, we consider inferential procedures that are not
concerned with population parameters. These procedures are often called chi-square
(χ2 ) procedures for simple reason that they rely on a probability distribution called chisquare distribution.
A random variable has the chi-square distribution if its distribution has the shape of a
special type of right-skewed curve, called the chi-square (χ2 ) curve.
4.1.1 Basic properties of 𝛘𝟐 -curves
 The total area under the χ2 -curve equals 1.
 A χ2 -curve is right skewed.
 As the number of degrees of freedom becomes larger, χ2 -curve looks
increasingly like normal curve.
 A χ2 -curve starts at zero on the horizontal axis and extends indefinitely to the
right, approaching, but never touching, the horizontal axis.
4.1.2 Finding the 𝛘𝟐 -value having the specified area to its right
For a χ2 -curve with 8 degrees of freedom, find χ2 0.025; that is, find the χ2 -value that has
area 0.025 to its right.
SSTS012
2019
STUDY GUIDE
To find this χ2 -value, we use the chi-square distribution table. The degrees of freedom
is 8, so we first go down the column, labelled df, to 8. Then going across that row to
the column labelled χ2 0.025, we reach 17.535. Therefore, the χ2 -curve with 8 degrees
of freedom, χ2 0.025 = 17.535.
Example 4.1: Use the chi-square distribution table to determine the required χ2 values. Illustrate you work graphically.
a)
For a χ2 -curve with 3 degrees of freedom, determine the χ2 -values that has area
0.025 and 0.95 to its right.
b)
For a χ2 -curve with df=7, determine χ2 0.05 and χ2 0.975.
c)
Consider a χ2 -curve with df=12 and df=20, respectively. Which one more closely
resembles a normal curve. Explain your answer.
4.2 Chi-square goodness-of-fit test
Goodness-of-fit test is a chi-square procedure which can be used to perform a
hypothesis test about the distribution of qualitative (categorical) variable or a discrete
quantitative variable that has only finitely many possible values.
4.2.1 Distribution of the 𝛘𝟐 -statistic for a goodness-of-fit test
For a chi-square goodness-of-fit test, the test statistic is algebraically expressed as:
SSTS012
2019
STUDY GUIDE
k
(Oi − Ei )2
χ =∑
Ei
2
i=1
And it has approximately chi-square distribution if the null hypothesis is correct. The
number of degrees of freedom is one less than the number of possible values (k) for
the variable of interest.
Note that in the chi-square statistic, O represent the observed frequencies and E
represent the expected frequencies. The expected frequency for each possible value
of the variable is obtained using the following formula:
Ei = npi
Where n is the sample size and pi is the relative frequency (or probability) given for
the value in the null hypothesis.
4.2.2 Procedures for the Chi-square Goodness-of-fit Test
The purpose of the chi-square goodness-of-fit test is to perform a hypothesis test for
the distribution of a variable.
Assumptions:
 The data are obtained from a random sample.
 The expected frequency of each category must be at least 5.
Six steps to be followed when conducting a chi-square goodness-of-fit test:
 Step 1: The null and the alternative hypothesis are, respectively,
H0 : The variable has the specified distribution
H1 : The variable does not have the specified distribution.
 Step 2: Decide on the significance level, α
 Step 3: Compute the test statistic, χ2 = ∑ki=1
(Oi −Ei )2
Ei
 Step 4: The critical value is χ2 α with degrees of freedom k − 1, where k is the
number of possible values for the variable.
SSTS012
2019
STUDY GUIDE
 Step 5: If the value of the test statistic falls in the rejection region, reject H0 ;
otherwise, do not reject H0 .
 Step 6: Interpret the results of the hypothesis test.
EXAMPLE 4.2: A simple random sample of 500 violent-crime reports from last year
yielded the results in Table 4.1 column 2. Column 3 gives relative-frequency for 2016.
Table 4.1: Distribution of violent-crimes in Polokwane.
Type of violent-crime
Observed frequency
Relative frequency
Murder
3
0.011
Forcible rape
37
0.063
Robbery
154
0.286
Assault
306
0.640
a) Identify the population and the variable of interest.
b) Check the two assumptions of the chi-square goodness-of-fit test if they are met.
c) At 5% level of significance, do the data provide sufficient evidence to conclude
that last year’s violent-crime distribution is different from the 2016 distribution?
Example 4.3: Finger Lakes Homes manufactures four models of prefabricated homes,
a two-story colonial, a log cabin, a split-level, and an A-frame. To help in production
planning, management would like to determine if previous customer purchases
indicate that there is a preference in the style selected.
SSTS012
2019
STUDY GUIDE
Table 4.2: The number of homes sold of each model for 100 sales over the past two
years.
Model
Colonial
Log-cabin
Split-level
Sold
30
20
35
A-frame
a) Complete the table above.
b) Calculate the expected value for each category.
c) Test if previous customer purchases indicate that there is a preference in the style
selected. Use 1% significance level.
Example 4.3: The Higher Education Research Institute of the University of Limpopo,
South Africa, publishes information on characteristics of incoming college freshmen in
the South African freshmen. In 2017, 27.7% of incoming freshmen characterised their
political views as liberal, 51.9% as moderate, and 20.4% as conservative. For this
year, a random sample of 250 incoming college freshmen produced the preceding
frequency distribution for political views.
Table 4.3: Frequency distribution for political views.
Political view
Liberal
Moderate
conservative
Frequency
80
123
47
a)
Identify the population and variable under consideration here.
b) Test if the data provide sufficient evidence to conclude that this year’s distribution
of political views for incoming college freshmen has changed from the 2017.
4.3 Chi-square test for independence of association
As indicated by the formula in the previous section, the chi-square test statistic
measures how much the observed frequencies and the expected frequencies differ.
The test establishes whether two categorical random variables are statistically related
(dependent of independent). Statistical independence means that the outcome of one
random variable in no way influences (or is influenced by) the outcome of the second
random variable.
4.3.1 Distribution of the 𝛘𝟐 -statistic for independence test
SSTS012
2019
STUDY GUIDE
The chi-square statistic that transform the sample frequencies into a test statistic is
mathematically expressed as follows:
k
2
χ =∑
i=1
(Oi − Ei )2
Ei
And it has approximately chi-square distribution if the null hypothesis of nonassociation is correct. The number of degrees of freedom is (r − 1)(c − 1), where r
and c are the number of rows and column, respectively.
Note that in the chi-square statistic, O represent the observed frequencies and E
represent the expected frequencies. The expected frequency for each possible value
of the variable is obtained using the following formula:
𝐸𝑖 =
𝑅𝑖 ∗ 𝐶𝑖
𝑛
Where 𝑛 is the sample size, 𝑅𝑖 is the sum of the all frequencies in row 𝑖 and 𝐶𝑖 is the
sum of all frequencies in column 𝑖.
4.3.2 Procedures for the Chi-square of Independence Test
The purpose of the chi-square independence test is to perform a hypothesis test to
decide whether the two variable are associated.
Assumptions:
 The data are obtained from a random sample.
 The expected frequency of each category must be at least 5.
Six steps to be followed when conducting a chi-square for independence test:
 Step 1: The null and the alternative hypothesis are, respectively,
𝐻0 : The two variables are not associated.
𝐻1 : The two variables are associated.
 Step 2: Decide on the significance level, 𝛼
 Step 3: Compute the test statistic, 𝜒 2 = ∑𝑘𝑖=1
(𝑂𝑖 −𝐸𝑖 )2
𝐸𝑖
SSTS012
2019
STUDY GUIDE
 Step 4: The critical value is 𝜒 2 𝛼 with degrees of freedom (𝑟 − 1)(𝑐 − 1)
 Step 5: If the value of the test statistic falls in the rejection region, reject 𝐻0 ;
otherwise, do not reject 𝐻0 .
 Step 6: Interpret the results of the hypothesis test.
Example 4.4: suppose that you are a marketing research analyst and you ask a
random sample of 286 if they purchased a Diet Pepsi or coke.
Table 4.4: Contingency of Pepsi and Coke Diet.
Diet Pepsi
Diet Coke
No
Yes
Total
No
84
32
116
Yes
48
122
170
Total
132
154
286
a) Check the two assumptions of the chi-square goodness-of-fit test if they are met.
b) At 𝛼 = 0.01 significance level, can one conclude that there exist a relationship
between the two diets?
Example 4.5: A national survey was conducted to obtain information on the alcohol
consumption pattern of RSA adults by marital status. A random sample of 1772
residents of 18 years old and older yielded the data displayed in the table below.
Table 4.5: Four by three contingency table of alcohol consumption and marital status.
SSTS012
2019
STUDY GUIDE
Drinks per month
Marital status
Abstain
1-60
Over 60
single
67
213
74
Married
411
Widowed
85
51
7
Divorced
27
60
15
129
a) Identify the population and variable under consideration here.
b) Complete the contingency table above.
c) Check the two assumptions of the chi-square goodness-of-fit test if they are met.
d) Do the data provide sufficient evidence to conclude that an association exist
between marital status and alcohol assumption?
SSTS012
2019
STUDY GUIDE
CHAPTER 5: ANALYSIS OF VARIANCE
5.1 Introduction
Analysis of variance (ANOVA) is a statistical technique used to test whether there
exists a significant difference between two or more population means. This might
seem strange because the technique is called “analysis of variance” rather than
“analysis of population means”. However, the name is appropriate because inference
about the population means is made by analysing the variance. In the context of
ANOVA, populations are described as treatments, while observations are results
obtained after applying treatments on the experimental units. In this module we will
only focus only on One-Way ANOVA.
5.2 Terms and concepts
Let’s define some of the important terms and concepts in design of experiments. We
have already seen the terms like, treatment, experimental unit, randomisation and
response. However, we define them again here for completeness.
Definition 5.1: Treatments, sometimes called factors, are the different procedures
(levels) that we want to investigate or compare. E.g, different kinds or amount of
fertilisers in agronomy.
Definition 5.2: Experimental units are the things to which we apply the treatments.
E.g, patients in hospital.
Definition 5.3: Response, sometimes called the dependent variable is the outcome
that we observe after applying a treatment on an experimental unit. That is, the
response is what we measure to judge what happened in the experiment.
Definition 5.4: Randomization is the random allocation of the experimental units to
the treatments of factor levels. That is, it is the allocation of the experimental units to
the treatments in a haphazard way.
Example 5.1: Suppose that a group of researchers from Rotterdam village in Giyani
conduct a study to compare the mean caffeine content of three brands of tea leaves.
They sampled 20 tea bags of each brand, analysed them of caffeine content and
record the amount of caffeine in each tea bag in milligrams. From the study above:
SSTS012
2019
STUDY GUIDE
a) What is the response variable?
b) Identify the treatment and levels of interest in the study.
c) Identify the experimental units and number of the experimental units.
5.3 The F-Distribution
ANOVA procedures depend on the distribution called the F-distribution, which was
named in honor of Sir Ronald Fisher. A variable is said to follow an F-distribution if its
distribution has a shape of a special type of right curve called an F-curve. The Fdistribution has two degrees of freedom instead of one. The first number of degrees
of freedom for an F-curve is called the degrees of freedom for the numerator and the
second is called the degree of freedom for the denominator.
5.3.1 Basic Properties of F-curves
 The total area under the F-curve equals 1.
 An F-curve is skewed to the right.
 An F-curve starts at zero on the horizontal axis and extends indefinitely to the
right, approaching, but never touching, the horizontal axis as it does so.
5.3.2 Finding the 𝛘𝟐 -value having the specified area to its right
For an F-curve with df = (4, 12), find 𝐹0.05 ; that is, find the F-value having area 0.05 to
its right:
SSTS012
2019
STUDY GUIDE
To find this F-value, we use the F-distribution table above. In this case, 𝛼 = 0.05, the
degrees of freedom of the numerator is 4, and the degrees of freedom for the
denominator is 12. We first go down the dfd column 12, then going across that row to
the column labelled 4 and reach 3.26. Therefore, F-curve with df = (4, 12), 𝐹0.05 = 3.26.
Example 5.2: Use the F-distribution table to determine the required F-values. Illustrate
you work graphically.
a) F-curve which has df = (8, 19). What is the degrees of freedom for the numerator
and for the denominator?
b) F-curve that has df = (12, 5) with 0.05 area to its right.
c) F-curve with df = (20, 20), 𝐹0.05 .
d) F-curve with df = (23, 9), 𝐹0.05
e) F-curve with df = (35, 10), 𝐹0.05
5.4 Performing a One-Way Anova
To perform a one-way ANOVA, we need to determine the three sums of squares, Total
sum of squares (SST), Treatment Sum of squares (SSTR) and Error sum of squares
(SSE). For a one-way ANOVA with 𝑡 population means, the defining and computing
formulas for the three sums of squares are as follows:
SSTS012
2019
Sum of squares
STUDY GUIDE
Defining formula
Computing formula
𝑛
Total, SST
𝑛
∑(𝑥𝑖 − 𝑥̅ )
2
∑ 𝑥 2 𝑖 − 𝑛𝑥̅ 2
𝑖=1
𝑖=1
𝑡
Treatment, SSTR
𝑡
∑ 𝑛𝑗 (𝑥̅𝑗 − 𝑥̅ )
2
∑ 𝑛𝑗 𝑥̅ 2𝑗 − 𝑛𝑥̅ 2
𝑗=1
𝑗=1
𝑡
Error, SSE
𝑡
∑(𝑛𝑗 −
1)𝑠 2𝑗
𝑡
∑ 𝑛𝑗 𝑠 2𝑗
𝑗=1
𝑗=1
− ∑ 𝑠 2𝑗
𝑗=1
Note: Total sum of squares equals the treatment sum of squares plus error sum of
squares: 𝑆𝑆𝑇 = 𝑆𝑆𝑇𝑅 + 𝑆𝑆𝐸.
In the table above, we used the following notations:
𝑛 =total number of observations
𝑥̅ = mean of all 𝑛 observations
and, for 𝑗 = 1,2, … , 𝑡,
𝑛𝑗 = size of sample from population 𝑗
𝑥̅𝑗 = mean of sample from population 𝑗
𝑠 2𝑗 = variance of sample from population 𝑗
5.5 One-Way ANOVA Table
To organize and summarize the quantities required for performing a one-way analysis
of variance, we use a one-way ANOVA table. The general format of such table is
shown in the table below:
Source of variation SS
DF
MS
F-ratio
Treatment
𝑆𝑆𝑇𝑅
𝑡−1
𝑀𝑆𝑇𝑅
Error
𝑆𝑆𝐸
𝑛−𝑡
𝑀𝑆𝐸
Total
𝑆𝑆𝑇
𝑛−1
𝑀𝑆𝑇𝑅
𝑀𝑆𝐸
Note: SS = Sum of squares, DF = Degrees of freedom and MS = Mean squares
Procedures for One-Way ANOVA Test
The purpose of one-way ANOVA is to perform a hypothesis test to compare 𝑡
treatment or population means. Assumptions:
 Simple random samples
 Independent samples
SSTS012
2019
STUDY GUIDE
 Normal populations
 Equal population variances
Six steps to be followed when conducting hypothesis testing for comparing more than
two population means:
 Step 1: The null and alternative hypothesis, respectively,
𝐻0 : 𝜇1 = 𝜇2 = ⋯ = 𝜇𝑡
𝐻1 : Not all the means are equal.
 Decide on the significance level, 𝛼.
 Compute the value of the F-test statistic
𝐹=
𝑀𝑆𝑇𝑅
𝑀𝑆𝐸
 The Critical value is 𝐹𝛼 with df = (𝑡 − 1, 𝑛 − 𝑡). Use the F-distribution table to find
the critical value of the specified area.
 If the value of the test statistic falls in the rejection region, reject 𝐻0 ; otherwise
do no reject 𝐻0 .
 Interpret the results of the hypothesis test.
Example 5.3: A study was undertaken to compare distance travelled in kilometres per
litre of three competing brands of petrol. Fifteen identical cars were available for the
experiments.
Replications
Brands
1
2
3
4
5
A
9.5
11.0
13.0
15.0
18.0
B
10.5
12.0
14.0
16.0
10.5
C
10.0
10.5
13.5
14.5
10.0
a) What is the response variable of this study?
SSTS012
2019
STUDY GUIDE
b) What is the sample size that was considered in this study?
c) Identify the treatment and number of levels.
d) Identify the experimental units.
e) Compute the three sum of squares.
f)
Use the answers in e) to construct the ANOVA table.
g) Is there a significance difference between the means of the three brands of petrol?
Example 5.4: Suppose that a company wishes to study job satisfaction of employees
according to the length of service. It plans to classify employees into five independent
groups and select four employees at random from each group for intensive
interviewing. Suppose that the results yielded: 𝑆𝑆𝑇 = 7872 and 𝑆𝑆𝐸 = 1724.
a) What is the sample size that was considered in this study?
b) Check the first two assumptions required for performing a one-way
ANOVA test.
c) Use the information above to set-up an appropriate ANOVA table.
d) Test at 𝛼 = 10% level of significance whether the job satisfaction scores are equal
for the five groups.
Example 5.5: Consider the summary statistics of former prisoners diagnosed with
three different posttraumatic stress disorder (PTSD): 𝑛1 = 32, 𝑥̅1 = 73.0, 𝑠1 =
19.2; 𝑛2 = 20, 𝑥̅2 = 45.6, 𝑠2 = 23.4, 𝑛3 = 29, 𝑥̅ 3 = 34.5 and 𝑠3 = 22.0. Do the data
provide sufficient evidence to conclude that the mean severity of PTSD are equal?
5.6 Pairwise comparisons of the treatments
In many practical situations, we will wish to compare only pairs of means. Frequently,
we can determine which means differ by testing the differences between all pairs of
treatment means. Suppose that we are interested in comparing all pairs of a treatment
means and that the null hypotheses that we wish to test are 𝐻0 : 𝜇𝑖 = 𝜇𝑗 for all 𝑖 ≠ 𝑗.
There are numerous procedures available for this problem. We now present two
popular methods for making such comparisons.
SSTS012
2019
STUDY GUIDE
5.6.1 Tukey’s pairwise comparison test
Suppose that, following an ANOVA in which we have rejected the null hypothesis of
equal treatment means, we wish to test all pairwise mean comparisons:
𝐻0 : 𝜇𝑖 = 𝜇𝑗 versus 𝐻1 : 𝜇𝑖 ≠ 𝜇𝑗 for all 𝑖 ≠ 𝑗.
Tukey’s test declares two means significantly different if the absolute value of their
sample differences exceeds
1
1
𝑖
𝑗
𝑇 = 𝑞𝛼 (𝑡, 𝑛 − 𝑡)√𝑀𝑆𝐸 (𝑛 + 𝑛 )
Equivalently, we could construct a set of 100(1 − 𝛼) percent confidence intervals for
all pairs of means as follows:
1
1
𝑖
𝑗
𝑦̅𝑖 − 𝑦̅𝑗 ∓ 𝑞𝛼 (𝑡, 𝑛 − 𝑡)√𝑀𝑆𝐸 (𝑛 + 𝑛 )
5.6.2 LSD method of pairwise comparison
Suppose that, following an ANOVA in which we have rejected the null hypothesis of
equal treatment means, we wish to test all pairwise mean comparisons:
𝐻0 : 𝜇𝑖 = 𝜇𝑗 versus 𝐻1 : 𝜇𝑖 ≠ 𝜇𝑗 for all 𝑖 ≠ 𝑗.
The Fisher Least Significant Difference (LSD) method declares two means
significantly different if the absolute value of their sample differences exceeds
1
1
𝐿𝑆𝐷 = 𝑡𝛼 (𝑛 − 𝑡)√𝑀𝑆𝐸 ( + )
𝑛𝑖 𝑛𝑗
2
Equivalently, we could construct a set of 100(1 − 𝛼) percent confidence intervals for
all pairs of means as follows:
𝑦̅𝑖 − 𝑦̅𝑗 ∓ 𝑡𝛼 (𝑛 − 𝑡)√𝑀𝑆𝐸 (
2
1
1
+ )
𝑛𝑖 𝑛𝑗
SSTS012
2019
STUDY GUIDE
CHAPTER 6: SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS
6.1 Introduction
Regression analysis and correlation analysis are the two statistical techniques which
deal with examining existence of relationship between two or more variables and
measuring the strength of this relationship. The relationship between any pair of
variables (𝑥, 𝑦) can also be examined graphically by producing a scatter plot of their
data values.
SSTS012
2019
STUDY GUIDE
6.2 Simple Linear Regression Analysis
Simple linear regression analysis (SLRA) is the statistical approach for modelling the
relationship between the dependent (response) variable 𝑌 and one independent
(explanatory) variable 𝑋.
The simple linear regression model is mathematically expressed as follows:
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀
Where 𝑦 is the dependent variable, 𝛽0 and 𝛽1 are the regression parameters called
the 𝑦-intercept and slope of the regression line, respectively, 𝑥 is called the
independent variable and 𝜀 (epsilon) is the random error term.
The above model is said to be simple, linear in the parameters, and linear in the
independent variable. The reason behind this is that it is simple because there is only
one independent variable, linear in the parameters because no parameter appears as
an exponent or is multiplied or divided by another parameter, and linear in the
independent variable because this variable appears only in the first power.
6.2.1 The basic assumptions on the random error term
 The random error term has a mean value of zero, i.e. 𝐸(𝜀) = 0.
 The random error term has a constant variance 𝜎 2 ; i.e. 𝑉𝑎𝑟(𝜀) = 𝜎 2 .
 The random error term a normal distribution.
We note that, since the random error term follows a normal distribution with mean zero
and constant variance 𝜎 2 , then the dependent variable 𝑦 will follow a normal
distribution with mean 𝛽0 + 𝛽1 𝑥 and variance 𝜎 2 ; that is, 𝐸(𝑦) = 𝛽0 + 𝛽1 𝑥 and 𝑉𝑎𝑟(𝑦) =
𝜎 2 . The reason for this is that, a linear equation of normally distributed variable also
follows a normal distribution.
The fitted regression equation obtained using the least squares estimates is given by:
𝑦̂ = 𝛽̂0 + 𝛽̂1 𝑥
This equation is used to predict the value of the response variable given the value of
the explanatory variable.
SSTS012
2019
STUDY GUIDE
6.2.2 Interpretation of the Regression Parameters
𝛽1 is the regression slope which indicates the amount of change in the mean of the
probability distribution of 𝑦 or just the value of 𝑦 per unit change in the independent
variable 𝑥. 𝛽0 is the 𝑦-iintercept which does not have any particular meaning as a
separate term in the model.
6.2.3 Formulas for the Least Squares Estimates
̂ 𝟏 = 𝑺𝑺𝒙𝒚
Slope: 𝜷
𝑺𝑺
𝒙𝒙
̂𝟎 = 𝒚
̂ 𝟏𝒙
̅−𝜷
̅
𝒚-intercept: 𝜷
̅) (𝒚𝒊 − 𝒚
̅) = ∑𝒏𝒊=𝟏 𝒙𝒊 𝒚𝒊 − 𝒏𝒙
̅𝒚
̅
Where: 𝑺𝑺𝒙𝒚 = ∑𝒏𝒊=𝟏(𝒙𝒊 − 𝒙
̅)𝟐 = ∑𝒏𝒊=𝟏 𝒚𝒊 𝟐 − 𝒏𝒚
̅𝟐
𝑺𝑺𝒚𝒚 = ∑𝒏𝒊=𝟏(𝒚𝒊 − 𝒚
̅)𝟐 = ∑𝒏𝒊=𝟏 𝒙𝒊 𝟐 − 𝒏𝒙
̅𝟐 and 𝒏 =sample size
𝑺𝑺𝒙𝒙 = ∑𝒏𝒊=𝟏(𝒙𝒊 − 𝒙
Example 6.1: Consider the information of five observations about the sale revenue
(𝑦) and advertising expenditure (𝑥) in the table below:
Sales Revenue (R, 000)
Advertising Expenditure (R, 00)
1
1
1
2
2
3
2
4
4
5
a) Draw a scatter plot of the data above. Comment in the relationship of the variables.
b) Estimate the regression parameters and fit the least squares equation.
c) Interpret the regression slope in terms on the sales revenue and advertising
expenditure.
d) Estimate the value of 𝑦 and calculate the mean of the estimated values.
SSTS012
2019
STUDY GUIDE
Example 6.2: Suppose that you are a statistician working for certain company selling
used cars in Polokwane. Consider the summary data below based on the age (𝑥) and
price (𝑦) of eleven cars:∑ 𝑥 = 58, ∑ 𝑦 = 975, ∑ 𝑥𝑦 = 4732, ∑ 𝑥 2 = 326 , ∑ 𝑦 2 = 96129 .
a) Estimate the regression parameters for the eleven cars.
b) If you were to draw a scatter plot for data, what kind of relationship would you
expect? Explain your answer.
c) Interpret the regression slope in terms of age and price of the cars.
d) Fit the least squares equation
e) Predict the price of a 3-year old and 4-year old used cars. Comment on the
predicted prices.
6.3 Inference in regression analysis
In this section, we shall discuss inferences concerning the regression slope 𝛽1,
considering both confidence estimation and hypothesis testing of 𝛽1.
Hypothesis testing for 𝜷𝟏
We use the student t distribution to perform the hypothesis testing of the regression
slope 𝛽. The test statistic for testing the following hypothesis,
Two-tailed test
One-tailed (Left-tailed) test
One-tailed (right tailed) test
𝐻0 : 𝛽1 = 𝑏1
𝐻0 : 𝛽1 ≥ 𝑏1
𝐻0 : 𝛽1 ≤ 𝑏1
𝐻1 : 𝛽1 ≠ 𝑏1
𝐻1 : 𝛽1 < 𝑏1
𝐻1 : 𝛽1 > 𝑏1
where 𝑏1 is the hypothesized value for 𝛽1, is given by
𝑡=
and 𝑆𝛽̂ =
𝑆𝑆𝑥𝑦
2
√∑ 𝑥 2 −(∑ 𝑥)
𝛽̂1 − 𝑏1
𝑆𝛽̂1
.
𝑛
A second, equivalent method for testing the existence of a linear relationship between
variables is to set up a confidence interval estimate of 𝛽1 and determine whether the
SSTS012
2019
STUDY GUIDE
hypothesised value (𝛽1 = 𝑏1) in included in the interval. The confidence interval
estimate of 𝛽1 would be obtained by using the following formula:
𝛽̂1 ± 𝑡𝛼(𝑛−2) 𝑆𝛽̂1
2
6.4 Correlation Analysis
The reliability of the estimate of the response variable (𝑦) depends on the strength of
the relationship between the independent variable (𝑥) and the dependent variable (𝑦).
A strong relationship implies a more accurate and reliable estimate of the response
variable.
Definition 6.1: The Pearson coefficient of correlation is a measure of the strength of
the relationship between two variables, 𝑥 and 𝑦.
The following expression is used to calculate the sample Pearson correlation
coefficient:
𝑟=
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅)
√∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2
=
𝑆𝑆𝑥𝑦
√𝑆𝑆𝑥𝑥 𝑆𝑆𝑦𝑦
𝑆𝑆𝑥𝑥
= 𝛽̂1 √
𝑆𝑆𝑦𝑦
We note that 𝑆𝑆𝑥𝑦 appears in the numerator of the expression of estimating the
correlation coefficient and the regression slope. Therefore, 𝑆𝑆𝑥𝑦 , 𝛽̂1 and 𝑟 will always
have the same sign (positive or negative).
The Pearson correlation coefficient is a number between -1 and 1 inclusive, which
measure the degree to which the two variables are linearly related.
6.4.1 The strength of correlations can be interpreted as follows
𝒓 value (±)
Correlation
Relationship
0.00 to 0.09
Very low
Very weak
0.10 to 0.29
Low
Weak
0.30 to 0.49
Medium
Moderate
0.50 to 0.89
High
Strong
0.90 to 1.00
Very high
Perfect
SSTS012
2019
STUDY GUIDE
6.4.2 Scatter diagrams which illustrate relationships between points are given
below:
6.5 Inference about the correlation
Testing for the existence of a linear relationship between two variables is the same as
determining whether there is any significant correlation between them. The population
correlation coefficient 𝜌 is hypothesized as equal to zero. Thus the null and alternative
hypotheses would be
𝐻0 : 𝜌 = 0 versus 𝐻1 : 𝜌 ≠ 0
The test statistic for determining the existence of correlation is given by
SSTS012
2019
𝑡=
STUDY GUIDE
𝑟√𝑛 − 2
√1 − 𝑟 2
6.6 The Coefficient of Determination
Another way of measuring the utility of the regression model is to quantify the
contribution of the independent variable (𝑥) and in predicting the values of the
dependent variable (𝑦). In order to do this, we measure how much the error of
predicting the value of 𝑦 was reduced by using the information provided by the
independent variable.
Definition 6.2: The coefficient of determination measures the proportion of variation
in the dependent variable that is explained by the independent variable.
The following formula is used to compute the value of the coefficient of determination:
𝑅2 =
𝑆𝑆𝑦𝑦 −𝑆𝑆𝐸
𝑆𝑆𝑦𝑦
𝑆𝑆𝐸
= 1 − 𝑆𝑆
𝑦𝑦
note that 𝑆𝑆𝐸 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂𝑖 )2
In simple linear regression, the coefficient of determination can also be computed as
the square of the correlation coefficient:
𝑆𝑆𝑥𝑦
2
2 𝑆𝑆𝑥𝑥
𝑟 =(
) = 𝛽̂1
𝑆𝑆𝑦𝑦
√𝑆𝑆𝑥𝑥 𝑆𝑆𝑦𝑦
2
Interpretation: about 100(𝑟 2 )% of the sample variation in the dependent variable 𝑦
can be explained by ( or attributed to ) using the independent variable 𝑥 to predict the
value of 𝑦 in the regression equation.
SSTS012
2019
STUDY GUIDE
CHAPTER 7: INDEX NUMBERS
7.1 Introduction
An index is a summary value which reflects how business or economic activity has
changed over time. The consumer price index is the most commonly understood
economic index. Index numbers are used to measure either price or quantity changes
over time. They play a vital role in the monitoring of business performance as well as
in the preparation of business forecasts.
Definition 7.1: Index number is a summary measure of overall change in the level of
activity of single item or a basket of related items from one-time period to another.
Index numbers are most commonly used to monitor price and quantity changes over
time. They can also monitor changes in business performance levels and are therefore
a useful planning and control tool in business. The best know and widely used index
number in any country is the consumer price index (CPI) or the inflation indicator.
An index number is constructed by dividing the value of an item in the current period
by its value in the base period, expressed as a percentage:
𝐼𝑛𝑑𝑒𝑥 𝑛𝑢𝑚𝑏𝑒𝑟 =
𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑝𝑒𝑟𝑖𝑜𝑑 𝑣𝑎𝑙𝑢𝑒
× 100
𝑏𝑎𝑠𝑒 𝑝𝑒𝑟𝑖𝑜𝑑 𝑣𝑎𝑙𝑢𝑒
We note that an index number value of above 100 indicate an increase in the level of
activity being monitored, while the index number value of below 100 reflects a
decrease in activity relative to the base period.
There are two major categories of index numbers. Within each of the two categories,
an index value can be computed for either a single item or basket of correlated items.
 Price indexes
 Single price index
 Composite price index
 Quantity indexes
 Single quantity index
 Composite price index
SSTS012
2019
STUDY GUIDE
The following notations are used in the construction of price and quantity index
numbers:
𝑃0 = base period price
𝑞0 = base period quantity
𝑃1 = Current period price
𝑞1 = Current period quantity
7.2 Price indexes
A price index measures the percentage change in price between any two-time period
either for a single item or a basket of correlated items.
7.2.1 Simple price index (Price relative)
The simple price index is the change in price from a base period to another time period
for a single item. It is sometimes called price relative. Mathematical expression of
computing the price relative value is defined as:
𝑃𝑟𝑖𝑐𝑒 𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 =
𝑃1
× 100
𝑃0
Note that the price relative is multiplied by 100 in order to express it in terms of
percentages.
Example 7.1: Consider the information about the prices of 95-Unleaded fuel in
Polokwane for each year from 2014 to 2017.
Year
Price/litre
2014
R10.22
2015
R11.78
2016
R12.28
2017
R13.49
Using 2014 as the base period, compute and interpret the price relatives for 95unleaded fuel in Polokwane for these years:
a) 2015
b) 2016
c) 2017
SSTS012
2019
STUDY GUIDE
7.2.3 Composite price index
A composite price index measures the average price change for a basket of related
items (activities) from one-time period (base period) to another period which is the
current period.
There are two techniques that can be used to compute the composite price index once
the weighting method between Laspeyres and Paasche has been chosen. The two
techniques yield the same index value, however, the reasoning behind how the values
are calculated is not the same. The two computational techniques are:
 The method of weighted aggregates and
 The method of weighted average of price relatives.
Since the two methods produce the same index value, in this level we will only look at
the method of weighted aggregates. The formula for calculating the weighted
aggregates index value is easy to use as compared to the one of the weighted average
of price index.
7.2.3.1 Weighted Aggregates Method – using the Laspeyres Weighting
Approach
The construction of the Laspeyres composite price index using the weighted
aggregates method is done following the three steps listed below:
 Step 1: Compute the base period value for the basket of items:
𝐵𝑎𝑠𝑒 𝑝𝑒𝑟𝑖𝑜𝑑 𝑣𝑎𝑙𝑢𝑒 = ∑(𝑃0 × 𝑞0 )
The based period value is what the basket of items would have costed in the based
period.
 Step 2: Compute the current period value for the basket of items:
𝐶𝑢𝑟𝑟𝑒𝑛𝑡 𝑝𝑒𝑟𝑖𝑜𝑑 𝑣𝑎𝑙𝑢𝑒 = ∑(𝑃1 × 𝑞0 )
The current period value is the cost of the basket of items in the current period paying
current prices, but consuming base period quantity.
 Step 3: Calculate the composite price index:
𝐿𝑎𝑠𝑝𝑒𝑦𝑟𝑒𝑠 𝑝𝑟𝑖𝑐𝑒 𝑖𝑛𝑑𝑒𝑥 =
∑(𝑃1 × 𝑞0 )
× 100%
∑(𝑃0 × 𝑞0 )
SSTS012
2019
STUDY GUIDE
7.2.3.2 Weighted Aggregates Method – using the Paasche Weighting Approach
The construction of the Paasche composite price index uses the current period
quantities to weight the basket. The same three steps in calculating the weighted
aggregates composite price index are followed:
 Step 1: Compute the base period value for the basket of items:
𝐵𝑎𝑠𝑒 𝑝𝑒𝑟𝑖𝑜𝑑 𝑣𝑎𝑙𝑢𝑒 = ∑(𝑃0 × 𝑞1 )
The based period value is what the basket of items would have costed in the base
period, but consuming current period quantities.
 Step 2: Compute the current period value for the basket of items:
𝐶𝑢𝑟𝑟𝑒𝑛𝑡 𝑝𝑒𝑟𝑖𝑜𝑑 𝑣𝑎𝑙𝑢𝑒 = ∑(𝑃1 × 𝑞1 )
The current period value of a basket is the current cost of the basket of items based
on the current prices and current consumption.
 Step 3: Calculate the composite price index:
𝑃𝑎𝑎𝑠𝑐ℎ𝑒 𝑝𝑟𝑖𝑐𝑒 𝑖𝑛𝑑𝑒𝑥 =
∑(𝑃1 × 𝑞1 )
× 100%
∑(𝑃0 × 𝑞1 )
Example 7.2: The data in Table 7.1 shows the usage of a basket of three toiletry
items in two-person households in Giyani for 2016 and 2017 respectively.
Table 7.1: Annual household Consumption of Basket of Toiletries (2016-2017)
Base year (2016)
Current year (2017)
Toiletry items
Unit price
Quantity
Unit price
Quantity
Soap
R5.95
38
R6.10
41
Deodorant
R18.65
25
R19.95
19
Toothpaste
R8.29
15
R8.74
17
a) Calculate the relative price index for Soup. Interpret your answer.
b) Calculate the price relative for toothpaste in Giyani. What is the meaning of this
value?
c) Compute the Laspeyres weighted aggregate composite price index for the basket
of toiletries. Interpret the value.
d) Compute the Paasche weighted aggregate composite price index for the basket
of toiletries. Interpret the answer.
SSTS012
2019
STUDY GUIDE
Example 7.3: A printing company that specialises in business stationary has recorded
its usage and cost of printer cartridges for its four different printers.
Printer
2015
2016
2017
cartridge
Unit price
Quantity
Unit price
Quantity
Unit price
Quantity
HQ21
145
24
155
28
149
36
HQ25
172
37
165
39
160
44
HQ26
236
12
255
12
262
14
HQ32
314
10
306
8
299
11
a) Using 2015 as the base year period, calculate the price relatives of the HQ26 and
HQ32 printer cartridges for 2016 only. Interpret the meaning of these two price
relatives.
b) Using 2016 as the base year period, calculate the price relatives of the HQ21 and
HQ25 printer cartridges for 2017 only. Interpret the meaning of these two price
relatives.
c)
Calculate the composite price indexes for 2016 and 2017, with 2015 as the
base period, using each of the following methods:
(i) The Laspeyres weighted aggregate method
(ii) The Paasche weighted aggregate method
d) Interpret all the values obtained in c) above.
7.3 Quantity Indexes
A quantity index measures the percentage change in consumption level, either for a
single item or a basket of items, form one-time period to other.
7.3.1 Simple Quantity index (Quantity Relative)
For a single item, the change in units consumed from a base period to another time
period is found by calculating its quantity relative. The quantity relative is
mathematically expressed as follows:
𝑄𝑢𝑎𝑛𝑡𝑖𝑡𝑦 𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 =
𝑞1
× 100%
𝑞0
This relative quantity change is multiplied by 100 to express it in percentage terms.
SSTS012
2019
STUDY GUIDE
Example 7.4: In 2014, Baloyi TS hardware store sold 145 window frames. In 2015,
window frame sale were only 125 units, while in 2016, sale of window frames rose to
175 units. Find the quantity relative of window frame for each year 2016 and 2015
respectively, using 2014 as the base period. What are the meaning of the computed
values?
7.3.2 Composite quantity index
A composite quantity index measures the average consumption (quantity) changes for
a basket of related items from one-time period (the base period) to another time period
which is the current period.
7.3.2.1 Weighted Aggregates Method – Composite Quantity Index
This method compares the aggregate value of the basket of related items between the
current period and the base period. The composite quantity index will reflect the overall
consumption changes while holding prices constant at either the base period
(Laspeyres approach) or current period (Paasche approach).
The Laspeyres approach holds prices constant in the based period:
𝐿𝑎𝑠𝑝𝑒𝑦𝑟𝑒𝑠 𝑞𝑢𝑎𝑛𝑡𝑖𝑡𝑦 𝑖𝑛𝑑𝑒𝑥 =
∑(𝑝0 × 𝑞1 )
× 100%
∑(𝑝0 × 𝑞0 )
The Paasche approach holds prices constant in the current period:
Paasche quantity index =
∑(p1 × q1 )
× 100%
∑(p1 × q 0 )
Example 7.5: The data in the table below refers to a basket of three carpentry items
used by Ngoveni woodwork company in the manufacture of cupboards for 2015 and
2016, respectively.
Carpentry
items
Base year (2015)
Unit price
Current year (2016)
Quantity
Unit price
Quantity
Cold glue (𝟏 𝐥) R14
45
R17
55
Boards (𝐦𝟐 )
R65
125
R80
115
Paint (𝟓 𝐥)
R125
20
R130
25
SSTS012
2019
STUDY GUIDE
a) Using 2015 as the base year period, calculate the quantity relatives of the cold
glue and paint. Interpret the meaning of these two quantity relatives.
b) Using the Laspeyres weighted aggregates method, construct the composite
quantity index for the average change of carpentry materials used between 2015
(as base period) and 2016 (as current period). Interpret the value.
c) Using the Paasche weighted aggregates method, construct the composite
quantity index for the average change of carpentry materials used between 2015
(as base period) and 2016 (as current period). Interpret the value.
Example 7.6: A printing company that specialises in business stationary has recorded
its usage and cost of printer cartridges for its four different printers.
Printer
2015
2016
2017
cartridge
Unit price
Quantity
Unit price
Quantity
Unit price
Quantity
HQ21
145
24
155
28
149
36
HQ25
172
37
165
39
160
44
HQ26
236
12
255
12
262
14
HQ32
314
10
306
8
299
11
a) Using 2015 as the base year period, calculate the quantity relatives of the HQ26
and HQ32 printer cartridges for 2016 only. Interpret the meaning of these two price
relatives.
b) Using 2016 as the base year period, calculate the quantity relatives of the HQ26
and HQ32 printer cartridges for 2017 only. Interpret the meaning of these two price
relatives.
c) Calculate the composite quantity indexes for 2016 and 2017, with 2015 as the
base period, using each of the following methods:
(i) The Laspeyres weighted aggregate method
(ii) The Paasche weighted aggregate method
d) Interpret all the values obtained in c) above.
SSTS012
2019
STUDY GUIDE
CHAPTER 8: TIME SERIES ANALYSIS
8.1 Introduction
Most of the data used in statistics analysis is called cross-sectional data, meaning that
it is gathered from a sample survey at one point in time. However, data can also be
collected over time. For example, when a company records its daily, weekly or monthly
turnover; or when a household records their daily or monthly electricity usage, they
are compiling a time series data.
Definition 8.1: A time series is a set of numeric data of a random variable that is
gathered over time at a regular intervals and arranged in time order.
The purpose of time series analysis is to identify any recurring patterns in a time series,
quantify these patterns through building a statistical model and then use the statistical
model to prepare forecasts to estimate future values of the time series.
8.2 Components of a Time Series
Time series analysis assumes that the data values of a time series variable are
determined by four underlying environmental forces that operate both individually and
collectively over time. The four underlying environmental forces are:
 Trend (T) – is defined as a long-term smooth underlying movement in a time
series. It measures the effect that long-term factors have on the times series.
 Cycles (C) – are the medium to long-term deviations from the trend. They reflect
alternating periods of relative expansion and contraction of economic activity.
 Seasonality (S) – seasonal variations are fluctuations in a time series that are
repeated at regular intervals within a year (daily, weekly, monthly).
 Irregular (random) influences (I) – Irregular fluctuations in time series are
attributed to unpredictable events, such natural disaster (floods) or man-made
disaster (strikes).
Time series analysis attempts to isolate each of these components and quantify them
statistically. The process of doing this is called decomposition of the times series.
Once these components are identified and quantified, they are combined and used to
estimate the future values of the time series variable.
SSTS012
2019
STUDY GUIDE
8.3 Decomposition of a Time Series
Time series analysis aims to isolate the influence of each of the four components on
the actual time series. The time series model used as the basis for analysis the
influence of these four components assumes a multiplicative relationship between
them. The multiplicative time series model is mathematically expressed as follows:
Actual y = trend × cyclical × seasonal × irregualr
= T ×C ×S ×I
Trend and seasonal components account for the most significant proportion of an
actual value in a time series. By isolating them, most of the actual time series values
will be explained. Therefore, we will examine the statistical approaches to quantify
trend and seasonal variation only.
SSTS012
2019
STUDY GUIDE
8.4 Trend Analysis
The long-term trend in a time series can be isolated by removing the medium-term
and short-term fluctuations (cycles, seasonal and irregular) in the series. This will
result in either a smooth curve or a straight line, depending on the technique chosen.
The two techniques which can be applied for trend isolation are:
 Moving average – is the technique which produces a smooth curve.
 Regression analysis – is the technique which produces a straight-line trend.
8.4.1 The Moving Average Technique
A moving average removes the short-term fluctuations in the time series by taking
successive averages of groups of observations. Each time period’s actual value is
replaced by the average of observations from time periods that surrounds it. This
results in a smoothed time series. Thus the moving average technique smoothies a
time series by removing short-term fluctuations. The four steps of calculating a kperiod moving average if k is odd are as follows:
 Step 1: Sum the first k period’s observation and position the total opposite the
middle time period.
 Step 2: Repeat the summing of the k period’s observations by removing the first
period’s observation and including the next period’s observation.
 Step 3: Continue producing these moving totals until the end of the time series
is reach. The process of positioning each moving total opposite the middle time
period of each sum of the k observation is called centring.
 The moving average series is now calculated by dividing each moving total by
k.
Example 8.1: The table below shows the number of fire insurance claims received by
an insurance company in each four-month period from 2014 to 2017.
2014
2015
2016
2017
Period P1
P2
P3
P1
P2
P3
P1
P2
P3
P1
P2
P3
Claim
3
5
9
7
9
12
4
10
13
9
10
7
Calculate the three-period, five-period and seven-period moving average for the
number of insurance claims received.
SSTS012
2019
STUDY GUIDE
8.4.1.1 Centring an Uncentred Moving average
A moving average value must always be centred on the middle time period. When the
number of the periods averaged is odd, centring occurs directly when the moving
average value is positioned in the middle time period of k observation. However, when
the moving average is calculated for an even number of time periods, then the moving
total will be Uncentred. The three steps of centring an Uncentred moving average are:
 Calculate the uncentred moving total.
 Centre the uncentred moving totals – calculate a second moving total series
consisting of pairs of the uncentred moving totals. Each second moving total
value is centred between the two uncentred moving total values. This positions
these second moving totals on an actual time period.
 Calculate the centred moving averages – a centred moving average is
calculated by dividing the centred moving total values by 2 × k.
Example 8.2: A cycle shop recorded the quarterly sales of racing bicycles for the
period of 2014 to 2016 as shown in the table below.
2014
Period
2015
2016
Q1
Q2
Q3
Q4
Q1
Q2
Q3
Q4
Q1
Q2
Q3
Q4
Sales(𝐲) 17
13
15
19
17
19
22
14
20
23
19
20
Produce a four-period centred moving average for the quarterly sales of racing
bicycles sold by the cycle shop during the period 2014 to 2016.
Example 8.3: The table below shows the number of quarterly orders received by a
security company in each quarterly period from 2014 to 2017.
Quarter
2014
2015
2016
2017
Q1
20
20
23
23
Q2
16
22
26
27
Q3
18
25
22
20
Q4
21
17
23
22
Calculate a four-period centred moving average for the quarterly orders received by a
security company in each four-month period from 2014 to 2017.
SSTS012
2019
STUDY GUIDE
8.4.2 Regression Analysis Technique
A trend line isolates the trend (T) component of the time series only. It shows the
general direction (upward, downward or constant) in which the series is moving. It is
therefore best presented by the simple linear regression. The method of least squares
from the regression analysis (chapter 6) is used to estimate the regression parameters
to find the trend line of best fit to a time series of numeric data. The dependent variable,
y, is the actual time series and the independent variable, x, is time. To use time as an
independent variable in regression analysis, it must be numerically coded. Any
sequential numbering system can be used, however, in this chapter we will use the
set of natural numbers (x = 1, 2, … . , n), where n is the number of time periods in the
time series.
Example 8.4: The number of houses sold quarterly by Valley Estates in the Cape
peninsula is recorded for the 16 quarters from 2014 to 2017, as shown in the table
below.
Quarter
2014
2015
2016
2017
Q1
20
20
23
23
Q2
16
22
26
27
Q3
18
25
22
20
Q4
21
17
23
22
a) Use the least square method to estimate the regression parameters
b) Construct the trend line for the quarterly houses sales data for Valley Estates
c) Use the regression trend line to estimate the level of house sales for the first and
third quarter of 2018.
d) Interpret the meaning of the values obtained in c) above.
Example 8.5: Consider the dataset of the quarterly sales of racing bicycles for the
period of 2014 to 2016 as shown in example 8.2.
a) Construct the trend line for the quarterly sales of racing bicycles
b) Interpret the meaning of the magnitude of the regression slope.
c) Estimate the sales of racing bicycle for second and third period of 2018.
SSTS012
2019
STUDY GUIDE
8.5 Seasonal Analysis
Seasonal analysis isolates the influence of seasonal forces on a time series. The ratio
to moving average method is utilized to measure and quantify these seasonal
influences. This method expresses the seasonal influence as an index number. It
measures the percentage deviation of the actual values of the time series, y, from a
base value that excludes the short-term seasonal influences. These base values of a
time series represent the trend/cyclical influences only.
8.5.1 Ratio to Moving Average Technique
 Step 1: identify the trend/cyclical movement – The moving average approach,
as described earlier, isolate the combine trend/cyclical components in a time
series. The choice of an appropriate moving average term, k, is determined by
the number of periods that distance the short-term seasonal fluctuations. In
most instances, the term k corresponds to the number of observations that
distance a one-year period. The below shows the appropriate term to use to
remove the short-term seasonal fluctuations in time series data the occur
annually.
Time interval
Appropriate term (𝐤)
Weekly
52-period term
Monthly
12-period term
Bi-monthly
6-period term
Quarterly
4-period term
Four-monthly
3-period term
Half-yearly
2-period term
 Step 2: Find the seasonal ratios – A seasonal ratio for each period is found by
dividing each actual time series value, y, by its corresponding moving average
value.
Seasonal ratio =
Actual y
× 100
Moving average y
A seasonal ratio is an index that measures the percentage deviation of each actual
y from its moving average value.
SSTS012
2019
STUDY GUIDE
 Step 3: Produce the median seasonal indexes – Average the seasonal ratios
across the corresponding periods within years to smooths out the irregular
components inherent in the seasonal ratios.
 Step 4: Calculate the adjusted seasonal indexes – Each seasonal index has a
base index of 100. Therefore, the sum of the k median seasonal indexes must
equal 100 × k. If this is not the case, each median seasonal index must be
adjusted to a base of 100. The adjustment factor is determined as follows:
Adjustment factor =
Example 8.6:
k × 100
∑(median seasonal indexes)
Refer to the quarterly house sales by Valley Estates in the Cape
peninsula from 2014 to 2017. Calculate the quarterly seasonal indexes for the house
sales dataset.
Example 8.7: Refer to the four-monthly number of fire insurance claims received by
an insurance company in each four-month period from 2014 to 2017. Calculate the
four-monthly seasonal indexes for the house sales dataset.
SSTS012
2019
STUDY GUIDE
TABLES
Table H1:Standard Normal Probabilities
The values in the table are the areas between zero and the z-score. That is, P(0<Z<z-score)
z
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.0 0.0000
0.0040
0.0080
0.0120
0.0160
0.0199
0.0239
0.0279
0.0319
0.1 0.0398
0.0438
0.0478
0.0517
0.0557
0.0596
0.0636
0.0675
0.0714
0.2 0.0793
0.0832
0.0871
0.0910
0.0948
0.0987
0.1026
0.1064
0.1103
0.3 0.1179
0.1217
0.1255
0.1293
0.1331
0.1368
0.1406
0.1443
0.1480
0.4 0.1554
0.1591
0.1628
0.1664
0.1700
0.1736
0.1772
0.1808
0.1844
0.5 0.1915
0.1950
0.1985
0.2019
0.2054
0.2088
0.2123
0.2157
0.2190
0.6 0.2257
0.2291
0.2324
0.2357
0.2389
0.2422
0.2454
0.2486
0.2517
0.7 0.2580
0.2611
0.2642
0.2673
0.2704
0.2734
0.2764
0.2794
0.2823
0.8 0.2881
0.2910
0.2939
0.2967
0.2995
0.3023
0.3051
0.3078
0.3106
0.9 0.3159
0.3186
0.3212
0.3238
0.3264
0.3289
0.3315
0.3340
0.3365
1.0 0.3413
0.3438
0.3461
0.3485
0.3508
0.3531
0.3554
0.3577
0.3599
1.1 0.3643
0.3665
0.3686
0.3708
0.3729
0.3749
0.3770
0.3790
0.3810
1.2 0.3849
0.3869
0.3888
0.3907
0.3925
0.3944
0.3962
0.3980
0.3997
1.3 0.4032
0.4049
0.4066
0.4082
0.4099
0.4115
0.4131
0.4147
0.4162
1.4 0.4192
0.4207
0.4222
0.4236
0.4251
0.4265
0.4279
0.4292
0.4306
1.5 0.4332
0.4345
0.4357
0.4370
0.4382
0.4394
0.4406
0.4418
0.4429
1.6 0.4452
0.4463
0.4474
0.4484
0.4495
0.4505
0.4515
0.4525
0.4535
1.7 0.4554
0.4564
0.4573
0.4582
0.4591
0.4599
0.4608
0.4616
0.4625
1.8 0.4641
0.4649
0.4656
0.4664
0.4671
0.4678
0.4686
0.4693
0.4699
1.9 0.4713
0.4719
0.4726
0.4732
0.4738
0.4744
0.4750
0.4756
0.4761
2.0 0.4772
0.4778
0.4783
0.4788
0.4793
0.4798
0.4803
0.4808
0.4812
2.1 0.4821
0.4826
0.4830
0.4834
0.4838
0.4842
0.4846
0.4850
0.4854
2.2 0.4861
0.4864
0.4868
0.4871
0.4875
0.4878
0.4881
0.4884
0.4887
2.3 0.4893
0.4896
0.4898
0.4901
0.4904
0.4906
0.4909
0.4911
0.4913
2.4 0.4918
0.4920
0.4922
0.4925
0.4927
0.4929
0.4931
0.4932
0.4934
2.5 0.4938
0.4940
0.4941
0.4943
0.4945
0.4946
0.4948
0.4949
0.4951
2.6 0.4953
0.4955
0.4956
0.4957
0.4959
0.4960
0.4961
0.4962
0.4963
2.7 0.4965
0.4966
0.4967
0.4968
0.4969
0.4970
0.4971
0.4972
0.4973
2.8 0.4974
0.4975
0.4976
0.4977
0.4977
0.4978
0.4979
0.4979
0.4980
2.9 0.4981
0.4982
0.4982
0.4983
0.4984
0.4984
0.4985
0.4985
0.4986
3.0 0.4987
0.4987
0.4987
0.4988
0.4988
0.4989
0.4989
0.4989
0.4990
0.09
0.0359
0.0753
0.1141
0.1517
0.1879
0.2224
0.2549
0.2852
0.3133
0.3389
0.3621
0.3830
0.4015
0.4177
0.4319
0.4441
0.4545
0.4633
0.4706
0.4767
0.4817
0.4857
0.4890
0.4916
0.4936
0.4952
0.4964
0.4974
0.4981
0.4986
0.4990
SSTS012
2019
STUDY GUIDE
Table 2: Critical Values for Student's t

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
40
60
120

t.100
3.078
1.886
1.638
1.533
1.476
1.440
1.415
1.397
1.383
1.372
1.363
1.356
1.350
1.345
1.341
1.337
1.333
1.330
1.328
1.325
1.323
1.321
1.319
1.318
1.316
1.315
1.314
1.313
1.311
1.310
1.303
1.296
1.289
1.282
t.050
6.314
2.920
2.353
2.132
2.015
1.943
1.895
1.860
1.833
1.812
1.796
1.782
1.771
1.760
1.753
1.746
1.740
1.734
1.729
1.725
1.721
1.717
1.714
1.711
1.708
1.706
1.703
1.701
1.699
1.697
1.684
1.671
1.658
1.645
t.025
12.706
4.303
3.182
2.776
2.571
2.447
2.365
2.306
2.262
2.228
2.201
2.179
2.160
2.145
2.131
2.120
2.110
2.101
2.093
2.086
2.080
2.074
2.069
2.064
2.060
2.056
2.052
2.048
2.045
2.042
2.021
2.000
1.980
1.960
t.010
31.821
6.965
4.541
3.747
3.365
3.143
2.998
2.896
2.821
2.764
2.718
2.681
2.650
2.624
2.602
2.583
2.567
2.552
2.539
2.528
2.528
2.508
2.500
2.492
2.485
2.479
2.473
2.467
2.462
2.457
2.423
2.390
2.358
2.326
t.005
63.657
9.925
5.841
4.604
4.032
3.707
3.499
3.355
3.250
3.169
3.106
3.055
3.102
2.977
2.947
2.921
2.898
2.878
2.861
2.845
2.831
2.819
2.807
2.797
2.787
2.779
2.771
2.763
2.756
2.750
2.704
2.660
2.617
2.576
t.001
318.310
22.326
10.213
7.173
5.893
5.208
4.785
4.501
4.297
4.144
4.025
3.930
3.852
3.787
3.733
3.686
3.646
3.610
3.579
3.552
3.527
3.505
3.485
3.467
3.450
3.435
3.421
3.408
3.396
3.385
3.307
3.232
3.160
3.090
t.0005
636.620
31.598
12.924
8.610
6.869
5.959
5.408
5.041
4.781
4.587
4.437
4.318
4.221
4.140
4.073
4.015
3.965
3.922
3.883
3.850
3.819
3.792
3.767
3.745
3.725
3.707
3.690
3.674
3.659
3.646
3.551
3.460
3.373
3.291
SSTS012
2019
STUDY GUIDE
Table H3: Chi-Square Probabilities
The areas given across the top are the areas to the right of the critical value. To look up an area
on the left, subtract it from one, and then look it up (i.e: 0.05 on the left is 0.95 on the right)
df
0.995
0.99
0.975
0.95
0.9
0.1
0.05
0.025
0.01
0.005
1
----0.001
0.004
0.016
2.706
3.841
5.024
6.635
7.879
2
0.010
0.020
0.051
0.103
0.211
4.605
5.991
7.378
9.210
10.597
3
0.072
0.115
0.216
0.352
0.584
6.251
7.815
9.348
11.345
12.838
4
0.207
0.297
0.484
0.711
1.064
7.779
9.488
11.143
13.277
14.860
5
0.412
0.554
0.831
1.145
1.610
9.236
11.070
12.833
15.086
16.750
6
0.676
0.872
1.237
1.635
2.204
10.645
12.592
14.449
16.812
18.548
7
0.989
1.239
1.690
2.167
2.833
12.017
14.067
16.013
18.475
20.278
8
1.344
1.646
2.180
2.733
3.490
13.362
15.507
17.535
20.090
21.955
9
1.735
2.088
2.700
3.325
4.168
14.684
16.919
19.023
21.666
23.589
10
2.156
2.558
3.247
3.940
4.865
15.987
18.307
20.483
23.209
25.188
11
2.603
3.053
3.816
4.575
5.578
17.275
19.675
21.920
24.725
26.757
12
3.074
3.571
4.404
5.226
6.304
18.549
21.026
23.337
26.217
28.300
13
3.565
4.107
5.009
5.892
7.042
19.812
22.362
24.736
27.688
29.819
14
4.075
4.660
5.629
6.571
7.790
21.064
23.685
26.119
29.141
31.319
15
4.601
5.229
6.262
7.261
8.547
22.307
24.996
27.488
30.578
32.801
16
5.142
5.812
6.908
7.962
9.312
23.542
26.296
28.845
32.000
34.267
17
5.697
6.408
7.564
8.672
10.085 24.769
27.587
30.191
33.409
35.718
18
6.265
7.015
8.231
9.390
10.865 25.989
28.869
31.526
34.805
37.156
19
6.844
7.633
8.907
10.117 11.651 27.204
30.144
32.852
36.191
38.582
20
7.434
8.260
9.591
10.851 12.443 28.412
31.410
34.170
37.566
39.997
21
8.034
8.897
10.283 11.591 13.240 29.615
32.671
35.479
38.932
41.401
22
8.643
9.542
10.982 12.338 14.041 30.813
33.924
36.781
40.289
42.796
23
9.260
10.196 11.689 13.091 14.848 32.007
35.172
38.076
41.638
44.181
24
9.886
10.856 12.401 13.848 15.659 33.196
36.415
39.364
42.980
45.559
25
10.520 11.524 13.120 14.611 16.473 34.382
37.652
40.646
44.314
46.928
26
11.160 12.198 13.844 15.379 17.292 35.563
38.885
41.923
45.642
48.290
27
11.808 12.879 14.573 16.151 18.114 36.741
40.113
43.195
46.963
49.645
28
12.461 13.565 15.308 16.928 18.939 37.916
41.337
44.461
48.278
50.993
29
13.121 14.256 16.047 17.708 19.768 39.087
42.557
45.722
49.588
52.336
30
13.787 14.953 16.791 18.493 20.599 40.256
43.773
46.979
50.892
53.672
40
20.707 22.164 24.433 26.509 29.051 51.805
55.758
59.342
63.691
66.766
50
27.991 29.707 32.357 34.764 37.689 63.167
67.505
71.420
76.154
79.490
60
35.534 37.485 40.482 43.188 46.459 74.397
79.082
83.298
88.379
91.952
70
43.275 45.442 48.758 51.739 55.329 85.527
90.531
95.023
100.425 104.215
80
51.172 53.540 57.153 60.391 64.278 96.578
101.879 106.629 112.329 116.321
90
59.196 61.754 65.647 69.126 73.291 107.565 113.145 118.136 124.116 128.299
100 67.328 70.065 74.222 77.929 82.358 118.498 124.342 129.561 135.807 140.169
SSTS012
2019
STUDY GUIDE
Table 4 : Critical values for F statistic: F.05

Denominator
degrees
of
freedom

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
40
60
120

1
161.40
18.51
10.13
7.71
6.61
5.99
5.59
5.32
5.12
4.96
4.84
4.75
4.67
4.60
4.54
4.49
4.45
4.41
4.38
4.35
4.32
4.30
4.28
4.26
4.24
4.23
4.21
4.20
4.18
4.17
4.08
4.00
3.92
3.84
2
199.50
19.00
9.55
6.94
5.79
5.14
4.74
4.46
4.26
4.10
3.98
3.89
3.81
3.74
3.68
3.63
3.59
3.55
3.52
3.49
3.47
3.44
3.42
3.40
3.39
3.37
3.35
3.34
3.33
3.32
3.23
3.15
3.07
3.00
3
215.70
19.16
9.28
6.59
5.41
4.76
4.35
4.07
3.86
3.71
3.59
3.49
3.41
3.34
3.29
3.24
3.20
3.16
3.13
3.10
3.07
3.05
3.03
3.01
2.99
2.98
2.96
2.95
2.93
2.92
2.84
2.76
2.68
2.60
4
224.60
19.25
9.12
6.39
5.19
4.53
4.12
3.84
3.63
3.48
3.36
3.26
3.18
3.11
3.06
3.01
2.96
2.93
2.90
2.87
2.84
2.82
2.80
2.78
2.76
2.74
2.73
2.71
2.70
2.69
2.61
2.53
2.45
2.37
5
230.20
19.30
9.01
6.26
5.05
4.39
3.97
3.69
3.48
3.33
3.20
3.11
3.03
2.96
2.90
2.85
2.81
2.77
2.74
2.71
2.68
2.66
2.64
2.62
2.60
2.59
2.57
2.56
2.55
2.53
2.45
2.37
2.29
2.21
6
234.00
19.33
8.94
6.16
4.95
4.28
3.87
3.58
3.37
3.22
3.09
3.00
2.92
2.85
2.79
2.74
2.70
2.66
2.63
2.60
2.57
2.55
2.53
2.51
2.49
2.47
2.46
2.45
2.43
2.42
2.34
2.25
2.17
2.10
Numerator degrees of freedom
7
8
9
10
12
15
236.80 238.90 240.50 241.90 243.90 245.90
19.35
19.37
19.38
19.40
19.41
19.43
8.89
8.85
8.81
8.79
8.74
8.70
6.09
6.04
6.00
5.96
5.91
5.86
4.88
4.82
4.77
4.74
4.68
4.62
4.21
4.15
4.10
4.06
4.00
3.94
3.79
3.73
3.68
3.64
3.57
3.51
3.50
3.44
3.39
3.35
3.28
3.22
3.29
3.21
3.18
3.14
3.07
3.01
3.14
3.07
3.02
2.98
2.91
2.85
3.01
2.95
2.90
2.85
2.79
2.72
2.91
2.85
2.80
2.75
2.69
2.62
2.83
2.77
2.71
2.67
2.60
2.53
2.76
2.70
2.65
2.60
2.53
2.46
2.71
2.64
2.59
2.54
2.48
2.40
2.66
2.59
2.54
2.49
2.42
2.35
2.61
2.55
2.49
2.45
2.38
2.31
2.58
2.51
2.46
2.41
2.34
2.27
2.54
2.48
2.42
2.38
2.31
2.23
2.51
2.45
2.39
2.35
2.28
2.20
2.49
2.42
2.37
2.32
2.25
2.18
2.46
2.40
2.34
2.30
2.23
2.15
2.44
2.37
2.32
2.27
2.20
2.13
2.42
2.36
2.30
2.25
2.18
2.11
2.40
2.34
2.28
2.24
2.16
2.09
2.39
2.32
2.27
2.22
2.15
2.07
2.37
2.31
2.25
2.20
2.13
2.06
2.36
2.29
2.24
2.19
2.12
2.04
2.35
2.28
2.22
2.18
2.10
2.03
2.33
2.27
2.21
2.16
2.09
2.01
2.25
2.18
2.12
2.08
2.00
1.92
2.17
2.10
2.04
1.99
1.92
1.84
2.09
2.02
1.96
1.91
1.83
1.75
2.01
1.94
1.88
1.83
1.75
1.67
20
248.00
19.45
8.66
5.80
4.56
3.87
3.44
3.15
2.94
2.77
2.65
2.54
2.46
2.39
2.33
2.28
2.23
2.19
2.16
2.12
2.10
2.07
2.05
2.03
2.01
1.99
1.97
1.96
1.94
1.93
1.84
1.75
1.66
1.57
24
249.10
19.45
8.64
5.77
4.53
3.84
3.41
3.12
2.90
2.74
2.61
2.51
2.42
2.35
2.29
2.24
2.19
2.15
2.11
2.08
2.05
2.03
2.01
1.98
1.96
1.95
1.93
1.91
1.90
1.89
1.79
1.70
1.61
1.52
30
250.10
19.46
8.62
5.75
4.50
3.81
3.38
3.08
2.86
2.70
2.57
2.47
2.38
2.31
2.25
2.19
2.15
2.11
2.07
2.04
2.01
1.98
1.96
1.94
1.92
1.90
1.88
1.87
1.85
1.84
1.74
1.65
1.55
1.46
40
251.10
19.47
8.59
5.72
4.46
3.77
3.34
3.04
2.83
2.66
2.53
2.43
2.34
2.27
2.20
2.15
2.10
2.06
2.03
1.99
1.96
1.94
1.91
1.89
1.87
1.85
1.84
1.82
1.81
1.79
1.69
1.59
1.50
1.39
60
252.20
19.48
8.57
5.69
4.43
3.74
3.30
3.01
2.79
2.62
2.49
2.38
2.30
2.22
2.16
2.11
2.06
2.02
1.98
1.95
1.92
1.89
1.86
1.84
1.82
1.80
1.79
1.77
1.75
1.74
1.64
1.53
1.43
1.32
120
253.30
19.49
8.55
5.66
4.40
3.70
3.27
2.97
2.75
2.58
2.45
2.34
2.25
2.18
2.11
2.06
2.01
1.97
1.93
1.90
1.87
1.84
1.81
1.79
1.77
1.75
1.73
1.71
1.70
1.68
1.58
1.47
1.35
1.22