Uploaded by kashvisds

Statistics 2 - Subjct Guide 2019

advertisement
Contents
Contents
1 Introduction
1
1.1
Route map to the guide . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Introduction to the subject area . . . . . . . . . . . . . . . . . . . . . . .
1
1.3
Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.4
Aims of the course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.5
Learning outcomes for the course . . . . . . . . . . . . . . . . . . . . . .
3
1.6
Overview of learning resources . . . . . . . . . . . . . . . . . . . . . . . .
3
1.6.1
The subject guide . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.6.2
Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.6.3
Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.6.4
Online study resources (the Online Library and the VLE) . . . . .
6
Examination advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.7
2 Probability theory
9
2.1
Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.2
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.4
Set theory: the basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.5
Axiomatic definition of probability . . . . . . . . . . . . . . . . . . . . .
19
2.5.1
Basic properties of probability . . . . . . . . . . . . . . . . . . . .
20
Classical probability and counting rules . . . . . . . . . . . . . . . . . . .
25
2.6.1
Combinatorial counting methods . . . . . . . . . . . . . . . . . .
28
Conditional probability and Bayes’ theorem . . . . . . . . . . . . . . . .
32
2.7.1
Independence of multiple events . . . . . . . . . . . . . . . . . . .
35
2.7.2
Independent versus mutually exclusive events . . . . . . . . . . .
39
2.7.3
Conditional probability of independent events . . . . . . . . . . .
44
2.7.4
Chain rule of conditional probabilities . . . . . . . . . . . . . . . .
44
2.7.5
Total probability formula . . . . . . . . . . . . . . . . . . . . . . .
46
2.7.6
Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
2.6
2.7
2.8
i
Contents
2.9
Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
2.10 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . .
58
3 Random variables
61
3.1
Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
3.2
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
3.3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
3.4
Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . .
63
3.4.1
Probability distribution of a discrete random variable . . . . . . .
63
3.4.2
The cumulative distribution function (cdf) . . . . . . . . . . . . .
68
3.4.3
Properties of the cdf for discrete distributions . . . . . . . . . . .
71
3.4.4
General properties of the cdf . . . . . . . . . . . . . . . . . . . . .
71
3.4.5
Properties of a discrete random variable . . . . . . . . . . . . . .
72
3.4.6
Expected value versus sample mean . . . . . . . . . . . . . . . . .
74
3.5
Continuous random variables
. . . . . . . . . . . . . . . . . . . . . . . .
84
Median of a random variable . . . . . . . . . . . . . . . . . . . . .
102
3.6
Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
103
3.7
Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
103
3.8
Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . .
104
3.5.1
4 Common distributions of random variables
4.1
Synopsis of chapter content . . . . . . . . . . . . . . . . . . . . . . . . .
105
4.2
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
105
4.3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
105
4.4
Common discrete distributions . . . . . . . . . . . . . . . . . . . . . . . .
106
4.4.1
Discrete uniform distribution . . . . . . . . . . . . . . . . . . . .
107
4.4.2
Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . .
107
4.4.3
Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . .
109
4.4.4
Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . .
116
4.4.5
Connections between probability distributions . . . . . . . . . . .
125
4.4.6
Poisson approximation of the binomial distribution . . . . . . . .
125
4.4.7
Some other discrete distributions . . . . . . . . . . . . . . . . . .
127
Common continuous distributions . . . . . . . . . . . . . . . . . . . . . .
128
4.5.1
The (continuous) uniform distribution . . . . . . . . . . . . . . .
128
4.5.2
Exponential distribution . . . . . . . . . . . . . . . . . . . . . . .
130
4.5.3
Normal (Gaussian) distribution . . . . . . . . . . . . . . . . . . .
135
4.5
ii
105
Contents
4.5.4
Normal approximation of the binomial distribution . . . . . . . .
141
4.6
Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
147
4.7
Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
147
4.8
Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . .
147
5 Multivariate random variables
149
5.1
Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
149
5.2
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
149
5.3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
149
5.4
Joint probability functions . . . . . . . . . . . . . . . . . . . . . . . . . .
150
5.5
Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
151
5.6
Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . .
153
5.6.1
Properties of conditional distributions . . . . . . . . . . . . . . . .
155
5.6.2
Conditional mean and variance . . . . . . . . . . . . . . . . . . .
155
Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . .
156
5.7.1
Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
157
5.7.2
Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
158
5.7.3
Sample covariance and correlation . . . . . . . . . . . . . . . . . .
173
Independent random variables . . . . . . . . . . . . . . . . . . . . . . . .
175
5.8.1
Joint distribution of independent random variables . . . . . . . .
176
Sums and products of random variables . . . . . . . . . . . . . . . . . . .
180
5.9.1
Distributions of sums and products . . . . . . . . . . . . . . . . .
181
5.9.2
Expected values and variances of sums of random variables . . . .
181
5.9.3
Expected values of products of independent random variables . .
183
5.9.4
Distributions of sums of random variables . . . . . . . . . . . . .
183
5.10 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
187
5.11 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
187
5.12 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . .
187
5.7
5.8
5.9
6 Sampling distributions of statistics
189
6.1
Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
189
6.2
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
189
6.3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
189
6.4
Random samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
190
6.4.1
Joint distribution of a random sample . . . . . . . . . . . . . . . .
190
Statistics and their sampling distributions . . . . . . . . . . . . . . . . .
191
6.5
iii
Contents
6.5.1
Sampling distribution of a statistic . . . . . . . . . . . . . . . . .
192
6.6
Sample mean from a normal population . . . . . . . . . . . . . . . . . . .
195
6.7
The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . .
201
6.8
Some common sampling distributions . . . . . . . . . . . . . . . . . . . .
209
6.8.1
The χ2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . .
210
6.8.2
(Student’s) t distribution . . . . . . . . . . . . . . . . . . . . . . .
213
6.8.3
The F distribution . . . . . . . . . . . . . . . . . . . . . . . . . .
217
Prelude to statistical inference . . . . . . . . . . . . . . . . . . . . . . . .
219
6.9.1
Population versus random sample . . . . . . . . . . . . . . . . . .
220
6.9.2
Parameter versus statistic . . . . . . . . . . . . . . . . . . . . . .
221
6.9.3
Difference between ‘Probability’ and ‘Statistics’ . . . . . . . . . .
223
6.10 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
224
6.11 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
224
6.12 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . .
224
6.9
7 Point estimation
7.1
Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
225
7.2
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
225
7.3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
225
7.4
Estimation criteria: bias, variance and mean squared error . . . . . . . .
226
7.5
Method of moments (MM) estimation . . . . . . . . . . . . . . . . . . . .
233
7.6
Least squares (LS) estimation . . . . . . . . . . . . . . . . . . . . . . . .
238
7.7
Maximum likelihood (ML) estimation . . . . . . . . . . . . . . . . . . . .
241
7.8
Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
249
7.9
Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
249
7.10 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . .
249
8 Interval estimation
iv
225
251
8.1
Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
251
8.2
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
251
8.3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
251
8.4
Interval estimation for means of normal distributions . . . . . . . . . . .
252
8.4.1
An important property of normal samples . . . . . . . . . . . . .
256
8.4.2
Means of non-normal distributions . . . . . . . . . . . . . . . . .
259
8.5
Use of the chi-squared distribution . . . . . . . . . . . . . . . . . . . . .
263
8.6
Interval estimation for variances of normal distributions . . . . . . . . . .
264
Contents
8.7
Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
267
8.8
Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
268
8.9
Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . .
268
9 Hypothesis testing
269
9.1
Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
269
9.2
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
269
9.3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
269
9.4
Introductory examples . . . . . . . . . . . . . . . . . . . . . . . . . . . .
270
9.5
Setting p-value, significance level, test statistic . . . . . . . . . . . . . . .
271
9.5.1
General setting of hypothesis tests
. . . . . . . . . . . . . . . . .
273
9.5.2
Statistical testing procedure . . . . . . . . . . . . . . . . . . . . .
273
9.5.3
Two-sided tests for normal means . . . . . . . . . . . . . . . . . .
276
9.5.4
One-sided tests for normal means . . . . . . . . . . . . . . . . . .
277
9.6
t tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
278
9.7
General approach to statistical tests . . . . . . . . . . . . . . . . . . . . .
281
9.8
Two types of error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
282
9.9
Tests for variances of normal distributions . . . . . . . . . . . . . . . . .
288
9.10 Summary: tests for µ and σ 2 in N (µ, σ 2 ) . . . . . . . . . . . . . . . . . .
291
9.11 Comparing two normal means with paired observations . . . . . . . . . .
292
9.11.1 Power functions of the test . . . . . . . . . . . . . . . . . . . . . .
293
9.12 Comparing two normal means . . . . . . . . . . . . . . . . . . . . . . . .
293
2
and σY2
9.12.1 Tests on µX − µY with known σX
. . . . . . . . . . . . .
294
2
9.12.2 Tests on µX − µY with σX
= σY2 but unknown . . . . . . . . . . .
296
9.13 Tests for correlation coefficients . . . . . . . . . . . . . . . . . . . . . . .
300
9.13.1 Tests for correlation coefficients . . . . . . . . . . . . . . . . . . .
302
9.14 Tests for the ratio of two normal variances . . . . . . . . . . . . . . . . .
305
9.15 Summary: tests for two normal distributions . . . . . . . . . . . . . . . .
308
9.16 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
309
9.17 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
309
9.18 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . .
309
10 Analysis of variance (ANOVA)
311
10.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
311
10.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
311
10.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
311
v
Contents
10.4 Testing for equality of three population means . . . . . . . . . . . . . . .
311
10.5 One-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . .
313
10.6 From one-way to two-way ANOVA . . . . . . . . . . . . . . . . . . . . .
330
10.7 Two-way analysis of variance
. . . . . . . . . . . . . . . . . . . . . . . .
330
10.8 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
339
10.9 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
341
10.10 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . .
341
10.11 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . .
342
A Linear regression (non-examinable)
A.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
343
A.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
343
A.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
343
A.4 Introductory examples . . . . . . . . . . . . . . . . . . . . . . . . . . . .
344
A.5 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . .
345
A.6 Inference for parameters in normal regression models . . . . . . . . . . .
350
A.7 Regression ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
354
A.8 Confidence intervals for E(y) . . . . . . . . . . . . . . . . . . . . . . . . .
355
A.9 Prediction intervals for y . . . . . . . . . . . . . . . . . . . . . . . . . . .
356
A.10 Multiple linear regression models . . . . . . . . . . . . . . . . . . . . . .
358
A.11 Regression using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
360
A.12 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
369
A.13 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
369
B Non-examinable proofs
371
B.1 Chapter 2 – Probability theory . . . . . . . . . . . . . . . . . . . . . . .
371
B.2 Chapter 3 – Random variables . . . . . . . . . . . . . . . . . . . . . . . .
371
B.3 Chapter 5 – Multivariate random variables . . . . . . . . . . . . . . . . .
373
C Solutions to Sample examination questions
vi
343
375
C.1 Chapter 2 – Probability theory . . . . . . . . . . . . . . . . . . . . . . .
375
C.2 Chapter 3 – Random variables . . . . . . . . . . . . . . . . . . . . . . . .
376
C.3 Chapter 4 – Common distributions of random variables . . . . . . . . . .
377
C.4 Chapter 5 – Multivariate random variables . . . . . . . . . . . . . . . . .
377
C.5 Chapter 6 – Sampling distributions of statistics . . . . . . . . . . . . . .
379
C.6 Chapter 7 – Point estimation . . . . . . . . . . . . . . . . . . . . . . . .
380
C.7 Chapter 8 – Interval estimation . . . . . . . . . . . . . . . . . . . . . . .
382
Contents
C.8 Chapter 9 – Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . .
383
C.9 Chapter 10 – Analysis of variance (ANOVA) . . . . . . . . . . . . . . . .
384
D Examination formula sheet
387
vii
Contents
viii
Chapter 1
Introduction
1.1
Route map to the guide
This subject guide provides you with a framework for covering the syllabus of the
ST104b Statistics 2 half course and directs you to additional resources such as
readings and the virtual learning environment (VLE).
The following ten chapters will cover important aspects of elementary statistical theory,
upon which many applications in EC2020 Elements of econometrics draw heavily.
The chapters are not a series of self-contained topics, rather they build on each other
sequentially. As such, you are strongly advised to follow the subject guide in chapter
order. There is little point in rushing past material which you have only partially
understood in order to reach the final chapter. Once you have completed your work on
all of the chapters, you will be ready for examination revision. A good place to start is
the sample examination paper which you will find at the end of the subject guide.
ST104b Statistics 2 extends the work of ST104a Statistics 1 and provides a precise
and accurate treatment of probability, distribution theory and statistical inference. As
such there will be a strong emphasis on mathematical statistics as important discrete
and continuous probability distributions are covered and properties of these
distributions are investigated.
Point estimation techniques are discussed including method of moments, least squares
and maximum likelihood estimation. Confidence interval construction and statistical
hypothesis testing follow. Analysis of variance and a (non-examinable) treatment of
linear regression models, featuring the interpretation of computer-generated regression
output and implications for prediction, round off the course.
Collectively, these topics provide a solid training in statistical analysis. As such,
ST104b Statistics 2 is of considerable value to those intending to pursue further
study in statistics, econometrics and/or empirical economics. Indeed, the quantitative
skills developed in the subject guide are readily applicable to all fields involving real
data analysis.
1.2
Introduction to the subject area
Why study statistics?
By successfully completing this half course, you will understand the ideas of
randomness and variability, and the way in which they link to probability theory. This
will allow the use of a systematic and logical collection of statistical techniques of great
1
1. Introduction
practical importance in many applied areas. The examples in this subject guide will
concentrate on the social sciences, but the methods are important for the physical
sciences too. This subject aims to provide a grounding in probability theory and some
of the most common statistical methods.
The material in ST104b Statistics 2 is necessary as preparation for other subjects
you may study later on in your degree. The full details of the ideas discussed in this
subject guide will not always be required in these other subjects, but you will need to
have a solid understanding of the main concepts. This can only be achieved by seeing
how the ideas emerge in detail.
How to study statistics
For statistics, you need some familiarity with abstract mathematical ideas, as well as
the ability and common sense to apply these to real-life problems. The concepts you will
encounter in probability and statistical inference are hard to absorb by just reading
about them in a book. You need to read, then think a little, then try some problems,
and then read and think some more. This procedure should be repeated until the
problems are easy to do; you should not spend a long time reading and forget about
solving problems.
1.3
Syllabus
The syllabus of ST104b Statistics 2 is as follows:
Probability: Set theory: the basics; Axiomatic definition of probability; Classical
probability and counting rules; Conditional probability and Bayes’ theorem.
Random variables: Discete random variables; Continuous random variables.
Common distributions of random variables: Common discrete distributions;
Common continuous distributions.
Multivariate random variables: Joint probability functions; Conditional
distributions; Covariance and correlation; Independent random variables; Sums and
products of random variables.
Sampling distributions of statistics: Random samples; Statistics and their
sampling distributions; Sampling distribution of a statistic; Sample mean from a
normal population; The central limit theorem; Some common sampling
distributions; Prelude to statistical inference.
Point estimation: Estimation criteria: bias, variance and mean squared error;
Method of moments estimation; Least squares estimation; Maximum likelihood
estimation.
Interval estimation: Interval estimation for means of normal distributions; Use of
the chi-squared distribution; Confidence intervals for normal variances.
2
1.4. Aims of the course
Hypothesis testing: Setting p-value, significance level, test statistic; t tests;
General approach to statistical tests; Two types of error; Tests for normal variances;
Comparing two normal means with paired observations; Comparing two normal
means; Tests for correlation coefficients; Tests for the ratio of two normal variances.
Analysis of variance (ANOVA): One-way analysis of variance; Two-way
analysis of variance.
Linear regression (non-examinable): Simple linear regression; Inference for
parameters in normal regression models; Regression ANOVA; Confidence intervals
for E(y); Prediction intervals for y; Multiple linear regression models.
1.4
Aims of the course
The aim of this half course is to develop students’ knowledge of elementary statistical
theory. The emphasis is on topics that are of importance in applications to
econometrics, finance and the social sciences. Concepts and methods that provide the
foundation for more specialised courses in statistics are introduced.
1.5
Learning outcomes for the course
At the end of this half course, and having completed the Essential reading and
activities, you should be able to:
apply and be competent users of standard statistical operators and be able to recall
a variety of well-known distributions and their respective moments
explain the fundamentals of statistical inference and apply these principles to
justify the use of an appropriate model and perform hypothesis tests in a number
of different settings
demonstrate understanding that statistical techniques are based on assumptions
and the plausibility of such assumptions must be investigated when analysing real
problems.
1.6
1.6.1
Overview of learning resources
The subject guide
This course builds on the ideas encountered in ST104a Statistics 1. Although this
subject guide offers a complete treatment of the course material, students may wish to
consider purchasing a textbook. Apart from the textbooks recommended in this subject
guide, you may wish to look in bookshops and libraries for alternative textbooks which
may help you. A critical part of a good statistics textbook is the collection of problems
to solve, and you may want to look at several different textbooks just to see a range of
3
1. Introduction
practice questions, especially for tricky topics. The subject guide is there mainly to
describe the syllabus and to show the level of understanding expected.
The subject guide is divided into chapters which should be worked through in the order
in which they appear. There is little point in rushing past material you only partly
understand to get to later chapters, as the presentation is somewhat sequential and not
a series of self-contained topics. You should be familiar with the earlier chapters and
have a solid understanding of them before moving on to the later ones.
The following procedure is recommended:
1. Read the introductory comments.
2. Consult the appropriate section of your textbook.
3. Study the chapter content, examples and learning activities.
4. Go through the learning outcomes carefully.
5. Attempt some of the problems from your textbook.
6. Refer back to this subject guide, or to the textbook, or to supplementary texts, to
improve your understanding until you are able to work through the problems
confidently.
The last two steps are the most important. It is easy to think that you have understood
the material after reading it, but working through problems is the crucial test of
understanding. Problem-solving should take up most of your study time.
Each chapter of the subject guide has suggestions for reading from the main textbook.
Usually, you will only need to read the material in the main textbook (see ‘Essential
reading’ below), but it may be helpful from time to time to look at others.
Basic notation
We often use the symbol to denote the end of a proof, where we have finished
explaining why a particular result is true. This is just to make it clear where the proof
ends and the following text begins.
Time management
About one-third of your self-study time should be spent reading and the rest should be
spent solving problems. An internal student would expect maybe 15 hours of formal
teaching and another 50 hours of private study to be enough to cover the subject. Of
the 50 hours of private study, about 17 hours should be spent on the initial study of the
textbook and subject guide. The remaining 33 hours should be spent on attempting
problems, which may well require more reading.
Calculators
A calculator may be used when answering questions on the examination paper for
ST104b Statistics 2. It must comply in all respects with the specification given in the
4
1.6. Overview of learning resources
Regulations. You should also refer to the admission notice you will receive when
entering the examination and the ‘Notice on permitted materials’.
Make sure you accustom yourself to using your chosen calculator and feel comfortable
with it. Specifically, calculators must:
have no external wires
must be:
hand held
compact and portable
quiet in operation
non-programmable
and must:
not be capable of receiving, storing or displaying user-supplied non-numerical data.
The Regulations state: ‘The use of a calculator that communicates or displays textual
messages, graphical or algebraic information is strictly forbidden. Where a calculator is
permitted in the examination, it must be a non-scientific calculator. Where calculators
are permitted, only calculators limited to performing just basic arithmetic operations
may be used. This is to encourage candidates to show the examiners the steps taken in
arriving at the answer.’
Computers
If you are aiming to carry out serious statistical analysis (which is beyond the level of
this course) you will probably want to use some statistical software package such as R.
It is not necessary for this course to have such software available, but if you do have
access to it you may benefit from using it in your study of the material.
1.6.2
Essential reading
This subject guide is ‘self-contained’ meaning that this is the only resource which is
essential reading for ST104b Statistics 2. Throughout the subject guide there are
many examples, activities and sample examination questions replicating resources
typically provided in statistical textbooks. You may, however, feel you could benefit
from reading textbooks, and a suggested list of these is provided below.
Statistical tables
In the examination you will be provided with relevant extracts of:
Lindley, D.V. and W.F. Scott, New Cambridge Statistical Tables.(Cambridge:
Cambridge University Press, 1995) second edition [ISBN 978-0521484855].
5
1. Introduction
As relevant extracts of these statistical tables are the same as those distributed for use
in the examination, it is advisable that you become familiar with them, rather than
those at the end of a textbook.
1.6.3
Further reading
As mentioned above, this subject guide is sufficient for study of ST104b Statistics 2.
Of course, you are free to read around the subject area in any text, paper or online
resource to support your learning and by thinking about how these principles apply in
the real world. To help you read extensively, you have free access to the virtual learning
environment (VLE) and University of London Online Library (see below).
Other useful texts for this course include:
Newbold, P., W.L. Carlson and B.M. Thorne, Statistics for Business and
Economics. (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060].
Johnson, R.A. and G.K. Bhattacharyya, Statistics: Principles and Methods. (New
York: John Wiley and Sons, 2010) sixth edition [ISBN 9780470505779].
Larsen, R.J. and M.L. Marx, Introduction to Mathematical Statistics and Its
Applications (Pearson, 2013) fifth edition [ISBN 9781292023557].
While Newbold et al. is the main recommended textbook for this course, there are many
which are just as good. You are encouraged to look at those listed above and at any
others you may find. It may be necessary to look at several textbooks for a single topic,
as you may find that the approach of one textbook suits you better than that of another.
1.6.4
Online study resources (the Online Library and the VLE)
In addition to the subject guide and the Essential reading, it is crucial that you take
advantage of the study resources that are available online for this course, including the
virtual learning environment (VLE) and the Online Library.
You can access the VLE, the Online Library and your University of London email
account via the Student Portal at:
http://my.londoninternational.ac.uk
You should have received your login details for the Student Portal with your official
offer, which was emailed to the address that you gave on your application form. You
have probably already logged in to the Student Portal in order to register! As soon as
you registered, you will automatically have been granted access to the VLE, Online
Library and your fully functional University of London email account.
If you forget your login details, please click on the ‘Forgotten your password’ link on the
login page.
The VLE
The VLE, which complements this subject guide, has been designed to enhance your
learning experience, providing additional support and a sense of community. It forms an
6
1.6. Overview of learning resources
important part of your study experience with the University of London and you should
access it regularly.
The VLE provides a range of resources for EMFSS courses:
Self-testing activities: Doing these allows you to test your own understanding of the
subject material.
Electronic study materials: The printed materials that you receive from the
University of London are available to download, including updated reading lists
and references.
Past examination papers and Examiners’ commentaries: These provide advice on
how each examination question might best be answered.
A student discussion forum: This is an open space for you to discuss interests and
experiences, seek support from your peers, work collaboratively to solve problems
and discuss subject material.
Videos: There are recorded academic introductions to the subject, interviews and
debates and, for some courses, audio-visual tutorials and conclusions.
Recorded lectures: For some courses, where appropriate, the sessions from previous
years’ Study Weekends have been recorded and made available.
Study skills: Expert advice on preparing for examinations and developing your
digital literacy skills.
Feedback forms.
Some of these resources are available for certain courses only, but we are expanding our
provision all the time and you should check the VLE regularly for updates.
Making use of the Online Library
The Online Library contains a huge array of journal articles and other resources to help
you read widely and extensively.
To access the majority of resources via the Online Library you will either need to use
your University of London Student Portal login details, or you will be required to
register and use an Athens login:
http://tinyurl.com/ollathens
The easiest way to locate relevant content and journal articles in the Online Library is
to use the Summon search engine.
If you are having trouble finding an article listed in a reading list, try removing any
punctuation from the title, such as single quotation marks, question marks and colons.
For further advice, please see the online help pages:
www.external.shl.lon.ac.uk/summon/about.php
7
1. Introduction
Additional material
There is a lot of computer-based teaching material available freely over the web. A
fairly comprehensive list can be found in the ‘Books & Manuals’ section of
http://statpages.org
Unless otherwise stated, all websites in this subject guide were accessed in August 2019.
We cannot guarantee, however, that they will stay current and you may need to
perform an internet search to find the relevant pages.
1.7
Examination advice
Important: the information and advice given here are based on the examination
structure used at the time this subject guide was written. Please note that subject
guides may be used for several years. Because of this we strongly advise you to always
check both the current Regulations for relevant information about the examination, and
the VLE where you should be advised of any forthcoming changes. You should also
carefully check the rubric/instructions on the paper you actually sit and follow those
instructions.
Remember, it is important to check the VLE for:
up-to-date information on examination and assessment arrangements for this course
where available, past examination papers and Examiners’ commentaries for the
course which give advice on how each question might best be answered.
The examination is by a two-hour unseen question paper. No books may be taken into
the examination, but the use of calculators is permitted, and statistical tables and a
formula sheet are provided (the formula sheet can be found in past examination papers
available on the VLE).
The examination paper has a variety of questions, some quite short and others longer.
All questions must be answered correctly for full marks. You may use your calculator
whenever you feel it is appropriate, always remembering that the examiners can give
marks only for what appears on the examination script. Therefore, it is important to
always show your working.
In terms of the examination, as always, it is important to manage your time carefully
and not to dwell on one question for too long – move on and focus on solving the easier
questions, coming back to harder ones later.
8
Chapter 2
Probability theory
2.1
Synopsis of chapter
Probability theory is very important for statistics because it provides the rules which
allow us to reason about uncertainty and randomness, which is the basis of statistics.
Independence and conditional probability are profound ideas, but they must be fully
understood in order to think clearly about any statistical investigation.
2.2
Learning outcomes
After completing this chapter, you should be able to:
explain the fundamental ideas of random experiments, sample spaces and events
list the axioms of probability and be able to derive all the common probability
rules from them
list the formulae for the number of combinations and permutations of k objects out
of n, and be able to routinely use such results in problems
explain conditional probability and the concept of independent events
prove the law of total probability and apply it to problems where there is a
partition of the sample space
prove Bayes’ theorem and apply it to find conditional probabilities.
2.3
Introduction
Consider the following hypothetical example. A country will soon hold a referendum
about whether it should leave the European Union (EU). An opinion poll of a random
sample of people in the country is carried out.
950 respondents say that they plan to vote in the referendum. They answer the question
‘Will you vote ‘Yes’ or ‘No’ to leaving the EU?’ as follows:
Count
%
Answer
Yes No
513 437
Total
950
54%
100%
46%
9
2. Probability theory
However, we are not interested in just this sample of 950 respondents, but in the
population which they represent, that is, all likely voters.
Statistical inference will allow us to say things like the following about the
population.
‘A 95% confidence interval for the population proportion, π, of ‘Yes’ voters is
(0.5083, 0.5717).’
‘The null hypothesis that π = 0.5, against the alternative hypothesis that π > 0.5,
is rejected at the 5% significance level.’
In short, the opinion poll gives statistically significant evidence that ‘Yes’ voters are in
the majority among likely voters. Such methods of statistical inference will be discussed
later in the course.
The inferential statements about the opinion poll rely on the following assumptions and
results.
Each response Xi is a realisation of a random variable from a Bernoulli
distribution with probability parameter π.
The responses X1 , X2 , . . . , Xn are independent of each other.
The sampling distribution of the sample mean (proportion) X̄ has expected
value π and variance π (1 − π)/n.
By use of the central limit theorem, the sampling distribution is approximately
a normal distribution.
In the next few chapters, we will learn about the terms in bold, among others.
The need for probability in statistics
In statistical inference, the data we have observed are regarded as a sample from a
broader population, selected with a random process.
Values in a sample are variable. If we collected a different sample we would not
observe exactly the same values again.
Values in a sample are also random. We cannot predict the precise values which
will be observed before we actually collect the sample.
Probability theory is the branch of mathematics which deals with randomness. So we
need to study this first.
A preview of probability
The first basic concepts in probability will be the following.
Experiment: for example, rolling a single die and recording the outcome.
10
2.4. Set theory: the basics
Outcome of the experiment: for example, rolling a 3.
Sample space S: the set of all possible outcomes, here {1, 2, 3, 4, 5, 6}.
Event: any subset A of the sample space, for example A = {4, 5, 6}.1
Probability of an event A, P (A), will be defined as a function which assigns
probabilities (real numbers) to events (sets). This uses the language and concepts of set
theory. So we need to study the basics of set theory first.
2.4
Set theory: the basics
A set is a collection of elements (also known as ‘members’ of the set).
Example 2.1 The following are all examples of sets:
A = {Amy, Bob, Sam}.
B = {1, 2, 3, 4, 5}.
C = {x | x is a prime number} = {2, 3, 5, 7, 11, . . .}.
D = {x | x ≥ 0} (that is, the set of all non-negative real numbers).
Activity 2.1 Why is S = {1, 1, 2}, not a sensible way to try to define a sample
space?
Solution
Because there is no need to list the elementary outcome ‘1’ twice. It is much clearer
to write S = {1, 2}.
Activity 2.2 Write out all the events for the sample space S = {a, b, c}. (There are
eight of them.)
Solution
The possible events are {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} (the sample space
S) and ∅.
Membership of sets and the empty set
x ∈ A means that object x is an element of set A.
x∈
/ A means that object x is not an element of set A.
The empty set, denoted ∅, is the set with no elements, i.e. x ∈
/ ∅ is true for every
object x, and x ∈ ∅ is not true for any object x.
1
Strictly speaking not all subsets are events.
11
2. Probability theory
Example 2.2 If A = {1, 2, 3, 4, 5}, then:
1 ∈ A and 2 ∈ A.
6∈
/ A and 1.5 ∈
/ A.
The familiar Venn diagrams help to visualise statements about sets. However, Venn
diagrams are not formal proofs of results in set theory.
Example 2.3 In Figure 2.1, the darkest area in the middle is A ∩ B, the total
shaded area is A ∪ B, and the white area is (A ∪ B)c = Ac ∩ B c .
Figure 2.1: Venn diagram depicting A ∪ B (the total shaded area).
Subsets and equality of sets
A ⊂ B means that set A is a subset of set B, defined as:
A⊂B
when x ∈ A
⇒
x ∈ B.
Hence A is a subset of B if every element of A is also an element of B. An example
is shown in Figure 2.2.
Figure 2.2: Venn diagram depicting a subset, where A ⊂ B.
Example 2.4 An example of the distinction between subsets and non-subsets is:
{1, 2, 3} ⊂ {1, 2, 3, 4}, because all elements appear in the larger set
{1, 2, 5} 6⊂ {1, 2, 3, 4}, because the element 5 does not appear in the larger set.
12
2.4. Set theory: the basics
Two sets A and B are equal (A = B) if they have exactly the same elements. This
implies that A ⊂ B and B ⊂ A.
Unions of sets (‘or’)
The union, denoted ∪, of two sets is:
A ∪ B = {x | x ∈ A or x ∈ B}.
That is, the set of those elements which belong to A or B (or both). An example is
shown in Figure 2.3.
Figure 2.3: Venn diagram depicting the union of two sets.
Example 2.5 If A = {1, 2, 3, 4}, B = {2, 3} and C = {4, 5, 6}, then:
A ∪ B = {1, 2, 3, 4}
A ∪ C = {1, 2, 3, 4, 5, 6}
B ∪ C = {2, 3, 4, 5, 6}.
Intersections of sets (‘and’)
The intersection, denoted ∩, of two sets is:
A ∩ B = {x | x ∈ A and x ∈ B}.
That is, the set of those elements which belong to both A and B. An example is
shown in Figure 2.4.
Example 2.6 If A = {1, 2, 3, 4}, B = {2, 3} and C = {4, 5, 6}, then:
A ∩ B = {2, 3}
A ∩ C = {4}
B ∩ C = ∅.
13
2. Probability theory
Figure 2.4: Venn diagram depicting the intersection of two sets.
Unions and intersections of many sets
Both set operators can also be applied to more than two sets, such as A ∩ B ∩ C.
Concise notation for the unions and intersections of sets A1 , A2 , . . . , An is:
n
[
Ai = A1 ∪ A2 ∪ · · · ∪ An
i=1
and:
n
\
Ai = A1 ∩ A2 ∩ · · · ∩ An .
i=1
These can also be used for an infinite number of sets, i.e. when n is replaced by ∞.
Complement (‘not’)
Suppose S is the set of all possible elements which are under consideration. In
probability, S will be referred to as the sample space.
It follows that A ⊂ S for every set A we may consider. The complement of A with
respect to S is:
Ac = {x | x ∈ S and x ∈
/ A}.
That is, the set of those elements of S that are not in A. An example is shown in
Figure 2.5.
We now consider some useful properties of set operators. In proofs and derivations
about sets, you can use the following results without proof.
14
2.4. Set theory: the basics
Figure 2.5: Venn diagram depicting the complement of a set.
Properties of set operators
Commutativity:
A ∩ B = B ∩ A and A ∪ B = B ∪ A.
Associativity:
A ∩ (B ∩ C) = (A ∩ B) ∩ C
and A ∪ (B ∪ C) = (A ∪ B) ∪ C.
Distributive laws:
A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) and A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C).
De Morgan’s laws:
(A ∩ B)c = Ac ∪ B c
and (A ∪ B)c = Ac ∩ B c .
Further properties of set operators
If S is the sample space and A and B are any sets in S, you can also use the following
results without proof:
∅c = S.
∅ ⊂ A, A ⊂ A and A ⊂ S.
A ∩ A = A and A ∪ A = A.
A ∩ Ac = ∅ and A ∪ Ac = S.
If B ⊂ A, A ∩ B = B and A ∪ B = A.
A ∩ ∅ = ∅ and A ∪ ∅ = A.
A ∩ S = A and A ∪ S = S.
∅ ∩ ∅ = ∅ and ∅ ∪ ∅ = ∅.
15
2. Probability theory
Mutually exclusive events
Two sets A and B are disjoint or mutually exclusive if:
A ∩ B = ∅.
Sets A1 , A2 , . . . , An are pairwise disjoint if all pairs of sets from them are disjoint,
i.e. Ai ∩ Aj = ∅ for all i 6= j.
Partition
The sets A1 , A2 , . . . , An form a partition of the set A if they are pairwise disjoint
n
S
and if
Ai = A, that is, A1 , A2 , . . . , An are collectively exhaustive of A.
i=1
Therefore, a partition divides the entire set A into non-overlapping pieces Ai , as
shown in Figure 2.6 for n = 3. Similarly, an infinite collection of sets A1 , A2 , . . . form
∞
S
a partition of A if they are pairwise disjoint and
Ai = A.
i=1
A
A2
A3
A1
Figure 2.6: The partition of the set A into A1 , A2 and A3 .
Example 2.7
Suppose that A ⊂ B. Show that A and B ∩ Ac form a partition of B.
We have:
A ∩ (B ∩ Ac ) = (A ∩ Ac ) ∩ B = ∅ ∩ B = ∅
and:
A ∪ (B ∩ Ac ) = (A ∪ B) ∩ (A ∪ Ac ) = B ∩ S = B.
Hence A and B ∩ Ac are mutually exclusive and collectively exhaustive of B, and so
they form a partition of B.
16
2.4. Set theory: the basics
Activity 2.3 For an event A, work out a simpler way to express the events A ∩ S,
A ∪ S, A ∩ ∅ and A ∪ ∅.
Solution
We have:
A ∩ S = A,
A ∪ S = S,
A ∩ ∅ = ∅ and A ∪ ∅ = A.
Activity 2.4 Use the rules of set operators to prove that the following represents a
partition of set A:
A = (A ∩ B) ∪ (A ∩ B c ).
(*)
In other words, prove that (*) is true, and also that (A ∩ B) ∩ (A ∩ B c ) = ∅.
Solution
We have:
(A ∩ B) ∩ (A ∩ B c ) = (A ∩ A) ∩ (B ∩ B c ) = A ∩ ∅ = ∅.
This uses the results of commutativity, associativity, A ∩ A = A, A ∩ Ac = ∅ and
A ∩ ∅ = ∅.
Similarly:
(A ∩ B) ∪ (A ∩ B c ) = A ∩ (B ∪ B c ) = A ∩ S = A
using the results of the distributive laws, A ∪ Ac = S and A ∩ S = A.
Activity 2.5 Find A1 ∪ A2 and A1 ∩ A2 of the two sets A1 and A2 , where:
(a) A1 = {0, 1, 2} and A2 = {2, 3, 4}
(b) A1 = {x | 0 < x < 2} and A2 = {x | 1 ≤ x < 3}
(c) A1 = {x | 0 ≤ x < 1} and A2 = {x | 2 < x ≤ 3}.
Solution
(a) We have:
A1 ∪ A2 = {0, 1, 2, 3, 4} and A1 ∩ A2 = {2}.
(b) We have:
A1 ∪ A2 = {x | 0 < x < 3} and A1 ∩ A2 = {x | 1 ≤ x < 2}.
(c) We have:
A1 ∪ A2 = {x | 0 ≤ x < 1 or 2 < x ≤ 3} and A1 ∩ A2 = ∅.
17
2. Probability theory
Activity 2.6 Let A, B and C be events in a sample space, S. Using only the
symbols ∪, ∩, () and c , find expressions for the following events:
(a) only A occurs
(b) none of the three events occurs
(c) exactly one of the three events occurs
(d) at least two of the three events occur
(e) exactly two of the three events occur.
Solution
There is more than one way to answer this question, because the sets can be
expressed in different, but logically equivalent, forms. One way to do so is the
following.
(a) A ∩ B c ∩ C c , i.e. A and not B and not C.
(b) Ac ∩ B c ∩ C c , i.e. not A and not B and not C.
(c) (A ∩ B c ∩ C c ) ∪ (Ac ∩ B ∩ C c ) ∪ (Ac ∩ B c ∩ C), i.e. only A or only B or only C.
(d) (A ∩ B) ∪ (A ∩ C) ∪ (B ∩ C), i.e. A and B, or A and C, or B and C.
Note that this includes A ∩ B ∩ C as a subset, so we do not need to write
(A ∩ B) ∪ (A ∩ C) ∪ (B ∩ C) ∪ (A ∩ B ∩ C) separately.
(e) ((A ∩ C) ∪ (A ∩ B) ∪ (B ∩ C)) ∩ (A ∩ B ∩ C)c , i.e. A and B, or A and C, or B
and C, but not A and B and C.
Activity 2.7 Let A and B be events in a sample space S. Use Venn diagrams to
convince yourself that the two De Morgan’s laws:
(A ∩ B)c = Ac ∪ B c
(1)
(A ∪ B)c = Ac ∩ B c
(2)
and:
are correct. For each of them, draw two Venn diagrams – one for the expression on
the left-hand side of the equation, and one for the right-hand side. Shade the areas
corresponding to each expression, and hence show that for both (1) and (2) the
left-hand and right-hand sides describe the same set.
Solution
For (A ∩ B)c = Ac ∪ B c we have:
18
2.5. Axiomatic definition of probability
For (A ∪ B)c = Ac ∩ B c we have:
2.5
Axiomatic definition of probability
First, we consider four basic concepts in probability.
An experiment is a process which produces outcomes and which can have several
different outcomes. The sample space S is the set of all possible outcomes of the
experiment. An event is any subset A of the sample space such that A ⊂ S.
Example 2.8 If the experiment is ‘select a trading day at random and record the
% change in the FTSE 100 index from the previous trading day’, then the outcome
is the % change in the FTSE 100 index.
S = [−100, +∞) for the % change in the FTSE 100 index (in principle).
An event of interest might be A = {x | x > 0} – the event that the daily change is
positive, i.e. the FTSE 100 index gains value from the previous trading day.
The sample space and events are represented as sets. For two events A and B, set
operations are then interpreted as follows:
A ∩ B: both A and B happen.
A ∪ B: either A or B happens (or both happen).
Ac : A does not happen, i.e. something other than A happens.
Once we introduce probabilities of events, we can also say that:
the sample space, S, is a certain event
the empty set, ∅, is an impossible event.
19
2. Probability theory
Axioms of probability
‘Probability’ is formally defined as a function P (·) from subsets (events) of the sample
space S onto real numbers.2 Such a function is a probability function if it satisfies
the following axioms (‘self-evident truths’).
Axiom 1:
P (A) ≥ 0 for all events A.
Axiom 2:
P (S) = 1.
Axiom 3:
If events A1 , A2 , . . . are pairwise disjoint (i.e. Ai ∩ Aj = ∅ for all
i 6= j), then:
!
∞
∞
[
X
P
Ai =
P (Ai ).
i=1
i=1
The axioms require that a probability function must always satisfy these requirements.
Axiom 1 requires that probabilities are always non-negative.
Axiom 2 requires that the outcome is some element from the sample space with
certainty (that is, with probability 1). In other words, the experiment must have
some outcome.
Axiom 3 states that if events A1 , A2 , . . . are mutually exclusive, the probability of
their union is simply the sum of their individual probabilities.
All other properties of the probability function can be derived from the axioms. We
begin by showing that a result like Axiom 3 also holds for finite collections of mutually
exclusive sets.
2.5.1
Basic properties of probability
Probability property
For the empty set, ∅, we have:
P (∅) = 0.
(2.1)
Probability property (finite additivity)
If A1 , A2 , . . . , An are pairwise disjoint, then:
!
n
n
[
X
P
Ai =
P (Ai ).
i=1
2
i=1
The precise definition also requires a careful statement of which subsets of S are allowed as events,
which we can skip on this course.
20
2.5. Axiomatic definition of probability
In pictures, the previous result means that in a situation like the one shown in Figure
2.7, the probability of the combined event A = A1 ∪ A2 ∪ A3 is simply the sum of the
probabilities of the individual events:
P (A) = P (A1 ) + P (A2 ) + P (A3 ).
That is, we can simply sum probabilities of mutually exclusive sets. This is very useful
for deriving further results.
A2
A1
A3
Figure 2.7: Venn diagram depicting three mutually exclusive sets, A1 , A2 and A3 . Note
although A2 and A3 have touching boundaries, there is no actual intersection and hence
they are (pairwise) mutually exclusive.
Probability property
For any event A, we have:
P (Ac ) = 1 − P (A).
Proof : We have that A ∪ Ac = S and A ∩ Ac = ∅. Therefore:
1 = P (S) = P (A ∪ Ac ) = P (A) + P (Ac )
using the previous result, with n = 2, A1 = A and A2 = Ac . Probability property
For any event A, we have:
P (A) ≤ 1.
Proof (by contradiction): If it was true that P (A) > 1 for some A, then we would have:
P (Ac ) = 1 − P (A) < 0.
This violates Axiom 1, so cannot be true. Therefore, it must be that P (A) ≤ 1 for all A.
Putting this and Axiom 1 together, we get:
0 ≤ P (A) ≤ 1
for all events A. 21
2. Probability theory
Probability property
For any two events A and B, if A ⊂ B, then P (A) ≤ P (B).
Proof : We proved in Example 2.7 that we can partition B as B = A ∪ (B ∩ Ac ) where
the two sets in the union are disjoint. Therefore:
P (B) = P (A ∪ (B ∩ Ac )) = P (A) + P (B ∩ Ac ) ≥ P (A)
since P (B ∩ Ac ) ≥ 0. Probability property
For any two events A and B, then:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
Proof : Using partitions:
P (A ∪ B) = P (A ∩ B c ) + P (A ∩ B) + P (Ac ∩ B)
P (A) = P (A ∩ B c ) + P (A ∩ B)
P (B) = P (Ac ∩ B) + P (A ∩ B)
and hence:
P (A ∪ B) = [P (A) − P (A ∩ B)] + P (A ∩ B) + [P (B) − P (A ∩ B)]
= P (A) + P (B) − P (A ∩ B).
In summary, the probability function has the following properties.
P (S) = 1 and P (∅) = 0.
0 ≤ P (A) ≤ 1 for all events A.
If A ⊂ B, then P (A) ≤ P (B).
These show that the probability function has the kinds of values we expect of something
called a ‘probability’.
P (Ac ) = 1 − P (A).
P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
These are useful for deriving probabilities of new events.
22
2.5. Axiomatic definition of probability
Example 2.9 Suppose that, on an average weekday, of all adults in a country:
86% spend at least 1 hour watching television (event A, with P (A) = 0.86)
19% spend at least 1 hour reading newspapers (event B, with P (B) = 0.19)
15% spend at least 1 hour watching television and at least 1 hour reading
newspapers (P (A ∩ B) = 0.15).
We select a member of the population for an interview at random. For example, we
then have:
P (Ac ) = 1 − P (A) = 1 − 0.86 = 0.14, which is the probability that the
respondent watches less than 1 hour of television
P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = 0.86 + 0.19 − 0.15 = 0.90, which is the
probability that the respondent spends at least 1 hour watching television or
reading newspapers (or both).
Activity 2.8
(a) A, B and C are any three events in the sample space S. Prove that:
P (A∪B∪C) = P (A)+P (B)+P (C)−P (A∩B)−P (B∩C)−P (A∩C)+P (A∩B∩C).
(b) A and B are events in a sample space S. Show that:
P (A ∩ B) ≤
P (A) + P (B)
≤ P (A ∪ B).
2
Solution
(a) We know P (E ∪ F ) = P (E) + P (F ) − P (E ∩ F ).
Consider A ∪ B ∪ C as (A ∪ B) ∪ C (i.e. as the union of the two sets A ∪ B and
C) and then apply the result above to obtain:
P (A ∪ B ∪ C) = P ((A ∪ B) ∪ C) = P (A ∪ B) + P (C) − P ((A ∪ B) ∩ C).
Now (A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C) – a Venn diagram can be drawn to check
this.
So:
P (A ∪ B ∪ C) = P (A ∪ B) + P (C) − (P (A ∩ C) + P (B ∩ C) − P ((A ∩ C) ∩ (B ∩ C)))
using the earlier result again for A ∩ C and B ∩ C.
Now (A ∩ C) ∩ (B ∩ C) = A ∩ B ∩ C and if we apply the earlier result once
more for A and B, we obtain:
P (A∪B∪C) = P (A)+P (B)−P (A∩B)+P (C)−P (A∩C)−P (B∩C)+P (A∩B∩C)
which is the required result.
23
2. Probability theory
(b) Use the result that if X ⊂ Y then P (X) ≤ P (Y ) for events X and Y .
Since A ⊂ A ∪ B and B ⊂ A ∪ B, we have P (A) ≤ P (A ∪ B) and
P (B) ≤ P (A ∪ B).
Adding these inequalities, P (A) + P (B) ≤ 2 × P (A ∪ B) so:
P (A) + P (B)
≤ P (A ∪ B).
2
Similarly, A ∩ B ⊂ A and A ∩ B ⊂ B, so P (A ∩ B) ≤ P (A) and
P (A ∩ B) ≤ P (B).
Adding, 2 × P (A ∩ B) ≤ P (A) + P (B) so:
P (A ∩ B) ≤
P (A) + P (B)
.
2
What does ‘probability’ mean?
Probability theory tells us how to work with the probability function and derive
‘probabilities of events’ from it. However, it does not tell us what ‘probability’ really
means.
There are several alternative interpretations of the real-world meaning of ‘probability’
in this sense. One of them is outlined below. The mathematical theory of probability
and calculations on probabilities are the same whichever interpretation we assign to
‘probability’. So, in this course, we do not need to discuss the matter further.
Frequency interpretation of probability
This states that the probability of an outcome A of an experiment is the proportion
(relative frequency) of trials in which A would be the outcome if the experiment was
repeated a very large number of times under similar conditions.
Example 2.10 How should we interpret the following, as statements about the real
world of coins and babies?
‘The probability that a tossed coin comes up heads is 0.5.’ If we tossed a coin a
large number of times, and the proportion of heads out of those tosses was 0.5,
the ‘probability of heads’ could be said to be 0.5, for that coin.
‘The probability is 0.51 that a child born in the UK today is a boy.’ If the
proportion of boys among a large number of live births was 0.51, the
‘probability of a boy’ could be said to be 0.51.
How to find probabilities?
A key question is how to determine appropriate numerical values of P (A) for the
probabilities of particular events.
24
2.6. Classical probability and counting rules
This is usually done empirically, by observing actual realisations of the experiment and
using them to estimate probabilities. In the simplest cases, this basically applies the
frequency definition to observed data.
Example 2.11 Consider the following.
If I toss a coin 10,000 times, and 5,023 of the tosses come up heads, it seems
that, approximately, P (heads) = 0.5, for that coin.
Of the 7,098,667 live births in England and Wales in the period 1999–2009,
51.26% were boys. So we could assign the value of about 0.51 to the probability
of a boy in this population.
The estimation of probabilities of events from observed data is an important part of
statistics.
2.6
Classical probability and counting rules
Classical probability is a simple special case where values of probabilities can be
found by just counting outcomes. This requires that:
the sample space contains only a finite number of outcomes
all of the outcomes are equally likely.
Standard illustrations of classical probability are devices used in games of chance, such
as:
tossing a coin (heads or tails) one or more times
rolling one or more dice (each scored 1, 2, 3, 4, 5 or 6)
drawing one or more playing cards from a deck of 52 cards.
We will use these often, not because they are particularly important but because they
provide simple examples for illustrating various results in probability.
Suppose that the sample space, S, contains m equally likely outcomes, and that event A
consists of k ≤ m of these outcomes. Therefore:
P (A) =
k
number of outcomes in A
=
.
m
total number of outcomes in the sample space, S
That is, the probability of A is the proportion of outcomes which belong to A out of all
possible outcomes.
In the classical case, the probability of any event can be determined by counting the
number of outcomes which belong to the event, and the total number of possible
outcomes.
25
2. Probability theory
Example 2.12 Rolling two dice, what is the probability that the sum of the two
scores is 5?
The sample space is the 36 ordered pairs:
S = {(1, 1), (1, 2), (1, 3), (1, 4) , (1, 5), (1, 6),
(2, 1), (2, 2), (2, 3) , (2, 4), (2, 5), (2, 6),
(3, 1), (3, 2) , (3, 3), (3, 4), (3, 5), (3, 6),
(4, 1) , (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.
The event of interest is A = {(1, 4), (2, 3), (3, 2), (4, 1)}.
The probability is P (A) = 4/36 = 1/9.
Now that we have a way of obtaining probabilities for events in the classical case, we
can use it together with the rules of probability.
The formula P (A) = 1 − P (Ac ) is convenient when we want P (A) but the probability of
the complementary event Ac , i.e. P (Ac ), is easier to find.
Example 2.13 When rolling two fair dice, what is the probability that the sum of
the dice is greater than 3?
The complement is that the sum is at most 3, i.e. the complementary event is
Ac = {(1, 1), (1, 2), (2, 1)}.
Therefore, P (A) = 1 − 3/36 = 33/36 = 11/12.
The formula:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
says that the probability that A or B happens (or both happen) is the sum of the
probabilities of A and B, minus the probability that both A and B happen.
Example 2.14 When rolling two fair dice, what is the probability that the two
scores are equal (event A) or that the total score is greater than 10 (event B)?
P (A) = 6/36, P (B) = 3/36 and P (A ∩ B) = 1/36.
So P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = (6 + 3 − 1)/36 = 8/36 = 2/9.
Activity 2.9 Assume that a calculator has a ‘random number’ key and that when
the key is pressed an integer between 0 and 999 inclusive is generated at random, all
numbers being generated independently of one another.
26
2.6. Classical probability and counting rules
(a) What is the probability that the number generated is less than 300?
(b) If two numbers are generated, what is the probability that both are less than
300?
(c) If two numbers are generated, what is the probability that the first number
exceeds the second number?
(d) If two numbers are generated, what is the probability that the first number
exceeds the second number, and their sum is exactly 300?
(e) If five numbers are generated, what is the probability that at least one number
occurs more than once?
Solution
(a) Simply 300/1000 = 0.3.
(b) Simply 0.3 × 0.3 = 0.09.
(c) Suppose P (first greater) = x, then by symmetry we have that
P (second greater) = x. However, the probability that both are equal is (by
counting):
1000
{0, 0}, {1, 1}, . . . , {999, 999}
=
= 0.001.
1000000
1000000
Hence x + x + 0.001 = 1, so x = 0.4995.
(d) The following cases apply {300, 0}, {299, 1}, . . . , {151, 149}, i.e. there are 150
possibilities from (10)6 . So the required probability is:
150
= 0.00015.
1000000
(e) The probability that they are all different is (noting that the first number can
be any number):
999
998
997
996
1×
×
×
×
.
1000 1000 1000 1000
Subtracting from 1 gives the required probability, i.e. 0.009965.
Activity 2.10 A box contains r red balls and b blue balls. One ball is selected at
random and its colour is observed. The ball is then returned to the box and k
additional balls of the same colour are also put into the box. A second ball is then
selected at random, its colour is observed, and it is returned to the box together
with k additional balls of the same colour. Each time another ball is selected, the
process is repeated. If four balls are selected, what is the probability that the first
three balls will be red and the fourth ball will be blue?
Hint: Your answer should be a function of r, b and k.
Solution
Let Ri be the event that a red ball is drawn on the ith draw, and let Bi be the event
27
2. Probability theory
that a blue ball is drawn on the ith draw, for i = 1, . . . , 4. Therefore, we have:
P (R1 ) =
P (R2 | R1 ) =
r
r+b
r+k
r+b+k
P (R3 | R1 ∩ R2 ) =
r + 2k
r + b + 2k
P (B4 | R1 ∩ R2 ∩ R3 ) =
b
r + b + 3k
where ‘|’ means ‘given’, notation which will be formally introduced later in the
chapter with conditional probability. The required probability is the product of these
four probabilities, namely:
r(r + k)(r + 2k)b
.
(r + b)(r + b + k)(r + b + 2k)(r + b + 3k)
2.6.1
Combinatorial counting methods
A powerful set of counting methods answers the following question: how many ways are
there to select k objects out of n distinct objects?
The answer will depend on:
whether the selection is with replacement (an object can be selected more than
once) or without replacement (an object can be selected only once)
whether the selected set is treated as ordered or unordered.
Ordered sets, with replacement
Suppose that the selection of k objects out of n needs to be:
ordered, so that the selection is an ordered sequence where we distinguish between
the 1st object, 2nd, 3rd etc.
with replacement, so that each of the n objects may appear several times in the
selection.
Therefore:
n objects are available for selection for the 1st object in the sequence
n objects are available for selection for the 2nd object in the sequence
. . . and so on, until n objects are available for selection for the kth object in the
sequence.
28
2.6. Classical probability and counting rules
Therefore, the number of possible ordered sequences of k objects selected with
replacement from n objects is:
k times
z
}|
{
n × n × · · · × n = nk .
Ordered sets, without replacement
Suppose that the selection of k objects out of n is again treated as an ordered sequence,
but that selection is now:
ordered, so that the selection is an ordered sequence where we distinguish between
the 1st object, 2nd, 3rd etc.
without replacement, so that if an object is selected once, it cannot be selected
again.
Now:
n objects are available for selection for the 1st object in the sequence
n − 1 objects are available for selection for the 2nd object
n − 2 objects are available for selection for the 3rd object
. . . and so on, until n − k + 1 objects are available for selection for the kth object.
Therefore, the number of possible ordered sequences of k objects selected without
replacement from n objects is:
n × (n − 1) × · · · × (n − k + 1).
(2.2)
An important special case is when k = n.
Factorials
The number of ordered sets of n objects, selected without replacement from n objects,
is:
n! = n × (n − 1) × · · · × 2 × 1.
The number n! (read ‘n factorial’) is the total number of different ways in which
n objects can be arranged in an ordered sequence. This is known as the number of
permutations of n objects.
We also define 0! = 1.
Using factorials, (2.2) can be written as:
n × (n − 1) × · · · × (n − k + 1) =
n!
.
(n − k)!
29
2. Probability theory
Unordered sets, without replacement
Suppose now that the identities of the objects in the selection matter, but the order
does not.
For example, the sequences (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1) are
now all treated as the same, because they all contain the elements 1, 2 and 3.
The number of such unordered subsets (combinations) of k out of n objects is
determined as follows.
The number of ordered sequences is n!/(n − k)!.
Among these, every different combination of k distinct elements appears k! times,
in different orders.
Ignoring the ordering, there are:
n
n!
=
k
(n − k)! k!
different combinations, for each k = 0, 1, . . . , n.
n
The number
is known as the binomial coefficient. Note that because 0! = 1,
k
n
n
=
= 1, so there is only 1 way of selecting 0 or n out of n objects.
0
n
Example 2.15 Suppose we have k = 3 people (Amy, Bob and Sam). How many
different sets of birthdays can they have (day and month, ignoring the year, and
pretending February 29th does not exist, so that n = 365) in the following cases?
1. It makes a difference who has which birthday (ordered ), i.e. Amy (January 1st),
Bob (May 5th) and Sam (December 5th) is different from Amy (May 5th), Bob
(December 5th) and Sam (January 1st), and different people can have the same
birthday (with replacement). The number of different sets of birthdays is:
(365)3 = 48,627,125.
2. It makes a difference who has which birthday (ordered ), and different people
must have different birthdays (without replacement). The number of different
sets of birthdays is:
365!
= 365 × 364 × 363 = 48,228,180.
(365 − 3)!
3. Only the dates matter, but not who has which one (unordered ), i.e. Amy
(January 1st), Bob (May 5th) and Sam (December 5th) is treated as the same
as Amy (May 5th), Bob (December 5th) and Sam (January 1st), and different
people must have different birthdays (without replacement). The number of
different sets of birthdays is:
365
365!
365 × 364 × 363
=
=
= 8,038,030.
3
(365 − 3)! 3!
3×2×1
30
2.6. Classical probability and counting rules
Example 2.16 Consider a room with r people in it. What is the probability that
at least two of them have the same birthday (call this event A)? In particular, what
is the smallest r for which P (A) > 1/2?
Assume that all days are equally likely.
Label the people 1 to r, so that we can treat them as an ordered list and talk about
person 1, person 2 etc. We want to know how many ways there are to assign
birthdays to this list of people. We note the following.
1. The number of all possible sequences of birthdays, allowing repeats (i.e. with
replacement) is (365)r .
2. The number of sequences where all birthdays are different (i.e. without
replacement) is 365!/(365 − r)!.
Here ‘1.’ is the size of the sample space, and ‘2.’ is the number of outcomes which
satisfy Ac , the complement of the case in which we are interested.
Therefore:
P (Ac ) =
365 × 364 × · · · × (365 − r + 1)
365!/(365 − r)!
=
r
(365)
(365)r
and:
P (A) = 1 − P (Ac ) = 1 −
365 × 364 × · · · × (365 − r + 1)
.
(365)r
Probabilities, for P (A), of at least two people sharing a birthday, for different values
of the number of people r are given in the following table:
r
2
3
4
5
6
7
8
9
10
11
P (A)
0.003
0.008
0.016
0.027
0.040
0.056
0.074
0.095
0.117
0.141
r
12
13
14
15
16
17
18
19
20
21
P (A)
0.167
0.194
0.223
0.253
0.284
0.315
0.347
0.379
0.411
0.444
r
22
23
24
25
26
27
28
29
30
31
P (A)
0.476
0.507
0.538
0.569
0.598
0.627
0.654
0.681
0.706
0.730
r P (A)
32 0.753
33 0.775
34 0.795
35 0.814
36 0.832
37 0.849
38 0.864
39 0.878
40 0.891
41 0.903
Activity 2.11 A box contains 18 light bulbs, of which two are defective. If a person
selects 7 bulbs at random, without replacement, what is the probability that both
defective bulbs will be selected?
Solution
The sample space consists
of all (unordered) subsets of 7 out of the 18 light bulbs in
18
the box. There are
such subsets. The number of subsets which contain the two
7
31
2. Probability theory
16
defective bulbs is the number of subsets of size 5 out of the other 16 bulbs,
, so
5
the probability we want is:
16
7×6
5
=
= 0.1373.
18
18 × 17
7
2.7
Conditional probability and Bayes’ theorem
Next we introduce some of the most important concepts in probability:
independence
conditional probability
Bayes’ theorem.
These give us powerful tools for:
deriving probabilities of combinations of events
updating probabilities of events, after we learn that some other event has happened.
Independence
Two events A and B are (statistically) independent if:
P (A ∩ B) = P (A) P (B).
Independence is sometimes denoted A ⊥⊥ B. Intuitively, independence means that:
if A happens, this does not affect the probability of B happening (and vice versa)
if you are told that A has happened, this does not give you any new information
about the value of P (B) (and vice versa).
For example, independence is often a reasonable assumption when A and B
correspond to physically separate experiments.
Example 2.17 Suppose we roll two dice. We assume that all combinations of the
values of them are equally likely. Define the events:
A = ‘Score of die 1 is not 6’
B = ‘Score of die 2 is not 6’.
32
2.7. Conditional probability and Bayes’ theorem
Therefore:
P (A) = 30/36 = 5/6
P (B) = 30/36 = 5/6
P (A ∩ B) = 25/36 = 5/6 × 5/6 = P (A) P (B), so A and B are independent.
Activity 2.12 A and B are independent events.
Suppose that P (A) = 2π, P (B) = π and P (A ∪ B) = 0.8. Evaluate π.
Solution
Using the probability property P (A ∪ B) = P (A) + P (B) − P (A ∩ B), and the
definition of independent events P (A ∩ B) = P (A) P (B), we have:
P (A ∪ B) = 0.8 = P (A) + P (B) − P (A ∩ B)
= P (A) + P (B) − P (A) P (B)
= 2π + π − 2π 2 .
Therefore, applying the quadratic formula from mathematics:
√
3
±
9 − 6.4
2π 2 − 3π + 0.8 = 0 ⇒ π =
.
4
Hence π = 0.346887, since the other root is > 1 which is impossible for a probability!
Activity 2.13 A and B are events such that P (A | B) > P (A). Prove that:
P (Ac | B c ) > P (Ac )
where Ac and B c are the complements of A and B, respectively, and P (B c ) > 0.
Solution
From the definition of conditional probability:
P (Ac | B c ) =
P (Ac ∩ B c )
P ((A ∪ B)c )
1 − P (A) − P (B) + P (A ∩ B)
=
=
.
c
c
P (B )
P (B )
1 − P (B)
However:
P (A | B) =
P (A ∩ B)
> P (A)
P (B)
i.e. P (A ∩ B) > P (A) P (B). Hence:
P (Ac | B c ) >
1 − P (A) − P (B) + P (A) P (B)
= 1 − P (A) = P (Ac ).
1 − P (B)
33
2. Probability theory
Activity 2.14 A and B are any two events in the sample space S. The binary set
operator ∨ denotes an exclusive union, such that:
A ∨ B = (A ∪ B) ∩ (A ∩ B)c = {s | s ∈ A or B, and s 6∈ (A ∩ B)}.
Show, from the axioms of probability, that:
(a) P (A ∨ B) = P (A) + P (B) − 2P (A ∩ B)
(b) P (A ∨ B | A) = 1 − P (B | A).
Solution
(a) We have:
A ∨ B = (A ∩ B c ) ∪ (B ∩ Ac ).
By axiom 3, noting that (A ∩ B c ) and (B ∩ Ac ) are disjoint:
P (A ∨ B) = P (A ∩ B c ) + P (B ∩ Ac ).
We can write A = (A ∩ B) ∪ (A ∩ B c ), hence (using axiom 3):
P (A ∩ B c ) = P (A) − P (A ∩ B).
Similarly, P (B ∩ Ac ) = P (B) − P (A ∩ B), hence:
P (A ∨ B) = P (A) + P (B) − 2P (A ∩ B).
(b) We have:
P (A ∨ B | A) =
P ((A ∨ B) ∩ A)
P (A)
=
P (A ∩ B c )
P (A)
=
P (A) − P (A ∩ B)
P (A)
=
P (A) P (A ∩ B)
−
P (A)
P (A)
= 1 − P (B | A).
Activity 2.15 Suppose that we toss a fair coin twice. The sample space is given by:
S = {HH, HT, T H, T T }
where the elementary outcomes are defined in the obvious way – for instance HT is
heads on the first toss and tails on the second toss. Show that if all four elementary
outcomes are equally likely, then the events ‘heads on the first toss’ and ‘heads on
the second toss’ are independent.
34
2.7. Conditional probability and Bayes’ theorem
Solution
Note carefully here that we have equally likely elementary outcomes (due to the coin
being fair), so that each has probability 1/4, and the independence follows.
The event ‘heads on the first toss’ is A = {HH, HT } and has probability 1/2,
because it is specified by two elementary outcomes. The event ‘heads on the second
toss’ is B = {HH, T H} and has probability 1/2. The event ‘heads on the first toss
and the second toss’ is A ∩ B = {HH} and has probability 1/4. So the
multiplication property P (A ∩ B) = 1/4 = 1/2 × 1/2 = P (A) P (B) is satisfied, and
the two events are independent.
2.7.1
Independence of multiple events
Events A1 , A2 , . . . , An are independent if the probability of the intersection of any subset
of these events is the product of the individual probabilities of the events in the subset.
This implies the important result that if events A1 , A2 , . . . , An are independent, then:
P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 ) P (A2 ) · · · P (An ).
Note that there is a difference between pairwise independence and full independence.
The following example illustrates.
Example 2.18 It can be cold in London. Four impoverished teachers dress to feel
warm. Teacher A has a hat and a scarf and gloves, Teacher B only has a hat, Teacher
C only has a scarf and Teacher D only has gloves. One teacher out of the four is
selected at random. It is shown that although each pair of events H = ‘the teacher
selected has a hat’, S = ‘the teacher selected has a scarf’, and G = ‘the teacher
selected has gloves’ are independent, all three of these events are not independent.
Two teachers have a hat, two teachers have a scarf, and two teachers have gloves, so:
P (H) =
1
2
= ,
4
2
P (S) =
2
1
=
4
2
and P (G) =
2
1
= .
4
2
Only one teacher has both a hat and a scarf, so:
P (H ∩ S) =
1
4
and similarly:
1
1
and P (S ∩ G) = .
4
4
From these results, we can verify that:
P (H ∩ G) =
P (H ∩ S) = P (H) P (S)
P (H ∩ G) = P (H) P (G)
P (S ∩ G) = P (S) P (G)
and so the events are pairwise independent. However, one teacher has a hat, a scarf
and gloves, so:
1
P (H ∩ S ∩ G) = 6= P (H) P (S) P (G).
4
35
2. Probability theory
Hence the three events are not independent. If the selected teacher has a hat and a
scarf, then we know that the teacher has gloves. There is no independence for all
three events together.
Activity 2.16 A, B and C are independent events. Prove that A and (B ∪ C) are
independent.
Solution
We need to show that the joint probability of A ∩ (B ∪ C) equals the product of the
probabilities of A and B ∪ C, i.e. we need to show that
P (A ∩ (B ∪ C)) = P (A) P (B ∪ C).
Using the distributive law:
P (A ∩ (B ∪ C)) = P ((A ∩ B) ∪ (A ∩ C))
= P (A ∩ B) + P (A ∩ C) − P (A ∩ B ∩ C)
= P (A) P (B) + P (A) P (C) − P (A) P (B) P (C)
= P (A)(P (B) + P (C) − P (B) P (C))
= P (A) P (B ∪ C).
Activity 2.17 Suppose that three components numbered 1, 2 and 3 have
probabilities of failure π1 , π2 and π3 , respectively. Determine the probability of a
system failure in each of the following cases where component failures are assumed
to be independent.
(a) Parallel system – the system fails if all components fail.
(b) Series system – the system fails unless all components do not fail.
(c) Mixed system – the system fails if component 1 fails or if both component 2 and
component 3 fail.
Solution
(a) Since the component failures are independent, the probability of system failure
is π1 π2 π3 .
(b) The probability that component i does not fail is 1 − πi , hence the probability
that the system does not fail is (1 − π1 )(1 − π2 )(1 − π3 ), and so the probability
that the system fails is:
1 − (1 − π1 )(1 − π2 )(1 − π3 ).
(c) Components 2 and 3 may be combined to form a notional component 4 with
failure probability π2 π3 . So the system is equivalent to a component with failure
probability π1 and another component with failure probability π2 π3 , these being
connected in series. Therefore, the failure probability is:
1 − (1 − π1 )(1 − π2 π3 ) = π1 + π2 π3 − π1 π2 π3 .
36
2.7. Conditional probability and Bayes’ theorem
Activity 2.18 Write down the condition for three events A, B and C to be
independent.
Solution
Applying the product rule, we must have:
P (A ∩ B ∩ C) = P (A) P (B) P (C).
Therefore, since all subsets of two events from A, B and C must be independent, we
must also have:
P (A ∩ B) = P (A) P (B)
P (A ∩ C) = P (A) P (C)
and:
P (B ∩ C) = P (B) P (C).
One must check that all four conditions hold to verify independence of A, B and C.
Activity 2.19 An electrical device contains 8 components connected in a sequence.
The device fails if any one of the components fails. For each component the
probability that it survives a year of use without failing is π, and the failures of
different components can be regarded as independent events.
(a) What is the probability that the device fails in a year of use?
(b) How large must π be for the probability of failure in (a) to be less than 0.05?
Solution
(a) It is often easier to evaluate the probability of the complement of the event
specified. Here, we calculate:
P (device does not fail) = P (every component works) = π 8
and hence P (device fails) = 1 − π 8 .
It is always a good idea to do a quick ‘reality check’ of your answer. If you
calculated, say, the probability to be 8 (1 − π), this must be wrong because for
some values of π you would have a probability greater than 1!
√
(b) We require 1 − π 8 < 0.05, which is true if π > 8 0.95 ≈ 0.9936.
Activity 2.20 Suppose A and B are independent events, i.e. P (A ∩ B) =
P (A) P (B). Prove that:
(a) A and B c are independent
37
2. Probability theory
(b) Ac and B c are independent.
Solution
(a) Note that A = (A ∩ B) ∪ (A ∩ B c ) is a partition, and hence P (A) = P (A ∩ B)+
P (A ∩ B c ). It follows from this that:
P (A ∩ B c ) = P (A) − P (A ∩ B)
= P (A) − P (A) P (B) (due to independence of A and B)
= P (A)[1 − P (B)]
= P (A) P (B c ).
(b) Here we first use one of De Morgan’s laws such that:
P (Ac ∩ B c ) = P ((A ∪ B)c )
= 1 − P (A ∪ B)
= 1 − [P (A) + P (B) − P (A ∩ B)]
= 1 − P (A) − P (B) + P (A) P (B)
= [1 − P (A)][1 − P (B)]
= P (Ac ) P (B c ).
Activity 2.21 Hard question!
Two boys, James A and James B, throw a ball at a target. Suppose that the
probability that James A will hit the target on any throw is 1/4 and the probability
that James B will hit the target on any throw is 1/5. Suppose also that James A
throws first and the two boys take turns throwing.
(a) Determine the probability that the target will be hit for the first time on the
third throw of James A.
(b) Determine the probability that James A will hit the target before James B does.
Solution
(a) In order for the target to be hit for the first time on the third throw of James A,
all five of the following independent events must occur: (i) James A misses on
his first throw, (ii) James B misses on his first throw, (iii) James A misses on
his second throw, (iv) James B misses on his second throw, and (v) James A
hits the target on his third throw. The probability of all five events occurring is:
9
3 4 3 4 1
× × × × =
.
4 5 4 5 4
100
38
2.7. Conditional probability and Bayes’ theorem
(b) Let A denote the event that James A hits the target before James B. There are
two methods of solving this problem.
1. The first method is to note that A can occur in two different ways. (i)
James A hits the target on the first throw, which occurs with probability
1/4. (ii) Both Jameses miss the target on their first throws, and then
subsequently James A hits the target before James B. The probability that
both Jameses miss on their first throws is:
3
3 4
× = .
4 5
5
When they do miss, the conditions of the game become exactly the same as
they were at the beginning of the game. In effect, it is as if the boys were
starting a new game all over again, and so the probability that James A
will subsequently hit the target before James B is again P (A). Therefore,
by considering these two ways in which the event A can occur, we have:
P (A) =
1 3
+ × P (A)
4 5
⇒
5
P (A) = .
8
2. The second method of solving the problem is to calculate the probabilities
that the target will be hit for the first time on James A’s first throw, on his
second throw, on his third throw etc. and then to sum these probabilities.
For the target to be hit for the first time on James A’s ith throw, both
Jameses must miss on each of their first i − 1 throws, and then James A
must hit the target on his next throw. The probability of this event is:
i−1 i−1 i−1 4
1
3
1
3
=
.
4
5
4
5
4
Hence:
∞
1X
P (A) =
4 i=1
i−1
1
3
1
5
= ×
=
5
4 1 − 3/5
8
which uses the sum to infinity of a geometric series (with common ratio less
than 1 in absolute value) from mathematics.
2.7.2
Independent versus mutually exclusive events
The idea of independent events is quite different from that of mutually exclusive
(disjoint) events, as shown in Figure 2.9.
For mutually exclusive events A ∩ B = ∅, and so, from (2.1), P (A ∩ B) = 0. For
independent events, P (A ∩ B) = P (A) P (B). So since P (A ∩ B) = 0 6= P (A) P (B) in
general (except in the uninteresting case when P (A) = 0 or P (B) = 0), then mutually
exclusive events and independent events are different.
In fact, mutually exclusive events are extremely non-independent (i.e. dependent). For
example, if you know that A has happened, you know for certain that B has not
happened. There is no particularly helpful way to represent independent events using a
39
2. Probability theory
A
B
Figure 2.8: Venn diagram depicting mutually exclusive events.
Venn diagram.
Conditional probability
Consider two events A and B. Suppose you are told that B has occurred. How does
this affect the probability of event A?
The answer is given by the conditional probability of A given that B has occurred,
or the conditional probability of A given B for short, defined as:
P (A | B) =
P (A ∩ B)
P (B)
assuming that P (B) > 0. The conditional probability is not defined if P (B) = 0.
Example 2.19 Suppose we roll two independent fair dice again. Consider the
following events.
A = ‘at least one of the scores is 2’.
B = ‘the sum of the scores is greater than 7’.
These are shown in Figure 2.10. Now P (A) = 11/36 ≈ 0.31, P (B) = 15/36 and
P (A ∩ B) = 2/36. Therefore, the conditional probability of A given B is:
P (A | B) =
P (A ∩ B)
2/36
2
=
=
≈ 0.13.
P (B)
15/36
15
Learning that B has occurred causes us to revise (update) the probability of A
downward, from 0.31 to 0.13.
One way to think about conditional probability is that when we condition on B, we
redefine the sample space to be B.
40
2.7. Conditional probability and Bayes’ theorem
A
(1,1)
(1,2)
(1,3) (1,4)
(1,5)
(1,6)
(2,1)
(2,2)
(2,3) (2,4)
(2,5)
(2,6)
(3,1)
(3,2)
(3,3) (3,4)
(3,5)
(3,6)
(4,1)
(4,2)
(4,3) (4,4)
(4,5)
(4,6)
(5,1)
(5,2)
(5,3) (5,4)
(5,5)
(5,6)
(6,1)
(6,2)
(6,3) (6,4)
(6,5)
(6,6)
B
A
B
Figure 2.9: Events A, B and A ∩ B for Example 2.19.
Example 2.20 In Example 2.19, when we are told that the conditioning event B
has occurred, we know we are within the green line in Figure 2.9. So the 15
outcomes within it become the new sample space. There are 2 outcomes which
satisfy A and which are inside this new sample space, so:
P (A | B) =
2
number of cases of A within B
=
.
15
number of cases of B
Activity 2.22 If all elementary outcomes are equally likely, S = {a, b, c, d},
A = {a, b, c} and B = {c, d}, find P (A | B) and P (B | A).
Solution
S has 4 elementary outcomes which are equally likely, so each elementary outcome
has probability 1/4.
We have:
P (A | B) =
and:
P (B | A) =
P (A ∩ B)
P ({c})
1/4
1
=
=
=
P (B)
P ({c, d})
1/4 + 1/4
2
P (B ∩ A)
P ({c})
1/4
1
=
=
= .
P (A)
P ({a, b, c})
1/4 + 1/4 + 1/4
3
Activity 2.23 Show that if A and B are disjoint events, and are also independent,
then P (A) = 0 or P (B) = 0. (Note that independence and disjointness are not
similar ideas.)
41
2. Probability theory
Solution
It is important to get the logical flow in the right direction here. We are told that A
and B are disjoint events, that is:
A ∩ B = ∅.
So:
P (A ∩ B) = 0.
We are also told that A and B are independent, that is:
P (A ∩ B) = P (A) P (B).
It follows that:
0 = P (A) P (B)
and so either P (A) = 0 or P (B) = 0.
Activity 2.24 Suppose A and B are events with P (A) = p, P (B) = 2p and
P (A ∪ B) = 0.75.
(a) Evaluate p and P (A | B) if A and B are independent events.
(b) Evaluate p and P (A | B) if A and B are mutually exclusive events.
Solution
(a) We know that P (A ∪ B) = P (A) + P (B) − P (A ∩ B). For independent events A
and B, P (A ∩ B) = P (A) P (B), so P (A ∪ B) = P (A) + P (B) − P (A) P (B)
gives 0.75 = p + 2p − 2p2 , or 2p2 − 3p + 0.75 = 0.
Solving the quadratic equation gives:
√
3− 3
p=
≈ 0.317
4
suppressing the irrelevant case for which p > 1.
Since A and B are independent, P (A | B) = P (A) = p = 0.317.
(b) For mutually exclusive events, P (A ∪ B) = P (A) + P (B), so 0.75 = p + 2p,
leading to p = 0.25.
Here P (A ∩ B) = 0, so P (A | B) = P (A ∩ B)/P (B) = 0.
Activity 2.25
(a) Show that if A and B are independent events in a sample space, then Ac and B c
are also independent.
(b) Show that if X and Y are mutually exclusive events in a sample space, then X c
and Y c are not in general mutually exclusive.
42
2.7. Conditional probability and Bayes’ theorem
Solution
(a) We are given that A and B are independent, so P (A ∩ B) = P (A) P (B). We
need to show a similar result for Ac and B c , namely we need to show that
P (Ac ∩ B c ) = P (Ac ) P (B c ).
Now Ac ∩ B c = (A ∪ B)c from basic set theory (draw a Venn diagram), hence:
P (Ac ∩ B c ) = P ((A ∪ B)c )
= 1 − P (A ∪ B)
= 1 − [P (A) + P (B) − P (A ∩ B)]
= 1 − P (A) − P (B) + P (A ∩ B)
= 1 − P (A) − P (B) + P (A) P (B) (independence assumption)
= [1 − P (A)][1 − P (B)] (factorising)
= P (Ac ) P (B c ) (as required).
(b) To show that X c and Y c are not necessarily mutually exclusive when X and Y
are mutually exclusive, the best approach is to find a counterexample. Attempts
to ‘prove’ the result directly are likely to be logically flawed.
Look for a simple example. Suppose we roll a die. Let X = {6} be the event of
obtaining a 6, and let Y = {5} be the event of obtaining a 5. Obviously X and
Y are mutually exclusive, but X c = {1, 2, 3, 4, 5} and Y c = {1, 2, 3, 4, 6} have
X c ∩ Y c 6= ∅, so X c and Y c are not mutually exclusive.
Activity 2.26 If C1 , C2 , C3 , . . . are events in S which are pairwise mutually
exclusive (i.e. Ci ∩ Cj = ∅ for all i 6= j), then, by the axioms of probability:
!
∞
∞
[
X
P
Ci =
P (Ci ).
i=1
(*)
i=1
Suppose that A1 , A2 , . . . are pairwise mutually exclusive events in S. Prove that a
property like (*) also holds for conditional probabilities given some event B, i.e.
prove that:
"∞ #
!
∞
[
X
P
Ai B =
P (Ai | B).
i=1
i=1
You can assume that all unions and intersections of Ai and B are also events in S.
Solution
We have:
P
"∞
[
i=1
#
Ai
P
!
B
=
∞
S
Ai ∩ B
i=1
P (B)
P
=
∞
S
[Ai ∩ B]
i=1
P (B)
=
∞
X
P (Ai ∩ B)
i=1
P (B)
=
∞
X
P (Ai | B)
i=1
where the equation on the second line follows from (*) in the question, since Ai ∩ B
43
2. Probability theory
are also events in S, and they are pairwise mutually exclusive (i.e. (Ai ∩ B)∩
(Aj ∩ B) = ∅ for all i 6= j).
2.7.3
Conditional probability of independent events
If A ⊥⊥ B, i.e. P (A ∩ B) = P (A) P (B), and P (B) > 0 and P (A) > 0, then:
P (A | B) =
P (A) P (B)
P (A ∩ B)
=
= P (A)
P (B)
P (B)
P (B | A) =
P (A ∩ B)
P (A) P (B)
=
= P (B).
P (A)
P (A)
and:
In other words, if A and B are independent, learning that B has occurred does not
change the probability of A, and learning that A has occurred does not change the
probability of B. This is exactly what we would expect under independence.
2.7.4
Chain rule of conditional probabilities
Since P (A | B) = P (A ∩ B)/P (B), then:
P (A ∩ B) = P (A | B) P (B).
That is, the probability that both A and B occur is the probability that A occurs given
that B has occurred multiplied by the probability that B occurs. An intuitive graphical
version of this is:
s
B
s
As
The path to A is to get first to B, and then from B to A.
It is also true that:
P (A ∩ B) = P (B | A) P (A)
and you can use whichever is more convenient. Very often some version of this chain
rule is much easier than calculating P (A ∩ B) directly.
The chain rule generalises to multiple events:
P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 ) · · · P (An | A1 , . . . , An−1 )
where, for example, P (A3 | A1 , A2 ) is shorthand for P (A3 | A1 ∩ A2 ). The events can be
taken in any order, as shown in Example 2.21.
44
2.7. Conditional probability and Bayes’ theorem
Example 2.21 For n = 3, we have:
P (A1 ∩ A2 ∩ A3 ) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 )
= P (A1 ) P (A3 | A1 ) P (A2 | A1 , A3 )
= P (A2 ) P (A1 | A2 ) P (A3 | A1 , A2 )
= P (A2 ) P (A3 | A2 ) P (A1 | A2 , A3 )
= P (A3 ) P (A1 | A3 ) P (A2 | A1 , A3 )
= P (A3 ) P (A2 | A3 ) P (A1 | A2 , A3 ).
Example 2.22 Suppose you draw 4 cards from a deck of 52 playing cards. What is
the probability of A = ‘the cards are the 4 aces (cards of rank 1)’ ?
We could calculate this using counting rules. There are 52
= 270,725 possible
4
subsets of 4 different cards, and only 1 of these consists of the 4 aces. Therefore,
P (A) = 1/270725.
Let us try with conditional probabilities. Define Ai as ‘the ith card is an ace’, so
that A = A1 ∩ A2 ∩ A3 ∩ A4 . The necessary probabilities are:
P (A1 ) = 4/52 since there are initially 4 aces in the deck of 52 playing cards
P (A2 | A1 ) = 3/51. If the first card is an ace, 3 aces remain in the deck of 51
playing cards from which the second card will be drawn
P (A3 | A1 , A2 ) = 2/50
P (A4 | A1 , A2 , A3 ) = 1/49.
Putting these together with the chain rule gives:
P (A) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 ) P (A4 | A1 , A2 , A3 )
4
3
2
1
×
×
×
52 51 50 49
24
=
6497400
1
=
.
270725
=
Here we could obtain the result in two ways. However, there are very many situations
where classical probability and counting rules are not usable, whereas conditional
probabilities and the chain rule are completely general and always applicable.
More methods for summing probabilities
We now return to probabilities of partitions like the situation shown in Figure 2.10.
45
2. Probability theory
HH
H
A1
HH
HHr
rH
A
HH
A2 H
HH
A3
H
A2
A1
A3
Figure 2.10: On the left, a Venn diagram depicting A = A1 ∪ A2 ∪ A3 , and on the right
the ‘paths’ to A.
Both diagrams in Figure 2.10 represent the partition A = A1 ∪ A2 ∪ A3 . For the next
results, it will be convenient to use diagrams like the one on the right in Figure 2.11,
where A1 , A2 and A3 are symbolised as different ‘paths’ to A.
We now develop powerful methods of calculating sums like:
P (A) = P (A1 ) + P (A2 ) + P (A3 ).
2.7.5
Total probability formula
Suppose B1 , B2 , . . . , BK form a partition of the sample space. Therefore, A ∩ B1 ,
A ∩ B2 , . . ., A ∩ BK form a partition of A, as shown in Figure 2.11.
r B1
B2
r
HH
HH
HHr
B
3
rH
r
A
@H
H
@ HH
Hr
@
B4
@
@
@r
B5
Figure 2.11: On the left, a Venn diagram depicting the set A and the partition of S, and
on the right the ‘paths’ to A.
In other words, think of event A as the union of all the A ∩ Bi s, i.e. of ‘all the paths to
A via different intervening events Bi ’.
To get the probability of A, we now:
1. apply the chain rule to each of the paths:
P (A ∩ Bi ) = P (A | Bi ) P (Bi )
2. add up the probabilities of the paths:
P (A) =
K
X
i=1
46
P (A ∩ Bi ) =
K
X
i=1
P (A | Bi ) P (Bi ).
2.7. Conditional probability and Bayes’ theorem
This is known as the formula of total probability. It looks complicated, but it is
actually often far easier to use than trying to find P (A) directly.
Example 2.23 Any event B has the property that B and its complement B c
partition the sample space. So if we take K = 2, B1 = B and B2 = B c in the formula
of total probability, we get:
P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= P (A | B) P (B) + P (A | B c ) [1 − P (B)].
r Bc
H
H
HH
H
HHr
r
A
H
HH
HH
HH
r B
Example 2.24 Suppose that 1 in 10,000 people (0.01%) has a particular disease. A
diagnostic test for the disease has 99% sensitivity. If a person has the disease, the
test will give a positive result with a probability of 0.99. The test has 99% specificity.
If a person does not have the disease, the test will give a negative result with a
probability of 0.99.
Let B denote the presence of the disease, and B c denote no disease. Let A denote a
positive test result. We want to calculate P (A).
The probabilities we need are P (B) = 0.0001, P (B c ) = 0.9999, P (A | B) = 0.99 and
P (A | B c ) = 0.01. Therefore:
P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= 0.99 × 0.0001 + 0.01 × 0.9999
= 0.010098.
Activity 2.27 A man has two bags. Bag A contains five keys and bag B contains
seven keys. Only one of the twelve keys fits the lock which he is trying to open. The
man selects a bag at random, picks out a key from the bag at random and tries that
key in the lock. What is the probability that the key he has chosen fits the lock?
Solution
Define a partition {Ci }, such that:
C1 = key in bag A and bag A chosen
⇒
C2 = key in bag B and bag A chosen
⇒
5
1
5
× =
12 2
24
7
1
7
P (C2 ) =
× =
12 2
24
P (C1 ) =
47
2. Probability theory
5
1
5
× =
12 2
24
7
1
7
C4 = key in bag B and bag B chosen ⇒ P (C4 ) =
× = .
12 2
24
C3 = key in bag A and bag B chosen ⇒ P (C3 ) =
Hence we require, defining the event F = ‘key fits’:
P (F ) =
2.7.6
1
1
1
5
1
7
1
× P (C1 ) + × P (C4 ) = ×
+ ×
= .
5
7
5 24 7 24
12
Bayes’ theorem
So far we have considered how to calculate P (A) for an event A which can happen in
different ways, ‘via’ different events B1 , B2 , . . . , BK .
Now we reverse the question. Suppose we know that A has occurred, as shown in Figure
2.12.
Figure 2.12: Paths to A indicating that A has occurred.
What is the probability that we got there via, say, B1 ? In other words, what is the
conditional probability P (B1 | A)? This situation is depicted in Figure 2.13.
Figure 2.13: A being achieved via B1 .
So we need:
P (Bj | A) =
P (A ∩ Bj )
P (A)
and we already know how to get this.
P (A ∩ Bj ) = P (A | Bj ) P (Bj ) from the chain rule.
P (A) =
K
P
i=1
48
P (A | Bi ) P (Bi ) from the total probability formula.
2.7. Conditional probability and Bayes’ theorem
Bayes’ theorem
Using the chain rule and the total probability formula, we have:
P (Bj | A) =
P (A | Bj ) P (Bj )
K
P
P (A | Bi ) P (Bi )
i=1
which holds for each Bj , j = 1, . . . , K. This is known as Bayes’ theorem.
Example 2.25 Continuing with Example 2.24, let B denote the presence of the
disease, B c denote no disease, and A denote a positive test result.
We want to calculate P (B | A), i.e. the probability that a person has the disease,
given that the person has received a positive test result.
The probabilities we need are:
P (B c ) = 0.9999
P (B) = 0.0001
P (A | B) = 0.99
and
P (A | B c ) = 0.01.
Therefore:
P (B | A) =
0.99 × 0.0001
P (A | B) P (B)
=
≈ 0.0098.
c
c
P (A | B) P (B) + P (A | B ) P (B )
0.010098
Why is this so small? The reason is because most people do not have the disease and
the test has a small, but non-zero, false positive rate P (A | B c ). Therefore, most
positive test results are actually false positives.
Activity 2.28 Prove the simplest version of Bayes’ theorem from first principles.
Solution
Applying the definition of conditional probability, we have:
P (B | A) =
P (B ∩ A)
P (A ∩ B)
P (A | B) P (B)
=
=
.
P (A)
P (A)
P (A)
Activity 2.29 State and prove Bayes’ theorem.
Solution
Bayes’ theorem is:
P (Bj | A) =
P (A | Bj ) P (Bj )
K
P
.
P (A | Bi ) P (Bi )
i=1
By definition:
P (Bj | A) =
P (Bj ∩ A)
P (A | Bj ) P (Bj )
=
.
P (A)
P (A)
49
2. Probability theory
If {Bi }, for i = 1, . . . , K, is a partition of the sample space S, then:
P (A) =
K
X
P (A ∩ Bi ) =
i=1
K
X
P (A | Bi ) P (Bi ).
i=1
Hence the result.
Activity 2.30 A statistics teacher knows from past experience that a student who
does their homework consistently has a probability of 0.95 of passing the
examination, whereas a student who does not do their homework has a probability
of 0.30 of passing.
(a) If 25% of students do their homework consistently, what percentage can expect
to pass?
(b) If a student chosen at random from the group gets a pass, what is the
probability that the student has done their homework consistently?
Solution
Here the random experiment is to choose a student at random, and to record
whether the student passes (P ) or fails (F ), and whether the student has done their
homework consistently (C) or has not (N ). (Notice that F = P c and N = C c .) The
sample space is S = {P C, P N, F C, F N }. We use the events Pass = {P C, P N }, and
Fail = {F C, F N }. We consider the sample space partitioned by Homework
= {P C, F C}, and No Homework = {P N, F N }.
(a) The first part of the example asks for the denominator of Bayes’ theorem:
P (Pass) = P (Pass | Homework) P (Homework)
+ P (Pass | No Homework) P (No Homework)
= 0.95 × 0.25 + 0.30 × (1 − 0.25)
= 0.2375 + 0.225
= 0.4625.
(b) Now applying Bayes’ theorem:
P (Homework | Pass) =
P (Homework ∩ Pass)
P (Pass)
=
P (Pass | Homework) P (Homework)
P (Pass)
=
0.95 × 0.25
0.4625
= 0.5135.
50
2.7. Conditional probability and Bayes’ theorem
Alternatively, we could arrange the calculations in a tree diagram as shown
below.
Activity 2.31 Plagiarism is a serious problem for assessors of coursework. One
check on plagiarism is to compare the coursework with a standard text. If the
coursework has plagiarised the text, then there will be a 95% chance of finding
exactly two phrases which are the same in both coursework and text, and a 5%
chance of finding three or more phrases. If the work is not plagiarised, then these
probabilities are both 50%.
Suppose that 5% of coursework is plagiarised. An assessor chooses some coursework
at random. What is the probability that it has been plagiarised if it has exactly two
phrases in the text? (Try making a guess before doing the calculation!)
What if there are three or more phrases? Did you manage to get a roughly correct
guess of these results before calculating?
Solution
Suppose that two phrases are the same. We use Bayes’ theorem:
P (plagiarised | two the same) =
0.95 × 0.05
= 0.0909.
0.95 × 0.05 + 0.5 × 0.95
Finding two phrases has increased the chance the work is plagiarised from 5% to
9.1%. Did you get anywhere near 9% when guessing? Now suppose that we find
three or more phrases:
P (plagiarised | three or more the same) =
0.05 × 0.05
= 0.0052.
0.05 × 0.05 + 0.5 × 0.95
It seems that no plagiariser is silly enough to keep three or more phrases the same,
so if we find three or more, the chance of the work being plagiarised falls from 5% to
0.5%! How close did you get by guessing?
51
2. Probability theory
Activity 2.32 Continuing with Activity 2.27, suppose the first key chosen does not
fit the lock. What is the probability that the bag chosen:
(a) is bag A?
(b) contains the required key?
Solution
(a) We require P (bag A | F c ) which is:
P (bag A | F c ) =
P (F c | C1 ) P (C1 ) + P (F c | C2 ) P (C2 )
.
4
P
c
P (F | Ci ) P (Ci )
i=1
The conditional probabilities are:
4
P (F c | C1 ) = ,
5
P (F c | C2 ) = 1,
6
P (F c | C3 ) = 1 and P (F c | C4 ) = .
7
Hence:
P (bag A | F c ) =
4/5 × 5/24 + 1 × 7/24
1
= .
4/5 × 5/24 + 1 × 7/24 + 1 × 5/24 + 6/7 × 7/24
2
(b) We require P (right bag | F c ) which is:
P (right bag | F c ) =
P (F c | C1 ) P (C1 ) + P (F c | C4 ) P (C4 )
4
P
P (F c | Ci ) P (Ci )
i=1
=
4/5 × 5/24 + 6/7 × 7/24
4/5 × 5/24 + 1 × 7/24 + 1 × 5/24 + 6/7 × 7/24
=
5
.
11
Activity 2.33 Hard question!
A, B and C throw a die in that order until a six appears. The person who throws
the first six wins. What are their respective chances of winning?
Solution
We must assume that the game finishes with probability one (it would be proved in
a more advanced subject). If A, B and C all throw and fail to get a six, then their
respective chances of winning are as at the start of the game. We can call each
completed set of three throws a round. Let us denote the probabilities of winning by
P (A), P (B) and P (C) for A, B and C, respectively. Therefore:
52
2.7. Conditional probability and Bayes’ theorem
P (A) = P (A wins on the 1st throw)
+ P (A wins in some round after the 1st round)
1
+ P (A, B and C fail on the 1st throw and A wins after the 1st round)
6
1
= + P (A, B and C fail in the 1st round)
6
× P (A wins after the 1st round | A, B and C fail in the 1st round)
=
1
+ P (No six in first 3 throws) P (A)
6
3
5
1
P (A)
= +
6
6
1
125
= +
P (A).
6
216
=
So (1 − 125/216) P (A) = 1/6, and P (A) = 216/(91 × 6) = 36/91.
Similarly:
P (B) = P (B wins in the 1st round)
+ P (B wins after the 1st round)
= P (A fails with the 1st throw and B throws a six on the 1st throw)
+ P (All fail in the 1st round and B wins after the 1st round)
= P (A fails with the 1st throw) P (B throws a six with the 1st throw)
+ P (All fail in the 1st round) P (B wins after the 1st | All fail in the 1st)
3
1
5
5
=
+
P (B).
6
6
6
So, (1 − 125/216) P (B) = 5/36, and P (B) = 5 × (216)/(91 × 36) = 30/91.
In the same way, P (C) = (5/6) × (5/6) × (1/6) × (216/91) = 25/91.
Notice that P (A) + P (B) + P (C) = 1. You may, on reflection, think that this rather
long solution could be shortened, by considering the relative winning chances of A,
B and C.
Activity 2.34 Hard question!
In men’s singles tennis, matches are played on the best-of-five-sets principle.
Therefore, the first player to win three sets wins the match, and a match may
consist of three, four or five sets. Assuming that two players are perfectly evenly
matched, and that sets are independent events, calculate the probabilities that a
match lasts three sets, four sets and five sets, respectively.
53
2. Probability theory
Solution
Suppose that the two players are A and B. We calculate the probability that A wins
a three-, four- or five-set match, and then, since the players are evenly matched,
double these probabilities for the final answer.
P (‘A wins in 3 sets’) = P (‘A wins 1st set’ ∩ ‘A wins 2nd set’ ∩ ‘A wins 3rd set’).
Since the sets are independent, we have:
P (‘A wins in 3 sets’) = P (‘A wins 1st set’) P (‘A wins 2nd set’) P (‘A wins 3rd set’)
1 1 1
× ×
2 2 2
1
= .
8
=
Therefore, the total probability that the game lasts three sets is:
2×
1
1
= .
8
4
If A wins in four sets, the possible winning patterns are:
BAAA,
ABAA and AABA.
Each of these patterns has probability (1/2)4 by using the same argument as in the
case of 3 sets. So the probability that A wins in four sets is 3 × (1/16) = 3/16.
Therefore, the total probability of a match lasting four sets is 2 × (3/16) = 3/8.
The probability of a five-set match should be 1 − 3/8 − 1/4 = 3/8, but let us check
this directly. The winning patterns for A in a five-set match are:
BBAAA,
BABAA,
BAABA,
ABBAA,
ABABA and AABBA.
Each of these has probability (1/2)5 because of the independence of the sets. So the
probability that A wins in five sets is 6 × (1/32) = 3/16. Therefore, the total
probability of a five-set match is 3/8, as before.
Activity 2.35 Hard question!
In a game of tennis, each point is won by one of the two players A and B. The usual
rules of scoring for tennis apply. That is, the winner of the game is the player who
first scores four points, unless each player has won three points, when deuce is called
and play proceeds until one player is two points ahead of the other and hence wins
the game.
A is serving and has a probability of winning any point of 2/3. The result of each
point is assumed to be independent of every other point.
(a) Show that the probability of A winning the game without deuce being called is
496/729.
(b) Find the probability of deuce being called.
54
2.7. Conditional probability and Bayes’ theorem
(c) If deuce is called, show that A’s subsequent probability of winning the game is
4/5.
(d) Hence determine A’s overall chance of winning the game.
Solution
(a) A will win the game without deuce if he or she wins four points, including the
last point, before B wins three points. This can occur in three ways.
• A wins four straight points, i.e. AAAA with probability (2/3)4 = 16/81.
• B wins just one point in the game. There are 4 C1 ways for this to happen,
namely BAAAA, ABAAA, AABAA and AAABA. Each has probability
(1/3) × (2/3)4 , so the probability of one of these outcomes is given by
4 × (1/3) × (2/3)4 = 64/243.
• B wins just two points in the game. There are 5 C2 ways for this to happen,
namely BBAAAA, BABAAA, BAABAA, BAAABA, ABBAAA,
ABABAA, ABAABA, AABBAA, AABABA and AAABBA. Each has
probability (1/3)2 × (2/3)4 , so the probability of one of these outcomes is
given by 10 × (1/3)2 × (2/3)4 = 160/729.
Therefore, the probability that A wins without a deuce must be the sum of
these, namely:
64
160
144 + 192 + 160
496
16
+
+
=
=
.
81 243 729
729
729
(b) We can mimic the above argument to find the probability that B wins the game
without a deuce. That is, the probability of four straight points to B is
(1/3)4 = 1/81, the probability that A wins just one point in the game is
4 × (2/3) × (1/3)4 = 8/243, and the probability that A wins just two points is
10 × (2/3)2 × (1/3)4 = 40/729. So the probability of B winning without a deuce
is 1/81 + 8/243 + 40/729 = 73/729 and so the probability of deuce is
1 − 496/729 − 73/729 = 160/729.
(c) Either: suppose deuce has been called. The probability that A wins the set
without further deuces is the probability that the next two points go AA – with
probability (2/3)2 .
The probability of exactly one further deuce is that the next four points go
ABAA or BAAA – with probability (2/3)3 × (1/3) + (2/3)3 × (1/3) = (2/3)4 .
The probability of exactly two further deuces is that the next six points go
ABABAA, ABBAAA, BAABAA or BABAAA – with probability
4 × (2/3)4 × (1/3)2 = (2/3)6 .
Continuing this way, the probability that A wins after three further deuces is
(2/3)8 and the overall probability that A wins after deuce has been called is
(2/3)2 + (2/3)4 + (2/3)6 + (2/3)8 + · · · .
55
2. Probability theory
This is a geometric progression (GP) with first term a = (2/3)2 and common
ratio (2/3)2 , so the overall probability that A wins after deuce has been called is
a/(1 − r) (sum to infinity of a GP) which is:
(2/3)2
4
4/9
= .
=
2
1 − (2/3)
5/9
5
Or (quicker!): given a deuce, the next 2 balls can yield the following results. A
wins with probability (2/3)2 , B wins with probability (1/3)2 , and deuce with
probability 4/9.
Hence P (A wins | deuce) = (2/3)2 + (4/9) P (A wins | deuce) and solving
immediately gives P (A wins | deuce) = 4/5.
(d) We have:
P (A wins the game) = P (A wins without deuce being called)
+ P (deuce is called) P (A wins | deuce is called)
496
+
729
496
+
=
729
624
=
.
729
=
160 4
×
729 5
128
729
Aside: so the probability of B winning the game is 1 − 624/729 = 105/729. It follows
that A is about six times as likely as B to win the game although the probability of
winning any point is only twice that of B. Another example of the counterintuitive
nature of probability.
Example 2.26 You are waiting for your bag at the baggage reclaim carousel of an
airport. Suppose that you know that there are 200 bags to come from your flight,
and you are counting the distinct bags which come out. Suppose that x bags have
arrived, and your bag is not among them. What is the probability that your bag will
not arrive at all, i.e. that it has been lost (or at least delayed)?
Define A = ‘your bag has been lost’ and x = ‘your bag is not among the first x bags
to arrive’. What we want to know is the conditional probability P (A | x) for any
x = 0, 1, 2, . . . , 200. The conditional probabilities the other way round are as follows.
P (x | A) = 1 for all x. If your bag has been lost, it will not arrive!
P (x | Ac ) = (200 − x)/200 if we assume that bags come out in a completely
random order.
Using Bayes’ theorem, we get:
P (A | x) =
56
P (x | A) P (A)
P (A)
=
.
c
c
P (x | A) P (A) + P (x | A ) P (A )
P (A) + [(200 − x)/200] [1 − P (A)]
2.7. Conditional probability and Bayes’ theorem
Obviously, P (A | 200) = 1. If the bag has not arrived when all 200 have come out, it
has been lost!
For other values of x we need P (A). This is the general probability that a bag gets
lost, before you start observing the arrival of the bags from your particular flight.
This kind of probability is known as the prior probability of an event A.
Let us assign values to P (A) based on some empirical data. Statistics by the
Association of European Airlines (AEA) show how many bags were ‘mishandled’ per
1,000 passengers the airlines carried. This is not exactly what we need (since not all
passengers carry bags, and some have several), but we will use it anyway. In
particular, we will compare the results for the best and the worst of the AEA in 2006:
Air Malta: P (A) = 0.0044
British Airways: P (A) = 0.023.
Figure 2.14 shows a plot of P (A | x) as a function of x for these two airlines.
The probabilities are fairly small, even for large values of x.
For Air Malta, P (A | 199) = 0.469. So even when only 1 bag remains to arrive,
the probability is less than 0.5 that your bag has been lost.
For British Airways, P (A | 199) = 0.825. Also, we see that P (A | 197) = 0.541 is
the first probability over 0.5.
This is because the baseline probability of lost bags, P (A), is low.
1.0
So, the moral of the story is that even when nearly everyone else has collected their
bags and left, do not despair!
0.6
0.4
0.0
0.2
P( Your bag is lost )
0.8
BA
Air Malta
0
50
100
150
200
Bags arrived
Figure 2.14: Plot of P (A | x) as a function of x for the two airlines in Example 2.26, Air
Malta and British Airways (BA).
57
2. Probability theory
2.8
Overview of chapter
This chapter introduced some formal terminology related to probability theory. The
axioms of probability were introduced, from which various other probability results were
derived. There followed a brief discussion of counting rules (using permutations and
combinations). The important concepts of independence and conditional probability
were discussed, and Bayes’ theorem was derived.
2.9
Key terms and concepts
Axiom
Chain rule
Collectively exhaustive
Complement
Element
Experiment
Factorial
Intersection
Outcome
Partition
Probability (theory)
Sample space
Subset
Union
With(out) replacement
2.10
Bayes’ theorem
Classical probability
Combination
Conditional probability
Empty set
Event
Independence
Mutually exclusive
Pairwise disjoint
Permutation
Relative frequency
Set
Total probability
Venn diagram
Sample examination questions
Solutions can be found in Appendix C.
1. For each one of the statements below say whether the statement is true or false,
explaining your answer. Throughout this question A and B are events such that
0 < P (A) < 1 and 0 < P (B) < 1.
(a) If A and B are independent, then P (A) + P (B) > P (A ∪ B).
(b) If P (A | B) = P (A | B c ) then A and B are independent.
(c) If A and B are disjoint events, then Ac and B c are disjoint.
2. Suppose that 10 people are seated in a random manner in a row of 10 lecture
theatre seats. What is the probability that two particular people, A and B, will be
seated next to each other?
3. A person tried by a three-judge panel is declared guilty if at least two judges cast
votes of guilty (i.e. a majority verdict). Suppose that when the defendant is in fact
guilty, each judge will independently vote guilty with probability 0.9, whereas when
58
2.10. Sample examination questions
the defendant is not guilty (i.e. innocent), this probability drops to 0.25. Suppose
70% of defendants are guilty.
(a) Compute the probability that judge 1 votes guilty.
(b) Given that both judge 1 and judge 2 vote not guilty, compute the probability
that judge 3 votes guilty.
59
2. Probability theory
60
Chapter 3
Random variables
3.1
Synopsis of chapter
This chapter introduces the concept of random variables and probability distributions.
These distributions are univariate, which means that they are used to model a single
numerical quantity. The concepts of expected value and variance are also discussed.
3.2
Learning outcomes
After completing this chapter, you should be able to:
define a random variable and distinguish it from the values which it takes
explain the difference between discrete and continuous random variables
find the mean and the variance of simple random variables whether discrete or
continuous
demonstrate how to proceed and use simple properties of expected values and
variances.
3.3
Introduction
In ST104a Statistics 1, we considered descriptive statistics for a sample of
observations of a variable X. Here we will represent the observations as a sequence of
variables, denoted as:
X1 , X2 , . . . , Xn
where n is the sample size.
In statistical inference, the observations will be treated as a sample drawn at random
from a population. We will then think of each observation Xi of a variable X as an
outcome of an experiment.
The experiment is ‘select a unit at random from the population and record its
value of X’.
The outcome is the observed value Xi of X.
Because variables X in statistical data are recorded as numbers, we can now focus on
experiments where the outcomes are also numbers – random variables.
61
3. Random variables
Random variable
A random variable is an experiment for which the outcomes are numbers.1 This
means that for a random variable:
the sample space, S, is the set of real numbers R, or a subset of R
the outcomes are numbers in this sample space (instead of ‘outcomes’, we often
call them the values of the random variable)
events are sets of numbers (values) in this sample space.
Discrete and continuous random variables
There are two main types of random variables, depending on the nature of S, i.e. the
possible values of the random variable.
A random variable is continuous if S is all of R or some interval(s) of it, for
example [0, 1] or [0, ∞).
A random variable is discrete if it is not continuous.2 More precisely, a discrete
random variable takes a finite or countably infinite number of values.
Notation
A random variable is typically denoted by an upper-case letter, for example X (or Y ,
W etc.). A specific value of a random variable is often denoted by a lower-case letter,
for example x.
Probabilities of values of a random variable are written as follows.
P (X = x) denotes the probability that (the value of) X is x.
P (X > 0) denotes the probability that X is positive.
P (a < X < b) denotes the probability that X is between the numbers a and b.
Random variables versus samples
You will notice that many of the quantities we define for random variables are
analogous to sample quantities defined in ST104a Statistics 1.
1
This definition is a bit informal, but it is sufficient for this course.
Strictly speaking, a discrete random variable is not just a random variable which is not continuous
as there are many others, such as mixture distributions.
2
62
3.4. Discrete random variables
Random variable
Probability distribution
Mean (expected value)
Variance
Standard deviation
Median
Sample
Sample
Sample
Sample
Sample
Sample
distribution
mean (average)
variance
standard deviation
median
This is no accident. In statistics, the population is represented as following a probability
distribution, and quantities for an observed sample are then used as estimators of the
analogous quantities for the population.
3.4
Discrete random variables
Example 3.1 The following two examples will be used throughout this chapter.
1. The number of people living in a randomly selected household in England.
• For simplicity, we use the value 8 to represent ‘8 or more’ (because 9 and
above are not reported separately in official statistics).
• This is a discrete random variable, with possible values of 1, 2, 3, 4, 5, 6, 7
and 8.
2. A person throws a basketball repeatedly from the free-throw line, trying to
make a basket. Consider the following random variable.
The number of unsuccessful throws before the first successful throw.
• The possible values of this are 0, 1, 2, . . ..
3.4.1
Probability distribution of a discrete random variable
The probability distribution (or just distribution) of a discrete random variable X
is specified by:
its possible values, x (i.e. its sample space, S)
the probabilities of the possible values, i.e. P (X = x) for all x ∈ S.
So we first need to develop a convenient way of specifying the probabilities.
63
3. Random variables
Example 3.2 Consider the following probability distribution for the household
size, X.3
Number of people
in the household, x
1
2
3
4
5
6
7
8
P (X = x)
0.3002
0.3417
0.1551
0.1336
0.0494
0.0145
0.0034
0.0021
Probability function
The probability function (pf) of a discrete random variable X, denoted by p(x),
is a real-valued function such that for any number x the function is:
p(x) = P (X = x).
We can talk of p(x) both as the pf of the random variable X, and as the pf of the
probability distribution of X. Both mean the same thing.
Alternative terminology: the pf of a discrete random variable is also often called the
probability mass function (pmf).
Alternative notation: instead of p(x), the pf is also often denoted by, for example, pX (x)
– especially when it is necessary to indicate clearly to which random variable the
function corresponds.
Necessary conditions for a probability function
To be a pf of a discrete random variable X with sample space S, a function p(x)
must satisfy the following conditions.
1. p(x) ≥ 0 for all real numbers x.
P
2.
p(xi ) = 1, i.e. the sum of probabilities of all possible values of X is 1.
xi ∈S
The pf is defined for all real numbers x, but p(x) = 0 for any x ∈
/ S, i.e. for any
value x which is not one of the possible values of X.
3
Source: ONS, National report for the 2001 Census, England and Wales. Table UV51.
64
3.4. Discrete random variables
Example 3.3 Continuing Example 3.2, here we can simply list all the values:


0.3002 for x = 1





0.3417 for x = 2





0.1551 for x = 3





0.1336 for x = 4
p(x) = 0.0494 for x = 5



0.0145 for x = 6





0.0034 for x = 7





0.0021 for x = 8



0
otherwise.
These are clearly all non-negative, and their sum is
8
P
p(x) = 1.
x=1
p(x)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
A graphical representation of the pf is shown in Figure 3.1.
1
2
3
4
5
6
7
8
x (number of people in the household)
Figure 3.1: Probability function for Example 3.3.
For the next example, we need to remember the following results from mathematics,
concerning sums of geometric series. If r 6= 1, then:
n−1
X
a (1 − rn )
ar =
1−r
x=0
x
and if |r| < 1, then:
∞
X
x=0
a rx =
a
.
1−r
65
3. Random variables
Example 3.4 In the basketball example, the number of possible values is infinite,
so we cannot simply list the values of the pf. So we try to express it as a formula.
Suppose that:
the probability of a successful throw is π at each throw and, therefore, the
probability of an unsuccessful throw is 1 − π
outcomes of different throws are independent.
Hence the probability that the first success occurs after x failures is the probability
of a sequence of x failures followed by a success, i.e. the probability is:
(1 − π)x π.
So the pf of the random variable X (the number of failures before the first success)
is:
(
(1 − π)x π for x = 0, 1, 2, . . .
p(x) =
(3.1)
0
otherwise
where 0 ≤ π ≤ 1. Let us check that (3.1) satisfies the conditions for a pf.
Clearly, p(x) ≥ 0 for all x, since π ≥ 0 and 1 − π ≥ 0.
Using the sum to infinity of a geometric series, we get:
∞
X
x=0
p(x) =
∞
∞
X
X
(1 − π)x π = π
(1 − π)x = π
x=0
x=0
π
1
= = 1.
1 − (1 − π)
π
p(x)
0.4
0.5
0.6
0.7
The expression of the pf involves a parameter π (the probability of a successful
throw), a number for which we can choose different values. This defines a whole
‘family’ of individual distributions, one for each value of π. For example, Figure 3.2
shows values of p(x) for two values of π reflecting fairly good and pretty poor
free-throw shooters, respectively.
0.0
0.1
0.2
0.3
π = 0.7
π = 0.3
0
5
10
15
x (number of failures)
Figure 3.2: Probability function for Example 3.4. π = 0.7 indicates a fairly good free-throw
shooter. π = 0.3 indicates a pretty poor free-throw shooter.
66
3.4. Discrete random variables
Activity 3.1 Suppose that a box contains 12 green balls and 4 yellow balls. If 7
balls are selected at random, without replacement, determine the probability
function of X, the number of green balls which will be obtained.
Solution
Let the random variable X denote the number of green balls. As 7 balls are selected
without replacement, the sample space of X is S = {3, 4, 5, 6, 7} because the
maximum number of yellow balls which could be obtained is 4 (all selected), hence a
minimum of 3 green balls must be obtained, up to a maximum
of7 green balls. The
16
number of possible combinations of 7 balls drawn from 16 is
. The x green
7
12
balls chosen from 12 can occur in
ways, and the 7 − x yellow balls chosen from
x
4
4 can occur in
ways. Therefore, using classical probability:
7−x
12
4
x
7−x
.
p(x) =
16
7
Therefore, the probability function is:
.  4
16
 12
for x = 3, 4, 5, 6, 7
x
7−x
7
p(x) =

0
otherwise.
Activity 3.2 Consider a sequence of independent tosses of a fair coin. Let the
random variable X denote the number of tosses needed to obtain the first head.
Determine the probability function of X and verify it satisfies the necessary
conditions for a valid probability function.
Solution
The sample space is clearly S = {1, 2, 3, . . .}. If the first head appears on toss x, then
the previous x − 1 tosses must have been tails. By independence of the tosses, and
the fact it is a fair coin:
x−1
x
1
1
1
× =
.
P (X = x) =
2
2
2
Therefore, the probability function is:
(
1/2x
p(x) =
0
for x = 1, 2, 3, . . .
otherwise.
Clearly, p(x) ≥ 0 for all x and:
2 3
∞ x
X
1
1
1
1
1/2
= +
+
+ ··· =
=1
2
2
2
2
1
−
1/2
x=1
67
3. Random variables
noting the sum to infinity of a geometric series with first term a = 1/2 and common
ratio r = 1/2.
Activity 3.3 Show that:
(
2x/(k (k + 1)) for x = 1, 2, . . . , k
p(x) =
0
otherwise
is a valid probability function for a discrete random variable X.
n
P
Hint:
i = n (n + 1)/2.
i=1
Solution
Since k > 0, then 2x/(k (k + 1)) ≥ 0 for x = 1, 2, . . . , k. Therefore, p(x) ≥ 0 for all
real x. Also, noting the hint in the question:
k
X
x=1
2x
2
4
2k
=
+
+ ··· +
k (k + 1)
k (k + 1) k (k + 1)
k (k + 1)
=
2
(1 + 2 + · · · + k)
k (k + 1)
=
2
k (k + 1)
k (k + 1)
2
= 1.
Hence p(x) is a valid probability function.
3.4.2
The cumulative distribution function (cdf)
Another way to specify a probability distribution is to give its cumulative
distribution function (cdf) (or just simply distribution function).
Cumulative distribution function (cdf)
The cdf is denoted F (x) (or FX (x)) and defined as:
F (x) = P (X ≤ x) for all real numbers x.
For a discrete random variable it is given by:
X
F (x) =
p(xi )
xi ∈S, xi ≤x
i.e. the sum of the probabilities of the possible values of X which are less than or
equal to x.
68
3.4. Discrete random variables
Example 3.5 Continuing with the household size example, values of F (x) at all
possible values of X are:
Number of people
in the household, x
1
2
3
4
5
6
7
8
p(x)
0.3002
0.3417
0.1551
0.1336
0.0494
0.0145
0.0034
0.0021
F (x)
0.3002
0.6419
0.7970
0.9306
0.9800
0.9945
0.9979
1.0000
0.0
0.2
0.4
F(x)
0.6
0.8
1.0
These are shown in graphical form in Figure 3.3.
0
2
4
6
8
x (number of people in the household)
Figure 3.3: Cumulative distribution function for Example 3.5.
Example 3.6 In the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . .. We
can calculate a simple formula for the cdf, using the sum of a geometric series. Since,
for any non-negative integer y, we obtain:
y
X
p(x) =
x=0
we can write:
y
X
x=0
x
(1 − π) π = π
y
X
(1 − π)x = π
x=0
(
0
F (x) =
1 − (1 − π)x+1
1 − (1 − π)y+1
= 1 − (1 − π)y+1
1 − (1 − π)
for x < 0
for x = 0, 1, 2, . . . .
The cdf is shown in graphical form in Figure 3.4.
69
3. Random variables
Activity 3.4 Suppose that random variable X has the range {x1 , x2 , . . .}, where
x1 < x2 < · · · . Prove the following results:
∞
X
p(xi ) = 1,
p(xk ) = F (xk ) − F (xk−1 ) and F (xk ) =
i=1
k
X
p(xi ).
i=1
Solution
The events X = x1 , X = x2 , . . . are disjoint, so we can write:
∞
X
i=1
p(xi ) =
∞
X
P (X = xi ) = P (X = x1 ∪ X = x2 ∪ · · · ) = P (S) = 1.
i=1
In words, this result states that the sum of the probabilities of all the possible values
X can take is equal to 1.
For the second equation, we have:
F (xk ) = P (X ≤ xk ) = P (X = xk ∪ X ≤ xk−1 ).
The two events on the right-hand side are disjoint, so:
F (xk ) = P (X = xk ) + P (X ≤ xk−1 ) = p(xk ) + F (xk−1 )
which immediately gives the required result.
For the final result, we can write:
F (xk ) = P (X ≤ xk ) = P (X = x1 ∪ X = x2 ∪ · · · ∪ X = xk ) =
k
X
p(xi ).
i=1
Activity 3.5 At a charity event, the organisers sell 100 tickets to a raffle. At the
end of the event, one of the tickets is selected at random and the person with that
number wins a prize. Carol buys ticket number 22. Janet buys tickets numbered 1–5.
What is the probability for each of them to win the prize?
Solution
Let X denote the number on the winning ticket. Since all values between 1 and 100
are equally likely, X has a discrete ‘uniform’ distribution such that:
P (‘Carol wins’) = P (X = 22) = p(22) =
and:
P (‘Janet wins’) = P (X ≤ 5) = F (5) =
70
1
= 0.01
100
5
= 0.05.
100
0.4
F(x)
0.6
0.8
1.0
3.4. Discrete random variables
0.0
0.2
π = 0.7
π = 0.3
0
5
10
15
x (number of failures)
Figure 3.4: Cumulative distribution function for Example 3.6.
3.4.3
Properties of the cdf for discrete distributions
The cdf F (x) of a discrete random variable X is a step function such that:
F (x) remains constant in all intervals between possible values of X
at a possible value xi of X, F (x) jumps up by the amount p(xi ) = P (X = xi )
at such an xi , the value of F (xi ) is the value at the top of the jump (i.e. F (x) is
right-continuous).
3.4.4
General properties of the cdf
These hold for both discrete and continuous random variables.
1. 0 ≤ F (x) ≤ 1 for all x (since F (x) is a probability).
2. F (x) → 0 as x → −∞, and F (x) → 1 as x → ∞.
3. F (x) is a non-decreasing function, i.e. if x1 < x2 , then F (x1 ) ≤ F (x2 ).
4. For any x1 < x2 , P (x1 < X ≤ x2 ) = F (x2 ) − F (x1 ).
Either the pf or the cdf can be used to calculate the probabilities of any events for a
discrete random variable.
Example 3.7 Continuing with the household size example (for the probabilities,
see Example 3.5), then:
P (X = 1) = p(1) = F (1) = 0.3002
P (X = 2) = p(2) = F (2) − F (1) = 0.3417
71
3. Random variables
P (X ≤ 2) = p(1) + p(2) = F (2) = 0.6419
P (X = 3 or 4) = p(3) + p(4) = F (4) − F (2) = 0.2887
P (X > 5) = p(6) + p(7) + p(8) = 1 − F (5) = 0.0200
P (X ≥ 5) = p(5) + p(6) + p(7) + p(8) = 1 − F (4) = 0.0694.
3.4.5
Properties of a discrete random variable
Let X be a discrete random variable with sample space S and pf p(x).
Expected value of a discrete random variable
The expected value (or mean) of X is denoted E(X), and defined as:
X
E(X) =
xi p(xi ).
xi ∈S
This can also be written more concisely as E(X) =
P
x p(x) or E(X) =
P
x p(x).
x
We can talk of E(X) as the expected value of both the random variable X, and of the
probability distribution of X.
Alternative notation: instead of E(X), the symbol µ (the lower-case Greek letter ‘mu’),
or µX , is often used.
Activity 3.6 Toward the end of the financial year, James is considering whether to
accept an offer to buy his stock option now, rather than wait until the normal
exercise time. If he sells now, his profit will be £120,000. If he waits until the
exercise time, his profit will be £200,000, provided that there is no crisis in the
markets before that time; if there is a crisis, the option will be worthless and he
would expect a net loss of £50,000. What action should he take to maximise his
expected profit if the probability of crisis is:
(a) 0.5?
(b) 0.1?
For what probability of a crisis would James be indifferent between the two courses
of action if he wishes to maximise his expected profit?
Solution
Let π = probability of crisis, then:
S = E(profit given James sells) = £120,000
and:
W = E(profit given James waits) = £200,000 (1 − π) + (−£50,000) π.
72
3.4. Discrete random variables
(a) If π = 0.5, then S = £120,000 and W = £75,000, so S > W , hence James
should sell now.
(b) If π = 0.1, then S = £120,000 and W = £175,000, so S < W , hence James
should wait until the exercise time.
To be indifferent, we require S = W , i.e. we have:
£200,000 − £250,000 π = £120,000
so π = 8/25 = 0.32.
Activity 3.7 What is the expectation of the random variable X if the only possible
value it can take is c? Also, show that E(X − E(X)) = 0.
Solution
We have p(c) = 1, so X is effectively a constant, even though it is called a random
variable. Its expectation is:
X
E(X) =
x p(x) = c p(x) = c p(c) = c × 1 = c.
(3.2)
∀x
This is intuitively correct; on average, a constant must be equal to itself!
We have:
E(X − E(X)) = E(X) − E(E(X))
Since E(X) is just a number, as opposed to a random variable, (3.2) tells us that its
expectation is equal to itself. Therefore, we can write:
E(X − E(X)) = E(X) − E(X) = 0.
Activity 3.8 If a probability function of a random variable X is given by:
(
1/2x for x = 1, 2, 3, . . .
p(x) =
0
otherwise
show that E(2X ) does not exist.
Solution
We have:
∞
X
∞
X
∞
X
1
E(2 ) =
2 p(x) =
2 x =
1 = 1 + 1 + 1 + · · · = ∞.
2
x=1
x=1
x=1
X
x
x
Note that this is the famous ‘Petersburg paradox’, according to which a player’s
expectation is infinite (i.e. does not exist) if s/he is to receive 2X units of currency
when, in a series of tosses of a fair coin, the first head appears on the xth toss.
73
3. Random variables
Activity 3.9 Suppose that on each play of a certain game James, a gambler, is
equally likely to win or to lose. Suppose that when he wins, his fortune is doubled,
and that when he loses, his fortune is cut in half. If James begins playing with a
given fortune c > 0, what is the expected value of his fortune after n independent
plays of the game?
Hint: If X1 , X2 , . . . , Xn are independent random variables, then:
E(X1 X2 · · · Xn ) = E(X1 ) × E(X2 ) × · · · × E(Xn ).
That is, for independent random variables the ‘expectation of the product’ is the
‘product of the expectations’. This will be introduced in Chapter 5: Multivariate
random variables.
Solution
For i = 1, . . . , n, let Xi = 2 if James’ fortune is doubled on the ith play of the game,
and let Xi = 1/2 if his fortune is cut in half on the ith play. Hence:
E(Xi ) = 2 ×
5
1 1 1
+ × = .
2 2 2
4
After the first play of the game, James’ fortune will be cX1 , after the second play it
will be (cX1 )X2 , and by continuing in this way it is seen that after n plays James’
fortune will be cX1 X2 · · · Xn . Since X1 , . . . , Xn are independent, and noting the hint:
n
5
.
E(cX1 X2 · · · Xn ) = c × E(X1 ) × E(X2 ) × · · · × E(Xn ) = c
4
3.4.6
Expected value versus sample mean
The mean (expected value) E(X) of a probability distribution is analogous to the
sample mean (average) X̄ of a sample distribution.
This is easiest to see when the sample space is finite. Suppose the random variable X
can have K different values X1 , . . . , XK , and their frequencies in a sample are
f1 , . . . , fK , respectively. Therefore, the sample mean of X is:
K
X
f 1 x1 + · · · + f K xK
= x1 pb(x1 ) + · · · + xK pb(xK ) =
xi pb(xi )
X̄ =
f1 + · · · + fK
i=1
where:
pb(xi ) =
fi
K
P
fi
i=1
are the sample proportions of the values xi .
The expected value of the random variable X is:
E(X) = x1 p(x1 ) + · · · + xK p(xK ) =
K
X
i=1
74
xi p(xi ).
3.4. Discrete random variables
So X̄ uses the sample proportions, pb(xi ), whereas E(X) uses the population
probabilities, p(xi ).
Example 3.8 Continuing with the household size example:
Number of people
in the household, x
1
2
3
4
5
6
7
8
Sum
p(x)
0.3002
0.3417
0.1551
0.1336
0.0494
0.0145
0.0034
0.0021
x p(x)
0.3002
0.6834
0.4653
0.5344
0.2470
0.0870
0.0238
0.0168
2.3579
= E(X)
The expected number of people in a randomly selected household is 2.36.
Example 3.9 For the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . ., and
0 otherwise.
It can be shown that (see the appendix for a non-examinable proof):
E(X) =
1−π
.
π
Hence, for example:
E(X) = 0.3/0.7 = 0.42 for π = 0.7
E(X) = 0.7/0.3 = 2.33 for π = 0.3.
So, before scoring a basket, a fairly good free-throw shooter (with π = 0.7) misses on
average about 0.42 shots, and a pretty poor free-throw shooter (with π = 0.3) misses
on average about 2.33 shots.
Expected values of functions of a random variable
Let g(X) be a function (‘transformation’) of a discrete random variable X. This is
also a random variable, and its expected value is:
X
E(g(X)) =
g(x) pX (x)
where pX (x) = p(x) is the probability function of X.
75
3. Random variables
Example 3.10 The expected value of the square of X is:
X
E(X 2 ) =
x2 p(x).
In general:
E(g(X)) 6= g(E(X))
when g(X) is a non-linear function of X.
Example 3.11 Note that:
2
2
E(X ) 6= (E(X))
and E
1
X
6=
1
.
E(X)
Expected values of linear transformations
Suppose X is a random variable and a and b are constants, i.e. known numbers
which are not random variables. Therefore:
E(aX + b) = a E(X) + b.
A special case of the result:
E(aX + b) = a E(X) + b
is obtained when a = 0, which gives:
E(b) = b.
That is, the expected value of a constant is the constant itself.
Variance and standard deviation of a discrete random variable
The variance of a discrete random variable X is defined as:
X
Var(X) = E (X − E(X))2 =
(x − E(X))2 p(x).
x
The standard deviation of X is sd(X) =
p
Var(X).
Both Var(X) and sd(X) are always ≥ 0. Both are measures of the dispersion (variation)
of the random variable X.
Alternative notation: the variance is often denoted σ 2 (‘sigma squared’) and the
standard deviation by σ (‘sigma’).
76
3.4. Discrete random variables
An alternative formula: the variance can also be calculated as:
Var(X) = E(X 2 ) − (E(X))2 .
Example 3.12 Continuing with the household size example:
x
1
2
3
4
5
6
7
8
P
p(x)
0.3002
0.3417
0.1551
0.1336
0.0494
0.0145
0.0034
0.0021
x p(x)
0.3002
0.6834
0.4653
0.5344
0.2470
0.0870
0.0238
0.0168
2.3579
= E(X)
(x − E(X))2
1.844
0.128
0.412
2.696
6.981
13.265
21.549
31.833
(x − E(X))2 p(x)
0.554
0.044
0.064
0.360
0.345
0.192
0.073
0.067
1.699
= Var(X)
x2
1
4
9
16
25
36
49
64
x2 p(x)
0.300
1.367
1.396
2.138
1.235
0.522
0.167
0.134
7.259
= E(X 2 )
2
2
2
2
Var(X) =pE[(X − E(X))
√ ] = 1.699 = 7.259 − (2.358) = E(X ) − (E(X)) and
sd(X) = Var(X) = 1.699 = 1.30.
Example 3.13 For the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . .,
and 0 otherwise. It can be shown (although the proof is beyond the scope of the
course) that for this distribution:
Var(X) =
1−π
.
π2
In the two cases we have used as examples:
Var(X) = 0.3/(0.7)2 = 0.61 and sd(X) = 0.78 for π = 0.7
Var(X) = 0.7/(0.3)2 = 7.78 and sd(X) = 2.79 for π = 0.3.
So the variation in how many free throws a pretty poor shooter misses before the
first success is much higher than the variation for a fairly good shooter.
Variances of linear transformations
If X is a random variable and a and b are constants, then:
Var(aX + b) = a2 Var(X).
If a = 0, this gives:
Var(b) = 0.
That is, the variance of a constant is 0. The converse also holds – if a random variable
has a variance of 0, it is actually a constant.
77
3. Random variables
Example 3.14 For further practice, let us consider a discrete random variable X
which has possible values 0, 1, 2 . . . , n, where n is a known positive integer, and X
has the following probability function:
( n
π x (1 − π)n−x for x = 0, 1, 2, . . . , n
x
p(x) =
0
otherwise
where nx = n!/(x! (n − x)!) denotes the binomial coefficient, and π is a probability
parameter such that 0 ≤ π ≤ 1.
A random variable like this follows the binomial distribution. We will discuss its
motivation and uses later in the next chapter.
Here, we consider the following tasks for this distribution.
Show that p(x) satisfies the conditions for a probability function.
Write down the cumulative distribution function, F (x).
To show that p(x) is a probability function, we need to show the following.
1. p(x) ≥ 0 for all x. This is clearly true, since x ≥ 0, π ≥ 0 and 1 − π ≥ 0.
2.
n
P
p(x) = 1. This is easiest to show by using the binomial theorem, which states
x=0
that, for any integer n ≥ 0 and any real numbers y and z, then:
n
(y + z) =
n X
n
x=0
x
y x z n−x .
(3.3)
If we choose y = π and z = 1 − π in (3.3), we get:
n
n
1 = 1 = [π + (1 − π)] =
n X
n
x=0
x
x
n−x
π (1 − π)
=
n
X
p(x).
x=0
This does not simplify into a simple formula, so we just calculate the values
from the definition, by summation.
For the values x = 0, 1, . . . , n, the value of the cdf is:
F (x) = P (X ≤ x) =
x X
n
y=0
y
π y (1 − π)n−y .
Since X is a discrete random variable, F (x) is a step function.
We note that:
E(X) = n π
and Var(X) = n π (1 − π).
Activity 3.10 Show that if Var(X) = 0 then p(µ) = 1. (We say in this case that X
is almost surely equal to its mean.)
78
3.4. Discrete random variables
Solution
From the definition of variance, we have:
Var(X) = E((X − µ)2 ) =
X
(x − µ)2 p(x) ≥ 0
∀x
because the squared term (x − µ)2 is non-negative (as is p(x)). The only case where
it is equal to 0 is when x − µ = 0, that is, when x = µ. Therefore, the random
variable X can only take the value µ, and we have p(µ) = P (X = µ) = 1.
Activity 3.11 Construct suitable examples to show that for a random variable X:
(a) E(X 2 ) 6= (E(X))2 in general
(b) E(1/X) 6= 1/E(X) in general.
Solution
We require a counterexample. A simple one will suffice – there is no merit in
complexity. Let the discrete random variable X assume values 1 and 2 with
probabilities 1/3 and 2/3, respectively. (Obviously, there are many other examples
we could have chosen.) Therefore:
1
2
5
+2× =
3
3
3
1
2
E(X 2 ) = 1 × + 4 × = 3
3
3
1 1 2
2
1
=1× + × =
E
X
3 2 3
3
E(X) = 1 ×
and, clearly, E(X 2 ) 6= (E(X))2 and E(1/X) 6= 1/E(X) in this case. So the result has
been shown in general.
Activity 3.12
(a) Let X be a random variable. Show that:
Var(X) = E(X(X − 1)) − E(X)(E(X) − 1).
(b) Let X1 , X2 , . . . , Xn be independent random variables. Assume that all have a
mean of µ and a variance of σ 2 . Find expressions for the mean and variance of
the random variable (X1 + X2 + · · · + Xn )/n.
Solution
(a) Recall that Var(X) = E(X 2 ) − (E(X)2 ). Now, working backwards:
79
3. Random variables
E(X(X − 1)) − E(X)(E(X) − 1) = E(X 2 − X) − (E(X))2 + E(X)
= E(X 2 ) − E(X) − E(X)2 + E(X)
(using standard properties of expectation) = E(X 2 ) − (E(X))2
= Var(X).
(b) We have:
E
X 1 + X2 + · · · + Xn
n
=
E(X1 + X2 + · · · + Xn )
n
E(X1 ) + E(X2 ) + · · · + E(Xn )
n
µ + µ + ··· + µ
=
n
nµ
=
n
=
= µ.
Also:
Var
X1 + X2 + · · · + X n
n
(by independence)
=
Var(X1 + X2 + · · · + Xn )
n2
=
Var(X1 ) + Var(X2 ) + · · · + Var(Xn )
n2
=
σ2 + σ2 + · · · + σ2
n2
=
n σ2
n2
=
σ2
.
n
Activity 3.13 Let X be a random variable for which E(X) = µ and Var(X) = σ 2 ,
and let c be an arbitrary constant. Show that:
E((X − c)2 ) = σ 2 + (µ − c)2 .
Solution
We have:
E((X − c)2 ) = E(X 2 − 2cX + c2 ) = E(X 2 ) − 2c E(X) + c2
= Var(X) + (E(X))2 − 2cµ + c2
= σ 2 + µ2 − 2cµ + c2
= σ 2 + (µ − c)2 .
80
3.4. Discrete random variables
Activity 3.14 Y is a random variable with expected value zero, P (Y = 1) = 0.2
and P (Y = 2) = 0.1. It is known that Y takes just one other value besides 1 and 2.
(a) What is the other value that Y takes?
(b) What is the variance of Y ?
Solution
(a) Let the other value be θ, then:
X
E(Y ) =
y P (Y = y) = (θ × 0.7) + (1 × 0.2) + (2 × 0.1) = 0
y
hence θ = −4/7.
(b) Var(Y ) = E(Y 2 ) − (E(Y ))2 = E(Y 2 ), since E(Y ) = 0. So:
X
y 2 P (Y = y)
Var(Y ) = E(Y 2 ) =
y
=
!
2
4
× 0.7 + (12 × 0.2) + (22 × 0.1)
−
7
= 0.8286.
Activity 3.15 James is planning to invest £1000 for two years. He will choose
between two savings accounts offered by a bank:
A standard fixed-term account which has a guaranteed interest rate of 5.5%
after the two years.
A ‘Deposit Plus’ account, for which the interest rate depends on the stock
prices of three companies as follows:
• if the stock prices of all three companies are higher two years after the
account is opened, the two-year interest rate is 8.1%
• if not, the two-year interest rate is 1.1%.
Denote by X the two-year interest rate of the Deposit Plus account, and by Y the
two-year interest rate of the standard account. Let π denote the probability that the
condition for the higher interest rate of the Deposit Plus account is satisfied at the
end of the period.
(a) Calculate the expected value and standard deviation of X, and the expected
value and standard deviation of Y .
(b) For which values of π is E(X) > E(Y )?
(c) Which account would you choose, and why? (There is no single right answer to
this question!)
81
3. Random variables
Solution
(a) Since the interest rate of the standard account is guaranteed, the ‘random’
variable Y is actually a constant. So E(Y ) = 5.5 and Var(Y ) = sd(Y ) = 0. The
random variable X has two values, 8.1 and 1.1, with probabilities π and 1 − π
respectively. Therefore:
E(X) = 8.1 × π + 1.1 × (1 − π) = 1.1 + 7.0π
E(X 2 ) = (8.1)2 × π + (1.1)2 × (1 − π) = 1.21 + 64.4π
Var(X) = E(X 2 ) − (E(X))2 = 49 π (1 − π)
and so sd(X) = 7
p
π (1 − π).
(b) E(X) > E(Y ) if 1.1 + 7.0π > 5.5, i.e. if π > 0.6286. The expected interest rate
of the Deposit Plus account is higher than the guaranteed rate of the standard
account if the probability is higher than 0.6286 that all three stock prices are at
higher levels at the end of the reference period.
(c) If you focus solely on the expected interest rate, you would make your decision
based on your estimate of π. You would choose the Deposit Plus account if you
believe – based on whatever evidence on the companies and the world economy
you choose to use – that there is a probabily of at least 0.6286 that the three
companies will all increase their share prices over the two years.
However, you might also consider the variances. The standard account has a
guaranteed rate, while the Deposit Plus account offers both a possibility of a
high rate and a risk of a low rate. So the choice could also depend on how
risk-averse you are.
Activity 3.16 Hard question!
In an investigation of animal behaviour, rats have to choose between four doors. One
of them, behind which is food, is ‘correct’. If an incorrect choice is made, the rat is
returned to the starting point and chooses again, continuing as long as necessary
until the correct choice is made. The random variable X is the serial number of the
trial on which the correct choice is made.
Find the probability function and expectation of X under each of the following
hypotheses:
(a) each door is equally likely to be chosen on each trial, and all trials are mutually
independent
(b) at each trial, the rat chooses with equal probability between the doors which it
has not so far tried
(c) the rat never chooses the same door on two successive trials, but otherwise
chooses at random with equal probabilities.
82
3.4. Discrete random variables
Solution
(a) For the ‘stupid’ rat:
1
4
3 1
P (X = 2) = ×
4 4
..
.
r−1
1
3
× .
P (X = r) =
4
4
P (X = 1) =
This is a ‘geometric distribution’ with π = 1/4, which gives E(X) = 1/π = 4.
(b) For the ‘intelligent’ rat:
1
4
3 1
1
P (X = 2) = × =
4 3
4
1
3 2 1
P (X = 3) = × × =
4 3 2
4
3 2 1
1
P (X = 4) = × × × 1 = .
4 3 2
4
P (X = 1) =
Hence E(X) = (1 + 2 + 3 + 4)/4 = 10/4 = 2.5.
(c) For the ‘forgetful’ rat (short-term, but not long-term, memory):
1
4
3 1
P (X = 2) = ×
4 3
3 2 1
P (X = 3) = × ×
4 3 3
..
.
r−2
3
2
1
P (X = r) = ×
×
4
3
3
P (X = 1) =
(for r ≥ 2).
Therefore:
"
!
#
2
1 3
1
2 1
2
1
E(X) = + ×
2×
+ 3× ×
+ 4×
×
+ ···
4 4
3
3 3
3
3
"
#
2 !
1 1
2
2
+ 4×
= +
2+ 3×
+ ··· .
4 4
3
3
83
3. Random variables
There is more than one way to evaluate this sum.
"
# "
#!
2
2
1 1
2
2
2
2
E(X) = + ×
1+ +
+ ··· + 1 + 2 × + 3 ×
+ ···
4 4
3
3
3
3
1
1
(3 + 9)
= +
4
4
= 3.25.
Note that 2.5 < 3.25 < 4, so the intelligent rat needs the least trials on average,
while the stupid rat needs the most, as we would expect!
3.5
Continuous random variables
A random variable (and its probability distribution) is continuous if it can have an
uncountably infinite number of possible values.4
In other words, the set of possible values (the sample space) is the real numbers R,
or one or more intervals in R.
Example 3.15 An example of a continuous random variable, used here as an
approximating model, is the size of claim made on an insurance policy (i.e. a claim
by the customer to the insurance company), in £000s.
Suppose the policy has a deductible of £999, so all claims are at least £1,000.
Therefore, the possible values of this random variable are {x | x ≥ 1}.
Most of the concepts introduced for discrete random variables have exact or
approximate analogies for continuous random variables, and many results are the same
for both types. However, there are some differences in the details. The most obvious
difference is that wherever in the discrete case there are sums over the possible values of
the random variable, in the continuous case these are integrals.
Probability density function (pdf)
For a continuous random variable X, the probability function is replaced by the
probability density function (pdf), denoted as f (x) [or fX (x)].
4
Strictly speaking, having an uncountably infinite number of possible values does not necessarily
imply that it is a continuous random variable. For example, the Cantor distribution (not covered in this
course) is neither a discrete nor an absolutely continuous probability distribution, nor is it a mixture of
these. However, we will not consider this matter any further in this course.
84
3.5. Continuous random variables
Example 3.16 Continuing the insurance example in Example 3.18, we consider a
pdf of the following form:
(
0
for x < k
f (x) =
α k α /xα+1 for x ≥ k
1.0
0.0
0.5
f(x)
1.5
2.0
where α > 0 is a parameter, and k > 0 (the smallest possible value of X) is a known
number. In our example, k = 1 (due to the deductible). A probability distribution
with this pdf is known as the Pareto distribution. A graph of this pdf when
α = 2.2 is shown in Figure 3.5.
1.0
1.5
2.0
2.5
3.0
3.5
4.0
x
Figure 3.5: Probability density function for Example 3.16.
Unlike for probability functions of discrete random variables, in the continuous case
values of the probability density function are not probabilities of individual values, i.e.
f (x) 6= P (X = x). In fact, for a continuous random variable:
P (X = x) = 0 for all x.
(3.4)
That is, the probability that X has any particular value exactly is always 0.
Because of (3.4), with a continuous random variable we do not need to be very careful
about differences between < and ≤, and between > and ≥. Therefore, the following
probabilities are all equal:
P (a < X < b),
P (a ≤ X ≤ b),
P (a < X ≤ b) and P (a ≤ X < b).
85
3. Random variables
Probabilities of intervals for continuous random variables
Integrals of the pdf give probabilities of intervals of values such that:
Z b
f (x) dx
P (a < X ≤ b) =
a
for any two numbers a < b.
In other words, the probability that the value of X is between a and b is the area
under f (x) between a and b. Here a can also be −∞, and/or b can be +∞.
R3
1.5
f (x) dx.
1.0
0.0
0.5
f(x)
1.5
2.0
Example 3.17 In Figure 3.6, the shaded area is P (1.5 < X ≤ 3) =
1.0
1.5
2.0
2.5
3.0
3.5
4.0
x
Figure 3.6: Probability density function showing P (1.5 < X ≤ 3).
86
3.5. Continuous random variables
Properties of pdfs
The pdf f (x) of any continuous random variable must satisfy the following conditions.
1.
f (x) ≥ 0 for all x.
2.
∞
Z
f (x) dx = 1.
−∞
These are analogous to the conditions for probability functions of discrete
distributions.
Example 3.18 Continuing with the insurance example, we check that the
conditions hold for the pdf:
(
0
for x < k
f (x) =
α k α /xα+1 for x ≥ k
where α > 0 and k > 0.
1. Clearly, f (x) ≥ 0 for all x, since α > 0, k α > 0 and xα+1 ≥ k α+1 > 0.
2. We have:
Z
∞
Z
f (x) dx =
−∞
k
∞
α kα
dx = α k α
xα+1
= αk
α
∞
Z
x−α−1 dx
k
1
−α
x−α
∞
k
= (−k α )(0 − k −α )
= 1.
Cumulative distribution function
The cumulative distribution function (cdf) of a continuous random variable X
is defined exactly as for discrete random variables, i.e. the cdf is:
F (x) = P (X ≤ x) for all real numbers x.
The general properties of the cdf stated previously also hold for continuous
distributions. The cdf of a continuous distribution is not a step function, so results
on discrete-specific properties do not hold in the continuous case. A continuous cdf
is a smooth, continuous function of x.
87
3. Random variables
Relationship between the cdf and pdf
The cdf is obtained from the pdf through integration:
Z x
f (t) dt for all x.
F (x) = P (X ≤ x) =
−∞
The pdf is obtained from the cdf through differentiation:
f (x) = F 0 (x).
Activity 3.17
(a) Define the cumulative distribution function (cdf) of a random variable and state
the principal properties of such a function.
(b) Identify which, if any, of the following functions could be a cdf under suitable
choices of the constants a and b. Explain why (or why not) each function
satisfies the properties required of a cdf and the constraints which may be
required in respect of the constants a and b.
i. F (x) = a (b − x)2 for −1 ≤ x ≤ 1.
ii. F (x) = a (1 − xb ) for −1 ≤ x ≤ 1.
iii. F (x) = a − b exp (−x/2) for 0 ≤ x ≤ 2.
Solution
(a) We defined the cdf to be F (x) = P (X ≤ x) where:
• 0 ≤ F (x) ≤ 1
• F (x) is non-decreasing
• dF (x)/dx = f (x) and F (x) =
Rx
−∞
f (t) dt for continuous X
• F (x) → 0 as x → −∞ and F (x) → 1 as x → ∞.
(b)
i. Okay. a = 0.25 and b = −1.
ii. Not okay. At x = 1, F (x) = 0, which would mean a decreasing function.
iii. Okay. a = b > 0 and b = (1 − e−1 )−1 .
88
3.5. Continuous random variables
Example 3.19 Continuing the insurance example:
Z x
Z x
α kα
f (t) dt =
dt
α+1
k t
−∞
Z x
α
(−α) t−α−1 dt
= (−k )
k
x
= (−k α ) t−α k
= (−k α )(x−α − k −α )
= 1 − k α x−α
α
k
=1−
.
x
Therefore:
(
0
F (x) =
1 − (k/x)α
for x < k
for x ≥ k.
(3.5)
If we were given (3.5), we could obtain the pdf by differentiation, since F 0 (x) = 0
when x < k, and:
F 0 (x) = −k α (−α) x−α−1 =
α kα
xα+1
for x ≥ k.
0.0
0.2
0.4
F(x)
0.6
0.8
1.0
A plot of the cdf is shown in Figure 3.7.
1
2
3
4
5
6
7
x
Figure 3.7: Cumulative distribution function for Example 3.19.
89
3. Random variables
Probabilities from cdfs and pdfs
Since P (X ≤ x) = F (x), it follows that P (X > x) = 1 − F (x). In general, for any
two numbers a < b, we have:
Z b
f (x) dx = F (b) − F (a).
P (a < X ≤ b) =
a
Example 3.20 Continuing with the insurance example (with k = 1 and α = 2.2),
then:
P (X ≤ 1.5) = F (1.5) = 1 − (1/1.5)2.2 ≈ 0.59
P (X ≤ 3) = F (3) = 1 − (1/3)2.2 ≈ 0.91
P (X > 3) = 1 − F (3) ≈ 1 − 0.91 = 0.09
P (1.5 ≤ X ≤ 3) = F (3) − F (1.5) ≈ 0.91 − 0.59 = 0.32.
Example 3.21
Consider now a continuous random variable with the following pdf:
(
λ e−λx for x > 0
f (x) =
(3.6)
0
for x ≤ 0
where λ > 0 is a parameter. This is the pdf of the exponential distribution. The
uses of this distribution will be discussed in the next chapter.
Since:
Z
0
x
x
λ e−λt dt = − e−λt 0 = 1 − e−λx
the cdf of the exponential distribution is:
(
0
F (x) =
1 − e−λx
for x ≤ 0
for x > 0.
We now show that (3.6) satisfies the conditions for a pdf.
1. Since λ > 0 and ea > 0 for any a, f (x) ≥ 0 for all x.
2. Since we have just done the integration to derive the cdf F (x), we can also use
it to show that f (x) integrates to one. This follows from:
Z ∞
f (x) dx = P (−∞ < X < ∞) = lim F (x) − lim F (x)
−∞
x→∞
which here is lim (1 − e−λx ) − 0 = (1 − 0) − 0 = 1.
x→∞
90
x→−∞
3.5. Continuous random variables
Expected value and variance of a continuous distribution
Suppose X is a continuous random variable with pdf f (x). Definitions of its expected
value, the expected value of any transformation g(X), the variance and standard
deviation are the same as for discrete distributions, except that summation is replaced
by integration:
Z ∞
x f (x) dx
E(X) =
−∞
Z
∞
g(x) f (x) dx
E[g(X)] =
−∞
2
Z
∞
Var(X) = E[(X − E(X)) ] =
(x − E(X))2 f (x) dx = E(X 2 ) − (E(X))2
−∞
sd(X) =
p
Var(X).
Example 3.22 Consider the exponential distribution introduced in Example 3.21.
To find E(X) we can use integration by parts by considering x λ e−λx as the product
of the functions f = x and g 0 = λ e−λx (so that g = −e−λx ). Therefore:
Z ∞
Z ∞
1 −λx ∞
−λx
−λx ∞
−λx
−λx ∞
E(X) =
xλe
dx = −x e
e
−
−e
dx
=
−x
e
−
0
0
0
λ
0
0
= [0 − 0] −
=
1
[0 − 1]
λ
1
.
λ
To obtain E(X 2 ), we choose f = x2 and g 0 = λ e−λx , and use integration by parts:
Z ∞
Z ∞
2 −λx ∞
2
2
−λx
E(X ) =
x λe
dx = −x e
+2
x e−λx dx
0
0
0
=0+
=
2
λ
Z
∞
x λ e−λx dx
0
2
λ2
where the last step follows because the last integral is simply E(X) = 1/λ again.
Finally:
2
1
1
Var(X) = E(X 2 ) − (E(X))2 = 2 − 2 = 2 .
λ
λ
λ
Activity 3.18 A continuous random variable, X, has a probability density
function, f (x), defined by:
(
ax + bx2 for 0 ≤ x ≤ 1
f (x) =
0
otherwise
91
3. Random variables
and E(X) = 1/2. Determine:
(a) the constants a and b
(b) the cumulative distribution function, F (x), of X
(c) the variance, Var(X).
Solution
(a) We have:
Z
1
Z
f (x) dx = 1
1
⇒
0
0
ax2 bx3
ax + bx dx =
+
2
3
2
1
=1
0
i.e. we have a/2 + b/3 = 1.
Also, we know E(X) = 1/2, hence:
Z
0
1
ax3 bx4
x (ax + bx ) dx =
+
3
4
2
1
=
0
1
2
i.e. we have:
a b
1
+ =
⇒ a = 6 and b = −6.
3 4
2
Hence f (x) = 6x(1 − x) for 0 ≤ x ≤ 1, and 0 otherwise.
(b) We have:


0
F (x) = 3x2 − 2x3


1
for x < 0
for 0 ≤ x ≤ 1
for x > 1.
(c) Finally:
2
Z
1
2
Z
x (6x(1 − x)) dx =
E(X ) =
0
0
1
6x4 6x5
6x − 6x dx =
−
4
5
3
4
and so the variance is:
Var(X) = E(X 2 ) − (E(X))2 = 0.3 − 0.25 = 0.05.
Activity 3.19 A continuous random variable X has the following pdf:
(
x3 /4 for 0 ≤ x ≤ 2
f (x) =
0
otherwise.
(a) Explain why f (x) can serve as a pdf.
(b) Find the mean and mode of the distribution.
92
1
= 0.3.
0
3.5. Continuous random variables
(c) Determine the cdf, F (x), of X.
(d) Find the variance, Var(X).
(e) Find the skewness of X, given by:
E[(X − E(X))3 ]
.
σ3
(f) If a sample of five observations is drawn at random from the distribution, find
the probability that all the observations exceed 1.5.
Solution
(a) Clearly, f (x) ≥ 0 for all x and:
Z 2
0
4 2
x
x3
dx =
= 1.
4
16 0
(b) The mean is:
Z
∞
2
Z
x f (x) dx =
E(X) =
−∞
0
5 2
x4
32
x
=
dx =
= 1.6
4
20 0 20
and the mode is 2 (where the density reaches a maximum).
(c) The cdf is:


for x < 0
0
F (x) = x4 /16 for 0 ≤ x ≤ 2


1
for x > 2.
(d) For the variance, we first find E(X 2 ), given by:
6 2
Z 2
Z 2 5
x
64
x
8
2
2
x f (x) dx =
E(X ) =
dx =
=
=
24 0 24
3
0
0 4
hence:
Var(X) = E(X 2 ) − (E(X))2 =
(e) The third ‘moment about zero’ is:
Z 2
Z
3
3
E(X ) =
x f (x) dx =
0
0
2
8 64
8
−
=
≈ 0.1067.
3 25
75
7 2
x
x6
128
dx =
=
≈ 4.5714.
4
28 0
28
Letting E(X) = µ, the numerator is:
E[(X − E(X))3 ] = E(X 3 ) − 3 µ E(X 2 ) + 3 µ2 E(X) − µ3
= 4.5714 − (3 × 1.6 × 2.6667) + (3 × (1.6)3 ) − (1.6)3
which is −0.0368, and the denominator is (0.1067)3/2 = 0.0349, hence the
skewness is −1.0544.
93
3. Random variables
(f) The probability of a single observation exceeding 1.5 is:
2
Z
4 2
x3
x
dx =
= 1 − 0.3164 = 0.6836.
4
16 1.5
2
Z
f (x) dx =
1.5
1.5
So the probability of all five exceeding 1.5 is, by independence:
(0.6836)5 = 0.1493.
Activity 3.20 A random variable X has


1/4
f (x) = 3/4


0
the following pdf:
for 0 ≤ x ≤ 1
for 1 < x ≤ 2
otherwise.
(a) Explain why f (x) can serve as a pdf.
(b) Find the mean and median of the distribution.
(c) Find the variance, Var(X).
(d) Write down the cdf of X.
(e) Find P (X = 1) and P (X > 1.5 | X > 0.5).
Solution
R∞
(a) Clearly, f (x) ≥ 0 for all x and −∞ f (x) dx = 1. This can be seen geometrically,
since f (x) defines two rectangles, one with base 1 and height 1/4, the other
with base 1 and height 3/4, giving a total area of 1/4 + 3/4 = 1.
(b) We have:
∞
Z
E(X) =
1
Z
x f (x) dx =
−∞
0
x
dx+
4
Z
1
2
2 1 2 2
3x
x
3x
1 3 3
5
dx =
+
= + − = .
4
8 0
8 1 8 2 8
4
The median is most simply found geometrically. The area to the right of the
point x = 4/3 is 0.5, i.e. the rectangle with base 2 − 4/3 = 2/3 and height 3/4,
giving an area of 2/3 × 3/4 = 1/2. Hence the median is 4/3.
(c) For the variance, we proceed as follows:
2
Z
∞
E(X ) =
2
Z
x f (x) dx =
−∞
0
1
3 1 3 2
Z 2 2
x2
3x
x
x
1
1
11
dx+
dx =
+
= +2− = .
4
4
12 0
4 1 12
4
6
1
Hence the variance is:
Var(X) = E(X 2 ) − (E(X))2 =
94
11 25
88 75
13
−
=
−
=
≈ 0.2708.
6
16
48 48
48
3.5. Continuous random variables
(d) The cdf is:

0



x/4
F (x) =
3x/4 − 1/2



1
for
for
for
for
x<0
0≤x≤1
1<x≤2
x > 2.
(e) P (X = 1) = 0, since the cdf is continuous, and:
P (X > 1.5 | X > 0.5) =
P (X > 1.5)
P ({X > 1.5} ∩ {X > 0.5})
=
P (X > 0.5)
P (X > 0.5)
0.5 × 0.75
1 − 0.5 × 0.25
0.375
=
0.875
3
= ≈ 0.4286.
7
=
Activity 3.21 Hard question!
The waiting time, W , of a traveller queueing at a taxi rank is distributed according
to the cumulative distribution function, G(w), defined by:


for w < 0
0
G(w) = 1 − (2/3) exp(−w/2) for 0 ≤ w < 2


1
for w ≥ 2.
(a) Sketch the cumulative distribution function.
(b) Is the random variable W discrete, continuous or mixed?
(c) Evaluate P (W > 1), P (W = 2), P (W ≤ 1.5 | W > 0.5) and E(W ).
95
3. Random variables
Solution
(a) A sketch of the cumulative distribution function is:
G (w )
1
1-(2/3)e -1
1/3
0
2
w
(b) We see the distribution is mixed, with discrete ‘atoms’ at 0 and 2.
(c) We have:
P (W > 1) = 1 − G(1) =
2 −1/2
e
,
3
P (W = 2) =
2 −1
e
3
and:
P (W ≤ 1.5 | W > 0.5) =
=
P (0.5 < W ≤ 1.5)
P (W > 0.5)
G(1.5) − G(0.5)
1 − G(0.5)
1 − (2/3)e−1.5/2 − 1 − (2/3)e−0.5/2
=
(2/3)e−0.5/2
= 1 − e−1/2 .
Finally, the mean is:
Z 2
2 −1
1
1
E(W ) = × 0 + e × 2 +
w e−w/2 dw
3
3
3
0
−w/2 2 Z 2
4 −1
we
2 −w/2
= e +
+
e
dw
3
3 −1/2 0
0 3
−w/2 2
4 −1 4 −1
2e
= e − e +
3
3
3 −1/2 0
4
= (1 − e−1 ).
3
96
3.5. Continuous random variables
Activity 3.22 Consider the function:
(
λ2 x e−λ x
f (x) =
0
for x > 0
otherwise.
(a) Show that this function has the characteristics of a probability density function.
(b) Evaluate E(X) and Var(X).
Solution
(a) Clearly, f (x) ≥ 0 for all x since λ2 > 0, x > 0 and e−λ x > 0.
R∞
To show, −∞ f (x) dx = 1, we have:
Z
∞
Z
∞
2
f (x) dx =
λ xe
−∞
0
−λx
∞ Z ∞
e−λx
e−λx
λ2
+
dx
dx = λ x
−λ 0
λ
0
Z ∞
λ e−λx dx
=0+
2
0
= 1 (provided λ > 0).
(b) For the mean:
Z
∞
E(X) =
x λ2 x e−λ x dx
0
= −x
=0+
2
∞
λ e−λ x 0
2
λ
∞
Z
2 x λ e−λ x dx
+
0
(from the exponential distribution).
For the variance:
Z ∞
Z
3 −λ x ∞
2
2 2
−λ x
E(X ) =
x λ xe
dx = −x λ e
+
0
0
∞
0
3 x2 λ e−λ x dx =
6
.
λ2
So, Var(X) = 6/λ2 − (2/λ)2 = 2/λ2 .
Activity 3.23 A random variable, X, has a
defined by:


0
F (x) = 1 − a e−x


1
cumulative distribution function, F (x),
for x < 0
for 0 ≤ x < 1
for x ≥ 1.
(a) Derive expressions for:
i. P (X = 0)
ii. P (X = 1)
97
3. Random variables
iii. the pdf of X (where it is continuous)
iv. E(X).
(b) Suppose that E(X) = 0.75 (1 − e−1 ). Evaluate the median of X and Var(X).
Solution
(a) We have:
i. P (X = 0) = F (0) = 1 − a.
ii. P (X = 1) = lim (F (1) − F (x)) = 1 − (1 − a e−1 ) = a e−1 .
x→1
−x
iii. f (x) = a e , for 0 ≤ x < 1, and 0 otherwise.
iv. The mean is:
Z
−1
E(X) = 0 × (1 − a) + 1 × (a e ) +
1
x a e−x dx
0
= a e−1 + [−x a e−x ]10 +
1
Z
a e−x dx
0
= ae
−1
− ae
−1
+
[−a e−x ]10
= a (1 − e−1 ).
(b) The median, m, satisfies:
−m
F (m) = 0.5 = 1 − 0.75 e
⇒
2
m = − ln
= 0.4055.
3
Recall Var(X) = E(X 2 ) − (E(X))2 , so:
2
2
2
Z
−1
E(X ) = 0 × (1 − a) + 1 × (a e ) +
1
x2 a e−x dx
0
= a e−1 + [−x2 a e−x ]10 + 2
Z
1
x a e−x dx
0
= ae
−1
− ae
−1
+ 2(a − 2a e−1 )
= 2a − 4a e−1 .
Hence:
Var(X) = 2a − 4a e−1 − a2 (1 + e−2 − 2e−1 ) = 0.1716.
Activity 3.24 A continuous random variable, X, has a probability density
function, f (x), defined by:
(
k sin(x) for 0 ≤ x ≤ π
f (x) =
0
otherwise.
98
3.5. Continuous random variables
(a) Determine the constant k and derive the cumulative distribution function, F (x),
of X.
(b) Find E(X) and Var(X).
Solution
(a) We have:
Z
∞
Z
f (x) dx =
−∞
π
k sin(x) dx = 1.
0
Therefore:
[k (− cos(x))]π0 = 2k = 1
⇒
1
k= .
2
The cdf is hence:


for x < 0
0
F (x) = (1 − cos(x))/2 for 0 ≤ x ≤ π


1
for x > π.
(b) By symmetry, E(X) = π/2. Alternatively:
Z π
Z π
1
1
1
π 1
π
π
E(X) =
x sin(x) dx = [x(− cos(x))]0 +
cos(x) dx = + [sin(x)]π0 = .
2
2 2
2
0 2
0 2
Next:
2
Z
E(X ) =
0
π
π Z π
1
1 2
x
sin(x) dx =
x (− cos(x)) +
x cos(x) dx
2
2
0
0
Z π
π2
π
sin(x) dx
=
+ [x sin(x)]0 −
2
0
2
=
π2
− [− cos(x)]π0
2
π2
− 2.
=
2
Therefore, the variance is:
Var(X) = E(X 2 ) − (E(X))2 =
Activity 3.25 A random variable, X, has the


x/5
f (x) = (20 − 4x)/30


0
π2
π2
π2
−2−
=
− 2.
2
4
4
following pdf:
for 0 < x < 2
for 2 ≤ x ≤ 5
otherwise.
99
3. Random variables
(a) Sketch the graph of f (x).
(b) Derive the cumulative distribution function, F (x), of X.
(c) Find the mean and the standard deviation of X.
Solution
0.2
0.0
0.1
f(x)
0.3
0.4
(a) The pdf of X has the following form:
0
1
2
3
4
5
x
(b) We determine the cdf by integrating the pdf over the appropriate range, hence:
F (x) =


0





x2 /10
for x ≤ 0
for 0 < x < 2


(10x − x2 − 10)/15 for 2 ≤ x ≤ 5




1
for x > 5.
This results from the following calculations. Firstly, for x ≤ 0, we have:
Z
x
x
Z
F (x) =
f (t) dt =
−∞
0 dt = 0.
−∞
For 0 < x < 2, we have:
Z
x
F (x) =
0
f (t) dt =
−∞
100
Z
Z
0 dt +
−∞
0
x
2 x
t
t
x2
dt =
= .
5
10 0
10
3.5. Continuous random variables
For 2 ≤ x ≤ 5, we have:
Z x
Z
F (x) =
f (t) dt =
Z x
t
20 − 4t
0 dt +
dt +
dt
30
−∞
0 5
2
x
2t
t2
4
+
−
=0+
10
3
15 2
4
2x x2
4
4
=
+
−
−
−
10
3
15
3 15
−∞
0
Z
2
=
2x x2 2
−
−
3
15 3
=
10x − x2 − 10
.
15
(c) To find the mean we proceed as follows:
Z ∞
Z
µ = E(X) =
x f (x) dx =
Z 5
x2
20x − 4x2
dx +
dx
30
0 5
2
3 2 2
5
2x3
x
x
−
=
+
15 0
3
45 2
25 250
4 16
8
+
−
−
=
−
15
3
45
3 45
−∞
2
7
= .
3
Similarly:
∞
2
Z 5
x3
20x2 − 4x3
E(X ) =
x f (x) dx =
dx +
dx
30
−∞
0 5
2
4 2 3
5
x
2x
x4
=
+
−
20 0
9
30 2
16
250 625
16 16
=
−
+
−
−
20
9
30
9
30
Z
2
Z
2
=
13
.
2
Hence the variance is:
2
7
117 98
19
=
−
=
≈ 1.0555.
3
18
18
18
√
Therefore, the standard deviation is σ = 1.0555 = 1.0274.
13
σ = E(X ) − (E(X)) =
−
2
2
2
2
101
3. Random variables
Activity 3.26 Let g(x) be defined as:


for 0 ≤ x < 3
x/3
g(x) = −x/3 + 2 for 3 ≤ x ≤ 6


0
otherwise.
(a) If f (x) = k g(x) is a probability density function, find the value of k.
(b) Let X be a random variable with probability density function f (x). Find E(X).
Solution
(a) If one draws it on a diagram, it is simply a triangle with base length 6 (from 0
to 6 on the x-axis) and height 1 (the highest point at x = 3). Integrating this
function
is just finding the area of it, which is (6 × 1)/2 = 3. Hence
R
f (x) dx = 3k, and so we must have 3k = 1, implying k = 1/3. Alternatively,
note that:
Z 3
Z 6
Z 6
Z 6
x
x
− + 2 dx .
dx +
g(x) dx = k
k g(x) dx = k
3
3
0 3
0
0
R
We must have f (x) dx = 1. Hence:
Z
1=k
0
3
x
dx + k
3
2
2 3
6
x
x
3k 3k
x
− + 2 dx = k
+ k − + 2x =
+
3
6 0
6
2
2
3
6
Z
3
giving k = 1/3.
(b) This can be done very quickly if one can realise that g(x), and hence f (x), is
symmetric aroundR 3, and hence the mean, E(X), must be 3. Otherwise, you
6
need to calculate 0 x f (x) dx, which can be written as the sum of two integrals:
Z
0
6
1
x f (x) dx =
3
Z
0
6
1
x g(x) dx =
3
Z
0
3
1
x2
dx +
3
3
Z
3
6
2
x
− + 2x dx.
3
Therefore:
3 3 3
6
Z 3 2
Z 6 2
x
x
2x
x
x
x2
E(X) =
dx+
− +
dx =
+ − +
= 1+(4−2) = 3.
9
3
27 0
27
3 3
0 9
3
3.5.1
Median of a random variable
Recall from ST104a Statistics 1 that the sample median is essentially the observation
‘in the middle’ of a set of data, i.e. where half of the observations in the sample are
smaller than the median and half of the observations are larger.
The median of a random variable (i.e. of its probability distribution) is similar in spirit.
102
3.6. Overview of chapter
Median of a random variable
The median, m, of a continuous random variable X is the value which satisfies:
F (m) = 0.5.
(3.7)
So once we know F (x), we can find the median by solving (3.7).
Example 3.23 For the Pareto distribution we have:
α
k
for x ≥ k.
F (x) = 1 −
x
So F (m) = 1 − (k/m)α = 1/2 when:
α
k
1
=
⇔
m
2
k
1
= √
α
m
2
⇔
√
α
m = k 2.
For example:
√
2 = 1.37
√
when k = 1 and α = 0.8, the median is m = 0.8 2 = 2.38.
when k = 1 and α = 2.2, the median is m =
2.2
Example 3.24 For the exponential distribution we have:
F (x) = 1 − e−λ x
for x > 0.
So F (m) = 1 − e−λ m = 1/2 when:
e−λ m =
3.6
1
2
⇔
−λ m = − log 2
⇔
m=
log 2
.
λ
Overview of chapter
This chapter has formally introduced random variables, making a distinction between
discrete and continuous random variables. Properties of probability distributions were
discussed, including the determination of expected values and variances.
3.7
Key terms and concepts
Constant
Cumulative distribution function
Estimators
Experiment
Continuous
Discrete
Expected value
Median
103
3. Random variables
Outcome
Probability density function
Probability (mass) function
Standard deviation
Variance
3.8
Parameter
Probability distribution
Random variable
Step function
Sample examination questions
Solutions can be found in Appendix C.
1. Let X be a discrete random variable with expected value zero. Furthermore, it is
given that P (X = 1) = 0.2 and P (X = 2) = 0.5. Suppose X only takes one other
value besides 1 and 2.
(a) What is the other value X takes?
(b) What is the variance of X?
2. The random variable X has the probability density function given by:
(
kx2 for 0 < x < 1
f (x) =
0
otherwise
where k > 0 is a constant.
(a) Find the value of k.
(b) Compute E(X) and Var(X).
3. The random variable X has the probability density function given by
f (x) = kx2 (1 − x) for 0 < x < 1 (and 0 otherwise). Here k > 0 is a constant.
(a) Find the value of k.
(b) Compute Var(1/X).
104
Chapter 4
Common distributions of random
variables
4.1
Synopsis of chapter content
This chapter formally introduces common ‘families’ of probability distributions which
can be used to model various real-world phenomena.
4.2
Learning outcomes
After completing this chapter, you should be able to:
summarise basic distributions such as the uniform, Bernoulli, binomial, Poisson,
exponential and normal
calculate probabilities of events for these distributions using the probability
function, probability density function or cumulative distribution function
determine probabilities using statistical tables, where appropriate
state properties of these distributions such as the expected value and variance.
4.3
Introduction
In statistical inference we will treat observations:
X1 , X2 , . . . , Xn
(the sample) as values of a random variable X, which has some probability distribution
(the population distribution).
How to choose the probability distribution?
Usually we do not try to invent new distributions from scratch.
Instead, we use one of many existing standard distributions.
There is a large number of such distributions, such that for most purposes we can
find a suitable standard distribution.
105
4. Common distributions of random variables
This part of the course introduces some of the most common standard distributions for
discrete and continuous random variables.
Probability distributions may differ from each other in a broader or narrower sense. In
the broader sense, we have different families of distributions which may have quite
different characteristics, for example:
continuous versus discrete
among discrete: a finite versus an infinite number of possible values
among continuous: different sets of possible values (for example, all real numbers x,
x > 0, or x ∈ [0, 1]); symmetric versus skewed distributions.
The ‘distributions’ discussed in this chapter are really families of distributions in this
sense.
In the narrower sense, individual distributions within a family differ in having different
values of the parameters of the distribution. The parameters determine the mean and
variance of the distribution, values of probabilities from it etc.
In the statistical analysis of a random variable X we typically:
select a family of distributions based on the basic characteristics of X
use observed data to choose (estimate) values for the parameters of that
distribution, and perform statistical inference on them.
Example 4.1 An opinion poll on a referendum, where each Xi is an answer to the
question ‘Will you vote ‘Yes’ or ‘No’ to leaving the European Union?’ has answers
recorded as Xi = 0 if ‘No’ and Xi = 1 if ‘Yes’. In a poll of 950 people, 513 answered
‘Yes’.
How do we choose a distribution to represent Xi ?
Here we need a family of discrete distributions with only two possible values (0
and 1). The Bernoulli distribution (discussed in the next section), which has
one parameter π (the probability that Xi = 1) is appropriate.
Within the family of Bernoulli distributions, we use the one where the value of
π is our best estimate based on the observed data. This is π
b = 513/950 = 0.54.
4.4
Common discrete distributions
For discrete random variables, we will consider the following distributions.
Discrete uniform distribution.
Bernoulli distribution.
Binomial distribution.
Poisson distribution.
106
4.4. Common discrete distributions
4.4.1
Discrete uniform distribution
Suppose a random variable X has k possible values 1, 2, . . . , k. X has a discrete
uniform distribution if all of these values have the same probability, i.e. if:
(
1/k
p(x) = P (X = x) =
0
for x = 1, 2, . . . , k
otherwise.
Example 4.2 A simple example of the discrete uniform distribution is the
distribution of the score of a fair die, with k = 6.
The discrete uniform distribution is not very common in applications, but it is useful as
a reference point for more complex distributions.
Mean and variance of a discrete uniform distribution
Calculating directly from the definition,1 we have:
E(X) =
k
X
x p(x) =
x=1
k+1
1 + 2 + ··· + k
=
k
2
(4.1)
and:
E(X 2 ) =
k
X
x2 p(x) =
x=1
12 + 22 + · · · + k 2
(k + 1)(2k + 1)
=
.
k
6
(4.2)
Therefore:
Var(X) = E(X 2 ) − (E(X))2 =
4.4.2
k2 − 1
.
12
Bernoulli distribution
A Bernoulli trial is an experiment with only two possible outcomes. We will number
these outcomes 1 and 0, and refer to them as ‘success’ and ‘failure’, respectively.
Example 4.3 Examples of outcomes of Bernoulli trials are:
agree / disagree
male / female
employed / not employed
owns a car / does not own a car
business goes bankrupt / continues trading.
1
(4.1) and (4.2) make use, respectively, of
n
P
i=1
i = n(n + 1)/2 and
n
P
i2 = n(n + 1)(2n + 1)/6.
i=1
107
4. Common distributions of random variables
The Bernoulli distribution is the distribution of the outcome of a single Bernoulli
trial. This is the distribution of a random variable X with the following probability
function:
(
π x (1 − π)1−x for x = 0, 1
p(x) =
0
otherwise.
Therefore, P (X = 1) = π and P (X = 0) = 1 − P (X = 1) = 1 − π, and no other values
are possible. Such a random variable X has a Bernoulli distribution with (probability)
parameter π. This is often written as:
X ∼ Bernoulli(π).
If X ∼ Bernoulli(π), then:
E(X) =
1
X
x p(x) = 0 × (1 − π) + 1 × π = π
(4.3)
x=0
2
E(X ) =
1
X
x2 p(x) = 02 × (1 − π) + 12 × π = π
x=0
and:
Var(X) = E(X 2 ) − (E(X))2 = π − π 2 = π (1 − π).
(4.4)
Activity 4.1 Suppose {Bi } is an infinite sequence of independent Bernoulli trials
with:
P (Bi = 0) = 1 − π and P (Bi = 1) = π
for all i.
(a) Derive the distribution of Xn =
n
P
Bi and the expected value and variance of
i=1
Xn .
(b) Let Y = min{i : Bi = 1}. Derive the distribution of Y and obtain an expression
for P (Y > y).
Solution
(a) Xn =
n
P
Bi takes the values 0, 1, . . . , n. Any sequence consisting of x 1s and
i=1
x
n−x
n
and gives a value Xn = x. There are
−x 0s has a probability π (1 − π)
n
such sequences, so:
x
n x
P (Xn = x) =
π (1 − π)n−x
x
and 0 otherwise. Hence E(Bi ) = π and Var(Bi ) = π (1 − π) which means
E(Xn ) = n π and Var(Xn ) = n π (1 − π).
(b) Y = min{i : Bi = 1} takes the values 1, 2, 3, . . ., hence:
P (Y = y) = (1 − π)y−1 π
and 0 otherwise. It follows that P (Y > y) = (1 − π)y .
108
4.4. Common discrete distributions
4.4.3
Binomial distribution
Suppose we carry out n Bernoulli trials such that:
at each trial, the probability of success is π
different trials are statistically independent events.
Let X denote the total number of successes in these n trials. X follows a binomial
distribution with parameters n and π, where n ≥ 1 is a known integer and 0 ≤ π ≤ 1.
This is often written as:
X ∼ Bin(n, π).
The binomial distribution was first encountered in Example 3.14.
Example 4.4 A multiple choice test has 4 questions, each with 4 possible answers.
James is taking the test, but has no idea at all about the correct answers. So he
guesses every answer and, therefore, has the probability of 1/4 of getting any
individual question correct.
Let X denote the number of correct answers in James’ test. X follows the binomial
distribution with n = 4 and π = 0.25, i.e. we have:
X ∼ Bin(4, 0.25).
For example, what is the probability that James gets 3 of the 4 questions correct?
Here it is assumed that the guesses are independent, and each has the probability
π = 0.25 of being correct. The probability of any particular sequence of 3 correct
and 1 incorrect answers, for example 1110, is π 3 (1 − π)1 , where ‘1’ denotes a correct
answer and ‘0’ denotes an incorrect answer.
However, we do not care about the order of the 1s and 0s, only about the number of
1s. So 1101 and 1011, for example, also count as 3 correct answers. Each of these
also has the probability π 3 (1 − π)1 .
The total number of sequences with three 1s (and, therefore, one 0) is the number of
locations
for the three 1s which can be selected in the sequence of 4 answers. This is
4
=
4.
Therefore,
the probability of obtaining three 1s is:
3
4 3
π (1 − π)1 = 4 × (0.25)3 × (0.75)1 ≈ 0.0469.
3
Binomial distribution probability function
In general, the probability function of X ∼ Bin(n, π) is:
( n
π x (1 − π)n−x for x = 0, 1, . . . , n
x
p(x) =
0
otherwise.
(4.5)
109
4. Common distributions of random variables
We have already shown that (4.5) satisfies the conditions for being a probability
function in the previous chapter (see Example 3.14).
Example 4.5 Continuing Example 4.4, where X ∼ Bin(4, 0.25), we have:
4
4
0
4
p(0) =
× (0.25) × (0.75) = 0.3164,
p(1) =
× (0.25)1 × (0.75)3 = 0.4219,
0
1
4
4
p(2) =
× (0.25)2 × (0.75)2 = 0.2109,
p(3) =
× (0.25)3 × (0.75)1 = 0.0469,
2
3
4
p(4) =
× (0.25)4 × (0.75)0 = 0.0039.
4
If X ∼ Bin(n, π), then:
E(X) = n π
and:
Var(X) = n π (1 − π).
Example 4.6 Suppose a multiple choice examination has 20 questions, each with 4
possible answers. Consider again James who guesses each one of the answers. Let X
denote the number of correct answers by such a student, so that we have
X ∼ Bin(20, 0.25). For such a student, the expected number of correct answers is
E(X) = 20 × 0.25 = 5.
The teacher wants to set the pass mark of the examination so that, for such a
student, the probability of passing is less than 0.05. What should the pass mark be?
In other words, what is the smallest x such that P (X ≥ x) < 0.05, i.e. such that
P (X < x) ≥ 0.95?
Calculating the probabilities of x = 0, 1, . . . , 20 we get (rounded to 2 decimal places):
x
p(x)
x
p(x)
0
0.00
1
2
3
4
5
6
7
8
9
10
0.02 0.07 0.13 0.19 0.20 0.17 0.11 0.06 0.03 0.01
11
12
13
14
15
16
17
18
19
20
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Calculating the cumulative probabilities, we find that F (7) = P (X < 8) = 0.898 and
F (8) = P (X < 9) = 0.959. Therefore, P (X ≥ 8) = 0.102 > 0.05 and also
P (X ≥ 9) = 0.041 < 0.05. The pass mark should be set at 9.
More generally, consider a student who has the same probability π of the correct
answer for every question, so that X ∼ Bin(20, π). Figure 4.1 shows plots of the
probabilities for π = 0.25, 0.5, 0.7 and 0.9.
110
4.4. Common discrete distributions
0.20
0.00
10
15
20
0
5
10
15
Correct answers
Correct answers
π = 0.7, E(X)=14
π = 0.9, E(X)=18
20
0.20
0.10
0.00
0.00
0.10
Probability
0.20
0.30
5
0.30
0
Probability
0.10
Probability
0.20
0.10
0.00
Probability
0.30
π = 0.5, E(X)=10
0.30
π = 0.25, E(X)=5
0
5
10
Correct answers
15
20
0
5
10
15
20
Correct answers
Figure 4.1: Probability plots for Example 4.6.
Activity 4.2 A binomial random variable X has probability function:
( n
π x (1 − π)n−x for x = 0, 1, 2, . . . , n
x
p(x) =
0
otherwise.
Consider this distribution in the case where n = 4 and π = 0.8. For this distribution,
calculate the expected value and variance of X.
(Note that E(X) = n π and Var(X) = n π (1 − π) for this distribution. Check that
your answer agrees with this.)
Solution
Substituting the values into the definitions we get:
X
E(X) =
x p(x) = 0 × 0.0016 + 1 × 0.0256 + · · · + 4 × 0.4096 = 3.2
x
E(X 2 ) =
X
x2 p(x) = 0 × 0.0016 + 1 × 0.0256 + · · · + 16 × 0.4096 = 10.88
x
and:
Var(X) = E(X 2 ) − (E(X))2 = 10.88 − (3.2)2 = 0.64.
Note that E(X) = n π = 4 × 0.8 = 3.2 and Var(X) = n π (1 − π) = 4 × 0.8 × (1 − 0.8)
= 0.64 for n = 4, π = 0.8, as stated by the general formulae.
111
4. Common distributions of random variables
Activity 4.3 A certain electronic system contains 12 components. Suppose that the
probability that each individual component will fail is 0.3 and that the components
fail independently of each other. Given that at least two of the components have
failed, what is the probability that at least three of the components have failed?
Solution
Let X denote the number of components which will fail, hence X ∼ Bin(12, 0.3).
Therefore:
P (X ≥ 3 | X ≥ 2) =
1 − P (X = 0) − P (X = 1) − P (X = 2)
P (X ≥ 3)
=
P (X ≥ 2)
1 − P (X = 0) − P (X = 1)
1 − 0.0138 − 0.0712 − 0.1678
1 − 0.0138 − 0.0712
0.7472
=
0.9150
=
= 0.8166.
Activity 4.4 A greengrocer has a very large pile of oranges on his stall. The pile of
fruit is a mixture of 50% old fruit with 50% new fruit; one cannot tell which are old
and which are new. However, 20% of old oranges are mouldy inside, but only 10% of
new oranges are mouldy. Suppose that you choose 5 oranges at random. What is the
distribution of the number of mouldy oranges in your sample?
Solution
For an orange chosen at random, the event ‘mouldy’ is the union of the disjoint
events ‘mouldy’ ∩ ‘new’ and ‘mouldy’ ∩ ‘old’. So:
P (‘mouldy’) = P (‘mouldy’ ∩ ‘new’) + P (‘mouldy’ ∩ ‘old’)
= P (‘mouldy’ | ‘new’) P (‘new’) + P (‘mouldy’ | ‘old’) P (‘old’)
= 0.1 × 0.5 + 0.2 × 0.5
= 0.15.
As the pile of oranges is very large, we can assume that the results for the five
oranges will be independent, so we have 5 independent trials each with probability of
‘mouldy’ equal to 0.15. The distribution of the number of mouldy oranges will be a
binomial distribution with n = 5 and π = 0.15.
Activity 4.5 Metro trains on a particular line have a probability 0.05 of failure
between two stations. Supposing that the failures are all independent, what is the
probability that out of 10 journeys between these two stations more than 8 do not
have a breakdown?
112
4.4. Common discrete distributions
Solution
The probability of no breakdown on one journey is π = 1 − 0.05 = 0.95, so the
number of journeys without a breakdown, X, has a Bin(10, 0.95) distribution. We
want P (X > 8), which is:
P (X > 8) = p(9) + p(10)
10
10
9
1
=
× (0.95) × (0.05) +
× (0.95)10 × (0.05)0
9
10
= 0.3151 + 0.5987
= 0.9138.
Activity 4.6 Hard question!
Show that for a binomial random variable X ∼ Bin(n, π), then:
E(X) = n π
n
X
x=1
(n − 1)!
π x−1 (1 − π)n−x .
(x − 1)! (n − x)!
Hence find E(X) and Var(X). (The wording of the question implies that you use the
result which you have just proved. Other methods of derivation will not be accepted!)
Solution
For X ∼ Bin(n, π), P (X = x) =
n
x
π x (1 − π)n−x . So, for E(X), we have:
n
X
n x
x
π (1 − π)n−x
E(X) =
x
x=0
=
n
X
x=1
=
n
X
x=1
= nπ
n x
x
π (1 − π)n−x
x
n (n − 1)!
π π x−1 (1 − π)n−x
(x − 1)! [(n − 1) − (x − 1)]!
n X
n−1
x=1
= nπ
x−1
n−1 X
n−1
y=0
y
π x−1 (1 − π)n−x
π y (1 − π)(n−1)−y
= nπ × 1
= nπ
where y = x − 1, and the last summation is over all the values of the pf of another
binomial distribution, this time with possible values 0, 1, . . . , n − 1 and probability
parameter π.
113
4. Common distributions of random variables
Similarly:
n
X
n x
E(X(X − 1)) =
x (x − 1)
π (1 − π)n−x
x
x=0
=
n
X
x (x − 1) n!
x=2
(n − x)! x!
= n (n − 1) π
2
π x (1 − π)n−x
n
X
(n − 2)!
π x−2 (1 − π)n−x
(n − x)! (x − 2)!
x=2
= n (n − 1) π
2
n−2
X
(n − 2)!
π y (1 − π)n−y−2
(n − y − 2)! y!
y=0
with y = x − 2. Now let m = n − 2, so:
E(X(X − 1)) = n (n − 1) π
2
m
X
y=0
m!
π y (1 − π)m−y
(m − y)! y!
= n (n − 1) π 2
since the summation is 1, as before.
Finally:
Var(X) = E(X(X − 1)) − E(X) [E(X) − 1]
= n (n − 1) π 2 − n π (n π − 1)
= −n π 2 + n π
= n π (1 − π).
Activity 4.7 Hard question!
Suppose that the normal rate of infection for a certain disease in cattle is 25%. To
test a new serum which may prevent infection, three experiments are carried out.
The test for infection is not always valid for some particular cattle, so the
experimental results are incomplete – we cannot always tell whether a cow is
infected or not. The results of the three experiments are:
(a) 10 animals are injected; all 10 remain free from infection
(b) 17 animals are injected; more than 15 remain free from infection and there are 2
doubtful cases
(c) 23 animals are infected; more than 20 remain free from infection and there are
three doubtful cases.
Which experiment provides the strongest evidence in favour of the serum?
114
4.4. Common discrete distributions
Solution
These experiments involve tests on different cattle, which one might expect to
behave independently of one another. The probability of infection without injection
with the serum might also reasonably be assumed to be the same for all cattle. So
the distribution which we need here is the binomial distribution. If the serum has no
effect, then the probability of infection for each of the cattle is 0.25.
One way to assess the evidence of the three experiments is to calculate the
probability of the result of the experiment if the serum had no effect at all. If it has
an effect, then one would expect larger numbers of cattle to remain free from
infection, so the experimental results as given do provide some clue as to whether
the serum has an effect, in spite of their incompleteness.
Let X(n) be the number of cattle infected, out of a sample of n. We are assuming
that X(n) ∼ Bin(n, 0.25).
(a) With 10 trials, the probability of 0 infected if the serum has no effect is:
10
P (X(10) = 0) =
× (0.75)10 = (0.75)10 = 0.0563.
0
(b) With 17 trials, the probability of more than 15 remaining uninfected if the
serum has no effect is:
P (X(17) < 2) = P (X(17) = 0) + P (X(17) = 1)
17
17
17
=
× (0.75) +
× (0.25)1 × (0.75)16
0
1
= (0.75)17 + 17 × (0.25)1 × (0.75)16
= 0.0075 + 0.0426
= 0.0501.
(c) With 23 trials, the probability of more than 20 remaining free from infection if
the serum has no effect is:
P (X(23) < 3) = P (X(23) = 0) + P (X(23) = 1) + P (X(23) = 2)
23
23
23
× (0.75) +
× (0.25)1 × (0.75)22
=
0
1
23
+
× (0.25)2 × (0.75)21
2
= 0.7523 + 23 × 0.25 × (0.75)22 +
23 × 22
× (0.25)2 × (0.75)21
2
= 0.0013 + 0.0103 + 0.0376
= 0.0492.
The most surprising-looking event in these three experiments is that of experiment
3, and so we can say that this experiment offered the most support for the use of the
serum.
115
4. Common distributions of random variables
4.4.4
Poisson distribution
The possible values of the Poisson distribution are the non-negative integers
0, 1, 2, . . ..
Poisson distribution probability function
The probability function of the Poisson distribution is:
(
e−λ λx /x! for x = 0, 1, 2, . . .
p(x) =
0
otherwise
(4.6)
where λ > 0 is a parameter.
If a random variable X has a Poisson distribution with parameter λ, this is often
denoted by:
X ∼ Poisson(λ) or X ∼ Pois(λ).
If X ∼ Poisson(λ), then:
E(X) = λ
and:
Var(X) = λ.
Poisson distributions are used for counts of occurrences of various kinds. To give a
formal motivation, suppose that we consider the number of occurrences of some
phenomenon in time, and that the process which generates the occurrences satisfies the
following conditions:
1. The numbers of occurrences in any two disjoint intervals of time are independent of
each other.
2. The probability of two or more occurrences at the same time is negligibly small.
3. The probability of one occurrence in any short time interval of length t is λ t for
some constant λ > 0.
In essence, these state that individual occurrences should be independent, sufficiently
rare, and happen at a constant rate λ per unit of time. A process like this is a Poisson
process.
If occurrences are generated by a Poisson process, then the number of occurrences in a
randomly selected time interval of length t = 1, X, follows a Poisson distribution with
mean λ, i.e. X ∼ Poisson(λ).
The single parameter λ of the Poisson distribution is, therefore, the rate of occurrences
per unit of time.
Example 4.7 Examples of variables for which we might use a Poisson distribution:
The number of telephone calls received at a call centre per minute.
116
4.4. Common discrete distributions
The number of accidents on a stretch of motorway per week.
The number of customers arriving at a checkout per minute.
The number of misprints per page of newsprint.
Because λ is the rate per unit of time, its value also depends on the unit of time (that
is, the length of interval) we consider.
Example 4.8 If X is the number of arrivals per hour and X ∼ Poisson(1.5), then if
Y is the number of arrivals per two hours, Y ∼ Poisson(1.5 × 2) = Poisson(3).
λ is also the mean of the distribution, i.e. E(X) = λ.
Both motivations suggest that distributions with higher values of λ have higher
probabilities of large values of X.
0.25
Example 4.9 Figure 4.2 shows the probabilities p(x) for x = 0, 1, 2, . . . , 10 for
X ∼ Poisson(2) and X ∼ Poisson(4).
0.15
0.00
0.05
0.10
p(x)
0.20
λ=2
λ=4
0
2
4
6
8
10
x
Figure 4.2: Probability plots for Example 4.9.
Example 4.10 Customers arrive at a bank on weekday afternoons randomly at an
average rate of 1.6 customers per minute. Let X denote the number of arrivals per
minute and Y denote the number of arrivals per 5 minutes.
We assume a Poisson distribution for both, such that:
X ∼ Poisson(1.6)
and:
Y ∼ Poisson(1.6 × 5) = Poisson(8).
117
4. Common distributions of random variables
1. What is the probability that no customer arrives in a one-minute interval?
For X ∼ Poisson(1.6), the probability P (X = 0) is:
pX (0) =
e−λ λ0
e−1.6 (1.6)0
=
= e−1.6 = 0.2019.
0!
0!
2. What is the probability that more than two customers arrive in a one-minute
interval?
P (X > 2) = 1 − P (X ≤ 2) = 1 − [P (X = 0) + P (X = 1) + P (X = 2)] which is:
1 − pX (0) − pX (1) − pX (2) = 1 −
e−1.6 (1.6)0 e−1.6 (1.6)1 e−1.6 (1.6)2
−
−
0!
1!
2!
= 1 − e−1.6 − 1.6 e−1.6 − 1.28 e−1.6
= 1 − 3.88 e−1.6
= 0.2167.
3. What is the probability that no more than 1 customer arrives in a five-minute
interval?
For Y ∼ Poisson(8), the probability P (Y ≤ 1) is:
pY (0) + pY (1) =
e−8 80 e−8 81
+
= e−8 + 8 e−8 = 9 e−8 = 0.0030.
0!
1!
Activity 4.8 Cars independently pass a point on a busy road at an average rate of
150 per hour.
(a) Assuming a Poisson distribution, find the probability that none passes in a
given minute.
(b) What is the expected number passing in two minutes?
(c) Find the probability that the expected number actually passes in a given
two-minute period.
Solution
(a) A rate of 150 cars per hour is a rate of 2.5 per minute. Using a Poisson
distribution with λ = 2.5, P (none passes) = e−2.5 × (2.5)0 /0! = e−2.5 = 0.0821.
(b) The expected number of cars passing in two minutes is 2 × 2.5 = 5.
(c) The probability of 5 cars passing in two minutes is e−5 × 55 /5! = 0.1755.
Activity 4.9 People entering an art gallery are counted by the attendant at the
door. Assume that people arrive in accordance with a Poisson distribution, with one
person arriving every 2 minutes. The attendant leaves the door unattended for 5
118
4.4. Common discrete distributions
minutes.
(a) Calculate the probability that:
i. nobody will enter the gallery in this time
ii. 3 or more people will enter the gallery in this time.
(b) Find, to the nearest second, the length of time for which the attendant could
leave the door unattended for there to be a probability of 0.9 of no arrivals in
that time.
(c) Comment briefly on the assumption of a Poisson distribution in this context.
Solution
(a) λ = 1 for a two-minute interval, so λ = 2.5 for a five-minute interval. Therefore:
P (no arrivals) = e−2.5 = 0.0821
and:
P (≥ 3 arrivals) = 1 − p(0) − p(1) − p(2) = 1 − e−2.5 (1 + 2.5 + 3.125) = 0.4562.
(b) For an interval of N minutes, the parameter is N/2. We need p(0) = 0.9, so
e−N/2 = 0.9 giving N/2 = − ln(0.9) and N = 0.21 minutes, or 13 seconds.
(c) The rate is unlikely to be constant: more people at lunchtimes or early evenings
etc. Likely to be several arrivals in a small period – couples, groups etc. Quite
unlikely the Poisson will provide a good model.
Activity 4.10 In a large industrial plant there is an accident on average every two
days.
(a) What is the chance that there will be exactly two accidents in a given week?
(b) What is the chance that there will be two or more accidents in a given week?
(c) If James goes to work there for a four-week period, what is the probability that
no accidents occur while he is there?
Solution
Here we have counts of random events over time, which is a typical application for
the Poisson distribution. We are assuming that accidents are equally likely to occur
at any time and are independent. The mean for the Poisson distribution is 0.5 per
day.
Let X be the number of accidents in a week. The probability of exactly two
accidents in a given week is found by using the parameter λ = 5 × 0.5 = 2.5 (5
working days a week assumed).
119
4. Common distributions of random variables
(a) The probability of exactly two accidents in a week is:
p(2) =
e−2.5 (2.5)2
= 0.2565.
2!
(b) The probability of two or more accidents in a given week is:
P (X ≥ 2) = 1 − p(0) − p(1) = 0.7127.
(c) If James goes to the industrial plant and does not change the probability of an
accident simply by being there (he might bring bad luck, or be superbly
safety-conscious!), then over 4 weeks there are 20 working days, and the
probability of no accident comes from a Poisson random variable with mean 10.
If Y is the number of accidents while James is there, the probability of no
accidents is:
e−10 (10)0
= 0.0000454.
pY (0) =
0!
James is very likely to be there when there is an accident!
Activity 4.11 Arrivals at a post office may be modelled as following a Poisson
distribution with a rate parameter of 84 arrivals per hour.
(a) Find:
i. the probability of exactly seven arrivals in a period of two minutes
ii. the probability of more than three arrivals in 45 seconds
iii. the probability that the time to arrival of the next customer is less than
one minute.
(b) If T is the time to arrival of the next customer (in minutes), calculate:
P (T > 2.3 | T > 1).
Solution
(a) The rate is given as 84 per hour, but it is convenient to work in numbers of
minutes, so note that this is the same as λ = 1.4 arrivals per minute.
i. For two minutes, use λ = 1.4 × 2 = 2.8. Hence:
P (X = 7) =
e−2.8 (2.8)7
= 0.0163.
7!
ii. For 45 seconds, λ = 1.4 × 0.75 = 1.05. Hence:
P (X > 3) = 1 − P (X ≤ 3) = 1 −
3
X
e−1.05 (1.05)x
x=0
x!
= 1 − e−1.05 (1 + 1.05 + 0.5513 + 0.1929)
= 0.0222.
120
4.4. Common discrete distributions
iii. The probability that the time to arrival of the next customer is less than
one minute is 1 − P (no arrivals in one minute) = 1 − P (X = 0). For one
minute we use λ = 1.4, hence:
e−1.4 (1.4)0
= 1 − e−1.4 = 1 − 0.2466 = 0.7534.
1 − P (X = 0) = 1 −
0!
(b) The time to the next customer is more than t if there are no arrivals in the
interval from 0 to t, which means that we need to use λ = 1.4 × t. Now the
conditional probability formula yields:
P (T > 2.3 | T > 1) =
P ({T > 2.3} ∩ {T > 1})
P (T > 1)
and, as in other instances, the two events {T > 2.3} and {T > 1} collapse to a
single event, {T > 2.3}. Hence:
P (T > 2.3 | T > 1) =
P (T > 2.3)
P (T > 2.3)
=
.
P (T > 1)
0.2466
To calculate the numerator, use λ = 1.4 × 2.3 = 3.22, hence (by the same
method as in (iii.):
P (T > 2.3) =
e−3.22 (3.22)0
= e−3.22 = 0.0400.
0!
Hence:
P (T > 2.3 | T > 1) =
P (T > 2.3)
0.0400
=
= 0.1620.
P (T > 1)
0.2466
Activity 4.12 A glacier in Greenland ‘calves’ (lets fall off into the sea) an iceberg
on average twice every five weeks. (Seasonal effects can be ignored for this question,
and so the calving process can be thought of as random, i.e. the calving of icebergs
can be assumed to be independent events.)
(a) Explain which distribution you would use to estimate the probabilities of
different numbers of icebergs being calved in different periods, justifying your
selection.
(b) What is the probability that no iceberg is calved in the next three weeks?
(c) What is the probability that no iceberg is calved in the three weeks after the
next three weeks?
(d) What is the probability that exactly five icebergs are calved in the next four
weeks?
(e) If exactly five icebergs are calved in the next four weeks, what is the probability
that exactly five more icebergs will be calved in the four-week period after the
next four weeks?
(f) Comment on the relationship between your answers to (d) and (e).
121
4. Common distributions of random variables
Solution
(a) If we assume that the calving process is random (as the remark about
seasonality hints) then we are counting events over periods of time (with, in
particular, no obvious upper maximum), and hence the appropriate distribution
is the Poisson distribution.
(b) The rate parameter for one week is 0.4, so for three weeks we use λ = 1.2, hence:
P (X = 0) =
e−1.2 × (1.2)0
= e−1.2 = 0.3012.
0!
(c) If it is correct to use the Poisson distribution then events are independent, and
hence:
P (none in weeks 1, 2 & 3) = P (none in weeks 4, 5 & 6) = 0.3012.
(d) The rate parameter for four weeks is λ = 1.6, hence:
P (X = 5) =
e−1.6 × (1.6)5
= 0.0176.
5!
(e) Bayes’ theorem tells us that:
P (5 in weeks 5 to 8 | 5 in weeks 1 to 4) =
P (5 in weeks 5 to 8 ∩ 5 in weeks 1 to 4)
.
P (5 in weeks 1 to 4)
If it is correct to use the Poisson distribution then events are independent.
Therefore:
P (5 in weeks 5 to 8 ∩ 5 in weeks 1 to 4) = P (5 in weeks 5 to 8) P (5 in weeks 1 to 4).
So, cancelling, we get:
P (5 in weeks 5 to 8 | 5 in weeks 1 to 4) = P (5 in weeks 5 to 8)
= P (5 in weeks 1 to 4)
= 0.0176.
(f) The fact that the results are identical in the two cases is a consequence of the
independence built into the assumption that the Poisson distribution is the
appropriate one to use. A Poisson process does not ‘remember’ what happened
before the start of a period under consideration.
Activity 4.13 Hard question!
A discrete random variable X has possible values 0, 1, 2, . . ., and the probability
function:
(
e−λ λx /x! for x = 0, 1, 2, . . .
p(x) =
0
otherwise
122
4.4. Common discrete distributions
where λ > 0 is a parameter. Show that E(X) = λ by determining
P
x p(x).
Solution
We have:
∞
∞
X
e−λ λx X e−λ λx
=
x
E(X) =
x p(x) =
x
x!
x!
x=1
x=0
x=0
∞
X
=λ
∞
X
e−λ λx−1
x=1
=λ
(x − 1)!
∞
X
e−λ λy
y=0
y!
=λ×1
=λ
where we replace x − 1 with y. The result follows from the fact that
∞
P
(e−λ λy )/y! is
y=0
the sum of all non-zero values of a probability function of this form.
For completeness, we also give here a derivation of the variance of this distribution.
Consider first:
E[X(X − 1)] =
∞
X
x(x − 1) p(x) =
x=0
∞
X
x(x − 1)
x=2
=λ
2
∞
X
e−λ λx−2
x=2
=λ
2
e−λ λx
x!
(x − 2)!
∞
X
e−λ λy
y=0
y!
= λ2
where y = x − 2. Also:
E[X(X − 1)] = E(X 2 − X) =
X
X
X
(x2 − x) p(x) =
x2 p(x) −
x p(x)
x
x
x
= E(X 2 ) − E(X)
= E(X 2 ) − λ.
Equating these and solving for E(X 2 ) we get E(X 2 ) = λ2 + λ. Therefore:
Var(X) = E(X 2 ) − (E(X))2 = λ2 + λ − (λ)2 = λ.
123
4. Common distributions of random variables
Activity 4.14 Hard question!
James goes fishing every Saturday. The number of fish he catches follows a Poisson
distribution. On a proportion π of the days he goes fishing, he does not catch
anything. He makes it a rule to take home the first, and then every other, fish which
he catches, i.e. the first, third, fifth fish etc.
(a) Using a Poisson distribution, find the mean number of fish he catches.
(b) Show that the probability that he takes home the last fish he catches is
(1 − π 2 )/2.
Solution
(a) Let X denote the number of fish caught, such that X ∼ Poisson(λ).
P (X = 0) = e−λ λx /x! where the parameter λ is as yet unknown, so
P (X = 0) = e−λ λ0 /0! = e−λ .
However, we know P (X = 0) = π. So e−λ = π giving −λ = ln π and λ = ln(1/π).
(b) James will take home the last fish caught if he catches 1, 3, 5, 7, . . . fish. So we
require:
e−λ λ1 e−λ λ3 e−λ λ5
+
+
+ ···
1!
3!
5!
1
λ3 λ5
λ
−λ
=e
+
+
+ ··· .
1!
3!
5!
P (X = 1) + P (X = 3) + P (X = 5) + · · · =
Now we know:
eλ = 1 + λ +
and:
e
−λ
λ2 λ3
+
+ ···
2!
3!
λ2 λ3
=1−λ+
−
+ ··· .
2!
3!
Subtracting gives:
λ
−λ
e −e
λ3 λ5
=2 λ+
+
+ ··· .
3!
5!
Hence the required probability is:
λ
e − e−λ
1 − e−2λ
1 − π2
−λ
e
=
=
2
2
2
since e−λ = π above gives e−2λ = π 2 .
124
4.4. Common discrete distributions
4.4.5
Connections between probability distributions
There are close connections between some probability distributions, even across
different families of them. Some connections are exact, i.e. one distribution is exactly
equal to another, for particular values of the parameters. For example, Bernoulli(π) is
the same distribution as Bin(1, π).
Some connections are approximate (or asymptotic), i.e. one distribution is closely
approximated by another under some limiting conditions. We next discuss one of these,
the Poisson approximation of the binomial distribution.
4.4.6
Poisson approximation of the binomial distribution
Suppose that:
X ∼ Bin(n, π)
n is large and π is small.
Under such circumstances, the distribution of X is well-approximated by a Poisson(λ)
distribution with λ = n π.
The connection is exact at the limit, i.e. Bin(n, π) → Poisson(λ) if n → ∞ and π → 0
in such a way that n π = λ remains constant.
This ‘law of small numbers’ provides another motivation for the Poisson distribution.
Example 4.11 A classic example (from Bortkiewicz (1898) Das Gesetz der kleinen
Zahlen) helps to remember the key elements of the ‘law of small numbers’.
Figure 4.3 shows the numbers of soldiers killed by horsekick in each of 14 army corps
of the Prussian army in each of the years spanning 1875–94.
Suppose that the number of men killed by horsekicks in one corps in one year is
X ∼ Bin(n, π), where:
n is large – the number of men in a corps (perhaps 50,000)
π is small – the probability that a man is killed by a horsekick.
X should be well-approximated by a Poisson distribution with some mean λ. The
sample frequencies and proportions of different counts are as follows:
Number killed
Count
%
0
144
51.4
1
91
32.5
2
32
11.4
3
11
3.9
4
2
0.7
More
0
0
The sample mean of the counts is x̄ = 0.7, which we use as λ for the Poisson
distribution. X ∼ Poisson(0.7) is indeed a good fit to the data, as shown in Figure
4.4.
125
4. Common distributions of random variables
Figure 4.3: Numbers of soldiers killed by horsekick in each of 14 army corps of the Prussian
0.5
army in each of the years spanning 1875–94. Source: Bortkiewicz (1898) Das Gesetz der
kleinen Zahlen, Leipzig: Teubner.
0.3
0.0
0.1
0.2
Probability
0.4
Poisson(0.7)
Sample proportion
0
1
2
3
4
5
6
Men killed
Figure 4.4: Fit of Poisson distribution to the data in Example 4.11.
Example 4.12 An airline is selling tickets to a flight with 198 seats. It knows that,
on average, about 1% of customers who have bought tickets fail to arrive for the
flight. Because of this, the airline overbooks the flight by selling 200 tickets. What is
the probability that everyone who arrives for the flight will get a seat?
Let X denote the number of people who fail to turn up. Using the binomial
distribution, X ∼ Bin(200, 0.01). We have:
P (X ≥ 2) = 1 − P (X = 0) − P (X = 1) = 1 − 0.1340 − 0.2707 = 0.5953.
Using the Poisson approximation, X ∼ Poisson(200 × 0.01) = Poisson(2).
P (X ≥ 2) = 1 − P (X = 0) − P (X = 1) = 1 − e−2 − 2 e−2 = 1 − 3 e−2 = 0.5940.
126
4.4. Common discrete distributions
Activity 4.15 The chance that a lottery ticket has a winning number is 0.0000001.
(a) If 10,000,000 people buy tickets which are independently numbered, what is the
probability there is no winner?
(b) What is the probability that there is exactly 1 winner?
(c) What is the probability that there are exactly 2 winners?
Solution
The number of winning tickets, X, will be distributed as:
X ∼ Bin(10000000, 0.0000001).
Since n is large and π is small, the Poisson distribution should provide a good
approximation. The Poisson parameter is:
λ = n π = 10000000 × 0.0000001 = 1
and so we set X ∼ Pois(1). We have:
p(0) =
e−1 10
= 0.3679,
0!
p(1) =
e−1 11
e−1 12
= 0.3679 and p(2) =
= 0.1839.
1!
2!
Using the exact binomial distribution of X, the results are:
(10)7
7
p(0) =
× ((10)−7 )0 × (1 − (10)−7 )(10) = 0.3679
0
(10)7
7
p(1) =
× ((10)−7 )1 × (1 − (10)−7 )(10) −1 = 0.3679
1
and:
(10)7
7
p(2) =
× ((10)−7 )2 × (1 − (10)−7 )(10) −2 = 0.1839.
2
Notice that, in this case, the Poisson approximation is correct to at least 4 decimal
places.
4.4.7
Some other discrete distributions
Just their names and short comments are given here, so that you have an idea of what
else there is. You may meet some of these in future courses.
Geometric(π) distribution.
• Distribution of the number of failures in Bernoulli trials before the first success.
• π is the probability of success at each trial.
• The sample space is 0, 1, 2, . . ..
• See the basketball example in Chapter 3.
127
4. Common distributions of random variables
Negative binomial(r, π) distribution.
• Distribution of the number of failures in Bernoulli trials before r successes
occur.
• π is the probability of success at each trial.
• The sample space is 0, 1, 2, . . ..
• Negative binomial(1, π) is the same as Geometric(π).
4.5
Common continuous distributions
For continuous random variables, we will consider the following distributions.
Uniform distribution.
Exponential distribution.
Normal distribution.
4.5.1
The (continuous) uniform distribution
The (continuous) uniform distribution has non-zero probabilities only on an interval
[a, b], where a < b are given numbers. The probability that its value is in an interval
within [a, b] is proportional to the length of the interval. In other words, all intervals
(within [a, b]) which have the same length have the same probability.
Uniform distribution pdf
The pdf of the (continuous) uniform distribution is:
(
1/(b − a) for a ≤ x ≤ b
f (x) =
0
otherwise.
A random variable X with this pdf may be written as X ∼ Uniform[a, b].
The pdf is ‘flat’, as shown in Figure 4.5 (along with the cdf). Clearly, f (x) ≥ 0 for all x,
and:
Z ∞
Z b
1
1
1
f (x) dx =
dx =
[x]ba =
[b − a] = 1.
b−a
b−a
−∞
a b−a
The cdf is:
Z
F (x) = P (X ≤ x) =
a
x


for x < a
0
f (t) dt = (x − a)/(b − a) for a ≤ x ≤ b


1
for x > b.
Therefore, the probability of an interval [x1 , x2 ], where a ≤ x1 < x2 ≤ b, is:
P (x1 ≤ X ≤ x2 ) = F (x2 ) − F (x1 ) =
128
x2 − x1
.
b−a
4.5. Common continuous distributions
f(x)
F(x)
1
0
a
b
a
x
b
x
Figure 4.5: Continuous uniform distribution pdf (left) and cdf (right).
So the probability depends only on the length of the interval, x2 − x1 .
If X ∼ Uniform[a, b], we have:
E(X) =
a+b
= median of X
2
and:
(b − a)2
.
12
The mean and median also follow from the fact that the distribution is symmetric
about (a + b)/2, i.e. the midpoint of the interval [a, b].
Var(X) =
Activity 4.16 Suppose that X ∼ Uniform[0, 1]. Compute P (X > 0.2), P (X ≥ 0.2)
and P (X 2 > 0.04).
Solution
We have a = 0 and b = 1, and can use the formula for P (c < X ≤ d), for constants c
and d. Hence:
1 − 0.2
P (X > 0.2) = P (0.2 < X ≤ 1) =
= 0.8.
1−0
Also:
P (X ≥ 0.2) = P (X = 0.2) + P (X > 0.2) = 0 + P (X > 0.2) = 0.8.
Finally:
P (X 2 > 0.04) = P (X < −0.2) + P (X > 0.2) = 0 + P (X > 0.2) = 0.8.
Activity 4.17 A newsagent, James, has n newspapers to sell and makes £1.00
profit on each sale. Suppose the number of customers of these newspapers is a
random variable with a distribution which can be approximated by:
(
1/200 for 0 < x < 200
f (x) =
0
otherwise.
129
4. Common distributions of random variables
If James does not have enough newspapers to sell to all customers, he figures he
loses £5.00 in goodwill from each unhappy (non-served) customer. However, if he
has surplus newspapers (which only have commercial value on the day of print), he
loses £0.50 on each unsold newspaper. What should n be (to the nearest integer) to
maximise profit?
Hint: If X ≤ n, James’ profit (in £) is X − 0.5(n − X). If X > n, James’ profit is
n − 5(X − n). Find the expected value of profit as a function of n, and then select n
to maximise this function. (There is no need to verify it is a maximum.)
Solution
We have:
Z 200
1
1
dx +
(n − 5(x − n))
dx
E(profit) =
(x − 0.5(n − x))
200
200
n
0
n
200
1 x2 (n − x)2
5x2
1
=
+
6nx −
+
200 2
4
200
2 n
0
Z
=
n
1
(−3.25n2 + 1200n − 100000).
200
Differentiating with respect to n, we have:
1
dE(profit)
=
(−6.5n + 1200).
dn
200
Equating to zero and solving, we have:
n=
4.5.2
1200
≈ 185.
6.5
Exponential distribution
Exponential distribution pdf
A random variable X has the exponential distribution with the parameter λ
(where λ > 0) if its probability density function is:
(
λ e−λx for x > 0
f (x) =
0
otherwise.
This is often denoted X ∼ Exponential(λ) or X ∼ Exp(λ).
It was shown in the previous chapter that this satisfies the conditions for a pdf (see
Example 3.21). The general shape of the pdf is that of ‘exponential decay’, as shown in
Figure 4.6 (hence the name).
130
f(x)
4.5. Common continuous distributions
0
1
2
3
4
5
x
Figure 4.6: Exponential distribution pdf.
The cdf of the Exponential(λ) distribution is:
(
0
F (x) =
1 − e−λx
for x ≤ 0
for x > 0.
0.0
0.2
0.4
F(x)
0.6
0.8
1.0
The cdf is shown in Figure 4.7 for λ = 1.6.
0
1
2
3
4
5
x
Figure 4.7: Exponential distribution cdf for λ = 1.6.
For X ∼ Exponential(λ), we have:
E(X) =
1
λ
and:
1
.
λ2
These have been derived in the previous chapter (see Example 3.22). The median of the
distribution, also previously derived (see Example 3.24), is:
Var(X) =
m=
log 2
1
= (log 2) × = (log 2) E(X) ≈ 0.69 × E(X).
λ
λ
131
4. Common distributions of random variables
Note that the median is always smaller than the mean, because the distribution is
skewed to the right.
Uses of the exponential distribution
The exponential is, among other things, a basic distribution of waiting times of
various kinds. This arises from a connection between the Poisson distribution – the
simplest distribution for counts – and the exponential.
If the number of events per unit of time has a Poisson distribution with parameter
λ, the time interval (measured in the same units of time) between two successive
events has an exponential distribution with the same parameter λ.
Note that the expected values of these behave as we would expect.
E(X) = λ for Poisson(λ), i.e. a large λ means many events per unit of time, on
average.
E(X) = 1/λ for Exponential(λ), i.e. a large λ means short waiting times between
successive events, on average.
Example 4.13 Consider Example 4.10.
The number of customers arriving at a bank per minute has a Poisson
distribution with parameter λ = 1.6.
Therefore, the time X, in minutes, between the arrivals of two successive
customers follows an exponential distribution with parameter λ = 1.6.
From this exponential distribution, the expected waiting time between arrivals of
customers is E(X) = 1/1.6 = 0.625 (minutes) and the median is calculated to be
(log 2) × 0.625 = 0.433.
We can also calculate probabilities of waiting times between arrivals, using the
cumulative distribution function:
(
0
for x ≤ 0
F (x) =
−1.6x
1−e
for x > 0.
For example:
P (X ≤ 1) = F (1) = 1 − e−1.6×1 = 1 − e−1.6 = 0.7981.
The probability is about 0.8 that two arrivals are at most a minute apart.
P (X > 3) = 1 − F (3) = e−1.6×3 = e−4.8 = 0.0082.
The probability of a gap of 3 minutes or more between arrivals is very small.
132
4.5. Common continuous distributions
Activity 4.18 Suppose that the service time for a customer at a fast food outlet
has an exponential distribution with parameter 1/3 (customers per minute). What is
the probability that a customer waits more than 4 minutes?
Solution
The distribution of X is Exp(1/3), so the probability is:
P (X > 4) = 1 − F (4) = 1 − (1 − e−(1/3)×4 ) = 1 − 0.7364 = 0.2636.
Activity 4.19 Suppose that commercial aeroplane crashes in a certain country
occur at the rate of 2.5 per year.
(a) Is it reasonable to assume that such crashes are Poisson events? Briefly explain.
(b) What is the probability that two or more crashes will occur next year?
(c) What is the probability that the next two crashes will occur within six months
of one another?
Solution
(a) Yes, because the Poisson assumptions are probably satisfied – crashes are
independent events and the crash rate is likely to remain constant.
(b) Since λ = 2.5 crashes per year:
P (X ≥ 2) = 1 − P (X ≤ 1) = 1 −
1
X
e−2.5 (2.5)x
x=0
x!
= 0.7127.
(c) Let Y = interval (in years) between the next two crashes. Therefore, we have
Y ∼ Exp(2.5). So:
Z 0.5
P (Y < 0.5) =
2.5e−2.5y dy = F (0.5) − F (0)
0
= (1 − e−2.5(0.5) ) − (1 − e−2.5(0) )
= 1 − e−1.25
= 0.7135.
Activity 4.20 Let the random variable X have the following pdf:
(
e−x for x ≥ 0
f (x) =
0
otherwise.
Find the interquartile range (IQR) of X.
133
4. Common distributions of random variables
Solution
Note that X ∼ Exp(1). For x > 0, we have:
Z x
Z x
x
f (t) dt =
e−t dt = −e−t 0 = 1 − e−x
0
hence:
0
(
1 − e−x
F (x) =
0
for x > 0
otherwise.
Denoting the first and third quartiles by Q1 and Q3 , respectively, we have:
F (Q1 ) = 1 − e−Q1 = 0.25 and F (Q3 ) = 1 − e−Q3 = 0.75.
Therefore:
Q1 = − ln(0.75) = 0.2877 and Q3 = − ln(0.25) = 1.3863
and so:
IQR = Q3 − Q1 = 1.3863 − 0.2877 = 1.0986.
Activity 4.21 The random variable Y , representing the life-span of an electronic
component, is distributed according to a probability density function f (y), where
y > 0.
The survivor function, =, is defined as =(y) = P (Y > y) and the age-specific failure
rate, φ(y), is defined as f (y)/=(y).
Suppose f (y) = λ e−λ y , i.e. Y ∼ Exp(λ).
(a) Derive expressions for =(y) and φ(y).
(b) Comment briefly on the implications of the age-specific failure rate you have
derived in the context of the exponentially-distributed component life-spans.
Solution
(a) The survivor function is:
Z
=(y) = P (Y > y) =
y
∞
∞
λ e−λx dx = −e−λx y = e−λy .
The age-specific failure rate is:
φ(y) =
λ e−λy
f (y)
= −λy = λ.
=(y)
e
(b) The age-specific failure rate is constant, indicating it does not vary with age.
This is unlikely to be true in practice!
134
4.5. Common continuous distributions
4.5.3
Normal (Gaussian) distribution
The normal distribution is by far the most important probability distribution in
statistics. This is for three broad reasons.
Many variables have distributions which are approximately normal, for example
heights of humans or animals, and weights of various products.
The normal distribution has extremely convenient mathematical properties, which
make it a useful default choice of distribution in many contexts.
Even when a variable is not itself even approximately normally distributed,
functions of several observations of the variable (‘sampling distributions’) are often
approximately normal, due to the central limit theorem. Because of this, the
normal distribution has a crucial role in statistical inference. This will be discussed
later in the course.
Normal distribution pdf
The pdf of the normal distribution is:
(x − µ)2
1
exp −
f (x) = √
2σ 2
2πσ 2
for − ∞ < x < ∞
where π is the mathematical constant (i.e. π = 3.14159 . . .), and µ and σ 2 are
parameters, with −∞ < µ < ∞ and σ 2 > 0.
A random variable X with this pdf is said to have a normal distribution with mean
µ and variance σ 2 , denoted X ∼ N (µ, σ 2 ).
Clearly, f (x) ≥ 0 for all x. Also, it can be shown that
to show this), so f (x) really is a pdf.
R∞
−∞
f (x) dx = 1 (do not attempt
If X ∼ N (µ, σ 2 ), then:
E(X) = µ
and:
Var(X) = σ 2
and, therefore, the standard deviation is sd(X) = σ.
The mean can also be inferred from the observation that the normal pdf is symmetric
about µ. This also implies that the median of the normal distribution is µ.
The normal density is the so-called ‘bell curve’. The two parameters affect it as follows.
The mean µ determines the location of the curve.
The variance σ 2 determines the dispersion (spread) of the curve.
Example 4.14 Figure 4.8 shows that:
N (0, 1) and N (5, 1) have the same dispersion but different location: the
N (5, 1) curve is identical to the N (0, 1) curve, but shifted 5 units to the right
135
4. Common distributions of random variables
0.3
0.4
N (0, 1) and N (0, 9) have the same location but different dispersion: the
N (0, 9) curve is centered at the same value, 0, as the N (0, 1) curve, but spread
out more widely.
N(5, 1)
0.1
0.2
N(0, 1)
0.0
N(0, 9)
−5
0
5
10
x
Figure 4.8: Various normal distributions.
Linear transformations of the normal distribution
We now consider one of the convenient properties of the normal distribution. Suppose
X is a random variable, and we consider the linear transformation Y = aX + b, where a
and b are constants.
Whatever the distribution of X, it is true that E(Y ) = a E(X) + b and also that
Var(Y ) = a2 Var(X).
Furthermore, if X is normally distributed, then so is Y . In other words, if
X ∼ N (µ, σ 2 ), then:
Y = aX + b ∼ N (aµ + b, a2 σ 2 ).
(4.7)
This type of result is not true in general. For other families of distributions, the
distribution of Y = aX + b is not always in the same family as X.
Let us apply (4.7) with a = 1/σ and b = −µ/σ, to get:
2 !
µ
X −µ
1
µ
1
1
∼N
µ− ,
σ 2 = N (0, 1).
Z= X− =
σ
σ
σ
σ
σ
σ
The transformed variable Z = (X − µ)/σ is known as a standardised variable or a
z-score.
The distribution of the z-score is N (0, 1), i.e. the normal distribution with mean µ = 0
and variance σ 2 = 1 (and, therefore, a standard deviation of σ = 1). This is known as
the standard normal distribution. Its density function is:
2
1
x
f (x) = √
exp −
for − ∞ < x < ∞.
2
2π
136
4.5. Common continuous distributions
The cumulative distribution function of the normal distribution is:
Z x
1
(t − µ)2
√
F (x) =
dt.
exp −
2σ 2
2πσ 2
−∞
In the special case of the standard normal distribution, the cdf is:
2
Z x
t
1
√
F (x) = Φ(x) =
dt.
exp −
2
2π
−∞
Note, this is often denoted Φ(x).
Such integrals cannot be evaluated in a closed form, so we use statistical tables of them,
specifically a table of Φ(x) (or we could use a computer, but not in the examination).
In the examination, you will have a table of some values of Φ(z), the cdf of
Z ∼ N (0, 1). Specifically, Table 4 of the New Cambridge Statistical Tables shows values
of Φ(x) = P (Z ≤ x) for x ≥ 0. This table can be used to calculate probabilities of any
intervals for any normal distribution, but how? The table seems to be incomplete.
1. It is only for N (0, 1), not for N (µ, σ 2 ) for any other µ and σ 2 .
2. Even for N (0, 1), it only shows probabilities for x ≥ 0.
We next show how these are not really limitations, starting with ‘2.’.
The key to using the tables is that the standard normal distribution is symmetric about
0. This means that for an interval in one tail, its ‘mirror image’ in the other tail has the
same probability. Another way to justify these results is that if Z ∼ N (0, 1), then also
−Z ∼ N (0, 1). See ST104a Statistics 1 for a discussion of how to use Table 4 of the
New Cambridge Statistical Tables.
Probabilities for any normal distribution
How about a normal distribution X ∼ N (µ, σ 2 ), for any other µ and σ 2 ?
What if we want to calculate, for any a < b, P (a < X ≤ b) = F (b) − F (a)?
Remember that (X − µ)/σ = Z ∼ N (0, 1). If we apply this transformation to all parts
of the inequalities, we get:
X −µ
b−µ
a−µ
<
≤
P (a < X ≤ b) = P
σ
σ
σ
a−µ
b−µ
=P
<Z≤
σ
σ
b−µ
a−µ
−Φ
=Φ
σ
σ
which can be calculated using Table 4 of the New Cambridge Statistical Tables. (Note
that this also covers the cases of the one-sided inequalities P (X ≤ b), with a = −∞,
and P (X > a), with b = ∞.)
137
4. Common distributions of random variables
Example 4.15 Let X denote the diastolic blood pressure of a randomly selected
person in England. This is approximately distributed as X ∼ N (74.2, 127.87).
Suppose we want to know the probabilities of the following intervals:
X > 90 (high blood pressure)
X < 60 (low blood pressure)
60 ≤ X ≤ 90 (normal blood pressure).
These are calculated using standardisation with µ = 74.2, σ 2 = 127.87 and,
therefore, σ = 11.31. So here:
X − 74.2
= Z ∼ N (0, 1)
11.31
and we can refer values of this standardised variable to Table 4 of the New
Cambridge Statistical Tables.
90 − 74.2
X − 74.2
>
P (X > 90) = P
11.31
11.31
= P (Z > 1.40)
= 1 − Φ(1.40)
= 1 − 0.9192
= 0.0808
and:
P (X < 60) = P
X − 74.2
60 − 74.2
<
11.31
11.31
= P (Z < −1.26)
= P (Z > 1.26)
= 1 − Φ(1.26)
= 1 − 0.8962
= 0.1038.
Finally:
P (60 ≤ X ≤ 90) = P (X ≤ 90) − P (X < 60) = 0.8152.
These probabilities are shown in Figure 4.9.
Activity 4.22 Suppose that the distribution of men’s heights in London, measured
in cm, is N (175, 62 ). Find the proportion of men whose height is:
(a) under 169 cm
138
0.04
4.5. Common continuous distributions
Low: 0.10
High: 0.08
0.00
0.01
0.02
0.03
Mid: 0.82
40
60
80
100
120
Diastolic blood pressure
Figure 4.9: Distribution of blood pressure for Example 4.15.
(b) over 190 cm
(c) between 169 cm and 190 cm.
Solution
The values of interest are 169 and 190. The corresponding z-values are:
z1 =
169 − 175
190 − 175
= −1 and z2 =
= 2.5.
6
6
Using values from statistical tables, we have:
P (X < 169) = P (Z < −1) = Φ(−1) = 1 − Φ(1) = 1 − 0.8413 = 0.1587
also:
P (X > 190) = P (Z > 2.5) = 1 − Φ(2.5) = 1 − 0.9938 = 0.0062
and:
P (169 < X < 190) = P (−1 < Z < 2.5) = Φ(2.5)−Φ(−1) = 0.9938−0.1587 = 0.8351.
Activity 4.23 In javelin throwing competitions, the throws of athlete A are
normally distributed. It has been found that 15% of her throws exceed 43 metres,
while 3% exceed 45 metres. What distance will be exceeded by 90% of her throws?
Solution
Suppose X ∼ N (µ, σ 2 ) is the random variable for throws. P (X > 43) = 0.15 leads
to µ = 43 − 1.035 × σ (using statistical tables).
Similarly, P (X > 45) = 0.03 leads to µ = 45 − 1.88 × σ. Solving yields µ = 40.55 and
139
4. Common distributions of random variables
σ = 2.367, hence X ∼ N (40.55, (2.367)2 ). So:
P (X > x) = 0.9
⇒
x − 40.55
= −1.28.
2.367
Hence x = 37.52 metres.
Activity 4.24 The life, in hours, of a light bulb is normally distributed with a mean
of 175 hours. If a consumer requires at least 95% of the light bulbs to have lives
exceeding 150 hours, what is the largest value that the standard deviation can have?
Solution
Let X be the random variable representing the lifetime of a light bulb (in hours), so
that for some value σ we have X ∼ N (175, σ 2 ). We want P (X > 150) = 0.95, such
that:
25
150 − 175
=P Z>−
= 0.95.
P (X > 150) = P Z >
σ
σ
Note that this is the same as P (Z > 25/σ) = 1 − 0.95 = 0.05, so 25/σ = 1.645,
giving σ = 15.20.
Activity 4.25 Two statisticians disagree about the distribution of IQ scores for a
population under study. Both agree that the distribution is normal, and that σ = 15,
but A says that 5% of the population have IQ scores greater than 134.6735, whereas
B says that 10% of the population have IQ scores greater than 109.224. What is the
difference between the mean IQ score as assessed by A and that as assessed by B?
Solution
The standardised z-value giving 5% in the upper tail is 1.6449, and for 10% it is
1.2816. So, converting to the scale for IQ scores, the values are:
1.6449 × 15 = 24.6735 and 1.2816 × 15 = 19.224.
Write the means according to A and B as µA and µB , respectively. Therefore:
µA + 24.6735 = 134.6735
so:
µA = 110
whereas:
µB + 19.224 = 109.224
so µB = 90. The difference µA − µB = 110 − 90 = 20.
Some probabilities around the mean
The following results hold for all normal distributions.
P (µ − σ < X < µ + σ) = 0.683. In other words, about 68.3% of the total
140
4.5. Common continuous distributions
probability is within 1 standard deviation of the mean.
P (µ − 1.96 × σ < X < µ + 1.96 × σ) = 0.950.
P (µ − 2 × σ < X < µ + 2 × σ) = 0.954.
P (µ − 2.58 × σ < X < µ + 2.58 × σ) = 0.99.
P (µ − 3 × σ < X < µ + 3 × σ) = 0.997.
The first two of these are illustrated graphically in Figure 4.10.
0.683
µ −1.96σ
µ−σ
µ
µ+σ
µ +1.96σ
<−−−−−−−−−− 0.95 −−−−−−−−−−>
Figure 4.10: Some probabilities around the mean for the normal distribution.
4.5.4
Normal approximation of the binomial distribution
For 0 < π < 1, the binomial distribution Bin(n, π) tends to the normal distribution
N (n π, n π (1 − π)) as n → ∞.
Less formally, the binomial distribution is well-approximated by the normal distribution
when the number of trials n is reasonably large.
For a given n, the approximation is best when π is not very close to 0 or 1. One
rule-of-thumb is that the approximation is good enough when n π > 5 and
n (1 − π) > 5. Illustrations of the approximation are shown in Figure 4.11 for different
values of n and π. Each plot shows values of the pf of Bin(n, π), and the pdf of the
normal approximation, N (n π, n π (1 − π)).
When the normal approximation is appropriate, we can calculate probabilities for
X ∼ Bin(n, π) using Y ∼ N (n π, n π (1 − π)) and Table 4 of the New Cambridge
Statistical Tables.
Unfortunately, there is one small caveat. The binomial distribution is discrete, but the
normal distribution is continuous. To see why this is problematic, consider the following.
Suppose X ∼ Bin(40, 0.4). Since X is discrete, such that x = 0, 1, . . . , 40, then:
P (X ≤ 4) = P (X ≤ 4.5) = P (X < 5)
141
4. Common distributions of random variables
n=10, π = 0.5
n=25, π = 0.5
n=25, π = 0.25
n=10, π = 0.9
n=25, π = 0.9
n=50, π = 0.9
Figure 4.11: Examples of the normal approximation of the binomial distribution.
since P (4 < X ≤ 4.5) = 0 and P (4.5 < X < 5) = 0 due to the ‘gaps’ in the probability
mass for this distribution. In contrast if Y ∼ N (16, 9.6), then:
P (Y ≤ 4) < P (Y ≤ 4.5) < P (Y < 5)
since P (4 < Y < 4.5) > 0 and P (4.5 < Y < 5) > 0 because this is a continuous
distribution.
The accepted way to circumvent this problem is to use a continuity correction which
corrects for the effects of the transition from a discrete Bin(n, π) distribution to a
continuous N (n π, n π (1 − π)) distribution.
Continuity correction
This technique involves representing each discrete binomial value x, for 0 ≤ x ≤ n,
by the continuous interval (x − 0.5, x + 0.5). Great care is needed to determine which
x values are included in the required probability. Suppose we are approximating
X ∼ Bin(n, π) with Y ∼ N (n π, n π (1 − π)), then:
P (X < 4) = P (X ≤ 3) ⇒ P (Y < 3.5) (since 4 is excluded)
P (X ≤ 4) = P (X < 5) ⇒ P (Y < 4.5) (since 4 is included)
P (1 ≤ X < 6) = P (1 ≤ X ≤ 5) ⇒ P (0.5 < Y < 5.5) (since 1 to 5 are included).
Example 4.16 In the UK general election in May 2010, the Conservative Party
received 36.1% of the votes. We carry out an opinion poll in November 2014, where
we survey 1,000 people who say they voted in 2010, and ask who they would vote for
142
4.5. Common continuous distributions
if a general election was held now. Let X denote the number of people who say they
would now vote for the Conservative Party.
Suppose we assume that X ∼ Bin(1000, 0.361).
1. What is the probability that X ≥ 400?
Using the normal approximation, noting n = 1000 and π = 0.361, with
Y ∼ N (1000 × 0.361, 1000 × 0.361 × 0.639) = N (361, 230.68), we get:
P (X ≥ 400) ≈ P (Y ≥ 399.5)
Y − 361
399.5 − 361
=P √
≥ √
230.68
230.68
= P (Z ≥ 2.53)
= 1 − Φ(2.53)
= 0.0057.
The exact probability from the binomial distribution is P (X ≥ 400) = 0.0059.
Without the continuity correction, the normal approximation would give 0.0051.
2. What is the largest number x for which P (X ≤ x) < 0.01?
We need the largest x which satisfies:
x + 0.5 − 361
P (X ≤ x) ≈ P (Y ≤ x + 0.5) = P Z ≤ √
< 0.01.
230.68
According to Table 4 of the New Cambridge Statistical Tables, the smallest z
which satisfies P (Z ≥ z) < 0.01 is z = 2.33, so the largest z which satisfies
P (Z ≤ z) < 0.01 is z = −2.33. We then need to solve:
x + 0.5 − 361
√
≤ −2.33
230.68
which gives x ≤ 325.1. The smallest integer value which satisfies this is x = 325.
Therefore, P (X ≤ x) < 0.01 for all x ≤ 325.
The sum of the exact binomial probabilities from 0 to x is 0.0093 for x = 325,
and 0.011 for x = 326. The normal approximation gives exactly the correct
answer in this instance.
3. Suppose that 300 respondents in the actual survey say they would vote for the
Conservative Party now. What do you conclude from this?
From the answer to Question 2, we know that P (X ≤ 300) < 0.01, if π = 0.361.
In other words, if the Conservatives’ support remains 36.1%, we would be very
unlikely to get a random sample where only 300 (or fewer) respondents would
say they would vote for the Conservative Party.
Now X = 300 is actually observed. We can then conclude one of two things (if
we exclude other possibilities, such as a biased sample or lying by the
respondents).
143
4. Common distributions of random variables
(a) The Conservatives’ true level of support is still 36.1% (or even higher), but
by chance we ended up with an unusual sample with only 300 of their
supporters.
(b) The Conservatives’ true level of support is currently less than 36.1% (in
which case getting 300 in the sample would be more probable).
Here (b) seems a more plausible conclusion than (a). This kind of reasoning is
the basis of statistical significance tests.
Activity 4.26 James enjoys playing Solitaire on his laptop. One day, he plays the
game repeatedly. He has found, from experience, that the probability of success in
any game is 1/3 and is independent of the outcomes of other games.
(a) What is the probability that his first success occurs in the fourth game he
plays? What is the expected number of games he needs to play to achieve his
first success?
(b) What is the probability of three successes in ten games? What is the expected
number of successes in ten games?
(c) Use a suitable approximation to find the probability of less than 25 successes in
100 games. You should justify the use of the approximation.
(d) What is the probability that his third success occurs in the tenth game he plays?
Solution
(a) P (first success in 4th game) = (2/3)3 × (1/3) = 8/81 ≈ 0.1. This is a geometric
distribution, for which E(X) = 1/π = 1/(1/3) = 3.
(b) Use X ∼ Bin(10, 1/3), such that E(X) = 10 × 1/3 = 3.33, and:
3 7
2
10
1
≈ 0.2601.
P (X = 3) =
3
3
3
(c) Approximate Bin(100, 1/3) by:
1
1 2
200
= N 33.3,
.
N 100 × , 100 × ×
3
3 3
9
The approximation seems reasonable since n = 100 is ‘large’, π = 1/3 is quite
close to 0.5, n π > 5 and n (1 − π) > 5. Using a continuity correction:
!
24.5 − 33.3
P (X ≤ 24.5) = P Z ≤ p
= P (Z ≤ −1.87) ≈ 0.0307.
200/9
144
4.5. Common continuous distributions
(d) This is a negative binomial distribution (used for the trial number of the kth
success) with a pf given by:
x−1 k
p(x) =
π (1 − π)x−k for x = k, k + 1, . . .
k−1
and 0 otherwise. Hence we require:
3 7
9
1
2
P (X = 10) =
≈ 0.0780.
2
3
3
Alternatively, you could calculate the probability of 2 successes in 9 trials,
followed by a further success.
Activity 4.27 You may assume that 15% of individuals in a large population are
left-handed.
(a) If a random sample of 40 individuals is taken, find the probability that exactly 6
are left-handed.
(b) If a random sample of 400 individuals is taken, find the probability that exactly
60 are left-handed by using a suitable approximation. Briefly discuss the
appropriateness of the approximation.
(c) What is the smallest possible size of a randomly chosen sample if we wish to be
99% sure of finding at least one left-handed individual in the sample?
Solution
(a) Let X ∼ Bin(40, 0.15), hence:
40
P (X = 6) =
× (0.15)6 × (0.85)34 = 0.1742.
6
(b) Use a normal approximation with a continuity correction. We require:
P (59.5 < X < 60.5)
where X ∼ N (60, 51) since X has mean n π and variance n π (1 − π) with
n = 400 and π = 0.15. Standardising, this is 2 × P (0 < Z ≤ 0.07) = 0.0558,
approximately.
Rules-of-thumb for use of the approximation are that n is ‘large’, π is close to
0.5, and n π and n (1 − π) are both at least 5. The first and last of these
definitely hold. There is some doubt whether a value of 0.15 can be considered
close to 0.5, so use with caution!
(c) Given a sample of size n, P (no left-handers) = (0.85)n . Therefore:
P (at least 1 left-hander) = 1 − (0.85)n .
145
4. Common distributions of random variables
We require 1 − (0.85)n > 0.99, or (0.85)n < 0.01.
This gives:
100 <
or:
n>
1
0.85
n
ln(100)
= 28.34.
ln(1.1765)
Rounding up, this gives a sample size of 29.
Activity 4.28 For the binomial distribution with a probability of success of 0.25 in
an individual trial, calculate the probability that, in 50 trials, there are at least 8
successes:
(a) using the normal approximation without a continuity correction
(b) using the normal approximation with a continuity correction.
Compare these results with the exact probability of 0.9547 and comment.
Solution
We seek P (X ≥ 8) using the normal approximation Y ∼ N (12.5, 9.375).
(a) So, without a continuity correction:
8 − 12.5
= P (Z ≥ −1.47) = 0.9292.
P (Y ≥ 8) = P Z ≥ √
9.375
The required probability could have been expressed as P (X > 7), or indeed any
number in [7, 8), for example:
7 − 12.5
P (Y > 7) = P Z ≥ √
= P (Z ≥ −1.80) = 0.9641.
9.375
(b) With a continuity correction:
7.5 − 12.5
= P (Z ≥ −1.63) = 0.9484.
P (Y > 7.5) = P Z ≥ √
9.375
Compared to 0.9547, using the continuity correction yields the closer approximation.
Activity 4.29 We have found that the Poisson distribution can be used to
approximate a binomial distribution, and a normal distribution can be used to
approximate a binomial distribution. It should not be surprising that a normal
distribution can be used to approximate a Poisson distribution. It can be shown that
the approximation is suitable for large values of the Poisson parameter λ, and should
be adequate for practical purposes when λ ≥ 10.
146
4.6. Overview of chapter
(a) Suppose X is a Poisson random variable with parameter λ. If we approximate
X by a normal variable which ∼ N (µ, σ 2 ), what are the values which should be
used for µ and σ 2 ?
Hint: What are the mean and variance of a Poisson distribution?
(b) Use this approach to estimate P (X > 12) for a Poisson random variable with
λ = 15. Use a continuity correction.
Note: The exact value of this probability, from the Poisson distribution, is
0.7323890.
Solution
(a) The Poisson distribution with parameter λ has its expectation and variance
both equal to λ, so we should take µ = λ and σ 2 = λ in a normal approximation,
i.e. use a N (λ, λ) distribution as the approximating distribution.
(b) P (X > 12) ≈ P (Y > 12.5) using a continuity correction, where Y ∼ N (15, 15).
This is:
12.5 − 15
Y − 15
√
> √
= P (Z > −0.65) = 0.7422.
P (Y > 12.5) = P
15
15
4.6
Overview of chapter
This chapter has introduced some common discrete and continuous probability
distributions. Their properties, uses and applications have been discussed. The
relationships between some of these distributions have also been covered.
4.7
Key terms and concepts
Bernoulli distribution
Central limit theorem
Continuous uniform distribution
Exponential distribution
Parameter
Standardised variable
z-score
4.8
Binomial distribution
Continuity correction
Discrete uniform distribution
Normal distribution
Poisson distribution
Standard normal distribution
Sample examination questions
Solutions can be found in Appendix C.
1. Find P (Y ≥ 2) when Y follows a binomial distribution with parameters n = 10 and
π = 0.25.
147
4. Common distributions of random variables
2. A random variable, X, has the following probability density function:
(
e−x for 0 < x < ∞
f (x) =
0
otherwise.
The probability of being aged at least x0 + 1, given being aged at least x0 , is:
p = P (X > x0 + 1 | X > x0 ).
Calculate p.
3. Let X be a normal random variable with mean 1 and variance 4. Calculate:
P (X > 3 | X < 5).
148
Chapter 5
Multivariate random variables
5.1
Synopsis of chapter
Almost all applications of statistical methods deal with several measurements on the
same, or connected, items. To think statistically about several measurements on a
randomly selected item, you must understand some of the concepts for joint
distributions of random variables.
5.2
Learning outcomes
After completing this chapter, you should be able to:
arrange the probabilities for a discrete bivariate distribution in tabular form
define marginal and conditional distributions, and determine them for a discrete
bivariate distribution
recall how to define and determine independence for two random variables
define and compute expected values for functions of two random variables and
demonstrate how to prove simple properties of expected values
provide the definition of covariance and correlation for two random variables and
calculate these.
5.3
Introduction
So far, we have considered univariate situations, that is one random variable at a time.
Now we will consider multivariate situations, that is two or more random variables at
once, and together.
In particular, we consider two somewhat different types of multivariate situations.
1. Several different variables – such as the height and weight of a person.
2. Several observations of the same variable, considered together – such as the heights
of all n people in a sample.
Suppose that X1 , X2 , . . . , Xn are random variables, then the vector:
X = (X1 , X2 , . . . , Xn )0
149
5. Multivariate random variables
is a multivariate random variable (here n-variate), also known as a random
vector. Its possible values are the vectors:
x = (x1 , x2 , . . . , xn )0
where each xi is a possible value of the random variable Xi , for i = 1, . . . , n.
The joint probability distribution of a multivariate random variable X is defined by
the possible values x, and their probabilities.
For now, we consider just the simplest multivariate case, a bivariate random variable
where n = 2. This is sufficient for introducing most of the concepts of multivariate
random variables.
For notational simplicity, we will use X and Y instead of X1 and X2 . A bivariate
random variable is then the pair (X, Y ).
Example 5.1 In this chapter, we consider the following example of a discrete
bivariate distribution – for a football match:
X = the number of goals scored by the home team
Y = the number of goals scored by the visiting (away) team.
5.4
Joint probability functions
When the random variables in (X1 , X2 , . . . , Xn ) are all discrete (or all continuous), we
also call the multivariate random variable discrete (or continuous, respectively).
For a discrete multivariate random variable, the joint probability distribution is
described by the joint probability function, defined as:
p(x1 , x2 , . . . , xn ) = P (X1 = x1 , X2 = x2 , . . . , Xn = xn )
for all vectors (x1 , x2 , . . . , xn ) of n real numbers. The value p(x1 , x2 , . . . , xn ) of the joint
probability function is itself a single number, not a vector.
In the bivariate case, this is:
p(x, y) = P (X = x, Y = y)
which we sometimes write as pX,Y (x, y) to make the random variables clear.
Example 5.2 Consider a randomly selected football match in the English Premier
League (EPL), and the two random variables:
X = the number of goals scored by the home team
Y = the number of goals scored by the visiting (away) team.
Suppose both variables have possible values 0, 1, 2 and 3 (to keep this example
simple, we have recorded the small number of scores of 4 or greater also as 3).
150
5.5. Marginal distributions
Consider the joint distribution of (X, Y ). We use probabilities based on data from
the 2009–10 EPL season.
Suppose the values of pX,Y (x, y) = p(x, y) = P (X = x, Y = y) are the following:
X=x
0
1
2
3
0
0.100
0.100
0.085
0.062
Y =y
1
2
0.031 0.039
0.146 0.092
0.108 0.092
0.031 0.039
3
0.031
0.015
0.023
0.006
and p(x, y) = 0 for all other (x, y).
Note that this satisfies the conditions for a probability function.
1. p(x, y) ≥ 0 for all (x, y).
2.
3 P
3
P
p(x, y) = 0.100 + 0.031 + · · · + 0.006 = 1.000.
x=0 y=0
The joint probability function gives probabilities of values of (X, Y ), for example:
A 1–1 draw, which is the most probable single result, has probability:
P (X = 1, Y = 1) = p(1, 1) = 0.146.
The match is a draw with probability:
P (X = Y ) = p(0, 0) + p(1, 1) + p(2, 2) + p(3, 3) = 0.344.
The match is won by the home team with probability:
P (X > Y ) = p(1, 0) + p(2, 0) + p(2, 1) + p(3, 0) + p(3, 1) + p(3, 2) = 0.425.
More than 4 goals are scored in the match with probability:
P (X + Y > 4) = p(2, 3) + p(3, 2) + p(3, 3) = 0.068.
5.5
Marginal distributions
Consider a multivariate discrete random variable X = (X1 , . . . , Xn ).
The marginal distribution of a subset of the variables in X is the (joint) distribution
of this subset. The joint pf of these variables (the marginal pf ) is obtained by
summing the joint pf of X over the variables which are not included in the subset.
151
5. Multivariate random variables
Example 5.3 Consider X = (X1 , X2 , X3 , X4 ), and the marginal distribution of the
subset (X1 , X2 ). The marginal pf of (X1 , X2 ) is:
XX
p1,2 (x1 , x2 ) = P (X1 = x1 , X2 = x2 ) =
p(x1 , x2 , x3 , x4 )
x3
x4
where the sum is of the values of the joint pf of (X1 , X2 , X3 , X4 ) over all possible
values of X3 and X4 .
The simplest marginal distributions are those of individual variables in the multivariate
random variable.
The marginal pf is then obtained by summing the joint pf over all the other variables.
The resulting marginal distribution is univariate, and its pf is a univariate pf.
Marginal distributions for discrete bivariate distributions
For the bivariate distribution of (X, Y ) the univariate marginal distributions are
those of X and Y individually. Their marginal pfs are:
X
X
pX (x) =
p(x, y) and pY (y) =
p(x, y).
y
x
Example 5.4 Continuing with the football example introduced in Example 5.2, the
joint and marginal probability functions are:
Y =y
X=x
0
1
2
3
pY (y)
0
0.100
0.100
0.085
0.062
0.347
1
0.031
0.146
0.108
0.031
0.316
2
0.039
0.092
0.092
0.039
0.262
3
0.031
0.015
0.023
0.006
0.075
pX (x)
0.201
0.353
0.308
0.138
1.000
and p(x, y) = pX (x) = pY (y) = 0 for all other (x, y).
For example:
pX (0) =
3
X
p(0, y) = p(0, 0) + p(0, 1) + p(0, 2) + p(0, 3)
y=0
= 0.100 + 0.031 + 0.039 + 0.031
= 0.201.
Even for a multivariate random variable, expected values E(Xi ), variances Var(Xi ) and
medians of individual variables are obtained from the univariate (marginal)
distributions of Xi , as defined in Chapter 3.
152
5.6. Conditional distributions
Example 5.5 Consider again the football example.
The expected number of goals scored by the home team is:
X
E(X) =
x pX (x) = 0 × 0.201 + 1 × 0.353 + 2 × 0.308 + 3 × 0.138 = 1.383.
x
The expected number of goals scored by the visiting team is:
X
E(Y ) =
y pY (y) = 0 × 0.347 + 1 × 0.316 + 2 × 0.262 + 3 × 0.075 = 1.065.
y
Activity 5.1 Show that the marginal distributions of a bivariate distribution are
not enough to define the bivariate distribution itself.
Solution
Here we must show that there are two distinct bivariate distributions with the same
marginal distributions. It is easiest to think of the simplest case where X and Y
each take only two values, say 0 and 1.
Suppose the marginal distributions of X and Y are the same, with
p(0) = p(1) = 0.5. One possible bivariate distribution with these marginal
distributions is the one for which there is independence between X and Y . This has
pX,Y (x, y) = pX (x) pY (y) for all x, y. Writing it in full:
pX,Y (0, 0) = pX,Y (1, 0) = pX,Y (0, 1) = pX,Y (1, 1) = 0.5 × 0.5 = 0.25.
The table of probabilities for this choice of independence is shown in the first table
below.
Trying some other value for pX,Y (0, 0), like 0.2, gives the second table below.
X/Y
0
1
0
0.25
0.25
1
0.25
0.25
X/Y
0
1
0
0.2
0.3
1
0.3
0.2
The construction of these probabilities is done by making sure the row and column
totals are equal to 0.5, and so we now have a second distribution with the same
marginal distributions as the first.
This example is very simple, but one can almost always construct many bivariate
distributions with the same marginal distributions even for continuous random
variables.
5.6
Conditional distributions
Consider discrete variables X and Y , with joint pf p(x, y) = pX,Y (x, y) and marginal pfs
pX (x) and pY (y), respectively.
153
5. Multivariate random variables
Conditional distributions of discrete bivariate distributions
Let x be one possible value of X, for which pX (x) > 0. The conditional
distribution of Y given that X = x is the discrete probability distribution with
the pf:
pY |X (y | x) = P (Y = y | X = x) =
P (X = x and Y = y)
pX,Y (x, y)
=
P (X = x)
pX (x)
for any value y.
This is the conditional probability function of Y given X = x.
Example 5.6 Recall that in the football example the joint and marginal pfs were:
Y =y
X=x
0
1
2
3
pY (y)
0
0.100
0.100
0.085
0.062
0.347
1
0.031
0.146
0.108
0.031
0.316
2
0.039
0.092
0.092
0.039
0.262
3
0.031
0.015
0.023
0.006
0.075
pX (x)
0.201
0.353
0.308
0.138
1.000
We can now calculate the conditional pf of Y given X = x for each x, i.e. of away
goals given home goals. For example:
pY |X (y | 0) = pY |X (y | X = 0) =
pX,Y (0, y)
pX,Y (0, y)
=
.
pX (0)
0.201
So, for example, pY |X (1 | 0) = pX,Y (0, 1)/0.201 = 0.031/0.201 = 0.154.
Calculating these for each value of x gives:
X=x
0
1
2
3
pY |X (y | x)
0
1
0.498 0.154
0.283 0.414
0.276 0.351
0.449 0.225
when y
2
0.194
0.261
0.299
0.283
is:
3
0.154
0.042
0.075
0.043
Sum
1.00
1.00
1.00
1.00
So, for example:
if the home team scores 0 goals, the probability that the visiting team scores 1
goal is pY |X (1 | 0) = 0.154
if the home team scores 1 goal, the probability that the visiting team wins the
match is pY |X (2 | 1) + pY |X (3 | 1) = 0.261 + 0.042 = 0.303.
154
5.6. Conditional distributions
5.6.1
Properties of conditional distributions
Each different value of x defines a different conditional distribution and conditional pf
pY |X (y | x). Each value of pY |X (y | x) is a conditional probability of the kind previously
defined. Defining events A = {Y = y} and B = {X = x}, then:
P (A | B) =
P (A ∩ B)
P (Y = y and X = x)
=
P (B)
P (X = x)
= P (Y = y | X = x)
=
pX,Y (x, y)
pX (x)
= pY |X (y | x).
A conditional distribution is itself a probability distribution, and a conditional pf is a
pf. Clearly, pY |X (y | x) ≥ 0 for all y, and:
P
X
y
pY |X (y | x) =
pX,Y (x, y)
y
pX (x)
=
pX (x)
= 1.
pX (x)
The conditional distribution and pf of X given Y = y (for any y such that pY (y) > 0) is
defined similarly, with the roles of X and Y reversed:
pX|Y (x | y) =
pX,Y (x, y)
pY (y)
for any value x.
Conditional distributions are general and are not limited to the bivariate case. If X
and/or Y are vectors of random variables, the conditional pf of Y given X = x is:
pY|X (y | x) =
pX,Y (x, y)
pX (x)
where pX,Y (x, y) is the joint pf of the random vector (X, Y), and pX (x) is the marginal
pf of the random vector X.
5.6.2
Conditional mean and variance
Since a conditional distribution is a probability distribution, it also has a mean
(expected value) and variance (and median etc.).
These are known as the conditional mean and conditional variance, and are
denoted, respectively, by:
EY |X (Y | x) and VarY |X (Y | x).
155
5. Multivariate random variables
Example 5.7 In the football example, we have:
X
EY |X (Y | 0) =
y pY |X (y | 0) = 0 × 0.498 + 1 × 0.154 + 2 × 0.194 + 3 × 0.154 = 1.00.
y
So, if the home team scores 0 goals, the expected number of goals by the visiting
team is EY |X (Y | 0) = 1.00.
EY |X (Y | x) for x = 1, 2 and 3 are obtained similarly.
Here X is the number of goals by the home team, and Y is the number of goals by
the visiting team:
X=x
0
1
2
3
pY |X (y | x)
0
1
0.498 0.154
0.283 0.414
0.276 0.351
0.449 0.225
when y
2
0.194
0.261
0.299
0.283
is:
3
0.154
0.042
0.075
0.043
EY |X (Y | x)
1.00
1.06
1.17
0.92
3.0
Plots of the conditional means are shown in Figure 5.1.
0.0
0.5
1.0
1.5
2.0
2.5
Home goals x
Expected away goals E(Y|x)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Goals
Figure 5.1: Conditional means for Example 5.7.
5.7
Covariance and correlation
Suppose that the conditional distributions pY |X (y | x) of a random variable Y given
different values x of a random variable X are not all the same, i.e. the conditional
distribution of Y ‘depends on’ the value of X.
Therefore, there is said to be an association (or dependence) between X and Y .
156
5.7. Covariance and correlation
If two random variables are associated (dependent), knowing the value of one (for
example, X) will help to predict the likely value of the other (for example, Y ).
We next consider two measures of association which are used to summarise the
strength of an association in a single number: covariance and correlation (scaled
covariance).
5.7.1
Covariance
Definition of covariance
The covariance of two random variables X and Y is defined as:
Cov(X, Y ) = Cov(Y, X) = E[(X − E(X))(Y − E(Y ))].
This can also be expressed as the more convenient formula:
Cov(X, Y ) = E(XY ) − E(X) E(Y ).
(Note that these involve expected values of products of two random variables, which
have not been defined yet. We will do so later in this chapter.)
Properties of covariance
Suppose X and Y are random variables, and a, b, c and d are constants.
The covariance of a random variable with itself is the variance of the random
variable:
Cov(X, X) = E(XX) − E(X) E(X) = E(X 2 ) − (E(X))2 = Var(X).
The covariance of a random variable and a constant is 0:
Cov(a, X) = E(aX) − E(a) E(X) = a E(X) − a E(X) = 0.
The covariance of linear transformations of random variables is:
Cov(aX + b, cY + d) = ac Cov(X, Y ).
Activity 5.2 Suppose that X and Y have a bivariate distribution. Find the
covariance of the new random variables W = aX + bY and V = cX + dY where a, b,
c and d are constants.
157
5. Multivariate random variables
Solution
The covariance of W and V is:
E(W V ) − E(W ) E(V ) = E[acX 2 + bdY 2 + (ad + bc)XY ]
− [ac E(X)2 + bd E(Y )2 + (ad + bc) E(X) E(Y )]
= ac [E(X 2 ) − E(X)2 ] + bd [E(Y 2 ) − E(Y )2 ]
+ (ad + bc) [E(XY ) − E(X) E(Y )]
2
= ac σX
+ bd σY2 + (ad + bc) σXY .
5.7.2
Correlation
Definition of correlation
The correlation of two random variables X and Y is defined as:
Cov(X, Y )
Cov(X, Y )
.
=
Corr(X, Y ) = Corr(Y, X) = p
sd(X) sd(Y )
Var(X) Var(Y )
When Cov(X, Y ) = 0, then Corr(X, Y ) = 0. When this is the case, we say that X
and Y are uncorrelated.
Correlation and covariance are measures of the strength of the linear (‘straight-line’)
association between X and Y .
The further the correlation is from 0, the stronger is the linear association. The most
extreme possible values of correlation are −1 and +1, which are obtained when Y is an
exact linear function of X.
Corr(X, Y ) = +1 when Y = aX + b with a > 0.
Corr(X, Y ) = −1 when Y = aX + b with a < 0.
If Corr(X, Y ) > 0, we say that X and Y are positively correlated.
If Corr(X, Y ) < 0, we say that X and Y are negatively correlated.
Example 5.8 Recall the joint pf pX,Y (x, y) in the football example:
Y =y
X=x
0
1
2
3
158
0
0
0.100
0
0.100
0
0.085
0
0.062
1
0
0.031
1
0.146
2
0.108
3
0.031
2
0
0.039
2
0.092
4
0.092
6
0.039
3
0
0.031
3
0.015
6
0.023
9
0.006
5.7. Covariance and correlation
Here, the numbers in bold are the values of xy for each combination of x and y.
From these and their probabilities, we can derive the probability distribution of XY .
For example:
P (XY = 2) = pX,Y (1, 2) + pX,Y (2, 1) = 0.092 + 0.108 = 0.200.
The pf of the product XY is:
XY = xy
P (XY = xy)
0
0.448
1
0.146
2
0.200
3
0.046
4
0.092
6
0.062
9
0.006
Hence:
E(XY ) = 0 × 0.448 + 1 × 0.146 + 2 × 0.200 + · · · + 9 × 0.006 = 1.478.
From the marginal pfs pX (x) and pY (y) we get:
E(X) = 1.383 and E(Y ) = 1.065
also:
E(X 2 ) = 2.827 and E(Y 2 ) = 2.039
hence:
Var(X) = 2.827 − (1.383)2 = 0.9143 and Var(Y ) = 2.039 − (1.065)2 = 0.9048.
Therefore, the covariance of X and Y is:
Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 1.478 − 1.383 × 1.065 = 0.00511
and the correlation is:
Cov(X, Y )
0.00511
Corr(X, Y ) = p
= 0.00562.
=√
0.9143 × 0.9048
Var(X) Var(Y )
The numbers of goals scored by the home and visiting teams are very nearly
uncorrelated (i.e. not linearly associated).
Activity 5.3 X and Y are independent random variables with distributions as
follows:
X=x
pX (x)
0
0.4
1
0.2
2
0.4
Y =y
pY (y)
1
0.4
2
0.6
The random variables W and Z are defined by W = 2X and Z = Y − X,
respectively.
(a) Compute the joint distribution of W and Z.
(b) Evaluate P (W = 2 | Z = 1), E(W | Z = 0) and Cov(W, Z).
159
5. Multivariate random variables
Solution
(a) The joint distribution (with marginal probabilities) is:
−1
0
1
2
pW (w)
Z=z
0
0.00
0.00
0.16
0.24
0.40
W =w
2
0.00
0.08
0.12
0.00
0.20
4
0.16
0.24
0.00
0.00
0.40
pZ (z)
0.16
0.32
0.28
0.24
1.00
(b) It is straightforward to see that:
P (W = 2 | Z = 1) =
0.12
3
P (W = 2 ∩ Z = 1)
=
= .
P (Z = 1)
0.28
7
For E(W | Z = 0), we have:
E(W | Z = 0) =
X
w P (W = w | Z = 0) = 0 ×
w
0
0.08
0.24
+2×
+4×
= 3.5.
0.32
0.32
0.32
We see E(W ) = 2 (by symmetry), and:
E(Z) = −1 × 0.16 + 0 × 0.32 + 1 × 0.28 + 2 × 0.24 = 0.6.
Also:
E(W Z) =
XX
w
w z p(w, z) = −4 × 0.16 + 2 × 0.12 = −0.4
z
hence:
Cov(W, Z) = E(W Z) − E(W ) E(Z) = −0.4 − 2 × 0.6 = −1.6.
Activity 5.4 The joint probability distribution of the random variables X and Y is:
Y =y
−1
0
1
−1
0.05
0.10
0.10
X=x
0
1
0.15 0.10
0.05 0.25
0.05 0.15
(a) Identify the marginal distributions of X and Y and the conditional distribution
of X given Y = 1.
(b) Evaluate E(X | Y = 1) and the correlation coefficient of X and Y .
(c) Are X and Y independent random variables?
160
5.7. Covariance and correlation
Solution
(a) The marginal and conditional distributions are, respectively:
X=x
pX (x)
−1
0.25
0
0.25
Y =y
pY (y)
1
0.50
X = x|Y = 1
pX|Y =1 (x | Y = 1)
−1
1/3
−1
0.30
0
1/6
0
0.40
1
0.30
1
1/2
(b) From the conditional distribution we see:
E(X | Y = 1) = −1 ×
1
1
1
1
+0× +1× = .
3
6
2
6
E(Y ) = 0 (by symmetry), and so Var(Y ) = E(Y 2 ) = 0.6.
E(X) = 0.25 and:
Var(X) = E(X 2 ) − (E(X))2 = 0.75 − (0.25)2 = 0.6875.
(Note that Var(X) and Var(Y ) are not strictly necessary here!)
Next:
E(XY ) =
XX
x
x y p(x, y)
y
= (−1)(−1)(0.05) + (1)(−1)(0.1) + (−1)(1)(0.1) + (1)(1)(0.15)
= 0.
So:
Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 0
⇒
Corr(X, Y ) = 0.
(c) X and Y are not independent random variables since, for example:
P (X = 1, Y = −1) = 0.1 6= P (X = 1) P (Y = −1) = 0.5 × 0.3 = 0.15.
Activity 5.5 The random variables X1 and X2 are independent and have the
common distribution given in the table below:
X=x
pX (x)
0
0.2
1
0.4
2
0.3
3
0.1
The random variables W and Y are defined by W = max(X1 , X2 ) and
Y = min(X1 , X2 ).
(a) Calculate the table of probabilities which defines the joint distribution of W
and Y .
161
5. Multivariate random variables
(b) Find:
i. the marginal distribution of W
ii. the conditional distribution of Y given W = 2
iii. E(Y | W = 2) and Var(Y | W = 2)
iv. Cov(W, Y ).
Solution
(a) The joint distribution of W and Y is:
Y =y
0
1
2
3
0
(0.2)2
0
0
0
(0.2)2
1
2(0.2)(0.4)
(0.4)(0.4)
0
0
(0.8)(0.4)
W =w
2
3
2(0.2)(0.3) 2(0.2)(0.1)
2(0.4)(0.3) 2(0.4)(0.1)
(0.3)(0.3) 2(0.3)(0.1)
0
(0.1)(0.1)
(1.5)(0.3) (1.9)(0.1)
which is:
Y =y
(b)
0
1
2
3
W =w
1
2
0.16 0.12
0.16 0.24
0.00 0.09
0.00 0.00
0.32 0.45
0
0.04
0.00
0.00
0.00
0.04
3
0.04
0.08
0.06
0.01
0.19
i. Hence the marginal distribution of W is:
W =w
pW (w)
0
0.04
1
0.32
2
0.45
3
0.19
ii. The conditional distribution of Y | W = 2 is:
Y = y|W = 2
pY |W =2 (y | W = 2)
0
4/15
= 0.26̇
1
8/15
= 0.53̇
2
2/10
= 0.2
3
0
0
iii. We have:
E(Y | W = 2) = 0 ×
4
8
2
+1×
+2×
+ 3 × 0 = 0.93̇
15
15
10
and:
Var(Y | W = 2) = E(Y 2 | W = 2)−(E(Y | W = 2))2 = 1.3̇−(0.93̇)2 = 0.4622.
iv. E(W Y ) = 1.69, E(W ) = 1.79 and E(Y ) = 0.81, therefore:
Cov(W, Y ) = E(W Y ) − E(W ) E(Y ) = 1.69 − 1.79 × 0.81 = 0.2401.
162
5.7. Covariance and correlation
Activity 5.6 Consider two random variables X and Y . X can take the values −1, 0
and 1, and Y can take the values 0, 1 and 2. The joint probabilities for each pair are
given by the following table:
X = −1 X = 0 X = 1
Y =0
0.10
0.20
0.10
Y =1
0.10
0.05
0.10
Y =2
0.10
0.05
0.20
(a) Calculate the marginal distributions and expected values of X and Y .
(b) Calculate the covariance of the random variables U and V , where U = X + Y
and V = X − Y .
(c) Calculate E(V | U = 1).
Solution
(a) The marginal distribution of X is:
X=x
pX (x)
−1
0.3
0
0.3
1
0.4
The marginal distribution of Y is:
Y =y
pY (y)
0
0.40
1
0.25
2
0.35
Hence:
E(X) = −1 × 0.3 + 0 × 0.3 + 1 × 0.4 = 0.1
and:
E(Y ) = 0 × 0.40 + 1 × 0.25 + 2 × 0.35 = 0.95.
(b) We have:
Cov(U, V ) = Cov(X + Y, X − Y )
= E((X + Y )(X − Y )) − E(X + Y )E(X − Y )
= E(X 2 − Y 2 ) − (E(X) + E(Y )) (E(X) − E(Y ))
E(X 2 ) = ((−1)2 × 0.3) + (02 × 0.3) + (12 × 0.4) = 0.7
E(Y 2 ) = (02 × 0.4) + (12 × 0.25) + (22 × 0.35) = 1.65
hence:
Cov(U, V ) = (0.7 − 1.65) − (0.1 + 0.95)(0.1 − 0.95) = −0.0575.
163
5. Multivariate random variables
(c) U = 1 is achieved for (X, Y ) pairs (−1, 2), (0, 1) or (1, 0). The corresponding
values of V are −3, −1 and 1. We have:
P (U = 1) = 0.1 + 0.05 + 0.1 = 0.25
P (V = −3 | U = 1) =
2
0.1
=
0.25
5
P (V = −1 | U = 1) =
0.05
1
=
0.25
5
P (V = 1 | U = 1) =
0.1
2
=
0.25
5
hence:
1
2
2
+ −1 ×
+ 1×
= −1.
E(V | U = 1) = −3 ×
5
5
5
Activity 5.7 Two refills for a ballpoint pen are selected at random from a box
containing three blue refills, two red refills and three green refills. Define the
following random variables:
X = the number of blue refills selected
Y = the number of red refills selected.
(a) Show that P (X = 1, Y = 1) = 3/14.
(b) Form the table showing the joint probability distribution of X and Y .
(c) Calculate E(X), E(Y ) and E(X | Y = 1).
(d) Find the covariance between X and Y .
(e) Are X and Y independent random variables? Give a reason for your answer.
Solution
(a) With the obvious notation B = blue and R = red:
P (X = 1, Y = 1) = P (BR) + P (RB) =
3
3 2 2 3
× + × = .
8 7 8 7
14
(b) We have:
Y =y
164
0
1
2
X=x
0
1
2
3/28 9/28 3/28
3/14 3/14
0
1/28
0
0
5.7. Covariance and correlation
(c) The marginal distribution of X is:
X=x
pX (x)
0
10/28
1
15/28
2
3/28
Hence:
15
3
3
10
+1×
+2×
= .
28
28
28
4
The marginal distribution of Y is:
E(X) = 0 ×
Y =y
pY (y)
0
15/28
1
12/28
2
1/28
Hence:
15
12
1
1
+1×
+2×
= .
28
28
28
2
The conditional distribution of X given Y = 1 is:
E(Y ) = 0 ×
X = x|Y = 1
pX|Y =1 (x | y = 1)
Hence:
E(X | Y = 1) = 0 ×
0
1/2
1
1/2
1
1
1
+1× = .
2
2
2
(d) The distribution of XY is:
XY = xy
pXY (xy)
Hence:
E(XY ) = 0 ×
0
22/28
1
6/28
22
6
3
+1×
=
28
28
14
and:
Cov(X, Y ) = E(XY ) − E(X) E(Y ) =
3
3 1
9
− × =− .
14 4 2
56
(e) Since Cov(X, Y ) 6= 0, a necessary condition for independence fails to hold. The
random variables are not independent.
Activity 5.8 A fair coin is tossed four times. Let X be the number of heads
obtained on the first three tosses of the coin. Let Y be the number of heads on all
four tosses of the coin.
(a) Find the joint probability distribution of X and Y .
(b) Find the mean and variance of X.
(c) Find the conditional probability distribution of Y given that X = 2.
(d) Find the mean of the conditional probability distribution of Y given that X = 2.
165
5. Multivariate random variables
Solution
(a) The joint probability distribution is:
Y =y \X=x
0
1
2
3
4
0
1/16
1/16
0
0
0
1
0
3/16
3/16
0
0
2
0
0
3/16
3/16
0
3
0
0
0
1/16
1/16
(b) The marginal distribution of X is:
X=x
p(x)
0
1/8
1
3/8
2
3/8
3
1/8
Hence:
E(X) =
X
x p(x) = 0 ×
x
E(X 2 ) =
X
1
1
3
+ ··· + 3 × =
8
8
2
x2 p(x) = 02 ×
1
1
+ · · · + 32 × = 3
8
8
Var(X) = 3 −
9
3
= .
4
4
x
and:
(c) We have:
P (Y = 0 | X = 2) =
p(2, 0)
0
=
=0
pX (2)
3/8
P (Y = 1 | X = 2) =
p(2, 1)
0
=
=0
pX (2)
3/8
P (Y = 2 | X = 2) =
p(2, 2)
3/16
1
=
=
pX (2)
3/8
2
P (Y = 3 | X = 2) =
p(2, 3)
3/16
1
=
=
pX (2)
3/8
2
P (Y = 4 | X = 2) =
p(2, 4)
0
=
= 0.
pX (2)
3/8
Hence:
Y = y|X = 2
p(y | X = 2)
2
1/2
3
1/2
(d) We have:
E(Y | X = 2) = 2 ×
166
1
1
5
+3× = .
2
2
2
5.7. Covariance and correlation
Activity 5.9 X and Y are discrete random variables which can assume values 0, 1
and 2 only.
P (X = x, Y = y) = A(x + y) for some constant A and x, y ∈ {0, 1, 2}.
(a) Draw up a table to describe the joint distribution of X and Y and find the
value of the constant A.
(b) Describe the marginal distributions of X and Y .
(c) Give the conditional distribution of X | Y = 1 and find E(X | Y = 1).
(d) Are X and Y independent? Give a reason for your answer.
Solution
(a) The joint distribution table is:
Y =y
Since
PP
0
1
2
0
0
A
2A
X=x
1
2
A 2A
2A 3A
3A 4A
pX,Y (x, y) = 1, we have A = 1/18.
∀x ∀y
(b) The marginal distribution of X (similarly of Y ) is:
X=x
P (X = x)
0
3A = 1/6
1
6A = 1/3
2
9A = 1/2
(c) The distribution of X | Y = 1 is:
X = x|y = 1
PX|Y =1 (X = x | y = 1)
Hence:
0
A/6A = 1/6
1
2A/6A = 1/3
2
3A/6A = 1/2
1
1
1
4
E(X | Y = 1) = 0 ×
+ 1×
+ 2×
= .
6
3
2
3
(d) Even though the distributions of X and X | Y = 1 are the same, X and Y are
not independent. For example, P (X = 0, Y = 0) = 0 although P (X = 0) 6= 0
and P (Y = 0) 6= 0.
Activity 5.10 X and Y are discrete random variables with the following joint
probability function:
167
5. Multivariate random variables
Y =y
0
1
−1
0.15
0.30
X=x
0
1
0.05 0.15
0.25 0.10
(a) Obtain the marginal distributions of X and Y , respectively.
(b) Calculate E(X), Var(X), E(Y ) and Var(Y ).
(c) Obtain the conditional distributions of Y given X = −1, and of X given Y = 0.
(d) Calculate EY |X (Y | X = −1) and EX|Y (X | Y = 0).
(e) Calculate E(XY ), Cov(X, Y ) and Corr(X, Y ).
(f) Find P (X > Y ) and P (X 2 > Y 2 ).
(g) Are X and Y independent? Explain why or why not.
Solution
(a) The marginal distributions are found by adding across rows and columns:
X=x
pX (x)
−1
0.45
0
0.30
1
0.25
and:
Y =y
pY (y)
0
0.35
1
0.65
(b) We have:
E(X) = −1 × 0.45 + 0 × 0.30 + 1 × 0.25 = −0.20
and:
E(X 2 ) = (−1)2 × 0.45 + 02 × 0.30 + 12 × 0.25 = 0.70
so Var(X) = 0.70 − (−0.20)2 = 0.66. Also:
E(Y ) = 0 × 0.35 + 1 × 0.65 = 0.65
and:
E(Y 2 ) = 02 × 0.35 + 12 × 0.65 = 0.65
so Var(Y ) = 0.65 − (0.65)2 = 0.2275.
(c) The conditional probability functions pY |X=−1 (y | x = −1) and pX|Y =0 (x | y = 0)
are given by, respectively:
168
5.7. Covariance and correlation
Y = y | X = −1
pY |X=−1 (y | x = −1)
0
0.15/0.45 = 0.3̇
1
0.30/0.45 = 0.6̇
and:
X = x|Y = 0
pX|Y =0 (x | y = 0)
−1
0.15/0.35 = 0.4286
0
0.05/0.35 = 0.1429
1
0.15/0.35 = 0.4286
(d) We have EY |X (Y | X = −1) = 0 × 0.3̇ + 1 × 0.6̇ = 0.6̇.
Also, EX|Y (X | Y = 0) = −1 × 0.4286 + 0 × 0.1429 + 1 × 0.4286 = 0.
(e) We have E(XY ) =
P P
x
y
x y p(x, y) = −1 × 0.30 + 0 × 0.60 + 1 × 0.10 = −0.20.
Also, Cov(X, Y ) = E(XY ) − E(X) E(Y ) = −0.20 − (−0.20)(0.65) = −0.07 and:
−0.07
Cov(X, Y )
=√
= −0.1807.
Corr(X, Y ) = p
0.66 × 0.2275
Var(X) Var(Y )
(f) We have P (X > Y ) = P (X = 1, Y = 0) = 0.15.
Also,
P (X 2 > Y 2 ) = P (X = −1, Y = 0) + P (X = 1, Y = 0) = 0.15 + 0.15 = 0.30.
(g) Since X and Y are (weakly) negatively correlated (as determined in (e)), they
cannot be independent.
While the non-zero correlation is a sufficient explanation in this case, for other
such bivariate distributions which are uncorrelated, i.e. when Corr(X, Y ) = 0, it
becomes necessary to check whether pX,Y (x, y) = pX (x) pY (y) for all pairs of
values of (x, y). Here, for example, pX,Y (0, 0) = 0.05, pX (0) = 0.30 and
pY (0) = 0.35. We then have that pX (0) pY (0) = 0.105, which is not equal to
pX,Y (0, 0) = 0.05. Hence X and Y cannot be independent.
Activity 5.11 A box contains 4 red balls, 3 green balls and 3 blue balls. Two balls
are selected at random without replacement. Let X represent the number of red
balls in the sample and Y the number of green balls in the sample.
(a) Arrange the different pairs of values of (X, Y ) as the cells in a table, each cell
being filled with the probability of that pair of values occurring, i.e. provide the
joint probability distribution.
(b) What does the random variable Z = 2 − X − Y represent?
(c) Calculate Cov(X, Y ).
(d) Calculate P (X = 1 | − 2 < X − Y < 2).
169
5. Multivariate random variables
Solution
(a) We have:
P (X = 0, Y = 0) =
3
2
6
1
× =
=
10 9
90
15
P (X = 0, Y = 1) = 2 ×
P (X = 0, Y = 2) =
3
3
18
3
× =
=
10 9
90
15
2
6
1
3
× =
=
10 9
90
15
P (X = 1, Y = 0) = 2 ×
4
3
24
4
× =
=
10 9
90
15
P (X = 1, Y = 1) = 2 ×
4
3
24
4
× =
=
10 9
90
15
P (X = 2, Y = 0) =
4
3
12
2
× =
= .
10 9
90
15
All other values have probability 0. We then construct the table of joint
probabilities:
Y =0 Y =1 Y =2
X = 0 1/15
3/15
1/15
4/15
0
X = 1 4/15
X = 2 2/15
0
0
(b) The number of blue balls in the sample.
(c) We have:
E(X) = 1 ×
E(Y ) = 1 ×
4
4
+
15 15
3
4
+
15 15
and:
E(XY ) = 1 × 1 ×
So:
Cov(X, Y ) =
+2×
2
4
=
15
5
+2×
1
3
=
15
5
4
4
= .
15
15
4
4 3
16
− × =− .
15 5 5
75
(d) We have:
P (X = 1 | |X − Y | < 2) =
170
2
4/15 + 4/15
= .
1/15 + 3/15 + 4/15 + 4/15
3
5.7. Covariance and correlation
Activity 5.12 Suppose that Var(X) = Var(Y ) = 1, and that X and Y have
correlation coefficient ρ. Show that it follows from Var(X − ρY ) ≥ 0 that ρ2 ≤ 1.
Solution
We have:
0 ≤ Var(X − ρY ) = Var(X) − 2ρ Cov(X, Y ) + ρ2 Var(Y ) = 1 − 2ρ2 + ρ2 = (1 − ρ2 ).
Hence 1 − ρ2 ≥ 0, and so ρ2 ≤ 1.
Activity 5.13 The distribution of a random variable X is:
X=x
P (X = x)
−1
a
0
b
1
a
Show that X and X 2 are uncorrelated.
Solution
This is an example of two random variables X and Y = X 2 which are uncorrelated,
but obviously dependent. The bivariate distribution of (X, Y ) in this case is singular
because of the complete functional dependence between them.
We have:
E(X) = −1 × a + 0 × b + 1 × a = 0
E(X 2 ) = +1 × a + 0 × b + 1 × a = 2a
E(X 3 ) = −1 × a + 0 × b + 1 × a = 0
and we must show that the covariance is zero:
Cov(X, Y ) = E(XY ) − E(X) E(Y ) = E(X 3 ) − E(X) E(X 2 ) = 0 − 0 × 2a = 0.
There are many possible choices for a and b which give a valid probability
distribution, for instance a = 0.25 and b = 0.5.
Activity 5.14 A fair coin is thrown n times, each throw being independent of the
ones before. Let R = ‘the number of heads’, and S = ‘the number of tails’. Find the
covariance of R and S. What is the correlation of R and S?
Solution
One can go about this in a straightforward way. If Xi is the number of heads and Yi
is the number of tails on the ith throw, then the distribution of Xi and Yi is given by:
X/Y
0
1
0
0
0.5
1
0.5
0
171
5. Multivariate random variables
From this table, we compute the following:
E(Xi ) = E(Yi ) = 0 × 0.5 + 1 × 0.5 = 0.5
E(Xi2 ) = E(Yi2 ) = 0 × 0.5 + 1 × 0.5 = 0.5
Var(Xi ) = Var(Yi ) = 0.5 − (0.5)2 = 0.25
E(Xi Yi ) = 0 × 0.5 + 0 × 0.5 = 0
Cov(Xi , Yi ) = E(Xi Yi ) − E(Xi ) E(Yi ) = 0 − 0.25 = −0.25.
P
P
Now, since R = i Xi and S = i Yi , we can add covariances of independent Xi s
and Yi s, just like means and variances, then:
Cov(R, S) = −0.25n.
Since R + S = n is a fixed quantity, there is a complete linear dependence between R
and S. We have R = n − S, so the correlation between R and S should be −1. This
can be checked directly since:
Var(R) = Var(S) = 0.25n
(add the variances of the Xi s or Yi s). The correlation between R and S works out as
−0.25n/0.25n = −1.
Activity 5.15 Suppose that X and Y are random variables, and a, b, c and d are
constants.
(a) Show that:
Cov(aX + b, cY + d) = ac Cov(X, Y ).
(b) Derive Corr(aX + b, cY + d).
(c) Suppose that Z = cX + d, where c and d are constants. Using the result you
obtained in (b), or in some other way, show that:
Corr(X, Z) = 1 for c > 0
and:
Corr(X, Z) = −1 for c < 0.
Solution
(a) Note first that:
E(aX + b) = a E(X) + b and E(cY + d) = c E(Y ) + d.
172
5.7. Covariance and correlation
Therefore, the covariance is:
Cov(aX + b, cY + d) = E[(aX + b)(cY + d)] − E(aX + b) E(cY + d)
= E(acXY + adX + bcY + bd) − [a E(X) + b] [c E(Y ) + d]
= ac E(XY ) + ad E(X) + bc E(Y ) + bd
− ac E(X) E(Y ) − ad E(X) − bc E(Y ) − bd
= ac E(XY ) − ac E(X) E(Y )
= ac [E(XY ) − E(X) E(Y )]
= ac Cov(X, Y )
as required.
(b) Note first that:
sd(aX + b) = |a| sd(X) and sd(cY + d) = |c| sd(Y ).
Therefore, the correlation is:
Corr(aX + b, cY + d) =
Cov(aX + b, cY + d)
sd(aX + b) sd(cY + d)
ac Cov(X, Y )
|ac| sd(X) sd(Y )
ac
Corr(X, Y ).
=
|ac|
=
(c) First, note that the correlation of a random variable with itself is 1, since:
Cov(X, X)
Var(X)
Corr(X, X) = p
= 1.
=
Var(X)
Var(X) Var(X)
In the result obtained in (b), select a = 1, b = 0 and Y = X. This gives:
Corr(X, Z) = Corr(X, cX + d) =
c
c
Corr(X, X) = .
|c|
|c|
This gives the two cases mentioned in the question.
• For c > 0, then Corr(X, cX + d) = 1.
• For c < 0, then Corr(X, cX + d) = −1.
5.7.3
Sample covariance and correlation
We have just introduced covariance and correlation, two new characteristics of
probability distributions (population distributions). We now discuss their sample
equivalents.
173
5. Multivariate random variables
Let (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) be a sample of n pairs of observed values of two
random variables X and Y .
We can use these observations to calculate sample versions of the covariance and
correlation between X and Y . These are measures of association in the sample, i.e.
descriptive statistics. They are also estimates of the corresponding population
quantities Cov(X, Y ) and Corr(X, Y ). The uses of these sample measures will be
discussed in more detail later in the course.
Sample covariance
The sample covariance of random variables X and Y is calculated as:
n
d
Cov(X,
Y)=
1 X
(Xi − X̄)(Yi − Ȳ )
n − 1 i=1
where X̄ and Ȳ are the sample means of X and Y , respectively.
Sample correlation
The sample correlation of random variables X and Y is calculated as:
n
P
(Xi − X̄)(Yi − Ȳ )
d
Cov(X,
Y)
i=1
r=
=rn
n
P
P
SX SY
(Xi − X̄)2
(Yi − Ȳ )2
i=1
i=1
where SX and SY are the sample standard deviations of X and Y , respectively.
r is always between −1 and +1, and is equal to −1 or +1 only if X and Y are
perfectly linearly related in the sample.
r = 0 if X and Y are uncorrelated (not linearly related) in the sample.
Example 5.9 Figure 5.2 shows different examples of scatterplots of observations of
X and Y , and different values of the sample correlation, r. The line shown in each
plot is the best-fitting (least squares) line for the scatterplot (which will be
introduced later in the course).
In (a), X and Y are perfectly linearly related, and r = 1.
Plots (b), (c) and (e) show relationships of different strengths.
In (c), the variables are negatively correlated.
In (d), there is no linear relationship, and r = 0.
Plot (f) shows that r can be 0 even if two variables are clearly related, if that
relationship is not linear.
174
5.8. Independent random variables
(a) r=1
(b) r=0.85
(c) r=-0.5
(d) r=0
(e) r=0.92
(f) r=0
Figure 5.2: Scatterplots depicting various sample correlations as discussed in Example
5.9.
5.8
Independent random variables
Two discrete random variables X and Y are associated if pY |X (y | x) depends on x.
What if it does not, i.e. what if:
pX,Y (x, y)
= pY (y) for all x and y
pX (x)
so that knowing the value of X does not help to predict Y ?
pY |X (y | x) =
This implies that:
pX,Y (x, y) = pX (x) pY (y) for all x, y.
(5.1)
X and Y are independent of each other if and only if (5.1) is true.
Independent random variables
In general, suppose that X1 , X2 , . . . , Xn are discrete random variables. These are
independent if and only if their joint pf is:
p(x1 , x2 , . . . , xn ) = p1 (x1 ) p2 (x2 ) · · · pn (xn )
for all numbers x1 , x2 , . . . , xn , where p1 (x1 ), . . . , pn (xn ) are the univariate marginal
pfs of X1 , . . . , Xn , respectively.
175
5. Multivariate random variables
Similarly, continuous random variables X1 , X2 , . . . , Xn are independent if and only
if their joint pdf is:
f (x1 , x2 , . . . , xn ) = f1 (x1 ) f2 (x2 ) · · · fn (xn )
for all x1 , x2 , . . . , xn , where f1 (x1 ), . . . , fn (xn ) are the univariate marginal pdfs of
X1 , . . . , Xn , respectively.
If two random variables are independent, they are also uncorrelated, i.e. we have:
Cov(X, Y ) = 0 and Corr(X, Y ) = 0.
The reverse is not true, i.e. two random variables can be dependent even when their
correlation is 0. This can happen when the dependence is non-linear.
Example 5.10 The football example is an instance of this. The conditional
distributions pY |X (y | x) are clearly not all the same, but the correlation is very
nearly 0 (see Example 5.8).
Another example is plot (f) in Figure 5.2, where the dependence is not linear, but
quadratic.
5.8.1
Joint distribution of independent random variables
When random variables are independent, we can easily derive their joint pf or pdf as
the product of their univariate marginal distributions. This is particularly simple if all
the marginal distributions are the same.
Example 5.11 Suppose that X1 , X2 , . . . , Xn are independent, and each of them
follows the Poisson distribution with the same mean λ. Therefore, the marginal pf of
each Xi is:
e−λ λxi
p(xi ) =
xi !
and the joint pf of the random variables is:
p(x1 , x2 , . . . , xn ) = p(x1 ) p(x2 ) · · · p(xn ) =
n
Y
i=1
p(xi ) =
n
Y
e−λ λxi
i=1
xi !
P
=
e
−nλ
λi
Q
xi !
xi
.
i
Example 5.12 For a continuous example, suppose that X1 , X2 , . . . , Xn are
independent, and each of them follows a normal distribution with the same mean µ
and same variance σ 2 . Therefore, the marginal pdf of each Xi is:
1
(xi − µ)2
f (xi ) = √
exp −
2σ 2
2πσ 2
176
5.8. Independent random variables
and the joint pdf of the variables is:
f (x1 , x2 , . . . , xn ) = f (x1 ) f (x2 ) · · · f (xn ) =
n
Y
f (xi )
i=1
n
Y
(xi − µ)2
√
=
exp −
2σ 2
2πσ 2
i=1
#
"
n
1
1 X
n exp − 2
(xi − µ)2 .
= √
2σ
2
2πσ
i=1
1
Activity 5.16 X1 , . . . , Xn are independent Bernoulli random variables. The
probability function of Xi is given by:
(
(1 − πi )1−xi πixi for xi = 0, 1
p(xi ) =
0
otherwise
where:
eiθ
1 + eiθ
for i = 1, 2, . . . , n. Derive the joint probability function, p(x1 , x2 , . . . , xn ).
πi =
Solution
Since the Xi s are independent (but not identically distributed) random variables, we
have:
n
Y
p(x1 , x2 , . . . , xn ) =
p(xi ).
i=1
So, the joint probability function is:
p(x1 , x2 , . . . , xn ) =
n Y
i=1
1
1 + eiθ
1−xi eiθ
1 + eiθ
xi
=
n Y
i=1
eiθxi
1 + eiθ
θ
n
P
ixi
e i=1
.
= Q
n
(1 + eiθ )
i=1
Activity 5.17 X1 , . . . , Xn are independent random variables with the common
probability density function:
(
λ2 x e−λx for x > 0
f (x) =
0
otherwise.
Derive the joint probability density function, f (x1 , x2 , . . . , xn ).
Solution
Since the Xi s are independent (and identically distributed) random variables, we
177
5. Multivariate random variables
have:
f (x1 , x2 , . . . , xn ) =
n
Y
f (xi ).
i=1
So, the joint probability density function is:
f (x1 , x2 , . . . , xn ) =
n
Y
2
λ xi e
−λxi
=λ
2n
n
Y
xi e
−λx1 −λx2 −···−λxn
=λ
2n
xi e
−λ
n
P
xi
i=1
.
i=1
i=1
i=1
n
Y
Activity 5.18 X1 , . . . , Xn are independent random variables with the common
probability function:
m
θx
p(x) =
for x = 0, 1, 2, . . . , m
x (1 + θ)m
and 0 otherwise. Derive the joint probability function, p(x1 , x2 , . . . , xn ).
Solution
Since the Xi s are independent (and identically distributed) random variables, we
have:
n
Y
p(x1 , x2 , . . . , xn ) =
p(xi ).
i=1
So, the joint probability function is:
p(x1 , x2 , . . . , xn ) =
n Y
i=1
m
θ xi
=
xi (1 + θ)m
n Y
i=1
! x1 x2
m
θ θ · · · θ xn
=
xi
(1 + θ)nm
i=1
Activity 5.19 Show that if:
P (X ≤ x ∩ Y ≤ y) = (1 − e−x ) (1 − e−2y )
for all x, y > 0, then X and Y are independent random variables, each with an
exponential distribution.
Solution
The right-hand side of the result given is the product of the cdf of an exponential
random variable X with mean 1 and the cdf of an exponential random variable Y
with mean 2. So the result follows from the definition of independent random
variables.
Activity 5.20 The random variable X has a discrete uniform distribution with
values 1, 2 and 3, i.e. P (X = i) = 1/3 for i = 1, 2, 3. The random variable Y has a
discrete uniform distribution with values 1, 2, 3 and 4, i.e. P (Y = i) = 1/4 for
i = 1, 2, 3, 4. X and Y are independent.
178
n
P
xi
!
m
θi=1
.
xi
(1 + θ)nm
n Y
5.8. Independent random variables
(a) Derive the probability distribution of X + Y .
(b) What are E(X + Y ) and Var(X + Y )?
Solution
(a) The possible values of the sum are 2, 3, 4, 5, 6 and 7. Since X and Y are
independent, the probabilities of the different sums are:
P (X + Y = 2) = P (X = 1, Y = 1) = P (X = 1) P (Y = 1) =
1
1 1
× =
3 4
12
P (X + Y = 3) = P (X = 1) P (Y = 2) + P (X = 2) P (Y = 1) =
2
1
=
12
6
P (X + Y = 4) = P (X = 1) P (Y = 3) + P (X = 2) P (Y = 2)
1
3
=
+ P (X = 3) P (Y = 1) =
12
4
P (X + Y = 5) = P (X = 1) P (Y = 4) + P (X = 2) P (Y = 3)
3
1
+ P (X = 3) P (Y = 2) =
=
12
4
P (X + Y = 6) = P (X = 2) P (Y = 4) + P (X = 3) P (Y = 3) =
P (X + Y = 7) = P (X = 3) P (Y = 4) =
2
1
=
12
6
1
12
and 0 for all other real numbers.
(b) You could find the expectation and variance directly from the distribution of
X + Y above. However, it is easier to use the expected value and variance of the
discrete uniform distribution for both X and Y , and then the results on the
expectation and variance of sums of independent random variables to get:
E(X + Y ) = E(X) + E(Y ) =
1+3 1+4
+
= 4.5
2
2
and:
32 − 1 42 − 1
23
Var(X + Y ) = Var(X) + Var(Y ) =
+
=
≈ 1.92.
12
12
12
Activity 5.21 Let X1 , . . . , Xk be independent random variables, and a1 , . . . , ak be
constants. Show that:
k
k
P
P
ai X i =
ai E(Xi )
(a) E
i=1
(b) Var
k
P
i=1
i=1
ai X i
=
k
P
a2i Var(Xi ).
i=1
179
5. Multivariate random variables
Solution
(a) We have:
E
k
X
!
ai X i
=
i=1
k
X
E(ai Xi ) =
i=1
k
X
ai E(Xi ).
i=1
(b) We have:
Var
k
X

!
ai X i
= E
i=1
k
X
ai X i −
i=1

= E
k
X
k
X
!2 
ai E(Xi )

i=1
!2 
ai (Xi − E(Xi )) 
i=1
=
k
X
a2i E((Xi − E(Xi ))2 )+
i=1
X
ai aj E((Xi − E(Xi ))(Xj − E(Xj )))
1≤i6=j≤n
=
k
X
a2i Var(Xi )+
i=1
X
ai aj E(Xi − E(Xi )) E(Xj − E(Xj ))
1≤i6=j≤n
=
k
X
a2i Var(Xi ).
i=1
Additional note: remember there are two ways to compute the variance:
Var(X) = E((X − µ)2 ) and Var(X) = E(X 2 ) − (E(X))2 . The former is more
convenient for analytical derivations/proofs (see above), while the latter should be
used to compute variances for common distributions such as Poisson or exponential
distributions. Actually it is rather difficult to compute the variance for a Poisson
distribution using the formula Var(X) = E((X − µ)2 ) directly.
5.9
Sums and products of random variables
Suppose X1 , X2 , . . . , Xn are random variables. We now go from the multivariate setting
back to the univariate setting, by considering univariate functions of X1 , X2 , . . . , Xn . In
particular, we consider sums and products like:
n
X
i=1
180
ai X i + b = a1 X 1 + a2 X 2 + · · · + an X n + b
(5.2)
5.9. Sums and products of random variables
and:
n
Y
ai Xi = (a1 X1 ) (a2 X2 ) · · · (an Xn )
i=1
where a1 , a2 , . . . , an and b are constants.
Each such sum or product is itself a univariate random variable. The probability
distribution of such a function depends on the joint distribution of X1 , . . . , Xn .
Example 5.13 In the football example, the sum Z = X + Y is the total number of
goals scored in a match.
Its probability function is obtained from the joint pf pX,Y (x, y), that is:
Z=z
pZ (z)
0
0.100
1
0.131
2
0.270
3
0.293
4
0.138
5
0.062
6
0.006
For example, pZP
(1) = pX,Y (0, 1) + pX,Y (1, 0) = 0.031 + 0.100 = 0.131. The mean of Z
is then E(Z) = z pZ (z) = 2.448.
z
Another example is the distribution of XY (see Example 5.8).
However, what can we say about such distributions in general, in cases where we cannot
derive them as easily?
5.9.1
Distributions of sums and products
General results for the distributions of sums and products of random variables are
available as follows:
Sums
Mean
Yes
Variance
Yes
No
Normal: Yes
Some other distributions:
only for independent
random variables
No
Distributional
form
5.9.2
Products
Only for independent
random variables
Expected values and variances of sums of random
variables
We state, without proof, the following important result.
If X1 , X2 , . . . , Xn are random variables with means E(X1 ), E(X2 ), . . . , E(Xn ),
181
5. Multivariate random variables
respectively, and a1 , a2 , . . . , an and b are constants, then:
!
n
X
E
ai Xi + b = E(a1 X1 + a2 X2 + · · · + an Xn + b)
i=1
= a1 E(X1 ) + a2 E(X2 ) + · · · + an E(Xn ) + b
=
n
X
ai E(Xi ) + b.
(5.3)
i=1
Two simple special cases of this, when n = 2, are:
E(X + Y ) = E(X) + E(Y ), obtained by choosing X1 = X, X2 = Y , a1 = a2 = 1
and b = 0
E(X − Y ) = E(X) − E(Y ), obtained by choosing X1 = X, X2 = Y , a1 = 1,
a2 = −1 and b = 0.
Example 5.14 In the football example, we have previously shown that
E(X) = 1.383,E(Y ) = 1.065 and E(X + Y ) = 2.448. So E(X + Y ) = E(X) + E(Y ),
as the theorem claims.
If X1 , X2 , . . . , Xn are random variables with variances Var(X1 ), Var(X2 ), . . . , Var(Xn ),
respectively, and covariances Cov(Xi , Xj ) for i 6= j, and a1 , a2 , . . . , an and b are
constants, then:
!
n
n
X
X
XX
Var
ai Xi + b =
a2i Var(Xi ) + 2
ai aj Cov(Xi , Xj ).
(5.4)
i=1
i=1
i<j
In particular, for n = 2:
Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
Var(X − Y ) = Var(X) + Var(Y ) − 2Cov(X, Y ).
If X1 , X2 , . . . , Xn are independent random variables, then Cov(Xi , Xj ) = 0 for all i 6= j,
and so (5.4) simplifies to:
!
n
n
X
X
Var
ai X i =
a2i Var(Xi ).
(5.5)
i=1
i=1
In particular, for n = 2, when X and Y are independent:
Var(X + Y ) = Var(X) + Var(Y )
Var(X − Y ) = Var(X) + Var(Y ).
These results also hold whenever Cov(Xi , Xj ) = 0 for all i 6= j, even if the random
variables are not independent.
182
5.9. Sums and products of random variables
5.9.3
Expected values of products of independent random
variables
If X1 , X2 , . . . , Xn are independent random variables and a1 , a2 , . . . , an are constants,
then:
!
n
n
Y
Y
E
ai Xi = E[(a1 X1 )(a2 X2 ) · · · (an Xn )] =
ai E(Xi ).
i=1
i=1
In particular, when X and Y are independent:
E(XY ) = E(X) E(Y ).
There is no corresponding simple result for the means of products of dependent random
variables. There is also no simple result for the variances of products of random
variables, even when they are independent.
5.9.4
Distributions of sums of random variables
We now know the expected value and variance of the sum:
a1 X 1 + a2 X 2 + · · · + an X n + b
whatever the joint distribution of X1 , X2 , . . . , Xn . This is usually all we can say about
the distribution of this sum.
In particular, the form of the distribution of the sum (i.e. its pf/pdf) depends on the
joint distribution of X1 , X2 , . . . , Xn , and there are no simple general results about that.
For example, even if X and Y have distributions from the same family, the distribution
of X + Y is often not from that same family. However, such results are available for a
few special cases.
Sums of independent binomial and Poisson random variables
Suppose X1 , X2 , . . . , Xn are random variables, and we consider the unweighted sum:
n
X
Xi = X1 + X 2 + · · · + X n .
i=1
That is, the general sum given by (5.2), with a1 = a2 = · · · = an = 1 and b = 0.
The following results hold when the random variables X1 , X2 , . . . , Xn are independent,
but not otherwise.
P
P
If Xi ∼ Bin(ni , π), then i Xi ∼ Bin( i ni , π).
P
P
If Xi ∼ Poisson(λi ), then i Xi ∼ Poisson( i λi ).
Activity 5.22 Cars pass a point on a busy road at an average rate of 150 per hour.
Assume that the number of cars in an hour follows a Poisson distribution. Other
motor vehicles (lorries, motorcycles etc.) pass the same point at the rate of 75 per
hour. Assume a Poisson distribution for these vehicles too, and assume that the
number of other vehicles is independent of the number of cars.
183
5. Multivariate random variables
(a) What is the probability that one car and one other motor vehicle pass in a
two-minute period?
(b) What is the probability that two motor vehicles of any type (cars, lorries,
motorcycles etc.) pass in a two-minute period?
Solution
(a) Let X denote the number of cars, and Y denote the number of other motor
vehicles in a two-minute period. We need the probability given by
P (X = 1, Y = 1), which is P (X = 1) P (Y = 1) since X and Y are independent.
A rate of 150 cars per hour is a rate of 5 per two minutes, so X ∼ Poisson(5).
The probability of one car passing in two minutes is
P (X = 1) = e−5 (5)1 /1! = 0.0337. The rate for other vehicles over two minutes is
2.5, so Y ∼ Poisson(2.5) and so P (Y = 1) = e−2.5 (2.5)1 /1! = 0.2052. Hence the
probability for one vehicle of each type is 0.0337 × 0.2055 = 0.0069.
(b) Here we require P (Z = 2), where Z = X + Y . Since the sum of two independent
Poisson variables is again Poisson (see Section 5.10.5), then
Z ∼ Poisson(5 + 2.5) = Poisson(7.5). Therefore, the required probability is:
P (Z = 2) =
e−7.5 × (7.5)2
= 0.0156.
2!
Application to the binomial distribution
An easy proof that the mean and variance of X ∼ Bin(n, π) are E(X) = n π and
Var(X) = n π (1 − π) is as follows.
1. Let Z1 , . . . , Zn be independent random variables, each distributed as
Zi ∼ Bernoulli(π) = Bin(1, π).
2. It is easy to show that E(Zi ) = π and Var(Zi ) = π (1 − π) for each i = 1, . . . , n (see
(4.3) and (4.4)).
3. Also
n
P
Zi = X ∼ Bin(n, π) by the result above for sums of independent binomial
i=1
random variables.
4. Therefore, using the results (5.2) and (5.5), we have:
E(X) =
n
X
E(Zi ) = n π
and Var(X) =
i=1
n
X
Var(Zi ) = n π (1 − π).
i=1
Sums of normally distributed random variables
All sums (linear combinations) of normally distributed random variables are also
normally distributed.
184
5.9. Sums and products of random variables
Suppose X1 , X2 , . . . , Xn are normally distributed random variables, with
Xi ∼ N (µi , σi2 ) for i = 1, . . . , n, and a1 , . . . , an and b are constants, then:
n
X
ai Xi + b ∼ N (µ, σ 2 )
i=1
where:
µ=
n
X
2
ai µi + b and σ =
i=1
n
X
a2i σi2 + 2
XX
ai aj Cov(Xi , Xj ).
i<j
i=1
If the Xi s are independent (or just uncorrelated), i.e. if Cov(Xi , Xj ) = 0 for all i 6= j,
n
P
the variance simplifies to σ 2 =
a2i σi2 .
i=1
Example 5.15 Suppose that in the population of English people aged 16 or over:
the heights of men (in cm) follow a normal distribution with mean 174.9 and
standard deviation 7.39
the heights of women (in cm) follow a normal distribution with mean 161.3 and
standard deviation 6.85.
Suppose we select one man and one woman at random and independently of each
other. Denote the man’s height by X and the woman’s height by Y . What is the
probability that the man is at most 10 cm taller than the woman?
In other words, what is the probability that the difference between X and Y is at
most 10?
Since X and Y are independent we have:
2
D = X − Y ∼ N (µX − µY , σX
+ σY2 )
= N (174.9 − 161.3, (7.39)2 + (6.85)2 )
= N (13.6, (10.08)2 ).
The probability we need is:
P (D ≤ 10) = P
10 − 13.6
D − 13.6
≤
10.08
10.08
= P (Z ≤ −0.36)
= P (Z ≥ 0.36)
= 0.3594
using Table 4 of the New Cambridge Statistical Tables.
The probability that a randomly selected man is at most 10 cm taller than a
randomly selected woman is about 0.3594.
185
5. Multivariate random variables
Activity 5.23 At one stage in the manufacture of an article a piston of circular
cross-section has to fit into a similarly-shaped cylinder. The distributions of
diameters of pistons and cylinders are known to be normal with parameters as
follows.
• Piston diameters:
mean 10.42 cm, standard deviation 0.03 cm.
• Cylinder diameters: mean 10.52 cm, standard deviation 0.04 cm.
If pairs of pistons and cylinders are selected at random for assembly, for what
proportion will the piston not fit into the cylinder (i.e. for which the piston diameter
exceeds the cylinder diameter)?
(a) What is the chance that in 100 pairs, selected at random:
i. every piston will fit?
ii. not more than two of the pistons will fail to fit?
(b) Calculate both of these probabilities:
i. exactly
ii. using a Poisson approximation.
Discuss the appropriateness of using this approximation.
Solution
Let P ∼ N (10.42, (0.03)2 ) for the pistons, and C ∼ N (10.52, (0.04)2 ) for the
cylinders. It follows that D ∼ N (0.1, (0.05)2 ) for the difference (adding the
variances, assuming independence). The piston will fit if D > 0. We require:
0 − 0.1
= P (Z > −2) = 0.9772
P (D > 0) = P Z >
0.05
so the proportion of 1 − 0.9772 = 0.0228 will not fit.
The number of pistons, N , failing to fit out of 100 will be a binomial random
variable such that N ∼ Bin(100, 0.0228).
(a) Calculating directly, we have the following.
i. P (N = 0) = (0.9772)100 = 0.0996.
ii. P (N ≤ 2) =
(0.9772)100 + 100(0.9772)99 (0.0228) +
100
2
(0.9772)98 (0.0228)2 = 0.6005.
(b) Using the Poisson approximation with λ = 100 × 0.0228 = 2.28, we have the
following.
i. P (N = 0) ≈ e−2.28 = 0.1023.
ii. P (N ≤ 2) ≈ e−2.28 + e−2.28 × 2.28 + e−2.28 × (2.28)2 /2! = 0.6013.
The approximations are good (note there will be some rounding error, but the
values are close with the two methods). It is not surprising that there is close
agreement since n is large, π is small and n π < 5.
186
5.10. Overview of chapter
5.10
Overview of chapter
This chapter has introduced how to deal with more than one random variable at a time.
Focusing mainly on discrete bivariate distributions, the relationships between joint,
marginal and conditional distributions were explored. Sums and products of random
variables concluded the chapter.
5.11
Key terms and concepts
Association
Conditional distribution
Conditional variance
Covariance
Independence
Joint probability (density) function
Multivariate
5.12
Bivariate
Conditional mean
Correlation
Dependence
Joint probability distribution
Marginal distribution
Uncorrelated
Sample examination questions
Solutions can be found in Appendix C.
1. Consider two random variables X and Y taking the values 0 and 1. The joint
probabilities for the pair are given by the following table
Y =0
Y =1
X=0
1/2 − α
α
X=1
α
1/2 − α
(a) What are the values α can take? Explain your answer.
Now let α = 1/4, and:
U=
max(X, Y )
3
and V = min(X, Y )
where max(X, Y ) means the larger of X and Y , and min(X, Y ) means the smaller
of X and Y . For example, max(0, 1) = 1, min(0, 1) = 0, and
min(0, 0) = max(0, 0) = 0.
(b) Compute the mean of U and the mean of V .
(c) Are U and V independent? Explain your answer.
2. The amount of coffee dispensed into a coffee cup by a coffee machine follows a
normal distribution with mean 150 ml and standard deviation 10 ml. The coffee is
sold at the price of £1 per cup. However, the coffee cups are marked at the 137 ml
level, and any cup with coffee below this level will be given away free of charge.
The amounts of coffee dispensed in different cups are independent of each other.
187
5. Multivariate random variables
(a) Find the probability that the total amount of coffee in 5 cups exceeds 700 ml.
(b) Find the probability that the difference in the amounts of coffee in 2 cups is
smaller than 20 ml.
(c) Find the probability that one cup is filled below the level of 137 ml.
(d) Find the expected income from selling one cup of coffee.
3. There are six houses on Station Street, numbered 1 to 6. The postman has six
letters to deliver, one addressed to each house. As he is sloppy and in a hurry he
does not look at which letter he puts in which letterbox (one per house).
(a) Explain in words why the probability that the people living in the first house
receive the correct letter is equal to 1/6.
(b) Let Xi (for i = 1, . . . , 6) be the random variable which is equal to 1 if the
people living in house number i receive the correct letter, and equal to 0
otherwise. Show that E(Xi ) = 1/6.
(c) Show that X1 and X2 are not independent.
(d) Calculate Cov(X1 , X2 ).
188
Chapter 6
Sampling distributions of statistics
6.1
Synopsis of chapter
This chapter considers the idea of sampling and the concept of a sampling distribution
for a statistic (such as a sample mean) which must be understood by all users of
statistics.
6.2
Learning outcomes
After completing this chapter, you should be able to:
demonstrate how sampling from a population results in a sampling distribution for
a statistic
prove and apply the results for the mean and variance of the sampling distribution
of the sample mean when a random sample is drawn with replacement
state the central limit theorem and recall when the limit is likely to provide a good
approximation to the distribution of the sample mean.
6.3
Introduction
Suppose we have a sample of n observations of a random variable X:
{X1 , X2 , . . . , Xn }.
We have already stated that in statistical inference each individual observation Xi is
regarded as a value of a random variable X, with some probability distribution (that is,
the population distribution).
In this chapter we discuss how we define and work with:
the joint distribution of the whole sample {X1 , X2 , . . . , Xn }, treated as a
multivariate random variable
distributions of univariate functions of {X1 , X2 , . . . , Xn } (statistics).
189
6. Sampling distributions of statistics
6.4
Random samples
Many of the results discussed here hold for many (or even all) probability distributions,
not just for some specific distributions.
It is then convenient to use generic notation.
We use f (x) to denote both the pdf of a continuous random variable, and the pf of
a discrete random variable.
The parameter(s) of a distribution are generally denoted as θ. For example, for the
Poisson distribution θ stands for λ, and for the normal distribution θ stands for
(µ, σ 2 ).
Parameters are often included in the notation: f (x; θ) denotes the pf/pdf of a
distribution with parameter(s) θ, and F (x; θ) is its cdf.
For simplicity, we may often use phrases like ‘distribution f (x; θ)’ or ‘distribution
F (x; θ)’ when we mean ‘distribution with the pf/pdf f (x; θ)’ and ‘distribution with the
cdf F (x; θ)’, respectively.
The simplest assumptions about the joint distribution of the sample are as follows.
1. {X1 , X2 , . . . , Xn } are independent random variables.
2. {X1 , X2 , . . . , Xn } are identically distributed random variables. Each Xi has the
same distribution f (x; θ), with the same value of the parameter(s) θ.
The random variables {X1 , X2 , . . . , Xn } are then called:
independent and identically distributed (IID) random variables from the
distribution (population) f (x; θ)
a random sample of size n from the distribution (population) f (x; θ).
We will assume this most of the time from now. So you will see many examples and
questions which begin something like:
‘Let {X1 , . . . , Xn } be a random sample from a normal distribution with mean
µ and variance σ 2 . . . ’.
6.4.1
Joint distribution of a random sample
The joint probability distribution of the random variables in a random sample is an
important quantity in statistical inference. It is known as the likelihood function.
You will hear more about it in the chapter on point estimation.
For a random sample the joint distribution is easy to derive, because the Xi s are
independent.
190
6.5. Statistics and their sampling distributions
The joint pf/pdf of a random sample is:
f (x1 , x2 , . . . , xn ) = f (x1 ; θ) f (x2 ; θ) · · · f (xn ; θ) =
n
Y
f (xi ; θ).
i=1
Other assumptions about random samples
Not all problems can be seen as IID random samples of a single random variable. There
are other possibilities, which you will see more of in the future.
IID samples from multivariate population distributions. For example, a sample of
n
Q
(Xi , Yi ), with the joint distribution
f (xi , yi ).
i=1
Independent but not identically distributed observations. For example, observations
(Xi , Yi ) where Yi (the ‘response variable’) is treated as random, but Xi (the
‘explanatory variable’) is not. Hence the joint distribution of the Yi s is
n
Q
fY |X (yi | xi ; θ) where fY |X (y | x; θ) is the conditional distribution of Y given X.
i=1
This is the starting point of regression modelling (introduced later in the course).
Non-independent observations. For example, a time series {Y1 , Y2 , . . . , YT } where
i = 1, 2, . . . , T are successive time points. The joint distribution of the series is, in
general:
f (y1 ; θ) f (y2 | y1 ; θ) f (y3 | y1 , y2 ; θ) · · · f (yT | y1 , . . . , yT −1 ; θ).
Random samples and their observed values
Here we treat {X1 , X2 , . . . , Xn } as random variables. Therefore, we consider what values
{X1 , X2 , . . . , Xn } might have in different samples.
Once a real sample is actually observed, the values of {X1 , X2 , . . . , Xn } in that specific
sample are no longer random variables, but realised values of random variables, i.e.
known numbers.
Sometimes this distinction is emphasised in the notation by using:
X1 , X2 , . . . , Xn for the random variables
x1 , x2 , . . . , xn for the observed values.
6.5
Statistics and their sampling distributions
A statistic is a known function of the random variables {X1 , X2 , . . . , Xn } in a random
sample.
191
6. Sampling distributions of statistics
Example 6.1 All of the following are statistics:
the sample mean X̄ =
n
P
Xi /n
i=1
the sample variance S 2 =
n
P
(Xi − X̄)2 /(n − 1) and standard deviation S =
√
S2
i=1
the sample median, quartiles, minimum, maximum etc.
quantities such as:
n
X
i=1
Xi2
and
X̄
√ .
S/ n
Here we focus on single (univariate) statistics. More generally, we could also consider
vectors of statistics, i.e. multivariate statistics.
6.5.1
Sampling distribution of a statistic
A (simple) random sample is modelled as a sequence of IID random variables. A
statistic is a function of these random variables, so it is also a random variable, with a
distribution of its own.
In other words, if we collected several random samples from the same population, the
values of a statistic would not be the same from one sample to the next, but would vary
according to some probability distribution.
The sampling distribution is the probability distribution of the values which the
statistic would have in a large number of samples collected (independently) from the
same population.
Example 6.2 Suppose we collect a random sample of size n = 20 from a normal
population (distribution) X ∼ N (5, 1).
Consider the following statistics:
sample mean X̄, sample variance S 2 , and maxX = max(X1 , X2 , . . . , Xn ).
Here is one such random sample (with values rounded to 2 decimal places):
6.28 5.22 4.19 3.56 4.15 4.11 4.03 5.81 5.43 6.09
4.98 4.11 5.55 3.95 4.97 5.68 5.66 3.37 4.98 6.58
For this random sample, the values of our statistics are:
x̄ = 4.94
s2 = 0.90
maxx = 6.58.
192
6.5. Statistics and their sampling distributions
Here is another such random sample (with values rounded to 2 decimal places):
5.44 6.14 4.91 5.63 3.89 4.17 5.79 5.33 5.09 3.90
5.47 6.62 6.43 5.84 6.19 5.63 3.61 5.49 4.55 4.27
For this sample, the values of our statistics are:
x̄ = 5.22 (the first sample had x̄ = 4.94)
s2 = 0.80 (the first sample had s2 = 0.90)
maxx = 6.62 (the first sample had maxx = 6.58).
Activity 6.1 Suppose that {X1 , X2 , . . . , Xn } is a random sample from a continuous
distribution with probability density function fX (x) and cumulative distribution
function FX (x). Here we consider the sampling distribution of the statistic
Y = X(n) = max{X1 , X2 , . . . , Xn }, i.e. the largest value of Xi in the random sample,
for i = 1, . . . , n.
(a) Write down the formula for the cumulative distribution function FY (y) of Y , i.e.
for the probability that all observations in the sample are ≤ y.
(b) From the result in (a), derive the probability density function fY (y) of Y .
(c) The heights (in cm) of men aged over 16 in England are approximately
normally distributed with a mean of 174.9 and a standard deviation of 7.39.
What is the probability that in a random sample of 60 men from this
population at least one man is more than 1.92 metres tall?
Solution
(a) The probability that a single randomly-selected observation of X is at most y is
P (Xi ≤ y) = FX (y). Since the Xi s are independent, the probability that they
are all at most y is:
FY (y) = P (X1 ≤ y, X2 ≤ y, . . . , Xn ≤ y) = [FX (y)]n .
(b) The pdf is the first derivative of the cdf, so:
fY (y) = FY0 (y) = n [FX (y)]n−1 fX (y)
since fX (x) = FX0 (x).
(c) Here Xi ∼ N (174.9, (7.39)2 ). Therefore:
192 − 174.9
FX (192) = P (X ≤ 192) = P Z ≤
≈ P (Z ≤ 2.31)
7.39
where Z ∼ N (0, 1). We have that P (Z ≤ 2.31) = 1 − 0.01044 = 0.98956.
Therefore, the probability we need is:
P (Y > 192) = 1 − P (Y ≤ 192) = 1 − [FX (192)]60 = 1 − (0.98956)60 = 0.4672.
193
6. Sampling distributions of statistics
How to derive a sampling distribution?
The sampling distribution of a statistic is the distribution of the values of the statistic
in (infinitely) many repeated samples. However, typically we only have one sample
which was actually observed. Therefore, the sampling distribution seems like an
essentially hypothetical concept.
Nevertheless, it is possible to derive the forms of sampling distributions of statistics
under different assumptions about the sampling schemes and population distribution
f (x; θ).
There are two main ways of doing this.
Exactly or approximately through mathematical derivation. This is the most
convenient way for subsequent use, but is not always easy.
With simulation, i.e. by using a computer to generate (artificial) random samples
from a population distribution of a known form.
Example 6.3 Consider again a random sample of size n = 20 from the population
X ∼ N (5, 1), and the statistics X̄, S 2 and maxX .
We first consider deriving the sampling distributions of these by approximation
through simulation.
Here a computer was used to draw 10,000 independent random samples of
n = 20 from N (5, 1), and the values of X̄, S 2 and maxX for each of these
random samples were recorded.
Figures 6.1, 6.2 and 6.3 show histograms of the statistics for these 10,000
random samples.
We now consider deriving the exact sampling distribution. Here this is possible. For
a random sample of size n from N (µ, σ 2 ) we have:
(a) X̄ ∼ N (µ, σ 2 /n)
(b) (n − 1)S 2 /σ 2 ∼ χ2n−1
(c) the sampling distribution of Y = maxX has the following pdf:
fY (y) = n [FX (y)]n−1 fX (y)
where FX (x) and fX (x) are the cdf and pdf of X ∼ N (µ, σ 2 ), respectively.
Curves of the densities of these distributions are also shown in Figures 6.1, 6.2 and
6.3.
194
6.6. Sample mean from a normal population
4.5
5.0
5.5
6.0
Sample mean
Figure 6.1: Simulation-generated sampling distribution of X̄ to accompany Example 6.3.
6.6
Sample mean from a normal population
Consider one very common statistic, the sample mean:
n
X̄ =
1X
1
1
1
Xi = X1 + X2 + · · · + X n .
n i=1
n
n
n
What is the sampling distribution of X̄?
We know from Section 5.9.2 that for independent {X1 , . . . , Xn } from any distribution:
!
n
n
X
X
E
ai X i =
ai E(Xi )
i=1
i=1
and:
Var
n
X
i=1
!
ai X i
=
n
X
a2i Var(Xi ).
i=1
For a random sample, all Xi s are independent and E(X
P i ) = E(X) is the same
Pfor all of
them, since the Xi s are identically distributed. X̄ = i Xi /n is of the form i ai Xi ,
with ai = 1/n for all i = 1, . . . , n.
Therefore:
E(X̄) =
and:
n
X
1
1
E(X) = n × E(X) = E(X)
n
n
i=1
n
X
1
1
Var(X)
Var(X̄) =
Var(X) = n × 2 Var(X) =
.
2
n
n
n
i=1
195
6. Sampling distributions of statistics
0.5
1.0
1.5
2.0
2.5
Sample variance
Figure 6.2: Simulation-generated sampling distribution of S 2 to accompany Example 6.3.
5
6
7
8
9
Maximum value
Figure 6.3: Simulation-generated sampling distribution of maxX to accompany Example
6.3.
196
6.6. Sample mean from a normal population
So the mean and variance of X̄ are E(X) and Var(X)/n, respectively, for a random
sample from any population distribution of X. What about the form of the sampling
distribution of X̄?
This depends on the distribution of X, and is not generally known. However, when the
distribution of X is normal, we do know that the sampling distribution of X̄ is also
normal.
Suppose that {X1 , . . . , Xn } is a random sample from a normal distribution with mean µ
and variance σ 2 , then:
σ2
X̄ ∼ N µ,
.
n
For example, the pdf drawn on the histogram in Figure 6.1 is that of N (5, 1/20).
We have E(X̄) = E(X) = µ.
In an individual sample, x̄ is not usually equal to µ, the expected value of the
population.
However, over repeated samples the values of X̄ are centred at µ.
√
We also have Var(X̄) = Var(X)/n = σ 2 /n, and hence also sd(X̄) = σ/ n.
The variation of the values of X̄ in different samples (the sampling variance) is
large when the population variance of X is large.
More interestingly, the sampling variance gets smaller when the sample size n
increases.
In other words, when n is large the distribution of X̄ is more tightly concentrated
around µ than when n is small.
Figure 6.4 shows sampling distributions of X̄ from N (5, 1) for different n.
Example 6.4 Suppose that the heights (in cm) of men (aged over 16) in a
population follow a normal distribution with some unknown mean µ and a known
standard deviation of 7.39.
We plan to select a random sample of n men from the population, and measure their
heights. How large should n be so that there is a probability of at least 0.95 that the
sample mean X̄ will be within 1 cm of the population mean µ?
√
Here X ∼ N (µ, (7.39)2 ), so X̄ ∼ N (µ, (7.39/ n)2 ). What we need is the smallest n
such that:
P (|X̄ − µ| ≤ 1) ≥ 0.95.
197
6. Sampling distributions of statistics
n=100
n=20
n=5
4.0
4.5
5.0
5.5
6.0
x
Figure 6.4: Sampling distributions of X̄ from N (5, 1) for different n.
So:
P (|X̄ − µ| ≤ 1) ≥ 0.95
P (−1 ≤ X̄ − µ ≤ 1) ≥ 0.95
−1
X̄ − µ
1
√ ≤
√ ≤
√
P
≥ 0.95
7.39/ n
7.39/ n
7.39/ n
√
√ n
n
≤Z≤
≥ 0.95
P −
7.39
7.39
√ 0.05
n
P Z>
= 0.025
<
7.39
2
where Z ∼ N (0, 1). From Table 4 of the New Cambridge Statistical Tables, we see
that the smallest z which satisfies P (Z > z) < 0.025 is z = 1.97. Therefore:
√
n
≥ 1.97 ⇔ n ≥ (7.39 × 1.97)2 = 211.9.
7.39
Therefore, n should be at least 212.
Activity 6.2 Suppose that the heights of students are normally distributed with a
mean of 68.5 inches and a standard deviation of 2.7 inches. If 200 random samples of
size 25 are drawn from this population with means recorded to the nearest 0.1 inch,
find:
(a) the expected mean and standard deviation of the sampling distribution of the
mean
198
6.6. Sample mean from a normal population
(b) the expected number of recorded sample means which fall between 67.9 and
69.2 inclusive
(c) the expected number of recorded sample means falling below 67.0.
Solution
(a) The sampling distribution of the mean of 25 observations has the same mean as
the population, which is√68.5 inches. The standard deviation (standard error) of
the sample mean is 2.7/ 25 = 0.54.
(b) Notice that the samples are random, so we cannot be sure exactly how many
will have means between 67.9 and 69.2 inches. We can work out the probability
that the sample mean will lie in this interval using the sampling distribution:
X̄ ∼ N (68.5, (0.54)2 ).
We need to make a continuity correction, to account for the fact that the
recorded means are rounded to the nearest 0.1 inch. For example, the probability
that the recorded mean is ≥ 67.9 inches is the same as the probability that the
sample mean is > 67.85. Therefore, the probability we want is:
69.25 − 68.5
67.85 − 68.5
<Z<
P (67.85 < X < 69.25) = P
0.54
0.54
= P (−1.20 < Z < 1.39)
= Φ(1.39) − Φ(−1.20)
= 0.9177 − (1 − 0.1151)
= 0.8026.
Since there are 200 independent random samples drawn, we can now think of
each as a single trial. The recorded mean lies between 67.9 and 69.2 with
probability 0.8026 at each trial. We are dealing with a binomial distribution
with n = 200 trials and probability of success π = 0.8026. The expected number
of successes is:
n π = 200 × 0.8026 = 160.52.
(c) The probability that the recorded mean is < 67.0 inches is:
66.95 − 68.5
P (X < 66.95) = P Z <
= P (Z < −2.87) = Φ(−2.87) = 0.00205
0.54
so the expected number of recorded means below 67.0 out of a sample of 200 is:
200 × 0.00205 = 0.41.
Activity 6.3 Suppose that we plan to take a random sample of size n from a
normal distribution with mean µ and standard deviation σ = 2.
199
6. Sampling distributions of statistics
(a) Suppose µ = 4 and n = 20.
i. What is the probability that the mean X̄ of the sample is greater than 5?
ii. What is the probability that X̄ is smaller than 3?
iii. What is P (|X̄ − µ| ≤ 1) in this case?
(b) How large should n be in order that P (|X̄ − µ| ≤ 0.5) ≥ 0.95 for every possible
value of µ?
(c) It is claimed that the true value of µ is 5 in a population. A random sample of
size n = 100 is collected from this population, and the mean for this sample is
x̄ = 5.8. Based on the result in (b), what would you conclude from this value of
X̄?
Solution
(a) Let {X1 , . . . , Xn } denote the random sample. We know that the sampling
distribution of X̄ is N (µ, σ 2 /n), here N (4, 22 /20) = N (4, 0.2).
i. The probability we need is:
5−4
X̄ − 4
> √
= P (Z > 2.24) = 0.0126
P (X̄ > 5) = P √
0.2
0.2
where, as usual, Z ∼ N (0, 1).
ii. P (X̄ < 3) is obtained similarly. Note that this leads to
P (Z < −2.24) = 0.0126, which is equal to the P (X̄ > 5) = P (Z > 2.24)
result obtained above. This is because 5 is one unit above the mean µ = 4,
and 3 is one unit below the mean, and because the normal distribution is
symmetric around its mean.
iii. One way of expressing this is:
P (X̄ − µ > 1) = P (X̄ − µ < −1) = 0.0126
for µ = 4. This also shows that:
P (X̄ − µ > 1) + P (X̄ − µ < −1) = P (|X̄ − µ| > 1) = 2 × 0.0126 = 0.0252
and hence:
P (|X̄ − µ| ≤ 1) = 1 − 2 × 0.0126 = 0.9748.
In other words, the probability is 0.9748 that the sample mean is within
one unit of the true population mean, µ = 4.
(b) We can use the same ideas as in (a). Since X̄ ∼ N (µ, 4/n) we have:
P (|X̄ − µ| ≤ 0.5) = 1 − 2 × P (X̄ − µ > 0.5)
0.5
X̄ − µ
>p
=1−2×P p
4/n
4/n
√
= 1 − 2 × P (Z > 0.25 n)
≥ 0.95
200
!
6.7. The central limit theorem
which holds if:
√
0.05
P (Z > 0.25 n) ≤
= 0.025.
2
Using Table
√ 4 of the New Cambridge Statistical2 Tables, we see that this is true
when 0.25 n ≥ 1.96, i.e. when n ≥ (1.96/0.25) = 61.5. Rounding up to the
nearest integer, we get n ≥ 62. The sample size should be at least 62 for us to be
95% confident that the sample mean will be within 0.5 units of the true mean, µ.
(c) Here n > 62, yet x̄ is further than 0.5 units from the claimed mean of µ = 5.
Based on the result in (b), this would be quite unlikely if µ is really 5. One
explanation of this apparent contradiction is that µ is not really equal to 5.
This kind of reasoning will be the basis of statistical hypothesis testing, which
will be discussed later in the course.
6.7
The central limit theorem
We have discussed the very convenient result that if a random sample comes from a
normally-distributed population, the sampling distribution of X̄ is also normal. How
about sampling distributions of X̄ from other populations?
For this, we can use a remarkable mathematical result, the central limit theorem
(CLT). In essence, the CLT states that the normal sampling distribution of X̄ which
holds exactly for random samples from a normal distribution, also holds approximately
for random samples from nearly any distribution.
The CLT applies to ‘nearly any’ distribution because it requires that the variance of the
population distribution is finite. If it is not (such as for some Pareto distributions,
introduced in Chapter 3), the CLT does not hold. However, such distributions are not
common.
Suppose that {X1 , X2 , . . . , Xn } is a random sample from a population distribution
which has mean E(Xi ) = µ < ∞ and variance Var(Xi ) = σ 2 < ∞, that is with a finite
mean and finite variance. Let X̄n denote the sample mean calculated from a random
sample of size n, then:
X̄n − µ
√ ≤ z = Φ(z)
lim P
n→∞
σ/ n
for any z, where Φ(z) denotes the cdf of the standard normal distribution.
The ‘ lim ’ indicates that this is an asymptotic result, i.e. one which holds increasingly
n→∞
well as n increases, and exactly when the sample size is infinite.
In less formal language, the CLT says that for a random sample from nearly any
distribution with mean µ and variance σ 2 then:
σ2
X̄ ∼ N µ,
n
approximately, when n is sufficiently large. We can then say that X̄ is asymptotically
normally distributed with mean µ and variance σ 2 /n.
201
6. Sampling distributions of statistics
The wide reach of the CLT
It may appear that the CLT is still somewhat limited, in that it applies only to sample
means calculated from random (IID) samples. However, this is not really true, for two
main reasons.
There are more general versions of the CLT which do not require the observations
Xi to be IID.
Even the basic version applies very widely, when we realise that the ‘X’ can also be
a function of the original variables in the data. For example, if X and Y are
random variables in the sample, we can also apply the CLT to:
n
X
log(Xi )
i=1
n
or
n
X
X i Yi
i=1
n
.
Therefore, the CLT can also be used to derive sampling distributions for many statistics
which do not initially look at all like X̄ for a single random variable in an IID sample.
You may get to do this in future courses.
How large is ‘large n’?
The larger the sample size n, the better the normal approximation provided by the CLT
is. In practice, we have various rules-of-thumb for what is ‘large enough’ for the
approximation to be ‘accurate enough’. This also depends on the population
distribution of Xi . For example:
for symmetric distributions, even small n is enough
for very skewed distributions, larger n is required.
For many distributions, n > 30 is sufficient for the approximation to be reasonably
accurate.
Example 6.5 In the first case, we simulate random samples of sizes:
n = 1, 5, 10, 30, 100 and 1000
from the Exponential(0.25) distribution (for which µ = 4 and σ 2 = 16). This is
clearly a skewed distribution, as shown by the histogram for n = 1 in Figure 6.5.
10,000 independent random samples of each size were generated. Histograms of the
values of X̄ in these random samples are shown in Figure 6.5. Each plot also shows
the pdf of the approximating normal distribution, N (4, 16/n). The normal
approximation is reasonably good already for n = 30, very good for n = 100, and
practically perfect for n = 1000.
202
6.7. The central limit theorem
n = 10
n=5
n=1
0
10
20
30
40
0
2
4
6
8
n = 30
2
3
4
5
6
10
12
14
2
4
6
n = 100
7
2.5
3.0
3.5
4.0
4.5
5.0
8
10
n = 1000
5.5
3.6
3.8
4.0
4.2
4.4
Figure 6.5: Sampling distributions of X̄ for various n when sampling from the
Exponential(0.25) distribution.
Example 6.6 In the second case, we simulate 10,000 independent random samples
of sizes:
n = 1, 10, 30, 50, 100 and 1000
from the Bernoulli(0.2) distribution (for which µ = 0.2 and σ 2 = 0.16).
Here the distribution of Xi itself is not even continuous, and has only two possible
values, 0 and 1. Nevertheless, the sampling distribution of X̄ can be very
well-approximated by the normal distribution, when n is large enough.
n
P
Note that since here Xi = 1 or Xi = 0 for all i, X̄ =
Xi /n = m/n, where m is the
i=1
number of observations for which Xi = 1. In other words, X̄ is the sample
proportion of the value X = 1.
The normal approximation is clearly very bad for small n, but reasonably good
already for n = 50, as shown by the histograms in Figure 6.6.
Activity 6.4 A random sample of 25 audits is to be taken from a company’s total
audits, and the average value of these audits is to be calculated.
(a) Explain what is meant by the sampling distribution of this average and discuss
its relationship to the population mean.
(b) Is it reasonable to assume that this sampling distribution is normal?
(c) If the population of all audits has a mean of £54 and a standard deviation of
£10, find the probability that:
203
6. Sampling distributions of statistics
n = 30
n = 10
n=1
0.0
0.2
0.4
0.6
0.8
1.0 0.0
0.2
0.4
0.6
0.8 0.0
0.1
0.2
0.3
0.4
0.5
n = 1000
n = 100
n = 50
0.0
0.1
0.2
0.3
0.4
0.50.05 0.10 0.15 0.20 0.25 0.30 0.35
0.16
0.18
0.20
0.22
0.24
Figure 6.6: Sampling distributions of X̄ for various n when sampling from the
Bernoulli(0.2) distribution.
i. the sample mean will be greater than £60
ii. the sample mean will be within 5% of the population mean.
Solution
(a) The sample average is composed of 25 randomly sampled data which are
subject to sampling variability, hence the average is also subject to this
variability. Its sampling distribution describes its probability properties. If a
large number of such averages were independently sampled, then their
histogram would be the sampling distribution.
(b) It is reasonable to assume that this sampling distribution is normal due to the
CLT, although the sample size is rather small. If n = 25 and µ = 54 and σ = 10,
then the CLT says that:
σ2
100
X̄ ∼ N µ,
= N 54,
.
n
25
(c)
i. We have:
P (X̄ > 60) = P
204
60 − 54
Z>p
100/25
!
= P (Z > 3) = 0.0013.
6.7. The central limit theorem
ii. We are asked for:
P (0.95 × 54 < X̄ < 1.05 × 54) = P
0.05 × 54
−0.05 × 54
<Z<
2
2
= P (−1.35 < Z < 1.35)
= 0.8230.
Activity 6.5 A manufacturer of objects packages them in boxes of 200. It is known
that, on average, the objects weigh 1 kg with a standard deviation of 0.03 kg. The
manufacturer is interested in calculating:
P (200 objects weigh more than 200.5 kg)
which would help detect whether too many objects are being put in a box. Explain
how you would calculate the (approximate?) value of this probability. Mention any
relevant theorems or assumptions needed.
Solution
Let Xi denote the weight of the ith object, for i = 1, . . . , 200. The Xi s are assumed
to be independent and identically distributed with E(Xi ) = 1 and Var(Xi ) = (0.03)2 .
We require:
!
!
200
200
X
X
Xi
> 1.0025 = P (X̄ > 1.0025).
P
Xi > 200.5 = P
200
i=1
i=1
If the weights are not normally distributed, then by the central limit theorem:
1.0025 − 1
√
P (X̄ > 1.0025) ≈ P Z >
= P (Z > 1.18) = 0.1190.
0.03/ 200
If the weights are normally distributed, then this is the exact (rather than
approximate) probability.
Activity 6.6
(a) Suppose {X1 , X2 , X3 , X4 } is a random sample of size n = 4 from the
n
P
Bernoulli(0.2) distribution. What is the distribution of
Xi in this case?
i=1
(b) Write down the sampling distribution of X̄ =
n
P
Xi /n for the sample considered
i=1
in (a). In other words, write down the possible values of X̄ and their
probabilities.
P
Hint: what are the possible values of i Xi , and their probabilities?
(c) Suppose we have a random sample of size n = 100 from the Bernoulli(0.2)
distribution. What is the approximate sampling distribution of X̄ suggested by
205
6. Sampling distributions of statistics
the central limit theorem in this case? Use this distribution to calculate an
approximate value for the probability that X̄ > 0.3. (The true value of this
probability is 0.0061.)
Solution
(a) The sum of n independent Bernoulli random variables, each with success
4
P
probability π, is Bin(n, π). Here n = 4 and π = 0.2, so
Xi ∼ Bin(4, 0.2).
i=1
P
(b) The possible values of
Xi are 0, 1, 2, 3 and 4, and their probabilities can be
calculated from the binomial distribution. For example:
X
4
P
Xi = 1 =
(0.2)1 (0.8)3 = 4 × 0.2 × 0.512 = 0.4096.
1
The other probabilities are shown in the table below.
P
Since X̄ = Xi /4, the possible values of X̄ are 0, 0.25, 0.5, 0.75 and
P 1. Their
probabilities are the same asP
those of the corresponding values of
Xi . For
example, P (X̄ = 0.25) = P ( Xi = 1) = 0.4096. The values and their
probabilities are:
X̄ = x̄
P (X̄ = x̄)
0.0
0.4096
0.25
0.4096
0.5
0.1536
0.75
0.0256
1.0
0.0016
(c) For Xi ∼ Bernoulli(π), E(Xi ) = π and Var(Xi ) = π (1 − π). Therefore, the
approximate normal sampling distribution of X̄, derived from the central limit
theorem, is N (π, π (1 − π)/n). Here this is:
0.2 × 0.8
= N (0.2, 0.0016) = N (0.2, (0.04)2 ).
N 0.2,
100
Therefore, the probability requested by the question is approximately:
X̄ − 0.2
0.3 − 0.2
P (X̄ > 0.3) = P
>
= P (Z > 2.5) = 0.0062.
0.04
0.04
This is very close to the probability obtained from the exact sampling
distribution, which is about 0.0061.
Activity 6.7 A country is about to hold a referendum about leaving the European
Union. A survey of a random sample of adult citizens of the country is conducted. In
the sample, n respondents say that they plan to vote in the referendum. These n
respondents are then asked whether they plan to vote ‘Yes’ or ‘No’. Define X = 1 if
such a person plans to vote ‘Yes’, and X = 0 if such a person plans to vote ‘No’.
Suppose that in the whole population 49% of those people who plan to vote are
currently planning to vote Yes, and hence the referendum result would show a (very
206
6.7. The central limit theorem
small) majority opposing leaving the European Union.
(a) Let X̄ =
n
P
Xi /n denote the proportion of the n voters in the sample who plan
i=1
to vote Yes. What is the central limit theorem approximation of the sampling
distribution of X̄ here?
(b) If there are n = 50 likely voters in the sample, what is the probability that
X̄ > 0.5? (Such an opinion poll would suggest a majority supporting leaving the
European Union in the referendum.)
(c) How large should n be so that there is less than a 1% chance that X̄ > 0.5 in
the random sample? (This means less than a 1% chance of the opinion poll
incorrectly predicting a majority supporting leaving the European Union in the
referendum.)
Solution
(a) Here the individual responses, the Xi s, follow a Bernoulli distribution with
probability parameter π = 0.49. The mean of this distribution is 0.49, and the
variance is 0.49 × 0.51. Therefore, the central limit theorem (CLT)
approximation of the sampling distribution of X̄ is:
2 !
0.4999
0.49 × 0.51
√
= N 0.49,
.
X̄ ∼ N 0.49,
n
n
(b) When n = 50, the CLT approximation from (a) is X̄ ∼ N (0.49, (0.0707)2 ).
With this, we get:
0.5 − 0.49
X̄ − 0.49
>
P (X̄ > 0.5) = P
= P (Z > 0.14) = 0.4443.
0.0707
0.0707
(c) Here we need the smallest integer n such that:
√
0.5 − 0.49
X̄ − 0.49
√ >
√
= P (Z > 0.0200 n) < 0.01.
P (X̄ > 0.5) = P
0.4999/ n
0.4999/ n
Using Table 4 of the New Cambridge Statistical Tables, the smallest z such that
P (Z > z) < 0.01 is z = 2.33. Therefore, we need:
√
0.0200 n ≥ 2.33
⇔
n≥
2.33
0.0200
2
= 13572.25
which means that we need at least n = 13,573 likely voters in the sample –
which is a very large sample size! Of course, the reason for this is that the
population of likely voters is almost equally split between those supporting
leaving the European Union, and those opposing. Hence such a large sample
size is necessary to be very confident of obtaining a representative sample.
207
6. Sampling distributions of statistics
Activity 6.8 Suppose {X1 , X2 , . . . , Xn } is a random sample from the Poisson(λ)
distribution.
(a) What is the sampling distribution of
n
P
Xi ?
i=1
(b) Write down the sampling distribution of X̄ =
n
P
Xi /n. In other words, write
i=1
down the possible values of X̄ and their probabilities. (Assume n is not large.)
n
P
Hint: What are the possible values of
Xi and their respective probabilities?
i=1
(c) What are the mean and variance of the sampling distribution of X̄ when λ = 5
and n = 100?
Solution
(a) The sum of n independent Poisson(λ) random variables follows the Poisson(nλ)
distribution.
P
P
(b) Since i Xi has possible values 0, 1, 2, . . ., the possible values of X̄ = i Xi /n
are 0/n, 1/n, 2/n, . . .. The probabilities
of these values are determined by the
P
probabilities of the values of i Xi , which are obtained from
P the Poisson(nλ)
distribution. Therefore, the probability function of X̄ = i Xi /n is:
(
e−nλ (nλ)nx̄ /(nx̄)! for x̄ = 0, 1/n, 2/n, . . .
p(x̄) =
0
otherwise.
(c) For Xi ∼ Poisson(λ) we have E(Xi ) = Var(Xi ) = λ, so the general results for X̄
give E(X̄) = λ and Var(X̄) = λ/n. When λ = 5 and n = 100, E(X̄) = 5 and
Var(X̄) = 5/100 = 0.05.
Activity 6.9 Suppose that a random sample of size n is to be taken from a
non-normal distribution for which µ = 4 and σ = 2. Use the central limit theorem to
determine, approximately, the smallest value of n for which:
P (|X̄n − µ| < 0.2) ≥ 0.95
where X̄n denotes the sample mean, which depends on n.
Solution
By the central limit theorem we have:
σ2
X̄ ∼ N µ,
n
approximately, as n → ∞. Hence:
√
√
n(X̄n − µ)
n(X̄n − 4)
=
→ Z ∼ N (0, 1).
σ
2
208
6.8. Some common sampling distributions
Therefore:
√
√
P (|X̄n − µ| < 0.2) ≈ P (|Z| < 0.1 n) = 2 × Φ(0.1 n) − 1.
√
However, 2 × Φ(0.1 n) − 1 ≥ 0.95 if and only if:
√
1 + 0.95
Φ(0.1 n) ≥
= 0.975
2
which is satisfied if:
√
0.1 n ≥ 1.96
⇒
n ≥ 384.16.
Hence the smallest possible value of n is 385.
6.8
Some common sampling distributions
In the remaining chapters, we will make use of results like the following.
Suppose that {X1 , . . . , Xn } and {Y1 , . . . , Ym } are two independent random samples from
N (µ, σ 2 ), then:
(n − 1) 2
SX ∼ χ2n−1
σ2
s
and
(m − 1) 2
SY ∼ χ2m−1
σ2
n+m−2
X̄ − Ȳ
∼ tn+m−2
×p
2
1/n + 1/m
(n − 1)SX
+ (m − 1)SY2
and:
2
SX
∼ Fn−1, m−1 .
SY2
Here ‘χ2 ’, ‘t’ and ‘F ’ refer to three new families of probability distributions:
the χ2 (‘chi-squared’) distribution
the t distribution
the F distribution.
These are not often used as distributions of individual variables. Instead, they are used
as sampling distributions for various statistics. Each of them arises from the normal
distribution in a particular way.
We will now briefly introduce their main properties. This is in preparation for statistical
inference, where the uses of these distributions will be discussed at length.
209
6. Sampling distributions of statistics
6.8.1
The χ2 distribution
Definition of the χ2 distribution
Let Z1 , Z2 , . . . , Zk be independent N (0, 1) random variables. If:
X=
Z12
+
Z22
+ ··· +
Zk2
=
k
X
Zi2
i=1
the distribution of X is the χ2 distribution with k degrees of freedom. This is
denoted by X ∼ χ2 (k) or X ∼ χ2k .
The χ2k distribution is a continuous distribution, which can take values of x ≥ 0. Its
mean and variance are:
E(X) = k
Var(X) = 2k.
For reference, the probability density function of X ∼ χ2k is:
(
(2k/2 Γ(k/2))−1 xk/2−1 e−x/2
f (x) =
0
where:
Z
Γ(α) =
for x > 0
otherwise
∞
xα−1 e−x dx
0
is the gamma function, which is defined for all α > 0. (Note the formula of the pdf of
X ∼ χ2k is not examinable.)
The shape of the pdf depends on the degrees of freedom k, as illustrated in Figure 6.7.
In most applications of the χ2 distribution the appropriate value of k is known, in which
case it does not need to be estimated from data.
If X1 , X2 , . . . , Xm are independent random variables and Xi ∼ χ2ki , then their sum is
also χ2 -distributed where the individual degrees of freedom are added, such that:
X1 + X2 + · · · + Xm ∼ χ2k1 +k2 +···+km .
The uses of the χ2 distribution will be discussed later. One example though is if
{X1 , X2 , . . . , Xn } is a random sample from the population N (µ, σ 2 ), and S 2 is the
sample variance, then:
(n − 1)S 2
∼ χ2n−1 .
σ2
This result is used to derive basic tools of statistical inference for both µ and σ 2 for the
normal distribution.
210
0.6
0.10
6.8. Some common sampling distributions
0.08
k=10
k=20
k=30
k=40
0.0
0.0
0.1
0.02
0.2
0.04
0.3
0.06
0.4
0.5
k=1
k=2
k=4
k=6
0
2
4
6
8
0
10
20
30
40
50
Figure 6.7: χ2 pdfs for various degrees of freedom.
Tables of the χ2 distribution
In the examination, you will need a table of some probabilities for the χ2 distribution.
Table 8 of the New Cambridge Statistical Tables shows the following information.
The rows correspond to different degrees of freedom k (denoted as ν in Table 8).
The table shows values of k up to 100.
The columns correspond to the right-tail probability as a percentage, that is
P (X > x) = P/100, where X ∼ χ2k , for different values of P , ranging from 50 to
0.05 (that is, right-tail probabilities ranging from 0.5 to 0.0005).
The numbers in the table are values of z such that P (X > z) = P/100 for the k
and P in that row and column, respectively.
Example 6.7 Consider the ‘ν = 5’ row, the 9.236 in the ‘P = 10’ column and the
11.07 in the ‘P = 5’ column. These mean, for X ∼ χ25 , that:
P (X > 9.236) = 0.10
[and hence P (X ≤ 9.236) = 0.90].
P (X > 11.07) = 0.05
[and hence P (X ≤ 11.07) = 0.95].
These also provide bounds for probabilities of other values. For example, since 10.00
is between 9.236 and 11.07, we can conclude that:
0.05 < P (X > 10.00) < 0.10.
Activity 6.10 If Z is a random variable with a standard normal distribution, what
is P (Z 2 < 3.841)?
211
6. Sampling distributions of statistics
Solution
We can compute the probability in two different ways. Working with the standard
normal distribution, we have:
√
√
P (Z 2 < 3.841) = P (− 3.841 < Z < 3.841)
= P (−1.96 < Z < 1.96)
= Φ(1.96) − Φ(−1.96)
= 0.9750 − (1 − 0.9750)
= 0.95.
Alternatively, we can use the fact that Z 2 follows a χ21 distribution. Using Table 8 of
the New Cambridge Statistical Tables we can see that 3.841 is the 5% right-tail value
for this distribution, and so P (Z 2 < 3.84) = 0.95, as before.
Activity 6.11 Suppose that X1 and X2 are independent N (0, 4) random variables.
Compute P (X12 < 36.84 − X22 ).
Solution
Rearrange the inequality to obtain:
P (X12 < 36.84 − X22 ) = P (X12 + X22 < 36.84)
2
36.84
X1 + X22
<
=P
4
4
!
2 2
X1
X2
=P
+
< 9.21 .
2
2
Since X1 /2 and X2 /2 are independent N (0, 1) random variables, the sum of their
squares will follow a χ22 distribution. Using Table 8 of the New Cambridge Statistical
Tables, we see that 9.210 is the 1% right-tail value, so the probability we are looking
for is 0.99.
Activity 6.12 Suppose A, B and C are independent chi-squared random variables
with 5, 7 and 10 degrees of freedom, respectively. Calculate:
(a) P (B < 12)
(b) P (A + B + C < 14)
(c) P (A3 + B 3 + C 3 < 0).
In this question, you should use the closest value given in the available statistical
tables. Further approximation is not required.
212
6.8. Some common sampling distributions
Solution
(a) P (B < 12) ≈ 0.9, directly from Table 8 of the New Cambridge Statistical Tables,
where B ∼ χ27 .
(b) A + B + C ∼ χ25+7+10 = χ222 , so P (A + B + C < 14) is the probability that such
a random variable is less than 14, which is approximately 0.1 from Table 8.
(c) A chi-squared random variable only assumes non-negative values. Hence each of
A, B and C is non-negative, so A3 + B 3 + C 3 ≥ 0, and:
P (A3 + B 3 + C 3 < 0) = 0.
6.8.2
(Student’s) t distribution
Definition of Student’s t distribution
Suppose Z ∼ N (0, 1), X ∼ χ2k , and Z and X are independent. The distribution of
the random variable:
Z
T =p
X/k
is the t distribution with k degrees of freedom. This is denoted T ∼ tk or
T ∼ t(k). The distribution is also known as ‘Student’s t distribution’.
The tk distribution is continuous with the pdf:
−(k+1)/2
Γ((k + 1)/2)
x2
f (x) = √
1+
k
kπ Γ(k/2)
0.4
for all −∞ < x < ∞. Examples of f (x) for different k are shown in Figure 6.8. (Note
the formula of the pdf of tk is not examinable.)
0.0
0.1
0.2
0.3
N(0,1)
k=1
k=3
k=8
k=20
−2
0
2
Figure 6.8: Student’s t pdfs for various degrees of freedom.
From Figure 6.8, we see the following.
213
6. Sampling distributions of statistics
The distribution is symmetric around 0.
As k → ∞, the tk distribution tends to the standard normal distribution, so tk with
large k is very similar to N (0, 1).
For any finite value of k, the tk distribution has heavier tails than the standard
normal distribution, i.e. tk places more probability on values far from 0 than
N (0, 1) does.
For T ∼ tk , the mean and variance of the distribution are:
E(T ) = 0 for k > 1
and:
k
for k > 2.
k−2
This means that for t1 neither E(T ) nor Var(T ) exist, and for t2 , Var(T ) does not exist.
Var(T ) =
Tables of the t distribution
In the examination, you will need a table of some probabilities for the t distribution.
Table 10 of the New Cambridge Statistical Tables shows the following information.
The rows correspond to different degrees of freedom k (denoted as ν in Table 10).
The table shows values of k up to 120, and then ‘∞’, which is N (0, 1).
• If you need a tk distribution for which k is not in the table, use the nearest
value or use interpolation.
The columns correspond to the right-tail probability P (T > z) = P/100, where
T ∼ tk , for various P ranging from 40 to 0.05.
The numbers in the table are values of t such that P (T > t) = P/100 for the k and
P in that row and column.
Example 6.8 Consider the ‘ν = 4’ row, and the ‘P = 5’ column. This means,
where T ∼ t4 , that:
P (T > 2.132) = 0.05
[and hence P (T ≤ 2.132) = 0.95].
The table also provides bounds for other probabilities. For example, the number in
the ‘P = 2.5’ column is 2.776, so P (T > 2.776) = 0.025. Since 2.132 < 2.5 < 2.776,
we know that 0.025 < P (T > 2.5) < 0.05.
Results for left-tail probabilities P (T < z) = P/100, where T ∼ tk , can also be
obtained, because the t distribution is symmetric around 0. This means that
P (T < t) = P (T > −t). Using T ∼ t4 , for example:
P (T < −2.132) = P (T > 2.132) = 0.05
and P (T < −2.5) < 0.05 [since P (T > 2.5) < 0.05].
This is the same trick that we used for the standard normal distribution.
214
6.8. Some common sampling distributions
Activity 6.13 The independent random variables X1 , X2 and X3 are each normally
distributed with a mean of 0 and a variance of 4. Find:
(a) P (X1 > X2 + X3 )
(b) P (X1 > 5(X22 + X32 )1/2 ).
Solution
(a) We have Xi ∼ N (0, 4), for i = 1, 2, 3, hence:
X1 − X2 − X3 ∼ N (0, 12).
So:
P (X1 > X2 + X3 ) = P (X1 − X2 − X3 > 0) = P (Z > 0) = 0.5
using Table 3 of the New Cambridge Statistical Tables.
(b) We have:
P (X1 > 5(X22 + X32 )1/2 ) = P
=P
X1
>5
2
X22 X32
+
4
4
√
X1
>5 2
2
1/2 !
X22 X32
+
4
4
1/2 ! !
√
2
p
√
i.e. P (Y1 > 5 2Y2 ), where Y1 ∼ N (0, 1) and Y2 ∼ χ22 /2, or P (Y3 > 7.07),
where Y3 ∼ t2 . From Table 10 of the New Cambridge Statistical Tables, this is
approximately 0.01.
Activity 6.14 The independent random variables X1 , X2 , X3 and X4 are each
normally distributed with a mean of 0 and a variance of 4. Using statistical tables,
derive values for k in each of the following cases:
(a) P (3X1 + 4X2 > 5) = k
p
2
2
(b) P X1 > k X3 + X4 = 0.025
(c) P (X12 + X22 + X32 < k) = 0.9
Solution
(a) We have Xi ∼ N (0, 4), for i = 1, 2, 3, 4, hence 3X1 ∼ N (0, 36) and
4X2 ∼ N (0, 64). Therefore:
3X1 + 4X2
= Z ∼ N (0, 1).
10
So, P (3X1 + 4X2 > 5) = k = P (Z > 0.5) = 0.3085, using Table 4 of the New
Cambridge Statistical Tables.
215
6. Sampling distributions of statistics
(b) We have Xi /2 ∼ N (0, 1), for i = 1, 2, 3, 4, hence (X32 + X42 )/4 ∼ χ22 . So:
q
√
P X1 > k X32 + X42 = 0.025 = P (T > k 2)
√
where T ∼ t2 and hence k 2 = 4.303, so k = 3.04268, using Table 10 of the
New Cambridge Statistical Tables.
(c) We have (X12 + X22 + X32 )/4 ∼ χ23 , so:
P (X12
+
X22
+
X32
k
< k) = 0.9 = P X <
4
where X ∼ χ23 . Therefore, k/4 = 6.251 using Table 8 of the New Cambridge
Statistical Tables. Hence k = 25.004.
Activity 6.15 Suppose Xi ∼ N (0, 4), for i = 1, 2, 3, 4. Assume all these random
variables are independent. Derive the value of k in each of the following.
(a) P (X1 + 4X2 > 5) = k.
(b) P (X12 + X22 + X32 + X42 < k) = 0.99.
p
(c) P X1 < k X22 + X32 = 0.01.
Solution
(a) Since X1 + 4X2 ∼ N (0, 68), then:
X1 + 4X2
5
√
P (X1 + 4X2 > 5) = P
>√
= P (Z > 0.61) = 0.2709
68
68
where Z ∼ N (0, 1).
(b) Xi2 /4 ∼ χ21 for i = 1, 2, 3, 4, hence (X12 + X22 + X32 + X42 )/4 ∼ χ24 , so:
k
2
2
2
2
P (X1 + X2 + X3 + X4 < k) = P X <
= 0.99
4
where X ∼ χ24 . Hence k/4 = 13.277, so k = 53.108.
√
(c) X1 / 4 ∼ N (0, 1) and (X22 + X32 )/4 ∼ χ22 , hence:
√
√
2X1
X1 / 4
q
p
=
∼ t2 .
(X22 +X32 )/4
X22 + X32
2
Therefore:
where T ∼ t2 . Hence
216
√
P (T <
√
2k) = 0.01
2k = −6.965, so k = −4.925.
6.8. Some common sampling distributions
6.8.3
The F distribution
Definition of the F distribution
Let U and V be two independent random variables, where U ∼ χ2p and V ∼ χ2k . The
distribution of:
U/p
F =
V /k
is the F distribution with degrees of freedom (p, k), denoted F ∼ Fp, k or
F ∼ F (p, k).
The F distribution is a continuous distribution, with non-zero probabilities for x > 0.
The general shape of its pdf is shown in Figure 6.9.
f(x)
(10,50)
(10,10)
(10,3)
0
1
2
3
4
x
Figure 6.9: F pdfs for various degrees of freedom.
For F ∼ Fp, k , E(F ) = k/(k − 2), for k > 2. If F ∼ Fp, k , then 1/F ∼ Fk, p . If T ∼ tk ,
then T 2 ∼ F1, k .
Tables of F distributions will be needed for some purposes. They will be available in the
examination. We will postpone practice with them until later in the course.
Activity 6.16 Let Xi , for i = 1, 2, 3, 4, be independent random variables such that
Xi ∼ N (i, i2 ). For each of the following situations, use the Xi s to construct a
statistic with the indicated distribution. Note there could be more than one possible
answer for each.
(a) χ23 .
(b) t2 .
(c) F1, 2 .
217
6. Sampling distributions of statistics
Solution
The following are possible, but not exhaustive, solutions.
(a) We could have:
2
3 X
Xi − i
i
i=1
∼ χ23 .
(b) We could have:
X1 − 1
s
3
P
i=2
X −i 2
i
i
∼ t2 .
/2
(c) We could have:
(X1 − 1)2
∼ F1, 2 .
3
P
Xi −i 2
/2
i
i=2
Activity 6.17 Suppose {Zi }, for i = 1, 2, . . . , k, are independent and identically
distributed standard normal random variables, i.e. Zi ∼ N (0, 1), for i = 1, 2, . . . , k.
State the distribution of:
(a) Z12
(b) Z12 /Z22
p
(c) Z1 / Z22
(d)
k
P
Zi /k
i=1
(e)
k
P
Zi2
i=1
(f) 3/2 × (Z12 + Z22 )/(Z32 + Z42 + Z52 ).
Solution
(a) Z12 ∼ χ21
(b) Z12 /Z22 ∼ F1, 1
p
(c) Z1 / Z22 ∼ t1
(d)
k
P
Zi /k ∼ N (0, 1/k)
i=1
(e)
k
P
Zi2 ∼ χ2k
i=1
(f) 3/2 × (Z12 + Z22 )/(Z32 + Z42 + Z52 ) ∼ F2, 3 .
218
6.9. Prelude to statistical inference
Activity 6.18 X1 , X2 , X3 and X4 are independent normally distributed random
variables each with a mean of 0 and a standard deviation of 3. Find:
(a) P (X1 + 2X2 > 9)
(b) P (X12 + X22 > 54)
(c) the distribution of (X12 + X22 )/(X32 + X42 ).
Solution
(a) We have X1 ∼ N (0, 9) and X2 ∼ N (0, 9). Hence 2X2 ∼ N (0, 36) and
X1 + 2X2 ∼ N (0, 45). So:
9
P (X1 + 2X2 > 9) = P Z > √
= P (Z > 1.34) = 0.0901
45
using Table 3 of the New Cambridge Statistical Tables.
(b) We have X1 /3 ∼ N (0, 1) and X2 /3 ∼ N (0, 1). Hence X12 /9 ∼ χ21 and
X22 /9 ∼ χ21 . Therefore, X12 /9 + X22 /9 ∼ χ22 . So:
P (X12 + X22 > 54) = P (Y > 6) = 0.05
where Y ∼ χ22 , using Table 8 of the New Cambridge Statistical Tables.
(c) We have X12 /9 + X22 /9 ∼ χ22 and also X32 /9 + X42 /9 ∼ χ22 . So:
(X12 + X22 )/18
X12 + X22
=
∼ F2, 2 .
(X32 + X42 )/18
X32 + X42
6.9
Prelude to statistical inference
We conclude Chapter 6 with a discussion of the preliminaries of statistical inference
before moving on to point estimation. The discussion below will review some key
concepts introduced previously.
So, just what is ‘Statistics’ ? It is a scientific subject of collecting and ‘making sense’ of
data.
Collection: designing experiments/questionnaires, designing sampling schemes, and
administration of data collection.
Making sense: estimation, testing and forecasting.
So, ‘Statistics’ is an application-oriented subject, particularly useful or helpful in
answering questions such as the following.
Does a certain new drug prolong life for AIDS sufferers?
219
6. Sampling distributions of statistics
Is global warming really happening?
Are GCSE and A-level examination standards declining?
Is the gap between rich and poor widening in Britain?
Is there still a housing bubble in London?
Is the Chinese yuan undervalued? If so, by how much?
These questions are difficult to study in a laboratory, and admit no self-evident axioms.
Statistics provides a way of answering these types of questions using data.
What should we learn in ‘Statistics’ ? The basic ideas, methods and theory. Some
guidelines for learning/applying statistics are the following.
Understand what data say in each specific context. All the methods are just tools
to help us to understand data.
Concentrate on what to do and why, rather than on concrete calculations and
graphing.
It may take a while to catch the basic idea of statistics – keep thinking!
6.9.1
Population versus random sample
Consider the following two practical examples.
Example 6.9 A new type of tyre was designed to increase its lifetime. The
manufacturer tested 120 new tyres and obtained the average lifetime (over these 120
tyres) of 35,391 miles. So the manufacturer claims that the mean lifetime of new
tyres is 35,391 miles.
Example 6.10 A newspaper sampled 1,000 potential voters, and 350 of them were
supporters of Party X. It claims that the proportion of Party X voters in the whole
country is 350/1000 = 0.35, i.e. 35%.
In both cases, the conclusion is drawn on a population (i.e. all the objects concerned)
based on the information from a sample (i.e. a subset of the population).
In Example 6.9, it is impossible to measure the whole population. In Example 6.10, it is
not economical to measure the whole population. Therefore, errors are inevitable!
The population is the entire set of objects concerned, and these objects are typically
represented by some numbers. We do not know the entire population in practice.
In Example 6.9, the population consists of the lifetimes of all tyres, including those to
be produced in the future. For the opinion poll in Example 6.10, the population consists
of many ‘1’s and ‘0’s, where each ‘1’ represents a voter for Party X, and each ‘0’
represents a voter for other parties.
220
6.9. Prelude to statistical inference
A sample is a (randomly) selected subset of a population, and is known in practice. The
population is unknown. We represent a population by a probability distribution.
Why do we need a model for the entire population?
Because the questions we ask concern the entire population, not just the data we
have. Having a model for the population tells us that the remaining population is
not much different from our data or, in other words, that the data are
representative of the population.
Why do we need a random model?
Because the process of drawing a sample from a population is a bit like the process
of generating random variables. A different sample would produce different values.
Therefore, the population from which we draw a random sample is represented as a
probability distribution.
6.9.2
Parameter versus statistic
For a given problem, we typically assume a population to be a probability distribution
F (x; θ), where the form of distribution F is known (such as normal or Poisson), and θ
denotes some unknown characteristic (such as the mean or variance) and is called a
parameter.
Example 6.11 Continuing with Example 6.9, the population may be assumed to
be N (µ, σ 2 ) with θ = (µ, σ 2 ), where µ is the ‘true’ lifetime.
Let:
X = the lifetime of a tyre
then we can write X ∼ N (µ, σ 2 ).
Example 6.12 Continuing with Example 6.10, the population is a Bernoulli
distribution such that:
P (X = 1) = P (a Party X voter) = π
and:
P (X = 0) = P (a non-Party X voter) = 1 − π
where:
π = the proportion of Party X supporters in the UK
= the probability of a voter being a Party X supporter.
221
6. Sampling distributions of statistics
A sample: a set of data or random variables?
A sample of size n, {X1 , . . . , Xn }, is also called a random sample. It consists of n real
numbers in a practical problem. The word ‘random’ captures the fact that samples (of
the same size) taken by different people or at different times may be different, as they
are different subsets of a population.
Furthermore, a sample is also viewed as n independent and identically distributed
(IID) random variables, when we assess the performance of a statistical method.
Example 6.13 For the tyre lifetime in Example 6.9, suppose the realised sample
(of size n = 120) gives the sample mean:
n
x̄ =
1X
xi = 35391.
n i=1
A different sample may give a different sample mean, such as 36,721.
Is the sample mean X̄ a good estimator of the unknown ‘true’ lifetime µ? Obviously,
we cannot use the real number 35,391 to assess how good this estimator is, as a different
sample may give a different average value, such as 36,721.
By treating {X1 , . . . , Xn } as random variables, X̄ is also a random variable. If the
distribution of X̄ concentrates closely around (unknown) µ, X̄ is a good estimator of µ.
Definition of a statistic
Any known function of a random sample is called a statistic. Statistics are used for
statistical inference such as estimation and testing.
Example 6.14 Let {X1 , . . . , Xn } be a random sample from the population
N (µ, σ 2 ), then:
n
1X
X̄ =
Xi ,
n i=1
X1 + Xn2
and
sin(X3 ) + 6
are all statistics, but:
X1 − µ
σ
is not a statistic, as it depends on the unknown quantities µ and σ 2 .
An observed random sample is often denoted as {x1 , . . . , xn }, indicating that they are n
real numbers. They are seen as a realisation of n IID random variables {X1 , . . . , Xn }.
The connection between a population and a sample is shown in Figure 6.10, where θ is
a parameter. A known function of {X1 , . . . , Xn } is called a statistic.
222
6.9. Prelude to statistical inference
Figure 6.10: Representation of the connection between a population and a sample.
6.9.3
Difference between ‘Probability’ and ‘Statistics’
‘Probability’ is a mathematical subject, while ‘Statistics’ is an application-oriented
subject (which uses probability heavily).
Example 6.15 Let:
X = the number of lectures attended by a student in a term with 20 lectures
then X ∼ Bin(20, π), i.e. the pf is:
P (X = x) =
20!
π x (1 − π)20−x
x! (20 − x)!
for x = 0, 1, . . . , 20
and 0 otherwise.
Some probability questions are as follows. Treating π as known:
what is E(X) (the average number of lectures attended)?
what is P (X ≥ 18) (the proportion of students attending at least 18 lectures)?
what is P (X < 10) (the proportion of students attending fewer than half of the
lectures)?
Some statistics questions are as follows.
What is π (the average attendance rate)?
Is π larger than 0.9?
Is π smaller than 0.5?
223
6. Sampling distributions of statistics
6.10
Overview of chapter
This chapter introduced sampling distributions of statistics which are the foundations
to statistical inference. The sampling distribution of the sample mean was derived
exactly when sampling from normal populations and also approximately for more
general distributions using the central limit theorem. Three new families of distributions
(χ2 , t and F ) were defined.
6.11
Key terms and concepts
Central limit theorem
F distribution
Random sample
Sampling variance
(Student’s) t distribution
6.12
Chi-squared (χ2 ) distribution
IID random variables
Sampling distribution
Statistic
Sample examination questions
Solutions can be found in Appendix C.
1. Let X be the amount of money won or lost in betting $5 on red in roulette, such
that:
20
18
and P (X = −5) = .
P (X = 5) =
38
38
If a gambler bets on red 100 times, use the central limit theorem to estimate the
probability that these wagers result in less than $50 in losses.
2. Suppose Z1 , Z2 , . . . , Z5 are independent standard normal random variables.
Determine the distribution of:
(a) Z12 + Z22
Z1
(b) s
5
P
Zi2 /4
i=2
(c)
Z12
5
P
.
Zi2 /4
i=2
3. Consider a sequence of random variables X1 , X2 , X3 , . . . which are independent and
normally distributed with mean 0 and variance 1. Using as many of these random
variables as you like construct a random variable which is a function of
X1 , X2 , X3 , . . . and has:
(a) a t11 distribution
(b) an F6, 9 distribution.
224
Chapter 7
Point estimation
7.1
Synopsis of chapter
This chapter covers point estimation. Specifically, the properties of estimators are
considered and the attributes of a desirable estimator are discussed. Techniques for
deriving estimators are introduced.
7.2
Learning outcomes
After completing this chapter, you should be able to:
summarise the performance of an estimator with reference to its sampling
distribution
use the concepts of bias and variance of an estimator
define mean squared error and calculate it for simple estimators
find estimators using the method of moments, least squares and maximum
likelihood.
7.3
Introduction
The basic setting is that we assume a random sample {X1 , . . . , Xn } is observed from a
population F (x; θ). The goal is to make inference (i.e. estimation or testing) for the
unknown parameter(s) θ.
Statistical inference is based on two things.
1. A set of data/observations {X1 , . . . , Xn }.
2. An assumption of F (x; θ) for the joint distribution of {X1 , . . . , Xn }.
Inference is carried out using a statistic, i.e. a known function of {X1 , . . . , Xn }.
b 1 , . . . , Xn ) such that the value of θb is
For estimation, we look for a statistic θb = θ(X
taken as an estimate (i.e. an estimated value) of θ. Such a θb is called a point
estimator of θ.
For testing, we typically use a statistic to test if a hypothesis on θ (such as θ = 3) is
true or not.
225
7. Point estimation
Example 7.1 Let {X1 , . . . , Xn } be a random sample from a population with mean
µ = E(Xi ). Find an estimator of µ.
Since µ is the mean of the population, a natural estimator would be the sample
mean µ
b = X̄, where:
n
1
1X
Xi = (X1 + · · · + Xn ).
X̄ =
n i=1
n
We call µ
b = X̄ a point estimator (or simply an estimator) of µ.
For example, if we have an observed sample of 9, 16, 15, 4 and 12, hence of size
n = 5, the sample mean is:
1
µ
b = (9 + 16 + 15 + 4 + 12) = 11.2.
5
The value 11.2 is a point estimate of µ. For an observed sample of 15, 16, 10, 8
and 9, we obtain µ
b = 11.6.
7.4
Estimation criteria: bias, variance and mean
squared error
Estimators are random variables and, therefore, have probability distributions, known
as sampling distributions. As we know, two important properties of probability
distributions are the mean and variance. Our objective is to create a formal criterion
which combines both of these properties to assess the relative performance of different
estimators.
Bias of an estimator
Let θb be an estimator of the population parameter θ.1 We define the bias of an
estimator as:
b = E(θ)
b − θ.
Bias(θ)
(7.1)
An estimator is:
positively biased if
b −θ >0
E(θ)
unbiased if
b −θ =0
E(θ)
negatively biased if
b − θ < 0.
E(θ)
A positively-biased estimator means the estimator would systematically overestimate
the parameter by the size of the bias, on average. An unbiased estimator means the
estimator would estimate the parameter correctly, on average. A negatively-biased
1
The ‘b’ (hat) notation is often used by statisticians to denote an estimator of the parameter beneath
b denotes an estimator of the Poisson rate parameter λ.
the ‘b’. So, for example, λ
226
7.4. Estimation criteria: bias, variance and mean squared error
estimator means the estimator would systematically underestimate the parameter by
the size of the bias, on average.
In words, the bias of an estimator is the difference between the expected (average) value
of the estimator and the true parameter being estimated. Intuitively, it would be
desirable, other things being equal, to have an estimator with zero bias, called an
unbiased estimator. Given the definition of bias in (7.1), an unbiased estimator would
satisfy:
b = θ.
E(θ)
In words, the expected value of the estimator is the true parameter being estimated, i.e.
on average, under repeated sampling, an unbiased estimator correctly estimates θ.
We view bias as a ‘bad’ thing, so, other things being equal, the smaller an estimator’s
bias the better.
Example 7.2 Since E(X̄) = µ, the sample mean X̄ is an unbiased estimator of µ
because:
E(X̄) − µ = 0.
Variance of an estimator
b is obtained directly from the
The variance of an estimator, denoted Var(θ),
estimator’s sampling distribution.
Example 7.3 For the sample mean, X̄, we have:
Var(X̄) =
σ2
.
n
(7.2)
It is clear that in (7.2) increasing the sample size n decreases the estimator’s variance
(and hence the standard error, i.e. the square root of the estimator’s variance), therefore
increasing the precision of the estimator.2 We conclude that variance is also a ‘bad’
thing so, other things being equal, the smaller an estimator’s variance the better.
Mean squared error (MSE)
The mean squared error (MSE) of an estimator is the average squared error.
Formally, this is defined as:
2 b =E
MSE(θ)
θb − θ
.
(7.3)
2
Remember, however, that this increased precision comes at a cost – namely the increased expenditure
on data collection.
227
7. Point estimation
It is possible to decompose the MSE into components involving the bias and variance of
an estimator. Recall that:
Var(X) = E(X 2 ) − (E(X))2
E(X 2 ) = Var(X) + (E(X))2 .
⇒
Also, note that for any constant k, Var(X ± k) = Var(X), that is adding or subtracting
a constant has no effect on the variance of a random variable. Noting that the true
parameter
θis some (unknown) constant,3 it immediately follows, by setting
X = θb − θ , that:
b =E
MSE(θ)
θb − θ
2 2
= Var(θb − θ) + E θb − θ
2
b + Bias(θ)
b .
= Var(θ)
(7.4)
Expression (7.4) is more useful than (7.3) for practical purposes.
We have already established that both bias and variance of an estimator are ‘bad’
things, so the MSE (being the sum of a bad thing and a bad thing squared) can also be
viewed as a ‘bad’ thing.4 Hence when faced with several competing estimators, we
prefer the estimator with the smallest MSE.
So, although an unbiased estimator is intuitively appealing, it is perfectly possible that
a biased estimator might be preferred if the ‘cost’ of the bias is offset by a substantial
reduction in variance. Hence the MSE provides us with a formal criterion to assess the
trade-off between the bias and variance of different estimators of the same parameter.
Example 7.4 A population is known to be normally distributed, i.e.
X ∼ N (µ, σ 2 ). Suppose we wish to estimate the population mean, µ. We draw a
random sample {X1 , X2 , . . . , Xn } such that these random variables are IID. We have
three candidate estimators of µ, T1 , T2 and T3 , defined as:
n
1X
T1 = X̄ =
Xi ,
n i=1
T2 =
X 1 + Xn
2
and T3 = X̄ + 3.
Which estimator should we choose?
We begin by computing the MSE for T1 , noting:
E(T1 ) = E(X̄) = µ
and:
σ2
.
n
Hence T1 is an unbiased estimator of µ. So the MSE of T1 is just the variance of T1 ,
since the bias is 0. Therefore, MSE(T1 ) = σ 2 /n.
Var(T1 ) = Var(X̄) =
3
4
Even though θ is an unknown constant, it is known to be a constant!
Or, for that matter, a ‘very bad’ thing!
228
7.4. Estimation criteria: bias, variance and mean squared error
Moving to T2 , note:
E(T2 ) = E
X 1 + Xn
2
=
1
1
(E(X1 ) + E(Xn )) = (µ + µ) = µ
2
2
and:
σ2
1
1
2
×
(2σ
)
=
.
(Var(X
)
+
Var(X
))
=
1
n
22
4
2
So T2 is also an unbiased estimator of µ, hence MSE(T2 ) = σ 2 /2.
Var(T2 ) =
Finally, consider T3 , noting:
E(T3 ) = E X̄ + 3 = E(X̄) + 3 = µ + 3
and:
σ2
.
n
So T3 is a positively-biased estimator of µ, with a bias of 3. Hence we have
MSE(T3 ) = σ 2 /n + 32 = σ 2 /n + 9.
Var(T3 ) = Var(X̄ + 3) = Var(X̄) =
We seek the estimator with the smallest MSE. Clearly, MSE(T1 ) < MSE(T3 ) so we
can eliminate T3 . Now comparing T1 with T2 , we note that:
for n = 2, MSE(T1 ) = MSE(T2 ), since the estimators are identical
for n > 2, MSE(T1 ) < MSE(T2 ), so T1 is preferred.
So T1 = X̄ is our preferred estimator of µ. Intuitively this should make sense. Note
for n > 2, T1 uses all the information in the sample (i.e. all observations are used),
unlike T2 which uses the first and last observations only. Of course, for n = 2, these
estimators are identical.
Some remarks are the following.
i. µ
b = X̄ is a better estimator of µ than X1 as:
MSE (b
µ) =
σ2
< MSE(X1 ) = σ 2 .
n
ii. As n → ∞, MSE(X̄) → 0, i.e. when the sample size tends to infinity, the error in
estimation goes to 0. Such an estimator is called a (mean-square) consistent
estimator.
Consistency is a reasonable requirement. It may be used to rule out some silly
estimators.
For µ̃ = (X1 + X4 )/2, MSE(µ̃) = σ 2 /2 which does not converge to 0 as n → ∞.
This is due to the fact that only a small portion of information (i.e. X1 and X4 )
is used in the estimation.
iii. For any random sample {X1 , . . . , Xn } from a population with mean µ and variance
σ 2 , it holds that:
σ2
E(X̄) = µ and Var(X̄) = .
n
229
7. Point estimation
The derivation of the expected value and variance of the sample mean was covered
in Chapter 6.
Example 7.5 Bias by itself cannot be used to measure the quality of an estimator.
Consider two artificial estimators of θ, θb1 and θb2 , such that θb1 takes only the two
values, θ − 100 and θ + 100, and θb2 takes only the two values θ and θ + 0.2, with the
following probabilities:
b
b
P θ1 = θ − 100 = P θ1 = θ + 100 = 0.5
and:
b
b
P θ2 = θ = P θ2 = θ + 0.2 = 0.5.
Note that θb1 is an unbiased estimator of θ and θb2 is a positively-biased estimator of θ
as:
Bias(θb2 ) = E(θb2 ) − θ = [(θ × 0.5) + ((θ + 0.2) × 0.5)] − θ = 0.1.
However:
MSE(θb1 ) = E[(θb1 − θ)2 ] = (−100)2 × 0.5 + (100)2 × 0.5 = 10000
and:
MSE(θb2 ) = E[(θb2 − θ)2 ] = 02 × 0.5 + (0.2)2 × 0.5 = 0.02.
Hence θb2 is a much better (i.e. more accurate) estimator of θ than θb1 .
Activity 7.1 Based on a random sample of two independent observations from a
population with mean µ and standard deviation σ, consider two estimators of µ, X
and Y , defined as:
X1 X 2
X1 2X2
X=
+
and Y =
+
.
2
2
3
3
Are X and Y unbiased estimators of µ?
Solution
We have:
X1 X2
+
2
2
X1 2X2
+
3
3
E(X) = E
=
1
1
1
1
× E(X1 ) + × E(X2 ) = × µ + × µ = µ
2
2
2
2
=
1
2
1
2
× E(X1 ) + × E(X2 ) = × µ + × µ = µ.
3
3
3
3
and:
E(Y ) = E
It follows that both estimators are unbiased estimators of µ.
Activity 7.2 Let {X1 , X2 , . . . , Xn }, where n > 2, be a random sample from an
unknown population with mean θ and variance σ 2 . We want to choose between two
estimators of θ, θb1 = X̄ and θb2 = (X1 + X2 )/2. Which is the better estimator of θ?
230
7.4. Estimation criteria: bias, variance and mean squared error
Solution
Let us consider the bias first. The estimator θb1 is just the sample mean, so we know
that it is unbiased. The estimator θb2 has expectation:
X1 + X2
E(X1 ) + E(X2 )
θ+θ
b
E(θ2 ) = E
=
=
=θ
2
2
2
so it is also an unbiased estimator of θ.
Next, we consider the variances of the two estimators. We have:
Var(θb1 ) = Var(X̄) =
and:
Var(θb2 ) = Var
X 1 + X2
2
=
σ2
n
Var(X1 ) + Var(X2 )
σ2 + σ2
σ2
=
= .
4
4
2
Since n > 2, we can see that θb1 has a lower variance than θb2 , so it is a better
estimator. Unsurprisingly, we obtain a better estimator of θ by considering the whole
sample, rather than just the first two values.
Activity 7.3 Find the MSEs of the estimators in the previous activity. Are they
consistent estimators of θ?
Solution
The MSEs are:
2 σ 2
σ2
+0=
MSE(θb1 ) = Var(θb1 ) + Bias(θb1 ) =
n
n
and:
2 σ 2
σ2
+0= .
MSE(θb2 ) = Var(θb2 ) + Bias(θb2 ) =
2
2
Note that the MSE of an unbiased estimator is equal to its variance.
The estimator θb1 has MSE equal to σ 2 /n, which converges to 0 as n → ∞. The
estimator θb2 has MSE equal to σ 2 /2, which stays constant as n → ∞. Therefore, θb1
is a (mean-square) consistent estimator of θ, whereas θb2 is not.
Activity 7.4 Let X1 and X2 be two independent random variables with the same
mean, µ, and the same variance, σ 2 < ∞. Let µ
b = aX1 + bX2 be an estimator of µ,
where a and b are two non-zero constants.
(a) Identify the condition on a and b to ensure that µ
b is an unbiased estimator of µ.
(b) Find the minimum mean squared error (MSE) among all unbiased estimators of
µ.
231
7. Point estimation
Solution
(a) Let E(b
µ) = E(aX1 + bX2 ) = a E(X1 ) + b E(X2 ) = (a + b)µ. Hence a + b = 1 is
the condition for µ
b to be an unbiased estimator of µ.
(b) Under this condition, noting that b = 1 − a, we have:
MSE(b
µ) = Var(b
µ) = a2 Var(X1 ) + b2 Var(X2 ) = (a2 + b2 ) σ 2 = (2a2 − 2a + 1) σ 2 .
Setting d MSE(b
µ)/da = (4a − 2)σ 2 = 0, we have a = 0.5, and hence b = 0.5.
Therefore, among all unbiased linear estimators, the sample mean (X1 + X2 )/2
has the minimum variance.
Remark: Let {X1 , . . . , Xn } be a random sample from a population with finite
variance. The sample mean X̄ has the minimum variance among all unbiased linear
n
P
estimators of the form
ai Xi , hence it is the best linear unbiased estimator
i=1
(BLUE(!)).
Activity 7.5 Hard question!
Let {X1 , . . . , Xn } be a random sample from a Bernoulli distribution where
P (Xi = 1) = π = 1 − P (Xi = 0) for all i = 1, . . . , n. Let π
b = X̄ = (X1 + · · · + Xn )/n
be an estimator of π.
(a) Find the mean squared error of π
b, i.e. MSE(b
π ). Is π
b an unbiased estimator of π?
Is π
b a consistent of π?
(b) Let Y = X1 + · · · + Xn . Find the probability distribution of Y .
(c) Find the sampling distribution of π
b = Y /n (which, recall, is simply the
probability distribution of π
b).
Solution
(a) We have E(Xi ) = 0 × (1 − π) + 1 × π = π, E(Xi2 ) = E(Xi ) = π (since Xi = Xi2
for the Bernoulli distribution), and Var(Xi ) = π − π 2 = π (1 − π) for all
i = 1, . . . , n. Hence:
!
n
n
1X
1
1X
Xi =
E(Xi ) = × n × π = π.
E(b
π) = E
n i=1
n i=1
n
Therefore, π
b is an unbiased estimator of π. Furthermore, by independence:
!
n
n
1 X
π (1 − π)
1X
MSE(b
π ) = Var(b
π ) = Var
Xi = 2
Var(Xi ) =
n i=1
n i=1
n
which converges to 0 as n → ∞. Hence π
b is a consistent estimator of π.
232
7.5. Method of moments (MM) estimation
(b) Y may only take the integer values 0, 1, . . . , n. For 0 ≤ y ≤ n, the event Y = y
occurs if and only if there are exactly y 1s and (n − y) 0s among the values of
X1 , . . . , Xn . However, those y 1s may take any y out of the n positions. Hence:
n y
n!
π y (1 − π)n−y .
P (Y = y) =
π (1 − π)n−y =
y! (n − y)!
y
Therefore, Y ∼ Bin(n, π).
(c) Note π
b = Y /n. Hence π
b has a rescaled binomial distribution on the n + 1 points
{0, 1/n, 2/n, . . . , 1}.
Finding estimators
In general, how should we find an estimator of θ in a practical situation?
There are three conventional methods:
method of moments estimation
least squares estimation
maximum likelihood estimation.
7.5
Method of moments (MM) estimation
Method of moments estimation
Let {X1 , . . . , Xn } be a random sample from a population F (x; θ). Suppose θ has p
components (for example, for a normal population N (µ, σ 2 ), p = 2; for a Poisson
population with parameter λ, p = 1).
Let:
µk = µk (θ) = E(X k )
denote the kth population moment, for k = 1, 2, . . .. Therefore, µk depends on the
unknown parameter θ, as everything else about the distribution F (x; θ) is known.
Denote the kth sample moment by:
n
Mk =
1X k
1
Xi = (X1k + · · · + Xnk ).
n i=1
n
The MM estimator (MME) θb of θ is the solution of the p equations:
b = Mk
µk (θ)
for k = 1, . . . , p.
233
7. Point estimation
Example 7.6 Let {X1 , . . . , Xn } be a random sample from a population with mean
µ and variance σ 2 < ∞. Find the MM estimator of (µ, σ 2 ).
There are two unknown parameters. Let:
n
µ
b=µ
b1 = M1
1X 2
X .
and µ
b2 = M2 =
n i=1 i
This gives us µ
b = M1 = X̄.
Since σ 2 = µ2 − µ21 = E(X 2 ) − (E(X))2 , we have:
n
2
σ
b = M2 −
M12
n
1X 2
1X
=
Xi − X̄ 2 =
(Xi − X̄)2 .
n i=1
n i=1
Note we have:
n
E(b
σ2) = E
1X 2
X − X̄ 2
n i=1 i
!
n
1X
=
E(Xi2 ) − E(X̄ 2 )
n i=1
= E(X 2 ) − E(X̄ 2 )
2
σ
2
2
2
+µ
=σ +µ −
n
(n − 1)σ 2
=
.
n
Since:
E(b
σ2) − σ2 = −
σ2
<0
n
σ
b2 is a negatively-biased estimator of σ 2 .
The sample variance, defined as:
n
S2 =
1 X
(Xi − X̄)2
n − 1 i=1
is a more frequently-used estimator of σ 2 as it has zero bias, i.e. it is an unbiased
estimator since E(S 2 ) = σ 2 . This is why we use the n − 1 divisor when calculating
the sample variance.
A useful formula for computation of the sample variance is:
!
n
X
1
S2 =
X 2 − nX̄ 2 .
n − 1 i=1 i
Note the MME does not use any information on F (x; θ) beyond the moments.
234
7.5. Method of moments (MM) estimation
The idea is that Mk should be pretty close to µk when n is sufficiently large. In fact:
n
1X k
Mk =
X
n i=1 i
converges to:
µk = E(X k )
as n → ∞. This is due to the law of large numbers (LLN). We illustrate this
phenomenon by simulation using R.
Example 7.7 For N (2, 4), we have µ1 = 2 and µ2 = 8. We use the sample
moments M1 and M2 as estimators of µ1 and µ2 , respectively. Note how the sample
moments converge to the population moments as the sample size increases.
For a sample of size n = 10, we obtained m1 = 0.5145838 and m2 = 2.171881.
> x <- rnorm(10,2,2)
> x
[1] 0.70709403 -1.38416864 -0.01692815
[7] -1.53308559 -0.42573724 1.76006933
> mean(x)
[1] 0.5145838
> x2 <- x^2
> mean(x2)
[1] 2.171881
2.51837989 -0.28518898
1.83541490
1.96998829
For a sample of size n = 100, we obtained m1 = 2.261542 and m2 = 8.973033.
> x <- rnorm(100,2,2)
> mean(x)
[1] 2.261542
> x2 <- x^2
> mean(x2)
[1] 8.973033
For a sample of size n = 500, we obtained m1 = 1.912112 and m2 = 7.456353.
> x <- rnorm(500,2,2)
> mean(x)
[1] 1.912112
> x2 <- x^2
> mean(x2)
[1] 7.456353
Example 7.8 For a Poisson distribution with λ = 1, we have µ1 = 1 and µ2 = 2.
With a sample of size n = 500, we obtained m1 = 1.09 and m2 = 2.198.
> x <- rpois(500,1)
> mean(x)
[1] 1.09
235
7. Point estimation
> x2 <- x^2
> mean(x2)
[1] 2.198
> x
[1] 1 2 2 1 0 0 0 0 0 0 2 2 1 2 1 1 1 2 ...
Activity 7.6 Let {X1 , . . . , Xn } be a random sample from the (continuous) uniform
distribution such that X ∼ Uniform[0, θ], where θ > 0. Find the method of moments
estimator (MME) of θ.
Solution
The pdf of Xi is:
(
θ−1
f (xi ; θ) =
0
Therefore:
1
E(Xi ) =
θ
Z
θ
0
for 0 ≤ xi ≤ θ
otherwise.
θ
1 x2i
θ
xi dxi =
= .
θ 2 0 2
Therefore, setting µ
b1 = M1 , we have:
θb
= X̄
2
⇒
θb = 2X̄.
Activity 7.7 Suppose that we have a random sample {X1 , . . . , Xn } from a
Uniform[−θ, θ] distribution. Find the method of moments estimator of θ.
Solution
The mean of the Uniform[a, b] distribution is (a + b)/2. In our case, this gives
E(X) = (−θ + θ)/2 = 0. The first population moment does not depend on θ, so we
need to move to the next (i.e. second) population moment.
Recall that the variance of the Uniform[a, b] distribution is (b − a)2 /12. Hence the
second population moment is:
E(X 2 ) = Var(X) + E(X)2 =
θ2
(θ − (−θ))2
+ 02 = .
12
3
We set this equal to the second sample moment to obtain:
n
1 X 2 θb2
X = .
n i=1 i
3
Therefore, the method of moments estimator of θ is:
v
u n
u3 X
X 2.
θbM M = t
n i=1 i
236
7.5. Method of moments (MM) estimation
Activity 7.8 Consider again the Uniform[−θ, θ] distribution from the previous
question. Suppose that we observe the following data:
1.8, 0.7, −0.2, −1.8, 2.8, 0.6, −1.3, −0.1.
Estimate θ using the method of moments.
Solution
The point estimate is:
θbM M
v
u 8
u3 X
=t
x2 ≈ 2.518
8 i=1 i
which implies that the data came from a Uniform[−2.518, 2.518] distribution.
However, this clearly cannot be true since the observation x5 = 2.8 falls outside this
range!
The method of moments does not take into account that all of the observations need
to lie in the interval [−θ, θ], and so it fails to produce a useful estimate.
Activity 7.9 Let X ∼ Bin(n, π), where n is known. Find the methods of moments
estimator (MME) of π.
Solution
The pf of the binomial distribution is:
P (X = x) =
n!
π x (1 − π)n−x
x! (n − x)!
for x = 0, 1, . . . , n
and 0 otherwise. Therefore:
E(X) =
n
X
x P (X = x)
x=0
=
n
X
x=1
=
n
X
x=1
x
n!
π x (1 − π)n−x
x! (n − x)!
n!
π x (1 − π)n−x .
(x − 1)! (n − x)!
Let m = n − 1 and write j = x − 1, then (n − x) = (m − j), and:
E(X) =
m
X
j=0
m
X
n m!
m!
j
m−j
π π (1 − π)
= nπ
π j (1 − π)m−j .
j! (m − j)!
j! (m − j)!
j=0
Therefore, E(X) = n π, and hence π
b = X/n.
237
7. Point estimation
7.6
Least squares (LS) estimation
Given a random sample {X1 , . . . , Xn } from a population with mean µ and variance σ 2 ,
how can we estimate µ?
n
P
The MME of µ is the sample mean X̄ =
Xi /n.
i=1
Least squares estimator for µ
The estimator X̄ is also the least squares estimator (LSE) of µ, defined as:
µ
b = X̄ = min
a
Proof : Given that S =
n
P
(Xi − a)2 =
i=1
n
P
n
X
(Xi − a)2 .
i=1
(Xi − X̄)2 + n(X̄ − a)2 , where all terms are
i=1
non-negative, then the value of a for which S is minimised is when n(X̄ − a)2 = 0, i.e.
a = X̄. Activity 7.10 Suppose that you are given observations y1 , y2 , y3 and y4 such that:
y1 = α + β + ε1
y2 = −α + β + ε2
y3 = α − β + ε3
y4 = −α − β + ε4 .
The random variables εi , for i = 1, 2, 3, 4, are independent and normally distributed
with mean 0 and variance σ 2 .
(a) Find the least squares estimators of the parameters α and β.
(b) Verify that the least squares estimators in (a) are unbiased estimators of their
respective parameters.
(c) Find the variance of the least squares estimator of α.
Solution
(a) We start off with the sum of squares function:
S=
4
X
ε2i = (y1 − α − β)2 + (y2 + α − β)2 + (y3 − α + β)2 + (y4 + α + β)2 .
i=1
Now take the partial derivatives:
∂S
= −2(y1 − α − β) + 2(y2 + α − β) − 2(y3 − α + β) + 2(y4 + α + β)
∂α
= −2(y1 − y2 + y3 − y4 ) + 8α
238
7.6. Least squares (LS) estimation
and:
∂S
= −2(y1 − α − β) − 2(y2 + α − β) + 2(y3 − α + β) + 2(y4 + α + β)
∂β
= −2(y1 + y2 − y3 − y4 ) + 8β.
The least squares estimators α
b and βb are the solutions to ∂S/∂α = 0 and
∂S/∂β = 0. Hence:
α
b=
y1 − y2 + y3 − y4
4
y1 + y2 − y3 − y4
and βb =
.
4
(b) α
b is an unbiased estimator of α since:
y1 − y2 + y3 − y4
α+β+α−β+α−β+α+β
E(b
α) = E
=
= α.
4
4
βb is an unbiased estimator of β since:
y1 + y2 − y3 − y4
α+β−α+β−α+β+α+β
b
E(β) = E
=
= β.
4
4
(c) Due to independence, we have:
y1 − y2 + y3 − y4
σ2
4 σ2
Var(b
α) = Var
= .
=
4
16
4
Estimator accuracy
In order to assess the accuracy of µ
b = X̄ as an estimator of µ we calculate its MSE:
σ2
MSE(b
µ) = E (b
µ − µ)2 = .
n
In order to determine the distribution of µ
b we require knowledge of the underlying
distribution. Even if the relevant knowledge is available, one may only compute the
exact distribution of µ
b explicitly for a limited number of cases.
By the central limit theorem, as n → ∞, we have:
P
X̄ − µ
√ ≤z
σ/ n
→ Φ(z)
for any z, where Φ(z) is the cdf of N (0, 1), i.e. when n is large, X̄ ∼ N (µ, σ 2 /n)
approximately.
Hence when n is large:
P
σ
|X̄ − µ| ≤ 1.96 × √
n
≈ 0.95.
239
7. Point estimation
In practice, the standard deviation σ is unknown and so we replace it by the sample
standard deviation S, where S 2 is the sample variance, given by:
n
S2 =
1 X
(Xi − X̄)2 .
n − 1 i=1
This gives an approximation of:
S
≈ 0.95.
P |X̄ − µ| ≤ 1.96 × √
n
To be on the safe side, the coefficient 1.96 is often replaced by 2. The estimated
standard error of X̄ is:
"
#1/2
n
X
S
1
E.S.E.(X̄) = √ =
(Xi − X̄)2
.
n (n − 1) i=1
n
Some remarks are the following.
i. The LSE is a geometrical solution – it minimises the sum of squared distances
between the estimated value and each observation. It makes no use of any
information about the underlying distribution.
ii. Taking the derivative of
n
P
(Xi − a)2 with respect to a, and equating it to 0, we
i=1
obtain (after dividing through by −2):
n
X
i=1
(Xi − a) =
n
X
Xi − na = 0.
i=1
Hence the solution is µ
b=b
a = X̄. This is another way to derive the LSE of µ.
Activity
7.11 AP
random sample of size n = 400 produced the sample sums of
P
2
i xi = 983 and
i xi = 4729.
(a) Calculate point estimates for the population mean and the population standard
deviation.
(b) Calculate the estimated standard error of the mean estimate.
Solution
(a) As before, we use the sample mean to estimate the population mean, i.e.
µ
b = x̄ = 983/400 = 2.4575, and the sample variance to estimate the population
variance, i.e. we have:
!
400
400
X
X
1
1
s2 =
(xi − x̄)2 =
x2 − nx̄2
n − 1 i=1
n − 1 i=1 i
=
1
4729 − 400 × (2.4575)2
399
= 5.7977.
240
7.7. Maximum likelihood (ML) estimation
Therefore,
the estimate for the population standard deviation is
√
s = 5.7977 = 2.4078.
√
√
(b) The estimated standard error is s/ n = 2.4078/ 400 = 0.1204.
Note that the estimated standard error is rather small, indicating that the estimate
of the population mean is rather accurate. This is due to two factors: (i.) the
population variance is small, as evident from the small value of s2 , and (ii.) the
sample size of n = 400 is rather large.
Note also that using the n divisor (i.e. the method of moments estimator of σ 2 ) we
n
P
have (xi − x̄)2 /n = 5.7832, which is pretty close to s2 .
i=1
7.7
Maximum likelihood (ML) estimation
We begin with an illustrative example. Maximum likelihood (ML) estimation
generalises the reasoning in the following example to arbitrary settings.
Example 7.9 Suppose we toss a coin 10 times, and record the number of ‘heads’ as
a random variable X. Therefore:
X ∼ Bin(10, π)
where π = P (heads) ∈ (0, 1) is the unknown parameter.
If x = 8, what is your best guess (i.e. estimate) of π? Obviously 0.8!
Is π = 0.1 possible? Yes, but very unlikely.
Is π = 0.5 possible? Yes, but not very likely.
Is π = 0.7 or 0.9 possible? Yes, very likely.
Nevertheless, π = 0.8 is the most likely, or ‘maximally’ likely value of the parameter.
Why do we think ‘π = 0.8’ is most likely?
Let:
10! 8
π (1 − π)2 .
8! 2!
Since x = 8 is the event which occurred in the experiment, this probability would be
very large. Figure 7.1 shows a plot of L(π) as a function of π.
L(π) = P (X = 8) =
The most likely value of π should make this probability as large as possible. This
value is taken as the maximum likelihood estimate of π.
Maximising L(π) is equivalent to maximising:
l(π) = log (L(π)) = 8 log π + 2 log(1 − π) + c
where c is the constant log(10!/(8! 2!)). Setting dl(π)/dπ = 0, we obtain the ML
estimate π
b = 0.8.
241
7. Point estimation
Figure 7.1: Plot of the likelihood function in Example 7.9.
Maximum likelihood definition
Let f (x1 , . . . , xn ; θ) be the joint probability density function (or probability function)
for random variables (X1 , . . . , Xn ). The maximum likelihood estimator (MLE) of θ
based on the observations {X1 , . . . , Xn } is defined as:
θb = max f (X1 , . . . , Xn ; θ).
θ
Some remarks are the following.
i. The MLE depends only on the observations {X1 , . . . , Xn }, such that:
b 1 , . . . , Xn ).
θb = θ(X
Therefore, θb is a statistic (as it must be for an estimator of θ).
ii. If {X1 , . . . , Xn } is a random sample from a population with probability density
function f (x; θ), the joint probability density function for (X1 , . . . , Xn ) is:
n
Y
f (xi ; θ).
i=1
The joint pdf is a function of (X1 , . . . , Xn ), while θ is a parameter.
The joint pdf describes the probability distribution of {X1 , . . . , Xn }.
The likelihood function is defined as:
L(θ) =
n
Y
i=1
242
f (Xi ; θ).
(7.5)
7.7. Maximum likelihood (ML) estimation
The likelihood function is a function of θ, while {X1 , . . . , Xn } are treated as
constants (as given observations).
The likelihood function reflects the information about the unknown parameter θ in
the data {X1 , . . . , Xn }.
Some remarks are the following.
i. The likelihood function is a function of the parameter. It is defined up to positive
constant factors. A likelihood function is not a probability density function. It
contains all the information about the unknown parameter from the observations.
ii. The MLE is θb = max L(θ).
θ
iii. It is often more convenient to use the log-likelihood function5 denoted as:
l(θ) = log L(θ) =
n
X
log (f (Xi ; θ))
i=1
as it transforms the product in (7.5) into a sum. Note that:
θb = max l(θ).
θ
iv. For a smooth likelihood function, the MLE is often the solution of the equation:
d
l(θ) = 0.
dθ
b is the MLE of φ (which is
v. If θb is the MLE and φ = g(θ) is a function of θ, φb = g(θ)
known as the invariance principle of the MLE).
vi. Unlike the MME or LSE, the MLE uses all the information about the population
distribution. It is often more efficient (i.e. more accurate) than the MME or LSE.
vii. In practice, ML estimation should be used whenever possible.
Example 7.10
Let {X1 , . . . , Xn } be a random sample from a distribution with pdf:
(
λ2 x e−λx for x > 0
f (x; λ) =
0
otherwise
where λ > 0 is unknown. Find the MLE of λ.
The joint pdf is f (x1 , . . . , xn ; λ) =
n
Q
λ2 xi e−λxi if all xi > 0, and 0 otherwise.
i=1
The likelihood function is:
"
2n
L(λ) = λ exp −λ
n
X
i=1
#
Xi
n
Y
i=1
2n
Xi = λ exp −nλX̄
n
Y
Xi .
i=1
5
Throughout where ‘log’ is used in log-likelihood functions, it will be assumed to be the logarithm to
the base e, i.e. the natural logarithm.
243
7. Point estimation
The log-likelihood function is l(λ) = 2n log λ − nλX̄ + c, where c = log
n
Q
Xi is a
i=1
constant.
b − nX̄ = 0, we obtain λ
b = 2/X̄.
Setting dl(λ)/dλ = 2n/λ
b may be obtained from maximising L(λ) directly. However, it is
Note the MLE λ
much easier to work with l(λ) instead.
b2 = 4/X̄ 2 .
By the invariance principle, the MLE of λ2 would be λ
Example 7.11 Consider a population with three types of individuals labelled 1, 2
and 3, and occurring according to the Hardy–Weinberg proportions:
p(1; θ) = θ2 ,
p(2; θ) = 2θ(1 − θ) and p(3; θ) = (1 − θ)2
where 0 < θ < 1. Note that p(1; θ) + p(2; θ) + p(3; θ) = 1.
A random sample of size n is drawn from this population with n1 observed values
equal to 1 and n2 observed values equal to 2 (therefore, there are n − n1 − n2 values
equal to 3). Find the MLE of θ.
Let us assume {X1 , . . . , Xn } is the sample (i.e. n observed values). Among them,
there are n1 ‘1’s, n2 ‘2’s, and n − n1 − n2 ‘3’s. The likelihood function is (where ∝
means ‘proportional to’):
L(θ) =
n
Y
p(Xi ; θ) = p(1; θ)n1 p(2; θ)n2 p(3; θ)n−n1 −n2
i=1
= θ2n1 (2θ(1 − θ))n2 (1 − θ)2(n−n1 −n2 )
∝ θ2n1 +n2 (1 − θ)2n−2n1 −n2 .
The log-likelihood is l(θ) ∝ (2n1 + n2 ) log θ + (2n − 2n1 − n2 ) log(1 − θ).
b = 0, that is:
Setting dl(θ)/dθ = (2n1 + n2 )/θb − (2n − 2n1 − n2 )/(1 − θ)
b (2n1 + n2 ) = θb (2n − 2n1 − n2 )
(1 − θ)
leads to the MLE:
2n1 + n2
θb =
.
2n
For example, for a sample with n = 4, n1 = 1 and n2 = 2, we obtain a point estimate
of θb = 0.5.
Example 7.12 Let {X1 , . . . , Xn } be a random sample from the (continuous)
uniform distribution Uniform[0, θ], where θ > 0 is unknown.
(a) Find the MLE of θ.
(b) If n = 3, x1 = 0.9, x2 = 1.2 and x3 = 0.3, what is the maximum likelihood
estimate of θ?
244
7.7. Maximum likelihood (ML) estimation
(a) The pdf of Uniform[0, θ] is:
(
θ−1
f (x; θ) =
0
for 0 ≤ x ≤ θ
otherwise.
The joint pdf is:
(
θ−n
f (x1 , . . . , xn ; θ) =
0
for 0 ≤ x1 , . . . , xn ≤ θ
otherwise.
As a function of θ, f (x1 , . . . , xn ; θ) is the likelihood function, L(θ). The
maximum likelihood estimator of θ is the value at which the likelihood function
L(θ) achieves its maximum. Note:
(
θ−n for X(n) ≤ θ
L(θ) =
0
otherwise
where:
X(n) = max Xi .
i
Hence the MLE is θb = X(n) .
Note that this is a special case of a likelihood function which is not
‘well-behaved’, since it is not continuously differentiable at the maximum. This
is because the sample space of this distribution is defined by θ, i.e. we have that
0 ≤ x ≤ θ. Therefore, it is impossible for θ to be any value below the maximum
observed value of X. As such, although L(θ) increases as θ decreases, L(θ) falls
to zero for all θ less than the maximum observed value of X.
As such, we cannot use calculus to maximise the likelihood function (nor the
log-likelihood function), so instead we immediately deduce here that θb = X(n) .
(b) For the given data, the maximum observation is x(3) = 1.2. Therefore, the
maximum likelihood estimate is θb = 1.2. The likelihood function looks like:
245
7. Point estimation
Activity 7.12 Let {X1 , . . . , Xn } be a random sample from a Poisson distribution
with mean λ > 0. Find the MLE of λ.
Solution
The probability function is:
e−λ λx
.
x!
The likelihood and log-likelihood functions are, respectively:
P (X = x) =
L(λ) =
n
Y
e−λ λXi
i=1
Xi !
e−nλ λnX̄
= Q
n
Xi !
i=1
and:
l(λ) = log L(λ) = nX̄ log(λ) − nλ + C = n(X̄ log(λ) − λ) + C
where C is a constant (i.e. it may depend on Xi but cannot depend on the
parameter). Setting:
X̄
dl(λ)
=n
−1 =0
b
dλ
λ
b = X̄, which is also the MME.
we obtain the MLE λ
Activity 7.13 Let {X1 , . . . , Xn } be a random sample from an Exponential(λ)
distribution. Find the MLE of λ.
Solution
The likelihood function is:
L(λ) =
n
Y
f (xi ; θ) =
i=1
n
Y
λ e−λXi = λn e−λ
P
i
Xi
= λn e−λnX̄
i=1
so the log-likelihood function is:
l(λ) = log λn e−λnX̄ = n log(λ) − λnX̄.
Differentiating and setting equal to zero gives:
d
n
l(λ) = − nX̄ = 0
b
dλ
λ
⇒
b= 1.
λ
X̄
The second derivative of the log-likelihood function is:
d2
n
l(λ)
=
−
dλ2
λ2
b = 1/X̄ is indeed a maximum. This
which is always negative, hence the MLE λ
happens to be the same as the method of moments estimator of λ.
246
7.7. Maximum likelihood (ML) estimation
Activity 7.14 Use the observed random sample x1 = 8.2, x2 = 10.6, x3 = 9.1 and
x4 = 4.9 to calculate the maximum likelihood estimate of λ in the exponential pdf:
(
λ e−λx for x ≥ 0
f (x; λ) =
0
otherwise.
Solution
We derive a general formula with a random sample {X1 , . . . , Xn } first. The joint pdf
is:
(
λn e−λnx̄ for x1 , . . . , xn ≥ 0
f (x1 , . . . , xn ; λ) =
0
otherwise.
With all xi ≥ 0, L(λ) = λn e−λnX̄ , hence the log-likelihood function is:
l(λ) = log L(λ) = n log λ − λnX̄.
Setting:
n
d
l(λ) = − nX̄ = 0
b
dλ
λ
⇒
b= 1.
λ
X̄
b = 0.1220.
For the given sample, x̄ = (8.2 + 10.6 + 9.1 + 4.9)/4 = 8.2. Therefore, λ
Activity 7.15 Let {X1 , . . . , Xn } be a random sample from a population with the
probability distribution specified in (a) and (b) below, respectively. Find the MLEs
of the following parameters.
(a) λ, µ = 1/λ and θ = λ2 , when the population has an exponential distribution
with pdf f (x; λ) = λ e−λx for x > 0, and 0 otherwise.
(b) π and θ = π/(1 − π), when the population has a Bernoulli (two-point)
distribution, that is p(1; π) = π = 1 − p(0; π), and 0 otherwise.
Solution
(a) The joint pdf is:

n
λn exp −λ P x
for all x1 , . . . , xn > 0
i
f (x1 , . . . , xn ; λ) =
i=1

0
otherwise.
Noting that
n
P
Xi = nX̄, the likelihood function is:
i=1
L(λ) = λn e−λnX̄ .
The log-likelihood function is:
l(λ) = n log λ − λnX̄.
247
7. Point estimation
Setting:
d
n
l(λ) = − nX̄ = 0
b
dλ
λ
b = 1/X̄. The MLE of µ is:
we obtain the MLE λ
b =
µ
b = µ(λ)
1
= X̄
b
λ
and the MLE of θ is:
b = (λ)
b 2 = X̄ −2
θb = θ(λ)
making use of the invariance principle in each case.
(b) The joint probability function is:
n
Y
p(xi ; π) = π y (1 − π)n−y
i=1
where y =
n
P
xi . The likelihood function is:
i=1
L(π) = π Y (1 − π)n−Y .
The log-likelihood function is:
l(π) = Y log π + (n − Y ) log(1 − π).
Setting:
d
Y
n−Y
l(π) = −
=0
dπ
π
b
1−π
b
we obtain the MLE π
b = Y /n = X̄. The MLE of θ is:
θb = θ(b
π) =
π
b
X̄
=
1−π
b
1 − X̄
making use of the invariance principle again.
Activity 7.16 Let {X1 , . . . , Xn } be a random sample from the distribution
N (µ, 1). Find the MLE of µ.
Solution
The joint pdf of the observations is:
"
#
n
n
Y
1
1
1
1X
2
2
√ exp − (xi − µ) =
f (x1 , . . . , xn ; µ) =
exp −
(xi − µ) .
2
(2π)n/2
2 i=1
2π
i=1
We write the above as a function of µ only:
"
#
n
1X
L(µ) = C exp −
(Xi − µ)2
2 i=1
248
7.8. Overview of chapter
where C > 0 is a constant. The MLE µ
b maximises this function, and also maximises
the function:
n
1X
l(µ) = log L(µ) = −
(Xi − µ)2 + log(C).
2 i=1
Therefore, the MLE effectively minimises
n
P
(Xi − µ)2 , i.e. the MLE is also the least
i=1
squares estimator (LSE), i.e. µ
b = X̄.
7.8
Overview of chapter
This chapter introduced point estimation. Key properties of estimators were explored
and the characteristics of a desirable estimator were studied through the calculation of
the mean squared error. Methods for finding estimators of parameters were also
described, including method of moments, least squares and maximum likelihood
estimation.
7.9
Key terms and concepts
Bias
Invariance principle
Least squares estimation
Log-likelihood function
Mean squared error (MSE)
Parameter
Point estimator
Sample moment
Unbiased
7.10
Consistent estimator
Law of large numbers (LLN)
Likelihood function
Maximum likelihood estimation
Method of moments estimation
Point estimate
Population moment
Statistic
Sample examination questions
Solutions can be found in Appendix C.
1. Let {X1 , . . . , Xn } be a random sample from the (continuous) uniform distribution
such that X ∼ Uniform[0, θ], where θ > 0.
(a) Find the method of moments estimator (MME) of θ. (Note you should derive
any required population moments.)
(b) If n = 3, with the observed data x1 = 0.2, x2 = 3.6 and x3 = 1.1, use the MME
obtained in i. to compute the point estimate of θ for this sample. Do you trust
this estimate? Justify your answer.
Hint: You may wish to make reference to the law of large numbers.
249
7. Point estimation
2. Suppose that you are given independent observations y1 , y2 and y3 such that:
y1 = α + β + ε1
y2 = α + 2β + ε2
y3 = α + 4β + ε3 .
The random variables εi , for i = 1, 2, 3, are normally distributed with a mean of 0
and a variance of 1.
(a) Find the least squares estimators of the parameters α and β, and verify that
they are unbiased estimators.
(b) Calculate the variance of the estimator of α.
3. A random sample {X1 , X2 , . . . , Xn } is drawn from the following probability
distribution:
2
λ2x e−λ
p(x; λ) =
for x = 0, 1, 2, . . .
x!
and 0 otherwise, where λ > 0.
(a) Derive the maximum likelihood estimator of λ.
(b) State the maximum likelihood estimator of θ = λ3 .
250
Chapter 8
Interval estimation
8.1
Synopsis of chapter
This chapter covers interval estimation – a natural extension of point estimation. Due
to the almost inevitable sampling error, we wish to communicate the level of
uncertainty in our point estimate by constructing confidence intervals.
8.2
Learning outcomes
After completing this chapter, you should be able to:
explain the coverage probability of a confidence interval
construct confidence intervals for means of normal and non-normal populations
when the variance is known and unknown
construct confidence intevals for the variance of a normal population
explain the link between confidence intervals and distribution theory, and critique
the assumptions made to justify the use of various confidence intervals.
8.3
Introduction
Point estimation is simple but not informative enough, since a point estimator is
always subject to errors. A more scientific approach is to find an upper bound
U = U (X1 , . . . , Xn ) and a lower bound L = L(X1 , . . . , Xn ), and hope that the unknown
parameter θ lies between the two bounds L and U (life is not always as simple as that,
but it is a good start).
An intuitive guess for estimating the population mean would be:
L = X̄ − k × S.E.(X̄) and U = X̄ + k × S.E.(X̄)
where k > 0 is a constant and S.E.(X̄) is the standard error of the sample mean.
The (random) interval (L, U ) forms an interval estimator of θ. For estimation to be
as precise as possible, intuitively the width of the interval, U − L, should be small.
251
8. Interval estimation
Typically, the coverage probability:
P (L(X1 , . . . , Xn ) < θ < U (X1 , . . . , Xn )) < 1.
Ideally, we should choose L and U such that:
the width of the interval is as small as possible
the coverage probability is as large as possible.
Activity 8.1 Why do we not always choose a very high confidence level for a
confidence interval?
Solution
We do not always want to use a very high confidence level because the confidence
interval would be very wide. We have a trade-off between the width of the confidence
interval and the coverage probability.
8.4
Interval estimation for means of normal
distributions
Let us consider a simple example. We have a random sample {X1 , . . . , Xn } from the
distribution N (µ, σ 2 ), with σ 2 known.
From Chapter 7, we have reason to believe that X̄ is a good estimator of µ. We also
know X̄ ∼ N (µ, σ 2 /n), and hence:
X̄ − µ
√ ∼ N (0, 1).
σ/ n
Therefore, supposing a 95% coverage probability:
|X̄ − µ|
√ ≤ 1.96
0.95 = P
σ/ n
σ
= P |µ − X̄| ≤ 1.96 × √
n
σ
σ
= P −1.96 × √ < µ − X̄ < 1.96 × √
n
n
σ
σ
= P X̄ − 1.96 × √ < µ < X̄ + 1.96 × √
.
n
n
Therefore, the interval covering µ with probability 0.95 is:
σ
σ
X̄ − 1.96 × √ , X̄ + 1.96 × √
n
n
which is called a 95% confidence interval for µ.
252
8.4. Interval estimation for means of normal distributions
Example 8.1 Suppose σ = 1, n = 4, and x̄ = 2.25, then a 95% confidence interval
for µ is:
1
1
= (1.27, 3.23).
2.25 − 1.96 × √ , 2.25 + 1.96 × √
4
4
Instead of a simple point estimate of µ
b = 2.25, we say µ is between 1.27 and 3.23 at
the 95% confidence level.
What is P (1.27 < µ < 3.23) = 0.95 in Example 8.1? Well, this probability does not
mean anything, since µ is an unknown constant!
We treat (1.27, 3.23) as one realisation of the random interval (X̄ − 0.98, X̄ + 0.98)
which covers µ with probability 0.95.
What is the meaning of ‘with probability 0.95’ ? If one repeats the interval estimation a
large number of times, about 95% of the time the interval estimator covers the true µ.
Some remarks are the following.
i. The confidence level is often specified as 90%, 95% or 99%. Obviously the higher
the confidence level, the wider the interval.
For the normal distribution example:
|X̄ − µ|
√ ≤ 1.645
0.90 = P
σ/ n
σ
σ
= P X̄ − 1.645 × √ < µ < X̄ + 1.645 × √
n
n
0.95 = P
|X̄ − µ|
√ ≤ 1.96
σ/ n
σ
σ
= P X̄ − 1.96 × √ < µ < X̄ + 1.96 × √
n
n
0.99 = P
|X̄ − µ|
√ ≤ 2.576
σ/ n
σ
σ
.
= P X̄ − 2.576 × √ < µ < X̄ + 2.576 × √
n
n
√
√
The widths of the
√ three intervals are 2 × 1.645 × σ/ n, 2 × 1.96 × σ/ n and
2 × 2.576 × σ/ n, corresponding to the confidence levels of 90%, 95% and 99%,
respectively.
To achieve a 100% confidence level in the normal example, the width of the interval
would have to be infinite!
ii. Among all the confidence intervals at the same confidence level, the one with the
smallest width gives the most accurate estimation and is, therefore, optimal.
iii. For a distribution with a symmetric unimodal density function, optimal confidence
intervals are symmetric, as depicted in Figure 8.1.
253
8. Interval estimation
Figure 8.1: Symmetric unimodal density function showing that a given probability is
represented by the narrowest interval when symmetric about the mean.
Activity 8.2
(a) Find the length of a 95% confidence interval for the mean of a normal
distribution with known variance σ 2 .
(b) Find the minimum sample size such that the width of a 95% confidence interval
is not wider than d, where d > 0 is a prescribed constant.
Solution
(a) With an available random sample {X1 , . . . , Xn } from the normal distribution
N (µ, σ 2 ) with σ 2 known, a 95% confidence interval for µ is of the form:
σ
σ
.
X̄ − 1.96 × √ , X̄ + 1.96 × √
n
n
Hence the width of the confidence interval is:
σ
σ
σ
σ
X̄ + 1.96 × √
− X̄ − 1.96 × √
= 2 × 1.96 × √ = 3.92 × √ .
n
n
n
n
√
(b) Let 3.92 × σ/ n ≤ d, and so we obtain the condition for the required sample
size:
2
3.92 × σ
15.37 × σ 2
.
n≥
=
d
d2
Therefore, in order to achieve the required accuracy, the sample size n should be
at least as large as 15.37 × σ 2 /d2 .
Note that as the variance σ 2 %, the confidence interval width d %, and as the
sample size n %, the confidence interval width d &. Also, note that when σ 2 is
unknown, the width of a confidence interval for µ depends on S. Therefore, the
width is a random variable.
Activity 8.3 Assume that the random variable X is normally distributed and that
σ 2 is known. What confidence level would be associated with each of the following
intervals?
254
8.4. Interval estimation for means of normal distributions
(a) The interval:
σ
σ
.
x̄ − 1.645 × √ , x̄ + 2.326 × √
n
n
(b) The interval:
σ
−∞, x̄ + 2.576 × √
n
.
(c) The interval:
σ
x̄ − 1.645 × √ , x̄ .
n
Solution
√
√
We have X̄ ∼ N (µ, σ 2 / n), hence n(X̄ − µ)/σ ∼ N (0, 1).
(a) P (−1.645 < Z < 2.326) = 0.94, hence a 94% confidence level.
(b) P (−∞ < Z < 2.576) = 0.995, hence a 99.5% confidence level.
(c) P (−1.645 < Z < 0) = 0.45, hence a 45% confidence level.
Activity 8.4 A personnel manager has found that historically the scores on
aptitude tests given to applicants for entry-level positions are normally distributed
with σ = 32.4 points. A random sample of nine test scores from the current group of
applicants had a mean score of 187.9 points.
(a) Find an 80% confidence interval for the population mean score of the current
group of applicants.
(b) Based on these sample results, a statistician found for the population mean a
confidence interval extending from 165.8 to 210.0 points. Find the confidence
level of this interval.
Solution
(a) We have n = 9, x̄ = 187.9, σ = 32.4 and 1 − α = 0.8, hence α/2 = 0.1 and, from
Table 4 of the New Cambridge Statistical Tables, P (Z > 1.282) = 1 − Φ(1.282)
= 0.1. So an 80% confidence interval is:
32.4
187.9 ± 1.282 × √
9
⇒
(174.05, 201.75).
(b) The half-width of the confidence interval is 210.0 − 187.9 = 22.1, which is equal
to the margin of error, i.e. we have:
σ
32.4
22.1 = k × √ = k × √
n
9
⇒
k = 2.05.
P (Z > 2.05) = 1 − Φ(2.05) = 0.02018 = α/2 ⇒ α = 0.04036. Hence we have a
100(1 − α)% = 100(1 − 0.04036)% ≈ 96% confidence interval.
255
8. Interval estimation
Activity 8.5 Five independent samples, each of size n, are to be drawn from a
normal distribution where σ 2 is known. For each sample, the interval:
σ
σ
x̄ − 0.96 × √ , x̄ + 1.06 × √
n
n
will be constructed. What is the probability that at least four of the intervals will
contain the unknown µ?
Solution
The probability that the given interval will contain µ is:
P (−0.96 < Z < 1.06) = 0.6869.
The probability of four or five such intervals is binomial with n = 5 and π = 0.6869,
so let the number of such intervals be Y ∼ Bin(5, 0.6869). The required probability
is:
5
5
4
P (Y ≥ 4) =
(0.6869) (0.3131) +
(0.6869)5 = 0.5014.
4
5
Dealing with unknown σ
In practice the standard deviation σ is typically unknown, and we replace it with the
sample standard deviation:
n
S=
1 X
(Xi − X̄)2
n − 1 i=1
!1/2
leading to a confidence interval for µ of the form:
S
S
X̄ − k × √ , X̄ + k × √
n
n
where k is a constant determined by the confidence level and also by the distribution of
the statistic:
X̄ − µ
√ .
(8.1)
S/ n
However, the distribution of (8.1) is no longer normal – it is the Student’s t distribution.
8.4.1
An important property of normal samples
Let {X1 , . . . , Xn } be a random sample from N (µ, σ 2 ). Suppose:
n
X̄ =
1X
Xi ,
n i=1
n
S2 =
1 X
(Xi − X̄)2
n − 1 i=1
S
and E.S.E.(X̄) = √
n
where E.S.E.(X̄) denotes the estimated standard error of the sample mean.
256
8.4. Interval estimation for means of normal distributions
i. X̄ ∼ N (µ, σ 2 /n) and (n − 1)S 2 /σ 2 ∼ χ2n−1 .
ii. X̄ and S 2 are independent, therefore:
√
n(X̄ − µ)/σ
X̄ − µ
X̄ − µ
p
√ =
=
∼ tn−1 .
S/ n
E.S.E.(X̄)
(n − 1)S 2 /(n − 1)σ 2
An accurate 100(1 − α)% confidence interval for µ, where α ∈ (0, 1), is:
S
S
= (X̄ − c × E.S.E.(X̄), X̄ + c × E.S.E.(X̄))
X̄ − c × √ , X̄ + c × √
n
n
where c > 0 is a constant such that P (T > c) = α/2, where T ∼ tn−1 .
Activity 8.6 Suppose that 9 bags of sugar are selected from the supermarket shelf
at random and weighed. The weights in grammes are 812.0, 786.7, 794.1, 791.6,
811.1, 797.4, 797.8, 800.8 and 793.2. Construct a 95% confidence interval for the
mean weight of all the bags on the shelf. Assume the population is normal.
Solution
Here we have a random sample of size n = 9. The mean is 798.30. The sample
variance is s2 = 72.76, which gives a sample standard deviation s = 8.53. From Table
10 of the New Cambridge Statistical Tables, the top 2.5th percentile of the t
distribution with n − 1 = 8 degrees of freedom is 2.306. Therefore, a 95% confidence
interval is:
8.53
8.53
= (798.30 − 6.56, 798.30 + 6.56)
798.30 − 2.306 × √ , 798.30 + 2.306 × √
9
9
= (791.74, 804.86).
It is sometimes more useful to write this as 798.30 ± 6.56.
Activity 8.7 Continuing the previous activity, suppose we are now told that σ, the
population standard deviation, is known to be 8.5 g. Construct a 95% confidence
interval using this information.
Solution
From Table 10 of the New Cambridge Statistical Tables, the top 2.5th percentile of
the standard normal distribution z0.025 = 1.96 (recall t∞ = N (0, 1)) so a 95%
confidence interval for the population mean is:
8.5
8.5
= (798.30 − 6.53, 798.30 + 6.53)
798.30 − 1.96 × √ , 798.30 + 1.96 × √
9
9
= (792.75, 803.85).
Again, it may be more useful to write this as 798.30 ± 5.55. Note that this
confidence interval is less wide than the one in the previous question, even though
our initial estimate s turned out to be very close to the true value of σ.
257
8. Interval estimation
Activity 8.8 A business requires an inexpensive check on the value of stock in its
warehouse. In order to do this, a random sample of 50 items is taken and valued.
The average value of these is computed to be £320.41 with a (sample) standard
deviation of £40.60. It is known that there are 9,875 items in the total stock.
Assume a normal distribution.
(a) Estimate the total value of the stock to the nearest £10,000.
(b) Construct a 95% confidence interval for the mean value of all items and hence
construct a 95% confidence interval for the total value of the stock.
(c) You are told that the confidence interval in (b) is too wide for decision-making
purposes and you are asked to assess how many more items would need to be
sampled to obtain a confidence interval with the same level of confidence, but
with half the width.
Solution
(a) The total value of the stock is 9875µ, where µ is the mean value of an item of
stock. From Chapter 7, X̄ is the obvious estimator of µ, so 9875X̄ is the obvious
estimator of 9875µ. Therefore, an estimate for the total value of the stock is
9875 × 320.41 = £3,160,000 (to the nearest £10,000).
(b) In this question n = 50 is large, and σ 2 is unknown so a 95% confidence interval
for µ is:
40.6
s
x̄±1.96× √ = 320.41±1.96× √ = 320.41±11.25
n
50
⇒
(£309.16, £331.66).
Note that because n is large we have used the standard normal distribution. It
is more accurate to use a t distribution with 49 degrees of freedom. This gives
an interval of (£308.87, £331.95) – not much of a difference.
To obtain a 95% confidence interval for the total value of the stock, 9875µ,
multiply the interval by 9875. This gives (to the nearest £10,000):
(£3,050,000, £3,280,000).
(c) Increasing the sample size
√ by a factor of k reduces the width of the confidence
a factor of 4
interval by a factor of k. Therefore, increasing the sample size by √
will reduce the width of the confidence interval by a factor of 2 (= 4). Hence
we need to increase the sample size from 50 to 4 × 50 = 200. So we should
collect another 150 observations.
Activity 8.9 In a survey of students, the number of hours per week of private
study is recorded. For a random sample of 23 students, the sample mean is 18.4
hours and the sample standard deviation is 3.9 hours. Treat the data as a random
sample from a normal distribution.
258
8.4. Interval estimation for means of normal distributions
(a) Find a 99% confidence interval for the mean number of hours per week of
private study in the student population.
(b) Recompute your confidence interval in the case that the sample size is, in fact,
121, but the sample mean and sample standard deviation values are unchanged.
Comment on the two intervals.
Solution
We have x̄ = 18.4 and s = 3.9, so a 99% confidence interval is of the form:
s
x̄ ± tn−1, 0.005 × √ .
n
(a) When n = 23, t22, 0.005 = 2.819. Hence a 99% confidence interval is:
3.9
18.4 ± 2.819 × √
23
⇒
(16.11, 20.69).
(b) When n = 121, t120, 0.005 = 2.617. Hence a 99% confidence interval is:
3.9
18.4 ± 2.617 × √
121
⇒
(17.47, 19.33).
In spite of the same sample mean and sample standard deviation, the sample of
size n = 121 offers a much more accurate estimate as the interval width is
merely 19.33 − 17.47 = 1.86 hours, in contrast to the interval width of
20.69 − 16.11 = 4.58 hours with the sample size of n = 23.
Note that to derive a confidence interval for µ with σ 2 unknown, the formula used in
the calculation involves both n and n − 1. We then refer to the Student’s t
distribution with n − 1 degrees of freedom.
Also, note that t120, α ≈ zα , where P (Z > zα ) = α for Z ∼ N (0, 1). Therefore, it
would be acceptable to use z0.005 = 2.576 as an approximation for t120, 0.005 = 2.617.
8.4.2
Means of non-normal distributions
Let {X1 , . . . , Xn } be a random sample from a non-normal distribution with mean µ and
variance σ 2 < ∞.
√
When n is large, n(X̄ − µ)/σ is N (0, 1) approximately.
Therefore, we have an approximate 95% confidence interval for µ given by:
S
S
X̄ − 1.96 × √ , X̄ + 1.96 × √
n
n
where S is the sample standard deviation. Note that it is a two-stage approximation.
1. Approximate the distribution of
√
n(X̄ − µ)/σ by N (0, 1).
2. Approximate σ by S.
259
8. Interval estimation
Example 8.2 The salary data of 253 graduates from a UK business school (in
thousands
of pounds) yield the following: n = 253, x̄ = 47.126, s = 6.843 and so
√
s/ n = 0.43.
A point estimate of the average salary µ is x̄ = 47.126.
An approximate 95% confidence interval for µ is:
47.126 ± 1.96 × 0.43
⇒
(46.283, 47.969).
Activity 8.10 Suppose a random survey of 400 first-time home buyers finds that
the sample mean of annual household income is £36,000 and the sample standard
deviation is £17,000.
(a) An economist believes that the ‘true’ standard deviation is σ = £12,000. Based
on this assumption, find an approximate 90% confidence interval for µ, i.e. for
the average annual household income of all first-time home buyers.
(b) Without the assumption that σ is known, find an approximate 90% confidence
interval for µ.
(c) Are the two confidence intervals very different? Which one would you trust
more, and why?
Solution
(a) Based on the central limit theorem for the sample mean, an approximate 90%
confidence interval is:
12000
σ
x̄ ± z0.05 × √ = 36000 ± 1.645 × √
n
400
= 36000 ± 987
⇒ (£35,013, £36,987).
We may interpret this result as follows. According to the assumption made by
the economist and the survey results, we may conclude at the 90% confidence
level that the average of all first-time home buyers’ incomes is between £35,013
and £36,987.
Note that it is wrong to conclude that 90% of all first-time home buyers’
incomes are between £35,013 and £36,987.
(b) Replacing σ = 12000 by s = 17000, we obtain an approximate 90% confidence
interval of:
17000
s
x̄ ± z0.05 × √ = 36000 ± 1.645 × √
n
400
= 36000 ± 1398
⇒ (£34,602, £37,398).
260
8.4. Interval estimation for means of normal distributions
Now, according to the survey results (only), we may conclude at the 90%
confidence level that the average of all first-time home buyers’ incomes is
between £34,602 and £37,398.
(c) The interval estimates are different. The first one gives a smaller range by £822.
This was due to the fact that the economist’s assumed σ of £12,000 is much
smaller than the sample standard deviation, s, of £17,000. With a sample size
as large as 400, we would think that we should trust the data more than an
assumption by an economist!
The key question is whether σ being £12,000 is a reasonable assumption. This
issue will be properly addressed using statistical hypothesis testing.
Activity 8.11 In a study of consumers’ views on guarantees for new products, 370
out of a random sample of 425 consumers agreed with the statement: ‘Product
guarantees are worded more for lawyers to understand than to be easily understood
by consumers.’
(a) Find an approximate 95% confidence interval for the population proportion of
consumers agreeing with this statement.
(b) Would a 99% condidence interval for the population proportion be wider or
narrower than that found in (a)? Explain your answer.
Solution
The population is a Bernoulli distribution on two points: 1 (agree) and 0 (disagree).
We have a random sample of size n = 425, i.e. {X1 , . . . , X425 }. Let π = P (Xi = 1),
hence E(Xi ) = π and Var(Xi ) = π (1 − π) for i = 1, . . . , 425. The sample mean and
variance are:
425
370
1 X
xi =
= 0.8706
x̄ =
425 i=1
425
and:
1
s2 =
424
425
X
i=1
!
x2i − 425x̄2
=
1
370 − 425 × (0.8706)2 = 0.1129.
424
(a) Based on the central limit theorem for the sample mean, an approximate 95%
confidence interval for π is:
r
0.1129
s
x̄ ± z0.025 × √ = 0.8706 ± 1.96 ×
425
n
= 0.8706 ± 0.0319
⇒ (0.8387, 0.9025).
(b) For a 99% confidence interval, we use z0.005 = 2.576 instead of z0.025 = 1.96 in
the above formula. Therefore, the confidence interval becomes wider.
261
8. Interval estimation
Note that the width of a confidence interval is a random variable, i.e. it varies from
sample to sample. The comparison in (b) above is with the understanding that the
same random sample is used to construct the two confidence intervals.
Be sure to pay close attention to how we interpret confidence intervals in the context
of particular practical problems.
Activity 8.12
(a) A sample of 954 adults in early 1987 found that 23% of them held shares. Given
a UK adult population of 41 million and assuming a proper random sample was
taken, construct a 95% confidence interval estimate for the number of
shareholders in the UK.
(b) A ‘similar’ survey the previous year had found a total of 7 million shareholders.
Assuming ‘similar’ means the same sample size, construct a 95% confidence
interval estimate of the increase in shareholders between the two years.
Solution
(a) Let π be the proportion of shareholders in the population. Start by estimating
π. We are estimating a proportion and n is large, so an approximate 95%
confidence interval for π is, using the central limit theorem:
r
r
π
b (1 − π
b)
0.23 × 0.77
⇒ 0.23±1.96×
= 0.23±0.027 ⇒ (0.203, 0.257).
π
b±1.96×
n
954
Therefore, a 95% confidence interval for the number (rather than the
proportion) of shareholders in the UK is obtained by multiplying the above
interval endpoints by 41 million and getting the answer 8.3 million to 10.5
million. An alternative way of expressing this is:
9,400,000 ± 1,100,000
⇒
(8,300,000, 10,500,000).
Therefore, we estimate there are about 9.4 million shareholders in the UK, with
a margin of error of 1.1 million.
(b) Let us start by finding a 95% confidence interval for the difference in the two
proportions. We use the formula:
s
π
b1 (1 − π
b1 ) π
b2 (1 − π
b2 )
+
.
π
b1 − π
b2 ± 1.96 ×
n1
n2
The estimates of the proportions π1 and π2 are 0.23 and 0.171, respectively. We
know n1 = 954 and although n2 is unknown we can assume it is approximately
equal to 954 (noting the ‘similar’ in the question), so an approximate 95%
confidence interval is:
r
0.23 × 0.77 0.171 × 0.829
0.23−0.171±1.96×
+
= 0.059±0.036 ⇒ (0.023, 0.094).
954
954
262
8.5. Use of the chi-squared distribution
By multiplying by 41 million, we get a confidence interval of:
2,400,000 ± 1,500,000
⇒
(900,000, 3,900,000).
We estimate that the number of shareholders has increased by about 2.4 million
in the two years. There is quite a large margin of error, i.e. 1.5 million, especially
when compared with a point estimate (i.e. interval midpoint) of 2.4 million.
8.5
Use of the chi-squared distribution
Let Y1 , . . . , Yn be independent N (µ, σ 2 ) random variables. Therefore:
Yi − µ
∼ N (0, 1).
σ
Hence:
n
1 X
(Yi − µ)2 ∼ χ2n .
σ 2 i=1
Note that:
n
n
1 X
n(Ȳ − µ)2
1 X
2
2
(Y
−
µ)
=
(Y
−
Ȳ
)
+
.
i
i
σ 2 i=1
σ 2 i=1
σ2
(8.2)
Proof : We have:
n
n
X
X
2
(Yi − µ) =
((Yi − Ȳ ) + (Ȳ − µ))2
i=1
i=1
=
n
X
2
(Yi − Ȳ ) +
i=1
=
n
X
n
X
2
(Ȳ − µ) + 2
i=1
(Yi − Ȳ )(Ȳ − µ)
i=1
n
X
(Yi − Ȳ ) + n(Ȳ − µ) + 2(Ȳ − µ)
(Yi − Ȳ )
2
2
i=1
=
n
X
i=1
n
X
(Yi − Ȳ )2 + n(Ȳ − µ)2 .
i=1
Hence:
n
n
1 X
n(Ȳ − µ)2
1 X
2
2
(Y
−
µ)
=
(Y
−
Ȳ
)
+
. i
i
σ 2 i=1
σ 2 i=1
σ2
Since Ȳ ∼ N (µ, σ 2 /n), then n(Ȳ − µ)2 /σ 2 ∼ χ21 . It can be proved that:
n
1 X
(Yi − Ȳ )2 ∼ χ2n−1 .
σ 2 i=1
Therefore, decomposition (8.2) is an instance of the relationship:
χ2n = χ2n−1 + χ21 .
263
8. Interval estimation
8.6
Interval estimation for variances of normal
distributions
Let {X1 , . . . , Xn } be a random sample from a population with mean µ and variance
σ 2 < ∞.
n
P
Let M = (Xi − X̄)2 = (n − 1)S 2 , then M/σ 2 ∼ χ2n−1 .
i=1
For any given small α ∈ (0, 1), we can find 0 < k1 < k2 such that:
P (X < k1 ) = P (X > k2 ) =
α
2
where X ∼ χ2n−1 . Therefore:
M
M
M
2
<σ <
.
1 − α = P k1 < 2 < k2 = P
σ
k2
k1
Hence a 100(1 − α)% confidence interval for σ 2 is:
M M
,
.
k2 k1
Example 8.3 Suppose n = 15 and the sample variance is s2 = 24.5. Let α = 0.05.
From Table 8 of the New Cambridge Statistical Tables, we find:
P (X < 5.629) = P (X > 26.119) = 0.025
where X ∼ χ214 .
Hence a 95% confidence interval for σ 2 is:
M
M
14 × S 2 14 × S 2
,
=
,
26.119 5.629
26.119
5.629
= (0.536 × S 2 , 2.487 × S 2 )
= (13.132, 60.934).
In the above calculation, we have used the formula:
n
1 X
1
S =
(Xi − X̄)2 =
× M.
n − 1 i=1
n−1
2
Activity 8.13 A random sample of size n = 16 drawn from a normal distribution
had a sample variance of s2 = 32.76. Construct a 99% confidence interval for σ 2 .
Solution
For a 99% confidence interval, we need the lower and upper half percentile values
from the χ2n−1 = χ215 distribution. These are χ20.995, 15 = 4.601 and χ20.005, 15 = 32.801,
264
8.6. Interval estimation for variances of normal distributions
respectively. Hence we obtain:
! (n − 1)s2 (n − 1)s2
15 × 32.76 15 × 32.76
=
,
= (14.98, 106.80).
,
χ2α/2, n−1 χ21−α/2, n−1
32.801
4.601
Note that this is a very wide confidence interval due to (i.) a high level of confidence
(99%), and (ii.) a small sample size (n = 16).
Activity 8.14 A manufacturer is concerned about the variability of the levels of
impurity contained in consignments of raw materials from a supplier. A random
sample of 10 consignments showed a standard deviation of 2.36 in the concentration
of impurity levels. Assume normality.
(a) Find a 95% confidence interval for the population variance.
(b) Would a 99% confidence interval for this variance be wider or narrower than
that found in (a)?
Solution
(a) We have n = 10, s2 = (2.36)2 = 5.5696, χ20.975, 9 = 2.700 and χ20.025, 9 = 19.023.
Hence a 95% confidence interval for σ 2 is:
9 × 5.5696 9 × 5.5696
(n − 1)s2 (n − 1)s2
,
,
=
= (2.64, 18.57).
χ20.025, n−1 χ20.975, n−1
19.023
2.700
(b) A 99% confidence interval would be wider since:
χ20.995, n−1 < χ20.975, n−1
and χ20.005, n−1 > χ20.025, n−1 .
Activity 8.15 Construct a 90% confidence interval for the variance of the bags of
sugar in Activity 8.6. Does the given value of 8.5 g for the population standard
deviation seem plausible?
Solution
We have n = 9 and s2 = 72.76. For a 90% confidence interval, we need the bottom
and top 5th percentiles of the chi-squared distribution on n − 1 = 8 degrees of
freedom. These are:
χ20.95, 8 = 2.733 and χ20.05, 8 = 15.507.
A 90% confidence interval is:
(n − 1)S 2 (n − 1)S 2
,
χ2α/2,n−1 χ21−α/2,n−1
!
=
(9 − 1) × 72.76 (9 − 1) × 72.76
,
15.507
2.733
= (37.536, 213.010).
265
8. Interval estimation
The corresponding values for the standard deviation are:
√
√
37.536, 213.010 = (6.127, 14.595).
The given value falls well within this confidence interval, so we have no reason to
doubt it.
Activity 8.16 The data below are from a random sample of size n = 9 taken from
the distribution N (µ, σ 2 ):
3.75,
5.67,
3.14,
7.89,
3.40,
9.32,
2.80,
10.34,
14.31.
(a) Assume σ 2 = 16. Find a 95% confidence interval for µ. If the width of such a
confidence interval must not exceed 2.5, at least how many observations do we
need?
(b) Suppose σ 2 is now unknown. Find a 95% confidence interval for µ. Compare the
result with that obtained in (a) and comment.
(c) Obtain a 95% confidence interval for σ 2 .
Solution
(a) We have x̄ = 6.74. For a 95% confidence interval, α = 0.05 so we need to find
the top 100α/2 = 2.5th percentile of N (0, 1), which is 1.96. Since σ = 4 and
n = 9, a 95% confidence interval for µ is:
4
4
σ
= (4.13, 9.35).
x̄ ± 1.96 × √ ⇒ 6.74 − 1.96 × , 6.74 + 1.96 ×
3
3
n
In general, a 100(1 − α)% confidence interval for µ is:
σ
σ
X̄ − zα/2 × √ , X̄ + zα/2 × √
n
n
where zα denotes the top 100αth percentile of the standard normal distribution,
i.e. such that:
P (Z > zα ) = α
where Z ∼ N (0, 1). Hence the width of the confidence interval is:
σ
2 × zα/2 × √ .
n
For this example, α = 0.05, z0.025 = 1.96 and σ = 4. Setting the width of the
confidence interval to be at most 2.5, we have:
σ
15.68
2 × 1.96 × √ = √ ≤ 2.5.
n
n
Hence:
2
15.68
n≥
= 39.34.
2.5
So we need a sample of at least 40 observations in order to obtain a 95%
confidence interval with a width not greater than 2.5.
266
8.7. Overview of chapter
(b) When σ 2 is unknown, a 95% confidence interval for µ is:
S
S
X̄ − tα/2, n−1 × √ , X̄ + tα/2, n−1 × √
n
n
where S 2 =
n
P
(Xi − X̄)2 /(n − 1), and tα, k denotes the top 100αth percentile of
i=1
the Student’s tk distribution, i.e. such that:
P (T > tα, k ) = α
for T ∼ tk . For this example, s2 = 16, s = 4, n = 9 and t0.025, 8 = 2.306. Hence a
95% confidence interval for µ is:
6.74 ± 2.306 ×
4
3
⇒
(3.67, 9.81).
This confidence interval is much wider than the one obtained in (a). Since we do
not know σ 2 , we have less information available for our estimation. It is only
natural that our estimation becomes less accurate.
Note that although the sample size is n, the Student’s t distribution used has
only n − 1 degrees of freedom. The loss of 1 degree of freedom in the sample
variance is due to not knowing µ. Hence we estimate µ using the data, for which
we effectively pay a ‘price’ of one degree of freedom.
(c) Note (n − 1)S 2 /σ 2 ∼ χ2n−1 = χ28 . From Table 8 of the New Cambridge Statistical
Tables, for X ∼ χ28 , we find that:
P (X < 2.180) = P (X > 17.535) = 0.025.
Hence:
P
8 × S2
< 17.535 = 0.95.
2.180 <
σ2
Therefore, the lower bound for σ 2 is 8 × s2 /17.535 = 7.298, and the upper
bound is 8 × s2 /2.180 = 58.701. Therefore, a 95% confidence interval for σ 2 ,
noting s2 = 16, is:
(7.30, 58.72).
Note that the estimation in this example is rather inaccurate. This is due to two
reasons.
i. The sample size is small.
ii. The population variance, σ 2 , is large.
8.7
Overview of chapter
This chapter covered interval estimation. A confidence interval converts a point
estimate of an unknown parameter into an interval estimate, reflecting the likely
sampling error. The chapter demonstrated how to construct confidence intervals for
267
8. Interval estimation
means and variances of normal populations.
8.8
Key terms and concepts
Confidence interval
Interval estimator
8.9
Coverage probability
Interval width
Sample examination questions
Solutions can be found in Appendix C.
1. Let {X1 , . . . , Xn } be a random sample from N (µ, σ 2 ), where σ 2 is unknown. Derive
the endpoints of an accurate 100(1 − α)% confidence interval for µ in this situation,
where α ∈ (0, 1).
2. A country is considering joining the European Union. In a study of voters’ views on
a forthcoming referendum, 163 out of a random sample of 250 voters agreed with
the statement: ‘The government should seek membership of the European Union.’
Find an approximate 99% confidence interval for the population proportion of all
voters agreeing with this statement.
3. A random sample of size n = 10 drawn from a normal distribution had a sample
variance of s2 = 21.05. Construct a 90% confidence interval for σ 2 . Note that
P (X < 3.325) = 0.05, where X ∼ χ29 .
268
Chapter 9
Hypothesis testing
9.1
Synopsis of chapter
This chapter discusses hypothesis testing which is used to answer questions about an
unknown parameter. We consider how to perform an appropriate hypothesis test for a
given problem, determine error probabilities and test power, and draw appropriate
conclusions from a hypothesis test.
9.2
Learning outcomes
After completing this chapter, you should be able to:
define and apply the terminology of hypothesis testing
conduct statistical tests of all the types covered in the chapter
calculate the power of some of the simpler tests
explain the construction of rejection regions as a consequence of prior distributional
results, with reference to the significance level and power.
9.3
Introduction
Hypothesis testing, together with statistical estimation, are the two most
frequently-used statistical inference methods. Hypothesis testing addresses a different
type of practical question from statistical estimation.
Based on the data, a (statistical) test is to make a binary decision on a hypothesis,
denoted by H0 :
reject H0 or not reject H0 .
Activity 9.1 Why does it make no sense to use a hypothesis like x̄ = 2?
Solution
We can see immediately if x̄ = 2 by calculating the sample mean. Inference is
concerned with the population from which the sample was taken. We are not very
interested in the sample mean in its own right.
269
9. Hypothesis testing
9.4
Introductory examples
Example 9.1 Consider a simple experiment – toss a coin 20 times.
Let {X1 , . . . , X20 } be the outcomes where ‘heads’ → Xi = 1, and ‘tails’ → Xi = 0.
Hence the probability distribution is P (Xi = 1) = π = 1 − P (Xi = 0), for π ∈ (0, 1).
Estimation would involve estimating π, using π
b = X̄ = (X1 + · · · + X20 )/20.
Testing involves assessing if a hypothesis such as ‘the coin is fair’ is true or not. For
example, this particular hypothesis can be formally represented as:
H0 : π = 0.5.
We cannot be sure what the answer is just from the data.
If π
b = 0.9, H0 is unlikely to be true.
If π
b = 0.45, H0 may be true (and also may be untrue).
If π
b = 0.7, what to do then?
Example 9.2 A customer complains that the amount of coffee powder in a coffee
tin is less than the advertised weight of 3 pounds.
A random sample of 20 tins is selected, resulting in an average weight of x̄ = 2.897
pounds. Is this sufficient to substantiate the complaint?
Again statistical estimation cannot provide a firm answer, due to random
fluctuations between different random samples. So we cast the problem into a
hypothesis testing problem as follows.
Let the weight of coffee in a tin be a normal random variable X ∼ N (µ, σ 2 ). We
need to test the hypothesis µ < 3. In fact, we use the data to test the hypothesis:
H0 : µ = 3.
If we could reject H0 , the customer complaint would be vindicated.
Example 9.3 Suppose one is interested in evaluating the mean income (in £000s)
of a community. Suppose income in the population is modelled as N (µ, 25) and a
random sample of n = 25 observations is taken, yielding the sample mean x̄ = 17.
Independently of the data, three expert economists give their own opinions as
follows.
Dr A claims the mean income is µ = 16.
Ms B claims the mean income is µ = 15.
Mr C claims the mean income is µ = 14.
How would you assess these experts’ statements?
X̄ ∼ N (µ, σ 2 /n) = N (µ, 1). We assess the statements based on this distribution.
270
9.5. Setting p-value, significance level, test statistic
If Dr A’s claim is correct, X̄ ∼ N (16, 1). The observed value x̄ = 17 is one standard
deviation away from µ, and may be regarded as a typical observation from the
distribution. Hence there is little inconsistency between the claim and the data
evidence. This is shown in Figure 9.1.
If Ms B’s claim is correct, X̄ ∼ N (15, 1). The observed value x̄ = 17 begins to look a
bit ‘extreme’, as it is two standard deviations away from µ. Hence there is some
inconsistency between the claim and the data evidence. This is shown in Figure 9.2.
If Mr C’s claim is correct, X̄ ∼ N (14, 1). The observed value x̄ = 17 is very extreme,
as it is three standard deviations away from µ. Hence there is strong inconsistency
between the claim and the data evidence. This is shown in Figure 9.3.
Figure 9.1: Comparison of claim and data evidence for Dr A in Example 9.3.
Figure 9.2: Comparison of claim and data evidence for Ms B in Example 9.3.
9.5
Setting p-value, significance level, test statistic
A measure of the discrepancy between the hypothesised (claimed) value of µ and the
observed value X̄ = x̄ is the probability of observing X̄ = x̄ or more extreme values
under the null hypothesis. This probability is called the p-value.
271
9. Hypothesis testing
Figure 9.3: Comparison of claim and data evidence for Mr C in Example 9.3.
Example 9.4 Continuing Example 9.3:
under H0 : µ = 16, P (X̄ ≥ 17) + P (X̄ ≤ 15) = P (|X̄ − 16| ≥ 1) = 0.317
under H0 : µ = 15, P (X̄ ≥ 17) + P (X̄ ≤ 13) = P (|X̄ − 15| ≥ 2) = 0.046
under H0 : µ = 14, P (X̄ ≥ 17) + P (X̄ ≤ 11) = P (|X̄ − 14| ≥ 3) = 0.003.
In summary, we reject the hypothesis µ = 15 or µ = 14, as, for example, if the
hypothesis µ = 14 is true, the probability of observing x̄ = 17, or more extreme
values, would be as small as 0.003. We are comfortable with this decision, as a small
probability event would be very unlikely to occur in a single experiment.
On the other hand, we cannot reject the hypothesis µ = 16. However, this does not
imply that this hypothesis is necessarily true, as, for example, µ = 17 or 18 are at
least as likely as µ = 16. Remember:
not reject 6= accept.
A statistical test is incapable of ‘accepting’ a hypothesis.
Definition of p-values
A p-value is the probability of the event that the test statistic takes the observed
value or more extreme (i.e. more unlikely) values under H0 . It is a measure of the
discrepancy between the hypothesis H0 and the data.
• A ‘small’ p-value indicates that H0 is not supported by the data.
• A ‘large’ p-value indicates that H0 is not inconsistent with the data.
So p-values may be seen as a risk measure of rejecting H0 , as shown in Figure 9.4.
272
9.5. Setting p-value, significance level, test statistic
Figure 9.4: Interpretation of p-values as a risk measure.
9.5.1
General setting of hypothesis tests
Let {X1 , . . . , Xn } be a random sample from a distribution with cdf F (x; θ). We are
interested in testing the hypotheses:
H0 : θ = θ0
vs. H1 : θ ∈ Θ1
where θ0 is a fixed value, Θ1 is a set, and θ0 6∈ Θ1 .
H0 is called the null hypothesis.
H1 is called the alternative hypothesis.
The significance level is based on α, which is a small number between 0 and 1
selected subjectively. Often we choose α = 0.1, 0.05 or 0.01, i.e. tests are often
conducted at the significance levels of 10%, 5% or 1%, respectively. So we test at the
100α% significance level.
Our decision is to reject H0 if the p-value is ≤ α.
9.5.2
Statistical testing procedure
1. Find a test statistic T = T (X1 , . . . , Xn ). Denote by t the value of T for the given
sample of observations under H0 .
2. Compute the p-value:
p = Pθ0 (T = t or more ‘extreme’ values)
where Pθ0 denotes the probability distribution such that θ = θ0 .
3. If p ≤ α we reject H0 . Otherwise, H0 is not rejected.
Our understanding of ‘extremity’ is defined by the alternative hypothesis H1 . This will
become clear in subsequent examples. The significance level determines which p-values
are considered ‘small’.
273
9. Hypothesis testing
Example 9.5 Let {X1 , . . . , X20 }, taking values either 1 or 0, be the outcomes of an
experiment of tossing a coin 20 times, where:
P (Xi = 1) = π = 1 − P (Xi = 0) for π ∈ (0, 1).
We are interested in testing:
H0 : π = 0.5 vs. H1 : π 6= 0.5.
Suppose there are 17 Xi s taking the value 1, and 3 Xi s taking the value 0. Will you
reject the null hypothesis at the 5% significance level?
Let T = X1 + · · · + X20 . Therefore, T ∼ Bin(20, π). We use T as the test statistic.
With the given sample, we observe t = 17. What are the more extreme values of T if
H0 is true?
Under H0 , E(T ) = n π0 = 10. Hence 3 is as extreme as 17, and the more extreme
values are:
0, 1, 2, 18, 19 and 20.
Therefore, the p-value is:
!
3
20
X
X
+
PH0 (T = i) =
i=0
i=17
3
X
+
i=0
20
X
!
i=17
= 2 × (0.5)20
20!
(0.5)i (1 − 0.5)20−i
(20 − i)! i!
3
X
i=0
20!
(20 − i)! i!
20 × 19 20 × 19 × 18
= 2 × (0.5) × 1 + 20 +
+
2!
3!
20
= 0.0026.
So we reject the null hypothesis of a fair coin at the 1% significance level.
Activity 9.2 Let {X1 , . . . , X14 }, taking values either 1 or 0, be the outcomes of an
experiment of tossing a coin 14 times, where:
P (Xi = 1) = π = 1 − P (Xi = 0) for π ∈ (0, 1).
We are interested in testing:
H0 : π = 0.5 vs. H1 : π 6= 0.5.
Suppose there are 4 Xi s taking the value 1, and 10 Xi s taking the value 0. Will you
reject the null hypothesis at the 5% significance level?
Solution
Let T = X1 + · · · + X14 . Therefore, T ∼ Bin(14, π). We use T as the test statistic.
With the given sample, we observe t = 4. We now determine which are the more
extreme values of T if H0 is true.
274
9.5. Setting p-value, significance level, test statistic
Under H0 , E(T ) = n π0 = 7. Hence 10 is as extreme as 4, and the more extreme
values are:
0, 1, 2, 3, 11, 12, 13 and 14.
Therefore, the p-value is:
!
4
14
X
X
+
PH0 (T = i) =
i=0
i=10
4
X
+
i=0
14
X
!
i=10
14
= 2 × (0.5)
14!
(0.5)i (1 − 0.5)14−i
(14 − i)! i!
4
X
i=0
14!
(14 − i)! i!
14 × 13 14 × 13 × 12
= 2 × (0.5) × 1 + 14 +
+
2!
3!
14 × 13 × 12 × 11
+
4!
14
= 0.1796.
Since α = 0.05 < 0.1796, we do not reject the null hypothesis of a fair coin at the 5%
significance level. The observed data are consistent with the null hypothesis of a fair
coin.
Activity 9.3 You wish to test whether a coin is fair. In 400 tosses of a coin, 217
heads and 183 tails appear. Is it reasonable to assume that the coin is fair? Justify
your answer with an appropriate hypothesis test. Calculate the p-value of the test,
and assume a 5% significance level.
Solution
Let {X1 , . . . , X400 }, taking values either 1 or 0, be the outcomes of an experiment of
tossing a coin 400 times, where:
P (Xi = 1) = π = 1 − P (Xi = 0)
for π ∈ (0, 1), and 0 otherwise. We are interested in testing:
H0 : π = 0.5 vs. H1 : π 6= 0.5.
Let T =
400
P
Xi . Under H0 , then T ∼ Bin(400, 0.5) ≈ N (200, 100), using the normal
i=1
approximation of the binomial distribution, with µ = n π0 = 400 × 0.5 = 200 and
σ 2 = n π0 (1 − π0 ) = 400 × 0.5 × 0.5 = 100. We observe t = 217, hence (using the
continuity correction):
216.5 − 200
√
= P (Z ≥ 1.65) = 0.0495.
P (T ≥ 216.5) = P Z ≥
100
Therefore, the p-value is:
2 × P (Z ≥ 1.65) = 0.0990
275
9. Hypothesis testing
which is far larger than α = 0.05, hence we do not reject H0 and conclude that there
is no evidence to suggest that the coin is not fair.
(Note that the test would be significant if we set H1 : π > 0.5, as the p-value would
be 0.0495 which is less than 0.05 (just). However, we have no a priori reason to
perform an upper-tailed test – we should not determine our hypotheses by observing
the sample data, rather the hypotheses should be set before any data are observed.)
Alternatively, one could apply the central limit theorem such that under H0 we have:
π (1 − π)
= N (0.5, 0.000625)
X̄ ∼ N π,
n
approximately, since n = 400. We observe x̄ = 217/400 = 0.5425, hence:
0.5425 − 0.5
P (X̄ ≥ 0.5425) = P Z ≥ √
= P (Z ≥ 1.70) = 0.0446.
0.000625
Therefore, the p-value is:
2 × P (Z ≥ 1.70) = 2 × 0.0446 = 0.0892
leading to the same conclusion.
Activity 9.4 In a given city, it is assumed that the number of car accidents in a
given week follows a Poisson distribution. In past weeks, the average number of
accidents per week was 9, and this week there were 3 accidents. Is it justified to
claim that the accident rate has dropped? Calculate the p-value of the test, and
assume a 5% significance level.
Solution
Let T be the number of car accidents per week such that T ∼ Poisson(λ). We are
interested in testing:
H0 : λ = 9 vs. H1 : λ < 9.
Under H0 , then T ∼ Poisson(9), and we observe t = 3. Hence the p-value is:
P (T ≤ 3) =
3
X
e−9 9t
t=0
t!
−9
=e
92 93
1+9+
+
2!
3!
= 0.0212.
Since 0.0212 < 0.05, we reject H0 and conclude that there is evidence to suggest that
the accident rate has dropped.
9.5.3
Two-sided tests for normal means
Let {X1 , . . . , Xn } be a random sample from N (µ, σ 2 ). Assume σ 2 > 0 is known. We are
interested in testing the hypotheses:
H0 : µ = µ0
where µ0 is a given constant.
276
vs. H1 : µ 6= µ0
9.5. Setting p-value, significance level, test statistic
P
Intuitively if H0 is true, X̄ = i Xi /n should be close to µ0 . Therefore, large values of
|X̄ − µ0 | suggest a departure from H0 .
√
Under H0 , X̄ ∼ N (µ0 , σ 2 /n), i.e. n(X̄ − µ0 )/σ ∼ N (0, 1). Hence the test statistic
may be defined as:
√
n(X̄ − µ0 )
X̄ − µ0
√ ∼ N (0, 1)
T =
=
σ
σ/ n
and we reject H0 for sufficiently ‘large’ values of |T |.
How large is ‘large’ ? This is determined by the significance level.
Suppose
√ µ0 = 3, σ = 0.148, n = 20 and x̄ = 2.897. Therefore, the observed value of T is
t = 20 × (2.897 − 3)/0.148 = −3.112. Hence the p-value is:
Pµ0 (|T | ≥ 3.112) = P (|Z| > 3.112) = 0.0019
where Z ∼ N (0, 1). Therefore, the null hypothesis of µ = 3 will be rejected even at the
1% significance level.
Alternatively, for a given 100α% significance level we may find the critical value cα
such that Pµ0 (|T | > cα ) = α. Therefore, the p-value is ≤ α if and only if the observed
value of |T | ≥ cα .
Using this alternative approach, we do not need to compute the p-value.
For this example, cα = zα/2 , that is the top 100α/2th percentile of N (0, 1), i.e. the
z-value which cuts off α/2 probability in the upper tail of the standard normal
distribution.
For α = 0.1, 0.05 and 0.01, zα/2 = 1.645, 1.96 and 2.576, respectively. Since we observe
|t| = 3.112, the null hypothesis is rejected at all three significance levels.
9.5.4
One-sided tests for normal means
Let {X1 , . . . , Xn } be a random sample from N (µ, σ 2 ) with σ 2 > 0 known. We are
interested in testing the hypotheses:
H0 : µ = µ0
vs. H1 : µ < µ0
where µ0 is a known constant.
√
Under H0 , T = n(X̄ − µ0 )/σ ∼ N (0, 1). We continue to use T as the test statistic. For
H1 : µ < µ0 we should reject H0 when t ≤ c, where c < 0 is a constant.
For a given 100α% significance level, the critical value c should be chosen such that:
α = Pµ0 (T ≤ c) = P (Z ≤ c).
Therefore, c is the 100αth percentile of N (0, 1). Due to the symmetry of N (0, 1),
c = −zα , where zα is the top 100αth percentile of N (0, 1), i.e. P (Z > zα ) = α, where
Z ∼ N (0, 1). For α = 0.05, zα = 1.645. We reject H0 if t ≤ −1.645.
Example 9.6 Suppose µ0 = 3, σ = 0.148, n = 20 and x̄ = 2.897, then:
√
20 × (2.897 − 3)
t=
= −3.112 < −1.645.
0.148
277
9. Hypothesis testing
So the null hypothesis of µ = 3 is rejected at the 5% significance level as there is
significant evidence from the data that the true mean is likely to be smaller than 3.
Some remarks are the following.
i. We use a one-tailed test when we are only interested in the departure from H0 in
one direction.
ii. The distribution of a test statistic under H0 must be known in order to calculate
p-values or critical values.
iii. A test may be carried out by either computing the p-value or determining the
critical value.
iv. The probability of incorrect decisions in hypothesis testing is typically positive. For
example, the significance level is the probability of rejecting a true H0 .
9.6
t tests
t tests are one of the most frequently-used statistical tests.
Let {X1 , . . . , Xn } be a random sample from N (µ, σ 2 ), where both µ and σ 2 > 0 are
unknown. We are interested in testing the hypotheses:
H0 : µ = µ0
vs. H1 : µ < µ0
where µ0 is known.
√
Now we cannot use n(X̄ − µ0 )/σ as a statistic, since σ is unknown. Naturally we
replace it by S, where:
n
1 X
2
S =
(Xi − X̄)2 .
n − 1 i=1
The test statistic is then the famous t statistic:
√
T =
.
n(X̄ − µ0 )
X̄ − µ0 √
√ = n(X̄ − µ0 )
=
S
S/ n
n
1 X
(Xi − X̄)2
n − 1 i=1
!1/2
.
We reject H0 if t < c, where c is the critical value determined by the significance level:
PH0 (T < c) = α
where PH0 denotes the distribution under H0 (with mean µ0 and unknown σ 2 ).
Under H0 , T ∼ tn−1 . Hence:
α = PH0 (T < c)
i.e. c is the 100αth percentile of the t distribution with n − 1 degrees of freedom. By
symmetry, c = −tα, n−1 , where tα, k denotes the top 100αth percentile of the tk
distribution.
278
9.6. t tests
Example 9.7 To deal with the customer complaint that the average amount of
coffee powder in a coffee tin is less than the advertised 3 pounds, 20 tins were
weighed, yielding the following observations:
2.82,
2.78,
3.01,
3.01,
3.11,
3.09,
2.71,
2.94,
2.93, 2.68,
2.82, 2.81,
3.02,
3.05,
3.01,
3.01,
2.93,
2.85,
2.56,
2.79.
The sample mean and standard deviation are, respectively:
x̄ = 2.897 and s = 0.148.
To test H0 : µ = 3 vs. H1 : µ < 3 at the 1% significance level, the critical value is
c = −t0.01, 19 = −2.539.
√
Since t = 20 × (2.897 − 3)/0.148 = −3.112 < −2.539, we reject the null hypothesis
that µ = 3 at the 1% significance level.
We conclude that there is highly significant evidence which supports the claim that
the mean amount of coffee is less than 3 pounds.
Note the hypotheses tested are in fact:
H0 : µ = µ0 , σ 2 > 0 vs. H1 : µ 6= µ0 , σ 2 > 0.
Although H0 does not specify the population distribution completely (σ 2 > 0), the
distribution of the test statistic, T , under H0 is completely known. This enables us
to find the critical value or p-value.
Activity 9.5 A doctor claims that the average European is more than 8.5 kg
overweight. To test this claim, a random sample of 12 Europeans were weighed, and
the difference between their actual weight and their ideal weight was calculated. The
data are:
14,
12,
8,
13,
−1,
10,
11,
15,
13,
20,
7,
14.
Assuming the data follow a normal distribution, conduct a t test to infer at the 5%
significance level whether or not the doctor’s claim is true.
Solution
We have a random sample of size n = 12 from N (µ, σ 2 ), and we test H0 : µ = 8.5 vs.
H1 : µ > 8.5. The test statistic, under H0 , is:
T =
X̄ − 8.5
X̄ − 8.5
√ =
√
∼ t11 .
S/ n
S/ 12
We reject H0 if t > t0.05, 11 = 1.796. For the given data:
12
1 X
1
x̄ =
xi = 11.333 and s2 =
12 i=1
11
Hence:
12
X
!
x2i − 12x̄2
= 26.606.
i=1
11.333 − 8.5
t= p
= 1.903 > 1.796 = t0.05, 11
26.606/12
279
9. Hypothesis testing
so we reject H0 at the 5% significance level. There is significant evidence to support
the doctor’s claim.
Activity 9.6 A sample of seven is taken at random from a large batch of (nominally
12-volt) batteries. These are tested and their true voltages are shown below:
12.9,
11.6,
13.5,
13.9,
12.1,
11.9,
13.0.
(a) Test if the mean voltage of the whole batch is 12 volts.
(b) Test if the mean batch voltage is less than 12 volts.
Which test do you think is the more appropriate?
Solution
(a) We are to test H0 : µ = 12 vs. H1 : µ 6= 12. The key points here are that n is
small and that σ 2 is unknown. We can use the t test and this is valid provided
the data are normally distributed. The test statistic value is:
t=
x̄ − 12
12.7 − 12
√ =
√ = 2.16.
s/ 7
0.858/ 7
This is compared to a Student’s t distribution on 6 degrees of freedom. The
critical value corresponding to a 5% significance level is 2.447. Hence we cannot
reject the null hypothesis at the 5% significance level. (We can reject at the 10%
significance level, but the convention on this course is to regard such evidence
merely as casting doubt on H0 , rather than justifying rejection as such, i.e. such
a result would be ‘weakly significant’.)
(b) We are to test H0 : µ = 12 vs. H1 : µ < 12. There is no need to do a formal
statistical test. As the sample mean is 12.7, which is greater than 12, there is no
evidence whatsoever for the alternative hypothesis.
In (a) you are asked to do a two-sided test and in (b) it is a one-sided test. Which is
more appropriate will depend on the purpose of the experiment, and your suspicions
before you conduct it.
• If you suspected before collecting the data that the mean voltage was less than
12 volts, the one-sided test would be appropriate.
• If you had no prior reason to believe that the mean was less than 12 volts you
would perform a two-sided test.
• General rule: decide on whether it is a one- or two-sided test before performing
the statistical test!
Activity 9.7 A random sample of 16 observations from the population N (µ, σ 2 )
yields the sample mean x̄ = 9.31 and the sample variance s2 = 0.375. At the 5%
280
9.7. General approach to statistical tests
significance level, test the following hypotheses by obtaining critical values:
(a) H0 : µ = 9 vs. H1 : µ > 9.
(b) H0 : µ = 9 vs. H1 : µ < 9.
(c) H0 : µ = 9 vs. H1 : µ 6= 9.
Repeat the above exercise with the additional assumption that σ 2 = 0.375. Compare
the results with those derived without this assumption and comment.
Solution
When σ 2 is unknown, we use the test statistic T =
T ∼ t15 . With α = 0.05, we reject H0 if:
√
n(X̄ − 9)/S. Under H0 ,
(a) t > t0.05, 15 = 1.753, against H1 : µ > 9.
(b) t < −t0.05, 15 = −1.753, against H1 : µ < 9.
(c) |t| > t0.025, 15 = 2.131, against H1 : µ 6= 9.
For the given sample, t = 2.02. Hence we reject H0 against the alternative
H1 : µ > 9, but we will not reject H0 against the two other alternative hypotheses.
√
When σ 2 is known, we use the test statistic T = n(X̄ − 9)/σ. Now under H0 ,
T ∼ N (0, 1). With α = 0.05, we reject H0 if:
(a) t > z0.05 = 1.645, against H1 : µ > 9.
(b) t < −z0.05 = −1.645, against H1 : µ < 9.
(c) |t| > z0.025 = 1.960, against H1 : µ 6= 9.
For the given sample, t = 2.02. Hence we reject H0 against the alternative H1 : µ > 9
and H1 : µ 6= 9, but we will not reject H0 against H1 : µ < 9.
With σ 2 known, we should be able to perform inference better simply because we
have more information about the population. More precisely, for the given
significance level, we require less extreme values to reject H0 . Put another way, the
p-value of the test is reduced when σ 2 is given. Therefore, the risk of rejecting H0 is
also reduced.
9.7
General approach to statistical tests
Let {X1 , . . . , Xn } be a random sample from the distribution F (x; θ). We are interested
in testing:
H0 : θ ∈ Θ0
vs. H1 : θ ∈ Θ1
281
9. Hypothesis testing
where Θ0 and Θ1 are two non-overlapping sets. A general approach to test the above
hypotheses at the 100α% significance level may be described as follows.
1. Find a test statistic T = T (X1 , . . . , Xn ) such that the distribution of T under H0 is
known.
2. Identify a critical region C such that:
PH0 (T ∈ C) = α.
3. If the observed value of T with the given sample is in the critical region C, H0 is
rejected. Otherwise, H0 is not rejected.
In order to make a test powerful in the sense that the chance of making an incorrect
decision is small, the critical region should consist of those values of T which are least
supportive of H0 (i.e. which lie in the direction of H1 ).
9.8
Two types of error
Statistical tests are often associated with two kinds of decision errors, which are
displayed in the following table:
True state
of nature
H0 true
H1 true
Decision made
H0 not rejected
H0 rejected
Correct decision
Type I error
Type II error
Correct decision
Some remarks are the following.
i. Ideally we would like to have a test which minimises the probabilities of making
both types of error, which unfortunately is not feasible.
ii. The probability of making a Type I error is the significance level, which is under
our control.
iii. We do not have explicit control over the probability of a Type II error. For a given
significance level, we try to choose a test statistic such that the probability of a
Type II error is small.
iv. The power function of the test is defined as:
β(θ) = Pθ (H0 is rejected) for θ ∈ Θ1
i.e. β(θ) = 1 − P (Type II error).
v. The null hypothesis H0 and the alternative hypothesis H1 are not treated equally in
a statistical test, i.e. there is an asymmetric treatment. The choice of H0 is based
on the subject matter concerned and/or technical convenience.
vi. It is more conclusive to end a test with H0 rejected, as the decision of ‘not reject
H0 ’ does not imply that H0 is accepted.
282
9.8. Two types of error
Activity 9.8
(a) Of 100 clinical trials, 5 have shown that wonder-drug ‘Zap2’ is better than the
standard treatment (aspirin). Should we be excited by these results?
(b) Of the 1,000 clinical trials of 1,000 different drugs this year, 30 trials found
drugs which seem better than the standard treatments with which they were
compared. The television news reports only the results of those 30 ‘successful’
trials. Should we believe these reports?
(c) A child welfare officer says that she has a test which always reveals when a child
has been abused, and she suggests it be put into general use. What is she saying
about Type I and Type II errors for her test?
Solution
(a) If 5 clinical trials out of 100 report that Zap2 is better, this is consistent with
there being no difference whatsoever between Zap2 and aspirin if a 5% Type I
error probability is being used for tests in these clinical trials. With a 5%
significance level we expect 5 trials in 100 to show spurious significant results.
(b) If the television news reports the 30 successful trials out of 1,000, and those
trials use tests with a significance level of 5%, we may well choose to be very
cautious about believing the results. We would expect 50 spuriously significant
results in the 1,000 trial results.
(c) The welfare officer is saying that the Type II error has probability zero. The
test is always positive if the null hypothesis of no abuse is false. On the other
hand, the welfare officer is saying nothing about the probability of a Type I
error. It may well be that the probability of a Type I error is high, which would
lead to many false accusations of abuse when no abuse had taken place. One
should always think about both types of error when proposing a test.
Activity 9.9 A manufacturer has developed a new fishing line which is claimed to
have an average breaking strength of 7 kg, with a standard deviation of 0.25 kg.
Assume that the standard deviation figure is correct and that the breaking strength
is normally distributed.
Suppose that we carry out a test, at the 5% significance level, of H0 : µ = 7 vs.
H1 : µ < 7. Find the sample size which is necessary for the test to have 90% power if
the true breaking strength is 6.95 kg.
Solution
The critical value for the test is z0.95 = −1.645 and the probability of rejecting H0
with this test is:
X̄ − 7
√ < −1.645
P
0.25/ n
283
9. Hypothesis testing
which we rewrite as:
P
X̄ − 6.95
7 − 6.95
√ <
√ − 1.645
0.25/ n
0.25/ n
because X̄ ∼ N (6.95, (0.25)2 /n).
To ensure power of 90% we need z0.10 = 1.282 since:
P (Z < 1.282) = 0.90.
Therefore:
7 − 6.95
√ − 1.645 = 1.282
0.25/ n
√
0.2 × n = 2.927
√
n = 14.635
n = 214.1832.
So to ensure that the test power is at least 90%, we should use a sample size of 215.
Remark: We see a rather large sample size is required. Hence investigators are
encouraged to use sample sizes large enough to come to rational decisions.
Activity 9.10 A manufacturer has developed a fishing line that is claimed to have
a mean breaking strength of 15 kg with a standard deviation of 0.8 kg. Suppose that
the breaking strength follows a normal distribution. With a sample size of n = 30,
the null hypothesis that µ = 15 kg, against the alternative hypothesis of µ < 15 kg,
will be rejected if the sample mean x̄ < 14.8 kg.
(a) Find the probability of committing a Type I error.
(b) Find the power of the test if the true mean is 14.9 kg, 14.8 kg and 14.7 kg,
respectively.
Solution
(a) Under H0 : µ = 15, we have X̄ ∼ N (15, σ 2 /30) where σ = 0.8. The probability
of committing a Type I error is:
P (H0 is rejected | µ = 15) = P (X̄ < 14.8 | µ = 15)
14.8 − 15
X̄ − 15
√ <
√
µ = 15
=P
σ/ 30
σ/ 30
14.8 − 15
√
=P Z<
0.8/ 30
= P (Z < −1.37)
= 0.0853.
284
9.8. Two types of error
(b) If the true value is µ, then X̄ ∼ N (µ, σ 2 /30). The power of the test for a
particular µ is:
14.8 − µ
X̄ − µ
14.8 − µ
√ <
√
√
=P Z<
Pµ (H0 is rejected) = Pµ (X̄ < 14.8) = Pµ
σ/ 30
σ/ 30
0.8/ 30
which is 0.2483 for µ = 14.9, 0.5 for µ = 14.8, and 0.7517 for µ = 14.7.
Activity 9.11 In a wire-based nail manufacturing process the target length for cut
wire is 22 cm. It is known that widths vary with a standard deviation equal to 0.08
cm. In order to monitor this process, a random sample of 50 separate wires is
accurately measured and the process is regarded as operating satisfactorily (the null
hypothesis) if the sample mean width lies between 21.97 cm and 22.03 cm so that
this is the decision procedure used (i.e. if the sample mean falls within this range
then the null hypothesis is not rejected, otherwise the null hypothesis is rejected).
(a) Determine the probability of a Type I error for this test.
(b) Determine the probability of making a Type II error when the process is
actually cutting to a length of 22.05 cm.
(c) Find the probability of rejecting the null hypothesis when the true cutting
length is 22.01 cm. (This is the power of the test when the true mean is 22.01
cm.)
Solution
(a) We have:
α = 1 − P (21.97 < X̄ < 22.03 | µ = 22)
22.03 − 22
21.97 − 22
√
√
<Z<
=1−P
0.08/ 50
0.08/ 50
= 1 − P (−2.65 < Z < 2.65)
= 1 − 0.992
= 0.008.
(b) We have:
β = P (21.97 < X̄ < 22.03 | µ = 22.05)
22.03 − 22.05
21.97 − 22.05
√
√
<Z<
=P
0.08/ 50
0.08/ 50
= P (−7.07 < Z < −1.77)
= P (Z < −1.77) − P (Z < −7.07)
= 0.0384.
285
9. Hypothesis testing
(c) We have:
P (rejecting H0 | µ = 22.01) = 1 − P (21.97 < X̄ < 22.03 | µ = 22.01)
22.03 − 22.01
21.97 − 22.01
√
√
=1−P
<Z<
0.08/ 50
0.08/ 50
= 1 − P (−3.53 < X < 1.77)
= 1 − (P (Z < 1.77) − P (Z < −3.53))
= 1 − (0.9616 − 0.00023)
= 0.0386.
Activity 9.12 It may be assumed that the length of nails produced by a particular
machine is a normally distributed random variable, with a standard deviation of 0.02
cm. The lengths of a random sample of 6 nails are 4.63 cm, 4.59 cm, 4.64 cm, 4.62
cm, 4.66 cm and 4.69 cm.
(a) Test, at the 1% significance level, the hypothesis that the machine produces
nails with a mean length of 4.62 cm (a two-sided test).
(b) Find the probability of committing a Type II error when the true mean length
is 4.64 cm.
Solution
We test the null√hypothesis H0 : µ = 4.62 vs. H1 : µ 6= 4.62 with σ = 0.02. The test
statistic is T = n(X̄ − 4.62)/σ, which is N (0, 1) under H0 . For the given sample,
t = 2.25.
(a) At the 1% significance level, we reject H0 if |t| ≥ 2.576. Since t = 2.25, H0 is not
rejected.
√
√
(b) For any µ 6= 4.62, E(T ) = n(E(X̄) − 4.62)/σ = n(µ − 4.62)/σ 6= 0, hence
T 6∼ N (0, 1). The probability of committing a Type II error is:
Pµ (H0 is not rejected) = Pµ (|T | < 2.576)
= Pµ (−2.576 < T < 2.576)
X̄ − 4.62
√
< 2.576
= Pµ −2.576 <
σ/ n
σ
σ
= Pµ 4.62 − 2.576 × √ < X̄ < 4.62 + 2.576 × √
n
n
4.62 − µ
X̄ − µ
4.62 − µ
√ − 2.576 <
√ <
√ + 2.576
= Pµ
σ/ n
σ/ n
σ/ n
4.62 − µ
4.62 − µ
√ − 2.576 < Z <
√ + 2.576 .
=P
σ/ n
σ/ n
286
9.8. Two types of error
Plugging in µ = 4.64, n = 6 and σ = 0.02 in the above expression, the
probability of committing a Type II error is:
√
√
P (− 6 − 2.576 < Z < − 6 + 2.576) ≈ Φ(0.13) − 0 = 0.5517.
Note:
i. The power of the test to reject H0 when µ = 4.64 is 1 − 0.5517 = 0.4483.
The power increases when µ moves further away from µ0 = 4.62.
ii. We always express probabilities to be calculated in terms of some
‘standard’ distributions such as N (0, 1), tk , etc. We can then refer to the
relevant table in the New Cambridge Statistical Tables.
Activity 9.13 A random sample of fibres is known to come from one of two
environments, A or B. It is known from past experience that the lengths of fibres
from A have a log-normal distribution so that the log-length of an A-type fibre is
normally distributed about a mean of 0.80 with a standard deviation of 1.00.
(Original units are in microns.)
The log-lengths of B-type fibres are normally distributed about a mean of 0.65 with
a standard deviation of 1.00. In order to identify the environment from which the
given sample was taken a subsample of n fibres are to be measured and the
classification is to be made on the evidence of these measurements.
Do not be put off by the log-normal distribution. This simply means that it is the
logs of the data, rather than the original data, which have a normal distribution. If
X represents the log of a fibre length for fibres from A, then X ∼ N (0.8, 1).
(a) If n = 50 and the sample is attributed to type A if the sample mean of
log-lengths exceeds 0.75, determine the error probabilities.
(b) What sample size and decision procedures should be used if it is desired to have
error probabilities such that the chance of misclassifying as A is to be 5% and
the chance of misclassifying as B is to be 10%?
(c) If the sample is classified as A if the sample mean of log-lengths exceeds 0.75,
and the misclassification as A is to have a probability of 2%, what sample size
should be used and what is the probability of a B-type misclassification?
(d) If the sample comes from neither A nor B but from an environment with a
mean log-length of 0.70, what is the probability of classifying it as type A if the
decision procedure determined in (b) is applied?
Solution
(a) We have n = 50 and σ = 1. We wish to test:
H0 : µ = 0.65 (sample is from ‘B’) vs. H1 : µ = 0.80 (sample is from ‘A’).
The decision rule is that we reject H0 if x̄ > 0.75.
287
9. Hypothesis testing
The probability of a Type I error is:
0.75 − 0.65
√
P (X̄ > 0.75 | H0 ) = P Z >
= P (Z > 0.71) = 0.2389.
1/ 50
The probability of a Type II error is:
0.75 − 0.80
√
P (X̄ < 0.75 | H1 ) = P Z <
= P (Z < −0.35) = 0.3632.
1/ 50
(b) To find the sample size n and the value a, we need to solve two conditions:
√
• α = P (X̄ > a |√H0 ) = P (Z > (a − 0.65)/(1/ n)) = 0.05 ⇒
(a − 0.65)/(1/ n) = 1.645.
√
• β = P (X̄ < a |√
H1 ) = P (Z < (a − 0.80)/(1/ n)) = 0.10 ⇒
(a − 0.80)/(1/ n) = −1.28.
Solving these equations gives a = 0.734 and n = 381, remembering to round up!
(c) A sample is classified as being from A if H1 if x̄ > 0.75. We have:
0.75 − 0.65
0.75 − 0.65
√
√
= 2.05.
α = P (X̄ > 0.75 | H0 ) = P Z >
= 0.02 ⇒
1/ n
1/ n
Solving this equation gives n = 421, remembering to round up! Therefore:
0.75 − 0.80
√
β = P (X̄ < 0.75 | H1 ) = P Z <
= P (Z < −1.026) = 0.1515.
1/ 421
(d) The rule in (b) is ‘take n = 381 and reject H0 if x̄ > 0.734’. So:
0.734 − 0.7
√
P (X̄ > 0.734 | µ = 0.7) = P Z >
= P (Z > 0.66) = 0.2546.
1/ 381
9.9
Tests for variances of normal distributions
Example 9.8 A container-filling machine is used to package milk cartons of 1 litre
(= 1,000 cm3 ). Ideally, the amount of milk should only vary slightly. The company
which produced the filling machine claims that the variance of the milk content is
not greater than 1 cm3 . To examine the veracity of the claim, a random sample of 25
cartons is taken, resulting in 25 measurements (in cm3 ) as follows:
1,000.3, 1,001.3,
999.5,
999.7,
999.3,
999.8,
998.3, 1,000.6,
999.7,
999.8,
1,001.0,
999.4,
999.5,
998.5, 1,000.7,
999.6,
999.8, 1,000.0,
998.2, 1,000.1,
998.1, 1,000.7,
999.8, 1,001.3, 1,000.7.
Do these data support the claim of the company?
288
9.9. Tests for variances of normal distributions
Turning Example 9.8 into a statistical problem, we assume that the data form a random
sample from N (µ, σ 2 ). We are interested in testing the hypotheses:
H0 : σ 2 = σ02
Let S 2 =
n
P
vs. H1 : σ 2 > σ02 .
(Xi − X̄)2 /(n − 1), then (n − 1)S 2 /σ 2 ∼ χ2n−1 . Under H0 we have:
i=1
2
(n − 1)S
T =
=
σ02
n
P
(Xi − X̄)2
i=1
∼ χ2n−1 .
σ02
Since we will reject H0 against an alternative hypothesis σ 2 > σ02 , we should reject H0
for large values of T .
H0 is rejected if t > χ2α, n−1 , where χ2α, n−1 denotes the top 100αth percentile of the χ2n−1
distribution, i.e. we have:
P (T ≥ χ2α, n−1 ) = α.
For any σ 2 > σ02 , the power of the test at σ is:
β(σ) = Pσ (H0 is rejected)
= Pσ (T > χ2α, n−1 )
(n − 1)S 2
2
= Pσ
> χα, n−1
σ02
(n − 1)S 2
σ02
2
= Pσ
> 2 × χα, n−1
σ2
σ
which is greater than α, as σ02 /σ 2 < 1, where (n − 1)S 2 /σ 2 ∼ χ2n−1 when σ 2 is the true
variance, instead of σ02 . Note that here 1 − β(σ) is the probability of a Type II error.
Suppose we choose α = 0.05. For n = 25, χ2α, n−1 = χ20.05, 24 = 36.415.
With the given sample, s2 = 0.8088 and σ02 = 1, t = 24 × 0.8088 = 19.41 < χ20.05, 24 .
Hence we do not reject H0 at the 5% significance level. There is no significant evidence
from the data against the company’s claim that the variance is not beyond 1.
With σ02 = 1, the power function is:
β(σ) = P
χ20.05, 24
(n − 1)S 2
>
σ2
σ2
=P
(n − 1)S 2
36.415
>
2
σ
σ2
where (n − 1)S 2 /σ 2 ∼ χ224 .
For any given values of σ 2 , we may compute β(σ). We list some specific values next.
σ2
χ20.05, 24 /σ 2
β(σ)
Approximate β(σ)
1
36.415
0.05
0.05
1.5
24.277
0.446
0.40
2
18.208
0.793
0.80
3
12.138
0.978
0.975
4
9.104
0.997
0.995
289
9. Hypothesis testing
Clearly, β(σ) % as σ 2 %. Intuitively, it is easier to reject H0 : σ 2 = 1 if the true
population, which generates the data, has a larger variance σ 2 .
Due to the sparsity of the available χ2 tables, we may only obtain some approximate
values for β(σ) – see the entries in the last row in the above table. The more accurate
values of β(σ) were calculated using a computer.
Some remarks are the following.
i. The significance level is selected subjectively by the statistician. To make the
conclusion more convincing in the above example, we may use α = 0.1 instead. As
χ20.1, 24 = 33.196, H0 is not rejected at the 10% significance level. In fact the p-value
is:
PH0 (T ≥ 19.41) = 0.73
where T ∼ χ224 .
ii. As σ 2 increases, the power function β(σ) also increases.
iii. For H1 : σ 2 6= σ02 , we should reject H0 if:
t ≤ χ21−α/2, n−1
or t ≥ χ2α/2, n−1
where χ2α, k denotes the top 100αth percentile of the χ2k distribution.
Activity 9.14 A machine is designed to fill bags of sugar. The weight of the bags is
normally distributed with standard deviation σ. If the machine is correctly
calibrated, σ should be no greater than 20 g. We collect a random sample of 18 bags
and weigh them. The sample standard deviation is found to be equal to 32.48 g. Is
there any evidence that the machine is incorrectly calibrated?
Solution
This is a hypothesis test for the variance of a normal population, so we will use the
chi-squared distribution. Let:
X1 , . . . , X18 ∼ N (µ, σ 2 )
be the weights of the bags in the sample. An appropriate test has hypotheses:
H0 : σ 2 = 400 vs. H1 : σ 2 > 400.
This is a one-sided test, because we are interested in detecting an increase in
variance. We compute the value of the test statistic:
t=
(18 − 1) × (32.48)2
(n − 1)s2
=
= 44.385.
σ02
(20)2
At the 5% significance level, the upper-tail value of the chi-squared distribution on
ν = 18 − 1 degrees of freedom is χ20.05, 17 = 27.587. Our test statistic exceeds this
value, so we reject the null hypothesis.
We now move to the 1% significance level. The upper-tail value is χ20.01, 17 = 33.409,
so we reject H0 again. We conclude that there is very strong evidence that the
machine is incorrectly calibrated.
290
9.10. Summary: tests for µ and σ 2 in N (µ, σ 2 )
Activity 9.15 {X1 , . . . , X21 } represents a random sample of size 21 from a normal
population with mean µ and variance σ 2 .
(a) Construct a test procedure with a 5% significance level to test the null
hypothesis that σ 2 = 8 against the alternative that σ 2 > 8.
(b) Evaluate the power of the test for the values of σ 2 given below.
σ2 =
8.84
10.04
10.55
11.03
12.99
15.45
17.24
Solution
(a) We test:
H0 : σ 2 = 8 vs. H1 : σ 2 > 8.
The test statistic, under H0 , is:
T =
(n − 1)S 2
20 × S 2
=
∼ χ220 .
σ02
8
With a 5% significance level, we reject the null hypothesis if:
t ≥ 31.410
since χ20.05, 20 = 31.410.
(b) To evaluate the power, we need the probability of rejecting H0 (which happens
if t ≥ 31.410) conditional on the actual value of σ 2 , that is:
8
8
2
P (T ≥ 31.410 | σ = k) = P T × ≥ 31.410 ×
k
k
where k is the true value of σ 2 , noting that:
T×
σ2 = k
31.410 × 8/k
β(σ 2 )
9.10
8.84
28.4
0.10
10.04
25.0
0.20
8
∼ χ220 .
k
10.55
23.8
0.25
11.03
22.8
0.30
12.99
19.3
0.50
15.45
16.3
0.70
17.24
14.6
0.80
Summary: tests for µ and σ 2 in N (µ, σ 2)
In the below table, X̄ =
n
P
Xi /n, S 2 =
i=1
n
P
(Xi − X̄)2 /(n − 1), and {X1 , . . . , Xn } is a
i=1
random sample from N (µ, σ 2 ).
291
9. Hypothesis testing
Null hypothesis, H0
µ = µ0
σ 2 = σ02
X̄−µ
√0
σ/ n
X̄−µ
√0
S/ n
(n−1)S 2
σ02
N (0, 1)
tn−1
χ2n−1
µ = µ0
(σ 2 known)
Test statistic, T
Distribution of T
under H0
9.11
Comparing two normal means with paired
observations
Suppose that the observations are paired:
(X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn )
2
), and Yi ∼ N (µY , σY2 ).
where all Xi s and Yi s are independent, Xi ∼ N (µX , σX
We are interested in testing the hypothesis:
H0 : µX = µY .
(9.1)
Example 9.9 The following are some practical examples.
Do husbands make more money than wives?
Is the increased marketing budget improving sales?
Are customers willing to pay more for the new product than the old one?
Does TV advertisement A have higher average effectiveness than advertisement
B?
Will promotion method A generate higher sales than method B?
Observations are paired together for good reasons: husband–wife, before–after,
A-vs.-B (from the same subject).
Let Zi = Xi − Yi , for i = 1, . . . , n, then {Z1 , . . . , Zn } is a random sample from the
population N (µ, σ 2 ), where:
µ = µX − µY
2
and σ 2 = σX
+ σY2 .
The hypothesis (9.1) can also be expressed as:
H0 : µ = 0.
292
9.12. Comparing two normal means
√
Therefore, we should use the test statistic T = nZ̄/S, where Z̄ and S 2 denote,
respectively, the sample mean and the sample variance of {Z1 , . . . , Zn }.
At the 100α% significance level, for α ∈ (0, 1), we reject the hypothesis µX = µY when:
|t| > tα/2, n−1 , if the alternative is H1 : µX 6= µY
t > tα, n−1 , if the alternative is H1 : µX > µY
t < −tα, n−1 , if the alternative is H1 : µX < µY
where P (T > tα, n−1 ) = α, for T ∼ tn−1 .
9.11.1
Power functions of the test
Consider the case of testing H0 : µX = µY vs. H1 : µX > µY only. For µ = µX − µY > 0,
we have:
β(µ) = Pµ (H0 is rejected)
= Pµ (T > tα, n−1 )
√
nZ̄
= Pµ
> tα, n−1
S
√
√ n(Z̄ − µ)
nµ
> tα, n−1 −
= Pµ
S
S
where
√
n(Z̄ − µ)/S ∼ tn−1 under the distribution represented by Pµ .
Note that for µ > 0, β(µ) > α. Furthermore, β(µ) increases as µ increases.
9.12
Comparing two normal means
Let {X1 , . . . , Xn } and {Y1 , . . . , Ym } be two independent random samples drawn from,
2
) and N (µY , σY2 ). We seek to test hypotheses on µX − µY .
respectively, N (µX , σX
We cannot pair the two samples together, because of the different sample sizes n and m.
n
m
P
P
Let the sample means be X̄ =
Xi /n and Ȳ =
Yi /m, and the sample variances be:
i=1
i=1
n
2
SX
1 X
=
(Xi − X̄)2
n − 1 i=1
m
and
SY2
1 X
=
(Yi − Ȳ )2 .
m − 1 i=1
Some remarks are the following.
2
X̄, Ȳ , SX
and SY2 are independent.
2
2
2
X̄ ∼ N (µX , σX
/n) and (n − 1)SX
/σX
∼ χ2n−1 .
Ȳ ∼ N (µY , σY2 /m) and (m − 1)SY2 /σY2 ∼ χ2m−1 .
293
9. Hypothesis testing
2
2
Hence X̄ − Ȳ ∼ N (µX − µY , σX
/n + σY2 /m). If σX
= σY2 , then:
X̄ − Ȳ − (µX − µY )
p
2
2
((n − 1)SX
/σX
+ (m − 1)SY2 /σY2 ) /(n + m − 2)
s
=
9.12.1
p 2
σX /n + σY2 /m
n+m−2
X̄ − Ȳ − (µX − µY )
∼ tn+m−2 .
×p
2
1/n + 1/m
(n − 1)SX
+ (m − 1)SY2
2
Tests on µX − µY with known σX
and σY2
Suppose we are interested in testing:
H0 : µX = µY
vs. H1 : µX 6= µY .
Note that:
X̄ − Ȳ − (µX − µY )
p
∼ N (0, 1).
2
σX
/n + σY2 /m
Under H0 , µX − µY = 0, so we have:
T =p
X̄ − Ȳ
2
σX
/n + σY2 /m
∼ N (0, 1).
At the 100α% significance level, for α ∈ (0, 1), we reject H0 if |t| > zα/2 , where
P (Z > zα/2 ) = α/2, for Z ∼ N (0, 1).
A 100(1 − α)% confidence interval for µX − µY is:
X̄ − Ȳ ± zα/2 ×
q
2
/n + σY2 /m.
σX
Activity 9.16 Two random samples {X1 , . . . , Xn } and {Y1 , . . . , Ym } from two
2
normally distributed populations with variances of σX
= 41 and σY2 = 15,
respectively, produced the following summary statistics:
x̄ = 63, n = 50;
ȳ = 60, m = 45.
(a) At the 5% significance level, test if the two population means are the same.
Find a 95% confidence interval for the difference between the two means.
2
(b) Repeat (a), but now with σX
= 85 and σY2 = 42. Comment on the impact of
increasing the variances.
(c) Repeat (a), but now with the sample sizes n = 20 and m = 14 (i.e. using the
original variances). Comment on the impact of decreasing the sample sizes.
(d) Repeat (a), but now with x̄ = 61.5 (i.e. using the original variances and sample
sizes), and comment.
294
9.12. Comparing two normal means
Solution
(a) We test H0 : µX = µY vs. H1 : µX 6= µY . Under H0 , the test statistic is:
X̄ − Ȳ
T =p 2
∼ N (0, 1).
σX /n + σY2 /m
At the 5% significance level we reject H0 if |t| > z0.025 = 1.96. With the given
data, t = 2.79. Hence we reject H0 (the p-value is
2 × P (Z ≥ 2.79) = 0.00528 < 0.05 = α).
The 95% confidence interval for µX − µY obtained from the data is:
r
r
2
σX
σY2
41 15
+
= 3 ± 1.96 ×
+
= 3 ± 2.105 ⇒ (0.895, 5.105).
x̄ − ȳ ± 1.96 ×
n
m
50 45
2
(b) With σX
= 85 and σY2 = 42, now t = 1.85. So, since 1.85 < 1.96, we cannot
reject H0 at the 5% significance level (the p-value is
2 × P (Z ≥ 1.85) = 0.0644 > 0.05 = α). The confidence interval is
3 ± 3.181 = (−0.181, 6.181) which is much wider and contains 0 – the
hypothesised valued under H0 .
Comparing with the results in (a) above, the statistical inference become less
conclusive. This is due to the increase in the variances of the populations: as the
‘randomness’ increases, we are less certain about the parameters with the same
amount of information. This also indicates that it is not enough to look only at
the sample means, even if we are only concerned with the population means.
(c) With n = 20 and m = 14, now t = 1.70. Therefore, since 1.70 < 1.96, we cannot
reject H0 at the 5% significance level (the p-value is
2 × P (Z ≥ 1.70) = 0.0892 > 0.05 = α). The confidence interval is
3 ± 3.463 = (−0.463, 6.463) which is much wider than that obtained in (a), and
contains 0 as well. This indicates that the difference of 3 units between the
sample means is significant for the sample sizes (50, 45), but is not significant
for the sample sizes (20, 14).
(d) With x̄ = 61.5, now t = 1.40. Again, since 1.40 < 1.96, we cannot reject H0 at
the 5% significance level (the p-value is 2 × P (Z ≥ 1.40) = 0.1616 > 0.05 = α).
The confidence interval is 1.5 ± 2.105 = (−0.605, 3.605). Comparing with (a),
the difference between the samples means is not significant enough to reject H0 ,
although everything else is unchanged.
Activity 9.17 Suppose that we have two independent samples from normal
populations with known variances. We want to test the H0 that the two population
means are equal against the alternative that they are different. We could use each
sample by itself to write down 95% confidence intervals and reject H0 if these
intervals did not overlap. What would be the significance level of this test?
295
9. Hypothesis testing
Solution
Let us assume H0 : µX = µY is true, then the two 95% confidence intervals do not
overlap if and only if:
σX
σY
X̄ − 1.96 × √ ≥ Ȳ + 1.96 × √
n
m
σY
σX
or Ȳ − 1.96 × √ ≥ X̄ + 1.96 × √ .
m
n
So we want the probability:
σX
σY
P |X̄ − Ȳ | ≥ 1.96 × √ + √
n
m
which is:
P
√ !
√
X̄ − Ȳ
σX / n + σY / m
p
≥ 1.96 × p 2
.
2
2
σX /n + σY /m
σX /n + σY2 /m
So we have:
P
√ !
√
σX / n + σY / m
|Z| ≥ 1.96 × p 2
σX /n + σY2 /m
where Z ∼ N (0, 1). This does not reduce in general, but if we assume n = m and
2
σX
= σY2 , then it reduces to:
√
P (|Z| ≥ 1.96 × 2) = 0.0056.
The significance level is about 0.6%, which is much smaller than the usual
conventions of 5% and 1%. Putting variability into two confidence intervals makes
them more likely to overlap than you might think, and so your chance of incorrectly
rejecting the null hypothesis is smaller than you might expect!
9.12.2
2
Tests on µX − µY with σX
= σY2 but unknown
This time we consider the following hypotheses:
H0 : µX − µY = δ0
vs. H1 : µX − µY > δ0
where δ0 is a given constant. Under H0 , we have:
s
T =
n+m−2
X̄ − Ȳ − δ0
×p
∼ tn+m−2 .
2
1/n + 1/m
(n − 1)SX
+ (m − 1)SY2
At the 100α% significance level, for α ∈ (0, 1), we reject H0 if t > tα, n+m−2 , where
P (T > tα, n+m−2 ) = α, for T ∼ tn+m−2 .
A 100(1 − α)% confidence interval for µX − µY is:
r
X̄ − Ȳ ± tα/2, n+m−2 ×
296
1/n + 1/m
2
((n − 1)SX
+ (m − 1)SY2 ).
n+m−2
9.12. Comparing two normal means
Example 9.10 Two types of razor, A and B, were compared using 100 men in an
experiment. Each man shaved one side, chosen at random, of his face using one razor
and the other side using the other razor. The times taken to shave, Xi and Yi
minutes, for i = 1, . . . , 100, corresponding to the razors A and B, respectively, were
recorded, yielding:
x̄ = 2.84,
s2X = 0.48,
ȳ = 3.02 and s2Y = 0.42.
Also available is the sample variance of the differences, Zi = Xi − Yi , which is
s2Z = 0.6.
Test, at the 5% significance level, if the two razors lead to different mean shaving
times. State clearly any assumptions used in the test.
Assumption: suppose {X1 , . . . , Xn } and {Y1 , . . . , Yn } are two independent random
2
samples from, respectively, N (µX , σX
) and N (µY , σY2 ).
The problem requires us to test the following hypotheses:
H0 : µX = µY
vs. H1 : µX 6= µY .
There are three approaches – a paired comparison method and two two-sample
comparisons based on different assumptions. Since the data are recorded in pairs,
the paired comparison is most relevant and effective to analyse these data.
Method I: paired comparison
2
+ σY2 . We want
We have Zi = Xi − Yi ∼ N (µZ , σZ2 ) with µZ = µX − µY and σZ2 = σX
to test:
H0 : µZ = 0 vs. H1 : µZ 6= 0.
This is the standard one-sample t test, where:
√
n(Z̄ − µZ )
X̄ − Ȳ − (µX − µY )
√
∼ tn−1 .
=
SZ
SZ / n
H0 is rejected if |t| > t0.025, 99 = 1.98, where under H0 we have:
√
√
nZ̄
100(X̄ − Ȳ )
T =
=
.
SZ
SZ
√
With the given data, we observe t = 10(2.84 − 3.02)/ 0.6 = −2.327. Hence we reject
the hypothesis that the two razors lead to the same mean shaving time at the 5%
significance level.
A 95% confidence interval for µX − µY is:
√
x̄ − ȳ ± t0.025, n−1 × sZ / n = −0.18 ± 0.154
⇒
(−0.334, −0.026).
Some remarks are the following.
i. Zero is not in the confidence interval for µX − µY .
ii. t0.025, 99 = 1.98 is pretty close to z0.025 = 1.96.
297
9. Hypothesis testing
Method II: two-sample comparison with known variances
2
A further assumption is that σX
= 0.48 and σY2 = 0.42.
2
Note X̄ − Ȳ ∼ N (µX − µY , σX
/100 + σY2 /100), i.e. we have:
X̄ − Ȳ − (µX − µY )
p
∼ N (0, 1).
2
σX
/100 + σY2 /100
Hence we reject H0 when |t| > 1.96 at the 5% significance level, where:
X̄ − Ȳ
T =p 2
.
σX /100 + σY2 /100
√
For the given data, t = −0.18/ 0.009 = −1.9. Hence we cannot reject H0 .
A 95% confidence interval for µX − µY is:
q
2
x̄ − ȳ ± 1.96 × σX
/100 + σY2 /100 = −0.18 ± 0.186
⇒
(−0.366, 0.006).
The value 0 is now contained in the confidence interval.
Method III: two-sample comparison with equal but unknown variance
2
= σY2 = σ 2 .
A different additional assumption is that σX
2
Now X̄ − Ȳ ∼ N (µX − µY , σ 2 /50) and 99(SX
+ SY2 )/σ 2 ∼ χ2198 . Hence:
√
50 X̄ − Ȳ − (µX − µY )
X̄ − Ȳ − (µX − µY )
p
p
= 10 ×
∼ t198 .
2
2
2
99(SX + SY )/198
SX
+ SY2
Hence we reject H0 if |t| > t0.025, 198 = 1.97 where:
10(X̄ − Ȳ )
T =p 2
.
SX + SY2
For the given data, t = −1.897. Hence we cannot reject H0 at the 5% significance
level.
A 95% confidence interval for µX − µY is:
q
x̄ − ȳ ± t0.025, 198 × (s2X + s2Y )/100 = −0.18 ± 0.1870
⇒
(−0.367, 0.007)
which contains 0.
Some remarks are the following.
i. Different methods lead to different but not contradictory conclusions, as remember:
not reject 6= accept.
ii. The paired comparison is intuitively the most relevant, requires the least
assumptions, and leads to the most conclusive inference (i.e. rejection of H0 ). It
also produces the narrowest confidence interval.
298
9.12. Comparing two normal means
iii. Methods II and III ignore the pairing of the data. Consequently, the inference is
less conclusive and less accurate.
iv. A general observation is that H0 is rejected at the 100α% significance level if and
only if the value hypothesised by H0 is not within the corresponding 100(1 − α)%
confidence interval.
v. It is much more challenging to compare two normal means with unknown and
unequal variances. This will not be discussed in this course.
Activity 9.18 The weights (in grammes) of a group of five-week-old chickens reared
on a high-protein diet are 336, 421, 310, 446, 390 and 434. The weights of a second
group of chickens similarly reared, except for their low-protein diet, are 224, 275,
393, 282 and 365. Is there evidence that the additional protein has increased the
average weight of the chickens? Assume normality.
Solution
Assuming normally-distributed populations with possibly different means, but the
same variance, we test:
H0 : µX = µY
vs. H1 : µX > µY .
The sample means and standard deviations are x̄ = 389.5, ȳ = 307.8, sX = 55.40 and
sY = 69.45. The test statistic and its distribution under H0 are:
s
n+m−2
X̄ − Ȳ
∼ tn+m−2
×p
T =
2
1/n + 1/m
(n − 1)SX
+ (m − 1)SY2
and we obtain, for the given data, t = 2.175 > 1.833 = t0.05, 9 hence we reject H0 that
the mean weights are equal and conclude that the mean weight for the high-protein
diet is greater at the 5% significance level.
Activity 9.19 Hard question!
(a) Two independent random samples, of n1 and n2 observations, are drawn from
normal distributions with the same variance σ 2 . Let S12 and S22 be the sample
variances of the first and the second samples, respectively. Show that:
σ
b2 =
(n1 − 1)S12 + (n2 − 1)S22
n1 + n2 − 2
is an unbiased estimator of σ 2 .
Hint: Remember the expectation of a chi-squared variable is its degrees of
freedom.
(b) Two makes of car safety belts, A and B, have breaking strengths which are
normally distributed with the same variance. A random sample of 140 belts of
make A and a random sample of 220 belts of make B were tested. The sample
299
9. Hypothesis testing
P
means, and the sums of squares about the means (i.e. i (xi − x̄)2 ), of the
breaking strengths (in lbf units) were (2685, 19000) for make A, and (2680,
34000) for make B, respectively. Is there significant evidence to support the
hypothesis that belts of make A are stronger on average than belts of make B?
Assume a 1% significance level.
Solution
(a) We first note that (ni − 1)Si2 /σ 2 ∼ χ2ni −1 . By the definition of χ2 distributions,
we have:
E (ni − 1)Si2 = (ni − 1)σ 2 for i = 1, 2.
Hence:
E σ
b
2
=E
(n1 − 1)S12 + (n2 − 1)S22
n1 + n2 − 2
=
1
E((n1 − 1)S12 ) + E((n2 − 1)S22 )
n1 + n2 − 2
=
(n1 − 1)σ 2 + (n2 − 1)σ 2
n1 + n2 − 2
= σ2.
(b) Denote x̄ = 2685 and ȳ = 2680, then 139s2X = 19000 and 219s2Y = 34000.
We test H0 : µX = µY vs. H1 : µX > µY . Under H0 we have:
1
1
2
X̄ − Ȳ ∼ N 0, σ
+
= N (0, 0.01169σ 2 )
140 220
and:
Hence:
2
+ 219SY2
139SX
∼ χ2358 .
σ2
√
(X̄ − Ȳ )/ 0.01169
T =p
∼ t358
2
(139SX
+ 219SY2 )/358
under H0 . We reject H0 if t > t0.01, 358 ≈ 2.326. Since we observe t = 3.801 we
reject H0 , i.e. there is significant evidence to suggest that belts of make A are
stronger on average than belts of make B.
9.13
Tests for correlation coefficients
We now consider a test for the correlation coefficient of two random variables X and Y
where:
ρ = Corr(X, Y ) =
300
Cov(X, Y )
1/2
[Var(X) Var(Y )]
=
E [(X − E(X))(Y − E(Y ))]
.
[E ((X − E(X))2 ) E ((Y − E(Y ))2 )]1/2
9.13. Tests for correlation coefficients
Some remarks are the following.
i. ρ ∈ [−1, 1], and |ρ| = 1 if and only if Y = aX + b for some constants a and b.
Furthermore, a > 0 if ρ = 1, and a < 0 if ρ = −1.
ii. ρ measures only the linear relationship between X and Y . When ρ = 0, X and Y
are linearly independent, that is uncorrelated.
iii. If X and Y are independent (in the sense that the joint pdf is the product of the
two marginal pdfs), ρ = 0. However, if ρ = 0, X and Y are not necessarily
independent, as there may exist some non-linear relationship between X and Y .
iv. If ρ > 0, X and Y tend to increase (or decrease) together. If ρ < 0, X and Y tend
to move in opposite directions.
Sample correlation coefficient
Given paired observations (Xi , Yi ), for i = 1, . . . , n, a natural estimator of ρ is defined
as:
n
P
(Xi − X̄)(Yi − Ȳ )
i=1
ρb =
!1/2
n
n
P
P
(Xi − X̄)2
(Yj − Ȳ )2
i=1
where X̄ =
n
P
i=1
Xi /n and Ȳ =
n
P
j=1
Yi /n.
i=1
Example 9.11 The measurements of height, X, and weight, Y , are taken from 69
students in a class. ρ should be positive, intuitively!
In Figure 9.5, the vertical line at x̄ and the horizontal line at ȳ divide the 69 points
into 4 quadrants: northeast (NE), southwest (SW), northwest (NW) and southeast
(SE). Most points are in either NE or SW.
In the NE quadrant, xi > x̄ and yi > ȳ, hence:
X
(xi − x̄)(yi − ȳ) > 0.
i∈NE
In the SW quadrant, xi < x̄ and yi < ȳ, hence:
X
(xi − x̄)(yi − ȳ) > 0.
i∈SW
In the NW quadrant, xi < x̄ and yi > ȳ, hence:
X
(xi − x̄)(yi − ȳ) < 0.
i∈NW
301
9. Hypothesis testing
In the SE quadrant, xi > x̄ and yi < ȳ, hence:
X
(xi − x̄)(yi − ȳ) < 0.
i∈SE
Overall,
69
P
(xi − x̄)(yi − ȳ) > 0 and hence ρb > 0.
i=1
Figure 9.5: Scatterplot of height and weight in Example 9.11.
Figure 9.6 shows examples of different sample correlation coefficients using scatterplots
of bivariate observations.
9.13.1
Tests for correlation coefficients
Let {(X1 , Y1 ), . . . , (Xn , Yn )} be a random sample from a two-dimensional normal
distribution. Let ρ = Corr(Xi , Yi ). We are interested in testing:
H0 : ρ = 0 vs. H1 : ρ 6= 0.
It can be shown that under H0 the test statistic is:
r
n−2
∼ tn−2 .
T = ρb
1 − ρb2
Hence we reject H0 at the 100α% significance level, for α ∈ (0, 1), if |t| > tα/2, n−2 ,
where:
α
P (T > tα/2, n−2 ) = .
2
Some remarks are the following.
p
i. |T | = |b
ρ| (n − 2)/(1 − ρb2 ) increases as |b
ρ| increases.
302
9.13. Tests for correlation coefficients
Figure 9.6: Scatterplots of bivariate observations with different sample correlation
coefficients.
ii. For H1 : ρ > 0, we reject H0 if t > tα, n−2 .
iii. Two random variables X and Y are jointly normal if aX + bY is normal for any
constants a and b.
iv. For jointly normal random variables X and Y , if Corr(X, Y ) = 0, X and Y are also
independent.
Activity 9.20 The following table shows the number of salespeople employed by a
company and the corresponding value of sales (in £000s):
Number of salespeople (x)
Sales (y)
Number of salespeople (x)
Sales (y)
210
206
220
210
209
200
233
218
219
204
200
201
225
215
215
212
232
222
205
204
221
216
227
212
Compute the sample correlation coefficient for these data and carry out a formal test
for a (linear) relationship between the number of salespeople and sales.
Note that:
X
X
X
xi = 2,616,
yi = 2,520,
x2i = 571,500,
X
X
yi2 = 529,746 and
xi yi = 550,069.
303
9. Hypothesis testing
Solution
We test:
H0 : ρ = 0 vs. H1 : ρ > 0.
The corresponding test statistic and its distribution under H0 are:
√
ρb n − 2
T = p
∼ tn−2 .
1 − ρb2
We find ρb = 0.8716 and obtain t = 5.62 > 2.764 = t0.01, 10 and so we reject H0 at the
1% significance level. Since the test is highly significant, there is overwhelming
evidence of a (linear) relationship between the number of salespeople and the value
of sales.
Activity 9.21 A random sample {(Xi , Yi ), 1 ≤ i ≤ n} from a two-dimensional
normal distribution yields:
n
P
x̄ = 6.31,
ȳ = 3.56,
sX = 5.31,
sY = 12.92 and
xi y i
i=1
= 14.78.
n
Let ρ = Corr(X, Y ).
(a) Test the null hypothesis H0 : ρ = 0 against the alternative hypothesis H1 : ρ < 0
at the 5% significance level with the sample size n = 10.
(b) Repeat (a) for n = 500.
Solution
We have:
n
P
ρb = s
i=1
n
P
n
P
(Xi − X̄)(Yi − Ȳ )
n
P
(Xi − X̄)2 (Yj − Ȳ )2
i=1
=
Xi Yi − nX̄ Ȳ
i=1
(n − 1)SX SY
.
j=1
Under H0 : ρ = 0, the test statistic is:
r
n−2
T = ρb
∼ tn−2 .
1 − ρb2
Hence we reject H0 if t < −t0.05, n−2 .
(a) For n = 10, −t0.05, n−2 = −1.860, ρb = −0.124 and t = −0.355. Hence we cannot
reject H0 , so there is no evidence that X and Y are correlated.
(b) For n = 500, −t0.05, n−2 ≈ −1.645, ρb = −0.112 and t = −2.52. Hence we reject
H0 , so there is significant evidence that X and Y are correlated.
Note that the sample correlation coefficient ρb = −0.124 is not significantly different
from 0 when the sample size is 10. However, ρb = −0.112 is significantly different
from 0 when the sample size is 500!
304
9.14. Tests for the ratio of two normal variances
9.14
Tests for the ratio of two normal variances
Let {X1 , . . . , Xn } and {Y1 , . . . , Ym } be two independent random samples from,
2
respectively, N (µX , σX
) and N (µY , σY2 ). We are interested in testing:
H0 :
σY2
=k
2
σX
vs. H1 :
σY2
6= k
2
σX
where k > 0 is a given constant. The case with k = 1 is of particular interest since this
tests for equal variances.
n
m
P
P
Let the sample means be X̄ =
Xi /n and Ȳ =
Yi /m, and the sample variances be:
i=1
2
SX
We have (n −
1
=
n−1
2
2
1)SX
/σX
n
X
i=1
m
2
(Xi − X̄)
and
i=1
SY2
1 X
=
(Yi − Ȳ )2 .
m − 1 i=1
∼ χ2n−1 and (m − 1)SY2 /σY2 ∼ χ2m−1 . Therefore:
2
2
2
σY2
SX
SX
/σX
×
=
∼ Fn−1, m−1 .
2
σX
SY2
SY2 /σY2
2
Under H0 , T = kSX
/SY2 ∼ Fn−1, m−1 . Hence H0 is rejected if:
t < F1−α/2, n−1, m−1
or t > Fα/2, n−1, m−1
where Fα, p, k denotes the top 100αth percentile of the Fp, k distribution, that is:
P (T > Fα, p, k ) = α
available from Table 12 of the New Cambridge Statistical Tables.
Since:
P
F1−α/2, n−1, m−1
S2
σ2
≤ Y2 × X2 ≤ Fα/2, n−1, m−1
σX
SY
=1−α
2
a 100(1 − α)% confidence interval for σY2 /σX
is:
SY2
SY2
F1−α/2, n−1, m−1 × 2 , Fα/2, n−1, m−1 × 2 .
SX
SX
Example 9.12 Here we practise use of Table 12 of the New Cambridge Statistical
Tables to obtain critical values for the F distribution.
Table 12 can be used to find the top 100αth percentile of the Fν1 , ν2 distribution for
α = 0.10, 0.05, 0.025, 0.01, 0.005 and 0.001 using Tables 12(a) to 12(f), respectively.
For example, for ν1 = 3 and ν2 = 5, then:
P (F3, 5 > 3.619) = 0.10 (using Table 12(a))
P (F3, 5 > 5.409) = 0.05 (using Table 12(b))
P (F3, 5 > 7.764) = 0.025 (using Table 12(c))
P (F3, 5 > 12.060) = 0.01 (using Table 12(d)).
305
9. Hypothesis testing
Example 9.13 The daily returns (in percentages) of two assets, X and Y , are
recorded over a period of 100 trading days, yielding average daily returns of x̄ = 3.21
and ȳ = 1.41. Also available from the data are the following quantities:
100
X
x2i
= 1989.24,
i=1
100
X
yi2
100
X
= 932.78 and
i=1
xi yi = 661.11.
i=1
Assume the data are normally distributed. Are the two assets positively correlated
with each other, and is asset X riskier than asset Y ?
With n = 100 we have:
n
1
1 X
(xi − x̄)2 =
s2X =
n − 1 i=1
n−1
and:
n
s2Y
1 X
1
(yi − ȳ)2 =
=
n − 1 i=1
n−1
Therefore:
n
P
ρb =
n
P
(xi − x̄)(yi − ȳ)
i=1
=
(n − 1) sX sY
n
X
!
x2i − nx̄2
= 9.69
i=1
n
X
!
yi2
− nȳ
2
= 7.41.
i=1
xi yi − n x̄ ȳ
i=1
(n − 1) sX sY
= 0.249.
First we test:
H0 : ρ = 0 vs. H1 : ρ > 0.
Under H0 , the test statistic is:
r
T = ρb
n−2
∼ t98 .
1 − ρb2
Setting α = 0.01, we reject H0 if t > t0.01, 98 = 2.37. With the given data, t = 2.545
hence we reject the null hypothesis of ρ = 0 at the 1% significance level. We
conclude that there is highly significant evidence indicating that the two assets are
positively correlated.
We measure the risks in terms of variances, and test:
2
H0 : σX
= σY2
2
vs. H1 : σX
> σY2 .
2
Under H0 , T = SX
/SY2 ∼ F99, 99 . Hence we reject H0 if t > F0.05, 99, 99 = 1.39 at the
5% significance level, using Table 12(b) of the New Cambridge Statistical Tables.
With the given data, t = 9.69/7.41 = 1.308. Therefore, we cannot reject H0 . As the
test is not significant at the 5% significance level, we may not conclude that the
variances of the two assets are significantly different. Therefore, there is no
significant evidence indicating that asset X is riskier than asset Y .
Strictly speaking, the test is valid only if the two samples are independent of each
other, which is not the case here.
306
9.14. Tests for the ratio of two normal variances
Activity 9.22 Two independent samples from normal populations yield the
following results:
Sample 1
Sample 2
n=5
m=7
P
2
P (xi − x̄)2 = 4.8
(yi − ȳ) = 37.2
Test at the 5% signficance level whether the population variances are the same based
on the above data.
Solution
We test:
H0 : σ12 = σ22
vs. H1 : σ12 6= σ22 .
Under H0 , the test statistic is:
T =
S12
∼ Fn−1, m−1 = F4, 6 .
S22
Critical values are F0.975, 4, 6 = 1/F0.025, 6, 4 = 1/9.20 = 0.11 and F0.025, 4, 6 = 6.23,
using Table 12 of the New Cambridge Statistical Tables. The test statistic value is:
t=
4.8/4
= 0.1935
37.2/6
and since 0.11 < 0.1935 < 6.23 we do not reject H0 , which means there is no
evidence of a difference in the variances.
Activity 9.23 Class A was taught using detailed PowerPoint slides. The marks in
the final examination for a random sample of Class A students were:
74,
61,
67,
84,
41,
68,
57,
64,
46.
Students in Class B were required to read textbooks and answer questions in class
discussions. The marks in the final examination for a random sample of Class B
students were:
48, 50, 42, 53, 81, 59, 64, 45.
Assuming examination marks are normally distributed, can we infer that the
variances of the marks differ between the two classes? Test at the 5% significance
level.
Solution
We test H0 : σA2 = σB2 vs. H1 : σA2 6= σB2 . Under H0 we have:
T =
SA2
∼ FnA −1, nB −1 .
SB2
Hence H0 is rejected if either t ≤ F1−α/2, nA −1, nB −1 or t ≥ Fα/2, nA −1, nB −1 .
307
9. Hypothesis testing
For the given data, nA = 9, s2A = 176.778, nB = 8 and s2B = 159.929. Setting
α = 0.05, F0.975, 8, 7 = 0.221 and F0.025, 8, 7 = 4.90. Since:
0.221 < t = 1.105 < 4.90
we cannot reject H0 , i.e. there is no significant evidence to indicate that the
variances of the marks in the two classes are different.
Activity 9.24 After the machine in Activity 9.14 is calibrated, we collect a new
sample of 21 bags. The sample standard deviation of their weights is 23.72 g. Based
on this sample, can you conclude that the calibration has reduced the variance of the
weights of the bags?
Solution
Let:
Y1 , . . . , Y21 ∼ N (µY , σY2 )
2
to denote the variance of
be the weights of the bags in the new sample, and use σX
the distribution of the previous sample, to avoid confusion. We want to test for a
reduction in variance, so we set:
H0 :
2
2
σX
σX
=
1
vs.
H
:
> 1.
1
σY2
σY2
The value of the test statistic in this case is:
(32.48)2
s2X
=
= 1.875.
s2Y
(23.72)2
If the null hypothesis is true, the test statistic will follow an F18−1, 21−1 = F17, 20
distribution.
At the 5% significance level, the upper-tail critical value of the F17, 20 distribution is
F0.05, 17, 20 = 2.17. Our test statistic does not exceed this value, so we cannot reject
the null hypothesis.
We move to the 10% significance level. The upper-tail critical value is
F0.10, 17, 20 = 1.821, so we can now reject the null hypothesis (if only barely). We
conclude that there is some evidence that the variance is reduced, but it is not very
strong evidence.
Notice the difference between the conclusions of these two tests. We have a much
more powerful test when we compare our standard deviation of 32.48 g to a fixed
standard deviation of 25 g, than when we compare it to an estimated standard
deviation of 23.78 g, even though the values are similar.
9.15
Summary: tests for two normal distributions
2
Let (X1 , . . . , Xn ) ∼IID N (µX , σX
), (Y1 , . . . , Ym ) ∼IID N (µY , σY2 ), and ρ = Corr(X, Y ).
308
9.16. Overview of chapter
A summary table of tests for two normal distributions is:
Null hypothesis, H0
Test statistic, T
µX − µY = δ
µX − µY = δ
ρ=0
2
(σX
, σY2 known)
2
(σX
= σY2 unknown)
(n = m)
√
Distribution of T
under H0
9.16
X̄−Ȳ −δ
2 /n+σ 2 /m
σX
Y
q
n+m−2
1/n+1/m
N (0, 1)
×√
X̄−Ȳ −δ
2 +(m−1)S 2
(n−1)SX
Y
tn+m−2
ρb
q
2
σY
2
σX
n−2
1−b
ρ2
k
tn−2
Overview of chapter
Key terms and concepts
Alternative hypothesis
Decision
p-value
Power function
t test
Type I error
9.18
2
SX
SY2
Fn−1, m−1
This chapter has discussed hypothesis tests for parameters of normal distributions –
specifically means and variances. In each case an appropriate test statistic was
constructed whose distribution under the null hypothesis was known. Concepts of
hypothesis testing errors and power were also discussed, as well as how to test
correlation coefficients.
9.17
=k
Critical value
Null hypothesis
Paired comparison
Significance level
Test statistic
Type II error
Sample examination questions
Solutions can be found in Appendix C.
1. Suppose that one observation, i.e. n = 1, is taken from the geometric distribution:
(
(1 − π)x−1 π for x = 1, 2, . . .
p(x; π) =
0
otherwise
to test H0 : π = 0.3 vs. H1 : π > 0.3. The null hypothesis is rejected if x ≥ 4.
(a) What is the probability that a Type II error will be committed when the true
parameter value is π = 0.4?
309
9. Hypothesis testing
(b) What is the probability that a Type I error will be committed?
(c) If x = 4, what is the p-value of the test?
2. Let X have a Poisson distribution with mean λ. We want to test the null
hypothesis that λ = 1/2 against the alternative λ = 2. We reject the null
hypothesis if and only if x > 1. Calculate the size and power of the test. You may
use the approximate value e ≈ 2.718.
3. A random sample of size n = 10 is taken from N (µ, σ 2 ). Consider the following
hypothesis test:
H0 : σ 2 = 2.00 vs. H1 : σ 2 > 2.00
to be conducted at the 1% significance level.
Determine the power of the test for σ 2 = 2.00 and σ 2 = 2.56. (You may use the
closest available values in the statistical tables provided.)
310
Chapter 10
Analysis of variance (ANOVA)
10.1
Synopsis of chapter
This chapter introduces analysis of variance (ANOVA) which is a widely-used technique
for detecting differences between groups based on continuous dependent variables.
10.2
Learning outcomes
After completing this chapter, you should be able to:
explain the purpose of analysis of variance
restate and interpret the models for one-way and two-way analysis of variance
conduct small examples of one-way and two-way analysis of variance with a
calculator, reporting the results in an ANOVA table
perform hypothesis tests and construct confidence intervals for one-way and
two-way analysis of variance
explain how to interpret residuals from an analysis of variance.
10.3
Introduction
Analysis of variance (ANOVA) is a popular tool which has an applicability and power
which we can only start to appreciate in this course. The idea of analysis of variance is
to investigate how variation in structured data can be split into pieces associated with
components of that structure. We look only at one-way and two-way classifications,
providing tests and confidence intervals which are widely used in practice.
10.4
Testing for equality of three population means
We begin with an illustrative example to test the hypothesis that three populations
means are equal.
311
10. Analysis of variance (ANOVA)
Example 10.1 To assess the teaching quality of class teachers, a random sample of
6 examination marks was selected from each of three classes. The examination marks
for each class are listed in the table below.
Can we infer from these data that there is no significant difference in the
examination marks among all three classes?
Class 1
85
75
82
76
71
85
Class 2
71
75
73
74
69
82
Class 3
59
64
62
69
75
67
Suppose examination marks from Class j follow the distribution N (µj , σ 2 ), for
j = 1, 2, 3. So we assume examination marks are normally distributed with the same
variance in each class, but possibly different means.
We need to test the hypothesis:
H0 : µ1 = µ2 = µ3 .
The data form a 6 × 3 array. Denote the data point at the (i, j)th position as Xij .
We compute the column means first where the jth column mean is:
X̄·j =
X1j + X2j + · · · + Xnj j
nj
where nj is the sample size of group j (here nj = 6 for all j).
This leads to x̄·1 = 79, x̄·2 = 74 and x̄·3 = 66. Transposing the table, we get:
Class 1
Class 2
Class 3
1
85
71
59
Observation
2 3 4 5
75 82 76 71
75 73 74 69
64 62 69 75
6
85
82
67
Mean
79
74
66
Note that similar problems arise from other practical situations. For example:
comparing the returns of three stocks
comparing sales using three advertising strategies
comparing the effectiveness of three medicines.
If H0 is true, the three observed sample means x̄·1 , x̄·2 and x̄·3 should be very close to
each other, i.e. all of them should be close to the overall sample mean, x̄, which is:
x̄ =
312
x̄·1 + x̄·2 + x̄·3
79 + 74 + 66
=
= 73
3
3
10.5. One-way analysis of variance
i.e. the mean value of all 18 observations.
So we wish to perform a hypothesis test based on the variation in the sample means
such that the greater the variation, the more likely we are to reject H0 . One possible
measure for the variation in the sample means X̄·j about the overall sample mean X̄,
for j = 1, 2, 3, is:
3
X
(X̄·j − X̄)2 .
(10.1)
j=1
However, (10.1) is not scale-invariant, so it would be difficult to judge whether the
realised value is large enough to warrant rejection of H0 due to the magnitude being
dependent on the units of measurement of the data. So we seek a scale-invariant test
statistic.
Just as we scaled the covariance between two random variables to give the
scale-invariant correlation coefficient, we can similarly scale (10.1) to give the
following possible test statistic:
3
P
T =
(X̄·j − X̄)2
j=1
sum of the three sample variances
.
Hence we would reject H0 for large values of T . (Note t = 0 if x̄·1 = x̄·2 = x̄·3 which
would mean that there is no variation at all between the sample means. In this case
all the sample means would equal x̄.)
It remains to determine the distribution of T under H0 .
10.5
One-way analysis of variance
We now extend Example 10.1 to consider a general setting where there are k
independent random samples available from k normal distributions N (µj , σ 2 ), for
j = 1, . . . , k. (Example 10.1 corresponds to k = 3.)
Denote by X1j , X2j , . . . , Xnj j the random sample with sample size nj from N (µj , σ 2 ),
for j = 1, . . . , k.
Our goal is to test H0 : µ1 = · · · = µk vs. H1 : not all µj s are the same.
One-way analysis of variance (one-way ANOVA) involves a continuous dependent
variable and one categorical independent variable (sometimes called a factor, or
treatment), where the k different levels of the categorical variable are the k different
groups.
We now introduce statistics associated with one-way ANOVA.
313
10. Analysis of variance (ANOVA)
Statistics associated with one-way ANOVA
The jth sample mean is:
nj
1 X
X̄·j =
Xij .
nj i=1
The overall sample mean is:
nj
k
k
1 XX
1X
nj X̄·j
X̄ =
Xij =
n j=1 i=1
n j=1
where n =
k
P
nj is the total number of observations across all k groups.
j=1
The total variation is:
nj
k X
X
(Xij − X̄)2
j=1 i=1
with n − 1 degrees of freedom.
The between-groups variation is:
k
X
B=
nj (X̄·j − X̄)2
j=1
with k − 1 degrees of freedom.
The within-groups variation is:
nj
k X
X
W =
(Xij − X̄·j )2
j=1 i=1
with n − k =
k
P
(nj − 1) degrees of freedom.
j=1
The ANOVA decomposition is:
nj
nj
k X
k
k X
X
X
X
2
2
(Xij − X̄) =
nj (X̄·j − X̄) +
(Xij − X̄·j )2 .
j=1 i=1
j=1
j=1 i=1
We have already discussed the jth sample mean and overall sample mean. The total
variation is a measure of the overall (total) variability in the data from all k groups
about the overall sample mean. The ANOVA decomposition decomposes this into two
components: between-groups variation (which is attributable to the factor level) and
within-groups variation (which is attributable to the variation within each group and is
assumed to be the same σ 2 for each group).
Some remarks are the following.
i. B and W are also called, respectively, between-treatments variation and
314
10.5. One-way analysis of variance
within-treatments variation. In fact W is effectively a residual (error) sum of
squares, representing the variation which cannot be explained by the treatment or
group factor.
ii. The ANOVA decomposition follows from the identity:
m
X
2
(ai − b) =
i=1
m
X
(ai − ā)2 + m(ā − b)2 .
i=1
However, the actual derivation is not required for this course.
iii. The following are some useful formulae for manual computations.
k
P
• n=
nj .
j=1
• X̄·j =
nj
P
Xij /nj and X̄ =
i=1
k
P
nj X̄·j /n.
j=1
• Total variation = Total SS = B + W =
nj
k P
P
Xij2 − nX̄ 2 .
j=1 i=1
• B=
k
P
nj X̄·j2 − nX̄ 2 .
j=1
• Residual (Error) SS = W =
nj
k P
P
k
P
Xij2 −
j=1 i=1
nj X̄·j2 =
j=1
k
P
(nj − 1)Sj2 where Sj2 is
j=1
the jth sample variance.
We now note, without proof, the following results.
i. B =
k
P
nj (X̄·j − X̄)2 and W =
(Xij − X̄·j )2 are independent of each other.
j=1 i=1
j=1
ii. W/σ 2 =
nj
k P
P
nj
k P
P
(Xij − X̄·j )2 /σ 2 ∼ χ2n−k .
j=1 i=1
iii. Under H0 : µ1 = · · · = µk , then B/σ 2 =
k
P
nj (X̄·j − X̄)2 /σ 2 ∼ χ2k−1 .
j=1
In order to test H0 : µ1 = · · · = µk , we define the following test statistic:
k
P
F =
nj (X̄·j
j=1
nj
k P
P
− X̄)2 /(k − 1)
=
(Xij − X̄·j )2 /(n − k)
B/(k − 1)
.
W/(n − k)
j=1 i=1
Under H0 , F ∼ Fk−1, n−k . We reject H0 at the 100α% significance level if:
f > Fα, k−1, n−k
where Fα, k−1, n−k is the top 100αth percentile of the Fk−1, n−k distribution, i.e.
P (F > Fα, k−1, n−k ) = α, and f is the observed test statistic value.
315
10. Analysis of variance (ANOVA)
The p-value of the test is:
p-value = P (F > f ).
It is clear that f > Fα, k−1, n−k if and only if the p-value < α, as we must reach the same
conclusion regardless of whether we use the critical value approach or the p-value
approach to hypothesis testing.
One-way ANOVA table
Typically, one-way ANOVA results are presented in a table as follows:
Source
Factor
Error
Total
DF
k−1
n−k
n−1
SS
B
W
B+W
MS
B/(k − 1)
W/(n − k)
F
p-value
p
B/(k−1)
W/(n−k)
Example 10.2 Continuing with Example 10.1, for the given data, k = 3,
n1 = n2 = n3 = 6, n = n1 + n2 + n3 = 18, x̄·1 = 79, x̄·2 = 74, x̄·3 = 66 and x̄ = 73.
The sample variances are calculated to be s21 = 34, s22 = 20 and s23 = 32. Therefore:
b=
3
X
6(x̄·j − x̄)2 = 6[(79 − 73)2 + (74 − 73)2 + (66 − 73)2 ] = 516
j=1
and:
w=
3 X
6
3 X
6
3
X
X
X
(xij − x̄·j )2 =
x2ij − 6
x̄2·j
j=1 i=1
j=1 i=1
=
3
X
j=1
5s2j
j=1
= 5(34 + 20 + 32)
= 430.
Hence:
f=
516/2
b/(k − 1)
=
= 9.
w/(n − k)
430/15
Under H0 : µ1 = µ2 = µ3 , F ∼ Fk−1, n−k = F2, 15 . Since F0.01, 2, 15 = 6.359 < 9, using
Table 12(d) of the New Cambridge Statistical Tables, we reject H0 at the 1%
significance level. In fact the p-value (using a computer) is P (F > 9) = 0.003.
Therefore, we conclude that there is a significant difference among the mean
examination marks across the three classes.
316
10.5. One-way analysis of variance
The one-way ANOVA table is as follows:
Source
Class
Error
Total
DF
2
15
17
SS
MS F
516 258
9
430 28.67
946
p-value
0.003
Example 10.3 A study performed by a Columbia University professor counted the
number of times per minute professors from three different departments said ‘uh’ or
‘ah’ during lectures to fill gaps between words. The data were derived from
observing 100 minutes from each of the three departments. If we assume that the
more frequent use of ‘uh’ or ‘ah’ results in more boring lectures, can we conclude
that some departments’ professors are more boring than others?
The counts for English, Mathematics and Political Science departments are stored.
As always in statistical analysis, we first look at the summary (descriptive) statistics
of these data, here using R.
> attach(UhAh)
> summary(UhAh)
Frequency
Department
Min.
: 0.00
English
:100
1st Qu.: 4.00
Mathematics
:100
Median : 5.00
Political Science:100
Mean
: 5.48
3rd Qu.: 7.00
Max.
:11.00
> xbar <- tapply(Frequency, Department, mean)
> s <- tapply(Frequency, Department, sd)
> n <- tapply(Frequency, Department, length)
> sem <- s/sqrt(n)
> list(xbar,s,n,sem)
[[1]]
English
Mathematics Political Science
5.81
5.30
5.33
[[2]]
English
2.493203
Mathematics Political Science
2.012587
1.974867
English
100
Mathematics Political Science
100
100
English
0.2493203
Mathematics Political Science
0.2012587
0.1974867
[[3]]
[[4]]
317
10. Analysis of variance (ANOVA)
Surprisingly, professors in English say ‘uh’ or ‘ah’ more on average than those in
Mathematics and Political Science (compare the sample means of 5.81, 5.30 and
5.33), but the difference seems small. However, we need to formally test whether the
(seemingly small) differences are statistically significant.
Using the data, R produces the following one-way ANOVA table:
> anova(lm(Frequency ~ Department))
Analysis of Variance Table
Response: Frequency
Df Sum Sq Mean Sq F value Pr(>F)
Department
2
16.38 8.1900 1.7344 0.1783
Residuals 297 1402.50 4.7222
Since the p-value for the F test is 0.1783, we cannot reject the following hypothesis:
H0 : µ1 = µ2 = µ3 .
Therefore, there is no evidence of a difference in the mean number of ‘uh’s or ‘ah’s
said by professors across the three departments.
In addition to a one-way ANOVA table, we can also obtain the following.
An estimator of σ is:
r
σ
b=S=
W
.
n−k
95% confidence intervals for µj are given by:
S
X̄·j ± t0.025, n−k × √
nj
for j = 1, . . . , k
where t0.025, n−k is the top 2.5th percentile of the Student’s tn−k distribution, which
can be obtained from Table 10 of the New Cambridge Statistical Tables.
Example 10.4 Assuming a common variance for each group, from the preceding
output in Example 10.3 we see that:
r
1402.50 √
= 4.72 = 2.173.
σ
b=s=
297
Since t0.025, 297 ≈ t0.025, ∞ = 1.96, using Table 10 of the New Cambridge Statistical
Tables, we obtain the following 95% confidence intervals for µ1 , µ2 and µ3 ,
respectively:
2.173
j = 1 : 5.81 ± 1.96 × √
⇒ (5.38, 6.24)
100
318
j=2:
2.173
5.30 ± 1.96 × √
100
⇒
(4.87, 5.73)
j=3:
2.173
5.33 ± 1.96 × √
100
⇒
(4.90, 5.76).
10.5. One-way analysis of variance
R can produce the following:
> stripchart(Frequency ~ Department,pch=16,vert=T)
> arrows(1:3,xbar+1.96*2.173/sqrt(n),1:3,xbar-1.96*2.173/sqrt(n),
angle=90,code=3,length=0.1)
> lines(1:3,xbar,pch=4,type="b",cex=2)
6
0
2
4
Frequency
8
10
These 95% confidence intervals can be seen plotted in the R output below. Note that
these confidence intervals all overlap, which is consistent with our failure to reject
the null hypothesis that all population means are equal.
English
Mathematics
Political Science
Figure 10.1: Overlapping confidence intervals.
Example 10.5 In early 2001, the American economy was slowing down and
companies were laying off workers. A poll conducted during February 2001 asked a
random sample of workers how long (in months) it would be before they faced
significant financial hardship if they lost their jobs. They are classified into four
groups according to their incomes. Below is part of the R output of the descriptive
statistics of the classified data. Can we infer that income group has a significant
impact on the mean length of time before facing financial hardship?
Hardship
Min.
: 0.00
1st Qu.: 8.00
Median :15.00
Mean
:16.11
3rd Qu.:22.00
Max.
:50.00
Income.group
$20 to 30K: 81
$30 to 50K:114
Over $50K : 39
Under $20K: 67
319
10. Analysis of variance (ANOVA)
> xbar <- tapply(Hardship, Income.group, mean)
> s <- tapply(Hardship, Income.group, sd)
> n <- tapply(Hardship, Income.group, length)
> sem <- s/sqrt(n)
> list(xbar,s,n,sem)
[[1]]
$20 to 30K $30 to 50K Over $50K Under $20K
15.493827 18.456140 22.205128
9.313433
[[2]]
$20 to 30K $30 to 50K
9.233260
9.507464
Over $50K Under $20K
11.029099
8.087043
[[3]]
$20 to 30K $30 to 50K
81
114
Over $50K Under $20K
39
67
[[4]]
$20 to 30K $30 to 50K
1.0259178 0.8904556
Over $50K Under $20K
1.7660693 0.9879896
Inspection of the sample means suggests that there is a difference between income
groups, but we need to conduct a one-way ANOVA test to see whether the
differences are statistically significant.
We apply one-way ANOVA to test whether the means in the k = 4 groups are equal,
i.e. H0 : µ1 = µ2 = µ3 = µ4 , from highest to lowest income groups.
We have n1 = 39, n2 = 114, n3 = 81 and n4 = 67, hence:
n=
k
X
nj = 39 + 114 + 81 + 67 = 301.
j=1
Also x̄·1 = 22.21, x̄·2 = 18.456, x̄·3 = 15.49, x̄·4 = 9.313 and:
k
x̄ =
39 × 22.21 + 114 × 18.456 + 81 × 15.49 + 67 × 9.313
1X
nj X̄·j =
= 16.109.
n j=1
301
Now:
b=
k
X
nj (x̄·j − x̄)2
j=1
= 39(22.21 − 16.109)2 + 114(18.456 − 16.109)2
+ 81(15.49 − 16.109)2 + 67(9.313 − 16.109)2
= 5205.097.
We have s21 = (11.03)2 = 121.661, s22 = (9.507)2 = 90.383, s23 = (9.23)2 = 85.193 and
320
10.5. One-way analysis of variance
s24 = (8.087)2 = 65.400, hence:
w=
nj
k X
X
2
(xij − x̄·j ) =
j=1 i=1
k
X
(nj − 1)s2j
j=1
= 38 × 121.661 + 113 × 90.383 + 80 × 85.193 + 66 × 65.400
= 25968.24.
Consequently:
f=
5205.097/3
b/(k − 1)
=
= 19.84.
w/(n − k)
25968.24/(301 − 4)
Under H0 , F ∼ Fk−1, n−k = F3, 297 . Since F0.01, 3, 297 ≈ 3.848 < 19.84, we reject H0 at
the 1% significance level, i.e. there is strong evidence that income group has a
significant impact on the mean length of time before facing financial hardship.
The pooled estimate of σ is:
r
s=
w
=
n−k
r
25968.24
= 9.351.
301 − 4
A 95% confidence interval for µj is:
s
9.351
18.328
x̄·j ± t0.025, 297 × √ = x̄·j ± 1.96 × √
= x̄·j ± √
.
nj
nj
nj
Hence, for example, a 95% confidence interval for µ1 is:
18.328
22.21 ± √
39
⇒
(19.28, 25.14)
⇒
(7.07, 11.55).
and a 95% confidence interval for µ4 is:
18.328
9.313 ± √
67
Notice that these two confidence intervals do not overlap, which is consistent with
our conclusion that there is a difference between the group means.
R output for the data is:
> anova(lm(Hardship ~ Income.group))
Analysis of Variance Table
Response: Hardship
Df Sum Sq Mean Sq F value
Pr(>F)
Income.group
3 5202.1 1734.03 19.828 9.636e-12 ***
Residuals
297 25973.3
87.45
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Note that minor differences are due to rounding errors in calculations.
321
10. Analysis of variance (ANOVA)
Activity 10.1 Show that under the one-way ANOVA assumptions, for any set of
k
P
constants {a1 , . . . , ak }, the quantity
aj X̄·j is normally distributed with mean
j=1
k
P
aj µj and variance σ
j=1
k
P
2
a2j /nj .
j=1
Solution
Under the one-way ANOVA assumptions, Xij ∼IID N (µj , σ 2 ) within each
j = 1, . . . , k. Therefore, since the Xij s are independent with a common variance, σ 2 ,
we have:
σ2
for j = 1, . . . , k.
X̄·j ∼ N µj ,
nj
Hence:
Therefore:
a2j σ 2
aj X̄·j ∼ N aj µj ,
nj
k
X
aj X̄·j ∼ N
j=1
k
X
for j = 1, . . . , k.
aj µ j , σ 2
j=1
k
X
a2j
j=1
nj
!
.
Activity 10.2 Do the following data appear to violate the assumptions underlying
one-way analysis of variance? Explain why or why not.
A
1.78
8.26
3.57
4.69
2.13
6.17
Treatment
B
C
8.41 0.57
5.61 3.04
3.90 2.67
3.77 1.66
1.08 2.09
2.67 1.57
D
9.45
8.47
7.69
8.53
9.04
7.11
Solution
We have s2A = 6.1632, s2B = 6.4106, s2C = 0.7715 and s2D = 0.7400. So we observe that
although s2A ≈ s2B and s2C ≈ s2D , the sample variances for treatments are very
different across all groups, suggesting that the assumption that σ 2 is the same for all
treatment levels may not be true.
Activity 10.3 An indicator of the value of a stock relative to its earnings is its
price-earnings ratio: the average of a given year’s high and low selling prices divided
by its annual earnings. The following table provides the price-earnings ratios for a
sample of 27 stocks, nine each from the financial, industrial and utility sectors of the
New York Stock Exchange. Test at the 1% significance level whether the true mean
price-earnings ratios for the three market sectors are the same. Use the ANOVA
table format to summarise your calculations. You may exclude the p-value.
322
10.5. One-way analysis of variance
Financial
11.4
12.3
10.8
9.8
14.3
16.1
11.9
12.4
13.1
Industrial
9.4
18.4
15.9
21.6
17.1
20.2
18.6
22.9
18.6
Utility
15.4
16.3
10.9
19.3
15.1
12.7
16.8
14.3
13.8
Solution
For these n = 27 observations and k = 3 groups, we have x̄·1 = 12.46, x̄·2 = 18.08,
x̄·3 = 14.96 and x̄ = 15.16. Also:
3 X
9
X
x2ij = 6548.3.
j=1 i=1
Hence the total variation is:
3 X
9
X
x2ij − nx̄2 = 6548.3 − 27 × (15.16)2 = 340.58.
j=1 i=1
The between-groups variation is:
b=
3
X
nj x̄2·j − n x̄2 = 9 × ((12.46)2 + (18.08)2 + (14.96)2 ) − 27 × (15.16)2
j=1
= 142.82.
Therefore, w = 340.58 − 142.82 = 197.76. Hence the ANOVA table is:
Source
Sector
Error
Total
DF
2
24
26
SS
MS
142.82 71.41
197.76 8.24
340.58
F
8.67
To test the null hypothesis that the three types of stocks have equal price-earnings
ratios, on average, we reject H0 if:
f > F0.01, 2, 24 = 5.61.
Since 5.61 < 8.67, we reject H0 and conclude that there is strong evidence of a
difference in the mean price-earnings ratios across the sectors.
323
10. Analysis of variance (ANOVA)
Activity 10.4 Three trainee salespeople were working on a trial basis. Salesperson
A went in the field for 5 days and made a total of 440 sales. Salesperson B was tried
for 7 days and made a total of 630 sales. Salesperson C was tried for 10 days and
made a total of 690 sales. Note that these figures
P are total sales, not daily averages.
The sum of the squares of all 22 daily sales ( x2i ) is 146,840.
(a) Construct a one-way analysis of variance table.
(b) Would you say there is a difference between the mean daily sales of the three
salespeople? Justify your answer.
(c) Construct a 95% confidence interval for the mean difference between salesperson
B and salesperson C. Would you say there is a difference?
Solution
(a) The means are 440/5 = 88, 630/7 = 90 and 690/10 = 69. We will perform a
one-way ANOVA. First, we calculate the overall mean. This is:
440 + 630 + 690
= 80.
22
We can now calculate the sum of squares between salespeople. This is:
5 × (88 − 80)2 + 7 × (90 − 80)2 + 10 × (69 − 80)2 = 2230.
The total sum of squares is:
146840 − 22 × (80)2 = 6040.
Here is the one-way ANOVA table:
Source
Salesperson
Error
Total
DF
2
19
21
SS
2230
3810
6040
MS
F
1115 5.56
200.53
p-value
≈ 0.01
(b) As 5.56 > 3.52 = F0.05, 2, 19 , which is the top 5th percentile of the F2, 19
distribution (interpolated from Table 12 of the New Cambridge Statistical
Tables), we reject H0 : µ1 = µ2 = µ3 and conclude that there is evidence that
the means are not equal.
(c) We have:
s
90 − 69 ± 2.093 ×
200.53 ×
1
1
+
7 10
= 21 ± 14.61.
Here 2.093 is the top 2.5th percentile point of the t distribution with 19 degrees
of freedom. A 95% confidence interval is (6.39, 35.61). As zero is not included,
there is evidence of a difference.
324
10.5. One-way analysis of variance
Activity 10.5 The total times spent by three basketball players on court were
recorded. Player A was recorded on three occasions and the times were 29, 25 and 33
minutes. Player B was recorded twice and the times were 16 and 30 minutes. Player
C was recorded on three occasions and the times were 12, 14 and 16 minutes. Use
analysis of variance to test whether there is any difference in the average times the
three players spend on court.
Solution
We have x̄·A = 29, x̄·B = 23, x̄·C = 14 and x̄ = 21.875. Hence:
3 × (29 − 21.875)2 + 2 × (23 − 21.875)2 + 3 × (14 − 21.875)2 = 340.875.
The total sum of squares is:
4307 − 8 × (21.875)2 = 478.875.
Here is the one-way ANOVA table:
Source
Players
Error
Total
DF
2
5
7
SS
340.875
138
478.875
MS
170.4375
27.6
F
6.175
p-value
≈ 0.045
We test H0 : µ1 = µ2 = µ3 (i.e. the average times they play are the same) vs. H1 :
The average times they play are not the same.
As 6.175 > 5.79 = F0.05, 2, 5 , which is the top 5th percentile of the F2, 5 distribution,
we reject H0 and conclude that there is evidence of a difference between the means.
Activity 10.6 Three independent random samples were taken. Sample A consists
of 4 observations taken from a normal distribution with mean µA and variance σ 2 ,
sample B consists of 6 observations taken from a normal distribution with mean µB
and variance σ 2 , and sample C consists of 5 observations taken from a normal
distribution with mean µC and variance σ 2 .
The average value of the first sample was 24, the average value of the second sample
was 20, and the average value of the third sample was 18. The sum of the squared
observations (all of them) was 6,722.4. Test the hypothesis:
H0 : µA = µB = µC
against the alternative that this is not so.
Solution
We will perform a one-way ANOVA. First we calculate the overall mean:
4 × 24 + 6 × 20 + 5 × 18
= 20.4.
15
We can now calculate the sum of squares between groups:
4 × (24 − 20.4)2 + 6 × (20 − 20.4)2 + 5 × (18 − 20.4)2 = 81.6.
325
10. Analysis of variance (ANOVA)
The total sum of squares is:
6722.4 − 15 × (20.4)2 = 480.
Here is the one-way ANOVA table:
Source
Sample
Error
Total
DF
2
12
14
SS
MS
F
p-value
81.6 40.8 1.229 ≈ 0.327
398.4 33.2
480
As 1.229 < 3.89 = F0.05, 2, 12 , which is the top 5th percentile of the F2, 12 distribution,
we see that there is no evidence that the means are not equal.
Activity 10.7 An executive of a prepared frozen meals company is interested in the
amounts of money spent on such products by families in different income ranges. The
table below lists the monthly expenditures (in dollars) on prepared frozen meals from
15 randomly selected families divided into three groups according to their incomes.
Under $15,000
45.2
60.1
52.8
31.7
33.6
39.4
$15,000 – $30,000
53.2
56.6
68.7
51.8
54.2
Over $30,000
52.7
73.6
63.3
51.8
(a) Based on these data, can we infer at the 5% significance level that the
population mean expenditures on prepared frozen meals are the same for the
three different income groups?
(b) Produce a one-way ANOVA table.
(c) Construct 95% confidence intervals for the mean expenditures of the first (under
$15,000) and the third (over $30,000) income groups.
Solution
(a) For this example, k = 3, n1 = 6, n2 = 5, n3 = 4 and n = n1 + n2 + n3 = 15.
We have x̄·1 = 43.8, x̄·2 = 56.9, x̄·3 = 60.35 and x̄ = 52.58.
nj
3 P
P
Also,
x2ij = 43387.85.
j=1 i=1
Total SS =
nj
3 P
P
x2ij − nx̄2 = 43387.85 − 41469.85 = 1918.
j=1 i=1
w=
nj
3 P
P
j=1 i=1
326
x2ij −
P3
j=1
nj x̄2·j = 43387.85 − 42267.18 = 1120.67.
10.5. One-way analysis of variance
Therefore, b = Total SS − w = 1918 − 1120.67 = 797.33.
To test H0 : µ1 = µ2 = µ3 , the test statistic value is:
f=
b/(k − 1)
797.33/2
=
= 4.269.
w/(n − k)
1120.67/12
Under H0 , F ∼ F2, 12 . Since F0.05, 2, 12 = 3.89 < 4.269, we reject H0 at the 5%
significance level, i.e. there exists evidence indicating that the population mean
expenditures on frozen meals are not the same for the three different income
groups.
(b) The ANOVA table is as follows:
Source
Income
Error
Total
DF
2
12
14
SS
797.33
1120.67
1918.00
MS
398.67
93.39
F
4.269
P
<0.05
(c) A 95% confidence interval for µj is of the form:
√
21.056
93.39
S
= X̄·j ± √
.
X̄·j ± t0.025, n−k × √ = X̄·j ± t0.025, 12 × √
nj
nj
nj
√
For j = 1, a 95% confidence interval is 43.8 ± 21.056/ 6 ⇒ (35.20, 52.40).
√
For j = 3, a 95% confidence interval is 60.35 ± 21.056/ 4 ⇒ (49.82, 70.88).
Activity 10.8 Does the level of success of publicly-traded companies affect the way
their board members are paid? The annual payments (in $000s) of randomly selected
publicly-traded companies to their board members were recorded. The companies
were divided into four quarters according to the returns in their stocks, and the
payments from each quarter were grouped together. Some summary statistics are
provided below.
Descriptive Statistics: 1st quarter, 2nd quarter, 3rd quarter, 4th quarter
Variable
1st quarter
2nd quarter
3rd quarter
4th quarter
N
30
30
30
30
Mean
74.10
75.67
78.50
81.30
SE Mean
2.89
2.48
2.79
2.85
StDev
15.81
13.57
15.28
15.59
(a) Can we infer that the amount of payment differs significantly across the four
groups of companies?
(b) Construct 95% confidence intervals for the mean payment of the 1st quarter
companies and the 4th quarter companies.
327
10. Analysis of variance (ANOVA)
Solution
(a) Here k = 4 and n1 = n2 = n3 = n4 = 30. We have x̄·1 = 74.10, x̄·2 = 75.67,
x̄·3 = 78.50, x̄·4 = 81.30, b = 909, w = 26403 and the pooled estimate of σ is
s = 15.09.
Hence the test statistic value is:
f=
b/(k − 1)
= 1.33.
w/(n − k)
Under H0 : µ1 = µ2 = µ3 = µ4 , F ∼ Fk−1, n−k = F3, 116 . Since
F0.05, 3, 116 = 2.68 > 1.33, we cannot reject H0 at the 5% significance level. Hence
there is no evidence to support the claim that payments among the four groups
are significantly different.
(b) A 95% confidence interval for µj is of the form:
15.09
S
= X̄·j ± 5.46.
X̄·j ± t0.025, n−k × √ = X̄·j ± t0.025, 116 × √
nj
30
For j = 1, a 95% confidence interval is 74.10 ± 5.46 ⇒ (68.64, 79.56).
For j = 4, a 95% confidence interval is 81.30 ± 5.46 ⇒ (75.84, 86.76).
Activity 10.9 Proficiency tests are administered to a sample of 9-year-old children.
The test scores are classified into four groups according to the highest education
level achieved by at least one of their parents. The education categories used for the
grouping are: ‘less than high school’, ‘high school graduate’, ‘some college’, and
‘college graduate’.
(a) Find the missing values A1, A2, A3 and A4 in the one-way ANOVA table below.
Source
Factor
Error
Total
DF
A1
275
278
S = 32.16
Level
Less than HS
HS grad
Some college
College grad
SS
45496
A2
329896
R-Sq = 13.79%
N
41
73
86
79
Mean
196.83
207.78
223.38
232.67
Pooled StDev = 32.16
328
MS
15165
A3
F
A4
P
0.000
R-Sq(adj) = 12.85%
StDev
30.23
29.34
34.58
32.86
Individual 95% CIs For Mean Based on
Pooled StDev
-----+---------+---------+---------+---(-----*------)
(----*---)
(----*---)
(----*----)
-----+---------+---------+---------+---195
210
225
240
10.5. One-way analysis of variance
(b) Test whether there are differences in mean test scores between children whose
parents have different highest education levels.
(c) State the required model conditions for the inference conducted in (b).
Solution
(a) We have A1 = 3, A2 = 284400, A3 = 1034 and A4 = 14.66.
(b) Since the p-value of the F test is 0.000, there exists strong evidence indicating
that the mean test scores are different for children whose parents have different
highest education levels.
(c) We need to assume that we have independent observations Xij ∼ N (µj , σ 2 ) for
i = 1, . . . , nj and j = 1, . . . , k.
Activity 10.10 Four different drinks A, B, C and D were assessed by 15 tasters.
Each taster assessed only one drink. Drink A was assessed by 3 tasters and the
scores x1A , x2A and x3A were recorded; drink B was assessed by 4 tasters and the
scores x1B , . . . , x4B were recorded; drink C was assessed by 5 tasters and the scores
x1C , . . . , x5C were recorded; drink D was assessed by 3 tasters and the scores x1D ,
x2D , and x3D were recorded.
Explain how you would use this information to construct a one-way analysis of
variance (ANOVA) table and use it to test whether the four drinks are equally good
against the alternative that they are not. The significance level should be 1% and
you should provide the critical value.
Solution
We need to calculate the following:
3
X̄A =
1X
XiA ,
3 i=1
4
X̄B =
and:
3
P
X̄ =
1X
XiB ,
4 i=1
XiA +
i=1
4
P
5
X̄C =
XiB +
i=1
5
P
1X
XiC ,
5 i=1
XiC +
i=1
3
P
i=1
15
3
X̄D =
1X
XiD
3 i=1
XiD
.
Alternatively:
3X̄A + 4X̄B + 5X̄C + 3X̄D
.
15
We then need the between-groups sum of squares:
X̄ =
B = 3(X̄A − X̄)2 + 4(X̄B − X̄)2 + 5(X̄C − X̄)2 + 3(X̄D − X̄)2
and the within-groups sum of squares:
3
4
5
3
X
X
X
X
2
2
2
W =
(XiA − X̄A ) +
(XiB − X̄B ) +
(XiC − X̄C ) +
(XiD − X̄D )2 .
i=1
i=1
i=1
i=1
329
10. Analysis of variance (ANOVA)
Alternatively, we could calculate only one of the two, and calculate the total sum of
squares (TSS):
TSS =
3
X
2
(XiA − X̄) +
i=1
4
X
2
(XiB − X̄) +
i=1
5
X
(XiC
3
X
− X̄) +
(XiD − X̄)2
2
i=1
i=1
and use the relationship TSS = B + W to calculate the other. We then construct the
ANOVA table:
Source
Factor
Error
Total
DF
3
11
14
SS
b
w
b+w
MS
b/3
w/11
F
11b/3w
At the 100α% significance level, we then compare f = 11b/3w to Fα, 3, 11 using Table
12 of the New Cambridge Statistical Tables. For α = 0.01, we will reject the null
hypothesis that there is no difference if f > 6.22.
10.6
From one-way to two-way ANOVA
One-way ANOVA: a review
We have independent observations Xij ∼ N (µj , σ 2 ) for i = 1, . . . , nj and j = 1, . . . , k.
We are interested in testing:
H 0 : µ1 = · · · = µk .
The variation of the Xij s is driven by a factor at different levels µ1 , . . . , µk , in addition
to random fluctuations (i.e. random errors). We test whether such a factor effect
exists or not. We can model a one-way ANOVA problem as follows:
Xij = µ + βj + εij
for i = 1, . . . , nj , j = 1, . . . , k
where εij ∼ N (0, σ 2 ) and the εij s are independent. µ is the average effect and βj is the
k
P
factor (or treatment) effect at the jth level. Note that
βj = 0. The null hypothesis
j=1
(i.e. that the group means are all equal) can also be expressed as:
H0 : β1 = · · · = βk = 0.
10.7
Two-way analysis of variance
Two-way analysis of variance (two-way ANOVA) involves a continuous dependent
variable and two categorical independent variables (factors). Two-way ANOVA models
the observations as:
Xij = µ + γi + βj + εij
330
for i = 1, . . . , r, j = 1, . . . , c
10.7. Two-way analysis of variance
where:
µ represents the average effect
β1 , . . . , βc represent c different treatment (column) levels
γ1 , . . . , γr represent r different block (row) levels
εij ∼ N (0, σ 2 ) and the εij s are independent.
In total, there are n = r × c observations. We now consider the conditions to make the
parameters µ, γi and βj identifiable for i = 1, . . . , r and j = 1, . . . , c. The conditions are:
γ1 + · · · + γr = 0 and β1 + · · · + βc = 0.
We will be interested in testing the following hypotheses.
The ‘no treatment (column) effect’ hypothesis of H0 : β1 = · · · = βc = 0.
The ‘no block (row) effect’ hypothesis of H0 : γ1 = · · · = γr = 0.
We now introduce statistics associated with two-way ANOVA.
Statistics associated with two-way ANOVA
The sample mean at the ith block level is:
c
P
X̄i· =
Xij
j=1
for i = 1, . . . , r.
c
The sample mean at the jth treatment level is:
r
P
X̄·j =
Xij
i=1
for j = 1, . . . , c.
r
The overall sample mean is:
r P
c
P
X̄ = X̄·· =
i=1 j=1
n
Xij
.
The total variation is:
Total SS =
r X
c
X
(Xij − X̄)2
i=1 j=1
with rc − 1 degrees of freedom.
331
10. Analysis of variance (ANOVA)
The between-blocks (rows) variation is:
Brow = c
r
X
(X̄i· − X̄)2
i=1
with r − 1 degrees of freedom.
The between-treatments (columns) variation is:
Bcol = r
c
X
(X̄·j − X̄)2
j=1
with c − 1 degrees of freedom.
The residual (error) variation is:
Residual SS =
r X
c
X
(Xij − X̄i· − X̄·j + X̄)2
i=1 j=1
with (r − 1)(c − 1) degrees of freedom.
The (two-way) ANOVA decomposition is:
r X
c
X
r
c
r X
c
X
X
X
2
2
(Xij − X̄) = c
(X̄i· − X̄) +r
(X̄·j − X̄) +
(Xij − X̄i· − X̄·j + X̄)2 .
2
i=1 j=1
i=1
j=1
i=1 j=1
The total variation is a measure of the overall (total) variability in the data and the
(two-way) ANOVA decomposition decomposes this into three components:
between-blocks variation (which is attributable to the row factor level),
between-treatments variation (which is attributable to the column factor level) and
residual variation (which is attributable to the variation not explained by the row and
column factors).
The following are some useful formulae for manual computations.
Row sample means: X̄i· =
c
P
Xij /c, for i = 1, . . . , r.
j=1
Column sample means: X̄·j =
r
P
Xij /r, for j = 1, . . . , c.
i=1
Overall sample mean: X̄ =
r P
c
P
Xij /n =
i=1 j=1
Total SS =
r P
c
P
r
P
i=1
X̄i· /r =
c
P
j=1
Xij2 − rcX̄ 2 .
i=1 j=1
Between-blocks (rows) variation: Brow = c
r
P
i=1
332
X̄i·2 − rcX̄ 2 .
X̄·j /c.
10.7. Two-way analysis of variance
Between-treatments (columns) variation: Bcol = r
c
P
X̄·j2 − rcX̄ 2 .
j=1
r P
c
P
Residual SS = (Total SS) − Brow − Bcol =
Xij2 − c
i=1 j=1
r
P
X̄i·2 − r
i=1
c
P
X̄·j2 + rcX̄ 2 .
j=1
In order to test the ‘no block (row) effect’ hypothesis of H0 : γ1 = · · · = γr = 0, the test
statistic is defined as:
F =
(c − 1)Brow
Brow /(r − 1)
=
.
(Residual SS)/[(r − 1)(c − 1)]
Residual SS
Under H0 , F ∼ Fr−1, (r−1)(c−1) . We reject H0 at the 100α% significance level if:
f > Fα, r−1, (r−1)(c−1)
where Fα, r−1, (r−1)(c−1) is the top 100αth percentile of the Fr−1, (r−1)(c−1) distribution, i.e.
P (F > Fα, r−1, (r−1)(c−1) ) = α, and f is the observed test statistic value.
The p-value of the test is:
p-value = P (F > f ).
In order to test the ‘no treatment (column) effect’ hypothesis of H0 : β1 = · · · = βc = 0,
the test statistic is defined as:
F =
(r − 1)Bcol
Bcol /(c − 1)
=
.
(Residual SS)/[(r − 1)(c − 1)]
Residual SS
Under H0 , F ∼ Fc−1, (r−1)(c−1) . We reject H0 at the 100α% significance level if:
f > Fα, c−1, (r−1)(c−1) .
The p-value of the test is defined in the usual way.
Two-way ANOVA table
As with one-way ANOVA, two-way ANOVA results are presented in a table as follows:
Source
DF
SS
MS
F
p-value
Row factor
r−1
Brow
Brow /(r − 1)
(c−1)Brow
Residual SS
p
Column factor
c−1
Bcol
Bcol /(c − 1)
(r−1)Bcol
Residual SS
p
(r − 1)(c − 1)
Residual SS
Residual SS
(r−1)(c−1)
rc − 1
Total SS
Residual
Total
Activity 10.11 Four suppliers were asked to quote prices for seven different
building materials. The average quote of supplier A was 1,315.8. The average quote
of suppliers B, C and D were 1,238.4, 1,225.8 and 1,200.0, respectively. The following
is the calculated two-way ANOVA table with some entries missing.
333
10. Analysis of variance (ANOVA)
Source
Materials
Suppliers
Error
Total
DF
SS
MS
F
17800
p-value
358700
(a) Complete the table using the information provided above.
(b) Is there a significant difference between the quotes of different suppliers?
Explain your answer.
(c) Construct a 90% confidence interval for the difference between suppliers A and
D. Would you say there is a difference?
Solution
(a) The average quote of all suppliers is:
1315.8 + 1238.4 + 1225.8 + 1200.0
= 1245.
4
Hence the sum of squares (SS) due to suppliers is:
7×[(1315.8−1245)2 +(1238.4−1245)2 +(1225.8−1245)2 +(1200.0−1245)2 ] = 52148.88
and the MS due to suppliers is 52148.88/(4 − 1) = 17382.96.
The degrees of freedom are 7 − 1 = 6, 4 − 1 = 3, (7 − 1)(4 − 1) = 18 and
7 × 4 − 1 = 27 for materials, suppliers, error and total sum of squares,
respectively.
The SS for materials is 6 × 17800 = 106800. We have that the SS due to the
error is given by 358700 − 52148.88 − 106800 = 199751.12 and the MS is
199751.12/18 = 11097.28. The F values are:
17800
= 1.604 and
11097.28
17382.96
= 1.567
11097.28
for materials and suppliers, respectively. The two-way ANOVA table is:
Source
DF
SS
MS
F
p-value
Materials 6
106800
17800 1.604 ≈ 0.203
Suppliers 3
52148.88 17382.96 1.567 ≈ 0.232
Error
18 199751.12 11097.28
Total
27
358700
(b) We test H0 : µ1 = µ2 = µ3 = µ4 (i.e. there is no difference between suppliers) vs.
H1 : There is a difference between suppliers. The F value is 1.567 and at a 5%
significance level the critical value from Table 12 (degrees of freedom 3 and 18)
is 3.16, hence we do not reject H0 and conclude that there is not enough
evidence that there is a difference.
334
10.7. Two-way analysis of variance
(c) The top 5th percentile of the t distribution with 18 degrees of freedom is 1.734
and the MS value is 11097.28. So a 90% confidence interval is:
s
1 1
+
= 115.8 ± 97.64
1315.8 − 1200 ± 1.734 × 11097.28
7 7
giving (18.16, 213.44). Since zero is not in the interval, there appears to be a
difference between suppliers A and D.
Activity 10.12 Blood alcohol content (BAC) is measured in milligrams per
decilitre of blood (mg/dL). A researcher is looking into the effects of alcoholic
drinks. Four different individuals tried five different brands of strong beer (A, B, C,
D and E) on different days, of course! Each individual consumed 1L of beer over a
30-minute period and their BAC was measured one hour later. The average BAC for
beers A, C, D and E were 83.25, 95.75, 79.25 and 99.25, respectively. The value for
beer B is not given. The following information is provided as well.
Source
Drinker
Beer
Error
Total
DF
SS
MS
F
p-value
1.56
303.5
695.6
(a) Complete the table using the information provided above.
(b) Is there a significant difference between the effects of different beers? What
about different drinkers?
(c) Construct a 90% confidence interval for the difference between the effects of
beers C and D. Would you say there is a difference?
Solution
(a) We have:
Source DF
SS
MS
F
p-value
Drinker 3
271.284 90.428 1.56 ≈ 0.250
Beer
4
1214
303.5 5.236 ≈ 0.011
Error
12
695.6
57.967
Total
19 2180.884
(b) We test the hypothesis H0 : µ1 = µ2 = µ3 = µ4 = µ5 (i.e. there is no difference
between the effects of different beers) vs. the alternative H1 : There is a
difference between the effects of different beers. The F value is 5.236 and at a
5% significance level the critical value from Table 9 is F0.05, 4, 12 = 3.26, so since
5.236 > 3.26 we reject H0 and conclude that there is evidence of a difference.
For drinkers, we test the hypothesis H0 : µ1 = µ2 = µ3 = µ4 (i.e. there is no
difference between the effects on different drinkers) vs. the alternative H1 : There
335
10. Analysis of variance (ANOVA)
is a difference between the effects on different drinkers. The F value is 1.56 and
at a 5% significance level the critical value from Table 9 is F0.05, 3, 12 = 3.49, so
since 1.56 < 3.49 we fail to reject H0 and conclude that there is no evidence of a
difference.
(c) The top 5th percentile of the t distribution with 12 degrees of freedom is 1.782.
So a 90% confidence interval is:
s
1 1
+
= 16.5 ± 9.59
95.75 − 79.25 ± 1.782 × 57.967
4 4
giving (6.91, 26.09). As the interval does not contain zero, there is evidence of a
difference between the effects of beers C and D.
Activity 10.13 A motor manufacturer operates five continuous-production plants:
A, B, C, D and E. The average rate of production has been calculated for the three
shifts of each plant and recorded in the table below. Does there appear to be a
difference in production rates in different plants or by different shifts?
Early shift
Late shift
Night shift
A
102
85
75
B C D
E
93 85 110 72
87 71 92 73
80 75 77 76
Solution
Here r = 3 and c = 5. We may obtain the two-way ANOVA table as follows:
Source
Shift
Plant
Error
Total
DF
2
4
8
14
SS
652.13
761.73
463.87
1877.73
MS
326.07
190.43
57.98
F
5.62
3.28
Under the null hypothesis of no shift effect, F ∼ F2, 8 . Since F0.05, 2, 8 = 4.46 < 5.62,
we can reject the null hypothesis at the 5% significance level. (Note the p-value is
0.030.)
Under the null hypothesis of no plant effect, F ∼ F4, 8 . Since F0.05, 4, 8 = 3.84 > 3.28,
we cannot reject the null hypothesis at the 5% significance level. (Note the p-value is
0.072.)
Overall, the data collected show some evidence of a shift effect but little evidence of
a plant effect.
Activity 10.14 Complete the two-way ANOVA table below. In the places of
p-values, indicate in the form such as ‘< 0.01’ appropriately and use the closest value
which you may find from the New Cambridge Statistical Tables.
336
10.7. Two-way analysis of variance
Source
DF
SS
MS
F
p-value
Row factor
Column factor
Residual
Total
4
6
?
34
?
270.84
708.00
1915.76
234.23
45.14
?
?
1.53
?
?
Solution
First, row factor SS = (row factor MS)×4 = 936.92.
The degrees of freedom for residual is 34 − 4 − 6 = 24. Therefore, residual MS
= 708.00/24 = 29.5.
Hence the F statistic for testing no row factor effect is 234.23/29.5 = 7.94. From
Table 12 of the New Cambridge Statistical Tables, F0.001, 4, 24 = 6.59 < 7.94.
Therefore, the corresponding p-value is smaller than 0.001.
Since F0.05, 6, 24 = 2.51 > 1.53, the p-value for testing the column factor effect is
greater than 0.05.
The complete ANOVA table is as follows:
Source
DF
Row factor
Column factor
Residual
Total
4
6
24
34
SS
MS
F
p-value
936.92 234.23 7.94 < 0.001
270.84 45.14 1.53 > 0.05
708.00 29.50
1915.76
Activity 10.15 The following table shows the audience shares (in %) of three
major networks’ evening news broadcasts in five major cities, with one observation
per cell so that there are 15 observations. Construct the two-way ANOVA table for
these data (without the p-value column). Is either factor statistically significant at
the 5% significance level?
City
A
B
C
D
E
BBC
21.3
20.6
24.1
23.6
21.8
ITV
17.8
17.5
16.1
18.3
17.0
Sky
20.2
20.1
19.4
20.8
28.7
Solution
We have r = 5 and c = 3.
The row sample means are calculated using X̄i· =
c
P
Xij /c, which gives 19.77, 19.40,
j=1
19.87, 20.90 and 22.50 for i = 1, 2, 3, 4, 5, respectively.
337
10. Analysis of variance (ANOVA)
The column means are calculated using X̄·j =
r
P
Xij /r, which gives 22.28, 17.34 and
i=1
21.84 for j = 1, 2, 3, respectively.
The overall sample mean is:
x̄ =
r
X
x̄i·
i=1
r
= 20.49.
The sum of the squared observations is:
r X
c
X
x2ij = 6441.99.
i=1 j=1
Hence:
Total SS =
r X
c
X
x2ij − rcx̄2 = 6441.99 − 15 × (20.49)2 = 6441.99 − 6297.60 = 144.39.
i=1 j=1
brow = c
r
X
x̄2i· − rcx̄2 = 3 × 2104.83 − 6297.60 = 16.88.
i=1
bcol = r
c
X
x̄2·j − rcx̄2 = 5 × 1274.06 − 6297.60 = 72.70.
j=1
Residual SS = Total SS − brow − bcol = 144.39 − 16.88 − 72.70 = 54.81.
To test the no row effect hypothesis H0 : γ1 = · · · = γ5 = 0, the test statistic value is:
f=
(c − 1)brow
2 × 16.88
=
= 0.62.
Residual SS
54.81
Under H0 , F ∼ Fr−1, (r−1)(c−1) = F4, 8 . Using Table 12 of the New Cambridge
Statistical Tables, since F0.05, 4, 8 = 3.84 > 0.62, we do not reject H0 at the 5%
significance level. We conclude that there is no evidence that the audience share
depends on the city.
To test the no column effect hypothesis H0 : β1 = β2 = β3 = 0, the test statistic
value is:
4 × 72.70
(r − 1)bcol
=
= 5.31.
f=
Residual SS
54.81
Under H0 , F ∼ Fc−1, (r−1)(c−1) = F2, 8 . Since F0.05, 2, 8 = 4.46 < 5.31, we reject H0 at
the 5% significance level. Therefore, there is evidence indicating that the audience
share depends on the network.
The results may be summarised in a two-way ANOVA table as follows:
Source
City
Network
Residual
Total
338
DF
4
2
8
14
SS
MS
16.88 4.22
72.70 36.35
54.81 6.85
144.39
F
0.61
5.31
10.8. Residuals
10.8
Residuals
Before considering an example of two-way ANOVA, we briefly consider residuals.
Recall the original two-way ANOVA model:
Xij = µ + γi + βj + εij .
We now decompose the observations as follows:
Xij = X̄ + (X̄i· − X̄) + (X̄·j − X̄) + (Xij − X̄i· − X̄·j + X̄)
for i = 1, . . . , r and j = 1, . . . , c, where we have the following point estimators.
µ
b = X̄ is the point estimator of µ.
γ
bi = X̄i· − X̄ is the point estimator of γi , for i = 1, . . . , r.
βbj = X̄·j − X̄ is the point estimator of βj , for j = 1, . . . , c.
It follows that the residual, i.e. the estimator of εij , is:
εbij = Xij − X̄i· − X̄·j + X̄
for i = 1, . . . r and j = 1, . . . , c.
The two-way ANOVA model assumes εij ∼ N (0, σ 2 ) and so, if the model structure is
correct, then the εbij s should behave like independent N (0, σ 2 ) random variables.
Example 10.6 The following table lists the percentage annual returns (calculated
four times per annum) of the Common Stock Index at the New York Stock
Exchange during 1981–85.
1981
1982
1983
1984
1985
1st quarter
5.7
7.2
4.9
4.5
4.4
2nd quarter
6.0
7.0
4.1
4.9
4.2
3rd quarter
7.1
6.1
4.2
4.5
4.2
4th quarter
6.7
5.2
4.4
4.5
3.6
(a) Is the variability in returns from year to year statistically significant?
(b) Are returns affected by the quarter of the year?
Using two-way ANOVA, we test the no row effect hypothesis to answer (a), and test
the no column effect hypothesis to answer (b). We have r = 5 and c = 4.
The row sample means are calculated using X̄i· =
c
P
Xij /c, which gives 6.375, 6.375,
j=1
4.4, 4.6 and 4.1, for i = 1, . . . , 5, respectively.
The column sample means are calculated using X̄·j =
r
P
Xij /r, which gives 5.34,
i=1
5.24, 5.22 and 4.88, for j = 1, . . . , 4, respectively.
339
10. Analysis of variance (ANOVA)
The overall sample mean is x̄ =
r
P
x̄i· /r = 5.17.
i=1
The sum of the squared observations is
r P
c
P
x2ij = 559.06.
i=1 j=1
Hence we have the following.
Total SS =
r X
c
X
x2ij − rcx̄2 = 559.06 − 20 × (5.17)2 = 559.06 − 534.578 = 24.482.
i=1 j=1
brow = c
r
X
x̄2i· − rcx̄2 = 4 × 138.6112 − 534.578 = 19.867.
i=1
bcol = r
c
X
x̄2·j − rcx̄2 = 5 × 107.036 − 534.578 = 0.602.
j=1
Residual SS = (Total SS) − brow − bcol = 24.482 − 19.867 − 0.602 = 4.013.
To test the no row effect hypothesis H0 : γ1 = · · · = γ5 = 0, the test statistic value is:
(c − 1)brow
3 × 19.867
=
= 14.852.
Residual SS
4.013
Under H0 , F ∼ Fr−1, (r−1)(c−1) = F4, 12 . Using Table 12(d) of the New Cambridge
Statistical Tables, since F0.01, 4, 12 = 5.412 < 14.852, we reject H0 at the 1%
significance level. We conclude that there is strong evidence that the return does
depend on the year.
f=
To test the no column effect hypothesis H0 : β1 = · · · = β4 = 0, the test statistic
value is:
4 × 0.602
(r − 1)bcol
=
= 0.600.
f=
Residual SS
4.013
Under H0 , F ∼ Fc−1, (r−1)(c−1) = F3, 12 . Since F0.10, 3, 12 = 2.606 > 0.600, we cannot
reject H0 even at the 10% significance level. Therefore, there is no significant
evidence indicating that the return depends on the quarter.
The results may be summarised in a two-way ANOVA table as follows:
Source
DF
Year
Quarter
Residual
Total
4
3
12
19
SS
MS
F
19.867 4.967 14.852
0.602 0.201 0.600
4.013 0.334
24.482
p-value
< 0.01
> 0.10
We could also provide 95% confidence interval estimates for each block and
treatment level by using the pooled estimator of σ 2 , which is:
S2 =
340
Residual SS
= Residual MS.
(r − 1)(c − 1)
10.9. Overview of chapter
For the given data, s2 = 0.334.
R produces the following output:
> anova(lm(Return ~ Year + Quarter))
Analysis of Variance Table
Response: Return
Df Sum Sq Mean Sq F value
Pr(>F)
Year
4 19.867 4.9667 14.852 0.0001349 ***
Quarter
3 0.602 0.2007
0.600 0.6271918
Residuals 12 4.013 0.3344
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Note that the confidence intervals for years 1 and 2 (corresponding to 1981 and
1982) are separated from those for years 3 to 5 (that is, 1983 to 1985), which is
consistent with rejection of H0 in the no row effect test. In contrast, the confidence
intervals for each quarter all overlap, which is consistent with our failure to reject H0
in the no column effect test.
Finally, we may also look at the residuals:
εbij = Xij − µ
b−γ
bi − βbj
for i = 1, . . . r, and j = 1, . . . , c.
If the assumed normal model (structure) is correct, the εbij s should behave like
independent N (0, σ 2 ) random variables.
10.9
Overview of chapter
This chapter introduced analysis of variance as a statistical tool to detect differences
between group means. One-way and two-way analysis of variance frameworks were
presented depending on whether one or two independent variables were modelled,
respectively. Statistical inference in the form of hypothesis tests and confidence intervals
was conducted.
10.10
Key terms and concepts
ANOVA decomposition
Between-groups variation
One-way ANOVA
Residual
Total variation
Within-groups variation
Between-blocks (rows) variation
Between-treatments (columns) variation
Random errors
Sample mean
Two-way ANOVA
341
10. Analysis of variance (ANOVA)
10.11
Sample examination questions
Solutions can be found in Appendix C.
1. Three call centre workers were being monitored for the average number of calls
they answer per daily shift. Worker A answered a total of 187 calls in 4 days.
Worker B answered a total of 347 calls in 6 days. Worker C answered a total of 461
calls in 10 days. Note that these
are total sales, not daily averages. The sum
P figures
2
of the squares of all 20 days,
xi , is 50,915.
(a) Construct a one-way analysis of variance table. (You may exclude the p-value.)
(b) Would you say there is a difference between the average daily calls answered of
the three workers? Justify your answer using a 5% significance level.
2. The audience shares (in %) of three major television networks’ evening news
broadcasts in four major cities were examined. The average audience share for the
three networks (A, B and C) were 21.35%, 17.28% and 20.18%, respectively. The
following is the calculated ANOVA table with some entries missing.
Source
City
Network
Error
Total
Degrees of freedom
Sum of squares
Mean square
1.95
F -value
51.52
(a) Complete the table using the information provided above.
(b) Test, at the 5% significance level, whether there is evidence of a difference in
audience shares between networks.
3. An experiment is conducted to study how long different external batteries for
laptops last (with the laptop on power saving mode). The aim is to find out
whether there is a difference in terms of battery life between four brands of
batteries using seven different laptops. Each battery was tried once with each
laptop. The total time the Brand A battery lasted was 43.86 hours. The total times
for brands B, C and D were 41.28, 40.86 and 40 hours respectively. The following is
the calculated ANOVA table with some entries missing.
Source
Degrees of freedom
Laptops
Batteries
Error
Total
Sum of squares
Mean square F -value
26
343
(a) Complete the table using the information provided above.
(b) Test whether there are significant differences between the expected battery
performance: (i) of different batteries, and (ii) of different laptops. Perform
both tests at the 5% significance level.
(c) Construct a 90% confidence interval for the expected difference between
brands A and D. Is there any evidence of a difference in the performance of
these brands?
342
Appendix A
Linear regression (non-examinable)
A.1
Synopsis of chapter
This chapter covers linear regression whereby the variation in a continuous dependent
variable is modelled as being explained by one or more continuous independent
variables.
A.2
Learning outcomes
After completing this chapter, you should be able to:
derive from first principles the least squares estimators of the intercept and slope in
the simple linear regression model
explain how to construct confidence intervals and perform hypothesis tests for the
intercept and slope in the simple linear regression model
demonstrate how to construct confidence intervals and prediction intervals and
explain the difference between the two
summarise the multiple linear regression model with several explanatory variables,
and explain its interpretation
provide the assumptions on which regression models are based
interpret typical output from a computer package fitting of a regression model.
A.3
Introduction
Regression analysis is one of the most frequently-used statistical techniques. It aims
to model an explicit relationship between one dependent variable, often denoted as y,
and one or more regressors (also called covariates, or independent variables), often
denoted as x1 , . . . , xp .
The goal of regression analysis is to understand how y depends on x1 , . . . , xp and to
predict or control the unobserved y based on the observed x1 , . . . , xp . We start with
some simple examples with p = 1.
343
A. Linear regression (non-examinable)
A.4
Introductory examples
Example 1.1 In a university town, the sales, y, of 10 Armand’s Pizza Parlour
restaurants are closely related to the student population, x, in their neighbourhoods.
The data are the sales (in thousands of euros) in a period of three months together
with the numbers of students (in thousands) in their neighbourhoods.
We plot y against x, and draw a straight line through the middle of the data points:
y = β0 + β1 x + ε
where ε stands for a random error term, β0 is the intercept and β1 is the slope of the
straight line.
For a given student population, x, the predicted sales are yb = β0 + β1 x.
Example 1.2 Data were collected on the heights, x, and weights, y, of 69 students
in a class.
We plot y against x, and draw a straight line through the middle of the data cloud:
y = β0 + β1 x + ε
where ε stands for a random error term, β0 is the intercept and β1 is the slope of the
straight line.
For a given height, x, the predicted value yb = β0 + β1 x may be viewed as a kind of
‘standard weight’.
344
A.5. Simple linear regression
Example 1.3 Some other possible examples of y and x are shown in the following
table.
y
Sales
Weight gain
Present FTSE 100 index
Consumption
Salary
Daughter’s height
x
Price
Protein in diet
Past FTSE 100 index
Income
Tenure
Mother’s height
In most cases, there are several x variables involved. We will consider such situations
later in this chapter.
Some questions to consider are the following.
How to draw a line through data clouds, i.e. how to estimate β0 and β1 ?
How accurate is the fitted line?
What is the error in predicting a future y?
A.5
Simple linear regression
We now present the simple linear regression model. Let the paired observations
(x1 , y1 ), . . . , (xn , yn ) be drawn from the model:
y i = β 0 + β 1 xi + εi
where:
E(εi ) = 0 and Var(εi ) = E(ε2i ) = σ 2 > 0.
Furthermore, suppose Cov(εi , εj ) = E(εi εj ) = 0 for all i 6= j. That is, the εi s are
assumed to be uncorrelated (remembering that a zero covariance between two random
variables implies that they are uncorrelated).
So the model has three parameters: β0 , β1 and σ 2 .
For convenience, we will treat x1 , . . . , xn as constants.1 We have:
E(yi ) = β0 + β1 xi
and Var(yi ) = σ 2 .
Since the εi s are uncorrelated (by assumption), it follows that y1 , . . . , yn are also
uncorrelated with each other.
Sometimes we assume εi ∼ N (0, σ 2 ), in which case yi ∼ N (β0 + β1 xi , σ 2 ), and y1 , . . . , yn
are independent. (Remember that a linear transformation of a normal random variable
is also normal, and that for jointly normal random variables if they are uncorrelated
then they are also independent.)
1
If you study EC2020 Elements of econometrics, you will explore regression models in much more
detail than is covered here. For example, x1 , . . . , xn will be treated as random variables in econometrics.
345
A. Linear regression (non-examinable)
Our tasks are two-fold.
Statistical inference for β0 , β1 and σ 2 , i.e. (point) estimation, confidence intervals
and hypothesis testing.
Prediction intervals for future values of y.
We derive estimators of β0 and β1 using least squares estimation (introduced in Chapter
7). The least squares estimators (LSEs) of β0 and β1 are the values of (β0 , β1 ) at which
the function:
n
n
X
X
2
L(β0 , β1 ) =
εi =
(yi − β0 − β1 xi )2
i=1
i=1
obtains its minimum.
We proceed to partially differentiate L(β0 , β1 ) with respect to β0 and β1 , respectively.
Firstly:
n
X
∂L(β0 , β1 )
= −2
(yi − β0 − β1 xi ).
∂β0
i=1
Upon setting this partial derivative to zero, this leads to:
n
X
yi − nβb0 − βb1
i=1
n
X
xi = 0 or βb0 = ȳ − βb1 x̄.
i=1
Secondly:
n
X
∂L(β0 , β1 )
= −2
xi (yi − β0 − β1 xi ).
∂β1
i=1
Upon setting this partial derivative to zero, this leads to:
0=
n
X
xi (yi − βb0 − βb1 xi )
i=1
=
n
X
b
b
xi yi − ȳ − (β1 xi −β1 x̄)
i=1
=
n
X
xi (yi − ȳ) − βb1
i=1
Hence:
n
P
i=1
βb1 = P
n
=
xi (xi − x̄)
i=1
xi (xi − x̄).
i=1
n
P
xi (yi − ȳ)
n
X
(xi − x̄)(yi − ȳ)
i=1
n
P
and βb0 = ȳ − βb1 x̄.
(xi − x̄)2
i=1
The estimator βb1 above is based on the fact that for any constant c, we have:
n
X
i=1
346
xi (yi − ȳ) =
n
X
i=1
(xi − c)(yi − ȳ)
A.5. Simple linear regression
since:
n
X
c (yi − ȳ) = c
n
X
i=1
Given that
n
P
(yi − ȳ) = 0.
i=1
(xi − x̄) = 0, it follows that
i=1
n
P
c (xi − x̄) = 0 for any constant c.
i=1
In order to calculate βb1 numerically, often the following formula is convenient:
n
P
xi yi − nx̄ȳ
i=1
b
.
β1 = P
n
x2i − nx̄2
i=1
An alternative derivation is as follows. Note L(β0 , β1 ) =
n
P
(yi − β0 − β1 xi )2 . For any β0
i=1
and β1 , we have:
n 2
X
b
b
b
b
L(β0 , β1 ) =
yi − β0 − β1 xi + β0 − β0 + β1 − β1 xi
i=1
n 2
X
βb0 − β0 + (βb1 − β1 )xi + 2B
= L βb0 , βb1 +
(A.1)
i=1
where:
B=
n X
βb0 − β0 + (βb1 − β1 )xi
b
b
yi − β0 − β1 xi
i=1
= βb0 − β0
n X
n
X
b
b
b
yi − β0 − β1 xi + β1 − β1
xi yi − βb0 − βb1 xi .
i=1
i=1
Now let βb0 , βb1 be the solution to the equations:
n X
yi − βb0 − βb1 xi = 0 and
n
X
xi yi − βb0 − βb1 xi = 0
(A.2)
i=1
i=1
such that B = 0. By (A.1), we have:
n X
2
b
b
L (β0 , β1 ) = L β0 , β1 +
βb0 − β0 + βb1 − β1 xi ≥ L βb0 , βb1 .
i=1
Hence βb0 , βb1 are the least squares estimators (LSEs) of β0 and β1 , respectively.
To find the explicit expression from (A.2), note the first equation can be written as:
nȳ − nβb0 − nβb1 x̄ = 0.
Hence βb0 = ȳ − βb1 x̄. Substituting this into the second equation, we have:
0=
n
X
i=1
xi
n
n
X
X
b
b
yi − ȳ − β1 (xi − x̄) =
xi (yi − ȳ) − β1
xi (xi − x̄).
i=1
i=1
347
A. Linear regression (non-examinable)
Therefore:
n
P
i=1
βb1 = P
n
n
P
xi (yi − ȳ)
=
(xi − x̄)(yi − ȳ)
i=1
n
P
xi (xi − x̄)
i=1
.
(xi −
x̄)2
i=1
This completes the derivation.
n
n
P
P
Remember (xi − x̄) = 0. Hence
c (xi − x̄) = 0 for any constant c.
i=1
i=1
2
We also note the estimator of σ , which is:
n
P
σ
b2 =
(yi − βb0 − βb1 xi )2
i=1
.
n−2
We now explore the properties of the LSEs βb0 and βb1 . We now proceed to show that the
means and variances of these LSEs are:
n
P
x2i
σ
i=1
and Var(βb0 ) =
n
P
n
(xi − x̄)2
2
E(βb0 ) = β0
i=1
for βb0 , and:
E(βb1 ) = β1
and Var(βb1 ) = P
n
σ2
(xi − x̄)2
i=1
for βb1 .
Proof: Recall we treat the xi s as constants, and we have E(yi ) = β0 + β1 xi and also
Var(yi ) = σ 2 . Hence:
n
E(ȳ) = E
1X
yi
n i=1
!
n
n
1X
1X
=
E(yi ) =
(β0 + β1 xi ) = β0 + β1 x̄.
n i=1
n i=1
Therefore:
E(yi − ȳ) = β0 + β1 xi − (β0 + β1 x̄) = β1 (xi − x̄).
Consequently, we have:

 n
n
n
P
P
P
(x
(xi − x̄)2 β1
(x
i − x̄)E(yi − ȳ)
i − x̄)(yi − ȳ)
 i=1
 i=1
i=1
=
E(βb1 ) = E 
= P
= β1 .
n
n
n


P
P
2
(xi − x̄)
(xi − x̄)2
(xi − x̄)2
i=1
i=1
i=1
Now:
E(βb0 ) = E(ȳ − βb1 x̄) = β0 + β1 x̄ − β1 x̄ = β0 .
Therefore, the LSEs βb0 and βb1 are unbiased estimators of β0 and β1 , respectively.
348
A.5. Simple linear regression
To work out the variances, the key is to write βb1 and βb0 as linear estimators (i.e.
linear combinations of the yi s):
n
P
n
P
(xi − x̄)(yi − ȳ)
i=1
βb1 =
n
P
=
(xi −
x̄)2
i=1
where ai = (xi − x̄)
n
P
(xi − x̄)yi
i=1
n
P
n
X
=
(xk −
x̄)2
ai y i
i=1
k=1
(xk − x̄)2 and:
k=1
βb0 = ȳ − βb1 x̄ = ȳ −
n
X
ai x̄yi =
i=1
n X
1
− ai x̄ yi .
n
i=1
Note that:
n
X
n
X
ai = 0 and
i=1
i=1
a2i = P
n
1
.
(xk −
x̄)2
k=1
Now we note the following lemma, without proof. Let y1 , . . . , yn be uncorrelated random
variables, and b1 , . . . , bn be constants, then:
Var
n
X
!
bi y i
=
i=1
n
X
b2i Var(yi ).
i=1
By this lemma:
Var(βb1 ) = Var
n
X
!
ai y i
=σ
2
i=1
n
X
i=1
a2i = P
n
σ2
(xk − x̄)2
k=1
and:

Var(βb0 ) = σ 2
n X
1
i=1
n
n
2
− ai x̄
=σ
2
1 X 2 2
+
a x̄
n i=1 i
!
=


σ2 
nx̄2
1 +

n

P
n 
(xk − x̄)2
k=1
n
P
x2k
σ2
k=1
=
.
n
n P
2
(xk − x̄)
k=1
The last equality uses the fact that:
n
X
k=1
x2k
n
X
=
(xk − x̄)2 + nx̄2 . k=1
349
A. Linear regression (non-examinable)
A.6
Inference for parameters in normal regression
models
The normal simple linear regression model is yi = β0 + β1 xi + εi , where:
ε1 , . . . , εn ∼IID N (0, σ 2 ).
y1 , . . . , yn are independent (but not identically distributed) and:
yi ∼ N (β0 + β1 xi , σ 2 ).
Since any linear combination of normal random variables is also normal, the LSEs of β0
and β1 (as linear estimators) are also normal random variables. In fact:




n
P
2
x
i




σ2
σ2
i=1
b

.

βb0 ∼ N 
β
,
and
β
∼
N
β
,
0
1
1
n
n




P
n P
2
2
(xi − x̄)
(xi − x̄)
i=1
i=1
Since σ 2 is unknown in practice, we replace σ 2 by its estimator:
n
P
σ
b2 =
(yi − βb0 − βb1 xi )2
i=1
n−2
and use the estimated standard errors:
n
P

1/2
x2i

σ
b  i=1

E.S.E.(βb0 ) = √ 
n


n P
2
(xi − x̄)
i=1
and:
E.S.E.(βb1 ) = σ
b
n
P
1/2 .
(xi − x̄)2
i=1
The following results all make use of distributional results introduced earlier in the
course. Statistical inference (confidence intervals and hypothesis testing) for the normal
simple linear regression model can then be performed.
i. We have:
n
P
2
(n − 2) σ
b
=
σ2
(yi − βb0 − βb1 xi )2
i=1
σ2
ii. βb0 and σ
b2 are independent, hence:
βb0 − β0
∼ tn−2 .
E.S.E.(βb0 )
350
∼ χ2n−2 .
A.6. Inference for parameters in normal regression models
iii. βb1 and σ
b2 are independent, hence:
βb1 − β1
∼ tn−2 .
E.S.E.(βb1 )
Confidence intervals for the simple linear regression model parameters
A 100(1 − α)% confidence interval for β0 is:
βb0 − tα/2, n−2 × E.S.E.(βb0 ), βb0 + tα/2, n−2 × E.S.E.(βb0 )
and a 100(1 − α)% confidence interval for β1 is:
βb1 − tα/2, n−2 × E.S.E.(βb1 ), βb1 + tα/2, n−2 × E.S.E.(βb1 )
where tα, k denotes the top 100αth percentile of the Student’s tk distribution, obtained
from Table 10 of the New Cambridge Statistical Tables.
Tests for the regression slope
The relationship between y and x in the regression model hinges on β1 . If β1 = 0,
then y ∼ N (β0 , σ 2 ).
To validate the use of the regression model, we need to make sure that β1 6= 0, or
more practically that βb1 is significantly non-zero. This amounts to testing:
H0 : β1 = 0 vs. H1 : β1 6= 0.
Under H0 , the test statistic is:
T =
βb1
E.S.E.(βb1 )
∼ tn−2 .
At the 100α% significance level, we reject H0 if |t| > tα/2, n−2 , where t is the observed
test statistic value.
Alternatively, we could use H1 : β1 < 0 or H1 : β1 > 0 if there was a rationale for
doing so. In such cases, we would reject H0 if t < −tα, n−2 and t > tα, n−2 for the
lower-tailed and upper-tailed t tests, respectively.
Some remarks are the following.
i. For testing H0 : β1 = b for a given constant b, the above test still applies, but now
with the following test statistic:
T =
βb1 − b
.
E.S.E.(βb1 )
351
A. Linear regression (non-examinable)
ii. Tests for the regression intercept β0 may be constructed in a similar manner,
replacing β1 and βb1 with β0 and βb0 , respectively.
In the normal regression model, the LSEs βb0 and βb1 are also the MLEs of β0 and β1 ,
respectively.
Since εi = yi − β0 − β1 xi ∼IID N (0, σ 2 ), the likelihood function is:
2
L(β0 , β1 , σ ) =
n
Y
i=1
∝
1
1
√
exp − 2 (yi − β0 − β1 xi )2
2σ
2πσ 2
1
σ2
n/2
"
#
n
1 X
exp − 2
(yi − β0 − β1 xi )2 .
2σ i=1
Hence the log-likelihood function is:
n
l(β0 , β1 , σ ) = log
2
2
1
σ2
−
n
1 X
(yi − β0 − β1 xi )2 + c.
2σ 2 i=1
Therefore, for any β0 , β1 and σ 2 > 0, we have:
l β0 , β1 , σ 2 ≤ l βb0 , βb1 , σ 2 .
b
b
Hence β0 , β1 are the MLEs of (β0 , β1 ).
To find the MLE of σ 2 , we need to maximise:
n
n
1
1 X
2
b
b
l(σ ) = l β0 , β1 , σ = log
(yi − βb0 − βb1 xi )2 .
− 2
2
σ2
2σ i=1
2
Setting u = 1/σ 2 , it is equivalent to maximising:
g(u) = n log u − ub
where b =
n P
yi − βb0 − βb1 xi
2
.
i=1
Setting dg(u)/du = n/b
u − b = 0, u
b = n/b, i.e. g(u) attains its maximum at u = u
b.
Hence the MLE of σ 2 is:
n
σ
e2 =
2
1
b
1 X
= =
yi − βb0 − βb1 xi .
u
b n
n i=1
Note the MLE σ
e2 is a biased estimator of σ 2 . In practice, we often use the unbiased
estimator:
n
2
1 X
2
b
b
yi − β0 − β1 xi .
σ
b =
n − 2 i=1
We now consider an empirical example of the normal simple linear regression model.
352
A.6. Inference for parameters in normal regression models
Example 1.4 A dataset contains the annual cigarette consumption, x, and the
corresponding mortality rate, y, due to coronary heart disease (CHD) of 21
countries. Some useful summary statistics calculated from the data are:
21
X
xi = 45,110,
i=1
21
X
21
X
yi = 3,042.2,
i=1
21
X
yi2
x2i = 109,957,100,
i=1
= 529,321.58 and
i=1
21
X
xi yi = 7,319,602.
i=1
Do these data support the suspicion that smoking contributes to CHD mortality?
(Note the assertion ‘smoking is harmful for health’ is largely based on statistical,
rather than laboratory, evidence.)
We fit the regression model y = β0 + β1 x + ε. Our least squares estimates of β1 and
β0 are, respectively:
P
P P
P
P
x
y
−
x
y
−
nx̄ȳ
(x
−
x̄)(y
−
ȳ)
i
i
i
i
i
i
i
i xi
j yj /n
i
P
P
P
=
=
βb1 = i P
2
2
2
2
2
i (xi − x̄)
i xi − nx̄
i xi − (
i xi ) /n
=
7319602 − 45110 × 3042.2/21
109957100 − (45110)2 /21
= 0.06
and:
3042.2 − 0.06 × 45110
βb0 = ȳ − βb1 x̄ =
= 15.77.
21
Also:
− βb0 − βb1 xi )2
σ
b =
n−2
X
X
X
X
X 2
2
2
2
b
b
b
b
b
b
=
yi + nβ0 + β1
xi − 2β0
yi − 2β1
xi y i + 2 β 0 β 1
xi /(n − 2)
2
P
i (yi
= 2181.66.
We now proceed to test H0 : β1 = 0 vs. H1 : β1 > 0. (If indeed smoking contributes
to CHD mortality, then β1 > 0.)
We have calculated βb1 = 0.06. However, is this deviation from zero due to sampling
error, or is it significantly different from zero? (The magnitude of βb1 itself is not
important in determining if β1 = 0 or not – changing the scale of x may make βb1
arbitrarily small.)
Under H0 , the test statistic is:
T =
βb1
E.S.E.(βb1 )
∼ tn−2 = t19
P
1/2
where E.S.E.(βb1 ) = σ
b/ ( i (xi − x̄)2 ) = 0.01293.
Since t = 0.06/0.01293 = 4.64 > 2.54 = t0.01, 19 , we reject the hypothesis β1 = 0 at
the 1% significance level and we conclude that there is strong evidence that smoking
contributes to CHD mortality.
353
A. Linear regression (non-examinable)
A.7
Regression ANOVA
In Chapter 10 we discussed ANOVA, whereby we decomposed the total variation of a
continuous dependent variable. In a similar way we can decompose the total variation of
y in the simple linear regression model. It can be shown that the regression ANOVA
decomposition is:
n
X
2
(yi − ȳ) =
i=1
n
X
βb12 (xi − x̄)2 +
i=1
n X
yi − βb0 − βb1 xi
2
i=1
where, denoting sum of squares by ‘SS’, we have the following.
Total SS is
n
P
(yi − ȳ)2 =
i=1
n
P
yi2 − nȳ 2 .
i=1
n
n
P
P 2
2
2
2
2
b
b
Regression (explained) SS is
β1 (xi − x̄) = β1
xi − nx̄ .
i=1
Residual (error) SS is
n P
yi − βb0 − βb1 xi
i=1
2
= Total SS − Regression SS.
i=1
If εi ∼ N (0, σ 2 ) and β1 = 0, then it can be shown that:
n
P
(yi − ȳ)2 /σ 2 ∼ χ2n−1
i=1
n
P
βb12 (xi − x̄)2 /σ 2 ∼ χ21
i=1
n P
yi − βb0 − βb1 xi
2
/σ 2 ∼ χ2n−2 .
i=1
Therefore, under H0 : β1 = 0, we have:
n
P
(n − 2) βb12
(xi − x̄)2
(Regression SS)/1
i=1
F =
= P
2 =
n (Residual SS)/(n − 2)
yi − βb0 − βb1 xi
βb1
E.S.E.(βb1 )
!2
∼ F1, n−2 .
i=1
We reject H0 at the 100α% significance level if f > Fα, 1, n−2 , where f is the observed
test statistic value and Fα, 1, n−2 is the top 100αth percentile of the F1, n−2 distribution,
obtained from Table 12 of the New Cambridge Statistical Tables.
A useful statistic is the coefficient of determination, denoted as R2 , defined as:
R2 =
Residual SS
Regression SS
=1−
.
Total SS
Total SS
If we view Total SS as the total variation (or energy) of y, then R2 is the proportion of
the total variation of y explained by x. Note that R2 ∈ [0, 1]. The closer R2 is to 1, the
better the explanatory power of the regression model.
354
A.8. Confidence intervals for E(y)
A.8
Confidence intervals for E(y)
Based on the observations (xi , yi ), for i = 1, . . . , n, we fit a regression model:
yb = βb0 + βb1 x.
Our goal is to predict the unobserved y corresponding to a known x. The point
prediction is:
yb = βb0 + βb1 x.
For the analysis to be more informative, we would like to have some ‘error bars’ for our
prediction. We introduce two methods as follows.
A confidence interval for µ(x) = E(y) = β0 + β1 x.
A prediction interval for y.
A confidence interval is an interval estimator of an unknown parameter (i.e. for a
constant) while a prediction interval is for a random variable. They are different and
serve different purposes.
We assume the model is normal, i.e. ε = y − β0 − β1 x ∼ N (0, σ 2 ) and let
µ
b(x) = βb0 + βb1 x, such that µ
b(x) is an unbiased estimator of µ(x). We note without
proof that:


n
P
2
(xi − x) 

σ 2 i=1
.

µ
b(x) ∼ N µ(x),
n

n P
(xj − x̄)2
j=1
Standardising gives:
µ
b(x) − µ(x)
v
u
u
t(σ 2 /n)
n
P
(xi − x)2 /
i=1
n
P
! ∼ N (0, 1).
(xj − x̄)2
j=1
In practice σ 2 is unknown, but it can be shown that (n − 2) σ
b2 /σ 2 ∼ χ2n−2 , where
n
P
σ
b2 = (yi − βb0 − βb1 xi )2 /(n − 2). Furthermore, µ
b(x) and σ
b2 are independent. Hence:
i=1
µ
b(x) − µ(x)
v
u
u
t(b
σ 2 /n)
n
P
(xi − x)2 /
i=1
n
P
! ∼ tn−2 .
(xj − x̄)2
j=1
355
A. Linear regression (non-examinable)
Confidence interval for µ(x)
A 100(1 − α)% confidence interval for µ(x) is:
1/2
2
(x
−
x)
 i=1 i

 .
µ
b(x) ± tα/2, n−2 × σ
b×
n
 P

2
n (xj − x̄)

n
P
j=1
Such a confidence interval contains the true expectation E(y) = µ(x) with probability
1 − α over repeated samples. It does not cover y with probability 1 − α.
A.9
Prediction intervals for y
A 100(1 − α)% prediction interval is an interval which contains y with probability
1 − α.
We may assume that the y to be predicted is independent of y1 , . . . , yn used in the
estimation of the regression model.
Hence y − µ
b(x) is normal with mean 0 and variance:
n
P
(xi − x)2
2
σ
i=1
Var(y) + Var (b
µ(x)) = σ 2 +
.
n
n P
2
(xj − x̄)
j=1
Therefore:


. 
(y − µ
b(x)) 
b2 
σ
1 +
n
P
2
1/2
(xi − x) 


(xj − x̄)2
i=1
n
P
n
∼ tn−2 .
j=1
Prediction interval for y
A 100(1 − α)% prediction interval covering y with probability 1 − α is:


µ
b(x) ± tα/2, n−2 × σ
b×
1 +
n
P
j=1
356
1/2
(xi − x) 


(xj − x̄)2
i=1
n
P
n
2
.
A.9. Prediction intervals for y
Some remarks are the following.
i. It holds that:



P y

1/2 
(xi − x)  

i=1
 
∈µ
b(x) ± tα/2, n−2 × σ
b×
= 1 − α.
n
1 + P
 

2
n (xj − x̄)

n
P
2
j=1
ii. The prediction interval for y is wider than the confidence interval for E(y). The
former contains the unobserved random variable y with probability 1 − α, the
latter contains the unknown constant E(y) with probability 1 − α over repeated
samples.
Example 1.5 A dataset contains the prices (y, in $000s) of 100 three-year-old Ford
Tauruses together with their mileages (x, in thousands of miles) when they were sold
at auction. Based on these data, a car dealer needs to make two decisions.
1. To prepare cash for bidding on one three-year-old Ford Taurus with a mileage
of x = 40.
2. To prepare buying several three-year-old Ford Tauruses with mileages close to
x = 40 from a rental company.
For the first task, a prediction interval would be more appropriate. For the second
task, the car dealer needs to know the average price and, therefore, a confidence
interval is appropriate. This can be easily done using R.
> reg <- lm(Price~ Mileage)
> summary(reg)
Call:
lm(formula = Price ~ Mileage)
Residuals:
Min
1Q
-0.68679 -0.27263
Median
0.00521
3Q
0.23210
Max
0.70071
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.248727
0.182093
94.72
<2e-16 ***
Mileage
-0.066861
0.004975 -13.44
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 0.3265 on 98 degrees of freedom
Multiple R-squared: 0.6483,
Adjusted R-squared: 0.6447
F-statistic: 180.6 on 1 and 98 DF, p-value: < 2.2e-16
357
A. Linear regression (non-examinable)
> new.Mileage <- data.frame(Mileage = c(40))
> predict(reg, newdata = new.Mileage, int = "c")
fit
lwr
upr
1 14.57429 14.49847 14.65011
> predict(reg, newdata = new.Mileage, int = "p")
fit
lwr
upr
1 14.57429 13.92196 15.22662
We predict that a Ford Taurus will sell for between $13,922 and $15,227. The
average selling price of several three-year-old Ford Tauruses is estimated to be
between $14,498 and $14,650. Because predicting the selling price for one car is more
difficult, the corresponding prediction interval is wider than the confidence interval.
To produce the plots with confidence intervals for E(y) and prediction intervals for
y, we proceed as follows:
pc <- predict(reg,int="c")
pp <- predict(reg,int="p")
plot(Mileage,Price,pch=16)
matlines(Mileage,pc)
matlines(Mileage,pp)
15.0
13.5
14.0
14.5
Price
15.5
16.0
16.5
>
>
>
>
>
20
25
30
35
40
45
50
Mileage
A.10
Multiple linear regression models
For most practical problems, the variable of interest, y, typically depends on several
explanatory variables, say x1 , . . . , xp , leading to the multiple linear regression
model. In this course we only provide a brief overview of the multiple linear regression
model. EC2020 Elements of econometrics will explore this model in much greater
depth.
358
A.10. Multiple linear regression models
Let (yi , xi1 , . . . , xip ), for i = 1, . . . , n, be observations from the model:
yi = β0 + β1 xi1 + · · · + βp xip + εi
where:
E(εi ) = 0,
Var(εi ) = σ 2 > 0 and Cov(εi , εj ) = 0 for all i 6= j.
The multiple linear regression model is a natural extension of the simple linear
regression model, just with more parameters: β0 , β1 , . . . , βp and σ 2 .
Treating all of the xij s as constants as before, we have:
and Var(yi ) = σ 2 .
E(yi ) = β0 + β1 xi1 + · · · + βp xip
y1 , . . . , yn are uncorrelated with each other, again as before.
If in addition εi ∼ N (0, σ 2 ), then:
yi ∼ N
β0 +
p
X
!
βj xij , σ 2 .
j=1
Estimation of the intercept and slope parameters is still performed using least squares
estimation. The LSEs βb0 , βb1 , . . . , βbp are obtained by minimising:
n
X
yi − β0 −
p
X
i=1
!2
βj xij
j=1
leading to the fitted regression model:
yb = βb0 + βb1 x1 + · · · + βbp xp .
The residuals are expressed as:
εbi = yi − βb0 −
p
X
βbj xij .
j=1
Just as with the simple linear regression model, we can decompose the total variation of
y such that:
n
n
n
X
X
X
2
2
(yi − ȳ) =
(b
yi − ȳ) +
εb2i
i=1
i=1
i=1
or, in words:
Total SS = Regression SS + Residual SS.
An unbiased estimator of σ 2 is:
n
p
X
X
1
σ
b2 =
yi − βb0 −
βbj xij
n − p − 1 i=1
j=1
!2
=
Residual SS
.
n−p−1
We can test a single slope coefficient by testing:
H0 : βi = 0 vs. H1 : βi 6= 0.
359
A. Linear regression (non-examinable)
Under H0 , the test statistic is:
T =
βbi
E.S.E.(βbi )
∼ tn−p−1
and we reject H0 if |t| > tα/2, n−p−1 . However, note the slight difference in the
interpretation of the slope coefficient βj . In the multiple regression setting, βj is the
effect of xj on y, holding all other independent variables fixed – this is unfortunately
not always practical.
It is also possible to test whether all the regression coefficients are equal to zero. This is
known as a joint test of significance and can be used to test the overall significance
of the regression model, i.e. whether there is at least one significant explanatory
(independent) variable, by testing:
H0 : β1 = · · · = βp = 0 vs. H1 : At least one βi 6= 0.
Indeed, it is preferable to perform this joint test of significance before conducting t tests
of individual slope coefficients. Failure to reject H0 would render the model useless and
hence the model would not warrant any further statistical investigation.
Provided εi ∼ N (0, σ 2 ), under H0 : β1 = · · · = βp = 0, the test statistic is:
F =
(Regression SS)/p
∼ Fp, n−p−1 .
(Residual SS)/(n − p − 1)
We reject H0 at the 100α% significance level if f > Fα, p, n−p−1 .
It may be shown that:
Regression SS =
n
X
i=1
2
(b
yi − ȳ) =
n X
2
b
b
β1 (xi1 − x̄1 ) + · · · + βp (xip − x̄p ) .
i=1
Hence, under H0 , f should be very small.
We now conclude the chapter with worked examples of linear regression using R.
A.11
Regression using R
To solve practical regression problems, we need to use statistical computing packages.
All of them include linear regression analysis. In fact all statistical packages, such as R,
make regression analysis much easier to use.
Example 1.6 We illustrate the use of linear regression in R using the dataset
introduced in Example 1.1.
360
A.11. Regression using R
> reg <- lm(Sales ~ Student.population)
> summary(reg)
Call:
lm(formula = Sales ~ Student.population)
Residuals:
Min
1Q Median
-21.00 -9.75 -3.00
3Q
11.25
Max
18.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
60.0000
9.2260
6.503 0.000187 ***
Student.population
5.0000
0.5803
8.617 2.55e-05 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 13.83 on 8 degrees of freedom
Multiple R-squared: 0.9027,
Adjusted R-squared: 0.8906
F-statistic: 74.25 on 1 and 8 DF, p-value: 2.549e-05
The fitted line is yb = 60 + 5x. We have σ
b2 = (13.83)2 . Also, βb0 = 60 and
E.S.E.(βb0 ) = 9.2260. βb1 = 5 and E.S.E.(βb1 ) = 0.5803.
For testing H0 : β0 = 0 we have t = βb0 /E.S.E.(βb0 ) = 6.503. The p-value is
P (|T | > 6.503) = 0.000187, where T ∼ tn−2 .
For testing H0 : β1 = 0 we have t = βb1 /E.S.E.(βb1 ) = 8.617. The p-value is
P (|T | > 8.617) = 0.0000255, where T ∼ tn−2 .
The F test statistic value is 74.25 with a corresponding p-value of:
P (F > 74.25) = 0.00002549
where F1, 8 .
Example 1.7 We apply the simple linear regression model to study the
relationship between two series of financial returns – a regression of Cisco Systems
stock returns, y, on S&P500 Index returns, x. This regression model is an example of
the capital asset pricing model (CAPM).
Stock returns are defined as:
current price − previous price
current price
return =
≈ log
previous price
previous price
when the difference between the two prices is small.
A dataset contains daily returns over the period 3 January – 29 December 2000 (i.e.
n = 252 observations). The dataset has 5 columns: Day, S&P500 return, Cisco
return, Intel return and Sprint return.
Daily prices are definitely not independent. However, daily returns may be seen as a
sequence of uncorrelated random variables.
361
A. Linear regression (non-examinable)
> summary(S.P500)
Min. 1st Qu.
Median
Mean
-6.00451 -0.85028 -0.03791 -0.04242
3rd Qu.
0.79869
Max.
4.65458
> summary(Cisco)
Min. 1st Qu.
-13.4387 -3.0819
3rd Qu.
2.6363
Max.
15.4151
Median
-0.1150
Mean
-0.1336
For the S&P500, the average daily return is −0.04%, the maximum daily return is
4.46%, the minimum daily return is −6.01% and the standard deviation is 1.40%.
For Cisco, the average daily return is −0.13%, the maximum daily return is 15.42%,
the minimum daily return is −13.44% and the standard deviation is 4.23%.
We see that Cisco is much more volatile than the S&P500.
−10
−5
0
5
10
15
> sandpts <- ts(S.P500)
> ciscots <- ts(Cisco)
> ts.plot(sandpts,ciscots,col=c(1:2))
0
50
100
150
200
250
Time
There is clear synchronisation between the movements of the two series of returns,
as evident from examining the sample correlation coefficient.
> cor.test(S.P500,Cisco)
Pearson’s product-moment correlation
data: S.P500 and Cisco
t = 14.943, df = 250, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
362
A.11. Regression using R
95 percent confidence interval:
0.6155530 0.7470423
sample estimates:
cor
0.686878
We fit the regression model: Cisco = β0 + β1 S&P500 + ε.
Our rationale is that part of the fluctuation in Cisco returns was driven by the
fluctuation in the S&P500 returns.
R produces the following regression output.
> reg <- lm(Cisco ~ S.P500)
> summary(reg)
Call:
lm(formula = Cisco ~ S.P500)
Residuals:
Min
1Q
-13.1175 -2.0238
Median
0.0091
3Q
2.0614
Max
9.9491
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.04547
0.19433 -0.234
0.815
S.P500
2.07715
0.13900 14.943
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 3.083 on 250 degrees of freedom
Multiple R-squared: 0.4718,
Adjusted R-squared: 0.4697
F-statistic: 223.3 on 1 and 250 DF, p-value: < 2.2e-16
The estimated slope is βb1 = 2.07715. The null hypothesis H0 : β1 = 0 is rejected with
a p-value of 0.000 (to three decimal places). Therefore, the test is extremely
significant.
Our interpretation is that when the market index goes up by 1%, Cisco stock goes
up by 2.07715%, on average. However, the error term ε in the model is large with an
estimated σ
b = 3.083%.
The p-value for testing H0 : β0 = 0 is 0.815, so we cannot reject the hypothesis that
β0 = 0. Recall βb0 = ȳ − βb1 x̄ and both ȳ and x̄ are very close to 0.
R2 = 47.18%, hence 47.18% of the variation of Cisco stock may be explained by the
variation of the S&P500 index, or, in other words, 47.18% of the risk in Cisco stock
is the market-related risk.
The capital asset pricing model (CAPM) is a simple asset pricing model in finance
given by:
yi = β0 + β1 xi + εi
where yi is a stock return and xi is a market return at time i.
363
A. Linear regression (non-examinable)
The total risk of the stock is:
n
n
n
1X
1X
1X
(yi − ȳ)2 =
(b
yi − ȳ)2 +
(yi − ybi )2 .
n i=1
n i=1
n i=1
The market-related (or systematic) risk is:
n
n
1X
1 b2 X
2
(b
yi − ȳ) = β1
(xi − x̄)2 .
n i=1
n
i=1
The firm-specific risk is:
n
1X
(yi − ybi )2 .
n i=1
Some remarks are the following.
i. β1 measures the market-related (or systematic) risk of the stock.
ii. Market-related risk is unavoidable, while firm-specific risk may be ‘diversified
away’ through hedging.
iii. Variance is a simple measure (and one of the most frequently-used) of risk in
finance.
Example 1.8 A dataset illustrates the effects of marketing instruments on the
weekly sales volume of a certain food product over a three-year period. Data are real
but transformed to protect the innocent!
There are observations on the following four variables:
y = LVOL: logarithms of weekly sales volume
x1 = PROMP : promotion price
x2 = FEAT : feature advertising
x3 = DISP : display measure.
R produces the following descriptive statistics.
> summary(Foods)
LVOL
Min.
:13.83
1st Qu.:14.08
Median :14.24
Mean
:14.28
3rd Qu.:14.43
Max.
:15.07
PROMP
Min.
:3.075
1st Qu.:3.330
Median :3.460
Mean
:3.451
3rd Qu.:3.560
Max.
:3.865
FEAT
Min.
: 2.84
1st Qu.:15.95
Median :22.99
Mean
:24.84
3rd Qu.:33.49
Max.
:57.10
DISP
Min.
:12.42
1st Qu.:20.59
Median :25.11
Mean
:25.31
3rd Qu.:29.34
Max.
:45.94
n = 156. The values of FEAT and DISP are much larger than LVOL.
364
A.11. Regression using R
As always, first we plot the data to ascertain basic characteristics.
14.4
13.8
14.0
14.2
LVOLts
14.6
14.8
15.0
> LVOLts <- ts(LVOL)
> ts.plot(LVOLts)
0
50
100
150
Time
The time series plot indicates momentum in the data.
Next we show scatterplots between y and each xi .
14.4
14.2
14.0
13.8
LVOL
14.6
14.8
15.0
> plot(PROMP,LVOL,pch=16)
3.2
3.4
3.6
3.8
PROMP
365
A. Linear regression (non-examinable)
14.4
13.8
14.0
14.2
LVOL
14.6
14.8
15.0
> plot(FEAT,LVOL,pch=16)
10
20
30
40
50
FEAT
14.4
13.8
14.0
14.2
LVOL
14.6
14.8
15.0
> plot(DISP,LVOL,pch=16)
15
20
25
30
35
40
45
DISP
What can we observe from these pairwise plots?
There is a negative correlation between LVOL and PROMP.
There is a positive correlation between LVOL and FEAT.
There is little or no correlation between LVOL and DISP, but this might have
been blurred by the other input variables.
366
A.11. Regression using R
Therefore, we should regress LVOL on PROMP and FEAT first.
We run a multiple linear regression model using x1 and x2 as explanatory variables:
y = β0 + β1 x1 + β2 x2 + ε.
> reg <- lm(LVOL~PROMP + FEAT)
> summary(reg)
Call:
lm(formula = LVOL ~ PROMP + FEAT)
Residuals:
Min
1Q
Median
-0.32734 -0.08519 -0.01011
3Q
0.08471
Max
0.30804
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.1500102 0.2487489
68.94
<2e-16
PROMP
-0.9042636 0.0694338 -13.02
<2e-16
FEAT
0.0100666 0.0008827
11.40
<2e-16
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
***
***
***
1
Residual standard error: 0.1268 on 153 degrees of freedom
Multiple R-squared: 0.756,
Adjusted R-squared: 0.7528
F-statistic:
237 on 2 and 153 DF, p-value: < 2.2e-16
We begin by performing a joint test of significance by testing H0 : β1 = β2 = 0. The
test statistic value is given in the regression ANOVA table as f = 237, with a
corresponding p-value of 0.000 (to three decimal places). Hence H0 is rejected and we
have strong evidence that at least one slope coefficient is not equal to zero.
Next we consider individual t tests of H0 : β1 = 0 and H0 : β2 = 0. The respective
test statistic values are −13.02 and 11.40, both with p-values of 0.000 (to three
decimal places) indicating that both slope coefficients are non-zero.
Turning to the estimated coefficients, βb1 = −0.904 (to three decimal places) which
indicates that LVOL decreases as PROMP increases controlling for FEAT. Also,
βb2 = 0.010 (to three decimal places) which indicates that LVOL increases as FEAT
increases, controlling for PROMP.
We could also compute 95% confidence intervals, given by:
βbi ± t0.025, n−3 × E.S.E.(βbi ).
Since n − 3 = 153 is large, t0.025, n−3 ≈ z0.025 = 1.96.
R2 = 0.756. Therefore, 75.6% of the variation of LVOL can be explained (jointly)
with PROMP and FEAT. However, a large R2 does not necessarily mean that the
fitted model is useful. For the estimation of coefficients and predicting y, the
absolute measure ‘Residual SS’ (or σ
b2 ) plays a critical role in determining the
accuracy of the model.
367
A. Linear regression (non-examinable)
Consider now introducing DISP into the regression model to give three explanatory
variables:
y = β0 + β1 x1 + β2 x2 + β3 x3 + ε.
The reason for adding the third variable is that one would expect DISP to have an
impact on sales and we may wish to estimate its magnitude.
> reg <- lm(LVOL~PROMP + FEAT + DISP)
> summary(reg)
Call:
lm(formula = LVOL ~ PROMP + FEAT + DISP)
Residuals:
Min
1Q
Median
-0.33363 -0.08203 -0.00272
3Q
0.07927
Max
0.33812
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.2372251 0.2490226 69.220
<2e-16
PROMP
-0.9564415 0.0726777 -13.160
<2e-16
FEAT
0.0101421 0.0008728 11.620
<2e-16
DISP
0.0035945 0.0016529
2.175
0.0312
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
***
***
***
*
1
Residual standard error: 0.1253 on 152 degrees of freedom
Multiple R-squared: 0.7633,
Adjusted R-squared: 0.7587
F-statistic: 163.4 on 3 and 152 DF, p-value: < 2.2e-16
All the estimated coefficients have the right sign (according to commercial common
sense!) and are statistically significant. In particular, the relationship with DISP
seems real when the other inputs are taken into account. On the other hand, the
addition
of DISP to the
b, from
√ model has resulted in a very small reduction in σ
√
0.0161 = 0.1268 to 0.0157 = 0.1253, and correspondingly a slightly higher R2
(0.7633, i.e. 76.33% of the variation of LVOL is explained by the model). Therefore,
DISP contributes very little to ‘explaining’ the variation of LVOL after the other
two explanatory variables, PROMP and FEAT, are taken into account.
Intuitively, we would expect a higher R2 if we add a further explanatory variable to
the model. However, the model has become more complex as a result – there is an
additional parameter to estimate. Therefore, strictly speaking, we should consider
the ‘adjusted R2 ’ statistic, although this will not be considered in this course.
368
A.12. Overview of chapter
A.12
Overview of chapter
This chapter has covered the linear regression model with one or more explanatory
variables. Least squares estimators were derived for the simple linear regression model,
and statistical inference procedures were also covered. The multiple linear regression
model and applications using R concluded the chapter.
A.13
Key terms and concepts
ANOVA decomposition
Confidence interval
Independent variable
Least squares estimation
Multiple linear regression
Regression analysis
Residual
Slope coefficient
Coefficient of determination
Dependent variable
Intercept
Linear estimators
Prediction interval
Regressor
Simple linear regression
369
A. Linear regression (non-examinable)
370
Appendix B
Non-examinable proofs
B.1
Chapter 2 – Probability theory
For the empty set, ∅, we have:
P (∅) = 0.
Proof : Since ∅ ∩ ∅ = ∅ and ∅ ∪ ∅ = ∅, Axiom 3 gives:
P (∅) = P (∅ ∪ ∅ ∪ · · · ) =
∞
X
P (∅).
i=1
However, the only real number for P (∅) which satisfies this is P (∅) = 0. If A1 , A2 , . . . , An are pairwise disjoint, then:
P
n
[
!
Ai
i=1
=
n
X
P (Ai ).
i=1
Proof : In Axiom 3, set An+1 = An+2 = · · · = ∅, so that:
P
∞
[
i=1
!
Ai
=
∞
X
i=1
P (Ai ) =
n
X
P (Ai ) +
i=1
∞
X
i=n+1
P (Ai ) =
n
X
P (Ai )
i=1
since P (Ai ) = P (∅) = 0 for i = n + 1, n + 2, . . .. B.2
Chapter 3 – Random variables
For the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . ., and 0 otherwise.
371
B. Non-examinable proofs
The expected value of X is then:
E(X) =
X
xi p(xi ) =
xi ∈S
∞
X
x (1 − π)x π
x=0
(starting from x = 1)
=
∞
X
x (1 − π)x π
x=1
= (1 − π)
∞
X
x (1 − π)x−1 π
x=1
(using y = x − 1)
= (1 − π)
∞
X
(y + 1)(1 − π)y π
y=0


∞

∞
X
X

y
y 

= (1 − π) 
y (1 − π) π +
(1 − π) π 
 y=0

y=0
|
{z
} |
{z
}
=1
= E(X)
= (1 − π) [E(X) + 1]
= (1 − π) E(X) + (1 − π)
from which we can solve:
E(X) =
1−π
1−π
=
.
1 − (1 − π)
π
Suppose X is a random variable and a and b are constants, i.e. known numbers which
are not random variables. Therefore:
E(aX + b) = a E(X) + b.
Proof : We have:
E(aX + b) =
X
(ax + b) p(x)
x
=
X
ax p(x) +
x
=a
X
b p(x)
x
X
x p(x) + b
x
X
p(x)
x
= a E(X) + b
where the last step follows from:
P
i.
x p(x) = E(X), by definition of E(X)
x
ii.
P
x
372
p(x) = 1, by definition of the probability function. B.3. Chapter 5 – Multivariate random variables
If X is a random variable and a and b are constants, then:
Var(aX + b) = a2 Var(X).
Proof:
Var(aX + b) = E ((aX + b) − E(aX + b))2
= E (aX + b − a E(X) − b)2
= E (aX − a E(X))2
= E a2 (X − E(X))2
= a2 E (X − E(X))2
= a2 Var(X).
Therefore, sd(aX + b) = |a| sd(X). B.3
Chapter 5 – Multivariate random variables
We can now prove some results which were stated earlier.
Recall:
Var(X) = E(X 2 ) − (E(X))2 .
Proof:
Var(X) = E[(X − E(X))2 ]
= E[X 2 − 2 E(X)X + (E(X))2 ]
= E(X 2 ) − 2 E(X) E(X) + (E(X))2
= E(X 2 ) − 2 (E(X))2 + (E(X))2
= E(X 2 ) − (E(X))2
using (5.3), with X1 = X 2 , X2 = X, a1 = 1, a2 = −2 E(X) and b = (E(X))2 . Recall:
Cov(X, Y ) = E(XY ) − E(X) E(Y ).
Proof:
Cov(X, Y ) = E[(X − E(X))(Y − E(Y ))]
= E[XY − E(Y )X − E(X)Y + E(X) E(Y )]
= E(XY ) − E(Y ) E(X) − E(X) E(Y ) + E(X) E(Y )
= E(XY ) − E(X) E(Y )
using (5.3), with X1 = XY , X2 = X, X3 = Y , a1 = 1, a2 = −E(Y ), a3 = −E(X) and
b = E(X) E(Y ). 373
B. Non-examinable proofs
Recall that if X and Y are independent, then:
Cov(X, Y ) = Corr(X, Y ) = 0.
Proof:
Cov(X, Y ) = E(XY ) − E(X) E(Y ) = E(X) E(Y ) − E(X) E(Y ) = 0
since E(XY ) = E(X) E(Y ) when X and Y are independent.
Since Corr(X, Y ) = Cov(X, Y )/[sd(X) sd(Y )], Corr(X, Y ) = 0 whenever
Cov(X, Y ) = 0. 374
Appendix C
Solutions to Sample examination
questions
C.1
Chapter 2 – Probability theory
1. (a) True. We have:
P (A) + P (B) > P (A) + P (B) − P (A) P (B) = P (A ∪ B).
(b) True. Note that P (A | B) = P (A | B c ) implies:
P (A ∩ B)
P (A ∩ B c )
=
P (B)
1 − P (B)
or:
P (A ∩ B) (1 − P (B)) = (P (A) − P (A ∩ B)) P (B)
which implies P (A ∩ B) = P (A) P (B).
(c) False. Consider, for example, throwing a die and letting A be the event that
the throw results in a 1 and B be the event that the throw is 2.
2. There are 10
ways in which A and B can choose their seats, of which 9 are pairs
2
of adjacent seats. Therefore, using the total probability formula, the probability
that A and B are adjacent is:
9
1
10 = .
5
2
3. (a) Judge 1 can either correctly vote guilty a guilty defendant, or incorrectly vote
guilty a not guilty defendant. Therefore, the probability is given by:
0.9 × 0.7 + 0.25 × 0.3 = 0.705.
(b) This conditional probability is given by:
P (Judge 3 votes guilty | Judges 1 and 2 vote not guilty)
P (Judge 3 votes guilty, and Judges 1 and 2 vote not guilty)
=
P (Judges 1 and 2 vote not guilty)
=
0.7 × 0.9 × (0.1)2 + 0.3 × 0.25 × (0.75)2
0.7 × (0.1)2 + 0.3 × (0.75)2
= 0.2759.
375
C. Solutions to Sample examination questions
C.2
Chapter 3 – Random variables
1. (a) Let the other value be x, then:
E(X) = 0 = 1 × 0.2 + 2 × 0.5 + x × 0.3
and hence the other value is −4.
(b) Since E(X) = 0, the variance of X is given by:
Var(X) = E(X 2 ) − (E(X))2 = E(X 2 ) = (−4)2 × 0.3 + 12 × 0.2 + 22 × 0.5 = 7.
2. (a) Since
R
f (x) dx = 1, we have:
1
Z
0
1
kx3
kx dx =
3
2
=
0
k
=1
3
and so k = 3.
(b) We have:
1
Z
1
Z
x f (x) dx =
E(X) =
0
0
3x4
3x dx =
4
3
1
=
0
3
4
and:
1
Z
2
Z
2
E(X ) =
1
x f (x) dx =
0
0
3x5
3x dx =
5
4
1
0
3
= .
5
Hence:
3
Var(X) = E(X ) − (E(X)) = −
5
2
2
2
3
3
=
= 0.0375.
4
80
3. (a) We compute:
Z
1
x2 (1 − x) dx =
0
1 1
1
− =
3 4
12
so k = 12.
(b) We first find:
E
1
X
1
Z
x(1 − x) dx = 12
= 12
0
1 1
−
2 3
Also:
E
1
X2
Z
1
(1 − x) dx = 6.
= 12
0
Therefore:
Var
376
1
X
= 6 − 22 = 2.
= 2.
C.3. Chapter 4 – Common distributions of random variables
C.3
Chapter 4 – Common distributions of random
variables
1. We have Y ∼ Bin(10, 0.25), hence:
P (Y ≥ 2) = 1 − P (Y = 0) − P (Y = 1) = 1 − (0.75)10 − 10 × 0.25 × (0.75)9 = 0.7560.
2. X ∼ Exp(1), hence 1 − F (x) = e−x . Therefore:
p = P (X > x0 + 1 | X > x0 ) =
P ({X > x0 + 1} ∩ {X > x0 })
P (X > x0 )
=
P (X > x0 + 1)
P (X > x0 )
=
e−(x0 +1)
e−x0
= e−1 .
3. We have that X ∼ N (1, 4). Using the definition of conditional probability, and
standardising with Z = (X − 1)/2, we have:
P (X > 3 | X < 5) =
C.4
P (3 < X < 5)
P (1 < Z < 2)
0.9772 − 0.8413
=
=
= 0.1391.
P (X < 5)
P (Z < 2)
0.9772
Chapter 5 – Multivariate random variables
1. (a) All probabilities must be in the interval [0, 1], hence α ∈ [0, 1/2].
(b) From the definition of U , the only values U can take are 0 and 1/3. U = 0 only
when X = 0 and Y = 0. We have:
1
P (U = 0) = P (X = 0, Y = 0) =
4
and:
1
3
P U=
= 1 − P (U = 0) =
3
4
therefore:
1 1 3
1
E(U ) = 0 × + × = .
4 3 4
4
Similarly, from the definition of V , the only values V can take are 0 and 1.
V = 1 only when X = 1 and Y = 1. We have:
P (V = 1) = P (X = 1, Y = 1) =
and:
P (V = 0) = 1 − P (V = 1) =
hence:
E(V ) = 0 ×
1
4
3
4
3
1
1
+1× = .
4
4
4
377
C. Solutions to Sample examination questions
(c) U and V are not independent since not all joint probabilities are equal to the
product of the respective marginal probabilities. For example, one sufficient
case to disprove independence is noting that P (U = 0, V = 0) = 0 whereas
P (U = 0) P (V = 0) > 0.
2. (a) Due to independence, the amount of coffee in 5 cups, X, follows a normal
distribution with mean 5 × 150 = 750 and variance 5 × (10)2 = 500, i.e.
X ∼ N (750, 500). Therefore:
−50
P (X > 700) = P Z > √
= P (Z > −2.24) = 0.98745
500
using Table 4 of the New Cambridge Statistical Tables.
(b) Due to independence, the difference in the amounts between two cups, D,
follows a normal distribution with mean 150 − 150 = 0 and variance
(10)2 + (10)2 = 200, i.e. D ∼ N (0, 200). Hence:
20
−20
<Z< √
= P (−1.41 < Z < 1.41)
P (|D| < 20) = P √
200
200
= 0.9207 − (1 − 0.9207)
= 0.8414
using Table 4 of the New Cambridge Statistical Tables.
(c) Let C denote the amount of coffee in one cup, hence C ∼ N (150, 100). We
require:
13
= P (Z < −1.30) = 0.0968.
P (C < 137) = P Z < −
10
using Table 4 of the New Cambridge Statistical Tables.
(d) The expected income is:
0 × P (C < 137) + 1 × P (C ≥ 137) = 0 × 0.0968 + 1 × 0.9032 = 0.9032
i.e. £0.9032.
3. (a) As the letters are delivered at random, the destination of letter 1 follows a
discrete uniform distribution among the six houses, i.e. the probability is equal
to 1/6.
(b) The random variable Xi is equal to 1 with probability 1/6 and 0 otherwise,
hence:
1
5
1
E(Xi ) = 1 × + 0 × = .
6
6
6
(c) If house 1 receives the correct letter, there are 5 letters still to be delivered.
Therefore, for example:
P (X1 = 1 ∩ X2 = 1) =
1 1
1
× 6=
= P (X1 = 1) P (X2 = 1)
6 5
36
hence X1 and X2 are not independent.
378
C.5. Chapter 6 – Sampling distributions of statistics
(d) Following the previous part:
Cov(X1 , X2 ) = E(X1 X2 ) − E(X1 ) E(X2 ).
Note that X1 X2 is equal to 1 with probability 1/30 and 0 otherwise, hence:
Cov(X1 , X2 ) =
C.5
1
1
1
−
=
.
30 36
180
Chapter 6 – Sampling distributions of statistics
1. We have:
E(X) = 5 ×
20
10
18
+ (−5) ×
= − = −0.2632
38
38
38
and:
2
18
20
10
Var(X) = E(X ) − (E(X)) = 25 ×
+ 25 ×
− −
= 24.9308.
38
38
38
2
Since n = 100 is large, then
2
100
X
Xi ∼ N (−26.32, 2493.08), approximately, by the
i=1
central limit theorem. We require:
!
100
X
−50 − (−26.32)
√
= P (Z > −0.47) = 0.6808.
P
Xi > −50 ≈ P Z >
2493.08
i=1
2. (a) Note Zi2 ∼ χ21 for all i = 1, . . . , 5. By independence, we have:
Z12 + Z22 ∼ χ22 .
(b) By independence, we have:
Z1
s
5
P
∼ t4 .
Zi2 /4
i=2
(c) By independence, we have:
Z12
5
P
∼ F1, 4 .
Zi2 /4
i=2
3. (a) The simplest answer is:
√
11X12
s
∼ t11
11
P
Xi2
i=1
since X12 ∼ N (0, 1) and
11
P
Xi2 ∼ χ211 .
i=1
379
C. Solutions to Sample examination questions
(b) The simplest answer is:
6
P
9
i=1
15
P
6
Xi2
∼ F6, 9
Xi2
i=7
since
6
P
Xi2 ∼ χ26 and
Xi2 ∼ χ29 .
i=7
i=1
C.6
15
P
Chapter 7 – Point estimation
1. (a) The pdf of Xi is:
(
θ−1
f (xi ; θ) =
0
Therefore:
1
E(Xi ) =
θ
Z
0
θ
for 0 ≤ xi ≤ θ
otherwise.
θ
θ
1 x2i
= .
xi dxi =
θ 2 0 2
Therefore, setting µ
b1 = M1 , we have:
n
P
θb
= X̄
2
⇒
θb = 2X̄ = 2 ×
Xi
i=1
n
.
(b) We have:
0.2 + 3.6 + 1.1
= 3.27.
3
The point estimate is not plausible since 3.27 < 3.6 = x2 which must be
impossible to observe if X ∼ Uniform[0, 3.27].
2 × x̄ = 2 ×
Due to the law of large numbers, sample moments should converge to the
corresponding population moments. Here, n = 3 is small hence poor
performance of the MME is not surprising.
2. (a) We have to minimise:
S=
3
X
ε2i = (y1 − α − β)2 + (y2 − α − 2β)2 + (y3 − α − 4β)2 .
i=1
We have:
∂S
= −2(y1 − α − β) − 2(y2 − α − 2β) − 2(y3 − α − 4β)
∂α
= 2(3α + 7β − (y1 + y2 + y3 ))
and:
∂S
= −2(y1 − α − β) − 4(y2 − α − 2β) − 8(y3 − α − 4β)
∂β
= 2(7α + 21β − (y1 + 2y2 + 4y3 )).
380
C.6. Chapter 7 – Point estimation
The estimators α
b and βb are the solutions of the equations ∂S/∂α = 0 and
∂S/∂β = 0. Hence:
3b
α + 7βb = y1 + y2 + y3
and 7b
α + 21βb = y1 + 2y2 + 4y3 .
Solving yields:
−4y1 − y2 + 5y3
βb =
14
and α
b=
2y1 + y2 − y3
.
2
They are unbiased estimators since:
b =
E(β)
−4α − 4β − α − 2β + 5α + 20β
=β
14
and:
E(b
α) =
2α + 2β + α + 2β − α − 4β
= α.
2
(b) We have, by independence:
2 2
1
1
3
Var(b
α) = 1 +
+
= .
2
2
2
2
3. (a) By independence, the likelihood function is:
L(λ) =
2
n
Y
λ2Xi e−λ
Xi !
i=1
=
λ
n
P
2
Xi
i=1
n
Q
2
e−nλ
.
Xi !
i=1
The log-likelihood function is:
l(λ) = ln L(λ) =
2
n
X
!
Xi
(ln λ) − nλ2 − ln
i=1
n
Y
!
Xi ! .
i=1
Differentiating:
d
l(λ) =
dλ
2
n
P
i=1
λ
Xi
2
n
P
Xi − 2nλ2
i=1
− 2nλ =
λ
.
Setting to zero, we re-arrange for the estimator:

2
n
X
i=1
b2 = 0
Xi − 2nλ
⇒
n
P
1/2
 i=1 Xi 
b

λ=
 n 
= X̄ 1/2 .
(b) By the invariance principle of maximum likelihood estimators:
b3 = X̄ 3/2 .
θb = λ
381
C. Solutions to Sample examination questions
C.7
Chapter 8 – Interval estimation
1. We have:
1−α=P
−tα/2, n−1
X̄ − µ
√ ≤ tα/2, n−1
≤
S/ n
S
S
−tα/2, n−1 × √ ≤ X̄ − µ ≤ tα/2, n−1 × √
n
n
S
S
−tα/2, n−1 × √ < µ − X̄ < tα/2, n−1 × √
n
n
S
S
X̄ − tα/2, n−1 × √ < µ < X̄ + tα/2, n−1 × √
n
n
=P
=P
=P
.
Hence an accurate 100 (1 − α)% confidence interval for µ, where α ∈ (0, 1), is:
S
S
X̄ − tα/2, n−1 × √ , X̄ + tα/2, n−1 × √
.
n
n
2. The population is a Bernoulli distribution on two points – 1 (agree) and 0
(disagree). We have a random sample of size n = 250, i.e. {X1 , . . . , X250 }. Let
π = P (Xi = 1). Therefore, E(Xi ) = π and Var(Xi ) = π (1 − π). The sample mean
and variance are:
250
163
1 X
= 0.652
xi =
p = x̄ =
250 i=1
250
and:
1
s2 =
249
250
X
!
x2i − 250x̄2
=
i=1
1
163 − 250 × (0.652)2 = 0.2278.
259
Note use of p (1 − p) = 0.652 × (1 − 0.652) = 0.2269 is also acceptable for the
sample variance.
Based on the central limit theorem for the sample mean, an approximate 99%
confidence interval for π is:
r
0.2278
s
x̄ ± z0.005 × √ = 0.652 ± 2.576 ×
= 0.652 ± 0.078 ⇒ (0.574, 0.730).
250
n
3. For a 90% confidence interval, we need the lower and upper 5% values from
χ2n−1 = χ29 . These are χ20.95, 9 = 3.325 (given in the question) and χ20.05, 9 = 16.92,
using Table 8 of the New Cambridge Statistical Tables. Hence we obtain:
(n − 1)s2 (n − 1)s2
,
χ2α/2, n−1 χ21−α/2, n−1
382
!
=
9 × 21.05 9 × 21.05
,
16.92
3.325
= (11.20, 56.98).
C.8. Chapter 9 – Hypothesis testing
C.8
Chapter 9 – Hypothesis testing
1. (a) We have:
P (Type II error) = P (not reject H0 | H1 ) = P (X ≤ 3 | π = 0.4)
=
3
X
(1 − 0.4)x−1 × 0.4
x=1
= 0.784.
(b) We have:
P (Type I error) = P (reject H0 | H0 ) = 1 − P (X ≤ 3 | π = 0.3)
=1−
3
X
(1 − 0.3)x−1 × 0.3
x=1
= 0.343.
(c) The p-value is P (X ≥ 4 | π = 0.3) = 0.343 which, of course, is the same as the
probability of a Type I error.
2. The size is the probability we reject the null hypothesis when it is true:
1
1.5
P X > 1|λ =
= 0.0902.
= 1 − e−0.5 − 0.5e−0.5 ≈ 1 − √
2
2.718
The power is the probability we reject the null hypothesis when the alternative is
true:
3
P (X > 1 | λ = 2) = 1 − e−2 − 2e−2 ≈ 1 −
= 0.5939.
(2.718)2
3. The power of the test at σ 2 is:
β(σ) = Pσ (H0 is rejected) = Pσ (T > χ2α, n−1 )
(n − 1)S 2
2
> χα, n−1
= Pσ
σ02
(n − 1)S 2
σ02
2
= Pσ
> 2 × χα, n−1
σ2
σ
σ02
2
= P X > 2 × χα, n−1
σ
where X ∼ χ2n−1 . Hence here, where n = 10, we have:
2.00
2.00
2
β(σ) = P X > 2 × χ0.01, 9 = P X > 2 × 21.666 .
σ
σ
With any given values of σ 2 , we may compute β(σ). For the σ 2 values requested,
we obtain the following.
σ2
2.00 × 21.666/σ 2
Approx. β(σ)
2.00
21.666
0.01
2.56
16.927
0.05
383
C. Solutions to Sample examination questions
C.9
Chapter 10 – Analysis of variance (ANOVA)
1. (a) The sample means are 187/4 = 46.75, 347/6 = 57.83 and 461/10 = 46.1 for
workers A, B and C, respectively. We will perform one-way ANOVA. We
calculate the overall sample mean to be:
187 + 347 + 461
= 49.75.
20
We can now calculate the sum of squares between workers. This is:
4 × (46.75 − 49.75)2 + 6 × (57.83 − 49.75)2 + 10 × (46.1 − 49.75)2 = 561.27.
The total sum of squares is:
50915 − 20 × (49.75)2 = 1413.75.
Here is the one-way ANOVA table:
Source
Worker
Error
Total
Degrees of Freedom
2
17
19
Sum of Squares
561.27
852.48
1413.75
Mean Square
280.64
50.15
F statistic
5.60
(b) At the 5% significance level, the critical value is F0.05, 2 17 = 3.59. Since
3.59 < 5.60, we reject H0 : µA = µB = µC and conclude that there is evidence
of a difference in the average daily calls answered of the three workers.
2. (a) The average audience share of all networks is:
21.35 + 17.28 + 20.18
= 19.60.
3
Hence the sum of squares (SS) due to networks is:
4 × (21.35 − 19.60)2 + (17.28 − 19.60)2 + (20.18 − 19.60)2 = 35.13
and the mean sum of squares (MS) due to networks is 35.13/(3 − 1) = 17.57.
The degrees of freedom are 4 − 1 = 3, 3 − 1 = 2, (4 − 1)(3 − 1) = 6 and
4 × 3 − 1 = 11 for cities, networks, error and total sum of squares, respectively.
The SS for cities is 3 × 1.95 = 5.85. We have that the SS due to residuals is
given by 51.52 − 5.85 − 35.13 = 10.54 and the MS is 10.54/6 = 1.76.
The F -values are 1.95/1.76 = 1.11 and 17.57/1.76 = 9.98 for cities and
networks, respectively.
Here is the two-way ANOVA table:
Source
City
Network
Error
Total
384
Degrees of freedom
3
2
6
11
Sum of squares
5.85
35.13
10.54
51.52
Mean square
1.95
17.57
1.76
F -value
1.11
9.98
C.9. Chapter 10 – Analysis of variance (ANOVA)
(b) We test H0 : There is no difference between networks against H1 : There is a
difference between networks. The F -value is 9.98 and at a 5% significance level
the critical value is F0.05, 2, 6 = 5.14, hence we reject H0 and conclude that there
is evidence of a difference between networks.
3. (a) The average time for all batteries is 41.5. Hence the sum of squares for
batteries is:
7 × (43.86 − 41.5)2 + (41.28 − 41.5)2 + (40.86 − 41.5)2 + (40 − 41.5)2 = 57.94
and the mean sum of squares due to batteries is 57.94/(4 − 1) = 19.31. The
degrees of freedom are 7 − 1 = 6, 4 − 1 = 3, (7 − 1)(4 − 1) = 18 and
7 × 4 − 1 = 27 for laptops, batteries, error and total sum of squares,
respectively.
The sum of squares for laptops is 6 × 26 = 156. We have that the sum of
squares due to residuals is given by:
343 − 156 − 57.94 = 129.06
and hence the mean sum of squares is 129.06/18 = 7.17.
The F -value is 26/7.17 = 3.63 and 19.31/7.17 = 2.69 for laptops and batteries,
respectively. To summarise:
Source
Laptops
Batteries
Error
Total
Degrees of freedom
6
3
18
27
Sum of squares
156
57.94
129.06
343
Mean square
26
19.31
7.17
F -value
3.63
2.69
(b) We test the hypothesis H0 : There is no difference between different batteries
vs. H1 : There is a difference between different batteries. The F -value is 2.69
and at the 5% significance level the critical value (degrees of freedom 3 and 18)
is 3.16, hence we conclude that there is not enough evidence that there is a
difference.
Next, we test the hypothesis H0 : There is no difference between different
laptop brands vs. H1 : There is a difference between different laptop brands.
The F -value is 3.63 and at the 5% significance level the critical value (degrees
of freedom 6 and 18) is 2.66, hence we reject H0 and conclude that there is
evidence of a difference.
(c) The upper 5% point of the t distribution with 18 degrees of freedom is 1.734
and the estimate of σ 2 is 7.17. So the confidence interval is:
s
1 1
+
= 3.86 ± 2.482 ⇒ (1.378, 6.342).
43.86 − 40 ± 1.734 × 7.17 ×
7 7
Since zero is not in the interval, we have evidence of a difference.
385
C. Solutions to Sample examination questions
386
Appendix D
Examination formula sheet
Formulae for Statistics
Discrete distributions
Distribution
p(x)
1
k
Uniform
π x (1 − π)1−x
Bernoulli
Binomial
for all x = 1, 2, . . . , k
n
x
π x (1 − π)n−x
(1 − π)x−1 π
Geometric
e−λ λx
x!
Poisson
for x = 0, 1
for x = 0, 1, . . . , n
for x = 1, 2, 3, . . .
for x = 0, 1, 2, . . .
E(X)
Var(X)
k+1
2
k2 − 1
12
π
π (1 − π)
nπ
n π (1 − π)
1
π
1−π
π2
λ
λ
Continuous distributions
Distribution
f (x)
1
b−a
Uniform
λ e−λx
Exponential
Normal
√
1
2πσ 2
for a ≤ x ≤ b
for x > 0
2 /2σ 2
e−(x−µ)
for all x
x−a
b−a
F (x)
E(X)
Var(X)
for a ≤ x ≤ b
a+b
2
(b − a)2
12
1
λ
1
λ2
µ
σ2
1 − e−λx
for x > 0
387
D. Examination formula sheet
Sample quantities
n
n
1 X
1 X 2
(xi − x̄)2 =
x − nx̄2
n − 1 i=1
n − 1 i=1 i
Sample variance
s2 =
Sample covariance
1 X
1 X
(xi − x̄)(yi − ȳ) =
xi yi − nx̄ȳ
n − 1 i=1
n − 1 i=1
n
n
n
P
Sample correlation
xi yi − nx̄ȳ
i=1
s
n
P
x2i
−
nx̄2
i=1
n
P
yi2
−
nȳ 2
i=1
Inference
Variance of sample mean
σ2
n
One-sample t statistic
X̄ − µ
√
S/ n
s
Two-sample t statistic
388
n+m−2
X̄ − Ȳ − δ0
×p
2
1/n + 1/m
+ (m − 1)SY2
(n − 1)SX
Download