Contents Contents 1 Introduction 1 1.1 Route map to the guide . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Introduction to the subject area . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Aims of the course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Learning outcomes for the course . . . . . . . . . . . . . . . . . . . . . . 3 1.6 Overview of learning resources . . . . . . . . . . . . . . . . . . . . . . . . 3 1.6.1 The subject guide . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.6.2 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6.3 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6.4 Online study resources (the Online Library and the VLE) . . . . . 6 Examination advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.7 2 Probability theory 9 2.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Set theory: the basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Axiomatic definition of probability . . . . . . . . . . . . . . . . . . . . . 19 2.5.1 Basic properties of probability . . . . . . . . . . . . . . . . . . . . 20 Classical probability and counting rules . . . . . . . . . . . . . . . . . . . 25 2.6.1 Combinatorial counting methods . . . . . . . . . . . . . . . . . . 28 Conditional probability and Bayes’ theorem . . . . . . . . . . . . . . . . 32 2.7.1 Independence of multiple events . . . . . . . . . . . . . . . . . . . 35 2.7.2 Independent versus mutually exclusive events . . . . . . . . . . . 39 2.7.3 Conditional probability of independent events . . . . . . . . . . . 44 2.7.4 Chain rule of conditional probabilities . . . . . . . . . . . . . . . . 44 2.7.5 Total probability formula . . . . . . . . . . . . . . . . . . . . . . . 46 2.7.6 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.6 2.7 2.8 i Contents 2.9 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.10 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 58 3 Random variables 61 3.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.4 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4.1 Probability distribution of a discrete random variable . . . . . . . 63 3.4.2 The cumulative distribution function (cdf) . . . . . . . . . . . . . 68 3.4.3 Properties of the cdf for discrete distributions . . . . . . . . . . . 71 3.4.4 General properties of the cdf . . . . . . . . . . . . . . . . . . . . . 71 3.4.5 Properties of a discrete random variable . . . . . . . . . . . . . . 72 3.4.6 Expected value versus sample mean . . . . . . . . . . . . . . . . . 74 3.5 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . 84 Median of a random variable . . . . . . . . . . . . . . . . . . . . . 102 3.6 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3.7 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3.8 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 104 3.5.1 4 Common distributions of random variables 4.1 Synopsis of chapter content . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.4 Common discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . 106 4.4.1 Discrete uniform distribution . . . . . . . . . . . . . . . . . . . . 107 4.4.2 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.4.3 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.4.4 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.4.5 Connections between probability distributions . . . . . . . . . . . 125 4.4.6 Poisson approximation of the binomial distribution . . . . . . . . 125 4.4.7 Some other discrete distributions . . . . . . . . . . . . . . . . . . 127 Common continuous distributions . . . . . . . . . . . . . . . . . . . . . . 128 4.5.1 The (continuous) uniform distribution . . . . . . . . . . . . . . . 128 4.5.2 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . 130 4.5.3 Normal (Gaussian) distribution . . . . . . . . . . . . . . . . . . . 135 4.5 ii 105 Contents 4.5.4 Normal approximation of the binomial distribution . . . . . . . . 141 4.6 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 4.7 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 4.8 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 147 5 Multivariate random variables 149 5.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.4 Joint probability functions . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.5 Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.6 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.6.1 Properties of conditional distributions . . . . . . . . . . . . . . . . 155 5.6.2 Conditional mean and variance . . . . . . . . . . . . . . . . . . . 155 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.7.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.7.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.7.3 Sample covariance and correlation . . . . . . . . . . . . . . . . . . 173 Independent random variables . . . . . . . . . . . . . . . . . . . . . . . . 175 5.8.1 Joint distribution of independent random variables . . . . . . . . 176 Sums and products of random variables . . . . . . . . . . . . . . . . . . . 180 5.9.1 Distributions of sums and products . . . . . . . . . . . . . . . . . 181 5.9.2 Expected values and variances of sums of random variables . . . . 181 5.9.3 Expected values of products of independent random variables . . 183 5.9.4 Distributions of sums of random variables . . . . . . . . . . . . . 183 5.10 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 5.11 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 5.12 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 187 5.7 5.8 5.9 6 Sampling distributions of statistics 189 6.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.4 Random samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 6.4.1 Joint distribution of a random sample . . . . . . . . . . . . . . . . 190 Statistics and their sampling distributions . . . . . . . . . . . . . . . . . 191 6.5 iii Contents 6.5.1 Sampling distribution of a statistic . . . . . . . . . . . . . . . . . 192 6.6 Sample mean from a normal population . . . . . . . . . . . . . . . . . . . 195 6.7 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 201 6.8 Some common sampling distributions . . . . . . . . . . . . . . . . . . . . 209 6.8.1 The χ2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 210 6.8.2 (Student’s) t distribution . . . . . . . . . . . . . . . . . . . . . . . 213 6.8.3 The F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Prelude to statistical inference . . . . . . . . . . . . . . . . . . . . . . . . 219 6.9.1 Population versus random sample . . . . . . . . . . . . . . . . . . 220 6.9.2 Parameter versus statistic . . . . . . . . . . . . . . . . . . . . . . 221 6.9.3 Difference between ‘Probability’ and ‘Statistics’ . . . . . . . . . . 223 6.10 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 6.11 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 6.12 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 224 6.9 7 Point estimation 7.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 7.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 7.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 7.4 Estimation criteria: bias, variance and mean squared error . . . . . . . . 226 7.5 Method of moments (MM) estimation . . . . . . . . . . . . . . . . . . . . 233 7.6 Least squares (LS) estimation . . . . . . . . . . . . . . . . . . . . . . . . 238 7.7 Maximum likelihood (ML) estimation . . . . . . . . . . . . . . . . . . . . 241 7.8 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 7.9 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 7.10 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 249 8 Interval estimation iv 225 251 8.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 8.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 8.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 8.4 Interval estimation for means of normal distributions . . . . . . . . . . . 252 8.4.1 An important property of normal samples . . . . . . . . . . . . . 256 8.4.2 Means of non-normal distributions . . . . . . . . . . . . . . . . . 259 8.5 Use of the chi-squared distribution . . . . . . . . . . . . . . . . . . . . . 263 8.6 Interval estimation for variances of normal distributions . . . . . . . . . . 264 Contents 8.7 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 8.8 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 8.9 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 268 9 Hypothesis testing 269 9.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 9.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 9.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 9.4 Introductory examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 9.5 Setting p-value, significance level, test statistic . . . . . . . . . . . . . . . 271 9.5.1 General setting of hypothesis tests . . . . . . . . . . . . . . . . . 273 9.5.2 Statistical testing procedure . . . . . . . . . . . . . . . . . . . . . 273 9.5.3 Two-sided tests for normal means . . . . . . . . . . . . . . . . . . 276 9.5.4 One-sided tests for normal means . . . . . . . . . . . . . . . . . . 277 9.6 t tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 9.7 General approach to statistical tests . . . . . . . . . . . . . . . . . . . . . 281 9.8 Two types of error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 9.9 Tests for variances of normal distributions . . . . . . . . . . . . . . . . . 288 9.10 Summary: tests for µ and σ 2 in N (µ, σ 2 ) . . . . . . . . . . . . . . . . . . 291 9.11 Comparing two normal means with paired observations . . . . . . . . . . 292 9.11.1 Power functions of the test . . . . . . . . . . . . . . . . . . . . . . 293 9.12 Comparing two normal means . . . . . . . . . . . . . . . . . . . . . . . . 293 2 and σY2 9.12.1 Tests on µX − µY with known σX . . . . . . . . . . . . . 294 2 9.12.2 Tests on µX − µY with σX = σY2 but unknown . . . . . . . . . . . 296 9.13 Tests for correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . 300 9.13.1 Tests for correlation coefficients . . . . . . . . . . . . . . . . . . . 302 9.14 Tests for the ratio of two normal variances . . . . . . . . . . . . . . . . . 305 9.15 Summary: tests for two normal distributions . . . . . . . . . . . . . . . . 308 9.16 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 9.17 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 9.18 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 309 10 Analysis of variance (ANOVA) 311 10.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 10.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 10.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 v Contents 10.4 Testing for equality of three population means . . . . . . . . . . . . . . . 311 10.5 One-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . 313 10.6 From one-way to two-way ANOVA . . . . . . . . . . . . . . . . . . . . . 330 10.7 Two-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . 330 10.8 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 10.9 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 10.10 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 341 10.11 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . 342 A Linear regression (non-examinable) A.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 A.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 A.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 A.4 Introductory examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 A.5 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 A.6 Inference for parameters in normal regression models . . . . . . . . . . . 350 A.7 Regression ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 A.8 Confidence intervals for E(y) . . . . . . . . . . . . . . . . . . . . . . . . . 355 A.9 Prediction intervals for y . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 A.10 Multiple linear regression models . . . . . . . . . . . . . . . . . . . . . . 358 A.11 Regression using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 A.12 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 A.13 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 B Non-examinable proofs 371 B.1 Chapter 2 – Probability theory . . . . . . . . . . . . . . . . . . . . . . . 371 B.2 Chapter 3 – Random variables . . . . . . . . . . . . . . . . . . . . . . . . 371 B.3 Chapter 5 – Multivariate random variables . . . . . . . . . . . . . . . . . 373 C Solutions to Sample examination questions vi 343 375 C.1 Chapter 2 – Probability theory . . . . . . . . . . . . . . . . . . . . . . . 375 C.2 Chapter 3 – Random variables . . . . . . . . . . . . . . . . . . . . . . . . 376 C.3 Chapter 4 – Common distributions of random variables . . . . . . . . . . 377 C.4 Chapter 5 – Multivariate random variables . . . . . . . . . . . . . . . . . 377 C.5 Chapter 6 – Sampling distributions of statistics . . . . . . . . . . . . . . 379 C.6 Chapter 7 – Point estimation . . . . . . . . . . . . . . . . . . . . . . . . 380 C.7 Chapter 8 – Interval estimation . . . . . . . . . . . . . . . . . . . . . . . 382 Contents C.8 Chapter 9 – Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . 383 C.9 Chapter 10 – Analysis of variance (ANOVA) . . . . . . . . . . . . . . . . 384 D Examination formula sheet 387 vii Contents viii Chapter 1 Introduction 1.1 Route map to the guide This subject guide provides you with a framework for covering the syllabus of the ST104b Statistics 2 half course and directs you to additional resources such as readings and the virtual learning environment (VLE). The following ten chapters will cover important aspects of elementary statistical theory, upon which many applications in EC2020 Elements of econometrics draw heavily. The chapters are not a series of self-contained topics, rather they build on each other sequentially. As such, you are strongly advised to follow the subject guide in chapter order. There is little point in rushing past material which you have only partially understood in order to reach the final chapter. Once you have completed your work on all of the chapters, you will be ready for examination revision. A good place to start is the sample examination paper which you will find at the end of the subject guide. ST104b Statistics 2 extends the work of ST104a Statistics 1 and provides a precise and accurate treatment of probability, distribution theory and statistical inference. As such there will be a strong emphasis on mathematical statistics as important discrete and continuous probability distributions are covered and properties of these distributions are investigated. Point estimation techniques are discussed including method of moments, least squares and maximum likelihood estimation. Confidence interval construction and statistical hypothesis testing follow. Analysis of variance and a (non-examinable) treatment of linear regression models, featuring the interpretation of computer-generated regression output and implications for prediction, round off the course. Collectively, these topics provide a solid training in statistical analysis. As such, ST104b Statistics 2 is of considerable value to those intending to pursue further study in statistics, econometrics and/or empirical economics. Indeed, the quantitative skills developed in the subject guide are readily applicable to all fields involving real data analysis. 1.2 Introduction to the subject area Why study statistics? By successfully completing this half course, you will understand the ideas of randomness and variability, and the way in which they link to probability theory. This will allow the use of a systematic and logical collection of statistical techniques of great 1 1. Introduction practical importance in many applied areas. The examples in this subject guide will concentrate on the social sciences, but the methods are important for the physical sciences too. This subject aims to provide a grounding in probability theory and some of the most common statistical methods. The material in ST104b Statistics 2 is necessary as preparation for other subjects you may study later on in your degree. The full details of the ideas discussed in this subject guide will not always be required in these other subjects, but you will need to have a solid understanding of the main concepts. This can only be achieved by seeing how the ideas emerge in detail. How to study statistics For statistics, you need some familiarity with abstract mathematical ideas, as well as the ability and common sense to apply these to real-life problems. The concepts you will encounter in probability and statistical inference are hard to absorb by just reading about them in a book. You need to read, then think a little, then try some problems, and then read and think some more. This procedure should be repeated until the problems are easy to do; you should not spend a long time reading and forget about solving problems. 1.3 Syllabus The syllabus of ST104b Statistics 2 is as follows: Probability: Set theory: the basics; Axiomatic definition of probability; Classical probability and counting rules; Conditional probability and Bayes’ theorem. Random variables: Discete random variables; Continuous random variables. Common distributions of random variables: Common discrete distributions; Common continuous distributions. Multivariate random variables: Joint probability functions; Conditional distributions; Covariance and correlation; Independent random variables; Sums and products of random variables. Sampling distributions of statistics: Random samples; Statistics and their sampling distributions; Sampling distribution of a statistic; Sample mean from a normal population; The central limit theorem; Some common sampling distributions; Prelude to statistical inference. Point estimation: Estimation criteria: bias, variance and mean squared error; Method of moments estimation; Least squares estimation; Maximum likelihood estimation. Interval estimation: Interval estimation for means of normal distributions; Use of the chi-squared distribution; Confidence intervals for normal variances. 2 1.4. Aims of the course Hypothesis testing: Setting p-value, significance level, test statistic; t tests; General approach to statistical tests; Two types of error; Tests for normal variances; Comparing two normal means with paired observations; Comparing two normal means; Tests for correlation coefficients; Tests for the ratio of two normal variances. Analysis of variance (ANOVA): One-way analysis of variance; Two-way analysis of variance. Linear regression (non-examinable): Simple linear regression; Inference for parameters in normal regression models; Regression ANOVA; Confidence intervals for E(y); Prediction intervals for y; Multiple linear regression models. 1.4 Aims of the course The aim of this half course is to develop students’ knowledge of elementary statistical theory. The emphasis is on topics that are of importance in applications to econometrics, finance and the social sciences. Concepts and methods that provide the foundation for more specialised courses in statistics are introduced. 1.5 Learning outcomes for the course At the end of this half course, and having completed the Essential reading and activities, you should be able to: apply and be competent users of standard statistical operators and be able to recall a variety of well-known distributions and their respective moments explain the fundamentals of statistical inference and apply these principles to justify the use of an appropriate model and perform hypothesis tests in a number of different settings demonstrate understanding that statistical techniques are based on assumptions and the plausibility of such assumptions must be investigated when analysing real problems. 1.6 1.6.1 Overview of learning resources The subject guide This course builds on the ideas encountered in ST104a Statistics 1. Although this subject guide offers a complete treatment of the course material, students may wish to consider purchasing a textbook. Apart from the textbooks recommended in this subject guide, you may wish to look in bookshops and libraries for alternative textbooks which may help you. A critical part of a good statistics textbook is the collection of problems to solve, and you may want to look at several different textbooks just to see a range of 3 1. Introduction practice questions, especially for tricky topics. The subject guide is there mainly to describe the syllabus and to show the level of understanding expected. The subject guide is divided into chapters which should be worked through in the order in which they appear. There is little point in rushing past material you only partly understand to get to later chapters, as the presentation is somewhat sequential and not a series of self-contained topics. You should be familiar with the earlier chapters and have a solid understanding of them before moving on to the later ones. The following procedure is recommended: 1. Read the introductory comments. 2. Consult the appropriate section of your textbook. 3. Study the chapter content, examples and learning activities. 4. Go through the learning outcomes carefully. 5. Attempt some of the problems from your textbook. 6. Refer back to this subject guide, or to the textbook, or to supplementary texts, to improve your understanding until you are able to work through the problems confidently. The last two steps are the most important. It is easy to think that you have understood the material after reading it, but working through problems is the crucial test of understanding. Problem-solving should take up most of your study time. Each chapter of the subject guide has suggestions for reading from the main textbook. Usually, you will only need to read the material in the main textbook (see ‘Essential reading’ below), but it may be helpful from time to time to look at others. Basic notation We often use the symbol to denote the end of a proof, where we have finished explaining why a particular result is true. This is just to make it clear where the proof ends and the following text begins. Time management About one-third of your self-study time should be spent reading and the rest should be spent solving problems. An internal student would expect maybe 15 hours of formal teaching and another 50 hours of private study to be enough to cover the subject. Of the 50 hours of private study, about 17 hours should be spent on the initial study of the textbook and subject guide. The remaining 33 hours should be spent on attempting problems, which may well require more reading. Calculators A calculator may be used when answering questions on the examination paper for ST104b Statistics 2. It must comply in all respects with the specification given in the 4 1.6. Overview of learning resources Regulations. You should also refer to the admission notice you will receive when entering the examination and the ‘Notice on permitted materials’. Make sure you accustom yourself to using your chosen calculator and feel comfortable with it. Specifically, calculators must: have no external wires must be: hand held compact and portable quiet in operation non-programmable and must: not be capable of receiving, storing or displaying user-supplied non-numerical data. The Regulations state: ‘The use of a calculator that communicates or displays textual messages, graphical or algebraic information is strictly forbidden. Where a calculator is permitted in the examination, it must be a non-scientific calculator. Where calculators are permitted, only calculators limited to performing just basic arithmetic operations may be used. This is to encourage candidates to show the examiners the steps taken in arriving at the answer.’ Computers If you are aiming to carry out serious statistical analysis (which is beyond the level of this course) you will probably want to use some statistical software package such as R. It is not necessary for this course to have such software available, but if you do have access to it you may benefit from using it in your study of the material. 1.6.2 Essential reading This subject guide is ‘self-contained’ meaning that this is the only resource which is essential reading for ST104b Statistics 2. Throughout the subject guide there are many examples, activities and sample examination questions replicating resources typically provided in statistical textbooks. You may, however, feel you could benefit from reading textbooks, and a suggested list of these is provided below. Statistical tables In the examination you will be provided with relevant extracts of: Lindley, D.V. and W.F. Scott, New Cambridge Statistical Tables.(Cambridge: Cambridge University Press, 1995) second edition [ISBN 978-0521484855]. 5 1. Introduction As relevant extracts of these statistical tables are the same as those distributed for use in the examination, it is advisable that you become familiar with them, rather than those at the end of a textbook. 1.6.3 Further reading As mentioned above, this subject guide is sufficient for study of ST104b Statistics 2. Of course, you are free to read around the subject area in any text, paper or online resource to support your learning and by thinking about how these principles apply in the real world. To help you read extensively, you have free access to the virtual learning environment (VLE) and University of London Online Library (see below). Other useful texts for this course include: Newbold, P., W.L. Carlson and B.M. Thorne, Statistics for Business and Economics. (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060]. Johnson, R.A. and G.K. Bhattacharyya, Statistics: Principles and Methods. (New York: John Wiley and Sons, 2010) sixth edition [ISBN 9780470505779]. Larsen, R.J. and M.L. Marx, Introduction to Mathematical Statistics and Its Applications (Pearson, 2013) fifth edition [ISBN 9781292023557]. While Newbold et al. is the main recommended textbook for this course, there are many which are just as good. You are encouraged to look at those listed above and at any others you may find. It may be necessary to look at several textbooks for a single topic, as you may find that the approach of one textbook suits you better than that of another. 1.6.4 Online study resources (the Online Library and the VLE) In addition to the subject guide and the Essential reading, it is crucial that you take advantage of the study resources that are available online for this course, including the virtual learning environment (VLE) and the Online Library. You can access the VLE, the Online Library and your University of London email account via the Student Portal at: http://my.londoninternational.ac.uk You should have received your login details for the Student Portal with your official offer, which was emailed to the address that you gave on your application form. You have probably already logged in to the Student Portal in order to register! As soon as you registered, you will automatically have been granted access to the VLE, Online Library and your fully functional University of London email account. If you forget your login details, please click on the ‘Forgotten your password’ link on the login page. The VLE The VLE, which complements this subject guide, has been designed to enhance your learning experience, providing additional support and a sense of community. It forms an 6 1.6. Overview of learning resources important part of your study experience with the University of London and you should access it regularly. The VLE provides a range of resources for EMFSS courses: Self-testing activities: Doing these allows you to test your own understanding of the subject material. Electronic study materials: The printed materials that you receive from the University of London are available to download, including updated reading lists and references. Past examination papers and Examiners’ commentaries: These provide advice on how each examination question might best be answered. A student discussion forum: This is an open space for you to discuss interests and experiences, seek support from your peers, work collaboratively to solve problems and discuss subject material. Videos: There are recorded academic introductions to the subject, interviews and debates and, for some courses, audio-visual tutorials and conclusions. Recorded lectures: For some courses, where appropriate, the sessions from previous years’ Study Weekends have been recorded and made available. Study skills: Expert advice on preparing for examinations and developing your digital literacy skills. Feedback forms. Some of these resources are available for certain courses only, but we are expanding our provision all the time and you should check the VLE regularly for updates. Making use of the Online Library The Online Library contains a huge array of journal articles and other resources to help you read widely and extensively. To access the majority of resources via the Online Library you will either need to use your University of London Student Portal login details, or you will be required to register and use an Athens login: http://tinyurl.com/ollathens The easiest way to locate relevant content and journal articles in the Online Library is to use the Summon search engine. If you are having trouble finding an article listed in a reading list, try removing any punctuation from the title, such as single quotation marks, question marks and colons. For further advice, please see the online help pages: www.external.shl.lon.ac.uk/summon/about.php 7 1. Introduction Additional material There is a lot of computer-based teaching material available freely over the web. A fairly comprehensive list can be found in the ‘Books & Manuals’ section of http://statpages.org Unless otherwise stated, all websites in this subject guide were accessed in August 2019. We cannot guarantee, however, that they will stay current and you may need to perform an internet search to find the relevant pages. 1.7 Examination advice Important: the information and advice given here are based on the examination structure used at the time this subject guide was written. Please note that subject guides may be used for several years. Because of this we strongly advise you to always check both the current Regulations for relevant information about the examination, and the VLE where you should be advised of any forthcoming changes. You should also carefully check the rubric/instructions on the paper you actually sit and follow those instructions. Remember, it is important to check the VLE for: up-to-date information on examination and assessment arrangements for this course where available, past examination papers and Examiners’ commentaries for the course which give advice on how each question might best be answered. The examination is by a two-hour unseen question paper. No books may be taken into the examination, but the use of calculators is permitted, and statistical tables and a formula sheet are provided (the formula sheet can be found in past examination papers available on the VLE). The examination paper has a variety of questions, some quite short and others longer. All questions must be answered correctly for full marks. You may use your calculator whenever you feel it is appropriate, always remembering that the examiners can give marks only for what appears on the examination script. Therefore, it is important to always show your working. In terms of the examination, as always, it is important to manage your time carefully and not to dwell on one question for too long – move on and focus on solving the easier questions, coming back to harder ones later. 8 Chapter 2 Probability theory 2.1 Synopsis of chapter Probability theory is very important for statistics because it provides the rules which allow us to reason about uncertainty and randomness, which is the basis of statistics. Independence and conditional probability are profound ideas, but they must be fully understood in order to think clearly about any statistical investigation. 2.2 Learning outcomes After completing this chapter, you should be able to: explain the fundamental ideas of random experiments, sample spaces and events list the axioms of probability and be able to derive all the common probability rules from them list the formulae for the number of combinations and permutations of k objects out of n, and be able to routinely use such results in problems explain conditional probability and the concept of independent events prove the law of total probability and apply it to problems where there is a partition of the sample space prove Bayes’ theorem and apply it to find conditional probabilities. 2.3 Introduction Consider the following hypothetical example. A country will soon hold a referendum about whether it should leave the European Union (EU). An opinion poll of a random sample of people in the country is carried out. 950 respondents say that they plan to vote in the referendum. They answer the question ‘Will you vote ‘Yes’ or ‘No’ to leaving the EU?’ as follows: Count % Answer Yes No 513 437 Total 950 54% 100% 46% 9 2. Probability theory However, we are not interested in just this sample of 950 respondents, but in the population which they represent, that is, all likely voters. Statistical inference will allow us to say things like the following about the population. ‘A 95% confidence interval for the population proportion, π, of ‘Yes’ voters is (0.5083, 0.5717).’ ‘The null hypothesis that π = 0.5, against the alternative hypothesis that π > 0.5, is rejected at the 5% significance level.’ In short, the opinion poll gives statistically significant evidence that ‘Yes’ voters are in the majority among likely voters. Such methods of statistical inference will be discussed later in the course. The inferential statements about the opinion poll rely on the following assumptions and results. Each response Xi is a realisation of a random variable from a Bernoulli distribution with probability parameter π. The responses X1 , X2 , . . . , Xn are independent of each other. The sampling distribution of the sample mean (proportion) X̄ has expected value π and variance π (1 − π)/n. By use of the central limit theorem, the sampling distribution is approximately a normal distribution. In the next few chapters, we will learn about the terms in bold, among others. The need for probability in statistics In statistical inference, the data we have observed are regarded as a sample from a broader population, selected with a random process. Values in a sample are variable. If we collected a different sample we would not observe exactly the same values again. Values in a sample are also random. We cannot predict the precise values which will be observed before we actually collect the sample. Probability theory is the branch of mathematics which deals with randomness. So we need to study this first. A preview of probability The first basic concepts in probability will be the following. Experiment: for example, rolling a single die and recording the outcome. 10 2.4. Set theory: the basics Outcome of the experiment: for example, rolling a 3. Sample space S: the set of all possible outcomes, here {1, 2, 3, 4, 5, 6}. Event: any subset A of the sample space, for example A = {4, 5, 6}.1 Probability of an event A, P (A), will be defined as a function which assigns probabilities (real numbers) to events (sets). This uses the language and concepts of set theory. So we need to study the basics of set theory first. 2.4 Set theory: the basics A set is a collection of elements (also known as ‘members’ of the set). Example 2.1 The following are all examples of sets: A = {Amy, Bob, Sam}. B = {1, 2, 3, 4, 5}. C = {x | x is a prime number} = {2, 3, 5, 7, 11, . . .}. D = {x | x ≥ 0} (that is, the set of all non-negative real numbers). Activity 2.1 Why is S = {1, 1, 2}, not a sensible way to try to define a sample space? Solution Because there is no need to list the elementary outcome ‘1’ twice. It is much clearer to write S = {1, 2}. Activity 2.2 Write out all the events for the sample space S = {a, b, c}. (There are eight of them.) Solution The possible events are {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} (the sample space S) and ∅. Membership of sets and the empty set x ∈ A means that object x is an element of set A. x∈ / A means that object x is not an element of set A. The empty set, denoted ∅, is the set with no elements, i.e. x ∈ / ∅ is true for every object x, and x ∈ ∅ is not true for any object x. 1 Strictly speaking not all subsets are events. 11 2. Probability theory Example 2.2 If A = {1, 2, 3, 4, 5}, then: 1 ∈ A and 2 ∈ A. 6∈ / A and 1.5 ∈ / A. The familiar Venn diagrams help to visualise statements about sets. However, Venn diagrams are not formal proofs of results in set theory. Example 2.3 In Figure 2.1, the darkest area in the middle is A ∩ B, the total shaded area is A ∪ B, and the white area is (A ∪ B)c = Ac ∩ B c . Figure 2.1: Venn diagram depicting A ∪ B (the total shaded area). Subsets and equality of sets A ⊂ B means that set A is a subset of set B, defined as: A⊂B when x ∈ A ⇒ x ∈ B. Hence A is a subset of B if every element of A is also an element of B. An example is shown in Figure 2.2. Figure 2.2: Venn diagram depicting a subset, where A ⊂ B. Example 2.4 An example of the distinction between subsets and non-subsets is: {1, 2, 3} ⊂ {1, 2, 3, 4}, because all elements appear in the larger set {1, 2, 5} 6⊂ {1, 2, 3, 4}, because the element 5 does not appear in the larger set. 12 2.4. Set theory: the basics Two sets A and B are equal (A = B) if they have exactly the same elements. This implies that A ⊂ B and B ⊂ A. Unions of sets (‘or’) The union, denoted ∪, of two sets is: A ∪ B = {x | x ∈ A or x ∈ B}. That is, the set of those elements which belong to A or B (or both). An example is shown in Figure 2.3. Figure 2.3: Venn diagram depicting the union of two sets. Example 2.5 If A = {1, 2, 3, 4}, B = {2, 3} and C = {4, 5, 6}, then: A ∪ B = {1, 2, 3, 4} A ∪ C = {1, 2, 3, 4, 5, 6} B ∪ C = {2, 3, 4, 5, 6}. Intersections of sets (‘and’) The intersection, denoted ∩, of two sets is: A ∩ B = {x | x ∈ A and x ∈ B}. That is, the set of those elements which belong to both A and B. An example is shown in Figure 2.4. Example 2.6 If A = {1, 2, 3, 4}, B = {2, 3} and C = {4, 5, 6}, then: A ∩ B = {2, 3} A ∩ C = {4} B ∩ C = ∅. 13 2. Probability theory Figure 2.4: Venn diagram depicting the intersection of two sets. Unions and intersections of many sets Both set operators can also be applied to more than two sets, such as A ∩ B ∩ C. Concise notation for the unions and intersections of sets A1 , A2 , . . . , An is: n [ Ai = A1 ∪ A2 ∪ · · · ∪ An i=1 and: n \ Ai = A1 ∩ A2 ∩ · · · ∩ An . i=1 These can also be used for an infinite number of sets, i.e. when n is replaced by ∞. Complement (‘not’) Suppose S is the set of all possible elements which are under consideration. In probability, S will be referred to as the sample space. It follows that A ⊂ S for every set A we may consider. The complement of A with respect to S is: Ac = {x | x ∈ S and x ∈ / A}. That is, the set of those elements of S that are not in A. An example is shown in Figure 2.5. We now consider some useful properties of set operators. In proofs and derivations about sets, you can use the following results without proof. 14 2.4. Set theory: the basics Figure 2.5: Venn diagram depicting the complement of a set. Properties of set operators Commutativity: A ∩ B = B ∩ A and A ∪ B = B ∪ A. Associativity: A ∩ (B ∩ C) = (A ∩ B) ∩ C and A ∪ (B ∪ C) = (A ∪ B) ∪ C. Distributive laws: A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) and A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C). De Morgan’s laws: (A ∩ B)c = Ac ∪ B c and (A ∪ B)c = Ac ∩ B c . Further properties of set operators If S is the sample space and A and B are any sets in S, you can also use the following results without proof: ∅c = S. ∅ ⊂ A, A ⊂ A and A ⊂ S. A ∩ A = A and A ∪ A = A. A ∩ Ac = ∅ and A ∪ Ac = S. If B ⊂ A, A ∩ B = B and A ∪ B = A. A ∩ ∅ = ∅ and A ∪ ∅ = A. A ∩ S = A and A ∪ S = S. ∅ ∩ ∅ = ∅ and ∅ ∪ ∅ = ∅. 15 2. Probability theory Mutually exclusive events Two sets A and B are disjoint or mutually exclusive if: A ∩ B = ∅. Sets A1 , A2 , . . . , An are pairwise disjoint if all pairs of sets from them are disjoint, i.e. Ai ∩ Aj = ∅ for all i 6= j. Partition The sets A1 , A2 , . . . , An form a partition of the set A if they are pairwise disjoint n S and if Ai = A, that is, A1 , A2 , . . . , An are collectively exhaustive of A. i=1 Therefore, a partition divides the entire set A into non-overlapping pieces Ai , as shown in Figure 2.6 for n = 3. Similarly, an infinite collection of sets A1 , A2 , . . . form ∞ S a partition of A if they are pairwise disjoint and Ai = A. i=1 A A2 A3 A1 Figure 2.6: The partition of the set A into A1 , A2 and A3 . Example 2.7 Suppose that A ⊂ B. Show that A and B ∩ Ac form a partition of B. We have: A ∩ (B ∩ Ac ) = (A ∩ Ac ) ∩ B = ∅ ∩ B = ∅ and: A ∪ (B ∩ Ac ) = (A ∪ B) ∩ (A ∪ Ac ) = B ∩ S = B. Hence A and B ∩ Ac are mutually exclusive and collectively exhaustive of B, and so they form a partition of B. 16 2.4. Set theory: the basics Activity 2.3 For an event A, work out a simpler way to express the events A ∩ S, A ∪ S, A ∩ ∅ and A ∪ ∅. Solution We have: A ∩ S = A, A ∪ S = S, A ∩ ∅ = ∅ and A ∪ ∅ = A. Activity 2.4 Use the rules of set operators to prove that the following represents a partition of set A: A = (A ∩ B) ∪ (A ∩ B c ). (*) In other words, prove that (*) is true, and also that (A ∩ B) ∩ (A ∩ B c ) = ∅. Solution We have: (A ∩ B) ∩ (A ∩ B c ) = (A ∩ A) ∩ (B ∩ B c ) = A ∩ ∅ = ∅. This uses the results of commutativity, associativity, A ∩ A = A, A ∩ Ac = ∅ and A ∩ ∅ = ∅. Similarly: (A ∩ B) ∪ (A ∩ B c ) = A ∩ (B ∪ B c ) = A ∩ S = A using the results of the distributive laws, A ∪ Ac = S and A ∩ S = A. Activity 2.5 Find A1 ∪ A2 and A1 ∩ A2 of the two sets A1 and A2 , where: (a) A1 = {0, 1, 2} and A2 = {2, 3, 4} (b) A1 = {x | 0 < x < 2} and A2 = {x | 1 ≤ x < 3} (c) A1 = {x | 0 ≤ x < 1} and A2 = {x | 2 < x ≤ 3}. Solution (a) We have: A1 ∪ A2 = {0, 1, 2, 3, 4} and A1 ∩ A2 = {2}. (b) We have: A1 ∪ A2 = {x | 0 < x < 3} and A1 ∩ A2 = {x | 1 ≤ x < 2}. (c) We have: A1 ∪ A2 = {x | 0 ≤ x < 1 or 2 < x ≤ 3} and A1 ∩ A2 = ∅. 17 2. Probability theory Activity 2.6 Let A, B and C be events in a sample space, S. Using only the symbols ∪, ∩, () and c , find expressions for the following events: (a) only A occurs (b) none of the three events occurs (c) exactly one of the three events occurs (d) at least two of the three events occur (e) exactly two of the three events occur. Solution There is more than one way to answer this question, because the sets can be expressed in different, but logically equivalent, forms. One way to do so is the following. (a) A ∩ B c ∩ C c , i.e. A and not B and not C. (b) Ac ∩ B c ∩ C c , i.e. not A and not B and not C. (c) (A ∩ B c ∩ C c ) ∪ (Ac ∩ B ∩ C c ) ∪ (Ac ∩ B c ∩ C), i.e. only A or only B or only C. (d) (A ∩ B) ∪ (A ∩ C) ∪ (B ∩ C), i.e. A and B, or A and C, or B and C. Note that this includes A ∩ B ∩ C as a subset, so we do not need to write (A ∩ B) ∪ (A ∩ C) ∪ (B ∩ C) ∪ (A ∩ B ∩ C) separately. (e) ((A ∩ C) ∪ (A ∩ B) ∪ (B ∩ C)) ∩ (A ∩ B ∩ C)c , i.e. A and B, or A and C, or B and C, but not A and B and C. Activity 2.7 Let A and B be events in a sample space S. Use Venn diagrams to convince yourself that the two De Morgan’s laws: (A ∩ B)c = Ac ∪ B c (1) (A ∪ B)c = Ac ∩ B c (2) and: are correct. For each of them, draw two Venn diagrams – one for the expression on the left-hand side of the equation, and one for the right-hand side. Shade the areas corresponding to each expression, and hence show that for both (1) and (2) the left-hand and right-hand sides describe the same set. Solution For (A ∩ B)c = Ac ∪ B c we have: 18 2.5. Axiomatic definition of probability For (A ∪ B)c = Ac ∩ B c we have: 2.5 Axiomatic definition of probability First, we consider four basic concepts in probability. An experiment is a process which produces outcomes and which can have several different outcomes. The sample space S is the set of all possible outcomes of the experiment. An event is any subset A of the sample space such that A ⊂ S. Example 2.8 If the experiment is ‘select a trading day at random and record the % change in the FTSE 100 index from the previous trading day’, then the outcome is the % change in the FTSE 100 index. S = [−100, +∞) for the % change in the FTSE 100 index (in principle). An event of interest might be A = {x | x > 0} – the event that the daily change is positive, i.e. the FTSE 100 index gains value from the previous trading day. The sample space and events are represented as sets. For two events A and B, set operations are then interpreted as follows: A ∩ B: both A and B happen. A ∪ B: either A or B happens (or both happen). Ac : A does not happen, i.e. something other than A happens. Once we introduce probabilities of events, we can also say that: the sample space, S, is a certain event the empty set, ∅, is an impossible event. 19 2. Probability theory Axioms of probability ‘Probability’ is formally defined as a function P (·) from subsets (events) of the sample space S onto real numbers.2 Such a function is a probability function if it satisfies the following axioms (‘self-evident truths’). Axiom 1: P (A) ≥ 0 for all events A. Axiom 2: P (S) = 1. Axiom 3: If events A1 , A2 , . . . are pairwise disjoint (i.e. Ai ∩ Aj = ∅ for all i 6= j), then: ! ∞ ∞ [ X P Ai = P (Ai ). i=1 i=1 The axioms require that a probability function must always satisfy these requirements. Axiom 1 requires that probabilities are always non-negative. Axiom 2 requires that the outcome is some element from the sample space with certainty (that is, with probability 1). In other words, the experiment must have some outcome. Axiom 3 states that if events A1 , A2 , . . . are mutually exclusive, the probability of their union is simply the sum of their individual probabilities. All other properties of the probability function can be derived from the axioms. We begin by showing that a result like Axiom 3 also holds for finite collections of mutually exclusive sets. 2.5.1 Basic properties of probability Probability property For the empty set, ∅, we have: P (∅) = 0. (2.1) Probability property (finite additivity) If A1 , A2 , . . . , An are pairwise disjoint, then: ! n n [ X P Ai = P (Ai ). i=1 2 i=1 The precise definition also requires a careful statement of which subsets of S are allowed as events, which we can skip on this course. 20 2.5. Axiomatic definition of probability In pictures, the previous result means that in a situation like the one shown in Figure 2.7, the probability of the combined event A = A1 ∪ A2 ∪ A3 is simply the sum of the probabilities of the individual events: P (A) = P (A1 ) + P (A2 ) + P (A3 ). That is, we can simply sum probabilities of mutually exclusive sets. This is very useful for deriving further results. A2 A1 A3 Figure 2.7: Venn diagram depicting three mutually exclusive sets, A1 , A2 and A3 . Note although A2 and A3 have touching boundaries, there is no actual intersection and hence they are (pairwise) mutually exclusive. Probability property For any event A, we have: P (Ac ) = 1 − P (A). Proof : We have that A ∪ Ac = S and A ∩ Ac = ∅. Therefore: 1 = P (S) = P (A ∪ Ac ) = P (A) + P (Ac ) using the previous result, with n = 2, A1 = A and A2 = Ac . Probability property For any event A, we have: P (A) ≤ 1. Proof (by contradiction): If it was true that P (A) > 1 for some A, then we would have: P (Ac ) = 1 − P (A) < 0. This violates Axiom 1, so cannot be true. Therefore, it must be that P (A) ≤ 1 for all A. Putting this and Axiom 1 together, we get: 0 ≤ P (A) ≤ 1 for all events A. 21 2. Probability theory Probability property For any two events A and B, if A ⊂ B, then P (A) ≤ P (B). Proof : We proved in Example 2.7 that we can partition B as B = A ∪ (B ∩ Ac ) where the two sets in the union are disjoint. Therefore: P (B) = P (A ∪ (B ∩ Ac )) = P (A) + P (B ∩ Ac ) ≥ P (A) since P (B ∩ Ac ) ≥ 0. Probability property For any two events A and B, then: P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Proof : Using partitions: P (A ∪ B) = P (A ∩ B c ) + P (A ∩ B) + P (Ac ∩ B) P (A) = P (A ∩ B c ) + P (A ∩ B) P (B) = P (Ac ∩ B) + P (A ∩ B) and hence: P (A ∪ B) = [P (A) − P (A ∩ B)] + P (A ∩ B) + [P (B) − P (A ∩ B)] = P (A) + P (B) − P (A ∩ B). In summary, the probability function has the following properties. P (S) = 1 and P (∅) = 0. 0 ≤ P (A) ≤ 1 for all events A. If A ⊂ B, then P (A) ≤ P (B). These show that the probability function has the kinds of values we expect of something called a ‘probability’. P (Ac ) = 1 − P (A). P (A ∪ B) = P (A) + P (B) − P (A ∩ B). These are useful for deriving probabilities of new events. 22 2.5. Axiomatic definition of probability Example 2.9 Suppose that, on an average weekday, of all adults in a country: 86% spend at least 1 hour watching television (event A, with P (A) = 0.86) 19% spend at least 1 hour reading newspapers (event B, with P (B) = 0.19) 15% spend at least 1 hour watching television and at least 1 hour reading newspapers (P (A ∩ B) = 0.15). We select a member of the population for an interview at random. For example, we then have: P (Ac ) = 1 − P (A) = 1 − 0.86 = 0.14, which is the probability that the respondent watches less than 1 hour of television P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = 0.86 + 0.19 − 0.15 = 0.90, which is the probability that the respondent spends at least 1 hour watching television or reading newspapers (or both). Activity 2.8 (a) A, B and C are any three events in the sample space S. Prove that: P (A∪B∪C) = P (A)+P (B)+P (C)−P (A∩B)−P (B∩C)−P (A∩C)+P (A∩B∩C). (b) A and B are events in a sample space S. Show that: P (A ∩ B) ≤ P (A) + P (B) ≤ P (A ∪ B). 2 Solution (a) We know P (E ∪ F ) = P (E) + P (F ) − P (E ∩ F ). Consider A ∪ B ∪ C as (A ∪ B) ∪ C (i.e. as the union of the two sets A ∪ B and C) and then apply the result above to obtain: P (A ∪ B ∪ C) = P ((A ∪ B) ∪ C) = P (A ∪ B) + P (C) − P ((A ∪ B) ∩ C). Now (A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C) – a Venn diagram can be drawn to check this. So: P (A ∪ B ∪ C) = P (A ∪ B) + P (C) − (P (A ∩ C) + P (B ∩ C) − P ((A ∩ C) ∩ (B ∩ C))) using the earlier result again for A ∩ C and B ∩ C. Now (A ∩ C) ∩ (B ∩ C) = A ∩ B ∩ C and if we apply the earlier result once more for A and B, we obtain: P (A∪B∪C) = P (A)+P (B)−P (A∩B)+P (C)−P (A∩C)−P (B∩C)+P (A∩B∩C) which is the required result. 23 2. Probability theory (b) Use the result that if X ⊂ Y then P (X) ≤ P (Y ) for events X and Y . Since A ⊂ A ∪ B and B ⊂ A ∪ B, we have P (A) ≤ P (A ∪ B) and P (B) ≤ P (A ∪ B). Adding these inequalities, P (A) + P (B) ≤ 2 × P (A ∪ B) so: P (A) + P (B) ≤ P (A ∪ B). 2 Similarly, A ∩ B ⊂ A and A ∩ B ⊂ B, so P (A ∩ B) ≤ P (A) and P (A ∩ B) ≤ P (B). Adding, 2 × P (A ∩ B) ≤ P (A) + P (B) so: P (A ∩ B) ≤ P (A) + P (B) . 2 What does ‘probability’ mean? Probability theory tells us how to work with the probability function and derive ‘probabilities of events’ from it. However, it does not tell us what ‘probability’ really means. There are several alternative interpretations of the real-world meaning of ‘probability’ in this sense. One of them is outlined below. The mathematical theory of probability and calculations on probabilities are the same whichever interpretation we assign to ‘probability’. So, in this course, we do not need to discuss the matter further. Frequency interpretation of probability This states that the probability of an outcome A of an experiment is the proportion (relative frequency) of trials in which A would be the outcome if the experiment was repeated a very large number of times under similar conditions. Example 2.10 How should we interpret the following, as statements about the real world of coins and babies? ‘The probability that a tossed coin comes up heads is 0.5.’ If we tossed a coin a large number of times, and the proportion of heads out of those tosses was 0.5, the ‘probability of heads’ could be said to be 0.5, for that coin. ‘The probability is 0.51 that a child born in the UK today is a boy.’ If the proportion of boys among a large number of live births was 0.51, the ‘probability of a boy’ could be said to be 0.51. How to find probabilities? A key question is how to determine appropriate numerical values of P (A) for the probabilities of particular events. 24 2.6. Classical probability and counting rules This is usually done empirically, by observing actual realisations of the experiment and using them to estimate probabilities. In the simplest cases, this basically applies the frequency definition to observed data. Example 2.11 Consider the following. If I toss a coin 10,000 times, and 5,023 of the tosses come up heads, it seems that, approximately, P (heads) = 0.5, for that coin. Of the 7,098,667 live births in England and Wales in the period 1999–2009, 51.26% were boys. So we could assign the value of about 0.51 to the probability of a boy in this population. The estimation of probabilities of events from observed data is an important part of statistics. 2.6 Classical probability and counting rules Classical probability is a simple special case where values of probabilities can be found by just counting outcomes. This requires that: the sample space contains only a finite number of outcomes all of the outcomes are equally likely. Standard illustrations of classical probability are devices used in games of chance, such as: tossing a coin (heads or tails) one or more times rolling one or more dice (each scored 1, 2, 3, 4, 5 or 6) drawing one or more playing cards from a deck of 52 cards. We will use these often, not because they are particularly important but because they provide simple examples for illustrating various results in probability. Suppose that the sample space, S, contains m equally likely outcomes, and that event A consists of k ≤ m of these outcomes. Therefore: P (A) = k number of outcomes in A = . m total number of outcomes in the sample space, S That is, the probability of A is the proportion of outcomes which belong to A out of all possible outcomes. In the classical case, the probability of any event can be determined by counting the number of outcomes which belong to the event, and the total number of possible outcomes. 25 2. Probability theory Example 2.12 Rolling two dice, what is the probability that the sum of the two scores is 5? The sample space is the 36 ordered pairs: S = {(1, 1), (1, 2), (1, 3), (1, 4) , (1, 5), (1, 6), (2, 1), (2, 2), (2, 3) , (2, 4), (2, 5), (2, 6), (3, 1), (3, 2) , (3, 3), (3, 4), (3, 5), (3, 6), (4, 1) , (4, 2), (4, 3), (4, 4), (4, 5), (4, 6), (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6), (6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}. The event of interest is A = {(1, 4), (2, 3), (3, 2), (4, 1)}. The probability is P (A) = 4/36 = 1/9. Now that we have a way of obtaining probabilities for events in the classical case, we can use it together with the rules of probability. The formula P (A) = 1 − P (Ac ) is convenient when we want P (A) but the probability of the complementary event Ac , i.e. P (Ac ), is easier to find. Example 2.13 When rolling two fair dice, what is the probability that the sum of the dice is greater than 3? The complement is that the sum is at most 3, i.e. the complementary event is Ac = {(1, 1), (1, 2), (2, 1)}. Therefore, P (A) = 1 − 3/36 = 33/36 = 11/12. The formula: P (A ∪ B) = P (A) + P (B) − P (A ∩ B) says that the probability that A or B happens (or both happen) is the sum of the probabilities of A and B, minus the probability that both A and B happen. Example 2.14 When rolling two fair dice, what is the probability that the two scores are equal (event A) or that the total score is greater than 10 (event B)? P (A) = 6/36, P (B) = 3/36 and P (A ∩ B) = 1/36. So P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = (6 + 3 − 1)/36 = 8/36 = 2/9. Activity 2.9 Assume that a calculator has a ‘random number’ key and that when the key is pressed an integer between 0 and 999 inclusive is generated at random, all numbers being generated independently of one another. 26 2.6. Classical probability and counting rules (a) What is the probability that the number generated is less than 300? (b) If two numbers are generated, what is the probability that both are less than 300? (c) If two numbers are generated, what is the probability that the first number exceeds the second number? (d) If two numbers are generated, what is the probability that the first number exceeds the second number, and their sum is exactly 300? (e) If five numbers are generated, what is the probability that at least one number occurs more than once? Solution (a) Simply 300/1000 = 0.3. (b) Simply 0.3 × 0.3 = 0.09. (c) Suppose P (first greater) = x, then by symmetry we have that P (second greater) = x. However, the probability that both are equal is (by counting): 1000 {0, 0}, {1, 1}, . . . , {999, 999} = = 0.001. 1000000 1000000 Hence x + x + 0.001 = 1, so x = 0.4995. (d) The following cases apply {300, 0}, {299, 1}, . . . , {151, 149}, i.e. there are 150 possibilities from (10)6 . So the required probability is: 150 = 0.00015. 1000000 (e) The probability that they are all different is (noting that the first number can be any number): 999 998 997 996 1× × × × . 1000 1000 1000 1000 Subtracting from 1 gives the required probability, i.e. 0.009965. Activity 2.10 A box contains r red balls and b blue balls. One ball is selected at random and its colour is observed. The ball is then returned to the box and k additional balls of the same colour are also put into the box. A second ball is then selected at random, its colour is observed, and it is returned to the box together with k additional balls of the same colour. Each time another ball is selected, the process is repeated. If four balls are selected, what is the probability that the first three balls will be red and the fourth ball will be blue? Hint: Your answer should be a function of r, b and k. Solution Let Ri be the event that a red ball is drawn on the ith draw, and let Bi be the event 27 2. Probability theory that a blue ball is drawn on the ith draw, for i = 1, . . . , 4. Therefore, we have: P (R1 ) = P (R2 | R1 ) = r r+b r+k r+b+k P (R3 | R1 ∩ R2 ) = r + 2k r + b + 2k P (B4 | R1 ∩ R2 ∩ R3 ) = b r + b + 3k where ‘|’ means ‘given’, notation which will be formally introduced later in the chapter with conditional probability. The required probability is the product of these four probabilities, namely: r(r + k)(r + 2k)b . (r + b)(r + b + k)(r + b + 2k)(r + b + 3k) 2.6.1 Combinatorial counting methods A powerful set of counting methods answers the following question: how many ways are there to select k objects out of n distinct objects? The answer will depend on: whether the selection is with replacement (an object can be selected more than once) or without replacement (an object can be selected only once) whether the selected set is treated as ordered or unordered. Ordered sets, with replacement Suppose that the selection of k objects out of n needs to be: ordered, so that the selection is an ordered sequence where we distinguish between the 1st object, 2nd, 3rd etc. with replacement, so that each of the n objects may appear several times in the selection. Therefore: n objects are available for selection for the 1st object in the sequence n objects are available for selection for the 2nd object in the sequence . . . and so on, until n objects are available for selection for the kth object in the sequence. 28 2.6. Classical probability and counting rules Therefore, the number of possible ordered sequences of k objects selected with replacement from n objects is: k times z }| { n × n × · · · × n = nk . Ordered sets, without replacement Suppose that the selection of k objects out of n is again treated as an ordered sequence, but that selection is now: ordered, so that the selection is an ordered sequence where we distinguish between the 1st object, 2nd, 3rd etc. without replacement, so that if an object is selected once, it cannot be selected again. Now: n objects are available for selection for the 1st object in the sequence n − 1 objects are available for selection for the 2nd object n − 2 objects are available for selection for the 3rd object . . . and so on, until n − k + 1 objects are available for selection for the kth object. Therefore, the number of possible ordered sequences of k objects selected without replacement from n objects is: n × (n − 1) × · · · × (n − k + 1). (2.2) An important special case is when k = n. Factorials The number of ordered sets of n objects, selected without replacement from n objects, is: n! = n × (n − 1) × · · · × 2 × 1. The number n! (read ‘n factorial’) is the total number of different ways in which n objects can be arranged in an ordered sequence. This is known as the number of permutations of n objects. We also define 0! = 1. Using factorials, (2.2) can be written as: n × (n − 1) × · · · × (n − k + 1) = n! . (n − k)! 29 2. Probability theory Unordered sets, without replacement Suppose now that the identities of the objects in the selection matter, but the order does not. For example, the sequences (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1) are now all treated as the same, because they all contain the elements 1, 2 and 3. The number of such unordered subsets (combinations) of k out of n objects is determined as follows. The number of ordered sequences is n!/(n − k)!. Among these, every different combination of k distinct elements appears k! times, in different orders. Ignoring the ordering, there are: n n! = k (n − k)! k! different combinations, for each k = 0, 1, . . . , n. n The number is known as the binomial coefficient. Note that because 0! = 1, k n n = = 1, so there is only 1 way of selecting 0 or n out of n objects. 0 n Example 2.15 Suppose we have k = 3 people (Amy, Bob and Sam). How many different sets of birthdays can they have (day and month, ignoring the year, and pretending February 29th does not exist, so that n = 365) in the following cases? 1. It makes a difference who has which birthday (ordered ), i.e. Amy (January 1st), Bob (May 5th) and Sam (December 5th) is different from Amy (May 5th), Bob (December 5th) and Sam (January 1st), and different people can have the same birthday (with replacement). The number of different sets of birthdays is: (365)3 = 48,627,125. 2. It makes a difference who has which birthday (ordered ), and different people must have different birthdays (without replacement). The number of different sets of birthdays is: 365! = 365 × 364 × 363 = 48,228,180. (365 − 3)! 3. Only the dates matter, but not who has which one (unordered ), i.e. Amy (January 1st), Bob (May 5th) and Sam (December 5th) is treated as the same as Amy (May 5th), Bob (December 5th) and Sam (January 1st), and different people must have different birthdays (without replacement). The number of different sets of birthdays is: 365 365! 365 × 364 × 363 = = = 8,038,030. 3 (365 − 3)! 3! 3×2×1 30 2.6. Classical probability and counting rules Example 2.16 Consider a room with r people in it. What is the probability that at least two of them have the same birthday (call this event A)? In particular, what is the smallest r for which P (A) > 1/2? Assume that all days are equally likely. Label the people 1 to r, so that we can treat them as an ordered list and talk about person 1, person 2 etc. We want to know how many ways there are to assign birthdays to this list of people. We note the following. 1. The number of all possible sequences of birthdays, allowing repeats (i.e. with replacement) is (365)r . 2. The number of sequences where all birthdays are different (i.e. without replacement) is 365!/(365 − r)!. Here ‘1.’ is the size of the sample space, and ‘2.’ is the number of outcomes which satisfy Ac , the complement of the case in which we are interested. Therefore: P (Ac ) = 365 × 364 × · · · × (365 − r + 1) 365!/(365 − r)! = r (365) (365)r and: P (A) = 1 − P (Ac ) = 1 − 365 × 364 × · · · × (365 − r + 1) . (365)r Probabilities, for P (A), of at least two people sharing a birthday, for different values of the number of people r are given in the following table: r 2 3 4 5 6 7 8 9 10 11 P (A) 0.003 0.008 0.016 0.027 0.040 0.056 0.074 0.095 0.117 0.141 r 12 13 14 15 16 17 18 19 20 21 P (A) 0.167 0.194 0.223 0.253 0.284 0.315 0.347 0.379 0.411 0.444 r 22 23 24 25 26 27 28 29 30 31 P (A) 0.476 0.507 0.538 0.569 0.598 0.627 0.654 0.681 0.706 0.730 r P (A) 32 0.753 33 0.775 34 0.795 35 0.814 36 0.832 37 0.849 38 0.864 39 0.878 40 0.891 41 0.903 Activity 2.11 A box contains 18 light bulbs, of which two are defective. If a person selects 7 bulbs at random, without replacement, what is the probability that both defective bulbs will be selected? Solution The sample space consists of all (unordered) subsets of 7 out of the 18 light bulbs in 18 the box. There are such subsets. The number of subsets which contain the two 7 31 2. Probability theory 16 defective bulbs is the number of subsets of size 5 out of the other 16 bulbs, , so 5 the probability we want is: 16 7×6 5 = = 0.1373. 18 18 × 17 7 2.7 Conditional probability and Bayes’ theorem Next we introduce some of the most important concepts in probability: independence conditional probability Bayes’ theorem. These give us powerful tools for: deriving probabilities of combinations of events updating probabilities of events, after we learn that some other event has happened. Independence Two events A and B are (statistically) independent if: P (A ∩ B) = P (A) P (B). Independence is sometimes denoted A ⊥⊥ B. Intuitively, independence means that: if A happens, this does not affect the probability of B happening (and vice versa) if you are told that A has happened, this does not give you any new information about the value of P (B) (and vice versa). For example, independence is often a reasonable assumption when A and B correspond to physically separate experiments. Example 2.17 Suppose we roll two dice. We assume that all combinations of the values of them are equally likely. Define the events: A = ‘Score of die 1 is not 6’ B = ‘Score of die 2 is not 6’. 32 2.7. Conditional probability and Bayes’ theorem Therefore: P (A) = 30/36 = 5/6 P (B) = 30/36 = 5/6 P (A ∩ B) = 25/36 = 5/6 × 5/6 = P (A) P (B), so A and B are independent. Activity 2.12 A and B are independent events. Suppose that P (A) = 2π, P (B) = π and P (A ∪ B) = 0.8. Evaluate π. Solution Using the probability property P (A ∪ B) = P (A) + P (B) − P (A ∩ B), and the definition of independent events P (A ∩ B) = P (A) P (B), we have: P (A ∪ B) = 0.8 = P (A) + P (B) − P (A ∩ B) = P (A) + P (B) − P (A) P (B) = 2π + π − 2π 2 . Therefore, applying the quadratic formula from mathematics: √ 3 ± 9 − 6.4 2π 2 − 3π + 0.8 = 0 ⇒ π = . 4 Hence π = 0.346887, since the other root is > 1 which is impossible for a probability! Activity 2.13 A and B are events such that P (A | B) > P (A). Prove that: P (Ac | B c ) > P (Ac ) where Ac and B c are the complements of A and B, respectively, and P (B c ) > 0. Solution From the definition of conditional probability: P (Ac | B c ) = P (Ac ∩ B c ) P ((A ∪ B)c ) 1 − P (A) − P (B) + P (A ∩ B) = = . c c P (B ) P (B ) 1 − P (B) However: P (A | B) = P (A ∩ B) > P (A) P (B) i.e. P (A ∩ B) > P (A) P (B). Hence: P (Ac | B c ) > 1 − P (A) − P (B) + P (A) P (B) = 1 − P (A) = P (Ac ). 1 − P (B) 33 2. Probability theory Activity 2.14 A and B are any two events in the sample space S. The binary set operator ∨ denotes an exclusive union, such that: A ∨ B = (A ∪ B) ∩ (A ∩ B)c = {s | s ∈ A or B, and s 6∈ (A ∩ B)}. Show, from the axioms of probability, that: (a) P (A ∨ B) = P (A) + P (B) − 2P (A ∩ B) (b) P (A ∨ B | A) = 1 − P (B | A). Solution (a) We have: A ∨ B = (A ∩ B c ) ∪ (B ∩ Ac ). By axiom 3, noting that (A ∩ B c ) and (B ∩ Ac ) are disjoint: P (A ∨ B) = P (A ∩ B c ) + P (B ∩ Ac ). We can write A = (A ∩ B) ∪ (A ∩ B c ), hence (using axiom 3): P (A ∩ B c ) = P (A) − P (A ∩ B). Similarly, P (B ∩ Ac ) = P (B) − P (A ∩ B), hence: P (A ∨ B) = P (A) + P (B) − 2P (A ∩ B). (b) We have: P (A ∨ B | A) = P ((A ∨ B) ∩ A) P (A) = P (A ∩ B c ) P (A) = P (A) − P (A ∩ B) P (A) = P (A) P (A ∩ B) − P (A) P (A) = 1 − P (B | A). Activity 2.15 Suppose that we toss a fair coin twice. The sample space is given by: S = {HH, HT, T H, T T } where the elementary outcomes are defined in the obvious way – for instance HT is heads on the first toss and tails on the second toss. Show that if all four elementary outcomes are equally likely, then the events ‘heads on the first toss’ and ‘heads on the second toss’ are independent. 34 2.7. Conditional probability and Bayes’ theorem Solution Note carefully here that we have equally likely elementary outcomes (due to the coin being fair), so that each has probability 1/4, and the independence follows. The event ‘heads on the first toss’ is A = {HH, HT } and has probability 1/2, because it is specified by two elementary outcomes. The event ‘heads on the second toss’ is B = {HH, T H} and has probability 1/2. The event ‘heads on the first toss and the second toss’ is A ∩ B = {HH} and has probability 1/4. So the multiplication property P (A ∩ B) = 1/4 = 1/2 × 1/2 = P (A) P (B) is satisfied, and the two events are independent. 2.7.1 Independence of multiple events Events A1 , A2 , . . . , An are independent if the probability of the intersection of any subset of these events is the product of the individual probabilities of the events in the subset. This implies the important result that if events A1 , A2 , . . . , An are independent, then: P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 ) P (A2 ) · · · P (An ). Note that there is a difference between pairwise independence and full independence. The following example illustrates. Example 2.18 It can be cold in London. Four impoverished teachers dress to feel warm. Teacher A has a hat and a scarf and gloves, Teacher B only has a hat, Teacher C only has a scarf and Teacher D only has gloves. One teacher out of the four is selected at random. It is shown that although each pair of events H = ‘the teacher selected has a hat’, S = ‘the teacher selected has a scarf’, and G = ‘the teacher selected has gloves’ are independent, all three of these events are not independent. Two teachers have a hat, two teachers have a scarf, and two teachers have gloves, so: P (H) = 1 2 = , 4 2 P (S) = 2 1 = 4 2 and P (G) = 2 1 = . 4 2 Only one teacher has both a hat and a scarf, so: P (H ∩ S) = 1 4 and similarly: 1 1 and P (S ∩ G) = . 4 4 From these results, we can verify that: P (H ∩ G) = P (H ∩ S) = P (H) P (S) P (H ∩ G) = P (H) P (G) P (S ∩ G) = P (S) P (G) and so the events are pairwise independent. However, one teacher has a hat, a scarf and gloves, so: 1 P (H ∩ S ∩ G) = 6= P (H) P (S) P (G). 4 35 2. Probability theory Hence the three events are not independent. If the selected teacher has a hat and a scarf, then we know that the teacher has gloves. There is no independence for all three events together. Activity 2.16 A, B and C are independent events. Prove that A and (B ∪ C) are independent. Solution We need to show that the joint probability of A ∩ (B ∪ C) equals the product of the probabilities of A and B ∪ C, i.e. we need to show that P (A ∩ (B ∪ C)) = P (A) P (B ∪ C). Using the distributive law: P (A ∩ (B ∪ C)) = P ((A ∩ B) ∪ (A ∩ C)) = P (A ∩ B) + P (A ∩ C) − P (A ∩ B ∩ C) = P (A) P (B) + P (A) P (C) − P (A) P (B) P (C) = P (A)(P (B) + P (C) − P (B) P (C)) = P (A) P (B ∪ C). Activity 2.17 Suppose that three components numbered 1, 2 and 3 have probabilities of failure π1 , π2 and π3 , respectively. Determine the probability of a system failure in each of the following cases where component failures are assumed to be independent. (a) Parallel system – the system fails if all components fail. (b) Series system – the system fails unless all components do not fail. (c) Mixed system – the system fails if component 1 fails or if both component 2 and component 3 fail. Solution (a) Since the component failures are independent, the probability of system failure is π1 π2 π3 . (b) The probability that component i does not fail is 1 − πi , hence the probability that the system does not fail is (1 − π1 )(1 − π2 )(1 − π3 ), and so the probability that the system fails is: 1 − (1 − π1 )(1 − π2 )(1 − π3 ). (c) Components 2 and 3 may be combined to form a notional component 4 with failure probability π2 π3 . So the system is equivalent to a component with failure probability π1 and another component with failure probability π2 π3 , these being connected in series. Therefore, the failure probability is: 1 − (1 − π1 )(1 − π2 π3 ) = π1 + π2 π3 − π1 π2 π3 . 36 2.7. Conditional probability and Bayes’ theorem Activity 2.18 Write down the condition for three events A, B and C to be independent. Solution Applying the product rule, we must have: P (A ∩ B ∩ C) = P (A) P (B) P (C). Therefore, since all subsets of two events from A, B and C must be independent, we must also have: P (A ∩ B) = P (A) P (B) P (A ∩ C) = P (A) P (C) and: P (B ∩ C) = P (B) P (C). One must check that all four conditions hold to verify independence of A, B and C. Activity 2.19 An electrical device contains 8 components connected in a sequence. The device fails if any one of the components fails. For each component the probability that it survives a year of use without failing is π, and the failures of different components can be regarded as independent events. (a) What is the probability that the device fails in a year of use? (b) How large must π be for the probability of failure in (a) to be less than 0.05? Solution (a) It is often easier to evaluate the probability of the complement of the event specified. Here, we calculate: P (device does not fail) = P (every component works) = π 8 and hence P (device fails) = 1 − π 8 . It is always a good idea to do a quick ‘reality check’ of your answer. If you calculated, say, the probability to be 8 (1 − π), this must be wrong because for some values of π you would have a probability greater than 1! √ (b) We require 1 − π 8 < 0.05, which is true if π > 8 0.95 ≈ 0.9936. Activity 2.20 Suppose A and B are independent events, i.e. P (A ∩ B) = P (A) P (B). Prove that: (a) A and B c are independent 37 2. Probability theory (b) Ac and B c are independent. Solution (a) Note that A = (A ∩ B) ∪ (A ∩ B c ) is a partition, and hence P (A) = P (A ∩ B)+ P (A ∩ B c ). It follows from this that: P (A ∩ B c ) = P (A) − P (A ∩ B) = P (A) − P (A) P (B) (due to independence of A and B) = P (A)[1 − P (B)] = P (A) P (B c ). (b) Here we first use one of De Morgan’s laws such that: P (Ac ∩ B c ) = P ((A ∪ B)c ) = 1 − P (A ∪ B) = 1 − [P (A) + P (B) − P (A ∩ B)] = 1 − P (A) − P (B) + P (A) P (B) = [1 − P (A)][1 − P (B)] = P (Ac ) P (B c ). Activity 2.21 Hard question! Two boys, James A and James B, throw a ball at a target. Suppose that the probability that James A will hit the target on any throw is 1/4 and the probability that James B will hit the target on any throw is 1/5. Suppose also that James A throws first and the two boys take turns throwing. (a) Determine the probability that the target will be hit for the first time on the third throw of James A. (b) Determine the probability that James A will hit the target before James B does. Solution (a) In order for the target to be hit for the first time on the third throw of James A, all five of the following independent events must occur: (i) James A misses on his first throw, (ii) James B misses on his first throw, (iii) James A misses on his second throw, (iv) James B misses on his second throw, and (v) James A hits the target on his third throw. The probability of all five events occurring is: 9 3 4 3 4 1 × × × × = . 4 5 4 5 4 100 38 2.7. Conditional probability and Bayes’ theorem (b) Let A denote the event that James A hits the target before James B. There are two methods of solving this problem. 1. The first method is to note that A can occur in two different ways. (i) James A hits the target on the first throw, which occurs with probability 1/4. (ii) Both Jameses miss the target on their first throws, and then subsequently James A hits the target before James B. The probability that both Jameses miss on their first throws is: 3 3 4 × = . 4 5 5 When they do miss, the conditions of the game become exactly the same as they were at the beginning of the game. In effect, it is as if the boys were starting a new game all over again, and so the probability that James A will subsequently hit the target before James B is again P (A). Therefore, by considering these two ways in which the event A can occur, we have: P (A) = 1 3 + × P (A) 4 5 ⇒ 5 P (A) = . 8 2. The second method of solving the problem is to calculate the probabilities that the target will be hit for the first time on James A’s first throw, on his second throw, on his third throw etc. and then to sum these probabilities. For the target to be hit for the first time on James A’s ith throw, both Jameses must miss on each of their first i − 1 throws, and then James A must hit the target on his next throw. The probability of this event is: i−1 i−1 i−1 4 1 3 1 3 = . 4 5 4 5 4 Hence: ∞ 1X P (A) = 4 i=1 i−1 1 3 1 5 = × = 5 4 1 − 3/5 8 which uses the sum to infinity of a geometric series (with common ratio less than 1 in absolute value) from mathematics. 2.7.2 Independent versus mutually exclusive events The idea of independent events is quite different from that of mutually exclusive (disjoint) events, as shown in Figure 2.9. For mutually exclusive events A ∩ B = ∅, and so, from (2.1), P (A ∩ B) = 0. For independent events, P (A ∩ B) = P (A) P (B). So since P (A ∩ B) = 0 6= P (A) P (B) in general (except in the uninteresting case when P (A) = 0 or P (B) = 0), then mutually exclusive events and independent events are different. In fact, mutually exclusive events are extremely non-independent (i.e. dependent). For example, if you know that A has happened, you know for certain that B has not happened. There is no particularly helpful way to represent independent events using a 39 2. Probability theory A B Figure 2.8: Venn diagram depicting mutually exclusive events. Venn diagram. Conditional probability Consider two events A and B. Suppose you are told that B has occurred. How does this affect the probability of event A? The answer is given by the conditional probability of A given that B has occurred, or the conditional probability of A given B for short, defined as: P (A | B) = P (A ∩ B) P (B) assuming that P (B) > 0. The conditional probability is not defined if P (B) = 0. Example 2.19 Suppose we roll two independent fair dice again. Consider the following events. A = ‘at least one of the scores is 2’. B = ‘the sum of the scores is greater than 7’. These are shown in Figure 2.10. Now P (A) = 11/36 ≈ 0.31, P (B) = 15/36 and P (A ∩ B) = 2/36. Therefore, the conditional probability of A given B is: P (A | B) = P (A ∩ B) 2/36 2 = = ≈ 0.13. P (B) 15/36 15 Learning that B has occurred causes us to revise (update) the probability of A downward, from 0.31 to 0.13. One way to think about conditional probability is that when we condition on B, we redefine the sample space to be B. 40 2.7. Conditional probability and Bayes’ theorem A (1,1) (1,2) (1,3) (1,4) (1,5) (1,6) (2,1) (2,2) (2,3) (2,4) (2,5) (2,6) (3,1) (3,2) (3,3) (3,4) (3,5) (3,6) (4,1) (4,2) (4,3) (4,4) (4,5) (4,6) (5,1) (5,2) (5,3) (5,4) (5,5) (5,6) (6,1) (6,2) (6,3) (6,4) (6,5) (6,6) B A B Figure 2.9: Events A, B and A ∩ B for Example 2.19. Example 2.20 In Example 2.19, when we are told that the conditioning event B has occurred, we know we are within the green line in Figure 2.9. So the 15 outcomes within it become the new sample space. There are 2 outcomes which satisfy A and which are inside this new sample space, so: P (A | B) = 2 number of cases of A within B = . 15 number of cases of B Activity 2.22 If all elementary outcomes are equally likely, S = {a, b, c, d}, A = {a, b, c} and B = {c, d}, find P (A | B) and P (B | A). Solution S has 4 elementary outcomes which are equally likely, so each elementary outcome has probability 1/4. We have: P (A | B) = and: P (B | A) = P (A ∩ B) P ({c}) 1/4 1 = = = P (B) P ({c, d}) 1/4 + 1/4 2 P (B ∩ A) P ({c}) 1/4 1 = = = . P (A) P ({a, b, c}) 1/4 + 1/4 + 1/4 3 Activity 2.23 Show that if A and B are disjoint events, and are also independent, then P (A) = 0 or P (B) = 0. (Note that independence and disjointness are not similar ideas.) 41 2. Probability theory Solution It is important to get the logical flow in the right direction here. We are told that A and B are disjoint events, that is: A ∩ B = ∅. So: P (A ∩ B) = 0. We are also told that A and B are independent, that is: P (A ∩ B) = P (A) P (B). It follows that: 0 = P (A) P (B) and so either P (A) = 0 or P (B) = 0. Activity 2.24 Suppose A and B are events with P (A) = p, P (B) = 2p and P (A ∪ B) = 0.75. (a) Evaluate p and P (A | B) if A and B are independent events. (b) Evaluate p and P (A | B) if A and B are mutually exclusive events. Solution (a) We know that P (A ∪ B) = P (A) + P (B) − P (A ∩ B). For independent events A and B, P (A ∩ B) = P (A) P (B), so P (A ∪ B) = P (A) + P (B) − P (A) P (B) gives 0.75 = p + 2p − 2p2 , or 2p2 − 3p + 0.75 = 0. Solving the quadratic equation gives: √ 3− 3 p= ≈ 0.317 4 suppressing the irrelevant case for which p > 1. Since A and B are independent, P (A | B) = P (A) = p = 0.317. (b) For mutually exclusive events, P (A ∪ B) = P (A) + P (B), so 0.75 = p + 2p, leading to p = 0.25. Here P (A ∩ B) = 0, so P (A | B) = P (A ∩ B)/P (B) = 0. Activity 2.25 (a) Show that if A and B are independent events in a sample space, then Ac and B c are also independent. (b) Show that if X and Y are mutually exclusive events in a sample space, then X c and Y c are not in general mutually exclusive. 42 2.7. Conditional probability and Bayes’ theorem Solution (a) We are given that A and B are independent, so P (A ∩ B) = P (A) P (B). We need to show a similar result for Ac and B c , namely we need to show that P (Ac ∩ B c ) = P (Ac ) P (B c ). Now Ac ∩ B c = (A ∪ B)c from basic set theory (draw a Venn diagram), hence: P (Ac ∩ B c ) = P ((A ∪ B)c ) = 1 − P (A ∪ B) = 1 − [P (A) + P (B) − P (A ∩ B)] = 1 − P (A) − P (B) + P (A ∩ B) = 1 − P (A) − P (B) + P (A) P (B) (independence assumption) = [1 − P (A)][1 − P (B)] (factorising) = P (Ac ) P (B c ) (as required). (b) To show that X c and Y c are not necessarily mutually exclusive when X and Y are mutually exclusive, the best approach is to find a counterexample. Attempts to ‘prove’ the result directly are likely to be logically flawed. Look for a simple example. Suppose we roll a die. Let X = {6} be the event of obtaining a 6, and let Y = {5} be the event of obtaining a 5. Obviously X and Y are mutually exclusive, but X c = {1, 2, 3, 4, 5} and Y c = {1, 2, 3, 4, 6} have X c ∩ Y c 6= ∅, so X c and Y c are not mutually exclusive. Activity 2.26 If C1 , C2 , C3 , . . . are events in S which are pairwise mutually exclusive (i.e. Ci ∩ Cj = ∅ for all i 6= j), then, by the axioms of probability: ! ∞ ∞ [ X P Ci = P (Ci ). i=1 (*) i=1 Suppose that A1 , A2 , . . . are pairwise mutually exclusive events in S. Prove that a property like (*) also holds for conditional probabilities given some event B, i.e. prove that: "∞ # ! ∞ [ X P Ai B = P (Ai | B). i=1 i=1 You can assume that all unions and intersections of Ai and B are also events in S. Solution We have: P "∞ [ i=1 # Ai P ! B = ∞ S Ai ∩ B i=1 P (B) P = ∞ S [Ai ∩ B] i=1 P (B) = ∞ X P (Ai ∩ B) i=1 P (B) = ∞ X P (Ai | B) i=1 where the equation on the second line follows from (*) in the question, since Ai ∩ B 43 2. Probability theory are also events in S, and they are pairwise mutually exclusive (i.e. (Ai ∩ B)∩ (Aj ∩ B) = ∅ for all i 6= j). 2.7.3 Conditional probability of independent events If A ⊥⊥ B, i.e. P (A ∩ B) = P (A) P (B), and P (B) > 0 and P (A) > 0, then: P (A | B) = P (A) P (B) P (A ∩ B) = = P (A) P (B) P (B) P (B | A) = P (A ∩ B) P (A) P (B) = = P (B). P (A) P (A) and: In other words, if A and B are independent, learning that B has occurred does not change the probability of A, and learning that A has occurred does not change the probability of B. This is exactly what we would expect under independence. 2.7.4 Chain rule of conditional probabilities Since P (A | B) = P (A ∩ B)/P (B), then: P (A ∩ B) = P (A | B) P (B). That is, the probability that both A and B occur is the probability that A occurs given that B has occurred multiplied by the probability that B occurs. An intuitive graphical version of this is: s B s As The path to A is to get first to B, and then from B to A. It is also true that: P (A ∩ B) = P (B | A) P (A) and you can use whichever is more convenient. Very often some version of this chain rule is much easier than calculating P (A ∩ B) directly. The chain rule generalises to multiple events: P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 ) · · · P (An | A1 , . . . , An−1 ) where, for example, P (A3 | A1 , A2 ) is shorthand for P (A3 | A1 ∩ A2 ). The events can be taken in any order, as shown in Example 2.21. 44 2.7. Conditional probability and Bayes’ theorem Example 2.21 For n = 3, we have: P (A1 ∩ A2 ∩ A3 ) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 ) = P (A1 ) P (A3 | A1 ) P (A2 | A1 , A3 ) = P (A2 ) P (A1 | A2 ) P (A3 | A1 , A2 ) = P (A2 ) P (A3 | A2 ) P (A1 | A2 , A3 ) = P (A3 ) P (A1 | A3 ) P (A2 | A1 , A3 ) = P (A3 ) P (A2 | A3 ) P (A1 | A2 , A3 ). Example 2.22 Suppose you draw 4 cards from a deck of 52 playing cards. What is the probability of A = ‘the cards are the 4 aces (cards of rank 1)’ ? We could calculate this using counting rules. There are 52 = 270,725 possible 4 subsets of 4 different cards, and only 1 of these consists of the 4 aces. Therefore, P (A) = 1/270725. Let us try with conditional probabilities. Define Ai as ‘the ith card is an ace’, so that A = A1 ∩ A2 ∩ A3 ∩ A4 . The necessary probabilities are: P (A1 ) = 4/52 since there are initially 4 aces in the deck of 52 playing cards P (A2 | A1 ) = 3/51. If the first card is an ace, 3 aces remain in the deck of 51 playing cards from which the second card will be drawn P (A3 | A1 , A2 ) = 2/50 P (A4 | A1 , A2 , A3 ) = 1/49. Putting these together with the chain rule gives: P (A) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 ) P (A4 | A1 , A2 , A3 ) 4 3 2 1 × × × 52 51 50 49 24 = 6497400 1 = . 270725 = Here we could obtain the result in two ways. However, there are very many situations where classical probability and counting rules are not usable, whereas conditional probabilities and the chain rule are completely general and always applicable. More methods for summing probabilities We now return to probabilities of partitions like the situation shown in Figure 2.10. 45 2. Probability theory HH H A1 HH HHr rH A HH A2 H HH A3 H A2 A1 A3 Figure 2.10: On the left, a Venn diagram depicting A = A1 ∪ A2 ∪ A3 , and on the right the ‘paths’ to A. Both diagrams in Figure 2.10 represent the partition A = A1 ∪ A2 ∪ A3 . For the next results, it will be convenient to use diagrams like the one on the right in Figure 2.11, where A1 , A2 and A3 are symbolised as different ‘paths’ to A. We now develop powerful methods of calculating sums like: P (A) = P (A1 ) + P (A2 ) + P (A3 ). 2.7.5 Total probability formula Suppose B1 , B2 , . . . , BK form a partition of the sample space. Therefore, A ∩ B1 , A ∩ B2 , . . ., A ∩ BK form a partition of A, as shown in Figure 2.11. r B1 B2 r HH HH HHr B 3 rH r A @H H @ HH Hr @ B4 @ @ @r B5 Figure 2.11: On the left, a Venn diagram depicting the set A and the partition of S, and on the right the ‘paths’ to A. In other words, think of event A as the union of all the A ∩ Bi s, i.e. of ‘all the paths to A via different intervening events Bi ’. To get the probability of A, we now: 1. apply the chain rule to each of the paths: P (A ∩ Bi ) = P (A | Bi ) P (Bi ) 2. add up the probabilities of the paths: P (A) = K X i=1 46 P (A ∩ Bi ) = K X i=1 P (A | Bi ) P (Bi ). 2.7. Conditional probability and Bayes’ theorem This is known as the formula of total probability. It looks complicated, but it is actually often far easier to use than trying to find P (A) directly. Example 2.23 Any event B has the property that B and its complement B c partition the sample space. So if we take K = 2, B1 = B and B2 = B c in the formula of total probability, we get: P (A) = P (A | B) P (B) + P (A | B c ) P (B c ) = P (A | B) P (B) + P (A | B c ) [1 − P (B)]. r Bc H H HH H HHr r A H HH HH HH r B Example 2.24 Suppose that 1 in 10,000 people (0.01%) has a particular disease. A diagnostic test for the disease has 99% sensitivity. If a person has the disease, the test will give a positive result with a probability of 0.99. The test has 99% specificity. If a person does not have the disease, the test will give a negative result with a probability of 0.99. Let B denote the presence of the disease, and B c denote no disease. Let A denote a positive test result. We want to calculate P (A). The probabilities we need are P (B) = 0.0001, P (B c ) = 0.9999, P (A | B) = 0.99 and P (A | B c ) = 0.01. Therefore: P (A) = P (A | B) P (B) + P (A | B c ) P (B c ) = 0.99 × 0.0001 + 0.01 × 0.9999 = 0.010098. Activity 2.27 A man has two bags. Bag A contains five keys and bag B contains seven keys. Only one of the twelve keys fits the lock which he is trying to open. The man selects a bag at random, picks out a key from the bag at random and tries that key in the lock. What is the probability that the key he has chosen fits the lock? Solution Define a partition {Ci }, such that: C1 = key in bag A and bag A chosen ⇒ C2 = key in bag B and bag A chosen ⇒ 5 1 5 × = 12 2 24 7 1 7 P (C2 ) = × = 12 2 24 P (C1 ) = 47 2. Probability theory 5 1 5 × = 12 2 24 7 1 7 C4 = key in bag B and bag B chosen ⇒ P (C4 ) = × = . 12 2 24 C3 = key in bag A and bag B chosen ⇒ P (C3 ) = Hence we require, defining the event F = ‘key fits’: P (F ) = 2.7.6 1 1 1 5 1 7 1 × P (C1 ) + × P (C4 ) = × + × = . 5 7 5 24 7 24 12 Bayes’ theorem So far we have considered how to calculate P (A) for an event A which can happen in different ways, ‘via’ different events B1 , B2 , . . . , BK . Now we reverse the question. Suppose we know that A has occurred, as shown in Figure 2.12. Figure 2.12: Paths to A indicating that A has occurred. What is the probability that we got there via, say, B1 ? In other words, what is the conditional probability P (B1 | A)? This situation is depicted in Figure 2.13. Figure 2.13: A being achieved via B1 . So we need: P (Bj | A) = P (A ∩ Bj ) P (A) and we already know how to get this. P (A ∩ Bj ) = P (A | Bj ) P (Bj ) from the chain rule. P (A) = K P i=1 48 P (A | Bi ) P (Bi ) from the total probability formula. 2.7. Conditional probability and Bayes’ theorem Bayes’ theorem Using the chain rule and the total probability formula, we have: P (Bj | A) = P (A | Bj ) P (Bj ) K P P (A | Bi ) P (Bi ) i=1 which holds for each Bj , j = 1, . . . , K. This is known as Bayes’ theorem. Example 2.25 Continuing with Example 2.24, let B denote the presence of the disease, B c denote no disease, and A denote a positive test result. We want to calculate P (B | A), i.e. the probability that a person has the disease, given that the person has received a positive test result. The probabilities we need are: P (B c ) = 0.9999 P (B) = 0.0001 P (A | B) = 0.99 and P (A | B c ) = 0.01. Therefore: P (B | A) = 0.99 × 0.0001 P (A | B) P (B) = ≈ 0.0098. c c P (A | B) P (B) + P (A | B ) P (B ) 0.010098 Why is this so small? The reason is because most people do not have the disease and the test has a small, but non-zero, false positive rate P (A | B c ). Therefore, most positive test results are actually false positives. Activity 2.28 Prove the simplest version of Bayes’ theorem from first principles. Solution Applying the definition of conditional probability, we have: P (B | A) = P (B ∩ A) P (A ∩ B) P (A | B) P (B) = = . P (A) P (A) P (A) Activity 2.29 State and prove Bayes’ theorem. Solution Bayes’ theorem is: P (Bj | A) = P (A | Bj ) P (Bj ) K P . P (A | Bi ) P (Bi ) i=1 By definition: P (Bj | A) = P (Bj ∩ A) P (A | Bj ) P (Bj ) = . P (A) P (A) 49 2. Probability theory If {Bi }, for i = 1, . . . , K, is a partition of the sample space S, then: P (A) = K X P (A ∩ Bi ) = i=1 K X P (A | Bi ) P (Bi ). i=1 Hence the result. Activity 2.30 A statistics teacher knows from past experience that a student who does their homework consistently has a probability of 0.95 of passing the examination, whereas a student who does not do their homework has a probability of 0.30 of passing. (a) If 25% of students do their homework consistently, what percentage can expect to pass? (b) If a student chosen at random from the group gets a pass, what is the probability that the student has done their homework consistently? Solution Here the random experiment is to choose a student at random, and to record whether the student passes (P ) or fails (F ), and whether the student has done their homework consistently (C) or has not (N ). (Notice that F = P c and N = C c .) The sample space is S = {P C, P N, F C, F N }. We use the events Pass = {P C, P N }, and Fail = {F C, F N }. We consider the sample space partitioned by Homework = {P C, F C}, and No Homework = {P N, F N }. (a) The first part of the example asks for the denominator of Bayes’ theorem: P (Pass) = P (Pass | Homework) P (Homework) + P (Pass | No Homework) P (No Homework) = 0.95 × 0.25 + 0.30 × (1 − 0.25) = 0.2375 + 0.225 = 0.4625. (b) Now applying Bayes’ theorem: P (Homework | Pass) = P (Homework ∩ Pass) P (Pass) = P (Pass | Homework) P (Homework) P (Pass) = 0.95 × 0.25 0.4625 = 0.5135. 50 2.7. Conditional probability and Bayes’ theorem Alternatively, we could arrange the calculations in a tree diagram as shown below. Activity 2.31 Plagiarism is a serious problem for assessors of coursework. One check on plagiarism is to compare the coursework with a standard text. If the coursework has plagiarised the text, then there will be a 95% chance of finding exactly two phrases which are the same in both coursework and text, and a 5% chance of finding three or more phrases. If the work is not plagiarised, then these probabilities are both 50%. Suppose that 5% of coursework is plagiarised. An assessor chooses some coursework at random. What is the probability that it has been plagiarised if it has exactly two phrases in the text? (Try making a guess before doing the calculation!) What if there are three or more phrases? Did you manage to get a roughly correct guess of these results before calculating? Solution Suppose that two phrases are the same. We use Bayes’ theorem: P (plagiarised | two the same) = 0.95 × 0.05 = 0.0909. 0.95 × 0.05 + 0.5 × 0.95 Finding two phrases has increased the chance the work is plagiarised from 5% to 9.1%. Did you get anywhere near 9% when guessing? Now suppose that we find three or more phrases: P (plagiarised | three or more the same) = 0.05 × 0.05 = 0.0052. 0.05 × 0.05 + 0.5 × 0.95 It seems that no plagiariser is silly enough to keep three or more phrases the same, so if we find three or more, the chance of the work being plagiarised falls from 5% to 0.5%! How close did you get by guessing? 51 2. Probability theory Activity 2.32 Continuing with Activity 2.27, suppose the first key chosen does not fit the lock. What is the probability that the bag chosen: (a) is bag A? (b) contains the required key? Solution (a) We require P (bag A | F c ) which is: P (bag A | F c ) = P (F c | C1 ) P (C1 ) + P (F c | C2 ) P (C2 ) . 4 P c P (F | Ci ) P (Ci ) i=1 The conditional probabilities are: 4 P (F c | C1 ) = , 5 P (F c | C2 ) = 1, 6 P (F c | C3 ) = 1 and P (F c | C4 ) = . 7 Hence: P (bag A | F c ) = 4/5 × 5/24 + 1 × 7/24 1 = . 4/5 × 5/24 + 1 × 7/24 + 1 × 5/24 + 6/7 × 7/24 2 (b) We require P (right bag | F c ) which is: P (right bag | F c ) = P (F c | C1 ) P (C1 ) + P (F c | C4 ) P (C4 ) 4 P P (F c | Ci ) P (Ci ) i=1 = 4/5 × 5/24 + 6/7 × 7/24 4/5 × 5/24 + 1 × 7/24 + 1 × 5/24 + 6/7 × 7/24 = 5 . 11 Activity 2.33 Hard question! A, B and C throw a die in that order until a six appears. The person who throws the first six wins. What are their respective chances of winning? Solution We must assume that the game finishes with probability one (it would be proved in a more advanced subject). If A, B and C all throw and fail to get a six, then their respective chances of winning are as at the start of the game. We can call each completed set of three throws a round. Let us denote the probabilities of winning by P (A), P (B) and P (C) for A, B and C, respectively. Therefore: 52 2.7. Conditional probability and Bayes’ theorem P (A) = P (A wins on the 1st throw) + P (A wins in some round after the 1st round) 1 + P (A, B and C fail on the 1st throw and A wins after the 1st round) 6 1 = + P (A, B and C fail in the 1st round) 6 × P (A wins after the 1st round | A, B and C fail in the 1st round) = 1 + P (No six in first 3 throws) P (A) 6 3 5 1 P (A) = + 6 6 1 125 = + P (A). 6 216 = So (1 − 125/216) P (A) = 1/6, and P (A) = 216/(91 × 6) = 36/91. Similarly: P (B) = P (B wins in the 1st round) + P (B wins after the 1st round) = P (A fails with the 1st throw and B throws a six on the 1st throw) + P (All fail in the 1st round and B wins after the 1st round) = P (A fails with the 1st throw) P (B throws a six with the 1st throw) + P (All fail in the 1st round) P (B wins after the 1st | All fail in the 1st) 3 1 5 5 = + P (B). 6 6 6 So, (1 − 125/216) P (B) = 5/36, and P (B) = 5 × (216)/(91 × 36) = 30/91. In the same way, P (C) = (5/6) × (5/6) × (1/6) × (216/91) = 25/91. Notice that P (A) + P (B) + P (C) = 1. You may, on reflection, think that this rather long solution could be shortened, by considering the relative winning chances of A, B and C. Activity 2.34 Hard question! In men’s singles tennis, matches are played on the best-of-five-sets principle. Therefore, the first player to win three sets wins the match, and a match may consist of three, four or five sets. Assuming that two players are perfectly evenly matched, and that sets are independent events, calculate the probabilities that a match lasts three sets, four sets and five sets, respectively. 53 2. Probability theory Solution Suppose that the two players are A and B. We calculate the probability that A wins a three-, four- or five-set match, and then, since the players are evenly matched, double these probabilities for the final answer. P (‘A wins in 3 sets’) = P (‘A wins 1st set’ ∩ ‘A wins 2nd set’ ∩ ‘A wins 3rd set’). Since the sets are independent, we have: P (‘A wins in 3 sets’) = P (‘A wins 1st set’) P (‘A wins 2nd set’) P (‘A wins 3rd set’) 1 1 1 × × 2 2 2 1 = . 8 = Therefore, the total probability that the game lasts three sets is: 2× 1 1 = . 8 4 If A wins in four sets, the possible winning patterns are: BAAA, ABAA and AABA. Each of these patterns has probability (1/2)4 by using the same argument as in the case of 3 sets. So the probability that A wins in four sets is 3 × (1/16) = 3/16. Therefore, the total probability of a match lasting four sets is 2 × (3/16) = 3/8. The probability of a five-set match should be 1 − 3/8 − 1/4 = 3/8, but let us check this directly. The winning patterns for A in a five-set match are: BBAAA, BABAA, BAABA, ABBAA, ABABA and AABBA. Each of these has probability (1/2)5 because of the independence of the sets. So the probability that A wins in five sets is 6 × (1/32) = 3/16. Therefore, the total probability of a five-set match is 3/8, as before. Activity 2.35 Hard question! In a game of tennis, each point is won by one of the two players A and B. The usual rules of scoring for tennis apply. That is, the winner of the game is the player who first scores four points, unless each player has won three points, when deuce is called and play proceeds until one player is two points ahead of the other and hence wins the game. A is serving and has a probability of winning any point of 2/3. The result of each point is assumed to be independent of every other point. (a) Show that the probability of A winning the game without deuce being called is 496/729. (b) Find the probability of deuce being called. 54 2.7. Conditional probability and Bayes’ theorem (c) If deuce is called, show that A’s subsequent probability of winning the game is 4/5. (d) Hence determine A’s overall chance of winning the game. Solution (a) A will win the game without deuce if he or she wins four points, including the last point, before B wins three points. This can occur in three ways. • A wins four straight points, i.e. AAAA with probability (2/3)4 = 16/81. • B wins just one point in the game. There are 4 C1 ways for this to happen, namely BAAAA, ABAAA, AABAA and AAABA. Each has probability (1/3) × (2/3)4 , so the probability of one of these outcomes is given by 4 × (1/3) × (2/3)4 = 64/243. • B wins just two points in the game. There are 5 C2 ways for this to happen, namely BBAAAA, BABAAA, BAABAA, BAAABA, ABBAAA, ABABAA, ABAABA, AABBAA, AABABA and AAABBA. Each has probability (1/3)2 × (2/3)4 , so the probability of one of these outcomes is given by 10 × (1/3)2 × (2/3)4 = 160/729. Therefore, the probability that A wins without a deuce must be the sum of these, namely: 64 160 144 + 192 + 160 496 16 + + = = . 81 243 729 729 729 (b) We can mimic the above argument to find the probability that B wins the game without a deuce. That is, the probability of four straight points to B is (1/3)4 = 1/81, the probability that A wins just one point in the game is 4 × (2/3) × (1/3)4 = 8/243, and the probability that A wins just two points is 10 × (2/3)2 × (1/3)4 = 40/729. So the probability of B winning without a deuce is 1/81 + 8/243 + 40/729 = 73/729 and so the probability of deuce is 1 − 496/729 − 73/729 = 160/729. (c) Either: suppose deuce has been called. The probability that A wins the set without further deuces is the probability that the next two points go AA – with probability (2/3)2 . The probability of exactly one further deuce is that the next four points go ABAA or BAAA – with probability (2/3)3 × (1/3) + (2/3)3 × (1/3) = (2/3)4 . The probability of exactly two further deuces is that the next six points go ABABAA, ABBAAA, BAABAA or BABAAA – with probability 4 × (2/3)4 × (1/3)2 = (2/3)6 . Continuing this way, the probability that A wins after three further deuces is (2/3)8 and the overall probability that A wins after deuce has been called is (2/3)2 + (2/3)4 + (2/3)6 + (2/3)8 + · · · . 55 2. Probability theory This is a geometric progression (GP) with first term a = (2/3)2 and common ratio (2/3)2 , so the overall probability that A wins after deuce has been called is a/(1 − r) (sum to infinity of a GP) which is: (2/3)2 4 4/9 = . = 2 1 − (2/3) 5/9 5 Or (quicker!): given a deuce, the next 2 balls can yield the following results. A wins with probability (2/3)2 , B wins with probability (1/3)2 , and deuce with probability 4/9. Hence P (A wins | deuce) = (2/3)2 + (4/9) P (A wins | deuce) and solving immediately gives P (A wins | deuce) = 4/5. (d) We have: P (A wins the game) = P (A wins without deuce being called) + P (deuce is called) P (A wins | deuce is called) 496 + 729 496 + = 729 624 = . 729 = 160 4 × 729 5 128 729 Aside: so the probability of B winning the game is 1 − 624/729 = 105/729. It follows that A is about six times as likely as B to win the game although the probability of winning any point is only twice that of B. Another example of the counterintuitive nature of probability. Example 2.26 You are waiting for your bag at the baggage reclaim carousel of an airport. Suppose that you know that there are 200 bags to come from your flight, and you are counting the distinct bags which come out. Suppose that x bags have arrived, and your bag is not among them. What is the probability that your bag will not arrive at all, i.e. that it has been lost (or at least delayed)? Define A = ‘your bag has been lost’ and x = ‘your bag is not among the first x bags to arrive’. What we want to know is the conditional probability P (A | x) for any x = 0, 1, 2, . . . , 200. The conditional probabilities the other way round are as follows. P (x | A) = 1 for all x. If your bag has been lost, it will not arrive! P (x | Ac ) = (200 − x)/200 if we assume that bags come out in a completely random order. Using Bayes’ theorem, we get: P (A | x) = 56 P (x | A) P (A) P (A) = . c c P (x | A) P (A) + P (x | A ) P (A ) P (A) + [(200 − x)/200] [1 − P (A)] 2.7. Conditional probability and Bayes’ theorem Obviously, P (A | 200) = 1. If the bag has not arrived when all 200 have come out, it has been lost! For other values of x we need P (A). This is the general probability that a bag gets lost, before you start observing the arrival of the bags from your particular flight. This kind of probability is known as the prior probability of an event A. Let us assign values to P (A) based on some empirical data. Statistics by the Association of European Airlines (AEA) show how many bags were ‘mishandled’ per 1,000 passengers the airlines carried. This is not exactly what we need (since not all passengers carry bags, and some have several), but we will use it anyway. In particular, we will compare the results for the best and the worst of the AEA in 2006: Air Malta: P (A) = 0.0044 British Airways: P (A) = 0.023. Figure 2.14 shows a plot of P (A | x) as a function of x for these two airlines. The probabilities are fairly small, even for large values of x. For Air Malta, P (A | 199) = 0.469. So even when only 1 bag remains to arrive, the probability is less than 0.5 that your bag has been lost. For British Airways, P (A | 199) = 0.825. Also, we see that P (A | 197) = 0.541 is the first probability over 0.5. This is because the baseline probability of lost bags, P (A), is low. 1.0 So, the moral of the story is that even when nearly everyone else has collected their bags and left, do not despair! 0.6 0.4 0.0 0.2 P( Your bag is lost ) 0.8 BA Air Malta 0 50 100 150 200 Bags arrived Figure 2.14: Plot of P (A | x) as a function of x for the two airlines in Example 2.26, Air Malta and British Airways (BA). 57 2. Probability theory 2.8 Overview of chapter This chapter introduced some formal terminology related to probability theory. The axioms of probability were introduced, from which various other probability results were derived. There followed a brief discussion of counting rules (using permutations and combinations). The important concepts of independence and conditional probability were discussed, and Bayes’ theorem was derived. 2.9 Key terms and concepts Axiom Chain rule Collectively exhaustive Complement Element Experiment Factorial Intersection Outcome Partition Probability (theory) Sample space Subset Union With(out) replacement 2.10 Bayes’ theorem Classical probability Combination Conditional probability Empty set Event Independence Mutually exclusive Pairwise disjoint Permutation Relative frequency Set Total probability Venn diagram Sample examination questions Solutions can be found in Appendix C. 1. For each one of the statements below say whether the statement is true or false, explaining your answer. Throughout this question A and B are events such that 0 < P (A) < 1 and 0 < P (B) < 1. (a) If A and B are independent, then P (A) + P (B) > P (A ∪ B). (b) If P (A | B) = P (A | B c ) then A and B are independent. (c) If A and B are disjoint events, then Ac and B c are disjoint. 2. Suppose that 10 people are seated in a random manner in a row of 10 lecture theatre seats. What is the probability that two particular people, A and B, will be seated next to each other? 3. A person tried by a three-judge panel is declared guilty if at least two judges cast votes of guilty (i.e. a majority verdict). Suppose that when the defendant is in fact guilty, each judge will independently vote guilty with probability 0.9, whereas when 58 2.10. Sample examination questions the defendant is not guilty (i.e. innocent), this probability drops to 0.25. Suppose 70% of defendants are guilty. (a) Compute the probability that judge 1 votes guilty. (b) Given that both judge 1 and judge 2 vote not guilty, compute the probability that judge 3 votes guilty. 59 2. Probability theory 60 Chapter 3 Random variables 3.1 Synopsis of chapter This chapter introduces the concept of random variables and probability distributions. These distributions are univariate, which means that they are used to model a single numerical quantity. The concepts of expected value and variance are also discussed. 3.2 Learning outcomes After completing this chapter, you should be able to: define a random variable and distinguish it from the values which it takes explain the difference between discrete and continuous random variables find the mean and the variance of simple random variables whether discrete or continuous demonstrate how to proceed and use simple properties of expected values and variances. 3.3 Introduction In ST104a Statistics 1, we considered descriptive statistics for a sample of observations of a variable X. Here we will represent the observations as a sequence of variables, denoted as: X1 , X2 , . . . , Xn where n is the sample size. In statistical inference, the observations will be treated as a sample drawn at random from a population. We will then think of each observation Xi of a variable X as an outcome of an experiment. The experiment is ‘select a unit at random from the population and record its value of X’. The outcome is the observed value Xi of X. Because variables X in statistical data are recorded as numbers, we can now focus on experiments where the outcomes are also numbers – random variables. 61 3. Random variables Random variable A random variable is an experiment for which the outcomes are numbers.1 This means that for a random variable: the sample space, S, is the set of real numbers R, or a subset of R the outcomes are numbers in this sample space (instead of ‘outcomes’, we often call them the values of the random variable) events are sets of numbers (values) in this sample space. Discrete and continuous random variables There are two main types of random variables, depending on the nature of S, i.e. the possible values of the random variable. A random variable is continuous if S is all of R or some interval(s) of it, for example [0, 1] or [0, ∞). A random variable is discrete if it is not continuous.2 More precisely, a discrete random variable takes a finite or countably infinite number of values. Notation A random variable is typically denoted by an upper-case letter, for example X (or Y , W etc.). A specific value of a random variable is often denoted by a lower-case letter, for example x. Probabilities of values of a random variable are written as follows. P (X = x) denotes the probability that (the value of) X is x. P (X > 0) denotes the probability that X is positive. P (a < X < b) denotes the probability that X is between the numbers a and b. Random variables versus samples You will notice that many of the quantities we define for random variables are analogous to sample quantities defined in ST104a Statistics 1. 1 This definition is a bit informal, but it is sufficient for this course. Strictly speaking, a discrete random variable is not just a random variable which is not continuous as there are many others, such as mixture distributions. 2 62 3.4. Discrete random variables Random variable Probability distribution Mean (expected value) Variance Standard deviation Median Sample Sample Sample Sample Sample Sample distribution mean (average) variance standard deviation median This is no accident. In statistics, the population is represented as following a probability distribution, and quantities for an observed sample are then used as estimators of the analogous quantities for the population. 3.4 Discrete random variables Example 3.1 The following two examples will be used throughout this chapter. 1. The number of people living in a randomly selected household in England. • For simplicity, we use the value 8 to represent ‘8 or more’ (because 9 and above are not reported separately in official statistics). • This is a discrete random variable, with possible values of 1, 2, 3, 4, 5, 6, 7 and 8. 2. A person throws a basketball repeatedly from the free-throw line, trying to make a basket. Consider the following random variable. The number of unsuccessful throws before the first successful throw. • The possible values of this are 0, 1, 2, . . .. 3.4.1 Probability distribution of a discrete random variable The probability distribution (or just distribution) of a discrete random variable X is specified by: its possible values, x (i.e. its sample space, S) the probabilities of the possible values, i.e. P (X = x) for all x ∈ S. So we first need to develop a convenient way of specifying the probabilities. 63 3. Random variables Example 3.2 Consider the following probability distribution for the household size, X.3 Number of people in the household, x 1 2 3 4 5 6 7 8 P (X = x) 0.3002 0.3417 0.1551 0.1336 0.0494 0.0145 0.0034 0.0021 Probability function The probability function (pf) of a discrete random variable X, denoted by p(x), is a real-valued function such that for any number x the function is: p(x) = P (X = x). We can talk of p(x) both as the pf of the random variable X, and as the pf of the probability distribution of X. Both mean the same thing. Alternative terminology: the pf of a discrete random variable is also often called the probability mass function (pmf). Alternative notation: instead of p(x), the pf is also often denoted by, for example, pX (x) – especially when it is necessary to indicate clearly to which random variable the function corresponds. Necessary conditions for a probability function To be a pf of a discrete random variable X with sample space S, a function p(x) must satisfy the following conditions. 1. p(x) ≥ 0 for all real numbers x. P 2. p(xi ) = 1, i.e. the sum of probabilities of all possible values of X is 1. xi ∈S The pf is defined for all real numbers x, but p(x) = 0 for any x ∈ / S, i.e. for any value x which is not one of the possible values of X. 3 Source: ONS, National report for the 2001 Census, England and Wales. Table UV51. 64 3.4. Discrete random variables Example 3.3 Continuing Example 3.2, here we can simply list all the values: 0.3002 for x = 1 0.3417 for x = 2 0.1551 for x = 3 0.1336 for x = 4 p(x) = 0.0494 for x = 5 0.0145 for x = 6 0.0034 for x = 7 0.0021 for x = 8 0 otherwise. These are clearly all non-negative, and their sum is 8 P p(x) = 1. x=1 p(x) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 A graphical representation of the pf is shown in Figure 3.1. 1 2 3 4 5 6 7 8 x (number of people in the household) Figure 3.1: Probability function for Example 3.3. For the next example, we need to remember the following results from mathematics, concerning sums of geometric series. If r 6= 1, then: n−1 X a (1 − rn ) ar = 1−r x=0 x and if |r| < 1, then: ∞ X x=0 a rx = a . 1−r 65 3. Random variables Example 3.4 In the basketball example, the number of possible values is infinite, so we cannot simply list the values of the pf. So we try to express it as a formula. Suppose that: the probability of a successful throw is π at each throw and, therefore, the probability of an unsuccessful throw is 1 − π outcomes of different throws are independent. Hence the probability that the first success occurs after x failures is the probability of a sequence of x failures followed by a success, i.e. the probability is: (1 − π)x π. So the pf of the random variable X (the number of failures before the first success) is: ( (1 − π)x π for x = 0, 1, 2, . . . p(x) = (3.1) 0 otherwise where 0 ≤ π ≤ 1. Let us check that (3.1) satisfies the conditions for a pf. Clearly, p(x) ≥ 0 for all x, since π ≥ 0 and 1 − π ≥ 0. Using the sum to infinity of a geometric series, we get: ∞ X x=0 p(x) = ∞ ∞ X X (1 − π)x π = π (1 − π)x = π x=0 x=0 π 1 = = 1. 1 − (1 − π) π p(x) 0.4 0.5 0.6 0.7 The expression of the pf involves a parameter π (the probability of a successful throw), a number for which we can choose different values. This defines a whole ‘family’ of individual distributions, one for each value of π. For example, Figure 3.2 shows values of p(x) for two values of π reflecting fairly good and pretty poor free-throw shooters, respectively. 0.0 0.1 0.2 0.3 π = 0.7 π = 0.3 0 5 10 15 x (number of failures) Figure 3.2: Probability function for Example 3.4. π = 0.7 indicates a fairly good free-throw shooter. π = 0.3 indicates a pretty poor free-throw shooter. 66 3.4. Discrete random variables Activity 3.1 Suppose that a box contains 12 green balls and 4 yellow balls. If 7 balls are selected at random, without replacement, determine the probability function of X, the number of green balls which will be obtained. Solution Let the random variable X denote the number of green balls. As 7 balls are selected without replacement, the sample space of X is S = {3, 4, 5, 6, 7} because the maximum number of yellow balls which could be obtained is 4 (all selected), hence a minimum of 3 green balls must be obtained, up to a maximum of7 green balls. The 16 number of possible combinations of 7 balls drawn from 16 is . The x green 7 12 balls chosen from 12 can occur in ways, and the 7 − x yellow balls chosen from x 4 4 can occur in ways. Therefore, using classical probability: 7−x 12 4 x 7−x . p(x) = 16 7 Therefore, the probability function is: . 4 16 12 for x = 3, 4, 5, 6, 7 x 7−x 7 p(x) = 0 otherwise. Activity 3.2 Consider a sequence of independent tosses of a fair coin. Let the random variable X denote the number of tosses needed to obtain the first head. Determine the probability function of X and verify it satisfies the necessary conditions for a valid probability function. Solution The sample space is clearly S = {1, 2, 3, . . .}. If the first head appears on toss x, then the previous x − 1 tosses must have been tails. By independence of the tosses, and the fact it is a fair coin: x−1 x 1 1 1 × = . P (X = x) = 2 2 2 Therefore, the probability function is: ( 1/2x p(x) = 0 for x = 1, 2, 3, . . . otherwise. Clearly, p(x) ≥ 0 for all x and: 2 3 ∞ x X 1 1 1 1 1/2 = + + + ··· = =1 2 2 2 2 1 − 1/2 x=1 67 3. Random variables noting the sum to infinity of a geometric series with first term a = 1/2 and common ratio r = 1/2. Activity 3.3 Show that: ( 2x/(k (k + 1)) for x = 1, 2, . . . , k p(x) = 0 otherwise is a valid probability function for a discrete random variable X. n P Hint: i = n (n + 1)/2. i=1 Solution Since k > 0, then 2x/(k (k + 1)) ≥ 0 for x = 1, 2, . . . , k. Therefore, p(x) ≥ 0 for all real x. Also, noting the hint in the question: k X x=1 2x 2 4 2k = + + ··· + k (k + 1) k (k + 1) k (k + 1) k (k + 1) = 2 (1 + 2 + · · · + k) k (k + 1) = 2 k (k + 1) k (k + 1) 2 = 1. Hence p(x) is a valid probability function. 3.4.2 The cumulative distribution function (cdf) Another way to specify a probability distribution is to give its cumulative distribution function (cdf) (or just simply distribution function). Cumulative distribution function (cdf) The cdf is denoted F (x) (or FX (x)) and defined as: F (x) = P (X ≤ x) for all real numbers x. For a discrete random variable it is given by: X F (x) = p(xi ) xi ∈S, xi ≤x i.e. the sum of the probabilities of the possible values of X which are less than or equal to x. 68 3.4. Discrete random variables Example 3.5 Continuing with the household size example, values of F (x) at all possible values of X are: Number of people in the household, x 1 2 3 4 5 6 7 8 p(x) 0.3002 0.3417 0.1551 0.1336 0.0494 0.0145 0.0034 0.0021 F (x) 0.3002 0.6419 0.7970 0.9306 0.9800 0.9945 0.9979 1.0000 0.0 0.2 0.4 F(x) 0.6 0.8 1.0 These are shown in graphical form in Figure 3.3. 0 2 4 6 8 x (number of people in the household) Figure 3.3: Cumulative distribution function for Example 3.5. Example 3.6 In the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . .. We can calculate a simple formula for the cdf, using the sum of a geometric series. Since, for any non-negative integer y, we obtain: y X p(x) = x=0 we can write: y X x=0 x (1 − π) π = π y X (1 − π)x = π x=0 ( 0 F (x) = 1 − (1 − π)x+1 1 − (1 − π)y+1 = 1 − (1 − π)y+1 1 − (1 − π) for x < 0 for x = 0, 1, 2, . . . . The cdf is shown in graphical form in Figure 3.4. 69 3. Random variables Activity 3.4 Suppose that random variable X has the range {x1 , x2 , . . .}, where x1 < x2 < · · · . Prove the following results: ∞ X p(xi ) = 1, p(xk ) = F (xk ) − F (xk−1 ) and F (xk ) = i=1 k X p(xi ). i=1 Solution The events X = x1 , X = x2 , . . . are disjoint, so we can write: ∞ X i=1 p(xi ) = ∞ X P (X = xi ) = P (X = x1 ∪ X = x2 ∪ · · · ) = P (S) = 1. i=1 In words, this result states that the sum of the probabilities of all the possible values X can take is equal to 1. For the second equation, we have: F (xk ) = P (X ≤ xk ) = P (X = xk ∪ X ≤ xk−1 ). The two events on the right-hand side are disjoint, so: F (xk ) = P (X = xk ) + P (X ≤ xk−1 ) = p(xk ) + F (xk−1 ) which immediately gives the required result. For the final result, we can write: F (xk ) = P (X ≤ xk ) = P (X = x1 ∪ X = x2 ∪ · · · ∪ X = xk ) = k X p(xi ). i=1 Activity 3.5 At a charity event, the organisers sell 100 tickets to a raffle. At the end of the event, one of the tickets is selected at random and the person with that number wins a prize. Carol buys ticket number 22. Janet buys tickets numbered 1–5. What is the probability for each of them to win the prize? Solution Let X denote the number on the winning ticket. Since all values between 1 and 100 are equally likely, X has a discrete ‘uniform’ distribution such that: P (‘Carol wins’) = P (X = 22) = p(22) = and: P (‘Janet wins’) = P (X ≤ 5) = F (5) = 70 1 = 0.01 100 5 = 0.05. 100 0.4 F(x) 0.6 0.8 1.0 3.4. Discrete random variables 0.0 0.2 π = 0.7 π = 0.3 0 5 10 15 x (number of failures) Figure 3.4: Cumulative distribution function for Example 3.6. 3.4.3 Properties of the cdf for discrete distributions The cdf F (x) of a discrete random variable X is a step function such that: F (x) remains constant in all intervals between possible values of X at a possible value xi of X, F (x) jumps up by the amount p(xi ) = P (X = xi ) at such an xi , the value of F (xi ) is the value at the top of the jump (i.e. F (x) is right-continuous). 3.4.4 General properties of the cdf These hold for both discrete and continuous random variables. 1. 0 ≤ F (x) ≤ 1 for all x (since F (x) is a probability). 2. F (x) → 0 as x → −∞, and F (x) → 1 as x → ∞. 3. F (x) is a non-decreasing function, i.e. if x1 < x2 , then F (x1 ) ≤ F (x2 ). 4. For any x1 < x2 , P (x1 < X ≤ x2 ) = F (x2 ) − F (x1 ). Either the pf or the cdf can be used to calculate the probabilities of any events for a discrete random variable. Example 3.7 Continuing with the household size example (for the probabilities, see Example 3.5), then: P (X = 1) = p(1) = F (1) = 0.3002 P (X = 2) = p(2) = F (2) − F (1) = 0.3417 71 3. Random variables P (X ≤ 2) = p(1) + p(2) = F (2) = 0.6419 P (X = 3 or 4) = p(3) + p(4) = F (4) − F (2) = 0.2887 P (X > 5) = p(6) + p(7) + p(8) = 1 − F (5) = 0.0200 P (X ≥ 5) = p(5) + p(6) + p(7) + p(8) = 1 − F (4) = 0.0694. 3.4.5 Properties of a discrete random variable Let X be a discrete random variable with sample space S and pf p(x). Expected value of a discrete random variable The expected value (or mean) of X is denoted E(X), and defined as: X E(X) = xi p(xi ). xi ∈S This can also be written more concisely as E(X) = P x p(x) or E(X) = P x p(x). x We can talk of E(X) as the expected value of both the random variable X, and of the probability distribution of X. Alternative notation: instead of E(X), the symbol µ (the lower-case Greek letter ‘mu’), or µX , is often used. Activity 3.6 Toward the end of the financial year, James is considering whether to accept an offer to buy his stock option now, rather than wait until the normal exercise time. If he sells now, his profit will be £120,000. If he waits until the exercise time, his profit will be £200,000, provided that there is no crisis in the markets before that time; if there is a crisis, the option will be worthless and he would expect a net loss of £50,000. What action should he take to maximise his expected profit if the probability of crisis is: (a) 0.5? (b) 0.1? For what probability of a crisis would James be indifferent between the two courses of action if he wishes to maximise his expected profit? Solution Let π = probability of crisis, then: S = E(profit given James sells) = £120,000 and: W = E(profit given James waits) = £200,000 (1 − π) + (−£50,000) π. 72 3.4. Discrete random variables (a) If π = 0.5, then S = £120,000 and W = £75,000, so S > W , hence James should sell now. (b) If π = 0.1, then S = £120,000 and W = £175,000, so S < W , hence James should wait until the exercise time. To be indifferent, we require S = W , i.e. we have: £200,000 − £250,000 π = £120,000 so π = 8/25 = 0.32. Activity 3.7 What is the expectation of the random variable X if the only possible value it can take is c? Also, show that E(X − E(X)) = 0. Solution We have p(c) = 1, so X is effectively a constant, even though it is called a random variable. Its expectation is: X E(X) = x p(x) = c p(x) = c p(c) = c × 1 = c. (3.2) ∀x This is intuitively correct; on average, a constant must be equal to itself! We have: E(X − E(X)) = E(X) − E(E(X)) Since E(X) is just a number, as opposed to a random variable, (3.2) tells us that its expectation is equal to itself. Therefore, we can write: E(X − E(X)) = E(X) − E(X) = 0. Activity 3.8 If a probability function of a random variable X is given by: ( 1/2x for x = 1, 2, 3, . . . p(x) = 0 otherwise show that E(2X ) does not exist. Solution We have: ∞ X ∞ X ∞ X 1 E(2 ) = 2 p(x) = 2 x = 1 = 1 + 1 + 1 + · · · = ∞. 2 x=1 x=1 x=1 X x x Note that this is the famous ‘Petersburg paradox’, according to which a player’s expectation is infinite (i.e. does not exist) if s/he is to receive 2X units of currency when, in a series of tosses of a fair coin, the first head appears on the xth toss. 73 3. Random variables Activity 3.9 Suppose that on each play of a certain game James, a gambler, is equally likely to win or to lose. Suppose that when he wins, his fortune is doubled, and that when he loses, his fortune is cut in half. If James begins playing with a given fortune c > 0, what is the expected value of his fortune after n independent plays of the game? Hint: If X1 , X2 , . . . , Xn are independent random variables, then: E(X1 X2 · · · Xn ) = E(X1 ) × E(X2 ) × · · · × E(Xn ). That is, for independent random variables the ‘expectation of the product’ is the ‘product of the expectations’. This will be introduced in Chapter 5: Multivariate random variables. Solution For i = 1, . . . , n, let Xi = 2 if James’ fortune is doubled on the ith play of the game, and let Xi = 1/2 if his fortune is cut in half on the ith play. Hence: E(Xi ) = 2 × 5 1 1 1 + × = . 2 2 2 4 After the first play of the game, James’ fortune will be cX1 , after the second play it will be (cX1 )X2 , and by continuing in this way it is seen that after n plays James’ fortune will be cX1 X2 · · · Xn . Since X1 , . . . , Xn are independent, and noting the hint: n 5 . E(cX1 X2 · · · Xn ) = c × E(X1 ) × E(X2 ) × · · · × E(Xn ) = c 4 3.4.6 Expected value versus sample mean The mean (expected value) E(X) of a probability distribution is analogous to the sample mean (average) X̄ of a sample distribution. This is easiest to see when the sample space is finite. Suppose the random variable X can have K different values X1 , . . . , XK , and their frequencies in a sample are f1 , . . . , fK , respectively. Therefore, the sample mean of X is: K X f 1 x1 + · · · + f K xK = x1 pb(x1 ) + · · · + xK pb(xK ) = xi pb(xi ) X̄ = f1 + · · · + fK i=1 where: pb(xi ) = fi K P fi i=1 are the sample proportions of the values xi . The expected value of the random variable X is: E(X) = x1 p(x1 ) + · · · + xK p(xK ) = K X i=1 74 xi p(xi ). 3.4. Discrete random variables So X̄ uses the sample proportions, pb(xi ), whereas E(X) uses the population probabilities, p(xi ). Example 3.8 Continuing with the household size example: Number of people in the household, x 1 2 3 4 5 6 7 8 Sum p(x) 0.3002 0.3417 0.1551 0.1336 0.0494 0.0145 0.0034 0.0021 x p(x) 0.3002 0.6834 0.4653 0.5344 0.2470 0.0870 0.0238 0.0168 2.3579 = E(X) The expected number of people in a randomly selected household is 2.36. Example 3.9 For the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . ., and 0 otherwise. It can be shown that (see the appendix for a non-examinable proof): E(X) = 1−π . π Hence, for example: E(X) = 0.3/0.7 = 0.42 for π = 0.7 E(X) = 0.7/0.3 = 2.33 for π = 0.3. So, before scoring a basket, a fairly good free-throw shooter (with π = 0.7) misses on average about 0.42 shots, and a pretty poor free-throw shooter (with π = 0.3) misses on average about 2.33 shots. Expected values of functions of a random variable Let g(X) be a function (‘transformation’) of a discrete random variable X. This is also a random variable, and its expected value is: X E(g(X)) = g(x) pX (x) where pX (x) = p(x) is the probability function of X. 75 3. Random variables Example 3.10 The expected value of the square of X is: X E(X 2 ) = x2 p(x). In general: E(g(X)) 6= g(E(X)) when g(X) is a non-linear function of X. Example 3.11 Note that: 2 2 E(X ) 6= (E(X)) and E 1 X 6= 1 . E(X) Expected values of linear transformations Suppose X is a random variable and a and b are constants, i.e. known numbers which are not random variables. Therefore: E(aX + b) = a E(X) + b. A special case of the result: E(aX + b) = a E(X) + b is obtained when a = 0, which gives: E(b) = b. That is, the expected value of a constant is the constant itself. Variance and standard deviation of a discrete random variable The variance of a discrete random variable X is defined as: X Var(X) = E (X − E(X))2 = (x − E(X))2 p(x). x The standard deviation of X is sd(X) = p Var(X). Both Var(X) and sd(X) are always ≥ 0. Both are measures of the dispersion (variation) of the random variable X. Alternative notation: the variance is often denoted σ 2 (‘sigma squared’) and the standard deviation by σ (‘sigma’). 76 3.4. Discrete random variables An alternative formula: the variance can also be calculated as: Var(X) = E(X 2 ) − (E(X))2 . Example 3.12 Continuing with the household size example: x 1 2 3 4 5 6 7 8 P p(x) 0.3002 0.3417 0.1551 0.1336 0.0494 0.0145 0.0034 0.0021 x p(x) 0.3002 0.6834 0.4653 0.5344 0.2470 0.0870 0.0238 0.0168 2.3579 = E(X) (x − E(X))2 1.844 0.128 0.412 2.696 6.981 13.265 21.549 31.833 (x − E(X))2 p(x) 0.554 0.044 0.064 0.360 0.345 0.192 0.073 0.067 1.699 = Var(X) x2 1 4 9 16 25 36 49 64 x2 p(x) 0.300 1.367 1.396 2.138 1.235 0.522 0.167 0.134 7.259 = E(X 2 ) 2 2 2 2 Var(X) =pE[(X − E(X)) √ ] = 1.699 = 7.259 − (2.358) = E(X ) − (E(X)) and sd(X) = Var(X) = 1.699 = 1.30. Example 3.13 For the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . ., and 0 otherwise. It can be shown (although the proof is beyond the scope of the course) that for this distribution: Var(X) = 1−π . π2 In the two cases we have used as examples: Var(X) = 0.3/(0.7)2 = 0.61 and sd(X) = 0.78 for π = 0.7 Var(X) = 0.7/(0.3)2 = 7.78 and sd(X) = 2.79 for π = 0.3. So the variation in how many free throws a pretty poor shooter misses before the first success is much higher than the variation for a fairly good shooter. Variances of linear transformations If X is a random variable and a and b are constants, then: Var(aX + b) = a2 Var(X). If a = 0, this gives: Var(b) = 0. That is, the variance of a constant is 0. The converse also holds – if a random variable has a variance of 0, it is actually a constant. 77 3. Random variables Example 3.14 For further practice, let us consider a discrete random variable X which has possible values 0, 1, 2 . . . , n, where n is a known positive integer, and X has the following probability function: ( n π x (1 − π)n−x for x = 0, 1, 2, . . . , n x p(x) = 0 otherwise where nx = n!/(x! (n − x)!) denotes the binomial coefficient, and π is a probability parameter such that 0 ≤ π ≤ 1. A random variable like this follows the binomial distribution. We will discuss its motivation and uses later in the next chapter. Here, we consider the following tasks for this distribution. Show that p(x) satisfies the conditions for a probability function. Write down the cumulative distribution function, F (x). To show that p(x) is a probability function, we need to show the following. 1. p(x) ≥ 0 for all x. This is clearly true, since x ≥ 0, π ≥ 0 and 1 − π ≥ 0. 2. n P p(x) = 1. This is easiest to show by using the binomial theorem, which states x=0 that, for any integer n ≥ 0 and any real numbers y and z, then: n (y + z) = n X n x=0 x y x z n−x . (3.3) If we choose y = π and z = 1 − π in (3.3), we get: n n 1 = 1 = [π + (1 − π)] = n X n x=0 x x n−x π (1 − π) = n X p(x). x=0 This does not simplify into a simple formula, so we just calculate the values from the definition, by summation. For the values x = 0, 1, . . . , n, the value of the cdf is: F (x) = P (X ≤ x) = x X n y=0 y π y (1 − π)n−y . Since X is a discrete random variable, F (x) is a step function. We note that: E(X) = n π and Var(X) = n π (1 − π). Activity 3.10 Show that if Var(X) = 0 then p(µ) = 1. (We say in this case that X is almost surely equal to its mean.) 78 3.4. Discrete random variables Solution From the definition of variance, we have: Var(X) = E((X − µ)2 ) = X (x − µ)2 p(x) ≥ 0 ∀x because the squared term (x − µ)2 is non-negative (as is p(x)). The only case where it is equal to 0 is when x − µ = 0, that is, when x = µ. Therefore, the random variable X can only take the value µ, and we have p(µ) = P (X = µ) = 1. Activity 3.11 Construct suitable examples to show that for a random variable X: (a) E(X 2 ) 6= (E(X))2 in general (b) E(1/X) 6= 1/E(X) in general. Solution We require a counterexample. A simple one will suffice – there is no merit in complexity. Let the discrete random variable X assume values 1 and 2 with probabilities 1/3 and 2/3, respectively. (Obviously, there are many other examples we could have chosen.) Therefore: 1 2 5 +2× = 3 3 3 1 2 E(X 2 ) = 1 × + 4 × = 3 3 3 1 1 2 2 1 =1× + × = E X 3 2 3 3 E(X) = 1 × and, clearly, E(X 2 ) 6= (E(X))2 and E(1/X) 6= 1/E(X) in this case. So the result has been shown in general. Activity 3.12 (a) Let X be a random variable. Show that: Var(X) = E(X(X − 1)) − E(X)(E(X) − 1). (b) Let X1 , X2 , . . . , Xn be independent random variables. Assume that all have a mean of µ and a variance of σ 2 . Find expressions for the mean and variance of the random variable (X1 + X2 + · · · + Xn )/n. Solution (a) Recall that Var(X) = E(X 2 ) − (E(X)2 ). Now, working backwards: 79 3. Random variables E(X(X − 1)) − E(X)(E(X) − 1) = E(X 2 − X) − (E(X))2 + E(X) = E(X 2 ) − E(X) − E(X)2 + E(X) (using standard properties of expectation) = E(X 2 ) − (E(X))2 = Var(X). (b) We have: E X 1 + X2 + · · · + Xn n = E(X1 + X2 + · · · + Xn ) n E(X1 ) + E(X2 ) + · · · + E(Xn ) n µ + µ + ··· + µ = n nµ = n = = µ. Also: Var X1 + X2 + · · · + X n n (by independence) = Var(X1 + X2 + · · · + Xn ) n2 = Var(X1 ) + Var(X2 ) + · · · + Var(Xn ) n2 = σ2 + σ2 + · · · + σ2 n2 = n σ2 n2 = σ2 . n Activity 3.13 Let X be a random variable for which E(X) = µ and Var(X) = σ 2 , and let c be an arbitrary constant. Show that: E((X − c)2 ) = σ 2 + (µ − c)2 . Solution We have: E((X − c)2 ) = E(X 2 − 2cX + c2 ) = E(X 2 ) − 2c E(X) + c2 = Var(X) + (E(X))2 − 2cµ + c2 = σ 2 + µ2 − 2cµ + c2 = σ 2 + (µ − c)2 . 80 3.4. Discrete random variables Activity 3.14 Y is a random variable with expected value zero, P (Y = 1) = 0.2 and P (Y = 2) = 0.1. It is known that Y takes just one other value besides 1 and 2. (a) What is the other value that Y takes? (b) What is the variance of Y ? Solution (a) Let the other value be θ, then: X E(Y ) = y P (Y = y) = (θ × 0.7) + (1 × 0.2) + (2 × 0.1) = 0 y hence θ = −4/7. (b) Var(Y ) = E(Y 2 ) − (E(Y ))2 = E(Y 2 ), since E(Y ) = 0. So: X y 2 P (Y = y) Var(Y ) = E(Y 2 ) = y = ! 2 4 × 0.7 + (12 × 0.2) + (22 × 0.1) − 7 = 0.8286. Activity 3.15 James is planning to invest £1000 for two years. He will choose between two savings accounts offered by a bank: A standard fixed-term account which has a guaranteed interest rate of 5.5% after the two years. A ‘Deposit Plus’ account, for which the interest rate depends on the stock prices of three companies as follows: • if the stock prices of all three companies are higher two years after the account is opened, the two-year interest rate is 8.1% • if not, the two-year interest rate is 1.1%. Denote by X the two-year interest rate of the Deposit Plus account, and by Y the two-year interest rate of the standard account. Let π denote the probability that the condition for the higher interest rate of the Deposit Plus account is satisfied at the end of the period. (a) Calculate the expected value and standard deviation of X, and the expected value and standard deviation of Y . (b) For which values of π is E(X) > E(Y )? (c) Which account would you choose, and why? (There is no single right answer to this question!) 81 3. Random variables Solution (a) Since the interest rate of the standard account is guaranteed, the ‘random’ variable Y is actually a constant. So E(Y ) = 5.5 and Var(Y ) = sd(Y ) = 0. The random variable X has two values, 8.1 and 1.1, with probabilities π and 1 − π respectively. Therefore: E(X) = 8.1 × π + 1.1 × (1 − π) = 1.1 + 7.0π E(X 2 ) = (8.1)2 × π + (1.1)2 × (1 − π) = 1.21 + 64.4π Var(X) = E(X 2 ) − (E(X))2 = 49 π (1 − π) and so sd(X) = 7 p π (1 − π). (b) E(X) > E(Y ) if 1.1 + 7.0π > 5.5, i.e. if π > 0.6286. The expected interest rate of the Deposit Plus account is higher than the guaranteed rate of the standard account if the probability is higher than 0.6286 that all three stock prices are at higher levels at the end of the reference period. (c) If you focus solely on the expected interest rate, you would make your decision based on your estimate of π. You would choose the Deposit Plus account if you believe – based on whatever evidence on the companies and the world economy you choose to use – that there is a probabily of at least 0.6286 that the three companies will all increase their share prices over the two years. However, you might also consider the variances. The standard account has a guaranteed rate, while the Deposit Plus account offers both a possibility of a high rate and a risk of a low rate. So the choice could also depend on how risk-averse you are. Activity 3.16 Hard question! In an investigation of animal behaviour, rats have to choose between four doors. One of them, behind which is food, is ‘correct’. If an incorrect choice is made, the rat is returned to the starting point and chooses again, continuing as long as necessary until the correct choice is made. The random variable X is the serial number of the trial on which the correct choice is made. Find the probability function and expectation of X under each of the following hypotheses: (a) each door is equally likely to be chosen on each trial, and all trials are mutually independent (b) at each trial, the rat chooses with equal probability between the doors which it has not so far tried (c) the rat never chooses the same door on two successive trials, but otherwise chooses at random with equal probabilities. 82 3.4. Discrete random variables Solution (a) For the ‘stupid’ rat: 1 4 3 1 P (X = 2) = × 4 4 .. . r−1 1 3 × . P (X = r) = 4 4 P (X = 1) = This is a ‘geometric distribution’ with π = 1/4, which gives E(X) = 1/π = 4. (b) For the ‘intelligent’ rat: 1 4 3 1 1 P (X = 2) = × = 4 3 4 1 3 2 1 P (X = 3) = × × = 4 3 2 4 3 2 1 1 P (X = 4) = × × × 1 = . 4 3 2 4 P (X = 1) = Hence E(X) = (1 + 2 + 3 + 4)/4 = 10/4 = 2.5. (c) For the ‘forgetful’ rat (short-term, but not long-term, memory): 1 4 3 1 P (X = 2) = × 4 3 3 2 1 P (X = 3) = × × 4 3 3 .. . r−2 3 2 1 P (X = r) = × × 4 3 3 P (X = 1) = (for r ≥ 2). Therefore: " ! # 2 1 3 1 2 1 2 1 E(X) = + × 2× + 3× × + 4× × + ··· 4 4 3 3 3 3 3 " # 2 ! 1 1 2 2 + 4× = + 2+ 3× + ··· . 4 4 3 3 83 3. Random variables There is more than one way to evaluate this sum. " # " #! 2 2 1 1 2 2 2 2 E(X) = + × 1+ + + ··· + 1 + 2 × + 3 × + ··· 4 4 3 3 3 3 1 1 (3 + 9) = + 4 4 = 3.25. Note that 2.5 < 3.25 < 4, so the intelligent rat needs the least trials on average, while the stupid rat needs the most, as we would expect! 3.5 Continuous random variables A random variable (and its probability distribution) is continuous if it can have an uncountably infinite number of possible values.4 In other words, the set of possible values (the sample space) is the real numbers R, or one or more intervals in R. Example 3.15 An example of a continuous random variable, used here as an approximating model, is the size of claim made on an insurance policy (i.e. a claim by the customer to the insurance company), in £000s. Suppose the policy has a deductible of £999, so all claims are at least £1,000. Therefore, the possible values of this random variable are {x | x ≥ 1}. Most of the concepts introduced for discrete random variables have exact or approximate analogies for continuous random variables, and many results are the same for both types. However, there are some differences in the details. The most obvious difference is that wherever in the discrete case there are sums over the possible values of the random variable, in the continuous case these are integrals. Probability density function (pdf) For a continuous random variable X, the probability function is replaced by the probability density function (pdf), denoted as f (x) [or fX (x)]. 4 Strictly speaking, having an uncountably infinite number of possible values does not necessarily imply that it is a continuous random variable. For example, the Cantor distribution (not covered in this course) is neither a discrete nor an absolutely continuous probability distribution, nor is it a mixture of these. However, we will not consider this matter any further in this course. 84 3.5. Continuous random variables Example 3.16 Continuing the insurance example in Example 3.18, we consider a pdf of the following form: ( 0 for x < k f (x) = α k α /xα+1 for x ≥ k 1.0 0.0 0.5 f(x) 1.5 2.0 where α > 0 is a parameter, and k > 0 (the smallest possible value of X) is a known number. In our example, k = 1 (due to the deductible). A probability distribution with this pdf is known as the Pareto distribution. A graph of this pdf when α = 2.2 is shown in Figure 3.5. 1.0 1.5 2.0 2.5 3.0 3.5 4.0 x Figure 3.5: Probability density function for Example 3.16. Unlike for probability functions of discrete random variables, in the continuous case values of the probability density function are not probabilities of individual values, i.e. f (x) 6= P (X = x). In fact, for a continuous random variable: P (X = x) = 0 for all x. (3.4) That is, the probability that X has any particular value exactly is always 0. Because of (3.4), with a continuous random variable we do not need to be very careful about differences between < and ≤, and between > and ≥. Therefore, the following probabilities are all equal: P (a < X < b), P (a ≤ X ≤ b), P (a < X ≤ b) and P (a ≤ X < b). 85 3. Random variables Probabilities of intervals for continuous random variables Integrals of the pdf give probabilities of intervals of values such that: Z b f (x) dx P (a < X ≤ b) = a for any two numbers a < b. In other words, the probability that the value of X is between a and b is the area under f (x) between a and b. Here a can also be −∞, and/or b can be +∞. R3 1.5 f (x) dx. 1.0 0.0 0.5 f(x) 1.5 2.0 Example 3.17 In Figure 3.6, the shaded area is P (1.5 < X ≤ 3) = 1.0 1.5 2.0 2.5 3.0 3.5 4.0 x Figure 3.6: Probability density function showing P (1.5 < X ≤ 3). 86 3.5. Continuous random variables Properties of pdfs The pdf f (x) of any continuous random variable must satisfy the following conditions. 1. f (x) ≥ 0 for all x. 2. ∞ Z f (x) dx = 1. −∞ These are analogous to the conditions for probability functions of discrete distributions. Example 3.18 Continuing with the insurance example, we check that the conditions hold for the pdf: ( 0 for x < k f (x) = α k α /xα+1 for x ≥ k where α > 0 and k > 0. 1. Clearly, f (x) ≥ 0 for all x, since α > 0, k α > 0 and xα+1 ≥ k α+1 > 0. 2. We have: Z ∞ Z f (x) dx = −∞ k ∞ α kα dx = α k α xα+1 = αk α ∞ Z x−α−1 dx k 1 −α x−α ∞ k = (−k α )(0 − k −α ) = 1. Cumulative distribution function The cumulative distribution function (cdf) of a continuous random variable X is defined exactly as for discrete random variables, i.e. the cdf is: F (x) = P (X ≤ x) for all real numbers x. The general properties of the cdf stated previously also hold for continuous distributions. The cdf of a continuous distribution is not a step function, so results on discrete-specific properties do not hold in the continuous case. A continuous cdf is a smooth, continuous function of x. 87 3. Random variables Relationship between the cdf and pdf The cdf is obtained from the pdf through integration: Z x f (t) dt for all x. F (x) = P (X ≤ x) = −∞ The pdf is obtained from the cdf through differentiation: f (x) = F 0 (x). Activity 3.17 (a) Define the cumulative distribution function (cdf) of a random variable and state the principal properties of such a function. (b) Identify which, if any, of the following functions could be a cdf under suitable choices of the constants a and b. Explain why (or why not) each function satisfies the properties required of a cdf and the constraints which may be required in respect of the constants a and b. i. F (x) = a (b − x)2 for −1 ≤ x ≤ 1. ii. F (x) = a (1 − xb ) for −1 ≤ x ≤ 1. iii. F (x) = a − b exp (−x/2) for 0 ≤ x ≤ 2. Solution (a) We defined the cdf to be F (x) = P (X ≤ x) where: • 0 ≤ F (x) ≤ 1 • F (x) is non-decreasing • dF (x)/dx = f (x) and F (x) = Rx −∞ f (t) dt for continuous X • F (x) → 0 as x → −∞ and F (x) → 1 as x → ∞. (b) i. Okay. a = 0.25 and b = −1. ii. Not okay. At x = 1, F (x) = 0, which would mean a decreasing function. iii. Okay. a = b > 0 and b = (1 − e−1 )−1 . 88 3.5. Continuous random variables Example 3.19 Continuing the insurance example: Z x Z x α kα f (t) dt = dt α+1 k t −∞ Z x α (−α) t−α−1 dt = (−k ) k x = (−k α ) t−α k = (−k α )(x−α − k −α ) = 1 − k α x−α α k =1− . x Therefore: ( 0 F (x) = 1 − (k/x)α for x < k for x ≥ k. (3.5) If we were given (3.5), we could obtain the pdf by differentiation, since F 0 (x) = 0 when x < k, and: F 0 (x) = −k α (−α) x−α−1 = α kα xα+1 for x ≥ k. 0.0 0.2 0.4 F(x) 0.6 0.8 1.0 A plot of the cdf is shown in Figure 3.7. 1 2 3 4 5 6 7 x Figure 3.7: Cumulative distribution function for Example 3.19. 89 3. Random variables Probabilities from cdfs and pdfs Since P (X ≤ x) = F (x), it follows that P (X > x) = 1 − F (x). In general, for any two numbers a < b, we have: Z b f (x) dx = F (b) − F (a). P (a < X ≤ b) = a Example 3.20 Continuing with the insurance example (with k = 1 and α = 2.2), then: P (X ≤ 1.5) = F (1.5) = 1 − (1/1.5)2.2 ≈ 0.59 P (X ≤ 3) = F (3) = 1 − (1/3)2.2 ≈ 0.91 P (X > 3) = 1 − F (3) ≈ 1 − 0.91 = 0.09 P (1.5 ≤ X ≤ 3) = F (3) − F (1.5) ≈ 0.91 − 0.59 = 0.32. Example 3.21 Consider now a continuous random variable with the following pdf: ( λ e−λx for x > 0 f (x) = (3.6) 0 for x ≤ 0 where λ > 0 is a parameter. This is the pdf of the exponential distribution. The uses of this distribution will be discussed in the next chapter. Since: Z 0 x x λ e−λt dt = − e−λt 0 = 1 − e−λx the cdf of the exponential distribution is: ( 0 F (x) = 1 − e−λx for x ≤ 0 for x > 0. We now show that (3.6) satisfies the conditions for a pdf. 1. Since λ > 0 and ea > 0 for any a, f (x) ≥ 0 for all x. 2. Since we have just done the integration to derive the cdf F (x), we can also use it to show that f (x) integrates to one. This follows from: Z ∞ f (x) dx = P (−∞ < X < ∞) = lim F (x) − lim F (x) −∞ x→∞ which here is lim (1 − e−λx ) − 0 = (1 − 0) − 0 = 1. x→∞ 90 x→−∞ 3.5. Continuous random variables Expected value and variance of a continuous distribution Suppose X is a continuous random variable with pdf f (x). Definitions of its expected value, the expected value of any transformation g(X), the variance and standard deviation are the same as for discrete distributions, except that summation is replaced by integration: Z ∞ x f (x) dx E(X) = −∞ Z ∞ g(x) f (x) dx E[g(X)] = −∞ 2 Z ∞ Var(X) = E[(X − E(X)) ] = (x − E(X))2 f (x) dx = E(X 2 ) − (E(X))2 −∞ sd(X) = p Var(X). Example 3.22 Consider the exponential distribution introduced in Example 3.21. To find E(X) we can use integration by parts by considering x λ e−λx as the product of the functions f = x and g 0 = λ e−λx (so that g = −e−λx ). Therefore: Z ∞ Z ∞ 1 −λx ∞ −λx −λx ∞ −λx −λx ∞ E(X) = xλe dx = −x e e − −e dx = −x e − 0 0 0 λ 0 0 = [0 − 0] − = 1 [0 − 1] λ 1 . λ To obtain E(X 2 ), we choose f = x2 and g 0 = λ e−λx , and use integration by parts: Z ∞ Z ∞ 2 −λx ∞ 2 2 −λx E(X ) = x λe dx = −x e +2 x e−λx dx 0 0 0 =0+ = 2 λ Z ∞ x λ e−λx dx 0 2 λ2 where the last step follows because the last integral is simply E(X) = 1/λ again. Finally: 2 1 1 Var(X) = E(X 2 ) − (E(X))2 = 2 − 2 = 2 . λ λ λ Activity 3.18 A continuous random variable, X, has a probability density function, f (x), defined by: ( ax + bx2 for 0 ≤ x ≤ 1 f (x) = 0 otherwise 91 3. Random variables and E(X) = 1/2. Determine: (a) the constants a and b (b) the cumulative distribution function, F (x), of X (c) the variance, Var(X). Solution (a) We have: Z 1 Z f (x) dx = 1 1 ⇒ 0 0 ax2 bx3 ax + bx dx = + 2 3 2 1 =1 0 i.e. we have a/2 + b/3 = 1. Also, we know E(X) = 1/2, hence: Z 0 1 ax3 bx4 x (ax + bx ) dx = + 3 4 2 1 = 0 1 2 i.e. we have: a b 1 + = ⇒ a = 6 and b = −6. 3 4 2 Hence f (x) = 6x(1 − x) for 0 ≤ x ≤ 1, and 0 otherwise. (b) We have: 0 F (x) = 3x2 − 2x3 1 for x < 0 for 0 ≤ x ≤ 1 for x > 1. (c) Finally: 2 Z 1 2 Z x (6x(1 − x)) dx = E(X ) = 0 0 1 6x4 6x5 6x − 6x dx = − 4 5 3 4 and so the variance is: Var(X) = E(X 2 ) − (E(X))2 = 0.3 − 0.25 = 0.05. Activity 3.19 A continuous random variable X has the following pdf: ( x3 /4 for 0 ≤ x ≤ 2 f (x) = 0 otherwise. (a) Explain why f (x) can serve as a pdf. (b) Find the mean and mode of the distribution. 92 1 = 0.3. 0 3.5. Continuous random variables (c) Determine the cdf, F (x), of X. (d) Find the variance, Var(X). (e) Find the skewness of X, given by: E[(X − E(X))3 ] . σ3 (f) If a sample of five observations is drawn at random from the distribution, find the probability that all the observations exceed 1.5. Solution (a) Clearly, f (x) ≥ 0 for all x and: Z 2 0 4 2 x x3 dx = = 1. 4 16 0 (b) The mean is: Z ∞ 2 Z x f (x) dx = E(X) = −∞ 0 5 2 x4 32 x = dx = = 1.6 4 20 0 20 and the mode is 2 (where the density reaches a maximum). (c) The cdf is: for x < 0 0 F (x) = x4 /16 for 0 ≤ x ≤ 2 1 for x > 2. (d) For the variance, we first find E(X 2 ), given by: 6 2 Z 2 Z 2 5 x 64 x 8 2 2 x f (x) dx = E(X ) = dx = = = 24 0 24 3 0 0 4 hence: Var(X) = E(X 2 ) − (E(X))2 = (e) The third ‘moment about zero’ is: Z 2 Z 3 3 E(X ) = x f (x) dx = 0 0 2 8 64 8 − = ≈ 0.1067. 3 25 75 7 2 x x6 128 dx = = ≈ 4.5714. 4 28 0 28 Letting E(X) = µ, the numerator is: E[(X − E(X))3 ] = E(X 3 ) − 3 µ E(X 2 ) + 3 µ2 E(X) − µ3 = 4.5714 − (3 × 1.6 × 2.6667) + (3 × (1.6)3 ) − (1.6)3 which is −0.0368, and the denominator is (0.1067)3/2 = 0.0349, hence the skewness is −1.0544. 93 3. Random variables (f) The probability of a single observation exceeding 1.5 is: 2 Z 4 2 x3 x dx = = 1 − 0.3164 = 0.6836. 4 16 1.5 2 Z f (x) dx = 1.5 1.5 So the probability of all five exceeding 1.5 is, by independence: (0.6836)5 = 0.1493. Activity 3.20 A random variable X has 1/4 f (x) = 3/4 0 the following pdf: for 0 ≤ x ≤ 1 for 1 < x ≤ 2 otherwise. (a) Explain why f (x) can serve as a pdf. (b) Find the mean and median of the distribution. (c) Find the variance, Var(X). (d) Write down the cdf of X. (e) Find P (X = 1) and P (X > 1.5 | X > 0.5). Solution R∞ (a) Clearly, f (x) ≥ 0 for all x and −∞ f (x) dx = 1. This can be seen geometrically, since f (x) defines two rectangles, one with base 1 and height 1/4, the other with base 1 and height 3/4, giving a total area of 1/4 + 3/4 = 1. (b) We have: ∞ Z E(X) = 1 Z x f (x) dx = −∞ 0 x dx+ 4 Z 1 2 2 1 2 2 3x x 3x 1 3 3 5 dx = + = + − = . 4 8 0 8 1 8 2 8 4 The median is most simply found geometrically. The area to the right of the point x = 4/3 is 0.5, i.e. the rectangle with base 2 − 4/3 = 2/3 and height 3/4, giving an area of 2/3 × 3/4 = 1/2. Hence the median is 4/3. (c) For the variance, we proceed as follows: 2 Z ∞ E(X ) = 2 Z x f (x) dx = −∞ 0 1 3 1 3 2 Z 2 2 x2 3x x x 1 1 11 dx+ dx = + = +2− = . 4 4 12 0 4 1 12 4 6 1 Hence the variance is: Var(X) = E(X 2 ) − (E(X))2 = 94 11 25 88 75 13 − = − = ≈ 0.2708. 6 16 48 48 48 3.5. Continuous random variables (d) The cdf is: 0 x/4 F (x) = 3x/4 − 1/2 1 for for for for x<0 0≤x≤1 1<x≤2 x > 2. (e) P (X = 1) = 0, since the cdf is continuous, and: P (X > 1.5 | X > 0.5) = P (X > 1.5) P ({X > 1.5} ∩ {X > 0.5}) = P (X > 0.5) P (X > 0.5) 0.5 × 0.75 1 − 0.5 × 0.25 0.375 = 0.875 3 = ≈ 0.4286. 7 = Activity 3.21 Hard question! The waiting time, W , of a traveller queueing at a taxi rank is distributed according to the cumulative distribution function, G(w), defined by: for w < 0 0 G(w) = 1 − (2/3) exp(−w/2) for 0 ≤ w < 2 1 for w ≥ 2. (a) Sketch the cumulative distribution function. (b) Is the random variable W discrete, continuous or mixed? (c) Evaluate P (W > 1), P (W = 2), P (W ≤ 1.5 | W > 0.5) and E(W ). 95 3. Random variables Solution (a) A sketch of the cumulative distribution function is: G (w ) 1 1-(2/3)e -1 1/3 0 2 w (b) We see the distribution is mixed, with discrete ‘atoms’ at 0 and 2. (c) We have: P (W > 1) = 1 − G(1) = 2 −1/2 e , 3 P (W = 2) = 2 −1 e 3 and: P (W ≤ 1.5 | W > 0.5) = = P (0.5 < W ≤ 1.5) P (W > 0.5) G(1.5) − G(0.5) 1 − G(0.5) 1 − (2/3)e−1.5/2 − 1 − (2/3)e−0.5/2 = (2/3)e−0.5/2 = 1 − e−1/2 . Finally, the mean is: Z 2 2 −1 1 1 E(W ) = × 0 + e × 2 + w e−w/2 dw 3 3 3 0 −w/2 2 Z 2 4 −1 we 2 −w/2 = e + + e dw 3 3 −1/2 0 0 3 −w/2 2 4 −1 4 −1 2e = e − e + 3 3 3 −1/2 0 4 = (1 − e−1 ). 3 96 3.5. Continuous random variables Activity 3.22 Consider the function: ( λ2 x e−λ x f (x) = 0 for x > 0 otherwise. (a) Show that this function has the characteristics of a probability density function. (b) Evaluate E(X) and Var(X). Solution (a) Clearly, f (x) ≥ 0 for all x since λ2 > 0, x > 0 and e−λ x > 0. R∞ To show, −∞ f (x) dx = 1, we have: Z ∞ Z ∞ 2 f (x) dx = λ xe −∞ 0 −λx ∞ Z ∞ e−λx e−λx λ2 + dx dx = λ x −λ 0 λ 0 Z ∞ λ e−λx dx =0+ 2 0 = 1 (provided λ > 0). (b) For the mean: Z ∞ E(X) = x λ2 x e−λ x dx 0 = −x =0+ 2 ∞ λ e−λ x 0 2 λ ∞ Z 2 x λ e−λ x dx + 0 (from the exponential distribution). For the variance: Z ∞ Z 3 −λ x ∞ 2 2 2 −λ x E(X ) = x λ xe dx = −x λ e + 0 0 ∞ 0 3 x2 λ e−λ x dx = 6 . λ2 So, Var(X) = 6/λ2 − (2/λ)2 = 2/λ2 . Activity 3.23 A random variable, X, has a defined by: 0 F (x) = 1 − a e−x 1 cumulative distribution function, F (x), for x < 0 for 0 ≤ x < 1 for x ≥ 1. (a) Derive expressions for: i. P (X = 0) ii. P (X = 1) 97 3. Random variables iii. the pdf of X (where it is continuous) iv. E(X). (b) Suppose that E(X) = 0.75 (1 − e−1 ). Evaluate the median of X and Var(X). Solution (a) We have: i. P (X = 0) = F (0) = 1 − a. ii. P (X = 1) = lim (F (1) − F (x)) = 1 − (1 − a e−1 ) = a e−1 . x→1 −x iii. f (x) = a e , for 0 ≤ x < 1, and 0 otherwise. iv. The mean is: Z −1 E(X) = 0 × (1 − a) + 1 × (a e ) + 1 x a e−x dx 0 = a e−1 + [−x a e−x ]10 + 1 Z a e−x dx 0 = ae −1 − ae −1 + [−a e−x ]10 = a (1 − e−1 ). (b) The median, m, satisfies: −m F (m) = 0.5 = 1 − 0.75 e ⇒ 2 m = − ln = 0.4055. 3 Recall Var(X) = E(X 2 ) − (E(X))2 , so: 2 2 2 Z −1 E(X ) = 0 × (1 − a) + 1 × (a e ) + 1 x2 a e−x dx 0 = a e−1 + [−x2 a e−x ]10 + 2 Z 1 x a e−x dx 0 = ae −1 − ae −1 + 2(a − 2a e−1 ) = 2a − 4a e−1 . Hence: Var(X) = 2a − 4a e−1 − a2 (1 + e−2 − 2e−1 ) = 0.1716. Activity 3.24 A continuous random variable, X, has a probability density function, f (x), defined by: ( k sin(x) for 0 ≤ x ≤ π f (x) = 0 otherwise. 98 3.5. Continuous random variables (a) Determine the constant k and derive the cumulative distribution function, F (x), of X. (b) Find E(X) and Var(X). Solution (a) We have: Z ∞ Z f (x) dx = −∞ π k sin(x) dx = 1. 0 Therefore: [k (− cos(x))]π0 = 2k = 1 ⇒ 1 k= . 2 The cdf is hence: for x < 0 0 F (x) = (1 − cos(x))/2 for 0 ≤ x ≤ π 1 for x > π. (b) By symmetry, E(X) = π/2. Alternatively: Z π Z π 1 1 1 π 1 π π E(X) = x sin(x) dx = [x(− cos(x))]0 + cos(x) dx = + [sin(x)]π0 = . 2 2 2 2 0 2 0 2 Next: 2 Z E(X ) = 0 π π Z π 1 1 2 x sin(x) dx = x (− cos(x)) + x cos(x) dx 2 2 0 0 Z π π2 π sin(x) dx = + [x sin(x)]0 − 2 0 2 = π2 − [− cos(x)]π0 2 π2 − 2. = 2 Therefore, the variance is: Var(X) = E(X 2 ) − (E(X))2 = Activity 3.25 A random variable, X, has the x/5 f (x) = (20 − 4x)/30 0 π2 π2 π2 −2− = − 2. 2 4 4 following pdf: for 0 < x < 2 for 2 ≤ x ≤ 5 otherwise. 99 3. Random variables (a) Sketch the graph of f (x). (b) Derive the cumulative distribution function, F (x), of X. (c) Find the mean and the standard deviation of X. Solution 0.2 0.0 0.1 f(x) 0.3 0.4 (a) The pdf of X has the following form: 0 1 2 3 4 5 x (b) We determine the cdf by integrating the pdf over the appropriate range, hence: F (x) = 0 x2 /10 for x ≤ 0 for 0 < x < 2 (10x − x2 − 10)/15 for 2 ≤ x ≤ 5 1 for x > 5. This results from the following calculations. Firstly, for x ≤ 0, we have: Z x x Z F (x) = f (t) dt = −∞ 0 dt = 0. −∞ For 0 < x < 2, we have: Z x F (x) = 0 f (t) dt = −∞ 100 Z Z 0 dt + −∞ 0 x 2 x t t x2 dt = = . 5 10 0 10 3.5. Continuous random variables For 2 ≤ x ≤ 5, we have: Z x Z F (x) = f (t) dt = Z x t 20 − 4t 0 dt + dt + dt 30 −∞ 0 5 2 x 2t t2 4 + − =0+ 10 3 15 2 4 2x x2 4 4 = + − − − 10 3 15 3 15 −∞ 0 Z 2 = 2x x2 2 − − 3 15 3 = 10x − x2 − 10 . 15 (c) To find the mean we proceed as follows: Z ∞ Z µ = E(X) = x f (x) dx = Z 5 x2 20x − 4x2 dx + dx 30 0 5 2 3 2 2 5 2x3 x x − = + 15 0 3 45 2 25 250 4 16 8 + − − = − 15 3 45 3 45 −∞ 2 7 = . 3 Similarly: ∞ 2 Z 5 x3 20x2 − 4x3 E(X ) = x f (x) dx = dx + dx 30 −∞ 0 5 2 4 2 3 5 x 2x x4 = + − 20 0 9 30 2 16 250 625 16 16 = − + − − 20 9 30 9 30 Z 2 Z 2 = 13 . 2 Hence the variance is: 2 7 117 98 19 = − = ≈ 1.0555. 3 18 18 18 √ Therefore, the standard deviation is σ = 1.0555 = 1.0274. 13 σ = E(X ) − (E(X)) = − 2 2 2 2 101 3. Random variables Activity 3.26 Let g(x) be defined as: for 0 ≤ x < 3 x/3 g(x) = −x/3 + 2 for 3 ≤ x ≤ 6 0 otherwise. (a) If f (x) = k g(x) is a probability density function, find the value of k. (b) Let X be a random variable with probability density function f (x). Find E(X). Solution (a) If one draws it on a diagram, it is simply a triangle with base length 6 (from 0 to 6 on the x-axis) and height 1 (the highest point at x = 3). Integrating this function is just finding the area of it, which is (6 × 1)/2 = 3. Hence R f (x) dx = 3k, and so we must have 3k = 1, implying k = 1/3. Alternatively, note that: Z 3 Z 6 Z 6 Z 6 x x − + 2 dx . dx + g(x) dx = k k g(x) dx = k 3 3 0 3 0 0 R We must have f (x) dx = 1. Hence: Z 1=k 0 3 x dx + k 3 2 2 3 6 x x 3k 3k x − + 2 dx = k + k − + 2x = + 3 6 0 6 2 2 3 6 Z 3 giving k = 1/3. (b) This can be done very quickly if one can realise that g(x), and hence f (x), is symmetric aroundR 3, and hence the mean, E(X), must be 3. Otherwise, you 6 need to calculate 0 x f (x) dx, which can be written as the sum of two integrals: Z 0 6 1 x f (x) dx = 3 Z 0 6 1 x g(x) dx = 3 Z 0 3 1 x2 dx + 3 3 Z 3 6 2 x − + 2x dx. 3 Therefore: 3 3 3 6 Z 3 2 Z 6 2 x x 2x x x x2 E(X) = dx+ − + dx = + − + = 1+(4−2) = 3. 9 3 27 0 27 3 3 0 9 3 3.5.1 Median of a random variable Recall from ST104a Statistics 1 that the sample median is essentially the observation ‘in the middle’ of a set of data, i.e. where half of the observations in the sample are smaller than the median and half of the observations are larger. The median of a random variable (i.e. of its probability distribution) is similar in spirit. 102 3.6. Overview of chapter Median of a random variable The median, m, of a continuous random variable X is the value which satisfies: F (m) = 0.5. (3.7) So once we know F (x), we can find the median by solving (3.7). Example 3.23 For the Pareto distribution we have: α k for x ≥ k. F (x) = 1 − x So F (m) = 1 − (k/m)α = 1/2 when: α k 1 = ⇔ m 2 k 1 = √ α m 2 ⇔ √ α m = k 2. For example: √ 2 = 1.37 √ when k = 1 and α = 0.8, the median is m = 0.8 2 = 2.38. when k = 1 and α = 2.2, the median is m = 2.2 Example 3.24 For the exponential distribution we have: F (x) = 1 − e−λ x for x > 0. So F (m) = 1 − e−λ m = 1/2 when: e−λ m = 3.6 1 2 ⇔ −λ m = − log 2 ⇔ m= log 2 . λ Overview of chapter This chapter has formally introduced random variables, making a distinction between discrete and continuous random variables. Properties of probability distributions were discussed, including the determination of expected values and variances. 3.7 Key terms and concepts Constant Cumulative distribution function Estimators Experiment Continuous Discrete Expected value Median 103 3. Random variables Outcome Probability density function Probability (mass) function Standard deviation Variance 3.8 Parameter Probability distribution Random variable Step function Sample examination questions Solutions can be found in Appendix C. 1. Let X be a discrete random variable with expected value zero. Furthermore, it is given that P (X = 1) = 0.2 and P (X = 2) = 0.5. Suppose X only takes one other value besides 1 and 2. (a) What is the other value X takes? (b) What is the variance of X? 2. The random variable X has the probability density function given by: ( kx2 for 0 < x < 1 f (x) = 0 otherwise where k > 0 is a constant. (a) Find the value of k. (b) Compute E(X) and Var(X). 3. The random variable X has the probability density function given by f (x) = kx2 (1 − x) for 0 < x < 1 (and 0 otherwise). Here k > 0 is a constant. (a) Find the value of k. (b) Compute Var(1/X). 104 Chapter 4 Common distributions of random variables 4.1 Synopsis of chapter content This chapter formally introduces common ‘families’ of probability distributions which can be used to model various real-world phenomena. 4.2 Learning outcomes After completing this chapter, you should be able to: summarise basic distributions such as the uniform, Bernoulli, binomial, Poisson, exponential and normal calculate probabilities of events for these distributions using the probability function, probability density function or cumulative distribution function determine probabilities using statistical tables, where appropriate state properties of these distributions such as the expected value and variance. 4.3 Introduction In statistical inference we will treat observations: X1 , X2 , . . . , Xn (the sample) as values of a random variable X, which has some probability distribution (the population distribution). How to choose the probability distribution? Usually we do not try to invent new distributions from scratch. Instead, we use one of many existing standard distributions. There is a large number of such distributions, such that for most purposes we can find a suitable standard distribution. 105 4. Common distributions of random variables This part of the course introduces some of the most common standard distributions for discrete and continuous random variables. Probability distributions may differ from each other in a broader or narrower sense. In the broader sense, we have different families of distributions which may have quite different characteristics, for example: continuous versus discrete among discrete: a finite versus an infinite number of possible values among continuous: different sets of possible values (for example, all real numbers x, x > 0, or x ∈ [0, 1]); symmetric versus skewed distributions. The ‘distributions’ discussed in this chapter are really families of distributions in this sense. In the narrower sense, individual distributions within a family differ in having different values of the parameters of the distribution. The parameters determine the mean and variance of the distribution, values of probabilities from it etc. In the statistical analysis of a random variable X we typically: select a family of distributions based on the basic characteristics of X use observed data to choose (estimate) values for the parameters of that distribution, and perform statistical inference on them. Example 4.1 An opinion poll on a referendum, where each Xi is an answer to the question ‘Will you vote ‘Yes’ or ‘No’ to leaving the European Union?’ has answers recorded as Xi = 0 if ‘No’ and Xi = 1 if ‘Yes’. In a poll of 950 people, 513 answered ‘Yes’. How do we choose a distribution to represent Xi ? Here we need a family of discrete distributions with only two possible values (0 and 1). The Bernoulli distribution (discussed in the next section), which has one parameter π (the probability that Xi = 1) is appropriate. Within the family of Bernoulli distributions, we use the one where the value of π is our best estimate based on the observed data. This is π b = 513/950 = 0.54. 4.4 Common discrete distributions For discrete random variables, we will consider the following distributions. Discrete uniform distribution. Bernoulli distribution. Binomial distribution. Poisson distribution. 106 4.4. Common discrete distributions 4.4.1 Discrete uniform distribution Suppose a random variable X has k possible values 1, 2, . . . , k. X has a discrete uniform distribution if all of these values have the same probability, i.e. if: ( 1/k p(x) = P (X = x) = 0 for x = 1, 2, . . . , k otherwise. Example 4.2 A simple example of the discrete uniform distribution is the distribution of the score of a fair die, with k = 6. The discrete uniform distribution is not very common in applications, but it is useful as a reference point for more complex distributions. Mean and variance of a discrete uniform distribution Calculating directly from the definition,1 we have: E(X) = k X x p(x) = x=1 k+1 1 + 2 + ··· + k = k 2 (4.1) and: E(X 2 ) = k X x2 p(x) = x=1 12 + 22 + · · · + k 2 (k + 1)(2k + 1) = . k 6 (4.2) Therefore: Var(X) = E(X 2 ) − (E(X))2 = 4.4.2 k2 − 1 . 12 Bernoulli distribution A Bernoulli trial is an experiment with only two possible outcomes. We will number these outcomes 1 and 0, and refer to them as ‘success’ and ‘failure’, respectively. Example 4.3 Examples of outcomes of Bernoulli trials are: agree / disagree male / female employed / not employed owns a car / does not own a car business goes bankrupt / continues trading. 1 (4.1) and (4.2) make use, respectively, of n P i=1 i = n(n + 1)/2 and n P i2 = n(n + 1)(2n + 1)/6. i=1 107 4. Common distributions of random variables The Bernoulli distribution is the distribution of the outcome of a single Bernoulli trial. This is the distribution of a random variable X with the following probability function: ( π x (1 − π)1−x for x = 0, 1 p(x) = 0 otherwise. Therefore, P (X = 1) = π and P (X = 0) = 1 − P (X = 1) = 1 − π, and no other values are possible. Such a random variable X has a Bernoulli distribution with (probability) parameter π. This is often written as: X ∼ Bernoulli(π). If X ∼ Bernoulli(π), then: E(X) = 1 X x p(x) = 0 × (1 − π) + 1 × π = π (4.3) x=0 2 E(X ) = 1 X x2 p(x) = 02 × (1 − π) + 12 × π = π x=0 and: Var(X) = E(X 2 ) − (E(X))2 = π − π 2 = π (1 − π). (4.4) Activity 4.1 Suppose {Bi } is an infinite sequence of independent Bernoulli trials with: P (Bi = 0) = 1 − π and P (Bi = 1) = π for all i. (a) Derive the distribution of Xn = n P Bi and the expected value and variance of i=1 Xn . (b) Let Y = min{i : Bi = 1}. Derive the distribution of Y and obtain an expression for P (Y > y). Solution (a) Xn = n P Bi takes the values 0, 1, . . . , n. Any sequence consisting of x 1s and i=1 x n−x n and gives a value Xn = x. There are −x 0s has a probability π (1 − π) n such sequences, so: x n x P (Xn = x) = π (1 − π)n−x x and 0 otherwise. Hence E(Bi ) = π and Var(Bi ) = π (1 − π) which means E(Xn ) = n π and Var(Xn ) = n π (1 − π). (b) Y = min{i : Bi = 1} takes the values 1, 2, 3, . . ., hence: P (Y = y) = (1 − π)y−1 π and 0 otherwise. It follows that P (Y > y) = (1 − π)y . 108 4.4. Common discrete distributions 4.4.3 Binomial distribution Suppose we carry out n Bernoulli trials such that: at each trial, the probability of success is π different trials are statistically independent events. Let X denote the total number of successes in these n trials. X follows a binomial distribution with parameters n and π, where n ≥ 1 is a known integer and 0 ≤ π ≤ 1. This is often written as: X ∼ Bin(n, π). The binomial distribution was first encountered in Example 3.14. Example 4.4 A multiple choice test has 4 questions, each with 4 possible answers. James is taking the test, but has no idea at all about the correct answers. So he guesses every answer and, therefore, has the probability of 1/4 of getting any individual question correct. Let X denote the number of correct answers in James’ test. X follows the binomial distribution with n = 4 and π = 0.25, i.e. we have: X ∼ Bin(4, 0.25). For example, what is the probability that James gets 3 of the 4 questions correct? Here it is assumed that the guesses are independent, and each has the probability π = 0.25 of being correct. The probability of any particular sequence of 3 correct and 1 incorrect answers, for example 1110, is π 3 (1 − π)1 , where ‘1’ denotes a correct answer and ‘0’ denotes an incorrect answer. However, we do not care about the order of the 1s and 0s, only about the number of 1s. So 1101 and 1011, for example, also count as 3 correct answers. Each of these also has the probability π 3 (1 − π)1 . The total number of sequences with three 1s (and, therefore, one 0) is the number of locations for the three 1s which can be selected in the sequence of 4 answers. This is 4 = 4. Therefore, the probability of obtaining three 1s is: 3 4 3 π (1 − π)1 = 4 × (0.25)3 × (0.75)1 ≈ 0.0469. 3 Binomial distribution probability function In general, the probability function of X ∼ Bin(n, π) is: ( n π x (1 − π)n−x for x = 0, 1, . . . , n x p(x) = 0 otherwise. (4.5) 109 4. Common distributions of random variables We have already shown that (4.5) satisfies the conditions for being a probability function in the previous chapter (see Example 3.14). Example 4.5 Continuing Example 4.4, where X ∼ Bin(4, 0.25), we have: 4 4 0 4 p(0) = × (0.25) × (0.75) = 0.3164, p(1) = × (0.25)1 × (0.75)3 = 0.4219, 0 1 4 4 p(2) = × (0.25)2 × (0.75)2 = 0.2109, p(3) = × (0.25)3 × (0.75)1 = 0.0469, 2 3 4 p(4) = × (0.25)4 × (0.75)0 = 0.0039. 4 If X ∼ Bin(n, π), then: E(X) = n π and: Var(X) = n π (1 − π). Example 4.6 Suppose a multiple choice examination has 20 questions, each with 4 possible answers. Consider again James who guesses each one of the answers. Let X denote the number of correct answers by such a student, so that we have X ∼ Bin(20, 0.25). For such a student, the expected number of correct answers is E(X) = 20 × 0.25 = 5. The teacher wants to set the pass mark of the examination so that, for such a student, the probability of passing is less than 0.05. What should the pass mark be? In other words, what is the smallest x such that P (X ≥ x) < 0.05, i.e. such that P (X < x) ≥ 0.95? Calculating the probabilities of x = 0, 1, . . . , 20 we get (rounded to 2 decimal places): x p(x) x p(x) 0 0.00 1 2 3 4 5 6 7 8 9 10 0.02 0.07 0.13 0.19 0.20 0.17 0.11 0.06 0.03 0.01 11 12 13 14 15 16 17 18 19 20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Calculating the cumulative probabilities, we find that F (7) = P (X < 8) = 0.898 and F (8) = P (X < 9) = 0.959. Therefore, P (X ≥ 8) = 0.102 > 0.05 and also P (X ≥ 9) = 0.041 < 0.05. The pass mark should be set at 9. More generally, consider a student who has the same probability π of the correct answer for every question, so that X ∼ Bin(20, π). Figure 4.1 shows plots of the probabilities for π = 0.25, 0.5, 0.7 and 0.9. 110 4.4. Common discrete distributions 0.20 0.00 10 15 20 0 5 10 15 Correct answers Correct answers π = 0.7, E(X)=14 π = 0.9, E(X)=18 20 0.20 0.10 0.00 0.00 0.10 Probability 0.20 0.30 5 0.30 0 Probability 0.10 Probability 0.20 0.10 0.00 Probability 0.30 π = 0.5, E(X)=10 0.30 π = 0.25, E(X)=5 0 5 10 Correct answers 15 20 0 5 10 15 20 Correct answers Figure 4.1: Probability plots for Example 4.6. Activity 4.2 A binomial random variable X has probability function: ( n π x (1 − π)n−x for x = 0, 1, 2, . . . , n x p(x) = 0 otherwise. Consider this distribution in the case where n = 4 and π = 0.8. For this distribution, calculate the expected value and variance of X. (Note that E(X) = n π and Var(X) = n π (1 − π) for this distribution. Check that your answer agrees with this.) Solution Substituting the values into the definitions we get: X E(X) = x p(x) = 0 × 0.0016 + 1 × 0.0256 + · · · + 4 × 0.4096 = 3.2 x E(X 2 ) = X x2 p(x) = 0 × 0.0016 + 1 × 0.0256 + · · · + 16 × 0.4096 = 10.88 x and: Var(X) = E(X 2 ) − (E(X))2 = 10.88 − (3.2)2 = 0.64. Note that E(X) = n π = 4 × 0.8 = 3.2 and Var(X) = n π (1 − π) = 4 × 0.8 × (1 − 0.8) = 0.64 for n = 4, π = 0.8, as stated by the general formulae. 111 4. Common distributions of random variables Activity 4.3 A certain electronic system contains 12 components. Suppose that the probability that each individual component will fail is 0.3 and that the components fail independently of each other. Given that at least two of the components have failed, what is the probability that at least three of the components have failed? Solution Let X denote the number of components which will fail, hence X ∼ Bin(12, 0.3). Therefore: P (X ≥ 3 | X ≥ 2) = 1 − P (X = 0) − P (X = 1) − P (X = 2) P (X ≥ 3) = P (X ≥ 2) 1 − P (X = 0) − P (X = 1) 1 − 0.0138 − 0.0712 − 0.1678 1 − 0.0138 − 0.0712 0.7472 = 0.9150 = = 0.8166. Activity 4.4 A greengrocer has a very large pile of oranges on his stall. The pile of fruit is a mixture of 50% old fruit with 50% new fruit; one cannot tell which are old and which are new. However, 20% of old oranges are mouldy inside, but only 10% of new oranges are mouldy. Suppose that you choose 5 oranges at random. What is the distribution of the number of mouldy oranges in your sample? Solution For an orange chosen at random, the event ‘mouldy’ is the union of the disjoint events ‘mouldy’ ∩ ‘new’ and ‘mouldy’ ∩ ‘old’. So: P (‘mouldy’) = P (‘mouldy’ ∩ ‘new’) + P (‘mouldy’ ∩ ‘old’) = P (‘mouldy’ | ‘new’) P (‘new’) + P (‘mouldy’ | ‘old’) P (‘old’) = 0.1 × 0.5 + 0.2 × 0.5 = 0.15. As the pile of oranges is very large, we can assume that the results for the five oranges will be independent, so we have 5 independent trials each with probability of ‘mouldy’ equal to 0.15. The distribution of the number of mouldy oranges will be a binomial distribution with n = 5 and π = 0.15. Activity 4.5 Metro trains on a particular line have a probability 0.05 of failure between two stations. Supposing that the failures are all independent, what is the probability that out of 10 journeys between these two stations more than 8 do not have a breakdown? 112 4.4. Common discrete distributions Solution The probability of no breakdown on one journey is π = 1 − 0.05 = 0.95, so the number of journeys without a breakdown, X, has a Bin(10, 0.95) distribution. We want P (X > 8), which is: P (X > 8) = p(9) + p(10) 10 10 9 1 = × (0.95) × (0.05) + × (0.95)10 × (0.05)0 9 10 = 0.3151 + 0.5987 = 0.9138. Activity 4.6 Hard question! Show that for a binomial random variable X ∼ Bin(n, π), then: E(X) = n π n X x=1 (n − 1)! π x−1 (1 − π)n−x . (x − 1)! (n − x)! Hence find E(X) and Var(X). (The wording of the question implies that you use the result which you have just proved. Other methods of derivation will not be accepted!) Solution For X ∼ Bin(n, π), P (X = x) = n x π x (1 − π)n−x . So, for E(X), we have: n X n x x π (1 − π)n−x E(X) = x x=0 = n X x=1 = n X x=1 = nπ n x x π (1 − π)n−x x n (n − 1)! π π x−1 (1 − π)n−x (x − 1)! [(n − 1) − (x − 1)]! n X n−1 x=1 = nπ x−1 n−1 X n−1 y=0 y π x−1 (1 − π)n−x π y (1 − π)(n−1)−y = nπ × 1 = nπ where y = x − 1, and the last summation is over all the values of the pf of another binomial distribution, this time with possible values 0, 1, . . . , n − 1 and probability parameter π. 113 4. Common distributions of random variables Similarly: n X n x E(X(X − 1)) = x (x − 1) π (1 − π)n−x x x=0 = n X x (x − 1) n! x=2 (n − x)! x! = n (n − 1) π 2 π x (1 − π)n−x n X (n − 2)! π x−2 (1 − π)n−x (n − x)! (x − 2)! x=2 = n (n − 1) π 2 n−2 X (n − 2)! π y (1 − π)n−y−2 (n − y − 2)! y! y=0 with y = x − 2. Now let m = n − 2, so: E(X(X − 1)) = n (n − 1) π 2 m X y=0 m! π y (1 − π)m−y (m − y)! y! = n (n − 1) π 2 since the summation is 1, as before. Finally: Var(X) = E(X(X − 1)) − E(X) [E(X) − 1] = n (n − 1) π 2 − n π (n π − 1) = −n π 2 + n π = n π (1 − π). Activity 4.7 Hard question! Suppose that the normal rate of infection for a certain disease in cattle is 25%. To test a new serum which may prevent infection, three experiments are carried out. The test for infection is not always valid for some particular cattle, so the experimental results are incomplete – we cannot always tell whether a cow is infected or not. The results of the three experiments are: (a) 10 animals are injected; all 10 remain free from infection (b) 17 animals are injected; more than 15 remain free from infection and there are 2 doubtful cases (c) 23 animals are infected; more than 20 remain free from infection and there are three doubtful cases. Which experiment provides the strongest evidence in favour of the serum? 114 4.4. Common discrete distributions Solution These experiments involve tests on different cattle, which one might expect to behave independently of one another. The probability of infection without injection with the serum might also reasonably be assumed to be the same for all cattle. So the distribution which we need here is the binomial distribution. If the serum has no effect, then the probability of infection for each of the cattle is 0.25. One way to assess the evidence of the three experiments is to calculate the probability of the result of the experiment if the serum had no effect at all. If it has an effect, then one would expect larger numbers of cattle to remain free from infection, so the experimental results as given do provide some clue as to whether the serum has an effect, in spite of their incompleteness. Let X(n) be the number of cattle infected, out of a sample of n. We are assuming that X(n) ∼ Bin(n, 0.25). (a) With 10 trials, the probability of 0 infected if the serum has no effect is: 10 P (X(10) = 0) = × (0.75)10 = (0.75)10 = 0.0563. 0 (b) With 17 trials, the probability of more than 15 remaining uninfected if the serum has no effect is: P (X(17) < 2) = P (X(17) = 0) + P (X(17) = 1) 17 17 17 = × (0.75) + × (0.25)1 × (0.75)16 0 1 = (0.75)17 + 17 × (0.25)1 × (0.75)16 = 0.0075 + 0.0426 = 0.0501. (c) With 23 trials, the probability of more than 20 remaining free from infection if the serum has no effect is: P (X(23) < 3) = P (X(23) = 0) + P (X(23) = 1) + P (X(23) = 2) 23 23 23 × (0.75) + × (0.25)1 × (0.75)22 = 0 1 23 + × (0.25)2 × (0.75)21 2 = 0.7523 + 23 × 0.25 × (0.75)22 + 23 × 22 × (0.25)2 × (0.75)21 2 = 0.0013 + 0.0103 + 0.0376 = 0.0492. The most surprising-looking event in these three experiments is that of experiment 3, and so we can say that this experiment offered the most support for the use of the serum. 115 4. Common distributions of random variables 4.4.4 Poisson distribution The possible values of the Poisson distribution are the non-negative integers 0, 1, 2, . . .. Poisson distribution probability function The probability function of the Poisson distribution is: ( e−λ λx /x! for x = 0, 1, 2, . . . p(x) = 0 otherwise (4.6) where λ > 0 is a parameter. If a random variable X has a Poisson distribution with parameter λ, this is often denoted by: X ∼ Poisson(λ) or X ∼ Pois(λ). If X ∼ Poisson(λ), then: E(X) = λ and: Var(X) = λ. Poisson distributions are used for counts of occurrences of various kinds. To give a formal motivation, suppose that we consider the number of occurrences of some phenomenon in time, and that the process which generates the occurrences satisfies the following conditions: 1. The numbers of occurrences in any two disjoint intervals of time are independent of each other. 2. The probability of two or more occurrences at the same time is negligibly small. 3. The probability of one occurrence in any short time interval of length t is λ t for some constant λ > 0. In essence, these state that individual occurrences should be independent, sufficiently rare, and happen at a constant rate λ per unit of time. A process like this is a Poisson process. If occurrences are generated by a Poisson process, then the number of occurrences in a randomly selected time interval of length t = 1, X, follows a Poisson distribution with mean λ, i.e. X ∼ Poisson(λ). The single parameter λ of the Poisson distribution is, therefore, the rate of occurrences per unit of time. Example 4.7 Examples of variables for which we might use a Poisson distribution: The number of telephone calls received at a call centre per minute. 116 4.4. Common discrete distributions The number of accidents on a stretch of motorway per week. The number of customers arriving at a checkout per minute. The number of misprints per page of newsprint. Because λ is the rate per unit of time, its value also depends on the unit of time (that is, the length of interval) we consider. Example 4.8 If X is the number of arrivals per hour and X ∼ Poisson(1.5), then if Y is the number of arrivals per two hours, Y ∼ Poisson(1.5 × 2) = Poisson(3). λ is also the mean of the distribution, i.e. E(X) = λ. Both motivations suggest that distributions with higher values of λ have higher probabilities of large values of X. 0.25 Example 4.9 Figure 4.2 shows the probabilities p(x) for x = 0, 1, 2, . . . , 10 for X ∼ Poisson(2) and X ∼ Poisson(4). 0.15 0.00 0.05 0.10 p(x) 0.20 λ=2 λ=4 0 2 4 6 8 10 x Figure 4.2: Probability plots for Example 4.9. Example 4.10 Customers arrive at a bank on weekday afternoons randomly at an average rate of 1.6 customers per minute. Let X denote the number of arrivals per minute and Y denote the number of arrivals per 5 minutes. We assume a Poisson distribution for both, such that: X ∼ Poisson(1.6) and: Y ∼ Poisson(1.6 × 5) = Poisson(8). 117 4. Common distributions of random variables 1. What is the probability that no customer arrives in a one-minute interval? For X ∼ Poisson(1.6), the probability P (X = 0) is: pX (0) = e−λ λ0 e−1.6 (1.6)0 = = e−1.6 = 0.2019. 0! 0! 2. What is the probability that more than two customers arrive in a one-minute interval? P (X > 2) = 1 − P (X ≤ 2) = 1 − [P (X = 0) + P (X = 1) + P (X = 2)] which is: 1 − pX (0) − pX (1) − pX (2) = 1 − e−1.6 (1.6)0 e−1.6 (1.6)1 e−1.6 (1.6)2 − − 0! 1! 2! = 1 − e−1.6 − 1.6 e−1.6 − 1.28 e−1.6 = 1 − 3.88 e−1.6 = 0.2167. 3. What is the probability that no more than 1 customer arrives in a five-minute interval? For Y ∼ Poisson(8), the probability P (Y ≤ 1) is: pY (0) + pY (1) = e−8 80 e−8 81 + = e−8 + 8 e−8 = 9 e−8 = 0.0030. 0! 1! Activity 4.8 Cars independently pass a point on a busy road at an average rate of 150 per hour. (a) Assuming a Poisson distribution, find the probability that none passes in a given minute. (b) What is the expected number passing in two minutes? (c) Find the probability that the expected number actually passes in a given two-minute period. Solution (a) A rate of 150 cars per hour is a rate of 2.5 per minute. Using a Poisson distribution with λ = 2.5, P (none passes) = e−2.5 × (2.5)0 /0! = e−2.5 = 0.0821. (b) The expected number of cars passing in two minutes is 2 × 2.5 = 5. (c) The probability of 5 cars passing in two minutes is e−5 × 55 /5! = 0.1755. Activity 4.9 People entering an art gallery are counted by the attendant at the door. Assume that people arrive in accordance with a Poisson distribution, with one person arriving every 2 minutes. The attendant leaves the door unattended for 5 118 4.4. Common discrete distributions minutes. (a) Calculate the probability that: i. nobody will enter the gallery in this time ii. 3 or more people will enter the gallery in this time. (b) Find, to the nearest second, the length of time for which the attendant could leave the door unattended for there to be a probability of 0.9 of no arrivals in that time. (c) Comment briefly on the assumption of a Poisson distribution in this context. Solution (a) λ = 1 for a two-minute interval, so λ = 2.5 for a five-minute interval. Therefore: P (no arrivals) = e−2.5 = 0.0821 and: P (≥ 3 arrivals) = 1 − p(0) − p(1) − p(2) = 1 − e−2.5 (1 + 2.5 + 3.125) = 0.4562. (b) For an interval of N minutes, the parameter is N/2. We need p(0) = 0.9, so e−N/2 = 0.9 giving N/2 = − ln(0.9) and N = 0.21 minutes, or 13 seconds. (c) The rate is unlikely to be constant: more people at lunchtimes or early evenings etc. Likely to be several arrivals in a small period – couples, groups etc. Quite unlikely the Poisson will provide a good model. Activity 4.10 In a large industrial plant there is an accident on average every two days. (a) What is the chance that there will be exactly two accidents in a given week? (b) What is the chance that there will be two or more accidents in a given week? (c) If James goes to work there for a four-week period, what is the probability that no accidents occur while he is there? Solution Here we have counts of random events over time, which is a typical application for the Poisson distribution. We are assuming that accidents are equally likely to occur at any time and are independent. The mean for the Poisson distribution is 0.5 per day. Let X be the number of accidents in a week. The probability of exactly two accidents in a given week is found by using the parameter λ = 5 × 0.5 = 2.5 (5 working days a week assumed). 119 4. Common distributions of random variables (a) The probability of exactly two accidents in a week is: p(2) = e−2.5 (2.5)2 = 0.2565. 2! (b) The probability of two or more accidents in a given week is: P (X ≥ 2) = 1 − p(0) − p(1) = 0.7127. (c) If James goes to the industrial plant and does not change the probability of an accident simply by being there (he might bring bad luck, or be superbly safety-conscious!), then over 4 weeks there are 20 working days, and the probability of no accident comes from a Poisson random variable with mean 10. If Y is the number of accidents while James is there, the probability of no accidents is: e−10 (10)0 = 0.0000454. pY (0) = 0! James is very likely to be there when there is an accident! Activity 4.11 Arrivals at a post office may be modelled as following a Poisson distribution with a rate parameter of 84 arrivals per hour. (a) Find: i. the probability of exactly seven arrivals in a period of two minutes ii. the probability of more than three arrivals in 45 seconds iii. the probability that the time to arrival of the next customer is less than one minute. (b) If T is the time to arrival of the next customer (in minutes), calculate: P (T > 2.3 | T > 1). Solution (a) The rate is given as 84 per hour, but it is convenient to work in numbers of minutes, so note that this is the same as λ = 1.4 arrivals per minute. i. For two minutes, use λ = 1.4 × 2 = 2.8. Hence: P (X = 7) = e−2.8 (2.8)7 = 0.0163. 7! ii. For 45 seconds, λ = 1.4 × 0.75 = 1.05. Hence: P (X > 3) = 1 − P (X ≤ 3) = 1 − 3 X e−1.05 (1.05)x x=0 x! = 1 − e−1.05 (1 + 1.05 + 0.5513 + 0.1929) = 0.0222. 120 4.4. Common discrete distributions iii. The probability that the time to arrival of the next customer is less than one minute is 1 − P (no arrivals in one minute) = 1 − P (X = 0). For one minute we use λ = 1.4, hence: e−1.4 (1.4)0 = 1 − e−1.4 = 1 − 0.2466 = 0.7534. 1 − P (X = 0) = 1 − 0! (b) The time to the next customer is more than t if there are no arrivals in the interval from 0 to t, which means that we need to use λ = 1.4 × t. Now the conditional probability formula yields: P (T > 2.3 | T > 1) = P ({T > 2.3} ∩ {T > 1}) P (T > 1) and, as in other instances, the two events {T > 2.3} and {T > 1} collapse to a single event, {T > 2.3}. Hence: P (T > 2.3 | T > 1) = P (T > 2.3) P (T > 2.3) = . P (T > 1) 0.2466 To calculate the numerator, use λ = 1.4 × 2.3 = 3.22, hence (by the same method as in (iii.): P (T > 2.3) = e−3.22 (3.22)0 = e−3.22 = 0.0400. 0! Hence: P (T > 2.3 | T > 1) = P (T > 2.3) 0.0400 = = 0.1620. P (T > 1) 0.2466 Activity 4.12 A glacier in Greenland ‘calves’ (lets fall off into the sea) an iceberg on average twice every five weeks. (Seasonal effects can be ignored for this question, and so the calving process can be thought of as random, i.e. the calving of icebergs can be assumed to be independent events.) (a) Explain which distribution you would use to estimate the probabilities of different numbers of icebergs being calved in different periods, justifying your selection. (b) What is the probability that no iceberg is calved in the next three weeks? (c) What is the probability that no iceberg is calved in the three weeks after the next three weeks? (d) What is the probability that exactly five icebergs are calved in the next four weeks? (e) If exactly five icebergs are calved in the next four weeks, what is the probability that exactly five more icebergs will be calved in the four-week period after the next four weeks? (f) Comment on the relationship between your answers to (d) and (e). 121 4. Common distributions of random variables Solution (a) If we assume that the calving process is random (as the remark about seasonality hints) then we are counting events over periods of time (with, in particular, no obvious upper maximum), and hence the appropriate distribution is the Poisson distribution. (b) The rate parameter for one week is 0.4, so for three weeks we use λ = 1.2, hence: P (X = 0) = e−1.2 × (1.2)0 = e−1.2 = 0.3012. 0! (c) If it is correct to use the Poisson distribution then events are independent, and hence: P (none in weeks 1, 2 & 3) = P (none in weeks 4, 5 & 6) = 0.3012. (d) The rate parameter for four weeks is λ = 1.6, hence: P (X = 5) = e−1.6 × (1.6)5 = 0.0176. 5! (e) Bayes’ theorem tells us that: P (5 in weeks 5 to 8 | 5 in weeks 1 to 4) = P (5 in weeks 5 to 8 ∩ 5 in weeks 1 to 4) . P (5 in weeks 1 to 4) If it is correct to use the Poisson distribution then events are independent. Therefore: P (5 in weeks 5 to 8 ∩ 5 in weeks 1 to 4) = P (5 in weeks 5 to 8) P (5 in weeks 1 to 4). So, cancelling, we get: P (5 in weeks 5 to 8 | 5 in weeks 1 to 4) = P (5 in weeks 5 to 8) = P (5 in weeks 1 to 4) = 0.0176. (f) The fact that the results are identical in the two cases is a consequence of the independence built into the assumption that the Poisson distribution is the appropriate one to use. A Poisson process does not ‘remember’ what happened before the start of a period under consideration. Activity 4.13 Hard question! A discrete random variable X has possible values 0, 1, 2, . . ., and the probability function: ( e−λ λx /x! for x = 0, 1, 2, . . . p(x) = 0 otherwise 122 4.4. Common discrete distributions where λ > 0 is a parameter. Show that E(X) = λ by determining P x p(x). Solution We have: ∞ ∞ X e−λ λx X e−λ λx = x E(X) = x p(x) = x x! x! x=1 x=0 x=0 ∞ X =λ ∞ X e−λ λx−1 x=1 =λ (x − 1)! ∞ X e−λ λy y=0 y! =λ×1 =λ where we replace x − 1 with y. The result follows from the fact that ∞ P (e−λ λy )/y! is y=0 the sum of all non-zero values of a probability function of this form. For completeness, we also give here a derivation of the variance of this distribution. Consider first: E[X(X − 1)] = ∞ X x(x − 1) p(x) = x=0 ∞ X x(x − 1) x=2 =λ 2 ∞ X e−λ λx−2 x=2 =λ 2 e−λ λx x! (x − 2)! ∞ X e−λ λy y=0 y! = λ2 where y = x − 2. Also: E[X(X − 1)] = E(X 2 − X) = X X X (x2 − x) p(x) = x2 p(x) − x p(x) x x x = E(X 2 ) − E(X) = E(X 2 ) − λ. Equating these and solving for E(X 2 ) we get E(X 2 ) = λ2 + λ. Therefore: Var(X) = E(X 2 ) − (E(X))2 = λ2 + λ − (λ)2 = λ. 123 4. Common distributions of random variables Activity 4.14 Hard question! James goes fishing every Saturday. The number of fish he catches follows a Poisson distribution. On a proportion π of the days he goes fishing, he does not catch anything. He makes it a rule to take home the first, and then every other, fish which he catches, i.e. the first, third, fifth fish etc. (a) Using a Poisson distribution, find the mean number of fish he catches. (b) Show that the probability that he takes home the last fish he catches is (1 − π 2 )/2. Solution (a) Let X denote the number of fish caught, such that X ∼ Poisson(λ). P (X = 0) = e−λ λx /x! where the parameter λ is as yet unknown, so P (X = 0) = e−λ λ0 /0! = e−λ . However, we know P (X = 0) = π. So e−λ = π giving −λ = ln π and λ = ln(1/π). (b) James will take home the last fish caught if he catches 1, 3, 5, 7, . . . fish. So we require: e−λ λ1 e−λ λ3 e−λ λ5 + + + ··· 1! 3! 5! 1 λ3 λ5 λ −λ =e + + + ··· . 1! 3! 5! P (X = 1) + P (X = 3) + P (X = 5) + · · · = Now we know: eλ = 1 + λ + and: e −λ λ2 λ3 + + ··· 2! 3! λ2 λ3 =1−λ+ − + ··· . 2! 3! Subtracting gives: λ −λ e −e λ3 λ5 =2 λ+ + + ··· . 3! 5! Hence the required probability is: λ e − e−λ 1 − e−2λ 1 − π2 −λ e = = 2 2 2 since e−λ = π above gives e−2λ = π 2 . 124 4.4. Common discrete distributions 4.4.5 Connections between probability distributions There are close connections between some probability distributions, even across different families of them. Some connections are exact, i.e. one distribution is exactly equal to another, for particular values of the parameters. For example, Bernoulli(π) is the same distribution as Bin(1, π). Some connections are approximate (or asymptotic), i.e. one distribution is closely approximated by another under some limiting conditions. We next discuss one of these, the Poisson approximation of the binomial distribution. 4.4.6 Poisson approximation of the binomial distribution Suppose that: X ∼ Bin(n, π) n is large and π is small. Under such circumstances, the distribution of X is well-approximated by a Poisson(λ) distribution with λ = n π. The connection is exact at the limit, i.e. Bin(n, π) → Poisson(λ) if n → ∞ and π → 0 in such a way that n π = λ remains constant. This ‘law of small numbers’ provides another motivation for the Poisson distribution. Example 4.11 A classic example (from Bortkiewicz (1898) Das Gesetz der kleinen Zahlen) helps to remember the key elements of the ‘law of small numbers’. Figure 4.3 shows the numbers of soldiers killed by horsekick in each of 14 army corps of the Prussian army in each of the years spanning 1875–94. Suppose that the number of men killed by horsekicks in one corps in one year is X ∼ Bin(n, π), where: n is large – the number of men in a corps (perhaps 50,000) π is small – the probability that a man is killed by a horsekick. X should be well-approximated by a Poisson distribution with some mean λ. The sample frequencies and proportions of different counts are as follows: Number killed Count % 0 144 51.4 1 91 32.5 2 32 11.4 3 11 3.9 4 2 0.7 More 0 0 The sample mean of the counts is x̄ = 0.7, which we use as λ for the Poisson distribution. X ∼ Poisson(0.7) is indeed a good fit to the data, as shown in Figure 4.4. 125 4. Common distributions of random variables Figure 4.3: Numbers of soldiers killed by horsekick in each of 14 army corps of the Prussian 0.5 army in each of the years spanning 1875–94. Source: Bortkiewicz (1898) Das Gesetz der kleinen Zahlen, Leipzig: Teubner. 0.3 0.0 0.1 0.2 Probability 0.4 Poisson(0.7) Sample proportion 0 1 2 3 4 5 6 Men killed Figure 4.4: Fit of Poisson distribution to the data in Example 4.11. Example 4.12 An airline is selling tickets to a flight with 198 seats. It knows that, on average, about 1% of customers who have bought tickets fail to arrive for the flight. Because of this, the airline overbooks the flight by selling 200 tickets. What is the probability that everyone who arrives for the flight will get a seat? Let X denote the number of people who fail to turn up. Using the binomial distribution, X ∼ Bin(200, 0.01). We have: P (X ≥ 2) = 1 − P (X = 0) − P (X = 1) = 1 − 0.1340 − 0.2707 = 0.5953. Using the Poisson approximation, X ∼ Poisson(200 × 0.01) = Poisson(2). P (X ≥ 2) = 1 − P (X = 0) − P (X = 1) = 1 − e−2 − 2 e−2 = 1 − 3 e−2 = 0.5940. 126 4.4. Common discrete distributions Activity 4.15 The chance that a lottery ticket has a winning number is 0.0000001. (a) If 10,000,000 people buy tickets which are independently numbered, what is the probability there is no winner? (b) What is the probability that there is exactly 1 winner? (c) What is the probability that there are exactly 2 winners? Solution The number of winning tickets, X, will be distributed as: X ∼ Bin(10000000, 0.0000001). Since n is large and π is small, the Poisson distribution should provide a good approximation. The Poisson parameter is: λ = n π = 10000000 × 0.0000001 = 1 and so we set X ∼ Pois(1). We have: p(0) = e−1 10 = 0.3679, 0! p(1) = e−1 11 e−1 12 = 0.3679 and p(2) = = 0.1839. 1! 2! Using the exact binomial distribution of X, the results are: (10)7 7 p(0) = × ((10)−7 )0 × (1 − (10)−7 )(10) = 0.3679 0 (10)7 7 p(1) = × ((10)−7 )1 × (1 − (10)−7 )(10) −1 = 0.3679 1 and: (10)7 7 p(2) = × ((10)−7 )2 × (1 − (10)−7 )(10) −2 = 0.1839. 2 Notice that, in this case, the Poisson approximation is correct to at least 4 decimal places. 4.4.7 Some other discrete distributions Just their names and short comments are given here, so that you have an idea of what else there is. You may meet some of these in future courses. Geometric(π) distribution. • Distribution of the number of failures in Bernoulli trials before the first success. • π is the probability of success at each trial. • The sample space is 0, 1, 2, . . .. • See the basketball example in Chapter 3. 127 4. Common distributions of random variables Negative binomial(r, π) distribution. • Distribution of the number of failures in Bernoulli trials before r successes occur. • π is the probability of success at each trial. • The sample space is 0, 1, 2, . . .. • Negative binomial(1, π) is the same as Geometric(π). 4.5 Common continuous distributions For continuous random variables, we will consider the following distributions. Uniform distribution. Exponential distribution. Normal distribution. 4.5.1 The (continuous) uniform distribution The (continuous) uniform distribution has non-zero probabilities only on an interval [a, b], where a < b are given numbers. The probability that its value is in an interval within [a, b] is proportional to the length of the interval. In other words, all intervals (within [a, b]) which have the same length have the same probability. Uniform distribution pdf The pdf of the (continuous) uniform distribution is: ( 1/(b − a) for a ≤ x ≤ b f (x) = 0 otherwise. A random variable X with this pdf may be written as X ∼ Uniform[a, b]. The pdf is ‘flat’, as shown in Figure 4.5 (along with the cdf). Clearly, f (x) ≥ 0 for all x, and: Z ∞ Z b 1 1 1 f (x) dx = dx = [x]ba = [b − a] = 1. b−a b−a −∞ a b−a The cdf is: Z F (x) = P (X ≤ x) = a x for x < a 0 f (t) dt = (x − a)/(b − a) for a ≤ x ≤ b 1 for x > b. Therefore, the probability of an interval [x1 , x2 ], where a ≤ x1 < x2 ≤ b, is: P (x1 ≤ X ≤ x2 ) = F (x2 ) − F (x1 ) = 128 x2 − x1 . b−a 4.5. Common continuous distributions f(x) F(x) 1 0 a b a x b x Figure 4.5: Continuous uniform distribution pdf (left) and cdf (right). So the probability depends only on the length of the interval, x2 − x1 . If X ∼ Uniform[a, b], we have: E(X) = a+b = median of X 2 and: (b − a)2 . 12 The mean and median also follow from the fact that the distribution is symmetric about (a + b)/2, i.e. the midpoint of the interval [a, b]. Var(X) = Activity 4.16 Suppose that X ∼ Uniform[0, 1]. Compute P (X > 0.2), P (X ≥ 0.2) and P (X 2 > 0.04). Solution We have a = 0 and b = 1, and can use the formula for P (c < X ≤ d), for constants c and d. Hence: 1 − 0.2 P (X > 0.2) = P (0.2 < X ≤ 1) = = 0.8. 1−0 Also: P (X ≥ 0.2) = P (X = 0.2) + P (X > 0.2) = 0 + P (X > 0.2) = 0.8. Finally: P (X 2 > 0.04) = P (X < −0.2) + P (X > 0.2) = 0 + P (X > 0.2) = 0.8. Activity 4.17 A newsagent, James, has n newspapers to sell and makes £1.00 profit on each sale. Suppose the number of customers of these newspapers is a random variable with a distribution which can be approximated by: ( 1/200 for 0 < x < 200 f (x) = 0 otherwise. 129 4. Common distributions of random variables If James does not have enough newspapers to sell to all customers, he figures he loses £5.00 in goodwill from each unhappy (non-served) customer. However, if he has surplus newspapers (which only have commercial value on the day of print), he loses £0.50 on each unsold newspaper. What should n be (to the nearest integer) to maximise profit? Hint: If X ≤ n, James’ profit (in £) is X − 0.5(n − X). If X > n, James’ profit is n − 5(X − n). Find the expected value of profit as a function of n, and then select n to maximise this function. (There is no need to verify it is a maximum.) Solution We have: Z 200 1 1 dx + (n − 5(x − n)) dx E(profit) = (x − 0.5(n − x)) 200 200 n 0 n 200 1 x2 (n − x)2 5x2 1 = + 6nx − + 200 2 4 200 2 n 0 Z = n 1 (−3.25n2 + 1200n − 100000). 200 Differentiating with respect to n, we have: 1 dE(profit) = (−6.5n + 1200). dn 200 Equating to zero and solving, we have: n= 4.5.2 1200 ≈ 185. 6.5 Exponential distribution Exponential distribution pdf A random variable X has the exponential distribution with the parameter λ (where λ > 0) if its probability density function is: ( λ e−λx for x > 0 f (x) = 0 otherwise. This is often denoted X ∼ Exponential(λ) or X ∼ Exp(λ). It was shown in the previous chapter that this satisfies the conditions for a pdf (see Example 3.21). The general shape of the pdf is that of ‘exponential decay’, as shown in Figure 4.6 (hence the name). 130 f(x) 4.5. Common continuous distributions 0 1 2 3 4 5 x Figure 4.6: Exponential distribution pdf. The cdf of the Exponential(λ) distribution is: ( 0 F (x) = 1 − e−λx for x ≤ 0 for x > 0. 0.0 0.2 0.4 F(x) 0.6 0.8 1.0 The cdf is shown in Figure 4.7 for λ = 1.6. 0 1 2 3 4 5 x Figure 4.7: Exponential distribution cdf for λ = 1.6. For X ∼ Exponential(λ), we have: E(X) = 1 λ and: 1 . λ2 These have been derived in the previous chapter (see Example 3.22). The median of the distribution, also previously derived (see Example 3.24), is: Var(X) = m= log 2 1 = (log 2) × = (log 2) E(X) ≈ 0.69 × E(X). λ λ 131 4. Common distributions of random variables Note that the median is always smaller than the mean, because the distribution is skewed to the right. Uses of the exponential distribution The exponential is, among other things, a basic distribution of waiting times of various kinds. This arises from a connection between the Poisson distribution – the simplest distribution for counts – and the exponential. If the number of events per unit of time has a Poisson distribution with parameter λ, the time interval (measured in the same units of time) between two successive events has an exponential distribution with the same parameter λ. Note that the expected values of these behave as we would expect. E(X) = λ for Poisson(λ), i.e. a large λ means many events per unit of time, on average. E(X) = 1/λ for Exponential(λ), i.e. a large λ means short waiting times between successive events, on average. Example 4.13 Consider Example 4.10. The number of customers arriving at a bank per minute has a Poisson distribution with parameter λ = 1.6. Therefore, the time X, in minutes, between the arrivals of two successive customers follows an exponential distribution with parameter λ = 1.6. From this exponential distribution, the expected waiting time between arrivals of customers is E(X) = 1/1.6 = 0.625 (minutes) and the median is calculated to be (log 2) × 0.625 = 0.433. We can also calculate probabilities of waiting times between arrivals, using the cumulative distribution function: ( 0 for x ≤ 0 F (x) = −1.6x 1−e for x > 0. For example: P (X ≤ 1) = F (1) = 1 − e−1.6×1 = 1 − e−1.6 = 0.7981. The probability is about 0.8 that two arrivals are at most a minute apart. P (X > 3) = 1 − F (3) = e−1.6×3 = e−4.8 = 0.0082. The probability of a gap of 3 minutes or more between arrivals is very small. 132 4.5. Common continuous distributions Activity 4.18 Suppose that the service time for a customer at a fast food outlet has an exponential distribution with parameter 1/3 (customers per minute). What is the probability that a customer waits more than 4 minutes? Solution The distribution of X is Exp(1/3), so the probability is: P (X > 4) = 1 − F (4) = 1 − (1 − e−(1/3)×4 ) = 1 − 0.7364 = 0.2636. Activity 4.19 Suppose that commercial aeroplane crashes in a certain country occur at the rate of 2.5 per year. (a) Is it reasonable to assume that such crashes are Poisson events? Briefly explain. (b) What is the probability that two or more crashes will occur next year? (c) What is the probability that the next two crashes will occur within six months of one another? Solution (a) Yes, because the Poisson assumptions are probably satisfied – crashes are independent events and the crash rate is likely to remain constant. (b) Since λ = 2.5 crashes per year: P (X ≥ 2) = 1 − P (X ≤ 1) = 1 − 1 X e−2.5 (2.5)x x=0 x! = 0.7127. (c) Let Y = interval (in years) between the next two crashes. Therefore, we have Y ∼ Exp(2.5). So: Z 0.5 P (Y < 0.5) = 2.5e−2.5y dy = F (0.5) − F (0) 0 = (1 − e−2.5(0.5) ) − (1 − e−2.5(0) ) = 1 − e−1.25 = 0.7135. Activity 4.20 Let the random variable X have the following pdf: ( e−x for x ≥ 0 f (x) = 0 otherwise. Find the interquartile range (IQR) of X. 133 4. Common distributions of random variables Solution Note that X ∼ Exp(1). For x > 0, we have: Z x Z x x f (t) dt = e−t dt = −e−t 0 = 1 − e−x 0 hence: 0 ( 1 − e−x F (x) = 0 for x > 0 otherwise. Denoting the first and third quartiles by Q1 and Q3 , respectively, we have: F (Q1 ) = 1 − e−Q1 = 0.25 and F (Q3 ) = 1 − e−Q3 = 0.75. Therefore: Q1 = − ln(0.75) = 0.2877 and Q3 = − ln(0.25) = 1.3863 and so: IQR = Q3 − Q1 = 1.3863 − 0.2877 = 1.0986. Activity 4.21 The random variable Y , representing the life-span of an electronic component, is distributed according to a probability density function f (y), where y > 0. The survivor function, =, is defined as =(y) = P (Y > y) and the age-specific failure rate, φ(y), is defined as f (y)/=(y). Suppose f (y) = λ e−λ y , i.e. Y ∼ Exp(λ). (a) Derive expressions for =(y) and φ(y). (b) Comment briefly on the implications of the age-specific failure rate you have derived in the context of the exponentially-distributed component life-spans. Solution (a) The survivor function is: Z =(y) = P (Y > y) = y ∞ ∞ λ e−λx dx = −e−λx y = e−λy . The age-specific failure rate is: φ(y) = λ e−λy f (y) = −λy = λ. =(y) e (b) The age-specific failure rate is constant, indicating it does not vary with age. This is unlikely to be true in practice! 134 4.5. Common continuous distributions 4.5.3 Normal (Gaussian) distribution The normal distribution is by far the most important probability distribution in statistics. This is for three broad reasons. Many variables have distributions which are approximately normal, for example heights of humans or animals, and weights of various products. The normal distribution has extremely convenient mathematical properties, which make it a useful default choice of distribution in many contexts. Even when a variable is not itself even approximately normally distributed, functions of several observations of the variable (‘sampling distributions’) are often approximately normal, due to the central limit theorem. Because of this, the normal distribution has a crucial role in statistical inference. This will be discussed later in the course. Normal distribution pdf The pdf of the normal distribution is: (x − µ)2 1 exp − f (x) = √ 2σ 2 2πσ 2 for − ∞ < x < ∞ where π is the mathematical constant (i.e. π = 3.14159 . . .), and µ and σ 2 are parameters, with −∞ < µ < ∞ and σ 2 > 0. A random variable X with this pdf is said to have a normal distribution with mean µ and variance σ 2 , denoted X ∼ N (µ, σ 2 ). Clearly, f (x) ≥ 0 for all x. Also, it can be shown that to show this), so f (x) really is a pdf. R∞ −∞ f (x) dx = 1 (do not attempt If X ∼ N (µ, σ 2 ), then: E(X) = µ and: Var(X) = σ 2 and, therefore, the standard deviation is sd(X) = σ. The mean can also be inferred from the observation that the normal pdf is symmetric about µ. This also implies that the median of the normal distribution is µ. The normal density is the so-called ‘bell curve’. The two parameters affect it as follows. The mean µ determines the location of the curve. The variance σ 2 determines the dispersion (spread) of the curve. Example 4.14 Figure 4.8 shows that: N (0, 1) and N (5, 1) have the same dispersion but different location: the N (5, 1) curve is identical to the N (0, 1) curve, but shifted 5 units to the right 135 4. Common distributions of random variables 0.3 0.4 N (0, 1) and N (0, 9) have the same location but different dispersion: the N (0, 9) curve is centered at the same value, 0, as the N (0, 1) curve, but spread out more widely. N(5, 1) 0.1 0.2 N(0, 1) 0.0 N(0, 9) −5 0 5 10 x Figure 4.8: Various normal distributions. Linear transformations of the normal distribution We now consider one of the convenient properties of the normal distribution. Suppose X is a random variable, and we consider the linear transformation Y = aX + b, where a and b are constants. Whatever the distribution of X, it is true that E(Y ) = a E(X) + b and also that Var(Y ) = a2 Var(X). Furthermore, if X is normally distributed, then so is Y . In other words, if X ∼ N (µ, σ 2 ), then: Y = aX + b ∼ N (aµ + b, a2 σ 2 ). (4.7) This type of result is not true in general. For other families of distributions, the distribution of Y = aX + b is not always in the same family as X. Let us apply (4.7) with a = 1/σ and b = −µ/σ, to get: 2 ! µ X −µ 1 µ 1 1 ∼N µ− , σ 2 = N (0, 1). Z= X− = σ σ σ σ σ σ The transformed variable Z = (X − µ)/σ is known as a standardised variable or a z-score. The distribution of the z-score is N (0, 1), i.e. the normal distribution with mean µ = 0 and variance σ 2 = 1 (and, therefore, a standard deviation of σ = 1). This is known as the standard normal distribution. Its density function is: 2 1 x f (x) = √ exp − for − ∞ < x < ∞. 2 2π 136 4.5. Common continuous distributions The cumulative distribution function of the normal distribution is: Z x 1 (t − µ)2 √ F (x) = dt. exp − 2σ 2 2πσ 2 −∞ In the special case of the standard normal distribution, the cdf is: 2 Z x t 1 √ F (x) = Φ(x) = dt. exp − 2 2π −∞ Note, this is often denoted Φ(x). Such integrals cannot be evaluated in a closed form, so we use statistical tables of them, specifically a table of Φ(x) (or we could use a computer, but not in the examination). In the examination, you will have a table of some values of Φ(z), the cdf of Z ∼ N (0, 1). Specifically, Table 4 of the New Cambridge Statistical Tables shows values of Φ(x) = P (Z ≤ x) for x ≥ 0. This table can be used to calculate probabilities of any intervals for any normal distribution, but how? The table seems to be incomplete. 1. It is only for N (0, 1), not for N (µ, σ 2 ) for any other µ and σ 2 . 2. Even for N (0, 1), it only shows probabilities for x ≥ 0. We next show how these are not really limitations, starting with ‘2.’. The key to using the tables is that the standard normal distribution is symmetric about 0. This means that for an interval in one tail, its ‘mirror image’ in the other tail has the same probability. Another way to justify these results is that if Z ∼ N (0, 1), then also −Z ∼ N (0, 1). See ST104a Statistics 1 for a discussion of how to use Table 4 of the New Cambridge Statistical Tables. Probabilities for any normal distribution How about a normal distribution X ∼ N (µ, σ 2 ), for any other µ and σ 2 ? What if we want to calculate, for any a < b, P (a < X ≤ b) = F (b) − F (a)? Remember that (X − µ)/σ = Z ∼ N (0, 1). If we apply this transformation to all parts of the inequalities, we get: X −µ b−µ a−µ < ≤ P (a < X ≤ b) = P σ σ σ a−µ b−µ =P <Z≤ σ σ b−µ a−µ −Φ =Φ σ σ which can be calculated using Table 4 of the New Cambridge Statistical Tables. (Note that this also covers the cases of the one-sided inequalities P (X ≤ b), with a = −∞, and P (X > a), with b = ∞.) 137 4. Common distributions of random variables Example 4.15 Let X denote the diastolic blood pressure of a randomly selected person in England. This is approximately distributed as X ∼ N (74.2, 127.87). Suppose we want to know the probabilities of the following intervals: X > 90 (high blood pressure) X < 60 (low blood pressure) 60 ≤ X ≤ 90 (normal blood pressure). These are calculated using standardisation with µ = 74.2, σ 2 = 127.87 and, therefore, σ = 11.31. So here: X − 74.2 = Z ∼ N (0, 1) 11.31 and we can refer values of this standardised variable to Table 4 of the New Cambridge Statistical Tables. 90 − 74.2 X − 74.2 > P (X > 90) = P 11.31 11.31 = P (Z > 1.40) = 1 − Φ(1.40) = 1 − 0.9192 = 0.0808 and: P (X < 60) = P X − 74.2 60 − 74.2 < 11.31 11.31 = P (Z < −1.26) = P (Z > 1.26) = 1 − Φ(1.26) = 1 − 0.8962 = 0.1038. Finally: P (60 ≤ X ≤ 90) = P (X ≤ 90) − P (X < 60) = 0.8152. These probabilities are shown in Figure 4.9. Activity 4.22 Suppose that the distribution of men’s heights in London, measured in cm, is N (175, 62 ). Find the proportion of men whose height is: (a) under 169 cm 138 0.04 4.5. Common continuous distributions Low: 0.10 High: 0.08 0.00 0.01 0.02 0.03 Mid: 0.82 40 60 80 100 120 Diastolic blood pressure Figure 4.9: Distribution of blood pressure for Example 4.15. (b) over 190 cm (c) between 169 cm and 190 cm. Solution The values of interest are 169 and 190. The corresponding z-values are: z1 = 169 − 175 190 − 175 = −1 and z2 = = 2.5. 6 6 Using values from statistical tables, we have: P (X < 169) = P (Z < −1) = Φ(−1) = 1 − Φ(1) = 1 − 0.8413 = 0.1587 also: P (X > 190) = P (Z > 2.5) = 1 − Φ(2.5) = 1 − 0.9938 = 0.0062 and: P (169 < X < 190) = P (−1 < Z < 2.5) = Φ(2.5)−Φ(−1) = 0.9938−0.1587 = 0.8351. Activity 4.23 In javelin throwing competitions, the throws of athlete A are normally distributed. It has been found that 15% of her throws exceed 43 metres, while 3% exceed 45 metres. What distance will be exceeded by 90% of her throws? Solution Suppose X ∼ N (µ, σ 2 ) is the random variable for throws. P (X > 43) = 0.15 leads to µ = 43 − 1.035 × σ (using statistical tables). Similarly, P (X > 45) = 0.03 leads to µ = 45 − 1.88 × σ. Solving yields µ = 40.55 and 139 4. Common distributions of random variables σ = 2.367, hence X ∼ N (40.55, (2.367)2 ). So: P (X > x) = 0.9 ⇒ x − 40.55 = −1.28. 2.367 Hence x = 37.52 metres. Activity 4.24 The life, in hours, of a light bulb is normally distributed with a mean of 175 hours. If a consumer requires at least 95% of the light bulbs to have lives exceeding 150 hours, what is the largest value that the standard deviation can have? Solution Let X be the random variable representing the lifetime of a light bulb (in hours), so that for some value σ we have X ∼ N (175, σ 2 ). We want P (X > 150) = 0.95, such that: 25 150 − 175 =P Z>− = 0.95. P (X > 150) = P Z > σ σ Note that this is the same as P (Z > 25/σ) = 1 − 0.95 = 0.05, so 25/σ = 1.645, giving σ = 15.20. Activity 4.25 Two statisticians disagree about the distribution of IQ scores for a population under study. Both agree that the distribution is normal, and that σ = 15, but A says that 5% of the population have IQ scores greater than 134.6735, whereas B says that 10% of the population have IQ scores greater than 109.224. What is the difference between the mean IQ score as assessed by A and that as assessed by B? Solution The standardised z-value giving 5% in the upper tail is 1.6449, and for 10% it is 1.2816. So, converting to the scale for IQ scores, the values are: 1.6449 × 15 = 24.6735 and 1.2816 × 15 = 19.224. Write the means according to A and B as µA and µB , respectively. Therefore: µA + 24.6735 = 134.6735 so: µA = 110 whereas: µB + 19.224 = 109.224 so µB = 90. The difference µA − µB = 110 − 90 = 20. Some probabilities around the mean The following results hold for all normal distributions. P (µ − σ < X < µ + σ) = 0.683. In other words, about 68.3% of the total 140 4.5. Common continuous distributions probability is within 1 standard deviation of the mean. P (µ − 1.96 × σ < X < µ + 1.96 × σ) = 0.950. P (µ − 2 × σ < X < µ + 2 × σ) = 0.954. P (µ − 2.58 × σ < X < µ + 2.58 × σ) = 0.99. P (µ − 3 × σ < X < µ + 3 × σ) = 0.997. The first two of these are illustrated graphically in Figure 4.10. 0.683 µ −1.96σ µ−σ µ µ+σ µ +1.96σ <−−−−−−−−−− 0.95 −−−−−−−−−−> Figure 4.10: Some probabilities around the mean for the normal distribution. 4.5.4 Normal approximation of the binomial distribution For 0 < π < 1, the binomial distribution Bin(n, π) tends to the normal distribution N (n π, n π (1 − π)) as n → ∞. Less formally, the binomial distribution is well-approximated by the normal distribution when the number of trials n is reasonably large. For a given n, the approximation is best when π is not very close to 0 or 1. One rule-of-thumb is that the approximation is good enough when n π > 5 and n (1 − π) > 5. Illustrations of the approximation are shown in Figure 4.11 for different values of n and π. Each plot shows values of the pf of Bin(n, π), and the pdf of the normal approximation, N (n π, n π (1 − π)). When the normal approximation is appropriate, we can calculate probabilities for X ∼ Bin(n, π) using Y ∼ N (n π, n π (1 − π)) and Table 4 of the New Cambridge Statistical Tables. Unfortunately, there is one small caveat. The binomial distribution is discrete, but the normal distribution is continuous. To see why this is problematic, consider the following. Suppose X ∼ Bin(40, 0.4). Since X is discrete, such that x = 0, 1, . . . , 40, then: P (X ≤ 4) = P (X ≤ 4.5) = P (X < 5) 141 4. Common distributions of random variables n=10, π = 0.5 n=25, π = 0.5 n=25, π = 0.25 n=10, π = 0.9 n=25, π = 0.9 n=50, π = 0.9 Figure 4.11: Examples of the normal approximation of the binomial distribution. since P (4 < X ≤ 4.5) = 0 and P (4.5 < X < 5) = 0 due to the ‘gaps’ in the probability mass for this distribution. In contrast if Y ∼ N (16, 9.6), then: P (Y ≤ 4) < P (Y ≤ 4.5) < P (Y < 5) since P (4 < Y < 4.5) > 0 and P (4.5 < Y < 5) > 0 because this is a continuous distribution. The accepted way to circumvent this problem is to use a continuity correction which corrects for the effects of the transition from a discrete Bin(n, π) distribution to a continuous N (n π, n π (1 − π)) distribution. Continuity correction This technique involves representing each discrete binomial value x, for 0 ≤ x ≤ n, by the continuous interval (x − 0.5, x + 0.5). Great care is needed to determine which x values are included in the required probability. Suppose we are approximating X ∼ Bin(n, π) with Y ∼ N (n π, n π (1 − π)), then: P (X < 4) = P (X ≤ 3) ⇒ P (Y < 3.5) (since 4 is excluded) P (X ≤ 4) = P (X < 5) ⇒ P (Y < 4.5) (since 4 is included) P (1 ≤ X < 6) = P (1 ≤ X ≤ 5) ⇒ P (0.5 < Y < 5.5) (since 1 to 5 are included). Example 4.16 In the UK general election in May 2010, the Conservative Party received 36.1% of the votes. We carry out an opinion poll in November 2014, where we survey 1,000 people who say they voted in 2010, and ask who they would vote for 142 4.5. Common continuous distributions if a general election was held now. Let X denote the number of people who say they would now vote for the Conservative Party. Suppose we assume that X ∼ Bin(1000, 0.361). 1. What is the probability that X ≥ 400? Using the normal approximation, noting n = 1000 and π = 0.361, with Y ∼ N (1000 × 0.361, 1000 × 0.361 × 0.639) = N (361, 230.68), we get: P (X ≥ 400) ≈ P (Y ≥ 399.5) Y − 361 399.5 − 361 =P √ ≥ √ 230.68 230.68 = P (Z ≥ 2.53) = 1 − Φ(2.53) = 0.0057. The exact probability from the binomial distribution is P (X ≥ 400) = 0.0059. Without the continuity correction, the normal approximation would give 0.0051. 2. What is the largest number x for which P (X ≤ x) < 0.01? We need the largest x which satisfies: x + 0.5 − 361 P (X ≤ x) ≈ P (Y ≤ x + 0.5) = P Z ≤ √ < 0.01. 230.68 According to Table 4 of the New Cambridge Statistical Tables, the smallest z which satisfies P (Z ≥ z) < 0.01 is z = 2.33, so the largest z which satisfies P (Z ≤ z) < 0.01 is z = −2.33. We then need to solve: x + 0.5 − 361 √ ≤ −2.33 230.68 which gives x ≤ 325.1. The smallest integer value which satisfies this is x = 325. Therefore, P (X ≤ x) < 0.01 for all x ≤ 325. The sum of the exact binomial probabilities from 0 to x is 0.0093 for x = 325, and 0.011 for x = 326. The normal approximation gives exactly the correct answer in this instance. 3. Suppose that 300 respondents in the actual survey say they would vote for the Conservative Party now. What do you conclude from this? From the answer to Question 2, we know that P (X ≤ 300) < 0.01, if π = 0.361. In other words, if the Conservatives’ support remains 36.1%, we would be very unlikely to get a random sample where only 300 (or fewer) respondents would say they would vote for the Conservative Party. Now X = 300 is actually observed. We can then conclude one of two things (if we exclude other possibilities, such as a biased sample or lying by the respondents). 143 4. Common distributions of random variables (a) The Conservatives’ true level of support is still 36.1% (or even higher), but by chance we ended up with an unusual sample with only 300 of their supporters. (b) The Conservatives’ true level of support is currently less than 36.1% (in which case getting 300 in the sample would be more probable). Here (b) seems a more plausible conclusion than (a). This kind of reasoning is the basis of statistical significance tests. Activity 4.26 James enjoys playing Solitaire on his laptop. One day, he plays the game repeatedly. He has found, from experience, that the probability of success in any game is 1/3 and is independent of the outcomes of other games. (a) What is the probability that his first success occurs in the fourth game he plays? What is the expected number of games he needs to play to achieve his first success? (b) What is the probability of three successes in ten games? What is the expected number of successes in ten games? (c) Use a suitable approximation to find the probability of less than 25 successes in 100 games. You should justify the use of the approximation. (d) What is the probability that his third success occurs in the tenth game he plays? Solution (a) P (first success in 4th game) = (2/3)3 × (1/3) = 8/81 ≈ 0.1. This is a geometric distribution, for which E(X) = 1/π = 1/(1/3) = 3. (b) Use X ∼ Bin(10, 1/3), such that E(X) = 10 × 1/3 = 3.33, and: 3 7 2 10 1 ≈ 0.2601. P (X = 3) = 3 3 3 (c) Approximate Bin(100, 1/3) by: 1 1 2 200 = N 33.3, . N 100 × , 100 × × 3 3 3 9 The approximation seems reasonable since n = 100 is ‘large’, π = 1/3 is quite close to 0.5, n π > 5 and n (1 − π) > 5. Using a continuity correction: ! 24.5 − 33.3 P (X ≤ 24.5) = P Z ≤ p = P (Z ≤ −1.87) ≈ 0.0307. 200/9 144 4.5. Common continuous distributions (d) This is a negative binomial distribution (used for the trial number of the kth success) with a pf given by: x−1 k p(x) = π (1 − π)x−k for x = k, k + 1, . . . k−1 and 0 otherwise. Hence we require: 3 7 9 1 2 P (X = 10) = ≈ 0.0780. 2 3 3 Alternatively, you could calculate the probability of 2 successes in 9 trials, followed by a further success. Activity 4.27 You may assume that 15% of individuals in a large population are left-handed. (a) If a random sample of 40 individuals is taken, find the probability that exactly 6 are left-handed. (b) If a random sample of 400 individuals is taken, find the probability that exactly 60 are left-handed by using a suitable approximation. Briefly discuss the appropriateness of the approximation. (c) What is the smallest possible size of a randomly chosen sample if we wish to be 99% sure of finding at least one left-handed individual in the sample? Solution (a) Let X ∼ Bin(40, 0.15), hence: 40 P (X = 6) = × (0.15)6 × (0.85)34 = 0.1742. 6 (b) Use a normal approximation with a continuity correction. We require: P (59.5 < X < 60.5) where X ∼ N (60, 51) since X has mean n π and variance n π (1 − π) with n = 400 and π = 0.15. Standardising, this is 2 × P (0 < Z ≤ 0.07) = 0.0558, approximately. Rules-of-thumb for use of the approximation are that n is ‘large’, π is close to 0.5, and n π and n (1 − π) are both at least 5. The first and last of these definitely hold. There is some doubt whether a value of 0.15 can be considered close to 0.5, so use with caution! (c) Given a sample of size n, P (no left-handers) = (0.85)n . Therefore: P (at least 1 left-hander) = 1 − (0.85)n . 145 4. Common distributions of random variables We require 1 − (0.85)n > 0.99, or (0.85)n < 0.01. This gives: 100 < or: n> 1 0.85 n ln(100) = 28.34. ln(1.1765) Rounding up, this gives a sample size of 29. Activity 4.28 For the binomial distribution with a probability of success of 0.25 in an individual trial, calculate the probability that, in 50 trials, there are at least 8 successes: (a) using the normal approximation without a continuity correction (b) using the normal approximation with a continuity correction. Compare these results with the exact probability of 0.9547 and comment. Solution We seek P (X ≥ 8) using the normal approximation Y ∼ N (12.5, 9.375). (a) So, without a continuity correction: 8 − 12.5 = P (Z ≥ −1.47) = 0.9292. P (Y ≥ 8) = P Z ≥ √ 9.375 The required probability could have been expressed as P (X > 7), or indeed any number in [7, 8), for example: 7 − 12.5 P (Y > 7) = P Z ≥ √ = P (Z ≥ −1.80) = 0.9641. 9.375 (b) With a continuity correction: 7.5 − 12.5 = P (Z ≥ −1.63) = 0.9484. P (Y > 7.5) = P Z ≥ √ 9.375 Compared to 0.9547, using the continuity correction yields the closer approximation. Activity 4.29 We have found that the Poisson distribution can be used to approximate a binomial distribution, and a normal distribution can be used to approximate a binomial distribution. It should not be surprising that a normal distribution can be used to approximate a Poisson distribution. It can be shown that the approximation is suitable for large values of the Poisson parameter λ, and should be adequate for practical purposes when λ ≥ 10. 146 4.6. Overview of chapter (a) Suppose X is a Poisson random variable with parameter λ. If we approximate X by a normal variable which ∼ N (µ, σ 2 ), what are the values which should be used for µ and σ 2 ? Hint: What are the mean and variance of a Poisson distribution? (b) Use this approach to estimate P (X > 12) for a Poisson random variable with λ = 15. Use a continuity correction. Note: The exact value of this probability, from the Poisson distribution, is 0.7323890. Solution (a) The Poisson distribution with parameter λ has its expectation and variance both equal to λ, so we should take µ = λ and σ 2 = λ in a normal approximation, i.e. use a N (λ, λ) distribution as the approximating distribution. (b) P (X > 12) ≈ P (Y > 12.5) using a continuity correction, where Y ∼ N (15, 15). This is: 12.5 − 15 Y − 15 √ > √ = P (Z > −0.65) = 0.7422. P (Y > 12.5) = P 15 15 4.6 Overview of chapter This chapter has introduced some common discrete and continuous probability distributions. Their properties, uses and applications have been discussed. The relationships between some of these distributions have also been covered. 4.7 Key terms and concepts Bernoulli distribution Central limit theorem Continuous uniform distribution Exponential distribution Parameter Standardised variable z-score 4.8 Binomial distribution Continuity correction Discrete uniform distribution Normal distribution Poisson distribution Standard normal distribution Sample examination questions Solutions can be found in Appendix C. 1. Find P (Y ≥ 2) when Y follows a binomial distribution with parameters n = 10 and π = 0.25. 147 4. Common distributions of random variables 2. A random variable, X, has the following probability density function: ( e−x for 0 < x < ∞ f (x) = 0 otherwise. The probability of being aged at least x0 + 1, given being aged at least x0 , is: p = P (X > x0 + 1 | X > x0 ). Calculate p. 3. Let X be a normal random variable with mean 1 and variance 4. Calculate: P (X > 3 | X < 5). 148 Chapter 5 Multivariate random variables 5.1 Synopsis of chapter Almost all applications of statistical methods deal with several measurements on the same, or connected, items. To think statistically about several measurements on a randomly selected item, you must understand some of the concepts for joint distributions of random variables. 5.2 Learning outcomes After completing this chapter, you should be able to: arrange the probabilities for a discrete bivariate distribution in tabular form define marginal and conditional distributions, and determine them for a discrete bivariate distribution recall how to define and determine independence for two random variables define and compute expected values for functions of two random variables and demonstrate how to prove simple properties of expected values provide the definition of covariance and correlation for two random variables and calculate these. 5.3 Introduction So far, we have considered univariate situations, that is one random variable at a time. Now we will consider multivariate situations, that is two or more random variables at once, and together. In particular, we consider two somewhat different types of multivariate situations. 1. Several different variables – such as the height and weight of a person. 2. Several observations of the same variable, considered together – such as the heights of all n people in a sample. Suppose that X1 , X2 , . . . , Xn are random variables, then the vector: X = (X1 , X2 , . . . , Xn )0 149 5. Multivariate random variables is a multivariate random variable (here n-variate), also known as a random vector. Its possible values are the vectors: x = (x1 , x2 , . . . , xn )0 where each xi is a possible value of the random variable Xi , for i = 1, . . . , n. The joint probability distribution of a multivariate random variable X is defined by the possible values x, and their probabilities. For now, we consider just the simplest multivariate case, a bivariate random variable where n = 2. This is sufficient for introducing most of the concepts of multivariate random variables. For notational simplicity, we will use X and Y instead of X1 and X2 . A bivariate random variable is then the pair (X, Y ). Example 5.1 In this chapter, we consider the following example of a discrete bivariate distribution – for a football match: X = the number of goals scored by the home team Y = the number of goals scored by the visiting (away) team. 5.4 Joint probability functions When the random variables in (X1 , X2 , . . . , Xn ) are all discrete (or all continuous), we also call the multivariate random variable discrete (or continuous, respectively). For a discrete multivariate random variable, the joint probability distribution is described by the joint probability function, defined as: p(x1 , x2 , . . . , xn ) = P (X1 = x1 , X2 = x2 , . . . , Xn = xn ) for all vectors (x1 , x2 , . . . , xn ) of n real numbers. The value p(x1 , x2 , . . . , xn ) of the joint probability function is itself a single number, not a vector. In the bivariate case, this is: p(x, y) = P (X = x, Y = y) which we sometimes write as pX,Y (x, y) to make the random variables clear. Example 5.2 Consider a randomly selected football match in the English Premier League (EPL), and the two random variables: X = the number of goals scored by the home team Y = the number of goals scored by the visiting (away) team. Suppose both variables have possible values 0, 1, 2 and 3 (to keep this example simple, we have recorded the small number of scores of 4 or greater also as 3). 150 5.5. Marginal distributions Consider the joint distribution of (X, Y ). We use probabilities based on data from the 2009–10 EPL season. Suppose the values of pX,Y (x, y) = p(x, y) = P (X = x, Y = y) are the following: X=x 0 1 2 3 0 0.100 0.100 0.085 0.062 Y =y 1 2 0.031 0.039 0.146 0.092 0.108 0.092 0.031 0.039 3 0.031 0.015 0.023 0.006 and p(x, y) = 0 for all other (x, y). Note that this satisfies the conditions for a probability function. 1. p(x, y) ≥ 0 for all (x, y). 2. 3 P 3 P p(x, y) = 0.100 + 0.031 + · · · + 0.006 = 1.000. x=0 y=0 The joint probability function gives probabilities of values of (X, Y ), for example: A 1–1 draw, which is the most probable single result, has probability: P (X = 1, Y = 1) = p(1, 1) = 0.146. The match is a draw with probability: P (X = Y ) = p(0, 0) + p(1, 1) + p(2, 2) + p(3, 3) = 0.344. The match is won by the home team with probability: P (X > Y ) = p(1, 0) + p(2, 0) + p(2, 1) + p(3, 0) + p(3, 1) + p(3, 2) = 0.425. More than 4 goals are scored in the match with probability: P (X + Y > 4) = p(2, 3) + p(3, 2) + p(3, 3) = 0.068. 5.5 Marginal distributions Consider a multivariate discrete random variable X = (X1 , . . . , Xn ). The marginal distribution of a subset of the variables in X is the (joint) distribution of this subset. The joint pf of these variables (the marginal pf ) is obtained by summing the joint pf of X over the variables which are not included in the subset. 151 5. Multivariate random variables Example 5.3 Consider X = (X1 , X2 , X3 , X4 ), and the marginal distribution of the subset (X1 , X2 ). The marginal pf of (X1 , X2 ) is: XX p1,2 (x1 , x2 ) = P (X1 = x1 , X2 = x2 ) = p(x1 , x2 , x3 , x4 ) x3 x4 where the sum is of the values of the joint pf of (X1 , X2 , X3 , X4 ) over all possible values of X3 and X4 . The simplest marginal distributions are those of individual variables in the multivariate random variable. The marginal pf is then obtained by summing the joint pf over all the other variables. The resulting marginal distribution is univariate, and its pf is a univariate pf. Marginal distributions for discrete bivariate distributions For the bivariate distribution of (X, Y ) the univariate marginal distributions are those of X and Y individually. Their marginal pfs are: X X pX (x) = p(x, y) and pY (y) = p(x, y). y x Example 5.4 Continuing with the football example introduced in Example 5.2, the joint and marginal probability functions are: Y =y X=x 0 1 2 3 pY (y) 0 0.100 0.100 0.085 0.062 0.347 1 0.031 0.146 0.108 0.031 0.316 2 0.039 0.092 0.092 0.039 0.262 3 0.031 0.015 0.023 0.006 0.075 pX (x) 0.201 0.353 0.308 0.138 1.000 and p(x, y) = pX (x) = pY (y) = 0 for all other (x, y). For example: pX (0) = 3 X p(0, y) = p(0, 0) + p(0, 1) + p(0, 2) + p(0, 3) y=0 = 0.100 + 0.031 + 0.039 + 0.031 = 0.201. Even for a multivariate random variable, expected values E(Xi ), variances Var(Xi ) and medians of individual variables are obtained from the univariate (marginal) distributions of Xi , as defined in Chapter 3. 152 5.6. Conditional distributions Example 5.5 Consider again the football example. The expected number of goals scored by the home team is: X E(X) = x pX (x) = 0 × 0.201 + 1 × 0.353 + 2 × 0.308 + 3 × 0.138 = 1.383. x The expected number of goals scored by the visiting team is: X E(Y ) = y pY (y) = 0 × 0.347 + 1 × 0.316 + 2 × 0.262 + 3 × 0.075 = 1.065. y Activity 5.1 Show that the marginal distributions of a bivariate distribution are not enough to define the bivariate distribution itself. Solution Here we must show that there are two distinct bivariate distributions with the same marginal distributions. It is easiest to think of the simplest case where X and Y each take only two values, say 0 and 1. Suppose the marginal distributions of X and Y are the same, with p(0) = p(1) = 0.5. One possible bivariate distribution with these marginal distributions is the one for which there is independence between X and Y . This has pX,Y (x, y) = pX (x) pY (y) for all x, y. Writing it in full: pX,Y (0, 0) = pX,Y (1, 0) = pX,Y (0, 1) = pX,Y (1, 1) = 0.5 × 0.5 = 0.25. The table of probabilities for this choice of independence is shown in the first table below. Trying some other value for pX,Y (0, 0), like 0.2, gives the second table below. X/Y 0 1 0 0.25 0.25 1 0.25 0.25 X/Y 0 1 0 0.2 0.3 1 0.3 0.2 The construction of these probabilities is done by making sure the row and column totals are equal to 0.5, and so we now have a second distribution with the same marginal distributions as the first. This example is very simple, but one can almost always construct many bivariate distributions with the same marginal distributions even for continuous random variables. 5.6 Conditional distributions Consider discrete variables X and Y , with joint pf p(x, y) = pX,Y (x, y) and marginal pfs pX (x) and pY (y), respectively. 153 5. Multivariate random variables Conditional distributions of discrete bivariate distributions Let x be one possible value of X, for which pX (x) > 0. The conditional distribution of Y given that X = x is the discrete probability distribution with the pf: pY |X (y | x) = P (Y = y | X = x) = P (X = x and Y = y) pX,Y (x, y) = P (X = x) pX (x) for any value y. This is the conditional probability function of Y given X = x. Example 5.6 Recall that in the football example the joint and marginal pfs were: Y =y X=x 0 1 2 3 pY (y) 0 0.100 0.100 0.085 0.062 0.347 1 0.031 0.146 0.108 0.031 0.316 2 0.039 0.092 0.092 0.039 0.262 3 0.031 0.015 0.023 0.006 0.075 pX (x) 0.201 0.353 0.308 0.138 1.000 We can now calculate the conditional pf of Y given X = x for each x, i.e. of away goals given home goals. For example: pY |X (y | 0) = pY |X (y | X = 0) = pX,Y (0, y) pX,Y (0, y) = . pX (0) 0.201 So, for example, pY |X (1 | 0) = pX,Y (0, 1)/0.201 = 0.031/0.201 = 0.154. Calculating these for each value of x gives: X=x 0 1 2 3 pY |X (y | x) 0 1 0.498 0.154 0.283 0.414 0.276 0.351 0.449 0.225 when y 2 0.194 0.261 0.299 0.283 is: 3 0.154 0.042 0.075 0.043 Sum 1.00 1.00 1.00 1.00 So, for example: if the home team scores 0 goals, the probability that the visiting team scores 1 goal is pY |X (1 | 0) = 0.154 if the home team scores 1 goal, the probability that the visiting team wins the match is pY |X (2 | 1) + pY |X (3 | 1) = 0.261 + 0.042 = 0.303. 154 5.6. Conditional distributions 5.6.1 Properties of conditional distributions Each different value of x defines a different conditional distribution and conditional pf pY |X (y | x). Each value of pY |X (y | x) is a conditional probability of the kind previously defined. Defining events A = {Y = y} and B = {X = x}, then: P (A | B) = P (A ∩ B) P (Y = y and X = x) = P (B) P (X = x) = P (Y = y | X = x) = pX,Y (x, y) pX (x) = pY |X (y | x). A conditional distribution is itself a probability distribution, and a conditional pf is a pf. Clearly, pY |X (y | x) ≥ 0 for all y, and: P X y pY |X (y | x) = pX,Y (x, y) y pX (x) = pX (x) = 1. pX (x) The conditional distribution and pf of X given Y = y (for any y such that pY (y) > 0) is defined similarly, with the roles of X and Y reversed: pX|Y (x | y) = pX,Y (x, y) pY (y) for any value x. Conditional distributions are general and are not limited to the bivariate case. If X and/or Y are vectors of random variables, the conditional pf of Y given X = x is: pY|X (y | x) = pX,Y (x, y) pX (x) where pX,Y (x, y) is the joint pf of the random vector (X, Y), and pX (x) is the marginal pf of the random vector X. 5.6.2 Conditional mean and variance Since a conditional distribution is a probability distribution, it also has a mean (expected value) and variance (and median etc.). These are known as the conditional mean and conditional variance, and are denoted, respectively, by: EY |X (Y | x) and VarY |X (Y | x). 155 5. Multivariate random variables Example 5.7 In the football example, we have: X EY |X (Y | 0) = y pY |X (y | 0) = 0 × 0.498 + 1 × 0.154 + 2 × 0.194 + 3 × 0.154 = 1.00. y So, if the home team scores 0 goals, the expected number of goals by the visiting team is EY |X (Y | 0) = 1.00. EY |X (Y | x) for x = 1, 2 and 3 are obtained similarly. Here X is the number of goals by the home team, and Y is the number of goals by the visiting team: X=x 0 1 2 3 pY |X (y | x) 0 1 0.498 0.154 0.283 0.414 0.276 0.351 0.449 0.225 when y 2 0.194 0.261 0.299 0.283 is: 3 0.154 0.042 0.075 0.043 EY |X (Y | x) 1.00 1.06 1.17 0.92 3.0 Plots of the conditional means are shown in Figure 5.1. 0.0 0.5 1.0 1.5 2.0 2.5 Home goals x Expected away goals E(Y|x) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Goals Figure 5.1: Conditional means for Example 5.7. 5.7 Covariance and correlation Suppose that the conditional distributions pY |X (y | x) of a random variable Y given different values x of a random variable X are not all the same, i.e. the conditional distribution of Y ‘depends on’ the value of X. Therefore, there is said to be an association (or dependence) between X and Y . 156 5.7. Covariance and correlation If two random variables are associated (dependent), knowing the value of one (for example, X) will help to predict the likely value of the other (for example, Y ). We next consider two measures of association which are used to summarise the strength of an association in a single number: covariance and correlation (scaled covariance). 5.7.1 Covariance Definition of covariance The covariance of two random variables X and Y is defined as: Cov(X, Y ) = Cov(Y, X) = E[(X − E(X))(Y − E(Y ))]. This can also be expressed as the more convenient formula: Cov(X, Y ) = E(XY ) − E(X) E(Y ). (Note that these involve expected values of products of two random variables, which have not been defined yet. We will do so later in this chapter.) Properties of covariance Suppose X and Y are random variables, and a, b, c and d are constants. The covariance of a random variable with itself is the variance of the random variable: Cov(X, X) = E(XX) − E(X) E(X) = E(X 2 ) − (E(X))2 = Var(X). The covariance of a random variable and a constant is 0: Cov(a, X) = E(aX) − E(a) E(X) = a E(X) − a E(X) = 0. The covariance of linear transformations of random variables is: Cov(aX + b, cY + d) = ac Cov(X, Y ). Activity 5.2 Suppose that X and Y have a bivariate distribution. Find the covariance of the new random variables W = aX + bY and V = cX + dY where a, b, c and d are constants. 157 5. Multivariate random variables Solution The covariance of W and V is: E(W V ) − E(W ) E(V ) = E[acX 2 + bdY 2 + (ad + bc)XY ] − [ac E(X)2 + bd E(Y )2 + (ad + bc) E(X) E(Y )] = ac [E(X 2 ) − E(X)2 ] + bd [E(Y 2 ) − E(Y )2 ] + (ad + bc) [E(XY ) − E(X) E(Y )] 2 = ac σX + bd σY2 + (ad + bc) σXY . 5.7.2 Correlation Definition of correlation The correlation of two random variables X and Y is defined as: Cov(X, Y ) Cov(X, Y ) . = Corr(X, Y ) = Corr(Y, X) = p sd(X) sd(Y ) Var(X) Var(Y ) When Cov(X, Y ) = 0, then Corr(X, Y ) = 0. When this is the case, we say that X and Y are uncorrelated. Correlation and covariance are measures of the strength of the linear (‘straight-line’) association between X and Y . The further the correlation is from 0, the stronger is the linear association. The most extreme possible values of correlation are −1 and +1, which are obtained when Y is an exact linear function of X. Corr(X, Y ) = +1 when Y = aX + b with a > 0. Corr(X, Y ) = −1 when Y = aX + b with a < 0. If Corr(X, Y ) > 0, we say that X and Y are positively correlated. If Corr(X, Y ) < 0, we say that X and Y are negatively correlated. Example 5.8 Recall the joint pf pX,Y (x, y) in the football example: Y =y X=x 0 1 2 3 158 0 0 0.100 0 0.100 0 0.085 0 0.062 1 0 0.031 1 0.146 2 0.108 3 0.031 2 0 0.039 2 0.092 4 0.092 6 0.039 3 0 0.031 3 0.015 6 0.023 9 0.006 5.7. Covariance and correlation Here, the numbers in bold are the values of xy for each combination of x and y. From these and their probabilities, we can derive the probability distribution of XY . For example: P (XY = 2) = pX,Y (1, 2) + pX,Y (2, 1) = 0.092 + 0.108 = 0.200. The pf of the product XY is: XY = xy P (XY = xy) 0 0.448 1 0.146 2 0.200 3 0.046 4 0.092 6 0.062 9 0.006 Hence: E(XY ) = 0 × 0.448 + 1 × 0.146 + 2 × 0.200 + · · · + 9 × 0.006 = 1.478. From the marginal pfs pX (x) and pY (y) we get: E(X) = 1.383 and E(Y ) = 1.065 also: E(X 2 ) = 2.827 and E(Y 2 ) = 2.039 hence: Var(X) = 2.827 − (1.383)2 = 0.9143 and Var(Y ) = 2.039 − (1.065)2 = 0.9048. Therefore, the covariance of X and Y is: Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 1.478 − 1.383 × 1.065 = 0.00511 and the correlation is: Cov(X, Y ) 0.00511 Corr(X, Y ) = p = 0.00562. =√ 0.9143 × 0.9048 Var(X) Var(Y ) The numbers of goals scored by the home and visiting teams are very nearly uncorrelated (i.e. not linearly associated). Activity 5.3 X and Y are independent random variables with distributions as follows: X=x pX (x) 0 0.4 1 0.2 2 0.4 Y =y pY (y) 1 0.4 2 0.6 The random variables W and Z are defined by W = 2X and Z = Y − X, respectively. (a) Compute the joint distribution of W and Z. (b) Evaluate P (W = 2 | Z = 1), E(W | Z = 0) and Cov(W, Z). 159 5. Multivariate random variables Solution (a) The joint distribution (with marginal probabilities) is: −1 0 1 2 pW (w) Z=z 0 0.00 0.00 0.16 0.24 0.40 W =w 2 0.00 0.08 0.12 0.00 0.20 4 0.16 0.24 0.00 0.00 0.40 pZ (z) 0.16 0.32 0.28 0.24 1.00 (b) It is straightforward to see that: P (W = 2 | Z = 1) = 0.12 3 P (W = 2 ∩ Z = 1) = = . P (Z = 1) 0.28 7 For E(W | Z = 0), we have: E(W | Z = 0) = X w P (W = w | Z = 0) = 0 × w 0 0.08 0.24 +2× +4× = 3.5. 0.32 0.32 0.32 We see E(W ) = 2 (by symmetry), and: E(Z) = −1 × 0.16 + 0 × 0.32 + 1 × 0.28 + 2 × 0.24 = 0.6. Also: E(W Z) = XX w w z p(w, z) = −4 × 0.16 + 2 × 0.12 = −0.4 z hence: Cov(W, Z) = E(W Z) − E(W ) E(Z) = −0.4 − 2 × 0.6 = −1.6. Activity 5.4 The joint probability distribution of the random variables X and Y is: Y =y −1 0 1 −1 0.05 0.10 0.10 X=x 0 1 0.15 0.10 0.05 0.25 0.05 0.15 (a) Identify the marginal distributions of X and Y and the conditional distribution of X given Y = 1. (b) Evaluate E(X | Y = 1) and the correlation coefficient of X and Y . (c) Are X and Y independent random variables? 160 5.7. Covariance and correlation Solution (a) The marginal and conditional distributions are, respectively: X=x pX (x) −1 0.25 0 0.25 Y =y pY (y) 1 0.50 X = x|Y = 1 pX|Y =1 (x | Y = 1) −1 1/3 −1 0.30 0 1/6 0 0.40 1 0.30 1 1/2 (b) From the conditional distribution we see: E(X | Y = 1) = −1 × 1 1 1 1 +0× +1× = . 3 6 2 6 E(Y ) = 0 (by symmetry), and so Var(Y ) = E(Y 2 ) = 0.6. E(X) = 0.25 and: Var(X) = E(X 2 ) − (E(X))2 = 0.75 − (0.25)2 = 0.6875. (Note that Var(X) and Var(Y ) are not strictly necessary here!) Next: E(XY ) = XX x x y p(x, y) y = (−1)(−1)(0.05) + (1)(−1)(0.1) + (−1)(1)(0.1) + (1)(1)(0.15) = 0. So: Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 0 ⇒ Corr(X, Y ) = 0. (c) X and Y are not independent random variables since, for example: P (X = 1, Y = −1) = 0.1 6= P (X = 1) P (Y = −1) = 0.5 × 0.3 = 0.15. Activity 5.5 The random variables X1 and X2 are independent and have the common distribution given in the table below: X=x pX (x) 0 0.2 1 0.4 2 0.3 3 0.1 The random variables W and Y are defined by W = max(X1 , X2 ) and Y = min(X1 , X2 ). (a) Calculate the table of probabilities which defines the joint distribution of W and Y . 161 5. Multivariate random variables (b) Find: i. the marginal distribution of W ii. the conditional distribution of Y given W = 2 iii. E(Y | W = 2) and Var(Y | W = 2) iv. Cov(W, Y ). Solution (a) The joint distribution of W and Y is: Y =y 0 1 2 3 0 (0.2)2 0 0 0 (0.2)2 1 2(0.2)(0.4) (0.4)(0.4) 0 0 (0.8)(0.4) W =w 2 3 2(0.2)(0.3) 2(0.2)(0.1) 2(0.4)(0.3) 2(0.4)(0.1) (0.3)(0.3) 2(0.3)(0.1) 0 (0.1)(0.1) (1.5)(0.3) (1.9)(0.1) which is: Y =y (b) 0 1 2 3 W =w 1 2 0.16 0.12 0.16 0.24 0.00 0.09 0.00 0.00 0.32 0.45 0 0.04 0.00 0.00 0.00 0.04 3 0.04 0.08 0.06 0.01 0.19 i. Hence the marginal distribution of W is: W =w pW (w) 0 0.04 1 0.32 2 0.45 3 0.19 ii. The conditional distribution of Y | W = 2 is: Y = y|W = 2 pY |W =2 (y | W = 2) 0 4/15 = 0.26̇ 1 8/15 = 0.53̇ 2 2/10 = 0.2 3 0 0 iii. We have: E(Y | W = 2) = 0 × 4 8 2 +1× +2× + 3 × 0 = 0.93̇ 15 15 10 and: Var(Y | W = 2) = E(Y 2 | W = 2)−(E(Y | W = 2))2 = 1.3̇−(0.93̇)2 = 0.4622. iv. E(W Y ) = 1.69, E(W ) = 1.79 and E(Y ) = 0.81, therefore: Cov(W, Y ) = E(W Y ) − E(W ) E(Y ) = 1.69 − 1.79 × 0.81 = 0.2401. 162 5.7. Covariance and correlation Activity 5.6 Consider two random variables X and Y . X can take the values −1, 0 and 1, and Y can take the values 0, 1 and 2. The joint probabilities for each pair are given by the following table: X = −1 X = 0 X = 1 Y =0 0.10 0.20 0.10 Y =1 0.10 0.05 0.10 Y =2 0.10 0.05 0.20 (a) Calculate the marginal distributions and expected values of X and Y . (b) Calculate the covariance of the random variables U and V , where U = X + Y and V = X − Y . (c) Calculate E(V | U = 1). Solution (a) The marginal distribution of X is: X=x pX (x) −1 0.3 0 0.3 1 0.4 The marginal distribution of Y is: Y =y pY (y) 0 0.40 1 0.25 2 0.35 Hence: E(X) = −1 × 0.3 + 0 × 0.3 + 1 × 0.4 = 0.1 and: E(Y ) = 0 × 0.40 + 1 × 0.25 + 2 × 0.35 = 0.95. (b) We have: Cov(U, V ) = Cov(X + Y, X − Y ) = E((X + Y )(X − Y )) − E(X + Y )E(X − Y ) = E(X 2 − Y 2 ) − (E(X) + E(Y )) (E(X) − E(Y )) E(X 2 ) = ((−1)2 × 0.3) + (02 × 0.3) + (12 × 0.4) = 0.7 E(Y 2 ) = (02 × 0.4) + (12 × 0.25) + (22 × 0.35) = 1.65 hence: Cov(U, V ) = (0.7 − 1.65) − (0.1 + 0.95)(0.1 − 0.95) = −0.0575. 163 5. Multivariate random variables (c) U = 1 is achieved for (X, Y ) pairs (−1, 2), (0, 1) or (1, 0). The corresponding values of V are −3, −1 and 1. We have: P (U = 1) = 0.1 + 0.05 + 0.1 = 0.25 P (V = −3 | U = 1) = 2 0.1 = 0.25 5 P (V = −1 | U = 1) = 0.05 1 = 0.25 5 P (V = 1 | U = 1) = 0.1 2 = 0.25 5 hence: 1 2 2 + −1 × + 1× = −1. E(V | U = 1) = −3 × 5 5 5 Activity 5.7 Two refills for a ballpoint pen are selected at random from a box containing three blue refills, two red refills and three green refills. Define the following random variables: X = the number of blue refills selected Y = the number of red refills selected. (a) Show that P (X = 1, Y = 1) = 3/14. (b) Form the table showing the joint probability distribution of X and Y . (c) Calculate E(X), E(Y ) and E(X | Y = 1). (d) Find the covariance between X and Y . (e) Are X and Y independent random variables? Give a reason for your answer. Solution (a) With the obvious notation B = blue and R = red: P (X = 1, Y = 1) = P (BR) + P (RB) = 3 3 2 2 3 × + × = . 8 7 8 7 14 (b) We have: Y =y 164 0 1 2 X=x 0 1 2 3/28 9/28 3/28 3/14 3/14 0 1/28 0 0 5.7. Covariance and correlation (c) The marginal distribution of X is: X=x pX (x) 0 10/28 1 15/28 2 3/28 Hence: 15 3 3 10 +1× +2× = . 28 28 28 4 The marginal distribution of Y is: E(X) = 0 × Y =y pY (y) 0 15/28 1 12/28 2 1/28 Hence: 15 12 1 1 +1× +2× = . 28 28 28 2 The conditional distribution of X given Y = 1 is: E(Y ) = 0 × X = x|Y = 1 pX|Y =1 (x | y = 1) Hence: E(X | Y = 1) = 0 × 0 1/2 1 1/2 1 1 1 +1× = . 2 2 2 (d) The distribution of XY is: XY = xy pXY (xy) Hence: E(XY ) = 0 × 0 22/28 1 6/28 22 6 3 +1× = 28 28 14 and: Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 3 3 1 9 − × =− . 14 4 2 56 (e) Since Cov(X, Y ) 6= 0, a necessary condition for independence fails to hold. The random variables are not independent. Activity 5.8 A fair coin is tossed four times. Let X be the number of heads obtained on the first three tosses of the coin. Let Y be the number of heads on all four tosses of the coin. (a) Find the joint probability distribution of X and Y . (b) Find the mean and variance of X. (c) Find the conditional probability distribution of Y given that X = 2. (d) Find the mean of the conditional probability distribution of Y given that X = 2. 165 5. Multivariate random variables Solution (a) The joint probability distribution is: Y =y \X=x 0 1 2 3 4 0 1/16 1/16 0 0 0 1 0 3/16 3/16 0 0 2 0 0 3/16 3/16 0 3 0 0 0 1/16 1/16 (b) The marginal distribution of X is: X=x p(x) 0 1/8 1 3/8 2 3/8 3 1/8 Hence: E(X) = X x p(x) = 0 × x E(X 2 ) = X 1 1 3 + ··· + 3 × = 8 8 2 x2 p(x) = 02 × 1 1 + · · · + 32 × = 3 8 8 Var(X) = 3 − 9 3 = . 4 4 x and: (c) We have: P (Y = 0 | X = 2) = p(2, 0) 0 = =0 pX (2) 3/8 P (Y = 1 | X = 2) = p(2, 1) 0 = =0 pX (2) 3/8 P (Y = 2 | X = 2) = p(2, 2) 3/16 1 = = pX (2) 3/8 2 P (Y = 3 | X = 2) = p(2, 3) 3/16 1 = = pX (2) 3/8 2 P (Y = 4 | X = 2) = p(2, 4) 0 = = 0. pX (2) 3/8 Hence: Y = y|X = 2 p(y | X = 2) 2 1/2 3 1/2 (d) We have: E(Y | X = 2) = 2 × 166 1 1 5 +3× = . 2 2 2 5.7. Covariance and correlation Activity 5.9 X and Y are discrete random variables which can assume values 0, 1 and 2 only. P (X = x, Y = y) = A(x + y) for some constant A and x, y ∈ {0, 1, 2}. (a) Draw up a table to describe the joint distribution of X and Y and find the value of the constant A. (b) Describe the marginal distributions of X and Y . (c) Give the conditional distribution of X | Y = 1 and find E(X | Y = 1). (d) Are X and Y independent? Give a reason for your answer. Solution (a) The joint distribution table is: Y =y Since PP 0 1 2 0 0 A 2A X=x 1 2 A 2A 2A 3A 3A 4A pX,Y (x, y) = 1, we have A = 1/18. ∀x ∀y (b) The marginal distribution of X (similarly of Y ) is: X=x P (X = x) 0 3A = 1/6 1 6A = 1/3 2 9A = 1/2 (c) The distribution of X | Y = 1 is: X = x|y = 1 PX|Y =1 (X = x | y = 1) Hence: 0 A/6A = 1/6 1 2A/6A = 1/3 2 3A/6A = 1/2 1 1 1 4 E(X | Y = 1) = 0 × + 1× + 2× = . 6 3 2 3 (d) Even though the distributions of X and X | Y = 1 are the same, X and Y are not independent. For example, P (X = 0, Y = 0) = 0 although P (X = 0) 6= 0 and P (Y = 0) 6= 0. Activity 5.10 X and Y are discrete random variables with the following joint probability function: 167 5. Multivariate random variables Y =y 0 1 −1 0.15 0.30 X=x 0 1 0.05 0.15 0.25 0.10 (a) Obtain the marginal distributions of X and Y , respectively. (b) Calculate E(X), Var(X), E(Y ) and Var(Y ). (c) Obtain the conditional distributions of Y given X = −1, and of X given Y = 0. (d) Calculate EY |X (Y | X = −1) and EX|Y (X | Y = 0). (e) Calculate E(XY ), Cov(X, Y ) and Corr(X, Y ). (f) Find P (X > Y ) and P (X 2 > Y 2 ). (g) Are X and Y independent? Explain why or why not. Solution (a) The marginal distributions are found by adding across rows and columns: X=x pX (x) −1 0.45 0 0.30 1 0.25 and: Y =y pY (y) 0 0.35 1 0.65 (b) We have: E(X) = −1 × 0.45 + 0 × 0.30 + 1 × 0.25 = −0.20 and: E(X 2 ) = (−1)2 × 0.45 + 02 × 0.30 + 12 × 0.25 = 0.70 so Var(X) = 0.70 − (−0.20)2 = 0.66. Also: E(Y ) = 0 × 0.35 + 1 × 0.65 = 0.65 and: E(Y 2 ) = 02 × 0.35 + 12 × 0.65 = 0.65 so Var(Y ) = 0.65 − (0.65)2 = 0.2275. (c) The conditional probability functions pY |X=−1 (y | x = −1) and pX|Y =0 (x | y = 0) are given by, respectively: 168 5.7. Covariance and correlation Y = y | X = −1 pY |X=−1 (y | x = −1) 0 0.15/0.45 = 0.3̇ 1 0.30/0.45 = 0.6̇ and: X = x|Y = 0 pX|Y =0 (x | y = 0) −1 0.15/0.35 = 0.4286 0 0.05/0.35 = 0.1429 1 0.15/0.35 = 0.4286 (d) We have EY |X (Y | X = −1) = 0 × 0.3̇ + 1 × 0.6̇ = 0.6̇. Also, EX|Y (X | Y = 0) = −1 × 0.4286 + 0 × 0.1429 + 1 × 0.4286 = 0. (e) We have E(XY ) = P P x y x y p(x, y) = −1 × 0.30 + 0 × 0.60 + 1 × 0.10 = −0.20. Also, Cov(X, Y ) = E(XY ) − E(X) E(Y ) = −0.20 − (−0.20)(0.65) = −0.07 and: −0.07 Cov(X, Y ) =√ = −0.1807. Corr(X, Y ) = p 0.66 × 0.2275 Var(X) Var(Y ) (f) We have P (X > Y ) = P (X = 1, Y = 0) = 0.15. Also, P (X 2 > Y 2 ) = P (X = −1, Y = 0) + P (X = 1, Y = 0) = 0.15 + 0.15 = 0.30. (g) Since X and Y are (weakly) negatively correlated (as determined in (e)), they cannot be independent. While the non-zero correlation is a sufficient explanation in this case, for other such bivariate distributions which are uncorrelated, i.e. when Corr(X, Y ) = 0, it becomes necessary to check whether pX,Y (x, y) = pX (x) pY (y) for all pairs of values of (x, y). Here, for example, pX,Y (0, 0) = 0.05, pX (0) = 0.30 and pY (0) = 0.35. We then have that pX (0) pY (0) = 0.105, which is not equal to pX,Y (0, 0) = 0.05. Hence X and Y cannot be independent. Activity 5.11 A box contains 4 red balls, 3 green balls and 3 blue balls. Two balls are selected at random without replacement. Let X represent the number of red balls in the sample and Y the number of green balls in the sample. (a) Arrange the different pairs of values of (X, Y ) as the cells in a table, each cell being filled with the probability of that pair of values occurring, i.e. provide the joint probability distribution. (b) What does the random variable Z = 2 − X − Y represent? (c) Calculate Cov(X, Y ). (d) Calculate P (X = 1 | − 2 < X − Y < 2). 169 5. Multivariate random variables Solution (a) We have: P (X = 0, Y = 0) = 3 2 6 1 × = = 10 9 90 15 P (X = 0, Y = 1) = 2 × P (X = 0, Y = 2) = 3 3 18 3 × = = 10 9 90 15 2 6 1 3 × = = 10 9 90 15 P (X = 1, Y = 0) = 2 × 4 3 24 4 × = = 10 9 90 15 P (X = 1, Y = 1) = 2 × 4 3 24 4 × = = 10 9 90 15 P (X = 2, Y = 0) = 4 3 12 2 × = = . 10 9 90 15 All other values have probability 0. We then construct the table of joint probabilities: Y =0 Y =1 Y =2 X = 0 1/15 3/15 1/15 4/15 0 X = 1 4/15 X = 2 2/15 0 0 (b) The number of blue balls in the sample. (c) We have: E(X) = 1 × E(Y ) = 1 × 4 4 + 15 15 3 4 + 15 15 and: E(XY ) = 1 × 1 × So: Cov(X, Y ) = +2× 2 4 = 15 5 +2× 1 3 = 15 5 4 4 = . 15 15 4 4 3 16 − × =− . 15 5 5 75 (d) We have: P (X = 1 | |X − Y | < 2) = 170 2 4/15 + 4/15 = . 1/15 + 3/15 + 4/15 + 4/15 3 5.7. Covariance and correlation Activity 5.12 Suppose that Var(X) = Var(Y ) = 1, and that X and Y have correlation coefficient ρ. Show that it follows from Var(X − ρY ) ≥ 0 that ρ2 ≤ 1. Solution We have: 0 ≤ Var(X − ρY ) = Var(X) − 2ρ Cov(X, Y ) + ρ2 Var(Y ) = 1 − 2ρ2 + ρ2 = (1 − ρ2 ). Hence 1 − ρ2 ≥ 0, and so ρ2 ≤ 1. Activity 5.13 The distribution of a random variable X is: X=x P (X = x) −1 a 0 b 1 a Show that X and X 2 are uncorrelated. Solution This is an example of two random variables X and Y = X 2 which are uncorrelated, but obviously dependent. The bivariate distribution of (X, Y ) in this case is singular because of the complete functional dependence between them. We have: E(X) = −1 × a + 0 × b + 1 × a = 0 E(X 2 ) = +1 × a + 0 × b + 1 × a = 2a E(X 3 ) = −1 × a + 0 × b + 1 × a = 0 and we must show that the covariance is zero: Cov(X, Y ) = E(XY ) − E(X) E(Y ) = E(X 3 ) − E(X) E(X 2 ) = 0 − 0 × 2a = 0. There are many possible choices for a and b which give a valid probability distribution, for instance a = 0.25 and b = 0.5. Activity 5.14 A fair coin is thrown n times, each throw being independent of the ones before. Let R = ‘the number of heads’, and S = ‘the number of tails’. Find the covariance of R and S. What is the correlation of R and S? Solution One can go about this in a straightforward way. If Xi is the number of heads and Yi is the number of tails on the ith throw, then the distribution of Xi and Yi is given by: X/Y 0 1 0 0 0.5 1 0.5 0 171 5. Multivariate random variables From this table, we compute the following: E(Xi ) = E(Yi ) = 0 × 0.5 + 1 × 0.5 = 0.5 E(Xi2 ) = E(Yi2 ) = 0 × 0.5 + 1 × 0.5 = 0.5 Var(Xi ) = Var(Yi ) = 0.5 − (0.5)2 = 0.25 E(Xi Yi ) = 0 × 0.5 + 0 × 0.5 = 0 Cov(Xi , Yi ) = E(Xi Yi ) − E(Xi ) E(Yi ) = 0 − 0.25 = −0.25. P P Now, since R = i Xi and S = i Yi , we can add covariances of independent Xi s and Yi s, just like means and variances, then: Cov(R, S) = −0.25n. Since R + S = n is a fixed quantity, there is a complete linear dependence between R and S. We have R = n − S, so the correlation between R and S should be −1. This can be checked directly since: Var(R) = Var(S) = 0.25n (add the variances of the Xi s or Yi s). The correlation between R and S works out as −0.25n/0.25n = −1. Activity 5.15 Suppose that X and Y are random variables, and a, b, c and d are constants. (a) Show that: Cov(aX + b, cY + d) = ac Cov(X, Y ). (b) Derive Corr(aX + b, cY + d). (c) Suppose that Z = cX + d, where c and d are constants. Using the result you obtained in (b), or in some other way, show that: Corr(X, Z) = 1 for c > 0 and: Corr(X, Z) = −1 for c < 0. Solution (a) Note first that: E(aX + b) = a E(X) + b and E(cY + d) = c E(Y ) + d. 172 5.7. Covariance and correlation Therefore, the covariance is: Cov(aX + b, cY + d) = E[(aX + b)(cY + d)] − E(aX + b) E(cY + d) = E(acXY + adX + bcY + bd) − [a E(X) + b] [c E(Y ) + d] = ac E(XY ) + ad E(X) + bc E(Y ) + bd − ac E(X) E(Y ) − ad E(X) − bc E(Y ) − bd = ac E(XY ) − ac E(X) E(Y ) = ac [E(XY ) − E(X) E(Y )] = ac Cov(X, Y ) as required. (b) Note first that: sd(aX + b) = |a| sd(X) and sd(cY + d) = |c| sd(Y ). Therefore, the correlation is: Corr(aX + b, cY + d) = Cov(aX + b, cY + d) sd(aX + b) sd(cY + d) ac Cov(X, Y ) |ac| sd(X) sd(Y ) ac Corr(X, Y ). = |ac| = (c) First, note that the correlation of a random variable with itself is 1, since: Cov(X, X) Var(X) Corr(X, X) = p = 1. = Var(X) Var(X) Var(X) In the result obtained in (b), select a = 1, b = 0 and Y = X. This gives: Corr(X, Z) = Corr(X, cX + d) = c c Corr(X, X) = . |c| |c| This gives the two cases mentioned in the question. • For c > 0, then Corr(X, cX + d) = 1. • For c < 0, then Corr(X, cX + d) = −1. 5.7.3 Sample covariance and correlation We have just introduced covariance and correlation, two new characteristics of probability distributions (population distributions). We now discuss their sample equivalents. 173 5. Multivariate random variables Let (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) be a sample of n pairs of observed values of two random variables X and Y . We can use these observations to calculate sample versions of the covariance and correlation between X and Y . These are measures of association in the sample, i.e. descriptive statistics. They are also estimates of the corresponding population quantities Cov(X, Y ) and Corr(X, Y ). The uses of these sample measures will be discussed in more detail later in the course. Sample covariance The sample covariance of random variables X and Y is calculated as: n d Cov(X, Y)= 1 X (Xi − X̄)(Yi − Ȳ ) n − 1 i=1 where X̄ and Ȳ are the sample means of X and Y , respectively. Sample correlation The sample correlation of random variables X and Y is calculated as: n P (Xi − X̄)(Yi − Ȳ ) d Cov(X, Y) i=1 r= =rn n P P SX SY (Xi − X̄)2 (Yi − Ȳ )2 i=1 i=1 where SX and SY are the sample standard deviations of X and Y , respectively. r is always between −1 and +1, and is equal to −1 or +1 only if X and Y are perfectly linearly related in the sample. r = 0 if X and Y are uncorrelated (not linearly related) in the sample. Example 5.9 Figure 5.2 shows different examples of scatterplots of observations of X and Y , and different values of the sample correlation, r. The line shown in each plot is the best-fitting (least squares) line for the scatterplot (which will be introduced later in the course). In (a), X and Y are perfectly linearly related, and r = 1. Plots (b), (c) and (e) show relationships of different strengths. In (c), the variables are negatively correlated. In (d), there is no linear relationship, and r = 0. Plot (f) shows that r can be 0 even if two variables are clearly related, if that relationship is not linear. 174 5.8. Independent random variables (a) r=1 (b) r=0.85 (c) r=-0.5 (d) r=0 (e) r=0.92 (f) r=0 Figure 5.2: Scatterplots depicting various sample correlations as discussed in Example 5.9. 5.8 Independent random variables Two discrete random variables X and Y are associated if pY |X (y | x) depends on x. What if it does not, i.e. what if: pX,Y (x, y) = pY (y) for all x and y pX (x) so that knowing the value of X does not help to predict Y ? pY |X (y | x) = This implies that: pX,Y (x, y) = pX (x) pY (y) for all x, y. (5.1) X and Y are independent of each other if and only if (5.1) is true. Independent random variables In general, suppose that X1 , X2 , . . . , Xn are discrete random variables. These are independent if and only if their joint pf is: p(x1 , x2 , . . . , xn ) = p1 (x1 ) p2 (x2 ) · · · pn (xn ) for all numbers x1 , x2 , . . . , xn , where p1 (x1 ), . . . , pn (xn ) are the univariate marginal pfs of X1 , . . . , Xn , respectively. 175 5. Multivariate random variables Similarly, continuous random variables X1 , X2 , . . . , Xn are independent if and only if their joint pdf is: f (x1 , x2 , . . . , xn ) = f1 (x1 ) f2 (x2 ) · · · fn (xn ) for all x1 , x2 , . . . , xn , where f1 (x1 ), . . . , fn (xn ) are the univariate marginal pdfs of X1 , . . . , Xn , respectively. If two random variables are independent, they are also uncorrelated, i.e. we have: Cov(X, Y ) = 0 and Corr(X, Y ) = 0. The reverse is not true, i.e. two random variables can be dependent even when their correlation is 0. This can happen when the dependence is non-linear. Example 5.10 The football example is an instance of this. The conditional distributions pY |X (y | x) are clearly not all the same, but the correlation is very nearly 0 (see Example 5.8). Another example is plot (f) in Figure 5.2, where the dependence is not linear, but quadratic. 5.8.1 Joint distribution of independent random variables When random variables are independent, we can easily derive their joint pf or pdf as the product of their univariate marginal distributions. This is particularly simple if all the marginal distributions are the same. Example 5.11 Suppose that X1 , X2 , . . . , Xn are independent, and each of them follows the Poisson distribution with the same mean λ. Therefore, the marginal pf of each Xi is: e−λ λxi p(xi ) = xi ! and the joint pf of the random variables is: p(x1 , x2 , . . . , xn ) = p(x1 ) p(x2 ) · · · p(xn ) = n Y i=1 p(xi ) = n Y e−λ λxi i=1 xi ! P = e −nλ λi Q xi ! xi . i Example 5.12 For a continuous example, suppose that X1 , X2 , . . . , Xn are independent, and each of them follows a normal distribution with the same mean µ and same variance σ 2 . Therefore, the marginal pdf of each Xi is: 1 (xi − µ)2 f (xi ) = √ exp − 2σ 2 2πσ 2 176 5.8. Independent random variables and the joint pdf of the variables is: f (x1 , x2 , . . . , xn ) = f (x1 ) f (x2 ) · · · f (xn ) = n Y f (xi ) i=1 n Y (xi − µ)2 √ = exp − 2σ 2 2πσ 2 i=1 # " n 1 1 X n exp − 2 (xi − µ)2 . = √ 2σ 2 2πσ i=1 1 Activity 5.16 X1 , . . . , Xn are independent Bernoulli random variables. The probability function of Xi is given by: ( (1 − πi )1−xi πixi for xi = 0, 1 p(xi ) = 0 otherwise where: eiθ 1 + eiθ for i = 1, 2, . . . , n. Derive the joint probability function, p(x1 , x2 , . . . , xn ). πi = Solution Since the Xi s are independent (but not identically distributed) random variables, we have: n Y p(x1 , x2 , . . . , xn ) = p(xi ). i=1 So, the joint probability function is: p(x1 , x2 , . . . , xn ) = n Y i=1 1 1 + eiθ 1−xi eiθ 1 + eiθ xi = n Y i=1 eiθxi 1 + eiθ θ n P ixi e i=1 . = Q n (1 + eiθ ) i=1 Activity 5.17 X1 , . . . , Xn are independent random variables with the common probability density function: ( λ2 x e−λx for x > 0 f (x) = 0 otherwise. Derive the joint probability density function, f (x1 , x2 , . . . , xn ). Solution Since the Xi s are independent (and identically distributed) random variables, we 177 5. Multivariate random variables have: f (x1 , x2 , . . . , xn ) = n Y f (xi ). i=1 So, the joint probability density function is: f (x1 , x2 , . . . , xn ) = n Y 2 λ xi e −λxi =λ 2n n Y xi e −λx1 −λx2 −···−λxn =λ 2n xi e −λ n P xi i=1 . i=1 i=1 i=1 n Y Activity 5.18 X1 , . . . , Xn are independent random variables with the common probability function: m θx p(x) = for x = 0, 1, 2, . . . , m x (1 + θ)m and 0 otherwise. Derive the joint probability function, p(x1 , x2 , . . . , xn ). Solution Since the Xi s are independent (and identically distributed) random variables, we have: n Y p(x1 , x2 , . . . , xn ) = p(xi ). i=1 So, the joint probability function is: p(x1 , x2 , . . . , xn ) = n Y i=1 m θ xi = xi (1 + θ)m n Y i=1 ! x1 x2 m θ θ · · · θ xn = xi (1 + θ)nm i=1 Activity 5.19 Show that if: P (X ≤ x ∩ Y ≤ y) = (1 − e−x ) (1 − e−2y ) for all x, y > 0, then X and Y are independent random variables, each with an exponential distribution. Solution The right-hand side of the result given is the product of the cdf of an exponential random variable X with mean 1 and the cdf of an exponential random variable Y with mean 2. So the result follows from the definition of independent random variables. Activity 5.20 The random variable X has a discrete uniform distribution with values 1, 2 and 3, i.e. P (X = i) = 1/3 for i = 1, 2, 3. The random variable Y has a discrete uniform distribution with values 1, 2, 3 and 4, i.e. P (Y = i) = 1/4 for i = 1, 2, 3, 4. X and Y are independent. 178 n P xi ! m θi=1 . xi (1 + θ)nm n Y 5.8. Independent random variables (a) Derive the probability distribution of X + Y . (b) What are E(X + Y ) and Var(X + Y )? Solution (a) The possible values of the sum are 2, 3, 4, 5, 6 and 7. Since X and Y are independent, the probabilities of the different sums are: P (X + Y = 2) = P (X = 1, Y = 1) = P (X = 1) P (Y = 1) = 1 1 1 × = 3 4 12 P (X + Y = 3) = P (X = 1) P (Y = 2) + P (X = 2) P (Y = 1) = 2 1 = 12 6 P (X + Y = 4) = P (X = 1) P (Y = 3) + P (X = 2) P (Y = 2) 1 3 = + P (X = 3) P (Y = 1) = 12 4 P (X + Y = 5) = P (X = 1) P (Y = 4) + P (X = 2) P (Y = 3) 3 1 + P (X = 3) P (Y = 2) = = 12 4 P (X + Y = 6) = P (X = 2) P (Y = 4) + P (X = 3) P (Y = 3) = P (X + Y = 7) = P (X = 3) P (Y = 4) = 2 1 = 12 6 1 12 and 0 for all other real numbers. (b) You could find the expectation and variance directly from the distribution of X + Y above. However, it is easier to use the expected value and variance of the discrete uniform distribution for both X and Y , and then the results on the expectation and variance of sums of independent random variables to get: E(X + Y ) = E(X) + E(Y ) = 1+3 1+4 + = 4.5 2 2 and: 32 − 1 42 − 1 23 Var(X + Y ) = Var(X) + Var(Y ) = + = ≈ 1.92. 12 12 12 Activity 5.21 Let X1 , . . . , Xk be independent random variables, and a1 , . . . , ak be constants. Show that: k k P P ai X i = ai E(Xi ) (a) E i=1 (b) Var k P i=1 i=1 ai X i = k P a2i Var(Xi ). i=1 179 5. Multivariate random variables Solution (a) We have: E k X ! ai X i = i=1 k X E(ai Xi ) = i=1 k X ai E(Xi ). i=1 (b) We have: Var k X ! ai X i = E i=1 k X ai X i − i=1 = E k X k X !2 ai E(Xi ) i=1 !2 ai (Xi − E(Xi )) i=1 = k X a2i E((Xi − E(Xi ))2 )+ i=1 X ai aj E((Xi − E(Xi ))(Xj − E(Xj ))) 1≤i6=j≤n = k X a2i Var(Xi )+ i=1 X ai aj E(Xi − E(Xi )) E(Xj − E(Xj )) 1≤i6=j≤n = k X a2i Var(Xi ). i=1 Additional note: remember there are two ways to compute the variance: Var(X) = E((X − µ)2 ) and Var(X) = E(X 2 ) − (E(X))2 . The former is more convenient for analytical derivations/proofs (see above), while the latter should be used to compute variances for common distributions such as Poisson or exponential distributions. Actually it is rather difficult to compute the variance for a Poisson distribution using the formula Var(X) = E((X − µ)2 ) directly. 5.9 Sums and products of random variables Suppose X1 , X2 , . . . , Xn are random variables. We now go from the multivariate setting back to the univariate setting, by considering univariate functions of X1 , X2 , . . . , Xn . In particular, we consider sums and products like: n X i=1 180 ai X i + b = a1 X 1 + a2 X 2 + · · · + an X n + b (5.2) 5.9. Sums and products of random variables and: n Y ai Xi = (a1 X1 ) (a2 X2 ) · · · (an Xn ) i=1 where a1 , a2 , . . . , an and b are constants. Each such sum or product is itself a univariate random variable. The probability distribution of such a function depends on the joint distribution of X1 , . . . , Xn . Example 5.13 In the football example, the sum Z = X + Y is the total number of goals scored in a match. Its probability function is obtained from the joint pf pX,Y (x, y), that is: Z=z pZ (z) 0 0.100 1 0.131 2 0.270 3 0.293 4 0.138 5 0.062 6 0.006 For example, pZP (1) = pX,Y (0, 1) + pX,Y (1, 0) = 0.031 + 0.100 = 0.131. The mean of Z is then E(Z) = z pZ (z) = 2.448. z Another example is the distribution of XY (see Example 5.8). However, what can we say about such distributions in general, in cases where we cannot derive them as easily? 5.9.1 Distributions of sums and products General results for the distributions of sums and products of random variables are available as follows: Sums Mean Yes Variance Yes No Normal: Yes Some other distributions: only for independent random variables No Distributional form 5.9.2 Products Only for independent random variables Expected values and variances of sums of random variables We state, without proof, the following important result. If X1 , X2 , . . . , Xn are random variables with means E(X1 ), E(X2 ), . . . , E(Xn ), 181 5. Multivariate random variables respectively, and a1 , a2 , . . . , an and b are constants, then: ! n X E ai Xi + b = E(a1 X1 + a2 X2 + · · · + an Xn + b) i=1 = a1 E(X1 ) + a2 E(X2 ) + · · · + an E(Xn ) + b = n X ai E(Xi ) + b. (5.3) i=1 Two simple special cases of this, when n = 2, are: E(X + Y ) = E(X) + E(Y ), obtained by choosing X1 = X, X2 = Y , a1 = a2 = 1 and b = 0 E(X − Y ) = E(X) − E(Y ), obtained by choosing X1 = X, X2 = Y , a1 = 1, a2 = −1 and b = 0. Example 5.14 In the football example, we have previously shown that E(X) = 1.383,E(Y ) = 1.065 and E(X + Y ) = 2.448. So E(X + Y ) = E(X) + E(Y ), as the theorem claims. If X1 , X2 , . . . , Xn are random variables with variances Var(X1 ), Var(X2 ), . . . , Var(Xn ), respectively, and covariances Cov(Xi , Xj ) for i 6= j, and a1 , a2 , . . . , an and b are constants, then: ! n n X X XX Var ai Xi + b = a2i Var(Xi ) + 2 ai aj Cov(Xi , Xj ). (5.4) i=1 i=1 i<j In particular, for n = 2: Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ) Var(X − Y ) = Var(X) + Var(Y ) − 2Cov(X, Y ). If X1 , X2 , . . . , Xn are independent random variables, then Cov(Xi , Xj ) = 0 for all i 6= j, and so (5.4) simplifies to: ! n n X X Var ai X i = a2i Var(Xi ). (5.5) i=1 i=1 In particular, for n = 2, when X and Y are independent: Var(X + Y ) = Var(X) + Var(Y ) Var(X − Y ) = Var(X) + Var(Y ). These results also hold whenever Cov(Xi , Xj ) = 0 for all i 6= j, even if the random variables are not independent. 182 5.9. Sums and products of random variables 5.9.3 Expected values of products of independent random variables If X1 , X2 , . . . , Xn are independent random variables and a1 , a2 , . . . , an are constants, then: ! n n Y Y E ai Xi = E[(a1 X1 )(a2 X2 ) · · · (an Xn )] = ai E(Xi ). i=1 i=1 In particular, when X and Y are independent: E(XY ) = E(X) E(Y ). There is no corresponding simple result for the means of products of dependent random variables. There is also no simple result for the variances of products of random variables, even when they are independent. 5.9.4 Distributions of sums of random variables We now know the expected value and variance of the sum: a1 X 1 + a2 X 2 + · · · + an X n + b whatever the joint distribution of X1 , X2 , . . . , Xn . This is usually all we can say about the distribution of this sum. In particular, the form of the distribution of the sum (i.e. its pf/pdf) depends on the joint distribution of X1 , X2 , . . . , Xn , and there are no simple general results about that. For example, even if X and Y have distributions from the same family, the distribution of X + Y is often not from that same family. However, such results are available for a few special cases. Sums of independent binomial and Poisson random variables Suppose X1 , X2 , . . . , Xn are random variables, and we consider the unweighted sum: n X Xi = X1 + X 2 + · · · + X n . i=1 That is, the general sum given by (5.2), with a1 = a2 = · · · = an = 1 and b = 0. The following results hold when the random variables X1 , X2 , . . . , Xn are independent, but not otherwise. P P If Xi ∼ Bin(ni , π), then i Xi ∼ Bin( i ni , π). P P If Xi ∼ Poisson(λi ), then i Xi ∼ Poisson( i λi ). Activity 5.22 Cars pass a point on a busy road at an average rate of 150 per hour. Assume that the number of cars in an hour follows a Poisson distribution. Other motor vehicles (lorries, motorcycles etc.) pass the same point at the rate of 75 per hour. Assume a Poisson distribution for these vehicles too, and assume that the number of other vehicles is independent of the number of cars. 183 5. Multivariate random variables (a) What is the probability that one car and one other motor vehicle pass in a two-minute period? (b) What is the probability that two motor vehicles of any type (cars, lorries, motorcycles etc.) pass in a two-minute period? Solution (a) Let X denote the number of cars, and Y denote the number of other motor vehicles in a two-minute period. We need the probability given by P (X = 1, Y = 1), which is P (X = 1) P (Y = 1) since X and Y are independent. A rate of 150 cars per hour is a rate of 5 per two minutes, so X ∼ Poisson(5). The probability of one car passing in two minutes is P (X = 1) = e−5 (5)1 /1! = 0.0337. The rate for other vehicles over two minutes is 2.5, so Y ∼ Poisson(2.5) and so P (Y = 1) = e−2.5 (2.5)1 /1! = 0.2052. Hence the probability for one vehicle of each type is 0.0337 × 0.2055 = 0.0069. (b) Here we require P (Z = 2), where Z = X + Y . Since the sum of two independent Poisson variables is again Poisson (see Section 5.10.5), then Z ∼ Poisson(5 + 2.5) = Poisson(7.5). Therefore, the required probability is: P (Z = 2) = e−7.5 × (7.5)2 = 0.0156. 2! Application to the binomial distribution An easy proof that the mean and variance of X ∼ Bin(n, π) are E(X) = n π and Var(X) = n π (1 − π) is as follows. 1. Let Z1 , . . . , Zn be independent random variables, each distributed as Zi ∼ Bernoulli(π) = Bin(1, π). 2. It is easy to show that E(Zi ) = π and Var(Zi ) = π (1 − π) for each i = 1, . . . , n (see (4.3) and (4.4)). 3. Also n P Zi = X ∼ Bin(n, π) by the result above for sums of independent binomial i=1 random variables. 4. Therefore, using the results (5.2) and (5.5), we have: E(X) = n X E(Zi ) = n π and Var(X) = i=1 n X Var(Zi ) = n π (1 − π). i=1 Sums of normally distributed random variables All sums (linear combinations) of normally distributed random variables are also normally distributed. 184 5.9. Sums and products of random variables Suppose X1 , X2 , . . . , Xn are normally distributed random variables, with Xi ∼ N (µi , σi2 ) for i = 1, . . . , n, and a1 , . . . , an and b are constants, then: n X ai Xi + b ∼ N (µ, σ 2 ) i=1 where: µ= n X 2 ai µi + b and σ = i=1 n X a2i σi2 + 2 XX ai aj Cov(Xi , Xj ). i<j i=1 If the Xi s are independent (or just uncorrelated), i.e. if Cov(Xi , Xj ) = 0 for all i 6= j, n P the variance simplifies to σ 2 = a2i σi2 . i=1 Example 5.15 Suppose that in the population of English people aged 16 or over: the heights of men (in cm) follow a normal distribution with mean 174.9 and standard deviation 7.39 the heights of women (in cm) follow a normal distribution with mean 161.3 and standard deviation 6.85. Suppose we select one man and one woman at random and independently of each other. Denote the man’s height by X and the woman’s height by Y . What is the probability that the man is at most 10 cm taller than the woman? In other words, what is the probability that the difference between X and Y is at most 10? Since X and Y are independent we have: 2 D = X − Y ∼ N (µX − µY , σX + σY2 ) = N (174.9 − 161.3, (7.39)2 + (6.85)2 ) = N (13.6, (10.08)2 ). The probability we need is: P (D ≤ 10) = P 10 − 13.6 D − 13.6 ≤ 10.08 10.08 = P (Z ≤ −0.36) = P (Z ≥ 0.36) = 0.3594 using Table 4 of the New Cambridge Statistical Tables. The probability that a randomly selected man is at most 10 cm taller than a randomly selected woman is about 0.3594. 185 5. Multivariate random variables Activity 5.23 At one stage in the manufacture of an article a piston of circular cross-section has to fit into a similarly-shaped cylinder. The distributions of diameters of pistons and cylinders are known to be normal with parameters as follows. • Piston diameters: mean 10.42 cm, standard deviation 0.03 cm. • Cylinder diameters: mean 10.52 cm, standard deviation 0.04 cm. If pairs of pistons and cylinders are selected at random for assembly, for what proportion will the piston not fit into the cylinder (i.e. for which the piston diameter exceeds the cylinder diameter)? (a) What is the chance that in 100 pairs, selected at random: i. every piston will fit? ii. not more than two of the pistons will fail to fit? (b) Calculate both of these probabilities: i. exactly ii. using a Poisson approximation. Discuss the appropriateness of using this approximation. Solution Let P ∼ N (10.42, (0.03)2 ) for the pistons, and C ∼ N (10.52, (0.04)2 ) for the cylinders. It follows that D ∼ N (0.1, (0.05)2 ) for the difference (adding the variances, assuming independence). The piston will fit if D > 0. We require: 0 − 0.1 = P (Z > −2) = 0.9772 P (D > 0) = P Z > 0.05 so the proportion of 1 − 0.9772 = 0.0228 will not fit. The number of pistons, N , failing to fit out of 100 will be a binomial random variable such that N ∼ Bin(100, 0.0228). (a) Calculating directly, we have the following. i. P (N = 0) = (0.9772)100 = 0.0996. ii. P (N ≤ 2) = (0.9772)100 + 100(0.9772)99 (0.0228) + 100 2 (0.9772)98 (0.0228)2 = 0.6005. (b) Using the Poisson approximation with λ = 100 × 0.0228 = 2.28, we have the following. i. P (N = 0) ≈ e−2.28 = 0.1023. ii. P (N ≤ 2) ≈ e−2.28 + e−2.28 × 2.28 + e−2.28 × (2.28)2 /2! = 0.6013. The approximations are good (note there will be some rounding error, but the values are close with the two methods). It is not surprising that there is close agreement since n is large, π is small and n π < 5. 186 5.10. Overview of chapter 5.10 Overview of chapter This chapter has introduced how to deal with more than one random variable at a time. Focusing mainly on discrete bivariate distributions, the relationships between joint, marginal and conditional distributions were explored. Sums and products of random variables concluded the chapter. 5.11 Key terms and concepts Association Conditional distribution Conditional variance Covariance Independence Joint probability (density) function Multivariate 5.12 Bivariate Conditional mean Correlation Dependence Joint probability distribution Marginal distribution Uncorrelated Sample examination questions Solutions can be found in Appendix C. 1. Consider two random variables X and Y taking the values 0 and 1. The joint probabilities for the pair are given by the following table Y =0 Y =1 X=0 1/2 − α α X=1 α 1/2 − α (a) What are the values α can take? Explain your answer. Now let α = 1/4, and: U= max(X, Y ) 3 and V = min(X, Y ) where max(X, Y ) means the larger of X and Y , and min(X, Y ) means the smaller of X and Y . For example, max(0, 1) = 1, min(0, 1) = 0, and min(0, 0) = max(0, 0) = 0. (b) Compute the mean of U and the mean of V . (c) Are U and V independent? Explain your answer. 2. The amount of coffee dispensed into a coffee cup by a coffee machine follows a normal distribution with mean 150 ml and standard deviation 10 ml. The coffee is sold at the price of £1 per cup. However, the coffee cups are marked at the 137 ml level, and any cup with coffee below this level will be given away free of charge. The amounts of coffee dispensed in different cups are independent of each other. 187 5. Multivariate random variables (a) Find the probability that the total amount of coffee in 5 cups exceeds 700 ml. (b) Find the probability that the difference in the amounts of coffee in 2 cups is smaller than 20 ml. (c) Find the probability that one cup is filled below the level of 137 ml. (d) Find the expected income from selling one cup of coffee. 3. There are six houses on Station Street, numbered 1 to 6. The postman has six letters to deliver, one addressed to each house. As he is sloppy and in a hurry he does not look at which letter he puts in which letterbox (one per house). (a) Explain in words why the probability that the people living in the first house receive the correct letter is equal to 1/6. (b) Let Xi (for i = 1, . . . , 6) be the random variable which is equal to 1 if the people living in house number i receive the correct letter, and equal to 0 otherwise. Show that E(Xi ) = 1/6. (c) Show that X1 and X2 are not independent. (d) Calculate Cov(X1 , X2 ). 188 Chapter 6 Sampling distributions of statistics 6.1 Synopsis of chapter This chapter considers the idea of sampling and the concept of a sampling distribution for a statistic (such as a sample mean) which must be understood by all users of statistics. 6.2 Learning outcomes After completing this chapter, you should be able to: demonstrate how sampling from a population results in a sampling distribution for a statistic prove and apply the results for the mean and variance of the sampling distribution of the sample mean when a random sample is drawn with replacement state the central limit theorem and recall when the limit is likely to provide a good approximation to the distribution of the sample mean. 6.3 Introduction Suppose we have a sample of n observations of a random variable X: {X1 , X2 , . . . , Xn }. We have already stated that in statistical inference each individual observation Xi is regarded as a value of a random variable X, with some probability distribution (that is, the population distribution). In this chapter we discuss how we define and work with: the joint distribution of the whole sample {X1 , X2 , . . . , Xn }, treated as a multivariate random variable distributions of univariate functions of {X1 , X2 , . . . , Xn } (statistics). 189 6. Sampling distributions of statistics 6.4 Random samples Many of the results discussed here hold for many (or even all) probability distributions, not just for some specific distributions. It is then convenient to use generic notation. We use f (x) to denote both the pdf of a continuous random variable, and the pf of a discrete random variable. The parameter(s) of a distribution are generally denoted as θ. For example, for the Poisson distribution θ stands for λ, and for the normal distribution θ stands for (µ, σ 2 ). Parameters are often included in the notation: f (x; θ) denotes the pf/pdf of a distribution with parameter(s) θ, and F (x; θ) is its cdf. For simplicity, we may often use phrases like ‘distribution f (x; θ)’ or ‘distribution F (x; θ)’ when we mean ‘distribution with the pf/pdf f (x; θ)’ and ‘distribution with the cdf F (x; θ)’, respectively. The simplest assumptions about the joint distribution of the sample are as follows. 1. {X1 , X2 , . . . , Xn } are independent random variables. 2. {X1 , X2 , . . . , Xn } are identically distributed random variables. Each Xi has the same distribution f (x; θ), with the same value of the parameter(s) θ. The random variables {X1 , X2 , . . . , Xn } are then called: independent and identically distributed (IID) random variables from the distribution (population) f (x; θ) a random sample of size n from the distribution (population) f (x; θ). We will assume this most of the time from now. So you will see many examples and questions which begin something like: ‘Let {X1 , . . . , Xn } be a random sample from a normal distribution with mean µ and variance σ 2 . . . ’. 6.4.1 Joint distribution of a random sample The joint probability distribution of the random variables in a random sample is an important quantity in statistical inference. It is known as the likelihood function. You will hear more about it in the chapter on point estimation. For a random sample the joint distribution is easy to derive, because the Xi s are independent. 190 6.5. Statistics and their sampling distributions The joint pf/pdf of a random sample is: f (x1 , x2 , . . . , xn ) = f (x1 ; θ) f (x2 ; θ) · · · f (xn ; θ) = n Y f (xi ; θ). i=1 Other assumptions about random samples Not all problems can be seen as IID random samples of a single random variable. There are other possibilities, which you will see more of in the future. IID samples from multivariate population distributions. For example, a sample of n Q (Xi , Yi ), with the joint distribution f (xi , yi ). i=1 Independent but not identically distributed observations. For example, observations (Xi , Yi ) where Yi (the ‘response variable’) is treated as random, but Xi (the ‘explanatory variable’) is not. Hence the joint distribution of the Yi s is n Q fY |X (yi | xi ; θ) where fY |X (y | x; θ) is the conditional distribution of Y given X. i=1 This is the starting point of regression modelling (introduced later in the course). Non-independent observations. For example, a time series {Y1 , Y2 , . . . , YT } where i = 1, 2, . . . , T are successive time points. The joint distribution of the series is, in general: f (y1 ; θ) f (y2 | y1 ; θ) f (y3 | y1 , y2 ; θ) · · · f (yT | y1 , . . . , yT −1 ; θ). Random samples and their observed values Here we treat {X1 , X2 , . . . , Xn } as random variables. Therefore, we consider what values {X1 , X2 , . . . , Xn } might have in different samples. Once a real sample is actually observed, the values of {X1 , X2 , . . . , Xn } in that specific sample are no longer random variables, but realised values of random variables, i.e. known numbers. Sometimes this distinction is emphasised in the notation by using: X1 , X2 , . . . , Xn for the random variables x1 , x2 , . . . , xn for the observed values. 6.5 Statistics and their sampling distributions A statistic is a known function of the random variables {X1 , X2 , . . . , Xn } in a random sample. 191 6. Sampling distributions of statistics Example 6.1 All of the following are statistics: the sample mean X̄ = n P Xi /n i=1 the sample variance S 2 = n P (Xi − X̄)2 /(n − 1) and standard deviation S = √ S2 i=1 the sample median, quartiles, minimum, maximum etc. quantities such as: n X i=1 Xi2 and X̄ √ . S/ n Here we focus on single (univariate) statistics. More generally, we could also consider vectors of statistics, i.e. multivariate statistics. 6.5.1 Sampling distribution of a statistic A (simple) random sample is modelled as a sequence of IID random variables. A statistic is a function of these random variables, so it is also a random variable, with a distribution of its own. In other words, if we collected several random samples from the same population, the values of a statistic would not be the same from one sample to the next, but would vary according to some probability distribution. The sampling distribution is the probability distribution of the values which the statistic would have in a large number of samples collected (independently) from the same population. Example 6.2 Suppose we collect a random sample of size n = 20 from a normal population (distribution) X ∼ N (5, 1). Consider the following statistics: sample mean X̄, sample variance S 2 , and maxX = max(X1 , X2 , . . . , Xn ). Here is one such random sample (with values rounded to 2 decimal places): 6.28 5.22 4.19 3.56 4.15 4.11 4.03 5.81 5.43 6.09 4.98 4.11 5.55 3.95 4.97 5.68 5.66 3.37 4.98 6.58 For this random sample, the values of our statistics are: x̄ = 4.94 s2 = 0.90 maxx = 6.58. 192 6.5. Statistics and their sampling distributions Here is another such random sample (with values rounded to 2 decimal places): 5.44 6.14 4.91 5.63 3.89 4.17 5.79 5.33 5.09 3.90 5.47 6.62 6.43 5.84 6.19 5.63 3.61 5.49 4.55 4.27 For this sample, the values of our statistics are: x̄ = 5.22 (the first sample had x̄ = 4.94) s2 = 0.80 (the first sample had s2 = 0.90) maxx = 6.62 (the first sample had maxx = 6.58). Activity 6.1 Suppose that {X1 , X2 , . . . , Xn } is a random sample from a continuous distribution with probability density function fX (x) and cumulative distribution function FX (x). Here we consider the sampling distribution of the statistic Y = X(n) = max{X1 , X2 , . . . , Xn }, i.e. the largest value of Xi in the random sample, for i = 1, . . . , n. (a) Write down the formula for the cumulative distribution function FY (y) of Y , i.e. for the probability that all observations in the sample are ≤ y. (b) From the result in (a), derive the probability density function fY (y) of Y . (c) The heights (in cm) of men aged over 16 in England are approximately normally distributed with a mean of 174.9 and a standard deviation of 7.39. What is the probability that in a random sample of 60 men from this population at least one man is more than 1.92 metres tall? Solution (a) The probability that a single randomly-selected observation of X is at most y is P (Xi ≤ y) = FX (y). Since the Xi s are independent, the probability that they are all at most y is: FY (y) = P (X1 ≤ y, X2 ≤ y, . . . , Xn ≤ y) = [FX (y)]n . (b) The pdf is the first derivative of the cdf, so: fY (y) = FY0 (y) = n [FX (y)]n−1 fX (y) since fX (x) = FX0 (x). (c) Here Xi ∼ N (174.9, (7.39)2 ). Therefore: 192 − 174.9 FX (192) = P (X ≤ 192) = P Z ≤ ≈ P (Z ≤ 2.31) 7.39 where Z ∼ N (0, 1). We have that P (Z ≤ 2.31) = 1 − 0.01044 = 0.98956. Therefore, the probability we need is: P (Y > 192) = 1 − P (Y ≤ 192) = 1 − [FX (192)]60 = 1 − (0.98956)60 = 0.4672. 193 6. Sampling distributions of statistics How to derive a sampling distribution? The sampling distribution of a statistic is the distribution of the values of the statistic in (infinitely) many repeated samples. However, typically we only have one sample which was actually observed. Therefore, the sampling distribution seems like an essentially hypothetical concept. Nevertheless, it is possible to derive the forms of sampling distributions of statistics under different assumptions about the sampling schemes and population distribution f (x; θ). There are two main ways of doing this. Exactly or approximately through mathematical derivation. This is the most convenient way for subsequent use, but is not always easy. With simulation, i.e. by using a computer to generate (artificial) random samples from a population distribution of a known form. Example 6.3 Consider again a random sample of size n = 20 from the population X ∼ N (5, 1), and the statistics X̄, S 2 and maxX . We first consider deriving the sampling distributions of these by approximation through simulation. Here a computer was used to draw 10,000 independent random samples of n = 20 from N (5, 1), and the values of X̄, S 2 and maxX for each of these random samples were recorded. Figures 6.1, 6.2 and 6.3 show histograms of the statistics for these 10,000 random samples. We now consider deriving the exact sampling distribution. Here this is possible. For a random sample of size n from N (µ, σ 2 ) we have: (a) X̄ ∼ N (µ, σ 2 /n) (b) (n − 1)S 2 /σ 2 ∼ χ2n−1 (c) the sampling distribution of Y = maxX has the following pdf: fY (y) = n [FX (y)]n−1 fX (y) where FX (x) and fX (x) are the cdf and pdf of X ∼ N (µ, σ 2 ), respectively. Curves of the densities of these distributions are also shown in Figures 6.1, 6.2 and 6.3. 194 6.6. Sample mean from a normal population 4.5 5.0 5.5 6.0 Sample mean Figure 6.1: Simulation-generated sampling distribution of X̄ to accompany Example 6.3. 6.6 Sample mean from a normal population Consider one very common statistic, the sample mean: n X̄ = 1X 1 1 1 Xi = X1 + X2 + · · · + X n . n i=1 n n n What is the sampling distribution of X̄? We know from Section 5.9.2 that for independent {X1 , . . . , Xn } from any distribution: ! n n X X E ai X i = ai E(Xi ) i=1 i=1 and: Var n X i=1 ! ai X i = n X a2i Var(Xi ). i=1 For a random sample, all Xi s are independent and E(X P i ) = E(X) is the same Pfor all of them, since the Xi s are identically distributed. X̄ = i Xi /n is of the form i ai Xi , with ai = 1/n for all i = 1, . . . , n. Therefore: E(X̄) = and: n X 1 1 E(X) = n × E(X) = E(X) n n i=1 n X 1 1 Var(X) Var(X̄) = Var(X) = n × 2 Var(X) = . 2 n n n i=1 195 6. Sampling distributions of statistics 0.5 1.0 1.5 2.0 2.5 Sample variance Figure 6.2: Simulation-generated sampling distribution of S 2 to accompany Example 6.3. 5 6 7 8 9 Maximum value Figure 6.3: Simulation-generated sampling distribution of maxX to accompany Example 6.3. 196 6.6. Sample mean from a normal population So the mean and variance of X̄ are E(X) and Var(X)/n, respectively, for a random sample from any population distribution of X. What about the form of the sampling distribution of X̄? This depends on the distribution of X, and is not generally known. However, when the distribution of X is normal, we do know that the sampling distribution of X̄ is also normal. Suppose that {X1 , . . . , Xn } is a random sample from a normal distribution with mean µ and variance σ 2 , then: σ2 X̄ ∼ N µ, . n For example, the pdf drawn on the histogram in Figure 6.1 is that of N (5, 1/20). We have E(X̄) = E(X) = µ. In an individual sample, x̄ is not usually equal to µ, the expected value of the population. However, over repeated samples the values of X̄ are centred at µ. √ We also have Var(X̄) = Var(X)/n = σ 2 /n, and hence also sd(X̄) = σ/ n. The variation of the values of X̄ in different samples (the sampling variance) is large when the population variance of X is large. More interestingly, the sampling variance gets smaller when the sample size n increases. In other words, when n is large the distribution of X̄ is more tightly concentrated around µ than when n is small. Figure 6.4 shows sampling distributions of X̄ from N (5, 1) for different n. Example 6.4 Suppose that the heights (in cm) of men (aged over 16) in a population follow a normal distribution with some unknown mean µ and a known standard deviation of 7.39. We plan to select a random sample of n men from the population, and measure their heights. How large should n be so that there is a probability of at least 0.95 that the sample mean X̄ will be within 1 cm of the population mean µ? √ Here X ∼ N (µ, (7.39)2 ), so X̄ ∼ N (µ, (7.39/ n)2 ). What we need is the smallest n such that: P (|X̄ − µ| ≤ 1) ≥ 0.95. 197 6. Sampling distributions of statistics n=100 n=20 n=5 4.0 4.5 5.0 5.5 6.0 x Figure 6.4: Sampling distributions of X̄ from N (5, 1) for different n. So: P (|X̄ − µ| ≤ 1) ≥ 0.95 P (−1 ≤ X̄ − µ ≤ 1) ≥ 0.95 −1 X̄ − µ 1 √ ≤ √ ≤ √ P ≥ 0.95 7.39/ n 7.39/ n 7.39/ n √ √ n n ≤Z≤ ≥ 0.95 P − 7.39 7.39 √ 0.05 n P Z> = 0.025 < 7.39 2 where Z ∼ N (0, 1). From Table 4 of the New Cambridge Statistical Tables, we see that the smallest z which satisfies P (Z > z) < 0.025 is z = 1.97. Therefore: √ n ≥ 1.97 ⇔ n ≥ (7.39 × 1.97)2 = 211.9. 7.39 Therefore, n should be at least 212. Activity 6.2 Suppose that the heights of students are normally distributed with a mean of 68.5 inches and a standard deviation of 2.7 inches. If 200 random samples of size 25 are drawn from this population with means recorded to the nearest 0.1 inch, find: (a) the expected mean and standard deviation of the sampling distribution of the mean 198 6.6. Sample mean from a normal population (b) the expected number of recorded sample means which fall between 67.9 and 69.2 inclusive (c) the expected number of recorded sample means falling below 67.0. Solution (a) The sampling distribution of the mean of 25 observations has the same mean as the population, which is√68.5 inches. The standard deviation (standard error) of the sample mean is 2.7/ 25 = 0.54. (b) Notice that the samples are random, so we cannot be sure exactly how many will have means between 67.9 and 69.2 inches. We can work out the probability that the sample mean will lie in this interval using the sampling distribution: X̄ ∼ N (68.5, (0.54)2 ). We need to make a continuity correction, to account for the fact that the recorded means are rounded to the nearest 0.1 inch. For example, the probability that the recorded mean is ≥ 67.9 inches is the same as the probability that the sample mean is > 67.85. Therefore, the probability we want is: 69.25 − 68.5 67.85 − 68.5 <Z< P (67.85 < X < 69.25) = P 0.54 0.54 = P (−1.20 < Z < 1.39) = Φ(1.39) − Φ(−1.20) = 0.9177 − (1 − 0.1151) = 0.8026. Since there are 200 independent random samples drawn, we can now think of each as a single trial. The recorded mean lies between 67.9 and 69.2 with probability 0.8026 at each trial. We are dealing with a binomial distribution with n = 200 trials and probability of success π = 0.8026. The expected number of successes is: n π = 200 × 0.8026 = 160.52. (c) The probability that the recorded mean is < 67.0 inches is: 66.95 − 68.5 P (X < 66.95) = P Z < = P (Z < −2.87) = Φ(−2.87) = 0.00205 0.54 so the expected number of recorded means below 67.0 out of a sample of 200 is: 200 × 0.00205 = 0.41. Activity 6.3 Suppose that we plan to take a random sample of size n from a normal distribution with mean µ and standard deviation σ = 2. 199 6. Sampling distributions of statistics (a) Suppose µ = 4 and n = 20. i. What is the probability that the mean X̄ of the sample is greater than 5? ii. What is the probability that X̄ is smaller than 3? iii. What is P (|X̄ − µ| ≤ 1) in this case? (b) How large should n be in order that P (|X̄ − µ| ≤ 0.5) ≥ 0.95 for every possible value of µ? (c) It is claimed that the true value of µ is 5 in a population. A random sample of size n = 100 is collected from this population, and the mean for this sample is x̄ = 5.8. Based on the result in (b), what would you conclude from this value of X̄? Solution (a) Let {X1 , . . . , Xn } denote the random sample. We know that the sampling distribution of X̄ is N (µ, σ 2 /n), here N (4, 22 /20) = N (4, 0.2). i. The probability we need is: 5−4 X̄ − 4 > √ = P (Z > 2.24) = 0.0126 P (X̄ > 5) = P √ 0.2 0.2 where, as usual, Z ∼ N (0, 1). ii. P (X̄ < 3) is obtained similarly. Note that this leads to P (Z < −2.24) = 0.0126, which is equal to the P (X̄ > 5) = P (Z > 2.24) result obtained above. This is because 5 is one unit above the mean µ = 4, and 3 is one unit below the mean, and because the normal distribution is symmetric around its mean. iii. One way of expressing this is: P (X̄ − µ > 1) = P (X̄ − µ < −1) = 0.0126 for µ = 4. This also shows that: P (X̄ − µ > 1) + P (X̄ − µ < −1) = P (|X̄ − µ| > 1) = 2 × 0.0126 = 0.0252 and hence: P (|X̄ − µ| ≤ 1) = 1 − 2 × 0.0126 = 0.9748. In other words, the probability is 0.9748 that the sample mean is within one unit of the true population mean, µ = 4. (b) We can use the same ideas as in (a). Since X̄ ∼ N (µ, 4/n) we have: P (|X̄ − µ| ≤ 0.5) = 1 − 2 × P (X̄ − µ > 0.5) 0.5 X̄ − µ >p =1−2×P p 4/n 4/n √ = 1 − 2 × P (Z > 0.25 n) ≥ 0.95 200 ! 6.7. The central limit theorem which holds if: √ 0.05 P (Z > 0.25 n) ≤ = 0.025. 2 Using Table √ 4 of the New Cambridge Statistical2 Tables, we see that this is true when 0.25 n ≥ 1.96, i.e. when n ≥ (1.96/0.25) = 61.5. Rounding up to the nearest integer, we get n ≥ 62. The sample size should be at least 62 for us to be 95% confident that the sample mean will be within 0.5 units of the true mean, µ. (c) Here n > 62, yet x̄ is further than 0.5 units from the claimed mean of µ = 5. Based on the result in (b), this would be quite unlikely if µ is really 5. One explanation of this apparent contradiction is that µ is not really equal to 5. This kind of reasoning will be the basis of statistical hypothesis testing, which will be discussed later in the course. 6.7 The central limit theorem We have discussed the very convenient result that if a random sample comes from a normally-distributed population, the sampling distribution of X̄ is also normal. How about sampling distributions of X̄ from other populations? For this, we can use a remarkable mathematical result, the central limit theorem (CLT). In essence, the CLT states that the normal sampling distribution of X̄ which holds exactly for random samples from a normal distribution, also holds approximately for random samples from nearly any distribution. The CLT applies to ‘nearly any’ distribution because it requires that the variance of the population distribution is finite. If it is not (such as for some Pareto distributions, introduced in Chapter 3), the CLT does not hold. However, such distributions are not common. Suppose that {X1 , X2 , . . . , Xn } is a random sample from a population distribution which has mean E(Xi ) = µ < ∞ and variance Var(Xi ) = σ 2 < ∞, that is with a finite mean and finite variance. Let X̄n denote the sample mean calculated from a random sample of size n, then: X̄n − µ √ ≤ z = Φ(z) lim P n→∞ σ/ n for any z, where Φ(z) denotes the cdf of the standard normal distribution. The ‘ lim ’ indicates that this is an asymptotic result, i.e. one which holds increasingly n→∞ well as n increases, and exactly when the sample size is infinite. In less formal language, the CLT says that for a random sample from nearly any distribution with mean µ and variance σ 2 then: σ2 X̄ ∼ N µ, n approximately, when n is sufficiently large. We can then say that X̄ is asymptotically normally distributed with mean µ and variance σ 2 /n. 201 6. Sampling distributions of statistics The wide reach of the CLT It may appear that the CLT is still somewhat limited, in that it applies only to sample means calculated from random (IID) samples. However, this is not really true, for two main reasons. There are more general versions of the CLT which do not require the observations Xi to be IID. Even the basic version applies very widely, when we realise that the ‘X’ can also be a function of the original variables in the data. For example, if X and Y are random variables in the sample, we can also apply the CLT to: n X log(Xi ) i=1 n or n X X i Yi i=1 n . Therefore, the CLT can also be used to derive sampling distributions for many statistics which do not initially look at all like X̄ for a single random variable in an IID sample. You may get to do this in future courses. How large is ‘large n’? The larger the sample size n, the better the normal approximation provided by the CLT is. In practice, we have various rules-of-thumb for what is ‘large enough’ for the approximation to be ‘accurate enough’. This also depends on the population distribution of Xi . For example: for symmetric distributions, even small n is enough for very skewed distributions, larger n is required. For many distributions, n > 30 is sufficient for the approximation to be reasonably accurate. Example 6.5 In the first case, we simulate random samples of sizes: n = 1, 5, 10, 30, 100 and 1000 from the Exponential(0.25) distribution (for which µ = 4 and σ 2 = 16). This is clearly a skewed distribution, as shown by the histogram for n = 1 in Figure 6.5. 10,000 independent random samples of each size were generated. Histograms of the values of X̄ in these random samples are shown in Figure 6.5. Each plot also shows the pdf of the approximating normal distribution, N (4, 16/n). The normal approximation is reasonably good already for n = 30, very good for n = 100, and practically perfect for n = 1000. 202 6.7. The central limit theorem n = 10 n=5 n=1 0 10 20 30 40 0 2 4 6 8 n = 30 2 3 4 5 6 10 12 14 2 4 6 n = 100 7 2.5 3.0 3.5 4.0 4.5 5.0 8 10 n = 1000 5.5 3.6 3.8 4.0 4.2 4.4 Figure 6.5: Sampling distributions of X̄ for various n when sampling from the Exponential(0.25) distribution. Example 6.6 In the second case, we simulate 10,000 independent random samples of sizes: n = 1, 10, 30, 50, 100 and 1000 from the Bernoulli(0.2) distribution (for which µ = 0.2 and σ 2 = 0.16). Here the distribution of Xi itself is not even continuous, and has only two possible values, 0 and 1. Nevertheless, the sampling distribution of X̄ can be very well-approximated by the normal distribution, when n is large enough. n P Note that since here Xi = 1 or Xi = 0 for all i, X̄ = Xi /n = m/n, where m is the i=1 number of observations for which Xi = 1. In other words, X̄ is the sample proportion of the value X = 1. The normal approximation is clearly very bad for small n, but reasonably good already for n = 50, as shown by the histograms in Figure 6.6. Activity 6.4 A random sample of 25 audits is to be taken from a company’s total audits, and the average value of these audits is to be calculated. (a) Explain what is meant by the sampling distribution of this average and discuss its relationship to the population mean. (b) Is it reasonable to assume that this sampling distribution is normal? (c) If the population of all audits has a mean of £54 and a standard deviation of £10, find the probability that: 203 6. Sampling distributions of statistics n = 30 n = 10 n=1 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5 n = 1000 n = 100 n = 50 0.0 0.1 0.2 0.3 0.4 0.50.05 0.10 0.15 0.20 0.25 0.30 0.35 0.16 0.18 0.20 0.22 0.24 Figure 6.6: Sampling distributions of X̄ for various n when sampling from the Bernoulli(0.2) distribution. i. the sample mean will be greater than £60 ii. the sample mean will be within 5% of the population mean. Solution (a) The sample average is composed of 25 randomly sampled data which are subject to sampling variability, hence the average is also subject to this variability. Its sampling distribution describes its probability properties. If a large number of such averages were independently sampled, then their histogram would be the sampling distribution. (b) It is reasonable to assume that this sampling distribution is normal due to the CLT, although the sample size is rather small. If n = 25 and µ = 54 and σ = 10, then the CLT says that: σ2 100 X̄ ∼ N µ, = N 54, . n 25 (c) i. We have: P (X̄ > 60) = P 204 60 − 54 Z>p 100/25 ! = P (Z > 3) = 0.0013. 6.7. The central limit theorem ii. We are asked for: P (0.95 × 54 < X̄ < 1.05 × 54) = P 0.05 × 54 −0.05 × 54 <Z< 2 2 = P (−1.35 < Z < 1.35) = 0.8230. Activity 6.5 A manufacturer of objects packages them in boxes of 200. It is known that, on average, the objects weigh 1 kg with a standard deviation of 0.03 kg. The manufacturer is interested in calculating: P (200 objects weigh more than 200.5 kg) which would help detect whether too many objects are being put in a box. Explain how you would calculate the (approximate?) value of this probability. Mention any relevant theorems or assumptions needed. Solution Let Xi denote the weight of the ith object, for i = 1, . . . , 200. The Xi s are assumed to be independent and identically distributed with E(Xi ) = 1 and Var(Xi ) = (0.03)2 . We require: ! ! 200 200 X X Xi > 1.0025 = P (X̄ > 1.0025). P Xi > 200.5 = P 200 i=1 i=1 If the weights are not normally distributed, then by the central limit theorem: 1.0025 − 1 √ P (X̄ > 1.0025) ≈ P Z > = P (Z > 1.18) = 0.1190. 0.03/ 200 If the weights are normally distributed, then this is the exact (rather than approximate) probability. Activity 6.6 (a) Suppose {X1 , X2 , X3 , X4 } is a random sample of size n = 4 from the n P Bernoulli(0.2) distribution. What is the distribution of Xi in this case? i=1 (b) Write down the sampling distribution of X̄ = n P Xi /n for the sample considered i=1 in (a). In other words, write down the possible values of X̄ and their probabilities. P Hint: what are the possible values of i Xi , and their probabilities? (c) Suppose we have a random sample of size n = 100 from the Bernoulli(0.2) distribution. What is the approximate sampling distribution of X̄ suggested by 205 6. Sampling distributions of statistics the central limit theorem in this case? Use this distribution to calculate an approximate value for the probability that X̄ > 0.3. (The true value of this probability is 0.0061.) Solution (a) The sum of n independent Bernoulli random variables, each with success 4 P probability π, is Bin(n, π). Here n = 4 and π = 0.2, so Xi ∼ Bin(4, 0.2). i=1 P (b) The possible values of Xi are 0, 1, 2, 3 and 4, and their probabilities can be calculated from the binomial distribution. For example: X 4 P Xi = 1 = (0.2)1 (0.8)3 = 4 × 0.2 × 0.512 = 0.4096. 1 The other probabilities are shown in the table below. P Since X̄ = Xi /4, the possible values of X̄ are 0, 0.25, 0.5, 0.75 and P 1. Their probabilities are the same asP those of the corresponding values of Xi . For example, P (X̄ = 0.25) = P ( Xi = 1) = 0.4096. The values and their probabilities are: X̄ = x̄ P (X̄ = x̄) 0.0 0.4096 0.25 0.4096 0.5 0.1536 0.75 0.0256 1.0 0.0016 (c) For Xi ∼ Bernoulli(π), E(Xi ) = π and Var(Xi ) = π (1 − π). Therefore, the approximate normal sampling distribution of X̄, derived from the central limit theorem, is N (π, π (1 − π)/n). Here this is: 0.2 × 0.8 = N (0.2, 0.0016) = N (0.2, (0.04)2 ). N 0.2, 100 Therefore, the probability requested by the question is approximately: X̄ − 0.2 0.3 − 0.2 P (X̄ > 0.3) = P > = P (Z > 2.5) = 0.0062. 0.04 0.04 This is very close to the probability obtained from the exact sampling distribution, which is about 0.0061. Activity 6.7 A country is about to hold a referendum about leaving the European Union. A survey of a random sample of adult citizens of the country is conducted. In the sample, n respondents say that they plan to vote in the referendum. These n respondents are then asked whether they plan to vote ‘Yes’ or ‘No’. Define X = 1 if such a person plans to vote ‘Yes’, and X = 0 if such a person plans to vote ‘No’. Suppose that in the whole population 49% of those people who plan to vote are currently planning to vote Yes, and hence the referendum result would show a (very 206 6.7. The central limit theorem small) majority opposing leaving the European Union. (a) Let X̄ = n P Xi /n denote the proportion of the n voters in the sample who plan i=1 to vote Yes. What is the central limit theorem approximation of the sampling distribution of X̄ here? (b) If there are n = 50 likely voters in the sample, what is the probability that X̄ > 0.5? (Such an opinion poll would suggest a majority supporting leaving the European Union in the referendum.) (c) How large should n be so that there is less than a 1% chance that X̄ > 0.5 in the random sample? (This means less than a 1% chance of the opinion poll incorrectly predicting a majority supporting leaving the European Union in the referendum.) Solution (a) Here the individual responses, the Xi s, follow a Bernoulli distribution with probability parameter π = 0.49. The mean of this distribution is 0.49, and the variance is 0.49 × 0.51. Therefore, the central limit theorem (CLT) approximation of the sampling distribution of X̄ is: 2 ! 0.4999 0.49 × 0.51 √ = N 0.49, . X̄ ∼ N 0.49, n n (b) When n = 50, the CLT approximation from (a) is X̄ ∼ N (0.49, (0.0707)2 ). With this, we get: 0.5 − 0.49 X̄ − 0.49 > P (X̄ > 0.5) = P = P (Z > 0.14) = 0.4443. 0.0707 0.0707 (c) Here we need the smallest integer n such that: √ 0.5 − 0.49 X̄ − 0.49 √ > √ = P (Z > 0.0200 n) < 0.01. P (X̄ > 0.5) = P 0.4999/ n 0.4999/ n Using Table 4 of the New Cambridge Statistical Tables, the smallest z such that P (Z > z) < 0.01 is z = 2.33. Therefore, we need: √ 0.0200 n ≥ 2.33 ⇔ n≥ 2.33 0.0200 2 = 13572.25 which means that we need at least n = 13,573 likely voters in the sample – which is a very large sample size! Of course, the reason for this is that the population of likely voters is almost equally split between those supporting leaving the European Union, and those opposing. Hence such a large sample size is necessary to be very confident of obtaining a representative sample. 207 6. Sampling distributions of statistics Activity 6.8 Suppose {X1 , X2 , . . . , Xn } is a random sample from the Poisson(λ) distribution. (a) What is the sampling distribution of n P Xi ? i=1 (b) Write down the sampling distribution of X̄ = n P Xi /n. In other words, write i=1 down the possible values of X̄ and their probabilities. (Assume n is not large.) n P Hint: What are the possible values of Xi and their respective probabilities? i=1 (c) What are the mean and variance of the sampling distribution of X̄ when λ = 5 and n = 100? Solution (a) The sum of n independent Poisson(λ) random variables follows the Poisson(nλ) distribution. P P (b) Since i Xi has possible values 0, 1, 2, . . ., the possible values of X̄ = i Xi /n are 0/n, 1/n, 2/n, . . .. The probabilities of these values are determined by the P probabilities of the values of i Xi , which are obtained from P the Poisson(nλ) distribution. Therefore, the probability function of X̄ = i Xi /n is: ( e−nλ (nλ)nx̄ /(nx̄)! for x̄ = 0, 1/n, 2/n, . . . p(x̄) = 0 otherwise. (c) For Xi ∼ Poisson(λ) we have E(Xi ) = Var(Xi ) = λ, so the general results for X̄ give E(X̄) = λ and Var(X̄) = λ/n. When λ = 5 and n = 100, E(X̄) = 5 and Var(X̄) = 5/100 = 0.05. Activity 6.9 Suppose that a random sample of size n is to be taken from a non-normal distribution for which µ = 4 and σ = 2. Use the central limit theorem to determine, approximately, the smallest value of n for which: P (|X̄n − µ| < 0.2) ≥ 0.95 where X̄n denotes the sample mean, which depends on n. Solution By the central limit theorem we have: σ2 X̄ ∼ N µ, n approximately, as n → ∞. Hence: √ √ n(X̄n − µ) n(X̄n − 4) = → Z ∼ N (0, 1). σ 2 208 6.8. Some common sampling distributions Therefore: √ √ P (|X̄n − µ| < 0.2) ≈ P (|Z| < 0.1 n) = 2 × Φ(0.1 n) − 1. √ However, 2 × Φ(0.1 n) − 1 ≥ 0.95 if and only if: √ 1 + 0.95 Φ(0.1 n) ≥ = 0.975 2 which is satisfied if: √ 0.1 n ≥ 1.96 ⇒ n ≥ 384.16. Hence the smallest possible value of n is 385. 6.8 Some common sampling distributions In the remaining chapters, we will make use of results like the following. Suppose that {X1 , . . . , Xn } and {Y1 , . . . , Ym } are two independent random samples from N (µ, σ 2 ), then: (n − 1) 2 SX ∼ χ2n−1 σ2 s and (m − 1) 2 SY ∼ χ2m−1 σ2 n+m−2 X̄ − Ȳ ∼ tn+m−2 ×p 2 1/n + 1/m (n − 1)SX + (m − 1)SY2 and: 2 SX ∼ Fn−1, m−1 . SY2 Here ‘χ2 ’, ‘t’ and ‘F ’ refer to three new families of probability distributions: the χ2 (‘chi-squared’) distribution the t distribution the F distribution. These are not often used as distributions of individual variables. Instead, they are used as sampling distributions for various statistics. Each of them arises from the normal distribution in a particular way. We will now briefly introduce their main properties. This is in preparation for statistical inference, where the uses of these distributions will be discussed at length. 209 6. Sampling distributions of statistics 6.8.1 The χ2 distribution Definition of the χ2 distribution Let Z1 , Z2 , . . . , Zk be independent N (0, 1) random variables. If: X= Z12 + Z22 + ··· + Zk2 = k X Zi2 i=1 the distribution of X is the χ2 distribution with k degrees of freedom. This is denoted by X ∼ χ2 (k) or X ∼ χ2k . The χ2k distribution is a continuous distribution, which can take values of x ≥ 0. Its mean and variance are: E(X) = k Var(X) = 2k. For reference, the probability density function of X ∼ χ2k is: ( (2k/2 Γ(k/2))−1 xk/2−1 e−x/2 f (x) = 0 where: Z Γ(α) = for x > 0 otherwise ∞ xα−1 e−x dx 0 is the gamma function, which is defined for all α > 0. (Note the formula of the pdf of X ∼ χ2k is not examinable.) The shape of the pdf depends on the degrees of freedom k, as illustrated in Figure 6.7. In most applications of the χ2 distribution the appropriate value of k is known, in which case it does not need to be estimated from data. If X1 , X2 , . . . , Xm are independent random variables and Xi ∼ χ2ki , then their sum is also χ2 -distributed where the individual degrees of freedom are added, such that: X1 + X2 + · · · + Xm ∼ χ2k1 +k2 +···+km . The uses of the χ2 distribution will be discussed later. One example though is if {X1 , X2 , . . . , Xn } is a random sample from the population N (µ, σ 2 ), and S 2 is the sample variance, then: (n − 1)S 2 ∼ χ2n−1 . σ2 This result is used to derive basic tools of statistical inference for both µ and σ 2 for the normal distribution. 210 0.6 0.10 6.8. Some common sampling distributions 0.08 k=10 k=20 k=30 k=40 0.0 0.0 0.1 0.02 0.2 0.04 0.3 0.06 0.4 0.5 k=1 k=2 k=4 k=6 0 2 4 6 8 0 10 20 30 40 50 Figure 6.7: χ2 pdfs for various degrees of freedom. Tables of the χ2 distribution In the examination, you will need a table of some probabilities for the χ2 distribution. Table 8 of the New Cambridge Statistical Tables shows the following information. The rows correspond to different degrees of freedom k (denoted as ν in Table 8). The table shows values of k up to 100. The columns correspond to the right-tail probability as a percentage, that is P (X > x) = P/100, where X ∼ χ2k , for different values of P , ranging from 50 to 0.05 (that is, right-tail probabilities ranging from 0.5 to 0.0005). The numbers in the table are values of z such that P (X > z) = P/100 for the k and P in that row and column, respectively. Example 6.7 Consider the ‘ν = 5’ row, the 9.236 in the ‘P = 10’ column and the 11.07 in the ‘P = 5’ column. These mean, for X ∼ χ25 , that: P (X > 9.236) = 0.10 [and hence P (X ≤ 9.236) = 0.90]. P (X > 11.07) = 0.05 [and hence P (X ≤ 11.07) = 0.95]. These also provide bounds for probabilities of other values. For example, since 10.00 is between 9.236 and 11.07, we can conclude that: 0.05 < P (X > 10.00) < 0.10. Activity 6.10 If Z is a random variable with a standard normal distribution, what is P (Z 2 < 3.841)? 211 6. Sampling distributions of statistics Solution We can compute the probability in two different ways. Working with the standard normal distribution, we have: √ √ P (Z 2 < 3.841) = P (− 3.841 < Z < 3.841) = P (−1.96 < Z < 1.96) = Φ(1.96) − Φ(−1.96) = 0.9750 − (1 − 0.9750) = 0.95. Alternatively, we can use the fact that Z 2 follows a χ21 distribution. Using Table 8 of the New Cambridge Statistical Tables we can see that 3.841 is the 5% right-tail value for this distribution, and so P (Z 2 < 3.84) = 0.95, as before. Activity 6.11 Suppose that X1 and X2 are independent N (0, 4) random variables. Compute P (X12 < 36.84 − X22 ). Solution Rearrange the inequality to obtain: P (X12 < 36.84 − X22 ) = P (X12 + X22 < 36.84) 2 36.84 X1 + X22 < =P 4 4 ! 2 2 X1 X2 =P + < 9.21 . 2 2 Since X1 /2 and X2 /2 are independent N (0, 1) random variables, the sum of their squares will follow a χ22 distribution. Using Table 8 of the New Cambridge Statistical Tables, we see that 9.210 is the 1% right-tail value, so the probability we are looking for is 0.99. Activity 6.12 Suppose A, B and C are independent chi-squared random variables with 5, 7 and 10 degrees of freedom, respectively. Calculate: (a) P (B < 12) (b) P (A + B + C < 14) (c) P (A3 + B 3 + C 3 < 0). In this question, you should use the closest value given in the available statistical tables. Further approximation is not required. 212 6.8. Some common sampling distributions Solution (a) P (B < 12) ≈ 0.9, directly from Table 8 of the New Cambridge Statistical Tables, where B ∼ χ27 . (b) A + B + C ∼ χ25+7+10 = χ222 , so P (A + B + C < 14) is the probability that such a random variable is less than 14, which is approximately 0.1 from Table 8. (c) A chi-squared random variable only assumes non-negative values. Hence each of A, B and C is non-negative, so A3 + B 3 + C 3 ≥ 0, and: P (A3 + B 3 + C 3 < 0) = 0. 6.8.2 (Student’s) t distribution Definition of Student’s t distribution Suppose Z ∼ N (0, 1), X ∼ χ2k , and Z and X are independent. The distribution of the random variable: Z T =p X/k is the t distribution with k degrees of freedom. This is denoted T ∼ tk or T ∼ t(k). The distribution is also known as ‘Student’s t distribution’. The tk distribution is continuous with the pdf: −(k+1)/2 Γ((k + 1)/2) x2 f (x) = √ 1+ k kπ Γ(k/2) 0.4 for all −∞ < x < ∞. Examples of f (x) for different k are shown in Figure 6.8. (Note the formula of the pdf of tk is not examinable.) 0.0 0.1 0.2 0.3 N(0,1) k=1 k=3 k=8 k=20 −2 0 2 Figure 6.8: Student’s t pdfs for various degrees of freedom. From Figure 6.8, we see the following. 213 6. Sampling distributions of statistics The distribution is symmetric around 0. As k → ∞, the tk distribution tends to the standard normal distribution, so tk with large k is very similar to N (0, 1). For any finite value of k, the tk distribution has heavier tails than the standard normal distribution, i.e. tk places more probability on values far from 0 than N (0, 1) does. For T ∼ tk , the mean and variance of the distribution are: E(T ) = 0 for k > 1 and: k for k > 2. k−2 This means that for t1 neither E(T ) nor Var(T ) exist, and for t2 , Var(T ) does not exist. Var(T ) = Tables of the t distribution In the examination, you will need a table of some probabilities for the t distribution. Table 10 of the New Cambridge Statistical Tables shows the following information. The rows correspond to different degrees of freedom k (denoted as ν in Table 10). The table shows values of k up to 120, and then ‘∞’, which is N (0, 1). • If you need a tk distribution for which k is not in the table, use the nearest value or use interpolation. The columns correspond to the right-tail probability P (T > z) = P/100, where T ∼ tk , for various P ranging from 40 to 0.05. The numbers in the table are values of t such that P (T > t) = P/100 for the k and P in that row and column. Example 6.8 Consider the ‘ν = 4’ row, and the ‘P = 5’ column. This means, where T ∼ t4 , that: P (T > 2.132) = 0.05 [and hence P (T ≤ 2.132) = 0.95]. The table also provides bounds for other probabilities. For example, the number in the ‘P = 2.5’ column is 2.776, so P (T > 2.776) = 0.025. Since 2.132 < 2.5 < 2.776, we know that 0.025 < P (T > 2.5) < 0.05. Results for left-tail probabilities P (T < z) = P/100, where T ∼ tk , can also be obtained, because the t distribution is symmetric around 0. This means that P (T < t) = P (T > −t). Using T ∼ t4 , for example: P (T < −2.132) = P (T > 2.132) = 0.05 and P (T < −2.5) < 0.05 [since P (T > 2.5) < 0.05]. This is the same trick that we used for the standard normal distribution. 214 6.8. Some common sampling distributions Activity 6.13 The independent random variables X1 , X2 and X3 are each normally distributed with a mean of 0 and a variance of 4. Find: (a) P (X1 > X2 + X3 ) (b) P (X1 > 5(X22 + X32 )1/2 ). Solution (a) We have Xi ∼ N (0, 4), for i = 1, 2, 3, hence: X1 − X2 − X3 ∼ N (0, 12). So: P (X1 > X2 + X3 ) = P (X1 − X2 − X3 > 0) = P (Z > 0) = 0.5 using Table 3 of the New Cambridge Statistical Tables. (b) We have: P (X1 > 5(X22 + X32 )1/2 ) = P =P X1 >5 2 X22 X32 + 4 4 √ X1 >5 2 2 1/2 ! X22 X32 + 4 4 1/2 ! ! √ 2 p √ i.e. P (Y1 > 5 2Y2 ), where Y1 ∼ N (0, 1) and Y2 ∼ χ22 /2, or P (Y3 > 7.07), where Y3 ∼ t2 . From Table 10 of the New Cambridge Statistical Tables, this is approximately 0.01. Activity 6.14 The independent random variables X1 , X2 , X3 and X4 are each normally distributed with a mean of 0 and a variance of 4. Using statistical tables, derive values for k in each of the following cases: (a) P (3X1 + 4X2 > 5) = k p 2 2 (b) P X1 > k X3 + X4 = 0.025 (c) P (X12 + X22 + X32 < k) = 0.9 Solution (a) We have Xi ∼ N (0, 4), for i = 1, 2, 3, 4, hence 3X1 ∼ N (0, 36) and 4X2 ∼ N (0, 64). Therefore: 3X1 + 4X2 = Z ∼ N (0, 1). 10 So, P (3X1 + 4X2 > 5) = k = P (Z > 0.5) = 0.3085, using Table 4 of the New Cambridge Statistical Tables. 215 6. Sampling distributions of statistics (b) We have Xi /2 ∼ N (0, 1), for i = 1, 2, 3, 4, hence (X32 + X42 )/4 ∼ χ22 . So: q √ P X1 > k X32 + X42 = 0.025 = P (T > k 2) √ where T ∼ t2 and hence k 2 = 4.303, so k = 3.04268, using Table 10 of the New Cambridge Statistical Tables. (c) We have (X12 + X22 + X32 )/4 ∼ χ23 , so: P (X12 + X22 + X32 k < k) = 0.9 = P X < 4 where X ∼ χ23 . Therefore, k/4 = 6.251 using Table 8 of the New Cambridge Statistical Tables. Hence k = 25.004. Activity 6.15 Suppose Xi ∼ N (0, 4), for i = 1, 2, 3, 4. Assume all these random variables are independent. Derive the value of k in each of the following. (a) P (X1 + 4X2 > 5) = k. (b) P (X12 + X22 + X32 + X42 < k) = 0.99. p (c) P X1 < k X22 + X32 = 0.01. Solution (a) Since X1 + 4X2 ∼ N (0, 68), then: X1 + 4X2 5 √ P (X1 + 4X2 > 5) = P >√ = P (Z > 0.61) = 0.2709 68 68 where Z ∼ N (0, 1). (b) Xi2 /4 ∼ χ21 for i = 1, 2, 3, 4, hence (X12 + X22 + X32 + X42 )/4 ∼ χ24 , so: k 2 2 2 2 P (X1 + X2 + X3 + X4 < k) = P X < = 0.99 4 where X ∼ χ24 . Hence k/4 = 13.277, so k = 53.108. √ (c) X1 / 4 ∼ N (0, 1) and (X22 + X32 )/4 ∼ χ22 , hence: √ √ 2X1 X1 / 4 q p = ∼ t2 . (X22 +X32 )/4 X22 + X32 2 Therefore: where T ∼ t2 . Hence 216 √ P (T < √ 2k) = 0.01 2k = −6.965, so k = −4.925. 6.8. Some common sampling distributions 6.8.3 The F distribution Definition of the F distribution Let U and V be two independent random variables, where U ∼ χ2p and V ∼ χ2k . The distribution of: U/p F = V /k is the F distribution with degrees of freedom (p, k), denoted F ∼ Fp, k or F ∼ F (p, k). The F distribution is a continuous distribution, with non-zero probabilities for x > 0. The general shape of its pdf is shown in Figure 6.9. f(x) (10,50) (10,10) (10,3) 0 1 2 3 4 x Figure 6.9: F pdfs for various degrees of freedom. For F ∼ Fp, k , E(F ) = k/(k − 2), for k > 2. If F ∼ Fp, k , then 1/F ∼ Fk, p . If T ∼ tk , then T 2 ∼ F1, k . Tables of F distributions will be needed for some purposes. They will be available in the examination. We will postpone practice with them until later in the course. Activity 6.16 Let Xi , for i = 1, 2, 3, 4, be independent random variables such that Xi ∼ N (i, i2 ). For each of the following situations, use the Xi s to construct a statistic with the indicated distribution. Note there could be more than one possible answer for each. (a) χ23 . (b) t2 . (c) F1, 2 . 217 6. Sampling distributions of statistics Solution The following are possible, but not exhaustive, solutions. (a) We could have: 2 3 X Xi − i i i=1 ∼ χ23 . (b) We could have: X1 − 1 s 3 P i=2 X −i 2 i i ∼ t2 . /2 (c) We could have: (X1 − 1)2 ∼ F1, 2 . 3 P Xi −i 2 /2 i i=2 Activity 6.17 Suppose {Zi }, for i = 1, 2, . . . , k, are independent and identically distributed standard normal random variables, i.e. Zi ∼ N (0, 1), for i = 1, 2, . . . , k. State the distribution of: (a) Z12 (b) Z12 /Z22 p (c) Z1 / Z22 (d) k P Zi /k i=1 (e) k P Zi2 i=1 (f) 3/2 × (Z12 + Z22 )/(Z32 + Z42 + Z52 ). Solution (a) Z12 ∼ χ21 (b) Z12 /Z22 ∼ F1, 1 p (c) Z1 / Z22 ∼ t1 (d) k P Zi /k ∼ N (0, 1/k) i=1 (e) k P Zi2 ∼ χ2k i=1 (f) 3/2 × (Z12 + Z22 )/(Z32 + Z42 + Z52 ) ∼ F2, 3 . 218 6.9. Prelude to statistical inference Activity 6.18 X1 , X2 , X3 and X4 are independent normally distributed random variables each with a mean of 0 and a standard deviation of 3. Find: (a) P (X1 + 2X2 > 9) (b) P (X12 + X22 > 54) (c) the distribution of (X12 + X22 )/(X32 + X42 ). Solution (a) We have X1 ∼ N (0, 9) and X2 ∼ N (0, 9). Hence 2X2 ∼ N (0, 36) and X1 + 2X2 ∼ N (0, 45). So: 9 P (X1 + 2X2 > 9) = P Z > √ = P (Z > 1.34) = 0.0901 45 using Table 3 of the New Cambridge Statistical Tables. (b) We have X1 /3 ∼ N (0, 1) and X2 /3 ∼ N (0, 1). Hence X12 /9 ∼ χ21 and X22 /9 ∼ χ21 . Therefore, X12 /9 + X22 /9 ∼ χ22 . So: P (X12 + X22 > 54) = P (Y > 6) = 0.05 where Y ∼ χ22 , using Table 8 of the New Cambridge Statistical Tables. (c) We have X12 /9 + X22 /9 ∼ χ22 and also X32 /9 + X42 /9 ∼ χ22 . So: (X12 + X22 )/18 X12 + X22 = ∼ F2, 2 . (X32 + X42 )/18 X32 + X42 6.9 Prelude to statistical inference We conclude Chapter 6 with a discussion of the preliminaries of statistical inference before moving on to point estimation. The discussion below will review some key concepts introduced previously. So, just what is ‘Statistics’ ? It is a scientific subject of collecting and ‘making sense’ of data. Collection: designing experiments/questionnaires, designing sampling schemes, and administration of data collection. Making sense: estimation, testing and forecasting. So, ‘Statistics’ is an application-oriented subject, particularly useful or helpful in answering questions such as the following. Does a certain new drug prolong life for AIDS sufferers? 219 6. Sampling distributions of statistics Is global warming really happening? Are GCSE and A-level examination standards declining? Is the gap between rich and poor widening in Britain? Is there still a housing bubble in London? Is the Chinese yuan undervalued? If so, by how much? These questions are difficult to study in a laboratory, and admit no self-evident axioms. Statistics provides a way of answering these types of questions using data. What should we learn in ‘Statistics’ ? The basic ideas, methods and theory. Some guidelines for learning/applying statistics are the following. Understand what data say in each specific context. All the methods are just tools to help us to understand data. Concentrate on what to do and why, rather than on concrete calculations and graphing. It may take a while to catch the basic idea of statistics – keep thinking! 6.9.1 Population versus random sample Consider the following two practical examples. Example 6.9 A new type of tyre was designed to increase its lifetime. The manufacturer tested 120 new tyres and obtained the average lifetime (over these 120 tyres) of 35,391 miles. So the manufacturer claims that the mean lifetime of new tyres is 35,391 miles. Example 6.10 A newspaper sampled 1,000 potential voters, and 350 of them were supporters of Party X. It claims that the proportion of Party X voters in the whole country is 350/1000 = 0.35, i.e. 35%. In both cases, the conclusion is drawn on a population (i.e. all the objects concerned) based on the information from a sample (i.e. a subset of the population). In Example 6.9, it is impossible to measure the whole population. In Example 6.10, it is not economical to measure the whole population. Therefore, errors are inevitable! The population is the entire set of objects concerned, and these objects are typically represented by some numbers. We do not know the entire population in practice. In Example 6.9, the population consists of the lifetimes of all tyres, including those to be produced in the future. For the opinion poll in Example 6.10, the population consists of many ‘1’s and ‘0’s, where each ‘1’ represents a voter for Party X, and each ‘0’ represents a voter for other parties. 220 6.9. Prelude to statistical inference A sample is a (randomly) selected subset of a population, and is known in practice. The population is unknown. We represent a population by a probability distribution. Why do we need a model for the entire population? Because the questions we ask concern the entire population, not just the data we have. Having a model for the population tells us that the remaining population is not much different from our data or, in other words, that the data are representative of the population. Why do we need a random model? Because the process of drawing a sample from a population is a bit like the process of generating random variables. A different sample would produce different values. Therefore, the population from which we draw a random sample is represented as a probability distribution. 6.9.2 Parameter versus statistic For a given problem, we typically assume a population to be a probability distribution F (x; θ), where the form of distribution F is known (such as normal or Poisson), and θ denotes some unknown characteristic (such as the mean or variance) and is called a parameter. Example 6.11 Continuing with Example 6.9, the population may be assumed to be N (µ, σ 2 ) with θ = (µ, σ 2 ), where µ is the ‘true’ lifetime. Let: X = the lifetime of a tyre then we can write X ∼ N (µ, σ 2 ). Example 6.12 Continuing with Example 6.10, the population is a Bernoulli distribution such that: P (X = 1) = P (a Party X voter) = π and: P (X = 0) = P (a non-Party X voter) = 1 − π where: π = the proportion of Party X supporters in the UK = the probability of a voter being a Party X supporter. 221 6. Sampling distributions of statistics A sample: a set of data or random variables? A sample of size n, {X1 , . . . , Xn }, is also called a random sample. It consists of n real numbers in a practical problem. The word ‘random’ captures the fact that samples (of the same size) taken by different people or at different times may be different, as they are different subsets of a population. Furthermore, a sample is also viewed as n independent and identically distributed (IID) random variables, when we assess the performance of a statistical method. Example 6.13 For the tyre lifetime in Example 6.9, suppose the realised sample (of size n = 120) gives the sample mean: n x̄ = 1X xi = 35391. n i=1 A different sample may give a different sample mean, such as 36,721. Is the sample mean X̄ a good estimator of the unknown ‘true’ lifetime µ? Obviously, we cannot use the real number 35,391 to assess how good this estimator is, as a different sample may give a different average value, such as 36,721. By treating {X1 , . . . , Xn } as random variables, X̄ is also a random variable. If the distribution of X̄ concentrates closely around (unknown) µ, X̄ is a good estimator of µ. Definition of a statistic Any known function of a random sample is called a statistic. Statistics are used for statistical inference such as estimation and testing. Example 6.14 Let {X1 , . . . , Xn } be a random sample from the population N (µ, σ 2 ), then: n 1X X̄ = Xi , n i=1 X1 + Xn2 and sin(X3 ) + 6 are all statistics, but: X1 − µ σ is not a statistic, as it depends on the unknown quantities µ and σ 2 . An observed random sample is often denoted as {x1 , . . . , xn }, indicating that they are n real numbers. They are seen as a realisation of n IID random variables {X1 , . . . , Xn }. The connection between a population and a sample is shown in Figure 6.10, where θ is a parameter. A known function of {X1 , . . . , Xn } is called a statistic. 222 6.9. Prelude to statistical inference Figure 6.10: Representation of the connection between a population and a sample. 6.9.3 Difference between ‘Probability’ and ‘Statistics’ ‘Probability’ is a mathematical subject, while ‘Statistics’ is an application-oriented subject (which uses probability heavily). Example 6.15 Let: X = the number of lectures attended by a student in a term with 20 lectures then X ∼ Bin(20, π), i.e. the pf is: P (X = x) = 20! π x (1 − π)20−x x! (20 − x)! for x = 0, 1, . . . , 20 and 0 otherwise. Some probability questions are as follows. Treating π as known: what is E(X) (the average number of lectures attended)? what is P (X ≥ 18) (the proportion of students attending at least 18 lectures)? what is P (X < 10) (the proportion of students attending fewer than half of the lectures)? Some statistics questions are as follows. What is π (the average attendance rate)? Is π larger than 0.9? Is π smaller than 0.5? 223 6. Sampling distributions of statistics 6.10 Overview of chapter This chapter introduced sampling distributions of statistics which are the foundations to statistical inference. The sampling distribution of the sample mean was derived exactly when sampling from normal populations and also approximately for more general distributions using the central limit theorem. Three new families of distributions (χ2 , t and F ) were defined. 6.11 Key terms and concepts Central limit theorem F distribution Random sample Sampling variance (Student’s) t distribution 6.12 Chi-squared (χ2 ) distribution IID random variables Sampling distribution Statistic Sample examination questions Solutions can be found in Appendix C. 1. Let X be the amount of money won or lost in betting $5 on red in roulette, such that: 20 18 and P (X = −5) = . P (X = 5) = 38 38 If a gambler bets on red 100 times, use the central limit theorem to estimate the probability that these wagers result in less than $50 in losses. 2. Suppose Z1 , Z2 , . . . , Z5 are independent standard normal random variables. Determine the distribution of: (a) Z12 + Z22 Z1 (b) s 5 P Zi2 /4 i=2 (c) Z12 5 P . Zi2 /4 i=2 3. Consider a sequence of random variables X1 , X2 , X3 , . . . which are independent and normally distributed with mean 0 and variance 1. Using as many of these random variables as you like construct a random variable which is a function of X1 , X2 , X3 , . . . and has: (a) a t11 distribution (b) an F6, 9 distribution. 224 Chapter 7 Point estimation 7.1 Synopsis of chapter This chapter covers point estimation. Specifically, the properties of estimators are considered and the attributes of a desirable estimator are discussed. Techniques for deriving estimators are introduced. 7.2 Learning outcomes After completing this chapter, you should be able to: summarise the performance of an estimator with reference to its sampling distribution use the concepts of bias and variance of an estimator define mean squared error and calculate it for simple estimators find estimators using the method of moments, least squares and maximum likelihood. 7.3 Introduction The basic setting is that we assume a random sample {X1 , . . . , Xn } is observed from a population F (x; θ). The goal is to make inference (i.e. estimation or testing) for the unknown parameter(s) θ. Statistical inference is based on two things. 1. A set of data/observations {X1 , . . . , Xn }. 2. An assumption of F (x; θ) for the joint distribution of {X1 , . . . , Xn }. Inference is carried out using a statistic, i.e. a known function of {X1 , . . . , Xn }. b 1 , . . . , Xn ) such that the value of θb is For estimation, we look for a statistic θb = θ(X taken as an estimate (i.e. an estimated value) of θ. Such a θb is called a point estimator of θ. For testing, we typically use a statistic to test if a hypothesis on θ (such as θ = 3) is true or not. 225 7. Point estimation Example 7.1 Let {X1 , . . . , Xn } be a random sample from a population with mean µ = E(Xi ). Find an estimator of µ. Since µ is the mean of the population, a natural estimator would be the sample mean µ b = X̄, where: n 1 1X Xi = (X1 + · · · + Xn ). X̄ = n i=1 n We call µ b = X̄ a point estimator (or simply an estimator) of µ. For example, if we have an observed sample of 9, 16, 15, 4 and 12, hence of size n = 5, the sample mean is: 1 µ b = (9 + 16 + 15 + 4 + 12) = 11.2. 5 The value 11.2 is a point estimate of µ. For an observed sample of 15, 16, 10, 8 and 9, we obtain µ b = 11.6. 7.4 Estimation criteria: bias, variance and mean squared error Estimators are random variables and, therefore, have probability distributions, known as sampling distributions. As we know, two important properties of probability distributions are the mean and variance. Our objective is to create a formal criterion which combines both of these properties to assess the relative performance of different estimators. Bias of an estimator Let θb be an estimator of the population parameter θ.1 We define the bias of an estimator as: b = E(θ) b − θ. Bias(θ) (7.1) An estimator is: positively biased if b −θ >0 E(θ) unbiased if b −θ =0 E(θ) negatively biased if b − θ < 0. E(θ) A positively-biased estimator means the estimator would systematically overestimate the parameter by the size of the bias, on average. An unbiased estimator means the estimator would estimate the parameter correctly, on average. A negatively-biased 1 The ‘b’ (hat) notation is often used by statisticians to denote an estimator of the parameter beneath b denotes an estimator of the Poisson rate parameter λ. the ‘b’. So, for example, λ 226 7.4. Estimation criteria: bias, variance and mean squared error estimator means the estimator would systematically underestimate the parameter by the size of the bias, on average. In words, the bias of an estimator is the difference between the expected (average) value of the estimator and the true parameter being estimated. Intuitively, it would be desirable, other things being equal, to have an estimator with zero bias, called an unbiased estimator. Given the definition of bias in (7.1), an unbiased estimator would satisfy: b = θ. E(θ) In words, the expected value of the estimator is the true parameter being estimated, i.e. on average, under repeated sampling, an unbiased estimator correctly estimates θ. We view bias as a ‘bad’ thing, so, other things being equal, the smaller an estimator’s bias the better. Example 7.2 Since E(X̄) = µ, the sample mean X̄ is an unbiased estimator of µ because: E(X̄) − µ = 0. Variance of an estimator b is obtained directly from the The variance of an estimator, denoted Var(θ), estimator’s sampling distribution. Example 7.3 For the sample mean, X̄, we have: Var(X̄) = σ2 . n (7.2) It is clear that in (7.2) increasing the sample size n decreases the estimator’s variance (and hence the standard error, i.e. the square root of the estimator’s variance), therefore increasing the precision of the estimator.2 We conclude that variance is also a ‘bad’ thing so, other things being equal, the smaller an estimator’s variance the better. Mean squared error (MSE) The mean squared error (MSE) of an estimator is the average squared error. Formally, this is defined as: 2 b =E MSE(θ) θb − θ . (7.3) 2 Remember, however, that this increased precision comes at a cost – namely the increased expenditure on data collection. 227 7. Point estimation It is possible to decompose the MSE into components involving the bias and variance of an estimator. Recall that: Var(X) = E(X 2 ) − (E(X))2 E(X 2 ) = Var(X) + (E(X))2 . ⇒ Also, note that for any constant k, Var(X ± k) = Var(X), that is adding or subtracting a constant has no effect on the variance of a random variable. Noting that the true parameter θis some (unknown) constant,3 it immediately follows, by setting X = θb − θ , that: b =E MSE(θ) θb − θ 2 2 = Var(θb − θ) + E θb − θ 2 b + Bias(θ) b . = Var(θ) (7.4) Expression (7.4) is more useful than (7.3) for practical purposes. We have already established that both bias and variance of an estimator are ‘bad’ things, so the MSE (being the sum of a bad thing and a bad thing squared) can also be viewed as a ‘bad’ thing.4 Hence when faced with several competing estimators, we prefer the estimator with the smallest MSE. So, although an unbiased estimator is intuitively appealing, it is perfectly possible that a biased estimator might be preferred if the ‘cost’ of the bias is offset by a substantial reduction in variance. Hence the MSE provides us with a formal criterion to assess the trade-off between the bias and variance of different estimators of the same parameter. Example 7.4 A population is known to be normally distributed, i.e. X ∼ N (µ, σ 2 ). Suppose we wish to estimate the population mean, µ. We draw a random sample {X1 , X2 , . . . , Xn } such that these random variables are IID. We have three candidate estimators of µ, T1 , T2 and T3 , defined as: n 1X T1 = X̄ = Xi , n i=1 T2 = X 1 + Xn 2 and T3 = X̄ + 3. Which estimator should we choose? We begin by computing the MSE for T1 , noting: E(T1 ) = E(X̄) = µ and: σ2 . n Hence T1 is an unbiased estimator of µ. So the MSE of T1 is just the variance of T1 , since the bias is 0. Therefore, MSE(T1 ) = σ 2 /n. Var(T1 ) = Var(X̄) = 3 4 Even though θ is an unknown constant, it is known to be a constant! Or, for that matter, a ‘very bad’ thing! 228 7.4. Estimation criteria: bias, variance and mean squared error Moving to T2 , note: E(T2 ) = E X 1 + Xn 2 = 1 1 (E(X1 ) + E(Xn )) = (µ + µ) = µ 2 2 and: σ2 1 1 2 × (2σ ) = . (Var(X ) + Var(X )) = 1 n 22 4 2 So T2 is also an unbiased estimator of µ, hence MSE(T2 ) = σ 2 /2. Var(T2 ) = Finally, consider T3 , noting: E(T3 ) = E X̄ + 3 = E(X̄) + 3 = µ + 3 and: σ2 . n So T3 is a positively-biased estimator of µ, with a bias of 3. Hence we have MSE(T3 ) = σ 2 /n + 32 = σ 2 /n + 9. Var(T3 ) = Var(X̄ + 3) = Var(X̄) = We seek the estimator with the smallest MSE. Clearly, MSE(T1 ) < MSE(T3 ) so we can eliminate T3 . Now comparing T1 with T2 , we note that: for n = 2, MSE(T1 ) = MSE(T2 ), since the estimators are identical for n > 2, MSE(T1 ) < MSE(T2 ), so T1 is preferred. So T1 = X̄ is our preferred estimator of µ. Intuitively this should make sense. Note for n > 2, T1 uses all the information in the sample (i.e. all observations are used), unlike T2 which uses the first and last observations only. Of course, for n = 2, these estimators are identical. Some remarks are the following. i. µ b = X̄ is a better estimator of µ than X1 as: MSE (b µ) = σ2 < MSE(X1 ) = σ 2 . n ii. As n → ∞, MSE(X̄) → 0, i.e. when the sample size tends to infinity, the error in estimation goes to 0. Such an estimator is called a (mean-square) consistent estimator. Consistency is a reasonable requirement. It may be used to rule out some silly estimators. For µ̃ = (X1 + X4 )/2, MSE(µ̃) = σ 2 /2 which does not converge to 0 as n → ∞. This is due to the fact that only a small portion of information (i.e. X1 and X4 ) is used in the estimation. iii. For any random sample {X1 , . . . , Xn } from a population with mean µ and variance σ 2 , it holds that: σ2 E(X̄) = µ and Var(X̄) = . n 229 7. Point estimation The derivation of the expected value and variance of the sample mean was covered in Chapter 6. Example 7.5 Bias by itself cannot be used to measure the quality of an estimator. Consider two artificial estimators of θ, θb1 and θb2 , such that θb1 takes only the two values, θ − 100 and θ + 100, and θb2 takes only the two values θ and θ + 0.2, with the following probabilities: b b P θ1 = θ − 100 = P θ1 = θ + 100 = 0.5 and: b b P θ2 = θ = P θ2 = θ + 0.2 = 0.5. Note that θb1 is an unbiased estimator of θ and θb2 is a positively-biased estimator of θ as: Bias(θb2 ) = E(θb2 ) − θ = [(θ × 0.5) + ((θ + 0.2) × 0.5)] − θ = 0.1. However: MSE(θb1 ) = E[(θb1 − θ)2 ] = (−100)2 × 0.5 + (100)2 × 0.5 = 10000 and: MSE(θb2 ) = E[(θb2 − θ)2 ] = 02 × 0.5 + (0.2)2 × 0.5 = 0.02. Hence θb2 is a much better (i.e. more accurate) estimator of θ than θb1 . Activity 7.1 Based on a random sample of two independent observations from a population with mean µ and standard deviation σ, consider two estimators of µ, X and Y , defined as: X1 X 2 X1 2X2 X= + and Y = + . 2 2 3 3 Are X and Y unbiased estimators of µ? Solution We have: X1 X2 + 2 2 X1 2X2 + 3 3 E(X) = E = 1 1 1 1 × E(X1 ) + × E(X2 ) = × µ + × µ = µ 2 2 2 2 = 1 2 1 2 × E(X1 ) + × E(X2 ) = × µ + × µ = µ. 3 3 3 3 and: E(Y ) = E It follows that both estimators are unbiased estimators of µ. Activity 7.2 Let {X1 , X2 , . . . , Xn }, where n > 2, be a random sample from an unknown population with mean θ and variance σ 2 . We want to choose between two estimators of θ, θb1 = X̄ and θb2 = (X1 + X2 )/2. Which is the better estimator of θ? 230 7.4. Estimation criteria: bias, variance and mean squared error Solution Let us consider the bias first. The estimator θb1 is just the sample mean, so we know that it is unbiased. The estimator θb2 has expectation: X1 + X2 E(X1 ) + E(X2 ) θ+θ b E(θ2 ) = E = = =θ 2 2 2 so it is also an unbiased estimator of θ. Next, we consider the variances of the two estimators. We have: Var(θb1 ) = Var(X̄) = and: Var(θb2 ) = Var X 1 + X2 2 = σ2 n Var(X1 ) + Var(X2 ) σ2 + σ2 σ2 = = . 4 4 2 Since n > 2, we can see that θb1 has a lower variance than θb2 , so it is a better estimator. Unsurprisingly, we obtain a better estimator of θ by considering the whole sample, rather than just the first two values. Activity 7.3 Find the MSEs of the estimators in the previous activity. Are they consistent estimators of θ? Solution The MSEs are: 2 σ 2 σ2 +0= MSE(θb1 ) = Var(θb1 ) + Bias(θb1 ) = n n and: 2 σ 2 σ2 +0= . MSE(θb2 ) = Var(θb2 ) + Bias(θb2 ) = 2 2 Note that the MSE of an unbiased estimator is equal to its variance. The estimator θb1 has MSE equal to σ 2 /n, which converges to 0 as n → ∞. The estimator θb2 has MSE equal to σ 2 /2, which stays constant as n → ∞. Therefore, θb1 is a (mean-square) consistent estimator of θ, whereas θb2 is not. Activity 7.4 Let X1 and X2 be two independent random variables with the same mean, µ, and the same variance, σ 2 < ∞. Let µ b = aX1 + bX2 be an estimator of µ, where a and b are two non-zero constants. (a) Identify the condition on a and b to ensure that µ b is an unbiased estimator of µ. (b) Find the minimum mean squared error (MSE) among all unbiased estimators of µ. 231 7. Point estimation Solution (a) Let E(b µ) = E(aX1 + bX2 ) = a E(X1 ) + b E(X2 ) = (a + b)µ. Hence a + b = 1 is the condition for µ b to be an unbiased estimator of µ. (b) Under this condition, noting that b = 1 − a, we have: MSE(b µ) = Var(b µ) = a2 Var(X1 ) + b2 Var(X2 ) = (a2 + b2 ) σ 2 = (2a2 − 2a + 1) σ 2 . Setting d MSE(b µ)/da = (4a − 2)σ 2 = 0, we have a = 0.5, and hence b = 0.5. Therefore, among all unbiased linear estimators, the sample mean (X1 + X2 )/2 has the minimum variance. Remark: Let {X1 , . . . , Xn } be a random sample from a population with finite variance. The sample mean X̄ has the minimum variance among all unbiased linear n P estimators of the form ai Xi , hence it is the best linear unbiased estimator i=1 (BLUE(!)). Activity 7.5 Hard question! Let {X1 , . . . , Xn } be a random sample from a Bernoulli distribution where P (Xi = 1) = π = 1 − P (Xi = 0) for all i = 1, . . . , n. Let π b = X̄ = (X1 + · · · + Xn )/n be an estimator of π. (a) Find the mean squared error of π b, i.e. MSE(b π ). Is π b an unbiased estimator of π? Is π b a consistent of π? (b) Let Y = X1 + · · · + Xn . Find the probability distribution of Y . (c) Find the sampling distribution of π b = Y /n (which, recall, is simply the probability distribution of π b). Solution (a) We have E(Xi ) = 0 × (1 − π) + 1 × π = π, E(Xi2 ) = E(Xi ) = π (since Xi = Xi2 for the Bernoulli distribution), and Var(Xi ) = π − π 2 = π (1 − π) for all i = 1, . . . , n. Hence: ! n n 1X 1 1X Xi = E(Xi ) = × n × π = π. E(b π) = E n i=1 n i=1 n Therefore, π b is an unbiased estimator of π. Furthermore, by independence: ! n n 1 X π (1 − π) 1X MSE(b π ) = Var(b π ) = Var Xi = 2 Var(Xi ) = n i=1 n i=1 n which converges to 0 as n → ∞. Hence π b is a consistent estimator of π. 232 7.5. Method of moments (MM) estimation (b) Y may only take the integer values 0, 1, . . . , n. For 0 ≤ y ≤ n, the event Y = y occurs if and only if there are exactly y 1s and (n − y) 0s among the values of X1 , . . . , Xn . However, those y 1s may take any y out of the n positions. Hence: n y n! π y (1 − π)n−y . P (Y = y) = π (1 − π)n−y = y! (n − y)! y Therefore, Y ∼ Bin(n, π). (c) Note π b = Y /n. Hence π b has a rescaled binomial distribution on the n + 1 points {0, 1/n, 2/n, . . . , 1}. Finding estimators In general, how should we find an estimator of θ in a practical situation? There are three conventional methods: method of moments estimation least squares estimation maximum likelihood estimation. 7.5 Method of moments (MM) estimation Method of moments estimation Let {X1 , . . . , Xn } be a random sample from a population F (x; θ). Suppose θ has p components (for example, for a normal population N (µ, σ 2 ), p = 2; for a Poisson population with parameter λ, p = 1). Let: µk = µk (θ) = E(X k ) denote the kth population moment, for k = 1, 2, . . .. Therefore, µk depends on the unknown parameter θ, as everything else about the distribution F (x; θ) is known. Denote the kth sample moment by: n Mk = 1X k 1 Xi = (X1k + · · · + Xnk ). n i=1 n The MM estimator (MME) θb of θ is the solution of the p equations: b = Mk µk (θ) for k = 1, . . . , p. 233 7. Point estimation Example 7.6 Let {X1 , . . . , Xn } be a random sample from a population with mean µ and variance σ 2 < ∞. Find the MM estimator of (µ, σ 2 ). There are two unknown parameters. Let: n µ b=µ b1 = M1 1X 2 X . and µ b2 = M2 = n i=1 i This gives us µ b = M1 = X̄. Since σ 2 = µ2 − µ21 = E(X 2 ) − (E(X))2 , we have: n 2 σ b = M2 − M12 n 1X 2 1X = Xi − X̄ 2 = (Xi − X̄)2 . n i=1 n i=1 Note we have: n E(b σ2) = E 1X 2 X − X̄ 2 n i=1 i ! n 1X = E(Xi2 ) − E(X̄ 2 ) n i=1 = E(X 2 ) − E(X̄ 2 ) 2 σ 2 2 2 +µ =σ +µ − n (n − 1)σ 2 = . n Since: E(b σ2) − σ2 = − σ2 <0 n σ b2 is a negatively-biased estimator of σ 2 . The sample variance, defined as: n S2 = 1 X (Xi − X̄)2 n − 1 i=1 is a more frequently-used estimator of σ 2 as it has zero bias, i.e. it is an unbiased estimator since E(S 2 ) = σ 2 . This is why we use the n − 1 divisor when calculating the sample variance. A useful formula for computation of the sample variance is: ! n X 1 S2 = X 2 − nX̄ 2 . n − 1 i=1 i Note the MME does not use any information on F (x; θ) beyond the moments. 234 7.5. Method of moments (MM) estimation The idea is that Mk should be pretty close to µk when n is sufficiently large. In fact: n 1X k Mk = X n i=1 i converges to: µk = E(X k ) as n → ∞. This is due to the law of large numbers (LLN). We illustrate this phenomenon by simulation using R. Example 7.7 For N (2, 4), we have µ1 = 2 and µ2 = 8. We use the sample moments M1 and M2 as estimators of µ1 and µ2 , respectively. Note how the sample moments converge to the population moments as the sample size increases. For a sample of size n = 10, we obtained m1 = 0.5145838 and m2 = 2.171881. > x <- rnorm(10,2,2) > x [1] 0.70709403 -1.38416864 -0.01692815 [7] -1.53308559 -0.42573724 1.76006933 > mean(x) [1] 0.5145838 > x2 <- x^2 > mean(x2) [1] 2.171881 2.51837989 -0.28518898 1.83541490 1.96998829 For a sample of size n = 100, we obtained m1 = 2.261542 and m2 = 8.973033. > x <- rnorm(100,2,2) > mean(x) [1] 2.261542 > x2 <- x^2 > mean(x2) [1] 8.973033 For a sample of size n = 500, we obtained m1 = 1.912112 and m2 = 7.456353. > x <- rnorm(500,2,2) > mean(x) [1] 1.912112 > x2 <- x^2 > mean(x2) [1] 7.456353 Example 7.8 For a Poisson distribution with λ = 1, we have µ1 = 1 and µ2 = 2. With a sample of size n = 500, we obtained m1 = 1.09 and m2 = 2.198. > x <- rpois(500,1) > mean(x) [1] 1.09 235 7. Point estimation > x2 <- x^2 > mean(x2) [1] 2.198 > x [1] 1 2 2 1 0 0 0 0 0 0 2 2 1 2 1 1 1 2 ... Activity 7.6 Let {X1 , . . . , Xn } be a random sample from the (continuous) uniform distribution such that X ∼ Uniform[0, θ], where θ > 0. Find the method of moments estimator (MME) of θ. Solution The pdf of Xi is: ( θ−1 f (xi ; θ) = 0 Therefore: 1 E(Xi ) = θ Z θ 0 for 0 ≤ xi ≤ θ otherwise. θ 1 x2i θ xi dxi = = . θ 2 0 2 Therefore, setting µ b1 = M1 , we have: θb = X̄ 2 ⇒ θb = 2X̄. Activity 7.7 Suppose that we have a random sample {X1 , . . . , Xn } from a Uniform[−θ, θ] distribution. Find the method of moments estimator of θ. Solution The mean of the Uniform[a, b] distribution is (a + b)/2. In our case, this gives E(X) = (−θ + θ)/2 = 0. The first population moment does not depend on θ, so we need to move to the next (i.e. second) population moment. Recall that the variance of the Uniform[a, b] distribution is (b − a)2 /12. Hence the second population moment is: E(X 2 ) = Var(X) + E(X)2 = θ2 (θ − (−θ))2 + 02 = . 12 3 We set this equal to the second sample moment to obtain: n 1 X 2 θb2 X = . n i=1 i 3 Therefore, the method of moments estimator of θ is: v u n u3 X X 2. θbM M = t n i=1 i 236 7.5. Method of moments (MM) estimation Activity 7.8 Consider again the Uniform[−θ, θ] distribution from the previous question. Suppose that we observe the following data: 1.8, 0.7, −0.2, −1.8, 2.8, 0.6, −1.3, −0.1. Estimate θ using the method of moments. Solution The point estimate is: θbM M v u 8 u3 X =t x2 ≈ 2.518 8 i=1 i which implies that the data came from a Uniform[−2.518, 2.518] distribution. However, this clearly cannot be true since the observation x5 = 2.8 falls outside this range! The method of moments does not take into account that all of the observations need to lie in the interval [−θ, θ], and so it fails to produce a useful estimate. Activity 7.9 Let X ∼ Bin(n, π), where n is known. Find the methods of moments estimator (MME) of π. Solution The pf of the binomial distribution is: P (X = x) = n! π x (1 − π)n−x x! (n − x)! for x = 0, 1, . . . , n and 0 otherwise. Therefore: E(X) = n X x P (X = x) x=0 = n X x=1 = n X x=1 x n! π x (1 − π)n−x x! (n − x)! n! π x (1 − π)n−x . (x − 1)! (n − x)! Let m = n − 1 and write j = x − 1, then (n − x) = (m − j), and: E(X) = m X j=0 m X n m! m! j m−j π π (1 − π) = nπ π j (1 − π)m−j . j! (m − j)! j! (m − j)! j=0 Therefore, E(X) = n π, and hence π b = X/n. 237 7. Point estimation 7.6 Least squares (LS) estimation Given a random sample {X1 , . . . , Xn } from a population with mean µ and variance σ 2 , how can we estimate µ? n P The MME of µ is the sample mean X̄ = Xi /n. i=1 Least squares estimator for µ The estimator X̄ is also the least squares estimator (LSE) of µ, defined as: µ b = X̄ = min a Proof : Given that S = n P (Xi − a)2 = i=1 n P n X (Xi − a)2 . i=1 (Xi − X̄)2 + n(X̄ − a)2 , where all terms are i=1 non-negative, then the value of a for which S is minimised is when n(X̄ − a)2 = 0, i.e. a = X̄. Activity 7.10 Suppose that you are given observations y1 , y2 , y3 and y4 such that: y1 = α + β + ε1 y2 = −α + β + ε2 y3 = α − β + ε3 y4 = −α − β + ε4 . The random variables εi , for i = 1, 2, 3, 4, are independent and normally distributed with mean 0 and variance σ 2 . (a) Find the least squares estimators of the parameters α and β. (b) Verify that the least squares estimators in (a) are unbiased estimators of their respective parameters. (c) Find the variance of the least squares estimator of α. Solution (a) We start off with the sum of squares function: S= 4 X ε2i = (y1 − α − β)2 + (y2 + α − β)2 + (y3 − α + β)2 + (y4 + α + β)2 . i=1 Now take the partial derivatives: ∂S = −2(y1 − α − β) + 2(y2 + α − β) − 2(y3 − α + β) + 2(y4 + α + β) ∂α = −2(y1 − y2 + y3 − y4 ) + 8α 238 7.6. Least squares (LS) estimation and: ∂S = −2(y1 − α − β) − 2(y2 + α − β) + 2(y3 − α + β) + 2(y4 + α + β) ∂β = −2(y1 + y2 − y3 − y4 ) + 8β. The least squares estimators α b and βb are the solutions to ∂S/∂α = 0 and ∂S/∂β = 0. Hence: α b= y1 − y2 + y3 − y4 4 y1 + y2 − y3 − y4 and βb = . 4 (b) α b is an unbiased estimator of α since: y1 − y2 + y3 − y4 α+β+α−β+α−β+α+β E(b α) = E = = α. 4 4 βb is an unbiased estimator of β since: y1 + y2 − y3 − y4 α+β−α+β−α+β+α+β b E(β) = E = = β. 4 4 (c) Due to independence, we have: y1 − y2 + y3 − y4 σ2 4 σ2 Var(b α) = Var = . = 4 16 4 Estimator accuracy In order to assess the accuracy of µ b = X̄ as an estimator of µ we calculate its MSE: σ2 MSE(b µ) = E (b µ − µ)2 = . n In order to determine the distribution of µ b we require knowledge of the underlying distribution. Even if the relevant knowledge is available, one may only compute the exact distribution of µ b explicitly for a limited number of cases. By the central limit theorem, as n → ∞, we have: P X̄ − µ √ ≤z σ/ n → Φ(z) for any z, where Φ(z) is the cdf of N (0, 1), i.e. when n is large, X̄ ∼ N (µ, σ 2 /n) approximately. Hence when n is large: P σ |X̄ − µ| ≤ 1.96 × √ n ≈ 0.95. 239 7. Point estimation In practice, the standard deviation σ is unknown and so we replace it by the sample standard deviation S, where S 2 is the sample variance, given by: n S2 = 1 X (Xi − X̄)2 . n − 1 i=1 This gives an approximation of: S ≈ 0.95. P |X̄ − µ| ≤ 1.96 × √ n To be on the safe side, the coefficient 1.96 is often replaced by 2. The estimated standard error of X̄ is: " #1/2 n X S 1 E.S.E.(X̄) = √ = (Xi − X̄)2 . n (n − 1) i=1 n Some remarks are the following. i. The LSE is a geometrical solution – it minimises the sum of squared distances between the estimated value and each observation. It makes no use of any information about the underlying distribution. ii. Taking the derivative of n P (Xi − a)2 with respect to a, and equating it to 0, we i=1 obtain (after dividing through by −2): n X i=1 (Xi − a) = n X Xi − na = 0. i=1 Hence the solution is µ b=b a = X̄. This is another way to derive the LSE of µ. Activity 7.11 AP random sample of size n = 400 produced the sample sums of P 2 i xi = 983 and i xi = 4729. (a) Calculate point estimates for the population mean and the population standard deviation. (b) Calculate the estimated standard error of the mean estimate. Solution (a) As before, we use the sample mean to estimate the population mean, i.e. µ b = x̄ = 983/400 = 2.4575, and the sample variance to estimate the population variance, i.e. we have: ! 400 400 X X 1 1 s2 = (xi − x̄)2 = x2 − nx̄2 n − 1 i=1 n − 1 i=1 i = 1 4729 − 400 × (2.4575)2 399 = 5.7977. 240 7.7. Maximum likelihood (ML) estimation Therefore, the estimate for the population standard deviation is √ s = 5.7977 = 2.4078. √ √ (b) The estimated standard error is s/ n = 2.4078/ 400 = 0.1204. Note that the estimated standard error is rather small, indicating that the estimate of the population mean is rather accurate. This is due to two factors: (i.) the population variance is small, as evident from the small value of s2 , and (ii.) the sample size of n = 400 is rather large. Note also that using the n divisor (i.e. the method of moments estimator of σ 2 ) we n P have (xi − x̄)2 /n = 5.7832, which is pretty close to s2 . i=1 7.7 Maximum likelihood (ML) estimation We begin with an illustrative example. Maximum likelihood (ML) estimation generalises the reasoning in the following example to arbitrary settings. Example 7.9 Suppose we toss a coin 10 times, and record the number of ‘heads’ as a random variable X. Therefore: X ∼ Bin(10, π) where π = P (heads) ∈ (0, 1) is the unknown parameter. If x = 8, what is your best guess (i.e. estimate) of π? Obviously 0.8! Is π = 0.1 possible? Yes, but very unlikely. Is π = 0.5 possible? Yes, but not very likely. Is π = 0.7 or 0.9 possible? Yes, very likely. Nevertheless, π = 0.8 is the most likely, or ‘maximally’ likely value of the parameter. Why do we think ‘π = 0.8’ is most likely? Let: 10! 8 π (1 − π)2 . 8! 2! Since x = 8 is the event which occurred in the experiment, this probability would be very large. Figure 7.1 shows a plot of L(π) as a function of π. L(π) = P (X = 8) = The most likely value of π should make this probability as large as possible. This value is taken as the maximum likelihood estimate of π. Maximising L(π) is equivalent to maximising: l(π) = log (L(π)) = 8 log π + 2 log(1 − π) + c where c is the constant log(10!/(8! 2!)). Setting dl(π)/dπ = 0, we obtain the ML estimate π b = 0.8. 241 7. Point estimation Figure 7.1: Plot of the likelihood function in Example 7.9. Maximum likelihood definition Let f (x1 , . . . , xn ; θ) be the joint probability density function (or probability function) for random variables (X1 , . . . , Xn ). The maximum likelihood estimator (MLE) of θ based on the observations {X1 , . . . , Xn } is defined as: θb = max f (X1 , . . . , Xn ; θ). θ Some remarks are the following. i. The MLE depends only on the observations {X1 , . . . , Xn }, such that: b 1 , . . . , Xn ). θb = θ(X Therefore, θb is a statistic (as it must be for an estimator of θ). ii. If {X1 , . . . , Xn } is a random sample from a population with probability density function f (x; θ), the joint probability density function for (X1 , . . . , Xn ) is: n Y f (xi ; θ). i=1 The joint pdf is a function of (X1 , . . . , Xn ), while θ is a parameter. The joint pdf describes the probability distribution of {X1 , . . . , Xn }. The likelihood function is defined as: L(θ) = n Y i=1 242 f (Xi ; θ). (7.5) 7.7. Maximum likelihood (ML) estimation The likelihood function is a function of θ, while {X1 , . . . , Xn } are treated as constants (as given observations). The likelihood function reflects the information about the unknown parameter θ in the data {X1 , . . . , Xn }. Some remarks are the following. i. The likelihood function is a function of the parameter. It is defined up to positive constant factors. A likelihood function is not a probability density function. It contains all the information about the unknown parameter from the observations. ii. The MLE is θb = max L(θ). θ iii. It is often more convenient to use the log-likelihood function5 denoted as: l(θ) = log L(θ) = n X log (f (Xi ; θ)) i=1 as it transforms the product in (7.5) into a sum. Note that: θb = max l(θ). θ iv. For a smooth likelihood function, the MLE is often the solution of the equation: d l(θ) = 0. dθ b is the MLE of φ (which is v. If θb is the MLE and φ = g(θ) is a function of θ, φb = g(θ) known as the invariance principle of the MLE). vi. Unlike the MME or LSE, the MLE uses all the information about the population distribution. It is often more efficient (i.e. more accurate) than the MME or LSE. vii. In practice, ML estimation should be used whenever possible. Example 7.10 Let {X1 , . . . , Xn } be a random sample from a distribution with pdf: ( λ2 x e−λx for x > 0 f (x; λ) = 0 otherwise where λ > 0 is unknown. Find the MLE of λ. The joint pdf is f (x1 , . . . , xn ; λ) = n Q λ2 xi e−λxi if all xi > 0, and 0 otherwise. i=1 The likelihood function is: " 2n L(λ) = λ exp −λ n X i=1 # Xi n Y i=1 2n Xi = λ exp −nλX̄ n Y Xi . i=1 5 Throughout where ‘log’ is used in log-likelihood functions, it will be assumed to be the logarithm to the base e, i.e. the natural logarithm. 243 7. Point estimation The log-likelihood function is l(λ) = 2n log λ − nλX̄ + c, where c = log n Q Xi is a i=1 constant. b − nX̄ = 0, we obtain λ b = 2/X̄. Setting dl(λ)/dλ = 2n/λ b may be obtained from maximising L(λ) directly. However, it is Note the MLE λ much easier to work with l(λ) instead. b2 = 4/X̄ 2 . By the invariance principle, the MLE of λ2 would be λ Example 7.11 Consider a population with three types of individuals labelled 1, 2 and 3, and occurring according to the Hardy–Weinberg proportions: p(1; θ) = θ2 , p(2; θ) = 2θ(1 − θ) and p(3; θ) = (1 − θ)2 where 0 < θ < 1. Note that p(1; θ) + p(2; θ) + p(3; θ) = 1. A random sample of size n is drawn from this population with n1 observed values equal to 1 and n2 observed values equal to 2 (therefore, there are n − n1 − n2 values equal to 3). Find the MLE of θ. Let us assume {X1 , . . . , Xn } is the sample (i.e. n observed values). Among them, there are n1 ‘1’s, n2 ‘2’s, and n − n1 − n2 ‘3’s. The likelihood function is (where ∝ means ‘proportional to’): L(θ) = n Y p(Xi ; θ) = p(1; θ)n1 p(2; θ)n2 p(3; θ)n−n1 −n2 i=1 = θ2n1 (2θ(1 − θ))n2 (1 − θ)2(n−n1 −n2 ) ∝ θ2n1 +n2 (1 − θ)2n−2n1 −n2 . The log-likelihood is l(θ) ∝ (2n1 + n2 ) log θ + (2n − 2n1 − n2 ) log(1 − θ). b = 0, that is: Setting dl(θ)/dθ = (2n1 + n2 )/θb − (2n − 2n1 − n2 )/(1 − θ) b (2n1 + n2 ) = θb (2n − 2n1 − n2 ) (1 − θ) leads to the MLE: 2n1 + n2 θb = . 2n For example, for a sample with n = 4, n1 = 1 and n2 = 2, we obtain a point estimate of θb = 0.5. Example 7.12 Let {X1 , . . . , Xn } be a random sample from the (continuous) uniform distribution Uniform[0, θ], where θ > 0 is unknown. (a) Find the MLE of θ. (b) If n = 3, x1 = 0.9, x2 = 1.2 and x3 = 0.3, what is the maximum likelihood estimate of θ? 244 7.7. Maximum likelihood (ML) estimation (a) The pdf of Uniform[0, θ] is: ( θ−1 f (x; θ) = 0 for 0 ≤ x ≤ θ otherwise. The joint pdf is: ( θ−n f (x1 , . . . , xn ; θ) = 0 for 0 ≤ x1 , . . . , xn ≤ θ otherwise. As a function of θ, f (x1 , . . . , xn ; θ) is the likelihood function, L(θ). The maximum likelihood estimator of θ is the value at which the likelihood function L(θ) achieves its maximum. Note: ( θ−n for X(n) ≤ θ L(θ) = 0 otherwise where: X(n) = max Xi . i Hence the MLE is θb = X(n) . Note that this is a special case of a likelihood function which is not ‘well-behaved’, since it is not continuously differentiable at the maximum. This is because the sample space of this distribution is defined by θ, i.e. we have that 0 ≤ x ≤ θ. Therefore, it is impossible for θ to be any value below the maximum observed value of X. As such, although L(θ) increases as θ decreases, L(θ) falls to zero for all θ less than the maximum observed value of X. As such, we cannot use calculus to maximise the likelihood function (nor the log-likelihood function), so instead we immediately deduce here that θb = X(n) . (b) For the given data, the maximum observation is x(3) = 1.2. Therefore, the maximum likelihood estimate is θb = 1.2. The likelihood function looks like: 245 7. Point estimation Activity 7.12 Let {X1 , . . . , Xn } be a random sample from a Poisson distribution with mean λ > 0. Find the MLE of λ. Solution The probability function is: e−λ λx . x! The likelihood and log-likelihood functions are, respectively: P (X = x) = L(λ) = n Y e−λ λXi i=1 Xi ! e−nλ λnX̄ = Q n Xi ! i=1 and: l(λ) = log L(λ) = nX̄ log(λ) − nλ + C = n(X̄ log(λ) − λ) + C where C is a constant (i.e. it may depend on Xi but cannot depend on the parameter). Setting: X̄ dl(λ) =n −1 =0 b dλ λ b = X̄, which is also the MME. we obtain the MLE λ Activity 7.13 Let {X1 , . . . , Xn } be a random sample from an Exponential(λ) distribution. Find the MLE of λ. Solution The likelihood function is: L(λ) = n Y f (xi ; θ) = i=1 n Y λ e−λXi = λn e−λ P i Xi = λn e−λnX̄ i=1 so the log-likelihood function is: l(λ) = log λn e−λnX̄ = n log(λ) − λnX̄. Differentiating and setting equal to zero gives: d n l(λ) = − nX̄ = 0 b dλ λ ⇒ b= 1. λ X̄ The second derivative of the log-likelihood function is: d2 n l(λ) = − dλ2 λ2 b = 1/X̄ is indeed a maximum. This which is always negative, hence the MLE λ happens to be the same as the method of moments estimator of λ. 246 7.7. Maximum likelihood (ML) estimation Activity 7.14 Use the observed random sample x1 = 8.2, x2 = 10.6, x3 = 9.1 and x4 = 4.9 to calculate the maximum likelihood estimate of λ in the exponential pdf: ( λ e−λx for x ≥ 0 f (x; λ) = 0 otherwise. Solution We derive a general formula with a random sample {X1 , . . . , Xn } first. The joint pdf is: ( λn e−λnx̄ for x1 , . . . , xn ≥ 0 f (x1 , . . . , xn ; λ) = 0 otherwise. With all xi ≥ 0, L(λ) = λn e−λnX̄ , hence the log-likelihood function is: l(λ) = log L(λ) = n log λ − λnX̄. Setting: n d l(λ) = − nX̄ = 0 b dλ λ ⇒ b= 1. λ X̄ b = 0.1220. For the given sample, x̄ = (8.2 + 10.6 + 9.1 + 4.9)/4 = 8.2. Therefore, λ Activity 7.15 Let {X1 , . . . , Xn } be a random sample from a population with the probability distribution specified in (a) and (b) below, respectively. Find the MLEs of the following parameters. (a) λ, µ = 1/λ and θ = λ2 , when the population has an exponential distribution with pdf f (x; λ) = λ e−λx for x > 0, and 0 otherwise. (b) π and θ = π/(1 − π), when the population has a Bernoulli (two-point) distribution, that is p(1; π) = π = 1 − p(0; π), and 0 otherwise. Solution (a) The joint pdf is: n λn exp −λ P x for all x1 , . . . , xn > 0 i f (x1 , . . . , xn ; λ) = i=1 0 otherwise. Noting that n P Xi = nX̄, the likelihood function is: i=1 L(λ) = λn e−λnX̄ . The log-likelihood function is: l(λ) = n log λ − λnX̄. 247 7. Point estimation Setting: d n l(λ) = − nX̄ = 0 b dλ λ b = 1/X̄. The MLE of µ is: we obtain the MLE λ b = µ b = µ(λ) 1 = X̄ b λ and the MLE of θ is: b = (λ) b 2 = X̄ −2 θb = θ(λ) making use of the invariance principle in each case. (b) The joint probability function is: n Y p(xi ; π) = π y (1 − π)n−y i=1 where y = n P xi . The likelihood function is: i=1 L(π) = π Y (1 − π)n−Y . The log-likelihood function is: l(π) = Y log π + (n − Y ) log(1 − π). Setting: d Y n−Y l(π) = − =0 dπ π b 1−π b we obtain the MLE π b = Y /n = X̄. The MLE of θ is: θb = θ(b π) = π b X̄ = 1−π b 1 − X̄ making use of the invariance principle again. Activity 7.16 Let {X1 , . . . , Xn } be a random sample from the distribution N (µ, 1). Find the MLE of µ. Solution The joint pdf of the observations is: " # n n Y 1 1 1 1X 2 2 √ exp − (xi − µ) = f (x1 , . . . , xn ; µ) = exp − (xi − µ) . 2 (2π)n/2 2 i=1 2π i=1 We write the above as a function of µ only: " # n 1X L(µ) = C exp − (Xi − µ)2 2 i=1 248 7.8. Overview of chapter where C > 0 is a constant. The MLE µ b maximises this function, and also maximises the function: n 1X l(µ) = log L(µ) = − (Xi − µ)2 + log(C). 2 i=1 Therefore, the MLE effectively minimises n P (Xi − µ)2 , i.e. the MLE is also the least i=1 squares estimator (LSE), i.e. µ b = X̄. 7.8 Overview of chapter This chapter introduced point estimation. Key properties of estimators were explored and the characteristics of a desirable estimator were studied through the calculation of the mean squared error. Methods for finding estimators of parameters were also described, including method of moments, least squares and maximum likelihood estimation. 7.9 Key terms and concepts Bias Invariance principle Least squares estimation Log-likelihood function Mean squared error (MSE) Parameter Point estimator Sample moment Unbiased 7.10 Consistent estimator Law of large numbers (LLN) Likelihood function Maximum likelihood estimation Method of moments estimation Point estimate Population moment Statistic Sample examination questions Solutions can be found in Appendix C. 1. Let {X1 , . . . , Xn } be a random sample from the (continuous) uniform distribution such that X ∼ Uniform[0, θ], where θ > 0. (a) Find the method of moments estimator (MME) of θ. (Note you should derive any required population moments.) (b) If n = 3, with the observed data x1 = 0.2, x2 = 3.6 and x3 = 1.1, use the MME obtained in i. to compute the point estimate of θ for this sample. Do you trust this estimate? Justify your answer. Hint: You may wish to make reference to the law of large numbers. 249 7. Point estimation 2. Suppose that you are given independent observations y1 , y2 and y3 such that: y1 = α + β + ε1 y2 = α + 2β + ε2 y3 = α + 4β + ε3 . The random variables εi , for i = 1, 2, 3, are normally distributed with a mean of 0 and a variance of 1. (a) Find the least squares estimators of the parameters α and β, and verify that they are unbiased estimators. (b) Calculate the variance of the estimator of α. 3. A random sample {X1 , X2 , . . . , Xn } is drawn from the following probability distribution: 2 λ2x e−λ p(x; λ) = for x = 0, 1, 2, . . . x! and 0 otherwise, where λ > 0. (a) Derive the maximum likelihood estimator of λ. (b) State the maximum likelihood estimator of θ = λ3 . 250 Chapter 8 Interval estimation 8.1 Synopsis of chapter This chapter covers interval estimation – a natural extension of point estimation. Due to the almost inevitable sampling error, we wish to communicate the level of uncertainty in our point estimate by constructing confidence intervals. 8.2 Learning outcomes After completing this chapter, you should be able to: explain the coverage probability of a confidence interval construct confidence intervals for means of normal and non-normal populations when the variance is known and unknown construct confidence intevals for the variance of a normal population explain the link between confidence intervals and distribution theory, and critique the assumptions made to justify the use of various confidence intervals. 8.3 Introduction Point estimation is simple but not informative enough, since a point estimator is always subject to errors. A more scientific approach is to find an upper bound U = U (X1 , . . . , Xn ) and a lower bound L = L(X1 , . . . , Xn ), and hope that the unknown parameter θ lies between the two bounds L and U (life is not always as simple as that, but it is a good start). An intuitive guess for estimating the population mean would be: L = X̄ − k × S.E.(X̄) and U = X̄ + k × S.E.(X̄) where k > 0 is a constant and S.E.(X̄) is the standard error of the sample mean. The (random) interval (L, U ) forms an interval estimator of θ. For estimation to be as precise as possible, intuitively the width of the interval, U − L, should be small. 251 8. Interval estimation Typically, the coverage probability: P (L(X1 , . . . , Xn ) < θ < U (X1 , . . . , Xn )) < 1. Ideally, we should choose L and U such that: the width of the interval is as small as possible the coverage probability is as large as possible. Activity 8.1 Why do we not always choose a very high confidence level for a confidence interval? Solution We do not always want to use a very high confidence level because the confidence interval would be very wide. We have a trade-off between the width of the confidence interval and the coverage probability. 8.4 Interval estimation for means of normal distributions Let us consider a simple example. We have a random sample {X1 , . . . , Xn } from the distribution N (µ, σ 2 ), with σ 2 known. From Chapter 7, we have reason to believe that X̄ is a good estimator of µ. We also know X̄ ∼ N (µ, σ 2 /n), and hence: X̄ − µ √ ∼ N (0, 1). σ/ n Therefore, supposing a 95% coverage probability: |X̄ − µ| √ ≤ 1.96 0.95 = P σ/ n σ = P |µ − X̄| ≤ 1.96 × √ n σ σ = P −1.96 × √ < µ − X̄ < 1.96 × √ n n σ σ = P X̄ − 1.96 × √ < µ < X̄ + 1.96 × √ . n n Therefore, the interval covering µ with probability 0.95 is: σ σ X̄ − 1.96 × √ , X̄ + 1.96 × √ n n which is called a 95% confidence interval for µ. 252 8.4. Interval estimation for means of normal distributions Example 8.1 Suppose σ = 1, n = 4, and x̄ = 2.25, then a 95% confidence interval for µ is: 1 1 = (1.27, 3.23). 2.25 − 1.96 × √ , 2.25 + 1.96 × √ 4 4 Instead of a simple point estimate of µ b = 2.25, we say µ is between 1.27 and 3.23 at the 95% confidence level. What is P (1.27 < µ < 3.23) = 0.95 in Example 8.1? Well, this probability does not mean anything, since µ is an unknown constant! We treat (1.27, 3.23) as one realisation of the random interval (X̄ − 0.98, X̄ + 0.98) which covers µ with probability 0.95. What is the meaning of ‘with probability 0.95’ ? If one repeats the interval estimation a large number of times, about 95% of the time the interval estimator covers the true µ. Some remarks are the following. i. The confidence level is often specified as 90%, 95% or 99%. Obviously the higher the confidence level, the wider the interval. For the normal distribution example: |X̄ − µ| √ ≤ 1.645 0.90 = P σ/ n σ σ = P X̄ − 1.645 × √ < µ < X̄ + 1.645 × √ n n 0.95 = P |X̄ − µ| √ ≤ 1.96 σ/ n σ σ = P X̄ − 1.96 × √ < µ < X̄ + 1.96 × √ n n 0.99 = P |X̄ − µ| √ ≤ 2.576 σ/ n σ σ . = P X̄ − 2.576 × √ < µ < X̄ + 2.576 × √ n n √ √ The widths of the √ three intervals are 2 × 1.645 × σ/ n, 2 × 1.96 × σ/ n and 2 × 2.576 × σ/ n, corresponding to the confidence levels of 90%, 95% and 99%, respectively. To achieve a 100% confidence level in the normal example, the width of the interval would have to be infinite! ii. Among all the confidence intervals at the same confidence level, the one with the smallest width gives the most accurate estimation and is, therefore, optimal. iii. For a distribution with a symmetric unimodal density function, optimal confidence intervals are symmetric, as depicted in Figure 8.1. 253 8. Interval estimation Figure 8.1: Symmetric unimodal density function showing that a given probability is represented by the narrowest interval when symmetric about the mean. Activity 8.2 (a) Find the length of a 95% confidence interval for the mean of a normal distribution with known variance σ 2 . (b) Find the minimum sample size such that the width of a 95% confidence interval is not wider than d, where d > 0 is a prescribed constant. Solution (a) With an available random sample {X1 , . . . , Xn } from the normal distribution N (µ, σ 2 ) with σ 2 known, a 95% confidence interval for µ is of the form: σ σ . X̄ − 1.96 × √ , X̄ + 1.96 × √ n n Hence the width of the confidence interval is: σ σ σ σ X̄ + 1.96 × √ − X̄ − 1.96 × √ = 2 × 1.96 × √ = 3.92 × √ . n n n n √ (b) Let 3.92 × σ/ n ≤ d, and so we obtain the condition for the required sample size: 2 3.92 × σ 15.37 × σ 2 . n≥ = d d2 Therefore, in order to achieve the required accuracy, the sample size n should be at least as large as 15.37 × σ 2 /d2 . Note that as the variance σ 2 %, the confidence interval width d %, and as the sample size n %, the confidence interval width d &. Also, note that when σ 2 is unknown, the width of a confidence interval for µ depends on S. Therefore, the width is a random variable. Activity 8.3 Assume that the random variable X is normally distributed and that σ 2 is known. What confidence level would be associated with each of the following intervals? 254 8.4. Interval estimation for means of normal distributions (a) The interval: σ σ . x̄ − 1.645 × √ , x̄ + 2.326 × √ n n (b) The interval: σ −∞, x̄ + 2.576 × √ n . (c) The interval: σ x̄ − 1.645 × √ , x̄ . n Solution √ √ We have X̄ ∼ N (µ, σ 2 / n), hence n(X̄ − µ)/σ ∼ N (0, 1). (a) P (−1.645 < Z < 2.326) = 0.94, hence a 94% confidence level. (b) P (−∞ < Z < 2.576) = 0.995, hence a 99.5% confidence level. (c) P (−1.645 < Z < 0) = 0.45, hence a 45% confidence level. Activity 8.4 A personnel manager has found that historically the scores on aptitude tests given to applicants for entry-level positions are normally distributed with σ = 32.4 points. A random sample of nine test scores from the current group of applicants had a mean score of 187.9 points. (a) Find an 80% confidence interval for the population mean score of the current group of applicants. (b) Based on these sample results, a statistician found for the population mean a confidence interval extending from 165.8 to 210.0 points. Find the confidence level of this interval. Solution (a) We have n = 9, x̄ = 187.9, σ = 32.4 and 1 − α = 0.8, hence α/2 = 0.1 and, from Table 4 of the New Cambridge Statistical Tables, P (Z > 1.282) = 1 − Φ(1.282) = 0.1. So an 80% confidence interval is: 32.4 187.9 ± 1.282 × √ 9 ⇒ (174.05, 201.75). (b) The half-width of the confidence interval is 210.0 − 187.9 = 22.1, which is equal to the margin of error, i.e. we have: σ 32.4 22.1 = k × √ = k × √ n 9 ⇒ k = 2.05. P (Z > 2.05) = 1 − Φ(2.05) = 0.02018 = α/2 ⇒ α = 0.04036. Hence we have a 100(1 − α)% = 100(1 − 0.04036)% ≈ 96% confidence interval. 255 8. Interval estimation Activity 8.5 Five independent samples, each of size n, are to be drawn from a normal distribution where σ 2 is known. For each sample, the interval: σ σ x̄ − 0.96 × √ , x̄ + 1.06 × √ n n will be constructed. What is the probability that at least four of the intervals will contain the unknown µ? Solution The probability that the given interval will contain µ is: P (−0.96 < Z < 1.06) = 0.6869. The probability of four or five such intervals is binomial with n = 5 and π = 0.6869, so let the number of such intervals be Y ∼ Bin(5, 0.6869). The required probability is: 5 5 4 P (Y ≥ 4) = (0.6869) (0.3131) + (0.6869)5 = 0.5014. 4 5 Dealing with unknown σ In practice the standard deviation σ is typically unknown, and we replace it with the sample standard deviation: n S= 1 X (Xi − X̄)2 n − 1 i=1 !1/2 leading to a confidence interval for µ of the form: S S X̄ − k × √ , X̄ + k × √ n n where k is a constant determined by the confidence level and also by the distribution of the statistic: X̄ − µ √ . (8.1) S/ n However, the distribution of (8.1) is no longer normal – it is the Student’s t distribution. 8.4.1 An important property of normal samples Let {X1 , . . . , Xn } be a random sample from N (µ, σ 2 ). Suppose: n X̄ = 1X Xi , n i=1 n S2 = 1 X (Xi − X̄)2 n − 1 i=1 S and E.S.E.(X̄) = √ n where E.S.E.(X̄) denotes the estimated standard error of the sample mean. 256 8.4. Interval estimation for means of normal distributions i. X̄ ∼ N (µ, σ 2 /n) and (n − 1)S 2 /σ 2 ∼ χ2n−1 . ii. X̄ and S 2 are independent, therefore: √ n(X̄ − µ)/σ X̄ − µ X̄ − µ p √ = = ∼ tn−1 . S/ n E.S.E.(X̄) (n − 1)S 2 /(n − 1)σ 2 An accurate 100(1 − α)% confidence interval for µ, where α ∈ (0, 1), is: S S = (X̄ − c × E.S.E.(X̄), X̄ + c × E.S.E.(X̄)) X̄ − c × √ , X̄ + c × √ n n where c > 0 is a constant such that P (T > c) = α/2, where T ∼ tn−1 . Activity 8.6 Suppose that 9 bags of sugar are selected from the supermarket shelf at random and weighed. The weights in grammes are 812.0, 786.7, 794.1, 791.6, 811.1, 797.4, 797.8, 800.8 and 793.2. Construct a 95% confidence interval for the mean weight of all the bags on the shelf. Assume the population is normal. Solution Here we have a random sample of size n = 9. The mean is 798.30. The sample variance is s2 = 72.76, which gives a sample standard deviation s = 8.53. From Table 10 of the New Cambridge Statistical Tables, the top 2.5th percentile of the t distribution with n − 1 = 8 degrees of freedom is 2.306. Therefore, a 95% confidence interval is: 8.53 8.53 = (798.30 − 6.56, 798.30 + 6.56) 798.30 − 2.306 × √ , 798.30 + 2.306 × √ 9 9 = (791.74, 804.86). It is sometimes more useful to write this as 798.30 ± 6.56. Activity 8.7 Continuing the previous activity, suppose we are now told that σ, the population standard deviation, is known to be 8.5 g. Construct a 95% confidence interval using this information. Solution From Table 10 of the New Cambridge Statistical Tables, the top 2.5th percentile of the standard normal distribution z0.025 = 1.96 (recall t∞ = N (0, 1)) so a 95% confidence interval for the population mean is: 8.5 8.5 = (798.30 − 6.53, 798.30 + 6.53) 798.30 − 1.96 × √ , 798.30 + 1.96 × √ 9 9 = (792.75, 803.85). Again, it may be more useful to write this as 798.30 ± 5.55. Note that this confidence interval is less wide than the one in the previous question, even though our initial estimate s turned out to be very close to the true value of σ. 257 8. Interval estimation Activity 8.8 A business requires an inexpensive check on the value of stock in its warehouse. In order to do this, a random sample of 50 items is taken and valued. The average value of these is computed to be £320.41 with a (sample) standard deviation of £40.60. It is known that there are 9,875 items in the total stock. Assume a normal distribution. (a) Estimate the total value of the stock to the nearest £10,000. (b) Construct a 95% confidence interval for the mean value of all items and hence construct a 95% confidence interval for the total value of the stock. (c) You are told that the confidence interval in (b) is too wide for decision-making purposes and you are asked to assess how many more items would need to be sampled to obtain a confidence interval with the same level of confidence, but with half the width. Solution (a) The total value of the stock is 9875µ, where µ is the mean value of an item of stock. From Chapter 7, X̄ is the obvious estimator of µ, so 9875X̄ is the obvious estimator of 9875µ. Therefore, an estimate for the total value of the stock is 9875 × 320.41 = £3,160,000 (to the nearest £10,000). (b) In this question n = 50 is large, and σ 2 is unknown so a 95% confidence interval for µ is: 40.6 s x̄±1.96× √ = 320.41±1.96× √ = 320.41±11.25 n 50 ⇒ (£309.16, £331.66). Note that because n is large we have used the standard normal distribution. It is more accurate to use a t distribution with 49 degrees of freedom. This gives an interval of (£308.87, £331.95) – not much of a difference. To obtain a 95% confidence interval for the total value of the stock, 9875µ, multiply the interval by 9875. This gives (to the nearest £10,000): (£3,050,000, £3,280,000). (c) Increasing the sample size √ by a factor of k reduces the width of the confidence a factor of 4 interval by a factor of k. Therefore, increasing the sample size by √ will reduce the width of the confidence interval by a factor of 2 (= 4). Hence we need to increase the sample size from 50 to 4 × 50 = 200. So we should collect another 150 observations. Activity 8.9 In a survey of students, the number of hours per week of private study is recorded. For a random sample of 23 students, the sample mean is 18.4 hours and the sample standard deviation is 3.9 hours. Treat the data as a random sample from a normal distribution. 258 8.4. Interval estimation for means of normal distributions (a) Find a 99% confidence interval for the mean number of hours per week of private study in the student population. (b) Recompute your confidence interval in the case that the sample size is, in fact, 121, but the sample mean and sample standard deviation values are unchanged. Comment on the two intervals. Solution We have x̄ = 18.4 and s = 3.9, so a 99% confidence interval is of the form: s x̄ ± tn−1, 0.005 × √ . n (a) When n = 23, t22, 0.005 = 2.819. Hence a 99% confidence interval is: 3.9 18.4 ± 2.819 × √ 23 ⇒ (16.11, 20.69). (b) When n = 121, t120, 0.005 = 2.617. Hence a 99% confidence interval is: 3.9 18.4 ± 2.617 × √ 121 ⇒ (17.47, 19.33). In spite of the same sample mean and sample standard deviation, the sample of size n = 121 offers a much more accurate estimate as the interval width is merely 19.33 − 17.47 = 1.86 hours, in contrast to the interval width of 20.69 − 16.11 = 4.58 hours with the sample size of n = 23. Note that to derive a confidence interval for µ with σ 2 unknown, the formula used in the calculation involves both n and n − 1. We then refer to the Student’s t distribution with n − 1 degrees of freedom. Also, note that t120, α ≈ zα , where P (Z > zα ) = α for Z ∼ N (0, 1). Therefore, it would be acceptable to use z0.005 = 2.576 as an approximation for t120, 0.005 = 2.617. 8.4.2 Means of non-normal distributions Let {X1 , . . . , Xn } be a random sample from a non-normal distribution with mean µ and variance σ 2 < ∞. √ When n is large, n(X̄ − µ)/σ is N (0, 1) approximately. Therefore, we have an approximate 95% confidence interval for µ given by: S S X̄ − 1.96 × √ , X̄ + 1.96 × √ n n where S is the sample standard deviation. Note that it is a two-stage approximation. 1. Approximate the distribution of √ n(X̄ − µ)/σ by N (0, 1). 2. Approximate σ by S. 259 8. Interval estimation Example 8.2 The salary data of 253 graduates from a UK business school (in thousands of pounds) yield the following: n = 253, x̄ = 47.126, s = 6.843 and so √ s/ n = 0.43. A point estimate of the average salary µ is x̄ = 47.126. An approximate 95% confidence interval for µ is: 47.126 ± 1.96 × 0.43 ⇒ (46.283, 47.969). Activity 8.10 Suppose a random survey of 400 first-time home buyers finds that the sample mean of annual household income is £36,000 and the sample standard deviation is £17,000. (a) An economist believes that the ‘true’ standard deviation is σ = £12,000. Based on this assumption, find an approximate 90% confidence interval for µ, i.e. for the average annual household income of all first-time home buyers. (b) Without the assumption that σ is known, find an approximate 90% confidence interval for µ. (c) Are the two confidence intervals very different? Which one would you trust more, and why? Solution (a) Based on the central limit theorem for the sample mean, an approximate 90% confidence interval is: 12000 σ x̄ ± z0.05 × √ = 36000 ± 1.645 × √ n 400 = 36000 ± 987 ⇒ (£35,013, £36,987). We may interpret this result as follows. According to the assumption made by the economist and the survey results, we may conclude at the 90% confidence level that the average of all first-time home buyers’ incomes is between £35,013 and £36,987. Note that it is wrong to conclude that 90% of all first-time home buyers’ incomes are between £35,013 and £36,987. (b) Replacing σ = 12000 by s = 17000, we obtain an approximate 90% confidence interval of: 17000 s x̄ ± z0.05 × √ = 36000 ± 1.645 × √ n 400 = 36000 ± 1398 ⇒ (£34,602, £37,398). 260 8.4. Interval estimation for means of normal distributions Now, according to the survey results (only), we may conclude at the 90% confidence level that the average of all first-time home buyers’ incomes is between £34,602 and £37,398. (c) The interval estimates are different. The first one gives a smaller range by £822. This was due to the fact that the economist’s assumed σ of £12,000 is much smaller than the sample standard deviation, s, of £17,000. With a sample size as large as 400, we would think that we should trust the data more than an assumption by an economist! The key question is whether σ being £12,000 is a reasonable assumption. This issue will be properly addressed using statistical hypothesis testing. Activity 8.11 In a study of consumers’ views on guarantees for new products, 370 out of a random sample of 425 consumers agreed with the statement: ‘Product guarantees are worded more for lawyers to understand than to be easily understood by consumers.’ (a) Find an approximate 95% confidence interval for the population proportion of consumers agreeing with this statement. (b) Would a 99% condidence interval for the population proportion be wider or narrower than that found in (a)? Explain your answer. Solution The population is a Bernoulli distribution on two points: 1 (agree) and 0 (disagree). We have a random sample of size n = 425, i.e. {X1 , . . . , X425 }. Let π = P (Xi = 1), hence E(Xi ) = π and Var(Xi ) = π (1 − π) for i = 1, . . . , 425. The sample mean and variance are: 425 370 1 X xi = = 0.8706 x̄ = 425 i=1 425 and: 1 s2 = 424 425 X i=1 ! x2i − 425x̄2 = 1 370 − 425 × (0.8706)2 = 0.1129. 424 (a) Based on the central limit theorem for the sample mean, an approximate 95% confidence interval for π is: r 0.1129 s x̄ ± z0.025 × √ = 0.8706 ± 1.96 × 425 n = 0.8706 ± 0.0319 ⇒ (0.8387, 0.9025). (b) For a 99% confidence interval, we use z0.005 = 2.576 instead of z0.025 = 1.96 in the above formula. Therefore, the confidence interval becomes wider. 261 8. Interval estimation Note that the width of a confidence interval is a random variable, i.e. it varies from sample to sample. The comparison in (b) above is with the understanding that the same random sample is used to construct the two confidence intervals. Be sure to pay close attention to how we interpret confidence intervals in the context of particular practical problems. Activity 8.12 (a) A sample of 954 adults in early 1987 found that 23% of them held shares. Given a UK adult population of 41 million and assuming a proper random sample was taken, construct a 95% confidence interval estimate for the number of shareholders in the UK. (b) A ‘similar’ survey the previous year had found a total of 7 million shareholders. Assuming ‘similar’ means the same sample size, construct a 95% confidence interval estimate of the increase in shareholders between the two years. Solution (a) Let π be the proportion of shareholders in the population. Start by estimating π. We are estimating a proportion and n is large, so an approximate 95% confidence interval for π is, using the central limit theorem: r r π b (1 − π b) 0.23 × 0.77 ⇒ 0.23±1.96× = 0.23±0.027 ⇒ (0.203, 0.257). π b±1.96× n 954 Therefore, a 95% confidence interval for the number (rather than the proportion) of shareholders in the UK is obtained by multiplying the above interval endpoints by 41 million and getting the answer 8.3 million to 10.5 million. An alternative way of expressing this is: 9,400,000 ± 1,100,000 ⇒ (8,300,000, 10,500,000). Therefore, we estimate there are about 9.4 million shareholders in the UK, with a margin of error of 1.1 million. (b) Let us start by finding a 95% confidence interval for the difference in the two proportions. We use the formula: s π b1 (1 − π b1 ) π b2 (1 − π b2 ) + . π b1 − π b2 ± 1.96 × n1 n2 The estimates of the proportions π1 and π2 are 0.23 and 0.171, respectively. We know n1 = 954 and although n2 is unknown we can assume it is approximately equal to 954 (noting the ‘similar’ in the question), so an approximate 95% confidence interval is: r 0.23 × 0.77 0.171 × 0.829 0.23−0.171±1.96× + = 0.059±0.036 ⇒ (0.023, 0.094). 954 954 262 8.5. Use of the chi-squared distribution By multiplying by 41 million, we get a confidence interval of: 2,400,000 ± 1,500,000 ⇒ (900,000, 3,900,000). We estimate that the number of shareholders has increased by about 2.4 million in the two years. There is quite a large margin of error, i.e. 1.5 million, especially when compared with a point estimate (i.e. interval midpoint) of 2.4 million. 8.5 Use of the chi-squared distribution Let Y1 , . . . , Yn be independent N (µ, σ 2 ) random variables. Therefore: Yi − µ ∼ N (0, 1). σ Hence: n 1 X (Yi − µ)2 ∼ χ2n . σ 2 i=1 Note that: n n 1 X n(Ȳ − µ)2 1 X 2 2 (Y − µ) = (Y − Ȳ ) + . i i σ 2 i=1 σ 2 i=1 σ2 (8.2) Proof : We have: n n X X 2 (Yi − µ) = ((Yi − Ȳ ) + (Ȳ − µ))2 i=1 i=1 = n X 2 (Yi − Ȳ ) + i=1 = n X n X 2 (Ȳ − µ) + 2 i=1 (Yi − Ȳ )(Ȳ − µ) i=1 n X (Yi − Ȳ ) + n(Ȳ − µ) + 2(Ȳ − µ) (Yi − Ȳ ) 2 2 i=1 = n X i=1 n X (Yi − Ȳ )2 + n(Ȳ − µ)2 . i=1 Hence: n n 1 X n(Ȳ − µ)2 1 X 2 2 (Y − µ) = (Y − Ȳ ) + . i i σ 2 i=1 σ 2 i=1 σ2 Since Ȳ ∼ N (µ, σ 2 /n), then n(Ȳ − µ)2 /σ 2 ∼ χ21 . It can be proved that: n 1 X (Yi − Ȳ )2 ∼ χ2n−1 . σ 2 i=1 Therefore, decomposition (8.2) is an instance of the relationship: χ2n = χ2n−1 + χ21 . 263 8. Interval estimation 8.6 Interval estimation for variances of normal distributions Let {X1 , . . . , Xn } be a random sample from a population with mean µ and variance σ 2 < ∞. n P Let M = (Xi − X̄)2 = (n − 1)S 2 , then M/σ 2 ∼ χ2n−1 . i=1 For any given small α ∈ (0, 1), we can find 0 < k1 < k2 such that: P (X < k1 ) = P (X > k2 ) = α 2 where X ∼ χ2n−1 . Therefore: M M M 2 <σ < . 1 − α = P k1 < 2 < k2 = P σ k2 k1 Hence a 100(1 − α)% confidence interval for σ 2 is: M M , . k2 k1 Example 8.3 Suppose n = 15 and the sample variance is s2 = 24.5. Let α = 0.05. From Table 8 of the New Cambridge Statistical Tables, we find: P (X < 5.629) = P (X > 26.119) = 0.025 where X ∼ χ214 . Hence a 95% confidence interval for σ 2 is: M M 14 × S 2 14 × S 2 , = , 26.119 5.629 26.119 5.629 = (0.536 × S 2 , 2.487 × S 2 ) = (13.132, 60.934). In the above calculation, we have used the formula: n 1 X 1 S = (Xi − X̄)2 = × M. n − 1 i=1 n−1 2 Activity 8.13 A random sample of size n = 16 drawn from a normal distribution had a sample variance of s2 = 32.76. Construct a 99% confidence interval for σ 2 . Solution For a 99% confidence interval, we need the lower and upper half percentile values from the χ2n−1 = χ215 distribution. These are χ20.995, 15 = 4.601 and χ20.005, 15 = 32.801, 264 8.6. Interval estimation for variances of normal distributions respectively. Hence we obtain: ! (n − 1)s2 (n − 1)s2 15 × 32.76 15 × 32.76 = , = (14.98, 106.80). , χ2α/2, n−1 χ21−α/2, n−1 32.801 4.601 Note that this is a very wide confidence interval due to (i.) a high level of confidence (99%), and (ii.) a small sample size (n = 16). Activity 8.14 A manufacturer is concerned about the variability of the levels of impurity contained in consignments of raw materials from a supplier. A random sample of 10 consignments showed a standard deviation of 2.36 in the concentration of impurity levels. Assume normality. (a) Find a 95% confidence interval for the population variance. (b) Would a 99% confidence interval for this variance be wider or narrower than that found in (a)? Solution (a) We have n = 10, s2 = (2.36)2 = 5.5696, χ20.975, 9 = 2.700 and χ20.025, 9 = 19.023. Hence a 95% confidence interval for σ 2 is: 9 × 5.5696 9 × 5.5696 (n − 1)s2 (n − 1)s2 , , = = (2.64, 18.57). χ20.025, n−1 χ20.975, n−1 19.023 2.700 (b) A 99% confidence interval would be wider since: χ20.995, n−1 < χ20.975, n−1 and χ20.005, n−1 > χ20.025, n−1 . Activity 8.15 Construct a 90% confidence interval for the variance of the bags of sugar in Activity 8.6. Does the given value of 8.5 g for the population standard deviation seem plausible? Solution We have n = 9 and s2 = 72.76. For a 90% confidence interval, we need the bottom and top 5th percentiles of the chi-squared distribution on n − 1 = 8 degrees of freedom. These are: χ20.95, 8 = 2.733 and χ20.05, 8 = 15.507. A 90% confidence interval is: (n − 1)S 2 (n − 1)S 2 , χ2α/2,n−1 χ21−α/2,n−1 ! = (9 − 1) × 72.76 (9 − 1) × 72.76 , 15.507 2.733 = (37.536, 213.010). 265 8. Interval estimation The corresponding values for the standard deviation are: √ √ 37.536, 213.010 = (6.127, 14.595). The given value falls well within this confidence interval, so we have no reason to doubt it. Activity 8.16 The data below are from a random sample of size n = 9 taken from the distribution N (µ, σ 2 ): 3.75, 5.67, 3.14, 7.89, 3.40, 9.32, 2.80, 10.34, 14.31. (a) Assume σ 2 = 16. Find a 95% confidence interval for µ. If the width of such a confidence interval must not exceed 2.5, at least how many observations do we need? (b) Suppose σ 2 is now unknown. Find a 95% confidence interval for µ. Compare the result with that obtained in (a) and comment. (c) Obtain a 95% confidence interval for σ 2 . Solution (a) We have x̄ = 6.74. For a 95% confidence interval, α = 0.05 so we need to find the top 100α/2 = 2.5th percentile of N (0, 1), which is 1.96. Since σ = 4 and n = 9, a 95% confidence interval for µ is: 4 4 σ = (4.13, 9.35). x̄ ± 1.96 × √ ⇒ 6.74 − 1.96 × , 6.74 + 1.96 × 3 3 n In general, a 100(1 − α)% confidence interval for µ is: σ σ X̄ − zα/2 × √ , X̄ + zα/2 × √ n n where zα denotes the top 100αth percentile of the standard normal distribution, i.e. such that: P (Z > zα ) = α where Z ∼ N (0, 1). Hence the width of the confidence interval is: σ 2 × zα/2 × √ . n For this example, α = 0.05, z0.025 = 1.96 and σ = 4. Setting the width of the confidence interval to be at most 2.5, we have: σ 15.68 2 × 1.96 × √ = √ ≤ 2.5. n n Hence: 2 15.68 n≥ = 39.34. 2.5 So we need a sample of at least 40 observations in order to obtain a 95% confidence interval with a width not greater than 2.5. 266 8.7. Overview of chapter (b) When σ 2 is unknown, a 95% confidence interval for µ is: S S X̄ − tα/2, n−1 × √ , X̄ + tα/2, n−1 × √ n n where S 2 = n P (Xi − X̄)2 /(n − 1), and tα, k denotes the top 100αth percentile of i=1 the Student’s tk distribution, i.e. such that: P (T > tα, k ) = α for T ∼ tk . For this example, s2 = 16, s = 4, n = 9 and t0.025, 8 = 2.306. Hence a 95% confidence interval for µ is: 6.74 ± 2.306 × 4 3 ⇒ (3.67, 9.81). This confidence interval is much wider than the one obtained in (a). Since we do not know σ 2 , we have less information available for our estimation. It is only natural that our estimation becomes less accurate. Note that although the sample size is n, the Student’s t distribution used has only n − 1 degrees of freedom. The loss of 1 degree of freedom in the sample variance is due to not knowing µ. Hence we estimate µ using the data, for which we effectively pay a ‘price’ of one degree of freedom. (c) Note (n − 1)S 2 /σ 2 ∼ χ2n−1 = χ28 . From Table 8 of the New Cambridge Statistical Tables, for X ∼ χ28 , we find that: P (X < 2.180) = P (X > 17.535) = 0.025. Hence: P 8 × S2 < 17.535 = 0.95. 2.180 < σ2 Therefore, the lower bound for σ 2 is 8 × s2 /17.535 = 7.298, and the upper bound is 8 × s2 /2.180 = 58.701. Therefore, a 95% confidence interval for σ 2 , noting s2 = 16, is: (7.30, 58.72). Note that the estimation in this example is rather inaccurate. This is due to two reasons. i. The sample size is small. ii. The population variance, σ 2 , is large. 8.7 Overview of chapter This chapter covered interval estimation. A confidence interval converts a point estimate of an unknown parameter into an interval estimate, reflecting the likely sampling error. The chapter demonstrated how to construct confidence intervals for 267 8. Interval estimation means and variances of normal populations. 8.8 Key terms and concepts Confidence interval Interval estimator 8.9 Coverage probability Interval width Sample examination questions Solutions can be found in Appendix C. 1. Let {X1 , . . . , Xn } be a random sample from N (µ, σ 2 ), where σ 2 is unknown. Derive the endpoints of an accurate 100(1 − α)% confidence interval for µ in this situation, where α ∈ (0, 1). 2. A country is considering joining the European Union. In a study of voters’ views on a forthcoming referendum, 163 out of a random sample of 250 voters agreed with the statement: ‘The government should seek membership of the European Union.’ Find an approximate 99% confidence interval for the population proportion of all voters agreeing with this statement. 3. A random sample of size n = 10 drawn from a normal distribution had a sample variance of s2 = 21.05. Construct a 90% confidence interval for σ 2 . Note that P (X < 3.325) = 0.05, where X ∼ χ29 . 268 Chapter 9 Hypothesis testing 9.1 Synopsis of chapter This chapter discusses hypothesis testing which is used to answer questions about an unknown parameter. We consider how to perform an appropriate hypothesis test for a given problem, determine error probabilities and test power, and draw appropriate conclusions from a hypothesis test. 9.2 Learning outcomes After completing this chapter, you should be able to: define and apply the terminology of hypothesis testing conduct statistical tests of all the types covered in the chapter calculate the power of some of the simpler tests explain the construction of rejection regions as a consequence of prior distributional results, with reference to the significance level and power. 9.3 Introduction Hypothesis testing, together with statistical estimation, are the two most frequently-used statistical inference methods. Hypothesis testing addresses a different type of practical question from statistical estimation. Based on the data, a (statistical) test is to make a binary decision on a hypothesis, denoted by H0 : reject H0 or not reject H0 . Activity 9.1 Why does it make no sense to use a hypothesis like x̄ = 2? Solution We can see immediately if x̄ = 2 by calculating the sample mean. Inference is concerned with the population from which the sample was taken. We are not very interested in the sample mean in its own right. 269 9. Hypothesis testing 9.4 Introductory examples Example 9.1 Consider a simple experiment – toss a coin 20 times. Let {X1 , . . . , X20 } be the outcomes where ‘heads’ → Xi = 1, and ‘tails’ → Xi = 0. Hence the probability distribution is P (Xi = 1) = π = 1 − P (Xi = 0), for π ∈ (0, 1). Estimation would involve estimating π, using π b = X̄ = (X1 + · · · + X20 )/20. Testing involves assessing if a hypothesis such as ‘the coin is fair’ is true or not. For example, this particular hypothesis can be formally represented as: H0 : π = 0.5. We cannot be sure what the answer is just from the data. If π b = 0.9, H0 is unlikely to be true. If π b = 0.45, H0 may be true (and also may be untrue). If π b = 0.7, what to do then? Example 9.2 A customer complains that the amount of coffee powder in a coffee tin is less than the advertised weight of 3 pounds. A random sample of 20 tins is selected, resulting in an average weight of x̄ = 2.897 pounds. Is this sufficient to substantiate the complaint? Again statistical estimation cannot provide a firm answer, due to random fluctuations between different random samples. So we cast the problem into a hypothesis testing problem as follows. Let the weight of coffee in a tin be a normal random variable X ∼ N (µ, σ 2 ). We need to test the hypothesis µ < 3. In fact, we use the data to test the hypothesis: H0 : µ = 3. If we could reject H0 , the customer complaint would be vindicated. Example 9.3 Suppose one is interested in evaluating the mean income (in £000s) of a community. Suppose income in the population is modelled as N (µ, 25) and a random sample of n = 25 observations is taken, yielding the sample mean x̄ = 17. Independently of the data, three expert economists give their own opinions as follows. Dr A claims the mean income is µ = 16. Ms B claims the mean income is µ = 15. Mr C claims the mean income is µ = 14. How would you assess these experts’ statements? X̄ ∼ N (µ, σ 2 /n) = N (µ, 1). We assess the statements based on this distribution. 270 9.5. Setting p-value, significance level, test statistic If Dr A’s claim is correct, X̄ ∼ N (16, 1). The observed value x̄ = 17 is one standard deviation away from µ, and may be regarded as a typical observation from the distribution. Hence there is little inconsistency between the claim and the data evidence. This is shown in Figure 9.1. If Ms B’s claim is correct, X̄ ∼ N (15, 1). The observed value x̄ = 17 begins to look a bit ‘extreme’, as it is two standard deviations away from µ. Hence there is some inconsistency between the claim and the data evidence. This is shown in Figure 9.2. If Mr C’s claim is correct, X̄ ∼ N (14, 1). The observed value x̄ = 17 is very extreme, as it is three standard deviations away from µ. Hence there is strong inconsistency between the claim and the data evidence. This is shown in Figure 9.3. Figure 9.1: Comparison of claim and data evidence for Dr A in Example 9.3. Figure 9.2: Comparison of claim and data evidence for Ms B in Example 9.3. 9.5 Setting p-value, significance level, test statistic A measure of the discrepancy between the hypothesised (claimed) value of µ and the observed value X̄ = x̄ is the probability of observing X̄ = x̄ or more extreme values under the null hypothesis. This probability is called the p-value. 271 9. Hypothesis testing Figure 9.3: Comparison of claim and data evidence for Mr C in Example 9.3. Example 9.4 Continuing Example 9.3: under H0 : µ = 16, P (X̄ ≥ 17) + P (X̄ ≤ 15) = P (|X̄ − 16| ≥ 1) = 0.317 under H0 : µ = 15, P (X̄ ≥ 17) + P (X̄ ≤ 13) = P (|X̄ − 15| ≥ 2) = 0.046 under H0 : µ = 14, P (X̄ ≥ 17) + P (X̄ ≤ 11) = P (|X̄ − 14| ≥ 3) = 0.003. In summary, we reject the hypothesis µ = 15 or µ = 14, as, for example, if the hypothesis µ = 14 is true, the probability of observing x̄ = 17, or more extreme values, would be as small as 0.003. We are comfortable with this decision, as a small probability event would be very unlikely to occur in a single experiment. On the other hand, we cannot reject the hypothesis µ = 16. However, this does not imply that this hypothesis is necessarily true, as, for example, µ = 17 or 18 are at least as likely as µ = 16. Remember: not reject 6= accept. A statistical test is incapable of ‘accepting’ a hypothesis. Definition of p-values A p-value is the probability of the event that the test statistic takes the observed value or more extreme (i.e. more unlikely) values under H0 . It is a measure of the discrepancy between the hypothesis H0 and the data. • A ‘small’ p-value indicates that H0 is not supported by the data. • A ‘large’ p-value indicates that H0 is not inconsistent with the data. So p-values may be seen as a risk measure of rejecting H0 , as shown in Figure 9.4. 272 9.5. Setting p-value, significance level, test statistic Figure 9.4: Interpretation of p-values as a risk measure. 9.5.1 General setting of hypothesis tests Let {X1 , . . . , Xn } be a random sample from a distribution with cdf F (x; θ). We are interested in testing the hypotheses: H0 : θ = θ0 vs. H1 : θ ∈ Θ1 where θ0 is a fixed value, Θ1 is a set, and θ0 6∈ Θ1 . H0 is called the null hypothesis. H1 is called the alternative hypothesis. The significance level is based on α, which is a small number between 0 and 1 selected subjectively. Often we choose α = 0.1, 0.05 or 0.01, i.e. tests are often conducted at the significance levels of 10%, 5% or 1%, respectively. So we test at the 100α% significance level. Our decision is to reject H0 if the p-value is ≤ α. 9.5.2 Statistical testing procedure 1. Find a test statistic T = T (X1 , . . . , Xn ). Denote by t the value of T for the given sample of observations under H0 . 2. Compute the p-value: p = Pθ0 (T = t or more ‘extreme’ values) where Pθ0 denotes the probability distribution such that θ = θ0 . 3. If p ≤ α we reject H0 . Otherwise, H0 is not rejected. Our understanding of ‘extremity’ is defined by the alternative hypothesis H1 . This will become clear in subsequent examples. The significance level determines which p-values are considered ‘small’. 273 9. Hypothesis testing Example 9.5 Let {X1 , . . . , X20 }, taking values either 1 or 0, be the outcomes of an experiment of tossing a coin 20 times, where: P (Xi = 1) = π = 1 − P (Xi = 0) for π ∈ (0, 1). We are interested in testing: H0 : π = 0.5 vs. H1 : π 6= 0.5. Suppose there are 17 Xi s taking the value 1, and 3 Xi s taking the value 0. Will you reject the null hypothesis at the 5% significance level? Let T = X1 + · · · + X20 . Therefore, T ∼ Bin(20, π). We use T as the test statistic. With the given sample, we observe t = 17. What are the more extreme values of T if H0 is true? Under H0 , E(T ) = n π0 = 10. Hence 3 is as extreme as 17, and the more extreme values are: 0, 1, 2, 18, 19 and 20. Therefore, the p-value is: ! 3 20 X X + PH0 (T = i) = i=0 i=17 3 X + i=0 20 X ! i=17 = 2 × (0.5)20 20! (0.5)i (1 − 0.5)20−i (20 − i)! i! 3 X i=0 20! (20 − i)! i! 20 × 19 20 × 19 × 18 = 2 × (0.5) × 1 + 20 + + 2! 3! 20 = 0.0026. So we reject the null hypothesis of a fair coin at the 1% significance level. Activity 9.2 Let {X1 , . . . , X14 }, taking values either 1 or 0, be the outcomes of an experiment of tossing a coin 14 times, where: P (Xi = 1) = π = 1 − P (Xi = 0) for π ∈ (0, 1). We are interested in testing: H0 : π = 0.5 vs. H1 : π 6= 0.5. Suppose there are 4 Xi s taking the value 1, and 10 Xi s taking the value 0. Will you reject the null hypothesis at the 5% significance level? Solution Let T = X1 + · · · + X14 . Therefore, T ∼ Bin(14, π). We use T as the test statistic. With the given sample, we observe t = 4. We now determine which are the more extreme values of T if H0 is true. 274 9.5. Setting p-value, significance level, test statistic Under H0 , E(T ) = n π0 = 7. Hence 10 is as extreme as 4, and the more extreme values are: 0, 1, 2, 3, 11, 12, 13 and 14. Therefore, the p-value is: ! 4 14 X X + PH0 (T = i) = i=0 i=10 4 X + i=0 14 X ! i=10 14 = 2 × (0.5) 14! (0.5)i (1 − 0.5)14−i (14 − i)! i! 4 X i=0 14! (14 − i)! i! 14 × 13 14 × 13 × 12 = 2 × (0.5) × 1 + 14 + + 2! 3! 14 × 13 × 12 × 11 + 4! 14 = 0.1796. Since α = 0.05 < 0.1796, we do not reject the null hypothesis of a fair coin at the 5% significance level. The observed data are consistent with the null hypothesis of a fair coin. Activity 9.3 You wish to test whether a coin is fair. In 400 tosses of a coin, 217 heads and 183 tails appear. Is it reasonable to assume that the coin is fair? Justify your answer with an appropriate hypothesis test. Calculate the p-value of the test, and assume a 5% significance level. Solution Let {X1 , . . . , X400 }, taking values either 1 or 0, be the outcomes of an experiment of tossing a coin 400 times, where: P (Xi = 1) = π = 1 − P (Xi = 0) for π ∈ (0, 1), and 0 otherwise. We are interested in testing: H0 : π = 0.5 vs. H1 : π 6= 0.5. Let T = 400 P Xi . Under H0 , then T ∼ Bin(400, 0.5) ≈ N (200, 100), using the normal i=1 approximation of the binomial distribution, with µ = n π0 = 400 × 0.5 = 200 and σ 2 = n π0 (1 − π0 ) = 400 × 0.5 × 0.5 = 100. We observe t = 217, hence (using the continuity correction): 216.5 − 200 √ = P (Z ≥ 1.65) = 0.0495. P (T ≥ 216.5) = P Z ≥ 100 Therefore, the p-value is: 2 × P (Z ≥ 1.65) = 0.0990 275 9. Hypothesis testing which is far larger than α = 0.05, hence we do not reject H0 and conclude that there is no evidence to suggest that the coin is not fair. (Note that the test would be significant if we set H1 : π > 0.5, as the p-value would be 0.0495 which is less than 0.05 (just). However, we have no a priori reason to perform an upper-tailed test – we should not determine our hypotheses by observing the sample data, rather the hypotheses should be set before any data are observed.) Alternatively, one could apply the central limit theorem such that under H0 we have: π (1 − π) = N (0.5, 0.000625) X̄ ∼ N π, n approximately, since n = 400. We observe x̄ = 217/400 = 0.5425, hence: 0.5425 − 0.5 P (X̄ ≥ 0.5425) = P Z ≥ √ = P (Z ≥ 1.70) = 0.0446. 0.000625 Therefore, the p-value is: 2 × P (Z ≥ 1.70) = 2 × 0.0446 = 0.0892 leading to the same conclusion. Activity 9.4 In a given city, it is assumed that the number of car accidents in a given week follows a Poisson distribution. In past weeks, the average number of accidents per week was 9, and this week there were 3 accidents. Is it justified to claim that the accident rate has dropped? Calculate the p-value of the test, and assume a 5% significance level. Solution Let T be the number of car accidents per week such that T ∼ Poisson(λ). We are interested in testing: H0 : λ = 9 vs. H1 : λ < 9. Under H0 , then T ∼ Poisson(9), and we observe t = 3. Hence the p-value is: P (T ≤ 3) = 3 X e−9 9t t=0 t! −9 =e 92 93 1+9+ + 2! 3! = 0.0212. Since 0.0212 < 0.05, we reject H0 and conclude that there is evidence to suggest that the accident rate has dropped. 9.5.3 Two-sided tests for normal means Let {X1 , . . . , Xn } be a random sample from N (µ, σ 2 ). Assume σ 2 > 0 is known. We are interested in testing the hypotheses: H0 : µ = µ0 where µ0 is a given constant. 276 vs. H1 : µ 6= µ0 9.5. Setting p-value, significance level, test statistic P Intuitively if H0 is true, X̄ = i Xi /n should be close to µ0 . Therefore, large values of |X̄ − µ0 | suggest a departure from H0 . √ Under H0 , X̄ ∼ N (µ0 , σ 2 /n), i.e. n(X̄ − µ0 )/σ ∼ N (0, 1). Hence the test statistic may be defined as: √ n(X̄ − µ0 ) X̄ − µ0 √ ∼ N (0, 1) T = = σ σ/ n and we reject H0 for sufficiently ‘large’ values of |T |. How large is ‘large’ ? This is determined by the significance level. Suppose √ µ0 = 3, σ = 0.148, n = 20 and x̄ = 2.897. Therefore, the observed value of T is t = 20 × (2.897 − 3)/0.148 = −3.112. Hence the p-value is: Pµ0 (|T | ≥ 3.112) = P (|Z| > 3.112) = 0.0019 where Z ∼ N (0, 1). Therefore, the null hypothesis of µ = 3 will be rejected even at the 1% significance level. Alternatively, for a given 100α% significance level we may find the critical value cα such that Pµ0 (|T | > cα ) = α. Therefore, the p-value is ≤ α if and only if the observed value of |T | ≥ cα . Using this alternative approach, we do not need to compute the p-value. For this example, cα = zα/2 , that is the top 100α/2th percentile of N (0, 1), i.e. the z-value which cuts off α/2 probability in the upper tail of the standard normal distribution. For α = 0.1, 0.05 and 0.01, zα/2 = 1.645, 1.96 and 2.576, respectively. Since we observe |t| = 3.112, the null hypothesis is rejected at all three significance levels. 9.5.4 One-sided tests for normal means Let {X1 , . . . , Xn } be a random sample from N (µ, σ 2 ) with σ 2 > 0 known. We are interested in testing the hypotheses: H0 : µ = µ0 vs. H1 : µ < µ0 where µ0 is a known constant. √ Under H0 , T = n(X̄ − µ0 )/σ ∼ N (0, 1). We continue to use T as the test statistic. For H1 : µ < µ0 we should reject H0 when t ≤ c, where c < 0 is a constant. For a given 100α% significance level, the critical value c should be chosen such that: α = Pµ0 (T ≤ c) = P (Z ≤ c). Therefore, c is the 100αth percentile of N (0, 1). Due to the symmetry of N (0, 1), c = −zα , where zα is the top 100αth percentile of N (0, 1), i.e. P (Z > zα ) = α, where Z ∼ N (0, 1). For α = 0.05, zα = 1.645. We reject H0 if t ≤ −1.645. Example 9.6 Suppose µ0 = 3, σ = 0.148, n = 20 and x̄ = 2.897, then: √ 20 × (2.897 − 3) t= = −3.112 < −1.645. 0.148 277 9. Hypothesis testing So the null hypothesis of µ = 3 is rejected at the 5% significance level as there is significant evidence from the data that the true mean is likely to be smaller than 3. Some remarks are the following. i. We use a one-tailed test when we are only interested in the departure from H0 in one direction. ii. The distribution of a test statistic under H0 must be known in order to calculate p-values or critical values. iii. A test may be carried out by either computing the p-value or determining the critical value. iv. The probability of incorrect decisions in hypothesis testing is typically positive. For example, the significance level is the probability of rejecting a true H0 . 9.6 t tests t tests are one of the most frequently-used statistical tests. Let {X1 , . . . , Xn } be a random sample from N (µ, σ 2 ), where both µ and σ 2 > 0 are unknown. We are interested in testing the hypotheses: H0 : µ = µ0 vs. H1 : µ < µ0 where µ0 is known. √ Now we cannot use n(X̄ − µ0 )/σ as a statistic, since σ is unknown. Naturally we replace it by S, where: n 1 X 2 S = (Xi − X̄)2 . n − 1 i=1 The test statistic is then the famous t statistic: √ T = . n(X̄ − µ0 ) X̄ − µ0 √ √ = n(X̄ − µ0 ) = S S/ n n 1 X (Xi − X̄)2 n − 1 i=1 !1/2 . We reject H0 if t < c, where c is the critical value determined by the significance level: PH0 (T < c) = α where PH0 denotes the distribution under H0 (with mean µ0 and unknown σ 2 ). Under H0 , T ∼ tn−1 . Hence: α = PH0 (T < c) i.e. c is the 100αth percentile of the t distribution with n − 1 degrees of freedom. By symmetry, c = −tα, n−1 , where tα, k denotes the top 100αth percentile of the tk distribution. 278 9.6. t tests Example 9.7 To deal with the customer complaint that the average amount of coffee powder in a coffee tin is less than the advertised 3 pounds, 20 tins were weighed, yielding the following observations: 2.82, 2.78, 3.01, 3.01, 3.11, 3.09, 2.71, 2.94, 2.93, 2.68, 2.82, 2.81, 3.02, 3.05, 3.01, 3.01, 2.93, 2.85, 2.56, 2.79. The sample mean and standard deviation are, respectively: x̄ = 2.897 and s = 0.148. To test H0 : µ = 3 vs. H1 : µ < 3 at the 1% significance level, the critical value is c = −t0.01, 19 = −2.539. √ Since t = 20 × (2.897 − 3)/0.148 = −3.112 < −2.539, we reject the null hypothesis that µ = 3 at the 1% significance level. We conclude that there is highly significant evidence which supports the claim that the mean amount of coffee is less than 3 pounds. Note the hypotheses tested are in fact: H0 : µ = µ0 , σ 2 > 0 vs. H1 : µ 6= µ0 , σ 2 > 0. Although H0 does not specify the population distribution completely (σ 2 > 0), the distribution of the test statistic, T , under H0 is completely known. This enables us to find the critical value or p-value. Activity 9.5 A doctor claims that the average European is more than 8.5 kg overweight. To test this claim, a random sample of 12 Europeans were weighed, and the difference between their actual weight and their ideal weight was calculated. The data are: 14, 12, 8, 13, −1, 10, 11, 15, 13, 20, 7, 14. Assuming the data follow a normal distribution, conduct a t test to infer at the 5% significance level whether or not the doctor’s claim is true. Solution We have a random sample of size n = 12 from N (µ, σ 2 ), and we test H0 : µ = 8.5 vs. H1 : µ > 8.5. The test statistic, under H0 , is: T = X̄ − 8.5 X̄ − 8.5 √ = √ ∼ t11 . S/ n S/ 12 We reject H0 if t > t0.05, 11 = 1.796. For the given data: 12 1 X 1 x̄ = xi = 11.333 and s2 = 12 i=1 11 Hence: 12 X ! x2i − 12x̄2 = 26.606. i=1 11.333 − 8.5 t= p = 1.903 > 1.796 = t0.05, 11 26.606/12 279 9. Hypothesis testing so we reject H0 at the 5% significance level. There is significant evidence to support the doctor’s claim. Activity 9.6 A sample of seven is taken at random from a large batch of (nominally 12-volt) batteries. These are tested and their true voltages are shown below: 12.9, 11.6, 13.5, 13.9, 12.1, 11.9, 13.0. (a) Test if the mean voltage of the whole batch is 12 volts. (b) Test if the mean batch voltage is less than 12 volts. Which test do you think is the more appropriate? Solution (a) We are to test H0 : µ = 12 vs. H1 : µ 6= 12. The key points here are that n is small and that σ 2 is unknown. We can use the t test and this is valid provided the data are normally distributed. The test statistic value is: t= x̄ − 12 12.7 − 12 √ = √ = 2.16. s/ 7 0.858/ 7 This is compared to a Student’s t distribution on 6 degrees of freedom. The critical value corresponding to a 5% significance level is 2.447. Hence we cannot reject the null hypothesis at the 5% significance level. (We can reject at the 10% significance level, but the convention on this course is to regard such evidence merely as casting doubt on H0 , rather than justifying rejection as such, i.e. such a result would be ‘weakly significant’.) (b) We are to test H0 : µ = 12 vs. H1 : µ < 12. There is no need to do a formal statistical test. As the sample mean is 12.7, which is greater than 12, there is no evidence whatsoever for the alternative hypothesis. In (a) you are asked to do a two-sided test and in (b) it is a one-sided test. Which is more appropriate will depend on the purpose of the experiment, and your suspicions before you conduct it. • If you suspected before collecting the data that the mean voltage was less than 12 volts, the one-sided test would be appropriate. • If you had no prior reason to believe that the mean was less than 12 volts you would perform a two-sided test. • General rule: decide on whether it is a one- or two-sided test before performing the statistical test! Activity 9.7 A random sample of 16 observations from the population N (µ, σ 2 ) yields the sample mean x̄ = 9.31 and the sample variance s2 = 0.375. At the 5% 280 9.7. General approach to statistical tests significance level, test the following hypotheses by obtaining critical values: (a) H0 : µ = 9 vs. H1 : µ > 9. (b) H0 : µ = 9 vs. H1 : µ < 9. (c) H0 : µ = 9 vs. H1 : µ 6= 9. Repeat the above exercise with the additional assumption that σ 2 = 0.375. Compare the results with those derived without this assumption and comment. Solution When σ 2 is unknown, we use the test statistic T = T ∼ t15 . With α = 0.05, we reject H0 if: √ n(X̄ − 9)/S. Under H0 , (a) t > t0.05, 15 = 1.753, against H1 : µ > 9. (b) t < −t0.05, 15 = −1.753, against H1 : µ < 9. (c) |t| > t0.025, 15 = 2.131, against H1 : µ 6= 9. For the given sample, t = 2.02. Hence we reject H0 against the alternative H1 : µ > 9, but we will not reject H0 against the two other alternative hypotheses. √ When σ 2 is known, we use the test statistic T = n(X̄ − 9)/σ. Now under H0 , T ∼ N (0, 1). With α = 0.05, we reject H0 if: (a) t > z0.05 = 1.645, against H1 : µ > 9. (b) t < −z0.05 = −1.645, against H1 : µ < 9. (c) |t| > z0.025 = 1.960, against H1 : µ 6= 9. For the given sample, t = 2.02. Hence we reject H0 against the alternative H1 : µ > 9 and H1 : µ 6= 9, but we will not reject H0 against H1 : µ < 9. With σ 2 known, we should be able to perform inference better simply because we have more information about the population. More precisely, for the given significance level, we require less extreme values to reject H0 . Put another way, the p-value of the test is reduced when σ 2 is given. Therefore, the risk of rejecting H0 is also reduced. 9.7 General approach to statistical tests Let {X1 , . . . , Xn } be a random sample from the distribution F (x; θ). We are interested in testing: H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1 281 9. Hypothesis testing where Θ0 and Θ1 are two non-overlapping sets. A general approach to test the above hypotheses at the 100α% significance level may be described as follows. 1. Find a test statistic T = T (X1 , . . . , Xn ) such that the distribution of T under H0 is known. 2. Identify a critical region C such that: PH0 (T ∈ C) = α. 3. If the observed value of T with the given sample is in the critical region C, H0 is rejected. Otherwise, H0 is not rejected. In order to make a test powerful in the sense that the chance of making an incorrect decision is small, the critical region should consist of those values of T which are least supportive of H0 (i.e. which lie in the direction of H1 ). 9.8 Two types of error Statistical tests are often associated with two kinds of decision errors, which are displayed in the following table: True state of nature H0 true H1 true Decision made H0 not rejected H0 rejected Correct decision Type I error Type II error Correct decision Some remarks are the following. i. Ideally we would like to have a test which minimises the probabilities of making both types of error, which unfortunately is not feasible. ii. The probability of making a Type I error is the significance level, which is under our control. iii. We do not have explicit control over the probability of a Type II error. For a given significance level, we try to choose a test statistic such that the probability of a Type II error is small. iv. The power function of the test is defined as: β(θ) = Pθ (H0 is rejected) for θ ∈ Θ1 i.e. β(θ) = 1 − P (Type II error). v. The null hypothesis H0 and the alternative hypothesis H1 are not treated equally in a statistical test, i.e. there is an asymmetric treatment. The choice of H0 is based on the subject matter concerned and/or technical convenience. vi. It is more conclusive to end a test with H0 rejected, as the decision of ‘not reject H0 ’ does not imply that H0 is accepted. 282 9.8. Two types of error Activity 9.8 (a) Of 100 clinical trials, 5 have shown that wonder-drug ‘Zap2’ is better than the standard treatment (aspirin). Should we be excited by these results? (b) Of the 1,000 clinical trials of 1,000 different drugs this year, 30 trials found drugs which seem better than the standard treatments with which they were compared. The television news reports only the results of those 30 ‘successful’ trials. Should we believe these reports? (c) A child welfare officer says that she has a test which always reveals when a child has been abused, and she suggests it be put into general use. What is she saying about Type I and Type II errors for her test? Solution (a) If 5 clinical trials out of 100 report that Zap2 is better, this is consistent with there being no difference whatsoever between Zap2 and aspirin if a 5% Type I error probability is being used for tests in these clinical trials. With a 5% significance level we expect 5 trials in 100 to show spurious significant results. (b) If the television news reports the 30 successful trials out of 1,000, and those trials use tests with a significance level of 5%, we may well choose to be very cautious about believing the results. We would expect 50 spuriously significant results in the 1,000 trial results. (c) The welfare officer is saying that the Type II error has probability zero. The test is always positive if the null hypothesis of no abuse is false. On the other hand, the welfare officer is saying nothing about the probability of a Type I error. It may well be that the probability of a Type I error is high, which would lead to many false accusations of abuse when no abuse had taken place. One should always think about both types of error when proposing a test. Activity 9.9 A manufacturer has developed a new fishing line which is claimed to have an average breaking strength of 7 kg, with a standard deviation of 0.25 kg. Assume that the standard deviation figure is correct and that the breaking strength is normally distributed. Suppose that we carry out a test, at the 5% significance level, of H0 : µ = 7 vs. H1 : µ < 7. Find the sample size which is necessary for the test to have 90% power if the true breaking strength is 6.95 kg. Solution The critical value for the test is z0.95 = −1.645 and the probability of rejecting H0 with this test is: X̄ − 7 √ < −1.645 P 0.25/ n 283 9. Hypothesis testing which we rewrite as: P X̄ − 6.95 7 − 6.95 √ < √ − 1.645 0.25/ n 0.25/ n because X̄ ∼ N (6.95, (0.25)2 /n). To ensure power of 90% we need z0.10 = 1.282 since: P (Z < 1.282) = 0.90. Therefore: 7 − 6.95 √ − 1.645 = 1.282 0.25/ n √ 0.2 × n = 2.927 √ n = 14.635 n = 214.1832. So to ensure that the test power is at least 90%, we should use a sample size of 215. Remark: We see a rather large sample size is required. Hence investigators are encouraged to use sample sizes large enough to come to rational decisions. Activity 9.10 A manufacturer has developed a fishing line that is claimed to have a mean breaking strength of 15 kg with a standard deviation of 0.8 kg. Suppose that the breaking strength follows a normal distribution. With a sample size of n = 30, the null hypothesis that µ = 15 kg, against the alternative hypothesis of µ < 15 kg, will be rejected if the sample mean x̄ < 14.8 kg. (a) Find the probability of committing a Type I error. (b) Find the power of the test if the true mean is 14.9 kg, 14.8 kg and 14.7 kg, respectively. Solution (a) Under H0 : µ = 15, we have X̄ ∼ N (15, σ 2 /30) where σ = 0.8. The probability of committing a Type I error is: P (H0 is rejected | µ = 15) = P (X̄ < 14.8 | µ = 15) 14.8 − 15 X̄ − 15 √ < √ µ = 15 =P σ/ 30 σ/ 30 14.8 − 15 √ =P Z< 0.8/ 30 = P (Z < −1.37) = 0.0853. 284 9.8. Two types of error (b) If the true value is µ, then X̄ ∼ N (µ, σ 2 /30). The power of the test for a particular µ is: 14.8 − µ X̄ − µ 14.8 − µ √ < √ √ =P Z< Pµ (H0 is rejected) = Pµ (X̄ < 14.8) = Pµ σ/ 30 σ/ 30 0.8/ 30 which is 0.2483 for µ = 14.9, 0.5 for µ = 14.8, and 0.7517 for µ = 14.7. Activity 9.11 In a wire-based nail manufacturing process the target length for cut wire is 22 cm. It is known that widths vary with a standard deviation equal to 0.08 cm. In order to monitor this process, a random sample of 50 separate wires is accurately measured and the process is regarded as operating satisfactorily (the null hypothesis) if the sample mean width lies between 21.97 cm and 22.03 cm so that this is the decision procedure used (i.e. if the sample mean falls within this range then the null hypothesis is not rejected, otherwise the null hypothesis is rejected). (a) Determine the probability of a Type I error for this test. (b) Determine the probability of making a Type II error when the process is actually cutting to a length of 22.05 cm. (c) Find the probability of rejecting the null hypothesis when the true cutting length is 22.01 cm. (This is the power of the test when the true mean is 22.01 cm.) Solution (a) We have: α = 1 − P (21.97 < X̄ < 22.03 | µ = 22) 22.03 − 22 21.97 − 22 √ √ <Z< =1−P 0.08/ 50 0.08/ 50 = 1 − P (−2.65 < Z < 2.65) = 1 − 0.992 = 0.008. (b) We have: β = P (21.97 < X̄ < 22.03 | µ = 22.05) 22.03 − 22.05 21.97 − 22.05 √ √ <Z< =P 0.08/ 50 0.08/ 50 = P (−7.07 < Z < −1.77) = P (Z < −1.77) − P (Z < −7.07) = 0.0384. 285 9. Hypothesis testing (c) We have: P (rejecting H0 | µ = 22.01) = 1 − P (21.97 < X̄ < 22.03 | µ = 22.01) 22.03 − 22.01 21.97 − 22.01 √ √ =1−P <Z< 0.08/ 50 0.08/ 50 = 1 − P (−3.53 < X < 1.77) = 1 − (P (Z < 1.77) − P (Z < −3.53)) = 1 − (0.9616 − 0.00023) = 0.0386. Activity 9.12 It may be assumed that the length of nails produced by a particular machine is a normally distributed random variable, with a standard deviation of 0.02 cm. The lengths of a random sample of 6 nails are 4.63 cm, 4.59 cm, 4.64 cm, 4.62 cm, 4.66 cm and 4.69 cm. (a) Test, at the 1% significance level, the hypothesis that the machine produces nails with a mean length of 4.62 cm (a two-sided test). (b) Find the probability of committing a Type II error when the true mean length is 4.64 cm. Solution We test the null√hypothesis H0 : µ = 4.62 vs. H1 : µ 6= 4.62 with σ = 0.02. The test statistic is T = n(X̄ − 4.62)/σ, which is N (0, 1) under H0 . For the given sample, t = 2.25. (a) At the 1% significance level, we reject H0 if |t| ≥ 2.576. Since t = 2.25, H0 is not rejected. √ √ (b) For any µ 6= 4.62, E(T ) = n(E(X̄) − 4.62)/σ = n(µ − 4.62)/σ 6= 0, hence T 6∼ N (0, 1). The probability of committing a Type II error is: Pµ (H0 is not rejected) = Pµ (|T | < 2.576) = Pµ (−2.576 < T < 2.576) X̄ − 4.62 √ < 2.576 = Pµ −2.576 < σ/ n σ σ = Pµ 4.62 − 2.576 × √ < X̄ < 4.62 + 2.576 × √ n n 4.62 − µ X̄ − µ 4.62 − µ √ − 2.576 < √ < √ + 2.576 = Pµ σ/ n σ/ n σ/ n 4.62 − µ 4.62 − µ √ − 2.576 < Z < √ + 2.576 . =P σ/ n σ/ n 286 9.8. Two types of error Plugging in µ = 4.64, n = 6 and σ = 0.02 in the above expression, the probability of committing a Type II error is: √ √ P (− 6 − 2.576 < Z < − 6 + 2.576) ≈ Φ(0.13) − 0 = 0.5517. Note: i. The power of the test to reject H0 when µ = 4.64 is 1 − 0.5517 = 0.4483. The power increases when µ moves further away from µ0 = 4.62. ii. We always express probabilities to be calculated in terms of some ‘standard’ distributions such as N (0, 1), tk , etc. We can then refer to the relevant table in the New Cambridge Statistical Tables. Activity 9.13 A random sample of fibres is known to come from one of two environments, A or B. It is known from past experience that the lengths of fibres from A have a log-normal distribution so that the log-length of an A-type fibre is normally distributed about a mean of 0.80 with a standard deviation of 1.00. (Original units are in microns.) The log-lengths of B-type fibres are normally distributed about a mean of 0.65 with a standard deviation of 1.00. In order to identify the environment from which the given sample was taken a subsample of n fibres are to be measured and the classification is to be made on the evidence of these measurements. Do not be put off by the log-normal distribution. This simply means that it is the logs of the data, rather than the original data, which have a normal distribution. If X represents the log of a fibre length for fibres from A, then X ∼ N (0.8, 1). (a) If n = 50 and the sample is attributed to type A if the sample mean of log-lengths exceeds 0.75, determine the error probabilities. (b) What sample size and decision procedures should be used if it is desired to have error probabilities such that the chance of misclassifying as A is to be 5% and the chance of misclassifying as B is to be 10%? (c) If the sample is classified as A if the sample mean of log-lengths exceeds 0.75, and the misclassification as A is to have a probability of 2%, what sample size should be used and what is the probability of a B-type misclassification? (d) If the sample comes from neither A nor B but from an environment with a mean log-length of 0.70, what is the probability of classifying it as type A if the decision procedure determined in (b) is applied? Solution (a) We have n = 50 and σ = 1. We wish to test: H0 : µ = 0.65 (sample is from ‘B’) vs. H1 : µ = 0.80 (sample is from ‘A’). The decision rule is that we reject H0 if x̄ > 0.75. 287 9. Hypothesis testing The probability of a Type I error is: 0.75 − 0.65 √ P (X̄ > 0.75 | H0 ) = P Z > = P (Z > 0.71) = 0.2389. 1/ 50 The probability of a Type II error is: 0.75 − 0.80 √ P (X̄ < 0.75 | H1 ) = P Z < = P (Z < −0.35) = 0.3632. 1/ 50 (b) To find the sample size n and the value a, we need to solve two conditions: √ • α = P (X̄ > a |√H0 ) = P (Z > (a − 0.65)/(1/ n)) = 0.05 ⇒ (a − 0.65)/(1/ n) = 1.645. √ • β = P (X̄ < a |√ H1 ) = P (Z < (a − 0.80)/(1/ n)) = 0.10 ⇒ (a − 0.80)/(1/ n) = −1.28. Solving these equations gives a = 0.734 and n = 381, remembering to round up! (c) A sample is classified as being from A if H1 if x̄ > 0.75. We have: 0.75 − 0.65 0.75 − 0.65 √ √ = 2.05. α = P (X̄ > 0.75 | H0 ) = P Z > = 0.02 ⇒ 1/ n 1/ n Solving this equation gives n = 421, remembering to round up! Therefore: 0.75 − 0.80 √ β = P (X̄ < 0.75 | H1 ) = P Z < = P (Z < −1.026) = 0.1515. 1/ 421 (d) The rule in (b) is ‘take n = 381 and reject H0 if x̄ > 0.734’. So: 0.734 − 0.7 √ P (X̄ > 0.734 | µ = 0.7) = P Z > = P (Z > 0.66) = 0.2546. 1/ 381 9.9 Tests for variances of normal distributions Example 9.8 A container-filling machine is used to package milk cartons of 1 litre (= 1,000 cm3 ). Ideally, the amount of milk should only vary slightly. The company which produced the filling machine claims that the variance of the milk content is not greater than 1 cm3 . To examine the veracity of the claim, a random sample of 25 cartons is taken, resulting in 25 measurements (in cm3 ) as follows: 1,000.3, 1,001.3, 999.5, 999.7, 999.3, 999.8, 998.3, 1,000.6, 999.7, 999.8, 1,001.0, 999.4, 999.5, 998.5, 1,000.7, 999.6, 999.8, 1,000.0, 998.2, 1,000.1, 998.1, 1,000.7, 999.8, 1,001.3, 1,000.7. Do these data support the claim of the company? 288 9.9. Tests for variances of normal distributions Turning Example 9.8 into a statistical problem, we assume that the data form a random sample from N (µ, σ 2 ). We are interested in testing the hypotheses: H0 : σ 2 = σ02 Let S 2 = n P vs. H1 : σ 2 > σ02 . (Xi − X̄)2 /(n − 1), then (n − 1)S 2 /σ 2 ∼ χ2n−1 . Under H0 we have: i=1 2 (n − 1)S T = = σ02 n P (Xi − X̄)2 i=1 ∼ χ2n−1 . σ02 Since we will reject H0 against an alternative hypothesis σ 2 > σ02 , we should reject H0 for large values of T . H0 is rejected if t > χ2α, n−1 , where χ2α, n−1 denotes the top 100αth percentile of the χ2n−1 distribution, i.e. we have: P (T ≥ χ2α, n−1 ) = α. For any σ 2 > σ02 , the power of the test at σ is: β(σ) = Pσ (H0 is rejected) = Pσ (T > χ2α, n−1 ) (n − 1)S 2 2 = Pσ > χα, n−1 σ02 (n − 1)S 2 σ02 2 = Pσ > 2 × χα, n−1 σ2 σ which is greater than α, as σ02 /σ 2 < 1, where (n − 1)S 2 /σ 2 ∼ χ2n−1 when σ 2 is the true variance, instead of σ02 . Note that here 1 − β(σ) is the probability of a Type II error. Suppose we choose α = 0.05. For n = 25, χ2α, n−1 = χ20.05, 24 = 36.415. With the given sample, s2 = 0.8088 and σ02 = 1, t = 24 × 0.8088 = 19.41 < χ20.05, 24 . Hence we do not reject H0 at the 5% significance level. There is no significant evidence from the data against the company’s claim that the variance is not beyond 1. With σ02 = 1, the power function is: β(σ) = P χ20.05, 24 (n − 1)S 2 > σ2 σ2 =P (n − 1)S 2 36.415 > 2 σ σ2 where (n − 1)S 2 /σ 2 ∼ χ224 . For any given values of σ 2 , we may compute β(σ). We list some specific values next. σ2 χ20.05, 24 /σ 2 β(σ) Approximate β(σ) 1 36.415 0.05 0.05 1.5 24.277 0.446 0.40 2 18.208 0.793 0.80 3 12.138 0.978 0.975 4 9.104 0.997 0.995 289 9. Hypothesis testing Clearly, β(σ) % as σ 2 %. Intuitively, it is easier to reject H0 : σ 2 = 1 if the true population, which generates the data, has a larger variance σ 2 . Due to the sparsity of the available χ2 tables, we may only obtain some approximate values for β(σ) – see the entries in the last row in the above table. The more accurate values of β(σ) were calculated using a computer. Some remarks are the following. i. The significance level is selected subjectively by the statistician. To make the conclusion more convincing in the above example, we may use α = 0.1 instead. As χ20.1, 24 = 33.196, H0 is not rejected at the 10% significance level. In fact the p-value is: PH0 (T ≥ 19.41) = 0.73 where T ∼ χ224 . ii. As σ 2 increases, the power function β(σ) also increases. iii. For H1 : σ 2 6= σ02 , we should reject H0 if: t ≤ χ21−α/2, n−1 or t ≥ χ2α/2, n−1 where χ2α, k denotes the top 100αth percentile of the χ2k distribution. Activity 9.14 A machine is designed to fill bags of sugar. The weight of the bags is normally distributed with standard deviation σ. If the machine is correctly calibrated, σ should be no greater than 20 g. We collect a random sample of 18 bags and weigh them. The sample standard deviation is found to be equal to 32.48 g. Is there any evidence that the machine is incorrectly calibrated? Solution This is a hypothesis test for the variance of a normal population, so we will use the chi-squared distribution. Let: X1 , . . . , X18 ∼ N (µ, σ 2 ) be the weights of the bags in the sample. An appropriate test has hypotheses: H0 : σ 2 = 400 vs. H1 : σ 2 > 400. This is a one-sided test, because we are interested in detecting an increase in variance. We compute the value of the test statistic: t= (18 − 1) × (32.48)2 (n − 1)s2 = = 44.385. σ02 (20)2 At the 5% significance level, the upper-tail value of the chi-squared distribution on ν = 18 − 1 degrees of freedom is χ20.05, 17 = 27.587. Our test statistic exceeds this value, so we reject the null hypothesis. We now move to the 1% significance level. The upper-tail value is χ20.01, 17 = 33.409, so we reject H0 again. We conclude that there is very strong evidence that the machine is incorrectly calibrated. 290 9.10. Summary: tests for µ and σ 2 in N (µ, σ 2 ) Activity 9.15 {X1 , . . . , X21 } represents a random sample of size 21 from a normal population with mean µ and variance σ 2 . (a) Construct a test procedure with a 5% significance level to test the null hypothesis that σ 2 = 8 against the alternative that σ 2 > 8. (b) Evaluate the power of the test for the values of σ 2 given below. σ2 = 8.84 10.04 10.55 11.03 12.99 15.45 17.24 Solution (a) We test: H0 : σ 2 = 8 vs. H1 : σ 2 > 8. The test statistic, under H0 , is: T = (n − 1)S 2 20 × S 2 = ∼ χ220 . σ02 8 With a 5% significance level, we reject the null hypothesis if: t ≥ 31.410 since χ20.05, 20 = 31.410. (b) To evaluate the power, we need the probability of rejecting H0 (which happens if t ≥ 31.410) conditional on the actual value of σ 2 , that is: 8 8 2 P (T ≥ 31.410 | σ = k) = P T × ≥ 31.410 × k k where k is the true value of σ 2 , noting that: T× σ2 = k 31.410 × 8/k β(σ 2 ) 9.10 8.84 28.4 0.10 10.04 25.0 0.20 8 ∼ χ220 . k 10.55 23.8 0.25 11.03 22.8 0.30 12.99 19.3 0.50 15.45 16.3 0.70 17.24 14.6 0.80 Summary: tests for µ and σ 2 in N (µ, σ 2) In the below table, X̄ = n P Xi /n, S 2 = i=1 n P (Xi − X̄)2 /(n − 1), and {X1 , . . . , Xn } is a i=1 random sample from N (µ, σ 2 ). 291 9. Hypothesis testing Null hypothesis, H0 µ = µ0 σ 2 = σ02 X̄−µ √0 σ/ n X̄−µ √0 S/ n (n−1)S 2 σ02 N (0, 1) tn−1 χ2n−1 µ = µ0 (σ 2 known) Test statistic, T Distribution of T under H0 9.11 Comparing two normal means with paired observations Suppose that the observations are paired: (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) 2 ), and Yi ∼ N (µY , σY2 ). where all Xi s and Yi s are independent, Xi ∼ N (µX , σX We are interested in testing the hypothesis: H0 : µX = µY . (9.1) Example 9.9 The following are some practical examples. Do husbands make more money than wives? Is the increased marketing budget improving sales? Are customers willing to pay more for the new product than the old one? Does TV advertisement A have higher average effectiveness than advertisement B? Will promotion method A generate higher sales than method B? Observations are paired together for good reasons: husband–wife, before–after, A-vs.-B (from the same subject). Let Zi = Xi − Yi , for i = 1, . . . , n, then {Z1 , . . . , Zn } is a random sample from the population N (µ, σ 2 ), where: µ = µX − µY 2 and σ 2 = σX + σY2 . The hypothesis (9.1) can also be expressed as: H0 : µ = 0. 292 9.12. Comparing two normal means √ Therefore, we should use the test statistic T = nZ̄/S, where Z̄ and S 2 denote, respectively, the sample mean and the sample variance of {Z1 , . . . , Zn }. At the 100α% significance level, for α ∈ (0, 1), we reject the hypothesis µX = µY when: |t| > tα/2, n−1 , if the alternative is H1 : µX 6= µY t > tα, n−1 , if the alternative is H1 : µX > µY t < −tα, n−1 , if the alternative is H1 : µX < µY where P (T > tα, n−1 ) = α, for T ∼ tn−1 . 9.11.1 Power functions of the test Consider the case of testing H0 : µX = µY vs. H1 : µX > µY only. For µ = µX − µY > 0, we have: β(µ) = Pµ (H0 is rejected) = Pµ (T > tα, n−1 ) √ nZ̄ = Pµ > tα, n−1 S √ √ n(Z̄ − µ) nµ > tα, n−1 − = Pµ S S where √ n(Z̄ − µ)/S ∼ tn−1 under the distribution represented by Pµ . Note that for µ > 0, β(µ) > α. Furthermore, β(µ) increases as µ increases. 9.12 Comparing two normal means Let {X1 , . . . , Xn } and {Y1 , . . . , Ym } be two independent random samples drawn from, 2 ) and N (µY , σY2 ). We seek to test hypotheses on µX − µY . respectively, N (µX , σX We cannot pair the two samples together, because of the different sample sizes n and m. n m P P Let the sample means be X̄ = Xi /n and Ȳ = Yi /m, and the sample variances be: i=1 i=1 n 2 SX 1 X = (Xi − X̄)2 n − 1 i=1 m and SY2 1 X = (Yi − Ȳ )2 . m − 1 i=1 Some remarks are the following. 2 X̄, Ȳ , SX and SY2 are independent. 2 2 2 X̄ ∼ N (µX , σX /n) and (n − 1)SX /σX ∼ χ2n−1 . Ȳ ∼ N (µY , σY2 /m) and (m − 1)SY2 /σY2 ∼ χ2m−1 . 293 9. Hypothesis testing 2 2 Hence X̄ − Ȳ ∼ N (µX − µY , σX /n + σY2 /m). If σX = σY2 , then: X̄ − Ȳ − (µX − µY ) p 2 2 ((n − 1)SX /σX + (m − 1)SY2 /σY2 ) /(n + m − 2) s = 9.12.1 p 2 σX /n + σY2 /m n+m−2 X̄ − Ȳ − (µX − µY ) ∼ tn+m−2 . ×p 2 1/n + 1/m (n − 1)SX + (m − 1)SY2 2 Tests on µX − µY with known σX and σY2 Suppose we are interested in testing: H0 : µX = µY vs. H1 : µX 6= µY . Note that: X̄ − Ȳ − (µX − µY ) p ∼ N (0, 1). 2 σX /n + σY2 /m Under H0 , µX − µY = 0, so we have: T =p X̄ − Ȳ 2 σX /n + σY2 /m ∼ N (0, 1). At the 100α% significance level, for α ∈ (0, 1), we reject H0 if |t| > zα/2 , where P (Z > zα/2 ) = α/2, for Z ∼ N (0, 1). A 100(1 − α)% confidence interval for µX − µY is: X̄ − Ȳ ± zα/2 × q 2 /n + σY2 /m. σX Activity 9.16 Two random samples {X1 , . . . , Xn } and {Y1 , . . . , Ym } from two 2 normally distributed populations with variances of σX = 41 and σY2 = 15, respectively, produced the following summary statistics: x̄ = 63, n = 50; ȳ = 60, m = 45. (a) At the 5% significance level, test if the two population means are the same. Find a 95% confidence interval for the difference between the two means. 2 (b) Repeat (a), but now with σX = 85 and σY2 = 42. Comment on the impact of increasing the variances. (c) Repeat (a), but now with the sample sizes n = 20 and m = 14 (i.e. using the original variances). Comment on the impact of decreasing the sample sizes. (d) Repeat (a), but now with x̄ = 61.5 (i.e. using the original variances and sample sizes), and comment. 294 9.12. Comparing two normal means Solution (a) We test H0 : µX = µY vs. H1 : µX 6= µY . Under H0 , the test statistic is: X̄ − Ȳ T =p 2 ∼ N (0, 1). σX /n + σY2 /m At the 5% significance level we reject H0 if |t| > z0.025 = 1.96. With the given data, t = 2.79. Hence we reject H0 (the p-value is 2 × P (Z ≥ 2.79) = 0.00528 < 0.05 = α). The 95% confidence interval for µX − µY obtained from the data is: r r 2 σX σY2 41 15 + = 3 ± 1.96 × + = 3 ± 2.105 ⇒ (0.895, 5.105). x̄ − ȳ ± 1.96 × n m 50 45 2 (b) With σX = 85 and σY2 = 42, now t = 1.85. So, since 1.85 < 1.96, we cannot reject H0 at the 5% significance level (the p-value is 2 × P (Z ≥ 1.85) = 0.0644 > 0.05 = α). The confidence interval is 3 ± 3.181 = (−0.181, 6.181) which is much wider and contains 0 – the hypothesised valued under H0 . Comparing with the results in (a) above, the statistical inference become less conclusive. This is due to the increase in the variances of the populations: as the ‘randomness’ increases, we are less certain about the parameters with the same amount of information. This also indicates that it is not enough to look only at the sample means, even if we are only concerned with the population means. (c) With n = 20 and m = 14, now t = 1.70. Therefore, since 1.70 < 1.96, we cannot reject H0 at the 5% significance level (the p-value is 2 × P (Z ≥ 1.70) = 0.0892 > 0.05 = α). The confidence interval is 3 ± 3.463 = (−0.463, 6.463) which is much wider than that obtained in (a), and contains 0 as well. This indicates that the difference of 3 units between the sample means is significant for the sample sizes (50, 45), but is not significant for the sample sizes (20, 14). (d) With x̄ = 61.5, now t = 1.40. Again, since 1.40 < 1.96, we cannot reject H0 at the 5% significance level (the p-value is 2 × P (Z ≥ 1.40) = 0.1616 > 0.05 = α). The confidence interval is 1.5 ± 2.105 = (−0.605, 3.605). Comparing with (a), the difference between the samples means is not significant enough to reject H0 , although everything else is unchanged. Activity 9.17 Suppose that we have two independent samples from normal populations with known variances. We want to test the H0 that the two population means are equal against the alternative that they are different. We could use each sample by itself to write down 95% confidence intervals and reject H0 if these intervals did not overlap. What would be the significance level of this test? 295 9. Hypothesis testing Solution Let us assume H0 : µX = µY is true, then the two 95% confidence intervals do not overlap if and only if: σX σY X̄ − 1.96 × √ ≥ Ȳ + 1.96 × √ n m σY σX or Ȳ − 1.96 × √ ≥ X̄ + 1.96 × √ . m n So we want the probability: σX σY P |X̄ − Ȳ | ≥ 1.96 × √ + √ n m which is: P √ ! √ X̄ − Ȳ σX / n + σY / m p ≥ 1.96 × p 2 . 2 2 σX /n + σY /m σX /n + σY2 /m So we have: P √ ! √ σX / n + σY / m |Z| ≥ 1.96 × p 2 σX /n + σY2 /m where Z ∼ N (0, 1). This does not reduce in general, but if we assume n = m and 2 σX = σY2 , then it reduces to: √ P (|Z| ≥ 1.96 × 2) = 0.0056. The significance level is about 0.6%, which is much smaller than the usual conventions of 5% and 1%. Putting variability into two confidence intervals makes them more likely to overlap than you might think, and so your chance of incorrectly rejecting the null hypothesis is smaller than you might expect! 9.12.2 2 Tests on µX − µY with σX = σY2 but unknown This time we consider the following hypotheses: H0 : µX − µY = δ0 vs. H1 : µX − µY > δ0 where δ0 is a given constant. Under H0 , we have: s T = n+m−2 X̄ − Ȳ − δ0 ×p ∼ tn+m−2 . 2 1/n + 1/m (n − 1)SX + (m − 1)SY2 At the 100α% significance level, for α ∈ (0, 1), we reject H0 if t > tα, n+m−2 , where P (T > tα, n+m−2 ) = α, for T ∼ tn+m−2 . A 100(1 − α)% confidence interval for µX − µY is: r X̄ − Ȳ ± tα/2, n+m−2 × 296 1/n + 1/m 2 ((n − 1)SX + (m − 1)SY2 ). n+m−2 9.12. Comparing two normal means Example 9.10 Two types of razor, A and B, were compared using 100 men in an experiment. Each man shaved one side, chosen at random, of his face using one razor and the other side using the other razor. The times taken to shave, Xi and Yi minutes, for i = 1, . . . , 100, corresponding to the razors A and B, respectively, were recorded, yielding: x̄ = 2.84, s2X = 0.48, ȳ = 3.02 and s2Y = 0.42. Also available is the sample variance of the differences, Zi = Xi − Yi , which is s2Z = 0.6. Test, at the 5% significance level, if the two razors lead to different mean shaving times. State clearly any assumptions used in the test. Assumption: suppose {X1 , . . . , Xn } and {Y1 , . . . , Yn } are two independent random 2 samples from, respectively, N (µX , σX ) and N (µY , σY2 ). The problem requires us to test the following hypotheses: H0 : µX = µY vs. H1 : µX 6= µY . There are three approaches – a paired comparison method and two two-sample comparisons based on different assumptions. Since the data are recorded in pairs, the paired comparison is most relevant and effective to analyse these data. Method I: paired comparison 2 + σY2 . We want We have Zi = Xi − Yi ∼ N (µZ , σZ2 ) with µZ = µX − µY and σZ2 = σX to test: H0 : µZ = 0 vs. H1 : µZ 6= 0. This is the standard one-sample t test, where: √ n(Z̄ − µZ ) X̄ − Ȳ − (µX − µY ) √ ∼ tn−1 . = SZ SZ / n H0 is rejected if |t| > t0.025, 99 = 1.98, where under H0 we have: √ √ nZ̄ 100(X̄ − Ȳ ) T = = . SZ SZ √ With the given data, we observe t = 10(2.84 − 3.02)/ 0.6 = −2.327. Hence we reject the hypothesis that the two razors lead to the same mean shaving time at the 5% significance level. A 95% confidence interval for µX − µY is: √ x̄ − ȳ ± t0.025, n−1 × sZ / n = −0.18 ± 0.154 ⇒ (−0.334, −0.026). Some remarks are the following. i. Zero is not in the confidence interval for µX − µY . ii. t0.025, 99 = 1.98 is pretty close to z0.025 = 1.96. 297 9. Hypothesis testing Method II: two-sample comparison with known variances 2 A further assumption is that σX = 0.48 and σY2 = 0.42. 2 Note X̄ − Ȳ ∼ N (µX − µY , σX /100 + σY2 /100), i.e. we have: X̄ − Ȳ − (µX − µY ) p ∼ N (0, 1). 2 σX /100 + σY2 /100 Hence we reject H0 when |t| > 1.96 at the 5% significance level, where: X̄ − Ȳ T =p 2 . σX /100 + σY2 /100 √ For the given data, t = −0.18/ 0.009 = −1.9. Hence we cannot reject H0 . A 95% confidence interval for µX − µY is: q 2 x̄ − ȳ ± 1.96 × σX /100 + σY2 /100 = −0.18 ± 0.186 ⇒ (−0.366, 0.006). The value 0 is now contained in the confidence interval. Method III: two-sample comparison with equal but unknown variance 2 = σY2 = σ 2 . A different additional assumption is that σX 2 Now X̄ − Ȳ ∼ N (µX − µY , σ 2 /50) and 99(SX + SY2 )/σ 2 ∼ χ2198 . Hence: √ 50 X̄ − Ȳ − (µX − µY ) X̄ − Ȳ − (µX − µY ) p p = 10 × ∼ t198 . 2 2 2 99(SX + SY )/198 SX + SY2 Hence we reject H0 if |t| > t0.025, 198 = 1.97 where: 10(X̄ − Ȳ ) T =p 2 . SX + SY2 For the given data, t = −1.897. Hence we cannot reject H0 at the 5% significance level. A 95% confidence interval for µX − µY is: q x̄ − ȳ ± t0.025, 198 × (s2X + s2Y )/100 = −0.18 ± 0.1870 ⇒ (−0.367, 0.007) which contains 0. Some remarks are the following. i. Different methods lead to different but not contradictory conclusions, as remember: not reject 6= accept. ii. The paired comparison is intuitively the most relevant, requires the least assumptions, and leads to the most conclusive inference (i.e. rejection of H0 ). It also produces the narrowest confidence interval. 298 9.12. Comparing two normal means iii. Methods II and III ignore the pairing of the data. Consequently, the inference is less conclusive and less accurate. iv. A general observation is that H0 is rejected at the 100α% significance level if and only if the value hypothesised by H0 is not within the corresponding 100(1 − α)% confidence interval. v. It is much more challenging to compare two normal means with unknown and unequal variances. This will not be discussed in this course. Activity 9.18 The weights (in grammes) of a group of five-week-old chickens reared on a high-protein diet are 336, 421, 310, 446, 390 and 434. The weights of a second group of chickens similarly reared, except for their low-protein diet, are 224, 275, 393, 282 and 365. Is there evidence that the additional protein has increased the average weight of the chickens? Assume normality. Solution Assuming normally-distributed populations with possibly different means, but the same variance, we test: H0 : µX = µY vs. H1 : µX > µY . The sample means and standard deviations are x̄ = 389.5, ȳ = 307.8, sX = 55.40 and sY = 69.45. The test statistic and its distribution under H0 are: s n+m−2 X̄ − Ȳ ∼ tn+m−2 ×p T = 2 1/n + 1/m (n − 1)SX + (m − 1)SY2 and we obtain, for the given data, t = 2.175 > 1.833 = t0.05, 9 hence we reject H0 that the mean weights are equal and conclude that the mean weight for the high-protein diet is greater at the 5% significance level. Activity 9.19 Hard question! (a) Two independent random samples, of n1 and n2 observations, are drawn from normal distributions with the same variance σ 2 . Let S12 and S22 be the sample variances of the first and the second samples, respectively. Show that: σ b2 = (n1 − 1)S12 + (n2 − 1)S22 n1 + n2 − 2 is an unbiased estimator of σ 2 . Hint: Remember the expectation of a chi-squared variable is its degrees of freedom. (b) Two makes of car safety belts, A and B, have breaking strengths which are normally distributed with the same variance. A random sample of 140 belts of make A and a random sample of 220 belts of make B were tested. The sample 299 9. Hypothesis testing P means, and the sums of squares about the means (i.e. i (xi − x̄)2 ), of the breaking strengths (in lbf units) were (2685, 19000) for make A, and (2680, 34000) for make B, respectively. Is there significant evidence to support the hypothesis that belts of make A are stronger on average than belts of make B? Assume a 1% significance level. Solution (a) We first note that (ni − 1)Si2 /σ 2 ∼ χ2ni −1 . By the definition of χ2 distributions, we have: E (ni − 1)Si2 = (ni − 1)σ 2 for i = 1, 2. Hence: E σ b 2 =E (n1 − 1)S12 + (n2 − 1)S22 n1 + n2 − 2 = 1 E((n1 − 1)S12 ) + E((n2 − 1)S22 ) n1 + n2 − 2 = (n1 − 1)σ 2 + (n2 − 1)σ 2 n1 + n2 − 2 = σ2. (b) Denote x̄ = 2685 and ȳ = 2680, then 139s2X = 19000 and 219s2Y = 34000. We test H0 : µX = µY vs. H1 : µX > µY . Under H0 we have: 1 1 2 X̄ − Ȳ ∼ N 0, σ + = N (0, 0.01169σ 2 ) 140 220 and: Hence: 2 + 219SY2 139SX ∼ χ2358 . σ2 √ (X̄ − Ȳ )/ 0.01169 T =p ∼ t358 2 (139SX + 219SY2 )/358 under H0 . We reject H0 if t > t0.01, 358 ≈ 2.326. Since we observe t = 3.801 we reject H0 , i.e. there is significant evidence to suggest that belts of make A are stronger on average than belts of make B. 9.13 Tests for correlation coefficients We now consider a test for the correlation coefficient of two random variables X and Y where: ρ = Corr(X, Y ) = 300 Cov(X, Y ) 1/2 [Var(X) Var(Y )] = E [(X − E(X))(Y − E(Y ))] . [E ((X − E(X))2 ) E ((Y − E(Y ))2 )]1/2 9.13. Tests for correlation coefficients Some remarks are the following. i. ρ ∈ [−1, 1], and |ρ| = 1 if and only if Y = aX + b for some constants a and b. Furthermore, a > 0 if ρ = 1, and a < 0 if ρ = −1. ii. ρ measures only the linear relationship between X and Y . When ρ = 0, X and Y are linearly independent, that is uncorrelated. iii. If X and Y are independent (in the sense that the joint pdf is the product of the two marginal pdfs), ρ = 0. However, if ρ = 0, X and Y are not necessarily independent, as there may exist some non-linear relationship between X and Y . iv. If ρ > 0, X and Y tend to increase (or decrease) together. If ρ < 0, X and Y tend to move in opposite directions. Sample correlation coefficient Given paired observations (Xi , Yi ), for i = 1, . . . , n, a natural estimator of ρ is defined as: n P (Xi − X̄)(Yi − Ȳ ) i=1 ρb = !1/2 n n P P (Xi − X̄)2 (Yj − Ȳ )2 i=1 where X̄ = n P i=1 Xi /n and Ȳ = n P j=1 Yi /n. i=1 Example 9.11 The measurements of height, X, and weight, Y , are taken from 69 students in a class. ρ should be positive, intuitively! In Figure 9.5, the vertical line at x̄ and the horizontal line at ȳ divide the 69 points into 4 quadrants: northeast (NE), southwest (SW), northwest (NW) and southeast (SE). Most points are in either NE or SW. In the NE quadrant, xi > x̄ and yi > ȳ, hence: X (xi − x̄)(yi − ȳ) > 0. i∈NE In the SW quadrant, xi < x̄ and yi < ȳ, hence: X (xi − x̄)(yi − ȳ) > 0. i∈SW In the NW quadrant, xi < x̄ and yi > ȳ, hence: X (xi − x̄)(yi − ȳ) < 0. i∈NW 301 9. Hypothesis testing In the SE quadrant, xi > x̄ and yi < ȳ, hence: X (xi − x̄)(yi − ȳ) < 0. i∈SE Overall, 69 P (xi − x̄)(yi − ȳ) > 0 and hence ρb > 0. i=1 Figure 9.5: Scatterplot of height and weight in Example 9.11. Figure 9.6 shows examples of different sample correlation coefficients using scatterplots of bivariate observations. 9.13.1 Tests for correlation coefficients Let {(X1 , Y1 ), . . . , (Xn , Yn )} be a random sample from a two-dimensional normal distribution. Let ρ = Corr(Xi , Yi ). We are interested in testing: H0 : ρ = 0 vs. H1 : ρ 6= 0. It can be shown that under H0 the test statistic is: r n−2 ∼ tn−2 . T = ρb 1 − ρb2 Hence we reject H0 at the 100α% significance level, for α ∈ (0, 1), if |t| > tα/2, n−2 , where: α P (T > tα/2, n−2 ) = . 2 Some remarks are the following. p i. |T | = |b ρ| (n − 2)/(1 − ρb2 ) increases as |b ρ| increases. 302 9.13. Tests for correlation coefficients Figure 9.6: Scatterplots of bivariate observations with different sample correlation coefficients. ii. For H1 : ρ > 0, we reject H0 if t > tα, n−2 . iii. Two random variables X and Y are jointly normal if aX + bY is normal for any constants a and b. iv. For jointly normal random variables X and Y , if Corr(X, Y ) = 0, X and Y are also independent. Activity 9.20 The following table shows the number of salespeople employed by a company and the corresponding value of sales (in £000s): Number of salespeople (x) Sales (y) Number of salespeople (x) Sales (y) 210 206 220 210 209 200 233 218 219 204 200 201 225 215 215 212 232 222 205 204 221 216 227 212 Compute the sample correlation coefficient for these data and carry out a formal test for a (linear) relationship between the number of salespeople and sales. Note that: X X X xi = 2,616, yi = 2,520, x2i = 571,500, X X yi2 = 529,746 and xi yi = 550,069. 303 9. Hypothesis testing Solution We test: H0 : ρ = 0 vs. H1 : ρ > 0. The corresponding test statistic and its distribution under H0 are: √ ρb n − 2 T = p ∼ tn−2 . 1 − ρb2 We find ρb = 0.8716 and obtain t = 5.62 > 2.764 = t0.01, 10 and so we reject H0 at the 1% significance level. Since the test is highly significant, there is overwhelming evidence of a (linear) relationship between the number of salespeople and the value of sales. Activity 9.21 A random sample {(Xi , Yi ), 1 ≤ i ≤ n} from a two-dimensional normal distribution yields: n P x̄ = 6.31, ȳ = 3.56, sX = 5.31, sY = 12.92 and xi y i i=1 = 14.78. n Let ρ = Corr(X, Y ). (a) Test the null hypothesis H0 : ρ = 0 against the alternative hypothesis H1 : ρ < 0 at the 5% significance level with the sample size n = 10. (b) Repeat (a) for n = 500. Solution We have: n P ρb = s i=1 n P n P (Xi − X̄)(Yi − Ȳ ) n P (Xi − X̄)2 (Yj − Ȳ )2 i=1 = Xi Yi − nX̄ Ȳ i=1 (n − 1)SX SY . j=1 Under H0 : ρ = 0, the test statistic is: r n−2 T = ρb ∼ tn−2 . 1 − ρb2 Hence we reject H0 if t < −t0.05, n−2 . (a) For n = 10, −t0.05, n−2 = −1.860, ρb = −0.124 and t = −0.355. Hence we cannot reject H0 , so there is no evidence that X and Y are correlated. (b) For n = 500, −t0.05, n−2 ≈ −1.645, ρb = −0.112 and t = −2.52. Hence we reject H0 , so there is significant evidence that X and Y are correlated. Note that the sample correlation coefficient ρb = −0.124 is not significantly different from 0 when the sample size is 10. However, ρb = −0.112 is significantly different from 0 when the sample size is 500! 304 9.14. Tests for the ratio of two normal variances 9.14 Tests for the ratio of two normal variances Let {X1 , . . . , Xn } and {Y1 , . . . , Ym } be two independent random samples from, 2 respectively, N (µX , σX ) and N (µY , σY2 ). We are interested in testing: H0 : σY2 =k 2 σX vs. H1 : σY2 6= k 2 σX where k > 0 is a given constant. The case with k = 1 is of particular interest since this tests for equal variances. n m P P Let the sample means be X̄ = Xi /n and Ȳ = Yi /m, and the sample variances be: i=1 2 SX We have (n − 1 = n−1 2 2 1)SX /σX n X i=1 m 2 (Xi − X̄) and i=1 SY2 1 X = (Yi − Ȳ )2 . m − 1 i=1 ∼ χ2n−1 and (m − 1)SY2 /σY2 ∼ χ2m−1 . Therefore: 2 2 2 σY2 SX SX /σX × = ∼ Fn−1, m−1 . 2 σX SY2 SY2 /σY2 2 Under H0 , T = kSX /SY2 ∼ Fn−1, m−1 . Hence H0 is rejected if: t < F1−α/2, n−1, m−1 or t > Fα/2, n−1, m−1 where Fα, p, k denotes the top 100αth percentile of the Fp, k distribution, that is: P (T > Fα, p, k ) = α available from Table 12 of the New Cambridge Statistical Tables. Since: P F1−α/2, n−1, m−1 S2 σ2 ≤ Y2 × X2 ≤ Fα/2, n−1, m−1 σX SY =1−α 2 a 100(1 − α)% confidence interval for σY2 /σX is: SY2 SY2 F1−α/2, n−1, m−1 × 2 , Fα/2, n−1, m−1 × 2 . SX SX Example 9.12 Here we practise use of Table 12 of the New Cambridge Statistical Tables to obtain critical values for the F distribution. Table 12 can be used to find the top 100αth percentile of the Fν1 , ν2 distribution for α = 0.10, 0.05, 0.025, 0.01, 0.005 and 0.001 using Tables 12(a) to 12(f), respectively. For example, for ν1 = 3 and ν2 = 5, then: P (F3, 5 > 3.619) = 0.10 (using Table 12(a)) P (F3, 5 > 5.409) = 0.05 (using Table 12(b)) P (F3, 5 > 7.764) = 0.025 (using Table 12(c)) P (F3, 5 > 12.060) = 0.01 (using Table 12(d)). 305 9. Hypothesis testing Example 9.13 The daily returns (in percentages) of two assets, X and Y , are recorded over a period of 100 trading days, yielding average daily returns of x̄ = 3.21 and ȳ = 1.41. Also available from the data are the following quantities: 100 X x2i = 1989.24, i=1 100 X yi2 100 X = 932.78 and i=1 xi yi = 661.11. i=1 Assume the data are normally distributed. Are the two assets positively correlated with each other, and is asset X riskier than asset Y ? With n = 100 we have: n 1 1 X (xi − x̄)2 = s2X = n − 1 i=1 n−1 and: n s2Y 1 X 1 (yi − ȳ)2 = = n − 1 i=1 n−1 Therefore: n P ρb = n P (xi − x̄)(yi − ȳ) i=1 = (n − 1) sX sY n X ! x2i − nx̄2 = 9.69 i=1 n X ! yi2 − nȳ 2 = 7.41. i=1 xi yi − n x̄ ȳ i=1 (n − 1) sX sY = 0.249. First we test: H0 : ρ = 0 vs. H1 : ρ > 0. Under H0 , the test statistic is: r T = ρb n−2 ∼ t98 . 1 − ρb2 Setting α = 0.01, we reject H0 if t > t0.01, 98 = 2.37. With the given data, t = 2.545 hence we reject the null hypothesis of ρ = 0 at the 1% significance level. We conclude that there is highly significant evidence indicating that the two assets are positively correlated. We measure the risks in terms of variances, and test: 2 H0 : σX = σY2 2 vs. H1 : σX > σY2 . 2 Under H0 , T = SX /SY2 ∼ F99, 99 . Hence we reject H0 if t > F0.05, 99, 99 = 1.39 at the 5% significance level, using Table 12(b) of the New Cambridge Statistical Tables. With the given data, t = 9.69/7.41 = 1.308. Therefore, we cannot reject H0 . As the test is not significant at the 5% significance level, we may not conclude that the variances of the two assets are significantly different. Therefore, there is no significant evidence indicating that asset X is riskier than asset Y . Strictly speaking, the test is valid only if the two samples are independent of each other, which is not the case here. 306 9.14. Tests for the ratio of two normal variances Activity 9.22 Two independent samples from normal populations yield the following results: Sample 1 Sample 2 n=5 m=7 P 2 P (xi − x̄)2 = 4.8 (yi − ȳ) = 37.2 Test at the 5% signficance level whether the population variances are the same based on the above data. Solution We test: H0 : σ12 = σ22 vs. H1 : σ12 6= σ22 . Under H0 , the test statistic is: T = S12 ∼ Fn−1, m−1 = F4, 6 . S22 Critical values are F0.975, 4, 6 = 1/F0.025, 6, 4 = 1/9.20 = 0.11 and F0.025, 4, 6 = 6.23, using Table 12 of the New Cambridge Statistical Tables. The test statistic value is: t= 4.8/4 = 0.1935 37.2/6 and since 0.11 < 0.1935 < 6.23 we do not reject H0 , which means there is no evidence of a difference in the variances. Activity 9.23 Class A was taught using detailed PowerPoint slides. The marks in the final examination for a random sample of Class A students were: 74, 61, 67, 84, 41, 68, 57, 64, 46. Students in Class B were required to read textbooks and answer questions in class discussions. The marks in the final examination for a random sample of Class B students were: 48, 50, 42, 53, 81, 59, 64, 45. Assuming examination marks are normally distributed, can we infer that the variances of the marks differ between the two classes? Test at the 5% significance level. Solution We test H0 : σA2 = σB2 vs. H1 : σA2 6= σB2 . Under H0 we have: T = SA2 ∼ FnA −1, nB −1 . SB2 Hence H0 is rejected if either t ≤ F1−α/2, nA −1, nB −1 or t ≥ Fα/2, nA −1, nB −1 . 307 9. Hypothesis testing For the given data, nA = 9, s2A = 176.778, nB = 8 and s2B = 159.929. Setting α = 0.05, F0.975, 8, 7 = 0.221 and F0.025, 8, 7 = 4.90. Since: 0.221 < t = 1.105 < 4.90 we cannot reject H0 , i.e. there is no significant evidence to indicate that the variances of the marks in the two classes are different. Activity 9.24 After the machine in Activity 9.14 is calibrated, we collect a new sample of 21 bags. The sample standard deviation of their weights is 23.72 g. Based on this sample, can you conclude that the calibration has reduced the variance of the weights of the bags? Solution Let: Y1 , . . . , Y21 ∼ N (µY , σY2 ) 2 to denote the variance of be the weights of the bags in the new sample, and use σX the distribution of the previous sample, to avoid confusion. We want to test for a reduction in variance, so we set: H0 : 2 2 σX σX = 1 vs. H : > 1. 1 σY2 σY2 The value of the test statistic in this case is: (32.48)2 s2X = = 1.875. s2Y (23.72)2 If the null hypothesis is true, the test statistic will follow an F18−1, 21−1 = F17, 20 distribution. At the 5% significance level, the upper-tail critical value of the F17, 20 distribution is F0.05, 17, 20 = 2.17. Our test statistic does not exceed this value, so we cannot reject the null hypothesis. We move to the 10% significance level. The upper-tail critical value is F0.10, 17, 20 = 1.821, so we can now reject the null hypothesis (if only barely). We conclude that there is some evidence that the variance is reduced, but it is not very strong evidence. Notice the difference between the conclusions of these two tests. We have a much more powerful test when we compare our standard deviation of 32.48 g to a fixed standard deviation of 25 g, than when we compare it to an estimated standard deviation of 23.78 g, even though the values are similar. 9.15 Summary: tests for two normal distributions 2 Let (X1 , . . . , Xn ) ∼IID N (µX , σX ), (Y1 , . . . , Ym ) ∼IID N (µY , σY2 ), and ρ = Corr(X, Y ). 308 9.16. Overview of chapter A summary table of tests for two normal distributions is: Null hypothesis, H0 Test statistic, T µX − µY = δ µX − µY = δ ρ=0 2 (σX , σY2 known) 2 (σX = σY2 unknown) (n = m) √ Distribution of T under H0 9.16 X̄−Ȳ −δ 2 /n+σ 2 /m σX Y q n+m−2 1/n+1/m N (0, 1) ×√ X̄−Ȳ −δ 2 +(m−1)S 2 (n−1)SX Y tn+m−2 ρb q 2 σY 2 σX n−2 1−b ρ2 k tn−2 Overview of chapter Key terms and concepts Alternative hypothesis Decision p-value Power function t test Type I error 9.18 2 SX SY2 Fn−1, m−1 This chapter has discussed hypothesis tests for parameters of normal distributions – specifically means and variances. In each case an appropriate test statistic was constructed whose distribution under the null hypothesis was known. Concepts of hypothesis testing errors and power were also discussed, as well as how to test correlation coefficients. 9.17 =k Critical value Null hypothesis Paired comparison Significance level Test statistic Type II error Sample examination questions Solutions can be found in Appendix C. 1. Suppose that one observation, i.e. n = 1, is taken from the geometric distribution: ( (1 − π)x−1 π for x = 1, 2, . . . p(x; π) = 0 otherwise to test H0 : π = 0.3 vs. H1 : π > 0.3. The null hypothesis is rejected if x ≥ 4. (a) What is the probability that a Type II error will be committed when the true parameter value is π = 0.4? 309 9. Hypothesis testing (b) What is the probability that a Type I error will be committed? (c) If x = 4, what is the p-value of the test? 2. Let X have a Poisson distribution with mean λ. We want to test the null hypothesis that λ = 1/2 against the alternative λ = 2. We reject the null hypothesis if and only if x > 1. Calculate the size and power of the test. You may use the approximate value e ≈ 2.718. 3. A random sample of size n = 10 is taken from N (µ, σ 2 ). Consider the following hypothesis test: H0 : σ 2 = 2.00 vs. H1 : σ 2 > 2.00 to be conducted at the 1% significance level. Determine the power of the test for σ 2 = 2.00 and σ 2 = 2.56. (You may use the closest available values in the statistical tables provided.) 310 Chapter 10 Analysis of variance (ANOVA) 10.1 Synopsis of chapter This chapter introduces analysis of variance (ANOVA) which is a widely-used technique for detecting differences between groups based on continuous dependent variables. 10.2 Learning outcomes After completing this chapter, you should be able to: explain the purpose of analysis of variance restate and interpret the models for one-way and two-way analysis of variance conduct small examples of one-way and two-way analysis of variance with a calculator, reporting the results in an ANOVA table perform hypothesis tests and construct confidence intervals for one-way and two-way analysis of variance explain how to interpret residuals from an analysis of variance. 10.3 Introduction Analysis of variance (ANOVA) is a popular tool which has an applicability and power which we can only start to appreciate in this course. The idea of analysis of variance is to investigate how variation in structured data can be split into pieces associated with components of that structure. We look only at one-way and two-way classifications, providing tests and confidence intervals which are widely used in practice. 10.4 Testing for equality of three population means We begin with an illustrative example to test the hypothesis that three populations means are equal. 311 10. Analysis of variance (ANOVA) Example 10.1 To assess the teaching quality of class teachers, a random sample of 6 examination marks was selected from each of three classes. The examination marks for each class are listed in the table below. Can we infer from these data that there is no significant difference in the examination marks among all three classes? Class 1 85 75 82 76 71 85 Class 2 71 75 73 74 69 82 Class 3 59 64 62 69 75 67 Suppose examination marks from Class j follow the distribution N (µj , σ 2 ), for j = 1, 2, 3. So we assume examination marks are normally distributed with the same variance in each class, but possibly different means. We need to test the hypothesis: H0 : µ1 = µ2 = µ3 . The data form a 6 × 3 array. Denote the data point at the (i, j)th position as Xij . We compute the column means first where the jth column mean is: X̄·j = X1j + X2j + · · · + Xnj j nj where nj is the sample size of group j (here nj = 6 for all j). This leads to x̄·1 = 79, x̄·2 = 74 and x̄·3 = 66. Transposing the table, we get: Class 1 Class 2 Class 3 1 85 71 59 Observation 2 3 4 5 75 82 76 71 75 73 74 69 64 62 69 75 6 85 82 67 Mean 79 74 66 Note that similar problems arise from other practical situations. For example: comparing the returns of three stocks comparing sales using three advertising strategies comparing the effectiveness of three medicines. If H0 is true, the three observed sample means x̄·1 , x̄·2 and x̄·3 should be very close to each other, i.e. all of them should be close to the overall sample mean, x̄, which is: x̄ = 312 x̄·1 + x̄·2 + x̄·3 79 + 74 + 66 = = 73 3 3 10.5. One-way analysis of variance i.e. the mean value of all 18 observations. So we wish to perform a hypothesis test based on the variation in the sample means such that the greater the variation, the more likely we are to reject H0 . One possible measure for the variation in the sample means X̄·j about the overall sample mean X̄, for j = 1, 2, 3, is: 3 X (X̄·j − X̄)2 . (10.1) j=1 However, (10.1) is not scale-invariant, so it would be difficult to judge whether the realised value is large enough to warrant rejection of H0 due to the magnitude being dependent on the units of measurement of the data. So we seek a scale-invariant test statistic. Just as we scaled the covariance between two random variables to give the scale-invariant correlation coefficient, we can similarly scale (10.1) to give the following possible test statistic: 3 P T = (X̄·j − X̄)2 j=1 sum of the three sample variances . Hence we would reject H0 for large values of T . (Note t = 0 if x̄·1 = x̄·2 = x̄·3 which would mean that there is no variation at all between the sample means. In this case all the sample means would equal x̄.) It remains to determine the distribution of T under H0 . 10.5 One-way analysis of variance We now extend Example 10.1 to consider a general setting where there are k independent random samples available from k normal distributions N (µj , σ 2 ), for j = 1, . . . , k. (Example 10.1 corresponds to k = 3.) Denote by X1j , X2j , . . . , Xnj j the random sample with sample size nj from N (µj , σ 2 ), for j = 1, . . . , k. Our goal is to test H0 : µ1 = · · · = µk vs. H1 : not all µj s are the same. One-way analysis of variance (one-way ANOVA) involves a continuous dependent variable and one categorical independent variable (sometimes called a factor, or treatment), where the k different levels of the categorical variable are the k different groups. We now introduce statistics associated with one-way ANOVA. 313 10. Analysis of variance (ANOVA) Statistics associated with one-way ANOVA The jth sample mean is: nj 1 X X̄·j = Xij . nj i=1 The overall sample mean is: nj k k 1 XX 1X nj X̄·j X̄ = Xij = n j=1 i=1 n j=1 where n = k P nj is the total number of observations across all k groups. j=1 The total variation is: nj k X X (Xij − X̄)2 j=1 i=1 with n − 1 degrees of freedom. The between-groups variation is: k X B= nj (X̄·j − X̄)2 j=1 with k − 1 degrees of freedom. The within-groups variation is: nj k X X W = (Xij − X̄·j )2 j=1 i=1 with n − k = k P (nj − 1) degrees of freedom. j=1 The ANOVA decomposition is: nj nj k X k k X X X X 2 2 (Xij − X̄) = nj (X̄·j − X̄) + (Xij − X̄·j )2 . j=1 i=1 j=1 j=1 i=1 We have already discussed the jth sample mean and overall sample mean. The total variation is a measure of the overall (total) variability in the data from all k groups about the overall sample mean. The ANOVA decomposition decomposes this into two components: between-groups variation (which is attributable to the factor level) and within-groups variation (which is attributable to the variation within each group and is assumed to be the same σ 2 for each group). Some remarks are the following. i. B and W are also called, respectively, between-treatments variation and 314 10.5. One-way analysis of variance within-treatments variation. In fact W is effectively a residual (error) sum of squares, representing the variation which cannot be explained by the treatment or group factor. ii. The ANOVA decomposition follows from the identity: m X 2 (ai − b) = i=1 m X (ai − ā)2 + m(ā − b)2 . i=1 However, the actual derivation is not required for this course. iii. The following are some useful formulae for manual computations. k P • n= nj . j=1 • X̄·j = nj P Xij /nj and X̄ = i=1 k P nj X̄·j /n. j=1 • Total variation = Total SS = B + W = nj k P P Xij2 − nX̄ 2 . j=1 i=1 • B= k P nj X̄·j2 − nX̄ 2 . j=1 • Residual (Error) SS = W = nj k P P k P Xij2 − j=1 i=1 nj X̄·j2 = j=1 k P (nj − 1)Sj2 where Sj2 is j=1 the jth sample variance. We now note, without proof, the following results. i. B = k P nj (X̄·j − X̄)2 and W = (Xij − X̄·j )2 are independent of each other. j=1 i=1 j=1 ii. W/σ 2 = nj k P P nj k P P (Xij − X̄·j )2 /σ 2 ∼ χ2n−k . j=1 i=1 iii. Under H0 : µ1 = · · · = µk , then B/σ 2 = k P nj (X̄·j − X̄)2 /σ 2 ∼ χ2k−1 . j=1 In order to test H0 : µ1 = · · · = µk , we define the following test statistic: k P F = nj (X̄·j j=1 nj k P P − X̄)2 /(k − 1) = (Xij − X̄·j )2 /(n − k) B/(k − 1) . W/(n − k) j=1 i=1 Under H0 , F ∼ Fk−1, n−k . We reject H0 at the 100α% significance level if: f > Fα, k−1, n−k where Fα, k−1, n−k is the top 100αth percentile of the Fk−1, n−k distribution, i.e. P (F > Fα, k−1, n−k ) = α, and f is the observed test statistic value. 315 10. Analysis of variance (ANOVA) The p-value of the test is: p-value = P (F > f ). It is clear that f > Fα, k−1, n−k if and only if the p-value < α, as we must reach the same conclusion regardless of whether we use the critical value approach or the p-value approach to hypothesis testing. One-way ANOVA table Typically, one-way ANOVA results are presented in a table as follows: Source Factor Error Total DF k−1 n−k n−1 SS B W B+W MS B/(k − 1) W/(n − k) F p-value p B/(k−1) W/(n−k) Example 10.2 Continuing with Example 10.1, for the given data, k = 3, n1 = n2 = n3 = 6, n = n1 + n2 + n3 = 18, x̄·1 = 79, x̄·2 = 74, x̄·3 = 66 and x̄ = 73. The sample variances are calculated to be s21 = 34, s22 = 20 and s23 = 32. Therefore: b= 3 X 6(x̄·j − x̄)2 = 6[(79 − 73)2 + (74 − 73)2 + (66 − 73)2 ] = 516 j=1 and: w= 3 X 6 3 X 6 3 X X X (xij − x̄·j )2 = x2ij − 6 x̄2·j j=1 i=1 j=1 i=1 = 3 X j=1 5s2j j=1 = 5(34 + 20 + 32) = 430. Hence: f= 516/2 b/(k − 1) = = 9. w/(n − k) 430/15 Under H0 : µ1 = µ2 = µ3 , F ∼ Fk−1, n−k = F2, 15 . Since F0.01, 2, 15 = 6.359 < 9, using Table 12(d) of the New Cambridge Statistical Tables, we reject H0 at the 1% significance level. In fact the p-value (using a computer) is P (F > 9) = 0.003. Therefore, we conclude that there is a significant difference among the mean examination marks across the three classes. 316 10.5. One-way analysis of variance The one-way ANOVA table is as follows: Source Class Error Total DF 2 15 17 SS MS F 516 258 9 430 28.67 946 p-value 0.003 Example 10.3 A study performed by a Columbia University professor counted the number of times per minute professors from three different departments said ‘uh’ or ‘ah’ during lectures to fill gaps between words. The data were derived from observing 100 minutes from each of the three departments. If we assume that the more frequent use of ‘uh’ or ‘ah’ results in more boring lectures, can we conclude that some departments’ professors are more boring than others? The counts for English, Mathematics and Political Science departments are stored. As always in statistical analysis, we first look at the summary (descriptive) statistics of these data, here using R. > attach(UhAh) > summary(UhAh) Frequency Department Min. : 0.00 English :100 1st Qu.: 4.00 Mathematics :100 Median : 5.00 Political Science:100 Mean : 5.48 3rd Qu.: 7.00 Max. :11.00 > xbar <- tapply(Frequency, Department, mean) > s <- tapply(Frequency, Department, sd) > n <- tapply(Frequency, Department, length) > sem <- s/sqrt(n) > list(xbar,s,n,sem) [[1]] English Mathematics Political Science 5.81 5.30 5.33 [[2]] English 2.493203 Mathematics Political Science 2.012587 1.974867 English 100 Mathematics Political Science 100 100 English 0.2493203 Mathematics Political Science 0.2012587 0.1974867 [[3]] [[4]] 317 10. Analysis of variance (ANOVA) Surprisingly, professors in English say ‘uh’ or ‘ah’ more on average than those in Mathematics and Political Science (compare the sample means of 5.81, 5.30 and 5.33), but the difference seems small. However, we need to formally test whether the (seemingly small) differences are statistically significant. Using the data, R produces the following one-way ANOVA table: > anova(lm(Frequency ~ Department)) Analysis of Variance Table Response: Frequency Df Sum Sq Mean Sq F value Pr(>F) Department 2 16.38 8.1900 1.7344 0.1783 Residuals 297 1402.50 4.7222 Since the p-value for the F test is 0.1783, we cannot reject the following hypothesis: H0 : µ1 = µ2 = µ3 . Therefore, there is no evidence of a difference in the mean number of ‘uh’s or ‘ah’s said by professors across the three departments. In addition to a one-way ANOVA table, we can also obtain the following. An estimator of σ is: r σ b=S= W . n−k 95% confidence intervals for µj are given by: S X̄·j ± t0.025, n−k × √ nj for j = 1, . . . , k where t0.025, n−k is the top 2.5th percentile of the Student’s tn−k distribution, which can be obtained from Table 10 of the New Cambridge Statistical Tables. Example 10.4 Assuming a common variance for each group, from the preceding output in Example 10.3 we see that: r 1402.50 √ = 4.72 = 2.173. σ b=s= 297 Since t0.025, 297 ≈ t0.025, ∞ = 1.96, using Table 10 of the New Cambridge Statistical Tables, we obtain the following 95% confidence intervals for µ1 , µ2 and µ3 , respectively: 2.173 j = 1 : 5.81 ± 1.96 × √ ⇒ (5.38, 6.24) 100 318 j=2: 2.173 5.30 ± 1.96 × √ 100 ⇒ (4.87, 5.73) j=3: 2.173 5.33 ± 1.96 × √ 100 ⇒ (4.90, 5.76). 10.5. One-way analysis of variance R can produce the following: > stripchart(Frequency ~ Department,pch=16,vert=T) > arrows(1:3,xbar+1.96*2.173/sqrt(n),1:3,xbar-1.96*2.173/sqrt(n), angle=90,code=3,length=0.1) > lines(1:3,xbar,pch=4,type="b",cex=2) 6 0 2 4 Frequency 8 10 These 95% confidence intervals can be seen plotted in the R output below. Note that these confidence intervals all overlap, which is consistent with our failure to reject the null hypothesis that all population means are equal. English Mathematics Political Science Figure 10.1: Overlapping confidence intervals. Example 10.5 In early 2001, the American economy was slowing down and companies were laying off workers. A poll conducted during February 2001 asked a random sample of workers how long (in months) it would be before they faced significant financial hardship if they lost their jobs. They are classified into four groups according to their incomes. Below is part of the R output of the descriptive statistics of the classified data. Can we infer that income group has a significant impact on the mean length of time before facing financial hardship? Hardship Min. : 0.00 1st Qu.: 8.00 Median :15.00 Mean :16.11 3rd Qu.:22.00 Max. :50.00 Income.group $20 to 30K: 81 $30 to 50K:114 Over $50K : 39 Under $20K: 67 319 10. Analysis of variance (ANOVA) > xbar <- tapply(Hardship, Income.group, mean) > s <- tapply(Hardship, Income.group, sd) > n <- tapply(Hardship, Income.group, length) > sem <- s/sqrt(n) > list(xbar,s,n,sem) [[1]] $20 to 30K $30 to 50K Over $50K Under $20K 15.493827 18.456140 22.205128 9.313433 [[2]] $20 to 30K $30 to 50K 9.233260 9.507464 Over $50K Under $20K 11.029099 8.087043 [[3]] $20 to 30K $30 to 50K 81 114 Over $50K Under $20K 39 67 [[4]] $20 to 30K $30 to 50K 1.0259178 0.8904556 Over $50K Under $20K 1.7660693 0.9879896 Inspection of the sample means suggests that there is a difference between income groups, but we need to conduct a one-way ANOVA test to see whether the differences are statistically significant. We apply one-way ANOVA to test whether the means in the k = 4 groups are equal, i.e. H0 : µ1 = µ2 = µ3 = µ4 , from highest to lowest income groups. We have n1 = 39, n2 = 114, n3 = 81 and n4 = 67, hence: n= k X nj = 39 + 114 + 81 + 67 = 301. j=1 Also x̄·1 = 22.21, x̄·2 = 18.456, x̄·3 = 15.49, x̄·4 = 9.313 and: k x̄ = 39 × 22.21 + 114 × 18.456 + 81 × 15.49 + 67 × 9.313 1X nj X̄·j = = 16.109. n j=1 301 Now: b= k X nj (x̄·j − x̄)2 j=1 = 39(22.21 − 16.109)2 + 114(18.456 − 16.109)2 + 81(15.49 − 16.109)2 + 67(9.313 − 16.109)2 = 5205.097. We have s21 = (11.03)2 = 121.661, s22 = (9.507)2 = 90.383, s23 = (9.23)2 = 85.193 and 320 10.5. One-way analysis of variance s24 = (8.087)2 = 65.400, hence: w= nj k X X 2 (xij − x̄·j ) = j=1 i=1 k X (nj − 1)s2j j=1 = 38 × 121.661 + 113 × 90.383 + 80 × 85.193 + 66 × 65.400 = 25968.24. Consequently: f= 5205.097/3 b/(k − 1) = = 19.84. w/(n − k) 25968.24/(301 − 4) Under H0 , F ∼ Fk−1, n−k = F3, 297 . Since F0.01, 3, 297 ≈ 3.848 < 19.84, we reject H0 at the 1% significance level, i.e. there is strong evidence that income group has a significant impact on the mean length of time before facing financial hardship. The pooled estimate of σ is: r s= w = n−k r 25968.24 = 9.351. 301 − 4 A 95% confidence interval for µj is: s 9.351 18.328 x̄·j ± t0.025, 297 × √ = x̄·j ± 1.96 × √ = x̄·j ± √ . nj nj nj Hence, for example, a 95% confidence interval for µ1 is: 18.328 22.21 ± √ 39 ⇒ (19.28, 25.14) ⇒ (7.07, 11.55). and a 95% confidence interval for µ4 is: 18.328 9.313 ± √ 67 Notice that these two confidence intervals do not overlap, which is consistent with our conclusion that there is a difference between the group means. R output for the data is: > anova(lm(Hardship ~ Income.group)) Analysis of Variance Table Response: Hardship Df Sum Sq Mean Sq F value Pr(>F) Income.group 3 5202.1 1734.03 19.828 9.636e-12 *** Residuals 297 25973.3 87.45 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Note that minor differences are due to rounding errors in calculations. 321 10. Analysis of variance (ANOVA) Activity 10.1 Show that under the one-way ANOVA assumptions, for any set of k P constants {a1 , . . . , ak }, the quantity aj X̄·j is normally distributed with mean j=1 k P aj µj and variance σ j=1 k P 2 a2j /nj . j=1 Solution Under the one-way ANOVA assumptions, Xij ∼IID N (µj , σ 2 ) within each j = 1, . . . , k. Therefore, since the Xij s are independent with a common variance, σ 2 , we have: σ2 for j = 1, . . . , k. X̄·j ∼ N µj , nj Hence: Therefore: a2j σ 2 aj X̄·j ∼ N aj µj , nj k X aj X̄·j ∼ N j=1 k X for j = 1, . . . , k. aj µ j , σ 2 j=1 k X a2j j=1 nj ! . Activity 10.2 Do the following data appear to violate the assumptions underlying one-way analysis of variance? Explain why or why not. A 1.78 8.26 3.57 4.69 2.13 6.17 Treatment B C 8.41 0.57 5.61 3.04 3.90 2.67 3.77 1.66 1.08 2.09 2.67 1.57 D 9.45 8.47 7.69 8.53 9.04 7.11 Solution We have s2A = 6.1632, s2B = 6.4106, s2C = 0.7715 and s2D = 0.7400. So we observe that although s2A ≈ s2B and s2C ≈ s2D , the sample variances for treatments are very different across all groups, suggesting that the assumption that σ 2 is the same for all treatment levels may not be true. Activity 10.3 An indicator of the value of a stock relative to its earnings is its price-earnings ratio: the average of a given year’s high and low selling prices divided by its annual earnings. The following table provides the price-earnings ratios for a sample of 27 stocks, nine each from the financial, industrial and utility sectors of the New York Stock Exchange. Test at the 1% significance level whether the true mean price-earnings ratios for the three market sectors are the same. Use the ANOVA table format to summarise your calculations. You may exclude the p-value. 322 10.5. One-way analysis of variance Financial 11.4 12.3 10.8 9.8 14.3 16.1 11.9 12.4 13.1 Industrial 9.4 18.4 15.9 21.6 17.1 20.2 18.6 22.9 18.6 Utility 15.4 16.3 10.9 19.3 15.1 12.7 16.8 14.3 13.8 Solution For these n = 27 observations and k = 3 groups, we have x̄·1 = 12.46, x̄·2 = 18.08, x̄·3 = 14.96 and x̄ = 15.16. Also: 3 X 9 X x2ij = 6548.3. j=1 i=1 Hence the total variation is: 3 X 9 X x2ij − nx̄2 = 6548.3 − 27 × (15.16)2 = 340.58. j=1 i=1 The between-groups variation is: b= 3 X nj x̄2·j − n x̄2 = 9 × ((12.46)2 + (18.08)2 + (14.96)2 ) − 27 × (15.16)2 j=1 = 142.82. Therefore, w = 340.58 − 142.82 = 197.76. Hence the ANOVA table is: Source Sector Error Total DF 2 24 26 SS MS 142.82 71.41 197.76 8.24 340.58 F 8.67 To test the null hypothesis that the three types of stocks have equal price-earnings ratios, on average, we reject H0 if: f > F0.01, 2, 24 = 5.61. Since 5.61 < 8.67, we reject H0 and conclude that there is strong evidence of a difference in the mean price-earnings ratios across the sectors. 323 10. Analysis of variance (ANOVA) Activity 10.4 Three trainee salespeople were working on a trial basis. Salesperson A went in the field for 5 days and made a total of 440 sales. Salesperson B was tried for 7 days and made a total of 630 sales. Salesperson C was tried for 10 days and made a total of 690 sales. Note that these figures P are total sales, not daily averages. The sum of the squares of all 22 daily sales ( x2i ) is 146,840. (a) Construct a one-way analysis of variance table. (b) Would you say there is a difference between the mean daily sales of the three salespeople? Justify your answer. (c) Construct a 95% confidence interval for the mean difference between salesperson B and salesperson C. Would you say there is a difference? Solution (a) The means are 440/5 = 88, 630/7 = 90 and 690/10 = 69. We will perform a one-way ANOVA. First, we calculate the overall mean. This is: 440 + 630 + 690 = 80. 22 We can now calculate the sum of squares between salespeople. This is: 5 × (88 − 80)2 + 7 × (90 − 80)2 + 10 × (69 − 80)2 = 2230. The total sum of squares is: 146840 − 22 × (80)2 = 6040. Here is the one-way ANOVA table: Source Salesperson Error Total DF 2 19 21 SS 2230 3810 6040 MS F 1115 5.56 200.53 p-value ≈ 0.01 (b) As 5.56 > 3.52 = F0.05, 2, 19 , which is the top 5th percentile of the F2, 19 distribution (interpolated from Table 12 of the New Cambridge Statistical Tables), we reject H0 : µ1 = µ2 = µ3 and conclude that there is evidence that the means are not equal. (c) We have: s 90 − 69 ± 2.093 × 200.53 × 1 1 + 7 10 = 21 ± 14.61. Here 2.093 is the top 2.5th percentile point of the t distribution with 19 degrees of freedom. A 95% confidence interval is (6.39, 35.61). As zero is not included, there is evidence of a difference. 324 10.5. One-way analysis of variance Activity 10.5 The total times spent by three basketball players on court were recorded. Player A was recorded on three occasions and the times were 29, 25 and 33 minutes. Player B was recorded twice and the times were 16 and 30 minutes. Player C was recorded on three occasions and the times were 12, 14 and 16 minutes. Use analysis of variance to test whether there is any difference in the average times the three players spend on court. Solution We have x̄·A = 29, x̄·B = 23, x̄·C = 14 and x̄ = 21.875. Hence: 3 × (29 − 21.875)2 + 2 × (23 − 21.875)2 + 3 × (14 − 21.875)2 = 340.875. The total sum of squares is: 4307 − 8 × (21.875)2 = 478.875. Here is the one-way ANOVA table: Source Players Error Total DF 2 5 7 SS 340.875 138 478.875 MS 170.4375 27.6 F 6.175 p-value ≈ 0.045 We test H0 : µ1 = µ2 = µ3 (i.e. the average times they play are the same) vs. H1 : The average times they play are not the same. As 6.175 > 5.79 = F0.05, 2, 5 , which is the top 5th percentile of the F2, 5 distribution, we reject H0 and conclude that there is evidence of a difference between the means. Activity 10.6 Three independent random samples were taken. Sample A consists of 4 observations taken from a normal distribution with mean µA and variance σ 2 , sample B consists of 6 observations taken from a normal distribution with mean µB and variance σ 2 , and sample C consists of 5 observations taken from a normal distribution with mean µC and variance σ 2 . The average value of the first sample was 24, the average value of the second sample was 20, and the average value of the third sample was 18. The sum of the squared observations (all of them) was 6,722.4. Test the hypothesis: H0 : µA = µB = µC against the alternative that this is not so. Solution We will perform a one-way ANOVA. First we calculate the overall mean: 4 × 24 + 6 × 20 + 5 × 18 = 20.4. 15 We can now calculate the sum of squares between groups: 4 × (24 − 20.4)2 + 6 × (20 − 20.4)2 + 5 × (18 − 20.4)2 = 81.6. 325 10. Analysis of variance (ANOVA) The total sum of squares is: 6722.4 − 15 × (20.4)2 = 480. Here is the one-way ANOVA table: Source Sample Error Total DF 2 12 14 SS MS F p-value 81.6 40.8 1.229 ≈ 0.327 398.4 33.2 480 As 1.229 < 3.89 = F0.05, 2, 12 , which is the top 5th percentile of the F2, 12 distribution, we see that there is no evidence that the means are not equal. Activity 10.7 An executive of a prepared frozen meals company is interested in the amounts of money spent on such products by families in different income ranges. The table below lists the monthly expenditures (in dollars) on prepared frozen meals from 15 randomly selected families divided into three groups according to their incomes. Under $15,000 45.2 60.1 52.8 31.7 33.6 39.4 $15,000 – $30,000 53.2 56.6 68.7 51.8 54.2 Over $30,000 52.7 73.6 63.3 51.8 (a) Based on these data, can we infer at the 5% significance level that the population mean expenditures on prepared frozen meals are the same for the three different income groups? (b) Produce a one-way ANOVA table. (c) Construct 95% confidence intervals for the mean expenditures of the first (under $15,000) and the third (over $30,000) income groups. Solution (a) For this example, k = 3, n1 = 6, n2 = 5, n3 = 4 and n = n1 + n2 + n3 = 15. We have x̄·1 = 43.8, x̄·2 = 56.9, x̄·3 = 60.35 and x̄ = 52.58. nj 3 P P Also, x2ij = 43387.85. j=1 i=1 Total SS = nj 3 P P x2ij − nx̄2 = 43387.85 − 41469.85 = 1918. j=1 i=1 w= nj 3 P P j=1 i=1 326 x2ij − P3 j=1 nj x̄2·j = 43387.85 − 42267.18 = 1120.67. 10.5. One-way analysis of variance Therefore, b = Total SS − w = 1918 − 1120.67 = 797.33. To test H0 : µ1 = µ2 = µ3 , the test statistic value is: f= b/(k − 1) 797.33/2 = = 4.269. w/(n − k) 1120.67/12 Under H0 , F ∼ F2, 12 . Since F0.05, 2, 12 = 3.89 < 4.269, we reject H0 at the 5% significance level, i.e. there exists evidence indicating that the population mean expenditures on frozen meals are not the same for the three different income groups. (b) The ANOVA table is as follows: Source Income Error Total DF 2 12 14 SS 797.33 1120.67 1918.00 MS 398.67 93.39 F 4.269 P <0.05 (c) A 95% confidence interval for µj is of the form: √ 21.056 93.39 S = X̄·j ± √ . X̄·j ± t0.025, n−k × √ = X̄·j ± t0.025, 12 × √ nj nj nj √ For j = 1, a 95% confidence interval is 43.8 ± 21.056/ 6 ⇒ (35.20, 52.40). √ For j = 3, a 95% confidence interval is 60.35 ± 21.056/ 4 ⇒ (49.82, 70.88). Activity 10.8 Does the level of success of publicly-traded companies affect the way their board members are paid? The annual payments (in $000s) of randomly selected publicly-traded companies to their board members were recorded. The companies were divided into four quarters according to the returns in their stocks, and the payments from each quarter were grouped together. Some summary statistics are provided below. Descriptive Statistics: 1st quarter, 2nd quarter, 3rd quarter, 4th quarter Variable 1st quarter 2nd quarter 3rd quarter 4th quarter N 30 30 30 30 Mean 74.10 75.67 78.50 81.30 SE Mean 2.89 2.48 2.79 2.85 StDev 15.81 13.57 15.28 15.59 (a) Can we infer that the amount of payment differs significantly across the four groups of companies? (b) Construct 95% confidence intervals for the mean payment of the 1st quarter companies and the 4th quarter companies. 327 10. Analysis of variance (ANOVA) Solution (a) Here k = 4 and n1 = n2 = n3 = n4 = 30. We have x̄·1 = 74.10, x̄·2 = 75.67, x̄·3 = 78.50, x̄·4 = 81.30, b = 909, w = 26403 and the pooled estimate of σ is s = 15.09. Hence the test statistic value is: f= b/(k − 1) = 1.33. w/(n − k) Under H0 : µ1 = µ2 = µ3 = µ4 , F ∼ Fk−1, n−k = F3, 116 . Since F0.05, 3, 116 = 2.68 > 1.33, we cannot reject H0 at the 5% significance level. Hence there is no evidence to support the claim that payments among the four groups are significantly different. (b) A 95% confidence interval for µj is of the form: 15.09 S = X̄·j ± 5.46. X̄·j ± t0.025, n−k × √ = X̄·j ± t0.025, 116 × √ nj 30 For j = 1, a 95% confidence interval is 74.10 ± 5.46 ⇒ (68.64, 79.56). For j = 4, a 95% confidence interval is 81.30 ± 5.46 ⇒ (75.84, 86.76). Activity 10.9 Proficiency tests are administered to a sample of 9-year-old children. The test scores are classified into four groups according to the highest education level achieved by at least one of their parents. The education categories used for the grouping are: ‘less than high school’, ‘high school graduate’, ‘some college’, and ‘college graduate’. (a) Find the missing values A1, A2, A3 and A4 in the one-way ANOVA table below. Source Factor Error Total DF A1 275 278 S = 32.16 Level Less than HS HS grad Some college College grad SS 45496 A2 329896 R-Sq = 13.79% N 41 73 86 79 Mean 196.83 207.78 223.38 232.67 Pooled StDev = 32.16 328 MS 15165 A3 F A4 P 0.000 R-Sq(adj) = 12.85% StDev 30.23 29.34 34.58 32.86 Individual 95% CIs For Mean Based on Pooled StDev -----+---------+---------+---------+---(-----*------) (----*---) (----*---) (----*----) -----+---------+---------+---------+---195 210 225 240 10.5. One-way analysis of variance (b) Test whether there are differences in mean test scores between children whose parents have different highest education levels. (c) State the required model conditions for the inference conducted in (b). Solution (a) We have A1 = 3, A2 = 284400, A3 = 1034 and A4 = 14.66. (b) Since the p-value of the F test is 0.000, there exists strong evidence indicating that the mean test scores are different for children whose parents have different highest education levels. (c) We need to assume that we have independent observations Xij ∼ N (µj , σ 2 ) for i = 1, . . . , nj and j = 1, . . . , k. Activity 10.10 Four different drinks A, B, C and D were assessed by 15 tasters. Each taster assessed only one drink. Drink A was assessed by 3 tasters and the scores x1A , x2A and x3A were recorded; drink B was assessed by 4 tasters and the scores x1B , . . . , x4B were recorded; drink C was assessed by 5 tasters and the scores x1C , . . . , x5C were recorded; drink D was assessed by 3 tasters and the scores x1D , x2D , and x3D were recorded. Explain how you would use this information to construct a one-way analysis of variance (ANOVA) table and use it to test whether the four drinks are equally good against the alternative that they are not. The significance level should be 1% and you should provide the critical value. Solution We need to calculate the following: 3 X̄A = 1X XiA , 3 i=1 4 X̄B = and: 3 P X̄ = 1X XiB , 4 i=1 XiA + i=1 4 P 5 X̄C = XiB + i=1 5 P 1X XiC , 5 i=1 XiC + i=1 3 P i=1 15 3 X̄D = 1X XiD 3 i=1 XiD . Alternatively: 3X̄A + 4X̄B + 5X̄C + 3X̄D . 15 We then need the between-groups sum of squares: X̄ = B = 3(X̄A − X̄)2 + 4(X̄B − X̄)2 + 5(X̄C − X̄)2 + 3(X̄D − X̄)2 and the within-groups sum of squares: 3 4 5 3 X X X X 2 2 2 W = (XiA − X̄A ) + (XiB − X̄B ) + (XiC − X̄C ) + (XiD − X̄D )2 . i=1 i=1 i=1 i=1 329 10. Analysis of variance (ANOVA) Alternatively, we could calculate only one of the two, and calculate the total sum of squares (TSS): TSS = 3 X 2 (XiA − X̄) + i=1 4 X 2 (XiB − X̄) + i=1 5 X (XiC 3 X − X̄) + (XiD − X̄)2 2 i=1 i=1 and use the relationship TSS = B + W to calculate the other. We then construct the ANOVA table: Source Factor Error Total DF 3 11 14 SS b w b+w MS b/3 w/11 F 11b/3w At the 100α% significance level, we then compare f = 11b/3w to Fα, 3, 11 using Table 12 of the New Cambridge Statistical Tables. For α = 0.01, we will reject the null hypothesis that there is no difference if f > 6.22. 10.6 From one-way to two-way ANOVA One-way ANOVA: a review We have independent observations Xij ∼ N (µj , σ 2 ) for i = 1, . . . , nj and j = 1, . . . , k. We are interested in testing: H 0 : µ1 = · · · = µk . The variation of the Xij s is driven by a factor at different levels µ1 , . . . , µk , in addition to random fluctuations (i.e. random errors). We test whether such a factor effect exists or not. We can model a one-way ANOVA problem as follows: Xij = µ + βj + εij for i = 1, . . . , nj , j = 1, . . . , k where εij ∼ N (0, σ 2 ) and the εij s are independent. µ is the average effect and βj is the k P factor (or treatment) effect at the jth level. Note that βj = 0. The null hypothesis j=1 (i.e. that the group means are all equal) can also be expressed as: H0 : β1 = · · · = βk = 0. 10.7 Two-way analysis of variance Two-way analysis of variance (two-way ANOVA) involves a continuous dependent variable and two categorical independent variables (factors). Two-way ANOVA models the observations as: Xij = µ + γi + βj + εij 330 for i = 1, . . . , r, j = 1, . . . , c 10.7. Two-way analysis of variance where: µ represents the average effect β1 , . . . , βc represent c different treatment (column) levels γ1 , . . . , γr represent r different block (row) levels εij ∼ N (0, σ 2 ) and the εij s are independent. In total, there are n = r × c observations. We now consider the conditions to make the parameters µ, γi and βj identifiable for i = 1, . . . , r and j = 1, . . . , c. The conditions are: γ1 + · · · + γr = 0 and β1 + · · · + βc = 0. We will be interested in testing the following hypotheses. The ‘no treatment (column) effect’ hypothesis of H0 : β1 = · · · = βc = 0. The ‘no block (row) effect’ hypothesis of H0 : γ1 = · · · = γr = 0. We now introduce statistics associated with two-way ANOVA. Statistics associated with two-way ANOVA The sample mean at the ith block level is: c P X̄i· = Xij j=1 for i = 1, . . . , r. c The sample mean at the jth treatment level is: r P X̄·j = Xij i=1 for j = 1, . . . , c. r The overall sample mean is: r P c P X̄ = X̄·· = i=1 j=1 n Xij . The total variation is: Total SS = r X c X (Xij − X̄)2 i=1 j=1 with rc − 1 degrees of freedom. 331 10. Analysis of variance (ANOVA) The between-blocks (rows) variation is: Brow = c r X (X̄i· − X̄)2 i=1 with r − 1 degrees of freedom. The between-treatments (columns) variation is: Bcol = r c X (X̄·j − X̄)2 j=1 with c − 1 degrees of freedom. The residual (error) variation is: Residual SS = r X c X (Xij − X̄i· − X̄·j + X̄)2 i=1 j=1 with (r − 1)(c − 1) degrees of freedom. The (two-way) ANOVA decomposition is: r X c X r c r X c X X X 2 2 (Xij − X̄) = c (X̄i· − X̄) +r (X̄·j − X̄) + (Xij − X̄i· − X̄·j + X̄)2 . 2 i=1 j=1 i=1 j=1 i=1 j=1 The total variation is a measure of the overall (total) variability in the data and the (two-way) ANOVA decomposition decomposes this into three components: between-blocks variation (which is attributable to the row factor level), between-treatments variation (which is attributable to the column factor level) and residual variation (which is attributable to the variation not explained by the row and column factors). The following are some useful formulae for manual computations. Row sample means: X̄i· = c P Xij /c, for i = 1, . . . , r. j=1 Column sample means: X̄·j = r P Xij /r, for j = 1, . . . , c. i=1 Overall sample mean: X̄ = r P c P Xij /n = i=1 j=1 Total SS = r P c P r P i=1 X̄i· /r = c P j=1 Xij2 − rcX̄ 2 . i=1 j=1 Between-blocks (rows) variation: Brow = c r P i=1 332 X̄i·2 − rcX̄ 2 . X̄·j /c. 10.7. Two-way analysis of variance Between-treatments (columns) variation: Bcol = r c P X̄·j2 − rcX̄ 2 . j=1 r P c P Residual SS = (Total SS) − Brow − Bcol = Xij2 − c i=1 j=1 r P X̄i·2 − r i=1 c P X̄·j2 + rcX̄ 2 . j=1 In order to test the ‘no block (row) effect’ hypothesis of H0 : γ1 = · · · = γr = 0, the test statistic is defined as: F = (c − 1)Brow Brow /(r − 1) = . (Residual SS)/[(r − 1)(c − 1)] Residual SS Under H0 , F ∼ Fr−1, (r−1)(c−1) . We reject H0 at the 100α% significance level if: f > Fα, r−1, (r−1)(c−1) where Fα, r−1, (r−1)(c−1) is the top 100αth percentile of the Fr−1, (r−1)(c−1) distribution, i.e. P (F > Fα, r−1, (r−1)(c−1) ) = α, and f is the observed test statistic value. The p-value of the test is: p-value = P (F > f ). In order to test the ‘no treatment (column) effect’ hypothesis of H0 : β1 = · · · = βc = 0, the test statistic is defined as: F = (r − 1)Bcol Bcol /(c − 1) = . (Residual SS)/[(r − 1)(c − 1)] Residual SS Under H0 , F ∼ Fc−1, (r−1)(c−1) . We reject H0 at the 100α% significance level if: f > Fα, c−1, (r−1)(c−1) . The p-value of the test is defined in the usual way. Two-way ANOVA table As with one-way ANOVA, two-way ANOVA results are presented in a table as follows: Source DF SS MS F p-value Row factor r−1 Brow Brow /(r − 1) (c−1)Brow Residual SS p Column factor c−1 Bcol Bcol /(c − 1) (r−1)Bcol Residual SS p (r − 1)(c − 1) Residual SS Residual SS (r−1)(c−1) rc − 1 Total SS Residual Total Activity 10.11 Four suppliers were asked to quote prices for seven different building materials. The average quote of supplier A was 1,315.8. The average quote of suppliers B, C and D were 1,238.4, 1,225.8 and 1,200.0, respectively. The following is the calculated two-way ANOVA table with some entries missing. 333 10. Analysis of variance (ANOVA) Source Materials Suppliers Error Total DF SS MS F 17800 p-value 358700 (a) Complete the table using the information provided above. (b) Is there a significant difference between the quotes of different suppliers? Explain your answer. (c) Construct a 90% confidence interval for the difference between suppliers A and D. Would you say there is a difference? Solution (a) The average quote of all suppliers is: 1315.8 + 1238.4 + 1225.8 + 1200.0 = 1245. 4 Hence the sum of squares (SS) due to suppliers is: 7×[(1315.8−1245)2 +(1238.4−1245)2 +(1225.8−1245)2 +(1200.0−1245)2 ] = 52148.88 and the MS due to suppliers is 52148.88/(4 − 1) = 17382.96. The degrees of freedom are 7 − 1 = 6, 4 − 1 = 3, (7 − 1)(4 − 1) = 18 and 7 × 4 − 1 = 27 for materials, suppliers, error and total sum of squares, respectively. The SS for materials is 6 × 17800 = 106800. We have that the SS due to the error is given by 358700 − 52148.88 − 106800 = 199751.12 and the MS is 199751.12/18 = 11097.28. The F values are: 17800 = 1.604 and 11097.28 17382.96 = 1.567 11097.28 for materials and suppliers, respectively. The two-way ANOVA table is: Source DF SS MS F p-value Materials 6 106800 17800 1.604 ≈ 0.203 Suppliers 3 52148.88 17382.96 1.567 ≈ 0.232 Error 18 199751.12 11097.28 Total 27 358700 (b) We test H0 : µ1 = µ2 = µ3 = µ4 (i.e. there is no difference between suppliers) vs. H1 : There is a difference between suppliers. The F value is 1.567 and at a 5% significance level the critical value from Table 12 (degrees of freedom 3 and 18) is 3.16, hence we do not reject H0 and conclude that there is not enough evidence that there is a difference. 334 10.7. Two-way analysis of variance (c) The top 5th percentile of the t distribution with 18 degrees of freedom is 1.734 and the MS value is 11097.28. So a 90% confidence interval is: s 1 1 + = 115.8 ± 97.64 1315.8 − 1200 ± 1.734 × 11097.28 7 7 giving (18.16, 213.44). Since zero is not in the interval, there appears to be a difference between suppliers A and D. Activity 10.12 Blood alcohol content (BAC) is measured in milligrams per decilitre of blood (mg/dL). A researcher is looking into the effects of alcoholic drinks. Four different individuals tried five different brands of strong beer (A, B, C, D and E) on different days, of course! Each individual consumed 1L of beer over a 30-minute period and their BAC was measured one hour later. The average BAC for beers A, C, D and E were 83.25, 95.75, 79.25 and 99.25, respectively. The value for beer B is not given. The following information is provided as well. Source Drinker Beer Error Total DF SS MS F p-value 1.56 303.5 695.6 (a) Complete the table using the information provided above. (b) Is there a significant difference between the effects of different beers? What about different drinkers? (c) Construct a 90% confidence interval for the difference between the effects of beers C and D. Would you say there is a difference? Solution (a) We have: Source DF SS MS F p-value Drinker 3 271.284 90.428 1.56 ≈ 0.250 Beer 4 1214 303.5 5.236 ≈ 0.011 Error 12 695.6 57.967 Total 19 2180.884 (b) We test the hypothesis H0 : µ1 = µ2 = µ3 = µ4 = µ5 (i.e. there is no difference between the effects of different beers) vs. the alternative H1 : There is a difference between the effects of different beers. The F value is 5.236 and at a 5% significance level the critical value from Table 9 is F0.05, 4, 12 = 3.26, so since 5.236 > 3.26 we reject H0 and conclude that there is evidence of a difference. For drinkers, we test the hypothesis H0 : µ1 = µ2 = µ3 = µ4 (i.e. there is no difference between the effects on different drinkers) vs. the alternative H1 : There 335 10. Analysis of variance (ANOVA) is a difference between the effects on different drinkers. The F value is 1.56 and at a 5% significance level the critical value from Table 9 is F0.05, 3, 12 = 3.49, so since 1.56 < 3.49 we fail to reject H0 and conclude that there is no evidence of a difference. (c) The top 5th percentile of the t distribution with 12 degrees of freedom is 1.782. So a 90% confidence interval is: s 1 1 + = 16.5 ± 9.59 95.75 − 79.25 ± 1.782 × 57.967 4 4 giving (6.91, 26.09). As the interval does not contain zero, there is evidence of a difference between the effects of beers C and D. Activity 10.13 A motor manufacturer operates five continuous-production plants: A, B, C, D and E. The average rate of production has been calculated for the three shifts of each plant and recorded in the table below. Does there appear to be a difference in production rates in different plants or by different shifts? Early shift Late shift Night shift A 102 85 75 B C D E 93 85 110 72 87 71 92 73 80 75 77 76 Solution Here r = 3 and c = 5. We may obtain the two-way ANOVA table as follows: Source Shift Plant Error Total DF 2 4 8 14 SS 652.13 761.73 463.87 1877.73 MS 326.07 190.43 57.98 F 5.62 3.28 Under the null hypothesis of no shift effect, F ∼ F2, 8 . Since F0.05, 2, 8 = 4.46 < 5.62, we can reject the null hypothesis at the 5% significance level. (Note the p-value is 0.030.) Under the null hypothesis of no plant effect, F ∼ F4, 8 . Since F0.05, 4, 8 = 3.84 > 3.28, we cannot reject the null hypothesis at the 5% significance level. (Note the p-value is 0.072.) Overall, the data collected show some evidence of a shift effect but little evidence of a plant effect. Activity 10.14 Complete the two-way ANOVA table below. In the places of p-values, indicate in the form such as ‘< 0.01’ appropriately and use the closest value which you may find from the New Cambridge Statistical Tables. 336 10.7. Two-way analysis of variance Source DF SS MS F p-value Row factor Column factor Residual Total 4 6 ? 34 ? 270.84 708.00 1915.76 234.23 45.14 ? ? 1.53 ? ? Solution First, row factor SS = (row factor MS)×4 = 936.92. The degrees of freedom for residual is 34 − 4 − 6 = 24. Therefore, residual MS = 708.00/24 = 29.5. Hence the F statistic for testing no row factor effect is 234.23/29.5 = 7.94. From Table 12 of the New Cambridge Statistical Tables, F0.001, 4, 24 = 6.59 < 7.94. Therefore, the corresponding p-value is smaller than 0.001. Since F0.05, 6, 24 = 2.51 > 1.53, the p-value for testing the column factor effect is greater than 0.05. The complete ANOVA table is as follows: Source DF Row factor Column factor Residual Total 4 6 24 34 SS MS F p-value 936.92 234.23 7.94 < 0.001 270.84 45.14 1.53 > 0.05 708.00 29.50 1915.76 Activity 10.15 The following table shows the audience shares (in %) of three major networks’ evening news broadcasts in five major cities, with one observation per cell so that there are 15 observations. Construct the two-way ANOVA table for these data (without the p-value column). Is either factor statistically significant at the 5% significance level? City A B C D E BBC 21.3 20.6 24.1 23.6 21.8 ITV 17.8 17.5 16.1 18.3 17.0 Sky 20.2 20.1 19.4 20.8 28.7 Solution We have r = 5 and c = 3. The row sample means are calculated using X̄i· = c P Xij /c, which gives 19.77, 19.40, j=1 19.87, 20.90 and 22.50 for i = 1, 2, 3, 4, 5, respectively. 337 10. Analysis of variance (ANOVA) The column means are calculated using X̄·j = r P Xij /r, which gives 22.28, 17.34 and i=1 21.84 for j = 1, 2, 3, respectively. The overall sample mean is: x̄ = r X x̄i· i=1 r = 20.49. The sum of the squared observations is: r X c X x2ij = 6441.99. i=1 j=1 Hence: Total SS = r X c X x2ij − rcx̄2 = 6441.99 − 15 × (20.49)2 = 6441.99 − 6297.60 = 144.39. i=1 j=1 brow = c r X x̄2i· − rcx̄2 = 3 × 2104.83 − 6297.60 = 16.88. i=1 bcol = r c X x̄2·j − rcx̄2 = 5 × 1274.06 − 6297.60 = 72.70. j=1 Residual SS = Total SS − brow − bcol = 144.39 − 16.88 − 72.70 = 54.81. To test the no row effect hypothesis H0 : γ1 = · · · = γ5 = 0, the test statistic value is: f= (c − 1)brow 2 × 16.88 = = 0.62. Residual SS 54.81 Under H0 , F ∼ Fr−1, (r−1)(c−1) = F4, 8 . Using Table 12 of the New Cambridge Statistical Tables, since F0.05, 4, 8 = 3.84 > 0.62, we do not reject H0 at the 5% significance level. We conclude that there is no evidence that the audience share depends on the city. To test the no column effect hypothesis H0 : β1 = β2 = β3 = 0, the test statistic value is: 4 × 72.70 (r − 1)bcol = = 5.31. f= Residual SS 54.81 Under H0 , F ∼ Fc−1, (r−1)(c−1) = F2, 8 . Since F0.05, 2, 8 = 4.46 < 5.31, we reject H0 at the 5% significance level. Therefore, there is evidence indicating that the audience share depends on the network. The results may be summarised in a two-way ANOVA table as follows: Source City Network Residual Total 338 DF 4 2 8 14 SS MS 16.88 4.22 72.70 36.35 54.81 6.85 144.39 F 0.61 5.31 10.8. Residuals 10.8 Residuals Before considering an example of two-way ANOVA, we briefly consider residuals. Recall the original two-way ANOVA model: Xij = µ + γi + βj + εij . We now decompose the observations as follows: Xij = X̄ + (X̄i· − X̄) + (X̄·j − X̄) + (Xij − X̄i· − X̄·j + X̄) for i = 1, . . . , r and j = 1, . . . , c, where we have the following point estimators. µ b = X̄ is the point estimator of µ. γ bi = X̄i· − X̄ is the point estimator of γi , for i = 1, . . . , r. βbj = X̄·j − X̄ is the point estimator of βj , for j = 1, . . . , c. It follows that the residual, i.e. the estimator of εij , is: εbij = Xij − X̄i· − X̄·j + X̄ for i = 1, . . . r and j = 1, . . . , c. The two-way ANOVA model assumes εij ∼ N (0, σ 2 ) and so, if the model structure is correct, then the εbij s should behave like independent N (0, σ 2 ) random variables. Example 10.6 The following table lists the percentage annual returns (calculated four times per annum) of the Common Stock Index at the New York Stock Exchange during 1981–85. 1981 1982 1983 1984 1985 1st quarter 5.7 7.2 4.9 4.5 4.4 2nd quarter 6.0 7.0 4.1 4.9 4.2 3rd quarter 7.1 6.1 4.2 4.5 4.2 4th quarter 6.7 5.2 4.4 4.5 3.6 (a) Is the variability in returns from year to year statistically significant? (b) Are returns affected by the quarter of the year? Using two-way ANOVA, we test the no row effect hypothesis to answer (a), and test the no column effect hypothesis to answer (b). We have r = 5 and c = 4. The row sample means are calculated using X̄i· = c P Xij /c, which gives 6.375, 6.375, j=1 4.4, 4.6 and 4.1, for i = 1, . . . , 5, respectively. The column sample means are calculated using X̄·j = r P Xij /r, which gives 5.34, i=1 5.24, 5.22 and 4.88, for j = 1, . . . , 4, respectively. 339 10. Analysis of variance (ANOVA) The overall sample mean is x̄ = r P x̄i· /r = 5.17. i=1 The sum of the squared observations is r P c P x2ij = 559.06. i=1 j=1 Hence we have the following. Total SS = r X c X x2ij − rcx̄2 = 559.06 − 20 × (5.17)2 = 559.06 − 534.578 = 24.482. i=1 j=1 brow = c r X x̄2i· − rcx̄2 = 4 × 138.6112 − 534.578 = 19.867. i=1 bcol = r c X x̄2·j − rcx̄2 = 5 × 107.036 − 534.578 = 0.602. j=1 Residual SS = (Total SS) − brow − bcol = 24.482 − 19.867 − 0.602 = 4.013. To test the no row effect hypothesis H0 : γ1 = · · · = γ5 = 0, the test statistic value is: (c − 1)brow 3 × 19.867 = = 14.852. Residual SS 4.013 Under H0 , F ∼ Fr−1, (r−1)(c−1) = F4, 12 . Using Table 12(d) of the New Cambridge Statistical Tables, since F0.01, 4, 12 = 5.412 < 14.852, we reject H0 at the 1% significance level. We conclude that there is strong evidence that the return does depend on the year. f= To test the no column effect hypothesis H0 : β1 = · · · = β4 = 0, the test statistic value is: 4 × 0.602 (r − 1)bcol = = 0.600. f= Residual SS 4.013 Under H0 , F ∼ Fc−1, (r−1)(c−1) = F3, 12 . Since F0.10, 3, 12 = 2.606 > 0.600, we cannot reject H0 even at the 10% significance level. Therefore, there is no significant evidence indicating that the return depends on the quarter. The results may be summarised in a two-way ANOVA table as follows: Source DF Year Quarter Residual Total 4 3 12 19 SS MS F 19.867 4.967 14.852 0.602 0.201 0.600 4.013 0.334 24.482 p-value < 0.01 > 0.10 We could also provide 95% confidence interval estimates for each block and treatment level by using the pooled estimator of σ 2 , which is: S2 = 340 Residual SS = Residual MS. (r − 1)(c − 1) 10.9. Overview of chapter For the given data, s2 = 0.334. R produces the following output: > anova(lm(Return ~ Year + Quarter)) Analysis of Variance Table Response: Return Df Sum Sq Mean Sq F value Pr(>F) Year 4 19.867 4.9667 14.852 0.0001349 *** Quarter 3 0.602 0.2007 0.600 0.6271918 Residuals 12 4.013 0.3344 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Note that the confidence intervals for years 1 and 2 (corresponding to 1981 and 1982) are separated from those for years 3 to 5 (that is, 1983 to 1985), which is consistent with rejection of H0 in the no row effect test. In contrast, the confidence intervals for each quarter all overlap, which is consistent with our failure to reject H0 in the no column effect test. Finally, we may also look at the residuals: εbij = Xij − µ b−γ bi − βbj for i = 1, . . . r, and j = 1, . . . , c. If the assumed normal model (structure) is correct, the εbij s should behave like independent N (0, σ 2 ) random variables. 10.9 Overview of chapter This chapter introduced analysis of variance as a statistical tool to detect differences between group means. One-way and two-way analysis of variance frameworks were presented depending on whether one or two independent variables were modelled, respectively. Statistical inference in the form of hypothesis tests and confidence intervals was conducted. 10.10 Key terms and concepts ANOVA decomposition Between-groups variation One-way ANOVA Residual Total variation Within-groups variation Between-blocks (rows) variation Between-treatments (columns) variation Random errors Sample mean Two-way ANOVA 341 10. Analysis of variance (ANOVA) 10.11 Sample examination questions Solutions can be found in Appendix C. 1. Three call centre workers were being monitored for the average number of calls they answer per daily shift. Worker A answered a total of 187 calls in 4 days. Worker B answered a total of 347 calls in 6 days. Worker C answered a total of 461 calls in 10 days. Note that these are total sales, not daily averages. The sum P figures 2 of the squares of all 20 days, xi , is 50,915. (a) Construct a one-way analysis of variance table. (You may exclude the p-value.) (b) Would you say there is a difference between the average daily calls answered of the three workers? Justify your answer using a 5% significance level. 2. The audience shares (in %) of three major television networks’ evening news broadcasts in four major cities were examined. The average audience share for the three networks (A, B and C) were 21.35%, 17.28% and 20.18%, respectively. The following is the calculated ANOVA table with some entries missing. Source City Network Error Total Degrees of freedom Sum of squares Mean square 1.95 F -value 51.52 (a) Complete the table using the information provided above. (b) Test, at the 5% significance level, whether there is evidence of a difference in audience shares between networks. 3. An experiment is conducted to study how long different external batteries for laptops last (with the laptop on power saving mode). The aim is to find out whether there is a difference in terms of battery life between four brands of batteries using seven different laptops. Each battery was tried once with each laptop. The total time the Brand A battery lasted was 43.86 hours. The total times for brands B, C and D were 41.28, 40.86 and 40 hours respectively. The following is the calculated ANOVA table with some entries missing. Source Degrees of freedom Laptops Batteries Error Total Sum of squares Mean square F -value 26 343 (a) Complete the table using the information provided above. (b) Test whether there are significant differences between the expected battery performance: (i) of different batteries, and (ii) of different laptops. Perform both tests at the 5% significance level. (c) Construct a 90% confidence interval for the expected difference between brands A and D. Is there any evidence of a difference in the performance of these brands? 342 Appendix A Linear regression (non-examinable) A.1 Synopsis of chapter This chapter covers linear regression whereby the variation in a continuous dependent variable is modelled as being explained by one or more continuous independent variables. A.2 Learning outcomes After completing this chapter, you should be able to: derive from first principles the least squares estimators of the intercept and slope in the simple linear regression model explain how to construct confidence intervals and perform hypothesis tests for the intercept and slope in the simple linear regression model demonstrate how to construct confidence intervals and prediction intervals and explain the difference between the two summarise the multiple linear regression model with several explanatory variables, and explain its interpretation provide the assumptions on which regression models are based interpret typical output from a computer package fitting of a regression model. A.3 Introduction Regression analysis is one of the most frequently-used statistical techniques. It aims to model an explicit relationship between one dependent variable, often denoted as y, and one or more regressors (also called covariates, or independent variables), often denoted as x1 , . . . , xp . The goal of regression analysis is to understand how y depends on x1 , . . . , xp and to predict or control the unobserved y based on the observed x1 , . . . , xp . We start with some simple examples with p = 1. 343 A. Linear regression (non-examinable) A.4 Introductory examples Example 1.1 In a university town, the sales, y, of 10 Armand’s Pizza Parlour restaurants are closely related to the student population, x, in their neighbourhoods. The data are the sales (in thousands of euros) in a period of three months together with the numbers of students (in thousands) in their neighbourhoods. We plot y against x, and draw a straight line through the middle of the data points: y = β0 + β1 x + ε where ε stands for a random error term, β0 is the intercept and β1 is the slope of the straight line. For a given student population, x, the predicted sales are yb = β0 + β1 x. Example 1.2 Data were collected on the heights, x, and weights, y, of 69 students in a class. We plot y against x, and draw a straight line through the middle of the data cloud: y = β0 + β1 x + ε where ε stands for a random error term, β0 is the intercept and β1 is the slope of the straight line. For a given height, x, the predicted value yb = β0 + β1 x may be viewed as a kind of ‘standard weight’. 344 A.5. Simple linear regression Example 1.3 Some other possible examples of y and x are shown in the following table. y Sales Weight gain Present FTSE 100 index Consumption Salary Daughter’s height x Price Protein in diet Past FTSE 100 index Income Tenure Mother’s height In most cases, there are several x variables involved. We will consider such situations later in this chapter. Some questions to consider are the following. How to draw a line through data clouds, i.e. how to estimate β0 and β1 ? How accurate is the fitted line? What is the error in predicting a future y? A.5 Simple linear regression We now present the simple linear regression model. Let the paired observations (x1 , y1 ), . . . , (xn , yn ) be drawn from the model: y i = β 0 + β 1 xi + εi where: E(εi ) = 0 and Var(εi ) = E(ε2i ) = σ 2 > 0. Furthermore, suppose Cov(εi , εj ) = E(εi εj ) = 0 for all i 6= j. That is, the εi s are assumed to be uncorrelated (remembering that a zero covariance between two random variables implies that they are uncorrelated). So the model has three parameters: β0 , β1 and σ 2 . For convenience, we will treat x1 , . . . , xn as constants.1 We have: E(yi ) = β0 + β1 xi and Var(yi ) = σ 2 . Since the εi s are uncorrelated (by assumption), it follows that y1 , . . . , yn are also uncorrelated with each other. Sometimes we assume εi ∼ N (0, σ 2 ), in which case yi ∼ N (β0 + β1 xi , σ 2 ), and y1 , . . . , yn are independent. (Remember that a linear transformation of a normal random variable is also normal, and that for jointly normal random variables if they are uncorrelated then they are also independent.) 1 If you study EC2020 Elements of econometrics, you will explore regression models in much more detail than is covered here. For example, x1 , . . . , xn will be treated as random variables in econometrics. 345 A. Linear regression (non-examinable) Our tasks are two-fold. Statistical inference for β0 , β1 and σ 2 , i.e. (point) estimation, confidence intervals and hypothesis testing. Prediction intervals for future values of y. We derive estimators of β0 and β1 using least squares estimation (introduced in Chapter 7). The least squares estimators (LSEs) of β0 and β1 are the values of (β0 , β1 ) at which the function: n n X X 2 L(β0 , β1 ) = εi = (yi − β0 − β1 xi )2 i=1 i=1 obtains its minimum. We proceed to partially differentiate L(β0 , β1 ) with respect to β0 and β1 , respectively. Firstly: n X ∂L(β0 , β1 ) = −2 (yi − β0 − β1 xi ). ∂β0 i=1 Upon setting this partial derivative to zero, this leads to: n X yi − nβb0 − βb1 i=1 n X xi = 0 or βb0 = ȳ − βb1 x̄. i=1 Secondly: n X ∂L(β0 , β1 ) = −2 xi (yi − β0 − β1 xi ). ∂β1 i=1 Upon setting this partial derivative to zero, this leads to: 0= n X xi (yi − βb0 − βb1 xi ) i=1 = n X b b xi yi − ȳ − (β1 xi −β1 x̄) i=1 = n X xi (yi − ȳ) − βb1 i=1 Hence: n P i=1 βb1 = P n = xi (xi − x̄) i=1 xi (xi − x̄). i=1 n P xi (yi − ȳ) n X (xi − x̄)(yi − ȳ) i=1 n P and βb0 = ȳ − βb1 x̄. (xi − x̄)2 i=1 The estimator βb1 above is based on the fact that for any constant c, we have: n X i=1 346 xi (yi − ȳ) = n X i=1 (xi − c)(yi − ȳ) A.5. Simple linear regression since: n X c (yi − ȳ) = c n X i=1 Given that n P (yi − ȳ) = 0. i=1 (xi − x̄) = 0, it follows that i=1 n P c (xi − x̄) = 0 for any constant c. i=1 In order to calculate βb1 numerically, often the following formula is convenient: n P xi yi − nx̄ȳ i=1 b . β1 = P n x2i − nx̄2 i=1 An alternative derivation is as follows. Note L(β0 , β1 ) = n P (yi − β0 − β1 xi )2 . For any β0 i=1 and β1 , we have: n 2 X b b b b L(β0 , β1 ) = yi − β0 − β1 xi + β0 − β0 + β1 − β1 xi i=1 n 2 X βb0 − β0 + (βb1 − β1 )xi + 2B = L βb0 , βb1 + (A.1) i=1 where: B= n X βb0 − β0 + (βb1 − β1 )xi b b yi − β0 − β1 xi i=1 = βb0 − β0 n X n X b b b yi − β0 − β1 xi + β1 − β1 xi yi − βb0 − βb1 xi . i=1 i=1 Now let βb0 , βb1 be the solution to the equations: n X yi − βb0 − βb1 xi = 0 and n X xi yi − βb0 − βb1 xi = 0 (A.2) i=1 i=1 such that B = 0. By (A.1), we have: n X 2 b b L (β0 , β1 ) = L β0 , β1 + βb0 − β0 + βb1 − β1 xi ≥ L βb0 , βb1 . i=1 Hence βb0 , βb1 are the least squares estimators (LSEs) of β0 and β1 , respectively. To find the explicit expression from (A.2), note the first equation can be written as: nȳ − nβb0 − nβb1 x̄ = 0. Hence βb0 = ȳ − βb1 x̄. Substituting this into the second equation, we have: 0= n X i=1 xi n n X X b b yi − ȳ − β1 (xi − x̄) = xi (yi − ȳ) − β1 xi (xi − x̄). i=1 i=1 347 A. Linear regression (non-examinable) Therefore: n P i=1 βb1 = P n n P xi (yi − ȳ) = (xi − x̄)(yi − ȳ) i=1 n P xi (xi − x̄) i=1 . (xi − x̄)2 i=1 This completes the derivation. n n P P Remember (xi − x̄) = 0. Hence c (xi − x̄) = 0 for any constant c. i=1 i=1 2 We also note the estimator of σ , which is: n P σ b2 = (yi − βb0 − βb1 xi )2 i=1 . n−2 We now explore the properties of the LSEs βb0 and βb1 . We now proceed to show that the means and variances of these LSEs are: n P x2i σ i=1 and Var(βb0 ) = n P n (xi − x̄)2 2 E(βb0 ) = β0 i=1 for βb0 , and: E(βb1 ) = β1 and Var(βb1 ) = P n σ2 (xi − x̄)2 i=1 for βb1 . Proof: Recall we treat the xi s as constants, and we have E(yi ) = β0 + β1 xi and also Var(yi ) = σ 2 . Hence: n E(ȳ) = E 1X yi n i=1 ! n n 1X 1X = E(yi ) = (β0 + β1 xi ) = β0 + β1 x̄. n i=1 n i=1 Therefore: E(yi − ȳ) = β0 + β1 xi − (β0 + β1 x̄) = β1 (xi − x̄). Consequently, we have: n n n P P P (x (xi − x̄)2 β1 (x i − x̄)E(yi − ȳ) i − x̄)(yi − ȳ) i=1 i=1 i=1 = E(βb1 ) = E = P = β1 . n n n P P 2 (xi − x̄) (xi − x̄)2 (xi − x̄)2 i=1 i=1 i=1 Now: E(βb0 ) = E(ȳ − βb1 x̄) = β0 + β1 x̄ − β1 x̄ = β0 . Therefore, the LSEs βb0 and βb1 are unbiased estimators of β0 and β1 , respectively. 348 A.5. Simple linear regression To work out the variances, the key is to write βb1 and βb0 as linear estimators (i.e. linear combinations of the yi s): n P n P (xi − x̄)(yi − ȳ) i=1 βb1 = n P = (xi − x̄)2 i=1 where ai = (xi − x̄) n P (xi − x̄)yi i=1 n P n X = (xk − x̄)2 ai y i i=1 k=1 (xk − x̄)2 and: k=1 βb0 = ȳ − βb1 x̄ = ȳ − n X ai x̄yi = i=1 n X 1 − ai x̄ yi . n i=1 Note that: n X n X ai = 0 and i=1 i=1 a2i = P n 1 . (xk − x̄)2 k=1 Now we note the following lemma, without proof. Let y1 , . . . , yn be uncorrelated random variables, and b1 , . . . , bn be constants, then: Var n X ! bi y i = i=1 n X b2i Var(yi ). i=1 By this lemma: Var(βb1 ) = Var n X ! ai y i =σ 2 i=1 n X i=1 a2i = P n σ2 (xk − x̄)2 k=1 and: Var(βb0 ) = σ 2 n X 1 i=1 n n 2 − ai x̄ =σ 2 1 X 2 2 + a x̄ n i=1 i ! = σ2 nx̄2 1 + n P n (xk − x̄)2 k=1 n P x2k σ2 k=1 = . n n P 2 (xk − x̄) k=1 The last equality uses the fact that: n X k=1 x2k n X = (xk − x̄)2 + nx̄2 . k=1 349 A. Linear regression (non-examinable) A.6 Inference for parameters in normal regression models The normal simple linear regression model is yi = β0 + β1 xi + εi , where: ε1 , . . . , εn ∼IID N (0, σ 2 ). y1 , . . . , yn are independent (but not identically distributed) and: yi ∼ N (β0 + β1 xi , σ 2 ). Since any linear combination of normal random variables is also normal, the LSEs of β0 and β1 (as linear estimators) are also normal random variables. In fact: n P 2 x i σ2 σ2 i=1 b . βb0 ∼ N β , and β ∼ N β , 0 1 1 n n P n P 2 2 (xi − x̄) (xi − x̄) i=1 i=1 Since σ 2 is unknown in practice, we replace σ 2 by its estimator: n P σ b2 = (yi − βb0 − βb1 xi )2 i=1 n−2 and use the estimated standard errors: n P 1/2 x2i σ b i=1 E.S.E.(βb0 ) = √ n n P 2 (xi − x̄) i=1 and: E.S.E.(βb1 ) = σ b n P 1/2 . (xi − x̄)2 i=1 The following results all make use of distributional results introduced earlier in the course. Statistical inference (confidence intervals and hypothesis testing) for the normal simple linear regression model can then be performed. i. We have: n P 2 (n − 2) σ b = σ2 (yi − βb0 − βb1 xi )2 i=1 σ2 ii. βb0 and σ b2 are independent, hence: βb0 − β0 ∼ tn−2 . E.S.E.(βb0 ) 350 ∼ χ2n−2 . A.6. Inference for parameters in normal regression models iii. βb1 and σ b2 are independent, hence: βb1 − β1 ∼ tn−2 . E.S.E.(βb1 ) Confidence intervals for the simple linear regression model parameters A 100(1 − α)% confidence interval for β0 is: βb0 − tα/2, n−2 × E.S.E.(βb0 ), βb0 + tα/2, n−2 × E.S.E.(βb0 ) and a 100(1 − α)% confidence interval for β1 is: βb1 − tα/2, n−2 × E.S.E.(βb1 ), βb1 + tα/2, n−2 × E.S.E.(βb1 ) where tα, k denotes the top 100αth percentile of the Student’s tk distribution, obtained from Table 10 of the New Cambridge Statistical Tables. Tests for the regression slope The relationship between y and x in the regression model hinges on β1 . If β1 = 0, then y ∼ N (β0 , σ 2 ). To validate the use of the regression model, we need to make sure that β1 6= 0, or more practically that βb1 is significantly non-zero. This amounts to testing: H0 : β1 = 0 vs. H1 : β1 6= 0. Under H0 , the test statistic is: T = βb1 E.S.E.(βb1 ) ∼ tn−2 . At the 100α% significance level, we reject H0 if |t| > tα/2, n−2 , where t is the observed test statistic value. Alternatively, we could use H1 : β1 < 0 or H1 : β1 > 0 if there was a rationale for doing so. In such cases, we would reject H0 if t < −tα, n−2 and t > tα, n−2 for the lower-tailed and upper-tailed t tests, respectively. Some remarks are the following. i. For testing H0 : β1 = b for a given constant b, the above test still applies, but now with the following test statistic: T = βb1 − b . E.S.E.(βb1 ) 351 A. Linear regression (non-examinable) ii. Tests for the regression intercept β0 may be constructed in a similar manner, replacing β1 and βb1 with β0 and βb0 , respectively. In the normal regression model, the LSEs βb0 and βb1 are also the MLEs of β0 and β1 , respectively. Since εi = yi − β0 − β1 xi ∼IID N (0, σ 2 ), the likelihood function is: 2 L(β0 , β1 , σ ) = n Y i=1 ∝ 1 1 √ exp − 2 (yi − β0 − β1 xi )2 2σ 2πσ 2 1 σ2 n/2 " # n 1 X exp − 2 (yi − β0 − β1 xi )2 . 2σ i=1 Hence the log-likelihood function is: n l(β0 , β1 , σ ) = log 2 2 1 σ2 − n 1 X (yi − β0 − β1 xi )2 + c. 2σ 2 i=1 Therefore, for any β0 , β1 and σ 2 > 0, we have: l β0 , β1 , σ 2 ≤ l βb0 , βb1 , σ 2 . b b Hence β0 , β1 are the MLEs of (β0 , β1 ). To find the MLE of σ 2 , we need to maximise: n n 1 1 X 2 b b l(σ ) = l β0 , β1 , σ = log (yi − βb0 − βb1 xi )2 . − 2 2 σ2 2σ i=1 2 Setting u = 1/σ 2 , it is equivalent to maximising: g(u) = n log u − ub where b = n P yi − βb0 − βb1 xi 2 . i=1 Setting dg(u)/du = n/b u − b = 0, u b = n/b, i.e. g(u) attains its maximum at u = u b. Hence the MLE of σ 2 is: n σ e2 = 2 1 b 1 X = = yi − βb0 − βb1 xi . u b n n i=1 Note the MLE σ e2 is a biased estimator of σ 2 . In practice, we often use the unbiased estimator: n 2 1 X 2 b b yi − β0 − β1 xi . σ b = n − 2 i=1 We now consider an empirical example of the normal simple linear regression model. 352 A.6. Inference for parameters in normal regression models Example 1.4 A dataset contains the annual cigarette consumption, x, and the corresponding mortality rate, y, due to coronary heart disease (CHD) of 21 countries. Some useful summary statistics calculated from the data are: 21 X xi = 45,110, i=1 21 X 21 X yi = 3,042.2, i=1 21 X yi2 x2i = 109,957,100, i=1 = 529,321.58 and i=1 21 X xi yi = 7,319,602. i=1 Do these data support the suspicion that smoking contributes to CHD mortality? (Note the assertion ‘smoking is harmful for health’ is largely based on statistical, rather than laboratory, evidence.) We fit the regression model y = β0 + β1 x + ε. Our least squares estimates of β1 and β0 are, respectively: P P P P P x y − x y − nx̄ȳ (x − x̄)(y − ȳ) i i i i i i i i xi j yj /n i P P P = = βb1 = i P 2 2 2 2 2 i (xi − x̄) i xi − nx̄ i xi − ( i xi ) /n = 7319602 − 45110 × 3042.2/21 109957100 − (45110)2 /21 = 0.06 and: 3042.2 − 0.06 × 45110 βb0 = ȳ − βb1 x̄ = = 15.77. 21 Also: − βb0 − βb1 xi )2 σ b = n−2 X X X X X 2 2 2 2 b b b b b b = yi + nβ0 + β1 xi − 2β0 yi − 2β1 xi y i + 2 β 0 β 1 xi /(n − 2) 2 P i (yi = 2181.66. We now proceed to test H0 : β1 = 0 vs. H1 : β1 > 0. (If indeed smoking contributes to CHD mortality, then β1 > 0.) We have calculated βb1 = 0.06. However, is this deviation from zero due to sampling error, or is it significantly different from zero? (The magnitude of βb1 itself is not important in determining if β1 = 0 or not – changing the scale of x may make βb1 arbitrarily small.) Under H0 , the test statistic is: T = βb1 E.S.E.(βb1 ) ∼ tn−2 = t19 P 1/2 where E.S.E.(βb1 ) = σ b/ ( i (xi − x̄)2 ) = 0.01293. Since t = 0.06/0.01293 = 4.64 > 2.54 = t0.01, 19 , we reject the hypothesis β1 = 0 at the 1% significance level and we conclude that there is strong evidence that smoking contributes to CHD mortality. 353 A. Linear regression (non-examinable) A.7 Regression ANOVA In Chapter 10 we discussed ANOVA, whereby we decomposed the total variation of a continuous dependent variable. In a similar way we can decompose the total variation of y in the simple linear regression model. It can be shown that the regression ANOVA decomposition is: n X 2 (yi − ȳ) = i=1 n X βb12 (xi − x̄)2 + i=1 n X yi − βb0 − βb1 xi 2 i=1 where, denoting sum of squares by ‘SS’, we have the following. Total SS is n P (yi − ȳ)2 = i=1 n P yi2 − nȳ 2 . i=1 n n P P 2 2 2 2 2 b b Regression (explained) SS is β1 (xi − x̄) = β1 xi − nx̄ . i=1 Residual (error) SS is n P yi − βb0 − βb1 xi i=1 2 = Total SS − Regression SS. i=1 If εi ∼ N (0, σ 2 ) and β1 = 0, then it can be shown that: n P (yi − ȳ)2 /σ 2 ∼ χ2n−1 i=1 n P βb12 (xi − x̄)2 /σ 2 ∼ χ21 i=1 n P yi − βb0 − βb1 xi 2 /σ 2 ∼ χ2n−2 . i=1 Therefore, under H0 : β1 = 0, we have: n P (n − 2) βb12 (xi − x̄)2 (Regression SS)/1 i=1 F = = P 2 = n (Residual SS)/(n − 2) yi − βb0 − βb1 xi βb1 E.S.E.(βb1 ) !2 ∼ F1, n−2 . i=1 We reject H0 at the 100α% significance level if f > Fα, 1, n−2 , where f is the observed test statistic value and Fα, 1, n−2 is the top 100αth percentile of the F1, n−2 distribution, obtained from Table 12 of the New Cambridge Statistical Tables. A useful statistic is the coefficient of determination, denoted as R2 , defined as: R2 = Residual SS Regression SS =1− . Total SS Total SS If we view Total SS as the total variation (or energy) of y, then R2 is the proportion of the total variation of y explained by x. Note that R2 ∈ [0, 1]. The closer R2 is to 1, the better the explanatory power of the regression model. 354 A.8. Confidence intervals for E(y) A.8 Confidence intervals for E(y) Based on the observations (xi , yi ), for i = 1, . . . , n, we fit a regression model: yb = βb0 + βb1 x. Our goal is to predict the unobserved y corresponding to a known x. The point prediction is: yb = βb0 + βb1 x. For the analysis to be more informative, we would like to have some ‘error bars’ for our prediction. We introduce two methods as follows. A confidence interval for µ(x) = E(y) = β0 + β1 x. A prediction interval for y. A confidence interval is an interval estimator of an unknown parameter (i.e. for a constant) while a prediction interval is for a random variable. They are different and serve different purposes. We assume the model is normal, i.e. ε = y − β0 − β1 x ∼ N (0, σ 2 ) and let µ b(x) = βb0 + βb1 x, such that µ b(x) is an unbiased estimator of µ(x). We note without proof that: n P 2 (xi − x) σ 2 i=1 . µ b(x) ∼ N µ(x), n n P (xj − x̄)2 j=1 Standardising gives: µ b(x) − µ(x) v u u t(σ 2 /n) n P (xi − x)2 / i=1 n P ! ∼ N (0, 1). (xj − x̄)2 j=1 In practice σ 2 is unknown, but it can be shown that (n − 2) σ b2 /σ 2 ∼ χ2n−2 , where n P σ b2 = (yi − βb0 − βb1 xi )2 /(n − 2). Furthermore, µ b(x) and σ b2 are independent. Hence: i=1 µ b(x) − µ(x) v u u t(b σ 2 /n) n P (xi − x)2 / i=1 n P ! ∼ tn−2 . (xj − x̄)2 j=1 355 A. Linear regression (non-examinable) Confidence interval for µ(x) A 100(1 − α)% confidence interval for µ(x) is: 1/2 2 (x − x) i=1 i . µ b(x) ± tα/2, n−2 × σ b× n P 2 n (xj − x̄) n P j=1 Such a confidence interval contains the true expectation E(y) = µ(x) with probability 1 − α over repeated samples. It does not cover y with probability 1 − α. A.9 Prediction intervals for y A 100(1 − α)% prediction interval is an interval which contains y with probability 1 − α. We may assume that the y to be predicted is independent of y1 , . . . , yn used in the estimation of the regression model. Hence y − µ b(x) is normal with mean 0 and variance: n P (xi − x)2 2 σ i=1 Var(y) + Var (b µ(x)) = σ 2 + . n n P 2 (xj − x̄) j=1 Therefore: . (y − µ b(x)) b2 σ 1 + n P 2 1/2 (xi − x) (xj − x̄)2 i=1 n P n ∼ tn−2 . j=1 Prediction interval for y A 100(1 − α)% prediction interval covering y with probability 1 − α is: µ b(x) ± tα/2, n−2 × σ b× 1 + n P j=1 356 1/2 (xi − x) (xj − x̄)2 i=1 n P n 2 . A.9. Prediction intervals for y Some remarks are the following. i. It holds that: P y 1/2 (xi − x) i=1 ∈µ b(x) ± tα/2, n−2 × σ b× = 1 − α. n 1 + P 2 n (xj − x̄) n P 2 j=1 ii. The prediction interval for y is wider than the confidence interval for E(y). The former contains the unobserved random variable y with probability 1 − α, the latter contains the unknown constant E(y) with probability 1 − α over repeated samples. Example 1.5 A dataset contains the prices (y, in $000s) of 100 three-year-old Ford Tauruses together with their mileages (x, in thousands of miles) when they were sold at auction. Based on these data, a car dealer needs to make two decisions. 1. To prepare cash for bidding on one three-year-old Ford Taurus with a mileage of x = 40. 2. To prepare buying several three-year-old Ford Tauruses with mileages close to x = 40 from a rental company. For the first task, a prediction interval would be more appropriate. For the second task, the car dealer needs to know the average price and, therefore, a confidence interval is appropriate. This can be easily done using R. > reg <- lm(Price~ Mileage) > summary(reg) Call: lm(formula = Price ~ Mileage) Residuals: Min 1Q -0.68679 -0.27263 Median 0.00521 3Q 0.23210 Max 0.70071 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 17.248727 0.182093 94.72 <2e-16 *** Mileage -0.066861 0.004975 -13.44 <2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.3265 on 98 degrees of freedom Multiple R-squared: 0.6483, Adjusted R-squared: 0.6447 F-statistic: 180.6 on 1 and 98 DF, p-value: < 2.2e-16 357 A. Linear regression (non-examinable) > new.Mileage <- data.frame(Mileage = c(40)) > predict(reg, newdata = new.Mileage, int = "c") fit lwr upr 1 14.57429 14.49847 14.65011 > predict(reg, newdata = new.Mileage, int = "p") fit lwr upr 1 14.57429 13.92196 15.22662 We predict that a Ford Taurus will sell for between $13,922 and $15,227. The average selling price of several three-year-old Ford Tauruses is estimated to be between $14,498 and $14,650. Because predicting the selling price for one car is more difficult, the corresponding prediction interval is wider than the confidence interval. To produce the plots with confidence intervals for E(y) and prediction intervals for y, we proceed as follows: pc <- predict(reg,int="c") pp <- predict(reg,int="p") plot(Mileage,Price,pch=16) matlines(Mileage,pc) matlines(Mileage,pp) 15.0 13.5 14.0 14.5 Price 15.5 16.0 16.5 > > > > > 20 25 30 35 40 45 50 Mileage A.10 Multiple linear regression models For most practical problems, the variable of interest, y, typically depends on several explanatory variables, say x1 , . . . , xp , leading to the multiple linear regression model. In this course we only provide a brief overview of the multiple linear regression model. EC2020 Elements of econometrics will explore this model in much greater depth. 358 A.10. Multiple linear regression models Let (yi , xi1 , . . . , xip ), for i = 1, . . . , n, be observations from the model: yi = β0 + β1 xi1 + · · · + βp xip + εi where: E(εi ) = 0, Var(εi ) = σ 2 > 0 and Cov(εi , εj ) = 0 for all i 6= j. The multiple linear regression model is a natural extension of the simple linear regression model, just with more parameters: β0 , β1 , . . . , βp and σ 2 . Treating all of the xij s as constants as before, we have: and Var(yi ) = σ 2 . E(yi ) = β0 + β1 xi1 + · · · + βp xip y1 , . . . , yn are uncorrelated with each other, again as before. If in addition εi ∼ N (0, σ 2 ), then: yi ∼ N β0 + p X ! βj xij , σ 2 . j=1 Estimation of the intercept and slope parameters is still performed using least squares estimation. The LSEs βb0 , βb1 , . . . , βbp are obtained by minimising: n X yi − β0 − p X i=1 !2 βj xij j=1 leading to the fitted regression model: yb = βb0 + βb1 x1 + · · · + βbp xp . The residuals are expressed as: εbi = yi − βb0 − p X βbj xij . j=1 Just as with the simple linear regression model, we can decompose the total variation of y such that: n n n X X X 2 2 (yi − ȳ) = (b yi − ȳ) + εb2i i=1 i=1 i=1 or, in words: Total SS = Regression SS + Residual SS. An unbiased estimator of σ 2 is: n p X X 1 σ b2 = yi − βb0 − βbj xij n − p − 1 i=1 j=1 !2 = Residual SS . n−p−1 We can test a single slope coefficient by testing: H0 : βi = 0 vs. H1 : βi 6= 0. 359 A. Linear regression (non-examinable) Under H0 , the test statistic is: T = βbi E.S.E.(βbi ) ∼ tn−p−1 and we reject H0 if |t| > tα/2, n−p−1 . However, note the slight difference in the interpretation of the slope coefficient βj . In the multiple regression setting, βj is the effect of xj on y, holding all other independent variables fixed – this is unfortunately not always practical. It is also possible to test whether all the regression coefficients are equal to zero. This is known as a joint test of significance and can be used to test the overall significance of the regression model, i.e. whether there is at least one significant explanatory (independent) variable, by testing: H0 : β1 = · · · = βp = 0 vs. H1 : At least one βi 6= 0. Indeed, it is preferable to perform this joint test of significance before conducting t tests of individual slope coefficients. Failure to reject H0 would render the model useless and hence the model would not warrant any further statistical investigation. Provided εi ∼ N (0, σ 2 ), under H0 : β1 = · · · = βp = 0, the test statistic is: F = (Regression SS)/p ∼ Fp, n−p−1 . (Residual SS)/(n − p − 1) We reject H0 at the 100α% significance level if f > Fα, p, n−p−1 . It may be shown that: Regression SS = n X i=1 2 (b yi − ȳ) = n X 2 b b β1 (xi1 − x̄1 ) + · · · + βp (xip − x̄p ) . i=1 Hence, under H0 , f should be very small. We now conclude the chapter with worked examples of linear regression using R. A.11 Regression using R To solve practical regression problems, we need to use statistical computing packages. All of them include linear regression analysis. In fact all statistical packages, such as R, make regression analysis much easier to use. Example 1.6 We illustrate the use of linear regression in R using the dataset introduced in Example 1.1. 360 A.11. Regression using R > reg <- lm(Sales ~ Student.population) > summary(reg) Call: lm(formula = Sales ~ Student.population) Residuals: Min 1Q Median -21.00 -9.75 -3.00 3Q 11.25 Max 18.00 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 60.0000 9.2260 6.503 0.000187 *** Student.population 5.0000 0.5803 8.617 2.55e-05 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 13.83 on 8 degrees of freedom Multiple R-squared: 0.9027, Adjusted R-squared: 0.8906 F-statistic: 74.25 on 1 and 8 DF, p-value: 2.549e-05 The fitted line is yb = 60 + 5x. We have σ b2 = (13.83)2 . Also, βb0 = 60 and E.S.E.(βb0 ) = 9.2260. βb1 = 5 and E.S.E.(βb1 ) = 0.5803. For testing H0 : β0 = 0 we have t = βb0 /E.S.E.(βb0 ) = 6.503. The p-value is P (|T | > 6.503) = 0.000187, where T ∼ tn−2 . For testing H0 : β1 = 0 we have t = βb1 /E.S.E.(βb1 ) = 8.617. The p-value is P (|T | > 8.617) = 0.0000255, where T ∼ tn−2 . The F test statistic value is 74.25 with a corresponding p-value of: P (F > 74.25) = 0.00002549 where F1, 8 . Example 1.7 We apply the simple linear regression model to study the relationship between two series of financial returns – a regression of Cisco Systems stock returns, y, on S&P500 Index returns, x. This regression model is an example of the capital asset pricing model (CAPM). Stock returns are defined as: current price − previous price current price return = ≈ log previous price previous price when the difference between the two prices is small. A dataset contains daily returns over the period 3 January – 29 December 2000 (i.e. n = 252 observations). The dataset has 5 columns: Day, S&P500 return, Cisco return, Intel return and Sprint return. Daily prices are definitely not independent. However, daily returns may be seen as a sequence of uncorrelated random variables. 361 A. Linear regression (non-examinable) > summary(S.P500) Min. 1st Qu. Median Mean -6.00451 -0.85028 -0.03791 -0.04242 3rd Qu. 0.79869 Max. 4.65458 > summary(Cisco) Min. 1st Qu. -13.4387 -3.0819 3rd Qu. 2.6363 Max. 15.4151 Median -0.1150 Mean -0.1336 For the S&P500, the average daily return is −0.04%, the maximum daily return is 4.46%, the minimum daily return is −6.01% and the standard deviation is 1.40%. For Cisco, the average daily return is −0.13%, the maximum daily return is 15.42%, the minimum daily return is −13.44% and the standard deviation is 4.23%. We see that Cisco is much more volatile than the S&P500. −10 −5 0 5 10 15 > sandpts <- ts(S.P500) > ciscots <- ts(Cisco) > ts.plot(sandpts,ciscots,col=c(1:2)) 0 50 100 150 200 250 Time There is clear synchronisation between the movements of the two series of returns, as evident from examining the sample correlation coefficient. > cor.test(S.P500,Cisco) Pearson’s product-moment correlation data: S.P500 and Cisco t = 14.943, df = 250, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 362 A.11. Regression using R 95 percent confidence interval: 0.6155530 0.7470423 sample estimates: cor 0.686878 We fit the regression model: Cisco = β0 + β1 S&P500 + ε. Our rationale is that part of the fluctuation in Cisco returns was driven by the fluctuation in the S&P500 returns. R produces the following regression output. > reg <- lm(Cisco ~ S.P500) > summary(reg) Call: lm(formula = Cisco ~ S.P500) Residuals: Min 1Q -13.1175 -2.0238 Median 0.0091 3Q 2.0614 Max 9.9491 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.04547 0.19433 -0.234 0.815 S.P500 2.07715 0.13900 14.943 <2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 3.083 on 250 degrees of freedom Multiple R-squared: 0.4718, Adjusted R-squared: 0.4697 F-statistic: 223.3 on 1 and 250 DF, p-value: < 2.2e-16 The estimated slope is βb1 = 2.07715. The null hypothesis H0 : β1 = 0 is rejected with a p-value of 0.000 (to three decimal places). Therefore, the test is extremely significant. Our interpretation is that when the market index goes up by 1%, Cisco stock goes up by 2.07715%, on average. However, the error term ε in the model is large with an estimated σ b = 3.083%. The p-value for testing H0 : β0 = 0 is 0.815, so we cannot reject the hypothesis that β0 = 0. Recall βb0 = ȳ − βb1 x̄ and both ȳ and x̄ are very close to 0. R2 = 47.18%, hence 47.18% of the variation of Cisco stock may be explained by the variation of the S&P500 index, or, in other words, 47.18% of the risk in Cisco stock is the market-related risk. The capital asset pricing model (CAPM) is a simple asset pricing model in finance given by: yi = β0 + β1 xi + εi where yi is a stock return and xi is a market return at time i. 363 A. Linear regression (non-examinable) The total risk of the stock is: n n n 1X 1X 1X (yi − ȳ)2 = (b yi − ȳ)2 + (yi − ybi )2 . n i=1 n i=1 n i=1 The market-related (or systematic) risk is: n n 1X 1 b2 X 2 (b yi − ȳ) = β1 (xi − x̄)2 . n i=1 n i=1 The firm-specific risk is: n 1X (yi − ybi )2 . n i=1 Some remarks are the following. i. β1 measures the market-related (or systematic) risk of the stock. ii. Market-related risk is unavoidable, while firm-specific risk may be ‘diversified away’ through hedging. iii. Variance is a simple measure (and one of the most frequently-used) of risk in finance. Example 1.8 A dataset illustrates the effects of marketing instruments on the weekly sales volume of a certain food product over a three-year period. Data are real but transformed to protect the innocent! There are observations on the following four variables: y = LVOL: logarithms of weekly sales volume x1 = PROMP : promotion price x2 = FEAT : feature advertising x3 = DISP : display measure. R produces the following descriptive statistics. > summary(Foods) LVOL Min. :13.83 1st Qu.:14.08 Median :14.24 Mean :14.28 3rd Qu.:14.43 Max. :15.07 PROMP Min. :3.075 1st Qu.:3.330 Median :3.460 Mean :3.451 3rd Qu.:3.560 Max. :3.865 FEAT Min. : 2.84 1st Qu.:15.95 Median :22.99 Mean :24.84 3rd Qu.:33.49 Max. :57.10 DISP Min. :12.42 1st Qu.:20.59 Median :25.11 Mean :25.31 3rd Qu.:29.34 Max. :45.94 n = 156. The values of FEAT and DISP are much larger than LVOL. 364 A.11. Regression using R As always, first we plot the data to ascertain basic characteristics. 14.4 13.8 14.0 14.2 LVOLts 14.6 14.8 15.0 > LVOLts <- ts(LVOL) > ts.plot(LVOLts) 0 50 100 150 Time The time series plot indicates momentum in the data. Next we show scatterplots between y and each xi . 14.4 14.2 14.0 13.8 LVOL 14.6 14.8 15.0 > plot(PROMP,LVOL,pch=16) 3.2 3.4 3.6 3.8 PROMP 365 A. Linear regression (non-examinable) 14.4 13.8 14.0 14.2 LVOL 14.6 14.8 15.0 > plot(FEAT,LVOL,pch=16) 10 20 30 40 50 FEAT 14.4 13.8 14.0 14.2 LVOL 14.6 14.8 15.0 > plot(DISP,LVOL,pch=16) 15 20 25 30 35 40 45 DISP What can we observe from these pairwise plots? There is a negative correlation between LVOL and PROMP. There is a positive correlation between LVOL and FEAT. There is little or no correlation between LVOL and DISP, but this might have been blurred by the other input variables. 366 A.11. Regression using R Therefore, we should regress LVOL on PROMP and FEAT first. We run a multiple linear regression model using x1 and x2 as explanatory variables: y = β0 + β1 x1 + β2 x2 + ε. > reg <- lm(LVOL~PROMP + FEAT) > summary(reg) Call: lm(formula = LVOL ~ PROMP + FEAT) Residuals: Min 1Q Median -0.32734 -0.08519 -0.01011 3Q 0.08471 Max 0.30804 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 17.1500102 0.2487489 68.94 <2e-16 PROMP -0.9042636 0.0694338 -13.02 <2e-16 FEAT 0.0100666 0.0008827 11.40 <2e-16 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 *** *** *** 1 Residual standard error: 0.1268 on 153 degrees of freedom Multiple R-squared: 0.756, Adjusted R-squared: 0.7528 F-statistic: 237 on 2 and 153 DF, p-value: < 2.2e-16 We begin by performing a joint test of significance by testing H0 : β1 = β2 = 0. The test statistic value is given in the regression ANOVA table as f = 237, with a corresponding p-value of 0.000 (to three decimal places). Hence H0 is rejected and we have strong evidence that at least one slope coefficient is not equal to zero. Next we consider individual t tests of H0 : β1 = 0 and H0 : β2 = 0. The respective test statistic values are −13.02 and 11.40, both with p-values of 0.000 (to three decimal places) indicating that both slope coefficients are non-zero. Turning to the estimated coefficients, βb1 = −0.904 (to three decimal places) which indicates that LVOL decreases as PROMP increases controlling for FEAT. Also, βb2 = 0.010 (to three decimal places) which indicates that LVOL increases as FEAT increases, controlling for PROMP. We could also compute 95% confidence intervals, given by: βbi ± t0.025, n−3 × E.S.E.(βbi ). Since n − 3 = 153 is large, t0.025, n−3 ≈ z0.025 = 1.96. R2 = 0.756. Therefore, 75.6% of the variation of LVOL can be explained (jointly) with PROMP and FEAT. However, a large R2 does not necessarily mean that the fitted model is useful. For the estimation of coefficients and predicting y, the absolute measure ‘Residual SS’ (or σ b2 ) plays a critical role in determining the accuracy of the model. 367 A. Linear regression (non-examinable) Consider now introducing DISP into the regression model to give three explanatory variables: y = β0 + β1 x1 + β2 x2 + β3 x3 + ε. The reason for adding the third variable is that one would expect DISP to have an impact on sales and we may wish to estimate its magnitude. > reg <- lm(LVOL~PROMP + FEAT + DISP) > summary(reg) Call: lm(formula = LVOL ~ PROMP + FEAT + DISP) Residuals: Min 1Q Median -0.33363 -0.08203 -0.00272 3Q 0.07927 Max 0.33812 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 17.2372251 0.2490226 69.220 <2e-16 PROMP -0.9564415 0.0726777 -13.160 <2e-16 FEAT 0.0101421 0.0008728 11.620 <2e-16 DISP 0.0035945 0.0016529 2.175 0.0312 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 *** *** *** * 1 Residual standard error: 0.1253 on 152 degrees of freedom Multiple R-squared: 0.7633, Adjusted R-squared: 0.7587 F-statistic: 163.4 on 3 and 152 DF, p-value: < 2.2e-16 All the estimated coefficients have the right sign (according to commercial common sense!) and are statistically significant. In particular, the relationship with DISP seems real when the other inputs are taken into account. On the other hand, the addition of DISP to the b, from √ model has resulted in a very small reduction in σ √ 0.0161 = 0.1268 to 0.0157 = 0.1253, and correspondingly a slightly higher R2 (0.7633, i.e. 76.33% of the variation of LVOL is explained by the model). Therefore, DISP contributes very little to ‘explaining’ the variation of LVOL after the other two explanatory variables, PROMP and FEAT, are taken into account. Intuitively, we would expect a higher R2 if we add a further explanatory variable to the model. However, the model has become more complex as a result – there is an additional parameter to estimate. Therefore, strictly speaking, we should consider the ‘adjusted R2 ’ statistic, although this will not be considered in this course. 368 A.12. Overview of chapter A.12 Overview of chapter This chapter has covered the linear regression model with one or more explanatory variables. Least squares estimators were derived for the simple linear regression model, and statistical inference procedures were also covered. The multiple linear regression model and applications using R concluded the chapter. A.13 Key terms and concepts ANOVA decomposition Confidence interval Independent variable Least squares estimation Multiple linear regression Regression analysis Residual Slope coefficient Coefficient of determination Dependent variable Intercept Linear estimators Prediction interval Regressor Simple linear regression 369 A. Linear regression (non-examinable) 370 Appendix B Non-examinable proofs B.1 Chapter 2 – Probability theory For the empty set, ∅, we have: P (∅) = 0. Proof : Since ∅ ∩ ∅ = ∅ and ∅ ∪ ∅ = ∅, Axiom 3 gives: P (∅) = P (∅ ∪ ∅ ∪ · · · ) = ∞ X P (∅). i=1 However, the only real number for P (∅) which satisfies this is P (∅) = 0. If A1 , A2 , . . . , An are pairwise disjoint, then: P n [ ! Ai i=1 = n X P (Ai ). i=1 Proof : In Axiom 3, set An+1 = An+2 = · · · = ∅, so that: P ∞ [ i=1 ! Ai = ∞ X i=1 P (Ai ) = n X P (Ai ) + i=1 ∞ X i=n+1 P (Ai ) = n X P (Ai ) i=1 since P (Ai ) = P (∅) = 0 for i = n + 1, n + 2, . . .. B.2 Chapter 3 – Random variables For the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . ., and 0 otherwise. 371 B. Non-examinable proofs The expected value of X is then: E(X) = X xi p(xi ) = xi ∈S ∞ X x (1 − π)x π x=0 (starting from x = 1) = ∞ X x (1 − π)x π x=1 = (1 − π) ∞ X x (1 − π)x−1 π x=1 (using y = x − 1) = (1 − π) ∞ X (y + 1)(1 − π)y π y=0 ∞ ∞ X X y y = (1 − π) y (1 − π) π + (1 − π) π y=0 y=0 | {z } | {z } =1 = E(X) = (1 − π) [E(X) + 1] = (1 − π) E(X) + (1 − π) from which we can solve: E(X) = 1−π 1−π = . 1 − (1 − π) π Suppose X is a random variable and a and b are constants, i.e. known numbers which are not random variables. Therefore: E(aX + b) = a E(X) + b. Proof : We have: E(aX + b) = X (ax + b) p(x) x = X ax p(x) + x =a X b p(x) x X x p(x) + b x X p(x) x = a E(X) + b where the last step follows from: P i. x p(x) = E(X), by definition of E(X) x ii. P x 372 p(x) = 1, by definition of the probability function. B.3. Chapter 5 – Multivariate random variables If X is a random variable and a and b are constants, then: Var(aX + b) = a2 Var(X). Proof: Var(aX + b) = E ((aX + b) − E(aX + b))2 = E (aX + b − a E(X) − b)2 = E (aX − a E(X))2 = E a2 (X − E(X))2 = a2 E (X − E(X))2 = a2 Var(X). Therefore, sd(aX + b) = |a| sd(X). B.3 Chapter 5 – Multivariate random variables We can now prove some results which were stated earlier. Recall: Var(X) = E(X 2 ) − (E(X))2 . Proof: Var(X) = E[(X − E(X))2 ] = E[X 2 − 2 E(X)X + (E(X))2 ] = E(X 2 ) − 2 E(X) E(X) + (E(X))2 = E(X 2 ) − 2 (E(X))2 + (E(X))2 = E(X 2 ) − (E(X))2 using (5.3), with X1 = X 2 , X2 = X, a1 = 1, a2 = −2 E(X) and b = (E(X))2 . Recall: Cov(X, Y ) = E(XY ) − E(X) E(Y ). Proof: Cov(X, Y ) = E[(X − E(X))(Y − E(Y ))] = E[XY − E(Y )X − E(X)Y + E(X) E(Y )] = E(XY ) − E(Y ) E(X) − E(X) E(Y ) + E(X) E(Y ) = E(XY ) − E(X) E(Y ) using (5.3), with X1 = XY , X2 = X, X3 = Y , a1 = 1, a2 = −E(Y ), a3 = −E(X) and b = E(X) E(Y ). 373 B. Non-examinable proofs Recall that if X and Y are independent, then: Cov(X, Y ) = Corr(X, Y ) = 0. Proof: Cov(X, Y ) = E(XY ) − E(X) E(Y ) = E(X) E(Y ) − E(X) E(Y ) = 0 since E(XY ) = E(X) E(Y ) when X and Y are independent. Since Corr(X, Y ) = Cov(X, Y )/[sd(X) sd(Y )], Corr(X, Y ) = 0 whenever Cov(X, Y ) = 0. 374 Appendix C Solutions to Sample examination questions C.1 Chapter 2 – Probability theory 1. (a) True. We have: P (A) + P (B) > P (A) + P (B) − P (A) P (B) = P (A ∪ B). (b) True. Note that P (A | B) = P (A | B c ) implies: P (A ∩ B) P (A ∩ B c ) = P (B) 1 − P (B) or: P (A ∩ B) (1 − P (B)) = (P (A) − P (A ∩ B)) P (B) which implies P (A ∩ B) = P (A) P (B). (c) False. Consider, for example, throwing a die and letting A be the event that the throw results in a 1 and B be the event that the throw is 2. 2. There are 10 ways in which A and B can choose their seats, of which 9 are pairs 2 of adjacent seats. Therefore, using the total probability formula, the probability that A and B are adjacent is: 9 1 10 = . 5 2 3. (a) Judge 1 can either correctly vote guilty a guilty defendant, or incorrectly vote guilty a not guilty defendant. Therefore, the probability is given by: 0.9 × 0.7 + 0.25 × 0.3 = 0.705. (b) This conditional probability is given by: P (Judge 3 votes guilty | Judges 1 and 2 vote not guilty) P (Judge 3 votes guilty, and Judges 1 and 2 vote not guilty) = P (Judges 1 and 2 vote not guilty) = 0.7 × 0.9 × (0.1)2 + 0.3 × 0.25 × (0.75)2 0.7 × (0.1)2 + 0.3 × (0.75)2 = 0.2759. 375 C. Solutions to Sample examination questions C.2 Chapter 3 – Random variables 1. (a) Let the other value be x, then: E(X) = 0 = 1 × 0.2 + 2 × 0.5 + x × 0.3 and hence the other value is −4. (b) Since E(X) = 0, the variance of X is given by: Var(X) = E(X 2 ) − (E(X))2 = E(X 2 ) = (−4)2 × 0.3 + 12 × 0.2 + 22 × 0.5 = 7. 2. (a) Since R f (x) dx = 1, we have: 1 Z 0 1 kx3 kx dx = 3 2 = 0 k =1 3 and so k = 3. (b) We have: 1 Z 1 Z x f (x) dx = E(X) = 0 0 3x4 3x dx = 4 3 1 = 0 3 4 and: 1 Z 2 Z 2 E(X ) = 1 x f (x) dx = 0 0 3x5 3x dx = 5 4 1 0 3 = . 5 Hence: 3 Var(X) = E(X ) − (E(X)) = − 5 2 2 2 3 3 = = 0.0375. 4 80 3. (a) We compute: Z 1 x2 (1 − x) dx = 0 1 1 1 − = 3 4 12 so k = 12. (b) We first find: E 1 X 1 Z x(1 − x) dx = 12 = 12 0 1 1 − 2 3 Also: E 1 X2 Z 1 (1 − x) dx = 6. = 12 0 Therefore: Var 376 1 X = 6 − 22 = 2. = 2. C.3. Chapter 4 – Common distributions of random variables C.3 Chapter 4 – Common distributions of random variables 1. We have Y ∼ Bin(10, 0.25), hence: P (Y ≥ 2) = 1 − P (Y = 0) − P (Y = 1) = 1 − (0.75)10 − 10 × 0.25 × (0.75)9 = 0.7560. 2. X ∼ Exp(1), hence 1 − F (x) = e−x . Therefore: p = P (X > x0 + 1 | X > x0 ) = P ({X > x0 + 1} ∩ {X > x0 }) P (X > x0 ) = P (X > x0 + 1) P (X > x0 ) = e−(x0 +1) e−x0 = e−1 . 3. We have that X ∼ N (1, 4). Using the definition of conditional probability, and standardising with Z = (X − 1)/2, we have: P (X > 3 | X < 5) = C.4 P (3 < X < 5) P (1 < Z < 2) 0.9772 − 0.8413 = = = 0.1391. P (X < 5) P (Z < 2) 0.9772 Chapter 5 – Multivariate random variables 1. (a) All probabilities must be in the interval [0, 1], hence α ∈ [0, 1/2]. (b) From the definition of U , the only values U can take are 0 and 1/3. U = 0 only when X = 0 and Y = 0. We have: 1 P (U = 0) = P (X = 0, Y = 0) = 4 and: 1 3 P U= = 1 − P (U = 0) = 3 4 therefore: 1 1 3 1 E(U ) = 0 × + × = . 4 3 4 4 Similarly, from the definition of V , the only values V can take are 0 and 1. V = 1 only when X = 1 and Y = 1. We have: P (V = 1) = P (X = 1, Y = 1) = and: P (V = 0) = 1 − P (V = 1) = hence: E(V ) = 0 × 1 4 3 4 3 1 1 +1× = . 4 4 4 377 C. Solutions to Sample examination questions (c) U and V are not independent since not all joint probabilities are equal to the product of the respective marginal probabilities. For example, one sufficient case to disprove independence is noting that P (U = 0, V = 0) = 0 whereas P (U = 0) P (V = 0) > 0. 2. (a) Due to independence, the amount of coffee in 5 cups, X, follows a normal distribution with mean 5 × 150 = 750 and variance 5 × (10)2 = 500, i.e. X ∼ N (750, 500). Therefore: −50 P (X > 700) = P Z > √ = P (Z > −2.24) = 0.98745 500 using Table 4 of the New Cambridge Statistical Tables. (b) Due to independence, the difference in the amounts between two cups, D, follows a normal distribution with mean 150 − 150 = 0 and variance (10)2 + (10)2 = 200, i.e. D ∼ N (0, 200). Hence: 20 −20 <Z< √ = P (−1.41 < Z < 1.41) P (|D| < 20) = P √ 200 200 = 0.9207 − (1 − 0.9207) = 0.8414 using Table 4 of the New Cambridge Statistical Tables. (c) Let C denote the amount of coffee in one cup, hence C ∼ N (150, 100). We require: 13 = P (Z < −1.30) = 0.0968. P (C < 137) = P Z < − 10 using Table 4 of the New Cambridge Statistical Tables. (d) The expected income is: 0 × P (C < 137) + 1 × P (C ≥ 137) = 0 × 0.0968 + 1 × 0.9032 = 0.9032 i.e. £0.9032. 3. (a) As the letters are delivered at random, the destination of letter 1 follows a discrete uniform distribution among the six houses, i.e. the probability is equal to 1/6. (b) The random variable Xi is equal to 1 with probability 1/6 and 0 otherwise, hence: 1 5 1 E(Xi ) = 1 × + 0 × = . 6 6 6 (c) If house 1 receives the correct letter, there are 5 letters still to be delivered. Therefore, for example: P (X1 = 1 ∩ X2 = 1) = 1 1 1 × 6= = P (X1 = 1) P (X2 = 1) 6 5 36 hence X1 and X2 are not independent. 378 C.5. Chapter 6 – Sampling distributions of statistics (d) Following the previous part: Cov(X1 , X2 ) = E(X1 X2 ) − E(X1 ) E(X2 ). Note that X1 X2 is equal to 1 with probability 1/30 and 0 otherwise, hence: Cov(X1 , X2 ) = C.5 1 1 1 − = . 30 36 180 Chapter 6 – Sampling distributions of statistics 1. We have: E(X) = 5 × 20 10 18 + (−5) × = − = −0.2632 38 38 38 and: 2 18 20 10 Var(X) = E(X ) − (E(X)) = 25 × + 25 × − − = 24.9308. 38 38 38 2 Since n = 100 is large, then 2 100 X Xi ∼ N (−26.32, 2493.08), approximately, by the i=1 central limit theorem. We require: ! 100 X −50 − (−26.32) √ = P (Z > −0.47) = 0.6808. P Xi > −50 ≈ P Z > 2493.08 i=1 2. (a) Note Zi2 ∼ χ21 for all i = 1, . . . , 5. By independence, we have: Z12 + Z22 ∼ χ22 . (b) By independence, we have: Z1 s 5 P ∼ t4 . Zi2 /4 i=2 (c) By independence, we have: Z12 5 P ∼ F1, 4 . Zi2 /4 i=2 3. (a) The simplest answer is: √ 11X12 s ∼ t11 11 P Xi2 i=1 since X12 ∼ N (0, 1) and 11 P Xi2 ∼ χ211 . i=1 379 C. Solutions to Sample examination questions (b) The simplest answer is: 6 P 9 i=1 15 P 6 Xi2 ∼ F6, 9 Xi2 i=7 since 6 P Xi2 ∼ χ26 and Xi2 ∼ χ29 . i=7 i=1 C.6 15 P Chapter 7 – Point estimation 1. (a) The pdf of Xi is: ( θ−1 f (xi ; θ) = 0 Therefore: 1 E(Xi ) = θ Z 0 θ for 0 ≤ xi ≤ θ otherwise. θ θ 1 x2i = . xi dxi = θ 2 0 2 Therefore, setting µ b1 = M1 , we have: n P θb = X̄ 2 ⇒ θb = 2X̄ = 2 × Xi i=1 n . (b) We have: 0.2 + 3.6 + 1.1 = 3.27. 3 The point estimate is not plausible since 3.27 < 3.6 = x2 which must be impossible to observe if X ∼ Uniform[0, 3.27]. 2 × x̄ = 2 × Due to the law of large numbers, sample moments should converge to the corresponding population moments. Here, n = 3 is small hence poor performance of the MME is not surprising. 2. (a) We have to minimise: S= 3 X ε2i = (y1 − α − β)2 + (y2 − α − 2β)2 + (y3 − α − 4β)2 . i=1 We have: ∂S = −2(y1 − α − β) − 2(y2 − α − 2β) − 2(y3 − α − 4β) ∂α = 2(3α + 7β − (y1 + y2 + y3 )) and: ∂S = −2(y1 − α − β) − 4(y2 − α − 2β) − 8(y3 − α − 4β) ∂β = 2(7α + 21β − (y1 + 2y2 + 4y3 )). 380 C.6. Chapter 7 – Point estimation The estimators α b and βb are the solutions of the equations ∂S/∂α = 0 and ∂S/∂β = 0. Hence: 3b α + 7βb = y1 + y2 + y3 and 7b α + 21βb = y1 + 2y2 + 4y3 . Solving yields: −4y1 − y2 + 5y3 βb = 14 and α b= 2y1 + y2 − y3 . 2 They are unbiased estimators since: b = E(β) −4α − 4β − α − 2β + 5α + 20β =β 14 and: E(b α) = 2α + 2β + α + 2β − α − 4β = α. 2 (b) We have, by independence: 2 2 1 1 3 Var(b α) = 1 + + = . 2 2 2 2 3. (a) By independence, the likelihood function is: L(λ) = 2 n Y λ2Xi e−λ Xi ! i=1 = λ n P 2 Xi i=1 n Q 2 e−nλ . Xi ! i=1 The log-likelihood function is: l(λ) = ln L(λ) = 2 n X ! Xi (ln λ) − nλ2 − ln i=1 n Y ! Xi ! . i=1 Differentiating: d l(λ) = dλ 2 n P i=1 λ Xi 2 n P Xi − 2nλ2 i=1 − 2nλ = λ . Setting to zero, we re-arrange for the estimator: 2 n X i=1 b2 = 0 Xi − 2nλ ⇒ n P 1/2 i=1 Xi b λ= n = X̄ 1/2 . (b) By the invariance principle of maximum likelihood estimators: b3 = X̄ 3/2 . θb = λ 381 C. Solutions to Sample examination questions C.7 Chapter 8 – Interval estimation 1. We have: 1−α=P −tα/2, n−1 X̄ − µ √ ≤ tα/2, n−1 ≤ S/ n S S −tα/2, n−1 × √ ≤ X̄ − µ ≤ tα/2, n−1 × √ n n S S −tα/2, n−1 × √ < µ − X̄ < tα/2, n−1 × √ n n S S X̄ − tα/2, n−1 × √ < µ < X̄ + tα/2, n−1 × √ n n =P =P =P . Hence an accurate 100 (1 − α)% confidence interval for µ, where α ∈ (0, 1), is: S S X̄ − tα/2, n−1 × √ , X̄ + tα/2, n−1 × √ . n n 2. The population is a Bernoulli distribution on two points – 1 (agree) and 0 (disagree). We have a random sample of size n = 250, i.e. {X1 , . . . , X250 }. Let π = P (Xi = 1). Therefore, E(Xi ) = π and Var(Xi ) = π (1 − π). The sample mean and variance are: 250 163 1 X = 0.652 xi = p = x̄ = 250 i=1 250 and: 1 s2 = 249 250 X ! x2i − 250x̄2 = i=1 1 163 − 250 × (0.652)2 = 0.2278. 259 Note use of p (1 − p) = 0.652 × (1 − 0.652) = 0.2269 is also acceptable for the sample variance. Based on the central limit theorem for the sample mean, an approximate 99% confidence interval for π is: r 0.2278 s x̄ ± z0.005 × √ = 0.652 ± 2.576 × = 0.652 ± 0.078 ⇒ (0.574, 0.730). 250 n 3. For a 90% confidence interval, we need the lower and upper 5% values from χ2n−1 = χ29 . These are χ20.95, 9 = 3.325 (given in the question) and χ20.05, 9 = 16.92, using Table 8 of the New Cambridge Statistical Tables. Hence we obtain: (n − 1)s2 (n − 1)s2 , χ2α/2, n−1 χ21−α/2, n−1 382 ! = 9 × 21.05 9 × 21.05 , 16.92 3.325 = (11.20, 56.98). C.8. Chapter 9 – Hypothesis testing C.8 Chapter 9 – Hypothesis testing 1. (a) We have: P (Type II error) = P (not reject H0 | H1 ) = P (X ≤ 3 | π = 0.4) = 3 X (1 − 0.4)x−1 × 0.4 x=1 = 0.784. (b) We have: P (Type I error) = P (reject H0 | H0 ) = 1 − P (X ≤ 3 | π = 0.3) =1− 3 X (1 − 0.3)x−1 × 0.3 x=1 = 0.343. (c) The p-value is P (X ≥ 4 | π = 0.3) = 0.343 which, of course, is the same as the probability of a Type I error. 2. The size is the probability we reject the null hypothesis when it is true: 1 1.5 P X > 1|λ = = 0.0902. = 1 − e−0.5 − 0.5e−0.5 ≈ 1 − √ 2 2.718 The power is the probability we reject the null hypothesis when the alternative is true: 3 P (X > 1 | λ = 2) = 1 − e−2 − 2e−2 ≈ 1 − = 0.5939. (2.718)2 3. The power of the test at σ 2 is: β(σ) = Pσ (H0 is rejected) = Pσ (T > χ2α, n−1 ) (n − 1)S 2 2 > χα, n−1 = Pσ σ02 (n − 1)S 2 σ02 2 = Pσ > 2 × χα, n−1 σ2 σ σ02 2 = P X > 2 × χα, n−1 σ where X ∼ χ2n−1 . Hence here, where n = 10, we have: 2.00 2.00 2 β(σ) = P X > 2 × χ0.01, 9 = P X > 2 × 21.666 . σ σ With any given values of σ 2 , we may compute β(σ). For the σ 2 values requested, we obtain the following. σ2 2.00 × 21.666/σ 2 Approx. β(σ) 2.00 21.666 0.01 2.56 16.927 0.05 383 C. Solutions to Sample examination questions C.9 Chapter 10 – Analysis of variance (ANOVA) 1. (a) The sample means are 187/4 = 46.75, 347/6 = 57.83 and 461/10 = 46.1 for workers A, B and C, respectively. We will perform one-way ANOVA. We calculate the overall sample mean to be: 187 + 347 + 461 = 49.75. 20 We can now calculate the sum of squares between workers. This is: 4 × (46.75 − 49.75)2 + 6 × (57.83 − 49.75)2 + 10 × (46.1 − 49.75)2 = 561.27. The total sum of squares is: 50915 − 20 × (49.75)2 = 1413.75. Here is the one-way ANOVA table: Source Worker Error Total Degrees of Freedom 2 17 19 Sum of Squares 561.27 852.48 1413.75 Mean Square 280.64 50.15 F statistic 5.60 (b) At the 5% significance level, the critical value is F0.05, 2 17 = 3.59. Since 3.59 < 5.60, we reject H0 : µA = µB = µC and conclude that there is evidence of a difference in the average daily calls answered of the three workers. 2. (a) The average audience share of all networks is: 21.35 + 17.28 + 20.18 = 19.60. 3 Hence the sum of squares (SS) due to networks is: 4 × (21.35 − 19.60)2 + (17.28 − 19.60)2 + (20.18 − 19.60)2 = 35.13 and the mean sum of squares (MS) due to networks is 35.13/(3 − 1) = 17.57. The degrees of freedom are 4 − 1 = 3, 3 − 1 = 2, (4 − 1)(3 − 1) = 6 and 4 × 3 − 1 = 11 for cities, networks, error and total sum of squares, respectively. The SS for cities is 3 × 1.95 = 5.85. We have that the SS due to residuals is given by 51.52 − 5.85 − 35.13 = 10.54 and the MS is 10.54/6 = 1.76. The F -values are 1.95/1.76 = 1.11 and 17.57/1.76 = 9.98 for cities and networks, respectively. Here is the two-way ANOVA table: Source City Network Error Total 384 Degrees of freedom 3 2 6 11 Sum of squares 5.85 35.13 10.54 51.52 Mean square 1.95 17.57 1.76 F -value 1.11 9.98 C.9. Chapter 10 – Analysis of variance (ANOVA) (b) We test H0 : There is no difference between networks against H1 : There is a difference between networks. The F -value is 9.98 and at a 5% significance level the critical value is F0.05, 2, 6 = 5.14, hence we reject H0 and conclude that there is evidence of a difference between networks. 3. (a) The average time for all batteries is 41.5. Hence the sum of squares for batteries is: 7 × (43.86 − 41.5)2 + (41.28 − 41.5)2 + (40.86 − 41.5)2 + (40 − 41.5)2 = 57.94 and the mean sum of squares due to batteries is 57.94/(4 − 1) = 19.31. The degrees of freedom are 7 − 1 = 6, 4 − 1 = 3, (7 − 1)(4 − 1) = 18 and 7 × 4 − 1 = 27 for laptops, batteries, error and total sum of squares, respectively. The sum of squares for laptops is 6 × 26 = 156. We have that the sum of squares due to residuals is given by: 343 − 156 − 57.94 = 129.06 and hence the mean sum of squares is 129.06/18 = 7.17. The F -value is 26/7.17 = 3.63 and 19.31/7.17 = 2.69 for laptops and batteries, respectively. To summarise: Source Laptops Batteries Error Total Degrees of freedom 6 3 18 27 Sum of squares 156 57.94 129.06 343 Mean square 26 19.31 7.17 F -value 3.63 2.69 (b) We test the hypothesis H0 : There is no difference between different batteries vs. H1 : There is a difference between different batteries. The F -value is 2.69 and at the 5% significance level the critical value (degrees of freedom 3 and 18) is 3.16, hence we conclude that there is not enough evidence that there is a difference. Next, we test the hypothesis H0 : There is no difference between different laptop brands vs. H1 : There is a difference between different laptop brands. The F -value is 3.63 and at the 5% significance level the critical value (degrees of freedom 6 and 18) is 2.66, hence we reject H0 and conclude that there is evidence of a difference. (c) The upper 5% point of the t distribution with 18 degrees of freedom is 1.734 and the estimate of σ 2 is 7.17. So the confidence interval is: s 1 1 + = 3.86 ± 2.482 ⇒ (1.378, 6.342). 43.86 − 40 ± 1.734 × 7.17 × 7 7 Since zero is not in the interval, we have evidence of a difference. 385 C. Solutions to Sample examination questions 386 Appendix D Examination formula sheet Formulae for Statistics Discrete distributions Distribution p(x) 1 k Uniform π x (1 − π)1−x Bernoulli Binomial for all x = 1, 2, . . . , k n x π x (1 − π)n−x (1 − π)x−1 π Geometric e−λ λx x! Poisson for x = 0, 1 for x = 0, 1, . . . , n for x = 1, 2, 3, . . . for x = 0, 1, 2, . . . E(X) Var(X) k+1 2 k2 − 1 12 π π (1 − π) nπ n π (1 − π) 1 π 1−π π2 λ λ Continuous distributions Distribution f (x) 1 b−a Uniform λ e−λx Exponential Normal √ 1 2πσ 2 for a ≤ x ≤ b for x > 0 2 /2σ 2 e−(x−µ) for all x x−a b−a F (x) E(X) Var(X) for a ≤ x ≤ b a+b 2 (b − a)2 12 1 λ 1 λ2 µ σ2 1 − e−λx for x > 0 387 D. Examination formula sheet Sample quantities n n 1 X 1 X 2 (xi − x̄)2 = x − nx̄2 n − 1 i=1 n − 1 i=1 i Sample variance s2 = Sample covariance 1 X 1 X (xi − x̄)(yi − ȳ) = xi yi − nx̄ȳ n − 1 i=1 n − 1 i=1 n n n P Sample correlation xi yi − nx̄ȳ i=1 s n P x2i − nx̄2 i=1 n P yi2 − nȳ 2 i=1 Inference Variance of sample mean σ2 n One-sample t statistic X̄ − µ √ S/ n s Two-sample t statistic 388 n+m−2 X̄ − Ȳ − δ0 ×p 2 1/n + 1/m + (m − 1)SY2 (n − 1)SX