Uploaded by soconn21

Rice J.A. Mathematical statistics and data analysis (3rd) (1)

advertisement
THIRD EDITION
Mathematical
Statistics and
Data Analysis
John A. Rice
University of California, Berkeley
Australia • Brazil • Canada • Mexico • Singapore • Spain
United Kingdom • United States
Mathematical Statistics and Data Analysis, Third Edition
John A. Rice
Acquisitions Editor: Carolyn Crockett
Assistant Editor: Ann Day
Editorial Assistant: Elizabeth Gershman
Technology Project Manager: Fiona Chong
Marketing Manager: Joe Rogove
Marketing Assistant: Brian Smith
Marketing Communications Manager: Darlene Amidon-Brent
Project Manager, Editorial Production: Kelsey McGee
Creative Director: Rob Hugel
Art Director: Lee Friedman
Print Buyer: Karen Hunt
© 2007 Duxbury, an imprint of Thomson Brooks/Cole, a part
of The Thomson Corporation. Thomson, the Star logo, and
Brooks/Cole are trademarks used herein under license.
ALL RIGHTS RESERVED. No part of this work covered by
the copyright hereon may be reproduced or used in any form or
by any means—graphic, electronic, or mechanical, including
photocopying, recording, taping, Web distribution, information
storage and retrieval systems, or in any other manner—without
the written permission of the publisher.
Printed in the United States of America
1 2 3 4 5 6 7 10 09 08 07 06
For more information about our products, contact us at:
Thomson Learning Academic Resource Center
1-800-423-0563
For permission to use material from this text or product,
submit a request online at
http://www.thomsonrights.com.
Any additional questions about permissions can be submitted
by e-mail to thomsonrights@thomson.com.
Permissions Editor: Bob Kauser
Production Service: Interactive Composition
Corporation
Text Designer: Roy Neuhaus
Copy Editor: Victoria Thurman
Illustrator: Interactive Composition Corporation
Cover Designer: Denise Davidson
Cover Printer: Coral Graphic Services
Compositor: Interactive Composition
Corporation
Printer: R.R. Donnelley/Crawfordsville
Thomson Higher Education
10 Davis Drive
Belmont, CA 94002-3098
USA
Library of Congress Control Number: 2005938314
Student Edition: ISBN 0-534-39942-8
We must be careful not to confuse data with the
abstractions we use to analyze them.
WILLIAM JAMES (1842–1910)
Contents
Preface
xi
1 Probability
1
1.1
Introduction
1.2
Sample Spaces
1.3
Probability Measures
1.4
Computing Probabilities: Counting Methods
1
2
4
1.4.1 The Multiplication Principle
1.4.2 Permutations and Combinations
1.5
Conditional Probability
1.6
Independence
1.7
Concluding Remarks
1.8
Problems
9
16
23
26
26
2 Random Variables
2.1
6
7
35
Discrete Random Variables
35
2.1.1 Bernoulli Random Variables
2.1.2 The Binomial Distribution
37
38
2.1.3 The Geometric and Negative Binomial Distributions
2.1.4 The Hypergeometric Distribution
2.1.5 The Poisson Distribution
2.2
Continuous Random Variables
47
2.2.1 The Exponential Density
2.2.2 The Gamma Density
iv
42
53
50
42
40
Contents
2.2.3 The Normal Distribution
2.2.4 The Beta Density
58
2.3
Functions of a Random Variable
2.4
Concluding Remarks
2.5
Problems
64
71
3.1
Introduction
3.2
Discrete Random Variables
3.3
Continuous Random Variables
3.4
Independent Random Variables
3.5
Conditional Distributions
71
72
75
84
87
3.5.1 The Discrete Case
87
3.5.2 The Continuous Case
88
Functions of Jointly Distributed Random Variables
3.6.1 Sums and Quotients
3.6.2 The General Case
3.7
Extrema and Order Statistics
3.8
Problems
96
96
99
104
107
4 Expected Values
4.1
58
64
3 Joint Distributions
3.6
54
116
The Expected Value of a Random Variable
116
4.1.1 Expectations of Functions of Random Variables
121
4.1.2 Expectations of Linear Combinations of Random Variables
4.2
Variance and Standard Deviation
130
4.2.1 A Model for Measurement Error
4.3
Covariance and Correlation
4.4
Conditional Expectation and Prediction
138
4.4.1 Definitions and Examples
4.4.2 Prediction
4.5
The Moment-Generating Function
4.6
Approximate Methods
4.7
Problems
166
147
152
161
135
155
147
124
v
vi
Contents
5 Limit Theorems
177
5.1
Introduction
5.2
The Law of Large Numbers
5.3
Convergence in Distribution and the Central Limit Theorem
5.4
Problems
177
177
181
188
6 Distributions Derived from the Normal Distribution
6.1
Introduction
6.2
χ , t, and F Distributions
6.3
The Sample Mean and the Sample Variance
6.4
Problems
192
192
2
192
195
198
7 Survey Sampling
199
7.1
Introduction
7.2
Population Parameters
7.3
Simple Random Sampling
199
200
202
7.3.1 The Expectation and Variance of the Sample Mean
7.3.2 Estimation of the Population Variance
203
210
7.3.3 The Normal Approximation to the Sampling Distribution of X
7.4
Estimation of a Ratio
7.5
Stratified Random Sampling
220
227
7.5.1 Introduction and Notation
227
7.5.2 Properties of Stratified Estimates
7.5.3 Methods of Allocation
7.6
Concluding Remarks
7.7
Problems
214
228
232
238
239
8 Estimation of Parameters and Fitting of Probability Distributions
8.1
Introduction
8.2
Fitting the Poisson Distribution to Emissions of Alpha Particles
8.3
Parameter Estimation
8.4
The Method of Moments
8.5
The Method of Maximum Likelihood
255
257
260
267
255
255
Contents
8.5.1 Maximum Likelihood Estimates of Multinomial Cell Probabilities
8.5.2 Large Sample Theory for Maximum Likelihood Estimates
8.5.3 Confidence Intervals from Maximum Likelihood Estimates
8.6
The Bayesian Approach to Parameter Estimation
8.6.1 Further Remarks on Priors
8.7
274
279
285
294
8.6.2 Large Sample Normal Approximation to the Posterior
8.6.3 Computational Aspects
272
296
297
Efficiency and the Cramér-Rao Lower Bound
298
8.7.1 An Example: The Negative Binomial Distribution 302
8.8
Sufficiency
305
8.8.1 A Factorization Theorem
306
8.8.2 The Rao-Blackwell Theorem
8.9
8.10
Concluding Remarks
Problems
310
311
312
9 Testing Hypotheses and Assessing Goodness of Fit
9.1
Introduction
9.2
The Neyman-Pearson Paradigm
329
329
331
9.2.1 Specification of the Significance Level and the Concept of a p-value
9.2.2 The Null Hypothesis
335
9.2.3 Uniformly Most Powerful Tests
336
9.3
The Duality of Confidence Intervals and Hypothesis Tests
9.4
Generalized Likelihood Ratio Tests
9.5
Likelihood Ratio Tests for the Multinomial Distribution
9.6
The Poisson Dispersion Test
9.7
Hanging Rootograms
9.8
Probability Plots
9.9
Tests for Normality
337
339
341
347
349
352
9.10
Concluding Remarks
9.11
Problems
358
361
362
10 Summarizing Data
377
10.1
Introduction
10.2
Methods Based on the Cumulative Distribution Function
377
378
334
vii
viii
Contents
10.2.1 The Empirical Cumulative Distribution Function
10.2.2 The Survival Function
380
10.2.3 Quantile-Quantile Plots
385
10.3
Histograms, Density Curves, and Stem-and-Leaf Plots
10.4
Measures of Location
393
395
10.4.3 The Trimmed Mean
10.4.4 M Estimates
389
392
10.4.1 The Arithmetic Mean
10.4.2 The Median
378
397
397
10.4.5 Comparison of Location Estimates
398
10.4.6 Estimating Variability of Location Estimates by the Bootstrap
10.5
Measures of Dispersion
10.6
Boxplots
10.7
Exploring Relationships with Scatterplots
10.8
Concluding Remarks
10.9
Problems
401
402
404
407
408
11 Comparing Two Samples
420
11.1
Introduction
11.2
Comparing Two Independent Samples
420
421
11.2.1 Methods Based on the Normal Distribution
11.2.2 Power
421
433
11.2.3 A Nonparametric Method—The Mann-Whitney Test
11.2.4 Bayesian Approach
11.3
Comparing Paired Samples
443
444
11.3.1 Methods Based on the Normal Distribution
446
11.3.2 A Nonparametric Method—The Signed Rank Test
11.3.3 An Example—Measuring Mercury Levels in Fish
11.4
Experimental Design
452
11.4.1 Mammary Artery Ligation 452
11.4.2 The Placebo Effect
453
11.4.3 The Lanarkshire Milk Experiment
11.4.4 The Portacaval Shunt
11.4.5 FD&C Red No. 40
435
453
454
455
11.4.6 Further Remarks on Randomization
456
448
450
399
ix
Contents
11.4.7 Observational Studies, Confounding, and Bias in Graduate Admissions
11.4.8 Fishing Expeditions
11.5
Concluding Remarks
11.6
Problems
458
459
459
12 The Analysis of Variance
12.1
Introduction
12.2
The One-Way Layout
477
477
477
12.2.1 Normal Theory; the F Test
478
12.2.2 The Problem of Multiple Comparisons
485
12.2.3 A Nonparametric Method—The Kruskal-Wallis Test
12.3
The Two-Way Layout
489
12.3.1 Additive Parametrization
489
12.3.2 Normal Theory for the Two-Way Layout
12.3.3 Randomized Block Designs
492
500
12.3.4 A Nonparametric Method—Friedman’s Test
12.4
Concluding Remarks
12.5
Problems
488
503
504
505
13 The Analysis of Categorical Data
514
13.1
Introduction
13.2
Fisher’s Exact Test
13.3
The Chi-Square Test of Homogeneity
516
13.4
The Chi-Square Test of Independence
520
13.5
Matched-Pairs Designs
13.6
Odds Ratios
13.7
Concluding Remarks
13.8
Problems
514
514
523
526
530
530
14 Linear Least Squares
542
14.1
Introduction
14.2
Simple Linear Regression
542
547
14.2.1 Statistical Properties of the Estimated Slope and Intercept
547
457
x
Contents
14.2.2 Assessing the Fit
550
14.2.3 Correlation and Regression
560
14.3
The Matrix Approach to Linear Least Squares
14.4
Statistical Properties of Least Squares Estimates
14.4.1 Vector-Valued Random Variables
564
567
567
14.4.2 Mean and Covariance of Least Squares Estimates
14.4.3 Estimation of σ
2
573
575
14.4.4 Residuals and Standardized Residuals
14.4.5 Inference about β
576
577
14.5
Multiple Linear Regression—An Example
14.6
Conditional Inference, Unconditional Inference, and the Bootstrap
14.7
Local Linear Smoothing
14.8
Concluding Remarks
14.9
Problems
580
587
591
591
Appendix A Common Distributions
Appendix B Tables A4
Bibliography A25
Answers to Selected Problems A32
Author Index A48
Applications Index A51
Subject Index A54
A1
585
Preface
Intended Audience
This text is intended for juniors, seniors, or beginning graduate students in statistics,
mathematics, natural sciences, and engineering as well as for adequately prepared
students in the social sciences and economics. A year of calculus, including Taylor
Series and multivariable calculus, and an introductory course in linear algebra are
prerequisites.
This Book’s Objectives
This book reflects my view of what a first, and for many students a last, course in
statistics should be. Such a course should include some traditional topics in mathematical statistics (such as methods based on likelihood), topics in descriptive statistics
and data analysis with special attention to graphical displays, aspects of experimental
design, and realistic applications of some complexity. It should also reflect the integral role played by computers in statistics. These themes, properly interwoven, can
give students a view of the nature of modern statistics. The alternative of teaching
two separate courses, one on theory and one on data analysis, seems to me artificial.
Furthermore, many students take only one course in statistics and do not have time
for two or more.
Analysis of Data and the Practice
of Statistics
In order to draw the above themes together, I have endeavored to write a book closely
tied to the practice of statistics. It is in the analysis of real data that one sees the roles
played by both formal theory and informal data analytic methods. I have organized
this book around various kinds of problems that entail the use of statistical methods
and have included many real examples to motivate and introduce the theory. Among
xi
xii
Preface
the advantages of such an approach are that theoretical constructs are presented in
meaningful contexts, that they are gradually supplemented and reinforced, and that
they are integrated with more informal methods. This is, I think, a fitting approach
to statistics, the historical development of which has been spurred on primarily by
practical needs rather than by abstract or aesthetic considerations. At the same time,
I have not shied away from using the mathematics that the students are supposed to
know.
The Third Edition
Eighteen years have passed since the first edition of this book was published and
eleven years since the second. Although the basic intent and stucture of the book
have not changed, the new editions reflect developments in the discipline of statistics,
primarily the computational revolution.
The most significant change in this edition is the treatment of Bayesian inference. I moved the material from the last chapter, a point that was never reached by
many instructors, and integrated it into earlier chapters. Bayesian inference is now
first previewed in Chapter 3, in the context of conditional distributions. It is then
placed side-by-side with frequentist methods in Chapter 8, where it complements the
material on maximum likelihood estimation very naturally. The introductory section
on hypothesis testing in Chapter 9 now begins with a Bayesian formulation before
moving on to the Neyman-Pearson paradigm. One advantage of this is that the fundamental importance of the likelihood ratio is now much more apparent. In applications,
I stress uninformative priors and show the similarity of the qualitative conclusions
that would be reached by frequentist and Bayesian methods.
Other new material includes the use of examples from genomics and financial
statistics in the probability chapters. In addition to its value as topically relevant, this
material naturally reinforces basic concepts. For example, the material on copulas
underscores the relationships of marginal and joint distributions. Other changes include the introduction of scatterplots and correlation coefficients within the context
of exploratory data analysis in Chapter 10 and a brief introduction to nonparametric
smoothing via local linear least squares in Chapter 14. There are nearly 100 new
problems, mainly in Chapters 7–14, including several new data sets. Some of the data
sets are sufficiently substantial to be the basis for computer lab assignments. I also
elucidated many passages that were obscure in earlier editions.
Brief Outline
A complete outline can be found, of course, in the Table of Contents. Here I will just
highlight some points and indicate various curricular options for the instructor.
The first six chapters contain an introduction to probability theory, particularly
those aspects most relevant to statistics. Chapter 1 introduces the basic ingredients
of probability theory and elementary combinatorial methods from a non measure
theoretic point of view. In this and the other probability chapters, I tried to use realworld examples rather than balls and urns whenever possible.
Preface
xiii
The concept of a random variable is introduced in Chapter 2. I chose to discuss
discrete and continuous random variables together, instead of putting off the continuous case until later. Several common distributions are introduced. An advantage of
this approach is that it provides something to work with and develop in later chapters.
Chapter 3 continues the treatment of random variables by going into joint distributions. The instructor may wish to skip lightly over Jacobians; this can be done
with little loss of continuity, since they are rarely used in the rest of the book. The
material in Section 3.7 on extrema and order statistics can be omitted if the instructor
is willing to do a little backtracking later.
Expectation, variance, covariance, conditional expectation, and moment-generating functions are taken up in Chapter 4. The instructor may wish to pass lightly
over conditional expectation and prediction, especially if he or she does not plan to
cover sufficiency later. The last section of this chapter introduces the δ method, or
the method of propagation of error. This method is used several times in the statistics
chapters.
The law of large numbers and the central limit theorem are proved in Chapter 5
under fairly strong assumptions.
Chapter 6 is a compendium of the common distributions related to the normal and
sampling distributions of statistics computed from the usual normal random sample.
I don’t spend a lot of time on this material here but do develop the necessary facts
as they are needed in the statistics chapters. It is useful for students to have these
distributions collected in one place.
Chapter 7 is on survey sampling, an unconventional, but in some ways natural,
beginning to the study of statistics. Survey sampling is an area of statistics with
which most students have some vague familiarity, and a set of fairly specific, concrete
statistical problems can be naturally posed. It is a context in which, historically, many
important statistical concepts have developed, and it can be used as a vehicle for
introducing concepts and techniques that are developed further in later chapters, for
example:
•
•
•
•
•
The idea of an estimate as a random variable with an associated sampling distribution
The concepts of bias, standard error, and mean squared error
Confidence intervals and the application of the central limit theorem
An exposure to notions of experimental design via the study of stratified estimates
and the concept of relative efficiency
Calculation of expectations, variances, and covariances
One of the unattractive aspects of survey sampling is that the calculations are rather
grubby. However, there is a certain virtue in this grubbiness, and students are given
practice in such calculations. The instructor has quite a lot of flexibility as to how
deeply to cover the concepts in this chapter. The sections on ratio estimation and
stratification are optional and can be skipped entirely or returned to at a later time
without loss of continuity.
Chapter 8 is concerned with parameter estimation, a subject that is motivated
and illustrated by the problem of fitting probability laws to data. The method of
moments, the method of maximum likelihood, and Bayesian inference are developed.
The concept of efficiency is introduced, and the Cramér-Rao Inequality is proved.
Section 8.8 introduces the concept of sufficiency and some of its ramifications. The
xiv
Preface
material on the Cramér-Rao lower bound and on sufficiency can be skipped; to my
mind, the importance of sufficiency is usually overstated. Section 8.7.1 (the negative
binomial distribution) can also be skipped.
Chapter 9 is an introduction to hypothesis testing with particular application
to testing for goodness of fit, which ties in with Chapter 8. (This subject is further
developed in Chapter 11.) Informal, graphical methods are presented here as well.
Several of the last sections of this chapter can be skipped if the instructor is pressed
for time. These include Section 9.6 (the Poisson dispersion test), Section 9.7 (hanging
rootograms), and Section 9.9 (tests for normality).
A variety of descriptive methods are introduced in Chapter 10. Many of these
techniques are used in later chapters. The importance of graphical procedures is
stressed, and notions of robustness are introduced. The placement of a chapter on
descriptive methods this late in a book may seem strange. I chose to do so because descriptive procedures usually have a stochastic side and, having been through
the three chapters preceding this one, students are by now better equipped to study the
statistical behavior of various summary statistics (for example, a confidence interval
for the median). When I teach the course, I introduce some of this material earlier.
For example, I have students make boxplots and histograms from samples drawn in
labs on survey sampling. If the instructor wishes, the material on survival and hazard
functions can be skipped.
Classical and nonparametric methods for two-sample problems are introduced
in Chapter 11. The concepts of hypothesis testing, first introduced in Chapter 9,
are further developed. The chapter concludes with some discussion of experimental
design and the interpretation of observational studies.
The first eleven chapters are the heart of an introductory course; the theoretical
constructs of estimation and hypothesis testing have been developed, graphical and
descriptive methods have been introduced, and aspects of experimental design have
been discussed.
The instructor has much more freedom in selecting material from Chapters 12
through 14. In particular, it is not necessary to proceed through these chapters in the
order in which they are presented.
Chapter 12 treats the one-way and two-way layouts via analysis of variance and
nonparametric techniques. The problem of multiple comparisons, first introduced at
the end of Chapter 11, is discussed.
Chapter 13 is a rather brief treatment of the analysis of categorical data. Likelihood ratio tests are developed for homogeneity and independence. McNemar’s test
is presented and finally, estimation of the odds ratio is motivated by a discussion of
prospective and retrospective studies.
Chapter 14 concerns linear least squares. Simple linear regression is developed
first and is followed by a more general treatment using linear algebra. I chose to
employ matrix algebra but keep the level of the discussion as simple and concrete as
possible, not going beyond concepts typically taught in an introductory one-quarter
course. In particular, I did not develop a geometric analysis of the general linear model
or make any attempt to unify regression and analysis of variance. Throughout this
chapter, theoretical results are balanced by more qualitative data analytic procedures
based on analysis of residuals. At the end of the chapter, I introduce nonparametric
regression via local linear least squares.
Preface
xv
Computer Use and Problem Solving
Computation is an integral part of contemporary statistics. It is essential for data
analysis and can be an aid to clarifying basic concepts. My students use the opensource package R, which they can install on their own computers. Other packages
could be used as well but I do not discuss any particular programs in the text. The
data in the text are available on the CD that is bound in the U.S. edition or can be
downloaded from www.thomsonedu.com/statistics.
This book contains a large number of problems, ranging from routine reinforcement of basic concepts to some that students will find quite difficult. I think that
problem solving, especially of nonroutine problems, is very important.
Acknowledgments
I am indebted to a large number of people who contributed directly and indirectly
to the first edition. Earlier versions were used in courses taught by Richard Olshen,
Yosi Rinnot, Donald Ylvisaker, Len Haff, and David Lane, who made many helpful
comments. Students in their classes and in my own had many constructive comments.
Teaching assistants, especially Joan Staniswalis, Roger Johnson, Terri Bittner, and
Peter Kim, worked through many of the problems and found numerous errors. Many
reviewers provided useful suggestions: Rollin Brant, University of Toronto; George
Casella, Cornell University; Howard B. Christensen, Brigham Young University;
David Fairley, Ohio State University; Peter Guttorp, University of Washington; Hari
Iyer, Colorado State University; Douglas G. Kelly, University of North Carolina;
Thomas Leonard, University of Wisconsin; Albert S. Paulson, Rensselaer Polytechnic
Institute; Charles Peters, University of Houston, University Park; Andrew Rukhin,
University of Massachusetts, Amherst; Robert Schaefer, Miami University; and Ruth
Williams, University of California, San Diego. Richard Royall and W. G. Cumberland
kindly provided the data sets used in Chapter 7 on survey sampling. Several other data
sets were brought to my attention by statisticians at the National Bureau of Standards,
where I was fortunate to spend a year while on sabbatical. I deeply appreciate the
patience, persistence, and faith of my editor, John Kimmel, in bringing this project to
fruition.
The candid comments of many students and faculty who used the first edition of
the book were influential in the creation of the second edition. In particular I would like
to thank Ian Abramson, Edward Bedrick, Jon Frank, Richard Gill, Roger Johnson,
Torgny Lindvall, Michael Martin, Deb Nolan, Roger Pinkham, Yosi Rinott, Philip
Stark, and Bin Yu; I apologize to any individuals who have inadvertently been left
off this list. Finally, I would like to thank Alex Kugushev for his encouragement and
support in carrying out the revision and the work done by Terri Bittner in carefully
reading the manuscript for accuracy and in the solutions of the new problems.
Many people contributed to the third edition. I would like to thank the reviewers
of this edition: Marten Wegkamp, Yale University; Aparna Huzurbazar, University of
New Mexico; Laura Bernhofen, Clark University; Joe Glaz, University of Connecticut; and Michael Minnotte, Utah State University. I deeply appreciate many readers
xvi
Preface
for generously taking the time to point out errors and make suggestions on improving
the exposition. In particular, Roger Pinkham sent many helpful email messages and
Nick Cox provided a very long list of grammatical errors. Alice Hsiaw made detailed
comments on Chapters 7–14. I also wish to thank Ani Adhikari, Paulo Berata, Patrick
Brewer, Sang-Hoon Cho Gier Eide, John Einmahl, David Freedman, Roger Johnson,
Paul van der Laan, Patrick Lee, Yi Lin, Jim Linnemann, Rasaan Moshesh, Eugene
Schuster, Dylan Small, Luis Tenorio, Richard De Veaux, and Ping Zhang. Bob Stine
contributed financial data, Diane Cook provided data on Italian olive oils, and Jim
Albert provided a baseball data set that nicely illustrates regression toward the mean.
Rainer Sachs provided the lovely data on chromatin separations. I thank my editor,
Carolyn Crockett, for her graceful persistence and patience in bringing about this
revision, and also the energetic production team. I apologize to any others whose
names have inadvertently been left off this list.
CHAPTER
1
Probability
1.1 Introduction
The idea of probability, chance, or randomness is quite old, whereas its rigorous
axiomatization in mathematical terms occurred relatively recently. Many of the ideas
of probability theory originated in the study of games of chance. In this century, the
mathematical theory of probability has been applied to a wide variety of phenomena;
the following are some representative examples:
•
•
•
•
•
•
•
•
•
Probability theory has been used in genetics as a model for mutations and ensuing
natural variability, and plays a central role in bioinformatics.
The kinetic theory of gases has an important probabilistic component.
In designing and analyzing computer operating systems, the lengths of various
queues in the system are modeled as random phenomena.
There are highly developed theories that treat noise in electrical devices and communication systems as random processes.
Many models of atmospheric turbulence use concepts of probability theory.
In operations research, the demands on inventories of goods are often modeled as
random.
Actuarial science, which is used by insurance companies, relies heavily on the tools
of probability theory.
Probability theory is used to study complex systems and improve their reliability,
such as in modern commercial or military aircraft.
Probability theory is a cornerstone of the theory of finance.
The list could go on and on.
This book develops the basic ideas of probability and statistics. The first part
explores the theory of probability as a mathematical model for chance phenomena.
The second part of the book is about statistics, which is essentially concerned with
1
2
Chapter 1
Probability
procedures for analyzing data, especially data that in some vague sense have a random
character. To comprehend the theory of statistics, you must have a sound background
in probability.
1.2 Sample Spaces
Probability theory is concerned with situations in which the outcomes occur randomly.
Generically, such situations are called experiments, and the set of all possible outcomes
is the sample space corresponding to an experiment. The sample space is denoted by
, and an element of is denoted by ω. The following are some examples.
EXAMPLE A
Driving to work, a commuter passes through a sequence of three intersections with
traffic lights. At each light, she either stops, s, or continues, c. The sample space is
the set of all possible outcomes:
= {ccc, ccs, css, csc, sss, ssc, scc, scs}
where csc, for example, denotes the outcome that the commuter continues through
the first light, stops at the second light, and continues through the third light.
■
EXAMPLE B
The number of jobs in a print queue of a mainframe computer may be modeled as
random. Here the sample space can be taken as
= {0, 1, 2, 3, . . .}
that is, all the nonnegative integers. In practice, there is probably an upper limit, N ,
on how large the print queue can be, so instead the sample space might be defined as
= {0, 1, 2, . . . , N }
EXAMPLE C
■
Earthquakes exhibit very erratic behavior, which is sometimes modeled as random.
For example, the length of time between successive earthquakes in a particular region
that are greater in magnitude than a given threshold may be regarded as an experiment.
Here is the set of all nonnegative real numbers:
= {t | t ≥ 0}
■
We are often interested in particular subsets of , which in probability language
are called events. In Example A, the event that the commuter stops at the first light is
the subset of denoted by
A = {sss, ssc, scc, scs}
1.2 Sample Spaces
3
(Events, or subsets, are usually denoted by italic uppercase letters.) In Example B,
the event that there are fewer than five jobs in the print queue can be denoted by
A = {0, 1, 2, 3, 4}
The algebra of set theory carries over directly into probability theory. The union
of two events, A and B, is the event C that either A occurs or B occurs or both occur:
C = A ∪ B. For example, if A is the event that the commuter stops at the first light
(listed before), and if B is the event that she stops at the third light,
B = {sss, scs, ccs, css}
then C is the event that she stops at the first light or stops at the third light and consists
of the outcomes that are in A or in B or in both:
C = {sss, ssc, scc, scs, ccs, css}
The intersection of two events, C = A ∩ B, is the event that both A and B occur.
If A and B are as given previously, then C is the event that the commuter stops at
the first light and stops at the third light and thus consists of those outcomes that are
common to both A and B:
C = {sss, scs}
The complement of an event, Ac , is the event that A does not occur and thus
consists of all those elements in the sample space that are not in A. The complement
of the event that the commuter stops at the first light is the event that she continues at
the first light:
Ac = {ccc, ccs, css, csc}
You may recall from previous exposure to set theory a rather mysterious set called
the empty set, usually denoted by ∅. The empty set is the set with no elements; it
is the event with no outcomes. For example, if A is the event that the commuter
stops at the first light and C is the event that she continues through all three lights,
C = {ccc}, then A and C have no outcomes in common, and we can write
A∩C =∅
In such cases, A and C are said to be disjoint.
Venn diagrams, such as those in Figure 1.1, are often a useful tool for visualizing
set operations.
The following are some laws of set theory.
Commutative Laws:
A∪B = B∪ A
A∩B = B∩ A
Associative Laws:
(A ∪ B) ∪ C = A ∪ (B ∪ C)
(A ∩ B) ∩ C = A ∩ (B ∩ C)
4
Chapter 1
Probability
Distributive Laws:
(A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C)
(A ∩ B) ∪ C = (A ∪ C) ∩ (B ∪ C)
Of these, the distributive laws are the least intuitive, and you may find it instructive
to illustrate them with Venn diagrams.
A
B
A
A傼B
B
A傽B
F I G U R E 1.1 Venn diagrams of A ∪ B and A ∩ B.
1.3 Probability Measures
A probability measure on is a function P from subsets of to the real numbers
that satisfies the following axioms:
1. P() = 1.
2. If A ⊂ , then P(A) ≥ 0.
3. If A1 and A2 are disjoint, then
P(A1 ∪ A2 ) = P(A1 ) + P(A2 ).
More generally, if A1 , A2 , . . . , An , . . . are mutually disjoint, then
∞ ∞
Ai =
P(Ai )
P
i=1
i=1
The first two axioms are obviously desirable. Since consists of all possible outcomes, P() = 1. The second axiom simply states that a probability is nonnegative.
The third axiom states that if A and B are disjoint—that is, have no outcomes in
common—then P(A ∪ B) = P(A) + P(B) and also that this property extends to
limits. For example, the probability that the print queue contains either one or three
jobs is equal to the probability that it contains one plus the probability that it contains
three.
The following properties of probability measures are consequences of the axioms.
Property A P(Ac ) = 1 − P(A). This property follows since A and Ac are disjoint
with A ∪ Ac = and thus, by the first and third axioms, P(A) + P(Ac ) = 1. In
words, this property says that the probability that an event does not occur equals one
minus the probability that it does occur.
Property B P(∅) = 0. This property follows from Property A since ∅ = c . In
words, this says that the probability that there is no outcome at all is zero.
1.3 Probability Measures
5
Property C If A ⊂ B, then P(A) ≤ P(B). This property states that if B occurs
whenever A occurs, then P(A) ≤ P(B). For example, if whenever it rains (A) it is
cloudy (B), then the probability that it rains is less than or equal to the probability
that it is cloudy. Formally, it can be proved as follows: B can be expressed as the
union of two disjoint sets:
B = A ∪ (B ∩ Ac )
Then, from the third axiom,
P(B) = P(A) + P(B ∩ Ac )
and thus
P(A) = P(B) − P(B ∩ Ac ) ≤ P(B)
Property D Addition Law P(A ∪ B) = P(A) + P(B) − P(A ∩ B). This property
is easy to see from the Venn diagram in Figure 1.2. If P(A) and P(B) are added
together, P(A ∩ B) is counted twice. To prove it, we decompose A ∪ B into three
disjoint subsets, as shown in Figure 1.2:
C = A ∩ Bc
D = A∩B
E = Ac ∩ B
A
B
C
E
D
F I G U R E 1.2 Venn diagram illustrating the addition law.
We then have, from the third axiom,
P(A ∪ B) = P(C) + P(D) + P(E)
Also, A = C ∪ D, and C and D are disjoint; so P(A) = P(C) + P(D). Similarly,
P(B) = P(D) + P(E). Putting these results together, we see that
P(A) + P(B) = P(C) + P(E) + 2P(D)
= P(A ∪ B) + P(D)
or
P(A ∪ B) = P(A) + P(B) − P(D)
6
Chapter 1
Probability
EXAMPLE A
Suppose that a fair coin is thrown twice. Let A denote the event of heads on the first
toss, and let B denote the event of heads on the second toss. The sample space is
= {hh, ht, th, tt}
We assume that each elementary outcome in is equally likely and has probability
1
. C = A ∪ B is the event that heads comes up on the first toss or on the second toss.
4
Clearly, P(C) = P(A) + P(B) = 1. Rather, since A ∩ B is the event that heads
comes up on the first toss and on the second toss,
P(C) = P(A) + P(B) − P(A ∩ B) = .5 + .5 − .25 = .75
EXAMPLE B
■
An article in the Los Angeles Times (August 24, 1987) discussed the statistical risks
of AIDS infection:
Several studies of sexual partners of people infected with the virus show that
a single act of unprotected vaginal intercourse has a surprisingly low risk of
infecting the uninfected partner—perhaps one in 100 to one in 1000. For an
average, consider the risk to be one in 500. If there are 100 acts of intercourse
with an infected partner, the odds of infection increase to one in five.
Statistically, 500 acts of intercourse with one infected partner or 100 acts
with five partners lead to a 100% probability of infection (statistically, not
necessarily in reality).
Following this reasoning, 1000 acts of intercourse with one infected partner would
lead to a probability of infection equal to 2 (statistically, but not necessarily in reality).
To see the flaw in the reasoning that leads to this conclusion, consider two acts of
intercourse. Let A1 denote the event that infection occurs on the first act and let A2
denote the event that infection occurs on the second act. Then the event that infection
occurs is B = A1 ∪ A2 and
2
P(B) = P(A1 ) + P(A2 ) − P(A1 ∩ A2 ) ≤ P(A1 ) + P(A2 ) =
■
500
1.4 Computing Probabilities:
Counting Methods
Probabilities are especially easy to compute for finite sample spaces. Suppose that
= {ω1 , ω2 , . . . , ω N } and that P(ωi ) = pi . To find the probability of an event A,
we simply add the probabilities of the ωi that constitute A.
EXAMPLE A
Suppose that a fair coin is thrown twice and the sequence of heads and tails is recorded.
The sample space is
= {hh, ht, th, tt}
1.4 Computing Probabilities: Counting Methods
7
As in Example A of the previous section, we assume that each outcome in has
probability .25. Let A denote the event that at least one head is thrown. Then A =
■
{hh, ht, th}, and P(A) = .75.
This is a simple example of a fairly common situation. The elements of all
have equal probability; so if there are N elements in , each of them has probability
1/N . If A can occur in any of n mutually exclusive ways, then P(A) = n/N , or
P(A) =
number of ways A can occur
total number of outcomes
Note that this formula holds only if all the outcomes are equally likely. In Example A, if only the number of heads were recorded, then would be {0, 1, 2}. These
■
outcomes are not equally likely, and P(A) is not 23 .
EXAMPLE B
Simpson’s Paradox
A black urn contains 5 red and 6 green balls, and a white urn contains 3 red and 4
green balls. You are allowed to choose an urn and then choose a ball at random from
the urn. If you choose a red ball, you get a prize. Which urn should you choose to
draw from? If you draw from the black urn, the probability of choosing a red ball is
5
= .455 (the number of ways you can draw a red ball divided by the total number
11
of outcomes). If you choose to draw from the white urn, the probability of choosing
a red ball is 37 = .429, so you should choose to draw from the black urn.
Now consider another game in which a second black urn has 6 red and 3 green
balls, and a second white urn has 9 red and 5 green balls. If you draw from the black
urn, the probability of a red ball is 69 = .667, whereas if you choose to draw from the
9
= .643. So, again you should choose to draw from
white urn, the probability is 14
the black urn.
In the final game, the contents of the second black urn are added to the first black
urn, and the contents of the second white urn are added to the first white urn. Again,
you can choose which urn to draw from. Which should you choose? Intuition says
choose the black urn, but let’s calculate the probabilities. The black urn now contains
= .55.
11 red and 9 green balls, so the probability of drawing a red ball from it is 11
20
The white urn now contains 12 red and 9 green balls, so the probability of drawing a red
= .571. So, you should choose the white urn. This counterintuitive
ball from it is 12
21
result is an example of Simpson’s paradox. For an example that occurred in real life,
■
see Section 11.4.7. For more amusing examples, see Gardner (1976).
In the preceding examples, it was easy to count the number of outcomes and
calculate probabilities. To compute probabilities for more complex situations, we
must develop systematic ways of counting outcomes, which are the subject of the
next two sections.
1.4.1 The Multiplication Principle
The following is a statement of the very useful multiplication principle.
8
Chapter 1
Probability
MULTIPLICATION PRINCIPLE
If one experiment has m outcomes and another experiment has n outcomes, then
there are mn possible outcomes for the two experiments.
Proof
Denote the outcomes of the first experiment by a1 , . . . , am and the outcomes of
the second experiment by b1 , . . . , bn . The outcomes for the two experiments are
the ordered pairs (ai , b j ). These pairs can be exhibited as the entries of an m × n
rectangular array, in which the pair (ai , b j ) is in the ith row and the jth column.
■
There are mn entries in this array.
EXAMPLE A
Playing cards have 13 face values and 4 suits. There are thus 4 × 13 = 52 face■
value/suit combinations.
EXAMPLE B
A class has 12 boys and 18 girls. The teacher selects 1 boy and 1 girl to act as
representatives to the student government. She can do this in any of 12 × 18 = 216
different ways.
■
EXTENDED MULTIPLICATION PRINCIPLE
If there are p experiments and the first has n 1 possible outcomes, the second
n 2 , . . . , and the pth n p possible outcomes, then there are a total of n 1 × n 2 × · · · ×
n p possible outcomes for the p experiments.
Proof
This principle can be proved from the multiplication principle by induction.
We saw that it is true for p = 2. Assume that it is true for p = q—that is, that
there are n 1 × n 2 × · · · × n q possible outcomes for the first q experiments. To
complete the proof by induction, we must show that it follows that the property holds for p = q + 1. We apply the multiplication principle, regarding
the first q experiments as a single experiment with n 1 × · · · × n q outcomes,
and conclude that there are (n 1 × · · · × n q ) × n q+1 outcomes for the q + 1
■
experiments.
EXAMPLE C
An 8-bit binary word is a sequence of 8 digits, of which each may be either a 0 or a 1.
How many different 8-bit words are there?
9
1.4 Computing Probabilities: Counting Methods
There are two choices for the first bit, two for the second, etc., and thus there are
2 × 2 × 2 × 2 × 2 × 2 × 2 × 2 = 28 = 256
■
such words.
EXAMPLE D
A DNA molecule is a sequence of four types of nucleotides, denoted by A, G, C, and T.
The molecule can be millions of units long and can thus encode an enormous amount
of information. For example, for a molecule 1 million (106 ) units long, there are
6
410 different possible sequences. This is a staggeringly large number having nearly a
million digits. An amino acid is coded for by a sequence of three nucleotides; there are
43 = 64 different codes, but there are only 20 amino acids since some of them can be
coded for in several ways. A protein molecule is composed of as many as hundreds of
amino acid units, and thus there are an incredibly large number of possible proteins.
■
For example, there are 20100 different sequences of 100 amino acids.
1.4.2 Permutations and Combinations
A permutation is an ordered arrangement of objects. Suppose that from the set
C = {c1 , c2 , . . . , cn } we choose r elements and list them in order. How many ways
can we do this? The answer depends on whether we are allowed to duplicate items
in the list. If no duplication is allowed, we are sampling without replacement. If
duplication is allowed, we are sampling with replacement. We can think of the
problem as that of taking labeled balls from an urn. In the first type of sampling, we
are not allowed to put a ball back before choosing the next one, but in the second, we
are. In either case, when we are done choosing, we have a list of r balls ordered in
the sequence in which they were drawn.
The extended multiplication principle can be used to count the number of different
ordered samples possible for a set of n elements. First, suppose that sampling is done
with replacement. The first ball can be chosen in any of n ways, the second in any
of n ways, etc., so that there are n × n × · · · × n = n r samples. Next, suppose that
sampling is done without replacement. There are n choices for the first ball, n − 1
choices for the second ball, n − 2 for the third, . . . , and n − r + 1 for the r th. We
have just proved the following proposition.
PROPOSITION A
For a set of size n and a sample of size r , there are n r different ordered samples with replacement and n(n − 1)(n − 2) · · · (n − r + 1) different ordered
■
samples without replacement.
COROLLARY A
The number of orderings of n elements is n(n − 1)(n − 2) · · · 1 = n!.
■
10
Chapter 1
Probability
EXAMPLE A
How many ways can five children be lined up?
This corresponds to sampling without replacement. According to Corollary A,
there are 5! = 5 × 4 × 3 × 2 × 1 = 120 different lines.
■
EXAMPLE B
Suppose that from ten children, five are to be chosen and lined up. How many different
lines are possible?
From Proposition A, there are 10 × 9 × 8 × 7 × 6 = 30,240 different lines. ■
EXAMPLE C
In some states, license plates have six characters: three letters followed by three
numbers. How many distinct such plates are possible?
This corresponds to sampling with replacement. There are 263 = 17,576 different
ways to choose the letters and 103 = 1000 ways to choose the numbers. Using
the multiplication principle again, we find there are 17,576 × 1000 = 17,576,000
■
different plates.
EXAMPLE D
If all sequences of six characters are equally likely, what is the probability that the
license plate for a new car will contain no duplicate letters or numbers?
Call the desired event A; consists of all 17,576,000 possible sequences. Since
these are all equally likely, the probability of A is the ratio of the number of ways
that A can occur to the total number of possible outcomes. There are 26 choices for
the first letter, 25 for the second, 24 for the third, and hence 26 × 25 × 24 = 15,600
ways to choose the letters without duplication (doing so corresponds to sampling
without replacement), and 10 × 9 × 8 = 720 ways to choose the numbers without
duplication. From the multiplication principle, there are 15,600 × 720 = 11,232,000
nonrepeating sequences. The probability of A is thus
P(A) =
EXAMPLE E
11,232,000
= .64
17,576,000
■
Birthday Problem
Suppose that a room contains n people. What is the probability that at least two of
them have a common birthday?
This is a famous problem with a counterintuitive answer. Assume that every day
of the year is equally likely to be a birthday, disregard leap years, and denote by A
the event that at least two people have a common birthday. As is sometimes the case,
finding P(Ac ) is easier than finding P(A). This is because A can happen in many
ways, whereas Ac is much simpler. There are 365n possible outcomes, and Ac can
happen in 365 × 364 × · · · × (365 − n + 1) ways. Thus,
P(Ac ) =
365 × 364 × · · · × (365 − n + 1)
365n
1.4 Computing Probabilities: Counting Methods
11
and
365 × 364 × · · · × (365 − n + 1)
365n
The following table exhibits the latter probabilities for various values of n:
P(A) = 1 −
n
P(A)
4
16
23
32
40
56
.016
.284
.507
.753
.891
.988
From the table, we see that if there are only 23 people, the probability of at least one
match exceeds .5. The probabilities in the table are larger than one might intuitively
guess, showing that the coincidence is not unlikely. Try it in your class.
■
EXAMPLE F
How many people must you ask to have a 50 : 50 chance of finding someone who
shares your birthday?
Suppose that you ask n people; let A denote the event that someone’s birthday is
the same as yours. Again, working with Ac is easier. The total number of outcomes
is 365n , and the total number of ways that Ac can happen is 364n . Thus,
P(Ac ) =
364n
365n
and
364n
365n
For the latter probability to be .5, n should be 253, which may seem counterintuitive.
P(A) = 1 −
■
We now shift our attention from counting permutations to counting combinations. Here we are no longer interested in ordered samples, but in the constituents
of the samples regardless of the order in which they were obtained. In particular,
we ask the following question: If r objects are taken from a set of n objects without
replacement and disregarding order, how many different samples are possible? From
the multiplication principle, the number of ordered samples equals the number of
unordered samples multiplied by the number of ways to order each sample. Since the
number of ordered samples is n(n − 1) · · · (n − r + 1), and since a sample of size r
can be ordered in r ! ways (Corollary A), the number of unordered samples is
n(n − 1) · · · (n − r + 1)
n!
=
r!
(n − r )!r !
This number is also denoted as nr . We have proved the following proposition.
12
Chapter 1
Probability
PROPOSITION B
The number of unordered
samples of r objects selected from n objects without
replacement is nr .
The numbers nk , called the binomial coefficients, occur in the expansion
n n k n−k
n
(a + b) =
a b
k
k=0
In particular,
2 =
n
n n
k=0
k
This latter result can be interpreted as the number of subsets of a set of n objects.
We just add the number of subsets of size 0 (with the usual convention that
0! = 1), and the number of subsets of size 1, and the number of subsets of
■
size 2, etc.
EXAMPLE G
Up until 1991, a player of the California state lottery could win the jackpot prize by
choosing the 6 numbers from1 to 49 that were subsequently chosen at random by
= 13,983,816 possible ways to choose 6 numbers
the lottery officials. There are 49
6
from 49, and so the probability of winning was about 1 in 14 million. If there were
no winners, the funds thus accumulated were rolled over (carried over) into the next
round of play, producing a bigger jackpot. In 1991, the rules were
changed so that
= 22,957,480,
a winner had to correctly select 6 numbers from 1 to 53. Since 53
6
the probability of winning decreased to about 1 in 23 million. Because of the ensuing
rollover, the jackpot accumulated to a record of about $120 million. This produced a
fever of play—people were buying tickets at the rate of between 1 and 2 million per
■
hour and state revenues burgeoned.
EXAMPLE H
In the practice of quality control, only a fraction of the output of a manufacturing
process is sampled and examined, since it may be too time-consuming and expensive
to examine each item, or because sometimes the testing is destructive.
Suppose that
n items are in a lot and a sample of size r is taken. There are nr such samples. Now
suppose that the lot contains k defective items. What is the probability that the sample
contains exactly m defectives?
Clearly, this question is relevant to the efficacy of the sampling scheme, and the
most desirable sample size can be determined by computing such probabilities for
various values of r . Call the event in question A. The probability of A is the number
of ways A can occur divided by the total number of outcomes. Tofind
the number of
ways A can occur, we use the multiplication principle. There are mk ways to choose
m defective items in the sample from the k defectives in the lot, and there are
the
n−k
ways to choose the r − m nondefective items in the sample from the n − k
r −m
ways. Thus, P(A) is the
nondefectives in the lot. Therefore, A can occur in mk rn−k
−m
1.4 Computing Probabilities: Counting Methods
13
ratio of the number of ways A can occur to the total number of outcomes, or
k n−k P(A) =
m
nr−m
■
r
Capture/Recapture Method
The so-called capture/recapture method is sometimes used to estimate the size of a
wildlife population. Suppose that 10 animals are captured, tagged, and released. On
a later occasion, 20 animals are captured, and it is found that 4 of them are tagged.
How large is the population?
We assume that there are n animals in the population, of which
n 10 are tagged.
possible groups
If the 20 animals captured later are taken in such a way that all 20
are equally likely (this is a big assumption), then the probability that 4 of them are
tagged is (using the technique of the previous example)
10n−10
4 16
n
20
Clearly, n cannot be precisely determined from the information at hand, but it can be
estimated. One method of estimation, called maximum likelihood, is to choose that
value of n that makes the observed outcome most probable. (The method of maximum
likelihood is one of the main subjects of a later chapter in this text.) The probability
of the observed outcome as a function of n is called the likelihood. Figure 1.3 shows
the likelihood as a function of n; the likelihood is maximized at n = 50.
0.35
0.30
0.25
Likelihood
EXAMPLE I
0.20
0.15
0.10
0.05
0
20
30
40
50
60
n
70
80
90
100
F I G U R E 1.3 Likelihood for Example I.
To find the maximum likelihood estimate, suppose that, in general, t animals are
tagged. Then, of a second sample of size m, r tagged animals are recaptured. We
estimate n by the maximizer of the likelihood
t n−t Ln =
r
nm−r
m
14
Chapter 1
Probability
To find the value of n that maximizes L n , consider the ratio of successive terms, which
after some algebra is found to be
(n − t)(n − m)
Ln
=
L n−1
n(n − t − m + r )
This ratio is greater than 1, i.e., L n is increasing, if
(n − t)(n − m) > n(n − t − m + r )
n − nm − nt + mt > n 2 − nt − nm − nr
mt > nr
mt
>n
r
2
Thus, L n increases for n < mt/r and decreases for n > mt/r ; so the value of n that
maximizes L n is the greatest integer not exceeding mt/r .
Applying this result to the data given previously, we see that the maximum
likelihood estimate of n is mtr = 204·10 = 50. This estimate has some intuitive appeal,
as it equates the proportion of tagged animals in the second sample to the proportion
in the population:
10
4
=
20
n
■
Proposition B has the following extension.
PROPOSITION C
The number of ways that n objects can be grouped into r classes with n i in the
ith class, i = 1, . . . , r , and ri=1 n i = n is
n!
n
=
n 1 !n 2 ! · · · n r !
n 1 n 2 · · · nr
Proof
This can be seen by using Proposition B and the multiplication principle. (Note
that Proposition B is the special case for which r = 2.) There are nn1 ways
1
to choose the objects for the first class. Having done that, there are n−n
n2
ways of choosing the objects for the second class. Continuing in this manner,
there are
(n − n 1 − n 2 − · · · − n r −1 )!
n!
(n − n 1 )!
···
n 1 !(n − n 1 )! (n − n 1 − n 2 )!n 2 !
0!n r !
choices in all. After cancellation, this yields the desired result.
■
1.4 Computing Probabilities: Counting Methods
EXAMPLE J
15
A committee of seven members is to be divided into three subcommittees of size
three, two, and two. This can be done in
7
7!
= 210
=
322
3!2!2!
■
ways.
EXAMPLE K
In how many ways can the set of nucleotides {A, A, G, G, G, G, C, C, C} be arranged
in a sequence of nine letters? Proposition C can be applied by realizing that this
problem can be cast as determining the number of ways that the nine positions in the
sequence can be divided into subgroups of sizes two, four, and three (the locations of
the letters A, G, and C):
9
9!
= 1260
=
■
2!4!3!
243
EXAMPLE L
In how many ways can n = 2m people be paired and assigned to m courts for the first
round of a tennis tournament?
In this problem, n i = 2, i = 1, . . . , m, and, according to Proposition C, there
are
(2m)!
2m
assignments.
One has to be careful with problems such as this one. Suppose we were asked
how many ways 2m people could be arranged in pairs without assigning the pairs to
courts. Since there are m! ways to assign the m pairs to m courts, the preceding result
should be divided by m!, giving
(2m)!
m!2m
■
pairs in all.
The numbers
expansion
n
n 1 n 2 ···nr
are called multinomial coefficients. They occur in the
(x1 + x2 + · · · + xr ) =
n
n
x n 1 x n 2 · · · xrnr
n 1 n 2 · · · nr 1 2
where the sum is over all nonnegative integers n 1 , n 2 , . . . , n r such that n 1 + n 2 +
· · · + n r = n.
16
Chapter 1
Probability
1.5 Conditional Probability
We introduce the definition and use of conditional probability with an example. Digitalis therapy is often beneficial to patients who have suffered congestive heart failure,
but there is the risk of digitalis intoxication, a serious side effect that is difficult to
diagnose. To improve the chances of a correct diagnosis, the concentration of digitalis
in the blood can be measured. Bellar et al. (1971) conducted a study of the relation
of the concentration of digitalis in the blood to digitalis intoxication in 135 patients.
Their results are simplified slightly in the following table, where this notation is used:
T + = high blood concentration (positive test)
T − = low blood concentration (negative test)
D+ = toxicity (disease present)
D− = no toxicity (disease absent)
D+
D−
Total
T+
T−
25
18
14
78
39
96
Total
43
92
135
Thus, for example, 25 of the 135 patients had a high blood concentration of digitalis
and suffered toxicity.
Assume that the relative frequencies in the study roughly hold in some larger
population of patients. (Making inferences about the frequencies in a large population
from those observed in a small sample is a statistical problem, which will be taken
up in a later chapter of this book.) Converting the frequencies in the preceding table
to proportions (relative to 135), which we will regard as probabilities, we obtain the
following table:
D+
D−
Total
T+
T−
.185
.133
.104
.578
.289
.711
Total
.318
.682
1.000
From the table, P(T +) = .289 and P(D+) = .318, for example. Now if a doctor
knows that the test was positive (that there was a high blood concentration), what is the
probability of disease (toxicity) given this knowledge? We can restrict our attention
to the first row of the table, and we see that of the 39 patients who had positive tests,
25 suffered from toxicity. We denote the probability that a patient shows toxicity given
that the test is positive by P(D + | T +), which is called the conditional probability
of D + given T +.
P(D + | T +) =
25
= .640
39
1.5 Conditional Probability
17
Equivalently, we can calculate this probability as
P(D + ∩ T +)
P(T +)
.185
= .640
=
.289
P(D + | T +) =
In summary, we see that the unconditional probability of D + is .318, whereas
the conditional probability D + given T + is .640. Therefore, knowing that the test is
positive makes toxicity more than twice as likely. What if the test is negative?
P(D − | T −) =
.578
= .848
.711
For comparison, P(D−) = .682. Two other conditional probabilities from this example are of interest: The probability of a false positive is P(D − | T +) = .360, and
the probability of a false negative is P(D + | T −) = .187.
In general, we have the following definition.
DEFINITION
Let A and B be two events with P(B) = 0. The conditional probability of A
given B is defined to be
P(A | B) =
P(A ∩ B)
P(B)
■
The idea behind this definition is that if we are given that event B occurred,
the relevant sample space becomes B rather than , and conditional probability is a
probability measure on B. In the digitalis example, to find P(D + | T +), we restricted
our attention to the 39 patients who had positive tests. For this new measure to be a
probability measure, it must satisfy the axioms, and this can be shown.
In some situations, P(A | B) and P(B) can be found rather easily, and we can
then find P(A ∩ B).
MULTIPLICATION LAW
Let A and B be events and assume P(B) = 0. Then
P(A ∩ B) = P(A | B)P(B)
■
The multiplication law is often useful in finding the probabilities of intersections,
as the following examples illustrate.
EXAMPLE A
An urn contains three red balls and one blue ball. Two balls are selected without
replacement. What is the probability that they are both red?
18
Chapter 1
Probability
Let R1 and R2 denote the events that a red ball is drawn on the first trial and on
the second trial, respectively. From the multiplication law,
P(R1 ∩ R2 ) = P(R1 )P(R2 | R1 )
P(R1 ) is clearly 34 , and if a red ball has been removed on the first trial, there are two
red balls and one blue ball left. Therefore, P(R2 | R1 ) = 23 . Thus, P(R1 ∩ R2 ) = 12 .
■
EXAMPLE B
Suppose that if it is cloudy (B), the probability that it is raining (A) is .3, and that
the probability that it is cloudy is P(B) = .2 The probability that it is cloudy and
raining is
P(A ∩ B) = P(A | B)P(B) = .3 × .2 = .06
■
Another useful tool for computing probabilities is provided by the following law.
LAW OF TOTAL PROBABILITY
n
Let B1 , B2 , . . . , Bn be such that i=1
Bi = and Bi ∩ B j = ∅ for i = j, with
P(Bi ) > 0 for all i. Then, for any event A,
P(A) =
n
P(A | Bi )P(Bi )
i=1
Proof
Before going through a formal proof, it is helpful to state the result in words. The
Bi are mutually disjoint events whose union is . To find the probability of an
event A, we sum the conditional probabilities of A given Bi , weighted by P(Bi ).
Now, for the proof, we first observe that
P(A) = P(A ∩ )
n
Bi
= P A∩
=P
n
i=1
(A ∩ Bi )
i=1
Since the events A ∩ Bi are disjoint,
n
n
(A ∩ Bi ) =
P(A ∩ Bi )
P
i=1
i=1
=
n
i=1
P(A | Bi )P(Bi )
■
1.5 Conditional Probability
19
The law of total probability is useful in situations where it is not obvious how to
calculate P(A) directly but in which P(A | Bi ) and P(Bi ) are more straightforward,
such as in the following example.
EXAMPLE C
Referring to Example A, what is the probability that a red ball is selected on the
second draw?
The answer may or may not be intuitively obvious—that depends on your intuition. On the one hand, you could argue that it is “clear from symmetry” that
P(R2 ) = P(R1 ) = 34 . On the other hand, you could say that it is obvious that a red
ball is likely to be selected on the first draw, leaving fewer red balls for the second
draw, so that P(R2 ) < P(R1 ). The answer can be derived easily by using the law of
total probability:
P(R2 ) = P(R2 | R1 )P(R1 ) + P(R2 | B1 )P(B1 )
1
3
2 3
= × +1× =
3 4
4
4
where B1 denotes the event that a blue ball is drawn on the first trial.
■
As another example of the use of conditional probability, we consider a model
that has been used for occupational mobility.
EXAMPLE D
Suppose that occupations are grouped into upper (U ), middle (M), and lower (L)
levels. U1 will denote the event that a father’s occupation is upper-level; U2 will
denote the event that a son’s occupation is upper-level, etc. (The subscripts index
generations.) Glass and Hall (1954) compiled the following statistics on occupational
mobility in England and Wales:
U1
M1
L1
U2
M2
L2
.45
.05
.01
.48
.70
.50
.07
.25
.49
Such a table, which is called a matrix of transition probabilities, is to be read in
the following way: If a father is in U, the probability that his son is in U is .45, the
probability that his son is in M is .48, etc. The table thus gives conditional probabilities:
for example, P(U2 | U1 ) = .45. Examination of the table reveals that there is more
upward mobility from L into M than from M into U . Suppose that of the father’s
generation, 10% are in U , 40% in M, and 50% in L. What is the probability that a
son in the next generation is in U ?
Applying the law of total probability, we have
P(U2 ) = P(U2 | U1 )P(U1 ) + P(U2 | M1 )P(M1 ) + P(U2 | L 1 )P(L 1 )
= .45 × .10 + .05 × .40 + .01 × .50 = .07
P(M2 ) and P(L 2 ) can be worked out similarly.
■
20
Chapter 1
Probability
Continuing with Example D, suppose we ask a different question: If a son has
occupational status U2 , what is the probability that his father had occupational status
U1 ? Compared to the question asked in Example D, this is an “inverse” problem; we
are given an “effect” and are asked to find the probability of a particular “cause.” In
situations like this, Bayes’ rule, which we state shortly, is useful. Before stating the
rule, we will see what it amounts to in this particular case.
We wish to find P(U1 | U2 ). By definition,
P(U1 | U2 ) =
=
P(U1 ∩ U2 )
P(U2 )
P(U2 | U1 )P(U1 )
P(U2 | U1 )P(U1 ) + P(U2 | M1 )P(M1 ) + P(U2 | L 1 )P(L 1 )
Here we used the multiplication law to reexpress the numerator and the law of
total probability to restate the denominator. The value of the numerator is
P(U2 | U1 )P(U1 ) = .45 × .10 = .045, and we calculated the denominator in Example D to be .07, so we find that P(U1 | U2 ) = .64. In other words, 64% of the sons who
are in upper-level occupations have fathers who were in upper-level occupations.
We now state Bayes’ rule.
BAYES' RULE
Let A and B1 , . . . , Bn be events where the Bi are disjoint,
P(Bi ) > 0 for all i. Then
P(B j | A) =
n
i=1
Bi = , and
P(A | B j )P(B j )
n
P(A | Bi )P(Bi )
i=1
The proof of Bayes’ rule follows exactly as in the preceding discussion.
EXAMPLE E
■
Diamond and Forrester (1979) applied Bayes’ rule to the diagnosis of coronary artery
disease. A procedure called cardiac fluoroscopy is used to determine whether there
is calcification of coronary arteries and thereby to diagnose coronary artery disease.
From the test, it can be determined if 0, 1, 2, or 3 coronary arteries are calcified. Let
T0 , T1 , T2 , T3 denote these events. Let D+ or D− denote the event that disease is
present or absent, respectively. Diamond and Forrester presented the following table,
based on medical studies:
i
P(Ti | D+)
P(Ti | D−)
0
1
2
3
.42
.24
.20
.15
.96
.02
.02
.00
1.5 Conditional Probability
21
According to Bayes’ rule,
P(D+ | Ti ) =
P(Ti | D+)P(D+)
P(Ti | D+)P(D+) + P(Ti | D−)P(D−)
Thus, if the initial probabilities P(D+) and P(D−) are known, the probability that
a patient has coronary artery disease can be calculated.
Let us consider two specific cases. For the first, suppose that a male between
the ages of 30 and 39 suffers from nonanginal chest pain. For such a patient, it is
known from medical statistics that P(D+) ≈ .05. Suppose that the test shows that
no arteries are calcified. From the preceding equation,
P(D+ | T0 ) =
.42 × .05
= .02
.42 × .05 + .96 × .95
It is unlikely that the patient has coronary artery disease. On the other hand, suppose
that the test shows that one artery is calcified. Then
P(D+ | T1 ) =
.24 × .05
= .39
.24 × .05 + .02 × .95
Now it is more likely that this patient has coronary artery disease, but by no means
certain.
As a second case, suppose that the patient is a male between ages 50 and 59 who
suffers typical angina. For such a patient, P(D+) = .92. For him, we find that
P(D+ | T0 ) =
.42 × .92
= .83
.42 × .92 + .96 × .08
P(D+ | T1 ) =
.24 × .92
= .99
.24 × .92 + .02 × .08
Comparing the two patients, we see the strong influence of the prior probability,
P(D+).
■
EXAMPLE F
Polygraph tests (lie-detector tests) are often routinely administered to employees
or prospective employees in sensitive positions. Let + denote the event that the
polygraph reading is positive, indicating that the subject is lying; let T denote the
event that the subject is telling the truth; and let L denote the event that the subject is
lying. According to studies of polygraph reliability (Gastwirth 1987),
P(+ | L) = .88
from which it follows that P(− | L) = .12 also
P(− | T ) = .86
from which it follows that P(+ | T ) = .14. In words, if a person is lying, the probability that this is detected by the polygraph is .88, whereas if he is telling the truth,
the polygraph indicates that he is telling the truth with probability .86. Now suppose
that polygraphs are routinely administered to screen employees for security reasons,
and that on a particular question the vast majority of subjects have no reason to lie so
22
Chapter 1
Probability
that P(T ) = .99, whereas P(L) = .01. A subject produces a positive response on
the polygraph. What is the probability that the polygraph is incorrect and that she is
in fact telling the truth? We can evaluate this probability with Bayes’ rule:
P(+ | T )P(T )
P(+ | T )P(T ) + P(+ | L)P(L)
(.14)(.99)
=
(.14)(.99) + (.88)(.01)
= .94
P(T | +) =
Thus, in screening this population of largely innocent people, 94% of the positive
polygraph readings will be in error. Most of those placed under suspicion because of
the polygraph result will, in fact, be innocent. This example illustrates some of the
dangers in using screening procedures on large populations.
■
Bayes’ rule is the fundamental mathematical ingredient of a subjective, or
“Bayesian,” approach to epistemology, theories of evidence, and theories of learning.
According to this point of view, an individual’s beliefs about the world can be coded
in probabilities. For example, an individual’s belief that it will hail tomorrow can be
represented by a probability P(H ). This probability varies from individual to individual. In principle, each individual’s probability can be ascertained, or elicited, by
offering him or her a series of bets at different odds.
According to Bayesian theory, our beliefs are modified as we are confronted with
evidence. If, initially, my probability for a hypothesis is P(H ), after seeing evidence
E (e.g., a weather forecast), my probability becomes P(H |E). P(E|H ) is often easier
to evaluate than P(H |E). In this case, the application of Bayes’ rule gives
P(H |E) =
P(E|H )P(H )
P(E|H )P(H ) + P(E| H̄ )P( H̄ )
where H̄ is the event that H does not hold. This point can be illustrated by the
preceding polygraph example. Suppose an investigator is questioning a particular
suspect and that the investigator’s prior opinion that the suspect is telling the truth
is P(T ). Then, upon observing a positive polygraph reading, his opinion becomes
P(T |+). Note that different investigators will have different prior probabilities P(T )
for different suspects, and thus different posterior probabilities.
As appealing as this formulation might be, a long line of research has demonstrated that humans are actually not very good at doing probability calculations in
evaluating evidence. For example, Tversky and Kahneman (1974) presented subjects
with the following question: “If Linda is a 31-year-old single woman who is outspoken on social issues such as disarmament and equal rights, which of the following
statements is more likely to be true?
•
•
Linda is bank teller.
Linda is a bank teller and active in the feminist movement.”
More than 80% of those questioned chose the second statement, despite Property C
of Section 1.3.
1.6 Independence
23
Even highly trained professionals are not good at doing probability calculations,
as illustrated by the following example of Eddy (1982), regarding interpreting the
results from mammogram screening. One hundred physicians were presented with
the following information:
•
•
•
In the absence of any special information, the probability that a woman (of the age
and health status of this patient) has breast cancer is 1%.
If the patient has breast cancer, the probability that the radiologist will correctly
diagnose it is 80%.
If the patient has a benign lesion (no breast cancer), the probability that the radiologist will incorrectly diagnose it as cancer is 10%.
They were then asked, “What is the probability that a patient with a positive mammogram actually has breast cancer?”
Ninety-five of the 100 physicians estimated the probability to be about 75%. The
correct probability, as given by Bayes’ rule, is 7.5%. (You should check this.) So even
experts radically overestimate the strength of the evidence provided by a positive
outcome on the screening test.
Thus the Bayesian probability calculus does not describe the way people actually
assimilate evidence. Advocates for Bayesian learning theory might assert that the
theory describes the way people “should think.” A softer point of view is that Bayesian
learning theory is a model for learning, and it has the merit of being a simple model
that can be programmed on computers. Probability theory in general, and Bayesian
learning theory in particular, are part of the core of artificial intelligence.
1.6 Independence
Intuitively, we would say that two events, A and B, are independent if knowing that
one had occurred gave us no information about whether the other had occurred; that
is, P(A | B) = P(A) and P(B | A) = P(B). Now, if
P(A ∩ B)
P(A) = P(A | B) =
P(B)
then
P(A ∩ B) = P(A)P(B)
We will use this last relation as the definition of independence. Note that it is symmetric
in A and in B, and does not require the existence of a conditional probability, that is,
P(B) can be 0.
DEFINITION
A and B are said to be independent events if P(A ∩ B) = P(A)P(B).
EXAMPLE A
■
A card is selected randomly from a deck. Let A denote the event that it is an ace
and D the event that it is a diamond. Knowing that the card is an ace gives no
24
Chapter 1
Probability
information about its suit. Checking formally that the events are independent, we
4
1
= 13
and P(D) = 14 . Also, A ∩ D is the event that the card is the ace
have P(A) = 52
1
1
1
of diamonds and P(A ∩ D) = 52
. Since P(A)P(D) = ( 14 ) × ( 13
) = 52
, the events
■
are in fact independent.
EXAMPLE B
A system is designed so that it fails only if a unit and a backup unit both fail. Assuming
that these failures are independent and that each unit fails with probability p, the
system fails with probability p 2 . If, for example, the probability that any unit fails
during a given year is .1, then the probability that the system fails is .01, which
represents a considerable improvement in reliability.
■
Things become more complicated when we consider more than two events. For
example, suppose we know that events A, B, and C are pairwise independent (any
two are independent). We would like to be able to say that they are all independent
based on the assumption that knowing something about two of the events does not tell
us anything about the third, for example, P(C | A ∩ B) = P(C). But as the following
example shows, pairwise independence does not guarantee mutual independence.
EXAMPLE C
A fair coin is tossed twice. Let A denote the event of heads on the first toss, B the
event of heads on the second toss, and C the event that exactly one head is thrown. A
and B are clearly independent, and P(A) = P(B) = P(C) = .5. To see that A and
C are independent, we observe that P(C | A) = .5. But
P(A ∩ B ∩ C) = 0 = P(A)P(B)P(C)
■
To encompass situations such as that in Example C, we define a collection
of events, A1 , A2 , . . . , An , to be mutually independent if for any subcollection,
Ai 1 , . . . , Ai m ,
P(Ai1 ∩ · · · ∩ Aim ) = P(Ai1 ) · · · P(Aim )
EXAMPLE D
We return to Example B of Section 1.3 (infectivity of AIDS). Suppose that virus
transmissions in 500 acts of intercourse are mutually independent events and that
the probability of transmission in any one act is 1/500. Under this model, what is the
probability of infection? It is easier to first find the probability of the complement
of this event. Let C1 , C2 , . . . , C500 denote the events that virus transmission does not
occur during encounters 1, 2, . . . , 500. Then the probability of no infection is
1 500
= .37
P(C1 ∩ C2 ∩ · · · ∩ C500 ) = 1 −
500
so the probability of infection is 1 − .37 = .63, not 1, which is the answer produced
by incorrectly adding probabilities.
■
1.6 Independence
EXAMPLE E
25
Consider a circuit with three relays (Figure 1.4). Let Ai denote the event that the ith
relay works, and assume that P(Ai ) = p and that the relays are mutually independent.
If F denotes the event that current flows through the circuit, then F = A3 ∪ (A1 ∩ A2 )
and, from the addition law and the assumption of independence,
P(F) = P(A3 ) + P(A1 ∩ A2 ) − P(A1 ∩ A2 ∩ A3 ) = p + p 2 − p 3
■
3
1
2
F I G U R E 1.4 Circuit with three relays.
EXAMPLE F
Suppose that a system consists of components connected in a series, so the system
fails if any one component fails. If there are n mutually independent components and
each fails with probability p, what is the probability that the system will fail?
It is easier to find the probability of the complement of this event; the system
works if and only if all the components work, and this situation has probability
(1 − p)n . The probability that the system fails is then 1 − (1 − p)n . For example, if
n = 10 and p = .05, the probability that the system works is only .9510 = .60, and
the probability that the system fails is .40.
Suppose, instead, that the components are connected in parallel, so the system
fails only when all components fail. In this case, the probability that the system fails
■
is only .0510 = 9.8 × 10−14 .
Calculations like those in Example F are made in reliability studies for systems consisting of quite complicated networks of components. The absolutely crucial
assumption is that the components are independent of one another. Theoretical studies
of the reliability of nuclear power plants have been criticized on the grounds that they
incorrectly assume independence of the components.
EXAMPLE G
Matching DNA Fragments
Fragments of DNA are often compared for similarity, for example, across species.
A simple way to make a comparison is to count the number of locations, or sites,
at which these fragments agree. For example, consider these two sequences, which
agree at three sites: fragment 1: AGATCAGT; and fragment 2: TGGATACT.
Many such comparisons are made, and to sort the wheat from the chaff, a probability model is often used. A comparison is deemed interesting if the number of
26
Chapter 1
Probability
matches is much larger than would be expected by chance alone. This requires a
chance model; a simple one stipulates that the nucleotide at each site of fragment 1
occurs randomly with probabilities p A1 , pG1 , pC1 , pT 1 , and that the second fragment
is similarly composed with probabilities p A2 , . . . , pT 2 . What is the chance that the
fragments match at a particular site if in fact the identity of the nucleotide on fragment 1 is independent of that on fragment 2? The match probability can be calculated
using the law of total probability:
P(match) = P(match|A on fragment 1)P(A on fragment 1) +
. . . + P(match|T on fragment 1)P(T on fragment 1)
= p A2 p A1 + pG2 pG1 + pC2 pC1 + pT 2 pT 1
The problem of determining the probability that they match at k out of a total of
■
n sites is discussed later.
1.7 Concluding Remarks
This chapter provides a simple axiomatic development of the mathematical theory of
probability. Some subtle issues that arise in a careful analysis of infinite sample spaces
have been neglected. Such issues are typically addressed in graduate-level courses
in measure theory and probability theory. Certain philosophical questions have also
been avoided. One might ask what is meant by the statement “The probability that
this coin will land heads up is 12 .” Two commonly advocated views are the frequentist approach and the Bayesian approach. According to the frequentist approach,
the statement means that if the experiment were repeated many times, the long-run
average number of heads would tend to 12 . According to the Bayesian approach, the
statement is a quantification of the speaker’s uncertainty about the outcome of the
experiment and thus is a personal or subjective notion; the probability that the coin
will land heads up may be different for different speakers, depending on their experience and knowledge of the situation. There has been vigorous and occasionally
acrimonious debate among proponents of various versions of these points of view.
In this and ensuing chapters, there are many examples of the use of probability
as a model for various phenomena. In any such modeling endeavor, an idealized
mathematical theory is hoped to provide an adequate match to characteristics of the
phenomenon under study. The standard of adequacy is relative to the field of study
and the modeler’s goals.
1.8 Problems
1. A coin is tossed three times and the sequence of heads and tails is recorded.
a. List the sample space.
b. List the elements that make up the following events: (1) A = at least two
heads, (2) B = the first two tosses are heads, (3) C = the last toss is a tail.
c. List the elements of the following events: (1) Ac , (2) A ∩ B, (3) A ∪ C.
1.8 Problems
27
2. Two six-sided dice are thrown sequentially, and the face values that come up are
recorded.
a. List the sample space.
b. List the elements that make up the following events: (1) A = the sum of the
two values is at least 5, (2) B = the value of the first die is higher than the
value of the second, (3) C = the first value is 4.
c. List the elements of the following events: (1) A∩C, (2) B ∪C, (3) A∩(B ∪C).
3. An urn contains three red balls, two green balls, and one white ball. Three balls
are drawn without replacement from the urn, and the colors are noted in sequence.
List the sample space. Define events A, B, and C as you wish and find their unions
and intersections.
4. Draw Venn diagrams to illustrate De Morgan’s laws:
(A ∪ B)c = Ac ∩ B c
(A ∩ B)c = Ac ∪ B c
5. Let A and B be arbitrary events. Let C be the event that either A occurs or B
occurs, but not both. Express C in terms of A and B using any of the basic
operations of union, intersection, and complement.
6. Verify the following extension of the addition rule (a) by an appropriate Venn
diagram and (b) by a formal argument using the axioms of probability and the
propositions in this chapter.
P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − P(A ∩ B)
− P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C)
7. Prove Bonferroni’s inequality:
P(A ∩ B) ≥ P(A) + P(B) − 1
8. Prove that
P
n
i=1
Ai
≤
n
P(Ai )
i=1
9. The weather forecaster says that the probability of rain on Saturday is 25% and
that the probability of rain on Sunday is 25%. Is the probability of rain during
the weekend 50%? Why or why not?
10. Make up another example of Simpson’s paradox by changing the numbers in
Example B of Section 1.4.
11. The first three digits of a university telephone exchange are 452. If all the sequences of the remaining four digits are equally likely, what is the probability
that a randomly selected university phone number contains seven distinct digits?
12. In a game of poker, five players are each dealt 5 cards from a 52-card deck. How
many ways are there to deal the cards?
28
Chapter 1
Probability
13. In a game of poker, what is the probability that a five-card hand will contain (a)
a straight (five cards in unbroken numerical sequence), (b) four of a kind, and (c)
a full house (three cards of one value and two cards of another value)?
14. The four players in a bridge game are each dealt 13 cards. How many ways are
there to do this?
15. How many different meals can be made from four kinds of meat, six vegetables,
and three starches if a meal consists of one selection from each group?
16. How many different letter arrangements can be obtained from the letters of the
word statistically, using all the letters?
17. In acceptance sampling, a purchaser samples 4 items from a lot of 100 and rejects
the lot if 1 or more are defective. Graph the probability that the lot is accepted as
a function of the percentage of defective items in the lot.
18. A lot of n items contains k defectives, and m are selected randomly and inspected.
How should the value of m be chosen so that the probability that at least one
defective item turns up is .90? Apply your answer to (a) n = 1000, k = 10, and
(b) n = 10,000, k = 100.
19. A committee consists of five Chicanos, two Asians, three African Americans,
and two Caucasians.
a. A subcommittee of four is chosen at random. What is the probability that all
the ethnic groups are represented on the subcommittee?
b. Answer the question for part (a) if a subcommittee of five is chosen.
20. A deck of 52 cards is shuffled thoroughly. What is the probability that the four
aces are all next to each other?
21. A fair coin is tossed five times. What is the probability of getting a sequence of
three heads?
22. A standard deck of 52 cards is shuffled thoroughly, and n cards are turned up.
What is the probability that a face card turns up? For what value of n is this
probability about .5?
23. How many ways are there to place n indistinguishable balls in n urns so that
exactly one urn is empty?
24. If n balls are distributed randomly into k urns, what is the probability that the
last urn contains j balls?
25. A woman getting dressed up for a night out is asked by her significant other to
wear a red dress, high-heeled sneakers, and a wig. In how many orders can she
put on these objects?
26. The game of Mastermind starts in the following way: One player selects four
pegs, each peg having six possible colors, and places them in a line. The second player then tries to guess the sequence of colors. What is the probability of
guessing correctly?
1.8 Problems
29
27. If a five-letter word is formed at random (meaning that all sequences of five letters
are equally likely), what is the probability that no letter occurs more than once?
28. How many ways are there to encode the 26-letter English alphabet into 8-bit
binary words (sequences of eight 0s and 1s)?
29. A poker player is dealt three spades and two hearts. He discards the two hearts and
draws two more cards. What is the probability that he draws two more spades?
30. A group of 60 second graders is to be randomly assigned to two classes of 30 each.
(The random assignment is ordered by the school district to ensure against any
bias.) Five of the second graders, Marcelle, Sarah, Michelle, Katy, and Camerin,
are close friends. What is the probability that they will all be in the same class?
What is the probability that exactly four of them will be? What is the probability
that Marcelle will be in one class and her friends in the other?
31. Six male and six female dancers perform the Virginia reel. This dance requires
that they form a line consisting of six male/female pairs. How many such arrangements are there?
32. A wine taster claims that she can distinguish four vintages of a particular Cabernet. What is the probability that she can do this by merely guessing? (She is
confronted with four unlabeled glasses.)
33. An elevator containing five people can stop at any of seven floors. What is the
probability that no two people get off at the same floor? Assume that the occupants
act independently and that all floors are equally likely for each occupant.
34. Prove the following identity:
n n
m−n
k
n−k
k=0
=
m
n
(Hint: How can each of the summands be interpreted?)
35. Prove the following two identities both algebraically and by interpreting their
meaning combinatorially.
n a. nr = n−r
n−1
+ r
b. nr = n−1
r −1
36. What is the coefficient of x 3 y 4 in the expansion of (x + y)7 ?
37. What is the coefficient of x 2 y 2 z 3 in the expansion of (x + y + z)7 ?
38. A child has six blocks, three of which are red and three of which are green. How
many patterns can she make by placing them all in a line? If she is given three white
blocks, how many total patterns can she make by placing all nine blocks in a line?
39. A monkey at a typewriter types each of the 26 letters of the alphabet exactly once,
the order being random.
a. What is the probability that the word Hamlet appears somewhere in the string
of letters?
30
Chapter 1
Probability
b. How many independent monkey typists would you need in order that the
probability that the word appears is at least .90?
40. In how many ways can two octopi shake hands? (There are a number of ways to
interpret this question—choose one.)
41. A drawer of socks contains seven black socks, eight blue socks, and nine green
socks. Two socks are chosen in the dark.
a. What is the probability that they match?
b. What is the probability that a black pair is chosen?
42. How many ways can 11 boys on a soccer team be grouped into 4 forwards,
3 midfielders, 3 defenders, and 1 goalie?
43. A software development company has three jobs to do. Two of the jobs require
three programmers, and the other requires four. If the company employs ten
programmers, how many different ways are there to assign them to the jobs?
44. In how many ways can 12 people be divided into three groups of 4 for an evening
of bridge? In how many ways can this be done if the 12 consist of six pairs of
partners?
45. Show that if the conditional probabilities exist, then
P(A1 ∩ A2 ∩ · · · ∩ An )
= P(A1 )P(A2 | A1 )P(A3 | A1 ∩ A2 ) · · · P(An | A1 ∩ A2 ∩ · · · ∩ An−1 )
46. Urn A has three red balls and two white balls, and urn B has two red balls and
five white balls. A fair coin is tossed. If it lands heads up, a ball is drawn from
urn A; otherwise, a ball is drawn from urn B.
a. What is the probability that a red ball is drawn?
b. If a red ball is drawn, what is the probability that the coin landed heads up?
47. Urn A has four red, three blue, and two green balls. Urn B has two red, three
blue, and four green balls. A ball is drawn from urn A and put into urn B, and
then a ball is drawn from urn B.
a. What is the probability that a red ball is drawn from urn B?
b. If a red ball is drawn from urn B, what is the probability that a red ball was
drawn from urn A?
48. An urn contains three red and two white balls. A ball is drawn, and then it and
another ball of the same color are placed back in the urn. Finally, a second ball
is drawn.
a. What is the probability that the second ball drawn is white?
b. If the second ball drawn is white, what is the probability that the first ball
drawn was red?
49. A fair coin is tossed three times.
a. What is the probability of two or more heads given that there was at least one
head?
b. What is the probability given that there was at least one tail?
1.8 Problems
31
50. Two dice are rolled, and the sum of the face values is six. What is the probability
that at least one of the dice came up a three?
51. Answer Problem 50 again given that the sum is less than six.
52. Suppose that 5 cards are dealt from a 52-card deck and the first one is a king.
What is the probability of at least one more king?
53. A fire insurance company has high-risk, medium-risk, and low-risk clients, who
have, respectively, probabilities .02, .01, and .0025 of filing claims within a given
year. The proportions of the numbers of clients in the three categories are .10,
.20, and .70, respectively. What proportion of the claims filed each year come
from high-risk clients?
54. This problem introduces a simple meteorological model, more complicated
versions of which have been proposed in the meteorological literature. Consider
a sequence of days and let Ri denote the event that it rains on day i. Suppose
c
) = β. Suppose further that only today’s
that P(Ri | Ri−1 ) = α and P(Ric | Ri−1
weather is relevant to predicting tomorrow’s; that is, P(Ri | Ri−1 ∩ Ri−2 ∩ · · · ∩
R0 ) = P(Ri | Ri−1 ).
a. If the probability of rain today is p, what is the probability of rain tomorrow?
b. What is the probability of rain the day after tomorrow?
c. What is the probability of rain n days from now? What happens as n approaches
infinity?
55. This problem continues Example D of Section 1.5 and concerns occupational
mobility.
a. Find P(M1 | M2 ) and P(L 1 | L 2 ).
b. Find the proportions that will be in the three occupational levels in the third
generation. To do this, assume that a son’s occupational status depends on
his father’s status, but that given his father’s status, it does not depend on his
grandfather’s.
56. A couple has two children. What is the probability that both are girls given that
the oldest is a girl? What is the probability that both are girls given that one of
them is a girl?
57. There are three cabinets, A, B, and C, each of which has two drawers. Each
drawer contains one coin; A has two gold coins, B has two silver coins, and C
has one gold and one silver coin. A cabinet is chosen at random, one drawer is
opened, and a silver coin is found. What is the probability that the other drawer
in that cabinet contains a silver coin?
58. A teacher tells three boys, Drew, Chris, and Jason, that two of them will have
to stay after school to help her clean erasers and that one of them will be able to
leave. She further says that she has made the decision as to who will leave and who
will stay at random by rolling a special three-sided Dungeons and Dragons die.
Drew wants to leave to play soccer and has a clever idea about how to increase his
chances of doing so. He figures that one of Jason and Chris will certainly stay and
asks the teacher to tell him the name of one of the two who will stay. Drew’s idea
32
Chapter 1
Probability
is that if, for example, Jason is named, then he and Chris are left and they each
have a probability .5 of leaving; similarly, if Chris is named, Drew’s probability
of leaving is still .5. Thus, by merely asking the teacher a question, Drew will
increase his probability of leaving from 13 to 12 . What do you think of this scheme?
59. A box has three coins. One has two heads, one has two tails, and the other is a
fair coin with one head and one tail. A coin is chosen at random, is flipped, and
comes up heads.
a. What is the probability that the coin chosen is the two-headed coin?
b. What is the probability that if it is thrown another time it will come up heads?
c. Answer part (a) again, supposing that the coin is thrown a second time and
comes up heads again.
60. A factory runs three shifts. In a given day, 1% of the items produced by the first
shift are defective, 2% of the second shift’s items are defective, and 5% of the
third shift’s items are defective. If the shifts all have the same productivity, what
percentage of the items produced in a day are defective? If an item is defective,
what is the probability that it was produced by the third shift?
61. Suppose that chips for an integrated circuit are tested and that the probability
that they are detected if they are defective is .95, and the probability that they are
declared sound if in fact they are sound is .97. If .5% of the chips are faulty, what
is the probability that a chip that is declared faulty is sound?
62. Show that if P(A | E) ≥ P(B | E) and P(A | E c ) ≥ P(B | E c ), then P(A) ≥
P(B).
63. Suppose that the probability of living to be older than 70 is .6 and the probability
of living to be older than 80 is .2. If a person reaches her 70th birthday, what is
the probability that she will celebrate her 80th?
64. If B is an event, with P(B) > 0, show that the set function Q(A) = P(A | B)
satisfies the axioms for a probability measure. Thus, for example,
P(A ∪ C | B) = P(A | B) + P(C | B) − P(A ∩ C | B)
65. Show that if A and B are independent, then A and B c as well as Ac and B c are
independent.
66. Show that ∅ is independent of A for any A.
67. Show that if A and B are independent, then
P(A ∪ B) = P(A) + P(B) − P(A)P(B)
68. If A is independent of B and B is independent of C, then A is independent of C.
Prove this statement or give a counterexample if it is false.
69. If A and B are disjoint, can they be independent?
70. If A ⊂ B, can A and B be independent?
71. Show that if A, B, and C are mutually independent, then A ∩ B and C are
independent and A ∪ B and C are independent.
1.8 Problems
33
72. Suppose that n components are connected in series. For each unit, there is a
backup unit, and the system fails if and only if both a unit and its backup fail.
Assuming that all the units are independent and fail with probability p, what is
the probability that the system works? For n = 10 and p = .05, compare these
results with those of Example F in Section 1.6.
73. A system has n independent units, each of which fails with probability p. The
system fails only if k or more of the units fail. What is the probability that
the system fails?
74. What is the probability that the following system works if each unit fails independently with probability p (see Figure 1.5)?
F I G U R E 1.5
75. This problem deals with an elementary aspect of a simple branching process. A
population starts with one member; at time t = 1, it either divides with probability p or dies with probability 1 − p. If it divides, then both of its children
behave independently with the same two alternatives at time t = 2. What is the
probability that there are no members in the third generation? For what value of
p is this probability equal to .5?
76. Here is a simple model of a queue. The queue runs in discrete time (t =
0, 1, 2, . . .), and at each unit of time the first person in the queue is served with
probability p and, independently, a new person arrives with probability q. At
time t = 0, there is one person in the queue. Find the probabilities that there are
0, 1, 2, 3 people in line at time t = 2.
77. A player throws darts at a target. On each trial, independently of the other trials,
he hits the bull’s-eye with probability .05. How many times should he throw so
that his probability of hitting the bull’s-eye at least once is .5?
78. This problem introduces some aspects of a simple genetic model. Assume that
genes in an organism occur in pairs and that each member of the pair can be either
of the types a or A. The possible genotypes of an organism are then A A, Aa, and
aa (Aa and a A are equivalent). When two organisms mate, each independently
contributes one of its two genes; either one of the pair is transmitted with probability .5.
a. Suppose that the genotypes of the parents are A A and Aa. Find the possible
genotypes of their offspring and the corresponding probabilities.
34
Chapter 1
Probability
b. Suppose that the probabilities of the genotypes A A, Aa, and aa are p, 2q,
and r, respectively, in the first generation. Find the probabilities in the second
and third generations, and show that these are the same. This result is called
the Hardy-Weinberg Law.
c. Compute the probabilities for the second and third generations as in part (b)
but under the additional assumption that the probabilities that an individual of
type A A, Aa, or aa survives to mate are u, v, and w, respectively.
79. Many human diseases are genetically transmitted (for example, hemophilia or
Tay-Sachs disease). Here is a simple model for such a disease. The genotype
aa is diseased and dies before it mates. The genotype Aa is a carrier but is not
diseased. The genotype AA is not a carrier and is not diseased.
a. If two carriers mate, what are the probabilities that their offspring are of each
of the three genotypes?
b. If the male offspring of two carriers is not diseased, what is the probability
that he is a carrier?
c. Suppose that the nondiseased offspring of part (b) mates with a member of the
population for whom no family history is available and who is thus assumed
to have probability p of being a carrier ( p is a very small number). What are
the probabilities that their first offspring has the genotypes AA, Aa, and aa?
d. Suppose that the first offspring of part (c) is not diseased. What is the probability that the father is a carrier in light of this evidence?
80. If a parent has genotype Aa, he transmits either A or a to an offspring (each with
a 12 chance). The gene he transmits to one offspring is independent of the one
he transmits to another. Consider a parent with three children and the following
events: A = {children 1 and 2 have the same gene}, B = {children 1 and 3 have
the same gene}, C = {children 2 and 3 have the same gene}. Show that these
events are pairwise independent but not mutually independent.
CHAPTER
2
Random Variables
2.1 Discrete Random Variables
A random variable is essentially a random number. As motivation for a definition, let
us consider an example. A coin is thrown three times, and the sequence of heads and
tails is observed; thus,
= {hhh, hht, htt, hth, ttt, tth, thh, tht}
Examples of random variables defined on are (1) the total number of heads, (2) the
total number of tails, and (3) the number of heads minus the number of tails. Each of
these is a real-valued function defined on ; that is, each is a rule that assigns a real
number to every point ω ∈ . Since the outcome in is random, the corresponding
number is random as well.
In general, a random variable is a function from to the real numbers. Because
the outcome of the experiment with sample space is random, the number produced
by the function is random as well. It is conventional to denote random variables by
italic uppercase letters from the end of the alphabet. For example, we might define
X to be the total number of heads in the experiment described above. A discrete
random variable is a random variable that can take on only a finite or at most a
countably infinite number of values. The random variable X just defined is a discrete
random variable since it can take on only the values 0, 1, 2, and 3. For an example of
a random variable that can take on a countably infinite number of values, consider an
experiment that consists of tossing a coin until a head turns up and defining Y to be
the total number of tosses. The possible values of Y are 0, 1, 2, 3, . . . . In general, a
countably infinite set is one that can be put into one-to-one correspondence with the
integers.
If the coin is fair, then each of the outcomes in above has probability 18 ,
from which the probabilities that X takes on the values 0, 1, 2, and 3 can be easily
35
Chapter 2
Random Variables
computed:
P(X = 0) =
P(X = 1) =
P(X = 2) =
P(X = 3) =
1
8
3
8
3
8
1
8
Generally, the probability measure on the sample space determines the probabilities
of the various values of X ; if those values are denoted by x1 , x2 , . . . , then there is a
function p such that p(xi ) = P(X = xi ) and i p(xi ) = 1. This function is called
the probability mass function, or the frequency function, of the random variable
X . Figure 2.1 shows a graph of p(x) for the coin tossing experiment. The frequency
function describes completely the probability properties of the random variable.
.4
.3
p(x)
36
.2
.1
0
0
1
2
3
x
F I G U R E 2.1 A probability mass function.
In addition to the frequency function, it is sometimes convenient to use the
cumulative distribution function (cdf) of a random variable, which is defined to be
F(x) = P(X ≤ x),
−∞ < x < ∞
Cumulative distribution functions are usually denoted by uppercase letters and frequency functions by lowercase letters. Figure 2.2 is a graph of the cumulative distribution function of the random variable X of the preceding paragraph. Note that the cdf
jumps wherever p(x) > 0 and that the jump at xi is p(xi ). For example, if 0 < x <
1, F(x) = 18 ; at x = 1, F(x) jumps to F(1) = 48 = 12 . The jump at x = 1 is p(1) = 38 .
The cumulative distribution function is non-decreasing and satisfies
lim F(x) = 0
x→−∞
and
lim F(x) = 1.
x→∞
Chapter 3 will cover in detail the joint frequency functions of several random
variables defined on the same sample space, but it is useful to define here the concept
2.1 Discrete Random Variables
37
1.0
.8
F(x)
.6
.4
.2
0
1
0
1
2
3
4
x
F I G U R E 2.2 The cumulative distribution function corresponding to Figure 2.1.
of independence of random variables. In the case of two discrete random variables X
and Y , taking on possible values x1 , x2 , . . . , and y1 , y2 , . . . , X and Y are said to be
independent if, for all i and j,
P(X = xi and Y = y j ) = P(X = xi )P(Y = y j )
The definition is extended to collections of more than two discrete random variables
in the obvious way; for example, X, Y, and Z are said to be mutually independent if,
for all i, j, and k,
P(X = xi , Y = y j , Z = z k ) = P(X = xi )P(Y = y j )P(Z = z k )
We next discuss some common discrete distributions that arise in applications.
2.1.1 Bernoulli Random Variables
A Bernoulli random variable takes on only two values: 0 and 1, with probabilities
1 − p and p, respectively. Its frequency function is thus
p(1) = p
p(0) = 1 − p
p(x) = 0,
if x = 0 and x = 1
An alternative and sometimes useful representation of this function is
p(x) =
p x (1 − p)1−x ,
0,
if x = 0 or x = 1
otherwise
If A is an event, then the indicator random variable, I A , takes on the value 1 if
A occurs and the value 0 if A does not occur:
1, if ω ∈ A
I A (ω) =
0, otherwise
38
Chapter 2
Random Variables
I A is a Bernoulli random variable. In applications, Bernoulli random variables often
occur as indicators. A Bernoulli random variable might take on the value 1 or 0
according to whether a guess was a success or a failure.
2.1.2 The Binomial Distribution
Suppose that n independent experiments, or trials, are performed, where n is a fixed
number, and that each experiment results in a “success” with probability p and a
“failure” with probability 1 − p. The total number of successes, X , is a binomial
random variable with parameters n and p. For example, a coin is tossed 10 times and
the total number of heads is counted (“head” is identified with “success”).
The probability that X = k, or p(k), can be found in the following way: Any
particular sequence of k successes occurs with probability p k (1− p)n−k , from the
multiplication
principle. The total number of such sequences is nk , since there are
n ways to assign k successes to n trials. P(X = k) is thus the probability of any
k
particular sequence times the number of such sequences:
n k
p(k) =
p (1 − p)n−k
k
Two binomial frequency functions are shown in Figure 2.3. Note how the shape varies
as a function of p.
p(x)
.4
.2
0
0
1
2
3
4
5
x
6
7
8
9
10
0
1
2
3
4
5
x
6
7
8
9
10
(a)
p(x)
.20
.10
0
(b)
F I G U R E 2.3 Binomial frequency functions, (a) n = 10 and p = .1 and (b) n = 10
and p = .5.
EXAMPLE A
Tay-Sachs disease is a rare but fatal disease of genetic origin occurring chiefly in
infants and children, especially those of Jewish or eastern European extraction. If a
couple are both carriers of Tay-Sachs disease, a child of theirs has probability .25 of
being born with the disease. If such a couple has four children, what is the frequency
function for the number of children who will have the disease?
2.1 Discrete Random Variables
39
We assume that the four outcomes are independent of each other, so, if X denotes
the number of children with the disease, its frequency function is
4
k = 0, 1, 2, 3, 4
p(k) =
.25k × .754−k ,
k
These probabilities are given in the following table:
EXAMPLE B
k
p(k)
0
1
2
3
4
.316
.422
.211
.047
.004
If a single bit (0 or 1) is transmitted over a noisy communications channel, it has
probability p of being incorrectly transmitted. To improve the reliability of the transmission, the bit is transmitted n times, where n is odd. A decoder at the receiving
end, called a majority decoder, decides that the correct message is that carried by a
majority of the received bits. Under a simple noise model, each bit is independently
subject to being corrupted with the same probability p. The number of bits that is
in error, X , is thus a binomial random variable with n trials and probability p of
success on each trial (in this case, and frequently elsewhere, the word success is used
in a generic sense; here a success is an error). Suppose, for example, that n = 5 and
p = .1. The probability that the message is correctly received is the probability of
two or fewer errors, which is
2 n k
p (1 − p)n−k = p 0 (1 − p)5 + 5 p(1 − p)4 + 10 p 2 (1 − p)3 = .9914
k
k=0
The result is a considerable improvement in reliability.
EXAMPLE C
■
■
DNA Matching
We continue Example G of Section 1.6. There we derived the probability p that two
fragments agree at a particular site under the assumption that the nucleotide probabilities were the same at every site and the identities on fragment 1 were independent
of those on fragment 2. To find the probability of the total number of matches, further
assumptions must be made. Suppose that the fragments are each of length n and that
the nucleotide identities are independent from site to site as well as between fragments. Thus, the identity of the nucleotide at site 1 of fragment 1 is independent of the
identity at site 2, etc. We did not make this assumption in Example G in Section 1.6;
in that case, the identity at site 2 could have depended on the identity at site 1, for
example. Now, under the current assumption, the two fragments agree at each site
with probability p as calculated in Example G of Section 1.6, and agreement is independent from site to site. So, the total number of agreements is a binomial random
variable with n trials and probability p of success.
■
40
Chapter 2
Random Variables
A random variable with a binomial distribution can be expressed in terms of independent Bernoulli random variables, a fact that will be quite useful for analyzing some
properties of binomial random variables in later chapters of this book. Specifically,
let X 1 , X 2 , . . . , X n be independent Bernoulli random variables with p(X i = 1) = p.
Then Y = X 1 + X 2 + · · · + X n is a binomial random variable.
2.1.3 The Geometric and Negative Binomial Distributions
The geometric distribution is also constructed from independent Bernoulli trials,
but from an infinite sequence. On each trial, a success occurs with probability p, and
X is the total number of trials up to and including the first success. So that X = k,
there must be k − 1 failures followed by a success. From the independence of the
trials, this occurs with probability
k = 1, 2, 3, . . .
p(k) = P(X = k) = (1 − p)k−1 p,
Note that these probabilities sum to 1:
∞
(1 − p)k−1 p = p
k=1
(1 − p) j = 1
j=0
The probability of winning in a certain state lottery is said to be about 19 . If it is exactly
1
, the distribution of the number of tickets a person must purchase up to and including
9
the first winning ticket is a geometric random variable with p = 19 . Figure 2.4 shows
■
the frequency function.
.12
.10
.08
p(x)
EXAMPLE A
∞
.06
.04
.02
0
0
10
20
30
40
50
x
F I G U R E 2.4 The probability mass function of a geometric random variable with
p = 19 .
2.1 Discrete Random Variables
41
The negative binomial distribution arises as a generalization of the geometric
distribution. Suppose that a sequence of independent trials, each with probability of
success p, is performed until there are r successes in all; let X denote the total number
of trials. To find P(X = k), we can argue in the following way: Any particular such
sequence has probability pr (1 − p)k−r , from the independence assumption. The last
trial is a success,
the remaining r − 1 successes can be assigned to the remaining
and
ways. Thus,
k − 1 trials in rk−1
−1
k−1 r
P(X = k) =
p (1 − p)k−r
r −1
It is sometimes helpful in analyzing properties of the negative binomial distribution to note that a negative binomial random variable can be expressed as the sum of
r independent geometric random variables: the number of trials up to and including
the first success plus the number of trials after the first success up to and including
the second success, . . . plus the number of trials from the (r − 1)st success up to and
including the r th success.
Continuing Example A, the distribution of the number of tickets purchased up to and
including the second winning ticket is negative binomial:
p(k) = (k − 1) p 2 (1 − p)k−2
■
This frequency function is shown in Figure 2.5.
.05
.04
.03
p(x)
EXAMPLE B
.02
.01
0
0
10
20
30
40
50
x
F I G U R E 2.5 The probability mass function of a negative binomial random variable
with p = 19 and r = 2.
The definitions of the geometric and negative binomial distributions vary slightly
from one textbook to another; for example, instead of X being the total number of
trials in the definition of the geometric distribution, X is sometimes defined as the
total number of failures.
42
Chapter 2
Random Variables
2.1.4 The Hypergeometric Distribution
The hypergeometric distribution was introduced in Chapter 1 but was not named
there. Suppose that an urn contains n balls, of which r are black and n−r are white. Let
X denote the number of black balls drawn when taking m balls without replacement.
Following the line of reasoning of Examples H and I of Section 1.4.2,
r
n −r
k
m−k
P(X = k) =
n
m
X is a hypergeometric random variable with parameters r, n, and m.
EXAMPLE A
As explained in Example G of Section 1.4.2, a player in the California lottery chooses
6 numbers from 53 and the lottery officials later choose 6 numbers at random. Let X
equal the number of matches. Then
6
47
k
6−k
P(X = k) =
53
6
The probability mass function of X is displayed in the following table:
k
0
1
2
3
4
5
6
p(k)
.468
.401
.117
.014
7.06 × 10−4
1.22 × 10−5
4.36 × 10−8
■
2.1.5 The Poisson Distribution
The Poisson frequency function with parameter λ (λ > 0) is
λk −λ
e ,
k = 0, 1, 2, . . .
k!
k
Since eλ = ∞
k=0 (λ /k!), it follows that the frequency function sums to 1. Figure 2.6
shows four Poisson frequency functions. Note how the shape varies as a function of λ.
The Poisson distribution can be derived as the limit of a binomial distribution as
the number of trials, n, approaches infinity and the probability of success on each trial,
p, approaches zero in such a way that np = λ. The binomial frequency function is
P(X = k) =
n!
p k (1 − p)n−k
k!(n − k)!
Setting np = λ, this expression becomes
n!
λ n−k
λ k
p(k) =
1−
k!(n − k)! n
n
λk
n!
1
λ n
λ
=
1
−
1−
k
k! (n − k)! n
n
n
p(k) =
−k
2.1 Discrete Random Variables
43
p(x)
.8
.4
0
0
1
2
3
4
5
3
4
5
x
(a)
p(x)
.4
.2
0
0
1
2
x
(b)
p(x)
.20
.10
0
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
x
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
x
(c)
p(x)
.12
.08
.04
0
(d)
F I G U R E 2.6 Poisson frequency functions, (a) λ = .1, (b) λ = 1, (c) λ = 5, (d) λ = 10.
As n → ∞,
λ
→0
n
n!
→1
(n − k)!n k
λ n
1−
→ e−λ
n
44
Chapter 2
Random Variables
and
1−
λ
n
−k
→1
We thus have
p(k) →
λk e−λ
k!
which is the Poisson frequency function.
EXAMPLE A
Two dice are rolled 100 times, and the number of double sixes, X , is counted. The
1
distribution of X is binomial with n = 100 and p = 36
= .0278. Since n is large and p
is small, we can approximate the binomial probabilities by Poisson probabilities with
λ = np = 2.78. The exact binomial probabilities and the Poisson approximations are
shown in the following table:
k
Binomial
Probability
Poisson
Approximation
0
1
2
3
4
5
6
7
8
9
10
11
.0596
.1705
.2414
.2255
.1564
.0858
.0389
.0149
.0050
.0015
.0004
.0001
.0620
.1725
.2397
.2221
.1544
.0858
.0398
.0158
.0055
.0017
.0005
.0001
The approximation is quite good.
■
The Poisson frequency function can be used to approximate binomial probabilities for large n and small p. This suggests how Poisson distributions can arise in
practice. Suppose that X is a random variable that equals the number of times some
event occurs in a given interval of time. Heuristically, let us think of dividing the
interval into a very large number of small subintervals of equal length, and let us
assume that the subintervals are so small that the probability of more than one event
in a subinterval is negligible relative to the probability of one event, which is itself
very small. Let us also assume that the probability of an event is the same in each
subinterval and that whether an event occurs in one subinterval is independent of what
happens in the other subintervals. The random variable X is thus nearly a binomial
random variable, with the subintervals consitituting the trials, and, from the limiting
result above, X has nearly a Poisson distribution.
The preceding argument is not formal, of course, but merely suggestive. But, in
fact, it can be made rigorous. The important assumptions underlying it are (1) what
2.1 Discrete Random Variables
45
happens in one subinterval is independent of what happens in any other subinterval,
(2) the probability of an event is the same in each subinterval, and (3) events do not
happen simultaneously. The same kind of argument can be made if we are concerned
with an area or a volume of space rather than with an interval on the real line.
The Poisson distribution is of fundamental theoretical and practical importance.
It has been used in many areas, including the following:
•
•
•
•
EXAMPLE B
The Poisson distribution has been used in the analysis of telephone systems. The
number of calls coming into an exchange during a unit of time might be modeled
as a Poisson variable if the exchange services a large number of customers who act
more or less independently.
One of the earliest uses of the Poisson distribution was to model the number of
alpha particles emitted from a radioactive source during a given period of time.
The Poisson distribution has been used as a model by insurance companies. For
example, the number of freak acidents, such as falls in the shower, for a large population of people in a given time period might be modeled as a Poisson distribution,
because the accidents would presumably be rare and independent (provided there
was only one person in the shower).
The Poisson distribution has been used by traffic engineers as a model for light
traffic. The number of vehicles that pass a marker on a roadway during a unit of
time can be counted. If traffic is light, the individual vehicles act independently
of each other. In heavy traffic, however, one vehicle’s movement may influence
another’s, so the approximation might not be good.
This amusing classical example is from von Bortkiewicz (1898). The number of
fatalities that resulted from being kicked by a horse was recorded for 10 corps of
Prussian cavalry over a period of 20 years, giving 200 corps-years worth of data.
These data and the probabilities from a Poisson model with λ = .61 are displayed
in the following table. The first column of the table gives the number of deaths per
year, ranging from 0 to 4. The second column lists how many times that number of
deaths was observed. Thus, for example, in 65 of the 200 corps-years, there was one
death. In the third column of the table, the observed numbers are converted to relative
frequencies by dividing them by 200. The fourth column of the table gives Poisson
probabilities with the parameter λ = .61. In Chapters 8 and 9, we discuss how to
choose a parameter value to fit a theoretical probability model to observed frequencies
and methods for testing goodness of fit. For now, we will just remark that the value
λ = .61 was chosen to match the average number of deaths per year.
Number of Deaths
per Year
Observed
Relative
Frequency
Poisson
Probability
0
1
2
3
4
109
65
22
3
1
.545
.325
.110
.015
.005
.543
.331
.101
.021
.003
■
46
Chapter 2
Random Variables
The Poisson distribution often arises from a model called a Poisson process for
the distribution of random events in a set S, which is typically one-, two-, or threedimensional, corresponding to time, a plane, or a volume of space. Basically, this
model states that if S1 , S2 , . . . , Sn are disjoint subsets of S, then the numbers of events
in these subsets, N1 , N2 , . . . , Nn , are independent random variables that follow Poisson distributions with parameters λ|S1 |, λ|S2 |, . . . , λ|Sn |, where |Si | denotes the measure of Si (length, area, or volume, for example). The crucial assumptions here are that
events in disjoint subsets are independent of each other and that the Poisson parameter
for a subset is proportional to the subset’s size. Later, we will see that this latter assumption implies that the average number of events in a subset is proportional to its size.
EXAMPLE C
Suppose that an office receives telephone calls as a Poisson process with λ = .5
per min. The number of calls in a 5-min. interval follows a Poisson distribution with
parameter ω = 5λ = 2.5. Thus, the probability of no calls in a 5-min. interval is
e−2.5 = .082. The probability of exactly one call is 2.5e−2.5 = .205.
■
EXAMPLE D
Figure 2.7 shows four realizations of a Poisson process with λ = 25 in the unit square,
0 ≤ x ≤ 1 and 0 ≤ y ≤ 1. It is interesting that the eye tends to perceive patterns, such
y
y
1
0
1
x
0
1
0
1
y
y
1
1
0
x
0
x
0
1
0
x
0
F I G U R E 2.7 Four realizations of a Poisson process with λ = 25.
1
2.2 Continuous Random Variables
47
as clusters of points and large blank spaces. But by the nature of a Poisson process,
the locations of the points have no relationship to one another, and these patterns are
■
entirely a result of chance.
2.2 Continuous Random Variables
In applications, we are often interested in random variables that can take on a continuum of values rather than a finite or countably infinite number. For example, a model
for the lifetime of an electronic component might be that it is random and can be
any positive real number. For a continuous random variable, the role of the frequency
function is taken by a density function, f (x), which has the properties that f (x) ≥ 0,
∞
f is piecewise continuous, and −∞ f (x) d x = 1. If X is a random variable with
a density function f , then for any a < b, the probability that X falls in the interval
(a, b) is the area under the density function between a and b:
b
f (x) d x
P(a < X < b) =
a
EXAMPLE A
A uniform random variable on the interval [0, 1] is a model for what we mean when
we say “choose a number at random between 0 and 1.” Any real number in the interval
is a possible outcome, and the probability model should have the property that the
probability that X is in any subinterval of length h is equal to h. The following density
function does the job:
f (x) =
1, 0 ≤ x ≤ 1
0, x < 0 or x > 1
This is called the uniform density on [0, 1]. The uniform density on a general interval
[a, b] is
f (x) =
1/(b − a), a ≤ x ≤ b
0,
x < a or x > b
■
One consequence of this definition is that the probability that a continuous random
variable X takes on any particular value is 0:
c
f (x) d x = 0
P(X = c) =
c
Although this may seem strange initially, it is really quite natural. If the uniform
random variable of Example A had a positive probability of being any particular
number, it should have the same probability for any number in [0, 1], in which case
the sum of the probabilities of any countably infinite subset of [0, 1] (for example,
the rational numbers) would be infinite. If X is a continuous random variable, then
P(a < X < b) = P(a ≤ X < b) = P(a < X ≤ b)
Note that this is not true for a discrete random variable.
48
Chapter 2
Random Variables
For small δ, if f is continuous at x,
x+δ/2
δ
δ
P x− ≤X≤x+
f (u) du ≈ δ f (x)
=
2
2
x−δ/2
Therefore, the probability of a small interval around x is proportional to f (x). It is
sometimes useful to employ differential notation: P(x ≤ X ≤ x + d x) = f (x) d x.
The cumulative distribution function of a continuous random variable X is defined
in the same way as for a discrete random variable:
F(x) = P(X ≤ x)
F(x) can be expressed in terms of the density function:
x
f (u) du
F(x) =
−∞
From the fundamental theorem of calculus, if f is continuous at x, f (x) = F (x).
The cdf can be used to evaluate the probability that X falls in an interval:
b
f (x) d x = F(b) − F(a)
P(a ≤ X ≤ b) =
a
EXAMPLE B
From this definition, we see that the cdf
(Example A) is
0,
F(x) = x,
1,
of a uniform random variable on [0, 1]
x ≤0
0≤x ≤1
x ≥1
■
Suppose that F is the cdf of a continuous random variable and is strictly increasing
on some interval I , and that F = 0 to the left of I and F = 1 to the right of I ; I
may be unbounded. Under this assumption, the inverse function F −1 is well defined;
x = F −1 (y) if y = F(x). The pth quantile of the distribution F is defined to be that
value x p such that F(x p ) = p, or P(X ≤ x p ) = p. Under the preceding assumption
stated, x p is uniquely defined as x p = F −1 ( p); see Figure 2.8. Special cases are
p = 12 , which corresponds to the median of F; and p = 14 and p = 34 , which
correspond to the lower and upper quartiles of F.
EXAMPLE C
Suppose that F(x) = x 2 for 0 ≤ x ≤ 1. This statement is shorthand for the more
explicit statement
0,
x ≤0
F(x) = x 2 , 0 ≤ x ≤ 1
1,
x ≥1
√
−1
2
To find F , we solve y = F(x) = x for x, obtaining x = F −1 (y) = y. The
median is F −1 (.5) = .707, the lower quartile is F −1 (.25) = .50, and the upper
■
quartile is F −1 (.75) = .866.
2.2 Continuous Random Variables
49
1.0
.8
F(x)
.6
p
.4
.2
0
3
2
1
xp 0
1
2
3
x
F I G U R E 2.8 A cdf F and F −1 .
EXAMPLE D
Value at Risk
Financial firms need to quantify and monitor the risk of their investments. Value at
Risk (VaR) is a widely used measure of potential losses. It involves two parameters:
a time horizon and a level of confidence. For example, if the VaR of an institution is
$10 million with a one-day horizon and a level of confidence of 95%, the interpretation
is that there is a 5% chance of losses exceeding $10 million. Such a loss should be
anticipated about once in 20 days.
To see how VaR is computed, suppose the current value of the investment is V0
and the future value is V1 . The return on the investment is R = (V1 − V0 )/V0 , which
is modeled as a continuous random variable with cdf FR (r ). Let the desired level of
confidence be denoted by 1 − α. We want to find v ∗ , the VaR. Then
α = P(V0 − V1 ≥ v ∗ )
v∗
V1 − V0
=P
≤−
V0
V0
∗
v
= FR −
V0
Thus, −v ∗ /V0 is the α quantile, rα ; and v ∗ = −V0rα . The VaR is minus the current
■
value times the α quantile of the return distribution.
We next discuss some density functions that commonly arise in practice.
Chapter 2
Random Variables
2.2.1 The Exponential Density
The exponential density function is
λe−λx ,
0,
f (x) =
x ≥0
x <0
Like the Poisson distribution, the exponential density depends on a single parameter,
λ > 0, and it would therefore be more accurate to refer to it as the family of exponential densities. Several exponential densities are shown in Figure 2.9. Note that as
λ becomes larger, the density drops off more rapidly.
2.0
1.5
f(x)
50
1.0
.5
0
0
2
4
6
8
10
x
F I G U R E 2.9 Exponential densities with λ = .5 (solid), λ = 1 (dotted), and λ = 2
(dashed).
The cumulative distribution function is easily found:
x
1 − e−λx ,
f (u) du =
F(x) =
0,
−∞
x ≥0
x <0
The median of an exponential distribution, η, say, is readily found from the cdf. We
solve F(η) = 12 :
1 − e−λη =
1
2
from which we have
log 2
λ
The exponential distribution is often used to model lifetimes or waiting times, in
which context it is conventional to replace x by t. Suppose that we consider modeling
the lifetime of an electronic component as an exponential random variable, that the
η=
2.2 Continuous Random Variables
51
component has lasted a length of time s, and that we wish to calculate the probability
that it will last at least t more time units; that is, we wish to find P(T > t +s | T > s):
P(T > t + s and T > s)
P(T > s)
P(T > t + s)
=
P(T > s)
−λ(t+s)
e
=
e−λs
−λt
=e
P(T > t + s | T > s) =
We see that the probability that the unit will last t more time units does not depend
on s. The exponential distribution is consequently said to be memoryless; it is clearly
not a good model for human lifetimes, since the probability that a 16-year-old will
live at least 10 more years is not the same as the probability that an 80-year-old
will live at least 10 more years. It can be shown that the exponential distribution
is characterized by this memoryless property—that is, the memorylessness implies
that the distribution is exponential. It may be somewhat surprising that a qualitative
characterization, the property of memorylessness, actually determines the form of
this density function.
The memoryless character of the exponential distribution follows directly from
its relation to a Poisson process. Suppose that events occur in time as a Poisson process
with parameter λ and that an event occurs at time t0 . Let T denote the length of time
until the next event. The density of T can be found as follows:
P(T > t) = P(no events in (t0 , t0 + t))
Since the number of events in the interval (t0 , t0 + t), which is of length t, follows a
Poisson distribution with parameter λt, this probability is e−λt , and thus T follows an
exponential distribution with parameter λ. We can continue in this fashion. Suppose
that the next event occurs at time t1 ; the distribution of time until the third event is
again exponential by the same analysis and, from the independence property of the
Poisson process, is independent of the length of time between the first two events.
Generally, the times between events of a Poisson process are independent, identically
distributed, exponential random variables.
Proteins and other biologically important molecules are regulated in various
ways. Some undergo aging and are thus more likely to degrade when they are old
than when they are young. If a molecule was not subject to aging, but its chance
of degradation was the same at any age, its lifetime would follow an exponential
distribution.
EXAMPLE A
Muscle and nerve cell membranes contain large numbers of channels through which
selected ions can pass when the channels are open. Using sophisticated experimental
techniques, neurophysiologists can measure the resulting current that flows through
a single channel, and experimental records often indicate that a channel opens and
closes at seemingly random times. In some cases, simple kinetic models predict that
the duration of the open time should be exponentially distributed.
Random Variables
A
B
Frequency ( per 0.5 ms)
300
10 M-Sux
1.18 ms
250
200
150
100
50
200
20 M-Sux
873 s
150
100
50
0 1 2 3 4 5 6 7 8 9 10
Open time (ms)
0 1 2 3 4 5 6 7 8 9 10
Open time (ms)
C
D
100
Frequency ( per 0.15 ms)
Frequency ( per 0.5 ms)
350
Frequency ( per 0.15 ms)
50 M-Sux
573 s
75
50
25
0
300
200
150
100
50
0
200 M-Sux
117 s
200
150
100
50
0
0.5 1.0 1.5 2.0 2.5 3.0
Open time (ms)
F
E
250
100 M-Sux
307 s
250
0.5 1.0 1.5 2.0 2.5 3.0
Open time (ms)
Frequency ( per 0.25 ms)
Chapter 2
Frequency ( per 0.05 ms)
52
0.25 0.50 0.75
Open time (ms)
1.0
500 M-Sux
38 s
150
100
50
0
0.1
0.2
0.3
Open time (ms)
0.4
F I G U R E 2.10 Histograms of open times at varying concentrations of
suxamethonium and fitted exponential densities.
Marshall et al. (1990) studied the action of a channel-blocking agent (suxamethonium) on a channel (the nicotinic receptor of frog muscle). Figure 2.10 displays
histograms of open times and fitted exponential distributions at a range of concentrations of suxamethonium. In this example, the exponential distribution is parametrized
as f (t) = (1/τ )exp(−t/τ ). τ is thus in units of time, whereas λ is in units of the
reciprocal of time. From the figure, we see that the intervals become shorter and that
the parameter τ decreases with increasing concentrations of the blocker. It can also
be seen, especially at higher concentrations, that very short intervals are not recorded
■
because of limitations of the instrumentation.
53
2.2 Continuous Random Variables
2.2.2 The Gamma Density
The gamma density function depends on two parameters, α and λ:
λα α−1 −λt
t ≥0
t e ,
g(t) =
(α)
For t < 0, g(t) = 0. So that the density be well defined and integrate to 1, α > 0 and
λ > 0. The gamma function, (x), is defined as
∞
(x) =
u x−1 e−u du,
x >0
0
Some properties of the gamma function are developed in the problems at the end of
this chapter.
Note that if α = 1, the gamma density coincides with the exponential density.
The parameter α is called a shape parameter for the gamma density, and λ is called
a scale parameter. Varying α changes the shape of the density, whereas varying λ
corresponds to changing the units of measurement (say, from seconds to minutes) and
does not affect the shape of the density.
Figure 2.11 shows several gamma densities. Gamma densities provide a fairly
flexible class for modeling nonnegative random variables.
2.5
.20
1.5
.15
g(t)
g(t)
2.0
1.0
.5
0
.10
.05
0
(a)
.5
1.0
t
1.5
2.0
0
0
(b)
5
10
t
15
20
F I G U R E 2.11 Gamma densities, (a) α = .5 (solid) and α = 1 (dotted) and (b) α = 5
(solid) and α = 10 (dotted); λ = 1 in all cases.
EXAMPLE A
The patterns of occurrence of earthquakes in terms of time, space, and magnitude
are very erratic, and attempts are sometimes made to construct probabilistic models
for these events. The models may be used in a purely descriptive manner or, more
ambitiously, for purposes of predicting future occurrences and consequent damage.
Figure 2.12 shows the fit of a gamma density and an exponential density to the
observed times separating a sequence of small earthquakes (Udias and Rice, 1975).
The gamma density clearly gives a better fit (α = .509 and λ = .00115). Note that an
Chapter 2
Random Variables
1369
(1331)
19681971
Time intervals
N 4764
600
500
Frequency
54
400
300
200
100
0
0
5
10
15
20
25
30
35
40
45
50
Hours
F I G U R E 2.12 Fit of gamma density (triangles) and of exponential density (circles) to
times between microearthquakes.
exponential model for interoccurrence times would be memoryless; that is, knowing
that an earthquake had not occurred in the last t time units would tell us nothing
about the probability of occurrence during the next s time units. The gamma model
does not have this property. In fact, although we will not show this, the gamma model
with these parameter values has the character that there is a large likelihood that the
next earthquake will immediately follow any given one and this likelihood decreases
■
monotonically with time.
2.2.3 The Normal Distribution
The normal distribution plays a central role in probability and statistics, for reasons
that will become apparent in later chapters of this book. This distribution is also called
the Gaussian distribution after Carl Friedrich Gauss, who proposed it as a model for
measurement errors. The central limit theorem, which will be discussed in Chapter 6,
justifies the use of the normal distribution in many applications. Roughly, the central
limit theorem says that if a random variable is the sum of a large number of independent
random variables, it is approximately normally distributed. The normal distribution
has been used as a model for such diverse phenomena as a person’s height, the distribution of IQ scores, and the velocity of a gas molecule. The density function of the normal
distribution depends on two parameters, μ and σ (where −∞ < μ < ∞, σ > 0):
1
2
2
−∞ < x < ∞
f (x) = √ e−(x−μ) /2σ ,
σ 2π
2.2 Continuous Random Variables
55
The parameters μ and σ are called the mean and standard deviation of the normal
density.
The cdf cannot be evaluated in closed form from this density function (the integral
that defines the cdf cannot be evaluated by an explicit formula but must be found
numerically). A problem at the end of this chapter asks you to show that the normal
density just given integrates to one.
As shorthand for the statement “X follows a normal distribution with parameters
μ and σ ,” it is convenient to use X ∼ N (μ, σ 2 ). From the form of the density function,
we see that the density is symmetric about μ, f (μ − x) = f (μ + x), where it has a
maximum, and that the rate at which it falls off is determined by σ . Figure 2.13 shows
several normal densities. Normal densities are sometimes referred to as bell-shaped
curves. The special case for which μ = 0 and σ = 1 is called the standard normal
density. Its cdf is denoted by and its density by φ (not to be confused with the
empty set). The relationship between the general normal density and the standard
normal density will be developed in the next section.
.8
f(x)
.6
.4
.2
0
6
4
2
0
x
2
4
6
F I G U R E 2.13 Normal densities, μ = 0 and σ = .5 (solid), μ = 0 and σ = 1
(dotted), and μ = 0 and σ = 2 (dashed).
EXAMPLE A
Acoustic recordings made in the ocean contain substantial background noise. To detect sonar signals of interest, it is useful to characterize this noise as accurately as
possible. In the Arctic, much of the background noise is produced by the cracking
and straining of ice. Veitch and Wilks (1985) studied recordings of Arctic undersea
noise and characterized the noise as a mixture of a Gaussian component and occasional large-amplitude bursts. Figure 2.14 is a trace of one recording that includes a
burst. Figure 2.15 shows a Gaussian distribution fit to observations from a “quiet”
(nonbursty) period of this noise.
■
56
Chapter 2
Random Variables
Standardized Amplitude
2
0
2
4
6
8
20
0
40
60
Time (milliseconds)
80
100
F I G U R E 2.14 A record of undersea noise containing a large burst.
.5
.4
.3
.2
.1
0
4
2
0
2
4
F I G U R E 2.15 A histogram from a “quiet” period of undersea noise with a fitted
normal density.
EXAMPLE B
Turbulent air flow is sometimes modeled as a random process. Since the velocity of
the flow at any point is subject to the influence of a large number of random eddies
in the neighborhood of that point, one might expect from the central limit theorem
that the velocity would be normally distributed. Van Atta and Chen (1968) analyzed
data gathered in a wind tunnel. Figure 2.16, taken from their paper, shows a normal
distribution fit to 409,600 observations of one component of the velocity; the fit is
■
remarkably good.
EXAMPLE C
S&P 500
The Standard and Poors 500 is an index of important U.S. stocks; each stock’s weight
in the index is proportional to its market value. Individuals can invest in mutual funds
that track the index. The top panel of Figure 2.17 shows the sequential values of the
2.2 Continuous Random Variables
57
.40
.35
.30
␴p (u /␴)
.25
.20
.15
.10
.05
0
3
2
1
0
u /␴
1
2
3
F I G U R E 2.16 A normal density (solid line) fit to 409,600 measurements of one
component of the velocity of a turbulent wind flow. The dots show the values from a
histogram.
Return
0.03
0
0.03
0
50
100
150
200
250
Time
Density
40
30
20
10
0
0.04
0.02
0
Returns
0.02
0.04
F I G U R E 2.17 Returns on the S&P 500 during 2003 (top panel) and a normal curve
fitted to their histogram (bottom panel). The region area to the left of the 0.05 quantile
is shaded.
58
Chapter 2
Random Variables
returns during 2003. The average return during this period was 0.1% per day, and we
can see from the figure that daily fluctuations were as large as 3% or 4%. The lower
panel of the figure shows a histogram of the returns and a fitted normal density with
μ = 0.001 and σ = 0.01.
A financial company could use the fitted normal density in calculating its Value at
Risk (see Example D of Section 2.2). Using a time horizon of one day and a confidence
level of 95%, the VaR is the current investment in the index, V0 , multiplied by the
negative of the 0.05 quantile of the distribution of returns. In this case, the quantile
can be calculated to be −0.0165, so the VaR is .0165V0 . Thus, if V0 is $10 million,
the VaR is $165,000. The company can have 95% “confidence” that its losses will
not exceed that amount on a given day. However, it should not be surprised if that
■
amount is exceeded about once in every 20 trading days.
2.2.4 The Beta Density
The beta density is useful for modeling random variables that are restricted to the
interval [0, 1]:
f (u) =
(a + b) a−1
u (1 − u)b−1 ,
(a) (b)
0≤u≤1
Figure 2.18 shows beta densities for various values of a and b. Note that the case
a = b = 1 is the uniform distribution. The beta distribution is important in Bayesian
statistics, as you will see later.
2.3 Functions of a Random Variable
Suppose that a random variable X has a density function f (x). We often need to find
the density function of Y = g(X ) for some given function g. For example, X might
be the velocity of a particle of mass m, and we might be interested in the probability
density function of the particle’s kinetic energy, Y = 12 m X 2 . Often, the density and
cdf of X are denoted by f X and FX ; and those of Y , by f Y and FY . To illustrate
techniques for solving such a problem, we first develop some useful facts about the
normal distribution.
Suppose that X ∼ N (μ, σ 2 ) and that Y = a X + b, where a > 0. The cumulative
distribution function of Y is
FY (y) = P(Y ≤ y)
= P(a X + b ≤ y)
y−b
=P X≤
a
y−b
= FX
a
1.2
0.8
0.4
0
.4
2.0
1.0
0
0
.2
.4
.6
.8
1.0
0
Beta density
2.0
1.0
0
.2
.4
.6
.8
p
(b)
1.0
12
10
8
6
4
2
0
0
.6
.8
1.0
.6
.8
1.0
p
(c)
3.0
0
.2
p
(a)
Beta density
59
3.0
Beta density
Beta density
1.6
2.3 Functions of a Random Variable
.2
(d)
.4
p
F I G U R E 2.18 Beta density functions for various values of a and b: (a) a = 2, b = 2; (b) a = 6, b = 2;
(c) a = 6, b = 6; and (d) a = .5, b = 4.
Thus,
d
y−b
FX
f Y (y) =
dy
a
1
y−b
= fX
a
a
Up to this point, we have not used the assumption of normality at all, so this result
holds for a general continuous random variable, provided that FX is appropriately
differentiable. If f X is a normal density function with parameters μ and σ , we find
that, after substitution,
1
1 y − b − aμ 2
√ exp −
f Y (y) =
2
aσ
aσ 2π
From this, we see that Y follows a normal distribution with parameters aμ + b
and aσ .
The case for which a < 0 can be analyzed similarly (see Problem 57 in the
end-of-chapter problems), yielding the following proposition.
PROPOSITION A
If X ∼ N (μ, σ 2 ) and Y = a X + b, then Y ∼ N (aμ + b, a 2 σ 2 ).
■
This proposition is quite useful for finding probabilities from the normal distribution. Suppose that X ∼ N (μ, σ 2 ) and we wish to find P(x0 < X < x1 ) for
60
Chapter 2
Random Variables
some numbers x0 and x1 . Consider the random variable
X
μ
X −μ
=
−
σ
σ
σ
Applying Proposition A with a = 1/σ and b = −μ/σ , we see that Z ∼ N (0, 1),
that is, Z follows a standard normal distribution. Therefore,
Z=
FX (x) = P(X ≤ x)
x −μ
X −μ
≤
=P
σ
σ
x −μ
=P Z≤
σ
x −μ
=
σ
We thus have
P(x0 < X < x1 ) = FX (x1 ) − FX (x0 )
x1 − μ
x0 − μ
=
−
σ
σ
Thus, probabilities for general normal random variables can be evaluated in terms of
probabilities for standard normal random variables. This is quite useful, since tables
need to be made up only for the standard normal distribution rather than separately
for every μ and σ .
EXAMPLE A
Scores on a certain standardized test, IQ scores, are approximately normally distributed with mean μ = 100 and standard deviation σ = 15. Here we are referring
to the distribution of scores over a very large population, and we approximate that
discrete cumulative distribution function by a normal continuous cumulative distribution function. An individual is selected at random. What is the probability that his
score X satisfies 120 < X < 130?
We can calculate this probability by using the standard normal distribution as
follows:
X − 100
130 − 100
120 − 100
<
<
P(120 < X < 130) = P
15
15
15
= P(1.33 < Z < 2)
where Z follows a standard normal distribution. Using a table of the standard normal
distribution (Table 2 of Appendix B), this probability is
P(1.33 < Z < 2) = (2) − (1.33)
= .9772 − .9082
= .069
Thus, approximately 7% of the population will have scores in this range.
■
2.3 Functions of a Random Variable
EXAMPLE B
61
Let X ∼ N (μ, σ 2 ), and find the probability that X is less than σ away from μ; that
is, find P(|X − μ| < σ ).
This probability is
X −μ
P(−σ < X − μ < σ ) = P −1 <
<1
σ
= P(−1 < Z < 1)
where Z follows a standard normal distribution. From tables of the standard normal
distribution, this last probability is
(1) − (−1) = .68
Thus, a normal random variable is within 1 standard deviation of its mean with
probability .68.
■
We now turn to another example involving the normal distribution.
EXAMPLE C
Find the density of X = Z 2 , where Z ∼ N (0, 1).
Here, we have
FX (x) = P(X ≤ x)
√
√
= P(− x ≤ Z ≤ x)
√
√
= ( x) − (− x)
We find the density of X by differentiating the cdf. Since (x) = φ(x), the chain
rule gives
√
√
f X (x) = 12 x −1/2 φ( x) + 12 x −1/2 φ(− x)
√
= x −1/2 φ( x)
In the last step, we used the symmetry of φ. Evaluating the last expression, we
find
x −1/2
f X (x) = √ e−x/2 ,
x ≥0
2π
We can recognize this as a gamma density by making use of a principle of general
utility. Suppose two densities are of the forms k1 h(x) and k2 h(x); then, because they
both integrate to 1, k1 = k2 . Now, comparing the form of f (x) given here to that of
1
, we recognize by this reasoning that f (x) is
the gamma density with α = λ = √
2
1
=
a gamma density and that
π . This density is also called the chi-square
2
■
density with 1 degree of freedom.
As another example, let us consider the following.
62
Chapter 2
EXAMPLE D
Random Variables
Let U be a uniform random variable on [0, 1], and let V = 1/U . To find the density
of V , we first find the cdf:
FV (v) = P(V ≤ v)
1
≤v
=P
U
1
=P U≥
v
= 1−
1
v
This expression is valid for v ≥ 1; for v < 1, FV (v) = 0. We can now find the density
by differentiation:
f V (v) =
1
,
v2
1≤v<∞
■
Looking back over these examples, we see that we have gone through the same
basic steps in each case: first finding the cdf of the transformed variable, then differentiating to find the density, and then specifying in what region the result holds.
These same steps can be used to prove the following general result.
PROPOSITION B
Let X be a continuous random variable with density f (x) and let Y = g(X ) where
g is a differentiable, strictly monotonic function on some interval I . Suppose that
f (x) = 0 if x is not in I . Then Y has the density function
d −1 −1
f Y (y) = f X (g (y)) g (y)
dy
for y such that y = g(x) for some x, and f Y (y) = 0 if y = g(x) for any x in I .
■
Here g −1 is the inverse function of g; that is, g −1 (y) = x if y = g(x).
For any specific problem, proceeding from scratch is usually easier than deciphering the notation and applying the proposition.
We conclude this section by developing some results relating the uniform distribution to other continuous distributions. Throughout, we consider a random variable X , with density f and cdf F, where F is strictly increasing on some interval
I, F = 0 to the left of I , and F = 1 to the right of I . I may be a bounded interval
or an unbounded interval such as the whole real line. F −1 (x) is then well defined
for x ∈ I .
2.3 Functions of a Random Variable
63
PROPOSITION C
Let Z = F(X ); then Z has a uniform distribution on [0, 1].
Proof
P(Z ≤ z) = P(F(X ) ≤ z) = P(X ≤ F −1 (z)) = F(F −1 (z)) = z
■
This is the uniform cdf.
PROPOSITION D
Let U be uniform on [0, 1], and let X = F −1 (U ). Then the cdf of X is F.
Proof
P(X ≤ x) = P(F −1 (U ) ≤ x) = P(U ≤ F(x)) = F(x)
■
This last proposition is quite useful in generating pseudorandom numbers with
a given cdf F. Many computer packages have routines for generating pseudorandom
numbers that are uniform on [0, 1]. These numbers are called pseudorandom because
they are generated according to some rule or algorithm and thus are not “really”
random. Proposition D tells us that to generate random variables with cdf F, we just
apply F −1 to uniform random numbers. This is quite practical as long as F −1 can be
calculated easily.
EXAMPLE E
Suppose that, as part of simulation study, we want to generate random variables from
an exponential distribution. For example, the performance of large queueing networks
is often assessed by simulation. One aspect of such a simulation involves generating
random time intervals between customer arrivals, which might be assumed to be
exponentially distributed. If we have access to a uniform random number generator,
we can apply Proposition D to generate exponential random numbers. The cdf is
F(t) = 1 − e−λt . F −1 can be found by solving x = 1 − e−λt for t:
e−λt = 1 − x
−λt = log(1 − x)
t = − log(1 − x)/λ
Thus, if U is uniform on [0, 1], then T = − log(1 − U )/λ is an exponential random
variable with parameter λ. This can be simplified slightly by noting that V = 1 − U
is also uniform on [0, 1] since
P(V ≤ v) = P(1 − U ≤ v) = P(U ≥ 1 − v) = 1 − (1 − v) = v
We may thus take T = − log(V )/λ, where V is uniform on [0, 1].
■
64
Chapter 2
Random Variables
2.4 Concluding Remarks
This chapter introduced the concept of a random variable, one of the fundamental
ideas of probability theory. A fully rigorous discussion of random variables requires
a background in measure theory. The development here is sufficient for the needs of
this course.
Discrete and continuous random variables have been defined, and it should be
mentioned that more general random variables can also be defined and are useful on
occasion. In particular, it makes sense to consider random variables that have both a
discrete and a continuous component. For example, the lifetime of a transistor might
be 0 with some probability p > 0 if it does not function at all; if it does function, the
lifetime could be modeled as a continuous random variable.
2.5 Problems
1. Suppose that X is a discrete random variable with P(X = 0) = .25, P(X = 1) =
.125, P(X = 2) = .125, and P(X = 3) = .5. Graph the frequency function and
the cumulative distribution function of X .
2. An experiment consists of throwing a fair coin four times. Find the frequency
function and the cumulative distribution function of the following random variables: (a) the number of heads before the first tail, (b) the number of heads
following the first tail, (c) the number of heads minus the number of tails, and
(d) the number of tails times the number of heads.
3. The following table shows the cumulative distribution function of a discrete
random variable. Find the frequency function.
k
F(k)
0
1
2
3
4
5
0
.1
.3
.7
.8
1.0
4. If X is an integer-valued random variable, show that the frequency function is
related to the cdf by p(k) = F(k) − F(k − 1).
5. Show that P(u < X ≤ v) = F(v) − F(u) for any u and v in the cases that (a)
X is a discrete random variable and (b) X is a continuous random variable.
6. Let A and B be events, and let I A and I B be the associated indicator random
variables. Show that
I A∩B = I A I B = min(I A , I B )
and
I A∪B = max(I A , I B )
2.5 Problems
65
7. Find the cdf of a Bernoulli random variable.
8. Show that the binomial probabilities sum to 1.
9. For what values of p is a two-out-of-three majority decoder better than transmission of the message once?
10. Appending three extra bits to a 4-bit word in a particular way (a Hamming code)
allows detection and correction of up to one error in any of the bits. If each
bit has probability .05 of being changed during communication, and the bits
are changed independently of each other, what is the probability that the word
is correctly received (that is, 0 or 1 bit is in error)? How does this probability
compare to the probability that the word will be transmitted correctly with no
check bits, in which case all four bits would have to be transmitted correctly for
the word to be correct?
11. Consider the binomial distribution with n trials and probability p of success on
each trial. For what value of k is P(X = k) maximized? This value is called the
mode of the distribution. (Hint: Consider the ratio of successive terms.)
12. Which is more likely: 9 heads in 10 tosses of a fair coin or 18 heads in 20 tosses?
13. A multiple-choice test consists of 20 items, each with four choices. A student is
able to eliminate one of the choices on each question as incorrect and chooses
randomly from the remaining three choices. A passing grade is 12 items or more
correct.
a. What is the probability that the student passes?
b. Answer the question in part (a) again, assuming that the student can eliminate
two of the choices on each question.
14. Two boys play basketball in the following way. They take turns shooting and
stop when a basket is made. Player A goes first and has probability p1 of making a basket on any throw. Player B, who shoots second, has probability p2 of
making a basket. The outcomes of the successive trials are assumed to be independent.
a. Find the frequency function for the total number of attempts.
b. What is the probability that player A wins?
15. Two teams, A and B, play a series of games. If team A has probability .4 of
winning each game, is it to its advantage to play the best three out of five games
or the best four out of seven? Assume the outcomes of successive games are
independent.
16. Show that if n approaches ∞ and r/n approaches p and m is fixed, the hypergeometric frequency function tends to the binomial frequency function. Give a
heuristic argument for why this is true.
17. Suppose that in a sequence of independent Bernoulli trials, each with probability
of success p, the number of failures up to the first success is counted. What is
the frequency function for this random variable?
18. Continuing with Problem 17, find the frequency function for the number of
failures up to the r th success.
66
Chapter 2
Random Variables
19. Find an expression for the cumulative distribution function of a geometric random
variable.
20. If X is a geometric random variable with p = .5, for what value of k is
P(X ≤ k) ≈ .99?
21. If X is a geometric random variable, show that
P(X > n + k − 1|X > n − 1) = P(X > k)
In light of the construction of a geometric distribution from a sequence of independent Bernoulli trials, how can this be interpreted so that it is “obvious”?
22. Three identical fair coins are thrown simultaneously until all three show the same
face. What is the probability that they are thrown more than three times?
23. In a sequence of independent trials with probability p of success, what is the
probability that there are r successes before the kth failure?
24. (Banach Match Problem) A pipe smoker carries one box of matches in his left
pocket and one box in his right. Initially, each box contains n matches. If he
needs a match, the smoker is equally likely to choose either pocket. What is the
frequency function for the number of matches in the other box when he first
discovers that one box is empty?
25. The probability of being dealt a royal straight flush (ace, king, queen, jack, and
ten of the same suit) in poker is about 1.3 × 10−8 . Suppose that an avid poker
player sees 100 hands a week, 52 weeks a year, for 20 years.
a. What is the probability that she is never dealt a royal straight flush dealt?
b. What is the probability that she is dealt exactly two royal straight flushes?
26. The university administration assures a mathematician that he has only 1 chance in
10,000 of being trapped in a much-maligned elevator in the mathematics building.
If he goes to work 5 days a week, 52 weeks a year, for 10 years, and always rides
the elevator up to his office when he first arrives, what is the probability that
he will never be trapped? That he will be trapped once? Twice? Assume that
the outcomes on all the days are mutually independent (a dubious assumption in
practice).
27. Suppose that a rare disease has an incidence of 1 in 1000. Assuming that members
of the population are affected independently, find the probability of k cases in a
population of 100,000 for k = 0, 1, 2.
28. Let p0 , p1 , . . . , pn denote the probability mass function of the binomial distribution with parameters n and p. Let q = 1− p. Show that the binomial probabilities
can be computed recursively by p0 = q n and
pk =
(n − k + 1) p
pk−1 ,
kq
k = 1, 2, . . . , n
Use this relation to find P(X ≤ 4) for n = 9000 and p = .0005.
2.5 Problems
67
29. Show that the Poisson probabilities p0 , p1 , . . . can be computed recursively by
p0 = exp(−λ) and
λ
k = 1, 2, . . .
pk−1 ,
k
Use this scheme to find P(X ≤ 4) for λ = 4.5 and compare to the results of
Problem 28.
pk =
30. Suppose that in a city, the number of suicides can be approximated by a Poisson
process with λ = .33 per month.
a. Find the probability of k suicides in a year for k = 0, 1, 2, . . . . What is the
most probable number of suicides?
b. What is the probability of two suicides in one week?
31. Phone calls are received at a certain residence as a Poisson process with parameter
λ = 2 per hour.
a. If Diane takes a 10-min. shower, what is the probability that the phone rings
during that time?
b. How long can her shower be if she wishes the probability of receiving no
phone calls to be at most .5?
32. For what value of k is the Poisson frequency function with parameter λ maximized? (Hint: Consider the ratio of consecutive terms.)
33. Let F(x) = 1 − exp(−αx β ) for x ≥ 0, α > 0, β > 0, and F(x) = 0 for x < 0.
Show that F is a cdf, and find the corresponding density.
34. Let f (x) = (1 + αx)/2 for −1 ≤ x ≤ 1 and f (x) = 0 otherwise, where
−1 ≤ α ≤ 1. Show that f is a density, and find the corresponding cdf. Find the
quartiles and the median of the distribution in terms of α.
35. Sketch the pdf and cdf of a random variable that is uniform on [−1, 1].
36. If U is a uniform random variable on [0, 1], what is the distribution of the random
variable X = [nU ], where [t] denotes the greatest integer less than or equal to t?
37. A line segment of length 1 is cut once at random. What is the probability that the
longer piece is more than twice the length of the shorter piece?
38. If f and g are densities, show that α f + (1 − α)g is a density, where 0 ≤ α ≤ 1.
39. The Cauchy cumulative distribution function is
1
1
+ tan−1 (x),
2 π
a. Show that this is a cdf.
b. Find the density function.
c. Find x such that P(X > x) = .1.
F(x) =
−∞ < x < ∞
40. Suppose that X has the density function f (x) = cx 2 for 0 ≤ x ≤ 1 and f (x) = 0
otherwise.
a. Find c.
b. Find the cdf.
c. What is P(.1 ≤ X < .5)?
68
Chapter 2
Random Variables
41. Find the upper and lower quartiles of the exponential distribution.
42. Find the probability density for the distance from an event to its nearest neighbor
for a Poisson process in the plane.
43. Find the probability density for the distance from an event to its nearest neighbor
for a Poisson process in three-dimensional space.
44. Let T be an exponential random variable with parameter λ. Let X be a discrete
random variable defined as X = k if k ≤ T < k + 1, k = 0, 1, . . . . Find the
frequency function of X .
45. Suppose that the lifetime of an electronic component follows an exponential
distribution with λ = .1.
a. Find the probability that the lifetime is less than 10.
b. Find the probability that the lifetime is between 5 and 15.
c. Find t such that the probability that the lifetime is greater than t is .01.
46. T is an exponential random variable, and P(T < 1) = .05. What is λ?
47. If α > 1, show that the gamma density has a maximum at (α − 1)/λ.
48. Show that the gamma density integrates to 1.
49. The gamma function is a generalized factorial function.
a. Show that (1) = 1.
b. Show that (x + 1) = x (x). (Hint: Use integration by parts.)
c. Conclude that (n) = (n √
− 1)!, for n = 1, 2, 3, . . . .
d. Use the fact that ( 12 ) = π to show that, if n is an odd integer,
n 2
√
=
π (n − 1)!
n−1 !
2
2n−1
50. Show by a change of variables that
(x) = 2
=
∞
0
∞
t 2x−1 e−t dt
2
e xt e−e dt
t
−∞
51. Show that the normal density integrates to 1. (Hint: First make a change of
variables to reduce the integral to that for√the standard normal. The problem is
∞
then to show that −∞ exp(−x 2 /2) d x = 2π . Square both sides and reexpress
the problem as that of showing
∞
−∞
exp(−x 2 /2) d x
∞
−∞
exp(−y 2 /2) dy
= 2π
Finally, write the product of integrals as a double integral and change to polar
coordinates.)
2.5 Problems
69
52. Suppose that in a certain population, individuals’ heights are approximately normally distributed with parameters μ = 70 and σ = 3 in.
a. What proportion of the population is over 6 ft. tall?
b. What is the distribution of heights if they are expressed in centimeters? In
meters?
53. Let X be a normal random variable with μ = 5 and σ = 10. Find (a) P(X > 10),
(b) P(−20 < X < 15), and (c) the value of x such that P(X > x) = .05.
54. If X ∼ N (μ, σ 2 ), show that P(|X − μ| ≤ .675σ ) = .5.
55. X ∼ N (μ, σ 2 ), find the value of c in terms of σ such that P(μ − c ≤ X ≤
μ + c) = .95.
56. If X ∼ N (0, σ 2 ), find the density of Y = |X |.
57. X ∼ N (μ, σ 2 ) and Y = a X + b, where a < 0, show that Y ∼ N (aμ + b, a 2 σ 2 ).
√
58. If U is uniform on [0, 1], find the density function of U .
59. If U is uniform on [−1, 1], find the density function of U 2 .
60. Find the density function of Y = e Z , where Z ∼ N (μ, σ 2 ). This is called the
lognormal density, since log Y is normally distributed.
61. Find the density of cX when X follows a gamma distribution. Show that only λ
is affected by such a transformation, which justifies calling λ a scale parameter.
62. Show that if X has a density function f X and Y = a X + b, then
1
y−b
fX
f Y (y) =
|a|
a
63. Suppose that follows a uniform distribution on the interval [−π/2, π/2]. Find
the cdf and density of tan .
64. A particle of mass m has a random velocity, V , which is normally distributed
with parameters μ = 0 and σ . Find the density function of the kinetic energy,
E = 12 mV 2 .
65. How could random variables with the following density function be generated
from a uniform random number generator?
f (x) =
1 + αx
,
2
−1 ≤ x ≤ 1,
−1 ≤ α ≤ 1
66. Let f (x) = αx −α−1 for x ≥ 1 and f (x) = 0 otherwise, where α is a positive
parameter. Show how to generate random variables from this density from a
uniform random number generator.
67. The Weibull cumulative distribution function is
β
F(x) = 1 − e−(x/α) ,
a. Find the density function.
x ≥ 0,
α > 0,
β>0
70
Chapter 2
Random Variables
b. Show that if W follows a Weibull distribution, then X = (W/α)β follows an
exponential distribution.
c. How could Weibull random variables be generated from a uniform random
number generator?
68. If the radius of a circle is an exponential random variable, find the density function
of the area.
69. If the radius of a sphere is an exponential random variable, find the density
function of the volume.
70. Let U be a uniform random variable. Find the density function of V = U −α ,
α > 0. Compare the rates of decrease of the tails of the densities as a function of
α. Does the comparison make sense intuitively?
71. This problem shows one way to generate discrete random variables from a uniform random number generator. Suppose that F is the cdf of an integer-valued
random variable; let U be uniform on [0, 1]. Define a random variable Y = k if
F(k − 1) < U ≤ F(k). Show that Y has cdf F. Apply this result to show how
to generate geometric random variables from uniform random variables.
72. One of the most commonly used (but not one of the best) methods of generating pseudorandom numbers is the linear congruential method, which works
as follows. Let x0 be an initial number (the “seed”). The sequence is generated
recursively as
xn = (axn−1 + c) mod m
a. Choose values of a, c, and m, and try this out. Do the sequences “look”
random?
b. Making good choices of a, c, and m involves both art and theory. The following are some values that have been proposed: (1) a = 69069, c = 0, m = 231 ;
(2) a = 65539, c = 0, m = 231 . The latter is an infamous generator called
RANDU. Try out these schemes, and examine the results.
CHAPTER
3
Joint Distributions
3.1 Introduction
This chapter is concerned with the joint probability structure of two or more random
variables defined on the same sample space. Joint distributions arise naturally in many
applications, of which the following are illustrative:
•
•
•
•
In ecological studies, counts of several species, modeled as random variables, are
often made. One species is often the prey of another; clearly, the number of predators
will be related to the number of prey.
The joint probability distribution of the x, y, and z components of wind velocity
can be experimentally measured in studies of atmospheric turbulence.
The joint distribution of the values of various physiological variables in a population
of patients is often of interest in medical studies.
A model for the joint distribution of age and length in a population of fish can be used
to estimate the age distribution from the length distribution. The age distribution is
relevant to the setting of reasonable harvesting policies.
The joint behavior of two random variables, X and Y, is determined by the
cumulative distribution function
F(x, y) = P(X ≤ x, Y ≤ y)
regardless of whether X and Y are continuous or discrete. The cdf gives the probability
that the point (X, Y ) belongs to a semi-infinite rectangle in the plane, as shown in
Figure 3.1. The probability that (X, Y ) belongs to a given rectangle is, from Figure 3.2,
P(x1 < X ≤ x2 , y1 < Y ≤ y2 ) = F(x2 , y2 ) − F(x2 , y1 ) − F(x1 , y2 )
+ F(x1 , y1 )
71
72
Chapter 3
Joint Distributions
y
y
y2
b
y1
a
x
x1
F I G U R E 3.1 F (a, b) gives the probability of the
shaded rectangle.
x2
x
F I G U R E 3.2 The probability of the shaded
rectangle can be found by subtracting from the
probability of the (semi-infinite) rectangle having
the upper-right corner (x2 , y2 ) the probabilities of
the (x1 , y2 ) and (x2 , y1 ) rectangles, and then adding
back in the probability of the (x1 , y1 ) rectangle.
The probability that (X, Y ) belongs to a set A, for a large enough class of sets
for practical purposes, can be determined by taking limits of intersections and unions
of rectangles. In general, if X 1 , . . . , X n are jointly distributed random variables, their
joint cdf is
F(x1 , x2 , . . . , xn ) = P(X 1 ≤ x1 , X 2 ≤ x2 , . . . , X n ≤ xn )
Two- and higher-dimensional versions of density functions and frequency functions exist. We will start with a detailed description of such functions for the discrete
case, since it is the easier one to understand.
3.2 Discrete Random Variables
Suppose that X and Y are discrete random variables defined on the same sample
space and that they take on values x1 , x2 , . . . , and y1 , y2 , . . . , respectively. Their
joint frequency function, or joint probability mass function, p(x, y), is
p(xi , y j ) = P(X = xi , Y = y j )
A simple example will illustrate this concept. A fair coin is tossed three times; let X
denote the number of heads on the first toss and Y the total number of heads. From
the sample space, which is
= {hhh, hht, hth, htt, thh, tht, tth, ttt}
3.2 Discrete Random Variables
73
we see that the joint frequency function of X and Y is as given in the following table:
y
x
0
1
2
3
0
1
8
0
1
8
2
8
0
1
2
8
1
8
1
8
Thus, for example, p(0, 2) = P(X = 0, Y = 2) = 18 . Note that the probabilities
in the preceding table sum to 1.
Suppose that we wish to find the frequency function of Y from the joint frequency
function. This is straightforward:
pY (0) = P(Y = 0)
= P(Y = 0, X = 0) + P(Y = 0, X = 1)
1
= +0
8
1
=
8
pY (1) = P(Y = 1)
= P(Y = 1, X = 0) + P(Y = 1, X = 1)
3
=
8
In general, to find the frequency function of Y , we simply sum down the appropriate
column of the table. For this reason, pY is called the marginal frequency function
of Y . Similarly, summing across the rows gives
p(x, yi )
p X (x) =
i
which is the marginal frequency function of X .
The case for several random variables is analogous. If X 1 , . . . , X m are discrete
random variables defined on the same sample space, their joint frequency function is
p(x1 , . . . , xm ) = P(X 1 = x1 , . . . , X m = xm )
The marginal frequency function of X 1 , for example, is
p X 1 (x1 ) =
p(x1 , x2 , . . . , xm )
x2 ···xm
The two-dimensional marginal frequency function of X 1 and X 2 , for example, is
p(x1 , x2 , . . . , xm )
p X 1 X 2 (x1 , x2 ) =
x3 ···xm
EXAMPLE A
Multinomial Distribution
The multinomial distribution, an important generalization of the binomial distribution,
arises in the following way. Suppose that each of n independent trials can result in
Joint Distributions
one of r types of outcomes and that on each trial the probabilities of the r outcomes
are p1 , p2 , . . . , pr . Let Ni be the total number of outcomes of type i in the n trials,
i = 1, . . . , r . To calculate the joint frequency function, we observe that any particular
sequence of trials giving rise to N1 = n 1 , N2 = n 2 , . . . , Nr = n r occurs with
probability p1n 1 p2n 2 · · · prnr . From Proposition C in Section 1.4.2, we know that there
are n!/(n 1 !n 2 ! · · · n r !) such sequences, and thus the joint frequency function is
n
p(n 1 , . . . , n r ) =
p n 1 p n 2 · · · prnr
n 1 · · · nr 1 2
The marginal distribution of any particular Ni can be obtained by summing the joint
frequency function over the other n j . This formidable algebraic task can be avoided,
however, by noting that Ni can be interpreted as the number of successes in n trials,
each of which has probability pi of success and 1 − pi of failure. Therefore, Ni is a
binomial random variable, and
n
p Ni (n i ) =
p ni (1 − pi )n−ni
ni i
The multinomial distribution is applicable in considering the probabilistic properties of a histogram. As a concrete example, suppose that 100 independent observations are taken from a uniform distribution on [0, 1], that the interval [0, 1] is
partitioned into 10 equal bins, and that the counts n 1 , . . . , n 10 in each of the 10 bins are
recorded and graphed as the heights of vertical bars above the respective bins. The joint
distribution of the heights is multinomial with n = 100 and pi = .1, i = 1, . . . , 10.
Figure 3.3 shows four histograms constructed in this manner from a pseudorandom
16
20
12
15
Count
Count
8
4
0
10
5
0
.2
.4
.6
.8
0
1.0
x
(a)
16
20
12
15
8
4
0
0
.2
.4
.6
.8
1.0
.6
.8
1.0
x
( b)
Count
Chapter 3
Count
74
10
5
0
(c)
.2
.4
.6
x
.8
1.0
0
0
(d)
.2
.4
x
F I G U R E 3.3 Four histograms, each formed from 100 independent uniform random
numbers.
3.3 Continuous Random Variables
75
number generator; the figure illustrates the sort of random fluctuations that can be
■
expected in histograms.
3.3 Continuous Random Variables
Suppose that X and Y are continuous random variables with a joint cdf, F(x, y). Their
joint density function is a piecewise continuous function of two variables, f (x, y).
∞
∞
The density function f (x, y) is nonnegative and −∞ −∞ f (x, y) d y d x = 1. For
any “reasonable” two-dimensional set A
P((X, Y ) ∈ A) =
f (x, y) d y d x
A
In particular, if A = {(X, Y )|X ≤ x and Y ≤ y},
x y
f (u, v) dv du
F(x, y) =
−∞
−∞
From the fundamental theorem of multivariable calculus, it follows that
f (x, y) =
∂2
F(x, y)
∂ x∂ y
wherever the derivative is defined.
For small δx and δ y , if f is continuous at (x, y),
x+δx P(x ≤ X ≤ x + δx , y ≤ Y ≤ y + δ y ) =
x
y+δ y
f (u, v) dv du
y
≈ f (x, y)δx δ y
Thus, the probability that (X, Y ) is in a small neighborhood of (x, y) is proportional
to f (x, y). Differential notation is sometimes useful:
P(x ≤ X ≤ x + d x, y ≤ Y ≤ y + dy) = f (x, y) d x d y
EXAMPLE A
Consider the bivariate density function
12 2
(x + x y),
0 ≤ x ≤ 1,
0≤y≤1
7
which is plotted in Figure 3.4. P(X > Y ) can be found by integrating f over the set
{(x, y)|0 ≤ y ≤ x ≤ 1}:
12 1 x 2
(x + x y) d y d x
P(X > Y ) =
7 0 0
f (x, y) =
=
9
14
■
76
Chapter 3
Joint Distributions
3
f (x, y) 2
1
0
0
1
0.8
0.2
0.4
0.6
y
0.4
x
0.2
0.6
0.8
1 0
F I G U R E 3.4 The density function f (x, y) =
12
(x 2
7
+ x y), 0 ≤ x ≤ 1, 0 ≤ y ≤ 1.
The marginal cdf of X , or FX , is
FX (x) = P(X ≤ x)
= lim F(x, y)
y→∞
x ∞
=
f (u, y) dy du
−∞
−∞
From this, it follows that the density function of X alone, known as the marginal
density of X , is
∞
f X (x) = FX (x) =
f (x, y) dy
−∞
In the discrete case, the marginal frequency function was found by summing the joint
frequency function over the other variable; in the continuous case, it is found by
integration.
EXAMPLE B
Continuing Example A, the marginal density of X is
12 1 2
(x + x y) dy
f X (x) =
7 0
12 2 x =
x +
7
2
A similar calculation shows that the marginal density of Y is f Y (y)
12 1
( + y/2).
7 3
=
■
For several jointly continuous random variables, we can make the obvious generalizations. The joint density function is a function of several variables, and the
marginal density functions are found by integration. There are marginal density
3.3 Continuous Random Variables
77
functions of various dimensions. Suppose that X , Y , and Z are jointly continuous
random variables with density function f (x, y, z). The one-dimensional marginal
distribution of X is
∞ ∞
f (x, y, z) dy dz
f X (x) =
−∞
−∞
and the two-dimensional marginal distribution of X and Y is
∞
f (x, y, z) dz
f X Y (x, y) =
−∞
EXAMPLE C
Farlie-Morgenstern Family
If F(x) and G(y) are one-dimensional cdfs, it can be shown that, for any α for which
|α| ≤ 1,
H (x, y) = F(x)G(y){1 + α[1 − F(x)][1 − G(y)]}
is a bivariate cumulative distribution function. Because lim F(x) = lim F(y) = 1,
x→∞
y→∞
the marginal distributions are
H (x, ∞) = F(x)
H (∞, y) = G(y)
In this way, an infinite number of different bivariate distributions with given marginals
can be constructed.
As an example, we will construct bivariate distributions with marginals that are
uniform on [0, 1] [F(x) = x, 0 ≤ x ≤ 1, and G(y) = y, 0 ≤ y ≤ 1]. First, with
α = −1, we have
H (x, y) = x y[1 − (1 − x)(1 − y)]
= x 2 y + y2 x − x 2 y2,
0 ≤ x ≤ 1,
0≤y≤1
The bivariate density is
∂2
H (x, y)
∂ x∂ y
= 2x + 2y − 4x y,
h(x, y) =
0 ≤ x ≤ 1,
0≤y≤1
The density is shown in Figure 3.5. Perhaps you can imagine integrating over y
(pushing all the mass onto the x axis) to produce a marginal uniform density for x.
Next, if α = 1,
H (x, y) = x y[1 + (1 − x)(1 − y)]
= 2x y − x 2 y − y 2 x + x 2 y 2 ,
0 ≤ x ≤ 1,
0≤y≤1
The density is
h(x, y) = 2 − 2x − 2y + 4x y,
0 ≤ x ≤ 1,
0≤y≤1
This density is shown in Figure 3.6.
We just constructed two different bivariate distributions, both of which have
uniform marginals.
■
78
Chapter 3
Joint Distributions
1
0.8
y 0.6
0.4
0.2
0
2
1.5
h(x, y)
1
0.5
0
0
0.2
0.4
x
0.6
0.8
1
F I G U R E 3.5 The joint density h(x, y) = 2x + 2y − 4x y, where 0 ≤ x ≤ 1 and
0 ≤ y ≤ 1, which has uniform marginal densities.
1
y
0.4
0.2
0
2
0.8
0.6
1.5
h(x, y)
1
0.5
0
0
0.2
0.4
x
0.6
0.8
1
F I G U R E 3.6 The joint density h(x, y) = 2 − 2x − 2y + 4x y, where 0 ≤ x ≤ 1 and
0 ≤ y ≤ 1, which has uniform marginal densities.
A copula is a joint cumulative distribution function of random variables that have
uniform marginal distributions. The functions H (x, y) in the preceding example are
copulas. Note that a copula C(u, v) is nondecreasing in each variable, because it
is a cdf. Also, P(U ≤ u) = C(u, 1) = u and C(1, v) = v, since the marginal
distributions are uniform. We will restrict ourselves to copulas that have densities, in
which case the density is
c(u, v) =
∂2
C(u, v) ≥ 0
∂u∂v
3.3 Continuous Random Variables
79
Now, suppose that X and Y are continuous random variables with cdfs FX (x)
and FY (y). Then U = FX (x) and V = FY (y) are uniform random variables (Proposition 2.3C). For a copula C(u, v), consider the joint distribution defined by
FX Y (x, y) = C(FX (x), FY (y))
Since C(FX (x), 1) = FX (x), the marginal cdfs corresponding to FX Y are FX (x) and
FY (y). Using the chain rule, the corresponding density is
f X Y (x, y) = c(FX (x), FY (y)) f X (x) f Y (y)
This construction points out that from the ingredients of two marginal distributions
and any copula, a joint distribution with those marginals can be constructed. It is
thus clear that the marginal distributions do not determine the joint distribution. The
dependence between the random variables is captured in the copula. Copulas are not
just academic curiousities—they have been extensively used in financial statistics in
recent years to model dependencies in the returns of financial instruments.
EXAMPLE D
Consider the following joint density:
f (x, y) =
λ2 e−λy ,
0,
0 ≤ x ≤ y, λ > 0
elsewhere
This joint density is plotted in Figure 3.7. To find the marginal densities, it is helpful
to draw a picture showing where the density is nonzero to aid in determining the limits
of integration (see Figure 3.8).
1
.75
.5 f (x, y)
.25
0
3
0
2
1
x
2
1
y
3 0
F I G U R E 3.7 The joint density of Example D.
80
Chapter 3
Joint Distributions
y
x
y
x
F I G U R E 3.8 The joint density of Example D is nonzero over the shaded region of
the plane.
∞
First consider the marginal density f X (x) = −∞ f X Y (x, y)dy. Since f (x, y) = 0
for x ≥ y,
∞
λ2 e−λy dy = λe−λx ,
x ≥0
f X (x) =
x
and we see that the marginal distribution of X is exponential. Next, because
f X Y (x, y) = 0 for x ≤ 0 and x > y,
y
f Y (y) =
λ2 e−λy d x = λ2 ye−λy ,
y≥0
0
The marginal distribution of Y is a gamma distribution.
■
In some applications, it is useful to analyze distributions that are uniform over
some region of space. For example, in the plane, the random point (X, Y ) is uniform
over a region, R, if for any A ⊂ R,
P((X, Y ) ∈ A) =
|A|
|R|
where | | denotes area.
EXAMPLE E
A point is chosen randomly in a disk of radius 1. Since the area of the disk is π,
1
,
if x 2 + y 2 ≤ 1
f (x, y) = π
0,
otherwise
3.3 Continuous Random Variables
81
We can calculate the distribution of R, the distance of the point from the origin. R ≤ r
if the point lies in a disk of radius r . Since this disk has area πr 2 ,
πr 2
= r2
π
The density function of R is thus f R (r ) = 2r, 0 ≤ r ≤ 1.
Let us now find the marginal density of the x coordinate of the random point:
∞
f (x, y) dy
f X (x) =
FR (r ) = P(R ≤ r ) =
−∞
1
=
π
=
√1−x 2
√
−
dy
1−x 2
2
1 − x 2,
π
−1 ≤ x ≤ 1
Note that we chose the limits of integration carefully; outside these limits the joint
density is zero. (Draw a picture of the region over which f (x, y) > 0 and indicate
the preceding limits of integration.) By symmetry, the marginal density of Y is
f Y (y) =
EXAMPLE F
2
1 − y2,
π
−1 ≤ y ≤ 1
■
Bivariate Normal Density
The bivariate normal density is given by the complicated expression
1
(y − μY )2
1
(x − μ X )2
exp −
+
f (x, y) =
2(1 − ρ 2 )
σ X2
σY2
2πσ X σY 1 − ρ 2
2ρ(x − μ X )(y − μY )
−
σ X σY
One of the earliest uses of this bivariate density was as a model for the joint distribution
of the heights of fathers and sons. The density depends on five parameters:
−∞ < μ X < ∞
σX > 0
−∞ < μY < ∞
σY > 0
−1 < ρ < 1
The contour lines of the density are the lines in the x y plane on which the joint density
is constant. From the preceding equation, we see that f (x, y) is constant if
(y − μY )2
2ρ(x − μ X )(y − μY )
(x − μ X )2
+
−
= constant
2
σX
σY2
σ X σY
The locus of such points is an ellipse centered at (μ X , μY ). If ρ = 0, the axes of the
ellipse are parallel to the x and y axes, and if ρ = 0, they are tilted. Figure 3.9 shows
several bivariate normal densities, and Figure 3.10 shows the corresponding elliptical
contours.
82
Chapter 3
Joint Distributions
.4
.4
.3
.3
.2
.2
2
.1
0
0
0
2
0
2
2
.1
0
2
0
2
2
(a)
2
(b)
.4
.4
.3
.3
.2
.2
2
.1
0
0
2
0
2
2
.1
0
0
2
0
2
2
2
(d)
(c)
F I G U R E 3.9 Bivariate normal densities with μ X = μY = 0 and σ X = σY = 1 and (a) ρ = 0, (b) ρ = .3,
(c) ρ = .6, (d) ρ = .9.
The marginal distributions of X and Y are N (μ X , σ X2 ) and N (μY , σY2 ), respectively, as we will now demonstrate. The marginal density of X is
∞
f X (x) =
f X Y (x, y) dy
−∞
Making the changes of variables u = (x − μ X )/σ X and v = (y − μY )/σY gives us
∞
1
1
2
2
f X (x) =
exp −
(u + v − 2ρuv) dv
2(1 − ρ 2 )
2πσ X 1 − ρ 2 −∞
3.3 Continuous Random Variables
2
2
1
1
0
0
1
1
2
2
(a)
1
0
1
2
2
(b)
2
2
2
1
1
0
0
1
1
2
2
(c)
2
1
0
1
2
(d)
2
1
0
1
2
1
0
1
2
83
F I G U R E 3.10 The elliptical contours of the bivariate normal densities of Figure 3.9.
To evaluate this integral, we use the technique of completing the square. Using the
identity
u 2 + v 2 − 2ρuv = (v − ρu)2 + u 2 (1 − ρ 2 )
we have
f X (x) =
2πσ X
1
1 − ρ2
e
−u 2 /2
∞
−∞
exp −
1
2
(v − ρu) dv
2(1 − ρ 2 )
Finally, recognizing the integral as that of a normal density with mean ρu and variance
(1 − ρ 2 ), we obtain
2
2
1
√ e−(1/2) (x−μ X ) /σ X
f X (x) =
σ X 2π
which is a normal density, as was to be shown. Thus, for example, the marginal
distributions of x and y in Figure 3.9 are all standard normal, even though the joint
■
distributions of (a)–(d) are quite different from each other.
We saw in our discussion of copulas earlier in this section that marginal densities
do not determine joint densities. For example, we can take both marginal densities
to be normal with parameters μ = 0 and σ = 1 and use the Farlie-Morgenstern
84
Chapter 3
Joint Distributions
0.2
0.15
f (x, y)
0.1
0.05
0
3
2
1
y
0
1
2
3
2
1
0
1
2
3
x
F I G U R E 3.11 A bivariate density that has normal marginals but is not bivariate
normal. The contours of the density shown in the x y plane are not elliptical.
copula with density c(u, v) = 2 − 2u − 2v + 4uv. Denoting the normal density and
cumulative distribution functions by φ(x) and (x), the bivariate density is
f (x, y) = (2 − 2(x) − 2(y) + 4(x)(y))φ(x)φ(y)
This density and its contours are shown in Figure 3.11. Note that the contours are not
elliptical. This bivariate density has normal marginals, but it is not a bivariate normal
density.
3.4 Independent Random Variables
DEFINITION
Random variables X 1 , X 2 , . . . , X n are said to be independent if their joint cdf
factors into the product of their marginal cdf’s:
F(x1 , x2 , . . . , xn ) = FX 1 (x1 )FX 2 (x2 ) · · · FX n (xn )
for all x1 , x2 , . . . , xn .
■
The definition holds for both continuous and discrete random variables. For
discrete random variables, it is equivalent to state that their joint frequency function
factors; for continuous random variables, it is equivalent to state that their joint density
function factors. To see why this is true, consider the case of two jointly continuous
random variables, X and Y . If they are independent, then
F(x, y) = FX (x)FY (y)
3.4 Independent Random Variables
85
and taking the second mixed partial derivative makes it clear that the density function
factors. On the other hand, if the density function factors, then the joint cdf can be
expressed as a product:
x y
F(x, y) =
f X (u) f Y (v) dv du
−∞
=
−∞
x
−∞
f X (u) du
y
−∞
f Y (v) dv = FX (x)FY (y)
It can be shown that the definition implies that if X and Y are independent, then
P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B)
It can also be shown that if g and h are functions, then Z = g(X ) and W = h(Y ) are
independent as well. A sketch of an argument goes like this (the details are beyond
the level of this course): We wish to find P(Z ≤ z, W ≤ w). Let A(z) be the set of
x such that g(x) ≤ z, and let B(w) be the set of y such that h(y) ≤ w. Then
P(Z ≤ z, W ≤ w) = P(X ∈ A(z), Y ∈ B(w))
= P(X ∈ A(z))P(Y ∈ B(w))
= P(Z ≤ z)P(W ≤ w)
EXAMPLE A
Suppose that the point (X, Y ) is uniformly distributed on the square S = {(x, y) |
− 1/2 ≤ x ≤ 1/2, −1/2 ≤ y ≤ 1/2}: f X Y (x, y) = 1 for (x, y) in S and 0 elsewhere.
Make a sketch of this square. You can visualize that the marginal distributions of X
and Y are uniform on [−1/2, 1/2]. For example, the marginal density at a point x,
−1/2 ≤ x ≤ 1/2 is found by integrating (summing) the joint density over the vertical
line that meets the horizontal axis at x. Thus, f X (x) = 1, −1/2 ≤ x ≤ 1/2 and
f Y (y) = 1, and − 1/2 ≤ y ≤ 1/2. The joint density is equal to the product of the
marginal densities, so X and Y are independent. You should be able to see from our
sketch that knowing the value of X gives no information about the possible values
■
of Y .
EXAMPLE B
Now consider rotating the square of the previous example by 90◦ to form a diamond.
Sketch this diamond. From the sketch, you can see that the marginal density of X is
nonnegative for −1/2 ≤ x ≤ 1/2 as before, but it is not uniform, and similarly for
the marginal density of Y . Thus, for example, f X (.9) > 0 and f Y (.9) > 0. But from
the sketch you can also see that f X Y (.9, .9) = 0. Thus, X and Y are not independent.
Finally, the sketch shows you that knowing the value of X — for example, X = .9—
constrains the possible values of Y .
■
86
Chapter 3
Joint Distributions
EXAMPLE C
Farlie-Morgenstern Family
From Example C in Section 3.3, we see that X and Y are independent only if α = 0,
since only in this case does the joint cdf H factor into the product of the marginals F
and G.
■
EXAMPLE D
If X and Y follow a bivariate normal distribution (Example F from Section 3.3)
and ρ = 0, their joint density factors into the product of two normal densities, and
■
therefore X and Y are independent.
EXAMPLE E
Suppose that a node in a communications network has the property that if two packets
of information arrive within time τ of each other, they “collide” and then have to be
retransmitted. If the times of arrival of the two packets are independent and uniform
on [0, T ], what is the probability that they collide?
The times of arrival of two packets, T1 and T2 , are independent and uniform on
[0, T ], so their joint density is the product of the marginals, or
1
T2
for t1 and t2 in the square with sides [0, T ]. Therefore, (T1 , T2 ) is uniformly distributed
over the square. The probability that the two packets collide is proportional to the
area of the shaded strip in Figure 3.12. Each of the unshaded triangles of the figure
has area (T − τ )2 /2, and thus the area of the shaded area is T 2 − (T − τ )2 . Integrating
■
f (t1 , t2 ) over this area gives the desired probability: 1 − (1 − τ/T )2 .
f (t1 , t2 ) =
t2
T
T
t1
F I G U R E 3.12 The probability that the two packets collide is proportional to the
area of the shaded region |t1 − t2 | < τ
3.5 Conditional Distributions
87
3.5 Conditional Distributions
3.5.1 The Discrete Case
If X and Y are jointly distributed discrete random variables, the conditional probability
that X = xi given that Y = y j is, if pY (y j ) > 0,
P(X = xi , Y = y j )
P(Y = y j )
p X Y (xi , y j )
=
pY (y j )
P(X = xi |Y = y j ) =
This probability is defined to be zero if pY (y j ) = 0. We will denote this conditional
probability by p X |Y (x|y). Note that this function of x is a genuine frequency function
since it is nonnegative and sums to 1 and that pY |X (y|x) = pY (y) if X and Y are
independent.
EXAMPLE A
We return to the simple discrete distribution considered in Section 3.2, reproducing
the table of values for convenience here:
y
x
0
1
2
3
0
1
8
0
1
8
2
8
0
1
2
8
1
8
1
8
The conditional frequency function of X given Y = 1 is
p X |Y (0|1) =
2
8
3
8
=
2
3
p X |Y (1|1) =
1
8
3
8
=
1
3
■
The definition of the conditional frequency function just given can be reexpressed
as
p X Y (x, y) = p X |Y (x|y) pY (y)
(the multiplication law of Chapter 1). This useful equation gives a relationship between
the joint and conditional frequency functions. Summing both sides over all values of
y, we have an extremely useful application of the law of total probability:
p X (x) =
p X |Y (x|y) pY (y)
y
88
Chapter 3
EXAMPLE B
Joint Distributions
Suppose that a particle counter is imperfect and independently detects each incoming
particle with probability p. If the distribution of the number of incoming particles in
a unit of time is a Poisson distribution with parameter λ, what is the distribution of
the number of counted particles?
Let N denote the true number of particles and X the counted number. From the
statement of the problem, the conditional distribution of X given N = n is binomial,
with n trials and probability p of success. By the law of total probability,
P(X = k) =
∞
P(N = n)P(X = k|N = n)
n=0
=
∞
λn e−λ n
n=k
n!
k
p k (1 − p)n−k
=
(λp)k −λ n−k (1 − p)n−k
e
λ
k!
(n − k)!
n=k
=
(λp)k −λ λ j (1 − p) j
e
k!
j!
j=0
∞
∞
(λp)k −λ λ(1− p)
e e
k!
(λp)k −λp
e
=
k!
We see that the distribution of X is a Poisson distribution with parameter λp. This
model arises in other applications as well. For example, N might denote the number
of traffic accidents in a given time period, with each accident being fatal or nonfatal;
X would then be the number of fatal accidents.
■
=
3.5.2 The Continuous Case
In analogy with the definition in the preceding section, if X and Y are jointly continuous random variables, the conditional density of Y given X is defined to be
f Y |X (y|x) =
f X Y (x, y)
f X (x)
if 0 < f X (x) < ∞, and 0 otherwise. This definition is in accord with the result to
which a differential argument would lead. We would define f Y |X (y|x) dy as P(y ≤
Y ≤ y + dy|x ≤ X ≤ x + d x) and calculate
P(y ≤ Y ≤ y + dy|x ≤ X ≤ x + d x) =
f X Y (x, y)
f X Y (x, y) d x d y
=
dy
f X (x) d x
f X (x)
Note that the rightmost expression is interpreted as a function of y, x being fixed.
The numerator is the joint density f X Y (x, y), viewed as a function of y for fixed x:
you can visualize it as the curve formed by slicing through the joint density function
3.5 Conditional Distributions
89
perpendicular to the x axis. The denominator normalizes that curve to have unit
area.
The joint density can be expressed in terms of the marginal and conditional
densities as follows:
f X Y (x, y) = f Y |X (y|x) f X (x)
Integrating both sides over x allows the marginal density of Y to be expressed as
∞
f Y |X (y|x) f X (x) d x
f Y (y) =
−∞
which is the law of total probability for the continuous case.
EXAMPLE A
In Example D in Section 3.3, we saw that
f X Y (x, y) = λ2 e−λy ,
0≤x≤y
f X (x) = λe−λx ,
f Y (y) = λ ye
2
−λy
x≥0
,
y≥0
Let us find the conditional densities. Before doing the formal calculations, it is informative to examine the joint density for x and y, respectively, held constant. If x
is constant, the joint density decays exponentially in y for y ≥ x; if y is constant,
the joint density is constant for 0 ≤ x ≤ y. (See Figure 3.7.) Now let us find the
conditional densities according to the preceding definition. First,
f Y |X (y|x) =
λ2 e−λy
= λe−λ(y−x) ,
λe−λx
y≥x
The conditional density of Y given X = x is exponential on the interval [x, ∞).
Expressing the joint density as
f X Y (x, y) = f Y |X (y|x) f X (x)
we see that we could generate X and Y according to f X Y in the following way: First,
generate X as an exponential random variable ( f X ), and then generate Y as another
exponential random variable ( f Y |X ) on the interval [x, ∞). From this representation,
we see that Y may be interpreted as the sum of two independent exponential random
variables and that the distribution of this sum is gamma, a fact that we will derive
later by a different method.
Now,
f X |Y (x|y) =
λ2 e−λy
1
= ,
2
−λy
λ ye
y
0≤x ≤y
The conditional density of X given Y = y is uniform on the interval [0, y]. Finally,
expressing the joint density as
f X Y (x, y) = f X |Y (x|y) f Y (y)
90
Chapter 3
Joint Distributions
we see that alternatively we could generate X and Y according to the density f X Y by
first generating Y from a gamma density and then generating X uniformly on [0, y].
Another interpretation of this result is that, conditional on the sum of two independent
exponential random variables, the first is uniformly distributed.
■
EXAMPLE B
Stereology
In metallography and other applications of quantitative microscopy, aspects of a threedimensional structure are deduced from studying two-dimensional cross sections.
Concepts of probability and statistics play an important role (DeHoff and Rhines
1968). In particular, the following problem arises. Spherical particles are dispersed
in a medium (grains in a metal, for example); the density function of the radii of
the spheres can be denoted as f R (r ). When the medium is sliced, two-dimensional,
circular cross sections of the spheres are observed; let the density function of the radii
of these circles be denoted by f X (x). How are these density functions related?
y
x
H
r
x
F I G U R E 3.13 A plane slices a sphere of radius r at a distance H from its center,
producing a circle of radius x.
To derive the relationship, we assume that the cross-sectioning plane is chosen
at random, fix R = r , and find the conditional density f X |R (x|r ). As shown in
Figure 3.13, let H denote the distance from the center of the sphere to the planar
cross
√
section. By our assumption, H is uniformly distributed on [0, r ], and X = r 2 − H 2 .
We can thus find the conditional distribution of X given R = r :
FX |R (x|r ) = P(X ≤ x)
= P( r 2 − H 2 ≤ x)
= P(H ≥ r 2 − x 2 )
√
r2 − x2
,
0≤x ≤r
= 1−
r
3.5 Conditional Distributions
91
Differentiating, we find
x
f X |R (x|r ) = √
,
2
r r − x2
0≤x ≤r
The marginal density of X is, from the law of total probability,
f X (x) =
∞
−∞
∞
=
x
f X |R (x|r ) f R (r ) dr
x
√
f R (r ) dr
r r2 − x2
[The limits of integration are x and ∞ since for r ≤ x, f X |R (x|r ) = 0.] This equation
is called Abel’s equation. In practice, the marginal density f X can be approximated
by making measurements of the radii of cross-sectional circles. Then the problem
becomes that of trying to solve for an approximation to f R , since it is the distribution
■
of spherical radii that is of real interest.
EXAMPLE C
Bivariate Normal Density
The conditional density of Y given X is the ratio of the bivariate normal density to a
univariate normal density. After some messy algebra, this ratio simplifies to
⎛
f Y |X (y|x) =
σY
⎜ 1
⎜
exp ⎜−
2
⎝ 2
2π(1 − ρ )
1
2 ⎞
σY
(x − μ X )
σX
σY2 (1 − ρ 2 )
y − μY − ρ
⎟
⎟
⎟
⎠
This is a normal density with mean μY + ρ(x − μ X )σY /σ X and variance σY2 (1 − ρ 2 ).
The conditional distribution of Y given X is a univariate normal distribution.
In Example B in Section 2.2.3, the distribution of the velocity of a turbulent
wind flow was shown to be approximately normally distributed. Van Atta and Chen
(1968) also measured the joint distribution of the velocity at a point at two different
times, t and t + τ . Figure 3.14 shows the measured conditional density of the velocity, v2 , at time t + τ , given various values of v1 . There is a systematic departure
from the normal distribution. Therefore, it appears that, even though the velocity
is normally distributed, the joint distribution of v1 and v2 is not bivariate normal.
This should not be totally unexpected, since the relation of v1 and v2 must conform to equations of motion and continuity, which may not permit a joint normal
■
distribution.
Example C illustrates that even when two random variables are marginally normally distributed, they need not be jointly normally distributed.
92
Chapter 3
Joint Distributions
.25
1 0.055
.20
.15
1 1.069
1 1.018
2 2 p ( 1, 2 )
.10
.05
0
.10
.05
0
.04
1 1.50
1 1.551
1 2.032
1 1.981
0
.01
0
1 2.514
3
2
1 2.461
1
0
2
1
2
3
F I G U R E 3.14 The conditional densities of v2 given v1 for selected values of v1 ,
where v1 and v2 are components of the velocity of a turbulent wind flow at different
times. The solid lines are the conditional densities according to a normal fit, and the
triangles and squares are empirical values determined from 409,600 observations.
EXAMPLE D
Rejection Method
The rejection method is commonly used to generate random variables from a density
function, especially when the inverse of the cdf cannot be found in closed form
and therefore the inverse cdf method, Proposition D in Section 2.3, cannot be used.
Suppose that f is a density function that is nonzero on an interval [a, b] and zero
outside the interval (a and b may be infinite). Let M(x) be a function such that
M(x) ≥ f (x) on [a, b], and let
m(x) =
M(x)
b
a
M(x) d x
3.5 Conditional Distributions
93
be a probability density function. As we will see, the idea is to choose M so that it is
easy to generate random variables from m. If [a, b] is finite, m can be chosen to be
the uniform distribution on [a, b]. The algorithm is as follows:
Step 1: Generate T with the density m.
Step 2: Generate U , uniform on [0, 1] and independent of T . If M(T ) ×U ≤ f (T ),
then let X = T (accept T ). Otherwise, go to Step 1 (reject T ).
See Figure 3.15. From the figure, we can see that a geometrical interpretation of this
algorithm is as follows: Throw a dart that lands uniformly in the rectangular region of
the figure. If the dart lands below the curve f (x), record its x coordinate; otherwise,
reject it.
y
accept reject
M
f
T
a
b
x
F I G U R E 3.15 Illustration of the rejection method.
We must check that the density function of the random variable X thus obtained
is in fact f :
P(x ≤ X ≤ x + d x) = P(x ≤ T ≤ x + d x | accept)
=
P(x ≤ T ≤ x + d x and accept)
P(accept)
=
P(accept|x ≤ T ≤ x + d x)P(x ≤ T ≤ x + d x)
P(accept)
First consider the numerator of this expression. We have
P(accept|x ≤ T ≤ x + d x) = P(U ≤ f (x)/M(x)) =
f (x)
M(x)
so that the numerator is
m(x) d x f (x)
=
M(x)
f (x) d x
b
a
M(x) d x
From the law of total probability, the denominator is
P(accept) = P(U ≤ f (T )/M(T ))
b
f (t)
m(t) dt =
=
M(t)
a
1
b
a
M(t) dt
where the last two steps follow from the definition of m and since f integrates to 1.
Finally, we see that the numerator over the denominator is f (x) d x.
■
94
Chapter 3
Joint Distributions
In order for the rejection method to be computationally efficient, the algorithm
should lead to acceptance with high probability; otherwise, many rejection steps may
have to be looped through for each acceptance.
EXAMPLE E
Bayesian Inference
A freshly minted coin has a certain probability of coming up heads if it is spun on its
edge, but that probability is not necessarily equal to 12 . Now suppose it is spun n times
and comes up heads X times. What has been learned about the chance the coin comes
up heads? We will go through a Bayesian treatment of this problem. Let denote the
probability that the coin will come up heads. We represent our knowledge about before gathering any data by a probability density on [0, 1], called the prior density.
If we are totally ignorant about , we might represent our state of knowledge by a
uniform density on [0, 1]:
f (θ ) = 1, 0 ≤ θ ≤ 1.
We will see how observing X changes our knowledge about , transforming the prior
distribution into a “posterior” distribution.
Given a value θ, X follows a binomial distribution with n trials and probability
of success θ:
n x
x = 0, 1, . . . , n
f X | (x|θ) =
θ (1 − θ )n−x ,
x
Now is continuous and X is discrete, and they have a joint probability distribution:
f ,X (θ, x) = f X | (x|θ) f (θ )
n x
=
θ (1 − θ )n−x ,
x
x = 0, 1, . . . , n, 0 ≤ θ ≤ 1
This is a density function in θ and a probability mass function in x, an object of a
kind we have not seen before. We can calculate the marginal density X by integrating
the joint over θ:
1
n x
f X (x) =
θ (1 − θ )n−x dθ
x
0
We can calculate this formidable looking integral by a trick. First write
(n + 1)
n
n!
=
=
x!(n − x)!
(x + 1) (n − x + 1)
x
(If k is an integer, (k) = (k − 1)!; see Problem 49 in Chapter 2). Recall the beta
density (Section 2.2.4)
(a + b) a−1
u (1 − u)b−1 ,
0≤u≤1
(a) (b)
The fact that this density integrates to 1 tells us that
1
(a) (b)
u a−1 (1 − u)b−1 du =
(a + b)
0
g(u) =
3.5 Conditional Distributions
95
Thus, identifying u with θ, a − 1 with x, and b − 1 with n − x,
f X (x) =
=
=
(n + 1)
(x + 1) (n − x + 1)
1
θ x (1 − θ )n−x dθ
0
(n + 1)
(x + 1) (n − x + 1)
(x + 1) (n − x + 1)
(n + 2)
1
,
n+1
x = 0, 1, . . . , n
Thus, if our prior on θ is uniform, each outcome of X is a priori equally likely.
Our knowledge about having observed X = x is quantified in the conditional
density of given X = x:
f ,X (θ, x)
f X (x)
n x
= (n + 1)
θ (1 − θ )n−x
x
f |X (θ|x) =
= (n + 1)
=
(n + 1)
θ x (1 − θ )n−x
(x + 1) (n − x + 1)
(n + 2)
θ x (1 − θ )n−x
(x + 1) (n − x + 1)
The relationship x (x) = (x + 1) has been used in the second step (see Problem 49,
Chapter 2). Bear in mind that for each fixed x, this is a function of θ—the posterior
density of θ given x—which quantifies our opinion about having observed x heads
in n spins. The posterior density is a beta density with parameters a = x + 1,
b = n − x + 1.
A one-Euro coin has the number 1 on one face and a bird on the other face. I spun
such a coin 20 times: the 1 came up 13 of the 20 times. Using the prior, ∼ U [0, 1],
the posterior is beta with a = x + 1 = 14 and b = n − x + 1 = 8. Figure 3.16 shows
this posterior, which represents my opinion if I was initially totally ignorant of θ and
then observed thirteen 1s in 20 spins. From the figure, it is extremely unlikely that
θ < 0.25, for example. My probability, or belief, that θ is greater than 12 is the area
under the density to the right of 12 , which can be calculated to be 0.91. I can be 91%
certain that θ is greater than 12 .
We need to distinguish between the steps of the preceding probability calculations, which are are mathematically straightforward; and the interpretation of the
results, which goes beyond the mathematics and requires a model that belief can
be expressed in terms of probability and revised using the laws of probability. See
Figure 3.16.
■
96
Chapter 3
Joint Distributions
4
3
2
1
0
0.0
0.2
0.4
0.6
0.8
1.0
p
F I G U R E 3.16 Beta density with parameters a = 14 and b = 8.
3.6 Functions of Jointly Distributed
Random Variables
The distribution of a function of a single random variable was developed in Section 2.3.
In this section, that development is extended to several random variables, but first some
important special cases are considered.
3.6.1 Sums and Quotients
Suppose that X and Y are discrete random variables taking values on the integers
and having the joint frequency function p(x, y), and let Z = X + Y . To find the
frequency function of Z , we note that Z = z whenever X = x and Y = z − x, where
x is an integer. The probability that Z = z is thus the sum over all x of these joint
probabilities, or
∞
p(x, z − x)
p Z (z) =
x=−∞
If X and Y are independent so that p(x, y) = p X (x) pY (y), then
p Z (z) =
∞
p X (x) pY (z − x)
x=−∞
This sum is called the convolution of the sequences p X and pY .
3.6 Functions of Jointly Distributed Random Variables
97
y
(0, z)
x
y
z
(z, 0)
z
Rz
F I G U R E 3.17
X + Y ≤ z whenever (X , Y ) is in the shaded region Rz.
The continuous case is very similar. Supposing that X and Y are continuous random variables, we first find the cdf of Z and then differentiate to find the density. Since
Z ≤ z whenever the point (X, Y ) is in the shaded region Rz shown in Figure 3.17,
we have
FZ (z) =
f (x, y) d x d y
Rz
=
∞
−∞
z−x
−∞
f (x, y) d y d x
In the inner integral, we make the change of variables y = v − x to obtain
∞ z
f (x, v − x) dv d x
FZ (z) =
−∞
z
=
Differentiating, we have, if
−∞
−∞
∞
−∞
f (x, v − x) d x dv
∞
−∞
f (x, z − x) d x is continuous at z,
∞
f Z (z) =
f (x, z − x) d x
−∞
which is the obvious analogue of the result for the discrete case.
If X and Y are independent,
∞
f X (x) f Y (z − x) d x
f Z (z) =
−∞
This integral is called the convolution of the functions f X and f Y .
98
Chapter 3
EXAMPLE A
Joint Distributions
Suppose that the lifetime of a component is exponentially distributed and that an
identical and independent backup component is available. The system operates as
long as one of the components is functional; therefore, the distribution of the life of
the system is that of the sum of two independent exponential random variables. Let
T1 and T2 be independent exponentials with parameter λ, and let S = T1 + T2 .
s
λe−λt λe−λ(s−t) dt
f S (s) =
0
It is important to note the limits of integration. Beyond these limits, one of the two
component densities is zero. When dealing with densities that are nonzero only on
some subset of the real line, we must always be careful. Continuing, we have
s
e−λs dt
f S (s) = λ2
0
= λ se−λs
2
This is a gamma distribution with parameters 2 and λ (compare with Example A in
■
Section 3.5.2).
Let us next consider the quotient of two continuous random variables. The derivation is very similar to that for the sum of such variables, given previously: We first find
the cdf and then differentiate to find the density. Suppose that X and Y are continuous
with joint density function f and that Z = Y / X . Then FZ (z) = P(Z ≤ z) is the
probability of the set of (x, y) such that y/x ≤ z. If x > 0, this is the set y ≤ x z; if
x < 0, it is the set y ≥ x z. Thus,
0 ∞
∞ xz
FZ (z) =
f (x, y) d y d x +
f (x, y) d y d x
−∞
xz
−∞
0
To remove the dependence of the inner integrals on x, we make the change of variables y = xv in the inner integrals and obtain
0 −∞
∞ z
x f (x, xv) dv d x +
x f (x, xv) dv d x
FZ (z) =
−∞
=
=
0
z
0
z
−∞
z
−∞
∞
−∞
−∞
(−x) f (x, xv) dv d x +
0
−∞
∞
z
−∞
x f (x, xv) dv d x
|x| f (x, xv) d x dv
Finally, differentiating (again under an assumption of continuity), we find
∞
|x| f (x, x z) d x
f Z (z) =
−∞
In particular, if X and Y are independent,
∞
f Z (z) =
|x| f X (x) f Y (x z) d x
−∞
3.6 Functions of Jointly Distributed Random Variables
EXAMPLE B
99
Suppose that X and Y are independent standard normal random variables and that
Z = Y / X . We then have
∞
|x| −x 2 /2 −x 2 z 2 /2
e
e
dx
f Z (z) =
2π
−∞
From the symmetry of the integrand about zero,
1 ∞ −x 2 ((z 2 +1)/2)
f Z (z) =
xe
dx
π 0
To simplify this, we make the change of variables u = x 2 to obtain
∞
1
2
f Z (z) =
e−u((z +1)/2) du
2π 0
Next, using the fact that
∞
0
λ exp(−λx) d x = 1 with λ = (z 2 + 1)/2, we get
f Z (z) =
1
,
+ 1)
π(z 2
−∞ < z < ∞
This density is called the Cauchy density. Like the standard normal density, the
Cauchy density is symmetric about zero and bell-shaped, but the tails of the Cauchy
tend to zero very slowly compared to the tails of the normal. This can be interpreted
as being because of a substantial probability that X in the quotient Y / X is near
zero.
■
Example B indicates one method of generating Cauchy random variables—we
can generate independent standard normal random variables and form their quotient.
The next section shows how to generate standard normals.
3.6.2 The General Case
The following example illustrates the concepts that are important to the general case
of functions of several random variables and is also interesting in its own right.
EXAMPLE A
Suppose that X and Y are independent standard normal random variables, which
means that their joint distribution is the standard bivariate normal distribution, or
f X Y (x, y) =
1 −(x 2 /2)−(y 2 /2)
e
2π
We change to polar coordinates and then reexpress the density in this new coordinate
system (R ≥ 0, 0 ≤ ≤ 2π):
R = X2 + Y 2
100
Chapter 3
Joint Distributions
=
⎧ −1 Y tan
,
if X > 0
⎪
X
⎪
⎪
⎪
⎪
⎪
⎪
⎪
−1 Y
⎪
+ π, if X < 0
⎨tan
X
⎪
π
⎪
⎪
sgn(Y ),
⎪
2
⎪
⎪
⎪
⎪
⎪
⎩
0,
if X = 0, Y = 0
if X = 0, Y = 0
(The range of the inverse tangent function is taken to be − π2 < < π2 .) The inverse
transformation is
X = R cos Y = R sin The joint density of R and is
f R (r, θ) dr dθ = P(r ≤ R ≤ r + dr, θ ≤ ≤ θ + dθ )
This probability is equal to the area of the shaded patch in Figure 3.18 times
f X Y [x(r, θ), y(r, θ)]. The area in question is clearly r dr dθ , so
P(r ≤ R ≤ r + dr, θ ≤ ≤ θ + dθ ) = f X Y (r cos θ, r sin θ )r dr dθ
and
f R (r, θ) = r f X Y (r cos θ, r sin θ )
Thus,
r [−(r 2 cos2 θ )/2−(r 2 sin2 θ )/2]
e
2π
1 −r 2 /2
re
=
2π
From this, we see that the joint density factors implying that R and are independent
random variables, that is uniform on [0, 2π ], and that R has the density
f R (r, θ) =
f R (r ) = r e−r
2 /2
,
r ≥0
which is called the Rayleigh density.
rd
dr
d
r
r dr
F I G U R E 3.18 The area of the shaded patch is r dr dθ.
3.6 Functions of Jointly Distributed Random Variables
101
An interesting relationship can be found by changing variables again, letting
T = R 2 . Using the standard techniques for finding the density of a function of a
single random variable, we obtain
f T (t) =
1 −t/2
e ,
2
t ≥0
This is an exponential distribution with parameter 12 . Because R and are independent, so are T and , and the joint density of the latter pair is
1 1 −t/2
f T (t, θ) =
e
2π 2
We have thus arrived at a characterization of the standard bivariate normal distribution: is uniform on [0, 2π], and R 2 is exponential with parameter 12 . (Also, from
Example B in Section 3.6.1, tan follows a Cauchy distribution.)
These relationships can be used to construct an algorithm for generating standard
normal random variables, which is quite useful since , the cdf, and −1 cannot be
expressed in closed form. First, generate U1 and U2 , which are independent and
uniform on [0, 1]. Then −2 log U1 is exponential with parameter 12 , and 2πU2 is
uniform on [0, 2π]. It follows that
X = −2 log U1 cos(2πU2 )
and
Y =
−2 log U1 sin(2πU2 )
are independent standard normal random variables. This method of generating nor■
mally distributed random variables is sometimes called the polar method.
For the general case, suppose that X and Y are jointly distributed continuous
random variables, that X and Y are mapped onto U and V by the transformation
u = g1 (x, y)
v = g2 (x, y)
and that the transformation can be inverted to obtain
x = h 1 (u, v)
y = h 2 (u, v)
Assume that g1 and g2 have continuous partial derivatives and that the Jacobian
⎡ ∂g ∂g ⎤
1
1
⎢ ∂x ∂y ⎥
∂g1
∂g2
∂g2
∂g1
⎢
⎥
=
J (x, y) = det ⎣
−
=0
⎦
∂x
∂y
∂x
∂y
∂g2 ∂g2
∂x ∂y
for all x and y. This leads directly to the following result.
102
Chapter 3
Joint Distributions
PROPOSITION A
Under the assumptions just stated, the joint density of U and V is
fU V (u, v) = f X Y (h 1 (u, v), h 2 (u, v))|J −1 (h 1 (u, v), h 2 (u, v))|
for (u, v) such that u = g1 (x, y) and v = g2 (x, y) for some (x, y) and 0
elsewhere.
■
We will not prove Proposition A here. It follows from the formula established
in advanced calculus for a change of variables in multiple integrals. The essential
elements of the proof follow the discussion in Example A.
EXAMPLE B
To illustrate the formalism, let us redo Example A. The roles of u and v are played
by r and θ:
r = x 2 + y2
y
θ = tan−1
x
The inverse transformation is
x = r cos θ
y = r sin θ
After some algebra, we obtain the partial derivatives:
∂r
x
=
2
∂x
x + y2
y
∂r
=
2
∂y
x + y2
−y
∂θ
= 2
∂x
x + y2
x
∂θ
= 2
∂y
x + y2
The Jacobian is the determinant of the matrix of these expressions, or
J (x, y) = 1
x2
+
y2
=
1
r
Proposition A therefore says that
f R (r, θ) = r f X Y (r cos θ, r sin θ )
for r ≥ 0, 0 ≤ θ ≤ 2π, and 0 elsewhere, which is the same as the result we obtained
by a direct argument in Example A.
■
Proposition A extends readily to transformations of more than two random variables. If X 1 , . . . , X n have the joint density function f X 1 ···X n and
Yi = gi (X 1 , . . . , X n ),
X i = h i (Y1 , . . . , Yn ),
i = 1, . . . , n
i = 1, . . . , n
3.6 Functions of Jointly Distributed Random Variables
103
and if J (x1 , . . . , xn ) is the determinant of the matrix with the i j entry ∂gi /∂ x j , then
the joint density of Y1 , . . . , Yn is
f Y1 ···Yn (y1 , . . . , yn ) = f X 1 ···X n (x1 , . . . , xn )|J −1 (x1 , . . . , xn )|
wherein each xi is expressed in terms of the y’s; xi = h i (y1 , . . . , yn ).
EXAMPLE C
Suppose that X 1 and X 2 are independent standard normal random variables and that
Y1 = X 1
Y2 = X 1 + X 2
We will show that the joint distribution of Y1 and Y2 is bivariate normal. The Jacobian
of the transformation is simply
10
J (x, y) = det
=1
11
Since the inverse transformation is x1 = y1 and x2 = y2 − y1 , from Proposition A the
joint density of Y1 and Y2 is
1 2
1
2
f Y1 Y2 (y1 , y2 ) =
exp − y1 + (y2 − y1 )
2π
2
1
1
exp − 2y12 + y22 − 2y1 y2
=
2π
2
This can be recognized to be a bivariate normal density, the parameters of which can
be identified by comparing the constants in this expression with the general form of the
bivariate normal (see Example F of Section 3.3). First, since the exponential contains
only quadratic terms in y1 and y2 , we have μY1 = μY2 = 0. (If μY1 were nonzero,
for example, examination of the equation for the bivariate density in Example F of
Section 3.3 shows that there would be a term y1 μY1 .) Next, from the constant that
occurs in front of the exponential, we have
σY1 σY2 1 − ρ 2 = 1
From the coefficient of y1 we have
1
2
Dividing the second relationship into the square of the first gives σY22 = 2. From the
coefficient of y2 , we have
σY21 (1 − ρ 2 ) =
σY22 (1 − ρ 2 ) = 1
from which it follows that ρ 2 = 12 .
√
From the sign of the cross product, we see that ρ = 1/ 2. Finally, we have
σY21 = 1. We thus see that this linear transformation of two independent standard
normal random variables follows a bivariate normal distribution. This is a special
case of a more general result: A nonsingular linear transformation of two random
variables whose joint distribution is bivariate normal yields two random variables
104
Chapter 3
Joint Distributions
whose joint distribution is still bivariate normal, although with different parameters.
■
(See Problem 58.)
3.7 Extrema and Order Statistics
This section is concerned with ordering a collection of independent continuous
random variables. In particular, let us assume that X 1 , X 2 , . . . , X n are independent
random variables with the common cdf F and density f . Let U denote the maximum
of the X i and V the minimum. The cdfs of U and V , and therefore their densities,
can be found by a simple trick.
First, we note that U ≤ u if and only if X i ≤ u for all i. Thus,
FU (u) = P(U ≤ u)
= P(X 1 ≤ u)P(X 2 ≤ u) · · · P(X n ≤ u)
= [F(u)]n
Differentiating, we find the density,
fU (u) = n f (u)[F(u)]n−1
Similarly, V ≥ v if and only if X i ≥ v for all i. Thus,
1 − FV (v) = [1 − F(v)]n
and
FV (v) = 1 − [1 − F(v)]n
The density function of V is therefore
f V (v) = n f (v)[1 − F(v)]n−1
EXAMPLE A
Suppose that n system components are connected in series, which means that the system fails if any one of them fails, and that the lifetimes of the components, T1 , . . . , Tn ,
are independent random variables that are exponentially distributed with parameter λ:
F(t) = 1−e−λt . The random variable that represents the length of time the system operates is V , which is the minimum of the Ti and by the preceding result has the density
f V (v) = nλe−λv (e−λv )n−1
= nλe−nλv
We see that V is exponentially distributed with parameter nλ.
EXAMPLE B
■
Suppose that a system has components as described in Example A but connected
in parallel, which means that the system fails only when they all fail. The system’s
lifetime is thus the maximum of n exponential random variables and has the
3.7 Extrema and Order Statistics
105
density
fU (u) = nλe−λu (1 − e−λu )n−1
By expanding the last term using the binomial theorem, we see that this density is a
■
weighted sum of exponential terms rather than a simple exponential density.
We will now derive the preceding results once more, by the differential technique,
and generalize them. To find fU (u), we observe that u ≤ U ≤ u + du if one of the
n X i falls in the interval (u, u + du) and the other (n − 1)X i fall to the left of u.
The probability of any particular such arrangement is [F(u)]n−1 f (u)du, and because
there are n such arrangements,
fU (u) = n[F(u)]n−1 f (u)
Now we again assume that X 1 , . . . , X n are independent continuous random variables with density f (x). We sort the X i and denote by X (1) < X (2) < · · · < X (n) the
order statistics. Note that X 1 is not necessarily equal to X (1) . (In fact, this equality
holds with probability n −1 .) Thus, X (n) is the maximum, and X (1) is the minimum. If
n is odd, say, n = 2m + 1, then X (m+1) is called the median of the X i .
THEOREM A
The density of X (k) , the kth-order statistic, is
f k (x) =
n!
f (x)F k−1 (x)[1 − F(x)]n−k
(k − 1)!(n − k)!
Proof
We will use a differential argument to derive this result heuristically. (The alternative approach of first deriving the cdf and then differentiating is developed in
Problem 66 at the end of this chapter.) The event x ≤ X (k) ≤ x + d x occurs if
k − 1 observations are less than x, one observation is in the interval [x, x + d x],
and n − k observations are greater than x + d x. The probability of any particular
arrangement of this type is f (x)F k−1 (x)[1 − F(x)]n−k d x, and, by the multinomial theorem, there are n!/[(k − 1)!1!(n − k)!] such arrangements, which
completes the argument.
■
EXAMPLE C
For the case where the X i are uniform on [0, 1], the density of the kth-order statistic
reduces to
n!
x k−1 (1 − x)n−k ,
(k − 1)!(n − k)!
0≤x ≤1
This is the beta density. An interesting by-product of this result is that since the
106
Chapter 3
Joint Distributions
density integrates to 1,
1
x k−1 (1 − x)n−k d x =
0
(k − 1)!(n − k)!
n!
■
Joint distributions of order statistics can also be worked out. For example, to find
the joint density of the minimum and maximum, we note that x ≤ X (1) ≤ x + d x
and y ≤ X (n) ≤ y + dy if one X i falls in [x, x + d x], one falls in [y, y + dy], and
n − 2 fall in [x, y]. There are n(n − 1) ways to choose the minimum and maximum,
and thus V = X (1) and U = X (n) have the joint density
f (u, v) = n(n − 1) f (v) f (u)[F(u) − F(v)]n−2 ,
u≥v
For example, for the uniform case,
f (u, v) = n(n − 1)(u − v)n−2 ,
1≥u≥v≥0
The range of X (1) , . . . , X (n) is R = X (n) − X (1) . Using the same kind of analysis we
used in Section 3.6.1 to derive the distribution of a sum, we find
∞
f (v + r, v) dv
f R (r ) =
−∞
EXAMPLE D
Find the distribution of the range, U − V , for the uniform [0, 1] case. The integrand
is f (v + r, v) = n(n − 1)r n−2 for 0 ≤ v ≤ v + r ≤ 1 or, equivalently, 0 ≤ v ≤ 1 − r .
Thus,
1−r
n(n − 1)r n−2 dv = n(n − 1)r n−2 (1 − r ),
0≤r ≤1
f R (r ) =
0
The corresponding cdf is
FR (r ) = nr n−1 − (n − 1)r n ,
EXAMPLE E
0≤r ≤1
■
Tolerance Interval
If a large number of independent random variables having the common density function f are observed, it seems intuitively likely that most of the probability mass of
the density f (x) is contained in the interval (X (1) , X (n) ) and unlikely that a future
observation will lie outside this interval. In fact, very precise statements can be made.
For example, the amount of the probability mass in the interval is F(X (n) ) − F(X (1) ),
a random variable that we will denote by Q. From Proposition C of Section 2.3, the
distribution of F(X i ) is uniform; therefore, the distribution of Q is the distribution
of U(n) − U(1) , which is the range of n independent uniform random variables. Thus,
P(Q > α), the probability that more than 100α% of the probability mass is contained
in the range is from Example D,
P(Q > α) = 1 − nα n−1 + (n − 1)α n
3.8 Problems
107
For example, if n = 100 and α = .95, this probability is .96. In words, this means
that the probability is .96 that the range of 100 independent random variables covers
95% or more of the probability mass, or, with probability .96, 95% of all further
observations from the same distribution will fall between the minimum and maximum.
This statement does not depend on the actual form of the distribution.
■
3.8 Problems
1. The joint frequency function of two discrete random variables, X and Y , is given
in the following table:
x
y
1
2
3
4
1
2
3
4
.10
.05
.02
.02
.05
.20
.05
.02
.02
.05
.20
.04
.02
.02
.04
.10
a. Find the marginal frequency functions of X and Y .
b. Find the conditional frequency function of X given Y = 1 and of Y given
X = 1.
2. An urn contains p black balls, q white balls, and r red balls; and n balls are
chosen without replacement.
a. Find the joint distribution of the numbers of black, white, and red balls in the
sample.
b. Find the joint distribution of the numbers of black and white balls in the
sample.
c. Find the marginal distribution of the number of white balls in the sample.
3. Three players play 10 independent rounds of a game, and each player has probability 13 of winning each round. Find the joint distribution of the numbers of
games won by each of the three players.
4. A sieve is made of a square mesh of wires. Each wire has diameter d, and the holes
in the mesh are squares whose side length is w. A spherical particle of radius r is
dropped on the mesh. What is the probability that it passes through? What is the
probability that it fails to pass through if it is dropped n times? (Calculations such
as these are relevant to the theory of sieving for analyzing the size distribution of
particulate matter.)
5. (Buffon’s Needle Problem) A needle of length L is dropped randomly on a plane
ruled with parallel lines that are a distance D apart, where D ≥ L. Show that
the probability that the needle comes to rest crossing a line is 2L/(π D). Explain
how this gives a mechanical means of estimating the value of π .
108
Chapter 3
Joint Distributions
6. A point is chosen randomly in the interior of an ellipse:
y2
x2
+
=1
a2
b2
Find the marginal densities of the x and y coordinates of the point.
7. Find the joint and marginal densities corresponding to the cdf
F(x, y) = (1−e−αx )(1−e−βy ),
x ≥ 0,
y ≥ 0,
α > 0,
β>0
8. Let X and Y have the joint density
6
(x + y)2 ,
0 ≤ x ≤ 1,
0≤y≤1
7
a. By integrating over the appropriate regions, find (i) P(X > Y ),
(ii) P(X + Y ≤ 1), (iii) P(X ≤ 12 ).
b. Find the marginal densities of X and Y .
c. Find the two conditional densities.
f (x, y) =
9. Suppose that (X, Y ) is uniformly distributed over the region defined by 0 ≤ y ≤
1 − x 2 and −1 ≤ x ≤ 1.
a. Find the marginal densities of X and Y .
b. Find the two conditional densities.
10. A point is uniformly distributed in a unit sphere in three dimensions.
a. Find the marginal densities of the x, y, and z coordinates.
b. Find the joint density of the x and y coordinates.
c. Find the density of the x y coordinates conditional on Z = 0.
11. Let U1 , U2 , and U3 be independent random variables uniform on [0, 1]. Find the
probability that the roots of the quadratic U1 x 2 + U2 x + U3 are real.
12. Let
f (x, y) = c(x 2 − y 2 )e−x ,
0 ≤ x < ∞,
−x ≤ y < x
a. Find c.
b. Find the marginal densities.
c. Find the conditional densities.
13. A fair coin is thrown once; if it lands heads up, it is thrown a second time. Find
the frequency function of the total number of heads.
14. Suppose that
f (x, y) = xe−x(y+1) ,
0 ≤ x < ∞,
0≤y<∞
a. Find the marginal densities of X and Y . Are X and Y independent?
b. Find the conditional densities of X and Y .
15. Suppose that X and Y have the joint density function
f (x, y) = c 1 − x 2 − y 2 ,
x 2 + y2 ≤ 1
a. Find c.
3.8 Problems
109
b. Sketch the joint density.
c. Find P(X 2 + Y 2 ) ≤ 12 .
d. Find the marginal densities of X and Y . Are X and Y independent random
variables?
e. Find the conditional densities.
16. What is the probability density of the time between the arrival of the two packets
of Example E in Section 3.4?
17. Let (X, Y ) be a random point chosen uniformly on the region R = {(x, y) :
|x| + |y| ≤ 1}.
a. Sketch R.
b. Find the marginal densities of X and Y using your sketch. Be careful of the
range of integration.
c. Find the conditional density of Y given X .
18. Let X and Y have the joint density function
f (x, y) = k(x − y),
0≤y≤x ≤1
and 0 elsewhere.
a. Sketch the region over which the density is positive and use it in determining
limits of integration to answer the following questions.
b. Find k.
c. Find the marginal densities of X and Y .
d. Find the conditional densities of Y given X and X given Y .
19. Suppose that two components have independent exponentially distributed lifetimes, T1 and T2 , with parameters α and β, respectively. Find (a) P(T1 > T2 ) and
(b) P(T1 > 2T2 ).
20. If X 1 is uniform on [0, 1], and, conditional on X 1 , X 2 , is uniform on [0, X 1 ], find
the joint and marginal distributions of X 1 and X 2 .
21. An instrument is used to measure very small concentrations, X , of a certain
chemical in soil samples. Suppose that the values of X in those soils in which the
chemical is present is modeled as a random variable with density function f (x).
The assay of a soil reports a concentration only if the chemical is first determined
to be present. At very low concentrations, however, the chemical may fail to
be detected even if it is present. This phenomenon is modeled by assuming that
if the concentration is x, the chemical is detected with probability R(x). Let Y
denote the concentration of a chemical in a soil in which it has been determined
to be present. Show that the density function of Y is
g(y) =
∞
0
R(y) f (y)
R(x) f (x) d x
22. Consider a Poisson process on the real line, and denote by N (t1 , t2 ) the number
of events in the interval (t1 , t2 ). If t0 < t1 < t2 , find the conditional distribution of
N (t0 , t1 ) given that N (t0 , t2 ) = n. (Hint: Use the fact that the numbers of events
in disjoint subsets are independent.)
110
Chapter 3
Joint Distributions
23. Suppose that, conditional on N , X has a binomial distribution with N trials and
probability p of success, and that N is a binomial random variable with m trials
and probability r of success. Find the unconditional distribution of X .
24. Let P have a uniform distribution on [0, 1], and, conditional on P = p, let X
have a Bernoulli distribution with parameter p. Find the conditional distribution
of P given X .
25. Let X have the density function f , and let Y = X with probability 12 and Y = −X
with probability 12 . Show that the density of Y is symmetric about zero—that is,
f Y (y) = f Y (−y).
26. Spherical particles whose radii have the density function f R (r ) are dropped on a
mesh as in Problem 4. Find an expression for the density function of the particles
that pass through.
27. Prove that X and Y are independent if and only if f X |Y (x|y) = f X (x) for all x
and y.
28. Show that C(u, v) = uv is a copula. Why is it called “the independence copula”?
29. Use the Farlie-Morgenstern copula to construct a bivariate density whose marginal
densities are exponential. Find an expression for the joint density.
30. For 0 ≤ α ≤ 1 and 0 ≤ β ≤ 1, show that C(u, v) = min(u 1−α v, uv 1−β ) is a
copula (the Marshall-Olkin copula). What is the joint density?
31. Suppose that (X, Y ) is uniform on the disk of radius 1 as in Example E of Section 3.3. Without doing any calculations, argue that X and Y are not independent.
32. Continuing Example E of Section 3.5.2, suppose you had to guess a value of θ .
One plausible guess would be the value of θ that maximizes the posterior density.
Find that value. Does the result make intuitive sense?
33. Suppose that, as in Example E of Section 3.5.2, your prior opinion that the coin
will land with heads up is represented by a uniform density on [0, 1]. You now
spin the coin repeatedly and record the number of times, N , until a heads comes
up. So if heads comes up on the first spin, N = 1, etc.
a. Find the posterior density of given N .
b. Do this with a newly minted penny and graph the posterior density.
34. This problem continues Example E of Section 3.5.2. In that example, the prior
opinion for the value of was represented by the uniform density. Suppose that
the prior density had been a beta density with parameters a = b = 3, reflecting
a stronger prior belief that the chance of a 1 was near 12 . Graph this prior density.
Following the reasoning of the example, find the posterior density, plot it, and
compare it to the posterior density shown in the example.
35. Find a newly minted penny. Place it on its edge and spin it 20 times. Following
Example E of Section 3.5.2, calculate and graph the posterior distribution. Spin
another 20 times, and calculate and graph the posterior based on all 40 spins.
What happens as you increase the number of spins?
3.8 Problems
111
36. Let f (x) = (1 + αx)/2, for −1 ≤ x ≤ 1 and −1 ≤ α ≤ 1.
a. Describe an algorithm to generate random variables from this density using
the rejection method.
b. Write a computer program to do so, and test it out.
37. Let f (x) = 6x 2 (1 − x)2 , for −1 ≤ x ≤ 1.
a. Describe an algorithm to generate random variables from this density using
the rejection method. In what proportion of the trials will the acceptance step
be taken?
b. Write a computer program to do so, and test it out.
38. Show that the number of iterations necessary to generate a random variable using
the rejection method is a geometric random variable, and evaluate the parameter
of the geometric frequency function. Show that in order to keep the number of
iterations small, M(x) should be chosen to be close to f (x).
39. Show that the following method of generating discrete random variables works
(D. R. Fredkin). Suppose, for concreteness, that X takes on values 0, 1, 2, . . . with
probabilities p0 , p1 , p2 , . . . . Let U be a uniform random variable. If U < p0 ,
return X = 0. If not, replace U by U − p0 , and if the new U is less than p1 ,
return X = 1. If not, decrement U by p1 , compare U to p2 , etc.
40. Suppose that X and Y are discrete random variables with a joint probability
mass function p X Y (x, y). Show that the following procedure generates a random
variable X ∼ p X |Y (x|y).
a. Generate X ∼ p X (x).
b. Accept X with probability p(y|X ).
c. If X is accepted, terminate and return X . Otherwise go to Step a.
Now suppose that X is uniformly distributed on the integers 1, 2, . . . , 100 and
that given X = x, Y is uniform on the integers 1, 2, . . . , x. You observe Y = 44.
What does this tell you about X ? Simulate the distribution of X , given Y = 44,
1000 times and make a histogram of the value obtained. How would you estimate
E(X |Y = 44)?
41. How could you extend the procedure of the previous problem in the case that X
and Y are continuous random variables?
42. a. Let T be an exponential random variable with parameter λ; let W be a random
variable independent of T , which is ±1 with probability 12 each; and let X =
W T . Show that the density of X is
f X (x) =
λ −λ|x|
e
2
which is called the double exponential density.
b. Show that for some constant c,
1
2
√ e−x /2 ≤ ce−|x|
2π
112
Chapter 3
Joint Distributions
Use this result and that of part (a) to show how to use the rejection method to
generate random variables from a standard normal density.
43. Let U1 and U2 be independent and uniform on [0, 1]. Find and sketch the density
function of S = U1 + U2 .
44. Let N1 and N2 be independent random variables following Poisson distributions
with parameters λ1 and λ2 . Show that the distribution of N = N1 + N2 is Poisson
with parameter λ1 + λ2 .
45. For a Poisson distribution, suppose that events are independently labeled A and
B with probabilities p A + p B = 1. If the parameter of the Poisson distribution is
λ, show that the number of events labeled A follows a Poisson distribution with
parameter p A λ.
46. Let X and Y be jointly continuous random variables. Find an expression for the
density of Z = X − Y .
47. Let X and Y be independent standard normal random variables. Find the density
of Z = X + Y , and show that Z is normally distributed as well. (Hint: Use the
technique of completing the square to help in evaluating the integral.)
48. Let T1 and T2 be independent exponentials with parameters λ1 and λ2 . Find the
density function of T1 + T2 .
49. Find the density function of X + Y , where X and Y have a joint density as given
in Example D in Section 3.3.
50. Suppose that X and Y are independent discrete random variables and each assumes the values 0, 1, and 2 with probability 13 each. Find the frequency function
of X + Y .
51. Let X and Y have the joint density function f (x, y), and let Z = X Y . Show that
the density function of Z is
∞ z
1
dy
f y,
f Z (z) =
y |y|
−∞
52. Find the density of the quotient of two independent uniform random variables.
53. Consider forming a random rectangle in two ways. Let U1 , U2 , and U3 be independent random variables uniform on [0, 1]. One rectangle has sides U1 and U2 ,
and the other is a square with sides U3 . Find the probability that the area of the
square is greater than the area of the other rectangle.
54. Let X , Y , and Z be independent N (0, σ 2 ). Let , , and R be the corresponding
random variables that are the spherical coordinates of (X, Y, Z ):
x = r sin φ cos θ
y = r sin φ sin θ
z = r cos φ
0 ≤ φ ≤ π,
0 ≤ θ ≤ 2π
3.8 Problems
113
Find the joint and marginal densities of , , and R. (Hint: d x dy dz = r 2 sin φ
dr dθ dφ.)
55. A point is generated on a unit disk in the following way: The radius, R, is uniform
on [0, 1], and the angle is uniform on [0, 2π ] and is independent of R.
a. Find the joint density of X = R cos and Y = R sin .
b. Find the marginal densities of X and Y .
c. Is the density uniform over the disk? If not, modify the method to produce a
uniform density.
56. If X and Y are independent exponential random variables, find the joint density
of the polar coordinates R and of the point (X, Y ). Are R and independent?
57. Suppose that Y1 and Y2 follow a bivariate normal
√ distribution with parameters
μY1 = μY2 = 0, σY21 = 1, σY22 = 2, and ρ = 1/ 2. Find a linear transformation
x1 = a11 y1 + a12 y2 , x2 = a21 y1 + a22 y2 such that x1 and x2 are independent
standard normal random variables. (Hint: See Example C of Section 3.6.2.)
58. Show that if the joint distribution of X 1 and X 2 is bivariate normal, then the joint
distribution of Y1 = a1 X 1 + b1 and Y2 = a2 X 2 + b2 is bivariate normal.
59. Let X 1 and X 2 be independent standard normal random variables. Show that the
joint distribution of
Y1 = a11 X 1 + a12 X 2 + b1
Y2 = a21 X 1 + a22 X 2 + b2
is bivariate normal.
60. Using the results of the previous problem, describe a method for generating pseudorandom variables that have a bivariate normal distribution from independent
pseudorandom uniform variables.
61. Let X and Y be jointly continuous random variables. Find an expression for the
joint density of U = a + bX and V = c + dY .
62. If X and Y are independent standard normal random variables, find P(X 2 +
Y 2 ≤ 1).
63. Let X and Y be jointly continuous random variables.
a. Develop an expression for the joint density of X + Y and X − Y .
b. Develop an expression for the joint density of X Y and Y / X .
c. Specialize the expressions from parts (a) and (b) to the case where X and Y
are independent.
64. Find the joint density of X + Y and X/Y , where X and Y are independent
exponential random variables with parameter λ. Show that X + Y and X/Y are
independent.
65. Suppose that a system’s components are connected in series and have lifetimes
that are independent exponential random variables with parameters λi . Show that
the lifetime of the system is exponential with parameter λi .
114
Chapter 3
Joint Distributions
66. Each component of the following system (Figure 3.19) has an independent exponentially distributed lifetime with parameter λ. Find the cdf and the density of
the system’s lifetime.
F I G U R E 3.19
67. A card contains n chips and has an error-correcting mechanism such that the card
still functions if a single chip fails but does not function if two or more chips fail.
If each chip has a lifetime that is an independent exponential with parameter λ,
find the density function of the card’s lifetime.
68. Suppose that a queue has n servers and that the length of time to complete a job
is an exponential random variable. If a job is at the top of the queue and will be
handled by the next available server, what is the distribution of the waiting time
until service? What is the distribution of the waiting time until service of the next
job in the queue?
69. Find the density of the minimum of n independent Weibull random variables,
each of which has the density
β
f (t) = βα −β t β−1 e−(t/α) ,
t ≥0
70. If five numbers are chosen at random in the interval [0, 1], what is the probability
that they all lie in the middle half of the interval?
71. Let X 1 , . . . , X n be independent random variables, each with the density function f. Find an expression for the probability that the interval (−∞, X (n) ]
encompasses at least 100ν% of the probability mass of f.
72. Let X 1 , X 2 , . . . , X n be independent continuous random variables each with cumulative distribution function F. Show that the joint cdf of X (1) and X (n) is
F(x, y) = F n (y) − [F(y) − F(x)]n ,
x≤y
73. If X 1 , . . . , X n are independent random variables, each with the density function
f , show that the joint density of X (1) , . . . , X (n) is
n! f (x1 ) f (x2 ) · · · f (xn ),
x1 < x2 < · · · < xn
74. Let U1 , U2 , and U3 be independent uniform random variables.
a. Find the joint density of U(1) , U(2) , and U(3) .
b. The locations of three gas stations are independently and randomly placed
along a mile of highway. What is the probability that no two gas stations are
less than 13 mile apart?
3.8 Problems
115
75. Use the differential method to find the joint density of X (i) and X ( j) , where i < j.
76. Prove Theorem A of Section 3.7 by finding the cdf of X (k) and differentiating.
(Hint: X (k) ≤ x if and only if k or more of the X i are less than or equal to x. The
number of X i less than or equal to x is a binomial random variable.)
77. Find the density of U(k) − U(k−1) if the Ui , i = 1, . . . , n are independent uniform
random variables. This is the density of the spacing between adjacent points
chosen uniformly in the interval [0, 1].
78. Show that
1
y
1
(n
+
1)(n
+ 2)
0
0
79. If T1 and T2 are independent exponential random variables, find the density
function of R = T(2) − T(1) .
(y − x)n d x d y =
80. Let U1 , . . . , Un be independent uniform random variables, and let V be uniform
and independent of the Ui .
a. Find P(V ≤ U(n) ).
b. Find P(U(1) < V < U(n) ).
81. Do both parts of Problem 80 again, assuming that the Ui and V have the density
function f and the cdf F, with F −1 uniquely defined. Hint: F(Ui ) has a uniform
distribution.
CHAPTER
4
Expected Values
4.1 The Expected Value of a Random Variable
The concept of the expected value of a random variable parallels the notion of a
weighted average. The possible values of the random variable are weighted by their
probabilities, as specified in the following definition.
DEFINITION
If X is a discrete random variable with frequency function p(x), the expected
value of X , denoted by E(X ), is
xi p(xi )
E(X ) =
i
provided that
fined.
i |x i | p(x i ) < ∞. If the sum diverges, the expectation is unde-
■
E(X ) is also referred to as the mean of X and is often denoted by μ or μ X .
It might be helpful to think of the expected value of X as the center of mass of the
frequency function. Imagine placing the masses p(xi ) at the points xi on a beam; the
balance point of the beam is the expected value of X .
EXAMPLE A
116
Roulette
A roulette wheel has the numbers 1 through 36, as well as 0 and 00. If you bet $1
that an odd number comes up, you win or lose $1 according to whether that event
occurs. If X denotes your net gain, X = 1 with probability 18
and X = −1 with
38
4.1 The Expected Value of a Random Variable
probability
20
.
38
117
The expected value of X is
E(X ) = 1 ×
20
1
18
+ (−1) ×
=−
38
38
19
Thus, your expected loss is about $.05. In Chapter 5, it will be shown that this coincides
in the limit with the actual average loss per game if you play a long sequence of
■
independent games.
EXAMPLE B
Expectation of a Geometric Random Variable
Suppose that items produced in a plant are independently defective with probability
p. Items are inspected one by one until a defective item is found. On the average, how
many items must be inspected?
The number of items inspected, X , is a geometric random variable, with P(X =
k) = q k−1 p, where q = 1 − p. Therefore,
E(X ) =
∞
kpq k−1 = p
k=1
∞
kq k−1
k=1
d k
q , we interchange the operWe use a trick to calculate the sum. Since kq k−1 =
dq
ations of summation and differentiation to obtain
∞
d k
d
q
E(X ) = p
q =p
dq k=1
dq 1 − q
=
1
p
=
(1 − q)2
p
It can be shown that the interchange of differentiation and summation is justified.
Thus, for example, if 10% of the items are defective, an average of 10 items must be
■
examined to find one that is defective, as might have been guessed.
EXAMPLE C
Poisson Distribution
The expected value of a Poisson random variable is
∞
kλk −λ
e
E(X ) =
k!
k=0
= λe−λ
∞
k=1
= λe−λ
λk−1
(k − 1)!
∞
λj
j!
j=0
j
λ
Since ∞
j=0 (λ /j!) = e , we have E(X ) = λ. The parameter λ of the Poisson
■
distribution can thus be interpreted as the average count.
118
Chapter 4
EXAMPLE D
Expected Values
St. Petersburg Paradox
A gambler has the following strategy for playing a sequence of games: He starts off
betting $1; if he loses, he doubles his bet; and he continues to double his bet until he
finally wins. To analyze this scheme, suppose that the game is fair and that he wins
or loses the amount he bets. At trial 0, he bets $1; if he loses, he bets $2 at trial 1;
and if he has not won by the kth trial, he bets $2k . When he finally wins, he will be
$1 ahead, which can be checked by going through the scheme for the first few values
of k. This seems like a foolproof way to win $1. What could be wrong with it?
Let X denote the amount of money bet on the very last game (the game he wins).
Because the probability that k losses are followed by one win is 2−(k+1) ,
P(X = 2k ) =
1
2k+1
and
E(X ) =
∞
n P(X = n)
n=0
=
∞
2k
k=0
1
2k+1
=∞
Formally, E(X ) is not defined. Practically, the analysis shows a flaw in this scheme,
which is that it does not take into account the enormous amount of capital
■
required.
The definition of expectation for a continuous random variable is a fairly obvious
extension of the discrete case—summation is replaced by integration.
DEFINITION
If X is a continuous random variable with density f (x), then
∞
x f (x) d x
E(X ) =
−∞
provided that
defined.
|x| f (x)d x < ∞. If the integral diverges, the expectation is un■
Again E(X ) can be regarded as the center of mass of the density. We next consider
some examples.
EXAMPLE E
Gamma Density
If X follows a gamma density with parameters α and λ,
∞
λα α −λx
E(X ) =
x e
dx
(α)
0
4.1 The Expected Value of a Random Variable
119
This integral is easy to evaluate once we realize that λα+1 x α e−λx / (α +1) is a gamma
density and therefore integrates to 1. We thus have
∞
(α + 1)
x α e−λx d x =
λα+1
0
from which it follows that
λα
E(X ) =
(α)
(α + 1)
λα+1
Finally, using the relation (α + 1) = α (α), we find
α
E(X ) =
λ
For the exponential density, α = 1, so E(X ) = 1/λ. This may be contrasted to
the median of the exponential density, which was found in Section 2.2.1 to be log 2/λ.
The mean and the median can both be interpreted as “typical” values of X , but they
measure different attributes of the probability distribution.
■
EXAMPLE F
Normal Distribution
From the definition of the expectation, we have
∞
1 (x−μ)2
1
E(X ) = √
xe− 2 σ 2 d x
σ 2π −∞
Making the change of variables z = x − μ changes this equation to
∞
∞
1
μ
2
2
2
2
E(X ) = √
ze−z /2σ dz + √
e−z /2σ dz
σ 2π −∞
σ 2π −∞
The first integral is 0 since the contributions from z < 0 cancel those from z > 0,
and the second integral is μ because the normal density integrates to 1. Thus,
E(X ) = μ
The parameter μ of the normal density is the expectation, or mean value. We could
have made the derivation much shorter by claiming that it was “obvious” that since
■
the center of symmetry of the density is μ, the expectation must be μ.
EXAMPLE G
Cauchy Density
Recall that the Cauchy density is
1
1
f (x) =
π 1 + x2
,
−∞ < x < ∞
The density is symmetric about zero, so it would seem that E(X ) = 0. However,
∞
|x|
=∞
2
−∞ 1 + x
120
Chapter 4
Expected Values
Therefore, the expectation does not exist. The reason that it fails to exist is, that the
density decreases so slowly that very large values of X can occur with substantial
■
probability.
The expected value can be interpreted as a long-run average. In Chapter 5, it will
be shown that if E(X ) exists and if X 1 , X 2 , . . . is a sequence of independent random
n
X i , then, as n → ∞,
variables with the same distribution as X , and if Sn = i=1
Sn
→ E(X )
n
This statement will be made more precise in Chapter 5. For now, a simple empirical
demonstration will be sufficient.
Using a pseudorandom number generator, a sequence X 1 , X 2 , . . . of independent
standard normal random variables was generated, as well as a sequence Y1 , Y2 , . . .
of independent Cauchy random variables. Figure 4.1 shows the graphs of
G(n) =
n
1 Xi
n i=1
and C(n) =
n
1 Yi
n i=1
n = 1, 2, . . . , 500
Note how G(n) appears to be tending to a limit, whereas C(n) does not.
■
G (n )
.8
.4
0
(a)
C (n )
EXAMPLE H
4
3
2
1
0
1
2
0
(b)
100
200
300
400
500
n
F I G U R E 4.1 The average of n independent random variables as a function of n for
(a) normal random variables and (b) Cauchy random variables.
4.1 The Expected Value of a Random Variable
121
We conclude this section with a simple result that is of great utility in probability
theory.
THEOREM A Markov’s Inequality
If X is a random variable with P(X ≥ 0) = 1 and for which E(X ) exists, then
P(X ≥ t) ≤ E(X )/t.
Proof
We will prove this for the discrete case; the continuous case is entirely analogous.
x p(x)
E(X ) =
x
=
x p(x) +
x<t
x p(x)
x≥t
All the terms in the sums are nonnegative because X takes on only nonnegative
values. Thus
E(X ) ≥
x p(x)
x≥t
≥
t p(x) = t P(X ≥ t)
■
x≥t
This result says that the probability that X is much bigger than E(X ) is small.
Suppose that in the theorem, we let t = k E(X ); then according to the result, P(X >
k E(X )) ≤ k −1 . This holds for any nonnegative random variable, regardless of its
probability distribution.
4.1.1 Expectations of Functions of Random Variables
We often need to find E[g(X )], where X is a random variable and g is a fixed function.
For example, according to the kinetic theory of gases, the magnitude of the velocity
of a gas molecule is random and its probability density is given by
√
f X (x) =
2/π 2 − 12 x 22
x e σ
σ3
(This is Maxwell’s distribution: the parameter σ depends on the temperature of the
gas.) From this density, we can find the average velocity, but suppose that we are
interested in finding the average kinetic energy Y = 12 m X 2 , where m is the mass of
the molecule. The straightforward way to do this would seem to be the following: Let
Y = g(X ); find the density or frequency function of Y , say, f Y ; and then compute
E(Y ) from the definition. It turns out, fortunately, that the process is much simpler
than that.
122
Chapter 4
Expected Values
THEOREM A
Suppose that Y = g(X ).
a. If X is discrete with frequency function p(x), then
E(Y ) =
g(x) p(x)
x
provided that |g(x)| p(x) < ∞.
b. If X is continuous with density function f (x), then
∞
E(Y ) =
g(x) f (x) d x
−∞
provided that
|g(x)| f (x) d x < ∞.
Proof
We will prove this result for the discrete case. The basic argument is the same
for the continuous case, but making that proof rigorous requires some advanced
theory of integration. By definition,
yi pY (yi )
E(Y ) =
i
Let Ai denote the set of x’s mapped to yi by g; that is, x ∈ Ai if g(x) = yi . Then
p(x)
pY (yi ) =
x ∈ Ai
and
E(Y ) =
i
=
=
=
yi
i
x ∈ Ai
i
x ∈ Ai
p(x)
x ∈ Ai
yi p(x)
g(x) p(x)
g(x) p(x)
x
This last step follows because the Ai are disjoint and every x belongs to
■
some Ai .
It is worth pointing out explicitly that E[g(X )] = g[E(X )]; that is, the average
value of the function is not equal to the function of the average value. Suppose, for
example, that X takes on values 1 and 2, each with probability 12 , so E(X ) = 32 . Let
Y = 1/ X . Then E(Y ) is clearly 1 × .5 + .5 × .5 = .75, but 1/E(X ) = 23 .
4.1 The Expected Value of a Random Variable
EXAMPLE A
123
Let us now apply Theorem A to find the average kinetic energy of a gas molecule.
∞
E(Y ) =
0
m
=
2
1 2
mx f X (x) d x
2
√
∞
1 x2
2/π
4 − 2 σ2
x
e
dx
σ3
0
To evaluate the integral, we make the change of variables u = x 2 /2σ 2 to reduce
it to
∞
5
2mσ 2
2mσ 2
√
u 3/2 e−u du = √
π
π
2
0
Finally, using the facts ( 12 ) =
√
π and (α + 1) = α (α), we have
E(Y ) = 32 mσ 2
■
Now suppose that Y = g(X 1 , . . . , X n ), where X i have a joint distribution, and
that we want to find E(Y ). We do not have to find the density or frequency function
of Y , which again could be a formidable task.
THEOREM B
Suppose that X 1 , . . . , X n are jointly distributed random variables and Y =
g(X 1 , . . . , X n ).
a. If the X i are discrete with frequency function p(x1 , . . . , xn ), then
E(Y ) =
g(x1 , . . . , xn ) p(x1 , . . . , xn )
x1 ,...,xn
provided that x1 ,...,xn |g(x1 , . . . , xn )| p(x1 , . . . , xn ) < ∞.
b. If the X i are continuous with joint density function f (x1 , . . . , xn ), then
E(Y ) =
· · · g(x1 , . . . , xn ) f (x1 , . . . , xn )d x1 · · · d xn
provided that the integral with |g| in place of g converges.
Proof
The proof is similar to that of Theorem A.
■
124
Chapter 4
EXAMPLE B
Expected Values
A stick of unit length is broken randomly in two places. What is the average length
of the middle piece?
We interpret this question to mean that the locations of the two break points are
independent uniform random variables U1 and U2 . Therefore, we need to compute
E|U1 − U2 |. Theorem B tells us that we do not need to find the density function of
|U1 − U2 | and that we can just integrate |u 1 − u 2 | against the joint density of U1 and
U2 , f (u 1 , u 2 ) = 1, 0 ≤ u 1 ≤ 1, 0 ≤ U2 ≤ 1. Thus,
1
E|U1 − U2 | =
1
|u 1 − u 2 | du 1 du 2
0
0
1
=
0
0
u1
1
1
(u 1 − u 2 ) du 2 du 1 +
(u 2 − u 1 ) du 2 du 1
0
u1
With some care, we find the expectation to be 13 . This is in accord with the intuitive argument that the smaller of U1 and U2 should be 13 on the average and the
larger should be 23 on the average, which means that the average difference should
be 13 .
■
We note the following immediate consequence of Theorem B.
COROLLARY A
If X and Y are independent random variables and g and h are fixed functions,
then E[g(X )h(Y )] = {E[g(X )]}{E[h(Y )]}, provided that the expectations on
■
the right-hand side exist.
In particular, if X and Y are independent, E(X Y ) = E(X )E(Y ). The proof of
this corollary is left to Problem 29 of the end-of-chapter problems.
4.1.2 Expectations of Linear Combinations
of Random Variables
One of the most useful properties of the expectation is that it is a linear operation.
Suppose that you were told that the average temperature on July 1 in a certain location
was 70◦ F, and you were asked what the average temperature in degrees Celsius was.
You can simply convert to degrees Celsius and obtain 59 × 70 − 17.7 = 21.2◦ C.
The notion of the average value of a random variable, which we have defined as the
expected value of a random variable, behaves in the same fashion. If Y = a X +b, then
E(Y ) = a E(X ) + b. More generally, this property extends to linear combinations of
random variables.
4.1 The Expected Value of a Random Variable
125
THEOREM A
If X 1 , . . . , X n are jointly distributed random variables with expectations E(X i )
n
bi X i , then
and Y is a linear function of the X i , Y = a + i=1
E(Y ) = a +
n
bi E(X i )
i=1
Proof
We will prove this for the continuous case. The proof in the discrete case is parallel
and is left to Problem 24 at the end of this chapter. For notational simplicity, we
take n = 2. From Theorem B of Section 4.1.1, we have
E(Y ) =
(a + b1 x1 + b2 x2 ) f (x1 , x2 ) d x1 d x2
=a
f (x1 , x2 ) d x1 d x2 + b1
x1 f (x1 , x2 ) d x1 d x2
+ b2
x2 f (x1 , x2 ) d x1 d x2
The first double integral of the last expression is merely the integral of the bivariate
density, which is equal to 1. The second double integral can be evaluated as
follows:
f (x1 , x2 ) d x2 d x1
x1 f (x1 , x2 ) d x1 d x2 = x1
=
x1 f x1 (x1 ) d x1
= E(X 1 )
A similar evaluation for the third double integral brings us to
E(Y ) = a + b1 E(X 1 ) + b2 E(X 2 )
This proves the theorem once we check that the expectation is well defined, or
that
|a + b1 x1 + b2 x2 | f (x1 , x2 ) d x1 d x2 < ∞
This can be verified using the inequality
|a + b1 x1 + b2 x2 | ≤ |a| + |b1 ||x1 | + |b2 ||x2 |
and the assumption that the E(X i ) exist.
■
Theorem A is extremely useful. We will illustrate its utility with several examples.
126
Chapter 4
EXAMPLE A
Expected Values
Suppose that we wish to find the expectation of a binomial random variable, Y . From
the binomial frequency function,
n n
E(Y ) =
kp k (1 − p)n−k
k
k=0
It is not immediately obvious how to evaluate this sum. We can, however, represent
Y as the sum of Bernoulli random variables, X i , which equal 1 or 0 depending on
whether there is success or failure on the ith trial,
n
Y =
Xi
i=1
Because E(X i ) = 0 × (1 − p) + 1 × p = p, it follows immediately that E(Y ) = np.
An application of the binomial distribution and its expectation occurs in “shotgun
sequencing” in genomics, a method of trying to figure out the sequence of letters that
make up a long segment of DNA. It is technically too difficult to sequence the entire
segment at once if it is very long. The basic idea of shotgun sequencing is to chop
the DNA randomly into many small fragments, sequence each fragment, and then
somehow assemble the fragments into one long “contig.” The hope is that if there are
many fragments, their overlaps can be used to assemble the contig.
Suppose, then, that the length of the DNA sequence is G and that there are N
fragments each of length L. G might be at least 100,000 and L about 500. Assume that
the left end of each fragment is equally likely to be at positions 1, 2, . . . , G − L + 1.
What is the probability that a particular location x ∈ {L , L + 1, . . . , G} is covered
by at least one fragment? How many fragments are expected to cover a particular
location? (The positions {1, 2, . . . , L − 1} are not included in this discussion because
the boundary effect makes them a little different; for example, the only fragment
that covers position 1 has its left end at position 1.) To answer these questions, first
consider a single fragment. The chance that it covers x equals the chance that its
left end is in one of the L locations {x − L + 1, x − L , . . . , x}, and because the
location of the left end is uniform, this probability is
L
L
≈
G−L +1
G
where the approximation holds because G L. Thus, the distribution of W , the
number of fragments that cover a particular location, is binomial with parameters N
and p.
From the binomial probability formula, the chance of coverage is
L N
P(W > 0) = 1 − P(W = 0) = 1 − 1 −
G
p=
Since N is large and p is small, the distribution of W is nearly Poisson with parameter
λ = N p = N L/G. From the Poisson probability formula, P(W = 0) ≈ e−N L/G ,
so the probability that a particular location is covered is approximately 1 − e−N L/G .
Observe that N L is the total length of all the fragments; the ratio N L/G is called the
coverage. Calculations of this kind are thus useful in deciding how many fragments
to use. If the coverage is 8, for example, the chance that a site is covered is .9997.
4.1 The Expected Value of a Random Variable
127
Overlap of fragments is important when trying to assemble them. Since W is a
binomial random variable, the expected number of fragments that cover a given site
is N p = N L/G, precisely the coverage.
We can also now answer this closely related question: How many sites do we
expect to be entirely missed? We will calculate this using indicator random variables:
let Ix equal 1 if site x is missed and 0 elsewhere. Then
E(Ix ) = 1 × P(Ix = 1) + 0 × P(Ix = 0) = e−N L/G .
The number of sites that are not covered is
G
V =
Ix
x=1
and from the linearity of expectation
E(V ) =
G
E(Ix ) ≈ Ge−N L/G .
x=L
The length of the human genome is approximately G = 3 × 109 , so with eight times
■
coverage, we would expect about a million sites to be missed.
EXAMPLE B
Coupon Collection
Suppose that you collect coupons, that there are n distinct types of coupons, and that
on each trial you are equally likely to get a coupon of any of the types. How many trials
would you expect to go through until you had a complete set of coupons? (This might
be a model for collecting baseball cards or for certain grocery store promotions.)
The solution of this problem is greatly simplified by representing the number of
trials as a sum. Let X 1 be the number of trials up to and including the trial on which
the first coupon is collected: X 1 = 1. Let X 2 be the number of trials from that point up
to and including the trial on which the next coupon different from the first is obtained;
let X 3 be the number of trials from that point up to and including the trial on which
the third distinct coupon is collected; and so on, up to X n . Then the total number of
trials, X , is the sum of the X i , i = 1, 2, . . . , n.
We now find the distribution of X r . At this point, r − 1 of n coupons have been
collected, so on each trial the probability of success is (n − r + 1)/n. Therefore, X r
is a geometric random variable, with E(X r ) = n/(n − r + 1). (See Example B of
Section 4.1.) Thus,
n
E(X r )
E(X ) =
r =1
n
n
n
n
+
+ ··· +
= +
n
n−1 n−2
1
n
1
=n
r
r =1
For example, if there are 10 types of coupons, the expected number of trials necessary
to obtain at least one of each kind is 29.3.
128
Chapter 4
Expected Values
Finally, we note the following famous approximation:
n
1
= log n + γ + εn
r
r =1
where log is the natural log or loge (unless otherwise specified, log means natural
log throughout this text), γ is Euler’s constant, γ = .57. . . , and εn approaches zero
as n approaches infinity. Using this approximation for n = 10, we find that the
approximate expected number of trials is 28.8. Generally, we see that the expected
■
number of trials grows at the rate n log n, or slightly faster than n.
EXAMPLE C
Group Testing
Suppose that a large number, n, of blood samples are to be screened for a relatively
rare disease. If each sample is assayed individually, n tests will be required. On the
other hand, if each sample is divided in half and one of the halves is put into a pool
with all the other halves, the pooled lot can be tested. Then, provided that the test
method is sensitive enough, if this test is negative, no further assays are necessary
and only one test has to be performed. If the test on the pooled blood is positive, each
reserved half-sample can be tested individually. In this case, a total of n + 1 tests
will be required. It is therefore plausible, assuming that the disease is rare, that some
savings can be achieved through this pooling procedure.
To analyze this more quantitatively, let us first generalize the scheme and suppose
that the n samples are first grouped into m groups of k samples each, or n = mk.
Each group is then tested; if a group tests positively, each individual in the group is
tested. If X i is the number of tests run on the ith group, the total number of tests run
m
X i , and the expected total number of tests is
is N = i=1
E(N ) =
m
E(X i )
i=1
Let us find E(X i ). If the probability of a negative on any individual sample is p, then
the X i take on the value 1 with probability p k or the value k + 1 with probability
1 − p k . Thus,
E(X i ) = p k + (k + 1)(1 − p k )
= k + 1 − kp k
We now have
1
E(N ) = m(k + 1) − mkp k = n 1 + − p k
k
Recalling that n tests are necessary with no pooling, we see that the factor (1 + 1/k −
p k ) is the average number of samples used in group testing as a proportion of n.
Figure 4.2 shows this proportion as a function of k for p = .99. From the figure, we
4.1 The Expected Value of a Random Variable
129
1.2
1.0
Proportion
.8
.6
.4
.2
0
0
5
10
k
15
20
F I G U R E 4.2 The proportion of n in the average number of samples tested using
group testing as a function of k.
see that for group testing with a group size of about 10, only 20% of the number of
■
tests used with the straightforward method are needed on the average.
EXAMPLE D
Counting Word Occurrences in DNA Sequences
Here we consider another example from genomics, and one that again illustrates
the power of using indicator random variables. In searching for patterns in DNA
sequences, there might be reason to expect that a “word” such as TATA would occur
more frequently than expected in a random sequence. Or suppose we want to identify
regions of a DNA sequence in which the occurrence of the word is unusually large.
To quantify these ideas, we need to specify the meaning of random. In this example,
we will take it to mean that the sequence is randomly composed of the letters A,C,G,
and T in the sense that the letters at sites are independent and, at every site, each letter
has probability 14 .
We also need to be careful to specify how we count. Consider the following
sequence
ACTATATAGATATA
We will count overlaps, so in the preceding sequence, TATA occurs three times. Now
suppose that the sequence is of length N and that the word is of length q. Let In
be an indicator random variable taking
qon the value 1 if the word begins at position
n and 0 otherwise: P(In = 1) = 14 from the assumption of independence and
E(In ) = P(In = 1). Now the total number of times the word occurs is
N −q+1
W =
n=1
In
130
Chapter 4
Expected Values
and
N −q+1
E(W ) =
E(In ) = (N − q + 1)
n=1
1
4
q
Note that the In are not independent—for example, in the case of the word TATA,
if I1 = 1, then I2 = 0. Thus W is not a binomial random variable. But despite the
lack of independence, we can find E(W ) by expressing W as a linear combination of
■
indicator variables.
EXAMPLE E
Investment Portfolios
An investor plans to apportion an amount of capital, C0 , between two investments,
placing a fraction π, 0 ≤ π ≤ 1, in one investment and a fraction 1 − π in the
other for a fixed period of time. Denoting the returns (final value divided by initial
value) on the investments by R1 and R2 , her capital at the end of the period will be
C1 = πC0 R1 + (1 − π)C0 R2 . Her return will then be
R=
C1
C0
= π R1 + (1 − π )R2
Suppose that the returns are unknown, as would be the case if they were stocks, for
example, and that they are hence modeled as random variables, with expected values
E(R1 ) and E(R2 ). Then her expected return is
E(R) = π E(R1 ) + (1 − π )E(R2 )
How should she choose π? A simple solution would apparently be to choose π = 1
if E(R1 ) > E(R2 ) and π = 0 otherwise. But there is more to the story as we will see
■
later.
4.2 Variance and Standard Deviation
The expected value of a random variable is its average value and can be viewed as
an indication of the central value of the density or frequency function. The expected
value is therefore sometimes referred to as a location parameter. The median of
a distribution is also a location parameter, one that does not necessarily equal the
mean. This section introduces another parameter, the standard deviation of a random
variable, which is an indication of how dispersed the probability distribution is about
its center, of how spread out on the average are the values of the random variable about
its expectation. We first define the variance of a random variable and then define the
standard deviation in terms of the variance.
4.2 Variance and Standard Deviation
131
DEFINITION
If X is a random variable with expected value E(X ), the variance of X is
Var(X ) = E{[X − E(X )]2 }
provided that the expectation exists. The standard deviation of X is the square
root of the variance.
■
If X is a discrete random variable with frequency function p(x) and expected
value μ = E(X ), then according to the definition and Theorem A of Section 4.1.1,
(xi − μ)2 p(xi )
Var(X ) =
i
whereas if X is a continuous random variable with density function f (x) and
E(X ) = μ
∞
(x − μ)2 f (x) d x
Var(X ) =
−∞
The variance is often denoted by σ 2 and the standard deviation by σ . From
the preceding definition, the variance of X is the average value of the squared
deviation of X from its mean. If X has units of meters, for example, the variance has units of meters squared, and the standard deviation has units of meters.
Although we are often interested ultimately in the standard deviation rather than
the variance, it is usually easier to find the variance first and then take the square
root.
The variance of a random variable changes in a simple way under linear transformations.
THEOREM A
If Var(X ) exists and Y = a + bX , then Var(Y ) = b2 Var(X ).
Proof
Since E(Y ) = a + bE(X ),
E[(Y − E(Y ))2 ] = E{[a + bX − a − bE(X )]2 }
= E{b2 [X − E(X )]2 }
= b2 E{[X − E(X )]2 }
= b2 Var(X )
■
This result seems reasonable once you realize that the addition of a constant does
not affect the variance, since the variance is a measure of the spread around a center
and the center has merely been shifted.
132
Chapter 4
Expected Values
The standard deviation transforms in a natural way: σY = |b|σ X . Thus, if the units
of measurement are changed from meters to centimeters, for example, the standard
deviation is simply multiplied by 100.
EXAMPLE A
Bernoulli Distribution
If X has a Bernoulli distribution—that is, X takes on values 0 and 1 with probability
1 − p and p, respectively—then we have seen (Example A of Section 4.1.2) that
E(X ) = p. By the definition of variance,
Var(X ) = (0 − p)2 × (1 − p) + (1 − p)2 × p
= p2 − p3 + p − 2 p2 + p3
= p(1 − p)
Note that the expression p(1 − p) is a quadratic with a maximum at p = 12 . If p
is 0 or 1, the variance is 0, which makes sense since the probability distribution is
concentrated at a single point and the random variable is not variable at all. The
■
distribution is most dispersed when p = 12 .
EXAMPLE B
Normal Distribution
We have seen that E(X ) = μ. Then
Var(X ) = E[(X − μ)2 ] =
1
√
σ 2π
∞
−∞
1 (x−μ)2
σ2
(x − μ)2 e− 2
dx
Making the change of variables z = (x − μ)/σ changes the right-hand side to
σ2
√
2π
∞
z 2 e−z
2 /2
dz
−∞
Finally, making the change of variables u = z 2 /2 reduces the integral to a gamma
■
function, and we find that Var(X ) = σ 2 .
The following theorem gives an alternative way of calculating the variance.
THEOREM B
The variance of X , if it exists, may also be calculated as follows:
Var(X ) = E(X 2 ) − [E(X )]2
4.2 Variance and Standard Deviation
133
Proof
Denote E(X ) by μ.
Var(X ) = E[(X − μ)2 ]
= E(X 2 − 2μX + μ2 )
By the linearity of the expectation, this becomes
Var(X ) = E(X 2 ) − 2μE(X ) + μ2
= E(X 2 ) − 2μ2 + μ2
= E(X 2 ) − μ2
■
as was to be shown.
According to Theorem B, the variance of X can be found in two steps: First find
E(X ), and then find E(X 2 ).
EXAMPLE C
Uniform Distribution
Let us apply Theorem B to find the variance of a random variable that is uniform on
[0, 1]. We know that E(X ) = 12 ; next we need to find E(X 2 ):
1
1
x2 dx =
E(X 2 ) =
3
0
We thus have
2
1
1
1
=
Var(X ) = −
■
3
2
12
It was stated earlier that the variance or standard deviation of a random variable
gives an indication as to how spread out its possible values are. A famous inequality,
Chebyshev’s inequality, lends a quantitative aspect to this indication.
THEOREM C Chebyshev’s Inequality
Let X be a random variable with mean μ and variance σ 2 . Then, for any t > 0,
P(|X − μ| > t) ≤
σ2
t2
Proof
Let Y = (X −μ)2 . Then E(Y ) = σ 2 , and the result follows by applying Markov’s
inequality to Y .
■
134
Chapter 4
Expected Values
Theorem C says that if σ 2 is very small, there is a high probability that X will
not deviate much from μ. For another interpretation, we can set t = kσ so that the
inequality becomes
1
k2
For example, the probability that X is more than 4σ away from μ is less than or
1
equal to 16
. These results hold for any random variable with any distribution provided
the variance exists. In particular cases, the bounds are often much narrower. For
example, if X is normally distributed, we find from tables of the normal distribution
that P(|X − μ| > 2σ ) = .05 (compared to 14 obtained from Chebyshev’s inequality).
Chebyshev’s inequality has the following consequence.
P(|X − μ| ≥ kσ ) ≤
COROLLARY A
If Var(X ) = 0, then P(X = μ) = 1.
Proof
We will give a proof by contradiction. Suppose that P(X = μ) < 1. Then, for
some ε > 0, P(|X − μ| ≥ ε) > 0. However, by Chebyshev’s inequality, for any
ε > 0,
P(|X − μ| ≥ ε) = 0
EXAMPLE D
■
Investment Portfolios
We continue Example E in Section 4.1.2. Suppose that one of the two investments
is risky and the other is risk free. The first might be a stock and the other an insured
savings account. The stock has a return R1 , which is modeled as a random variable with
expectation μ1 = 0.10 and standard deviation σ1 = 0.075. The standard deviation is
a measure of risk—a large standard deviation means that the returns fluctuate a lot
so that the investor might be lucky and get a large return, but might also be unlucky
and lose a lot. Suppose that the savings account has a certain return R2 = 0.03. The
expected value of this return is μ2 = 0.03 and its standard deviation is 0—it is risk
free. If the investor places a fraction π1 in the stock and a fraction π2 = 1 − π1 in the
savings account, her return is
R = π1 R1 + (1 − π1 )R2
and her expected return is
E(R) = π1 μ1 + (1 − π1 )μ2
Since μ1 > μ2 , her expected return is maximized by π1 = 1, putting all her money
in the stock. However, this point of view is too narrow; it does not take into account
the risk of the stock. By Theorem A
Var(R) = π12 σ12
4.2 Variance and Standard Deviation
135
and the standard deviation of the return is σ R = π1 σ1 . The larger π1 , the larger
the expected return, but also the larger the risk. In choosing π1 , the investor has to
balance the risk she is willing to take against the expected gain; the desired balance
will be different for different investors. If she is risk averse, she will choose a small
value of π1 , being leery of volatile investments. By tracing out the expected return
and the standard deviation as functions of π1 , she can strike a balance with which she
■
is comfortable.
4.2.1 A Model for Measurement Error
Values of physical constants are not precisely known but must be determined by
experimental procedures. Such seemingly simple operations as weighting an object,
determining a voltage, or measuring an interval of time are actually quite complicated
when all the details and possible sources of error are taken into account. The National
Institute of Standards and Technology (NIST) in the United States and similar agencies in other countries are charged with developing and maintaining measurement
standards. Such agencies employ probabilists and statisticians as well as physical
scientists in this endeavor.
A distinction is usually made between random and systematic measurement
errors. A sequence of repeated independent measurements made with no deliberate
change in the apparatus or experimental procedure may not yield identical values,
and the uncontrollable fluctuations are often modeled as random. At the same time,
there may be errors that have the same effect on every measurement; equipment may
be slightly out of calibration, for example, or there may be errors associated with the
theory underlying the method of measurement. If the true value of the quantity being
measured is denoted by x0 , the measurement, X , is modeled as
X = x0 + β + ε
where β is the constant, or systematic, error and ε is the random component of the
error; ε is a random variable with E(ε) = 0 and Var(ε) = σ 2 . We then have
E(X ) = x0 + β
and
Var(X ) = σ 2
β is often called the bias of the measurement procedure. The two factors affecting the
size of the error are the bias and the size of the variance, σ 2 . A perfect measurement
would have β = 0 and σ 2 = 0.
EXAMPLE A
Measurement of the Gravity Constant
This and the next example are taken from an interesting and readable paper by Youden
(1972), a statistician at NIST. Measurement of the acceleration due to gravity at Ottawa
was done 32 times with each of two different methods (Preston-Thomas et al. 1960).
The results are displayed as histograms in Figure 4.3. There is clearly some systematic
difference between the two methods as well as some variation within each method. It
Expected Values
980.616
980.615
Mean 980.6139 cm/sec2
Standard deviation 0.9 mgal
Maximum spread 4.1 mgal
980.614
Aug 1958
32 Drops
Rule No. 1
980.613
Mean 980.6124 cm/sec2
Standard deviation 0.6 mgal
Maximum spread 2.9 mgal
980.612
Dec 1959
32 Drops
Rule No. 2
980.611
Chapter 4
980.610
136
cm/sec2
F I G U R E 4.3 Histograms of two sets of measurements of the acceleration due to
gravity.
appears that the two biases are unequal. The results from Rule 1 are more scattered
■
than those of Rule 2, and their standard deviation is larger.
An overall measure of the size of the measurement error that is often used is the
mean squared error, which is defined as
MSE = E[(X − x0 )2 ]
The mean squared error, which is the expected squared deviation of X from x0 , can
be decomposed into contributions from the bias and the variance.
THEOREM A
MSE = β 2 + σ 2 .
Proof
From Theorem B of Section 4.2,
E[(X − x0 )2 ] = Var(X − x0 ) + [E(X − x0 )]2
= Var(X ) + β 2
= σ 2 + β2
■
4.2 Variance and Standard Deviation
137
Measurements are often reported in the form 102 ± 1.6, for example. Although it
is not always clear what precisely is meant by such notation, 102 is the experimentally
determined value and 1.6 is some measure of the error. It is often claimed or hoped that
β is negligible relative to σ , and in that case 1.6 represents σ or some multiple of σ .
In the graphical presentation of experimentally obtained data, error bars, usually of
width σ or some multiple of σ , are placed around measured values. In some cases,
efforts are made to bound the magnitude of β, and the bound is incorporated into the
error bars in some fashion.
EXAMPLE B
Measurement of the Speed of Light
Figure 4.4, taken from McNish (1962) and discussed by Youden (1972), shows 24
independent determinations of the speed of light, c, with error bars. The right column of the figure contains codes for the experimental methods used to obtain the
measurements; for example, G denotes a method called the geodimeter method.
The range of values for c is about 3.5 km/sec, and many of the errors are less than
.5 km/sec. Examination of the figure makes it clear that the error bars are too small and
that the spread of values cannot be accounted for by different experimental techniques
alone—the geodimeter method produced both the smallest and the next to largest value
for c. Youden remarks, “Surely the evidence suggests that individual investigators are
unable to set realistic limits of error to their reported values.” He goes on to suggest
■
that the differences are largely a result of calibration errors for equipment.
Designation
US
CUTKOSKY-THOMAS
GB
US
US
GB
AU
AU
AU
ESSEN
FROOME
FROOME
SW
FROOME
SW
SW
AU
SW
US
AU
US
US
US
FLORMAN
299790
91
Method
G
Ru
G
Sh
G
G
G
G
G
CR
MI
MI
G
MI
G
G
G
G
Sh
G
G
Sh
G
RI
92
93
km/sec
94
95
96
F I G U R E 4.4 A plot of 24 independent determinations of the speed of light with the
reported error bars. The investigator or country is listed in the left column, and the
experimental method is coded in the right column.
138
Chapter 4
Expected Values
4.3 Covariance and Correlation
The variance of a random variable is a measure of its variability, and the covariance
of two random variables is a measure of their joint variability, or their degree of association. After defining covariance, we will develop some of its properties and discuss
a measure of association called correlation, which is defined in terms of covariance.
You may find this material somewhat formal and abstract at first, but as you use them,
covariance, correlation, and their properties will begin to seem natural and familiar.
DEFINITION
If X and Y are jointly distributed random variables with expectations μ X and μY ,
respectively, the covariance of X and Y is
Cov(X, Y ) = E[(X − μ X )(Y − μY )]
■
provided that the expectation exists.
The covariance is the average value of the product of the deviation of X from
its mean and the deviation of Y from its mean. If the random variables are positively
associated—that is, when X is larger than its mean, Y tends to be larger than its mean
as well—the covariance will be positive. If the association is negative—that is, when
X is larger than its mean, Y tends to be smaller than its mean—the covariance is
negative. These statements will be expanded in the discussion of correlation.
By expanding the product and using the linearity of the expectation, we obtain
an alternative expression for the covariance, paralleling Theorem B of Section 4.2:
Cov(X, Y ) = E(X Y − X μY − Y μ X + μ X μY )
= E(X Y ) − E(X )μY − E(Y )μ X + μ X μY
= E(X Y ) − E(X )E(Y )
In particular, if X and Y are independent, then E(X Y ) = E(X )E(Y ) and Cov(X, Y ) =
0 (but the converse is not true). See Problems 59 and 60 at the end of this chapter for
examples.
EXAMPLE A
Let us return to the bivariate uniform distributions of Example C in Section 3.3. First,
note that since the marginal distributions are uniform, E(X ) = E(Y ) = 12 . For the
case α = −1, the joint density of X and Y is f (x, y) = (2x + 2y − 4x y), 0 ≤ x ≤ 1,
0 ≤ y ≤ 1.
E(X Y ) =
x y f (x, y) d x d y
1
0
=
1
x y(2x + 2y − 4x y) d x d y
=
2
9
0
4.3 Covariance and Correlation
139
Thus,
Cov(X, Y ) =
2
9
−
1 1
2
2
1
= − 36
The covariance is negative, indicating a negative relationship between X and Y . In
fact, from Figure 3.5, we see that if X is less than its mean, 12 , then Y tends to be
larger than its mean, and vice versa. A similar analysis shows that when α = 1,
1
.
■
Cov(X, Y ) = 36
We will now develop an expression for the covariance of linear combinations of
random variables, proceeding in a number of small steps. First, since E(a + X ) =
a + E(X ),
Cov(a + X, Y ) = E{[a + X − E(a + X )][Y − E(Y )]}
= E{[X − E(X )][Y − E(Y )]}
= Cov(X, Y )
Next, since E(a X ) = a E(X ),
Cov(a X, bY ) = E{[a X − a E(X )][bY − bE(Y )]}
= E{ab[X − E(X )][Y − E(Y )]}
= abE{[X − E(X )][Y − E(Y )]}
= ab Cov(X, Y )
Next, we consider Cov(X, Y + Z ):
Cov(X, Y + Z ) = E([X − E(X )]{[Y − E(Y )] + [Z − E(Z )]})
= E{[X − E(X )][Y − E(Y )] + [X − E(X )][Z − E(Z )]}
= E{[X − E(X )][Y − E(Y )]}
+ E{[X − E(X )][Z − E(Z )]}
= Cov(X, Y ) + Cov(X, Z )
We can now put these results together to find Cov(aW + bX, cY + d Z ):
Cov(aW + bX, cY + d Z ) = Cov(aW + bX, cY ) + Cov(aW + bX, d Z )
= Cov(aW, cY ) + Cov(bX, cY ) + Cov(aW, d Z )
+ Cov(bX, d Z )
= ac Cov(W, Y ) + bc Cov(X, Y ) + ad Cov(W, Z )
+ bd Cov(X, Z )
In general, the same kind of argument gives the following important bilinear
property of covariance.
140
Chapter 4
Expected Values
THEOREM A
Suppose that U = a +
n
i=1
bi X i and V = c +
Cov(U, V ) =
m
n i=1
m
j=1
d j Y j . Then
bi d j Cov(X i , Y j )
■
j=1
This theorem has many applications. In particular, since Var(X ) = Cov(X, X ),
Var(X + Y ) = Cov(X + Y, X + Y )
= Var(X ) + Var(Y ) + 2Cov(X, Y )
More generally, we have the following result for the variance of a linear combination
of random variables.
COROLLARY A
Var(a +
n
i=1
bi X i ) =
n
i=1
n
j=1
bi b j Cov(X i , X j ).
■
If the X i are independent, then Cov(X i , X j ) = 0 for i = j, and we have another
corollary.
COROLLARY B
Var(
n
i=1
Xi ) =
n
i=1
Var(X i ), if the X i are independent.
■
Corollary B is very useful. Note that E(
Xi ) =
E(X i ) whether or not the
Xi ) =
Var(X i ).
X i are independent, but it is generally not the case that Var(
EXAMPLE B
Finding the variance of a binomial random variable from the definition of variance and
the frequency function of the binomial distribution is not easy (try it). But expressing
a binomial random variable as a sum of independent Bernoulli random variables
makes the computation of the variance trivial. Specifically, if Y is a binomial random
variable, it can be expressed as Y = X 1 + X 2 +· · ·+ X n , where the X i are independent
Bernoulli random variables with P(X i = 1) = p. We saw earlier (Example A in
Section 4.2) that Var(X i ) = p(1 − p), from which it follows from Corollary B that
■
Var(Y ) = np(1 − p).
EXAMPLE C
Random Walk
A drunken walker starts out at a point x0 on the real line. He takes a step on length X 1 ,
which is a random variable with expected value μ and variance σ , and his position
4.3 Covariance and Correlation
141
at that time is S(1) = x0 + X 1 . He then takes another step of length X 2 , which is
independent of X 1 with the same mean and standard deviation. His position after n
n
X i . Then
such steps is S(n) = x0 + i=1
n
X i = x0 + nμ
E(S(n)) = x0 + E
Var(S(n)) = Var
n
i=1
Xi
= nσ 2
i=1
He thus expects to be at
√the position x0 + nμ with an uncertainty as measured by the
standard deviation of nσ . Note that if μ > 0, for example, for large values of n
he will be to the right of the point x0 with very high probability (using Chebyshev’s
inequality).
Random walks have found applications in many areas of science. Brownian motion is a continuous time version of a random walk with the steps being normally
distributed random variables. The name derives from observations of the biologist
Robert Brown in 1827 of the apparently spontaneous motion of pollen grains suspended in water. This was later explained by Einstein to be due to the collisions of
the grains with randomly moving water molecules.
The theory of Brownian motion was developed by Louis Bachelier in 1900 in his
PhD thesis “The theory of speculation,” which related random walks to the evolution
of stock prices. If the value of a stock evolves through time as a random walk, its
short-term behavior is unpredictable. The efficient market hypothesis states that stock
prices already reflect all known information so that the future price is random and
unknowable. The solid line in Figure 4.5 shows the value of the S&P 500 during
1300
1200
Value
1100
1000
900
800
0
50
100
150
200
250
Time
F I G U R E 4.5 The solid line is the value of the S&P 500 during 2003. The dashed
lines are simulations of random walks.
142
Chapter 4
Expected Values
2003. The average of the increments (steps) was 0.81 and the standard deviation was
9.82. The dashed lines are simulations of random walks with the same intial value
and increments that were normally distributed random variables with μ = 0.81 and
σ = 9.82. Notice the long stretches of upturns and downturns that occurred in the
random walks as the markets reacted in ways that would have been explained ex post
facto by analysts. See Malkiel (2004) for a popular exposition of the implications of
random walk theory for stock market investors.
■
The correlation coefficient is defined in terms of the covariance.
DEFINITION
If X and Y are jointly distributed random variables and the variances and covariances of both X and Y exist and the variances are nonzero, then the correlation
of X and Y , denoted by ρ, is
Cov(X, Y )
ρ=√
Var(X )Var(Y )
■
Note that because of the way the ratio is formed, the correlation is a dimensionless quantity (it has no units, such as inches, since the units in the numerator and
denominator cancel). From the properties of the variance and covariance that we have
established, it follows easily that if X and Y are both subjected to linear transformations (such as changing their units from inches to meters), the correlation coefficient
does not change. Since it does not depend on the units of measurement, ρ is in many
cases a more useful measure of association than is the covariance.
EXAMPLE D
Let us return to the bivariate uniform distribution of Example A. Because X and Y
1
. In the one case (α = −1), we found
are marginally uniform, Var(X ) = Var(Y ) = 12
1
Cov(X, Y ) = − 36 , so
1
× 12 = − 13
ρ = − 36
In the other case (α = 1), the covariance was
1
,
36
so the correlation is 13 .
■
The following notation and relationship are often useful. The standard deviations
of X and Y are denoted by σ X and σY and their covariance by σ X Y . We thus have
σX Y
ρ=
σ X σY
and
σ X Y = ρσ X σY
The following theorem states some further properties of ρ.
4.3 Covariance and Correlation
143
THEOREM B
−1 ≤ ρ ≤ 1. Furthermore, ρ = ±1 if and only if P(Y = a + bX ) = 1 for some
constants a and b.
Proof
Since the variance of a random variable is nonnegative,
X
Y
0 ≤ Var
+
σX
σY
X
Y
X Y
,
= Var
+ Var
+ 2Cov
σX
σY
σ X σY
=
Var(X ) Var(Y ) 2Cov(X, Y )
+
+
σ X2
σY2
σ X σY
= 2(1 + ρ)
From this, we see that ρ ≥ −1. Similarly,
X
Y
0 ≤ Var
−
σX
σY
= 2(1 − ρ)
implies that ρ ≤ 1. Suppose that ρ = 1. Then
X
Y
Var
−
=0
σX
σY
which by Corollary A of Section 4.2 implies that
X
Y
P
−
=c =1
σX
σY
for some constant, c. This is equivalent to P(Y = a + bX ) = 1 for some a and
■
b. A similar argument holds for ρ = −1.
EXAMPLE E
Investment Portfolio
We are now in a position to further develop the investment theory discussed in Section 4.1.2, Example E, and Section 4.2, Example D. Please review those examples before continuing. We first consider the simple example of two securities, assuming that
they have the same expected returns μ1 = μ2 = μ and their returns are uncorrelated:
σi j = Cov(Ri , R j ) = 0. For a portfolio (π, 1 − π ), the expected return is
E(R(π)) = π μ + (1 − π )μ = μ
so that when considering expected return only, the choice of π makes no difference.
However, taking risk into account,
Var(R(π)) = π 2 σ12 + (1 − π )2 σ22 .
144
Chapter 4
Expected Values
Minimizing this with respect to π gives the optimal portfolio
σ2
πopt = 2 2 2
σ1 + σ2
For example, if the investments are equally risky, σ1 = σ2 = σ , then π = 1/2, so the
best strategy is to split her total investment equally between the two securities. If she
does so, the variance of her return is, by Theorem A,
σ 2
Var R 12 =
2
whereas if she put all her money in one security, the variance of her return would
be σ 2 . The expected return is the same in both cases. This is a particularly simple
example of the value of diversification of investments.
Suppose now that the two securities do not have the same expected returns,
μ1 < μ2 . Let the standard deviations of the returns be σ1 and σ2 ; usually less risky
investments have lower expected returns, σ1 < σ2 . Furthermore, the two returns may
be correlated: Cov(R1 , R2 ) = ρσ1 σ2 . Corresponding to the portfolio (π, 1 − π ), we
have expected return
E(R(π)) = π μ1 + (1 − π )μ2
and the variance of the return is
Var(R(π)) = π 2 σ12 + 2π(1 − π )ρσ1 σ2 + (1 − π )2 σ22
Comparing this to the result when the returns were independent, we see the risk is
lower when the returns are independent than when they are positively correlated. It
would thus be better to invest in two unrelated or weakly related market sectors than
to make two investments in the same sector. In deciding the choice of the portfolio
vector, the investor can study how the risk (the standard deviation of R(π )) changes
as the expected return increases, and balance expected return versus risk.
In actual investment decisions, many more than two possible investments are
involved, but the basic idea remains the same. Suppose there are n possible investments. Let the portfolio weights be denoted by the vector π = (π1 , π2 , . . . , πn ). Let
E(Ri ) = μi , Cov(Ri , R j ) = σi j (so, in particular, Var(Ri ) is denoted by σii ), then
E(R(π )) =
πi μi
and
Var(R(π )) =
n
n πi πi σi j .
i=1 j=1
The investment decision, the choice of the portfolio vector π, is often couched as
that of maximizing expected return subject to the risk being less than some value the
individual investor is willing to tolerate. Some investors are more risk averse than
others, so the portfolio vectors will differ from investor to investor. Equivalently, the
decision may be phrased as that of finding the portfolio vector with the minimum risk
subject to a desired return; there may well be many portfolio choices that give the
4.3 Covariance and Correlation
145
5
Philipines
Monthly return (%)
4
Thailand
Malaysia
3
Brazil
INDEX
Chile
2
Taiwan
Indonesia
Turkey
Mexico
Korea
1
Argentina
Portugal
0
S&P 500
0
5
Greece
10
Standard deviation
15
20
F I G U R E 4.6 The benefit of diversification. The monthly average return from January
1992 to June 1994 of 13 stock markets, plotted against their standard deviations. The
performance of the Standard and Poor's 500 index of U.S. stocks is plotted for
comparison.
same expected return, and the wise investor would choose the one among them that
had the lowest risk.
As a general rule, risk is reduced by diversification and can be decreased with
only a small sacrifice of returns. Figure 4.6 from Bernstein (1996, p. 254) illustrates
this point empirically. The point labeled “Index” shows the monthly average versus
standard deviation for an investment that was equally weighted across all the markets.
A reasonably high return with relatively little risk would thus have been obtained by
spreading investments equally over the 13 stock markets. In fact, the risk is less than
that of any of the individual markets. Note that these emerging markets were riskier
than the U.S. market, but that they were more profitable.
■
EXAMPLE F
Bivariate Normal Distribution
We will show that the covariance of X and Y when they follow a bivariate normal
distribution is ρσ X σY , which means that ρ is the correlation coefficient. The covariance is
∞ ∞
Cov(X, Y ) =
(x − μ X )(y − μY ) f (x, y) d x d y
−∞
−∞
Making the changes of variables u = (x − μ X )/σ X and v = (y − μY )/σY changes
the right-hand side to
∞ ∞
1
σ X σY
2
2
(u + v − 2ρuv) du dv
uv exp −
2(1 − ρ 2 )
2π 1 − ρ 2 −∞ −∞
146
Chapter 4
Expected Values
As in Example F in Section 3.3, we use the technique of completing the square to
rewrite this expression as
∞
∞
σ X σY
1
2
(u
−
ρv)
v exp(−v 2 /2)
u exp −
du dv
2(1 − ρ 2 )
2π 1 − ρ 2 −∞
−∞
The inner integral is the mean of an N [ρv, (1 − ρ 2 )] random variable, lacking only
the normalizing constant [2π(1 − ρ 2 )]−1/2 , and we thus have
∞
ρσ X σY
2
v 2 e−v /2 dv = ρσ X σY
Cov(X, Y ) = √
2π
−∞
as was to be shown.
■
The correlation coefficient ρ measures the strength of the linear relationship
between X and Y (compare with Figure 3.9). Correlation also affects the appearance
of a scatterplot, which is constructed by generating n independent pairs (X i , Yi ),
where i = 1, . . . , n, and plotting the points. Figure 4.7 shows scatterplots of 100
pairs of pseudorandom bivariate normal random variables for various values of ρ.
Note that the clouds of points are roughly elliptical in shape.
y
y
3
3
2
2
1
1
0
0
1
1
2
2
3
2 1
(a)
0
1
2
3
4
x
3
2 1
(b)
y
1
2
3
4
0
1
2
3
4
x
y
4
4
3
3
2
2
1
1
0
0
1
1
2
3
2 1
(c)
0
0
1
2
3
4
x
2
3
2 1
(d)
x
F I G U R E 4.7 Scatterplots of 100 independent pairs of bivariate normal random
variables, (a) ρ = 0, (b) ρ = .3, (c) ρ = .6, (d) ρ = .9.
4.4 Conditional Expectation and Prediction
147
4.4 Conditional Expectation and Prediction
4.4.1 Definitions and Examples
In Section 3.5, conditional frequency functions and density functions were defined.
We noted that these had the properties of ordinary frequency and density functions. In
particular, associated with a conditional distribution is a conditional mean. Suppose
that Y and X are discrete random variables and that the conditional frequency function
of Y given x is pY |X (y|x). The conditional expectation of Y given X = x is
E(Y |X = x) =
ypY |X (y|x)
y
For the continuous case, we have
E(Y |X = x) =
y f Y |X (y|x) dy
More generally, the conditional expectation of a function h(Y ) is
E[h(Y )|X = x] = h(y) f Y |X (y|x) dy
in the continuous case. A similar equation holds in the discrete case.
EXAMPLE A
Consider a Poisson process on [0, 1] with mean λ, and let N be the number of points
in [0, 1]. For p < 1, let X be the number of points in [0, p]. Find the conditional
distribution and conditional mean of X given N = n.
We first find the joint distribution: P(X = x, N = n), which is the probability
of x events in [0, p] and n − x events in [ p, 1]. From the assumption of a Poisson
process, the counts in the two intervals are independent Poisson random variables
with parameters pλ and (1 − p)λ, so
p X N (x, n) =
( pλ)x e− pλ [(1 − p)λ]n−x e−(1− p)λ
x!
(n − x)!
The marginal distribution of N is Poisson, so the conditional frequency function of
X is, after some algebra,
p X N (x, n)
p N (n)
n!
p x (1 − p)n−x
=
x!(n − x)!
p X |N (x|n) =
This is the binomial distribution with parameters n and p. The conditional expectation
■
is thus by Example A of Section 4.1.2, np.
148
Chapter 4
EXAMPLE B
Expected Values
Bivariate Normal Distribution
From Example C in Section 3.5.2, if Y and X follow a bivariate normal distribution,
the conditional density of Y given X is
⎛
2 ⎞
σY
⎜ 1 y − μY − ρ (x − μ X ) ⎟
1
σX
⎜
⎟
exp ⎜−
f Y |X (y|x) = ⎟
⎝ 2
⎠
σY2 (1 − ρ 2 )
σY 2π(1 − ρ 2 )
This is a normal density with mean μY + ρ(x − μ X )σY /σ X and variance σY2 (1 − ρ 2 ).
The former is the conditional mean and the latter the conditional variance of Y given
X = x.
Note that the conditional mean is a linear function of X and that as |ρ| increases,
the conditional variance decreases; both of these facts are suggested by the elliptical
contours of the joint density. To see this more exactly, consider the case in which
σ X = σY = 1 and μ X = μY = 0. The contours then are ellipses satisfying
ρ 2 x 2 − 2ρx y + y 2 = constant
The major and minor axes of such an ellipse are at 45◦ and 135◦ . The conditional
expectation of Y given X = x is the line y = ρx; note that this line does not lie along
the major axis of the ellipse. Figure 4.8 shows such a bivariate normal distribution
with ρ = 0.5. The curved lines of the bivariate density correspond to the conditional
density of Y given various values of x, but they are not normalized to integrate to
1. The contours of the bivariate normal are the ellipses shown in the x y plane as
dashed curves, with the major axis shown by the straight dashed line. The conditional
expectation of Y given X = x is shown as a function of x by the solid line in the
plane. Note that it is not the major axis of the ellipse.
This phenomenon was noted by Sir Francis Galton (1822–1911) who studied
the relationship of the heights of sons to that of their fathers. He observed that
0.2
0.18
0.16
0.14
0.12
f (x, y) 0.1
0.08
0.06
0.04
0.02
0
4
3
2
1
x 01
2
3
4
1
3 2
0
1
2
3
4
y
F I G U R E 4.8 Bivariate normal density with correlation, ρ = 0.5. The conditional
expectation of Y given X = x is shown as the solid line in the x y plane.
4.4 Conditional Expectation and Prediction
149
sons of very tall fathers were shorter on average than their fathers and that sons
of very short fathers were on average taller. The empirical relationship is shown in
■
Figure 14.19.
Assuming that the conditional expectation of Y given X = x exists for every
x in the range of X , it is a well-defined function of X and hence is a random
variable, which we write as E(Y |X ). For instance, in Example A we found that
E(X |N = n) = np; thus, E(X |N ) = N p is a random variable that is a function of
N . Provided that the appropriate sums or integrals converge, this random variable has
an expectation and a variance. Its expectation is E[E(Y |X )]; for this expression, note
that since E(Y |X ) is a random variable that is a function of X , the outer expectation
can be taken with respect to the distribution of X (Theorem A of Section 4.1.1). The
following theorem says that the average (expected) value of Y can be found by first
conditioning on X , finding E(Y |X ), and then averaging this quantity with respect to X .
THEOREM A
E(Y ) = E[E(Y |X )].
Proof
We will prove this for the discrete case. The continuous case is proved similarly.
Using Theorem 4.1.1A we need to show that
E(Y |X = x) p X (x)
E(Y ) =
x
where
E(Y |X = x) =
ypY |X (y|x)
y
Interchanging the order of summation gives us
E(Y |X = x) p X (x) =
y
pY |X (y|x) p X (x)
x
y
x
(It can be shown that this interchange can be made.) From the law of total
probability, we have
pY (y) =
pY |X (y|x) p X (x)
x
Therefore,
y
y
x
pY |X (y|x) p X (x) =
ypY (y) = E(Y )
■
y
Theorem A gives what might be called a law of total expectation: The expectation of a random variable Y can be calculated by weighting the conditional
expectations appropriately and summing or integrating.
150
Chapter 4
EXAMPLE C
Expected Values
Suppose that in a system, a component and a backup unit both have mean lifetimes
equal to μ. If the component fails, the system automatically substitutes the backup
unit, but there is probability p that something will go wrong and it will fail to do so.
Let T be the total lifetime, and let X = 1 if the substitution of the backup takes place
successfully, and X = 0 if it does not. Thus the total lifetime is the lifetime of the
first component if the backup fails and is the sum of the lifetimes of the original and
the backup units if the backup is successfully made. Then
E(T |X = 1) = 2μ
E(T |X = 0) = μ
Thus,
E(T ) = E(T |X = 1)P(X = 1) + E(T |X = 0)P(X = 0) = μ(2 − p)
EXAMPLE D
■
Random Sums
This example introduces sums of the type
T =
N
Xi
i=1
where N is a random variable with a finite expectation and the X i are random variables
that are independent of N and have the common mean E(X ). Such sums arise in a
variety of applications. An insurance company might receive N claims in a given
period of time, and the amounts of the individual claims might be modeled as random
variables X 1 , X 2 , . . . . The random variable N could denote the number of customers
entering a store and X i the expenditure of the ith customer, or N could denote the
number of jobs in a single-server queue and X i the service time for the ith job.
For this last case, T is the time to serve all the jobs in the queue. According to
Theorem A,
E(T ) = E[E(T |N )]
Since E(T |N = n) = n E(X ), E(T |N ) = N E(X ) and thus
E(T ) = E[N E(X )] = E(N )E(X )
This agrees with the intuitive guess that the average time to complete N jobs, where
N is random, is the average value of N times the average amount of time to complete
■
a job.
We have seen that the expectation of the random variable E(Y |X ) is E(Y ). We
now find its variance.
4.4 Conditional Expectation and Prediction
151
THEOREM B
Var(Y ) = Var[E(Y |X )] + E[Var(Y |X )].
Proof
We will explain what is meant by the notation in the course of the proof. First,
Var(Y |X = x) = E(Y 2 |X = x) − [E(Y )|X = x)]2
which is defined for all values of x. Thus, just as we defined E(Y |X ) to be a
random variable by letting X be random, we can define Var(Y |X ) as a random
variable. In particular, Var(Y |X ) has the expectation E[Var(Y |X )]. Since
Var(Y |X ) = E(Y 2 |X ) − [E(Y |X )]2
E[Var(Y |X )] = E[E(Y 2 |X )] − E{[E(Y |X )]2 }
Also,
Var[E(Y |X )] = E{[E(Y |X )]2 } − {E[E(Y |X )]}2
The final piece that we need is
Var(Y ) = E(Y 2 ) − [E(Y )]2
= E[E(Y 2 |X )] − {E[E(Y |X )]}2
by the law of total expectation. Now we can put all the pieces together:
Var(Y ) = E[E(Y 2 |X )] − {E[E(Y |X )]}2
= E[E(Y 2 |X )] − E{[E(Y |X )]2 } + E{[E(Y |X )]2 } − {E[E(Y |X )]}2
= E[Var(Y |X )] + Var[E(Y |X )]
■
EXAMPLE E
Random Sums
Let us continue Example D but with the additional assumptions that the X i are independent random variables with the same mean, E(X ), and the same variance, Var(X ),
and that Var(N ) < ∞. According to Theorem B,
Var(T ) = E[Var(T |N )] + Var[E(T |N )]
Because E(T |N ) = N E(X ),
Var[E(T |N )] = [E(X )]2 Var(N )
Also, since Var(T |N = n) = Var(
n
i=1
X i ) = n Var(X ),
Var(T |N ) = N Var(X )
and
E[Var(T |N )] = E(N )Var(X )
152
Chapter 4
Expected Values
We thus have
Var(T ) = [E(X )]2 Var(N ) + E(N )Var(X )
If N is fixed, say, N = n, then Var(T ) = n Var(X ). Thus, we see from the preceding
equation that extra variability occurs in T because N is random.
As a concrete example, suppose that the number of insurance claims in a certain
time period has expected value equal to 900 and standard deviation equal to 30, as
would be the case if the number were a Poisson random variable with expected value
900. Suppose that the average claim value is $1000 and the standard deviation is $500.
Then the expected value of the total, T , of the claims is E(T ) = $900,000 and the
variance of T is
Var (T ) = 10002 × 900 + 900 × 5002
= 1.125 × 109
The standard deviation of T is the square root of the variance, $33,541. The insurance
company could then plan on total claims of $900,000 plus or minus a few standard
deviations (by Chebyshev’s inequality). Observe that if the total number of claims
were not variable but were fixed at N = 900, the variance of the total claims would
be given by E(N )Var(X ) in the preceding expression. The result would be a standard
deviation equal to $15,000. The variability in the number of claims thus contributes
■
substantially to the uncertainty in the total.
4.4.2 Prediction
This section treats the problem of predicting the value of one random variable from
another. We might wish, for example, to measure the value of some physical quantity, such as pressure, using an instrument. The actual pressures to be measured are
unknown and variable, so we might model them as values of a random variable, Y.
Assume that measurements are to be taken by some instrument that produces a response, X , related to Y in some fashion but corrupted by random noise as well; X
might represent current flow, for example. Y and X have some joint distribution, and
we wish to predict the actual pressure, Y , from the instrument response, X .
As another example, in forestry, the volume of a tree is sometimes estimated from
its diameter, which is easily measured. For a whole forest, it is reasonable to model
diameter (X ) and volume (Y ) as random variables with some joint distribution, and
then attempt to predict Y from X .
Let us first consider a relatively trivial situation: the problem of predicting Y by
means of a constant value, c. If we wish to choose the “best” value of c, we need some
measure of the effectiveness of a prediction. One that is amenable to mathematical
analysis and that is widely used is the mean squared error:
MSE = E[(Y − c)2 ]
This is the average squared error of prediction, the averaging being done with respect
to the distribution of Y . The problem then becomes finding the value of c that minimizes the mean squared error. To solve this problem, we denote E(Y ) by μ and
4.4 Conditional Expectation and Prediction
153
observe that (see Theorem A of Section 4.2.1)
E[(Y − c)2 ] = Var(Y − c) + [E(Y − c)]2
= Var(Y ) + (μ − c)2
The first term of the last expression does not depend on c, and the second term is
minimized for c = μ, which is the optimal choice of c.
Now let us consider predicting Y by some function h(X ) in order to minimize
MSE = E{[Y − h(X )]2 }. From Theorem A of Section 4.4.1, the right-hand side can
be expressed as
E{[Y − h(X )]2 } = E(E{[Y − h(X )]2 |X })
The outer expectation is with respect to X . For every x, the inner expectation is
minimized by setting h(x) equal to the constant E(Y |X = x), from the result of the
preceding paragraph. We thus have that the minimizing function h(X ) is
h(X ) = E(Y |X )
EXAMPLE A
For the bivariate normal distribution, we found that
σY
E(Y |X ) = μY + ρ (X − μ X )
σX
This linear function of X is thus the minimum mean squared error predictor of Y
■
from X .
A practical limitation of the optimal prediction scheme is that its implementation
depends on knowing the joint distribution of Y and X in order to find E(Y |X ), and
often this information is not available, not even approximately. For this reason, we
can try to attain the more modest goal of finding the optimal linear predictor of Y . (In
Example A, it turned out that the best predictor was linear, but this is not generally
the case.) That is, rather than finding the best function h among all functions, we try
to find the best function of the form h(x) = α + βx. This merely requires optimizing
over the two parameters α and β. Now
E[(Y − α − β X )2 ] = Var(Y − α − β X ) + [E(Y − α − β X )]2
= Var(Y − β X ) + [E(Y − α − β X )]2
The first term of the last expression does not depend on α, so α can be chosen so as
to minimize the second term. To do this, note that
E(Y − α − β X ) = μY − α − βμ X
and that the right-hand side is zero, and hence its square is minimized, if
α = μY − βμ X
As for the first term,
Var(Y − β X ) = σY2 + β 2 σ X2 − 2βσ X Y
154
Chapter 4
Expected Values
where σ X Y = Cov(X, Y ). This is a quadratic function of β, and the minimum is
found by setting the derivative with respect to β equal to zero, which yields
σX Y
σY
β= 2 =ρ
σX
σX
ρ is the correlation coefficient. Substituting in these values of α and β, we find that
the minimum mean squared error predictor, which we denote by Ŷ , is
Ŷ = α + β X
σX Y
= μY + 2 (X − μ X )
σX
The mean squared prediction error is then
Var(Y − β X ) = σY2 +
= σY2 −
σ X2 Y 2
σX Y
σX − 2 2 σX Y
4
σX
σX
σ X2 Y
σ X2
= σY2 − ρ 2 σY2
= σY2 (1 − ρ 2 )
Note that the optimal linear predictor depends on the joint distribution of X
and Y only through their means, variances, and covariance. Thus, in practice, it is
generally easier to construct the optimal linear predictor or an approximation to it
than to construct the general optimal predictor E(Y |X ). Second, note that the form
of the optimal linear predictor is the same as that of E(Y |X ) for the bivariate normal
distribution. Third, note that the mean squared prediction error depends only on σY
and ρ and that it is small if ρ is close to +1 or −1. Here we see again, from a different
point of view, that the correlation coefficient is a measure of the strength of the linear
relationship between X and Y .
EXAMPLE B
Suppose that two examinations are given in a course. As a probability model, we
regard the scores of a student on the first and second examinations as jointly distributed
random variables X and Y . Suppose for simplicity that the exams are scaled to have
the same means μ = μ X = μY and standard deviations σ = σ X = σY . Then,
the correlation between X and Y is ρ = σ X Y /σ 2 and the best linear predictor is
Ŷ = μ + ρ(X − μ), so
Ŷ − μ = ρ(X − μ)
Notice that by this equation we predict the student’s score on the second examination
to differ from the overall mean μ by less than did the score on the first examination.
If the correlation ρ is positive, this is encouraging for a student who scores below the
mean on the first exam, since our best prediction is that his score on the next exam
will be closer to the mean. On the other hand, it’s bad news for the student who scored
above the mean on the first exam, since our best prediction is that she will score closer
to the mean on the next exam. This phenomenon is often referred to as regression to
■
the mean.
4.5 The Moment-Generating Function
155
4.5 The Moment-Generating Function
This section develops and applies some of the properties of the moment-generating
function. It turns out, despite its unlikely appearance, to be a very useful tool that can
dramatically simplify certain calculations.
The moment-generating function (mgf) of a random variable X is M(t) =
E(et X ) if the expectation is defined. In the discrete case,
M(t) =
et x p(x)
x
and in the continuous case,
M(t) =
∞
−∞
et x f (x) d x
The expectation, and hence the moment-generating function, may or may not exist
for any particular value of t. In the continuous case, the existence of the expectation
depends on how rapidly the tails of the density decrease; for example, because the
tails of the Cauchy density die down at the rate x −2 , the expectation does not exist
for any t and the moment-generating function is undefined. The tails of the normal
2
density die down at the rate e−x , so the integral converges for all t.
PROPERTY A
If the moment-generating function exists for t in an open interval containing
zero, it uniquely determines the probability distribution.
■
We cannot prove this important property here—its proof depends on properties
of the Laplace transform. Note that Property A says that if two random variables have
the same mgf in an open interval containing zero, they have the same distribution.
For some problems, we can find the mgf and then deduce the unique probability
distribution corresponding to it.
The r th moment of a random variable is E(X r ) if the expectation exists. We
have already encountered the first and second moments earlier in this chapter, that is,
E(X ) and E(X 2 ). Central moments rather than ordinary moments are often used: The
r th central moment is E{[X − E(X )]r }. The variance is the second central moment
and is a measure of dispersion about the mean. The third central moment, called the
skewness, is used as a measure of the asymmetry of a density or a frequency function
about its mean; if a density is symmetric about its mean, the skewness is zero (see
Problem 78 at the end of this chapter). As its name implies, the moment-generating
function has something to do with moments. To see this, consider the continuous
case:
∞
et x f (x) d x
M(t) =
−∞
156
Chapter 4
Expected Values
The derivative of M(t) is
d
M (t) =
dt
∞
−∞
et x f (x) d x
It can be shown that differentiation and integration can be interchanged, so that
∞
M (t) =
xet x f (x) d x
−∞
and
M (0) =
∞
−∞
x f (x) d x = E(X )
Differentiating r times, we find
M (r ) (0) = E(X r )
It can further be argued that if the moment-generating function exists in an interval
containing zero, then so do all the moments. We thus have the following property.
PROPERTY B
If the moment-generating function exists in an open interval containing zero,
then M (r ) (0) = E(X r ).
■
To find the moments of a random variable from the definition of expectation, we
must sum a series or carry out an integration. The utility of Property B is that, if the
mgf can be found, the process of integration or summation, which may be difficult, can
be replaced by the process of differentiation, which is mechanical. We now illustrate
these concepts using some familiar distributions.
EXAMPLE A
Poisson Distribution
By definition,
M(t) =
∞
etk
k=0
=
λk −λ
e
k!
∞
(λet )k −λ
e
k!
k=0
= e−λ eλe
t
= eλ(e −1)
t
The sum converges for all t. Differentiating, we have
M (t) = λet eλ(e −1)
t
t
M (t) = λet eλ(e −1) + λ2 e2t eλ(e −1)
t
4.5 The Moment-Generating Function
157
Evaluating these derivatives at t = 0, we find
E(X ) = λ
E(X 2 ) = λ2 + λ
from which it follows that
Var(X ) = E(X 2 ) − [E(X )]2 = λ
We have found that the mean and the variance of a Poisson distribution are
■
equal.
EXAMPLE B
Gamma Distribution
The mgf of a gamma distribution is
∞
λα α−1 −λx
x e
M(t) =
et x
dx
(α)
0
∞
λα
x α−1 e x(t−λ) d x
=
(α) 0
The latter integral converges for t < λ and can be evaluated by relating it to the
gamma density having parameters α and λ − t. We thus obtain
α
λα
λ
(α)
M(t) =
=
(α) (λ − t)α
λ−t
Differentiating, we find
α
λ
α(α + 1)
M (0) = E(X 2 ) =
λ2
M (0) = E(X ) =
From these equations, we find that
Var(X ) = E(X 2 ) − [E(X )]2
α(α + 1) α 2
=
− 2
λ2
λ
α
= 2
λ
EXAMPLE C
Standard Normal Distribution
For the standard normal distribution, we have
∞
1
2
M(t) = √
et x e−x /2 d x
2π −∞
■
158
Chapter 4
Expected Values
The integral converges for all t and can be evaluated using the technique of completing
the square. Since
x2
t2
1
− t x = (x 2 − 2t x + t 2 ) −
2
2
2
=
t2
1
(x − t)2 −
2
2
therefore,
et /2
M(t) = √
2π
2
∞
e−(x−t)
2 /2
dx
−∞
Making the change of variables u = x − t and using the fact that the standard normal
density integrates to 1, we find that
M(t) = et
2 /2
From this result, we easily see that E(X ) = 0 and Var(X ) = 1.
■
Let us continue with the development of the properties of the moment-generating
function.
PROPERTY C
If X has the mgf M X (t) and Y = a +bX , then Y has the mgf MY (t) = eat M X (bt).
Proof
MY (t) = E(etY )
= E(eat+bt X )
= E(eat ebt X )
= eat E(ebt X )
= eat M X (bt)
EXAMPLE D
■
General Normal Distribution
If Y follows a general normal distribution with parameters μ and σ , the distribution
of Y is the same as that of μ + σ X , where X follows a standard normal distribution.
Thus, from Example C and Property C,
MY (t) = eμt M X (σ t) = eμt eσ
2 t 2 /2
■
4.5 The Moment-Generating Function
159
PROPERTY D
If X and Y are independent random variables with mgf’s M X and MY and Z =
X + Y , then M Z (t) = M X (t)MY (t) on the common interval where both mgf’s
exist.
Proof
M Z (t) = E(et Z )
= E(et X +tY )
= E(et X etY )
From the assumption of independence,
M Z (t) = E(et X )E(etY )
= M X (t)MY (t)
■
By induction, Property D can be extended to sums of several independent random
variables. This is one of the most useful properties of the moment-generating function.
The next three examples show how it can be used to easily derive results that would
take a lot more work to achieve without recourse to the mgf.
EXAMPLE E
The sum of independent Poisson random variables is a Poisson random variable: If
X is Poisson with parameter λ and Y is Poisson with parameter μ, then X + Y is
Poisson with parameter λ + μ, since
eλ(e −1) eμ(e −1) = e(λ+μ)(e −1)
t
EXAMPLE F
t
t
■
If X follows a gamma distribution with parameters α1 and λ and Y follows a gamma
distribution with parameters α2 and λ, the mgf of X + Y is
α1 α2
α1 + α2
λ
λ
λ
=
λ−t
λ−t
λ−t
where t < λ. The right-hand expression is the mgf of a gamma distribution with
parameters λ and α1 +α2 . It follows from this that the sum of n independent exponential
random variables with parameter λ follows a gamma distribution with parameters n
and λ. Thus, the time between n consecutive events of a Poisson process in time follows
a gamma distribution. Assuming that the service times in a queue are independent
exponential random variables, the length of time to serve n customers follows a
gamma distribution.
■
160
Chapter 4
EXAMPLE G
Expected Values
If X ∼ N (μ, σ 2 ) and, independent of X, Y ∼ N (v, τ 2 ), then the mgf of X + Y is
eμt et
2 σ 2 /2
evt et
2 τ 2 /2
= e(μ+v)t et
2 (σ 2 +τ 2 )/2
which is the mgf of a normal distribution with mean μ + v and variance σ 2 + τ 2 . The
■
sum of independent normal random variables is thus normal.
The preceding three examples are atypical. In general, if two independent random variables follow some type of distribution, it is not necessarily true that their
sum follows the same type of distribution. For example, the sum of two gamma random variables having different values for the parameter λ does not follow a gamma
distribution, as can be easily seen from the mgf.
We now apply moment-generating functions to random sums of the type introduced in Section 4.4.1. Suppose that
S=
N
Xi
i=1
where the X i are independent and have the same mgf, M X , and where N has the mgf
M N and is independent of the X i . By conditioning, we have
M S (t) = E[E(et S |N )]
Given N = n, M S (t) = [M X (t)]n from Property D. We thus have
M S (t) = E[M X (t) N ]
= E(e N log M X (t) )
= M N [log M X (t)]
(We must carefully note the values of t for which this is defined.)
EXAMPLE H
Compound Poisson Distribution
This example presents a model that occurs for certain chain reactions, or “cascade”
processes. When a single primary electron, having been accelerated in an electrical
field, hits a plate, several secondary electrons are produced. In a multistage multiplying
tube, each of these secondary electrons hits another plate and thereby produces a
number of tertiary electrons. The process can continue through several stages in this
manner. Woodward (1948) considered models of this type in which the number of
electrons produced by the impact of a single electron on the plate is random and, in
particular, in which the number of secondary electrons follows a Poisson distribution.
The number of electrons produced at the third stage is described by a random sum
of the type just described, where N is the number of secondary electrons and X i is
the number of electrons produced by the ith secondary electron. Suppose that the X i
are independent Poisson random variables with parameter λ and that N is a Poisson
random variable with parameter μ. According to the preceding result, the mgf of S,
4.6 Approximate Methods
161
the total number of particles, is
M S (t) = exp[μ(eλ(e −1) − 1)]
t
■
Example H illustrates the utility of the mgf. It would have been more difficult
to find the probability mass function of the number of particles at the third stage. By
differentiating the mgf, we can find the moments of the probability mass function
(see Problem 98 at the end of this chapter).
If X and Y have a joint distribution, their joint moment-generating function is
defined as
M X Y (s, t) = E(es X +tY )
which is a function of two variables, s and t. If the joint mgf is defined on an open
set containing the origin, it uniquely determines the joint distribution. The mgf of the
marginal distribution of X alone is
M X (s) = M X Y (s, 0)
and similarly for Y . It can be shown that X and Y are independent if and only if
their joint mgf factors into the product of the mgf’s of the marginal distributions.
E(X Y ) and other higher-order joint moments can be obtained from the joint mgf
by differentiation. Analogous properties hold for the joint mgf of several random
variables.
The major limitation of the mgf is that it may not exist. The characteristic
function of a random variable X is defined to be
where i =
√
φ(t) = E(eit X )
−1. In the continuous case,
∞
eit x f (x) d x
φ(t) =
−∞
This integral converges for all values of t, since |eit x | ≤ 1. The characteristic function is thus defined for all distributions. Its properties are similar to those of the
mgf: Moments can be obtained by differentiation, the characteristic function changes
simply under linear transformations, and the characteristic function of a sum of independent random variables is the product of their characteristic functions. But using
the characteristic function requires some familiarity with the techniques of complex
variables.
4.6 Approximate Methods
In many applications, only the first two moments of a random variable, and not
the entire probability distribution, are known, and even these may be known only
approximately. We will see in Chapter 5 that repeated independent observations of a
random variable allow reliable estimates to be made of its mean and variance. Suppose
that we know the expectation and the variance of a random variable X but not the
162
Chapter 4
Expected Values
entire distribution, and that we are interested in the mean and variance of Y = g(X )
for some fixed function g. For example, we might be able to measure X and determine
its mean and variance, but really be interested in Y , which is related to X in a known
way. We might want to know Var(Y ), at least approximately, in order to assess the
accuracy of the indirect measurement process. From the results given in this chapter,
we cannot in general find E(Y ) = μY and Var(Y ) = σY2 from E(X ) = μ X and
Var(X ) = σ X2 , unless the function g is linear. However, if g is nearly linear in a range
in which X has high probability, it can be approximated by a linear function and
approximate moments of Y can be found.
In proceeding as just described, we follow a tack often taken in applied mathematics: When confronted with a nonlinear problem that we cannot solve, we linearize. In probability and statistics, this method is called propagation of error, or the
δ method. Linearization is carried out through a Taylor series expansion of g about
μ X . To the first order,
Y = g(X ) ≈ g(μ X ) + (X − μ X )g (μ X )
We have expressed Y as approximately equal to a linear function of X . Recalling that
if U = a + bV , then E(U ) = a + bE(V ) and Var(U ) = b2 Var(V ), we find
μY ≈ g(μ X )
σY2 ≈ σ X2 [g (μ X )]2
We know that in general E(Y ) = g(E(X )), as given by the approximation. In fact,
we can carry out the Taylor series expansion to the second order to get an improved
approximation of μY :
Y = g(X ) ≈ g(μ X ) + (X − μ X )g (μ X ) + 12 (X − μ X )2 g (μ X )
Taking the expectation of the right-hand side, we have, since E(X − μ X ) = 0,
E(Y ) ≈ g(μ X ) + 12 σ X2 g (μ X )
How good such approximations are depends on how nonlinear g is in a neighborhood of μ X and on the size of σ X . From Chebyshev’s inequality, we know that X is
unlikely to be many standard deviations away from μ X ; if g can be reasonably well
approximated in this range by a linear function, the approximations for the moments
will be reasonable as well.
EXAMPLE A
The relation of voltage, current, and resistance is V = I R. Suppose that the voltage
is held constant at a value V0 across a medium whose resistance fluctuates randomly
as a result, say, of random fluctuations at the molecular level. The current therefore
also varies randomly. Suppose that it can be determined experimentally to have mean
μ I = 0 and variance σ I2 . We wish to find the mean and variance of the resistance, R,
and since we do not know the distribution of I , we must resort to an approximation.
4.6 Approximate Methods
163
We have
R = g(I ) =
g (μ I ) = −
V0
I
V0
2V0
g (μ I ) = 3
2
μI
μI
Thus,
μR ≈
V0
V0
+ 3 σ I2
μI
μI
σ R2 ≈
V02 2
σ
μ4I I
We see that the variability of R depends on both the mean level of I and the variance
of I . This makes sense, since if I is quite small, small variations in I will result in
large variations in R = V0 /I , whereas if I is large, small variations will not affect R
as much. The second-order correction factor for μ R also depends on μ I and is large if
μ I is small. In fact, when I is near zero, the function g(I ) = V0 /I is quite nonlinear,
■
and the linearization is not a good approximation.
This example examines the accuracy
√ of the approximations using a simple test case.
We choose the function g(x) = x and consider two cases: X uniform on [0, 1],
and X uniform on [1, 2]. The graph of g(x) in Figure 4.9 shows that g is more nearly
linear in the latter
√ case, so we would expect the approximations to work better there.
Let Y = X ; because X is uniform on [0, 1],
1
√
E(Y ) =
x d x = 23
0
1.6
1.4
1.2
1.0
g (x)
EXAMPLE B
.8
.6
.4
.2
0
0
.5
1.0
x
1.5
F I G U R E 4.9 The function g(x) =
than over the interval [0, 1].
2.0
√
x is more nearly linear over the interval [1, 2]
164
Chapter 4
Expected Values
and
1
E(Y ) =
x dx =
2
so Var(Y ) =
1
2
−
2 2
3
0
=
1
18
1
2
and σY = .236. These results are exact.
Using the approximation method, we first calculate
g (x) = 12 x −1/2
g (x) = − 14 x −3/2
Since X is uniform on [0, 1], μ X = 12 , and evaluating the derivatives at this value
gives us
√
2
g (μ X ) =
2√
2
g (μ X ) = −
2
We know that Var(X ) =
mations are
1
12
for a random variable uniform on [0, 1], so the approxi*
E(Y ) ≈
Var(Y ) ≈
1
2
1 1
−
2 2
×
1
12
√ 2
= .678
12 × 2
= .042
σY ≈ .204
The approximation to the mean is .678, and compared to the actual value of .667, it is
off by about 1.6%. The approximation to the standard deviation is .204, and compared
to the actual value of .236, it is off by 13%.
Now let us consider√the case in which X is uniform on [1, 2]. Proceeding as
before, we find that y = x has mean 1.219. The variance and standard deviation are
.0142 and .119, respectively. To compare these to the approximations, we note that
1
μ X = 32 and Var(X ) = 12
(the random variable uniform on [1, 2] can be obtained by
adding the constant 1 to a random variable uniform on [0, 1]; compare with Theorem A
in Section 4.2). We find
g (μ X ) = .408
g (μ X ) = −.136
so the approximations are
*
3 1 .136
−
E(Y ) ≈
2 2
12
2
.408
= .0138
Var(Y ) ≈
12
σY ≈ .118
= 1.219
These values are much closer to the actual values than are the approximations for the
■
first case.
4.6 Approximate Methods
165
Suppose that we have Z = g(X, Y ), a function of two variables. We can again
carry out Taylor series expansions to approximate the mean and variance of Z . To the
first order, letting μ denote the point (μ X , μY ),
Z = g(X, Y ) ≈ g(μ) + (X − μ X )
∂g(μ)
∂g(μ)
+ (Y − μY )
∂x
∂y
The notation ∂g(μ)/∂ x means that the derivative is evaluated at the point μ. Here Z
is expressed as approximately equal to a linear function of X and Y , and the mean
and variance of this linear function are easily calculated to be
E(Z ) ≈ g(μ)
and
Var(Z ) ≈ σ X2
∂g(μ)
∂x
2
+ σY2
∂g(μ)
∂y
2
+ 2σ X Y
∂g(μ)
∂x
∂g(μ)
∂y
(For the latter calculation, see Corollary A in Section 4.3.) As is the case with a single
variable, a second-order expansion can be used to obtain an improved estimate of
E(Z ):
∂g(μ)
∂g(μ)
+ (Y − μY )
Z = g(X, Y ) ≈ g(μ) + (X − μ X )
∂x
∂y
∂ 2 g(μ) 1
∂ 2 g(μ)
1
+ (X − μ X )2
+ (Y − μY )2
2
2
∂x
2
∂ y2
+ (X − μ X )(Y − μY )
∂ 2 g(μ)
∂ x∂ y
Taking expectations term by term on the right-hand side yields
1 ∂ 2 g(μ) 1 2 ∂ 2 g(μ)
∂ 2 g(μ)
E(Z ) ≈ g(μ) + σ X2
σ
+
+
σ
X
Y
Y
2
∂x2
2
∂ y2
∂ x∂ y
The general case of a function of n variables can be worked out similarly; the basic
concepts are illustrated by the two-variable case.
EXAMPLE C
Expectation and Variance of a Ratio
Let us consider the case where Z = Y / X , which arises frequently in practice. For
example, a chemist might measure the concentrations of two substances, both with
some measurement error that is indicated by their standard deviations, and then report
the relative concentrations in the form of a ratio. What is the approximate standard
deviation of the ratio, Z ?
Using the method of propagation of error derived above, for g(x, y) = y/x, we
have
−y
1
∂g
∂g
= 2
=
∂x
x
∂y
x
2y
∂2g
= 3
∂x2
x
∂2g
=0
∂ y2
−1
∂2g
= 2
∂ x∂ y
x
166
Chapter 4
Expected Values
Evaluating these derivatives at (μ X , μY ) and using the preceding result, we find, if
μ X = 0,
μY
μY
σX Y
+ σ X2 3 − 2
μX
μX
μX
μY
μY
1
=
+ 2 σ X2
− ρσ X σY
μX
μX
μX
E(Z ) ≈
From this equation, we see that the difference between E(Z ) and μY /μ X depends on
several factors. If σ X and σY are small—that is, if the two concentrations are measured
quite accurately—the difference is small. If μ X is small, the difference is relatively
large. Finally, correlation between X and Y affects the difference.
We now consider the variance. Again using the preceding result and evaluating
the partial derivatives at (μ X , μY ), we find
μ2Y
μY
σY2
+
− 2σ X Y 3
4
2
μX
μX
μX
2
μ
μY
1
= 2 σ X2 2Y + σY2 − 2ρσ X σY
μX
μX
μX
Var(Z ) ≈ σ X2
From this equation, we see that the ratio is quite variable when μ X is small, paralleling
the results in Example A, and that correlation between X and Y , if of the same sign
■
as μY /μ X , decreases Var(Z ).
4.7 Problems
1. Show that if a random variable is bounded—that is, |X | < M < ∞—then
E(X ) exists.
2. If X is a discrete uniform random variable—that is, P(X = k) = 1/n for k =
1, 2, . . . , n—find E(X ) and Var(X ).
3. Find E(X ) and Var(X ) for Problem 3 in Chapter 2.
4. Let X have the cdf F(x) = 1 − x −α , x ≥ 1.
a. Find E(X ) for those values of α for which E(X ) exists.
b. Find Var(X ) for those values of α for which it exists.
5. Let X have the density
f (x) =
1 + αx
,
2
Find E(X ) and Var(X ).
−1 ≤ x ≤ 1,
−1 ≤ α ≤ 1
4.7 Problems
167
6. Let X be a continuous random variable with probability density function
f (x) = 2x, 0 ≤ x ≤ 1.
a. Find E(X ).
b. Let Y = X 2 . Find the probability mass function of Y and use it to find E(Y ).
c. Use Theorem A in Section 4.1.1 to find E(X 2 ) and compare to your answer
in part (b).
d. Find Var(X ) according to the definition of variance given in Section 4.2.
Also find Var(X ) by using Theorem B of Section 4.2.
7. Let X be a discrete random variable that takes on values 0, 1, 2 with probabilities
1 3 1
, , , respectively.
2 8 8
a. Find E(X ).
b. Let Y = X 2 . Find the probability mass function of Y and use it to find E(Y ).
c. Use Theorem A of Section 4.1.1 to find E(X 2 ) and compare to your answer
in part (b).
d. Find Var(X ) according to the definition of variance given in Section 4.2.
Also find Var(X ) by using Theorem B in Section 4.2.
8. Show that if X is a discrete random variable, taking values on the positive
integers, then E(X ) = ∞
k=1 P(X ≥ k). Apply this result to find the expected
value of a geometric random variable.
9. This is a simplified inventory problem. Suppose that it costs c dollars to stock an
item and that the item sells for s dollars. Suppose that the number of items that
will be asked for by customers is a random variable with the frequency function
p(k). Find a rule for the number of items that should be stocked in order to
maximize the expected income. (Hint: Consider the difference of successive
terms.)
10. A list of n items is arranged in random order; to find a requested item, they
are searched sequentially until the desired item is found. What is the expected
number of items that must be searched through, assuming that each item is
equally likely to be the one requested? (Questions of this nature arise in the
design of computer algorithms.)
11. Referring to Problem 10, suppose that the items are not equally likely to be
requested but have known probabilities p1 , p2 , . . . , pn . Suggest an alternative
searching procedure that will decrease the average number of items that must
be searched through, and show that in fact it does so.
12. If X is a continuous random variable with a density that is symmetric about
some point, ξ , show that E(X ) = ξ , provided that E(X ) exists.
13. If X is a nonnegative continuous random variable, show that
E(X ) =
∞
[1 − F(x)] d x
0
Apply this result to find the mean of the exponential distribution.
168
Chapter 4
Expected Values
14. Let X be a continuous random variable with the density function
f (x) = 2x,
0≤x ≤1
a. Find E(X ).
b. Find E(X 2 ) and Var(X ).
15. Suppose that two lotteries each have n possible numbers and the same payoff.
In terms of expected gain, is it better to buy two tickets from one of the lotteries
or one from each?
16. Suppose that E(X ) = μ and Var(X ) = σ 2 . Let Z = (X − μ)/σ . Show that
E(Z ) = 0 and Var(Z ) = 1.
17. Find (a) the expectation and (b) the variance of the kth-order statistic of a sample
of n independent random variables uniform on [0, 1]. The density function is
given in Example C in Section 3.7.
18. If U1 , . . . , Un are independent uniform random variables, find E(U(n) − U(1) ).
19. Find E(U(k) − U(k−1) ), where the U(i) are as in Problem 18.
20. A stick of unit length is broken into two pieces. Find the expected ratio of the
length of the longer piece to the length of the shorter piece.
21. A random square has a side length that is a uniform [0, 1] random variable.
Find the expected area of the square.
22. A random rectangle has sides the lengths of which are independent uniform
random variables. Find the expected area of the rectangle, and compare this
result to that of Problem 21.
23. Repeat Problems 21 and 22 assuming that the distribution of the lengths is
exponential.
24. Prove Theorem A of Section 4.1.2 for the discrete case.
25. If X 1 and X 2 are independent random variables following a gamma distribution
with parameters α and λ, find E(R 2 ), where R 2 = X 12 + X 22 .
26. Referring to Example B in Section 4.1.2, what is the expected number of
coupons needed to collect r different types, where r < n?
27. If n men throw their hats into a pile and each man takes a hat at random, what
is the expected number of matches? (Hint: Express the number as a sum.)
28. Suppose that n enemy aircraft are shot at simultaneously by m gunners, that
each gunner selects an aircraft to shoot at independently of the other gunners,
and that each gunner hits the selected aircraft with probability p. Find the
expected number of aircraft hit by the gunners.
29. Prove Corollary A of Section 4.1.1.
30. Find E[1/(X + 1)], where X is a Poisson random variable.
4.7 Problems
169
31. Let X be uniformly distributed on the interval [1, 2]. Find E(1/ X ). Is E(1/ X ) =
1/E(X )?
32. Let X have a gamma distribution with parameters α and λ. For those values of
α and λ for which it is defined, find E(1/ X ).
33. Prove Chebyshev’s inequality in the discrete case.
√
34. Let X be uniform on [0, 1], and let Y = X . Find E(Y ) by (a) finding the
density of Y and then finding the expectation and (b) using Theorem A of
Section 4.1.1.
35. Find the mean of a negative binomial random variable. (Hint: Express the
random variable as a sum.)
36. Consider the following scheme for group testing. The original lot of samples is
divided into two groups, and each of the subgroups is tested as a whole. If either
subgroup tests positive, it is divided in two, and the procedure is repeated. If
any of the groups thus obtained tests positive, test every member of that group.
Find the expected number of tests performed, and compare it to the number
performed with no grouping and with the scheme described in Example C in
Section 4.1.2.
37. For what values of p is the group testing of Example C in Section 4.1.2 inferior
to testing every individual?
38. This problem continues Example A of Section 4.1.2.
a. What is the probability that a fragment is the leftmost member of a
contig?
b. What is the expected number of fragments that are leftmost members of
contigs?
c. What is the expected number of contigs?
39. Suppose that a segment of DNA of length 1,000,000 is to be shotgun sequenced
with fragments of length 1000.
a. How many fragment would be needed so that the chance of an individual
site being covered is greater than 0.99?
b. With this choice, how many sites would you expect to be missed?
40. A child types the letters Q, W, E, R, T, Y, randomly producing 1000 letters in
all. What is the expected number of times that the sequence QQQQ appears,
counting overlaps?
41. Continuing with the previous problem, how many times would we expect the
word “TRY” to appear? Would we be surprised if it occurred 100 times? (Hint:
Consider Markov’s inequality.)
42. Let X be an exponential random variable with standard deviation σ . Find
P(|X − E(X )| > kσ ) for k = 2, 3, 4, and compare the results to the bounds
from Chebyshev’s inequality.
43. Show that Var(X − Y ) = Var(X ) + Var(Y ) − 2Cov(X, Y ).
170
Chapter 4
Expected Values
44. If X and Y are independent random variables with equal variances, find
Cov(X + Y, X − Y ).
45. Find the covariance and the correlation of Ni and N j , where N1 , N2 , . . . , Nr
are multinomial random variables. (Hint: Express them as sums.)
46. If U = a + bX and V = c + dY , show that |ρU V | = |ρ X Y |.
47. If X and Y are independent random variables and Z = Y − X , find expressions
for the covariance and the correlation of X and Z in terms of the variances of
X and Y .
48. Let U and V be independent
random variables with means μ and variances σ 2 .
√
2
Let Z = αU + V 1 − α . Find E(Z ) and ρU Z .
49. Two independent measurements, X and Y , are taken of a quantity μ. E(X ) =
E(Y ) = μ, but σ X and σY are unequal. The two measurements are combined
by means of a weighted average to give
Z = α X + (1 − α)Y
where α is a scalar and 0 ≤ α ≤ 1.
a. Show that E(Z ) = μ.
b. Find α in terms of σ X and σY to minimize Var(Z ).
c. Under what circumstances is it better to use the average (X + Y )/2 than
either X or Y alone?
50. Suppose that X i , where i = 1, . . . , n, are independent random variables with
n
E(X i ) = μ and Var(X i ) = σ 2 . Let X = n −1
i=1 X i . Show that E(X ) = μ
and Var(X ) = σ 2 /n.
51. Continuing Example E in Section 4.3, suppose there are n securities, each with
the same expected return, that all the returns have the same standard deviations,
and that the returns are uncorrelated. What is the optimal portfolio vector? Plot
the risk of the optimal portfolio versus n. How does this risk compare to that
incurred by putting all your money in one security?
52. Consider two securities, the first having μ1 = 1 and σ1 = 0.1, and the second
having μ2 = 0.8 and σ2 = 0.12. Suppose that they are negatively correlated,
with ρ = −0.8.
a. If you could only invest in one security, which one would you choose, and
why?
b. Suppose you invest 50% of your money in each of the two. What is your
expected return and what is your risk?
c. If you invest 80% of your money in security 1 and 20% in security 2, what
is your expected return and your risk?
d. Denote the expected return and its standard deviation as functions of π by
μ(π) and σ (π). The pair (μ(π ), σ (π )) trace out a curve in the plane as π
varies from 0 to 1. Plot this curve.
e. Repeat b–d if the correlation is ρ = 0.1.
√
53. Show that Cov(X, Y ) ≤ Var(X )Var(Y ).
4.7 Problems
171
54. Let X , Y , and Z be uncorrelated random variables with variances σ X2 , σY2 , and
σ Z2 , respectively. Let
U =Z+X
V = Z +Y
Find Cov(U, V ) and ρU V .
n
55. Let T =
k=1 k X k , where the X k are independent random variables with
means μ and variances σ 2 . Find E(T ) and Var(T ).
56. Let S = nk=1 X k , where the X k are as in Problem 55. Find the covariance and
the correlation of S and T .
57. If X and Y are independent random variables, find Var(X Y ) in terms of the
means and variances of X and Y .
58. A function is measured at two points with some error (for example, the position
of an object is measured at two times). Let
X 1 = f (x) + ε1
X 2 = f (x + h) + ε2
where ε1 and ε2 are independent random variables with mean zero and variance
σ 2 . Since the derivative of f is
lim
h→0
f (x + h) − f (x)
h
it is estimated by
Z=
X2 − X1
h
a. Find E(Z ) and Var(Z ). What is the effect of choosing a value of h that is
very small, as is suggested by the definition of the derivative?
b. Find an approximation to the mean squared error of Z as an estimate of f (x)
using a Taylor series expansion. Can you find the value of h that minimizes
the mean squared error?
c. Suppose that f is measured at three points with some error. How could you
construct an estimate of the second derivative of f , and what are the mean
and the variance of your estimate?
59. Let (X, Y ) be a random point uniformly distributed on a unit disk. Show that
Cov(X, Y ) = 0, but that X and Y are not independent.
60. Let Y have a density that is symmetric about zero, and let X = SY , where S is an
independent random variable taking on the values +1 and −1 with probability
1
each. Show that Cov(X, Y ) = 0, but that X and Y are not independent.
2
61. In Section 3.7, the joint density of the minimum and maximum of n independent
uniform random variables was found. In the case n = 2, this amounts to X
and Y , the minimum and maximum, respectively, of two independent random
172
Chapter 4
Expected Values
variables uniform on [0, 1], having the joint density
f (x, y) = 2,
0≤x ≤y
a. Find the covariance and the correlation of X and Y . Does the sign of the
correlation make sense intuitively?
b. Find E(X |Y = y) and E(Y |X = x). Do these results make sense intuitively?
c. Find the probability density functions of the random variables E(X |Y ) and
E(Y |X ).
d. What is the linear predictor of Y in terms of X (denoted by Ŷ = a + bX )
that has minimal mean squared error? What is the mean square prediction
error?
e. What is the predictor of Y in terms of X [Ŷ = h(X )] that has minimal mean
squared error? What is the mean square prediction error?
62. Let X and Y have the joint distribution given in Problem 1 of Chapter 3.
a. Find the covariance and correlation of X and Y .
b. Find E(Y |X = x) for x = 1, 2, 3, 4. Find the probability mass function of
the random variable E(Y |X ).
63. Let X and Y have the joint distribution given in Problem 8 of Chapter 3.
a. Find the covariance and correlation of X and Y .
b. Find E(Y |X = x) for 0 ≤ x ≤ 1.
64. Let X and Y be jointly distributed random variables with correlation ρ X√Y ; define
the standardized random
√ variables X̃ and Ỹ as X̃ = (X − E(X ))/ Var(X )
and Ỹ = (Y − E(Y ))/ Var(Y ). Show that Cov( X̃ , Ỹ ) = ρ X Y .
65. How has the assumption that N and the X i are independent been used in
Example D of Section 4.4.1?
66. A building contains two elevators, one fast and one slow. The average waiting
time for the slow elevator is 3 min. and the average waiting time of the fast
elevator is 1 min. If a passenger chooses the fast elevator with probability 23 and
the slow elevator with probability 13 , what is the expected waiting time? (Use
the law of total expectation, Theorem A of Section 4.4.1, defining appropriate
random variables X and Y .)
67. A random rectangle is formed in the following way: The base, X , is chosen
to be a uniform [0, 1] random variable and after having generated the base,
the height is chosen to be uniform on [0, X ]. Use the law of total expectation,
Theorem A of Section 4.4.1, to find the expected circumference and area of the
rectangle.
68. Show that E[Var(Y |X )] ≤ Var(Y ).
69. Suppose that a bivariate normal distribution has μ X = μY = 0 and σ X =
σY = 1. Sketch the contours of the density and the lines E(Y |X = x) and
E(X |Y = y) for ρ = 0, .5, and .9.
4.7 Problems
173
70. If X and Y are independent, show that E(X |Y = y) = E(X ).
71. Let X be a binomial random variable representing the number of successes in
n independent Bernoulli trials. Let Y be the number of successes in the first m
trials, where m < n. Find the conditional frequency function of Y given X = x
and the conditional mean.
72. An item is present in a list of n items with probability p; if it is present, its position in the list is uniformly distributed. A computer program searches through
the list sequentially. Find the expected number of items searched through before
the program terminates.
73. A fair coin is tossed n times, and the number of heads, N , is counted. The coin
is then tossed N more times. Find the expected total number of heads generated
by this process.
74. The number of offspring of an organism is a discrete random variable with
mean μ and variance σ 2 . Each of its offspring reproduces in the same manner. Find the expected number of offspring in the third generation and its
variance.
75. Let T be an exponential random variable, and conditional on T , let U be uniform
on [0, T ]. Find the unconditional mean and variance of U .
76. Let the point (X, Y ) be uniformly distributed over the half disk x 2 + y 2 ≤ 1,
where y ≥ 0. If you observe X , what is the best prediction for Y ? If you observe
Y , what is the best prediction for X ? For both questions, “best” means having
the minimum mean squared error.
77. Let X and Y have the joint density
f (x, y) = e−y ,
0≤x ≤y
a. Find Cov(X, Y ) and the correlation of X and Y .
b. Find E(X |Y = y) and E(Y |X = x).
c. Find the density functions of the random variables E(X |Y ) and E(Y |X ).
78. Show that if a density is symmetric about zero, its skewness is zero.
79. Let X be a discrete random variable that takes on values 0, 1, 2 with probabilities
1 3 1
, , , respectively. Find the moment-generating function of X , M(t), and
2 8 8
verify that E(X ) = M (0) and that E(X 2 ) = M (0).
80. Let X be a continuous random variable with density function f (x) = 2x,
0 ≤ x ≤ 1. Find the moment-generating function of X , M(t), and verify that
E(X ) = M (0) and that E(X 2 ) = M (0).
81. Find the moment-generating function of a Bernoulli random variable, and use
it to find the mean, variance, and third moment.
82. Use the result of Problem 81 to find the mgf of a binomial random variable and
its mean and variance.
174
Chapter 4
Expected Values
83. Show that if X i follows a binomial distribution with n i trials and probability of
n
Xi
success pi = p, where i = 1, . . . , n and the X i are independent, then i=1
follows a binomial distribution.
84. Referring to Problem 83, show that if the pi are unequal, the sum does not
follow a binomial distribution.
85. Find the mgf of a geometric random variable, and use it to find the mean and
the variance.
86. Use the result of Problem 85 to find the mgf of a negative binomial random
variable and its mean and variance.
87. Under what conditions is the sum of independent negative binomial random
variables also negative binomial?
88. Let X and Y be independent random variables, and let α and β be scalars. Find
an expression for the mgf of Z = α X + βY in terms of the mgf’s of X and Y .
89. Let X 1 , X 2 , . . . , X n be independent normal random variables with means
n
αi X i , where the αi are scalars,
μi and variances σi2 . Show that Y = i=1
is normally distributed, and find its mean and variance. (Hint: Use momentgenerating functions.)
90. Assuming that X ∼ N (0, σ 2 ), use the mgf to show that the odd moments are
zero and the even moments are
μ2n =
(2n)!σ 2n
2n (n!)
91. Use the mgf to show that if X follows an exponential distribution, cX (c > 0)
does also.
92. Suppose that is a random variable that follows a gamma distribution with parameters λ and α, where α is an integer, and suppose that, conditional on
, X follows a Poisson distribution with parameter . Find the unconditional distribution of α + X . (Hint: Find the mgf by using iterated conditional
expectations.)
93. Find the distribution of a geometric sum of exponential random variables by
using moment-generating functions.
94. If X is a nonnegative integer-valued random variable, the probabilitygenerating function of X is defined to be
G(s) =
∞
s k pk
k=0
where pk = P(X = k).
a. Show that
1 dk
G(s)
pk =
k! ds k
s=0
4.7 Problems
b. Show that
175
dG = E(X )
ds s=1
d 2 G = E[X (X − 1)]
ds 2 s=1
c. Express the probability-generating function in terms of moment-generating
function.
d. Find the probability-generating function of the Poisson distribution.
95. Show that if X and Y are independent, their joint moment-generating function
factors.
96. Show how to find E(X Y ) from the joint moment-generating function of X
and Y .
97. Use moment-generating functions to show that if X and Y are independent,
then
Var(a X + bY ) = a 2 Var(X ) + b2 Var(Y )
98. Find the mean and variance of the compound Poisson distribution (Example H
in Section 4.5).
99. Find expressions
for the approximate mean and variance of Y = g(X ) for (a)
√
g(x) = x, (b) g(x) = log x, and (c) g(x) = sin−1 x.
100. If X is uniform on [10, 20], find the approximate and exact mean and variance
of Y = 1/ X , and compare them.
√
101. Find the approximate mean and variance of Y = X , where X is a random
variable following a Poisson distribution.
102. Two sides, x0 and y0 , of a right triangle are independently measured as X and
Y , where E(X ) = x0 and E(Y ) = y0 and Var(X ) = Var(Y ) = σ 2 . The angle
between the two sides is then determined as
Y
−1
= tan
X
Find the approximate mean and variance of .
103. The volume of a bubble is estimated by measuring its diameter and using the
relationship
π
V = D3
6
Suppose that the true diameter is 2 mm and that the standard deviation of the
measurement of the diameter is .01 mm. What is the approximate standard
deviation of the estimated volume?
104. The position of an aircraft relative to an observer on the ground is estimated
by measuring its distance r from the observer and the angle θ that the line of
176
Chapter 4
Expected Values
sight from the observer to the aircraft makes with the horizontal. Suppose that
the measurements, denoted by R and , are subject to random errors and are
independent of each other. The altitude of the aircraft is then estimated to be
Y = R sin .
a. Find an approximate expression for the variance of Y .
b. For given r , at what value of θ is the estimated altitude most variable?
CHAPTER
5
Limit Theorems
5.1 Introduction
This chapter is principally concerned with the limiting behavior of the sum of independent random variables as the number of summands becomes large. The results
presented here are both intrinsically interesting and useful in statistics, since many
commonly computed statistical quantities, such as averages, can be represented as
sums.
5.2 The Law of Large Numbers
It is commonly believed that if a fair coin is tossed many times and the proportion of
heads is calculated, that proportion will be close to 12 . John Kerrich, a South African
mathematician, tested this belief empirically while detained as a prisoner during
World War II. He tossed a coin 10,000 times and observed 5067 heads. The law of
large numbers is a mathematical formulation of this belief. The successive tosses of
the coin are modeled as independent random trials. The random variable X i takes on
the value 0 or 1 according to whether the ith trial results in a tail or a head, and the
proportion of heads in n trials is
Xn =
n
1
Xi
n i=1
The law of large numbers states that X n approaches
the following theorem.
1
2
in a sense that is specified by
177
178
Chapter 5
Limit Theorems
THEOREM A
Law of Large Numbers
Let X 1 , X 2 , . . . , X i . . . be a sequence of independent random variables with
n
X i . Then, for any ε > 0,
E(X i ) = μ and Var(X i ) = σ 2 . Let X n = n −1 i=1
P(|X n − μ| > ε) → 0
as n → ∞
Proof
We first find E(X n ) and Var(X n ):
E(X n ) =
n
1
E(X i ) = μ
n i=1
Since the X i are independent,
Var(X n ) =
n
1 σ2
Var(X i ) =
2
n i=1
n
The desired result now follows immediately from Chebyshev’s inequality, which
states that
Var(X n )
σ2
=
→ 0,
as n → ∞
■
P(|X n − μ| > ε) ≤
ε2
nε 2
In the case of a fair coin toss, the X i are Bernoulli random variables with p = 1/2,
E(X i ) = 1/2 and Var(X i ) = 1/4. If tossed 10,000 times
Var(X 10,000 ) = 2.5 × 10−5
and the standard deviation of the average is the square root of the variance, 0.005.
The proportion observed by Kerrich, 0.5067, is thus a little more than one standard
deviation away from its expected value of 0.5, consistent with Chebyshev’s inequality.
(Recall from Section 4.2 that Chebyshev’s inequality can be written in the form
P(|X n − μ| > kσ ) ≤ 1/k 2 .)
If a sequence of random variables, {Z n }, is such that P(|Z n − α| > ε) approaches
zero as n approaches infinity, for any ε > 0 and where α is some scalar, then Z n is
said to converge in probability to α. There is another mode of convergence, called
strong convergence or almost sure convergence, which asserts more. Z n is said to
converge almost surely to α if for every ε > 0, |Z n − α| > ε only a finite number
of times with probability 1; that is, beyond some point in the sequence, the difference
is always less than ε, but where that point is random. The version of the law of large
numbers stated and proved earlier asserts that X n converges to μ in probability. This
version is usually called the weak law of large numbers. Under the same assumptions,
a strong law of large numbers, which asserts that X n converges almost surely to μ,
can also be proved, but we will not do so.
We now consider some examples that illustrate the utility of the law of large
numbers.
5.2 The Law of Large Numbers
EXAMPLE A
179
Monte Carlo Integration
Suppose that we wish to calculate
I( f ) =
1
f (x) d x
0
where the integration cannot be done by elementary means or evaluated using tables of integrals. The most common approach is to use a numerical method in which
the integral is approximated by a sum; various schemes and computer packages exist for doing this. Another method, called the Monte Carlo method, works in the
following way. Generate independent uniform random variables on [0, 1]—that is,
X 1 , X 2 , . . . , X n —and compute
n
1
f (X i )
n i=1
Î ( f ) =
By the law of large numbers, this should be close to E[ f (X )], which is simply
1
f (x) d x = I ( f )
E[ f (X )] =
0
This simple scheme can be easily modified in order to change the range of integration
and in other ways. Compared to the standard numerical methods, it is not especially
efficient in one dimension, but becomes increasingly efficient as the dimensionality
of the integral grows.
As a concrete example, let us consider the evaluation of
1
1
2
e−x /2 d x
I( f ) = √
2π 0
The integral is that of the standard normal density, which cannot be evaluated in closed
form. From the table of the normal distribution (Table 2 in Appendix B), an accurate
numerical approximation is I ( f ) = .3413. If 1000 points, X 1 , . . . , X 1000 , uniformly
distributed over the interval 0 ≤ x ≤ 1, are generated using a pseudorandom number
generator, the integral is then approximated by
1
Î ( f ) =
1000
1
√
2π
1000
e−X i /2
2
i=1
which produced for one realization of the X i the value .3417.
EXAMPLE B
■
Repeated Measurements
Suppose that repeated independent unbiased measurements, X 1 , . . . , X n , of a quantity
are made. If n is large, the law of large numbers says that X will be close to the true
value, μ, of the quantity, but how close X is depends not only on n but on the variance
of the measurement error, σ 2 , as can be seen in the proof of Theorem A.
180
Chapter 5
Limit Theorems
Fortunately, σ 2 can be estimated and therefore
Var(X ) =
σ2
n
n
X i2
can be estimated from the data to assess the precision of X . First, note that n −1 i=1
2
converges to E(X ), from the law of large numbers. Second, it can be shown that if
Z n converges to α in probability and g is a continuous function, then
g(Z n ) → g(α)
which implies that
2
X → [E(X )]2
2
n
Finally, since n −1 i=1
X i2 converges to E(X 2 ) and X converges to [E(X )]2 , with a
little additional argument it can be shown that
n
1 2
2
X i − X → E(X 2 ) − [E(X )]2 = Var(X )
n i=1
More generally, it follows from the law of large numbers that the sample moments,
n
n −1 i=1
X ir , converge in probability to the moments of X, E(X r ).
■
EXAMPLE C
A muscle or nerve cell membrane contains a very large number of channels; when
open, these channels allow ions to pass through. Individual channels seem to open and
close randomly, and it is often assumed that in an equilibrium situation the channels
open and close independently of each other and that only a very small fraction are open
at any one time. Suppose then that the probability that a channel is open is p, a very
small number, that there are m channels in all, and that the amount of current flowing
through an individual channel is c. The number of channels open at any particular
time is N , a binomial random variable with m trials and probability p of success on
each trial. The total amount of current is S = cN and can be measured. We then
have
E(S) = cE(N ) = cmp
Var(S) = c2 mp(1 − p)
and
Var(S)
= c(1 − p) ≈ c
E(S)
since p is small. Thus, through independent measurements, S1 , . . . , Sn , we can estimate E(S) and Var(S) and therefore c, the amount of current flowing through a single
■
channel, without knowing how many channels there are.
5.3 Convergence in Distribution and the Central Limit Theorem
181
5.3 Convergence in Distribution
and the Central Limit Theorem
In applications, we often want to find P(a < X < b) when we do not know the
cdf of X precisely; it is sometimes possible to do this by approximating FX . The
approximation is often arrived at by some sort of limiting argument. The most famous
limit theorem in probability theory is the central limit theorem, which is the main
topic of this section. Before discussing the central limit theorem, we develop some
introductory terminology, theory, and examples.
DEFINITION
Let X 1 , X 2 , . . . be a sequence of random variables with cumulative distribution
functions F1 , F2 , . . . , and let X be a random variable with distribution function
F. We say that X n converges in distribution to X if
lim Fn (x) = F(x)
n→∞
at every point at which F is continuous.
■
Moment-generating functions are often useful for establishing the convergence
of distribution functions. We know from Property A of Section 4.5 that a distribution function Fn is uniquely determined by its mgf, Mn . The following theorem,
which we give without proof, states that this unique determination holds for limits
as well.
THEOREM A Continuity Theorem
Let Fn be a sequence of cumulative distribution functions with the corresponding
moment-generating function Mn . Let F be a cumulative distribution function with
the moment-generating function M. If Mn (t) → M(t) for all t in an open interval
■
containing zero, then Fn (x) → F(x) at all continuity points of F.
EXAMPLE A
We will show that the Poisson distribution can be approximated by the normal distribution for large values of λ. This is suggested by examining Figure 2.6, which shows
that as λ increases, the probability mass function of the Poisson distribution becomes
more symmetric and bell-shaped.
Let λ1 , λ2 , . . . be an increasing sequence with λn → ∞, and let {X n } be a
sequence of Poisson random variables with the corresponding parameters. We know
that E(X n ) = Var(X n ) = λn . If we wish to approximate the Poisson distribution
function by a normal distribution function, the normal must have the same mean and
182
Chapter 5
Limit Theorems
variance as the Poisson does. In addition, if we wish to prove a limiting result, we run
into the difficulty that the mean and variance are tending to infinity. This difficulty is
dealt with by standardizing the random variables—that is, by letting
X n − E(X n )
Zn = √
Var(X n )
=
X n − λn
√
λn
We then have E(Z n ) = 0 and Var(Z n ) = 1, and we will show that the mgf of Z n
converges to the mgf of the standard normal distribution.
The mgf of X n is
M X n (t) = eλn (e −1)
t
By Property C of Section 4.5, the mgf of Z n is
M Z n (t) = e−t
= e−t
√
√
λn
MXn
t
√
λn
√
λn λn (et/ λn −1)
e
It will be easier to work with the log of this expression.
log M Z n (t) = −t
Using the power series expansion e x =
√
λn + λn (et/ λn − 1)
∞ xk
k=0 k! ,
lim log M Z n (t) =
n→∞
we see that
t2
2
or
lim M Z n (t) = et
2 /2
n→∞
The last expression is the mgf of the standard normal distribution.
We have shown that a standardized Poisson random variable converges in distribution to a standard normal variable as λ approaches infinity. Practically, we wish to
use this limiting result as a basis for an approximation for large but finite values of λn .
How adequate the approximation is for λ = 100, say, is a matter for theoretical and/or
empirical investigation. It turns out that the approximation is increasingly good for
large values of λ and that λ does not have to be all that large. (See Problem 8 at the
end of this chapter.)
■
5.3 Convergence in Distribution and the Central Limit Theorem
183
The next example shows how the approximation of the Poisson distribution can
be applied in a specific case.
EXAMPLE B
A certain type of particle is emitted at a rate of 900 per hour. What is the probability
that more than 950 particles will be emitted in a given hour if the counts form a
Poisson process?
Let X be a Poisson random variable with mean 900. We find P(X > 950) by
standardizing:
950 − 900
X − 900
√
> √
900
900
5
≈ 1− 3
P(X > 950) = P
= .04779
where is the standard normal cdf. For comparison, the exact probability is
■
.04712.
We now turn to the central limit theorem, which is concerned with a limiting
property of sums of random variables. If X 1 , X 2 , . . . is a sequence of independent
random variables with mean μ and variance σ 2 , and if
Sn =
n
Xi
i=1
we know from the law of large numbers that Sn /n converges to μ in probability. This
followed from the fact that
Var
Sn
n
=
σ2
1
→0
Var(S
)
=
n
n2
n
The central limit theorem is concerned not with the fact that the ratio Sn /n converges
to μ but with how it fluctuates around μ. To analyze these fluctuations, we standardize:
Zn =
Sn − nμ
√
σ n
You should verify that Z n has mean 0 and variance 1. The central limit theorem states
that the distribution of Z n converges to the standard normal distribution.
184
Chapter 5
Limit Theorems
THEOREM B Central Limit Theorem
Let X 1 , X 2 , . . . be a sequence of independent random variables having mean 0
and variance σ 2 and the common distribution function F and moment-generating
function M defined in a neighborhood of zero. Let
n
Xi
Sn =
i=1
Then
lim P
n→∞
Sn
√ ≤x
σ n
= (x),
−∞ < x < ∞
Proof
√
Let Z n = Sn /(σ n). We will show that the mgf of Z n tends to the mgf of the
standard normal distribution. Since Sn is a sum of independent random variables,
M Sn (t) = [M(t)]n
and
M Z n (t) = M
t
√
n
σ n
M(s) has a Taylor series expansion about zero:
M(s) = M(0) + s M (0) + 12 s 2 M (0) + εs
2
where εs /s 2 →
√ 0 as s → 0. Since E(X ) = 0, M (0) = 0, and M (0) = σ . As
n → ∞, t/(σ n) → 0, and
2
t
t
1
√
√
M
+ εn
= 1 + σ2
σ n
2
σ n
where εn /(t 2 /(nσ 2 )) → 0 as n → ∞. We thus have
n
t2
+ εn
M Z n (t) = 1 +
2n
It can be shown that if an → a, then
a n n
lim 1 +
= ea
n→∞
n
From this result, it follows that
M Z n (t) → et
2 /2
as n → ∞
where exp(t 2 /2) is the mgf of the standard normal distribution, as was to be
■
shown.
Theorem B is one of the simplest versions of the central limit theorem; there are
many central limit theorems of various degrees of abstraction and generality. We have
proved Theorem B under the assumption that the moment-generating functions exist,
which is a rather strong assumption. By using characteristic functions instead, we
5.3 Convergence in Distribution and the Central Limit Theorem
185
could modify the proof so that it would only be necessary that first and second moments exist. Further generalizations weaken the assumption that the X i have the same
distribution and apply to linear combinations of independent random variables. Central limit theorems have also been proved that weaken the independence assumption
and allow the X i to be dependent but not “too” dependent.
For practical purposes, especially for statistics, the limiting result in itself is not of
primary interest. Statisticians are more interested in its use as an approximation with
finite values of n. It is impossible to give a concise and definitive statement of how
good the approximation is, but some general guidelines are available, and examining
special cases can give insight. How fast the approximation becomes good depends
on the distribution of the summands, the X i . If the distribution is fairly symmetric
and has tails that die off rapidly, the approximation becomes good for relatively small
values of n. If the distribution is very skewed or if the tails die down very slowly, a
larger value of n is needed for a good approximation. The following examples deal
with two special cases.
1
Because the uniform distribution on [0, 1] has mean 12 and variance 12
, the sum of
12 uniform random variables, minus 6, has mean 0 and variance 1. The distribution
of this sum is quite close to normal; in fact, before better algorithms were developed,
it was commonly used in computers for generating normal random variables from
uniform ones. It is possible to compare the real and approximate distributions analytically, but we will content ourselves with a simple demonstration. Figure 5.1 shows
a histogram of 1000 such sums with a superimposed normal density function. The fit
is surprisingly good, especially considering that 12 is not usually regarded as a large
■
value of n.
250
200
150
Count
EXAMPLE C
100
50
0
4
2
0
2
Value
4
6
F I G U R E 5.1 A histogram of 1000 values, each of which is the sum of 12 uniform
[− 12 , 12 ] pseudorandom variables, with an approximating standard normal density.
186
Chapter 5
EXAMPLE D
Limit Theorems
The sum of n independent exponential random variables with parameter λ = 1
follows a gamma distribution with λ = 1 and α = n (Example F in Section 4.5). The
exponential density is quite skewed; therefore, a good approximation of a standardized
gamma by a standardized normal would not be expected for small n. Figure 5.2 shows
the cdf’s of the standard normal and standardized gamma distributions for increasing
values of n. Note how the approximation improves as n increases.
■
1.0
Cumulative probability
.8
.6
.4
.2
0
3
2
1
0
x
1
2
3
F I G U R E 5.2 The standard normal cdf (solid line) and the cdf s of standardized
gamma distributions with α = 5 (long dashes), α = 10 (short dashes), and α = 30 (dots).
Let us now consider some applications of the central limit theorem.
EXAMPLE E
Measurement Error
Suppose that X 1 , . . . , X n are repeated, independent measurements of a quantity, μ,
and that E(X i ) = μ and Var(X i ) = σ 2 . The average of the measurements, X , is
used as an estimate of μ. The law of large numbers tells us that X converges to
μ in probability, so we can hope that X is close to μ if n is large. Chebyshev’s
inequality allows us to bound the probability of an error of a given size, but the
central limit theorem gives a much sharper approximation to the actual error. Suppose
that we wish to find P(|X − μ| < c) for some constant c. To use the central limit
theorem to approximate this probability, we first standardize, using E(X ) = μ and
Var(X ) = σ 2 /n:
P(|X − μ| < c) = P(−c < X − μ < c)
c
X −μ
−c
√ <
√ < √
=P
σ/ n
σ/ n
σ/ n
√
√
c n
c n
≈
− −
σ
σ
5.3 Convergence in Distribution and the Central Limit Theorem
187
For example, suppose that 16 measurements are taken with σ = 1. The probability that the average deviates from μ by less than .5 is approximately
P(|X − μ| < .5) = (.5 × 4) − (−.5 × 4) = .954
This sort of reasoning can be turned around. That is, given c and γ , n can be found
such that
P(|X − μ| < c) ≥ γ
EXAMPLE F
■
Normal Approximation to the Binomial Distribution
Since a binomial random variable is the sum of independent Bernoulli random variables, its distribution can be approximated by a normal distribution. The approximation is best when the binomial distribution is symmetric—that is, when p = 12 . A
frequently used rule of thumb is that the approximation is reasonable when np > 5
and n(1 − p) > 5. The approximation is especially useful for large values of n, for
which tables are not readily available.
Suppose that a coin is tossed 100 times and lands heads up 60 times. Should we
be surprised and doubt that the coin is fair?
To answer this question, we note that if the coin is fair, the number of heads, X , is
a binomial random variable with n = 100 trials and probability of success p = 12 , so
that E(X ) = np = 50 (see Example A of Section 4.1) and Var(X ) = np(1 − p) = 25
(see Example B of Section 4.3). We could calculate P(X = 60), which would be a
small number. But because there are so many possible outcomes, P(X = 50) is also
a small number, so this calculation would not really answer the question. Instead, we
calculate the probability of a deviation as extreme as or more extreme than 60 if the
coin is fair; that is, we calculate P(X ≥ 60). To approximate this probability from
the normal distribution, we standardize:
60 − 50
X − 50
≥
P(X ≥ 60) = P
5
5
≈ 1 − (2)
= .0228
The probability is rather small, so the fairness of the coin is called into question.
EXAMPLE G
■
Particle Size Distribution
The distribution of the sizes of grains of particulate matter is often found to be quite
skewed, with a slowly decreasing right tail. A distribution called the lognormal is
sometimes fit to such a distribution, and X is said to follow a lognormal distribution if
log X has a normal distribution. The central limit theorem gives a theoretical rationale
for the use of the lognormal distribution in some situations.
Suppose that a particle of initial size y0 is subjected to repeated impacts, that on
each impact a proportion, X i , of the particle remains, and that the X i are modeled as
independent random variables having the same distribution. After the first impact, the
188
Chapter 5
Limit Theorems
size of the particle is Y1 = X 1 y0 ; after the second impact, the size is Y2 = X 2 X 1 y0 ;
and after the nth impact, the size is
Yn = X n X n−1 · · · X 2 X 1 y0
Then
log Yn = log y0 +
n
log X i
i=1
■
and the central limit theorem applies to log Yn .
A similar construction is relevant to the theory of finance. Suppose that an initial
investment of value v0 is made and that returns occur in discrete time, for example,
daily. If the return on the first day is R1 , then the value becomes V1 = R1 v0 . After
day two the value is V2 = R2 R1 v0 , and after day n the value is
Vn = Rn Rn−1 · · · R1 v0
The log value is thus
log Vn = log v0 +
n
log Ri
i=1
If the returns are independent random variables with the same distribution, then the
distribution of log Vn is approximately normally distributed.
5.4 Problems
1. Let X 1 , X 2 , . . . be a sequence of independent random variables with E(X i ) = μ
n
σi2 → 0, then X → μ in probability.
and Var(X i ) = σi2 . Show that if n −2 i=1
2. Let X i be as in Problem 1 but with E(X i ) = μi and n −1
that X → μ in probability.
n
i=1
μi → μ. Show
3. Suppose that the number of insurance claims, N , filed in a year is Poisson
distributed with E(N ) = 10,000. Use the normal approximation to the Poisson
to approximate P(N > 10,200).
4. Suppose that the number of traffic accidents, N , in a given period of time is distributed as a Poisson random variable with E(N ) = 100. Use the normal approximation to the Poisson to find such that P(100 − < N < 100 + ) ≈ .9.
5. Using moment-generating functions, show that as n → ∞, p → 0, and
np → λ, the binomial distribution with parameters n and p tends to the Poisson
distribution.
6. Using moment-generating functions, show that as α → ∞ the gamma distribution with parameters α and λ, properly standardized, tends to the standard
normal distribution.
7. Show that if X n → c in probability and if g is a continuous function, then
g(X n ) → g(c) in probability.
5.4 Problems
189
8. Compare the Poisson cdf and the normal approximation for (a) λ = 10,
(b) λ = 20, and (c) λ = 40.
9. Compare the binomial cdf and the normal approximation for (a) n = 20 and
p = .2, and (b) n = 40 and p = .5.
10. A six-sided die is rolled 100 times. Using the normal approximation, find the
probability that the face showing a six turns up between 15 and 20 times.
Find the probability that the sum of the face values of the 100 trials is less
than 300.
11. A skeptic gives the following argument to show that there must be a flaw in
the central limit theorem: “We know that the sum of independent Poisson random variables follows a Poisson distribution with a parameter that is the sum
of the parameters of the summands. In particular, if n independent Poisson
random variables, each with parameter n −1 , are summed, the sum has a Poisson distribution with parameter 1. The central limit theorem says that as n
approaches infinity, the distribution of the sum tends to a normal distribution,
but the Poisson with parameter 1 is not the normal.” What do you think of this
argument?
12. The central limit theorem can be used to analyze round-off error. Suppose that
the round-off error is represented as a uniform random variable on [− 12 , 12 ]. If
100 numbers are added, approximate the probability that the round-off error
exceeds (a) 1, (b) 2, and (c) 5.
13. A drunkard executes a “random walk” in the following way: Each minute he
takes a step north or south, with probability 12 each, and his successive step
directions are independent. His step length is 50 cm. Use the central limit theorem to approximate the probability distribution of his location after 1 h. Where
is he most likely to be?
14. Answer Problem 13 under the assumption that the drunkard has some idea of
where he wants to go so that he steps north with probability 23 and south with
probability 13 .
15. Suppose that you bet $5 on each of a sequence of 50 independent fair games.
Use the central limit theorem to approximate the probability that you will lose
more than $75.
16. Suppose that X 1 , . . . , X 20 are independent random variables with density functions
f (x) = 2x,
0≤x ≤1
Let S = X 1 + · · · + X 20 . Use the central limit theorem to approximate
P(S ≤ 10).
17. Suppose that a measurement has mean μ and variance σ 2 = 25. Let X be the
average of n such independent measurements. How large should n be so that
P(|X − μ| < 1) = .95?
190
Chapter 5
Limit Theorems
18. Suppose that a company ships packages that are variable in weight, with an average weight of 15 lb and a standard deviation of 10. Assuming that the packages
come from a large number of different customers so that it is reasonable to
model their weights as independent random variables, find the probability that
100 packages will have a total weight exceeding 1700 lb.
19. a. Use the Monte Carlo method with n = 100 and n = 1000 to estimate
1
cos(2π x) d x. Compare the estimates to the exact answer.
0
1
b. Use Monte Carlo to evaluate 0 cos(2π x 2 ) d x. Can you find the exact
answer?
20. What is the variance of the estimate of an integral by the Monte Carlo method
(Example A of Section 5.2)? [Hint: Find E( Î 2 ( f )).] Compare the standard
deviations of the estimates of part (a) of previous problem to the actual errors
you made.
21. This problem introduces a variation on the Monte Carlo integration technique
of Example A of Section 5.2. Suppose that we wish to evaluate
b
I( f ) =
f (x) d x
a
Let g be a density function on [a, b]. Generate X 1 , · · · , X n from g and estimate
I by
Î ( f ) =
n
1 f (X i )
n i=1 g(X i )
a. Show that E( Î ( f )) = I ( f ).
b. Find an expression for Var( Î ( f )). Give an example for which it is finite and
an example for which it is infinite. Note that if it is finite, the law of large
numbers implies that Î ( f ) → I ( f ) as n → ∞.
c. Show that if a = 0, b = 1, and g is uniform, this is the same Monte Carlo
estimate as that of Example A of Section 5.2. Can this estimate be improved
by choosing g to be other than uniform? (Hint: Compare variances.)
22. Use the central limit theorem to find such that P(| Î ( f ) − I ( f )| ≤ ) =
1
.05, where Î ( f ) is the Monte Carlo estimate of 0 cos(2π x) d x based on
1000 points.
23. An irregularly shaped object of unknown area A is located in the unit square
0 ≤ x ≤ 1, 0 ≤ y ≤ 1. Consider a random point distributed uniformly over the
square; let Z = 1 if the point lies inside the object and Z = 0 otherwise. Show
that E(Z ) = A. How could A be estimated from a sequence of n independent
points uniformly distributed on the square?
24. How could the central limit theorem be used to gauge the probable size of the
error of the estimate of the previous problem? Denoting the estimate by Â, if
A = .2, how large should n be so that P(| Â − A| < .01) ≈ .99?
25. Let X be a continuous random variable with density function f (x) = 32 x 2 , −1 ≤
x ≤ 1. Sketch this density function. Use the central limit theorem to sketch
5.4 Problems
191
the approximate density function of S = X 1 + · · · + X 50 , where the X i are
independent random variables with density
f . Similarly, sketch the approxi√
mate density functions of S/50 and S/ 50. For each sketch, label at least three
points on the horizontal axis.
26. Suppose that a basketball player can score on a particular shot with probability .3. Use the central limit theorem to find the approximate distribution of S,
the number of successes out of 25 independent shots. Find the approximate
probabilities that S is less than or equal to 5, 7, 9, and 11 and compare these to
the exact probabilities.
27. Prove that if an → a, then (1 + an /n)n → ea .
28. Let f n be a sequence of frequency functions with f n (x) = 12 if x = ±( 12 )n
and f n (x) = 0 otherwise. Show that lim f n (x) = 0 for all x, which means that
the frequency functions do not converge to a frequency function, but that there
exists a cdf F such that lim Fn (x) = F(x).
29. In addition to limit theorems that deal with sums, there are limit theorems that
deal with extreme values such as maxima or minima. Here is an example. Let
U1 , . . . , Un be independent uniform random variables on [0, 1], and let U(n) be
the maximum. Find the cdf of U(n) and a standardized U(n) , and show that the
cdf of the standardized variable tends to a limiting value.
30. Generate a sequence U1 , U2 , . . . , U1000 of independent uniform random varin
Ui for n = 1, 2, . . . , 1000. Plot each of
ables on a computer. Let Sn = i=1
the following versus n:
a. Sn
b. Sn /n
c. Sn − n/2
d. (Sn − n/2)/n
√
e. (Sn − n/2)/ n
Explain the shapes of the resulting graphs using the concepts of this chapter.
CHAPTER
6
Distributions Derived
from the Normal
Distribution
6.1 Introduction
This chapter assembles some results concerning three probability distributions derived
from the normal distribution—the χ 2 , t, and F distributions. These distributions occur
in many statistical problems and will be used in later chapters.
6.2 χ 2, t, and F Distributions
DEFINITION
If Z is a standard normal random variable, the distribution of U = Z 2 is called
■
the chi-square distribution with 1 degree of freedom.
We have already encountered the chi-square distribution in Section 2.3, where
we saw that it is a special case of the gamma distribution with parameters 12 and 12 .
The chi-square distribution with 1 degree of freedom is denoted χ12 . It is useful to note
that if X ∼ N (μ, σ 2 ), then (X − μ)/σ ∼ N (0, 1), and therefore [(X − μ)/σ ]2 ∼ χ12 .
DEFINITION
If U1 , U2 , . . . , Un are independent chi-square random variables with 1 degree of
freedom, the distribution of V = U1 + U2 + · · · + Un is called the chi-square
■
distribution with n degrees of freedom and is denoted by χn2 .
192
6.2 χ 2 , t, and F Distributions
193
From Example F in Section 4.5, we know that the sum of independent gamma
random variables that have the same value of λ follows a gamma distribution, and
therefore the chi-square distribution with n degrees of freedom is a gamma distribution
with α = n/2 and λ = 12 . Its density is
f (v) =
1
v (n/2)−1 e−v/2 ,
2n/2 (n/2)
v≥0
Its moment-generating function is
M(t) = (1 − 2t)−n/2
Also, E(V ) = n and Var(V ) = 2n. To indicate that V follows a chi-square distribution
with n degrees of freedom, we write V ∼ χn2 . A notable consequence of the definition
of the chi-square distribution is that if U and V are independent and U ∼ χn2 and
2
.
V ∼ χm2 , then U + V ∼ χm+n
We now turn to the t distribution.
DEFINITION
If Z ∼√N (0, 1) and U ∼ χn2 and Z and U are independent, then the distribution
■
of Z / U/n is called the t distribution with n degrees of freedom.
PROPOSITION A
The density function of the t distribution with n degrees of freedom is
[(n + 1)/2]
f (t) = √
nπ (n/2)
1+
t2
n
−(n+1)/2
Proof
√
This is proved by a standard method. The density function of U/n is straightforward to obtain, and the density function of the quotient of two independent
random variables was derived in Section 3.6.1. The details of the proof are left
■
as an end-of-chapter problem.
From the density function of Proposition A, f (t) = f (−t), so the t distribution
is symmetric about zero. As the number of degrees of freedom approaches infinity,
the t distribution tends to the standard normal distribution; in fact, for more than 20
or 30 degrees of freedom, the distributions are very close. Figure 6.1 shows several t
densities. Note that the tails become lighter as the degrees of freedom increase.
Chapter 6
Distributions Derived from the Normal Distribution
.4
.3
Density
194
.2
.1
0
3
2
1
0
x
1
2
3
F I G U R E 6.1 Three t densities with 5 (long dashes), 10 (short dashes), and 30 (dots)
degrees of freedom and the standard normal density (solid line).
DEFINITION
Let U and V be independent chi-square random variables with m and n degrees
of freedom, respectively. The distribution of
U/m
W =
V /n
is called the F distribution with m and n degrees of freedom and is denoted by
■
Fm,n .
PROPOSITION B
The density function of W is given by
[(m + n)/2] m m/2 m/2−1 m −(m+n)/2
f (w) =
w
,
1+ w
(m/2) (n/2) n
n
w≥0
Proof
W is the ratio of two independent random variables, and its density follows from
the results given in Section 3.6.1.
■
It can be shown that, for n > 2, E(W ) exists and equals n/(n − 2). From the
definitions of the t and F distributions, it follows that the square of a tn random
variable follows an F1,n distribution (see Problem 6 at the end of this chapter).
6.3 The Sample Mean and the Sample Variance
195
6.3 The Sample Mean and the Sample Variance
Let X 1 , . . . , X n be independent N (μ, σ 2 ) random variables; we sometimes refer to
them as a sample from a normal distribution. In this section, we will find the joint
and marginal distributions of
X =
n
1
Xi
n i=1
1 (X i − X )2
n − 1 i=1
n
S2 =
These are called the sample mean and the sample variance, respectively. First note
that because X is a linear combination of independent normal random variables, it is
normally distributed with
E(X ) = μ
Var(X ) =
σ2
n
As a preliminary to showing that X and S 2 are independently distributed, we
establish the following theorem.
THEOREM A
The random variable X and the vector of random variables (X 1 − X , X 2 −
X , . . . , X n − X ) are independent.
Proof
At the level of this course, it is difficult to give a proof that provides sufficient
insight into why this result is true; a rigorous proof essentially depends on geometric properties of the multivariate normal distribution, which this book does not
cover. We present a proof based on moment-generating functions; in particular,
we will show that the joint moment-generating function
M(s, t1 , . . . , tn ) = E{exp[s X + t1 (X 1 − X ) + · · · + tn (X n − X )]}
factors into the product of two moment-generating functions—one of X and the
other of (X 1 − X ), . . . , (X n − X ). The factoring implies (Section 4.5) that the
random variables are independent of each other and is accomplished through
some algebraic trickery. First we observe that since
n
i=1
ti (X i − X ) =
n
i=1
ti X i − n X t̄
196
Chapter 6
Distributions Derived from the Normal Distribution
then
sX +
n
ti (X i − X ) =
i=1
n +
s
n
i=1
=
n
,
+ (ti − t̄ ) X i
ai X i
i=1
where
s
+ (ti − t̄ )
n
ai =
Furthermore, we observe that
n
ai = s
i=1
n
s2 +
(ti − t̄ )2
n
i=1
n
ai2 =
i=1
Now we have
M(s, t1 , . . . , tn ) = M X 1 ···X n (a1 , . . . , an )
and since the X i are independent normal random variables, we have
M(s, t1 , . . . , tn ) =
n
-
M X i (ai )
i=1
=
n
-
exp μai +
i=1
= exp μ
n
i=1
σ2 2
a
2 i
n
σ2 2
ai +
a
2 i=1 i
σ2
= exp μs +
2
s2
n
n
σ2 (ti − t̄ )2
+
2 i=1
n
σ2 2
σ2 s exp
= exp μs +
(ti − t̄ )2
2n
2 i=1
The first factor is the mgf of X . Since the mgf of the vector (X 1 − X , . . . , X n − X )
can be obtained by setting s = 0 in M, the second factor is this mgf.
■
6.3 The Sample Mean and the Sample Variance
197
COROLLARY A
X and S 2 are independently distributed.
Proof
This follows immediately since S 2 is a function of the vector (X 1 − X , . . . ,
■
X n − X ), which is independent of X .
The next theorem gives the marginal distribution of S 2 .
THEOREM B
The distribution of (n − 1)S 2 /σ 2 is the chi-square distribution with n − 1 degrees
of freedom.
Proof
We first note that
n n
1 Xi − μ
2
(X
−
μ)
=
i
σ 2 i=1
σ
i=1
2
∼ χn2
Also,
n
n
1 1 2
(X
−
μ)
=
[(X i − X ) + (X − μ)]2
i
σ 2 i=1
σ 2 i=1
Expanding the square and using the fact that
n
i=1 (X i
− X ) = 0, we obtain
n
n
1 X −μ
1 2
2
√
(X
−
μ)
=
(X
−
X
)
+
i
i
σ 2 i=1
σ 2 i=1
σ/ n
2
This is a relation of the form W = U + V . Since U and V are independent
by Corollary A, MW (t) = MU (t)MV (t). W and V both follow chi-square distributions, so
MW (t)
MU (t) =
MV (t)
=
(1 − 2t)−n/2
(1 − 2t)−1/2
= (1 − 2t)−(n−1)/2
2
The last expression is the mgf of a random variable with a χn−1
distribution.
One final result concludes this chapter’s collection.
■
198
Chapter 6
Distributions Derived from the Normal Distribution
COROLLARY B
Let X and S 2 be as given at the beginning of this section. Then
X −μ
√ ∼ tn−1
S/ n
Proof
We simply express the given ratio in a different form:
X −μ
√
X −μ
σ/ n
√ = S/ n
S 2 /σ 2
The latter is the ratio of an N (0, 1) random variable to the square root of an
2
distribution divided by its degrees of
independent random variable with a χn−1
freedom. Thus, from the definition in Section 6.2, the ratio follows a t distribution
■
with n − 1 degrees of freedom.
6.4 Problems
1. Prove Proposition A of Section 6.2.
2. Prove Proposition B of Section 6.2.
3. Let X be the average of a sample of 16 independent normal random variables
with mean 0 and variance 1. Determine c such that
P(|X | < c) = .5
4. If T follows a t7 distribution, find t0 such that (a) P(|T | < t0 ) = .9 and
(b) P(T > t0 ) = .05.
5. Show that if X ∼ Fn,m , then X −1 ∼ Fm,n .
6. Show that if T ∼ tn , then T 2 ∼ F1,n .
7. Show that the Cauchy distribution and the t distribution with 1 degree of freedom are the same.
8. Show that if X and Y are independent exponential random variables with λ = 1,
then X/Y follows an F distribution. Also, identify the degrees of freedom.
9. Find the mean and variance of S 2 , where S 2 is as in Section 6.3.
10. Show how to use the chi-square distribution to calculate P(a < S 2 /σ 2 < b).
11. Let X 1 , . . . , X n be a sample from an N (μ X , σ 2 ) distribution and Y1 , . . . , Ym be
an independent sample from an N (μY , σ 2 ) distribution. Show how to use the
F distribution to find P(S X2 /SY2 > c).
CHAPTER
7
Survey Sampling
7.1 Introduction
Resting on the probabilistic foundations of the preceding chapters, this chapter marks
the beginning of our study of statistics by introducing the subject of survey sampling.
As well as being of considerable intrinsic interest and practical utility, the development
of the elementary theory of survey sampling serves to introduce several concepts and
techniques that will recur and be amplified in later chapters.
Sample surveys are used to obtain information about a large population by examining only a small fraction of that population. Sampling techniques have been used
in many fields, such as the following:
•
•
•
•
•
Governments survey human populations; for example, the U.S. government conducts health surveys and census surveys.
Sampling techniques have been extensively employed in agriculture to estimate
such quantities as the total acreage of wheat in a state by surveying a sample of
farms.
The Interstate Commerce Commission has carried out sampling studies of rail and
highway traffic. In one such study, records of shipments of household goods by
motor carriers were sampled to evaluate the accuracy of preshipment estimates of
charges, claims for damages, and other variables.
In the practice of quality control, the output of a manufacturing process may be
sampled in order to examine the items for defects.
During audits of the financial records of large companies, sampling techniques may
be used when examination of the entire set of records is impractical.
The sampling techniques discussed here are probabilistic in nature—each member of the population has a specified probability of being included in the sample, and
the actual composition of the sample is random. Such techniques differ markedly from
199
200
Chapter 7
Survey Sampling
the type of sampling scheme in which particular population members are included
in the sample because the investigator thinks they are typical in some way. Such a
scheme may be effective in some situations, but there is no way mathematically to
guarantee its unbiasedness (a term that will be precisely defined later) or to estimate
the magnitude of any error committed, such as that arising from estimating the population mean by the sample mean. We will see that using a random sampling technique
has a consequence that estimates can be guaranteed to be unbiased and probabilistic
bounds on errors can be calculated. Among the advantages of using random sampling
are the following:
•
•
•
•
•
The selection of sample units at random is a guard against investigator biases, even
unconscious ones.
A small sample costs far less and is much faster to survey than a complete enumeration.
The results from a small sample may actually be more accurate than those from a
complete enumeration. The quality of the data in a small sample can be more easily
monitored and controlled, and a complete enumeration may require a much larger,
and therefore perhaps more poorly trained, staff.
Random sampling techniques make possible the calculation of an estimate of the
error due to sampling.
In designing a sample, it is frequently possible to determine the sample size necessary to obtain a prescribed error level.
Peck et al. (2005) contains several interesting papers about applications of
sampling.
7.2 Population Parameters
This section defines those numerical characteristics, or parameters, of the population
that we will estimate from a sample. We will assume that the population is of size
N and that associated with each member of the population is a numerical value of
interest. These numerical values will be denoted by x1 , x2 , · · ·, x N . The variable xi
may be a numerical variable such as age or weight, or it may take on the value 1 or
0 to denote the presence or absence of some characteristic such as gender. We will
refer to the latter situation as the dichotomous case.
EXAMPLE A
This is the first of many examples in this chapter in which we will illustrate ideas
by using a study by Herkson (1976). The population consists of N = 393 shortstay hospitals. We will let xi denote the number of patients discharged from the ith
hospital during January 1968. A histogram of the population values is shown in Figure 7.1. The histogram was constructed in the following way: The number of hospitals
that discharged 0–200, 201– 400, . . . , 2801–3000 patients were graphed as horizontal lines above the respective intervals. For example, the figure indicates that about
7.2 Population Parameters
201
80
Count
60
40
20
0
0
500
1000 1500 2000 2500 3000 3500
Number of discharges
F I G U R E 7.1 Histogram of the numbers of patients discharged during January 1968
from 393 short-stay hospitals.
40 hospitals discharged from 601 to 800 patients. The histogram is a convenient
graphical representation of the distribution of the values in the population, being
■
more quickly assimilated than would a list of 393 values.
We will be particularly interested in the population mean, or average,
μ=
N
1 xi
N i=1
For the population of 393 hospitals, the mean number of discharges is 814.6. Note
the location of this value in Figure 7.1. In the dichotomous case, where the presence
or absence of a characteristic is to be determined, μ equals the proportion, p, of
individuals in the population having the particular characteristic, because in the sum
above, each xi is either 0 or 1. The sum thus reduces to the number of 1s and when
divided by N , gives the proportion, p.
The population total is
τ=
N
xi = N μ
i=1
The total number of people discharged from the population of hospitals is τ =
320,138. In the dichotomous case, the population total is the total number of members
of the population possessing the characteristic of interest.
We will also need to consider the population variance,
σ2 =
N
1 (xi − μ)2
N i=1
202
Chapter 7
Survey Sampling
A useful identity can be obtained by expanding the square in this equation:
1
σ =
N
2
1
=
N
N
xi2
− 2μ
i=1
N
N
xi + N μ
2
i=1
xi2
− 2N μ + N μ
2
2
i=1
N
1 2
=
x − μ2
N i=1 i
In the dichotomous case, the population variance reduces to p(1 − p):
σ2 =
N
1 2
x − μ2
N i=1 i
= p − p2
= p(1 − p)
Here we used the fact that because each xi is 0 or 1, each xi2 is also 0 or 1.
The population standard deviation is the square root of the population variance
and is used as a measure of how spread out, dispersed, or scattered the individual values
are. The standard deviation is given in the same units (for example, inches) as are the
population values, whereas the variance is given in those units squared. The variance
of the discharges is 347,766, and the standard deviation is 589.7; examination of
the histogram in Figure 7.1 makes it clear that the latter number is the more reasonable
description of the spread of the population values.
7.3 Simple Random Sampling
The most elementary form of sampling is simple random sampling (s.r.s.): Each
particular
sample of size n has the same probability of occurrence; that is, each of the
N possible
samples of size n taken without replacement has the same probability.
n
We assume that sampling is done without replacement so that each member of the
population will appear in the sample at most once. The actual composition of the
sample is usually determined by using a table of random numbers or a random number
generator on a computer. Conceptually, we can regard the population members as
balls in an urn, a specified number of which are selected for inclusion in the sample
at random and without replacement.
Because the composition of the sample is random, the sample mean is random.
An analysis of the accuracy with which the sample mean approximates the population
mean must therefore be probabilistic in nature. In this section, we will derive some
statistical properties of the sample mean.
7.3 Simple Random Sampling
203
7.3.1 The Expectation and Variance of the Sample Mean
We will denote the sample size by n (n is less than N ) and the values of the sample
members by X 1 , X 2 , . . . , X n . It is important to realize that each X i is a random variable. In particular, X i is not the same as xi : X i is the value of the ith member of the sample, which is random and xi is that of the ith member of the population, which is fixed.
We will consider the sample mean,
X=
n
1
Xi
n i=1
as an estimate of the population mean. As an estimate of the population total, we will
consider
T = NX
Properties of T will follow readily from those of X . Since each X i is a random
variable, so is the sample mean; its probability distribution is called its sampling
distribution. In general, any numerical value, or statistic, computed from a random
sample is a random variable and has an associated sampling distribution. The sampling
distribution of X determines how accurately X estimates μ; roughly speaking, the
more tightly the sampling distribution is centered on μ, the better the estimate.
EXAMPLE A
To illustrate the concept of a sampling distribution, let us look again at the population
of 393 hospitals. In practice, of course, the population would not be known, and only
one sample would be drawn. For pedagogical purposes here, we can consider the
sampling distribution of the sample mean from this known population. Suppose, for
example, that we want to find the sampling
distribution of the mean of a sample of size
samples and compute the mean of each one—
16. In principle, we could form all 393
16
this would give the sampling distribution. But because the number of such samples is
of the order 1033 , this is clearly not practical. We will thus employ a technique known
as simulation. We can estimate the sampling distribution of the mean of a sample of
size n by drawing many samples of size n, computing the mean of each sample, and
then forming a histogram of the collection of sample means. Figure 7.2 shows the
results of such a simulation for sample sizes of 8, 16, 32, and 64 with 500 replications
for each sample size. Three features of Figure 7.2 are noteworthy:
1. All the histograms are centered about the population mean, 814.6.
2. As the sample size increases, the histograms become less spread out.
3. Although the shape of the histogram of population values (Figure 7.1) is not
symmetric about the mean, the histograms in Figure 7.2 are more nearly so.
These features will be explained quantitatively.
■
As we have said, X is a random variable whose distribution is determined by
that of the X i . We thus examine the distribution of a single sample element, X i . It
should be noted that the following lemma holds whether sampling is with or without
replacement.
Count
Survey Sampling
120
100
80
60
40
20
0
200 400
(a)
600
800 1000 1200 1400 1600 1800
600
800 1000 1200 1400 1600 1800
600
800 1000 1200 1400 1600 1800
600
800 1000 1200 1400 1600 1800
140
Count
100
60
20
0
200 400
(b)
Count
Chapter 7
120
100
80
60
40
20
0
200 400
(c)
140
100
Count
204
60
20
0
200 400
(d)
F I G U R E 7.2 Histograms of the values of the mean number of discharges in 500
simple random samples from the population of 393 hospitals. Sample sizes: (a) n = 8,
(b) n = 16, (c) n = 32, (d) n = 64.
We need to be careful about the values that the random variable X i can assume.
The i th sample member is equally likely to be any of the N population members. If
all the population values were distinct, we would then have P(X 1 = x j ) = 1/N .
But the population values may not be distinct (for example, in the dichotomous case
7.3 Simple Random Sampling
205
there are only two values, 0 and 1). If k members of the population have the same
value ζ , then P(X i = ζ ) = k/N . We use this construction in proving the following
lemma.
LEMMA A
Denote the distinct values assumed by the population members by ζ1 , ζ2 , . . . , ζm ,
and denote the number of population members that have the value ζ j by n j , j =
1, 2, . . . , m. Then X i is a discrete random variable with probability mass
function
nj
P(X i = ζ j ) =
N
Also,
E(X i ) = μ
Var(X i ) = σ 2
Proof
The only possible values that X i can assume are ζ1 , ζ2 , . . . , ζm . Since each member of the population is equally likely to be the ith member of the sample, the
probability that X i assumes the value ζ j is thus n j /N . The expected value of the
random variable X i is then
E(X i ) =
m
ζ j P(X i = ζ j ) =
j=1
m
1 n jζj = μ
N j=1
The last equation follows because n j population members have the value ζ j
and the sum is thus equal to the sum of the values of all the population members.
Finally,
Var(X i ) = E X i2 − [E(X i )]2
=
m
1 n j ζ j2 − μ2
N j=1
= σ2
N
2
Here we have used the fact that
i=1 x i =
population variance derived in Section 7.2.
m
2
j=1 n j ζ j
and the identity for the
■
As a measure of the center of the sampling distribution, we will use E(X ). As a
measure of the dispersion of the sampling distribution about this center, we will use
the standard deviation of X . The key results that will be obtained shortly are that the
sampling distribution is centered at μ and that its spread is inversely proportional to
the square root of the sample size, n. We first show that the sampling distribution is
centered at μ.
206
Chapter 7
Survey Sampling
THEOREM A
With simple random sampling, E(X ) = μ.
Proof
Since, from Lemma A, E(X i ) = μ, it follows from Theorem A in Section 4.1.2
that
n
1
E(X ) =
E(X i ) = μ
■
n i=1
From Theorem A, we have the following corollary.
COROLLARY A
With simple random sampling, E(T ) = τ.
Proof
E(T ) = E(N X )
= N E(X )
= Nμ
=τ
■
In the dichotomous case, μ = p, and X is the proportion of the sample that
possesses the characteristic of interest. In this case, X will be denoted by p̂. We have
shown that E( p̂) = p.
It is important to keep in mind that X is random. The result E(X ) = μ can be
interpreted to mean that “on the average” X = μ. In general, if we wish to estimate
a population parameter, θ say, by a function θ̂ of the sample, X 1 , X 2 , . . . , X n , and
E(θ̂) = θ, whatever the value of θ may be, we say that θ̂ is unbiased. Thus, X
and T are unbiased estimates of μ and τ . On average they are correct. We next
investigate how variable they are, by deriving their variances and standard deviations.
Section 4.2.1 introduced the concepts of bias and variance in the context of a model
of measurement error, and these concepts are also relevant in this new context. In
Chapter 4, it was shown that
Mean squared error = variance + bias2
Since X and T are unbiased, their mean squared errors are equal to their variances.
n
We next find Var(X ). Since X = n −1 i=1
X i , it follows from Corollary A of
Section 4.3 that
n
n
1 Cov(X i , X j )
Var(X ) = 2
n i=1 j=1
7.3 Simple Random Sampling
207
Suppose that sampling were done with replacement. Then the X i would be independent, and for i = j we would have Cov(X i , X j ) = 0, whereas Cov(X i , X i ) =
Var(X i ) = σ 2 . It would then follow that
Var X =
=
n
1 Var(X i )
n 2 i=1
σ2
n
and that the standard deviation of X , also called its standard error, would be
σ
σX = √
n
Sampling without replacement induces dependence among the X i , which complicates this simple result. However, we will see that if the sample size n is small
relative to the population size N , the dependence is weak and this simple result holds
to a good approximation.
To find the variance of the sample mean in sampling without replacement we
need to find Cov(X i , X j ) for i = j.
LEMMA B
For simple random sampling without replacement,
Cov(X i , X j ) = −σ 2 /(N − 1)
if i = j
Using the identity for covariance established at the beginning of Section 4.3,
Cov(X i , X j ) = E(X i X j ) − E(X i )E(X j )
and
E(X i X j ) =
m
m ζk ζl P(X i = ζk and X j = ζl )
k=1 l=1
=
m
ζk P(X i = ζk )
k=1
m
ζl P(X j = ζl |X i = ζk )
l=1
from the multiplication law for conditional probability. Now,
P(X j = ζl |X i = ζk ) =
n l /(N − 1),
(n l − 1)/(N − 1),
if k = l
if k = l
Now if we express
m
ζl P(X j = ζl |X i = ζk ) =
ζl
nl
nk − 1
+ ζk
N −1
N −1
ζl
nl
1
− ζk
N −1
N −1
l =k
l=1
=
m
l=1
208
Chapter 7
Survey Sampling
the expression for E(X i X j ) becomes
m
m
m
ζk
1
nk nl
2
2
−
=
τ −
ζk
ζl
ζk n k
N l=1 N − 1
N −1
N (N − 1)
k=1
k=1
τ2
1
−
ζ 2nk
N (N − 1)
N (N − 1) k=1 k
m
=
1
N μ2
−
(μ2 + σ 2 )
N −1
N −1
σ2
= μ2 −
N −1
2
Finally, subtracting E(X i )E(X j ) = μ from the last equation, we have
=
Cov(X i , X j ) = −
σ2
N −1
for i = j.
■
(Alternative proofs of Lemma B are outlined in Problems 25 and 26 at the end of
this chapter.) This lemma shows that X i and X j are not independent of each other for
i = j, but that the covariance is very small for large values of N . We are now able to
derive the following theorem.
THEOREM B
With simple random sampling,
σ2
Var(X ) =
n
=
σ2
n
N −n
N −1
1−
n−1
N −1
Proof
From Corollary A of Section 4.3,
Var(X ) =
n
n
1 Cov(X i , X j )
n 2 i=1 j=1
n
n
1 1 Var(X i ) + 2
Cov(X i , X j )
= 2
n i=1
n i=1 j =i
1
σ2
σ2
− 2 n(n − 1)
n
n
N −1
After some algebra, this gives the desired result.
=
■
7.3 Simple Random Sampling
209
Notice that the variance of the sample mean in sampling without replacement
differs from that in sampling with replacement by the factor
n−1
1−
N −1
which is called the finite population correction. The ratio n/N is called the sampling
fraction. Frequently, the sampling fraction is very small, in which case the standard
error (standard deviation) of X is
σ
σX ≈ √
n
We see that, apart from the usually small finite population correction, the spread of the
sampling distribution and therefore the precision of X are determined by the sample
size (n) and not by the population size (N ). As will be made more explicit later,
the appropriate measure of the precision of the sample mean is its standard error,
which is inversely proportional to the square root of the sample size. Thus, in order
to double the accuracy, the sample size must be quadrupled. (You might examine
Figure 7.2 with this in mind.) The other factor that determines the accuracy of the
sample mean is the population standard deviation, σ . If σ is small, the population
values are not very dispersed and a small sample will be fairly accurate. But if the
values are widely dispersed, a much larger sample will be required in order to attain
the same accuracy.
EXAMPLE B
If the population of hospitals is sampled without replacement and the sample size is
n = 32,
*
σ
n−1
1−
σX = √
n
N −1
*
589.7
31
= √
1−
392
32
= 104.2 × .96
= 100.0
Notice that because the sampling fraction is small, the finite population correction
makes little difference. To see that σ X = 100.0 is a reasonable measure of accuracy,
examine part (b) of Figure 7.2 and observe that the vast majority of sample means
differed from the population mean (814) by less than two standard errors; i.e., the
■
vast majority of sample means were in the interval (614, 1014).
EXAMPLE C
Let us apply this result to the problem of estimating a proportion. In the population of
hospitals, a proportion p = .654 had fewer than 1000 discharges. If this proportion
were estimated from a sample as the sample proportion p̂, the standard error of p̂
210
Chapter 7
Survey Sampling
could be found by applying Theorem B to this dichotomous case:
*
σ p̂ =
p(1 − p)
n
*
1−
n−1
N −1
For example, for n = 32, the standard error of p̂ is
*
.654 × .346
σ p̂ =
32
= .08
*
1−
31
392
■
The precision of the estimate of the population total does depend on the population
size, N .
COROLLARY B
With simple random sampling,
Var(T ) = N 2
σ2
n
N −n
N −1
Proof
Since T = N X ,
Var(T ) = N 2 Var(X )
■
7.3.2 Estimation of the Population Variance
A sample survey is used to estimate population parameters, and it is desirable also
to assess and quantify the variability of the estimates. In the previous section, we
saw how the standard error of an estimate may be determined from the sample size
and the population variance. In practice, however, the population variance will not
be known, but as we will show in this section, it can be estimated from the sample.
Since the population variance is the average squared deviation from the population
mean, estimating it by the average squared deviation from the sample mean seems
natural:
σ̂ 2 =
n
1
(X i − X )2
n i=1
7.3 Simple Random Sampling
211
The following theorem shows that this estimate is biased.
THEOREM A
With simple random sampling,
E(σ̂ 2 ) = σ 2
n−1
n
N
N −1
Proof
Expanding the square and proceeding as in the identity for the population variance
in Section 7.2, we find
n
1 2
σ̂ 2 =
X − X2
n i=1 i
Thus,
E(σ̂ 2 ) =
Now, we know that
n
1 2
E X i − E(X 2 )
n i=1
E X i2 = Var(X i ) + [E(X i )]2
= σ 2 + μ2
Similarly, from Theorems A and B of Section 7.3.1,
E(X 2 ) = Var(X ) + [E(X )]2
σ2
n−1
=
1−
+ μ2
n
N −1
Substituting these expressions for E(X i2 ) and E(X 2 ) in the preceding equation
for E(σ̂ 2 ) gives the desired result.
■
Because N > n, it follows with a little algebra that
n−1 N
<1
n N −1
so that E(σ̂ 2 ) < σ 2 ; σ̂ 2 thus tends to underestimate σ 2 . From Theorem A, we see
that an unbiased estimate of σ 2 may be obtained by multiplying σ̂ 2 by the factor
n
1
n(N −1)/[(n −1)N ]. Thus, an unbiased estimate of σ 2 is n−1
(1− N1 ) i=1
(X i − X )2 .
We also have the following corollary.
212
Chapter 7
Survey Sampling
COROLLARY A
An unbiased estimate of Var(X ) is
σ̂ 2
n
N −1
N −n
s X2 =
n n−1
N
N −1
s2 n
=
1−
n
N
where
n
1 2
s =
(X i − X )2
n − 1 i=1
Proof
Since
σ2
Var(X ) =
n
N −n
N −1
an unbiased estimate of Var(X ) may be obtained by substituting in an unbiased
■
estimate of σ 2 . Algebra then yields the desired result.
Similarly, an unbiased estimate of the variance of T , the estimator of the population total, is
sT2 = N 2 s X2
For the dichotomous case, in which each X i is 0 or 1, note that
n
n
1
1 2
2
(X i − X )2 =
Xi − X
n i=1
n i=1
= p̂(1 − p̂)
Therefore,
n
p̂(1 − p̂)
n−1
Thus, as a special case of Corollary A, we have the following corollary.
s2 =
COROLLARY B
An unbiased estimate of Var( p̂) is
p̂(1 − p̂) n
s 2p̂ =
1−
n−1
N
■
In many cases, the sampling fraction, n/N , is small and may be neglected. Furthermore, it often makes little difference whether n − 1 or n is used as the divisor.
7.3 Simple Random Sampling
213
The quantities s X , sT , and s p̂ are called estimated standard errors. If we knew
them, the actual standard errors, σ X , σT and σ p̂ , would be used to gauge the accuracy
of the estimates X , T and p̂. If they are not known, which is the typical case, the
estimated standard errors are used in their place.
EXAMPLE A
A simple random sample of 50 of the 393 hospitals was taken. From this sample,
X = 938.5 (recall that, in fact, μ = 814.6) and s = 614.53 (σ = 590). An estimate
of the variance of X is
n
s2 1−
= 6592
s X2 =
n
N
The estimated standard error of X is
s X = 81.19
.
49
= 78.) This estimated standard error
(Note that the true value is σ X = √σ50 1 − 392
gives a rough idea of how accurate the value of X is; in this case, we see that the
magnitude of the error is of the order 80, as opposed to 8 or 800, say. In fact, the error
■
was 123.9, or about 1.5 s X .
EXAMPLE B
From the same sample, the estimate of the total number of discharges in the population
of hospitals is
T = N X = 368,831
Recall that the true value of the population total is 320,139. The estimated standard
error of T is
sT = N s X = 31,908
Again, this estimated standard error can be used as a rough gauge of the estimation
error.
■
EXAMPLE C
Let p be the proportion of hospitals that had fewer than 1000 discharges—that is,
p = .654. In the sample of Example A, 26 of 50 hospitals had fewer than 1000
discharges, so
p̂ =
26
= .52
50
The variance of p̂ is estimated by
p̂(1 − p̂) n
1−
= .0045
n−1
N
Thus, the estimated standard error of p̂ is
s 2p̂ =
s p̂ = .067
214
Chapter 7
Survey Sampling
Crudely, this tells us that the error of p̂ is in the second or first decimal place—that
we are probably not so fortunate as to have an error only in the third decimal place.
■
In fact, the error was .134 or about 2 × s p̂ .
These examples show how, in simple random sampling, we can not only form
estimates of unknown population parameters, but can also gauge the likely size of the
errors of the estimates, by estimating their standard errors from the data in the sample.
We have covered a lot of ground, and the presence of the finite population correction complicates the expressions we have derived. It is thus useful to summarize
our results in the following table:
Population
Parameter
Estimate
σ2
n
p̂ = sample proportion
σ p̂2 =
p(1− p)
n
T = NX
σT2 = N 2 σ X2
X=
p
σ2
N −n σ X2 =
μ
τ
Variance of Estimate
1−
1
n
1
N
n
i=1
Xi
N −1
N −n N −1
Estimated Variance
s X2 =
s2
n
s 2p̂ =
p̂(1− p̂)
n−1
1−
n
N
1−
n
N
sT2 = N 2 s X2
s2
n
1
2
where s 2 = n−1
i=1 (X i − X ) .
The square roots of the entries in the third column are called standard errors,
and the square roots of the entries in the fourth column are called estimated standard
errors. The former depend on unknown population parameters, so the latter are used
to gauge the accuracy of the parameter estimates. When the population is large relative
to the sample size, the finite population correction can be ignored, simplifying the
preceding expressions.
7.3.3 The Normal Approximation to the Sampling
Distribution of X
We have found the mean and the standard deviation of the sampling distribution of X .
Ideally, we would like to know the sampling distribution, since it would tell us everything we could hope to know about the accuracy of the estimate. Without knowledge
of the population itself, however, we cannot determine the sampling distribution. In
this section, we will use the central limit theorem to deduce an approximation to
the sampling distribution—the normal, or Gaussian, distribution. This approximation
will be used to find probabilistic bounds for the estimation error.
In Section 5.3, we considered a sequence of independent and identically distributed (i.i.d.) random variables, X 1 , X 2 , . . . having the common mean and variance
μ and σ 2 . The sample mean of X 1 , X 2 , . . . , X n is
Xn =
n
1
Xi
n i=1
7.3 Simple Random Sampling
215
This sample mean has the properties
E(X n ) = μ
and
σ2
n
The central limit theorem says that, for a fixed number z,
Xn − μ
√ ≤ z → (z)
as n → ∞
P
σ/ n
Var(X n ) =
where is the cumulative distribution function of the standard normal distribution.
Using a more compact and suggestive notation, we have
Xn − μ
P
≤ z → (z)
σX n
The context of survey sampling is not exactly like that of the central limit theorem
as stated above—as we have seen, in sampling without replacement, the X i are not
independent of each other, and it makes no sense to have n tend to infinity while N
remains fixed. But other central limit theorems have been proved that are appropriate
to the sampling context. These show that if n is large, but still small relative to N ,
then X n , the mean of a simple random sample, is approximately normally distributed.
To demonstrate the use of the central limit theorem, we will apply it to approximate P(|X − μ| ≤ δ), the probability that the error made in estimating μ by X is
less than some constant δ
P(|X − μ| ≤ δ) = P(−δ ≤ X − μ ≤ δ)
δ
X −μ
δ
=P −
≤
≤
σX
σX
σX
δ
δ
≈
− −
σX
σX
δ
= 2
−1
σX
since (−z) = 1 − (z), from the symmetry of the standard normal distribution
about zero.
EXAMPLE A
Let us again consider the population of 393 hospitals. The standard deviation of the
mean of a sample of size n = 64 is, using the finite population correction,
*
σ
n−1
1−
σX = √
n
N −1
*
589.7
63
= 67.5
=
1−
8
392
We can use the central limit theorem to approximate the probability that the
sample mean differs from the population mean by more than 100 in absolute value; i.e.,
216
Chapter 7
Survey Sampling
P(|X − μ| > 100). First, from the symmetry of the normal distribution,
P(|X − μ| > 100) ≈ 2P(X − μ > 100)
and
P(X − μ > 100) = 1 − P(X − μ < 100)
X −μ
100
= 1− P
<
σX
σX
100
≈ 1−
67.5
= .069
Thus the probability that the sample mean differs from the population mean by more
than 100 is approximately .14. In fact, among the 500 samples of size 64 in Example
A in Section 7.3.1, 82, or 16.4%, differed by more than 100 from the population mean.
Similarly, the central limit theorem approximation gives .026 as the probability of
deviations of more than 150 from the population mean. In the simulation in Example
A in Section 7.3.1, 11 of 500, or 2.2%, differed by more than 150. If we are not too
finicky, the central limit theorem gives us reasonable and useful approximations. ■
EXAMPLE B
For a sample of size 50, the standard error of the sample mean number of discharges
is
σ X = 78
For the particular sample of size 50 discussed in Example A in Section 7.3.2, we
found X = 938.35, so X − μ = 123.9. We now calculate an approximation of the
probability of an error this large or larger:
P(|X − μ| ≥ 123.9) = 1 − P(|X − μ| < 123.9)
123.9
≈ 1 − 2
−1
78
= 2 − 2(1.59)
= .11
Thus, we can expect an error this large or larger to occur about 11% of the time.
EXAMPLE C
■
In Example C in Section 7.3.2, we found from the sample of size 50 an estimate
p̂ = .52 of the proportion of hospitals that discharged fewer than 1000 patients; in
fact, the actual proportion in the population is .65. Thus, | p̂ − p | = .13. What is the
probability that an estimate will be off by an amount this large or larger?
We have
*
*
p(1 − p)
n−1
1−
σ p̂ =
n
N −1
= .068 × .94 = .064
7.3 Simple Random Sampling
217
We can therefore calculate
P(| p − p̂| > .13) = 1 − P(| p − p̂| ≤ .13)
.13
| p − p̂|
≤
= 1− P
σ p̂
σ p̂
≈ 2[1 − (2.03)] = .04
We see that the sample was rather “unlucky”—an error this large or larger would
■
occur only about 4% of the time.
We can now derive a confidence interval for the population mean, μ. A confidence interval for a population parameter, θ, is a random interval, calculated from the
sample, that contains θ with some specified probability. For example, a 95% confidence interval for μ is a random interval that contains μ with probability .95; if we
were to take many random samples and form a confidence interval from each one,
about 95% of these intervals would contain μ. If the coverage probability is 1 − α,
the interval is called a 100(1 − α)% confidence interval. Confidence intervals are
frequently used in conjunction with point estimates to convey information about the
uncertainty of the estimates.
For 0 ≤ α ≤ 1, let z(α) be that number such that the area under the standard
normal density function to the right of z(α) is α (Figure 7.3). Note that the symmetry
of the standard normal density function about zero implies that z(1 − α) = −z(α).
If Z follows a standard normal distribution, then, by definition of z(α),
P(−z(α/2) ≤ Z ≤ z(α/2)) = 1 − α
From the central limit theorem, (X − μ)/σ X has approximately a standard normal
distribution, so
X −μ
≤ z(α/2) ≈ 1 − α
P −z(α/2) ≤
σX
.4
f (z)
.3
.2
.1
0
3
2
1
0
z
1
2
3
z ()
F I G U R E 7.3 A standard normal density showing α and z(α).
Survey Sampling
Elementary manipulation of the inequalities gives
P(X − z(α/2)σ X ≤ μ ≤ X + z(α/2)σ X ) ≈ 1 − α
800
Number of discharges
1000
1200
That is, the probability that μ lies in the interval X ± z(α/2)σ X is approximately
1 − α. The interval is thus called a 100(1 − α)% confidence interval. It is important
to understand that this interval is random and that the preceding equation states that
the probability that this random interval covers μ is 1 − α. In practice, α is assigned a
small value, such as .1, .05, or .01, so that the probability that the interval covers μ will
be large. Also, since the population variance is typically not known, s X is substituted
for σ X . For large samples, it can be shown that the effect of this substitution is
practically negligible. It is impossible to give a precise answer to the question “How
large is large?” As a rule of thumb, a value of n greater than 25 or 30 is usually
adequate.
To illustrate the concept of a confidence interval, 20 samples each of size n = 25
were drawn from the population of hospital discharges. From each of these 20 samples,
an approximate 95% confidence interval for μ, the mean number of discharges, was
computed. These 20 confidence intervals are displayed as vertical lines in Figure 7.4;
the dashed line in the figure is drawn at the true value, μ = 814.6. Notice that it so
600
Chapter 7
400
218
F I G U R E 7.4 Vertical lines are 20 approximate 95% confidence intervals for μ. The
horizontal line is the true value of μ.
7.3 Simple Random Sampling
219
happened that all the confidence intervals included μ; since these are 95% intervals,
on the average 5%, or 1 out of 20, would not include μ.
The following example illustrates the procedure for calculating confidence
intervals.
EXAMPLE D
A particular area contains 8000 condominium units. In a survey of the occupants, a
simple random sample of size 100 yields the information that the average number of
motor vehicles per unit is 1.6, with a sample standard deviation of .8. The estimated
standard error of X is thus
*
s
n
1−
sX = √
n
N
*
.8
100
=
1−
10
8000
= .08
Note that the finite population correction makes almost no difference. Since z(.025) =
1.96, a 95% confidence interval for the population average is X ± 1.96s X , or (1.44,
1.76).
An estimate of the total number of motor vehicles is T = 8000 × 1.6 = 12,800.
The estimated standard error of T is
sT = N s X = 640
A 95% confidence interval for the total number of motor vehicles is T ± 1.96sT , or
(11,546, 14,054).
In the same survey, 12% of the respondents said they planned to sell their condos
within the next year; p̂ = .12 is an estimate of the population proportion p. The
estimated standard error is
*
*
p̂(1 − p̂)
100
= .03
1−
s p̂ =
n−1
8000
A 95% confidence interval for p is p̂ ± 1.96s p̂ , or (.06, .18).
The total number of owners planning to sell is estimated as T = N p̂ = 960. The
estimated standard error of T is sT = N s p̂ = 240. A 95% confidence interval for the
number in the population planning to sell is T ± 1.96sT , or (490, 1430). The proper
interpretation of this interval, (490, 1430), is a little subtle. We cannot state that the
probability is 0.95 and that the number of owners planning to sell is between 490 and
1430, because that number is either in this interval or not. What is true is that 95% of
intervals formed in this way will contain the true number in the long run. This interval
is like one of those shown in Figure 7.4; in the long run, 95% of those intervals will
contain the true number of discharges, but in the figure any particular interval either
■
does or doesn’t contain the true number.
The width of a confidence interval is determined by the sample size n and the
population standard deviation σ . If σ is known approximately, perhaps from earlier
220
Chapter 7
Survey Sampling
samples of the population, n can be chosen so as to obtain a confidence interval close
to some desired length. Such analysis is usually an important aspect of planning the
design of a sample survey.
EXAMPLE E
The interval for the total number of owners planning to sell in Example D might be
considered too wide for practical purposes; reducing its width would require a larger
sample size. Suppose that an interval with a half-width of 200 is desired. Neglecting
the finite population correction, the half-width is
*
p̂(1 − p̂)
5095
=√
1.96sT = 1.96N
n−1
n−1
Setting the last expression equal to 200 and solving for n yields n = 650 as the
necessary sample size.
■
Let us summarize: The fundamental result of this section is that the sampling
distribution of the sample mean is approximately Gaussian. This approximation can be
used to quantify the error committed in estimating the population mean by the sample
mean, thus giving us a good understanding of the accuracy of estimates produced
by a simple random sample. We next introduced the idea of a confidence interval,
a random interval that contains a population parameter with a specified probability
and thus provides an assessment of the accuracy of the corresponding estimate of that
parameter. We have seen in our examples that the width of the confidence interval is a
multiple of the estimated standard deviation of the estimate; for example, a confidence
interval for μ is X ± ks X , where the constant k depends on the coverage probability
of the interval.
7.4 Estimation of a Ratio
The foundations of the theory of survey sampling have been laid in the preceding sections on simple random sampling. This and the next section build on that foundation,
developing some advanced topics in survey sampling.
In this section, we consider the estimation of a ratio. Suppose that for each member
of a population, two values, x and y, may be measured. The ratio of interest is
N
yi
r=
i=1
N
=
xi
μy
μx
i=1
Ratios arise frequently in sample surveys; for example, if households are sampled,
the following ratios might be calculated:
•
If y is the number of unemployed males aged 20–30 in a household and x is the
number of males aged 20–30 in a household, then r is the proportion of unemployed
males aged 20–30.
7.4 Estimation of a Ratio
•
•
221
If y is weekly food expenditure and x is number of inhabitants, then r is weekly
food cost per inhabitant.
If y is the number of motor vehicles and x is the number of inhabitants of driving
age, then r is the number of motor vehicles per inhabitant of driving age.
In a survey of farms, y might be the acres of wheat planted and x the total acreage.
In an inventory audit, y might be the audited value of an item and x the book value.
In this section, we first consider directly the problem of estimating a ratio. Later,
we will use the estimation of a ratio as a technique for estimating μ y . We will produce
a new estimate, the ratio estimate, which we will compare to the ordinary estimate, Y .
Before continuing, we note the elementary but sometimes overlooked fact that
r =
N
1 yi
N i=1 xi
Suppose that a sample is drawn consisting of the pairs (X i , Yi ); the natural
estimate of r is R = Y /X . We wish to derive expressions for E(R) and Var(R), but
since R is a nonlinear function of the random variables X and Y , we cannot do this
in closed form. We will therefore employ the approximate methods of Section 4.6.
In order to calculate the approximate variance of R, we need to know Var(X ),
Var(Y ), and Cov(X , Y ). The first two quantities we know from Theorem B of Section
7.3.1. For the last quantity, we define the population covariance of x and y to be
σx y =
N
1 (xi − μx )( yi − μ y )
N i=1
It can then be shown, in a manner entirely analogous to the proof of Theorem B in
Section 7.3.1, that
σx y
n−1
Cov(X , Y ) =
1−
n
N −1
From Example C in Section 4.6, we have the following theorem.
THEOREM A
With simple random sampling, the approximate variance of R = Y /X is
1 Var(R) ≈ 2 r 2 σ X2 + σY2 − 2r σ X Y
μx
1
n−1
1 2 2
r σx + σ y2 − 2r σx y
=
1−
2
n
N − 1 μx
■
The population correlation coefficient is defined as
σx y
ρ=
σx σ y
and is used as a measure of the strength of the linear relationship between the x and
y values in the population. It can be shown that −1 ≤ ρ ≤ 1; large values of ρ
222
Chapter 7
Survey Sampling
indicate a strong positive relationship between x and y, and small values indicate a
strong negative relationship. (See Figure 4.7 for some illustrations of correlation.)
The equation in Theorem A can be expressed in terms of the population correlation
coefficient as follows:
1
n−1
1 2 2
r σx + σ y2 − 2rρσx σ y
Var(R) ≈
1−
2
n
N − 1 μx
From this expression, we see that strong correlation of the same sign as r decreases the
variance. We also note that the variance is affected by the size of μx —if μx is small,
the variance is large, essentially because small values of X in the ratio R = Y /X
cause R to fluctuate wildly.
We now consider the approximate expectation of R. From Example C in Section
4.6 and the preceding calculations, we have the following theorem.
THEOREM B
With simple random sampling, the expectation of R is given approximately by
1
n−1
1 2
r
σ
E(R) ≈ r +
−
ρσ
σ
■
1−
x
y
x
n
N − 1 μ2x
From the equation in Theorem B, we see that strong correlation of the same
sign as r decreases the bias and that the bias is large if μx is small. Furthermore,
note that the bias is of the order 1/n, so its contribution to the mean squared error is
of the order 1/n 2 . In comparison, the contribution of the variance is of the order 1/n.
Therefore, for large samples, the bias is negligible compared to the standard error of
the estimate.
For large samples, truncating the Taylor series after the linear term provides a
good approximation, since the deviations X − μ X and Y − μY are likely to be small.
To this order of approximation, R is expressed as a linear combination of X and Y ,
and an argument based on the central limit theorem can be used to show that R is
approximately normally distributed. Approximate confidence intervals can thus be
formed for r by using the normal distribution.
In order to estimate the standard error of R, we substitute R for r in the formula
of Theorem A. The x and y population variances are estimated by sx2 and s y2 . The
population covariance is estimated by
1 (X i − X )(Yi − Y )
n − 1 i=1
n
1
X i Yi − n X Y
=
n − 1 i=1
n
sx y =
(as can be seen by expanding the product), and the population correlation is estimated
by
sx y
ρ̂ =
sx s y
7.4 Estimation of a Ratio
The estimated variance of R is thus
1
n−1
s R2 =
1−
n
N −1
223
1
(R 2 sx2 + s y2 − 2Rsx y )
2
X
An approximate 100(1 − α)% confidence interval for r is R ± z(α/2)s R .
EXAMPLE A
Suppose that 100 people who recently bought houses are surveyed, and the monthly
mortgage payment and gross income of each buyer are determined. Let y denote the
mortgage payment and x the gross income. Suppose that
X = $3100
s y = $250
ρ̂ = .85
Y = $868
sx = $1200
R = .28
Neglecting the finite population correction, the estimated standard error of R is
1
1
.282 × 12002 + 2502 − 2 × .28 × .85 × 250 × 1200
sR =
10 3100
= .006
An approximate 95% confidence interval for r is .28 ±(1.96) × (.006), or .28 ± .012.
Note that the high correlation between x and y causes the standard error of R to be
small. We can use the observed values for the variances, covariances, and means to
gauge the order of magnitude of the bias by substituting them in place of the population
parameters in the formula of Theorem B. Doing so, and again neglecting the finite
population correction, gives the value .00015 for the bias, which is negligible relative
to s R . Note that the large value of X and the large positive correlation coefficient
■
cause the bias to be small.
Ratios may also be used as tools for estimating population means and totals.
To illustrate the concept, we return to the example of hospital discharges. For this
population, the number of beds in each hospital is also known; let us denote the number
of beds in the ith hospital by xi and the number of discharges by yi . Suppose that
all the xi are known, perhaps from an earlier enumeration, before a sample has been
taken to estimate the number of discharges, and that we would like to take advantage
of this information. One way to do this is to form a ratio estimate of μ y :
μx
YR =
Y = μx R
X
where X is the average number of beds and Y is the average number of discharges in
the sample. The idea is fairly simple: We expect xi and yi to be closely related in the
population, since a hospital with a large number of beds should tend to have a large
number of discharges. This is borne out by Figure 7.5, a scatterplot of the number
of discharges versus the number of beds. If X < μx , the sample underestimates the
number of beds and probably the number of discharges as well; multiplying Y by
μx /X increases Y to Y R .
Survey Sampling
3000
2500
Discharges
2000
1500
1000
500
0
0
200
400
600
800
1000
Beds
F I G U R E 7.5 Scatterplot of the number of discharges versus the number of beds for
the 393 hospitals.
120
Count
Chapter 7
80
40
0
500
600
(a)
700
800
900
1000
Mean of simple random sample
1100
1000
1100
120
Count
224
80
40
0
500
(b)
600
700
800
900
Ratio estimate
F I G U R E 7.6 (a) A histogram of the means of 500 simple random samples of size 64
from the population of discharges; (b) a histogram of the values of 500 ratio estimates
of the mean number of discharges from samples of size 64.
To see how this ratio estimate works in practice, it was simulated from 500 samples of size 64 from the population of hospitals. The histogram of the results is shown
in Figure 7.6 along with the histogram of the means of 500 simple random samples
of size 64. The comparison shows dramatically how effective the ratio estimate is at
reducing variability.
7.4 Estimation of a Ratio
225
Two more examples will illustrate the scope of the ratio estimation method.
EXAMPLE B
Suppose that we want to estimate the total number of unemployed males aged 20–30
from a sample of households and that we know τx , the total number of males aged
20–30, from census data. The ratio estimate is
Y
X
where Y is the average number of unemployed males aged 20–30 per household in
the sample, and X is the sample average number of males aged 20–30 per household.
■
TR = τx
EXAMPLE C
A sample of items in an inventory is taken to estimate the total value of the inventory.
Let Yi be the audited value of the ith sample item, and let X i be its book value. We
assume that τx , the total book value of the inventory, is known, and we estimate the
total audited value by
TR = τx
Y
X
■
We will now analyze the observed success of the ratio estimate. Since Y R = μ X R,
Var(Y R ) = μ2X Var(R). From Theorem A, we thus have the following.
COROLLARY A
The approximate variance of the ratio estimate of μ y is
1
n−1 2 2
r σx + σ y2 − 2rρσx σ y
Var(Y R ) ≈
1−
n
N −1
■
Similarly, from Theorem B, we have another corollary.
COROLLARY B
The approximate bias of the ratio estimate of μ y is
1
n−1
1 2
r σx − ρσx σ y
E(Y R ) − μY ≈
1−
n
N − 1 μx
■
When will the ratio estimate Y R be better than the ordinary estimate Y ? In the following, the finite population correction is neglected for simplicity. Since the variance
of the ordinary estimate Y is
Var(Y ) =
σ y2
n
226
Chapter 7
Survey Sampling
the ratio estimate has a smaller variance if
r 2 σx2 − 2rρσx σ y < 0
or (provided r > 0, for example)
2ρσ y > r σx
Letting C x = σx /μx and C y = σ y /μ y , this last inequality is equivalent to
1 Cx
ρ>
2 Cy
C x and C y are called coefficients of variation and give the standard deviation as a
proportion of the mean. (Coefficients of variation are often more meaningful than
standard deviations. For example, a standard deviation of 10 means one thing if the
true value of the quantity being measured is 100 and something entirely different if
the true value is 10,000.)
In order to assess the accuracy of Y R , Var(Y R ) can be estimated from the sample.
COROLLARY C
The variance of Y R can be estimated by
1
n−1 2 2
R sx + s y2 − 2Rsx y
sY2 R =
1−
n
N −1
and an approximate 100(1 − α)% confidence interval for μ y is (Y R ±
■
z( α2 )sY R ).
EXAMPLE D
For the population of 393 hospitals, we have
μx = 274.8
μ y = 814.6
r = 2.96
σx = 213.2
σ y = 589.7
ρ = .91
Thus,
1
(2.962 × 213.22 + 589.72 − 2 × 2.96 × .91 × 213.2 × 589.7)
n
68,697.4
=
n
Var(Y R ) ≈
and
262.1
σY R ≈ √
n
Including the finite population correction, the linearized approximation predicts that,
with n = 64,
*
1
63
= 30.0
σY R = (262.1) 1 −
8
392
7.5 Stratified Random Sampling
227
The actual standard deviation of the 500 sample values displayed in Figure 7.6 is
29.9, which is remarkably close. The mean of the 500 values is 816.2, compared to
the population mean of 814.6; the slight apparent bias is consistent with Corollary B.
In contrast, the standard deviation of Y from a simple random sample of size
n = 64 is
*
σ
n−1
1−
σY = √
n
N −1
*
589.7
63
=
1−
8
329
= 66.3
The comparison of σY to σY R is consistent with the substantial reduction in variability
accomplished by using a ratio estimate of μ y shown in Figure 7.6.
The following is another way of interpreting this comparison. If a simple random
sample of size n 1 is taken, the variance of the estimate is Var(Y ) = 589.72 /n 1 . A
ratio estimate from a sample of size n 2 will have the same variance if
589.72
262.12
=
n2
n1
or
n2 = n1
262.1
589.7
2
= .1975n 1
Thus, in this case, we can obtain the same precision from a ratio estimate using a
sample about 80% smaller than the simple random sample. Note that this comparison
neglects the bias of the ratio estimate, which is justifiable in this case because the bias
is quite small. Here is a case in which a biased estimate performs substantially better
than an unbiased estimate, the bias being quite small and the reduction in variance
■
being quite large.
7.5 Stratified Random Sampling
7.5.1 Introduction and Notation
In stratified random sampling, the population is partitioned into subpopulations, or
strata, which are then independently sampled. The results from the strata are then
combined to estimate population parameters, such as the mean.
Following are some examples that suggest the range of situations in which stratification is natural:
•
•
•
In auditing financial transactions, the transactions may be grouped into strata on
the basis of their nominal values. For example, high-value, medium-value, and
low-value strata might be formed.
In samples of human populations, geographical areas often form natural strata.
In a study of records of shipments of household goods by motor carriers, the carriers
were grouped into three strata: large carriers, medium carriers, and small carriers.
228
Chapter 7
Survey Sampling
Stratified samples are used for a variety of reasons. We are often interested in
obtaining information about each of a number of natural subpopulations in addition
to information about the population as a whole. The subpopulations might be defined
by geographical areas or age groups. In an industrial application in which the population consists of items produced by a manufacturing process, relevant subpopulations
might consist of items produced during different shifts or from different lots of raw
material. The use of a stratified random sample guarantees a prescribed number of
observations from each subpopulation, whereas the use of a simple random sample
can result in underrepresentation of some subpopulations. A second reason for using
stratification is that, as will be shown below, the stratified sample mean can be considerably more precise than the mean of a simple random sample, especially if the
population members within each stratum are relatively homogeneous and if there is
considerable variation between strata.
In the next section, properties of the stratified sample mean are derived. Since
a simple random sample is taken within each stratum, the results will follow easily
from the derivations of earlier sections. The section after that takes up the problem
of how to allocate the total number of observations, n, among the various strata.
Comparisons will be made of the efficiencies of different allocation schemes and
also of the precisions of these allocation schemes relative to that of a simple random
sample of the same total size.
7.5.2 Properties of Stratified Estimates
Suppose there are L strata in all. Let the number of population elements in stratum
1 be denoted by N1 , the number in stratum 2 be N2 , etc. The total population size
is N = N1 + N2 + . . . + N L . The population mean and variance of the lth stratum
are denoted by μl and σl2 . The overall population mean can be expressed in terms of
the μl as follows. Let xil denote the ith population value in the lth stratum and let
Wl = Nl /N denote the fraction of the population in the lth stratum. Then
μ=
=
=
Nl
L
1 xil
N l=1 i=1
L
1 Nl μl
N l=1
L
Wl μl
l=1
Within each stratum, a simple random sample of size n l is taken. The sample
mean in stratum l is denoted by
Xl =
nl
1 X il
n l i=1
Here X il denotes the ith sample value in the lth stratum. Note that X l is the mean of
a simple random sample from the population consisting of the lth stratum, so from
Theorem A of Section 7.3.1, E(X l ) = μl . By analogy with the preceding relationship
7.5 Stratified Random Sampling
229
between the overall population mean and the population means of the various strata,
the obvious estimate of μ is
Xs =
=
L
Nl X l
N
l=1
L
Wl X l
l=1
THEOREM A
The stratified estimate, X s , of the population mean is unbiased.
Proof
E(X s ) =
L
Wl E(X l )
l=1
=
L
1 Nl μl
N l=1
=μ
■
Since we assume that the samples from different strata are independent of one
another and that within each stratum a simple random sample is taken, the variance
of X s can be easily calculated.
THEOREM B
The variance of the stratified sample mean is given by
L
1
nl − 1
Wl2
Var(X s ) =
1−
nl
Nl − 1
l=1
σl2
Proof
Since the X l are independent,
Var(X s ) =
L
Wl2 Var(X l )
l=1
From Theorem B of Section 7.3.1, we have
1
nl − 1
Var(X l ) =
1−
nl
Nl − 1
Therefore, the desired result follows.
σl2
■
230
Chapter 7
Survey Sampling
If the sampling fractions within all strata are small,
L
Wl2 σl2
nl
l=1
Var(X s ) ≈
EXAMPLE A
We again consider the population of hospitals. As we did in the discussion of ratio
estimates, we assume that the number of beds in each hospital is known but that the
number of discharges is not. We will try to make use of this knowledge by stratifying
the hospitals according to the number of beds. Let stratum A consist of the 98 smallest
hospitals, stratum B of the 98 next larger, stratum C of the 98 next larger, and stratum
D of the 99 largest. The following table shows the results of this stratification of
hospitals by size:
Stratum
Nl
Wl
μl
σl
A
B
C
D
98
98
98
99
.249
.249
.249
.251
182.9
526.5
956.3
1591.2
103.4
204.8
243.5
419.2
Suppose that we use a sample of total size n and let
n
n1 = n2 = n3 = n4 =
4
so that we have equal sample sizes in each stratum. Then, from Theorem B, neglecting
the finite population corrections and using the numerical values in the preceding table,
we have
4
Wl2 σl2
Var(X s ) =
n1
l=1
and
=
4
4 2 2
W σ
n l=1 l l
=
72, 042.6
n
268.4
σX s = √
n
The standard deviation of the mean of a simple random sample is
587.7
σX = √
n
Comparing the two standard deviations, we see that a tremendous gain in precision
has resulted from the stratification. The ratio of the variances is .20; thus a stratified
estimate based on a total sample size of n/5 is as precise as a simple random sample
of size n. The reduction in variance due to stratification is comparable to that achieved
7.5 Stratified Random Sampling
231
by using a ratio estimate (Example D in Section 7.4). In later parts of this section, we
will look more analytically at why the stratification done here produced such dramatic
■
improvement.
Let us next consider the stratified estimate of the population total, Ts = N X s .
From Theorem B, we have the following corollary.
COROLLARY A
The expectation and variance of the stratified estimate of the population total are
E(Ts ) = τ
and
Var(Ts ) = N 2 Var(X s )
L
1
Nl2
=
n
l
l=1
1−
nl − 1
Nl − 1
σl2
■
In order to estimate the standard errors of X s and Ts , the variances of the individual
strata must be separately estimated and substituted into the preceding formulae. The
estimate of σl2 is given by
l
1 (X il − X l )2
n l − 1 i=1
n
sl2 =
Var(X s ) is estimated by
s X2 s
=
L
Wl2
l=1
1
nl
1−
nl
Nl
sl2
The next example illustrates how this variance estimate can be used to find
approximate confidence intervals for μ based on X s .
EXAMPLE B
A sample of size 10 was drawn from each of the four strata of hospitals described in
Example A, yielding the following:
X 1 = 240.6
s12 = 6827.6
X 2 = 507.4
s22 = 23,790.7
X 3 = 865.1
X 4 = 1716.5
s32 = 42,573.0
s42 = 152,099.6
232
Chapter 7
Survey Sampling
Therefore, X s = 832.5. The variance of the stratified sample mean is estimated by
s X2 s =
4
1 2
nl − 1
Wl 1 −
10 l=1
Nl − 1
sl2
= 1282.0
Thus,
s X s = 35.8
An approximate 95% confidence interval for the population mean number of discharges is X s ± 1.96sx̄s , or (762.4, 902.7).
The total number of discharges is estimated by Ts = 393X s = 327,172. The
standard error of Ts is estimated by sTs = 393s X s = 14,069. An approximate
95% confidence interval for the population total is Ts ± 1.96sTs , or (299,596, 354,
■
748).
7.5.3 Methods of Allocation
In Section 7.5.2, it was shown that, neglecting the finite population correction,
Var(X s ) =
L
Wl2 σl2
nl
l=1
If the resources of a survey allow only a total of n units to be sampled, the question
arises of how to choose n 1 , . . . , n L to minimize Var(X s ) subject to the constraint
n 1 + · · · + n L = n.
For the sake of simplicity, the calculations in this section ignore the finite population correction within each stratum. The analysis may be extended to include these
corrections, but at the cost of some additional algebra. More complete results are
contained in Cochran (1977).
THEOREM A
The sample sizes n 1 , . . . , n L that minimize Var(X s ) subject to the constraint
n 1 + · · · + n L = n are given by
Wl σl
nl = n L
Wk σk
k=1
where l = 1, . . . , L .
7.5 Stratified Random Sampling
233
Proof
We introduce a Lagrange multiplier, and we must then minimize
L
L
Wl2 σl2
L(n 1 , . . . , n L , λ) =
+λ
nl − n
nl
l=1
l=1
For l = 1, . . . , L , we have
W 2σ 2
∂L
=− l2l +λ
∂n l
nl
Setting these partial derivatives equal to zero, we have the system of equations
Wl σl
nl = √
λ
for l = 1, . . . , L. To determine λ, we first sum these equations over l:
1 Wl σl
n= √
λ l=1
L
Thus,
1
√ =
λ
n
L
Wl σl
l=1
and
nl = n
Wl σl
L
Wl σl
l=1
which proves the theorem.
■
This theorem shows that those strata for which Wl σl is large should be sampled
heavily. This makes sense intuitively. If Wl is large, the stratum contains a large
fraction of the population; if σl is large, the population values in the stratum are
quite variable, and in order to obtain a good determination of the stratum’s mean, a
relatively large sample size must be used. This optimal allocation scheme is called
Neyman allocation.
Substituting the optimal values of n l as given in Theorem A into the equation for
Var(X s ) given in Theorem B in Section 7.5.2 gives us the following corollary.
COROLLARY A
Denoting by X so , the stratified estimate using the optimal allocations as given in
Theorem A and neglecting the finite population correction,
L
2
Wl σl
Var(X so ) = l=1
■
n
234
Chapter 7
EXAMPLE A
Survey Sampling
For the population of hospitals, the weights for optimal allocation, Wl σl /
are, from the table of Example A of Section 7.5.2,
Wl σl ,
Stratum
Weight
A
.106
B
.210
C
.250
D
.434
Note that, because of its larger standard deviation, stratum D is sampled more than
■
four times as heavily as stratum A.
The optimal allocations depend on the individual variances of the strata, which
generally will not be known. Furthermore, if a survey measures several attributes
for each population member, it is usually impossible to find an allocation that is
simultaneously optimal for all of those variables. A simple and popular alternative
method of allocation is to use the same sampling fraction in each stratum,
n2
nL
n1
=
=···=
N1
N2
NL
which holds if
Nl
= nWl
N
for l = 1, . . . , L. This method is called proportional allocation. The estimate of the
population mean based on proportional allocation is
nl = n
X sp =
L
Wl X l
l=1
=
L
l=1
Wl
nl
1 X il
n l i=1
nl
L
1 =
X il
n l=1 i=1
since Wl /n l = 1/n. This estimate is simply the unweighted mean of the sample
values.
THEOREM B
With stratified sampling based on proportional allocation, ignoring the finite
population correction,
L
1
Wl σl2
Var(X sp ) =
n l=1
7.5 Stratified Random Sampling
235
Proof
From Theorem B of Section 7.5.2, we have
Var(X sp ) =
L
Wl2 Var(X l )
l=1
=
L
l=1
Wl2
σl2
nl
Using n l = nWl , the result follows.
■
We now compare Var(X sp ) and Var(X so ) in order to discover the circumstances
under which optimal allocation is substantially better than proportional allocation.
THEOREM C
With stratified random sampling, the difference between the variance of the
estimate of the population mean based on proportional allocation and the variance
of that estimate based on optimal allocation is, ignoring the finite population
correction,
L
1
Var(X sp ) − Var(X so ) =
Wl (σl − σ̄ )2
n l=1
where
σ̄ =
L
Wl σl
l=1
Proof
⎡
L
2 ⎤
L
1
Var(X sp ) − Var(X so ) = ⎣
Wl σl2 −
Wl σl ⎦
n l=1
l=1
L
The term within the large brackets equals l=1
Wl (σl − σ̄ )2 , which may be
verified by expanding the square and collecting terms.
■
According to Theorem C, if the variances of the strata are all the same, proportional allocation yields the same results as optimal allocation. The more variable these
variances are, the better it is to use optimal allocation.
236
Chapter 7
EXAMPLE B
Survey Sampling
Let us calculate how much better optimal allocation is than proportional allocation
for the population of hospitals. From Theorem C and Corollary A, we have
1
Wl (σl − σ̄ )2
Var(X sp ) = Var(X so ) +
n
Therefore,
1
Var(X sp )
= 1+ n
Var(X so )
= 1+
Wl (σl − σ̄ )2
Var(X so )
Wl (σl − σ̄ )2
( Wl σl )2
= 1 + .218
Thus, under proportional allocation, the variance of the mean is about 20% larger
■
than it is under optimal allocation.
We can also compare the variance under simple random sampling with the variance under proportional allocation. The variance under simple random sampling is,
neglecting the finite population correction,
σ2
n
In order to compare this equation with that for the variance under proportional allocation, we need a relationship between the overall population variance, σ 2 , and the
strata variances, σl2 . The overall population variance may be expressed as
Var(X ) =
Nl
L
1 (xil − μ)2
N l=1 i=1
σ2 =
Also,
(xil − μ)2 = [(xil − μl ) + (μl − μ)]2
= (xil − μl )2 + 2(xil − μl )(μl − μ) + (μl − μ)2
When both sides of this last equation are summed over l, the middle term on the
Nl
xil , so we have
right-hand side becomes zero since Nl μl = l=1
Nl
(xil − μ) =
2
i=1
nl
(xil − μl )2 + Nl (μl − μ)2
i=1
= Nl σl2 + Nl (μl − μ)2
Dividing both sides by N and summing over l, we have
σ2 =
L
l=1
Wl σl2 +
L
l=1
Wl (μl − μ)2
7.5 Stratified Random Sampling
237
Substituting this expression for σ 2 into Var(X ) = σ 2 /n and using the formula for
Var(X sp ) given in Theorem B completes a proof of the following theorem.
THEOREM D
The difference between the variance of the mean of a simple random sample and
the variance of the mean of a stratified random sample based on proportional
allocation is, neglecting the finite population correction,
L
1
Wl (μl − μ)2
Var(X ) − Var(X sp ) =
n l=1
■
Thus, stratified random sampling with proportional allocation always gives a
smaller variance than does simple random sampling, providing that the finite population correction is ignored. Comparing the equations for the variances under simple
random sampling, proportional allocation, and optimal allocation, we see that stratification with proportional allocation is better than simple random sampling if the
strata means are quite variable and that stratification with optimal allocation is even
better than stratification with proportional allocation if the strata standard deviations
are variable.
EXAMPLE C
We calculate the improvement that would result from using stratification with proportional allocation rather than simple random sampling for the population of hospitals.
From Theorems B and D, we have
Var(X sr s )
Wl (μl − μ̄)2
= 1+
Wl σl2
Var(X sp )
= 1 + 3.83
As is frequently the case, the gain from using stratification with proportional allocation
rather than simple random sampling is much greater than the gain from using optimal
allocation rather than proportional allocation. Furthermore, proportional allocation
requires knowledge only of the sizes of the strata, whereas optimal allocation requires
knowledge of the standard deviations of the strata, and such knowledge is usually
unavailable.
■
Typically, stratified random sampling can result in substantial increases in precision for populations containing values that vary greatly in size. For example, a population of transactions, a sample of which is to be audited for errors, might contain
transactions in the hundreds of thousands of dollars and transactions in the hundreds
of dollars. If such a population were divided into several strata according to the dollar
amounts of the transactions, there might well be considerable variation in the mean
transaction errors between the strata, since there may be rather large errors on large
238
Chapter 7
Survey Sampling
transactions and small errors on small transactions. The variability of the errors might
also be larger in the former strata as well.
We have not addressed the question of how many strata to form and how to
define the strata. In order to construct the optimal number of strata, the population
values themselves, which are of course unknown, would have to be used. Stratification
must therefore be done on the basis of some related variable that is known (such as
transaction amount in the preceding paragraph) or on the results of earlier samples.
In practice, it usually turns out that such relationships are not strong enough to make
it worthwhile constructing more than a few strata.
7.6 Concluding Remarks
This chapter introduced survey sampling. It first covered the most elementary method
of probability sampling—simple random sampling. The theory of this method underlies the theory of more complex sampling techniques. Stratified sampling was also introduced and shown to increase the precision of estimates substantially in many cases.
Several concepts and techniques introduced here recur throughout statistics: the
concept of a random estimate of a population parameter, such as the population mean;
bias; the standard error of an estimate; confidence intervals based on the central limit
theorem; and linearization, or propagation of error.
The theory and technique of survey sampling go far beyond the material in
this introduction. One method that deserves mention because of its widespread use
is systematic sampling. The population members are given in a list. If, say, a 10%
sample is desired, every tenth member of the list is sampled starting from some random
point among the first ten. If the list is in totally random order, this method is similar
to simple random sampling. If, however, there is some correlation or relationship
between successive members, the method is more similar to stratified sampling. The
clear danger of this method is that there may be some periodic structure in the list, in
which case bias can ensue.
Another commonly used method is cluster sampling. In sampling residential
households, a survey might choose blocks randomly and then either sample every
dwelling on each chosen block or further subsample the dwellings. Because one
would expect dwellings within a single block to be relatively homogeneous, this
method is typically less precise than a simple random sample of the same size.
We have developed a mathematical model for survey sampling and have deduced
consequences of that model, including probabilistic error bounds for the estimates.
As is always the case, reality never quite matches the mathematical model. The
basic assumptions of the model are (1) that every population member appears in
the sample with a specified probability and (2) that an exact measurement or response
is obtained from every sample member. In practice, neither assumption will hold precisely. Converse and Traugott (1986) provide an interesting discussion of the practical
difficulties of polls and surveys and consequences for the variability of the estimates.
The first assumption may fail because of the difficulty of obtaining an exact enumeration of the population or because of imprecision in its definition. For
example, political surveys can be putatively based on all adults, all registered voters,
or all “likely” voters. However, the most serious problem with respect to the first
7.7 Problems
239
assumption is that of nonresponse. Response levels of only 60% to 70% are common
in surveys of human populations. The possibility of substantial bias clearly arises if
there is a relationship of potential answers to survey questions to the propensity to
respond to those questions. For example, adults living in families are easier to contact
by a telephone survey than those living alone, and the opinions of these two groups
may well differ on certain issues. It is important to realize that the standard errors
of estimates that we have developed earlier in this chapter account only for random
variability in sample composition, not for systematic biases.
The Literary Digest poll of 1936, which predicted a 57% to 43% victory for
Republican Alfred Landon over incumbent president Franklin Roosevelt, is one of
the most famous of flawed surveys. Questionnaires were mailed to about 10 million
voters, who were selected from lists such as telephone books and club memberships,
and approximately 2.4 million of the questionnaires were returned. There were two
intrinsic problems: (1) nonresponse—those who did not respond may have voted differently from those who did—and (2) selection bias—even if all 10 million voters
had responded, they would not have constituted a random sample; those in lower
socioeconomic classes (who were more likely to vote for Roosevelt) were less likely
to have telephone service or belong to clubs and thus less likely to be included in
the sample than were wealthier voters. The assumption that an exact measurement is
obtained from every member of the sample may also be in error. In surveys conducted
by interviewers, the interviewer’s approach and personality may affect the response.
In surveys that use questionnaires, the wording of the questions and the context within
which they are lodged can have an effect. An interesting example is a poll conducted
by Stanley Presser, (New Yorker, Oct 18, 2004). Half of the sample was asked, “Do
you think the United States should allow public speeches against democracy?” The
other half was asked, “Do you think the United States should forbid public speeches
against democracy?” 56% said no to the first question, and 39% said yes to the second.
The interesting paper by Hansen in Tanur et al. (1972) reports on efforts of the U.S.
Bureau of the Census to investigate these sorts of problems.
7.7 Problems
1. Consider a population consisting of five values—1, 2, 2, 4, and 8. Find the
population mean and variance. Calculate the sampling distribution of the mean
of a sample of size 2 by generating all possible such samples. From them, find
the mean and variance of the sampling distribution, and compare the results to
Theorems A and B in Section 7.3.1.
2. Suppose that a sample of size n = 2 is drawn from the population of the preceding
problem and that the proportion of the sample values that are greater than 3 is
recorded. Find the sampling distribution of this statistic by listing all possible
such samples. Find the mean and variance of the sampling distribution.
3. Which of the following is a random variable?
a. The population mean
b. The population size, N
240
Chapter 7
Survey Sampling
c.
d.
e.
f.
g.
h.
The sample size, n
The sample mean
The variance of the sample mean
The largest value in the sample
The population variance
The estimated variance of the sample mean
4. Two populations are surveyed with simple random samples. A sample of size n 1
is used for population I, which has a population standard deviation σ1 ; a sample of
size n 2 = 2n 1 is used for population II, which has a population standard deviation
σ2 = 2σ1 . Ignoring finite population corrections, in which of the two samples
would you expect the estimate of the population mean to be more accurate?
5. How would you respond to a friend who asks you, “How can we say that the
sample mean is a random variable when it is just a number, like the population
mean? For example, in Example A of Section 7.3.2, a simple random sample of size 50 produced x̄ = 938.5; how can the number 938.5 be a random
variable?”
6. Suppose that two populations have equal population variances but are of different
sizes: N1 = 100,000 and N2 = 10,000,000. Compare the variances of the sample
means for a sample of size n = 25. Is it substantially easier to estimate the mean
of the smaller population?
7. Suppose that a simple random sample is used to estimate the proportion of families
in a certain area that are living below the poverty level. If this proportion is roughly
.15, what sample size is necessary so that the standard error of the estimate is .02?
8. A sample of size n = 100 is taken from a population that has a proportion
p = 1/5.
a. Find δ such that P(| p̂ − p| ≥ δ) = 0.025.
b. If, in the sample, p̂ = 0.25, will the 95% confidence interval for p contain
the true value of p?
9. In a simple random sample of 1,500 voters, 55% said they planned to vote for a
particular proposition, and 45% said they planned to vote against it. The estimated
margin of victory for the proposition is thus 10%. What is the standard error of
this estimated margin? What is an approximate 95% confidence interval for the
margin?
10. True or false (and state why):
If a sample from a population is large, a histogram of the values in the sample
will be approximately normal, even if the population is not normal.
11. Consider a population of size four, the members of which have values x1 , x2 , x3 , x4 .
a. If simple random sampling were used, how many samples of size two are
there?
b. Suppose that rather than simple random sampling, the following sampling
scheme is used. The possible samples of size two are
{x1 , x2 }, {x2 , x3 }, {x3 , x4 }, {x1 , x4 }
7.7 Problems
241
and the sampling is done in such a way that each of these four possible samples
is equally likely. Is the sample mean unbiased?
12. Consider simple random sampling with replacement.
a. Show that
1 (X i − X )2
n − 1 i=1
n
s2 =
b.
c.
d.
e.
is an unbiased estimate of σ 2 .
Is s an unbiased estimate of σ ?
Show that n −1 s 2 is an unbiased estimate of σ X2 .
Show that n −1 N 2 s 2 is an unbiased estimate of σT2 .
Show that p̂(1 − p̂)/(n − 1) is an unbiased estimate of σ p̂2 .
13. Suppose that the total number of discharges, τ , in Example A of Section 7.2 is
estimated from a simple random sample of size 50. Denoting the estimate by T ,
use the central limit theorem to sketch the approximate probability density of the
error T − τ .
14. The proportion of hospitals in Example A of Section 7.2 that had fewer than 1000
discharges is p = .654. Suppose that the total number of hospitals having fewer
than 1000 discharges is estimated from a simple random sample of size 25. Use
the central limit theorem to sketch the approximate sampling distribution of the
estimate.
15. Consider estimating the mean of the population of hospital discharges (Example A of Section 7.2) from a simple random sample of size n. Use the normal
approximation to the distribution of X in answering the following:
a. Sketch P(|X − μ| > 200) as a function of n for 20 ≤ n ≤ 100.
b. For n = 20, 40, and 80, find such that P(|X − μ| > ) ≈ .10. Similarly,
find such that P(|X − μ| > ) ≈ .50.
16. True or false?
a. The center of a 95% confidence interval for the population mean is a random
variable.
b. A 95% confidence interval for μ contains the sample mean with probability
.95.
c. A 95% confidence interval contains 95% of the population.
d. Out of one hundred 95% confidence intervals for μ, 95 will contain μ.
17. A 90% confidence interval for the average number of children per household
based on a simple random sample is found to be (.7, 2.1). Can we conclude that
90% of households have between .7 and 2.1 children?
18. From independent surveys of two populations, 90% confidence intervals for the
population means are constructed. What is the probability that neither interval
contains the respective population mean? That both do?
19. This problem introduces the concept of a one-sided confidence interval. Using
the central limit theorem, how should the constant k be chosen so that the interval
242
Chapter 7
Survey Sampling
(−∞, X + ks X ) is a 90% confidence interval for μ—i.e., so that P(μ ≤ X +
ks X ) = .9? This is called a one-sided confidence interval. How should k be
chosen so that (X − ks X , ∞) is 95% one-sided confidence interval?
20. In Example D of Section 7.3.3, a 95% confidence interval for μ was found to be
(1.44, 1.76). Because μ is some fixed number, it either lies in this interval or it
doesn’t, so it doesn’t make any sense to claim that P(1.44 ≤ μ ≤ 1.76) = .95.
What do we mean, then, by saying this is a “95% confidence interval?”
21. In order to halve the width of a 95% confidence interval for a mean, by what factor
should the sample size be increased? Ignore the finite population correction.
22. An investigator quantifies her uncertainty about the estimate of a population mean
by reporting X ± s X . What size confidence interval is this?
23. a. Show that the standard error of an estimated proportion is largest when p =
1/2.
b. Use this result and Corollary B of Section 7.3.2 to conclude that the
quantity
/
N −n
1
2 N (n − 1)
is a conservative estimate of the standard error of p̂ no matter what the value
of p may be.
c. Use the central limit theorem to conclude that the interval
/
N −n
p̂ ±
N (n − 1)
contains p with probability at least .95.
24. For a random sample of size n from a population of size N , consider the following
as an estimate of μ:
Xc =
n
ci X i
i=1
where the ci are fixed numbers and X 1 , . . . , X n is the sample.
a. Find a condition on the ci such that the estimate is unbiased.
b. Show that the choice of ci that minimizes the variances of the estimate subject
to this condition is ci = 1/n, where i = 1, . . . , n.
25. Here is an alternative proof of Lemma B in Section 7.3.1. Consider a random
permutation Y1 , Y2 , . . . , Y N of x1 , x2 , . . . , x N . Argue that the joint distribution of
any subcollection, Yi1 , . . . , Yin , of the Yi is the same as that of a simple random
sample, X 1 , . . . , X n . In particular,
Var(Yi ) = Var(X k ) = σ 2
and
Cov(Yi , Y j ) = Cov(X k , X l ) = γ
7.7 Problems
243
if i = j and k = l. Since Y1 + Y2 + · · · + Y N = τ ,
N
Var
Yi = 0
i=1
N
Yi ) in terms of σ 2 and the unknown covariance, γ .
(Why?) Express Var( i=1
Solve for γ , and conclude that
σ2
γ =−
N −1
for i = j.
26. This is another proof of Lemma B in Section 7.3.1. Let Ui be a random variable with Ui = 1 if the ith population member is in the sample and equal to 0
otherwise.
N
Ui xi .
a. Show that the sample mean X = n −1 i=1
b. Show that P(Ui = 1) = n/N . Find E(Ui ), using the fact that Ui is a Bernoulli
random variable.
c. What is the variance of the Bernoulli random variable Ui ?
d. Noting that Ui U j is a Bernoulli random variable, find E(Ui U j ), i = j. (Be
careful to take into account that the sample is drawn without replacement.)
e. Find Cov(Ui , U j ), i = j.
f. Using the representation of X above, find Var(X ).
27. Suppose that the population size N is not known, but it is known that n ≤ N .
Show that the following procedure will generate a simple random sample of
size n. Imagine that the population is arranged in a long list that you can read
sequentially.
a. Let the sample initially consist of the the first n elements in the list.
b. For k = 1, 2, . . . , as long as the end of the list has not been encountered:
i. Read the (n + k)-th element in the list.
ii. Place it in the sample with probability n/(n + k) and, if it is placed in the
sample, randomly drop one of the exisiting sample members.
28. In surveys, it is difficult to obtain accurate answers to sensitive questions such as
“Have you ever used heroin?” or “Have you ever cheated on an exam?” Warner
(1965) introduced the method of randomized response to deal with such situations. A respondent spins an arrow on a wheel or draws a ball from an urn
containing balls of two colors to determine which of two statements to respond
to: (1) “I have characteristic A,” or (2) “I do not have characteristic A.” The interviewer does not know which statement is being responded to but merely records
a yes or a no. The hope is that an interviewee is more likely to answer truthfully
if he or she realizes that the interviewer does not know which statement is being
responded to. Let R be the proportion of a sample answering Yes. Let p be the
probability that statement 1 is responded to ( p is known from the structure of
the randomizing device), and let q be the proportion of the population that has
characteristic A. Let r be the probability that a respondent answers Yes.
a. Show that r = (2 p−1)q +(1− p). [Hint: P(yes) = P(yes given question 1) ×
P(question 1) + P(yes given question 2) × P(question 2).]
244
Chapter 7
Survey Sampling
b. If r were known, how could q be determined?
c. Show that E(R) = r , and propose an estimate, Q, for q. Show that the estimate
is unbiased.
d. Ignoring the finite population correction, show that
Var(R) =
r (1 − r )
n
where n is the sample size.
e. Find an expression for Var(Q).
29. A variation of the method described in Problem 28 has been proposed. Instead
of responding to statement 2, the respondent answers an unrelated question for
which the probability of a “yes” response is known, for example, “Were you born
in June?”
a. Propose an estimate of q for this method.
b. Show that the estimate is unbiased.
c. Obtain an expression for the variance of the estimate.
30. Compare the accuracies of the methods of Problems 28 and 29 by comparing their
standard deviations. You may do this by substituting some plausible numerical
values for p and q.
31. Referring to Example D in Section 7.3.3, how large should the sample be in order
that the 95% confidence interval for the total number of owners planning to sell
will have a width of 500?
32. Referring again to Example D in Section 7.3.3, suppose that a survey is done of
another condominium project of 12,000 units. The sample size is 200, and the
proportion planning to sell in this sample is .18.
a. What is the standard error of this estimate? Give a 90% confidence interval.
b. Suppose we use the notation p̂1 = .12 and p̂2 = .18 to refer to the proportions
in the two samples. Let dˆ = p̂1 − p̂2 be an estimate of the difference, d, of
the two population proportions p1 and p2 . Using the fact that p̂1 and p̂2 are
independent random variables, find expressions for the variance and standard
ˆ
error of d.
ˆ Use this
c. Because p̂1 and p̂2 are approximately normally distributed, so is d.
fact to construct 99%, 95%, and 90% confidence intervals for d. Is there clear
evidence that p1 is really different from p2 ?
33. Two populations are independently surveyed using simple random samples of
size n, and two proportions, p1 and p2 , are estimated. It is expected that both
population proportions are close to .5. What should the sample size be so that the
standard error of the difference, p̂1 − p̂2 , will be less than .02?
34. In a survey of a very large population, the incidences of two health problems are
to be estimated from the same sample. It is expected that the first problem will
affect about 3% of the population and the second about 40%. Ignore the finite
population correction in answering the following questions.
7.7 Problems
245
a. How large should the sample be in order for the standard errors of both estimates to be less than .01? What are the actual standard errors for this sample
size?
b. Suppose that instead of imposing the same limit on both standard errors, the
investigator wants the standard error to be less than 10% of the true value in
each case. What should the sample size be?
35. A simple random sample of a population of size 2000 yields the following
25 values:
104
86
91
104
79
109
80
103
98
87
111
119
99
98
94
109
88
108
83
92
87
122
96
107
97
a. Calculate an unbiased estimate of the population mean.
b. Calculate unbiased estimates of the population variance and Var(X ).
c. Give approximate 95% confidence intervals for the population mean and total.
2
36. With simple random sampling, is X an unbiased estimate of μ2 ? If not, what is
the bias?
37. Two surveys were independently conducted to estimate a population mean, μ.
Denote the estimates and their standard errors by X 1 and X 2 and σ X 1 and σ X 2 .
Assume that X 1 and X 2 are unbiased. For some α and β, the two estimates can
be combined to give a better estimator:
X = αX1 + β X2
a. Find the conditions on α and β that make the combined estimate unbiased.
b. What choice of α and β minimizes the variances, subject to the condition of
unbiasedness?
n
1 3
38. Let X 1 , . . . , X n be a simple random sample. Show that
X is an unbiased
n i=1 i
N
1
estimate of
x 3.
N i=1 i
39. Suppose that of a population of N items, k are defective in some way. For example, the items might be documents, a small proportion of which are fraudulent.
How large should a sample be so that with a specified probability it will contain
at least one of the defective items? For example, if N = 10,000, k = 50, and
p = .95, what should the sample size be? Such calculations are useful in planning
sample sizes for acceptance sampling.
40. This problem presents an algorithm for drawing a simple random sample from a
population in a sequential manner. The members of the population are considered
for inclusion in the sample one at a time in some prespecified order (for example,
the order in which they are listed). The ith member of the population is included
246
Chapter 7
Survey Sampling
in the sample with probability
n − ni
N −i +1
where n i is the number of population members already in the sample before the
ith member is examined. Show that the sample selected in this way is in fact
a simple random sample; that is, show that every possible sample occurs with
probability
1
N
n
41. In accounting and auditing, the following sampling method is sometimes used to
estimate a population total. In estimating the value of an inventory, suppose that
a book value exists for each item and is readily accessible. For each item in the
sample, the difference D, audited value minus book value, is determined. The
inventory value is estimated by the sum of the book values of the population and
N D, where N is the population size.
a. Show that the estimate is unbiased.
b. Find an expression for the variance of the estimate.
c. Compare the expression obtained in part (b) to the variance of the usual estimate, which is the product of N and the average audited value. Under what
circumstances would the proposed method be more accurate?
d. How could a ratio estimate be employed in this situation? Would there be any
advantage or disadvantage to using a ratio estimate rather than the proposed
method?
42. Show that the population correlation coefficient is less than or equal to 1 in
absolute value.
43. Suppose that for Example D in Section 7.3.3, the average number of occupants
per condominium unit in the sample is 2.2 with a sample standard deviation of
.7 and the sample correlation coefficient between the number of occupants and
the number of motor vehicles is .85. Estimate the population ratio of the number
of motor vehicles per occupant and its standard error. Find an approximate 95%
confidence interval for the estimate.
44. Show that
Var(Y R )
Cx
≈1+
Cy
Var(Y )
Cx
− 2ρ
Cy
Sketch the graph of this ratio as a function of C x /C y .
45. In the population of hospitals, the correlation of the number of beds and the number of discharges is ρ = .91 (Example D of Section 7.4). To see how Var(Y R )
would be different if the correlation were different, plot Var(Y R ) for n = 64 as
a function of ρ for −1 < ρ < 1.
7.7 Problems
247
46. Use the central limit theorem to sketch the approximate sampling distribution
of Y R for n = 64 for the population of hospitals. Compare to the approximate
sampling distribution of Y .
47. For the population of hospitals and a sample size of n = 64, find the approximate bias of Y R by applying Corollary B of Section 7.4 and compare it to the
approximate standard deviation of the estimate. Repeat for n = 128.
48. A simple random sample of 100 households located in a city recorded the number
of people living in the household, X , and the weekly expenditure for food, Y . It
is known that there are 100,000 households in the city. In the sample
X i = 320
Yi = 10,000
X i2 = 1250
Yi2 = 1,100,000
X i Yi = 36,000
Neglect the finite population correction in answering the following.
a. Estimate the ratio r = μ y /μx .
b. Form an approximate 95% confidence interval for μ y /μx .
c. Using only the data on Y estimate the total weekly food expenditure, τ , for
households in the city and form a 90% confidence interval.
49. In a wildlife survey, an area of desert land was divided into 1000 squares, or
“quadrats,” a simple random sample of 50 of which were surveyed. In each surveyed quadrat, the number of birds, Y , and the area covered by vegetation, X ,
were determined. It was found that
X i = 3000
Yi = 150
X i2 = 225,000
Yi2 = 650
X i Yi = 11,000
a. Estimate the ratio of the average number of birds per quadrat to the average
vegetation cover per quadrat.
b. Estimate the standard error of your estimate and find an approximate 90%
confidence interval for the population average.
c. Estimate the total number of birds and find an approximate 95% confidence
interval for the population total.
d. Suppose that from an aerial survey, the total area covered by vegetation could
easily be determined. How could this information be used to provide another
248
Chapter 7
Survey Sampling
estimate of the number of birds? Would you expect this estimate to be better
than or worse than that found in part (c)?
50. Hartley and Ross (1954) derived the following exact bound on the relative size
of the bias and standard error of a ratio estimate:
/ |E(R) − r |
σX
σx 1
n−1
≤
=
1−
σR
μx
μx n
N −1
a. Derive this bound from the relation
Y
X
Cov(R, X ) = E
X
−E
Y
X
E(X )
b. Apply the bound to Problem 43 using sample estimates in place of the given
population parameters.
51. This problem introduces a technique called the “jackknife,” originally proposed
by Quenouille (1956) for reducing bias. Many nonlinear estimates, including the
ratio estimator, have the property that
E(θ̂ ) = θ +
b1
b2
+ 2 +···
n
n
where θ̂ is an estimate of θ. The jackknife forms an estimate θ̂ J , which has a
leading bias term of the order n −2 rather than n −1 . Thus, for sufficiently large
n, the bias of θ̂ J is substantially smaller than that of θ̂. The technique involves
splitting the sample into several subsamples, computing the estimate for each
subsample, and then combining the several estimates. The sample is split into p
groups of size m, where n = mp. For j = 1, . . . , p, the estimate θ̂ j is calculated
from the m( p − 1) observations left after the jth group has been deleted. From
the preceding expression,
E(θ̂ j ) = θ +
b2
b1
+
+···
m( p − 1) [m( p − 1)]2
Now, p “pseudovalues” are defined:
V j = p θ̂ − ( p − 1)θ̂ j
The jackknife estimate, θ̂ J , is defined as the average of the pseudovalues:
θ̂ J =
p
1
Vj
p j=1
Show that the bias of θ̂ J is of the order n −2 .
52. A population consists of three strata with N1 = N2 = 1000 and N3 = 500.
A stratified random sample with 10 observations in each stratum yields the
7.7 Problems
249
following data:
Stratum 1 94
Stratum 2 183
Stratum 3 343
99
183
302
106
179
286
106
211
317
101
178
289
102
179
284
122
192
357
104
192
288
97
201
314
97
177
276
Estimate the population mean and total and give a 90% confidence interval.
53. The following table (Cochran 1977) shows the stratification of all farms in a
county by farm size and the mean and standard deviation of the number of acres
of corn in each stratum.
Farm Size
Nl
μl
σl
0–40
41–80
81–120
121–160
161–200
201–240
241 +
394
461
391
334
169
113
148
5.4
16.3
24.3
34.5
42.1
50.1
63.8
8.3
13.3
15.1
19.8
24.5
26.0
35.2
a. For a sample size of 100 farms, compute the sample sizes from each stratum
for proportional and optimal allocation, and compare them.
b. Calculate the variances of the sample mean for each allocation and compare
them to each other and to the variance of an estimate formed from simple
random sampling.
c. What are the population mean and variance?
d. Suppose that ten farms are sampled per stratum. What is Var(X s )? How large
a simple random sample would have to be taken to attain the same variance?
Ignore the finite population correction.
e. Repeat part (d) using proportional allocation of the 70 samples.
54. a. Suppose that the cost of a survey is C = C0 + C1 n, where C0 is a startup
cost and C1 is the cost per observation. For a given cost C, find the allocation n 1 , . . . , n L to L strata that is optimal in the sense that it minimizes the variance of the estimate of the population mean subject to the cost
constraint.
b. Suppose that the cost of an observation varies from stratum to stratum—in
some strata the observations might be relatively cheap and in others relatively
expensive. The cost of a survey with an allocation n 1 , . . . , n L is
C = C0 +
L
Cl n l
l=1
For a fixed total cost C, what choice of n 1 , · · ·, n L minimizes the variance?
c. Assuming that the cost function is as given in part (b), for a fixed variance,
find n l to minimize cost.
250
Chapter 7
Survey Sampling
55. The designer of a sample survey stratifies a population into two strata, H and L.
H contains 100,000 people, and L contains 500,000. He decides to allocate 100
samples to stratum H and 200 to stratum L, taking a simple random sample in
each stratum.
a. How should the designer estimate the population mean?
b. Suppose that the population standard deviation in stratum H is 20 and the
standard deviation in stratum L is 10. What will be the standard error of his
estimate?
c. Would it be better to allocate 200 samples to stratum H and 100 to stratum L?
d. Would it be better to use proportional allocation?
56. How might stratification be used in each of the following sampling problems?
a. A survey of household expenditures in a city.
b. A survey to examine the lead concentration in the soil in a large plot of land.
c. A survey to estimate the number of people who use elevators in a large building
with a single bank of elevators.
d. A survey of programs on a television station, taken to estimate the proportion
of time taken up by advertising on Monday through Friday from 6 P.M. until
10 P.M. Assume that 52 weeks of recorded broadcasts are available for analysis.
57. Consider stratifying the population of Problem 1 into two strata: (1, 2, 2) and (4,
8). Assuming that one observation is taken from each stratum, find the sampling
distribution of the estimate of the population mean and the mean and standard
deviation of the sampling distribution. Compare to Theorems A and B in Section
7.5.2 and the results of Problem 1.
58. (Computer Exercise) Construct a population consisting of the integers from 1 to
100. Simulate the sampling distribution of the sample mean of a sample of size
12 by drawing 100 samples of size 12 and making a histogram of the results.
59. (Computer Exercise) Continuing with Problem 58, divide the population into
two strata of equal size, allocate six observations per stratum, and simulate
the distribution of the stratified estimate of the population mean. Do the same
thing with four strata. Compare the results to each other and to the results of
Problem 58.
60. A population consists of two strata, H and L, of sizes 100,000 and 500,000 and
standard deviations 20 and 12, respectively. A stratified sample of size 100 is to
be taken.
a. Find the optimal allocation for estimating the population mean.
b. Find the optimal allocation for estimating the difference of the means of the
strata, μ H − μ L .
61. The value of a population mean increases linearly through time: μ(t) = α + βt
while the variance remains constant. Independent simple random samples of size
n are taken at times t = 1, 2, and 3.
a. Find conditions on w1 , w2 , and w3 such that
β̂ = w1 X 1 + w2 X 2 + w3 X 3
7.7 Problems
251
is an unbiased estimate of the rate of change, β. Here X i denotes the sample
mean at time ti .
b. What values of the wi minimize the variance subject to the constraint that the
estimate is unbiased?
62. In Example B of Section 7.5.2, the standard error of X s was estimated to be
s X s = 35.8. How good is this estimate—what is the actual standard error of X s ?
63. (Open-ended) Monte Carlo evaluation of an integral was introduced in Example
A of Section 5.2. Refer to that example for the following notation. Try to interpret
that method from the point of view of survey sampling by considering an “infinite
population” of numbers in the interval [0, 1], each population member x having
a value f (x). Interpret Î ( f ) as the mean of a simple random sample. What is
the standard error of Î ( f )? How could it be estimated? How could a confidence
interval for I ( f ) be formed? Do you think that anything could be gained by
stratifying the “population?” For example, the strata could be the intervals [0, .5)
and [.5, 1]. You might find it helpful to consider some examples.
64. The value of an inventory is to be estimated by sampling. The items are stratified
by book value in the following way:
Stratum
$1000 +
$200–1000
$1–200
Nl
μl
70
500
10,000
3000
500
90
σl
1250
100
30
a. What should the relative sampling fraction in each stratum be for proportional
and for optimal allocation? Ignore the finite population correction.
b. How do the variances under each type of allocation compare to each other and
to the variance under simple random sampling?
65. The disk file cancer contains values for breast cancer mortality from 1950 to
1960 (y) and the adult white female population in 1960 (x) for 301 counties in
North Carolina, South Carolina, and Georgia.
a. Make a histogram of the population values for cancer mortality.
b. What are the population mean and total cancer mortality? What are the population variance and standard deviation?
c. Simulate the sampling distribution of the mean of a sample of 25 observations
of cancer mortality.
d. Draw a simple random sample of size 25 and use it to estimate the mean and
total cancer mortality.
e. Estimate the population variance and standard deviation from the sample of
part (d).
f. Form 95% confidence intervals for the population mean and total from the
sample of part (d). Do the intervals cover the population values?
g. Repeat parts (d) through (f) for a sample of size 100.
h. Suppose that the size of the total population of each county is known and that
this information is used to improve the cancer mortality estimates by forming
a ratio estimator. Do you think this will be effective? Why or why not?
252
Chapter 7
Survey Sampling
i. Simulate the sampling distribution of ratio estimators of mean cancer mortality based on a simple random sample of size 25. Compare this result to that
of part (c).
j. Draw a simple random sample of size 25 and estimate the population mean and
total cancer mortality by calculating ratio estimates. How do these estimates
compare to those formed in the usual way in part (d) from the same data?
k. Form confidence intervals about the estimates obtained in part ( j).
l. Stratify the counties into four strata by population size. Randomly sample six
observations from each stratum and form estimates of the population mean
and total mortality.
m. Stratify the counties into four strata by population size. What are the sampling fractions for proportional allocation and optimal allocation? Compare
the variances of the estimates of the population mean obtained using simple
random sampling, proportional allocation, and optimal allocation.
n. How much better than those in part (m) will the estimates of the population
mean be if 8, 16, 32, or 64 strata are used instead?
66. A photograph of a large crowd on a beach is taken from a helicopter. The photo
is of such high resolution that when sections are magnified, individual people
can be identified, but to count the entire crowd in this way would be very timeconsuming. Devise a plan to estimate the number of people on the beach by using
a sampling procedure.
67. The data set families contains information about 43,886 families living in
the city of Cyberville. The city has four regions: the Northern region has 10,149
families, the Eastern region has 10,390 families, the Southern region has 13,457
families, and the Western region has 9,890. For each family, the following information is recorded:
1. Family type
1: Husband-wife family
2: Male-head family
3: Female-head family
2. Number of persons in family
3. Number of children in family
4. Family income
5. Region
1: North
2: East
3: South
4: West
6. Education level of head of household
31: Less than 1st grade
32: 1st, 2nd, 3rd, or 4th grade
33: 5th or 6th grade
34: 7th or 8th grade
35: 9th grade
36: 10th grade
37: 11th grade
7.7 Problems
253
38: 12th grade, no diploma
39: High school graduate, high school diploma, or equivalent
40: Some college but no degree
41: Associate degree in college (occupation/vocation program)
42: Associate degree in college (academic program)
43: Bachelor’s degree (e.g., B.S., B.A., A.B.)
44: Master’s degree (e.g., M.S., M.A., M.B.A.)
45: Professional school degree (e.g., M.D., D.D.S., D.V.M., LL.B., J.D.)
46: Doctoral degree (e.g., Ph.D., Ed.D.)
In these exercises, you will try to learn about the families of Cyberville by using
sampling.
a. Take a simple random sample of 500 families. Estimate the following population parameters, calculate the estimated standard errors of these estimates, and
form 95% confidence intervals:
i. The proportion of female-headed families
ii. The average number of children per family
iii. The proportion of heads of households who did not receive a high school
diploma
iv. The average family income
Repeat the preceding parameters for five different simple random samples of
size 500 and compare the results.
b. Take 100 samples of size 400.
i. For each sample, find the average family income.
ii. Find the average and standard deviation of these 100 estimates and make
a histogram of the estimates.
iii. Superimpose a plot of a normal density with that mean and standard deviation of the histogram and comment on how well it appears to fit.
iv. Plot the empirical cumulative distribution function (see Section 10.2). On
this plot, superimpose the normal cumulative distribution function with
mean and standard deviation as earlier. Comment on the fit.
v. Another method for examining a normal approximation is via a normal
probability plot (Section 9.9). Make such a plot and comment on what it
shows about the approximation.
vi. For each of the 100 samples, find a 95% confidence interval for the population average income. How many of those intervals actually contain the
population target?
vii. Take 100 samples of size 100. Compare the averages, standard deviations,
and histograms to those obtained for a sample of size 400 and explain how
the theory of simple random sampling relates to the comparisons.
c. For a simple random sample of 500, compare the incomes of the three family
types by comparing histograms and boxplots (see Chapter 10.6).
d. Take simple random samples of size 400 from each of the four regions.
i. Compare the incomes by region by making parallel boxplots.
ii. Does it appear that some regions have larger families than others?
iii. Are there differences in education level among the four regions?
254
Chapter 7
Survey Sampling
e. Formulate a question of your choice and attempt to answer it with a simple
random sample of size 400.
f. Does stratification help in estimating the average family income? From a simple
random sample of size 400, estimate the average income and also the standard
error of your estimate. Form a 95% confidence interval. Next, allocate the 400
observations proportionally to the four regions and estimate the average income
from the stratified sample. Estimate the standard error and form a 95% confidence interval. Compare your results to the results of the simple random sample.
CHAPTER
8
Estimation of Parameters
and Fitting of Probability
Distributions
8.1 Introduction
In this chapter, we discuss fitting probability laws to data. Many families of probability
laws depend on a small number of parameters; for example, the Poisson family depends on the parameter λ (the mean number of counts), and the Gaussian family
depends on two parameters, μ and σ . Unless the values of parameters are known in
advance, they must be estimated from data in order to fit the probability law.
After parameter values have been chosen, the model should be compared to the
actual data to see if the fit is reasonable; Chapter 9 is concerned with measures and
tests of goodness of fit.
In order to introduce and illustrate some of the ideas and to provide a concrete
basis for later theoretical discussions, we will first consider a classical example—the
fitting of a Poisson distribution to radioactive decay. The concepts introduced in this
example will be elaborated in this and the next chapter.
8.2 Fitting the Poisson Distribution to Emissions
of Alpha Particles
Records of emissions of alpha particles from radioactive sources show that the number of emissions per unit of time is not constant but fluctuates in a seemingly random
fashion. If the underlying rate of emission is constant over the period of observation
(which will be the case if the half-life is much longer than the time period of observation) and if the particles come from a very large number of independent sources
(atoms), the Poisson model seems appropriate. For this reason, the Poisson distribution is frequently used as a model for radioactive decay. You should recall that the
255
256
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
Poisson distribution as a model for random counts in space or time rests on three
assumptions: (1) the underlying rate at which the events occur is constant in space
or time, (2) events in disjoint intervals of space or time occur independently, and (3)
there are no multiple events.
Berkson (1966) conducted a careful analysis of data obtained from the National
Bureau of Standards. The source of the alpha particles was americium 241. The
experimenters recorded 10,220 times between successive emissions. The observed
mean emission rate (total number of emissions divided by total time) was .8392
emissions per sec. The clock used to record the times was accurate to .0002 sec.
The first two columns of the following table display the counts, n, that were
observed in 1207 intervals, each of length 10 sec. In 18 of the 1207 intervals, there
were 0, 1, or 2 counts; in 28 of the intervals there were 3 counts, etc.
n
0–2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17+
Observed
Expected
18
28
56
105
126
146
164
161
123
101
74
53
23
15
9
5
12.2
27.0
56.5
94.9
132.7
159.1
166.9
155.6
130.6
99.7
69.7
45.0
27.0
15.1
7.9
7.1
1207
1207
In fitting a Poisson distribution to the counts shown in the table, we view the
1207 counts as 1207 independent realizations of Poisson random variables, each of
which has the probability mass function
πk = P(X = k) =
λk e−λ
k!
In order to fit the Poisson distribution, we must estimate a value for λ from the
observed data. Since the average count in a 10-second interval was 8.392, we take
this as an estimate of λ (recall that the E(X ) = λ) and denote it by λ̂.
Before continuing, we want to mention some issues that will be explored in
depth in subsequent sections of this chapter. First, observe that if the experiment
8.3 Parameter Estimation
257
were to be repeated, the counts would be different and the estimate of λ would be
different; it is thus appropriate to regard the estimate of λ as a random variable which
has a probability distribution referred to as its sampling distribution. The situation
is entirely analogous to tossing a coin 10 times and regarding the number of heads
as a binomially distributed random variable. Doing so and observing 6 heads generates
one realization of this random variable; in the same sense 8.392 is a realization of a
random variable. The question thus arises: what is the sampling distribution? This is
of some practical interest, since the spread of the sampling distribution reflects the
variability of the estimate. We could ask crudely, to what decimal place is the estimate
8.392 accurate? Second, later in this chapter we will discuss the rationale for choosing
to estimate λ as we have done. Although estimating λ as the observed mean count is
quite reasonable on its face, in principle there might be better procedures.
We now turn to assessing goodness of fit, a subject that will be taken up in depth
in the next chapter. Consider the 16 cells into which the counts are grouped. Under
the hypothesized model, the probability that a random count falls in any one of the
cells may be calculated from the Poisson probability law. The probability that an
observation falls in the first cell (0, 1, or 2 counts) is
p1 = π0 + π1 + π2
The probability that an observation falls in the second cell is p2 = π3 . The probability
that an observation falls in the 16th cell is
∞
πk
p16 =
k=17
Under the assumption that X 1 , . . . , X 1207 are independent Poisson random variables,
the number of observations out of 1207 falling in a given cell follows a binomial
distribution with a mean, or expected value, of 1207 pk , and the joint distribution of the
counts in all the cells is multinomial with n = 1207 and probabilities p1 , p2 , . . . , p16 .
The third column of the preceding table gives the expected number of counts in each
cell; for example, because p4 = .0786, the expected count in the corresponding cell
is 1207 × .0786 = 94.9. Qualitatively, there is good agreement between the expected
and observed counts. Quantitative measures will be presented in Chapter 9.
8.3 Parameter Estimation
As was illustrated in the example of alpha particle emissions, in order to fit a probability law to data, one typically has to estimate parameters associated with the probability
law from the data. The following examples further illustrate this point.
EXAMPLE A
Normal Distribution
The normal, or Gaussian, distribution involves two parameters, μ and σ , where μ is
the mean of the distribution and σ 2 is the variance:
2
1 (x−μ)
1
f (x|μ, σ ) = √ e− 2 σ 2 ,
−∞ < x < ∞
σ 2π
258
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
.45
.40
.35
P (x )
.30
.25
.20
.15
.10
.05
0
3.0 2.0 1.0
0
x
1.0
2.0
3.0
F I G U R E 8.1 Gaussian fit of current flow across a cell membrane to a frequency
polygon.
The use of the normal distribution as a model is usually justified using some
version of the central limit theorem, which says that the sum of a large number of
independent random variables is approximately normally distributed. For example,
Bevan, Kullberg, and Rice (1979) studied random fluctuations of current across a
muscle cell membrane. The cell membrane contained a large number of channels,
which opened and closed at random and were assumed to operate independently. The
net current resulted from ions flowing through open channels and was therefore the
sum of a large number of roughly independent currents. As the channels opened and
closed, the net current fluctuated randomly. Figure 8.1 shows a smoothed histogram
of values obtained from 49,152 observations of the net current and an approximating Gaussian curve. The fit of the Gaussian distribution is quite good, although the
smoothed histogram seems to show a slight skewness. In this application, information about the characteristics of the individual channels, such as conductance, was
■
extracted from the estimated parameters μ and σ 2 .
EXAMPLE B
Gamma Distribution
The gamma distribution depends on two parameters, α and λ:
1 α α−1 −λx
λ x e ,
0≤x ≤∞
(α)
The family of gamma distributions provides a flexible set of densities for nonnegative
random variables.
Figure 8.2 shows how the gamma distribution fits to the amounts of rainfall
from different storms (Le Cam and Neyman 1967). Gamma distributions were fit
f (x|α, λ) =
8.3 Parameter Estimation
259
60
Frequency
50
40
Seeded
30
20
10
0
0
(a)
10 20 30 40 50 60 70 80 90 100 110 120
Amount (mm)
80
70
Frequency
60
50
Unseeded
40
30
20
10
0
0
(b)
10 20 30 40 50 60 70 80 90 100 110
Amount (mm)
F I G U R E 8.2 Fit of gamma densities to amounts of rainfall for (a) seeded and
(b) unseeded storms.
to rainfall amounts from storms that were seeded and unseeded in an experiment to
determine the effects, if any, of seeding. Differences in the distributions between the
seeded and unseeded conditions should be reflected in differences in the parameters
■
α and λ.
As these examples illustrate, there are a variety of reasons for fitting probability
laws to data. A scientific theory may suggest the form of a probability distribution
and the parameters of that distribution may be of direct interest to the scientific investigation; the examples of alpha particle emission and Example A are of this character.
Example B is typical of situations in which a model is fit for essentially descriptive
purposes as a method of data summary or compression. A probability model may
play a role in a complex modeling situation; for example, utility companies interested
in projecting patterns of consumer demand find it useful to model daily temperatures
as random variables from a distribution of a particular form. This distribution may
then be used in simulations of the effects of various pricing and generation schemes.
In a similar way, hydrologists planning uses of water resources use stochastic models
of rainfall in simulations.
260
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
We will take the following basic approach to the study of parameter estimation.
The observed data will be regarded as realizations of random variables X 1 , X 2 , . . . , X n ,
whose joint distribution depends on an unknown parameter θ . Note that θ may be a
vector, such as (α, λ) in Example B. Usually the X i will be modeled as independent
random variables all having the same distribution f (x|θ ), in which case their joint distribution is f (x1 |θ) f (x2 |θ) · · · f (xn |θ ). We will refer to such X i as independent and
identically distributed, or i.i.d. An estimate of θ will be a function of X 1 , X 2 , . . . , X n
and will hence be a random variable with a probability distribution called its sampling
distribution. We will use approximations to the sampling distribution to assess the
variability of the estimate, most frequently through its standard deviation, which is
commonly called its standard error.
It is desirable to have general procedures for forming estimates so that each new
problem does not have to be approached ab initio. We will develop two such procedures, the method of moments and the method of maximum likelihood, concentrating
primarily on the latter, because it is the more generally useful.
The advanced theory of statistics is heavily concerned with “optimal estimation,”
and we will touch lightly on this topic. The essential idea is that given a choice of many
different estimation procedures, we would like to use that estimate whose sampling
distribution is most concentrated around the true parameter value.
Before going on to the method of moments, let us note that there are strong
similarities of the subject matter of this and the previous chapter. In Chapter 7 we were
concerned with estimating population parameters, such as the mean and total, and the
process of random sampling created random variables whose probability distributions
depended on those parameters. We were concerned with the sampling distributions
of the estimates and with assessing variability via standard errors and confidence
intervals. In this chapter we consider models in which the data are generated from a
probability distribution. This distribution usually has a more hypothetical status than
that of Chapter 7, where the distribution was induced by deliberate randomization. In
this chapter we will also be concerned with sampling distributions and with assessing
variability through standard errors and confidence intervals.
8.4 The Method of Moments
The kth moment of a probability law is defined as
μk = E(X k )
where X is a random variable following that probability law (of course, this is defined
only if the expectation exists). If X 1 , X 2 , . . . , X n are i.i.d. random variables from that
distribution, the kth sample moment is defined as
μ̂k =
n
1 k
X
n i=1 i
We can view μ̂k as an estimate of μk . The method of moments estimates parameters
by finding expressions for them in terms of the lowest possible order moments and
then substituting sample moments into the expressions.
8.4 The Method of Moments
261
Suppose, for example, that we wish to estimate two parameters, θ1 and θ2 . If θ1
and θ2 can be expressed in terms of the first two moments as
θ1 = f 1 (μ1 , μ2 )
θ2 = f 2 (μ1 , μ2 )
then the method of moments estimates are
θ̂1 = f 1 (μ̂1 , μ̂2 )
θ̂2 = f 2 (μ̂1 , μ̂2 )
The construction of a method of moments estimate involves three basic steps:
1. Calculate low order moments, finding expressions for the moments in terms of the
parameters. Typically, the number of low order moments needed will be the same
as the number of parameters.
2. Invert the expressions found in the preceding step, finding new expressions for the
parameters in terms of the moments.
3. Insert the sample moments into the expressions obtained in the second step, thus
obtaining estimates of the parameters in terms of the sample moments.
To illustrate this procedure, we consider some examples.
EXAMPLE A
Poisson Distribution
The first moment for the Poisson distribution is the parameter λ = E(X ). The first
sample moment is
μ̂1 = X =
n
1
Xi
n i=1
which is, therefore, the method of moments estimate of λ: λ̂ = X .
As a concrete example, let us consider a study done at the National Institute of
Science and Technology (Steel et al. 1980). Asbestos fibers on filters were counted
as part of a project to develop measurement standards for asbestos concentration.
Asbestos dissolved in water was spread on a filter, and 3-mm diameter punches were
taken from the filter and mounted on a transmission electron microscope. An operator
counted the number of fibers in each of 23 grid squares, yielding the following counts:
31
34
26
28
29
27
27
24
19
34
27
21
18
30
18
17
31
16
24
24
28
18
22
The Poisson distribution would be a plausible model for describing the variability
from grid square to grid square in this situation and could be used to characterize the
inherent variability in future measurements. The method of moments estimate of λ is
simply the arithmetic mean of the counts listed above, these or λ̂ = 24.9.
If the experiment were to be repeated, the counts—and therefore the estimate—
would not be exactly the same. It is thus natural to ask how stable this estimate is.
262
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
A standard statistical technique for addressing this question is to derive the sampling
distribution of the estimate or an approximation to that distribution. The statistical
model stipulates that the individual counts X i are independent Poisson random variables with parameter λ0 . Letting S =
X i , the parameter estimate λ̂ = S/n is a
random variable, the distribution of which is called its sampling distribution. Now
from Example E in Section 4.5, the distribution of the sum of independent Poisson
random variables is Poisson distributed, so the distribution of S is Poisson (nλ0 ).
Thus the probability mass function of λ̂ is
P(λ̂ = v) = P(S = nv)
=
(nλ0 )nv e−nλ0
(nv)!
for v such that nv is a nonnegative integer.
Since S is Poisson, its mean and variance are both nλ0 , so
1
E(λ̂) = E(S) = λ0
n
1
λ0
Var(λ̂) = 2 Var(S) =
n
n
From Example A in Section 5.3, if nλ0 is large, the distribution of S is approximately
normal; hence, that of λ̂ is approximately normal as well, with mean and variance
given above. Because E(λ̂) = λ0 , we say that the estimate is unbiased: the sampling
distribution is centered at λ0 . The second equation shows that the sampling distribution
becomes more concentrated about λ0 as n increases. The standard deviation of this
distribution is called the standard error of λ̂ and is
*
λ0
σλ̂ =
n
Of course, we can’t know the sampling distribution or the standard error of λ̂ because
they depend on λ0 , which is unknown. However, we can derive an approximation by
substituting λ̂ and λ0 and use it to assess the variability of our estimate. In particular,
we can calculate the estimated standard error of λ̂ as
/
λ̂
sλ̂ =
n
For this example, we find
*
sλ̂ =
24.9
= 1.04
23
At the end of this section, we will present a justification for using λ̂ in place of λ0 .
In summary, we have found that the sampling distribution of λ̂ is approximately
normal, centered at the true value λ0 with standard deviation 1.04. This gives us
a reasonable assessment of the variability of our parameter estimate. For example,
because a normally distributed random variable is unlikely to be more than two
standard deviations away from its mean, the error in our estimate of λ is unlikely to
be more than 2.08. We thus have not only an estimate of λ0 , but also an understanding
of the inherent variability of that estimate.
8.4 The Method of Moments
263
In Chapter 9, we will address the question of whether the Poisson distribution
really fits these data. Clearly, we could calculate the average of any batch of numbers,
■
whether or not they were well fit by the Poisson distribution.
EXAMPLE B
Normal Distribution
The first and second moments for the normal distribution are
μ1 = E(X ) = μ
μ2 = E(X 2 ) = μ2 + σ 2
Therefore,
μ = μ1
σ 2 = μ2 − μ21
The corresponding estimates of μ and σ 2 from the sample moments are
μ̂ = X
n
n
1
1
2
2
2
σ̂ =
X −X =
(X i − X )2
n i=1 i
n i=1
From Section 6.3, the sampling distribution of X is N (μ, σ 2 /n) and n σ̂ 2 /σ 2 ∼
Furthermore, X and σ̂ 2 are independently distributed. We will return to these
■
sampling distributions later in the chapter.
2
.
χn−1
EXAMPLE C
Gamma Distribution
The first two moments of the gamma distribution are
α
μ1 =
λ
α(α + 1)
μ2 =
λ2
(see Example B in Section 4.5). To apply the method of moments, we must express
α and λ in terms of μ1 and μ2 . From the second equation,
μ1
μ2 = μ21 +
λ
or
μ1
λ=
μ2 − μ21
Also, from the equation for the first moment given here,
α = λμ1 =
μ21
μ2 − μ21
The method of moments estimates are, since σ̂ 2 = μ̂2 − μ̂21 ,
λ̂ =
X
σ̂ 2
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
160
140
120
100
Count
264
80
60
40
20
0
0
.5
1.0
1.5
Precipitation
2.0
2.5
F I G U R E 8.3 Gamma densities fit by the methods of moments and by the method of
maximum likelihood to amounts of precipitation; the solid line shows the method of
moments estimate and the dotted line the maximum likelihood estimate.
and
2
X
σ̂ 2
As a concrete example, let us consider the fit of the amounts of precipitation
during 227 storms in Illinois from 1960 to 1964 to a gamma distribution (Le Cam and
Neyman 1967). The data, listed in Problem 42 at the end of Chapter 10, were gathered
and analyzed in an attempt to characterize the natural variability in precipitation
from storm to storm. A histogram shows that the distribution is quite skewed, so a
gamma distribution is a natural candidate for a model. For these data, X = .224 and
σ̂ 2 = .1338, and therefore α̂ = .375 and λ̂ = 1.674.
The histogram with the fitted density is shown in Figure 8.3. Note that, in order
to make visual comparison easy, the density was normalized to have a total area equal
to the total area under the histogram, which is the number of observations times the
bin width of the histogram, or 227 × .2 = 45.4. Alternatively, the histogram could
have been normalized to have a total area of 1. Qualitatively, the fit in Figure 8.3 looks
■
reasonable; we will examine it in more detail in Example C in Section 9.9.
α̂ =
We now turn to a discussion of the sampling distributions of α̂ and λ̂. In the previous two examples, we were able to use known theoretical results in deriving sampling
distributions, but it appears that it would be difficult to derive the exact forms of the
sampling distributions of λ̂ and α̂, because they are each rather complicated functions
of the sample values X 1 , X 2 , . . . , X n . However, the problem can be approached by
simulation. Imagine for the moment that we knew the true values λ0 and α0 . We could
generate many, many samples of size n = 227 from the gamma distribution with
8.4 The Method of Moments
265
these parameter values, and from each of these samples we could calculate estimates
of λ and α. A histogram of the values of the estimates of λ, for example, should then
give us a good idea of the sampling distribution of λ̂.
The only problem with this idea is that it requires knowing the true parameter
values. (Notice that we faced a problem very much like this in Example A.) So we
substitute our estimates of λ and α for the true values; that is we draw many, many
samples of size n = 227 from a gamma distribution with parameters α = .375 and
λ = 1.674. The results of drawing 1000 such samples of size n = 227 are displayed
in Figure 8.4. Figure 8.4(a) is a histogram of the 1000 estimates of α so obtained and
Figure 8.4(b) shows the corresponding histogram for λ. These histograms indicate the
variability that is inherent in estimating the parameters from a sample of this size. For
example, we see that if the true value of α is .375, then it would not be very unusual
for the estimate to be in error by .1 or more. Notice that the shapes of the histograms
suggest that they might be approximated by normal densities.
The variability shown by the histograms can be summarized by calculating the
standard deviations of the 1000 estimates, thus providing estimated standard errors of
α̂ and λ̂. To be precise, if the 1000 estimates of α are denoted by αi∗ , i = 1, 2, . . . , 1000,
the standard error of α̂ is estimated as
0
1
1000
1 1 sα̂ = 2
(α ∗ − α)2
1000 i=1 i
where α is the mean of the 1000 values. The results of this calculation and the
corresponding one for λ̂ are sα̂ = .06 and sλ̂ = .34. These standard errors are concise
quantifications of the amount of variability of the estimates α̂ = .375 and λ̂ = 1.674
displayed in Figure 8.4.
250
200
200
150
150
100
100
50
50
0
0.2
(a)
0.3
0.4
0.5
0.6
0
1.0
1.5
2.0
2.5
3.0
(b)
F I G U R E 8.4 Histogram of 1000 simulated method of moment estimates of (a) α
and (b) λ.
266
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
Our use of simulation (or Monte Carlo) here is an example of what in statistics
is called the bootstrap. We will see more examples of this versatile method later.
EXAMPLE D
An Angular Distribution
The angle θ at which electrons are emitted in muon decay has a distribution with the
density
1 + αx
,
−1 ≤ x ≤ 1
and
−1≤α ≤1
f (x|α) =
2
where x = cos θ. The parameter α is related to polarization. Physical considerations
dictate that |α| ≤ 13 , but we note that f (x|α) is a probability density for |α| ≤ 1.
The method of moments may be applied to estimate α from a sample of experimental
measurements, X 1 , . . . , X n . The mean of the density is
1
α
1 + αx
dx =
x
μ=
2
3
−1
Thus, the method of moments estimate of α is α̂ = 3X . Consideration of the sampling
distribution of α̂ is left as an exercise (Problem 13).
■
Under reasonable conditions, method of moments estimates have the desirable
property of consistency. An estimate, θ̂, is said to be a consistent estimate of a
parameter, θ, if θ̂ approaches θ as the sample size approaches infinity. The following
states this more precisely.
DEFINITION
Let θ̂n be an estimate of a parameter θ based on a sample of size n. Then θ̂n is said
to be consistent in probability if θ̂n converges in probability to θ as n approaches
infinity; that is, for any > 0,
P(|θ̂n − θ| > ) → 0
as n → ∞
■
The weak law of large numbers implies that the sample moments converge in
probability to the population moments. If the functions relating the estimates to the
sample moments are continuous, the estimates will converge to the parameters as the
sample moments converge to the population moments.
The consistency of method of moments estimates can be used to provide a justification for a procedure that we used in estimating standard errors in the previous
examples. We were interested in the variance (or its square root—the standard error)
of a parameter estimate θ̂. Denoting the true parameter by θ0 , we had a relationship
of the form
1
σθ̂ = √ σ (θ0 )
n
√
√
(In Example A, σλ̂ = λ0 /n, so that σ (λ) = λ.) We approximated this by the
8.5 The Method of Maximum Likelihood
267
estimated standard error
1
sθ̂ = √ σ (θ̂ )
n
We now claim that the consistency of θ̂ implies that sθ̂ ≈ σθ̂ . More precisely,
s
lim θ̂ = 1
n→∞ σθ̂
provided that the function σ (θ) is continuous in θ. The result follows since if θ̂ → θ0 ,
then σ (θ̂) → σ (θ0 ). Of course, this is just a limiting result and we always have a
finite value of n in practice, but it does provide some hope that the ratio will be close
to 1 and that the estimated standard error will be a reasonable indication of variability.
Let us summarize the results of this section. We have shown how the method
of moments can provide estimates of the parameters of a probability distribution
based on a “sample” (an i.i.d. collection) of random variables from that distribution.
We addressed the question of variability or reliability of the estimates by observing
that if the sample is random, the parameter estimates are random variables having
distributions that are referred to as their sampling distributions. The standard deviation
of the sampling distribution is called the standard error of the estimate. We then faced
the problem of how to ascertain the variability of an estimate from the sample itself.
In some cases the sampling distribution was of an explicit form depending upon
the unknown parameters (Examples A and B); in these cases we could substitute
our estimates for the unknown parameters in order to approximate the sampling
distribution. In other cases the form of the sampling distribution was not so obvious,
but we realized that even if we didn’t know it explicitly, we could simulate it. By
using the bootstrap we avoided doing perhaps difficult analytic calculations by sitting
back and instructing a computer to generate random numbers.
8.5 The Method of Maximum Likelihood
As well as being a useful tool for parameter estimation in our current context, the
method of maximum likelihood can be applied to a great variety of other statistical
problems, such as curve fitting, for example. This general utility is one of the major
reasons for the importance of likelihood methods in statistics. We will later see that
maximum likelihood estimates have nice theoretical properties as well.
Suppose that random variables X 1 , . . . , X n have a joint density or frequency
function f (x1 , x2 , . . . , xn |θ). Given observed values X i = xi , where i = 1, . . . , n,
the likelihood of θ as a function of x1 , x2 , . . . , xn is defined as
lik(θ) = f (x1 , x2 , . . . , xn |θ )
Note that we consider the joint density as a function of θ rather than as a function of
the xi . If the distribution is discrete, so that f is a frequency function, the likelihood
function gives the probability of observing the given data as a function of the parameter θ. The maximum likelihood estimate (mle) of θ is that value of θ that maximizes the likelihood—that is, makes the observed data “most probable” or “most
likely.”
268
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
If the X i are assumed to be i.i.d., their joint density is the product of the marginal
densities, and the likelihood is
n
-
lik(θ ) =
f (X i |θ )
i=1
Rather than maximizing the likelihood itself, it is usually easier to maximize its natural
logarithm (which is equivalent since the logarithm is a monotonic function). For an
i.i.d. sample, the log likelihood is
l(θ) =
n
log[ f (X i |θ )]
i=1
(In this text, “log” will always mean the natural logarithm.)
Let us find the maximum likelihood estimates for the examples first considered
in Section 8.4.
Poisson Distribution
If X follows a Poisson distribution with parameter λ, then
λx e−λ
x!
If X 1 , . . . , X n are i.i.d. and Poisson, their joint frequency function is the product of
the marginal frequency functions. The log likelihood is thus
P(X = x) =
l(λ) =
n
(X i log λ − λ − log X i !)
i=1
= log λ
n
X i − nλ −
i=1
n
log X i !
i=1
70
72
74
Log likelihood
EXAMPLE A
76
78
80
82
84
86
20
22
24
26
28
30
F I G U R E 8.5 Plot of the log likelihood function of λ for asbestos data.
8.5 The Method of Maximum Likelihood
269
Figure 8.5 is a graph of l(λ) for the asbestos counts of Example A in Section 8.4.
Setting the first derivative of the log likelihood equal to zero, we find
1
Xi − n = 0
λ i=1
n
l (λ) =
The mle is then
λ̂ = X
We can check that this is indeed a maximum (in fact, l(λ) is a concave function
of λ; see Figure 8.5). The maximum likelihood estimate agrees with the method of
moments for this case and thus has the same sampling distribution.
■
EXAMPLE B
Normal Distribution
If X 1 , X 2 , . . . , X n are i.i.d. N (μ, σ 2 ), their joint density is the product of their marginal
densities:
n
1
1 xi − μ 2
√ exp −
f (x1 , x2 , . . . , xn |μ, σ ) =
2
σ
σ 2π
i=1
Regarded as a function of μ and σ , this is the likelihood function. The log likelihood
is thus
l(μ, σ ) = −n log σ −
n
1 n
log 2π −
(X i − μ)2
2
2σ 2 i=1
The partials with respect to μ and σ are
n
1 ∂l
= 2
(X i − μ)
∂μ
σ i=1
∂l
n
= − + σ −3
(X i − μ)2
∂σ
σ
i=1
n
Setting the first partial equal to zero and solving for the mle, we obtain
μ̂ = X
Setting the second partial equal to zero and substituting the mle for μ, we find that
the mle for σ is
0
1 n
11 σ̂ = 2
(X i − X )2
n i=1
Again, these estimates and their sampling distributions are the same as those obtained
by the method of moments.
■
270
Chapter 8
EXAMPLE C
Estimation of Parameters and Fitting of Probability Distributions
Gamma Distribution
Since the density function of a gamma distribution is
1 α α−1 −λx
λ x e ,
(α)
the log likelihood of an i.i.d. sample, X i , . . . , X n , is
f (x|α, λ) =
l(α, λ) =
n
0≤x <∞
[α log λ + (α − 1) log X i − λX i − log (α)]
i=1
= nα log λ + (α − 1)
n
log X i − λ
i=1
n
X i − n log (α)
i=1
The partial derivatives are
(α)
∂l
= n log λ +
log X i − n
∂α
(α)
i=1
n
nα ∂l
=
−
Xi
∂λ
λ
i=1
n
Setting the second partial equal to zero, we find
λ̂ =
α̂
n α̂
=
n
X
Xi
i=1
But when this solution is substituted into the equation for the first partial, we obtain
a nonlinear equation for the mle of α:
n log α̂ − n log X +
n
i=1
log X i − n
(α̂)
=0
(α̂)
This equation cannot be solved in closed form; an iterative method for finding the
roots has to be employed. To start the iterative procedure, we could use the initial
value obtained by the method of moments.
For this example, the two methods do not give the same estimates. The mle’s
are computed from the precipitation data of Example C in Section 8.4 by an iterative
procedure (a combination of the secant method and the method of bisection) using the
method of moments estimates as starting values. The resulting estimates are α̂ = .441
and λ̂ = 1.96. In Example C in Section 8.4, the method of moments estimates were
found to be α̂ = .375 and λ̂ = 1.674. Figure 8.3 shows fitted densities from both
types of estimates of α and λ. There is clearly little practical difference, especially if
we keep in mind that the gamma distribution is only a possible model and should not
be taken as being literally true.
Because the maximum likelihood estimates are not given in closed form,
obtaining their exact sampling distribution would appear to be intractable. We thus
use the bootstrap to approximate these distributions, just as we did to approximate
the sampling distributions of the method of moments estimates. The underlying rationale is the same: If we knew the “true” values, α0 and λ0 , say, we could approximate
8.5 The Method of Maximum Likelihood
200
271
300
250
150
200
150
100
100
50
50
0
0
0.40 0.45 0.50 0.55 0.60
1.5
2.0
2.5
3.0
3.5
(b)
(a)
F I G U R E 8.6 Histograms of 1000 simulated maximum likelihood estimates of (a) α
and (b) λ.
the sampling distribution of their maximum likelihood estimates by generating many,
many samples of size n = 227 from a gamma distribution with parameters α0 and
λ0 , forming the maximum likelihood estimates from each sample, and displaying the
results in histograms. Since, of course, we don’t know the true values, we let our
maximum likelihood estimates play their role: We generated 1000 samples each of
size n = 227 of gamma distributed random variables with α = .471 and λ = 1.97.
For each of these samples, the maximum likelihood estimates of α and λ were calculated. Histograms of these 1000 estimates are shown in Figure 8.6; we regard these
histograms as approximations to the sampling distribution of the maximum likelihood
estimates α̂ and λ̂.
Comparison of Figures 8.6 and 8.4 is interesting. We see that the sampling distributions of the maximum likelihood estimates are substantially less dispersed than
those of the method of moments estimates, which indicates that in this situation, the
method of maximum likelihood is more precise than the method of moments. The
standard deviations of the values displayed in the histograms are the estimated standard errors of the maximum likelihood estimates; we find sα̂ = .03 and sλ̂ = .26.
Recall that in Example C of Section 8.4 the corresponding estimated standard errors
■
for the method of moments estimates were found to be .06 and .34.
EXAMPLE D
Muon Decay
From the form of the density given in Example D in Section 8.4, the log likelihood is
l(α) =
n
log(1 + α X i ) − n log 2
i=1
Setting the derivative equal to zero, we see that the mle of α satisfies the following
272
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
nonlinear equation:
n
i=1
Xi
=0
1 + α̂ X i
Again, we would have to use an iterative technique to solve for α̂. The method of
■
moments estimate could be used as a starting value.
In Examples C and D, in order to find the maximum likelihood estimate, we
would have to solve a nonlinear equation. In general, in some problems involving
several parameters, systems of nonlinear equations must be solved to find the mle’s.
We will not discuss numerical methods here; a good discussion is found in Chapter 6
of Dahlquist and Bjorck (1974).
8.5.1 Maximum Likelihood Estimates of Multinomial
Cell Probabilities
The method of maximum likelihood is often applied to problems involving multinomial cell probabilities. Suppose that X 1 , . . . , X m , the counts in cells 1, . . . , m, follow
a multinomial distribution with a total count of n and cell probabilities p1 , . . . , pm .
We wish to estimate the p’s from the x’s. The joint frequency function of X 1 , . . . , X m
is
m
n! - xi
pi
f (x1 , . . . , xm | p1 , . . . , pm ) = m
i=1
xi !
i=1
Note that the marginal distribution of each X i is binomial (n, pi ), and that since
the X i are not independent (they are constrained to sum to n), their joint frequency
function is not the product of the marginal frequency functions, as it was in the
examples considered in the preceding section. We can, however, still use the method
of maximum likelihood since we can write an expression for the joint distribution.
We assume n is given, and we wish to estimate p1 , . . . , pm with the constraint that
the pi sum to 1. From the joint frequency function just given, the log likelihood is
l( p1 , . . . , pm ) = log n! −
m
log xi ! +
i=1
m
xi log pi
i=1
To maximize this likelihood subject to the constraint, we introduce a Lagrange multiplier and maximize
m
m
m
log xi ! +
xi log pi + λ
pi − 1
L( p1 , . . . , pm , λ) = log n! −
i=1
i=1
i=1
Setting the partial derivatives equal to zero, we have the following system of
equations:
xj
j = 1, . . . , m
p̂ j = − ,
λ
8.5 The Method of Maximum Likelihood
273
Summing both sides of this equation, we have
1=
−n
λ
or
λ = −n
Therefore,
xj
n
which is an obvious set of estimates. The sampling distribution of p̂ j is determined
by the distribution of x j , which is binomial.
In some situations, such as frequently occur in the study of genetics, the multinomial cell probabilities are functions of other unknown parameters θ; that is, pi =
pi (θ). In such cases, the log likelihood of θ is
p̂ j =
l(θ) = log n! −
m
log xi ! +
i=1
EXAMPLE A
m
xi log pi (θ )
i=1
Hardy-Weinberg Equilibrium
If gene frequencies are in equilibrium, the genotypes A A, Aa, and aa occur in a
population with frequencies (1 − θ )2 , 2θ (1 − θ ), and θ 2 , according to the HardyWeinberg law. In a sample from the Chinese population of Hong Kong in 1937,
blood types occurred with the following frequencies, where M and N are erythrocyte
antigens:
Blood Type
Frequency
M
342
MN
500
N
187
Total
1029
There are several possible ways to estimate θ from the observed frequencies. For example, if we equate θ 2 with 187/1029, we obtain .4263 as an estimate of θ. Intuitively,
however, it seems that this procedure ignores some of the information in the other
cells. If we let X 1 , X 2 , and X 3 denote the counts in the three cells and let n = 1029,
the log likelihood of θ is (you should check this):
l(θ) = log n! −
3
log X i ! + X 1 log(1 − θ )2 + X 2 log 2θ (1 − θ ) + X 3 log θ 2
i=1
= log n! −
3
log X i ! + (2X 1 + X 2 ) log(1 − θ )
i=1
+ (2X 3 + X 2 ) log θ + X 2 log 2
In maximizing l(θ), we do not need to explicitly incorporate the constraint that the cell
3
probabilities sum to 1 since the functional form of pi (θ ) is such that i=1
pi (θ ) = 1.
274
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
Setting the derivative equal to zero, we have
−
2X 1 + X 2
2X 3 + X 2
+
=0
1−θ
θ
Solving this, we obtain the mle:
2X 3 + X 2
2X 1 + 2X 2 + 2X 3
2X 3 + X 2
=
2n
2 × 187 + 500
=
= .4247
2 × 1029
θ̂ =
How precise is this estimate? Do we have faith in the accuracy of the first, second,
third, or fourth decimal place? We will address these questions by using the bootstrap to estimate the sampling distribution and the standard error of θ̂. The bootstrap
logic is as follows: If θ were known, then the three multinomial cell probabilities,
(1 − θ)2 , 2θ(1 − θ), and θ 2 , would be known. To find the sampling distribution of θ̂,
we could simulate many multinomial random variables with these probabilities and
n = 1029, and for each we could form an estimate of θ . A histogram of these estimates
would be an approximation to the sampling distribution. Since, of course, we don’t
know the actual value of θ to use in such a simulation, the bootstrap principle tells us
to use θ̂ = .4247 in its place. With this estimated value of θ the three cell probabilities
(M, MN, N ) are .331, .489, and .180. One thousand multinomial random counts, each
with total count 1029, were simulated with these probabilities (see problem 35 at the
end of the chapter for the method of generating these random counts). From each of
these 1000 computer “experiments,” a value θ ∗ was determined. A histogram of the
estimates (Figure 8.7) can be regarded as an estimate of the sampling distribution of
θ̂. The estimated standard error of θ̂ is the standard deviation of these 1000 values:
■
sθ̂ = .011.
8.5.2 Large Sample Theory for Maximum Likelihood Estimates
In this section we develop approximations to the sampling distribution of maximum
likelihood estimates by using limiting arguments as the sample size increases. The
theory we shall sketch shows that under reasonable conditions, maximum likelihood
estimates are consistent. We also develop a useful and important approximation for
the variance of a maximum likelihood estimate and argue that for large sample sizes,
the sampling distribution is approximately normal.
The rigorous development of this large sample theory is quite technical; we will
simply state some results and give very rough, heuristic arguments for the case of
an i.i.d. sample and a one-dimensional parameter. (The arguments for Theorems A
and B may be skipped without loss of continuity. Rigorous proofs may be found in
Cramér (1946).)
8.5 The Method of Maximum Likelihood
275
150
100
50
0
0.38
0.40
0.42
0.44
0.46
F I G U R E 8.7 Histogram of 1000 simulated maximum likelihood estimates of θ
described in Example A.
For an i.i.d. sample of size n, the log likelihood is
l(θ) =
n
log f (xi |θ )
i=1
We denote the true value of θ by θ0 . It can be shown that under reasonable conditions
θ̂ is a consistent estimate of θ0 ; that is, θ̂ converges to θ0 in probability as n approaches
infinity.
THEOREM A
Under appropriate smoothness conditions on f , the mle from an i.i.d. sample is
consistent.
Proof
The following is merely a sketch of the proof. Consider maximizing
n
1
1
l(θ) =
log f (X i |θ )
n
n i=1
276
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
As n tends to infinity, the law of large numbers implies that
1
l(θ) → E log f (X |θ )
n
=
log f (x|θ ) f (x|θ0 ) d x
It is thus plausible that for large n, the θ that maximizes l(θ ) should be close
to the θ that maximizes E log f (X |θ ). (An involved argument is necessary to
establish this.) To maximize E log f (X |θ ), we consider its derivative:
∂
f (x|θ )
∂
∂θ
f (x|θ0 ) d x
log f (x|θ) f (x|θ0 ) d x =
∂θ
f (x|θ )
If θ = θ0 , this equation becomes
∂
∂
∂
f (x|θ0 ) d x =
(1) = 0
f (x|θ0 ) d x =
∂θ
∂θ
∂θ
which shows that θ0 is a stationary point and hopefully a maximum. Note that
we have interchanged differentiation and integration and that the assumption of
■
smoothness on f must be strong enough to justify this.
We will now derive a useful intermediate result.
LEMMA A
Define I (θ) by
2
∂
log f (X |θ )
I (θ) = E
∂θ
Under appropriate smoothness conditions on f, I (θ ) may also be expressed as
2
∂
I (θ) = −E
log f (X |θ )
∂θ 2
Proof
First, we observe that since
f (x|θ ) d x = 1,
∂
f (x|θ ) d x = 0
∂θ
Combining this with the identity
∂
∂
f (x|θ) =
log f (x|θ ) f (x|θ )
∂θ
∂θ
8.5 The Method of Maximum Likelihood
we have
∂
0=
∂θ
f (x|θ) d x =
277
∂
log f (x|θ ) f (x|θ ) d x
∂θ
where we have interchanged differentiation and integration (some assumptions
must be made in order to do this). Taking second derivatives of the preceding
expressions, we have
∂
∂
log f (x|θ) f (x|θ ) d x
0=
∂θ
∂θ
2
∂2
∂
log f (x|θ ) f (x|θ ) d x
=
log f (x|θ) f (x|θ ) d x +
∂θ 2
∂θ
From this, the desired result follows.
■
The large sample distribution of a maximum likelihood estimate is approximately
normal with mean θ0 and variance 1/[n I (θ0 )]. Since this is merely a limiting result,
which holds as the sample size tends to infinity, we say that the mle is asymptotically unbiased and refer to the variance of the limiting normal distribution as the
asymptotic variance of the mle.
THEOREM B
√
Under smoothness conditions on f , the probability distribution of n I (θ0 )(θ̂− θ0 )
tends to a standard normal distribution.
Proof
The following is merely a sketch of the proof; the details of the argument are
beyond the scope of this book. From a Taylor series expansion,
0 = l (θ̂) ≈ l (θ0 ) + (θ̂ − θ0 )l (θ0 )
(θ̂ − θ0 ) ≈
−l (θ0 )
l (θ0 )
−n −1/2l (θ0 )
n −1l (θ0 )
First, we consider the numerator of this last expression. Its expectation is
n
∂
−1/2
−1/2
log f (X i |θ0 )
l (θ0 )] = n
E
E[n
∂θ
i=1
n 1/2 (θ̂ − θ0 ) ≈
=0
278
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
as in Theorem A. Its variance is
Var[n −1/2l (θ0 )] =
2
n
1
∂
log f (X i |θ0 )
E
n i=1
∂θ
= I (θ0 )
Next, we consider the denominator:
n
1
1 ∂2
l (θ0 ) =
log f (xi |θ0 )
n
n i=1 ∂θ 2
By the law of large numbers, the latter expression converges to
2
∂
E
log
f
(X
|θ
)
= −I (θ0 )
0
∂θ 2
from Lemma A.
We thus have
n 1/2 (θ̂ − θ0 ) ≈
n −1/2l (θ0 )
I (θ0 )
Therefore,
E[n 1/2 (θ̂ − θ0 )] ≈ 0
Furthermore,
I (θ0 )
I 2 (θ0 )
1
=
I (θ0 )
Var[n 1/2 (θ̂ − θ0 )] ≈
and thus
1
n I (θ0 )
The central limit theorem may be applied to l (θ0 ), which is a sum of i.i.d.
random variables:
n
∂
l (θ0 ) =
log f (X i |θ )
■
∂θ
0
i=1
Var(θ̂ − θ0 ) ≈
Another interpretation of the result of Theorem B is as follows. For an i.i.d. sample, the maximum likelihood estimate is the maximizer of the log likelihood function,
l(θ) =
n
log f (X i |θ )
i=1
The asymptotic variance is
1
1
=−
n I (θ0 )
El (θ0 )
8.5 The Method of Maximum Likelihood
279
when El (θ0 ) is large, l(θ) is, on average, changing very rapidly in a vicinity of θ0
and the variance of the maximizer is small.
A corresponding result can be proved from the multidimensional case. The vector
of maximum likelihood estimates is asymptotically normally distributed. The mean
of the asymptotic distribution is the vector of true parameters, θ0 . The covariance of
the estimates θ̂i and θ̂ j is given by the i j entry of the matrix n −1 I −1 (θ0 ), where I (θ )
is the matrix with i j component
∂
∂
∂2
E
log f (X |θ)
log f (X |θ ) = −E
log f (X |θ )
∂θi
∂θ j
∂θi ∂θ j
Since we do not wish to delve deeply into technical details, we do not specify the
conditions under which the results obtained in this section hold. It is worth mentioning,
however, that the true parameter value, θ0 , is required to be an interior point of the set of
all parameter values. Thus the results would not be expected to apply in Example D of
Section 8.5 if α0 = 1, for example. It is also required that the support of the density
or frequency function f (x|θ) [the set of values for which f (x|θ ) > 0] does not depend
on θ. Thus, for example, the results would not be expected to apply to estimating θ from
a sample of random variables that were uniformly distributed on the interval [0, θ].
The following sections will apply these results in several examples.
8.5.3 Confidence Intervals from Maximum
Likelihood Estimates
In Chapter 7, confidence intervals for the population mean μ were introduced. Recall that the confidence interval for μ was a random interval that contained μ with
some specified probability. In the current context, we are interested in estimating the
parameter θ of a probability distribution. We will develop confidence intervals for θ
based on θ̂; these intervals serve essentially the same function as they did in Chapter 7
in that they express in a fairly direct way the degree of uncertainty in the estimate θ̂ . A
confidence interval for θ is an interval based on the sample values used to estimate θ.
Since these sample values are random, the interval is random and the probability that
it contains θ is called the coverage probability of the interval. Thus, for example, a
90% confidence interval for θ is a random interval that contains θ with probability .9.
A confidence interval quantifies the uncertainty of a parameter estimate.
We will discuss three methods for forming confidence intervals for maximum
likelihood estimates: exact methods, approximations based on the large sample properties of maximum likelihood estimates, and bootstrap confidence intervals. The construction of confidence intervals for parameters of a normal distribution illustrates the
use of exact methods.
EXAMPLE A
We found in Example B of Section 8.5 that the maximum likelihood estimates of μ
and σ 2 from an i.i.d. normal sample are
μ̂ = X
n
1
(X i − X )2
σ̂ 2 =
n i=1
280
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
A confidence interval for μ is based on the fact that
√
n(X − μ)
∼ tn−1
S
where tn−1 denotes the t distribution with n − 1 degrees of freedom and
1 (X i − X )2
n − 1 i=1
n
S2 =
(see Section 6.3). Let tn−1 (α/2) denote that point beyond which the t distribution with
n − 1 degrees of freedom has probability α/2. Since the t distribution is symmetric
about 0, the probability to the left of −tn−1 (α/2) is also α/2. Then, by definition,
√
n(X − μ)
≤ tn−1 (α/2) = 1 − α
P −tn−1 (α/2) ≤
S
The inequality can be manipulated to yield
S
S
P X − √ tn−1 (α/2) ≤ μ ≤ X + √ tn−1 (α/2)
n
n
=1−α
According to this equation, the probability that μ lies in the interval X ± Stn−1 (α/2)/
√
n is 1 − α. Note that this interval is random: The center is at the random point X
and the width is proportional to S, which is also random.
Now let us turn to a confidence interval for σ 2 . From Section 6.3,
n σ̂ 2
2
∼ χn−1
σ2
2
where χn−1
denotes the chi-squared distribution with n − 1 degrees of freedom. Let
2
χm (α) denote the point beyond which the chi-square distribution with m degrees of
freedom has probability α. It then follows by definition that
n σ̂ 2
2
2
(1 − α/2) ≤ 2 ≤ χn−1
(α/2) = 1 − α
P χn−1
σ
Manipulation of the inequalities yields
n σ̂ 2
n σ̂ 2
2
≤
σ
≤
P
2
2
χn−1
(α/2)
χn−1
(1 − α/2)
=1−α
Therefore, a 100(1 − α)% confidence interval for σ 2 is
n σ̂ 2
n σ̂ 2
,
2
2
χn−1
(α/2)
χn−1
(1 − α/2)
Note that this interval is not symmetric about σ̂ 2 —it is not of the form σ̂ 2 ± c, unlike
the previous example.
A simulation illustrates these ideas: The following experiment was done on a
computer 20 times. A random sample of size n = 11 from normal distribution with
mean μ = 10 and variance σ 2 = 9 was generated. From the sample, X and σ̂ 2 were
8.5 The Method of Maximum Likelihood
12
50
11
40
10
30
9
281
20
8
10
7
0
F I G U R E 8.8 20 confidence intervals for μ (left panel) and for σ 2 (right panel) as
described in Example A. Horizontal lines indicate the true values.
calculated and 90% confidence intervals for μ and σ 2 were constructed, as described
before. Thus at the end there were 20 intervals for μ and 20 intervals for σ 2 . The 20
intervals for μ are shown as vertical lines in the left panel of Figure 8.8 and the
20 intervals for σ 2 are shown in the right panel. Horizontal lines are drawn at the
true values μ = 10 and σ 2 = 9. Since these are 90% confidence intervals, we expect
the true parameter values to fall outside the intervals 10% of the time; thus on the
average we would expect 2 of 20 intervals to fail to cover the true parameter value.
From the figure, we see that all the intervals for μ actually cover μ, whereas four of
■
the intervals of σ 2 failed to contain σ 2 .
Exact methods such as that illustrated in the previous example are the exception
rather than the rule in practice. To construct an exact interval requires detailed knowledge of the sampling distribution as well as some cleverness. A second method of
constructing confidence intervals is based on the large sample theory
√ of the previous
section. According to the results of that section, the distribution of n I (θ0 )(θ̂ − θ0 )
is approximately the standard normal distribution. Since θ0 is unknown, we will use
I (θ̂) in place of I (θ0 ); we have employed similar substitutions a number of times
before—for example, in finding an approximate standard error
in Example A of Section 8.4. It can be further argued that the distribution of n I (θ̂ )(θ̂ − θ0 ) is also
approximately standard normal. Since the standard normal distribution is symmetric
about 0,
.
P −z(α/2) ≤ n I (θ̂ )(θ̂ − θ0 ) ≤ z(α/2) ≈ 1 − α
Manipulation of the inequalities yields
θ̂ ± z(α/2) 1
n I (θ̂ )
282
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
as an approximate 100(1 − α)% confidence interval. We now illustrate this procedure
with an example.
EXAMPLE B
Poisson Distribution
The mle of λ from a sample of size n from a Poisson distribution is
λ̂ = X
Since the sum of independent Poisson random variables follows a Poisson distribution,
the parameter of which is the sum of the parameters of the individual summands,
n
n λ̂ =
i=1 X i follows a Poisson distribution with mean nλ. Also, the sampling
distribution of λ̂ is known, although it depends on the true value of λ, which is
unknown. Exact confidence intervals for λ may be obtained by using this fact, and
special tables are available (Pearson and Hartley 1966).
For large samples, confidence intervals may be derived as follows. First, we
need to calculate I (λ). Let f (x|λ) denote the probability mass function of a Poisson
random variable with parameter λ. There are two ways to do this. We may use the
definition
2
∂
log f (X |λ)
I (λ) = E
∂λ
We know that
log f (x|λ) = x log λ − λ − log x!
and thus
I (λ) = E
X
−1
λ
2
Rather than evaluate this quantity, we may use the alternative expression for I (λ)
given by Lemma A of Section 8.5.2:
2
∂
I (λ) = −E
log f (X |λ)
∂λ2
Since
∂2
X
log f (X |λ) = − 2
2
∂λ
λ
I (λ) is simply
E(X )
1
=
λ2
λ
Thus, an approximate 100(1 − α)% confidence interval for λ is
/
X
X ± z(α/2)
n
Note that the asymptotic variance is in fact the exact variance in this case. The
confidence interval, however, is only approximate, since the sampling distribution of
X is only approximately normal.
8.5 The Method of Maximum Likelihood
283
As a concrete example, let us return to the study that involved counting asbestos
fibers on filters, discussed earlier. In Example A in Section 8.4, we found λ̂ = 24.9.
The estimated standard error of λ̂ is thus (n = 23)
/
λ̂
sλ̂ =
= 1.04
n
An approximate 90% confidence interval for λ is
λ̂ ± 1.65sλ̂
or (23.2, 26.6). This interval gives a good indication of the uncertainty inherent in
the determination of the average asbestos level using the model that the counts in the
■
grid squares are independent Poisson random variables.
In a similar way, approximate confidence intervals can be obtained for parameters
estimated from random multinomial counts. The counts are not i.i.d., so the variance
of the parameter estimate is not of the form 1/[n I (θ )]. However, it can be shown that
Var(θ̂) ≈
1
1
=−
E[l (θ0 )2 ]
E[l (θ0 )]
and the maximum likelihood estimate is approximately normally distributed. Example C illustrates this concept.
EXAMPLE C
Hardy-Weinberg Equilibrium
Let us return to the example of Hardy-Weinberg equilibrium discussed in Example A
in Section 8.5.1. There we found θ̂ = .4247. Now,
l (θ) = −
2X 3 + X 2
2X 1 + X 2
+
1−θ
θ
In order to calculate E[l (θ)2 ], we would have to deal with the variances and covariances of the X i . This does not look too inviting; it turns out to be easier to calculate
E[l (θ)].
l (θ) = −
2X 1 + X 2
2X 3 + X 2
−
2
(1 − θ )
θ2
Since the X i are binomially distributed, we have
E(X 1 ) = n(1 − θ )2
E(X 2 ) = 2nθ (1 − θ )
E(X 3 ) = nθ 2
We find, after some algebra, that
E[l (θ )] = −
2n
θ (1 − θ )
Since θ is unknown, we substitute θ̂ in its place and obtain the estimated standard
284
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
error of θ̂:
sθ̂ = /
=
1
−I (θ̂ )
θ̂ (1 − θ̂ )
= .011
2n
An approximate 95% confidence interval for θ is θ̂ ± 1.96sθ̂ , or (.403, .447). (Note
that this estimated standard error of θ̂ agrees with that obtained by the bootstrap in
■
Example 8.5.1A.)
Finally, we describe the use of the bootstrap for finding approximate confidence
intervals. Suppose that θ̂ is an estimate of a parameter θ—the true, unknown value
of which is θ0 —and suppose for the moment that the distribution of = θ̂ − θ0 is
known. Denote the α/2 and 1 − α/2 quantiles of this distribution by δ and δ; i.e.,
α
P(θ̂ − θ0 ≤ δ) =
2
α
P(θ̂ − θ0 ≤ δ) = 1 −
2
Then
P(δ ≤ θ̂ − θ0 ≤ δ) = 1 − α
and from manipulation of the inequalities,
P(θ̂ − δ ≤ θ0 ≤ θ̂ − δ) = 1 − α
The preceding assumed that the distribution of θ̂ − θ0 was known, which is
typically not the case. If θ0 were known, this distribution could be approximated
arbitrarily well by simulation: Many, many samples of observations could be randomly
generated on a computer with the true value θ0 ; for each sample, the difference θ̂ − θ0
could be recorded; and the two quantiles δ and δ could, consequently, be determined
as accurately as desired. Since θ0 is not known, the bootstrap principle suggests using
θ̂ in its place: Generate many, many samples (say, B in all) from a distribution with
value θ̂; and for each sample construct an estimate of θ, say θ ∗j , j = 1, 2, . . . , B. The
distribution of θ̂ −θ0 is then approximated by that of θ ∗ − θ̂, the quantiles of which are
used to form an approximate confidence interval. Examples may make this clearer.
EXAMPLE D
We first apply this technique to the Hardy-Weinberg equilibrium problem; we will
find an approximate 95% confidence interval based on the bootstrap and compare the
result to the interval obtained in Example C, where large-sample theory for maximum likelihood estimates was used. The 1000 bootstrap estimates of θ of Example A
of Section 8.5.1 provide an estimate of the distribution of θ ∗ ; in particular the 25th
largest is .403 and the 975th largest is .446, which are our estimates of the .025 and
8.6 The Bayesian Approach to Parameter Estimation
285
.975 quantiles of the distribution. The distribution of θ ∗ − θ̂ is approximated by subtracting θ̂ = .425 from each θi∗ , so the .025 and .975 quantiles of this distribution are
estimated as
δ = .403 − .425 = −.022
δ = .446 − .425 = .021
Thus our approximate 95% confidence interval is
(θ̂ − δ, θ̂ − δ) = (.404, .447)
Since the uncertainty in θ̂ is in the second decimal place, this interval and that found
in Example C are identical for all practical purposes.
■
EXAMPLE E
Finally, we apply the bootstrap to find approximate confidence intervals for the
parameters of the gamma distribution fit in Example C of Section 8.5. Recall that
the estimates were α̂ = .471 and λ̂ = 1.97. Of the 1000 bootstrap values of
∗
α ∗ , α1∗ , α2∗ , . . . , α1000
, the 50th largest was .419 and the 950th largest was .538; the
.05 and .95 quantiles of the distribution of α ∗ − α̂ are approximated by subtracting α̂
from these values, giving
δ = .419 − .471 = −.052
δ = .538 − .471 = .067
Our approximate 90% confidence interval for α0 is thus
(α̂ − δ, α̂ − δ) = (.404, .523)
The 50th and 950th largest values of λ∗ were 1.619 and 2.478, and the corresponding
■
approximate 90% confidence interval for λ0 is (1.462, 2.321).
We caution the reader that there are a number of different methods of using the
bootstrap to find approximate confidence intervals. We have chosen to present the
preceding method largely because the reasoning leading to its development is fairly
direct. Another popular method, the bootstrap percentile method, uses the quantiles
of the bootstrap distribution of θ̂ directly. Using this method in the previous example,
the confidence interval for α would be (.419, .538). Although this direct equation
of quantiles of the bootstrap sampling distribution with confidence limits may seem
initially appealing, its rationale is somewhat obscure. If the bootstrap distribution is
symmetric, the two methods are equivalent (see Problem 38).
8.6 The Bayesian Approach
to Parameter Estimation
A preview of the Bayesian approach was given in Example E of Section 3.5.2, which
should be reviewed before continuing.
286
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
In the Bayesian approach, the unknown parameter θ is treated as a random variable, with “prior distribution” f (θ ) representing what we know about the parameter
before observing data, X . In the following, we assume is a continuous random
variable; the discrete case is entirely analogous. This model is in contrast to the approaches described in the previous sections, in which θ was treated as an unknown
constant. For a given value, = θ, the data have the probability distribution (density
or probability mass function) f X | (x|θ ). The joint distribution of X and is thus
f X, (x, θ) = f X | (x|θ ) f (θ )
and the marginal distribution of X is
f X, (x, θ)dθ
f X (x) =
=
f X | (x|θ ) f (θ )dθ
The distribution of given the data X is thus
f X, (x, θ)
f X (x)
f X | (x|θ ) f (θ )
=
f X | (x|θ ) f (θ )dθ
f |X (θ|x) =
This is called the posterior distribution; it represents what is known about having
observed data X . Note that the likelihood is f X | (x|θ), viewed as a function of θ , and
we may usefully summarize the preceding result as
f |X (θ|x) ∝ f X | (x|θ ) × f (θ )
Posterior density ∝ Likelihood × Prior density
The Bayes paradigm has an appealing formal simplicity as it involves elementary
probability operations. We will now see what it amounts to for examples we considered
earlier.
EXAMPLE A
Fitting a Poisson Distribution
Here the unknown parameter is λ, which has a prior distribution f (λ), and the data
are n i.i.d. observations X 1 , X 2 , . . . , X n , which for a given value λ are Poisson random
variables with
f X i | (xi |λ) =
λxi e−λ
,
xi !
xi = 0, 1, 2, . . .
Their joint distribution given λ is (from independence) the product of their marginal
distributions given λ
λi=1 xi e−nλ
f X | (x|λ) = 3n
i=1 x i !
n
8.6 The Bayesian Approach to Parameter Estimation
287
where X denotes (X 1 , X 2 , . . . , X n ). The posterior distribution of given X is then
λi=1 xi e−nλ f (λ)
n
λi=1 xi e−nλ f (λ) dλ
n
f |X (λ|x) =
3n
(the term i=1
xi ! has cancelled out).
Thus, to evaluate the posterior distribution, we have to do two things: specify the prior distribution f (λ) and carry out the integration in the denominator of
the preceding expression. For illustration, we consider the data of Examples 8.4A
and 8.5A.
We will consider two approaches to specifying the prior distribution. The first
is that of an orthodox Bayesian who takes very seriously the model that the prior
distribution specifies his prior opinion. Note that this specification should be done
before seeing the data, X , and he is required to provide the probability density f (λ)
through introspection. This is not an easy task to carry out, and even the orthodox often
compromise for convenience. He thus decides to quantify his opinion by specifying a
prior mean μ1 = 15 and standard deviation σ = 5 and to use, because the math works
out nicely as we will see, a gamma density with that mean and standard deviation.
This choice could be aided by plotting gamma densities for various parameter values.
The prior density is shown in Figure 8.9. Using the relationships developed in Example C in Section 8.4, the second moment is μ2 = μ21 + σ 2 = 250 and the parameters
of the gamma density are
ν=
μ1
= 0.6
μ2 − μ21
α = νμ1 = 9
0.4
0.3
0.2
0.1
0.0
0
5
10
15
20
25
30
F I G U R E 8.9 First statistician s prior (solid) and posterior (dashed). Second
statistician s posterior (dotted).
288
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
(We denote the parameter by ν rather than by the usual λ since λ has already been
used for the parameter of the Poisson distribution.) The prior distribution for is then
ν α α−1 −νλ
λ e
(α)
f (λ) =
After some cancellation, the posterior density is
f |X (λ|x) =
λxi +α−1 e−(n+ν)λ
λxi +α−1 e−(n+ν)λ dλ
∞
0
Now, consider this an important trick that is used time and again in Bayesian calculations: the denominator is a constant that makes the expression integrate to 1. We can
deduce from the form of the numerator that the ratio must be a gamma density with
parameters
α =
xi + α = 582
ν = n + ν = 23.6
This standard trick allows the statistician to avoid having to do any explicit integration. (Make sure you understand it, because it will occur again several times.) The
posterior density is shown in Figure 8.9. Compare it to the prior distribution to observe
how observation of the data, X , has drastically changed his state of knowledge about
. Notice that the posterior density is much more symmetric and looks like a normal
■
density (that this is no accident will be shown later).
According to the Bayesian paradigm, all the information about is contained in
the posterior distribution. The mean of this distribution (the posterior mean) is
α
= 24.7
ν
The most probable value of , the posterior mode, is 24.6. (Verify that the gamma
density is maximized at (α − 1)/ν.) Either of these two values could be used as a
point estimate of the unknown mean of the Poisson distribution, if a single number is
required.
The variance of the posterior distribution is
μpost =
α
= 1.04
ν2
and the posterior standard deviation is σpost = 1.02, which is a simple measure of
variability—the posterior distribution of has mean 24.7 and standard deviation
1.02. A Bayesian analogue of a 90% confidence interval is the interval from the 5th
percentile to the 95th percentile of the posterior, which can be found numerically to
be [23.02, 26.34]. A common alternative to this interval is a high posterior density
(HPD) interval, formed as follows: Imagine placing a horizontal line at the high point
of the posterior density and moving it downward until the interval of λ formed below
where the line cuts the density contained 90% probability. If the posterior density is
symmetric and unimodal, as is nearly the case in Figure 8.9, the HPD interval will
coincide with the interval between the percentiles.
2
=
σpost
8.6 The Bayesian Approach to Parameter Estimation
289
The second statistician takes a more utilitarian, noncommittal approach. She
believes that it is implausible that the mean count λ could be larger than 100, and
uses a simple prior that is uniform on [0, 100], without trying to quantify her opinion
more precisely. The posterior density is thus
1
λi=1 xi e−nλ 100
f |X (λ|x) =
,
100
1
n
λi=1 xi e−nλ dλ
100 0
n
0 ≤ λ ≤ 100
The denominator has to be integrated numerically, but this is easy to do for such
a smooth function. The resulting posterior density is shown in Figure 8.9. Using
numerical evaluations, she finds that the posterior mode is 24.9, the posterior mean
is 25.0, and the posterior standard deviation is 1.04. The interval from the 5th to the
95th percentile is [23.3, 26.7].
We now compare these two results to each other and to the results of maximum
likelihood analysis.
Estimate
mode
mean
standard deviation
upper limit
lower limit
Bayes 1
Bayes 2
Maximum Likelihood
24.6
24.7
1.02
26.3
23.0
24.9
25.0
1.04
26.7
23.3
24.9
—
1.04
26.6
23.2
Comparing the results of the second Bayesian to those of maximum likelihood,
it is important to realize that her posterior density is directly proportional to the likelihood for 0 ≤ λ ≤ 100, because the prior is flat over this range and the posterior is
proportional to the prior times the likelihood. Thus, her posterior mode and the maximum likelihood estimate are identical. There is no such guarantee that her posterior
standard deviation and the approximate standard error of the maximum likelihood
estimate are identical, but they turn out to be, to the number of significant figures
displayed in the table. The two 90% intervals are very close.
Now compare the results of the first and second Bayesians. Observe that although
his prior opinion was not in accord with the data, the data strongly modified the prior,
to produce a posterior that is close to hers. Even though they start with quite different
assumptions, the data forces them to very similar conclusions. His prior opinion has
indeed influenced the results: his posterior mean and mode are less than hers, but
the influence has been mild. (If there had been less data or if his prior opinions
had been much more biased to low values, the results would have been in greater
conflict.) The fundamental result that the posterior is proportional to the prior times
the likelihood helps us to understand the difference: the likelihood is substantial only
in the region approximately between λ = 22 and λ = 28. (This can be seen in the
figure, because the second statistician’s posterior is proportional to the likelihood.
See Figure 8.5, also). In this region, his prior decreases slowly, so the posterior is
proportional to a weighted version of the likelihood, with slowly decreasing weight.
290
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
The first Bayesian’s posterior thus differs from the second by being pushed up slightly
on the left and pulled down on the right.
Although they are very similar numerically, there is an important difference
between the Bayesian and frequentist interpretation of the confidence intervals. In
the Bayesian framework, is a random variable and it makes perfect sense to say,
“Given the observations, the probability that is in the interval [23.3, 26.7] is 0.90.”
Under the frequentist framework, such a statement makes no sense, because λ is a
constant, albeit unknown, and it either lies in the interval [23.3, 26.7] or doesn’t—no
probability is involved. Before the data are observed, the interval is random, and it
makes sense to state that the probability that the interval contains the true parameter
value is 0.90, but after the data are observed, nothing is random anymore. One way to
understand the difference of interpretation is to realize that in the Bayesian analysis
the interval refers to the state of knowledge about λ and not to λ itself.
Finally, we note that an alternative for the second statistician would have been to
use a gamma prior because of its analytical convenience, but to make the prior very
flat. This can be accomplished by setting α and λ to be very small.
EXAMPLE B
Normal Distribution
It is convenient to reparametrize the normal distribution, replacing σ 2 by ξ = 1/σ 2 ;
ξ is called the precision. We will also use θ in place of μ. The density is then
ξ 1/2
1
exp − ξ(x − θ )2
f (x|θ, ξ ) =
2π
2
The normal distribution has two parameters, and we will consider cases of Bayesian
■
analysis depending on which of them are known and unknown.
Case of Unknown Mean and Known Variance
We first consider the case in which the precision is known, ξ = ξ0 and the mean, θ ,
is unknown. In the Bayesian treatment, the mean is a random variable, . It is mathe−1
). This prior
matically convenient to use a prior distribution for , which is N (θ0 , ξprior
is very flat, or uninformative, when ξprior is very small, i.e., when the prior variance
is very large. Thus, if X = (X 1 , X 2 , . . . , X n ) are independent given θ
f |X (θ|x) ∝ f X | (x|θ) × f (θ )
n
−ξ0
ξprior
ξ0 n/2 (xi − θ )2 ×
exp
=
2π
2
2π
i=1
−ξprior
(θ − θ0 )2
× exp
2
n
1
2
2
(xi − θ ) + ξprior (θ − θ0 )
∝ exp − ξ0
2
i=1
1/2
Here we have exhibited only the terms in the posterior density that depend upon θ ;
the last expression above shows the shape of the posterior density as a function of θ .
The posterior density itself is proportional to this expression, with a proportionality
constant that is determined by the requirement that the posterior density integrates to 1.
8.6 The Bayesian Approach to Parameter Estimation
291
We will now manipulate the expression for the numerator to cast it in a form so
that we can recognize that the posterior density is normal. Expressing (xi − θ )2 =
(xi − x̄)2 + n(θ − x̄)2 , and absorbing more terms that do not depend on θ into the
constant of proportionality (a typical move in Bayesian calculations), we find
1
f |X (θ|x) ∝ exp − [nξ0 (θ − x̄)2 + ξprior (θ − θ0 )2 ]
2
Now, observe that this is of the form exp(−(1/2)Q(θ )), where Q(θ ) is a quadratic
polynomial. We can find expressions ξpost and θpost , and write
Q(θ) = ξpost (θ − θpost )2 + terms that do not depend on θ
and conclude that the posterior density is normal with posterior mean θpost and posterior precision ξpost . Again, terms that do not depend on θ do not affect the shape of
the posterior density and are absorbed in the normalization constant that makes the
posterior density integrate to 1. Thus we expand Q(θ ) and identify the coefficient of
θ 2 as the posterior precision and the coefficient of −θ as twice the posterior mean
times the posterior precision. Doing so, we find
ξpost = nξ0 + ξprior
nξ0 x̄ + θ0 ξprior
nξ0 + ξprior
ξprior
nξ0
= x̄
+ θ0
nξ0 + ξprior
nξ0 + ξprior
θpost =
The posterior density of θ is thus normal with this mean and precision. Note that the
precision has increased and that the posterior mean is a weighted combination of the
sample mean and the prior mean.
To interpret these results, consider what happens when ξprior nξ0 , which would
be the case if n were sufficiently large of if ξprior were small (as for a very flat prior).
Then the posterior mean would be
θpost ≈ x̄
which is the maximum likelihood estimate, and
ξpost ≈ nξ0
2
This last equation can be written as σpost
= σ02 /n, which is just the variance of X in
the non-Bayesian setting. In summary, if the flat prior with very small ξprior is used,
the posterior density of θ is very close to normal with mean x̄ and variance σ02 /n.
■
Case of Known Mean and Unknown Variance
In this case, the precision is unknown and is treated as a random variable , with
prior distribution f (ξ ). Given ξ , the X i are independent N (θ0 , ξ −1 ). Let X =
(X 1 , X 2 , . . . , X n ). Then
f |X (ξ |x) ∝ f X | (x|ξ ) f (ξ )
1 (xi − θ0 )2
∝ ξ n/2 exp − ξ
2
f (ξ )
292
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
Observing how the density depends on ξ , we realize that it is analytically convenient
to specify the prior to be a gamma density: ∼ (α, λ). Then
1 n/2
f |X (ξ |x) ∝ ξ exp − ξ
(xi − θ0 )2 ξ α−1 e−λξ
2
which is a gamma density with parameters,
n
αpost = α +
2
1
λpost = λ +
(xi − θ0 )2
2
In the case of a flat prior (small α and λ), the posterior mean and mode are
1
Posterior mean ≈
(xi − θ0 )2
n
1 Posterior mode ≈
(xi − θ0 )2
n−2
The former is the maximum likelihood estimate of σ 2 . In the limit, λ → 0, α → 0,
1 (xi − θ0 )2
■
f |X (ξ |x) ∝ ξ n/2−1 exp − ξ
2
Case of Unknown Mean and Unknown Variance
In this case, there are two unknown parameters, and a Bayesian approach requires
the specification of a joint two-dimensional prior distribution. We follow a path of
mathematical convenience and take the priors to be independent:
−1
∼ N θ0 , ξprior
∼ (α, λ)
We then have
f ,|X (θ, ξ |x) ∝ f X |, (x|θ, ξ ) f (θ ) f (ξ )
ξ
∝ ξ n/2 exp −
(xi − θ )2
2
ξprior
(θ − θ0 )2 ξ α−1 exp(−λξ )
× exp −
2
From the manner in which θ and ξ occur in the first exponential, it appears that the
two variables are not independent in the posterior even though they were in the prior.
To evaluate this joint posterior density, we would have to find the constant of proportionality that makes it integrate to 1—the normalization constant. Two dimensional
numerical integration could be used.
Often the primary interest is in the mean, θ , and one useful aspect of Bayesian
analysis is that information about θ can be “marginalized” by integrating out ξ :
∞
f ,|X (θ, ξ |x)dξ
f |X (θ|x) =
0
8.6 The Bayesian Approach to Parameter Estimation
293
Examining the preceding expression for f ,|X (θ, ξ |x) as a function of ξ , we see
that it is of the form of a gamma density, with parameters α̃ = α + n/2 and λ̃ =
λ + (1/2) (xi − θ)2 , so we can evaluate the integral. We thus find
ξprior
(α + n/2)
(θ − θ0 )2 )
f |X (θ|x) ∝ exp −
2
[λ + 12 (xi − θ )2 ]α+n/2
This is not a density that we recognize, but it could be evaluated numerically. Doing
so would again entail finding the normalizing constant, which could be done by
numerical integration. Some simplifications occur when n is large or when the prior
is quite flat (α, λ, ξprior are small). Then
−n/2
(xi − θ )2
f |X (θ|x) ∝
This posterior is maximized when (xi − θ )2 is minimized, which occurs at θ =
x̄. We can relate this to the result we found for maximum likelihood analysis by
expressing
(xi − x̄)2 + n(θ − x̄)2
(xi − θ)2 =
= (n − 1)s 2 + n(θ − x̄)2
n(θ − x̄)2
= (n − 1)s 2 1 +
(n − 1)s 2
Substituting this above and absorbing terms that do not depend on θ into the proportionality constant, we find
−n/2
1 n(θ − x̄)2
f |X (θ|x) ∝ 1 +
n−1
s2
Now comparing this to the definition of the t distribution (Section 6.2), we see that
√
n( − x̄)
∼ tn−1
s
corresponding to the result from √
maximum likelihood analysis.
The interval x̄ ±tn−1 (α/2)s/ n was earlier derived as a 100(1−α)% confidence
interval centered about the maximum likelihood estimate, and here it has reappeared
in the Bayesian analysis as an interval with posterior probability 1 − α. There are
differences of interpretation, however, just as there were for the earlier Poisson case.
The Bayesian interval is a probability statement referring to the state of knowledge
about θ given the observed data, regarding θ as a random variable. The frequentist
confidence interval is based on a probability statement about the possible values of
■
the observations, regarding θ as a constant, albeit unknown.
EXAMPLE C
Hardy-Weinberg Equilibrium
We now turn to a Bayesian treatment of Example A in Section 8.5.1. We use the
multinomial likelihood function and a prior for θ , which is uniform on [0, 1]. The
posterior density is thus proportional to the likelihood, and is shown in Figure 8.10.
Note that it looks very much like a normal density, a phenomenon that will be
explored in a later section. Since f X | (x|θ ) is a polynomial in θ (of high degree),
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
30
Posterior density
294
20
10
0
0.30
0.35
0.40
0.45
0.50
0.55
0.60
F I G U R E 8.10 Posterior distribution of .
the normalization constant can in principle be computed analytically. (Alternatively,
all the computations can be done numerically.)
Because the prior is flat, the posterior is directly proportional to the likelihood
and the maximum of the posterior density is the maximum likelihood estimate, θ̂ =
0.4247. The 0.025 percentile of the density is 0.404, and the 0.975 percentile is 0.446.
These results agree with the approximate confidence interval found for the maximum
likelihood estimate in Example C in Section 8.5.3.
■
8.6.1 Further Remarks on Priors
In the previous section, we saw that if the prior for a Poisson parameter is chosen
to be a gamma density, then the posterior is also a gamma density. Similarly, when
the prior for a normal mean with known variance is chosen to be normal, then the
posterior is normal as well. Earlier, in Example E in Section 3.5.2, a beta prior was
used for a binomial parameter, and the posterior turned out to be beta as well. These
are examples of conjugate priors: if the prior distribution belongs to a family G
and, conditional on the parameters of G, the data have a distribution H , then G is
said to be conjugate to H if the posterior is in the family G. Other conjugate priors
will be the subject of problems at the end of the chapter. Conjugate priors are used
for mathematical convenience (required integrations can be done in closed form)
and because they can assume a variety of shapes as the parameters of the prior are
varied.
In scientific applications, it is usually desirable to use a flat, or “uninformative,”
prior so that the data can speak for themselves. Even if a scientific investigator actually
had a strong prior opinion, he or she might want to present an “objective” analysis.
This is accomplished by using a flat prior so that the conclusions, as summarized in
the posterior density, are those of one who is initially unopinionated or unprejudiced.
8.6 The Bayesian Approach to Parameter Estimation
295
If an informative prior were used, it would have to be justified to the larger scientific
community. The objective prior thus has a hypothetical, or “what if,” status: if one
was initially indifferent to parameter values in the range in which the likelihood is
large, then one’s opinion after observing the data would be expressed as a posterior
proportional to the likelihood.
Attempts have been made to formalize more precisely what the notion of an uninformative prior means. One problem that is addressed is caused by reparametrization.
For example, suppose that the prior density of the precision ξ is taken to be uniform on
an interval [a, b], which might seem to be a reasonable way to quantify the notion of
being uniformative. However, if the variance σ 2 = 1/ξ , rather than the precison, was
used, the prior density of σ 2 would not be uniform on [b−1 , a −1 ]. We will not delve
further into these issues here, except to note that the parametrization θ or g(θ ) would
make a difference only if the difference in the shapes of the priors was substantial in
the region in which the likelihood was large.
We saw in the Poisson example that if α and ν are very small, the gamma prior
is quite flat and the posterior is proportional to the likelihood function. Formally, if α
and ν are set equal to zero, then the prior is
f |α,ν (λ) = λ−1 , 0 ≤ λ < ∞
But this function does not integrate to 1—it is not a probability density. A similar
phenomena occurs in the normal case with unknown mean and known precision, if
the prior precision is set equal to 0. The prior is then
f (θ) ∝ 1, −∞ < θ < ∞
and not a probability density either. Such priors are called improper priors (priors
that lack propriety).
In general, if an improper prior is formally used, the posterior may not be a
density either, because the denominator of the expression for the posterior density,
f X | (x|θ) f (θ) dθ may not converge. (Note that it is integrated with respect to
θ , not x.) This has not been the case in our examples. For the Poisson example, if
f (λ) ∝ λ−1 , then the denominator is
∞
λ xi −1 e−nλ dλ < ∞
0
In the normal case, too, the integral is defined, and thus there is a well-defined posterior
density.
Let us revisit some examples using the device of an improper prior. In the Poisson
example, using the improper prior f (λ) = λ−1 results in a (proper) posterior
f |X (λ|x) ∝ λ
xi −1 −nλ
e
which can be recognized as a gamma density.
In the normal example with unknown mean and variance, we can take θ and ξ to
be independent with improper priors f (θ ) = 1 and f (ξ ) = ξ −1 . The joint posterior
of θ and ξ is then
ξ
(xi − θ )2
f ,|X (θ, ξ |x) ∝ ξ n/2−1 exp −
2
296
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
Expressing
n
i=1 (x i
− θ)2 = (n − 1)s 2 + n(θ − x̄)2 , we have
f ,|X (θ, ξ |x) ∝ ξ
n/2−1
ξ
nξ
2
exp − (n − 1)s exp − (θ − x̄)2
2
2
For fixed ξ , this expression is proportional to the conditional density of θ given ξ .
(Why?) From the form of the dependence on θ , we see that conditional on ξ , θ is
normal with mean x̄ and precision nξ . By integrating out ξ , we can find the marginal
distribution of θ and relate it to the t distribution as was done earlier.
Since improper priors are not actually probability densities, they are difficult to
interpret literally. However, the resulting posteriors can be viewed as approximations
to those that would have arisen with extreme values of the parameters of proper
priors. The priors corresponding to such extreme values are very flat, so the posterior
is dominated by the likelihood. Then it is only in the range in which the likelihood is
large that the prior makes any practical difference—truncating the improper prior well
outside this range to produce a proper prior will not appreciably change the posterior.
8.6.2 Large Sample Normal Approximation to the Posterior
We have seen in several examples that the posterior distribution is nearly normal with
the mean equal to the maximum likelihood estimate, and that the posterior standard
deviation is close to the asymptotic standard deviation of the maximum likelihood
estimate. The two methods thus often give quite comparable results. We will not give
a formal proof here, but rather will sketch an argument that the posterior distribution
is approximately normal with the mean equal the the maximum likelihood estimate,
θ̂ , and variance approximately equal to −[l (θ̂ )]−1 .
Denoting the observations generically by x, the posterior distribution is
f |X (θ|x) ∝ f (θ ) f X | (x|θ )
= exp[log f (θ )] exp[log f X | (x|θ )]
= exp[log f (θ )] exp[l(θ )]
Now, if the sample is large, the posterior is dominated by the likelihood, and in the
region where the likelihood is large, the prior is nearly constant. Thus, to an approximation,
1
2
f |X (θ|x) ∝ exp l(θ̂) + (θ − θ̂ )l (θ̂ ) + (θ − θ̂ ) l (θ̂ )
2
1
∝ exp (θ − θ̂ )2l (θ̂ )
2
In the last step, we used the fact that since θ̂ is the maximum likelihood estimate
l (θ̂) = 0. The term l(θ̂) was absorbed into a proportionality constant, since we are
evaluating the posterior as a function of θ . Finally, observe that the last expression is
proportional to a normal density with mean θ̂ and variance −[l (θ̂ )]−1 .
8.6 The Bayesian Approach to Parameter Estimation
297
8.6.3 Computational Aspects
Contemporary computational resources have had an enormous impact on Bayesian
inference. As we have seen in several examples, the computationally difficult part
of Bayesian inference is the calculation of the normalizing constant that makes the
posterior density integrate to 1. Traditionally, such calculations were performed analytically, often using conjugate priors so that the integrations could be done explicitly.
The numerical integration of a well-behaved function of a small number of variables
is now trivial.
Difficulties do arise in high dimensional problems, however, and the integrations
are often done by sophisticated Monte Carlo methods. We will not go into these
sorts of methods in this book, but will hint at their nature in the following example of a method called Gibbs Sampling. Consider, as a simple example, inference
for a normal distribution with unknown mean and variance. From Example B in
Section 8.6
ξ
(xi − θ )2
f ,|X (θ, ξ |x) ∝ ξ n/2 exp −
2
ξprior
(θ − θ0 )2 ξ α−1 exp(−λξ )
× exp −
2
For simplicity, suppose that an improper prior is used: ξprior → 0, α → 0, λ → 0.
Then
ξ
(xi − θ )2
f ,|X (θ, ξ |x) ∝ ξ n/2−1 exp −
2
nξ
n/2−1
(θ − x̄)2
exp
∝ξ
2
Here we expressed
(xi − θ)2 =
(xi − x̄)2 + n(θ − x̄)2
and absorbed terms that do not involve θ into the constant of proportionality. To study
the posterior distribution of ξ and θ by Monte Carlo, we would draw many pairs
(ξk , θk ) from this joint density; the problem is how to actually do this.
Gibbs Sampling would accomplish this in the following way. Observe that the
expression f ,|X (θ, ξ |x) shows that for given ξ , θ is normally distributed with mean
x̄ and precision nξ . (Fix ξ in the expression and recognize a normal density in θ.)
Also, if θ is fixed, the density of ξ is a gamma density. Gibbs Sampling alternates
back and forth between the two conditional distributions:
1.
2.
3.
4.
5.
Choose an initial value θ0 ; x̄ would be a natural choice.
Generate ξ0 from a gamma density with parameter θ0 .
Generate θ1 from a normal distribution with parameter ξ0 .
Generate ξ1 from a gamma density with parameter θ1 .
Continue on in this fashion.
The analysis of the algorithm and why it works is beyond the scope of this book. A
“burn-in” period is required so that we might run this scheme for a few hundred steps
before beginning to record pairs (ξk , θk ), k = 1, . . . , N , which would be regarded
298
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
as simulated pairs from the posterior. A further complication is that these pairs are
not independent of one another. But, nonetheless, a histogram of the collection of θk
could be used as an estimate of the marginal posterior distribution of . The posterior
mean of can be estimated as
E(|X ) ≈
N
1 θk
N k=1
8.7 Efficiency and the Cramér-Rao
Lower Bound
In most statistical estimation problems, there are a variety of possible parameter
estimates. For example, in Chapter 7 we considered both the sample mean and a
ratio estimate, and in this chapter we considered the method of moments and the
method of maximum likelihood. Given a variety of possible estimates, how would we
choose which to use? Qualitatively, it would be sensible to choose that estimate whose
sampling distribution was most highly concentrated about the true parameter value.
To define this aim operationally, we would need to specify a quantitative measure
of such concentration. Mean squared error is the most commonly used measure of
concentration, largely because of its analytic simplicity. The mean squared error of θ̂
as an estimate of θ0 is
M S E(θ̂) = E(θ̂ − θ0 )2
= Var(θ̂ ) + (E(θ̂ ) − θ0 )2
(See Theorem A of Section 4.2.1.) If the estimate θ̂ is unbiased [E(θ̂ )= θ0 ], MSE(θ̂ )=
Var(θ̂). When the estimates under consideration are unbiased, comparison of their
mean squared errors reduces to comparison of their variances, or equivalently, standard errors.
Given two estimates, θ̂ and θ̃, of a parameter θ, the efficiency of θ̂ relative to θ̃
is defined to be
Var(θ̃ )
eff(θ̂ , θ̃ ) =
Var(θ̂ )
Thus, if the efficiency is smaller than 1, θ̂ has a larger variance than θ̃ has. This
comparison is most meaningful when both θ̂ and θ̃ are unbiased or when both have
the same bias. Frequently, the variances of θ̂ and θ̃ are of the form
c1
Var(θ̂ ) =
n
c2
Var(θ̃ ) =
n
where n is the sample size. If this is the case, the efficiency can be interpreted as
the ratio of sample sizes necessary to obtain the same variance for both θ̂ and θ̃ . (In
Chapter 7, we compared the efficiencies of estimates of a population mean from a
simple random sample, a stratified random sample with proportional allocation, and
a stratified random sample with optimal allocation.)
8.7 Efficiency and the Cramér-Rao Lower Bound
EXAMPLE A
299
Muon Decay
Two estimates have been derived for α in the problem of muon decay. The method of
moments estimate is
α̃ = 3X
The maximum likelihood estimate is the solution of the nonlinear equation
n
i=1
Xi
=0
1 + α̂ X i
We need to find the variances of these two estimates.
Since the variance of a sample mean is σ 2 /n, we compute σ 2 :
σ 2 = E(X 2 ) − [E(X )]2
1
α2
1 + αx
dx −
=
x2
2
9
−1
1 α2
−
3
9
Thus, the variance of the method of moments estimate is
=
Var(α̃) = 9 Var(X ) =
3 − α2
n
The exact variance of the mle, θ̂, cannot be computed in closed form, so we approximate it by the asymptotic variance,
Var(α̂) ≈
1
n I (α)
and then compare this asymptotic variance to the variance of α̃. The ratio of the former
to the latter is called the asymptotic relative efficiency. By definition,
2
∂
log f (x|α)
I (α) = E
∂α
1
x2
1 + αx
=
dx
2
2
−1 (1 + αx)
1+α
log
− 2α
1−α
,
−1 < α < 1, α = 0
=
2α 3
1
α=0
= ,
3
The asymptotic relative efficiency is thus (for α = 0)
⎤
⎡
Var(α̂)
2α 3 ⎢
⎢
=
Var(α̃)
3 − α2 ⎣
log
1
1+α
1−α
− 2α
⎥
⎥
⎦
300
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
The following table gives this efficiency for various values of α between 0 and 1;
symmetry would yield the values between −1 and 0.
α
Efficiency
0.0
.1
.2
.3
.4
.5
.6
.7
.8
.9
.95
1.0
.997
.989
.975
.953
.931
.878
.817
.727
.582
.464
As α tends to 1, the efficiency tends to 0. Thus, the mle is not much better than the
method of moments estimate for α close to 0 but does increasingly better as α tends
to 1.
It must be kept in mind that we used the asymptotic variance of the mle, so we
calculated an asymptotic relative efficiency, viewing this as an approximation to the
actual relative efficiency. To gain more precise information for a given sample size,
a simulation of the sampling distribution of the mle could be conducted. This might
be especially interesting for α = 1, a case for which the formula for the asymptotic
variance given above does not appear to make much sense. With a simulation study,
the behavior of the bias as n and α vary could be analyzed (we showed that the mle
is asymptotically unbiased, but there may be bias for a finite sample size), and the
■
actual distribution could be compared to the approximating normal.
In searching for an optimal estimate, we might ask whether there is a lower bound
for the MSE of any estimate. If such a lower bound existed, it would function as a
benchmark against which estimates could be compared. If an estimate achieved this
lower bound, we would know that it could not be improved upon. In the case in which
the estimate is unbiased, the Cramér-Rao inequality provides such a lower bound. We
now state and prove the Cramér-Rao inequality.
THEOREM A
Cramér-Rao Inequality
Let X 1 , . . . , X n be i.i.d. with density function f (x|θ ). Let T = t (X 1 , . . . , X n )
be an unbiased estimate of θ. Then, under smoothness assumptions on f (x|θ ),
1
Var(T ) ≥
n I (θ )
8.7 Efficiency and the Cramér-Rao Lower Bound
301
Proof
Let
Z =
=
n
∂
log f (X i |θ )
∂θ
i=1
n
i=1
∂
f (X i |θ )
∂θ
f (X i |θ )
In Section 8.5.2, we showed that E(Z ) = 0. Because the correlation coefficient
of Z and T is less than or equal to 1 in absolute value
Cov2 (Z , T ) ≤ Var(Z )Var(T )
It was also shown in Section 8.5.2 that
∂
log f (X |θ ) = I (θ )
Var
∂θ
Therefore,
Var(Z ) = n I (θ )
The proof will be completed by showing that Cov(Z , T ) = 1. Since Z has
mean 0,
Cov(Z , T ) = E(Z T )
=
···
⎤
⎡
∂
n
n
f
(x
|θ
)
i
⎥⎢ ∂θ
t (x1 , . . . , xn ) ⎣
f (x j |θ ) d x j
⎦
f (xi |θ )
i=1
j=1
Noting that
n
i=1
∂
n
n
f (xi |θ) ∂ ∂θ
f (x j |θ ) =
f (xi |θ )
f (xi |θ) j=1
∂θ i=1
we rewrite the expression for the covariance of Z and T as
n
∂ Cov(Z , T ) = · · · t (x1 , . . . , xn )
f (xi |θ ) d xi
∂θ i=1
n
∂
=
f (xi |θ ) d xi
· · · t (x1 , . . . , xn )
∂θ
i=1
∂
∂
E(T ) =
(θ ) = 1
∂θ
∂θ
which proves the inequality. [Note the interchange of differentiation and integration that must be justified by the smoothness assumptions on f (x|θ ).]
■
=
302
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
Theorem A gives a lower bound on the variance of any unbiased estimate. An
unbiased estimate whose variance achieves this lower bound is said to be efficient.
Since the asymptotic variance of a maximum likelihood estimate is equal to the
lower bound, maximum likelihood estimates are said to be asymptotically efficient.
For a finite sample size, however, a maximum likelihood estimate may not be efficient, and maximum likelihood estimates are not the only asymptotically efficient
estimates.
EXAMPLE B
Poisson Distribution
In Example B in Section 8.5.3, we found that for the Poisson distribution
1
λ
Therefore, by Theorem A, for any unbiased estimate T of λ, based on a sample of
independent Poisson random variables, X 1 , . . . , X n ,
λ
Var(T ) ≥
n
I (λ) =
The mle of λ was found to be X = S/n, where S = X 1 + · · · + X n . Since S follows a
Poisson distribution with parameter nλ, Var(S) = nλ and Var(X ) = λ/n. Therefore,
X attains the Cramér-Rao lower bound, and we know that no unbiased estimator of λ
can have a smaller variance. In this sense, X is optimal for the Poisson distribution.
But note that the theorem does not preclude the possibility that there is a biased
■
estimator of λ that has a smaller mean squared error than X does.
8.7.1 An Example: The Negative Binomial Distribution
The Poisson distribution is often the first model considered for random counts; it
has the property that the mean of the distribution is equal to the variance. When it is
found that the variance of the counts is substantially larger than the mean, the negative
binomial distribution is sometimes instead considered as a model. We consider a
reparametrization and generalization of the negative binomial distribution introduced
in Section 2.1.3, which is a discrete distribution on the nonnegative integers with a
frequency function depending on the parameters m and k:
x
m −k (k + x)
m
f (x|m, k) = 1 +
k
x! (k)
m+k
The mean and variance of the negative binomial distribution can be shown
to be
μ=m
m2
k
It is apparent that this distribution is overdispersed (σ 2 > μ) relative to the Poisson.
We will not derive the mean and variance. (They are most easily obtained by using
moment-generating functions.)
σ2 = m +
8.7 Efficiency and the Cramér-Rao Lower Bound
303
The negative binomial distribution can be used as a model in several cases:
•
•
•
•
If k is an integer, the distribution of the number of successes up to the kth failure in a
sequence of independent Bernoulli trials with probability of success p = m/(m + k)
is negative binomial.
Suppose that is a random variable following a gamma distribution and that for λ,
a given value of , X follows a Poisson distribution with mean λ. It can be shown
that the unconditional distribution of X is negative binomial. Thus, for situations in
which the rate varies randomly over time or space, the negative binomial distribution
might tentatively be considered as a model.
The negative binomial distribution also arises with a particular type of clustering.
Suppose that counts of colonies, or clusters, follow a Poisson distribution and that
each colony has a random number of individuals. If the probability distribution
of the number of individuals per colony is of a particular form (the logarithmic
series distribution), it can be shown that the distribution of counts of individuals is
negative binomial. The negative binomial distribution might be a plausible model
for the distribution of insect counts if the insects hatch from depositions, or clumps,
of larvae.
The negative binomial distribution can be applied to model population size in a
certain birth/death process, the assumption being that the birth rate and death rate
per individual are constant and that there is a constant rate of immigration.
Anscombe (1950) discusses estimation of the parameters m and k and compares
the efficiencies of several methods of estimation. The simplest method is the method
of moments; from the relations of m and k to μ and σ 2 given previously, the method
of moments estimates of m and k are
m̂ = X
X2
σ̂ 2 − X
Another relatively simple method of estimation of m and k is based on the number
of zeros. The probability of the count being zero is
m −k
p0 = 1 +
k
If m is estimated by the sample mean and there are n 0 zeros out of a sample size of
n, then k is estimated by k̂, where k̂ satisfies
−k̂
X
n0
= 1+
n
k̂
k̂ =
Although the solution cannot be obtained in closed form, it is not difficult to find by
iteration.
Figure 8.11, from Anscombe (1950), shows the asymptotic efficiencies of the two
methods of estimation of the negative binomial parameters relative to the maximum
likelihood estimate. In the figure, the method of moments is method 1 and the method
based on the number of zeros is method 2. Method 2 is quite efficient when the mean
is small—that is, when there are a large number of zeros. Method 1 becomes more
efficient as k increases.
304
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
100
40
98%
20
Exponent k
10
90%
4
75%
2
50%
1
50%
75%
90%
0-4
0-2
0-1
0-04 0-1 0-2 0-4
98%
1
2 4
10 20 40
Mean m
100 200 400
Method 1
Method 2
F I G U R E 8.11 Asymptotic efficiencies of estimates of negative binomial parameters.
The maximum likelihood estimate is asymptotically efficient but is somewhat
more difficult to compute. The equations will not be written out here. Bliss and Fisher
(1953) discuss computational methods and give several examples. The maximum
likelihood estimate of m is the sample mean, but that of k is the solution of a nonlinear
equation.
EXAMPLE A
Insect Counts
Let us consider an example from Bliss and Fisher (1953). From each of 6 apple trees
in an orchard that was sprayed, 25 leaves were selected. On each of the leaves, the
number of adult female red mites was counted. Intuitively, we might conclude that
this situation was too heterogeneous for a Poisson model to fit; the rates of infestation
might be different on different trees and at different locations on the same tree.
The following table shows the observed counts and the expected counts from fitting
Poisson and negative binomial distributions. The mle’s for k and m were k̂ = 1.025
and m̂ = 1.146.
Number
per Leaf
Observed
Count
Poisson
Distribution
Negative Binomial
Distribution
0
1
2
3
4
5
6
7
8+
70
38
17
10
9
3
2
1
0
47.7
54.6
31.3
12.0
3.4
.75
.15
.03
.00
69.5
37.6
20.1
10.7
5.7
3.0
1.6
.85
.95
8.8 Sufficiency
305
Casual inspection of this table makes it clear that the Poisson does not fit; there
are many more small and large counts observed than are expected for a Poisson
■
distribution.
A recursive relation is useful in fitting the negative binomial distribution:
m −k
p0 = 1 +
k
m
k+n−1
pn =
n
k+m
pn−1
8.8 Sufficiency
This section introduces the concept of sufficiency and some of its theoretical implications. Suppose that X 1 , . . . , X n is a sample from a probability distribution with the
density or frequency function f (x|θ ). The concept of sufficiency arises as an attempt
to answer the following question: Is there a statistic, a function T (X 1 , . . . , X n ), that
contains all the information in the sample about θ ? If so, a reduction of the original
data to this statistic without loss of information is possible. For example, consider
a sequence of independent Bernoulli trials with unknown probability of success, θ .
We may have the intuitive feeling that the total number of successes contains all
the information about θ that there is in the sample, that the order in which the successes occurred, for example, does not give any additional information. The following
definition formalizes this idea.
DEFINITION
A statistic T (X 1 , . . . , X n ) is said to be sufficient for θ if the conditional distribution of X 1 , . . . , X n , given T = t, does not depend on θ for any value
■
of t.
In other words, given the value of T , which is called a sufficient statistic, we can
gain no more knowledge about θ from knowing more about the probability distribution
of X 1 , . . . , X n . (Formally, we could envision keeping only T and throwing away all
the X i without any loss of information. Informally, and more realistically, this would
make no sense at all. The values of the X i might indicate that the model did not fit or
that something was fishy about the data. What would you think, for example, if you
saw 50 ones followed by 50 zeros in a sequence of supposedly independent Bernoulli
trials?)
306
Chapter 8
EXAMPLE A
Estimation of Parameters and Fitting of Probability Distributions
Let X 1 , . . . , X n be a sequence of independent Bernoulli random variables with
n
X i is sufficient for θ.
P(X i = 1) = θ. We will verify that T = i=1
P(X 1 = x1 , . . . , X n = xn |T = t) =
P(X 1 = x1 , . . . , X n = xn , T = t)
P(T = t)
Bearing in mind that the X i can take on only the values 0s and 1s, the probability
in the numerator is the probability that some particular set of t X i are equal to 1s
and the other n − t are 0s. Since the X i are independent, the probability of this is
the product of the marginal probabilities, or θ t (1 − θ )n−t . To find the denominator
note that the distribution of T , the total number of ones, is binomial with n trials and
probability of success θ. The ratio in question is thus
θ t (1 − θ )n−t
1
=
n t
n
θ (1 − θ )n−t
t
t
The conditional distribution thus does not involve θ at all. Given the total number of
ones, the probability that they occur on any particular set of t trials is the same for
■
any value of θ so that set of trials contains no additional information about θ .
8.8.1 A Factorization Theorem
The preceding definition of sufficiency is hard to work with, because it does not
indicate how to go about finding a sufficient statistic, and given a candidate statistic,
T , it would typically be very hard to conclude whether it was sufficient because of
the difficulty in evaluating the conditional distribution. The following factorization
theorem provides a convenient means of identifying sufficient statistics.
THEOREM A
A necessary and sufficient condition for T (X 1 , . . . , X n ) to be sufficient for a
parameter θ is that the joint probability function (density function or frequency
function) factors in the form
f (x1 , . . . , xn |θ) = g[T (x1 , . . . , xn ), θ]h(x1 , . . . , xn )
8.8 Sufficiency
307
Proof
We give a proof for the discrete case. (The proof for the general case is more
subtle and requires regularity conditions, but the basic ideas are the same.) First,
suppose that the frequency function factors as given in the theorem. To simplify
notation, we will let X denote (X 1 , . . . , X n ) and x denote (x1 , . . . , xn ). We have
P(X = x)
P(T = t) =
T (x)=t
= g(t, θ)
h(x)
T (x)=t
Here the notation indicates that the sum is over all x such that T (x) = t. We then
have
P(X = x, T = t)
P(X = x|T = t) =
P(T = t)
h(x)
=
h(x)
T (x)=t
This conditional distribution does not depend on θ , as was to be shown.
To show that the conclusion holds in the other direction, suppose that the
conditional distribution of X given T is independent of θ . Let
g(t, θ) = P(T = t|θ )
h(x) = P(X = x|T = t)
We then have
P(X = x|θ) = P(T = t|θ )P(X = x|T = t)
= g(t, θ)h(x)
■
as was to be shown.
We can demonstrate the utility of Theorem A by applying it to some examples.
More examples are included in the problems at the end of this chapter.
EXAMPLE A
Consider a sequence of independent Bernoulli random variables, X 1 , . . . , X n , where
P(X i = x) = θ x (1 − θ )1−x ,
then
f (x|θ) =
n
-
x = 0 or x = 1
θ xi (1 − θ )1−xi
i=1
= θ i=1 xi (1 − θ )n−i=1 xi
n
xi
i=1
θ
=
(1 − θ )n
1−θ
n
n
308
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
We see that f (x|θ) depends only on x1 , x2 , . . . , xn through the sufficient statistic
n
n
xi and f (x|θ) is of the form g( i=1
xi , θ)h(x), where h(x) = 1 and
t = i=1
t
θ
g(t, θ) =
(1 − θ )n
■
1−θ
EXAMPLE B
Consider a random sample from a normal distribution that has an unknown mean and
variance. We have
n
−1
1
2
√ exp
(xi − μ)
f (x|μ, σ ) =
2σ 2
σ 2π
i=1
n
−1 1
2
exp
(xi − μ)
= n
σ (2π)n/2
2σ 2 i=1
n
n
−1 2
1
2
exp
x − 2μ
xi + nμ
= n
σ (2π)n/2
2σ 2 i=1 i
i=1
n
n
xi and i=1
xi2 , which are therefore
This expression is just a function of i=1
sufficient statistics. In this example we have a two-dimensional sufficient statistic.
Although Theorem A was stated explicitly for a one-dimensional sufficient statistic,
■
the multidimensional analogue holds also.
Because the likelihood,
f (x1 , . . . , xn ; θ) = g[T (x1 , . . . , xn ), θ]h(x1 , . . . , xn )
it depends only on the data through T (x1 , . . . , xn ). The maximum likelihood estimate is found by maximizing g[T (x1 , . . . , xn ), θ]. In Example A, the likelihood is a
n
xi , and the maximum likelihood estimate is θ̂ = t/n.
function of t = i=1
Similarly, in a Bayesian framework, the posterior distribution of θ is proportional
to the product of the prior distribution of θ and the likelihood. As a function of θ, the
posterior distribution thus depends only on the data through g[T (x1 , . . . , xn ), θ]—the
posterior probability of θ is the same for all {x1 , . . . , xn } which have a common value
of T (x1 , . . . , xn ). The sufficient statistic carries all the information about θ that is
contained in the data x1 , x2 , . . . , xn .
A study of the properties of probability distributions that have sufficient statistics
of the same dimension as the parameter space regardless of sample size led to the
development of what is called the exponential family of probability distributions.
Many common distributions, including the normal, the binomial, the Poisson, and
the gamma, are members of this family. One-parameter members of the exponential
family have density or frequency functions of the form
f (x|θ) = exp[c(θ)T (x) + d(θ ) + S(x)],
= 0,
x∈A
x∈A
8.8 Sufficiency
309
where the set A does not depend on θ. Suppose that X 1 , . . . , X n is a sample from a
member of the exponential family; the joint probability function is
f (x|θ) =
n
i=1
exp[c(θ )T (xi ) + d(θ ) + S(xi )]
= exp c(θ)
n
T (xi ) + nd(θ ) exp
i=1
n
S(xi )
i=1
From this result, it is apparent by the factorization theorem that
sufficient statistic.
EXAMPLE C
n
i=1
T (X i ) is a
The frequency function of the Bernoulli distribution is
x = 0 or x = 1
P(X = x) = θ x (1 − θ )1−x ,
θ
= exp x log
+ log(1 − θ )
1−θ
This is a member of the exponential family with T (x) = x, and we have already seen
n
X i , is a sufficient statistic for a sample from the Bernoulli distribution. ■
that i=1
A k-parameter member of the exponential family has a density or frequency
function of the form
k
ci (θ )Ti (x) + d(θ ) + S(x) ,
x∈A
f (x|θ) = exp
i=1
= 0,
x∈A
where the set A does not depend on θ.
The normal distribution is of this form. A great deal of theoretical work has
centered around the exponential family; further discussion of this family can be found
in Bickel and Doksum (2001).
We conclude this section with the following corollary of Theorem A.
COROLLARY A
If T is sufficient for θ, the maximum likelihood estimate is a function of T .
Proof
From Theorem A, the likelihood is g(T, θ)h(x), which depends on θ only through
T . To maximize this quantity, we need only maximize g(T, θ).
■
310
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
Corollary A and the Rao-Blackwell theorem of the next section may be interpreted
as giving some theoretical support to the use of maximum likelihood estimates.
8.8.2 The Rao-Blackwell Theorem
In the preceding section, we argued for the importance of sufficient statistics on essentially qualitative grounds. The Rao-Blackwell theorem gives a quantitative rationale
for basing an estimator of a parameter θ on a sufficient statistic if one exists.
THEOREM A
Rao-Blackwell Theorem
Let θ̂ be an estimator of θ with E(θ̂ 2 ) < ∞ for all θ . Suppose that T is sufficient
for θ, and let θ̃ = E(θ̂|T ). Then, for all θ ,
E(θ̃ − θ )2 ≤ E(θ̂ − θ )2
The inequality is strict unless θ̂ = θ̃.
Proof
We first note that, from the property of iterated conditional expectation
(Theorem A of Section 4.4.1),
E(θ̃) = E[E(θ̂ |T )] = E(θ̂ )
Therefore, to compare the mean squared error of the two estimators, we need
only compare their variances. From Theorem B of Section 4.4.1, we have
Var(θ̂) = Var[E(θ̂|T )] + E[Var(θ̂|T )]
or
Var(θ̂) = Var(θ̃ ) + E[Var(θ̂|T )]
Thus, Var(θ̂) > Var(θ̃) unless Var(θ̂ |T ) = 0, which is the case only if θ̂ is a
function of T , which would imply θ̂ = θ̃.
■
Since E(θ̂|T ) is a function of the sufficient statistic T , the Rao-Blackwell theorem
gives a strong rationale for basing estimators on sufficient statistics if they exist. If an
estimator is not a function of a sufficient statistic, it can be improved.
Suppose that there are two estimates, θ̂1 and θ̂2 , having the same expectation.
Assuming that a sufficient statistic T exists, we may construct two other estimates, θ̃1
and θ̃2 , by conditioning on T . The theory we have developed so far gives no clues as
to which one of these two is better. If the probability distribution of T has the property
called completeness, θ̃1 and θ̃2 are identical, by a theorem of Lehmann and Scheffé.
We will not define completeness or pursue this topic further; Lehmann and Casella
(1998) and Bickel and Doksum (2001) discuss this concept.
8.9 Concluding Remarks
311
8.9 Concluding Remarks
Certain key ideas first introduced in the context of survey sampling in Chapter 7 have
recurred in this chapter. We have viewed an estimate as a random variable having a
probability distribution called its sampling distribution. In Chapter 7, the estimate was
of a parameter, such as the mean, of a finite population; in this chapter, the estimate
was of a parameter of a probability distribution. In both cases, characteristics of
the sampling distribution, such as the bias and the variance and the large sample
approximate form, have been of interest. In both chapters, we studied confidence
intervals for the true value of the unknown parameter. The method of propagation of
error, or linearization, has been a useful tool in both chapters. These key ideas will
be important in other contexts in later chapters as well.
Important concepts and techniques in estimation theory were introduced in this
chapter. We discussed two general methods of estimation—the method of moments
and the method of maximum likelihood. The latter especially has great general utility
in statistics. We developed and applied some approximate distribution theory for
maximum likelihood estimates. Other theoretical developments included the concept
of efficiency, the Cramér-Rao lower bound, and the concept of sufficiency and some
of its consequences.
Bayesian inference was introduced in this chapter. The point of view contrasts
rather sharply with that of frequentist inference in that the Bayesian formalism allows
uncertainty statements about parameter values to be probabilistic, for example,“After
seeing the data, the probability is 95% that 1.8 ≤ θ ≤ 6.3.” In frequentist inference,
θ is not a random variable, and a statement like this would literally make no sense; it
would be replaced by, “A 95% confidence interval for θ is [1.8, 6.3],” perhaps followed
by a long convoluted explication of the meaning of a confidence interval. Despite this
apparently sharp philosophical difference, Bayesian and frequentist procedures have a
great deal in common and typically lead to similar conclusions. Despite the distinction
between the two statements above, the statements may well mean essentially the
same thing operationally to a practitioner who has analyzed the data. The likelihood
function is fundamental for both frequentist and Bayesian inference. In an application,
the choice of a model, that is, the choice of a likelihood function, will typically
be much more important than whether on subsequently multiplies it be a prior or
just maximizes it. This is especially true if flat priors are used; in fact, one might
regard a flat prior as a device that allows the likelihood to be treated as a probability
density.
In this chapter, we introduced the bootstrap method for assessing the variability
of an estimate. Such uses of simulation have become increasingly widespread as
computers have become faster and cheaper; the bootstrap as a general method has
been developed only quite recently and has rapidly become one of the most important
statistical tools. We will see other situations in which the bootstrap is useful in later
chapters. Efron and Tibshirani (1993) give an excellent introduction to the theory and
applications of the bootstrap.
The context in which we have introduced the bootstrap is often referred to as the
parametric bootstrap. The nonparametric bootstrap will be introduced in Chapter 10.
The parametric bootstrap can be thought about somewhat abstractly in the following
312
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
way. We have data x that we regard as being generated by a probability distribution
F(x|θ), which depends on a parameter θ . We wish to know Eh(X, θ) for some
function h( ). For example, if θ itself is estimated from the data as θ̂ (x) and h(X, θ) =
[θ̂ (X) − θ]2 , then Eh(X, θ) is the mean square error of the estimate. As another
example, if
1 if |θ̂ (X) − θ| > 0 otherwise
h(X, θ) =
then Eh(X, θ) is the probability that |θ̂ (X) − θ | > . We realize that if θ were known,
we could use the computer to generate independent random variables X1 , X2 , . . . , X B
from F(x|θ) and then appeal to the law of large numbers:
Eh(X, θ) ≈
B
1 h(Xi , θ)
B i=1
This approximation could be made arbitrarily precise by choosing B sufficiently large.
The parametric bootstrap principle is to perform this Monte Carlo simulation using θ̂
in place of the unknown θ—that is, using F(x|θ̂ ) to generate the Xi . It is difficult to
give a concise answer to the natural question: How much error is introduced by using
θ̂ in place of θ? The answer depends on the continuity of Eh(X, θ) as a function of
θ —if small changes in θ can give rise to large changes in Eh(X, θ), the parametric
bootstrap will not work well.
8.10 Problems
1. The following table gives the observed counts in 1-second intervals for
Berkson’s data (Section 8.2). What are the expected counts from a Poisson distribution? Do they match the observed counts?
n
0
1
2
3
4
5+
Observed
5267
4436
1800
534
111
21
2. The Poisson distribution has been used by traffic engineers as a model for light
traffic, based on the rationale that if the rate is approximately constant and the
traffic is light (so the individual cars move independently of each other), the
distribution of counts of cars in a given time interval or space area should be nearly
Poisson (Gerlough and Schuhl 1955). The following table shows the number of
right turns during 300 3-min intervals at a specific intersection. Fit a Poisson
distribution. Comment on the fit by comparing observed and expected counts. It
is useful to know that the 300 intervals were distributed over various hours of the
day and various days of the week.
8.10 Problems
n
Frequency
0
1
2
3
4
5
6
7
8
9
10
11
12
13+
14
30
36
68
43
43
30
14
10
6
4
1
1
0
313
3. One of the earliest applications of the Poisson distribution was made by Student
(1907) in studying errors made in counting yeast cells or blood corpuscles with
a haemacytometer. In this study, yeast cells were killed and mixed with water
and gelatin; the mixture was then spread on a glass and allowed to cool. Four
different concentrations were used. Counts were made on 400 squares, and the
data are summarized in the following table:
Number
of Cells
Concentration
1
Concentration
2
Concentration
3
Concentration
4
0
1
2
3
4
5
6
7
8
9
10
11
12
213
128
37
18
3
1
0
0
0
0
0
0
0
103
143
98
42
8
4
2
0
0
0
0
0
0
75
103
121
54
30
13
2
1
0
1
0
0
0
0
20
43
53
86
70
54
37
18
10
5
2
2
a. Estimate the parameter λ for each of the four sets of data.
b. Find an approximate 95% confidence interval for each estimate.
c. Compare observed and expected counts.
4. Suppose that X is a discrete random variable with
P(X = 0) =
2
θ
3
314
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
1
θ
3
2
P(X = 2) = (1 − θ )
3
1
P(X = 3) = (1 − θ )
3
P(X = 1) =
where 0 ≤ θ ≤ 1 is a parameter. The following 10 independent observations
were taken from such a distribution: (3, 0, 2, 1, 3, 2, 1, 0, 2, 1).
a. Find the method of moments estimate of θ.
b. Find an approximate standard error for your estimate.
c. What is the maximum likelihood estimate of θ?
d. What is an approximate standard error of the maximum likelihood estimate?
e. If the prior distribution of is uniform on [0, 1], what is the posterior density?
Plot it. What is the mode of the posterior?
5. Suppose that X is a discrete random variable with P(X = 1) = θ and P(X = 2)
= 1 − θ. Three independent observations of X are made: x1 = 1, x2 = 2, x3 = 2.
a. Find the method of moments estimate of θ .
b. What is the likelihood function?
c. What is the maximum likelihood estimate of θ ?
d. If has a prior distribution that is uniform on [0, 1], what is its posterior
density?
6. Suppose that X ∼ bin(n, p).
a. Show that the mle of p is p̂ = X/n.
b. Show that mle of part (a) attains the Cramér-Rao lower bound.
c. If n = 10 and X = 5, plot the log likelihood function.
7. Suppose that X follows a geometric distribution,
P(X = k) = p(1 − p)k−1
and assume an i.i.d. sample of size n.
a. Find the method of moments estimate of p.
b. Find the mle of p.
c. Find the asymptotic variance of the mle.
d. Let p have a uniform prior distribution on [0, 1]. What is the posterior distribution of p? What is the posterior mean?
8. In an ecological study of the feeding behavior of birds, the number of hops
between flights was counted for several birds. For the following data, (a) fit a
geometric distribution, (b) find an approximate 95% confidence interval for p, (c)
8.10 Problems
315
examine goodness of fit. (d) If a uniform prior is used for p, what is the posterior
distribution and what are the posterior mean and standard deviation?
Number of Hops
Frequency
1
2
3
4
5
6
7
8
9
10
11
12
48
31
20
9
6
5
4
2
1
1
2
1
9. How would you respond to the following argument? This talk of sampling distributions is ridiculous! Consider Example A of Section 8.4. The experimenter
found the mean number of fibers to be 24.9. How can this be a “random variable”
with an associated “probability distribution” when it’s just a number? The author
of this book is guilty of deliberate mystification!
10. Use the normal approximation of the Poisson distribution to sketch the approximate sampling distribution of λ̂ of Example A of Section 8.4. According to this
approximation, what is P(|λ0 − λ̂| > δ) for δ = .5, 1, 1.5, 2, and 2.5, where λ0
denotes the true value of λ?
11. In Example A of Section 8.4, we used knowledge of the exact form of the sampling
distribution of λ̂ to estimate its standard error by
/
λ̂
sλ̂ =
n
This was arrived at by realizing that
X i follows a Poisson distribution with
parameter nλ0 . Now suppose we hadn’t realized this but had used the bootstrap,
letting the computer do our work for us by generating B samples of size n = 23
of Poisson random variables with parameter λ = 24.9, forming the mle of λ from
each sample, and then finally computing the standard deviation of the resulting
collection of estimates and taking this as an estimate of the standard error of λ̂.
Argue that as B → ∞, the standard error estimated in this way will tend to sλ̂ .
12. Suppose that you had to choose either the method of moments estimates
or the maximum likelihood estimates in Example C of Section 8.4 and C of
Section 8.5. Which would you choose and why?
13. In Example D of Section 8.4, the method of moments estimate was found to be
α̂ = 3X . In this problem, you will consider the sampling distribution of α̂.
a. Show that E(α̂) = α—that is, that the estimate is unbiased.
316
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
b. Show that Var(α̂) = (3 − α 2 )/n. [Hint: What is Var(X )?]
c. Use the central limit theorem to deduce a normal approximation to the sampling distribution of α̂. According to this approximation, if n = 25 and α = 0,
what is the P(|α̂| > .5)?
14. In Example C of Section 8.5, how could you use the bootstrap to estimate the
following measures of the accuracy of α̂: (a) P(|α̂ − α0 | > .05), (b) E(|α̂ − α0 |),
(c) that number such that P(|α̂ − α0 | > ) = .5.
15. The upper quartile of a distribution with cumulative distribution F is that point
q.25 such that F(q.25 ) = .75. For a gamma distribution, the upper quartile depends
on α and λ, so denote it as q(α, λ). If a gamma distribution is fit to data as in
Example C of Section 8.5 and the parameters α and λ are estimated by α̂ and λ̂,
the upper quartile could then be estimated by q̂ = q(α̂, λ̂). Explain how to use
the bootstrap to estimate the standard error of q̂.
16. Consider an i.i.d. sample of random variables with density function
|x|
1
exp −
f (x|σ ) =
2σ
σ
a.
b.
c.
d.
Find the method of moments estimate of σ .
Find the maximum likelihood estimate of σ .
Find the asymptotic variance of the mle.
Find a sufficient statistic for σ .
17. Suppose that X 1 , X 2 , . . . , X n are i.i.d. random variables on the interval [0, 1]
with the density function
f (x|α) =
(2α)
[x(1 − x)]α−1
(α)2
where α > 0 is a parameter to be estimated from the sample. It can be shown
that
E(X ) =
Var(X ) =
a.
b.
c.
d.
e.
1
2
1
4(2α + 1)
How does the shape of the density depend on α?
How can the method of moments be used to estimate α?
What equation does the mle of α satisfy?
What is the asymptotic variance of the mle?
Find a sufficient statistic for α.
18. Suppose that X 1 , X 2 , . . . , X n are i.i.d. random variables on the interval [0, 1]
with the density function
f (x|α) =
(3α)
x α−1 (1 − x)2α−1
(α) (2α)
where α > 0 is a parameter to be estimated from the sample. It can be shown
8.10 Problems
317
that
E(X ) =
Var(X ) =
a.
b.
c.
d.
1
3
2
9(3α + 1)
How could the method of moments be used to estimate α?
What equation does the mle of α satisfy?
What is the asymptotic variance of the mle?
Find a sufficient statistic for α.
19. Suppose that X 1 , X 2 , . . . , X n are i.i.d. N (μ, σ 2 ).
a. If μ is known, what is the mle of σ ?
b. If σ is known, what is the mle of μ?
c. In the case above (σ known), does any other unbiased estimate of μ have
smaller variance?
20. Suppose that X 1 , X 2 , . . . , X 25 are i.i.d. N (μ, σ 2 ), where μ = 0 and σ = 10. Plot
the sampling distributions of X and σ̂ 2 .
21. Suppose that X 1 , X 2 , . . . , X n are i.i.d. with density function
f (x|θ ) = e−(x−θ ) ,
x ≥θ
and f (x|θ) = 0 otherwise.
a. Find the method of moments estimate of θ .
b. Find the mle of θ. (Hint: Be careful, and don’t differentiate before thinking.
For what values of θ is the likelihood positive?)
c. Find a sufficient statistic for θ .
22. The Weibull distribution was defined in Problem 67 of Chapter 2. This distribution
is sometimes fit to lifetimes. Describe how to fit this distribution to data and how
to find approximate standard errors of the parameter estimates.
23. A company has manufactured certain objects and has printed a serial number
on each manufactured object. The serial numbers start at 1 and end at N , where
N is the number of objects that have been manufactured. One of these objects
is selected at random, and the serial number of that object is 888. What is the
method of moments estimate of N ? What is the mle of N ?
24. Find a very new shiny penny. Hold it on its edge and spin it. Do this 20 times
and count the number of times it comes to rest heads up. Letting π denote the
probability of a head, graph the log likelihood of π. Next, repeat the experiment
in a slightly different way: This time spin the coin until 10 heads come up. Again,
graph the log likelihood of π.
25. If a thumbtack is tossed in the air, it can come to rest on the ground with either
the point up or the point touching the ground. Find a thumbtack. Before doing
any experiment, what do you think π , the probability of it landing point up, is?
Next, toss the thumbtack 20 times and graph the log likelihood of π . Then do
318
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
another experiment: Toss the thumbtack until it lands point up 5 times, and graph
the log likelihood of π based on this experiment.
Find and graph the posterior distribution arising from a uniform prior on π .
Find the posterior mean and standard deviation and compare the posterior with
a normal distribution with that mean and standard deviation. Finally, toss the
thumbtack 20 more times and compare the posterior distribution based on all 40
tosses to that based on the first 20.
26. In an effort to determine the size of an animal population, 100 animals are captured
and tagged. Some time later, another 50 animals are captured, and it is found
that 20 of them are tagged. How would you estimate the population size? What
assumptions about the capture/recapture process do you need to make? (See
Example I of Section 1.4.2.)
27. Suppose that certain electronic components have lifetimes that are exponentially
distributed: f (t|τ ) = (1/τ ) exp(−t/τ ), t ≥ 0. Five new components are put on
test, the first one fails at 100 days, and no further observations are recorded.
a. What is the likelihood function of τ ?
b. What is the mle of τ ?
c. What is the sampling distribution of the mle?
d. What is the standard error of the mle?
(Hint: See Example A of Section 3.7.)
28. Why do the intervals in the left panel of Figure 8.8 have different centers? Why
do they have different lengths?
29. Are the estimates of σ 2 at the centers of the confidence intervals shown in the
right panel of Figure 8.8? Why are some intervals so short and others so long? For
which of the samples that produced these confidence intervals was σ̂ 2 smallest?
30. The exponential distribution is f (x; λ) = λe−λx and E(X ) = λ−1 . The cumulative distribution function is F(x) = P(X ≤ x) = 1 − e−λx . Three observations
are made by an instrument that reports x1 = 5 and x2 = 3, but x3 is too large for
the instrument to measure and it reports only that x3 > 10. (The largest value the
instrument can measure is 10.0.)
a. What is the likelihood function?
b. What is the mle of λ?
31. George spins a coin three times and observes no heads. He then gives the coin to
Hilary. She spins it until the first head occurs, and ends up spinning it four times
total. Let θ denote the probability the coin comes up heads.
a. What is the likelihood of θ?
b. What is the MLE of θ?
32. The following 16 numbers came from normal random number generator on a
computer:
5.3299
6.5941
4.1547
4.2537
3.5281
2.2799
3.1502
4.7433
3.7032
0.1077
1.6070
1.5977
6.3923
5.4920
3.1181
1.7220
319
8.10 Problems
a. What would you guess the mean and variance (μ and σ 2 ) of the generating
normal distribution were?
b. Give 90%, 95%, and 99% confidence intervals for μ and σ 2 .
c. Give 90%, 95%, and 99% confidence intervals for σ .
d. How much larger a sample do you think you would need to halve the length
of the confidence interval for μ?
33. Suppose that X 1 , X 2 , . . . , X n are i.i.d. N (μ, σ 2 ), where μ and σ are unknown.
How should the constant c be chosen so that the interval (−∞, X + c) is a 95%
confidence interval for μ; that is, c should be chosen so that P(−∞ < μ ≤
X + c) = .95.
34. Suppose that X 1 , X 2 , . . . , X n are i.i.d. N (μ0 , σ02 ) and μ and σ 2 are estimated by
the method of maximum likelihood, with resulting estimates μ̂ and σ̂ 2 . Suppose
the bootstrap is used to estimate the sampling distribution of μ̂.
a. Explain why the bootstrap estimate of the distribution of μ̂ is N (μ̂, σ̂n ).
2
b. Explain why the bootstrap estimate of the distribution of μ̂ − μ0 is N (0, σ̂n ).
c. According to the result of the previous part, what is the form of the bootstrap
confidence interval for μ, and how does it compare to the exact confidence
interval based on the t distribution?
2
35. (Bootstrap in Example A of Section 8.5.1) Let U1 , U2 , . . . , U1029 be independent
uniformly distributed random variables. Let X 1 equal the number of Ui less than
.331, X 2 equal the number between .331 and .820, and X 3 equal the number
greater than .820. Why is the joint distribution of X 1 , X 2 , and X 3 multinomial
with probabilities .331, .489, and .180 and n = 1029?
36. How do the approximate 90% confidence intervals in Example E of Section 8.5.3
compare to those that would be obtained approximating the sampling distributions
of α̂ and λ̂ by normal distributions with standard deviations given by sα̂ and sλ̂
as in Example C of Section 8.5?
37. Using the notation of Section 8.5.3, suppose that θ and θ are lower and upper
quantiles of the distribution of θ ∗ . Show that the bootstrap confidence interval
for θ can be written as (2θ̂ − θ, 2θ̂ − θ).
38. Continuing Problem 37, show that if the sampling distribution of θ ∗ is symmetric
about θ̂, then the bootstrap confidence interval is (θ , θ ).
39. In Section 8.5.3, the bootstrap confidence interval was derived from consideration
of the sampling distribution of θ̂ −θ0 . Suppose that we had started with considering
the distribution of θ̂/θ. How would the argument have proceeded, and would the
bootstrap interval that was finally arrived at have been different?
40. In Example A of Section 8.5.1, how could you use the bootstrap to estimate the
following measures of the accuracy of θ̂ : (a) P(|θ̂ − θ0 | > .01), (b) E(|θ̂ − θ0 |),
(c) that number such that P(|θ̂ − θ0 | > ) = .5?
41. What are the relative efficiencies of the method of moments and maximum likelihood estimates of α and λ in Example C of Section 8.4 and Example C of
Section 8.5?
320
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
42. The file gamma-ray contains a small quantity of data collected from the Compton Gamma Ray Observatory, a satellite launched by NASA in 1991
(http://cossc.gsfc.nasa.gov/). For each of 100 sequential time
intervals of variable lengths (given in seconds), the number of gamma rays
originating in a particular area of the sky was recorded. Assuming a model
that the arrival times are a Poisson process with constant emission rate (λ =
events per second), estimate λ. What is the estimated standard error? How might
you informally check the assumption that the emission rate is constant? What is
the posterior distribution of if an improper gamma prior is used?
43. The file gamma-arrivals contains another set of gamma-ray data, this one
consisting of the times between arrivals (interarrival times) of 3,935 photons
(units are seconds).
a. Make a histogram of the interarrival times. Does it appear that a gamma
distribution would be a plausible model?
b. Fit the parameters by the method of moments and by maximum likelihood.
How do the estimates compare?
c. Plot the two fitted gamma densities on top of the histogram. Do the fits look
reasonable?
d. For both maximum likelihood and the method of moments, use the bootstrap to
estimate the standard errors of the parameter estimates. How do the estimated
standard errors of the two methods compare?
e. For both maximum likelihood and the method of moments, use the bootstrap
to form approximate confidence intervals for the parameters. How do the
confidence intervals for the two methods compare?
f. Is the interarrival time distribution consistent with a Poisson process model
for the arrival times?
44. The file bodytemp contains normal body temperature readings (degrees Fahrenheit) and heart rates (beats per minute) of 65 males (coded by 1) and 65 females
(coded by 2) from Shoemaker (1996). Assuming that the population distributions
are normal (an assumption that will be investigated in a later chapter), estimate the
means and standard deviations of the males and females. Form 95% confidence
intervals for the means. Standard folklore is that the average body temperature is
98.6 degrees Fahrenheit. Does this appear to be the case?
45. A Random Walk Model for Chromatin. A human chromosome is a very large
molecule, about 2 or 3 centimeters long, containing 100 million base pairs (Mbp).
The cell nucleus, where the chromosome is contained, is in contrast only about a
thousandth of a centimeter in diameter. The chromosome is packed in a series of
coils, called chromatin, in association with special proteins (histones), forming
a string of microscopic beads. It is a mixture of DNA and protein. In the G0/G1
phase of the cell cycle, between mitosis and the onset of DNA replication, the
mitotic chromosomes diffuse into the interphase nucleus. At this stage, a number
of important processes related to chromosome function take place. For example, DNA is made accessible for transcription and is duplicated, and repairs are
made of DNA strand breaks. By the time of the next mitosis, the chromosomes
have been duplicated. The complexity of these and other processes raises many
8.10 Problems
321
questions about the large-scale spatial organization of chromosomes and how
this organization relates to cell function. Fundamentally, it is puzzling how these
processes can unfold in such a spatially restricted environment.
At a scale of about 10−3 Mbp, the DNA forms a chromatin fiber about 30
nm in diameter; at a scale of about 10−1 Mbp the chromatin may form loops.
Very little is known about the spatial organization beyond this scale. Various
models have been proposed, ranging from highly random to highly organized,
including irregularly folded fibers, giant loops, radial loop structures, systematic
organization to make the chromatin readily accessible to transcription and replication machinery, and stochastic configurations based on random walk models
for polymers.
A series of experiments (Sachs et al., 1995; Yokota et al., 1995) were conducted to learn more about spatial organization on larger scales. Pairs of small
DNA sequences (size about 40 kbp) at specified locations on human chromosome 4 were flourescently labeled in a large number of cells. The distances
between the members of these pairs were then determined by flourescence microscopy. (The distances measured were actually two-dimensional distances between the projections of the paired locations onto a plane.) The empirical distribution of these distances provides information about the nature of large-scale
organization.
There has long been a tradition in chemistry of modeling the configurations
of polymers by the theory of random walks. As a consequence of such a model,
the two-dimensional distance should follow a Rayleigh distribution
r
f (r |θ ) = 2 exp
θ
−r 2
2θ 2
Basically, the reason for this is as follows: The random walk model implies that
the joint distribution of the locations of the pair in R 3 is multivariate Gaussian; by
properties of the multivariate Gaussian, it can be shown the joint distribution of
the locations of the projections onto a plane is bivariate Gaussian. As in Example
A of Section 3.6.2 of the text, it can be shown that the distance between the points
follows a Rayleigh distribution.
In this exercise, you will fit the Rayleigh distribution to some of the experimental results and examine the goodness of fit. The entire data set comprises 36
experiments in which the separation between the pairs of flourescently tagged
locations ranged from 10 Mbp to 192 Mbp. In each such experimental condition, about 100–200 measurements of two-dimensional distances were determined. This exercise will be concerned just with the data from three experiments
(short, medium, and long separation). The measurements from these experiments is contained in the files Chromatin/short, Chromatin/medium,
Chromatin/long.
a. What is the maximum likelihood estimate of θ for a sample from a Rayleigh
distribution?
b. What is the method of moments estimate?
c. What are the approximate variances of the mle and the method of moments
estimate?
322
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
d. For each of the three experiments, plot the likelihood functions and find the
mle’s and their approximate variances.
e. Find the method of moments estimates and the approximate variances.
f. For each experiment, make a histogram (with unit area) of the measurements
and plot the fitted densities on top. Do the fits look reasonable? Is there any
appreciable difference between the maximum likelihood fits and the method
of moments fits?
g. Does there appear to be any relationship between your estimates and the
genomic separation of the points?
h. For one of the experiments, compare the asymptotic variances to the results
obtained from a parametric bootstrap. In order to do this, you will have to
generate random variables from a Rayleigh distribution with parameter θ.
Show that if X follows a Rayleigh distribution with θ = 1, then Y = θ X
follows a Rayleigh distribution with parameter θ. Thus it is sufficient to figure
out how to generate random variables that are Rayleigh, θ = 1. Show how
Proposition D of Section 2.3 of the text can be applied to accomplish this.
B = 100 bootstrap samples should suffice for this problem. Make a
histogram of the values of the θ ∗ . Does the distribution appear roughly normal?
Do you think that the large sample theory can be reasonably applied here?
Compare the standard deviation calculated from the bootstrap to the standard
errors you found previously.
i. For one of the experiments, use the bootstrap to construct an approximate 95%
confidence interval for θ using B = 1000 bootstrap samples. Compare this
interval to that obtained using large sample theory.
46. The data of this exercise were gathered as part of a study to estimate the population
size of the bowhead whale (Raftery and Zeh 1993). The statistical procedures
for estimating the population size along with an assessment of the variability of
the estimate were quite involved, and this problem deals with only one aspect
of the problem—a study of the distribution of whale swimming speeds. Pairs
of sightings and corresponding locations that could be reliably attributed to the
same whale were collected, thus providing an estimate of velocity for each whale.
The velocities, v1 , v2 , . . . , v210 (km/h), were converted into times t1 , t2 , . . . , t210
to swim 1 km—ti = 1/vi . The distribution of the ti was then fit by a gamma
distribution. The times are contained in the file whales.
a. Make a histogram of the 210 values of ti . Does it appear that a gamma distribution would be a plausible model to fit?
b. Fit the parameters of the gamma distribution by the method of moments.
c. Fit the parameters of the gamma distribution by maximum likelihood. How
do these values compare to those found before?
d. Plot the two gamma densities on top of the histogram. Do the fits look reasonable?
e. Estimate the sampling distributions and the standard errors of the parameters
fit by the method of moments by using the bootstrap.
f. Estimate the sampling distributions and the standard errors of the parameters
fit by maximum likelihood by using the bootstrap. How do they compare to
the results found previously?
8.10 Problems
323
g. Find approximate confidence intervals for the parameters estimated by maximum likelihood.
47. The Pareto distribution has been used in economics as a model for a density
function with a slowly decaying tail:
f (x|x0 , θ) = θ x0θ x −θ −1 ,
x ≥ x0 ,
θ >1
Assume that x0 > 0 is given and that X 1 , X 2 , . . . , X n is an i.i.d. sample.
a. Find the method of moments estimate of θ.
b. Find the mle of θ.
c. Find the asymptotic variance of the mle.
d. Find a sufficient statistic for θ.
48. Consider the following method of estimating λ for a Poisson distribution.
Observe that
p0 = P(X = 0) = e−λ
Letting Y denote the number of zeros from an i.i.d. sample of size n, λ might be
estimated by
Y
λ̃ = − log
n
Use the method of propagation of error to obtain approximate expressions for
the variance and the bias of this estimate. Compare the variance of this estimate
to the variance of the mle, computing relative efficiencies for various values of
λ. Note that Y ∼ bin(n, p0 ).
49. For the example on muon decay in Section 8.4, suppose that instead of recording
x = cos θ, only whether the electron goes backward (x < 0) or forward (x > 0)
is recorded.
a. How could α be estimated from n independent observations of this type?
(Hint: Use the binomial distribution.)
b. What is the variance of this estimate and its efficiency relative to the method
of moments estimate and the mle for α = 0, .1, .2, .3, .4, .5, .6, .7, .8, .9?
50. Let X 1 , . . . , X n be an i.i.d. sample from a Rayleigh distribution with parameter
θ > 0:
x
2
2
x ≥0
f (x|θ) = 2 e−x /(2θ ) ,
θ
(This is an alternative parametrization of that of Example A in Section 3.6.2.)
a. Find the method of moments estimate of θ .
b. Find the mle of θ.
c. Find the asymptotic variance of the mle.
51. The double exponential distribution is
1
−∞ < x < ∞
f (x|θ) = e−|x−θ | ,
2
For an i.i.d. sample of size n = 2m + 1, show that the mle of θ is the median
of the sample. (The observation such that half of the rest of the observations are
324
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
smaller and half are larger.) [Hint: The function g(x) = |x| is not differentiable.
Draw a picture for a small value of n to try to understand what is going on.]
52. Let X 1 , . . . , X n be i.i.d. random variables with the density function
f (x|θ) = (θ + 1)x θ ,
a.
b.
c.
d.
0≤x ≤1
Find the method of moments estimate of θ .
Find the mle of θ.
Find the asymptotic variance of the mle.
Find a sufficient statistic for θ .
53. Let X 1 , . . . , X n be i.i.d. uniform on [0, θ ].
a. Find the method of moments estimate of θ and its mean and variance.
b. Find the mle of θ.
c. Find the probability density of the mle, and calculate its mean and variance.
Compare the variance, the bias, and the mean squared error to those of the
method of moments estimate.
d. Find a modification of the mle that renders it unbiased.
54. Suppose that an i.i.d. sample of size 15 from a normal distribution gives X = 10
and s 2 = 25. Find 90% confidence intervals for μ and σ 2 .
55. For two factors—starchy or sugary, and green base leaf or white base leaf—the
following counts for the progeny of self-fertilized heterozygotes were observed
(Fisher 1958):
Type
Count
Starchy green
Starchy white
Sugary green
Sugary white
1997
906
904
32
According to genetic theory, the cell probabilities are .25(2 + θ ), .25(1 − θ ),
.25(1 − θ), and .25θ, where θ(0 < θ < 1) is a parameter related to the linkage
of the factors.
a. Find the mle of θ and its asymptotic variance.
b. Form an approximate 95% confidence interval for θ based on part (a).
c. Use the bootstrap to find the approximate standard deviation of the mle and
compare to the result of part (a).
d. Use the bootstrap to find an approximate 95% confidence interval and compare
to part (b).
56. Referring to Problem 55, consider two other estimates of θ . (1) The expected
number of counts in the first cell is n(2 + θ )/4; if this expected number is
equated to the count X 1 , the following estimate is obtained:
θ̃1 =
4X 1
−2
n
8.10 Problems
325
(2) The same procedure done for the last cell yields
4X 4
n
Compute these estimates. Using that X 1 and X 4 are binomial random variables,
show that these estimates are unbiased, and obtain expressions for their variances. Evaluate the estimated standard errors and compare them to the estimated
standard error of the mle.
θ̃2 =
57. This problem is concerned with the estimation of the variance of a normal distribution with unknown mean from a sample X 1 , . . . , X n of i.i.d. normal random
variables. In answering the following questions, use the fact that (from Theorem B
of Section 6.3)
(n − 1)s 2
2
∼ χn−1
σ2
and that the mean and variance of a chi-square random variable with r df are
r and 2r , respectively.
a. Which of the following estimates is unbiased?
1 (X i − X )2
n − 1 i=1
n
s2 =
σ̂ 2 =
n
1
(X i − X )2
n i=1
b. Which of the estimates given in part (a) has the smaller MSE?
n
(X i − X )2 have the minimal MSE?
c. For what value of ρ does ρ i=1
58. If gene frequencies are in equilibrium, the genotypes A A, Aa, and aa occur
with probabilities (1 − θ)2 , 2θ(1 − θ ), and θ 2 , respectively. Plato et al. (1964)
published the following data on haptoglobin type in a sample of 190 people:
Haptoglobin Type
Hp1-1
10
Hp1-2
68
Hp2-2
112
Find the mle of θ.
Find the asymptotic variance of the mle.
Find an approximate 99% confidence interval for θ.
Use the bootstrap to find the approximate standard deviation of the mle and
compare to the result of part (b).
e. Use the bootstrap to find an approximate 99% confidence interval and compare
to part (c).
a.
b.
c.
d.
59. Suppose that in the population of twins, males (M) and females (F) are equally
likely to occur and that the probability that twins are identical is α. If twins are
not identical, their genes are independent.
a. Show that
1+α
1−α
P(MM) = P(FF) =
P(MF) =
4
2
326
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
b. Suppose that n twins are sampled. It is found that n 1 are MM, n 2 are FF, and
n 3 are MF, but it is not known which twins are identical. Find the mle of α
and its variance.
60. Let X 1 , . . . , X n be an i.i.d. sample from an exponential distribution with the
density function
f (x|τ ) =
1 −x/τ
e
,
τ
0≤x <∞
a. Find the mle of τ .
b. What is the exact sampling distribution of the mle?
c. Use the central limit theorem to find a normal approximation to the sampling
distribution.
d. Show that the mle is unbiased, and find its exact variance. (Hint: The sum of
the X i follows a gamma distribution.)
e. Is there any other unbiased estimate with smaller variance?
f. Find the form of an approximate confidence interval for τ .
g. Find the form of an exact confidence interval for τ .
61. Laplace’s rule of succession. Laplace claimed that when an event happens n times
in a row and never fails to happen, the probability that the event will occur the
next time is (n + 1)/(n + 2). Can you suggest a rationale for this claim?
62. Show that the gamma distribution is a conjugate prior for the exponential distribution. Suppose that the waiting time in a queue is modeled as an exponential
random variable with unknown parameter λ, and that the average time to serve a
random sample of 20 customers is 5.1 minutes. A gamma distribution is used as
a prior. Consider two cases: (1) the mean of the gamma is 0.5 and the standard
deviation is 1, and (2) the mean is 10 and the standard deviation is 20. Plot the
two posterior distributions and compare them. Find the two posterior means and
compare them. Explain the differences.
63. Suppose that 100 items are sampled from a manufacturing process and 3 are found
to be defective. A beta prior is used for the unknown proportion θ of defective
items. Consider two cases: (1) a = b = 1, and (2) a = 0.5 and b = 5. Plot the
two posterior distributions and compare them. Find the two posterior means and
compare them. Explain the differences.
64. This is a continuation of the previous problem. Let X = 0 or 1 according to
whether an item is defective. For each choice of the prior, what is the marginal
distribution of X before the sample is taken? What are the marginal distributions after the sample is taken? (Hint: for the second question, use the posterior
distribution of θ.)
65. Suppose that a random sample of size 20 is taken from a normal distribution
with unknown mean and known variance equal to 1, and the mean is found to
be x̄ = 10. A normal distribution was used as the prior for the mean, and it was
found that the posterior mean was 15 and the posterior standard deviation was
0.1. What were the mean and standard deviation of the prior?
8.10 Problems
327
66. Let the unknown probability that a basketball player makes a shot successfully
be θ. Suppose your prior on θ is uniform on [0, 1] and that she then makes two
shots in a row. Assume that the outcomes of the two shots are independent.
a. What is the posterior density of θ?
b. What would you estimate the probability that she makes a third shot to be?
67. Evans (1953) considered fitting the negative binomial distribution and other distributions to a number of data sets that arose in ecological studies. Two of these
sets will be used in this problem. The first data set gives frequency counts of
Glaux maritima made in 500 contiguous 20-cm2 quadrants. For the second data
set, a plot of potato plants 48 rows wide and 96 ft long was examined. The area
was split into 2304 sampling units consisting of 2-ft lengths of row and in each
unit the number of potato beetles was counted. Fit Poisson and negative binomial
distributions, and comment on the goodness of fit. For these data, the method of
moments should be fairly efficient.
Count
Glaux maritima
Potato Beetles
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
1
15
27
42
77
77
89
57
48
24
14
16
9
3
1
190
264
304
260
294
219
183
150
104
90
60
46
29
36
19
12
11
6
10
2
4
1
3
4
1
1
0
0
1
328
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
68. Let X 1 , . . . , X n be an i.i.d. sample from a Poisson distribution with mean λ, and
n
Xi .
let T = i=1
a. Show that the distribution of X 1 , . . . , X n given T is independent of λ, and
conclude that T is sufficient for λ.
b. Show that X 1 is not sufficient.
c. Use Theorem A of Section 8.8.1 to show that T is sufficient. Identify the
functions g and h of that theorem.
69. Use the factorization theorem (Theorem A in Section 8.8.1) to conclude that
n
X i is a sufficient statistic when the X i are an i.i.d. sample from a
T = i=1
geometric distribution.
70. Use the factorization theorem to find a sufficient statistic for the exponential
distribution.
71. Let X 1 , . . . , X n be an i.i.d. sample from a distribution with the density function
θ
,
0 < θ < ∞ and 0 ≤ x < ∞
f (x|θ) =
(1 + x)θ+1
Find a sufficient statistic for θ.
3n
n
X i and i=1
X i are sufficient statistics for the gamma distribu72. Show that i=1
tion.
73. Find a sufficient statistic for the Rayleigh density,
x
2
2
f (x|θ) = 2 e−x /(2θ ) ,
x ≥0
θ
74. Show that the binomial distribution belongs to the exponential family.
75. Show that the gamma distribution belongs to the exponential family.
CHAPTER
9
Testing Hypotheses and
Assessing Goodness of Fit
9.1 Introduction
We will introduce some of the basic concepts of this chapter by means of a simple
artificial example; it is important that you read the example carefully. Suppose that I
have two coins, coin 0 has probability of heads equal to 0.5 and coin 1 has probability
of heads equal to 0.7. I choose one of the coins, toss it 10 times and tell you the
number of heads, but do not tell you whether it was coin 0 or coin 1. On the basis
of the number of heads, your task is to decide which coin it was. How should your
decision rule be?
Let X denote the number of heads. Figure 9.1 gives p(x) for each of the coins.
x
0
1
2
3
4
5
6
7
8
9
10
coin 0
coin 1
.0010
.0000
.0098
.0001
.0439
.0014
.1172
.0090
.2051
.0368
.2461
.1029
.2051
.2001
.1172
.2668
.0439
.2335
.0098
.1211
.0010
.0282
FIGURE 9.1
Suppose that you observed two heads. Then P0 (2)/P1 (2) is about 30, which we
will call the likelihood ratio—coin 0 was about 30 times more likely to produce this
result than was coin 1. This result would favor coin 0. On the other hand, if there were
8 heads, the likelihood ratio would be .0439/.2335 = 0.19, which would favor coin
1. The likelihood ratio will play a central role in the procedures we develop.
We specify two hypotheses, H0 and H1 , according to whether coin 0 or coin 1 was
tossed. We first develop a Bayesian methodology for assessing the evidence for each of
the hypotheses. This approach requires the specification of prior probabilities P(H0 )
and P(H1 ) for each of the hypotheses before observing any data. If you believed that
I have no reason to choose coin 0 over coin 1, you would take P(H0 ) = P(H1 ) = 1/2.
329
330
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
After observing the number of heads, your posterior probabilities would be P(H0 |x)
and P(H1 |x). The former would be
P(H0 , x)
P(H0 |x) =
P(x)
P(x|H0 )P(H0 )
=
P(x)
The ratio
P(H0 |x)
P(H0 ) P(x|H0 )
=
P(H1 |x)
P(H1 ) P(x|H1 )
is the product of the ratio of prior probabilities and the likelihood ratio. Thus, the
evidence provided by the data is contained in the likelihood ratio, which is multiplied
by the ratio of prior probabilities to produce the ratio of posterior probabilities.
The likelihood ratio is evaluated in Figure 9.2.
x
0
1
2
3
4
5
6
7
8
9
10
P(x|H0 )
P(x|H1 )
165.4
70.88
30.38
13.02
5.579
2.391
1.025
0.4392
0.1882
0.0807
0.0346
FIGURE 9.2
(The numbers in Figure 9.2 do not precisely agree with the ratios of the numbers in
the Figure 9.1 because the former are truncated to four decimal places.) The evidence
x is increasingly favorable to H0 as x decreases, i.e., the likelihood ratio is monotonic
in x. If one’s prior probabilities were equal, then for zero to six heads, H0 would be
more probable and for seven to ten heads, H1 would be more probable. If the prior
probabilities change, the breakpoint changes. If you were asked to choose H0 or H1
on the basis of the data x, it seems reasonable that you would choose the hypothesis
which had larger posterior probability. You would choose H0 if
P(H0 ) P(x|H0 )
P(H0 |x)
=
>1
P(H1 |x)
P(H1 ) P(x|H1 )
or equivalently if
P(x|H0 )
>c
P(x|H1 )
where the critical value c depends upon your prior probability. Your decision would
be based on the likelihood ratio: you accept H0 if the likelihood ratio is greater than
c, and you reject H0 if the likelihood ratio is less than c.
Let us now further examine the consequences of a particular decision rule, i.e., a
particular specification of the constant c. First suppose that c = 1; then H0 is accepted
as long as X ≤ 6 and is rejected in favor of H1 if X > 6. We can make two possible
errors: reject H0 when it is true, or accept H0 when it is false. The probabilities of
these two possible errors can be evaluated as follows:
P(reject H0 |H0 ) = P(X > 6|H0 )
= 0.18
Here we used the binomial probabilities above corresponding to H0 . Similarly, the
9.2 The Neyman-Pearson Paradigm
331
probability of the other kind of error is
P(accept H0 |H1 ) = P(X ≤ 6|H1 )
= 0.35
Now suppose that c = 0.1, which corresponds to P(H0 )/P(H1 ) = 10. Then
from Figure 9.2, H0 is accepted when X ≤ 8. Compared to equal odds, more extreme
evidence is required to reject H0 because the prior probabilities greatly favor H0 .
Then the probabilities of the two types of errors are
P(reject H0 |H0 ) = P(X > 8|H0 )
= 0.01
P(accept H0 |H1 ) = P(X ≤ 8|H1 )
= 0.85
In this way, we see that there is a correspondence between the prior probabilities
and the probabilities of the two types of errors. The constant c controls the tradeoff
between the probabilities of the two types of errors.
9.2 The Neyman-Pearson Paradigm
Rather than using a Bayesian approach, Neyman and Pearson formulated their theory
of hypothesis testing by casting it as a decision problem and making the probabilities
of the two types of errors central, thus bypassing the necessity of specifying prior
probabilities. In doing so, this approach introduced an asymmetry: one hypothesis is
singled out as the null hypothesis and the other as the alternative hypothesis, the
former usually denoted by H0 and the latter by H1 or H A . We will see later through
examples how this specification is naturally made, but for now we will continue with
the example of the previous section and arbitrarily declare H0 to the null hypothesis.
The following terminology is standard:
•
•
•
•
•
•
•
Rejecting H0 when it is true is called a type I error.
The probability of a type I error is called the significance level of the test and is
usually denoted by α.
Accepting the null hypothesis when it is false is called a type II error and its
probability is usually denoted by β.
The probability that the null hypothesis is rejected when it is false is called the
power of the test, and equals 1 − β.
We have seen in this example how rejecting H0 when the likelihood ratio is less
than a constant c is equivalent to rejecting when the number of heads is greater than
some value x0 . The likelihood ratio, or equivalently, the number of heads, is called
the test statistic.
The set of values of the test statistic that leads to rejection of the null hypothesis is
called the rejection region, and the set of values that leads to acceptance is called
the acceptance region.
The probability distribution of the test statistic when the null hypothesis is true is
called the null distribution.
332
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
In the example in the introduction to this chapter, the null and alternative hypotheses each completely specify the probability distribution of the number of heads,
as binomial(10,0.5) or binomial(10,0.7), respectively. These are called simple
hypotheses. The Neyman-Pearson Lemma shows that basing the test on the likelihood ratio as we did is optimal:
NEYMAN-PEARSON LEMMA
Suppose that H0 and H1 are simple hypotheses and that the test that rejects H0
whenever the likelihood ratio is less than c and significance level α. Then any
other test for which the significance level is less than or equal to α has power
■
less than or equal to that of the likelihood ratio test.
The point is that there are many possible tests. Any partition of the set of possible
outcomes of the observations into a set that has probability less than or equal to α when
the null hypothesis is true and its complement, and that rejects when the observations
are in the complement has significance level less than or equal to α by construction.
Among all such possible partitions, that based on the likelihood ratio maximizes the
power.
Proof
Let f (x) denote the probability density function or frequency function of the
observations. A test of H0 : f (x) = f 0 (x) versus H1 : f (x) = f A (x) amounts to
using a decision function d(x), where d(x) = 0 if H0 is accepted and d(x) = 1
if H0 is rejected. Since d(X ) is a Bernoulli random variable, E(d(X )) = P
(d(X ) = 1). The significance level of the test is thus α = P0 (d(X ) = 1)
= E 0 (d(X )), and the power is PA (d(X ) = 0) = E A (d(X )). Here E 0 denotes
expectation under the probability law specified by H0 , etc.
Let d(X ) correspond to the likelihood ratio test: d(x) = 1 if f 0 (x) < c f A (x)
and E 0 (d(X )) = α. Let d ∗ (x) be the decision function of another test satisfying
E 0 (d ∗ (X )) ≤ E 0 (d(X )) = α. We will show that E A (d ∗ (X )) ≤ E A (d(X )). This
will follow from the key inequality
d ∗ (x)[c f A (x) − f 0 (x)] ≤ d(x)[c f A (x) − f 0 (x)]
which holds since if d(x) = 1, c f A (x) − f 0 (x) > 0 and if d(x) = 0, c f A (x) −
f 0 (x) ≤ 0. Now integrating (or summing) both sides of the inequality above with
respect to x gives
cE A (d ∗ (X )) − E 0 (d ∗ (X )) ≤ cE A (d(X )) − E 0 (d(X ))
and thus
E 0 (d(X )) − E 0 (d ∗ (X )) ≤ c[E A (d(X )) − E A (d ∗ (X ))]
The conclusion follows since the left-hand side of this inequality is nonnegative
by assumption.
■
9.2 The Neyman-Pearson Paradigm
EXAMPLE A
333
Let X 1 , . . . , X n be a random sample from a normal distribution having known variance
σ 2 . Consider two simple hypotheses:
H0 : μ = μ0
H A : μ = μ1
where μ0 and μ1 are given constants. Let the significance level α be prescribed. The
Neyman-Pearson Lemma states that among all tests with significance level α, the test
that rejects for small values of the likelihood ratio is most powerful. We thus calculate
the likelihood ratio, which is
n
−1 2
exp
(X i − μ0 )
2σ 2 i=1
f 0 (X)
=
n
f 1 (X)
−1 exp
(X i − μ1 )2
2σ 2 i=1
since the multipliers of the exponentials cancel. Small values of this statistic corren
n
(X i − μ1 )2 − i=1
(X i − μ0 )2 . Expanding the squares,
spond to small values of i=1
we see that the latter expression reduces to
2n X (μ0 − μ1 ) + nμ21 − nμ20
Now, if μ0 − μ1 > 0, the likelihood ratio is small if X is small; if μ0 − μ1 < 0, the
likelihood ratio is small if X is large. To be concrete, let us assume the latter case.
We then know that the likelihood ratio is a function of X and is small when X is
large. The Neyman-Pearson lemma thus tells us that the most powerful test rejects
for X > x0 for some x0 , and we choose x0 so as to give the test the desired level α.
That is, x0 is chosen so that P(X > x0 ) = α if H0 is true. Under H0 in this example,
the null distribution of X is a normal distribution with mean μ0 and variance σ 2 /n,
so x0 can be chosen from tables of the standard normal distribution. Since
x 0 − μ0
X − μ0
√ >
√
P(X > x0 ) = P
σ/ n
σ/ n
we can solve
x0 − μ0
√ = z(α)
σ/ n
for x0 in order to find the rejection region for a level α test. Here, as usual, z(α) denotes
the upper α point of the standard normal distribution; that is, if Z is a standard normal
random variable, P(Z > z(α)) = α.
■
This example is typical of the way that the Neyman-Pearson Lemma is used. We
write down the likelihood ratio and observe that small values of it correspond in a
one-to-one manner with extreme values of a test statistic, in this case X . Knowing
the null distribution of the test statistic makes it possible to choose a critical level that
produces a desired significance level α.
334
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
Unfortunately, the Neyman-Pearson Lemma is of little direct utility in most
practical problems, because the case of testing a simple null hypothesis versus a
simple alternative is rarely encountered. If a hypothesis does not completely specify
the probability distribution, the hypothesis is called a composite hypothesis. Here
are some examples:
EXAMPLE B
Goodness-of-Fit Test
Let X 1 , X 2 , . . . , X n be a sample from a discrete probability distribution. The null
hypothesis could be that the distribution is Poisson with some unspecified mean, and
the alternative could be that the distribution is not Poisson. For example, we might
want to test whether a Poisson model is reasonable for the data of Example A in
Section 8.4. Since the null hypothesis does not completely specify the distribution
of the X i ’s, it is composite. If the null hypothesis were refined to state that the
distribution was Poisson with some specified mean, then it would be simple. The
alternative hypothesis does not completely specify the distribution, so it is composite.
■
We will take up the subject of testing for goodness of fit later in this chapter.
EXAMPLE C
Testing for ESP
Consider a hypothetical experiment in which a subject is asked to identify, without
looking, the suits of 20 cards drawn randomly with replacement from a 52 card deck.
Let T be the number of correct identifications. The null hypothesis states that the
person is purely guessing, and the alternative states that the person has extrasensory ability. The null hypothesis is simple because then T is binomial(20,0.25). The
alternative does not completely specify the distribution of T , so it is composite; note
■
that it does not even specify that the distribution is binomial.
This example is useful for further illustrating two other issues that arise in hypothesis testing: the specification of the significance level and the choice of the null
hypothesis.
9.2.1 Specification of the Significance Level
and the Concept of a p-value
One of the strengths of the Neyman-Pearson approach is that only the distribution
under the null hypothesis is needed in order to construct a test. In Example C above,
it would be conventional and convenient to take the null hypothesis to be that of pure
guessing; we discuss this further in the next section. In this case, the null distribution
of T is binomial(20,0.25). Because large values of T tend to lend credence to the
alternative, the rejection region would be of the form {T ≥ t0 } where t0 is chosen
so that P(T ≥ t0 |H0 ) = α, the desired significance level of the test. For example,
from calculating binomial probabilities, we find P(T ≥ 12) = .0009, so for this
choice of the critical region, the null hypothesis of no ESP ability would be falsely
rejected only with probability about one in a thousand. Note that we did not need to
9.2 The Neyman-Pearson Paradigm
335
specify the form of the probability distribution under the alternative, but used only
the notion that if the alternative is true, the subject would be expected to correctly
identify more suits than if purely guessing. In comparison, a fully Bayesian treatment would have to specify the distribution under the alternative as well as prior
probabilities.
The theory requires the specification of the significance level, α, in advance
of analyzing the data, but gives no guidance about how to make this choice. In
practice it is almost always the case that the choice of α is essentially arbitrary, but
is heavily influenced by custom. Small values, such as 0.01 and 0.05, are commonly
used. Another criticism of the paradigm is that it is built on the assumption that one
must either reject or not reject a hypothesis, when typically no such decision is actually
required. The theory is thus often applied in a hypothetical manner. For example,
suppose that the subject above guessed the suit correctly nine times. Since P(T ≥
9|H0 ) = .041, the null hypothesis would have been “rejected” at the significance
level α = .05, if one were actually “rejecting” or “not rejecting.” Thus, the evidence
is often summarized as a p-value, which is defined to be the smallest significance
level at which the null hypothesis would be rejected. If nine suits were identified
correctly, the p-value would be 0.041. If ten suits were identified, it would be 0.014,
since P(T ≥ 10|H0 ) = .014, etc.
The use of a p-value to summarize the evidence against the null hypothesis was
advocated by the eminent statistician Sir Ronald Fisher. But rather than casting it
within a hypothetical framework of “rejection,” he thought of the p-value as being
the probability under the null hypothesis of a result as or more extreme than that
actually observed. So for example, in the case that ten suits are identified, the p-value
is the chance of someone getting at least that many correct by purely guessing. The
smaller the p-value, the stronger the evidence against the null hypothesis.
The Bayesian paradigm summarizes the evidence for and against the null hypothesis as a posterior probability. Its application depends on specifying probability
models under both the null and the alternative and on assigning meaningful prior probabilities. It is important to understand that a p-value is not the posterior probability
that the null hypothesis is true. To reiterate, the p-value is the probability of a result
as or more extreme than that actually observed if the null hypothesis were true. This
is a probability, but it is not the posterior probability that the null hypothesis is true;
the latter depends on the specification of prior probabilities. Consider the example of
Section 9.1. If x = 8 heads are observed, the p-value is .0439 + .0098 + .0010 =
.0546, or about 5%. Suppose that the prior probabilities were equal. The likelihood
ratio is 0.1882 = P(H0 |x)/(1 − P(H0 |x)) from which it follows that P(H0 |x) =
0.1584, or about 16%.
9.2.2 The Null Hypothesis
As should be clear by now, there is an asymmetry in the Neyman-Pearson paradigm
between the null and alternative hypotheses. The decision as to which is the null
and which is the alternative hypothesis is not a mathematical one, and depends on
scientific context, custom, and convenience. This will gradually become clearer as
we see more real examples in this and later chapters, and for now we will make only
the following remarks:
336
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
•
•
•
In Example B of Section 9.2, we chose as the null hypothesis the hypothesis that
the distribution was Poisson and as the alternative hypothesis the hypothesis that
the distribution was not Poisson. In this case, the null hypothesis is simpler than the
alternative, which in a sense contains more distributions than does the null. It is
conventional to choose the simpler of two hypotheses as the null.
The consequences of incorrectly rejecting one hypothesis may be graver than those
of incorrectly rejecting the other. In such a case, the former should be chosen as the
null hypothesis, because the probability of falsely rejecting it could be controlled
by choosing α. Examples of this kind arise in screening new drugs; frequently, it
must be documented rather conclusively that a new drug has a positive effect before
it is accepted for general use.
In scientific investigations, the null hypothesis is often a simple explanation that
must be discredited in order to demonstrate the presence of some physical phenomenon or effect. The hypothetical ESP experiment referred to earlier falls in this
category; the null hypothesis states that the subject is merely guessing, that there
is no ESP. The validity of the null hypothesis would not be cast in doubt unless the
results would be extremely unlikely under the null. We will see other examples of
this type beginning in Chapter 11.
9.2.3 Uniformly Most Powerful Tests
The optimality result of the Neyman-Pearson Lemma requires that both hypotheses be
simple. In some cases, the theory can be extended to include composite hypotheses. If
the alternative H1 is composite, a test that is most powerful for every simple alternative
in H1 is said to be uniformly most powerful.
EXAMPLE A
Continuing with Example A of Section 9.2, consider testing H0 : μ = μ0 versus
H1 : μ > μ0 . In Example A, we saw that for a particular alternative μ1 > μ0 , the most
powerful test rejects for X > x0 , where x0 depends on μ0 , σ, n, but not on μ1 . Because
this test is most powerful and is the same for every alternative, it is uniformly most
■
powerful.
It can also be argued that the test is uniformly most powerful for testing H0 : μ ≤
μ0 versus H1 : μ > μ0 . But it is not uniformly most powerful for testing H0 : μ = μ0
versus H1 : μ = μ0 . This follows from further examination of the example, which
shows that the test that is most powerful against the alternative that μ > μ0 rejects for
large values of X − μ0 , whereas the test that is most powerful against the alternative
μ < μ0 rejects for small values of X − μ0 . The most powerful test is thus not the
same for every alternative.
In typical composite situations, there is no uniformly most powerful test. The alternatives H1 : μ < μ0 and H1 : μ > μ0 are called one-sided alternatives. The
alternative H1 : μ = μ0 is a two-sided alternative.
9.3 The Duality of Confidence Intervals and Hypothesis Tests
337
9.3 The Duality of Confidence Intervals
and Hypothesis Tests
There is a duality between confidence intervals (more generally, confidence sets) and
hypothesis tests. In this section, we will show that a confidence set can be obtained by
“inverting” a hypothesis test, and vice versa. Before presenting the general structure,
we consider an example.
EXAMPLE A
Let X 1 , . . . , X n be a random sample from a normal distribution having unknown
mean μ and known variance σ 2 . We consider testing the following hypotheses:
H0 : μ = μ0
H A : μ = μ0
Consider a test at a specific level α that rejects for |X − μ0 | > x0 , where x0 is
H0 is true: x0 = σ X z(α/2). Here the
determined so that P(|X − μ0 | > x0 ) = α if √
standard deviation of X is denoted by σ X = σ/ n. The test thus accepts when
|X − μ0 | < σ X z(α/2)
or
−σ X z(α/2) < X − μ0 < σ X z(α/2)
or
X − σ X z(α/2) < μ0 < X + σ X z(α/2)
A 100(1 − α)% confidence interval for μ0 is
X − σ X z(α/2), X + σ X z(α/2)
Comparing the acceptance region of the test to the confidence interval, we see that
μ0 lies in the confidence interval for μ if and only if the hypothesis test accepts. In
other words, the confidence interval consists precisely of all those values of μ0 for
■
which the null hypothesis H0 : μ = μ0 is accepted.
We now demonstrate that this duality holds more generally. Let θ be a parameter
of a family of probability distributions, and denote the set of all possible values of θ
by . Denote the random variables constituting the data by X.
338
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
THEOREM A
Suppose that for every value θ0 in there is a test at level α of the hypothesis
H0 : θ = θ0 . Denote the acceptance region of the test by A(θ0 ). Then the set
C(X) = {θ : X ∈ A(θ )}
is a 100(1 − α)% confidence region for θ.
Proof
Because A is the acceptance region of a test at level α,
P[X ∈ A(θ0 )|θ = θ0 ] = 1 − α
Now,
P[θ0 ∈ C(X)|θ = θ0 ] = P[X ∈ A(θ0 )|θ = θ0 ]
= 1−α
■
by the definition of C(X).
It is helpful to state Theorem A in words: A 100(1 − α)% confidence region for
θ consists of all those values of θ0 for which the hypothesis that θ equals θ0 will not
be rejected at level α.
THEOREM B
Suppose that C(X) is a 100(1 − α)% confidence region for θ; that is, for every
θ0 ,
P[θ0 ∈ C(X)|θ = θ0 ] = 1 − α
Then an acceptance region for a test at level α of the hypothesis H0 : θ = θ0 is
A(θ0 ) = {X|θ0 ∈ C(X)}
Proof
The test has level α because
P(X ∈ A(θ0 )|θ = θ0 ) = P(θ0 ∈ C(X)|θ = θ0 ) = 1 − α
■
In words, Theorem B says that the hypothesis that θ equals θ0 is accepted if θ0
lies in the confidence region.
This duality can be quite useful. In some situations, it is possible to form confidence intervals for parameters of probability distributions and then use those intervals
to test hypotheses about the values of those parameters. In other situations, it may
9.4 Generalized Likelihood Ratio Tests
339
be relatively easy to test hypotheses and then determine the acceptance regions for
the test to form confidence intervals that might have been quite difficult to derive
in a more direct manner. We will see examples of both types of situations in later
chapters.
9.4 Generalized Likelihood Ratio Tests
The likelihood ratio test is optimal for testing a simple hypothesis versus a simple
hypothesis. In this section, we will develop a generalization of this test for use in
situations in which the hypotheses are not simple. Such tests are not generally optimal,
but they are typically nonoptimal in situations for which no optimal test exists, and
they usually perform reasonably well. Generalized likelihood ratio tests have wide
utility; they play the same role in testing as mle’s do in estimation.
It is frequently the case that the hypotheses under consideration specify, or partially specify, the values of parameters of the probability distribution that has generated
the data. Specifically, suppose that the observations X = (X 1 , . . . , X n ) have a joint
density or frequency function f (x|θ ). Then H0 may specify that θ ∈ ω0 , where ω0 is
a subset of the set of all possible values of θ, and H1 may specify that θ ∈ ω1 , where
ω1 is disjoint from ω0 . Let = ω0 ∪ ω1 . Based on the data, a plausible measure of the
relative tenability of the hypotheses is the ratio of their likelihoods. If the hypotheses are composite, each likelihood is evaluated at that value of θ that maximizes it,
yielding the generalized likelihood ratio
∗ =
max[lik(θ )]
θ∈ω0
max[lik(θ )]
θ∈ω1
Small values of ∗ tend to discredit H0 .
It is preferable for certain technical reasons to use the test statistic
=
max[lik(θ )]
θ ∈ω0
max[lik(θ )]
θ∈
rather than ∗ . Note that = min(∗ , 1) so small values of ∗ correspond to small
values of . The rejection region for a likelihood ratio test consists of small values of
, for example, all ≤ λ0 . The threshhold λ0 is chosen so that P( ≤ λ0 |H0 ) = α,
the desired significance level of the test.
We now illustrate the construction of a likelihood ratio test with a simple example.
EXAMPLE A
Testing a Normal Mean
Let X 1 , . . . , X n be i.i.d. and normally distributed with mean μ and variance σ 2 ,
where σ is known. We wish to test H0 : μ = μ0 against H1 : μ = μ0 , where μ0 is a
prescribed number. The role of θ is played by μ, and ω0 = {μ0 }, ω1 = {μ|μ = μ0 },
and = {−∞ < μ < ∞}.
340
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
Since ω0 consists of only one point, the numerator of the likelihood ratio statistic
is
1
1
√
e− 2σ 2
n
(σ 2π )
n
i=1
(X i −μ0 )2
For the denominator, we have to maximize the likelihood for μ ∈ , which is achieved
when μ is the mle X . The denominator is the likelihood of X 1 , X 2 , . . . , X n evaluated
with μ = X :
1
1
√
e− 2σ 2
n
(σ 2π )
n
i=1
(X i −X )2
The likelihood ratio statistic is, therefore,
n
n
1 2
2
= exp − 2
(X i − μ0 ) −
(X i − X )
2σ
i=1
i=1
Rejecting for small values of is equivalent to rejecting for large values of
n
n
1 −2 log = 2
(X i − μ0 )2 −
(X i − X )2
σ
i=1
i=1
Using the identity
n
i=1
(X i − μ0 )2 =
n
(X i − X )2 + n(X − μ0 )2
i=1
we see that the likelihood ratio test rejects for large values of −2 log = n(X −
μ0 )2 /σ 2 . The distribution of this statistic under H0 is chi-square with 1 degree
of freedom. This follows, since under H0 , X ∼ N (μ0 , σ 2 /n), which implies that
√
n(X − μ0 )/σ ∼ N (0, 1) and hence its square, −2 log ∼ χ12 . Knowing the null
distribution of the test statistic makes possible the construction of a rejection region
for any significance level α: The test rejects when
n
(X − μ0 )2 > χ12 (α)
σ2
Again using the fact that a chi-square random variable with 1 degree of freedom is
the square of a standard normal random variable, we can rewrite this relation to show
that the rejection region for the test is
σ
|X − μ0 | ≥ √ z(α/2)
■
n
The preceding derivation has been rather formal, but upon examination the result
looks perfectly reasonable or perhaps even so obvious as to make us doubt the value of
H1 : μ = μ0 rejects when |X√− μ0 |
the formal exercise: The test of H0 : μ = μ0 versus √
n ≤ X − μ0√≤ σ z(α/2)/ n or,
is large. The test does not reject when
−σ
z(α/2)/
√
equivalently, when X − σ z(α/2)/ n ≤ μ0 ≤ X + σ z(α/2)/ n. That is, the test
does not reject when μ0 lies in a 100(1 − α)% confidence interval for μ. Compare to
Example A of Section 9.3.
9.5 Likelihood Ratio Tests for the Multinomial Distribution
341
In order for the likelihood ratio test to have the significance level α, λ0 must be
chosen so that P( ≤ λ0 ) = α if H0 is true. If the sampling distribution of under
H0 is known, we can determine λ0 . Generally, the sampling distribution is not of a
simple form, but in many situations the following theorem provides the basis for an
approximation to the null distribution.
THEOREM A
Under smoothness conditions on the probability density or frequency functions
involved, the null distribution of −2 log tends to a chi-square distribution
with degrees of freedom equal to dim − dim ω0 as the sample size tends to
■
infinity.
The proof, which is beyond the scope of this book, is based on a second-order
Taylor series expansion.
In the statement of Theorem A, dim and dim ω0 are the numbers of free parameters under and ω0 , respectively. In Example A, the null hypothesis completely
specifies μ and σ 2 ; there are no free parameters under ω0 , so dim ω0 = 0. Under
, σ is fixed but μ is free, so dim = 1. For this example, the null distribution of
−2 log is exactly χ12 .
9.5 Likelihood Ratio Tests for the
Multinomial Distribution
In this section we will develop a generalized likelihood ratio test of the goodness of fit
of a model for multinomial cell probabilities. Under the model, the vector of cell probabilities p is described by a hypothesis H0 , which specifies that p = p(θ ), θ ∈ ω0 ,
where θ is a parameter that may be unknown. For example, in Section 8.2 we considered fitting Poisson probabilities that depended on an unknown parameter (there
called λ, which played the role of θ ) to the cell counts in a table. We want to
judge the plausibility of the model relative to a model H1 in which the cell probabilities are free except for the constraints that they are nonnegative and sum to 1.
If there are m cells, is thus the set consisting of m nonnegative numbers that
sum to 1.
The numerator of the likelihood ratio is
n!
p1 (θ )x1 · · · pm (θ )xm
max
p∈ω0
x1 ! · · · xm !
where the xi are the observed counts in the m cells. By the definition of the maximum
likelihood estimate, this likelihood is maximized when θ̂ is the maximum likelihood
estimate of θ. The corresponding probabilities will be denoted by pi (θ̂ ).
342
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
Since the probabilities are unrestricted under , the denominator is maximized
by the unrestricted mle’s, or
xi
p̂i =
n
The likelihood ratio is, therefore,
n!
p1 (θ̂ )x1 · · · pm (θ̂ )xm
x1 ! · · · xm !
=
n!
p̂ x1 · · · p̂mxm
x1 ! · · · xm ! 1
xi
m
pi (θ̂ )
=
p̂i
i=1
Also, since xi = n p̂i ,
−2 log = −2n
m
p̂i log
i=1
=2
m
i=1
Oi log
pi (θ̂ )
p̂i
Oi
Ei
where Oi = n p̂i and E i = npi (θ̂) denote the observed and expected counts, respectively.
Under , the cell probabilities are allowed to be free, with the constraint that
they sum to 1, so dim = m − 1. If, under H0 , the probabilities pi (θ̂ ) depend on a
k-dimensional parameter θ that has been estimated from the data, dim ω0 = k. The
large sample distribution of −2 log is thus a chi-square distribution with m − k − 1
degrees of freedom (the number of cells minus the number of estimated parameters
minus 1).
Pearson’s chi-square statistic is commonly used to test for goodness of fit
X2 =
m
[xi − npi (θ̂ )]2
npi (θ̂ )
i=1
Pearson’s statistic and the likelihood ratio are asymptotically equivalent under H0 .
To indicate heuristically why this is so, we will go through a Taylor series argument.
To begin,
m
p̂i
p̂i log
−2 log = 2n
pi (θ̂ )
i=1
If H0 is true and n is large, p̂i ≈ pi (θ̂ ). The Taylor series expansion of the function
x
f (x) = x log
x0
about x0 is
1
1
f (x) = (x − x0 ) + (x − x0 )2 + · · ·
2
x0
9.5 Likelihood Ratio Tests for the Multinomial Distribution
343
Thus,
−2 log ≈ 2n
m
[ p̂i − pi (θ̂ )] + n
i=1
m
[ p̂i − pi (θ̂ )]2
pi (θ̂ )
i=1
The first term on the right-hand side is equal to 0 since the probabilities sum to 1, and
the second term on the right-hand side may be expressed as
m
[xi − npi (θ̂ )]2
npi (θ̂ )
i=1
since xi , the observed count, equals n p̂i for i = 1, . . . , m.
We have argued for the approximate equivalence of two test statistics. Pearson’s
test has been more commonly used than the likelihood ratio test, because it is somewhat easier to calculate without the use of a computer.
Let us consider some examples.
EXAMPLE A
Hardy-Weinberg Equilibrium
Hardy-Weinberg equilibrium was first introduced in Example A in Section 8.5.1.
We will now test whether this model fits the observed data. Recall that the HardyWeinberg equilibrium model says that the cell probabilities are (1 − θ )2 , 2θ (1 − θ ),
and θ 2 . Using the maximum likelihood estimate for θ, θ̂ = .4247, and multiplying the
resulting probabilities by the sample size n = 1029, we calculate expected counts,
which are compared with observed counts in the following table:
Blood Type
M
Observed
Expected
342
340.6
MN
N
500
502.8
187
185.6
The null hypothesis will be that the multinomial distribution is as specified by the
Hardy-Weinberg equilibrium frequencies, with unknown parameter θ. The alternative
hypothesis will be that the multinomial distribution does not have probabilities of that
specified form. We first choose a value for α, the significance level for the test (recall
that the significance level is the probability of falsely rejecting the hypothesis that the
multinomial cell, probabilities are as specified by genetic theory). In this application,
there is no compelling reason to choose any particular value of α, so we will follow
convention and let α = .05. This means that our decision rule will falsely reject H0
in only 5% of the cases.
We will use Pearson’s chi-square test, and therefore X 2 as our test statistic. The
null distribution of X 2 is approximately chi-square with 1 degree of freedom. (There
are two independent cells, and one parameter has been estimated from the data.)
Since, from Table 3 in Appendix B, the point defining the upper 5% of the chi-square
distribution with 1 degree of freedom is 3.84, the test rejects if X 2 > 3.84. We next
344
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
calculate X 2 :
(O − E)2
= .0319
E
Thus, the null hypothesis is not rejected.
There is a certain unnecessary rigidity in this procedure, because it is not clear
that such a decision (to reject or not) has to be made at all. There is also a certain
arbitrariness: There was no strong reason to let α = .05, but that choice essentially
determined our decision. If we had let α = .01, the decision would have been the
same since χ 2 (.01) > χ 2 (.05), but what if we had let α = .10, or .20? It is here that
the concept of the p-value becomes useful. Recall that the p-value is the smallest
significance level at which the null hypothesis would be rejected. From a computer
calculation of the chi-square distribution (or from a table of the normal distribution,
since a chi-square distribution with 1 degree of freedom is the square of a standard
normal random variable), the probability that a chi-square random variable with 1
degree of freedom is greater than or equal to .0319 is .86, which is the p-value.
Another interpretation of this p-value is that if the model were correct, deviations
this large or larger would occur 86% of the time. Thus, the data give us no reason to
doubt the model.
In comparison, the likelihood ratio test statistic is
3
Oi
= .0319
Oi log
−2 log = 2
Ei
i=1
X2 =
The two tests lead to the same conclusion.
Finally, we note that the actual maximized likelihood ratio is = exp(−.0319/2)
= .98. Thus the Hardy-Weinberg model is almost as likely as the most general possible
■
model.
EXAMPLE B
Bacterial Clumps
In testing milk for bacterial contamination, 0.01 ml of milk is spread over an area
of 1 cm2 on a slide. The slide is mounted on a microscope, and counts of bacterial
clumps within grid squares are made. The Poisson model appears quite reasonable
for the distribution of the clumps at first glance: The clumps are presumably mixed
uniformly throughout the milk, and there is no reason to suspect that the clumps
bunch together. However, on closer examination, two possible problems are noted.
First, bacteria held by surface tension on the lower surface of the drop may adhere to
the glass slide on contact, producing increased concentrations in that area of the film.
Second, the film is not of uniform thickness, being thicker in the center and thinner at
the edges, giving rise to nonuniform concentrations of bacteria. The following table,
taken from Bliss and Fisher (1953), summarizes the counts of clumps on 400 grid
squares.
Number per Square
Frequency
0
1
2
3
4
5
6
7
8
9
10
19
56
104
80
62
42
27
9
9
5
3
2
1
9.5 Likelihood Ratio Tests for the Multinomial Distribution
345
To fit a Poisson distribution to these data, we compute the mle, λ̂, which is the
mean of the 400 counts:
0 × 56 + 1 × 104 + 2 × 80 + · · · + 19 × 1
400
= 2.44
λ̂ =
The following table shows the observed and expected counts and the components
of chi-square test statistic. (The last several cells were grouped together so that the
minimum expected count would be nearly 5.)
Observed
56
104
Expected
34.9
85.1
Component of X 2
12.8
4.2
80
62
42
27
9
103.8
84.4
51.5
25.1
5.5
5.9
1.8
.14
20
10.2
.14
5.0
45.0
The chi-square statistic is X 2 = 75.4. With 6 degrees of freedom (there are eight
cells, and one parameter has been estimated from the data), the null hypothesis is
conclusively rejected [χ62 (.005) = 18.55, so the p-value is less than .005]. When a
goodness-of-fit test rejects, it is instructive to find out why; where does the model fail
to fit? This can be seen by looking at the cells that make large contributions to X 2 and
the signs of the observed minus the expected counts for those cells. We see here that
the greatest contributions to X 2 come from the first and last cells of the table—there
are too many small counts and too many large counts relative to what is expected for
■
a Poisson distribution.
EXAMPLE C
Fisher’s Reexamination of Mendel’s Data
In one of his famous experiments, Mendel crossed 556 smooth, yellow male peas
with wrinkled, green female peas. According to now established genetic theory, the
relative frequencies of the progeny should be as given in the following table.
Type
Smooth yellow
Smooth green
Wrinkled yellow
Wrinkled green
Frequency
9
16
3
16
3
16
1
16
The counts that Mendel recorded and the expected counts are given in the following
table:
346
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
Type
Observed Count
Expected Count
Smooth yellow
Smooth green
Wrinkled yellow
Wrinkled green
315
108
102
31
312.75
104.25
104.25
34.75
Calculating the likelihood ratio test statistic, we obtain
4
Oi
Oi log
= .618
−2 log = 2
Ei
i=1
Comparing this value with the chi-square distribution with 3 degrees of freedom (three
independent parameters are estimated under and none under ω0 ), we have a p-value
of slightly less than .9. Pearson’s chi-square statistic is .604, which is quite close to
the value from the likelihood ratio test. We interpret the p-value as meaning that,
even if the model were correct, discrepancies this large or larger would be expected
to occur on the basis of chance about 90% of the time. There is thus no reason
to reject the hypothesis that the counts come from a multinomial distribution with
the prescribed probabilities. We would tend to doubt this hypothesis for only small
p-values.
The p-value can also be interpreted to mean that on the basis of chance we would
expect agreement this close or closer about only 10% of the time. There is some
validity to the suggestion that the data agree with the model too well; if the p-value
had been .999, for example, we would definitely be suspicious.
Fisher pooled the results of all of Mendel’s experiments in the following way.
Suppose that two independent experiments give chi-square statistics with p and r
degrees of freedom, respectively. Then, under the null hypothesis that the models were
correct, the sum of those two test statistics would follow a chi-square distribution with
p + r degrees of freedom. Fisher added the chi-square statistics for all the independent
experiments and compared the result with the chi-square distribution with degrees of
freedom equal to the sum of all the degrees of freedom. The resulting p-value was
.99996. Such close agreement would only occur 4 times out of 100,000 on the basis
of chance!
What happened? Did Mendel deliberately or unconsciously fudge the data? Did
he have an overzealous lab technician who was hoping for a recommendation to
medical school? Was there divine intervention? Perhaps the best explanation is that
Mendel continued experimenting until the results looked good and then he stopped.
The statistical analysis here assumes that the sample size is fixed before the data are
■
collected.
Mendel is not the only scientist whose data are too good to be true. Cyril
Burt was an English psychologist whose work had a great impact on the debate
concerning the genetic basis for intelligence. His many papers and extensive data
argue for such a basis. In 1946, Burt became the first psychologist to be knighted;
however, during the 1970s, his work came under increasing attack, and he was
9.6 The Poisson Dispersion Test
347
accused of actually fabricating data. One of his most famous studies was of the intelligence and occupational status of 40,000 fathers and sons. Dorfman (1978) studied
the goodness of fit of these intelligence scores to a normal distribution, using Pearson’s chi-square test. The p-values for fathers and sons were greater than 1 − 10−7
and 1 − 10−6 , respectively! Dorfman concluded that “it may well be that Burt’s frequency distributions are the most normally distributed in the history of anthropometric
measurement.”
9.6 The Poisson Dispersion Test
The likelihood ratio test and Pearson’s chi-square test are carried out with respect to
the general alternative hypothesis that the cell probabilities are completely free. If one
has a specific alternative hypothesis in mind, better power can usually be obtained
by testing against that alternative rather than against a more general alternative. Such
a test is developed in this section for the hypothesis that a distribution is Poisson. The
test is quite useful, and its construction affords another illustration of a generalized
likelihood ratio test.
The two key assumptions underlying the Poisson distribution are that the rate
is constant and that the counts in one interval of time or space are independent of
the counts in disjoint intervals. These assumptions are often not met. For example, suppose that insects are counted on leaves of plants. The leaves are of different
sizes and occur at various locations on different plants; the rate of infestation may
well not be constant over the different locations. Furthermore, if the insects hatched
from eggs that were deposited in groups, there might be clustering of the insects
and the independence assumption might fail. If counts occurring over time are being
recorded, the underlying rate of the phenomenon being studied might not be constant. Motor vehicle counts for traffic studies, for example, typically vary cyclically
over time.
Given counts x1 , . . . , xn , we consider testing the null hypothesis that the counts
are Poisson with the common parameter λ versus the alternative hypothesis that
they are Poisson but have different rates, λ1 , . . . , λn . Under , there are n different rates; ω0 ⊂ is the special case that they are all equal. Under ω0 , the
maximum likelihood estimate of λ is λ̂ = X . Under , the maximum likelihood
estimates of the λi are x1 , . . . , xn ; we denote these estimates by λ̃i . The likelihood ratio
is thus
n
λ̂xi e−λ̂ /xi !
=
i=1
n
-
λ̃ixi e−λ̃i /xi !
i=1
=
n x̄
i=1
xi
xi
e xi −x̄
348
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
The likelihood ratio test statistic is
n x̄
−2 log = −2
xi log
x
i
i=1
=2
n
xi log
i=1
+ (xi − x̄)
x i
x̄
A nearly equivalent form of this statistic is produced using the Taylor series argument
given in Section 9.5:
−2 log ≈
n
1
(xi − x̄)2
x̄ i=1
Under , there are n independent parameters, λ1 , . . . , λn , so dim = n. Under
ω0 , there is only one parameter, λ, so dim ω0 = 1, and the degrees of freedom are
n − 1.
The last equation given for the test statistic may be interpreted as being the ratio of
n times the estimated variance to the estimated mean. For the Poisson distribution, the
variance equals the mean; for the types of alternatives discussed above, the variance
is typically greater than the mean. For this reason the test is often called the Poisson
dispersion test. It is sensitive to—that is, has high power against—alternatives that
are overdispersed relative to the Poisson, such as the negative binomial distribution.
The ratio σ̂ 2 /x̄ is sometimes used as a measure of clustering.
EXAMPLE A
Asbestos Fibers
In Example A in Section 8.4, we considered whether counts of asbestos fibers on grid
squares could be modeled as a Poisson distribution. Applying the Poisson dispersion
test, we find that
1
(xi − x̄)2 = 26.56
x̄
or, if the likelihood ratio statistic is used,
x i
2
xi log
= 27.11
x̄
Since there are 23 observations, there are 22 degrees of freedom. From a computer
calculation, the p-value for the likelihood ratio statistic is .21. The evidence against
the null hypothesis is not persuasive; however, the sample size is small and the test
may have low power.
■
EXAMPLE B
Bacterial Clumps
In Example B in Section 9.5, we applied Pearson’s chi-square test to test whether
counts of bacteria clumps in milk were fit by a Poisson distribution. There we found
9.7 Hanging Rootograms
349
x̄ = 2.44. The sample variance is
02 × 56 + 12 × 104 + · · · + 192 × 1
− x̄ 2
400
= 4.59
σ̂ 2 =
The ratio of the variance to the mean is 1.88 rather than 1; the test statistic is
n σ̂ 2
x̄
400 × 4.59
= 752.7
=
2.40
T =
Under the null hypothesis, the statistic approximately follows a chi-square distribution
with 399 degrees of freedom. Since a chi-square random variable with m degrees of
freedom is the sum of the squares of m independent N (0, 1) random variables, the
central limit theorem implies that for large values of m the chi-square distribution
with m degrees of freedom is approximately normal. For a chi-square distribution, the
mean equals the number of degrees of freedom and variance equals twice the number
of degrees of freedom. The p-value can thus be found by standardizing the statistic
and using tables of the standard normal distribution:
P(T ≥ 752.7) = P
T − 399
752.7 − 399
√
≥ √
2 × 399
2 × 399
≈ 1 − (12.5) ≈ 0
Thus, there is almost no doubt that the Poisson distribution fails to fit the data.
■
9.7 Hanging Rootograms
In this and the next section, we develop additional informal techniques for assessing
goodness of fit. The first of these is the hanging rootogram. Hanging rootograms are a
graphical display of the differences between observed and fitted values in histograms.
To illustrate the construction and interpretation of hanging rootograms, we will use
a set of data from the field of clinical chemistry (Martin, Gudzinowicz, and Fanger
1975). The following table gives the empirical distribution of 152 serum potassium
levels. In clinical chemistry, distributions such as this are often tabulated to establish
a range of “normal” values against which the level of the chemical found in a patient
can be compared to determine whether it is abnormal. The tabulated distributions are
often fit to parametric forms such as the normal distribution.
350
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
Serum Potassium Levels
Interval Midpoint
Frequency
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4.0
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
5.0
5.1
5.2
5.3
2
1
3
2
7
8
8
14
14
18
16
15
10
8
8
6
4
1
1
1
4
1
Figure 9.3(a) is a histogram of the frequencies. The plot looks roughly bellshaped, but the normal distribution is not the only bell-shaped distribution. In order to
evaluate their distribution more exactly, we must compare the observed frequencies
to frequencies fit by the normal distribution. This can be done in the following way.
Suppose that the parameters μ and σ of the normal distribution are estimated from the
data by x̄ and σ̂ . If the jth interval has the left boundary x j−1 and the right boundary x j ,
then according to the model, the probability that an observation falls in that interval is
x j − x̄
x j−1 − x̄
p̂ j = −
σ̂
σ̂
If the sample is of size n, the predicted, or fitted, count in the jth interval is
n̂ j = n p̂ j
which may be compared to the observed counts, n j .
Figure 9.3(b) is a hanging histogram of the differences: observed count (n j )
minus expected count (n̂ j ). These differences are difficult to interpret since the variability is not constant from cell to cell. If we neglect the variability in the estimated
expected counts, we have
Var(n j − n̂ j ) = Var(n j ) = np j (1 − p j )
= np j − np 2j
9.7 Hanging Rootograms
Histogram
351
Hanging histogram
20
4
3
2
1
0
1
2
15
10
5
0
(a)
(b)
Hanging rootogram
Hanging chi-gram
5
4
3
2
1
0
1
2
1.0
0.0
1.0
(d)
(c)
F I G U R E 9.3 (a) Histogram, (b) hanging histogram, (c) hanging rootogram, and
(d) hanging chi-gram for normal fit to serum potassium data.
In this case, the p j are small, so
Var(n j − n̂ j ) ≈ np j
Thus, cells with large values of p j (equivalent to large values of n j if the model is at
all close) have more variable differences, n j − n̂ j . In a hanging histogram, we expect
larger fluctuations in the center than in the tails. This unequal variability makes it
difficult to assess and compare the fluctuations, since a large deviation may indicate
real misfit of the model or may be merely caused by large random variability.
To put the differences between observed and expected values on a scale on which
they all have equal variability, a variance-stabilizing transformation may be used.
(Such transformations will be used in later chapters as well.) Suppose that a random
variable X has mean μ and variance σ 2 (μ), which depends on μ. If Y = f (X ), the
method of propagation of error (Section 4.6) shows that
Var(Y ) ≈ σ 2 (μ)[ f (μ)]2
Thus if f is chosen so that σ 2 (μ)[ f (μ)]2 is constant, the variance of Y will not depend
on μ. The transformation f that accomplishes this is called a variance-stabilizing
transformation.
Let us apply this idea to the case we have been considering:
E(n j ) = np j = μ
Var(n j ) ≈ np j = σ 2 (μ)
352
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
That is, σ 2 (μ) = μ, so f will be √
a variance-stabilizing transformation if μ[ f (μ)]2
is constant. The function f (x) = x does the job, and
√
√
E( n j ) ≈ np j
√
Var( n j ) ≈ 14
if the model is correct.
Figure 9.3(c) shows a hanging rootogram, a display showing
√
n j − n̂ j
The advantage of the hanging rootogram is that the deviations from cell to cell have
approximately the same statistical variability. To assess the deviations, we may use the
rough rule of thumb that a deviation of more than 2 or 3 standard deviations (more than
1.0 or 1.5 in this case) is “large.” The most striking feature of the hanging rootogram in
Figure 9.3(c) is the large deviation in the right tail. Generally, deviations in the center
have been down-weighted and those in the tails emphasized by the transformation.
Also, it is noteworthy that although the deviations other than the one in the right
tail are not especially large, they have a certain systematic character: Note the run
of positive deviations followed by the run of negative deviations and then the large
positive deviation in the extreme right tail. This may indicate some asymmetry in the
distribution.
A possible alternative to the rootogram is what can be called a hanging chi-gram,
a plot of the components of Pearson’s chi-square statistic:
n j − n̂ j
n̂ j
Neglecting the variability in the expected counts, as before, Var(n j − n̂ j ) ≈ np j = n̂ j ,
so
n j − n̂ j
≈1
Var
n̂ j
so this technique also stabilizes variance. Figure 9.3(d) is a hanging chi-gram for the
case we have been considering; it is quite similar in overall character to the hanging
rootogram, but the deviation in the right tail is emphasized even more.
9.8 Probability Plots
Probability plots are an extremely useful graphical tool for qualitatively assessing the
fit of data to a theoretical distribution. Consider a sample of size n from a uniform
distribution on [0, 1]. Denote the ordered sample values by X (1) < X (2) · · · < X (n) .
These values are called order statistics. It can be shown (see Problem 17 at the end
of Chapter 4) that
j
n+1
This suggests plotting the ordered observations X (1) , . . . , X (n) against their expected
values 1/(n + 1), . . . , n/(n + 1). If the underlying distribution is uniform, the plot
E(X ( j) ) =
9.8 Probability Plots
353
1.0
Ordered observations
.8
.6
.4
.2
0
0
.2
.4
.6
Uniform quantiles
.8
1.0
F I G U R E 9.4 Uniform-uniform probability plot.
should look roughly linear. Figure 9.4 is such a plot for a sample of size 100 from a
uniform distribution.
Now suppose that a sample Y1 , . . . , Y100 is generated in which each Y is half the
sum of two independent uniform random variables. The distribution of Y is no longer
uniform but triangular:
f ( y) =
4y,
4 − 4y,
0 ≤ y ≤ 12
1
≤y≤1
2
The ordered observations Y(1) , . . . , Y(n) are plotted against the points 1/(n + 1), . . . ,
n/(n + 1). The graph in Figure 9.5 shows a clear deviation from linearity and enables
us to describe qualitatively the deviation of the distribution of the Y ’s from the uniform
distribution. Note that in the left tail of the plotted distribution (near 0), the order statistics are larger than expected for a uniform distribution, and in the right tail (near 1),
they are smaller, indicating that the tails of the distribution of the Y ’s decrease more
quickly (are “lighter”) than the tails of the uniform distribution.
The technique can be extended to other continuous probability laws by means of
Proposition C of Section 2.3, which states that if X is a continuous random variable
with a strictly increasing cumulative distribution function, FX , and if Y = FX (X ),
then Y has a uniform distribution on [0, 1]. The transformation Y = FX (X ) is known
as the probability integral transform.
The following procedure is suggested by the proposition just referred to. Suppose that it is hypothesized that X follows a certain distribution, F. Given a sample
X 1 , . . . , X n , we plot
F(X (k) )
vs.
k
n+1
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
1.0
.8
Ordered observations
354
.6
.4
.2
0
0
.2
.4
.6
Uniform quantiles
.8
1.0
F I G U R E 9.5 Uniform-triangular probability plot.
or equivalently
X (k)
vs.
F −1
k
n+1
In some cases, F is of the form
F(x) = G
x −μ
σ
where μ and σ are called location and scale parameters, respectively. The normal
distribution is of this form. We could plot
k
X (k) − μ
vs.
G −1
σ
n+1
or if we plotted
X (k)
vs.
G −1
k
n+1
the result would be approximately a straight line if the model were correct:
k
−1
X (k) ≈ σ G
+μ
n+1
Slight modifications of this procedure are sometimes used. For example, rather than
G −1 [k/(n + 1)], E(X (k) ), the expected value of the kth smallest observation can be
9.8 Probability Plots
355
used. But it can be argued that
k
n+1
k
= σ G −1
n+1
E(X (k) ) ≈ F
−1
+μ
so this modification yields very similar results to the original procedure.
The procedure can be viewed from another perspective. Recall from Section 2.2
that F −1 [k/(n + 1)] is the k/(n + 1) quantile of the distribution F; that is, it is the
point such that the probability that a random variable with distribution function F is
less than it is k/(n + 1). We are thus plotting the ordered observations (which may be
viewed as the observed or empirical quantiles) versus the quantiles of the theoretical
distribution.
EXAMPLE A
We can illustrate the procedure just described using a set of 100 observations, which
are Michelson’s determinations of the velocity of light made from June 5, 1879 to
July 2, 1879; 299,000 has been subtracted from the determinations to give the values
listed [data from Stigler (1977)]:
850
940
880
820
760
950
880
970
760
870
930
880
840
720
950
960
800
960
880
810
810
850
850
860
750
850
980
810
870
880
940
1000
760
840
880
810
780
930
800
620
740
820
1000
790
850
860
810
1000
790
840
780
890
840
1070
880
720
760
810
880
830
910
890
740
810
800
840
850
870
890
900
940
720
770
790
980
840
880
920
810
760
830
850
850
810
740
960
860
800
810
980
900
950
910
870
650
880
840
840
800
960
Figure 9.6 shows the normal probability plot. The plot looks straight, showing that
the normal distribution gives a reasonable fit.
A word of caution is in order here: Probability plots are by nature monotoneincreasing and they all tend to look fairly straight. Some experience is necessary
in gauging “straightness.” Simulations, which are easily done, are very helpful in
sharpening one’s judgment. Some find it useful to hold the plot so that they are
looking down the plotted line as if it were a roadway; this often makes curvature
■
much more apparent.
356
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
1100
Ordered observations
1000
900
800
700
600
3
2
1
0
1
Normal quantiles
2
3
F I G U R E 9.6 Normal probability plot of Michelson's data.
In order to be able to interpret probability plots, it is useful to see how they are shaped
for samples from nonnormal distributions. Figure 9.7 is a normal probability plot of
500 pseudorandom variables from a double exponential distribution:
f (x) = 12 e−|x| ,
−∞ < x < ∞
6
4
Ordered observations
EXAMPLE B
2
0
2
4
6
8
4
3
2
1
0
1
Normal quantiles
2
3
4
F I G U R E 9.7 Normal probability plot of 500 pseudorandom variables from a double
exponential distribution.
9.8 Probability Plots
357
14
Ordered observations
12
10
8
6
4
2
0
4
3
2
1
0
1
Normal quantiles
2
3
4
F I G U R E 9.8 Normal probability plot of 500 pseudorandom variables from a
gamma distribution with the shape parameter α = 5.
This density is symmetric about zero, but its tails die off at the rate exp(−|x|). This
rate is slower than that for the tails of the normal distribution, which decay at the rate
exp(−x 2 ). Note how the plot in Figure 9.7 bends down at the left and up at the right,
indicating that the observations in the left tail were more negative than expected for a
normal distribution and the observations in the right tail were more positive. In other
words, the extreme observations were larger in magnitude than extreme observations
from a normal distribution would be. This effect results because the tails of the double
exponential are “heavier” than those of a normal distribution.
Figure 9.8 is a normal probability plot of 500 pseudorandom numbers from a
gamma distribution with the shape parameter α = 5 and the scale parameter λ = 1.
As can be seen in Figure 2.11, the gamma density with α = 5 is nonsymmetric,
or skewed, and this is reflected by the bowlike appearance of the probability
■
plot.
EXAMPLE C
As an example for a nonnormal distribution, Figure 9.9 is a gamma probability plot
of the precipitation amounts of Example C in Section 8.4.
The parameter λ of a gamma distribution is a scale parameter and so, as we have
seen before, affects only the slope of a probability plot, not its straightness. Thus in
constructing a probability plot we can take λ = 1 without loss. A computer was used
to find the quantiles of a gamma distribution with parameter α = .471 and λ = 1,
and Figure 9.9 was produced by plotting the observed sorted values of precipitation
versus the quantiles. Qualitatively, the fit looks reasonable, because there is no gross
■
systematic deviation from a straight line.
358
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
Stored values of precipitation
2.5
2.0
1.5
1.0
.5
0
0
1
2
3
4
5
Ordered quantiles of gamma distribution ( .471)
F I G U R E 9.9 Gamma probability plot of rainfall distribution.
Probability plots can also be constructed for grouped data, such as the data on
serum potassium levels in Section 9.7. Because the ordered observations are not all
available in such a case, the procedure must be modified. Suppose that the grouping
gives the points x1 , x2 , . . . , xm+1 for the histogram’s bin boundaries and that in the
interval [xi , xi+1 ) there are n i counts, where i = 1, . . . , m. We denote the cumulative
j
frequencies by N j = i=1
n i . Then N1 < N2 < · · · < Nm and Nm = n, which is the
total sample size. We thus plot
x j+1
EXAMPLE D
vs.
G
−1
Nj
n+1
,
j = 1, . . . , m
Figure 9.10 shows a normal probability plot for the serum potassium data of Section
9.7. The cumulative frequencies are found by summing the frequencies in each bin.
■
The deviations in the right tail are immediately apparent.
9.9 Tests for Normality
A wide variety of tests are available for testing goodness of fit to the normal distribution. We discuss some of them in this section; more discussion may be found in the
works referred to.
9.9 Tests for Normality
359
5.5
Ordered observations
5.0
4.5
4.0
3.5
3.0
3
2
1
0
1
Normal quantiles
2
3
F I G U R E 9.10 Normal probability plot of serum potassium data.
If the data are grouped into bins, with several counts in each bin, Pearson’s chisquare test for goodness of fit may be applied. But if the parameters are estimated
from ungrouped data and the expected counts in each bin are calculated using the
estimated parameters, the limiting distribution of the test statistic is no longer chisquare. In order for the limiting distribution to be chi-square, the parameters must
be estimated from the grouped data. This was pointed out by Chernoff and Lehmann
(1954) and is further discussed by Dahiya and Gurland (1972). Generally speaking,
it seems rather artificial and wasteful of information to group continuous data.
Departures from normality often take the form of asymmetry, or skewness. For a
∞
normal distribution, the third central moment is −∞ (x − μ)3 ϕ(x)d x, which equals 0
since the density is symmetric about μ. Suppose that we wish to test the null hypothesis
that X 1 , . . . , X n are independent normally distributed random variables with the same
mean and variance. A goodness-of-fit test can be based on the coefficient of skewness
for the sample,
1
(X i − X )3
n i=1
n
b1 =
s3
The test rejects for large values of |b1 |.
Symmetric distributions can depart from normality by being heavy-tailed or lighttailed or too peaked or too flat in the center. These forms of departures may be detected
by the coefficient of kurtosis for the sample,
1
(X i − X )4
n i=1
n
b2 =
s4
360
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
If either of these measures is to be used as a test statistic, its sampling distribution
when the distribution generating the data is normal must be determined. The hypothesis test rejects normality when the observed value of the statistic is in the tails of the
sampling distribution. These sampling distributions are difficult to evaluate in closed
form but can be approximated by simulations.
A goodness-of-fit test may also be based on the linearity of the probability
plot, as measured by the correlation coefficient, r , of the x and y coordinates of
the points of the probability plot. The test rejects for small values of r . The sampling distribution of r under normality has been approximated by simulations and
is tabled in Filliben (1975). Ryan and Joiner (unpublished) give a short table for
the null sampling distribution of r from normal probability plots with critical values for the correlation coefficient corresponding to significance levels .1, .05,
and .01:
n
.1
.05
.01
4
5
10
15
20
25
30
40
50
60
75
.8951
.9033
.9347
.9506
.9600
.9662
.9707
.9767
.9807
.9836
.9865
.8734
.8804
.9180
.9383
.9503
.9582
.9639
.9715
.9764
.9799
.9835
.8318
.8320
.8804
.9110
.9290
.9408
.9490
.9597
.9664
.9710
.9752
They also report the results of some simulations of the power of r as the test statistic for
certain alternative distributions. For example, the power against a uniform distribution
with a significance level of .1 is .13 for n = 10 and .20 for n = 20. This is somewhat
discouraging—the test rejects only 13% of the time and 20% of the time for the given
sample sizes if the real underlying distribution is uniform. The moral is that it may be
quite difficult to detect departure from normality in small samples. On a more positive
note, the power of r against an exponential distribution is 53% for n = 10 and 89%
for n = 20.
Pearson, D’Agostino, and Bowman (1977) report the results of quite extensive
simulations of the power for several alternative distributions and give further references.
For Michelson’s data (see Example A in Section 9.8), the correlation coefficient
is .995. From the tables in Filliben (1975), this falls between the 50th and 75th
percentiles of the null sampling distribution, giving no reason to reject the hypothesis
of normality. It may not be very realistic, however, to model the 100 observations
of the velocity of light as a sample of 100 independent random variables from some
probability distribution and to use this model to test goodness of fit. We have little
information about how these data were collected and processed. For example, since
the observations were made sequentially, it is quite possible that the measurement
9.10 Concluding Remarks
361
process drifted in time or that successive errors were correlated. It is also possible
that Michelson discarded some obviously bad data.
9.10 Concluding Remarks
Two very important concepts, estimation and hypothesis testing, have been introduced
in this chapter and the last. They have been introduced here in the context of fitting
probability distributions but will recur throughout the rest of this book in various
other contexts. Generally, observations are taken from a probability law that depends
on a parameter, θ. Estimation theory is concerned with estimating θ from the data;
the theory of hypothesis testing is concerned with testing hypotheses about the value
of θ. Methods based on likelihood, maximum likelihood estimation and likelihood
ratio tests, have also been introduced. These methods are much more generally useful
than has been demonstrated by the specific purposes to which they have been put in
these chapters. The likelihood and the likelihood ratio are key concepts of statistics,
from both Bayesian and frequentist perspectives.
The fundamental concepts and techniques of hypothesis testing have been introduced in this chapter. We have seen how to test a null hypothesis by choosing a test
statistic and a rejection region such that, under the null hypothesis, the probability
that the test statistic falls in the rejection region is α, the significance level of the
test. The choice of this region is determined by knowing, at least approximately, the
null distribution of the test statistic. The test statistic is frequently, but not always, a
likelihood ratio; when the exact distribution of the likelihood ratio cannot be found,
we can use the chi-square distribution as a large-sample approximation. We have also
explored the relation of the p-value of the test statistic to the significance level. In
some situations, the p-value is a less rigid summary of the evidence than is a decision
whether to reject the null hypothesis.
With the increasing availability of flexible computer programs and inexpensive
computers, graphical methods are being used more and more in statistics. The last part
of this chapter introduced two graphical techniques: hanging rootograms and probability plots. Other graphical techniques will be introduced in Chapter 10. Such informal
techniques are often of more practical use than are more formal techniques, such as
hypothesis testing. Literally testing for goodness of fit is often rather artificial—a
parametric distribution is usually entertained only as a model for the distribution of
data values, and it is clear a priori that the data do not really come from that distribution. If enough data were available, the goodness-of-fit test would certainly reject.
Rather than test a hypothesis that no one literally believes could hold, it is usually more
useful to ascertain qualitatively where the model fits and where and how it fails to fit.
Some concepts introduced in Chapters 7 and 8 have been elaborated on in this
chapter. In Chapter 7, we introduced confidence intervals for the parameters of finite populations; in Chapter 8, we considered confidence intervals for parameters of
probability distributions. In this chapter, we have introduced hypothesis testing and
developed a relation between hypothesis tests and confidence intervals. The method of
propagation of error, used in Chapter 7 as a tool for analyzing the statistical behavior
of ratio estimates, has been used in this chapter in connection with variance-stabilizing
transformations.
362
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
9.11 Problems
1. A coin is thrown independently 10 times to test the hypothesis that the probability
of heads is 12 versus the alternative that the probability is not 12 . The test rejects
if either 0 or 10 heads are observed.
a. What is the significance level of the test?
b. If in fact the probability of heads is .1, what is the power of the test?
2. Which of the following hypotheses are simple, and which are composite?
a. X follows a uniform distribution on [0, 1].
b. A die is unbiased.
c. X follows a normal distribution with mean 0 and variance σ 2 > 10.
d. X follows a normal distribution with mean μ = 0.
3. Suppose that X ∼ bin(100, p). Consider the test that rejects H0 : p = .5 in favor
of H A : p = .5 for |X − 50| > 10. Use the normal approximation to the binomial
distribution to answer the following:
a. What is α?
b. Graph the power as a function of p.
4. Let X have one of the following distributions:
X
H0
HA
x1
x2
x3
x4
.2
.3
.3
.2
.1
.4
.1
.4
a. Compare the likelihood ratio, , for each possible value X and order the xi
according to .
b. What is the likelihood ratio test of H0 versus H A at level α = .2? What is the
test at level α = .5?
c. If the prior probabilities are P(H0 ) = P(H A ), which outcomes favor H0 ?
d. What prior probabilities correspond to the decision rules with α = .2 and
α = .5?
5. True or false, and state why:
a. The significance level of a statistical test is equal to the probability that the
null hypothesis is true.
b. If the significance level of a test is decreased, the power would be expected to
increase.
c. If a test is rejected at the significance level α, the probability that the null
hypothesis is true equals α.
d. The probability that the null hypothesis is falsely rejected is equal to the power
of the test.
e. A type I error occurs when the test statistic falls in the rejection region of the
test.
9.11 Problems
363
f. A type II error is more serious than a type I error.
g. The power of a test is determined by the null distribution of the test statistic.
h. The likelihood ratio is a random variable.
6. Consider the coin tossing example of Section 9.1. Suppose that instead of tossing
the coin 10 times, the coin was tossed until a head came up and the total number
of tosses, X , was recorded.
a. If the prior probabilities are equal, which outcomes favor H0 and which favor
H1 ?
b. Suppose P(H0 )/P(H1 ) = 10. What outcomes favor H0 ?
c. What is the significance level of a test that rejects H0 if X ≥ 8?
d. What is the power of this test?
7. Let X 1 , . . . , X n be a sample from a Poisson distribution. Find the likelihood ratio
for testing H0 : λ = λ0 versus H A : λ = λ1 , where λ1 > λ0 . Use the fact that the
sum of independent Poisson random variables follows a Poisson distribution to
explain how to determine a rejection region for a test at level α.
8. Show that the test of Problem 7 is uniformly most powerful for testing H0 : λ = λ0
versus H A : λ > λ0 .
9. Let X 1 , . . . , X 25 be a sample from a normal distribution having a variance of
100. Find the rejection region for a test at level α = .10 of H0 : μ = 0 versus
H A : μ = 1.5. What is the power of the test? Repeat for α = .01.
10. Suppose that X 1 , . . . , X n form a random sample from a density function, f (x|θ ),
for which T is a sufficient statistic for θ. Show that the likelihood ratio test of
H0 : θ = θ0 versus H A : θ = θ1 is a function of T . Explain how, if the distribution
of T is known under H0 , the rejection region of the test may be chosen so that
the test has the level α.
11. Suppose that X 1 , . . . , X 25 form a random sample from a normal distribution having a variance of 100. Graph the power of the likelihood ratio test of H0 : μ = 0
versus H A : μ = 0 as a function of μ, at significance levels .10 and .05. Do the
same for a sample size of 100. Compare the graphs and explain what
you see.
12. Let X 1 , . . . , X n be a random sample from an exponential distribution with the
density function f (x|θ) = θ exp[−θ x]. Derive a likelihood ratio test of H0 : θ =
θ0 versus H A : θ = θ0 , and show that the rejection region is of the form
{X exp[−θ0 X ] ≤ c}.
13. Suppose, to be specific, that in Problem 12, θ0 = 1, n = 10, and that α = .05. In
order to use the test, we must find the appropriate value of c.
a. Show that the rejection region is of the form {X ≤ x0 } ∪ {X ≥ x1 }, where x0
and x1 are determined by c.
b. Explain why c should be chosen so that P(X exp(−X ) ≤ c) = .05 when
θ0 = 1.
10
X i and hence X follow gamma distributions when θ0 = 1.
c. Explain why i=1
How could this knowledge be used to choose c?
364
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
d. Suppose that you hadn’t thought of the preceding fact. Explain how you could
determine a good approximation to c by generating random numbers on a
computer (simulation).
14. Suppose that under H0 , a measurement X is N (0, σ 2 ), and that under H1 , X is
N (1, σ 2 ) and that the prior probability P(H0 ) = 2× P(H1 ). As in Section 9.1, the
hypothesis H0 will be chosen if P(H0 |x) > P(H1 |x). For σ 2 = 0.1, 0.5, 1.0, 5.0:
a. For what values of X will H0 be chosen?
b. In the long run, what proportion of the time will H0 be chosen if H0 is true 23
of the time?
15. Suppose that under H0 , a measurement X is N (0, σ 2 ), and that under H1 , X
is N (1, σ 2 ) and that the prior probability P(H0 ) = P(H1 ). For σ = 1 and
x ∈ [0, 3], plot and compare (1) the p-value for the test of H0 and (2) P(H0 |x).
Can the p-value be interpreted as the probability that H0 is true? Choose another
value of σ and repeat.
16. In the previous problem, with σ = 1, what is the probability that the p-value is
less than 0.05 if H0 is true? What is the probability if H1 is true?
17. Let X ∼ N (0, σ 2 ), and consider testing H0 : σ1 = σ0 versus H A : σ = σ1 , where
σ1 > σ0 . The values σ0 and σ1 are fixed.
a. What is the likelihood ratio as a function of x? What values favor H0 ? What
is the rejection region of a level α test?
b. For a sample, X 1 , X 2 , . . . , X n distributed as above, repeat the previous question.
c. Is the test in the previous question uniformly most powerful for testing
H0 : σ = σ0 versus H1 : σ > σ0 ?
18. Let X 1 , X 2 , . . . , X n be i.i.d. random variables from a double exponential distribution with density f (x) = 12 λ exp(−λ|x|). Derive a likelihood ratio test of the
hypothesis H0 : λ = λ0 versus H1 : λ = λ1 , where λ0 and λ1 > λ0 are specified
numbers. Is the test uniformly most powerful against the alternative H1 : λ > λ0 ?
19. Under H0 , a random variable has the cumulative distribution function F0 (x) = x 2 ,
0 ≤ x ≤ 1; and under H1 , it has the cumulative distribution function F1 (x) = x 3 ,
0 ≤ x ≤ 1.
a. If the two hypotheses have equal prior probability, for what values of x is the
posterior probability of H0 greater than that of H1 ?
b. What is the form of the likelihood ratio test of H0 versus H1 ?
c. What is the rejection region of a level α test?
d. What is the power of the test?
20. Consider two probability density functions on [0, 1]: f 0 (x) = 1, and f 1 (x) = 2x.
Among all tests of the null hypothesis H0 : X ∼ f 0 (x) versus the alternative X ∼
f 1 (x), with significance level α = 0.10, how large can the power possibly be?
21. Suppose that a single observation X is taken from a uniform density on [0, θ ],
and consider testing H0 : θ = 1 versus H1 : θ = 2.
9.11 Problems
365
a. Find a test that has significance level α = 0. What is its power?
b. For 0 < α < 1, consider the test that rejects when X ∈ [0, α]. What is its
significance level and power?
c. What is the significance level and power of the test that rejects when X ∈
[1 − α, 1]?
d. Find another test that has the same significance level and power as the previous
one.
e. Does the likelihood ratio test determine a unique rejection region?
f. What happens if the null and alternative hypotheses are interchanged—H0 :
θ = 2 versus H1 : θ = 1?
22. In Example A of Section 8.5.3 a confidence interval for the variance of a normal
distribution was derived. Use Theorem B of Section 9.3 to derive an acceptance
region for testing the hypothesis H0 : σ 2 = σ02 at the significance level α based on
a sample X 1 , X 2 , . . . , X n . Precisely describe the rejection region if σ0 = 1, n =
15, α = .05.
23. Suppose that a 99% confidence interval for the mean μ of a normal distribution
is found to be (−2.0, 3.0). Would a test of H0 : μ = −3 versus H A : μ = −3 be
rejected at the .01 significance level?
24. Let X be a binomial random variable with n trials and probability p of success.
a. What is the generalized likelihood ratio for testing H0 : p = .5 versus
H A : p = .5?
b. Show that the test rejects for large values of |X − n/2|.
c. Using the null distribution of X , show how the significance level corresponding
to a rejection region |X − n/2| > k can be determined.
d. If n = 10 and k = 2, what is the significance level of the test?
e. Use the normal approximation to the binomial distribution to find the significance level if n = 100 and k = 10.
This analysis is the basis of the sign test, a typical application of which would be
something like this: An experimental drug is to be evaluated on laboratory rats.
In n pairs of litter mates, one animal is given the drug and the other is given a
placebo. A physiological measure of benefit is made after some time has passed.
Let X be the number of pairs for which the animal receiving the drug benefited
more than its litter mate. A simple model for the distribution of X if there is no
drug effect is binomial with p = .5. This is then the null hypothesis that must
be made untenable by the data before one could conclude that the drug had an
effect.
25. Calculate the likelihood ratio for Example B of Section 9.5 and compare the
results of a test based on the likelihood ratio to those of one based on Pearson’s
chi-square statistic.
26. True or false:
a. The generalized likelihood ratio statistic is always less than or equal to 1.
b. If the p-value is .03, the corresponding test will reject at the significance
level .02.
366
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
c. If a test rejects at significance level .06, then the p-value is less than or equal
to .06.
d. The p-value of a test is the probability that the null hypothesis is correct.
e. In testing a simple versus simple hypothesis via the likelihood ratio, the
p-value equals the likelihood ratio.
f. If a chi-square test statistic with 4 degrees of freedom has a value of 8.5, the
p-value is less than .05.
27. What values of a chi-square test statistic with 7 degrees of freedom yield a p-value
less than or equal to .10?
28. Suppose that a test statistic T has a standard normal null distribution.
a. If the test rejects for large values of |T |, what is the p-value corresponding to
T = 1.50?
b. Answer the same question if the test rejects for large T .
29. Suppose that a level α test based on a test statistic T rejects if T > t0 . Suppose
that g is a monotone-increasing function and let S = g(T ). Is the test that rejects
if S > g(t0 ) a level α test?
30. Suppose that the null hypothesis is true, that the distribution of the test statistic,
T say, is continuous with cdf F and that the test rejects for large values of T .
Let V denote the p-value of the test.
a. Show that V = 1 − F(T ).
b. Conclude that the null distribution of V is uniform. (Hint: See Proposition C
of Section 2.3.)
c. If the null hypothesis is true, what is the probability that the p-value is greater
than .1?
d. Show that the test that rejects if V < α has significance level α.
31. What values of the generalized likelihood ratio are necessary to reject the null
hypothesis at the significance level α = .1 if the degrees of freedom are 1, 5, 10,
and 20?
32. The intensity of light reflected by an object is measured. Suppose there are two
types of possible objects, A and B. If the object is of type A, the measurement is
normally distributed with mean 100 and standard deviation 25; if it is of type B,
the measurement is normally distributed with mean 125 and standard deviation
25. A single measurement is taken with the value X = 120.
a. What is the likelihood ratio?
b. If the prior probabilities of A and B are equal ( 12 each), what is the posterior
probability that the item is of type B?
c. Suppose that a decision rule has been formulated that declares the object to be
of type B if X > 125. What is the significance level associated with this rule?
d. What is the power of this test?
e. What is the p-value when X = 120?
33. It has been suggested that dying people may be able to postpone their death until
after an important occasion, such as a wedding or birthday. Phillips and King
9.11 Problems
367
(1988) studied the patterns of death surrounding Passover, an important Jewish
holiday, in California during the years 1966–1984. They compared the number of
deaths during the week before Passover to the number of deaths during the week
after Passover for 1919 people who had Jewish surnames. Of these, 922 occurred
in the week before Passover and 997, in the week after Passover. The significance
of this discrepancy can be assessed by statistical calculations. We can think of
the counts before and after as constituting a table with two cells. If there is no
holiday effect, then a death has probability 12 of falling in each cell. Thus, in
order to show that there is a holiday effect, it is necessary to show that this simple
model does not fit the data. Test the goodness of fit of the model by Pearson’s
X 2 test or by a likelihood ratio test. Repeat this analysis for a group of males of
Chinese and Japanese ancestry, of whom 418 died in the week before Passover
and 434 died in the week after. What is the relevance of this latter analysis to the
former?
34. Test the goodness of fit of the data to the genetic model given in Problem 55 of
Chapter 8.
35. Test the goodness of fit of the data to the genetic model given in Problem 58 of
Chapter 8.
36. The National Center for Health Statistics (1970) gives the following data on
distribution of suicides in the United States by month in 1970. Is there any
evidence that the suicide rate varies seasonally, or are the data consistent with
the hypothesis that the rate is constant? (Hint: Under the latter hypothesis, model
the number of suicides in each month as a multinomial random variable with the
appropriate probabilities and conduct a goodness-of-fit test. Look at the signs of
the deviations, Oi − E i , and see if there is a pattern.)
Month
Number
of Suicides
Days/Month
Jan.
Feb.
Mar.
Apr.
May
June
July
Aug.
Sept.
Oct.
Nov.
Dec.
1867
1789
1944
2094
2097
1981
1887
2024
1928
2032
1978
1859
31
28
31
30
31
30
31
31
30
31
30
31
37. The following table gives the number of deaths due to accidental falls for each
month during 1970. Is there any evidence for a departure from uniformity in the
368
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
rate over time? That is, is there a seasonal pattern to this death rate? If so, describe
its pattern and speculate as to causes.
Month
Number of Deaths
Jan.
Feb.
Mar.
Apr.
May
June
July
Aug.
Sept.
Oct.
Nov.
Dec.
1668
1407
1370
1309
1341
1338
1406
1446
1332
1363
1410
1526
38. Yip et al. (2000) studied seasonal variations in suicide rates in England and Wales
during 1982–1996, collecting counts shown in the following table:
Month
Jan
Feb
Mar
Apr
May
June
July
Aug
Sept
Oct
Nov
Dec
Male
3755
3251
3777
3706
3717
3660
3669
3626
3481
3590
3605
3392
Female
1362
1244
1496
1452
1448
1376
1370
1301
1337
1351
1416
1226
Do either the male or female data show seasonality?
39. There is a great deal of folklore about the effects of the full moon on humans and
other animals. Do animals bite humans more during a full moon? In an attempt
to study this question, Bhattacharjee et al. (2000) collected data on admissions to
a medical facility for treatment of bites by animals: cats, rats, horses, and dogs.
95% of the bites were by man’s best friend, the dog. The lunar cycle was divided
into 10 periods, and the number of bites in each period is shown in the following
table. Day 29 is the full moon. Is there a temporal trend in the incidence of bites?
Lunar Day
Number of Bites
16,17,18 19,20,21 22,23,24 25,26,27 28,29,1 2,3,4 5,6,7 8,9,10 11,12,13 14,15
137
150
163
201
269
155
142
146
148
110
40. Consider testing goodness of fit for a multinomial distribution with two cells.
Denote the number of observations in each cell by X 1 and X 2 and let the hypothesized probabilities be p1 and p2 . Pearson’s chi-square statistic is equal to
2
(X i − npi )2
npi
i=1
9.11 Problems
369
Show that this may be expressed as
(X 1 − np1 )2
np1 (1 − p1 )
Because X 1 is binomially distributed, the following holds approximately under
the null hypothesis:
√
X 1 − np1
∼ N (0, 1)
np1 (1 − p1 )
Thus, the square of the quantity on the left-hand side is approximately distributed
as a chi-square random variable with 1 degree of freedom.
41. Let X i ∼ bin(n i , pi ), for i = 1, . . . , m, be independent. Derive a likelihood ratio
test for the hypothesis
H0 : p1 = p2 = · · · = pm
against the alternative hypothesis that the pi are not all equal. What is the largesample distribution of the test statistic?
42. Nylon bars were tested for brittleness (Bennett and Franklin 1954). Each of 280
bars was molded under similar conditions and was tested in five places. Assuming
that each bar has uniform composition, the number of breaks on a given bar should
be binomially distributed with five trials and an unknown probability p of failure.
If the bars are all of the same uniform strength, p should be the same for all of
them; if they are of different strengths, p should vary from bar to bar. Thus, the
null hypothesis is that the p’s are all equal. The following table summarizes the
outcome of the experiment:
Breaks/Bar
Frequency
0
1
2
3
4
5
157
69
35
17
1
1
a. Under the given assumption, the data in the table consist of 280 observations
of independent binomial random variables. Find the mle of p.
b. Pooling the last three cells, test the agreement of the observed frequency
distribution with the binomial distribution using Pearson’s chi-square test.
c. Apply the test procedure derived in the previous problem.
43. a. In 1965, a newspaper carried a story about a high school student who reported
getting 9207 heads and 8743 tails in 17,950 coin tosses. Is this a significant
discrepancy from the null hypothesis H0 : p = 12 ?
b. Jack Youden, a statistician at the National Bureau of Standards, contacted the
student and asked him exactly how he had performed the experiment (Youden
370
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
1974). To save time, the student had tossed groups of five coins at a time, and
a younger brother had recorded the results, shown in the following table:
Number of Heads
Frequency
0
1
2
3
4
5
100
524
1080
1126
655
105
Are the data consistent with the hypothesis that all the coins were fair
( p = 12 )?
c. Are the data consistent with the hypothesis that all five coins had the same
probability of heads but that this probability was not necessarily 12 ? (Hint: Use
the binomial distribution.)
44. Derive and carry out a likelihood ratio test of the hypothesis H0 : θ =
H1 : θ = 12 for Problem 58 of Chapter 8.
1
2
versus
45. In a classic genetics study, Geissler (1889) studied hospital records in Saxony
and compiled data on the gender ratio. The following table shows the number
of male children in 6115 families with 12 children. If the genders of successive
children are independent and the probabilities remain constant over time, the
number of males born to a particular family of 12 children should be a binomial
random variable with 12 trials and an unknown probability p of success. If the
probability of a male child is the same for each family, the table represents the
occurrence of 6115 binomial random variables. Test whether the data agree with
this model. Why might the model fail?
Number
Frequency
0
1
2
3
4
5
6
7
8
9
10
11
12
7
45
181
478
829
1112
1343
1033
670
286
104
24
3
46. Show that the transformation Y = sin−1
where X ∼ bin(n, p).
√
p̂ is variance-stabilizing if p̂ = X/n,
9.11 Problems
371
47. Let X√follow a Poisson distribution with mean λ. Show that the transformation
Y = X is variance-stabilizing.
48. Suppose that E(X ) = μ and Var(X ) = cμ2 , where c is a constant. Find a
variance-stabilizing transformation.
49. An English naturalist collected data on the lengths of cuckoo eggs, measuring to
the nearest .5 mm. Examine the normality of this distribution by (a) constructing
a histogram and superposing a normal density, (b) plotting on normal probability
paper, and (c) constructing a hanging rootogram.
Length
Frequency
18.5
19.0
19.5
20.0
20.5
21.0
21.5
22.0
22.5
23.0
23.5
24.0
24.5
25.0
25.5
26.0
26.5
0
1
3
33
39
156
152
392
288
286
100
86
21
12
2
0
1
50. Burr (1974) gives the following data on the percentage of manganese in iron
made in a blast furnace. For 24 days, a single analysis was made on each of five
casts. Examine the normality of this distribution by making a normal probability
plot and a hanging rootogram. (As a prelude to topics that will be taken up in
later chapters, you might also informally examine whether the percentage of
manganese is roughly constant from one day to the next or whether there are
significant trends over time.)
Day
1
Day
2
Day
3
Day
4
Day
5
Day
6
Day
7
Day
8
Day
9
Day
10
Day
11
Day
12
1.40
1.28
1.36
1.38
1.44
1.40
1.34
1.54
1.44
1.46
1.80
1.44
1.46
1.50
1.38
1.54
1.50
1.48
1.52
1.58
1.52
1.46
1.42
1.58
1.70
1.62
1.58
1.62
1.76
1.68
1.58
1.64
1.62
1.72
1.60
1.62
1.46
1.38
1.42
1.38
1.60
1.44
1.46
1.38
1.34
1.38
1.34
1.36
1.58
1.38
1.34
1.28
1.08
1.08
1.36
1.50
1.46
1.28
1.18
1.28
(Continued)
372
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
Day
13
Day
14
Day
15
Day
16
Day
17
Day
18
Day
19
Day
20
Day
21
Day
22
Day
23
Day
24
1.26
1.50
1.52
1.38
1.50
1.52
1.50
1.46
1.34
1.40
1.50
1.42
1.38
1.36
1.38
1.42
1.32
1.48
1.36
1.38
1.32
1.40
1.40
1.26
1.26
1.16
1.34
1.40
1.16
1.54
1.24
1.22
1.20
1.30
1.36
1.30
1.48
1.28
1.18
1.28
1.30
1.52
1.76
1.16
1.28
1.48
1.46
1.48
1.42
1.36
1.32
1.22
1.72
1.18
1.36
1.44
1.28
1.10
1.06
1.10
51. Examine the probability plot in Figure 9.6 and explain why there are several sets
of horizontal bands of points.
52. The following table gives values of two abundance ratios for different isotopes
of potassium from several samples of minerals (H. Ku, private communication).
Examine whether each of the ratios appears normally distributed by first making
histograms and superposing normal densities and then making probability plots.
39
K/41 K
13.8645
13.8695
13.8659
13.8622
13.8696
13.8604
13.8672
13.8598
13.8641
13.8673
13.8597
13.8604
13.8591
13.8472
13.863
13.8566
13.8503
13.8553
13.8642
13.8613
13.8706
13.8601
13.866
13.8655
13.8612
13.8598
41
K/40 K
576.369
578.012
575.597
575.244
575.567
576.836
576.236
575.291
576.478
576.992
578.335
576.767
576.571
576.617
575.885
576.651
575.974
577.255
574.664
576.405
574.306
577.095
576.957
576.434
575.211
576.630
39
K/41 K
13.8689
13.8593
13.8742
13.8703
13.8472
13.8555
13.8439
13.8646
13.8702
13.8606
13.8622
13.8588
13.8547
13.8597
13.8663
13.8597
13.8604
13.8634
13.8658
13.8547
13.8519
13.863
13.8581
13.8644
13.8665
13.8648
41
K/40 K
578.277
574.708
573.630
576.069
575.637
575.971
576.403
576.179
575.129
577.084
576.749
576.669
575.869
577.793
577.770
577.697
576.299
575.903
574.773
577.391
577.057
577.286
575.510
576.509
574.300
575.846
39
K/41 K
13.8724
13.8665
13.8566
13.8555
13.8534
13.8685
13.8694
13.8599
13.8605
13.8619
13.9641
13.8597
13.8617
13.861
13.8615
13.8469
13.8582
13.8645
13.8713
13.8593
13.8522
13.8489
13.8609
13.857
13.8566
13.864
41
K/40 K
576.017
574.881
578.508
576.796
580.394
576.772
576.501
574.950
577.614
574.506
576.317
575.665
575.815
576.109
576.144
576.820
576.672
576.169
575.390
575.108
576.663
578.358
575.371
575.851
575.644
574.462
53. Hoaglin (1980) suggested a “Poissonness plot”—a simple visual method for
assessing goodness of fit. The expected frequencies for a sample of size n from
9.11 Problems
a Poisson distribution are
E k = n P(X = k) = ne−λ
or
373
λk
k!
log E k = log n − λ + k log λ − log k!
Thus, a plot of log(Ok ) + log k! versus k should yield nearly a straight line with
a slope of log λ and an intercept of log n − λ. Construct such plots for the data
of Problems 1, 2, and 3 of Chapter 8. Comment on how straight they are.
54. A random variable X is said to follow a lognormal distribution if Y = log(X )
follows a normal distribution. The lognormal is sometimes used as a model for
heavy-tailed skewed distributions.
a. Calculate the density function of the lognormal distribution.
b. Examine whether the lognormal roughly fits the following data (Robson 1929),
which are the dorsal lengths in millimeters of taxonomically distinct octopods.
110
190
55
65
43
15
40
32
11
44
76
23
28
12
15
57
19
63
70
12
17
33
49
16
150
29
18
21
60
43
23
36
22
11
26
29
82
6
21
64
84
73
54
44
82
16
95
29
30
27
85
35
5
22
52
19
18
175
10
20
29
16
16
20
17
6
47
130
115
37
50
17
41
61
116
55
67
26
51
9
50
73
43
80
52
17
22
28
8
27
32
75
10
45
55. a. Generate samples of size 25, 50, and 100 from a normal distribution. Construct
probability plots. Do this several times to get an idea of how probability plots
behave when the underlying distribution is really normal.
b. Repeat part (a) for a chi-square distribution with 10 df.
c. Repeat part (a) for Y = Z /U , where Z ∼ N (0, 1) and U ∼ U [0, 1] and Z
and U are independent.
d. Repeat part (a) for a uniform distribution.
e. Repeat part (a) for an exponential distribution.
f. Can you distinguish between the normal distribution of part (a) and the subsequent nonnormal distributions?
56. Suppose that a sample is taken from a symmetric distribution whose tails decrease
more slowly than those of the normal distribution. What would be the qualitative
shape of a normal probability plot of this sample?
57. The Cauchy distribution has the probability density function
1
1
,
−∞ < x < ∞
f (x) =
π 1 + x2
374
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
What would be the qualitative shape of a normal probability plot of a sample
from this distribution?
58. Show how probability plots for the exponential distribution, F(x) = 1 − e−λx ,
may be constructed. Berkson (1966) recorded times between events and fit them
to an exponential distribution. (The times between events in a Poisson process
are exponentially distributed.) The following table comes from Berkson’s paper.
Make an exponential probability plot, and evaluate its “straightness.”
Time Interval (sec)
Observed Frequency
0–60
60–120
120–181
181–243
243–306
306–369
369–432
432–497
497–562
562–628
628–698
689–1130
1130–1714
1714–2125
2125–2567
2567–3044
3044–3562
3562–4130
4130–4758
4758–5460
5460–6255
6255–7174
7174–8260
8260–9590
9590–11,304
11,304–13,719
13,719–14,347
14,347–15,049
15,049–15,845
15,845–16,763
16,763–17,849
17,849–19,179
19,179–20,893
20,893–23,309
23,309–27,439
27,439+
115
104
99
106
113
104
101
106
104
96
512
524
468
531
461
526
506
509
520
540
542
499
494
500
550
465
104
97
101
104
92
102
103
110
112
100
59. Construct a hanging rootogram from the data of the previous problem in order to
compare the observed distribution to an exponential distribution.
9.11 Problems
375
60. The exponential distribution is widely used in studies of reliability as a model
for lifetimes, largely because of its mathematical simplicity. Barlow, Toland, and
Freeman (1984) analyzed data on the strength of Kevlar 49/epoxy, a material
used in the space shuttle. The times to failure (in hours) of 76 strands tested at a
stress level of 90% are given in the following table.
Times to Failure at 90% Stress Level
.01
.03
.07
.10
.13
.24
.36
.52
.63
.72
.80
.92
1.02
1.11
1.31
1.45
1.54
1.63
2.02
3.03
7.89
.01
.03
.07
.10
.18
.24
.38
.54
.65
.72
.80
.95
1.03
1.15
1.33
1.50
1.54
1.64
2.05
3.03
.02
.04
.08
.11
.19
.29
.40
.56
.67
.73
.83
.99
1.05
1.18
1.34
1.51
1.55
1.80
2.14
3.24
.02
.05
.09
.11
.20
.34
.42
.60
.68
.79
.85
1.00
1.10
1.20
1.40
1.52
1.58
1.80
2.17
4.20
.02
.06
.09
.12
.23
.35
.43
.60
.72
.79
.90
1.01
1.10
1.29
1.43
1.53
1.60
1.81
2.33
4.69
a. Construct a probability plot of the data against the quantiles of an exponential
distribution to assess qualitatively whether the exponential is a reasonable
model. Can you explain the peculiar appearance of the plot?
b. Compare the data to the exponential distribution by means of a hanging rootogram.
61. The files haliburton and macdonalds give the monthly returns on the
stocks of these two companies from 1975 through 1999.
a. Make histograms of the returns and superimpose fitted normal densities. Comment on the quality of the fit. Which stock is more volatile?
b. Make normal probability plots and again comment on the quality of the fit.
62. Apply the Poisson dispersion test to the data on gamma-ray counts—Problem 42
of Chapter 8. You will have to modify the development of the likelihood ratio
test in Section 9.5 to take account of the time intervals being of different lengths.
63. Construct a gamma probability plot for the data of Problem 46 of Chapter 8.
376
Chapter 9
Testing Hypotheses and Assessing Goodness of Fit
64. The file bodytemp contains normal body temperature readings (degrees Fahrenheit) and heart rates (beats per minute) of 65 males (coded by 1) and 65 females
(coded by 2) from Shoemaker (1996).
a. Assess the normality of the male and female body temperatures by making normal probability plots. In order to judge the inherent variability of these plots,
simulate several samples from normal distributions with matching means and
standard deviations, and make normal probability plots. What do you conclude?
b. Repeat the preceding problem for heart rates.
c. For the males, test the null hypothesis that the mean body temperature is 98.6◦
versus the alternative that the mean is not equal to 98.6◦ . Do the same for the
females. What do you conclude?
65. This problem continues the analysis of the chromatin data from Problem 45 of
Chapter 8 and is concerned with further examining goodness of fit.
a. Goodness of fit can also be examined via probability plots in which the quantiles of a theoretical distribution are plotted against those of the empirical
distribution. Following the discussion in Section 9.8, show that it is sufficient
to plot the observed order statistics, X (k) , versus the quantiles of the Rayleigh
distribution with θ = 1. Construct three such probability plots and comment
on any systematic lack of fit that you observe. To get an idea of what sort of
variability could be expected due to chance, simulate several sets of data from
a Rayleigh distribution and make corresponding probability plots.
b. Formally test goodness of fit by performing a chi-squared goodness of fit test,
comparing histogram counts to those predicted from the Rayleigh model. You
may need to combine cells of the histograms so that the expected counts in
each cell are at least 5.
CHAPTER
10
Summarizing Data
10.1 Introduction
This chapter deals with methods of describing and summarizing data that are in the
form of one or more samples, or batches. These procedures, many of which generate
graphical displays, are useful in revealing the structure of data that are initially in
the form of numbers printed in columns on a page or recorded on a tape or disk
as a computer file. In the absence of a stochastic model, the methods are useful for
purely descriptive purposes. If it is appropriate to entertain a stochastic model, the
implications of that model for the method are of interest. For example, the arithmetic
mean x̄ is often used as a summary of a collection of numbers x1 , x2 , . . . , xn ; it
indicates a “typical value.” (We discuss some of its strengths and weaknesses in this
regard in Section 10.4.) In some situations, it may be useful to model the collection
of numbers as a realization of n independent random variables X 1 , X 2 , . . . , X n with
common mean μ and variance σ 2 . The question of variability of x̄ can be addressed
with such a model—the mean x̄ is regarded as an estimate of μ, and we know from
previous work that the stochastic model implies E(X ) = μ and Var(X ) = σ 2 /n.
We will first discuss methods that are data analogues of the cumulative distribution function of a random variable. These methods are useful in displaying the
distribution of data values. Next, we will discuss the histogram and related graphical
displays that play the role for data that the probability density or frequency function
plays for a random variable, giving a different view of the distribution of data values
than that provided by the cumulative distribution function. We then discuss simpler
numerical summaries of data, numbers that indicate a typical or central value of the
data and a quantification of the spread. Such statistics provide a more condensed
summary than do the cumulative distribution function and the histogram. We will
pay particular attention to the effect of extreme data points on these measures. Next,
we will introduce boxplots, graphical summaries that combine in a simple form
information about the central values, spread, and shape of a distribution. Finally,
377
378
Chapter 10
Summarizing Data
scatterplots are introduced as a method for displaying information about relationships of variables.
10.2 Methods Based on the Cumulative
Distribution Function
10.2.1 The Empirical Cumulative Distribution Function
Suppose that x1 , . . . , xn is a batch of numbers. (The word sample is often used in the
case that the xi are independently and identically distributed with some distribution
function; the word batch will imply no such commitment to a stochastic model.) The
empirical cumulative distribution function (ecdf) is defined as
1
(#xi ≤ x)
n
(With this definition, Fn is right-continuous; in the former Soviet Union and Eastern
Europe, the ecdf is usually defined to be left-continuous.)
Denote the ordered batch of numbers by x(1) ≤ x(2) ≤ · · · ≤ x(n) . Then if x < x(1) ,
Fn (x) = 0, if x(1) ≤ x < x(2) , Fn (x) = 1/n , if x(k) ≤ x < x(k+1) , Fn (x) = k/n, and
so on. If there is a single observation with value x, Fn has a jump of height 1/n at x;
if there are r observations with the same value x, Fn has a jump of height r/n at x.
The ecdf is the data analogue of the cumulative distribution function of a random
variable: F(x) gives the probability that X ≤ x and Fn (x) gives the proportion of the
collection of numbers less than or equal to x.
Fn (x) =
EXAMPLE A
As an example of the use of the ecdf, let us consider data taken from a study by White,
Riethof, and Kushnir (1960) of the chemical properties of beeswax. The aim of the
study was to investigate chemical methods for detecting the presence of synthetic
waxes that had been added to beeswax. For example, the addition of microcrystalline
wax raises the melting point of beeswax. If all pure beeswax had the same melting
point, its determination would be a reasonable way to detect dilutions. The melting
point and other chemical properties of beeswax, however, vary from one beehive to
another. The authors obtained samples of pure beeswax from 59 sources, measured
several chemical properties, and examined the variability of the measurements. The
59 melting points (in ◦ C) are listed here. As a summary of these measurements, the
ecdf is plotted in Figure 10.1.
63.78
63.34
63.36
63.51
63.92
63.50
63.68
63.31
63.45
63.50
63.86
63.84
63.56
63.30
63.13
63.66
63.58
63.83
63.34
64.27
63.43
63.86
63.41
63.60
63.08
63.63
63.92
63.50
64.21
63.93
63.60
63.40
63.27
63.88
63.56
64.24
63.43
63.13
64.42
63.30
63.36
63.39
64.12
64.40
63.69
63.27
63.83
63.36
63.78
63.92
63.61
63.05
63.10
63.50
63.51
63.92
63.53
63.03
62.85
10.2 Methods Based on the Cumulative Distribution Function
379
1.0
Cumulative frequency
.8
.6
.4
.2
.0
62.8
63.2
63.6
64.0
Melting point (C)
64.4
F I G U R E 10.1 The empirical cumulative distribution function of the melting points
of beeswax.
Figure 10.1 conveniently summarizes the natural variability in melting points.
For example, we can see from the graph that about 90% of the samples had melting
points less than 64.2◦ C and that about 12% had melting points less than 63.2◦ C.
White, Riethof, and Kushnir showed that the addition of 5% microcrystalline
wax raised the melting point of beeswax by .85◦ C and the addition of 10% raised
the melting point by 2.22◦ C. From Figure 10.1, we can see that an addition of 5%
microcrystalline wax might well be difficult to detect, especially if it was made to
beeswax that had a low melting point, but that an addition of 10% would be detectable.
In further calculations, the investigators modeled the distribution of melting points as
Gaussian. How reasonable does this model appear to be?
■
Let us briefly consider some of the elementary statistical properties of the ecdf
in the case in which X 1 , . . . , X n is a random sample from a continuous distribution
function, F. For purposes of analysis, it is convenient to express Fn in the following
way:
Fn (x) =
n
1
I(−∞,x] (X i )
n i=1
where
I(−∞,x] (X i ) =
1,
0,
if X i ≤ x
if X i > x
380
Chapter 10
Summarizing Data
The random variables I(−∞,x] (X i ) are independent Bernoulli random variables:
1,
0,
I(−∞,x] (X i ) =
with probability F(x)
with probability 1 − F(x)
Thus, n Fn (x) is a binomial random variable (n trials, probability F(x) of success)
and so
E[Fn (x)] = F(x)
1
Var[Fn (x)] = F(x)[1 − F(x)]
n
As an estimate of F(x), Fn (x) is unbiased and has a maximum variance at that value
of x such that F(x) = .5, that is, at the median. As x becomes very large or very
small, the variance tends to zero.
In the preceding paragraph, we considered Fn (x) for fixed x; the results can be
applied to form a confidence interval for F(x) for any given value of x. Much deeper
analysis focuses on the stochastic behavior of Fn as a random function; that is, all
values of x are considered simultaneously. It turns out, somewhat surprisingly, that
the distribution of
max |Fn (x) − F(x)|
−∞<x<∞
does not depend on F if F is continuous. This result makes possible the construction
of a simultaneous confidence band about Fn , which can be used to test goodness
of fit. [For further details, refer to Section 9.6 of Bickel and Doksum (1977).] It is
important to realize the difference between the simultaneous confidence band and the
individual confidence intervals that may be constructed using the binomial distribution. Each such individual confidence interval covers F at one point with a certain
probability, say, 1 − α, but the probability that all such intervals cover F simultaneously is not necessarily 1 − α. We will encounter other phenomena of this type in later
chapters.
10.2.2 The Survival Function
The survival function is equivalent to the cumulative distribution function and is
defined as
S(t) = P(T > t) = 1 − F(t)
where T is a random variable with cdf F. In applications where the data consist of
times until failure or death and are thus nonnegative, it is often customary to work
with the survival function rather than the cumulative distribution function, although
the two give equivalent information. Data of this type occur in medical and reliability
studies. In these cases, S(t) is simply the probability that the lifetime will be longer
than t. We will be concerned with the sample analogue of S,
Sn (t) = 1 − Fn (t)
which gives the proportion of the data greater than t.
10.2 Methods Based on the Cumulative Distribution Function
EXAMPLE A
381
As an example, let us consider the use of the survival function with a study of the
lifetimes of guinea pigs infected with varying doses of tubercle bacilli (Bjerkdal
1960). In one study, five groups of 72 animals each were inoculated with the bacilli
at increasing dosages, and a control group of 107 animals was used. We denote the
inoculated groups by I, II, III, IV, and V, in order of increasing dose. The animals
were observed over a 2-year period, and their times of death (in days) were recorded.
The data are given here. Note that not all the animals in the lower-dosage regimens
died.
Control Lifetimes
18
102
149
189
341
446
576
637
735
36
105
160
209
355
455
590
638
50
114
165
212
367
463
603
641
52
114
166
216
380
474
607
650
86
115
167
273
382
506
608
663
87
118
167
278
421
515
621
665
89
119
173
279
421
546
634
688
91
120
178
292
432
559
634
725
113
154
179
213
259
311
397
114
154
181
216
265
315
398
119
160
181
220
268
326
406
99
119
142
164
187
228
267
334
99
123
144
165
192
238
267
335
110
124
145
167
196
242
270
358
Dose I Lifetimes
76
136
164
183
225
268
326
452
93
137
164
185
225
270
361
466
97
138
166
194
244
283
373
592
107
139
168
198
253
289
373
598
108
152
178
212
256
291
376
Dose II Lifetimes
72
113
131
154
171
211
248
286
409
72
113
133
156
176
214
256
303
473
78
114
135
157
177
216
257
309
550
83
114
137
162
181
216
262
324
85
118
140
162
182
218
264
326
382
Chapter 10
Summarizing Data
Dose III Lifetimes
10
92
107
116
136
163
197
240
327
33
93
108
120
139
168
202
245
342
44
96
108
121
144
171
213
251
347
56
100
108
122
146
172
215
253
361
59
100
109
122
153
176
216
254
402
72
102
112
124
159
183
222
254
432
74
105
113
130
160
195
230
278
458
77
107
115
134
163
196
231
293
555
57
80
88
99
104
123
147
198
511
58
81
89
100
107
126
156
211
522
66
81
91
100
108
128
162
214
598
32
48
58
62
72
85
110
175
341
32
52
58
63
73
87
121
175
341
33
53
59
65
75
91
127
211
376
Dose IV Lifetimes
43
67
81
91
101
109
137
174
243
45
73
82
92
102
113
138
178
249
53
74
83
92
102
114
139
179
329
56
79
83
97
102
118
144
184
380
56
80
84
99
103
121
145
191
403
Dose V Lifetimes
12
34
54
60
65
76
95
129
233
15
38
54
60
67
76
96
131
258
22
38
55
60
68
81
98
143
258
24
43
56
60
70
83
99
146
263
24
44
57
61
70
84
109
146
297
A plot (Figure 10.2) of the empirical survival functions provides a convenient
summary of the data. The proportions surviving beyond given times are plotted; it is
not necessary to know the actual lifetimes of the animals that survived beyond the
termination of the study. The graph is a much more effective presentation of the data
than the tabular listings.
One of Bjerkdahl’s primary interests was comparing the effect increased exposure
had on guinea pigs that had different levels of resistivity. Comparing groups III and
V, for example, we see that the difference in lifetimes of the weakest guinea pigs (say
the 10% weakest) from the two groups was about 50 days, whereas the difference in
■
lifetimes for stronger animals increases to about 100 days.
10.2 Methods Based on the Cumulative Distribution Function
1.0
Control
I
II
III
IV
V
.8
Proportion of live animals
383
.6
.4
.2
0
0
200
400
Days elapsed
600
800
F I G U R E 10.2 Survival functions for guinea pig lifetimes. For purposes of visual clarity, the points have been joined by lines: The solid line corresponds to the control group,
the dotted line to group I, the short-dash line to group II, the long-dash line to group III,
the dot-and-long-dash line to group IV, and the short-and-long-dash line to group V.
Survival plots may also be used for informal examinations of the hazard function, which may be interpreted as the instantaneous death rate for individuals who
have survived up to a given time. If an individual is alive at time t, the probability
that that individual will die in the time interval (t, t + δ) is, assuming that the density
function f is continuous at t,
P(t ≤ T ≤ t + δ)
P(t ≤ T ≤ t + δ|T ≥ t) =
P(T ≥ t)
F(t + δ) − F(t)
=
1 − F(t)
δ f (t)
≈
1 − F(t)
The hazard function is defined as
f (t)
h(t) =
1 − F(t)
and may be thought of as the instantaneous rate of mortality for an individual alive
at time t. If T is the lifetime of a manufactured component, it may be natural to think
of h(t) as the instantaneous or age-specific failure rate. It may also be expressed as
d
d
log[1 − F(t)] = − log S(t)
dt
dt
which reveals that it is the negative of the slope of the log of the survival function.
h(t) = −
384
Chapter 10
Summarizing Data
Consider, for example, the exponential distribution:
F(t) = 1 − e−λt
S(t) = e−λt
f (t) = λe−λt
h(t) = λ
The instantaneous mortality rate is constant. If the exponential distribution were used
as a model for the time until failure of a component, it would imply that the probability
of the component failing did not depend on its age. This is a consequence of the
“memoryless” property of the exponential distribution (Section 2.2.1). An alternative
model might have a hazard function that is U-shaped, the rate of failure being high
for very new components because of flaws in the manufacturing process that show
up very quickly, declining for components of intermediate age, and then increasing
for older components as they wear out.
The empirical survival function and its logarithm can be expressed in terms of
the ordered observations. For simplicity, suppose that there are no ties and that the
ordered failure times are T(1) < T(2) < · · · < T(n) . Then if t = T(i) , Fn (t) = i/n and
Sn (t) = 1 − i/n. Since log Sn (t) is then undefined for t ≥ T(n) , it is often defined as
Sn (t) = 1 − i/(n + 1) for T(i) ≤ t < T(i+1) .
EXAMPLE B
For the data of Example A, Figure 10.3 is a plot of the log of the empirical survival
functions. We plotted log[1 − i/(n + 1)] versus the ordered survival times T(i) . From
the slopes of these curves, we see that the hazard functions are initially fairly small.
As the dosage level increases, the instantaneous mortality rates both increase more
quickly and reach higher levels. The increased mortality rate sets in at an earlier age
for the high-dosage group and seems greater (the slope is greater). (To see this, hold
■
the figure at an angle so that you are “looking down” the curves.)
When interpreting plots such as that presented in Figure 10.3, we will find it
useful to keep in mind the variability of the empirical log survival function. Using the
method of propagation of error (Section 4.6), we have
Var[1 − Fn (t)]
[1 − F(t)]2
1 F(t)[1 − F(t)]
=
n
[1 − F(t)]2
1
F(t)
=
n 1 − F(t)
Var{log[1 − Fn (t)]} ≈
10.2 Methods Based on the Cumulative Distribution Function
Proportion of live animals
5
385
Control
I
II
III
IV
V
1
.5
.1
.05
.01
0
200
400
Days elapsed
600
800
F I G U R E 10.3 Log survival functions for guinea pig lifetimes. For purposes of visual
clarity, the points have been joined by lines: The solid line corresponds to the control
group, the dotted line to group I, the short-dash line to group II, the long-dash line to
group III, the dot-and-long-dash line to group IV, and the short-and-long-dash line
to group V.
From this expression, we see that for large values of t, the empirical log survival
function is extremely unreliable, because 1− F(t) is then very small. Thus, in practice,
the last few data points are disregarded. (Note the large fluctuations of the log survival
functions in Figure 10.3 for large times.)
10.2.3 Quantile-Quantile Plots
Quantile-quantile (Q-Q) plots are useful for comparing distribution functions. If X
is a continuous random variable with a strictly increasing distribution function, F,
the pth quantile of the distribution was defined in Section 2.2 to be that value of x
such that
F(x) = p
or
x p = F −1 ( p)
In a Q-Q plot, the quantiles of one distribution are plotted against those of another.
Suppose, for purposes of discussion, that one cdf (F) is a model for observations of
a control group and another (G) is a model for observations of a group that has
Chapter 10
Summarizing Data
received some treatment. Let the observations of the control group be denoted by x
with cdf F, and let the observations of the treatment group be denoted by y with
cdf G. The simplest effect that the treatment could have would be to increase the
expected response of every member of the treatment group by the same amount,
say h units. That is, both the weakest and the strongest individuals would have their
responses changed by h. Then y p = x p + h, and the Q-Q plot would be a straight
line with slope 1 and intercept h. We will now show that this relationship between
the quantiles implies that the cumulative distribution functions have the relationship
G(y) = F(y − h). This follows, because for every 0 ≤ p ≤ 1,
p = G(y p )
= F(x p )
= F(y p − h)
as in Figure 10.4.
1.0
.8
.6
cdf
386
.4
.2
0
3
2
1
0
1
2
3
4
y
F I G U R E 10.4 An additive treatment effect. The solid line is F(y), and the dotted
line is G(y) = F (y − h).
Another possible effect of a treatment would be multiplicative: The response
(such as lifetime or strength) is multiplied by a constant, c. The quantiles would then
be related as y p = cx p , and the Q-Q plot would be a straight line with slope c and
intercept 0. The cdf’s would be related as G( y) = F( y/c) (see Figure 10.5).
A simple summary of a treatment effect for the additive model would be of the
form “the treatment increases lifetime by 2 mo.” For the multiplicative model, one
might say something like “the treatment increases lifetime by 25%.”
10.2 Methods Based on the Cumulative Distribution Function
387
1.0
.8
cdf
.6
.4
.2
0
0
5
10
y
15
20
F I G U R E 10.5 A multiplicative treatment effect. The solid line is F ( y), and the
dotted line is G( y) = F ( y/c).
The effect of a treatment can, of course, be much more complicated than either of
these two simple models. For example, a treatment could benefit weaker individuals
and be to the detriment of stronger individuals. An educational program that places
very heavy emphasis on elementary or basic skills might be expected to have this sort
of effect relative to a regular program.
Given a batch of numbers, or a sample from a probability distribution, quantiles
are constructed from the order statistics. Given n observations and the order statistics
X (1) , . . . , X (n) , the k/(n + 1) quantile of data is assigned to X (k) . (This convention
is not unique; sometimes, for example, the quantile assigned to X (k) is defined as
(k − .5)/n. For descriptive purposes, it makes little difference which definition we
use.) In constructing probability plots in Chapter 9, we plotted sample quantiles
defined as just described versus the quantiles of a theoretical distribution, such as the
normal, and used these plots to informally assess goodness of fit.
To compare two batches of n numbers with order statistics X (1) , . . . , X (n) and
Y(1) , . . . , Y(n) , a Q-Q plot is simply constructed by plotting the points (X (i) , Y(i) ). If
the batches are of unequal size, an interpolation process can be used. A procedure for
interpolating intermediate quantiles is described in the end-of-chapter problems.
EXAMPLE A
Cleveland et al. (1974) used Q-Q plots in a study of air pollution. They plotted the
quantiles of distributions of the values of various variables on Sunday against the
quantiles for weekdays (Figure 10.6). The Q-Q plot of the ozone maxima shows that
the very highest quantiles occur on weekdays but that all the other quantiles are larger
on Sundays. For carbon monoxide, nitrogen oxide, and aerosols, the differences in the
Chapter 10
Sundays
388
Summarizing Data
0.12
6
0.08
4
O3 max
Linden
(ppm)
0.04
Sundays
0.00
0.00 0.04
(a)
0.08
0
0.12
480
2.0
320
1.0
Nonmethane
hydrocarbons 160
Linden
(ppm)
0
(d)
1
2
3
Weekdays
0.15
CO
Elizabeth
(ppm)
2
3.0
0.0
0.25
4
0
1
(b)
3
5
NO
Elizabeth
(ppm)
0.05
7
0.0 0.1
(c)
0.2
0.3
0.4
1.6
1.2
0
(e)
0.8
Solar radiation
Central Park 0.4
(langleys)
0.0
160
320
480
0.0
Weekdays
(f)
Aerosols
Elizabeth
(ruds)
1.2
2.0
Weekdays
2.8
F I G U R E 10.6 Q-Q plots of air pollution variables: (a) ozone maxima (ppm), (b) carbon
monoxide concentration (ppm), (c) nitrogen oxide concentration (ppm), (d) nonmethane
hydrocarbons (ppm), (e) solar radiation (langleys), (f) aerosols (ruds).
quantiles increase with increasing concentration. The very high and very low quantiles
of solar radiation are about the same on Sundays and weekdays (presumably corresponding to very clear days and days with heavy cloud cover), but for intermediate
quantiles, the Sunday quantiles are larger.
■
EXAMPLE B
Figure 10.7 is a Q-Q plot for groups III and V of Bjerkdahl (see Example A in
Section 10.2.2). It shows that the difference in the quantiles increases for the larger
quantiles; this is consistent with the observations we made earlier. From his analysis
of the data, Bjerkdahl concluded that the increases were proportionally the same for
animals with little, average, or great resistance—that is, that the treatment effect is
multiplicative in the sense defined earlier. If this were the case, the Q-Q plot would
be a straight line. For times up to about 200 days, the animals in group III live
approximately twice as long as those in group V, but beyond 100 days the difference
is roughly constant. The Q-Q plot thus provides a simple and effective means of
■
comparing the lifetimes in the two groups.
Further discussion and examples of Q-Q plots can be found in Wilk and
Gnanadesikan (1968).
10.3 Histograms, Density Curves, and Stem-and-Leaf Plots
389
600
Quantiles of group III
500
400
300
200
100
0
0
100
200
300
Quantiles of group V
400
F I G U R E 10.7 Q-Q plots of groups III and V from Bjerkdahl (1960). For reference,
the line y = x has been added.
10.3 Histograms, Density Curves,
and Stem-and-Leaf Plots
The histogram, a time-honored method of displaying data, has already been introduced. It displays the shape of the distribution of data values in the same sense that
a density function displays probabilities. The range of the data is divided into intervals, or bins, and the number or proportion of the observations falling in each bin
is plotted. If the bins are not of equal size, the resulting histogram can be misleading. A procedure that is often recommended is to plot the proportion of observations
falling in the bin divided by the bin width; if this procedure is used, the area under
the histogram is 1.
Figure 10.8 shows three histograms of the melting points of beeswax from Example A in Section 10.2.1 with increasingly larger bin width. If the bin width is too
small, the histogram is too ragged; if the bin is too wide, the shape is oversmoothed
and obscured. The choice of bin width is usually made subjectively in an attempt to
strike a balance between a histogram that is too ragged and one that oversmooths.
Rudemo (1982) discusses automatic methods for choosing the bin width.
Histograms are frequently used to display data for which there is no assumption
of any stochastic model—for example, populations of U.S. cities. If the data are
modeled as a random sample from some continuous distribution, the histogram may
be viewed as an estimate of the probability density. Regarded in this light, it suffers
from not being smooth.
A smooth probability density estimate can be constructed in the following
way. Let w(x) be a nonnegative, symmetric weight function, centered at zero and
Count
Summarizing Data
8
6
4
2
0
62.5
63.0
63.5
64.0
Melting point (C)
64.5
65.0
63.0
63.5
64.0
Melting point (C)
64.5
65.0
63.0
63.5
64.0
Melting point (C)
64.5
65.0
(a)
Count
Chapter 10
16
12
8
4
0
62.5
(b)
Count
390
40
30
20
10
0
62.5
(c)
F I G U R E 10.8 Histograms of melting points of beeswax: (a) bin width = .1, (b) bin
width = .2, (c) bin width = .5.
integrating to 1. For example, w(x) can be the standard normal density. The function
x
1
wh (x) = w
h
h
is a rescaled version of w. As h approaches zero, wh becomes more concentrated
and peaked about zero. As h approaches infinity, wh becomes more spread out and
flatter. If w(x) is the standard normal density, then wh (x) is the normal density with
standard deviation h. If X 1 , . . . , X n is a sample from a probability density function,
f , an estimate of f is
f h (x) =
n
1
wh (x − X i )
n i=1
This estimate, called a kernel probability density estimate, consists of the superposition of “hills” centered on the observations. In the case where w(x) is the standard normal density, wh (x − X i ) is the normal density with mean X i and standard
deviation h.
The parameter h, the bandwidth of the estimating function, controls its smoothness and corresponds to the bin width of the histogram. If h is too small, the estimate
10.3 Histograms, Density Curves, and Stem-and-Leaf Plots
391
fh ( x)
1.6
0.8
0.0
58
60
62
64
66
Melting point (C)
68
70
60
62
64
66
Melting point (C)
68
70
60
62
64
66
Melting point (C)
68
70
(a)
fh (x )
1.2
0.6
0.0
58
(b)
fh (x )
0.30
0.15
0.0 58
(c)
F I G U R E 10.9 Probability density estimates from melting point data. The kernel w is
the standard normal density with standard deviation (a) .025, (b) .125, and (c) 1.25.
Note that the vertical scales are different.
is too rough; if it is too large, the shape of f is smeared out too much. Figure 10.9
shows estimates of the probability density of the melting points of beeswax (from
Example A in Section 10.2.1) for various values of h. Making a reasonable choice
of the bandwidth is important, just as is choosing the bin width for a histogram.
From Figure 10.9, we see that too small a bandwidth yields a ragged curve and too
large a bandwidth obscures the shape and spreads the probability mass out too much.
Scott (1992) contains extensive discussion of probability density estimation, including methods for automatic, data-driven bandwidth choice and estimation of densities
in more than one dimension.
One disadvantage of a histogram or a probability density estimate is that information is lost; neither allows the reconstruction of the original data. Furthermore,
a histogram does not allow one to calculate a statistic such as a median; one can
tell from a histogram only in which bin the median lies and not the median’s actual
value.
Stem-and-leaf plots (Tukey 1977) convey information about shape while retaining the numerical information. It is easiest to define this type of plot by an example,
a stem-and-leaf plot of the beeswax melting-point data (the decimal point is one
392
Chapter 10
Summarizing Data
place to the left of the colon):
1
1
4
7
9
18
23
26
19
17
11
6
6
5
2
2
1
0
3
3
2
9
5
10
7
2
6
5
0
1
3
0
2
STEM
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
LEAF
:5
:
:358
:033
:77
:001446669
:01335
:0000113668
:0013689
:88
:334668
:22223
:
:2
:147
:
:02
The first three digits of the melting points have been selected to form the stem and are
listed in the third column. The leaves on each stem are the fourth digit of all numbers
with that stem. For example, the first stem is 628, and its leaf indicates the presence of
the number 62.85 in the data. The third stem is 630, and its leaves indicate the presence
of the numbers 63.03, 63.05, and 63.08. This stem-and-leaf plot was constructed by
a computer, but they are very easy to make by hand. The second column of numbers
gives the number of leaves on each stem. The first column of numbers facilitates
finding order statistics, such as quartiles and the median; starting at the top of the
plot and continuing down to the stem containing the median, the cumulative numbers
of observations out to the smallest observation are listed. The numbering process
is then extended symmetrically from the stem containing the median to the largest
observation of the data.
Straightforward stem-and-leaf plots do not work well for data that range over
several orders of magnitude. In such a situation, it is better to make a stem-and-leaf
plot of the logarithms of the data.
10.4 Measures of Location
Sections 10.2 and 10.3 were concerned with data analogues of the cumulative distribution and density functions and with related curves, which convey visual information
about the shape of the distribution of the data. Here and in Section 10.5, we discuss
simple numerical summaries of data that are useful when there is not enough data
to justify constructing a histogram or an ecdf, or when a more concise summary is
desired.
A measure of location is a measure of the center of a batch of numbers. If
the numbers result from different measurements of the same quantity, a measure of
10.4 Measures of Location
393
location is often used in the hope that it is more accurate than any single measurement. In other situations, a measure of location is used as a simple summary of the
numbers—for example, “the average grade on the exam was 72.” In this section, we
will discuss several common measures of location and their relative advantages and
disadvantages.
10.4.1 The Arithmetic Mean
The most commonly used measure of location is the arithmetic mean,
x̄ =
n
1
xi
n i=1
For illustration, we consider a set of 26 measurements of the heat of sublimation of
platinum from an experiment done by Hampson and Walker (1961). The data are
listed here:
Heats of Sublimation of Platinum (kcal/mol)
136.3
147.8
134.8
134.3
136.6
148.8
135.8
135.2
135.8
134.8
135.0
135.4
135.2
133.7
134.7
134.9
134.4
135.0
146.5
134.9
134.1
141.2
134.8
143.3
.
135.4
134.5
The 26 measurements are all attempts to measure the “true” heat of sublimation,
and we see that there is variability among them. Intuitively, it may seem that a measure
of location or center for this batch of numbers would give a more accurate estimate
of the heat of sublimation than any one of the numbers alone.
A common statistical model for the variability of a measurement process is the
following:
X i = μ + β + εi
(See Section 4.2.1.) Here, X i is the value of the ith measurement, μ is the true value
of the heat of sublimation, β represents bias in the measurement procedure, and εi
is the random error. The εi are usually assumed to be independent and identically
distributed random variables with mean 0 and variance σ 2 . The efficacy of measures
of location is often judged by comparing their performances (mean squared error, for
example) with this model. Note that with this model the data alone tell us nothing
about β, the bias in the measurement procedure, which in some cases may be as or
more important than the random variability.
The observations are listed across rows in the order in which the experiments were
done. When observations are acquired sequentially, it is often informative to plot them
in order, as in Figure 10.10. From this plot, we see that the first few observations were
somewhat high. The most striking aspect of the plot is the presence of five extreme
observations that occurred in groups of three and two. Such observations, which are
quite far from the bulk of the data, are called outliers. Outliers occur all too frequently,
Chapter 10
Summarizing Data
150
148
Heat of sublimation (kcal/mole)
394
146
144
142
140
138
136
134
132
0
5
10
15
20
Order of experiment
25
30
F I G U R E 10.10 Plot showing time sequence of measurements of heat of
sublimation of platinum.
even in carefully conducted studies. The outliers in this case might have been caused
by improperly calibrated equipment, for example. Outliers can also be caused by
recording and transcription errors or by equipment malfunctions. It is important to
detect outliers, since they may have an undue influence on subsequent calculations.
Graphical presentation is an effective means of detection. Careful reexamination of the
data and the circumstances under which they were obtained can sometimes uncover
the causes behind the outliers. Although outliers are often unexplainable aberrations,
an examination of them and their causes can sometimes deepen an investigator’s
understanding of the phenomenon under study.
Figure 10.10 also makes us doubt that the model for measurement error given
above is appropriate for this set of data. The fact that the outliers occur in groups of
two and three, rather than being randomly scattered, makes the independence model
somewhat implausible.
A stem-and-leaf plot provides another summary of this data (the decimal point
is at the colon):
1
4
11
1 133:7
3 134:134
7 134:5788899
6 135:002244
9 2 135:88
7 1 136:3
6 1 136:6
High: 141.2 143.3 146.5 147.8
148.8
10.4 Measures of Location
395
On this stem-and-leaf plot, the outlying observations have been isolated and
flagged as high.
In their analysis, Hampson and Walker set aside the seven largest observations
and the smallest observation and found the average of the remaining observations to
be 134.9. Calculated from all the observations, the arithmetic mean is 137.05. Note
from the stem-and-leaf plot and from Figure 10.10 that this number is larger than the
bulk of the data and is clearly not a good descriptive measure of the “center” of this
batch of numbers. We would not be satisfied with it as an estimate of the true heat of
sublimation.
If the data are modeled as a sample from a probability law, as with the measurement error model described above, an approximate 100(1 − α)% confidence interval
for the population mean can be obtained from the central limit theorem as in Chapter 7.
The interval is of the form
x̄ ± z(α/2)sx̄
Blindly applying this formula to the platinum data, with α = .05, we obtain the
interval 137.05 ± 1.71, or (135.3, 138.8). Note where this interval falls on the stemand-leaf plot!
Although the example presented here may be somewhat extreme, it illustrates the
sensitivity of the sample mean to outlying observations. In fact, by changing a single
number, the arithmetic mean of a batch of numbers can be made arbitrarily large or
small. Thus, if used blindly, without careful attention to the data, the arithmetic mean
can produce misleading results. When the data are automatically acquired, stored
as files on disks or tapes, and not visually examined, this danger increases. For this
reason, measures of location that are robust, or insensitive to outliers, are important.
10.4.2 The Median
If the sample size is an odd number, the median is defined to be the middle value
of the ordered observations; if the sample size is even, the median is the average of
the two middle values. Clearly, moving the extreme observations does not affect the
sample median at all, so the median is quite robust. The median of the platinum data
is 135.1, which, as can be seen from the stem-and-leaf plot, is more reasonable than
the mean as a measure of the center.
When the data are a sample from a continuous probability law, the sample median can be viewed as an estimate of the population median, η, for which a simple
confidence interval can be formed. We will now demonstrate that this interval is of
the form
(X (k) , X (n−k+1) )
The coverage probability of this interval is
P(X (k) ≤ η ≤ X (n−k+1) ) = 1 − P(η < X (k) or η > X (n−k+1) )
= 1 − P(η < X (k) ) − P(η > X (n−k+1) )
396
Chapter 10
Summarizing Data
since the events are mutually exclusive. To evaluate these terms, we first note that
P(η > X (n−k+1) ) =
k−1
P( j observations are greater than η)
j=0
P(η < X (k) ) =
k−1
P( j observations are less than η)
j=0
Since, by definition, the median satisfies
P(X i > η) = P(X i < η) =
1
2
and since the n observations X 1 , . . . , X n are independent and identically distributed,
the distribution of the number of observations greater than the median is binomial
with n trials and probability 12 of success on each trial. Thus,
1 n
P(exactly j observations are greater than η) = n
2 j
and
k−1 1 n
P(η > X (n−k+1) ) = n
2 j=0 j
From symmetry, we then have that the coverage probability of the interval in question
is
k−1 1 n
1 − n−1
2
j
j=0
These probabilities can be found from tables of the cumulative binomial distribution
since
k−1 1 n
= P(Y ≤ k − 1)
2n j=0 j
where Y is a binomial random variable with n trials and probability of success equal
to 12 .
EXAMPLE A
As a concrete example, with n = 26, we have the following cumulative binomial
probabilities:
k
P(Y ≤ k)
5
6
7
8
9
.0012
.0047
.0145
.0378
.0843
10.4 Measures of Location
397
If we choose k = 8,
P(Y < k) = .0145
and since P(Y < k) = P(Y > n − k + 1), P(Y > 19) = .0145. Since 2 × .0145 =
.029, the interval (X (8) , X (19) ) is a 97% confidence interval. Note that this confidence
interval is exact, not approximate, and does not depend on the form of the underlying
cdf but only on the assumption that the cdf is continuous and that the observations
are independent.
For the platinum data, this confidence interval is (134.8, 135.8). Compare this
interval to the interval based on the sample mean. (But as we noted, there is reason
to doubt the independence assumption for the platinum data, so these calculations
■
should be viewed as an illustrative numerical exercise.)
10.4.3 The Trimmed Mean
Another simple and robust measure of location is the trimmed mean. The 100α%
trimmed mean is easy to calculate: Order the data, discard the lowest 100α% and the
highest 100α%, and take the arithmetic mean of the remaining data. It is generally
recommended that the value chosen for α be from .1 to .2. Formally, we may write
the trimmed mean as
x([nα]+1) + · · · + x(n−[nα])
x̄α =
n − 2[nα]
where [nα] denotes the greatest integer less than or equal to nα. Note that the median
can be regarded as a 50% trimmed mean.
The 20% trimmed mean for the platinum data listed in Section 10.4.1 is formed
by discarding the highest and lowest five observations (.2 × 26 = 5.2) and averaging
the rest. The result is 135.29; for the same data, the median was 135.1 and the mean
was 137.05.
10.4.4 M Estimates
The sample mean is the mle of μ, the location parameter, when the underlying distribution is normal. Equivalently, the sample mean minimizes the negative log likelihood,
or
n Xi − μ 2
i=1
σ
This is the simplest case of a least squares estimate. (We will discuss least squares
estimates in more detail in the context of curve fitting.) Outliers have a great effect
on this estimate, since the deviation of μ from X i is measured by the square of
their difference. In contrast, the median is the minimizer of (see Problem 34 of the
end-of-chapter problems)
n Xi − μ σ i=1
398
Chapter 10
Summarizing Data
Here, large deviations are not weighted as heavily, and it is this property that causes
the median to be robust.
Huber (1981) proposed a class of estimates, M estimates, which are the minimizers of
n
Xi − μ
σ
i=1
where the weight function is a compromise between the weight functions for the
mean and the median. A wide variety of weight functions have been proposed. Huber
discusses weight functions that are quadratic near zero and are linear beyond a cutoff
point, k. Thus, k = ∞ corresponds to the mean and k = 0 to the median. A common
choice is k = 1.5. With this choice, the influence of observations more than 1.5σ
away from the center is reduced. In practice, a robust estimate of σ , such as those
discussed in Section 10.5, must be used.
The computation of an M estimate is a nonlinear minimization problem and must
be done iteratively (using the Newton-Raphson method, for example). If M is a convex
function, the minimizer will be unique. Fairly simple computer programs that do this
are common in statistical packages. The M estimate (k = 1.5) for the platinum data
we have been considering is 135.38, close to the median (135.1) and the trimmed
mean (135.29) but quite different from the mean (137.05).
10.4.5 Comparison of Location Estimates
We introduced several location estimates (and there are many others). Which one
is best? There is no simple answer to this question. It is always important to bear
in mind what is being estimated by the location estimates and to what purpose the
estimate is being put. If the underlying distribution is symmetric, the trimmed mean,
the sample mean, the sample median, and an M estimate all estimate the center of
symmetry. If the underlying distribution is not symmetric, however, the four statistics
estimate four different population parameters: the population mean, the population
median, the population trimmed mean, and a functional of the cdf determined by the
weight function . Moreover, there is no single estimate that is best for all symmetric
distributions. Life isn’t that simple. Simulations have been done to compare estimates
for a variety of distributions. Andrews et al. (1972) report the results of a large number
of simulations from symmetric distributions. Their results show that the 10% or 20%
trimmed mean is overall quite an effective estimate: Its variance is never much larger
than the variance of the ordinary mean (even in the Gaussian case for which the
mean is optimal) and can be quite a lot smaller when the underlying distribution
is heavy-tailed relative to the Gaussian. The median, although quite robust, has a
substantially larger variance in the Gaussian case than does the trimmed mean. The
trimmed mean and the median have a certain appealing simplicity and are easy to
explain to someone who has little formal statistical training. M estimates performed
quite well in the simulations of the Andrews et al. study, and they do generalize
more naturally to other problems such as curve fitting. But they are somewhat more
difficult to compute and have less immediate intuitive appeal. For the purpose of
simply summarizing data, it is often useful to compute more than one measure of
location and compare the results.
10.4 Measures of Location
399
10.4.6 Estimating Variability of Location Estimates
by the Bootstrap
If we view the observations x1 , x2 , . . . , xn as realizations of independent random
variables with common distribution function F, it is appropriate to investigate the
variability and sampling distribution of a location estimate calculated from a sample
of size n. Suppose we denote the location estimate as θ̂ ; it is important to keep in
mind that θ̂ is a function of the random variables X 1 , X 2 , . . . , X n and hence has a
probability distribution, its sampling distribution, which is determined by n and F. We
would like to know this sampling distribution, but we are faced with two problems:
(1) We don’t know F, and (2) even if we knew F, θ̂ may be such a complicated
function of X 1 , X 2 , . . . , X n that finding its distribution would exceed our analytic
abilities.
First, we address the second problem. Suppose then, for the moment, that we
knew F. How could we find the probability distribution of θ̂ without going through
incredibly complicated analytic calculations? The computer comes to our rescue—we
can do it by simulation. We generate many, many samples, say B in number, of size
n from F; from each sample we calculate the value of θ̂ . The empirical distribution
of the resulting values θ1∗ , θ2∗ , . . . , θ B∗ is an approximation to the distribution function
of θ̂, which is good if B is very large. If we want to know the standard deviation of
θ̂ , we can find a good approximation to it by calculating the standard deviation of
the collection of values θ1∗ , θ2∗ , . . . , θ B∗ . We can make these approximations arbitrarily
accurate by taking B to be arbitrarily large.
All this would be well and good if we knew F, but we don’t. So what do we do?
The bootstrap solution is to view the empirical cdf Fn as an approximation to F and
sample from Fn . That is, Fn would be used in place of F in the previous paragraph.
How do we go about sampling from Fn ? Fn is a discrete probability distribution
that gives probability 1/n to each observed value x1 , x2 , . . . , xn . A sample of size
n from Fn is thus a sample of size n drawn with replacement from the collection
x1 , x2 , . . . , xn . We thus draw B samples of size n with replacement from the observed
data, producing θ1∗ , θ2∗ , . . . , θ B∗ . The standard deviation of θ̂ is then estimated by
0
1
B
11 2
(θ ∗ − θ̄ ∗ )2
sθ̂ =
B i=1 i
where θ̄ ∗ is the mean of θ1∗ , θ2∗ , . . . , θ B∗ .
EXAMPLE A
We illustrate this idea on the platinum data by using the bootstrap to approximate the
sampling distribution of the 20% trimmed mean and its standard error. To this end,
1000 samples of size n = 26 were drawn randomly with replacement from the collection of 26 values. A histogram of the 1000 trimmed means is displayed in Figure 10.11.
The standard deviation of the 1000 values was .64, which is the estimated standard
error of the 20% trimmed mean. The histogram is interesting—note the skewed tail to
the right. We see that some of the trimmed means were far from the bulk of the data.
This happened because some of the samples drawn with replacement included several
400
Chapter 10
Summarizing Data
500
400
300
200
100
0
135
136
137
kcal/mol
138
139
F I G U R E 10.11 Histogram of 1000 bootstrap 20% trimmed means.
replicates of the five outliers (see Figure 10.10). The computer calculation is telling
us that if we sample from Fn , the 20% trimmed mean is not as robust as we might
like; this is an extremely heavy-tailed distribution and a sample of 26 may contain a
large number of outliers.
As in Chapter 8, we can use the bootstrap distribution to form an approximate
90% confidence interval. We proceed as in Examples D and E of Section 8.5.3, which
you may want to review at this time. Denote the trimmed mean of the sample by
∗
∗
≤ θ(2)
≤
θ̂ = 135.29, and denote the 1000 ordered bootstrap trimmed means by θ(1)
∗
∗
· · · ≤ θ(1000) . Then the .05 quantile of the bootstrap distribution is θ = θ(50) =
∗
= 136.93. Following the notation of the
134.00, and the .95 quantile is θ̄ = θ(950)
examples of Section 8.5.3, the approximate 90% confidence interval is (θ̂ − δ̄, θ̂ − δ),
where
θ̂ − δ̄ = θ̂ − (θ̄ − θ̂ )
= 2θ̂ − θ̄
= 133.65
and
θ̂ − δ = θ̂ − (θ − θ̂ )
= 2θ̂ − θ
= 135.58
10.5 Measures of Dispersion
401
400
300
200
100
0
135.0
135.5
136.0
136.5
F I G U R E 10.12 Histogram of 1000 bootstrap medians.
Figure 10.12 is a histogram of 1000 bootstrapped medians. It is less dispersed
than the histogram of trimmed means; the standard deviation of the medians is .24,
considerably less than that of the trimmed mean. The bootstrap simulation is telling
us that when sampling from a distribution like this, the median is more robust than
the 20% trimmed mean.
■
How accurate are these bootstrap estimates? It is difficult to answer this question
in a useful, explicit manner. Essentially, the accuracy depends on two factors: (1) the
accuracy of Fn as an estimate of F, and (2) the dependence of the distribution of the
statistic θ̂ on F. For example, if the distribution of θ̂ changes little if F changes, then
Fn need not be a very good estimate of F, whereas if the distribution of θ̂ is extremely
sensitive to F, Fn will have to be a good estimate of F, and hence the sample size
will have to be large, in order for the bootstrap approximation to be accurate.
10.5 Measures of Dispersion
A measure of dispersion, or scale, gives a numerical indication of the “scatteredness”
of a batch of numbers. Simple summaries of data often consist of a measure of location
and a measure of dispersion. The most commonly used measure is the sample standard
402
Chapter 10
Summarizing Data
deviation, s, which is the square root of the sample variance,
1 (X i − X )2
n − 1 i=1
n
s2 =
Using n − 1 as the divisor rather than the more obvious divisor n is based on the
rationale that s 2 is an unbiased estimate of the population variance if the observations
are independent and identically distributed with variance σ 2 . (But s is not an unbiased
estimate of σ because the square root is a nonlinear function.) If n is of moderate to
large size, it makes little difference whether n or n − 1 is used.
If the observations are a sample from a normal distribution with variance σ 2 ,
(n − 1)s 2
2
∼ χn−1
σ2
This distributional result may be used to construct confidence intervals for σ 2 in the
normal case (compare with Example A in Section 8.5.3), but the result is not robust
against deviations from normality.
Like the sample mean, the sample standard deviation is sensitive to outlying
observations. Two simple robust measures of dispersion are the interquartile range
(IQR), or the difference between the two sample quartiles; (the 25th and 75th percentiles) and the median absolute deviation from the median (MAD). If the data
are x1 , . . . , xn with median x̃, the MAD is defined to be the median of the numbers
|xi − x̃|. These two measures of dispersion, the IQR and the MAD, can be converted
into estimates of σ for a normal distribution by dividing them by 1.35 and .675,
respectively. David (1981) discusses a method for finding a confidence interval for
the population interquartile range, using reasoning similar to that used in Section
10.4.2 for developing a confidence interval for the population median.
Let us compare all three measures of dispersion for the platinum data:
s = 4.45
IQR
= 1.26
1.35
MAD
= .934
.675
The two robust estimates are similar. From the stem-and-leaf plot of the platinum
values presented earlier, we can see that both the IQR and the MAD give measures of
the spread of the central portion of the data, whereas the standard deviation is heavily
influenced by the outliers.
10.6 Boxplots
A boxplot is a graphical display invented by Tukey that shows a measure of location
(the median), a measure of dispersion (the interquartile range), and the presence of
possible outliers and also gives an indication of the symmetry or skewness of the
distribution. Figure 10.13 is a boxplot of the platinum data.
10.6 Boxplots
403
150
Heat of sublimation (kcal/mole)
148
146
144
142
140
138
Upper quartile
136
134
Median
Lower quartile
132
F I G U R E 10.13 Boxplot of the platinum data.
We outline the construction of a boxplot:
1. Horizontal lines are drawn at the median and at the upper and lower quartiles
and are joined by vertical lines to produce the box.
2. A vertical line is drawn up from the upper quartile to the most extreme data point
that is within a distance of 1.5 (IQR) of the upper quartile. A similarly defined
vertical line is drawn down from the lower quartile. Short horizontal lines are
added to mark the ends of these vertical lines.
3. Each data point beyond the ends of the vertical lines is marked with an asterisk
or dot (* or ·).
Boxplots are not uniformly standardized, but the basic structure is as outlined
above, perhaps with additional embellishments or small variations. A boxplot thus
gives an indication of the center of the data (the median), the spread of the data
(the interquartile range), and the presence of outliers, and indicates the symmetry or
asymmetry of the distribution of data values (the location of the median relative to the
quartiles). In Figure 10.13, the five outliers of the platinum data are clearly displayed,
and we see an indication that the central part of the distribution is somewhat skewed
toward high values.
EXAMPLE A
Figure 10.14 is taken from Chambers et al. (1983). The data plotted are daily maximum
concentrations in parts per billion of sulfur dioxide in Bayonne, N.J., from November
1969 to October 1972 grouped by month. There are thus 36 batches, each of size
about 30. The investigators concluded:
The boxplots . . . show many properties of the data rather strikingly. There is
a general reduction in sulphur dioxide concentration through time due to the
gradual conversion to low sulphur fuels in the region. The decline is most
dramatic for the highest quantiles. Also, there are higher concentrations
404
Chapter 10
Summarizing Data
Sulfur dioxide concentration (ppb)
250
200
150
100
50
0
NDJFMAMJJASONDJFMAMJ JASONDJFMAMJ JASO
Month
F I G U R E 10.14 Boxplots of daily maximum concentrations of sulfur dioxide.
during the winter months due to the use of heating oil. In addition, the
boxplots show that the distributions are skewed toward high values and
that the spread of the distributions . . . is larger when the general level of
concentration is higher.
The boxplot is clearly a very effective method of presenting and summarizing
these data. As they are in this example, boxplots are generally useful for comparing
batches of numbers, a purpose to which they will be put in the next two chapters. ■
10.7 Exploring Relationships with Scatterplots
Many interesting questions in statistics involve trying to understand the relationships among variables. The scatterplot is a basic method for displaying the empirical
relationship between two variables based on a collection of pairs (xi , yi ): one merely
plots the points in the x y plane. This basic display can be augmented in various ways,
as we will illustrate with some examples.
EXAMPLE A
Allison and Cicchetti (1976) examined the relationships of possible correlates of sleep
behavior in mammals. Figure 10.15 is a scatterplot of total sleep versus brain weight.
Other than that two mammals with very large brains slept very little, no relationships
are apparent in the plot. There is in fact a relationship, but it is obscured in the plot
because brain weights vary over orders of magnitude—the brain of the lesser shorttailed shrew weighs 0.14 grams, and at the other extreme the brain of the African
elephant weighs 5,712 grams. It is thus much more informative to plot sleep versus
the logarithm of brain weight, and annotating the plot helps further—as shown in
Figure 10.16. It is now clear that mammals with heavier brains tend to sleep less.
405
10.7 Exploring Relationships with Scatterplots
20
Total sleep
15
10
5
0
1000
2000
3000
4000
Brain weight
5000
F I G U R E 10.15 Sleep versus brain weight for a collection of mammals.
20
Big brown bat
Little brown bat
Water opossum
N. American opossum
Giant armadillo
Ninebanded armadillo
Owl monkey
Arctic ground squirrel
Tree shrew
15
Cat
Golden hamster
Total sleep
Mouse
Ground squirrel
Rat
Tenrec
Arctic fox
Chinchilla
Musk shrew
10
Phanlanger
Gray wolf
Raccoon
Gorilla
Mountain beaver Slow loris
Patas monkey
European hedgehog
Jaguar
Galago
Vervet
Star nosed mole
Mole rat
Baboon Chimpanzee
Desert hedgehog
Red fox
Rhesus monkey
Lesser shorttailed shrew
African giant pouched rat
Echidna
Eastern American mole
Pig
Rabbit
Man
Guinea pig
Rock hyrax (Hetero. b)
Genet
Rock hyrax (Procavia hab)
5
Brazilian tapir
Gray seal
Tree hyrax
Sheep
Asian elephant
Cow
Goat
Donkey
Roe deer
0.01
1
100
Brain weight
F I G U R E 10.16 Sleep versus logarithm of brain weight.
African elephant
Horse
10000
406
Chapter 10
Summarizing Data
Data on these and other variables (how much do elephants dream?) can be found at
■
http://lib.stat.cmu.edu/datasets/sleep.
Correlation coefficients are often used as a simple numerical summary of the
strength of a relationship. The Pearson correlation coefficient corresponding to the
pairs (xi , yi ) is
r=
(xi − x̄)(yi − ȳ)
(xi − x̄)2
(yi − ȳ)2
This statistic measures the strength of a linear relationship. The correlation of brain
weight and sleep is −0.36, and the correlation between the logarithm of brain weight
and sleep is −0.56. These are different because a nonlinear transformation has been
applied and the correlation coefficient measures the strength of a linear relationship.
An alternative to the Pearson correlation coefficient is the rank correlation coefficient: the brain weights are replaced by their ordered ranks (1, 2, . . .), the sleeping times are replaced by their ranks, and then the Pearson correlation coefficient
of the pairs of ranks is computed. The rank correlation turns out to be −0.39 in
our example. Some advantages of the rank correlation coefficient are that it is insensitive to outliers and is invariant under any monotone transformation (thus the
rank correlation does not depend on whether brain weight or log brain weight is
used).
Arrays of scatterplots are useful for examining the relationships among more
than two variables, as illustrated in the following example.
EXAMPLE B
Inductive loop detectors are wire loops embedded in the pavement of roadways. They
operate by detecting the change in inductance caused by the metal in vehicles that
pass over them. During successive intervals of time, a detector reports the number
of passing vehicles, and the percentage of time that it was covered by a vehicle. The
number of vehicles is called flow, the percentage of coverage is called the occupancy.
Such detectors are widely used to measure freeway traffic but are subject to various
kinds of malfunction. Faulty detectors must be identified by traffic management centers. One key to detecting malfunction is knowing that measurements in the several
freeway lanes at a particular location should be highly related—the increases and
decreases of traffic flow in one lane should tend to be mirrored in other lanes. Figure 10.17 shows an array of scatterplots of occupancy measured by detectors in four
lanes at a particular location (Bickel et al. 2004). The detectors in lanes three and
four were closely related to each other at all times and were correlated with measurements in lanes one and two some, but not all, of the time. Apparently the detectors
in lanes 1 and 2 malfunctioned some of the time while this set of measurements was
■
taken.
10.8 Concluding Remarks
0.0 0.1 0.2 0.3 0.4 0.5 0.6
407
0.0 0.05 0.10 0.15 0.20 0.25
0.6
0.5
0.4
0.3
0.2
0.1
0.0
lane 1
0.6
0.5
0.4
0.3
0.2
0.1
0.0
lane 2
0.25
0.15
lane 3
0.05
0.0
0.25
0.20
0.15
lane 4
0.10
0.05
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0 0.05 0.10 0.15 0.20 0.25
F I G U R E 10.17 Occupancy measurements by adjacent loops in four lanes.
10.8 Concluding Remarks
This chapter introduced several tools for summarizing data, some of which are graphical in nature. Under the assumption of a stochastic model for the data, some aspects
of the sampling distributions of these summaries have been discussed. Summaries are
very important in practice; an intelligent summary of data is often sufficient to fulfill
the purposes for which the data were gathered, and more formal techniques such
as confidence intervals or hypothesis tests sometimes add little to an investigator’s
understanding. Effective summaries can also point to “bad” data or to unexpected
aspects of data that might have gone unnoticed if the data had been blindly crunched
by a computer.
We saw the bootstrap appear again as a method for approximating a sampling
distribution and functionals of it such as its standard deviation. The bootstrap, a
relatively recent development in statistical methodology, relies on the availability of
powerful and inexpensive computing resources. Our development of approximate
408
Chapter 10
Summarizing Data
confidence intervals based on the bootstrap followed that of Chapter 8, where we
motivated the construction by using the bootstrap distribution of θ ∗ − θ̂ to approximate
the distribution of θ̂ −θ0 . We note that another popular method, known as the bootstrap
percentile method, gives the interval (θ , θ̄ ) (see Example A of Section 8.5.3 for
definition of the notation). The rationale for this is harder to understand. More accurate
methods for constructing bootstrap confidence intervals have been proposed and are
under study, but we will not pursue these developments.
10.9 Problems
1. Plot the ecdf of this batch of numbers: 1, 14, 10, 9, 11, 9.
2. Suppose that X 1 , X 2 , . . . , X n are independent U [0, 1] random variables.
a. Sketch F(x) and the standard deviation of Fn (x).
b. Generate many samples of size 16 on a computer; for each sample, plot Fn (x)
and Fn (x) − F(x). Relate what you see to your answer to (a).
3. From Figure 10.1, roughly what are the upper and lower quartiles and the median
of the distribution of melting points?
4. In Section 10.2.1, it was claimed that the random variables I(−∞,x] (X i ) are independent. Why is this so?
5. Let X 1 , . . . , X n be a sample (i.i.d.) from a distribution function, F, and let Fn
denote the ecdf. Show that
1
Cov [Fn (u), Fn (v)] = [F(m) − F(u)F(v)]
n
where m = min(u, v). Conclude that Fn (u) and Fn (v) are positively correlated:
If Fn (u) overshoots F(u), then Fn (v) will tend to overshoot F(v).
6. Various chemical tests were conducted on beeswax by White, Riethof, and
Kushnir (1960). In particular, the percentage of hydrocarbons in each sample
of wax was determined.
a. Plot the ecdf, a histogram, and a normal probability plot of the percentages of
hydrocarbons given in the following table. Find the .90, .75, .50, .25, and .10
quantiles. Does the distribution appear Gaussian?
14.27
15.15
13.98
15.40
14.04
14.10
13.75
14.23
14.80
13.98
14.47
14.68
13.68
15.47
14.87
14.44
12.28
14.90
14.65
13.33
15.31
13.73
15.28
14.57
17.09
15.91
14.73
14.41
14.32
13.65
14.43
15.10
14.52
15.18
14.19
13.64
15.02
13.96
12.92
15.63
14.49
15.21
14.77
14.01
14.57
15.56
13.83
14.56
14.75
14.30
14.92
15.49
15.38
13.66
15.03
14.41
14.62
15.47
15.13
b. The average percentage of hydrocarbons in microcrystalline wax (a synthetic
commercial wax) is 85%. Suppose that beeswax was diluted with 1% micro-
10.9 Problems
409
crystalline wax. Could this be detected? What about a 3% or a 5% dilution?
(Such questions were one of the main concerns of the beeswax study.)
7. Compare group I to group V in Figure 10.2. Roughly, what are the differences in
lifetimes for the animals that are the 10% weakest, median, and 10% strongest?
8. Consider a sample of size 100 from an exponential distribution with parameter
λ = 1.
a. Sketch the approximate standard deviation of the empirical log survival function, log Sn (t), as a function of t.
b. Generate several such samples of size 100 on a computer and for each sample
plot the empirical log survival function. Relate the plots to your answer to (a).
9. Use the method of propagation of error to derive an approximation to the bias of
the log survival function. Where is this bias large, and what is its sign?
10. Let X 1 , . . . , X n be a sample from cdf F and denote the order statistics by
X (1) , X (2) , . . . , X (n) . We will assume that F is continuous, with density function f . From Theorem A in Section 3.7, the density function of X (k) is
f k (x) = n
n−1
[F(x)]k−1 [1 − F(x)]n−k f (x)
k−1
a. Find the mean and variance of X (k) from a uniform distribution on [0, 1]. You
will need to use the fact that the density of X (k) integrates to 1. Show that
Mean =
k
n+1
Variance =
1
n+2
k
n+1
1−
k
n+1
b. Find the approximate mean and variance of Y(k) , the kth-order statistic of a
sample of size n from F. To do this, let
X i = F(Yi )
or
Yi = F −1 (X i )
The X i are a sample from a U [0, 1] distribution (why?). Use the propagation
of error formula,
Y(k) = F −1 (X (k) )
k
≈ F −1
n+1
+
X (k) −
k
n+1
d −1 F (x) k/(n+1)
dx
410
Chapter 10
Summarizing Data
and argue that
EY(k) ≈ F −1
k
Var (Y(k) ) ≈
n+1
k
n+1
1−
1
−1
( f {F [k/(n + 1)]})2
k
n+1
1
n+2
c. Use the results of parts (a) and (b) to show that the variance of the pth sample
quantile is approximately
1
nf
2 (x
p)
p(1 − p)
where x p is the pth quantile.
d. Use the result of part (c) to find the approximate variance of the median of a
sample of size n from a N (μ, σ 2 ) distribution. Compare to the variance of the
sample mean.
11. Calculate the hazard function for
β
F(t) = 1 − e−αt ,
t ≥0
12. Let f denote the density function and h the hazard function of a nonnegative
random variable. Show that
f (t) = h(t)e
−
t
0
h(s)ds
that is, that the hazard function uniquely determines the density.
13. Give an example of a probability distribution with increasing failure rate.
14. Give an example of a probability distribution with decreasing failure rate.
15. A prisoner is told that he will be released at a time chosen uniformly at random
within the next 24 hours. Let T denote the time that he is released. What is the
hazard function for T ? For what values of t is it smallest and largest? If he has
been waiting for 5 hours, is it more likely that he will be released in the next few
minutes than if he has been waiting for 1 hour?
16. Suppose that F is N (0, 1) and G is N (1, 1). Sketch a Q-Q plot. Repeat for G
being N (1, 4).
17. Suppose that F is an exponential distribution with parameter λ = 1 and that
G is exponential with λ = 2. Sketch a Q-Q plot.
18. A certain chemotherapy treatment for cancer tends to lengthen the lifetimes of
very seriously ill patients and decrease the lifetimes of the least ill patients.
Suppose that an experiment is done that compares this treatment to a placebo.
Draw a sketch showing the qualitative behavior of a Q-Q plot.
19. Consider the two cdfs:
F(x) = x,
G(x) = x 2 ,
Sketch a Q-Q plot of F versus G.
0≤x ≤1
0≤x ≤1
10.9 Problems
411
20. Sketch what you would expect the qualitative shape of the hazard function of
human mortality to look like.
21. Make Q-Q plots for other pairs of treatment groups from Bjerkdahl’s data (see
Example A in Section 10.2.2). Does the model of a multiplicative effect appear
reasonable?
22. By examining the survival function of group V of Bjerkdahl’s data (see Example A in Section 10.2.2), make a rough sketch of the qualitative shape of a
histogram. Then make a histogram, and compare it to your guess.
23. In the examples of Q-Q plots in the text, we only discussed the case in which
quantiles of equal size batches are compared. From two batches of size n the
k/(n + 1) quantiles are estimated as X (k) and Y(k) , so one merely has to plot X (k)
vs. Y(k) . Write down a linear interpolation formula for the pth quantile where
k/(n + 1) ≤ p ≤ (k + 1)/(n + 1). Now suppose that the batch sizes are not the
same, being m and n, m < n say. A Q-Q plot may be constructed by fixing the
quantiles k/(m + 1) of the smaller data set and interpolating these quantiles for
the larger data set.
Interpolate to find the upper and lower quartiles of the following batch of
numbers: 1, 2, 3, 4, 5, 6.
24. Show that the probability plots discussed in Section 9.9 are Q-Q plots of the
empirical distribution Fn versus a theoretical distribution F.
25. In Section 10.2.3, it was claimed that if y p = cx p , then G( y) = F( y/c). Justify
this claim.
26. Hampson and Walker also made measurements of the heats of sublimation of
rhodium and iridium. Do the following calculations for each of the two given
sets of data:
a.
b.
c.
d.
e.
f.
g.
h.
i.
j.
k.
Make a histogram.
Make a stem-and-leaf plot.
Make a boxplot.
Plot the observations in the order of the experiment.
Does that statistical model of independent and identically distributed measurement errors seem reasonable?
Find the mean, 10% and 20% trimmed means, and median and compare them.
Find the standard error of the sample mean and a corresponding approximate
90% confidence interval.
Find a confidence interval based on the median that has as close to 90%
coverage as possible.
Use the bootstrap to approximate the sampling distributions of the 10% and
20% trimmed means and their standard errors and compare.
Use the bootstrap to approximate the sampling distribution of the median and
its standard error. Compare to the corresponding results for trimmed means
above.
Find approximate 90% confidence intervals based on the trimmed means and
compare to the intervals for the mean and median found previously.
412
Chapter 10
Summarizing Data
Iridium (kcal/mol)
136.6
160.4
160.0
145.2
161.1
160.2
151.5
160.6
160.1
162.7
160.2
160.0
159.1
159.5
159.7
159.8
160.3
159.5
160.8
159.2
159.5
173.9
159.3
159.6
160.1
159.6
159.5
Rhodium (kcal/mol)
126.4
133.3
131.3
134.2
132.6
135.7
132.5
131.2
133.8
133.3
132.9
133.0
132.1
133.3
133.5
131.5
133.0
131.1
133.5
133.5
131.1
132.4
131.4
133.4
132.3
131.1
131.6
131.2
133.5
132.7
131.9
132.6
131.1
133.0
132.9
132.7
132.2
131.1
132.8
134.1
27. Demographers often refer to the hazard function as the “age specific mortality
rate,” or death rate. Until recently, most researchers in the field of gerontology
thought that a death rate increasing with age was a universal fact in the biological
world. There has been heavy debate over whether there is a genetically programmed upper limit to lifespan. Using a facility in which sterilized medflies are
bred to be released to fight medfly infestations in California, James Carey and coworkers (Carey et al. 1992) bred more than a million medflies and recorded their
pattern of mortality. The data file medflies, contains the number of medflies
alive from an initial population of 1,203,646 as a function of age in days. Using
these data, estimate and plot the age specific mortality rate. Does it increase with
age?
28. For a sample of size n = 3 from a continuous probability distribution, what
is P(X (1) < η < X (2) ), where η is the median of the distribution? What is
P(X (1) < η < X (3) )?
29. Of the 26 measurements of the heat of sublimation of platinum, 5 are outliers
(see Figure 10.10). Let N denote the number of these outliers that occur in a
bootstrap sample (sample with replacement) of the 26 measurements.
a. Explain why the distribution of N is binomial.
b. Find P(N ≥ 10).
c. In 1000 bootstrap samples, how many would you expect to contain 10 or more
of these outliers?
d. What is the probability that a bootstrap sample is composed entirely of these
outliers?
30. In Example A of Section 10.4.6, a 90% bootstrap confidence interval based on
the trimmed mean was found to be (133.65, 135.58). Compare these values to the
list of data values given in Section 10.4.1 and observe that 133.65 is smaller than
the smallest observation. Explain why the bootstrap confidence interval extends
so far in this direction.
31. We have seen that the bootstrap entails sampling with replacement from the
original observations.
10.9 Problems
413
a. If the original sample is of size n, how many samples with replacement are
there?
b. Suppose for pedagogical purposes that n = 3 and we have the following
observations: 1, 3, 4. List all the possible samples with replacement.
c. Now suppose that we want to find the bootstrap distribution of the sample
mean. For each of the preceding samples, calculate the mean and use these
results to construct the bootstrap distribution of the sample mean.
d. Based on the bootstrap distribution, what is the standard error of the sample
mean? Compare this to the usual estimated standard error, s X .
32. Explain how the bootstrap could be used to approximate the sampling distribution
of the MAD.
33. Which of the following statistics can be made arbitrarily large by making one
number out of a batch of 100 numbers arbitrarily large: the mean, the median, the
10% trimmed mean, the standard deviation, the MAD, the interquartile range?
34. Show that the median is an M estimate if (x) = |x|. For what symmetric density
function is this the mle of the mean?
35. What proportion of the observations from a normal sample would you expect to
be marked by an asterisk on a boxplot?
36. Explain why the IQR and the MAD are divided by 1.35 and .675, respectively,
to estimate σ for a normal sample.
37. For the data of Problem 6:
a. Find the mean, median, and 10% and 20% trimmed means.
b. Find an approximate 90% confidence interval for the mean.
c. Find a confidence interval with coverage near 90% for the median.
d. Use the bootstrap to find approximate standard errors of the trimmed means.
e. Use the bootstrap to find approximate 90% confidence intervals for the trimmed
means.
f. Find and compare the standard deviation of the measurements, the interquartile
range, and the MAD.
g. Use the bootstrap to find the approximate sampling distribution and standard
error of the upper quartile.
38. The Cauchy distribution has the density function
1
f (x) =
π
1
1 + x2
,
−∞ < x < ∞
which is symmetric about zero. This distribution has very heavy tails, which
cause the arithmetic mean to be a very poor estimate of location. Simulate the
distribution of the arithmetic mean and of the median from a sample of size 25
from the Cauchy distribution by drawing 100 samples of size 25 and compare.
From Example B in Section 3.6.1, if Z 1 and Z 2 are independent and N (0, 1),
then their quotient follows a Cauchy distribution. (This gives a simple way of
generating Cauchy random variables.)
414
Chapter 10
Summarizing Data
39. Simiu and Filliben (1975), in a statistical analysis of extreme winds, analyzed the
data contained in the file windspeed 10.1. Construct boxplots to examine
and compare the forms of the distributions across cities and across years.
40. Olson, Simpson, and Eden (1975) discuss the analysis of data obtained from a
cloud seeding experiment. A cloud was deemed “seedable” if it satisfied certain
criteria; for each seedable cloud a decision was made at random whether to actually seed. The nonseeded clouds are referred to as control clouds. The following
table presents the rainfall from 26 seeded and 26 control clouds. Make Q-Q plots
for rainfall versus rainfall and log rainfall versus log rainfall. What do these plots
suggest about the effect, if any, of seeding?
Seeded Clouds
129.6
92.4
198.6
32.7
31.4
17.5
703.4
40.6
2745.6
200.7
1697.8
489.1
274.7
334.1
430.0
274.7
118.3
302.8
7.7
255.0
119.0
1656.0
115.3
0.01
28.6
29.0
17.3
830.1
163.0
4.1
978.0
242.5
Control Clouds
26.1
11.5
1202.6
147.8
26.3
321.2
36.6
21.7
87.0
68.5
4.9
95.0
81.2
4.9
372.4
47.3
41.1
24.4
345.5
244.3
Based on your results, how would you expect boxplots of precipitation from
seeded and unseeded clouds to compare? How would you expect boxplots of log
precipitation to compare? Make the boxplots and see whether your predictions
are confirmed.
41. Construct a nonparametric confidence interval for a quantile x p by using the same
reasoning as in the derivation of a confidence interval for a median.
42. In a study of the natural variability of rainfall, the rainfall of summer storms
was measured by a network of rain gauges in southern Illinois for the years
1960–1964 (Changnon and Huff, in LeCam and Neyman 1967). The average
amount of rainfall (in inches) from each storm, by year, is contained in the files,
Illinois60,...,Illinois64.
a. Is the form of the distribution of rainfall per storm skewed or symmetric?
b. What is the average rainfall per storm? What is the median rainfall per storm?
Explain why these measures differ, using the results of part (a).
c. You may have read statements like “10% of the storms account for 90% of
the rain.” Construct a graph that shows such a relationship for these data.
d. Compare the years using boxplots.
e. Which years were wet and which were dry? Are the wet years wet because
there were more storms, because individual storms produced more rain, or for
both of these reasons?
10.9 Problems
415
43. Barlow, Toland, and Freeman (1984) studied the lifetimes of Kevlar 49/epoxy
strands subjected to sustained stress. (The space shuttle uses Kevlar/epoxy spherical vessels in an environment of sustained pressure.) The files kevlar70,
kevlar80, and kevlar90 contain the times to failure (in hours) of strands
tested at 70%, 80%, and 90% stress levels. What do these data indicate about the
nature of the distribution of lifetimes and the effect of increasing stress?
44. Hopper and Seeman (1994) studied the relationship between bone density and
smoking among 41 pairs of middle-aged female twins. In each pair, one twin
was a lighter smoker and one a heavier smoker, as measured by pack-years, the
number of packages of cigarettes consumed in a year. Bone mineral density was
measured at the lumbar spine, the femoral (hip) neck, and the femoral shaft. As
well as smoking, other variables, such as alcohol consumption and tea and coffee
consumption, were recorded. The data are contained in the file bonden and documentation is in the file bonedendoc. Use graphical methods to compare bone
densities of the heavy and light smoking twins. Do any of the other variables bear
a relationship to bone density? After completing your analysis, you may wish to
compare your conclusions to those in the paper.
45. The 2000 U.S. Presidential election was very close and hotly contested. George
W. Bush was ultimately appointed to the Presidency by the U.S. Supreme Court.
Among the issues was a confusing ballot in Palm Beach County, Florida, the
so-called Butterfly Ballot, shown in the following figure.
ELECTORS
for
PRESIDENT
and
VICE PRESIDENT
(REPUBLICAN)
GEORGE W. BUSH-President
DICK CHENEY-Vice President
3
(DEMOCRATIC)
AL GORE-President
JOE LIBERMAN-Vice President
5
(LIBERTARIAN)
HARRY BROWNE-President
ART OLIVER-Vice President
7
4
(REFORM)
PAT BUCHANAN-President
EZOLA FOSTER-Vice President
6
(SOCIALIST)
DAVID McREYNOLDS-President
MARY CAL HOLLIS-Vice President
8
9
10
11
Notice that on this ballot, although the Democrats are listed in the second row
on the left, a voter wishing to specify them would have to punch the third
hole—punching the second hole would result in a vote for the Reform Party
(Pat Buchanan). After the election, many distraught Democratic voters claimed
that they had inadvertently voted for Buchanan, a right-wing candidate.
The file PalmBeach contains relevant data: vote counts by county in Florida
for Buchanan and for four other presidential candidates in 2000, the total vote
416
Chapter 10
Summarizing Data
counts in 2000, the presidential vote counts for three presidential candidates in
1996, the vote count for Buchanan in the 1996 Republican primary, the registration in Buchanan’s Reform Party, and the total registration in the county. Does
this data support voters’ claims that they were misled by the form of the ballot?
Start by making two scatterplots: a plot of Buchanan’s votes versus Bush’s votes
in 2000, and a plot of Buchanan’s votes in 2000 versus his votes in the 1996
primary.
46. The file bodytemp contains normal body temperature readings (degrees Fahrenheit) and heart rates (beats per minute) of 65 males (coded by 1) and 65 females
(coded by 2) from Shoemaker (1996).
a. For both males and females, make scatterplots of heart rate versus body temperature. Comment on the relationship or lack thereof.
b. Quantify the strengths of the relationships by calculating Pearson and rank
correlation coefficients.
c. Does the relationship for males appear to be the same as that for females? Examine this question graphically, by making a scatterplot showing both females
and males and identifying females and males by different plotting symbols.
47. Old Faithful geyser in Yellowstone National Park, Wyoming, derives its name
from the regularity of its eruptions. The file oldfaithful contains measurements on eight successive days of the durations of the eruptions (in minutes) and
the subsequent time interval before the next eruption.
a. Use histograms of durations and time intervals as well as other graphical
methods to examine the fidelity of Old Faithful, and summarize your findings.
b. Is there a relationship between the durations of eruptions and the time intervals
between them?
48. In 1970, Congress instituted a lottery for the military draft to support the unpopular war in Vietnam. All 366 possible birth dates were placed in plastic capsules
in a rotating drum and were selected one by one. Eligible males born on the first
day drawn were first in line to be drafted followed by those born on the second
day drawn, etc. The results were criticized by some who claimed that government
incompetence at running a fair lottery resulted in a tendency of men born later
in the year being more likely to be drafted. Indeed, later investigation revealed
that the birthdates were placed in the drum by month and were not thoroughly
mixed. The columns of the file 1970lottery are month, month number, day
of the year, and draft number.
a. Plot draft number versus day number. Do you see any trend?
b. Calculate the Pearson and rank correlation coefficients. What do they
suggest?
c. Is the correlation statistically significant? One way to assess this is via a
permutation test. Randomly permute the draft numbers and find the correlation of this random permutation with the day numbers. Do this 100 times
and see how many of the resulting correlation coefficients exceed the one
observed in the data. If you are not satisfied with 100 times, do it 1,000
times.
10.9 Problems
417
d. Make parallel boxplots of the draft numbers by month. Do you see any
pattern?
49. Olive oil from Spain, Tunisia, and other countries is imported into Italy and is
then repackaged and exported with the label “Imported from Italy.” Olive oils
from different places have distinctive tastes. Can the oils from different regions
and areas in Italy be distinguished based on their combinations of fatty acids?
This question was considered by Forina et al. (1983). The data consists of the percentage composition of 8 fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic,
linolenic, arachidic, eicosenoic) found in the lipid fraction of 572 Italian olive
oils. There are 9 collection areas, 4 from southern Italy (North and South Apulia,
Calabria, Sicily), two from Sardinia (Inland and Coastal), and 3 from northern
Italy (Umbria, East and West Liguria). The file olive contains the following
variables for each of the 572 samples:
•
•
•
•
•
•
•
•
•
•
Region: South, North, or Sardinia
Area (subregions within the larger regions): North and South Apulia, Calabria,
Sicily, Inland and Coastal Sardinia, Umbria, East and West Liguria
Palmitic Acid Percentage
Palmitoleic Acid Percentage
Stearic Acid Percentage
Oleic Acid Percentage
Linoleic Acid Percentage
Linolenic Acid Percentage
Arachidic Acid Percentage
Eicosenoic Acid Percentage
Examine this data with the aim of distinguishing between regions and areas
by using fatty acid composition.
a. Make a table of the mean and median values of percentages for each area,
grouping the areas within regions.
b. Complement the analysis by making parallel boxplots. Which variables look
promising for separating the regions?
c. It is possible that the regions can be more clearly separated by considering
pairs of variables. Use the variables that appear to be informative from the
analysis up to this point to make scatterplots. How well can the regions be
separated based on the scatterplots?
d. How well can the areas within regions be distinguished?
e. By interactively rotating point clouds, one can examine relationships among
more than two variables at a time. Try this with the software ggobi available
at http://www.ggobi.org/.
50. The file flow-occ contains data collected by loop detectors at a particular location of eastbound Interstate 80 in Sacramento, California, from March 14–20,
2003. (Source: http://pems.eecs.berkeley.edu/) For each of three
lanes, the flow (the number of cars) and the occupancy (the percentage of time
a car was over the loop) were recorded in successive five minute intervals. (See
Example B of Section 10.7 for background information.) There were 1740 such
418
Chapter 10
Summarizing Data
five-minute intervals. Lane 1 is the farthest left lane, lane 2 is in the center, and
lane 3 is the farthest right.
a. For each station, plot flow and occupancy versus time. Explain the patterns
you see. Can you deduce from the plots what the days of the week were?
b. Compare the flows in the three lanes by making parallel boxplots. Which lane
typically serves the most traffic?
c. Examine the relationships of the flows in the three lanes by making scatterplots.
Can you explain the patterns you see? Are statements of the form, “The flow
in lane 2 is typically about 50% higher than in lane 3,” accurate descriptions
of the relationships?
d. Occupancy can be viewed as a measure of congestion. Find the mean and
median occupancy in each of the three lanes. Do you think that the distributions
of occupancy are symmetric or skewed? Why?
e. Make histograms of the occupancies, varying the number of bins. What number of bins seems to give good representations for the shapes of the distributions? Are there any unusual features, and if so, how might they be explained?
f. Make plots to support or refute the statement, “When one lane is congested,
the others are, too.”
g. Flow can be regarded as a measure of the throughput of the system. How does
this throughput depend on congestion? Consider the following conjecture:
“When very few cars are on the road, flow is small and so is congestion.
Adding a few more cars may increase congestion but not enough so that
velocity is decreased, so flow will also increase. Beyond some point, increasing
occupancy (congestion) will decrease velocity, but since there will then be
more cars in total, flow will still continue to increase.” Does this seem plausible
to you? Plot flow versus occupancy for each of the three lanes. Does this
conjecture appear to be true? Can you explain what you see? Is the relationship
of flow to occupancy the same in all lanes?
h. This and the following exercises require the use of dynamic graphics, e.g.,
http://www.ggobi.org/. Make time series plots of all the variables.
Consider lane 1. Make a one-dimensional display of occupancy and vary the
smoothness until you can see some distinct modes. Use brushing to determine
when in the time series plots those modes occured. Do the same for flow and
then examine some other lanes.
i. Choose a lane and make one-dimensional displays for flow and occupancy
and a scatterplot of flow versus occupancy. Use brushing to simultaneously
identify regions in the three plots. Does what you see make sense?
j. From scatterplots of flow versus occupancy, examine when different regions
of this scatterplot occur in time. In particular, identify when in the time series
plots the flow breaks down because a critical point is reached.
k. You have now seen that all these variables, flow and occupancy in each of the
three lanes, are closely related, but because scatterplots are two-dimensional,
you have been able to examine only those relationships between pairs of variables. In these scatterplots, the points tend to lie along curves. What happens
in higher dimensions?
10.9 Problems
419
i. Examine the relationship of the three flows. In three dimensions, do the
points tend to lie along a curve (a one-dimensional object), or do they tend
to concentrate on a two-dimensional manifold, or are they scattered over
three dimensions?
ii. Examine the relationships of the three occupancies. In three dimensions,
do the points tend to lie along a curve (a one-dimensional object), or
do they tend to concentrate on a two-dimensional manifold, or are they
scattered over three dimensions?
iii. How do the points lie in six dimensions (three flows and three occupancies)? When do different regions occur in time?
l. A taxi driver claims that when traffic breaks down, the fast lane breaks down
first so he moves immediately to the right lane. Can you see any such phenomena in the data?
CHAPTER
11
Comparing Two Samples
11.1 Introduction
This chapter is concerned with methods for comparing samples from distributions
that may be different and especially with methods for making inferences about how
the distributions differ. In many applications, the samples are drawn under different
conditions, and inferences must be made about possible effects of these conditions.
We will be primarily concerned with effects that tend to increase or decrease the
average level of response.
For example, in the end-of-chapter problems, we will consider some experiments
performed to determine to what degree, if any, cloud seeding increases precipitation.
In cloud-seeding experiments, some storms are selected for seeding, other storms are
left unseeded, and the amount of precipitation from each storm is measured. This
amount varies widely from storm to storm, and in the face of this natural variability,
it is difficult to tell whether seeding has a systematic effect. The average precipitation
from the seeded storms might be slightly higher than that from the unseeded storms,
but a skeptic might not be convinced that the difference was due to anything but
chance. We will develop statistical methods to deal with this type of problem based
on a stochastic model that treats the amounts of precipitation as random variables.
We will also see how a process of randomization allows us to make inferences about
treatment effects even in the case where the observations are not modeled as samples
from populations or probability laws.
This chapter will be concerned with analyzing measurements that are continuous
in nature (such as temperature); Chapter 13 will take up the analysis of qualitative
data. This chapter will conclude with some general discussion of the design and
interpretation of experimental studies.
420
11.2 Comparing Two Independent Samples
421
11.2 Comparing Two Independent Samples
In many experiments, the two samples may be regarded as being independent of each
other. In a medical study, for example, a sample of subjects may be assigned to a
particular treatment, and another independent sample may be assigned to a control
(or placebo) treatment. This is often accomplished by randomly assigning individuals
to the placebo and treatment groups. In later sections, we will discuss methods that
are appropriate when there is some pairing, or dependence, between the samples, such
as might occur if each person receiving the treatment were paired with an individual
of similar weight in the control group.
Many experiments are such that if they were repeated, the measurements would
not be exactly the same. To deal with this problem, a statistical model is often employed: The observations from the control group are modeled as independent random
variables with a common distribution, F, and the observations from the treatment
group are modeled as being independent of each other and of the controls and as
having their own common distribution function, G. Analyzing the data thus entails
making inferences about the comparison of F and G. In many experiments, the primary effect of the treatment is to change the overall level of the responses, so that
analysis focuses on the difference of means or other location parameters of F and G.
When only a small amount of data is available, it may not be practical to do much
more than this.
11.2.1 Methods Based on the Normal Distribution
In this section, we will assume that a sample, X 1 , . . . , X n , is drawn from a normal distribution that has mean μ X and variance σ 2 , and that an independent sample,
Y1 , . . . , Ym , is drawn from another normal distribution that has mean μY and the same
variance, σ 2 . If we think of the X ’s as having received a treatment and the Y ’s as
being the control group, the effect of the treatment is characterized by the difference
μ X − μY . A natural estimate of μ X − μY is X − Y ; in fact, this is the mle. Since
X − Y may be expressed as a linear combination of independent normally distributed
random variables, it is normally distributed:
1
1
+
X − Y ∼ N μ X − μY , σ 2
n
m
If σ 2 were known, a confidence interval for μ X − μY could be based on
Z=
(X − Y ) − (μ X − μY )
.
σ n1 + m1
which follows a standard normal distribution. The confidence interval would be of
the form
*
1
1
+
(X − Y ) ± z(α/2)σ
n
m
This confidence interval is of the same form as those introduced in Chapters 7 and
8—a statistic (X − Y in this case) plus or minus a multiple of its standard deviation.
422
Chapter 11
Comparing Two Samples
Generally, σ 2 will not be known and must be estimated from the data by calculating the pooled sample variance,
s 2p =
(n − 1)s X2 + (m − 1)sY2
m+n−2
n
where s X2 = (n − 1) i=1
(X i − X )2 and similarly for sY2 . Note that s 2p is a weighted
average of the sample variances of the X ’s and Y ’s, with the weights proportional
to the degrees of freedom. This weighting is appropriate since if one sample is
much larger than the other, the estimate of σ 2 from that sample is more reliable
and should receive greater weight. The following theorem gives the distribution of a
statistic that will be used for forming confidence intervals and performing hypothesis
tests.
THEOREM A
Suppose that X 1 , . . . , X n are independent and normally distributed random variables with mean μ X and variance σ 2 , and that Y1 , . . . , Ym are independent and
normally distributed random variables with mean μY and variance σ 2 , and that
the Yi are independent of the X i . The statistic
(X − Y ) − (μ X − μY )
*
1
1
+
sp
n
m
follows a t distribution with m + n − 2 degrees of freedom.
t=
Proof
According to the definition of the t distribution in Section 6.2, we have to
show that the statistic is the quotient of a standard normal random variable and
the square root of an independent chi-square random variable divided by its
n + m − 2 degrees of freedom. First, we note from Theorem B in Section 6.3 that
(n − 1)s X2 /σ 2 and (m − 1)sY2 /σ 2 are distributed as chi-square random variables
with n − 1 and m − 1 degrees of freedom, respectively, and are independent since
the X i and Yi are. Their sum is thus chi-square with m + n − 2 df. Now, we
express the statistic as the ratio U/V , where
(X − Y ) − (μ X − μY )
*
1
1
+
σ
n
m
/
(m − 1)sY2
(n − 1)s X2
1
+
V =
σ2
σ2
m+n−2
U =
U follows a standard normal distribution and from the preceding argument V has
the distribution of the square root of a chi-square random variable divided by its
degrees of freedom. The independence of U and V follows from Corollary A in
■
Section 6.3.
11.2 Comparing Two Independent Samples
423
It is convenient and suggestive to use the following notation for the estimated
standard deviation (or standard error) of X − Y :
*
1
1
+
s X −Y = s p
n
m
A confidence interval for μ X − μY follows as a corollary to Theorem A.
COROLLARY A
Under the assumptions of Theorem A, a 100(1 − α)% confidence interval for
μ X − μY is
(X − Y ) ± tm+n−2 (α/2)s X −Y
EXAMPLE A
■
Two methods, A and B, were used in a determination of the latent heat of fusion of
ice (Natrella 1963). The investigators wanted to find out by how much the methods
differed. The following table gives the change in total heat from ice at −.72◦ C to
water 0◦ C in calories per gram of mass:
Method A
Method B
79.98
80.04
80.02
80.04
80.03
80.03
80.04
79.97
80.05
80.03
80.02
80.00
80.02
80.02
79.94
79.98
79.97
79.97
80.03
79.95
79.97
It is fairly obvious from the table and from boxplots (Figure 11.1) that there is a
difference between the two methods (we will test this more formally later). If we
assume the conditions of Theorem A, we can form a 95% confidence interval to
estimate the magnitude of the average difference between the two methods. From the
table, we calculate
X A = 80.02
Sa = .024
X B = 79.98
Sb = .031
12 × Sa2 + 7 × Sb2
= .0007178
19
s p = .027
s 2p =
Chapter 11
Comparing Two Samples
80.06
80.04
Heat of fusion of ice (cal/g)
424
80.02
80.00
79.98
79.96
79.94
Method A
Method B
F I G U R E 11.1 Boxplots of measurements of heat of fusion obtained by methods A
and B.
Our estimate of the average difference of the two methods is X A − X B = .04 and its
estimated standard error is
*
1
1
+
s X A −X B = s p
13 8
= .012
From Table 4 of Appendix B, the .975 quantile of the t distribution with 19 df
is 2.093, so t19 (.025) = 2.093 and the 95% confidence interval is (X A − X B ) ±
t19 (.025)s X A −X B , or (.015, .065). The estimated standard error and the confidence
interval quantify the uncertainty in the point estimate X A − X B = .04.
■
We will now discuss hypothesis testing for the two-sample problem. Although
the hypotheses under consideration are different from those of Chapter 9, the general
conceptual framework is the same (you should review that framework at this time).
In the current case, the null hypothesis to be tested is
H0 : μ X = μY
This asserts that there is no difference between the distributions of the X ’s and Y ’s. If
one group is a treatment group and the other a control, for example, this hypothesis
asserts that there is no treatment effect. In order to conclude that there is a treatment
effect, the null hypothesis must be rejected.
There are three common alternative hypotheses for the two-sample case:
H1 : μ X = μY
H2 : μ X > μY
H3 : μ X < μY
11.2 Comparing Two Independent Samples
425
The first of these is a two-sided alternative, and the other two are one-sided
alternatives. The first hypothesis is appropriate if deviations could in principle go
in either direction, and one of the latter two is appropriate if it is believed that any
deviation must be in one direction or the other. In practice, such a priori information is not usually available, and it is more prudent to conduct two-sided tests, as in
Example A.
The test statistic that will be used to make a decision whether or not to reject the
null hypothesis is
t=
X −Y
s X −Y
The t-statistic equals the multiple of its estimated standard deviation that X − Y
differs from zero. It plays the same role in the comparison of two samples as is
played by the chi-square statistic in testing goodness of fit. Just as we rejected for
large values of the chi-square statistic, we will reject in this case for extreme values
of t. The distribution of t under H0 , its null distribution, is, from Theorem A, the t
distribution with m + n − 2 degrees of freedom. Knowing this null distribution allows
us to determine a rejection region for a test at level α, just as knowing that the null
distribution of the chi-square statistic was chi-square with the appropriate degrees of
freedom allowed the determination of a rejection region for testing goodness of fit.
The rejection regions for the three alternatives just listed are
For H1 , |t| > tn+m−2 (α/2)
For H2 , t > tn+m−2 (α)
For H3 , t < −tn+m−2 (α)
Note how the rejection regions are tailored to the particular alternatives and how
knowing the null distribution of t allows us to determine the rejection region for any
value of α.
EXAMPLE B
Let us continue Example A. To test H0 : μ A = μ B versus a two-sided alternative, we
form and calculate the following test statistic:
XA − XB
*
1
1
+
sp
n
m
= 3.33
t=
From Table 4 in Appendix B, t19 (.005) = 2.861 < 3.33. The two-sided test would
thus reject at the level α = .01. If there were no difference in the two conditions,
differences as large or larger than that observed would occur only with probability
less than .01—that is, the p-value is less than .01. There is little doubt that there is a
difference between the two methods.
■
In Chapter 9, we developed a general duality between hypothesis tests and confidence intervals. In the case of the testing and confidence interval methods considered
426
Chapter 11
Comparing Two Samples
in this section, the t test rejects if and only if the confidence interval does not include
zero (see Problem 10 at the end of this chapter).
We will now demonstrate that the test of H0 versus H1 is equivalent to a likelihood
ratio test. (The rather long argument is sketched here and should be read with paper
and pencil in hand.) is the set of all possible parameter values:
= {−∞ < μ X < ∞, −∞ < μY < ∞, 0 < σ < ∞}
The unknown parameters are θ = (μ X , μY , σ ). Under H0 , θ ∈ ω0 , where ω0 =
{μ X = μY , 0 < σ < ∞}. The likelihood of the two samples X 1 , . . . , X n and
Y1 , . . . , Ym is
m
n
1
1
2
2
2
2
√
√
lik μ X , μY , σ 2 =
e−(1/2)[(X i −μ X ) /σ ]
e−(1/2)[(Y j −μY ) /σ ]
2
2
2πσ
2π σ
i=1
j=1
and the log likelihood is
(m + n)
(m + n)
log 2π −
log σ 2
l μ X , μY , σ 2 = −
2
2
n
m
1 − 2
(X i − μ X )2 +
(Y j − μY )2
2σ
i=1
j=1
We must maximize the likelihood under ω0 and under and then calculate the ratio
of the two maximized likelihoods, or the difference of their logarithms.
Under ω0 , we have a sample of size m + n from a normal distribution with
unknown mean μ0 and unknown variance σ02 . The mle’s of μ0 and σ02 are thus
n
m
1
Xi +
Yj
μ̂0 =
m + n i=1
j=1
n
m
1
2
2
2
σ̂0 =
(X i − μ̂0 ) +
(Y j − μ̂0 )
m + n i=1
j=1
The corresponding value of the maximized log likelihood is, after some cancellation,
m+n
m+n
m+n
log 2π −
log σ̂02 −
l μ̂0 , σ̂02 = −
2
2
2
To find the mle’s μ̂ X , μ̂Y , and σ̂12 under , we first differentiate the log likelihood and
obtain the equations
n
(X i − μ̂ X ) = 0
i=1
m
(Y j − μ̂Y ) = 0
j=1
n
m
m+n
1 2
2
−
+
(X i − μ̂ X ) +
(Y j − μ̂Y ) = 0
2σ̂12
2σ̂14 i=1
j=1
11.2 Comparing Two Independent Samples
427
The mle’s are, therefore,
μ̂ X = X
μ̂Y = Y
n
m
1
(X i − μ̂ X )2 +
(Y j − μ̂Y )2
σ̂12 =
m + n i=1
j=1
When these are substituted into the log likelihood, we obtain
m+n
m+n
m+n
log 2π −
log σ̂12 −
l μ̂ X , μ̂Y , σ̂12 = −
2
2
2
The log of the likelihood ratio is thus
m+n
log
2
σ̂12
σ̂02
and the likelihood ratio test rejects for large values of
n
σ̂02
σ̂12
=
(X i − μ̂0 )2 +
i=1
n
(X i −
X )2
i=1
+
m
(Y j − μ̂0 )2
j=1
m
(Y j − Y )2
j=1
We now find an alternative expression for the numerator of this ratio, by using
the identities
n
(X i − μ̂0 )2 =
i=1
m
n
(X i − X )2 + n(X − μ̂0 )2
i=1
(Y j − μ̂0 )2 =
j=1
m
(Y j − Y )2 + m(Y − μ̂0 )2
j=1
We obtain
1
(n X + mY )
m+n
n
m
=
X+
Y
m+n
m+n
μ̂0 =
Therefore,
X − μ̂0 =
m(X − Y )
m+n
Y − μ̂0 =
n(Y − X )
m+n
The alternative expression for the numerator of the ratio is thus
n
i=1
(X i − X )2 +
m
j=1
(Y j − Y )2 +
mn
(X − Y )2
m+n
428
Chapter 11
Comparing Two Samples
and the test rejects for large values of
⎛
1+
mn ⎜
⎜
m+n ⎝
⎞
(X − Y )2
n
(X i − X )2 +
i=1
m
(Y j − Y )2
⎟
⎟
⎠
j=1
or, equivalently, for large values of
/
|X − Y |
n
(X i − X )2 +
i=1
m
(Y j − Y )2
j=1
which is the t statistic apart from constants that do not depend on the data. Thus, the
likelihood ratio test is equivalent to the t test, as claimed.
We have used the assumption that the two populations have the same variance.
If the two variances are not assumed to be equal, a natural estimate of Var(X − Y ) is
s2
s X2
+ Y
n
m
If this estimate is used in the denominator of the t statistic, the distribution of that
statistic is no longer the t distribution. But it has been shown that its distribution can
be closely approximated by the t distribution with degrees of freedom calculated in
the following way and then rounded to the nearest integer:
df =
EXAMPLE C
[(s X2 /n) + (sY2 /m)]2
(s 2 /m)2
(s X2 /n)2
+ Y
n−1
m−1
Let us rework Example B, but without the assumption that the variances are equal.
Using the preceding formula, we find the degrees of freedom to be 12 rather than 19.
The t statistic is 3.12. Since the .995 quantile of the t distribution with 12 df is 3.055
(Table 4 of Appendix B), the test still rejects at level α = .01.
■
If the underlying distributions are not normal and the sample sizes are large, the
use of the t distribution or the normal distribution is justified by the central limit
theorem, and the probability levels of confidence intervals and hypothesis tests are
approximately valid. In such a case, however, there is little difference between the t
and normal distributions. If the sample sizes are small, however, and the distributions
are not normal, conclusions based on the assumption of normality may not be valid.
Unfortunately, if the sample sizes are small, the assumption of normality cannot be
tested effectively unless the deviation is quite gross, as we saw in Chapter 9.
11.2.1.1 An Example—A Study of Iron Retention An experiment was performed to determine whether two forms of iron (Fe2+ and Fe3+ ) are retained differently. (If one form of iron were retained especially well, it would be the better
dietary supplement.) The investigators divided 108 mice randomly into 6 groups of
18 each; 3 groups were given Fe2+ in three different concentrations, 10.2, 1.2, and
11.2 Comparing Two Independent Samples
429
.3 millimolar, and 3 groups were given Fe3+ at the same three concentrations. The
mice were given the iron orally; the iron was radioactively labeled so that a counter
could be used to measure the initial amount given. At a later time, another count was
taken for each mouse, and the percentage of iron retained was calculated. The data for
the two forms of iron are listed in the following table. We will look at the data for the
concentration 1.2 millimolar. (In Chapter 12, we will discuss methods for analyzing
all the groups simultaneously.)
Fe3+
Fe2+
10.2
1.2
.3
10.2
1.2
.3
.71
1.66
2.01
2.16
2.42
2.42
2.56
2.60
3.31
3.64
3.74
3.74
4.39
4.50
5.07
5.26
8.15
8.24
2.20
2.93
3.08
3.49
4.11
4.95
5.16
5.54
5.68
6.25
7.25
7.90
8.85
11.96
15.54
15.89
18.3
18.59
2.25
3.93
5.08
5.82
5.84
6.89
8.50
8.56
9.44
10.52
13.46
13.57
14.76
16.41
16.96
17.56
22.82
29.13
2.20
2.69
3.54
3.75
3.83
4.08
4.27
4.53
5.32
6.18
6.22
6.33
6.97
6.97
7.52
8.36
11.65
12.45
4.04
4.16
4.42
4.93
5.49
5.77
5.86
6.28
6.97
7.06
7.78
9.23
9.34
9.91
13.46
18.4
23.89
26.39
2.71
5.43
6.38
6.38
8.32
9.04
9.56
10.01
10.08
10.62
13.80
15.99
17.90
18.25
19.32
19.87
21.60
22.25
As a summary of the data, boxplots (Figure 11.2) show that the data are quite skewed to
the right. This is not uncommon with percentages or other variables that are bounded
below by zero. Three observations from the Fe2+ group are flagged as possible outliers.
The median of the Fe2+ group is slightly larger than the median of the Fe3+ groups,
but the two distributions overlap substantially.
Another view of these data is provided by normal probability plots (Figure 11.3).
These plots also indicate the skewness of the distributions. We should obviously
doubt the validity of using normal distribution theory (for example, the t test) for this
problem even though the combined sample size is fairly large (36).
The mean and standard deviation of the Fe2+ group are 9.63 and 6.69; for the
3+
Fe group, the mean is 8.20 and the standard deviation is 5.45. To test the hypothesis
that the two means are equal, we can use a t test without assuming that the population
standard deviations are equal. The approximate degrees of freedom, calculated as
described at the end of Section 11.2.1, are 32. The t statistic is .702, which corresponds
to a p-value of .49 for a two-sided test; if the two populations had the same mean,
values of the t statistic this large or larger would occur 49% of the time. There is thus
insufficient evidence to reject the null hypothesis. A 95% confidence interval for the
Comparing Two Samples
30
25
Percentage retained
Chapter 11
20
15
10
5
0
Fe2+
Fe3+
F I G U R E 11.2 Boxplots of the percentages of iron retained for the two forms.
30
2
25
Ordered observations
430
2
20
2
3
2
15
3
3
3
3
10
5
2
3
2
3
23
2 3
2
2 3
22 3
2 2 3 33
3
2
2
3
3 3
2
0
2.0 1.5 1.0 .5
.5
0
Normal quantiles
1.0
1.5
2.0
F I G U R E 11.3 Normal probability plots of iron retention data.
difference of the two population means is (−2.7, 5.6). But the t test assumes that the
underlying populations are normally distributed, and we have seen there is reason to
doubt this assumption.
It is sometimes advocated that skewed data be transformed to a more symmetric
shape before normal theory is applied. Transformations such as taking the log or
11.2 Comparing Two Independent Samples
431
the square root can be effective in symmetrizing skewed distributions because they
spread out small values and compress large ones. Figures 11.4 and 11.5 show boxplots
and normal probability plots for the natural logs of the iron retention data we have
been considering. The transformation was fairly successful in symmetrizing these
distributions, and the probability plots are more linear than those in Figure 11.3,
although some curvature is still evident.
Natural log of percentage retained
3.5
3.0
2.5
2.0
1.5
1.0
.5
Fe2+
Fe3+
F I G U R E 11.4 Boxplots of natural logs of percentages of iron retained.
3.5
2
Ordered observations
3.0
2.5
2.0
1.5
2
1.0
3
2
2
23
23
2 23
2 3
2 33
2
2 3
3
2
2
3
2
3
3 3
2
3 3
2
3
2
3
3
.5
2.0 1.5 1.0 .5
0
.5
Normal quantiles
1.0
1.5
2.0
F I G U R E 11.5 Normal probability plots of natural logs of iron retention data.
432
Chapter 11
Comparing Two Samples
The following model is natural for the log transformation:
i = 1, . . . , n
X i = μ X (1 + εi ),
j = 1, . . . , m
Y j = μY (1 + δ j ),
log X i = log μ X + log(1 + εi )
log Y j = log μY + log(1 + δ j )
Here the εi and δ j are independent random variables with mean zero. This model
implies that if the variances of the errors are σ 2 , then
E(X i ) = μ X
E(Y j ) = μY
σX = μX σ
σY = μY σ
or that
σY
σX
=
μX
μY
If the εi and δ j have the same distribution, Var(log X ) = Var(log Y ). The ratio of the
standard deviation of a distribution to the mean is called the coefficient of variation
(CV); it expresses the standard deviation as a fraction of the mean. Coefficients of
variation are sometimes expressed as percentages. For the iron retention data we have
been considering, the CV’s are .69 and .67 for the Fe2+ and Fe3+ groups; these values
are quite close. These data are quite “noisy”—the standard deviation is nearly 70%
of the mean for both groups.
For the transformed iron retention data, the means and standard deviations are
given in the following table:
Mean
Standard Deviation
Fe2+
Fe3+
2.09
.659
1.90
.574
For the transformed data, the t statistic is .917, which gives a p-value of .37.
Again, there is no reason to reject the null hypothesis. A 95% confidence interval is
(−.61, .23). Using the preceding model, this is a confidence interval for
μX
log μ X − log μY = log
μY
The interval is
−.61 ≤ log
or
μX
μY
≤ .23
μX
≤ 1.26
μY
Other transformations, such as raising all values to some power, are sometimes
used. Attitudes toward the use of transformations vary: Some view them as a very
.54 ≤
11.2 Comparing Two Independent Samples
433
useful tool in statistics and data analysis, and others regard them as questionable
manipulation of the data.
11.2.2 Power
Calculations of power are an important part of planning experiments in order to
determine how large sample sizes should be. The power of a test is the probability
of rejecting the null hypothesis when it is false. The power of the two-sample t test
depends on four factors:
1. The real difference, = |μ X − μY |. The larger this difference, the greater the
power.
2. The significance level α at which the test is done. The larger the significance level,
the more powerful the test.
3. The population standard deviation σ , which is the amplitude of the “noise” that
hides the “signal.” The smaller the standard deviation, the larger the power.
4. The sample sizes n and m. The larger the sample sizes, the greater the power.
Before continuing, you should try to understand intuitively why these statements are
true. We will express them quantitatively below.
The necessary sample sizes can be determined from the significance level of the
test, the standard deviation, and the desired power against an alternative hypothesis,
H1 : μ X − μY = To calculate the power of a t test exactly, special tables of the noncentral t
distribution are required. But if the sample sizes are reasonably large, one can perform
approximate power calculations based on the normal distribution, as we will now
demonstrate.
Suppose that σ , α, and are given and that the samples are both of size n. Then
Var(X − Y ) = σ
=
2
1
1
+
n
n
2σ 2
n
The test at level α of H0 : μ X = μY against the alternative H1 : μ X = μY is based on
the test statistic
Z=
X −Y
√
σ 2/n
The rejection region for this test is |Z | > z(α/2), or
*
|X − Y | > z(α/2)σ
2
n
434
Chapter 11
Comparing Two Samples
The power of the test if μ X − μY = is the probability that the test statistic falls in
the rejection region, or
* 2
P |X − Y | > z(α/2)σ
n
* * 2
2
+ P X − Y < −z(α/2)σ
= P X − Y > z(α/2)σ
n
n
since the two events are mutually exclusive. Both probabilities on the right-hand side
are calculated by standardizing. For the first one, we have
* √
2
z(α/2)σ 2/n − (X − Y ) − √
√
P X − Y > z(α/2)σ
>
=P
n
σ 2/n
σ 2/n
* n
= 1 − z(α/2) −
σ 2
where is the standard normal cdf. Similarly, the second probability is
* n
− z(α/2) −
σ 2
Thus, the probability that the test statistic falls in the rejection region is equal to
* * n
n
1 − z(α/2) −
+ − z(α/2) −
σ 2
σ 2
Typically, as moves away from zero, one of these terms will be negligible with
respect to the other. For example, if is greater than zero, the first term will be
dominant. For fixed n, this expression can be evaluated as a function of ; or for
fixed , it can be evaluated as a function of n.
EXAMPLE A
As an example, let us consider a situation similar to an idealized form of the iron
retention experiment. Assume that we have samples of size 18 from two normal
distributions whose standard deviations are both 5, and we calculate the power for
various values of when the null hypothesis is tested at a significance level of .05.
The results of the calculations are displayed in Figure 11.6. We see from the plot that
if the mean difference in retention is only 1%, the probability of rejecting the null
hypothesis is quite small, only 9%. A mean difference of 5% in retention rate gives
a more satisfactory power of 85%.
Suppose that we wanted to be able to detect a difference of = 1 with probability
.9. What sample size would be necessary? Using only the dominant term in the
expression for the power, the sample size should be such that
* n
= .1
1.96 −
σ 2
11.2 Comparing Two Independent Samples
435
1.0
Power
.8
.6
.4
.2
0
0
2
4
6
8
F I G U R E 11.6 Plot of power versus .
From the tables for the normal distribution, .1 = (−1.28), so
*
n
= −1.28
1.96 −
σ 2
Solving for n, we find that the necessary sample size would be 525! This is clearly
unfeasible; if in fact the experimenters wanted to detect such a difference, some
modification of the experimental technique to reduce σ would be necessary.
■
11.2.3 A Nonparametric Method—The Mann-Whitney Test
Nonparametric methods do not assume that the data follow any particular distributional form. Many of them are based on replacement of the data by ranks. With this
replacement, the results are invariant under any monotonic transformation; in comparison, we saw that the p-value of a t test may change if the log of the measurements
is analyzed rather than the measurements on the original scale. Replacing the data by
ranks also has the effect of moderating the influence of outliers.
For purposes of discussion, we will develop the Mann-Whitney test (also sometimes called the Wilcoxon rank sum test) in a specific context. Suppose that we have
m + n experimental units to assign to a treatment group and a control group. The
assignment is made at random: n units are randomly chosen and assigned to
the control, and the remaining m units are assigned to the treatment. We are interested in testing the null hypothesis that the treatment has no effect. If the null
hypothesis is true, then any difference in the outcomes under the two conditions is
due to the randomization.
436
Chapter 11
Comparing Two Samples
A test statistic is calculated in the following way. First, we group all m + n
observations together and rank them in order of increasing size (we will assume for
simplicity that there are no ties, although the argument holds even in the presence
of ties). We next calculate the sum of the ranks of those observations that came
from the control group. If this sum is too small or too large, we will reject the null
hypothesis.
It is easiest to see how the procedure works by considering a very small example.
Suppose that a treatment and a control are to be compared: Of four subjects, two are
randomly assigned to the treatment and the other two to the control, and the following
responses are observed (the ranks of the observations are shown in parentheses):
Treatment
Control
1 (1)
3 (2)
6 (4)
4 (3)
The sum of the ranks of the control group is R = 7, and the sum of the ranks
of the treatment group is 3. Does this discrepancy provide convincing evidence of a
systematic difference between treatment and control, or could it be just due to chance?
To answer this question, we calculate the probability of such a discrepancy if the
treatment had no effect at all, so that the difference was entirely due to the particular
randomization—this is the null hypothesis. The key idea of the Mann-Whitney test is
that we can explicitly calculate the distribution of R under the null hypothesis, since
under this hypothesis every assignment of ranks to observations is equally likely
and
we can enumerate all 4! = 24 such assignments. In particular, each of the 72 = 6
assignments of ranks to the control group shown in the following table is equally
likely:
Ranks
R
{1, 2}
{1, 3}
{1, 4}
{2, 3}
{2, 4}
{3, 4}
3
4
5
5
6
7
From this table, we see that under the null hypothesis, the distribution of R (its null
distribution) is:
r
3
4
5
6
7
P(R = r )
1
6
1
6
1
3
1
6
1
6
In particular, P(R = 7) = 16 , so this discrepancy would occur one time out of six
purely on the basis of chance.
11.2 Comparing Two Independent Samples
437
The small example of the previous paragraph has been laid out for pedagogical reasons, the point being that we could in principle go through similar calculations for any sample sizes m and n. Suppose that there are n observations in the
treatment group and m in the control group. If the null hypothesis holds, every assignment
of ranks to the m + n observations is equally likely, and hence each of
possible assignments of ranks to the control group is equally likely. For
the m+n
m
each of these assignments, we can calculate the sum of the ranks and thus determine the null distribution of the test statistic—the sum of the ranks of the control
group.
It is important to note that we have not made any assumption that the observations
from the control and treatment groups are samples from a probability distribution.
Probability has entered in only as a result of the random assignment of experimental
units to treatment and control groups (this is similar to the way that probability
enters into survey sampling). We should also note that, although we chose the sum
of control ranks as the test statistic, any other test statistic could have been used and
its null distribution computed in the same fashion. The rank sum is easy to compute
and is sensitive to a treatment effect that tends to make responses larger or smaller.
Also, its null distribution has to be computed only once and tabled; if we worked with
the actual numerical values, the null distribution would depend on those particular
values.
Tables of the null distribution of the rank sum are widely available and vary in
format. Note that because the sum of the two rank sums is the sum of the integers
from 1 to m + n, which is [(m + n)(m + n + 1)/2], knowing one rank sum tells us
the other. Some tables are given in terms of the rank sum of the smaller of the two
groups, and some are in terms of the smaller of the two rank sums (the advantage of
the latter scheme is that only one tail of the distribution has to be tabled). Table 8 of
Appendix B makes use of additional symmetries. Let n 1 be the smaller sample size
and let R be the sum of the ranks from that sample. Let R = n 1 (m + n + 1) − R
and R ∗ = min(R, R ). The table gives critical values for R ∗ . (Fortunately, such fussy
tables are largely obsolete with the increasing use of computers.)
When it is more appropriate to model the control values, X 1 , . . . , X n , as a sample
from some probability distribution F and the experimental values, Y1 , . . . , Ym , as a
sample from some distribution G, the Mann-Whitney test is a test of the null hypothesis
H0 : F = G. The reasoning is exactly the same: Under H0 , any assignment of ranks
to the pooled m + n observations is equally likely, etc.
We have assumed here that there are no ties among the observations. If there
are only a small number of ties, tied observations are assigned average ranks (the
average of the ranks for which they are tied); the significance levels are not greatly
affected.
EXAMPLE A
Let us illustrate the Mann-Whitney test by referring to the data on latent heats of fusion
of ice considered earlier (Example A in Section 11.2.1). The sample sizes are fairly
small (13 and 8), so in the absence of any prior knowledge concerning the adequacy
of the assumption of a normal distribution, it would seem safer to use a nonparametric
438
Chapter 11
Comparing Two Samples
method. The following table exhibits the ranks given to the measurements for each
method (refer to Example A in Section 11.2.1 for the original data):
Method A
Method B
7.5
19.0
11.5
19.0
15.5
15.5
19.0
4.5
21.0
15.5
11.5
9.0
11.5
11.5
1.0
7.5
4.5
4.5
15.5
2.0
4.5
Note how the ties were handled. For example, the four observations with the value
79.97 tied for ranks 3, 4, 5, and 6 were each assigned the rank of 4.5 = (3 + 4 +
5 + 6)/4.
Table 8 of Appendix B is used as follows. The sum of the ranks of the smaller
sample is R = 51.
R = 8(8 + 13 + 1) − R
= 125
Thus, R ∗ = 51. From the table, 53 is the critical value for a two-tailed test with
α = .01, and 60 is the critical value for α = .05. The Mann-Whitney test thus rejects
■
at the .01 significance level.
Let TY denote the sum of the ranks of Y1 , Y2 , . . . , Ym . Using results from Chapter 7, we can easily find E(TY ) and Var(TY ) under the null hypothesis F = G.
THEOREM A
If F = G,
m(m + n + 1)
2
mn(m + n + 1)
Var(TY ) =
12
E(TY ) =
11.2 Comparing Two Independent Samples
439
Proof
Under the null hypothesis, TY is the sum of a random sample of size m drawn
without replacement from a population consisting of the integers {1, 2, . . . ,
m + n}. TY thus equals m times the average of such a sample. From Theorems A
and B of Section 7.3.1,
E(TY ) = mμ
Var(TY ) = mσ 2
N −m
N −1
where N = m + n is the size of the population, and μ and σ 2 are the population
mean and variance. Now, using the identities
N
k=
k=1
N
k=1
k2 =
N (N + 1)
2
N (N + 1)(2N + 1)
6
we find that for the population {1, 2, . . . , m + n}
N +1
μ=
2
2
−1
N
σ2 =
12
The result then follows after algebraic simplification.
■
Unlike the t test, the Mann-Whitney test does not depend on an assumption of
normality. Since the actual numerical values are replaced by their ranks, the test is
insensitive to outliers, whereas the t test is sensitive. It has been shown that even
when the assumption of normality holds, the Mann-Whitney test is nearly as powerful as the t test and it is thus generally preferable, especially for small sample
sizes.
The Mann-Whitney test can also be derived starting from a different point of
view. Suppose that the X ’s are a sample from F and the Y ’s a sample from G, and
consider estimating, as a measure of the effect of the treatment,
π = P(X < Y )
where X and Y are independently distributed with distribution functions F and G,
respectively. The value π is the probability that an observation from the distribution
F is smaller than an independent observation from the distribution G.
If, for example, F and G represent lifetimes of components that have been manufactured according to two different conditions, π is the probability that a component
of one type will last longer than a component of the other type. An estimate of π can
be obtained by comparing all n values of X to all m values of Y and calculating the
440
Chapter 11
Comparing Two Samples
proportion of the comparisons for which X was less than Y :
π̂ =
n
m
1 Zi j
mn i=1 j=1
where
if X i < Y j
otherwise
1,
0,
Zi j =
To see the relationship of π̂ to the rank sum introduced earlier, we will find it convenient to work with
Vi j =
if X (i) < Y( j)
otherwise
1,
0,
Clearly,
m
n Zi j =
i=1 j=1
m
n Vi j
i=1 j=1
since the Vi j are just a reordering of the Z i j . Also,
n m
i=1 j=1
Vi j = (number of X ’s that are less than Y(1) )
+ (number of X ’s that are less than Y(2) )
+ · · · + (number of X ’s that are less than Y(m) )
If the rank of Y(k) in the combined sample is denoted by R yk , then the number of X ’s
less than Y(1) is R y1 − 1, the number of X ’s less than Y(2) is R y2 − 2, etc. Therefore,
n m
Vi j = (R y1 − 1) + (R y2 − 2) + · · · + (R ym − m)
i=1 j=1
=
m
R yi −
i=1
=
m
m
i
i=1
R yi −
i=1
= Ty −
m(m + 1)
2
m(m + 1)
2
Thus, π̂ may be expressed in terms of the rank sum of the Y ’s (or in terms of the rank
sum of the X ’s, since the two rank sums add up to a constant).
11.2 Comparing Two Independent Samples
441
From Theorem A, we have
COROLLARY A
Under the null hypothesis H0 : F = G,
mn
E(UY ) =
2
mn(m + n + 1)
Var(UY ) =
12
■
For m and n both greater than 10, the null distribution of UY is quite well approximated by a normal distribution,
UY − E(UY )
√
∼ N (0, 1)
Var(UY )
(Note that this does not follow immediately from the ordinary central limit theorem;
although UY is a sum of random variables, they are not independent.) Similarly, the
distribution of the rank sum of the X ’s or Y ’s may be approximated by a normal
distribution, since these rank sums differ from UY only by constants.
EXAMPLE B
Referring to Example A, let us use a normal approximation to the distribution of the
rank sum from method B. For n = 13 and m = 8, we have from Corollary A that
under the null hypothesis,
8(8 + 13 + 1)
= 88
2
*
8 × 13(8 + 13 + 1)
σT =
= 13.8
12
E(T ) =
T is the sum of the ranks from method B, or 51, and the normalized test statistic is
T − E(T )
= −2.68
σT
From the tables of the normal distribution, this corresponds to a p-value of .007 for
a two-sided test, so the null hypothesis is rejected at level α = .01, just as it was
when we used the exact distribution. For this set of data, we have seen that the t
test with the assumption of equal variances, the t test without that assumption, the
exact Mann-Whitney test, and the approximate Mann-Whitney test all reject at level
■
α = .01.
442
Chapter 11
Comparing Two Samples
The Mann-Whitney test can be inverted to form confidence intervals. Let us
consider a “shift” model: G(x) = F(x − ). This model says that the effect of the
treatment (the Y ’s) is to add a constant to what the response would have been with
no treatment (the X ’s). (This is a very simple model, and we have already seen cases
for which it is not appropriate.) We now derive a confidence interval for . To test
H0 : F = G, we used the statistic UY equal to the number of the X i − Y j that are
less than zero. To test the hypothesis that the shift parameter is , we can similarly
use
UY () = #[X i − (Y j − ) < 0] = #(Y j − X i > )
It can be shown that the null distribution of UY () is symmetric about mn/2:
mn
mn
+ k = P UY () =
−k
P UY () =
2
2
for all integers k. Suppose that k = k(α) is such that P(k ≤ UY () ≤ mn − k) =
1 − α; the level α test then accepts for such UY (). By the duality of confidence
intervals and hypothesis tests, a 100(1 − α)% confidence interval for is thus
C = { | k ≤ UY () ≤ mn − k}
C consists of the set of values for which the null hypothesis would not be rejected.
We can find an explicit form for this confidence interval. Let D(1) , D(2) , . . . , D(mn)
denote the ordered mn differences Y j − X i . We will show that
C = [D(k) , D(mn−k+1) )
To see this, first suppose that = D(k) . Then
UY () = #(X i − Y j + < 0)
= #(Y j − X i > )
= mn − k
Similarly, if = D(mn−k+1) ,
UY () = #(Y j − X i > )
=k
(You might find it helpful to consider the case m = 3, n = 2, k = 2.)
EXAMPLE C
We return to the data on iron retention (Section 11.2.1.1). The earlier analysis using
the t test rested on the assumption that the populations were normally distributed,
which, in fact, seemed rather dubious. The Mann-Whitney test does not make this
assumption. The sum of the ranks of the Fe2+ group is used as a test statistic (we
could have as easily used the U statistic). The rank sum is 362. Using the normal
approximation to the null distribution of the rank sum, we get a p-value of .36. Again,
there is insufficient evidence to reject the null hypothesis that there is no differential
retention. The 95% confidence interval for the shift between the two distributions is
(−1.6, 3.7), which overlaps zero substantially. Note that this interval is shorter than
11.2 Comparing Two Independent Samples
443
the interval based on the t distribution; the latter was inflated by the contributions of
■
the large observations to the sample variance.
We close this section with an illustration of the use of the bootstrap in a twosample problem. As before, suppose that X 1 , X 2 , . . . , X n and Y1 , Y2 , . . . , Ym are
two independent samples from distributions F and G, respectively, and that π =
P(X < Y ) is estimated by π̂. How can the standard error of π̂ be estimated and how
can an approximate confidence interval for π be constructed? (Note that the calculations of Theorem A are not directly relevant, since they are done under the assumption that F = G.)
The problem can be approached in the following way: First suppose for the
moment that F and G were known. Then the sampling distribution of π̂ and its
standard error could be estimated by simulation. A sample of size n would be generated
from F, an independent sample of size m would be generated from G, and the resulting
value of π̂ would be computed. This procedure would be repeated many times, say B
times, producing π̂1 , π̂2 , . . . , π̂ B . A histogram of these values would be an indication
of the sampling distribution of π̂ and their standard deviation would be an estimate
of the standard error of π̂ .
Of course, this procedure cannot be implemented, because F and G are not
known. But as in the previous chapter, an approximation can be obtained by using the
empirical distributions Fn and G n in their places. This means that a bootstrap value of
π̂ is generated by randomly selecting n values from X 1 , X 2 , . . . , X n with replacement,
m values from Y1 , Y2 , . . . , Ym with replacement and calculating the resulting value
of π̂. In this way, a bootstrap sample π̂1 , π̂2 , . . . , π̂ B is generated.
11.2.4 Bayesian Approach
We consider a Bayesian approach to the model, which stipulates that the X i are i.i.d.
normal with mean μ X and precision ξ ; and the Y j are i.i.d. normal with mean μY ,
precision ξ , and independent of the X i . In general, a prior joint distribution assigned
to (μ X , μY , ξ ) would be multiplied by the likelihood and normalized to integrate
to 1 to produce a three-dimensional joint posterior distribution for (μ X , μY , ξ ). The
marginal joint distribution of (μ X , μY ) could be obtained by integrating out ξ . The
marginal distribution of μ X − μY could then be obtained by another integration as in
Section 3.6.1. Several integrations would thus have to be done, either analytically or
numerically. Special Monte Carlo methods have been devised for high dimensional
Bayesian problems, but we will not consider them here.
An approximate result can be obtained using improper priors. We take (μ X , μY , ξ )
to be independent. The means μ X and μY are given improper priors that are constant
on (−∞, ∞), and ξ is given the improper prior f (ξ ) = ξ −1 . The posterior is thus
proportional to the likelihood multiplied by ξ −1 :
f post (μ X , μY , ξ ) ∝ ξ
n+m
2 −1
n
m
ξ m+n 2
2
exp −
(xi − μ X ) +
(y j − μY )
2
i=1
j=1
444
Chapter 11
Comparing Two Samples
n
Next, using i=1
(xi − μ X )2 = (n − 1)sx2 + n(μ X − x̄)2 and the analogous expression
for the y j , we have
ξ
n+m
f post (μ X , μY , ξ ) ∝ ξ 2 −1 exp − (n − 1)sx2 + (m − 1)s y2
2
mξ
nξ
2
(μY − ȳ)2
× exp − (μ X − x̄) exp −
2
2
From the form of this expression as a function of μ X and μY , we see that for fixed ξ ,
μ X and μY are independent normally distributed with means x̄ and ȳ and precisions
nξ and mξ . Their difference, μ X − μY , is thus normally distributed with mean x̄ − ȳ
and variance ξ −1 (n −1 + m −1 ).
With further analysis similar to that of Section 8.6, it can be shown that the
marginal posterior distribution of = μ X − μY can be related to the t distribution:
− (x̄ − ȳ)
√
∼ tn+m−2
s p n −1 + m −1
Although formally similar to Theorem A of Section 11.2.1, the interpretation is different: x̄ − ȳ and s p are random in Theorem A but are fixed here, and = μ X − μY
is random here but fixed in Theorem A. The Bayesian formalism makes probability
statements about given the observed data.
The posterior probability that > 0 can thus be found using the t distribution. Let
T denote a random variable with a tm+n−2 distribution. Then, denoting the observations
by X and Y
− (x̄ − ȳ)
−(x̄ − ȳ)
√
P( > 0 | X, Y ) = P
≥ √
| X, Y
s p n −1 + m −1
s p n −1 + m −1
ȳ − x̄
=P T ≥ √
s p n −1 + m −1
Letting X denote the measurements of method A, and Y denote the measurements
of method B in Example A of Section 11.2.1, we find that for that example,
P( > 0|X, Y ) = t19 (−3.33) = .998
This posterior probability is very close to 1.0, and there is thus little doubt that the
mean of method A is larger than the mean of method B.
The confidence interval calculated in Section 11.2.1 is formally similar but has
a different interpretation under the Bayesian model, which concludes that
P(.015 ≤ ≤ .065|X, Y ) = .95
by integration of the posterior t distribution over a region containing 95% of the
probability.
11.3 Comparing Paired Samples
In Section 11.2, we considered the problem of analyzing two independent samples.
In many experiments, the samples are paired. In a medical experiment, for example,
11.3 Comparing Paired Samples
445
subjects might be matched by age or weight or severity of condition, and then one
member of each pair randomly assigned to the treatment group and the other to the
control group. In a biological experiment, the paired subjects might be littermates.
In some applications, the pair consists of a “before” and an “after” measurement on
the same object. Since pairing causes the samples to be dependent, the analysis of
Section 11.2 does not apply.
Pairing can be an effective experimental technique, as we will now demonstrate
by comparing a paired design and an unpaired design. First, we consider the paired
design. Let us denote the pairs as (X i , Yi ), where i = 1, . . . , n, and assume the X ’s
and Y ’s have means μ X and μY and variances σ X2 and σY2 . We will assume that different
pairs are independently distributed and that Cov(X i , Yi ) = σ X Y . We will work with
the differences Di = X i − Yi , which are independent with
E(Di ) = μ X − μY
Var(Di ) = σ X2 + σY2 − 2σ X Y
= σ X2 + σY2 − 2ρσ X σY
when ρ is the correlation of members of a pair. A natural estimate of μ X − μY is
D = X − Y , the average difference. From the properties of Di , it follows that
E(D) = μ X − μY
1 2
σ X + σY2 − 2ρσ X σY
Var(D) =
n
Suppose, on the other hand, that an experiment had been done by taking a sample
of n X ’s and an independent sample of n Y ’s. Then μ X − μY would be estimated by
X − Y and
E(X − Y ) = μ X − μY
1 2
σ + σY2
Var(X − Y ) =
n X
Comparing the variances of the two estimates, we see that the variance of D is
smaller if the correlation is positive—that is, if the X ’s and Y ’s are positively correlated. In this circumstance, pairing is the more effective experimental design. In
the simple case in which σ X = σY = σ , the two variances may be more simply
expressed as
Var(D) =
2σ 2 (1 − ρ)
n
in the paired case and as
Var(X − Y ) =
2σ 2
n
in the unpaired case, and the relative efficiency is
Var(D)
=1−ρ
Var(X − Y )
446
Chapter 11
Comparing Two Samples
If the correlation coefficient is .5, for example, a paired design with n pairs of subjects
yields the same precision as an unpaired design with 2n subjects per treatment. This
additional precision results in shorter confidence intervals and more powerful tests if
the degrees of freedom for estimating σ 2 are sufficiently large.
We next present methods based on the normal distribution for analyzing data
from paired designs and then a nonparametric, rank-based method.
11.3.1 Methods Based on the Normal Distribution
In this section, we assume that the differences are a sample from a normal distribution
with
E(Di ) = μ X − μY = μ D
Var(Di ) = σ D2
Generally, σ D will be unknown, and inferences will be based on
t=
D − μD
sD
which follows a t distribution with n − 1 degrees of freedom. Following familiar
reasoning, a 100(1 − α)% confidence interval for μ D is
D ± tn−1 (α/2)s D
A two-sided test of the null hypothesis H0 : μ D = 0 (the natural null hypothesis for
testing no treatment effect) at level α has the rejection region
|D| > tn−1 (α/2)s D
If the sample size n is large, the approximate validity of the confidence interval
and hypothesis test follows from the central limit theorem. If the sample size is small
and the true distribution of the differences is far from normal, the stated probability
levels may be considerably in error.
EXAMPLE A
To study the effect of cigarette smoking on platelet aggregation, Levine (1973) drew
blood samples from 11 individuals before and after they smoked a cigarette and
measured the extent to which the blood platelets aggregated. Platelets are involved
in the formation of blood clots, and it is known that smokers suffer more often
from disorders involving blood clots than do nonsmokers. The data are shown in
the following table, which gives the maximum percentage of all the platelets that
aggregated after being exposed to a stimulus.
11.3 Comparing Paired Samples
Before
After
Difference
25
25
27
44
30
67
53
53
52
60
28
27
29
37
56
46
82
57
80
61
59
43
2
4
10
12
16
15
4
27
9
−1
15
447
From the column of differences, D = 10.27 and s D = 2.40. The uncertainty
in D is quantified in s D or in a confidence interval. Since t10 (.05) = 1.812, a 90%
confidence interval is D ± 1.812s D , or (5.9, 14.6). We can also formally test the null
hypothesis that means before and after are the same. The t statistic is 10.27/2.40 =
4.28, and since t10 (.005) = 3.169, the p-value of a two-sided test is less than .01.
There is little doubt that smoking increases platelet aggregation.
The experiment was actually more complex than we have indicated. Some subjects also smoked cigarettes made of lettuce leaves and “smoked” unlit cigarettes.
(You should reflect on why these additional experiments were done.)
Figure 11.7 is a plot of the after values versus the before values. They are correlated, with a correlation coefficient of .90. Pairing was a natural and effective exper■
imental design in this case.
90
80
70
After
60
50
40
30
20
20
30
40
50
60
70
Before
F I G U R E 11.7 Plot of platelet aggregation after smoking versus aggregation before
smoking.
448
Chapter 11
Comparing Two Samples
11.3.2 A Nonparametric Method—The Signed Rank Test
A nonparametric test based on ranks can be constructed for paired samples. We
illustrate the calculation with a very small example. Suppose there are four pairs,
corresponding to “before” and “after” measurements listed in the following table:
Before
After
Difference
|Difference|
Rank
Signed Rank
25
29
60
27
27
25
59
37
2
−4
−1
10
2
4
1
10
2
3
1
4
2
−3
−1
4
The test statistic is calculated by the following steps:
1. Calculate the differences, Di , and the absolute values of the differences and rank
the latter.
2. Restore the signs of the differences to the ranks, obtaining signed ranks.
3. Calculate W+ , the sum of those ranks that have positive signs. For the table, this
sum is W+ = 2 + 4 = 6.
The idea behind the signed rank test (sometimes called the Wilcoxon signed rank
test) is intuitively simple. If there is no difference between the two paired conditions,
we expect about half the Di to be positive and half negative, and W+ will not be too
small or too large. If one condition tends to produce larger values than the other, W+
will tend to be more extreme. We therefore can use W+ as a test statistic and reject
for extreme values.
Before continuing, we need to specify more precisely the null hypothesis we are
testing with the signed rank test: H0 states that the distribution of the Di is symmetric
about zero. This will be true if the members of pairs of experimental units are assigned
randomly to treatment and control conditions, and the treatment has no effect at all.
As usual, in order to define a rejection region for a test at level α, we need to
know the sampling distribution of W+ if the null hypothesis is true. The rejection
region will be located in the tails of this null distribution in such a way that the
test has level α. The null distribution may be calculated in the following way. If H0
is true, it makes no difference which member of the pair corresponds to treatment
and which to control. The difference X i − Yi = Di has the same distribution as the
difference Yi − X i = −Di , so the distribution of Di is symmetric about zero. The kth
largest value of D is thus equally likely to be positive or negative, and any particular
assignment of signs to the integers 1, . . . , n (the ranks) is equally likely. There are 2n
such assignments, and for each we can calculate W+ . We obtain a list of 2n values (not
all distinct) of W+ , each of which occurs with probability 1/2n . The probability of
each distinct value of W+ may thus be calculated, giving the desired null distribution.
The preceding argument has assumed that the Di are a sample from some continuous probability distribution. If we do not wish to regard the X i and Yi as random
variables and if the assignments to treatment and control have been made at random,
the hypothesis that there is no treatment effect may be tested in exactly the same
11.3 Comparing Paired Samples
449
manner, except that inferences are based on the distribution induced by the randomization, as was done for the Mann-Whitney test.
The null distribution of W+ is calculated by many computer packages, and tables
are also available.
The signed rank test is a nonparametric version of the paired sample t test.
Unlike the t test, it does not depend on an assumption of normality. Since differences
are replaced by ranks, it is insensitive to outliers, whereas the t test is sensitive. It
has been shown that even when the assumption of normality holds, the signed rank
test is nearly as powerful as the t test. The nonparametric method is thus generally
preferable, especially for small sample sizes.
EXAMPLE A
The signed rank test can be applied to the data on platelet aggregation considered
previously (Example A in Section 11.3.1). In this case, it is easier to work with W−
rather than W+ , since W− is clearly 1. From Table 9 of Appendix B, the two-sided
■
test is significant at α = .01.
If the sample size is greater than 20, a normal approximation to the null distribution can be used. To find this, we calculate the mean and variance of W+ .
THEOREM A
Under the null hypothesis that the Di are independent and symmetrically distributed about zero,
n(n + 1)
E(W+ ) =
4
n(n + 1)(2n + 1)
Var(W+ ) =
24
Proof
To facilitate the calculation, we represent W+ in the following way:
W+ =
n
k Ik
k=1
where
Ik =
1,
0,
if the kth largest |Di | has Di > 0
otherwise
Under H0 , the Ik are independent Bernoulli random variables with p = 12 , so
1
2
1
Var(Ik ) =
4
E(Ik ) =
450
Chapter 11
Comparing Two Samples
We thus have
1
n(n + 1)
k=
2 k=1
4
n
E(W+ ) =
1 2
n(n + 1)(2n + 1)
k =
4 k=1
24
n
Var(W+ ) =
■
as was to be shown.
If some of the differences are equal to zero, the most common technique is to
discard those observations. If there are ties, each |Di | is assigned the average value of
the ranks for which it is tied. If there are not too many ties, the significance level of
the test is not greatly affected. If there are a large number of ties, modifications must
be made. For further information on these matters, see Hollander and Wolfe (1973)
or Lehmann (1975).
11.3.3 An Example—Measuring Mercury Levels in Fish
Kacprzak and Chvojka (1976) compared two methods of measuring mercury levels
in fish. A new method, which they called “selective reduction,” was compared to
an established method, referred to as “the permanganate method.” One advantage
of selective reduction is that it allows simultaneous measurement of both inorganic
mercury and methyl mercury. The mercury in each of 25 juvenile black marlin was
measured by both techniques. The 25 measurements for each method (in ppm of
mercury) and the differences are given in the following table.
Fish
Selective Reduction
Permanganate
Difference
Signed Rank
1
2
3
4
5
6
7
8
9
10
11
12
.32
.40
.11
.47
.32
.35
.32
.63
.50
.60
.38
.46
.39
.47
.11
.43
.42
.30
.43
.98
.86
.79
.33
.45
.07
.07
.00
−.04
.10
−.05
.11
.35
.36
.19
−.05
−.01
+15.5
+15.5
−11
+19
−13.5
+20
+23
+24
+22
−13.5
−2.5
(Continued)
11.3 Comparing Paired Samples
Fish
Selective Reduction
Permanganate
Difference
Signed Rank
13
14
15
16
17
18
19
20
21
22
23
24
25
.20
.31
.62
.52
.77
.23
.30
.70
.41
.53
.19
.31
.48
.22
.30
.60
.53
.85
.21
.33
.57
.43
.49
.20
.35
.40
.02
−.01
−.02
.01
.08
−.02
.03
−.13
.02
−.04
.01
.04
−.08
+6.5
−2.5
−6.5
+2.5
+17.5
−6.5
+9.0
−21
+6.5
−11
+2.5
+11
−17.5
451
In analyzing such data, it is often informative to check whether the differences
depend in some way on the level or size of the quantity being measured. The differences versus the permanganate values are plotted in Figure 11.8. This plot is quite
interesting. It appears that the differences are small for low permanganate values and
larger for higher permanganate values. It is striking that the differences are all positive and large for the highest four values. The investigators do not comment on these
phenomena. It is not uncommon for the size of fluctuations to increase as the value
being measured increases; the percent error may remain nearly constant but the actual
error does not. For this reason, data of this nature are often analyzed on a log scale.
.4
.3
Differences
.2
.1
0
.1
.2
0
.2
.4
.6
Permanganate value
.8
1.0
F I G U R E 11.8 Plot of differences versus permanganate values.
452
Chapter 11
Comparing Two Samples
Because the observations are paired (two measurements on each fish), we will
use the paired t test for a parametric test. The sample size is large enough that the test
should be robust against nonnormality. The mean difference is .04, and the standard
deviation of the differences is .116. The t statistic is 1.724; with 24 degrees of freedom,
this corresponds to a p-value of .094 for a two-sided test. Although this p-value is
fairly small, the evidence against H0 : μ D = 0 is not overwhelming. The test does not
reject at the significance level .05.
The signed ranks are shown in the last column of the table above. Note that the
single zero difference was set aside, and also note how the tied ranks were handled.
The test statistic W+ is 194.5. Under H0 , its mean and variance are
24 × 25
= 150
4
24 × 25 × 49
= 1225
Var(W+ ) =
24
Since n is greater than 20, we use the normalized test statistic, or
E(W+ ) =
W+ − E(W+ )
Z= √
= 1.27
Var(W+ )
The p-value for a two-sided test from the normal approximation is .20, which is not
strong evidence against the null hypothesis. It is possible to correct for the presence
of ties, but in this case the correction only amounts to changing the standard deviation
of W+ from 35 to 34.95.
Neither the parametric nor the nonparametric test gives conclusive evidence that
there is any systematic difference between the two methods of measurement. The
informal graphical analysis does suggest, however, that there may be a difference for
high concentrations of mercury.
11.4 Experimental Design
This section covers some basic principles of the interpretation and design of experimental studies and illustrates them with case studies.
11.4.1 Mammary Artery Ligation
A person with coronary artery disease suffers from chest pain during exercise because
the constricted arteries cannot deliver enough oxygen to the heart. The treatment
of ligating the mammary arteries enjoyed a brief vogue; the basic idea was that
ligating these arteries forced more blood to flow into the heart. This procedure had the
advantage of being quite simple surgically, and it was widely publicized in an article
in Reader’s Digest (Ratcliffe 1957). Two years later, the results of a more careful study
(Cobb et al. 1959) were published. In this study, a control group and an experimental
group were established in the following way. When a prospective patient entered
surgery, the surgeon made the necessary preliminary incisions prior to tying off the
mammary artery. At that point, the surgeon opened a sealed envelope that contained
instructions about whether to complete the operation by tying off the artery. Neither
11.4 Experimental Design
453
the patient nor his attending physician knew whether the operation had actually been
carried out. The study showed essentially no difference after the operation between
the control group (no ligation) and the experimental group (ligation), although there
was some suggestion that the control group had done better.
The Ratcliffe and Cobb studies differ in that in the earlier one there was no
control group and thus no benchmark by which to gauge improvement. The reported
improvement of the patients in this earlier study could have been due to the placebo
effect, which we discuss next. The design of the later study protected against possible
unconscious biases by randomly assigning the control and experimental groups and
by concealing from the patients and their physicians the actual nature of the treatment.
Such a design is called a double-blind, randomized controlled experiment.
11.4.2 The Placebo Effect
The placebo effect refers to the effect produced by any treatment, including dummy
pills (placebos), when the subject believes that he or she has been given an effective
treatment. The possibility of a placebo effect makes the use of a blind design necessary
in many experimental investigations.
The placebo effect may not be due entirely to psychological factors, as was
shown in an interesting experiment by Levine, Gordon, and Fields (1978). A group
of subjects had teeth extracted. During the extraction, they were given nitrous oxide
and local anesthesia. In the recovery room, they rated the amount of pain they were
experiencing on a numerical scale. Two hours after surgery, the subjects were given
a placebo and were again asked to rate their pain. An hour later, some of the subjects
were given a placebo and some were given naloxone, a morphine antagonist. It is
known that there are specific receptors to morphine in the brain and that the body
can also release endorphins that bind to these sites. Naloxone blocks the morphine
receptors. In the study, it was found that when those subjects who responded positively
to the placebo received naloxone, they experienced an increase in pain that made their
pain levels comparable to those of the patients who did not respond to the placebo.
The implication is that those who responded to the placebo had produced endorphins,
the actions of which were subsequently blocked by the naloxone.
An instance of the placebo effect was demonstrated by a psychologist, Claude
Steele (2002), who gave a math exam to a group of male and female undergraduates
at Stanford University. One group (treatment) was told that the exam was genderneutral, and the other group (controls) was not so informed. The men outperformed
the women in the control group. In the treatment group, men and women performed
equally well. Men in the treatment group did worse than men in the control group.
(Economist Feb 21, 2002).
11.4.3 The Lanarkshire Milk Experiment
The importance of the randomized assignment of individuals (or other experimental
units) to treatment and control groups is illustrated by a famous study known as the
Lanarkshire milk experiment. In the spring of 1930, an experiment was carried out in
Lanarkshire, Scotland, to determine the effect of providing free milk to schoolchildren.
In each participating school, some children (treatment group) were given free milk
454
Chapter 11
Comparing Two Samples
and others (controls) were not. The assignment of children to control or treatment
was initially done at random; however, teachers were allowed to use their judgment
in switching children between treatment and control to obtain a better balance of
undernourished and well-nourished individuals in the groups.
A paper by Gosset (1931), who published under the name Student (as in Student’s t test), is a very interesting critique of the experiment. An examination of the
data revealed that at the start of the experiment the controls were heavier and taller.
Student conjectured that the teachers, perhaps unconsciously, had adjusted the initial
randomization in a manner that placed more of the undernourished children in the
treatment group. A further complication was caused by weighing the children with
their clothes on. The experimental data were weight gains measured in late spring
relative to early spring or late winter. The more well-to-do children probably tended to
be better nourished and may have had heavier winter clothing than the poor children.
Thus, the well-to-do children’s weight gains were vitiated as a result of differences in
clothing, which may have influenced comparisons between the treatment and control
groups.
11.4.4 The Portacaval Shunt
Cirrhosis of the liver, to which alcoholics are prone, is a condition in which resistance
to blood flow causes blood pressure in the liver to build up to dangerously high levels.
Vessels may rupture, which may cause death. Surgeons have attempted to relieve this
condition by connecting the portal artery, which feeds the liver, to the vena cava,
one of the main veins returning to the heart, thus reducing blood flow through the
liver. This procedure, called the Portacaval shunt, had been used for more than 20
years when Grace, Muench, and Chalmers (1966) published an examination of 51
studies of the method. They examined the design of each study (presence or absence
of a control group and presence or absence of randomization) and the investigators’
conclusions (categorized as markedly enthusiastic, moderately enthusiastic, or not
enthusiastic). The results are summarized in the following table, which speaks for
itself:
Enthusiasm
Design
Marked
Moderate
None
No controls
Nonrandomized controls
Randomized controls
24
10
0
7
3
1
1
2
3
The differences between the experiments that used controls and those that did
not is not entirely surprising, because the placebo effect was probably operating. The
importance of randomized assignment to treatment and control groups is illustrated
by comparing the conclusions for the randomized and nonrandomized controlled
experiments. Randomization can help to ensure against subtle unconscious biases that
may creep into an experiment. For example, a physician might tend to recommend
surgery for patients who are somewhat more robust than the average. Articulate
11.4 Experimental Design
455
patients might be more likely to have an influence on the decision as to which group
they are assigned to.
11.4.5 FD&C Red No. 40
This discussion follows Lagakos and Mosteller (1981). During the middle and late
1970s, experiments were conducted to determine possible carcinogenic effects of a
widely used food coloring, FD&C Red No. 40. One of the experiments involved
500 male and 500 female mice. Both genders were divided into five groups: two
control groups, a low-dose group, a medium-dose group, and a high-dose group. The
mice were bred in the following way: Males and females were paired and before
and during mating were given their prescribed dose of Red No. 40. The regime was
continued during gestation and weaning of the young. From litters that had at least
three pups of each sex, three of each sex were selected randomly and continued
on their parents’ dosage throughout their lives. After 109–111 weeks, all the mice
still living were killed. The presence or absence of reticuloendothelial tumors was
of particular interest. Although there were significant differences between some of
the treatment groups, the results were rather confusing. For example, there was a
significant difference between the incidence rates for the two male control groups,
and among the males the medium-dose group had the lowest incidence.
Several experts were asked to examine the results of this and other experiments.
Among them were Lagakos and Mosteller, who requested information on how the
cages that housed the mice were arranged. There were three racks of cages, each
containing five rows of seven cages in the front and five rows of seven cages in the
back. Five mice were housed in each cage. The mice were assigned to the cages in a
systematic way: The first male control group was in the top of the front of rack 1; the
first female control group was in the bottom of the front of rack 1; and so on, ending
with the high-dose females in the bottom of the back of rack 3 (Figure 11.9). Lagakos
and Mosteller showed that there were effects due to cage position that could not be
explained by gender or by dosage group. A random assignment of cage positions
would have eliminated this confounding. Lagakos and Mosteller also suggested some
experimental designs to systematically control for cage position.
Rack 1
Rack 2
Rack 3
Female–C2
Female–M
Male–C1
Front
Male–L
Female–C1
Female–C1
Back
Female–L
Female–L
Male–C2
Male–H
Male–H
Female–H
Male–M
Female–C2
F I G U R E 11.9 Location of mice cages in racks.
456
Chapter 11
Comparing Two Samples
It was also possible that a litter effect might be complicating the analysis, since
littermates received the same treatment and littermates of the same sex were housed
in the same or contiguous cages. In the presence of a litter effect, mice from the same
litter might show less variability than that present among mice from different litters.
This reduces the effective sample size—in the extreme case in which littermates react
identically, the effective sample size is the number of litters, not the total number of
mice. One way around this problem would have been to use only one mouse from
each litter.
The presence of a possible selection bias is another problem. Because mice were
included in the experiment only if they came from a litter with at least three males
and three females, offspring of possibly less healthy parents were excluded. This
could be a serious problem since exposure to Red No. 40 might affect the parents’
health and the birth process. If, for example, among the high-dose mice, only the
most hardy produced large enough litters, their offspring might be hardier than the
controls’ offspring.
11.4.6 Further Remarks on Randomization
As well as guarding against possible biases on the part of the experimenter, the process of randomization tends to balance any factors that may be influential but are
not explicitly controlled in the experiment. Time is often such a factor; background
variables such as temperature, equipment calibration, line voltage, and chemical composition can change slowly with time. In experiments that are run over some period of
time, therefore, it is important to randomize the assignments to treatment and control
over time. Time is not the only factor that should be randomized, however. In agricultural experiments, the positions of test plots in a field are often randomly assigned.
In biological experiments with test animals, the locations of the animals’ cages may
have an effect, as illustrated in the preceding section.
Although rarer than in other areas, randomized experiments have been carried
out in the social sciences as well (Economist Feb 28, 2002). Randomized trials have
been used to evaluate such programs as driver training, as well as the criminal justice
system and reduced classroom size. In evaluations of “whole-language” approaches to
reading (in which children are taught to read by evaluating contextual clues rather than
breaking down words), 52 randomized studies carried out by the National Reading
Panel in 2000 showed that effective reading instruction requires phonics. Randomized
studies of “scared straight” programs, in which juvenile delinquents are introduced
to prison inmates, suggested that the likelihood of subsequent arrests is actually
increased by such programs.
Generally, if it is anticipated that a variable will have a significant effect, that
variable should be included as one of the controlled factors in the experimental design.
The matched-pairs design of this chapter can be used to control for a single factor.
To control for more than one factor, factorial designs, which are briefly introduced in
the next chapter, may be used.
11.4 Experimental Design
457
11.4.7 Observational Studies, Confounding,
and Bias in Graduate Admissions
It is not always possible to conduct controlled experiments or use randomization.
In evaluating some medical therapies, for example, a randomized, controlled experiment would be unethical if one therapy was strongly believed to be superior. For many
problems of psychological interest (effects of parental modes of discipline, for example), it is impossible to conduct controlled experiments. In such situations, recourse
is often made to observational studies. Hospital records may be examined to compare
the outcomes of different therapies, or psychological records of children raised in
different ways may be analyzed. Although such studies may be valuable, the results
are seldom unequivocal. Because there is no randomization, it is always possible that
the groups under comparison differ in respects other than their “treatments.”
As an example, let us consider a study of gender bias in admissions to graduate
school at the University of California at Berkeley (Bickel and O’Connell 1975). In
the fall of 1973, 8442 men applied for admission to graduate studies at Berkeley, and
44% were admitted; 4321 women applied, and 35% were admitted. If the men and
women were similar in every respect other than sex, this would be strong evidence
of sex bias. This was not a controlled, randomized experiment, however; sex was not
randomly assigned to the applicants. As will be seen, the male and female applicants
differed in other respects, which influenced admission.
The following table shows admission rates for the six most popular majors on
the Berkeley campus.
Men
Women
Major
Number of
Applicants
Percentage
Admitted
Number of
Applicants
Percentage
Admitted
A
B
C
D
E
F
825
560
325
417
191
373
62
63
37
33
28
6
108
25
593
375
393
341
82
68
34
35
34
7
If the percentages admitted are compared, women do not seem to be unfavorably
treated. But when the combined admission rates for all six majors are calculated, it
is found that 44% of the men and only 30% of the women were admitted, which
seems paradoxical. The resolution of the paradox lies in the observation that the
women tended to apply to majors that had low admission rates (C through F) and
the men to majors that had relatively high admission rates (A and B). This factor
was not controlled for, because the study was observational in nature; it was also
“confounded” with the factor of interest, sex; randomization, had it been possible,
would have tended to balance out the confounded factor.
Confounding also plays an important role in studies of the effect of coffee
drinking. Several studies have claimed to show a significant association of coffee
458
Chapter 11
Comparing Two Samples
consumption with coronary disease. Clearly, randomized, controlled trials are not
possible here—a randomly selected individual cannot be told that he or she is in the
treatment group and must drink 10 cups of coffee a day for the next five years. Also, it
is known that heavy coffee drinkers also tend to smoke more than average, so smoking
is confounded with coffee drinking. Hennekens et al. (1976) review several studies
in this area.
11.4.8 Fishing Expeditions
Another problem that sometimes flaws observational studies, and controlled experiments as well, is that they engage in “fishing expeditions.” For example, consider
a hypothetical study of the effects of birth control pills. In such a case, it would be
impossible to assign women to a treatment or a placebo at random, but a nonrandomized study might be conducted by carefully matching controls to treatments on such
factors as age and medical history. The two groups might be followed up on for some
time, with many variables being recorded for each subject such as blood pressure,
psychological measures, and incidences of various medical problems. After termination of the study, the two groups might be compared on each of these variables, and
it might be found, say, that there was a “significant difference” in the incidence of
melanoma. The problem with this “significant finding” is the following. Suppose that
100 independent two-sample t tests are conducted at the .05 level and that, in fact, all
the null hypotheses are true. We would expect that five of the tests would produce a
“significant” result. Although each of the tests has probability .05 of type I error, as a
collection they do not simultaneously have α = .05. The combined significance level
is the probability that at least one of the null hypotheses is rejected:
α = P{at least one H0 rejected}
= 1 − P{no H0 rejected}
= 1 − .95100 = .994
Thus, with very high probability, at least one “significant” result will be found, even
if all the null hypotheses are true.
There are no simple cures for this problem. One possibility is to regard the
results of a fishing expedition as merely providing suggestions for further experiments.
Alternatively, and in the same spirit, the data could be split randomly into two halves,
one half for fishing in and the other half to be locked safely away, unexamined.
“Significant” results from the first half could then be tested on the second half. A
third alternative is to conduct each individual hypothesis test at a small significance
level. To see how this works, suppose that all null hypotheses are true and that each
of n null hypotheses is tested at level α. Let Ri denote the event that the ith null
hypothesis is rejected, and let α ∗ denote the overall probability of a type I error. Then
α ∗ = P{R1 or R2 or · · · or Rn }
≤ P{R1 } + P{R2 } + · · · + P{Rn }
= nα
Thus, if each of the n null hypotheses is tested at level α/n, the overall significance
level is less than or equal to α. This is often called the Bonferroni method.
11.6 Problems
459
11.5 Concluding Remarks
This chapter was concerned with the problem of comparing two samples. Within this
context, the fundamental statistical concepts of estimation and hypothesis testing,
which were introduced in earlier chapters, were extended and utilized. The chapter
also showed how informal descriptive and data analytic techniques are used in supplementing more formal analysis of data. Chapter 12 will extend the techniques of
this chapter to deal with multisample problems. Chapter 13 is concerned with similar
problems that arise in the analysis of qualitative data.
We considered two types of experiments, those with two independent samples
and those with matched pairs. For the case of independent samples, we developed
the t test, based on an assumption of normality, as well as a modification of the t test
that takes into account possibly unequal variances. The Mann-Whitney test, based on
ranks, was presented as a nonparametric method, that is, a method that is not based
on an assumption of a particular distribution. Similarly, for the matched-pairs design,
we developed a parametric t test and a nonparametric test, the signed rank test.
We discussed methods based on an assumption of normality and rank methods,
which do not make this assumption. It turns out, rather surprisingly, that even if the
normality assumption holds, the rank methods are quite powerful relative to the t test.
Lehmann (1975) shows that the efficiency of the rank tests relative to that of the t
test—that is, the ratio of sample sizes required to attain the same power—is typically
around .95 if the distributions are normal. Thus, a rank test using a sample of size
100 is as powerful as a t test based on 95 observations. Collecting the extra 5 pieces
of data is a small price to pay for a safeguard against nonnormality.
The bootstrap appeared again in this chapter. Indeed, uses of this recently developed technique are finding applications in a great variety of statistical problems.
In contrast with earlier chapters, where bootstrap samples were generated from one
distribution, here we have bootstrapped from two empirical distributions.
The chapter concluded with a discussion of experimental design, which emphasized the importance of incorporating controls and randomization in investigations.
Possible problems associated with observational studies were discussed. Finally, the
difficulties encountered in making many comparisons from a single data set were
pointed out; such problems of multiplicity will come up again in Chapter 12.
11.6 Problems
1. A computer was used to generate four random numbers from a normal distribution
with a set mean and variance: 1.1650, .6268, .0751, .3516. Five more random
normal numbers with the same variance but perhaps a different mean were then
generated (the mean may or may not actually be different): .3035, 2.6961, 1.0591,
2.7971, 1.2641.
a. What do you think the means of the random normal number generators were?
What do you think the difference of the means was?
b. What do you think the variance of the random number generator was?
c. What is the estimated standard error of your estimate of the difference of the
means?
460
Chapter 11
Comparing Two Samples
d. Form a 90% confidence interval for the difference of the means of the random
number generators.
e. In this situation, is it more appropriate to use a one-sided test or a two-sided
test of the equality of the means?
f. What is the p-value of a two-sided test of the null hypothesis of equal means?
g. Would the hypothesis that the means were the same versus a two-sided alternative be rejected at the significance level α = .1?
h. Suppose you know that the variance of the normal distribution was σ 2 = 1.
How would your answers to the preceding questions change?
2. The difference of the means of two normal distributions with equal variance is to
be estimated by sampling an equal number of observations from each distribution.
If it were possible, would it be better to halve the standard deviations of the
populations or double the sample sizes?
3. In Section 11.2.1, we considered two methods of estimating Var(X − Y ). Under
the assumption that the two population variances were equal, we estimated this
quantity by
1
1
s 2p
+
n
m
and without this assumption by
s2
s X2
+ Y
n
m
Show that these two estimates are identical if m = n.
4. Respond to the following:
Using the t distribution is absolutely ridiculous—another example of deliberate mystification! It’s valid when the populations are normal and have
equal variance. If the sample sizes were so small that the t distribution were
practically different from the normal distribution, you would be unable to
check these assumptions.
5. Respond to the following:
Here is another example of deliberate mystification—the idea of formulating
and testing a null hypothesis. Let’s take Example A of Section 11.2.1. It
seems to me that it is inconceivable that the expected values of any two
methods of measurement could be exactly equal. It is certain that there will
be subtle differences at the very least. What is the sense, then, in testing
H0 : μ X = μY ?
6. Respond to the following:
I have two batches of numbers and I have a corresponding x̄ and ȳ. Why
should I test whether they are equal when I can just see whether they are or
not?
7. In the development of Section 11.2.1, where are the following assumptions used?
(1) X 1 , X 2 , . . . , X n are independent random variables; (2) Y1 , Y2 , . . . , Yn are
independent random variables; (3) the X ’s and Y ’s are independent.
11.6 Problems
461
8. An experiment to determine the efficacy of a drug for reducing high blood pressure
is performed using four subjects in the following way: two of the subjects are
chosen at random for the control group and two for the treatment group. During
the course of treatment with the drug, the blood pressure of each of the subjects in
the treatment group is measured for ten consecutive days as is the blood pressure
of each of the subjects in the control group.
a. In order to test whether the treatment has an effect, do you think it is appropriate
to use the two-sample t test with n = m = 20?
b. Do you think it is appropriate to use the Mann-Whitney test with n = m = 20?
9. Referring to the data in Section 11.2.1.1, compare iron retention at concentrations of 10.2 and .3 millimolar using graphical procedures and parametric and
nonparametric tests. Write a brief summary of your conclusions.
10. Verify that the two-sample t test at level α of H0 : μ X = μY versus H A : μ X = μY
rejects if and only if the confidence interval for μ X − μY does not contain zero.
11. Explain how to modify the t test of Section 11.2.1 to test H0 : μ X = μY + versus H A : μ X = μY + where is specified.
12. An equivalence between hypothesis tests and confidence intervals was demonstrated in Chapter 9. In Chapter 10, a nonparametric confidence interval for the
median, η, was derived. Explain how to use this confidence interval to test the
hypothesis H0 : η = η0 . In the case where η0 = 0, show that using this approach
on a sample of differences from a paired experiment is equivalent to the sign
test. The sign test counts the number of positive differences and uses the fact
that in the case that the null hypothesis is true, the distribution of the number of
positive differences is binomial with (n, .5). Apply the sign test to the data from
the measurement of mercury levels, listed in Section 11.3.3.
13. Let X 1 , . . . , X 25 be i.i.d. N (.3, 1). Consider testing the null hypothesis H0 : μ = 0
versus H A : μ > 0 at significance level α = .05. Compare the power of the sign
test and the power of the test based on normal theory assuming that σ is known.
14. Suppose that X 1 , . . . , X n are i.i.d. N (μ, σ 2 ). To test the null hypothesis H0 : μ =
μ0 , the t test is often used:
X − μ0
t=
sX
Under H0 , t follows a t distribution with n − 1 df. Show that the likelihood ratio
test of this H0 is equivalent to the t test.
15. Suppose that n measurements are to be taken under a treatment condition and
another n measurements are to be taken independently under a control condition. It is thought that the standard deviation of a single observation is about 10
under both conditions. How large should n be so that a 95% confidence interval for μ X − μY has a width of 2? Use the normal distribution rather than the t
distribution, since n will turn out to be rather large.
16. Referring to Problem 15, how large should n be so that the test of H0 : μ X = μY
against the one-sided alternative H A : μ X > μY has a power of .5 if μ X − μY = 2
and α = .10?
462
Chapter 11
Comparing Two Samples
17. Consider conducting a two-sided test of the null hypothesis H0 : μ X = μY as
described in Problem 16. Sketch power curves for (a) α = .05, n = 20; (b) α =
.10, n = 20; (c) α = .05, n = 40; (d) α = .10, n = 40. Compare the curves.
18. Two independent samples are to be compared to see if there is a difference in the
population means. If a total of m subjects are available for the experiment, how
should this total be allocated between the two samples in order to (a) provide the
shortest confidence interval for μ X − μY and (b) make the test of H0 : μ X = μY
as powerful as possible? Assume that the observations in the two samples are
normally distributed with the same variance.
19. An experiment is planned to compare the mean of a control group to the mean
of an independent sample of a group given a treatment. Suppose that there are to
be 25 samples in each group. Suppose that the observations are approximately
normally distributed and that the standard deviation of a single measurement in
either group is σ = 5.
a. What will the standard error of Y − X be?
b. With a significance level α = .05, what is the rejection region of the test of
the null hypothesis H0: μY = μ X versus the alternative H A: μY > μ X ?
c. What is the power of the test if μY = μ X + 1?
d. Suppose that the p-value of the test turns out to be 0.07. Would the test reject
at significance level α = .10?
e. What is the rejection region if the alternative is H A : μY = μ X ? What is the
power if μY = μ X + 1?
20. Consider Example A of Section 11.3.1 using a Bayesian model. As in the example, use a normal model for the differences and also use an improper prior
for the expected difference and the precision (as in the case of unknown mean
and variance in Section 8.6). Find the posterior probability that the expected
difference is positive. Find a 90% posterior credibility interval for the expected
difference.
21. A study was done to compare the performances of engine bearings made
of different compounds (McCool 1979). Ten bearings of each type were tested.
The following table gives the times until failure (in units of millions of
cycles):
Type I
Type II
3.03
5.53
5.60
9.30
9.92
12.51
12.95
15.21
16.04
16.84
3.19
4.26
4.47
4.53
4.67
4.69
12.78
6.79
9.37
12.75
11.6 Problems
463
a. Use normal theory to test the hypothesis that there is no difference between
the two types of bearings.
b. Test the same hypothesis using a nonparametric method.
c. Which of the methods—that of part (a) or that of part (b)—do you think is
better in this case?
d. Estimate π, the probability that a type I bearing will outlast a type II bearing.
e. Use the bootstrap to estimate the sampling distribution of π̂ and its standard
error.
f. Use the bootstrap to find an approximate 90% confidence interval for π .
22. An experiment was done to compare two methods of measuring the calcium
content of animal feeds. The standard method uses calcium oxalate precipitation
followed by titration and is quite time-consuming. A new method using flame
photometry is faster. Measurements of the percent calcium content made by each
method of 118 routine feed samples (Heckman 1960) are contained in the file
calcium. Analyze the data to see if there is any systematic difference between
the two methods. Use both parametric and nonparametric tests and graphical
methods.
23. Let X 1 , . . . , X n be i.i.d. with cdf F, and let Y1 , . . . , Ym be i.i.d. with cdf G. The
hypothesis to be tested is that F = G. Suppose for simplicity that m + n is even
so that in the combined sample of X ’s and Y ’s, (m + n)/2 observations are less
than the median and (m + n)/2 are greater.
a. As a test statistic, consider T , the number of X ’s less than the median of the
combined sample. Show that T follows a hypergeometric distribution under
the null hypothesis:
(m + n)/2 (m + n)/2
t
n−t
P(T = t) =
m+n
n
Explain how to form a rejection region for this test.
b. Show how to find a confidence interval for the difference between the median
of F and the median of G under the shift model, G(x) = F(x − ). (Hint:
Use the order statistics.)
c. Apply the results (a) and (b) to the data of Problem 21.
24. Find the exact null distribution of the Mann-Whitney statistic, UY , in the case
where m = 3 and n = 2.
25. Referring to Example A in Section 11.2.1, (a) if the smallest observation for
method B (79.94) is made arbitrarily small, will the t test still reject? (b) If the
largest observation for method B (80.03) is made arbitrarily large, will the t test
still reject? (c) Answer the same questions for the Mann-Whitney test.
26. Let X 1 , . . . , X n be a sample from an N (0, 1) distribution and let Y1 , . . . , Yn be
an independent sample from an N (1, 1) distribution.
a. Determine the expected rank sum of the X ’s.
b. Determine the variance of the rank sum of the X ’s.
464
Chapter 11
Comparing Two Samples
27. Find the exact null distribution of W+ in the case where n = 4.
28. For n = 10, 20, and 30, find the .05 and .01 critical values for a two-sided signed
rank test from the tables and then by using the normal approximation. Compare
the values.
29. (Permutation Test for Means) Here is another view on hypothesis testing that
we will illustrate with Example A of Section 11.2.1. We ask whether the measurements produced by methods A and B are identical or exchangeable in the
,
following sense. There are 13 + 8 = 21 measurements in all and there are 21
8
or about 2 × 105 , ways that 8 of these could be assigned to method B. Is the
particular assignment we have observed unusual among these in the sense that
the means of the two samples are unusually different?
a. It’s
21not inconceivable, but it may be asking too much for you to generate all
partitions. So just choose a random sample of these partitions, say of size
8
1000, and make a histogram of the resulting values of X A − X B . Where on
this distribution does the value of X A − X B that was actually observed fall?
Compare to the result of Example B of Section 11.2.1.
b. In what way is this procedure similar to the Mann-Whitney test?
30. Use the bootstrap to estimate the standard error of and a confidence interval for
X A − X B and compare to the result of Example A of Section 11.2.1.
31. In Section 11.2.3, if F = G, what are E(π̂ ) and Var(π̂ )? Would there be any
advantage in using equal sample sizes m = n in estimating π or does it make no
difference?
32. If X ∼ N (μ X , σ X2 ) and Y is independent N (μY , σY2 ), what is π = P(X < Y ) in
terms of μ X , μY , σ X , and σY ?
33. To compare two variances in the normal case, let X 1 , . . . , X n be i.i.d. N (μ X , σ X2 ),
and let Y1 , . . . , Ym be i.i.d. N (μY , σY2 ), where the X ’s and Y ’s are independent
samples. Argue that under H0 : σ X = σY ,
s X2
∼ Fn−1, m−1
sY2
a. Construct rejection regions for one- and two-sided tests of H0 .
b. Construct a confidence interval for the ratio σ X2 /σY2 .
c. Apply the results of parts (a) and (b) to Example A in Section 11.2.1. (Caution: This test and confidence interval are not robust against violations of the
assumption of normality.)
34. This problem contrasts the power functions of paired and unpaired designs. Graph
and compare the power curves for testing H0 : μ X = μY for the following two
designs.
a. Paired: Cov(X i , Yi ) = 50, σ X = σY = 10, i = 1, . . . , 25.
b. Unpaired: X 1 , . . . , X 25 and Y1 , . . . , Y25 are independent with variance as in
part (a).
11.6 Problems
465
35. An experiment was done to measure the effects of ozone, a component of smog.
A group of 22 seventy-day-old rats were kept in an environment containing ozone
for 7 days, and their weight gains were recorded. Another group of 23 rats of a
similar age were kept in an ozone-free environment for a similar time, and their
weight gains were recorded. The data (in grams) are given below. Analyze the
data to determine the effect of ozone. Write a summary of your conclusions.
[This problem is from Doksum and Sievers (1976) who provide an interesting
analysis.]
Controls
41.0
25.9
13.1
−16.9
15.4
22.4
29.4
26.0
38.4
21.9
27.3
17.4
27.4
17.7
21.4
26.6
Ozone
24.9
18.3
28.5
21.8
19.2
26.0
22.7
10.1
7.3
−9.9
17.9
6.6
39.9
−14.7
−9.0
6.1
14.3
6.8
−12.9
12.1
−15.9
44.1
20.4
15.5
28.2
14.0
15.7
54.6
−9.0
36. Lin, Sutton, and Qurashi (1979) compared microbiological and hydroxylamine
methods for the analysis of ampicillin dosages. In one series of experiments, pairs
of tablets were analyzed by the two methods. The data in the following table give
the percentages of claimed amount of ampicillin found by the two methods in
several pairs of tablets. What are X − Y and s X −Y ? If the pairing had been erroneously ignored and it had been assumed that the two samples were independent,
what would have been the estimate of the standard deviation of X − Y ? Analyze the data to determine if there is a systematic difference between the two
methods.
Microbiological Method
Hydroxylamine Method
97.2
105.8
99.5
100.0
93.8
79.2
72.0
72.0
69.5
20.5
95.2
90.8
96.2
96.2
91.0
97.2
97.8
96.2
101.8
88.0
74.0
75.0
67.5
65.8
21.2
94.8
95.8
98.0
99.0
100.2
466
Chapter 11
Comparing Two Samples
37. Stanley and Walton (1961) ran a controlled clinical trial to investigate the effect
of the drug stelazine on chronic schizophrenics. The trials were conducted on
chronic schizophrenics in two closed wards. In each of the wards, the patients were
divided into two groups matched for age, length of time in the hospital, and score
on a behavior rating sheet. One member of each pair was given stelazine, and the
other a placebo. Only the hospital pharmacist knew which member of each pair
received the actual drug. The following table gives the behavioral rating scores
for the patients at the beginning of the trial and after 3 mo. High scores are good.
Ward A
Stelazine
Before
Placebo
After
2.3
2.0
1.9
3.1
2.2
2.3
2.8
1.9
1.1
Before
After
2.4
2.2
2.1
2.9
2.2
2.4
2.7
1.9
1.3
2.0
2.6
2.0
2.0
2.4
3.18
3.0
2.54
1.72
3.1
2.1
2.45
3.7
2.54
3.72
4.54
1.61
1.63
Ward B
Stelazine
Placebo
Before
After
Before
After
1.9
2.3
2.0
1.6
1.6
2.6
1.7
1.45
2.45
1.81
1.72
1.63
2.45
2.18
1.9
2.4
2.0
1.5
1.5
2.7
1.7
1.91
2.54
1.45
1.45
1.54
1.54
1.54
a. For each of the wards, test whether stelazine is associated with improvement
in the patients’ scores.
b. Test if there is any difference in improvement between the wards. [These data
are also presented in Lehmann (1975), who discusses methods of combining
the data from the wards.]
38. Bailey, Cox, and Springer (1978) used high-pressure liquid chromatography to
measure the amounts of various intermediates and by-products in food dyes. The
following table gives the percentages added and found for two substances in the
dye FD&C Yellow No. 5. Is there any evidence that the amounts found differ
systematically from the amounts added?
11.6 Problems
Sulfanilic Acid
467
Pyrazolone-T
Percentage Added
Percentage Found
Percentage Added
Percentage Found
.048
.096
.20
.19
.096
.18
.080
.24
0
.040
.060
.060
.091
.16
.16
.091
.19
.070
.23
0
.042
.056
.035
.087
.19
.19
.16
.032
.060
.13
.080
0
.031
.084
.16
.17
.15
.040
.076
.11
.082
0
39. An experiment was done to test a method for reducing faults on telephone lines
(Welch 1987). Fourteen matched pairs of areas were used. The following table
shows the fault rates for the control areas and for the test areas:
Test
Control
676
206
230
256
280
433
337
466
497
512
794
428
452
512
88
570
605
617
653
2913
924
286
1098
982
2346
321
615
519
a. Plot the differences versus the control rate and summarize what you see.
b. Calculate the mean difference, its standard deviation, and a confidence interval.
c. Calculate the median difference and a confidence interval and compare to the
previous result.
d. Do you think it is more appropriate to use a t test or a nonparametric method to
test whether the apparent difference between test and control could be due to
chance? Why? Carry out both tests and compare.
40. Biological effects of magnetic fields are a matter of current concern and research.
In an early study of the effects of a strong magnetic field on the development of
mice (Barnothy 1964), 10 cages, each containing three 30-day-old albino female
468
Chapter 11
Comparing Two Samples
mice, were subjected for a period of 12 days to a field with an average strength
of 80 Oe/cm. Thirty other mice housed in 10 similar cages were not placed in
a magnetic field and served as controls. The following table shows the weight
gains, in grams, for each of the cages.
a. Display the data graphically with parallel dotplots. (Draw two parallel number lines and put dots on one corresponding to the weight gains of the controls and on the other at points corresponding to the gains of the treatment
group.)
b. Find a 95% confidence interval for the difference of the mean weight
gains.
c. Use a t test to assess the statistical significance of the observed difference.
What is the p-value of the test?
d. Repeat using a nonparametric test.
e. What is the difference of the median weight gains?
f. Use the bootstrap to estimate the standard error of the difference of median
weight gains.
g. Form a confidence interval for the difference of median weight gains based
on the bootstrap approximation to the sampling distribution.
Field Present
Field Absent
22.8
10.2
20.8
27.0
19.2
9.0
14.2
19.8
14.5
14.8
23.5
31.0
19.5
26.2
26.5
25.2
24.5
23.8
27.8
22.0
ˆ = median(X i − Y j ),
41. The Hodges-Lehmann shift estimate is defined to be where X 1 , X 2 , . . . , X n are independent observations from a distribution F and
Y1 , Y2 , . . . , Ym are independent observations from a distribution G and are independent of the X i .
ˆ = μ X − μY .
a. Show that if F and G are normal distributions, then E()
ˆ
b. Why is robust to outliers?
ˆ for the previous problem and how does it compare to the differences
c. What is of the means and of the medians?
d. Use the bootstrap to approximate the sampling distribution and the standard
ˆ
error of .
e. From the bootstrap approximation to the sampling distribution, form an apˆ
proximate 90% confidence interval for .
11.6 Problems
469
42. Use the data of Problem 40 of Chapter 10.
a. Estimate π, the probability that more rain will fall from a randomly selected
seeded cloud than from a randomly selected unseeded cloud.
b. Use the bootstrap to estimate the standard error of π̂ .
c. Use the bootstrap to form an approximate confidence interval for π.
43. Suppose that X 1 , X 2 , . . . , X n and Y1 , Y2 , . . . , Ym are two independent samples.
As a measure of the difference in location of the two samples, the difference
of the 20% trimmed means is used. Explain how the bootstrap could be used to
estimate the standard error of this difference.
44. Interest in the role of vitamin C in mental illness in general and schizophrenia
in particular was spurred by a paper of Linus Pauling in 1968. This exercise
takes its data from a study of plasma levels and urinary vitamin C excretion in
schizophrenic patients (Subotic̆anec et al. 1986). Twenty schizophrenic patients
and 15 controls with a diagnosis of neurosis of different origin who had been
patients at the same hospital for a minimum of 2 months were selected for the
study. Before the experiment, all the subjects were on the same basic hospital
diet. A sample of 2 ml of venous blood for vitamin C determination was drawn
from each subject before breakfast and after the subjects had emptied their bladders. Each subject was then given 1 g ascorbic acid dissolved in water. No foods
containing ascorbic acid were available during the test. For the next 6 h all urine
was collected from the subjects for assay of vitamin C. A second blood sample
was also drawn 2 h after the dose of vitamin C.
The following two tables show the plasma concentrations (mg/dl).
Schizophrenics
Nonschizophrenics
0h
2h
0h
.55
.60
.21
.09
1.01
.24
.37
1.01
.26
.30
.26
.10
.42
.11
.14
.20
.09
.32
.24
.25
1.22
1.54
.97
.45
1.54
.75
1.12
1.31
.92
1.27
1.08
1.19
.64
.30
.24
.89
.24
1.68
.99
.67
1.27
.09
1.64
.23
.18
.12
.85
.69
.78
.63
.50
.62
.19
.66
.91
2h
2.00
.41
2.37
.41
.79
.94
1.72
1.75
1.60
1.80
2.08
1.58
.86
1.92
1.54
470
Chapter 11
Comparing Two Samples
a. Graphically compare the two groups at the two times and for the difference
in concentration at the two times.
b. Use the t test to assess the strength of the evidence for differences between
the two groups at 0 h, at 2 h, and the difference 2 h − 0 h.
c. Use the Mann-Whitney test to test the hypotheses of (b).
The following tables show the amounts of urinary vitamin C, both total
and milligrams per kilogram of body weight, for the two groups:
Schizophrenics
Nonschizophrenics
Total
mg/kg
Total
mg/kg
16.6
33.3
34.1
0.0
119.8
.1
25.3
359.3
6.6
.4
62.8
.2
13.0
0.0
0.0
5.9
.1
6.0
32.1
0.0
.19
.44
.39
.00
1.75
.01
.27
5.99
.10
.01
.68
.01
.15
0.00
0.00
.10
.01
.07
.42
0.00
289.4
0.0
620.4
0.0
8.5
5.5
43.2
91.7
200.9
113.8
102.2
108.2
36.9
122.0
101.9
3.96
0.00
7.95
0.00
.10
.09
.91
1.00
3.46
2.01
1.50
1.98
.49
1.72
1.52
d. Use descriptive statistics and graphical presentations to compare the two
groups with respect to total excretion and mg/kg body weight. Do the data
look normally distributed?
e. Use a t test to compare the two groups on both variables. Is the normality
assumption reasonable?
f. Use the Mann-Whitney test to compare the two groups. How do the results
compare with those obtained in part (e)?
The lower levels of plasma vitamin C in the schizophrenics before administration of ascorbic acid could be attributed to several factors. Interindividual
differences in the intake of meals cannot be excluded, despite the fact that
all patients were offered the same food. A more interesting possibility is
that the differences are the result of poorer resorption or of higher ascorbic
acid utilization in schizophrenics. In order to answer this question, another
11.6 Problems
471
experiment was run on 15 schizophrenics and 15 controls. All subjects were
given 70 mg of ascorbic acid daily for 4 weeks before the ascorbic acid loading test. The following table shows the concentration of plasma vitamin C
(mg/dl) and the 6-h urinary excretion (mg) after administration of 1 g ascorbic
acid.
Schizophrenics
Controls
Plasma
Urine
Plasma
Urine
.72
1.11
.96
1.23
.76
.75
1.26
.64
.67
1.05
1.28
.54
.77
1.11
.51
86.20
21.55
182.07
88.28
76.58
18.81
50.02
107.74
.09
113.23
34.38
8.44
109.03
144.44
172.09
1.02
.86
.78
1.38
.95
1.00
.47
.60
1.15
.86
.61
1.01
.77
.77
.94
190.14
149.76
285.27
244.93
184.45
135.34
157.74
125.65
164.98
99.65
86.29
142.23
144.60
265.40
28.26
g. Use graphical methods and descriptive statistics to compare the two groups
with respect to plasma concentrations and urinary excretion.
h. Use the t test to compare the two groups on the two variables. Does the
normality assumption look reasonable?
i. Compare the two groups using the Mann-Whitney test.
45. This and the next two problems are based on discussions and data in Le Cam
and Neyman (1967), which is devoted to the analysis of weather modification
experiments. The examples illustrate some ways in which principles of experimental design have been used in this field. During the summers of 1957 through
1960, a series of randomized cloud-seeding experiments were carried out in the
mountains of Arizona. Of each pair of successive days, one day was randomly
selected for seeding to be done. The seeding was done during a two-hour to
four-hour period starting at midday, and rainfall during the afternoon was measured by a network of 29 gauges. The data for the four years are given in the
following table (in inches). Observations in this table are listed in chronological
order.
a. Analyze the data for each year and for the years pooled together to see if there
appears to be any effect due to seeding. You should use graphical descriptive
methods to get a qualitative impression of the results and hypothesis tests to
assess the significance of the results.
472
Chapter 11
Comparing Two Samples
b. Why should the day on which seeding is to be done be chosen at random rather
than just alternating seeded and unseeded days? Why should the days be paired
at all, rather than just deciding randomly which days to seed?
1957
1958
Seeded Unseeded
0
.154
.154
.003
.084
.002
.157
.010
0
.002
.078
.101
.169
.139
.172
0
.008
.033
.035
.007
.140
.022
0
0
0
Seeded Unseeded
.074
.002
.318
.096
0
0
.050
.152
0
0
.002
.007
.013
.161
0
.274
.001
.122
.101
.012
.002
.066
.040
.013
0
.445
0
.079
.006
.008
.001
.001
.025
.046
.007
.019
0
0
.012
1959
Seeded Unseeded
.015
0
0
.021
0
.004
.010
0
.055
.004
.053
0
0
.090
.028
0
.032
.133
.083
0
0
.086
.006
.115
.090
0
0
0
.076
.090
0
.078
.121
1.027
.104
.023
.172
.002
0
1960
Seeded Unseeded
0
0
.042
0
0
0
.152
0
0
0
0
0
.008
.040
.003
.011
.010
0
.057
0
.093
.183
0
0
0
0
0
0
0
.060
.102
.041
0
46. The National Weather Bureau’s ACN cloud-seeding project was carried out in
the states of Oregon and Washington. Cloud seeding was accomplished by dispersing dry ice from an aircraft; only clouds that were deemed “ripe” for seeding
were candidates for seeding. On each occasion, a decision was made at random
whether to seed, the probability of seeding being 23 . This resulted in 22 seeded and
13 control cases. Three types of targets were considered, two of which are dealt
with in this problem. Type I targets were large geographical areas downwind from
the seeding; type II targets were sections of type I targets located so as to have,
theoretically, the greatest sensitivity to cloud seeding. The following table gives
the average target rainfalls (in inches) for the seeded and control cases, listed in
chronological order. Is there evidence that seeding has an effect on either type of
target? In what ways is the design of this experiment different from that of the
one in Problem 45?
11.6 Problems
Control Cases
473
Seeded Cases
Type I
Type II
Type I
Type II
.0080
.0046
.0549
.1313
.0587
.1723
.3812
.1720
.1182
.1383
.0106
.2126
.1435
.0000
.0000
.0053
.0920
.0220
.1133
.2880
.0000
.1058
.2050
.0100
.2450
.1529
.1218
.0403
.1166
.2375
.1256
.1400
.2439
.0072
.0707
.1036
.1632
.0788
.0365
.2409
.0408
.2204
.1847
.3332
.0676
.1097
.0952
.2095
.0200
.0163
.1560
.2885
.1483
.1019
.1867
.0233
.1067
.1011
.2407
.0666
.0133
.2897
.0425
.2191
.0789
.3570
.0760
.0913
.0400
.1467
47. During 1963 and 1964, an experiment was carried out in France; its design differed somewhat from those of the previous two problems. A 1500-km target area
was selected, and an adjacent area of about the same size was designated as the
control area; 33 ground generators were used to produce silver iodide to seed
the target area. Precipitation was measured by a network of gauges for each suitable “rainy period,” which was defined as a sequence of periods of continuous
precipitation between dry spells of a specified length. When a forecaster determined that the situation was favorable for seeding, he telephoned an order to
a service agent, who then opened a sealed envelope that contained an order to
actually seed or not. The envelopes had been prepared in advance, using a table
of random numbers. The following table gives precipitation (in inches) in the
target and control areas for the seeded and unseeded periods.
a. Analyze the data, which are listed in chronological order, to see if there is an
effect of seeding.
b. The analysis done by the French investigators used the square root transformation in order to make normal theory more applicable. Do you think that
taking the square root was an effective transformation for this purpose?
c. Reflect on the nature of this design. In particular, what advantage is there to
using the control area? Why not just compare seeded and unseeded periods
on the target area?
474
Chapter 11
Comparing Two Samples
Seeded
Unseeded
Target
Control
Target
Control
1.6
28.1
7.8
4.0
9.6
0.2
18.7
16.5
4.6
9.3
3.5
0.1
11.5
0.0
9.3
5.5
70.2
0.7
38.6
11.3
3.3
8.9
11.1
64.3
16.6
7.3
3.2
23.9
0.6
1.0
27.0
.3
6.0
12.6
0.5
8.7
21.5
13.9
6.7
4.5
0.7
8.7
0.0
10.7
4.7
29.1
1.9
34.7
10.2
2.7
2.8
4.3
38.7
11.1
6.5
3.0
13.6
0.1
1.1
3.5
2.6
2.6
9.8
5.6
.1
0.0
17.7
19.4
8.9
10.6
10.2
16.0
9.7
21.4
6.1
24.3
20.9
60.2
15.2
2.7
0.3
12.2
2.2
23.3
9.9
2.2
5.2
0.0
2.0
4.9
8.5
3.5
1.1
11.0
19.8
5.3
8.9
4.5
13.0
21.1
15.9
19.5
16.3
6.3
47.0
10.8
4.8
0.0
5.7
5.1
30.6
3.7
48. Proteinuria, the presence of excess protein in urine, is a symptom of renal (kidney)
distress among diabetics. Taguma et al. (1985) studied the effects of captopril for
treating proteinuria in diabetics. Urinary protein was measured for 12 patients
before and after eight weeks of captopril therapy. The amounts of urinary protein
(in g/24 hrs) before and after therapy are shown in the following table. What
can you conclude about the effect of captopril? Consider using parametric or
nonparametric methods and analyzing the data on the original scale or on a log
scale.
11.6 Problems
Before
After
24.6
17.0
16.0
10.4
8.2
7.9
8.2
7.9
5.8
5.4
5.1
4.7
10.1
5.7
5.6
3.4
6.5
0.7
6.5
0.7
6.1
4.7
2.0
2.9
475
49. Egyptian researchers, Kamal et al. (1991), took a sample of 126 police officers
subject to inhalation of vehicle exhaust in downtown Cairo and found an average
blood level concentration of lead equal to 29.2 g/dl with a standard deviation
of 7.5 g/dl. A sample of 50 policemen from a suburb, Abbasia, had an average
concentration of 18.2 g/dl and a standard deviation of 5.8 g/dl. Form a confidence interval for the population difference and test the null hypothesis that there
is no difference in the populations.
50. The file bodytemp contains normal body temperature readings (degrees
Fahrenheit) and heart rates (beats per minute) of 65 males (coded by 1) and
65 females (coded by 2) from Shoemaker (1996).
a. Using normal theory, form a 95% confidence interval for the difference of
mean body temperatures between males and females. Is the use of the normal
approximation reasonable?
b. Using normal theory, form a 95% confidence interval for the difference of
mean heart rates between males and females. Is the use of the normal approximation reasonable?
c. Use both parametric and nonparametric tests to compare the body temperatures and heart rates. What do you conclude?
51. A common symptom of otitis-media (inflamation of the middle ear) in young
children is the prolonged presence of fluid in the middle ear, called middle-ear
effusion. It is hypothesized that breast-fed babies tend to have less prolonged
effusions than do bottle-fed babies. Rosner (2006) presents the results of a study
of 24 pairs of infants who were matched according to sex, socioeconomic status,
and type of medication taken. One member of each pair was bottle-fed and the
other was breast-fed. The file ears gives the durations (in days) of middle-ear
effusions after the first episode of otitis-media.
a. Examine the data using graphical methods and summarize your conclusions.
b. In order to test the hypothesis of no difference, do you think it is more appropriate to use a parametric or a nonparametric test? Carry out a test. What do
you conclude?
476
Chapter 11
Comparing Two Samples
52. The media often present short reports of the results of experiments. To the critical reader or listener, such reports often raise more questions than they answer.
Comment on possible pitfalls in the interpretation of each of the following.
a. It is reported that patients whose hospital rooms have a window recover faster
than those whose rooms do not.
b. Nonsmoking wives whose husbands smoke have a cancer rate twice that of
wives whose husbands do not smoke.
c. A 2-year study in North Carolina found that 75% of all industrial accidents in
the state happened to workers who had skipped breakfast.
d. A school integration program involved busing children from minority schools
to majority (primarily white) schools. Participation in the program was voluntary. It was found that the students who were bused scored lower on standardized tests than did their peers who chose not to be bused.
e. When a group of students were asked to match pictures of newborns with
pictures of their mothers, they were correct 36% of the time.
f. A survey found that those who drank a moderate amount of beer were healthier
than those who totally abstained from alcohol.
g. A 15-year study of more than 45,000 Swedish soldiers revealed that heavy
users of marijuana were six times more likely than nonusers to develop
schizophrenia.
h. A University of Wisconsin study showed that within 10 years of the wedding,
38% of those who had lived together before marriage had split up, compared
to 27% of those who had married without a “trial period.”
i. A study of nearly 4,000 elderly North Carolinians has found that those who
attended religious services every week were 46% less likely to die over a
six-year period than people who attended less often or not at all, according to
researchers at Duke University Medical Center.
53. Explain why in Levine’s experiment (Example A in Section 11.3.1) subjects also
smoked cigarettes made of lettuce leaves and unlit cigarettes.
54. This example is taken from an interesting article by Joiner (1981) and from data in
Ryan, Joiner, and Ryan (1976). The National Institute of Standards and Technology supplies standard materials of many varieties to manufacturers and other parties, who use these materials to calibrate their own testing equipment. Great pains
are taken to make these reference materials as homogeneous as possible. In an experiment, a long homogeneous steel rod was cut into 4-inch lengths, 20 of which
were randomly selected and tested for oxygen content. Two measurements were
made on each piece. The 40 measurements were made over a period of 5 days,
with eight measurements per day. In order to avoid possible bias from time-related
trends, the sequence of measurements was randomized. The file steelrods
contains the measurements. There is an unexpected systematic source of variability in these data. Can you find it by making an appropriate plot? Would this effect
have been detectable if the measurements had not been randomized over time?
CHAPTER
12
The Analysis of Variance
12.1 Introduction
Chapter 11 was concerned with the analysis of data arising from experimental designs
with two samples. Experiments frequently involve more than two samples; they may
compare several treatments, such as different drugs, and perhaps other factors, such
as sex, at the same time. This chapter is an introduction to the statistical analysis
of such experiments. The methods we will discuss are called analysis of variance.
Contrary to what this phrase seems to imply, we will be primarily concerned with the
comparison of the means of the data, not their variances. We will consider the two
most elementary multisample designs: the one-way and two-way layouts. Methods
based on the normal distribution and nonparametric methods will be developed.
12.2 The One-Way Layout
A one-way layout is an experimental design in which independent measurements
are made under each of several treatments. The techniques we will introduce are thus
generalizations of the techniques for comparing two independent samples that were
covered in Chapter 11.
In this section, we will use as an example data from Kirchhoefer (1979), who
studied the measurement of chlorpheniramine maleate in tablets. Measurements of
composites that had nominal dosages equal to 4 mg were made by seven laboratories,
each laboratory making 10 measurements. Data is shown in the following table. There
are two possible sources of variability in the data: variability within labs and variability
between labs.
477
Chapter 12
The Analysis of Variance
Lab 1
Lab 2
Lab 3
Lab 4
Lab 5
Lab 6
Lab 7
4.13
4.07
4.04
4.07
4.05
4.04
4.02
4.06
4.10
4.04
3.86
3.85
4.08
4.11
4.08
4.01
4.02
4.04
3.97
3.95
4.00
4.02
4.01
4.01
4.04
3.99
4.03
3.97
3.98
3.98
3.88
3.88
3.91
3.95
3.92
3.97
3.92
3.90
3.97
3.90
4.02
3.95
4.02
3.89
3.91
4.01
3.89
3.89
3.99
4.00
4.02
3.86
3.96
3.97
4.00
3.82
3.98
3.99
4.02
3.93
4.00
4.02
4.03
4.04
4.10
3.81
3.91
3.96
4.05
4.06
Figure 12.1, a boxplot of these data, shows some variation in the medians among
the seven labs, as well as some variation in the interquartile ranges. It appears from
the figure that there may be some systematic differences between the labs and that
there is less variability in some labs than in others. We will discuss the following
question: Are the differences in the means of the measurements from the various labs
significant, or might they be due to chance?
4.15
Amount of chlorpheniramine maleate (mg)
478
4.10
4.05
4.00
3.95
3.90
3.85
3.80
Lab 1 Lab 2 Lab 3 Lab 4 Lab 5 Lab 6 Lab 7
F I G U R E 12.1 Boxplots of determinations of amounts of chlorpheniramine maleate
in tablets by seven laboratories.
12.2.1 Normal Theory; the F Test
We first discuss the analysis of variance and the F test in the case of I groups, each
containing J samples. The I groups will be referred to generically as treatments, or
levels. (In the preceding example, I = 7 and J = 10. We will discuss the case of
unequal sample sizes later.)
12.2 The One-Way Layout
479
We first define some notation and introduce the basic model. Let
Yi j = the jth observation of the ith treatment
Our model is that the observations are corrupted by random errors and that the error in
one observation is independent of the errors in the other observations. The statistical
model is
Yi j = μ + αi + εi j
Here μ is the overall mean level, αi is the differential effect of the ith treatment, and
εi j is the random error in the jth observation under the ith treatment. The errors are
assumed to be independent, normally distributed with mean zero and variance σ 2 .
The αi are normalized:
I
αi = 0
i=1
The expected response to the ith treatment is E(Yi j ) = μ + αi . Thus, if αi = 0, for
i = 1, . . . , I , all treatments have the same expected response, and, in general, αi − α j
is the difference between the expected values under treatments i and j. We will derive
a test for the null hypothesis, which is that all the means are equal.
The analysis of variance is based on the following identity:
J
I
i=1
(Yi j − Y .. )2 =
j=1
J
I
i=1
(Yi j − Y i. )2 + J
j=1
I
(Y i. − Y .. )2
i=1
where
J
1 Y i. =
Yi j
J j=1
is the average of the observations under the ith treatment and
Y .. =
J
I
1 Yi j
I J i=1 j=1
is the overall average. The terms appearing in the first identity above are called sums
of squares, and the identity may be symbolically expressed as
SSTOT = SSW + SS B
In words, this means that the total sum of squares equals the sum of squares within
groups plus the sum of squares between groups. The terminology reflects that S SW
is a measure of the variation of the data within the treatment groups and that SS B is
a measure of the variation of the treatment means among or between treatments.
To establish the identity, we express the left-hand side as
J
I
i=1
j=1
(Yi j − Y .. )2 =
J
I
i=1
j=1
[(Yi j − Y i. ) + (Y i. − Y .. )]2
480
Chapter 12
The Analysis of Variance
=
J
I
i=1
+2
=
j=1
J
i=1
j=1
+2
j=1
I
J
I
i=1
I
J
I
i=1
(Yi j − Y i. )2 +
(Y i. − Y .. )2
j=1
(Yi j − Y i. )(Y i. − Y .. )
(Yi j − Y i. ) +
2
J
I
i=1
(Y i. − Y .. )
i=1
J
(Y i. − Y .. )2
j=1
(Yi j − Y i. )
j=1
The last term of the final expression vanishes because the sum of deviations from a
mean is zero.
As we will see, the basic idea underlying analysis of variance is the comparison
of the sizes of various sums of squares. We can calculate the expected values of the
sums of squares defined previously using the following lemma.
LEMMA A
Let X i , where i = 1, . . . , n, be independent random variables with E(X i ) = μi
and Var(X i ) = σ 2 . Then
n−1 2
σ
E(X i − X )2 = (μi − μ)2 +
n
where
n
1
μ̄ =
μi
n i=1
Proof
We use the fact that E(U 2 ) = [E(U )]2 + Var(U ) for any random variable U with
finite variance. The first term on the right-hand side of the equation in the lemma
follows immediately. For the second term, we have to calculate Var(X i − X ):
Var(X i − X ) = Var(X i ) + Var(X ) − 2Cov(X i , X )
and
Var(X i ) = σ 2
1
Var(X ) = σ 2
n Cov(X i , X ) = Cov
n
1
Xi ,
Xj
n j=1
=
1 2
σ
n
(Here we have used Cov(X i , X j ) = 0 if i = j, since the X ’s are independent.) Putting these results together proves the lemma.
■
12.2 The One-Way Layout
481
Lemma A may be applied to the sums of squares discussed before, yielding the
following theorem.
THEOREM A
Under the assumptions for the model stated at the beginning of this section,
E(SSW ) =
I
J
i=1
=
E(Yi j − Y i. )2
j=1
I
J
J −1 2
σ
J
j=1
i=1
= I (J − 1)σ 2
Here we have used Lemma A, with the role of X i being played by Yi j and that
of X being played by Y i. . The second line then follows since E(Yi j ) = E(Y i ) =
μ + αi . To find E(SS B ), we again use the lemma with Y i. and Y .. in place of X i
and X :
I
E(Y i. − Y .. )2
E(SS B ) = J
i=1
I
(I − 1)σ 2
2
=J
αi +
IJ
i=1
=J
I
αi2 + (I − 1)σ 2
■
i=1
SSW may be used to estimate σ 2 ; the estimate is
s 2p =
SSW
I (J − 1)
which is unbiased. The subscript p stands for pooled. Estimates of σ 2 from the I
treatments are pooled together, since SSW can be written as
SSW =
I
(J − 1)si2
i=1
where si2 is the sample variance in the ith group.
If all the αi are equal to zero, then the expectation of SS B /(I − 1) is also σ 2 .
Thus, in this case, SSW /[I (J − 1)] and SS B /(I − 1) should be about equal. If some
of the αi are nonzero, SS B will be inflated. We next develop a method of comparing
the two sums of squares to find a test statistic for testing the null hypothesis that all
the αi are equal. Under the assumption that the errors are normally distributed, the
probability distributions of the sums of squares can be calculated.
482
Chapter 12
The Analysis of Variance
THEOREM B
If the errors are independent and normally distributed with means 0 and variances
σ 2 , then SSW /σ 2 follows a chi-square distribution with I (J − 1) degrees of
freedom. If, additionally, the αi are all equal to zero, then SS B /σ 2 follows a chisquare distribution with I − 1 degrees of freedom and is independent of SSW .
Proof
We first consider SSW . From Theorem B of Section 6.3,
J
1 (Yi j − Y i. )2
σ 2 j=1
follows a chi-square distribution with J − 1 degrees of freedom. There are I such
sums in SSW , and they are independent of each other since the observations are
independent. The sum of I independent chi-square random variables that each
have J − 1 degrees of freedom follows a chi-square distribution with I (J − 1)
degrees of freedom. Theorem B of Section 6.3 can also be applied to SS B , noting
that Var(Y i. ) = σ 2 /J .
We next prove that the two sums of squares are independent of each other.
SSW is a function of the vector U, which has elements Yi j − Y i. , where i =
1, . . . , I and j = 1, . . . , J . SS B is a function of the vector V, whose elements
are Y i. , where i = 1, . . . , I , since Y .. can be obtained from the Y i. . Thus, it is
sufficient to show that these two vectors are independent of each other. First, if
i = i , Yi j − Y i. and Y i . are independent since they are functions of different
observations. Second, Yi j − Y i. and Y i. are independent by Theorem A of Section
■
6.3. This completes the proof of the theorem.
The statistic
F=
SS B /(I − 1)
SSW /[I (J − 1)]
is used to test the following null hypothesis:
H0 : α1 = α2 = · · · = α I = 0
By Theorem A, the denominator of the F statistic has expected value equal to σ 2 ,
I
αi2 + σ 2 . Thus, if the null
and the expectation of the numerator is J (I − 1)−1 i=1
hypothesis is true, the F statistic should be close to 1, whereas if it is false, the statistic
should be larger. If the null hypothesis is false, the numerator reflects variation between
the different groups as well as variation within groups, whereas the denominator
reflects only variation within groups. The hypothesis is thus rejected for large values
of F. As usual, in order to apply this test, we must know the null distribution of the
test statistic.
12.2 The One-Way Layout
483
THEOREM C
Under the assumption that the errors are normally distributed, the null distribution
of F is the F distribution with (I − 1) and I (J − 1) degrees of freedom.
Proof
The theorem follows from Theorem B and from the definition of the F distribution
(Section 6.2), since, under H0 , F is the ratio of two independent chi-square
■
random variables divided by their degrees of freedom.
Percentage points of the F distribution are widely tabled. It can show that, under
the normality assumption, the F test is equivalent to the likelihood ratio test.
EXAMPLE A
We can illustrate the use of the F statistic by applying it to the tablet data from Section
12.2. In doing so, we adopt an explicit statistical model for the variability seen in Figure
12.1. According to this model, there is an unknown mean level associated with each
laboratory, and the deviations from this mean level of the 10 measurements within a
laboratory are independent, normally distributed, random variables. With the aid of
this model, we will see whether it is plausible that the unknown laboratory means are
all equal, so that the variability between labs displayed in Figure 12.1 is entirely due
to chance.
The sums of squares defined previously are calculated and presented in a table
called the analysis of variance table:
Source
df
SS
MS
F
Labs
Error
Total
6
63
69
.125
.231
.356
.021
.0037
5.66
In the table, SSW is the sum of squares due to error, and SS B is the sum of squares
due to labs. M S stands for mean square and equals the sum of squares divided by
the degrees of freedom. The column headed F gives the F statistic for testing the
null hypothesis that there is no systematic difference among the seven labs. The F
statistic has 6 and 63 df and a value of 5.66. This particular combination of degrees
of freedom is not included in Table 5 of Appendix B, but upon examining the entries with 6 and 60 df, it is clear that the p-value is less than .01. We may thus
conclude that the means of the measurements from the various labs are significantly
different.
Figure 12.2 is a normal probability plot of the residuals from the analysis of
variance model (the residuals are formed by simply subtracting from the measurements of each lab the mean value for that lab). There is some indication of deviation
Chapter 12
The Analysis of Variance
.15
.10
.05
Ordered residuals
484
0
.05
.10
.15
.20
3
2
1
0
1
Normal quantiles
2
3
F I G U R E 12.2 Normal probability plot of residuals from one-way analyses of
variance of tablet data.
from normality in the lower tail of the distribution, but the data do not appear grossly
■
nonnormal.
We now outline the procedure for the case in which the numbers of observations
under the various treatments are not necessarily equal. The only difficulties with this
case are algebraic; conceptually, the analysis is the same as for the case of equal sample
sizes. Suppose that there are Ji observations under treatment i, for i = 1, . . . , I . The
basic identity still holds; that is, we have
Ji
I
i=1
j=1
(Yi j − Y .. ) =
2
Ji
I
i=1
(Yi j − Y i. ) +
2
j=1
I
Ji (Y i − Y .. )2
i=1
By reasoning similar to that used here for the simple case, it can be shown that
E(SSW ) = σ 2
I
(Ji − 1)
i=1
E(SS B ) = (I − 1)σ 2 +
I
Ji αi2
i=1
I
The degrees of freedom for these sums of squares are i=1
Ji − I and I − 1,
respectively. It may be argued, as in the proof of Theorem B, that the normalized
sums of squares follow chi-square distributions and that the ratio of mean squares
follows an F distribution under the null hypothesis of no treatment differences.
12.2 The One-Way Layout
485
To conclude this section, let us review the basic assumptions of the model and
comment on their importance. The model is
Yi j = μ + αi + εi j
We assume the following:
1. The εi j are normally distributed. The F test, like the t test, remains approximately
valid for moderate to large samples from moderately nonnormal distributions.
2. The error variance, σ 2 , is constant. In many applications, the error variances may
be different in different groups. For example, Figure 12.1 suggests that some
labs may be more precise in their measurements than others. Fortunately, if there
are an equal number of observations in each group, the F test is not strongly
affected.
3. The εi j are independent. This assumption is very important, both for normal theory
and for the nonparametric analysis we will present later.
12.2.2 The Problem of Multiple Comparisons
The application of the F test in Example A in Section 12.2.1 has an anticlimactic
character. We concluded that the means of measurements from different labs are not
all equal, but the test gives no information about how they differ, in particular about
which pairs are significantly different. In many applications, the null hypothesis is
a “straw man” that is not seriously entertained. Real interest may be focused on
comparing pairs or groups of treatments and estimating the treatment means and
their differences. A naive approach would be to compare all pairs of treatment means
using t tests. The difficulty with such a procedure was pointed out in the section on
experimental design in Chapter 11: Although each individual comparison would have
a type I error rate of α, the collection of all comparisons considered simultaneously
would not. In this section, we discuss two solutions to this problem—Tukey’s method
and the Bonferroni method. More discussion can be found in Miller (1981).
12.2.2.1 Tukey's Method Tukey’s method is used to construct confidence intervals for the differences of all pairs of means in such a way that the intervals simultaneously have a set coverage probability. The duality of confidence intervals and tests
can then be used to determine which particular pairs are significantly different.
If the sample sizes are all equal and the errors are normally distributed with
a constant variance, the centered sample means, Y i. − μi , are independent and
normally distributed with means 0 and variances σ 2 /J , which may be estimated by
s 2p /J . Tukey’s method is based on the probability distribution of the random variable
max
i 1 ,i 2
|(Y i1 . − μi1 ) − (Y i2 . − μi2 )|
√
sp/ J
where the maximum is taken over all pairs i 1 and i 2 . This distribution is called the
studentized range distribution with parameters I (the number of samples being
compared) and I (J − 1) (the degrees of freedom in s p ). The upper 100α percentage
486
Chapter 12
The Analysis of Variance
point of the distribution is denoted by q I,I (J −1) (α). Now,
sp
P |(Y i1 . − μi1 ) − (Y i2 . − μi2 )| ≤ q I,I (J −1) (α) √ , for all i 1 and i 2
J
sp
= P max |(Y i1 . − μi1 ) − (Y i2 . − μi2 )| ≤ q I,I (J −1) (α) √
i 1 ,i 2
J
By definition, this latter probability equals 1 − α. The idea is that all the differences
are less than some number if and only if the largest difference is. The above probability statement can be converted directly into a set of confidence intervals that hold
simultaneously for all differences μi1 − μi2 with confidence level 100(1 − α)%. The
intervals are
sp
(Y i1 . − Y i2 . ) ± q I,I (J −1) (α) √
J
By the duality of confidence intervals and hypothesis tests, if the 100(1 − α)%
confidence interval for (Y i1 . − Y i2 . ) does not include zero—that is, if
sp
|Y i1 . − Y i2 . | > q I,I (J −1) (α) √
J
the null hypothesis that there is no difference between u i1 and u i2 may be rejected
at level α. Also, all such hypothesis tests considered collectively have level α.
EXAMPLE A
We can illustrate Tukey’s method by applying it to the tablet data of Section 12.2. We
list the labs in decreasing order of the mean of their measurements:
Lab
Mean
1
3
7
2
5
6
4
4.062
4.003
3.998
3.997
3.957
3.955
3.920
s p is the square root of the mean square for error in the analysis of variance table of
Example A of Section 12.2.1: s p = .06. The degrees of freedom of the appropriate
studentized range distribution are 7 and 63. Using 7 and 60 df in Table 6 of Appendix
B as an approximation, q7,60 (.05) = 4.31, two of the means in the preceding table are
significantly different at the .05 level if they differ by more than
sp
q7,63 (.05) √ = .082
J
The mean from lab 1 is thus significantly different from those from labs 4, 5, and 6;
The mean of lab 3 is significantly greater than that of lab 4. No other comparisons
are significant at the .05 level.
12.2 The One-Way Layout
487
At the 95% confidence level, the other differences in mean level that are seen
in Figure 12.1 cannot be judged to be significantly different from zero. Although
differences between these labs must certainly exist, we cannot reliably establish the
signs of the differences.
It is interesting to note that a price is paid here for performing multiple comparisons simultaneously. If separate t tests had been conducted using the pooled sample
variance, labs would have been declared significantly different if their means had
differed by more than
*
2
= .053
t63 (.025)s p
■
J
12.2.2.2 The Bonferroni Method The Bonferroni method was briefly introduced
in Section 11.4.8. The idea is very simple. If k null hypotheses are to be tested, a
desired overall type I error rate of at most α can be guaranteed by testing each null
hypothesis at level α/k. Equivalently, if k confidence intervals are each formed to
have confidence level 100(1 − α/k)%, they all hold simultaneously with confidence
level at least 100(1 − α)%.
The method is simple and versatile and, although crude, gives surprisingly good
results if k is not too large.
EXAMPLE A
To7 apply the Bonferroni method to the data on tablets, we note that there are k =
= 21 pairwise comparisons among the seven labs. A set of simultaneous 95%
2
confidence intervals for the pairwise comparisons is
(Y i1 . − Y i2 . ) ± s p
t63 (.025/21)
√
5
Special tables for such values of the t distribution have been prepared; from Table 7
of Appendix B, we find
.025
t60
= 3.16
20
which we will use as an approximation to t63 (.025/21), giving confidence intervals
(Y i1 . − Y i2 . ) ± .085
that we will use as an approximation. Given the crude nature of the Bonferroni method,
these are surprisingly close to the intervals produced by Tukey’s method, which have
a half-width of .082. Here, too, we conclude that lab 1 produced significantly higher
measurements than those of labs 4, 5, and 6.
■
A significant advantage of the Bonferroni method over Tukey’s method is that it
does not require equal sample sizes in each treatment.
488
Chapter 12
The Analysis of Variance
12.2.3 A Nonparametric Method—The Kruskal-Wallis Test
The Kruskal-Wallis test is a generalization of the Mann-Whitney test that is conceptually quite simple. The observations are assumed to be independent, but no particular
distributional form, such as the normal, is assumed. The observations are pooled
together and ranked. Let
Ri j = the rank of Yi j in the combined sample
Let
Ji
1 R i. =
Ri j
Ji j=1
be the average rank in the ith group. Let
R .. =
=
Ji
I
1 Ri j
N i=1 j=1
N +1
2
where N is the total number of observations. As in the analysis of variance, let
SS B =
I
Ji (R i. − R .. )2
i=1
be a measure of the dispersion of the R i. . SS B may be used to test the null hypothesis that the probability distributions generating the observations under the various
treatments are identical. The larger SS B is, the stronger is the evidence against the
null hypothesis. The exact null distribution of this statistic for various combinations
of I and Ji can be enumerated, as for the Mann-Whitney test. The null distribution
is commonly available in computer packages. Tables are given in Lehmann (1975)
and in references therein. For I = 3 and Ji ≥ 5 or I > 3 and Ji ≥ 4, a chisquare approximation to a normalized version of SS B is fairly accurate. Under the
null hypothesis that the probability distributions of the I groups are identical, the
statistic
K =
12
SS B
N (N + 1)
is approximately distributed as a chi-square random variable with I − 1 degrees of
freedom. The value of K can be found by running the ranks through an analysis of
variance program and multiplying SS B by 12/[N (N + 1)]. It can be shown that K
can also be expressed as
I
12
2
K =
Ji R i. − 3(N + 1)
N (N + 1) i=1
which is easier to compute by hand.
12.3 The Two-Way Layout
EXAMPLE A
489
For the data on the tablets, K = 29.51. Referring to Table 3 of Appendix B with 6 df,
we see that the p-value is less than .005. The nonparametric analysis, too, indicates
that there is a systematic difference among the labs.
■
Multiple comparison procedures for nonparametric methods are discussed in
detail in Miller (1981). The Bonferroni method requires no special discussion; it can
be applied to all comparisons tested by Mann-Whitney tests.
Like the Mann-Whitney test, the Kruskal-Wallis test makes no assumption of
normality and thus has a wider range of applicability than does the F test. It is
especially useful in small-sample situations. Also, because data are replaced by their
ranks, outliers will have less influence on this nonparametric test than on the F test.
In some applications, the data consist of ranks—for example, in a wine tasting, judges
usually rank the wines—which makes the use of the Kruskal-Wallis test natural.
12.3 The Two-Way Layout
A two-way layout is an experimental design involving two factors, each at two or
more levels. The levels of one factor might be various drugs, for example, and the
levels of the other factor might be genders. If there are I levels of one factor and
J of the other, there are I × J combinations. We will assume that K independent
observations are taken for each of these combinations. (The last section of this chapter
will outline the advantages of such an experimental design.)
The next section defines the parameters that we might want to estimate from a
two-way layout. Later sections present statistical methods based on normal theory
and nonparametric methods.
12.3.1 Additive Parametrization
To develop and illustrate the ideas in this section, we will use a portion of the data
contained in a study of electric range energy consumption (Fechter and Porter 1978).
The following table shows the mean number of kilowatt-hours used by three electric
ranges in cooking on each of three menu days (means are over several cooks).
Menu Day
Range 1
Range 2
Range 3
1
2
3
3.97
2.39
2.76
4.24
2.61
2.75
4.44
2.82
3.01
We wish to describe the variation in the numbers in this table in terms of the
effects of different ranges and different menu days. Denoting the number in the ith
490
Chapter 12
The Analysis of Variance
row and jth column by Yi j , we first calculate a grand average
3
3
1 Yi j = 3.22
μ̂ = Y .. =
9 i=1 j=1
This gives a measure of typical energy consumption per menu day.
The menu day means, averaged over the ranges, are
Y 1. = 4.22
Y 2. = 2.61
Y 3. = 2.84
We will define the differential effect of a menu day as the difference between the
mean for that day and the overall mean; we will denote these differential effects by
α̂i , where i = 1, 2, or 3.
α̂1 = Y 1. − Y .. = 1.00
α̂2 = Y 2. − Y .. = −.61
α̂3 = Y 3. − Y .. = −.38
(Note that, except for rounding error, the αi would sum to zero.) In words, on menu
day 1, 1 kwh more than the average is consumed, and so on.
The range means, averaged over the menu days, are
Y .1 = 3.04
Y .2 = 3.20
Y .3 = 3.42
The differential effects of the ranges are
β̂1 = Y .1 − Y .. = −.18
β̂2 = Y .2 − Y .. = −.02
β̂3 = Y .3 − Y .. = .20
The effects of the ranges are smaller than the effects of the menu days.
The preceding description of the values in the table incorporates an overall average level plus differential effects of ranges and menu days. This is a simple additive
model.
Ŷi j = μ̂ + α̂i + β̂ j
Here we use Ŷi j to denote the fitted or predicted values of Yi j from the additive model.
According to this additive model, the differences between the three ranges are the
same on all menu days. For example, for i = 1, 2, 3,
Ŷi1 − Ŷi2 = (μ̂ + α̂i + β̂1 ) − (μ̂ + α̂i + β̂2 )
= β̂1 − β̂2
Figure 12.3 shows that this is not quite the case. If the differences were exactly the same
on all menu days, the three lines would be exactly parallel. The differences between
menu days 1 and 2 appear nearly the same—the lines are nearly parallel. But on menu
day 3, the difference between ranges 2 and 3 increased and the difference between
12.3 The Two-Way Layout
491
4.5
Energy consumed (kWh)
4.0
3.5
3.0
2.5
2.0
1
2
Menu day
3
F I G U R E 12.3 Plot of energy consumption versus menu day for three electric
ranges. The dashed line corresponds to range 3, the dotted line to range 2, and the
solid line to range 1.
ranges 1 and 2 decreased. This phenomenon is called an interaction between menu
days and ranges—it is as if there were something about menu day 3 that especially
affected adversely the energy consumption of range 1 relative to range 2.
The differences of the observed values and the fitted values, Yi j − Ŷi j , are the
residuals from the additive model and are shown in the following table:
Menu Day
Range 1
Range 2
Range 3
1
2
3
−.07
−.04
.10
.04
.02
−.07
.02
.01
−.03
The residuals are small relative to the main effects, with the possible exception of
those for menu day 3.
Interactions can be incorporated into the model to make it fit the data exactly.
The residual in cell i j is
Yi j − μ̂ − α̂i − β̂ j = Yi j − Y.. − (Y i. − Y .. ) − (Y . j − Y .. )
= Yi j − Y i. − Y . j + Y ..
= δ̂i j
Note that
3
i=1
δ̂i j =
3
j=1
δ̂i j = 0
492
Chapter 12
The Analysis of Variance
For example,
3
δ̂i j =
3
i=1
(Yi j − Y i. − Y . j + Y .. )
i=1
= 3Y . j − 3Y .. − 3Y . j + 3Y ..
=0
In the preceding table of residuals, the row and column sums are not exactly zero
because of rounding errors. The model
Yi j = μ̂ + α̂i + β̂ j + δ̂i j
thus fits the data exactly; it is merely another way of expressing the numbers listed in
the table.
An additive model is simple and easy to interpret, especially in the absence of
interactions. Transformations of the data are sometimes used to improve the adequacy
of an additive model. The logarithmic transformation, for example, converts a multiplicative model into an additive one. Transformations are also used to stabilize the
variance (to make the variance independent of the mean) and to make normal theory
more applicable. There is no guarantee, of course, that a given transformation will
accomplish all these aims.
The discussion in this section has centered on the parametrization and interpretation of the additive model as used in the analysis of variance. We have not taken
into account the possibility of random errors and their effects on the inferences about
the parameters, but will do so in the next section.
12.3.2 Normal Theory for the Two-Way Layout
In this section, we will assume that there are K > 1 observations per cell in a two-way
layout. A design with an equal number of observations per cell is called balanced.
Let Yi jk denote the kth observation in cell i j; the statistical model is
Yi jk = μ + αi + β j + δi j + εi jk
We will assume that the random errors, εi jk , are independent and normally distributed
with mean zero and common variance σ 2 . Thus, E(Yi jk ) = μ + αi + β j + δi j . The
parameters satisfy the following constraints:
I
i=1
J
αi = 0
βj = 0
j=1
I
i=1
δi j =
J
j=1
δi j = 0
12.3 The Two-Way Layout
493
We now find the mle’s of the unknown parameters. Since the observations in cell
i j are normally distributed with mean μ + αi + β j + δi j and variance σ 2 , and since
all the observations are independent, the log likelihood is
l=−
J
K
I
I JK
1 log(2πσ 2 ) −
(Yi jk − μ − αi − β j − δi j )2
2
2σ 2 i=1 j=1 k=1
Maximizing the likelihood subject to the constraints given above yields the following
estimates (see Problem 17 at the end of this chapter):
μ̂ = Y ...
α̂i = Y i.. − Y ... ,
i = 1, . . . , I
β̂ j = Y . j. − Y ... ,
j = 1, . . . , J
δ̂i j = Y i j. − Y i.. − Y . j. + Y ...
as is expected from the discussion in Section 12.3.1.
Like one-way analysis of variance, two-way analysis of variance is conducted
by comparing various sums of squares. The sums of squares are as follows:
SS A = J K
I
(Y i.. − Y ... )2
i=1
J
SS B = I K
(Y . j. − Y ... )2
j=1
SS AB = K
I
J
i=1
SS E =
SST O T =
I
J
(Y i j. − Y i.. − Y . j. + Y ... )2
j=1
K
i=1
j=1 k=1
I
J K
i=1
j=1 k=1
(Yi jk − Y i j. )2
(Yi jk − Y ... )2
The sums of squares satisfy this algebraic identity:
SST O T = SS A + SS B + SS AB + SS E
This identity may be proved by writing
Yi jk − Y ... = (Yi jk − Y i j. ) + (Y i.. − Y ... ) + (Y . j. − Y ... )
+ (Y i j. − Y i.. − Y . j. + Y ... )
and then squaring both sides, summing, and verifying that the cross products vanish.
494
Chapter 12
The Analysis of Variance
The following theorem gives the expectations of these sums of squares.
THEOREM A
Under the assumption that the errors are independent with mean zero and variance
σ 2,
E(SS A ) = (I − 1)σ 2 + JK
I
αi2
i=1
E(SS B ) = (J − 1)σ 2 + IK
J
β 2j
j=1
E(SS AB ) = (I − 1)(J − 1)σ 2 + K
I
J
i=1
E(SS E ) = I J (K − 1)σ
δi2j
j=1
2
Proof
The results for S S A , S S B , and S S E follow from Lemma A of Section 12.2.1.
Applying the lemma to SST O T , we have
E(SST O T ) = E
J K
I
i=1
(Yi jk − Y ... )2
j=1 k=1
= (IJK − 1)σ 2 +
J K
I
i=1
= (IJK − 1)σ 2 + JK
(αi + β j + δi j )2
j=1 k=1
I
αi2 + I K
i=1
J
β 2j + K
j=1
J
I
i=1
δi2j
j=1
In the last step, we used the constraints on the parameters. For example, the cross
product involving αi and β j is
I
J
J K
I
αi β j = K
αi
βj
i=1
j=1 k=1
i=1
j=1
=0
The desired expression for E(SS AB ) now follows, since
E(SST O T ) = E(SS A ) + E(SS B ) + E(SS AB ) + E(SS E )
■
The distributions of these sums of squares are given by the following theorem.
12.3 The Two-Way Layout
495
THEOREM B
Assume that the errors are independent and normally distributed with means zero
and variances σ 2 . Then
a. SS E /σ 2 follows a chi-square distribution with IJ (K − 1) degrees of freedom.
b. Under the null hypothesis
H A : αi = 0, i = 1, . . . , I
SS A /σ 2 follows a chi-square distribution with I − 1 degrees of freedom.
c. Under the null hypothesis
HB : β j = 0, j = 1, . . . , J
SS B /σ 2 follows a chi-square distribution with J − 1 degrees of freedom.
d. Under the null hypothesis
H AB : δi j = 0, i = 1, . . . , I, j = 1, . . . , J
SS AB /σ 2 follows a chi-square distribution with (I − 1)(J − 1) degrees of
freedom.
e. The sums of squares are independently distributed.
Proof
We will not give a full proof of this theorem. The results for SS A , SS B , and SS E
follow from arguments similar to those used in proving Theorem B of Section
■
12.2.1. The result for SS AB requires some additional argument.
F tests of the various null hypotheses are conducted by comparing the appropriate
sums of squares to the sum of squares for error, as was done for the simpler case of the
one-way layout. The mean squares are the sums of squares divided by their degrees of
freedom and the F statistics are ratios of mean squares. When such a ratio is substantially larger than 1, the presence of an effect is suggested. Note, for example, that from
Theorem A, E(MS A ) = σ 2 + (JK /(I − 1)) i αi2 and that E(M S E ) = σ 2 . So if the
ratio MS A /MS E is large, it suggests that some of the αi are nonzero. The null distribution of this F statistic is the F distribution with (I − 1) and IJ (K − 1) degrees of
freedom, and knowing this null distribution allows us to assess the significance of the
ratio.
EXAMPLE A
As an example, we return to the experiment on iron retention discussed in Section
11.2.1.1. In the complete experiment, there were I = 2 forms of iron, J = 3 dosage
levels, and K = 18 observations per cell. In Section 11.2.1.1, we discussed a logarithmic transformation of the data to make it more nearly normal and to stabilize the
variance. Figure 12.4 shows boxplots of the data on the original scale; boxplots of the
log data are given in Figure 12.5. The distribution of the log data is more symmetrical,
and the interquartile ranges are less variable. Figure 12.6 is a plot of cell standard
The Analysis of Variance
30
25
Percentage retained
Chapter 12
20
15
10
5
0
High
High Medium Medium Low
Low
dosage dosage dosage dosage dosage dosage
of
of
of
of
of
of
Fe3+
Fe2+ Fe3+
Fe2+
Fe3+
Fe2+
F I G U R E 12.4 Boxplots of iron retention for two forms of iron at three dosage levels.
4
Natural log of percentage retained
496
3
2
1
0
1
High High Medium Medium Low
Low
dosage dosage dosage dosage dosage dosage
of
of
of
of
of
of
Fe3+
Fe2+
Fe3+
Fe2+
Fe3+
Fe2+
F I G U R E 12.5 Boxplots of log data on iron retention.
12.3 The Two-Way Layout
497
8
Standard deviation
7
6
5
4
3
2
2
4
6
8
Mean
12
10
14
F I G U R E 12.6 Plot of cell standard deviations versus cell means for iron retention
data.
deviations versus cell means for the untransformed data; it shows that the error variance increases with the mean. Figure 12.7 is a plot of cell standard deviations versus
means for the log data; it shows that the transformation is successful in stabilizing
the variance. Note that one of the assumptions of Theorem B is that the errors have
equal variance.
.70
Standard deviation
.65
.60
.55
.50
.45
1.0
1.2
1.4
1.6
1.8
Mean
2.0
2.2
2.4
2.6
F I G U R E 12.7 Plot of cell standard deviations versus cell means for log data on iron
retention.
Chapter 12
The Analysis of Variance
2.6
2.4
2.2
Mean response
498
2.0
1.8
1.6
1.4
1.2
1.0
0
2
4
6
8
Dosage level
10
12
F I G U R E 12.8 Plot of cell means of log data versus dosage level. The dashed line
corresponds to Fe2+ and the solid line to Fe3+ .
Figure 12.8 is a plot of the cell means of the transformed data versus the dosage
levels for the two forms of iron. It suggests that Fe2+ may be retained more than Fe3+ . If
there is no interaction, the two curves should be parallel except for random variation.
This appears to be roughly the case, although there is a hint that the difference in
retention of the two forms of iron increases with dosage level. To check this, we will
perform a quantitative test for interaction.
In the following analysis of variance table, SS A is the sum of squares due to
the form of iron, SS B is the sum of squares due to dosage, and SS AB is the sum of
squares due to interaction. The F statistics were found by dividing the appropriate
mean square by the mean square for error.
Analysis of Variance Table
Source
df
SS
MS
F
Iron form
Dosage
Interaction
Error
Total
1
2
2
102
107
2.074
15.588
.810
35.296
53.768
2.074
7.794
.405
.346
5.99
22.53
1.17
To test the effect of the form of iron, we test
H A : α1 = α2 = 0
12.3 The Two-Way Layout
499
using the statistic
F=
SSIRON /1
= 5.99
SS E /102
From computer evaluation of the F distribution with 1 and 102 df, the p-value is less
than .025. There is an effect due to the form of iron. An estimate of the difference
α1 − α2 is
Y 1.. − Y 2.. = .28
and a confidence interval for the difference may be obtained by noting that Y1.. and
Y2.. are uncorrelated since they are averages over different observations and that
Var(Y 1.. ) = Var(Y 2.. ) =
σ2
JK
Thus,
Var(Y 1.. − Y 2.. ) =
2σ 2
JK
Estimating σ 2 by the mean square for error, Var(Y 1.. − Y 2.. ) is estimated by
2 × .346
= .0128
54
A confidence interval can be constructed using the t distribution with IJ(K − 1)
degrees of freedom. The interval is of the form
sY2 1. −Y 2. =
(Y 1.. − Y 2.. ) ± t I J (K −1) (α/2)sY 1.. −Y 2..
There are 102 df; to form a 95% confidence interval we use t120 (.025) = 1.98
from√Table 4 of Appendix B as an approximation, producing the interval .28 ±
1.98 .0128, or (.06, .5).
Recall that we are working on a log scale. The additive effect of .28 on the log
scale corresponds to a multiplicative effect of e.28 = 1.32 on a linear scale and the
interval (.06, .50) transforms to (e.06 , e.50 ), or (1.06, 1.65). Thus, we estimate that
Fe2+ increases retention by a factor of 1.32, and the uncertainty in this factor is
expressed in the confidence interval (1.06, 1.65).
The F statistic for testing the effect of dosage is significant, but this effect is
expected and is not of major interest.
To test the hypothesis H AB which states that there is no interaction, we consider
the following F statistic:
F=
SS AB /(i − 1)(J − 1)
= 1.17
SS E /I J (K − 1)
From computer evaluation of the F distribution with 2 and 102 df, the p-value is .31,
so there is insufficient evidence to reject this hypothesis. Thus, the deviation of the
lines of Figure 12.8 from parallelism could easily be due to chance.
In conclusion, it appears that there is a difference of 6–65% in the ratio of
percentage retained between the two forms of iron and that there is little evidence
that this difference depends on dosage.
■
500
Chapter 12
The Analysis of Variance
12.3.3 Randomized Block Designs
Randomized block designs originated in agricultural experiments. To compare the
effects of I different fertilizers, J relatively homogeneous plots of land, or blocks,
are selected, and each is divided into I plots. Within each block, the assignment of
fertilizers to plots is made at random. By comparing fertilizers within blocks, the
variability between blocks, which would otherwise contribute “noise” to the results,
is controlled. This design is a multisample generalization of a matched-pairs design.
A randomized block design might be used by a nutritionist who wants to compare
the effects of three different diets on experimental animals. To control for genetic
variation in the animals, the nutritionist might select three animals from each of
several litters and randomly determine their assignments to the diets. Randomized
block designs are used in many areas. If an experiment is to be carried out over a
substantial period of time, the blocks may consist of stretches of time. In industrial
experiments, the blocks are often batches of raw material.
Randomization helps ensure against unintentional bias and can form a basis
for inference. In principle, the null distribution of a test statistic can be derived by
permutation arguments, just as we derived the null distribution of the Mann-Whitney
test statistic in Section 11.2.3. Parametric procedures often give a good approximation
to the permutation distribution.
As a model for the responses in the randomized block design, we will use
Yi j = μ + αi + β j + εi j
where αi is the differential effect of the ith treatment, β j is the differential effect
of the jth block, and the εi j are independent random errors. This is the model of
Section 12.3.2 but with the additional assumption of no interactions between blocks
and treatments. Interest is focused on the αi .
From Theorem A of Section 12.3.2, if there is no interaction,
E(MS A ) = σ 2 +
I
J 2
α
I − 1 i=1 i
E(MS B ) = σ 2 +
J
I
β2
J − 1 j=1 j
E(MSAB ) = σ 2
Thus, σ 2 can be estimated from M S AB . Also, since these mean squares are independently distributed, F tests can be performed to test H A or HB . For example, to
test
H A : αi = 0, i = 1, . . . , I
this statistic can be used:
F=
MS A
MSAB
From Theorem B in Section 12.3.2, under H A , the statistic follows an F distribution
with I − 1 and (I − 1)(J − 1) degrees of freedom. HB may be tested similarly but is
12.3 The Two-Way Layout
501
not usually of interest. Note that if, contrary to the assumption, there is an interaction,
then
E(MSAB ) = σ 2 +
I
J
1
δ2
(I − 1)(J − 1) i=1 j=1 i j
MS AB will tend to overestimate σ 2 . This will cause the F statistic to be smaller than
it should be and will result in a test that is conservative; that is, the actual probability
of type I error will be smaller than desired.
EXAMPLE A
Let us consider an experimental study of drugs to relieve itching (Beecher 1959). Five
drugs were compared to a placebo and no drug with 10 volunteer male subjects aged
20–30. (Note that this set of subjects limits the scope of inference; from a statistical
point of view, one cannot extrapolate the results of the experiment to older women,
for example. Any such extrapolation could be justified only on grounds of medical
judgment.) Each volunteer underwent one treatment per day, and the time-order was
randomized. Thus, individuals were “blocks.” The subjects were given a drug (or
placebo) intravenously, and then itching was induced on their forearms with cowage,
an effective itch stimulus. The subjects recorded the duration of the itching. More
details are in Beecher (1959). The following table gives the durations of the itching
(in seconds):
Subject
No
Drug
Placebo
Papaverine
Morphine
Aminophylline
Pentobarbital
Tripelennamine
BG
JF
BS
SI
BW
TS
GM
SS
MU
OS
174
224
260
255
165
237
191
100
115
189
263
213
231
291
168
121
137
102
89
433
105
103
145
103
144
94
35
133
83
237
199
143
113
225
176
144
87
120
100
173
141
168
78
164
127
114
96
222
165
168
108
341
159
135
239
136
140
134
185
188
141
184
125
227
194
155
121
129
79
317
Average
191.0
204.8
118.2
148.0
144.3
176.5
167.2
Figure 12.9 shows boxplots of the responses to the six treatments and to the
control (no drugs). Although the boxplot is probably not the ideal visual display of
these data, since it takes no account of the blocking, Figure 12.9 does show some
interesting aspects of the data. There is a suggestion that all the drugs had some effect
and that papaverine was the most effective. There is a lot of scatter relative to the
differences between the medians, and there are some outliers. It is interesting that
the placebo responses have the greatest spread; this might be because some subjects
responded to the placebo and some did not.
Chapter 12
The Analysis of Variance
500
400
Duration of itching (sec)
502
300
200
100
0
No drug
Papaverine
Placebo
Aminophylline
Morphine
Tripelennamine
Pentobarbital
F I G U R E 12.9 Boxplots of durations of itching under seven treatments.
We next construct an analysis of variance table for this experiment:
Source
df
SS
MS
F
Drugs
Subjects
Interaction
Total
6
9
54
69
53013
103280
167130
323422
8835
11476
3095
2.85
3.71
The F statistic for testing differences between drugs is 2.85 with 6 and 54 df, corresponding to a p-value less than .025. The null hypothesis that there is no difference
between subjects is not experimentally interesting.
Figure 12.10 is a probability plot of the residuals from the two-way analysis of
variance model. The residual in cell i j is
ri j = Yi j − μ̂ − α̂i − β̂ j
= Yi j − Y i. − Y . j + Y ..
There is a slightly bowed character to the probability plot, indicating some skewness
in the distribution of the residuals. But because the F test is robust against moderate
deviations from normality, we should not be overly concerned.
Tukey’s method may be applied to make multiple comparisons. Suppose that
we want to compare the drug means, Y 1. , . . . , Y 7. (I = 7). These have expectations
μ + αi , where i = 1, . . . , I , and each is an average over J = 10 independent observations. The error variance is estimated by MSAB with 54 df. Simultaneous 95%
12.3 The Two-Way Layout
503
150
Ordered residuals
100
50
0
50
100
3
2
1
0
1
Normal quantiles
2
3
F I G U R E 12.10 Normal probability plot of residuals from two-way analysis of
variance of data on duration of itching.
confidence intervals for all differences between drug means have half-widths of
*
q7,54 (.05)s
3095
√
= 4.31
10
J
= 75.8
[Here we have used q7,60 (.05) from Table 6 of Appendix B as an approximation to
q7,54 (.05).] Examining the table of means, we see that, at the 95% confidence level,
we can conclude only that papaverine achieves a reduction of itching over the effect
■
of a placebo.
12.3.4 A Nonparametric Method—Friedman s Test
This section presents a nonparametric method for the randomized block design. Like
other nonparametric methods we have discussed, Friedman’s test relies on ranks and
does not make an assumption of normality. The test is very simple. Within each of
the J blocks, the observations are ranked. To test the hypothesis that there is no effect
due to the factor corresponding to treatments (I ), the following statistic is calculated:
SS A = J
I
(R i. − R .. )2
i=1
just as in the ordinary analysis of variance. Under the null hypothesis that there
is no treatment effect and that the only effect is due to the randomization within
blocks, the permutation distribution of the statistic can, in principle, be calculated.
504
Chapter 12
The Analysis of Variance
For sample sizes such as that of the itching experiment, a chi-square approximation
to this distribution is perfectly adequate. The null distribution of
I
12J
Q=
(R i. − R .. )2
I (I + 1) i=1
is approximately chi-square with I − 1 degrees of freedom.
EXAMPLE A
To carry out Friedman’s test on the data from the experiment on itching, we first
construct the following table by ranking durations of itching for each subject:
No
Drug
Placebo
Papaverine
Morphine
Aminophylline
Pentobarbital
Tripelennamine
BG
JF
BS
SI
BW
TS
GM
SS
MU
OS
5
6
7
6
3
7
7
1
5
4
7
5
6
7
4
3
5
2
3
7
1
1
4
1
2
1
1
5
2
5
6
2
2
4
5
5
2
3
4
2
3.5
3
1
3
1
2
3
7
6
1
2
7
5
2
7
4
6
6
7
3
3.5
4
3
5
6
6
4
4
1
6
Average
5.10
4.90
2.30
3.50
3.05
4.90
4.25
Note that we have handled ties in the usual way by assigning average ranks. From the
preceding table, no drug, placebo, and pentobarbitol have the highest average ranks.
From these average ranks, we find R = 4, (R i. − R .. )2 = 6.935 and Q = 14.86.
From Table 3 of Appendix B with 6 df, the p-value is less than .025. The nonparametric
analysis also rejects the hypothesis that there is no drug effect.
■
Procedures for using Friedman’s test for multiple comparisons are discussed
by Miller (1981). When these methods are applied to the data from the experiment
on itching, the conclusions reached are identical to those reached by the parametric
analysis.
12.4 Concluding Remarks
The most complicated experimental design considered in this chapter was the twoway layout; more generally, a factorial design incorporates several factors with one
or more observations per cell. With such a design, the concept of interaction becomes more complicated—there are interactions of various orders. For instance, in a
three-factor experiment, there are two-factor and three-factor interactions. It is both
12.5 Problems
505
interesting and useful that the two-factor interactions in a three-way layout can be
estimated using only one observation per cell.
To gain some insight into why factorial designs are effective, we can begin by
considering a two-way layout, with each factor at five levels, no interaction, and one
observation per cell. With this design, comparisons of two levels of any factor are
based on 10 observations. A traditional alternative to this design is to do first an
experiment comparing the levels of factor A and then another experiment comparing
the levels of factor B. To obtain the same precision as is achieved by the two-way
layout in this case, 25 observations in each experiment, or a total of 50 observations,
would be needed. The factorial design achieves its economy by using the same observations to compare the levels of factor A as are used to compare the levels of
factor B.
The advantages of factorial designs become greater as the number of factors
increases. For example, in an experiment with four factors, with each factor at two
levels (which might be the presence or absence of some chemical, for example) and one
observation per cell, there are 16 observations that may be used to compare the levels
of each factor. Furthermore, it can be shown that two- and three-factor interactions
can be estimated. By comparison, if each of the four factors were investigated in a
separate experiment, 64 observations would be required to attain the same precision.
As the number of factors increases, the number of observations necessary for
a factorial experiment with only one observation per cell grows very rapidly. To
decrease the cost of an experiment, certain cells, designated in a systematic way, can
be left empty, and the main effects and some interactions can still be estimated. Such
arrangements are called fractional factorial designs.
Similarly, with a randomized block design, the individual blocks may not be
able to accommodate all the treatments. For example, in a chemical experiment that
compares a large number of treatments, the blocks of the experiment, batches of raw
material of uniform quality, may not be large enough. In such situations, incomplete
block designs may be used to retain the advantages of blocking.
The basic theoretical assumptions underlying the analysis of variance are that the
errors are independent and normally distributed with constant variance. Because we
cannot fully check the validity of these assumptions in practice and can probably detect
only gross violations, it is natural to ask how robust the procedures are with respect
to violations of the assumptions. It is impossible to give a complete and conclusive
answer to this question. Generally speaking, the independence assumption is probably
the most important (and this is true for nonparametric procedures as well). The F test
is robust against moderate departures from normality; if the design is balanced, the
F test is also robust against unequal error variance.
For further reading, Box, Hunter, and Hunter (1978) is recommended.
12.5 Problems
1. Simulate observations like those of Figure 12.1 under the null hypothesis of no
treatment effects. That is, simulate seven batches of ten normally distributed
random numbers with mean 4 and variance .0037. Make parallel boxplots of
these seven batches like those of Figure 12.1. Do this several times. Your figures
506
Chapter 12
The Analysis of Variance
display the kind of variability that random fluctuations can cause; do you see any
pairs of labs that appear quite different in either mean level or dispersion?
2. Verify that if I = 2, the estimate s 2p of Theorem A of Section 11.2.1 is the s 2p
given in Section 12.2.1.
3. For a one-way analysis of variance with I = 2 treatment groups, show that the
F statistic is t 2 , where t is the usual t statistic for a two-sample case.
4. Prove the analogues of Theorems A and B in Section 12.2.1 for the case of
unequal numbers of observations in the cells of a one-way layout.
5. Derive the likelihood ratio test for the null hypothesis of the one-way layout, and
show that it is equivalent to the F test.
6. Prove this version of the Bonferroni inequality:
P
n
4
i=1
Ai
≥1−
n
P Aic
i=1
(Use Venn diagrams if you wish.) In the context of simultaneous confidence
intervals, what is Ai and what is Aic ?
7. Show that, as claimed in Theorem B of Section 12.2.1, SS B /σ 2 ∼ χ I2−1 .
8. Form simultaneous confidence intervals for the difference of the mean of lab 1
and those of labs 4, 5, and 6 in Example A of Section 12.2.2.1.
9. Compare the tables of the t distribution and the studentized range in Appendix
B. For example, consider
the column corresponding to t.95 ; multiply the numbers
√
in that column by 2 and observe that you get the numbers in the column t = 2
of the table of q.90 . Why is this?
10. Suppose that in a one-way layout there are 10 treatments and seven observations
under each treatment. What is the ratio of the length of a simultaneous confidence
interval for the difference of two means formed by Tukey’s method to that of one
formed by the Bonferroni method? How do both of these compare in length to
an interval based on the t distribution that does not take account of multiple
comparisons?
11. Consider a hypothetical two-way layout with four factors (A, B, C, D) each at
three levels (I, III, III). Construct a table of cell means for which there is no
interaction.
12. Consider a hypothetical two-way layout with three factors (A, B, C) each at two
levels (I, II). Is it possible for there to be interactions but no main effects?
13. Show that for comparing two groups the Kruskal-Wallis test is equivalent to the
Mann-Whitney test.
14. Show that for comparing two groups Friedman’s test is equivalent to the sign
test.
12.5 Problems
507
15. Show the equality of the two forms of K given in Section 12.2.3:
I
12
Ji (R i. − R .. )2
N (N + 1) i=1
I
12
2
=
Ji R i. − 3(N + 1)
N (N + 1) i=1
K =
16. Prove the sums of squares identity for the two-way layout:
SSTOT = SS A + SS B + SSAB + SS E
17. Find the mle’s of the parameters αi , β j , δi j , and μ of the model for the two-way
layout.
18. The table below gives the energy use of five gas ranges for seven menu days. (The
units are equivalent kilowatt-hours; .239 kwh = 1 ft3 of natural gas.) Estimate
main effects and discuss interaction, paralleling the discussion of Section 12.3.
Menu Day
Range 1
Range 2
Range 3
Range 4
Range 5
1
2
3
4
5
6
7
8.25
5.12
5.32
8.00
6.97
7.65
7.86
8.26
4.81
4.37
6.50
6.26
5.84
7.31
6.55
3.87
3.76
5.38
5.03
5.23
5.87
8.21
4.81
4.67
6.51
6.40
6.24
6.64
6.69
3.99
4.37
5.60
5.60
5.73
6.03
19. Develop a parametrization for a balanced three-way layout. Define main effects
and two-factor and three-factor interactions, and discuss their interpretation. What
linear constraints do the parameters satisfy?
20. This problem introduces a random effects model for the one-way layout. Consider a balanced one-way layout in which the I groups being compared are
regarded as being a sample from some larger population. The random effects
model is
Yi j = μ + Ai + εi j
where the Ai are random and independent of each other with E(Ai ) = 0 and
Var(Ai ) = σ A2 . The εi j are independent of the Ai and of each other, and E(εi j ) = 0
and Var(εi j ) = σε2 .
To fix these ideas, we can consider an example from Davies (1960). The
variation of the strength (coloring power) of a dyestuff from one manufactured
batch to another was studied. Strength was measured by dyeing a square of cloth
with a standard concentration of dyestuff under carefully controlled conditions
and visually comparing the result with a standard. The result was numerically
scored by a technician. Large samples were taken from six batches of a dyestuff;
each sample was well mixed, and from each six subsamples were taken. These
508
Chapter 12
The Analysis of Variance
36 subsamples were submitted to the laboratory in random order over a period of
several weeks for testing as described. The percentage strengths of the dyestuff
are given in the following table.
Batch
I
II
III
IV
V
VI
Subsample Subsample Subsample Subsample Subsample Subsample
1
2
3
4
5
6
94.5
89.0
88.5
100.0
91.5
98.5
93.0
90.0
93.5
99.0
93.0
100.0
91.0
92.5
93.5
100.0
90.0
98.0
89.0
88.5
88.0
98.0
92.5
100.0
96.5
91.5
92.5
95.0
89.0
96.5
88.0
91.5
91.5
97.5
91.0
98.0
There are two sources of variability in these numbers: batch-to-batch variability and measurement variability. It is hoped that variability between subsamples
has been eliminated by the mixing. We will consider the random effects model,
Yi j = μ + Ai + εi j
Here, μ is the overall mean level, Ai is the random effect of the ith batch, and εi j
is the measurement error on the jth subsample from the ith batch. We assume
that the Ai are independent of each other and of the measurement errors, with
E(Ai ) = 0 and Var(Ai ) = σ A2 . The εi j are assumed to be independent of each
other and to have mean 0 and variance σε2 . Thus,
Var(Yi j ) = σ A2 + σε2
Large variability in the Yi j could be caused by large variability among batches,
large measurement error, or both. The former could be decreased by changing the
manufacturing process to make the batches more homogeneous, and the latter by
controlling the scoring process more carefully.
a. Show that for this model
E(M SW ) = σε2
E(M S B ) = σε2 + J σ A2
and that therefore σε2 and σ A2 can be estimated from the data. Calculate these
estimates.
b. Suppose that the samples had not been mixed, but that duplicate measurements
had been made on each subsample. Formulate a model that also incorporates
variability between subsamples. How could the parameters of this model be
estimated?
21. During each of four experiments on the use of carbon tetrachloride as a worm
killer, ten rats were infested with larvae (Armitage 1983). Eight days later, five
rats were treated with carbon tetrachloride; the other five were kept as controls.
After two more days, all the rats were killed and the numbers of worms were
counted. The table below gives the counts of worms for the four control groups.
Significant differences, although not expected, might be attributable to changes in
12.5 Problems
509
experimental conditions. A finding of significant differences could result in more
carefully controlled experimentation and thus greater precision in later work.
Use both graphical techniques and the F test to test whether there are significant
differences among the four groups. Use a nonparametric technique as well.
Group I
Group II
Group III
Group IV
279
338
334
198
303
378
275
412
265
286
172
335
335
282
250
381
346
340
471
318
22. Referring to Section 12.2, the file tablets gives the measurements on chlorpheniramine maleate tablets from another manufacturer. Are there systematic
differences between the labs? If so, which pairs differ significantly? How do
these data compare to those given for the other manufacturer in Section 12.2?
23. For a study of the release of luteinizing hormone (LH), male and female rats
kept in constant light were compared to male and female rats in a regime of 14 h
of light and 10 h of darkness. Various dosages of luteinizing releasing factor
(LRF) were given: control (saline), 10, 50, 250, and 1250 ng. Levels of LH (in
nanograms per milliliter of serum) were measured in blood samples at a later time.
Analyze the data given in file LHfemale, LHmale to determine the effects of
light regime and LRF on release of LH for both males and females. Use both
graphical techniques and more formal analyses.
24. A collaborative study was conducted to study the precision and homogeneity of
a method of determining the amount of niacin in cereal products (Campbell and
Pelletier 1962). Homogenized samples of bread and bran flakes were enriched
with 0, 2, 4, or 8 mg of niacin per 100 g. Portions of the samples were sent
to 12 labs, which were asked to carry out the specified procedures on each of
three separate days. The data (in milligrams per 100 g) are given in the file
niacin. Conduct two-way analyses of variance for both the bread and bran
data and discuss the results. (Two data points are missing. Substitute for them
the corresponding cell means.)
25. This problem deals with an example from Youden (1962). An ingot of magnesium
alloy was drawn into a square rod about 100 m long with a cross section of about
4.5 cm on a side. The rod was then cut into 100 bars, each a meter long. Five of
these were selected at random, and a test piece 1.2 cm thick was cut from each.
From each of these five specimens, 10 test points were selected in a particular
geometric pattern. Two determinations of the magnesium content were made
at each test point (the analyst ran all 50 points once and then made a set of
repeat measurements). The overall purpose of the experiment was to test for
homogeneity of magnesium content in the different bars and different locations.
Analyze the data in the file magnesium (giving percentage of magnesium times
1000) to determine if there is significant variability between bars and between
510
Chapter 12
The Analysis of Variance
locations. There are a couple of unexpected aspects of these data—can you find
them?
26. The concentrations (in nanograms per milliliter) of plasma epinephrine were
measured for 10 dogs under isofluorane, halothane, and cyclopropane anesthesia; the measurements are given in the following table (Perry et al. 1974).
Is there a difference in treatment effects? Use a parametric and a nonparametric
analysis.
Dog
1
Isofluorane
Halothane
Cyclopropane
Dog
2
Dog
3
Dog
4
Dog
5
Dog
6
Dog
7
Dog
8
Dog
9
Dog
10
.28 .51
.30 .39
1.07 1.35
1.00
.63
.69
.39
.68
.28
.29
.36
.38
.21
1.24 1.53
.32
.88
.49
.69
.39
.56
.17
.51
1.02
.33
.32
.30
27. Three species of mice were tested for “aggressiveness.” The species were A/J,
C57, and F2 (a cross of the first two species). A mouse was placed in a 1-m2
box, which was marked off into 49 equal squares. The mouse was let go on the
center square, and the number of squares traversed in a 5-min period was counted.
Analyze the file C57, AJ, F2, using the Bonferroni method, to determine if there
is a significant difference among species.
28. Samples of each of three types of stopwatches were tested. The following table
gives thousands of cycles (on-off-restart) survived until some part of the mechanism failed (Natrella 1963). Test whether there is a significant difference among
the types, and if there is, determine which types are significantly different. Use
both a parametric and a nonparametric technique.
Type I
Type II
Type III
1.7
1.9
6.1
12.5
16.5
25.1
30.5
42.1
82.5
13.6
19.8
25.2
46.2
46.2
61.1
13.4
20.9
25.1
29.7
46.9
29. The performance of a semiconductor depends upon the thickness of a layer of
silicon dioxide. In an experiment (Czitrom and Reece, 1997), layer thicknesses
were measured at three furnace locations for three types of wafers (virgin wafers,
recycled in-house wafers, and recycled wafers from an external source). The data
are contained in the file waferlayers. Conduct a two-way analysis of variance
and test for significance of main effects and interactions. Construct a graph such
as that shown in Figure 12.3. Does the comparison of layer thicknesses depend
on furnace location?
12.5 Problems
511
30. Ten varieties of linseed were grown on six different plots (Adguna and
Labuschagne, 2002). The file linseed contains the yields (kg per hectare).
Can you conclude that the varieties have different yields? Use Tukey’s method
to compare the varieties.
31. Problem 39 of Chapter 10 involved a table of maximum windspeeds for 35 years
at each of 21 cities. Would you expect an additive model (no interaction) to
provide a good fit to the numbers in this table? Why or why not? Check it out.
32. It is known that increased reproductions leads to reduced longevity for female
fruitflies. Patridge and Farquhar (1981) studied whether the same phenomenon
held for male fruitflies. The data are also discussed in Hanley and Shapiro (1994).
The experiment set up five treatment groups, each consisting of 25 randomly
assigned male fruitflies. The males in one treatment were housed with eight
virgin females per day. In another treatment, the males were housed with one
virgin female day. There were three control groups: males housed with eight
newly impregnated females, housed with one newly impregnated female, and
housed alone. (Newly inseminated females will not usually mate within two
days).
The data are contained in the file fruitfly, with a row for each male in the
following format:
Column 1: the number of females
Column 2: the type of female—0 denoting newly pregnant, 1 denoting virgin,
and 9 when there were no females
Column 3: lifespan in days
Column 4: length of thorax (mm), which is fixed at birth
Column 5: percentage of time spent sleeping
a. Calculate summary statistics for lifespan in each group and compare. Display
the data in parallel boxplots. Qualitatively, what do you conclude?
b. Do the same for percentage of time spent sleeping.
c. Make a scatterplot of lifespan versus thorax length. Is thorax length predictive of lifespan and did the randomization balance thorax length between the
groups?
d. Use the F test to test for differences in longevity between the groups. Use both
Tukey’s method and the Bonferroni method to compare all pairs of means.
Summarize your conclusions.
e. Repeat the analysis using the Kruskal-Wallis test and the Bonferroni method.
f. How does the availability of virgin females affect the sleep of male fruitflies?
33. How does diet affect longevity? Studies on animals have shown that restricting caloric intake can increase lifespan. Weindruch et al. did an experiment
involving six treatment groups of female mice. The data, contained in the file
diet-and-longevity, are also discussed in Ramsey and Shafer (2002).
The groups were:
1. NP: mice ate as much as they wished of a standard diet.
2. N/N85: mice were fed normally before and after weaning. After weaning,
their caloric intake was 85 kcal per week, which is the normal average level.
512
Chapter 12
The Analysis of Variance
3. N/R50: mice were fed normally before weaning; and after weaning, their
caloric intake was restricted to 50 kcal per week.
4. R/R50: mice were fed 50 kcal per week before and after weaning.
5. lopro: mice were fed normally before weaning, a restricted diet of 50 kcal per
week after weaning, and the dietary protein content decreased as they got older.
6. N/R40: mice were fed normally before weaning and were given 40 kcal per
week after weaning.
As well as making parallel boxplots and conducting an overall test for equality of means, scientific questions of interest involve some specific comparisons.
For example, to determine whether reducing from a normal 85 kcal per week to
50 kcal per week, groups N/N85 and N/R50 would be compared. Which groups
would you compare to answer the following questions:
a. Do preweaning dietary restrictions have an effect?
b. Does reduction in protein have an effect?
c. Does reduction to 40 kcal per week have an effect?
Formulate the comparisons you wish to make and carry them out using an appropriate Bonferroni correction. What is the purpose of including the group NP?
34. The following table gives the survival times (in hours) for animals in an experiment whose design consisted of three poisons, four treatments, and four
observations per cell.
a. Conduct a two-way analysis of variance to test the effects of the two main
factors and their interaction.
b. Box and Cox (1964) analyzed the reciprocals of the data, pointing out that the
reciprocal of a survival time can be interpreted as the rate of death. Conduct a
two-way analysis of variance, and compare to the results of part (a). Comment
on how well the standard two-way analysis of variance model fits and on the
interaction in both analyses.
Treatment
Poison
I
II
III
A
3.1
4.6
3.6
4.0
2.2
1.8
B
4.5
4.3
2.9
2.3
2.1
2.3
8.2
8.8
9.2
4.9
3.0
3.8
C
11.0
7.2
6.1
12.4
3.7
2.9
4.3
6.3
4.4
3.1
2.3
2.4
D
4.5
7.6
3.5
4.0
2.5
2.2
4.5
6.6
5.6
7.1
3.0
3.1
7.1
6.2
10.0
3.8
3.6
3.3
35. The concentration of follicle stimulating hormone (FSH) can be measured through
a bioassay. The basic idea is that when FSH is added to a certain culture, a proportional amount of estrogen is produced; hence, after calibration, the amount of
FSH can be found by measuring estrogen production. However, determining FSH
levels in serum samples is difficult because some factor(s) in the serum inhibit
estrogen production and thus screw up the bioassay. An experiment was done to
see if it would be effective to pretreat the serum with polyethyleneglycol (PEG)
which, it was hoped, precipitates some of the inhibitory substance(s).
12.5 Problems
513
Three treatments were applied to prepared cultures: no serum, PEG-treated
FSH free serum, and untreated FSH free serum. Each culture had one of eight
doses of FSH: 4, 2, 1, .5, .25, .125, .06, or 0.0 mIU/μl. For each serum-dose
combination, there were three cultures, and after incubation for three days, each
culture was assayed for estrogen by radioimmunoassay. The table that follows
gives the results (units are nanograms of estrogen per milliliter). Analyze these
data with a view to determining to what degree PEG treatment is successful in
removing the inhibitory substances from the serum. Write a brief report summarizing and documenting your conclusions.
Dose
No Serum
PEG Serum
Untreated Serum
.00
.00
.00
.06
.06
.06
.12
.12
.12
.25
.25
.25
.50
.50
.50
1.00
1.00
1.00
2.00
2.00
2.00
4.00
4.00
4.00
1,814.4
3,043.2
3,857.1
2,447.9
3,320.9
3,387.6
4,887.8
5,171.2
3,370.7
10,255.6
9,431.8
10,961.2
14,538.8
14,214.3
16,934.5
19,719.8
20,801.4
32,740.7
16,453.8
28,793.8
19,148.5
17,967.0
18,768.6
19,946.9
372.7
350.1
426.0
628.3
655.0
700.0
1,701.8
2,589.4
1,117.1
4,114.6
2,761.5
1,975.8
6,074.3
12,273.9
14,240.9
17,889.9
11,685.7
11,342.4
11,843.5
18,320.7
23,580.6
12,380.0
20,039.0
15,135.6
1,745.3
2,470.0
1,700.0
1,919.2
1,605.1
2,796.0
1,929.7
1,537.3
1,692.7
1,149.1
743.4
948.5
4,471.9
2,772.1
5,782.3
11,588.7
8,249.5
18,481.5
10,433.5
8,181.0
11,104.0
10,020.0
8,448.5
10,482.8
CHAPTER
13
The Analysis of
Categorical Data
13.1 Introduction
This chapter introduces the analysis of data that are in the form of counts in various
categories. We will deal primarily with two-way tables, the rows and columns of
which represent categories. Suppose that the rows of such a table represent various
hair colors and its columns various eye colors and that each cell contains a count
of the number of people who fall in that particular cross-classification. We might be
interested in dependencies between the row and column classifications—that is, is
hair color related to eye color?
We emphasize that the data considered in this chapter are counts, rather than
continuous measurements as they were in Chapter 12. Thus, in this chapter, we will
make heavy use of the multinomial and chi-square distributions.
13.2 Fisher s Exact Test
We will develop Fisher’s exact test in the context of the following example. Rosen and
Jerdee (1974) conducted several experiments, using as subjects male bank supervisors
attending a management institute. As part of their training, the supervisors had to make
decisions on items in an in-basket. The investigators embedded their experimental
materials in the contents of the in-baskets. In one experiment, the supervisors were
given a personnel file and had to decide whether to promote the employee or to hold the
file and interview additional candidates. By random selection, 24 of the supervisors
examined a file labeled as being that of a male employee and 24 examined a file
labeled as being that of a female employee; the files were otherwise identical. The
results are summarized in the following table:
514
13.2 Fisher s Exact Test
Male
Female
Promote
21
14
Hold File
3
10
515
From the results, it appears that there is a sex bias—21 of 24 males were promoted,
but only 14 of 24 females were. But someone who was arguing against the presence of
sex bias could claim that the results occurred by chance; that is, even if there were no
bias and the supervisors were completely indifferent to sex, discrepancies like those
observed could occur with fairly large probability by chance alone. To rephrase this
argument, the claim is that 35 of the 48 supervisors chose to promote the employee
and 13 chose not to, and that 21 of the 35 promotions were of a male employee merely
because of the random assignment of supervisors to male and female files.
The strength of the argument against sex bias must be assessed by a calculation
of probability. If it is likely that the randomization could result in such an imbalance,
the argument is difficult to refute; however, if only a small proportion of all possible
randomizations would give such an imbalance, the argument has less force. We take
as the null hypothesis that there is no sex bias and that any differences observed are
due to the randomization. We denote the counts in the table and on the margins as
follows:
N11
N12
n 1.
N21
N22
n 2.
n .1
n .2
n ..
According to the null hypothesis, the margins of the table are fixed: There are 24
females, 24 males, 35 supervisors who choose to promote, and 13 who choose not
to. Also, the process of randomization determines the counts in the interior of the
table (denoted by capital letters since they are random) subject to the constraints of
the margins. With these constraints, there is only 1 degree of freedom in the interior
of the table; if any interior count is fixed, the others may be determined.
Consider the count N11 , the number of males who are promoted. Under the null
hypothesis, the distribution of N11 is that of the number of successes in 24 draws
without replacement from a population of 35 successes and 13 failures; that is, the
distribution of N11 induced by the randomization is hypergeometric. The probability
that N11 = n 11 is
n 1. n 2. p(n 11 ) =
n .. n 21
n 11
n .1
We will use N11 as the test statistic for testing the null hypothesis. The preceding
hypergeometric probability distribution is the null distribution of N11 and is tabled
here. A two-sided test rejects for extreme values of N11 .
516
Chapter 13
The Analysis of Categorical Data
n 11
p(n 11 )
11
.000
12
.000
13
.004
14
.021
15
.072
16
.162
17
.241
18
.241
19
.162
20
.072
21
.021
22
.004
23
.000
24
.000
From this table, a rejection region for a two-sided test with α = .05 consists of the
following values for N11 : 11, 12, 13, 14, 21, 22, 23, and 24. The observed value of N11
falls in this region, so the test would reject at level .05. An imbalance in promotions
as or more extreme than that observed would occur only by chance with probability
.05, so there is fairly strong evidence of gender bias.
13.3 The Chi-Square Test of Homogeneity
Suppose that we have independent observations from J multinomial distributions,
each of which has I cells, and that we want to test whether the cell probabilities
of the multinomials are equal—that is, to test the homogeneity of the multinomial
distributions.
As an example, we will consider a quantitative study of an aspect of literary
style. Several investigators have used probabilistic models of word counts as indices
of literary style, and statistical techniques applied to such counts have been used in
controversies about disputed authorship. An interesting account is given by Morton
(1978), from whom we take the following example.
When Jane Austen died, she left the novel Sanditon only partially completed,
but she left a summary of the remainder. A highly literate admirer finished the novel,
13.3 The Chi-Square Test of Homogeneity
517
attempting to emulate Austen’s style, and the hybrid was published. Morton counted
the occurrences of various words in several works: Chapters 1 and 3 of Sense and
Sensibility, Chapters 1, 2, and 3 of Emma, Chapters 1 and 6 of Sanditon (written by
Austen); and Chapters 12 and 24 of Sanditon (written by her admirer). The counts
Morton obtained for six words are given in the following table:
Word
Sense and Sensibility
Emma
Sanditon I
Sanditon II
a
147
186
101
83
an
25
26
11
29
this
32
39
15
15
that
94
105
37
22
with
59
74
28
43
without
18
10
10
4
375
440
202
196
Total
We will compare the relative frequencies with which these words appear and will
examine the consistency of Austen’s usage of them from book to book and the degree
to which her admirer was successful in imitating this aspect of her style. A stochastic
model will be used for this purpose: The six counts for Sense and Sensibility will
be modeled as a realization of a multinomial random variable with unknown cell
probabilities and total count 375; the counts for the other works will be similarly
modeled as independent multinomial random variables.
Thus, we must consider comparing J multinomial distributions each having I
categories. If the probability of the ith category of the jth multinomial is denoted πi j ,
the null hypothesis to be tested is
H0 : πi1 = πi2 = · · · = πi J ,
i = 1, . . . , I
We may view this as a goodness-of-fit test: Does the model prescribed by the null
hypothesis fit the data? To test goodness of fit, we will compare observed values with
expected values as in Chapter 9, using likelihood ratio statistics or Pearson’s chisquare statistic. We will assume that the data consist of independent samples from
each multinomial distribution, and we will denote the count in the ith category of the
jth multinomial as n i j .
Under H0 , each of the J multinomials has the same probability for the ith category, say πi . The following theorem shows that the mle of πi is simply n i. /n .. , which
is an obvious estimate. Here, n i. is the total count in the ith category, n .. is the grand
total count, n . j is the total count for the jth multinomial.
518
Chapter 13
The Analysis of Categorical Data
THEOREM A
Under H0 , the mle’s of the parameters π1 , π2 , . . . , π I are
n i.
,
i = 1, . . . , I
π̂i =
n ..
where n i. is the total number of responses in the ith category and n .. is the grand
total number of responses.
Proof
Since the multinomial distributions are independent,
J n. j
n
n
n
π1 1 j π 2 2 j · · · π I I j
lik(π1 , π2 , . . . , π I ) =
n1 j n2 j · · · n I j
j=1
J n. j
n 1. n 2.
n I.
= π 1 π2 · · · π I
n
n
1j 2j · · · nI j
j=1
I
πi =
Let us consider maximizing the log likelihood subject to the constraint i=1
1. Introducing a Lagrange multiplier, we have to maximize
I
J
I
n. j
l(π, λ) =
log
n i. log πi + λ
πi − 1
+
n1 j n2 j · · · n I j
j=1
i=1
i=1
Now,
n i.
∂l
=
+ λ,
∂πi
πi
i = 1, . . . , I
or
n i.
λ
Summing over both sides and applying the constraint, we find λ = −n .. and
■
π̂i = n i. /n .. , as was to be proved.
π̂i = −
For the jth multinomial, the expected count in the ith category is the estimated
probability of that cell times the total number of observations for the jth multinomial,
or
n . j n i.
Ei j =
n ..
Pearson’s chi-square statistic is therefore
X2 =
=
J
I (Oi j − E i j )2
Ei j
i=1 j=1
J
I (n i j − n i. n . j /n .. )2
n i. n . j /n ..
i=1 j=1
13.3 The Chi-Square Test of Homogeneity
519
For large sample sizes, the approximate null distribution of this statistic is chi-square.
(The usual recommendation concerning the sample size necessary for this approximation to be reasonable is that the expected counts should all be greater than 5.)
The degrees of freedom are the number of independent counts minus the number of
independent parameters estimated from the data. Each multinomial has I − 1 independent counts, since the totals are fixed, and I − 1 independent parameters have
been estimated. The degrees of freedom are therefore
d f = J (I − 1) − (I − 1) = (I − 1)(J − 1)
We now apply this method to the word counts from Austen’s works. First, we
consider Austen’s consistency from one work to another. The following table gives
the observed count and, below it, the expected count in each cell of the table.
Word
Sense and Sensibility
Emma
Sanditon I
a
147
160.0
186
187.8
101
86.2
an
25
22.9
26
26.8
11
12.3
this
32
31.7
39
37.2
15
17.1
that
94
87.0
105
102.1
37
46.9
with
59
59.4
74
69.7
28
32.0
without
18
14.0
10
16.4
10
7.5
The observed counts look fairly close to the expected counts, and the chi-square
statistic is 12.27. The 10% point of the chi-square distribution with 10 degrees of
freedom is 15.99, and the 25% point is 12.54. The data are thus consistent with the
model that the word counts in the three works are realizations of multinomial random
variables with the same underlying probabilities. The relative frequencies with which
Austen used these words did not change from work to work.
To compare Austen and her imitator, we can pool all Austen’s work together
in light of the above findings. The following table shows the observed and expected
frequencies for the imitator and Austen:
520
Chapter 13
The Analysis of Categorical Data
Word
Imitator
Austen
a
83
83.5
434
433.5
an
29
14.7
62
76.3
this
15
16.3
86
84.7
that
22
41.7
236
216.3
with
43
33.0
161
171.0
4
6.8
38
35.2
without
The chi-square statistic is 32.81 with 5 degrees of freedom, giving a p-value of
less than .001. The imitator was not successful in imitating this aspect of Austen’s
style. To see which discrepancies are large, it is helpful to examine the contributions
to the chi-square statistic cell by cell, as tabulated here:
Word
Imitator
Austen
a
0.00
0.00
an
13.90
2.68
this
0.11
0.02
that
9.30
1.79
with
3.06
0.59
without
1.14
0.22
Inspecting this and the preceding table, we see that the relative frequency with which
Austen used the word an was much smaller than that with which her imitator used it,
and that the relative frequency with which she used that was much larger.
13.4 The Chi-Square Test of Independence
This section develops a chi-square test that is very similar to the one of the preceding
section but is aimed at answering a slightly different question. We will again use an
example.
13.4 The Chi-Square Test of Independence
521
In a demographic study of women who were listed in Who’s Who, Kiser and
Schaefer (1949) compiled the following table for 1436 women who were married at
least once:
Education
Married Once
Married More Than Once
Total
College
550
61
611
No College
681
144
825
1231
205
1436
Total
Is there a relationship between marital status and educational level? Of the women
61
= 10% were married more than once; of those who
who had a college degree, 611
144
had no college degree, 825 = 17% were married more than once. Alternatively, the
question might be addressed by noting that of those women who were married more
61
= 30% had a college degree, whereas of those married only once,
than once, 205
550
=
45%
had
a college degree. For this sample of 1436 women, having a college
1231
degree is positively associated with being married only once, but it is impossible
to make causal inferences from the data. Marital stability could be influenced by
educational level, or both characteristics could be influenced by other factors, such
as social class.
A critic of the study could in any case claim that the relationship between marital
status and educational level is “statistically insignificant.” Since the data are not a
sample from any population, and since no randomization has been performed, the
role of probability and statistics is not clear. One might respond to such a criticism
by saying that the data speak for themselves and that there is no chance mechanism
at work. The critic might then rephrase his argument: “If I were to take a sample of
1436 people cross-classified into two categories which were in fact unrelated in the
population from which the sample was drawn, I might find associations as strong or
stronger than those observed in this table. Why should I believe that there is any real
association in your table?” Even though this argument may not seem compelling,
statistical tests are often carried out in situations in which stochastic mechanisms are
at best hypothetical.
We will discuss statistical analysis of a sample of size n cross-classified in a
table with I rows and J columns. Such a configuration is called a contingency table.
The joint distribution of the counts n i j , where i = 1, . . . , I and j = 1, . . . , J , is
multinomial with cell probabilities denoted as πi j . Let
πi. =
J
πi j
j=1
π. j =
I
πi j
i=1
denote the marginal probabilities that an observation will fall in the ith row and
jth column, respectively. If the row and column classifications are independent of
522
Chapter 13
The Analysis of Categorical Data
each other,
πi j = πi. π. j
We thus consider testing the following null hypothesis:
H0 : πi j = πi. π. j ,
i = 1, . . . , I,
j = 1, . . . , J
versus the alternative that the πi j are free. Under H0 , the mle of πi j is
π̂i j = π̂i. π̂. j
n. j
n i.
×
=
n
n
(see Problem 10 at the end of this chapter). Under the alternative, the mle of πi j is
simply
π̃i j =
ni j
n
These estimates can be used to form a likelihood ratio test or an asymptotically
equivalent Pearson’s chi-square test,
X2 =
J
I (Oi j − E i j )2
Ei j
i=1 j=1
Here the Oi j are the observed counts (n i j ). The expected counts, the E i j , are the fitted
counts:
E i j = n π̂i j =
n i. n . j
n
Pearson’s chi-square statistic is, therefore,
X2 =
J
I (n i j − n i. n . j /n)2
n i. n . j /n
i=1 j=1
The degrees of freedom for the chi-square statistic are calculated as in Section 9.5. Under , the cell probabilities sum to 1 but are otherwise free and there
are thus IJ − 1 independent parameters. Under the null hypothesis, the marginal
probabilities, are estimated from the data and are specified by (I − 1) + (J − 1)
independent parameters. Thus,
df = IJ − 1 − (I − 1) − (J − 1) = (I − 1)(J − 1)
Returning to the data on 1436 women from the demographic study, we calculate
expected values and construct the following table:
13.5 Matched-Pairs Designs
Education
Married Once
More Than Once
College
550
523.8
61
87.2
No College
681
707.2
144
117.8
523
The chi-square statistic is 16.01 with 1 degree of freedom, giving a p-value less than
.001. We would reject the hypothesis of independence and conclude that there is a
relationship between marital status and educational level.
The chi-square statistic used here to test independence is identical in form and
degrees of freedom to that used in the preceding section to test homogeneity; however, the hypotheses are different and the sampling schemes are different. The test
of homogeneity was derived under the assumption that the column (or row) margins
were fixed, and the test of independence was derived under the assumption that only
the grand total was fixed. Because the test statistics are computed in an identical
fashion and have the same number of degrees of freedom, the distinction between
them is often slurred over. Furthermore, the notions of homogeneity and independence are closely related and easily confused. Independence can be thought of as
homogeneity of conditional distributions; for example, if education level and marital
status are independent, then the conditional probabilities of marital status given educational level are homogeneous—P(Married Once | College) = P(Married Once |
No College).
13.5 Matched-Pairs Designs
Matched-pairs designs can be effective for experiments involving categorical data;
as with experiments involving continuous data, pairing can control for extraneous
sources of variability and can increase the power of a statistical test. Appropriate
techniques, however, must be used in the analysis of the data. This section begins
with an extended example illustrating these concepts.
EXAMPLE A
Vianna, Greenwald, and Davies (1971) collected data comparing the percentages
of tonsillectomies for a group of patients suffering from Hodgkin’s disease and a
comparable control group:
Tonsillectomy
No Tonsillectomy
Hodgkin’s
67
34
Control
43
64
The table shows that 66% of the Hodgkin’s sufferers had had a tonsillectomy, compared to 40% of the control group. The chi-square test for homogeneity gives a
524
Chapter 13
The Analysis of Categorical Data
chi-square statistic of 14.26 with 1 degree of freedom, which is highly significant.
The investigators conjectured that the tonsils act as a protective barrier in some fashion
against Hodgkin’s disease.
Johnson and Johnson (1972) selected 85 Hodgkin’s patients who had a sibling
of the same sex who was free of the disease and whose age was within 5 years of the
patient’s. These investigators presented the following table:
Tonsillectomy
No Tonsillectomy
Hodgkin’s
41
44
Control
33
52
They calculated a chi-square statistic of 1.53, which is not significant. Their findings
thus appeared to be at odds with those of Vianna, Greenwald, and Davies.
Several letters to the editor of the journal that published Johnson and Johnson’s
results pointed out that those investigators had made an error in their analysis by
ignoring the pairings. The assumption behind the chi-square test of homogeneity
is that independent multinomial samples are compared, and Johnson and Johnson’s
samples were not independent, because siblings were paired. An appropriate analysis
of Johnson and Johnson’s data is suggested once we set up a table that exhibits the
pairings:
Sibling
Patient
No Tonsillectomy
Tonsillectomy
No Tonsillectomy
37
7
Tonsillectomy
15
26
Viewed in this way, the data are a sample of size 85 from a multinomial distribution
with four cells. We can represent the probabilities in the table as follows:
π11
π12
π1.
π21
π22
π2.
π.1
π.2
1
The appropriate null hypothesis states that the probabilities of tonsillectomy
and no tonsillectomy are the same for patients and siblings—that is, π1. = π.1 and
π2. = π.2 , or
π11 + π12 = π11 + π21
π12 + π22 = π21 + π22
These equations simplify to π12 = π21 .
13.5 Matched-Pairs Designs
525
The relevant null hypothesis is thus
H0 : π12 = π21
Under the null hypothesis, the off-diagonal probabilities are equal, and under the
alternative they are not. The diagonal probabilities do not distinguish the null and
alternative hypotheses. We will derive a test, called McNemar’s test, of this hypothesis. Under H0 , the mle’s of the cell probabilities are (see Problem 10 at the end of
this chapter)
n 11
n
n 22
=
n
n 12 + n 21
=
2n
π̂11 =
π̂22
π̂12 = π̂21
The contributions to the chi-square statistic from the n 11 and n 22 cells are equal to
zero; the remainder of the statistic is
X2 =
=
[n 21 − (n 12 + n 21 )/2]2
[n 12 − (n 12 + n 21 )/2]2
+
(n 12 + n 21 )/2
(n 12 + n 21 )/2
(n 12 − n 21 )2
n 12 + n 21
Let us count degrees of freedom: Under there are three free parameters since there
are four cell probabilities which are constrained to sum to 1. Under the null hypothesis,
there is the additional constraint, π12 = π21 , and there are two free parameters.
The chi-square statistic thus has 1 degree of freedom. For the data table exhibiting
the pairings, X 2 = 2.91, with a corresponding p-value of .09. This casts doubt on the
■
null hypothesis, contrary to Johnson and Johnson’s original analysis.
EXAMPLE B
Cell Phones and Driving
Does the use of cell phones while driving cause accidents? This is a difficult question
to study empirically. An observational study comparing accident rates of users and
nonusers would be subject to numerous sources of confounding, such as age, gender,
and time and place of driving. A randomized, controlled experiment in which drivers
were randomly assigned to use or not use cell phones is infeasible, partly because it
would be unethical to deliberately expose people to a potentially hazardous condition.
Double blinding would clearly be impossible. Redelmeier and Tibshirani (1997) conducted a clever study, designed in the following way. They identified 699 drivers who
owned cell phones and who had been involved in motor vehicle collisions. They then
used billing records to determine whether each individual used a cell phone during
the 10 minutes preceding the collision and also at the same time during the previous
week. (For more details, see the cited paper.) Each person thus served as his own
control, eliminating various sources of confounding. The results are laid out in the
526
Chapter 13
The Analysis of Categorical Data
following table:
Collision
Before Collision
On Phone
Not on Phone
On Phone
13
24
Not on Phone
157
505
Total
170
529
Total
37
662
699
From the table, on the day of the collision, 24% of the drivers had been on the
phone as compared to 5% the day before the collision. McNemar’s test can be applied
to test the null hypothesis of no association:
(157 − 24)2
157 + 24
= 97.7
X2 =
There is thus no doubt that the association is statistically significant. However, the
authors pointed out that this result does not necessarily imply that cell phone use
while driving causes more accidents—for example, it is possible that during times
of emotional stress, drivers are more likely to use cell phones, and because of the
emotional stress are also less attentive to their driving.
■
13.6 Odds Ratios
If an event A has probability P(A) of occurring, the odds of A occurring are defined
to be
P(A)
odds(A) =
1 − P(A)
Since this implies that
odds(A)
P(A) =
1 + odds(A)
odds of 2 (or 2 to 1), for example, correspond to P(A) = 2/3.
Now suppose that X denotes the event that an individual is exposed to a potentially
harmful agent and that D denotes the event that the individual becomes diseased. We
denote the complementary events as X and D. The odds of an individual contracting
the disease given that he is exposed are
P(D|X )
odds(D|X ) =
1 − P(D|X )
and the odds of contracting the disease given that he is not exposed are
odds(D|X ) =
P(D|X )
1 − P(D|X )
13.6 Odds Ratios
527
The odds ratio
odds(D|X )
odds(D|X )
is a measure of the influence of exposure on subsequent disease.
We will consider how the odds and odds ratio could be estimated by sampling
from a population with joint and marginal probabilities defined as in the following
table:
=
X
X
D
D
π00
π10
π01
π11
π0.
π1.
π.0
π.1
1
With this notation,
π11
π10 + π11
π01
P(D|X ) =
π00 + π01
P(D|X ) =
so that
π11
π10
π01
odds(D|X ) =
π00
odds(D|X ) =
and the odds ratio is
π11 π00
π01 π10
the product of the diagonal probabilities in the preceding table divided by the product
of the off-diagonal probabilities.
Now we will consider three possible ways to sample this population to study
the relationship of disease and exposure. First, we might consider drawing a random
sample from the entire population; from such a sample we could estimate all the
probabilities directly. However, if the disease is rare, the total sample size would have
to be quite large to guarantee that a substantial number of diseased individuals was
included.
A second method of sampling is called a prospective study—a fixed number of
exposed and nonexposed individuals are sampled, and the incidences of disease in
those two groups are compared. In this case the data allow us to estimate and compare
P(D|X ) and P(D|X ) and, hence, the odds ratio. For example, P(D|X ) would be
estimated by the proportion of exposed individuals who had the disease. However,
note that the individual probabilities πi j cannot be estimated from the data, because
the marginal counts of exposed and unexposed individuals have been fixed arbitrarily
by the sampling design.
A third method of sampling—a retrospective study—is one in which a fixed
number of diseased and undiseased individuals are sampled and the incidences of
=
528
Chapter 13
The Analysis of Categorical Data
exposure in the two groups are compared. The study of Vianna, Greenwald, and
Davies (1971) discussed in the previous section was of this type. From such data,
we can directly estimate P(X |D) and P(X |D) by the proportions of diseased and
nondiseased individuals who were exposed. Because the marginal counts of diseased
and nondiseased are fixed, we cannot estimate the joint probabilities or the important
conditional probabilities P(D|X ) and P(D|X ). However, as will be shown, we can
estimate the odds ratio . Observe that
π11
P(X |D) =
π01 + π11
π01
1 − P(X |D) =
π01 + π11
π11
odds(X |D) =
π01
Similarly,
π10
odds(X |D) =
π00
We thus see that the odds ratio, , defined previously, can also be expressed as
=
odds(X |D)
odds(X |D)
Specifically, suppose that the counts in such a study are denoted as in the following
table:
X
X
D
D
n 00
n 10
n 01
n 11
n .0
n .1
Then the conditional probabilities and the odds ratios are estimated as
n 11
P̂(X |D) =
n .1
n 01
1 − P̂(X |D) =
n .1
n
5 |D) = 11
odds(X
n 01
Similarly,
5 |D) = n 10
odds(X
n 00
so that the estimate of the odds ratio is
ˆ = n 00 n 11
n 01 n 10
the product of the diagonal counts divided by the product of the off-diagonal counts.
13.6 Odds Ratios
529
As an example, consider the table in the previous section that displays the data
of Vianna, Greenwald, and Davies. The odds ratio is estimated to be
ˆ = 67 × 64 = 2.93
43 × 34
According to this study, the odds of contracting Hodgkin’s disease are increased by
about a factor of three by undergoing a tonsillectomy.
ˆ = 2.93, it would be useful to attach
As well as having a point estimate ˆ
an approximate standard error to the estimate to indicate its uncertainty. Since is a nonlinear function of the counts, it appears that an analytical derivation of its
standard error would be difficult. Once again, however, the convenience of simulation
ˆ by
(the bootstrap) comes to our aid. In order to approximate the distribution of simulation, we need to generate random numbers according to a statistical model
for the counts in the table of Vianna, Greenwald, and Davies. The model is that the
count in the first row and first column, N11 , is binomially distributed with n = 101 and
probability π11 . The count in the second row and second column, N22 , is independently
binomially distributed with n = 107 and probability π22 . The distribution of the
random variable
ˆ =
N11 N22
(101 − N11 )(107 − N22 )
is thus determined by the two binomial distributions, and we could approximate it
arbitrarily well by drawing a large number of samples from them.
Since the probabilities π11 and π22 are unknown, they are estimated from the
observed counts by π̂11 = 67/101 = .663 and π̂22 = 64/107 = .598. One thousand
realizations of binomial random variables N11 and N22 were generated on a computer
ˆ The standard
and Figure 13.1 shows a histogram of the resulting 1000 values of .
deviation of these 1000 values was .89, which can be used as an
Download